Children Book Dataset

Python scraper that builds a labeled dataset of Russian children's books — made to feed the GenreNeuro classifier

View Source

By the numbers

Labeled entries

Genre categories

The Problem

What I was solving

GenreNeuro needed clean, labeled training data. Public childrens book catalogs exist, but the data is messy: inconsistent encoding, special characters in titles, irregular HTML markup across pages, and genre labels that range from 200+ categories into too granular to use.

My Approach

How I built it

Plain requests + BeautifulSoup4 — no Scrapy overkill for a one-time job. Rate-limited to 1 request per second to be a polite guest. Handles Cyrillic encoding explicitly, strips special characters from titles, and collapses 200+ source genres into 6 training categories. Outputs clean JSON ready to feed into TensorFlow. Not architecturally interesting — thats the point. Do the boring data work well so the ML part can be simple.

Tech choices

requests + BeautifulSoup4— For a ~20k-page one-off crawl, Scrapy framework overhead isnt worth it. Simple libraries = fewer moving parts = less to debug.
1 req/sec rate limit— Polite scraping. You want the site owners to not block your IP or ban the User-Agent next time.

Outcome

What came out of it

9,400+ labeled book entries in clean JSON — the training set that made GenreNeuro possible. 6 normalized genre categories instead of 200+ chaos. Re-runnable when the source catalog updates. Boring, reliable, done.