Completed
data
Children Book Dataset
Python scraper that builds a labeled dataset of Russian children's books — made to feed the GenreNeuro classifier
By the numbers
0
Labeled entries
0
Genre categories
The Problem
What I was solving
GenreNeuro needed clean, labeled training data. Public childrens book catalogs exist, but the data is messy: inconsistent encoding, special characters in titles, irregular HTML markup across pages, and genre labels that range from 200+ categories into too granular to use.
My Approach
How I built it
Plain requests + BeautifulSoup4 — no Scrapy overkill for a one-time job. Rate-limited to 1 request per second to be a polite guest. Handles Cyrillic encoding explicitly, strips special characters from titles, and collapses 200+ source genres into 6 training categories. Outputs clean JSON ready to feed into TensorFlow. Not architecturally interesting — thats the point. Do the boring data work well so the ML part can be simple.
Tech choices
- requests + BeautifulSoup4— For a ~20k-page one-off crawl, Scrapy framework overhead isnt worth it. Simple libraries = fewer moving parts = less to debug.
- 1 req/sec rate limit— Polite scraping. You want the site owners to not block your IP or ban the User-Agent next time.
Outcome
What came out of it
9,400+ labeled book entries in clean JSON — the training set that made GenreNeuro possible. 6 normalized genre categories instead of 200+ chaos. Re-runnable when the source catalog updates. Boring, reliable, done.