All projects
Children Book Dataset
Completed
data

Children Book Dataset

Python scraper that builds a labeled dataset of Russian children's books — made to feed the GenreNeuro classifier

By the numbers

0

Labeled entries

0

Genre categories

The Problem

What I was solving

GenreNeuro needed clean, labeled training data. Public childrens book catalogs exist, but the data is messy: inconsistent encoding, special characters in titles, irregular HTML markup across pages, and genre labels that range from 200+ categories into too granular to use.
My Approach

How I built it

Plain requests + BeautifulSoup4 — no Scrapy overkill for a one-time job. Rate-limited to 1 request per second to be a polite guest. Handles Cyrillic encoding explicitly, strips special characters from titles, and collapses 200+ source genres into 6 training categories. Outputs clean JSON ready to feed into TensorFlow. Not architecturally interesting — thats the point. Do the boring data work well so the ML part can be simple.

Tech choices

  • requests + BeautifulSoup4For a ~20k-page one-off crawl, Scrapy framework overhead isnt worth it. Simple libraries = fewer moving parts = less to debug.
  • 1 req/sec rate limitPolite scraping. You want the site owners to not block your IP or ban the User-Agent next time.
Outcome

What came out of it

9,400+ labeled book entries in clean JSON — the training set that made GenreNeuro possible. 6 normalized genre categories instead of 200+ chaos. Re-runnable when the source catalog updates. Boring, reliable, done.