Completed
data
Diafilm Parts Finder
Python tool for RGDB library: finds multi-part Soviet slidefilms in a messy Excel catalog — fuzzy matching, parallel processing, 5-sheet audit report
By the numbers
0
Records processed
0
Runtime (min)
0
Excel report sheets
The Problem
What I was solving
Soviet-era slidefilms (diafilms) come in multi-part series, but the library databases storing them have no standard for marking parts. Youll see "часть 2", "pt. II", "2", or nothing at all for the same structural concept. Before migrating 100k+ records into a proper database, someone had to untangle which films belong together — manually that would take months.
My Approach
How I built it
A pipeline in plain Python — no web interface, no framework. Stage one: parse out part markers (numeric, Cyrillic text, Roman numerals) with regex. Stage two: rapidfuzz for fuzzy string matching to cluster titles that look like they belong together. Stage three: multi-threaded processing to handle 100k records in a few minutes instead of an hour. Output is a 5-sheet Excel report: grouped series, detected duplicates, unpaired parts, and normalization steps — librarians can audit the groupings, not just trust the script.
Tech choices
- rapidfuzz— Chose over python-Levenshtein for speed — C++ core runs 5-10x faster on 100k×100k pairwise comparisons.
- Multi-threading— Fuzzy matching is CPU-bound and embarrassingly parallel. 8 threads turn an hour into minutes on a laptop.
- openpyxl 5-sheet report— Librarians live in Excel, not JSON. Separate sheets for groupings, duplicates, unpaired parts, and normalization let them audit without running code.
Outcome
What came out of it
100,000 records processed in 3-5 minutes. Multi-part series correctly grouped with high precision (the Excel audit sheet let librarians spot-check and confirm). Migration into the new database went from "months of manual work" to "run the script, review the Excel, approve". One-off tool that did its job and retired.