Diafilm Parts Finder

Python tool for RGDB library: finds multi-part Soviet slidefilms in a messy Excel catalog — fuzzy matching, parallel processing, 5-sheet audit report

View Source

By the numbers

Records processed

Runtime (min)

Excel report sheets

The Problem

What I was solving

Soviet-era slidefilms (diafilms) come in multi-part series, but the library databases storing them have no standard for marking parts. Youll see "часть 2", "pt. II", "2", or nothing at all for the same structural concept. Before migrating 100k+ records into a proper database, someone had to untangle which films belong together — manually that would take months.

My Approach

How I built it

A pipeline in plain Python — no web interface, no framework. Stage one: parse out part markers (numeric, Cyrillic text, Roman numerals) with regex. Stage two: rapidfuzz for fuzzy string matching to cluster titles that look like they belong together. Stage three: multi-threaded processing to handle 100k records in a few minutes instead of an hour. Output is a 5-sheet Excel report: grouped series, detected duplicates, unpaired parts, and normalization steps — librarians can audit the groupings, not just trust the script.

Tech choices

rapidfuzz— Chose over python-Levenshtein for speed — C++ core runs 5-10x faster on 100k×100k pairwise comparisons.
Multi-threading— Fuzzy matching is CPU-bound and embarrassingly parallel. 8 threads turn an hour into minutes on a laptop.
openpyxl 5-sheet report— Librarians live in Excel, not JSON. Separate sheets for groupings, duplicates, unpaired parts, and normalization let them audit without running code.

Outcome

What came out of it

100,000 records processed in 3-5 minutes. Multi-part series correctly grouped with high precision (the Excel audit sheet let librarians spot-check and confirm). Migration into the new database went from "months of manual work" to "run the script, review the Excel, approve". One-off tool that did its job and retired.