Completed
research
Heart Disease ML Benchmark
Multicenter ML benchmark on 1,904 patients from 6 heart-disease datasets — CatBoost wins at 0.948 ROC-AUC
By the numbers
0
ROC-AUC (CatBoost)
0
Patients in dataset
0
Algorithms compared
0
Features recovering 98%
The Problem
What I was solving
Heart disease ML benchmarks in the wild usually cherry-pick one dataset, report a single metric, and skip statistical testing. When papers claim "our XGBoost hits 92% accuracy," its often on a 300-patient toy set with no baseline comparison. Clinicians cant compare methods apples-to-apples, and replication is nearly impossible.
My Approach
How I built it
Merged six open cardiovascular datasets into one unified 1,904-patient set. Ran eight algorithms through the same gauntlet: logistic regression, random forest, XGBoost, CatBoost, LightGBM, SVM, KNN, and a stacking ensemble. Every model got 100 Optuna trials with 5-fold cross-validation — not just the final metrics, but the whole hyperparameter landscape. SHAP analysis on the winner. Statistical testing with DeLong (for AUC) and McNemar (for accuracy) tests, Bonferroni-corrected for multiple comparisons. Everything runs from one Jupyter notebook.
Tech choices
- CatBoost— Won the benchmark but also the most practical pick — handles missing values natively and needs almost no feature engineering on mixed medical data.
- Optuna— TPE sampler beats grid/random search on small-budget runs. 100 trials per model is reasonable, not astronomical, and Optunas pruning kills bad trials early.
- SHAP— Model-agnostic explanations so clinicians dont have to trust a black box. Global + per-patient views answer "why did the model say this?"
- DeLong + McNemar— Paired statistical tests for AUC and accuracy stop you from claiming a tiny improvement is real. Bonferroni correction because youre testing many pairs.
Outcome
What came out of it
CatBoost won with 0.948 ROC-AUC and 88.3% accuracy — but thats not the headline. The real result: SHAP showed 7 of 50+ features recover 98% of model performance. A clinician can collect 7 measurements instead of 50+ and lose almost nothing. Statistical tests confirm the top-3 models arent distinguishable — dont claim you beat the baseline when you havent. Every result reproducible from the notebook.