Heart Disease ML Benchmark

Multicenter ML benchmark on 1,904 patients from 6 heart-disease datasets — CatBoost wins at 0.948 ROC-AUC

View Source

By the numbers

ROC-AUC (CatBoost)

Patients in dataset

Algorithms compared

Features recovering 98%

The Problem

What I was solving

Heart disease ML benchmarks in the wild usually cherry-pick one dataset, report a single metric, and skip statistical testing. When papers claim "our XGBoost hits 92% accuracy," its often on a 300-patient toy set with no baseline comparison. Clinicians cant compare methods apples-to-apples, and replication is nearly impossible.

My Approach

How I built it

Merged six open cardiovascular datasets into one unified 1,904-patient set. Ran eight algorithms through the same gauntlet: logistic regression, random forest, XGBoost, CatBoost, LightGBM, SVM, KNN, and a stacking ensemble. Every model got 100 Optuna trials with 5-fold cross-validation — not just the final metrics, but the whole hyperparameter landscape. SHAP analysis on the winner. Statistical testing with DeLong (for AUC) and McNemar (for accuracy) tests, Bonferroni-corrected for multiple comparisons. Everything runs from one Jupyter notebook.

Tech choices

CatBoost— Won the benchmark but also the most practical pick — handles missing values natively and needs almost no feature engineering on mixed medical data.
Optuna— TPE sampler beats grid/random search on small-budget runs. 100 trials per model is reasonable, not astronomical, and Optunas pruning kills bad trials early.
SHAP— Model-agnostic explanations so clinicians dont have to trust a black box. Global + per-patient views answer "why did the model say this?"
DeLong + McNemar— Paired statistical tests for AUC and accuracy stop you from claiming a tiny improvement is real. Bonferroni correction because youre testing many pairs.

Outcome

What came out of it

CatBoost won with 0.948 ROC-AUC and 88.3% accuracy — but thats not the headline. The real result: SHAP showed 7 of 50+ features recover 98% of model performance. A clinician can collect 7 measurements instead of 50+ and lose almost nothing. Statistical tests confirm the top-3 models arent distinguishable — dont claim you beat the baseline when you havent. Every result reproducible from the notebook.