• Built a leakage-safe, time-aware ML pipeline to predict competitive fencing match outcomes using only pre-match data
• Logistic Regression baseline achieves ~0.79 ROC-AUC and ~0.71 accuracy on held-out future matches
• Best calibrated model: Logistic Regression (ECE = 0.0173) → highly reliable probability estimates
• Best discriminating model: LightGBM (ROC-AUC = 0.8455) → strongest at ranking winners vs losers
• Both models outperform the official website’s win probability system in calibration and overall predictive quality • Feature ablation confirms rating differential dominates prediction, with experience and inactivity providing secondary signal
👉 Takeaway: This project demonstrates end-to-end applied ML skills across data collection, temporal feature engineering, modeling, evaluation, and probability calibration.
This project builds a pre-match win probability model for competitive HEMA (Historical European Martial Arts) tournaments using publicly available fighter ratings and match histories.
The goal is to estimate:
P(fighter wins | information available before the match)
The project emphasizes temporal correctness, leakage prevention, and interpretability, mirroring real-world applied machine learning constraints rather than benchmark-style modeling.
Given two fighters scheduled to compete in a tournament bout, predict the probability that the focal fighter wins using only pre-match information.
Key challenges addressed:
- Ratings are updated monthly, not per match
- Fighters may have long periods of inactivity
- Many competitors appear with little or no prior history (cold start)
Match and rating data were collected programmatically from publicly accessible sources.
To ensure reliable and respectful data acquisition, the scraping pipeline was designed with:
- Rate-limited HTTP requests
- Exponential backoff retry logic for transient failures
- Idempotent requests to allow safe restarts
This allowed the data pipeline to run unattended and consistently without overwhelming the source.
- Tournament match history: individual bouts with fighter IDs, opponents, divisions, stages, and outcomes
- Rating history: monthly snapshots of fighter ratings and confidence values published by a third-party rating system
All rating joins are performed backward in time to ensure no future information is used.
The dataset is built through a reproducible pipeline:
- Load raw match and rating history data
- Normalize and sort all data chronologically
- Join ratings to matches using backward-in-time temporal joins
- Track per-fighter state (experience and recency) in strict match order
- Generate pre-match features
- Drop matches without valid pre-match information
- Freeze the dataset for modeling
Final dataset:
- 28,684 matches
- Temporal train/test split based on match date (no random shuffling)
- Only information available before the match is used
- Temporal causality is strictly enforced
- Missing history is handled explicitly (never encoded as zero)
| Feature | Description |
|---|---|
ratings_diff |
Rating difference between fighter and opponent at match time |
experience_diff |
Difference in number of prior matches |
fighter_days_since_last_fought |
Days since fighter’s previous match |
opponent_days_since_last_fought |
Days since opponent’s previous match |
days_since_last_fought_diff |
Relative recency advantage |
fighter_first_match |
Fighter has no prior recorded matches |
opponent_first_match |
Opponent has no prior recorded matches |
Cold-start cases are handled via explicit flags rather than misleading numeric encodings.
A logistic regression model was used as an interpretable baseline.
Results (held-out future data):
- ROC-AUC: ~0.79
- Accuracy: ~71%
Coefficient inspection shows:
- Rating difference is the dominant predictor
- Experience provides a modest secondary advantage
- Cold-start effects are asymmetric but meaningful
- Recency effects are present but smaller, consistent with rating decay already encoding inactivity
A tree-based LightGBM model was trained using the same features and temporal split.
Outcome:
- Performance (ROC-AUC and accuracy) closely matched logistic regression
Interpretation: This indicates that the engineered features capture most of the predictive signal in a near-linear form. Additional nonlinear modeling does not materially improve performance, validating the feature design and confirming the suitability of a simple, interpretable model for this problem.
- Logistic regression and LightGBM achieve comparable performance
- No evidence of strong nonlinear interactions beyond engineered features
- Feature engineering quality dominates model choice
This comparison was used as a validation step rather than a performance-chasing exercise.
HEMARATINGSANALYSIS/
├── data/
│ ├── raw/
│ └── processed/
├── src/
│ ├── fighter_state.py
│ ├── build_dataset.py
│ ├── scraper.py
│ └── models/
│ ├── logistic_regression.py
│ └── lightgbm.py
├── tests/
│ ├── build_dataset_test.py
│ └── scraper_test.py
|
├── notebooks/
│ └── feature_sanity_check.ipynb
└── README.md
This project demonstrates:
- Leakage-safe temporal feature engineering
- Cold-start handling in real competition data
- End-to-end ML pipeline construction
- Model comparison driven by insight, not metrics chasing
- Interpretable results aligned with domain expectations
It reflects production-style applied ML rather than benchmark optimization.
- Dataset construction complete (v1 frozen)
- Logistic regression baseline evaluated
- Nonlinear model comparison completed
- Probability calibration analysis
- Feature ablation study
- Division- and stage-specific modeling
- Model monitoring across eras
This project is intended as a demonstration of applied machine learning methodology and engineering judgment, not as a commercial betting or forecasting system.