This project implements an end-to-end machine learning system for predicting loan default risk using real-world financial application data.
The system is designed with production ML engineering principles, including:
- Modular feature engineering pipelines
- Reusable preprocessing and model training modules
- Multi-model benchmarking
- Cross-validation based performance validation
- Business-aware decision optimization using profit simulation
- Model interpretability via feature importance analysis
- Inference-ready model serialization
This project emphasizes real-world ML workflow, not just model accuracy.
Financial institutions must estimate the probability that a customer will default on a loan.
Accurate credit risk prediction enables:
- Risk-adjusted loan approvals
- Portfolio loss reduction
- Interest rate optimization
- Regulatory-compliant risk modeling
Predict probability of default (PD) for each applicant and optimize approval threshold for maximum portfolio profit.
Dataset: Home Credit Default Risk
Primary Table Used: application_train.csv
Future extensions could include:
- Credit bureau history
- Previous loan performance
- Installment payment behavior
Raw Data
↓
Feature Engineering (src/features)
↓
Preprocessing Pipeline (ColumnTransformer)
↓
Model Training Modules (src/models)
↓
Cross Validation Evaluation
↓
Model Comparison & Selection
↓
Business Decision Optimization (Profit Simulation)
↓
Model Serialization (Deployment Ready)
src/
├ features/
│ ├ build_features.py
│ ├ pipeline.py
│
├ models/
│ ├ train_model.py
│ ├ train_tree_model.py
│ ├ train_histgb_model.py
│ ├ train_xgb_model.py
│ ├ evaluate_model.py
│ ├ save_model.py
│
├ decision/
│ ├ profit_simulation.py
│
notebooks/
├ 01_eda.ipynb
├ 03_modeling.ipynb
artifacts/
├ xgb_credit_model.joblib
- Credit-to-Income Ratio
- Annuity-to-Income Ratio
- Employment anomaly detection
- Registration duration
- Phone activity recency
- Age conversion from raw birth date encoding
- Sentinel missing value handling
- Identifier column removal
- Redundant feature removal
Implemented using sklearn ColumnTransformer.
- Median Imputation
- Standard Scaling
- Most Frequent Imputation
- One-Hot Encoding with unknown category safety
| Model | Purpose |
|---|---|
| Logistic Regression | Linear baseline |
| Random Forest | Nonlinear bagging baseline |
| Gradient Boosting | Sequential boosting baseline |
| HistGradientBoosting | Modern histogram boosting |
| XGBoost | Final production candidate |
| Model | ROC AUC |
|---|---|
| Logistic Regression | ~0.749 |
| Random Forest | ~0.726 |
| Gradient Boosting | ~0.753 |
| HistGradientBoosting | ~0.759 |
| XGBoost | ~0.762 |
Logistic Baseline Cross Validation:
- Mean ROC AUC: ~0.746
- Std Dev: ~0.0026
Indicates stable model generalization.
Selected because:
- Highest validation ROC AUC
- Strong tabular feature interaction modeling
- Industry standard for structured financial ML
- Stable training behavior
Feature importance analysis confirms dominant signals from:
- External credit risk score features (EXT_SOURCE variables)
- Financial stress ratio features
- Customer stability indicators
- Age / lifecycle features
Instead of using default probability threshold (0.5), a profit simulation layer was implemented to optimize loan approval decisions.
- Interest revenue modeling
- Loss given default modeling
- Operational cost modeling
Optimal Approval Threshold ≈ 0.20
This reflects real-world credit risk asymmetry: Default losses are much larger than interest gains.
Final model is saved as serialized pipeline artifact:
artifacts/xgb_credit_model.joblib
This includes: Feature Engineering Preprocessing Model Inference
pip install -r requirements.txt
Run:
notebooks/03_modeling.ipynb
import joblib
model = joblib.load("artifacts/xgb_credit_model.joblib")
preds = model.predict_proba(X_new)- Feature engineering dominates tabular ML performance
- Boosting models outperform bagging on structured financial data
- Histogram boosting improves training efficiency significantly
- Cross-validation is critical for stable evaluation
- Business-aligned metrics outperform pure accuracy metrics
Potential next enhancements:
- Multi-table feature aggregation
- Probability calibration for financial risk pricing
- Model monitoring and drift detection
- Real-time inference pipeline
Built as a production-style machine learning system demonstrating:
- End-to-end ML pipeline engineering
- Financial tabular modeling best practices
- Business-aligned ML decision making


