This project focuses on predicting whether a loan will be paid back using structured tabular data. The solution was developed entirely with Python scripts (.py), without notebooks, following a clean and reproducible end‑to‑end machine learning pipeline.
The project is based on the Kaggle competition Playground Series – Season 5, Episode 11 and is designed as a portfolio‑ready Data Science / Machine Learning project.
Given customer financial and demographic data, predict the probability that a loan will be paid back (loan_paid_back).
This is a binary classification problem, evaluated using ROC AUC.
playground-series-s5e11/
│
├── data/ # Raw competition data (train.csv, test.csv)
├── outputs/ # EDA reports, metrics, trained models, submission
│ ├── eda_report.txt
│ ├── metrics.txt
│ ├── catboost_model.cbm
│ └── submission.csv
│
├── src/ # Source code (pure Python, no notebooks)
│ ├── config.py
│ ├── load_data.py
│ ├── eda.py
│ ├── train_catboost.py
│ └── inference.py
│
├── requirements.txt
└── README.md
- Python 3.11+
- Pandas, NumPy
- Scikit‑learn
- CatBoost (native categorical feature handling)
EDA is performed via a standalone Python script and saved as a text report:
- dataset shapes
- column overview
- target distribution
- missing values check
- data types
Output:
outputs/eda_report.txt
CatBoostClassifier was selected due to:
- native handling of categorical features
- strong performance on tabular data
- minimal preprocessing requirements
- robustness and stability
['gender', 'marital_status', 'education_level',
'employment_status', 'loan_purpose', 'grade_subgrade']
| Metric | Score |
|---|---|
| Public Leaderboard AUC | 0.92293 |
| Private Leaderboard AUC | 0.92385 |
| OOF AUC | 0.92338 ± 0.00069 |
These results indicate a stable and well‑generalizing model.
python -m venv venv
venv\Scripts\activatepip install -r requirements.txtpython -m src.edapython -m src.train_catboostpython -m src.inference- ❌ No Jupyter notebooks
- ✅ Script‑based, reproducible pipeline
- ✅ Clear separation of concerns (EDA / training / inference)
- ✅ Local development (VSC‑friendly)
- ✅ Ready for extension (SHAP, feature engineering, hyperparameter tuning)
- Feature engineering (ratio & interaction features)
- SHAP‑based model interpretability
- Hyperparameter optimization
- Model ensembling
Competition: Playground Series S5E11
Submission performed via Late Submission (learning & portfolio purposes).
Grzegorz
Focused on Data Science and Machine Learning with emphasis on clean pipelines, reproducibility, and production‑ready code.