Skip to content

Production-ready machine learning pipeline for loan repayment prediction using CatBoost with cross-validation and model evaluation.

Notifications You must be signed in to change notification settings

Coltrane35/predicting-loan-payback

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting Loan Payback (Kaggle Playground Series S5E11)

📌 Project Overview

This project focuses on predicting whether a loan will be paid back using structured tabular data. The solution was developed entirely with Python scripts (.py), without notebooks, following a clean and reproducible end‑to‑end machine learning pipeline.

The project is based on the Kaggle competition Playground Series – Season 5, Episode 11 and is designed as a portfolio‑ready Data Science / Machine Learning project.


🧠 Problem Statement

Given customer financial and demographic data, predict the probability that a loan will be paid back (loan_paid_back).

This is a binary classification problem, evaluated using ROC AUC.


🗂️ Project Structure

playground-series-s5e11/
│
├── data/                 # Raw competition data (train.csv, test.csv)
├── outputs/              # EDA reports, metrics, trained models, submission
│   ├── eda_report.txt
│   ├── metrics.txt
│   ├── catboost_model.cbm
│   └── submission.csv
│
├── src/                  # Source code (pure Python, no notebooks)
│   ├── config.py
│   ├── load_data.py
│   ├── eda.py
│   ├── train_catboost.py
│   └── inference.py
│
├── requirements.txt
└── README.md

⚙️ Tech Stack

  • Python 3.11+
  • Pandas, NumPy
  • Scikit‑learn
  • CatBoost (native categorical feature handling)

🔍 Exploratory Data Analysis (EDA)

EDA is performed via a standalone Python script and saved as a text report:

  • dataset shapes
  • column overview
  • target distribution
  • missing values check
  • data types

Output:

outputs/eda_report.txt

🤖 Model

CatBoostClassifier was selected due to:

  • native handling of categorical features
  • strong performance on tabular data
  • minimal preprocessing requirements
  • robustness and stability

Categorical Features

['gender', 'marital_status', 'education_level',
 'employment_status', 'loan_purpose', 'grade_subgrade']

📊 Results

Metric Score
Public Leaderboard AUC 0.92293
Private Leaderboard AUC 0.92385
OOF AUC 0.92338 ± 0.00069

These results indicate a stable and well‑generalizing model.


▶️ How to Run Locally

1️⃣ Create virtual environment

python -m venv venv
venv\Scripts\activate

2️⃣ Install dependencies

pip install -r requirements.txt

3️⃣ Run EDA

python -m src.eda

4️⃣ Train model

python -m src.train_catboost

5️⃣ Generate submission

python -m src.inference

🧪 Key Design Decisions

  • ❌ No Jupyter notebooks
  • ✅ Script‑based, reproducible pipeline
  • ✅ Clear separation of concerns (EDA / training / inference)
  • ✅ Local development (VSC‑friendly)
  • ✅ Ready for extension (SHAP, feature engineering, hyperparameter tuning)

🚀 Future Improvements

  • Feature engineering (ratio & interaction features)
  • SHAP‑based model interpretability
  • Hyperparameter optimization
  • Model ensembling

📎 Kaggle

Competition: Playground Series S5E11

Submission performed via Late Submission (learning & portfolio purposes).


👤 Author

Grzegorz

Focused on Data Science and Machine Learning with emphasis on clean pipelines, reproducibility, and production‑ready code.

About

Production-ready machine learning pipeline for loan repayment prediction using CatBoost with cross-validation and model evaluation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages