Skip to content

A machine learning project focused on predicting the Probability of Default (PD) on loans using historical financial and demographic data. The project includes data preprocessing, feature engineering, model training, hyperparameter tuning, and interpretability using SHAP values.

Notifications You must be signed in to change notification settings

divinesaun/claxon_loan_default

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿฆ Loan Default Prediction: Probability of Default (PD)

Financial institutions face significant risks due to loan defaults. Accurately predicting the Probability of Default (PD) is critical for effective risk management and strategic planning.

๐ŸŽฏ In this project, we develop a machine learning model to estimate the likelihood that a loan will default, based on historical loan data.


๐Ÿ“Š Problem Statement

Build a predictive model to estimate the probability of default for loan applicants using historical loan application data.


๐Ÿงญ My Approach

๐Ÿ“ฆ 1. Package Imports

  • Imported all necessary libraries including pandas, numpy, matplotlib, seaborn, sklearn, shap, etc.
  • Managed long list of packages to support EDA, preprocessing, modeling, and interpretation.

๐Ÿงฎ 2. Data Loading & Cleaning

  • Fixed index column issues in the dataset.
  • Dropped repeated columns that appeared due to merging or saving errors.

๐Ÿ”Ž 3. Exploratory Data Analysis (EDA)

โœ… Cleaned Key Columns:

  • remaining_term: Standardized inconsistent values.
  • job: Unified titles and standardized missing entries.
  • location: Cleaned and grouped underrepresented categories to avoid data sparsity.
  • currency and country: Dropped due to low variance and redundancy.
  • gender: Added an explicit NaN category to "Other" for clarity.

๐Ÿ—๏ธ 4. Feature Engineering

  • Created new features based on correlation analysis with the target (default).
  • Engineered interaction features where relevant.
  • Ensured no data leakage during feature creation.

๐Ÿ”ง 5. Preprocessing

  • ๐Ÿง  Gender Imputation: Used a Random Forest classifier to intelligently impute missing values.
  • ๐Ÿงน Simple Imputer: Applied for other missing categorical/numerical values.
  • Scaled numerical features and one-hot encoded categoricals.

๐Ÿงช 6. Model Selection

  • Evaluated multiple models using cross-validation:
    • Logistic Regression
    • Random Forest

๐Ÿงฌ 7. Feature Selection

  • Tried Recursive Feature Elimination (RFE) but results were unsatisfactory.
  • Opted for manual selection based on domain knowledge and model performance.

๐Ÿ› ๏ธ 8. Hyperparameter Tuning

  • Used RandomizedSearchCV for fast and efficient parameter optimization.

๐Ÿง  9. Interpretation

  • Applied SHAP (SHapley Additive exPlanations) to interpret model predictions.
  • Identified key features driving loan default risk:
    • Loan amount
    • Remaining term
    • Employment status
    • Income vs. loan ratio

๐Ÿ”’ Note: This project is for educational purposes and should not be used as financial advice or in production without further validation.

About

A machine learning project focused on predicting the Probability of Default (PD) on loans using historical financial and demographic data. The project includes data preprocessing, feature engineering, model training, hyperparameter tuning, and interpretability using SHAP values.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published