Financial institutions face significant risks due to loan defaults. Accurately predicting the Probability of Default (PD) is critical for effective risk management and strategic planning.
๐ฏ In this project, we develop a machine learning model to estimate the likelihood that a loan will default, based on historical loan data.
Build a predictive model to estimate the probability of default for loan applicants using historical loan application data.
- Imported all necessary libraries including
pandas,numpy,matplotlib,seaborn,sklearn,shap, etc. - Managed long list of packages to support EDA, preprocessing, modeling, and interpretation.
- Fixed index column issues in the dataset.
- Dropped repeated columns that appeared due to merging or saving errors.
remaining_term: Standardized inconsistent values.job: Unified titles and standardized missing entries.location: Cleaned and grouped underrepresented categories to avoid data sparsity.currencyandcountry: Dropped due to low variance and redundancy.gender: Added an explicitNaNcategory to "Other" for clarity.
- Created new features based on correlation analysis with the target (
default). - Engineered interaction features where relevant.
- Ensured no data leakage during feature creation.
- ๐ง Gender Imputation: Used a Random Forest classifier to intelligently impute missing values.
- ๐งน Simple Imputer: Applied for other missing categorical/numerical values.
- Scaled numerical features and one-hot encoded categoricals.
- Evaluated multiple models using cross-validation:
- Logistic Regression
- Random Forest
- Tried Recursive Feature Elimination (RFE) but results were unsatisfactory.
- Opted for manual selection based on domain knowledge and model performance.
- Used RandomizedSearchCV for fast and efficient parameter optimization.
- Applied SHAP (SHapley Additive exPlanations) to interpret model predictions.
- Identified key features driving loan default risk:
- Loan amount
- Remaining term
- Employment status
- Income vs. loan ratio
๐ Note: This project is for educational purposes and should not be used as financial advice or in production without further validation.