This project builds a credit approval prediction system using machine learning models. It processes customer loan application data, performs feature selection, trains multiple ML models, and tunes hyperparameters using a custom Grid Search where the test set is used as a validation set.
The goal is to classify applicants into four approval categories (P1, P2, P3, P4) to support risk-based lending decisions. Here used two dataset one from CBIL dataset (51336, 54) and internal bank dataset (51296, 26) with same "PPROSPECTID"
Remove the null values from two datasets and also remove those columns which hav more than 10k null values
Divided the dataset into categorical and numerical columns. How the categorical columns associated with target column by chi2 test with p-value <=0.05. In the numerical columns use sequential VIF (Variation Inflation Factor) = 6 to check multicolinearity. And again test ANOVA with numerical columns and different class and set p-value as 0.05. Used label encoding(EDUCATION) and one hot encoding on categorical colums (GENDER ,MARITALSTATUS etc).
XGBOOST gave the maximum accuracy approx 78%.
param_grid= { 'colsample_bytree':[0.1,0.3,0.5,0.7,0.9], 'learning_rate':[0.001,0.01,0.1,1], 'max_depth':[3,5,8,10], 'alpha':[1,10,100], 'n_estimators':[10,50,100]
}
Best parameters are
Train Accuracy: 0.8055927015541886 Test Accuracy: 0.7801022227505052 colsample_bytree: 0.3 learning_rate: 1 max_depth: 3 alpha: 10 n_estimators: 100