This super-lightweight Loan Default Prediction engine, smaller than your average image file (20kb), leverages machine learning to tackle the challenge of accurately predicting loan defaults, a critical aspect of efficient loan management. The application delivers precise approval or denial decisions by using a RandomForestClassifier to analyze loan features such as amount, interest rate, and applicant financials. This project enhances decision-making accuracy, reduces the risk of defaults, and optimizes the loan evaluation process.
Lightweight
Machine Learning, Consumer Credit Application Processing Engine
Overview
The project was built in two main layers with a GUI: "pipeline.py", this script builds the "model.pkl"; "predict.py", this takes our "model.pkl" file and uses it to analyze a users credit application; My GUI was built in the "main.py" file which is able to run "predict.py" through a subprocess in order to analyze an imported application CSV.
Overview of the pipeline script:
Pipeline is designed to train and evaluate a RandomForestClassifier model for predicting loan defaults using data from Lending Club. The script consists of the following steps:
-
Data Loading and Preprocessing: Loads and preprocess data from a CSV file.
-
Data Splitting: Splits the data into training and testing sets.
-
Model Training: Trains a RandomForestClassifier model.
-
Feature Selection: Selects top features based on their importance.
-
Model Evaluation: Evaluates the trained model on the test set.
-
Model Saving: Finally, it saves the trained model and the top features.
Detailed Breakdown:
1. Data Loading and Preprocessing
-
Data Loading: Reads the CSV file into a pandas DataFrame.
-
Column Selection: Selects specific columns relevant to the analysis.
-
Missing Values: Drops rows with missing values.
-
Filtering: Filters the data to include only loans that are either "Fully Paid" or "Charged Off".
-
Binary Mapping: Maps "Fully Paid" to 0 and "Charged Off" to 1 to create a binary target variable (defaulted).
-
Categorical Mapping: Maps categorical values to numerical values for emp_length, grade, term, and home_ownership.
2. Data Splitting
-
Feature and Target Separation: Separates the features (x) and target variable (y).
-
Train-Test Split: Splits the data into training (80%) and testing (20%) sets.
3. Model Training
-
Feature Selection: If top_features is provided, it selects those features from the training data.
-
Model Initialization: Initializes a RandomForestClassifier with specific hyperparameters.
-
Model Training: Train the model using the training data.
4. Feature Selection
-
Feature Importances: Extracts the feature importances from the trained model.
-
Weight Adjustments: Adjusts the importance of certain features based on predefined weight factors.
-
Top Features Selection: Selects the top top_n features based on their adjusted importance.
5. Model Evaluation
-
Feature Selection: If top_features is provided, it selects those features from the test data.
-
Prediction: Uses the model to make predictions on the test data.
-
Metrics Calculation: Calculates accuracy, precision, and recall scores.
6. Model Saving
-
Pipeline Execution: Runs the entire pipeline, from loading data to saving the trained model.
-
Metrics Output: Prints the accuracy, precision, and recall scores.
Overall, The pipeline script builds a RandomForestClassifier model to predict loan defaults, selects important features, evaluates the model, and saves it along with the top features. It handles various preprocessing steps, including converting categorical features to numerical ones and splitting the data into training and testing sets. I made sure the script emphasizes feature importance by adjusting weights for certain features and selecting the top ones for training and evaluation. The model isn't a direct replacement to Credit scoring, rather it is a secondary rail to improve risk from possibly skewed scores.
Benchmark Results
25 Splits - 93 Accuracy, 81 Precision, 93 Recall
To benchmark and validate the model, I used StratifiedKFold which is a type of K-Fold cross-validation. Before I added cross-validation, the model was producing a 92 Accuracy score, 77 Precision score, and a 96 Recall score. Below are the results of 25 splits, 35 splits and 75 splits (colored area is the standard deviation):
35 Splits - 93 Accuracy, 83 Precision, 94 Recall
75 Splits - 94 Accuracy, 85 Precision, 94 Recall
Summary
Final Remarks and Improvements
-
Model Accuracy Chart:
-
The mean accuracy is stable around 0.94.
-
The red shaded area represents the standard deviation, indicating some variability in performance across different folds. However, the variability is relatively small, showing consistent performance.
-
-
Model Precision Chart:
-
The precision values fluctuate more compared to accuracy and recall, but the mean remains high at 0.86.
-
The variability shown by the shaded area indicates that in some folds, the model's precision is lower. This suggests potential issues with false positives in certain data splits.
-
-
Model Recall Chart:
-
The recall is consistently high, with a mean of 0.94.
-
The variability in recall is similar to accuracy, indicating consistent identification of true positives across different folds.
-
High-Performance Metrics: The high accuracy, precision, and recall values indicate that the model is effective at predicting loan defaults.
-
Consistency: The charts show that the model performs consistently across different data splits, with relatively low variability.
Areas for Improvement
Precision Variability: The precision chart shows more variability compared to accuracy and recall. This indicates that in some folds, the model may have higher false positive rates.
Hyperparameter Tuning: To further improve precision and overall model performance, we may consider conducting a hyperparameter tuning process. This can be done using techniques such as grid search or random search to find the optimal parameters for the Random Forest model.
Accuracy: there are several iterations where the accuracy drops significantly below the mean. This indicates that while the model performs well overall, there are certain subsets of the data where its performance is suboptimal.
Precision: with notable dips in some iterations. This suggests that the model's ability to correctly identify positive instances varies significantly. High precision indicates fewer false positives, but the dips show some inconsistency.
Recall: the model is generally good at identifying positive cases. However, similar to accuracy and precision, there are some iterations with lower recall, this shows variability in capturing all relevant positive instances.
Overall: the model is generally good but does show a few inconsistencies. These can be worked out by adding more hyper-parameter tuning using GridSearchCV.
Accuracy: there are several iterations where the accuracy drops significantly below the mean. This indicates that while the model performs well overall, there are certain subsets of the data where its performance is suboptimal
Precision: with notable dips in some iterations. This suggests that the model's ability to correctly identify positive instances varies significantly. High precision indicates fewer false positives, but the dips show some inconsistency.
Recall: the model is generally good at identifying positive cases. However, similar to accuracy and precision, there are some iterations with lower recall, this shows variability in capturing all relevant positive instances
Overall: the model is generally good but does show a few inconsistencies. These can be worked out by adding more hyper-parameter tuning using GridSearchCV