Lightweight

This super-lightweight Loan Default Prediction engine, smaller than your average image file (20kb), leverages machine learning to tackle the challenge of accurately predicting loan defaults, a critical aspect of efficient loan management. The application delivers precise approval or denial decisions by using a RandomForestClassifier to analyze loan features such as amount, interest rate, and applicant financials. This project enhances decision-making accuracy, reduces the risk of defaults, and optimizes the loan evaluation process.

Machine Learning, Consumer Credit Application Processing Engine

Take Me Home

Overview

The project was built in two main layers with a GUI: "pipeline.py", this script builds the "model.pkl"; "predict.py", this takes our "model.pkl" file and uses it to analyze a users credit application; My GUI was built in the "main.py" file which is able to run "predict.py" through a subprocess in order to analyze an imported application CSV.

Overview of the pipeline script:

Pipeline is designed to train and evaluate a RandomForestClassifier model for predicting loan defaults using data from Lending Club. The script consists of the following steps:

Data Loading and Preprocessing: Loads and preprocess data from a CSV file.
Data Splitting: Splits the data into training and testing sets.
Model Training: Trains a RandomForestClassifier model.
Feature Selection: Selects top features based on their importance.
Model Evaluation: Evaluates the trained model on the test set.
Model Saving: Finally, it saves the trained model and the top features.

Detailed Breakdown:

1. Data Loading and Preprocessing

Data Loading: Reads the CSV file into a pandas DataFrame.
Column Selection: Selects specific columns relevant to the analysis.
Missing Values: Drops rows with missing values.
Filtering: Filters the data to include only loans that are either "Fully Paid" or "Charged Off".
Binary Mapping: Maps "Fully Paid" to 0 and "Charged Off" to 1 to create a binary target variable (defaulted).
Categorical Mapping: Maps categorical values to numerical values for emp_length, grade, term, and home_ownership.

2. Data Splitting

Feature and Target Separation: Separates the features (x) and target variable (y).
Train-Test Split: Splits the data into training (80%) and testing (20%) sets.

3. Model Training

Feature Selection: If top_features is provided, it selects those features from the training data.
Model Initialization: Initializes a RandomForestClassifier with specific hyperparameters.
Model Training: Train the model using the training data.

Click Here to download the training data (166.4 MB)

4. Feature Selection

Feature Importances: Extracts the feature importances from the trained model.
Weight Adjustments: Adjusts the importance of certain features based on predefined weight factors.
Top Features Selection: Selects the top top_n features based on their adjusted importance.

5. Model Evaluation

Feature Selection: If top_features is provided, it selects those features from the test data.
Prediction: Uses the model to make predictions on the test data.
Metrics Calculation: Calculates accuracy, precision, and recall scores.

6. Model Saving

Pipeline Execution: Runs the entire pipeline, from loading data to saving the trained model.
Metrics Output: Prints the accuracy, precision, and recall scores.

Overall, The pipeline script builds a RandomForestClassifier model to predict loan defaults, selects important features, evaluates the model, and saves it along with the top features. It handles various preprocessing steps, including converting categorical features to numerical ones and splitting the data into training and testing sets. I made sure the script emphasizes feature importance by adjusting weights for certain features and selecting the top ones for training and evaluation. The model isn't a direct replacement to Credit scoring, rather it is a secondary rail to improve risk from possibly skewed scores.

Overview of the predict script:

Predict.py is designed to assess loan applications using a pre-trained RandomForestClassifier model, assesses credit grades, and generates personalized summaries using OpenAI's GPT-4. The steps involved are:

Data Loading and Preprocessing: Load and preprocess data from a CSV file.
Prediction: Predict loan approval using the pre-trained model.
Credit Grading: Assign a credit grade based on the prediction probability.
Decision Making: Generate approval or denial decisions based on predictions.
OpenAI API Usage: Generate personalized summaries using the OpenAI API.
Formatting Results: Format the results for presentation.

Detailed Breakdown:

1. Data Loading and Preprocessing

Data Loading: Reads the CSV file into a pandas DataFrame.
Column Selection: Selects specific columns relevant to the analysis.
Missing Values: Drops rows with missing values.
Filtering: Filters the data to include only loans that are either "Fully Paid" or "Charged Off".
Binary Mapping: Maps "Fully Paid" to 0 and "Charged Off" to 1 to create a binary target variable (defaulted).
Categorical Mapping: Map categorical values to numerical values for emp_length, grade, term, and home_ownership.

2. Data Splitting

Loading Data: Calls load_data to load and preprocess the data.
Loading Model: Loads the pre-trained model and top features from a pickle file.
Selecting Features: Selects the top features from the data for prediction.
Predicting Probabilities: Uses model.predict_proba to predict the probabilities of default.
Thresholding: Cnverts probabilities to binary predictions with a threshold of 0.5.

3. Decision Making

Making Decisions: Based on the prediction, generates approval or denial decisions and reasons.
Using OpenAI API: Generates personalized summaries for each decision using OpenAI's GPT-4.

4. Formatting Results

Formatting Results: Format the results of the engine for presentation to the applicant.

5. Execution Function

Main Function: This is the entry point of the script. It checks for the correct usage, loads the data, runs the predictions, generates decisions, and prints the formatted results.

Overall, the script loads and preprocesses data, predicts the loan approval or denial using a pre-trained RandomForest model, can assign credit grades based on prediction probabilities, generates approval or denial decisions, and uses OpenAI's GPT-4 to create personalized summaries. The results are formatted and printed for presentation to the applicant. The mathematical foundations include probability thresholding for predictions, and credit grading based on predefined thresholds.

Click here to see a mathmatical version of the overviews

Benchmark Results

25 Splits - 93 Accuracy, 81 Precision, 93 Recall

To benchmark and validate the model, I used StratifiedKFold which is a type of K-Fold cross-validation. Before I added cross-validation, the model was producing a 92 Accuracy score, 77 Precision score, and a 96 Recall score. Below are the results of 25 splits, 35 splits and 75 splits (colored area is the standard deviation):

35 Splits - 93 Accuracy, 83 Precision, 94 Recall

75 Splits - 94 Accuracy, 85 Precision, 94 Recall

Summary

Final Remarks and Improvements

Model Accuracy Chart:
- The mean accuracy is stable around 0.94.
- The red shaded area represents the standard deviation, indicating some variability in performance across different folds. However, the variability is relatively small, showing consistent performance.
Model Precision Chart:
- The precision values fluctuate more compared to accuracy and recall, but the mean remains high at 0.86.
- The variability shown by the shaded area indicates that in some folds, the model's precision is lower. This suggests potential issues with false positives in certain data splits.
Model Recall Chart:
- The recall is consistently high, with a mean of 0.94.
- The variability in recall is similar to accuracy, indicating consistent identification of true positives across different folds.

High-Performance Metrics: The high accuracy, precision, and recall values indicate that the model is effective at predicting loan defaults.

Consistency: The charts show that the model performs consistently across different data splits, with relatively low variability.

Areas for Improvement

Precision Variability: The precision chart shows more variability compared to accuracy and recall. This indicates that in some folds, the model may have higher false positive rates.

Hyperparameter Tuning: To further improve precision and overall model performance, we may consider conducting a hyperparameter tuning process. This can be done using techniques such as grid search or random search to find the optimal parameters for the Random Forest model.

Accuracy: there are several iterations where the accuracy drops significantly below the mean. This indicates that while the model performs well overall, there are certain subsets of the data where its performance is suboptimal.

Precision: with notable dips in some iterations. This suggests that the model's ability to correctly identify positive instances varies significantly. High precision indicates fewer false positives, but the dips show some inconsistency.

Recall: the model is generally good at identifying positive cases. However, similar to accuracy and precision, there are some iterations with lower recall, this shows variability in capturing all relevant positive instances.

Overall: the model is generally good but does show a few inconsistencies. These can be worked out by adding more hyper-parameter tuning using GridSearchCV.

Overall: the model is generally good but does show a few inconsistencies. These can be worked out by adding more hyper-parameter tuning using GridSearchCV

Lightweight

Machine Learning, Consumer Credit Application Processing Engine

Take Me Home

Overview

Overview of the pipeline script:

Benchmark Results

​

Model Accuracy Chart:

The mean accuracy is stable around 0.94.

The red shaded area represents the standard deviation, indicating some variability in performance across different folds. However, the variability is relatively small, showing consistent performance.

Model Precision Chart:

The precision values fluctuate more compared to accuracy and recall, but the mean remains high at 0.86.

The variability shown by the shaded area indicates that in some folds, the model's precision is lower. This suggests potential issues with false positives in certain data splits.

Model Recall Chart:

The recall is consistently high, with a mean of 0.94.

The variability in recall is similar to accuracy, indicating consistent identification of true positives across different folds.

High-Performance Metrics: The high accuracy, precision, and recall values indicate that the model is effective at predicting loan defaults.

Consistency: The charts show that the model performs consistently across different data splits, with relatively low variability.