6 Loan Default Prediction

The following application is inspired by the empirical example in “Measuring the model risk-adjusted performance of machine learning algorithms in credit default prediction” by Alonso Robisco and Carbó Martínez (2022). However, since we are not interested in model risk-adjusted performance, the application will purely focus on the implementation of machine learning algorithms for loan default prediction.

6.1 Problem Setup

The dataset that we will be using was used in the Kaggle competition “Give Me Some Credit”. The description of the competition reads as follows:

Banks play a crucial role in market economies. They decide who can get finance and on what terms and can make or break investment decisions. For markets and society to function, individuals and companies need access to credit.

Credit scoring algorithms, which make a guess at the probability of default, are the method banks use to determine whether or not a loan should be granted. This competition requires participants to improve on the state of the art in credit scoring, by predicting the probability that somebody will experience financial distress in the next two years.

The goal of this competition is to build a model that borrowers can use to help make the best financial decisions.

Historical data are provided on 250,000 borrowers and the prize pool is $5,000 ($3,000 for first, $1,500 for second and $500 for third).

Unfortunately, there won’t be any prize money today. However, the experience that you can gain from working through an application like this can be invaluable. So, in a way, you are still winning!

6.2 Dataset

Let’s download the dataset automatically, unzip it, and place it in a folder called data if you haven’t done so already

from io import BytesIO
from urllib.request import urlopen
from zipfile import ZipFile
import os.path

# Check if the file exists
if not os.path.isfile('data/Data Dictionary.xl') or not os.path.isfile('data/cs-training.csv'):

    print('Downloading dataset...')

    # Define the dataset to be downloaded
    zipurl = 'https://www.kaggle.com/api/v1/datasets/download/brycecf/give-me-some-credit-dataset'

    # Download and unzip the dataset in the data folder
    with urlopen(zipurl) as zipresp:
        with ZipFile(BytesIO(zipresp.read())) as zfile:
            zfile.extractall('data')

    print('DONE!')

else:

    print('Dataset already downloaded!')

Downloading dataset...
DONE!

Then, we can have a look at the data dictionary that is provided with the dataset. This will give us an idea of the variables that are available in the dataset and what they represent

import pandas as pd
data_dict = pd.read_excel('data/Data Dictionary.xls', header=1)
data_dict.style.hide()

Variable Name	Description	Type
SeriousDlqin2yrs	Person experienced 90 days past due delinquency or worse	Y/N
RevolvingUtilizationOfUnsecuredLines	Total balance on credit cards and personal lines of credit except real estate and no installment debt like car loans divided by the sum of credit limits	percentage
age	Age of borrower in years	integer
NumberOfTime30-59DaysPastDueNotWorse	Number of times borrower has been 30-59 days past due but no worse in the last 2 years.	integer
DebtRatio	Monthly debt payments, alimony,living costs divided by monthy gross income	percentage
MonthlyIncome	Monthly income	real
NumberOfOpenCreditLinesAndLoans	Number of Open loans (installment like car loan or mortgage) and Lines of credit (e.g. credit cards)	integer
NumberOfTimes90DaysLate	Number of times borrower has been 90 days or more past due.	integer
NumberRealEstateLoansOrLines	Number of mortgage and real estate loans including home equity lines of credit	integer
NumberOfTime60-89DaysPastDueNotWorse	Number of times borrower has been 60-89 days past due but no worse in the last 2 years.	integer
NumberOfDependents	Number of dependents in family excluding themselves (spouse, children etc.)	integer

The variable $y$ that we want to predict is SeriousDlqin2yrs which indicates whether a person has been 90 days past due on a loan payment (serious delinquency) in the past two years. This target variable is $1$ if the loan defaults (i.e., serious delinquency occured) and $0$ if the loan does not default (i.e., no serious delinquency occured) . The other variables are features that we can use to predict this target variable such as the age of the borrower and the monthly income of the borrower.

6.3 Putting the Problem into the Context of the Course

Given the description of the competition and the dataset, we can see that this is a supervised learning problem. We have a target variable that we want to predict, and we have features that we can use to predict this target variable. The target variable is binary, i.e., it can take two values: 0 or 1. The value 0 indicates that the loan will not default, while the value 1 indicates that the loan will default. Thus, this is a binary classification problem.

6.4 Setting up the Environment

We will start by setting up the environment by importing the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

and loading the dataset

df = pd.read_csv('data/cs-training.csv')

Let’s also download some precomputed models that we will use later on

for file_name in ['clf_nn.joblib', 'clf_nn2.joblib']:

    if not os.path.isfile(file_name):

        print(f'Downloading {file_name}...')

        # Generate the download link
        url = f'https://github.com/jmarbet/data-science-course/raw/main/notebooks/{file_name}'

        # Download the file
        with urlopen(url) as response, open(file_name, 'wb') as out_file:
            data = response.read()
            out_file.write(data)

        print('DONE!')

    else:

        print(f'{file_name} already downloaded!')

clf_nn.joblib already downloaded!
clf_nn2.joblib already downloaded!

6.5 Data Preprocessing

The dataset is now loaded into a pandas DataFrame. Let’s have a look at the first few rows of the dataset to get an idea of what the data looks like.

df.head()

	Unnamed: 0	SeriousDlqin2yrs	RevolvingUtilizationOfUnsecuredLines	age	NumberOfTime30-59DaysPastDueNotWorse	DebtRatio	MonthlyIncome	NumberOfOpenCreditLinesAndLoans	NumberOfTimes90DaysLate	NumberRealEstateLoansOrLines	NumberOfDependents
0	1	1	0.766127	45	2	0.802982	9120.0	13	0	6	2.0
1	2	0	0.957151	40	0	0.121876	2600.0	4	0	0	1.0
2	3	0	0.658180	38	1	0.085113	3042.0	2	1	0	0.0
3	4	0	0.233810	30	0	0.036050	3300.0	5	0	0	0.0
4	5	0	0.907239	49	1	0.024926	63588.0	7	0	1	0.0

The column Unnamed: 0 seems to be a superfluous index column that we could drop. Let’s do that

df = df.drop('Unnamed: 0', axis=1)

Furthermore, the order of the column names in the dataset is not very intuitive. Let’s reorder the columns in the dataset

orderedList = [
    'SeriousDlqin2yrs',
    'age', 
    'NumberOfDependents',
    'MonthlyIncome', 
    'DebtRatio',
    'RevolvingUtilizationOfUnsecuredLines',
    'NumberOfOpenCreditLinesAndLoans',
    'NumberRealEstateLoansOrLines',
    'NumberOfTime30-59DaysPastDueNotWorse', 
    'NumberOfTime60-89DaysPastDueNotWorse', 
    'NumberOfTimes90DaysLate'
]

df = df.loc[:, orderedList]

Let’s also have a look at the data types of the columns in the dataset and whether there are any missing values

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150000 entries, 0 to 149999
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      150000 non-null  int64  
 1   age                                   150000 non-null  int64  
 2   NumberOfDependents                    146076 non-null  float64
 3   MonthlyIncome                         120269 non-null  float64
 4   DebtRatio                             150000 non-null  float64
 5   RevolvingUtilizationOfUnsecuredLines  150000 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       150000 non-null  int64  
 7   NumberRealEstateLoansOrLines          150000 non-null  int64  
 8   NumberOfTime30-59DaysPastDueNotWorse  150000 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  150000 non-null  int64  
 10  NumberOfTimes90DaysLate               150000 non-null  int64  
dtypes: float64(4), int64(7)
memory usage: 12.6 MB

Note that the column MonthlyIncome and NumberOfDependents seem to have missing values. Before we drop these missing values or impute them, let’s have a look at the distribution of our target variable SeriousDlqin2yrs

df['SeriousDlqin2yrs'].value_counts(normalize=True)

SeriousDlqin2yrs
0    0.93316
1    0.06684
Name: proportion, dtype: float64

As with the example that we have seen during one of our previous lectures, the dataset seems to be quite imbalanced. Only about 6.7% of the loans have defaulted. This is something that we need to keep in mind when treating the missing values and when building our models.

Let’s see what happens to the distribution of the target variable if we drop the missing values

df.dropna().value_counts("SeriousDlqin2yrs", normalize=True)

SeriousDlqin2yrs
0    0.930514
1    0.069486
Name: proportion, dtype: float64

It seems to have almost no impact on the distribution of the target variable. This is good news. Let’s compare some other statistics of the dataset before and after dropping the missing values

df.describe().T

	count	mean	std	25%	50%	75%	max
SeriousDlqin2yrs	150000.0	0.066840	0.249746	0.000000	0.000000	0.000000	1.0
age	150000.0	52.295207	14.771866	41.000000	52.000000	63.000000	109.0
NumberOfDependents	146076.0	0.757222	1.115086	0.000000	0.000000	1.000000	20.0
MonthlyIncome	120269.0	6670.221237	14384.674215	3400.000000	5400.000000	8249.000000	3008750.0
DebtRatio	150000.0	353.005076	2037.818523	0.175074	0.366508	0.868254	329664.0
RevolvingUtilizationOfUnsecuredLines	150000.0	6.048438	249.755371	0.029867	0.154181	0.559046	50708.0
NumberOfOpenCreditLinesAndLoans	150000.0	8.452760	5.145951	5.000000	8.000000	11.000000	58.0
NumberRealEstateLoansOrLines	150000.0	1.018240	1.129771	0.000000	1.000000	2.000000	54.0
NumberOfTime30-59DaysPastDueNotWorse	150000.0	0.421033	4.192781	0.000000	0.000000	0.000000	98.0
NumberOfTime60-89DaysPastDueNotWorse	150000.0	0.240387	4.155179	0.000000	0.000000	0.000000	98.0
NumberOfTimes90DaysLate	150000.0	0.265973	4.169304	0.000000	0.000000	0.000000	98.0

df.dropna().describe().T

	count	mean	std	25%	50%	75%	max
SeriousDlqin2yrs	120269.0	0.069486	0.254280	0.000000	0.000000	0.000000	1.0
age	120269.0	51.289792	14.426684	40.000000	51.000000	61.000000	103.0
NumberOfDependents	120269.0	0.851832	1.148391	0.000000	0.000000	2.000000	20.0
MonthlyIncome	120269.0	6670.221237	14384.674215	3400.000000	5400.000000	8249.000000	3008750.0
DebtRatio	120269.0	26.598777	424.446457	0.143388	0.296023	0.482559	61106.5
RevolvingUtilizationOfUnsecuredLines	120269.0	5.899873	257.040685	0.035084	0.177282	0.579428	50708.0
NumberOfOpenCreditLinesAndLoans	120269.0	8.758475	5.172835	5.000000	8.000000	11.000000	58.0
NumberRealEstateLoansOrLines	120269.0	1.054519	1.149273	0.000000	1.000000	2.000000	54.0
NumberOfTime30-59DaysPastDueNotWorse	120269.0	0.381769	3.499234	0.000000	0.000000	0.000000	98.0
NumberOfTime60-89DaysPastDueNotWorse	120269.0	0.187829	3.447901	0.000000	0.000000	0.000000	98.0
NumberOfTimes90DaysLate	120269.0	0.211925	3.465276	0.000000	0.000000	0.000000	98.0

It looks like the statistics before and after dropping the missing values are quite similar, except for the variable DebtRatio, where we have substantially lower means and standard deviation. Let’s also have a look at the distribution of the variables for the rows that we have dropped

df.loc[df.isna().any(axis=1)].describe().T

	count	mean	std	min	25%	50%	75%	max
SeriousDlqin2yrs	29731.0	0.056137	0.230189	0.0	0.000000	0.000000	0.000000	1.0
age	29731.0	56.362349	15.438786	21.0	46.000000	57.000000	67.000000	109.0
NumberOfDependents	25807.0	0.316310	0.809944	0.0	0.000000	0.000000	0.000000	9.0
MonthlyIncome	0.0	NaN	NaN	NaN	NaN	NaN	NaN	NaN
DebtRatio	29731.0	1673.396556	4248.372895	0.0	123.000000	1159.000000	2382.000000	329664.0
RevolvingUtilizationOfUnsecuredLines	29731.0	6.649421	217.814854	0.0	0.016027	0.081697	0.440549	22198.0
NumberOfOpenCreditLinesAndLoans	29731.0	7.216071	4.842720	0.0	4.000000	6.000000	10.000000	45.0
NumberRealEstateLoansOrLines	29731.0	0.871481	1.034291	0.0	0.000000	1.000000	1.000000	23.0
NumberOfTime30-59DaysPastDueNotWorse	29731.0	0.579866	6.255361	0.0	0.000000	0.000000	0.000000	98.0
NumberOfTime60-89DaysPastDueNotWorse	29731.0	0.452995	6.242076	0.0	0.000000	0.000000	0.000000	98.0
NumberOfTimes90DaysLate	29731.0	0.484612	6.250408	0.0	0.000000	0.000000	0.000000	98.0

Again, the mean of the dropped rows seems to be substantially higher for the variable DebtRatio suggesting that the missing values are not missing entirely at random. Note, however, that the standard deviation is lower meaning that the dropped observations are more similar to each other in the DebtRatio dimension. From our data dictionary, we know that the DebtRatio is defined as

data_dict.loc[data_dict['Variable Name'] == 'DebtRatio'].style.hide()

Variable Name	Description	Type
DebtRatio	Monthly debt payments, alimony,living costs divided by monthy gross income	percentage

So, it seems that the DebtRatio is the ratio of the monthly debt payments to the monthly gross income. We actually have MonthlyIncome in our dataset!

data_dict.loc[data_dict['Variable Name'] == 'MonthlyIncome'].style.hide()

Variable Name	Description	Type
MonthlyIncome	Monthly income	real

It does not say whether this is in gross or net terms though. Nevertheless, let’s have a look at the relationship between the DebtRatio and the MonthlyIncome

ax = df.plot.scatter(x='MonthlyIncome', y='DebtRatio')
ax.set_xscale('log')
ax.set_yscale('log')
ax.set_xlabel('MonthlyIncome')
ax.set_ylabel('DebtRatio')
ax.set_title('DebtRatio vs. MonthlyIncome')
plt.show()

This looks rather odd. Note how there are a lot of monthly incomes that are close to zero. Furthermore, there is a weird gap going through the scatter points. We can look at the descriptive statistics of the rows with MonthlyIncome less than 100

df.query('MonthlyIncome <= 100').describe().T

	count	mean	std	min	25%	50%	75%	max
SeriousDlqin2yrs	2301.0	0.036506	0.187586	0.0	0.000000	0.000000	0.000000	1.0
age	2301.0	47.740113	16.199176	21.0	35.000000	46.000000	60.000000	103.0
NumberOfDependents	2301.0	0.778792	1.192441	0.0	0.000000	0.000000	2.000000	10.0
MonthlyIncome	2301.0	1.837027	11.408271	0.0	0.000000	0.000000	1.000000	100.0
DebtRatio	2301.0	1370.529300	2752.843610	0.0	79.000000	732.000000	1850.500000	61106.5
RevolvingUtilizationOfUnsecuredLines	2301.0	3.604101	125.553453	0.0	0.022243	0.113149	0.486629	5893.0
NumberOfOpenCreditLinesAndLoans	2301.0	7.171230	4.869628	0.0	4.000000	6.000000	10.000000	31.0
NumberRealEstateLoansOrLines	2301.0	0.742721	0.904984	0.0	0.000000	1.000000	1.000000	9.0
NumberOfTime30-59DaysPastDueNotWorse	2301.0	0.549326	6.132226	0.0	0.000000	0.000000	0.000000	98.0
NumberOfTime60-89DaysPastDueNotWorse	2301.0	0.428509	6.124274	0.0	0.000000	0.000000	0.000000	98.0
NumberOfTimes90DaysLate	2301.0	0.438505	6.128357	0.0	0.000000	0.000000	0.000000	98.0

These observations seem to have a higher debtRatio than the rest of the dataset but are less likely to default on their loans (the mean of SeriousDlqin2yrs is equal to the fraction of defaulting loans). Given that they have no income (or essentially no income), this seems rather odd and is likely due to an error during data entry/collection. Since there are only a small number of observations with MonthlyIncome less than 100, we can probably drop them. Let’s look at the same figure for MonthlyIncome greater than 100

ax = df.query('MonthlyIncome > 100').plot.scatter(x='MonthlyIncome', y='DebtRatio')
ax.set_xscale('log')
ax.set_yscale('log')
ax.axvline(10**3, color='black', linestyle=':')
ax.axhline(10**(-3), color='black', linestyle=':')
ax.plot([10, 10**5], [10**(-1), 10**(-5)], color='black', linestyle='--')
ax.set_xlabel('MonthlyIncome')
ax.set_ylabel('DebtRatio')
ax.set_title('DebtRatio vs. MonthlyIncome')
plt.show()

This looks better but note how the scatter points below the gap seem to line up with the line $\frac{1}{\text{MonthlyIncome}}$. Thus, there seems to be another potential data entry/collection error since the debt in the raw data has likely been just set to $1$ for these observations. If this was a real dataset, we would need to investigate this further and maybe talk to the people who have sent us the data. However, given that this is just an example, we leave it as is.

Let’s also have a look at the distribution of DebtRatio variable

fig, ax = plt.subplots(1, 3, figsize=(15, 5))

df['DebtRatio'].plot.hist(bins=1000, ax=ax[0])
df.dropna()['DebtRatio'].plot.hist(bins=1000, ax=ax[1])
df.loc[df.isna().any(axis=1), 'DebtRatio'].plot.hist(bins=1000, ax=ax[2])

ax[0].set_xscale('log')
ax[1].set_xscale('log')
ax[2].set_xscale('log')

ax[0].set_title('DebtRatio (Whole Dataset)')
ax[1].set_title('DebtRatio (Rows with Missing Values Dropped)')
ax[2].set_title('DebtRatio (Dropped Rows)')

fig.tight_layout()
plt.show()

Where we can again see the high DebtRatio values for the rows with missing values. We can also have a look at the distribution of all the variables in the dataset

fig, ax = plt.subplots(df.shape[1], 3, figsize=(10, 20))

for ii, col in enumerate(df.columns):

    # Plot the distribution of the variable for the whole dataset, the dataset with missing values dropped, and the dropped rows
    if col in ('SeriousDlqin2yrs', 'age', 'NumberOfTime30-59DaysPastDueNotWorse', 'NumberOfOpenCreditLinesAndLoans', 'NumberOfTimes90DaysLate', 'NumberRealEstateLoansOrLines', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfDependents'): 

        # Use a bar plot for discrete variables
        df[col].value_counts(normalize=True).sort_index().plot.bar(ax=ax[ii,0])
        df.dropna()[col].value_counts(normalize=True).sort_index().plot.bar(ax=ax[ii,1])
        df.loc[df.isna().any(axis=1), col].value_counts(normalize=True).sort_index().plot.bar(ax=ax[ii,2])

        # Set the y-axis label
        ax[ii,0].set_ylabel('Fraction')
        ax[ii,1].set_ylabel('')
        ax[ii,2].set_ylabel('')

    else:

        # Use a histogram for continuous variables
        df[col].plot.hist(bins=1000, ax=ax[ii,0])
        df.dropna()[col].plot.hist(bins=1000, ax=ax[ii,1])
        df.loc[df.isna().any(axis=1), col].plot.hist(bins=1000, ax=ax[ii,2])

        # Set the x-axis to a logarithmic scale for the continuous variables
        ax[ii,0].set_xscale('log')
        ax[ii,1].set_xscale('log')
        ax[ii,2].set_xscale('log')

        # Set the x-axis label
        ax[ii,0].set_xlabel(col)
        ax[ii,1].set_xlabel(col)
        ax[ii,2].set_xlabel(col)

        # Set the y-axis label
        ax[ii,0].set_ylabel('Frequency')
        ax[ii,1].set_ylabel('')
        ax[ii,2].set_ylabel('')

ax[0,0].set_title('Whole Dataset')
ax[0,1].set_title('Rows with Missing Values Dropped')
ax[0,2].set_title('Dropped Rows')

fig.tight_layout()
plt.show()

This shows another potential issue with our dataset. Checkout the variable NumberOfTime30-59DaysPastDueNotWorse. It seems that there are a some observations with values greater than 90. This seems rather odd. Let’s have a look at the data dictionary

data_dict.loc[data_dict['Variable Name'].isin(['NumberOfTime30-59DaysPastDueNotWorse', 'NumberOfTime60-89DaysPastDueNotWorse', 'NumberOfTimes90DaysLate'])].style.hide()

Variable Name	Description	Type
NumberOfTime30-59DaysPastDueNotWorse	Number of times borrower has been 30-59 days past due but no worse in the last 2 years.	integer
NumberOfTimes90DaysLate	Number of times borrower has been 90 days or more past due.	integer
NumberOfTime60-89DaysPastDueNotWorse	Number of times borrower has been 60-89 days past due but no worse in the last 2 years.	integer

The data dictionary does not mention anything about values above 90. These values may have a special meaning such as being a flag for missing values. Let’s have a look at the distribution of the target variable for the rows with values greater than 90

df.loc[df['NumberOfTime30-59DaysPastDueNotWorse'] > 90, 'SeriousDlqin2yrs'].value_counts()

SeriousDlqin2yrs
1    147
0    122
Name: count, dtype: int64

df.loc[df['NumberOfTime60-89DaysPastDueNotWorse'] > 90, 'SeriousDlqin2yrs'].value_counts()

SeriousDlqin2yrs
1    147
0    122
Name: count, dtype: int64

df.loc[df['NumberOfTimes90DaysLate'] > 90, 'SeriousDlqin2yrs'].value_counts()

SeriousDlqin2yrs
1    147
0    122
Name: count, dtype: int64

df.loc[(df['NumberOfTimes90DaysLate'] > 90) & (df['NumberOfTime60-89DaysPastDueNotWorse'] > 90) & (df['NumberOfTime30-59DaysPastDueNotWorse'] > 90), 'SeriousDlqin2yrs'].value_counts()

SeriousDlqin2yrs
1    147
0    122
Name: count, dtype: int64

There seems to be a very high number of defaults for these observations (more than half), which makes sense given the meaning of these variables. Furthermore, the observations with above 90 in one category have it above 90 in the other categories as well. Thus, this might not be a data entry/collection error and these are just borrowers who commonly fail to make loan payments.

Given that Alonso Robisco and Carbó Martínez (2022) seem to be dropping the missing values, let’s do the same for our dataset

df = df.dropna()

and let’s also drop the rows with MonthlyIncome less than (or equal) 100

df = df.query('MonthlyIncome > 100')

to eliminate some of the potential data entry/collection errors.

Then double-check that we have no missing values left

df.isna().sum()

SeriousDlqin2yrs                        0
age                                     0
NumberOfDependents                      0
MonthlyIncome                           0
DebtRatio                               0
RevolvingUtilizationOfUnsecuredLines    0
NumberOfOpenCreditLinesAndLoans         0
NumberRealEstateLoansOrLines            0
NumberOfTime30-59DaysPastDueNotWorse    0
NumberOfTime60-89DaysPastDueNotWorse    0
NumberOfTimes90DaysLate                 0
dtype: int64

or, alternatively,

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 117968 entries, 0 to 149999
Data columns (total 11 columns):
 #   Column                                Non-Null Count   Dtype  
---  ------                                --------------   -----  
 0   SeriousDlqin2yrs                      117968 non-null  int64  
 1   age                                   117968 non-null  int64  
 2   NumberOfDependents                    117968 non-null  float64
 3   MonthlyIncome                         117968 non-null  float64
 4   DebtRatio                             117968 non-null  float64
 5   RevolvingUtilizationOfUnsecuredLines  117968 non-null  float64
 6   NumberOfOpenCreditLinesAndLoans       117968 non-null  int64  
 7   NumberRealEstateLoansOrLines          117968 non-null  int64  
 8   NumberOfTime30-59DaysPastDueNotWorse  117968 non-null  int64  
 9   NumberOfTime60-89DaysPastDueNotWorse  117968 non-null  int64  
 10  NumberOfTimes90DaysLate               117968 non-null  int64  
dtypes: float64(4), int64(7)
memory usage: 10.8 MB

All good! We should also check for duplicated rows with the duplicated() method

df.loc[df.duplicated()]

	SeriousDlqin2yrs	age	NumberOfDependents	MonthlyIncome	DebtRatio	RevolvingUtilizationOfUnsecuredLines	NumberOfOpenCreditLinesAndLoans	NumberRealEstateLoansOrLines	NumberOfTime30-59DaysPastDueNotWorse	NumberOfTime60-89DaysPastDueNotWorse	NumberOfTimes90DaysLate
7920	0	22	0.0	820.0	0.0	1.0	1	0	0	0	0
8840	0	23	0.0	820.0	0.0	1.0	1	0	0	0	0
15546	0	22	0.0	929.0	0.0	0.0	2	0	0	0	0
17265	0	22	0.0	820.0	0.0	1.0	1	0	0	0	0
21190	0	22	0.0	820.0	0.0	1.0	1	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...
143750	0	23	0.0	820.0	0.0	1.0	1	0	0	0	0
144153	0	28	0.0	2200.0	0.0	1.0	0	0	0	0	0
144922	0	40	0.0	3500.0	0.0	0.0	1	0	0	0	0
148419	0	22	0.0	1500.0	0.0	0.0	2	0	0	0	0
149993	0	22	0.0	820.0	0.0	1.0	1	0	0	0	0

72 rows × 11 columns

and look at the statistics of the duplicated rows

df.loc[df.duplicated()].describe().T

	count	mean	std	min	25%	50%	75%	max
SeriousDlqin2yrs	72.0	0.013889	0.117851	0.0	0.0	0.0	0.0	1.000000
age	72.0	24.902778	8.868618	21.0	22.0	22.5	24.0	70.000000
NumberOfDependents	72.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.000000
MonthlyIncome	72.0	1031.527778	542.007873	764.0	820.0	820.0	929.0	3500.000000
DebtRatio	72.0	0.017594	0.104816	0.0	0.0	0.0	0.0	0.633374
RevolvingUtilizationOfUnsecuredLines	72.0	0.500000	0.503509	0.0	0.0	0.5	1.0	1.000000
NumberOfOpenCreditLinesAndLoans	72.0	1.458333	0.749413	0.0	1.0	1.0	2.0	4.000000
NumberRealEstateLoansOrLines	72.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.000000
NumberOfTime30-59DaysPastDueNotWorse	72.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.000000
NumberOfTime60-89DaysPastDueNotWorse	72.0	0.000000	0.000000	0.0	0.0	0.0	0.0	0.000000
NumberOfTimes90DaysLate	72.0	0.013889	0.117851	0.0	0.0	0.0	0.0	1.000000

There are indeed 72 duplicated rows in the dataset. However, given the variables in our dataset, which are mostly discrete, the fact that monthly income seems to be generally rounded, it does not seem implausible that some rows might appear multiple times in the dataset, simply because some observations have the same values for all variables. Thus, we will keep the duplicated rows in the dataset.

6.6 Data Exploration

Let’s start by looking at the distribution of the target variable SeriousDlqin2yrs in our preprocessed dataset

df.value_counts("SeriousDlqin2yrs").plot.pie(autopct = "%.1f")
plt.ylabel('')
plt.show()

We have already looked at some variables selectively. To do it more broadly, we can look at the pair plot of the dataset. A pair plot shows the pairwise relationships between the variables in our dataset. On the diagonal, we are plotting the kernel density estimate

sns.pairplot(df.sample(1000), hue='SeriousDlqin2yrs')

Note that we are plotting all variables in different colors based on whether our target variable SeriousDlqin2yrs is $0$ or $1$. Furthermore, since it is computationally quite demanding to create this plot, we have sampled only 1000 rows from the dataset. Since we have many variables, some of them with very skewed distributions, and also several discrete variables, it might make sense to look only at a subset

pp = sns.pairplot(df[['age', 'MonthlyIncome', 'DebtRatio', 'SeriousDlqin2yrs']], hue='SeriousDlqin2yrs')

# Fix the x-axis and y-axis scales
for ax in pp.axes.flat:
    if ax.get_xlabel() == 'MonthlyIncome':
        ax.set(xscale="log")
        ax.set_xlim(10**1, 10**7)
    if ax.get_ylabel() == 'MonthlyIncome':
        ax.set(yscale="log")
        ax.set_ylim(10**1, 10**7)

    if ax.get_xlabel() == 'DebtRatio':
        ax.set(xscale="log")
        ax.set_xlim(10**(-5), 10**3)
    if ax.get_ylabel() == 'DebtRatio':
        ax.set(yscale="log")
        ax.set_ylim(10**(-5), 10**3)

We can continue with the analysis of our dataset by looking at the correlation matrix of the variables in the dataset. We will calculate both the Pearson correlation (linear relationship) and the Spearman correlation (monotonic relationship) and create a heatmap of both correlation matrices

corr = df.corr() # Calculate the Pearson correlation (linear relationship)
cmap = sns.diverging_palette(10, 255, as_cmap=True) # Create a color map
mask = np.triu(np.ones_like(corr, dtype=bool)) # Create a mask to only show the lower triangle of the matrix
sns.heatmap(corr, cmap=cmap, vmax=1, center=0, mask=mask) # Create a heatmap of the correlation matrix (Note: vmax=1 makes sure that the color map goes up to 1 and center=0 are used to center the color map at 0)
plt.show()

corr = df.corr('spearman') # Calculate the Spearman correlation (monotonic relationship)
cmap = sns.diverging_palette(10, 255, as_cmap=True) # Create a color map
mask = np.triu(np.ones_like(corr, dtype=bool)) # Create a mask to only show the lower triangle of the matrix
sns.heatmap(corr, cmap=cmap, vmax=1, center=0, mask=mask) # Create a heatmap of the correlation matrix (Note: vmax=1 makes sure that the color map goes up to 1 and center=0 are used to center the color map at 0)
plt.show()

It seems that age is negatively correlated with default (SeriousDlqin2yrs) which we can also see in the kernel density estimate of the age variable

sns.kdeplot(data=df, x='age', hue='SeriousDlqin2yrs', cut=0, fill=True, common_norm=False)
plt.show()

but then MonthlyIncome is also negatively correlated with default and with age. Thus, likely the relationship between age and default is driven by MonthlyIncome.

Furthermore, the variables NumberOfTime30-59DaysPastDueNotWorse, NumberOfTime60-89DaysPastDueNotWorse, and NumberOfTimes90DaysLate are highly correlated with each other and with the target variable SeriousDlqin2yrs. This is not surprising given that these variables are all related to the number of times a borrower has been past due on a loan payment. RevolvingUtilizationOfUnsecuredLines is also highly correlated with the target variable and with the number of times a borrower has been past due on a loan payment. This is also not surprising given that the RevolvingUtilizationOfUnsecuredLines is the ratio of the amount of money owed to the amount of credit available.

6.7 Implementation of Loan Default Prediction Models

We have explored our dataset and are now ready to implement machine learning algorithms for loan default prediction. Let’s start by importing the required libraries

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score, recall_score, precision_score, roc_curve
from joblib import dump, load

6.7.1 Splitting the Data into Training and Test Sets

Before we can train a machine learning model, we need to split our dataset into a training set and a test set.

X = df.drop('SeriousDlqin2yrs', axis=1) # All variables except `SeriousDlqin2yrs`
y = df['SeriousDlqin2yrs'] # Only SeriousDlqin2yrs

We follow Alonso Robisco and Carbó Martínez (2022) and use 80% of the data for training and 20% for testing. We will also set the stratify argument to y to make sure that the distribution of the target variable is the same in the training and test sets. Otherwise, we might randomly not have any defaulted loans in the test set, which would make it impossible to correctly evaluate our model.

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size = 0.2, random_state = 42)

6.7.2 Scaling Features

To improve the performance of our machine learning model, we should scale the features. This is especially important for models that are sensitive to the scale of the features. We will use the MinMaxScaler class from the sklearn.preprocessing module to scale the features. The MinMaxScaler class scales the features so that they have a minimum of 0 and a maximum of 1.

def scale_features(scaler, df, col_names, only_transform=False):

    # Extract the features we want to scale
    features = df[col_names] 

    # Fit the scaler to the features and transform them
    if only_transform:
        features = scaler.transform(features.values)
    else:
        features = scaler.fit_transform(features.values)

    # Replace the original features with the scaled features
    df[col_names] = features

scaler = MinMaxScaler() 
scale_features(scaler, X_train, X_train.columns)
scale_features(scaler, X_test, X_test.columns, only_transform=True)

Note that we have very skewed distributions for some variables in our dataset. This might make the MinMaxScaler less effective and there might be gains from more carefully scaling different variables. However, for the sake of simplicity, we will use the MinMaxScaler for all variables.

We have fully preprocessed and explored our dataset. The next step will be our main task: the implementation of machine learning algorithms for loan default prediction.

6.7.3 Evaluation Criertia

We will evaluate the performance of our machine-learning models using the following metrics:

Accuracy: The proportion of correctly classified instances
Precision: The proportion of true positive predictions among all positive predictions
Recall: The proportion of true positive predictions among all actual positive instances
ROC AUC: The area under the receiver operating characteristic curve

Furthermore, we will plot the ROC curve for each model to visualize the trade-off between the true positive rate and the false positive rate. To make the evaluation of our models more convenient, we will define a function that computes these metrics and plots the ROC curve for a given model

def evaluate_model(clf, X_train, y_train, X_test, y_test, label=''):

    # Compute predictions and probabilities for tha training and test set
    y_pred_train = clf.predict(X_train)
    y_proba_train = clf.predict_proba(X_train)
    y_pred_test = clf.predict(X_test)
    y_proba_test = clf.predict_proba(X_test)

    # Print accuracy measures
    print(f"--------------------------------------------------------------")
    print(f"Metrics: {label}")
    print(f"--------------------------------------------------------------")
    print(f"Accuracy (Train): {accuracy_score(y_train, y_pred_train)}")
    print(f"Precision (Train): {precision_score(y_train, y_pred_train)}")
    print(f"Recall (Train): {recall_score(y_train, y_pred_train)}")
    print(f"ROC AUC (Train): {roc_auc_score(y_train, y_proba_train[:, 1])}")
    print(f"--------------------------------------------------------------")
    print(f"Accuracy (Test): {accuracy_score(y_test, y_pred_test)}")
    print(f"Precision (Test): {precision_score(y_test, y_pred_test)}")
    print(f"Recall (Test): {recall_score(y_test, y_pred_test)}")
    print(f"ROC AUC (Test): {roc_auc_score(y_test, y_proba_test[:, 1])}")
    print(f"--------------------------------------------------------------")

    # Compute the ROC curve
    fpr_train, tpr_train, thresholds_train = roc_curve(y_train, y_proba_train[:, 1])
    fpr_test, tpr_test, thresholds_test = roc_curve(y_test, y_proba_test[:, 1])

    # Plot the ROC curve
    plt.plot(fpr_train, tpr_train, label = "Train") 
    plt.plot(fpr_test, tpr_test, label = "Test")
    plt.plot([0, 1], [0, 1], linestyle='--', color='grey')
    plt.xlabel('False Positive Rate (FPR)')
    plt.ylabel('True Positive Rate (TPR)')
    plt.title(f'ROC Curve: {label}')
    plt.legend()
    plt.show()

While we compute all of these metrics, we will focus on the ROC AUC score as our main evaluation metric.

6.7.4 Logistic Regression

Let’s start with a simple logistic regression model. We will use the LogisticRegression class from the sklearn.linear_model module to train a logistic regression model. We will use the lbfgs solver and set the max_iter parameter to 5000 to make sure that the optimization algorithm converges. We will also set the penalty parameter to None to avoid regularization.

clf_logistic = LogisticRegression(penalty = None, solver = 'lbfgs', max_iter = 5000).fit(X_train, y_train)

Let’s evaluate the performance of the logistic regression model

evaluate_model(clf_logistic, X_train, y_train, X_test, y_test, label = 'Logistic Regression')

--------------------------------------------------------------
Metrics: Logistic Regression
--------------------------------------------------------------
Accuracy (Train): 0.9306588679085341
Precision (Train): 0.5787234042553191
Recall (Train): 0.041100030220610456
ROC AUC (Train): 0.6936171984517471
--------------------------------------------------------------
Accuracy (Test): 0.9302788844621513
Precision (Test): 0.5490196078431373
Recall (Test): 0.033836858006042296
ROC AUC (Test): 0.6926238627317243
--------------------------------------------------------------

The model does not perform as well as what we have seen in previous lectures. The ROC AUC score is only around 0.7. Note again that the accuracy score is quite high but this is due to the imbalanced nature of the dataset.

6.7.5 Decision Tree

Let’s now train a decision tree classifier. We will use the DecisionTreeClassifier class from the sklearn.tree module to train a decision tree classifier. We will set the max_depth parameter to 7 as in Alonso Robisco and Carbó Martínez (2022) to avoid overfitting.

clf_tree = DecisionTreeClassifier(max_depth=7).fit(X_train, y_train)

Then, let’s evaluate the performance of the decision tree classifier

evaluate_model(clf_tree, X_train, y_train, X_test, y_test, label = 'Decision Tree')

--------------------------------------------------------------
Metrics: Decision Tree
--------------------------------------------------------------
Accuracy (Train): 0.936348994426431
Precision (Train): 0.6407185628742516
Recall (Train): 0.2101843457237836
ROC AUC (Train): 0.8265896261153018
--------------------------------------------------------------
Accuracy (Test): 0.9323556836483852
Precision (Test): 0.5529622980251346
Recall (Test): 0.18610271903323264
ROC AUC (Test): 0.8159823399376106
--------------------------------------------------------------

The decision tree classifier performs better than the logistic regression model with a ROC AUC score of around 0.77. This is not surprising given that decision trees are more flexible models that can capture non-linear relationships in the data.

6.7.6 Random Forest

Let’s now train a random forest classifier. We will use the RandomForestClassifier class from the sklearn.ensemble module to train a random forest classifier. We will set the max_depth parameter to 20 and the n_estimators parameter to 100 as in Alonso Robisco and Carbó Martínez (2022).

clf_forest = RandomForestClassifier(max_depth=20, n_estimators = 100).fit(X_train, y_train)

Then, let’s evaluate the performance of the random forest classifier

evaluate_model(clf_forest, X_train, y_train, X_test, y_test, label = 'Random Forest')

--------------------------------------------------------------
Metrics: Random Forest
--------------------------------------------------------------
Accuracy (Train): 0.976126899357874
Precision (Train): 1.0
Recall (Train): 0.6595648232094289
ROC AUC (Train): 0.9956114701590723
--------------------------------------------------------------
Accuracy (Test): 0.9321013817072137
Precision (Test): 0.5495327102803739
Recall (Test): 0.17764350453172206
ROC AUC (Test): 0.8400146575047622
--------------------------------------------------------------

This is a good example of the dangers of not using a test set for the evaluation of a model. The random forest classifier performs very well on the training set with a ROC AUC score of close to 1.0. However, it performs much worse on the test set with a ROC AUC score of around 0.83. Nevertheless, the random forest classifier still outperforms the logistic regression and decision tree classifiers.

6.7.7 XGBoost

Let’s now train an XGBoost classifier. We will use the XGBClassifier class from the xgboost module to train an XGBoost classifier. We will set the max_depth parameter to 5 and the n_estimators parameter to 40 as in Alonso Robisco and Carbó Martínez (2022).

clf_xgb = XGBClassifier(max_depth = 5, n_estimators = 40, random_state = 0).fit(X_train, y_train)

Then, let’s evaluate the performance of the XGBoost classifier

evaluate_model(clf_xgb, X_train, y_train, X_test, y_test, label = 'XGBoost')

--------------------------------------------------------------
Metrics: XGBoost
--------------------------------------------------------------
Accuracy (Train): 0.9389238561468201
Precision (Train): 0.6947992700729927
Recall (Train): 0.23012994862496222
ROC AUC (Train): 0.8778826700467908
--------------------------------------------------------------
Accuracy (Test): 0.9330762058150377
Precision (Test): 0.5701107011070111
Recall (Test): 0.18670694864048337
ROC AUC (Test): 0.8514290998289821
--------------------------------------------------------------

The XGBoost classifier performs quite well with an ROC AUC score of around 0.83. This is the best performance we have seen so far.

6.7.8 Neural Network

Finally, let’s train a neural network classifier. We will use the MLPClassifier class from the sklearn.neural_network module to train a neural network classifier. We will set the activation parameter to relu, the solver parameter to adam, and the hidden_layer_sizes parameter to (300,200,100) as in Alonso Robisco and Carbó Martínez (2022). We will also set the random_state parameter to 42 to make the results reproducible.

#clf_nn = MLPClassifier(activation='relu', solver='adam', hidden_layer_sizes=(300,200,100), random_state=42, max_iter = 300, verbose=True).fit(X_train, y_train)
#dump(clf_nn, 'clf_nn.joblib')

Since training the neural network classifier can take a long time, we have saved the trained model to a file called clf_nn.joblib. We can load the model from the file using the load function from the joblib module

clf_nn = load('clf_nn.joblib')

Let’s check the loss curve of the neural network classifier

plt.plot(clf_nn.loss_curve_)
plt.title("Loss Curve", fontsize=14)
plt.xlabel('Iterations')
plt.ylabel('Cost')
plt.show()

Then, let’s evaluate the performance of the neural network classifier

evaluate_model(clf_nn, X_train, y_train, X_test, y_test, label = 'Neural Network')

--------------------------------------------------------------
Metrics: Neural Network
--------------------------------------------------------------
Accuracy (Train): 0.9465742683366182
Precision (Train): 0.8080531665363565
Recall (Train): 0.31233000906618313
ROC AUC (Train): 0.8682599095370773
--------------------------------------------------------------
Accuracy (Test): 0.9282020852759176
Precision (Test): 0.4694835680751174
Recall (Test): 0.18126888217522658
ROC AUC (Test): 0.7983637135044449
--------------------------------------------------------------

6.8 Overview of the Results

Looking at all the models side by side, we can see that

results = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'XGBoost', 'Neural Network'],
    'ROC AUC (Train)': [roc_auc_score(y_train, clf_logistic.predict_proba(X_train)[:, 1]), 
                        roc_auc_score(y_train, clf_tree.predict_proba(X_train)[:, 1]), 
                        roc_auc_score(y_train, clf_forest.predict_proba(X_train)[:, 1]), 
                        roc_auc_score(y_train, clf_xgb.predict_proba(X_train)[:, 1]), 
                        roc_auc_score(y_train, clf_nn.predict_proba(X_train)[:, 1])],
    'ROC AUC (Test)': [roc_auc_score(y_test, clf_logistic.predict_proba(X_test)[:, 1]), 
                        roc_auc_score(y_test, clf_tree.predict_proba(X_test)[:, 1]), 
                        roc_auc_score(y_test, clf_forest.predict_proba(X_test)[:, 1]), 
                        roc_auc_score(y_test, clf_xgb.predict_proba(X_test)[:, 1]), 
                        roc_auc_score(y_test, clf_nn.predict_proba(X_test)[:, 1])]
})
results

	Model	ROC AUC (Train)	ROC AUC (Test)
0	Logistic Regression	0.693617	0.692624
1	Decision Tree	0.826590	0.815982
2	Random Forest	0.995611	0.840015
3	XGBoost	0.877883	0.851429
4	Neural Network	0.868260	0.798364

But can we do better? Alonso Robisco and Carbó Martínez (2022) have also applied feature engineering to the dataset. Let’s see if we can improve the performance of our models by adding some additional features.

6.9 Feature Engineering and Model Improvement

We will add the square of each feature to the dataset to create additional features as in Alonso Robisco and Carbó Martínez (2022). We will use the assign method of the pandas DataFrame to add the squared features to the dataset

X2 = df.drop('SeriousDlqin2yrs', axis=1) # All variables except `SeriousDlqin2yrs`
y2 = df['SeriousDlqin2yrs'] # Only SeriousDlqin2yrs
X2 = X2.assign(**X2.pow(2).add_suffix('_sq')) # Add the squared features to the dataset

Then, we will split the dataset into a training set and a test set and scale the features

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, stratify=y2, test_size = 0.2, random_state = 42)
scaler = MinMaxScaler() 
scale_features(scaler, X2_train, X2_train.columns)
scale_features(scaler, X2_test, X2_test.columns, only_transform=True)

Let’s train the models again with the new dataset and evaluate their performance

clf_logistic2 = LogisticRegression(penalty = None, solver = 'lbfgs', max_iter = 5000).fit(X2_train, y2_train)
clf_tree2 = DecisionTreeClassifier(max_depth=7).fit(X2_train, y2_train)
clf_forest2 = RandomForestClassifier(max_depth=20, n_estimators = 100).fit(X2_train, y2_train)
clf_xgb2 = XGBClassifier(max_depth = 5, n_estimators = 40, random_state = 0).fit(X2_train, y2_train)
#clf_nn2 = MLPClassifier(activation='relu', solver='adam', hidden_layer_sizes=(300,200,100), random_state=42, max_iter = 300, verbose=True).fit(X2_train, y2_train)
#dump(clf_nn2, 'clf_nn2.joblib')

The neural network classifier takes a long time to train, so we will load the model from the file clf_nn2.joblib that we saved earlier

clf_nn2 = load('clf_nn2.joblib')

Furthermore, we will also add a LASSO penalty to the logistic regression model to see if we can improve its performance

clf_logistic_lasso2 = LogisticRegression(penalty = 'l1', solver = 'liblinear').fit(X2_train, y2_train)

Let’s evaluate the performance of the models with the new features in the dataset

results2 = pd.DataFrame({
    'Model': ['Logistic Regression', 'Decision Tree', 'Random Forest', 'XGBoost', 'Neural Network', 'Logistic LASSO'],
    'ROC AUC (Train)': [roc_auc_score(y2_train, clf_logistic2.predict_proba(X2_train)[:, 1]), 
                        roc_auc_score(y2_train, clf_tree2.predict_proba(X2_train)[:, 1]), 
                        roc_auc_score(y2_train, clf_forest2.predict_proba(X2_train)[:, 1]), 
                        roc_auc_score(y2_train, clf_xgb2.predict_proba(X2_train)[:, 1]), 
                        roc_auc_score(y2_train, clf_nn2.predict_proba(X2_train)[:, 1]),
                        roc_auc_score(y2_train, clf_logistic_lasso2.predict_proba(X2_train)[:, 1])
    ],
    'ROC AUC (Test)': [roc_auc_score(y2_test, clf_logistic2.predict_proba(X2_test)[:, 1]), 
                        roc_auc_score(y2_test, clf_tree2.predict_proba(X2_test)[:, 1]), 
                        roc_auc_score(y2_test, clf_forest2.predict_proba(X2_test)[:, 1]), 
                        roc_auc_score(y2_test, clf_xgb2.predict_proba(X2_test)[:, 1]), 
                        roc_auc_score(y2_test, clf_nn2.predict_proba(X2_test)[:, 1]),
                        roc_auc_score(y2_test, clf_logistic_lasso2.predict_proba(X2_test)[:, 1])
    ]
})
results2

	Model	ROC AUC (Train)	ROC AUC (Test)
0	Logistic Regression	0.810256	0.813330
1	Decision Tree	0.826549	0.814273
2	Random Forest	0.994439	0.838450
3	XGBoost	0.877883	0.851429
4	Neural Network	0.871913	0.794420
5	Logistic LASSO	0.804410	0.810687

The models with the new features in the dataset perform better than the models without the new features. The random forest classifier and the XGBoost classifier have the best performance with ROC AUC scores of around 0.84 and 0.85, respectively.

6.10 Feature Importance

We can also look at the feature importance of the random forest classifier and the XGBoost classifier to see which features are most important for predicting loan defaults. We will use the feature_importances_ attribute of the random forest classifier and the XGBoost classifier to get the feature importances

feature_importances_forest = clf_forest2.feature_importances_
feature_importances_xgb = clf_xgb2.feature_importances_

Then, we will create a bar plot of the feature importances for the random forest classifier and the XGBoost classifier

fig, ax = plt.subplots(1, 2, figsize=(15, 5))

df_feature_importance_forest = pd.DataFrame({'Feature': X_train.columns, 'Importance': clf_forest.feature_importances_})
df_feature_importance_forest = df_feature_importance_forest.sort_values('Importance', ascending=False)

df_feature_importance_xgb = pd.DataFrame({'Feature': X_train.columns, 'Importance': clf_xgb.feature_importances_})
df_feature_importance_xgb = df_feature_importance_xgb.set_index('Feature')
df_feature_importance_xgb = df_feature_importance_xgb.loc[df_feature_importance_forest['Feature'], ]

# Random Forest
ax[0].barh(df_feature_importance_forest['Feature'], df_feature_importance_forest['Importance'])
ax[0].set_title('Random Forest')
ax[0].set_xlabel('Feature Importance')
ax[0].set_ylabel('Feature')

# XGBoost
ax[1].barh(df_feature_importance_forest['Feature'], df_feature_importance_xgb['Importance'])
ax[1].set_title('XGBoost')
ax[1].set_xlabel('Feature Importance')
ax[1].set_ylabel('Feature')

fig.tight_layout()
plt.show()

6.11 Conclusions

We have successfully implemented machine learning algorithms for loan default prediction. We have explored the dataset, preprocessed the data, trained several machine learning models, and evaluated their performance. We have also applied feature engineering to the dataset and improved the performance of the models. The random forest classifier and the XGBoost classifier have the best performance with ROC AUC scores of around 0.84 and 0.85, respectively. We have also looked at the feature importance of the random forest classifier and the XGBoost classifier to see which features are most important for predicting loan defaults.

Code Links