- Introduction
- Load the data
- Summary of data
- Clean Data
- Exploratory Data Analysis
- Feature Engineering
- Training
- Create Scorecard
- Conclusions
Introduction
Credit Risk modeling predicts whether a customer or applicant may or may not default on a loan. These models include predictor variables that are categorical or numeric. One of the outputs in the modeling process is a credit scorecard with attributes to allocate scores.
The objectives of this post are as follow:
- Create models using logistic regression
- Compare manual coarse binning with automated optimal Weight-of-Evidence (WOE) and Information Value (IV) binning
- Understand the WOE calculations
- Create a credit scorecard
Here we will use a public dataset, German Credit Data, with a binary response variable, good or bad risk. We will evaluate and compare the models with typical credit risk model measures, AUC and Kolmogorov-Smirnov test (KS).
library(tidyverse)
# Packages for loading data
library(httr)
# Packages for data manipulation
library(purrr)
# Packages for visualisation
library(Information)
library(gmodels)
library(DataExplorer)
# Packages for modeling
library(pROC)
library(scorecard)
Load data
Load the Statlog (German Credit Data) Data Set with httr and readr R packages. Note readr is available from loading and attaching the package to the search path with library(tidyverse) above.
# Set the url and uset GET function from httr to import the csv file to a temporary local directory, and read_csv from readr R package. This csv is also available at this url, with variable names. The UCI link does not include variable names and this will save us a step.
url <- ("http://www.biz.uiowa.edu/faculty/jledolter/DataMining/germancredit.csv")
GET(url, write_disk(tf <- tempfile(fileext = ".data")))
## Response [https://www.biz.uiowa.edu/faculty/jledolter/DataMining/germancredit.csv]
## Date: 2018-09-03 22:10
## Status: 200
## Content-Type: text/csv
## Size: 81 kB
## <ON DISK> C:\Users\Home\AppData\Local\Temp\RtmpEREOML\file16080716649d9.data
creditdata <- read_csv(tf)
## Parsed with column specification:
## cols(
## .default = col_character(),
## Default = col_integer(),
## duration = col_integer(),
## amount = col_integer(),
## installment = col_integer(),
## residence = col_integer(),
## age = col_integer(),
## cards = col_integer(),
## liable = col_integer()
## )
## See spec(...) for full column specifications.
Summary of data
The dataset metadata describes this data. This dataset appears to represent approved loan data, and would therefore exclude rejected applications.
Let’s first take a look at a summary using skim function from the skimr package.
# Plot the summary excluding the histograms as these do not print in R Markdown html yet
skimr::skim_with(integer = list(hist = NULL))
skimr::skim(creditdata)
## Skim summary statistics
## n obs: 1000
## n variables: 21
##
## -- Variable type:character ----------------------------------------------------------------------
## variable missing complete n min max empty n_unique
## checkingstatus1 0 1000 1000 3 3 0 4
## employ 0 1000 1000 3 3 0 5
## foreign 0 1000 1000 4 4 0 2
## history 0 1000 1000 3 3 0 5
## housing 0 1000 1000 4 4 0 3
## job 0 1000 1000 4 4 0 4
## otherplans 0 1000 1000 4 4 0 3
## others 0 1000 1000 4 4 0 3
## property 0 1000 1000 4 4 0 4
## purpose 0 1000 1000 3 4 0 10
## savings 0 1000 1000 3 3 0 5
## status 0 1000 1000 3 3 0 4
## tele 0 1000 1000 4 4 0 2
##
## -- Variable type:integer ------------------------------------------------------------------------
## variable missing complete n mean sd p0 p25 p50
## age 0 1000 1000 35.55 11.38 19 27 33
## amount 0 1000 1000 3271.26 2822.74 250 1365.5 2319.5
## cards 0 1000 1000 1.41 0.58 1 1 1
## Default 0 1000 1000 0.3 0.46 0 0 0
## duration 0 1000 1000 20.9 12.06 4 12 18
## installment 0 1000 1000 2.97 1.12 1 2 3
## liable 0 1000 1000 1.16 0.36 1 1 1
## residence 0 1000 1000 2.85 1.1 1 2 3
## p75 p100
## 42 75
## 3972.25 18424
## 2 4
## 1 1
## 24 72
## 4 4
## 1 2
## 4 4
Clean Data
Next we will clean the dataset using dplyr and purrr R packages.
From the metadata, the following variables are numeric:
- Attribute 2: duration (numerical) Duration in month
- Attribute 5: amount (numerical) Credit amount
- Attribute 8: instalment (numerical) Instalment rate in percentage of disposable income
- Attribute 11: residence (numerical) Present residence since
- Attribute 13: age (numerical) Age in years
- Attribute 18: liable (numerical) Number of people being liable to provide maintenance for
head(creditdata)
## # A tibble: 6 x 21
## Default checkingstatus1 duration history purpose amount savings employ
## <int> <chr> <int> <chr> <chr> <int> <chr> <chr>
## 1 0 A11 6 A34 A43 1169 A65 A75
## 2 1 A12 48 A32 A43 5951 A61 A73
## 3 0 A14 12 A34 A46 2096 A61 A74
## 4 0 A11 42 A32 A42 7882 A61 A74
## 5 1 A11 24 A33 A40 4870 A61 A73
## 6 0 A14 36 A32 A46 9055 A65 A73
## # ... with 13 more variables: installment <int>, status <chr>,
## # others <chr>, residence <int>, property <chr>, age <int>,
## # otherplans <chr>, housing <chr>, cards <int>, job <chr>, liable <int>,
## # tele <chr>, foreign <chr>
However from the view of the first rows, instalment, residence and duration appear to be categorical variables, as they are classification groupings. Therefore we will only convert duration, amount and age to numeric variables.
# Change the numerical variables to numeric using map_at function from purrr R package
creditdata <- creditdata %>%
mutate_at(vars(duration,amount,age),as.numeric) %>%
data.frame(stringsAsFactors = FALSE)
# Convert remaining integer variables to character ie categorical variables using map_if function from purrr R package
creditdata <- creditdata %>%
map_if(is.integer,as.character) %>%
data.frame() # leave as factors with default argument
# Rename the default variable to lower case for naming consistency
names(creditdata) <- tolower(names(creditdata))
# Convert the default variable to numeric and binary for modeling
creditdata$default <- as.numeric(creditdata$default)
creditdata$default <- recode(creditdata$default,'1'=0L,'2'=1L)
Next recode the variable levels with the forcats R package so that the variables are easier to understand in the plots. We have recoded as per the metadata.
# Recode with fct_recode function
creditdata$checkingstatus1 <- fct_recode(creditdata$checkingstatus1, less0='A11', low0to200DM='A12', highover200DM='A13', noaccount='A14')
creditdata$history <- fct_recode(creditdata$history, nocredit='A30', creditpaid='A31',creditpaidtilnow ='A32', pastdelays='A33',otherbankcredit='A34')
creditdata$purpose <- fct_recode(creditdata$purpose, car_new='A40', car_used='A41',furniture ='A42', radio_tv='A43',appliance='A44', repairs='A45', education='A46', retraining='A48', business='A49', others='A410') #vacation='A47'does not exist in data
creditdata$others <- fct_recode(creditdata$others, noguarantors='A101', coapplicant='A102',guarantor ='A103')
creditdata$property <- fct_recode(creditdata$property, realestate='A121', othersavings='A122',othercar ='A123',noproperty='A124')
creditdata$otherplans <- fct_recode(creditdata$otherplans, bankplans='A141', storeplans='A142',noplans ='A143')
creditdata$status <- fct_recode(creditdata$status, male_other='A91',female_other='A92', male_single='A93', male_other='A94', female_single='A95')
## Warning: Unknown levels in `f`: A95
creditdata$savings <- fct_recode(creditdata$savings, less100DM='A61',f100to500DM='A62',f500to1000DM='A63',greater1000DM='A64',unknown='A65')
creditdata$employ <- fct_recode(creditdata$employ, unemployed='A71',less1year='A72',f1to4years='A73',f4to7years='A74',over7years='A75')
creditdata$job <- fct_recode(creditdata$job, unemployed='A171',unskilled='A172',skilled='A173',management='A174')
creditdata$foreign <- fct_recode(creditdata$foreign, yes='A201',no='A202')
creditdata$housing <- fct_recode(creditdata$housing, rent='A151',own='A152',free='A153')
creditdata$tele <- fct_recode(creditdata$tele, none='A191',yes='A192')
We note an warning message with the Unknown levels in ‘f’ :A95 in the status
variable. It appears that there are no values for ‘f’.
MISSING VALUES
As per the metadata, there are no missing values. Let’s confirm with a missing values plot, plot_missing from Data Explorer package.
The DataExplorer speeds up data exploration process and automatically scans through each variable and does data profiling.
plot_missing(creditdata)
# Create a base dataset copy for the woe models
creditdata_base <- creditdata
With real data, we would also verify missing values coded other than NA.
Exploratory Data Analysis
Target variable
The response variable is default
(As per the metadata 1 = Good, 2 = Bad) however the variable has been coded to 0 = Good and 1 = Bad in the dataset.
Let’s take a look at the proportions of default
and the checkingstatus1
using the gmodels R package. We will use the checkingstatus1
variable as an example to understand the WOE calculations.
# Use the CrossTable function from gmodels package on default and checkingstatus1
gmodels::CrossTable(creditdata$default,
creditdata$checkingstatus1,prop.r= FALSE, prop.c =FALSE, prop.t=FALSE, chisq = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## |-------------------------|
##
##
## Total Observations in Table: 1000
##
##
## | creditdata$checkingstatus1
## creditdata$default | less0 | low0to200DM | highover200DM | noaccount | Row Total |
## -------------------|---------------|---------------|---------------|---------------|---------------|
## 0 | 139 | 164 | 49 | 348 | 700 |
## | 14.535 | 3.136 | 0.544 | 18.901 | |
## -------------------|---------------|---------------|---------------|---------------|---------------|
## 1 | 135 | 105 | 14 | 46 | 300 |
## | 33.915 | 7.317 | 1.270 | 44.102 | |
## -------------------|---------------|---------------|---------------|---------------|---------------|
## Column Total | 274 | 269 | 63 | 394 | 1000 |
## -------------------|---------------|---------------|---------------|---------------|---------------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 123.7209 d.f. = 3 p = 1.218902e-26
##
##
##
From the table, 30% of the loans had a bad outcome in this dataset. If this was representative of the population then 30% of applicants cannot repay their loans. Given the Chi-squared test, we can reject a null hypothesis that these two variables, default
and checkingstatus1
, are independent. There appears to be some relationship between these two variables.
The imbalance in the dataset is worth noting, however real life datasets are typically highly imbalanced.
Categorical Variables
To visualize distributions for the categorical features, let’s view frequency bar charts using DataExplorer.
# Barplot of categorical features using DataExplorer.
creditdata %>%
plot_bar()
From these barplots it appears that the majority of the loans are made by single males, however there is no data for single females. Most loans are used for purchasing a radio/ television, then new cars and majority of borrowers have low savings or no account with this bank. The borrowers are mostly foreign, skilled workers with existing property.
Numeric Variables
Now let’s take a look at the default
variable against the numeric variables as density plots.
# Use keep function from purrr, where columns are numeric ie TRUE are kept for our plots, then use purrr map function to plot a density plot for each numeric variable by default value
creditdata %>%
select(age,duration,amount) %>%
map(~ggplot(creditdata, aes(x = .)) +
geom_density(aes(fill=as.factor(default)),alpha=0.3) +
scale_fill_viridis_d()) # New ggplot v3.0.0 includes viridis palette
## $age
##
## $duration
##
## $amount
From these density plots there do not appear to be outliers.
The duration
, amount
and age
distributions and peaks appear to be different between default
values, therefore these could be a predictor variable where higher duration is riskier, higher amount is riskier and lower age is riskier.
Feature Engineering
Before we fit models we will subjectively group the continuous numeric variables to binned categorical variables based on visualising cut-off points in the EDA above. This is called coarse binning as we will group values with similar risk, in as few bins as possible. With the new group variables, we will remove the original continuous variables.
# Create a new age_group variable, with buckets based on the density distribution above
creditdata <- creditdata %>%
mutate(age_group=factor(case_when(
.$age < 25 ~ "25less",
.$age >= 25 & .$age <= 29 ~ "25to29",
.$age >= 30 & .$age <= 39 ~ "30to39",
.$age >= 40 & .$age <= 49 ~ "40to49",
.$age >= 50 & .$age <= 59 ~ "50to59",
.$age >= 60 & .$age <= 70 ~ "60over",
.$age >= 70 ~ "6")))
# Remove the original age variable to avoid feature interactions
creditdata <- creditdata %>%
dplyr::select(-age)
# Create a new amount_group variable, with buckets based on the density distribution above
creditdata <- creditdata %>%
mutate(amount_group=factor(case_when(
.$amount < 1250 ~ "1250less",
.$amount >= 1250 & .$amount <= 5000 ~ "1250to5000",
.$amount >= 5000 ~ "5000over")))
# Remove the original amount variable to avoid feature interactions
creditdata <- creditdata %>%
dplyr::select(-amount)
# Create a new duration_group variable, with buckets based on the density distribution above. We will group the months into years, we see peaks at the incremental 12 month marks
creditdata <- creditdata %>%
mutate(duration_group=factor(case_when(
.$ duration < 12 ~ "1yearunder",
.$ duration >= 12 & .$ duration <= 24 ~ "1to2year",
.$ duration >= 24 & .$ duration <= 36 ~ "2to3year",
.$ duration >= 36 & .$ duration <= 48 ~ "3to4year",
.$ duration >= 48 ~ "4yearover")))
# Remove the original duration variable to avoid feature interactions
creditdata <- creditdata %>%
dplyr::select(- duration)
Training
Split the data into training and test sets following the Data Camp course steps in Chapter 1. Here we will use a 70:30 split between training and test sets and create splits for the base dataset for WOE binning and one which will be feature engineered with coarse binning.
# Set seed of 123 for reproduceability
set.seed(123)
# Store row numbers for training set: index_train using randomly assigned observations
index_train <- sample(1:nrow(creditdata), 0.7 * nrow(creditdata))
# Create training set: training with the index
training <- creditdata[index_train, ]
# Create test set: test with the index
test <- creditdata[-index_train, ]
# Store row numbers for training set: index_train using randomly assigned observations
index_train_base <- sample(1:nrow(creditdata_base), 0.7 * nrow(creditdata_base))
# Create training set: training with the index
training_base <- creditdata_base[index_train_base, ]
# Create test set: test with the index
test_base <- creditdata_base[-index_train_base, ]
Model 1 - Logistic Model with Coarse Binning
Fit the first logistic model on the training data set where the variables have been coarse binned manually.
# Fit a logistic model using the base glm function
glm_model <- glm(default ~ ., family = "binomial", data= training)
Make predictions on the test set and evaluate the model using AUC and K-S measures with the pROC, ROCR and scoredcard R packages.
prediction_prob <- predict(glm_model, newdata = test, type = "response")
ROC <- pROC::roc(test$default,prediction_prob)
# AUC for the logistic model
AUC <- pROC::auc(ROC )
AUC
## Area under the curve: 0.7804
# KS for logistic
pred <- ROCR::prediction(prediction_prob,test$default)
perf <- ROCR::performance(pred,"tpr","fpr")
KS <- max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
KS
## [1] 0.4548609
Now use the the scorecard R package to plot the KS and AUC values.
# performance
test_perf <- scorecard::perf_eva(test$default, prediction_prob, title = "Model 1")
Model 2 - Logistic Model with WOE Binning
Information Theory (WoE and IV)
We can use the Information R package to perform variable screening using weight-of-evidence (WOE) and information value (IV).
WOE and IV play two distinct roles when analyzing data:
- WOE describes the relationship between a predictive variable and a binary target variable.
- IV measures the strength of that relationship.
The WOE framework optimises the number of bins and cut off points based on a risk value. If the variable is numeric then the cut off will respect the ordering, but it may not respect the ordering of the categorical values.
Let’s view a summary of the information values.
# Create a WOE table
IV <- create_infotables(data=creditdata_base,
y="default")
# Print the summary of the IV table information values. Thes IVs are derived from the WOE.
IV$Summary %>%
knitr::kable()
Variable | IV | |
---|---|---|
1 | checkingstatus1 | 0.6660115 |
3 | history | 0.2932335 |
2 | duration | 0.2778772 |
6 | savings | 0.1960096 |
4 | purpose | 0.1691951 |
13 | age | 0.1212277 |
12 | property | 0.1126383 |
5 | amount | 0.1117618 |
7 | employ | 0.0864336 |
15 | housing | 0.0832934 |
14 | otherplans | 0.0576145 |
20 | foreign | 0.0438774 |
9 | status | 0.0333416 |
10 | others | 0.0320193 |
8 | installment | 0.0263221 |
16 | cards | 0.0132665 |
17 | job | 0.0087628 |
19 | tele | 0.0063776 |
11 | residence | 0.0035888 |
18 | liable | 0.0000434 |
We will fit another logistic model on the training set. This model will filter variables and create WOE bins using the scorecard R package.
# Filter out variables via missing rate, information value, identical value rate
creditdata_filt <- var_filter(creditdata_base, y="default")
# Print the remaining number variables of the filtered dataset
dim(creditdata_filt)[2]
## [1] 15
# Weight of average (WOE) binning
bins <- woebin(creditdata_filt, y="default")
## Warning in e$fun(obj, substitute(ex), parent.frame(), e$data): already
## exporting variable(s): dt, xs, y, breaks_list, min_perc_fine_bin,
## stop_limit, max_num_bin
## Binning on 1000 rows and 15 columns in 0: 0:13
# Let's visualise the bins created for the numeric variables age, duration and amount
woebin_plot(bins$age)
## $age
woebin_plot(bins$duration)
## $duration
woebin_plot(bins$amount)
## $amount
# Let's visualise the bins created for a categorical variable checkingstatus1
woebin_plot(bins$checkingstatus1)
## $checkingstatus1
The WOE bin has grouped the age
similarly to the coarse bins up until 35 where it has added a group for 35:37. The duration
WOE bin grouped by 8 months rather than 12 months until 36 months. For amount
there are 5 WOE bins as opposed to 3 bins. For checkingstatus1
, the WOE binned less than 0 and 0 to 200DM levels.
The bad probability lines for the age and amount appear contrary to our density plots where it appeared that higher amount is riskier and lower age is riskier.
# Convert train and test sets original input data to WOE values based on WOE binning
training_woe <- woebin_ply(training_base, bins)
## Warning in e$fun(obj, substitute(ex), parent.frame(), e$data): already
## exporting variable(s): dt, xs
test_woe <- woebin_ply(test_base, bins)
## Warning in e$fun(obj, substitute(ex), parent.frame(), e$data): already
## exporting variable(s): dt, xs
# Fit a logistic model
glm_model2 <- glm(default ~ ., family = "binomial", data = training_woe)
prediction_prob2 <- predict(glm_model2, newdata = test_woe, type = "response")
# Calculate the ROC
ROC_2 <- pROC::roc(test_woe$default,prediction_prob2)
# AUC for fourth model
AUC2 <- auc(ROC_2)
AUC2
## Area under the curve: 0.7951
# KS for fourth model
prediction_2 <- ROCR::prediction(prediction_prob2,test_woe$default)
perf_2 <- ROCR::performance(prediction_2,"tpr","fpr")
KS2 <- max(attr(perf_2,'y.values')[[1]]-attr(perf_2,'x.values')[[1]])
KS2
## [1] 0.4846106
# Performance of the WOE model
test_perf <- perf_eva(test_woe$default, prediction_prob2, title = "Model 2")
Create Scorecard
Now create a scorecard on the Model 2 using the scorecard package.
# Calculate score card
card <- scorecard(bins, glm_model2)
# Take a look at the scorecard for duration, which includes scorecard points
card$duration %>%
knitr::kable()
variable | bin | count | count_distr | good | bad | badprob | woe | bin_iv | total_iv | breaks | is_special_values | points |
---|---|---|---|---|---|---|---|---|---|---|---|---|
duration | [-Inf,8) | 87 | 0.087 | 78 | 9 | 0.1034483 | -1.3121864 | 0.1068495 | 0.2826181 | 8 | FALSE | 66 |
duration | [8,16) | 344 | 0.344 | 264 | 80 | 0.2325581 | -0.3466246 | 0.0382938 | 0.2826181 | 16 | FALSE | 18 |
duration | [16,36) | 399 | 0.399 | 270 | 129 | 0.3233083 | 0.1086883 | 0.0048133 | 0.2826181 | 36 | FALSE | -6 |
duration | [36,44) | 100 | 0.100 | 58 | 42 | 0.4200000 | 0.5245245 | 0.0299728 | 0.2826181 | 44 | FALSE | -27 |
duration | [44, Inf) | 70 | 0.070 | 30 | 40 | 0.5714286 | 1.1349799 | 0.1026887 | 0.2826181 | Inf | FALSE | -57 |
This scorecards calculate points based on the probabilities, WOE and IV. We observed in the EDA that the bad outcomes increased with duration. In the scorecard higher points are awarded for lower durations, which is indicates a similar relationship.
Conclusions
This dataset has no missing variables, no outliers and is not a heavily imbalanced dataset so we have not had to overcome these real world challenges in this exercise. However the imbalance and size of dataset is not representative of real world data. Other data such as credit bureau and time series data could be used in addition to approved loans data in practice.
When we compare each of the models, with different manual and optimised binning methods, the AUCs for the two models are as follows:
# Model 1 - Logistic Model with Coarse Binning
AUC
## Area under the curve: 0.7804
# Model 2 - Logistic Model with WOE Binning
AUC2
## Area under the curve: 0.7951
Model 2 with the WOE binning has the higher AUC.
Finally, this very simple model using the glm function is one model option for credit risk modeling. In a real life application, other algorithms and packages would be used as baseline and for comparison in building models for implementation.
Acknowledgements
- DataCamp Introduction to Credit Modeling course in R
- WoE and IV Variable Screening with {Information} in R
- Sharma CreditScoring
- Ross Gayler for review and discussion on credit risk modeling in practice.