Credit Risk Modeling and Scorecard Example

15 Aug 2018

R / modeling / creditrisk / EDA

Introduction
Load the data
Summary of data
Clean Data
Exploratory Data Analysis
Feature Engineering
Training
Create Scorecard
Conclusions

Introduction

Credit Risk modeling predicts whether a customer or applicant may or may not default on a loan. These models include predictor variables that are categorical or numeric. One of the outputs in the modeling process is a credit scorecard with attributes to allocate scores.

The objectives of this post are as follow:

Create models using logistic regression
Compare manual coarse binning with automated optimal Weight-of-Evidence (WOE) and Information Value (IV) binning
Understand the WOE calculations
Create a credit scorecard

Here we will use a public dataset, German Credit Data, with a binary response variable, good or bad risk. We will evaluate and compare the models with typical credit risk model measures, AUC and Kolmogorov-Smirnov test (KS).

library(tidyverse)
# Packages for loading data
library(httr)
# Packages for data manipulation
library(purrr)
# Packages for visualisation
library(Information)
library(gmodels)
library(DataExplorer)
# Packages for modeling
library(pROC)
library(scorecard)

Load data

Load the Statlog (German Credit Data) Data Set with httr and readr R packages. Note readr is available from loading and attaching the package to the search path with library(tidyverse) above.

# Set the url and uset GET function from httr to import the csv file to a temporary local directory, and read_csv from readr R package. This csv is also available at this url, with variable names. The UCI link does not include variable names and this will save us a step.
url <- ("http://www.biz.uiowa.edu/faculty/jledolter/DataMining/germancredit.csv")
GET(url, write_disk(tf <- tempfile(fileext = ".data")))

## Response [https://www.biz.uiowa.edu/faculty/jledolter/DataMining/germancredit.csv]
##   Date: 2018-09-03 22:10
##   Status: 200
##   Content-Type: text/csv
##   Size: 81 kB
## <ON DISK>  C:\Users\Home\AppData\Local\Temp\RtmpEREOML\file16080716649d9.data

creditdata <- read_csv(tf)

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Default = col_integer(),
##   duration = col_integer(),
##   amount = col_integer(),
##   installment = col_integer(),
##   residence = col_integer(),
##   age = col_integer(),
##   cards = col_integer(),
##   liable = col_integer()
## )

## See spec(...) for full column specifications.

Summary of data

The dataset metadata describes this data. This dataset appears to represent approved loan data, and would therefore exclude rejected applications.

Let’s first take a look at a summary using skim function from the skimr package.

# Plot the summary excluding the histograms as these do not print in R Markdown html yet
skimr::skim_with(integer = list(hist = NULL))
skimr::skim(creditdata)

## Skim summary statistics
##  n obs: 1000 
##  n variables: 21 
## 
## -- Variable type:character ----------------------------------------------------------------------
##         variable missing complete    n min max empty n_unique
##  checkingstatus1       0     1000 1000   3   3     0        4
##           employ       0     1000 1000   3   3     0        5
##          foreign       0     1000 1000   4   4     0        2
##          history       0     1000 1000   3   3     0        5
##          housing       0     1000 1000   4   4     0        3
##              job       0     1000 1000   4   4     0        4
##       otherplans       0     1000 1000   4   4     0        3
##           others       0     1000 1000   4   4     0        3
##         property       0     1000 1000   4   4     0        4
##          purpose       0     1000 1000   3   4     0       10
##          savings       0     1000 1000   3   3     0        5
##           status       0     1000 1000   3   3     0        4
##             tele       0     1000 1000   4   4     0        2
## 
## -- Variable type:integer ------------------------------------------------------------------------
##     variable missing complete    n    mean      sd  p0    p25    p50
##          age       0     1000 1000   35.55   11.38  19   27     33  
##       amount       0     1000 1000 3271.26 2822.74 250 1365.5 2319.5
##        cards       0     1000 1000    1.41    0.58   1    1      1  
##      Default       0     1000 1000    0.3     0.46   0    0      0  
##     duration       0     1000 1000   20.9    12.06   4   12     18  
##  installment       0     1000 1000    2.97    1.12   1    2      3  
##       liable       0     1000 1000    1.16    0.36   1    1      1  
##    residence       0     1000 1000    2.85    1.1    1    2      3  
##      p75  p100
##    42       75
##  3972.25 18424
##     2        4
##     1        1
##    24       72
##     4        4
##     1        2
##     4        4

Clean Data

Next we will clean the dataset using dplyr and purrr R packages.

From the metadata, the following variables are numeric:

Attribute 2: duration (numerical) Duration in month
Attribute 5: amount (numerical) Credit amount
Attribute 8: instalment (numerical) Instalment rate in percentage of disposable income
Attribute 11: residence (numerical) Present residence since
Attribute 13: age (numerical) Age in years
Attribute 18: liable (numerical) Number of people being liable to provide maintenance for

head(creditdata)

## # A tibble: 6 x 21
##   Default checkingstatus1 duration history purpose amount savings employ
##     <int> <chr>              <int> <chr>   <chr>    <int> <chr>   <chr> 
## 1       0 A11                    6 A34     A43       1169 A65     A75   
## 2       1 A12                   48 A32     A43       5951 A61     A73   
## 3       0 A14                   12 A34     A46       2096 A61     A74   
## 4       0 A11                   42 A32     A42       7882 A61     A74   
## 5       1 A11                   24 A33     A40       4870 A61     A73   
## 6       0 A14                   36 A32     A46       9055 A65     A73   
## # ... with 13 more variables: installment <int>, status <chr>,
## #   others <chr>, residence <int>, property <chr>, age <int>,
## #   otherplans <chr>, housing <chr>, cards <int>, job <chr>, liable <int>,
## #   tele <chr>, foreign <chr>

However from the view of the first rows, instalment, residence and duration appear to be categorical variables, as they are classification groupings. Therefore we will only convert duration, amount and age to numeric variables.

# Change the numerical variables to numeric using map_at function from purrr R package
creditdata <- creditdata %>% 
      mutate_at(vars(duration,amount,age),as.numeric) %>% 
      data.frame(stringsAsFactors = FALSE)

# Convert remaining integer variables to character ie categorical variables using map_if function from purrr R package
creditdata <- creditdata %>% 
      map_if(is.integer,as.character) %>% 
      data.frame() # leave as factors with default argument

# Rename the default variable to lower case for naming consistency
names(creditdata) <- tolower(names(creditdata))

# Convert the default variable to numeric and binary for modeling
creditdata$default <- as.numeric(creditdata$default)
creditdata$default <- recode(creditdata$default,'1'=0L,'2'=1L)

Next recode the variable levels with the forcats R package so that the variables are easier to understand in the plots. We have recoded as per the metadata.

# Recode with fct_recode function
creditdata$checkingstatus1 <- fct_recode(creditdata$checkingstatus1, less0='A11', low0to200DM='A12', highover200DM='A13', noaccount='A14')
creditdata$history <- fct_recode(creditdata$history, nocredit='A30', creditpaid='A31',creditpaidtilnow ='A32', pastdelays='A33',otherbankcredit='A34')
creditdata$purpose <- fct_recode(creditdata$purpose, car_new='A40', car_used='A41',furniture ='A42', radio_tv='A43',appliance='A44', repairs='A45', education='A46',  retraining='A48', business='A49', others='A410') #vacation='A47'does not exist in data
creditdata$others <- fct_recode(creditdata$others, noguarantors='A101', coapplicant='A102',guarantor ='A103')
creditdata$property <- fct_recode(creditdata$property, realestate='A121', othersavings='A122',othercar ='A123',noproperty='A124')
creditdata$otherplans <- fct_recode(creditdata$otherplans, bankplans='A141', storeplans='A142',noplans ='A143')
creditdata$status <- fct_recode(creditdata$status, male_other='A91',female_other='A92', male_single='A93', male_other='A94', female_single='A95')

## Warning: Unknown levels in `f`: A95

creditdata$savings <- fct_recode(creditdata$savings, less100DM='A61',f100to500DM='A62',f500to1000DM='A63',greater1000DM='A64',unknown='A65')
creditdata$employ <- fct_recode(creditdata$employ, unemployed='A71',less1year='A72',f1to4years='A73',f4to7years='A74',over7years='A75')
creditdata$job <- fct_recode(creditdata$job, unemployed='A171',unskilled='A172',skilled='A173',management='A174')
creditdata$foreign <- fct_recode(creditdata$foreign, yes='A201',no='A202')
creditdata$housing <- fct_recode(creditdata$housing, rent='A151',own='A152',free='A153')
creditdata$tele <- fct_recode(creditdata$tele, none='A191',yes='A192')

We note an warning message with the Unknown levels in ‘f’ :A95 in the status variable. It appears that there are no values for ‘f’.

MISSING VALUES

As per the metadata, there are no missing values. Let’s confirm with a missing values plot, plot_missing from Data Explorer package.

The DataExplorer speeds up data exploration process and automatically scans through each variable and does data profiling.

plot_missing(creditdata)

# Create a base dataset copy for the woe models
creditdata_base <- creditdata

With real data, we would also verify missing values coded other than NA.

Exploratory Data Analysis

Target variable

The response variable is default (As per the metadata 1 = Good, 2 = Bad) however the variable has been coded to 0 = Good and 1 = Bad in the dataset.

Let’s take a look at the proportions of default and the checkingstatus1 using the gmodels R package. We will use the checkingstatus1 variable as an example to understand the WOE calculations.

# Use the CrossTable function from gmodels package on default and checkingstatus1 
gmodels::CrossTable(creditdata$default,
                    creditdata$checkingstatus1,prop.r= FALSE, prop.c =FALSE, prop.t=FALSE, chisq = TRUE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |-------------------------|
## 
##  
## Total Observations in Table:  1000 
## 
##  
##                    | creditdata$checkingstatus1 
## creditdata$default |         less0 |   low0to200DM | highover200DM |     noaccount |     Row Total | 
## -------------------|---------------|---------------|---------------|---------------|---------------|
##                  0 |           139 |           164 |            49 |           348 |           700 | 
##                    |        14.535 |         3.136 |         0.544 |        18.901 |               | 
## -------------------|---------------|---------------|---------------|---------------|---------------|
##                  1 |           135 |           105 |            14 |            46 |           300 | 
##                    |        33.915 |         7.317 |         1.270 |        44.102 |               | 
## -------------------|---------------|---------------|---------------|---------------|---------------|
##       Column Total |           274 |           269 |            63 |           394 |          1000 | 
## -------------------|---------------|---------------|---------------|---------------|---------------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  123.7209     d.f. =  3     p =  1.218902e-26 
## 
## 
##

From the table, 30% of the loans had a bad outcome in this dataset. If this was representative of the population then 30% of applicants cannot repay their loans. Given the Chi-squared test, we can reject a null hypothesis that these two variables, default and checkingstatus1, are independent. There appears to be some relationship between these two variables.

The imbalance in the dataset is worth noting, however real life datasets are typically highly imbalanced.

Categorical Variables

To visualize distributions for the categorical features, let’s view frequency bar charts using DataExplorer.

# Barplot of categorical features using DataExplorer.
creditdata %>%
      plot_bar()

From these barplots it appears that the majority of the loans are made by single males, however there is no data for single females. Most loans are used for purchasing a radio/ television, then new cars and majority of borrowers have low savings or no account with this bank. The borrowers are mostly foreign, skilled workers with existing property.

Numeric Variables

Now let’s take a look at the default variable against the numeric variables as density plots.

# Use keep function from purrr, where columns are numeric ie TRUE are kept for our plots, then use purrr map function to plot a density plot for each numeric variable by default value
creditdata %>%
      select(age,duration,amount) %>% 
      map(~ggplot(creditdata, aes(x = .)) +
                geom_density(aes(fill=as.factor(default)),alpha=0.3) +
                scale_fill_viridis_d()) # New ggplot v3.0.0 includes viridis palette

## $age

## 
## $duration

## 
## $amount

From these density plots there do not appear to be outliers.

The duration, amount and age distributions and peaks appear to be different between default values, therefore these could be a predictor variable where higher duration is riskier, higher amount is riskier and lower age is riskier.

Feature Engineering

Before we fit models we will subjectively group the continuous numeric variables to binned categorical variables based on visualising cut-off points in the EDA above. This is called coarse binning as we will group values with similar risk, in as few bins as possible. With the new group variables, we will remove the original continuous variables.

# Create a new age_group variable, with buckets based on the density distribution above
creditdata <- creditdata %>% 
  mutate(age_group=factor(case_when(
  .$age <  25 ~ "25less",
  .$age >= 25 & .$age <= 29 ~ "25to29",
  .$age >= 30 & .$age <= 39 ~ "30to39",
  .$age >= 40 & .$age <= 49 ~ "40to49",
  .$age >= 50 & .$age <= 59 ~ "50to59",
  .$age >= 60 & .$age <= 70 ~ "60over",
  .$age >= 70 ~ "6")))
# Remove the original age variable to avoid feature interactions
creditdata <- creditdata %>% 
      dplyr::select(-age)

# Create a new amount_group variable, with buckets based on the density distribution above
creditdata <- creditdata %>% 
  mutate(amount_group=factor(case_when(
  .$amount <  1250 ~ "1250less",
  .$amount >= 1250 & .$amount <= 5000 ~ "1250to5000",
  .$amount >= 5000  ~ "5000over")))
# Remove the original amount variable to avoid feature interactions
creditdata <- creditdata %>% 
      dplyr::select(-amount)

# Create a new  duration_group variable, with buckets based on the density distribution above. We will group the months into years, we see peaks at the incremental 12 month marks
creditdata <- creditdata %>% 
  mutate(duration_group=factor(case_when(
  .$ duration <  12 ~ "1yearunder",
  .$ duration >= 12 & .$ duration <= 24 ~ "1to2year",
  .$ duration >= 24 & .$ duration <= 36 ~ "2to3year",
  .$ duration >= 36 & .$ duration <= 48 ~ "3to4year",
  .$ duration >= 48  ~ "4yearover")))
# Remove the original  duration variable to avoid feature interactions
creditdata <- creditdata %>% 
      dplyr::select(- duration)

Training

Split the data into training and test sets following the Data Camp course steps in Chapter 1. Here we will use a 70:30 split between training and test sets and create splits for the base dataset for WOE binning and one which will be feature engineered with coarse binning.

# Set seed of 123 for reproduceability
set.seed(123)
# Store row numbers for training set: index_train using randomly assigned observations
index_train <- sample(1:nrow(creditdata), 0.7 * nrow(creditdata))
# Create training set: training with the index
training <- creditdata[index_train, ]
# Create test set: test with the index
test <- creditdata[-index_train, ]

# Store row numbers for training set: index_train using randomly assigned observations
index_train_base <- sample(1:nrow(creditdata_base), 0.7 * nrow(creditdata_base))
# Create training set: training with the index
training_base <- creditdata_base[index_train_base, ]
# Create test set: test with the index
test_base <- creditdata_base[-index_train_base, ]

Model 1 - Logistic Model with Coarse Binning

Fit the first logistic model on the training data set where the variables have been coarse binned manually.

# Fit a logistic model using the base glm function
glm_model <- glm(default ~ ., family = "binomial", data= training)

Make predictions on the test set and evaluate the model using AUC and K-S measures with the pROC, ROCR and scoredcard R packages.

prediction_prob <- predict(glm_model, newdata = test, type = "response")
ROC <- pROC::roc(test$default,prediction_prob)
# AUC for the logistic model
AUC <- pROC::auc(ROC )
AUC

## Area under the curve: 0.7804

# KS for logistic
pred <- ROCR::prediction(prediction_prob,test$default)
perf <- ROCR::performance(pred,"tpr","fpr")
KS <- max(attr(perf,'y.values')[[1]]-attr(perf,'x.values')[[1]])
KS

## [1] 0.4548609

Now use the the scorecard R package to plot the KS and AUC values.

# performance
test_perf <- scorecard::perf_eva(test$default, prediction_prob, title = "Model 1")

Model 2 - Logistic Model with WOE Binning

Information Theory (WoE and IV)

We can use the Information R package to perform variable screening using weight-of-evidence (WOE) and information value (IV).

WOE and IV play two distinct roles when analyzing data:
- WOE describes the relationship between a predictive variable and a binary target variable.
- IV measures the strength of that relationship.

The WOE framework optimises the number of bins and cut off points based on a risk value. If the variable is numeric then the cut off will respect the ordering, but it may not respect the ordering of the categorical values.

Let’s view a summary of the information values.

# Create a WOE table
IV <- create_infotables(data=creditdata_base,
                        y="default")
# Print the summary of the IV table information values. Thes IVs are derived from the WOE.
IV$Summary %>% 
      knitr::kable()

	Variable	IV
1	checkingstatus1	0.6660115
3	history	0.2932335
2	duration	0.2778772
6	savings	0.1960096
4	purpose	0.1691951
13	age	0.1212277
12	property	0.1126383
5	amount	0.1117618
7	employ	0.0864336
15	housing	0.0832934
14	otherplans	0.0576145
20	foreign	0.0438774
9	status	0.0333416
10	others	0.0320193
8	installment	0.0263221
16	cards	0.0132665
17	job	0.0087628
19	tele	0.0063776
11	residence	0.0035888
18	liable	0.0000434

We will fit another logistic model on the training set. This model will filter variables and create WOE bins using the scorecard R package.

# Filter out variables via missing rate, information value, identical value rate
creditdata_filt <- var_filter(creditdata_base, y="default")
# Print the remaining number variables of the filtered dataset
dim(creditdata_filt)[2]

## [1] 15

# Weight of average (WOE) binning 
bins <- woebin(creditdata_filt, y="default")

## Warning in e$fun(obj, substitute(ex), parent.frame(), e$data): already
## exporting variable(s): dt, xs, y, breaks_list, min_perc_fine_bin,
## stop_limit, max_num_bin

## Binning on 1000 rows and 15 columns in  0: 0:13

# Let's visualise the bins created for the numeric variables age, duration and amount
woebin_plot(bins$age)

## $age

woebin_plot(bins$duration)

## $duration

woebin_plot(bins$amount)

## $amount

# Let's visualise the bins created for a categorical variable checkingstatus1 
woebin_plot(bins$checkingstatus1)

## $checkingstatus1

The WOE bin has grouped the age similarly to the coarse bins up until 35 where it has added a group for 35:37. The duration WOE bin grouped by 8 months rather than 12 months until 36 months. For amount there are 5 WOE bins as opposed to 3 bins. For checkingstatus1, the WOE binned less than 0 and 0 to 200DM levels.

The bad probability lines for the age and amount appear contrary to our density plots where it appeared that higher amount is riskier and lower age is riskier.

# Convert train and test sets original input data to WOE values based on WOE binning 
training_woe <- woebin_ply(training_base, bins)

## Warning in e$fun(obj, substitute(ex), parent.frame(), e$data): already
## exporting variable(s): dt, xs

test_woe <- woebin_ply(test_base, bins)

## Warning in e$fun(obj, substitute(ex), parent.frame(), e$data): already
## exporting variable(s): dt, xs

# Fit a logistic model 
glm_model2 <- glm(default ~ ., family = "binomial", data = training_woe)

prediction_prob2 <- predict(glm_model2, newdata = test_woe, type = "response")
# Calculate the ROC
ROC_2 <- pROC::roc(test_woe$default,prediction_prob2)
# AUC for fourth model
AUC2 <- auc(ROC_2)
AUC2

## Area under the curve: 0.7951

# KS for fourth model
prediction_2 <- ROCR::prediction(prediction_prob2,test_woe$default)
perf_2 <- ROCR::performance(prediction_2,"tpr","fpr")
KS2 <- max(attr(perf_2,'y.values')[[1]]-attr(perf_2,'x.values')[[1]])
KS2

## [1] 0.4846106

# Performance of the WOE model
test_perf <- perf_eva(test_woe$default, prediction_prob2, title = "Model 2")

Create Scorecard

Now create a scorecard on the Model 2 using the scorecard package.

# Calculate score card
card <- scorecard(bins, glm_model2)
# Take a look at the scorecard for duration, which includes scorecard points
card$duration %>% 
      knitr::kable()

variable	bin	count	count_distr	good	bad	badprob	woe	bin_iv	total_iv	breaks	is_special_values	points
duration	[-Inf,8)	87	0.087	78	9	0.1034483	-1.3121864	0.1068495	0.2826181	8	FALSE	66
duration	[8,16)	344	0.344	264	80	0.2325581	-0.3466246	0.0382938	0.2826181	16	FALSE	18
duration	[16,36)	399	0.399	270	129	0.3233083	0.1086883	0.0048133	0.2826181	36	FALSE	-6
duration	[36,44)	100	0.100	58	42	0.4200000	0.5245245	0.0299728	0.2826181	44	FALSE	-27
duration	[44, Inf)	70	0.070	30	40	0.5714286	1.1349799	0.1026887	0.2826181	Inf	FALSE	-57

This scorecards calculate points based on the probabilities, WOE and IV. We observed in the EDA that the bad outcomes increased with duration. In the scorecard higher points are awarded for lower durations, which is indicates a similar relationship.

Conclusions

This dataset has no missing variables, no outliers and is not a heavily imbalanced dataset so we have not had to overcome these real world challenges in this exercise. However the imbalance and size of dataset is not representative of real world data. Other data such as credit bureau and time series data could be used in addition to approved loans data in practice.

When we compare each of the models, with different manual and optimised binning methods, the AUCs for the two models are as follows:

# Model 1 - Logistic Model with Coarse Binning
AUC

## Area under the curve: 0.7804

# Model 2 - Logistic Model with WOE Binning
AUC2

## Area under the curve: 0.7951

Model 2 with the WOE binning has the higher AUC.

Finally, this very simple model using the glm function is one model option for credit risk modeling. In a real life application, other algorithms and packages would be used as baseline and for comparison in building models for implementation.

Acknowledgements

DataCamp Introduction to Credit Modeling course in R
WoE and IV Variable Screening with {Information} in R
Sharma CreditScoring
Ross Gayler for review and discussion on credit risk modeling in practice.