Upload
stephen-denham
View
220
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Models a dataset of customer information for
Citation preview
DATA MINING
Delinquent Loans Author: Stephen Denham Lecturer: Dr. Myra O’Regan Prepared for: ST4003 Data Mining Submitted: 24st February 2012
Data Mining | Delinquent Loans Denham S.
1
1. Background ................................................................................................ 1
1.1 Dataset ................................................................................................. 1
2. Models ....................................................................................................... 4
2.1 Logistic Regression .............................................................................. 4
2.2 Principal Component Analysis .............................................................. 5
2.3 Trees .................................................................................................... 7
2.4 Random Forests ................................................................................. 10
2.5 Neural Networks ................................................................................. 15
2.6 Support Vector Machines ................................................................... 17
3. Further Analysis ....................................................................................... 19
3.1 Ensemble Model ................................................................................. 19
3.2 Limitations and Further Work ............................................................. 19
3.3 Variables Importance ......................................................................... 20
4. Models Assessment ................................................................................. 21
5. Conclusion ............................................................................................... 25
7. Appendix………………………………………………………………………… i
Data Mining | Delinquent Loans Denham S.
1
1. BACKGROUND
This report evaluates predictive models applied to a dataset of over 21,000 rows of
customer information across 12 variables. The aim is to model which loans are most
likely to be deemed ‘delinquent’.
Norman Ralph Augustine said ‘it’s easy to get a loan unless you need it’. By
understanding customer trends, financial institutions can better allocate their
resources. The combination of data driven predictive models and human intuition can
mean credit is given to those in the best position to repay.
Techniques used in this report include logistic regression, trees, random forest,
neural networks, support vector machines and the creation of an ensemble model. It
is as much a problem of soft intuition as mathematical skill.
1.1 Dataset
The original dataset contained 21,326 rows and 12 columns, representing customer
described by the following 12 variables:
1. nid (integer): Simply a unique identifier that is of no use in our analysis. Such a variable may be necessary for other analysis tools.
2. dlq (0/1): Individual has been 90 days or more past due date are termed ‘delinquent’, represented here by ‘1’. This is the target variable of model.
3. revol (percentage): Total balance on credit cards and personal lines of credit except real estate and no instalment debt like car loans divided by the sum of credit limits. All revol values above 4 (40 cases) were deemed outliers and were removed.
4. age (integer): Age of borrower in years. This variable was not changed.
5. times1 (integer): Number of times borrower has been 30-59 days past due but no worse in the last 2 years. All times1 values above 14 (159 cases) were deemed outliers and were removed.
6. DebtRatio (percentage): Monthly debt payments, alimony, living costs divided by monthly gross income. All DebtRatio values above 14 (3,800 cases) were deemed outliers and were removed. This may be a large, however the vast majority of these were missing MonthlyIncome or other variable outliers.
7. MonthlyIncome (integer): Monthly income. 18.48% of cases have missing data for MonthlyIncome. All MonthlyIncome values above 60,000 (32 cases) were deemed outliers and were removed.
Data Mining | Delinquent Loans Denham S.
2
8. noloansetc (integer): Number of open loans (instalment like car loan or mortgage) and lines of credit (e.g. credit cards). All noloansetc values above 60,000 (471 cases) were deemed outliers and were removed.
9. ntimeslate (integer): Number of times borrower has been 90 days or more past due. All ntimeslate values above 40 (159 cases) were deemed outliers and were removed.
10. norees (integer): Number of mortgage and real estate loans including home equity lines of credit. All norees values above 6 (205 cases) were deemed outliers and were removed.
11. ntimes2 (integer): Number of times borrower has been 60-89 days past due but no worse in the last 2 years. All ntimes2 values above 8 (262 cases) were deemed outliers and were removed.
12. depends (integer): Number of dependents in family excluding themselves (spouse, children etc.). 2.2% of original cases were missing this variable and were simply removed. All of these cases were also missing All depends values above 10 (2 cases) were deemed outliers and were removed.
Overall, 22.8% of variables were removed between outliers and missing data
rejection. Figure 1.1 shows boxplots of all variables after cleansing. They show how
revol, times1, DebtRatio, revol, ntimeslate, and depends are all
densely distributed around 0. Variables age and noloansetc are more normally
distributed.
Figure 1.1: Cleansed Data Boxplots
Figure 1.2 shows a histogram for revol density between both delinquent and non-
delinquent. revol showed the most visible relationship with delinquency.
Data Mining | Delinquent Loans Denham S.
3
Unfortunately from a simplicity point of view, revol is the exception, and when
graphing other variables in this way, the relationship is far more subtle – not enough
to make a judgement purely on a univariate basis and so multivariate analysis is
necessary.
Figure 1.2: revol Histogram
It is often beneficial to create combination variables – blending variables together to
create new ones, which encompass a better expression of cases. This was
considered but deemed unnecessary. There are a small number of variables, some
of which are already combinations – DebtRatio and revol. Also, non-linear
kernels would pick up such relationship.
In some instances, missing data and outliers are just as insightful. To quickly
determine if benefit could be gained from them in a model, a simplified version of the
dataset was created. This technique can also be done to divide variables into
brackets, such as age brackets. Through this, variables, which have a non-linear
influence on the prediction, can also be picked up in simple models using indicator
variables.
Variables with questionable values were classified. 0 replaced a ‘normal’ value, and
1 replaced an outlier or missing value. This new dataset was fed into a basic tree to
assess the value added by these cases. The binary variables did not prove to be
accurate predictors, being inferior to, and so it was deemed satisfactory to omit
outliers and missing data.
Data Mining | Delinquent Loans Denham S.
4
2. MODELS
After cleansing a large dataset just under 16,500 remained. This was deemed more
than enough to warrant dividing the data into training and test subsets (2:1). These
subsets were representative distributions of delinquent loans and so stratified
sampling was not required. A third validation would also have been possible. 6
models were developed on the training set and tested.
2.1 Logistic Regression
The key element of this problem is the binary nature of the outcome variable. Cases
can only be 0 or 1 and the associated probably should only be between 0 and 1.
Furthermore, many of the variables were prominently small values however large
outliers existed. For these reasons, a logistic regression model was suitable for initial
testing. Table 2.1 shows the output produced by the first logistic regression. It
showed that on this linear scale, the variables DebtRatio and depends could be
removed from the model.
Table 2.1: Logistic Regression Output Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.06498817259 0.11027763366 -9.65734 < 0.000000000000000222 *** revol 1.94583262923 0.06703656789 29.02644 < 0.000000000000000222 *** age -0.01751873613 0.00184558679 -9.49223 < 0.000000000000000222 *** times1 0.54663351405 0.03189642304 17.13777 < 0.000000000000000222 *** DebtRatio 0.04952123774 0.04646285001 1.06582 0.286503 MonthlyIncome -0.00003790262 0.00000642855 -5.89598 0.0000000037245671760 *** noloansetc 0.04735721583 0.00604523194 7.83381 0.0000000000000047329 *** ntimeslate 0.89797913290 0.05709272686 15.72843 < 0.000000000000000222 *** norees 0.06882099509 0.02810557602 2.44866 0.014339 * ntimes2 0.95544920500 0.07330640163 13.03364 < 0.000000000000000222 *** depends 0.03064240612 0.02053475812 1.49222 0.135641
Further trail and error experimentation, plotting and ANOVA, showed that the
norees and depends variable could also be removed. Removing these variables
brought the AIC down from 10916 down to 10909. Table 2.2 shows the final logistic
regression misclassification rates.
Table 2.2: Logistic Regression Model
FALSE POSITVE RATE 29.8% FALSE NEGETIVE RATE 16.5%
MISCLASSIFICATION RATE 25.5%
Data Mining | Delinquent Loans Denham S.
5
2.2 Principal Component Analysis
A principal component analysis was done to gain further insight into the underlying
trends before further modelling. This method creates distinct linear combinations
from the data which to account for multivariate data patterns – components. It does
not result in a direct model however can show underlying trends.
Figure 2.1 shows a plot of the proportion of total variance accounted for in each
component. The ‘elbow’ at component 3 shows clear decreasing marginal value and
so, only the first three components were analysed as they explained a
disproportionate 23% of data variance.
Figure 2.1: Principal Component Analysis Scree Plot
The first component showed high values for dlq, revol, times1, ntimeslate and
times2. This would suggest that these 4 variables might be good indicators of dlq
prediction. This component could be termed a ‘financial competence’ element. The
second component contains high values for two variables that indicate numbers of
lines of credit. It also contains some value of dlq. The third PC could be interpreted
as a ‘middle-aged’ variable as it contains values for age, number of dependents,
monthly income and debt ratio. This may be because those starting families may
typically have large debt obligations as they have bought a house relatively recently.
The fifth and ninth components showed high values for dlq, while elements of all
other variables, apart from norees (which had a strong presence in component two).
This suggests that all variables have some amount of correlation with dlq, however
these relationships are still unclear from the PCA alone. The values relevant to this
discussion are highlighted.
●
●
●
●
●●
●
●●
●
Scree Plot
Variances
0.5
1.0
1.5
2.0
2.5
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
Data Mining | Delinquent Loans Denham S.
6
Table 2.3: Principal Component Analysis Loadings Output Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 dlq -0.422 -0.214 -0.352 -0.249 -0.218 0.554 0.449 -0.154 revol -0.457 -0.114 -0.293 -0.223 -0.204 -0.352 -0.304 -0.583 0.197 age 0.279 0.255 0.551 0.229 -0.608 -0.333 times1 -0.301 -0.340 0.243 -0.258 0.520 0.420 -0.427 0.166 DebtRatio -0.386 0.490 -0.461 0.234 -0.104 -0.265 -0.502 MonthlyIncome 0.224 -0.269 -0.583 0.277 -0.204 -0.279 0.110 -0.121 -0.252 -0.499 noloansetc 0.279 -0.476 -0.166 0.154 0.217 0.497 -0.577 ntimeslate -0.373 0.266 0.414 -0.551 0.534 norees 0.239 -0.532 -0.147 -0.306 0.145 -0.102 0.289 0.652 ntimes2 -0.333 -0.221 0.327 0.400 0.225 0.572 -0.429 depends -0.193 -0.582 -0.362 0.464 0.330 -0.385
Data Mining | Delinquent Loans Denham S.
7
2.3 Trees
Trees are a relatively simple concept in machine learning. They classify predictions
based on a series of ‘if-statements’ that can easily be represented graphically and
are produced via a recursive-partitioning algorithm.
Possibly the greatest benefit of a default classification tree is its simplicity and thus
interpretability. It is clearly explained diagrammatically, and so a bank manager could
easily apply it to a loan applicant without even the use of a calculator.
Figure 2.2 shows a basic tree constructed. It shows ntimeslate, times1 and
ntimes2 to be the most important determinants of prediction. By this tree, if a
customer has a ntimeslate value of 1 or greater, they have an 88% chance of
being delinquent.
Figure 2.2: Simple Tree with Default Complexity Parameter (cp=0.01)
Unfortunately this tree is an oversimplified model. There exists a trade-off between
model accuracy and complexity that can be seen in Figure 2.3.
Figure 2.3: Complexity Parameter vs. Relative Error
revol <> 0.49497613650; 10966 obs; 51.8%
times1 <> 0.50; 5736 obs; 72.5%
ntimeslate <> 0.50; 4638 obs; 79.8%
04474 obs
1
1164 obs
21
1098 obs
31
5230 obs
4
Total classified correct = 74.1 %
●
●
●
●
● ● ● ● ● ● ● ●
cp
X−va
l Rel
ative
Erro
r
0.4
0.6
0.8
1.0
Inf 0.12 0.023 0.011 0.0069 0.0054 0.0048 0.004 0.0034 0.0027 0.002 0.0017
1 2 3 4 8 9 12 13 15 17 18 19
size of tree
Data Mining | Delinquent Loans Denham S.
8
The rpart package chooses a default complexity parameter of 0.01, however,
mathematically, the optimum complexity parameter is 0.0017000378 as it has the
lowest corresponding error. This tree contains 16 splits and is shown here in Figure
2.4.
Figure 2.4: Optimized Complexity Parameter Tree (cp=0.0010406812)
The second problem with more complex models, after loss of user accessibility, is
the risk of over-fitting. As Figure 2.3 shows, there is a diminishing marginal decrease
in the x-error as the tree becomes more complex, and sometimes rises again. With
this that in mind, another way to decide on the number of splits (complexity) is to
judge the complexity parameter curve. Figure 2.3 shows two ‘elbows’ where the
marginal improvement for adding splits diminishes. There is one at c=0.1 however
this even more basic than our original tree. The next is roughly at c=0.0054 and the
associated tree is shown here in Figure 2.5 and contains 8 splits.
Figure 2.5: Tree of Complexity Parameter at CP Plot Elbow
|revol< 0.495
times1< 0.5
ntimeslate< 0.5
ntimes2< 0.5
revol< 0.144
DebtRatio< 0.707
norees< 2.5
revol< 0.0453
revol>=0.004775
ntimes2< 0.5
ntimeslate< 0.5
times1< 1.5
MonthlyIncome>=3242
revol< 0.4698 revol< 0.004575
ntimeslate< 0.5
times1< 0.5
ntimes2< 0.5
revol< 0.8326
DebtRatio< 0.563
age>=46.5
DebtRatio>=0.07558 noloansetc< 10.5
DebtRatio>=0.174
revol< 0.5994
noloansetc< 9.5
age>=58.5
noloansetc< 8.5
age>=52.5
norees>=0.5
MonthlyIncome< 1125
noloansetc>=0.5
05681/5285
04158/1578
03701/937
03660/814
03604/727
02448/323
01156/404
01055/309
0101/95
086/60
115/35
156/87
026/21
019/7
17/14
130/66
141/123
1457/641
1424/462
0396/365
0319/240
0273/1700
267/1571
6/13
146/70
012/4
134/66
177/125
128/97
133/179
11523/3707
11371/2221
11075/1172
01039/1007
0563/393
0413/239
0219/93
0215/80
14/13
0194/146
0168/104
0132/66
136/38
020/10
116/28
126/42
1150/154
075/45
175/109
028/19
147/90
1476/614
1375/415
0107/69
1268/346
0111/99
1157/247
026/19
024/11
12/8
1131/228
1101/199
136/165
1296/1049
1152/1486
|revol< 0.495
times1< 0.5
ntimeslate< 0.5
ntimeslate< 0.5
times1< 0.5
ntimes2< 0.5
revol< 0.8326
05681/5285
04158/1578
03701/937
03660/814
141/123
1457/641
11523/3707
11371/2221
11075/1172
01039/1007
0563/393
1476/614
136/165
1296/1049
1152/1486
Data Mining | Delinquent Loans Denham S.
9
Due to the categorical output variable, classification trees were used for all of these
trees. The gini measure of node purity, also known as the splitting index, was used
for the trees selected. Entropy was also used, however, gini has become the
convention however other methods appear equally competent. The party package
was experimented with as well. Although its graphical output is slightly better, it
lacked flexibility in adjusting complexity and so was deemed unsuitable in this case.
Trees facilitate loss matrices to weight the value of false negatives and false
positives for model evaluation. This was experimented with but ultimately without
sufficient background knowledge and client communication, it was deemed too
arbitrary and eventually removed. This issue is discussed further in the report (see
section 4).
Table 2.4 shows the occurrence of variables as splitters or surrogate splitters in the
model. The table was produced by a word count of each variable name in the output
generated from the function important.rpart. The function was created by Dr.
Noel O’Boyle of DCU (2012). The list coincided with the variable importance plot,
later generated by neural networks, and to a lesser degree with the PCA output.
Simple trees are inferior at predicting linear relationships, which are evident in this
dataset from the logistic regression.
Table 2.4: Variable Importance via rpart.importance
Rank Variable Occurrences in Split or Surrogate 1 revol 63 2 times1 38 3 ntimeslate 32 4 times2 31 5 DebtRatio 26 6 age 23 7 MonthlyIncome 22 8 noloansetc 22 9 norees 12
10 depends 5
Table 2.5: Classification Tree Misclassification Table
FALSE POSITVE RATE 22.4%
FALSE NEGETIVE RATE 24.5% MISCLASSIFICATION RATE 23.4%
Data Mining | Delinquent Loans Denham S.
10
2.4 Random Forests
The random forest method involves creating a large number of CART trees of a
dataset that form a democratic ensemble. The method has two ‘random’ elements.
First, the algorithm selects cases at random (bootstrapping). Secondly, each tree
looks at a small selection of variables (roughly the square root of the total number of
variables), which are used in random combinations. Each tree is built without 34% of
the data, which is then used as the ‘out of bag sample’ (oob) to evaluate the tree. In
this sense, it creates internal train and test sets.
1,000 trees were created, as is standard. It was enough to gather a large sample of
possible combinations and is not too computationally exhaustive. Each tree in the
forest looked at 3 variables, roughly the square root of the total number of variables.
As the target variable is binary, classification trees were best suited. As in SVM, this
meant the training data had to be converted to type ‘data.frame‘ and the target
variable dlq was converted a factor.
Variable Insight
Random Forests give every variable a chance to ‘vote’ because of the large number
of trees created and the relatively low number of variables randomly selected for
each tree. This allows them to provide deep insight for every variable. These are
expressed by counting the number of times a variable us used in a split and by
partial dependency plots.
This bar chart (Figure 2.6) displays the number of times each variable was selected
for a split in a tree, as opposed to the two other possible variables in that tree (mtry
= 3). The chart shows times1, ntimeslate and ntimes2 to be used rarely. The
variables revol, age and MonthlyIncome were often used. Unfortunately, this
graph does not represent the relative value of each variable. It may be bias towards
more continuous variables, as opposed to discrete, and so variables times1,
ntimeslate and ntimes2 are low.
Data Mining | Delinquent Loans Denham S.
11
Figure 2.6: Bar Chart of
‘Partial dependence plot gives a graphical depiction of the marginal effect of a
variable on the class probability (classification) or response (regression).’
– randomForest Reference Manual.
The variables are listed here in descending importance as measured by the average
decrease in accuracy and gini.
Figure 2.7: Variable Importance Plot via VarImportance
Partial dependency plots show the marginal effect variable has on the predicted
probability as its values change. Figure XX shows the partial dependency plot of
times1 (black), DebtRatio (dark blue), ntimes2 (pink), norees (red),
ntimeslate (green) and noloansetc (light blue). DebtRatio, times1,
ntimeslate and noloansetc all show early rapid falls in credibility in going from 0
to 2. This graph clearly displays how being late on one payment can have a dramatic
Times variable used
050
000
1500
0025
0000
revol age times1 DebtRatio MonthlyIncome noloansetc ntimeslate norees ntimes2 depends
dependsnoloansetcnoreesMonthlyIncomeageDebtRationtimes2revoltimes1ntimeslate
●
●
●
●
●
●
●
●
●
●
0.20 0.30 0.40 0.50MeanDecreaseAccuracy
noreesdependsntimes2noloansetcntimeslatetimes1ageMonthlyIncomeDebtRatiorevol
●
●
●
●
●
●
●
●
●
●
0 200 600 1000 1400MeanDecreaseGini
Variable Importance
Data Mining | Delinquent Loans Denham S.
12
increase on a person’s credit score. The noloansetc and norees variables, which
both measure numbers of financial lines, rise from 0 which matches the assumption
that at least one line of credit is required to be in this dataset, however many lines of
credit may suggest financial instability.
Although having so many variables on a single graph can be inaccessible, it is the
best way to maintain relative perspective by using the same scale. While analysing
these graphs, one must recall the density distributions of the variables (see Figure
1.1). Variable times1 and DebtRatio show unexpected recoveries, however these
are a caused by a low number of values that may have warranted removal as
outliers.
Figure 2.8: Partial Dependency Plot for times1 (black), DebtRatio (dark blue), ntimes2 (pink), norees (red), ntimeslate (green) and noloansetc (light
blue)
The initial drop in the partial dependency plot for monthly income may be explained
by those with 0 monthly income being students or retirees, however low-income
earners show the least probability of repayment. The large spike at 10,000 and
fluctuations thereafter are more difficult to understand, however, these earners are a
minority. Figure 1.1 shows them to be in the tail end of the boxplot, and so they may
be of less importance.
0 2 4 6 8 10 12
−0.6
−0.4
−0.2
0.0
0.2
Partial Dependency Plot
times1
Data Mining | Delinquent Loans Denham S.
13
Figure 2.9: Partial Dependency Plot – MonthlyIncome
Figure 2.10 shows the difference age has on the predicted probability of delinquency.
It shows steep increases in financial competence through the late twenties as people
mature, and again leading towards retirement, which is followed by a decline after
retirement.
Figure 2.10: Partial Dependency Plot – Age
The revol plot decreases rapidly from 0 to 1, which is to be expected as a low
value of revol suggests high financial control. This is followed by a slow increase
for which there is not an obvious reason. Again, they are a ‘tailed’ minority and could
be deemed outliers. Values of revol over 1 would be due to interest. Another
possibility is that ‘total balance’ includes the sum of future interest payments due. In
which case, those deemed satisfactory to obtain a longer-term loan, would be
deemed so, due to some other measure of financial stability.
0 10000 20000 30000 40000 50000 60000
0.00
0.05
0.10
0.15
0.20
0.25
MonthlyIncome − Partial Dependency Plot
MonthlyIncome
20 40 60 80 100
−0.1
0.0
0.1
0.2
0.3
Age Partial Dependency Plot
age
Data Mining | Delinquent Loans Denham S.
14
Figure 2.11: Partial Dependency Plot – revol
Figure 2.12 shows how the error rate of the out-of-bag sample is reduced when the
number of trees voting increases. This shows how 1,000 trees were more than
sufficient for reaching a satisfactory error rate.
Figure 2.12: Error Rate
A tree of size 250 was also created which showed approximately the same results.
There was not major difference, however the 1,000-tree forest was used for the
majority of analysis.
Random forests often suffer from over fitting, however they can handle variables
being large outliers as the democratic voting reduces sensitivity to a single variable.
Random forests cannot include variable costs, which was not considered, however a
client might express variables to be preferable. Possibly the greatest benefit from the
random forest analysis is through the in-depth understanding of the variables it
provides.
Table 2.6: Random Forest Misclassification Table
FALSE POSITVE RATE 22.6% FALSE NEGETIVE RATE 23.2%
MISCLASSIFICATION RATE 23.7%
0 1 2 3
−0.6
−0.4
−0.2
0.0
0.2
0.4
revol − Partial Dependency Plot
revol
0 200 400 600 800 1000
0.20
0.22
0.24
0.26
0.28
0.30
0.32
trees
Error
Data Mining | Delinquent Loans Denham S.
15
2.5 Neural Networks
Neural networks are a black-box modelling technique. Put simply, the model creates
a formula, similar to that of a linear regression, containing hidden layers. ‘Neurons’
which represent input variables, output variables and optional variables in the hidden
layer are algorithmically weighted to minimise the difference between the desired
results and the actual results.
Both the nnet and neuralnet packages were tested in analysis but nnet was
chosen as the primary function due to its speed and level of functionality The model
was tweaked with to optimise the AIC, error, positive hessian matrix eigenvalues and
misclassification rates.
The data was scaled and a trail-and-error method was used and several
combinations of model inputs were created and tested with different values for size
(number of hidden layers) and input vectors. To tailor the model to the 0/1
classification target variable, softmax = TRUE was defined as the activation
function for the nnet package and act.fct=”logistic” was set for the
neuralnet package.
A decay value can be added to the input to improve its stability, mitigating the risk of
weights tending towards infinity. Figure 2.13 plots the error rate for decay values
between 0 and 1. It showed that a small value of decay added to the model, vastly
reduced the error. Given the high possibility of outliers, the support vector machine
model could easily be over fitted so adding decay to the model generalised it.
Figure 2.13: Decay ~ Error Plot
0.0 0.2 0.4 0.6 0.8 1.0
3500
4000
4500
5000
Decay
Error
Data Mining | Delinquent Loans Denham S.
16
One hidden layer was added to the model and the depends variable was removed
from the model, reducing the AIC to -12291.02. This produced the following output
for the optimum model: a 8-1-2 network with 13 weights options were - softmax modelling decay=0.1 b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 1.34 0.47 -0.13 0.56 0.02 -0.09 0.11 1.20 0.04 b->o1 h1->o1 3.25 -4.53 b->o2 h1->o2 -3.25 4.53
Figure 2.14 shows the model as graphically produced by the neuralnet package.
On the left are the input variables – the values that are multiplied by the weights
marked along the arrows to the hidden layer neuron. The bias is multiplied by the
output neuron and is then passed through an activation function to give the
probability associated with the given case.
Figure 2.14: Neural Network
In some ways package is more advanced than the nnet package, however it is
poorly built in a number of ways. The naming conventions are often unclear and
conflicting. The prediction function, which is crucial for developing receiver
operating characteristic curves (ROC curves), is also the name of a function in the
ROCR package. The result is severable variables and functions, all with similar
names and very similar tasks.
Table 2.7 shows the performance of the neural network model on the test set.
Table 2.7: Neural Network Misclassification
FALSE POSITVE RATE 23.2% FALSE NEGETIVE RATE 26.1%
MISCLASSIFICATION RATE 24.6%
−0.0
2937
scale(norees)
−0.99
107
scale(ntimeslate)
−0.07825
scale(noloansetc)
0.0764scale(MonthlyIncome)
−0.01145scale(DebtRatio)
−0.44193
scale(times1)
0.09152
scale(age)
−0.34639scale(revol)
−21.70731 as.numeric(dlq)
−2.33183
1
2.63764
1
Error: 891.320636 Steps: 13284
Data Mining | Delinquent Loans Denham S.
17
2.6 Support Vector Machines
In the SVM process, points are transformed/mapped to a ‘hyper-space’ so they can
be split by a hyper-plane. This transformation is not explicitly calculated which would
be computationally expensive, but rather calculate the kernel which is function of the
inner product of the original multidimensional space. Consider piece of string. In the
first dimension, the distance between the two ends of the string is the string’s length
and they can never touch. However in real life, thee-dimensional space, the string
can be bent around both ends can easily touch. This may be an abstract example,
however it demonstrates the idea that more dimensions can allow more
manipulation.
Optimising the Support Vector Machine involved tuning the values for gamma and
cost on the training set. Gamma is a term in the radial basis transformation and cost
is the constant for the Lagrange formula. They were tested between 0.5 and 1.5 and
100 and 1,000 respectively. The tune function concluded that a gamma of 1.3 and
cost of 130 were optimum, giving a best performance of 0.2118273. Figure 2.15
shows an early plotted of the tuning function.
Figure 2.15: SVM Tuning Plot
SVM Classification Plots
SVM classification plots can be useful in determining the relationship between two
variables and the predictive model. In this case, they are better for mapping more
scattered variables such as MonthlyIncome rather than ntimeslate. Figure 2.16
Data Mining | Delinquent Loans Denham S.
18
displays the relationship between age, MonthlyIncome and delinquent
classification by the radial c-classification method. The colours represent the
classification determined in a hyperspace radial plane.
Figure 2.16: Radial SVM Plotted on MonthlyIncome ~ age
Figure 2.17 shows the same two variables being classified by a sigmoid plane.
Clearly there is a large differfence in classification using different kernels.
Figure 2.17: Sigmoid SVM Plotted on MonthlyIncome ~ age
Table 2.8 shows the performance of support vector machines model on the test set.
Table 2.8: Support Vector Machines Misclassification
FALSE POSITVE RATE 24.7% FALSE NEGETIVE RATE 23.2%
MISCLASSIFICATION RATE 24.1%
Data Mining | Delinquent Loans Denham S.
19
3. FURTHER ANALYSIS
3.1 Ensemble Model
In an attempt to gain the best elements of all models, an ensemble was created. This
simply averages the probabilities created by the tree, random forest, neural network
and support vector machine, for each individual case. All models are run and the
probabilities they produce for the case are averaged. This creates a less diverse
array of probabilities, however it is just as decisive. Table 2.1 shows its performance
on the test set. Interestingly, this model had the lowest false positive rate for the test
set.
Table 2.1: Ensemble Model Misclassification
FALSE POSITVE RATE 22.4% FALSE NEGETIVE RATE 25.3%
MISCLASSIFICATION RATE 23.7%
3.2 Limitations and Further Work
Without having the opportunities to talk with the client and properly understand the
data, the variable descriptions are limited. To what extent outliers can be removed
and the ease of collecting variables was done by estimation. The Rattle package
offers the incorporation of risk into the model however this could but calculated
without a measure of risk associated with each case, such as the nominal value of
each loan.
The creation of this data must also be considered. These are loans, which have
already been approved, and so these customers would already gone through some
screening system to get to that stage. Any model taken from this empirical study
must be used in conjunction with current structures that have already screened out
potential loan applicants.
Data is temporal but this dataset show not information of the time period from which
it was taken. Consider the value of such a model in the midst of the financial crisis.
Models can expire and must be continually tested and updated. Unfortunately, it is
difficult to test a model once implemented.
Data Mining | Delinquent Loans Denham S.
20
3.3 Variables Importance
Table 2.2 contains a simplified rating, good, bad, average or inconclusive (--), for
each variable under by different topics. Support Vector Machines was not included
as it is a black-box technique.
Table 2.1:
Missing Data
Outliers Log. PCA Tree R. Forest
N. Net
revol good Good good good good good good
age good good good ave. ave. good good
times1 good ave. good good good -‐-‐ good
DebtRatio good poor poor ave. ave. good good
MonthlyIncome poor good good ave. ave. good good
noloansetc good ave. good ave. ave. ave. good
ntimeslate good good good good good poor good
norees good ave. ave. ave. poor ave. good
ntimes2 good ave. good good good -‐-‐ -‐-‐
depends ave. good good poor poor ave. poor
Data Mining | Delinquent Loans Denham S.
21
4. MODELS ASSESSMENT
Models tend to produce different values for false positive and false negatives which
makes the question of ‘best model’ ambiguous, particularly without client contact.
Prospect Theory (Kahneman and Tversky, 1979) indicates that people’s negative
response to a loss is more pronounced than their positive response to an equal gain.
Consider one’s feelings on losing a €50 note, compared to finding one. Taking this
aspect of human behaviour into account, models with lower false positive (a loan
would be given and money is lost as it is not paid back as agreed) may be weighted
better than those with lower fast negative (a loan would not be given to a customer
who would have paid it back and so potential profit is foregone). To finalise the ‘best’
model in this instance, an analysis of the material profit/loss gained by a loan in
question and a cost quantity to represent the client’s attitude to risk. This would guide
a weighting to user preference for minimal false negatives or false positives. Such
weighting can be used by the evaluateRisk or the loss matrices that some of
these models facilitate. Due to the lack of information they could not be used.
There is also no information of the ease with which managers can obtain the input
information or how reliable these tend to be. If for example, revol may not be
practical to obtain for every customer. In the case of trees and random forests,
surrogates may be used instead.
Figure 4.1 shows the receiver-operating characteristic (ROC) curves for the 4 models
plotted together with true positive against false positive rates. A good model would
arc high and to the left. In this respect, the random forest, neural network and
support vector machine models are essentially the same. The simple tree model is
clearly suboptimal. This trend can also be seen in a lift chart
Data Mining | Delinquent Loans Denham S.
22
Figure 4.1: ROC Curves
Figure 4.2 shows scatter plots of the four models’ predictions for the test set, against
each other. The graphs also show the actual values for each case by colour. The
visible correlations represent the similarity of the two models.
Cases plotted in the top left or bottom right are cases that both models agreed and if
correct, these should be blue and green respectively if the models prediction was
correct. If correctly classified, they will be blue for dlq=1 and green for dlq=0.
The simple tree can be clearly seen by the discrete nature of its predictions for each
node. Random forests and neural networks appear to be most correlated models
with the least variance.
These graphs alone should not be used to determine misclassification rates. They
give an idea, however the full density of cases at the corners of the graphs is
crowded, and so the number of cases there is difficult to read. True misclassification
can be seen the bar chart in Figure 4.3.
Data Mining | Delinquent Loans Denham S.
23
Figure Figure 4.2: Model Comparison Scatter Plots
Data Mining | Delinquent Loans Denham S.
24
As explained earlier, the client attitude towards false negatives and false positives
may be disproportionate due to the prospect effect. It is logical to assume that to
some degree, a low false positive is more desirable than a low false negative.
Figure 4.3 displays a bar chart of misclassification rates on the test set of the tree,
random forest, neural network, support vector machine and the ensemble models.
The ensemble model, arguably most complex, performed the best for false positives;
however, it also performed poorly with false negatives. Interestingly, the simplest
model, logistic regression, performed inversely. It performed well with false negatives
and poorly with false positives. The neural network consistently performed worse
than its competitors. The support vector machine was mediocre, as was simple tree
(however this performed poorly in the ROC analysis).
Figure 4.3:
The random forest appeared to be the most consistently low misclassification rates. It
is relatively simple to implement, it is particularly good for classifications and as it is
an ensemble method, it often can negate biases and handle missing data and
outliers well. It also provided as level of insight into variable importance. It also has
relatively strong correlations to other models as seen in Figure 4.2. Unfortunately,
random forests are prone to over fitting and they cannot include costs, however,
purely based on the information provided, it would be deemed most suitable across
all measures of model quality.
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
False Positve Rate False Negetive Rate
Missclassification Rate
Logistic Regression
Simple Tree
Random Forest
Neural Network
Support Vector Machine
Ensemble
Data Mining | Delinquent Loans Denham S.
25
5. CONCLUSION
This report detailed the in depth analysis of a dataset towards the prediction of loans
becoming ‘delinquent’. Principle component analysis, logistic regression,
classification trees, random forests, neural networks and support vector machines
were all implemented. An ensemble model of the latter four was also tested.
The best predictors, in descending order are: the percentage of credit limits utilised –
revol, and three variables representing the number of times borrowers were later
replaying their loans, times1, ntimeslate and ntimes2. The variables
measuring number of property loans – norees, and number of dependents –
depends, appear to the least valuable predictors.
Although limitations of this study are clearly defined, based on basic intuition and the
information provided, it was deemed that the random forest model was most suitable
as it consistently performed well against all measures, most likely due to its ability to
represent non-linear relationship while not over-fitting. It is also best placed to deal
with outliers and missing data due to its democratic ensemble nature.
Data Mining | Delinquent Loans Denham S.
7. APPENDIX
7.1 References
KAHNEMAN, D. & TVERSKY, A. 1979. Prospect Theory: An Analysis of Decision Under Risk. Econometrica, 47, 263-291.
O'BOYLE, N. 2011. Supervised Classification: Variable Importance in CART (Classification and Regression Trees), DCU Redbrick, viewed 19 Ferbrary, 2012, < http://www.redbrick.dcu.ie/~noel/R_classification.html>
7.2 R Code
i
# Adding Packages library("rpart") library(neuralnet) library("ROCR") # prediction object is masked from neuralnet library(tools) library("randomForest") library("caTools") library("colorspace") library("Martrix") library("nnet") library("gtools") library("e1071") library(randomForest) library("rattle") library(car) library(cluster) library(maptree) library("RColorBrewer") library(modeltools) library(coin) library(party) library(zoo) library("rattle") library(base) sandwich, strucchange, vcd search() ? ctree oData = read.csv("/Users/stephendenham/Dropbox/College/Data Mining/directory/creditData.csv") cData = oData nrow(cData) # 21326 orginal cases cData[cData$MonthlyIncome==-999,"MonthlyIncome"] <- NA cData = na.omit(cData) nrow(oData)-nrow(cData) # 3943 Missing 1-(nrow(cData)/nrow(oData)) # 18.49% Missing for MonthlyIncome cData = oData nrow(cData) # 21326 cData[cData$depends==-999,"depends"] <- NA cData = na.omit(cData) nrow(oData)-nrow(cData) # 480 Missing 1-(nrow(cData)/nrow(oData)) # 2.25% Missing for MonthlyIncome
cData[cData$MonthlyIncome==-999,"MonthlyIncome"] <- NA cData = na.omit(cData) nrow(oData)-nrow(cData) # 3943 Missing 1-(nrow(cData)/nrow(oData)) # 18.5% Missing for both depends and MonthlyIncome together # All missing values of depends are also missing for MonthlyIncome # 3943 NA's/-999s na.omited.oData = cData # Outlier Removal Measureing cData = oData nrow(cData) # 21326 orginal # Insert outlier removal code.................................... cData = subset(cData, cData$depends < 10) # <5 removes 152. 10 is fine No need to remove any I think, good distribution # cData = na.omit(cData) nrow(oData)-nrow(cData) # Number removed Missing 1-(nrow(cData)/nrow(oData)) # % Removed # Actual Outlier Removal cData = oData cData[cData$depends==-999,"depends"] <- NA cData[cData$MonthlyIncome==-999,"MonthlyIncome"] <- NA cData = na.omit(cData)
Data Mining | Delinquent Loans Denham S.
ii
# Outliers picked from boxplots cData = subset(cData, cData$revol < 4) # 5000>>6. 50>>28. 9>>33. 4>>40 cData = subset(cData, cData$times1 < 80) # cData = subset(cData, cData$times1 < 14) # 4 removes 390. 10 removes 359. 14 removes 248 cData = subset(cData, cData$DebtRatio < 14) # 4 removes 390. 10 removes 359. 14 removes 248 cData = subset(cData, cData$MonthlyIncome < 60000) # 100000 removes 6. 50000 removes 37. 14 removes 248 cData = subset(cData, cData$noloansetc < 22) # <22 removes 400. No need to remove any I think, good distribution cData = subset(cData, cData$ntimeslate < 40) # removes 33 cData = subset(cData, cData$norees < 6) # <6 removes 137. No need to remove any I think, good distribution cData = subset(cData, cData$ntimes2 < 8) # 8 removes 1. 5 removes 28. No need to remove any I think, good distribution cData = subset(cData, cData$depends < 10) x = nrow(cData) y = nrow(oData) y-x 1-(x/y) # Percent Removeded # Creating smaller sets for testing #par(mfrow=c(2,5)) #boxplot(cData$nid, main="nid") #boxplot(cData$dlq, main="nid") boxplot(cData$revol, main="revol", col=10) boxplot(cData$age, main="age", col=7) boxplot(cData$times1, main="times1", col=3) boxplot(cData$DebtRatio, main="DebtRatio", col=4) boxplot(cData$MonthlyIncome, main="MonthlyIncome", col=5) boxplot(cData$noloansetc, main="noloansetc", col=6) boxplot(cData$ntimeslate, main="ntimeslate", col="light green") boxplot(cData$norees, main="norees", col="light blue") boxplot(cData$ntimes2, main="ntimes2", col=1) boxplot(cData$depends, main="depends", col="purple") names(cData) #dev.off() c1Data = subset(cData, cData$dlq == 1) #
c0Data = subset(cData, cData$dlq == 0) # set.seed(12345) test_rows = sample.int(nrow(cData), nrow(cData)/3) test = cData[test_rows,] train = cData[-test_rows,] set.seed(12345) otest_rows = sample.int(nrow(na.omited.oData), nrow(na.omited.oData)/3) otest = na.omited.oData[otest_rows,] otrain = na.omited.oData[-otest_rows,] # Creating Data Frames train.Frame = data.frame(train) test.Frame = data.frame(test) # Cleaning Done ################ # Linear model # ################ ? glm fitLR.2 = glm(dlq ~ revol + age + times1 + ntimeslate + norees + ntimes2 + depends, data=train ,family=binomial() ) fitLR.2 summary(fitLR.2) confint(fitLR.2) exp(coef(fitLR.2)) exp(confint(fitLR.2)) predict(fitLR.2, type="response") residuals(fitLR.2, type="deviance") plot(fitLR.2) # predict.glm predict(fitLR.2, test, type = "response") # Optimal Logistic Model fitLR = glm(dlq ~ revol + age + times1 + MonthlyIncome + noloansetc + ntimeslate # + norees + ntimes2, data=train ,family=binomial() ) summary(fitLR) plot(predict(fitLR, test, type = "response")~predict(fitLR.2, test, type = "response")) anova(fitLR,fitLR.2, test="Chisq") # Anova of regression with less data ? anova
Data Mining | Delinquent Loans Denham S.
iii
####### # PCA # ####### boxplot(cData[2:12,]) # Scale Data names(cData) sData = scale(na.omit(cData[2:12]), center = TRUE, scale = TRUE) sData sData$age age boxplot(sData[2:12, -7]) boxplot(sData[2:12,]) names(cData) boxplot(cData$ntimeslate) sDataPCA = princomp(sData[, c(1:11)], cor=T) # Can't use NAs print(sDataPCA) summary(sDataPCA) round(sDataPCA$sdev,2) plot(sDataPCA, type='l', main="Scree Plot") #Simple PC variance plot. Elbows at PCs 2 & 9 loadings(sDataPCA) # biplot(sDataPCA, main="Biplot") # Difficult to run abline(h=0); abline(v=0) # pairs(cData[2:12], main = "Pairs", pch = 21, bg = c("red", "blue")[unclass(cData$dlq)]) #abline(lsfit(Sepal.Width,Sepal.Width)) #abline(lsfit((setosa$Petal.Length,setosa$Petal.Width), col="red", lwd = 2, lty = 2)) ######### # TREES # ######### # Simpler Tree - CP is fitTree.sim = rpart(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train, parms=list(split="gini") ,method = "class" ) draw.tree(fitTree.sim, cex=.8, pch=2,size=2.5, nodeinfo = FALSE, cases = "obs") plot(fitTree.sim,compress=TRUE,uniform=TRUE, branch=0.5) text(fitTree.sim,use.n=T,all=T,cex=.7,pretty=0,xpd=TRUE) draw.tree(fitTree.sim, cex=.8, pch=2,size=2.5, nodeinfo = TRUE, cases = "obs")
plotcp(fitTree) # Main Tree fitTree = rpart(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train[,2:12] #,parms=list(split="gini") ,control=rpart.control(cp=0.0010406812 ,control=(maxsurrogate=100) ) ,method = "class" ,parms=list(split="gini" #,loss=matrix(c(0,false.pos.weight,false.neg.weight,0), byrow=TRUE, nrow=2) ) # Loss Matrix ) fitTree fitTree$cp # xError of: 0.4826868 - CP: 0.0010406812 ? plotcp printcp(fitTree) fitTree$parm fitTree$parm$loss # Plotcac draw.tree(fitTree, cex=.8, pch=2,size=2.5, nodeinfo = TRUE, cases = "obs") ? draw.tree ? abline ? plot.rpart... type, extra, plotcp plot(fitTree,compress=TRUE,uniform=TRUE, branch=0.5) text(fitTree,use.n=T,all=T,cex=.7,pretty=0,xpd=TRUE) draw.tree(fitTree, cex=.8, pch=2,size=2.5, nodeinfo = TRUE, cases = "obs") ? draw.tree fitTree.sim$cp fitTree.sim$splits[,1] plot(fitTree.sim$splits[,1]) fitTree.sim$splits false.pos.weight = 10 false.neg.weight = 10 #Min Error CP fitTree.elbow = rpart(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc +
Data Mining | Delinquent Loans Denham S.
iv
ntimeslate + norees + ntimes2 + depends, data=train ,control=rpart.control(cp=0.0068) ,method = "class" ,parms=list(split="gini" #,loss=matrix(c(0,false.pos.weight,false.neg.weight,0), byrow=TRUE, nrow=2) ) # Loss Matrix ) fitTree.elbow plot(fitTree.elbow,compress=TRUE,uniform=TRUE, branch=0.5) text(fitTree.elbow,use.n=T,all=T,cex=.7,pretty=0,xpd=TRUE) draw.tree(fitTree.elbow, cex=.8, pch=2,size=2.5, nodeinfo = FALSE, cases = "") # CP at the elbow - 0.004701763719512 # Party ? ctree fitTree.party <- ctree(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train[,2:12] #, controls=ctree_control( #stump=TRUE, #maxdepth=3 #) ) plot(fitTree.party, type = "simple" ) fitTree asRules(fitTree) info.gain.rpart(fitTree) ? rpart # Function which determines Variable Importance in rpart. See below. a <- importance(fitTree) summary(a) # NOTE: a different CP (.01) had a better #? rattle # parms=list(prior=c(.5,.5)) ?? Priors? # control=rpart.control(cp=0.0018)) # RATTLE
# ? rattle.print.rpart # Must figure out which is false positive and which is false negetive newdata0 = subset(cData[2:12], dlq==0) newdata1 = subset(cData[2:12], dlq==1) # newdata1 = subset(cData, dlq==0) noPredictions0 = predict(fitTree, newdata0) noPredictions1 = predict(fitTree, newdata1) noPredictions = predict(fitTree, test) max(noPredictions) min(noPredictions) noPredictions0 noPredictions1 correct0 = (noPredictions0 < 0.5) correct0 correct1 = (noPredictions1 > 0.5) correct1 table(correct0) table(correct1) # Confusion matrix? # Still have to do miss class thing # 2nd Lab on Normal Trees # Gini/Information : seems to make no difference # To do # 1. Add loads of missing data and outliers to test for robustness # 2. Add maxsurrogates (end of lab 3) # ??? ################## # Random Forests # ################## train ? randomForest ? randomForest set.seed(12345) fitrf=randomForest(as.factor(train$dlq) ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, # declared above ntree=1000, type="classification", predicted=TRUE, importance=TRUE, proximity=FALSE # Never run prox as is crashes computer
Data Mining | Delinquent Loans Denham S.
v
) fitrf # Var Importance Plot importance(fitrf) varImpPlot(fitrf, main = "Variable Importance", sort = TRUE) varImpPlot(fitrf, class=1, main = "Variable Importance", sort = TRUE) # Looking good ? varImpPlot boxplot(cData$depends) # Partial Dep Plots give graphical depiction of the marginal effect of a variable on the class response (regression) ? partialPlot partialPlot(fitrf, train, age, main="Age Partial Dependency Plot") partialPlot(fitrf, train, revol,main="revol - Partial Dependency Plot", col="red") partialPlot(fitrf, train,MonthlyIncome,main="MonthlyIncome - Partial Dependency Plot") partialPlot(fitrf, train, depends,main="depends - Partial Dependency Plot") # 6 in one... partialPlot(fitrf, train, times1 ,main="Partial Dependency Plot") partialPlot(fitrf, train, add=TRUE,DebtRatio,main="DebtRatio - Partial Dependency Plot", col="blue") partialPlot(fitrf, train,ntimes2, add=TRUE,main="ntimes2 - Partial Dependency Plot", col="pink") partialPlot(fitrf, train, ntimeslate, add=TRUE,main="ntimeslate - Partial Dependency Plot", col="green") partialPlot(fitrf, train, norees, add=TRUE,main="Partial Dependency Plot", col="red") partialPlot(fitrf, train, noloansetc, add=TRUE,main="noloansetc - Partial Dependency Plot", col="light blue") partialPlot(fitrf, train, noloansetc,main="noloansetc - Partial Dependency Plot", col="light blue") partialPlot(fitrf, train,DebtRatio,main="DebtRatio - Partial Dependency Plot", col="blue") # Var Used Barchart Ylabels = c("revol","age","times1","DebtRatio","MonthlyIncome","noloansetc","ntimeslate","norees","ntimes2","depends") Graph = barplot(varUsed(fitrf, count=TRUE), xlab="Times variable used", c(1:14),
col=c("red", "dark red"), horiz=FALSE, space=0.4, width=2, axis=FALSE, axisnames=FALSE) axis(1, at=Graph, las=1, adj=0, cex.axis=0.7, labels = Ylabels) # oposite order of Ylabels ? varUsed # getTree(fitrf, k=1, labelVar=TRUE) # View and individual tree set.seed(54321) fitrf.250=randomForest(as.factor(train$dlq) ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, # declared above ntree=250, type="classification", predicted=TRUE, importance=TRUE, proximity=FALSE # Never run prox as is crashes computer ) fitrf.250 set.seed(12345) #fitrf.reg=randomForest(dlq ~ revol + age + times1 + DebtRatio # + MonthlyIncome + noloansetc + ntimeslate # + norees + ntimes2 + depends, # data=train, # ntree=1000, # type="regression", # predicted=TRUE, # importance=TRUE, # proximity=FALSE # ) # set1$similarity <- as.factor(set1$similarity) # Need to work out this prox stuff etc. fitrf names(fitrf) summary(fitrf) fitrf$importance hist(fitrf$importance) fitrf$mtry # mtry = 3 hist(fitrf$oob.times) # Normal Distrition fitrf$importanceSD hist(treesize(fitrf, terminal=TRUE))
Data Mining | Delinquent Loans Denham S.
vi
boxplot(fitrf$oob.times) plot(fitrf, main = "") ? plot.randomForest getTree(fitrf, k=3, labelVar=FALSE) fitrf$votes margins.fitrf=margin(fitrf,churn) plot(margins.rf) hist(margins.rf,main="Margins of Random Forest for churn dataset") boxplot(margins.rf~data$churn, main="Margins of Random Forest for churn dataset by class") The error rate over the trees is obtained as follows: plot(fit, main="Error rate over trees") MDSplot(fit, data$churn, k=2) # Margins margins.fitrf=margin(fitrf,dlq) plot(margins.fitrf) hist(margins.fitrf, main="Margins of Random Forest for Credit Dataset") boxplot(margins.fitrf~train.Frame$dlq) plot(margins.fitrf~dlq) MDSplot(fitrf, cData$dlq, k=2) # can't do because missing proximity matrix # Rattle Random Forest Stuff treeset.randomForest(fitrf, format="R") # This takes forever # printRandomForests(fitrf) # Making Predictions #predict(fitrf, test[101,]) # outputs a value of either 1 or 0 predict(fitrf, test) # outputs a value of either 1 or 0 print(pred.fitrf <- predict(fitrf, test, votes = TRUE) ) pred.fitrf = predict(fitrf, test, type="prob")[,2] pred.fitrf ############### # Neural Nets # ############### names(cData) sData = scale(na.omit(cData[2:12]), center = TRUE, scale = TRUE) sData # i = inpur # h = hidden layer
# b = bias class.ind(train$dlq) set.seed(12345) nn.lin.values = vector("numeric", 10) for(i in 1:10) { fitnn.lin = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2) + scale(depends), train, size=0, skip=TRUE, softmax=FALSE, Hess=TRUE) cat(fitnn.lin$value,"\n") nn.lin.values[i] = fitnn.lin$value } hist(nn.lin.values) plot(nn.lin.values) nn.lin.values eigen(fitnn.lin$Hess) eigen(fitnn.lin$Hess)$values # Main Nnet set.seed(999) fitnn = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE, decay=0.1) b=nrow(train) findAIC(fitnn, b, 10) # -12168 (these AICs are not consistent) eigen(fitnn$Hess)$values # usually # ? neuralnet set.seed(999) fitnn.2 = neuralnet( # Took forever to run as.numeric(dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), data = data.matrix(train),##-- act.fct = "logistic", hidden = 1, linear.output = FALSE #, rep = 4, err.fct="sse" ) # as.numeric(dlq) # Can we change this to class/int??? class.ind(train$dlq)
Data Mining | Delinquent Loans Denham S.
vii
# dlq runs! fitnn.2.2 = neuralnet( as.factor(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), data = data.matrix(train), hidden = 1, linear.output = FALSE #, rep = 4, err.fct="sse" ) # fitnn.2 plot(fitnn.2) # gwplot(fitnn.2, rep="best") nn.CI=confidence.interval(fitnn.2, alpha=0.05) # CI of weights nn.CI$upper nn.CI$lower nn.CI$upper.ci # ? compute # Don't think this scaling is right t = test t = t[-1] t = t[-1] t = t[-10] t = t[-9] names(t) t$revol = scale(t$revol) t$age = scale(t$age) t$times1 = scale(t$times1) t$DebtRatio = scale(t$DebtRatio) t$MonthlyIncome = scale(t$MonthlyIncome) t$noloansetc = scale(t$noloansetc) t$ntimeslate = scale(t$ntimeslate) t$norees = scale(t$norees) t print(pr <- compute(fitnn.2, t)) fitnn.2.pred=pr$net.result #print(pr.2 <- compute(fitnn.2.2, t)) #fitnn.2.2.pred=pr.2$net.result fitnn.2.pred # # ##-- List of probabilites generated by the neuralnet package b = nrow(train) fitnn.0 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), train, size=0, skip=TRUE, softmax=FALSE, Hess=TRUE) c = 10 # No depends findAIC(fitnn.0, b, c) # -12013.49
? nnet confus.fun(fitnn) eigen(fitnn$Hess) eigen(fitnn$Hess)$values # eigen(fitnn.0$Hess) eigen(fitnn.0$Hess)$values # ooooh, not all positive) - measure stability ??? # Postive definite. All eigenvalues greater than 0 # This is also good fitnn.5 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(times1) + scale(MonthlyIncome) + scale(ntimeslate), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 5 # ?? findAIC(fitnn.5, b, c) # Worst AIC from here on fitnn.11 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2) + scale(depends), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 11 # ?? findAIC(fitnn.11, b, c) # -11595.36 fitnn.10.1 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(ntimes2) + scale(depends), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 10 # + scale(norees) findAIC(fitnn.10.1, b, c) # -12452.82 fitnn.9 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(norees) + scale(ntimes2), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 9 # No , depends, ntimeslate findAIC(fitnn.9, b, c) # -11872.63 fitnn.11 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) +
Data Mining | Delinquent Loans Denham S.
viii
scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2) + scale(depends), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 11 # ?? findAIC(fitnn.11, b, c) # # uses test!!! (woo) # Softmax = TRUE requires at least two response categories # Not always all negetive, must est seet # easy because its already all in numbers. No need for class.ind etc... # Task: manually fill in examples into a nnet summary(fitnn) names(fitnn) fitnn$terms fitnn$wts # Hessian: , Hess = TRUE # Matrix # AIC p=nrow(train) k = 8 # ncol(train) SSE = sum(fitnn$residuals^2) AIC = 2*k +p*log(SSE/p) # SBC # Different Decays errorRate = vector("numeric", 100) DecayRate = seq(.0001, 1, length.out=100) for(i in 1:100) { # set.seed(12345) fitnn = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2), train, size=0, skip=TRUE, decay=DecayRate[i]) errorRate[i] = sum(fitnn$residuals^2) # Could add AIC here # Was inverse graph for size = 0 } errorRate plot(DecayRate, errorRate, xlab="Decay", ylab="Error", type="l", lwd="2") # Actual?? fitnn = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) +
scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2), train, size=1, skip=FALSE, softmax=TRUE, decay=DecayRate[i]) ####### # SVM # ####### # Small datasets #plot(cData$age, cData$DebtRatio, col=(cData$dlq+3), pch=(cData$dlq+2)) set.seed(12345) # attach(cData) # Regression - Radial svm.model.reg.rad <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "regression", kernel = "radial", cost = 100, gamma = 1) # Regression - Linear svm.model.reg.lin <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "regression", kernel = "linear", cost = 100) # no gamma # Polynomial svm.model.pol <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "C-classification", kernel = "polynomial", cost = 100, gamma = 1) # Sigmoid svm.model.sig <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends,
Data Mining | Delinquent Loans Denham S.
ix
data=train.Frame, type = "C-classification", kernel = "sigmoid", probability = FALSE, cost = 100, gamma = 1) # linear started 5.23. Started 18.41 - 1848!!! :) svm.model.lin <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "C-classification", kernel = "linear", probability = TRUE, cost = 100) # no gamma for linear # Radial svm.model.rad <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "C-classification", kernel = "radial", probability = TRUE, cachesize = 1000, cost = 130, gamma = 1.3) # # tuning results: 1.3 130 -- 0.2118273 # Smaller Sample now svmtest_rows = sample.int(nrow(cData), 4000) svmtrain = cData[svmtest_rows,] svmtest = cData[-svmtest_rows,] svmtrain = data.frame(svmtrain) svmtest = data.frame(svmtest) obj <- tune.svm(dlq~. ,data=svmtrain , gamma = seq(.5, .9, by = .1) , cost = seq(100,1000, by = 100) ) plot(obj) # obj 0.218031
obj2 <- tune.svm(dlq~. ,data=svmtrain , gamma = seq(1, 1.5, by = .1) , cost = seq(10,150, by = 10) ) plot(obj2) obj2 # Gamma Cost Best Performance # 1.1 100 -- 0.2132932 # 1.3 130 -- 0.2118273 # 1.3 80 -- 0.2118273 # 1.3 120 -- 0.2137122 obj3 <- tune.svm(dlq~. ,data=svmtrain , gamma = seq(.5, 1.5, by = .1) , cost = seq(0,300, by = 10) ) plot(obj3) obj3 sm = svm.model # change kernals # svm.model.lin / svm.model... PROBS x = svm.model.sig x = svm.model.pol x = svm.model.lin x = svm.model # PLOT.SVM. (cData must be attached). maybe test[,3] detach(test) nrow(test) nrow(age) nrow(test$MonthlyIncome) attach(cData) plot(x, data=test.Frame, MonthlyIncome ~ age, # age~revol, MonthlyIncome ~ DebtRatio svSymbol = 1, dataSymbol = 2, fill = TRUE ) detach(cData) #attach(cData) # Outputs svm.model names(svm.model) str(svm.model) summary(svm.model) ? predict predict(svm.model, test[101,]) # outputs a value of either 1 or 0 predict(svm.model, test) # outputs a value of either 1 or 0 # predsvm.lin / predsvm.pol / predsvm.reg.lin / predsvm.reg.rad / predsvm. predsvm = svm.model.lin predsvm <- predict(svm.model, test, probability = TRUE) predsvm
Data Mining | Delinquent Loans Denham S.
x
# This bit needed for classification, not for regression, class-radial, #predsvm=attr(predsvm, "probabilities")[,2] # Converts into probabilities... plot(predsvm) # ...for the purposes of comparison with other models and ensemble model creation predsvm # Are these probabilities? svs=svm.model$SV svs ? attr # Model Evaluation predsvmlin = attr(predict(svm.model.lin, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... predsvmrad = attr(predict(svm.model.rad, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... predsvmsig = attr(predict(svm.model.sig, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... #predsvmpol = attr(predict(svm.model.pol, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... #predsvmregrad =attr(predict(svm.model.reg.rad, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... #predsvmreglin =attr(predict(svm.model.reg.lin, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... ######## # ROCR # ######## ############## # Evaluation # ############## # MODELS fitTree fitTree.sim fitTree.sml fitrf.reg fitrf fitnn fitnn.2 svm.model # Probs
pred.fitrf fitnn.2.pred x1=predict(fitTree, test)[,2] x1.2=predict(fitTree.sim, test)[,2] x1.3=predict(fitTree.sml, test) x2=pred.fitrf ##--x2.2=predict(fitrf.reg, test) x3=predict(fitnn, test) x3=x3[,2] # This is needed for regression Nn. Not class. See above. Oh waitm seems to be needed now for softmax. x3.2=fitnn.2.pred x4.2=attr(predict(svm.model.lin, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x4.3=attr(predict(svm.model.rad, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x4.4=attr(predict(svm.model.sig, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x4=x4.3 # Radial c-classification performs best ##--x4.5=attr(predict(svm.model.pol, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x5=predict(fitLR, test, type = "response") # Logistic Regression plot(x1~x4, col=(test$dlq+3), pch=(1), ylab="", xlab="", main = "") # with is linear c-classification ##--x4.2=predsvm.rad ensem=(x1+x2+x3+x4+x5)/4 ensem.2=(x1+x2+x3+x4+x5)/5 dev.off() par(mfrow=c(3, 2)) # Comparing Model Results plot(x1~x2, col=(test$dlq+3), pch=(1), ylab="Tree", xlab="Random Forest", main = "Tree vs. Random Forest") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x1~x3, col=(test$dlq+3), pch=(1), ylab="Tree", xlab="Nueral Net", main = "Tree vs. Nueral Net") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green"))
Data Mining | Delinquent Loans Denham S.
xi
abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x1~x4, col=(test$dlq+3), pch=(1), ylab="Tree", xlab="Support Vector Machine", main = "Tree vs. Support Vector Machine") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x2~x3, col=(test$dlq+3), pch=(1), ylab="Random Forest", xlab="Nueral Net", main = "Random Forest vs. Nueral Net") # Correlated, but x3 has negetives. FML legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x2~x4, col=(test$dlq+3), pch=(1), ylab="Random Forest", xlab="Support Vector Machine", main = "Random Forest vs. Support Vector Machine") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x3~x4, col=(test$dlq+3), pch=(1), ylab="Nueral Net", xlab="Support Vector Machine", main = "Nueral Net vs. Support Vector Machine") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x1~x1.2, col=(test$dlq+3), pch=(1), main="Tree vs. Other Tree") # Class vs. Regression Random Forests plot(x1~x1.2, col=(test$dlq+3), pch=(1), main="Tree vs. Other Tree") # Class vs. Regression Random Forests plot(x2~x2.2, col=(test$dlq+3), pch=(1), main="Class Random Forests vs. Regression Random Forests
Predictions") # Class vs. Regression Random Forests plot(x3~x3.2, col=(test$dlq+3), pch=(1), ylab="NN", xlab="NN", main = "Nueral Net vs. ") plot(fitnn.2, col=(test$dlq+3), pch=(1).pred~x3.2, ylab="NN", xlab="NN", main = "Nueral Net vs. ") plot(x1~x1.2, col=(test$dlq+3), pch=(1)) plot(x1~x1.3, col=(test$dlq+3), pch=(1)) plot(x1.2~x1.3, col=(test$dlq+3), pch=(1)) dev.off() plot(x1~ensem, col=(test$dlq+3), pch=(1), main = "Tree vs. Random Forest") plot(x2~ensem, col=(test$dlq+3), pch=(1), main = "Random Forest vs. Random Forest") plot(x3~ensem, col=(test$dlq+3), pch=(1), main = "Nueral Net vs. Random Forest") plot(x4~ensem, col=(test$dlq+3), pch=(1), main = "Support Vector Machine vs. Random Forest") # Confusion Matrices table(data.frame(predicted=predict(fitTree, test) > 0.5, actual=test[,2]>0.5)) # this works # that doesn't work, but this does TREEmat=table(data.frame(predict(fitTree, test)[,2] > 0.5, actual=test[,2]>0.5)) # this works RFmat=table(data.frame(pred.fitrf > 0.5, actual=test[,2]>0.5)) # this works NNmat=table(data.frame(predicted=(predict(fitnn, test) > 0.5)[,2], actual=test[,2]>0.5)) # this works SVMmat=table(data.frame(predicted=predsvm > 0.5, actual=test[,2]>0.5)) # takes probabilities directly from above LMmat=table(data.frame(predict(fitLR, test) > 0.5, actual=test[,2]>0.5)) # this works ENSmat=table(data.frame(predicted=ensem.2 > 0.5, actual=test[,2]>0.5)) # takes probabilities directly from above mat=ENSmat fp.rate = mat[1,2]/(mat[1,1] + mat[1,2]) fn.rate = mat[2,1]/(mat[2,1] + mat[2,2]) mc.rate=(mat[1,2]+mat[2,1])/(mat[1,1]+mat[1,2]+mat[2,1]+mat[2,2]) rates=c(fp.rate, fn.rate, mc.rate) mat
Data Mining | Delinquent Loans Denham S.
xii
fp.rate fn.rate mc.rate rates # clas.ind gives two list # Prediction Probabilities predTree=predict(fitTree, newdata = test, prob = "class")[,2] predrf=predict(fitrf, newdata = test, type = "prob")[,2] # prob = "class" for Class membership prednn=predict(fitnn, newdata = test, prob = "prob") predsvm=predict(svm.model, newdata = test) ##--predsvm=attr(predict(svm.model, test, probability = TRUE), "probabilities")#[,2] # Converts into probabilities... # Can not plot risk without Nominal value of each loan par(mfrow=c(2, 2)) evalTree=evaluateRisk(predTree,test$dlq) plotRisk(evalTree$Caseload,evalTree$Precision,evalTree$Recall, show.legend=TRUE) evalrf=evaluateRisk(predrf,test$dlq) plotRisk(evalrf$Caseload,evalrf$Precision,evalrf$Recall) evalnn=evaluateRisk(prednn,test$dlq) plotRisk(evalnn$Caseload,evalnn$Precision,evalnn$Recall, show.legend=TRUE) evalsvm=evaluateRisk(predsvm,test$dlq) plotRisk(evalsvm$Caseload,evalsvm$Precision,evalsvm$Recall) dev.off() ########### par(mfrow=c(2, 2)) # Box Plots for Model Evaluation boxplot(predTree~test$dlq, col = "red", main = "Simple Tree") boxplot(predrf~test$dlq, col = "green", main = "Random Forest") boxplot(prednn[,2]~test[,2], col = "purple", main = "Support Vector Machine") boxplot(predsvm~test[,2], col = "orange", main = "Light Blue") ##--TREE NOT LOOKING GOOD!!!!! # Prediction for ROCR detach("package:neuralnet") # Needed because neuralnet has a predictions function predTree=predict(fitTree.sim, newdata = test, prob = "class") predicsTree=prediction(predTree[,2], test$dlq) predicsRF=prediction(predrf,test$dlq)
predicsNN=prediction(prednn[,2],test$dlq) predicsSVM=prediction(predsvm,test$dlq) predicsRF predicsTree predicsNN predicsSVM #str(predicsRF) #predicsRF@fp perfTree=performance(predicsTree,"tpr","fpr") # ROC Curve perfrf=performance(predicsRF,"tpr","fpr") # ROC Curve perfNN=performance(predicsNN,"tpr","fpr") # ROC Curve perfSVM=performance(predicsSVM,"tpr","fpr") # ROC Curve # QUICK ROCR CURVES plot(perfTree, col="red") plot(perfrf, add=TRUE, col="green") plot(perfNN, add=TRUE, col="orange") plot(perfSVM, add=TRUE, col="purple") legend("bottomright",c("Tree","R. Forest","N. Net","SVM"),pch=1,col=c("red","green", "orange", "purple")) predics = predicsTree # predicsRF, predicsTree, predicsNN, predicsSVM perfo=performance(predics,"acc") plot(perfo) perfo=performance(predics,"tpr","acc") plot(perfo) perfo=performance(predics,"err","acc") plot(perfo) perfo=performance(predics,"lift","rpp") # lift chart plot(perfo) perfo=performance(predics,"tpr","rpp") plot(perfo) perfTree=performance(predicsTree,"lift","rpp") # ROC Curve perfrf=performance(predicsRF,"lift","rpp") # ROC Curve perfNN=performance(predicsNN,"lift","rpp") # ROC Curve perfSVM=performance(predicsSVM,"lift","rpp") # ROC Curve plot(performance(predics,"lift","rpp")) plot(perfTree, col="red") plot(perfrf, add=TRUE, col="green") plot(perfNN, add=TRUE, col="orange") plot(perfSVM, add=TRUE, col="purple")
Data Mining | Delinquent Loans Denham S.
xiii
######################################################################################################### ###### # CREATING BINARY DATA ###### # DebtRatio j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$DebtRatio[j]>=3.5) { bData$DebtRatio[j]=0 num_with_zero = num_with_zero + 1 } else { bData$DebtRatio[j]=1 num_with_one = num_with_one + 1 } } bData$DebtRatio num_with_zero num_with_one pc = num_with_zero/upto pc boxplot(bData) # Monthly Income - 0/1 j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$MonthlyIncome[j]=='-999') { bData$MonthlyIncome[j]=0 num_with_zero = num_with_zero + 1 } else { bData$MonthlyIncome[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$MonthlyIncome # ntimeslate boxplot(bData) j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$ntimeslate[j]==0) { bData$ntimeslate[j]=0 num_with_zero = num_with_zero + 1 } else {
bData$ntimeslate[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$ntimeslate # revol j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$revol[j]==0) { bData$revol[j]=0 num_with_zero = num_with_zero + 1 } else { bData$revol[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$revol # norees j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$norees[j]>=6) { bData$norees[j]=1 num_with_zero = num_with_zero + 1 } else { bData$norees[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$norees # Depends j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$depends[j]=='-999') { bData$depends[j]=0 num_with_zero = num_with_zero + 1 } else { bData$depends[j]=1 num_with_one = num_with_one + 1
Data Mining | Delinquent Loans Denham S.
xiv
} } num_with_zero num_with_one pc = num_with_zero/upto pc bData$depends bData = bData[-6] bData = bData[-5] bData = bData[-2] bData = bData[-6] # times1 j = 0 pc = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$times1[j]<80) { bData$times1[j]=0 } else { bData$times1[j]=1 } } bData$times1 # ntimes2 j = 0 pc = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$ntimes2[j]<80) { bData$ntimes2[j]=0 } else { bData$ntimes2[j]=1 } } bData$ntimes2 boxplot(bData) names(bData) #############################################plot(cData$times1~cData$revol)##### # bData set.seed(12345) Btest_rows = sample.int(nrow(bData), nrow(bData)/3) Btest = bData[Btest_rows,] Btrain = bData[-Btest_rows,] # Simpler Tree - CP is fitTree.binary = rpart(dlq ~ age + times1 + noloansetc + ntimeslate + ntimes2 + depends , data=Btrain ,parms=list(split="gini") ,method = "class" ) draw.tree(fitTree.binary, cex=.8, pch=2,size=2.5, nodeinfo = FALSE, cases = "obs")
####################### # Univariate Graphing # ####################### attach(cData) par(mfrow=c(1,2)) # REVOL! hist(revol,col= "red",main="revol", xlab="dlq", ylab="Density", freq=FALSE) lines(density(revol), col="black", lwd = 2, lty = 1) lines(density(c1Data$revol), col="red", lwd = 2, lty = 2) lines(density(c0Data$revol), col="green", lwd = 2, lty = 2) #legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # norees hist(norees,col= "red",main="norees", xlab="dlq", ylab="Density", freq=FALSE) lines(density(norees), col="black", lwd = 2, lty = 1) lines(density(c1Data$norees), col="red", lwd = 2, lty = 2) lines(density(c0Data$norees), col="green", lwd = 2, lty = 2) #legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) dev.off() attach(cData) par(mfrow=c(1,2)) names(cData) # REVOL! hist(revol,col= "red",main="revol", xlab="dlq", ylab="Density", freq=FALSE) lines(density(revol), col="black", lwd = 2, lty = 1) lines(density(c1Data$revol), col="red", lwd = 2, lty = 2) lines(density(c0Data$revol), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0","All"),lwd = 3,col=c("red","green",1)) # AGE hist(age,col= "red",main="age", xlab="age", ylab="Density", freq=FALSE) lines(density(age), col="black", lwd = 2, lty = 1) lines(density(c1Data$age), col="red", lwd = 2, lty = 2) lines(density(c0Data$age), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # times1
Data Mining | Delinquent Loans Denham S.
xv
hist(times1,col= "red",main="times1", xlab="dlq", ylab="Density", freq=FALSE) lines(density(times1), col="black", lwd = 2, lty = 1) lines(density(c1Data$times1), col="red", lwd = 2, lty = 2) lines(density(c0Data$times1), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # DebtRatio hist(DebtRatio,col= "red",main="DebtRatio", xlab="dlq", ylab="Density", freq=FALSE) lines(density(DebtRatio), col="black", lwd = 2, lty = 1) lines(density(c1Data$DebtRatio), col="red", lwd = 2, lty = 2) lines(density(c0Data$DebtRatio), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # MonthlyIncome hist(MonthlyIncome,col= "red",main="MonthlyIncome", xlab="dlq", ylab="Density", freq=FALSE) lines(density(MonthlyIncome), col="black", lwd = 2, lty = 1) lines(density(c1Data$MonthlyIncome), col="red", lwd = 2, lty = 2) lines(density(c0Data$MonthlyIncome), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # noloansetc hist(noloansetc,col= "red",main="noloansetc", xlab="dlq", ylab="Density", freq=FALSE) lines(density(noloansetc), col="black", lwd = 2, lty = 1) lines(density(c1Data$noloansetc), col="red", lwd = 2, lty = 2) lines(density(c0Data$noloansetc), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # ntimeslate hist(ntimeslate,col= "red",main="ntimeslate", xlab="dlq", ylab="Density", freq=FALSE) lines(density(ntimeslate), col="black", lwd = 2, lty = 1) lines(density(c1Data$ntimeslate), col="red", lwd = 2, lty = 2) lines(density(c0Data$ntimeslate), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # norees
hist(norees,col= "red",main="norees", xlab="dlq", ylab="Density", freq=FALSE) lines(density(norees), col="black", lwd = 2, lty = 1) lines(density(c1Data$norees), col="red", lwd = 2, lty = 2) lines(density(c0Data$norees), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # ntimes2 hist(ntimes2,col= "red",main="ntimes2", xlab="dlq", ylab="Density", freq=FALSE) lines(density(ntimes2), col="black", lwd = 2, lty = 1) lines(density(c1Data$ntimes2), col="red", lwd = 2, lty = 2) lines(density(c0Data$ntimes2), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # depends hist(depends,col= "red",main="depends", xlab="dlq", ylab="Density", freq=FALSE) lines(density(depends), col="black", lwd = 2, lty = 1) lines(density(c1Data$depends), col="red", lwd = 2, lty = 2) lines(density(c0Data$depends), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) detach(cData) #Univariate summary(cData) mean(cData) sd(cData) var(cData) summary(cData) table(cData$dlq) ############## # Univariate # ############## par(mfrow=c(1, 1)) for(i in 2:12) {boxplot(cData[,i], xlab=names(cData)[i], main=names(cData)[i]) } #Bivariate plot(cData$MonthlyIncome~cData$age) for(j in 2:11) { for(i in 3:12) { plot(cData[,j]~cData[,i], main="Plot", xlab=names(cData)[i], ylab=names(cData[j])) } } pairs(cData, col=as.integer(cData$dlq))
Data Mining | Delinquent Loans Denham S.
xvi
? pairs # This could be creative. Surely older people are better are paying off loans and have higher monthly income ######################## # Noel OBoyle Function # ######################## importance <- function(mytree) { # Calculate variable importance for an rpart classification tree # NOTE!! The tree *must* be based upon data that has the response (a factor) # in the *first* column # Returns an object of class 'importance.rpart' # You can use print() and summary() to find information on the result delta_i <- function(data,variable,value) { # Calculate the decrease in impurity at a particular node given: # data -- the subset of the data that 'reaches' a particular node # variable -- the variable to be used to split the data # value -- the 'split value' for the variable current_gini <- gini(data[,1]) size <- length(data[,1]) left_dataset <- eval(parse(text=paste("subset(data,",paste(variable,"<",value),")"))) size_left <- length(left_dataset[,1]) left_gini <- gini(left_dataset[,1]) right_dataset <- eval(parse(text=paste("subset(data,",paste(variable,">=",value),")"))) size_right <- length(right_dataset[,1]) right_gini <- gini(right_dataset[,1]) # print(paste(" Gini values: current=",current_gini,"(size=",size,") left=",left_gini,"(size=",size_left,"), right=", right_gini,"(size=",size_right,")"))
current_gini*size-length(left_dataset[,1])*left_gini-length(right_dataset[,1])*right_gini } gini <- function(data) { # Calculate the gini value for a vector of categorical data numFactors = nlevels(data) nameFactors = levels(data) proportion = rep(0,numFactors) for (i in 1:numFactors) { proportion[i] = sum(data==nameFactors[i])/length(data) } 1-sum(proportion**2) } frame <- mytree$frame splits <- mytree$splits allData <- eval(mytree$call$data) output <- "" finalAnswer <- rep(0,length(names(allData))) names(finalAnswer) <- names(allData) d <- dimnames(frame)[[1]] # Make this vector of length = the max nodeID # It will be a lookup table from frame-->splits index <- rep(0,as.integer(d[length(d)])) total <- 1 for (node in 1:length(frame[,1])) { if (frame[node,]$var!="<leaf>") { nodeID <- as.integer(d[node]) index[nodeID] <- total total <- total + frame[node,]$ncompete + frame[node,]$nsurrogate+ 1 } } for (node in 1:length(frame[,1])) { if (frame[node,]$var!="<leaf>") { nodeID <- as.integer(d[node]) output <- paste(output,"Looking at nodeID:",nodeID,"\n") output <- paste(output," (1) Need to find subset","\n") output <- paste(output," Choices made to get here:...","\n") data <- allData if (nodeID%%2==0) symbol <- "<" else symbol <- ">="
Data Mining | Delinquent Loans Denham S.
xvii
i <- nodeID%/%2 while (i>0) { output <- paste(output," Came from nodeID:",i,"\n") variable <- dimnames(splits)[[1]][index[i]] value <- splits[index[i],4] command <- paste("subset(allData,",variable,symbol,value,")") output <- paste(output," Applying command",command,"\n") data <- eval(parse(text=command)) if (i%%2==0) symbol <- "<" else symbol <- ">=" i <- i%/%2 } output <- paste(output," Size of current subset:",length(data[,1]),"\n") output <- paste(output," (2) Look at importance of chosen split","\n") variable <- dimnames(splits)[[1]][index[nodeID]] value <- splits[index[nodeID],4] best_delta_i <- delta_i(data,variable,value) output <- paste(output," The best delta_i is:",format(best_delta_i,digits=3),"for",variable,"and",value,"\n") finalAnswer[variable] <- finalAnswer[variable] + best_delta_i output <- paste(output," Final answer: ",paste(finalAnswer,collapse=" "),"\n") output <- paste(output," (3) Look at importance of surrogate splits","\n") ncompete <- frame[node,]$ncompete nsurrogate <- frame[node,]$nsurrogate if (nsurrogate>0) { start <- index[nodeID] for (i in seq(start+ncompete+1,start+ncompete+nsurrogate)) { variable <- dimnames(splits)[[1]][i] value <- splits[i,4] best_delta_i <- delta_i(data,variable,value) output <- paste(output," The best delta_i is:",format(best_delta_i,digits=3),
"for",variable,"and",value,"and agreement of",splits[i,3],"\n") finalAnswer[variable] <- finalAnswer[variable] + best_delta_i*splits[i,3] output <- paste(output," Final answer: ",paste(finalAnswer[2:length(finalAnswer)],collapse=" "),"\n") } } } } result <- list(result=finalAnswer[2:length(finalAnswer)],info=output) class(result) <- "importance.rpart" result } print.importance.rpart <- function(self) { print(self$result) } summary.importance.rpart <- function(self) { cat(self$info) } ## wew confus.fun = function(x) # x= fitnn { confus.mat = table(data.frame(predicted=predict(x, test) > 0.5, actual=test[,2]>0.5)) false.neg = confus.mat[1,2] / (confus.mat[1,2] + confus.mat[1,1]) false.pos = confus.mat[2,1] / (confus.mat[2,1] + confus.mat[2,2]) confus.mat false.neg false.pos cat( "Confusion Matrix: ", "\n", "FALSE NEGETIVE: ",false.neg, "\n","FALSE POSITIVE: ",false.pos, "\n") } confus.fun(fitnn) confus.fun(svm.model) AIC findAIC = function(a, b, c) { p=b k = c # ncol(train) SSE = sum(a$residuals^2) AIC = 2*k +p*log(SSE/p) AIC #SBC = p*log(n) + n*log(SSE1/n) #SBC }