Credit Delinquency Data Mining

DATA MINING

Delinquent Loans Author: Stephen Denham Lecturer: Dr. Myra O’Regan Prepared for: ST4003 Data Mining Submitted: 24st February 2012

Data Mining | Delinquent Loans Denham S.

1

1. Background ................................................................................................ 1

1.1 Dataset ................................................................................................. 1

2. Models ....................................................................................................... 4

2.1 Logistic Regression .............................................................................. 4

2.2 Principal Component Analysis .............................................................. 5

2.3 Trees .................................................................................................... 7

2.4 Random Forests ................................................................................. 10

2.5 Neural Networks ................................................................................. 15

2.6 Support Vector Machines ................................................................... 17

3. Further Analysis ....................................................................................... 19

3.1 Ensemble Model ................................................................................. 19

3.2 Limitations and Further Work ............................................................. 19

3.3 Variables Importance ......................................................................... 20

4. Models Assessment ................................................................................. 21

5. Conclusion ............................................................................................... 25

7. Appendix………………………………………………………………………… i


1

1. BACKGROUND

This report evaluates predictive models applied to a dataset of over 21,000 rows of

customer information across 12 variables. The aim is to model which loans are most

likely to be deemed ‘delinquent’.

Norman Ralph Augustine said ‘it’s easy to get a loan unless you need it’. By

understanding customer trends, financial institutions can better allocate their

resources. The combination of data driven predictive models and human intuition can

mean credit is given to those in the best position to repay.

Techniques used in this report include logistic regression, trees, random forest,

neural networks, support vector machines and the creation of an ensemble model. It

is as much a problem of soft intuition as mathematical skill.

1.1 Dataset

The original dataset contained 21,326 rows and 12 columns, representing customer

described by the following 12 variables:

1. nid (integer): Simply a unique identifier that is of no use in our analysis. Such a variable may be necessary for other analysis tools.

2. dlq (0/1): Individual has been 90 days or more past due date are termed ‘delinquent’, represented here by ‘1’. This is the target variable of model.

3. revol (percentage): Total balance on credit cards and personal lines of credit except real estate and no instalment debt like car loans divided by the sum of credit limits. All revol values above 4 (40 cases) were deemed outliers and were removed.

4. age (integer): Age of borrower in years. This variable was not changed.

5. times1 (integer): Number of times borrower has been 30-59 days past due but no worse in the last 2 years. All times1 values above 14 (159 cases) were deemed outliers and were removed.

6. DebtRatio (percentage): Monthly debt payments, alimony, living costs divided by monthly gross income. All DebtRatio values above 14 (3,800 cases) were deemed outliers and were removed. This may be a large, however the vast majority of these were missing MonthlyIncome or other variable outliers.

7. MonthlyIncome (integer): Monthly income. 18.48% of cases have missing data for MonthlyIncome. All MonthlyIncome values above 60,000 (32 cases) were deemed outliers and were removed.


2

8. noloansetc (integer): Number of open loans (instalment like car loan or mortgage) and lines of credit (e.g. credit cards). All noloansetc values above 60,000 (471 cases) were deemed outliers and were removed.

9. ntimeslate (integer): Number of times borrower has been 90 days or more past due. All ntimeslate values above 40 (159 cases) were deemed outliers and were removed.

10. norees (integer): Number of mortgage and real estate loans including home equity lines of credit. All norees values above 6 (205 cases) were deemed outliers and were removed.

11. ntimes2 (integer): Number of times borrower has been 60-89 days past due but no worse in the last 2 years. All ntimes2 values above 8 (262 cases) were deemed outliers and were removed.

12. depends (integer): Number of dependents in family excluding themselves (spouse, children etc.). 2.2% of original cases were missing this variable and were simply removed. All of these cases were also missing All depends values above 10 (2 cases) were deemed outliers and were removed.

Overall, 22.8% of variables were removed between outliers and missing data

rejection. Figure 1.1 shows boxplots of all variables after cleansing. They show how

revol, times1, DebtRatio, revol, ntimeslate, and depends are all

densely distributed around 0. Variables age and noloansetc are more normally

distributed.

Figure 1.1: Cleansed Data Boxplots

Figure 1.2 shows a histogram for revol density between both delinquent and non-

delinquent. revol showed the most visible relationship with delinquency.


3

Unfortunately from a simplicity point of view, revol is the exception, and when

graphing other variables in this way, the relationship is far more subtle – not enough

to make a judgement purely on a univariate basis and so multivariate analysis is

necessary.

Figure 1.2: revol Histogram

It is often beneficial to create combination variables – blending variables together to

create new ones, which encompass a better expression of cases. This was

considered but deemed unnecessary. There are a small number of variables, some

of which are already combinations – DebtRatio and revol. Also, non-linear

kernels would pick up such relationship.

In some instances, missing data and outliers are just as insightful. To quickly

determine if benefit could be gained from them in a model, a simplified version of the

dataset was created. This technique can also be done to divide variables into

brackets, such as age brackets. Through this, variables, which have a non-linear

influence on the prediction, can also be picked up in simple models using indicator

variables.

Variables with questionable values were classified. 0 replaced a ‘normal’ value, and

1 replaced an outlier or missing value. This new dataset was fed into a basic tree to

assess the value added by these cases. The binary variables did not prove to be

accurate predictors, being inferior to, and so it was deemed satisfactory to omit

outliers and missing data.


4

2. MODELS

After cleansing a large dataset just under 16,500 remained. This was deemed more

than enough to warrant dividing the data into training and test subsets (2:1). These

subsets were representative distributions of delinquent loans and so stratified

sampling was not required. A third validation would also have been possible. 6

models were developed on the training set and tested.

2.1 Logistic Regression

The key element of this problem is the binary nature of the outcome variable. Cases

can only be 0 or 1 and the associated probably should only be between 0 and 1.

Furthermore, many of the variables were prominently small values however large

outliers existed. For these reasons, a logistic regression model was suitable for initial

testing. Table 2.1 shows the output produced by the first logistic regression. It

showed that on this linear scale, the variables DebtRatio and depends could be

removed from the model.

Table 2.1: Logistic Regression Output Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.06498817259 0.11027763366 -9.65734 < 0.000000000000000222 *** revol 1.94583262923 0.06703656789 29.02644 < 0.000000000000000222 *** age -0.01751873613 0.00184558679 -9.49223 < 0.000000000000000222 *** times1 0.54663351405 0.03189642304 17.13777 < 0.000000000000000222 *** DebtRatio 0.04952123774 0.04646285001 1.06582 0.286503 MonthlyIncome -0.00003790262 0.00000642855 -5.89598 0.0000000037245671760 *** noloansetc 0.04735721583 0.00604523194 7.83381 0.0000000000000047329 *** ntimeslate 0.89797913290 0.05709272686 15.72843 < 0.000000000000000222 *** norees 0.06882099509 0.02810557602 2.44866 0.014339 * ntimes2 0.95544920500 0.07330640163 13.03364 < 0.000000000000000222 *** depends 0.03064240612 0.02053475812 1.49222 0.135641

Further trail and error experimentation, plotting and ANOVA, showed that the

norees and depends variable could also be removed. Removing these variables

brought the AIC down from 10916 down to 10909. Table 2.2 shows the final logistic

regression misclassification rates.

Table 2.2: Logistic Regression Model

FALSE POSITVE RATE 29.8% FALSE NEGETIVE RATE 16.5%

MISCLASSIFICATION RATE 25.5%


5

2.2 Principal Component Analysis

A principal component analysis was done to gain further insight into the underlying

trends before further modelling. This method creates distinct linear combinations

from the data which to account for multivariate data patterns – components. It does

not result in a direct model however can show underlying trends.

Figure 2.1 shows a plot of the proportion of total variance accounted for in each

component. The ‘elbow’ at component 3 shows clear decreasing marginal value and

so, only the first three components were analysed as they explained a

disproportionate 23% of data variance.

Figure 2.1: Principal Component Analysis Scree Plot

The first component showed high values for dlq, revol, times1, ntimeslate and

times2. This would suggest that these 4 variables might be good indicators of dlq

prediction. This component could be termed a ‘financial competence’ element. The

second component contains high values for two variables that indicate numbers of

lines of credit. It also contains some value of dlq. The third PC could be interpreted

as a ‘middle-aged’ variable as it contains values for age, number of dependents,

monthly income and debt ratio. This may be because those starting families may

typically have large debt obligations as they have bought a house relatively recently.

The fifth and ninth components showed high values for dlq, while elements of all

other variables, apart from norees (which had a strong presence in component two).

This suggests that all variables have some amount of correlation with dlq, however

these relationships are still unclear from the PCA alone. The values relevant to this

discussion are highlighted.

●

●

●

●

●●

●

●●

●

Scree Plot

Variances

0.5

1.0

1.5

2.0

2.5

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10


6

Table 2.3: Principal Component Analysis Loadings Output Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 dlq -0.422 -0.214 -0.352 -0.249 -0.218 0.554 0.449 -0.154 revol -0.457 -0.114 -0.293 -0.223 -0.204 -0.352 -0.304 -0.583 0.197 age 0.279 0.255 0.551 0.229 -0.608 -0.333 times1 -0.301 -0.340 0.243 -0.258 0.520 0.420 -0.427 0.166 DebtRatio -0.386 0.490 -0.461 0.234 -0.104 -0.265 -0.502 MonthlyIncome 0.224 -0.269 -0.583 0.277 -0.204 -0.279 0.110 -0.121 -0.252 -0.499 noloansetc 0.279 -0.476 -0.166 0.154 0.217 0.497 -0.577 ntimeslate -0.373 0.266 0.414 -0.551 0.534 norees 0.239 -0.532 -0.147 -0.306 0.145 -0.102 0.289 0.652 ntimes2 -0.333 -0.221 0.327 0.400 0.225 0.572 -0.429 depends -0.193 -0.582 -0.362 0.464 0.330 -0.385


7

2.3 Trees

Trees are a relatively simple concept in machine learning. They classify predictions

based on a series of ‘if-statements’ that can easily be represented graphically and

are produced via a recursive-partitioning algorithm.

Possibly the greatest benefit of a default classification tree is its simplicity and thus

interpretability. It is clearly explained diagrammatically, and so a bank manager could

easily apply it to a loan applicant without even the use of a calculator.

Figure 2.2 shows a basic tree constructed. It shows ntimeslate, times1 and

ntimes2 to be the most important determinants of prediction. By this tree, if a

customer has a ntimeslate value of 1 or greater, they have an 88% chance of

being delinquent.

Figure 2.2: Simple Tree with Default Complexity Parameter (cp=0.01)

Unfortunately this tree is an oversimplified model. There exists a trade-off between

model accuracy and complexity that can be seen in Figure 2.3.

Figure 2.3: Complexity Parameter vs. Relative Error

revol <> 0.49497613650; 10966 obs; 51.8%

times1 <> 0.50; 5736 obs; 72.5%

ntimeslate <> 0.50; 4638 obs; 79.8%

04474 obs

1

1164 obs

21

1098 obs

31

5230 obs

4

Total classified correct = 74.1 %

●

●

●

●

● ● ● ● ● ● ● ●

cp

X−va

l Rel

ative

Erro

r

0.4

0.6

0.8

1.0

Inf 0.12 0.023 0.011 0.0069 0.0054 0.0048 0.004 0.0034 0.0027 0.002 0.0017

1 2 3 4 8 9 12 13 15 17 18 19

size of tree


8

The rpart package chooses a default complexity parameter of 0.01, however,

mathematically, the optimum complexity parameter is 0.0017000378 as it has the

lowest corresponding error. This tree contains 16 splits and is shown here in Figure

2.4.

Figure 2.4: Optimized Complexity Parameter Tree (cp=0.0010406812)

The second problem with more complex models, after loss of user accessibility, is

the risk of over-fitting. As Figure 2.3 shows, there is a diminishing marginal decrease

in the x-error as the tree becomes more complex, and sometimes rises again. With

this that in mind, another way to decide on the number of splits (complexity) is to

judge the complexity parameter curve. Figure 2.3 shows two ‘elbows’ where the

marginal improvement for adding splits diminishes. There is one at c=0.1 however

this even more basic than our original tree. The next is roughly at c=0.0054 and the

associated tree is shown here in Figure 2.5 and contains 8 splits.

Figure 2.5: Tree of Complexity Parameter at CP Plot Elbow

|revol< 0.495

times1< 0.5

ntimeslate< 0.5

ntimes2< 0.5

revol< 0.144

DebtRatio< 0.707

norees< 2.5

revol< 0.0453

revol>=0.004775

ntimes2< 0.5

ntimeslate< 0.5

times1< 1.5

MonthlyIncome>=3242

revol< 0.4698 revol< 0.004575

ntimeslate< 0.5

times1< 0.5

ntimes2< 0.5

revol< 0.8326

DebtRatio< 0.563

age>=46.5

DebtRatio>=0.07558 noloansetc< 10.5

DebtRatio>=0.174

revol< 0.5994

noloansetc< 9.5

age>=58.5

noloansetc< 8.5

age>=52.5

norees>=0.5

MonthlyIncome< 1125

noloansetc>=0.5

05681/5285

04158/1578

03701/937

03660/814

03604/727

02448/323

01156/404

01055/309

0101/95

086/60

115/35

156/87

026/21

019/7

17/14

130/66

141/123

1457/641

1424/462

0396/365

0319/240

0273/1700

267/1571

6/13

146/70

012/4

134/66

177/125

128/97

133/179

11523/3707

11371/2221

11075/1172

01039/1007

0563/393

0413/239

0219/93

0215/80

14/13

0194/146

0168/104

0132/66

136/38

020/10

116/28

126/42

1150/154

075/45

175/109

028/19

147/90

1476/614

1375/415

0107/69

1268/346

0111/99

1157/247

026/19

024/11

12/8

1131/228

1101/199

136/165

1296/1049

1152/1486

|revol< 0.495

times1< 0.5

ntimeslate< 0.5

ntimeslate< 0.5

times1< 0.5

ntimes2< 0.5

revol< 0.8326

05681/5285

04158/1578

03701/937

03660/814

141/123

1457/641

11523/3707

11371/2221

11075/1172

01039/1007

0563/393

1476/614

136/165

1296/1049

1152/1486


9

Due to the categorical output variable, classification trees were used for all of these

trees. The gini measure of node purity, also known as the splitting index, was used

for the trees selected. Entropy was also used, however, gini has become the

convention however other methods appear equally competent. The party package

was experimented with as well. Although its graphical output is slightly better, it

lacked flexibility in adjusting complexity and so was deemed unsuitable in this case.

Trees facilitate loss matrices to weight the value of false negatives and false

positives for model evaluation. This was experimented with but ultimately without

sufficient background knowledge and client communication, it was deemed too

arbitrary and eventually removed. This issue is discussed further in the report (see

section 4).

Table 2.4 shows the occurrence of variables as splitters or surrogate splitters in the

model. The table was produced by a word count of each variable name in the output

generated from the function important.rpart. The function was created by Dr.

Noel O’Boyle of DCU (2012). The list coincided with the variable importance plot,

later generated by neural networks, and to a lesser degree with the PCA output.

Simple trees are inferior at predicting linear relationships, which are evident in this

dataset from the logistic regression.

Table 2.4: Variable Importance via rpart.importance

Rank Variable Occurrences in Split or Surrogate 1 revol 63 2 times1 38 3 ntimeslate 32 4 times2 31 5 DebtRatio 26 6 age 23 7 MonthlyIncome 22 8 noloansetc 22 9 norees 12

10 depends 5

Table 2.5: Classification Tree Misclassification Table

FALSE POSITVE RATE 22.4%

FALSE NEGETIVE RATE 24.5% MISCLASSIFICATION RATE 23.4%


10

2.4 Random Forests

The random forest method involves creating a large number of CART trees of a

dataset that form a democratic ensemble. The method has two ‘random’ elements.

First, the algorithm selects cases at random (bootstrapping). Secondly, each tree

looks at a small selection of variables (roughly the square root of the total number of

variables), which are used in random combinations. Each tree is built without 34% of

the data, which is then used as the ‘out of bag sample’ (oob) to evaluate the tree. In

this sense, it creates internal train and test sets.

1,000 trees were created, as is standard. It was enough to gather a large sample of

possible combinations and is not too computationally exhaustive. Each tree in the

forest looked at 3 variables, roughly the square root of the total number of variables.

As the target variable is binary, classification trees were best suited. As in SVM, this

meant the training data had to be converted to type ‘data.frame‘ and the target

variable dlq was converted a factor.

Variable Insight

Random Forests give every variable a chance to ‘vote’ because of the large number

of trees created and the relatively low number of variables randomly selected for

each tree. This allows them to provide deep insight for every variable. These are

expressed by counting the number of times a variable us used in a split and by

partial dependency plots.

This bar chart (Figure 2.6) displays the number of times each variable was selected

for a split in a tree, as opposed to the two other possible variables in that tree (mtry

= 3). The chart shows times1, ntimeslate and ntimes2 to be used rarely. The

variables revol, age and MonthlyIncome were often used. Unfortunately, this

graph does not represent the relative value of each variable. It may be bias towards

more continuous variables, as opposed to discrete, and so variables times1,

ntimeslate and ntimes2 are low.


11

Figure 2.6: Bar Chart of

‘Partial dependence plot gives a graphical depiction of the marginal effect of a

variable on the class probability (classification) or response (regression).’

– randomForest Reference Manual.

The variables are listed here in descending importance as measured by the average

decrease in accuracy and gini.

Figure 2.7: Variable Importance Plot via VarImportance

Partial dependency plots show the marginal effect variable has on the predicted

probability as its values change. Figure XX shows the partial dependency plot of

times1 (black), DebtRatio (dark blue), ntimes2 (pink), norees (red),

ntimeslate (green) and noloansetc (light blue). DebtRatio, times1,

ntimeslate and noloansetc all show early rapid falls in credibility in going from 0

to 2. This graph clearly displays how being late on one payment can have a dramatic

Times variable used

050

000

1500

0025

0000

revol age times1 DebtRatio MonthlyIncome noloansetc ntimeslate norees ntimes2 depends

dependsnoloansetcnoreesMonthlyIncomeageDebtRationtimes2revoltimes1ntimeslate

●

●

●

●

●

●

●

●

●

●

0.20 0.30 0.40 0.50MeanDecreaseAccuracy

noreesdependsntimes2noloansetcntimeslatetimes1ageMonthlyIncomeDebtRatiorevol

●

●

●

●

●

●

●

●

●

●

0 200 600 1000 1400MeanDecreaseGini

Variable Importance


12

increase on a person’s credit score. The noloansetc and norees variables, which

both measure numbers of financial lines, rise from 0 which matches the assumption

that at least one line of credit is required to be in this dataset, however many lines of

credit may suggest financial instability.

Although having so many variables on a single graph can be inaccessible, it is the

best way to maintain relative perspective by using the same scale. While analysing

these graphs, one must recall the density distributions of the variables (see Figure

1.1). Variable times1 and DebtRatio show unexpected recoveries, however these

are a caused by a low number of values that may have warranted removal as

outliers.

Figure 2.8: Partial Dependency Plot for times1 (black), DebtRatio (dark blue), ntimes2 (pink), norees (red), ntimeslate (green) and noloansetc (light

blue)

The initial drop in the partial dependency plot for monthly income may be explained

by those with 0 monthly income being students or retirees, however low-income

earners show the least probability of repayment. The large spike at 10,000 and

fluctuations thereafter are more difficult to understand, however, these earners are a

minority. Figure 1.1 shows them to be in the tail end of the boxplot, and so they may

be of less importance.

0 2 4 6 8 10 12

−0.6

−0.4

−0.2

0.0

0.2

Partial Dependency Plot

times1


13

Figure 2.9: Partial Dependency Plot – MonthlyIncome

Figure 2.10 shows the difference age has on the predicted probability of delinquency.

It shows steep increases in financial competence through the late twenties as people

mature, and again leading towards retirement, which is followed by a decline after

retirement.

Figure 2.10: Partial Dependency Plot – Age

The revol plot decreases rapidly from 0 to 1, which is to be expected as a low

value of revol suggests high financial control. This is followed by a slow increase

for which there is not an obvious reason. Again, they are a ‘tailed’ minority and could

be deemed outliers. Values of revol over 1 would be due to interest. Another

possibility is that ‘total balance’ includes the sum of future interest payments due. In

which case, those deemed satisfactory to obtain a longer-term loan, would be

deemed so, due to some other measure of financial stability.

0 10000 20000 30000 40000 50000 60000

0.00

0.05

0.10

0.15

0.20

0.25

MonthlyIncome − Partial Dependency Plot

MonthlyIncome

20 40 60 80 100

−0.1

0.0

0.1

0.2

0.3

Age Partial Dependency Plot

age


14

Figure 2.11: Partial Dependency Plot – revol

Figure 2.12 shows how the error rate of the out-of-bag sample is reduced when the

number of trees voting increases. This shows how 1,000 trees were more than

sufficient for reaching a satisfactory error rate.

Figure 2.12: Error Rate

A tree of size 250 was also created which showed approximately the same results.

There was not major difference, however the 1,000-tree forest was used for the

majority of analysis.

Random forests often suffer from over fitting, however they can handle variables

being large outliers as the democratic voting reduces sensitivity to a single variable.

Random forests cannot include variable costs, which was not considered, however a

client might express variables to be preferable. Possibly the greatest benefit from the

random forest analysis is through the in-depth understanding of the variables it

provides.

Table 2.6: Random Forest Misclassification Table



0 1 2 3

−0.6

−0.4

−0.2

0.0

0.2

0.4

revol − Partial Dependency Plot

revol

0 200 400 600 800 1000

0.20

0.22

0.24

0.26

0.28

0.30

0.32

trees

Error


15

2.5 Neural Networks

Neural networks are a black-box modelling technique. Put simply, the model creates

a formula, similar to that of a linear regression, containing hidden layers. ‘Neurons’

which represent input variables, output variables and optional variables in the hidden

layer are algorithmically weighted to minimise the difference between the desired

results and the actual results.

Both the nnet and neuralnet packages were tested in analysis but nnet was

chosen as the primary function due to its speed and level of functionality The model

was tweaked with to optimise the AIC, error, positive hessian matrix eigenvalues and

misclassification rates.

The data was scaled and a trail-and-error method was used and several

combinations of model inputs were created and tested with different values for size

(number of hidden layers) and input vectors. To tailor the model to the 0/1

classification target variable, softmax = TRUE was defined as the activation

function for the nnet package and act.fct=”logistic” was set for the

neuralnet package.

A decay value can be added to the input to improve its stability, mitigating the risk of

weights tending towards infinity. Figure 2.13 plots the error rate for decay values

between 0 and 1. It showed that a small value of decay added to the model, vastly

reduced the error. Given the high possibility of outliers, the support vector machine

model could easily be over fitted so adding decay to the model generalised it.

Figure 2.13: Decay ~ Error Plot

0.0 0.2 0.4 0.6 0.8 1.0

3500

4000

4500

5000

Decay

Error


16

One hidden layer was added to the model and the depends variable was removed

from the model, reducing the AIC to -12291.02. This produced the following output

for the optimum model: a 8-1-2 network with 13 weights options were - softmax modelling decay=0.1 b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 1.34 0.47 -0.13 0.56 0.02 -0.09 0.11 1.20 0.04 b->o1 h1->o1 3.25 -4.53 b->o2 h1->o2 -3.25 4.53

Figure 2.14 shows the model as graphically produced by the neuralnet package.

On the left are the input variables – the values that are multiplied by the weights

marked along the arrows to the hidden layer neuron. The bias is multiplied by the

output neuron and is then passed through an activation function to give the

probability associated with the given case.

Figure 2.14: Neural Network

In some ways package is more advanced than the nnet package, however it is

poorly built in a number of ways. The naming conventions are often unclear and

conflicting. The prediction function, which is crucial for developing receiver

operating characteristic curves (ROC curves), is also the name of a function in the

ROCR package. The result is severable variables and functions, all with similar

names and very similar tasks.

Table 2.7 shows the performance of the neural network model on the test set.

Table 2.7: Neural Network Misclassification



−0.0

2937

scale(norees)

−0.99

107

scale(ntimeslate)

−0.07825

scale(noloansetc)

0.0764scale(MonthlyIncome)

−0.01145scale(DebtRatio)

−0.44193

scale(times1)

0.09152

scale(age)

−0.34639scale(revol)

−21.70731 as.numeric(dlq)

−2.33183

1

2.63764

1

Error: 891.320636 Steps: 13284


17

2.6 Support Vector Machines

In the SVM process, points are transformed/mapped to a ‘hyper-space’ so they can

be split by a hyper-plane. This transformation is not explicitly calculated which would

be computationally expensive, but rather calculate the kernel which is function of the

inner product of the original multidimensional space. Consider piece of string. In the

first dimension, the distance between the two ends of the string is the string’s length

and they can never touch. However in real life, thee-dimensional space, the string

can be bent around both ends can easily touch. This may be an abstract example,

however it demonstrates the idea that more dimensions can allow more

manipulation.

Optimising the Support Vector Machine involved tuning the values for gamma and

cost on the training set. Gamma is a term in the radial basis transformation and cost

is the constant for the Lagrange formula. They were tested between 0.5 and 1.5 and

100 and 1,000 respectively. The tune function concluded that a gamma of 1.3 and

cost of 130 were optimum, giving a best performance of 0.2118273. Figure 2.15

shows an early plotted of the tuning function.

Figure 2.15: SVM Tuning Plot

SVM Classification Plots

SVM classification plots can be useful in determining the relationship between two

variables and the predictive model. In this case, they are better for mapping more

scattered variables such as MonthlyIncome rather than ntimeslate. Figure 2.16


18

displays the relationship between age, MonthlyIncome and delinquent

classification by the radial c-classification method. The colours represent the

classification determined in a hyperspace radial plane.

Figure 2.16: Radial SVM Plotted on MonthlyIncome ~ age

Figure 2.17 shows the same two variables being classified by a sigmoid plane.

Clearly there is a large differfence in classification using different kernels.

Figure 2.17: Sigmoid SVM Plotted on MonthlyIncome ~ age

Table 2.8 shows the performance of support vector machines model on the test set.

Table 2.8: Support Vector Machines Misclassification




19

3. FURTHER ANALYSIS

3.1 Ensemble Model

In an attempt to gain the best elements of all models, an ensemble was created. This

simply averages the probabilities created by the tree, random forest, neural network

and support vector machine, for each individual case. All models are run and the

probabilities they produce for the case are averaged. This creates a less diverse

array of probabilities, however it is just as decisive. Table 2.1 shows its performance

on the test set. Interestingly, this model had the lowest false positive rate for the test

set.

Table 2.1: Ensemble Model Misclassification



3.2 Limitations and Further Work

Without having the opportunities to talk with the client and properly understand the

data, the variable descriptions are limited. To what extent outliers can be removed

and the ease of collecting variables was done by estimation. The Rattle package

offers the incorporation of risk into the model however this could but calculated

without a measure of risk associated with each case, such as the nominal value of

each loan.

The creation of this data must also be considered. These are loans, which have

already been approved, and so these customers would already gone through some

screening system to get to that stage. Any model taken from this empirical study

must be used in conjunction with current structures that have already screened out

potential loan applicants.

Data is temporal but this dataset show not information of the time period from which

it was taken. Consider the value of such a model in the midst of the financial crisis.

Models can expire and must be continually tested and updated. Unfortunately, it is

difficult to test a model once implemented.


20

3.3 Variables Importance

Table 2.2 contains a simplified rating, good, bad, average or inconclusive (--), for

each variable under by different topics. Support Vector Machines was not included

as it is a black-box technique.

Table 2.1:

Missing Data

Outliers Log. PCA Tree R. Forest

N. Net

revol good Good good good good good good

age good good good ave. ave. good good

times1 good ave. good good good -‐-‐ good

DebtRatio good poor poor ave. ave. good good

MonthlyIncome poor good good ave. ave. good good

noloansetc good ave. good ave. ave. ave. good

ntimeslate good good good good good poor good

norees good ave. ave. ave. poor ave. good

ntimes2 good ave. good good good -‐-‐ -‐-‐

depends ave. good good poor poor ave. poor


21

4. MODELS ASSESSMENT

Models tend to produce different values for false positive and false negatives which

makes the question of ‘best model’ ambiguous, particularly without client contact.

Prospect Theory (Kahneman and Tversky, 1979) indicates that people’s negative

response to a loss is more pronounced than their positive response to an equal gain.

Consider one’s feelings on losing a €50 note, compared to finding one. Taking this

aspect of human behaviour into account, models with lower false positive (a loan

would be given and money is lost as it is not paid back as agreed) may be weighted

better than those with lower fast negative (a loan would not be given to a customer

who would have paid it back and so potential profit is foregone). To finalise the ‘best’

model in this instance, an analysis of the material profit/loss gained by a loan in

question and a cost quantity to represent the client’s attitude to risk. This would guide

a weighting to user preference for minimal false negatives or false positives. Such

weighting can be used by the evaluateRisk or the loss matrices that some of

these models facilitate. Due to the lack of information they could not be used.

There is also no information of the ease with which managers can obtain the input

information or how reliable these tend to be. If for example, revol may not be

practical to obtain for every customer. In the case of trees and random forests,

surrogates may be used instead.

Figure 4.1 shows the receiver-operating characteristic (ROC) curves for the 4 models

plotted together with true positive against false positive rates. A good model would

arc high and to the left. In this respect, the random forest, neural network and

support vector machine models are essentially the same. The simple tree model is

clearly suboptimal. This trend can also be seen in a lift chart


22

Figure 4.1: ROC Curves

Figure 4.2 shows scatter plots of the four models’ predictions for the test set, against

each other. The graphs also show the actual values for each case by colour. The

visible correlations represent the similarity of the two models.

Cases plotted in the top left or bottom right are cases that both models agreed and if

correct, these should be blue and green respectively if the models prediction was

correct. If correctly classified, they will be blue for dlq=1 and green for dlq=0.

The simple tree can be clearly seen by the discrete nature of its predictions for each

node. Random forests and neural networks appear to be most correlated models

with the least variance.

These graphs alone should not be used to determine misclassification rates. They

give an idea, however the full density of cases at the corners of the graphs is

crowded, and so the number of cases there is difficult to read. True misclassification

can be seen the bar chart in Figure 4.3.


23

Figure Figure 4.2: Model Comparison Scatter Plots


24

As explained earlier, the client attitude towards false negatives and false positives

may be disproportionate due to the prospect effect. It is logical to assume that to

some degree, a low false positive is more desirable than a low false negative.

Figure 4.3 displays a bar chart of misclassification rates on the test set of the tree,

random forest, neural network, support vector machine and the ensemble models.

The ensemble model, arguably most complex, performed the best for false positives;

however, it also performed poorly with false negatives. Interestingly, the simplest

model, logistic regression, performed inversely. It performed well with false negatives

and poorly with false positives. The neural network consistently performed worse

than its competitors. The support vector machine was mediocre, as was simple tree

(however this performed poorly in the ROC analysis).

Figure 4.3:

The random forest appeared to be the most consistently low misclassification rates. It

is relatively simple to implement, it is particularly good for classifications and as it is

an ensemble method, it often can negate biases and handle missing data and

outliers well. It also provided as level of insight into variable importance. It also has

relatively strong correlations to other models as seen in Figure 4.2. Unfortunately,

random forests are prone to over fitting and they cannot include costs, however,

purely based on the information provided, it would be deemed most suitable across

all measures of model quality.

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

False Positve Rate False Negetive Rate

Missclassification Rate

Logistic Regression

Simple Tree

Random Forest

Neural Network

Support Vector Machine

Ensemble


25

5. CONCLUSION

This report detailed the in depth analysis of a dataset towards the prediction of loans

becoming ‘delinquent’. Principle component analysis, logistic regression,

classification trees, random forests, neural networks and support vector machines

were all implemented. An ensemble model of the latter four was also tested.

The best predictors, in descending order are: the percentage of credit limits utilised –

revol, and three variables representing the number of times borrowers were later

replaying their loans, times1, ntimeslate and ntimes2. The variables

measuring number of property loans – norees, and number of dependents –

depends, appear to the least valuable predictors.

Although limitations of this study are clearly defined, based on basic intuition and the

information provided, it was deemed that the random forest model was most suitable

as it consistently performed well against all measures, most likely due to its ability to

represent non-linear relationship while not over-fitting. It is also best placed to deal

with outliers and missing data due to its democratic ensemble nature.


7. APPENDIX

7.1 References

KAHNEMAN, D. & TVERSKY, A. 1979. Prospect Theory: An Analysis of Decision Under Risk. Econometrica, 47, 263-291.

O'BOYLE, N. 2011. Supervised Classification: Variable Importance in CART (Classification and Regression Trees), DCU Redbrick, viewed 19 Ferbrary, 2012, < http://www.redbrick.dcu.ie/~noel/R_classification.html>

7.2 R Code

i

# Adding Packages library("rpart") library(neuralnet) library("ROCR") # prediction object is masked from neuralnet library(tools) library("randomForest") library("caTools") library("colorspace") library("Martrix") library("nnet") library("gtools") library("e1071") library(randomForest) library("rattle") library(car) library(cluster) library(maptree) library("RColorBrewer") library(modeltools) library(coin) library(party) library(zoo) library("rattle") library(base) sandwich, strucchange, vcd search() ? ctree oData = read.csv("/Users/stephendenham/Dropbox/College/Data Mining/directory/creditData.csv") cData = oData nrow(cData) # 21326 orginal cases cData[cData$MonthlyIncome==-999,"MonthlyIncome"] <- NA cData = na.omit(cData) nrow(oData)-nrow(cData) # 3943 Missing 1-(nrow(cData)/nrow(oData)) # 18.49% Missing for MonthlyIncome cData = oData nrow(cData) # 21326 cData[cData$depends==-999,"depends"] <- NA cData = na.omit(cData) nrow(oData)-nrow(cData) # 480 Missing 1-(nrow(cData)/nrow(oData)) # 2.25% Missing for MonthlyIncome

cData[cData$MonthlyIncome==-999,"MonthlyIncome"] <- NA cData = na.omit(cData) nrow(oData)-nrow(cData) # 3943 Missing 1-(nrow(cData)/nrow(oData)) # 18.5% Missing for both depends and MonthlyIncome together # All missing values of depends are also missing for MonthlyIncome # 3943 NA's/-999s na.omited.oData = cData # Outlier Removal Measureing cData = oData nrow(cData) # 21326 orginal # Insert outlier removal code.................................... cData = subset(cData, cData$depends < 10) # <5 removes 152. 10 is fine No need to remove any I think, good distribution # cData = na.omit(cData) nrow(oData)-nrow(cData) # Number removed Missing 1-(nrow(cData)/nrow(oData)) # % Removed # Actual Outlier Removal cData = oData cData[cData$depends==-999,"depends"] <- NA cData[cData$MonthlyIncome==-999,"MonthlyIncome"] <- NA cData = na.omit(cData)


ii

# Outliers picked from boxplots cData = subset(cData, cData$revol < 4) # 5000>>6. 50>>28. 9>>33. 4>>40 cData = subset(cData, cData$times1 < 80) # cData = subset(cData, cData$times1 < 14) # 4 removes 390. 10 removes 359. 14 removes 248 cData = subset(cData, cData$DebtRatio < 14) # 4 removes 390. 10 removes 359. 14 removes 248 cData = subset(cData, cData$MonthlyIncome < 60000) # 100000 removes 6. 50000 removes 37. 14 removes 248 cData = subset(cData, cData$noloansetc < 22) # <22 removes 400. No need to remove any I think, good distribution cData = subset(cData, cData$ntimeslate < 40) # removes 33 cData = subset(cData, cData$norees < 6) # <6 removes 137. No need to remove any I think, good distribution cData = subset(cData, cData$ntimes2 < 8) # 8 removes 1. 5 removes 28. No need to remove any I think, good distribution cData = subset(cData, cData$depends < 10) x = nrow(cData) y = nrow(oData) y-x 1-(x/y) # Percent Removeded # Creating smaller sets for testing #par(mfrow=c(2,5)) #boxplot(cData$nid, main="nid") #boxplot(cData$dlq, main="nid") boxplot(cData$revol, main="revol", col=10) boxplot(cData$age, main="age", col=7) boxplot(cData$times1, main="times1", col=3) boxplot(cData$DebtRatio, main="DebtRatio", col=4) boxplot(cData$MonthlyIncome, main="MonthlyIncome", col=5) boxplot(cData$noloansetc, main="noloansetc", col=6) boxplot(cData$ntimeslate, main="ntimeslate", col="light green") boxplot(cData$norees, main="norees", col="light blue") boxplot(cData$ntimes2, main="ntimes2", col=1) boxplot(cData$depends, main="depends", col="purple") names(cData) #dev.off() c1Data = subset(cData, cData$dlq == 1) #

c0Data = subset(cData, cData$dlq == 0) # set.seed(12345) test_rows = sample.int(nrow(cData), nrow(cData)/3) test = cData[test_rows,] train = cData[-test_rows,] set.seed(12345) otest_rows = sample.int(nrow(na.omited.oData), nrow(na.omited.oData)/3) otest = na.omited.oData[otest_rows,] otrain = na.omited.oData[-otest_rows,] # Creating Data Frames train.Frame = data.frame(train) test.Frame = data.frame(test) # Cleaning Done ################ # Linear model # ################ ? glm fitLR.2 = glm(dlq ~ revol + age + times1 + ntimeslate + norees + ntimes2 + depends, data=train ,family=binomial() ) fitLR.2 summary(fitLR.2) confint(fitLR.2) exp(coef(fitLR.2)) exp(confint(fitLR.2)) predict(fitLR.2, type="response") residuals(fitLR.2, type="deviance") plot(fitLR.2) # predict.glm predict(fitLR.2, test, type = "response") # Optimal Logistic Model fitLR = glm(dlq ~ revol + age + times1 + MonthlyIncome + noloansetc + ntimeslate # + norees + ntimes2, data=train ,family=binomial() ) summary(fitLR) plot(predict(fitLR, test, type = "response")~predict(fitLR.2, test, type = "response")) anova(fitLR,fitLR.2, test="Chisq") # Anova of regression with less data ? anova


iii

####### # PCA # ####### boxplot(cData[2:12,]) # Scale Data names(cData) sData = scale(na.omit(cData[2:12]), center = TRUE, scale = TRUE) sData sData$age age boxplot(sData[2:12, -7]) boxplot(sData[2:12,]) names(cData) boxplot(cData$ntimeslate) sDataPCA = princomp(sData[, c(1:11)], cor=T) # Can't use NAs print(sDataPCA) summary(sDataPCA) round(sDataPCA$sdev,2) plot(sDataPCA, type='l', main="Scree Plot") #Simple PC variance plot. Elbows at PCs 2 & 9 loadings(sDataPCA) # biplot(sDataPCA, main="Biplot") # Difficult to run abline(h=0); abline(v=0) # pairs(cData[2:12], main = "Pairs", pch = 21, bg = c("red", "blue")[unclass(cData$dlq)]) #abline(lsfit(Sepal.Width,Sepal.Width)) #abline(lsfit((setosa$Petal.Length,setosa$Petal.Width), col="red", lwd = 2, lty = 2)) ######### # TREES # ######### # Simpler Tree - CP is fitTree.sim = rpart(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train, parms=list(split="gini") ,method = "class" ) draw.tree(fitTree.sim, cex=.8, pch=2,size=2.5, nodeinfo = FALSE, cases = "obs") plot(fitTree.sim,compress=TRUE,uniform=TRUE, branch=0.5) text(fitTree.sim,use.n=T,all=T,cex=.7,pretty=0,xpd=TRUE) draw.tree(fitTree.sim, cex=.8, pch=2,size=2.5, nodeinfo = TRUE, cases = "obs")

plotcp(fitTree) # Main Tree fitTree = rpart(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train[,2:12] #,parms=list(split="gini") ,control=rpart.control(cp=0.0010406812 ,control=(maxsurrogate=100) ) ,method = "class" ,parms=list(split="gini" #,loss=matrix(c(0,false.pos.weight,false.neg.weight,0), byrow=TRUE, nrow=2) ) # Loss Matrix ) fitTree fitTree$cp # xError of: 0.4826868 - CP: 0.0010406812 ? plotcp printcp(fitTree) fitTree$parm fitTree$parm$loss # Plotcac draw.tree(fitTree, cex=.8, pch=2,size=2.5, nodeinfo = TRUE, cases = "obs") ? draw.tree ? abline ? plot.rpart... type, extra, plotcp plot(fitTree,compress=TRUE,uniform=TRUE, branch=0.5) text(fitTree,use.n=T,all=T,cex=.7,pretty=0,xpd=TRUE) draw.tree(fitTree, cex=.8, pch=2,size=2.5, nodeinfo = TRUE, cases = "obs") ? draw.tree fitTree.sim$cp fitTree.sim$splits[,1] plot(fitTree.sim$splits[,1]) fitTree.sim$splits false.pos.weight = 10 false.neg.weight = 10 #Min Error CP fitTree.elbow = rpart(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc +


iv

ntimeslate + norees + ntimes2 + depends, data=train ,control=rpart.control(cp=0.0068) ,method = "class" ,parms=list(split="gini" #,loss=matrix(c(0,false.pos.weight,false.neg.weight,0), byrow=TRUE, nrow=2) ) # Loss Matrix ) fitTree.elbow plot(fitTree.elbow,compress=TRUE,uniform=TRUE, branch=0.5) text(fitTree.elbow,use.n=T,all=T,cex=.7,pretty=0,xpd=TRUE) draw.tree(fitTree.elbow, cex=.8, pch=2,size=2.5, nodeinfo = FALSE, cases = "") # CP at the elbow - 0.004701763719512 # Party ? ctree fitTree.party <- ctree(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train[,2:12] #, controls=ctree_control( #stump=TRUE, #maxdepth=3 #) ) plot(fitTree.party, type = "simple" ) fitTree asRules(fitTree) info.gain.rpart(fitTree) ? rpart # Function which determines Variable Importance in rpart. See below. a <- importance(fitTree) summary(a) # NOTE: a different CP (.01) had a better #? rattle # parms=list(prior=c(.5,.5)) ?? Priors? # control=rpart.control(cp=0.0018)) # RATTLE

# ? rattle.print.rpart # Must figure out which is false positive and which is false negetive newdata0 = subset(cData[2:12], dlq==0) newdata1 = subset(cData[2:12], dlq==1) # newdata1 = subset(cData, dlq==0) noPredictions0 = predict(fitTree, newdata0) noPredictions1 = predict(fitTree, newdata1) noPredictions = predict(fitTree, test) max(noPredictions) min(noPredictions) noPredictions0 noPredictions1 correct0 = (noPredictions0 < 0.5) correct0 correct1 = (noPredictions1 > 0.5) correct1 table(correct0) table(correct1) # Confusion matrix? # Still have to do miss class thing # 2nd Lab on Normal Trees # Gini/Information : seems to make no difference # To do # 1. Add loads of missing data and outliers to test for robustness # 2. Add maxsurrogates (end of lab 3) # ??? ################## # Random Forests # ################## train ? randomForest ? randomForest set.seed(12345) fitrf=randomForest(as.factor(train$dlq) ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, # declared above ntree=1000, type="classification", predicted=TRUE, importance=TRUE, proximity=FALSE # Never run prox as is crashes computer


v

) fitrf # Var Importance Plot importance(fitrf) varImpPlot(fitrf, main = "Variable Importance", sort = TRUE) varImpPlot(fitrf, class=1, main = "Variable Importance", sort = TRUE) # Looking good ? varImpPlot boxplot(cData$depends) # Partial Dep Plots give graphical depiction of the marginal effect of a variable on the class response (regression) ? partialPlot partialPlot(fitrf, train, age, main="Age Partial Dependency Plot") partialPlot(fitrf, train, revol,main="revol - Partial Dependency Plot", col="red") partialPlot(fitrf, train,MonthlyIncome,main="MonthlyIncome - Partial Dependency Plot") partialPlot(fitrf, train, depends,main="depends - Partial Dependency Plot") # 6 in one... partialPlot(fitrf, train, times1 ,main="Partial Dependency Plot") partialPlot(fitrf, train, add=TRUE,DebtRatio,main="DebtRatio - Partial Dependency Plot", col="blue") partialPlot(fitrf, train,ntimes2, add=TRUE,main="ntimes2 - Partial Dependency Plot", col="pink") partialPlot(fitrf, train, ntimeslate, add=TRUE,main="ntimeslate - Partial Dependency Plot", col="green") partialPlot(fitrf, train, norees, add=TRUE,main="Partial Dependency Plot", col="red") partialPlot(fitrf, train, noloansetc, add=TRUE,main="noloansetc - Partial Dependency Plot", col="light blue") partialPlot(fitrf, train, noloansetc,main="noloansetc - Partial Dependency Plot", col="light blue") partialPlot(fitrf, train,DebtRatio,main="DebtRatio - Partial Dependency Plot", col="blue") # Var Used Barchart Ylabels = c("revol","age","times1","DebtRatio","MonthlyIncome","noloansetc","ntimeslate","norees","ntimes2","depends") Graph = barplot(varUsed(fitrf, count=TRUE), xlab="Times variable used", c(1:14),

col=c("red", "dark red"), horiz=FALSE, space=0.4, width=2, axis=FALSE, axisnames=FALSE) axis(1, at=Graph, las=1, adj=0, cex.axis=0.7, labels = Ylabels) # oposite order of Ylabels ? varUsed # getTree(fitrf, k=1, labelVar=TRUE) # View and individual tree set.seed(54321) fitrf.250=randomForest(as.factor(train$dlq) ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, # declared above ntree=250, type="classification", predicted=TRUE, importance=TRUE, proximity=FALSE # Never run prox as is crashes computer ) fitrf.250 set.seed(12345) #fitrf.reg=randomForest(dlq ~ revol + age + times1 + DebtRatio # + MonthlyIncome + noloansetc + ntimeslate # + norees + ntimes2 + depends, # data=train, # ntree=1000, # type="regression", # predicted=TRUE, # importance=TRUE, # proximity=FALSE # ) # set1$similarity <- as.factor(set1$similarity) # Need to work out this prox stuff etc. fitrf names(fitrf) summary(fitrf) fitrf$importance hist(fitrf$importance) fitrf$mtry # mtry = 3 hist(fitrf$oob.times) # Normal Distrition fitrf$importanceSD hist(treesize(fitrf, terminal=TRUE))


vi

boxplot(fitrf$oob.times) plot(fitrf, main = "") ? plot.randomForest getTree(fitrf, k=3, labelVar=FALSE) fitrf$votes margins.fitrf=margin(fitrf,churn) plot(margins.rf) hist(margins.rf,main="Margins of Random Forest for churn dataset") boxplot(margins.rf~data$churn, main="Margins of Random Forest for churn dataset by class") The error rate over the trees is obtained as follows: plot(fit, main="Error rate over trees") MDSplot(fit, data$churn, k=2) # Margins margins.fitrf=margin(fitrf,dlq) plot(margins.fitrf) hist(margins.fitrf, main="Margins of Random Forest for Credit Dataset") boxplot(margins.fitrf~train.Frame$dlq) plot(margins.fitrf~dlq) MDSplot(fitrf, cData$dlq, k=2) # can't do because missing proximity matrix # Rattle Random Forest Stuff treeset.randomForest(fitrf, format="R") # This takes forever # printRandomForests(fitrf) # Making Predictions #predict(fitrf, test[101,]) # outputs a value of either 1 or 0 predict(fitrf, test) # outputs a value of either 1 or 0 print(pred.fitrf <- predict(fitrf, test, votes = TRUE) ) pred.fitrf = predict(fitrf, test, type="prob")[,2] pred.fitrf ############### # Neural Nets # ############### names(cData) sData = scale(na.omit(cData[2:12]), center = TRUE, scale = TRUE) sData # i = inpur # h = hidden layer

# b = bias class.ind(train$dlq) set.seed(12345) nn.lin.values = vector("numeric", 10) for(i in 1:10) { fitnn.lin = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2) + scale(depends), train, size=0, skip=TRUE, softmax=FALSE, Hess=TRUE) cat(fitnn.lin$value,"\n") nn.lin.values[i] = fitnn.lin$value } hist(nn.lin.values) plot(nn.lin.values) nn.lin.values eigen(fitnn.lin$Hess) eigen(fitnn.lin$Hess)$values # Main Nnet set.seed(999) fitnn = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE, decay=0.1) b=nrow(train) findAIC(fitnn, b, 10) # -12168 (these AICs are not consistent) eigen(fitnn$Hess)$values # usually # ? neuralnet set.seed(999) fitnn.2 = neuralnet( # Took forever to run as.numeric(dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), data = data.matrix(train),##-- act.fct = "logistic", hidden = 1, linear.output = FALSE #, rep = 4, err.fct="sse" ) # as.numeric(dlq) # Can we change this to class/int??? class.ind(train$dlq)


vii

# dlq runs! fitnn.2.2 = neuralnet( as.factor(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), data = data.matrix(train), hidden = 1, linear.output = FALSE #, rep = 4, err.fct="sse" ) # fitnn.2 plot(fitnn.2) # gwplot(fitnn.2, rep="best") nn.CI=confidence.interval(fitnn.2, alpha=0.05) # CI of weights nn.CI$upper nn.CI$lower nn.CI$upper.ci # ? compute # Don't think this scaling is right t = test t = t[-1] t = t[-1] t = t[-10] t = t[-9] names(t) t$revol = scale(t$revol) t$age = scale(t$age) t$times1 = scale(t$times1) t$DebtRatio = scale(t$DebtRatio) t$MonthlyIncome = scale(t$MonthlyIncome) t$noloansetc = scale(t$noloansetc) t$ntimeslate = scale(t$ntimeslate) t$norees = scale(t$norees) t print(pr <- compute(fitnn.2, t)) fitnn.2.pred=pr$net.result #print(pr.2 <- compute(fitnn.2.2, t)) #fitnn.2.2.pred=pr.2$net.result fitnn.2.pred # # ##-- List of probabilites generated by the neuralnet package b = nrow(train) fitnn.0 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees), train, size=0, skip=TRUE, softmax=FALSE, Hess=TRUE) c = 10 # No depends findAIC(fitnn.0, b, c) # -12013.49

? nnet confus.fun(fitnn) eigen(fitnn$Hess) eigen(fitnn$Hess)$values # eigen(fitnn.0$Hess) eigen(fitnn.0$Hess)$values # ooooh, not all positive) - measure stability ??? # Postive definite. All eigenvalues greater than 0 # This is also good fitnn.5 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(times1) + scale(MonthlyIncome) + scale(ntimeslate), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 5 # ?? findAIC(fitnn.5, b, c) # Worst AIC from here on fitnn.11 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2) + scale(depends), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 11 # ?? findAIC(fitnn.11, b, c) # -11595.36 fitnn.10.1 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(ntimes2) + scale(depends), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 10 # + scale(norees) findAIC(fitnn.10.1, b, c) # -12452.82 fitnn.9 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(norees) + scale(ntimes2), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 9 # No , depends, ntimeslate findAIC(fitnn.9, b, c) # -11872.63 fitnn.11 = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) +


viii

scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2) + scale(depends), train, size=1, skip=FALSE, softmax=TRUE, Hess=TRUE) c = 11 # ?? findAIC(fitnn.11, b, c) # # uses test!!! (woo) # Softmax = TRUE requires at least two response categories # Not always all negetive, must est seet # easy because its already all in numbers. No need for class.ind etc... # Task: manually fill in examples into a nnet summary(fitnn) names(fitnn) fitnn$terms fitnn$wts # Hessian: , Hess = TRUE # Matrix # AIC p=nrow(train) k = 8 # ncol(train) SSE = sum(fitnn$residuals^2) AIC = 2*k +p*log(SSE/p) # SBC # Different Decays errorRate = vector("numeric", 100) DecayRate = seq(.0001, 1, length.out=100) for(i in 1:100) { # set.seed(12345) fitnn = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) + scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2), train, size=0, skip=TRUE, decay=DecayRate[i]) errorRate[i] = sum(fitnn$residuals^2) # Could add AIC here # Was inverse graph for size = 0 } errorRate plot(DecayRate, errorRate, xlab="Decay", ylab="Error", type="l", lwd="2") # Actual?? fitnn = nnet(class.ind(train$dlq) ~ scale(revol) + scale(age) + scale(times1) + scale(DebtRatio) +

scale(MonthlyIncome) + scale(noloansetc) + scale(ntimeslate) + scale(norees) + scale(ntimes2), train, size=1, skip=FALSE, softmax=TRUE, decay=DecayRate[i]) ####### # SVM # ####### # Small datasets #plot(cData$age, cData$DebtRatio, col=(cData$dlq+3), pch=(cData$dlq+2)) set.seed(12345) # attach(cData) # Regression - Radial svm.model.reg.rad <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "regression", kernel = "radial", cost = 100, gamma = 1) # Regression - Linear svm.model.reg.lin <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "regression", kernel = "linear", cost = 100) # no gamma # Polynomial svm.model.pol <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "C-classification", kernel = "polynomial", cost = 100, gamma = 1) # Sigmoid svm.model.sig <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends,


ix

data=train.Frame, type = "C-classification", kernel = "sigmoid", probability = FALSE, cost = 100, gamma = 1) # linear started 5.23. Started 18.41 - 1848!!! :) svm.model.lin <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "C-classification", kernel = "linear", probability = TRUE, cost = 100) # no gamma for linear # Radial svm.model.rad <- svm(dlq ~ revol + age + times1 + DebtRatio + MonthlyIncome + noloansetc + ntimeslate + norees + ntimes2 + depends, data=train.Frame, type = "C-classification", kernel = "radial", probability = TRUE, cachesize = 1000, cost = 130, gamma = 1.3) # # tuning results: 1.3 130 -- 0.2118273 # Smaller Sample now svmtest_rows = sample.int(nrow(cData), 4000) svmtrain = cData[svmtest_rows,] svmtest = cData[-svmtest_rows,] svmtrain = data.frame(svmtrain) svmtest = data.frame(svmtest) obj <- tune.svm(dlq~. ,data=svmtrain , gamma = seq(.5, .9, by = .1) , cost = seq(100,1000, by = 100) ) plot(obj) # obj 0.218031

obj2 <- tune.svm(dlq~. ,data=svmtrain , gamma = seq(1, 1.5, by = .1) , cost = seq(10,150, by = 10) ) plot(obj2) obj2 # Gamma Cost Best Performance # 1.1 100 -- 0.2132932 # 1.3 130 -- 0.2118273 # 1.3 80 -- 0.2118273 # 1.3 120 -- 0.2137122 obj3 <- tune.svm(dlq~. ,data=svmtrain , gamma = seq(.5, 1.5, by = .1) , cost = seq(0,300, by = 10) ) plot(obj3) obj3 sm = svm.model # change kernals # svm.model.lin / svm.model... PROBS x = svm.model.sig x = svm.model.pol x = svm.model.lin x = svm.model # PLOT.SVM. (cData must be attached). maybe test[,3] detach(test) nrow(test) nrow(age) nrow(test$MonthlyIncome) attach(cData) plot(x, data=test.Frame, MonthlyIncome ~ age, # age~revol, MonthlyIncome ~ DebtRatio svSymbol = 1, dataSymbol = 2, fill = TRUE ) detach(cData) #attach(cData) # Outputs svm.model names(svm.model) str(svm.model) summary(svm.model) ? predict predict(svm.model, test[101,]) # outputs a value of either 1 or 0 predict(svm.model, test) # outputs a value of either 1 or 0 # predsvm.lin / predsvm.pol / predsvm.reg.lin / predsvm.reg.rad / predsvm. predsvm = svm.model.lin predsvm <- predict(svm.model, test, probability = TRUE) predsvm


x

# This bit needed for classification, not for regression, class-radial, #predsvm=attr(predsvm, "probabilities")[,2] # Converts into probabilities... plot(predsvm) # ...for the purposes of comparison with other models and ensemble model creation predsvm # Are these probabilities? svs=svm.model$SV svs ? attr # Model Evaluation predsvmlin = attr(predict(svm.model.lin, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... predsvmrad = attr(predict(svm.model.rad, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... predsvmsig = attr(predict(svm.model.sig, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... #predsvmpol = attr(predict(svm.model.pol, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... #predsvmregrad =attr(predict(svm.model.reg.rad, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... #predsvmreglin =attr(predict(svm.model.reg.lin, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... ######## # ROCR # ######## ############## # Evaluation # ############## # MODELS fitTree fitTree.sim fitTree.sml fitrf.reg fitrf fitnn fitnn.2 svm.model # Probs

pred.fitrf fitnn.2.pred x1=predict(fitTree, test)[,2] x1.2=predict(fitTree.sim, test)[,2] x1.3=predict(fitTree.sml, test) x2=pred.fitrf ##--x2.2=predict(fitrf.reg, test) x3=predict(fitnn, test) x3=x3[,2] # This is needed for regression Nn. Not class. See above. Oh waitm seems to be needed now for softmax. x3.2=fitnn.2.pred x4.2=attr(predict(svm.model.lin, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x4.3=attr(predict(svm.model.rad, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x4.4=attr(predict(svm.model.sig, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x4=x4.3 # Radial c-classification performs best ##--x4.5=attr(predict(svm.model.pol, test, probability = TRUE), "probabilities")[,2] # Converts into probabilities... x5=predict(fitLR, test, type = "response") # Logistic Regression plot(x1~x4, col=(test$dlq+3), pch=(1), ylab="", xlab="", main = "") # with is linear c-classification ##--x4.2=predsvm.rad ensem=(x1+x2+x3+x4+x5)/4 ensem.2=(x1+x2+x3+x4+x5)/5 dev.off() par(mfrow=c(3, 2)) # Comparing Model Results plot(x1~x2, col=(test$dlq+3), pch=(1), ylab="Tree", xlab="Random Forest", main = "Tree vs. Random Forest") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x1~x3, col=(test$dlq+3), pch=(1), ylab="Tree", xlab="Nueral Net", main = "Tree vs. Nueral Net") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green"))


xi

abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x1~x4, col=(test$dlq+3), pch=(1), ylab="Tree", xlab="Support Vector Machine", main = "Tree vs. Support Vector Machine") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x2~x3, col=(test$dlq+3), pch=(1), ylab="Random Forest", xlab="Nueral Net", main = "Random Forest vs. Nueral Net") # Correlated, but x3 has negetives. FML legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x2~x4, col=(test$dlq+3), pch=(1), ylab="Random Forest", xlab="Support Vector Machine", main = "Random Forest vs. Support Vector Machine") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x3~x4, col=(test$dlq+3), pch=(1), ylab="Nueral Net", xlab="Support Vector Machine", main = "Nueral Net vs. Support Vector Machine") legend("bottomright",c("dlq = 1","dlq = 0"),pch=1,col=c("blue","green")) abline(a=0, b=0, h=NULL, v=0.5, col=4) abline(a=0, b=0, h=0.5, v=NULL, col=12) plot(x1~x1.2, col=(test$dlq+3), pch=(1), main="Tree vs. Other Tree") # Class vs. Regression Random Forests plot(x1~x1.2, col=(test$dlq+3), pch=(1), main="Tree vs. Other Tree") # Class vs. Regression Random Forests plot(x2~x2.2, col=(test$dlq+3), pch=(1), main="Class Random Forests vs. Regression Random Forests

Predictions") # Class vs. Regression Random Forests plot(x3~x3.2, col=(test$dlq+3), pch=(1), ylab="NN", xlab="NN", main = "Nueral Net vs. ") plot(fitnn.2, col=(test$dlq+3), pch=(1).pred~x3.2, ylab="NN", xlab="NN", main = "Nueral Net vs. ") plot(x1~x1.2, col=(test$dlq+3), pch=(1)) plot(x1~x1.3, col=(test$dlq+3), pch=(1)) plot(x1.2~x1.3, col=(test$dlq+3), pch=(1)) dev.off() plot(x1~ensem, col=(test$dlq+3), pch=(1), main = "Tree vs. Random Forest") plot(x2~ensem, col=(test$dlq+3), pch=(1), main = "Random Forest vs. Random Forest") plot(x3~ensem, col=(test$dlq+3), pch=(1), main = "Nueral Net vs. Random Forest") plot(x4~ensem, col=(test$dlq+3), pch=(1), main = "Support Vector Machine vs. Random Forest") # Confusion Matrices table(data.frame(predicted=predict(fitTree, test) > 0.5, actual=test[,2]>0.5)) # this works # that doesn't work, but this does TREEmat=table(data.frame(predict(fitTree, test)[,2] > 0.5, actual=test[,2]>0.5)) # this works RFmat=table(data.frame(pred.fitrf > 0.5, actual=test[,2]>0.5)) # this works NNmat=table(data.frame(predicted=(predict(fitnn, test) > 0.5)[,2], actual=test[,2]>0.5)) # this works SVMmat=table(data.frame(predicted=predsvm > 0.5, actual=test[,2]>0.5)) # takes probabilities directly from above LMmat=table(data.frame(predict(fitLR, test) > 0.5, actual=test[,2]>0.5)) # this works ENSmat=table(data.frame(predicted=ensem.2 > 0.5, actual=test[,2]>0.5)) # takes probabilities directly from above mat=ENSmat fp.rate = mat[1,2]/(mat[1,1] + mat[1,2]) fn.rate = mat[2,1]/(mat[2,1] + mat[2,2]) mc.rate=(mat[1,2]+mat[2,1])/(mat[1,1]+mat[1,2]+mat[2,1]+mat[2,2]) rates=c(fp.rate, fn.rate, mc.rate) mat


xii

fp.rate fn.rate mc.rate rates # clas.ind gives two list # Prediction Probabilities predTree=predict(fitTree, newdata = test, prob = "class")[,2] predrf=predict(fitrf, newdata = test, type = "prob")[,2] # prob = "class" for Class membership prednn=predict(fitnn, newdata = test, prob = "prob") predsvm=predict(svm.model, newdata = test) ##--predsvm=attr(predict(svm.model, test, probability = TRUE), "probabilities")#[,2] # Converts into probabilities... # Can not plot risk without Nominal value of each loan par(mfrow=c(2, 2)) evalTree=evaluateRisk(predTree,test$dlq) plotRisk(evalTree$Caseload,evalTree$Precision,evalTree$Recall, show.legend=TRUE) evalrf=evaluateRisk(predrf,test$dlq) plotRisk(evalrf$Caseload,evalrf$Precision,evalrf$Recall) evalnn=evaluateRisk(prednn,test$dlq) plotRisk(evalnn$Caseload,evalnn$Precision,evalnn$Recall, show.legend=TRUE) evalsvm=evaluateRisk(predsvm,test$dlq) plotRisk(evalsvm$Caseload,evalsvm$Precision,evalsvm$Recall) dev.off() ########### par(mfrow=c(2, 2)) # Box Plots for Model Evaluation boxplot(predTree~test$dlq, col = "red", main = "Simple Tree") boxplot(predrf~test$dlq, col = "green", main = "Random Forest") boxplot(prednn[,2]~test[,2], col = "purple", main = "Support Vector Machine") boxplot(predsvm~test[,2], col = "orange", main = "Light Blue") ##--TREE NOT LOOKING GOOD!!!!! # Prediction for ROCR detach("package:neuralnet") # Needed because neuralnet has a predictions function predTree=predict(fitTree.sim, newdata = test, prob = "class") predicsTree=prediction(predTree[,2], test$dlq) predicsRF=prediction(predrf,test$dlq)

predicsNN=prediction(prednn[,2],test$dlq) predicsSVM=prediction(predsvm,test$dlq) predicsRF predicsTree predicsNN predicsSVM #str(predicsRF) #predicsRF@fp perfTree=performance(predicsTree,"tpr","fpr") # ROC Curve perfrf=performance(predicsRF,"tpr","fpr") # ROC Curve perfNN=performance(predicsNN,"tpr","fpr") # ROC Curve perfSVM=performance(predicsSVM,"tpr","fpr") # ROC Curve # QUICK ROCR CURVES plot(perfTree, col="red") plot(perfrf, add=TRUE, col="green") plot(perfNN, add=TRUE, col="orange") plot(perfSVM, add=TRUE, col="purple") legend("bottomright",c("Tree","R. Forest","N. Net","SVM"),pch=1,col=c("red","green", "orange", "purple")) predics = predicsTree # predicsRF, predicsTree, predicsNN, predicsSVM perfo=performance(predics,"acc") plot(perfo) perfo=performance(predics,"tpr","acc") plot(perfo) perfo=performance(predics,"err","acc") plot(perfo) perfo=performance(predics,"lift","rpp") # lift chart plot(perfo) perfo=performance(predics,"tpr","rpp") plot(perfo) perfTree=performance(predicsTree,"lift","rpp") # ROC Curve perfrf=performance(predicsRF,"lift","rpp") # ROC Curve perfNN=performance(predicsNN,"lift","rpp") # ROC Curve perfSVM=performance(predicsSVM,"lift","rpp") # ROC Curve plot(performance(predics,"lift","rpp")) plot(perfTree, col="red") plot(perfrf, add=TRUE, col="green") plot(perfNN, add=TRUE, col="orange") plot(perfSVM, add=TRUE, col="purple")


xiii

######################################################################################################### ###### # CREATING BINARY DATA ###### # DebtRatio j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$DebtRatio[j]>=3.5) { bData$DebtRatio[j]=0 num_with_zero = num_with_zero + 1 } else { bData$DebtRatio[j]=1 num_with_one = num_with_one + 1 } } bData$DebtRatio num_with_zero num_with_one pc = num_with_zero/upto pc boxplot(bData) # Monthly Income - 0/1 j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$MonthlyIncome[j]=='-999') { bData$MonthlyIncome[j]=0 num_with_zero = num_with_zero + 1 } else { bData$MonthlyIncome[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$MonthlyIncome # ntimeslate boxplot(bData) j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$ntimeslate[j]==0) { bData$ntimeslate[j]=0 num_with_zero = num_with_zero + 1 } else {

bData$ntimeslate[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$ntimeslate # revol j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$revol[j]==0) { bData$revol[j]=0 num_with_zero = num_with_zero + 1 } else { bData$revol[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$revol # norees j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$norees[j]>=6) { bData$norees[j]=1 num_with_zero = num_with_zero + 1 } else { bData$norees[j]=1 num_with_one = num_with_one + 1 } } num_with_zero num_with_one pc = num_with_zero/upto pc bData$norees # Depends j = 0 pc = 0 num_with_zero = 0 num_with_one = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$depends[j]=='-999') { bData$depends[j]=0 num_with_zero = num_with_zero + 1 } else { bData$depends[j]=1 num_with_one = num_with_one + 1


xiv

} } num_with_zero num_with_one pc = num_with_zero/upto pc bData$depends bData = bData[-6] bData = bData[-5] bData = bData[-2] bData = bData[-6] # times1 j = 0 pc = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$times1[j]<80) { bData$times1[j]=0 } else { bData$times1[j]=1 } } bData$times1 # ntimes2 j = 0 pc = 0 upto = nrow(bData) for(j in 1:upto) { if(bData$ntimes2[j]<80) { bData$ntimes2[j]=0 } else { bData$ntimes2[j]=1 } } bData$ntimes2 boxplot(bData) names(bData) #############################################plot(cData$times1~cData$revol)##### # bData set.seed(12345) Btest_rows = sample.int(nrow(bData), nrow(bData)/3) Btest = bData[Btest_rows,] Btrain = bData[-Btest_rows,] # Simpler Tree - CP is fitTree.binary = rpart(dlq ~ age + times1 + noloansetc + ntimeslate + ntimes2 + depends , data=Btrain ,parms=list(split="gini") ,method = "class" ) draw.tree(fitTree.binary, cex=.8, pch=2,size=2.5, nodeinfo = FALSE, cases = "obs")

####################### # Univariate Graphing # ####################### attach(cData) par(mfrow=c(1,2)) # REVOL! hist(revol,col= "red",main="revol", xlab="dlq", ylab="Density", freq=FALSE) lines(density(revol), col="black", lwd = 2, lty = 1) lines(density(c1Data$revol), col="red", lwd = 2, lty = 2) lines(density(c0Data$revol), col="green", lwd = 2, lty = 2) #legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # norees hist(norees,col= "red",main="norees", xlab="dlq", ylab="Density", freq=FALSE) lines(density(norees), col="black", lwd = 2, lty = 1) lines(density(c1Data$norees), col="red", lwd = 2, lty = 2) lines(density(c0Data$norees), col="green", lwd = 2, lty = 2) #legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) dev.off() attach(cData) par(mfrow=c(1,2)) names(cData) # REVOL! hist(revol,col= "red",main="revol", xlab="dlq", ylab="Density", freq=FALSE) lines(density(revol), col="black", lwd = 2, lty = 1) lines(density(c1Data$revol), col="red", lwd = 2, lty = 2) lines(density(c0Data$revol), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0","All"),lwd = 3,col=c("red","green",1)) # AGE hist(age,col= "red",main="age", xlab="age", ylab="Density", freq=FALSE) lines(density(age), col="black", lwd = 2, lty = 1) lines(density(c1Data$age), col="red", lwd = 2, lty = 2) lines(density(c0Data$age), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # times1


xv

hist(times1,col= "red",main="times1", xlab="dlq", ylab="Density", freq=FALSE) lines(density(times1), col="black", lwd = 2, lty = 1) lines(density(c1Data$times1), col="red", lwd = 2, lty = 2) lines(density(c0Data$times1), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # DebtRatio hist(DebtRatio,col= "red",main="DebtRatio", xlab="dlq", ylab="Density", freq=FALSE) lines(density(DebtRatio), col="black", lwd = 2, lty = 1) lines(density(c1Data$DebtRatio), col="red", lwd = 2, lty = 2) lines(density(c0Data$DebtRatio), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # MonthlyIncome hist(MonthlyIncome,col= "red",main="MonthlyIncome", xlab="dlq", ylab="Density", freq=FALSE) lines(density(MonthlyIncome), col="black", lwd = 2, lty = 1) lines(density(c1Data$MonthlyIncome), col="red", lwd = 2, lty = 2) lines(density(c0Data$MonthlyIncome), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # noloansetc hist(noloansetc,col= "red",main="noloansetc", xlab="dlq", ylab="Density", freq=FALSE) lines(density(noloansetc), col="black", lwd = 2, lty = 1) lines(density(c1Data$noloansetc), col="red", lwd = 2, lty = 2) lines(density(c0Data$noloansetc), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # ntimeslate hist(ntimeslate,col= "red",main="ntimeslate", xlab="dlq", ylab="Density", freq=FALSE) lines(density(ntimeslate), col="black", lwd = 2, lty = 1) lines(density(c1Data$ntimeslate), col="red", lwd = 2, lty = 2) lines(density(c0Data$ntimeslate), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # norees

hist(norees,col= "red",main="norees", xlab="dlq", ylab="Density", freq=FALSE) lines(density(norees), col="black", lwd = 2, lty = 1) lines(density(c1Data$norees), col="red", lwd = 2, lty = 2) lines(density(c0Data$norees), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # ntimes2 hist(ntimes2,col= "red",main="ntimes2", xlab="dlq", ylab="Density", freq=FALSE) lines(density(ntimes2), col="black", lwd = 2, lty = 1) lines(density(c1Data$ntimes2), col="red", lwd = 2, lty = 2) lines(density(c0Data$ntimes2), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) # depends hist(depends,col= "red",main="depends", xlab="dlq", ylab="Density", freq=FALSE) lines(density(depends), col="black", lwd = 2, lty = 1) lines(density(c1Data$depends), col="red", lwd = 2, lty = 2) lines(density(c0Data$depends), col="green", lwd = 2, lty = 2) legend("topright",c("dlq = 1","dlq = 0"),pch=1,col=c("red","green")) detach(cData) #Univariate summary(cData) mean(cData) sd(cData) var(cData) summary(cData) table(cData$dlq) ############## # Univariate # ############## par(mfrow=c(1, 1)) for(i in 2:12) {boxplot(cData[,i], xlab=names(cData)[i], main=names(cData)[i]) } #Bivariate plot(cData$MonthlyIncome~cData$age) for(j in 2:11) { for(i in 3:12) { plot(cData[,j]~cData[,i], main="Plot", xlab=names(cData)[i], ylab=names(cData[j])) } } pairs(cData, col=as.integer(cData$dlq))


xvi

? pairs # This could be creative. Surely older people are better are paying off loans and have higher monthly income ######################## # Noel OBoyle Function # ######################## importance <- function(mytree) { # Calculate variable importance for an rpart classification tree # NOTE!! The tree *must* be based upon data that has the response (a factor) # in the *first* column # Returns an object of class 'importance.rpart' # You can use print() and summary() to find information on the result delta_i <- function(data,variable,value) { # Calculate the decrease in impurity at a particular node given: # data -- the subset of the data that 'reaches' a particular node # variable -- the variable to be used to split the data # value -- the 'split value' for the variable current_gini <- gini(data[,1]) size <- length(data[,1]) left_dataset <- eval(parse(text=paste("subset(data,",paste(variable,"<",value),")"))) size_left <- length(left_dataset[,1]) left_gini <- gini(left_dataset[,1]) right_dataset <- eval(parse(text=paste("subset(data,",paste(variable,">=",value),")"))) size_right <- length(right_dataset[,1]) right_gini <- gini(right_dataset[,1]) # print(paste(" Gini values: current=",current_gini,"(size=",size,") left=",left_gini,"(size=",size_left,"), right=", right_gini,"(size=",size_right,")"))

current_gini*size-length(left_dataset[,1])*left_gini-length(right_dataset[,1])*right_gini } gini <- function(data) { # Calculate the gini value for a vector of categorical data numFactors = nlevels(data) nameFactors = levels(data) proportion = rep(0,numFactors) for (i in 1:numFactors) { proportion[i] = sum(data==nameFactors[i])/length(data) } 1-sum(proportion**2) } frame <- mytree$frame splits <- mytree$splits allData <- eval(mytree$call$data) output <- "" finalAnswer <- rep(0,length(names(allData))) names(finalAnswer) <- names(allData) d <- dimnames(frame)[[1]] # Make this vector of length = the max nodeID # It will be a lookup table from frame-->splits index <- rep(0,as.integer(d[length(d)])) total <- 1 for (node in 1:length(frame[,1])) { if (frame[node,]$var!="<leaf>") { nodeID <- as.integer(d[node]) index[nodeID] <- total total <- total + frame[node,]$ncompete + frame[node,]$nsurrogate+ 1 } } for (node in 1:length(frame[,1])) { if (frame[node,]$var!="<leaf>") { nodeID <- as.integer(d[node]) output <- paste(output,"Looking at nodeID:",nodeID,"\n") output <- paste(output," (1) Need to find subset","\n") output <- paste(output," Choices made to get here:...","\n") data <- allData if (nodeID%%2==0) symbol <- "<" else symbol <- ">="


xvii

i <- nodeID%/%2 while (i>0) { output <- paste(output," Came from nodeID:",i,"\n") variable <- dimnames(splits)[[1]][index[i]] value <- splits[index[i],4] command <- paste("subset(allData,",variable,symbol,value,")") output <- paste(output," Applying command",command,"\n") data <- eval(parse(text=command)) if (i%%2==0) symbol <- "<" else symbol <- ">=" i <- i%/%2 } output <- paste(output," Size of current subset:",length(data[,1]),"\n") output <- paste(output," (2) Look at importance of chosen split","\n") variable <- dimnames(splits)[[1]][index[nodeID]] value <- splits[index[nodeID],4] best_delta_i <- delta_i(data,variable,value) output <- paste(output," The best delta_i is:",format(best_delta_i,digits=3),"for",variable,"and",value,"\n") finalAnswer[variable] <- finalAnswer[variable] + best_delta_i output <- paste(output," Final answer: ",paste(finalAnswer,collapse=" "),"\n") output <- paste(output," (3) Look at importance of surrogate splits","\n") ncompete <- frame[node,]$ncompete nsurrogate <- frame[node,]$nsurrogate if (nsurrogate>0) { start <- index[nodeID] for (i in seq(start+ncompete+1,start+ncompete+nsurrogate)) { variable <- dimnames(splits)[[1]][i] value <- splits[i,4] best_delta_i <- delta_i(data,variable,value) output <- paste(output," The best delta_i is:",format(best_delta_i,digits=3),

"for",variable,"and",value,"and agreement of",splits[i,3],"\n") finalAnswer[variable] <- finalAnswer[variable] + best_delta_i*splits[i,3] output <- paste(output," Final answer: ",paste(finalAnswer[2:length(finalAnswer)],collapse=" "),"\n") } } } } result <- list(result=finalAnswer[2:length(finalAnswer)],info=output) class(result) <- "importance.rpart" result } print.importance.rpart <- function(self) { print(self$result) } summary.importance.rpart <- function(self) { cat(self$info) } ## wew confus.fun = function(x) # x= fitnn { confus.mat = table(data.frame(predicted=predict(x, test) > 0.5, actual=test[,2]>0.5)) false.neg = confus.mat[1,2] / (confus.mat[1,2] + confus.mat[1,1]) false.pos = confus.mat[2,1] / (confus.mat[2,1] + confus.mat[2,2]) confus.mat false.neg false.pos cat( "Confusion Matrix: ", "\n", "FALSE NEGETIVE: ",false.neg, "\n","FALSE POSITIVE: ",false.pos, "\n") } confus.fun(fitnn) confus.fun(svm.model) AIC findAIC = function(a, b, c) { p=b k = c # ncol(train) SSE = sum(a$residuals^2) AIC = 2*k +p*log(SSE/p) AIC #SBC = p*log(n) + n*log(SSE1/n) #SBC }

Documents

Credit Delinquency Data Mining