Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Ph.D. Thesis
Credit Scoring Method and System Development for
Imbalanced Datasets
A thesis submitted in fulfillment of the requirements for the degree of
DOCTOR OF SCIENCE AND TECHNOLOGY
By
Xiying Hao
Graduate School of Symbiotic Systems Science and Technology
Fukushima University
2014
i
ACKNOWLEDGEMENTS
This thesis would not have been possible without the support and guidance of a number of
important people, for which I would like to take this opportunity to acknowledge and thank
them.
First of all I would like to thank my professor Yanwen Dong, who has been of unwavering
support throughout my time at the Fukushima University. Without his tutorials and expert
knowledge in the field of credit risk modeling I could not have achieved the work conducted
in this thesis.
Special thanks also to professor Katsushige Fujimoto and professor Shoichi Nakamura, for
their helpful discussions and valuable review comments.
I would also like to thank my friends and my parents for their helpful suggestions and
encouragement.
ii
CONTENTS
ACKNOWLEDGEMENTS ...................................................................................................... i
CHAPTER 1 INTRODUCTION ........................................................................................... 1
CHAPTER 2 LITERATURE REVIEW ............................................................................... 7
2.1 Credit Scoring .................................................................................................................. 8
2.1.1 Credit scoring and methodologies ............................................................................. 8
2.1.2 Current research ...................................................................................................... 11
2.2 Class Imbalance Problem ............................................................................................... 14
2.2.1 The problem of imbalanced datasets ....................................................................... 14
2.2.2 Methods for dealing with imbalanced data sets ...................................................... 17
2.3 Issues and Aims of This Study ....................................................................................... 20
CHAPTER 3 METHODOLOGIES ..................................................................................... 22
3.1 Classification Techniques .............................................................................................. 23
3.1.1 k-nearest neighbors (k-NN) ..................................................................................... 23
3.1.2 k-means algorithms ................................................................................................. 24
3.1.3 Decision tree (C4.5) ................................................................................................ 24
3.1.4 Artificial neural networks ........................................................................................ 24
3.2 Ensemble Learning ......................................................................................................... 26
3.2.1 Bagging ................................................................................................................... 26
3.2.2 Boosting .................................................................................................................. 26
3.2.3 Stacking ................................................................................................................... 26
3.2.4 Random forests ........................................................................................................ 27
3.3 Learning from Class Imbalance Data Sets ..................................................................... 27
3.3.1 Sampling Methods ................................................................................................... 27
3.3.2 Cost-sensitive learning ............................................................................................ 29
CHAPTER 4 EVALUATION MEASURES ....................................................................... 31
4.1 Sensitivity, Specificity and Geometric Mean ................................................................. 33
4.2 Type I and Type II errors ............................................................................................... 33
4.3 Integrated Performance Measures .................................................................................. 33
CHAPTER 5 DATA SETS ................................................................................................... 35
iii
5.1 Credit Datasets in a Small Company.............................................................................. 36
5.1.1 Credit assessment problem ...................................................................................... 36
5.1.2 Features of the customers ........................................................................................ 36
5.1.3 Data sets summary .................................................................................................. 37
5.2 German Credit Data Set ................................................................................................. 38
CHAPTER 6 A TWO-STAGE DATA RESAMPLING METHOD FOR CREDIT
SCORING ....................................................................................................... 40
6.1 Background and Purpose of This Study ......................................................................... 41
6.2 System Design ................................................................................................................ 43
6.2.1 Scheme system ........................................................................................................ 44
6.2.2 Training data generating .......................................................................................... 44
6.2.3 Learnig and classification ........................................................................................ 45
6.3 The Application for a Real Credit Scoring Problem ...................................................... 47
6.4 Performance Comparison ............................................................................................... 48
6.5 Concluding Remarks ...................................................................................................... 52
CHAPTER 7 An ADAPTIVE AND HIERARCHICAL SYSTEM FOR CREDIT
SCORING ....................................................................................................... 54
7.1 The Purpose of This Study ............................................................................................. 55
7.2 The Concept and Scheme of The System ....................................................................... 55
7.3 Systematic Constructing Procedures .............................................................................. 57
7.4 Application to Practical Problem ................................................................................... 58
7.5 System Performance and Discussion ............................................................................. 59
7.5.1 Ability for classification .......................................................................................... 59
7.5.2 Ability for predication ............................................................................................. 60
7.6 Comparison with Other Methods ................................................................................... 61
7.6.1 Comparison with neural networks and decision tree .............................................. 61
7.6.2 Comparison with the parallel ensemble system ...................................................... 62
7.6.3 Type I and Type II errors ........................................................................................ 63
7.6.4 Comparison with other methods ............................................................................. 64
7.7 Concluding Remarks and Discussion ............................................................................. 65
iv
CHAPTER 8 AN INVESTIGATION INTO THE RELATIONSHIP BETWEEN
CLASSIFICATION PERFORMANCE AND DEGREE OF IMBALANCE
.......................................................................................................................... 67
8.1 Background and Aims of This Study ............................................................................. 68
8.2 Experimental Design ...................................................................................................... 70
8.2.1 Trainig data set generating ...................................................................................... 70
8.2.2 Selecting of classification techniques ..................................................................... 71
8.2.3 Parameters tuning .................................................................................................... 72
8.2.4 Statistical comparison of classifier .......................................................................... 73
8.3 Experiment Results and Discussion ............................................................................... 74
8.4 Concluding Remarks ...................................................................................................... 77
CHAPTER 9 CONCLUSIONS ............................................................................................ 79
REFERENCES ....................................................................................................................... 82
1
CHAPTER 1
INTRODUCTION
2
Introduction
In today’s increasingly competitive business environment, all of companies are exposed to
different kinds of risk, but the most challenging risk which can cause a company to fail is
credit risk. Credit risk is most simply defined as the potential that a counterparty will fail to
meet its obligations in accordance with agreed terms. Because there are many types of
counterparties -- from individuals to sovereign governments -- and many different types of
obligations -- from auto loans to derivatives transactions -- credit risk takes many forms.
Recently, credit scoring has emerged as a leading method to assess credit risk. The main
idea of credit scoring is to accurately and efficiently quantify the level of credit risk associated
with the counterparties. The credit scoring model’s objective is to predict future behavior in
terms of credit risk by relying on past experience of counterparties with similar characteristics.
The level of credit risk of a counterparty is associated with the probability that it will fail to
meet its obligations. The main task of credit scoring model is to provide discrimination
between the ones who do default and the ones who do not. Discrimination ability is the key
indicator of model successfulness. The higher the discrimination power the more precise the
credit scoring model will be.
A wide range of classification techniques has already been proposed in the credit scoring
literature, including statistical techniques, such as linear discriminant analysis and logistic
regression, and non-parametric models, such as k-nearest neighbors and decision tree. But
there are still several issues to be addressed.
(1) Availability of models
Current models and methods are applied mainly in financial community companies, there
are few literatures focused on small and medium enterprise credit scoring. However, small
and medium enterprises (SMEs) play an important role in the economy of many countries all
over the world and they are financially weak and easily affected and are bankrupted by their
Chapter
r 1
3
partners and their bad/good financial status, so monitoring the SMEs counterparts/customers
is a new challenge for us.
(2) Class imbalance problem
In the field of credit scoring context, imbalanced data sets frequently occur when there are
significantly fewer training instances of one class compared to other classes, making it harder
to be correctly learnt. What is most important, this minority class is usually the one with the
highest interest because in real business it represents the real loss. So it is more important to
identify this minority class to minimize credit risk.
(3) Strong relationship between performance of models and characteristic of dataset
There are often conflicting opinions when comparing the conclusions of studies promoting
different techniques. That is because many empirical studies only evaluate a small number of
classification techniques on a single credit scoring data set. The data sets used in these
empirical studies are also often far smaller and less imbalanced than those data sets used in
practice. Hence, the issue of which classification techniques to use for credit scoring,
particularly with extremely imbalanced data sets, remains a challenging problem.
In this thesis, we will address these three problems and make some contributions as
follows:
(1) Different from existing research, our study contributes to the literature on small
and medium enterprise credit scoring.
The difficulties of credit assessment for small and medium enterprise is that they cannot
require financial data and/or others from their counterparts/customers. Hence, in this thesis we
have proposed some new approaches to assess the customers’ credit only base on the daily
transaction data. This characteristic data can be extracted from the database of small-business
management information systems. The proposed approaches are suitable to be applied to
many of organizations where the customers do not disclose their financial data and they are
also very easy to be incorporated into existing information systems. Our approaches based on
daily transaction data can be used in a wide variety of firms.
(2) For the issue of imbalanced credit scoring data sets, the aim of our study is to
improve the ability for identifying the minority class.
Most learning algorithms obtain a high predictive accuracy over the majority class, but
predict poorly over the minority class. Furthermore, the examples in the minority class can be
treated as noise and they might be completely ignored by the classifier. So how to improve
the classification performance of minority class became a new challenge for us. This paper
4
presents a two-stage data resampling method, an adaptive and hierarchical system,
respectively, to improve the performance of minority class. As a unique resampling method,
we used k-means algorithms to perform under-sampling on the majority class of customers.
Then, in order to avoid losing information, we introduced a pre-classification to pick up
customers of the majority class whose information could not be reflected in the previous
under-sampling result. For the adaptive and hierarchical system, it can choose the best method
adaptively based on the accuracy for identifying customers of every credit score.
(3) Carry on an investigation into the relationship between classification
performance and degree of imbalance.
Many studies indicate that the class imbalance problem is actually a relative problem that
depends on the degree of class imbalance, however, how the performance of classification
techniques is affected by the different degree of class imbalance has not been discussed. In
this study, our focus is on the performance of classification techniques on data sets with
different degree of imbalance. We set out to compare several techniques in varying degree of
class imbalance. After comparing the effectiveness of these techniques, we want to find the
most appropriate one under different scenarios. For this study, the class of bad observations in
each of the training data sets was artificially reduced so as to create the different class
imbalance. The class (good/bad) distributions ranged from 70/30 to 99/1.
The research objectives are addressed in 9 chapters, with the current chapter presenting an
introduction of this research. Following is the general overview and structure of the thesis.
In Chapter 2, a review of the literature topics related to credit scoring will be given. Section
2.1 focuses on introducing some basic theory behind credit scoring. In this section, current
applications of techniques in credit scoring models are also presented. In section 2.2, the issue
of imbalanced credit scoring data sets, which is looked at and reviewed. Finally, based on the
existing problems from the overview of section 2.1 and 2.2, the major research and
contributions of this paper are given.
In Chapter 3, a brief explanation of each of the techniques applied in this thesis is present
with citations given to their full derivation. We have summed up these methods and classified
them into three categories: classification techniques, such as k-nearest neighbors and k-means
algorithms, ensemble methods, such as bagging and boosting, and methods for dealing with
class imbalance problems, such as sampling and cost-sensitive learning.
In Chapter 4, we review several metrics that are commonly used to assess classifier
performance. The commonly used metric is the overall classification (i.e. accuracy). However,
5
on an imbalanced data set, the overall classification rate is no longer a suitable metric, since
the minority class has less effect on accuracy as compared to the majority class. Therefore,
other metrics have been developed, such as sensitivity, geometric mean and so on.
In Chapter 5, two datasets used in our study are described in detail. A widely used
academic data sets (German credit data) obtained from the UCI Repository of Machine
Learning Databases are adopted in the chapter 8. The other data sets are collected in a small
company where the main business is selling school uniforms and accessories at wholesale.
There are 20 employees in the company, and the annual sale is about 600 million Japanese
yen.
In Chapter 6, we present a new approach which uses both the k-nearest neighbor (k-NN)
algorithm and random forest method to deal with imbalanced data sets in a small-business
credit assessment. Two types of classifiers are designed. The first one is called a preliminary
classifier, which is constructed using a k-means clustering algorithm based on the test data in
order to save useful information of the customers of the majority class as much as possible.
The second classifier is constructed using the random forest method; it is used to reclassify
customers that were predicted to belong to the non-majority class in the preliminary
classification to improve the classification performance of the minority class.
In Chapter 7, we aim at proposing an adaptive and hierarchical system to solve the credit
assessment problem and our emphasis is put on improving more accuracy for identifying the
minority class. The proposed system can choose the best method adaptively from neural
networks and decision tree based on the accuracy for identifying customers of every credit
score. The performance and effectiveness of the proposed system have been demonstrated by
applying it to the real problems of the company.
In Chapter 8, we set out to compare several techniques that can be used in the analysis of
imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets occur as
the number of examples in one class significantly outnumbers the number of examples in the
other class. However, some techniques may not be able to adequately cope with these
imbalanced data sets. Therefore, the objective is to compare a variety of techniques
performances’ over differing sizes of class distribution.
In Chapter 9, we display the conclusions that can be drawn from the research undertaken in
this thesis.
6
The relationship between each chapter are summarized in Figure 1.1.
Chapter 1
Introduction
Chapter 2
Literature Review
Chapter 3
Methodologies
Chapter 4
Evaluation Measures
Chapter 5
Data Sets
Chapter 6
A Two-stage Data Resampling Method
Chapter 7
An Adaptive and Hierarchical System
Chapter 8
An Investigation into Relationship Between Classification Performance and
Degree of Imbalance
Chapter 9
Conclusions
(Methods for SMEs Credit Score)
(Theories Introduction)
(Data Description)
(Methods for Generally Theory)
Figure 1.1 The structure of the thesis
7
CHAPTER 2
LITERATURE REVIEW
8
Literature Review
2.1 Credit scoring
2.1.1 Credit scoring and methodologies
Credit scoring is a quantitative method to evaluate the credit risk of counterparties. Almost
every day, individual’s and company’s records of past borrowing and repaying actions are
stocked and analyzed. This information is used for estimating the probability of default,
bankruptcy or fraud associated to a company or an individual. When assessing the risk,
according to the context we can roughly summarize the different kind of scoring as follows
[1]:
Application scoring: It refers to the assessment of the credit worthiness for new
applicants. It quantifies the risks, associated with credit requests, by evaluating the social,
demographic, financial, and other data collected at the time of the application.
Behavioral scoring: It involves principles that are similar to application scoring, with
the difference that it refers to existing customers. As a consequence, the analyst already
has evidence of the borrower’s behavior with the lender. Behavioral scoring models
analyze the consumers’ behavioral patterns to support dynamic portfolio management
processes.
Collection scoring: It is used to divide customers with different levels of insolvency into
groups, separating those who require more decisive actions from those who don’t need to
be attended to immediately. These models are distinguished according to the degree of
delinquency (early, middle, late recovery) and allow a better management of delinquent
customers, from the first signs of delinquency (30–60 days) to subsequent phases and
debt write-off.
Fraud detection: fraud scoring models rank the applicants according to the relative
likelihood that an application may be fraudulent.
Chapter
r 2
9
A graphical conceptual framework shown in Figure 2.1 is used for classifying credit
scoring. The conceptual framework is designed by the literature review of current researches
and books in credit scoring area [2].
As shown in Figure 2.1, the given framework consists of two levels.
The first level includes three types of credit scoring problem comprising enterprise credit
score, individual’s credit score and small and medium enterprises (SMEs) credit score.
Individual (consumer) credit score: It scores people credit using variables like
applicant age, marital status, income and some other variables and can include credit
bureau variables.
Enterprise credit score: using audited financial accounts variables and other internal or
external, industrial or credit bureau variables, the enterprise score is extracted.
SME credit score: For SME and especially small companies financial accounts are not
reliable and it’s up to the owner to withdraw or retain cash, there are also other issues, for
example small companies are affected by their partners and their bad/good financial
status affects them, so monitoring the SMEs counterparts is another way of scoring them
[2]. As a matter of fact, small businesses have a major share of the world economy and
their share is growing, so SME scoring is a major issue which is investigated in this
paper.
The second layer, comprised from three types of solutions and variable selection, they are
presented below.
Credit
Scoring
Individual
Enterprise SME
Variable selection
Single classifier
Ensemble
learning
Hybrid
approach
Figure 2.1 Classification framework for intelligent techniques in credit scoring.
10
Single classifier: Credit scoring is a classification problem and mainly classified
applicant to good or bad. The tested models are mainly statistical methods and artificial
intelligence techniques.
Hybrid approaches: The main idea behind the hybrid approaches is that different
methods have different strengths and weaknesses. This notion makes sense when the
methods can be combined in some extent. This combination covers the weaknesses of the
others. There are five different hybrid methods [3].
- Hybrid Algorithms (HA): In this kind of systems two or more intelligent algorithms
are tightly integrated in order to form a new classification device.
- Clustering and Classificatory devices (CC): These hybrid methods preprocess the
financial information on the failed and non-failed firms and identify groups based on
similarities. The grouping information is used in the subsequent estimation of a
classification model.
① Classification + Clustering
Clustering is an unsupervised learning technique and it cannot distinguish data
accurately like supervised techniques. Therefore, a classifier can be trained first,
and its output is used as the input for the cluster to improve the clustering results.
In the case of credit scoring, one can cluster good applicants in different groups.
② Clustering + Classification
In this approach, clustering technique is done first in order to detect and filter
outlier. Then the remained data, which are not filtered, are used to train the
classifier in order to probably improve the classification result.
③ Classification + Classification
In this approach, the aim of the first classifier is to ‘pre-process’ the data set for
data reduction. That is, the correctly classified data by the first classifier are
collected and used to train the second classifier. It is assumed that for a new
testing set, the second classifier could provide better classification results than
single classifiers trained by the original datasets.
④ Clustering + Clustering
For the combination of two clustering techniques, the first cluster is also used for
data reduction. The correctly clustered data by the first cluster are used to train
the second cluster. Finally, for a new testing set, it is assumed that the second
cluster could provide better results.
11
Variable selection: Selecting appropriate and more predictive variables are fundamental
for credit scoring [4]. Variable selection is the process of selecting the best predictive
subset of variables from the original set of variables in a dataset [5]. There are many
different methods for selecting variables include Stepwise regression, Factor analysis,
and partial least squares.
Ensemble learning: Ensemble learning aggregates the predictions made by multiple
classifiers to improve the overall accuracy. They construct a set of classifiers from the
training data and predict the classes of test samples by combining the predictions of these
classifiers [6]. There are several types of ensembles include bagging, boosting and
stacking.
2.1.2 Current research
In this section, a review of the current applications of techniques in a credit risk modeling
environment will be given. The ideas already present in the literature will be explored with
the aim to highlight potential gaps with which further research could fill. Table 2.1 provides a
selection of techniques currently applied in a credit scoring context.
Table 2.1 Credit scoring techniques and their application.
Categories Classification techniques Application in credit scoring context
Statistical methods
Linear Discriminant Analysis (LDA)
Altman [7], Baesens et al. [8], Desai et al. [9], Karels and Prakash [10], Reichert et al. [11], West [12], Yobas et al.[13]
Logistic Regression (LOG) Arminger, et al.[14], Baesens et al. [8],
Desai et al.[9], Steenackers and Goovaerts [15], west [12], Wiginton [16]
Quadratic discriminant analysis (QDA)
Altman [7], Baesens et al.[8]
Multivariate Adaptive Regression Splines (MARS)
Friedman [17]
Artificial intelligence
techniques
Neural networks(NNs) Altman [18], Arminger et al. [14], Baesens
et al. [8], Desai et al.[9], West [12], Yobas et al.[13]
Decision Tree(DT) Arminger et al.[14], Baesens et al.[8], West
[12], Yobas et al. [13], Hung and Chen [19]
Support vector machines (SVM, LS-SVM, etc.)
Baesens et al.[8], Schebensch and Stecking [20]
Case-Based Reasoning (CBR) Buta [21], Shin and Han [22], Dong [23]
Hybrid approach Hybrid Algorithms (HA) Piramuthu [24], Tseng, Lin and Wang [25]
Clustering and Classificatory devices (CC)
Rafiei, Manzati and Bostanian [26]
Variable selection Stepwise regression, Factor
analysis and partial least squares
Tsai [27], Danenas, et al. [28]
Ensemble learning Bagging, Boosting, Stacking Kim and Kang [29], Tsai and Wu [30]
12
(1) Statistical methods
Many researchers have developed a variety of traditional statistical methods for credit
scoring, with utilization of linear discriminant analysis (LDA) and logistic regression (LOG)
being the two most commonly used statistical techniques in building credit scoring models.
However, Karels and Prakash [31] and Reichert et al. [32] pointed that the application of
linear discriminant analysis (LDA) has often been challenged owing to its assumption of the
categorical nature of the credit data, and the fact that the covariance matrices of the good and
bad credit classes are unlikely to be equal.
In addition to the linear discriminant analysis (LDA) approach, logistic regression (LOG) is
another commonly used alternative to conduct credit scoring tasks. Logistic regression is a
model used for prediction of the probability of occurrence of an event. It makes use of several
predictor variables that may be either numerical or categories. Basically, the logistic
regression model first appeared as the technique in predicting binary outcomes. Logistic
regression does not require the multivariate normality assumption, however, the dependent
variable accessible to a full linear relationship among independent variables in the exponent
of the logistic function. Thomas [33] and West [12] indicated that both linear discriminant
analysis (LDA) and logistic regression (LOG) are intended for the case when the underlying
relationship between variables are linear and hence are reported to be lacking in sufficient
credit scoring accuracy.
Friedman [17] reported that multivariate adaptive regression splines (MARS) is another
commonly discussed classification technique. MARS is widely accepted by researchers for
the following reasons. Firstly, MARS is capable of modeling complex nonlinear relationships
among variables without strong model assumptions. Secondly, MARS can capture the relative
importance of independent variables to the dependent variable when many potential
independent variables are considered. Thirdly, the training process of MARS is simple and
hence can save lots of model building time, especially when the amount of data is huge.
Finally, the resulting model of MARS can be more easily interpreted than can other
classification techniques. The final fact for MARS is its important managerial and explanatory
implications and can help to make appropriate decisions.
(2) Artificial intelligence techniques
Recent studies have revealed that emerging artificial intelligence techniques, such as
decision tree (DT), support vector machine (SVM), genetic algorithm (GA) and artificial
13
neural networks (ANNs) are advantageous to statistical models and optimization technique for
credit risk evaluation. In contrast with statistical methods, artificial intelligence methods do
not assume certain data distributions. These methods automatically extract knowledge from
training samples. According to previous studies, artificial intelligence methods are superior to
statistical methods in dealing with corporate credit risk evaluation problems, especially for
nonlinear pattern classification. Application of aforementioned techniques had been
investigated by several works. Baesens et al. [8] conducted a study for benchmarking of 17
different classification techniques on eight different real-life credit datasets. They used SVM
and Least Squred-SVM with linear and Radial Basis Function (RBF) kernels and adopted a
grid search mechanism to tune the hyper parameters in their study. Their experimental results
indicated that SVM has the highest average ranking on performance. Schebesch and Stecking
[20] used a standard SVM with linear and RBF kernel for applicant credit scoring and used a
linear-kernel-based SVM to divide a set of labeled credit applicants into the sunsets of typical
and critical patterns, which can be used for rejecting applicants. In [34] SVMs were used for
bankruptcy prediction and better accuracy was generated by SVM when compared to other
methods. Gestel et al. [35] used LS-SVM for credit rating of banks, and compared the results
with ordinary least squares, logistic regression (LR) and multilayer perceptron (MLR). Min et
al. [36] proposed methods for improving SVM performance in two aspects: feature subset
selection and parameter optimization. Abdou et al. [37] investigated the ability of neural
networks (NNs), such as probabilistic neural networks (PNN) and multi-layer feed-forward
nets, and traditional techniques such as discriminant analysis, probit analysis and logistic
regression (LR) in evaluating credit risk in Egyptian banks by applying credit scoring models.
The results of their investigation have shown that neural networks (NNs) models have a more
accurate classification rate in comparison with other techniques. Pang and Gong [38] had
applied C5.0 algorithms for credit risk. They stated that decision tree (DT) is good techniques
for these kinds of problem.
(3) Ensemble learning
Ensemble learning is a machine learning paradigm where multiple learners are trained to
solve the same problem [39]. In contrast to ordinary machine learning approaches that try to
learn one hypothesis from the training data, ensemble methods try to construct a set of
hypotheses and combine them to use. Learners composed of an ensemble are usually called
base learners.
14
One of the earliest studies on ensemble learning is Dasarathy and Sheela’s research [40],
which discusses partitioning the feature space using two or more classifiers. In 1990, Hansen
and Salamon showed that the generalization performance of an ANNs can be improved using
an ensemble of similarly configured ANNs [41]. While Schapire proved that a strong
classifier in probably approximately correct (PAC) sense can be generated by combining
weak classifiers through boosting [42], the predecessor of the suite of AdaBoost algorithms.
Since these seminal works, studies in ensemble learning have expanded rapidly, appearing
often in the literature under many creative names and ideas [39].
The generalization ability of an ensemble is usually much stronger than that of a single
learner, which makes ensemble methods very attractive [43]. In practice, to achieve a good
ensemble, two necessary conditions should be satisfied: accuracy and diversity [44].
(4) Hybrid methods
At present, hybrid models that synthesizing advantages of various methods have become
hot research topics. However, there is not a clear solution to how to classifying the hybrid
models. Generally, the classification is employed according to the different method used in
the feature selection and classification stages. Based on this idea, Tsai & Chen [3] divided
them into four types: clustering + classification, classification + classification, clustering +
clustering and classification + clustering. He compared four kinds of classification techniques
(C4.5, Naive Bayesian, Logistic regression, Artificial neural networks) as well as two kinds of
clustering methods (k-means, expectation-maximization algorithm EM). The result showed
that EM + LR, LR + ANNs, EM + EM and LR + EM are the optimal one of the above models
respectively.
Recent years, the imbalanced learning problem has received a high attention in the credit
scoring context. In 2005, The Basel Committee on Banking Supervision [45] highlighted the
fact that calculations based on historical data made for very assets may “not be sufficiently
reliable” for estimating the probability of default. The reason for this is that as there are few
defaulted observations, the resulting estimation is likely to be inaccurate. Therefore a need is
present for a better understanding of the appropriate modeling techniques for data sets which
display a limited number of defaulted observations.
In the next section, we have been further sub-divided into imbalance problems.
2. 2 Class imbalance problem
2.2.1 The problem of imbalanced datasets
15
In a data set with the class imbalance problem, the most obvious characteristic is the
skewed data distribution between classes. However, theoretical and experimental studies
presented in Refs. [46] [47] and [48] indicate that skewed data distribution is not the only
parameter that influences the modelling of a capable classifier in identifying rare events.
Other influential facts include lack of data, and concept complexity.
(1) Imbalanced class distribution
The imbalance degree of a class distribution can be denoted by the ratio of the sample size
of the minority class to that of the majority class. In practical applications, the ratio can be as
drastic as 1:100, 1:1000, or even larger [49]. In Ref. [48], research was conducted to explore
the relationship between the class distribution of a training data set and the classification
performances of decision trees. Their study indicates that a relatively balanced distribution
usually attains a better result. However, at what imbalance degree the class distribution
deteriorates the classification performance cannot be stated explicitly, since other factors such
as sample size and separability also affect performance. In some applications, a ratio as low as
1:35 can make some methods inadequate for building a good model, but in some other cases,
1:10 is tough to deal with [50].
(2) Lack of data
One of the primary problem when learning with imbalanced data sets is the associated lack
of data where the number of samples is small [47]. In a given classification task, the size of
data set has an important role in building a good classifier. Lack of examples, therefore,
makes it difficult to uncover regularities within the small classes. Figure 2.2 illustrates an
example of the problem that can be caused by lack of data. Figure.1 (a) shows the decision
boundary (dashed line) obtained when using sufficient data for training, whereas Figure.1 (b)
shows the result when using a small number of samples. When there is sufficient data, the
estimated decision boundary (dashed line) approximates well the true decision boundary
(solid line); whereas, if there is a lack of data, the estimated decision boundary can be very far
from the true boundary. In fact, it has been shown that as the size of training set increases, the
error rate caused by imbalanced training data decreases [46]. Weiss and Provost conducted
experiments on twenty six data sets taken from the UCI repository to investigate the
relationship between the degree of class imbalance and training set sizes [48]. They showed
that when more training data become available, the classifiers are less sensitive to the level of
imbalance between classes. This suggests that with sufficient amount of training data, the
classification system may not be affected by the high imbalance ratio.
16
(3) Concept complexity
Concept complexity is an important factor in a classifier ability to deal with imbalanced
problems. Concept complexity in data corresponds to the level of separability of classes with
the data. Japkowicz and Stephen reported that for simple data sets that are linearly separable
(Figure 2.3 shows), classifier performances are not susceptible to any amount of imbalance
[46].
Indeed, as the degree of data complexity increases, the class imbalance factor starts
impacting the classifier generalization ability. High complexity refers to inseparable data sets
with highly overlapped classes, complex boundaries and high noise level. When samples of
different classes overlap in the feature space, finding the optimum class boundary becomes
hard (see Figure 2.4). In fact, most accuracy-driven algorithms bias toward the majority class.
That is, they improve the overall accuracy by assigning the overlapped area to the majority
class, and ignore or treat the minority class as noise.
Figure 2.2 The effect of lack of data on class imbalance problem; the solid line represents
the true decision boundary and dashed line represents the estimated decision boundary.
(a) (b)
Figure 2.3 Linear separable data
17
The class imbalance problem is more significant when the data sets have a high level of
noise. Noise in data sets can emerge from various sources, such as data samples are poorly
acquired or incorrectly labeled, or extracted feature are not sufficient for classification. It is
known that noisy data affect many machine learning algorithms; however, Weiss showed that
noise has even more serious impact when learning with imbalanced data. The problem occurs
when samples from the minority class are mistakenly included in the training data for the
majority class, and vice versa. For the majority class it takes only a few noise samples to
influence the learned sub-concept. For a given data set that is complex and imbalanced, the
challenge is how to train a classifier that correctly recognizes samples of different classes with
high accuracy.
2.2.2 Methods for dealing with imbalanced credit scoring data sets
A wide range of different classification techniques for scoring credit data sets has been
proposed in the literature, a non-exhaustive list of which was provided earlier. In addition,
some benchmarking studies have been undertaken to empirically compare the performance of
these various techniques [8], but they did not focus specifically on how these techniques be
compared on heavily imbalanced data sets, or to what extent any such comparison is affected
by the issue of class imbalance. For example, in Baesens et al. [8], seventeen techniques
including both well known techniques such as logistic regression and discriminant analysis
and more advanced techniques such as least square support vector machines were compared
on eight real-life credit scoring data sets. Although more complicated techniques such as
radial basis function least square support vector machines (RBF LS-SVM) and neural
networks (NN) yielded good performances in terms of the area under the ROC curve (AUC),
simpler linear classifiers such as linear discriminant analysis (LDA) and logistic regression
(LOG) also gave very good performances.
Figure 2.4 Overlapping data
18
However, there are often conflicting opinions when comparing the conclusions of studies
promoting differing techniques. For example, Yobas et al. [13] found that linear discriminant
analysis (LDA) outperformed neural networks in the prediction of loan default, whereas Desai
et al. [9] reported that neural networks actually perform significantly better than LDA.
Furthermore, many empirical studies only evaluate a small number of classification
techniques on a single credit scoring data set. The data sets used in these empirical studies are
also often far smaller and less imbalanced than those data sets used in practice. Hence, the
issue of which classification technique to use for credit scoring, particularly with a small
number of bad observations, remains a challenging problem.
In more recent work on the effects of class distribution on the prediction of the probability
of default (PD), Crone and Finlay [51] found that under sampled data sets are inferior to
unbalanced and oversampled data sets. However it was also found that the larger the sample
size used, the less significant the differences between the methods of balancing were. Their
study also incorporated the use of a variety of data mining techniques, including logistic
regression, classification and regression trees, linear discriminate analysis and neural
networks. From the application of these techniques over a variety of class balances it was
found that logistic regression was the least sensitive to balancing. This piece of work is
thorough in its empirical design; however, it does not assess more novel machine learning
techniques in the estimation of default.
In Yao [52], hybrid SVM-based credit scoring models are constructed to evaluate an
applicant’s scoring from an applicant’s input feature. This paper shows the implications of
using machine learning based techniques in a credit scoring context on two widely used credit
scoring data sets (Australian credit and German credit) and compares the accuracy of this
model against other techniques (LDA, logistic regression and NN). Their findings suggest that
the SVM hybrid classifier has the best scoring capability when compared to traditional
techniques. Although this is a non-exhaustive study with a bias towards the use of RBF-
SVMs it gives a clear basis for the hypothetical use of SVMs in a credit scoring context.
In Kennedy [53], the suitability of one-class and supervised two-class classification
algorithms as a solution to the low-default portfolio problem are evaluated. This study
compares a variety of well established credit scoring techniques (e.g. LDA, LOG and k-
nearest neighbor) against the use of a linear kernel SVM. Nine banking data sets are utilized
and class imbalance is artificially created by removing 10% of the defaulting observations
from the training set after each run. The only issue with this process is that the data sets are
19
comparatively small in size (ranging from 125 to 5397) which leads this author to believe a
process of k-fold cross validation would have been more applicable considering the size of the
datasets after a training, validation and test set split are made. As more class imbalance is
induced it is shown that logistic regression performs significantly better than Lin-SVM, QDC
(Quadratic Discriminant Classifier) and k-NN. It is also shown that over-sampling produces
no overall improvement to the best performing two-class classifiers. The findings in this paper
lead into the work that will be conducted in this thesis, as several similar techniques and
datasets will be employed, alongside the determination of classifier performance on
imbalanced data sets.
The topic of which good/bad distribution is the most appropriate in classifying a data set
has been discussed in some detail in the machine learning and data mining literature. In Weiss
and Provost [48] it was found that the naturally occurring class distribution in the twenty-five
data sets looked at, often did not produce the best-performing classifiers. More specifically,
based on the AUC measure (which was preferred over the use of the error rate), it was shown
that the optimal class distribution should contain between 50% and 90% minority class
examples within the training set. Alternatively, a progressive adaptive sampling strategy for
selecting the optimal class distribution is proposed in Provost et al. [54]. Whilst this method
of class adjustment can be very effective for large data sets, with an adequate number of
observations in the minority class of defaulters, in some imbalanced data sets there are only a
very small number of loan defaults to begin with.
Various kinds of techniques have been compared in the literature to try and ascertain the
most effective way of overcoming a large class imbalance. Chawla et al. [55] proposed the
Synthetic Minority Over-sampling techniques (SMOTE) which was applied to example data
sets in fraud, telecommunications management, and detection of oil spills in satellite images.
In Japkowicz [56] over-sampling and downsizing were compared to the author’s own method
of “learning by recognition” in order to determine the most effective techniques. The findings,
however, were inconclusive but demonstrated that both over-sampling the minority class and
downsizing the majority class can be very effective. Subsequently Batista [57] identified ten
alternative techniques to deal with class imbalances and trialled them on thirteen data sets.
The techniques chosen included a variety of under-sampling and over-sampling methods.
Their findings suggested that generally over-sampling methods provide more accurate results
than under-sampling methods. Also, a combination of either SMOTE and Tomek links or
SMOTE and ENN (a nearest-neighbor cleaning rule), were proposed.
20
2.3 Issues and the aims of this study
Although a lot of significant classification methods can be used to assess credit risk, there
are still several issues to be addressed.
(1) According to the study of Sadatrasoul [58], they find that current techniques of credit
scoring are mostly applied to an individual credit score and there is inadequate research
on enterprise and small and midsized companies (SME) credit scoring. Current
research for small and midsized companies (SMEs) credit scoring only 2%.
(2) The stacking strategy of ensemble learning, which is based on different kinds of
classification algorithms is not only to inherit advantages from the different classifiers,
but also inevitably suffers from disadvantages of these classifiers (Hung et al. [59];
Witten & Frank, [60]). Therefore, the performance of this policy is not always better
than an individual classifier. On the other hand, most of hybrid models using different
models usually be structured in parallel, based on a voting strategy.
(3) Data sampling is the approach to produce a more balanced learning data set. As the
under-sampling method extracts a smaller set of majority instances, some information
of the majority class will be lost. It is also very difficult to determine the correct
distribution for a learning algorithm or the appropriate re-sampling strategy to avoid
losing information of the majority class for under-sampling, and over-feeding the
minority class for over-sampling.
(4) In the literature, data sets that can be considered as very low risk, or imbalanced data
sets have had relatively little attention paid to them in particular with regards to which
techniques are most appropriate for scoring them (Benjamin et al. [61]). The
underlying problem with imbalanced data sets is that there are significantly fewer
training instances of one class compared to other classes. A large class imbalance is
therefore present which some techniques may not be able to successfully handle
(Benjamin et al. [61]). In a recent FSA publication regarding conservative estimation
of imbalanced data sets, regulatory concerns were raised about whether companies can
adequately assess the risk of imbalanced credit scoring data sets. A wide range of
classification techniques has already been proposed in the credit scoring literature. But
it is currently unclear from the literature which techniques are the most appropriate for
improving discrimination for imbalanced credit scoring data sets.
In order to address these problems described above, the aim of this contribution is
21
organized as follows:
(1) Because there are few literatures on SME credit scoring, in our study, we mainly take our
attention to focus on the small company. Two novel systems have been proposed to solve
a real small business credit assessment problem based on the features data such as sales,
payments from the customers, and so on.
(2) Different from the existing hybrid approach in parallel, we have proposed an adaptive and
hierarchical system which can inherit advantages and avoid disadvantage of different
classification techniques.
(3) When using under-sampling methods to re-sample instances of the majority class, it is
unavoidable that some useful information of the majority class is lost. In order to avoid
this information loss, we propose a two-stage data re-sampling method to reduce the
sample size of the majority class.
(4) We will address the issue of imbalanced data sets. Whereas other studies have
benchmarked several scoring techniques, in our study, we have explicitly looked at the
problem of having to build models on potentially highly imbalanced data sets. The data
sets collect from a small company which the number of insolvent customers is much
lower than the healthy ones. The other data sets of German used in our study are created
in a large class distribution by altering the percentage of bad observation in the original
training data.
22
CHAPTER 3
METHODOLOGIES
23
Methodologies
3.1 Classification Techniques
3.1.1 k-nearest neighbors
One common classification scheme based on the use of distance measures is that of the k-
nearest neighbor. The k-nearest neighbor technique assumes that the entire sampling set
includes not only the data in the set, but also the desired classification for each item. When a
classification is to be made for a new item, its distance to each item in the sampling set must
be computed. Only the k closest entries in the sampling set are considered further. The new
item is then classified to the class that contains the most items from this set of k closest items.
Figure 3.1 shows an example of a 5-NN classifier which consists of three categories w1, w2
and w3. xu is the new unlabeled input data point to be classified in the testing stage.
According to Figure 3.1, the value of parameter k is 5 and the Euclidean distance formula
has been used to calculate the distance between the training data points and the testing data
point xu. Among the five nearest neighbors of xu, four are belong to category w1 and another
one belongs to category w3. Hence, xu is classified as category w1 by the k-nearest neighbor
(k-NN) classifier [62].
xu
w1 w2
w3
Figure 3.1 Feature space of a three-dimensional 5-NN classifier.
Chapter
r 3
24
3.1.2 k-means algorithm
k-means is the most popularly used algorithms for clustering. The user needs to specify the
number (k) of clusters in advance. The algorithm randomly selects k objects as the cluster
mean or center. It works towards optimizing the square error criterion function, which is
defined as:
k
i Cx
i
i
mcx1
2, (3.1)
where mci is the mean of cluster Ci (i=1,2,…,k) and x represents a sample.
The main steps of the k-means algorithm are
(1) Assign initial means mci (i=1,2,…,k).
(2) Assign each data object x to the cluster Ci for the closest mean.
(3) Compute new mean for each cluster.
(4) Iterate until criteria function converges; that is, there are no more new assignments.
The k-means algorithm has the advantages of fast clustering and easy realization. But there
is a pre-fixed number k of the clusters. This condition has affected and the origin cluster is
stochastic which may bring instability to the result. Hence, it is of high value to improve the
quality and stability in the cluster analysis.
3.1.3 Decision tree (C4.5)
A decision tree consists of internal nodes that specify tests on individual input variables or
attributes that split the data into smaller subsets, and a series of leaf nodes assigning a class to
each of the observations in the resulting segments. For our study, we chose the popular
decision tree classifier C4.5, which builds decision trees using the concept of information
entropy [63]. The entropy of a sample S of classified observations is given by
Entropy (S) = )()(020121
plogpplogp (3.2)
where p1(p0) are the proportions of the class values 1(0) in the sample S, respectively. C4.5
examines the normalised information gain (entropy difference) that results from choosing an
attribute for splitting the data. The attribute with the highest normalised information gain is
the one used to make the decision. The algorithm then recurs on the smaller subsets.
3.1.4 Artificial neural network
Artificial neural network (ANN) is a system based on the operation of biological neural
25
networks, in other words, is an emulation of biological neural system. The key element of this
paradigm is the novel structure of the information processing system [64]. It is composed of a
large number of highly interconnected processing elements working in unison to solve
specific problems.
The most common type of neural networks consists of three layers of units: input layer,
hidden layer, and output layer. It is called multilayers perceptron (MLP) [65]. A layer of
“input” units is connected to a layer of “hidden” units, which is connected to a layer of
“output” units. The activity of each input layer represents the raw information that is fed into
the network. The activity of each hidden unit is determined by the activities of the input units
and the weights on the connections between the input and the hidden units. The behavior of
the output units depends on the activity of the hidden units and the weights between the
hidden and output units. Figure 3.2 shows an example of three-layer neural network including
input, output, and one hidden layer.
Advantages of neural networks include their strong learning ability and no assumptions
about the relationship between input variables. However, it also has some drawbacks. A major
disadvantage of neural networks lies in their poor understandability. Because of the “black
box” nature, it is very difficult for ANN to make knowledge representation. The second
problem is how to design and optimize the network topology, which is a very complex
experimental process.
Out1
Out2
Out3
Hidden Layer
Output Layer
Figure 3.2 A three-layer neural networks
Input Layer
26
3.2 Ensemble learning
An ensemble of classifiers is a collection of several classifiers whose individual decisions
are combined in some way to classify the test examples [66]. It is known that an ensemble
often shows much better performance than the single classifiers that make it up.
3.2.1 Bagging
Bagging, short for bootstrap aggregating, is considered one of the earliest ensemble scheme
[67]. Bagging is intuitive but powerful, especially when the data size is limited. Bagging
generates a series of training subsets by random sampling with replacement from the original
training set. Then the different classifiers are trained by the same classification algorithm with
different training subsets. When a certain number of classifiers are generated, these
individuals are combined by the majority voting scheme. Given a testing instance, different
outputs will be given from the trained classifiers, and the majority will be considered as the
final decision.
3.2.2 Boosting
The AdaBoost family of algorithms, also known as boosting, is another category of
powerful ensemble methods [68]. It explicitly alters the distribution of training data fed to
every individual classifier, specifically weight so each training sample. Initially the weights
are uniform for all the training samples. During the boosting procedure, they are adjusted after
the training of each classifier is completed. For misclassified samples, the weights are
increased, while for correctly classified samples they are decreased. The final ensemble is
constructed by combining individual classifiers according to their own accuracies.
3.2.3 Stacking
Stacking is another popular ensemble learning and general method of using a high-level
base learner to combine lower level base learners to achieve greater predictive accuracy [69].
It builds an ensemble by using different classification algorithms. The simplest way to
combine classification results from different classifiers is by voting. However, this policy may
inherit advantages from some classifiers and disadvantages from other classifiers
simultaneously.
27
3.2.4 Random forests
Random forests are defined as a group of un-pruned classification or regression trees,
trained on bootstrap samples of the training data using random feature selection in the process
of tree generation. After a large number of trees have been generated, each tree votes for the
most popular class. These tree voting procedures are collectively defined as random forests. A
more detailed explanation of how to train a random forest can be found in Breiman [70]. For
the random forests classification technique two parameters require tuning. These are the
number of trees and the number of attributes used to grow each tree.
3.3 Learning from class imbalance data sets
A number of solutions to the class imbalance problem were previously proposed both at the
data and algorithmic levels [71]. At the data level, these solutions include many different
forms of re-sampling such as random over-sampling with replacement, random under-
sampling, directed over-sampling (in which no new examples are created, but the choice of
samples to replace is informed rather than random), directed under-sampling (where, again,
the choice of examples to eliminate is informed), over-sampling with informed generation of
new samples, and combinations of the above techniques. At the algorithmic level, solutions
include adjusting the costs of the various classes so as to counter the class imbalance,
adjusting the probabilistic estimate at the tree leaf (when working with decision trees),
adjusting the decision threshold, and recognition-based (i.e., learning from one class) rather
than discrimination-based (two class) learning.
The most effective techniques to deal with imbalanced data sets include sampling, cost-
sensitive learning.
3.3.1 Sampling Methods
An easy data level method for balancing the classes consists of re-sampling the original
data set, either by over-sampling the minority class or by under-sampling the majority class,
until the classes are approximately equally represented. Both strategies can be applied in any
learning system, since they act as a preprocessing phase, allowing the learning system to
receive the training instances as if they belonged to a well-balanced data set. Thus, any bias of
the system towards the majority class due to the different proportion of examples per class
would be expected to be suppressed.
28
Hulse et al. [72] suggest that the utility of the re-sampling methods depends on a number of
factors, including the ration between positive and negative examples, other characteristics of
data, and the nature of the classifier. However, re-sampling methods have shown important
drawbacks. Under-sampling may throw out potentially useful data, while over-sampling
artificially increases the size of the data set and consequently, worsens the computational
burden of the learning algorithm.
(1) Over-sampling
The simplest method to increase the size of the minority class corresponds to random over-
sampling, that is, a non-heuristic methods that balances the class distribution through the
random replication of positive examples. Nevertheless, since this method replicates existing
examples in the minority class, overfitting is more likely to occur.
(2) Under-sampling
Under-sampling is an efficient method for classing imbalance learning. This method uses a
subset of the majority class to train the classifier. Since many majority class examples are
ignored, the training set becomes more balanced and the training process becomes faster. The
most common preprocessing technique is random majority under-sampling (RUS), in random
under-sampling, instances of the majority class are randomly discarded from the data set. So
the main drawback of under-sampling is that potentially useful information contained in these
ignored examples is neglected.
(3) Advanced sampling
Although sampling methods are widely used for tacking class imbalance problems, there is
no established way to determine the suitable class distribution for a given data set. The
optimal class distribution depends on the performance measures and varies from one data set
to another. Recent variants of over-sampling and under-sampling overcome some of the
weaknesses. Among them, one popular over-sampling approach is SMOTE (Synthetic
Minority Over-sampling Technique), which adds information to the training set by
introducing new, non-replicated minority class examples.
SMOTE is an intelligent over-sampling method. In this approach, the minority class is
over-sampled by taking each minority class sample and introducing synthetic examples along
the line segments joining any/all of the k minority class nearest neighbors. Depending upon
the amount of over-sampling required, neighbors from the k nearest neighbors are randomly
chosen. This process is illustrated in Figure 3.3, where xi is the selected point, xi1 to xi4 are
29
some selected nearest neighbors and r1 to r4 are the synthetic data points created by the
randomized [73].
This method is investigated for C4.5 and gives better results than random over-sampling.
By interpolating the minority class examples with new data, the within class imbalance is
reduced and C4.5 achieves a better generalization of the minority class, opposed to the
specialization effect obtained by randomly replicating the minority class examples.
3.3.2 Cost-sensitive learning: C4.5 decision tree
At the algorithmic level, solutions include adjusting the costs of the various classes so as to
counter the class imbalance, adjusting the probabilistic estimate at the tree leaf (when working
with decision trees), adjusting the decision threshold, and recognition-based (i.e., learning
from one class) rather than discrimination-based (two class) learning.
Cost-sensitive learning is a type of learning in data mining that takes the misclassifications
costs (and possibly other types of cost) into consideration. There are many ways to implement
cost sensitive learning, in Haibo He [74], it is categorized into three, the first class of
techniques applies misclassification costs to the data set as a form of data space weighting, the
second class applies cost-minimizing techniques to the combination schemes of ensemble
methods, and the last class of techniques incorporates cost sensitive features directly into
classification paradigms to essentially fit the cost sensitive framework into these classifiers.
Incorporating cost into the decision tree classification algorithm which is one of the most
widely used and simple classifier. Cost can be incorporated into it in various ways. First way
is cost can be applied to adjust the decision threshold, second way is cost can be used in
xi3
r3
r4 r2
r1
xi4 xi2
xi1
xi
Figure 3.3 An illustration of how to create the synthetic data point in the SMOTE
algorithm
30
splitting attribute selection during decision tree construction and the other way is cost
sensitive pruning schemes can be applied to the tree.
This paper, we will make use of the cost-sensitive C4.5 decision tree (C4.5CS) proposed in
(Ting, 2002) [75]. This method changes the class distribution such that the tree induced is in
favor of the class with a high weight/cost and is less likely to commit errors with high cost.
Specifically, the computation of the split criteria for C4.5 (normalized information gain) is
modified to take into account the a priori probability according to the number of samples for
each class.
C4.5CS modifies the weight of an instance proportional to the cost of misclassifying the
class to which the instance belonged, leaving the sum of all training instance weights still
equal to N. Let C(j) be the cost of misclassifying a class j instance; the weight of a class j
instance can be computed as :
i iNiC
NjCjw
)()()( (3.3)
such that the sum of all instance weights is j j Njw N)( .
The standard greedy divide-and-conquer procedure for inducing minimum error trees can
then be used without modification, except that Wj(t) is used instead of Nj(t) (number of
instances of class j) in the computation of the test selection criterion in the tree growing
process and the error estimation in the pruning process. That Wj(t) is the result of weighting
the initial number of instances from a class with the weight computed in Eq.(1) Wj(t)=w(j)·
Nj(t) Thus, both processes are affected due to this change.
C4.5CS also introduces another optional modification that alters the usual classification
process after creating the decision tree. Instead of classifying using the minimum error criteria,
it is advisable to classify using the expected misclassification cost in the last part of the
classification procedure. The expected misclassification cost for predicting class i with respect
to the instance x is given by
)())(( ∝)( j,itcosxtxj
ji WEC (3.4)
where t(x) is the leaf of the tree that instance x falls into and Wj(t) is the total weight of class j
training instance in node t.
31
CHAPTER 4
EVALUTION MEASURES
32
Evaluation measures
Evaluation measures play a crucial role in both assessing the classification performance and
guiding the classifier modeling. Traditionally, accuracy is the most commonly used measure
for these purposes.
Accuracy=TNFPFNTP
TNTP
(4.1)
However, for classification with the class imbalance problem, accuracy is no longer a
proper measure since the minority class has very little impact on accuracy as compared to the
majority class. For example, in a problem where a minority class is represented by only 1% of
the training data, a simple strategy can be to predict the majority class label for every example.
It can achieve a high accuracy of 99%. However, this measurement is meaningless to some
applications where the learning concern is the identification of the minority cases. Therefore,
other metrics have been developed to assess classifier performance for imbalanced datasets. A
variety of common metrics are defined based on the confusion matrix. A two-by-two
confusion matrix is shown in Table 4.1.
The four counts, which constitute a confusion matrix (as seen in Table 4.1) for binary
classification are: the number of correctly recognized positive class examples (true positives),
the number of correctly recognized examples that belong to the negative class (true negatives),
and examples that either were incorrectly assigned to the positive class (false positives) or that
were not recognized as positive class examples (false negatives).
Table 4.1 Confusion matrix for performance evaluation.
Predicted class (expectation)
Positive Negative
Actual class
(observation)
Positive True positive (TP) False negative (FN)
Negative False positive (FP) True negative (TN)
Chapter
r 4
33
Among the various evaluation criteria, the measures that most relevant to imbalanced data
are sensitivity, specificity, genmetric mean (G-mean), ROC curve, AUC and MCC. These
metrics share a commonality in that they are all class-independent measures.
4.1 Sensitivity, Specificity and Geometric mean
These measures are utilized when performance of both classes is concerned and expected to
be high simultaneously. The geometric mean (G-mean) metric was suggested in Kubat and
Matwin [76] and has been used by several researchers for evaluating classifiers on imbalanced
data sets [77] [78]. G-mean indicates the balance between classification performance on the
majority and minority class. This metric takes into account both the sensitivity, (the accuracy
on the positive examples) and the specificity (the accuracy on the negative examples):
Sensitivity =FNTP
TP
(4.2)
Specificity =TNFP
TN
(4.3)
G-mean= ySpecificitySensitivit (4.4)
4.2 Type I and Type II errors
Ideally, a perfect system would be described as having 100% sensitivity and 100%
specificity. However, two types of errors, type I and type II error often occurred.
Type I error =1-Specificity (4.5)
Type II error =1-Sensitivity (4.6)
Type I error shows the rate of classification errors of a model, which is to incorrectly
classify the insolvent customers into the healthy ones. When this happens, it will be exposed
to high credit risk. From a theoretical point of view, it is better to utilize classification models
with lower type I error. Opposed to type I error, type II error defines the rate of healthy
customers being classified as insolvent ones. In practice it is also of great importance to
achieve an appropriate balance between type I and type II error so as not to lose potentially
healthy customers.
4.3 Integrated performance measures
(1) ROC and AUC.
34
The receiver operating characteristic (ROC) and the area under the ROC curve (AUC) are
two most common measures for assessing the overall classification performance [79]. The
ROC is a graph showing the relationship between benefits (correct detection rate or true
positive rate) and costs (false detection rate or false positive rate) as the decision threshold
varies. The ROC curve shows that for any classifier, the true positive rate cannot increase
without also increasing the false positive rate.
A ROC curve gives a visual indication if a classifier is superior to another classifier, over a
wide range of operating points. However, a single metric is sometimes preferred when
comparing different classifiers. The area under the ROC curve (AUC) is employed to
summarize the performance of a classifier into a single metric. The AUC does not place more
weight on one class over another. The larger the AUC, the better is the classifier performance.
It can be defined as the arithmetic average of the mean predictions for each class.
AUC=2
ySpecificitySensitivit (4.7)
(2) MCC
The Matthews correlation coefficient [20] is used in machine learning as a measure of the
quality of binary (two-class) classifications. It takes into account true and false positives and
negatives and is generally regarded as a balanced measure which can be used even if the
classes are of very different sizes.
MCC=))()()(( FNTNFPTNFNTPFPTP
FNFPTNTP
(4.8)
If any of the four counts in the denominator is zero, the denominator can be arbitrarily set
to one; which results in a MCC of zero. There are situations, however, where the MCC is not a
reliable performance measure. For instance, the MCC will be relatively high in cases where a
classification model gives very few or no false-positives, but at the same time very few true-
positives
35
CHAPTER 5
DATA SETS
36
Data sets
5.1 The credit datasets in a small company
5.1.1 Credit assessment problem
The credit datasets are available from a small company where the main business is selling
school uniforms and accessories at wholesale. There are 20 employees in the company, and
the annual sale is about 600 million Japanese yen. Orders come from about 800 customers and
these customers are classified into three types: retailers, schools and others, as shown in Table
5.1.
The customers’ credit has been assessed through a four-grade credit score:
Score of one: a healthy customer for which all orders are accepted.
Score of two: a customer for which orders are accepted and limited to a given amount.
Score of three: a customer for which orders are accepted only in a cash sale.
Score of four: an insolvent customer for which all orders are rejected.
5.1.2 Features of the customers
For the company, most of the customers are minor small businesses without disclosure of
financial information, and it is almost impossible to obtain their financial data. It is also
frequently difficult to ask an agency to evaluate customers’ credit due to a limited budget. For
Table 5.1 Type of Customers
Type Customers
Retailer Co-ops or retailers to them products are usually sold on credit
School Nominal customers that are used to treat the sales directly to the students of each
school at the beginning of a school year.
Other Nominal customers that are used to treat the over-the-counter sales or orders
coming from the sales team. Students’ circles or clubs, and any other association.
Chapter
r 5
37
these reasons, we collected the following seven features from the daily transaction which can
be availed by the small businesses to assess customers’ credit.
Type of customers.
Average amount of overdue payment in the year considered.
Maximum overdue days for all overdue payments in the year considered.
Number of times that overdue payment occurs in the year considered.
Total sales in the year considered.
Rate of the average amount of overdue payment of the total sales.
Number of transaction months in which any order from the customer is fulfilled in the
year considered.
This characteristic data can be extracted from the database of small-business management
information systems.
5.1.3 Data sets summary
We collected the data from the financial year of 2001 to 2003 and summarized the
distribution of customers in each credit score, as Table 5.2 shows.
From Table 5.2, it can be seen that the customers in the finical year of 2001 and 2002 with
a score of two or three only accounted for 0.4% of the total amount, the customers in the
finical year of 2003 with score of two only accounted for 0.2% of the total amount and 0.6%
with score of three, much less than the customers with a score of one. According to previous
studies, the traditional methods used for credit score do not work well to identify the
customers with score of two, three or four in the minority class.
Table 5.2 Number of customers 2001 financial year 2002 financial year 2003 financial year
Credit
score
Number of
customers(%)
Credit
score
Number of
customers(%)
Credit
score
Number of
customers(%)
1 474 (95.2%) 1 469 (95.1%) 1 450 (96.4%)
2 2 ( 0.4%) 2 2 ( 0.4%) 2 1 (0.2%)
3 2 ( 0.4%) 3 2 ( 0.4%) 3 3 (0.6%)
4 20 ( 4.0%) 4 20 ( 4.1%) 4 13 (2.8%)
Total 498 (100%) Total 493 (100%) Total 467 (100%)
38
5.2 German credit data set
The other data sets chose for our study is German credit. This is an open data set that can
be available from the UCI Repository (http://www.ics.uci.edu/~mlearn/MLRepository. html),
and has been used in many previous studies as a benchmarking problem.
The German Credit data set contains observations on 30 variables for 1000 past applicants
for credit. Each applicant was rated as “good credit” (700 cases) or “bad credit” (300 cases)
(encoded as 1 and 0 respectively in the Response variable). All the variables are explained in
Table 5.3.
(Note: The original data set had a number of categorical variables, some of which have been
transformed into a series of binary variables so that they can be appropriately handled by our
study).
Table 5.3 Variables for the German Credit data
Var.
#
Variable Name Description Variable
Type
Description
1. OBS# Observation No. Categorical
2. CHK_ACCT Checking account status Categorical 0 : < 0 DM
1: 0 < ...< 200 DM
2 : => 200 DM
3: no checking account
3. DURATION Duration of credit in
months
Numerical
4. HISTORY Credit history Categorical 0: no credits taken
1: all credits at this bank paid
back duly
2: existing credits paid back duly
till now
3: delay in paying off in the past
4: critical account
5. NEW_CAR Purpose of credit Binary car (new) 0: No, 1: Yes
6. USED_CAR Purpose of credit Binary car (used) 0: No, 1: Yes
7. FURNITURE Purpose of credit Binary furniture/equipment 0: No, 1:
Yes
8. RADIO/TV Purpose of credit Binary radio/television 0: No, 1: Yes
9. EDUCATION Purpose of credit Binary education 0: No, 1: Yes
10. RETRAINING Purpose of credit Binary retraining 0: No, 1: Yes
11. AMOUNT Credit amount Numerical
12. SAV_ACCT Average balance in savings
account
Categorical 0 : < 100 DM
1 : 100<= ... < 500 DM
2 : 500<= ... < 1000 DM
3 : =>1000 DM
4 : unknown/ no savings account
39
13. EMPLOYMENT Present employment since Categorical 0 : unemployed
1: < 1 year
2 : 1 <= ... < 4 years
3 : 4 <=... < 7 years
4 : >= 7 years
14. INSTALL_RATE Installment rate as % of
disposable income
Numerical
15. MALE_DIV Applicant is male and
divorced
Binary
0: No, 1: Yes
16. MALE_SINGLE Applicant is male and
single
Binary
0: No, 1: Yes
17. MALE_MAR_WID Applicant is male and
married or a widower
Binary
0: No, 1: Yes
18. CO-APPLICANT Application has a co-
applicant
Binary
0: No, 1: Yes
19. GUARANTOR Applicant has a guarantor Binary 0: No, 1: Yes
20. PRESENT_RESIDENT Present resident since -
years
Categorical 0: <= 1 year
1<…<=2 years
2<…<=3 years
3:>4years
21. REAL_ESTATE Applicant owns real estate Binary 0: No, 1: Yes
22. PROP_UNKN_NONE Applicant owns no
property (or unknown)
Binary
0: No, 1: Yes
23. AGE Age in years Numerical
24. OTHER_INSTALL Applicant has other
installment plan credit
Binary
0: No, 1: Yes
25. RENT Applicant rents Binary 0: No, 1: Yes
26. OWN_RES Applicant owns residence Binary 0: No, 1: Yes
27. NUM_CREDITS Number of existing credits
at this bank
Numerical
28. JOB Nature of job Categorical 0 : unemployed/ unskilled - non-
resident
1 : unskilled - resident
2 : skilled employee / official
3 : management/ self-
employed/highly qualified
employee/ officer
29. NUM_DEPENDENTS Number of people for
whom liable to provide
maintenance
Numerical
30. TELEPHONE Applicant has phone in his
or her name
Binary
0: No, 1: Yes
31. FOREIGN Foreign worker Binary 0: No, 1: Yes
32 RESPONSE Credit rating is good Binary 0: No, 1: Yes
40
CHAPTER 6
A TWO-STAGE DATA RESAMPLING METHOD FOR CREDIT
SCORING
41
A two-stage data resampling method
6.1 Background and purpose of this study
The scenario of classification with imbalanced data sets has supposed a serious challenge
for credit scoring researchers along the last years. The main handicap relates to the number of
insolvent customers is much smaller than the number of healthy ones. As a result, the
classifier tends to favor healthy customers of the majority class. In other words, healthy
customers could be overlearned in the model and therefore can be identified with high
accuracy, but insolvent customers of the minority class cannot be identified correctly.
However, in real business, it is more important to identify insolvent customers in order to
minimize credit risk. Thus, improving the classification performance of insolvent customers
in the minority class became a new challenge for us.
Several researchers have tried to address these problems over past decades. In general,
there are two approaches used to tackle the problem of extremely imbalanced data.
(1) Data Sampling
The training samples are modified in such a way as to produce a more balanced class
distribution that allow classifiers to perform in a similar manner to standard classification.
Typical sampling methods include over-sampling and under-sampling [80] [81] that modify
the prior probability of the majority and minority class in the training set to obtain a more
balanced number of instances in each class.
The under-sampling method extracts a smaller set of majority instances while preserving all
the minority instances. This method is suitable for large-scale application where the number
of majority samples is huge and lessening the training instances reduces the training time and
makes the learning problem more tractable. However, one problem associated with under-
sampling techniques is that we may lose information when discard the instances.
In contrast to under-sampling, the over-sampling method increases the number of minority
instances by over-sampling them. The advantage is that no information is lost from the
Chapterr 6
42
training samples because all instances are employed. However, the minority instances are
over- represented in the training set and, moreover, will increase the training time.
(2) Algorithmic Modification
This approach is oriented towards the adaptation of base learning methods to be more
attuned to class imbalance data [82]. Substantial work has gone into making individual
algorithms cost-sensitive. Cost-sensitive approaches assign a high cost to misclassification of
the minority class, and try to minimize the overall cost [83][84]. Cost-sensitive learning plays
an important role in real-world data mining applications. Turney [85] provided a
comprehensive survey of a large variety of different types of costs in data mining and
machine learning, including misclassification costs, data acquisition costs, active learning
costs, computation costs, human-computer interaction costs, and so on. The misclassification
cost is singled out as the most important cost, and it has also been the most studied in recent
years.
Although much research about the class imbalance problem has been reported, some
challenging problems still remain.
(1) Data sampling is the approach to produce a more balanced learning data set. As the
under-sampling method extracts a smaller set of majority instances, some
informationof the majority class will be lost. Furthermore, it is very difficult to
determine the correct distribution for a learning algorithm or the appropriate
resampling strategy to avoid losing information of the majority class for under-
sampling, and over-feeding the minority class for over-sampling.
(2) Previous research focuses on either resampling techniques or algorithmic modifications.
However, the effectiveness of any learning algorithm is influenced by the construction
method of the learning data set, so it is necessary to consider learning algorithms and
resampling techniques simultaneously.
(3) Most papers published so far have used some benchmark data sets to confirm the
effectiveness: there are very few real-world applications that have been reported. As
algorithms which are effective for benchmark data sets are not necessarily effective in
real-world applications, it is important to make an attempt to solve practical class
imbalance problems and provide some insights or experiences about solving real-world
problems.
43
This study aims to solve a real small-business credit assessment problem and make some
new contributions for dealing with class imbalanced data sets from the following three
viewpoints.
(1) When using under-sampling methods to resample instances of the majority class, it is
unavoidable that some useful information of the majority class is lost. In order to avoid
this information loss, we propose a two-stage data resampling method to reduce the
sample size of the majority class whose information cannot be reflected in the under-
sampling results.
(2) Instead of focusing on either resampling techniques or algorithmic modifications, we
try to propose a new learning approach of performing algorithmic modification and
data resampling at the same time. That is, we use k-means algorithms and the k-nearest
neighbor method for resampling class imbalanced data sets and generate two training
data sets. Meanwhile, we classify healthy and insolvent customers through a hybrid
method of the k-nearest neighbor and random forest methods.
(3) This study has dealt with a credit scoring problem in a small-scale student dress
wholesale company and proposed some new approaches to assess the customers’ credit
only based on characteristic data that can be easily retrieved from daily transaction data
[86] [87]. These approaches are suitable to be applied to many organizations where the
customers do not disclose their financial data and have an advantage of a lower cost
data collection compared to other methods. However, we are having difficulties in
improving the accuracy of identifying insolvent customers. The emphasis of this study
is to provide a new approach based on class imbalance learning and construct a system
to identify the insolvent customers of the minority class with as high as possible
accuracy.
6.2 System design
Here, we propose a two-stage data resampling method to generate two balanced training
data sets. Similar to other under-sampling methods, at the first stage we perform under-
sampling through clustering the majority class of customers using a k-means algorithm. At the
second stage, in order to avoid information loss, we execute a pre-classification to pick up
customers of the majority class whose information cannot be reflected in the under-sampling
results of the first stage.
44
6.2.1 Scheme system
The proposed approach classifies imbalanced data sets through two steps, like the Figure
6.1 shows. The first one is to generate two training data sets and the second one is to construct
two classifiers based on the training data sets to classify new customers.
6.2.2 Training data generating
Let T= {S, M} be an imbalanced data set, where S= {s1, s2, …, sL} is the set of customers in
the minority class and M={m1, m2 ,…, mN} is the set of customers in the majority class. L and
N are the number of the customers in minority and majority classes respectively: in addition L
< N.
[First stage]
(1) For the customer mi (i =1, 2, …, N) of the majority class, we use the k-means algorithm
to generate k cluster means or centers. These k cluster means are defined as the seeds
of the majority class and put into set E.
(2) Combining the customers belonging to minority class S and set E, we generate a new
training set T1: T1= {S, E}.
Through the operation of the first stage, N customers of the majority class are clustered into
k clusters. We can give an appropriate k which is near to the size of minority class S, so the
training set T1 is a well-balanced one.
[Second stage] (1) Based on T1= {S, E}, we can classify the customers of the majority class M using the
1-nearest neighbor algorithm. If a customer mj of the majority class is classified wrongly into
minority class S, then we put it into set H.
(2) Combining the customers of the minority class and set H, another training set T2 can
Generate
two training
data sets
by k-means
and k-NN
Training set T2
Training set T1
Learning
and
classification
by two
classifiers
Figure 6.1 System scheme
45
be generated as T2= {S, H}.
The training data generating process is shown in Figure 6.2.
As the first stage aims at generating a balanced data set through under-sampling of the
majority class, it can discard data potentially important for the classification process. Hence,
we perform the operation of the second stage so the customers of the majority class whose
information cannot be reflected in the training set T1 will be picked up again in training set T2.
It is clear that our approach cannot only perform under-sampling, but also can avoid
information loss.
6.2.3 Learning and classification
As shown in Figure 6.3, the learning and classification are performed as follows.
[Step 1]
Firstly, based on training data set T1, we construct a preliminary classifier C1 where the
1-NN (k-nearest neighbor, k=1) algorithm was used as the classification method.
[Step 2]
Based on training set T2 and using the random forest method, we construct the second
classifier C2.
[Step 3]
When a new customer is given, the preliminary classifier C1 is firstly applied to classify
it. If the customer is classified into the majority class, then it is determined to be a
Pre-classifier
Majority class
M
E: The seeds of majority
class Training set T1
1-nearest
neighbor
H: The customers which
belong to M but were
misclassified into S
Training set T2
Figure 6.2 Generating the training data.
K-means
algorithm
Minority class
S
Majority class M
Training set T1
The first stage
The second stage
46
healthy customer. On the contrary, if the customer is classified into the non-majority
(minority) class, the second classifier C2 is applied again to reclassify it so as to decide
finally whether it belongs to the minority class (an insolvent customer) or to the majority
class.
Compared to other research, the proposed approach constructs and applies two types of
classifiers C1 and C2, and has the following characteristics.
(1) Since the insolvent customers belonging to the minority class S have been included in
both training data sets T1 and T2, and these two training data sets are balanced ones,
insolvent customers can be well represented in the two classifiers C1 and C2.
Furthermore, the second classifier C2 used the random forest method as the base
classifier, and the random forest method has been reported as being able to train the
imbalanced data effectively [88]. For these reasons, the performance of classifying
customers of the minority class can be improved.
(2) As the k cluster means of the majority class were included in the training data set T1,
new customers that are near to a seed of the majority class can be classified correctly
through preliminary classifier C1. In addition to this, other customers that could not be
represented in the seeds of the majority class have been included in the training data set.
Yes
No
New customer
The preliminary classifier
C1 (k-nearest neighbor)
(k-nearest neighbor, k=1)
The second classifier C2 (random forest)
Majority
class Minority
class
Training set
T2
Figure 6.3 Proposed Approach.
Is of majority
class?
Training set
T1
47
T2, and new customers that are not near to a seed of the majority class can be classified
correctly through second classifier C2. Therefore, the customers of the majority class can
be expected to be identified with high accuracy.
6.3 The application for a real credit scoring problem
In order to confirm the effectiveness of the proposed approach and give a real-world
application, this study applies the proposed approach to the credit scoring problem in a small-
scale student dress wholesale company.
As the original training data, we collected the features data from customers of company in
the 2001 financial year. Then generated two training data sets T1 and T2 from the date of the
2001 financial year. When we used the k-means algorithm to generate the seeds of the
majority class and training data set T1, k was set as k=4.
Training data set T1 is used to construct the preliminary classifier C1 where the 1-NN (k-
nearest neighbor, k=1) algorithm was used. Training data set T2 is used to construct the
second classifier C2 where the number of trees was set at 10 and the number of features
selected at each node is 4.
As new customers, we choose every customer in the financial year of 2002 and 2003, and
decide a new credit score by applying the proposed approach shown in Figure 6.3. These new
credit scores were compared with that given by the financial managers of the company. The
predicted result for 2002 customers is shown in Table 6.1.
Table 6.1 Prediction results for 2002 customers.
Number of customers
Credit score
provided by our approach Hit rate
1 2 3 4
Credit scores given
by the financial managers
1 467 1 0 1 99.6%
2 0 2 0 0 100.0%
3 0 0 2 0 100.0%
4 2 0 0 18 90.0%
From Table 6.1, the customers with scores of two and three are 100% in agreement with the
judgments of the financial managers of the company. For the customers with a score of four,
18 of 20 are correctly predicted. According to the result, it is clear that our system has very
high ability for classifying the minority class.
The predicted result in 2003 financial year is shown in Table 6.2.
48
Table 6.2 Prediction results for 2003 customers.
Number of customers
Credit score
provided by our approach Hit rate
1 2 3 4
Credit scores given
by the financial managers
1 444 6 0 0 99.0%
2 0 1 0 0 100.0%
3 0 2 1 0 33.0%
4 6 0 0 7 54.0%
From Table 6.2, the credit scores of healthy customers (score=1) provided by the system
are 99% in agreement with judgments of the financial managers of the company. The hit rate
of customer with score of 3 is 33% and the hit rate of customers with score of 4 is 54%.
6.4 Performance comparison
To clarify the performance and effectiveness of our approach, we compare the proposed
approach with the k-nearest neighbor algorithm and random forest method.
As described above, we also choose every customer in the 2001 financial year as training
data sets and construct two single classifiers using the k-nearest neighbor algorithm and
random forest method, respectively. The k-nearest neighbor algorithm was applied for k=1(1-
NN) using the Weka IBk classifier [88]. For the random forest method, the number of trees
was set to10 and the number of features selected at each node to four in the proposed method
and single random forest.
Data of every customer in financial year 2002 and 2003 were used as the test data and new
credit scores were decided by these two single classifiers respectively, and then compared
with that given by the financial managers of the company. The comparison results are shown
in Table 6.3 and Table 6.4.
Table 6.3 Classification results of the 2002 financial year for using 1-NN and
random forest.
Number of customers
Credit score given by1-
NN
Credit score
given by RF
1 >1 1 >1
Credit score given
by the financial
managers
1 463 6 469 0
>1 13 11 6 18
49
Table 6.4 Classification results of the 2003 financial year for using 1-NN and
random forest.
Number of customers
Credit score given by1-
NN
Credit score
given by RF
1 >1 1 >1
Credit score given
by the financial
managers
1 444 6 447 3
>1 8 9 13 4
(1) Comparison of Specificity
A main purpose of our study is to improve the performance for identifying insolvent
customers of a minority class. Thus, the specificity, which relates to the ability to identify the
minority class, has been assessed based on the results shown in Table 6.3 and Table 6.4, the
comparison results are shown in Figure 6.4 and Figure 6.5.
As Figure 6.4 shows, the specificity obtained by the proposed approach is 92% and it
performs better than random forest (75%) and k-nearest neighbor (46%) methods. It is clear
0.46
0.75
0.92
k-nearest neighbor Random forest The proposed method
Figure 6.4 Comparison of specificity in financial year of 2002.
0.53
0.24
0.53
k-nearest neighbor Random forest The proposed method
Figure 6.5 Comparison of specificity in financial year of 2003.
50
that the performance of classifying customers of the minority class was improved
significantly by our approach.
As Figure 6.5 shows, the specificity obtained by the proposed approach and single k-nearest
neighbor are all the same (53%). However, it performs better than random forest (24%).
(2) Comparison of Type I and Type II errors
In the field of credit scoring, type I and type II errors are very important criteria for
evaluating performance of credit scoring models. The type I and type II errors of the proposed
approach, k-nearest neighbor algorithm and random forest can be calculated based on Tables
6.3 and 6.4, the comparison result in financial year of 2002 is shown in Table 6.5.
Table 6.5 Comparison of type I and type II errors in financial year of 2002.
Methods Type I Type II
k-nearest neighbor 54.0% 1.3%
Random forest 25.0% 0.0%
Proposed method 8.3% 0.4%
From Table 6.5, it is clarified that:
The type II errors range from 0 to 1.3% and all three methods showed very low error to
identify healthy customers of the majority class. This result is because the number of
healthy customers is very large and their features can be learnt sufficiently into the models.
Among the three methods compared here, the proposed approach provided the lowest type I
error (8.3%) and showed a big difference with respect to the k-nearest neighbor (54%) and
random forest (25%) methods, respectively. This shows that the proposed approach is
superior to the single classifiers using the k-nearest neighbor algorithm and random forest
method to control type I errors.
Superior to the single classifiers of the k-nearest neighbor algorithm and random forest
method, the proposed approach can control type I and type II errors at the same time. In
other words, it could not only correctly identify healthy customers of the majority class, but
could classify customers of the minority class with a very low error rate.
The same as the above, type I and type II errors in financial year of 2003 can be calculated
based on Table 6.4. The comparison result in financial year of 2003 is shown in Table 6.6.
51
Table 6.6 Comparison of type I and type II errors in financial year of 2003.
Methods Type I Type II
k-nearest neighbor 47.1% 1.3%
Random forest 76.5% 0.7%
Proposed method 47.1% 1.3%
As shown in Table 6.6, the k-nearest neighbor and proposed method gave the lowest type I
error for classification and prediction, but random forest gave a comparatively high type I
error. These high type I errors arose from the fact that the number of insolvent customers is
much less than that of healthy ones, and thus, models are over learned from healthy customers.
(3) Comparison of integrated performance
In order to clarify the integrated performance of the proposed approach, G-mean, AUC and
MCC are calculated and compared with that of k-nearest neighbor algorithm and random
forest method. The comparison result are shown in Figure 6.6 and Figure 6.7.
From Figure 6.6, it is obvious that our method outperformed the k-nearest neighbor
algorithm and random forest method in all three integrated performance measures G-mean,
AUC and MCC with values of more than 91%. This result proved that our method has a
higher ability to identify both healthy customers of the majority class and insolvent customers
of the minority class than that of 1-NN and random forest method.
67% 72%
53%
87% 88% 86% 96% 96%
91%
G-mean AUC MCC
k-nearest neighbor Random forest Proposed method
Figure 6.6 Comparison of G-mean, AUC and MCC in financial year of 2002.
52
From Figure 6.7, it is obvious that our method outperformed the random forest method in
all three integrated performance measures G-mean, AUC and MCC, and have the same
performance with k-nearest neighbor.
6.5 Concluding Remarks
Prior to this study, we have proposed some new approaches using statistical methods and
case-based reasoning (CBR) to deal with the customers’ credit assessment problem in this
company [86] [87]. Furthermore, in order to consider class imbalance problems and improve
the accuracy of identifying insolvent customers, we have also proposed a credit assessment
system using bagging [89] [90].
Although insolvent customers could be identified with very high accuracy, we have to
improve our study further in order to find more effective and more efficient methods solve the
customers’ credit assessment problem based on less characteristic data. According to this
motivation, this paper makes an attempt to solve the customers’ credit assessment problem by
proposing a new approach based on class imbalance learning. Different from the existing
techniques, we mainly make the following two contributions.
(1) As a unique resampling method, we used k-means algorithms to perform under-
sampling on the majority class of customers. Then, in order to avoid losing information,
we introduced a pre-classification to pick up customers of the majority class whose
72% 76%
55%
48%
61%
35%
72% 76%
55%
G-mean AUC MCC
k-nearest neighbor Random forest Proposed method
Figure 6.7 Comparison of G-mean, AUC and MCC in financial year of 2003.
53
information could not be reflected in the previous under-sampling result. As a result,
we generated two training data sets.
(2) The proposed method was applied to solve a credit assessment problem in a small
company. As demonstrated by the practical credit scoring problems of the company, it
was clarified that the proposed approach can identify insolvent customers more
effectively than single classifiers based only on either the k-nearest neighbor algorithm
or random forest method.
Through the discussion of this study, it was confirmed that the approaches or methods of
dealing with class imbalance data sets are applied to solve practical credit assessment
problems. However, it is also important to decide the system’s structure and parameters, so as
to obtain good performance. As our approach is to assess small business credit scores only
based on daily transaction data, it has the advantage over other approaches using financial
data in our approach can be applied to assess a wide variety of businesses.
54
CHAPTER 7
AN ADAPTIVE AND HIERARCHICAL SYSTEM FOR CREDIT
SCORING
55
An adaptive and hierarchical system
7.1 The purpose of this study
As stated in chapter 6, the proposed system has shown that it can identify insolvent
customers more effectively than single classifiers based only on either the k-nearest neighbor
algorithm or random forest method. But it is a challenging issue to identify the insolvent
customers as higher accuracy as possible.
In this chapter, we still deal with a credit scoring problem in a small-scale student dress
wholesale company and aims at proposing an adaptive and hierarchical system to solve the
credit assessment problem and our emphasis is put on how to improve the accuracy for
identifying insolvent customers as higher accuracy as possible. The proposed system can
choose the best method adaptively from neural networks and decision tree based on the
accuracy for identifying customers of every credit score. The performance and effectiveness
of the proposed system have been demonstrated by applying it to the real problems of the
company.
7.2 The concept and scheme of the system
Although a great many models and methods for credit assessment have been published so
far, each method has the different ability when identifying the customers with different credit
scores. That is, a method can identify correctly the healthy customers. But in the other hand, it
cannot identify any insolvent ones. It is reasonable to group the customers according to their
credit scores and choose the best method to identify the customers of each group.
Meanwhile the performance of most of the credit scoring systems depends on the type and
quantity of the data needed for decision making. It is meaningful to extract the most important
variables or features according to the type of problems and/or customers. Furthermore, we
have found that the key features data to identify healthy customers differ from that to identify
Chapterr 7
56
insolvent ones. However, almost of models or methods reported before applying the same
dataset and assign the same weight to each feature to assess credit scores of all of the
customers. It is expectable to improve accuracy through assigning different weight to features
data when identifying a customer belonging to different groups.
Based on these considerations, we propose an adaptive and hierarchical system for small-
businesses’ credit assessment, as shown in Figure 7.1. The key points of this system can be
given as follows:
(1) The customers are divided into m groups, according to its credit score, where m is the
number of grades of credit score.
(2) The credit score of a customer is assessed though a hierarchical system of m-1 layers,
where the classifiers are arranged hierarchically. In each layer, only one group of
customers is to be identified.
(3) In each layer, two kinds of classifiers are provided, one is a classifier of neural network
and another is a classifier of decision tree. One of these two classifiers will be chosen
adaptively according to the expected probability or the correctness of identification to
identify a group of customers.
(4) A turning algorithm is proposed in the following section to decide the best sequence or
layer in which all groups of customers should be identified with the highest correctness.
Figure 7.1 The adaptive and hierarchicalsystem.
Training
Dataset
Target
Customer
NN
DT Score s1
NN: Neural Networks
DT: Decision Tree
NN
DT
NN
DT
NN
DT
···
Score sm-2
Score s2
Score sm-1
Score sm
57
The proposed system of Figure 7.1 is similar to ensemble learning systems in that it uses
multiple classifiers. However, it differs from ensemble learning systems mainly in the
following aspects:
(1) When constructing an ensemble learning system, the number of classifiers is a
parameter to be decided optimally. In our proposed system, the number of classifiers is
defined according to the number of customer groups and equals to m-1.
(2) Typical bagging or stacking ensemble systems are constructed through a parallel
scheme where multiple classifiers of the same type are arranged in parallel form. As
shown in the Figure 7.1, the proposed system has a hierarchical structure where m-1
classifiers are arranged hierarchically.
(3) Each classifier in ensemble systems is used to classify all groups of customers, but in
our proposed system, one classifier is used to identify only one group of customers,
and therefore it is not necessary to introduce any weighting algorithm or meta-level
classifier to combine the predictions from an ensemble of diverse classifiers.
7.3 Systematic Constructing Procedure
In order to identify all the groups of the customers with the highest accuracy, the system
structure should be decided optimally.
Let the customers’ credit be assessed by m-grade score s= 1, 2, …, m, the group of
customers with score of s be denoted by Cs (s=1, 2, …, m), then we can construct a m-1 layers
system through the following procedure. At layer L, the group of customers with score
sL( s
L1, 2, 3, …, m-1 ) is to be identified.
[Step1] Set the number of layers as L=1, the learning data set as m
ssCC
1
[Step2] Using the learning data set C, we can execute an appropriate learning program to
train the neural network and construct the decision tree. After the learning, can be
calculated the classification accuracy for identifying the customers of group Ct ( t
{ sss m,...,LL 1
}) as
group ofcustomer ofnumber Total
group into classified also are which group ofcustomer ofnumber The
C
CCR
t
ttt
To distinguish the accuracy through the neural network or the decision tree, we denote Rt
obtain by neural networks (NN) as RNNt and that by decision tree (DT) as R
DTt .
[Step3] Finding the best accuracy RNN and R
DT as
58
maxRNN { R
NNt , t { s,ss m,...,
LL 1}} (7.1)
maxRDT { R
DTt , t { s,ss m,...,
LL 1}} (7.2)
If RRDTNN , the neural network is chosen as the classifier of layers L and meanwhile the
customers’ group C s Lis chosen as the group to be identified in layer L, where
argmaxsL
{ RNNt , t { s,ss m,...,
LL 1}} (7.3)
Otherwise the decision tree is chosen as the classifier of layer L and as same as the above,
the customers of group CsLshould be identified where
argmaxsL
{ RDTt , t { s,ss m,...,
LL 1}} (7.4)
[Step4] If L=m-1, then the procedure is finished. Otherwise, set CsCCL
and then left
L=L+1, go back step 2.
7.4 Application to practical problem
To investigate the performance and effectiveness of the proposed system, we apply it to the
real credit assessing problem in the small company.
According to the procedure described in section 7.3, in the first layer, we choose every
customer in 2001 financial year as training data set and then calculate the classification
accuracy through neural network (NN) and decision tree (DT). Because there are many
decision tree algorithms. Here, we use J4.8, which is a modification of C4.5 revision 8 [87].
For neural network (NN), we use back propagation neural networks (BPN), the structure of
BPN is 7, 6 and 4 for the number of units in its input, hidden and output layers, respectively.
The result is shown in Table 7.1.
Table 7.1 The classification accuracy at the first layer. Layers Credit scores NN DT
L=1
1 99.4% 99.8%
2 50.0% 0.0%
3 50.0% 0.0%
4 85.0% 60.0%
From the Table 7.1, we have known that the maximum accuracy is 99.8% by decision tree.
So the decision tree is chosen at the first layer of the hierarchical system. If samples in the
target customer are filtered by decision tree as score 1, they are final result due to the
maximum accuracy. Otherwise, the rest samples should be filtered on the next layer.
59
At the second layer, we find the maximum accuracy only from score 2, 3, 4 though neural
networks and decision tree. The result is shown in Table 7.2.
From the Table 7.2, we have seen that the accuracies by neural networks are all 100%, so
we filtered neural networks at the second layer to identify score 2.
Same as above, according to the Table 7.3, we use neural networks at the third layer to
identify the customers with score of 3 and 4. The classification accuracy at the third layer is
shown in Table 7.3.
According to the described above, the classifier in each layer could be decided and shown
in Figure 7.2.
7.5 System performance and discussion
7.5.1 Ability for classification
First, we choose every customer in 2001 financial year as the target customer and decide its
new credit score by applying the proposed system. These new credit scores provided by the
Table 7.2 The classification accuracy at the second layer. Layers Credit scores NN DT
L=2
2 100.0% 100.0%
3 100.0% 50.0%
4 100.0% 95.0%
Table 7.3 The classification accuracy at the third layer. Layers Credit scores NN DT
L=3 3 100.0% 0.0%
4 100.0% 100.0%
Input DT Score 1
NN: Neural Networks
DT: Decision Tree
NN
NN
Figure7.2 Classifier in each layer of adaptive and hierarchical system.
Score 2
Score 4
Score 3
60
system are compared with that given by the financial managers of the company. The
comparison results are shown in Table 7.4.
Table 7.4 Classification Results by the system.
Number of customers Credit score provided by system Hit
rate 1 2 3 4
Credit score given by
the financial managers
1 473 0 0 0 99.8%
2 0 2 0 0 100%
3 0 0 2 0 100%
4 0 0 0 20 100%
From the Table 7.4, the credit scores for insolvent customers (score=2, 3, 4) provide by the
system are 100% in agreement with judgment of the financial managers of the company.
According to the result, we have found that our system has very high ability for classifying
the insolvent customers.
7.5.2 Ability for predication
As target customers, the features data of 493 customers in 2002 financial year and 467
customers in 2003 financial year was collected. For every customer of 2002 and 2003
financial years, a new credit score is predicted by the system based on the cases of 2001
financial year. Furthermore, these prediction results are also compared with the credit scores
given by the financial managers of the company and the hit rates of prediction are
summarized in table 7.5 and table 7.6.
Table 7.5 Prediction Results for 2002’s Customers.
Number of customers Credit score provided by system Hit
rate 1 2 3 4
Credit score given by
the financial managers
1 469 0 0 24 100%
2 0 2 0 0 100%
3 0 0 2 0 100%
4 1 0 0 19 95%
From Table 7.5, the total hit rates are more than 95% in agreement with the judgments of
the financial managers of the company. According to the result, we have known that the
system which we proposed not only has the ability to predict the insolvent ones but also to the
healthy ones.
61
Table 7.6 Prediction Results for 2003’s Customers.
Number of customers Credit score provided by system Hit
rate 1 2 3 4
Credit score given by
the financial managers
1 426 0 0 24 94.6%
2 0 1 0 0 100%
3 2 0 1 0 33%
4 0 0 0 13 100%
From Table 7.6, the credit scores of healthy customers (score=1) provided by the system
more than 94% in agreement with judgments of the financial managers of the company.
Although, the hit rate of the system for healthy customers is about 5.4% lower than 2002’s
and the hit rate of customer with score of 3 are only 33%. However, the total of the prediction
performance is not bad.
7.6 Comparison with other methods
7.6.1 Comparison with neural network and decision tree
In order to test the performance of our system, we compared it to single classifiers based
only on neural network or decision tree. As same as above, we firstly chose every customer in
2001 financial year as training data set to construct the single classifier and then every
customer of 2002 and 2003 financial years was used as the target customer and new credit
scores were decided by the single classifiers and the proposed system respectively. The hit
rates are shown in Figure 7.3 and Figure 7.4.
99%
0% 0%
65%
100%
0% 0%
50%
100% 100% 100% 95%
score=1 score=2 score=3 score=4
NN DT proposed system
Figure 7.3 The hit rates for 2002’s customers.
62
From Figure 7.3 and Figure 7.4, it is obvious that:
(1) For the customers of score=1, both the single classifiers based only on neural network
or decision tree and the proposed system have very high hit rate, and therefore even the
single classifier can perform well if the number of customers is large.
(2) Because the number of customers of score=2 and score=3 is very small, the single
classifiers based only on neural network or decision tree could not identify these
customers. However, the proposed showed the hit rates of 100% and 33% and
outperforms the single classifiers using neural networks or decision tree.
7.6.2 Comparison with parallel ensemble system
To compare the proposed system with ensemble systems of multiple classifiers, here a
parallel ensemble system is constructed, as shown in Figure 7.5.
95%
0% 0%
46%
93%
0% 0%
24%
95% 100%
33%
100%
score=1 score=2 score=3 score=4
NN DT proposed system
Figure 7.4 The hit rates for 2003’s customers.
Training
data
D
Dataset D1
···
Neural Network
Decision Tree
Combined
classifier
Neural Network
Decision Tree
Neural Network
Decision Tree
···
The final
result
Majority voting
Figure 7.5 The parallel ensemble system with voting.
Dataset D
2
Dataset D
n
63
Firstly, we use the bagging technique to make several data sets and in each dataset two
kinds of single classifier based on neural networks and decision tree are located in parallel. As
majority vote is the most commonly used methods for combining different classifiers, we also
combine the classification results of every single classifier by majority voting. In our study,
the number of single classifiers ranges from 5 to 15 and the best results will be selected to
compare with the proposed system.
After every single classifier was trained by using the customers’ data in 2001 financial year,
the new credit scores of customers in 2002 and 2003 financial year were predicted by the
parallel ensemble system of Figure 7.5 and then the hit rates compared with the proposed
system. The comparison result is shown in Table 7.7.
From Table 7.7, it is obvious that:
(1) When the number of single classifiers increase, the parallel ensemble system can also
give the same high accuracy as the proposed system to identifying the insolvent
customers of score 2 and score 3. Meantime, the proposed system outperformed the
parallel ensemble system for identifying the insolvent customers of score 4.
(2) Although the proposed system has only three classifiers, the parallel ensemble system
consists of 5 to 15 single classifiers and it is necessary to decide the optimal number of
the single classifiers. From this viewpoint, the proposed system is very simple and has
an advantage over the parallel ensemble system.
7.6.3 Type I and Type II errors
According to the previous study, most of them only examine the average prediction
performance of their models. However, from Table 7.5 and Table 7.6, we have found that
usually, there are two types of errors in the prediction results.
Table 7.7 The comparison with parallel ensemble system. The hit rates for 2002’s customers
Method Hit rate
score=1 score=2 score=3 score=4
Parallel ensemble system 96.1% 100% 100% 90%
Proposed system 100% 100% 100% 95%
The hit rates for 2003’s customers
Method Hit rate
score=1 score=2 score=3 score=4
Parallel ensemble system 94.2% 100% 33% 62%
Proposed system 94.6% 100% 33% 100%
64
Type I error: it represents an actual bankrupt firm classified as non-bankrupt.
Type II error: it represents an actual non-bankrupt firm classified as bankrupt.
Because type I errors represent real losses, we should improve our models or methods to
insolvent customers more accurately. Here we compare the result which was provided by our
system with single classifiers and multiple classifiers which were combined with neural
networks and decision tree by voting examine their Type I and Type II errors and show them
in Table 7.8.
Table 7.8 showed that the proposed system has lower errors than other methods. On the
other hand, the system we proposed has higher performance and effectiveness to control type
I and type II errors.
7.6.4 Comparison with other method
We further compared the hit rates for classifying the 2002 customers with the following
three approaches :
・ CBR system: a case-based reasoning system developed by Dong [86].
・ CBR+Bagging: a credit assessment system using hybrid method of bagging and case
case-based reasoning [90].
・ Two-stage data resampling method (TDR) which are proposed in chapter 6 of this
thesis [91].
and show the comparison result in Table 7.9.
Table 7.8 Type I and Type II errors.
Method 2002’s customers 2003’s cutomers
Type I Type II Type I Type II
Neural Network 30.0% 1.0% 46.0% 5.3%
Decision Tree 50.0% 0.0% 58.0% 6.8%
Voting 8.0% 3.8% 41.0% 5.8%
Proposed System 4.2% 0.0% 11.7% 5.3%
Table 7.9 Comparing hit rates for 2002 customers.
Customers’ credit
score
Hit rate
CBR CBR+
Bagging TDR
Proposed
approach
1 98.9% 94.4% 99.6% 100.0%
2 50.0% 100.0% 100.0% 100.0%
3 100.0% 100.0% 100.0% 100.0%
4 95.0% 90.0% 90.0% 95.0%
65
From Table 7.9, it is obvious that:
(1) Compared with the CBR system developed by Dong [85], the proposed approach
showed higher accuracy for classifying customers with scores of one or two. Although
the accuracy of identifying customers with a score of four is 95.0%, which is the same
to CBR system, the proposed approach can classify the customers without using their
former credit scores. That is, the proposed approach uses less characteristic data than
the CBR system.
(2) Compared to the hybrid method of bagging and CBR (CBR+Bagging) [90], the
proposed approach showed higher accuracy for classifying customers with a score of
one and four. The same accuracy for classifying customers with a score of two and
three. However, when using the CBR+Bagging approach, we have to build 10
bootstrapped replicas and decide an appropriate sampling method. The proposed
approach has a very simple structure.
(3) The proposed approach showed the same accuracy as two-stage data resampling
method system (TDR) [91] for identifying customers with scores of two or three, the
accuracies of the proposed approach for identifying customers with scores of one and
four are higher than TDR. As the higher accuracies of the depend mainly on the
adoption of two neural networks.
7.7 Concluding Remarks
In this chapter, we still dealt with customers’ credit scoring problems in a small company
and intended to assess the customer’s credit based only on daily transaction data such as sales,
payments by customers, amount of overdue payment, etc. The emphasis has been put on how
to improve the accuracy for identifying insolvent customers. An adaptive and hierarchical
system was proposed, where the best method can be chosen adaptively from neural networks
and decision tree in each level. It is similar to ensemble learning systems in that it uses
multiple classifiers. But it differs from ensemble learning systems mainly in that:
(1) The number of classifiers is decided by the number of customer groups and it is usually
less than ensemble learning systems.
(2) While typical bagging or stacking ensemble systems are constructed through a parallel
scheme, the proposed system has a hierarchical form where m-1 classifiers are
arranged hierarchically.
66
(3) In the proposed system, one classifier is used to identify only one group of customers
and it is not necessary to introduce any weighting algorithm or meta-level classifier to
combine the predictions from an ensemble of diverse classifiers.
The performance and effectiveness has been confirmed by applying it to the real problems
of the company. The experiment results showed that the system can identify insolvent
customers more effectively than single classifiers based only on neural networks or decision
tree. The system has also higher ability to identify insolvent customers than the parallel
ensemble system based on neural networks and decision tree.
67
CHAPTER 8
AN INVESTIGATION INTO THE RELATIONSHIP BETWEEN
CLASSIFICATION PERFORMANCE AND DEGREE OF
IMBALANCE
68
An investigation into the relationship between
classification performance and degree of imbalance
8.1 Background and aims of this study
During recent years, the class imbalance problem has received a high attention. A number
of solutions to the class imbalance problem were previously proposed both at the data and
algorithmic levels. As summarized by López et al. [92], these solutions can be categorized
into three major groups:
(1) Data sampling: at the data level, these solutions include many different forms of re-
sampling such as random over-sampling with replacement, random under-sampling,
directed over-sampling (in which no new instances are created, but the choice of
samples to replace is informed rather than random), directed under-sampling (where,
again, the choice of instances to eliminate is informed), over-sampling with informed
generation of new samples, and combinations of the above techniques [93].
(2) Algorithmic modification: at the algorithmic level, this procedure is oriented towards
the adaptation of base learning methods, including both standard learning algorithms
and ensemble techniques, to be more attuned to class imbalance issues.
(3) Cost-sensitive learning: this type of solutions incorporate approaches at the data level,
at the algorithmic level, or at both levels combined, considering higher costs for the
misclassification of instances of the positive class with respect to the negative class,
and therefore, trying to minimize higher cost errors.
Some experimental studies have been carried out to compare the effectiveness of the
methods previously proposed to deal with the class imbalance problem [46][48]. Meanwhile,
the nature (concept complexity, size of the training set and class imbalance level, etc.) of the
class imbalance problem has been investigated by several researchers. Japkowicz and Stephen
[46] have argued that the class imbalance problem is a relative problem that depends on 1) the
Chapterr 8
69
degree of class imbalance; 2) the complexity of the concept represented by the data; 3) the
overall size of the training set; and 4) the classifier involved. They also found that the higher
the degree of class imbalance the higher the complexity of the concept and the smaller the
overall size of the training set, the greater the effect of class imbalances in classifiers sensitive
to the problem. Furthermore, several researchers have considered how class distribution
affects classifier performance. López et al. [92] have made a detailed analysis and study of the
data intrinsic characteristics and given a brief description on how they affect the performance
of the classification algorithms. Although some researchers have argued that the level of class
imbalance have effect on classifiers’ performance [93-96], López et al. [92] pointed out that
the imbalanced ratio by itself does not have the most significant effect on the classifiers’
performance, but that there are other issues that must be taken into account.
In this study, we carry out an experimental study to investigate the effect of different levels
of imbalanced class distribution on the performance of three techniques often used for solving
the class imbalance problem: Synthetic Minority Over-sampling TEchnique (SMOTE), cost-
sensitive learning and ensemble learning. The purpose is to clarify the most effective
technique with different degrees of class imbalance. Our research differs from the others and
makes the following contributions:
(1) Although some researchers have studied the effects of different levels of imbalanced
class distribution on classifiers’ performance, their emphasizes have been put on how
to determine the best class distribution or decide the most appropriate sampling
algorithm for selecting training instances for a particular learning method such as
bagging, cost-sensitive learning, fuzzy classifier and decision-tree [93-95]. Here, we
aim at investigating the behavior and performance of the three techniques often used
for solving the class imbalance problem with varying levels of class imbalance, and the
emphasis is put on deciding the most effective technique.
(2) Almost all of the previous researches used several different kinds of benchmark data
sets to investigate the relationship between the level of class imbalance and classifiers’
performance. However, if data sets are changed, the degree of class imbalance as well
as other nature changes, and therefore it is difficult to distinguish the effect of the
degree of class imbalance from that caused by other natures. Here, we design an
experiment that generates the training data sets with different levels of imbalanced
class distribution from one original data set.
70
8.2 Experimental Design
8.2.1 Training data set generating
This study aims at investigating how the degree of class imbalance affects the performance
of various classification techniques, while eliminating the influence of other factors as much
as possible. We selected the German credit data as the original training dataset, which is a
widely used academic data sets and can be obtained from UCI Machine Learning Repository.
This data set has two classes, “good” and “bad” (credits), 7 numerical attributes and 13
categorical attributes. 700 instances belong to the “good” class and 300 instances belong to
the “bad”class.
At first, we divide the German credit data set into two parts: the training data and the test
data; the training data consists of instances of two-thirds and the test set consists of instances
of one-thirds. The training data and the test data are randomly resampled from the original
German credit data set while the ratio of bad/good instances is fixed at 3/7. Furthermore, the
test data will remain unchanged throughout the experiment.
Then, as shown in Figure 8.1, we fixed the imbalance ratio of bad/good instances IR=30/70,
20/80, 15/85, 12/88, 10/90, 3/97, 2/98 and 1/99 respectively and made resampling from the
original training data set. As the result, eight groups of training data G30/70, G20/80, G15/85,
G12/88, G10/90, G3/97, G2/98 and G1/99 were generated; each group of them contains
thirteen training data sets with the same IR.
71
8.2.2 Selecting of classification techniques
Current approaches to deal with the problem of imbalanced datasets fall into two major
categories: data sampling and algorithmic modification. Nevertheless, there is not a full
exhaustive comparison between those models. In order to analyze the data sampling
methodologies against cost-sensitive learning approach, we will use the “Synthetic Minority
Over-sampling Technique” (SMOTE) compare with the cost-sensitive C4.5 decision tree. In
addition, we also present in the comparison a hybrid procedure that combines with boosting.
Hence, SMOTE techniques, cost-sensitive learning techniques and ensemble learning
techniques have been selected.
As shown in Table 8.1, we will use the C4.5 decision tree as the base classifier [63]. This is
due to the reason that firstly the C4.5 has been widely used to deal with imbalanced data sets
[78][98], and secondly it has been included as one of the top-ten data mining algorithms [99].
Combined with the base classifier C4.5, three classification methods: C4.5+SMOTE,
C4.5+Boost and C4.5+CS will be compared in our experimental study.
The original
training
dataset
Figure 8.1 Setting up training data sets.
Training data G12/88
(12%bad, 88% good)
×13set
Resampling
with IR=12/88
Training data G10/90
(10%bad, 90% good)
×13set
Resampling
with IR=10/90
Training data G3/97
(3%bad, 97% good)
×13set
Resampling
with IR=3/97
Resampling
with IR=2/98
Training data G1/99
(1%bad, 99% good)
×13set
Resampling
with IR=1/99
Training data G30/70
(30%bad, 70% good)
×13set
Training data G20/80
(20%bad, 80% good)
×13set
Training data G15/85
(15%bad, 85% good)
×13set
Resampling
with IR=30/70
Resampling
with IR=20/80
Resampling
with IR=15/85
Training data G2/98
(2%bad, 88% good)
×13set
72
Table 8.1 The classification techniques used in the experimental study.
Acronyms Base
classifier Combined algorithms Algorithm description
C4.5+SMOTE
C4.5
SMOTE The base classifier of C4.5 applied to a
dataset preprocessed with the SMOTE algorithm.
C4.5+ENN AdaBoost The base classifier of C4.5 combined
with booting algorithm.
C4.5+CS Cost-sensitive
C4.5 decision tree
Invest the cost-sensitive learning algorithms into the base classifier of
C4.5.
8.2.3 Parameter tuning
The confidence level for the pruning strategy of C4.5 was 0.25. The tree was built using the
Weka [88] package.
The 5-nearest neighbors scheme was applied to generate synthetic training data in the
SMOTE. The over-sampling rate was set to 50%, 100%, 200%, 300%, 400%, 500%. And the
most appropriate value was selected for each dataset based on validation set performance. As
with the C4.5 algorithm, SMOTE was also trained in Weka.
For the boosting classifier, the number of iterations was varied in the range
[10,50,100,250,500,1000], and we also select the appropriate values to compare with other
methods.
Furthermore, we have to identify the misclassification costs associated with the positive
and negative class for the cost-sensitive learning versions. If we misclassify a positive
samples as a negative one, the associated misclassification cost is the IR of the data sets
(C(1,0)=IR) which is defined as:
IR=class bad of percentage The
class good of percentage The (8.1)
and the value of IR in each data set is presented in Table 8.2, where we denote the number of
data set, class distribution and IR.
If we misclassify a negative sample as a positive one the associated cost is 1(C(0,1)=1).
The cost of classifying correctly is 0(C(1,1)=C(0,0)=0) because guessing the correct class
should not penalize the built model.
73
Table 8.2 The value of IR in each data set.
Data No. Class distribution (bad/good) IR 1 30/70 2.33 2 20/80 4 3 15/85 5.7 4 12/88 7.3 5 10/90 9 6 3/97 32.5 7 2/98 49 8 1/99 99
8.2.4 Statistical comparison of classifiers
Since this study is to compare three methods according to their performance for classifying
eight groups of training data sets, statistical analysis needs to be carried out in order to find
significant differences among the obtained results. Here, we use Friedman’s test to compare
the AUCs obtained by the three methods; it is a well-known non-parametric statistical tests
for multiple comparisons [100].
The Friedman test statistic is based on the average ranked (AR) performance of the
classification techniques on each data set. Let D be the number of data sets used in the study,
K be the total number of classifiers and j
ir be the rank of classifier j on data set i, then the
average rank of classifier j is calculated as follows:
ARj=
D
i
j
irD 1
1 (8.2)
The test statistic is given by
K
jjF
KKAR
KK
D
1
2
22
4
)1(
)1(
12 (8.3)
2
F is distributed according to the Chi-square distribution with K-1 degrees of freedom. If
the value of 2
F is large enough, then the null hypothesis that there is no difference between
the techniques can be rejected. The Friedman statistic is well suited for this type of data
analysis as it is less susceptible to outliers [100].
Furthermore, we consider the average ranking of the classification methods in order to
show how good a method is with respect to classification of imbalanced data sets. This
ranking is obtained by assigning a position to each method depending on its performance for
each data set. The method which achieves the best performance (AUC value) in a specific
data set will have the first rank (value 1); then, the method with the second best performance
74
is assigned rank 2, and so forth.
8.3 Experimental results and discussion
The Table 8.3 shows the AUC values that were obtained by applying the four methods:
C4.5, C4.5+SMOTE, C4.5+Boost and C4.5+CS to train the classifiers based on the data set
from the eight groups of training data, and then evaluate the AUC of these classifiers for
classifying the test data [101]. For each level of imbalance, the Friedman test statistic and
corresponding p-values are calculated and also shown in Table 8.3. The average rank (AR) of
the four methods on each group of training data is shown in the column on the extreme right
of Table 8.3 and their comparison is given in Figure 8.2. The maximum AUC on each data set
as well as the highest average ranked value among the four methods is underlined.
From Table 8.3 and Figure 8.2, it is clear that:
Among the eight groups of training data sets, the minimum of Friedman test statistic is
19.36 and the corresponding p is 0.023%. Because of this, it is clear that there are
significant differences with respect to the classification performance of the four
methods.
As all of the Friedman test statistic corresponds to very low p-values (p<0.001), the
classification performance of all the four methods: C4.5, C4.5+SMOTE, C4.5+Boost
and C4.5+CS varies significantly with the degrees of class imbalance and there is no
any method that is always effective at varying degrees of class imbalance.
The base classifier C4.5 gave the lowest AUCs at all degrees of class imbalance. In
other words, in order to solve the class imbalance problem, it is a very effective
approach to combine a base classifier with SMOTE, cost-sensitive learning and
ensemble learning methods.
While training classifiers based on the data set from group G30/70 (bad/good=30/70)
and G20/80 (bad/good=20/80), C4.5+Boost was the best performing classification
method with the AR values of 1.23 and 1.54. The next well performed classifier was
the C4.5+SMOTE.
75
Table 8.3 The AUCs of four classifiers on eight groups of training data.
Training data G30/70 (bad/good=30/70) Friedman test statistic= 26.82 (p<0.001)
Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
0.60
0.67
0.73
0.67
0.61
0.69
0.68
0.67
0.68
0.71
0.69
0.68
0.62
0.67
0.70
0.66
0.62
0.68
0.70
0.67
0.67
0.66
0.67
0.65
0.58
0.66
0.71
0.62
0.63
0.67 0.69
0.61
0.61
0.68
0.69
0.67
0.60
0.67
0.73
0.67
0.63
0.68
0.72
0.68
0.63
0.65 0.67
0.61
0.72
0.69
0.70
0.68
3.46
2.00
1.23
3.31
Training data G20/80 (bad/good=20/80) Friedman test statistic= 27.83 (p<0.001)
Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
0.64
0.61
0.67
0.64
0.61
0.68
0.65
0.59
0.61
0.65
0.64
0.62
0.59
0.62
0.66
0.61
0.61
0.66
0.65
0.62
0.58
0.62
0.65
0.61
0.56
0.66
0.64
0.60
0.58
0.67 0.66
0.60
0.59
0.66
0.63
0.66
0.57
0.63
0.64
0.59
0.54
0.62
0.65
0.58
0.62
0.66 0.67
0.62
0.56
0.65
0.66
0.60
3.85
1.70
1.54
2.92
Training data G15/85 (bad/good=15/85) Friedman test statistic=19.98 (p<0.001)
Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
0.56
0.62
0.64
0.63
0.57
0.61
0.58
0.63
0.56
0.62
0.62
0.61
0.58
0.67
0.61
0.64
0.57
0.61
0.60
0.60
0.56
0.61
0.59
0.62
0.57
0.60
0.61
0.66
0.60
0.60 0.61
0.60
0.53
0.60
0.63
0.65
0.60
0.66
0.62
0.63
0.57
0.65
0.62
0.63
0.59
0.67 0.61
0.61
0.56
0.65
0.61
0.59
3.85
1.77
2.31
2.08
Training data G12/88 (bad/good=12/88) Friedman test statistic= 21.76 (p<0.001)
Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
0.57
0.58
0.58
0.57
0.56
0.61
0.57
0.64
0.53
0.61
0.61
0.59
0.58
0.66
0.60
0.62
0.52
0.59
0.56
0.59
0.60
0.61
0.58
0.58
0.52
0.63
0.61
0.58
0.53
0.61 0.60
0.64
0.57
0.64
0.61
0.56
0.60
0.67
0.62
0.60
0.55
0.60
0.58
0.59
0.57
0.65 0.63
0.61
0.58
0.62
0.58
0.61
3.54
1.23
2.62
3.31
Training data G10/90 (bad/good=10/90) Friedman test statistic= 26.82 (p<0.001)
Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
0.50
0.57
0.57
0.58
0.51
0.60
0.56
0.63
0.57
0.61
0.58
0.64
0.52
0.59
0.57
0.56
0.50
0.61
0.59
0.60
0.51
0.57
0.58
0.61
0.50
0.58
0.56
0.62
0.56
0.60 0.59
0.59
0.51
0.61
0.55
0.58
0.50
0.59
0.57
0.58
0.50
0.62
0.57
0.59
0.54
0.57 0.59
0.61
0.53
0.59
0.59
0.57
4
1.69
2.54
1.77
Training data G3/97 (bad/good=3/97) Friedman test statistic= 34.57 (p<0.001)
Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
0.50
0.52
0.55
0.55
0.50
0.52
0.55
0.57
0.50
0.51
0.51
0.54
0.50
0.52
0.54
0.56
0.50
0.51
0.55
0.57
0.50
0.52
0.53
0.53
0.50
0.51
0.54
0.52
0.50
0.53 0.58
0.53
0.52
0.52
0.54
0.55
0.52
0.52
0.54
0.55
0.50
0.53
0.55
0.57
0.50
0.52 0.56
0.53
0.50
0.56
0.58
0.58
3.92
1.69
3.08
1.31
Training data G2/98 (bad/good=2/98) Friedman test statistic= 24.72 (p<0.001)
Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
0.50
0.54
0.52
0.56
0.50
0.50
0.51
0.52
0.50
0.53
0.55
0.55
0.50
0.51
0.50
0.51
0.50
0.52
0.53
0.55
0.50
0.52
0.51
0.52
0.50
0.54
0.52
0.56
0.50
0.52 0.52
0.54
0.50
0.51
0.51
0.50
0.50
0.54
0.51
0.48
0.50
0.55
0.52
0.53
0.50
0.58 0.54
0.60
0.50
0.54
0.52
0.57
3.92
1.92
2.46
1.54
Training data G1/99 (bad/good=1/99) Friedman test statistic= 19.36 (p<0.001)
Data set No #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 AR
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
0.50
0.54
0.51
0.52
0.50
0.53
0.51
0.54
0.50
0.53
0.51
0.53
0.50
0.51
0.51
0.53
0.50
0.50
0.51
0.49
0.50
0.52
0.51
0.53
0.50
0.52
0.51
0.50
0.50
0.54 0.54
0.55
0.50
0.52
0.50
0.50
0.50
0.53
0.52
0.55
0.50
0.52
0.53
0.52
0.50
0.50 0.50
0.52
0.50
0.51
0.51
0.55
3.62
1.77
2.54
1.69
76
When the training data set came from group G15/85 (bad/good=15/85), G12/88
(bad/good=12/88) and G10/90 (bad/good=10/90), the highest average rank (AR) was
given by the C4.5+SMOTE classifier.
When the imbalance degree increased and the training data group became G3/97
(bad/good=3/97), G2/98 (bad/good=2/98) and G1/99 (bad/good=1/99), the effect of
cost-sensitive learning becomes remarkable gradually and as the result, C4.5+CS
provided the best average rank across the three groups of data sets.
In addition to the comparison of the average rank described above, a comparison can also
be made from the view point of the AUC values. Figure 8.3 shows the average of AUCs on
each group of training data.
From Figure 8.3, it can be observed that:
As the degree of class imbalance increases, the average of AUCs is reduced almost
monotonically for all the four methods. Compared with the AUC on the training data
group G30/70, the average of AUCs on the training data group G1/99 is reduced by
20%-26%. It is clear that the degree of imbalance gives a strong effect to the
performance of classification methods.
Comparing the AUCs obtained by C4.5+SMOTE, C4.5+Boost and C4.5+CS
respectively, the range (=maximum – minimum) of the average AUC on the same
3.46
3.85 3.85
3.54
4 3.92 3.92
3.62
2
1.7 1.77
1.23
1.69 1.69
1.92 1.77
1.23
1.54
2.31
2.62 2.54
3.08
2.46 2.54
3.31
2.94
2.08
3.31
1.77
1.31
1.54 1.69
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
G30/70 G20/80 G15/85 G12/88 G10/90 G3/97 G2/98 G1/99
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
Figure 8.2 AR comparison on eight groups of training data.
77
group of training data is 4%-8% of the maximum. Therefore, the difference in the
average of AUCs obtained by the three methods is very small.
According to these observations, it has been clarified that:
(1) The degree of class imbalance affects significantly the performance of the four
classification methods : C4.5, C4.5+SMOTE, C4.5+Boost and C4.5+CS. When the
degree of class imbalance increases or the imbalance ratio decreases, the performance
of the classification methods deteriorates noteworthily. Meanwhile, there is no any
method that is always effective at varying degrees of class imbalance.
(2) Although the difference in the average of AUCs is very small, the most effective
classification method changes significantly along with change of the degree of class
imbalance. When the degree of class imbalance is low (imbalance ratio is 30/70 or
20/80), C4.5+Boost is the best performing classification technique. At the middle level
of class imbalance where the imbalance ratio is reduced to 15/85, 12/88 and 10/90, the
C4.5+SMOTE classifier outperformed other methods. Moreover, when the imbalance
degree is high and the imbalance ratio is reduced to 3/97, 2/98 and 1/99, the cost-
sensitive learning based classifier C4.5+CS can provide the best performance.
8.4 Concluding remarks
In our study, we carried out an experiment study to investigate the effect of the degree of
class imbalance on the performance of three representative techniques often used for solving
the class imbalance problem: SMOTE, cost-sensitive learning and ensemble learning
(boosting). In order to specify the effect of the degree of class imbalance and exclude the
0.45
0.50
0.55
0.60
0.65
0.70
0.75
G30/70 G20/80 G15/85 G12/88 G10/90 G3/97 G2/98 G1/89
Figure 8.3 Average of AUCs on eight groups of training data.
C4.5
C4.5+SMOTE
C4.5+Boost
C4.5+CS
G1/99
78
influence of other natures, we designed an experiment that generate the training data sets with
different levels of imbalanced class distribution from one original data set. Through our
experiment, it is clear that the degree of imbalance gives a strong effect to the performance of
classification methods, and the performance of the classification methods deteriorates
noteworthily if the degree of class imbalance increases.
It is also clarified that there is no any method that is always effective at varying degrees of
class imbalance. The cost-sensitive learning based classifiers can perform well at the very
high degree of class imbalance (imbalance ratio is 3/97, 2/98 and 1/99), and the SMOTE
classifiers can work well when the class imbalance is at a middle level (imbalance ratio is
15/85, 12/88 and 10/90). Furthermore, the boosting methods outperform others at the low
degree of class imbalance (imbalance ratio is 30/70 or 20/80). According to these results, we
can conclude that selecting the appropriate classification techniques is very important to deal
with the class imbalance problems.
79
CHAPTER 9
CONCLUSIONS
80
Conclusions
In this PhD thesis, we mainly addressed three issues relating to the credit scoring. The
issues raised in this thesis included that of building credit method and system for imbalanced
credit scoring data sets. Especially for the extremely imbalanced data sets; The contributors to
the literature on small and medium enterprise credit scoring; and an investigation into the
relationship between classification performance and degree of imbalance.
In this chapter, we display the conclusions that can be drawn from the research undertaken
in this thesis.
The main contributions of the thesis are summarized below:
(1) In this thesis, some effective approaches have been proposed to assess credit for small
and medium enterprise.
The two systems are suitable to be applied to many of organizations where the
customers do not disclose their financial data and they are also very easy to be
incorporated into existing information systems. Furthermore, the proposed systems are
more valuable and can be applied more widely than other existing credit assessment
systems.
(2) For the class imbalance problem, the emphasis has been put on how to improve the
ability for identifying the minority class.
The two-stage resampling method can used k-means algorithms to perform under-
sampling on the majority class of customers. In order to avoid losing information,
we introduced a pre-classification to pick up customers of the majority class whose
information could not be reflected in the previous under-sampling result.
An adaptive and hierarchical system can choose the best method adaptively from
neural networks and decision tree based on the accuracy for identifying customers
of every credit score.
Chapterr 9
81
The experiment results showed that the performance of classifying the minority class
was improved significantly by the two proposed approaches.
(3) We carry on an investigation into the relationship between classification performance
and degree of imbalance.
Latest techniques include single C4.5, hybrid techniques of boosting, Synthetic
Minority Over-sampling Technique and cost-sensitive learning algorithms have been
used in the analysis of different degree of class imbalance. The results help to select the
appropriate methods for problems with different degrees of imbalance. We also find
that when faced with a large class imbalance the cost-sensitive learning performs very
well.
82
References
[1] Van Gestel, T. and B. Baesens, Credit Risk Management: Oxford University Press.
[2] Edelman, D.B. and J.N. Crook, Credit Scoring and its Applications. Society for Industrial
Mathematics: Philadelphia, 2002.
[3] Tsai, C.F. and M.-L. Chen, Credit Rating by Hybrid Machine Learning Techniques.
Applied Soft Computing, 10 (2), 374-380, 2010.
[4] Leung, K, Cheong, F, Cheong, C, O’ Farrell, S and Tissington, R, A Comparison of
Variable Selection Techniques for Credit Scoring. In Proceedings of the 7th
International
Conference on Computational Intelligence in Economics and Finance, Taiwan,
December 5-7, 2008.
[5] Cios, K.J., et al., Data mining methods for knowledge discovery. Kluwer Academic
Publishers, 1998.
[6] Tan, P.N., Steinbach, M. and Kumar, V., Introduction to data mining. Pearson Addison
Wesley Boston, 2006.
[7] Altman E., Financial Ratios, Discriminant Analysis and Prediction of Corporate
Bankruptcy. Journal of Finance, 23 (4), 589-609, 1968.
[8] Baesens B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J. and Vanthienen, J.,
Benchmarking State of Art Classification Algorithms for Credit Scoring. Journal of the
Operational Research Society, 54 (6), 627-635, 2003.
[9] Desai, V.S., Crook, J.N., Overstreet Jr G.A., A Comparison of Neural Networks and
Linear Scoring Models in the Credit Union Environment. European Journal of Operation
Research Society, 95 (1), 24-37, 1996.
[10] Karels, G.V., and Prakash, A.J., Multivariate Normality and Forecasting of Business
Bankruptcy. Journal of Business Finance and Accounting, 14 (4), 573-593,1987.
[11] Reichert, A.K., Cho, C.-C and Wagner, G.M., An Examination of the Conceptual Issues
Involved in Developing Credit Scoring models. Journal of Business and Economic
Statistics, 1 (2), 101-114, 1983.
[12] West, D., Neural Network Credit Scoring models. Computers & Operations Research,
27(11-12), 1131-1152, 2000.
[13] Yobas, M.B., Crook, J.N. and Ross, P., Credit Scoring using Neural and Evolutionary
Techniques. IMA Journal of Management Mathematics, 11 (2), 111-125, 2000.
83
[14] Arminger,G., Enache, D and Bonne, T., Analyzing Credit Risk Data: A Comparison of
Logistic Discrimination, Classification Tree Analysis, and Feedforward Networks.
Computational Statistics, 12 (2), 293-310, 1997.
[15] Steenackers, A. and Goovaerts M.J., A Credit Scoring Model for Personal Loans.
Inurances: Mathematics and Economics, 8 (1), 31-34, 1989.
[16] Wiginton, J.C., A Note on the Comparison of Logit and Discriminant Models of
Consumer Credit Behavior. Journal of Financial and Quantitative Analysis, 15, 757-770,
1980.
[17] Friedman, J.H., Multivariate Adaptive Regression Splines. The Annals of Statistics, 19 (1),
1-67,1991.
[18] Altman, E., Corporate Distress Diagnosis: Comparisons Using Linear Discriminant
Analysis and Neural Networks. Journal of Banking & Finance, 18 (3), 505-529, 1994.
[19] Hung, C. and Chen, J.H., A Selective Ensemble Based on Expected Probabilities for
Bankruptcy Prediction. Expert System with Applications, 36(3), 5297-5303, 2009.
[20] Schebesch, K.B., and Stecking, R., Support Vector Machines for Classifying and
Describing Credit Applicants: Detecting Typical and Critical Regions. The Journal of the
Operational Research Society, 56 (9), 1082-1088, 2005.
[21] Buta, Mining for financial knowledge with CBR. AI Expert, 9(2), 34-41,1994.
[22] Shin, K.S., and Han, I., A Case-Based Approach Using Inductive Indexing for Corpoate
Bond Rating. Decision Support System, 32(1), 41-52, 2001.
[23] Yanwen, D., Development of a Customer Credit Evaluation System via Case-based
Reasoning Approach. Asia-Pacific Journal of Industrial Management, 1, 1-7, 2008.
[24] Piramuthu, S., Financial Credit Risk Evaluation with Neural and Neurofuzzy Systems.
European Journal of Operational Research, 112, 310-321, 1999.
[25] Tseng, F.M., Lin, L., A Quadratic Interval Logit Model for Forecasting Bankruptcy.
Omega, 33 (1), 85-91, 2005.
[26] Rafiei, F.M., Manzari, S.M., Bostanian, S., Financial Health Prediction Models Using
Artificial Neural Networks, Genetic Algorithms and Multivariate Discriminant Analysis:
Iranian Evidence. Expert Systems with Applications, 38(8), 10210-10217, 2011.
[27] Tsai, C.F., Feature selection in bankruptcy prediction. Knowledge-Based System, 22 (2),
120-127, 2009.
[28] Danenas, P., Garsva, G., Guda, S., Credit risk evaluation model development using
support vector based classifiers. Procedia Computer Science, 4, 1699-1707, 2011.
84
[29] Kim, M.J., Kang, D.K., Ensemble with neural networks for bankruptcy prediction. Expert
System with Applications, 37 (4), 3373-3379, 2010.
[30] Tsai, Ch.F., Wu, J.W., Using neural networks ensembles for bankruptcy prediction and
credit scoring. Expert Systems with Applications, 34 (4), 2639-2649,2008.
[31] Karels, G., Prakash, A., Multivariate Normality and Forecasting of Business Bankruptcy.
Journal of Business Finance Accounting, 14(4), 573-593, 1987.
[32] Reichert, A.K., Cho, C.C., Wagner, G.M., An Examination of the Conceptual Issues
Involved in Developing Credit-scoring Models. Journal of Business and Economic
Statistics, 101-114, 1983.
[33] Thomas, L.C., A Survey of Credit and Behavioral Scoring: Forecasting Financial Risks
of Lending to Customers. International Journal of Forecasting, 16, 147-172, 2000.
[34] Shin, K.S., Lee, T.S., and Kim, H., An Application of Support Vector Machines in
Bankruptcy Prediction Model. Expert Systems with Applications, 28, 127-135, 2005.
[35] Gestel, T.V., Baesens, B., Suykens, J.A., Van den Poel, D., Baestaens, D.E. and
Willekens, B., Bayesian Kernel based Classification for Financial Distress Detection.
European Journal of Operational Research, 172, 979-1003, 2004.
[36] Min, S.H., Lee, J., and Han, I., Hybrid Genetic Algorithms and Support Vector machines
for Bankruptcy Prediction. Expert Systems with Applications, 31, 652-660, 2006.
[37] Abdou, H., Pointon, J. and Elmasry, A., Neural Nets Versus Conventional Techniques in
Credit Scoring in Egyptian Banking. Expert Systems with Application, 35 (3), 1275-
1292,2008.
[38] Pang, S. and Gong, J., Classification Algorithms and Application on Individual Credit
Evaluation of Banks. Systems Engineering Theory & Practice, 29 (12), 94-104, 2009.
[39] Polikar, R., Ensemble Based Systems in Decision Making. IEEE Circuits and Systems
Magazine, 6 (3), 21-45, 2006.
[40] Dasarathy, B.V., Sheela, B.V., Composite Classifier System Design: Concepts and
Methodology. Proceedings of the IEEE, 67 (5), 708-713, 1979.
[41] Hansen, L.K., Salamon, P., Neural Networks Ensemble. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 12 (10), 993-1001, 1990.
[42] Schapire, R.E., The Strength of Weak Learnability. Machine Learning, 5 (2), 197-227,
1990.
[43] Dietterich, T.G., Machine Learning Research: Four Current Directions. AI Magazine, 18
(4), 97-136, 1997.
85
[44] Windeatt, T., Ardeshir, G., Decision Tree Simplification for Classifier Ensembles.
International Journal of Pattern Recognition, 18 (5), 749-776, 2004.
[45] Basel Committee on Banking Supervision, Basel Committee Newsletter No.6: Validation
of low-default portfolios in the Basel II Framework. Technical Report, Bank for
International Settlements.
[46] Japkowicz, N., Stephen, S., The Class Imbalance Problem: A Systematic Study. Journal
Intelligent Data Analysis, 6(5), 429-449, 2002.
[47] Weiss, G., Mining With Rarity: A Unifying Framework. SIGKDD Explorations Special
Issue on Learning from Imbalanced Datasets, 6 (1), 7-19, 2004.
[48] Weiss G., Provost F.J., Learning When Training Data are Costly: The Effect of Class
Distribution on Tree Induction. Journal of Artificial Intelligence Research, 19, 315-354,
2003.
[49] Chawla, N.V., Japkowicz, N., Kotcz, A., Editorial: Special Issue on Learning from
Imbalanced Datasets. SIGKDD Explorations, 6, 1-6, 2004.
[50] Joshi, M.V., Learning Classifier Models for Predicting Rare Phenomena, Ph.D.,
University of Minnesota, Twin Cites, MN, USA, 2002.
[51] Crone, S.F., Finlay, S., Instance Sampling in Credit Scoring: An empirical study of
sample size and balancing. International Journal of Forecasting, 28, 224-238, 2012.
[52] Yao, P., Hybrid Classifier Using Neighborhood Rough Set and SVM for Credit Scoring.
International Conference on Business Intelligence and Financial Engineering, 138-142,
2009.
[53] Kennedy, K., et al., Using semi-supervised Classifiers for Credit Scoring. The Journal of
the Operational Research Society, 64, 513-529, 2013.
[54] Provost, F., Jensen, D. and Oates, T., Efficient Progressive Sampling. In Proceedings of
the Fifth International Conference on Knowledge Discovery and Data Mining, 23-32,
1999.
[55] Chawla, N.W., et al., SMOTE: Synthetic Minority Over-sampling Techniques. Journal of
Artificial Intelligence Research, 16, 321-357, 2002.
[56] Japkowicz, N., Learning from Imbalanced Data sets: A Comparison of Various Strategies.
AAAI Workshop on Learning from Imbalanced Data Sets, 6, 10-15, 2000.
[57] Batista, G., A Study of the Behavior of Several Methods for Balancing Machine Learning
Training Data. ACM SIGKDD Explorations Newsletter, 6 (1), 20-29, 2004.
86
[58] Sadatrasoul, S.M., et al., Credit Scoring in Banks and Financial Institutions via Data
Mining Techniques: A Literature Review. Journal of Artificial Intelligence and Data
Mining, 1 (2), 119-129, 2013.
[59] Hung, C., Chen, J.-H., Wermter, S., Hybrid Probability-Based Ensemble for Bankruptcy
Prediction. In Proceedings of International Conference on Business and Information,
July 11-13, Tokyo, Japan.
[60] Witten, I.H. and Frank, E., Data Mining. Morgan Kaufmann Publishers: Elsevier, 2005.
[61] Benjamin, N. et al., Low Default Portfolios: A Proposal for Conservative Estimation of
Default Probabilities. Discussion Paper, Financial Services Authority, 2006.
[62] Osuna, R.G., Lecture Notes CS 790: Introduction to Pattern Recognition. Wright State
University, Dayton, Ohio,USA, 2002.
[63] Quinlan, J.R., C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo,
California, 1993.
[64] Haykin, S., Neural Networks: A Comprehensive Foundation, New Jersey: Prentice Hall,
1999.
[65] West, D., Dellana, S. and Qian, J., Neural network ensemble strategies for financial
decision applications. Computers & Operations Research, 32 (10), 2543-2559, 2005.
[66] Skurichina, M., Duin, R.P.W., Bagging, Boosting and The Random Subspace Method for
Linear Classifiers. Pattern Analysis and Applications, 5(2), 121-135, 2002.
[67] Breiman, L., Bagging Predictors. Machine learning, 24 (2), 123-140, 1996.
[68] Freund, Y., and Schapire, R.E., A Decision-theoretic generalization of On-line Learning
and an Application to Boosting. Journal of Computer and System Science, 55 (1), 119-
139, 1997.
[69] Wolpert, D.H., Stacked Generalization. Neural Networks, 5 (2), 241-259, 1992.
[70] Breiman, L., Random Forest. Machine Learning, 45 (1), 5-32, 2001.
[71] Chawla, N.V., Japkowicz, N. and Kolcz, A., Editorial: Special Issue On Learning From
Imbalanced Data Sets. ACM SIGKDD Explorations Newsletter, 6 (1), 1-6, 2004.
[72] Hulse, J., Khoshgoftaar, T., Napolitano, A., Experimental perspectives on Learning from
imbalanced data, In Proceedings of the 24th
International Conference on Machine
Learning, 935-942, 2007.
[73] López, V., Fernández, A., Moreno-Torres, J.G., Herrera, F., Analysis of preprocessing vs.
cost-sensitive learning for imbalanced classification. Open problems on intrinsic data
characteristics. Expert Systems with Applications, 39 (7), 6585-6608, 2012.
87
[74] Haibo, He, Edwardo, A., Garcia, Learning from Imbalanced Data, IEEE Transactions on
Knowledge and Data Engineering, 21 (9), 1263-1284, 2009.
[75] Ting, K.M., An instance-weighting method to induce cost-sensitive trees. IEEE
Transaction on Knowledge and Data Engineering, 14 (3), 659-665, 2002.
[76] Kubat, M and Matwin, S., Addressing the curse of imbalanced training sets: one-sided
selection. Proceedings of the Fourteenth International Conference on Machine Learning,
179-186, 1997.
[77] Ertekin, S., Huang, J., Bottou, L., and Giles, C.L., Learning on the Border: Active
Learning in Imbalanced Data Classification. In Proceedings of the Sixteenth ACM
Conference on Information and Knowledge Management, 127-136, Portugal, 2007.
[78] Su, C.T. and Hsiao, Y.H., An evaluation of the robustness of MTS for imbalanced data.
IEEE Transactions on Knowledge and Data Engineering, 19(10), 1321-1332,2007.
[79] Weiss, G.M., Mining with Rarity: a Unifying Framework. SIGKDD Explorations and
Newsletters, 6,7-19, 2004.
[80] Lu, Y., Guo, H. and Feldkamp, L., Robust Neural Learning from Unbalanced Data
Samples. In IEEE International Joint Conference on Neural Networks, IEEE World
Congress on Computational Intelligence, 3, 1816-1821, 1998.
[81] Yoon, K. and Kwek, S., A Data Reduction Approach for Resolving the Imbalanced Data
Issue in Functional Genomics. Neural Computing and Applications, 16 (3), 295-206,
2007.
[82] Zadrozny, B. and Elkan, C., Learning and Making Decisions When Costs and
Probabilities are Both Unknown. In Proceeding of the 7th International Conference on
Knowledge Discovery and Data Mining, 204-213, 2001.
[83] Domingos, P., Metacost: A General Method for Making Classifiers Cost-Sensitive, In
Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, 155-164, 1999.
[84] Pazzani, M., Murphy, P., Ali, K., Hune, T. and Brunk, C., Reducing Misclassification
Costs. In Proceedings of the 11th International Conference on Machine Learning, 217-
225, 1994.
[85] Turney, P., Types of Cost in Inductive Concept Learning. In Proceedings of the
Workshop on Cost-Sensitive Learning at the Seventeenth International Conference on
Machine Learning, 15-21, 2000.
88
[86] Dong, Y., Development of a Customer Credit Evaluation System via Case-based
Reasoning Approach, ASIA-PACIFIC Journal of Industrial Management, 1 (1), 2008.
[87] Dong, Y., Hao, X. and Yu, C., Comparison of Statistical and Artificial Intelligence
Methodologies in Small-Businesses’ Credit Assessment Based On Daily Transaction
Data, ICIC Express Letters, An International Journal of Research and Surveys, 5 (5),
1725-1730, 2011.
[88] Witten, I. H. and Frank, E., Data Mining : Practical Machine Learning Tools and
Techniques. Morgan Kaufmann, San Francisco, 2005.
[89] Dong, Y., Application of Bagging for Solving Small-Businesses’ Credit Assessment
Problems Based on Daily Transaction Data. China Management Information, 115-119,
2009.
[90] Dong, Y., Application of Hybrid Method of Bagging and Case-Based Reasoning to Solve
Small-Businesses’ Credit Assessment Problems. Information, 14 (2), 399-409, 2011.
[91] Xiying, H., Dong, Y., A New Approach Based on Class Imbalance Learning for Small-
business Credit Assessment. Journal of Japan Industrial Management Association,
64(2E), 325-335, 2013.
[92] López, V., Fernández, A., García, S., Palade, V. and Herrera, F., An insight into
classification with imbalanced data: Empirical results and current trends on using data
intrinsic characteristics. Information Sciences, 250, 113–141, 2013.
[93] Visa, S. and Ralescu, A., Issues in mining imbalanced data sets-a review paper. In:
Proceedings of the sixteen Midwest artificial intelligence and cognitive science
conference, 67-73, 2005.
[94] Visa, S. and Ralescu, A., The effect of imbalanced data class distribution on fuzzy
classifiers—experimental study. IEEE International Conference on Fuzzy Systems,
FUZZ-IEEE 2005, 749–754, 2005.
[95] Chen, N., Chen, A. and Ribeiro, B., Influence of class distribution on cost-sensitive
learning: A case study of bankruptcy analysis. Journal Intelligent Data Analysis,17(3),
423-437, 2013.
[96] Liang, G., Zhu, X., and Zhang, C., The effect of varying levels of class distribution on
bagging for different algorithms: An empirical study. International Journal of Machine
Learning and Cybernetics, 5(1), 63-71, 2014.
[97] Kotsiantis, S., Kanellopoulos, D. and Pintelas, P., Handling Imbalanced Datasets: A
Review. GESTS International Transactions on Computer Science and Engineering, 30(1),
89
25-36, 2006.
[98] Garc´ıa, S., Fern´andez, A., Luengo, J., and Herrera, F., A study of statistical techniques
and performance measures for genetics-based machine learning: accuracy and
interpretability. Soft Computing, 13 (10), 959-977, 2009.
[99] Wu, X. and Kumar, V., The top ten algorithms in data mining. Data mining and
Knowledge Discovery Series. Chapman and Hall/CRC press, 2009.
[100] Friedman, M., A comparison of alternative tests of significance for the problem of m
rankings. Annals of Mathematical Statistics, 11(1), 86-92, 1940.
[101] Hao, X., Dong, Y. and Wu, S., Dealing with severely imbalanced credit scoring
dataset. Proceedings of 2012 Asian Conference of Management Science & Applications
(ACMSA2012), 69-74, 2012.