bura.brunel.ac.ukbura.brunel.ac.uk/bitstream/2438/12932/1/Fulltext.docx · Web viewThe utilisation of the different techniques in building credit-scoring models have varied over time,

A new hybrid ensemble credit scoring model based on classifiers consensus system approachMaher Ala’raj*, Maysam F. Abbod

Department of Electronic and Computer Engineering, College of Engineering, Design and Physical Sciences, Brunel University London, Kingston Lane, Uxbridge UB8 3PH, UK

ABSTRACT

During the last few years there has been marked attention towards hybrid and ensemble systems development, having proved their ability to be more accurate than single classifier models. However, among the hybrid and ensemble models developed in the literature there has been little consideration given to: 1) combining data filtering and feature selection methods 2) combining classifiers of different algorithms; and 3) exploring different classifier output combination techniques other than the traditional ones found in the literature. In this paper, the aim is to improve predictive performance by presenting a new hybrid ensemble credit scoring model through the combination of two data pre-processing methods based on Gabriel Neighbourhood Graph editing (GNG) and Multivariate Adaptive Regression Splines (MARS) in the hybrid modelling phase. In addition, a new classifier combination rule based on the consensus approach (ConsA) of different classification algorithms during the ensemble modelling phase is proposed. Several comparisons will be carried out in this paper, as follows: 1) Comparison of individual base classifiers with the GNG and MARS methods applied separately and combined in order to choose the best results for the ensemble modelling phase; 2) Comparison of the proposed method with all the base classifiers and ensemble classifiers with the traditional combination methods; and 3) Comparison of the proposed approach with recent related studies in the literature. Five of the well-known base classifiers are used, namely, neural networks (NN), support vector machines (SVM), random forests (RF), decision trees (DT), and naïve Bayes (NB). The experimental results, analysis and statistical tests prove the ability of the proposed method to improve prediction performance against all the base classifiers, hybrid and the traditional combination methods in terms of average accuracy, the area under the curve (AUC) H-measure and the Brier Score. The model was validated over seven real world credit datasets.

Keywords—credit scoring; consensus approach; classifier ensembles; hybrid models; data filtering; feature selection.

1. Introduction

1.1. Background

Managing customer credit is an important issue for commercial banks and hence, they take great care when dealing with customer loans to avoid any improper decisions that can lead to loss of opportunity or financial losses. The manual estimation of customer creditworthiness has become both time- and resource-consuming. Moreover, a manual approach is subjective (dependable on the bank employee who gives this estimation), which is why devising and implementing programming models that provide loan estimations is being engaged in, as it can eradicate the ‘human factor’ in this process. Such a model should be able to provide recommendations to the bank in terms of whether or not a loan should be given and/or give a probability in relation to whether the loan will be returned. Nowadays, a number of models have been designed, but there is no ideal classifier amongst them as each gives some percentage of incorrect outputs, which is a critical consideration when each percentage point of an incorrect answer can mean millions of dollars of losses for large banks. The area of credit-scoring has become a extensively researched topic by scholars and the financial industry (Kumar & Ravi, 2007; Lin et al., 2012), with many models having been proposed and developed using statistical approaches, such as logistic regression (LR) and linear discriminate analysis (LDA) (Desai et al., 1996; Baesens et al., 2003). As a result of the financial crises, the Basel Committee on Banking Supervision requested all banks to apply rigorous credit evaluation models in their systems when granting a loan to an individual client or a company. Accordingly, research studies have demonstrated that artificial intelligence (AI) techniques, (e.g. neural networks, support vector machines and random forests) can be a good replacement for statistical approaches in building credit scoring models (Atiya, 2000; Bellotti and Crook, 2009).

The utilisation of the different techniques in building credit-scoring models have varied over time, with researchers initially using each technique individually, and then later, in order to overcome shortcomings of applying them in this way, they started to customise the design of credit-scoring models. That is, they began to introduce complexity into their designs, with new approaches, such as hybrid and ensemble modelling that provided better performance than the use of individual techniques. Hybrid and ensemble approaches can be utilised independently or in combination. The basic idea behind the former is to conduct a pre-processing step for the data that are fed to the classifiers. Hybrid modelling can take many forms, including: 1) cascading different approaches (Lee et al., 2002; Garcia et al., 2012); 2) combining clustering and classification methods (Tsai and Chen, 2010); and using synergetic ways in combining different methods into single approaches such as fuzzy-based rules (Gorzałczany and Rudzinski, 2016); the focus of this paper is with regards to the first type of technique. Whilst ensemble modelling focuses on gathering the insight of a group of classifiers trained on the same problem and using their opinions to reach an effective classification decision. In spite of complex modelling being associated with significant financial and computational cost, we believe that complexity leads to better and universal classification models for credit-scoring and hence, the creation of a complex model for this purpose is the main aim of this paper.

1.2. Research motivations

Our The research motivation in this paper is based on three factorsolds: 1) data filtering, 2) feature selection and 3) ensemble combination. The general process in credit scoreing modelling is to use the credit history of former clients to compute and predict new applicants risk of default (Tsai and Wu 2008; Thomas 2009). The collected historical loans will beare used to build a credit scoring model that maps the attributes or features of each loan applicant to measure their probability of default. The number of the available features makes up the feature space and where the high dimensionality in this feature space has advantages but also some serious deficiencies (Yu and Liu, 2003). In practice, real historical datasets are used in order to develop credit-scoring models; these datasets might differ in size, nature, and the information or characteristics it holds, which may cause difficulties in classifiers

* Corresponding author. Tel.: +447466925096.E-mail addresses: maher.ala’[email protected] (M. Ala’raj), [email protected] (M.F. Abbod).

* Corresponding author. Tel.: +447466925096.E-mail addresses: maher.ala’[email protected] (M. Ala’raj), [email protected] (M.F. Abbod).

mailto:[email protected]




training and hence will not to be able to capture different relationships of these datasets characteristics. Such datasets may include noisy data, missing values, redundant or irrelevant features and complexity of distributions (Piramuthu 2006). Practically, the more features in a dataset is, the more computation time is required and this could lead to low model accuracy and inefficient scoring interpretation results (Liu and Schumann 2005). One solution is to perform a feature selection on the original data. Also however, another problem could rise with the original dataset is that that there may not be any data points located at the exact points that wcould make for the most accurate and concise concept description (Wilson and Martinez, 2000). Another solution is to reduce the original data by filtering or removing it.

The majority of credit scoring studies have used feature selection step as a pre-processing step to clean their data from any noise that can disturb the training process (Liu and Schumann, 2005; Huang et al., 2007; Bellotti and Crook, 2009; Tsai, 2009; Chen and Ma, 2009; Chen and Li, 2010l; Tsai & Chen, 2010; Akkoc, 2012; Tsai, 2014; Harris, 2014). On the other hand, up best to our knowledge; there are just 3 studies which have considered data filtering in their approaches (Tsai and Chou, 2011; Tsai and Cheng, 2012; Garcia et al., 2012). Very few studies considered multi stages modelling in their approach (e.g. Tsai and Chou, 2011) and in order to fill this gap, we have considered a multi-stage model is considered based on data filtering and feature selection combined in addition to our the proposed approach which is based on classifiers cooperation.

The purpose of data-filtering is to reduce the size of the original dataset and to produce a representative training dataset, whilst keeping its integrity (Wilson & Martinez, 2000). Data that are noisy or contain outliers can have strong effect on model performance as can redundant and irrelevant features. According to Tsai & Chou (2011), in some cases, removing outliers can increase the classifiers’ performance and accuracy by smoothing the decision boundaries between data points or feature space. In general, outliers in a dataset mean that a sample of the dataset appears to be inconsistent within other samples in the same dataset; these data can be atypical data, data without prior class or data that are mislabelled. If all this appears in a dataset, then these outliers must be eliminated by filtering those samples that hold such characteristics that could hinder the training process, since their occurrence can lead to inefficient data training by classifiers.

Another important step in data pre-processing is feature selection, which refers to choosing the most relevant and appropriate features and accordingly removing the unneeded ones. In other words, it pertains to the process of selection of a subset of representative features that can improve model performance. The first motivation for this paper is to develop hybrid classifiers using data filtering and feature selection approaches. After training these, the obtained results from the classifiers are compared comprehensively in terms of using feature selection and data-filtering separately as well as by combining them together. To the best of our knowledge, combining a data-filtering technique with a feature-selection technique has not been considered before in the area of credit-scoring. Nowadays, the research trend has been actively moving towards using single AI techniques in building ensemble models (Wang et al., 2011). According to Tsai (2014), the idea of ensemble classifiers is based on the combination of a pool of diversified classifiers, which leads to better performance as each complements the other classifier’s errors.

In the literature on credit-scoring, most of the classifier combination techniques adopt the form of homogenous and heterogeneous classifier ensembles, where the former combines the classifiers of the same algorithm, whilst the latter combines those of different algorithms (Partalas et al., 2010; Lessmann et al., 2015; Tsai, 2014). As Nanni & Lumini (2009) point out, an ensemble classifier is a group of classifiers, where the decisions of each are combined using the same approach. In the light of the above discussion, a recent paper was presented by the authors (Ala’raj and Abbod, 2016), where they developed a credit scoring model based on a combination of heterogeneous ensemble of classifiers and combined their rankings using a new combination rule called the consensus approach through which the classifiers work as a team to reach an agreement on all the data points’ final outputs. They carried out a comprehensive analysis, comparing their approach with several other classifiers and theirs was demonstrated having significantly superior predictive performance across various measures and datasets.

The second and main motivation of this paper, is to enhance the results of this approach by applying the following: Proposing a new way of evaluating the consensus between classifiers; Investigating the use of data pre-processing, including data filtering and feature selection, to see to what extent the

consensus approach results can be improved upon, as Ala’raj and Abbod (2016) just used the raw data for the training. Based on the above motivations, a new hybrid ensemble model is proposed that involves combining hybrid and ensemble models

together with data filtering with feature selection. For the next stage, the outcomes of the hybrid classifier are combined using the enhanced consensus approach, with the aim being to increase accuracy and have better predictive performance. Experimentally, the consensus approach comprises combining the decisions of the classifier ensembles after hybridising them. The generalisation ability of the proposed model is evaluated against four widely approved performance indicator measures across several real world financial credit datasets. Moreover, the new proposed model is compared to the developed models in this paper in addition to some literature results that are considered as being benchmarks.

The structure of the paper as follows: Section 2 provides an introduction to data filtering, feature selection and ensemble models, with comparison and analysis of the related literature in terms of the datasets, base models, hybrid and ensemble models, combination methods and evaluation performance measures used. Section 3 explains the data filtering method, feature selection method and the classifier consensus approach that is adopted in this paper. Section 4 describes the experimental setup of the paper, whilst section 5 presents the experimental results and analysis. Finally, in section 6 conclusions are drawn and future work possibilities are discussed.

2. Literature review

2.1. Feature selection

Datasets, in general, contains different attributes or features that make them up, and they vary from one to another. However, they can include irrelevant or redundant features that make it difficult to train models, thus leading to low performance and accuracy. As a result, analysing features in terms of their importance has become an essential task for data pre-processing in data-mining, in general and credit-scoring, in particular, in an effort to enhance the chosen model’s prediction performance (Tsai, 2009; Yao, 2009). In the literature, many studies have involved conducting data pre-processing to hybridise their models. Lee & Chen (2005) built a two stage hybrid model based on MARS to choose the most significant variables, which provided the best performance when compared to other techniques. Whilst Chen & Ma (2009) proposed a hybrid SVM model based on three strategies namely, CART, MARS and grid search, with the best results being delivered when using MARS with this model. Chen & Li (2010) proposed four methods to select the best feature subset for building a hybrid SVM model. The results showed that such a model is very robust and effective in finding the best feature subsets for credit scoring models.

The main reason for using feature selection is to choose an optimal subset of features for improving prediction accuracy or decreasing the size of the data structure without significantly decreasing the prediction accuracy of the classifier. This type of data

pre-processing is important for many classifiers as it increases learning speed and accuracy regarding the testing set . To fulfil this purpose, the MARS technique can be used to determine the most valuable variables or features in the input data. This is a commonly applied classification technique, nowadays, widely accepted by researchers and practitioners in the field of credit scoring for the following reasons:

It is capable of modelling complex non-linear relationships among variables without strong model assumptions; It can evaluate the relative importance of independent variables to the dependent variable when many potential

independent variables are considered; It does not require a long training process and hence, can save lots of model building time, especially when working with

a large dataset; It gives models that can be easily interpreted (Lee et al. 2002; Lee and Chen, 2005).

Moreover, according to Sholom and Indurhnya (1998), the MARS method is suitable for feature selection when the number of variables is not very large. Hence, selection of MARS for feature selection task is justified for the current enquiry.

2.2. Data filtering

Data filtering is used to improve the results of machine learning classifiers, which obviously should be trained on some training data before applying to the testing set. It improves the training set of data by removing the inaccurate samples from the set, which are those samples which stand out from the whole picture. For example, a loan in the data sample data that is labelled as bad amongst many good loans, but with similar characteristics, would need to be removed from the training set. Tsai and Cheng (2012) carried out an investigation into the removal of different percentages of outliers in the data using simple distance-based clustering and examined the performance of the prediction models that were developed using four different classification models. The results demonstrated different performance abilities due to the structure of the datasets used. Garcia et al. (2012) conducted a study using a wide range of filtering algorithms and applied these to a credit-scoring assessment problem. Specifically, they used a total of 20 filtering algorithms, all of which showed superiority over the original training set and of those, they reported that the RNG1 filtering algorithm, which is based on proximity graphs, was the most statistically significant when compared with the others. Consequently, the idea of proximity graphs is adopted in this paper as the filtering algorithm used to pre-process the training data for the collected datasets.

The motivation behind applying data filtering algorithm in this paper lies in the belief that training a classifier with the filtered dataset can have several benefits (Garcia et al., 2012), such as:

The decision boundaries are smooth and clear; It is easier for the classifiers to discriminate between the classes; Decreasing the size of the training; leaving in it the really important data; Improving the accuracy performance of the model; Computational costs can be reduced.

Regarding the particular algorithm employed for fulfilling the task of data filtering, this is the Gabriel Neighbourhood Graph editing (GNG), which is based on proximity graphs and it was chosen for the following reasons:

Proximity graphs are used to avoid incoherent data (when some places are full of points and some places have only a few);

Proximity graphs find neighbours in each direction, so if some point has two neighbours, one and another just behind, the second will not be counted. This feature is not available in k-NN filtering, where the directions of neighbours' do not count.

Proximity graphs describe the structure of data very well in that each point the algorithm finds the closest matches to it.

2.3. Ensemble models

Alongside the hybrid techniques, another method used by researchers is the ensemble learning method or multiple classifier system, which is the most recent to be introduced to credit-scoring evaluation (Lin & Zhong, 2012). It involves applying multi-classifiers rather than single ones in order to achieve higher accuracy in the results. The difference between the ensemble and hybrid methods is that for the former, the output of the multiple classifiers is pooled to give a decision, whilst for the latter, only one classifier gives the final output and the other classifiers’ results are processed as an input to this final classifier (Verikas et al., 2010). The ensemble method for building credit-scoring models is valued due to its ability to outperform the best single classifier’s performance (Kittler et al., 1998). A central key issue in building an ensemble classifier is to make each single classifier as different from the other classifiers as possible, in other words, the aim is to be as diverse as possible (Nanni & Lumini, 2009). Most of the works on ensemble studies in the domain of credit-scoring have focused on homogenous ensemble classifiers via simple combination rules and basic fusion methods, such as majority voting (MajVot), weighted average (WAVG), weighted voting (WVOT), reliability-based methods (i.e. MIN, MAX, PROD), stacking and fuzzy rules (Wang et al., 2012; Tsai, 2014; Yu et al., 2009; Tsai & Wu, 2008; West et al., 2005; Yu et al., 2008). A few researchers have employed heterogeneous ensemble classifiers in their studies, but still with the aforementioned combination rules (Ala’raj and Abbod, 2016; Lessmann et al., 2015; Wang et al., 2012; Hsieh & Hung, 2010; Tsai, 2014). In ensemble learning, all the classifiers are trained independently to produce their decisions, which are subsequently combined via a heuristic algorithm to produce one final decision (Zhang et al., 2014; Rokach, 2010). Table 1 summarises the extant ensemble studies in relation to whether they were homogenous or heterogeneous and as can be seen, most involved adopting a homogenous classifier ensemble with the traditional combination rules. Two studies solely applied heterogeneous ensembles, while three employed both homogenous and heterogeneous ones.

Table 1Ensemble models studies

Year Study Classifier ensembles Combination ruleHomogenous Heterogeneous

1 Relative Neighbourhood Graph editing

2005 West et al. x Majority vote, Weighted average 2006 Lai et al. x Majority vote, Reliability-based2008 Tsai and Wu x Majority vote

Yu et al. x Majority vote, Reliability-based2009 Nanni and Lumini x Sum rule

Yu et al. x Fuzzy GDM2

2010 Hseih and Hung x Confidence-weighted averageYu et al. x Majority vote, Weighted average, ALNN3

Zhang et al. x Majority vote

Zhou et al. xMajority vote, Reliability-based, Weights based on Tough

samplesPartalas et al. x Weighted voting

2011 Wang et al. x x Majority vote, Weighted average, stackingFinlay x Majority vote, Weighted average, mean

2012 Wang et al. Majority voteWang and Ma x Majority vote

Marques et al. (a) x Majority voteMarques et al. (b) x Majority vote

2014 Tsai x x Majority vote, Weighted voteAbellan and Manatas x Majority vote

2015 Lessmann et al. x x Majority vote, Weighted average, stacking2016 Zhou x Majority vote

Xiao et al. x Majority vote, Weighted voteAla’raj and Abbod x Consensus approach

3. New hybrid ensemble credit scoring model

3.1.Gabriel Neighbourhood Graph editing (GNG)

The idea behind data-filtering is the selection of the data outliers, data points with labels that are weakly associated with those of their neighbours, such that it is assumed that some mistake in data collection or representation was made and hence, this data point may contain an error. Hence, it is best not to include such data in the training process. The efficient way to reflect the structure of the data and the interconnection between training set entries is to represent the training data as a graphical structure. The simplest

way to this is to connect two data points when they are close enough, such that x i and x j are connected ifd (x i . x j )<ϵ , where ϵ

is chosen manually. Whilst this method is easy, in the case of non-uniform data distribution it is impossible to provide coherency of data representation: some graph areas are full of edges, whereas others have only a few. Moreover, it is not guaranteed that the obtained graph is connected. Consequently, it is better to use proximity graphs, which are built without using a fixed distanceϵ and instead, involves connecting two data points according to their neighbours’ location with respect to them. GNG (Garcia et al., 2012) is used as a special case of proximity graphs, which provides a list of neighbours for each point from the training set and is defined as follows:

(x i . x j )∈E⇔d2 ( xi . x j )≤d2 (x i . xk )+d2 ( x j . xk ) .∀ xk . k≠ i . j (1)

Figure 1 demonstrates the connection between 2 points in the GNG. In simple terms, the idea is to connect the points i and j , if

and only if, there is no point k inside the circle with segment [ i . j ] as a diameter, otherwise the two points will not be connected.

The Euclidian distance between two points x i=( xi1 . x i2. x i3 .…) and x j=(x j1 . x j2 . x j3 .…) is calculated as follows:

d ( xi . xj )=√ (x i1−x j1 )2+(x i 2−x j2 )2+…+(x i1−x j2 )2 (2)

Fig. 1. Illustration of GNG edge connection (Gabriel & Sokal, 1969)

2 GDM: group decision making

3 ALNN: adaptive linear neural network

Figure 2 illustrates the construction of a GNG on a 2-D training dataset, showing how points are connected and the process of filtering data points, which depends on meeting certain conditions set by the GNG algorithm.

Fig. 2. The construction of GNG for 2-D on training data (Gabriel & Sokal, 1969)

As can be seen from Figure 2, the GNG is constructed and training data points are connected, now for each sample x¿ the weighted

average for all neighbours’ labels to x¿ is evaluated. In this case, the weights chosen are proportional to the distance from eachx¿.

Also, for every data point two scalar values, T r 0and T r 1 are defined that can be interpreted as thresholds. Subsequently, two

conditions are checked:

If label of x¿ is equal to 0, and the weighted average of all the GNG is greater thanT r 0, then x¿ is removed or filtered from

the training set; If label of x¿ is equal to 1, and the weighted average of all GNG is less than T r 1, then x¿ is removed or filtered from the

training set; If neither condition is satisfied,then x¿ will remain in the training set.

In balanced datasets, where number of '0' labels is approximately equal to the number of ‘1’ labels, it is wise to use 0.5 for both the

T r 0 and T r 1 thresholds. However, in the case of imbalanced datasets, when number of bad loans is far less than number of good

ones, if the values of both thresholds are equal to 0.5 this leads to excessive filtration for entries with data labelled as ‘1’. For

instance, if the data set is imbalanced the goal is to tighten its conditions for good loans as they are the majority, so T r 0 is

decreased to remove more good loans (the advantage here is more than the disadvantage, so if a non-noisy loan is removed, its

guaranteed that the noisy ones are removed), and regarding the minority class, T r 1 is decreased to keep as many bad loans as

possible. However, the thresholds of the datasets when they are imbalanced are chosen based on the best accuracy of training set. Therefore, the proposed new enhancement on the GNG in this paper is using the following two modifications:

Using a weighted average for the GNG instead of a simple one, in order to account for far points less than close points for each

x¿, thereby more precisely finding outliers; check Using various thresholds so as to avoid excess filtering of bad loan entries.

For clearer understanding, the steps of the GNG filtering algorithm are summarised in the following pseudo-code:

1. Compute GNG for all entries of the training set:

For every pair x i x j of the training set:

- Check whether to connect them in the GNG using Equation (1).

End for

2. For each classifier optimal good loans and bad loans thresholds (T r 0, T r 1) are evaluated beforehand.

For each entry x i of training set:

- Compute the vector l which consists of actual labels of all x i Gabriel Graph neighbours;

- Compute the vector w which consists of distances from x¿ to its Gabriel Graph neighbours;

- Perform the subsequent operations: w=max (w )−w;

- Evaluatel¿=w ⋅ l, where ⟨⋅ ⟩a scalar product:

- If label of x¿ is equal to 0 and l¿ is greater than T r 0, then x¿ is removed from training set;

- If label of x¿ is equal to 1 and l¿ is less than T r 1, then x¿ is removed from training set.

End for

3. Perform training stage of selected classifier on reduced training set.

3.2. Multivariate Adaptive Regression Splines (MARS)

MARS is a non-parametric and non-linear regression technique developed by Friedman (1991), which models the complex relationships between the independent input variables and the dependent target variable. It is a member of the regression analysis methods family, and can be described as an extension of linear regression. The MARS model takes care of non-linearity, being constructed by assuming the result value on the unknown points and then converts the linear model to one that is non-linear. The conversion occurs by creating knots on t extremes of arguments using the hinge functions. The advantage of converting the model with these functions is that different combinations of them can form a complex model, which lies in the closest position possible to the real results. The hinge function looks like:

f (x)=C×max (0. x−c1) or,

f (x)=C×max (0. c1−x)

where, C and c1 are constants and c1 is called a knot. In the knot a hinge function changes its direction. In fact, the single hinge

function is a combination of two linear functions f (x)=0 and f (x)=±1(x−c1), where the second function constitutes

the first after x=c 1 point.After finishing the training stage, the obtained MARS is as a mathematical generalised linear model, which fits the data well.

However, this model can be used not only to classify entries, but also to analyse the input data and to find the most important features, which are highly correlated with the target labels. The aim of feature selection is to choose a subset of features for improving the prediction accuracy or decreasing the size of the structure without significantly decreasing the accuracy of the classifier built using only the selected features. ANOVA decomposition is used for the MARS mathematical model to determine the most valuable and important features of the input data. The main characteristic that distinguishes MARS from other classifiers is that its results are easily-interpreted and then ANOVA can be used with this model to make investigations into the input data structure, particularly features importance. ANOVA decomposition is the process of separation of the MARS model into groups of functions that depend on the different variables (features). Thus, when analysing this decomposition, one of the groups is removed to see how performance drops and the more it does so, the more important the feature is. The steps of the MARS feature selection process can be summarised in Pseudo-code as follows:

Suppose there are N iterations, for each of them the dataset is divided to training and testing parts.

For i from 1 to N do

Evaluate the MARS algorithm using the training set with the given parameters:

Maximum number of functions in the model allowed. The default value for this parameter is -1 in which case maxFuncs4 is

calculated automatically using the formula min(200.max (20.2d ))+1(Jekabsons, 2009), where d is the number of

input variables; Penalty value per knot. Larger values lead to fewer knots being placed (i.e. final model is simpler); Perform ANOVA decomposition on the obtained model using the aresanova function. Store the second column of this

table as (i) and thus, w k (i) denotes the importance of the k-th input feature during the i-th iteration.

End For

1. Assume N f is the total number of features of the data.

For s from 1 to N f do

w s=∑i=1

N

‍ws( i)

N (3)

End for

2. Return w=(w1 .w2 .…wN f)

3. Suppose that for each single classifier the optimal feature importance threshold t r imp was evaluated beforehand. Thus, for

training and testing only the features i are chosen, for which w i> t rimp.

3.3. The classifier consensus system approach

4 This is the maximum number of hinge functions allowed by the MARS model

The basic idea behind classifier combination decisions being taken together is that when making a decision, one should not rely only on a single classifier decision, but rather, classifiers need to participate jointly in the decision-making process by combining or fusing their individual opinions or decisions. Consequently, the core problem that needs to be addressed when combining different classifiers is resolving conflicts between them. In other words, the problem is how to combine the results of different classifiers to obtain a better result (Chitroub, 2010; Xu et al., 1992). In this section, a new combination method is introduced in the field of credit scoring based on classifier consensus, where those in the ensemble interact in a cooperative manner in order to reach an agreement on the final decision for each data sample. Tan (1993) stressed that classifiers working in collaboration can considerably outperform those working independently. The idea of the consensus approach is not new and it has been investigated in many studies in different fields, such as statistics, remote sensing, geography, classification, web information retrieval, multi-sensory data and the financial domain (Tan, 1993; DeGroot, 1975; Benediktsson and Swain, 1992; Shaban et al., 2002; Basir and Shen, 1993, Ala’raj and Abbod, 2016). Regarding which, we adopted the general guidelines of DeGroot. (1975) and Shaban et al. (2002), who proposed a framework that provides a comprehensive and practical set of procedures on the construction of the consensus theory, where the interactions between the classifiers are modelled when agreement between them is needed. It is believed that their guidelines can be useful in the domain of credit scoring and credit risk evaluation.

The goal of the consensus approach is to merge the rankings of the ensemble classifiers into one group ranking (answer of the group) and to do so, it should comprise the following main stages:

I. Calculating the rankings of all the ensemble classifiers and build a decision profile

Consider a group of N classifiers, denoted by the C1, C2,…,Cn. All the classifiers in the ensemble will be trained and tested on the same input data points. After training, each classifier will produce a ranking for the input data point and this ranking after applying a threshold will have a set of two possible answers, which is either good or bad loan. The set of the two possible answers can be

denoted byΓ=(γ1 . γ2), γ1=good loan , γ2=bad loan.For each classifier, consider a ranking functionRi, which

associates a non-negative number for every possible answer for Γ i . The result of the estimate function Ri is a value in the range of

[0, 1], which shows the desirability of the corresponding answer. Predictions of the classifiers can be found after finding Ri and

applying a threshold to it. Hence, the ranking of each classifier given an input data will be:

∑k=1

m

‍R i(γ k )=1∀ i∈ {1. .N } (4)

where, Ri is the ranking of the 2 classes for each classifier given an input data point.

Now, after calculating each classifier’s rankings, the decision profile can be represented in matrix form as follows:

DP=[R1 (e1 ) R1 (e2 ) R1 (e3 ) … R1 (en )R2 (e1 ) R2 (e2 ) R2 (e3 ) … R2 (en )R3 (e1 ) R3 (e2 ) R3 (e3 ) … R3 (en )R4 (e1 ) R4 (e2 ) R4 (e3 ) … R4 (en )R5 (e1 ) R5 (e2 ) R5 (e3 ) … R5 (en )

] (5)

where, n is the number of input data in the training/testing set, ei is the ith input data and Rj(ei); j ∈1..5 is the j-th classifier ranking for the i-th input data. So, to evaluate the uncertainty between the classifiers it is necessary to process n columns of matrix DP for

the testing set input by input. The main objective is to evaluate the common group ranking RG :Γ→[0,1] by aggregating the

expected rankings for all classifiers and hence, reach a consensus on the final ranking of each given input data point

II. Calculating of classifier uncertainty

After building the decision profile (DP) for the classifier rankings, the next stage is to find a function by which each classifier’s uncertainty about its own decision can be computed. The task here is to give more weight to those that are less uncertain about their decision, and vice versa. Moreover, the assigned weights should reflect the contrast in the classifiers' decisions. During this stage, classifier uncertainty can be divided into two types: local (self) and global (conditional). Self-uncertainty relates to the quality of the classifier's own decision, while conditional uncertainty refers to this quality after being exposed to the other classifiers’ decisions, which takes the form of a decision profile exchange between classifiers. At this stage a classifier will be able to review its uncertainty level and modify it in light of its decision as well as those of other classifiers. In other words, this shows how a classifier is able to

improve its decision when other classifiers’ decisions become available. Accordingly, Ri(γk ) is the i-th classifier’s ranking of

answerγk , and Ri(γk∨Γ j) is the i-th classifier’s ranking of answerγk , when it is exposed to the ranking of the jth classifier.

Consequently, the uncertainty matrix can be presented as follows:

U=[U11 U 12 U 13 … U 1 N

U21 U 22 U 23 … U 2 N

U31 U 32 U 33 … U 3 N

U41 U 42 U 43 … U 4 N

U51 U 52 U 53 … U 5 N

] (6)

Matrix U is evaluated using equations (7) and (8):

U ii=−∑k=1

M

‍R i(γ k ) log2 (Ri(γ k)) (7)

U ij=−∑k=1

M

‍Ri(γ k∨Γ j) log2 (Ri(γ k∨Γ j)) (8)

where, U ii is the self uncertainty and U ij is the conditional uncertainty of each classifier for each given input data point.

Now, knowing that equation (4) is fulfilled, equation (8) can be fulfilled as follows:

∑k=1

m

‍R i(γ k∨Γ j)=1∀ i∈{1. .N } (9)

In the case of two possible answers: "0"as good loans and "1" as bad loans, then, for simplicity, equations (4) and (9) can be converted into:

Ri (0 )+Ri (1 )=1.R i(0∨Γ j)+Ri(1∨Γ j)=1 (10)

where, Ri(1) is the i-th classifier ranking of answer "1" (bad loan) and Ri(0) is the i-th classifier ranking of answer "0" (good

loan). Denote Ri=R i(1) andRi(Γ j)=Ri(1∨Γ j), then Ri(0)=1−Ri and Ri(0∨Γ j)=1−Ri(Γ j) and, hence,

equations (7) and (8) can be converted into:

U ii=−Ri log2(R i)−(1−Ri) log2(1−Ri) (11)

U ij=−R i(Γ j) log2(Ri(Γ j))−(1−Ri(Γ j)) log2(1−R i(Γ j)) (12)

where, Uii; i ∈1…5 is the local uncertainty of the ith classifier, and Uij, I, j∈1…5; i ≠ j is the global uncertainty of the i-th classifier, when it knows the ranking of the jth classifier. It is important to explain that the reason why the uncertainties in equations (11) and (12)

are evaluated using a logarithm with base 2 (log2 ¿can be demonstrated by plotting equation (11), where Uii is a function of

parameter Ri:

Fig. 3. Uncertainty value U ii as a function of the parameter Ri

From the plot in Figure 3, it is clear that, if the value of the classifier’s ranking is close to the edges of the [0, 1] interval, uncertainty will be near zero (the classifier is certain about its decision). On the other hand, if the ranking is close to the 0.5, the classifiers uncertainty will be close to 1, which is the maximum value of uncertainty (the classifier is very uncertain about its decision). Looking

at U ii it is straight forward to calculate the self-uncertainty of each classifier, but when it comes to the conditional uncertainty, U ij ,there

is no information available about how to calculate the rankings of the classifiers after they are exposed to each other Ri(γk∨Γ j). In DeGroot (1974), Berger (1981) amd Shaban‍ et ‍al. (2002) convergence conditions for the optimal single decision of the group were investigated, but provided no information about how they calculated their conditional uncertainty. To evaluate the conditional rankings in U ij, an uncertainty approach is proposed here. In Ala’raj and Abbod (2016), they proposed a way for estimating the conditional rankings

Ri(γk∨Γ j) of the U ij, which was based on giving the rankings of the i th and jth classifiers a weight using the hyperbolic function tanh‍

and global accuracy. In this paper, another way is put forward for estimating the conditional rankings of U ij, which is based on: 1) Calculating how far the ith and jth classifiers’ rankings (distance) are from the threshold (0.5) and measuring how certain they are about their decisions;2) Calculating the local accuracy (the strength of a classifier of classifying loans around a given input test data) of both classifiers when computing their uncertainty (the more locally accurate classifier will have less uncertainty and vice versa).

The whole process of estimating Ri(Γ j) and U ij is described in Algorithm (1):

1) Calculated i=R i−0.5, d j=Ri−0.5

If d i>0 and d j>0 d¿=(d i+d j)×k1.

If d i<0 and d j<0 d¿=(d i+d j )×k2 . If d i and d j have opposite signs, then d¿=(d i+d j)×k3.

Ri(Γ j)=d¿+0.5.

The logic behind Algorithm (1) is to simulate the classifiers’ communication behaviour in order to generate conditional rankings so

that we can calculate U ij . In the first two conditions, k1 and k 2 should be greater than 0.5, as the effect of ith classifier certainty

is increasing due to the similar opinion of jth classifier. For example, if Ri=0.6, R j=0.7 and k=1, then

d¿=(0.2+0.1)×1=0.3, and the conditional ranking according to this is Ri(Γ j)=0.3+0.5=0.8. The logic behind it

is that if the two classifiers simultaneously consider a loan as ‘good’, after communicating their certainty in that decision will increase (thus, the ranking will decrease). On the other hand, if two experts simultaneously consider a loan as ‘bad’, after communicating their certainty, the decision will increase (thus, the ranking is increased).

Evaluate U ij according to equation (12)

After calculating all the values in the uncertainty matrix a symmetrical matrix will be produced, but because of the clear differences

in single classifier performance we do not want them to be symmetrical, which is why U ij is updated taking into consideration the

classifiers’ local accuracies, as in step (2).

2) UpdateU ij=k4 ⋅U ij /(L A i(q ,nn)−k5), where L A i(q ,nn) is the local accuracy of the ith classifier on input

data q , using the ith classifier answer error for exactly k-neighbour queries from the training set. In the current

implementation, the parameternn=4 . Parameters k1, k2, k3, k4 and k5 are chosen using gradient descent with the objective function being global accuracy on the training set. For each iteration these parameters are evaluated separately.

3)The reason for updating step (2) is to take into consideration the local accuracy of a classifier, for one with low local accuracy is

more uncertain about its decision (as local accuracy is in the denominator). Coefficient k5 is a normalising coefficient that picks

lower than the lowest local accuracy for the ith classifier and so the denominator stays positive. Local accuracy (LAi) is estimated by the accuracy of each classifier in the local region of the feature space surrounding an

unknown test point. Before applying the ensemble combiner, all single combiners are trained, with their predictions being evaluated on the training and testing sets. To combine the decisions of these classifiers the local accuracy is evaluated for each entry

x¿ from the testing set. In Xiao et al. (2012) it is proposed choosing a non-negative distance d as a local accuracy area and

evaluating the accuracy of all entries for the training set that are located at a distance from x¿ less than d . As an enhancement, rather than using a simple mean value, evaluating a weighted average for all points of the training set with weights that are inversely proportional to the distance from the training test entries to x¿ is proposed . For example, we have a training set with

x i entries with ranking values pi, and classifier's actual targets for training set ~pi , where i=1. ..N . Hence, local accuracy is

evaluated as: check

L(x¿)=∑i=1

N

‍wi∨p i−~pi∨,∑i=1

N

‍wi=1 (13)

III. Evaluating the weights of each classifier uncertainty

After having calculated the uncertainties of the classifiers and all values of uncertainties having been presented in the uncertainty matrix, at this stage the classifier can assign weights for itself and for other classifiers. The uncertainty weights are e valuated using the following equation, which can be presented in a matrix as with the uncertainty, which we call matrix W:

W ij=1

U ij2 ∑k∈ A

‍U ki−2 (14)

Equation (14) is the result of a set of minimisation problems (Shaban et al., 2002) (One problem for each i∈1.2.…N )

{T i=∑j=1

N

‍w ij2 ⋅U ji→min

∑j=1

N

‍w ij=1

, i∈1.2.…N (15)

These problems are stated in this form to ensure that each classifier will assign high weights to classifiers with low global uncertainties and low weights to those with high global uncertainties. These N problems are solved via the Lagrange method of undetermined coefficients, as illustrated in equation (14) and the detailed process of its derivation is described in (Shaban et al., 2002).

IV. Evaluating vector π (the weight of each classfier)

Vector π , which is the weight that is given for each classifier that reflects the confidence of each of its own decisions after being exposed to the results of other classifiers. It is evaluated as an approximate solution to the following equation:

(π ⋅W ¿π

∑i=1

N

‍ π=1 (16)

Weights of matrix W are assigned as the transition matrix of a Markovian chain with single classifiers, as stated in DeGroot

(1974). Then, the stationary distribution π of this chain can be evaluated using a system of equations. Sometimes there is no exact solution to this chain, because number of equations in it is one more than the number of variables. In this case, the Markovian chain does not converge to stationary distribution. Equation (16) can be converted to the form of:

~W ⋅ π=¿ (17)

where:

~W=¿ (18)

Matrix ~W is a rectangular N ×(N+1) matrix. The sum of the elements for each column of matrix¿ is equal to 0, so at least one

row of this matrix is redundant and can be removed. Therefore, if:

rank ¿ (19)

then, equation (17) has a single exact solution. To solve equation (16) using Matlab the least squares method could be used. It is also a new approach, when compared to those in articles by DeGroot (1974) and Berger (1981). Using the least squares method, it is not necessary to worry about vector π convergence, because the result of the approximate solution of equation (17) when equation

(19) is fulfilled, is the same as using DeGroot’s (1974) iterative method π i+1=π iW with normalisation at each step, until

‖π i+1−π i‖ becomes close to zero check In that scientific paper the final value of π is called the ‘equilibrium’, as this value does

not change again after reaching it. Generally, the equilibrium represents a balance between the single classifiers’ opinions, i.e. is a common denominator for all classifiers and they all agree with it.

V. Aggregating the consensus ranking of each ensemble classifier

When all the classifiers reach a consensus about their decisions and there is no room for decision updates, at this point, the aggregation of the consensus rankings is evaluated using the following equation:

RG(γ k)=∑i=1

N

‍Ri(γk )⋅ π i (20)

Vector π is considered as the weights importance of each single classifier and the sum of all the elements of it equals 1. So,

consensus rankings aggregation can be evaluated as a linear combination of single classifier rankings. The length of vector RG is

equal to the size of the set of the possible answers, and the sum of all the elements of RG is equal to 1. The final prediction of the

group, using ConsA, is the answer γ¿, for which RG(γ ¿) reaches the maximum value, which can be specified as:

γ¿=Arg maxa∈(γ 1 ..…γm)

RG(a) (21)

The pseudo-code below summarises the process of the classifiers ConsA adopted in this work.

The ConsA pseudo-code (generating the common group ranking for one input sample)

Input: Ri – ranking of answer ‘1’ for each agent, i=1. .5, Ai– Accuracy of each agent.

Output:

For i = 1 to N do

For j = 1 to N do

If (i == j) then U ii=¿(computed by equation (11))

Else

U ij=¿(Computed by algorithm (1) and equation (12))

End if

End for

End for

∀ i . j∈{1..5 }W ij=¿ (computed by equation (14))

Compute~W=¿ (computed by equation (18))

Compute π=¿(computed by system (17)) In Matlab the lsqnonneg function is used.

Compute aggregate ConsARG(γ k) using equation (20).

Define group aggregate answer using equation (21).

Figure 4 shows a flowchart for the ConsA, based on generating a common group ranking for one data sample or input:

Fig. 4. The process of ConsA

To make it clear we provide an example of how ConsA works:

Suppose that five classifiers have the rankings: R=¿0.8,0.3,0.4,0.7,0.6, and local accuracies (LA) (0.77,0.7,0.65,0.75,0.65):1) During the gradient descent the vector of parameters: k1=1, k2=2, k3=0.5, k4=1, k5=0.3 is obtained, which gives the

best accuracy for the training set.2) Calculate uncertainty matrix U (for diagonal elements equation 11 is used, for non-diagonal – Algorithm 1):

U=

¿0.72 2.11 2.07 0.00 1.00

¿

2.48 0.88 0.00 2.50 2.48

2.77 0.00 0.97 2.84 2.86

0.00 2.22 2.21 0.88 1.60

1.34 2.84 2.86 2.06 0.97

U 11 is calculated as follows:

U 11=−0 .8× log2 0.8−(1−0 , )× log2 (1−0.8 )=0 .72(0.8 is the first classifier ranking, R1) and the other diagonal elements are calculated in the same way. For non-

diagonal elements Algorithm (1) is used. U 12 is calculated as:

Calculate d1=R1−0.5=0.3, d2=R2−0.5=−0.2

As d1 and d2 have opposite signs, then

d¿=(d1+d2 )×k 3=(0.3−0.2 )×0.5=0.05 R1 (Γ2 )=d¿+0.5=0.55

Evaluate U12 according to equation (13)

U 12=0.55× log2 (0.55 )+ (1−0.55 )× log2 (1−0.55 )=0.9928 Update U 12=k4 ⋅U 12 /(L A i(q .nn)−k 5), where L A i(q .nn) is local accuracy.

U12=1×0.9928 /(0.77−0.3)=2.11

Other non-diagonal elements are calculated by the same algorithm

W=

(¿¿¿0.0

0 0.00 0.001.0

0 0.00¿¿¿)0.0

0 0.00 1.000.0

0 0.000.0

0 1.00 0.000.0

0 0.001.0

0 0.00 0.000.0

0 0.000.3

7 0.06 0.040.1

4 0.39

In this example, the four first rows all have zero elements, except one. This fact is because of the equation of w ij evaluation: sum of

inverse squares ∑k∈ A

‍U ik−2

is infinity for i∈(1.2 .3 .4 .5), because in these rows matrix U has zeros. The only element is equal

to one for these rows is where U ij=0 . The last row of matrix U has no zeros, so it can be evaluated, for example, w51. To do this,

firstly, the sum of the inverse squares of all the elements of matrix U for the last row is evaluated:

∑k∈{1.2.3.4.5 }

‍U 5k−2= 1

1.342 +1

2.842 +1

2.862 +1

2.062 +1

0.972 =2.74

w51=1

U 512 ∗∑

k∈ A‍U 5k

−2 =1

1.342⋅2.74=0.37 .

All other weights in this row are calculated in the same way.

3) Evaluate matrix ~W=¿

~W =

¿ -1.00 0.00 0.00 1.00 0.37 ¿0.00 -1.00 1.00 0.00 0.06

0.00 1.00 -1.00 0.00 0.04

1.00 0.00 0.00 -1.00 0.14

0.00 0.00 0.00 0.00 -0.61

1.00 1.00 1.00 1.00 1.00

4) Calculate vector π , such that ~W ⋅ π=¿ and to do this, the least squares method is used:

π=(~W T ⋅~W )−1⋅~W T ⋅¿.

π= (0.3,0.2,0.2,0.0)

5) Evaluate global final ranking as:

π ⋅R=¿(0.3,0.2,0.2,0.0)× (0.8,0.3,0.4,0.7,0.6) = 0.59

So, the ranking of each classifier is calculated as:

0.8×0.3+0.3×0.2+0.4×0.2+0.7×0.3+0.6×0.0

6) As the global final ranking is greater than 0.5, ConsA considers the loan as “bad”.

4.Experimental design

As in this paper the work of Ala’raj and Abbod (2016) is extended, in order to reach a fair comparison, the decision was taken to use the same experimental set-up that was used in their earlier study in terms of the credit datasets, base classifiers, traditional combination methods, performance indicator measures and significance test.

4.1. Credit datasets

A collection of public and private datasets with different characteristics is employed in the process of empirical model evaluation. In total, seven datasets were obtained, four public and three private. The public datasets are well-known real-world credit-scoring datasets that have been widely adopted by researchers in their studies, which are easily accessed and publicly available at the UCI machine-learning repository (Asuncion & Newman, 2007). The German5, Australian6 and Japanese7 datasets of this nature were employed, with the purpose being to provide extra validation. The Iranian dataset, which consists of corporate client data from a small private bank in Iran, has been used in several studies (Sabzevari et al., 2007; Kennedy, 2012; Marques et al., 2012a, 2012b), whilst the Polish dataset contains information on bankrupted Polish companies recorded over two years (Pietruszkiewicz, 2008; Kennedy, 2012; Marques et al., 2012a, 2012b). Moreover, to two extra datasets are used for the proposed model for extra validation. Firstly, a Jordanian dataset, based on a historical loan dataset, was gathered from one public commercial bank in Jordan. These data are confidential and sensitive, hence acquiring them involved a complex and time-consuming process. Secondly, we used the UCSD dataset that matches with a reduced version of a database employed in the 2007 Data Mining Contest organised by the University of California San Diego and Fair Isaac Corporation. A summary of all the datasets is illustrated in Table 2.

Table 2Description of the seven datasets used in the study

Dataset #Loans #Attributes Good/ BadGerman 1000 20 700/300

Australian 690 14 307/383Japanese 690 15 296/357Iranian 1000 27 950/50Polish 240 30 128/112

Jordanian 500 12 400/100UCSD 2435 38 1836/599

5 It is a function in ARESLab toolbox in Matlab that performs ANOVA decomposition and variable importance assessment.

6 https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

7 https://archive.ics.uci.edu/ml/datasets/Statlog+(Australian+Credit+Approval)

https://archive.ics.uci.edu/ml/datasets/Statlog+(Australian+Credit+Approval)

https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

4.2. Base classifiers development

The baseline models developed in this paper, chosen to be part of the multiple classifier systems, based on their wide application in credit scoring studies (Harris, 2015; Tomczak and Zieba, 2015; Xiao, 2016; Louzada and Fernandes, 2016), are neural networks (NN), support vector machines (SVM), decision tress (DT), random forests (RF) and naïve Bayes (NB). These methods as well as being well-known are easy to implement and hence, facilitate banks or credit card companies in quickly evaluating the creditworthiness of clients. Below is the theoretical background to the classifiers used.

Neural Networks (NNs)

NNs are machine-learning systems based on the concept of artificial intelligence (AI) inspired by the design of the biological neuron (Haykin, 1999). These are modelled in such a way as to be able to mimic the human brain functions in terms of capturing complex relationships between the inputs and outputs (Bhattacharyya and Maulik, 2013). One of the most common architectures for NNs is the multi-layer perceptron (MLP), which consists of one input layer, one or more hidden layers and one output layer. According to Angelini et al. (2008), one of the key issues needing to be addressed in building NNs is their topology, structure and learning algorithm. The most commonly utilised topology of an NN model in credit-scoring is the three-layer feed-forward back propagation. Consider the input of a credit-scoring training set x = {x1, x2,…, xn}, the NN model works in one direction, starting from feeding the data x to the input layer (x includes the customer’s attribute or characteristics). These inputs are then sent to a hidden layer through links or synapsis associated with the random initial weight for every input. The hidden layer will process what it has received from the input layer and accordingly, will apply it to an activation function. The result is served as a weighted input to the output layer, which will further process weighted inputs and apply the activation function, which will lead to a final decision (Malhotra and Malhotra, 2003).

Support Vector Machines (SVMs)

An SVM is another powerful machine-learning technique used in classification and credit-scoring problems. It is being widely used in the area of credit-scoring and other fields owing to its superior results (Huang et al., 2007; Lahsasna et al., 2010). SVMs first were proposed by Cortes & Vapnik (1995), adopting the form of a linear classifier. SVMs take a set of two classes of given inputs and predict them in order to determine which of the two classes, namely good or bad, has the output. SVMs are used for binary classification in order to make a finest hyperplane (line) that categorises the input data into two classes (good and bad credit) (Li et al., 2006). In cases where the data are non-linear, other types are proposed in order to improve the accuracy of the original model. The main difference of the new model, compared to the initial one, is the function used to map the data into a higher dimensional space. To achieve this, new functions were proposed, namely linear, polynomial, radial basis function (RBF) and sigmoid. SVMs map non-linear data of two classes to a high-dimensional feature space, with a linear model then being used to implement the non-linear classes. The linear model in the new space will denote the non-linear decision margin in the original space. Subsequently, the SVMs will construct an optimal line (hyperplane) that can perfectly separate the two classes in the space.

Decision Trees (DTs)

A DT is another commonly used approach for classification purposes in credit-scoring applications. DTs are non-parametric classification techniques used to analyse dependant variables as a function of independent variables (Lee et al., 2006). They can be represented using graphical tools; the node is shown in the box with lines to show the possible events and consequences, until reaching the optimal outcome. The idea behind the DT in credit-scoring is to provide a classification between two classes of credit, namely, ‘good’ and ‘bad’ loans. It begins with a root node that comprises the two types of classes, with the node then being split into two subsets with the possible events based on the chosen variable or attribute. Decision tree algorithm goes round all the splits to find the optimal one and then select the winning sub-tree that gives the most accurate ‘good’ and ‘bad loans based on its overall error rate and lowest misclassification cost (Biermann et al., 1984; Thomas, 2000).

Random Forests (RFs)

An RF is considered an advanced technique of DTs, as proposed by Biermann (2001), which consists of a bunch of DTs that are created by generating n subsets from the main dataset, with each subset a DT created based on randomly selected variables, which is why it is referred to as RF, since a very large number of trees are generated. After all the DTs are generated and trained, the final decision class is based on voting procedure, where the most popular class determined by each tree is selected as a final class for the RF.

Naïve Bayes (NB)

NB classifiers are statistical classifiers that predict a particular class (good or bad) of loan. Bayesian classification is based on Bayesian theory and is a valuable measure when the input feature space is high (Bishop, 2006). This is considered as being a very simple method for making classification rules that are more accurate than those made by other methods; however, it has received very little attention in relation to the credit-scoring domain (Antonakis & Sfakianakis, 2009). An NB classifier is calculated using the posterior probability of a class by multiplying the prior probability of a class before seeing any data with the likelihood of the data given its class. For example, in the credit-scoring context, the assumption can be made that the training sample set D = { x₁, x₂,….,xn}, where each x is made up of n characteristics or attributes { x11, x12,…., x1n} and assisted with a class label c either good or bad loan. The task of the NB classifier is centred on analysing these training set instances and determining a mapping function ƒ: (x11,….,x1n} -> (c), which can decide the label of an unknown example x = (x1,….., xn).

4.3. Data pre-processing and partitioning

Before building and training the models, data has to be prepared in terms of dealing with any missing values that could detract from the knowledge discovery process. The easiest way of dealing with missing values is to delete the instances containing the

missing value of the feature; however, there are other ways of handling missing values instead of deleting them, such as by adopting an imputation approach, which means replacing missing values with new ones based on some estimation (Acuna et al., 2004). Regarding the collected datasets, only the Japanese dataset was found to contain some missing values, and it was deciding to impute them via a simple imputation approach as follows (Acuna et al., 2004; Lessmann et al., 2015):

Replace missing categorical or nominal data with the most frequent category within the remaining entries, in other words, the mode;

Replace missing quantitative data with the mean value of the features that holds that missing value.

Some classifiers, such as NNs and SVMs, require input values that range from 0 to 1 and in vectors of real numbers. However, the datasets contain inputs that hold different values than those fed to these classifiers and each attribute in the dataset contains values that vary in range. In order to avoid bias and accordingly, feed the classifiers with data within the same interval, data should be transformed from a different scale of values to those of a common scale. In order to obtain this dataset, attributes should be normalised to values in the range of between 0 and 1 using an appropriate way for the datasets used in this paper. The data are normalised using the min-max normalisation procedure (Sustersic et al., 2009; Wang & Huang, 2009; Li & Sun, 2009), where the maximum value in an attribute is given a value of 1 (max_new) and the minimum value is given a value of 0 (min_new). Then, the values in between are scaled based on the equation below:

New_value = (original – min) / (max– min) * (max_new – min_new) + min_new

Regarding the data splitting technique, the k-fold cross validation was adopted. In this technique, the original dataset is partitioned in to k-subsets or folds of approximately equal size. For example, consider P1, P2, P3,…..,Pk are the number of partitions made from the original dataset (Louzada and Fernandes, 2016). Now, individually each partition must be trained and tested and the final accuracy is estimated by taking the average of all the partitions or folds that have been tested. An issue also could arise regarding the consideration of how many folds or partitions to create. Garcia et al. (2015) stated that 5 or 10 folds can be a good choice with data sets of different sizes, with repetitions of the process also being desirable in order to ensure switching between training and testing data as much as possible and also to avoid high variances. Consequently, a 5-fold cross-validation was adopted, repeated 50 times, in order to achieve reliable and robust conclusions relating to model performance. As a result, in this paper, a 10

× 5-fold cross-validation is applied on each dataset, and the process is repeated 10 times for each, giving a total of 50 test results that are averaged to give a final result for each.

4.4. Parameters tuning and setting

Practically, a few parameters needed to be set up before classifier construction, for such as NN, SVM, DT and RF. The intention was to make a unique model for all datasets and it is worth noting that all the parameters for all the classifiers across all the datasets are based on achieving the best results on the training set. Firstly, for the NN model, a feed forward back-propagation was constructed and for the German, Australian, Japanese, Iranian, Polish, UCSD and Jordanian datasets the chosen number of neurons in the hidden

layer were 4, 10, 3, 10, 10, 10, 10, respectively. Generally, the number of hidden neurons should be chosen relatively to the number and complexity of relations between the input features for each dataset. In the developed NN classifier, a grid search was carried out to find the optimal number of neurons in the hidden layer for each dataset. Furthermore, the learning rate was a default of 0.01 in the case of Australian, Japanese, Iranian, UCSD and Jordanian datasets, while in the German and Polish ones the values were set at 0.005 and 0.5, respectively. Moreover, the maximal number of epochs was set to 1,000. For each particular dataset, it was important to change the way in which the neural network was trained. Regarding which, the German, Australian, Polish, UCSD and Jordanian datasets were kept at default (trainlm), whereas in the Japanese and Iranian datasets were changed to

traingdx∧traingda , respectively. In addition, for the Japanese and Iranian datasets, a momentum default parameter of 0.9

was chosen. For all the other datasets, a momentum was not defined, as trainlm and traingda training methods do not require this parameter.

Secondly, regarding the SVM, an RBF kernel was used. For each dataset, different values of kernel scale parameters were provided (German – ‘1.37’, Australian, Japanese, UCSD, Polish and Jordanian: 'auto'8, Iranian: ‘1’). Thus, for the majority of datasets, the SVM function automatically chooses the appropriate kernel scale. However, for some datasets, the values for the kernel scale parameter that increased the SVM accuracy were found by grid search rather than the default ('auto') parameter. The ‘auto’ parameter gave the values 0.7812, 0.9795, 0.1836, 0.366, 0.243 for the Australian, Japanese, UCSD, Polish and Jordanian datasets, respectively.

With RF, the most important parameters are the number of trees and attributes used to build a tree. 60 9 trees were built and regarding the number of chosen attributes for growing each decision tree, the default value was selected (all attributes available in the dataset). Another important issue worth noting is the defining of the categorical variables for each analysed dataset. The number of the features that were define as categorical during the RF evaluation was less than the initial number of categorical features, because some are best considered as being numerical. According to Rhemtulla et al. (2012), when the categorical variables have many levels, there is a considerable advantage in treating them as continuous variables. By way of an example: for a particular dataset there is an ‘Education’ feature, where ‘0’ means ‘No education’, ‘1’ refers to ‘ordinary school’, ‘2’ pertains to MSc and ‘3’ means ‘PhD’. Hence, his feature can be considered as being numerical, whereby the bigger its value, the better educated is the loan applicant. So during leaf splitting, RF does not need to iterate for all values of this feature, for it can simply define the leaf threshold, which is thus more efficient.

Lastly the DT, the impurity evaluation is performed according to Gini’s diversity index in order to choose the best feature to start the tree with. Regarding which, the best categorical variables split is when 2 C−1 −1 combinations are considered, where C is the number of categories in each categorical variable. 4.5. Performance measure metrics

8 https://archive.ics.uci.edu/ml/datasets/Japanese0Credit0Screening

9 ‘auto’ is a parameter that automatically choose the best value for the kernel scale that increases SVM’s training accuracy. This is achieved by using the fitcsvm function in Matlab.

In order to reach a reliable and robust conclusion on the predictive accuracy of the proposed approach, four performance indicator measures are implemented, specifically: 1) accuracy, 2) area under the curve (AUC), 3) H-measure and 4) Brier Score. These were chosen because they are popular in credit scoring and they give a comprehensive view on all aspects of model performance. The accuracy stands for the proportion of correctly classified good and bad loans, which measures the predictive power of the model. As such, this is a criterion that measures the discriminating ability of the model (Lessmann et al., 2105). AUC is a tool used in binary classification analysis to determine which of the models used predicts the classes the best. According to Hand (2009), the AUC can be used to estimate the model’s performance without any prior information about the error costs. However, it assumes different costs distribution among classifiers depending on their actual score distribution, which prevents them from being compared effectively. As a result, Hand (2009) proposed the H-measure as an alternative measure to the AUC for measuring classification performance, which assumes different costs distribution between classifiers without depending on their scores. In other words, this measure finds a single threshold distribution for all classifiers. Finally, the Brier Score, which also known as the mean squared error (Brier, 1950), measures the accuracy of the probability predictions of the classifier, by taking the mean squared error of the probability. In other words, it shows the average quadratic possibility of a mistake, and the main difference between it and accuracy is that it directly takes the probabilities into the account, while accuracy transforms these probabilities into 0 or 1 based on a pre-determined threshold or cut-off score. The lower the Brier Score the better the classifier performance.

4.6. Statistical significance tests

Statistical tests can be categorised into parametric and non-parametric (Demšar, 2006). Demšar recommended that using non-parametric tests are preferable to parametric tests as the latter can be conceptually inappropriate and statistically unsafe. That is, non-parametric tests are more appropriate and safer than parametric tests since they do not assume normality of the data or homogeneity of the variance (Demšar, 2006). Friedman’s (1940) test is non-parametric test that ranks the classifiers for each dataset independently. The best ranking classifier is given a rank of one, the second best a rank of two and so on. Under the null hypothesis of Friedman, the test is that all classifiers from this group perform identically and all differences are only random

fluctuations. The Friedman statistic xF2 is distributed according to xF

2 with K - 1 degrees of freedom, when N (number of data sets)

and K (number of classifiers) are big enough (Demšar, 2006). If the null hypothesis of the Friedman test is rejected, then, a post-hoc test is carried out in order to find the particular pair wise comparisons that produce significant differences. For instance, the Bonferroni–Dunn (1961) test can be used when all the classifiers are compared with a control model (Demšar, 2006; Marques et al., 2012a; Marques et al., 2012b). With this test, the performance of two or more classifiers is significantly different if their average ranks differ by at least the critical difference (CD), as follows:

CD = q∝ √ k (k+1 )6N

(22)

where, q∝ is calculated as a studentised range statistic with a confidence level ∝/ (k-1) divided by √2. Also, k = number of classifiers to be compared to ConsA and N = number of datasets.

5. Experimental results

In this section, the results of the proposed model are presented along with comparison between the individual base classifiers, hybrid classifiers and ensemble classifiers with traditional combination methods. The model is validated over the above-described seven real-world credit datasets across four performance measures metrics. In addition, histograms of the loans ranking distribution of the proposed model are provided and discussed. All the experiments for this study were performed using Matlab 2014b version, on a PC with 3.4 GHz, Intel CORE i7 and 8 GB RAM, using the Microsoft Windows 7 operating system.

Table 3Classifier results for the for all individual classifiers for all the datasets for the different performance measures without GNG or MARS

Dataset Base classifiers

German

Perf. RF DT NB NN SVMAcc. 0.767 0.705 0.725 0.748 0.761AUC 0.792 0.679 0.762 0.764 0.783

H-measure 0.289 0.136 0.238 0.239 0.272Brier Score 0.162 0.252 0.199 0.172 0.166

Australian

Accuracy 0.867 0.826 0.803 0.859 0.852AUC 0.936 0.866 0.896 0.915 0.911


Japanese

Accuracy 0.867 0.817 0.797 0.858 0.858AUC 0.931 0.856 0.889 0.912 0.907


Iranian

Accuracy 0.951 0.924 0.926 0.950 0.948AUC 0.779 0.615 0.714 0.613 0.603


Polish

Accuracy 0.763 0.701 0.690 0.698 0.749AUC 0.837 0.728 0.740 0.767 0.821


Jordanian Accuracy 0.855 0.828 0.811 0.815 0.830AUC 0.909 0.795 0.707 0.746 0.789

H-measure 0.535 0.385 0.176 0.238 0.309

Brier Score 0.097 0.143 0.172 0.142 0.128

UCSD

Accuracy 0.862 0.820 0.614 0.831 0.831AUC 0.903 0.788 0.574 0.859 0.843


Table 4Classifier results for the for all individual classifiers for all the datasets for the different performance measures with GNG


German

Performancemeasure

RF DT NB NN SVM

Accuracy 0.770 0.745 0.759 0.751 0.768AUC 0.793 0.689 0.775 0.766 0.796

H-measure 0.294 0.182 0.268 0.247 0.296Brier Score 0,1608 0,2301 0.198 0.173 0.164

Australian Accuracy 0.868 0.868 0.865 0.859 0.863AUC 0.923 0.888 0.911 0.916 0.921


Japanese Accuracy 0.867 0.864 0.863 0.865 0.853AUC 0.929 0.882 0.909 0.908 0.911


Iranian Accuracy 0.951 0.951 0.931 0.949 0.946AUC 0.788 0.530 0.722 0.613 0.649

H-measure 0.301 0.036 0.211 0.070 0.117Brier Score

0.0430,049

10.071 0.048 0.051

Polish Accuracy 0.752 0.751 0.726 0.743 0.756AUC 0.834 0.770 0.773 0.806 0.810

H-measure 0.375 0.312 0.302 0.336 0.365Brier Score

0.1670,233

80.264 0.184 0.177

Jordanian Accuracy 0.864 0.853 0.813 0.820 0.834AUC 0.889 0.763 0.710 0.765 0.824


UCSD Accuracy 0.868 0.834 0.807 0.840 0.832AUC 0.916 0.780 0.823 0.857 0.844


Table 5Classifier results for the for all individual classifiers for all the datasets for the different performance measures with MARS


German

Performancemeasure

RF DT NB NN SVM

Accuracy 0.767 0.721 0.744 0.748 0.766AUC 0.787 0.692 0.768 0.767 0.783


Australian

Accuracy 0.868 0.828 0.785 0.865 0.853AUC 0.940 0.868 0.894 0.919 0.905


Japanese

Accuracy 0.866 0.817 0.797 0.862 0.858AUC 0.932 0.857 0.889 0.911 0.908


Iranian

Accuracy 0.951 0.927 0.945 0.949 0.948AUC 0.790 0.639 0.740 0.618 0.566

H-measure 0.278 0.151 0.218 0.061 0.038Brier Score 0.043 0.068 0.055 0.048 0,0508

Polish

Accuracy 0.768 0.725 0.707 0.689 0.754AUC 0.842 0.753 0.801 0.772 0.826


Jordanian

Accuracy 0.862 0.830 0.821 0.838 0.837AUC 0.920 0.809 0.709 0.827 0.796


UCSD

Accuracy 0.866 0.824 0.625 0.847 0.844AUC 0.914 0.801 0.593 0.889 0.874


Filt (-) FS (-) Filt (-) FS (+) Filt (+) FS (-) Filt (+) FS (+)0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Accuracy AUC Brier Score H-measure

Fig. 5 Comparisons of different set-ups of data filtering and feature selection on average for all classifiers across all the datasets

5.1. Classification results

In this section, three experimental results are presented in order to choose the best rankings for the proposed hybrid ensemble model: 1) Results of all base classifiers with all features and data; 2) Results of base classifiers with GNG data filtering and all features; 3) Results of base classifiers with MARS feature selection and all data; and 4) Results of base classifiers with MARS and GNG combined. All the obtained results are compared and hence, the method used for the proposed model is justified. Subsequently, the results of the proposed model are summarised and compared with those of the base and ensemble classifiers using the traditional combination methods.

Tables 3 to 6 report the results of base classifiers with each of the MARS and GNG options in addition to the results of the proposed model. Regarding each classifier in each table several key findings emerge.

RF: Shows a good example of how GNG+MARS work jointly better than separately. This is clear for the Japanese dataset, where accuracy decreases by 0.07% and 0.12% when GNG and MARS are used alone, respectively, but by combining them it increases by 0.43%. Also, an unusual case is found with the Polish dataset, whereby filtering decreases the accuracy and the feature selection increases it. The reason this happens is that the Polish dataset has fewer data samples and the large number of features compared to the other datasets. All the other datasets show increments ranging between 0.03% and 1.1%.

DT: Also demonstrates that GNG+MARS work well together. Both of them separately increase the accuracy, but combining them increases it for all datasets except the Japanese, where GNG alone is better than GNG + MARS, which might be because the latter’s performance alone was worse than with individual classification. Also, another substantial increment for both methods is found for the Polish dataset where the increment is 8.87%, which is much better than both methods separately.

NB: Shows that GNG +MARS works the same way as with DT, as for all the datasets the incremental change is more than when using them separately, with the exception being the Australian dataset, where GNG alone is better. In general, the performance of GNG alone is quite good with Naïve Bayes, but when combining it with MARS the performance gets better if the latter’s performance is superior to that of individual classification (e.g. Australian and Japanese datasets).

NN: GNG+MARS communicate in an impressive way, whereas separately, they both work differently. That is, both methods together improve the accuracy for all the datasets, ranging from 0.01% for the Iranian dataset to 5.46% for the Polish. All the other datasets show increments of between 0.01% and 1.76%.

SVM: Is the most controversial classifier, for whilst GNG+MARS should improve the accuracy, this is not so regarding the Japanese and Iranian datasets. Regarding which, applying MARS to the Japanese dataset decreases the accuracy more than when GNG is used, because the former dataset does not have many features, compared to the Polish dataset, for example. In sum, GNG and MARS work better together rather then they are applied separately with SV.

Regarding the AUC and H-measures, GNG+MARS, on average, is better than using them separately or without using them. In conclusion, the results re improved when applying the GNG and MARS methods together (as an average of all the dataset outcomes): NB: 6.07%, DT: 4.37%, NN: 2.41%, SVM: 0.90%, RF: 0.63%. It can be seen, that the worse results are when the classifiers are applied without filtering and feature selection, whereas the best improvement in accuracy occurs when these two pre-processing methods are applied together. The case where using filtering and feature selection is particularly useful are as follows: When the dataset is well-balanced (e.g. the Australian and Japanese) and even when imbalanced in some cases (e.g. Jordanian);

When the data have a lot of features and some of them are categorical; When any individual classifier without filtering and feature selection gives surprisingly low results, which cannot be explained

by any other reason than the existence of outliers in the data; When DT or NB are used as a part of the classification system. In fact, even if one of them is applied to the data being analysed,

using filtering and feature selection in combination is very desirable. From Figure 5, the main conclusion that can be drawn is that using filtering and feature selection in combination is justified, as the

experiments conducted with these two pre-processing techniques show improvement for all the tested classifiers when compared to no pre-processing or using just one pre-processing technique. It is worth noting, that filtering is more responsible for accuracy increases than feature selection. This can be seen when the results of experiments were compared for filtering and feature selection being applied separately. Having made these comparisons, for the ranking of classifiers GNG+MARS will be used in combination when building the proposed hybrid ensemble model. Table 6 demonstrates the results of the proposed model as well as those for the base classifiers and ensemble using traditional combination methods after implementing GNG+MARS. Regarding the ensemble classifiers with traditional combination methods, seven methods were adopted: Min Rule (MIN), Max Rule (MAX), Product Rule (PROD), Average Rule (AVG), Majority Voting (MajVot), Weighted Average (WAVG) and Weighted Voting (WVOT). The results reveal that the best combiner appears to be the MajVot. Its first place can simply be explained by the fact, that the classifiers have quite a high accuracy by themselves and hence, the probability that four will make a mistake regarding the same data point is low. This is also the reason for AVG being in second place. WVOT, which is a combination of WAVG and MajVot, is third, but for the Japanese and UCSD datasets it holds first place. The final decision about which traditional combination method to choose can be made by looking at the structure of the dataset. The worst combiner is PROD, which can be explained the fact that the result of multiplying the rankings of the five classifiers will be less than one, and this value is very small, which is why the threshold is very hard to choose. For example, if all five classifiers have a ranking of 0.6, then the ranking of this combiner is 0.65 = 0.078, which is an extremely small, so the threshold would have to be much lower than this value (0.078).

The other interesting thing about the combiners and classifiers here is that each of the latter works better when it has few features to rely on. That is, with many features this unnecessarily hinders the classifier training, which in turn reduces the accuracy and increases losses. The results obtained clearly show that most traditional combiners are behind the best of the classifiers (RF) for all the datasets. Of course, Random Forest stays the best, for it actually it is not a single classifier, but rather, a homogenous combiner of DTs. Finally, the traditional combiners could be used to improve the work of single classifiers, but these could not be used on every dataset with the same productivity, and hence, should be chosen independently for each dataset.

As a result of the above findings, a complex combiner is proposed (ConsA) to combine the rankings of all the base classifiers, which should result in a classifier that outperforms the best base classifier and the traditional ensemble combining methods developed in this paper. The last column of Table reports the results of the proposed approach across all the datasets using the three performance measures. It can be seen ConsA is superior for all measures across all the datasets when compared to the base classifiers and traditional combination methods. However, several distinct findings can be reported for each dataset.

German: During this experiment ConsA shows accuracy 1.65% higher than with RF (second best classifier in this case). Moreover, the accuracy of ConsA at 0.7903, is superior to the best of the traditional combiners by 1.7%. Standard deviation of ConsA Accuracy over all iterations is 0.029. The results clearly show that using GNG and MARS in conjunction is advisable, because this increases the performance of almost all the classifiers when compared to only one of these being used. The AUC value of ConsA is the highest amongst all the other classifiers and combiners.

Australian: Regarding the filtering and feature selection methods being applied in conjunction it can be observed that ConsA’s accuracy is the highest at 0.881, which is better than the best second classifier by 0.74%. The standard deviation of ConsA’s accuracy over all iterations is 0.0268. Moreover, ConsA’s AUC is the highest amongst the other classifiers indicating its efficiency across several thresholds. Moreover, the H-measure ConsA is the biggest amongst all the classifiers,

Japanese: ConsA’s accuracy is 0.8871, better than RF by 1.5%. The standard deviation of ConsA’s accuracy over all iterations is 0.0259. Moreover, the AUC of ConsA is the highest by 0.9330 and the H-measure of ConsA is almost 3.65% higher than that for RF. In general, ConsA is stable for balanced and unbalanced datasets so far.

Iranian: For this severely imbalanced dataset, it can be seen that the results of ConsA rises up to 95.75%, which is better than the best second classifier by 0.062%. The standard deviation of ConsA’s accuracy for all iterations is 0.015. so what? The results show that for this dataset ConsA has the highest H-measure and by far the best AUC.

Polish: The accuracy of ConsA with filtering and feature selection enabled is 81.33%, better than the second best classifier by 2.33%. Standard deviation of ConsA Accuracy over all iterations is 0.0514, which is the highest when compared to all the other datasets. The AUC value of ConsA remains greater than that for all the other classifiers. This is the only dataset so far, for which it can be seen that there is a big advantage of ConsA over the other classifiers and combiners. GNG and MARS helped to raise its accuracy by almost 2.3%, which proves the importance of these two pre-processing techniques in the classification procedure. Interestingly, for this dataset, RF shows a worse accuracy result than DT, with the latter being 79%. The reason of this good performance is that GNG helps this classifier to choose the right node splits, and thus, the obtained model becomes quite precise. The H-measure of ConsA is the best as is the AUC. In fact, for this dataset ConsA shows superiority for all the measures evaluated.

Jordanian: Reveals that with filtering and feature selection algorithms, ConsA can rightfully be called the best possible option. It surpasses RF’s accuracy by 0.78%, and is superior in all other measures, including AUC (almost a 3% increase). Traditional combiners show different results: NN and SVM provide 2% worse results than RF and about 4.5% worse than ConsA, NB delivers even worse results. This proves that increasing complexity the of the classifier will significantly increase the benefits of using it. For this dataset, the complexity of the classifier is highly positively correlated with its accuracy and other performance metrics. Finally, the H-measure of ConsA is better than the second placed WAVG by about 6%.

UCSD: The ConsA classifier is always better than any other class, but its accuracy enhancement of only 0.59% is not very big. However, in large real world datasets this figure may be crucial regarding losses and profits, which undoubtedly makes ConsA the number one classifier for this dataset. The distribution of testing set rankings for the proposed model is demonstrated in histograms in Figures 6 to 12. Each histogram represents the following:

f (R∨0) is the predicted values subset where the actual target is 0 (Red);

f (R∨1) is the predicted values subset where the actual target is 1 (Green);

f (R) is the predicted value set (Black).

From Figure 6, it can be concluded that ConsA for the German dataset is much more certain about good loans prediction, than regarding bad ones, the highest probability (22%) is the ranking of a random bad loan entry in the interval [0.1-0.2].

Table 6Classifier results including those for the proposed method for all the datasets for the different performance measures with GNG +

MARS Dataset Base classifiers Traditional combination methods Proposed

method

German

Performance

measure

RF DT NB NN SVM MIN MAX PROD AVG MajVot WAVG WVOT ConsA

Accuracy 0.773 0.753 0.764 0.758 0.773 0.764 0.753 0.736 0.773 0.778 0.746 0.773 0.790AUC 0.794 0.699 0.774 0.772 0.794 0.718 0.788 0.709 0.800 0.755 0.746 0.688 0.802

H-measure 0.297 0.197 0.267 0.258 0.299 0.225 0.288 0.225 0.306 0.286 0.223 0.247 0.325Brier Score 0.160 0.221 0.193 0.170 0.164 0.166 0.206 0.232 0.158 0.184 0.180 0.193 0.164

Australian

Accuracy 0.871 0.869 0.861 0.864 0.869 0.866 0.866 0.858 0.873 0.874 0.871 0.870 0.881AUC 0.929 0.887 0.909 0.920 0.921 0.913 0.908 0.912 0.929 0.903 0.920 0.891 0.935


Japanese

Accuracy 0.872 0.862 0.863 0.869 0.854 0.864 0.845 0.860 0.865 0.865 0.854 0.872 0.887AUC 0.929 0.880 0.909 0.907 0.911 0.913 0.903 0.910 0.926 0.908 0.909 0.849 0.933


Iranian

Accuracy 0.951 0.951 0.945 0.950 0.946 0.950 0.910 0.950 0.950 0.950 0.950 0.946 0.958AUC 0.779 0.536 0.747 0.629 0.612 0.553 0.740 0.538 0.777 0.578 0.776 0.572 0.842


Polish

Accuracy 0.774 0.790 0.730 0.752 0.757 0.720 0.768 0.719 0.782 0.788 0.736 0.774 0.813AUC 0.841 0.798 0.800 0.806 0.816 0.829 0.824 0.819 0.859 0.858 0.801 0.799 0.874


Jordanian

Accuracy 0.866 0.861 0.821 0.845 0.847 0.825 0.853 0.816 0.857 0.860 0.862 0.857 0.874AUC 0.886 0.781 0.774 0.835 0.830 0.806 0.861 0.773 0.882 0.803 0.879 0.795 0.913


UCSD

Accuracy 0.869 0.841 0.808 0.849 0.846 0.805 0.842 0.803 0.864 0.865 0.860 0.869 0.875AUC 0.916 0.793 0.831 0.883 0.868 0.883 0.836 0.893 0.908 0.877 0.901 0.809 0.924


However, the bad loans prediction performance of most of the other classifiers and combiners is even worse and the few that show higher accuracy in such prediction have poor good loan prediction as well as overall accuracy. This indicates that for German dataset due to its imbalanced structure it is very difficult to build combiner with over 85% accuracy combiner. With Figure 7, it can be concluded that ConsA for the Australian dataset is very often certain about its decisions (length of bars near the 0.4 - 0.6 points is much less than those on the edges of the [0 and 1] ranking interval). Moreover, ConsA often is very certain about good loans (if the loan is good, the probability that ConsA will give less than a 0.1 prediction value is more than 60%). Regarding the Japanese dataset, Figure 8 shows that ConsA provides a very good level of confidence for good and bad loan entries. Most of the rankings of ConsA lie either in [0, 0.1] interval or the [0.9, 1] one. When the input loan is good, the probability that ConsA will give the number near 0.1 or less is almost 80%, whilst when it is bad, the probability that the ConsA will give the number near 0.9 or more is 70%. In relation to the Iranian dataset, Figure 9 proves again the fact that ConsA is very good at good loan recognition, but demonstrates much worse results in terms of bad loan identification. Most of the time, when the input query has ‘bad loan’ label, ConsA treat this query as good, and its ranking is in the [0.1 - 0.3] interval. Regarding the Polish dataset, ConsA is not very certain about its answers, as is the case with some of the other datasets. The most likely ranking of a good loan entry lies in the [0.1 - 0.2] interval (35%) and that of bad loans entry is apparent in the [0.8 - 1] interval. However, for 10% of input entries ConsA is not certain at all, as the rankings lie in the [0.4 - 0.6] interval (see Figure 10).

As it can be seen in Figure 11, regarding the real world Jordanian dataset, the tall red bar on the right of the graph indicates that ConsA is very certain regarding good loans, whereas for bad loans this is not the case. However, ConsA very rarely shows uncertainty (ratings between 0.4 and 0.6), and in most of the cases if it makes incorrect prediction, its ranking is not completely wrong (so for bad loans, it can make a mistake on 0.2-0.3 ranking, but not on 0-0.1). In other words, even when ConsA is wrong and actual class is ‘1’, its ranking is not ‘0’ (completely wrong) but rather (0.2 - 0.3). So, in the case that a 100% guarantee that ConsA

will make a correct good loan prediction, a true good loans can only be accepted for 0-0.1 rankings. The same logic is applicable to bad loan predictions. In the case where it is crucial to be sure about classifier prediction, a two-threshold system can be recommended as follows:

If the prediction is less than the first threshold, it can be accepted that the loan is good with great certainty; If the prediction is greater than the second threshold, it can be accepted that the loan is bad with great certainty. If the prediction is situated between the thresholds, this could be interpreted as the ‘grey zone’ and any decision based on

cannot be made. From Figure 12 it can be said that ConsA for the UCSD dataset is certain about its good and bad loans predictions, whereby most of the good loans are scored by a prediction value of less than 0.2 and most of the bad ones by a prediction value greater than 0.8. This is a big advantage of ConsA, for if the classifier gives a ranking close to the boundary of the [0 and 1] interval, it can be said almost for sure that it is correct. However, very few bad loans gain a prediction value of ‘1’, which could be due to the fact, that UCSD dataset is skewed. In summary, ConsA shows the best performance and for some datasets, it superior performance over all the other classifiers is impressive. The ranking histograms demonstrate that, for almost all the focal datasets, ConsA is certain about its predictions, which shows that it can be successfully used with various range of thresholds without any significant drop in accuracy. The most impressive performance for ConsA is regards to the Polish dataset, which can be explained by the fact that this dataset is balanced. ConsA also deals well with imbalanced datasets and its high H-measure shows that it can be successfully used with different pairs of misclassifying costs (false-positive cost and true-negative cost), with its misclassifying error for all thresholds being lower than for the other classifiers. This means that in real life, losses caused by ConsA wrong decisions will be smaller than those as a result of the decisions of any of the other classifiers that have been considered here.

5.2. Significance tests results

5.2.1. Friedman test for best classifiers

In this section Friedman’s statistical test is conducted on all the implemented classifiers to prove that ConsA is better not only on the seven datasets that have been investigated, but also with very high probability on all datasets with a similar structure to those examined in this paper. After this, the Bonferroni-Dunn test is performed to rank all classifiers from the best to the worst and to divide them into two groups, 1) classifiers that under some conditions could rival ConsA and 2) classifiers that are undoubtedly worse than ConsA. So, to make the conclusions scientifically more solid, analysis of Friedman test is considered for the three best classifiers, including ConsA. That is, the test was performed on ConsA, the best single classifier and the best classical combined classifier. The null hypothesis in this case is that the difference between these five base classifier rankings is accidental and not caused by the level of significance of each classifier. The null-hypothesis is accepted with 95 % probability, if the Friedman statistic

S< χ 0.052 (4 )=9.488∧¿ it is accepted with 90% probability, if S< χ 0.1

2 (4)=7.779. At a significance level of0.05, the

null hypothesis for all the classifiers rejected, except for the Polish dataset. Moreover, at the significance level 0.1 the null hypothesis for all datasets is also rejected apart from for the Polish dataset. The reason why the Polish dataset is an exception, is owing to its small size, having only 60 entries and if it had more entries, the Friedman statistic would be much higher.

5.2.2. Bonferroni-Dunn test for all classifiers

The Friedman ranking test (accuracy rankings) was calculated for all single classifiers, all classical combiners and ConsA. To evaluate the critical values of significance levels α=0.05 and α=0.1, a Bonferroni-Dunn two-tailed test is evaluated as in

equation (22). qa is calculated as the Studentised range statistic with a confidence level α /(k−1)=α /12, divided by √2. So,

in such a case, the Studentised range statistic test is calculated with confidence levels α=0.00416 and α=0.00833 . The

obtained values are q0.05=2.8653, q0.1=2.6383 and the obtained results

for criticaldifferences areC D0.05=6.8494, C D0.1=6.3067. check Looking at Figure 13. the two horizontal

lines, which are at heights equal to the sum of the lowest rank and the critical difference computed by the Bonferroni–Dunn test,

represent the threshold for the best performing method at each significance level (α=0.05 and α=0.1 ,(CD+1)). The

obtained results clearly show us that ConsA is obviously the best when compared to all the other classifiers and classical combiners. RF shows good stable results, holding second position for all the datasets, whilst DT is good, but worse than some of the classical combiners. Based on the evaluated critical values, it can be concluded that PROD, LR, NB, Max and MIN, SVM, WAVG and NN are significantly worse than the ConsA approach at significance levels α=0.05 and α=0.1, whilst DT is worse only at α=0.1.

Fig. 6. Frequency histogram of conditional and absolute values of RG for the test set for the German dataset

Fig. 7. Frequency histogram of conditional and absolute values of RGfor the test set for the Australian dataset

Fig.8. Frequency histogram of conditional and absolute values of RG for the test set for the Japanese dataset

Fig. 9. Frequency histogram of conditional and absolute values of RG for the test set for the Iranian dataset

Fig. 10. Frequency histogram of conditional and absolute values of RG for the test set for the Polish dataset

Fig. 11. Frequency histogram of conditional and absolute values of RG for the test set for the Jordanian dataset

Fig. 12. Frequency histogram of conditional and absolute values of RG for the test set for the UCSD dataset

Fig. 13. Significance ranking for the Bonferroni–Dunn two-tailed test for the ConsA approach, benchmark classifier, base classifiers and traditional combination methods, with

∝ = 0.05 and ∝= 0.10

5.3. Benchmark studies

In this section, a comparison of the proposed approach ConsA with recent related studies in credit scoring and data classification (Gorzałczany and Rudzinski, 2016; Partalas et al., 2010) is provide. Gorzałczany and Rudzinski (2016) which employed three of the same benchmark datasets as in this paper, namely, the German, Australian and Japanese. In their paper they developed a hybrid based on combining fuzzy-rules based classifiers with evolutionary optimisation algorithms. In their modelling design they used k-fold cross validation for their dataset splitting for model training and validation. They used different values of k-folds with repetition in order to minimise any bias that could be associated to the random splitting of the datasets, Specifically, values of 2, 3, 5 and 10 folds were employed in their paper. Moreover, further comparisons with other studies can be found in Gorzałczany and Rudzinski (2016).

Partalas et al. (2010) have proposed a new metric based on uncertainty weighted accuracy (UWA) to measure heterogamous ensemble pruning via direct hill climbing. The search is based on forward selection and backward elimination, and based on these search methods UWA determine whether to remove classifier from the ensemble, leave it or add a new one. Their approach was evaluated on many datasets, one of the datasets evaluated was the German datasets which is the same dataset used in this paper. In their modelling design they used 2 data splitting techniques to evaluate their approach based on hold out sampling where results are based 80% training data and 20% testing data, and on 60% training data, 20% validation data and 20% testing data.

Table 7Comparison of our proposed approach (ConsA) results with recent approaches across three benchmark datasets

Studies Proposed approach

German Australian Japanese

Data splitting

technique

Accuracy Data splitting

technique

Accuracy Data splitting

technique

Accuracy

Partalas et al.

(2010)

BVUWA10 80%-20%-20% 0.7475 - - - -

BTUWA11 80%-20% 0.7535 - - - -

FVUWA12 80%-20%-20% 0.751 - - - -

FTUWA13 80%-20% 0.704 - - - -

Gorzałczany and FRB-MOEOAs14 2 k-folds 0.744 2 k-folds 0.868 2 k-folds 0.867

3 k-folds 0.754 3 k-folds 0.873 3 k-folds 0.872

10 We tried several parameters and with a value of 60 RF achieved a very good training accuracy with acceptable computational speed.

11 Forward uncertainty weighted accuracy using training set.

12 Forward uncertainty weighted accuracy using validation set.

13 Backward uncertainty weighted accuracy using training set.

Rudzinski (2016)5 k-folds 0.765 5 k-folds 0.880 5 k-folds 0.882

10 k-folds 0.785 10 k-folds 0.891 10 k-folds 0.890

This paper ConsA 5 k-folds 0.790 5 k-folds 0.881 5 k-folds 0.887

Table 7 summarises a comparison of ConsA approach results against the aforementioned recent credit scoring and related binary classification studies in the literature. Our approach, in general, outperforms Partalas et al. (2010) in all their proposed approaches regarding German dataset and Gorzałczany and Rudzinski (2016) for 5-fold cross validation. Considering the other values of k-folds, ConsA for the German dataset is better and regarding the Australian and Japanese datasets our approach is better except for the 10-fold cross validation. In this regard, Gorzałczany and Rudzinski (2016) method outperforms ours by 1% and 0.3% in relation to the Australian and Japanese datasets, respectively. To make a reasonable comparison, with our proposed approach, for 5-folds, it provides better accuracy performance than Gorzałczany and Rudzinski (2016) across all three data sets.

6. Conclusions

The main advantage of ConsA compared to traditional combiners is the creation of a group ranking as a fusion of individual classifier rankings, rather than merging these rankings using arithmetical, logical or other mathematical functions. It simulates the real experts’ group behaviour: they continuously interchange their opinions, and change their measurements of possible answers influenced by other experts. The process continues until they come up with group decision, with which they all agree. Sometimes, however, experts cannot come up with such a decision, i.e. ConsA does not converge. To prevent these situations, it has been decided to use the least squares method instead of an iterations procedure to obtain optimal group ranking. Another problem is unknown conditional ranking values, which has been evaluated as a linear combination of two classifier rankings. Moreover, the better the accuracy of classifier is the more impact it has on other classifiers. So, the two things new in the current investigation when compared with Shaban et al (2002) are:

Using a local accuracy algorithm to estimate the performance of single classifiers at a given point and then evaluating conditional rankings;

Using the least squares algorithm instead of iterations to solve equation (17). The ConsA algorithm was tested on seven datasets with the aim of predicting the loan quality of the client (0 – good loan, 1 – bad

loan). For every dataset, when compared to the single classifiers, hybrid classifiers and traditional combiners, ConsA delivered better performance. In relation to the direction of future work, the proposed model could be modified by:

Analysing other approaches to the conditional ranking Ri(γk∨Γ j) evaluation of ConsA;

Investigate combining homogenous classifiers or different numbers of heterogeneous classifiers to see to what extent ConsA results can change;

Investigate different pre-processing methods for the datasets, such as other feature-selection or data-filtering methods; Improve ConsA so it can output, rather than a single floating-point ranking, a fuzzy opinion using fuzzy logic, whereby a

fuzzy matrix with fuzzy opinions can be produced.

References

Abellán, J., & Mantas, C. J. (2014). Improving experimental studies about ensembles of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 41, 3825-3830.

Acuna, E., & Rodriguez, C. (2004). The treatment of missing values and its effect on classifier Accuracy. Classification, clustering, and data mining applications (pp. 639-647) Springer.Ala'raj, M., & Abbod, M. F. (2016). Classifiers consensus system approach for credit scoring. Knowledge-Based Systems, 104, 89-105.Angelini, E., di Tollo, G., & Roli, A. (2008). A neural network approach for credit risk evaluation. The quarterly review of economics and finance, 48, 733-755Antonakis, A., & Sfakianakis, M. (2009). Assessing naive Bayes as a method for screening credit applicants. Journal of applied Statistics, 36, 537-545. Asuncion, A., & Newman, D. (2007). UCI machine learning repository. Atiya, A. F., & Parlos, A. G. (2000). New results on recurrent network training: unifying the algorithms and accelerating convergence. Neural Networks, IEEE Transactions on, 11,

697-709. Baesens, B., Van Gestel, T., Viaene, S., Stepanova, M., Suykens, J., & Vanthienen, J. (2003). Benchmarking state-of-the-art classification algorithms for credit scoring. Journal of the

Operational Research Society, 54, 627-635. Basir, O. A., & Shen, H. C. (1993). New approach for aggregating multi‐sensory data. Journal of Robotic Systems, 10, 1075-1093. Bellotti, T., & Crook, J. (2009). Support vector machines for credit scoring and discovery of significant features. Expert Systems with Applications, 36, 3302-3308. Benediktsson, J. A., & Swain, P. H. (1992). Consensus theoretic classification methods. Systems, Man and Cybernetics, IEEE Transactions on, 22, 688-704. Berger, R. L. (1981). A necessary and sufficient condition for reaching a consensus using DeGroot's method. Journal of the American Statistical Association, 76, 415-418. Bhattacharyya, S., & Maulik, U. (2013). Soft computing for image and multimedia data processing. Springer.Bishop, C. M. (2006). Pattern recognition and machine learning. springer. Breiman, L. (2001). RFs. Machine-learning, 45(1), 5-32. Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Whether Review, 78(1), 1-3. Breiman, L., Friedman, J. H., Olshen, R. A., & Stone, C. J. (1984). Classification and regression trees. Wadsworth. Belmont, CA. Brown, I., & Mues, C. (2012). An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39, 3446-3453. Chen, F., & Li, F. (2010). Combination of feature selection approaches with SVM in credit scoring. Expert Systems with Applications, 37, 4902-4909. Chen, W., Ma, C., & Ma, L. (2009). Mining the customer credit using hybrid support vector machine technique. Expert Systems with Applications, 36, 7611-7616. Chitroub, S. (2010). Classifier combination and score level fusion: concepts and practical aspects. International Journal of Image and Data Fusion, 1, 113-135. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273-297.DeGroot, M. H. (1974). Reaching a consensus. Journal of the American Statistical Association, 69, 118-121. Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7, 1-30. Desai, V. S., Crook, J. N., & Overstreet, G. A. (1996). A comparison of neural networks and linear scoring models in the credit union environment. European Journal of Operational

Research, 95, 24-37. Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56, 52-64. Finlay, S. (2011). Multiple classifier architectures and their application to credit risk assessment. European Journal of Operational Research, 210, 368-378. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11, 86-92. Friedman, J. H. (1991). Multivariate adaptive regression splines. The Annals of Statistics, 1–67.

14 Backward uncertainty weighted accuracy using validation set.

García, V., Marqués, A., & Sánchez, J. S. (2012). On the use of data filtering techniques for credit risk prediction with instance-based models. Expert Systems with Applications, 39, 13267-13276.

García, V., Marqués, A. I., & Sánchez, J. S. (2015). An insight into the experimental design for credit risk and corporate bankruptcy prediction systems. Journal of Intelligent Information Systems, 44, 159-189.

Gorzałczany, M. B., & Rudziński, F. (2016). A multi-objective genetic optimization for fast, fuzzy rule-based credit classification with balanced accuracy and interpretability. Applied Soft Computing, 40, 206-220.

Hand, D. J. (2009). Measuring classifier performance: a coherent alternative to the area under the ROC curve. Machine Learning, 77, 103-123. Harris, T. (2015). Credit scoring using the clustered support vector machine. Expert Systems with Applications, 42, 741-750. Haykin, S. (1999). Adaptive filters. Signal Processing Magazine, 6. Hsieh, N., & Hung, L. (2010). A data driven ensemble classifier for credit scoring analysis. Expert Systems with Applications, 37, 534-545. Huang, C., Chen, M., & Wang, C. (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems with Applications, 33, 847-856. Jekabsons, G. (2009). Adaptive Regression Splines Toolbox for Matlab. Ver, 1, 3-17. Kennedy, K., Mac Namee, B., & Delany, S. J. (2012). Using semi-supervised classifiers for credit scoring. Journal of the Operational Research Society, 64, 513-529. Kittler, J., Hatef, M., Duin, R. P., & Matas, J. (1998). On combining classifiers. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 20, 226-239. Kumar, P. R., & Ravi, V. (2007). Bankruptcy prediction in banks and firms via statistical and intelligent techniques–A review. European Journal of Operational Research, 180, 1-28.

Lai, K. K., Yu, L., Zhou, L., & Wang, S. (2006). Credit risk evaluation with least square support vector machine. In Anonymous Rough Sets and Knowledge Technology (pp. 490-495). Springer.

Lee, T., & Chen, I. (2005). A two-stage hybrid credit scoring model using artificial neural networks and multivariate adaptive regression splines. Expert Systems with Applications, 28, 743-752.

Lee, T., Chiu, C., Chou, Y., & Lu, C. (2006). Mining the customer credit using classification and regression tree and multivariate adaptive regression splines. Computational Statistics & Data Analysis, 50, 1113-1130

Lessmann, S., Baesens, B., Seow, H., & Thomas, L. C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research.

Li, X., & Zhong, Y. (2012). An overview of personal credit scoring: techniques and future work. Li, S., Shiue, W., & Huang, M. (2006). The evaluation of consumer loans using support vector machines. Expert Systems with Applications, 30, 772-782.Li, H., & Sun, J. (2009). Majority voting combination of multiple case-based reasoning for financial distress prediction. Expert Systems with Applications, 36(3), 4363-4373.Lin, W., Hu, Y., & Tsai, C. (2012). Machine learning in financial crisis prediction: a survey. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on,

42, 421-436. Liu, Y., & Schumann, M. (2005). Data mining feature selection for credit-scoring models. Journal of the Operational Research Society, 56(9), 1099-1108.Louzada, F., Ara, A., & Fernandes, G. B. (2016). Classification methods applied to credit scoring: A systematic review and overall comparison. arXiv preprint arXiv:1602.02137.Marqués, A., García, V., & Sánchez, J. S. (2012). Exploring the behaviour of base classifiers in credit scoring ensembles. Expert Systems with Applications, 39, 10244-10250. Marqués, A., García, V., & Sánchez, J. S. (2012). Two-level classifier ensembles for credit risk assessment. Expert Systems with Applications, 39, 10916-10922. Nanni, L., & Lumini, A. (2009). An experimental comparison of ensemble of classifiers for bankruptcy prediction and credit scoring. Expert Systems with Applications, 36, 3028-

3033. Pietruszkiewicz, W. (2008). Dynamical systems and nonlinear Kalman filtering applied in classification. , 1-6. Partalas, I., Tsoumakas, G., & Vlahavas, I. (2010). An ensemble uncertainty aware measure for directed hill climbing ensemble pruning. Machine Learning, 81(3), 257-282.Rhemtulla, M., Brosseau-Liard, P. É., & Savalei, V. (2012). When can categorical variables be treated as continuous. A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological methods, 17, 354. Rokach, L. (2010). Ensemble-based classifiers. Artificial Intelligence Review, 33, 1-39. Sabzevari, H., Soleymani, M., & Noorbakhsh, E. (2007). A comparison between statistical and data mining methods for credit scoring in case of limited available data. Shaban, K., Basir, O., Kamel, M., & Hassanein, K. (2002). Intelligent information fusion approach in cooperative multiagent systems. , 13, 429-434. Šušteršič, M., Mramor, D., & Zupan, J. (2009). Consumer credit-scoring models with limited data.‍Expert‍Systems‍with‍Applications,‍36(3), 4736-4744. Tan, M. (1993). Multi-agent reinforcement learning: Independent vs. cooperative agents. , 330-337. Tomczak, J. M., & Zięba, M. (2015). Classification restricted Boltzmann machine for comprehensible credit scoring model. Expert Systems with Applications, 42, 1789-1796.Tsai, C. (2014). Combining cluster analysis with classifier ensembles to predict financial distress. Information Fusion, 16, 46-58. Tsai, C. (2009). Feature selection in bankruptcy prediction. Knowledge-Based Systems, 22, 120-127. Tsai, C., & Chen, M. (2010). Credit rating by hybrid machine learning techniques. Applied soft computing, 10, 374-380. Tsai, C., & Cheng, K. (2012). Simple instance selection for bankruptcy prediction. Knowledge-Based Systems, 27, 333-342. Tsai, C., & Chou, J. (2011). Data pre-processing by genetic algorithms for bankruptcy prediction. , 1780-1783. Tsai, C., & Wu, J. (2008). Using neural network ensembles for bankruptcy prediction and credit scoring. Expert Systems with Applications, 34, 2639-2649. Verikas, A., Kalsyte, Z., Bacauskiene, M., & Gelzinis, A. (2010). Hybrid and ensemble-based soft computing techniques in bankruptcy prediction: a survey. Soft Computing, 14, 995-

1010. Wang, G., Hao, J., Ma, J., & Jiang, H. (2011). A comparative assessment of ensemble learning for credit scoring. Expert Systems with Applications, 38, 223-230. Wang, C., & Huang, Y. (2009). Evolutionary-based feature selection approaches with new criteria for data mining: A case study of credit approval data. Expert Systems with Applications, 36(3), 5900-5908. Wang, G., Ma, J., Huang, L., & Xu, K. (2012). Two credit scoring models based on dual strategy ensemble trees. Knowledge-Based Systems, 26, 61-68. West, D., Dellana, S., & Qian, J. (2005). Neural network ensemble strategies for financial decision applications. Computers & Operations Research, 32, 2543-2559. Wilson, D. R., & Martinez, T. R. (2000). Reduction techniques for instance-based learning algorithms. Machine Learning, 38, 257-286. Woods, K., Bowyer, K., & Kegelmeyer Jr, W. P. (1996). Combination of multiple classifiers using local accuracy estimates. , 391-396. Xiao, H., Xiao, Z., & Wang, Y. (2016). Ensemble classification based on supervised clustering for credit scoring. Applied Soft Computing, 43, 73-86.Xiao, J., Xie, L., He, C., & Jiang, X. (2012). Dynamic classifier ensemble model for customer classification with imbalanced class distribution. Expert Systems with Applications, 39,

3668-3675. Xu, L., Krzyżak, A., & Suen, C. Y. (1992). Methods of combining multiple classifiers and their applications to handwriting recognition. Systems, man and cybernetics, IEEE

transactions on, 22, 418-435. Yao, P. (2009). Feature selection based on SVM for credit scoring. , 2, 44-47. Yu, L. &Liu, H. (2003). Feature selection for high-dimensional data: A fast correlation-based filter solution. In ICML, 3, 856-863Yu, L., Wang, S., & Lai, K. K. (2009). An intelligent-agent-based fuzzy group decision making model for financial multicriteria decision support: The case of credit scoring.

European Journal of Operational Research, 195, 942-959. Yu, L., Wang, S., & Lai, K. K. (2008). Credit risk assessment with a multistage neural network ensemble learning approach. Expert Systems with Applications, 34, 1434-1444. Yu, L., Yue, W., Wang, S., & Lai, K. K. (2010). Support vector machine based multiagent ensemble learning for credit risk evaluation. Expert Systems with Applications, 37, 1351-

1360. Zang, W., Zhang, P., Zhou, C., & Guo, L. (2014). Comparative study between incremental and ensemble learning on data streams: Case study. Journal of Big Data, 1, 1-16. Zhang, D., Zhou, X., Leung, S. C., & Zheng, J. (2010). Vertical bagging decision trees model for credit scoring. Expert Systems with Applications, 37, 7838-7843. Zhou, L., Lai, K. K., & Yu, L. (2010). Least squares support vector machines ensemble models for credit scoring. Expert Systems with Applications, 37, 127-133. Zhou, L., Tam, K. P., & Fujita, H. (2016). Predicting the listing status of Chinese listed companies with multi-class classification models. Information Sciences, 328, 222-236.

Documents

bura.brunel.ac.ukbura.brunel.ac.uk/bitstream/2438/12932/1/Fulltext.docx · Web viewThe utilisation of the different techniques in building credit-scoring models have varied over time,