42
A novel credit scoring model based on feature selection and PSO

A novel credit scoring model based on feature selection and PSO

Embed Size (px)

Citation preview

A novel credit scoring model based on feature selection and PSO

Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine

Variable Name Description Codings

dob Year of birth If unknown the year will be 99

nkid Number of children number

dep Number of other dependents number

phon Is there a home phone 1=yes, 0 = no

sinc Spouse's income

aes Applicant's employment status V = Government

W = housewife

M = military

P = private sector

B = public sector

R = retired

E = self employed

T = student

U = unemployed

N = others

Z = no response

Outline

• What is classification? • What is prediction?• Classification by decision tree induction• Prediction of Continuous Values

Outline

• What is classification? • What is prediction?• Classification by decision tree induction• Prediction of Continuous Values

Classification vs. Prediction

• Classification– Predicts categorical class labels– Classifies data (constructs a model) based on the

training set and the values (class labels) in a classifying attribute; then, uses the model in classifying new data

• Prediction – Models continuous-valued functions, i.e., predicts

unknown or missing values

Classification is the prediction for discrete and nominal values.

with classification, one can predict in which bucket to put the ball,

wt?

red green gray blue pink…

but can’t predict the weight of the ball

Supervised and Unsupervised

• Unsupervised classification is clustering.– The class labels are unknown

• Supervised classification is classification.– The class labels and the number of classes are known

• In Unsupervised classification,the number of classes may also be unknown

red green gray blue pink…1 2 3 4 n…

? ? ? ? ?…1 2 3 4 n…

? ? ? ? ?…1 2 3 4 ?…

Typical Applications

• Credit approval• Target marketing• Medical diagnosis• Treatment effectiveness analysis

Classification Example:Credit approval

• For example, credit scoring tries to assess the credit risk of a new customer. This can be transformed to a classification problem by:– creating two classes, good and bad customers. – A classification model can be generated from

existing customer data and their credit behavior.

– This classification model can then be used to assign a new potential customer to one of the two classes and hence accept or reject him.

Specific Example:• Banks generally have information on

the payment behavior of their credit applicants.

• Combining this financial information with other information about the customers like sex, age, income, etc., it is possible to develop a system to classify new customers as good or bad customers, (i.e., the credit risk in acceptance of a customer is either low or high, respectively).

Classification Example:Credit approval

Classification Process

Data

TrainingData

TestData

DeriveClassifier(Model)

EstimateAccuracy

Classification: Two-Step Process

1. Construct model: • By describing a set of predetermined classes

2. Use the model in prediction• Estimate the accuracy of the model• Use the model to classify unseen objects or

future data

Preparing Data Before ClassificationData transformation:• Discretization of continuous data• Normalization to [-1..1] or [0 ..1]Data Cleaning:• Smoothing to reduce noiseRelevance Analysis:• Feature selection to eliminate irrelevant attributes

Training Data

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

Step1 : Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

tuple

class label attribute 1-a. Extract a set of training data from the database

1-a. Extract a set of training data from the database

Each tuple/sample in the training set is assumed to belong to a predefined class, as determined by the class label attribute

Each tuple/sample in the training set is assumed to belong to a predefined class, as determined by the class label attribute

Step 1 : Model ConstructionTrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

Classifier(Model)

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

classification rule

1-c. Use the training set to construct the model.

1-c. Use the training set to construct the model.

1-b.Develop / adopt classification algorithms.

1-b.Develop / adopt classification algorithms.

The model is represented as: •classification rules, •decision trees, or •mathematical formulae

The model is represented as: •classification rules, •decision trees, or •mathematical formulae

Classification: Two-Step Process

2. Model Evaluation (Accuracy)• Estimate accuracy rate of the model based on a

test set.– The known label of test sample is compared with the

classified result from the model.– Accuracy rate is the percentage of test set samples that

are correctly classified by the model.– Test set is independent of training set otherwise over-

fitting will occur.• The model is used to classify unseen objects

– Give a class label to a new tuple– Predict the value of an actual attribute

Step 2: Use the Model in Prediction

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

2-b.Use the classifier model to classify the test data

2-b.Use the classifier model to classify the test data

Classifier

IF rank = ‘professor’ OR years > 6THEN tenured = ‘yes’

IF rank = ‘professor’ OR years > 6THEN tenured = ‘yes’

2-a. Extract a set of test data from the databaseNote: Test set is independent of training set otherwise over-fitting will occur.

2-a. Extract a set of test data from the databaseNote: Test set is independent of training set otherwise over-fitting will occur.

ClassifiedData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 yesGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Classification Process: Use the Model in Prediction

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Classifier

IF rank = ‘professor’ OR years > 6THEN tenured = ‘yes’

IF rank = ‘professor’ OR years > 6THEN tenured = ‘yes’

2-c. Compare the known label of the test data with the classified result from the model.

2-c. Compare the known label of the test data with the classified result from the model.

2-d. Estimate the accuracy rate of the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model.

2-d. Estimate the accuracy rate of the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model.

ClassifiedData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 yesGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Classification Process: Use the Model in Prediction

Tenured?

2-e. Modify the model if need be.

2-e. Modify the model if need be.

Classifier

IF rank = ‘professor’THEN tenured = ‘yes’

IF rank = ‘professor’THEN tenured = ‘yes’

UnseenData

NAME RANK YEARS TENUREDMaria Assistant Prof 5Juan Associate Prof 3Pedro Professor 4Joseph Assistant Prof 8

NAME RANK YEARS TENUREDMaria Assistant Prof 5 noJuan Associate Prof 3 noPedro Professor 4 yesJoseph Assistant Prof 8 no

2-f. The model is used to classify unseen objects.•Give a class label to a new tuple•Predict the value of an actual attribute

2-f. The model is used to classify unseen objects.•Give a class label to a new tuple•Predict the value of an actual attribute

Classification Methods

• Decision Tree Induction• Neural networks• Bayesian• k-nearest neighbor classifier • Case-based reasoning• Genetic algorithm• Rough set approach• Fuzzy set approaches

Data

Combinevotes

Classifier 1

Classifier 2

Classifier 3

Classifier n

New Data

Improving Accuracy:Composite Classifier

Evaluating Classification Methods• Predictive accuracy• Speed and scalability

– time to construct the model– time to use the model

• Robustness– handling noise and missing values

• Scalability– efficiency in disk-resident databases

• Interpretability: – understanding and insight provided by the

model

Outline

• What is classification? What is prediction?• Classification by decision tree induction• Prediction of Continuous Values

Outline

• Introduction and Motivation• Background and Related Work• Preliminaries

– Publications– Theoretical Framework– Empirical Framework : Margin Based Instance Weighting

– Empirical Study• Planned Tasks

Introduction and MotivationFeature Selection Applications

D1

D2

Sports

T1 T2 ….…… TN

12 0 ….…… 6

DM

C

Travel

Jobs

… … …

Terms

Doc

umen

ts

3 10 ….…… 28

0 11 ….…… 16

Features(Genes or Proteins)

Sam

ples

Pixels

Vs

Features

Introduction and MotivationFeature Selection from High-dimensional Data

p: # of features n: # of samplesHigh-dimensional data: p >> n

Feature Selection:Alleviating the effect of the curse of dimensionality.Enhancing generalization capability.Speeding up learning process.Improving model interpretability.

Curse of Dimensionality:•Effects on distance functions•In optimization and learning•In Bayesian statistics

High-Dimensional Data

Feature Selection AlgorithmMRMR, SVMRFE, Relief-F,

F-statistics, etc.

Low-Dimensional Data

Learning ModelsClassification, Clustering, etc.

Knowledge Discovery on High-dimensional Data

Introduction and MotivationStability of Feature Selection

Training Data Feature SubsetTraining Data Feature Subset

Training Data Feature Subset

Feature Selection Method

Consistent or not???

Stability of Feature Selection: the insensitivity of the result of a feature selection algorithm to variations to the training set.

Training Data Learning ModelTraining Data Learning ModelTraining Data Learning Model

Learning Algorithm

Stability of Learning Algorithm isfirstly examined by Turney in 1995

Stability of feature selection was relatively neglected before and attracted interests from researchers in data mining recently.

Stability Issue of Feature Selection

Data Mining Lectures Lecture 18: Credit Scoring Padhraic Smyth, UC Irvine

Variable Name Description Codings

dainc Applicant's income

res Residential status O = Owner

F = tenant furnished

U = Tenant Unfurnished

P = With parents

N = Other

Z = No response

dhval Value of Home 0 = no response or not owner

000001 = zero value

blank = no response

dmort Mortgage balance outstanding 0 = no response or not owner

000001 = zero balance

blank = no response

doutm Outgoings on mortgage or rent

doutl Outgoings on Loans

douthp Outgoings on Hire Purchase

doutcc Outgoings on credit cards

Bad Good/bad indicator 1 = Bad

0 = Good

Tiến trình trích chọn đặc trưng

Evaluation Criteria Filter Model Wrapper Model Embedded Model

Search Strategies: Complete Search Sequential Search Random Search

Representative Algorithms Relief, SFS, MDLM, etc. FSBC, ELSA, LVW, etc. BBHFS, Dash-Liu’s, etc.

30/26

Evaluation Strategies

• Filter Methods– Evaluation is independent

of the classification algorithm.

– The objective function evaluates feature subsets by their information content, typically interclass distance, statistical dependence or information-theoretic measures.

Evaluation Strategies

• Wrapper Methods– Evaluation uses criteria

related to the classification algorithm.

– The objective function is a pattern classifier, which evaluates feature subsets by their predictive accuracy (recognition rate on test data) by statistical resampling or cross-validation.

Naïve Search

• Sort the given n features in order of their probability of correct recognition.

• Select the top d features from this sorted list.

• Disadvantage– Feature correlation is not considered.– Best pair of features may not even contain the best individual

feature.

Sequential forward selection (SFS)(heuristic search)

• First, the best single feature is selected (i.e., using some criterion function).

• Then, pairs of features are formed using one of the remaining features and this best feature, and the best pair is selected.

• Next, triplets of features are formed using one of the remaining features and these two best features, and the best triplet is selected.

• This procedure continues until a predefined number of features are selected.

34

SFS performsbest when the optimal subset issmall.

Example

35

Results of sequential forward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features added at each iteration (the first iteration is at the bottom). The highest accuracy value is shown with a star.

features added at each iteration

Sequential backward selection (SBS) (heuristic search)

• First, the criterion function is computed for all n features.

• Then, each feature is deleted one at a time, the criterion function is computed for all subsets with n-1 features, and the worst feature is discarded.

• Next, each feature among the remaining n-1 is deleted one at a time, and the worst feature is discarded to form a subset with n-2 features.

• This procedure continues until a predefined number of features are left.

36

SBS performsbest when the optimal subset islarge.

Example

37

Results of sequential backward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features removed at each iteration (the first iteration is at the top). The highest accuracy value is shown with a star.

features removed at each iteration

Bidirectional Search (BDS)

• BDS applies SFS and SBS simultaneously:– SFS is performed from the empty set– SBS is performed from the full set

• To guarantee that SFS and SBS converge to the same solution– Features already selected by SFS are not

removed by SBS– Features already removed by SBS are not

selected by SFS

“Plus-L, minus-R” selection (LRS)

• A generalization of SFS and SBS– If L>R, LRS starts from the empty set and:

• Repeatedly add L features • Repeatedly remove R features

– If L<R, LRS starts from the full set and:• Repeatedly removes R features• Repeatedly add L features

• LRS attempts to compensate for the weaknesses of SFS and SBS with some backtracking capabilities.

Sequential floating selection (SFFS and SFBS)

• An extension to LRS with flexible backtracking capabilities– Rather than fixing the values of L and R, floating methods

determine these values from the data.– The dimensionality of the subset during the search can be

thought to be “floating” up and down

• There are two floating methods:– Sequential floating forward selection (SFFS) – Sequential floating backward selection (SFBS)

P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature selection, Pattern Recognition Lett. 15 (1994) 1119–1125.

Sequential floating selection (SFFS and SFBS)

• SFFS– Sequential floating forward selection (SFFS) starts from the

empty set.– After each forward step, SFFS performs backward steps as long

as the objective function increases.• SFBS

– Sequential floating backward selection (SFBS) starts from the full set.

– After each backward step, SFBS performs forward steps as long as the objective function increases.

Feature Selection using Genetic Algorithms (GAs)

(randomized search)

ClassifierFeature Subset

Pre-Processing

Feature Extraction

Feature Selection

(GA)

Feature Subset

GAs provide a simple, general, and powerful framework

for feature selection.