LearningfromImbalancedData - Electrical, Prof.HaiboHe ElectricalEngineering University!of!Rhode!Island,!Kingston,!RI!02881!!

  • View
    213

  • Download
    0

Embed Size (px)

Text of LearningfromImbalancedData - Electrical, Prof.HaiboHe ElectricalEngineering...

Learning from Imbalanced Data

Prof. Haibo He

Electrical Engineering University of Rhode Island, Kingston, RI 02881

Computa)onal Intelligence and Self-Adap)ve Systems (CISA) Laboratory

h?p://www.ele.uri.edu/faculty/he/ Email: he@ele.uri.edu

This lecture notes is based on the following paper: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge

and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

1 Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Learning from Imbalanced Data

1. The problem: Imbalanced Learning

2. The solu)ons: State-of-the-art

3. The evalua)on: Assessment Metrics

4. The future: Opportuni)es and Challenges

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 2

The Nature of Imbalanced Learning Problem

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 3

Requirement? Balanced distribution of data Equal costs of misclassification

Explosive availability of raw data Well-developed algorithms for data analysis

The Problem

What about data in reality?

The Nature of Imbalance Learning

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 4

Imbalance is Everywhere Between-class

Within-class

Intrinsic and extrinsic

Rela)vity and rarity

Imbalance and small sample size

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 5

Growing interest

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 6

Mammography Data Set: An example of between-class imbalance

The Nature of Imbalance Learning

Nega)ve/healthy Posi)ve/cancerous

Number of cases 10,923 260

Category Majority Minority

Imbalanced accuracy 100% 0-10 %

Imbalance can be on the order of 100 : 1 up to 10,000 : 1!

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 7

Intrinsic and extrinsic imbalance Intrinsic:

Imbalance due to the nature of the dataspace Extrinsic:

Imbalance due to time, storage, and other factors Example:

Data transmission over a specific interval of time with interruption

0 10 20 30 40 50 60 70 80 90 100-2

-1

0

1

2

time/t

x(t)

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 8

Data Complexity

The Nature of Imbalance Learning

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 9

Relative imbalance and absolute rarity

The minority class may be outnumbered, but not necessarily rare

Therefore they can be accurately learned with little disturbance

?

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 10

Data with high dimensionality and small sample size Face recogni)on, gene expression

Challenges with small sample size: 1.Embedded absolute rarity and within-class imbalances 2.Failure of generalizing induc)ve rules by learning algorithms

Difficulty in forming good classifica)on decision boundary over more features but less samples

Risk of overfifng

Imbalanced data with small sample size

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 11

The Solutions to Imbalanced Learning Problem

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 12

Solutions to imbalanced learning

Sampling methods

Cost-sensi)ve methods

Kernel and Ac)ve Learning methods

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 13

Create balanced dataset

Modify data distribu)on

If data is Imbalanced

Sampling methods

Sampling methods

Create balance though sampling

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 14

Sampling methods

Random Sampling S: training data set; Smin: set of minority class samples, Smaj: set of majority class samples; E: generated samples

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 15

Sampling methods

EasyEnsemble Unsupervised: use random subsets of the majority class to create balance and form mul5ple classifiers

BalanceCascade Supervised: itera5vely create balance and pull out redundant samples in majority class to form a final classifier

Informed Undersampling

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 16

Sampling methods

Informed Undersampling

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 17

Sampling methods

Synthetic Sampling with Data Generation

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 18

Synthe)c minority oversampling technique (SMOTE)

Sampling methods

Synthetic Sampling with Data Generation

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 19

Sampling methods

Adaptive Synthetic Sampling

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 20

Overcomes over generaliza)on in SMOTE algorithm Border-line-SMOTE

Sampling methods

Adaptive Synthetic Sampling

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 21

Sampling methods

Adaptive Synthetic Sampling

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 22

Sampling methods

Sampling with Data Cleaning

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 23

Sampling methods Sampling with Data Cleaning

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 24

Sampling methods

Cluster-based oversampling (CBO) method

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009

Sampling methods

CBO Method

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 26

Sampling methods

Integration of Sampling and Boosting

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 27

Sampling methods

Integration of Sampling and Boosting

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 28

Cost-Sensitive Methods

U)lize cost-sensi)ve

methods for imbalanced learning

Considering the cost of misclassifica

)on

Instead of modifying data

Cost-Sensi)ve methods

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 29

Cost-Sensitive Learning Framework

Cost-Sensi)ve methods

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 30

Cost-Sensi)ve methods Cost-Sensitive Dataspace Weighting with Adaptive Boosting

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol. 21, issue 9, pp. 1263-1284, 2009 31

Cost-Sensi)ve methods Cost-Sensitive Dataspace Weighting with Adaptive Boosting

Source: H. He and E. A. Garcia, Learning from Imbalanced Data, IEEE Trans. Knowledge and Data Engineering, vol.