36
Borderline-SMOTE: A New Over- Sampling Method in Imbalanced Data Sets Learning Synthetic minority over- sampling technique (SMOTE) Presented by Hector Franco TCD Topic: Machine learning Imbalanced data sets

Borderline Smote

Embed Size (px)

DESCRIPTION

reading group

Citation preview

Page 1: Borderline Smote

Borderline-SMOTE: A New Over-Sampling Method in

Imbalanced Data Sets Learning Synthetic minority over-

sampling technique (SMOTE)Presented by Hector Franco

TCD

Topic:Machine learningImbalanced data sets

Page 2: Borderline Smote

Basic concepts1. Introduction2. Recent developments3. Algorithms description.4. Evaluation.5. Discursion.

Content:

Page 3: Borderline Smote

Basic concepts:0

Page 4: Borderline Smote

Why this paper is important for us? Multi class problems are imbalance when we

compare one against all. In some cases the data set is very small, to

generalize well. Text classification is an example of imbalanced

data. It can be use with tree-kernels.

Page 5: Borderline Smote

Effect of SMOTE and DEC – (SDC)

After DEC alone After SMOTE and DEC

Page 6: Borderline Smote

6

SMOTE’s Informed Oversampling Procedure I, k=3

: Minority sample

: Synthetic sample

: Majority sample

Page 7: Borderline Smote

SMOTE:

Page 8: Borderline Smote

SECTION 1introduction

Page 9: Borderline Smote

By convention the class with less number of examples is called minority or positive samples.

introduction

Page 10: Borderline Smote

SECTION 2The recent developments in imbalanced data sets learning

Page 11: Borderline Smote

Between-class imbalanced. (where we focused on)

Within-class imbalanced.

It is important in text classification. We focused on the minority class, we want a

high prediction for the minority class.. Two class problem = multiclass problem .

Types of imbalances in data sets:

Page 12: Borderline Smote

Evaluation Metrics in Imbalanced Domains

NOT VERY GOOD IN UNBALANCED

DATA

Popular evaluation for imbalance problem. Usually B=1, and =1

in this paper

Page 13: Borderline Smote

TP rate

FP rate

AUC:AREAUNDERROC

Page 14: Borderline Smote

Data level: Change the distribution ◦ make the data balanced

Modify the existing data mining algorithms◦ Make new algorithms

2.2 Dealing with imbalanced data sets

Page 15: Borderline Smote

Random oversampling: duplicate Random under sampling: (can remove

important data) Remove noise SMOTE Combine under sampling and over

sampling. Find the hard examples and over sample

them.

2.2.1 Methods at data level:-re-sampling methods-

Page 16: Borderline Smote

Adaboost (increase weights of misclassified), it does not perform well on imbalances ds. Improve updated weights of TP & FP, better than weights of prediction based on TP & FP.

Use a kernel of SVM Use a BMPM

Biased Mini max Probability Machine. There are other cost-based learning…

2.2.2 Methods at Algorithm Level:

Page 17: Borderline Smote

SECTION 3A new Over-Sampling Method: Borderline-SMOTE.

Page 18: Borderline Smote

SMOTE:

Algorithms usually try to learn the borderline, as exactly as possible.

Page 19: Borderline Smote

Borderline-SMOTE1 Borderline-SMOTE2

New oversampling methods

Page 20: Borderline Smote

Borderline-SMOTE1 algorithm

Page 21: Borderline Smote

Borderline-SMOTE1 algorithm

Page 22: Borderline Smote

Also oversampling the majority class. The random numbers are between 0 and

0.5 so the synthetic examples are more close to each other.

Borderline-SMOTE2

Page 23: Borderline Smote

Circle data set (artificial)

Page 24: Borderline Smote

Danger samples:

Page 25: Borderline Smote

Borderline-Smote1 synthetic samples:

Page 26: Borderline Smote

Section 4Experiments

Page 27: Borderline Smote

Data sets:

Page 28: Borderline Smote

Nothing: base line. SMOTE Random over-sampling Borderline-SMOTE1 Borderline-SMOTE2

K=5 10 Fold cross validation. C4.5 classified We only want to improve the prediction of

the minority class

Methods

Page 29: Borderline Smote

circle

Page 30: Borderline Smote

pima

Page 31: Borderline Smote

satimge

Page 32: Borderline Smote

hab

erm

an

Page 33: Borderline Smote

Section 5conclusion

Page 34: Borderline Smote

Is a common problem to work with imbalanced data sets.

Borderline examples are more easy to misclassified.

Our methods are better than traditional SMOTE.

Open to research:◦ how to define DANGER examples.◦ Determination of number of examples in DANGER.◦ Combine to data mining algorithms.

conclusion

Page 35: Borderline Smote

Thank you for your time

Page 36: Borderline Smote

Creative commons license

You are free:•to copy, distribute, display, and perform the work •to make derivative works

Under the following conditions:•Attribution. You must give the original author credit. What does "Attribute this work" mean? The page you came from contained embedded licensing metadata, including how the creator wishes to be attributed for re-use. You can use the HTML here to cite the work. Doing so will also include metadata on your page so that others can find the original work as well.  •Non-Commercial. You may not use this work for commercial purposes. •For any reuse or distribution, you must make clear to others the licence terms of this work. •Any of these conditions can be waived if you get permission from the copyright holder. •Nothing in this license impairs or restricts the author's moral rights.