24
Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

Embed Size (px)

Citation preview

Page 1: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

1

Class Imbalance in Text Classification

Project ID: 08

Elham JebalbareziNedjma Ousidhoum

Page 2: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

2

Outline

• Class Imbalance• Algorithms for Class Imbalance• Text Classification• Feature selection for text classification• Experiments• Results• Discussion

Page 3: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

3

The Class Imbalance Problem(1)

• Common problem in Machine Learning• Almost all the instances belong to one major class

and the rest belong to the minor class.• Imbalance Level= |Majority Class|/|Minority Class|.

It can be huge (order of 106).• Applications

detecting oil spills, text classification, fraud detection and many medical applications such as automatic diagnosis

Page 4: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

4

The Class Imbalance Problem(2)

• Many classification algorithms are sensitive to the imbalanced class distribution

• Class imbalance is taken into account in the design of new classifiers

• Solutionscost-sensitive learning, data resampling,

feature selection.

Page 5: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

5

Cost-Sensitive Algorithms

• Penalties assigned to mistakes made by classification algorithms.

• Assign different asymmetric misclassification costs to classes. The penalty is higher when the mistake is made on the minority class, to emphasize the correct classification of minority instances.

• Cost- sensitive learning does not modify the class distribution

Page 6: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

6

Data Resampling

• Learning instances in the majority class and minority class are manipulated in order to balance the class distribution.

• Effective but may introduce noise or remove useful information.

Page 7: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

7

Data ResamplingOversampling

• Duplicates the minority class for more effect on the machine learning algorithm.

• Might be effective but may be prone to overfitting.

• Variants: SMOTE (Synthetic Minority Oversampling Technique), MSMOTE (Modified SMOTE), …

Page 8: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

8

Data ResamplingUndersampling

• Using a subset of the majority class to train the classifier.

• Many majority class examples are ignored so that the training set becomes more balanced and the training process becomes faster.

• Effective but may discard useful information.• There are variants of undersampling. E.g. One-

sided undersampling

Page 9: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

9

Bagging/Boosting

• Bootstrapping is random sampling with replacement

• Bagging is aggregating classifiers induced over independently drawn bootstrap samples.

• Boosting is to focus on difficult samples by giving a higher weight parameter

Page 10: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

10

Feature Selection

• Feature selection is able to improve the performance of naive Bayes and regularized logistic regression on imbalanced data.

• The challenges of feature selection and imbalanced data classification meet when the dataset to be analyzed is of high-dimensionality and highly imbalanced class distribution

Page 11: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

11

Text Classification

• Sorting natural language texts or documents into predefined categories based on their content.

• Applications automatic indexing, document organization, text filtering, hierarchical categorization of

web pages, spam filtering, …• Class Imbalance is common in text classification

(e.g)

Page 12: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

12

Feature Selection in Text Classification

• Common in text classification because it can improve text classification.

• Select features using different metrics (TF, Chi-square, information gain) for a nearly optimal classification

• We can use positive/negative features

• Combining positive and negative features might be useful

Page 13: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

13

Experiments

• We implemented random oversampling, Random undersampling, SMOTE, MSMOTE, One sided Undersampling.

• Our approachWe combined feature selection and resampling by:

1. Calculating Term Frequency2. Applying a resampling Algorithm

• Dataset Reuters-21578.• Chosen Evaluation Metrics

precision=tp/tp+fp , recall=tp/tp+fn, f-measure=2.recall.precision/recall+precision

Page 14: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

14

ExperimentsData

Page 15: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

15

ExperimentsRandom Oversampling

Page 16: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

16

ExperimentsSMOTE

Page 17: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

17

ExperimentsMSMOTE

Page 18: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

18

ExperimentsRandom Undersampling

Page 19: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

19

ExperimentsOne-sided Undersampling

Page 20: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

20

Results(1)

Without Sampling

Random oversampling

Random undersampling

one sided undersampling

smote msmote

Precision0.03191 0.09259 0.14772 0.15957 0.0909 0.0434

Recall0.14285 0.23809 0.61904 0.71428 0.2380 0.09523

F-Measure

0.05217 0.13333 0.2385 0.26086 0.1315 0.0597

No feature selection

Page 21: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

21

Results(2)

Without Sampling

Random oversampling

Random undersampling

one sided undersampling

smote msmote

Precision1 0.6111 0.0884 0.0851 0.5 0.5384

Recall0.0476 0.5238 0.6190 0.7619 0.5238 0.3333

F-measure

0.0909 0.5641 0.1547 0.1531 0.5116 0.4117

100 features selected after using TF

Page 22: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

22

Results(3)

Without Sampling

Random oversampling

Random undersampling

one sided undersampling

smote msmote

Precision0.0476 0.1777 0.0937 0.1666 0.16666 0.4

Recall0.0476 0.38095 0.2857 0.5238 0.3809 0.5714

F-Measure

0.0476 0.2424 0.1411 0.2528 0.2318 0.4705

500 features selected after using TF

Page 23: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

23

Discussion

• Feature selection improves oversampling.

• Feature selection also improves undersampling recall.

• Adding more features does not always improve the results.

Page 24: Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

24

Thank you!