Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

Class Imbalance in Text Classification

Project ID: 08

Elham JebalbareziNedjma Ousidhoum

Outline

• Class Imbalance• Algorithms for Class Imbalance• Text Classification• Feature selection for text classification• Experiments• Results• Discussion

The Class Imbalance Problem(1)

• Common problem in Machine Learning• Almost all the instances belong to one major class

and the rest belong to the minor class.• Imbalance Level= |Majority Class|/|Minority Class|.

It can be huge (order of 106).• Applications

detecting oil spills, text classification, fraud detection and many medical applications such as automatic diagnosis

The Class Imbalance Problem(2)

• Many classification algorithms are sensitive to the imbalanced class distribution

• Class imbalance is taken into account in the design of new classifiers

• Solutionscost-sensitive learning, data resampling,

feature selection.

Cost-Sensitive Algorithms

• Penalties assigned to mistakes made by classification algorithms.

• Assign different asymmetric misclassification costs to classes. The penalty is higher when the mistake is made on the minority class, to emphasize the correct classification of minority instances.

• Cost- sensitive learning does not modify the class distribution

Data Resampling

• Learning instances in the majority class and minority class are manipulated in order to balance the class distribution.

• Effective but may introduce noise or remove useful information.

Data ResamplingOversampling

• Duplicates the minority class for more effect on the machine learning algorithm.

• Might be effective but may be prone to overfitting.

• Variants: SMOTE (Synthetic Minority Oversampling Technique), MSMOTE (Modified SMOTE), …

Data ResamplingUndersampling

• Using a subset of the majority class to train the classifier.

• Many majority class examples are ignored so that the training set becomes more balanced and the training process becomes faster.

• Effective but may discard useful information.• There are variants of undersampling. E.g. One-

sided undersampling

Bagging/Boosting

• Bootstrapping is random sampling with replacement

• Bagging is aggregating classifiers induced over independently drawn bootstrap samples.

• Boosting is to focus on difficult samples by giving a higher weight parameter

Feature Selection

• Feature selection is able to improve the performance of naive Bayes and regularized logistic regression on imbalanced data.

• The challenges of feature selection and imbalanced data classification meet when the dataset to be analyzed is of high-dimensionality and highly imbalanced class distribution

Text Classification

• Sorting natural language texts or documents into predefined categories based on their content.

• Applications automatic indexing, document organization, text filtering, hierarchical categorization of

web pages, spam filtering, …• Class Imbalance is common in text classification

Feature Selection in Text Classification

• Common in text classification because it can improve text classification.

• Select features using different metrics (TF, Chi-square, information gain) for a nearly optimal classification

• We can use positive/negative features

• Combining positive and negative features might be useful

Experiments

• We implemented random oversampling, Random undersampling, SMOTE, MSMOTE, One sided Undersampling.

• Our approachWe combined feature selection and resampling by:

1. Calculating Term Frequency2. Applying a resampling Algorithm

• Dataset Reuters-21578.• Chosen Evaluation Metrics

precision=tp/tp+fp , recall=tp/tp+fn, f-measure=2.recall.precision/recall+precision

ExperimentsData

ExperimentsRandom Oversampling

ExperimentsSMOTE

ExperimentsMSMOTE

ExperimentsRandom Undersampling

ExperimentsOne-sided Undersampling

Results(1)

Without Sampling

Random oversampling

Random undersampling

one sided undersampling

smote msmote

Precision0.03191 0.09259 0.14772 0.15957 0.0909 0.0434

Recall0.14285 0.23809 0.61904 0.71428 0.2380 0.09523

F-Measure

0.05217 0.13333 0.2385 0.26086 0.1315 0.0597

No feature selection

Results(2)

Without Sampling

Random oversampling

smote msmote

Precision1 0.6111 0.0884 0.0851 0.5 0.5384

Recall0.0476 0.5238 0.6190 0.7619 0.5238 0.3333

F-measure

0.0909 0.5641 0.1547 0.1531 0.5116 0.4117

100 features selected after using TF

Results(3)

Without Sampling

Random oversampling

smote msmote

Precision0.0476 0.1777 0.0937 0.1666 0.16666 0.4

Recall0.0476 0.38095 0.2857 0.5238 0.3809 0.5714

F-Measure

0.0476 0.2424 0.1411 0.2528 0.2318 0.4705

500 features selected after using TF

Discussion

• Feature selection improves oversampling.

• Feature selection also improves undersampling recall.

• Adding more features does not always improve the results.

Thank you!

Class Imbalance in Text Classification Project ID: 08 Elham Jebalbarezi Nedjma Ousidhoum 1

Documents

Caf é Net Management System … Prepared By : Shereen Atallah Shereen Atallah Elham AL_Yaseen Elham AL_Yaseen

Portfolio of Shaharin Elham Hossain

Instructor : Elham Gholami Bayesian Belief Network

Kateb Yacine NEDJMA · 2018. 12. 11. · Nedjma’ nýn bütün manzarasý, yeniden deðineceðimiz tuhaf bir yol-7 1 1957. Kateb Yacine o zaman yirmi sekiz yaþýndaydý. Albert

African Epic Discourse in Kateb Yacine’s Nedjma (1956) · 2020. 5. 12. · African Epic Discourse in Kateb Yacine’s Nedjma(1956) Revue n° 20 24 projected as if it was actual

Walks in East Kent Elham

Dix ans dans la vie de Kateb Yacine, de Soliloques Nedjma

Elham Shirazi, e-planning

Intelligence L4 Prof. Dr. Elham AlJammas May2015

ENVIRONMENTAL IMBALANCE

The Hawkinge and Elham Valley Handbook

Imbalance Elektrolit

Fluid Imbalance

MS Defence Presentation Elham

Published by the Elham Village Hall Association MARCH 2015.pdf · March 2015 Published by the Elham Village Hall Association Lavender House, High Street, Elham CT4 6TB T: 01303 840577

Imbalance Energy Overview Highlight Imbalance Energy

Cv elham programe الامارات نهائى5 2015

To whom it may concern elham refaat

Imbalance load

Elham Farsi