Transcript
Page 1: Text Categorization - cs.virginia.edu

Text Categorization

Hongning WangCS@UVa

Page 2: Text Categorization - cs.virginia.edu

Today’s lecture

β€’ Bayes decision theoryβ€’ Supervised text categorization

– General steps for text categorization– Feature selection methods– Evaluation metrics

CS@UVa CS 6501: Text Mining 2

Page 3: Text Categorization - cs.virginia.edu

Text mining in general

CS@UVa CS6501: Text Mining 3

Access Mining

Organization

Filterinformation

Discover knowledge

AddStructure/Annotations

Serve for IR applications

Based on NLP/ML techniques

Sub-area of DM research

Page 4: Text Categorization - cs.virginia.edu

Applications of text categorization

β€’ Automatically classify political news from sports news

CS@UVa CS 6501: Text Mining 4

political sports

Page 5: Text Categorization - cs.virginia.edu

Applications of text categorization

β€’ Recognizing spam emails

CS@UVa CS 6501: Text Mining 5

Spam=True/False

Page 6: Text Categorization - cs.virginia.edu

Applications of text categorization

β€’ Sentiment analysis

CS@UVa CS 6501: Text Mining 6

Page 7: Text Categorization - cs.virginia.edu

Basic notions about categorization

β€’ Data points/Instances– 𝑋𝑋: an π‘šπ‘š-dimensional feature vector

β€’ Labels– 𝑦𝑦: a categorical value from {1, … ,π‘˜π‘˜}

β€’ Classification hyper-plane– 𝑓𝑓 𝑋𝑋 β†’ 𝑦𝑦

vector space representation

CS@UVa CS 6501: Text Mining 7

Key question: how to find such a mapping?

Page 8: Text Categorization - cs.virginia.edu

Bayes decision theory

β€’ If we know 𝑝𝑝(𝑦𝑦) and 𝑝𝑝(𝑋𝑋|𝑦𝑦), the Bayes decision rule is

�𝑦𝑦 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘šπ‘šπ‘Žπ‘Žπ‘₯π‘₯𝑦𝑦𝑝𝑝 𝑦𝑦 𝑋𝑋 = π‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘Žπ‘šπ‘šπ‘Žπ‘Žπ‘₯π‘₯𝑦𝑦𝑝𝑝 𝑋𝑋 𝑦𝑦 𝑝𝑝 𝑦𝑦

𝑝𝑝 𝑋𝑋– Example in binary classification

β€’ �𝑦𝑦 = 1, if 𝑝𝑝 𝑋𝑋 𝑦𝑦 = 1 𝑝𝑝 𝑦𝑦 = 1 > 𝑝𝑝 𝑋𝑋 𝑦𝑦 = 0 𝑝𝑝 𝑦𝑦 = 0β€’ �𝑦𝑦 = 0, otherwise

β€’ This leads to optimal classification result– Optimal in the sense of β€˜risk’ minimization

Constant with respect to 𝑦𝑦

CS@UVa CS 6501: Text Mining 8

Page 9: Text Categorization - cs.virginia.edu

Bayes risk

β€’ Risk – assign instance to a wrong class– Type I error: (π‘¦π‘¦βˆ— = 0, �𝑦𝑦 = 1)

β€’ 𝑝𝑝 𝑋𝑋 𝑦𝑦 = 0 𝑝𝑝(𝑦𝑦 = 0)– Type II error: (π‘¦π‘¦βˆ— = 1, �𝑦𝑦 = 0)

β€’ 𝑝𝑝 𝑋𝑋 𝑦𝑦 = 1 𝑝𝑝(𝑦𝑦 = 1)– Risk by Bayes decision rule

β€’ π‘Žπ‘Ž(𝑋𝑋) = min{𝑝𝑝 𝑋𝑋 𝑦𝑦 = 1 𝑝𝑝 𝑦𝑦 = 1 ,𝑝𝑝 𝑋𝑋 𝑦𝑦 = 0 𝑝𝑝(𝑦𝑦 =0)}

β€’ It can determine a β€˜reject region’

CS@UVa CS 6501: Text Mining 9

False positive

False negative

Page 10: Text Categorization - cs.virginia.edu

Bayes risk

β€’ Risk – assign instance to a wrong class

CS@UVa CS 6501: Text Mining 10

𝑋𝑋

𝑝𝑝(𝑋𝑋,𝑦𝑦)

𝑝𝑝 𝑋𝑋 𝑦𝑦 = 1 𝑝𝑝(𝑦𝑦 = 1)𝑝𝑝 𝑋𝑋 𝑦𝑦 = 0 𝑝𝑝(𝑦𝑦 = 0)

�𝑦𝑦 = 0 �𝑦𝑦 = 1

False positiveFalse negative

Bayes decision boundary

Page 11: Text Categorization - cs.virginia.edu

Bayes risk

β€’ Risk – assign instance to a wrong class

CS@UVa CS 6501: Text Mining 11

𝑋𝑋

𝑝𝑝(𝑋𝑋,𝑦𝑦)

𝑝𝑝 𝑋𝑋 𝑦𝑦 = 1 𝑝𝑝(𝑦𝑦 = 1)𝑝𝑝 𝑋𝑋 𝑦𝑦 = 0 𝑝𝑝(𝑦𝑦 = 0)

�𝑦𝑦 = 0 �𝑦𝑦 = 1

False positiveFalse negative

Increased error

Page 12: Text Categorization - cs.virginia.edu

Bayes risk

β€’ Risk – assign instance to a wrong class

CS@UVa CS 6501: Text Mining 12

𝑋𝑋

𝑝𝑝(𝑋𝑋,𝑦𝑦)

𝑝𝑝 𝑋𝑋 𝑦𝑦 = 1 𝑝𝑝(𝑦𝑦 = 1)𝑝𝑝 𝑋𝑋 𝑦𝑦 = 0 𝑝𝑝(𝑦𝑦 = 0)

�𝑦𝑦 = 0 �𝑦𝑦 = 1

False positiveFalse negative

*Optimal Bayes decision boundary

Page 13: Text Categorization - cs.virginia.edu

Bayes risk

β€’ Expected risk𝐸𝐸 π‘Žπ‘Ž π‘₯π‘₯ = ∫π‘₯π‘₯ 𝑝𝑝 π‘₯π‘₯ π‘Žπ‘Ž π‘₯π‘₯ 𝑑𝑑π‘₯π‘₯

= οΏ½π‘₯π‘₯𝑝𝑝 π‘₯π‘₯ min{𝑝𝑝 π‘₯π‘₯ 𝑦𝑦 = 1 𝑝𝑝 𝑦𝑦 = 1 ,𝑝𝑝 π‘₯π‘₯ 𝑦𝑦 = 0 𝑝𝑝(𝑦𝑦 = 0)}dx

= 𝑝𝑝 𝑦𝑦 = 1 �𝑅𝑅0𝑝𝑝(π‘₯π‘₯)𝑝𝑝 π‘₯π‘₯|𝑦𝑦 = 1 𝑑𝑑π‘₯π‘₯

+ 𝑝𝑝 𝑦𝑦 = 0 �𝑅𝑅1𝑝𝑝(π‘₯π‘₯)𝑝𝑝 π‘₯π‘₯|𝑦𝑦 = 0 𝑑𝑑π‘₯π‘₯

Region where we assign π‘₯π‘₯ to class 0

Region where we assign π‘₯π‘₯ to class 1

Will the error of assigning 𝑐𝑐1 to 𝑐𝑐0 be always equal to the error of assigning 𝑐𝑐0 to 𝑐𝑐1?

CS@UVa CS 6501: Text Mining 13

Page 14: Text Categorization - cs.virginia.edu

Loss function

β€’ The penalty we will pay when misclassifying instances

β€’ Goal of classification in general– Minimize loss

𝐸𝐸 𝐿𝐿 = 𝐿𝐿1,0𝑝𝑝 𝑦𝑦 = 1 �𝑅𝑅0𝑝𝑝 π‘₯π‘₯ 𝑝𝑝 π‘₯π‘₯|𝑦𝑦 = 1 𝑑𝑑π‘₯π‘₯

+𝐿𝐿0,1𝑝𝑝 𝑦𝑦 = 0 �𝑅𝑅1𝑝𝑝 π‘₯π‘₯ 𝑝𝑝(π‘₯π‘₯|𝑦𝑦 = 1)𝑑𝑑π‘₯π‘₯

Penalty when misclassifying 𝑐𝑐1 to 𝑐𝑐0Penalty when misclassifying 𝑐𝑐0 to 𝑐𝑐1

CS@UVa CS 6501: Text Mining 14

Region where we assign π‘₯π‘₯ to class 0

Region where we assign π‘₯π‘₯ to class 1

Will this still be optimal?

Page 15: Text Categorization - cs.virginia.edu

Supervised text categorization

β€’ Supervised learning– Estimate a model/method from labeled data– It can then be used to determine the labels of the

unobserved samples

Classifier𝑓𝑓 π‘₯π‘₯,πœƒπœƒ β†’ 𝑦𝑦

SportsBusiness

Education

π‘₯π‘₯𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1𝑛𝑛

…

Sports

Business

Education

Science…

π‘₯π‘₯𝑖𝑖 , �𝑦𝑦𝑖𝑖 𝑖𝑖=1π‘šπ‘š

Training Testing

CS@UVa CS 6501: Text Mining 15

π‘₯π‘₯𝑖𝑖 , ? 𝑖𝑖=1π‘šπ‘š

Page 16: Text Categorization - cs.virginia.edu

Type of classification methods

β€’ Model-less– Instance based classifiers

β€’ Use observation directlyβ€’ E.g., kNN

Classifier𝑓𝑓 π‘₯π‘₯,πœƒπœƒ β†’ 𝑦𝑦

…

Sports

Business

Education

Science…

SportsBusiness

Education

π‘₯π‘₯𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1𝑛𝑛

π‘₯π‘₯𝑖𝑖 , �𝑦𝑦𝑖𝑖 𝑖𝑖=1π‘šπ‘š

Training Testing

Instance lookup

CS@UVa CS 6501: Text Mining 16

π‘₯π‘₯𝑖𝑖 , ? 𝑖𝑖=1π‘šπ‘š

Key: assuming similar items have similar class labels!

Page 17: Text Categorization - cs.virginia.edu

Type of classification methods

β€’ Model-based– Generative models

β€’ Modeling joint probability of 𝑝𝑝(π‘₯π‘₯,𝑦𝑦)β€’ E.g., NaΓ―ve Bayes

– Discriminative modelsβ€’ Directly estimate a decision rule/boundaryβ€’ E.g., SVM

Classifier𝑓𝑓 π‘₯π‘₯,πœƒπœƒ β†’ 𝑦𝑦

…

Sports

Business

Education

Science…

SportsBusiness

Education

π‘₯π‘₯𝑖𝑖 ,𝑦𝑦𝑖𝑖 𝑖𝑖=1𝑛𝑛

π‘₯π‘₯𝑖𝑖 , �𝑦𝑦𝑖𝑖 𝑖𝑖=1π‘šπ‘š

Training Testing

CS@UVa CS 6501: Text Mining 17π‘₯π‘₯𝑖𝑖 , ? 𝑖𝑖=1

π‘šπ‘š

Key: i.i.d. assumption!

Page 18: Text Categorization - cs.virginia.edu

Generative V.S. discriminative models

β€’ Binary classification as an example

CS@UVa CS 6501: Text Mining 18

Generative Model’s view Discriminative Model’s view

Page 19: Text Categorization - cs.virginia.edu

Generative V.S. discriminative models

Generativeβ€’ Specifying joint distribution

– Full probabilistic specification for all the random variables

β€’ Dependence assumption has to be specified for 𝑝𝑝 π‘₯π‘₯ 𝑦𝑦 and 𝑝𝑝(𝑦𝑦)

β€’ Flexible, can be used in unsupervised learning

Discriminative β€’ Specifying conditional

distribution– Only explain the target

variable

β€’ Arbitrary features can be incorporated for modeling 𝑝𝑝 𝑦𝑦 π‘₯π‘₯

β€’ Need labeled data, only suitable for (semi-) supervised learning

CS@UVa CS 6501: Text Mining 19

Page 20: Text Categorization - cs.virginia.edu

General steps for text categorization

1. Feature construction and selection

2. Model specification3. Model estimation

and selection4. Evaluation

Political News

Sports News

Entertainment News

CS@UVa CS 6501: Text Mining 20

Page 21: Text Categorization - cs.virginia.edu

General steps for text categorization

1. Feature construction and selection

2. Model specification3. Model estimation

and selection4. Evaluation

Political News

Sports News

Entertainment News Consider:

1.1 How to represent the text documents? 1.2 Do we need all those features?

CS@UVa CS 6501: Text Mining 21

Page 22: Text Categorization - cs.virginia.edu

Feature construction for text categorization

β€’ Vector space representation– Standard procedure in document representation– Features

β€’ N-gram, POS tags, named entities, topics

– Feature valueβ€’ Binary (presence/absence)β€’ TF-IDF (many variants)

CS@UVa CS 6501: Text Mining 22

Page 23: Text Categorization - cs.virginia.edu

Recall MP1

β€’ How many unigram+bigram are there in our controlled vocabulary?– 130K on Yelp_small

β€’ How many review documents do we have there for training?– 629K Yelp_small

Very sparse feature representation!

CS@UVa CS 6501: Text Mining 23

Page 24: Text Categorization - cs.virginia.edu

Feature selection for text categorization

β€’ Select the most informative features for model training– Reduce noise in feature representation

β€’ Improve final classification performance

– Improve training/testing efficiencyβ€’ Less time complexityβ€’ Fewer training data

CS@UVa CS 6501: Text Mining 24

Page 25: Text Categorization - cs.virginia.edu

Feature selection methods

β€’ Wrapper method– Find the best subset of features for a particular

classification method

R. Kohavi, G.H. John/Artijicial Intelligence 97 (1997) 273-324

the same classifier

CS@UVa CS 6501: Text Mining 25

Page 26: Text Categorization - cs.virginia.edu

R. Kohavi, G.H. John/Artijicial Intelligence 97 (1997) 273-324

Feature selection methods

β€’ Wrapper method– Search in the whole space of feature groups

β€’ Sequential forward selection or genetic search to speed up the search

CS@UVa CS 6501: Text Mining 26

Page 27: Text Categorization - cs.virginia.edu

Feature selection methods

β€’ Wrapper method– Consider all possible dependencies among the

features– Impractical for text categorization

β€’ Cannot deal with large feature setβ€’ A NP-complete problem

– No direct relation between feature subset selection and evaluation

CS@UVa CS 6501: Text Mining 27

Page 28: Text Categorization - cs.virginia.edu

Feature selection methods

β€’ Filter method– Evaluate the features independently from the

classifier and other featuresβ€’ No indication of a classifier’s performance on the

selected featuresβ€’ No dependency among the features

– Feasible for very large feature setβ€’ Usually used as a preprocessing step

R. Kohavi, G.H. John/Artijicial Intelligence 97 (1997) 273-324

CS@UVa CS 6501: Text Mining 28

Page 29: Text Categorization - cs.virginia.edu

Feature scoring metrics

β€’ Document frequency– Rare words: non-influential for global prediction,

reduce vocabulary size

remove head words

remove rare words

CS@UVa CS 6501: Text Mining 29

Page 30: Text Categorization - cs.virginia.edu

Feature scoring metrics

β€’ Information gain– Decrease in entropy of categorical prediction

when the feature is present v.s. absent

class uncertainty decreases class uncertainty intact

CS@UVa CS 6501: Text Mining 30

Page 31: Text Categorization - cs.virginia.edu

Feature scoring metrics

β€’ Information gain– Decrease in entropy of categorical prediction

when the feature is present or absent

𝐼𝐼𝐼𝐼 𝑑𝑑 = βˆ’οΏ½π‘π‘

𝑝𝑝 𝑐𝑐 log𝑝𝑝 𝑐𝑐

+𝑝𝑝(𝑑𝑑)�𝑐𝑐

𝑝𝑝 𝑐𝑐|𝑑𝑑 log𝑝𝑝 𝑐𝑐|𝑑𝑑

+𝑝𝑝( 𝑑𝑑)�𝑐𝑐

𝑝𝑝 𝑐𝑐| 𝑑𝑑 log𝑝𝑝 𝑐𝑐| 𝑑𝑑

Entropy of class label along

Entropy of class label if 𝑑𝑑 is present

Entropy of class label if 𝑑𝑑 is absent

probability of seeing class label 𝑐𝑐 in documents where t does not occur

probability of seeing class label 𝑐𝑐 in documents where t occurs

CS@UVa CS 6501: Text Mining 31

Page 32: Text Categorization - cs.virginia.edu

Feature scoring metrics

β€’ πœ’πœ’2 statistics– Test whether distributions of two categorical

variables are independent of one anotherβ€’ 𝐻𝐻0: they are independentβ€’ 𝐻𝐻1: they are dependent

𝑑𝑑 𝑑𝑑𝑐𝑐 𝐴𝐴 𝐡𝐡𝑐𝑐 𝐢𝐢 𝐷𝐷

πœ’πœ’2 𝑑𝑑, 𝑐𝑐 =(𝐴𝐴 + 𝐡𝐡 + 𝐢𝐢 + 𝐷𝐷) 𝐴𝐴𝐷𝐷 βˆ’ 𝐡𝐡𝐢𝐢 2

(𝐴𝐴 + 𝐢𝐢)(𝐡𝐡 + 𝐷𝐷)(𝐴𝐴 + 𝐡𝐡)(𝐢𝐢 + 𝐷𝐷)

𝐷𝐷𝐷𝐷(𝑑𝑑) 𝑁𝑁 βˆ’ 𝐷𝐷𝐷𝐷(𝑑𝑑) #𝑃𝑃𝑃𝑃𝑃𝑃 𝑑𝑑𝑃𝑃𝑐𝑐 #π‘π‘π‘π‘π‘Žπ‘Ž 𝑑𝑑𝑃𝑃𝑐𝑐CS@UVa CS 6501: Text Mining 32

Page 33: Text Categorization - cs.virginia.edu

Feature scoring metrics

β€’ πœ’πœ’2 statistics– Test whether distributions of two categorical

variables are independent of one anotherβ€’ Degree of freedom = (#col-1)X(#row-1)β€’ Significance level: 𝛼𝛼, i.e., 𝑝𝑝-value<𝛼𝛼

𝑑𝑑 𝑑𝑑𝑐𝑐 36 30

𝑐𝑐 14 25

πœ’πœ’2 𝑑𝑑, 𝑐𝑐 =105 36 Γ— 25 βˆ’ 14 Γ— 30 2

50 Γ— 55 Γ— 66 Γ— 39 = 3.418

Look into πœ’πœ’2 distribution table to find the threshold

DF=1, 𝛼𝛼 = 0.05 => threshold = 3.841

We cannot reject 𝐻𝐻0

𝑑𝑑 is not a good feature to choose

CS@UVa CS 6501: Text Mining 33

Page 34: Text Categorization - cs.virginia.edu

Feature scoring metrics

β€’ πœ’πœ’2 statistics– Test whether distributions of two categorical

variables are independent of one anotherβ€’ Degree of freedom = (#col-1)X(#row-1)β€’ Significance level: 𝛼𝛼, i.e., 𝑝𝑝-value<𝛼𝛼‒ For the features passing the threshold, rank them by

descending order of πœ’πœ’2 values and choose the top π‘˜π‘˜features

CS@UVa CS 6501: Text Mining 34

Page 35: Text Categorization - cs.virginia.edu

Feature scoring metrics

β€’ πœ’πœ’2 statistics with multiple categories– πœ’πœ’2 𝑑𝑑 = βˆ‘π‘π‘ 𝑝𝑝(𝑐𝑐)πœ’πœ’2 𝑐𝑐, 𝑑𝑑

β€’ Expectation of πœ’πœ’2 over all the categories

– πœ’πœ’2 𝑑𝑑 = max𝑐𝑐

πœ’πœ’2 𝑐𝑐, 𝑑𝑑‒ Strongest dependency between a category

β€’ Problem with πœ’πœ’2 statistics– Normalization breaks down for the very low frequency

termsβ€’ πœ’πœ’2 values become incomparable between high frequency

terms and very low frequency terms

Distribution assumption becomes inappropriate in this test

CS@UVa CS 6501: Text Mining 35

Page 36: Text Categorization - cs.virginia.edu

Feature scoring metrics

β€’ Many other metrics– Mutual information

β€’ Relatedness between term 𝑑𝑑 and class 𝑐𝑐

– Odds ratioβ€’ Odds of term 𝑑𝑑 occurring with class 𝑐𝑐 normalized by

that without 𝑐𝑐

𝑃𝑃𝑃𝑃𝐼𝐼 𝑑𝑑; 𝑐𝑐 = 𝑝𝑝(𝑑𝑑, 𝑐𝑐) log(𝑝𝑝(𝑑𝑑, 𝑐𝑐)𝑝𝑝 𝑑𝑑 𝑝𝑝(𝑐𝑐))

𝑂𝑂𝑑𝑑𝑑𝑑𝑃𝑃 𝑑𝑑; 𝑐𝑐 =𝑝𝑝(𝑑𝑑, 𝑐𝑐)

1 βˆ’ 𝑝𝑝(𝑑𝑑, 𝑐𝑐) Γ—1 βˆ’ 𝑝𝑝(𝑑𝑑, 𝑐𝑐)𝑝𝑝(𝑑𝑑, 𝑐𝑐)

Same trick as in πœ’πœ’2 statistics for multi-class cases

CS@UVa CS 6501: Text Mining 36

Page 37: Text Categorization - cs.virginia.edu

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, 1289-1305.

A graphical analysis of feature selection

β€’ Isoclines for each feature scoring metric – Machine learning papers v.s. other CS papers

Zoom in

Stopword removal

CS@UVa CS 6501: Text Mining 37

Page 38: Text Categorization - cs.virginia.edu

A graphical analysis of feature selection

β€’ Isoclines for each feature scoring metric – Machine learning papers v.s. other CS papers

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, 1289-1305.

CS@UVa CS 6501: Text Mining 38

Page 39: Text Categorization - cs.virginia.edu

Effectiveness of feature selection methods

β€’ On a multi-class classification data set– 229 documents, 19 classes– Binary feature, SVM classifier

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, 1289-1305.CS@UVa CS 6501: Text Mining 39

Page 40: Text Categorization - cs.virginia.edu

Effectiveness of feature selection methods

β€’ On a multi-class classification data set– 229 documents, 19 classes– Binary feature, SVM classifier

Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. JMLR, 3, 1289-1305.

CS@UVa CS 6501: Text Mining 40

Page 41: Text Categorization - cs.virginia.edu

Empirical analysis of feature selection methods

β€’ Text corpus– Reuters-22173

β€’ 13272 documents, 92 classes, 16039 unique words

– OHSUMED β€’ 3981 documents, 14321 classes, 72076 unique words

β€’ Classifier: kNN and LLSF

CS@UVa CS 6501: Text Mining 41

Page 42: Text Categorization - cs.virginia.edu

General steps for text categorization

1. Feature construction and selection

2. Model specification3. Model estimation

and selection4. Evaluation

Political News

Sports News

Entertainment News

Consider:2.1 What is the unique property of this problem? 2.2 What type of classifier we should use?

CS@UVa CS 6501: Text Mining 42

Page 43: Text Categorization - cs.virginia.edu

Model specification

β€’ Specify dependency assumptions – Linear relation between π‘₯π‘₯ and 𝑦𝑦

β€’ 𝑀𝑀𝑇𝑇π‘₯π‘₯ β†’ 𝑦𝑦‒ Features are independent among each other

– NaΓ―ve Bayes, linear SVM

– Non-linear relation between π‘₯π‘₯ and 𝑦𝑦‒ 𝑓𝑓(π‘₯π‘₯) β†’ 𝑦𝑦, where 𝑓𝑓(β‹…) is a non-linear function of π‘₯π‘₯β€’ Features are not independent among each other

– Decision tree, kernel SVM, mixture model

β€’ Choose based on our domain knowledge of the problem

We will discuss these choices later

CS@UVa CS 6501: Text Mining 43

Page 44: Text Categorization - cs.virginia.edu

General steps for text categorization

1. Feature construction and selection

2. Model specification3. Model estimation

and selection4. Evaluation

Political News

Sports News

Entertainment News

Consider:3.1 How to estimate the parameters in the selected model? 3.2 How to control the complexity of the estimated model?

CS@UVa CS 6501: Text Mining 44

Page 45: Text Categorization - cs.virginia.edu

Model estimation and selection

β€’ General philosophy– Loss minimization

𝐸𝐸 𝐿𝐿 = 𝐿𝐿1,0𝑝𝑝 𝑦𝑦 = 1 �𝑅𝑅0𝑝𝑝 π‘₯π‘₯ 𝑑𝑑π‘₯π‘₯ + 𝐿𝐿0,1𝑝𝑝 𝑦𝑦 = 0 οΏ½

𝑅𝑅1𝑝𝑝 π‘₯π‘₯ 𝑑𝑑π‘₯π‘₯

Penalty when misclassifying 𝑐𝑐1 to 𝑐𝑐0

Penalty when misclassifying 𝑐𝑐0 to 𝑐𝑐1

Empirically estimated from training setEmpirical loss!

CS@UVa CS 6501: Text Mining 45

Key assumption: Independent and Identically Distributed!

Page 46: Text Categorization - cs.virginia.edu

Empirical loss minimization

β€’ Overfitting– Good empirical loss, terrible generalization loss– High model complexity -> prune to overfit noise

Underlying dependency: linear relation

Over-complicated dependency assumption: polynomial

CS@UVa CS 6501: Text Mining 46

Page 47: Text Categorization - cs.virginia.edu

Generalization loss minimization

β€’ Avoid overfitting– Measure model complexity as well– Model selection and regularization

Error on training set

Error on testing set

Model complexity

Error

CS@UVa CS 6501: Text Mining 47

Page 48: Text Categorization - cs.virginia.edu

Generalization loss minimization

β€’ Cross validation– Avoid noise in train/test separation– k-fold cross-validation

1. Partition all training data into k equal size disjoint subsets;

2. Leave one subset for validation and the other k-1 for training;

3. Repeat step (2) k times with each of the k subsets used exactly once as the validation data.

CS@UVa CS 6501: Text Mining 48

Page 49: Text Categorization - cs.virginia.edu

Generalization loss minimization

β€’ Cross validation– Avoid noise in train/test separation– k-fold cross-validation

CS@UVa CS 6501: Text Mining 49

Page 50: Text Categorization - cs.virginia.edu

Generalization loss minimization

β€’ Cross validation– Avoid noise in train/test separation– k-fold cross-validation

β€’ Choose the model (among different models or same model with different settings) that has the best average performance on the validation sets

β€’ Some statistical test is needed to decide if one model is significantly better than another

Will cover it shortly

CS@UVa CS 6501: Text Mining 50

Page 51: Text Categorization - cs.virginia.edu

General steps for text categorization

1. Feature construction and selection

2. Model specification3. Model estimation

and selection4. Evaluation

Political News

Sports News

Entertainment News

Consider:4.1 How to judge the quality of learned model? 4.2 How can you further improve the performance?

CS@UVa CS 6501: Text Mining 51

Page 52: Text Categorization - cs.virginia.edu

Classification evaluation

β€’ Accuracy– Percentage of correct prediction over all

predictions, i.e., 𝑝𝑝 π‘¦π‘¦βˆ— = 𝑦𝑦– Limitation

β€’ Highly skewed class distribution– 𝑝𝑝 π‘¦π‘¦βˆ— = 1 = 0.99

Β» Trivial solution: all testing cases are positive– Classifiers’ capability is only differentiated by 1% testing cases

CS@UVa CS 6501: Text Mining 52

Page 53: Text Categorization - cs.virginia.edu

Evaluation of binary classification

β€’ Precision– Fraction of predicted positive documents that are

indeed positive, i.e., 𝑝𝑝(π‘¦π‘¦βˆ— = 1|𝑦𝑦 = 1)β€’ Recall

– Fraction of positive documents that are predicted to be positive, i.e., 𝑝𝑝(𝑦𝑦 = 1|π‘¦π‘¦βˆ— = 1)

π‘¦π‘¦βˆ— = 1 π‘¦π‘¦βˆ— = 0y = 1 true positive (TP) false positive (FP)

𝑦𝑦 = 0 false negative (FN) true negative (TN)

𝑇𝑇𝑃𝑃𝑇𝑇𝑃𝑃 + 𝐷𝐷𝑁𝑁Recall=

𝑇𝑇𝑃𝑃𝑇𝑇𝑃𝑃 + 𝐷𝐷𝑃𝑃

Precision=

CS@UVa CS 6501: Text Mining 53

Page 54: Text Categorization - cs.virginia.edu

Evaluation of binary classification

CS@UVa CS 6501: Text Mining 54

𝑋𝑋

𝑝𝑝(𝑋𝑋,𝑦𝑦)

𝑝𝑝 𝑋𝑋 𝑦𝑦 = 1 𝑝𝑝(𝑦𝑦 = 1)𝑝𝑝 𝑋𝑋 𝑦𝑦 = 0 𝑝𝑝(𝑦𝑦 = 0)

�𝑦𝑦 = 0 �𝑦𝑦 = 1

False positiveFalse negative

*Optimal Bayes decision boundary

True positiveTrue negative

Page 55: Text Categorization - cs.virginia.edu

Precision and recall trade off

β€’ Precision decreases as the number of documents predicted to be positive increases (unless in perfect classification), while recall keeps increasing

β€’ These two metrics emphasize different perspectives of a classifier– Precision: prefers a classifier to recognize fewer

documents, but highly accurate– Recall: prefers a classifier to recognize more

documents

CS@UVa CS 6501: Text Mining 55

Page 56: Text Categorization - cs.virginia.edu

Summarizing precision and recall

β€’ With a single value– In order to compare different classifiers– F-measure: weighted harmonic mean of precision

and recall, 𝛼𝛼 balances the trade-off

– Why harmonic mean?β€’ Classifier1: P:0.53, R:0.36β€’ Classifier2: P:0.01, R:0.99

𝐷𝐷 =1

𝛼𝛼 1𝑃𝑃 + 1 βˆ’ 𝛼𝛼 1

𝑅𝑅𝐷𝐷1 =

21𝑃𝑃 + 1

𝑅𝑅

Equal weight between precision and recall

CS@UVa CS 6501: Text Mining 56

H A

0.429 0.445

0.019 0.500

Page 57: Text Categorization - cs.virginia.edu

Summarizing precision and recall

β€’ With a curve1. Order all the testing cases by the classifier’s

prediction score (assuming the higher the score is, the more likely it is positive);

2. Scan through each testing case: treat all cases above it as positive (including itself), below it as negative; compute precision and recall;

3. Plot precision and recall computed for each testing case in step (2).

CS@UVa CS 6501: Text Mining 57

Page 58: Text Categorization - cs.virginia.edu

Summarizing precision and recall

β€’ With a curve– A.k.a., precision-recall curve– Area Under Curve (AUC)

Under each recall level, we prefer a higher precision

CS@UVa CS 6501: Text Mining 58

Page 59: Text Categorization - cs.virginia.edu

Multi-class categorization

β€’ Confusion matrix– A generalized contingency table for precision and

recall

CS@UVa CS 6501: Text Mining 59

Page 60: Text Categorization - cs.virginia.edu

Statistical significance tests

β€’ How confident you are that an observed difference doesn’t simply result from the train/test separation you chose?

Classifier 1

0.200.210.220.190.18

Classifier 2

0.180.190.210.200.37

Fold12345

Average 0.20 0.23

CS@UVa CS 6501: Text Mining 60

Page 61: Text Categorization - cs.virginia.edu

Background knowledge

β€’ p-value in statistic test is the probability of obtaining data as extreme as was observed, if the null hypothesis were true (e.g., if observation is totally random)

β€’ If p-value is smaller than the chosen significance level (Ξ±), we reject the null hypothesis (e.g., observation is not random)

β€’ We seek to reject the null hypothesis (we seek to show that the observation is not a random result), and so small p-values are good

61CS@UVa CS 6501: Text Mining 61

Page 62: Text Categorization - cs.virginia.edu

Paired t-test

β€’ Paired t-test– Test if two sets of observations are significantly

different from each otherβ€’ On k-fold cross validation, different classifiers are

applied onto the same train/test separation– Hypothesis: difference between two responses

measured on the same statistical unit has a zero mean value

β€’ One-tail v.s. two-tail?– If you aren’t sure, use two-tail

CS@UVa CS 6501: Text Mining 62

Algorithm 1 Algorithm 2

Page 63: Text Categorization - cs.virginia.edu

Statistical significance test

0

95% of outcomes

CS@UVa CS 6501: Text Mining 63

Classifier A Classifier BFold12345

Average 0.20 0.23

paired t-test

p=0.4987

+0.02+0.02+0.01-0.01-0.19

0.200.210.220.190.18

0.180.190.210.200.37

Page 64: Text Categorization - cs.virginia.edu

What you should know

β€’ Bayes decision theory– Bayes risk minimization

β€’ General steps for text categorization– Text feature construction– Feature selection methods– Model specification and estimation– Evaluation metrics

CS@UVa CS 6501: Text Mining 64

Page 65: Text Categorization - cs.virginia.edu

Today’s reading

β€’ Introduction to Information Retrieval– Chapter 13: Text classification and Naive Bayes

β€’ 13.1 – Text classification problemβ€’ 13.5 – Feature selectionβ€’ 13.6 – Evaluation of text classification

CS@UVa CS 6501: Text Mining 65


Recommended