82
807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Embed Size (px)

Citation preview

Page 1: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

807 - TEXT ANALYTICS

Massimo Poesio

Lecture 3: Text Classification

Page 2: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

TEXT CLASSIFICATIONFrom: "" <[email protected]>Subject: real estate is the only way... gem oalvgkay

Anyone can buy real estate with no money down

Stop paying rent TODAY !

There is no need to spend hundreds or even thousands for similar courses

I am 22 years old and I have already purchased 6 properties using themethods outlined in this truly INCREDIBLE ebook.

Change your life NOW !

=================================================Click Below to order:http://www.wholesaledaily.com/sales/nmd.htm=================================================

Page 3: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

TAGS FOR TEXT MANAGEMENT

Page 4: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

HOW DOCUMENT TAGS ARE PRODUCED

• Traditionally: by HAND by EXPERTS (curators)– E.g., Yahoo!, Looksmart, about.com, ODP, Medline

• Also: Bridgeman, UK Data Archive

– very accurate when job is done by experts– consistent when the problem size and team is small– difficult and expensive to scale

• Now:– AUTOMATICALLY– By CROWDS

Page 5: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

CLASSIFICATION AND CATEGORIZATION

• Given:– A description of an instance, xX, where X is the instance

language or instance space.• Issue: how to represent text documents.

– A fixed set of categories:C = {c1, c2,…, cn}

• Determine:– The category of x: c(x)C, where c(x) is a categorization

function whose domain is X and whose range is C.• We want to know how to build categorization functions

(“classifiers”).

Page 6: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

AUTOMATIC TEXT CLASSIFICATION

• Supervised– Training set– Test set

• Clustering: next week

Page 7: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Text Categorization: attributes• Representations of text are very high dimensional (one

feature for each word).• High-bias algorithms that prevent overfitting in high-

dimensional space are best.• For most text categorization tasks, there are many

irrelevant and many relevant features.• Methods that combine evidence from many or all

features (e.g. naive Bayes, kNN, neural-nets) tend to work better than ones that try to isolate just a few relevant features (standard decision-tree or rule induction)*

*Although one can compensate by using many rules

Page 8: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Bayesian Methods

• Learning and classification methods based on probability theory.

• Bayes theorem plays a critical role in probabilistic learning and classification.

• Build a generative model that approximates how data is produced

• Uses prior probability of each category given no information about an item.

• Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Page 9: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Bayes’ Rule

)()|()()|(),( CPCXPXPXCPXCP

)(

)()|()|(

XP

CPCXPXCP

Page 10: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Maximum a posteriori Hypothesis

)|(argmax DhPhHh

MAP

)(

)()|(argmax

DP

hPhDPh

HhMAP

)()|(argmax hPhDPhHh

MAP

Page 11: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Maximum likelihood Hypothesis

If all hypotheses are a priori equally likely, we only need to consider the P(D|h) term:

)|(argmax hDPhHh

ML

Page 12: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Naive Bayes Classifiers

Task: Classify a new instance based on a tuple of attribute values

nxxx ,,, 21

),,,|(argmax 21 njCc

MAP xxxcPcj

),,,(

)()|,,,(argmax

21

21

n

jjn

CcMAP cccP

cPcxxxPc

j

)()|,,,(argmax 21 jjnCc

MAP cPcxxxPcj

Page 13: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Naïve Bayes Classifier: Assumptions

• P(cj)– Can be estimated from the frequency of classes in

the training examples.

• P(x1,x2,…,xn|cj) – O(|X|n•|C|)– Could only be estimated if a very, very large number

of training examples was available.Conditional Independence Assumption: Assume that the probability of observing the conjunction of

attributes is equal to the product of the individual probabilities.

Page 14: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

The Naïve Bayes Classifier

• Conditional Independence Assumption: features are independent of each other given the class:

)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Page 15: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Learning the Model

• Common practice:maximum likelihood– simply use the frequencies in the data

)(

),()|(ˆ

j

jiiji cCN

cCxXNcxP

C

X1 X2 X5X3 X4 X6

N

cCNcP j

j

)()(ˆ

Page 16: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Problem with Max Likelihood

0)(

),()|(ˆ 5

5

nfCN

nfCtXNnfCtXP

• What if we have seen no training cases where patient had no flu and muscle aches?

• Zero probabilities cannot be conditioned away, no matter the other evidence!

i ic cxPcP )|(ˆ)(ˆmaxarg

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Page 17: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Smoothing to Avoid Overfitting

kcCN

cCxXNcxP

j

jiiji

)(

1),()|(ˆ

• Somewhat more subtle version

# of values of Xi

mcCN

mpcCxXNcxP

j

kijkiijki

)(

),()|(ˆ ,,

,

overall fraction in data where Xi=xi,k

extent of“smoothing”

Page 18: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Using Naive Bayes Classifiers to Classify Text: Basic method

• Attributes are text positions, values are words.

Naive Bayes assumption is clearly violated. Example?

Still too many possibilities Assume that classification is independent of the positions of the

words Use same parameters for each position

)|text""()|our""()(argmax

)|()(argmax

1j

j

jnjjCc

ijij

CcNB

cxPcxPcP

cxPcPc

Page 19: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Textj single document containing all docsj

for each word xk in Vocabulary nk number of occurrences of xk in Textj

Text Classification Algorithms: Learning

• From training corpus, extract Vocabulary• Calculate required P(cj) and P(xk | cj) terms

– For each cj in C do

• docsj subset of documents for which the target class is cj

||

1)|(

Vocabularyn

ncxP k

jk

|documents # total|

||)( j

j

docscP

Page 20: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Text Classification Algorithms: Classifying

• positions all word positions in current document which contain tokens found in Vocabulary

• Return cNB, where

positionsi

jijCc

NB cxPcPc )|()(argmaxj

Page 21: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Underflow Prevention

• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

• Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

• Class with highest final un-normalized log probability score is still the most probable.

Page 22: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Naïve Bayes Posterior Probabilities

• Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.

• However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.– Output probabilities are generally very close to 0

or 1.

Page 23: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Feature Selection

• Text collections have a large number of features– 10,000 – 1,000,000 unique words – and more

• Make using a particular classifier feasible– Some classifiers can’t deal with 100,000s of feat’s

• Reduce training time– Training time for some methods is quadratic or worse in

the number of features (e.g., logistic regression)• Improve generalization– Eliminate noise features– Avoid overfitting

Page 24: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Recap: Feature Reduction

• Standard ways of reducing feature space for text– Stemming• Laugh, laughs, laughing, laughed -> laugh

– Stop word removal• E.g., eliminate all prepositions

– Conversion to lower case– Tokenization• Break on all special characters: fire-fighter -> fire,

fighter

Page 25: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Feature Selection

• Yang and Pedersen 1997• Comparison of different selection criteria– DF – document frequency– IG – information gain– MI – mutual information– CHI – chi square

• Common strategy– Compute statistic for each term– Keep n terms with highest value of this statistic

Page 26: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Feature selection via Mutual Information

• We might not want to use all words, but just reliable, good discriminators

• In training set, choose k words which best discriminate the categories.

• One way is in terms of Mutual Information:

– For each word w and each category c

}1,0{ }1,0{ )()(

),(log),(),(

w ce e cw

cwcw epep

eepeepcwI

Page 27: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

(Pointwise) Mutual Information

Page 28: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Feature selection via MI (contd.)

• For each category we build a list of k most discriminating terms.

• For example (on 20 Newsgroups):– sci.electronics: circuit, voltage, amp, ground, copy, battery,

electronics, cooling, …– rec.autos: car, cars, engine, ford, dealer, mustang, oil,

collision, autos, tires, toyota, …• Greedy: does not account for correlations between

terms• In general feature selection is necessary for binomial

NB, but not for multinomial NB

Page 29: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Information Gain

Page 30: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Chi-Square

Term present

Term absent

Document belongs to category

A B

Document does not belong to category

C D

X^2 = N(AD-BC)^2 / ( (A+B) (A+C) (B+D) (C+D) )

Use either maximum or average X^2

Value for complete independence?

Page 31: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Document Frequency

• Number of documents a term occurs in • Is sometimes used for eliminating both very

frequent and very infrequent terms• How is document frequency measure different

from the other 3 measures?

Page 32: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Yang&Pedersen: Experiments

• Two classification methods– kNN (k nearest neighbors; more later)– Linear Least Squares Fit

• Regression method

• Collections– Reuters-22173

• 92 categories• 16,000 unique terms

– Ohsumed: subset of medline• 14,000 categories• 72,000 unique terms

• Ltc term weighting

Page 33: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Yang&Pedersen: Experiments

• Choose feature set size• Preprocess collection, discarding non-selected

features / words• Apply term weighting -> feature vector for

each document• Train classifier on training set• Evaluate classifier on test set

Page 34: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification
Page 35: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Discussion

• You can eliminate 90% of features for IG, DF, and CHI without decreasing performance.

• In fact, performance increases with fewer features for IG, DF, and CHI.

• Mutual information is very sensitive to small counts.• IG does best with smallest number of features.• Document frequency is close to optimal. By far the

simplest feature selection method.• Similar results for LLSF (regression).

Page 36: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Results

Why is selecting common terms a good strategy?

Page 37: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

IG, DF, CHI Are Correlated.

Page 38: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Information Gain vs Mutual Information

• Information gain is similar to MI for random variables• Independence?• In contrast, pointwise MI ignores non-occurrence of terms

– E.g., for complete dependence, you get:– P(AB)/ (P(A)P(B)) = 1/P(A) – larger for rare terms than for frequent

terms• Yang&Pedersen: Pointwise MI favors rare terms

Page 39: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Feature Selection:Other Considerations

• Generic vs Class-Specific– Completely generic (class-independent)– Separate feature set for each class– Mixed (a la Yang&Pedersen)

• Maintainability over time– Is aggressive features selection good or bad for

robustness over time?• Ideal: Optimal features selected as part of

training

Page 40: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Yang&Pedersen: Limitations

• Don’t look at class specific feature selection• Don’t look at methods that can’t handle high-

dimensional spaces• Evaluate category ranking (as opposed to

classification accuracy)

Page 41: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Feature Selection: Other Methods

• Stepwise term selection – Forward– Backward– Expensive: need to do n^2 iterations of training

• Term clustering• Dimension reduction: PCA / SVD

Page 42: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

FEATURES IN SPAM ASSASSIN

• SpamAssassin is a spam filtering program written in Perl, developed by Justin Mason in 1997 and entirely rewritten by Mark Jeftovic in 2001 when it went on Sourceforge – now maintained by the Apache foundation

• It uses Naïve Bayes and a variety of other methods

• It uses features coming both from content and from other properties of the email message

Page 43: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

43

SpamAssassin Features100 From: address is in the user's black-list 4.0 Sender is on www.habeas.com Habeas Infringer List 3.994 Invalid Date: header (timezone does not exist) 3.970 Written in an undesired language 3.910 Listed in Razor2, see http://razor.sf.net/ 3.801 Subject is full of 8-bit characters 3.472 Claims compliance with Senate Bill 1618 3.437 exists:X-Precedence-Ref 3.371 Reverses Aging 3.350 Claims you can be removed from the list 3.284 'Hidden' assets 3.283 Claims to honor removal requests 3.261 Contains "Stop Snoring" 3.251 Received: contains a name with a faked IP-address 3.250 Received via a relay in list.dsbl.org 3.200 Character set indicates a foreign language

Page 44: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

44

SpamAssassin Features3.198 Forged eudoramail.com 'Received:' header found 3.193 Free Investment 3.180 Received via SBLed relay, seehttp://www.spamhaus.org/sbl/ 3.140 Character set doesn't exist 3.123 Dig up Dirt on Friends 3.090 No MX records for the From: domain 3.072 X-Mailer contains malformed Outlook Expressversion 3.044 Stock Disclaimer Statement 3.009 Apparently, NOT Multi Level Marketing 3.005 Bulk email software fingerprint (jpfree) found inheaders 2.991 exists:Complain-To 2.975 Bulk email software fingerprint (VC_IPA) found inheaders 2.968 Invalid Date: year begins with zero 2.932 Mentions Spam law "H.R. 3113" 2.900 Received forged, contains fake AOL relays 2.879 Asks for credit card details

Page 45: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

45

Measuring Performance

• Classification accuracy: What % of messages were classified correctly?

• Is this what we care about?

Overall accurac

y

Accuracy on spam

Accuracy on gen

System 1

95% 99.99% 90%

System 2

95% 90% 99.99% Which system do you prefer?

Page 46: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

46

Measuring Performance

• Precision = good messages kept all messages kept

• Recall =good messages kept all good messages

Trade off precision vs. recall by setting thresholdMeasure the curve on annotated dev data (or test data)Choose a threshold where user is comfortable

Precision vs. Recall of Good (non-spam) Email

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%

Recall

Pre

cisi

on

Page 47: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

600.465 - Intro to NLP - J. Eisner 47

Precision vs. Recall of Good (non-spam) Email

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%

Recall

Pre

cisi

on

Measuring Performance

low threshold:keep all the good stuff,but a lot of the bad too

high threshold:all we keep is good,but we don’t keep much

OK for spam filtering and legal search

OK for search engines (maybe)

would prefer to be here!

point whereprecision=recall(sometimes reported)

F-measure = 1 / (average(1/precision, 1/recall))

Page 48: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Micro- vs. Macro-Averaging

• If we have more than one class, how do we combine multiple performance measures into one quantity?

• Macroaveraging: Compute performance for each class, then average.

• Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

Page 49: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Micro- vs. Macro-Averaging: Example

Truth: yes

Truth: no

Classifier: yes

10 10

Classifier: no

10 970

Truth: yes

Truth: no

Classifier: yes

90 10

Classifier: no

10 890

Truth: yes

Truth: no

Classifier: yes

100 20

Classifier: no

20 1860

Class 1 Class 2 Micro.Av. Table

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Why this difference?

Page 50: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Confusion matrix

• Function of classifier, topics and test docs.• For a perfect classifier, all off-diagonal entries

should be zero.• For a perfect classifier, if there are n docs in

category j than entry (j,j) should be n.• Straightforward when there is 1 category per

document.• Can be extended to n categories per document.

Page 51: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

52

Decision Trees

• Tree with internal nodes labeled by terms• Branches are labeled by tests on the weight that the

term has• Leaves are labeled by categories • Classifier categorizes document by descending tree

following tests to leaf • The label of the leaf node is then assigned to the

document• Most decision trees are binary trees (never

disadvantageous; may require extra internal nodes)

Page 52: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

53

Decision Tree Example

Page 53: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

54

Decision Tree Learning

• Learn a sequence of tests on features, typically using top-down, greedy search– At each stage choose unused feature with highest

Information Gain (as in Lecture 5)• Binary (yes/no) or continuous decisions

f1 !f1

f7 !f7

P(class) = .6

P(class) = .9

P(class) = .2

Page 54: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Top-down DT induction

• Partition training examples into good “splits”, based on values of a single “good” feature:

(1) Sat, hot, no, casual, keys -> +(2) Mon, cold, snow, casual, no-keys -> -(3) Tue, hot, no, casual, no-keys -> -(4) Tue, cold, rain, casual, no-keys -> -(5) Wed, hot, rain, casual, keys -> +

Page 55: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Top-down DT induction

keys?

yes no

Drive: 1,5 Walk: 2,3,4

Page 56: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Top-down DT induction

• Partition training examples into good “splits”, based on values of a single “good” feature

(1) Sat, hot, no, casual -> +(2) Mon, cold, snow, casual -> -(3) Tue, hot, no, casual -> -(4) Tue, cold, rain, casual -> -(5) Wed, hot, rain, casual -> +• No acceptable classification: proceed recursively

Page 57: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Top-down DT induction

t?

cold hot

Walk: 2,4 Drive: 1,5Walk: 3

Page 58: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Top-down DT induction

t?

cold hot

Walk: 2,4 day?

SatTue

Wed

Drive: 1 Walk: 3 Drive: 5

Page 59: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Top-down DT induction

t?

cold hot

Walk: 2,4 day?

SatTue

Wed

Drive: 1 Walk: 3 Drive: 5

Mo, Thu, Fr, Su

?Drive

Page 60: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Top-down DT induction: divide and conquer algorithm

• Pick a feature• Split your examples into subsets based on the

values of the feature• For each subsets, examine the examples:– Zero: assign the most popular class for the parent– All from the same class: assign this class– Otherwise, process recursively

Page 61: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Top-Down DT induction

Different trees can be built for the same data, depending on the order of features:

t?

cold hot

Walk: 2,4 day?

SatTue

Wed

Drive: 1 Walk: 3 Drive: 5

Mo, Thu, Fr, Su

?Drive

Page 62: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Top-down DT induction

Different trees can be built for the same data, depending on the order of features:

t?

cold hot

Walk: 2,4 day?

Sat Tue WedDrive: 1 Walk: 3 Drive: 5

Mo

Drive:?

clothing

casualhalloween

walk:?

Page 63: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Selecting features

• Intuitively–We want more “informative” features to be higher

in the tree:• Is it Monday? Is it raining? Good political news? No

halloween cloths? Hat on? Coat on? Car keys? Yes?? -> Driving! (doesn't look as a good learning job)

–We want a nice compact tree..

Page 64: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Selecting features-2

• Formally– Define “tree size” (number of nodes, leaves;

depth,..)– Try all the possible trees, find the smallest one– NP-hard

• Top-down DT induction – greedy search, depends on heuristics for feature ordering (=> no guarantee)– Information gain

Page 65: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Entropy

Information theory: entropy – number of bits needed to encode some information.

S – set of N examples: p*N positive (“Walk”) and q*N negative (“Drive”)

Entropy(S)= -p*lg(p) – q*lg(q)

p=1, q=0 => Entropy(S)=0p=1/2, q=1/2 => Entropy(S)=1

Page 66: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Entropy and Decision Trees

keys?

no yes

Walk: 2,4 Drive: 1,3,5

E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97

E(Sno)=0 E(Skeys)=0

Page 67: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Entropy and Decision Trees

t?

cold hot

Walk: 2,4 Drive: 1,5Walk: 3

E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97

E(Scold)=0 E(Shot)=-0.33*lg(0.33)-0.66*lg(0.66)= 0.92

Page 68: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Information gain

• For each feature f, compute the reduction in entropy on the split:

Gain(S,f)=E(S)-∑(Entropy(Si)*|Si|/|S|)

f=keys? : Gain(S,f)=0.97f=t?: Gain(S,f)=0.97-0*2/5-0.92*3/5=0.42f=clothing?: Gain(S,f)= ?

Page 69: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

TEXT CATEGORIZATION WITH DT

• Build a separate decision tree for each category

• Use WORDS COUNTS as features

Page 70: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

71

Reuters Data Set (21578 - ModApte split)

• 9603 training, 3299 test articles; ave. 200 words• 118 categories

– An article can be in more than one category– Learn 118 binary category distinctions

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

Common categories(#train, #test)

Page 71: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Foundations of Statistical Natural Language Processing, Manning and Schuetze

AN EXAMPLE OF REUTERS TEXT

Page 72: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Foundations of Statistical Natural Language Processing, Manning and Schuetze

Decision Tree for Reuter classification

Page 73: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Conquer-and-divide with Information gain

• Batch learning (read all the input data, compute information gain based on all the examples simultaneously)

• Greedy search, may find local optima• Outputs a single solution• Optimizes depth

Page 74: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Complexity

• Worst case: build a complete tree– Compute gains on all nodes: at level i, we have

already examined i features; m-i remaining. • In practice: tree is rarely complete, linear on

number of features, number of examples (== very fast)

Page 75: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Overfitting

• Suppose we build a very complex tree.. Is it good?

• Last lecture: we measure the quality (“goodness”) of the prediction, not the performance on the training data

• Why can complex trees yield mistakes:– Noise in the data– Even without noise, solutions at the last levels are

based on too few observations

Page 76: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Overfitting

Mo: Walk (50 observations), Drive (5)Tue: Walk (40), Drive (3)We: Drive (1)Thu: Walk (42), Drive (14)Fri: Walk (50)Sa: Drive (20), Walk (20)Su: Drive (10)

• Can we conclude that “We->Drive”?

Page 77: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Overfitting

• A hypothesis H is said to overfit the training data if there exist another hypothesis H' such that:Error(H, train data) <= Error (H', train data)Error(H, unseen data) > Error (H', unseen data)

• Overfitting is related to hypothesis complexity: a more complex hypothesis (e.g., a larger decision tree) overfits more

Page 78: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Overfitting Prevention for DT: Pruning

• “prune” a complex tree: produce a smaller tree that is less accurate on the training dataOriginal tree: ...Mo: hot->drive (2), cold -> walk (100)Pruned tree: .. Mo-> walk (100/2)

• post-/pre- pruning

Page 79: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

Pruning criteria

• Cross-validation– Reserve some training data to evaluate the utility

of the subtrees• Statistical tests: use a test to determine

whether observations at given level can be random

• MDL (minimum description length): compare the added complexity against memorizing exceptions

Page 80: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

DT: issues

• Splitting criteria– Information gain: split at features with many values

• Non-discrete features• Non-discrete outputs (“regression trees”)• Costs• Missing values• Incremental learning• Memory issues

Page 81: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

REFERENCES

• Tom Mitchell, Machine Learning. McGraw-Hill, 1997. – Chapter 6, Bayesian Learning - 6.1, 6.2, 6.9, 6.10– Chapter 3, Decision Trees – whole chapter

• Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002

Page 82: 807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

ACKNOWLEDGMENTS

• Many slides borrowed from Chris Manning, Stanford