807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification

807 - TEXT ANALYTICS

Massimo Poesio

Lecture 3: Text Classification

TEXT CLASSIFICATIONFrom: "" <[email protected]>Subject: real estate is the only way... gem oalvgkay

Anyone can buy real estate with no money down

Stop paying rent TODAY !

There is no need to spend hundreds or even thousands for similar courses

I am 22 years old and I have already purchased 6 properties using themethods outlined in this truly INCREDIBLE ebook.

Change your life NOW !

=================================================Click Below to order:http://www.wholesaledaily.com/sales/nmd.htm=================================================

TAGS FOR TEXT MANAGEMENT

HOW DOCUMENT TAGS ARE PRODUCED

• Traditionally: by HAND by EXPERTS (curators)– E.g., Yahoo!, Looksmart, about.com, ODP, Medline

• Also: Bridgeman, UK Data Archive

– very accurate when job is done by experts– consistent when the problem size and team is small– difficult and expensive to scale

• Now:– AUTOMATICALLY– By CROWDS

CLASSIFICATION AND CATEGORIZATION

• Given:– A description of an instance, xX, where X is the instance

language or instance space.• Issue: how to represent text documents.

– A fixed set of categories:C = {c1, c2,…, cn}

• Determine:– The category of x: c(x)C, where c(x) is a categorization

function whose domain is X and whose range is C.• We want to know how to build categorization functions

(“classifiers”).

AUTOMATIC TEXT CLASSIFICATION

• Supervised– Training set– Test set

• Clustering: next week

Text Categorization: attributes• Representations of text are very high dimensional (one

feature for each word).• High-bias algorithms that prevent overfitting in high-

dimensional space are best.• For most text categorization tasks, there are many

irrelevant and many relevant features.• Methods that combine evidence from many or all

features (e.g. naive Bayes, kNN, neural-nets) tend to work better than ones that try to isolate just a few relevant features (standard decision-tree or rule induction)*

*Although one can compensate by using many rules

Bayesian Methods

• Learning and classification methods based on probability theory.

• Bayes theorem plays a critical role in probabilistic learning and classification.

• Build a generative model that approximates how data is produced

• Uses prior probability of each category given no information about an item.

• Categorization produces a posterior probability distribution over the possible categories given a description of an item.

Bayes’ Rule

)()|()()|(),( CPCXPXPXCPXCP

)(

)()|()|(

XP

CPCXPXCP

Maximum a posteriori Hypothesis

)|(argmax DhPhHh

MAP

)(

)()|(argmax

DP

hPhDPh

HhMAP

)()|(argmax hPhDPhHh

MAP

Maximum likelihood Hypothesis

If all hypotheses are a priori equally likely, we only need to consider the P(D|h) term:

)|(argmax hDPhHh

ML

Naive Bayes Classifiers

Task: Classify a new instance based on a tuple of attribute values

nxxx ,,, 21

),,,|(argmax 21 njCc

MAP xxxcPcj

),,,(

)()|,,,(argmax

21

21

n

jjn

CcMAP cccP

cPcxxxPc

j

)()|,,,(argmax 21 jjnCc

MAP cPcxxxPcj

Naïve Bayes Classifier: Assumptions

• P(cj)– Can be estimated from the frequency of classes in

the training examples.

• P(x1,x2,…,xn|cj) – O(|X|n•|C|)– Could only be estimated if a very, very large number

of training examples was available.Conditional Independence Assumption: Assume that the probability of observing the conjunction of

attributes is equal to the product of the individual probabilities.

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

The Naïve Bayes Classifier

• Conditional Independence Assumption: features are independent of each other given the class:

)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Learning the Model

• Common practice:maximum likelihood– simply use the frequencies in the data

)(

),()|(ˆ

j

jiiji cCN

cCxXNcxP

C

X1 X2 X5X3 X4 X6

N

cCNcP j

j

)()(ˆ

Problem with Max Likelihood

0)(

),()|(ˆ 5

5

nfCN

nfCtXNnfCtXP

• What if we have seen no training cases where patient had no flu and muscle aches?

• Zero probabilities cannot be conditioned away, no matter the other evidence!

i ic cxPcP )|(ˆ)(ˆmaxarg

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

)|()|()|()|,,( 52151 CXPCXPCXPCXXP

Smoothing to Avoid Overfitting

kcCN

cCxXNcxP

j

jiiji

)(

1),()|(ˆ

• Somewhat more subtle version

# of values of Xi

mcCN

mpcCxXNcxP

j

kijkiijki

)(

),()|(ˆ ,,

,

overall fraction in data where Xi=xi,k

extent of“smoothing”

Using Naive Bayes Classifiers to Classify Text: Basic method

• Attributes are text positions, values are words.

Naive Bayes assumption is clearly violated. Example?

Still too many possibilities Assume that classification is independent of the positions of the

words Use same parameters for each position

)|text""()|our""()(argmax

)|()(argmax

1j

j

jnjjCc

ijij

CcNB

cxPcxPcP

cxPcPc

Textj single document containing all docsj

for each word xk in Vocabulary nk number of occurrences of xk in Textj

Text Classification Algorithms: Learning

• From training corpus, extract Vocabulary• Calculate required P(cj) and P(xk | cj) terms

– For each cj in C do

• docsj subset of documents for which the target class is cj

•

||

1)|(

Vocabularyn

ncxP k

jk

|documents # total|

||)( j

j

docscP

Text Classification Algorithms: Classifying

• positions all word positions in current document which contain tokens found in Vocabulary

• Return cNB, where

positionsi

jijCc

NB cxPcPc )|()(argmaxj

Underflow Prevention

• Multiplying lots of probabilities, which are between 0 and 1 by definition, can result in floating-point underflow.

• Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

• Class with highest final un-normalized log probability score is still the most probable.

Naïve Bayes Posterior Probabilities

• Classification results of naïve Bayes (the class with maximum posterior probability) are usually fairly accurate.

• However, due to the inadequacy of the conditional independence assumption, the actual posterior-probability numerical estimates are not.– Output probabilities are generally very close to 0

or 1.

Feature Selection

• Text collections have a large number of features– 10,000 – 1,000,000 unique words – and more

• Make using a particular classifier feasible– Some classifiers can’t deal with 100,000s of feat’s

• Reduce training time– Training time for some methods is quadratic or worse in

the number of features (e.g., logistic regression)• Improve generalization– Eliminate noise features– Avoid overfitting

Recap: Feature Reduction

• Standard ways of reducing feature space for text– Stemming• Laugh, laughs, laughing, laughed -> laugh

– Stop word removal• E.g., eliminate all prepositions

– Conversion to lower case– Tokenization• Break on all special characters: fire-fighter -> fire,

fighter

Feature Selection

• Yang and Pedersen 1997• Comparison of different selection criteria– DF – document frequency– IG – information gain– MI – mutual information– CHI – chi square

• Common strategy– Compute statistic for each term– Keep n terms with highest value of this statistic

Feature selection via Mutual Information

• We might not want to use all words, but just reliable, good discriminators

• In training set, choose k words which best discriminate the categories.

• One way is in terms of Mutual Information:

– For each word w and each category c

}1,0{ }1,0{ )()(

),(log),(),(

w ce e cw

cwcw epep

eepeepcwI

(Pointwise) Mutual Information

Feature selection via MI (contd.)

• For each category we build a list of k most discriminating terms.

• For example (on 20 Newsgroups):– sci.electronics: circuit, voltage, amp, ground, copy, battery,

electronics, cooling, …– rec.autos: car, cars, engine, ford, dealer, mustang, oil,

collision, autos, tires, toyota, …• Greedy: does not account for correlations between

terms• In general feature selection is necessary for binomial

NB, but not for multinomial NB

Information Gain

Chi-Square

Term present

Term absent

Document belongs to category

A B

Document does not belong to category

C D

X^2 = N(AD-BC)^2 / ( (A+B) (A+C) (B+D) (C+D) )

Use either maximum or average X^2

Value for complete independence?

Document Frequency

• Number of documents a term occurs in • Is sometimes used for eliminating both very

frequent and very infrequent terms• How is document frequency measure different

from the other 3 measures?

Yang&Pedersen: Experiments

• Two classification methods– kNN (k nearest neighbors; more later)– Linear Least Squares Fit

• Regression method

• Collections– Reuters-22173

• 92 categories• 16,000 unique terms

– Ohsumed: subset of medline• 14,000 categories• 72,000 unique terms

• Ltc term weighting

Yang&Pedersen: Experiments

• Choose feature set size• Preprocess collection, discarding non-selected

features / words• Apply term weighting -> feature vector for

each document• Train classifier on training set• Evaluate classifier on test set

Discussion

• You can eliminate 90% of features for IG, DF, and CHI without decreasing performance.

• In fact, performance increases with fewer features for IG, DF, and CHI.

• Mutual information is very sensitive to small counts.• IG does best with smallest number of features.• Document frequency is close to optimal. By far the

simplest feature selection method.• Similar results for LLSF (regression).

Results

Why is selecting common terms a good strategy?

IG, DF, CHI Are Correlated.

Information Gain vs Mutual Information

• Information gain is similar to MI for random variables• Independence?• In contrast, pointwise MI ignores non-occurrence of terms

– E.g., for complete dependence, you get:– P(AB)/ (P(A)P(B)) = 1/P(A) – larger for rare terms than for frequent

terms• Yang&Pedersen: Pointwise MI favors rare terms

Feature Selection:Other Considerations

• Generic vs Class-Specific– Completely generic (class-independent)– Separate feature set for each class– Mixed (a la Yang&Pedersen)

• Maintainability over time– Is aggressive features selection good or bad for

robustness over time?• Ideal: Optimal features selected as part of

training

Yang&Pedersen: Limitations

• Don’t look at class specific feature selection• Don’t look at methods that can’t handle high-

dimensional spaces• Evaluate category ranking (as opposed to

classification accuracy)

Feature Selection: Other Methods

• Stepwise term selection – Forward– Backward– Expensive: need to do n^2 iterations of training

• Term clustering• Dimension reduction: PCA / SVD

FEATURES IN SPAM ASSASSIN

• SpamAssassin is a spam filtering program written in Perl, developed by Justin Mason in 1997 and entirely rewritten by Mark Jeftovic in 2001 when it went on Sourceforge – now maintained by the Apache foundation

• It uses Naïve Bayes and a variety of other methods

• It uses features coming both from content and from other properties of the email message

43

SpamAssassin Features100 From: address is in the user's black-list 4.0 Sender is on www.habeas.com Habeas Infringer List 3.994 Invalid Date: header (timezone does not exist) 3.970 Written in an undesired language 3.910 Listed in Razor2, see http://razor.sf.net/ 3.801 Subject is full of 8-bit characters 3.472 Claims compliance with Senate Bill 1618 3.437 exists:X-Precedence-Ref 3.371 Reverses Aging 3.350 Claims you can be removed from the list 3.284 'Hidden' assets 3.283 Claims to honor removal requests 3.261 Contains "Stop Snoring" 3.251 Received: contains a name with a faked IP-address 3.250 Received via a relay in list.dsbl.org 3.200 Character set indicates a foreign language

44

SpamAssassin Features3.198 Forged eudoramail.com 'Received:' header found 3.193 Free Investment 3.180 Received via SBLed relay, seehttp://www.spamhaus.org/sbl/ 3.140 Character set doesn't exist 3.123 Dig up Dirt on Friends 3.090 No MX records for the From: domain 3.072 X-Mailer contains malformed Outlook Expressversion 3.044 Stock Disclaimer Statement 3.009 Apparently, NOT Multi Level Marketing 3.005 Bulk email software fingerprint (jpfree) found inheaders 2.991 exists:Complain-To 2.975 Bulk email software fingerprint (VC_IPA) found inheaders 2.968 Invalid Date: year begins with zero 2.932 Mentions Spam law "H.R. 3113" 2.900 Received forged, contains fake AOL relays 2.879 Asks for credit card details

45

Measuring Performance

• Classification accuracy: What % of messages were classified correctly?

• Is this what we care about?

Overall accurac

y

Accuracy on spam

Accuracy on gen

System 1

95% 99.99% 90%

System 2

95% 90% 99.99% Which system do you prefer?

46


• Precision = good messages kept all messages kept

• Recall =good messages kept all good messages

Trade off precision vs. recall by setting thresholdMeasure the curve on annotated dev data (or test data)Choose a threshold where user is comfortable

Precision vs. Recall of Good (non-spam) Email

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%

Recall

Pre

cisi

on

600.465 - Intro to NLP - J. Eisner 47

Precision vs. Recall of Good (non-spam) Email

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%

Recall

Pre

cisi

on


low threshold:keep all the good stuff,but a lot of the bad too

high threshold:all we keep is good,but we don’t keep much

OK for spam filtering and legal search

OK for search engines (maybe)

would prefer to be here!

point whereprecision=recall(sometimes reported)

F-measure = 1 / (average(1/precision, 1/recall))

Micro- vs. Macro-Averaging

• If we have more than one class, how do we combine multiple performance measures into one quantity?

• Macroaveraging: Compute performance for each class, then average.

• Microaveraging: Collect decisions for all classes, compute contingency table, evaluate.

Micro- vs. Macro-Averaging: Example

Truth: yes

Truth: no

Classifier: yes

10 10

Classifier: no

10 970

Truth: yes

Truth: no

Classifier: yes

90 10

Classifier: no

10 890

Truth: yes

Truth: no

Classifier: yes

100 20

Classifier: no

20 1860

Class 1 Class 2 Micro.Av. Table

Macroaveraged precision: (0.5 + 0.9)/2 = 0.7 Microaveraged precision: 100/120 = .83 Why this difference?

Confusion matrix

• Function of classifier, topics and test docs.• For a perfect classifier, all off-diagonal entries

should be zero.• For a perfect classifier, if there are n docs in

category j than entry (j,j) should be n.• Straightforward when there is 1 category per

document.• Can be extended to n categories per document.

52

Decision Trees

• Tree with internal nodes labeled by terms• Branches are labeled by tests on the weight that the

term has• Leaves are labeled by categories • Classifier categorizes document by descending tree

following tests to leaf • The label of the leaf node is then assigned to the

document• Most decision trees are binary trees (never

disadvantageous; may require extra internal nodes)

53

Decision Tree Example

54

Decision Tree Learning

• Learn a sequence of tests on features, typically using top-down, greedy search– At each stage choose unused feature with highest

Information Gain (as in Lecture 5)• Binary (yes/no) or continuous decisions

f1 !f1

f7 !f7

P(class) = .6

P(class) = .9

P(class) = .2

Top-down DT induction

• Partition training examples into good “splits”, based on values of a single “good” feature:

(1) Sat, hot, no, casual, keys -> +(2) Mon, cold, snow, casual, no-keys -> -(3) Tue, hot, no, casual, no-keys -> -(4) Tue, cold, rain, casual, no-keys -> -(5) Wed, hot, rain, casual, keys -> +


keys?

yes no

Drive: 1,5 Walk: 2,3,4


• Partition training examples into good “splits”, based on values of a single “good” feature

(1) Sat, hot, no, casual -> +(2) Mon, cold, snow, casual -> -(3) Tue, hot, no, casual -> -(4) Tue, cold, rain, casual -> -(5) Wed, hot, rain, casual -> +• No acceptable classification: proceed recursively


t?

cold hot

Walk: 2,4 Drive: 1,5Walk: 3


t?

cold hot

Walk: 2,4 day?

SatTue

Wed

Drive: 1 Walk: 3 Drive: 5


t?

cold hot

Walk: 2,4 day?

SatTue

Wed


Mo, Thu, Fr, Su

?Drive

Top-down DT induction: divide and conquer algorithm

• Pick a feature• Split your examples into subsets based on the

values of the feature• For each subsets, examine the examples:– Zero: assign the most popular class for the parent– All from the same class: assign this class– Otherwise, process recursively

Top-Down DT induction

Different trees can be built for the same data, depending on the order of features:

t?

cold hot

Walk: 2,4 day?

SatTue

Wed


Mo, Thu, Fr, Su

?Drive


Different trees can be built for the same data, depending on the order of features:

t?

cold hot

Walk: 2,4 day?

Sat Tue WedDrive: 1 Walk: 3 Drive: 5

Mo

Drive:?

clothing

casualhalloween

walk:?

Selecting features

• Intuitively–We want more “informative” features to be higher

in the tree:• Is it Monday? Is it raining? Good political news? No

halloween cloths? Hat on? Coat on? Car keys? Yes?? -> Driving! (doesn't look as a good learning job)

–We want a nice compact tree..

Selecting features-2

• Formally– Define “tree size” (number of nodes, leaves;

depth,..)– Try all the possible trees, find the smallest one– NP-hard

• Top-down DT induction – greedy search, depends on heuristics for feature ordering (=> no guarantee)– Information gain

Entropy

Information theory: entropy – number of bits needed to encode some information.

S – set of N examples: p*N positive (“Walk”) and q*N negative (“Drive”)

Entropy(S)= -p*lg(p) – q*lg(q)

p=1, q=0 => Entropy(S)=0p=1/2, q=1/2 => Entropy(S)=1

Entropy and Decision Trees

keys?

no yes

Walk: 2,4 Drive: 1,3,5

E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97

E(Sno)=0 E(Skeys)=0

Entropy and Decision Trees

t?

cold hot

Walk: 2,4 Drive: 1,5Walk: 3

E(S)=-0.6*lg(0.6)-0.4*lg(0.4)= 0.97

E(Scold)=0 E(Shot)=-0.33*lg(0.33)-0.66*lg(0.66)= 0.92

Information gain

• For each feature f, compute the reduction in entropy on the split:

Gain(S,f)=E(S)-∑(Entropy(Si)*|Si|/|S|)

f=keys? : Gain(S,f)=0.97f=t?: Gain(S,f)=0.97-0*2/5-0.92*3/5=0.42f=clothing?: Gain(S,f)= ?

TEXT CATEGORIZATION WITH DT

• Build a separate decision tree for each category

• Use WORDS COUNTS as features

71

Reuters Data Set (21578 - ModApte split)

• 9603 training, 3299 test articles; ave. 200 words• 118 categories

– An article can be in more than one category– Learn 118 binary category distinctions

• Earn (2877, 1087) • Acquisitions (1650, 179)• Money-fx (538, 179)• Grain (433, 149)• Crude (389, 189)

• Trade (369,119)• Interest (347, 131)• Ship (197, 89)• Wheat (212, 71)• Corn (182, 56)

Common categories(#train, #test)

Foundations of Statistical Natural Language Processing, Manning and Schuetze

AN EXAMPLE OF REUTERS TEXT

Foundations of Statistical Natural Language Processing, Manning and Schuetze

Decision Tree for Reuter classification

Conquer-and-divide with Information gain

• Batch learning (read all the input data, compute information gain based on all the examples simultaneously)

• Greedy search, may find local optima• Outputs a single solution• Optimizes depth

Complexity

• Worst case: build a complete tree– Compute gains on all nodes: at level i, we have

already examined i features; m-i remaining. • In practice: tree is rarely complete, linear on

number of features, number of examples (== very fast)

Overfitting

• Suppose we build a very complex tree.. Is it good?

• Last lecture: we measure the quality (“goodness”) of the prediction, not the performance on the training data

• Why can complex trees yield mistakes:– Noise in the data– Even without noise, solutions at the last levels are

based on too few observations

Overfitting

Mo: Walk (50 observations), Drive (5)Tue: Walk (40), Drive (3)We: Drive (1)Thu: Walk (42), Drive (14)Fri: Walk (50)Sa: Drive (20), Walk (20)Su: Drive (10)

• Can we conclude that “We->Drive”?

Overfitting

• A hypothesis H is said to overfit the training data if there exist another hypothesis H' such that:Error(H, train data) <= Error (H', train data)Error(H, unseen data) > Error (H', unseen data)

• Overfitting is related to hypothesis complexity: a more complex hypothesis (e.g., a larger decision tree) overfits more

Overfitting Prevention for DT: Pruning

• “prune” a complex tree: produce a smaller tree that is less accurate on the training dataOriginal tree: ...Mo: hot->drive (2), cold -> walk (100)Pruned tree: .. Mo-> walk (100/2)

• post-/pre- pruning

Pruning criteria

• Cross-validation– Reserve some training data to evaluate the utility

of the subtrees• Statistical tests: use a test to determine

whether observations at given level can be random

• MDL (minimum description length): compare the added complexity against memorizing exceptions

DT: issues

• Splitting criteria– Information gain: split at features with many values

• Non-discrete features• Non-discrete outputs (“regression trees”)• Costs• Missing values• Incremental learning• Memory issues

REFERENCES

• Tom Mitchell, Machine Learning. McGraw-Hill, 1997. – Chapter 6, Bayesian Learning - 6.1, 6.2, 6.9, 6.10– Chapter 3, Decision Trees – whole chapter

• Fabrizio Sebastiani. Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1-47, 2002

ACKNOWLEDGMENTS

• Many slides borrowed from Chris Manning, Stanford

Documents

807 - TEXT ANALYTICS Massimo Poesio Lecture 3: Text Classification