Text Classification with Limited Labeled Data Andrew McCallum [email protected] Just Research (formerly JPRC) Center for Automated Learning and Discovery,

Text Classificationwith Limited Labeled Data

Andrew [email protected]

Just Research (formerly JPRC)

Center for Automated Learning and Discovery, Carnegie Mellon University

Joint work with Kamal Nigam, Tom Mitchell, Sebastian Thrun, Roni Rosenfeld, Andrew Ng, Larry Wasserman, Kristie Seymore, and Jason Rennie

2

3

4

5

The Task: Document Classification(also “Document Categorization”, “Routing” or “Tagging”)

Automatically placing documents in their correct categories.

Magnetism RelativityEvolutionBotanyIrrigation Crops

cornwheatsilofarmgrow...

corntulipssplicinggrow...

watergratingditchfarmtractor...

selectionmutationDarwinGalapagosDNA...

... ...

“grow corn tractor…”

TrainingData:

TestingData:

Categories:

(Crops)

6

A Probabilistic Approach to Document Classification

)|Pr(maxarg dcc jc j

Pick the most probable class, given the evidence:

jc

d- a class (like “Crops”)

- a document (like “grow corn tractor...”)

)Pr(

)|Pr()Pr()|Pr(

d

cdcdc jj

j

Bayes Rule:

k

d

i i

d

i i

ckdk

jdj

cwc

cwc||

1

||

1

)|Pr()Pr(

)|Pr()Pr(

(1) One mixture-component per class(2) Independence assumption

“Naïve Bayes”:

idw - the i th word in d (like “corn”)

7

Parameter Estimation in Naïve Bayes

||

1

),(||

),(1

)|r(P̂V

t cdkt

cdki

ji

jk

jk

dwNV

dwN

cw

||

1

)|Pr()Pr(maxargd

ijdjj cwcc

i

Maximum a posteriori estimate of Pr(w|c),with a Dirichlet prior,(AKA “Laplace smoothing”)

Naïve Bayes

Two ways to improve this method:(A) Make less restrictive assumptions about the model(B) Get better estimates of the model parameters, i.e. Pr(w|c)

where N(w,d) isnumber of times word w occursin document d.

8

The Rest of the Talk

(1) Borrow data from related classes in a hierarchy

(2) Use unlabeled data.

Two Methods for Improving Parameter Estimation when Labeled Data is Sparse

Improving Document Classification by Shrinkage in a Hierarchy

Andrew McCallum

Roni Rosenfeld

Tom Mitchell

Andrew Ng (Berkeley)

Larry Wasserman (CMU Statistics)

10

The Idea: “Shrinkage” / “Deleted Interpolation”

We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. This represents a tradeoff between reliability and specificity.

Magnetism Relativity

Physics

EvolutionBotanyIrrigation Crops

BiologyAgriculture

Science

cornwheatsilofarmgrow...

corntulipssplicinggrow...

watergratingditchfarmtractor...

“corn grow tractor…”

selectionmutationDarwinGalapagosDNA...

... ...

TestingData:

TrainingData:

Categories:

(Crops)

11

“Shrinkage” / “Deleted Interpolation”

Crops of

ancestors#

0ancestor

ancestorancestorSHRINKAGE )Crops|tractor""r(P̂)Crops|tractor""(Pr j

[James and Stein, 1961] / [Jelinek and Mercer, 1980]

)Crops|tractor"r("P̂

||

1)tractor"("PrUNIFORM V

(Uniform)

Magnetism Relativity

Physics

EvolutionBotanyIrrigation Crops

BiologyAgriculture

Science

)eAgricultur|tractor"r("P̂

)Science|tractor"r("P̂

12

Learning Mixture Weights

Crops

Agriculture

Science

Learn the ’s via EM, performing the E-step with leave-one-out cross-validation.

parent

Crops

child

Crops

tgrandparen

Crops

Uniform uniform

Crops

corn wheatsilo farmgrow...

Use the current ’s to estimate the degreeto which each node was likely to have generated the words in held out documents.

E-step

M-stepUse the estimates to recalculate new

values for the ’s.

14

Learning Mixture Weights

Hw jtj

jtjj

tcw

cw

m

mm

aaa

)|r(P̂

)|r(P̂

m

m

aa

j

jj

E-step

M-step

15

Newsgroups Data Set

macibm

graphicswindows

X guns

mideastautomotorcycle

atheism

christian

misc baseballhockey

misc

computers religion sport politics motor

15 classes, 15k documents,1.7 million words, 52k vocabulary

(Subset of Ken Lang’s 20 Newsgroups set)

16

Newsgroups HierarchyMixture Weights

Mixture Weights# trainingdocuments Class child parent g’parent uniform

/politics/talk.politics.guns 0.368 0.092 0.017 0.522/politics/talk.politics.mideast 0.256 0.132 0.001 0.611235/politics/talk.politics.misc 0.197 0.213 0.026 0.564/politics/talk.politics.guns 0.801 0.089 0.048 0.061/politics/talk.politics.mideast 0.859 0.061 0.010 0.0717497/politics/talk.politics.misc 0.762 0.126 0.043 0.068

18

Industry Sector Data Set

waterair

railroadtrucking

misc coal

oil&gas

filmcommunication

electric

water

gas appliancefurniture

integrated

transportation utilities consumer energy services

71 classes, 6.5k documents,1.2 million words, 30k vocabulary

... ... ...

… (11)

www.marketguide.com

19

Industry Sector Classification Accuracy

Title:

Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.

20

Newsgroups Classification Accuracy

Title:


21

Yahoo Science Data Set

dairycrops

agronomyforestry

AI

HCIcraft

missions

botany

evolution

cellmagnetism

relativity

courses

agriculture biology physics CS space

264 classes, 14k documents,3 million words, 76k vocabulary

... ... ...

… (30)

www.yahoo.com/Science

... ...

22

Yahoo Science Classification Accuracy

Title:


23

Related Work• Shrinkage in Statistics:

– [Stein 1955], [James & Stein 1961]

• Deleted Interpolation in Language Modeling:– [Jelinek & Mercer 1980], [Seymore & Rosenfeld 1997]

• Bayesian Hierarchical Modeling for n-grams– [MacKay & Peto 1994]

• Class hierarchies for text classification– [Koller & Sahami 1997]

• Using EM to set mixture weights in a hierarchical clustering model for unsupervised learning– [Hofmann & Puzicha 1998]

24

Future Work

• Learning hierarchies that aid classification.

• Using more complex generative models.– Capturing word dependancies– Clustering words in each ancestor

25

Shrinkage Conclusions

• Shrinkage in a hierarchy of classes can dramatically improve classification accuracy.

• Shrinkage helps especially when training data is sparse. In models more complex than naïve Bayes, it should be even more helpful.

• [The hierarchy can be pruned for exponential reduction in computation necessary for classification; only minimal loss in accuracy.]

26

The Rest of the Talk

(1) Borrow data from related classes in a hierarchy.

(2) Use unlabeled data.

Two Methods for Improving Parameter Estimation when Labeled Data is Sparse

Text Classification with Labeled and Unlabeled Documents

Kamal Nigam

Andrew McCallum

Sebastian Thrun

Tom Mitchell

28

The Scenario

Training datawith class labels

Data available at trainingtime, but without class labels

Web pagesuser says areinteresting

Web pagesuser says areuninteresting

Web pages userhasn’t seen or saidanything about

Can we use the unlabeled documents to increase accuracy?

29

Using the Unlabeled DataBuild a classificationmodel using limitedlabeled data

Use model to estimate thelabels of the unlabeleddocuments

Use all documents to build a new classification model, which is often more accurate because it is trained using more data.

30

An Example

Baseball Ice Skating

Labeled Data

Fell on the ice...The new hitter struck out...

Pete Rose is not as good an athlete as Tara Lipinski...

Struck out in last inning...

Homerun in the first inning...

Perfect triple jump...

Katarina Witt’s gold medal performance...

New ice skates...

Practice at the ice rink every day...

Unlabeled Data

Tara Lipinski’s substitute ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal.

Tara Lipinski bought a new house for her parents.

Pr ( Lipinski ) = 0.01 Pr ( Lipinski ) = 0.001

Pr ( Lipinski | Ice Skating ) = 0.02

Pr ( Lipinski | Baseball ) = 0.003

After EM:

Before EM:

31

Filling in Missing Labels with EM

• E-step: Use current estimates of model parameters to “guess” value of missing labels.

• M-step: Use current “guesses” for missing labels to calculate new estimates of model parameters.

• Repeat E- and M-steps until convergence.

Expectation Maximization is a class of iterative algorithms for maximum likelihood estimation with incomplete data.

[Dempster et al ‘77], [Ghahramani & Jordan ‘95], [McLachlan & Krishnan ‘97]

Finds the model parameters that locally maximize the probability of both the labeled and the unlabeled data.

32

EM for Text Classification

Expectation-step (estimate the class labels)

||

1

)|Pr()Pr()|Pr(d

ijdjj cwcdc

i

Maximization-step (new parameters using the estimates)

k

k

d

V

tkjkt

dkjki

ji

dcdwNV

dcdwN

cw||

1

)|Pr(),(||

)|Pr(),(1

)|r(P̂

33

WebKB Data Set

student faculty course project

4 classes, 4199 documents

from CS academic departments

34

Word Vector Evolution with EM

Iteration 0intelligence

DDartificial

understandingDDwdist

identicalrus

arrangegames

dartmouthnatural

cognitivelogic

provingprolog

Iteration 1DDD

lectureccD*

DD:DDhandout

dueproblem

settay

DDamyurtas

homeworkkfoury

sec

Iteration 2D

DDlecture

ccDD:DD

dueD*

homeworkassignment

handoutsethw

examproblemDDam

postscript

(D is a digit)

35

EM as Clustering

X

X

X

= unlabeled

36

EM as Clustering, Gone Wrong

X

X

X

37

20 Newsgroups Data Set

20 class labels, 20,000 documents62k unique words

…

comp.sys.m

ac.hardware

comp.sys.ibm

.pc.hardware

comp.os.m

s-window

s.misc

alt.atheismcom

p.graphics

comp.w

indows.x

rec.sport.baseballrec.sport.hockey

talk.politics.mideast

talk.politics.guns

talk.politics.misc

talk.religion.misc

sci.crypt

sci.electronics

sci.med

sci.space

38

Newsgroups Classification Accuracyvarying # labeled documents

Title:


39

Newsgroups Classification Accuracyvarying # unlabeled documentsTitle:


40

WebKB Classification Accuracyvarying # labeled documents

Title:


41

WebKB Classification Accuracyvarying weight of unlabeled data

Title:


42

WebKB Classification Accuracyvarying # labeled documents

and selecting unlabeled weight by CVTitle:


43

Reuters 21578 Data Set

acq earn interest ship

135 class labels, 12902 documents

crude graincorn wheat …

44

EM as Clustering, Salvageable

XX

45

Reuters 21578 Precision-Recall Breakeven

Category NB 1 EM 1 EM 20 EM 40 Diff .acq 75.9 39.5 88.4 88.9 +13.0corn 40.5 21.1 39.8 39.1 -0.7crude 60.7 27.8 63.9 66.6 +5.9earn 92.6 90.2 95.3 95.2 +2.7grain 51.7 21.0 54.6 55.8 +4.1interest 52.0 25.9 48.6 50.3 -1.7money-fx 57.7 28.8 54.7 59.7 +2.0ship 58.1 9.3 46.5 55.0 -3.1trade 56.8 34.7 54.3 57.0 +0.2wheat 48.9 13.0 42.1 44.2 -4.7

# mixture componentsfor negative class

46

Related Work• Using EM to reduce the need for training examples:

– [Miller & Uyar 1997], [Shahshahani & Landgrebe 1994]

• Using EM to fill in missing values– [Ghahramani & Jordan 1995]

• AutoClass - unsupervised EM with Naïve Bayes:– [Cheeseman et al. 1988]

• Co-Training– [Blum & Mitchell COLT’98]

• Relevance Feedback for Information Retrieval– [Salton & Buckley 1990]

47

Unlabeled Data Conclusions & Future Work

• Combining labeled and unlabeled data with EM can greatly reduce the need for labeled training data.

• Exercise caution: EM can sometimes hurt.– Weight the unlabeled data.– Choose parametric model carefully.

• Vary EM likelihood surface for different tasks.• Use similar techniques for other text tasks: e.g.

Information Extraction.

48

Cora Demo

49

50

51

52

Populating a hierarchy

• Naïve Bayes+ Simple, robust document classification.+ Many principled enhancements (e.g. shrinkage).– Requires a lot of labeled training data.

• Keyword matching+ Requires no labeled training data.– Human effort to select keywords (acc/cov)– Brittle, breaks easily

53

Combine Naïve Bayes and Keywords for Best of Both

• Classify unlabeled documents with keyword matching.

• Pretend these category labels are correct, and use this data to train naïve Bayes.

• Naïve Bayes acts to temper and “round out” the keyword class definitions.

• Brings in new probabilistically-weighted keywords that are correlated with the few original keywords.

54

Cora Topic HierarchyClassification Accuracy

0

10

20

30

40

50

60

70

Algorithm

Keyword Matching

Naïve Bayes

Naïve Bayes withShrinkage

55

Top words found by naïve Bayes and Shrinkage

ROOTcomputer, university, science, system, paper

HCIcomputersystem

multimedia university

paper

IRinformation

textdocuments

classificationretrieval

Hardwarecircuitsdesigns

computeruniversity

performance

AIlearning

universitycomputer

basedintelligence

Programmingprogramming

languagelogic

universityprograms

GUIinterfacedesignuser

sketchinterfaces

Cooperativecollaborative

CSCWwork

providegroup

Multimediamultimedia

realtimedata

media

Planningplanningtemporalreasoning

planproblems

MachineLearninglearning

algorithmuniversitynetworks

NLPlanguagenatural

processinginformation

text

Semanticssemantics

denotationallanguage

constructiontypes

GarbageCollection

garbagecollectionmemory

optimizationregion

56

Less Labeled Data, but with Unlabeled Data

0

10

20

30

40

50

60

70

Algorithm

Naïve Bayes

Naïve Bayes withShrinkage

Naïve Bayes withShrinkage andUnlabeled Data

57

Next Cora Projects

• Improving existing components with further Machine Learning research.

• Building a topic hierarchy automatically by clustering.

• Reference matching by machine learning.• Active Learning for improving performance

interactively.• Seminal-paper detection (ala Kleinberg).• TDT (“What’s new in research this month?”)

58

BibliographyFor more details see

http://www.cs.cmu.edu/~mccallum

Improving Text Classification by Shrinkage in a Hierarchy of Classes

McCallum, Rosenfeld, Mitchell, Ng

ICML-98

Learning to ClassifyText from Labeled and Unlabeled Documents

Nigam, McCallum, Thrun, Mitchell

AAAI-98

Building Domain-Specific Search Engines with Machine LearningMcCallum, Nigam, Rennie, SeymoreAAAI Spring Symposium 1999 (submitted)