Upload
shavonne-ferguson
View
217
Download
3
Embed Size (px)
Citation preview
Text Classificationwith Limited Labeled Data
Andrew [email protected]
Just Research (formerly JPRC)
Center for Automated Learning and Discovery, Carnegie Mellon University
Joint work with Kamal Nigam, Tom Mitchell, Sebastian Thrun, Roni Rosenfeld, Andrew Ng, Larry Wasserman, Kristie Seymore, and Jason Rennie
2
3
4
5
The Task: Document Classification(also “Document Categorization”, “Routing” or “Tagging”)
Automatically placing documents in their correct categories.
Magnetism RelativityEvolutionBotanyIrrigation Crops
cornwheatsilofarmgrow...
corntulipssplicinggrow...
watergratingditchfarmtractor...
selectionmutationDarwinGalapagosDNA...
... ...
“grow corn tractor…”
TrainingData:
TestingData:
Categories:
(Crops)
6
A Probabilistic Approach to Document Classification
)|Pr(maxarg dcc jc j
Pick the most probable class, given the evidence:
jc
d- a class (like “Crops”)
- a document (like “grow corn tractor...”)
)Pr(
)|Pr()Pr()|Pr(
d
cdcdc jj
j
Bayes Rule:
k
d
i i
d
i i
ckdk
jdj
cwc
cwc||
1
||
1
)|Pr()Pr(
)|Pr()Pr(
(1) One mixture-component per class(2) Independence assumption
“Naïve Bayes”:
idw - the i th word in d (like “corn”)
7
Parameter Estimation in Naïve Bayes
||
1
),(||
),(1
)|r(P̂V
t cdkt
cdki
ji
jk
jk
dwNV
dwN
cw
||
1
)|Pr()Pr(maxargd
ijdjj cwcc
i
Maximum a posteriori estimate of Pr(w|c),with a Dirichlet prior,(AKA “Laplace smoothing”)
Naïve Bayes
Two ways to improve this method:(A) Make less restrictive assumptions about the model(B) Get better estimates of the model parameters, i.e. Pr(w|c)
where N(w,d) isnumber of times word w occursin document d.
8
The Rest of the Talk
(1) Borrow data from related classes in a hierarchy
(2) Use unlabeled data.
Two Methods for Improving Parameter Estimation when Labeled Data is Sparse
Improving Document Classification by Shrinkage in a Hierarchy
Andrew McCallum
Roni Rosenfeld
Tom Mitchell
Andrew Ng (Berkeley)
Larry Wasserman (CMU Statistics)
10
The Idea: “Shrinkage” / “Deleted Interpolation”
We can improve the parameter estimates in a leaf by averaging them with the estimates in its ancestors. This represents a tradeoff between reliability and specificity.
Magnetism Relativity
Physics
EvolutionBotanyIrrigation Crops
BiologyAgriculture
Science
cornwheatsilofarmgrow...
corntulipssplicinggrow...
watergratingditchfarmtractor...
“corn grow tractor…”
selectionmutationDarwinGalapagosDNA...
... ...
TestingData:
TrainingData:
Categories:
(Crops)
11
“Shrinkage” / “Deleted Interpolation”
Crops of
ancestors#
0ancestor
ancestorancestorSHRINKAGE )Crops|tractor""r(P̂)Crops|tractor""(Pr j
[James and Stein, 1961] / [Jelinek and Mercer, 1980]
)Crops|tractor"r("P̂
||
1)tractor"("PrUNIFORM V
(Uniform)
Magnetism Relativity
Physics
EvolutionBotanyIrrigation Crops
BiologyAgriculture
Science
)eAgricultur|tractor"r("P̂
)Science|tractor"r("P̂
12
Learning Mixture Weights
Crops
Agriculture
Science
Learn the ’s via EM, performing the E-step with leave-one-out cross-validation.
parent
Crops
child
Crops
tgrandparen
Crops
Uniform uniform
Crops
corn wheatsilo farmgrow...
Use the current ’s to estimate the degreeto which each node was likely to have generated the words in held out documents.
E-step
M-stepUse the estimates to recalculate new
values for the ’s.
14
Learning Mixture Weights
Hw jtj
jtjj
tcw
cw
m
mm
aaa
)|r(P̂
)|r(P̂
m
m
aa
j
jj
E-step
M-step
15
Newsgroups Data Set
macibm
graphicswindows
X guns
mideastautomotorcycle
atheism
christian
misc baseballhockey
misc
computers religion sport politics motor
15 classes, 15k documents,1.7 million words, 52k vocabulary
(Subset of Ken Lang’s 20 Newsgroups set)
16
Newsgroups HierarchyMixture Weights
Mixture Weights# trainingdocuments Class child parent g’parent uniform
/politics/talk.politics.guns 0.368 0.092 0.017 0.522/politics/talk.politics.mideast 0.256 0.132 0.001 0.611235/politics/talk.politics.misc 0.197 0.213 0.026 0.564/politics/talk.politics.guns 0.801 0.089 0.048 0.061/politics/talk.politics.mideast 0.859 0.061 0.010 0.0717497/politics/talk.politics.misc 0.762 0.126 0.043 0.068
18
Industry Sector Data Set
waterair
railroadtrucking
misc coal
oil&gas
filmcommunication
electric
water
gas appliancefurniture
integrated
transportation utilities consumer energy services
71 classes, 6.5k documents,1.2 million words, 30k vocabulary
... ... ...
… (11)
www.marketguide.com
19
Industry Sector Classification Accuracy
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
20
Newsgroups Classification Accuracy
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
21
Yahoo Science Data Set
dairycrops
agronomyforestry
AI
HCIcraft
missions
botany
evolution
cellmagnetism
relativity
courses
agriculture biology physics CS space
264 classes, 14k documents,3 million words, 76k vocabulary
... ... ...
… (30)
www.yahoo.com/Science
... ...
22
Yahoo Science Classification Accuracy
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
23
Related Work• Shrinkage in Statistics:
– [Stein 1955], [James & Stein 1961]
• Deleted Interpolation in Language Modeling:– [Jelinek & Mercer 1980], [Seymore & Rosenfeld 1997]
• Bayesian Hierarchical Modeling for n-grams– [MacKay & Peto 1994]
• Class hierarchies for text classification– [Koller & Sahami 1997]
• Using EM to set mixture weights in a hierarchical clustering model for unsupervised learning– [Hofmann & Puzicha 1998]
24
Future Work
• Learning hierarchies that aid classification.
• Using more complex generative models.– Capturing word dependancies– Clustering words in each ancestor
25
Shrinkage Conclusions
• Shrinkage in a hierarchy of classes can dramatically improve classification accuracy.
• Shrinkage helps especially when training data is sparse. In models more complex than naïve Bayes, it should be even more helpful.
• [The hierarchy can be pruned for exponential reduction in computation necessary for classification; only minimal loss in accuracy.]
26
The Rest of the Talk
(1) Borrow data from related classes in a hierarchy.
(2) Use unlabeled data.
Two Methods for Improving Parameter Estimation when Labeled Data is Sparse
Text Classification with Labeled and Unlabeled Documents
Kamal Nigam
Andrew McCallum
Sebastian Thrun
Tom Mitchell
28
The Scenario
Training datawith class labels
Data available at trainingtime, but without class labels
Web pagesuser says areinteresting
Web pagesuser says areuninteresting
Web pages userhasn’t seen or saidanything about
Can we use the unlabeled documents to increase accuracy?
29
Using the Unlabeled DataBuild a classificationmodel using limitedlabeled data
Use model to estimate thelabels of the unlabeleddocuments
Use all documents to build a new classification model, which is often more accurate because it is trained using more data.
30
An Example
Baseball Ice Skating
Labeled Data
Fell on the ice...The new hitter struck out...
Pete Rose is not as good an athlete as Tara Lipinski...
Struck out in last inning...
Homerun in the first inning...
Perfect triple jump...
Katarina Witt’s gold medal performance...
New ice skates...
Practice at the ice rink every day...
Unlabeled Data
Tara Lipinski’s substitute ice skates didn’t hurt her performance. She graced the ice with a series of perfect jumps and won the gold medal.
Tara Lipinski bought a new house for her parents.
Pr ( Lipinski ) = 0.01 Pr ( Lipinski ) = 0.001
Pr ( Lipinski | Ice Skating ) = 0.02
Pr ( Lipinski | Baseball ) = 0.003
After EM:
Before EM:
31
Filling in Missing Labels with EM
• E-step: Use current estimates of model parameters to “guess” value of missing labels.
• M-step: Use current “guesses” for missing labels to calculate new estimates of model parameters.
• Repeat E- and M-steps until convergence.
Expectation Maximization is a class of iterative algorithms for maximum likelihood estimation with incomplete data.
[Dempster et al ‘77], [Ghahramani & Jordan ‘95], [McLachlan & Krishnan ‘97]
Finds the model parameters that locally maximize the probability of both the labeled and the unlabeled data.
32
EM for Text Classification
Expectation-step (estimate the class labels)
||
1
)|Pr()Pr()|Pr(d
ijdjj cwcdc
i
Maximization-step (new parameters using the estimates)
k
k
d
V
tkjkt
dkjki
ji
dcdwNV
dcdwN
cw||
1
)|Pr(),(||
)|Pr(),(1
)|r(P̂
33
WebKB Data Set
student faculty course project
4 classes, 4199 documents
from CS academic departments
34
Word Vector Evolution with EM
Iteration 0intelligence
DDartificial
understandingDDwdist
identicalrus
arrangegames
dartmouthnatural
cognitivelogic
provingprolog
Iteration 1DDD
lectureccD*
DD:DDhandout
dueproblem
settay
DDamyurtas
homeworkkfoury
sec
Iteration 2D
DDlecture
ccDD:DD
dueD*
homeworkassignment
handoutsethw
examproblemDDam
postscript
(D is a digit)
35
EM as Clustering
X
X
X
= unlabeled
36
EM as Clustering, Gone Wrong
X
X
X
37
20 Newsgroups Data Set
20 class labels, 20,000 documents62k unique words
…
comp.sys.m
ac.hardware
comp.sys.ibm
.pc.hardware
comp.os.m
s-window
s.misc
alt.atheismcom
p.graphics
comp.w
indows.x
rec.sport.baseballrec.sport.hockey
talk.politics.mideast
talk.politics.guns
talk.politics.misc
talk.religion.misc
sci.crypt
sci.electronics
sci.med
sci.space
38
Newsgroups Classification Accuracyvarying # labeled documents
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
39
Newsgroups Classification Accuracyvarying # unlabeled documentsTitle:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
40
WebKB Classification Accuracyvarying # labeled documents
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
41
WebKB Classification Accuracyvarying weight of unlabeled data
Title:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
42
WebKB Classification Accuracyvarying # labeled documents
and selecting unlabeled weight by CVTitle:
Creator:gnuplotPreview:This EPS picture was not savedwith a preview included in it.Comment:This EPS picture will print to aPostScript printer, but not toother types of printers.
43
Reuters 21578 Data Set
acq earn interest ship
135 class labels, 12902 documents
crude graincorn wheat …
44
EM as Clustering, Salvageable
XX
45
Reuters 21578 Precision-Recall Breakeven
Category NB 1 EM 1 EM 20 EM 40 Diff .acq 75.9 39.5 88.4 88.9 +13.0corn 40.5 21.1 39.8 39.1 -0.7crude 60.7 27.8 63.9 66.6 +5.9earn 92.6 90.2 95.3 95.2 +2.7grain 51.7 21.0 54.6 55.8 +4.1interest 52.0 25.9 48.6 50.3 -1.7money-fx 57.7 28.8 54.7 59.7 +2.0ship 58.1 9.3 46.5 55.0 -3.1trade 56.8 34.7 54.3 57.0 +0.2wheat 48.9 13.0 42.1 44.2 -4.7
# mixture componentsfor negative class
46
Related Work• Using EM to reduce the need for training examples:
– [Miller & Uyar 1997], [Shahshahani & Landgrebe 1994]
• Using EM to fill in missing values– [Ghahramani & Jordan 1995]
• AutoClass - unsupervised EM with Naïve Bayes:– [Cheeseman et al. 1988]
• Co-Training– [Blum & Mitchell COLT’98]
• Relevance Feedback for Information Retrieval– [Salton & Buckley 1990]
47
Unlabeled Data Conclusions & Future Work
• Combining labeled and unlabeled data with EM can greatly reduce the need for labeled training data.
• Exercise caution: EM can sometimes hurt.– Weight the unlabeled data.– Choose parametric model carefully.
• Vary EM likelihood surface for different tasks.• Use similar techniques for other text tasks: e.g.
Information Extraction.
48
Cora Demo
49
50
51
52
Populating a hierarchy
• Naïve Bayes+ Simple, robust document classification.+ Many principled enhancements (e.g. shrinkage).– Requires a lot of labeled training data.
• Keyword matching+ Requires no labeled training data.– Human effort to select keywords (acc/cov)– Brittle, breaks easily
53
Combine Naïve Bayes and Keywords for Best of Both
• Classify unlabeled documents with keyword matching.
• Pretend these category labels are correct, and use this data to train naïve Bayes.
• Naïve Bayes acts to temper and “round out” the keyword class definitions.
• Brings in new probabilistically-weighted keywords that are correlated with the few original keywords.
54
Cora Topic HierarchyClassification Accuracy
0
10
20
30
40
50
60
70
Algorithm
Keyword Matching
Naïve Bayes
Naïve Bayes withShrinkage
55
Top words found by naïve Bayes and Shrinkage
ROOTcomputer, university, science, system, paper
HCIcomputersystem
multimedia university
paper
IRinformation
textdocuments
classificationretrieval
Hardwarecircuitsdesigns
computeruniversity
performance
AIlearning
universitycomputer
basedintelligence
Programmingprogramming
languagelogic
universityprograms
GUIinterfacedesignuser
sketchinterfaces
Cooperativecollaborative
CSCWwork
providegroup
Multimediamultimedia
realtimedata
media
Planningplanningtemporalreasoning
planproblems
MachineLearninglearning
algorithmuniversitynetworks
NLPlanguagenatural
processinginformation
text
Semanticssemantics
denotationallanguage
constructiontypes
GarbageCollection
garbagecollectionmemory
optimizationregion
56
Less Labeled Data, but with Unlabeled Data
0
10
20
30
40
50
60
70
Algorithm
Naïve Bayes
Naïve Bayes withShrinkage
Naïve Bayes withShrinkage andUnlabeled Data
57
Next Cora Projects
• Improving existing components with further Machine Learning research.
• Building a topic hierarchy automatically by clustering.
• Reference matching by machine learning.• Active Learning for improving performance
interactively.• Seminal-paper detection (ala Kleinberg).• TDT (“What’s new in research this month?”)
58
BibliographyFor more details see
http://www.cs.cmu.edu/~mccallum
Improving Text Classification by Shrinkage in a Hierarchy of Classes
McCallum, Rosenfeld, Mitchell, Ng
ICML-98
Learning to ClassifyText from Labeled and Unlabeled Documents
Nigam, McCallum, Thrun, Mitchell
AAAI-98
Building Domain-Specific Search Engines with Machine LearningMcCallum, Nigam, Rennie, SeymoreAAAI Spring Symposium 1999 (submitted)