Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Semi-supervised Learning with Generalized Expectation Criteria
Andrew McCallum
Computer Science DepartmentUniversity of Massachusetts Amherst
Joint work with Gideon Mann and Greg Druck.
Successful Applications of ML
• sentiment classification: – > 80% accuracy classifying positive and negative reviews
• sequence labeling: – > 99% accuracy labeling research paper references
• dependency parsing: – > 90% accuracy on English news text
Substantial Human Annotation Required
• sentiment classification: – > 80% accuracy classifying positive and negative reviews– with 2000 labeled reviews
• sequence labeling: – > 99% accuracy labeling research paper references– with 500 labeled references
• dependency parsing: – > 90% accuracy on English news text– with the Penn Treebank, more than 3 years to
annotate
Goal
• Problem: To apply machine learning to a new problem, a substantial amount of human annotation effort is required.
• Goal: – Reduce the amount of human effort required to
learn an accurate model for a new task.– Provide a natural way to inject human domain
knowledge.
People have domain knowledge.They need tools for naturally, safely incorporate that knowledge.
Supervised Learning
Decision boundary
Creation of labeled instances requires extensive human effort
What if limited labeled data?
Small amount of labeled data
Semi-Supervised Learning:Labeled & Unlabeled data
Large amount of unlabeled data
Small amount of labeled data
Augment limited labeled data by using unlabeled data
More Semi-Supervised Algorithms than Applications
0
8
15
23
30
1998 1999 2000 2001 2002 2003 2004 2005 2006
AlgorithmsApplications
# papers
Compiled from [Zhu, 2007]
Weakness of ManySemi-Supervised Algorithms
Difficult to ImplementSignificantly more complicated than supervised counterparts
FragileMeta-parameters hard to tune
Lacking in ScalabilityO(n2) or O(n3) on unlabeled data
“EM will generally degrade [tagging] accuracy, except when only a limited
amount of hand-tagged text is available.”
[Merialdo, 1994]
“When the percentage of labeled data increases from
50% to 75%, the performance of [Label Propagation with
Jensen-Shannon divergence] and SVM become almost
same, while [Label propagation with cosine distance] performs significantly worse than SVM.” [Niu,Ji,Tan, 2005]
Families ofSemi-Supervised Learning
1. Expectation Maximization2. Graph-Based Methods3. Auxiliary Functions4. Decision Boundaries in Sparse Regions
Family 1 : Expectation Maximization
[Dempster, Laird, Rubin, 1977]
Fragile -- often worse than supervised
Family 2: Graph-Based Methods
[Zhu, Ghahramani, 2002]
[Szummer, Jaakkola, 2002]
Lacking in scalability, Sensitive to choice of metric
Family 3: Auxiliary-Task Methods[Ando and Zhang, 2005]
Complicated to find appropriate auxiliary tasks
Family 4: Decision Boundary in Sparse Region
Family 4: Decision Boundary in Sparse Region
Transductive SVMs [Joachims, 1999]: Sparsity measured by marginEntropy Regularization [Grandvalet and Bengio, 2005] …by label entropy
Minimal Entropy Solution!
How do we know the minimal entropy solution is wrong?
We suspect at least some of the data is in the second class!
0
0.2
0.4
0.6
0.8
Class Size
In fact we often have prior knowledge of the relative class proportions 0.8 : Student
0.2 : Professor
How do we know the minimal entropy solution is wrong?
We suspect at least some of the data is in the second class!
0
0.2
0.5
0.7
0.9
Class Size
In fact we often have prior knowledge of the relative class proportions 0.1 : Gene Mention
0.9 : Background
How do we know the minimal entropy solution is wrong?
We suspect at least some of the data is in the second class!
0
0.2
0.3
0.5
0.6
Class Size
In fact we often have prior knowledge of the relative class proportions 0.6 : Person
0.4 : Organization
Families ofSemi-Supervised Learning
1. Expectation Maximization2. Graph-Based Methods3. Auxiliary Functions4. Decision Boundaries in Sparse Regions5. Generalized Expectation
0
0.2
0.4
0.6
0.8
Class Size
Family 5: Generalized Expectation
0
0.2
0.4
0.6
0.8
Class Size
Low density region
Favor decision boundaries that match the prior
Generalized Expectation
Simple: Easy to implement
Robust: Meta-parameters need little or no tuning
Scalable: Linear in number of unlabeled examples
Generalized ExpectationSpecial Cases
• Label Regularization p(y)
• Expectation Regularization p(y | feature)
• Generalized Expectation E [ f(x,y) ](general case)
0
0.2
0.4
0.6
0.8
Class Size
Label Regularization (LR)
Log-likelihood LR
KL-Divergence between a prior distribution
and an expected distribution
over the unlabeled data
LR Results for Classification
Accuracy Accuracy Accuracy # Labeled Examples # Labeled Examples # Labeled Examples
2 100 1000
SVM (supervised) 55.41% 66.29%Cluster Kernel SVM 57.05% 65.97%
QC Smartsub 57.68% 59.16%
Naïve Bayes (supervised) 52.42% 57.12% 64.47%Naïve Bayes EM 50.79% 57.34% 57.60%
Logistic Regression (supervised) 52.42% 56.74% 65.43%Logistic Regression + Entropy Reg. 48.56% 54.45% 58.28%
Logistic Regression + GE 57.08% 58.51% 65.44%
Secondary Structure Prediction
XR Results for Classification: Sliding Window Model
CoNLL03 Named Entity Recognition Shared Task
XR Results for Classification: Sliding Window Model 2
BioCreativeII 2007 Gene/Gene Product Extraction
Noise in Prior Knowledge
What happens when users’ estimates of the class proportions is in error?
Noisy Prior Distribution
20% changein probability of majority class
CoNLL03 Named Entity Recognition Shared Task
Generalized ExpectationSpecial Cases
• Label Regularization p(y)
• Expectation Regularization p(y | feature)
• Generalized Expectation E [ f(x,y) ](general case)
0
0.2
0.4
0.6
0.8
Class Size
p ( BASEBALL | “homerun” ) = 0.95
An Alternative Style of SupervisionClassifying Baseball versus Hockey
Traditional
HumanLabeling
Effort
(Semi-)Supervised Training viaMaximum Likelihood
Generalized Expectation
Brainstorma few
Keywords
Semi-Supervised Training viaGeneralized Expectation
puckicestick
ballfieldbat
p(HOCKEY | “puck”) = .9
Labeling Features
hockeybaseball
HRMets
85%
ballOilersSox
Pensruns
Pittsburgh
Penguins
Edmonton
Oilers
94.5%
goalBuffaloLeafspuck
Lemieux
92%
Toronto
Maple Leafs
battingbaseNHL
BruinsPenguins
96%Accuracy
features labeled . . .
~1000 unlabeled examples
Human Annotation Experiments
• Three annotators labeled 100 features and 100 documents.
• baseball vs. hockey
0 100 200 300 400 500 600 700 8000.4
0.5
0.6
0.7
0.8
0.9
1
labeling time in seconds
test
ing
accu
racy
GEER
~2 minutes,
100 features labeled or skipped,
82% accuracy
~15 minutes,
100 documents labeled (or skipped),
78% accuracy
Human Annotation Experiments
• words all annotators labeled
Generalized ExpectationSpecial Cases
• Label Regularization p(y)
• Expectation Regularization p(y | feature)
• Generalized Expectation E [ f(x,y) ](general case)
0
0.2
0.4
0.6
0.8
Class Size
Generalized Expectation (GE) criteria
• Definition: Parameter estimation objective fn that expresses preference on expectations of the model.
• Sometimes in same equivalence class as– Moment matching
– Maximum likelihood
– Maximum entropy
Objective = Score ( E [ f(x,y) ] )
Not just momentsNot necessarily matching a single target value
[McCallum, Mann, Druck 2007]
Not necessarily p(data)Preferences on subset of model factors
Based on constraints and expectations, butparameterization not req. to match constraints
• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]
Generalized Expectation (GE)
G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).
• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]
Generalized Expectation (GE)
G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).
constraint function
• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]
Generalized Expectation (GE)
G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).
constraint function
model distribution (CRF)
• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]
Generalized Expectation (GE)
G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).
constraint function
model distribution (CRF)
empirical distribution (unlabeled data)
• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]
Generalized Expectation (GE)
G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).
constraint function
model distribution (CRF)
empirical distribution (unlabeled data)
score function
GE Score Functions
• In this proposal, S measures some distance to a target expectation .
• Model expectation :
• squared difference (L2):
• KL divergence:
• in some cases, a set of define a probability distribution
• sum of for all define negative cross-entropy (entropy of is constant)
G! = Ep(x)[Ep(y|x;!)[G(x,y)]]
Ssq(G!) = !(G!G!)2
Skl(G!) = G log G!
G!
Skl G!G
G
Estimating Parameters with GE
• Objective function:
• Maximize using gradient methods.
• Partial derivatives with respect to model parameters:!
!"jGkl(") =
G
G!
!Ep(x)[Ep(y|x;!)[Fj(x,y)G(x,y)]
!Ep(y|x;!)[Fj(x,y)]Ep(y|x;!)[G(x,y)]]"
!
!"jGsq(") = 2(G!G!)
!Ep(x)[Ep(y|x;!)[Fj(x,y)G(x,y)]
!Ep(y|x;!)[Fj(x,y)]Ep(y|x;!)[G(x,y)]]"
O =!
G(!)
G(!) + log p(!)
Estimating Parameters with GE
• Objective function:
• Maximize using gradient methods.
• Partial derivatives with respect to model parameters:!
!"jGkl(") =
G
G!
!Ep(x)[Ep(y|x;!)[Fj(x,y)G(x,y)]
!Ep(y|x;!)[Fj(x,y)]Ep(y|x;!)[G(x,y)]]"
!
!"jGsq(") = 2(G!G!)
!Ep(x)[Ep(y|x;!)[Fj(x,y)G(x,y)]
!Ep(y|x;!)[Fj(x,y)]Ep(y|x;!)[G(x,y)]]"
O =!
G(!)
G(!) + log p(!)
predicted covariance between constraint function and model
feature function
GE in Practice
• Active Learning with GE
• GE for Dependency Parsing
Active Learning
Active Learning by Labeling Features
Active Learning by Labeling Features
algorithm allows skipping queries
queries for feature labels
Query Selection: Expected Information Gain
• ideal criteria: expected reduction in model uncertainty after labeling feature k
• are new parameter estimates after getting a particular labeling .
• drawback: This requires re-training the model with every possible labeling for every feature.
• solution: Approximate the expected reduction in uncertainty.
!EIG(qk) = Ep(g|qk)[Ep(x)[H(p(y|x; ")!H(p(y|x; "g)]]
!g g
Marginal Uncertainty Query Selection
• Approximation: reduction in uncertainty is ≈ uncertainty of marginal distributions at positions where feature occurs
• total uncertainty:
• weighted uncertainty:
• addresses biasing towards features that are too frequent/infrequent
• diverse uncertainty:
• chooses uncertain features that appear in diverse contexts
!TU (qk) =!
i
!
j
qk(xi, j)H(p(yj |xi; "))
!WU (qk) = log(Ck)!TU (qk)
Ck
!DU (qk) = !TU (qk)1|C|
!
j!C1!sim(qk, qj)
Other Query Selection Baselines
• Methods that are active but do not require re-training:
• coverage (dissimilarity from other labeled features)
• similarity (similarity to other labeled features)
• Passive baselines: random, frequency, LDA (top words in each topic [Druck et al. 08])
!cov(qk) = Ck1|C|
!
j!C1! sim(qk, qj)
!sim(qk) = Ck maxj!C
sim(qk, qj)
Related Work
• Tandem Learning [Raghavan & Allan 07]
• does not generalize to structured outputs in a straightforward way
• similarity query selection inspired by their method, but performs poorly
• Active Measurement Selection [Liang et al. 09]
• query selection closely related to EIG, but too slow for real experiments
• does not consider skipping queries; limited evaluation; no human experiments
• Dual Supervision [Sindhwani et al. 09]
• does not generalize to structured outputs in a straightforward way
• uses certainty query selection (because method similar to expectation uncertainty does not work well)
“Simulated” Annotation Experiments
• Simulated annotator labeling instances:
• provide the true labels
• Simulated annotator labeling features:
• skip or label?
• labels a feature if the entropy of its distribution over labels is ≤ 0.7
• skips otherwise
• which labels to assign?
• assign max probability label, as well as any label whose probability as at least half as large [Druck et al. 08]
Experimental Setup
• instance active learning
• query selection: random (rand), uncertainty sampling [Lewis and Gale 94] (US), information density [Settles and Craven 08] (ID)
• training: maximum likelihood + entropy regularization [Jiao 06] (ER)
• feature active learning
• query selection: random, frequency, LDA, coverage (cov), similarity (sim), expectation uncertainty (EU), total uncertainty (TU), weighted uncertainty (WU), diverse uncertainty (DU)
• training: maximum marginal likelihood (MML), GE
• limit candidate set to 500 most frequent features
Experimental Setup
• data sets:
• apartments: 300 Craigslist apartment postings, 11 labels
• cora reference extraction: 500 reference paper references, 13 labels
• setup:
• each experiment simulates 10 minutes of annotation time
• Measured annotation times for labeling actions (seconds):
0 1 2 3 4
label tokenskip featurelabel feature
Simulated Experiments Results
• Active learning with labeled features using GE training outperforms:
• passive or active learning with labeled features using MML
• passive learning with labeled features using GE
• active and passive learning with instances
• Uncertainty based query selection methods generally outperform others.
Simulated Experiments Results
• on cora, GE + weighted uncertainty outperforms random after 5 minutes of annotation
“Grid” Interface
• Feature queries are concise and easy to browse.
• Suggests new interfaces in which rather than being asked to answer one query at a time, groups of queries are displayed.
• “Grid” interface:
• displays small groups of related (distributional similarity) features
• may reduce annotation time because features in the same group are likely to have the same label
• within groups, features sorted by query selection metric
• groups that are more uncertain displayed closer to the top
Feature Active Learning Grid Interface
Human Active Learning Experiments
• human active learning experiments:
• labeling instances (with fast labeling interface)
• labeling features with the serial interface
• labeling feature with the grid interface
• 5 two minute sessions per annotator per experiment.
Human Annotation Experiments: User 1
Human Annotation Experiments: User 2
Human Annotation Experiments: User 3
Conclusions
• Developed an active learning method in which the user is asked to “label features” instead of labeling instances.
• Outperforms:
• active and passive learning with instances
• passive learning with labeled features
• Suggests new user interfaces that may allow more efficient annotation.
Dependency Parsing
Problem and Motivation
• Suppose we are given, for some language:
• How do we efficiently learn a dependency parser?
• Why is this important?
• There are low-density languages and sub-domains of languages for which there are no syntactically annotated corpora.
text
dependency
syntax prior
knowledge
text
textraw
text
Supervised Solution
• Supervised solution:
• Problem: Syntactic annotation is costly.
• Penn English Treebank: 6 years of development (1989 - 1995)
• Chinese Treebank: 9 years of development (1998 - 2007)
• Arabic Treebank: 2 years of development (2001-2003)
text
dependency
syntax prior
knowledge
text
textrawtext
text
text
text
treebank
annotationsupervisedlearning
parser
Traditional Semi-Supervised Solution
• Semi-supervised solution:
• Problem: Costs of developing annotation guidelines may dominate total annotation time early in development.
text
dependency
syntax prior
knowledge
text
textrawtext
seed treebank
annotationsemi-supervised
learningparser
text
textrawtext
entropy regularization [Smith & Eisner 07] Brown clustering [Koo et al. 08]self-training [McClosky et al. 06] convex loss on unlabeled [Wang et al. 08]
Unsupervised Solution
• Unsupervised solution:
• Possible approaches:
• Dependency model with Valence (DMV) + EM [Klein & Manning 04]
• Contrastive Estimation (CE) [Smith & Eisner 05]
• Others: [Smith & Eisner 06], [Bod 06], [Seginer 07]
text
dependency
syntax prior
knowledge
text
textrawtext
unsupervised learning
parser
Unsupervised Solution
• Unsupervised solution:
• Problem: Requires some limited prior knowledge, but incorporates this information in a cumbersome way.
• developing model structure, tweaking learning algorithm, clever parameter initialization, hyperparameter tuning, devising neighborhood function [Smith & Eisner 05]
text
dependency
syntax prior
knowledge
text
textrawtext
unsupervised learning
parser
text
dependency
syntax prior
knowledge
text
textrawtext
unsupervised learning
parser
Our Approach
• Our approach:
• Example constraints:
• A DT should attach to a NN directly to the right 90% of the time.
• The parent of a VBD is the ROOT 75% of the time.
• How to estimate model parameters with such expectation constraints?
Encode prior knowledge directly with model expectation constraints, use to learn a feature-rich parser.
Non-Projective Dependency Tree CRF
• x is the input sentence, i.e. xi is the word at position i
• y is non-projective tree represented as a vector of parent indices, i.e. yi is the index of the parent word of word i
• CRF that models the probability of tree y given sentence x
• θ are model parameters
• fj are edge-factored feature functions, i.e. they consider the entire input x and single edge yi → i
• Zx is the partition function, or the sum of the scores of all possible trees for x
p(y|x; !) =1
Zxexp
! n"
i=1
"
j
!jfj(xi, xyi ,x)#
Experiments
Simulated “Oracle” Constraints
• In some experiments, prior knowledge is simulated using an “oracle” that looks at labeled data.
• Oracle constraint selection uses a few simple statistics:
• count:
• edge count:
• edge probability:
• Target expectations are true edge probabilities, binned into the set: [ 0, 0.1, 0.25, 0.5, 0.75, 1]
c(g) =!
x!D
!
i
!
j
g(xi, xj ,x)
cedge(g) =!
(x,y)!D
!
i
g(xi, xyi ,x)
p(edge|g) =cedge(g)
c(g)
Comparison with Unsupervised Methods
Corpus:
• WSJ10: WSJ portion of Penn Treebank, stripped of punctuation, sentences of 10 words or fewer, only POS tags (unlexicalized)
Models:
• DMV [Klein and Manning 04]: does not model distance, can model arity and sibling relationships.
• Restricted CRF: only features of type (parent-POS ∧ child-POS ∧ direction). Weaker than DMV (ignoring projective vs non-projective).
• CRF: [McDonald et al. 05] features. models distance, but still unlexicalized.
• baseline: assigns target expectations as scores to edges with constraints, runs MST with the resulting scores
Comparison with Unsupervised Methods
• parameter estimation methods:
• DMV (results from [Smith 06]):
• expectation maximization (EM)
• contrastive estimation (CE)
• restricted CRF / CRF:
• supervised maximum likelihood (upper bound)
• generalized expectation (GE)
Oracle Expectation Constraints
• constraint selection: sort functions (parent-POS ∧ child-POS ∧ direction) with count ≥ 200 by edge probability
• first 20 constraints selected:
• POS tags in sentence order, head → modifier, grouped by head
Human Provided Expectation Constraints
• Constraints extracted from grammar fragments in [Haghighi & Klein 06]
• Target expectations provided using (limited!) prior knowledge
• Based on output, refined target expectations, added new constraints
GE vs. Supervised & Baseline
• GE outperforms the baseline
• human constraints provide accuracy comparable to oracle
• GE performs much better in conjunction with feature-rich model
10 20 30 40 50 6010
20
30
40
50
60
70
80
90
number of constraints
accu
racy
constraint baselineCRF restricted supervisedCRF supervisedCRF restricted GECRF GECRF GE human
GE vs. Unsupervised
• GE outperforms DMV EM with 10 or 20 (restricted CRF) constraints.
• GE outperforms DMV CE with 50 (restricted CRF) or 20 constraints
10 20 30 40 50 6010
20
30
40
50
60
70
80
number of constraints
accu
racy
attach right baselineDMV EMDMV CECRF restricted GECRF GECRF GE human
Sensitivity of Unsupervised Methods
• sensitivity of DMV EM to initialization [Smith 06]:
• reported results use best of three parameter initialization methods, the method of [Klein & Manning 04]
• others give accuracies lower than 32%
• sensitivity of DMV CE to neighborhood function [Smith 06]:
• reported results use the best of eight neighborhood functions, DEL1ORTRANS1
• DEL1ORTRANS2 gives 51.2% accuracy
• the other six give accuracy of less than 50%
Generalized Expectation criteriaEasy communication with domain experts
• Inject domain knowledge into parameter estimation
• Like “informative prior”...
• ...but rather than the “language of parameters”(difficult for humans to understand)
• ...use the “language of expectations”(natural for humans)
Use of Domain Knowledge
• “Expectations” are a natural language in which to express expertise.
• GE translates expectations into parameter estimation objective.
• Expert has knowledge.Must provide ML tools to integrate safely.