Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function

Semi-supervised Learning with Generalized Expectation Criteria

Andrew McCallum

Computer Science DepartmentUniversity of Massachusetts Amherst

Joint work with Gideon Mann and Greg Druck.

Successful Applications of ML

• sentiment classification: – > 80% accuracy classifying positive and negative reviews

• sequence labeling: – > 99% accuracy labeling research paper references

• dependency parsing: – > 90% accuracy on English news text

Substantial Human Annotation Required

• sentiment classification: – > 80% accuracy classifying positive and negative reviews– with 2000 labeled reviews

• sequence labeling: – > 99% accuracy labeling research paper references– with 500 labeled references

• dependency parsing: – > 90% accuracy on English news text– with the Penn Treebank, more than 3 years to

annotate

Goal

• Problem: To apply machine learning to a new problem, a substantial amount of human annotation effort is required.

• Goal: – Reduce the amount of human effort required to

learn an accurate model for a new task.– Provide a natural way to inject human domain

knowledge.

People have domain knowledge.They need tools for naturally, safely incorporate that knowledge.

Supervised Learning

Decision boundary

Creation of labeled instances requires extensive human effort

What if limited labeled data?

Small amount of labeled data

Semi-Supervised Learning:Labeled & Unlabeled data

Large amount of unlabeled data

Small amount of labeled data

Augment limited labeled data by using unlabeled data

More Semi-Supervised Algorithms than Applications

0

8

15

23

30

1998 1999 2000 2001 2002 2003 2004 2005 2006

AlgorithmsApplications

# papers

Compiled from [Zhu, 2007]

Weakness of ManySemi-Supervised Algorithms

Difficult to ImplementSignificantly more complicated than supervised counterparts

FragileMeta-parameters hard to tune

Lacking in ScalabilityO(n2) or O(n3) on unlabeled data

“EM will generally degrade [tagging] accuracy, except when only a limited

amount of hand-tagged text is available.”

[Merialdo, 1994]

“When the percentage of labeled data increases from

50% to 75%, the performance of [Label Propagation with

Jensen-Shannon divergence] and SVM become almost

same, while [Label propagation with cosine distance] performs significantly worse than SVM.” [Niu,Ji,Tan, 2005]

Families ofSemi-Supervised Learning

1. Expectation Maximization2. Graph-Based Methods3. Auxiliary Functions4. Decision Boundaries in Sparse Regions

Family 1 : Expectation Maximization

[Dempster, Laird, Rubin, 1977]

Fragile -- often worse than supervised

Family 2: Graph-Based Methods

[Zhu, Ghahramani, 2002]

[Szummer, Jaakkola, 2002]

Lacking in scalability, Sensitive to choice of metric

Family 3: Auxiliary-Task Methods[Ando and Zhang, 2005]

Complicated to find appropriate auxiliary tasks

Family 4: Decision Boundary in Sparse Region

Family 4: Decision Boundary in Sparse Region

Transductive SVMs [Joachims, 1999]: Sparsity measured by marginEntropy Regularization [Grandvalet and Bengio, 2005] …by label entropy

Minimal Entropy Solution!

How do we know the minimal entropy solution is wrong?

We suspect at least some of the data is in the second class!

0

0.2

0.4

0.6

0.8

Class Size

In fact we often have prior knowledge of the relative class proportions 0.8 : Student

0.2 : Professor



0

0.2

0.5

0.7

0.9

Class Size

In fact we often have prior knowledge of the relative class proportions 0.1 : Gene Mention

0.9 : Background



0

0.2

0.3

0.5

0.6

Class Size

In fact we often have prior knowledge of the relative class proportions 0.6 : Person

0.4 : Organization

Families ofSemi-Supervised Learning

1. Expectation Maximization2. Graph-Based Methods3. Auxiliary Functions4. Decision Boundaries in Sparse Regions5. Generalized Expectation

0

0.2

0.4

0.6

0.8

Class Size

Family 5: Generalized Expectation

0

0.2

0.4

0.6

0.8

Class Size

Low density region

Favor decision boundaries that match the prior

Generalized Expectation

Simple: Easy to implement

Robust: Meta-parameters need little or no tuning

Scalable: Linear in number of unlabeled examples

Generalized ExpectationSpecial Cases

• Label Regularization p(y)

• Expectation Regularization p(y | feature)

• Generalized Expectation E [ f(x,y) ](general case)

0

0.2

0.4

0.6

0.8

Class Size

Label Regularization (LR)

Log-likelihood LR

KL-Divergence between a prior distribution

and an expected distribution

over the unlabeled data

LR Results for Classification

Accuracy Accuracy Accuracy # Labeled Examples # Labeled Examples # Labeled Examples

2 100 1000

SVM (supervised) 55.41% 66.29%Cluster Kernel SVM 57.05% 65.97%

QC Smartsub 57.68% 59.16%

Naïve Bayes (supervised) 52.42% 57.12% 64.47%Naïve Bayes EM 50.79% 57.34% 57.60%

Logistic Regression (supervised) 52.42% 56.74% 65.43%Logistic Regression + Entropy Reg. 48.56% 54.45% 58.28%

Logistic Regression + GE 57.08% 58.51% 65.44%

Secondary Structure Prediction

XR Results for Classification: Sliding Window Model

CoNLL03 Named Entity Recognition Shared Task

XR Results for Classification: Sliding Window Model 2

BioCreativeII 2007 Gene/Gene Product Extraction

Noise in Prior Knowledge

What happens when users’ estimates of the class proportions is in error?

Noisy Prior Distribution

20% changein probability of majority class

CoNLL03 Named Entity Recognition Shared Task





0

0.2

0.4

0.6

0.8

Class Size

p ( BASEBALL | “homerun” ) = 0.95

An Alternative Style of SupervisionClassifying Baseball versus Hockey

Traditional

HumanLabeling

Effort

(Semi-)Supervised Training viaMaximum Likelihood

Generalized Expectation

Brainstorma few

Keywords

Semi-Supervised Training viaGeneralized Expectation

puckicestick

ballfieldbat

p(HOCKEY | “puck”) = .9

Labeling Features

hockeybaseball

HRMets

85%

ballOilersSox

Pensruns

Pittsburgh

Penguins

Edmonton

Oilers

94.5%

goalBuffaloLeafspuck

Lemieux

92%

Toronto

Maple Leafs

battingbaseNHL

BruinsPenguins

96%Accuracy

features labeled . . .

~1000 unlabeled examples

Human Annotation Experiments

• Three annotators labeled 100 features and 100 documents.

• baseball vs. hockey

0 100 200 300 400 500 600 700 8000.4

0.5

0.6

0.7

0.8

0.9

1

labeling time in seconds

test

ing

accu

racy

GEER

~2 minutes,

100 features labeled or skipped,

82% accuracy

~15 minutes,

100 documents labeled (or skipped),

78% accuracy

Human Annotation Experiments

• words all annotators labeled





0

0.2

0.4

0.6

0.8

Class Size

Generalized Expectation (GE) criteria

• Definition: Parameter estimation objective fn that expresses preference on expectations of the model.

• Sometimes in same equivalence class as– Moment matching

– Maximum likelihood

– Maximum entropy

Objective = Score ( E [ f(x,y) ] )

Not just momentsNot necessarily matching a single target value

[McCallum, Mann, Druck 2007]

Not necessarily p(data)Preferences on subset of model factors

Based on constraints and expectations, butparameterization not req. to match constraints

• Generalized Expectation Criteria are terms in a parameter estimation objective function that express a preference on the value of the model expectation of some function. [Mann and McCallum 07, Mann and McCallum 08, Druck et al. 08, Druck et al. 09]

Generalized Expectation (GE)

G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).



G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).

constraint function



G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).

constraint function

model distribution (CRF)



G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).

constraint function


empirical distribution (unlabeled data)



G(!) = S(Ep(x)[Ep(y|x;!)[G(x,y)]]).

constraint function


empirical distribution (unlabeled data)

score function

GE Score Functions

• In this proposal, S measures some distance to a target expectation .

• Model expectation :

• squared difference (L2):

• KL divergence:

• in some cases, a set of define a probability distribution

• sum of for all define negative cross-entropy (entropy of is constant)

G! = Ep(x)[Ep(y|x;!)[G(x,y)]]

Ssq(G!) = !(G!G!)2

Skl(G!) = G log G!

G!

Skl G!G

G

Estimating Parameters with GE

• Objective function:

• Maximize using gradient methods.

• Partial derivatives with respect to model parameters:!

!"jGkl(") =

G

G!

!Ep(x)[Ep(y|x;!)[Fj(x,y)G(x,y)]

!Ep(y|x;!)[Fj(x,y)]Ep(y|x;!)[G(x,y)]]"

!

!"jGsq(") = 2(G!G!)



O =!

G(!)

G(!) + log p(!)

Estimating Parameters with GE

• Objective function:

• Maximize using gradient methods.

• Partial derivatives with respect to model parameters:!

!"jGkl(") =

G

G!



!

!"jGsq(") = 2(G!G!)



O =!

G(!)

G(!) + log p(!)

predicted covariance between constraint function and model

feature function

GE in Practice

• Active Learning with GE

• GE for Dependency Parsing

Active Learning

Active Learning by Labeling Features

Active Learning by Labeling Features

algorithm allows skipping queries

queries for feature labels

Query Selection: Expected Information Gain

• ideal criteria: expected reduction in model uncertainty after labeling feature k

• are new parameter estimates after getting a particular labeling .

• drawback: This requires re-training the model with every possible labeling for every feature.

• solution: Approximate the expected reduction in uncertainty.

!EIG(qk) = Ep(g|qk)[Ep(x)[H(p(y|x; ")!H(p(y|x; "g)]]

!g g

Marginal Uncertainty Query Selection

• Approximation: reduction in uncertainty is ≈ uncertainty of marginal distributions at positions where feature occurs

• total uncertainty:

• weighted uncertainty:

• addresses biasing towards features that are too frequent/infrequent

• diverse uncertainty:

• chooses uncertain features that appear in diverse contexts

!TU (qk) =!

i

!

j

qk(xi, j)H(p(yj |xi; "))

!WU (qk) = log(Ck)!TU (qk)

Ck

!DU (qk) = !TU (qk)1|C|

!

j!C1!sim(qk, qj)

Other Query Selection Baselines

• Methods that are active but do not require re-training:

• coverage (dissimilarity from other labeled features)

• similarity (similarity to other labeled features)

• Passive baselines: random, frequency, LDA (top words in each topic [Druck et al. 08])

!cov(qk) = Ck1|C|

!

j!C1! sim(qk, qj)

!sim(qk) = Ck maxj!C

sim(qk, qj)

Related Work

• Tandem Learning [Raghavan & Allan 07]

• does not generalize to structured outputs in a straightforward way

• similarity query selection inspired by their method, but performs poorly

• Active Measurement Selection [Liang et al. 09]

• query selection closely related to EIG, but too slow for real experiments

• does not consider skipping queries; limited evaluation; no human experiments

• Dual Supervision [Sindhwani et al. 09]

• does not generalize to structured outputs in a straightforward way

• uses certainty query selection (because method similar to expectation uncertainty does not work well)

“Simulated” Annotation Experiments

• Simulated annotator labeling instances:

• provide the true labels

• Simulated annotator labeling features:

• skip or label?

• labels a feature if the entropy of its distribution over labels is ≤ 0.7

• skips otherwise

• which labels to assign?

• assign max probability label, as well as any label whose probability as at least half as large [Druck et al. 08]

Experimental Setup

• instance active learning

• query selection: random (rand), uncertainty sampling [Lewis and Gale 94] (US), information density [Settles and Craven 08] (ID)

• training: maximum likelihood + entropy regularization [Jiao 06] (ER)

• feature active learning

• query selection: random, frequency, LDA, coverage (cov), similarity (sim), expectation uncertainty (EU), total uncertainty (TU), weighted uncertainty (WU), diverse uncertainty (DU)

• training: maximum marginal likelihood (MML), GE

• limit candidate set to 500 most frequent features

Experimental Setup

• data sets:

• apartments: 300 Craigslist apartment postings, 11 labels

• cora reference extraction: 500 reference paper references, 13 labels

• setup:

• each experiment simulates 10 minutes of annotation time

• Measured annotation times for labeling actions (seconds):

0 1 2 3 4

label tokenskip featurelabel feature

Simulated Experiments Results

• Active learning with labeled features using GE training outperforms:

• passive or active learning with labeled features using MML

• passive learning with labeled features using GE

• active and passive learning with instances

• Uncertainty based query selection methods generally outperform others.

Simulated Experiments Results

• on cora, GE + weighted uncertainty outperforms random after 5 minutes of annotation

“Grid” Interface

• Feature queries are concise and easy to browse.

• Suggests new interfaces in which rather than being asked to answer one query at a time, groups of queries are displayed.

• “Grid” interface:

• displays small groups of related (distributional similarity) features

• may reduce annotation time because features in the same group are likely to have the same label

• within groups, features sorted by query selection metric

• groups that are more uncertain displayed closer to the top

Feature Active Learning Grid Interface

Human Active Learning Experiments

• human active learning experiments:

• labeling instances (with fast labeling interface)

• labeling features with the serial interface

• labeling feature with the grid interface

• 5 two minute sessions per annotator per experiment.

Human Annotation Experiments: User 1



Conclusions

• Developed an active learning method in which the user is asked to “label features” instead of labeling instances.

• Outperforms:

• active and passive learning with instances

• passive learning with labeled features

• Suggests new user interfaces that may allow more efficient annotation.

Dependency Parsing

Problem and Motivation

• Suppose we are given, for some language:

• How do we efficiently learn a dependency parser?

• Why is this important?

• There are low-density languages and sub-domains of languages for which there are no syntactically annotated corpora.

text

dependency

syntax prior

knowledge

text

textraw

text

Supervised Solution

• Supervised solution:

• Problem: Syntactic annotation is costly.

• Penn English Treebank: 6 years of development (1989 - 1995)

• Chinese Treebank: 9 years of development (1998 - 2007)

• Arabic Treebank: 2 years of development (2001-2003)

text

dependency

syntax prior

knowledge

text

textrawtext

text

text

text

treebank

annotationsupervisedlearning

parser

Traditional Semi-Supervised Solution

• Semi-supervised solution:

• Problem: Costs of developing annotation guidelines may dominate total annotation time early in development.

text

dependency

syntax prior

knowledge

text

textrawtext

seed treebank

annotationsemi-supervised

learningparser

text

textrawtext

entropy regularization [Smith & Eisner 07] Brown clustering [Koo et al. 08]self-training [McClosky et al. 06] convex loss on unlabeled [Wang et al. 08]

Unsupervised Solution

• Unsupervised solution:

• Possible approaches:

• Dependency model with Valence (DMV) + EM [Klein & Manning 04]

• Contrastive Estimation (CE) [Smith & Eisner 05]

• Others: [Smith & Eisner 06], [Bod 06], [Seginer 07]

text

dependency

syntax prior

knowledge

text

textrawtext

unsupervised learning

parser

Unsupervised Solution

• Unsupervised solution:

• Problem: Requires some limited prior knowledge, but incorporates this information in a cumbersome way.

• developing model structure, tweaking learning algorithm, clever parameter initialization, hyperparameter tuning, devising neighborhood function [Smith & Eisner 05]

text

dependency

syntax prior

knowledge

text

textrawtext


parser

text

dependency

syntax prior

knowledge

text

textrawtext


parser

Our Approach

• Our approach:

• Example constraints:

• A DT should attach to a NN directly to the right 90% of the time.

• The parent of a VBD is the ROOT 75% of the time.

• How to estimate model parameters with such expectation constraints?

Encode prior knowledge directly with model expectation constraints, use to learn a feature-rich parser.

Non-Projective Dependency Tree CRF

• x is the input sentence, i.e. xi is the word at position i

• y is non-projective tree represented as a vector of parent indices, i.e. yi is the index of the parent word of word i

• CRF that models the probability of tree y given sentence x

• θ are model parameters

• fj are edge-factored feature functions, i.e. they consider the entire input x and single edge yi → i

• Zx is the partition function, or the sum of the scores of all possible trees for x

p(y|x; !) =1

Zxexp

! n"

i=1

"

j

!jfj(xi, xyi ,x)#

Experiments

Simulated “Oracle” Constraints

• In some experiments, prior knowledge is simulated using an “oracle” that looks at labeled data.

• Oracle constraint selection uses a few simple statistics:

• count:

• edge count:

• edge probability:

• Target expectations are true edge probabilities, binned into the set: [ 0, 0.1, 0.25, 0.5, 0.75, 1]

c(g) =!

x!D

!

i

!

j

g(xi, xj ,x)

cedge(g) =!

(x,y)!D

!

i

g(xi, xyi ,x)

p(edge|g) =cedge(g)

c(g)

Comparison with Unsupervised Methods

Corpus:

• WSJ10: WSJ portion of Penn Treebank, stripped of punctuation, sentences of 10 words or fewer, only POS tags (unlexicalized)

Models:

• DMV [Klein and Manning 04]: does not model distance, can model arity and sibling relationships.

• Restricted CRF: only features of type (parent-POS ∧ child-POS ∧ direction). Weaker than DMV (ignoring projective vs non-projective).

• CRF: [McDonald et al. 05] features. models distance, but still unlexicalized.

• baseline: assigns target expectations as scores to edges with constraints, runs MST with the resulting scores

Comparison with Unsupervised Methods

• parameter estimation methods:

• DMV (results from [Smith 06]):

• expectation maximization (EM)

• contrastive estimation (CE)

• restricted CRF / CRF:

• supervised maximum likelihood (upper bound)

• generalized expectation (GE)

Oracle Expectation Constraints

• constraint selection: sort functions (parent-POS ∧ child-POS ∧ direction) with count ≥ 200 by edge probability

• first 20 constraints selected:

• POS tags in sentence order, head → modifier, grouped by head

Human Provided Expectation Constraints

• Constraints extracted from grammar fragments in [Haghighi & Klein 06]

• Target expectations provided using (limited!) prior knowledge

• Based on output, refined target expectations, added new constraints

GE vs. Supervised & Baseline

• GE outperforms the baseline

• human constraints provide accuracy comparable to oracle

• GE performs much better in conjunction with feature-rich model

10 20 30 40 50 6010

20

30

40

50

60

70

80

90

number of constraints

accu

racy

constraint baselineCRF restricted supervisedCRF supervisedCRF restricted GECRF GECRF GE human

GE vs. Unsupervised

• GE outperforms DMV EM with 10 or 20 (restricted CRF) constraints.

• GE outperforms DMV CE with 50 (restricted CRF) or 20 constraints

10 20 30 40 50 6010

20

30

40

50

60

70

80

number of constraints

accu

racy

attach right baselineDMV EMDMV CECRF restricted GECRF GECRF GE human

Sensitivity of Unsupervised Methods

• sensitivity of DMV EM to initialization [Smith 06]:

• reported results use best of three parameter initialization methods, the method of [Klein & Manning 04]

• others give accuracies lower than 32%

• sensitivity of DMV CE to neighborhood function [Smith 06]:

• reported results use the best of eight neighborhood functions, DEL1ORTRANS1

• DEL1ORTRANS2 gives 51.2% accuracy

• the other six give accuracy of less than 50%

Generalized Expectation criteriaEasy communication with domain experts

• Inject domain knowledge into parameter estimation

• Like “informative prior”...

• ...but rather than the “language of parameters”(difficult for humans to understand)

• ...use the “language of expectations”(natural for humans)

Use of Domain Knowledge

• “Expectations” are a natural language in which to express expertise.

• GE translates expectations into parameter estimation objective.

• Expert has knowledge.Must provide ML tools to integrate safely.

Documents

Semi-supervised Learning with Generalized Expectation Criteriamccallum/talks/xrce-ge.pdf · • Generalized Expectation Criteria are terms in a parameter estimation objective function