Efficient Decomposed Learning for Structured Prediction
Rajhans SamdaniJoint work with Dan Roth
University of Illinois at Urbana-Champaign
Page 1
Structured Prediction
Structured prediction: predicting a structured output variable y based on the input variable x
y = {y1,y2,…,yn} variables form a structure Structure comes from interactions between the output
variables through mutual correlations and constraints Such problems occur frequently in
NLP – e.g. predicting the tree structured parse of a sentence, predicting the entity-relation structure from a document.
Computer vision – scene segmentation, body-part identification Speech processing – capturing relations between phonemes Computational Biology – protein folding and interactions between
different sub-structures Etc.
Page 2
Example Problem: Information Extraction
Given citation text, extract author, booktitle, title, etc.Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages…. Given ad text, extract features, size, neighborhood,
etc.Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines … Structure introduced by correlations between words
E.g. if treated as sequence-tagging Structure is also introduced by declarative constraints that
define the set of feasible assignments E.g. the ‘author’ tokens are likely to appear together in a single block A paper should have at most one ‘title’
Page 3
Example problem: Body Part Identification
Count the number of people
Predict the body parts
Correlations Position of shoulders and
heads correlated Position of torso and legs
correlated
Page 4
Predict variables in y = {y1,y2,…,yn} 2 Y together to leverage dependencies (e.g. entity-relation, shoulders-head, information fields, document labels etc.) between these variables
Inference constitutes predicting the best scoring structure
f(x,y) = w¢ Á(x,y) is called the scoring function
Set of allowed structuresoften specified by constraints
Weight parameters (to be estimated during learning)
Features on input-output
Structured Prediction: Inference
Page 5
Structural Learning: Quick Overview
Consider a big monolithic structured prediction problem
Given labeled data pairs ( xj, yj = {yj1,yj
2,…,yjn} ),
how do we learn w and perform inference?
Page 6
y1y3
y6y5
y2
y4
Learning w: Two Extreme Styles
Global Learning (GL) Consider all the variables
together
Collins’02; Taskar et al’04; Tsochantiridis et al’04
Local Learning (LL) Ignore hard to learn
structural aspects e.g. global constraints/consider variables in isolation
Punyakanok et al’05; Roth and Yih’05; Koo et al’10…
Page 7
y1
y3
y6y5
y2
y4
y1
y3
y6y5
y2
y4
LL+C: apply constraints, if available, only at test-time inference
Expensive Inconsistent
Our Contribution: Decomposed Learning
We consider learning with subsets of variables at a time
We give conditions under which this decomposed learning is actually identical to global learning and exhibit the advantage of our learning paradigm experimentally.
Page 8
y1
y3
y6
y5
y2
y4
00011011
00 01 10 11
000001010011100101110111
Related work: Pseudolikelihood – Besag, 77; Piecewise Pseudolikelihood – Sutton and McCallum, 07; Pseudomax – Sontag et al, 10
Existing Global Structural learning algorithms
Decomposed Learning (DecL): Efficient structural learning Intuition Formalization
Theoretical properties of DecL
Experimental evaluation
Outline
Page 10
Supervised Structural Learning
We focus on structural SVM style algorithms which learn w by minimizing regularized structured-hinge loss
Literature: Taskar et al’04; Tsochantiridis et al’04;
Structured hinge-lossScore of non-ground truth y
Score of the ground truth y
Loss-based margin
Page 11
Global Inference over all the variables
Limitations of Global Learning
Exact global inference as an intermediate step
Expressive models don’t admit exact and efficient (poly-time) inference algorithms e.g. HMM with global constraints, Arbitrary Pairwise Markov Networks
Hence Global Learning is expensive for expressive features (Á(x,y)) and constraints (y 2 Y)
The problem is using inference as a black box during learning
Our proposal: change the inference-during-learning to inference over a smaller output space: Decomposed inference for learning
Page 13
Existing Structural learning algorithms
Decomposed Learning (DecL): Efficient structural learning Intuition Formalization
Theoretical properties of DecL
Experimental evaluation
Outline
Page 15
Decomposed Structural Learning (DecL)
GENERAL IDEA: For (xj,yj), reduce the argmax inference from the intractable output space Y to a “neighborhood” around yj: nbr(yj)µY
Small and tractable nbr(yj) ) efficient learning
Use domain knowledge to create neighborhoods which preserve the structure of the problem
Page 16
{0,1}n
Y
nbr(y)
n outputs in y
Neighborhoods via Decompositions
Generate nbr(yj) by varying a subset of the output variables, while fixing the rest of them to their gold labels in yj …
… and repeat the same for different subsets of the output variables
A decomposition is a collection of different (non-inclusive, possibly overlapping) sets of variables which vary together
Sj = {s1,…,sl| 8 i, si µ {1,…,n}; 8 i, k, si * sk}
Inference could be exponential in the size of sets Smaller set sizes yield efficient learning Under some conditions, DecL with smaller set sizes is identical to Global
Learning
Page 17
Creating Decompositions
Allow different decompositions Sj for different training instances yj
Aim to get results close to doing exact inference: we need decompositions which yield exactness (next few slides)
Example: Learning with Decompositions in which all subsets of size k are considered: DecL-k DecL-1 same as Pseudomax (Sontag et al, 2010) which is similar to
Pseudolikelihood (Besag, 77) learning
In practice, decompositions should be based on domain knowledge – put highly coupled variables in the same set
Page 19
Existing Structural learning algorithms
DecL: Efficient decomposed structural learning Intuition Formalization
Theoretical results: exactness
Experimental evaluation
Outline
Page 20
Theoretical Results: Assume Separability
Ideally we want Decomposed Learning with decompositions having small sets to give the same results as Global Learning
For analyzing the equivalence between DecL and GL, we assume that the training data is separable
Separability: existence of a set of weights W* that satisfyW*: {w* | f(xj, yj ;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8 y 2 Y}
Separating weights for DecLWdecl: {w* | f(xj, yj ;w*) ¸ f(xj, y ;w*)+ ¢(yj,y), 8 y 2 nbr(yj)}
Naturally: W* µ Wdecl
Score of ground truth yj
Score of non ground-truth y
Loss-based margin
Page 21
Theoretical Results: Exactness
The property we desire is Exactness: Wdecl = W*
as a property of constraints, ground truth yj, and the globally separating weight W*
Different from asymptotic consistency results of Pseudolikelihood/Pseudomax!
Exactness much more useful – learning with DecL yields the same weights as GL
Main theorem in the paper: providing general exactness condition
Page 22
One Example of Exactness: Pairwise Markov Networks
Scoring function define over a graph with edges E
Assume domain knowledge on W*: we know that for correct (separating) w, if Ái,k (.;w) is:
Submodular: Ái,k(0,0)+ Ái,k(1,1) > Ái,k(0,1) + Ái,k(1,0)
OR Supermodular: Ái,k(0,0)+ Ái,k(1,1) < Ái,k(0,1) + Ái,k(1,0)
y1
y3
y6y5
y2
y4
Page 26
Singleton/Vertex components
Pairwise/Edge components
Decomposition for PMNs
Define
Theorem: Spair decomposition consisting of connected components of Ej yields Exactness
E Ej
sub(Á) sup(Á)
1 0
Page 27
Existing Structural learning algorithms
DecL: Efficient decomposed structural learning Intuition Formalization
Theoretical properties of DecL
Experimental evaluation
Outline
Page 28
Experiments
Experimentally compare Decomposed Learning (DecL) to Global Learning (GL), Local Learning (LL) and Local Learning + Constraints (if available, during test-time inference)
(LL+C)
Study the robustness of DecL in conditions where our theoretical assumptions may not hold
Page 29
Synthetic Experiments
Experiments on random synthetic data with 10 binary variables
Labels assigned with random singleton scoring functions and random linear constraints
Local Learning (LL) baselines
Global Learning (GL) and Dec. Learning (DecL)-2,3
DecL-1 aka Pseudomax
Page 30
No. of training examples
Avg
. H
amm
ing
Lo
ss
Multi-label Document Classification
Experiments on multi-label document classification Documents with multi-labels corn, crude, earn, grain, interest…
Modeled as a Pairwise Markov Network over a complete graph over all the labels – singleton and pairwise components
LL – local learning baseline that ignores pairwise interactions
Page 31
Results: Per Instance F1 and training time (hours)
Loca
l Learn
ing (LL)
Gobal Le
arning (
GL)
Dec. Le
arning-2
(DecL-
2)
Dec. Le
arning-3
(DecL-
3)757677787980818283
020406080100120140
Time
Page 32
F1
Sco
res
Tim
e take
n to tra
in (h
ours)
Results: Per Instance F1 and training time (hours)
Loca
l Learn
ing (LL)
Gobal Le
arning (
GL)
Dec. Le
arning-2
(DecL-
2)
Dec. Le
arning-3
(DecL-
3)
Dec. Le
arning-1
(DecL-
1)5055606570758085
020406080100120140
Time
Page 33
F1
Sco
res
Tim
e take
n to tra
in (h
ours)
Example Problem: Information Extraction
Given citation text, extract author, booktitle, title, etc.Marc Shapiro and Susan Horwitz. Fast and accurate flow-insensitive points-to analysis. In Proceedings of the 24th Annual ACM Symposium on Principles of Programming Languages….
Given ad text, extract features, size, neighborhood, etc.
Spacious 1 bedroom apt. newly remodeled, includes dishwasher, gated community, near subway lines …
Constraints like: The ‘title’ tokens are likely to appear together in a single block, A paper should have at most one ‘title’
Page 34
Information Extraction: Modeling
Modeled as HMM with additional constraints The constraints make inference with HMM Hard
Local Learning (LL) in this case is HMM with no constraints
Domain Knolwedge: HMM transition matrix is likely to be diagonal heavy – generalization of submodular pairwise potentials for Pairwise Markov Networks ) use decomposition Spair
Bottomline: DecL is 2 to 8 times faster than GL and gives same accuracies
Page 35
Citation Info. Extraction: Accuracy and Training Time
Page 36
F1
Sco
res
Tim
e take
n to tra
in (h
ours)
Cit-HMM/LL Cit-LL+C Cit-GL Cit-DecL70
75
80
85
90
95
100
0
5
10
15
20
25
30
35
40
45
Time
Ads. Info. Extraction: Accuracy and Training Time
Page 37
F1
Sco
res
Tim
e take
n to tra
in (h
ours)
Ads-HMM/LL
Ads-LL+C Ads-GL Ads-DecL70
72
74
76
78
80
82
0
10
20
30
40
50
60
70
80
Time
Take Home: Efficient Structural Learning with DecL
We presented Decomposed Learning (DecL): efficient learning by reducing the inference to a small output space
Exactness: Provided conditions for when DecL is provably identical to global structural learning (GL)
Experiments: DecL performs as good as GL on real-world data, with significant cost reduction (with 50% - 90% reduction in training time)
QUESTIONS?
Page 38