Upload
mikelamar77
View
230
Download
0
Embed Size (px)
DESCRIPTION
icassp 2013 graph learning tutorial
Citation preview
A Tutorial on
Graph-based Semi-Supervised Learning Algorithms for Speech and Spoken Language Processing
Partha Pratim Talukdar(Carnegie Mellon University)
Amarnag Subramanya(Google Research)
http://graph-ssl.wikidot.com/ICASSP 2013, Vancouver
Supervised Learning
ModelLabeled DataLearning
Algorithm
2
Supervised Learning
ModelLabeled DataLearning
Algorithm
2
Examples:Decision Trees
Support Vector Machine (SVM) Maximum Entropy (MaxEnt)
Semi-Supervised Learning (SSL)
ModelLabeled Data
Learning Algorithm
A Lot of Unlabeled Data
3
Semi-Supervised Learning (SSL)
ModelLabeled Data
Learning Algorithm
A Lot of Unlabeled Data
3
Examples:Self-TrainingCo-Training
Why SSL?
How can unlabeled data be helpful?
4
Without Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
4
Without Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Labeled Instances
4
Without Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Labeled Instances
DecisionBoundary
4
With Unlabeled DataWithout Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Labeled Instances
DecisionBoundary
4
With Unlabeled DataWithout Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Labeled Instances
DecisionBoundary Unlabeled
Instances
4
With Unlabeled DataWithout Unlabeled Data
Why SSL?
How can unlabeled data be helpful?
Example from [Belkin et al., JMLR 2006]
Labeled Instances
DecisionBoundary
More accurate decision boundary in the presence of unlabeled instances
UnlabeledInstances
4
Inductive vs Transductive
5
Inductive vs Transductive
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
5
Inductive vs Transductive
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
5
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)
5
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)
5
Focus of this tutorial
Inductive vs Transductive
SVM, Maximum Entropy X
Manifold Regularization
Label Propagation, MAD, MP, TACO, ...
Supervised(Labeled)
Semi-supervised(Labeled + Unlabeled)
Inductive(Generalize toUnseen Data)
Transductive(Doesnt Generalize to
Unseen Data)
See Chapter 25 of SSL Book: http://olivier.chapelle.cc/ssl-book/discussion.pdf
Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)
5
Focus of this tutorial
Why Graph-based SSL?
6
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
6
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
Uniform representation for heterogeneous data
6
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
Uniform representation for heterogeneous data Easily parallelizable, scalable to large data
6
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice
6
Why Graph-based SSL?
Some datasets are naturally represented by a graph web, citation network, social network, ...
Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice
Graph SSL
Supervised
Non-Graph SSL
Text Classification
6
Graph-based SSL
7
Graph-based SSL
7
Graph-based SSL
Similarity
7
Graph-based SSL
Similarity
0.60.20.8
7
Graph-based SSL
/f//m/
Similarity
0.60.20.8
7
Graph-based SSL
/f//m/
/m/ /f/
Similarity
0.60.20.8
7
Graph-based SSL
8
Graph-based SSL
Smoothness Assumption If two instances are similar
according to the graph, then output labels should be similar
8
Graph-based SSL
Smoothness Assumption If two instances are similar
according to the graph, then output labels should be similar
8
Graph-based SSL
Two stages Graph construction (if not already present) Label Inference
Smoothness Assumption If two instances are similar
according to the graph, then output labels should be similar
8
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
9
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
10
Graph Construction
Neighborhood Methods k-NN Graph Construction (k-NNG) e-Neighborhood Method
Metric Learning Other approaches
11
Neighborhood Methods
12
Neighborhood Methods
k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its
k-nearest neighbors
12
Neighborhood Methods
k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its
k-nearest neighborsk = 3
12
Neighborhood Methods
k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its
k-nearest neighbors
e-Neighborhood add edges to all instances inside a ball of
radius e
k = 3
12
Neighborhood Methods
k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its
k-nearest neighbors
e-Neighborhood add edges to all instances inside a ball of
radius e
e
k = 3
12
Issues with k-NNG
13
Issues with k-NNG
13
Not scalable (quadratic)
Issues with k-NNG
13
Not scalable (quadratic) Results in an asymmetric graph
Issues with k-NNG
13
Not scalable (quadratic) Results in an asymmetric graph
b is the closest neighbor of a, but not the other way
a b c
Issues with k-NNG
Results in irregular graphs some nodes may end up with
higher degree than other nodes
13
Not scalable (quadratic) Results in an asymmetric graph
b is the closest neighbor of a, but not the other way
a b c
Issues with k-NNG
Results in irregular graphs some nodes may end up with
higher degree than other nodes
Node of degree 4 inthe k-NNG (k = 1)
13
Not scalable (quadratic) Results in an asymmetric graph
b is the closest neighbor of a, but not the other way
a b c
Issues with e-Neighborhood
14
Issues with e-Neighborhood
Not scalable
14
Issues with e-Neighborhood
Not scalable
Sensitive to value of e : not invariant to scaling
14
Issues with e-Neighborhood
Not scalable
Sensitive to value of e : not invariant to scaling
Fragmented Graph: disconnected components
14
Issues with e-Neighborhood
Not scalable
Sensitive to value of e : not invariant to scaling
Fragmented Graph: disconnected components
Figure from [Jebara et al., ICML 2009]
e-NeighborhoodData
Disconnected
14
Graph Construction using Metric Learning
15
Graph Construction using Metric Learning
xi xjwij exp(DA(xi, xj))
15
Graph Construction using Metric Learning
xi xjwij exp(DA(xi, xj))
Estimated using Mahalanobis metric learning algorithms
DA(xi, xj) = (xi xj)TA(xi xj)
15
Graph Construction using Metric Learning
Supervised Metric Learning ITML [Kulis et al., ICML 2007] LMNN [Weinberger and Saul, JMLR 2009]
Semi-supervised Metric Learning IDML [Dhillon et al., UPenn TR 2010]
xi xjwij exp(DA(xi, xj))
Estimated using Mahalanobis metric learning algorithms
DA(xi, xj) = (xi xj)TA(xi xj)
15
Benefits of Metric Learning forGraph Construction
16
Benefits of Metric Learning forGraph Construction
0
0.125
0.25
0.375
0.5
Amazon Newsgroups Reuters Enron A Text
Original RP PCA ITML IDML
Erro
r
100 seed and1400 test instances, all inferences using LP
16
Benefits of Metric Learning forGraph Construction
Graph constructed using supervised metric learning
0
0.125
0.25
0.375
0.5
Amazon Newsgroups Reuters Enron A Text
Original RP PCA ITML IDML
Erro
r
100 seed and1400 test instances, all inferences using LP
16
Benefits of Metric Learning forGraph Construction
[Dhillon et al., UPenn TR 2010]
Graph constructed using supervised metric learning
0
0.125
0.25
0.375
0.5
Amazon Newsgroups Reuters Enron A Text
Original RP PCA ITML IDML
Erro
r
100 seed and1400 test instances, all inferences using LP
Graph constructed using semi-supervised metric learning[Dhillon et al., 2010]
16
Benefits of Metric Learning forGraph Construction
Careful graph construction is critical![Dhillon et al., UPenn TR 2010]
Graph constructed using supervised metric learning
0
0.125
0.25
0.375
0.5
Amazon Newsgroups Reuters Enron A Text
Original RP PCA ITML IDML
Erro
r
100 seed and1400 test instances, all inferences using LP
Graph constructed using semi-supervised metric learning[Dhillon et al., 2010]
16
Other Graph Construction Approaches
Local Reconstruction Linear Neighborhood [Wang and Zhang, ICML 2005] Regular Graph: b-matching [Jebara et al., ICML 2008] Fitting Graph to Vector Data [Daitch et al., ICML 2009]
Graph Kernels [Zhu et al., NIPS 2005]
17
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
18
19
Graph Laplacian
Laplacian (un-normalized) of a graph:
L = D W,where Dii =j
Wij , Dij( =i) = 0
19
Graph Laplacian
Laplacian (un-normalized) of a graph:
1
12
3
a
b
c
d
L = D W,where Dii =j
Wij , Dij( =i) = 0
19
Graph Laplacian
Laplacian (un-normalized) of a graph:
1
12
3
a
b
c
d
3 -1 -2 0-1 4 -3 0-2 -3 6 -1 0 0 -1 1
abcd
a b c d
L = D W,where Dii =j
Wij , Dij( =i) = 0
19
Graph Laplacian
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:
1
20
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:
1
20
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-Smoothness
1
20
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
1
20
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
125
5
10
1
12
3
a
b
c
d1
fT = [1 10 5 25]
20
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
125
5
10
1
12
3
a
b
c
d1
fT = [1 10 5 25]
fTLf = 588 Not Smooth20
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
125
5
10
1
12
3
a
b
c
d1
fT = [1 10 5 25]
fTLf = 588 Not Smooth
1
1
1
3
fT = [1113]fTLf = 4
1
12
3
a
b
c
d
20
fTLf =i,j
Wij(fi fj)2
Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in
terms of the Laplacian:Measure of
Non-SmoothnessVector of scores for single label on nodes
125
5
10
1
12
3
a
b
c
d1
fT = [1 10 5 25]
fTLf = 588 Not Smooth
1
1
1
3
fT = [1113]fTLf = 4
1
12
3
a
b
c
d
Smooth20
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
Eigenvalue of LEigenvector of L
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
Eigenvalue of LEigenvector of L
= 1, as eigenvectors are are orthonormal
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
Eigenvalue of LEigenvector of L
= 1, as eigenvectors are are orthonormal
Measure of Non-Smoothness(previous slide)
Lg = g
gTLg = gT g
gTLg =
Relationship between Eigenvalues of the Laplacian and Smoothness
21
Eigenvalue of LEigenvector of L
= 1, as eigenvectors are are orthonormal
If an eigenvector is used to classify nodes, then the
corresponding eigenvalue gives the measure of non-smoothness
Measure of Non-Smoothness(previous slide)
Spectrum of the Graph Laplacian
Figure from [Zhu et al., 2005]
Number ofconnected
components = Number of 0 eigenvalues
Constant withincomponent
Higher Eigenvalue,Irregular Eigenvector,
Less smoothness
22
NotationsSeed Scores
Estimated Scores
Label Priors
Yv,l : score of estimated label l on node v
Yv,l : score of seed label l on node v
Rv,l : regularization target for label l on node v
S : seed node indicator (diagonal matrix)
v
Wuv : weight of edge (u, v) in the graph
23
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
24
NotationsSeed Scores
Estimated Scores
Label Regularization
Yv,l : score of estimated label l on node v
Yv,l : score of seed label l on node v
Rv,l : regularization target for label l on node v
S : seed node indicator (diagonal matrix)
v
Wuv : weight of edge (u, v) in the graph
25
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
=
m
l=1
YTl LYl
GraphLaplacian
26
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
Smooth
=
m
l=1
YTl LYl
GraphLaplacian
26
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
Smooth
Match Seeds (hard)
=
m
l=1
YTl LYl
GraphLaplacian
26
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
Smooth
Match Seeds (hard)
Smoothness two nodes connected by
an edge with high weight should be assigned similar labels
=
m
l=1
YTl LYl
GraphLaplacian
26
LP-ZGL [Zhu et al., ICML 2003]
argminY
m
l=1
Wuv(Yul Yvl)2
Yul = Yul, Suu = 1such that
Smooth
Match Seeds (hard)
Smoothness two nodes connected by
an edge with high weight should be assigned similar labels
Solution satisfies harmonic property
=
m
l=1
YTl LYl
GraphLaplacian
26
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
27
Two Related Views
UL
Label Diffusion
28
Labeled (Seeded) Node
Unlabeled Node
Two Related Views
UL
Label Diffusion
L
Random Walk
U
28
Labeled (Seeded) Node
Unlabeled Node
Random Walk View
29
Random Walk View
UV
what next? Starting node
29
Continue walk with probability
Assign Vs seed label to U with probability
Abandon random walk with probability assign U a dummy label
pcont
v
pinjv
pabndv
Random Walk View
UV
what next? Starting node
29
Discounting Nodes
30
Discounting Nodes
Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them
30
Discounting Nodes
Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them
Solution: increase abandon probability on such nodes:
30
Discounting Nodes
Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them
Solution: increase abandon probability on such nodes:
pabndv degree(v)
30
Redefining Matrices
W
uv= pcont
uWuv
Suu =
pinju
Ru! = pabndu
, and 0 for non-dummy labels
Dummy Label
New EdgeWeight
31
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
32
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Seed Scores
EstimatedScores
Label Priorsv
32
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft)
Seed Scores
EstimatedScores
Label Priorsv
32
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) Smooth
Seed Scores
EstimatedScores
Label Priorsv
32
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) SmoothMatch Priors(Regularizer)
Seed Scores
EstimatedScores
Label Priorsv
32
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) SmoothMatch Priors(Regularizer)
Seed Scores
EstimatedScores
Label Priorsv
for none-of-the-above label
32
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) SmoothMatch Priors(Regularizer)
Seed Scores
EstimatedScores
Label Priorsv
MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006]
for none-of-the-above label
32
Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]
argminY
m+1l=1
SY l SY l2 + 1
u,v
Muv(Y ul Y vl)2 + 2Y l Rl2
m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v
Match Seeds (soft) SmoothMatch Priors(Regularizer)
Seed Scores
EstimatedScores
Label Priorsv
MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006]
for none-of-the-above label
MADs Objective is Convex
32
Solving MAD Objective
33
Solving MAD Objective
Can be solved using matrix inversion (like in LP) but matrix inversion is expensive
33
Solving MAD Objective
Can be solved using matrix inversion (like in LP) but matrix inversion is expensive
Instead solved exactly using a system of linear equations (Ax = b)
solved using Jacobi iterations results in iterative updates guaranteed convergence see [Bengio et al., 2006] and
[Talukdar and Crammer, ECML 2009] for details
33
Solving MAD using Iterative Updates
Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W
Zv Svv + 1
u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv
(SY )v + 1MvY + 2Rv
end for
until convergence
Seed Prior
0.75
0.05
0.60
Current label estimate on ba b
c
v
34
Solving MAD using Iterative Updates
Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W
Zv Svv + 1
u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv
(SY )v + 1MvY + 2Rv
end for
until convergence
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v
34
Solving MAD using Iterative Updates
Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W
Zv Svv + 1
u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv
(SY )v + 1MvY + 2Rv
end for
until convergence
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v
34
Importance of a node can be discounted Easily Parallelizable: Scalable (more later)
When is MAD most effective?
35
When is MAD most effective?
0
0.1
0.2
0.3
0.4
0 7.5 15 22.5 30
Re
lati
ve I
ncr
eas
e i
n M
RR
by M
AD
ove
r L
P-Z
GL
Average Degree
35
When is MAD most effective?
0
0.1
0.2
0.3
0.4
0 7.5 15 22.5 30
Re
lati
ve I
ncr
eas
e i
n M
RR
by M
AD
ove
r L
P-Z
GL
Average Degree
MAD is particularly effective in denser graphs, where there is greater need for regularization.
35
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
36
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
Main lesson from MAD: discount bad (high degree) nodes
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
Main lesson from MAD: discount bad (high degree) nodes
TACO generalizes no=on of bad, adds per-node, per-class condence
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
LowMain lesson from MAD: discount bad (high degree) nodes
TACO generalizes no=on of bad, adds per-node, per-class condence
Label Scores:
Condence:
Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]
37
LowMain lesson from MAD: discount bad (high degree) nodes
TACO generalizes no=on of bad, adds per-node, per-class condence
Introducing Confidence
38
0.49
0.5
0.51
0.3
0.90.3
? ?
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.5
0.90.3
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.5
0.90.3
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.5
0.90.3
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.5
0.90.3
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.90.3
0.35
Neighborhood disagreement => low condence
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.90.3
0.35
Neighborhood disagreement => low condence
Lower the eect of poorly es=mated (low condence) scores
Introducing Confidence
38
0.49
0.50.5
0.51
0.3
0.90.3
0.35
TACO Objective
39
TACO Objective
39
TACO Objective
39
TACO Objective
Convex objective Iterative solution
39
TACO: Iterative Algorithm
40
TACO: Iterative Algorithm
40
TACO: Score Update
410.3
?
0.3
0.9
TACO: Score Update
410.3
?
0.3
0.9
TACO: Score Update
410.3
?
0.3
0.9
TACO: Score Update
410.3
?
0.3
0.9
TACO: Score Update
410.3
?
0.3
0.9
TACO: Score Update
410.3
0.3
0.35 0.9
- Average Neighbors- Consider Confidence
TACO: Confidence Update
420.3
0.3
0.90.35
TACO: Confidence Update
420.3
0.3
0.90.35
TACO: Confidence Update
420.3
0.3
0.90.35
TACO: Confidence Update
420.3
0.3
0.90.35
TACO: Confidence Update
420.3
0.3
0.90.35
TACO: Confidence Update
420.3
0.3
0.9
Confidence is monotonicin neighborsdisagreement
0.35
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
43
Manifold Regularization[Belkin et al., JMLR 2006]
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
Manifold Regularization[Belkin et al., JMLR 2006]
Loss Function(e.g., soft margin)
Training DataLoss
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
Manifold Regularization[Belkin et al., JMLR 2006]
Loss Function(e.g., soft margin)
Laplacian of graph over labeled and unlabeled data
Training DataLoss
Smoothness Regularizer
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
Manifold Regularization[Belkin et al., JMLR 2006]
Loss Function(e.g., soft margin)
Laplacian of graph over labeled and unlabeled data
Training DataLoss
Smoothness Regularizer
Regularizer(e.g., L2)
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
Manifold Regularization[Belkin et al., JMLR 2006]
Loss Function(e.g., soft margin)
Laplacian of graph over labeled and unlabeled data
Trains an inductive classifier which can generalize to unseen instances
Training DataLoss
Smoothness Regularizer
Regularizer(e.g., L2)
f = argminf
1
l
li=1
V (yi, f(xi)) + fTLf + ||f ||2K
44
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation
45
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Seed and estimated label distributions (normalized)
on node i
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Smoothness(divergence across edge)
KL Divergence
DKL(pi||pj) =y
pi(y) logpi(y)
pj(y)
Seed and estimated label distributions (normalized)
on node i
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Smoothness(divergence across edge)
Entropic Regularizer
KL Divergence
DKL(pi||pj) =y
pi(y) logpi(y)
pj(y)
EntropyH(pi) =
y
pi(y) log pi(y)
Seed and estimated label distributions (normalized)
on node i
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Smoothness(divergence across edge)
Entropic Regularizer
KL Divergence
DKL(pi||pj) =y
pi(y) logpi(y)
pj(y)
EntropyH(pi) =
y
pi(y) log pi(y)
Seed and estimated label distributions (normalized)
on node i
Normalization Constraint
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
argmin{pi}
li=1
DKL(ri||pi) + i,j
wijDKL(pi||pj) ni=1
H(pi)
Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]
Divergence onseed nodes
Smoothness(divergence across edge)
Entropic Regularizer
KL Divergence
DKL(pi||pj) =y
pi(y) logpi(y)
pj(y)
EntropyH(pi) =
y
pi(y) log pi(y)
Seed and estimated label distributions (normalized)
on node i
CKL is convex (with non-negative edge weights and hyper-parameters)MP is related to Information Regularization [Corduneanu and Jaakkola, 2003]
Normalization Constraint
CKL
s.t.y
pi(y) = 1, pi(y) 0, y, i
46
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
CMP
47
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
New probability measure, one for each
vertex, similar to pi
CMP
47
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
wij = wij + (i, j)New probability
measure, one for each vertex, similar to pi
CMP
47
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
wij = wij + (i, j)New probability
measure, one for each vertex, similar to pi
CMP
Encourages agreement between pi and qi
47
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
wij = wij + (i, j)New probability
measure, one for each vertex, similar to pi
CMP
Encourages agreement between pi and qi
CMP is also convex(with non-negative edge weights and hyper-parameters)
47
Solving MP Objective
For ease of optimization, reformulate MP objective:
arg min{pi,qi}
li=1
DKL(ri||qi) + i,j
wijDKL(pi||qj)
ni=1
H(pi)
wij = wij + (i, j)New probability
measure, one for each vertex, similar to pi
CMP
Encourages agreement between pi and qi
CMP is also convex(with non-negative edge weights and hyper-parameters)
CMP can be solved using Alternating Minimization (AM)47
Alternating Minimization
48
Alternating Minimization
Q0
48
Alternating Minimization
Q0
P1
48
Alternating Minimization
Q0
Q1
P1
48
Alternating Minimization
Q0
Q1
P1P2
48
Q2
Alternating Minimization
Q0
Q1
P1P2
48
Q2
Alternating Minimization
Q0
Q1
P1P2P3
48
Q2
Alternating Minimization
Q0
Q1
P1P2P3
48
Q2
Alternating Minimization
Q0
Q1
P1P2P3
48
Q2
Alternating Minimization
Q0
Q1
P1P2P3
CMP satisfies the necessary conditions for AM to converge [Subramanya and Bilmes, JMLR 2011]
48
Why AM?
49
Why AM?
49
Why AM?
49
Performance of SSL Algorithms
Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets
50
Performance of SSL Algorithms
Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets
50
Performance of SSL Algorithms
Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets
Graph SSL can be effective when the data satisfies manifold assumption. More results and discussion in Chapter 21 of
the SSL Book (Chapelle et al.)
50
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Label Propagation- Modified Adsorption- Manifold Regularization- Spectral Graph Transduction- Measure Propagation- Sparse Label Propagation
51
Background: Factor Graphs[Kschischang et al., 2001]
Factor Graph bipartite graph variable nodes (e.g., label distribution on a node) factor nodes: fitness function over variable assignment
Distribution over all variables values
Variable Nodes (V)
Factor Nodes (F)
variables connected to factor f52
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
53
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
53
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
q1 q2(q1, q2)
53
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
q1 q2(q1, q2)
(q1, q2) w1,2 ||q1 q2||2Smoothness Factor53
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
q1 q2(q1, q2)
(q1, q2) w1,2 ||q1 q2||2Smoothness Factor
Seed MatchingFactor (unary)
53
Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]
min Edge Smoothness LossRegularization
Loss + +Seed Matching Loss (if any)
3-term Graph SSL Objective (common to many algorithms)
q1 q2w1,2 ||q1 q2||2
q1 q2(q1, q2)
Regularization factor (unary)
(q1, q2) w1,2 ||q1 q2||2Smoothness Factor
Seed MatchingFactor (unary)
53
Factor Graph Interpretation[Zhu et al., ICML 2003][Das and Smith, NAACL 2012]
r1 r2
r3 r4
q1 q2
q4 q3
q9264
q9265
q9266
q9267 q9268 q9269 q9270 1. Factor encouraging agreement on seed
labels
2. SmoothnessFactor
3. Unary factor for regularizationlogt(qt)
54
Label Propagation with Sparsity
55
Label Propagation with Sparsity
Enforce through sparsity inducing unary factor
55
Label Propagation with Sparsity
Enforce through sparsity inducing unary factor
logt(qt) =
logt(qt) =
Lasso (Tibshirani, 1996)
Elitist Lasso (Kowalski and Torrsani, 2009)
55
Label Propagation with Sparsity
Enforce through sparsity inducing unary factor
logt(qt) =
logt(qt) =
Lasso (Tibshirani, 1996)
Elitist Lasso (Kowalski and Torrsani, 2009)
For more details, see [Das and Smith, NAACL 2012]
55
Other Graph-SSL Methods
Spectral Graph Transduction [Joachims, ICML 2003] SSL on Directed Graphs
[Zhou et al, NIPS 2005], [Zhou et al., ICML 2005]
Learning with dissimilarity edges [Goldberg et al., AISTATS 2007]
Learning to order: GraphOrder [Talukdar et al., CIKM 2012] Graph Transduction using Alternating Minimization
[Wang et al., ICML 2008]
Graph as regularizer for Multi-Layered Perceptron [Karlen et al., ICML 2008], [Malkin et al., Interspeech 2009]
56
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Scalability Issues- Node reordering- MapReduce Parallelization
57
More (Unlabeled) Data is Better Data
[Subramanya & Bilmes, JMLR 2011]58
More (Unlabeled) Data is Better Data
Graph with 120m vertices
[Subramanya & Bilmes, JMLR 2011]58
More (Unlabeled) Data is Better Data
Graph with 120m vertices
Challenges with large unlabeled data:
Constructing graph from large data Scalable inference over large graphs
[Subramanya & Bilmes, JMLR 2011]58
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Scalability Issues- Node reordering- MapReduce Parallelization
59
Scalability Issues (I)Graph Construction
60
Brute force (exact) k-NNG too expensive (quadratic)
Scalability Issues (I)Graph Construction
60
Brute force (exact) k-NNG too expensive (quadratic)
Approximate nearest neighbor using kd-tree [Friedman et al., 1977, also see http://www.cs.umd.edu/mount/]
Scalability Issues (I)Graph Construction
60
Sub-sample the data
Construct graph over a subset of a unlabeled data [Delalleau et al., AISTATS 2005]
Sparse Grids [Garcke & Griebel, KDD 2001]
Scalability Issues (II)Label Inference
61
Sub-sample the data
Construct graph over a subset of a unlabeled data [Delalleau et al., AISTATS 2005]
Sparse Grids [Garcke & Griebel, KDD 2001]
How about using more computation? (next section) Symmetric multi-processor (SMP) Distributed Computer
Scalability Issues (II)Label Inference
61
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Scalability Issues- Node reordering [Subramanya & Bilmes, JMLR 2011; Bilmes & Subramanya, 2011]
- MapReduce Parallelization
62
Parallel computation on a SMP
.....
k+1
k
2
1
..... Graph nodes (neighbors not shown)
SMP with k Processors
Parallel computation on a SMP
Processor 1
.....
k+1
k
2
1
..... Graph nodes (neighbors not shown)
SMP with k Processors
Parallel computation on a SMP
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Graph nodes (neighbors not shown)
SMP with k Processors
Parallel computation on a SMP
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Processor k
Processor 1
Graph nodes (neighbors not shown)
SMP with k Processors
Parallel computation on a SMP
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Processor k
Processor 1
Graph nodes (neighbors not shown)
SMP with k Processors
Label Update using Message Passing
Seed Prior
0.75
0.05
0.60Current label estimate on a
a b
c
v
64
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Processor k
Processor 1
Graph nodes (neighbors not shown)
SMP with k Processors
Label Update using Message Passing
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v
64
Processor 1
.....
k+1
k
2
1
.....
Processor 2
Processor k
Processor 1
Graph nodes (neighbors not shown)
SMP with k Processors
Speed-up on SMP
[Subramanya & Bilmes, JMLR, 2011]
Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM
65
Speed-up on SMP
Ratio of time spent when using 1 processor to
time spent using n processors
[Subramanya & Bilmes, JMLR, 2011]
Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM
65
Speed-up on SMP
Ratio of time spent when using 1 processor to
time spent using n processors
Cache miss?
[Subramanya & Bilmes, JMLR, 2011]
Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM
65
Node Reordering Algorithm
Input: Graph G = (V, E)
Result: Node ordered graph
1. Select an arbitrary node v
2. while unselected nodes remain do
2.1. select an unselected node v` from among the neighbors neighbors of v that has maximum overlap with v` neighbors
2.2. mark v` as selected
2.3. set v to v`
[Subramanya & Bilmes, JMLR, 2011]66
Node Reordering Algorithm
Input: Graph G = (V, E)
Result: Node ordered graph
1. Select an arbitrary node v
2. while unselected nodes remain do
2.1. select an unselected node v` from among the neighbors neighbors of v that has maximum overlap with v` neighbors
2.2. mark v` as selected
2.3. set v to v`
Exhaustive for sparse
(e.g., k-NN) graphs
[Subramanya & Bilmes, JMLR, 2011]66
Node Reordering Algorithm : Intuition
k
a
b
c
67
Node Reordering Algorithm : Intuition
k
a
b
c
67
Which node should be placed after k to
optimize cache performance?
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
67
Which node should be placed after k to
optimize cache performance?
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n67
Which node should be placed after k to
optimize cache performance?
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n
|N(k) N(a)| = 1
67
Which node should be placed after k to
optimize cache performance?
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n
|N(k) N(a)| = 1
67
Which node should be placed after k to
optimize cache performance?
Cardinality of Intersection
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n
|N(k) N(a)| = 1
|N(k) N(b)| = 2
|N(k) N(c)| = 0
67
Which node should be placed after k to
optimize cache performance?
Cardinality of Intersection
Node Reordering Algorithm : Intuition
k
a
b
c
a
e
j
c
b
a
d
c
c
l
m
n
|N(k) N(a)| = 1
|N(k) N(b)| = 2
|N(k) N(c)| = 0
Best Node
67
Which node should be placed after k to
optimize cache performance?
Cardinality of Intersection
Speed-up on SMP after Node Ordering
[Subramanya & Bilmes, JMLR, 2011]68
Distributed Processing
Maximize overlap between consecutive nodes within the same machine
Minimize overlap across machines (reduce inter machine communication)
69
Distributed Processing
[Bilmes & Subramanya, 2011]70
Distributed ProcessingMaximize
intra-processor overlap
[Bilmes & Subramanya, 2011]70
Distributed ProcessingMaximize
intra-processor overlap
Minimizeinter-processor
overlap
[Bilmes & Subramanya, 2011]70
Node reordering for Distributed Computer
...... ............ ......
Processor #i Processor #j
71
Node reordering for Distributed Computer
...... ............ ......
Processor #i Processor #j
Minimize Overlap
71
Node reordering for Distributed Computer
...... ............ ......
Processor #i Processor #j
Minimize OverlapMaximize
Overlap
71
Distributed Processing Results
[Bilmes & Subramanya, 2011]72
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Scalability Issues- Node reordering- MapReduce Parallelization
73
MapReduce Implementation of MAD
Seed Prior
0.75
0.05
0.60
Current label estimate on ba b
c
v
74
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
Current label estimate on ba b
c
v
74
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
a b
c
v
74
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
a b
c
v Reduce Each node updates its own label
assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)
Repeat until convergence
Reduce
74
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v Reduce Each node updates its own label
assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)
Repeat until convergence74
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v Reduce Each node updates its own label
assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)
Repeat until convergenceCode in Junto Label Propagation Toolkit
(includes Hadoop-based implementation)
http://code.google.com/p/junto/74
MapReduce Implementation of MAD Map
Each node send its current label assignments to its neighbors
Seed Prior
0.75
0.05
0.60
New label estimate on v
a b
c
v Reduce Each node updates its own label
assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)
Repeat until convergenceCode in Junto Label Propagation Toolkit
(includes Hadoop-based implementation)
http://code.google.com/p/junto/74
Graph-based algorithms are amenable to distributed processing
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization - Dialog Act Tagging- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging
75
Problem Description & Motivation
76
Problem Description & Motivation
Given a frame of speech classify it into one of n phones
76
Problem Description & Motivation
Given a frame of speech classify it into one of n phones
Training supervised models requires large amounts of labeled data (phone classification in resource-scarce languages)
76
TIMIT
77
TIMIT
Corpus of read speech
77
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
77
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences
77
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions
77
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89]
77
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89] mapped down to 48 phones for modeling
77
TIMIT
Corpus of read speech Broadband recordings of 630 speakers of 8 major
dialects of American English
Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89] mapped down to 48 phones for modeling further mapped down to 39 phones for scoring
77
Feature Extraction
78
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
Delta
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
Delta .... ....
ti
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
Delta .... ....
ti
Feature Extraction
78
MFCC
Hamming Window
25ms @ 100 Hz
Delta .... ....
ti
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+ k-NN Graph
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+ k-NN Graphwij = exp{(xi xj)T1(xi xj)}
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+ k-NN Graph
- k = 10- Graph with about 1.4 M nodes
wij = exp{(xi xj)T1(xi xj)}
Graph Construction
79
TIMIT Training Set
TIMIT Development
Set
+ k-NN Graph
- k = 10- Graph with about 1.4 M nodes
Labels not used during graph construction
wij = exp{(xi xj)T1(xi xj)}
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph
Test Sample
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph
Test Sample
Feature Extraction
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph Nearest Neighbors
Test Sample
Feature Extraction
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph Nearest Neighbors
Test Sample
Feature Extraction
Inductive Extension
80
TIMIT Training Set
TIMIT Development
Set
+ Graph Nearest Neighbors
Test Sample
Feature Extraction
Nadaraya-Watson Estimator
Results (I)
81
Results (II)
82
Labeled Graph Construction
83
TIMIT Training Set
[Alexandrescu & Kirchoff, ASRU 2007]
Labeled Graph Construction
83
TIMIT Training Set
Feature Extraction
[Alexandrescu & Kirchoff, ASRU 2007]
Labeled Graph Construction
83
TIMIT Training Set
Feature Extraction
MLP Training
[Alexandrescu & Kirchoff, ASRU 2007]
Labeled Graph Construction
83
TIMIT Training Set
Feature Extraction
MLP Training
TIMIT Training Set
TIMIT Development
Set
+ MLP
[Alexandrescu & Kirchoff, ASRU 2007]
Labeled Graph Construction
83
TIMIT Training Set
Feature Extraction
MLP Training
TIMIT Training Set
TIMIT Development
Set
+ MLP Graph ConstructionJenson-Shannon
Divergence[Alexandrescu & Kirchoff, ASRU 2007]
Results (III)
84
65
67.25
69.5
71.75
74
10% 50% 100%
MLP MP MAD p-MP
[Liu & Kirchoff, UW-EE TR 2012]
Results (III)
84
65
67.25
69.5
71.75
74
10% 50% 100%
MLP MP MAD p-MP
[Liu & Kirchoff, UW-EE TR 2012]
Take advantage of labeled data during graph construction Regularize towards prior (when available)
Switchboard Phonetic Annotation
85
Switchboard Phonetic Annotation
Switchboard corpus consists of about 300 hours of conversational speech.
Less reliable automatically generated phone-level annotations [Deshmukh et al., 1998]
85
Switchboard Phonetic Annotation
Switchboard corpus consists of about 300 hours of conversational speech.
Less reliable automatically generated phone-level annotations [Deshmukh et al., 1998]
Switchboard transcription project (STP) [Greenberg, 1995]
Manual phonetic annotationOnly about 75 minutes of data annotated
85
Graph Construction
86
Graph Construction
86
x% of SWB
STP
+
Graph Construction
86
x% of SWB
STP
+ k-NN Graph
Graph Construction
86
x% of SWB
STP
+ k-NN Graph
- kd-tree- k = 10- Graph with about 120 M nodes
Results
[Subramanya & Bilmes, JMLR 2011]87
Results
Graph with 120m vertices
[Subramanya & Bilmes, JMLR 2011]87
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization - Dialog Act Tagging- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging
88
Problem Description & Motivation
89
Problem Description & Motivation
Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)
89
Problem Description & Motivation
Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)
Multi-label problem
89
Problem Description & Motivation
Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)
Multi-label problem Training supervised models requires large
amounts of labeled data [Dumais et al., 1998]
89
Corpora
Reuters [Lewis, et al., 1978] Newswire About 20K document with 135 categories. Use
top 10 categories (e.g., earnings, acquistions, wheat, interest) and label the remaining as other
90
Corpora
Reuters [Lewis, et al., 1978] Newswire About 20K document with 135 categories. Use
top 10 categories (e.g., earnings, acquistions, wheat, interest) and label the remaining as other
WebKB [Bekkerman, et al., 2003] 8K webpages from 4 academic domains Categories include course, department,
faculty and project90
Feature Extraction
Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...
Document[Lewis, et al., 1978]
91
Feature Extraction
Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...
Document
Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...
Stop-word Removal[Lewis, et al., 1978]
91
Feature Extraction
Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...
Document
Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...
shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao, ...
Stop-word Removal Stemming[Lewis, et al., 1978]
91
Feature Extraction
Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...
Document
Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...
shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao, ...
Stop-word Removal Stemming
showerbahiacocoa
.
.
.
.
0.010.340.33
.
.
.
.
TFIDF weighted Bag-of-words[Lewis, et al., 1978]
91
Results
Average PRBEP
SVM TSVM SGT LP MP MAD
Reuters 48.9 59.3 60.3 59.7 66.3 -
WebKB 23.0 29.2 36.8 41.2 51.9 53.7
Precision-recall break even point (PRBEP)92
Results
Average PRBEP
SVM TSVM SGT LP MP MAD
Reuters 48.9 59.3 60.3 59.7 66.3 -
WebKB 23.0 29.2 36.8 41.2 51.9 53.7
Precision-recall break even point (PRBEP)
Support Vector
Machine (Supervised)
92
Results
Average PRBEP
SVM TSVM SGT LP MP MAD
Reuters 48.9 59.3 60.3 59.7 66.3 -
WebKB 23.0 29.2 36.8 41.2 51.9 53.7
Precision-recall break even point (PRBEP)
Support Vector
Machine (Supervised)
Transductive SVM
[Joachims 1999]
92
Results
Average PRBEP
SVM TSVM SGT LP MP MAD
Reuters 48.9 59.3 60.3 59.7 66.3 -
WebKB 23.0 29.2 36.8 41.2 51.9 53.7
Precision-recall break even point (PRBEP)
Support Vector
Machine (Supervised)
Transductive SVM
[Joachims 1999]
Spectral Graph
Transduction (SGT)
[Joachims 2003]
Label Propagation [Zhu & Ghahramani 2002]
Measure Propagation [Subramanya & Bilmes 2008]
Modified Adsorption[Talukdar & Crammer 2009]
92
Results on WebKB
[Subramanya & Bilmes, EMNLP 2008]93
Results on WebKBModified Adsorption (MAD)
[Talukdar & Crammer 2009]
[Subramanya & Bilmes, EMNLP 2008]93
Results on WebKBModified Adsorption (MAD)
[Talukdar & Crammer 2009]
More labeled data => Better Performance Unnormalized distributions (scores) more suitable for multi-label problems (MAD outperforms other approaches)
[Subramanya & Bilmes, EMNLP 2008]93
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization- Dialog Act Tagging [Subramanya & Bilmes, JMLR 2011]- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging
94
Problem description
Dialog acts (DA) reflect the function that utterances serve in discourse
Applications in automatic speech recognition (ASR), machine translation (MT) & natural language processing (NLP)
95
Switchboard DA Corpus
96
Switchboard DA Corpus
Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]
96
Switchboard DA Corpus
Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]
Manually labeled 1155 conversations (about 200k sentences)
96
Switchboard DA Corpus
Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]
Manually labeled 1155 conversations (about 200k sentences) Example labels: question, answer, backchannel, agreement... (total
of 42 different DAs)
96
Switchboard DA Corpus
Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]
Manually labeled 1155 conversations (about 200k sentences) Example labels: question, answer, backchannel, agreement... (total
of 42 different DAs)
Use top 18 DAs (about 185k sentences)
96
Graph Construction
97
Graph Construction
Bigram & trigram TFIDF
97
Graph Construction
Bigram & trigram TFIDF Cosine similarity
97
Graph Construction
Bigram & trigram TFIDF Cosine similarity k-NN graph
97
SWB DA Tagging Results
Baseline performance: 84.2% [Ji & Bilmes, 2005]
98
79
80.75
82.5
84.25
86
Bigram Trigram
SQ-Loss MP
[Subramanya & Bilmes, JMLR 2011]
Dihana Corpus
99
Dihana Corpus
Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services
99
Dihana Corpus
Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services
225 speakers
99
Dihana Corpus
Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services
225 speakers Standard train/test set (16k/7.5k sentences)
99
Dihana Corpus
Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services
225 speakers Standard train/test set (16k/7.5k sentences) Number of labels = 72
99
Dihana Results
Results for classifying user turns
Baseline Performance: 76.4%
100
75
76.75
78.5
80.25
82
Bigram Trigram
SQ-Loss MP
[Subramanya & Bilmes, JMLR 2011]
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization- Dialog Act Tagging- Statistical Machine Translation [Alexandrescu & Kirchoff, NAACL 2009]- POS Tagging- MultiLingual POS Tagging
101
Problem Description
Phrase-based statistical machine translation (SMT) Sentences are translated in isolation
102
Motivation
103
Motivation
103
Asf lA ymknk *lk hnAk .....Source
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
CorrectlA ymknk you may not/you cannot
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
CorrectlA ymknk you may not/you cannot
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
Correct
Different Translations
lA ymknk you may not/you cannot
Motivation
103
Asf lA ymknk *lk hnAk .....Source
im sorry you cant there is a cost1-best
E*rA lA ymknk t$gyl ......Source
excuse me i you turn tv ....1-best
Correct
Different TranslationsGraph to enforce smoothness constraint
lA ymknk you may not/you cannot
Issues
104
Issues
What we want to do - exploit similarity between sentences
104
Issues
What we want to do - exploit similarity between sentences
Input consists of variable-length word strings
104
Issues
What we want to do - exploit similarity between sentences
Input consists of variable-length word strings Output space is structured (number of possible
labels is very large)
104
Graph Construction (I)
105
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
(si, ti)
Encodes a source &
target sentence
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
(si, ti)
Encodes a source &
target sentence
(sj , tj)
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
(si, ti)
Encodes a source &
target sentence
(sj , tj)wij?
Graph Construction (I)
105
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
(si, ti)
Encodes a source &
target sentence
(sj , tj)wij?
How do we compute similarity What about the test set?
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
si
Unlabeled sentence
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
First pass Decoder
si {h(1)si , . . . , h(N)si }N-best listUnlabeled
sentence
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
First pass Decoder
si {h(1)si , . . . , h(N)si }N-best listUnlabeled
sentence
X
Graph Construction (II)
106
{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}
First pass Decoder
si {h(1)si , . . . , h(N)si }N-best listUnlabeled
sentence
X
{(si, h(1)si ), . . . , (si, h(N)si )}
Graph Construction (III)
107
si
Unlabeled sentence
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
(si, sj) >
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
(si, sj) >
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
(si, sj) >
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
(si, sj) >
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(si, sj) >
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(si, sj) >
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(h(j)si , tk)
(si, sj) >
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(h(j)si , tk)
Label similarity
(si, sj) >
Graph Construction (III)
107
All sentences (Labeled + Test)
si
Unlabeled sentence
Find most similar
sentences
Sentence similarity
h(j)si
Targets & Hypothesis
(h(j)si , tk)
Averaging
Label similarity
(si, sj) >
Corpora
IWSLT 2007 task Italian-to-English (IE) & Arabic-to-English (AE) travel tasks
Each task has train/dev/eval sets Baseline: Standard phrase-based SMT based on a
log-linear model. Yields state-of-the-art performance.
Results are measured using BLEU score and Phrase error rate (PER) [Papineni et al., ACL 2002]
108
Results (I)
109
29
30
31
32
33
BLEU-based String Kernel
BLEU
Baseline LP
IE Results on eval set
Results (I)
109
29
30
31
32
33
BLEU-based String Kernel
BLEU
Baseline LP
IE Results on eval set
Geometric mean based averaging worked the best
Results (II)
110
37
37.75
38.5
39.25
40
BLEU PER
Baseline LP
IE Results on eval set
Results (II)
110
37
37.75
38.5
39.25
40
BLEU PER
Baseline LP
IE Results on eval set
Higher is better
Lower is better
Results (II)
110
37
37.75
38.5
39.25
40
BLEU PER
Baseline LP
IE Results on eval set37
38.25
39.5
40.75
42
BLEU PER
Baseline LP
AE Results on eval setHigher is better
Lower is better
Outline
Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work
- Phone Classification- Text Categorization- Dialog Act Tagging- Statistical Machine Translation- POS Tagging [Subramanya et. al., EMNLP 2008]- MultiLingual POS Tagging
111
Motivation
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
Motivation
DomainAdaptation
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
Motivation
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
Motivation
... how to book a band ...can you book a day room ...
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
Motivation
... how to book a band ...can you book a day room ...
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
Motivation
... how to book a band ...can you book a day room ...
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
Motivation
... how to book a band ...can you book a day room ...
... bought a book detailing the ...
... wanted to book a flight to ...
... the book is about the ... ... how to unrar a file ...
DomainAdaptation
... VBD DT NN VBG DT ...
... VBD TO VB DT NN TO ...
... DT NN VBZ PP DT ...
Unlabeled Data
Small amounts of labeled
sourcedomain data
Large amounts of unlabeled
targetdomain data
112
Graph Construction (I)
when do you book plane tickets?
do you read a book on the plane?
113
Graph Construction (I)
when do you book plane tickets?
do you read a book on the plane?
Similarity
113
Graph Construction (I)
when do you book plane tickets?WRB VBP PRP VB NN NNS
VBP PRP VB DT NN IN DT NN
do you read a book on the plane?
Similarity
113
Graph Construction (II)
how much does it cost to book a band ?
what was the book that has no letter e ?
how to get a book agent ?
can you book a day room at hilton hawaiian village ?
114
Graph Construction (II)
how much does it cost to book a band ?
what was the book that has no letter e ?
how to get a book agent ?
can you book a day room at hilton hawaiian village ?
114
Graph Construction (II)
how much does it cost to book a band ?
what was the book that has no letter e ?
how to get a book agent ?
can you book a day room at hilton hawaiian village ?
114
Graph Construction (II)
you book athe book that
to book a
a book agent
114
Graph Construction (II)
you book athe book that
to book a
a book agent
k-nearestneighbors?
114
Graph Construction (III)
you book athe book that
to book a
a book agent
the city that
to run a
to become a
you start a
a movie agent
115
Graph Construction (III)
you book athe book that
to book a
a book agent
the city that
to run a
to become a
you start a
a movie agent
115
Graph Construction (III)
you book athe book that
to book a
a book agent
the city that
to run a
to become a
you start a
a movie agent
you unrar a
115
Graph Construction - Features
how much does it cost to book a band ?to book acost to book a band
116
Graph Construction - Features
how much does it cost to book a band ?to book acost to book a band
116
Graph Construction - Features
how much does it cost to book a band ?
Trigram + Context
Left Context
Right Context
Center Word book
Trigram - Center Word to ____ a
Left Word + Right Context to ____ a band
Left Context + Right Word cost to ____ a
Suffix none
to book a
cost to book a band
116
Graph Construction - Features
how much does it cost to book a band ?
Trigram + Context
Left Context
Right Context
Center Word book
Trigram - Center Word to ____ a
Left Word + Right Context to ____ a band
Left Context + Right Word cost to ____ a
Suffix none
to book a
cost to book a band
cost to
116
Graph Construction - Features
how much does it cost to book a band ?
Trigram + Context
Left Context
Right Context
Center Word book
Trigram - Center Word to ____ a
Left Word + Right Context to ____ a band
Left Context + Right Word cost to ____ a
Suffix none
to book a