Graph Ssl Icassp2013 Tutorial Slides

A Tutorial on

Graph-based Semi-Supervised Learning Algorithms for Speech and Spoken Language Processing

Partha Pratim Talukdar(Carnegie Mellon University)

Amarnag Subramanya(Google Research)

http://graph-ssl.wikidot.com/ICASSP 2013, Vancouver

Supervised Learning

ModelLabeled DataLearning

Algorithm

2

Supervised Learning

ModelLabeled DataLearning

Algorithm

2

Examples:Decision Trees

Support Vector Machine (SVM) Maximum Entropy (MaxEnt)

Semi-Supervised Learning (SSL)

ModelLabeled Data

Learning Algorithm

A Lot of Unlabeled Data

3

Semi-Supervised Learning (SSL)

ModelLabeled Data

Learning Algorithm

A Lot of Unlabeled Data

3

Examples:Self-TrainingCo-Training

Why SSL?

How can unlabeled data be helpful?

4

Without Unlabeled Data

Why SSL?


4


Why SSL?


Labeled Instances

4


Why SSL?


Labeled Instances

DecisionBoundary

4

With Unlabeled DataWithout Unlabeled Data

Why SSL?


Labeled Instances

DecisionBoundary

4


Why SSL?


Labeled Instances

DecisionBoundary Unlabeled

Instances

4


Why SSL?


Example from [Belkin et al., JMLR 2006]

Labeled Instances

DecisionBoundary

More accurate decision boundary in the presence of unlabeled instances

UnlabeledInstances

4

Inductive vs Transductive

5


Supervised(Labeled)

Semi-supervised(Labeled + Unlabeled)

5


Supervised(Labeled)


Inductive(Generalize toUnseen Data)

Transductive(Doesnt Generalize to

Unseen Data)

5


SVM, Maximum Entropy X

Manifold Regularization

Label Propagation, MAD, MP, TACO, ...

Supervised(Labeled)




Unseen Data)

5





Supervised(Labeled)




Unseen Data)

Most Graph SSL algorithms are non-parametric (i.e., # parameters grows with data size)

5





Supervised(Labeled)




Unseen Data)


5

Focus of this tutorial





Supervised(Labeled)




Unseen Data)

See Chapter 25 of SSL Book: http://olivier.chapelle.cc/ssl-book/discussion.pdf


5

Focus of this tutorial

Why Graph-based SSL?

6


Some datasets are naturally represented by a graph web, citation network, social network, ...

6



Uniform representation for heterogeneous data

6



Uniform representation for heterogeneous data Easily parallelizable, scalable to large data

6



Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice

6



Uniform representation for heterogeneous data Easily parallelizable, scalable to large data Effective in practice

Graph SSL

Supervised

Non-Graph SSL

Text Classification

6

Graph-based SSL

7

Graph-based SSL

Similarity

7

Graph-based SSL

Similarity

0.60.20.8

7

Graph-based SSL

/f//m/

Similarity

0.60.20.8

7

Graph-based SSL

/f//m/

/m/ /f/

Similarity

0.60.20.8

7

Graph-based SSL

8

Graph-based SSL

Smoothness Assumption If two instances are similar

according to the graph, then output labels should be similar

8

Graph-based SSL

Two stages Graph construction (if not already present) Label Inference

Smoothness Assumption If two instances are similar

according to the graph, then output labels should be similar

8

Outline

Motivation Graph Construction Inference Methods Scalability Applications Conclusion & Future Work

9

Outline


10

Graph Construction

Neighborhood Methods k-NN Graph Construction (k-NNG) e-Neighborhood Method

Metric Learning Other approaches

11

Neighborhood Methods

12


k-Nearest Neighbor Graph (k-NNG) add edges between an instance and its

k-nearest neighbors

12



k-nearest neighborsk = 3

12



k-nearest neighbors

e-Neighborhood add edges to all instances inside a ball of

radius e

k = 3

12



k-nearest neighbors

e-Neighborhood add edges to all instances inside a ball of

radius e

e

k = 3

12

Issues with k-NNG

13

Issues with k-NNG

13

Not scalable (quadratic)

Issues with k-NNG

13

Not scalable (quadratic) Results in an asymmetric graph

Issues with k-NNG

13


b is the closest neighbor of a, but not the other way

a b c

Issues with k-NNG

Results in irregular graphs some nodes may end up with

higher degree than other nodes

13



a b c

Issues with k-NNG

Results in irregular graphs some nodes may end up with

higher degree than other nodes

Node of degree 4 inthe k-NNG (k = 1)

13



a b c

Issues with e-Neighborhood

14


Not scalable

14


Not scalable

Sensitive to value of e : not invariant to scaling

14


Not scalable


Fragmented Graph: disconnected components

14


Not scalable


Fragmented Graph: disconnected components

Figure from [Jebara et al., ICML 2009]

e-NeighborhoodData

Disconnected

14

Graph Construction using Metric Learning

15


xi xjwij exp(DA(xi, xj))

15



Estimated using Mahalanobis metric learning algorithms

DA(xi, xj) = (xi xj)TA(xi xj)

15


Supervised Metric Learning ITML [Kulis et al., ICML 2007] LMNN [Weinberger and Saul, JMLR 2009]

Semi-supervised Metric Learning IDML [Dhillon et al., UPenn TR 2010]


Estimated using Mahalanobis metric learning algorithms

DA(xi, xj) = (xi xj)TA(xi xj)

15

Benefits of Metric Learning forGraph Construction

16


0

0.125

0.25

0.375

0.5

Amazon Newsgroups Reuters Enron A Text

Original RP PCA ITML IDML

Erro

r

100 seed and1400 test instances, all inferences using LP

16


Graph constructed using supervised metric learning

0

0.125

0.25

0.375

0.5



Erro

r


16


[Dhillon et al., UPenn TR 2010]


0

0.125

0.25

0.375

0.5



Erro

r


Graph constructed using semi-supervised metric learning[Dhillon et al., 2010]

16


Careful graph construction is critical![Dhillon et al., UPenn TR 2010]


0

0.125

0.25

0.375

0.5



Erro

r


Graph constructed using semi-supervised metric learning[Dhillon et al., 2010]

16

Other Graph Construction Approaches

Local Reconstruction Linear Neighborhood [Wang and Zhang, ICML 2005] Regular Graph: b-matching [Jebara et al., ICML 2008] Fitting Graph to Vector Data [Daitch et al., ICML 2009]

Graph Kernels [Zhu et al., NIPS 2005]

17

Outline


- Label Propagation- Modified Adsorption- Transduction with Confidence- Manifold Regularization- Measure Propagation- Sparse Label Propagation

18

19

Graph Laplacian

Laplacian (un-normalized) of a graph:

L = D W,where Dii =j

Wij , Dij( =i) = 0

19

Graph Laplacian


1

12

3

a

b

c

d


Wij , Dij( =i) = 0

19

Graph Laplacian


1

12

3

a

b

c

d

3 -1 -2 0-1 4 -3 0-2 -3 6 -1 0 0 -1 1

abcd

a b c d


Wij , Dij( =i) = 0

19

Graph Laplacian

Graph Laplacian (contd.) L is positive semi-definite (assuming non-negative weights) Smoothness of prediction f over the graph in

terms of the Laplacian:

1

20

fTLf =i,j

Wij(fi fj)2


terms of the Laplacian:

1

20

fTLf =i,j

Wij(fi fj)2


terms of the Laplacian:Measure of

Non-Smoothness

1

20

fTLf =i,j

Wij(fi fj)2



Non-SmoothnessVector of scores for single label on nodes

1

20

fTLf =i,j

Wij(fi fj)2




125

5

10

1

12

3

a

b

c

d1

fT = [1 10 5 25]

20

fTLf =i,j

Wij(fi fj)2




125

5

10

1

12

3

a

b

c

d1

fT = [1 10 5 25]

fTLf = 588 Not Smooth20

fTLf =i,j

Wij(fi fj)2




125

5

10

1

12

3

a

b

c

d1

fT = [1 10 5 25]

fTLf = 588 Not Smooth

1

1

1

3

fT = [1113]fTLf = 4

1

12

3

a

b

c

d

20

fTLf =i,j

Wij(fi fj)2




125

5

10

1

12

3

a

b

c

d1

fT = [1 10 5 25]

fTLf = 588 Not Smooth

1

1

1

3

fT = [1113]fTLf = 4

1

12

3

a

b

c

d

Smooth20

Lg = g

gTLg = gT g

gTLg =

Relationship between Eigenvalues of the Laplacian and Smoothness

21

Lg = g

gTLg = gT g

gTLg =


21

Eigenvalue of LEigenvector of L

Lg = g

gTLg = gT g

gTLg =


21


= 1, as eigenvectors are are orthonormal

Lg = g

gTLg = gT g

gTLg =


21



Measure of Non-Smoothness(previous slide)

Lg = g

gTLg = gT g

gTLg =


21



If an eigenvector is used to classify nodes, then the

corresponding eigenvalue gives the measure of non-smoothness

Measure of Non-Smoothness(previous slide)

Spectrum of the Graph Laplacian

Figure from [Zhu et al., 2005]

Number ofconnected

components = Number of 0 eigenvalues

Constant withincomponent

Higher Eigenvalue,Irregular Eigenvector,

Less smoothness

22

NotationsSeed Scores

Estimated Scores

Label Priors

Yv,l : score of estimated label l on node v

Yv,l : score of seed label l on node v

Rv,l : regularization target for label l on node v

S : seed node indicator (diagonal matrix)

v

Wuv : weight of edge (u, v) in the graph

23

Outline



24

NotationsSeed Scores

Estimated Scores

Label Regularization

Yv,l : score of estimated label l on node v

Yv,l : score of seed label l on node v

Rv,l : regularization target for label l on node v

S : seed node indicator (diagonal matrix)

v

Wuv : weight of edge (u, v) in the graph

25

LP-ZGL [Zhu et al., ICML 2003]

argminY

m

l=1

Wuv(Yul Yvl)2

Yul = Yul, Suu = 1such that

=

m

l=1

YTl LYl

GraphLaplacian

26


argminY

m

l=1

Wuv(Yul Yvl)2


Smooth

=

m

l=1

YTl LYl

GraphLaplacian

26


argminY

m

l=1

Wuv(Yul Yvl)2


Smooth

Match Seeds (hard)

=

m

l=1

YTl LYl

GraphLaplacian

26


argminY

m

l=1

Wuv(Yul Yvl)2


Smooth

Match Seeds (hard)

Smoothness two nodes connected by

an edge with high weight should be assigned similar labels

=

m

l=1

YTl LYl

GraphLaplacian

26


argminY

m

l=1

Wuv(Yul Yvl)2


Smooth

Match Seeds (hard)

Smoothness two nodes connected by

an edge with high weight should be assigned similar labels

Solution satisfies harmonic property

=

m

l=1

YTl LYl

GraphLaplacian

26

Outline



27

Two Related Views

UL

Label Diffusion

28

Labeled (Seeded) Node

Unlabeled Node

Two Related Views

UL

Label Diffusion

L

Random Walk

U

28

Labeled (Seeded) Node

Unlabeled Node

Random Walk View

29

Random Walk View

UV

what next? Starting node

29

Continue walk with probability

Assign Vs seed label to U with probability

Abandon random walk with probability assign U a dummy label

pcont

v

pinjv

pabndv

Random Walk View

UV

what next? Starting node

29

Discounting Nodes

30

Discounting Nodes

Certain nodes can be unreliable (e.g., high degree nodes) do not allow propagation/walk through them

30

Discounting Nodes


Solution: increase abandon probability on such nodes:

30

Discounting Nodes


Solution: increase abandon probability on such nodes:

pabndv degree(v)

30

Redefining Matrices

W

uv= pcont

uWuv

Suu =

pinju

Ru! = pabndu

, and 0 for non-dummy labels

Dummy Label

New EdgeWeight

31

Modified Adsorption (MAD)[Talukdar and Crammer, ECML 2009]

32


argminY

m+1l=1

SY l SY l2 + 1

u,v

Muv(Y ul Y vl)2 + 2Y l Rl2

m labels, +1 dummy label M =W +W is the symmetrized weight matrix Y vl: weight of label l on node v Y vl: seed weight for label l on node v S: diagonal matrix, nonzero for seed nodes Rvl: regularization target for label l on node v

Seed Scores

EstimatedScores

Label Priorsv

32


argminY

m+1l=1

SY l SY l2 + 1

u,v



Match Seeds (soft)

Seed Scores

EstimatedScores

Label Priorsv

32


argminY

m+1l=1

SY l SY l2 + 1

u,v



Match Seeds (soft) Smooth

Seed Scores

EstimatedScores

Label Priorsv

32


argminY

m+1l=1

SY l SY l2 + 1

u,v



Match Seeds (soft) SmoothMatch Priors(Regularizer)

Seed Scores

EstimatedScores

Label Priorsv

32


argminY

m+1l=1

SY l SY l2 + 1

u,v




Seed Scores

EstimatedScores

Label Priorsv

for none-of-the-above label

32


argminY

m+1l=1

SY l SY l2 + 1

u,v




Seed Scores

EstimatedScores

Label Priorsv

MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006]


32


argminY

m+1l=1

SY l SY l2 + 1

u,v




Seed Scores

EstimatedScores

Label Priorsv

MAD has extra regularization compared to LP-ZGL [Zhu et al, ICML 03]; similar to QC [Bengio et al, 2006]


MADs Objective is Convex

32

Solving MAD Objective

33


Can be solved using matrix inversion (like in LP) but matrix inversion is expensive

33


Can be solved using matrix inversion (like in LP) but matrix inversion is expensive

Instead solved exactly using a system of linear equations (Ax = b)

solved using Jacobi iterations results in iterative updates guaranteed convergence see [Bengio et al., 2006] and

[Talukdar and Crammer, ECML 2009] for details

33

Solving MAD using Iterative Updates

Inputs Y ,R : |V | (|L| + 1), W : |V | |V |, S : |V | |V | diagonalY YM =W +W

Zv Svv + 1

u =vMvu + 2 v Vrepeatfor all v V doY v 1Zv

(SY )v + 1MvY + 2Rv

end for

until convergence

Seed Prior

0.75

0.05

0.60

Current label estimate on ba b

c

v

34



Zv Svv + 1


(SY )v + 1MvY + 2Rv

end for

until convergence

Seed Prior

0.75

0.05

0.60

New label estimate on v

a b

c

v

34



Zv Svv + 1


(SY )v + 1MvY + 2Rv

end for

until convergence

Seed Prior

0.75

0.05

0.60


a b

c

v

34

Importance of a node can be discounted Easily Parallelizable: Scalable (more later)

When is MAD most effective?

35


0

0.1

0.2

0.3

0.4

0 7.5 15 22.5 30

Re

lati

ve I

ncr

eas

e i

n M

RR

by M

AD

ove

r L

P-Z

GL

Average Degree

35


0

0.1

0.2

0.3

0.4

0 7.5 15 22.5 30

Re

lati

ve I

ncr

eas

e i

n M

RR

by M

AD

ove

r L

P-Z

GL

Average Degree

MAD is particularly effective in denser graphs, where there is greater need for regularization.

35

Outline



36

Transduction Algorithm with Confidence (TACO) [Orbach and Crammer, ECML 2012]

37


37

Main lesson from MAD: discount bad (high degree) nodes


37

Main lesson from MAD: discount bad (high degree) nodes

TACO generalizes no=on of bad, adds per-node, per-class condence


37

LowMain lesson from MAD: discount bad (high degree) nodes


Label Scores:

Condence:


37

LowMain lesson from MAD: discount bad (high degree) nodes


Introducing Confidence

38

0.49

0.5

0.51

0.3

0.90.3

? ?


38

0.49

0.50.5

0.51

0.3

0.5

0.90.3


38

0.49

0.50.5

0.51

0.3

0.90.3

0.35

Neighborhood disagreement => low condence


38

0.49

0.50.5

0.51

0.3

0.90.3

0.35

Neighborhood disagreement => low condence

Lower the eect of poorly es=mated (low condence) scores


38

0.49

0.50.5

0.51

0.3

0.90.3

0.35

TACO Objective

39

TACO Objective

Convex objective Iterative solution

39

TACO: Iterative Algorithm

40

TACO: Score Update

410.3

?

0.3

0.9

TACO: Score Update

410.3

0.3

0.35 0.9

- Average Neighbors- Consider Confidence

TACO: Confidence Update

420.3

0.3

0.90.35

TACO: Confidence Update

420.3

0.3

0.9

Confidence is monotonicin neighborsdisagreement

0.35

Outline



43

Manifold Regularization[Belkin et al., JMLR 2006]

f = argminf

1

l

li=1

V (yi, f(xi)) + fTLf + ||f ||2K

44


Loss Function(e.g., soft margin)

Training DataLoss

f = argminf

1

l

li=1

V (yi, f(xi)) + fTLf + ||f ||2K

44



Laplacian of graph over labeled and unlabeled data

Training DataLoss

Smoothness Regularizer

f = argminf

1

l

li=1

V (yi, f(xi)) + fTLf + ||f ||2K

44




Training DataLoss


Regularizer(e.g., L2)

f = argminf

1

l

li=1

V (yi, f(xi)) + fTLf + ||f ||2K

44




Trains an inductive classifier which can generalize to unseen instances

Training DataLoss


Regularizer(e.g., L2)

f = argminf

1

l

li=1

V (yi, f(xi)) + fTLf + ||f ||2K

44

Outline



45

argmin{pi}

li=1

DKL(ri||pi) + i,j

wijDKL(pi||pj) ni=1

H(pi)

Measure Propagation (MP)[Subramanya and Bilmes, EMNLP 2008, NIPS 2009, JMLR 2011]

CKL

s.t.y

pi(y) = 1, pi(y) 0, y, i

46

argmin{pi}

li=1

DKL(ri||pi) + i,j

wijDKL(pi||pj) ni=1

H(pi)


Divergence onseed nodes

Seed and estimated label distributions (normalized)

on node i

CKL

s.t.y

pi(y) = 1, pi(y) 0, y, i

46

argmin{pi}

li=1

DKL(ri||pi) + i,j

wijDKL(pi||pj) ni=1

H(pi)



Smoothness(divergence across edge)

KL Divergence

DKL(pi||pj) =y

pi(y) logpi(y)

pj(y)


on node i

CKL

s.t.y

pi(y) = 1, pi(y) 0, y, i

46

argmin{pi}

li=1

DKL(ri||pi) + i,j

wijDKL(pi||pj) ni=1

H(pi)




Entropic Regularizer

KL Divergence

DKL(pi||pj) =y

pi(y) logpi(y)

pj(y)

EntropyH(pi) =

y

pi(y) log pi(y)


on node i

CKL

s.t.y

pi(y) = 1, pi(y) 0, y, i

46

argmin{pi}

li=1

DKL(ri||pi) + i,j

wijDKL(pi||pj) ni=1

H(pi)





KL Divergence

DKL(pi||pj) =y

pi(y) logpi(y)

pj(y)

EntropyH(pi) =

y

pi(y) log pi(y)


on node i

Normalization Constraint

CKL

s.t.y

pi(y) = 1, pi(y) 0, y, i

46

argmin{pi}

li=1

DKL(ri||pi) + i,j

wijDKL(pi||pj) ni=1

H(pi)





KL Divergence

DKL(pi||pj) =y

pi(y) logpi(y)

pj(y)

EntropyH(pi) =

y

pi(y) log pi(y)


on node i

CKL is convex (with non-negative edge weights and hyper-parameters)MP is related to Information Regularization [Corduneanu and Jaakkola, 2003]

Normalization Constraint

CKL

s.t.y

pi(y) = 1, pi(y) 0, y, i

46

Solving MP Objective

For ease of optimization, reformulate MP objective:

arg min{pi,qi}

li=1

DKL(ri||qi) + i,j

wijDKL(pi||qj)

ni=1

H(pi)

CMP

47



arg min{pi,qi}

li=1

DKL(ri||qi) + i,j

wijDKL(pi||qj)

ni=1

H(pi)

New probability measure, one for each

vertex, similar to pi

CMP

47



arg min{pi,qi}

li=1

DKL(ri||qi) + i,j

wijDKL(pi||qj)

ni=1

H(pi)

wij = wij + (i, j)New probability

measure, one for each vertex, similar to pi

CMP

47



arg min{pi,qi}

li=1

DKL(ri||qi) + i,j

wijDKL(pi||qj)

ni=1

H(pi)



CMP

Encourages agreement between pi and qi

47



arg min{pi,qi}

li=1

DKL(ri||qi) + i,j

wijDKL(pi||qj)

ni=1

H(pi)



CMP


CMP is also convex(with non-negative edge weights and hyper-parameters)

47



arg min{pi,qi}

li=1

DKL(ri||qi) + i,j

wijDKL(pi||qj)

ni=1

H(pi)



CMP


CMP is also convex(with non-negative edge weights and hyper-parameters)

CMP can be solved using Alternating Minimization (AM)47

Alternating Minimization

48


Q0

48


Q0

P1

48


Q0

Q1

P1

48


Q0

Q1

P1P2

48

Q2


Q0

Q1

P1P2

48

Q2


Q0

Q1

P1P2P3

48

Q2


Q0

Q1

P1P2P3

CMP satisfies the necessary conditions for AM to converge [Subramanya and Bilmes, JMLR 2011]

48

Why AM?

49

Performance of SSL Algorithms

Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets

50

Performance of SSL Algorithms

Comparison of accuracies for different number of labeledsamples across COIL (6 classes) and OPT (10 classes) datasets

Graph SSL can be effective when the data satisfies manifold assumption. More results and discussion in Chapter 21 of

the SSL Book (Chapelle et al.)

50

Outline


- Label Propagation- Modified Adsorption- Manifold Regularization- Spectral Graph Transduction- Measure Propagation- Sparse Label Propagation

51

Background: Factor Graphs[Kschischang et al., 2001]

Factor Graph bipartite graph variable nodes (e.g., label distribution on a node) factor nodes: fitness function over variable assignment

Distribution over all variables values

Variable Nodes (V)

Factor Nodes (F)

variables connected to factor f52

Factor Graph Interpretation of Graph SSL [Zhu et al., ICML 2003] [Das and Smith, NAACL 2012]

min Edge Smoothness LossRegularization

Loss + +Seed Matching Loss (if any)

3-term Graph SSL Objective (common to many algorithms)

53





q1 q2w1,2 ||q1 q2||2

53





q1 q2w1,2 ||q1 q2||2

q1 q2(q1, q2)

53





q1 q2w1,2 ||q1 q2||2

q1 q2(q1, q2)

(q1, q2) w1,2 ||q1 q2||2Smoothness Factor53





q1 q2w1,2 ||q1 q2||2

q1 q2(q1, q2)

(q1, q2) w1,2 ||q1 q2||2Smoothness Factor

Seed MatchingFactor (unary)

53





q1 q2w1,2 ||q1 q2||2

q1 q2(q1, q2)

Regularization factor (unary)

(q1, q2) w1,2 ||q1 q2||2Smoothness Factor

Seed MatchingFactor (unary)

53

Factor Graph Interpretation[Zhu et al., ICML 2003][Das and Smith, NAACL 2012]

r1 r2

r3 r4

q1 q2

q4 q3

q9264

q9265

q9266

q9267 q9268 q9269 q9270 1. Factor encouraging agreement on seed

labels

2. SmoothnessFactor

3. Unary factor for regularizationlogt(qt)

54

Label Propagation with Sparsity

55


Enforce through sparsity inducing unary factor

55



logt(qt) =

logt(qt) =

Lasso (Tibshirani, 1996)

Elitist Lasso (Kowalski and Torrsani, 2009)

55



logt(qt) =

logt(qt) =

Lasso (Tibshirani, 1996)

Elitist Lasso (Kowalski and Torrsani, 2009)

For more details, see [Das and Smith, NAACL 2012]

55

Other Graph-SSL Methods

Spectral Graph Transduction [Joachims, ICML 2003] SSL on Directed Graphs

[Zhou et al, NIPS 2005], [Zhou et al., ICML 2005]

Learning with dissimilarity edges [Goldberg et al., AISTATS 2007]

Learning to order: GraphOrder [Talukdar et al., CIKM 2012] Graph Transduction using Alternating Minimization

[Wang et al., ICML 2008]

Graph as regularizer for Multi-Layered Perceptron [Karlen et al., ICML 2008], [Malkin et al., Interspeech 2009]

56

Outline


- Scalability Issues- Node reordering- MapReduce Parallelization

57

More (Unlabeled) Data is Better Data

[Subramanya & Bilmes, JMLR 2011]58


Graph with 120m vertices




Challenges with large unlabeled data:

Constructing graph from large data Scalable inference over large graphs


Outline



59

Scalability Issues (I)Graph Construction

60

Brute force (exact) k-NNG too expensive (quadratic)


60

Brute force (exact) k-NNG too expensive (quadratic)

Approximate nearest neighbor using kd-tree [Friedman et al., 1977, also see http://www.cs.umd.edu/mount/]


60

Sub-sample the data

Construct graph over a subset of a unlabeled data [Delalleau et al., AISTATS 2005]

Sparse Grids [Garcke & Griebel, KDD 2001]

Scalability Issues (II)Label Inference

61

Sub-sample the data

Construct graph over a subset of a unlabeled data [Delalleau et al., AISTATS 2005]

Sparse Grids [Garcke & Griebel, KDD 2001]

How about using more computation? (next section) Symmetric multi-processor (SMP) Distributed Computer

Scalability Issues (II)Label Inference

61

Outline


- Scalability Issues- Node reordering [Subramanya & Bilmes, JMLR 2011; Bilmes & Subramanya, 2011]

- MapReduce Parallelization

62

Parallel computation on a SMP

.....

k+1

k

2

1

..... Graph nodes (neighbors not shown)

SMP with k Processors


Processor 1

.....

k+1

k

2

1

..... Graph nodes (neighbors not shown)



Processor 1

.....

k+1

k

2

1

.....

Processor 2

Graph nodes (neighbors not shown)



Processor 1

.....

k+1

k

2

1

.....

Processor 2

Processor k

Processor 1



Label Update using Message Passing

Seed Prior

0.75

0.05

0.60Current label estimate on a

a b

c

v

64

Processor 1

.....

k+1

k

2

1

.....

Processor 2

Processor k

Processor 1



Label Update using Message Passing

Seed Prior

0.75

0.05

0.60


a b

c

v

64

Processor 1

.....

k+1

k

2

1

.....

Processor 2

Processor k

Processor 1



Speed-up on SMP

[Subramanya & Bilmes, JMLR, 2011]

Graph with 1.4M nodes SMP with 16 cores and 128GB of RAM

65

Speed-up on SMP

Ratio of time spent when using 1 processor to

time spent using n processors



65

Speed-up on SMP

Ratio of time spent when using 1 processor to

time spent using n processors

Cache miss?



65

Node Reordering Algorithm

Input: Graph G = (V, E)

Result: Node ordered graph

1. Select an arbitrary node v

2. while unselected nodes remain do

2.1. select an unselected node v` from among the neighbors neighbors of v that has maximum overlap with v` neighbors

2.2. mark v` as selected

2.3. set v to v`

[Subramanya & Bilmes, JMLR, 2011]66

Node Reordering Algorithm

Input: Graph G = (V, E)

Result: Node ordered graph

1. Select an arbitrary node v

2. while unselected nodes remain do

2.1. select an unselected node v` from among the neighbors neighbors of v that has maximum overlap with v` neighbors

2.2. mark v` as selected

2.3. set v to v`

Exhaustive for sparse

(e.g., k-NN) graphs


Node Reordering Algorithm : Intuition

k

a

b

c

67


k

a

b

c

67

Which node should be placed after k to

optimize cache performance?


k

a

b

c

a

e

j

c

67




k

a

b

c

a

e

j

c

b

a

d

c

c

l

m

n67




k

a

b

c

a

e

j

c

b

a

d

c

c

l

m

n

|N(k) N(a)| = 1

67




k

a

b

c

a

e

j

c

b

a

d

c

c

l

m

n

|N(k) N(a)| = 1

67



Cardinality of Intersection


k

a

b

c

a

e

j

c

b

a

d

c

c

l

m

n

|N(k) N(a)| = 1

|N(k) N(b)| = 2

|N(k) N(c)| = 0

67




Speed-up on SMP after Node Ordering


Distributed Processing

Maximize overlap between consecutive nodes within the same machine

Minimize overlap across machines (reduce inter machine communication)

69

Distributed Processing

[Bilmes & Subramanya, 2011]70

Distributed ProcessingMaximize

intra-processor overlap


Distributed ProcessingMaximize

intra-processor overlap

Minimizeinter-processor

overlap


Node reordering for Distributed Computer

...... ............ ......

Processor #i Processor #j

71


...... ............ ......


Minimize Overlap

71


...... ............ ......


Minimize OverlapMaximize

Overlap

71

Distributed Processing Results


Outline



73

MapReduce Implementation of MAD

Seed Prior

0.75

0.05

0.60


c

v

74

MapReduce Implementation of MAD Map

Each node send its current label assignments to its neighbors

Seed Prior

0.75

0.05

0.60


c

v

74



Seed Prior

0.75

0.05

0.60

a b

c

v

74



Seed Prior

0.75

0.05

0.60

a b

c

v Reduce Each node updates its own label

assignment using messages received from neighbors, and its own information (e.g., seed labels, reg. penalties etc.)

Repeat until convergence

Reduce

74



Seed Prior

0.75

0.05

0.60


a b

c



Repeat until convergence74



Seed Prior

0.75

0.05

0.60


a b

c



Repeat until convergenceCode in Junto Label Propagation Toolkit

(includes Hadoop-based implementation)

http://code.google.com/p/junto/74



Seed Prior

0.75

0.05

0.60


a b

c



Repeat until convergenceCode in Junto Label Propagation Toolkit

(includes Hadoop-based implementation)

http://code.google.com/p/junto/74

Graph-based algorithms are amenable to distributed processing

Outline


- Phone Classification- Text Categorization - Dialog Act Tagging- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging

75

Problem Description & Motivation

76


Given a frame of speech classify it into one of n phones

76


Given a frame of speech classify it into one of n phones

Training supervised models requires large amounts of labeled data (phone classification in resource-scarce languages)

76

TIMIT

77

TIMIT

Corpus of read speech

77

TIMIT

Corpus of read speech Broadband recordings of 630 speakers of 8 major

dialects of American English

77

TIMIT



Each speaker has read 10 sentences

77

TIMIT



Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions

77

TIMIT



Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89]

77

TIMIT



Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89] mapped down to 48 phones for modeling

77

TIMIT



Each speaker has read 10 sentences Includes time-aligned phonetic transcriptions Phone set has 61 phones [Lee & Hon, 89] mapped down to 48 phones for modeling further mapped down to 39 phones for scoring

77

Feature Extraction

78

Feature Extraction

78

MFCC

Hamming Window

25ms @ 100 Hz

Feature Extraction

78

MFCC

Hamming Window

25ms @ 100 Hz

Delta

Feature Extraction

78

MFCC

Hamming Window

25ms @ 100 Hz

Delta .... ....

ti

Feature Extraction

78

MFCC

Hamming Window

25ms @ 100 Hz

Delta .... ....

ti

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Graph Construction

79

TIMIT Training Set

TIMIT Development

Set

Graph Construction

79

TIMIT Training Set

TIMIT Development

Set

+

Graph Construction

79

TIMIT Training Set

TIMIT Development

Set

+ k-NN Graph

Graph Construction

79

TIMIT Training Set

TIMIT Development

Set

+ k-NN Graphwij = exp{(xi xj)T1(xi xj)}

Graph Construction

79

TIMIT Training Set

TIMIT Development

Set

+ k-NN Graph

- k = 10- Graph with about 1.4 M nodes

wij = exp{(xi xj)T1(xi xj)}

Graph Construction

79

TIMIT Training Set

TIMIT Development

Set

+ k-NN Graph

- k = 10- Graph with about 1.4 M nodes

Labels not used during graph construction

wij = exp{(xi xj)T1(xi xj)}

Inductive Extension

80

TIMIT Training Set

TIMIT Development

Set

+ Graph

Inductive Extension

80

TIMIT Training Set

TIMIT Development

Set

+ Graph

Test Sample

Inductive Extension

80

TIMIT Training Set

TIMIT Development

Set

+ Graph

Test Sample

Feature Extraction

Inductive Extension

80

TIMIT Training Set

TIMIT Development

Set

+ Graph Nearest Neighbors

Test Sample

Feature Extraction

Inductive Extension

80

TIMIT Training Set

TIMIT Development

Set

+ Graph Nearest Neighbors

Test Sample

Feature Extraction

Nadaraya-Watson Estimator

Results (I)

81

Results (II)

82

Labeled Graph Construction

83

TIMIT Training Set

[Alexandrescu & Kirchoff, ASRU 2007]


83

TIMIT Training Set

Feature Extraction



83

TIMIT Training Set

Feature Extraction

MLP Training



83

TIMIT Training Set

Feature Extraction

MLP Training

TIMIT Training Set

TIMIT Development

Set

+ MLP



83

TIMIT Training Set

Feature Extraction

MLP Training

TIMIT Training Set

TIMIT Development

Set

+ MLP Graph ConstructionJenson-Shannon

Divergence[Alexandrescu & Kirchoff, ASRU 2007]

Results (III)

84

65

67.25

69.5

71.75

74

10% 50% 100%

MLP MP MAD p-MP

[Liu & Kirchoff, UW-EE TR 2012]

Results (III)

84

65

67.25

69.5

71.75

74

10% 50% 100%

MLP MP MAD p-MP

[Liu & Kirchoff, UW-EE TR 2012]

Take advantage of labeled data during graph construction Regularize towards prior (when available)

Switchboard Phonetic Annotation

85


Switchboard corpus consists of about 300 hours of conversational speech.

Less reliable automatically generated phone-level annotations [Deshmukh et al., 1998]

85


Switchboard corpus consists of about 300 hours of conversational speech.

Less reliable automatically generated phone-level annotations [Deshmukh et al., 1998]

Switchboard transcription project (STP) [Greenberg, 1995]

Manual phonetic annotationOnly about 75 minutes of data annotated

85

Graph Construction

86

Graph Construction

86

x% of SWB

STP

+

Graph Construction

86

x% of SWB

STP

+ k-NN Graph

Graph Construction

86

x% of SWB

STP

+ k-NN Graph

- kd-tree- k = 10- Graph with about 120 M nodes

Results


Outline


- Phone Classification- Text Categorization - Dialog Act Tagging- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging

88


89


Given a document (e.g., web page, news article), assign it to a fixed number of semantic categories (e.g., sports, politics, entertainment)

89



Multi-label problem

89



Multi-label problem Training supervised models requires large

amounts of labeled data [Dumais et al., 1998]

89

Corpora

Reuters [Lewis, et al., 1978] Newswire About 20K document with 135 categories. Use

top 10 categories (e.g., earnings, acquistions, wheat, interest) and label the remaining as other

90

Corpora

Reuters [Lewis, et al., 1978] Newswire About 20K document with 135 categories. Use

top 10 categories (e.g., earnings, acquistions, wheat, interest) and label the remaining as other

WebKB [Bekkerman, et al., 2003] 8K webpages from 4 academic domains Categories include course, department,

faculty and project90

Feature Extraction

Showers continued throughout the week inthe Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, ...

Document[Lewis, et al., 1978]

91

Feature Extraction


Document

Showers continued week Bahia cocoa zone alleviating drought early January improving prospects coming temporao, ...

Stop-word Removal[Lewis, et al., 1978]

91

Feature Extraction


Document


shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao, ...

Stop-word Removal Stemming[Lewis, et al., 1978]

91

Feature Extraction


Document


shower continu week bahia cocoa zone allevi drought earli januari improv prospect come temporao, ...

Stop-word Removal Stemming

showerbahiacocoa

.

.

.

.

0.010.340.33

.

.

.

.

TFIDF weighted Bag-of-words[Lewis, et al., 1978]

91

Results

Average PRBEP

SVM TSVM SGT LP MP MAD

Reuters 48.9 59.3 60.3 59.7 66.3 -

WebKB 23.0 29.2 36.8 41.2 51.9 53.7

Precision-recall break even point (PRBEP)92

Results

Average PRBEP


Reuters 48.9 59.3 60.3 59.7 66.3 -

WebKB 23.0 29.2 36.8 41.2 51.9 53.7

Precision-recall break even point (PRBEP)

Support Vector

Machine (Supervised)

92

Results

Average PRBEP


Reuters 48.9 59.3 60.3 59.7 66.3 -

WebKB 23.0 29.2 36.8 41.2 51.9 53.7


Support Vector


Transductive SVM

[Joachims 1999]

92

Results

Average PRBEP


Reuters 48.9 59.3 60.3 59.7 66.3 -

WebKB 23.0 29.2 36.8 41.2 51.9 53.7


Support Vector


Transductive SVM

[Joachims 1999]

Spectral Graph

Transduction (SGT)

[Joachims 2003]

Label Propagation [Zhu & Ghahramani 2002]

Measure Propagation [Subramanya & Bilmes 2008]

Modified Adsorption[Talukdar & Crammer 2009]

92

Results on WebKB

[Subramanya & Bilmes, EMNLP 2008]93

Results on WebKBModified Adsorption (MAD)

[Talukdar & Crammer 2009]


Results on WebKBModified Adsorption (MAD)

[Talukdar & Crammer 2009]

More labeled data => Better Performance Unnormalized distributions (scores) more suitable for multi-label problems (MAD outperforms other approaches)


Outline


- Phone Classification- Text Categorization- Dialog Act Tagging [Subramanya & Bilmes, JMLR 2011]- Statistical Machine Translation- POS Tagging- MultiLingual POS Tagging

94

Problem description

Dialog acts (DA) reflect the function that utterances serve in discourse

Applications in automatic speech recognition (ASR), machine translation (MT) & natural language processing (NLP)

95

Switchboard DA Corpus

96


Switchboard dialog act (DA) tagging project [Jurafsky, et al., 1997]

96



Manually labeled 1155 conversations (about 200k sentences)

96



Manually labeled 1155 conversations (about 200k sentences) Example labels: question, answer, backchannel, agreement... (total

of 42 different DAs)

96



Manually labeled 1155 conversations (about 200k sentences) Example labels: question, answer, backchannel, agreement... (total

of 42 different DAs)

Use top 18 DAs (about 185k sentences)

96

Graph Construction

97

Graph Construction

Bigram & trigram TFIDF

97

Graph Construction

Bigram & trigram TFIDF Cosine similarity

97

Graph Construction

Bigram & trigram TFIDF Cosine similarity k-NN graph

97

SWB DA Tagging Results

Baseline performance: 84.2% [Ji & Bilmes, 2005]

98

79

80.75

82.5

84.25

86

Bigram Trigram

SQ-Loss MP

[Subramanya & Bilmes, JMLR 2011]

Dihana Corpus

99

Dihana Corpus

Computer-human dialogues answering telephone queries about train services in Spain about 900 dialogues topics include timetables, fares and services

99

Dihana Corpus


225 speakers

99

Dihana Corpus


225 speakers Standard train/test set (16k/7.5k sentences)

99

Dihana Corpus


225 speakers Standard train/test set (16k/7.5k sentences) Number of labels = 72

99

Dihana Results

Results for classifying user turns

Baseline Performance: 76.4%

100

75

76.75

78.5

80.25

82

Bigram Trigram

SQ-Loss MP

[Subramanya & Bilmes, JMLR 2011]

Outline


- Phone Classification- Text Categorization- Dialog Act Tagging- Statistical Machine Translation [Alexandrescu & Kirchoff, NAACL 2009]- POS Tagging- MultiLingual POS Tagging

101

Problem Description

Phrase-based statistical machine translation (SMT) Sentences are translated in isolation

102

Motivation

103

Motivation

103

Asf lA ymknk *lk hnAk .....Source

Motivation

103


im sorry you cant there is a cost1-best

Motivation

103



E*rA lA ymknk t$gyl ......Source

excuse me i you turn tv ....1-best

Motivation

103





CorrectlA ymknk you may not/you cannot

Motivation

103





Correct

Different Translations

lA ymknk you may not/you cannot

Motivation

103





Correct

Different TranslationsGraph to enforce smoothness constraint

lA ymknk you may not/you cannot

Issues

104

Issues

What we want to do - exploit similarity between sentences

104

Issues


Input consists of variable-length word strings

104

Issues


Input consists of variable-length word strings Output space is structured (number of possible

labels is very large)

104

Graph Construction (I)

105


105

{(s1, t1), . . . , (sl, tl)}Labeled data:


105

{(s1, t1), . . . , (sl, tl)}Labeled data:Unlabeled data (test set): {sl+1, . . . , sn}


105


(si, ti)

Encodes a source &

target sentence


105


(si, ti)

Encodes a source &

target sentence

(sj , tj)


105


(si, ti)

Encodes a source &

target sentence

(sj , tj)wij?


105


(si, ti)

Encodes a source &

target sentence

(sj , tj)wij?

How do we compute similarity What about the test set?

Graph Construction (II)

106



106


si

Unlabeled sentence


106


First pass Decoder

si {h(1)si , . . . , h(N)si }N-best listUnlabeled

sentence


106


First pass Decoder


sentence

X


106


First pass Decoder


sentence

X

{(si, h(1)si ), . . . , (si, h(N)si )}

Graph Construction (III)

107

si

Unlabeled sentence


107

All sentences (Labeled + Test)

si

Unlabeled sentence


107


si

Unlabeled sentence

Find most similar

sentences


107


si

Unlabeled sentence

Find most similar

sentences

(si, sj) >


107


si

Unlabeled sentence

Find most similar

sentences

Sentence similarity

(si, sj) >


107


si

Unlabeled sentence

Find most similar

sentences

Sentence similarity

h(j)si

(si, sj) >


107


si

Unlabeled sentence

Find most similar

sentences

Sentence similarity

h(j)si

Targets & Hypothesis

(si, sj) >


107


si

Unlabeled sentence

Find most similar

sentences

Sentence similarity

h(j)si


(h(j)si , tk)

(si, sj) >


107


si

Unlabeled sentence

Find most similar

sentences

Sentence similarity

h(j)si


(h(j)si , tk)

Label similarity

(si, sj) >


107


si

Unlabeled sentence

Find most similar

sentences

Sentence similarity

h(j)si


(h(j)si , tk)

Averaging

Label similarity

(si, sj) >

Corpora

IWSLT 2007 task Italian-to-English (IE) & Arabic-to-English (AE) travel tasks

Each task has train/dev/eval sets Baseline: Standard phrase-based SMT based on a

log-linear model. Yields state-of-the-art performance.

Results are measured using BLEU score and Phrase error rate (PER) [Papineni et al., ACL 2002]

108

Results (I)

109

29

30

31

32

33

BLEU-based String Kernel

BLEU

Baseline LP

IE Results on eval set

Results (I)

109

29

30

31

32

33

BLEU-based String Kernel

BLEU

Baseline LP


Geometric mean based averaging worked the best

Results (II)

110

37

37.75

38.5

39.25

40

BLEU PER

Baseline LP


Results (II)

110

37

37.75

38.5

39.25

40

BLEU PER

Baseline LP


Higher is better

Lower is better

Results (II)

110

37

37.75

38.5

39.25

40

BLEU PER

Baseline LP

IE Results on eval set37

38.25

39.5

40.75

42

BLEU PER

Baseline LP

AE Results on eval setHigher is better

Lower is better

Outline


- Phone Classification- Text Categorization- Dialog Act Tagging- Statistical Machine Translation- POS Tagging [Subramanya et. al., EMNLP 2008]- MultiLingual POS Tagging

111

Motivation

Unlabeled Data

Small amounts of labeled

sourcedomain data

Large amounts of unlabeled

targetdomain data

112

Motivation

DomainAdaptation

Unlabeled Data


sourcedomain data


targetdomain data

112

Motivation

... bought a book detailing the ...

... wanted to book a flight to ...

... the book is about the ...

DomainAdaptation

... VBD DT NN VBG DT ...

... VBD TO VB DT NN TO ...

... DT NN VBZ PP DT ...

Unlabeled Data


sourcedomain data


targetdomain data

112

Motivation

... how to book a band ...can you book a day room ...



... the book is about the ...

DomainAdaptation




Unlabeled Data


sourcedomain data


targetdomain data

112

Motivation

... how to book a band ...can you book a day room ...



... the book is about the ... ... how to unrar a file ...

DomainAdaptation




Unlabeled Data


sourcedomain data


targetdomain data

112


when do you book plane tickets?

do you read a book on the plane?

113


when do you book plane tickets?


Similarity

113


when do you book plane tickets?WRB VBP PRP VB NN NNS

VBP PRP VB DT NN IN DT NN


Similarity

113


how much does it cost to book a band ?

what was the book that has no letter e ?

how to get a book agent ?

can you book a day room at hilton hawaiian village ?

114


you book athe book that

to book a

a book agent

114



to book a

a book agent

k-nearestneighbors?

114



to book a

a book agent

the city that

to run a

to become a

you start a

a movie agent

115



to book a

a book agent

the city that

to run a

to become a

you start a

a movie agent

you unrar a

115

Graph Construction - Features

how much does it cost to book a band ?to book acost to book a band

116



Trigram + Context

Left Context

Right Context

Center Word book

Trigram - Center Word to ____ a

Left Word + Right Context to ____ a band

Left Context + Right Word cost to ____ a

Suffix none

to book a

cost to book a band

116



Trigram + Context

Left Context

Right Context

Center Word book




Suffix none

to book a

cost to book a band

cost to

116



Trigram + Context

Left Context

Right Context

Center Word book




Suffix none

to book a

Documents

Graph Ssl Icassp2013 Tutorial Slides