24
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi

Structured Perceptron

  • Upload
    melody

  • View
    214

  • Download
    1

Embed Size (px)

DESCRIPTION

Structured Perceptron. Alice Lai and Shi Zhi. Presentation Outline. Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable Perceptron. Motivation. An algorithm to learn weights for structured prediction - PowerPoint PPT Presentation

Citation preview

Page 1: Structured Perceptron

STRUCTURED PERCEPTRONAlice Lai and Shi Zhi

Page 2: Structured Perceptron

Presentation Outline• Introduction to Structured Perceptron• ILP-CRF Model• Averaged Perceptron• Latent Variable Perceptron

Page 3: Structured Perceptron

Motivation• An algorithm to learn weights for structured prediction• Alternative to POS tagging with MEMM and CRF (Collins

2002)• Convergence guarantees under certain conditions even

for inseparable data• Generalizes to new examples and other sequence

labeling problems

Page 4: Structured Perceptron

POS Tagging Example

Gold labels: the/D man/N saw/V the/D dog/NPrediction: the/D man/N saw/N the/D dog/NParameter update:

Add 1: Subtract 1:

D

N

V

A

D

N

V

A

D

N

V

A

D

N

V

A

D

N

V

A

Example: the theman saw dog

Page 5: Structured Perceptron

MEMM Approach

• Conditional model: probability of the current state given previous state and current observation

• For tagging problem, define local features for each tag in context• Features are often indicator functions

• Learn parameter vector α with Generalized Iterative Scaling or gradient descent

Page 6: Structured Perceptron

Global Features• Local features are defined only for a single label• Global features are defined for an observed sequence

and a possible label sequence• Simple version: global features are local features summed

over an observation-label sequence pair• Compared to original perceptron algorithm, we have

prediction of a vector of labels instead of a single label• Which of the possible incorrect label vectors do we use as the

negative example in training?

Page 7: Structured Perceptron

Structured Perceptron AlgorithmInput: training examples Initialize parameter vector For t = 1…max_iter:

For i = 1…n:

If then update: Output: parameter vector

enumerates possible label sequences for observed sequence .

Page 8: Structured Perceptron

Properties• Convergence

• Data is separable with margin if there is some vector where such that

• For data that is separable with margin , then the number of mistakes made in training is bounded by where is a constant such that

• Inseparable case• Number of mistakes

• Generalization

Theorems and proofs from Collins 2002

Page 9: Structured Perceptron

Global vs. Local Learning• Global learning (IBT): constraints are used during training• Local learning (L+I): classifiers are trained without

constraints, constraints are applied later to produce global output

• Example: ILP-CRF model [Roth and Yih 2005]

Page 10: Structured Perceptron

Perceptron IBT• This is structured perceptron!

Input: training examples Initialize parameter vector For t = 1…max_iter:

For i = 1…n:

If then update: Output: parameter vector

enumerates possible label sequences for observed sequence .F is a scoring function.

Page 11: Structured Perceptron

Perceptron I+L• Decomposition:

• Prediction: • If then update:

• Either learn parameter vector for global features or do inference only at evaluation time

Page 12: Structured Perceptron

ILP-CRF Introduction [Roth and Yih 2005]

• ILP-CRF model for Semantic Role Labeling as a sequence labeling problem

• Viterbi inference for CRFs can include constraints• Cannot handle long-range or general constraints• Viterbi is a shortest path problem that can be solved with ILP

• Use integer linear programming to express general constraints during inference• Allows incorporation of expressive constraints, including long-

range constraints between distant tokens that cannot be handled by Viterbi

sABC

ABC

ABC

ABC

ABC

t

Page 13: Structured Perceptron

ILP-CRF Models• CRF trained with max log-likelihood• CRF trained with voted perceptron

• I+L• IBT

• Local training (L+I)• Perceptron, winnow, voted perceptron, voted winnow

Page 14: Structured Perceptron

ILP-CRF ResultsSequential Models Local

L+I L+IIBT

Page 15: Structured Perceptron

ILP-CRF Conclusions• Performance of local learning models perform poorly

improves dramatically when constraints are added at evaluation• Performance is comparable to IBT methods

• The best models for global and local training show comparable results

• L+I vs. IBT: L+I requires fewer training examples, is more efficient, outperforms IBT in most situations (unless local problems are difficult to solve) [Punyakanok et. al , IJCAI 2005]

Page 16: Structured Perceptron

Variations: Voted Perceptron• For iteration t=1,…,T• For example i=1,…,n• Given parameter ,by Viterbi Decoding,• Get sequence labels for one example

• Each example define a tagging sequence.• The voted perceptron: takes the most frequently ocurring output in the set

,t i

,_ argmax ( , )it i i i

i tagsbest tags words tags

1{ _ ,....., _ }nbest tags best tags

Page 17: Structured Perceptron

Variations: Voted Perceptron• Averaged algorithm(Collins‘02): approximation of the voted method. It takes the averaging parameter instead of final parameter

• Performance: • Higher F-Measure, Lower error rate • Greater Stability on variance in its scores• Variation: modified averaged algorithm for latent

perceptron

,1,.., , 1,..., /t i

t T i n nT

,T n

Page 18: Structured Perceptron

Variations: Latent Structure Perceptron• Model Definition

• is the parameter for perceptron. is the feature encoding function mapping to feature vector

• In NER task, x is word sequence, y is the named-entity type sequence, h is the hidden latent variable sequence.

• Features: unigram bigram for word, POS and orthography (prefix, upper/lower case)• Why latent variables?• Capture latent dependencies(i.e. hidden sub-structure)

' argmax(max ( , , ))h Hy Y

y x h y

( )

Page 19: Structured Perceptron

Variations: Latent Structure Perceptron• Purely Latent Structure Perceptron(Connor’s)• Training(Structure perceptron with margin)

• C: margin• Alpha: learning rate• Variation: modified averaging parameter method(Sun’s): re-initiate

the parameter with averaged parameter in each k iteration.• Advantage: reduce overfitting of the latent perceptron

Page 20: Structured Perceptron

Variations: Latent Structure Perceptron• Disadvantage of purely latent perceptron: h* is found and then forgotten for each x.• Solution: Online Latent Classifier (Connor’s)• Two classifiers: latent classifier: parameter: u label classifier: parameter: w

* *

,( , ) argmax( ( , , ) ( , ))u

y Y h Hy h x h y u x h

Page 21: Structured Perceptron

Variations: Latent Structure Perceptron• Online Latent Classifier Training(Connor’s)

Page 22: Structured Perceptron

Variations: Latent Structure Perceptron• Experiments: Bio-NER with purely latent perceptron cc: cut-off Odr:#order dependency

Train-timeF-measureHigh-order

Page 23: Structured Perceptron

Variations: Latent Structure Perceptron• Experiments: Semantic Role Labeling with

argument/predicate as latent structure• X: She likes yellow flowers (sentence)• Y: agent predicate ------ patient (role)• H: predicate: only one; argument: at least one (latent

structure)• Optimization for (h*,y*): search all possible

argument/predicate structure. For more complex data, need other methods.

On test set:

Page 24: Structured Perceptron

Summary• Structured Perceptron definition and motivation• IBT vs. L+I• Variations of Structure Perceptron

References:• Discriminative Training for Hidden Markov Models: Theory and

Experiments with Perceptron Algorithms, M. Collins, EMNLP 2002.• Latent Variable Perceptron Algorithm for Structured Classification, Sun,

Xu, Takuya Matsuzaki, Daisuke Okanohara and Jun'ichi Tsujii, IJCAI 2009.• Integer Linear Programming Inference for Conditional Random Fields, D.

Roth, W. Yih, ICML 2005. • Online Latent Structure Training for Language Acquisition, M. Connor and

C. Fisher and D. Roth, IJCAI 2011