24
Real-World Semi- Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Embed Size (px)

Citation preview

Page 1: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Real-World Semi-Supervised Learning of POS-Taggers for

Low-Resource Languages

Dan Garrette, Jason Mielens, and Jason Baldridge

Proceedings of ACL 2013

Page 2: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Semi-Supervised Training

HMM with Expectation-Maximization (EM)

Need:

Large raw corpus

Tag dictionary

[Kupiec, 1992][Merialdo, 1994]

Page 3: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Previous Works: Supervised LearningProvide high accuracy for POS tagging (Manning, 2011).

Perform poorly when little supervision is available.

Semi-SupervisedDone by training sequence models such as HMM using the EM algorithm.

Work in this area has still relied on relativelylarge amounts of data.(Kupiec, 1992; Merialdo,1994).

Page 4: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Previous Works: Goldberg et al.(2008)Manually constructed lexicon for Hebrew to

train HMM tagger.Lexicon was developed over a long period of

time by expert lexicographers. Tackstrom et al. (2013)Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.

Large parallel corpora required.

Page 5: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Low-Resource Languages

6,900 languages in the world

~30 have non-negligible quantities of data

No million-word corpus for anyendangered language

[Maxwell and Hughes, 2006][Abney and Bird, 2010]

Page 6: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Low-Resource Languages

Kinyarwanda (KIN)Niger-Congo.Morphologically-rich.

Malagasy (MLG)Austronesian.Spoken in Madagascar.

Also, English

Page 7: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Collecting Annotations

• Supervised training is not an option.

•Semi-supervised training:

•Annotate some data by hand in 4 hours,

(in 30-minute intervals) for two tasks.

•Type supervision.

•Token supervision.

Page 8: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Tag Dict Generalization

These annotations are too sparse!

Generalize to the entire vocabulary

Page 9: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Tag Dict Generalization

Haghighi and Klein (2006) do this witha vector space.

We don’t have enough raw data

Das and Petrov (2011) do this witha parallel corpus.

We don’t have a parallel corpus

Page 10: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Tag Dict Generalization

Strategy: Label Propagation

• Connect annotations to raw corpus tokens

• Push tag labels to entire corpus

[Talukdar and Crammer. 2009]

Page 11: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Morphological Transducers• Finite-state transducers are used for morphological analysis.

• FST accepts a word type and producesa set of morphological features.

•Power of FSTs:•Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.

Page 12: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Tag Dict GeneralizationPREV_<b> NEXT_thug

TOK_the_4 TOK_the_1

TYPE_the

PREV_the

TOK_the_9 TOK_thug_5

TYPE_thug

NEXT_walks

TOK_dog_2

TYPE_dog

PRE1_t PRE2_th SUF1_e SUF1_g PRE1_d PRE2_do

Page 13: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Tag Dict GeneralizationType Annotations

_the__DT_____dog_NN____

TYPE_the

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

Page 14: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Tag Dict GeneralizationType Annotations

_the_________dog________

TYDTthe

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYNNog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

Page 15: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Tag Dict GeneralizationType Annotations

_the________dog

TYPE_the

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

Token Annotationsthe dog walksDT NN VBZ

Page 16: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Tag Dict GeneralizationType Annotations

_the________dog

TYPE_the

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TODTe_4TOK_the_1 TOK_thug_5

TOKNN_2

Token Annotationsthe dog walks____________

Page 17: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Model Minimization

[Ravi et al., 2010; Garrette and Baldridge, 2012]

• LP graph has a node for each corpus token.• Each node is labelled with distribution over POS tags.•Graph provides a corpus of sentences labelled with noisy tag distributions.

•Greedily seek the minimal set of tagbigrams that describe the raw corpus.•Now use, HMM trained by EM.

Page 18: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Overall Accuracy

KIN usin

g all t

ypes

MLG

using h

alf ty

pes and half

toke

ns

ENG using a

ll typ

es and m

axim

al am

ount of d

ata0.00%

20.00%

40.00%

60.00%

80.00%

100.00%Accuracy

Accuracy

All of these values were achieved using both FST and affix LP features.

Page 19: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Results

Page 20: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Types versus Tokens

Page 21: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Mixing Type and Token Annotations

Page 22: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Morphological Analysis

Page 23: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Annotator Experience

Page 24: Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Conclusion•Type Annotations are the most useful input from a linguist.

•We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.