Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason...

Preview:

Citation preview

Real-World Semi-Supervised Learning of POS-Taggers for

Low-Resource Languages

Dan Garrette, Jason Mielens, and Jason Baldridge

Proceedings of ACL 2013

Semi-Supervised Training

HMM with Expectation-Maximization (EM)

Need:

Large raw corpus

Tag dictionary

[Kupiec, 1992][Merialdo, 1994]

Previous Works: Supervised LearningProvide high accuracy for POS tagging (Manning, 2011).

Perform poorly when little supervision is available.

Semi-SupervisedDone by training sequence models such as HMM using the EM algorithm.

Work in this area has still relied on relativelylarge amounts of data.(Kupiec, 1992; Merialdo,1994).

Previous Works: Goldberg et al.(2008)Manually constructed lexicon for Hebrew to

train HMM tagger.Lexicon was developed over a long period of

time by expert lexicographers. Tackstrom et al. (2013)Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.

Large parallel corpora required.

Low-Resource Languages

6,900 languages in the world

~30 have non-negligible quantities of data

No million-word corpus for anyendangered language

[Maxwell and Hughes, 2006][Abney and Bird, 2010]

Low-Resource Languages

Kinyarwanda (KIN)Niger-Congo.Morphologically-rich.

Malagasy (MLG)Austronesian.Spoken in Madagascar.

Also, English

Collecting Annotations

• Supervised training is not an option.

•Semi-supervised training:

•Annotate some data by hand in 4 hours,

(in 30-minute intervals) for two tasks.

•Type supervision.

•Token supervision.

Tag Dict Generalization

These annotations are too sparse!

Generalize to the entire vocabulary

Tag Dict Generalization

Haghighi and Klein (2006) do this witha vector space.

We don’t have enough raw data

Das and Petrov (2011) do this witha parallel corpus.

We don’t have a parallel corpus

Tag Dict Generalization

Strategy: Label Propagation

• Connect annotations to raw corpus tokens

• Push tag labels to entire corpus

[Talukdar and Crammer. 2009]

Morphological Transducers• Finite-state transducers are used for morphological analysis.

• FST accepts a word type and producesa set of morphological features.

•Power of FSTs:•Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.

Tag Dict GeneralizationPREV_<b> NEXT_thug

TOK_the_4 TOK_the_1

TYPE_the

PREV_the

TOK_the_9 TOK_thug_5

TYPE_thug

NEXT_walks

TOK_dog_2

TYPE_dog

PRE1_t PRE2_th SUF1_e SUF1_g PRE1_d PRE2_do

Tag Dict GeneralizationType Annotations

_the__DT_____dog_NN____

TYPE_the

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

Tag Dict GeneralizationType Annotations

_the_________dog________

TYDTthe

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYNNog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

Tag Dict GeneralizationType Annotations

_the________dog

TYPE_the

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

Token Annotationsthe dog walksDT NN VBZ

Tag Dict GeneralizationType Annotations

_the________dog

TYPE_the

PREV_<b>

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TODTe_4TOK_the_1 TOK_thug_5

TOKNN_2

Token Annotationsthe dog walks____________

Model Minimization

[Ravi et al., 2010; Garrette and Baldridge, 2012]

• LP graph has a node for each corpus token.• Each node is labelled with distribution over POS tags.•Graph provides a corpus of sentences labelled with noisy tag distributions.

•Greedily seek the minimal set of tagbigrams that describe the raw corpus.•Now use, HMM trained by EM.

Overall Accuracy

KIN usin

g all t

ypes

MLG

using h

alf ty

pes and half

toke

ns

ENG using a

ll typ

es and m

axim

al am

ount of d

ata0.00%

20.00%

40.00%

60.00%

80.00%

100.00%Accuracy

Accuracy

All of these values were achieved using both FST and affix LP features.

Results

Types versus Tokens

Mixing Type and Token Annotations

Morphological Analysis

Annotator Experience

Conclusion•Type Annotations are the most useful input from a linguist.

•We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.

Recommended