Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason...

Real-World Semi-Supervised Learning of POS-Taggers for

Low-Resource Languages

Dan Garrette, Jason Mielens, and Jason Baldridge

Proceedings of ACL 2013

Semi-Supervised Training

HMM with Expectation-Maximization (EM)

Large raw corpus

Tag dictionary

[Kupiec, 1992][Merialdo, 1994]

Previous Works: Supervised LearningProvide high accuracy for POS tagging (Manning, 2011).

Perform poorly when little supervision is available.

Semi-SupervisedDone by training sequence models such as HMM using the EM algorithm.

Work in this area has still relied on relativelylarge amounts of data.(Kupiec, 1992; Merialdo,1994).

Previous Works: Goldberg et al.(2008)Manually constructed lexicon for Hebrew to

train HMM tagger.Lexicon was developed over a long period of

time by expert lexicographers. Tackstrom et al. (2013)Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.

Large parallel corpora required.

6,900 languages in the world

~30 have non-negligible quantities of data

No million-word corpus for anyendangered language

[Maxwell and Hughes, 2006][Abney and Bird, 2010]

Kinyarwanda (KIN)Niger-Congo.Morphologically-rich.

Malagasy (MLG)Austronesian.Spoken in Madagascar.

Also, English

Collecting Annotations

• Supervised training is not an option.

•Semi-supervised training:

•Annotate some data by hand in 4 hours,

(in 30-minute intervals) for two tasks.

•Type supervision.

•Token supervision.

Tag Dict Generalization

These annotations are too sparse!

Generalize to the entire vocabulary

Haghighi and Klein (2006) do this witha vector space.

We don’t have enough raw data

Das and Petrov (2011) do this witha parallel corpus.

We don’t have a parallel corpus

Strategy: Label Propagation

• Connect annotations to raw corpus tokens

• Push tag labels to entire corpus

[Talukdar and Crammer. 2009]

Morphological Transducers• Finite-state transducers are used for morphological analysis.

• FST accepts a word type and producesa set of morphological features.

•Power of FSTs:•Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.

Tag Dict GeneralizationPREV_ NEXT_thug

TOK_the_4 TOK_the_1

TYPE_the

PREV_the

TOK_the_9 TOK_thug_5

TYPE_thug

NEXT_walks

TOK_dog_2

TYPE_dog

PRE1_t PRE2_th SUF1_e SUF1_g PRE1_d PRE2_do

Tag Dict GeneralizationType Annotations

_the__DT_____dog_NN____

TYPE_the

PREV_

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TOK_the_4 TOK_the_1 TOK_thug_5 TOK_dog_2

_the_________dog________

TYDTthe

PREV_

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYNNog

NEXT_walks

_the________dog

TYPE_the

PREV_

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

Token Annotationsthe dog walksDT NN VBZ

_the________dog

TYPE_the

PREV_

PRE2_th PRE1_t

TYPE_thug

PREV_the

SUF1_g

TYPE_dog

NEXT_walks

TODTe_4TOK_the_1 TOK_thug_5

TOKNN_2

Token Annotationsthe dog walks____________

Model Minimization

[Ravi et al., 2010; Garrette and Baldridge, 2012]

• LP graph has a node for each corpus token.• Each node is labelled with distribution over POS tags.•Graph provides a corpus of sentences labelled with noisy tag distributions.

•Greedily seek the minimal set of tagbigrams that describe the raw corpus.•Now use, HMM trained by EM.

Overall Accuracy

KIN usin

g all t

using h

alf ty

pes and half

ENG using a

ll typ

es and m

ount of d

ata0.00%

20.00%

40.00%

60.00%

80.00%

100.00%Accuracy

Accuracy

All of these values were achieved using both FST and affix LP features.

Results

Types versus Tokens

Mixing Type and Token Annotations

Morphological Analysis

Annotator Experience

Conclusion•Type Annotations are the most useful input from a linguist.

•We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason...

Documents

Abstract of “Creating Algorithms for Parsers and Taggers ...cs.brown.edu/research/pubs/theses/phd/2006/genzel.pdf · Abstract of “Creating Algorithms for Parsers and Taggers for

Islam Beltagy, Cuong Chau, Gemma Boleda, Dan Garrette, Katrin Erk, Raymond Mooney The University of Texas at Austin Richard Montague Andrey Markov Montague

Inverse Kinematics Jason Clark (jason@essentialmath.com)

Computer , Jason Kim, Jason Miller, The Raw Processorpeople.csail.mit.edu/wentzlaf/documents/Taylor... · The Raw Processor Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat, Ben

July 26, 2020 HOMES · 26.07.2020 · Garrette Custom Homes is a Paciﬁ c Northwest builder specializing in custom homes, single level homes, acreage properties, and subdivision

JASON SOPROVICH EXCELLENCE IN THE LUXURY HOME … · Jason Soprovich Royal LePage Sussex -Jason Soprovich West Vancouver P: 604-817-8812 jason@jasonsoprovich.com soprovich.com Go

africanamericanhighschoolsinlouisianabefore1970.files ... · fuse her, 'Veronica Gabriel: Gertrude Garrette, Ruby Lee t, Rita Gordon, Barabara Ann Guidry, Vernta guilTory, :Mary Lee

City Research Online Zhang-CBF_Final_20090923.pdf · also thank Vincent Mangematin, Marc Anderson, Rudi Durand, Bernard Garrette, Xavier Castañer, Mike Wright and the three anonymous

Development of ATLAS B-taggers based on BDT

University of Bath...! 3 contribuent effectivement à résorber la pauvreté mais elles ne parviennent pas à être rentables (Karnani, 2007a ; Garrette et Karnani, 2010)

Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:

Jason vs jason x

Lecture 20: Advanced POS Taggers - University of Pittsburghnaraehan/ling1330/Lecture20.pdf · Lecture 20: Advanced POS Taggers Ling 1330/2330 Computational Linguistics Na-Rae Han,

Abstract of â€œCreating Algorithms for Parsers and Taggers for

Revista Taggers

Instalacion Taggers Firefox y Explorer

Status of the KLOE-2 high energy taggers · 2016. 10. 26. · Status of the KLOE-2 high energy taggers D. Moricciani on behalf of KLOE-2 collaboration 10/26/16 Status of the KLOE-2

Lecture 6 POS Tagging Methods Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:

[Taggers] 제품소개서 업데이트 · 6 실제주요성과 페이스북다이나믹광고 태거스를통한의류쇼핑몰N사월간성과매체별성과 ROAS (ReturnonAdSpending)

ROYAL LEPAGE SUSSEX - JASON SOPROVICH soprovich. com · SUSSEX - JASON SOPROVICH soprovich. com ROYAL LEPAGE SUSSEX - JASON SOPROVICH Jason Soprovich Royal LePage Sussex - Jason Soprovich