Factored Language Models

Factored Language ModelsFactored Language Models

EE517 Presentation

April 19, 2005

Kevin Duh ([email protected])

Factored Language Models 2

OutlineOutline

1. Motivation

2. Factored Word Representation

3. Generalized Parallel Backoff

4. Model Selection Problem

5. Applications

6. Tools


Word-based Language ModelsWord-based Language Models

• Standard word-based language models

• How to get robust n-gram estimates ( )?• Smoothing

• E.g. Kneser-Ney, Good-Turing

• Class-based language models

p(w1,w2 ,...,wT ) = p(wt |w1,...,wt−1)t=1

T

∏

≈ p(wt |wt−1,wt−2 )t=1

T

∏

p(wt | wt−1) ≈p(wt |C(wt))p(C(wt) |C(wt−1))

p(wt | wt−1,wt−2 )


Limitation of Word-based Language Models

Limitation of Word-based Language Models

• Words are inseparable whole units. • E.g. “book” and “books” are distinct vocabulary units

• Especially problematic in morphologically-rich languages:• E.g. Arabic, Finnish, Russian, Turkish• Many unseen word contexts • High out-of-vocabulary rate• High perplexity

Arabic k-t-b

Kitaab A book

Kitaab-iy My book

Kitaabu-hum Their book

Kutub Books


Arabic MorphologyArabic Morphology

root

pattern

LIVE + past + 1st-sg-past + part: “so I lived”

-tufa- affixesparticles sakan

• ~5000 roots• several hundred patterns• dozens of affixes


Vocabulary Growth - full word formsVocabulary Growth - full word forms

CallHome

0

2000

4000

6000

8000

10000

12000

14000

16000

10k20k30k40k50k60k70k80k90k100k110k120k

# word tokens

vocab size EnglishArabic

Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition - Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002


Vocabulary Growth - stemmed wordsVocabulary Growth - stemmed words

CallHome

0

2000

4000

6000

8000

10000

12000

14000

16000

10k20k30k40k50k60k70k80k90k100k110k120k

# word tokens

vocab size EN wordsAR wordsEN stemsAR stems

Source: K. Kirchhoff, et al., “Novel Approaches to Arabic Speech Recognition - Final Report from the JHU Summer Workshop 2002”, JHU Tech Report 2002


Solution: Word as FactorsSolution: Word as Factors

• Decompose words into “factors” (e.g. stems)• Build language model over factors: P(w|factors)• Two approaches for decomposition

• Linear • [e.g. Geutner, 1995]

• Parallel • [Kirchhoff et. al., JHU Workshop 2002]• [Bilmes & Kirchhoff, NAACL/HLT 2003]

WtWt-2 Wt-1

StSt-2 St-1

MtMt-2 Mt-1

stem suffixprefixsuffixstem


Factored Word RepresentationsFactored Word Representations

• Factors may be any word feature. Here we use morphological features:• E.g. POS, stem, root, pattern, etc.

€

w ≡{ f1, f 2 ,..., f K } ≡ f1:K

p(w1, w

2,..., w

T) ≡p( f1

1:K , f21:K ,..., fT

1:K )

≈ p(t=1

T

∏ ft1:K | ft−1

1:K , ft−21:K )

1 2 1 2 1 2( | , , , , , )t t t t t t tP w w w s s m m− − − − − −WtWt-2 Wt-1

StSt-2 St-1

MtMt-2 Mt-1


Advantage of Factored Word Representations

Advantage of Factored Word Representations

• Main advantage: Allows robust estimation of probabilities (i.e. ) using backoff• Word combinations in context may not be observed in

training data, but factor combinations are• Simultaneous class assignment

p( ft| f

t−11:K , ft−2

1:K )

Word

word

stem

root

tag

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

Kutub(Books)

kutub

kutub

ktb

noun (pl.)

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

Kitaab-iy(My book)

kitaab-iy

kitaab

ktb

noun+poss

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

Kitaabu-hum(Their book)

kitaabu-hum

kitaabu

ktb

noun+poss

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥


ExampleExample

• Training sentence: “lAzim tiqra kutubiy bi sorca”(You have to read my books quickly)

• Test sentence: “lAzim tiqra kitAbiy bi sorca” (You have to read my book quickly)

Count(tiqra, kitAbiy, bi) = 0

Count(tiqra, kutubiy, bi) > 0

Count(tiqra, ktb, bi) > 0

P(bi| kitAbiy, tiqra) can back off to

P(bi | ktb, tiqra) to obtain more robust estimate.

=> this is better than P(bi | <unknown>, tiqra)


Language Model BackoffLanguage Model Backoff

• When n-gram count is low, use (n-1)-gram estimate• Ensures more robust parameter estimation in sparse data:

P(Wt | Wt-1 Wt-2)

P(Wt)

P(Wt | Wt-1)

P(Wt | Wt-1 Wt-2 Wt-3)

Backoff path: Drop most distant word during backoff

Word-based LM:

Backoff graph: multiple backoff paths possible

F | F1 F2 F3

F | F1 F3

F | F2

F | F1 F2 F | F2 F3

F | F3F | F1

F

Factored Language Model:


Choosing Backoff PathsChoosing Backoff Paths

• Four methods for choosing backoff path1. Fixed path (a priori)

2. Choose path dynamically during training

3. Choose multiple paths dynamically during training and combine result (Generalized Parallel Backoff)

4. Constrained version of (2) or (3) F | F1 F2 F3

F | F1 F3

F | F2

F | F1 F2 F | F2 F3

F | F3F | F1

F


Generalized BackoffGeneralized Backoff

• Katz Backoff:

• Generalized Backoff:

1 2

1 2( , , ) 1 2

1 21 2

1 2 1 2

( , , )if ( , , ) 0

( , )( | , )

( , ) ( , , ) otherwise

P P

P PN f f f P P

P PBO P P

P P P P

N f f fd N f f f

N f fP f f f

f f g f f fα

⎧ >⎪=⎨⎪⎩

α( fP1, fP2 ) =

1− dN( f , fP1 , fP2 )

N( f , fP1, fP2 )N( fP1, fP2 )f :N( f , fP1 , fP2 )>0

∑g( f , fP1, fP2 )

f :N( f , fP1 , fP2 )=0∑

PBO

(wt| w

t−1,wt−2 ) =dN(wt ,wt−1 ,wt−2 )

N(wt,wt−1,wt−2 )N(wt−1,wt−2 )

if N(wt,wt−1,wt−2 ) > 0

α(wt−1,wt−2 )PBO(wt |wt−1) otherwise

⎧

⎨⎪

⎩⎪

g() can be any positive function, but some g() makes backoff weight computation difficult


g() functionsg() functions

• A priori fixed path:

• Dynamic path: Max counts:

• Dynamic path: Max normalized counts:

1 2 1( , , ) ( | )P P BO Pg f f f P f f=

1 2 *( , , ) ( | )P P BO Pjg f f f P f f=* argmax ( , )Pj

jj N f f=

* ( , )argmax

( )Pj

j Pj

N f fj

N f=

Based on raw counts=> Favors robust estimation

Based on maximum likelihood=> Favors statistical predictability


Dynamically Choosing Backoff Paths During Training

Dynamically Choosing Backoff Paths During Training

• Choose backoff path based based on g() and statistics of the data

Wt | Wt-1 St-1 Tt-1

Wt | St-1 Tt-1

Wt | Wt-1

Wt | Wt-1 St-1 Wt | Wt-1 Tt-1

Wt | Tt-1Wt | St-1

Wt

Wt | Wt-1 St-1 Tt-1

Wt | Wt-1 St-1

Wt | Wt-1 St-1 Tt-1

Wt | St-1

Wt | Wt-1 St-1

Wt | Wt-1 St-1 Tt-1

Wt

Wt | St-1

Wt | Wt-1 St-1

Wt | Wt-1 St-1 Tt-1


Multiple Backoff Paths: Generalized Parallel Backoff

Multiple Backoff Paths: Generalized Parallel Backoff

• Choose multiple paths during training and combine probability estimates

Wt | Wt-1 St-1 Tt-1

Wt | St-1 Tt-1Wt | Wt-1 St-1 Wt | Wt-1 Tt-1

Wt | Wt-1 St-1 Tt-1

Wt | Wt-1 St-1

Wt | Wt-1 St-1 Tt-1

Wt | Wt-1 Tt-1

pbo

(wt| w

t−1,st−1,tt−1) =dcpML(wt |wt−1,st−1,tt−1) if count≥threshold

α2[ pbo(wt |wt−1,st−1) + pbo(wt |wt−1,tt−1)] else

⎧

⎨⎪

⎩⎪

Options for combination are: - average, sum, product, geometric mean, weighted mean


Summary: Factored Language Models

Summary: Factored Language Models

FACTORED LANGUAGE MODEL =

Factored Word Representation + Generalized Backoff

• Factored Word Representation• Allows rich feature set representation of words

• Generalized (Parallel) Backoff• Enables robust estimation of models with many

conditioning variables


Model Selection ProblemModel Selection Problem

• In n-grams, choose, eg. • Bigram vs. trigram vs. 4gram

=> relatively easy search; just try each and note perplexity on development set

• In Factored LM, choose:• Initial Conditioning Factors• Backoff Graph• Smoothing OptionsToo many options; need automatic searchTradeoff: Factored LM is more general, but harder to

select a good model that fits data well.


Example: a Factored LMExample: a Factored LM

• Initial Conditioning Factors, Backoff Graph, and Smoothing parameters completely specify a Factored Language Model

• E.g. 3 factors total:

Wt | Wt-1 St-1 Tt-1

Wt | St-1 Tt-1

Wt | Wt-1

Wt | Wt-1 St-1 Wt | Wt-1 Tt-1

Wt | Tt-1Wt | St-1

Wt

0. Begin with full graphstructure for 3 factors

Wt | Wt-1 St-1

1. Initial Factors specify start-node


Example: a Factored LMExample: a Factored LM

• Initial Conditioning Factors, Backoff Graph, and Smoothing parameters completely specify a Factored Language Model

• E.g. 3 factors total:

Wt | Wt-1

Wt | Wt-1 St-1

Wt | St-1

Wt

3. Begin with subgraph obtained with new root node

4. Specify backoff graph:i.e. what backoff to use at each node

Wt | Wt-1

Wt | Wt-1 St-1

Wt

5. Specify smoothingfor each edge


Applications for Factored LMApplications for Factored LM

• Modeling of Arabic, Turkish, Finnish, German, and other morphologically-rich languages• [Kirchhoff, et. al., JHU Summer Workshop 2002]• [Duh & Kirchhoff, Coling 2004], [Vergyri, et. al., ICSLP 2004]

• Modeling of conversational speech • [Ji & Bilmes, HLT 2004]

• Applied in Speech Recognition, Machine Translation• General Factored LM tools can also be used to obtain

various smoothed conditional probability tables for other applications outside of language modeling (e.g. tagging)

• More possibilities (factors can be anything!)


To explore further…To explore further…

• Factored Language Model is now part of the standard SRI Language Modeling Toolkit distribution (v.1.4.1)• Thanks to Jeff Bilmes (UW) and Andreas Stolcke (SRI)• Downloadable at:

http://www.speech.sri.com/projects/srilm/

http://www.speech.sri.com/projects/srilm/


fngram Toolsfngram Tools

fngram-count -factor-file my.flmspec -text train.txtfngram -factor-file my.flmspec -ppl test.txt

train.txt: “Factored LM is fun”W-Factored:P-adj W-LM:P-noun W-is:P-verb W-fun:P-adj

my.flmspecW: 2 W(-1) P(-1) my.count my.lm 3 W1,P1 W1 kndiscount gtmin 1 interpolate P1 P1 kndiscount gtmin 1 0 0 kndiscount gtmin 1



Turkish Language ModelTurkish Language Model

• Newspaper text from web [Hakkani-Tür, 2000]• Train: 400K tokens / Dev: 100K / Test: 90K

• Factors from morphological analyzer

Word

word

root

part-of-speech

number

case

other

inflection-group

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥⎥⎥

yararmanlak

yarar

NounInf-N:A3sg

singular

Nom

Pnon

NounA3sgPnonNom+Verb+Acquire+Pos

⎡

⎣

⎢⎢⎢⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥⎥⎥⎥

yararmanlak


Turkish: Dev Set PerplexityTurkish: Dev Set Perplexity

2 593.8 555.0 556.4 539.2 -2.9

3 534.9 533.5 497.1 444.5 -10.6

4 534.8 549.7 566.5 522.2 -5.0

Ngram Word-based LM

HandFLM

RandomFLM

GeneticFLM

ppl(%)

• Factored Language Models found by Genetic Algorithms perform best

• Poor performance of high order Hand-FLM corresponds to difficulty manual search


Turkish: Eval Set PerplexityTurkish: Eval Set Perplexity

2 609.8 558.7 525.5 487.8 -7.2

3 545.4 583.5 509.8 452.7 -11.2

4 543.9 559.8 574.6 527.6 -5.8

Ngram Word-based LM

HandFLM

RandomFLM

GeneticFLM

ppl(%)

• Dev Set results generalizes to Eval Set => Genetic Algorithms did not overfit

• Best models used Word, POS, Case, Root factors and parallel backoff


Arabic Language ModelArabic Language Model

• LDC CallHome Conversational Egyptian Arabic speech transcripts • Train: 170K words / Dev: 23K / Test: 18K

• Factors from morphological analyzer • [LDC,1996], [Darwish, 2002]

Word

word

root

morphological tag

stem

pattern

⎡

⎣

⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥

Il+dOr

Il+dOr

dwr

Noun+masc-sg+article

dOr

CCC

⎡

⎣

⎢⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥⎥


Arabic: Dev Set and Eval Set Perplexity

Arabic: Dev Set and Eval Set Perplexity

Ngram Word-based LM

Hand FLM

Random FLM

Genetic FLM

ppl(%)

2 229.9 229.6 229.9 222.9 -2.9

3 229.3 226.1 230.3 212.6 -6.0

The best models used all available factors (Word, Stem, Root, Pattern, Morph), and various parallel backoffs

Ngram Word Hand Random Genetic ppl(%)

2 249.9 230.1 239.2 223.6 -2.8

3 285.4 217.1 224.3 206.2 -5.0

Dev Set perplexities

Eval Set perplexities


Word Error Rate (WER) ResultsWord Error Rate (WER) Results

Dev Set

Stage Word LM Baseline

Factored LM

1 57.3 56.2

2a 54.8 52.7

2b 54.3 52.5

3 53.9 52.1

Eval Set (eval 97)

Word LM Baseline

Factored LM

61.7 61.0

58.2 56.5

58.8 57.4

57.6 56.1

Factored language models gave 1.5% improvement in WER

Documents

Factored Language Models