Machine Translation - Spring 20111 Machine Translation Language Model Stephan Vogel 21 February 2011

Machine Translation - Spring 2011 1

Machine Translation

Language Model

Stephan Vogel21 February 2011


Overview

• N-gram LMs Perplexity Smoothing

Dealing with large LMs


A Word Guessing Game

Good afternoon, how are ____



Good afternoon, how are you?

I apologize for being late, I am very _____




I apologize for being late, I am very sorry!





My favorite OS is ________





My favorite OS is Linux

Mac OS

Windows XP

WinCE



Hello, I am very happy to see you, Mr.



Black

Hello, I am very happy to see you, Mr. White

Jones

Smith

….

….



What do we learn from the word guessing game? • For some histories the number of expected words is

rather small. • For some histories we can make virtually no prediction

about the next word. • The more words fit at some point the more difficult it is

to select the correct one (more errors are possible) • The difficulty of generating a word sequence is

correlated with the "branching degree"


Language Model in MT

From the translation model, we typically have many alternative translations for words and phrases

Reordering throws phrases around pretty arbitrarily

LM needs to help To select words with the right sense (disambiguate

homonyms: e.g. river bank versus money bank) To select hypothesis where words are ‘in the right

place’ To increase cohesion between words (agreement)


N-gram LM

Probability for a sentence:

Factorize (chain rule) without loss of generality:

But too many possible word sequences: we do not see them Vocabulary of 10k and sentence length of 5 words -> 10^15

different word sequences Of course, most sequences are extremely unlikely

)Pr( 1Nw

N

n

nn

N www1

111 )|Pr()Pr(


N-gram LM

Probability for a sentence:

Factorize (chain rule) without loss of generality:

Limit length of history

Unigram LM:

Bigram LM:

Trigram LM:

)Pr( 1Nw

N

nn

N ww1

1 )Pr()Pr(

N

n

nn

N www1

111 )|Pr()Pr(

N

nnn

N www1

11 )|Pr()Pr(

N

nnnn

N wwww1

121 )|Pr()Pr(


Maximum Likelihood Estimates

Probabilities as relative frequencies: Here shown for 3-gram LM

Estimate from corpus – just count

)(

),(

),'(

),(

)',,(

),,()|Pr(

''21

321213 hCount

hwCount

hwCount

hwCount

wwwCount

wwwCountwww

ww


Measuring the Quality of LMs

Obvious approach to finding out whether LM1 or LM2 is better: run translation system with both;choose the one that produces better output

But: Performance of MT system depends also on translation and

distortion model, and the interaction between these models Performance of MT system also depends on pruning. Expensive and time consuming

We would like to have an independent measure: Declare a LM to be good, if it restricts the future more strongly

(i.e. if it has a smaller average "branching factor").


Perplexity

Inverse geometric average

Interpretation: weighted number of choices per word position

NN

n

nn

NN

ww

wPP/1

1

11

/1

1

)|Pr(

)Pr(

Easy to see that is better than


Perplexity 0-gram LM

Zero-gram LM: constant probability for each word Vocabulary size V Probability for word: 1/V Probability for corpus:

Perplexity

N

nn

N

nn

N

VVww

111

11)Pr()Pr(

VV

wPPN

N

NN

/1/1

1

1)Pr(


Perplexity and Entropy

Log Perplexity:

Relation between perplexity and entropy Entropy = log perplexity

Entropy: minimum number of bits per symbol needed for encoding on average

Perplexity depends on Language model p(w|h) Corpus w1

N

N

hw

N

nnn

hwpwhNN

hwpN

PP

)|(log),(1

)|(log1

log1


Some Perplexities

Problem: even if the perplexity of is lower than the one of

the task is more difficult than

Task Vocabulary Language Model Perplexity

Conference Registration 400 Bigrams 7

Resource Management 1000 Bigrams 20

Wall Street Journal 60.000 Bigrams 160

Wall Street Journal 60.000 Trigrams 130

Arabic Broadcast News 220.000 Fourgrams 212

Chinese Broadcast News 90.000 Fourgrams 430


Sparse Data and Smoothing

Many n-grams are not seen -> zero probability Sentences with such an n-gram would be considered

impossible by model Need smoothing

Essentially: unseen events (in training corpus) still have probability > 0

Need to deduct probability from seen events and distribute to unseen (and rare) events

Robin Hood principle: take from the rich, redistribute to the poor

Large body of work on smoothing in speech recognition


Smoothing

Two types of smoothing:

Backing off: if word wn not seen with history w1 … wn-1, back-off to shorter history. Different variants: Absolute discounting Katz smoothing (Modified) Kneser-Ney smooting

Linear Interpolation: interpolate with probability distribution for n-grams with shorter histories


Smoothing

Assume n-grams of the form: a b c a is first word c is last word b is zero or more words in between

Interpolation:

Backing off:

Back-off weights such that probabilities are normalized Interpolation always incorporates lower order distribution,

back-off only when abc not seen

)|Pr()()|()|Pr( bcababcabc

otherwise)|Pr()(

0) )|()|Pr(

bcab

C(abcifabcabc


Absolute Discounting

Subtract fixed constant D from each nonzero count to allocate probability of unseen events

Different D for each n-gram order

For interpolated form: C(ab*) total count of n-grams starting with ab N(ab*) number of distinct words following ab

*)(

*)()(

*)(

))(,0max()|(

abC

DabNab

abC

DabcCabc


Kneser-Ney Smoothing

‘Standard’ smoothing technique, as is consistently shows good performance

Motivation P(Francisco | eggplant) vs P(stew | eggplant) “Francisco” is common, so backoff, interpolated methods say it is likely But it mostly occurs in context of “San” “Stew” is common, and in many contexts

Weight backoff by number of contexts word occurs in(Count-of-Counts)

v

xyzC

wyvCw

wyzCwxy

xyC

DxyzC

0)(|

0)(|)(

)(

)( )(


Problems With N-gram LM

N-gram LMs have short memory Ok for speech recognition Less sufficient for MT, where reordering is needed

Typically trained on ‘good’ sentences, applied to not-so-good sentences Some unseen events are good – should be allowed by the LM Some (many) unseen events are not good – should be

disfavored LM can not distinguish – often assigns higher probability to

worse sentences This is true also for other LMs, e.g. syntax-based LMs


Problem With N-gram LMs

LM does not assign higher probability to reference translations

Gloss

Pakistan president Musharraf win senate representative two houses trust vote

R1 pakistani president musharraf wins vote of confidence in both houses

-7.18

R2 pakistani president musharraf won the trust vote in senate and lower house

-7.27

R3 pakistani president musharraf wins votes of confidence in senate and house

-6.78

R4 the pakistani president musharraf won the vote of confidence in senate and lower house

-6.18

H1 pakistani president musharraf has won both vote of confidence in -6.12

H2 pakistani president musharraf win both the house and the senate confidence vote in

-5.97

H3 pakistani president musharraf win both the house and the senate vote of confidence in

-5.24

H4 pakistani president musharraf win both the senate confidence vote in

-7.25

H5 pakistani president musharraf win both the senate vote of confidence in

-6.22


Dealing with Large LMs

Three strategies Distribute over many machines Filter Use lossy compression


Distributed LM

Want to use large LMs From large corpora Long histories: 5-gram, 6-gram, …

Too much data to fit into memory of one workstation Use client-server architecture

Each server owns part of the LM data Use suffix array to find n-gram and get n-gram counts Efficiency has been greatly improved using batch-mode

communication between client and server Client collects n-gram occurrence counts and

calculates probabilities Simple smoothing: linear interpolation with uniform

weights


Client-Server Architecture for LM

Suffix array for 50M words needs ~450M RAM We use 2.9 billion words from Gigaword corpus …

NYT1999 NYT2000

… … … …

XINHUA2004

Monolingual Corpus Information Servers

ClientHyp1

Hyp2

…

…

Hyp N

Batch of n-grams

Add up occurrence information from servers

N-best list

Courtesy Joy Zhang


Impact on Decoder

Communication overhead is substantial Cannot query each n-gram probability

Restructure Decoder Generate many hypotheses, using all information but the LM Send all queries (LMstatei, WordSequencei) to the LM Server

Get back all probabilities and new LM states (pi, NewLMstatei)

Will have some impact on pruning Early termination when expanding hypotheses cannot use LM

score


Filtering LM

Assume we have been able to build large LM Most entries will not be used to translate current

sentence Extract those entries which may be used

Filter phrase table for current sentence Establish word list from source sentence and filtered phrase

table Extract all n-grams from LM file which are completely covered

by the word list

Filtering does not change the decoding result All needed ngrams are still available All probabilities remain unchanged


Filtering LM

Filtering is an expensive operation Needs to be re-done whenever phrase table changes MapReduce can help

Typical LMs sizes (using pruned phrase table: top 60) Min: 100KB Avg: 75MB Max: 500MB Divide by 40 to get approx number of n-grams

For long sentences and weakly pruned phrase tables the filtered LM can still be large (>2GB)

But works well for n-best list rescoring and system combination


Bloom-Filter LMs

Proposed by David Talbot and Miles Osborne (2007) Bloom-Filters have long been used for storing data in

compact way

General idea: use multiple hash functions to generate a ‘foot print’ for an object

Following slides courtesy of Matthias Eck …


General Bloomfilter

Bitarray and k hash functions f1..fk

Enter object o in hash: Put 1 at f1(o)...fk(o)o1 o2

To check if object o in hash: Check if 1 at f1(o)...fk(o)

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0

??

?


Lossy Hash

Properties: False negatives are impossible:

Inserted objects will always be found again False positives are possible:

Objects that were not inserted can be found Objects cannot be deleted

Optimal size: For k hash functions and n elements to insert: Hash size

optimal at: M=k/(n * ln 2) At optimal size: Error rate: (0.5)k (smaller than 1% for k=7)


Usage as a LM

Bloomfilter is only boolean (contains/does not contain) To realize LM:

Put N-gram frequencies in Bloomfilter Ngram: a b c with frequency f Insert: a b c_1 ... a b c_f (or map frequencies) Or map frequencies e.g.: 1→1, 2/3→2, 4-8→3....

Lower error rate by checking if sub-n-grams are present in Bloomfilter a b c with frequency 10 means frequency of a b ≥ 10, b c ≥ 10

etc.


Experiments

As an LM during decoding: All NYT data (Gigaword Corpus) (57.9M lines, 1220M words) overall 704M ngrams (up to 4grams) → subsampled to 105M 105M ngrams x Frequency 10 (average) → 1000M entries

1000M entries * 4 Hash functions/ln 2 → 5700 Mbitchoose: Bloomfilter LM: 4 Hash functions, 1 GByte memory

Frequency mapping: inserted frequency = 1+ln(frequency)*5 Baseline LM (250 M words, Xinhua News) Baseline MT03: 28.2 BLEU

+ BloomfilterLM: 28.9 BLEU


Other LMs

Cache LM: boost n-grams seen recently, e.g. in the current document

Trigger LM: boost current word based on specific words in the past

Class-based LMs: generalize, e.g. For numbers, named entities Cluster entire vocabulary Longer matching history if number of classes is small

Continuous LMs (Holger Schwenk et al 2006) Based on neural nets Used only to predict most frequent words


Summary

LM used for translation: typically n-gram More data, longer histories possible Large systems use 5-gram

Smoothing is important Different smoothing techniques Kneser-Ney is a good default

Many LM toolkits are available SRI LM most widely used Kenneth provides a more memory efficient version (integrated into

Moses)

Using very large LMs requires engineering Map-Reduce or Hadoop Bloom-Filter

Many extensions possible Class-based, trigger, cache, … LSA and topic LMs Syntactic LMs

Documents

Machine Translation - Spring 20111 Machine Translation Language Model Stephan Vogel 21 February 2011