Upload
debra-cooper
View
238
Download
0
Tags:
Embed Size (px)
Citation preview
Machine Translation - Spring 2011 1
Machine Translation
Language Model
Stephan Vogel21 February 2011
Machine Translation - Spring 2011 2
Overview
• N-gram LMs Perplexity Smoothing
Dealing with large LMs
Machine Translation - Spring 2011 3
A Word Guessing Game
Good afternoon, how are ____
Machine Translation - Spring 2011 4
A Word Guessing Game
Good afternoon, how are you?
I apologize for being late, I am very _____
Machine Translation - Spring 2011 5
A Word Guessing Game
Good afternoon, how are you?
I apologize for being late, I am very sorry!
Machine Translation - Spring 2011 6
A Word Guessing Game
Good afternoon, how are you?
I apologize for being late, I am very sorry!
My favorite OS is ________
Machine Translation - Spring 2011 7
A Word Guessing Game
Good afternoon, how are you?
I apologize for being late, I am very sorry!
My favorite OS is Linux
Mac OS
Windows XP
WinCE
Machine Translation - Spring 2011 8
A Word Guessing Game
Hello, I am very happy to see you, Mr.
Machine Translation - Spring 2011 9
A Word Guessing Game
Black
Hello, I am very happy to see you, Mr. White
Jones
Smith
….
….
Machine Translation - Spring 2011 10
A Word Guessing Game
What do we learn from the word guessing game? • For some histories the number of expected words is
rather small. • For some histories we can make virtually no prediction
about the next word. • The more words fit at some point the more difficult it is
to select the correct one (more errors are possible) • The difficulty of generating a word sequence is
correlated with the "branching degree"
Machine Translation - Spring 2011 11
Language Model in MT
From the translation model, we typically have many alternative translations for words and phrases
Reordering throws phrases around pretty arbitrarily
LM needs to help To select words with the right sense (disambiguate
homonyms: e.g. river bank versus money bank) To select hypothesis where words are ‘in the right
place’ To increase cohesion between words (agreement)
Machine Translation - Spring 2011 12
N-gram LM
Probability for a sentence:
Factorize (chain rule) without loss of generality:
But too many possible word sequences: we do not see them Vocabulary of 10k and sentence length of 5 words -> 10^15
different word sequences Of course, most sequences are extremely unlikely
)Pr( 1Nw
N
n
nn
N www1
111 )|Pr()Pr(
Machine Translation - Spring 2011 13
N-gram LM
Probability for a sentence:
Factorize (chain rule) without loss of generality:
Limit length of history
Unigram LM:
Bigram LM:
Trigram LM:
)Pr( 1Nw
N
nn
N ww1
1 )Pr()Pr(
N
n
nn
N www1
111 )|Pr()Pr(
N
nnn
N www1
11 )|Pr()Pr(
N
nnnn
N wwww1
121 )|Pr()Pr(
Machine Translation - Spring 2011 14
Maximum Likelihood Estimates
Probabilities as relative frequencies: Here shown for 3-gram LM
Estimate from corpus – just count
)(
),(
),'(
),(
)',,(
),,()|Pr(
''21
321213 hCount
hwCount
hwCount
hwCount
wwwCount
wwwCountwww
ww
Machine Translation - Spring 2011 15
Measuring the Quality of LMs
Obvious approach to finding out whether LM1 or LM2 is better: run translation system with both;choose the one that produces better output
But: Performance of MT system depends also on translation and
distortion model, and the interaction between these models Performance of MT system also depends on pruning. Expensive and time consuming
We would like to have an independent measure: Declare a LM to be good, if it restricts the future more strongly
(i.e. if it has a smaller average "branching factor").
Machine Translation - Spring 2011 16
Perplexity
Inverse geometric average
Interpretation: weighted number of choices per word position
NN
n
nn
NN
ww
wPP/1
1
11
/1
1
)|Pr(
)Pr(
Easy to see that is better than
Machine Translation - Spring 2011 17
Perplexity 0-gram LM
Zero-gram LM: constant probability for each word Vocabulary size V Probability for word: 1/V Probability for corpus:
Perplexity
N
nn
N
nn
N
VVww
111
11)Pr()Pr(
VV
wPPN
N
NN
/1/1
1
1)Pr(
Machine Translation - Spring 2011 18
Perplexity and Entropy
Log Perplexity:
Relation between perplexity and entropy Entropy = log perplexity
Entropy: minimum number of bits per symbol needed for encoding on average
Perplexity depends on Language model p(w|h) Corpus w1
N
N
hw
N
nnn
hwpwhNN
hwpN
PP
)|(log),(1
)|(log1
log1
Machine Translation - Spring 2011 19
Some Perplexities
Problem: even if the perplexity of is lower than the one of
the task is more difficult than
Task Vocabulary Language Model Perplexity
Conference Registration 400 Bigrams 7
Resource Management 1000 Bigrams 20
Wall Street Journal 60.000 Bigrams 160
Wall Street Journal 60.000 Trigrams 130
Arabic Broadcast News 220.000 Fourgrams 212
Chinese Broadcast News 90.000 Fourgrams 430
Machine Translation - Spring 2011 20
Sparse Data and Smoothing
Many n-grams are not seen -> zero probability Sentences with such an n-gram would be considered
impossible by model Need smoothing
Essentially: unseen events (in training corpus) still have probability > 0
Need to deduct probability from seen events and distribute to unseen (and rare) events
Robin Hood principle: take from the rich, redistribute to the poor
Large body of work on smoothing in speech recognition
Machine Translation - Spring 2011 21
Smoothing
Two types of smoothing:
Backing off: if word wn not seen with history w1 … wn-1, back-off to shorter history. Different variants: Absolute discounting Katz smoothing (Modified) Kneser-Ney smooting
Linear Interpolation: interpolate with probability distribution for n-grams with shorter histories
Machine Translation - Spring 2011 22
Smoothing
Assume n-grams of the form: a b c a is first word c is last word b is zero or more words in between
Interpolation:
Backing off:
Back-off weights such that probabilities are normalized Interpolation always incorporates lower order distribution,
back-off only when abc not seen
)|Pr()()|()|Pr( bcababcabc
otherwise)|Pr()(
0) )|()|Pr(
bcab
C(abcifabcabc
Machine Translation - Spring 2011 23
Absolute Discounting
Subtract fixed constant D from each nonzero count to allocate probability of unseen events
Different D for each n-gram order
For interpolated form: C(ab*) total count of n-grams starting with ab N(ab*) number of distinct words following ab
*)(
*)()(
*)(
))(,0max()|(
abC
DabNab
abC
DabcCabc
Machine Translation - Spring 2011 24
Kneser-Ney Smoothing
‘Standard’ smoothing technique, as is consistently shows good performance
Motivation P(Francisco | eggplant) vs P(stew | eggplant) “Francisco” is common, so backoff, interpolated methods say it is likely But it mostly occurs in context of “San” “Stew” is common, and in many contexts
Weight backoff by number of contexts word occurs in(Count-of-Counts)
v
xyzC
wyvCw
wyzCwxy
xyC
DxyzC
0)(|
0)(|)(
)(
)( )(
Machine Translation - Spring 2011 25
Problems With N-gram LM
N-gram LMs have short memory Ok for speech recognition Less sufficient for MT, where reordering is needed
Typically trained on ‘good’ sentences, applied to not-so-good sentences Some unseen events are good – should be allowed by the LM Some (many) unseen events are not good – should be
disfavored LM can not distinguish – often assigns higher probability to
worse sentences This is true also for other LMs, e.g. syntax-based LMs
Machine Translation - Spring 2011 26
Problem With N-gram LMs
LM does not assign higher probability to reference translations
Gloss
Pakistan president Musharraf win senate representative two houses trust vote
R1 pakistani president musharraf wins vote of confidence in both houses
-7.18
R2 pakistani president musharraf won the trust vote in senate and lower house
-7.27
R3 pakistani president musharraf wins votes of confidence in senate and house
-6.78
R4 the pakistani president musharraf won the vote of confidence in senate and lower house
-6.18
H1 pakistani president musharraf has won both vote of confidence in -6.12
H2 pakistani president musharraf win both the house and the senate confidence vote in
-5.97
H3 pakistani president musharraf win both the house and the senate vote of confidence in
-5.24
H4 pakistani president musharraf win both the senate confidence vote in
-7.25
H5 pakistani president musharraf win both the senate vote of confidence in
-6.22
Machine Translation - Spring 2011 27
Dealing with Large LMs
Three strategies Distribute over many machines Filter Use lossy compression
Machine Translation - Spring 2011 28
Distributed LM
Want to use large LMs From large corpora Long histories: 5-gram, 6-gram, …
Too much data to fit into memory of one workstation Use client-server architecture
Each server owns part of the LM data Use suffix array to find n-gram and get n-gram counts Efficiency has been greatly improved using batch-mode
communication between client and server Client collects n-gram occurrence counts and
calculates probabilities Simple smoothing: linear interpolation with uniform
weights
Machine Translation - Spring 2011 29
Client-Server Architecture for LM
Suffix array for 50M words needs ~450M RAM We use 2.9 billion words from Gigaword corpus …
NYT1999 NYT2000
… … … …
XINHUA2004
Monolingual Corpus Information Servers
ClientHyp1
Hyp2
…
…
Hyp N
Batch of n-grams
Add up occurrence information from servers
N-best list
Courtesy Joy Zhang
Machine Translation - Spring 2011 30
Impact on Decoder
Communication overhead is substantial Cannot query each n-gram probability
Restructure Decoder Generate many hypotheses, using all information but the LM Send all queries (LMstatei, WordSequencei) to the LM Server
Get back all probabilities and new LM states (pi, NewLMstatei)
Will have some impact on pruning Early termination when expanding hypotheses cannot use LM
score
Machine Translation - Spring 2011 31
Filtering LM
Assume we have been able to build large LM Most entries will not be used to translate current
sentence Extract those entries which may be used
Filter phrase table for current sentence Establish word list from source sentence and filtered phrase
table Extract all n-grams from LM file which are completely covered
by the word list
Filtering does not change the decoding result All needed ngrams are still available All probabilities remain unchanged
Machine Translation - Spring 2011 32
Filtering LM
Filtering is an expensive operation Needs to be re-done whenever phrase table changes MapReduce can help
Typical LMs sizes (using pruned phrase table: top 60) Min: 100KB Avg: 75MB Max: 500MB Divide by 40 to get approx number of n-grams
For long sentences and weakly pruned phrase tables the filtered LM can still be large (>2GB)
But works well for n-best list rescoring and system combination
Machine Translation - Spring 2011 33
Bloom-Filter LMs
Proposed by David Talbot and Miles Osborne (2007) Bloom-Filters have long been used for storing data in
compact way
General idea: use multiple hash functions to generate a ‘foot print’ for an object
Following slides courtesy of Matthias Eck …
Machine Translation - Spring 2011 34
General Bloomfilter
Bitarray and k hash functions f1..fk
Enter object o in hash: Put 1 at f1(o)...fk(o)o1 o2
To check if object o in hash: Check if 1 at f1(o)...fk(o)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0
??
?
Machine Translation - Spring 2011 35
Lossy Hash
Properties: False negatives are impossible:
Inserted objects will always be found again False positives are possible:
Objects that were not inserted can be found Objects cannot be deleted
Optimal size: For k hash functions and n elements to insert: Hash size
optimal at: M=k/(n * ln 2) At optimal size: Error rate: (0.5)k (smaller than 1% for k=7)
Machine Translation - Spring 2011 36
Usage as a LM
Bloomfilter is only boolean (contains/does not contain) To realize LM:
Put N-gram frequencies in Bloomfilter Ngram: a b c with frequency f Insert: a b c_1 ... a b c_f (or map frequencies) Or map frequencies e.g.: 1→1, 2/3→2, 4-8→3....
Lower error rate by checking if sub-n-grams are present in Bloomfilter a b c with frequency 10 means frequency of a b ≥ 10, b c ≥ 10
etc.
Machine Translation - Spring 2011 37
Experiments
As an LM during decoding: All NYT data (Gigaword Corpus) (57.9M lines, 1220M words) overall 704M ngrams (up to 4grams) → subsampled to 105M 105M ngrams x Frequency 10 (average) → 1000M entries
1000M entries * 4 Hash functions/ln 2 → 5700 Mbitchoose: Bloomfilter LM: 4 Hash functions, 1 GByte memory
Frequency mapping: inserted frequency = 1+ln(frequency)*5 Baseline LM (250 M words, Xinhua News) Baseline MT03: 28.2 BLEU
+ BloomfilterLM: 28.9 BLEU
Machine Translation - Spring 2011 39
Other LMs
Cache LM: boost n-grams seen recently, e.g. in the current document
Trigger LM: boost current word based on specific words in the past
Class-based LMs: generalize, e.g. For numbers, named entities Cluster entire vocabulary Longer matching history if number of classes is small
Continuous LMs (Holger Schwenk et al 2006) Based on neural nets Used only to predict most frequent words
Machine Translation - Spring 2011 40
Summary
LM used for translation: typically n-gram More data, longer histories possible Large systems use 5-gram
Smoothing is important Different smoothing techniques Kneser-Ney is a good default
Many LM toolkits are available SRI LM most widely used Kenneth provides a more memory efficient version (integrated into
Moses)
Using very large LMs requires engineering Map-Reduce or Hadoop Bloom-Filter
Many extensions possible Class-based, trigger, cache, … LSA and topic LMs Syntactic LMs