Syntax-based and Factored Language Models

1

Syntax-based and Factored Language Models

Rashmi Gangadharaiah

April 16th, 2008

2

Noisy Channel Model

3

Why is MT output still bad

• Strong Translation models weak language models

• Using other knowledge sources in model building?– Parse trees, taggers etc.– How much improvement?

• Models can be computationally expensive,– n-gram models are the least expensive

models– Other models have to efficiently coded

4

Conventional Language models

• n-gram word based language model:p(wi|h)=p(wi|wi-1,….w1)

• Retain only n-1 most recent words of history to avoid storing a large number of parameters

p(wi|h)=p(wi|wi-1,….wi-n+1) for n=3, p(S)=p(w1)p(w2|w1)…p(wi|wi-1,wi-2)

• Estimated using MLE• Innacurate probability estimates for higher

order n-grams• Smoothing/discounting to overcome

sparseness

5

Problems still present in the n-gram model

• Do not make efficient use of training corpus– Blindly discards relevant words that lie n

positions in the past– Retains words of little or no value

• Do not generalize well to unseen word sequences main motivation for using class-based

LMs and factored LMs• Lexical dependencies are structurally

related rather than sequentially related main motivation for using syntactic/structural LMs

6

Earlier work on incorporating low level syntactic information(1)

Group words into classes(1)• P. F. Brown et. al:

– Start with each word in a separate class, iteratively combine classes

• Heeman’s (1998) POS LM: – achieved a perplexity reduction compared to a trigram

LM by redefining the speech recognition problem:

P.F. Brown et al. 1992. Class-Based n-Gram Models of Natural Language. In Computational Linguistics, 18(4):467-479P.A. Heeman. 1998. POS tagging versus classes in language modeling. In Proceedings of the 6th Workshop on Very Large

Corpora, Montreal.

7

Earlier work on incorporating low level syntactic information(2)

• Use predictive clustering and conditional clustering

• Predictive:P(Tuesday|party on)=P(WEEKDAY|party on)*P(Tuesday|party on WEEKDAY)

• Conditional: P(Tuesday|party EVENT on PREPOSITION) Backoff order from P(wi|wi-2Wi-2wi-1Wi-1) P(wi|Wi-2wi-1Wi-1) (= P(Tuesday|EVENT on PREPOSITION)) toP(wi|wi-1Wi-1) (=P(Tuesday|on PRESPOSITION)) toP(wi|Wi-1) (=P(Tuesday|PREPOSITION))t oP(wi)(=P(Tuesday))

J. Goodman. 2000. Putting it all together: Language model combination. In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, volume 3, pages 1647-1650, Istanbul

8

More Complex Language models that we will look at today…

• LMs that incorporate syntax• Charniak et al. 2003 Syntax-based LM (in MT)

• LMs that incorporate both syntax and semantics– Model Headword Dependency only

• N-best rescoring strategy– Chelba et al. 1997 almost parsing (in ASR)

• Full parsing for decoding word lattices– Chelba et al. 1998 full parsing in left-to-right fashion with

Dependency LM (in ASR)– Model both Headword and non-Headword

Dependencies• N-best rescoring strategy

– Wang et al. 2007 SuperARV LMs (in MT)– Kirchhoff et al. 2005 Factored LMs (in MT)

• Full parsing– Rens Bod 2001 Data Oriented Parsing (in ASR)– Wang et al. 2003 (in ASR)

9

link grammar to model long distance dependencies (1)

• Maximum Entropy language model that incorporates both syntax and semantics via dependency grammar

• Motivation: – dependencies are structurally related rather than

sequentially related • Incorporates the predictive power of words that lie

outside of bigram or trigram range

• Elements of the model: A disjunct rule shows how a word must be connected to other words in a legal parse.

Ciprian Chelba, David Engle, Frederick Jelinek, Victor Jimenez, Sanjeev Khudanpur, lidia Mangu, Harry Printz, Eric Ristad, Ronald Rosenfeld, Andreas Stolcke, Dekai Wu, 1997,“Structure and performance of a dependency language model”, In Eurospeech

10

• Maps of histories

– Mapping retains • finite context of 0,1, or 2 preceding words • a link stack consisting of open links at the current

position and the identities of the words from which they emerge


11

• Maximum entropy formulation – to treat each of the numerous elements of [h]

as a distinct predictor variable• Link grammar feature function

– “[h] matches d”: d is a legal disjunct to occupy the next position in the parse

– “yLz”: at least one of the links must bear label L and connect to word y


12

• Tagging and Parsing– Dependency parser of Michael Collins (required pos tags).

• P(S,K) = P(S|K) P(K|S)– Parser didn’t operate in left to right direction hence used N-

best lists.– Training and testing data drawn from Switchboard corpus

and from Treebank corpus• Trained tagger on 1 million words (Ratnaparkhi), applied it on

226000 words of hand parsed training set and finally applied this on 1.44 million words, tested on 11 time marked telephone transcripts

• Dependency model– Used the Maximum Entropy modeling toolkit

• Generated 100 best hypothesis for each utterance– P(S)=– Achieved reduction in WER from 47.4% (adjacency bigram)

to 46.4%


13

Syntactic structure to model long distance dependencies (1)

• Language model develops syntactic structure and uses it to extract meaningful information from the word history

• Motivation:– 3-gram approach would predict

“after” from (7, cents) – strongest predictor

should be “ended”– Syntactic structure

in the past filters out irrelavant words

Ciprian Chelba and Frederick Jelinek,1998 “Exploiting Syntactic Structure for Language modeling”, ACL

Headword of(ended(with(..)))

Exposed headwordwhen predicting

“after”

14


• Terminology– Wk: word k-prefix w0….wk of the sentence

– WkTk: the word parse k-prefix• A word-parse k-prefix contains – for a given parse

only those binary subtrees whose span is completely included in the word k-prefix excluding w0 = <s>

– Single words along with their POStag can be regarded as root-only trees

15


• Model operates by means of three modules– WORD-PREDICTOR

• Predicts the next word wk+1 given the word-parse k-prefix and passes control to the TAGGER

– TAGGER• Predicts the POS tag of the next word tk+1 given the word-

parse k-prefix and wk+1 and passes control to the PARSER

– PARSER• Grows the already existing binary branching structure by

repeatedly generating the transitions: (unary, NTlabel), (adjoin-left, NTlabel) or (adjoin-right, NTlabel) until it passes control to the PREDICTOR by taking a null transition

16


• Probabilistic Model:

• Word Level Perplexity

17


• Search strategy– Synchronous multi-stack search algorithm– Each stack contains partial parses constructed by the

same number of predictor and parser operations– Hypotheses ranked according to ln(P(W,T)) score– Width controlled by maximum stack depth and log-

probability threshold

• Parameter Estimation– Solution inspired by HMM re-estimation technique

HMM re-estimation technique that works on pruned N-best trellis (Byrne)

– binarized the UPenn Treebank parse trees and percolated the headwords using a rule-based approach

W. Byrne, A. Gunawardhana, and S. Khudanpur, 1998. “Information geometry and EM variants”. Technical Report CLSP Research Note 17.

18


• Setup - Upenn Treebank corpus– Stack depth=10, log-probability

threshold=6.91 nats– Training data: 1Mwds of training data, word

vocabulary:10k, POS tag vocabulary=40, non-terminal tag vocabulary=52

– Test data: 82430 words• Results

– Reduced test-set perplexity from 167.14(trigram model) to 158.28

– Interpolating the model with a trigram model resulted in 148.90 (interpolation weight = 0.36)

19

Non-headword dependencies matter : DOP-based LM(1)

• The DOP (Data Oriented Parsing) model learns a stochastic tree-substitution grammar (STSG) from– a treebank by extracting all subtrees from the treebank– assigning probabilities to the subtrees– DOP takes into account both headword and non-

headword dependencies– Subtrees are lexicalized at their frontiers with one or

more words• Motivation

– Head lexicalized grammar is limited• It cannot capture dependencies between non-headwords• Eg: “more people than cargo”, “more couples exchanging

rings in 1988 than in the previous year” (from WSJ)– Neither “more” nor “than” are headwords of these phrases

• Dependency between “more” and “than” is captured by a subtree where “more” and “than” are the only frontier words.

Rens Bod, 2000 “combining semantic and syntactic structure for language modeling”

20

Non-headword dependencies matter: DOP-based LM(2)

• DOP learns an STSG from a treebank by taking all subtrees in that treebank– Eg: Consider a Treebank

21

Non-headword dependencies matter: DOP-based LM(3)

• New sentences may be derived by combining subtrees from the treebank– Node substitution is left-associative

– Other derivations may yield the same parse tree

22


• Model computes the probability of a subtree as:

r(t): root label of t

• Probability of a derivation

• Probability of a parse tree• Probability of a word string W• Note:

– does not maximize the likelihood of the corpus– implicit assumption that all derivations of a parse tree

contribute equally to the total probability of the parse tree.– There is a hidden component DOP can be trained using EM

23


• Combining semantic and syntactic structure

24


• Computation of the most probable string– NP hard : Employed Viterbi n best search– Estimate the most probable string by the 1000

most probable derivations

• OVIS corpus– 10,000 user utterances about Dutch public

transport information, syntactically and semantically annotated

– DOP model obtained by extracting all subtrees of depth upto 4

25

More and More features (A Hybrid) :SuperARV LMs (1)

• SuperARV LM is a highly lexicalized probabilistic LM based on the Constraint Dependency Grammar (CDG)

• CDG represents a parse as assignments of dependency relations to functional variables (roles) associated with each word in a sentence

• Motivation– High levels of word prediction capability can be

achieved by tightly integrating knowledge of words, structural constraints, morphological and lexical features at the word level.

Wen Wang, Mary P. Harper, 2002 “The SuperARV Language Model: Investigating the Effectiveness of Tightly Integrating Multiple Knowledge Sources”, ACL

26

• CDG parse– Each word in the parse has a lexical category

and a set of feature values– Each word has a governor role (G)

• Comprised of a label (indicates position of the words head/governor) and modifiee

• Need roles are used to ensure the grammatical requirements of a word are met

– Mechanism for using non-headword dependencies


27

• ARVs and ARVPs– Using the relationship between a role value’s

position and its modifiee’s position, unary and binary constraints can be represented as a finite set of ARVs and ARVPs


28

• SuperARVs– Four-tuple for a word <C,F,(R,L,UC,MC)

+,DC>– Abstraction of the joint assignment of

dependencies for a word– a mechanism for lexicalizing CDG parse rules– Encode lexical information, syntactic and

semantic constraints – much more fine grained than POS


29

• SuperARV LM estimates the joint probability of words w1

N and their SuperARV tags t1

N

• SuperARV LM does not encode the word identity at the data structure level since this can cause serious data sparsity problems

• Estimate the probability distributions– recursive linear interpolation

• WER on WSJ CSR 20k test sets, – 3gram=14.74, SARV=14.28, Chelba=14.36


30

• SCDG Parser– Probabilistic generative model– For S, parser returns the parse T that

maximizes its probability• First step:

– N-best SuperARV assignments are generated– Each SuperARV sequence is represented as: (w1, s1), . . . ,

(wn sn)• Second step: the modifiees are statistically specified

in a left-to-right manner.– determine the left dependants of wk from the closest to

the farthest– also determine whether wk could be the (d+1)th right

dependent of a previously seen word wp, p = 1,. . . , k – 1» d denotes the number of already assigned right dependents

of wp

Wen Wang and Mary P. Harper, 2004, A Statistical Constraint Dependency Grammar (CDG) Parser, ACL


31

• SCDG Parser (contd.)• Second step (contd.)

– After processing word wk in each partial parse on the stack, the partial parses are re-ranked according to their updated probabilities.

– parsing algorithm is implemented as a simple best first search

– Two pruning thresholds: maximum stack depth and maximum difference between the log probabilities of the top and bottom partial parses in the stack

• WER– LM training data for this task is composed of the 1987-

1989 files containing 37,243,300 words– evaluate all LMs on the 1993 20k open vocabulary

DARPA WSJ CSR evaluation set (denoted 93-20K), which consists of 213 utterances and 3,446 words.

– 3gram=13.72, Chelba=13.0, SCDG LM=12.18


32

• Employ LMs for N-best re-ranking in MT• Two pass decoding

– First pass: generate N-best lists• Uses a hierarchical phrase decoder with standard 4-

gram LM

– Second pass:• Rescore the N-best lists using several LMs trained on

different corpora and estimated in different ways• Scores are combined in a log-linear modeling

framework– Along with other features used in SMT

» Rule probabilities P(f|e), p(e|f); lexical weights pw(f|e) pw(e|f), sentence length and rule counts

» Optimized weights (GALE dev07) using minimum error training method to maximize BLEU search

» Blind test set NIST MT eval06 GALE portion (eval06)Wen Wang, Andreas Stolcke and Jing Zheng, Dec 2007"Reranking machine translation

hypothesis with structured and web-based language models“, ASRU. IEEE Workshop


33

• Structured LMs– Almost parsing LM– Parsing LM

• Using baseNP model– Given W, generates the reduced sentence W’ by

marking all baseNPs and then reducing all baseNPs to their headwords

• Further simplification of the parser LM


34

• LMs for searching: – 4-gram (4g) LM

• English side of Arabic-English and Chinese-English from LDC

• All of English BN and BC, webtexts and translations for Mandarin and Arabic BC BN released under DARPA EARS and GALE

• LDC2005T12, LDC95T21, LDC98T30• Webdata collected by SRI and BBN

• LMs for reranking (N-best list size =3000)– 5 gram count LM Google LM (google) (1

terawords)– 5 gram count LM Yahoo LM (yahoo) (3.4G words)– first two sources for training almost parsing

LM(sarv)– the second source for training the parser LM(plm)– 5-gram count-LM on all BBN webdata (wlm)


35


36


37

Syntax-based LMs (1)

• Performs translation by assuming the target language specifies not just words but complete parse

• Yamada 2002: incomplete use of syntactic information– Decoder optimized, language model was

not• Develop a system in which the translation

model of [Yamada 2001] is “married” to the syntax-based language model of [Charniak 2001]

Kenji Yamada and Kevin Knight. “A Syntax-based statistical translation model”, 2001

Kenji Yamada and Kevin Knight, ”A decoder for syntax-based statistical mt”, 2002Eugene Charniak,”Immediate-head parsing for language models” 2001Eugene Charniak, Kevin Knight and Kenji Yamada, "Syntax-based Language Models

for Statistical Machine Translation"

38

Syntax-based LMs (2)• Translation model has 3 operations:

– Reorders child nodes, inserts an optional word, translates the leaf words

– Ө varies over the possible alignments between the F and E

• Decoding algorithm similar to a regular parser– Build English parse tree from Chinese sentence

• Extract CFG rules from parsed corpus of English• Supplement each non-lexical English CFG rule (VPVB NP) with

all possible reordered rules (VPNP PP VB, VPPP NP VB, etc)• Add extra rules “VPVP X” and “Xword” for insertion

operations• Also add “englishwordchineseword” for translation

39


• Now we can parse a Chinese sentence and extract English parse tree– Removing leaf

Chinese words– Recovering the

reordered child nodes into English order

• pick the best tree – the product of the

LM probability and the TM probability is the highest.

40


• Decoding process– First build a forest using the bottom-up

decoding parser using only P(F|E)– Pick the best tree from the forest with a LM

• Parser/Language model (Charniak 2001)– Takes an English sentence and uses two

parsing stages• Simple non-lexical PCFG to create a large parse forest

– Pruning step

• Sophisticated lexicalized PCFG is applied to the sentence

41


• Evaluation– 347 previously unseen Chinese newswire sentences– 780,000 English parse tree-Chinese sentence pairs– YC: TM of Yamada2001 and LM of Charniak2001– YT: TM of Yamada2001, trigram LM Yamada2002– BT: TM of Peter et. al, 1993, trigram LM and greedy

decoder Ulrich Germann 1997

Eugene Charniak. A maximum-entropy-inspired Parser, 2000Ulrich Germann, Michael Jahr, Daniel Marcu, Kevin Knight, , and Kenji

Yamada. Fast decoding and optimal deciding for machine translation, 2001

42

Factored Language models(1)

• Allow a larger set of conditioning variables predicting the current word – morphological, syntactic or semantic word

features, etc.

• Motivation– Statistical language modeling is a difficult

problem for languages with rich morphologyhigh perplexity

– Probability estimates are unreliable even with smoothing

– Features mentioned above are shared by many words and hence can be used to obtain better smoothed probability estimates

Katrin Kirchhoff and Mei Yang, 2005 "Improved Language modeling for Statistical Machine Translation“, ACL

43


• Factored Word Representation– Decompose words into sets of features (or

factors)

• Probabilistic language models constructed over subsets of word features.

• Word is equivalent to a fixed number of factors W=f1:K

44


• Probability model

• Standard generalized parallel backoff

c: count of (wt,wt-1,wt-2),

pML:ML distribution, dc: discounting factor τ3:count threshold, α:normalization factor

– N-grams whose counts are above the threshold retain their ML estimates discounted by a factor that redistributes probability mass to the lower-order distribution

45


• Backoff paths

46


• Backoff paths– Space of possible models is extremely large– Ways of choosing among different paths

• Linguistically determined, – Eg: drop syntactic before morphological variables– Usually leads to sub-optimal results

• Choose path at runtime based on statistical criteria• Choose multiple paths and combine their probability

estimates

C: count of (f,f1,f2,f3), pML:ML distribution,τ4:count threshold, α:normalization factor,g:determines the backoff strategy, can be any non-negative function of

f,f1,f2,f3

47


• Learning FLM Structure– Three types of parameters need to be specified

• Initial conditioning factors, backoff graph, smoothing options

– Model space is extremely large• Find best model structure automatically

– Genetic algorithms(GA) • Class of evolution-inspired search/optimization

techniques• Encode problem solutions as strings (genes) and evolve

and test successive populations of solutions through the use of genetic operators (selection, crossover, mutation) applied to encoded strings

• Solution evaluates according to a fitness function which represents the desired optimization criteria

• No guarantee of finding the optimal solution, they find good solutions quickly

48


• Structure Search using GA– Conditioning factors

• Encoded as binary strings– Eg: with 3 factors (A,B,C), 6 conditioning variables {A-

1,B-1,C-1,A-2,B-2,C-2}

» String 10011 corresponds to F={A-1,B-2,C-2}

– Backoff graph Large number of possible paths

• Encode a binary string in terms of graph grammar rules

– 1 indicating the use of the rule and 0 for non-use

49


• Structure Search using GA (contd.)– Smoothing options

• Encoded as tuples of integers– First integerdiscounting method– Second integerbackoff threshold

• Integer string consists of successive concatenated tuples each representing the smoothing option at a node in the graph

• GA operators are applied to concatenations of all three substrings describing the set of factors, backoff graph and smoothing options to jointly optimize all parameters

50


• Data: ACL05 shared MT task website for 4 language pairs

• Finnish, Spanish, French to English• Development set provided by the website:2000

sentences

– Trained using GIZA++– Pharaoh for phrase based decoding

• Trigram word LM trained using SRILM toolkit with Kneser-Ney smoothing and interpolation of higher and lower order n-grams– Combination weights trained using minimum

error weight optimization (Pharaoh)

51


• First pass– Extract N-best lists: 2000 hyp per sentence– 7 model scores collected from the outputs

• Distortion model score, the first pass LM score, word and phrase penalties, bidirectional phrase and word translation scores.

• Second pass– N-best lists rescored with additional LMs.

• Word based 4-gram model,• factored trigram model: separate FLM for each language

– Features: tags(Rathnaparki), stems(Porter)– Optimized to achieve a low perplexity on the oracle 1-best hyp

(with the best BLEU score) from the first pass– Resulting scores combined with the above scores in a

log-linear fashion• Combination weights optimized on the development set to

maximize the BLEU score• Weights combined scores are used to select the best hyp

52


53

Conclusion(1)

• Chelba et al. 1997 Dependency LM (in ASR)– Incorporated syntax and semantics– Predicted words based on their relation to words that lie far in

the past– For practical purpose

• Selected the best output from an N-best list

– For MT - Can be applied on N-best lists

• Rens Bod 2001 DOP (in ASR)– Used both headword and non-headword dependencies– Incorporated syntax and semantics

• Showed semantic annotations contribute to performance

– For MT - huge space (reordering)• Better to use it on N-best lists

54

Conclusion(2)

• Chelba et al. 1997 Dependency model (in ASR)– Model assigned probability to joint sequences of words-

binary-parse structures with headword annotations in L2R manner, improvements in PPL

– For MT • Can be done but huge space (reordering)

• Charniak et al. 2003 Syntax-based LM (in MT)– TM “married” to Sytax-based LM– No improvements in BLEU score

• Blame it on BLEU ?• Blame it on parse accuracies?• Obtained fewer translations that were syntactically and semantically

wrong, obtained more perfect translations

– In future, can integrate all knowledge sources in hand

55

Conclusion(3)

• SuperARV LMs– Wang et al. 2003 (in ASR)

• Almost parsing, statistical dependency grammar parser

• Enriched tags• Reduced WER

– Wang et al. 2007 (in MT)• For reranking N-best hypotheses• Showed improvements in BLEU score

– by almost a BLEU point

56

Conclusion(4)

• Katrin Kirchhoff 2005 Factored LMs (in MT)– Used a set of factors to define a word– Did not show improvements in MT quality

• Could be adding in more noise• Blame BLEU?

– No study

• Structural learning intuitively makes sense– Does not find the optimum structure

• Was the list of N-best hypotheses good?• Context was limited to 3 grams

– Might help with higher values of N

• Features are probably not good/insufficient

– Interpolating with other LMs might help– Might perform better than a word-based LM on morphologically

rich languages

57

Thank You

58

Documents

Syntax-based and Factored Language Models