Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Contributions to connectionist languagemodeling and its application to sequencerecognition and machine translation

PhD Thesis defense

Francisco Zamora Martínezsupervised by María José Castro Bleda

Departament de Sistemes Informàtics i ComputacióUniversitat Politècnica de València

2012 November 30

Index

...1 Introduction

...2 Connectionist language modeling

...3 Sequence recognition applications

...4 Machine translation applications

...5 Conclusions

Index

...1 Introduction




...5 Conclusions

MotivationImportant role of Language Models (LMs): N-grams+ learned automatically+ simple and effective– problem with unseen patterns, smoothing heuristics

Neural Network Language Models (NN LMs): based on N-grams+ automatic smoothing of unseen patterns– big computational cost– decoupled integration ⇒ N-best list rescoring

.Open questions on NN LMs..

.

Totally coupled integration scheme.NN LM capabilities to improve hypotheses pruning.Quality of fast evaluation of NN LMs.Long-dependencies modelisation with NN LMs.

Tasks: Spoken Language Understanding, Handwritten Text Recognition,Statistical Machine Translation.

Pattern Recognition.Fundamental equation..

.ˆy = arg max

y∈Ω+

p(y|x) = arg maxy∈Ω+

p(x|y)p(y)

.Machine translation..

.

x1 x2 x3 x4 x5

x = Traduciremos esta frase al inglésy1 y2 y3 y4 y5 y6 y7

y = We will translate this sentence into English.Handwritten Text Recognition..

.

x =y1 y2 y3 y4

y = must be points .

Pattern Recognition.Fundamental equation and its generalization..

.

ˆy = arg maxy∈Ω+


p(x|y)p(y)

ˆy = arg maxy∈Ω+


M∏m=1

Hm(x, y)λm

Language Models (LMs) estimate a-priori probability p(y), that is,compute a measure of y belongs to the task language.Generalized under maximum entropy framework as a log-linearcombination.For Spoken Language Understanding (SLU) and Handwritten TextRecognition (HTR) it models Grammar Scale Factor (GSF) andWord Insertion Penalty (WIP) weights.

Dataflow architecture

Search procedure computes ˆy using previous equations.Algorithms based on graphs were implemented, breaking thesearch in little building blocks.Each block is a module in a dataflow architecture.Modules exchange different information types: feature vectors,graph protocol messages, probabilities, …

.Speech recognition dataflow..

.x

GenProbs

Module

OSE

Module

WGen

Module

Wordgraph

NGramParser

Moduley

maxe∈active vertex

p(e)

Graph protocol messagesThe most important dataflow data type.Normally, graph messages are produced/consumed left-to-right.Possible a specialization for multi-stage graphs (as for StatisticalMachine Translation).

.General graph..

.

1: begin_dag (multistage=false)2: vertex (0)3: is_initial (0)4: no_more_in_edges (0)5: vertex (1)6: edge (0,data=⟨a, 1.0⟩, ⟨b, 0.7⟩)7: no_more_in_edges (1)8: vertex (2)9: edge (0,data=⟨b, 0.5⟩)

10: no_more_in_edges (2)11: no_more_out_edges (0)12: vertex (3)13: edge (1,data=⟨b, 0.1⟩)

14: edge (2,data=⟨a, 1.0⟩)15: no_more_in_edges (3)16: no_more_out_edges (2)17: vertex (4)18: is_final (4)19: edge (1,data=⟨a, 1.0⟩,

⟨c, 0.2⟩)

20: no_more_out_edges (1)21: edge (3,data=⟨d, 0.4⟩)22: no_more_out_edges (3)23: no_more_in_edges (4)24: no_more_out_edges (4)25: end_dag ()

Optimization and evaluation

.Evaluation measures..

.

Perplexity (PPL).Word Error Rate (WER).Character Error Rate (CER).Sentence Error Rate (SER).Concept Error Rate (CER).Bilingual EvaluationUnderstudy (BLEU).Translation Edit Rate (TER).

.Log-linear combination..

.

All tasks are formalized aslog-linear combination.Estimated via Minimum ErrorRate Training (MERT).

.Confidence interval andcomparison..

.

Bootstrapping technique.Pairwise comparison.

Goals.Scientific aims..

.

NN LMs formalization as general N-grams.Method specification of totally coupled integration of NN LMs.Evaluation of totally coupled approach.NN LMs extension to Cache NN LMs, inspired by cache LMs.

.Technological aims..

.

Efficient implementation of training and evaluation of NN LMs.Efficient implementation of algorithms for coupling NN LMs onSLU, HTR and Statistical Machine Translation (SMT).MERT algorithm to estimate GSF and WIP in SLU and HTR.April toolkit development in collaboration with research group.

Index

...1 Introduction




...5 Conclusions

Language modeling

Statistical LMs follow pattern recognition fundamental equation.Estimate the probability of a sentence as belonging to a certainlanguage.Simplified by using N-gram LMs to sequences of order N:

p(y) ≈|y|∏j=1

p(yj|hj)

being hj = yj−1yj−2 . . . yj−N+1.

.N-gram probability computation: 3-gram..

.

p(ω1ω2ω3) = p(ω1|bcc) · p(ω2|bcc ω1) · p(ω3|ω1 ω2) · p(ecc|ω2 ω3)

bcc is begin context cue or start of sentence; ecc is end context cue or end of sentence.

Connectionist Language Models

Based on the idea of word projections onto a continuous space.Interpolation of unseen N-grams given word projections.Joint training of word projections and LM probability computation.Word projections are position independent: shared weights.

.Word projection..

.

Input: local encoding, word as a category (size = |Ω|)

0, 0, . . . , 0, 0, 0, 1, 0, 0, . . . , 0, 0, 0, 0

Projection: distributed encoding, a feature vector (size <<< |Ω|)

0.1, . . . ,−0.4, 0.2, . . . , 1.1


. . .

. . .. . .

0

0

1

0

.

.

.

0

0

1

0

.

.

.

0

1

0

0

0

.

.

.

. . .. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .


.Training issues..

.

Stochastic backpropagation algorithm with weight decayregularization.Stochastic training selects with replacement a random set ofpatterns every epoch.For large datasets, training converges before training partitionwas completely traversed.Fast training using matrix-matrix multiplications and fine-tunedBLAS implementations (bunch mode).

NN LM deficiencies and our solutions I

.Word projections – Projection layer..

.

Deficiencies:Random initialization of projection layer: low frequency words(“rare” words) has very different encoding.Large vocabularies + stochastic training ⇒ poor training of “rare”words.

Solutions:Restrict NN LM input vocabulary to words with frequency > θ(experiments).Use of bias and weight decay terms on the projection layer.

NN LM deficiencies and our solutions II.Computational problems – Output layer..

.

Deficiencies:

Softmax activation forces to compute all outputs: O = exp(A)∑|A|i=1 exp(ai)

Currently, training problems were partially solved using fast mathoperations and stochastic backpropagation.At decoding, softmax bottle-neck forces developing of decoupledsystems: N-best list rescoring.

Solutions:

Shortlist output vocabulary Ω′ restricted to most frequent words.

p(yj|hj) =

pNN(yj|hj), iff yj ∈ Ω′

pNN(OOS|hj) ·COOS(yj)∑

y′ ∈Ω′

COOS(y′), iff yj ∈ Ω′

Precompute softmax normalization constants: Fast NN LM.

NN LM and decoding

.N-best list rescoring..

.

Widely used in literature.Encouraging improvements in ASR and SMT.

.Totally integrated decoding..

.

Anyone try it before.Major contribution and focus of this work.

Integration of NN LM into decoding

A generic framework will be presented.Softmax computation problems need to be solved.

.Generic LM interface..

.

A LMkey is an automaton state number.LMkey is used by decoder as N-gramcontext identifier.prepareLM(LMkey)getLMprob(LMkey, word)getNextLMKey(LMkey, word)getInitialLMKey()getFinalLMKey()restartLM()

.N-gram stochasticfinite stateautomaton..

._

a

aa

b

ab

ab

b

b

a

Generic framework.TrieLM..

.

TrieLM represents the whole N-gram space: |Ω|N (NN LMs).It is built on-the-fly, enumerating only states needed by decoding.A path in the TrieLM is a sequence of the N − 1 context words ofan N-gram.Two kinds of node: persistent and dynamic.

.TrieLM node..

.

Parent back-pointer.Time-stamp.Word transition.

.TrieLM example..

.

Fast evaluation of NN LMs

Softmax normalization constant forcesthe computation of all output neurons,even when only a few are needed:Softmax has several advantages:

Ensures true probabilitycomputations with ANNs.Improve training convergence.

O =exp(A)∑|A|

i=1 exp(ai)

.# of weights at output layer..

.

|Ω′| Output layer100 25 700

1 000 257 00010 000 2 570 000

.Our solution..

.

Precompute the most important constants needed during decoding.When a constant is not found, two possibilities are feasible:

Compute the constant on-the-fly.Use some kind of smoothing.

Fast evaluation of NN LMs

.Preliminar notes..

.

For a NN LM of order N exists |ΩI|N−1 softmax normalizationconstants.Note a bigram only needs |ΩI|.

ΩI is the NN LM input restricted vocabulary.

. . .

. . .

. . .

. . .

. . .

. . .

A 4-gram NN LM

Fast evaluation of NN LMs.Precomputation procedure: training..

.

N-gram contexts are extracted from a training corpora, countingits frequency.For the most frequent contexts, the softmax normalizationconstants are computed.The TrieLM stores as persistent nodes:

the N-gram context words related to softmax normalization constantsand its associated constant value.

.Softmax normalization constants computation: 3-gram..

.

Sentence N − 1 context words Softmax constant

A MOVE to stop Mr. Gaitskell from

A MOVEMOVE toto stop…

43 41878 18488 931…

Fast evaluation of NN LMs.Smoothed approach: evaluation..

.

Use a simpler model if a constant is not found.The most simpler model is a NN LM bigram or a standard N-gram.

No

No

2-gram

NN LMto

Search

2-gram softmaxnormalization constants

3-gram

NN LM

move to

Search4-gram

NN LM

3-gram softmaxnormalization constantsa move to

P(stop | a move to)move to

a move to

a move to stop mr. gatiskell ...

1-gram softmaxnormalization constants

Yes

Yes

stop

stop

Fast evaluation of NN LMs.Smoothed approach pros/cons..

.

– Model quality reduction.+ Constant do not need to be computed.Trade-off between quality and speed: more precomputedconstants means more quality, but slower speed.

.On-the-fly approach pros/cons..

.

+ Full model quality.– Needs to compute on-the-fly a lot of constants.More precomputed constants means faster speed.It is possible to store computed constants to future use.

– However, always slower than smoothed approach.

Experiments with PPL in text: NN LMs and speed-up.Fast evaluation approach..

.

LOB-ale corpus: random sentences from LOB with closedvocabulary.Three configurations are evaluated:

On-the-fly Fast NN LM: computing constants when needed.Smoothed Fast NN LM: using a simpler model.Smoothed-SRI Fast NN LM: simplest model is a SRI model.

.PPL and speed-up results..

.

Setup ms/word speed-up test PPLMixed NN LM 6.43 0 79.90

On-the-fly Fast NN LM 1.82 3 79.90Smoothed Fast NN LM 0.19 33 80.78Smoothed-SRI Fast NN LM 0.19 33 79.02

Experiments with PPL in text: NN LMs and speed-up

Smoothed Fast NN LM

Experiments on HTR: NN LMs and integrated decoding

Totally integrated decoding.Experiments on the influence of pruning, WER, and time.IAM-DB task, deeply presented at sequence recognition part.4-gram NN LM Linearly combined with a SRI bigram followingon-the-fly (standard) approach and smoothed approach.

Experiments on HTR: NN LMs and integrated decoding

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

18 20 22 24 26 28 30 32 34 36 38 40

se

c/w

ord

WER

SmoothedFastNNLMStandardNNLM

SRI bigram

18

20

22

24

26

28

30

32

34

36

38

40

0 1 2 3 4 5 6 7 8 9 10 0

2

4

6

8

10

WE

R

%

Histogram pruning size (x 1000)


SRI bigram% improvement

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9 10 100

110

120

130

140

150

se

c/w

ord

%

Histogram pruning size (x 1000)


SRI bigram% ratio .

Conclusions..

.

Smoothed Fast NN LM: better time, sameWER.Smoothed Fast NN LM: improves WER by8%, additional 10% time.Standard NN LM approach: same WERthan smoothed, but two times slower.

Index

...1 Introduction




...5 Conclusions

Hidden Markov Model decoding

Based on HMM/ANN models.Two step decoding with pruning synchronization.N-gram Viterbi decoder with integrated NN LMs.

x

GenProbs

Module

OSE

Module

WGen

Module

Wordgraph

NGramParser

Moduley

maxe∈active vertex

p(e)

.Two tasks..

.

Spoken Language Understanding (SLU).Handwritten Text Recognition (HTR).

Spoken Language Understanding

.Cache NN LM for long distance dependencies..

.

Cache-based LM:

p(yj|yj−1 . . . y1) = αp(yj|hj) + (1− α)pcache(yj|hj−11 ) ,

Cache NN LM:

p(y) ≈ p(y|hu−11 ) ≈

|y|∏j=1

p(yj|hj, hu−11 )

hj is the N-gram context and hu−11 is the cache part.

Cache NN LM receives a summary of all previous machine/userinteractions.The cache part remains the same during one sentence decoding.


SLU using language models of pairs: concept/word sequences.Using Cache NN LM, different summary combinations weretested.


.Experiments..

.

MEDIA French corpus.Using Cache NN LM, different summary combinations:A) Using a cache only of concepts.B) Using only words at the cache.C) Using concepts+words at the cache.D) Using concepts+words+Wizard-of-Oz at the cache.

.MEDIA statistics..

.

Set # sentences # running words # running conceptsTraining 12 811 87 297 42 251Validation 1 241 9 996 4 652

Test 3 468 24 598 11 790

Spoken Language Understanding2-grams 3-grams 4-grams

System Val. Test Val. Test Val. Testbaseline-a 33.6 30.1 32.9 29.3 33.5 29.3baseline-b 33.1 28.3 30.7 27.4 30.2 28.1cacheNNLM-A 31.7 28.2 29.7 27.0 29.7 27.0cacheNNLM-B 30.5 27.3 29.7 27.0 30.0 26.1cacheNNLM-C 32.2 28.3 30.5 27.0 30.8 27.4cacheNNLM-D 31.2 28.2 29.9 26.2 30.3 27.1

CER results for validation and test sets.

baseline-a) Standard N-gram of pairs, without cache.baseline-b) Standard NN LM of pairs, without cache.

A) Using a cache only of concepts.B) Using only words at the cache.C) Using concepts+words at the cache.D) Using concepts+words+Wizard-of-Oz at the

cache.

.Conclusions..

.

Significant CER reduction using a rathersimple SLU model.The best Cache NN LM suggests that thereis plenty of room for improvement.The use of cache with long-distancedependencies improves systematically thebaselines.Best CER in the literature 23.8%.

Handwritten Text RecognitionBased on HMM/ANN models: 80 characters (26 lowercase and26 uppercase letters, 10 digits, 16 punctuation marks, white space,and crossing out mark).Two step decoding, totally integration of NN LMs.IAM-DB task: off-line text line recognition.Language models trained using LOB+WELLINGTON+BROWN:103K vocabulary size (|Ω|).Two type of experiments: word-based, and character-based.

.IAM-DB text line images..

.

Handwritten Text Recognition: word-based experiments.Validation results..

.

% WER /% CER /% SER in validationSystem |ΩI| bigram 3-gram 4-gramBigram mKN 103K 17.3/6.2/69.8 17.8/6.3/70.3 17.9/6.3/69.3

Rescoring with NN LMsΘ = 21 NN 10K 16.0/5.8/67.4 16.0/5.9/67.2 16.2/5.8/67.2Θ = 10 NN 16K 15.9/5.9/67.5 16.0/5.9/67.2 16.4/5.9/66.5Θ = 8 NN 19K 16.0/5.8/66.9 16.3/5.9/67.8 16.9/6.1/68.8Θ = 1 NN 56K 16.0/5.8/66.6 16.3/6.0/67.9 16.9/6.2/69.2Θ = 21 FNN 10K 16.0/5.8/67.4 15.8/5.7/66.0 15.8/5.7/65.1Θ = 10 FNN 16K 15.9/5.9/67.5 15.9/5.7/65.0 15.9/5.8/66.0Θ = 8 FNN 19K 16.0/5.8/66.9 15.8/5.8/65.8 15.8/5.7/65.8Θ = 1 FNN 56K 16.0/5.8/66.6 15.9/5.7/66.1 15.7/5.8/66.3

NN LMs integrated during decodingΘ = 21 NN 10K 16.0/5.8/67.0 16.0/5.8/67.2 16.1/5.8/67.2Θ = 10 NN 16K 16.1/5.8/66.9 16.1/5.8/67.3 16.4/5.8/66.7Θ = 8 NN 19K 16.1/5.9/67.0 16.3/5.9/67.9 16.9/6.0/69.2Θ = 1 NN 56K 16.1/5.8/66.7 16.4/6.0/67.5 16.8/6.2/69.3Θ = 21 FNN 10K 16.0/5.8/67.0 15.8/5.8/66.5 15.8/5.6/65.2Θ = 10 FNN 16K 16.1/5.8/66.9 15.9/5.7/65.0 16.0/5.8/65.6Θ = 8 FNN 19K 16.1/5.9/67.0 15.8/5.8/65.7 15.8/5.7/65.6Θ = 1 FNN 56K 16.1/5.8/66.7 15.9/5.7/66.1 15.8/5.7/65.6

NN are standard NN LMs; FNN are Smoothed Fast NN LM; words frequency > Θ ⇒ ΩI.

Handwritten Text Recognition: word-based experiments.Test results..

.

System % WER % CER % SER[Graves et al.] 25.9± 0.8 – –Bigram 20K mKN 23.4± 0.8 9.6± 0.4 78.1± 1.6

Bigram mKN 21.9± 0.7 8.8± 0.4 76.0± 1.6

Rescoring with NN LMs4-gram Θ = 21 FNN 20.2 ± 0.7 8.3 ± 0.4 72.9 ± 1.6

NN LMs integrated during decoding4-gram Θ = 21 FNN 20.2 ± 0.7 8.3 ± 0.4 73.0± 1.6

20K mKN from [Graves et al]; words frequency > Θ ⇒ ΩI..Conclusions..

.

Not significant ⇒ input vocabulary sizes.Rescoring vs integrated approach similar results.First system for HTR using NN LMs, significant improvement ofstate-of-the-art results.

Handwritten Text Recognition: character-based experiments

LMs using characters instead of words.Useful for tasks with lack of data.

.Validation results..

.

Handwritten Text Recognition: character-based experiments

.Test results..

.

System % WER % CER OOV % accuracySRI 30.9 13.8 29.8NN LM 24.2 10.1 33.8% improvement 21.7 26.8 13.4

.Conclusions..

.

Large improvement over N-gram baseline (SRI).Almost 33% of OOV words for a word-based LM were recovered.This results suggests to try mixed approach combining word-basedLM with character-based LM.Ancient document transcription could take profit of this approach.

Index

...1 Introduction




...5 Conclusions

Statistical Machine Translation

.Overview..

.

Introduction to SMT following phrase-based and N-gram-basedapproaches.Decoding process algorithm details.Experiments for decoder quality and integration of NN LMs.

Statistical Machine TranslationFollowing maximum entropy approach: log-linear combination ofseveral models.This work focuses in phrase-based and N-gram-basedapproaches:

Both follow similar basis: word alignments, segmentation in tuples(bilingual phrases), log-linear modelisation.Phrase-based: all consistent phrases (multiple segmentations) areextracted.N-gram-based:

source language sentence is reordered to follow target language order;a unique segmentation in tuples is extracted for training, in decodingmultiple paths are possible.

.Fundamental equation..

.

y = target_part( ˆT , φ) = arg max(T ,φ)

M∏m=1

Hm(T , φ)λm

being T = T1, T2, . . . ,T|T | a sequence of bilingual tuples;and φ : 1, 2, . . . , |x| → 1, 2, . . . , |T | a mapping between source words and tuples.

Statistical Machine TranslationSource: María no daba una bofetada a la bruja verdeReordered source: María no daba una bofetada a la verde brujaTarget: Mary did not slap the green witch

María

no daba

una

bofetada

a la bruja

verde

Marydidnotslapthegreenwitch

.Phrase-based..

.

(María, Mary), (no, did not),(María no, Mary did not),(slap, daba una bofetada), (verde, green),(no daba una bofetada, did not slap),(daba una bofetada a la, slap the),(a la, the), (bruja, witch),(bruja verde, green witch),(a la bruja verde, the green witch),…

.N-gram-based..

.

(María, Mary) ⇒ T1,(no, did not) ⇒ T2,(daba una bofetada, slap) ⇒ T3,(a,<NULL>) ⇒ T4,(la, the) ⇒ T5,(verde, green) ⇒ T6,(bruja, witch) ⇒ T7.

Machine Translation decoding

Phrase-based and N-gram-based:follow a similar decoding algorithm.Major difference:

N-gram-based approach: uses a LM for the computation of jointprobability.Phrase-based approach: does not needs joint probability, it uses alarger phrase table with conditional probabilities.

.General decoding overview..

.

Reordering

Module

Current Stage

Word2Tuple

Module

maxv∈current stage

p(v)

Current Active Vertex

Viterbi

Module

maxe∈current active vertex

p(e)

Machine Translation decoding

0 0 0 0

1 0 0 0

[1,1]

t engo | | | I have

t e n g o | | | h a v e

0 1 0 0

[2,2]

u n | | | a

u n | | |

1 1 0 0

[1,2]

tengo un | | | I have a

1 0 1 0

[3,3]

coche | | | car

[2 ,2]

u n | | | a

u n | | |

[1 ,1]

t engo | | | I have

t e n g o | | | h a v e

1 1 1 0

[2,2]

u n | | | a

u n | | |

[3 ,3]

coche | | | car

1 1 0 1

[4,4]

rojo | | | red 1 1 1 1

[3,4]

coche rojo | | | red car

[4 ,4]

rojo | | | red

[3,3]

coche | | | car

0 0 0 0

1 0 0 01

0 1 0 0

2

1 0 1 03

1 1 0 0

2

1

1 1 1 02

3

1 1 0 14

1 1 1 1

4

3

t e n g o | | | h a v e

tengo | | | I have

t e n g o

u n | | | a

u n | | |u n

coche | | | r ed

coche

rojo | | | red

rojo

tengo un | | | I have au n

coche rojo | | | red carrojo

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6 7

Machine Translation experiments

.N-best list rescoring with NN LMs..

.

IWSLT’06 Italian-English task.WMT’10 and WMT’11 English-Spanish task.

.Totally integrated NN LMs decoding..

.

IWSLT’10 French-English task.News-Commentary 2010 Spanish-English task.

Participation in international evaluation campaigns IWSLT’10,WMT’10 and WMT’11, achieving very well positioned systems(second position at IWSLT’10 and WMT’11).


News-Commentary 2010 Spanish-English task.Totally integration of NN LMs in decoding vs N-best list rescoring.Comparative evaluation of N-gram-based and phrase-basedapproach.Phrase-based models will be trained using Giza++ and Mosestoolkits, decoding with April .N-gram-based models will be trained with Giza++ and Apriltoolkits, decoding with April .


.News-Commentary 2010 statistics..

.

Spanish EnglishSet # Lines # Words # Lines # Words Voc. sizeNews-Commentary 2010 80.9K 1.8M 81.0K 1.6M 38 781

News2008 2.0K 52.6K 2.0K 49.7K –News2009 2.5K 68.0K 2.5K 65.6K –News2010 2.5K 65.5K 2.5K 61.9K –

N-gram-based bilingual tuple translation corpusSet # Lines # Tuples Voc. sizeNews-Commentary 2010 80.9K 1.5M 231 981

Tuple vocabulary size is huge for a direct NN LMs training ⇒ NN LMs of statistical classes.

English data for LM trainingSet # Lines # WordsNews-Commentary 2010 125.9K 2.97M

Machine Translation experiments.NN LMs of statistical classes training procedure..

.

...1 Using Giza++ a non-ambiguous mapping between tuples andclasses is done (CLS number of classes).

...2 Conditioned probability of tuple in classes is computed bycounting:

p(z ∈ ∆|c ∈ CLS) = C(z|c)∑z′∈∆ C(z′|c)

being C(z|c) the count of tuple z in class c....3 Tuples are substituted by their corresponding class....4 Standard training algorithm is used over previous dataset toestimate NN LMs.

...5 Joint probability is computed following:

p(x, y) ≈∏

ip(Ti|Ti−1 . . .Ti−N+1) ≈

∏i

p(Ti|ci) · p(ci|ci−1 . . . ci−N+1)


20.5

20.6

20.7

20.8

20.9

2 3 4 5

BLE

U

Value of N-gram order

NNTM-100

NNTM

-300

NNTM-500NNTM-1000

News2009System BLEU TERApril-NB baseline 20.2 60.4+ NNTLM-4gr 20.9 59.9+ NNTM-300-4gr 21.1 59.7+ NNTM-500-4gr 21.2 59.7


News2009 News2010System BLEU TER BLEU TER Time (s/sentence)Moses 20.4 60.3 22.6 57.8 0.6April-PB 20.6 60.3 22.7 57.8 0.4Moses⋆ – – 22.6 57.9 0.6April-NB 20.2 60.4 22.7 58.0 0.8

Integrating smoothed Fast NN LMs in the decoderApril-PB + NNTLM 21.2 59.8 23.2 57.5 1.8April-NB + NNTLM 20.9 59.9 23.2 57.4 1.8April-NB + NNTM 20.7 60.0 23.3 57.6 1.6April-NB + NNTLM + NNTM 21.2 59.7 23.6 57.1 2.5

Integrating on-the-fly Fast NN LM (standard NN LMs) in the decoderApril-PB + NNTLM – – 23.3 57.3 384.3April-NB + NNTLM + NNTM – – 23.7 57.1 177.3

Rescoring 2000-uniq-best list with standard NN LMsApril-PB + NNTLM 21.1 59.9 23.4 57.3 –April-NB + NNTLM 20.9 60.0 – – –April-NB + NNTM 20.6 60.2 – – –April-NB + NNTLM + NNTM 21.1 59.8 23.5 57.4 –

Machine Translation experimentsB

LE

U

TE

R

BLEU NGBLEU PBTER NGTER PB

22.6

22.8

22.9

23.1

23.2

23.4

23.5

23.6

23.8

57.0

57.1

57.2

57.3

57.4

57.5

57.6

57.7

57.8

Tim

e (

s/s

en

ten

ce

)

Number of pre-calculated softmax normalization constants (log-scaled)

NGPB

0.81.01.21.41.61.82.02.22.4

1e+02 1e+03 1e+04 1e+05 1e+06

.Conclusions..

.

Smoothed Fast NN LMloss ≈ 0.2 BLEU/TERpoints.

Phrase-based system: abit worst integrated.

N-gram-based system: abit better integrated, TERhas statistical significanceat 95% confidence usinga pairwise test.

.

.

Adding class-based NNTM improves 0.6 BLEU points.

Smoothed Fast NN LM system is two/three times slower…

but achieves an speed-up of 70 compared with standard NN LM.

Index

...1 Introduction




...5 Conclusions

Final conclusions I

.Contributions to connectionist language modeling..

.

Speed-up technique based on softmax normalization constantsprecomputation.Formalization and development of a totally coupled Viterbidecoding:

comparable computational cost for sequence recognition tasks;computational cost two/three times higher for Machine Translationtasks.Improve baseline quality in every case.

Extension to dynamic domain adaptation with cache-basedNN LMs.

Final conclusions II

.Contributions to sequence recognition..

.

Encouraging results using Cache NN LMs for a SLU task.State-of-the-art improvement using NN LMs at HTR tasks.Character-based NN LMs to deal with lack of data.

.Contributions to SMT..

.

Implementation of DP decoding algorithm for SMT.Improving N-gram-based SMT by using class-based NN LMs.Well positioned systems at international Machine Translationevaluations.

Future work.Connectionist language modeling..

.

New projection layer initialization method based on POS tags.

Improve NN LM and standard N-gram combination: GLI-CS.

.Sequence recognition..

.

Improve SLU using deep learning with continuous space techniques.

HTR system combining character-based and word-based LMs (OOV).

.Statistical Machine Translation..

.

Application of continuous space idea to reordering models.

Study of Cache NN LMs for document translation.

New NN LMs for vocabulary dispersion problem N-gram-based SMT.Integration SMT decoder for human assisted transcription.

Publications related with this PhD I

Speed-up technique for NN LMs

F. Zamora-Martínez, M.J. Castro-Bleda, S. España-Boquera. FastEvaluation of Connectionist Language Models. Pages 33–40 ofIWANN 2009 proceedings. Salamanca.

S. España-Boquera, F. Zamora-Martínez, M.J. Castro-Bleda, J.Gorbe-Moya. Efficient BP Algorithms for General Feedforward NeuralNetworks. Pages 327–336 of IWINAC 2007, Murcia.

Spoken Languag Understanding and Cache NN LMs

F. Zamora-Martínez, Salvador España-Boquera, M.J. Castro-Bleda,Renato de-Mori. Cache Neural Network Language Models based onLong-Distance Dependencies for a Spoken Dialog System. Pages4993–4996 of the IEEE ICASSP 2012 proceedings, Kyoto (Japan).

Publications related with this PhD II

Handwritten Text Recognition

F. Zamora-Martínez, V. Frinken, S. España-Boquera, M.J. Castro-Bleda,A. Fisher, H. Bunke. Neural Network Language Models in Off-LineHandwriting Recognition. Pattern Recognition. SUBMITTED.

F. Zamora-Martínez, M.J. Castro-Bleda, S. España-Boquera, J.Gorbe-Moya. Unconstrained Offline Handwritting Recognition usingConnectionist Character N-grams. Pages 18–23 of the IJCNN,Barcelona, 2010.

Statistical Machine Translation

F. Zamora-Martínez, M.J. Castro-Bleda. CEU-UPV English-Spanishsystem for WMT11. Pages 490–495 of the WMT 2011 proceedings,Edinburgh (Scotland).

F. Zamora-Martínez, M.J. Castro-Bleda, H. Schwenk. Ngram-basedMachine Translation enhanced with Neural Networks for theFrench-English BTEC-IWSLT’10 task.. Pages 45–52 of the IWSLT 2010proceedings, Paris (France).

Publications related with this PhD III

F. Zamora-Martínez, Germán Sanchis-Trilles. UCH-UPV English-Spanishsystem for WMT10. Pages 207–211 of the WMT 2010 proceedings,Uppsala (Sweden).

F. Zamora-Martínez, M.J. Castro-Bleda. Traducción AutomáticaEstadística basada en Ngramas Conexionistas. Pages 221–228 of theSEPLN journal. Volume 45, number 45, 2010. Valencia (Spain).

Maxim Khalilov, José A. R. Fonollosa, F. Zamora-Martínez, María J.Castro-Bleda, S. España-Boquera. Neural Network Language Modelsfor Translation with Limited Data. Pages 445–451 of the ICTAI 2008proceedings, Dayton (USA).

Maxim Khalilov, José A. R. Fonollosa, F. Zamora-Martínez, María J.Castro-Bleda, S. España-Boquera. Arabic-English translationimprovement by target-side neural network language modeling. Inproceedings of HLT & NLP workshop at LREC 2008, Marrakech(Marocco).

Publications in collaboration I

Recurrent NN LM based on Long-Short Term Memories:

Volkmar Frinken, F. Zamora-Martínez, Salvador España-Boquera,María J. Castro-Bleda, Andreas Fischer, Horst. Bunke. Long-Short TermMemory Neural Networks Language Modeling for HandwritingRecognition. ICPR 2012 proceedings, Tsukuba (Japan).

Handwritten Text Recognition:

S. España-Boquera, M.J. Castro-Bleda, J. Gorbe-Moya, F.Zamora-Martínez. Improving Offline Handwritten Text Recognition withHybrid HMM/ANN Models. Pages 767–779 of the IEEE TPAMIjournal. Volume 33, number 4, 2011.

F. Zamora-Martínez, M.J. Castro-Bleda, S. España-Boquera, J.Gorbe-Moya. Improving Isolated Handwritten Word Recognition Usinga Specialized Classifier for Short Words. Pages 61–70 of CAEPIA2009 proceedings, Sevilla.

J. Gorbe-Moya, S. España-Boquera, F. Zamora-Martínez, M.J.Castro-Bleda. Handwritten Text Normalization by using Local ExtremaClassification. Pages 164–172 of PRIS 2008 workshop, Barcelona.

Publications in collaboration II

Decoding:

S. España-Boquera, M.J. Castro-Bleda, F. Zamora-Martínez, J.Gorbe-Moya. Efficient Viterbi Algorithms for Lexical Tree BasedModels. Pages 179–187 of NOLISP 2007, Paris.

S. España-Boquera, J. Gorbe-Moya, F. Zamora-Martínez. SemiringLattice Parsing Applied to CYK. Pages 603–610 of IbPRIA 2007conference, Girona.

Thanks for your attention!

Questions?

Education

Contributions to connectionist language modeling and its application to sequence recognition and machine translation