63
Contributions to connectionist language modeling and its application to sequence recognition and machine translation PhD Thesis defense Francisco Zamora Martínez supervised by María José Castro Bleda Departament de Sistemes Informàtics i Computació Universitat Politècnica de València 2012 November 30

Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Embed Size (px)

DESCRIPTION

Natural Language Processing is an area of Artificial Intelligence, in particular, of Pattern Recognition. It is a multidisciplinary field that studies human language, both oral and written. It deals with the development and research of computational mechanisms for communication between people and computers, using natural languages. Natural Language Processing is a reasearch area constantly evolving, and this work focuses only on the part related to language modeling, and its application to various tasks: recognition/understanding of sequences and statistical machine translation. Specifically, this thesis focus its interest on the so-called connectionist language models (or continuos space language models), i.e., language models based on neural networks. Their excellent performance in various Natural Language Processing areas has motivated this study. Because of certain computational problems suffered by connectionist language models, the most widespread approach followed by the systems that currently use these models, is based on two totally decoupled stages. At a first stage, using a standard and cheaper language model, a set of feasible hypotheses, assuming that this set is representative of the search space in which the best hypothesis is located, is generated. In a second stage, on this set, a connectionist language model is applied and a rescoring of the list of hypotheses is done. This scenario motivates scientific goals of this thesis: - Developing techniques to reduce drastically the computational cost degrading as less as possible the quality. - Study the effect of a totally coupled approach that integrates neural network language models on decoding stage. - Developing some extensions of original model in order to improve it quality and to fulfill context domain adaptation. - Empirical application of neural network language models to sequence recognition and machine translation tasks. All developed algorithms were implemented in C++ and using Lua as scripting language. The implementations are compared with those that are considered standard on each of the addressed tasks. Neural network language models achieve very interesting improvements of quality over the reference baseline systems: - competitive results are achieved on automatic speech recognition and spoken language understanding; - improvement of state-of-the-art handwritten text recognition; - state-of-the-art results on statistical machine translation, as was stated with the participation on international evaluation campaigns. On sequence recognition tasks, the integration of neural network language models on the first decoding stage achieve very competitive computational costs. However, their integration in machine translation tasks requires a deeper development because the computation cost of the system is still somewhat high.

Citation preview

Page 1: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Contributions to connectionist languagemodeling and its application to sequencerecognition and machine translation

PhD Thesis defense

Francisco Zamora Martínezsupervised by María José Castro Bleda

Departament de Sistemes Informàtics i ComputacióUniversitat Politècnica de València

2012 November 30

Page 2: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Index

...1 Introduction

...2 Connectionist language modeling

...3 Sequence recognition applications

...4 Machine translation applications

...5 Conclusions

Page 3: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Index

...1 Introduction

...2 Connectionist language modeling

...3 Sequence recognition applications

...4 Machine translation applications

...5 Conclusions

Page 4: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

MotivationImportant role of Language Models (LMs): N-grams+ learned automatically+ simple and effective– problem with unseen patterns, smoothing heuristics

Neural Network Language Models (NN LMs): based on N-grams+ automatic smoothing of unseen patterns– big computational cost– decoupled integration ⇒ N-best list rescoring

.Open questions on NN LMs..

.

Totally coupled integration scheme.NN LM capabilities to improve hypotheses pruning.Quality of fast evaluation of NN LMs.Long-dependencies modelisation with NN LMs.

Tasks: Spoken Language Understanding, Handwritten Text Recognition,Statistical Machine Translation.

Page 5: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Pattern Recognition.Fundamental equation..

.ˆy = arg max

y∈Ω+

p(y|x) = arg maxy∈Ω+

p(x|y)p(y)

.Machine translation..

.

x1 x2 x3 x4 x5

x = Traduciremos esta frase al inglésy1 y2 y3 y4 y5 y6 y7

y = We will translate this sentence into English.Handwritten Text Recognition..

.

x =y1 y2 y3 y4

y = must be points .

Page 6: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Pattern Recognition.Fundamental equation and its generalization..

.

ˆy = arg maxy∈Ω+

p(y|x) = arg maxy∈Ω+

p(x|y)p(y)

ˆy = arg maxy∈Ω+

p(y|x) = arg maxy∈Ω+

M∏m=1

Hm(x, y)λm

Language Models (LMs) estimate a-priori probability p(y), that is,compute a measure of y belongs to the task language.Generalized under maximum entropy framework as a log-linearcombination.For Spoken Language Understanding (SLU) and Handwritten TextRecognition (HTR) it models Grammar Scale Factor (GSF) andWord Insertion Penalty (WIP) weights.

Page 7: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Dataflow architecture

Search procedure computes ˆy using previous equations.Algorithms based on graphs were implemented, breaking thesearch in little building blocks.Each block is a module in a dataflow architecture.Modules exchange different information types: feature vectors,graph protocol messages, probabilities, …

.Speech recognition dataflow..

.x

GenProbs

Module

OSE

Module

WGen

Module

Wordgraph

NGramParser

Moduley

maxe∈active vertex

p(e)

Page 8: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Graph protocol messagesThe most important dataflow data type.Normally, graph messages are produced/consumed left-to-right.Possible a specialization for multi-stage graphs (as for StatisticalMachine Translation).

.General graph..

.

1: begin_dag (multistage=false)2: vertex (0)3: is_initial (0)4: no_more_in_edges (0)5: vertex (1)6: edge (0,data=⟨a, 1.0⟩, ⟨b, 0.7⟩)7: no_more_in_edges (1)8: vertex (2)9: edge (0,data=⟨b, 0.5⟩)

10: no_more_in_edges (2)11: no_more_out_edges (0)12: vertex (3)13: edge (1,data=⟨b, 0.1⟩)

14: edge (2,data=⟨a, 1.0⟩)15: no_more_in_edges (3)16: no_more_out_edges (2)17: vertex (4)18: is_final (4)19: edge (1,data=⟨a, 1.0⟩,

⟨c, 0.2⟩)

20: no_more_out_edges (1)21: edge (3,data=⟨d, 0.4⟩)22: no_more_out_edges (3)23: no_more_in_edges (4)24: no_more_out_edges (4)25: end_dag ()

Page 9: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Optimization and evaluation

.Evaluation measures..

.

Perplexity (PPL).Word Error Rate (WER).Character Error Rate (CER).Sentence Error Rate (SER).Concept Error Rate (CER).Bilingual EvaluationUnderstudy (BLEU).Translation Edit Rate (TER).

.Log-linear combination..

.

All tasks are formalized aslog-linear combination.Estimated via Minimum ErrorRate Training (MERT).

.Confidence interval andcomparison..

.

Bootstrapping technique.Pairwise comparison.

Page 10: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Goals.Scientific aims..

.

NN LMs formalization as general N-grams.Method specification of totally coupled integration of NN LMs.Evaluation of totally coupled approach.NN LMs extension to Cache NN LMs, inspired by cache LMs.

.Technological aims..

.

Efficient implementation of training and evaluation of NN LMs.Efficient implementation of algorithms for coupling NN LMs onSLU, HTR and Statistical Machine Translation (SMT).MERT algorithm to estimate GSF and WIP in SLU and HTR.April toolkit development in collaboration with research group.

Page 11: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Index

...1 Introduction

...2 Connectionist language modeling

...3 Sequence recognition applications

...4 Machine translation applications

...5 Conclusions

Page 12: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Language modeling

Statistical LMs follow pattern recognition fundamental equation.Estimate the probability of a sentence as belonging to a certainlanguage.Simplified by using N-gram LMs to sequences of order N:

p(y) ≈|y|∏j=1

p(yj|hj)

being hj = yj−1yj−2 . . . yj−N+1.

.N-gram probability computation: 3-gram..

.

p(ω1ω2ω3) = p(ω1|bcc) · p(ω2|bcc ω1) · p(ω3|ω1 ω2) · p(ecc|ω2 ω3)

bcc is begin context cue or start of sentence; ecc is end context cue or end of sentence.

Page 13: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Connectionist Language Models

Based on the idea of word projections onto a continuous space.Interpolation of unseen N-grams given word projections.Joint training of word projections and LM probability computation.Word projections are position independent: shared weights.

.Word projection..

.

Input: local encoding, word as a category (size = |Ω|)

0, 0, . . . , 0, 0, 0, 1, 0, 0, . . . , 0, 0, 0, 0

Projection: distributed encoding, a feature vector (size <<< |Ω|)

0.1, . . . ,−0.4, 0.2, . . . , 1.1

Page 14: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Connectionist Language Models

. . .

. . .. . .

0

0

1

0

.

.

.

0

0

1

0

.

.

.

0

1

0

0

0

.

.

.

. . .. . .

. . .

. . .

. . .

. . .

. . .

. . .

. . .

Page 15: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Connectionist Language Models

.Training issues..

.

Stochastic backpropagation algorithm with weight decayregularization.Stochastic training selects with replacement a random set ofpatterns every epoch.For large datasets, training converges before training partitionwas completely traversed.Fast training using matrix-matrix multiplications and fine-tunedBLAS implementations (bunch mode).

Page 16: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

NN LM deficiencies and our solutions I

.Word projections – Projection layer..

.

Deficiencies:Random initialization of projection layer: low frequency words(“rare” words) has very different encoding.Large vocabularies + stochastic training ⇒ poor training of “rare”words.

Solutions:Restrict NN LM input vocabulary to words with frequency > θ(experiments).Use of bias and weight decay terms on the projection layer.

Page 17: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

NN LM deficiencies and our solutions II.Computational problems – Output layer..

.

Deficiencies:

Softmax activation forces to compute all outputs: O = exp(A)∑|A|i=1 exp(ai)

Currently, training problems were partially solved using fast mathoperations and stochastic backpropagation.At decoding, softmax bottle-neck forces developing of decoupledsystems: N-best list rescoring.

Solutions:

Shortlist output vocabulary Ω′ restricted to most frequent words.

p(yj|hj) =

pNN(yj|hj), iff yj ∈ Ω′

pNN(OOS|hj) ·COOS(yj)∑

y′ ∈Ω′

COOS(y′), iff yj ∈ Ω′

Precompute softmax normalization constants: Fast NN LM.

Page 18: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

NN LM and decoding

.N-best list rescoring..

.

Widely used in literature.Encouraging improvements in ASR and SMT.

.Totally integrated decoding..

.

Anyone try it before.Major contribution and focus of this work.

Page 19: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Integration of NN LM into decoding

A generic framework will be presented.Softmax computation problems need to be solved.

.Generic LM interface..

.

A LMkey is an automaton state number.LMkey is used by decoder as N-gramcontext identifier.prepareLM(LMkey)getLMprob(LMkey, word)getNextLMKey(LMkey, word)getInitialLMKey()getFinalLMKey()restartLM()

.N-gram stochasticfinite stateautomaton..

._

a

aa

b

ab

ab

b

b

a

Page 20: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Generic framework.TrieLM..

.

TrieLM represents the whole N-gram space: |Ω|N (NN LMs).It is built on-the-fly, enumerating only states needed by decoding.A path in the TrieLM is a sequence of the N − 1 context words ofan N-gram.Two kinds of node: persistent and dynamic.

.TrieLM node..

.

Parent back-pointer.Time-stamp.Word transition.

.TrieLM example..

.

Page 21: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Fast evaluation of NN LMs

Softmax normalization constant forcesthe computation of all output neurons,even when only a few are needed:Softmax has several advantages:

Ensures true probabilitycomputations with ANNs.Improve training convergence.

O =exp(A)∑|A|

i=1 exp(ai)

.# of weights at output layer..

.

|Ω′| Output layer100 25 700

1 000 257 00010 000 2 570 000

.Our solution..

.

Precompute the most important constants needed during decoding.When a constant is not found, two possibilities are feasible:

Compute the constant on-the-fly.Use some kind of smoothing.

Page 22: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Fast evaluation of NN LMs

.Preliminar notes..

.

For a NN LM of order N exists |ΩI|N−1 softmax normalizationconstants.Note a bigram only needs |ΩI|.

ΩI is the NN LM input restricted vocabulary.

. . .

. . .

. . .

. . .

. . .

. . .

A 4-gram NN LM

Page 23: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Fast evaluation of NN LMs.Precomputation procedure: training..

.

N-gram contexts are extracted from a training corpora, countingits frequency.For the most frequent contexts, the softmax normalizationconstants are computed.The TrieLM stores as persistent nodes:

the N-gram context words related to softmax normalization constantsand its associated constant value.

.Softmax normalization constants computation: 3-gram..

.

Sentence N − 1 context words Softmax constant

A MOVE to stop Mr. Gaitskell from

A MOVEMOVE toto stop…

43 41878 18488 931…

Page 24: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Fast evaluation of NN LMs.Smoothed approach: evaluation..

.

Use a simpler model if a constant is not found.The most simpler model is a NN LM bigram or a standard N-gram.

No

No

2-gram

NN LMto

Search

2-gram softmaxnormalization constants

3-gram

NN LM

move to

Search4-gram

NN LM

3-gram softmaxnormalization constantsa move to

P(stop | a move to)move to

a move to

a move to stop mr. gatiskell ...

1-gram softmaxnormalization constants

Yes

Yes

stop

stop

Page 25: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Fast evaluation of NN LMs.Smoothed approach pros/cons..

.

– Model quality reduction.+ Constant do not need to be computed.Trade-off between quality and speed: more precomputedconstants means more quality, but slower speed.

.On-the-fly approach pros/cons..

.

+ Full model quality.– Needs to compute on-the-fly a lot of constants.More precomputed constants means faster speed.It is possible to store computed constants to future use.

– However, always slower than smoothed approach.

Page 26: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Experiments with PPL in text: NN LMs and speed-up.Fast evaluation approach..

.

LOB-ale corpus: random sentences from LOB with closedvocabulary.Three configurations are evaluated:

On-the-fly Fast NN LM: computing constants when needed.Smoothed Fast NN LM: using a simpler model.Smoothed-SRI Fast NN LM: simplest model is a SRI model.

.PPL and speed-up results..

.

Setup ms/word speed-up test PPLMixed NN LM 6.43 0 79.90

On-the-fly Fast NN LM 1.82 3 79.90Smoothed Fast NN LM 0.19 33 80.78Smoothed-SRI Fast NN LM 0.19 33 79.02

Page 27: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Experiments with PPL in text: NN LMs and speed-up

Smoothed Fast NN LM

Page 28: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Experiments on HTR: NN LMs and integrated decoding

Totally integrated decoding.Experiments on the influence of pruning, WER, and time.IAM-DB task, deeply presented at sequence recognition part.4-gram NN LM Linearly combined with a SRI bigram followingon-the-fly (standard) approach and smoothed approach.

Page 29: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Experiments on HTR: NN LMs and integrated decoding

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

18 20 22 24 26 28 30 32 34 36 38 40

se

c/w

ord

WER

SmoothedFastNNLMStandardNNLM

SRI bigram

18

20

22

24

26

28

30

32

34

36

38

40

0 1 2 3 4 5 6 7 8 9 10 0

2

4

6

8

10

WE

R

%

Histogram pruning size (x 1000)

SmoothedFastNNLMStandardNNLM

SRI bigram% improvement

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

0 1 2 3 4 5 6 7 8 9 10 100

110

120

130

140

150

se

c/w

ord

%

Histogram pruning size (x 1000)

SmoothedFastNNLMStandardNNLM

SRI bigram% ratio .

Conclusions..

.

Smoothed Fast NN LM: better time, sameWER.Smoothed Fast NN LM: improves WER by8%, additional 10% time.Standard NN LM approach: same WERthan smoothed, but two times slower.

Page 30: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Index

...1 Introduction

...2 Connectionist language modeling

...3 Sequence recognition applications

...4 Machine translation applications

...5 Conclusions

Page 31: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Hidden Markov Model decoding

Based on HMM/ANN models.Two step decoding with pruning synchronization.N-gram Viterbi decoder with integrated NN LMs.

x

GenProbs

Module

OSE

Module

WGen

Module

Wordgraph

NGramParser

Moduley

maxe∈active vertex

p(e)

.Two tasks..

.

Spoken Language Understanding (SLU).Handwritten Text Recognition (HTR).

Page 32: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Spoken Language Understanding

.Cache NN LM for long distance dependencies..

.

Cache-based LM:

p(yj|yj−1 . . . y1) = αp(yj|hj) + (1− α)pcache(yj|hj−11 ) ,

Cache NN LM:

p(y) ≈ p(y|hu−11 ) ≈

|y|∏j=1

p(yj|hj, hu−11 )

hj is the N-gram context and hu−11 is the cache part.

Cache NN LM receives a summary of all previous machine/userinteractions.The cache part remains the same during one sentence decoding.

Page 33: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Spoken Language Understanding

SLU using language models of pairs: concept/word sequences.Using Cache NN LM, different summary combinations weretested.

Page 34: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Spoken Language Understanding

.Experiments..

.

MEDIA French corpus.Using Cache NN LM, different summary combinations:A) Using a cache only of concepts.B) Using only words at the cache.C) Using concepts+words at the cache.D) Using concepts+words+Wizard-of-Oz at the cache.

.MEDIA statistics..

.

Set # sentences # running words # running conceptsTraining 12 811 87 297 42 251Validation 1 241 9 996 4 652

Test 3 468 24 598 11 790

Page 35: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Spoken Language Understanding2-grams 3-grams 4-grams

System Val. Test Val. Test Val. Testbaseline-a 33.6 30.1 32.9 29.3 33.5 29.3baseline-b 33.1 28.3 30.7 27.4 30.2 28.1cacheNNLM-A 31.7 28.2 29.7 27.0 29.7 27.0cacheNNLM-B 30.5 27.3 29.7 27.0 30.0 26.1cacheNNLM-C 32.2 28.3 30.5 27.0 30.8 27.4cacheNNLM-D 31.2 28.2 29.9 26.2 30.3 27.1

CER results for validation and test sets.

baseline-a) Standard N-gram of pairs, without cache.baseline-b) Standard NN LM of pairs, without cache.

A) Using a cache only of concepts.B) Using only words at the cache.C) Using concepts+words at the cache.D) Using concepts+words+Wizard-of-Oz at the

cache.

.Conclusions..

.

Significant CER reduction using a rathersimple SLU model.The best Cache NN LM suggests that thereis plenty of room for improvement.The use of cache with long-distancedependencies improves systematically thebaselines.Best CER in the literature 23.8%.

Page 36: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Handwritten Text RecognitionBased on HMM/ANN models: 80 characters (26 lowercase and26 uppercase letters, 10 digits, 16 punctuation marks, white space,and crossing out mark).Two step decoding, totally integration of NN LMs.IAM-DB task: off-line text line recognition.Language models trained using LOB+WELLINGTON+BROWN:103K vocabulary size (|Ω|).Two type of experiments: word-based, and character-based.

.IAM-DB text line images..

.

Page 37: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Handwritten Text Recognition: word-based experiments.Validation results..

.

% WER /% CER /% SER in validationSystem |ΩI| bigram 3-gram 4-gramBigram mKN 103K 17.3/6.2/69.8 17.8/6.3/70.3 17.9/6.3/69.3

Rescoring with NN LMsΘ = 21 NN 10K 16.0/5.8/67.4 16.0/5.9/67.2 16.2/5.8/67.2Θ = 10 NN 16K 15.9/5.9/67.5 16.0/5.9/67.2 16.4/5.9/66.5Θ = 8 NN 19K 16.0/5.8/66.9 16.3/5.9/67.8 16.9/6.1/68.8Θ = 1 NN 56K 16.0/5.8/66.6 16.3/6.0/67.9 16.9/6.2/69.2Θ = 21 FNN 10K 16.0/5.8/67.4 15.8/5.7/66.0 15.8/5.7/65.1Θ = 10 FNN 16K 15.9/5.9/67.5 15.9/5.7/65.0 15.9/5.8/66.0Θ = 8 FNN 19K 16.0/5.8/66.9 15.8/5.8/65.8 15.8/5.7/65.8Θ = 1 FNN 56K 16.0/5.8/66.6 15.9/5.7/66.1 15.7/5.8/66.3

NN LMs integrated during decodingΘ = 21 NN 10K 16.0/5.8/67.0 16.0/5.8/67.2 16.1/5.8/67.2Θ = 10 NN 16K 16.1/5.8/66.9 16.1/5.8/67.3 16.4/5.8/66.7Θ = 8 NN 19K 16.1/5.9/67.0 16.3/5.9/67.9 16.9/6.0/69.2Θ = 1 NN 56K 16.1/5.8/66.7 16.4/6.0/67.5 16.8/6.2/69.3Θ = 21 FNN 10K 16.0/5.8/67.0 15.8/5.8/66.5 15.8/5.6/65.2Θ = 10 FNN 16K 16.1/5.8/66.9 15.9/5.7/65.0 16.0/5.8/65.6Θ = 8 FNN 19K 16.1/5.9/67.0 15.8/5.8/65.7 15.8/5.7/65.6Θ = 1 FNN 56K 16.1/5.8/66.7 15.9/5.7/66.1 15.8/5.7/65.6

NN are standard NN LMs; FNN are Smoothed Fast NN LM; words frequency > Θ ⇒ ΩI.

Page 38: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Handwritten Text Recognition: word-based experiments.Test results..

.

System % WER % CER % SER[Graves et al.] 25.9± 0.8 – –Bigram 20K mKN 23.4± 0.8 9.6± 0.4 78.1± 1.6

Bigram mKN 21.9± 0.7 8.8± 0.4 76.0± 1.6

Rescoring with NN LMs4-gram Θ = 21 FNN 20.2 ± 0.7 8.3 ± 0.4 72.9 ± 1.6

NN LMs integrated during decoding4-gram Θ = 21 FNN 20.2 ± 0.7 8.3 ± 0.4 73.0± 1.6

20K mKN from [Graves et al]; words frequency > Θ ⇒ ΩI..Conclusions..

.

Not significant ⇒ input vocabulary sizes.Rescoring vs integrated approach similar results.First system for HTR using NN LMs, significant improvement ofstate-of-the-art results.

Page 39: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Handwritten Text Recognition: character-based experiments

LMs using characters instead of words.Useful for tasks with lack of data.

.Validation results..

.

Page 40: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Handwritten Text Recognition: character-based experiments

.Test results..

.

System % WER % CER OOV % accuracySRI 30.9 13.8 29.8NN LM 24.2 10.1 33.8% improvement 21.7 26.8 13.4

.Conclusions..

.

Large improvement over N-gram baseline (SRI).Almost 33% of OOV words for a word-based LM were recovered.This results suggests to try mixed approach combining word-basedLM with character-based LM.Ancient document transcription could take profit of this approach.

Page 41: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Index

...1 Introduction

...2 Connectionist language modeling

...3 Sequence recognition applications

...4 Machine translation applications

...5 Conclusions

Page 42: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Statistical Machine Translation

.Overview..

.

Introduction to SMT following phrase-based and N-gram-basedapproaches.Decoding process algorithm details.Experiments for decoder quality and integration of NN LMs.

Page 43: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Statistical Machine TranslationFollowing maximum entropy approach: log-linear combination ofseveral models.This work focuses in phrase-based and N-gram-basedapproaches:

Both follow similar basis: word alignments, segmentation in tuples(bilingual phrases), log-linear modelisation.Phrase-based: all consistent phrases (multiple segmentations) areextracted.N-gram-based:

source language sentence is reordered to follow target language order;a unique segmentation in tuples is extracted for training, in decodingmultiple paths are possible.

.Fundamental equation..

.

y = target_part( ˆT , φ) = arg max(T ,φ)

M∏m=1

Hm(T , φ)λm

being T = T1, T2, . . . ,T|T | a sequence of bilingual tuples;and φ : 1, 2, . . . , |x| → 1, 2, . . . , |T | a mapping between source words and tuples.

Page 44: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Statistical Machine TranslationSource: María no daba una bofetada a la bruja verdeReordered source: María no daba una bofetada a la verde brujaTarget: Mary did not slap the green witch

María

no daba

una

bofetada

a la bruja

verde

Marydidnotslapthegreenwitch

.Phrase-based..

.

(María, Mary), (no, did not),(María no, Mary did not),(slap, daba una bofetada), (verde, green),(no daba una bofetada, did not slap),(daba una bofetada a la, slap the),(a la, the), (bruja, witch),(bruja verde, green witch),(a la bruja verde, the green witch),…

.N-gram-based..

.

(María, Mary) ⇒ T1,(no, did not) ⇒ T2,(daba una bofetada, slap) ⇒ T3,(a,<NULL>) ⇒ T4,(la, the) ⇒ T5,(verde, green) ⇒ T6,(bruja, witch) ⇒ T7.

Page 45: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Machine Translation decoding

Phrase-based and N-gram-based:follow a similar decoding algorithm.Major difference:

N-gram-based approach: uses a LM for the computation of jointprobability.Phrase-based approach: does not needs joint probability, it uses alarger phrase table with conditional probabilities.

.General decoding overview..

.

Reordering

Module

Current Stage

Word2Tuple

Module

maxv∈current stage

p(v)

Current Active Vertex

Viterbi

Module

maxe∈current active vertex

p(e)

Page 46: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Machine Translation decoding

0 0 0 0

1 0 0 0

[1,1]

t engo | | | I have

t e n g o | | | h a v e

0 1 0 0

[2,2]

u n | | | a

u n | | |

1 1 0 0

[1,2]

tengo un | | | I have a

1 0 1 0

[3,3]

coche | | | car

[2 ,2]

u n | | | a

u n | | |

[1 ,1]

t engo | | | I have

t e n g o | | | h a v e

1 1 1 0

[2,2]

u n | | | a

u n | | |

[3 ,3]

coche | | | car

1 1 0 1

[4,4]

rojo | | | red 1 1 1 1

[3,4]

coche rojo | | | red car

[4 ,4]

rojo | | | red

[3,3]

coche | | | car

0 0 0 0

1 0 0 01

0 1 0 0

2

1 0 1 03

1 1 0 0

2

1

1 1 1 02

3

1 1 0 14

1 1 1 1

4

3

t e n g o | | | h a v e

tengo | | | I have

t e n g o

u n | | | a

u n | | |u n

coche | | | r ed

coche

rojo | | | red

rojo

tengo un | | | I have au n

coche rojo | | | red carrojo

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6 7

Page 47: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Machine Translation experiments

.N-best list rescoring with NN LMs..

.

IWSLT’06 Italian-English task.WMT’10 and WMT’11 English-Spanish task.

.Totally integrated NN LMs decoding..

.

IWSLT’10 French-English task.News-Commentary 2010 Spanish-English task.

Participation in international evaluation campaigns IWSLT’10,WMT’10 and WMT’11, achieving very well positioned systems(second position at IWSLT’10 and WMT’11).

Page 48: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Machine Translation experiments

News-Commentary 2010 Spanish-English task.Totally integration of NN LMs in decoding vs N-best list rescoring.Comparative evaluation of N-gram-based and phrase-basedapproach.Phrase-based models will be trained using Giza++ and Mosestoolkits, decoding with April .N-gram-based models will be trained with Giza++ and Apriltoolkits, decoding with April .

Page 49: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Machine Translation experiments

.News-Commentary 2010 statistics..

.

Spanish EnglishSet # Lines # Words # Lines # Words Voc. sizeNews-Commentary 2010 80.9K 1.8M 81.0K 1.6M 38 781

News2008 2.0K 52.6K 2.0K 49.7K –News2009 2.5K 68.0K 2.5K 65.6K –News2010 2.5K 65.5K 2.5K 61.9K –

N-gram-based bilingual tuple translation corpusSet # Lines # Tuples Voc. sizeNews-Commentary 2010 80.9K 1.5M 231 981

Tuple vocabulary size is huge for a direct NN LMs training ⇒ NN LMs of statistical classes.

English data for LM trainingSet # Lines # WordsNews-Commentary 2010 125.9K 2.97M

Page 50: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Machine Translation experiments.NN LMs of statistical classes training procedure..

.

...1 Using Giza++ a non-ambiguous mapping between tuples andclasses is done (CLS number of classes).

...2 Conditioned probability of tuple in classes is computed bycounting:

p(z ∈ ∆|c ∈ CLS) = C(z|c)∑z′∈∆ C(z′|c)

being C(z|c) the count of tuple z in class c....3 Tuples are substituted by their corresponding class....4 Standard training algorithm is used over previous dataset toestimate NN LMs.

...5 Joint probability is computed following:

p(x, y) ≈∏

ip(Ti|Ti−1 . . .Ti−N+1) ≈

∏i

p(Ti|ci) · p(ci|ci−1 . . . ci−N+1)

Page 51: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Machine Translation experiments

20.5

20.6

20.7

20.8

20.9

2 3 4 5

BLE

U

Value of N-gram order

NNTM-100

NNTM

-300

NNTM-500NNTM-1000

News2009System BLEU TERApril-NB baseline 20.2 60.4+ NNTLM-4gr 20.9 59.9+ NNTM-300-4gr 21.1 59.7+ NNTM-500-4gr 21.2 59.7

Page 52: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Machine Translation experiments

News2009 News2010System BLEU TER BLEU TER Time (s/sentence)Moses 20.4 60.3 22.6 57.8 0.6April-PB 20.6 60.3 22.7 57.8 0.4Moses⋆ – – 22.6 57.9 0.6April-NB 20.2 60.4 22.7 58.0 0.8

Integrating smoothed Fast NN LMs in the decoderApril-PB + NNTLM 21.2 59.8 23.2 57.5 1.8April-NB + NNTLM 20.9 59.9 23.2 57.4 1.8April-NB + NNTM 20.7 60.0 23.3 57.6 1.6April-NB + NNTLM + NNTM 21.2 59.7 23.6 57.1 2.5

Integrating on-the-fly Fast NN LM (standard NN LMs) in the decoderApril-PB + NNTLM – – 23.3 57.3 384.3April-NB + NNTLM + NNTM – – 23.7 57.1 177.3

Rescoring 2000-uniq-best list with standard NN LMsApril-PB + NNTLM 21.1 59.9 23.4 57.3 –April-NB + NNTLM 20.9 60.0 – – –April-NB + NNTM 20.6 60.2 – – –April-NB + NNTLM + NNTM 21.1 59.8 23.5 57.4 –

Page 53: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Machine Translation experimentsB

LE

U

TE

R

BLEU NGBLEU PBTER NGTER PB

22.6

22.8

22.9

23.1

23.2

23.4

23.5

23.6

23.8

57.0

57.1

57.2

57.3

57.4

57.5

57.6

57.7

57.8

Tim

e (

s/s

en

ten

ce

)

Number of pre-calculated softmax normalization constants (log-scaled)

NGPB

0.81.01.21.41.61.82.02.22.4

1e+02 1e+03 1e+04 1e+05 1e+06

.Conclusions..

.

Smoothed Fast NN LMloss ≈ 0.2 BLEU/TERpoints.

Phrase-based system: abit worst integrated.

N-gram-based system: abit better integrated, TERhas statistical significanceat 95% confidence usinga pairwise test.

.

.

Adding class-based NNTM improves 0.6 BLEU points.

Smoothed Fast NN LM system is two/three times slower…

but achieves an speed-up of 70 compared with standard NN LM.

Page 54: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Index

...1 Introduction

...2 Connectionist language modeling

...3 Sequence recognition applications

...4 Machine translation applications

...5 Conclusions

Page 55: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Final conclusions I

.Contributions to connectionist language modeling..

.

Speed-up technique based on softmax normalization constantsprecomputation.Formalization and development of a totally coupled Viterbidecoding:

comparable computational cost for sequence recognition tasks;computational cost two/three times higher for Machine Translationtasks.Improve baseline quality in every case.

Extension to dynamic domain adaptation with cache-basedNN LMs.

Page 56: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Final conclusions II

.Contributions to sequence recognition..

.

Encouraging results using Cache NN LMs for a SLU task.State-of-the-art improvement using NN LMs at HTR tasks.Character-based NN LMs to deal with lack of data.

.Contributions to SMT..

.

Implementation of DP decoding algorithm for SMT.Improving N-gram-based SMT by using class-based NN LMs.Well positioned systems at international Machine Translationevaluations.

Page 57: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Future work.Connectionist language modeling..

.

New projection layer initialization method based on POS tags.

Improve NN LM and standard N-gram combination: GLI-CS.

.Sequence recognition..

.

Improve SLU using deep learning with continuous space techniques.

HTR system combining character-based and word-based LMs (OOV).

.Statistical Machine Translation..

.

Application of continuous space idea to reordering models.

Study of Cache NN LMs for document translation.

New NN LMs for vocabulary dispersion problem N-gram-based SMT.Integration SMT decoder for human assisted transcription.

Page 58: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Publications related with this PhD I

Speed-up technique for NN LMs

F. Zamora-Martínez, M.J. Castro-Bleda, S. España-Boquera. FastEvaluation of Connectionist Language Models. Pages 33–40 ofIWANN 2009 proceedings. Salamanca.

S. España-Boquera, F. Zamora-Martínez, M.J. Castro-Bleda, J.Gorbe-Moya. Efficient BP Algorithms for General Feedforward NeuralNetworks. Pages 327–336 of IWINAC 2007, Murcia.

Spoken Languag Understanding and Cache NN LMs

F. Zamora-Martínez, Salvador España-Boquera, M.J. Castro-Bleda,Renato de-Mori. Cache Neural Network Language Models based onLong-Distance Dependencies for a Spoken Dialog System. Pages4993–4996 of the IEEE ICASSP 2012 proceedings, Kyoto (Japan).

Page 59: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Publications related with this PhD II

Handwritten Text Recognition

F. Zamora-Martínez, V. Frinken, S. España-Boquera, M.J. Castro-Bleda,A. Fisher, H. Bunke. Neural Network Language Models in Off-LineHandwriting Recognition. Pattern Recognition. SUBMITTED.

F. Zamora-Martínez, M.J. Castro-Bleda, S. España-Boquera, J.Gorbe-Moya. Unconstrained Offline Handwritting Recognition usingConnectionist Character N-grams. Pages 18–23 of the IJCNN,Barcelona, 2010.

Statistical Machine Translation

F. Zamora-Martínez, M.J. Castro-Bleda. CEU-UPV English-Spanishsystem for WMT11. Pages 490–495 of the WMT 2011 proceedings,Edinburgh (Scotland).

F. Zamora-Martínez, M.J. Castro-Bleda, H. Schwenk. Ngram-basedMachine Translation enhanced with Neural Networks for theFrench-English BTEC-IWSLT’10 task.. Pages 45–52 of the IWSLT 2010proceedings, Paris (France).

Page 60: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Publications related with this PhD III

F. Zamora-Martínez, Germán Sanchis-Trilles. UCH-UPV English-Spanishsystem for WMT10. Pages 207–211 of the WMT 2010 proceedings,Uppsala (Sweden).

F. Zamora-Martínez, M.J. Castro-Bleda. Traducción AutomáticaEstadística basada en Ngramas Conexionistas. Pages 221–228 of theSEPLN journal. Volume 45, number 45, 2010. Valencia (Spain).

Maxim Khalilov, José A. R. Fonollosa, F. Zamora-Martínez, María J.Castro-Bleda, S. España-Boquera. Neural Network Language Modelsfor Translation with Limited Data. Pages 445–451 of the ICTAI 2008proceedings, Dayton (USA).

Maxim Khalilov, José A. R. Fonollosa, F. Zamora-Martínez, María J.Castro-Bleda, S. España-Boquera. Arabic-English translationimprovement by target-side neural network language modeling. Inproceedings of HLT & NLP workshop at LREC 2008, Marrakech(Marocco).

Page 61: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Publications in collaboration I

Recurrent NN LM based on Long-Short Term Memories:

Volkmar Frinken, F. Zamora-Martínez, Salvador España-Boquera,María J. Castro-Bleda, Andreas Fischer, Horst. Bunke. Long-Short TermMemory Neural Networks Language Modeling for HandwritingRecognition. ICPR 2012 proceedings, Tsukuba (Japan).

Handwritten Text Recognition:

S. España-Boquera, M.J. Castro-Bleda, J. Gorbe-Moya, F.Zamora-Martínez. Improving Offline Handwritten Text Recognition withHybrid HMM/ANN Models. Pages 767–779 of the IEEE TPAMIjournal. Volume 33, number 4, 2011.

F. Zamora-Martínez, M.J. Castro-Bleda, S. España-Boquera, J.Gorbe-Moya. Improving Isolated Handwritten Word Recognition Usinga Specialized Classifier for Short Words. Pages 61–70 of CAEPIA2009 proceedings, Sevilla.

J. Gorbe-Moya, S. España-Boquera, F. Zamora-Martínez, M.J.Castro-Bleda. Handwritten Text Normalization by using Local ExtremaClassification. Pages 164–172 of PRIS 2008 workshop, Barcelona.

Page 62: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Publications in collaboration II

Decoding:

S. España-Boquera, M.J. Castro-Bleda, F. Zamora-Martínez, J.Gorbe-Moya. Efficient Viterbi Algorithms for Lexical Tree BasedModels. Pages 179–187 of NOLISP 2007, Paris.

S. España-Boquera, J. Gorbe-Moya, F. Zamora-Martínez. SemiringLattice Parsing Applied to CYK. Pages 603–610 of IbPRIA 2007conference, Girona.

Page 63: Contributions to connectionist language modeling and its application to sequence recognition and machine translation

Thanks for your attention!

Questions?