25
Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Why Generative Models Underperform Surface

Heuristics

UC BerkeleyNatural Language Processing

John DeNero, Dan Gillick, James Zhang, and Dan Klein

Page 2: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Overview: Learning Phrases

Sentence-aligned corpus

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9language ||| langue ||| 0.9 …

Phrase table(translation model)

Intersected and grown word alignments

Directional word alignments

Page 3: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Overview: Learning Phrases

Sentence-aligned corpus

cat ||| chat ||| 0.9 the cat ||| le chat ||| 0.8dog ||| chien ||| 0.8 house ||| maison ||| 0.6 my house ||| ma maison ||| 0.9language ||| langue ||| 0.9 …

Phrase table(translation model)

Phrase-level generative model

• Early successful phrase-based SMT system [Marcu & Wong ‘02]

• Challenging to train

• Underperforms heuristic approach

Page 4: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

OutlineI) Generative phrase-based alignment

Motivation Model structure and training Performance results

II) Error analysis Properties of the learned phrase

table Contributions to increased error rate

III) Proposed Improvements

Page 5: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Motivation for Learning Phrases

Translate!

Input sentence:

Output sentence:

J ’ ai un chat .

I have a spade .

Page 6: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Motivation for Learning Phrases

appelle un chat un chat

call

a

spade

a

spade

appelle call

chat un chat spade a spade

Page 7: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Motivation for Learning Phrases

appelle un chat un chat

call

a

spade

a

spade

appelleappelle un appelle un chatunun chatun chat unchatchat unchat un chat

callcall acall a spadea x2

a spade x2

a spade aspade x2

spade aspade a spade

… appelle un chat un chat …

Page 8: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

A Phrase Alignment Model Compatible with Pharaoh

les chats aiment le poisson frais .

cats like fresh fish .

Page 9: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Training Regimen That Respects Word Alignment

les chatsaiment

lepoisson

cats

like

fresh

fish

.

.frais

.

les chatsaiment

lepoisson

cats

like

fresh

fish

.

.

frais

.

X

Page 10: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Training Regimen That Respects Word Alignment

les chatsaiment

lepoisson

cats

like

fresh

fish

.

.frais

.

Only 46% of training sentences contributed to training.

Page 11: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

36

37

38

39

40

0 1 2 3 4

EM Iterations

BLEU

100k25k

Performance Results

Heuristically generated parameters

Page 12: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Performance Results

39.0

38.538.3

38.8

37

38

39

40

Heuristic(100k)

Heuristic(50k)

Heuristic(25k)

Learned(100k)

BLEU

Lost training data is not the whole story

Learned parameters with 4x training data

underperform heuristic

Page 13: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

OutlineI) Generative phrase-based alignment

Model structure and training Performance results

II) Error analysis Properties of the learned phrase

table Contributions to increased error

rate

III) Proposed Improvements

Page 14: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Training Corpus

French: carte sur la table

English: map on the table

French: carte sur la table

English: notice on the chart

Example: Maximizing Likelihood with Competing Segmentations

cartecartecarte surcarte surcarte sur lacarte sur lasurlasur lasur la tablesur la tablela tablela tabletabletable

mapnoticemap onnotice onmap on thenotice on theontheon theon the tableon the chartthe tablethe charttablechart

0.50.50.50.50.50.51.01.01.00.50.50.50.50.50.50.25 * 7 / 7

= 0.25

carte sur la tableLikelihood Computation

Page 15: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Training Corpus

French: carte sur la table

English: map on the table

French: carte sur la table

English: notice on the chart

Example: Maximizing Likelihood with Competing Segmentations

cartecarte surcarte sur lasursur lasur la tablelala tabletable

mapnotice onnotice on theonon theon the tablethethe tablechart

1.01.01.01.01.01.01.01.01.0

carte sur la table

Likelihood of “notice on the chart” pair: 1.0 * 2 / 7 = 0.28 > 0.25

Likelihood of “map on the table” pair: 1.0 * 2 / 7 = 0.28 > 0.25

Page 16: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

EM Training Significantly Decreases Entropy of the Phrase Table

French phrase entropy:

0 10 20 30 40

0-.01

.01-.5

.5-1

1-1.5

1.5-2

> 2

Entropy

Percent of French Phrases

LearnedHeuristic

10% of French phrases have deterministic distributions

Page 17: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Effect 1: Useful Phrase Pairs Are Lost Due to Critically Small Probabilities

In 10k translated sentences, no phrases with weight less than 10-5 were used by the decoder.

0 100 200 300 400

Heuristic

Learned

Effective Table Size (1000 phrases)

Page 18: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Effect 2: Determinized Phrases Override Better Candidates During Decoding

the situation varies to an enormous degree

the situation varie d ' une immense degré

the situation varies to an enormous degree

the situation varie d ' une immense caractérise

Heuristic

Learned

~00.02amount

0.010.02extent

0.260.38level

0.640.49degree

degré

φH

φEM

0.998~0degree

~00.05features

0.0010.21characterized

0.0010.49characterizes

caractérise

φH

φEM

Page 19: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Effect 3: Ambiguous Foreign Phrases Become Active During Decoding

Deterministic phrases can be used by the decoder with no cost.

Translations for the French apostrophe

Page 20: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

OutlineI) Generative phrase-based alignment

Model structure and training Performance results

II) Error analysis Properties of the learned phrase

table Contributions to increased error

rate

III) Proposed Improvements

Page 21: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Motivation for Reintroducing Entropy to the Phrase Table

1. Useful phrase pairs are lost due to critically small probabilities.

2. Determinized phrases override better candidates.

3. Ambiguous foreign phrases become active during decoding.

Page 22: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Reintroducing Lost Phrases

36.5 37 37.5 38 38.5 39

Learned

Heuristic

Interpolated

BLEU (25k sentences)

Interpolation yields up to 1.0 BLEU improvement

Page 23: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Smoothing Phrase Probabilities

Reserves probability mass for unseen translations based on

the length of the French phrase

36.5 37 37.5 38 38.5 39

Learned

Heuristic

Smoothed

BLEU (25k sentences)

Page 24: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Conclusion Generative phrase models determinize the phrase table via the latent segmentation variable.

A determinized phrase table introduces errors at decoding time.

Modest improvement can be realized by reintroducing phrase table entropy.

Page 25: Why Generative Models Underperform Surface Heuristics UC Berkeley Natural Language Processing John DeNero, Dan Gillick, James Zhang, and Dan Klein

Questions?