60
Evaluation

JHU MT class: Human Evaluation of Machine Translation Systems

Embed Size (px)

Citation preview

Page 1: JHU MT class: Human Evaluation of Machine Translation Systems

Evaluation

Page 2: JHU MT class: Human Evaluation of Machine Translation Systems

•Some (not all) key ingredients in Google Translate:

•Phrase-based translation models

•... Learned heuristically from word alignments

•... Coupled with a huge language model

•... And very tight pruning heuristics

•Q: How do they know it works?

Page 3: JHU MT class: Human Evaluation of Machine Translation Systems

Overview

training data(parallel text) learner model

联合国 安全 理事会 的

五个 常任 理事 国都decoder

However , the sky remained clear under the strong north wind .

Page 4: JHU MT class: Human Evaluation of Machine Translation Systems

Overview

training data(parallel text) learner model

联合国 安全 理事会 的

五个 常任 理事 国都decoder

However , the sky remained clear under the strong north wind .

Page 5: JHU MT class: Human Evaluation of Machine Translation Systems

Overview

training data(parallel text) learner model

联合国 安全 理事会 的

五个 常任 理事 国都decoder

However , the sky remained clear under the strong north wind .

Evaluation

Page 6: JHU MT class: Human Evaluation of Machine Translation Systems

More has been written about machine translation

evaluation than about machine translation itself.

Yorick Wilks

Page 7: JHU MT class: Human Evaluation of Machine Translation Systems

•Why evaluate?•Rank systems.•Evaluate incremental changes.•Assess new ideas empirically.

•Evaluation must be:•Fast•Cheap•Reliable•Repeatable

Page 8: JHU MT class: Human Evaluation of Machine Translation Systems
Page 9: JHU MT class: Human Evaluation of Machine Translation Systems

© 2010 IBM Corporation

IBM Research

55

What It Takes to compete against Top Human Jeopardy! PlayersOur Analysis Reveals the Winner’s Cloud

Winning Human Performance

Winning Human Performance

2007 QA Computer System2007 QA Computer System

Grand Champion Human Performance

Grand Champion Human Performance

Each dot – actual historical human Jeopardy! games

More ConfidentMore Confident Less ConfidentLess Confident

Computers?Not So Good.

Page 10: JHU MT class: Human Evaluation of Machine Translation Systems

© 2010 IBM Corporation

IBM Research

10

Baseline 12/06

v0.1 12/07

v0.3 08/08

v0.5 05/09

v0.6 10/09

v0.8 11/10

v0.4 12/08

DeepQA: Incremental Progress in Answering Precision on the Jeopardy Challenge: 6/2007-11/2010

v0.2 05/08

IBM WatsonPlaying in the Winners Cloud

V0.7 04/10

Page 11: JHU MT class: Human Evaluation of Machine Translation Systems

美国愿和北韩谈判但拒绝再付出报酬

Page 12: JHU MT class: Human Evaluation of Machine Translation Systems

美国愿和北韩谈判但拒绝再付出报酬

US willing to negotiate with North Korea but not to pay more compensation.

Page 13: JHU MT class: Human Evaluation of Machine Translation Systems

美国愿和北韩谈判但拒绝再付出报酬

US willing to negotiate with North Korea but not to pay more compensation.

The United States is willing to hold talks with North Korea but refused to pay

remuneration.

Page 14: JHU MT class: Human Evaluation of Machine Translation Systems

“奋进”号因机械手故障推迟到升空

Page 15: JHU MT class: Human Evaluation of Machine Translation Systems

Launch of “Endeavour” delayed by robotic arm problems.

“奋进”号因机械手故障推迟到升空

Page 16: JHU MT class: Human Evaluation of Machine Translation Systems

“Progress” postponed because of mechanical hand into the sky.

Launch of “Endeavour” delayed by robotic arm problems.

“奋进”号因机械手故障推迟到升空

Page 17: JHU MT class: Human Evaluation of Machine Translation Systems

Rank Sentences

You have judged 25 sentences for WMT09 Spanish-English News Corpus, 427 sentences total taking 64.9 seconds per

sentence.

Source: Estos tejidos están analizados, transformados y congelados antes de ser almacenados en Hema-Québec, que gestiona también el único banco público de sangre del cordón umbilical en Quebec.

Reference: These tissues are analyzed, processed and frozen before being stored at Héma-Québec, which manages also the only bank of placental blood in Quebec.

Translation Rank

These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

These tissues analysed, processed and before frozen of stored in Hema-Québec, which also operates the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Québec, which also manages the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Quebec, which also operates the only public bank of umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These fabrics are analyzed, are transformed and are frozen before being stored in Hema-Québec, who manages also the only public bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

Annotator: ccb Task: WMT09 Spanish-English News Corpus

Instructions:

Rank each translation from Best to Worst relative to the other choices (ties are allowed). These are not interpreted as absolute scores. They are relative scores.

Manual Evaluation

Page 18: JHU MT class: Human Evaluation of Machine Translation Systems

Rank Sentences

You have judged 25 sentences for WMT09 Spanish-English News Corpus, 427 sentences total taking 64.9 seconds per

sentence.

Source: Estos tejidos están analizados, transformados y congelados antes de ser almacenados en Hema-Québec, que gestiona también el único banco público de sangre del cordón umbilical en Quebec.

Reference: These tissues are analyzed, processed and frozen before being stored at Héma-Québec, which manages also the only bank of placental blood in Quebec.

Translation Rank

These weavings are analyzed, transformed and frozen before being stored in Hema-Quebec, that negotiates also the public only bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

These tissues analysed, processed and before frozen of stored in Hema-Québec, which also operates the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Québec, which also manages the only public bank umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These tissues are analyzed, processed and frozen before being stored in Hema-Quebec, which also operates the only public bank of umbilical cord blood in Quebec.

1

Best

2 3 4 5

Worst

These fabrics are analyzed, are transformed and are frozen before being stored in Hema-Québec, who manages also the only public bank of blood of the umbilical cord in Quebec.

1

Best

2 3 4 5

Worst

Annotator: ccb Task: WMT09 Spanish-English News Corpus

Instructions:

Rank each translation from Best to Worst relative to the other choices (ties are allowed). These are not interpreted as absolute scores. They are relative scores.

Manual Evaluation

Page 19: JHU MT class: Human Evaluation of Machine Translation Systems

Chinese people in the traditional Spring Festival is approaching, the CPC Central Committee this afternoon in Zhongnanhai on the 22nd non-Party

personages to convene a forum in Spring Festival, invited the central committees of democratic parties, the leadership of the National Federation

of Industry and Commerce and personages without party affiliation on behalf of comrades gathered together State yes, talked in length about the

friendship, to greet the Chinese New Year. CPC Central Committee General Secretary and State President and Central Military Commission Chairman

Hu Jintao on behalf of the CPC Central Committee, the State Council, to the central committees of democratic parties, leaders of the National Federation

of Industry and Commerce and personages without party affiliation, to members of the united front, to extend my New Year's blessing.

Page 20: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)

Page 21: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)

Page 22: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

Page 23: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

Page 24: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{

Page 25: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{➡Costly: 361 hours of human effort in 2011.

Page 26: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{Are you sure this is the correct ranking?

Page 27: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

} 1. system C

2. system D

3. system A

4. system B

5. system G

6. system F

7. system E

{Are you sure this is the correct ranking?

•In above example, there are 5040 possible rankings.•With 10 systems: 3 million possible rankings.•With 20 systems: 2 quintillion possible rankings.

Page 28: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

Page 29: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =While (evaluation period is not over):

Page 30: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.While (evaluation period is not over):

Page 31: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.

While (evaluation period is not over):

Page 32: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.

While (evaluation period is not over):

Page 33: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.➡ Receive (partial) ranking of translations from assessor.

While (evaluation period is not over):

1. reference2. system C3. system A, system F4. system D

Page 34: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.➡ Receive (partial) ranking of translations from assessor.

While (evaluation period is not over):

1. reference2. system C3. system A, system F4. system D

reference system Areference system Creference system Dreference system Fsystem A system Csystem A system Dsystem A system Fsystem C system Dsystem C system Fsystem D system F

≺≺≺≺

≡{

Page 35: JHU MT class: Human Evaluation of Machine Translation Systems

Design of the WMT Evaluation (2008-2011)system A

system B

system C

system D

system E

system F

system G

reference =

➡ Sample input sentence.➡ Sample five translators of it from Systems ∪ {Reference}.➡ Sample an assessor.➡ Receive (partial) ranking of translations from assessor.

While (evaluation period is not over):

1. reference2. system C3. system A, system F4. system D

reference system Areference system Creference system Dreference system Fsystem A system Csystem A system Dsystem A system Fsystem C system Dsystem C system Fsystem D system F

≺≺≺≺

≡{WMT Raw Data:

pairwise rankings

Page 36: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

•Directed edge between every pair of vertices.•Edge from A to B if A beats B in pairwise comparison.•Widely used to model: sports, web results, elections.

Page 37: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

•Directed edge between every pair of vertices.•Edge from A to B if A beats B in pairwise comparison.•Widely used to model: sports, web results, elections.

Landau, 1951. On dominance relations andthe structure of animal societies

Page 38: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

•Directed edge between every pair of vertices.•Edge from A to B if A beats B in pairwise comparison.•Widely used to model: sports, web results, elections.

Landau, 1951. On dominance relations andthe structure of animal societies

•We use to model all WMT `10-`11 rankings (25 tasks).

Page 39: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Page 40: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Page 41: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Page 42: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

If tournament is acyclic: topological sort

Page 43: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C

system D

If tournament is acyclic: topological sort

Page 44: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C

system D

If tournament is acyclic: topological sort

Page 45: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A

system B

system C

system D

If tournament is acyclic: topological sort

Page 46: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A

system B

system C

system D

If tournament is acyclic: topological sort

Page 47: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Page 48: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

16 out of 25 tasks in WMT ’10-’11 contain cycles!

Page 49: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Solution: Reverse a set of edges such that:(a) Resulting graph is acyclic.

(b) Sum of reversed edges weights is minimized.

Page 50: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Solution: Reverse a set of edges such that:(a) Resulting graph is acyclic.

(b) Sum of reversed edges weights is minimized.

Page 51: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Solution: Reverse a set of edges such that:(a) Graph is acyclic.

(b) Sum of reversed edges weights is minimized.

Page 52: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Set of reversed edges = minimum feedback arc set (MFAS).In theory, this optimization is NP-hard (Karp, 1972).

In practice, it’s not too hard.

Page 53: JHU MT class: Human Evaluation of Machine Translation Systems

Tournamentssystem A system B

system C system D

1

23

1

1

2

What if tournament contains cycles?

Important detail: What should the weight be?Following analysis uses #(wins - losses).

Dumb, but counts each observation equally.

Page 54: JHU MT class: Human Evaluation of Machine Translation Systems

Example: French-English 2010

Task Rankings

MFAS

Page 55: JHU MT class: Human Evaluation of Machine Translation Systems

onlineBrwth-combo

cmu-hyposel-combocambridge

liumdcu-combo

cmu-heafield-comboupv-combo

nrcuedin

jhulimsi

jhu-combolium-combo

ralilig

bbn-comborwth

cmu-statxferonlineAhuicong

dfkicu-zeman

geneva

Example: French-English 2010

Task Rankings

MFAS

Page 56: JHU MT class: Human Evaluation of Machine Translation Systems

onlineBrwth-combo

cmu-hyposel-combocambridge

liumdcu-combo

cmu-heafield-comboupv-combo

nrcuedin

jhulimsi

jhu-combolium-combo

ralilig

bbn-comborwth

cmu-statxferonlineAhuicong

dfkicu-zeman

geneva

Example: French-English 2010

Task Rankings

MFAS

Page 57: JHU MT class: Human Evaluation of Machine Translation Systems

Has WMT solved these problems?

Human evaluation is too slow and expensive!

Human evaluation isn’t reproducible!

Page 58: JHU MT class: Human Evaluation of Machine Translation Systems

Has WMT solved these problems?

Human evaluation is too slow and expensive!

Human evaluation isn’t reproducible!

With crowdsourcing, WMT has made a good dent in this problem.

Page 59: JHU MT class: Human Evaluation of Machine Translation Systems

Has WMT solved these problems?

Human evaluation is too slow and expensive!

Human evaluation isn’t reproducible!

With crowdsourcing, WMT has made a good dent in this problem.

Empirically true in the WMT data.

Page 60: JHU MT class: Human Evaluation of Machine Translation Systems

Human Assessment is Fast and Cheap!