10
“A Tool for a High-Carat Gold-Standard Word Alignment” Drayton C. Benner [email protected] Miklal Software Solutions

Benner LaTech 2014 Presentation

Embed Size (px)

Citation preview

Page 1: Benner LaTech 2014 Presentation

“A Tool for a High-Carat Gold-Standard Word

Alignment”Drayton C. Benner

[email protected]

Miklal Software Solutions

Page 2: Benner LaTech 2014 Presentation

Manual word alignment

•Uses• Mostly used as an input for statistical machine translation• But also useful in and of itself to humanists and other

linguists• Philology, translation technique, textual criticism, lexicography

• Contact linguistics, corpus linguistics, historical linguistics

•Distinctive needs for use in humanistic inquiry• Visualization• Quality• Consistency

Page 3: Benner LaTech 2014 Presentation

Project: Manually align the Hebrew Bible with an English translation

• Hebrew text and morphology: Westminster Leningrad Codex, Westminster Hebrew morphology

• English translation: English Standard Version, 2011 text edition

• Lengthy document outlining consistency standards

• Built a Java application to do the manual alignment

Page 4: Benner LaTech 2014 Presentation

Past visualizations

Lines (from Smith and Jahr 2000)Alignment matrix (from Germann 2007)

Page 5: Benner LaTech 2014 Presentation

Past visualizations (cont.)

Colors (from Merkel et al 2003)

Mouseover (from Germann 2008)

Page 6: Benner LaTech 2014 Presentation

Visualization

Language helps

Colors

Blank rows

Lines

Page 7: Benner LaTech 2014 Presentation

Quality and consistency

• Sortable table providing detailed information about how a source lexeme is grouped and linked to the target translation

• Allows for quick check for quality and consistency for a source lexeme• English NLP (for target lexemes, shortened glosses) done using Stanford parser and WordNet mixed with

local algorithms• Another panel shows analogous information for target lexemes

Page 8: Benner LaTech 2014 Presentation

Quality and consistency (cont.)

• Sortable, filterable table providing information about source lexemes and how they are linked

• Another panel shows analogous information for target lexemes

Page 9: Benner LaTech 2014 Presentation

Quality and consistency (cont.)

• An attempt at enforcing some of the projected consistency standards• Uses the Westminster Hebrew Morphology and the syntactic parsing from the

Stanford parser to look for common errors in the alignment

Page 10: Benner LaTech 2014 Presentation

Preliminary results

• Enabling aligners to work quickly• Approximately 13 source tokens/21 target tokens per minute

• Enabling high-quality, consistent work• Two aligners produced extremely similar manual alignments of a small

sample.• 22 different decisions

• 34/1623 (2.1%) target tokens did not have identical source tokens linked to them

• 29/933 (3.1%) source tokens that did not have identical target tokens linked to them

• In the absence of consistency standards, all the differences would be considered ambiguous• Half of the different decisions reflected deviations from the project’s detailed consistency

standards by one of the aligners.