Benner LaTech 2014 Presentation

Preview:

Citation preview

“A Tool for a High-Carat Gold-Standard Word

Alignment”Drayton C. Benner

draytonbenner@miklalsoftware.com

Miklal Software Solutions

Manual word alignment

•Uses• Mostly used as an input for statistical machine translation• But also useful in and of itself to humanists and other

linguists• Philology, translation technique, textual criticism, lexicography

• Contact linguistics, corpus linguistics, historical linguistics

•Distinctive needs for use in humanistic inquiry• Visualization• Quality• Consistency

Project: Manually align the Hebrew Bible with an English translation

• Hebrew text and morphology: Westminster Leningrad Codex, Westminster Hebrew morphology

• English translation: English Standard Version, 2011 text edition

• Lengthy document outlining consistency standards

• Built a Java application to do the manual alignment

Past visualizations

Lines (from Smith and Jahr 2000)Alignment matrix (from Germann 2007)

Past visualizations (cont.)

Colors (from Merkel et al 2003)

Mouseover (from Germann 2008)

Visualization

Language helps

Colors

Blank rows

Lines

Quality and consistency

• Sortable table providing detailed information about how a source lexeme is grouped and linked to the target translation

• Allows for quick check for quality and consistency for a source lexeme• English NLP (for target lexemes, shortened glosses) done using Stanford parser and WordNet mixed with

local algorithms• Another panel shows analogous information for target lexemes

Quality and consistency (cont.)

• Sortable, filterable table providing information about source lexemes and how they are linked

• Another panel shows analogous information for target lexemes

Quality and consistency (cont.)

• An attempt at enforcing some of the projected consistency standards• Uses the Westminster Hebrew Morphology and the syntactic parsing from the

Stanford parser to look for common errors in the alignment

Preliminary results

• Enabling aligners to work quickly• Approximately 13 source tokens/21 target tokens per minute

• Enabling high-quality, consistent work• Two aligners produced extremely similar manual alignments of a small

sample.• 22 different decisions

• 34/1623 (2.1%) target tokens did not have identical source tokens linked to them

• 29/933 (3.1%) source tokens that did not have identical target tokens linked to them

• In the absence of consistency standards, all the differences would be considered ambiguous• Half of the different decisions reflected deviations from the project’s detailed consistency

standards by one of the aligners.