View
209
Download
1
Category
Tags:
Preview:
Citation preview
“A Tool for a High-Carat Gold-Standard Word
Alignment”Drayton C. Benner
draytonbenner@miklalsoftware.com
Miklal Software Solutions
Manual word alignment
•Uses• Mostly used as an input for statistical machine translation• But also useful in and of itself to humanists and other
linguists• Philology, translation technique, textual criticism, lexicography
• Contact linguistics, corpus linguistics, historical linguistics
•Distinctive needs for use in humanistic inquiry• Visualization• Quality• Consistency
Project: Manually align the Hebrew Bible with an English translation
• Hebrew text and morphology: Westminster Leningrad Codex, Westminster Hebrew morphology
• English translation: English Standard Version, 2011 text edition
• Lengthy document outlining consistency standards
• Built a Java application to do the manual alignment
Past visualizations
Lines (from Smith and Jahr 2000)Alignment matrix (from Germann 2007)
Past visualizations (cont.)
Colors (from Merkel et al 2003)
Mouseover (from Germann 2008)
Visualization
Language helps
Colors
Blank rows
Lines
Quality and consistency
• Sortable table providing detailed information about how a source lexeme is grouped and linked to the target translation
• Allows for quick check for quality and consistency for a source lexeme• English NLP (for target lexemes, shortened glosses) done using Stanford parser and WordNet mixed with
local algorithms• Another panel shows analogous information for target lexemes
Quality and consistency (cont.)
• Sortable, filterable table providing information about source lexemes and how they are linked
• Another panel shows analogous information for target lexemes
Quality and consistency (cont.)
• An attempt at enforcing some of the projected consistency standards• Uses the Westminster Hebrew Morphology and the syntactic parsing from the
Stanford parser to look for common errors in the alignment
Preliminary results
• Enabling aligners to work quickly• Approximately 13 source tokens/21 target tokens per minute
• Enabling high-quality, consistent work• Two aligners produced extremely similar manual alignments of a small
sample.• 22 different decisions
• 34/1623 (2.1%) target tokens did not have identical source tokens linked to them
• 29/933 (3.1%) source tokens that did not have identical target tokens linked to them
• In the absence of consistency standards, all the differences would be considered ambiguous• Half of the different decisions reflected deviations from the project’s detailed consistency
standards by one of the aligners.
Recommended