Upload
egrumbac
View
441
Download
1
Embed Size (px)
DESCRIPTION
Slides for the eMOP presentation at the Digital Humanities 2014 conference in Lausanne, Switzerland.
Citation preview
Navigating the Storm:eMOP, Big DH Projects, and Agile
Steering Standards
Elizabeth Grumbach, Co-Project Manager, IDHMC
Laura Mandell, PI, IDHMC
Apostolos Antonacopoulos, PRImA Lab
Clemens Neudecker, Koninklijke BibliotheekMatthew Christy, Co-Project Manager, IDHMC
Loretta Auvil, SEASR Analytics
Todd Samuelson, Cushing Memorial Library
emop.tamu.edu
Navigating the Storm:eMOP, Big DH Projects, and Agile Steering Standards
Initial Goals
ChallengesOr
Failures
AnalysisNew Directions
Adaptability
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Straight from the grant proposal…
“Our overarching goals”
1) Train three open-access OCR engines to “read” early modern fonts
2) Map specific font training onto specific sets of documents
3) Create error-evaluation mechanisms for failed documents
4) Use crowd-sourced correction tools specific to OCR errors
5) Identify pages that are too flawed to be “readable”
6) Share our workflow procedure and results, so that the community can use them in digitizing and transcribing early modern documents.
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Main Collaborators
CIIRIDHMC + Cushing Memorial Library
Koninklijke BibliotheekPerformant Software Solutions
PRImA Labs PSI Labs
SEASR
UMass AmhearstTexas A&MNetherlandsCharlottesville, VirginiaUniversity of Salford, ManchesterTexas A&MU of Illinois, Urbana-Champaign
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Data Contributors + Collaborators
Early English Books Online (EEBO)Eighteenth Century Collections Online (ECCO)
Text Creation Partnership (TCP)Brazos Computing Cluster (Texas A&M)
Main Collaborators
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Navigating the Storm:eMOP, Big DH Projects, and Agile
Steering Standards
Laura Mandell, Principal Investigator, eMOP
Director, IDHMC
@mandellc
Early Modern Printing• Individual, hand-made
typefaces
• Worn and broken type
• Poor quality equipment/paper
• Inconsistent line bases
• Unusual page layouts, decorative page elements,
• Special characters & ligatures
• Spelling variations
• Mixed typefaces and languages
Slides by Matthew Christy 7
Slides by Matthew Christy 8
• Irregular Layouts• Print Bleedthrough
Document/Image Quality• Torn and damaged
pages• Noise introduced to
images of pages• Skewed pages• Warped pages• Missing pages• Inverted pages• Incorrect metadata• Extremely low quality
TIFFs (~50K)
Slides by Matthew Christy 9
Slides by Matthew Christy 10
11
There may be as much difference between one letter and another in a specific font
As there is between letters in different fonts.
Reality
Dream
Training Tesseract in different fonts and applying them to the documents printed in those particular fonts will improve OCR quality.
Training TesseractAletheia
Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.
Training TesseractFranken+
1. Takes Aletheia's output files as input.
2. Groups all glyphs with the same Unicode values into one window for comparison.
3. Mistakenly coded glyphs are easily identified and re-coded.
4. A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired.
5. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base.
6. Outputs the same box files and TIFF images that Tesseract's first stage of native training.
7. Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files.
8. Outputs a .traineddata file used by Tesseract when OCRing page images.
Slides by Matthew Christy 13
Navigating the Storm:eMOP, Big DH Projects, and Agile
Steering Standards
Clemens Neudecker, Koninklijke Bibliotheek
@cneudecker
The case of IMPAC T
• IMPACT = IMProving ACcess to Text
• EU FP7, 2008 – 2012
• €16.7 M budget
• 22 partners (libraries, universities, companies)
• Goal: Significantly improve OCR for historical documents
Issue 1
• Expectation: The "IMPACT OCR"
• Reality: A collection of very diverse tools,
algorithms, etc. Some prototypes, some
commercial tools, different programming
languages, different levels of maturity etc. •
• No integrated product possible!
Issue 1
• Solution: Interoperability rather than integration
• Change: Individual applications as pluggable modules in a web-based framework
• Result: Flexible framework with additional benefits for testing, transparency, provenance
Issue 2
• Diversity: Librarians, Computer Scientists, Computational Linguists, Humanists
• Are we really talking the same language?
• Different focus points in the project: applicable solutions vs. academic publications
Issue 2
• Solution: Create bonding activities, foster atmosphere for knowledge exchange
• Change: Buddy programme, social games, quizzes about partners
• Result: Understand your partners background,
their way of thinking enrich the experience for everyone
Large Digitisation Projects:
Two Key Perspectives
Apostolos AntonacopoulosPRImA Research Lab
Background
Since 2002 the PRImA Lab has been involved in large digitisation
projects, creating software tools for all stages of the workflow
• From Image Enhancement to Layout Analysis to OCR
• Use-scenario based evaluation of extracted text quality
• Crowd/Scholar-sourcing
Two general points are routinely underestimated:
• (Really) Understanding stakeholders and their roles
• (Real) Understanding of problems, their extent and the
effectiveness/requirements of potential solutions
Stakeholders and their
rolesSeems obvious and often mentioned but the significance of understanding this point and its effects is vastly underestimated
Content holders
• Keen for their content to be widely available and used
• Do not know their content well and neither its potential uses
Computer scientists
• Have technical expertise to solve many of the problems
• Do not know the material and its use to prioritise problems well
DH researchers – the catalysts
• Very knowledgeable of material and potential use
• Have complementary technical skills to computer scientists
Problem understandingAt the start of each project everyone is eager to deliver “big” results but
it is important to identify and understand a few key problems and solve
them well
“Improve OCR results” is an ill-defined and short-sighted goal
• Measured in terms of word-accuracy, OCR results are of little use
• Layout is very important
• Even if all the words are recognised correctly, the reading order is unlikely to be
correct, limiting potentially interesting uses.
• Page numbers, captions, running headers etc. should not be mixed with body text
• Graphical elements / illustrations are important too
Think: Useful data (investment) vs. just more of any data (instant
gratification)
Navigating the Storm:eMOP, Big DH Projects, and Agile
Steering Standards
Elizabeth Grumbach, Co-Project Manager, IDHMC
@EMGrumbach
“If an electronic scholarly project can’t fail and doesn’t produce new ignorance, then it isn’t
worth a damn.”- John Unsworth
“Documenting the Reinvention of Text: The Importance of Failure”
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Navigating the Storm:eMOP, Big DH Projects, and Agile Steering Standards
Initial Goals
ChallengesOr
Failures
AnalysisNew Directions
Adaptability
Navigating the Storm | @EMGrumbach | emop.tamu.edu
Navigating the Storm:eMOP, Big DH Projects, and Agile Steering Standards
ChallengesOr
Failures
AnalysisNew Directions
Adaptability
Challenges +Failuresshould be constantly or consistently communicated.
Analysis + New Directions should lead to research and communication with similar projects.
Adaptabilityshould allow for new possibilities, new questions.
Navigating the Storm | @EMGrumbach | emop.tamu.edu