Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Navigating the Storm:eMOP, Big DH Projects, and Agile

Steering Standards

Elizabeth Grumbach, Co-Project Manager, IDHMC

Laura Mandell, PI, IDHMC

Apostolos Antonacopoulos, PRImA Lab

Clemens Neudecker, Koninklijke BibliotheekMatthew Christy, Co-Project Manager, IDHMC

Loretta Auvil, SEASR Analytics

Todd Samuelson, Cushing Memorial Library

emop.tamu.edu

Navigating the Storm:eMOP, Big DH Projects, and Agile Steering Standards

Initial Goals

ChallengesOr

Failures

AnalysisNew Directions

Adaptability

Navigating the Storm | @EMGrumbach | emop.tamu.edu

Straight from the grant proposal…

“Our overarching goals”

1) Train three open-access OCR engines to “read” early modern fonts

2) Map specific font training onto specific sets of documents

3) Create error-evaluation mechanisms for failed documents

4) Use crowd-sourced correction tools specific to OCR errors

5) Identify pages that are too flawed to be “readable”

6) Share our workflow procedure and results, so that the community can use them in digitizing and transcribing early modern documents.


Main Collaborators

CIIRIDHMC + Cushing Memorial Library

Koninklijke BibliotheekPerformant Software Solutions

PRImA Labs PSI Labs

SEASR

UMass AmhearstTexas A&MNetherlandsCharlottesville, VirginiaUniversity of Salford, ManchesterTexas A&MU of Illinois, Urbana-Champaign


Data Contributors + Collaborators

Early English Books Online (EEBO)Eighteenth Century Collections Online (ECCO)

Text Creation Partnership (TCP)Brazos Computing Cluster (Texas A&M)

Main Collaborators



Steering Standards

Laura Mandell, Principal Investigator, eMOP

Director, IDHMC

@mandellc

[email protected]

Early Modern Printing• Individual, hand-made

typefaces

• Worn and broken type

• Poor quality equipment/paper

• Inconsistent line bases

• Unusual page layouts, decorative page elements,

• Special characters & ligatures

• Spelling variations

• Mixed typefaces and languages

Slides by Matthew Christy 7


• Irregular Layouts• Print Bleedthrough

Document/Image Quality• Torn and damaged

pages• Noise introduced to

images of pages• Skewed pages• Warped pages• Missing pages• Inverted pages• Incorrect metadata• Extremely low quality

TIFFs (~50K)



11

There may be as much difference between one letter and another in a specific font

As there is between letters in different fonts.

Reality

Dream

Training Tesseract in different fonts and applying them to the documents printed in those particular fonts will improve OCR quality.

Training TesseractAletheia

Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.

Training TesseractFranken+

1. Takes Aletheia's output files as input.

2. Groups all glyphs with the same Unicode values into one window for comparison.

3. Mistakenly coded glyphs are easily identified and re-coded.

4. A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired.

5. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base.

6. Outputs the same box files and TIFF images that Tesseract's first stage of native training.

7. Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files.

8. Outputs a .traineddata file used by Tesseract when OCRing page images.



Steering Standards

Clemens Neudecker, Koninklijke Bibliotheek

@cneudecker

The case of IMPAC T

• IMPACT = IMProving ACcess to Text

• EU FP7, 2008 – 2012

• €16.7 M budget

• 22 partners (libraries, universities, companies)

• Goal: Significantly improve OCR for historical documents

Issue 1

• Expectation: The "IMPACT OCR"

• Reality: A collection of very diverse tools,

algorithms, etc. Some prototypes, some

commercial tools, different programming

languages, different levels of maturity etc. •

• No integrated product possible!

Issue 1

• Solution: Interoperability rather than integration

• Change: Individual applications as pluggable modules in a web-based framework

• Result: Flexible framework with additional benefits for testing, transparency, provenance

Issue 2

• Diversity: Librarians, Computer Scientists, Computational Linguists, Humanists

• Are we really talking the same language?

• Different focus points in the project: applicable solutions vs. academic publications

Issue 2

• Solution: Create bonding activities, foster atmosphere for knowledge exchange

• Change: Buddy programme, social games, quizzes about partners

• Result: Understand your partners background,

their way of thinking enrich the experience for everyone

Large Digitisation Projects:

Two Key Perspectives

Apostolos AntonacopoulosPRImA Research Lab

Background

Since 2002 the PRImA Lab has been involved in large digitisation

projects, creating software tools for all stages of the workflow

• From Image Enhancement to Layout Analysis to OCR

• Use-scenario based evaluation of extracted text quality

• Crowd/Scholar-sourcing

Two general points are routinely underestimated:

• (Really) Understanding stakeholders and their roles

• (Real) Understanding of problems, their extent and the

effectiveness/requirements of potential solutions

Stakeholders and their

rolesSeems obvious and often mentioned but the significance of understanding this point and its effects is vastly underestimated

Content holders

• Keen for their content to be widely available and used

• Do not know their content well and neither its potential uses

Computer scientists

• Have technical expertise to solve many of the problems

• Do not know the material and its use to prioritise problems well

DH researchers – the catalysts

• Very knowledgeable of material and potential use

• Have complementary technical skills to computer scientists

Problem understandingAt the start of each project everyone is eager to deliver “big” results but

it is important to identify and understand a few key problems and solve

them well

“Improve OCR results” is an ill-defined and short-sighted goal

• Measured in terms of word-accuracy, OCR results are of little use

• Layout is very important

• Even if all the words are recognised correctly, the reading order is unlikely to be

correct, limiting potentially interesting uses.

• Page numbers, captions, running headers etc. should not be mixed with body text

• Graphical elements / illustrations are important too

Think: Useful data (investment) vs. just more of any data (instant

gratification)


Steering Standards

Elizabeth Grumbach, Co-Project Manager, IDHMC

@EMGrumbach

[email protected]

“If an electronic scholarly project can’t fail and doesn’t produce new ignorance, then it isn’t

worth a damn.”- John Unsworth

“Documenting the Reinvention of Text: The Importance of Failure”



Initial Goals

ChallengesOr

Failures


Adaptability



ChallengesOr

Failures


Adaptability

Challenges +Failuresshould be constantly or consistently communicated.

Analysis + New Directions should lead to research and communication with similar projects.

Adaptabilityshould allow for new possibilities, new questions.


Education

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards