27
Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards Elizabeth Grumbach, Co-Project Manager, IDHMC Laura Mandell, PI, IDHMC Apostolos Antonacopoulos, PRImA Lab Clemens Neudecker, Koninklijke Bibliotheek Matthew Christy, Co-Project Manager, IDHMC Loretta Auvil, SEASR Analytics Todd Samuelson, Cushing Memorial Library emop.tamu.edu

Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Embed Size (px)

DESCRIPTION

Slides for the eMOP presentation at the Digital Humanities 2014 conference in Lausanne, Switzerland.

Citation preview

Page 1: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Navigating the Storm:eMOP, Big DH Projects, and Agile

Steering Standards

Elizabeth Grumbach, Co-Project Manager, IDHMC

Laura Mandell, PI, IDHMC

Apostolos Antonacopoulos, PRImA Lab

Clemens Neudecker, Koninklijke BibliotheekMatthew Christy, Co-Project Manager, IDHMC

Loretta Auvil, SEASR Analytics

Todd Samuelson, Cushing Memorial Library

emop.tamu.edu

Page 2: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Navigating the Storm:eMOP, Big DH Projects, and Agile Steering Standards

Initial Goals

ChallengesOr

Failures

AnalysisNew Directions

Adaptability

Navigating the Storm | @EMGrumbach | emop.tamu.edu

Page 3: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Straight from the grant proposal…

“Our overarching goals”

1) Train three open-access OCR engines to “read” early modern fonts

2) Map specific font training onto specific sets of documents

3) Create error-evaluation mechanisms for failed documents

4) Use crowd-sourced correction tools specific to OCR errors

5) Identify pages that are too flawed to be “readable”

6) Share our workflow procedure and results, so that the community can use them in digitizing and transcribing early modern documents.

Navigating the Storm | @EMGrumbach | emop.tamu.edu

Page 4: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Main Collaborators

CIIRIDHMC + Cushing Memorial Library

Koninklijke BibliotheekPerformant Software Solutions

PRImA Labs PSI Labs

SEASR

UMass AmhearstTexas A&MNetherlandsCharlottesville, VirginiaUniversity of Salford, ManchesterTexas A&MU of Illinois, Urbana-Champaign

Navigating the Storm | @EMGrumbach | emop.tamu.edu

Page 5: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Data Contributors + Collaborators

Early English Books Online (EEBO)Eighteenth Century Collections Online (ECCO)

Text Creation Partnership (TCP)Brazos Computing Cluster (Texas A&M)

Main Collaborators

Navigating the Storm | @EMGrumbach | emop.tamu.edu

Page 6: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Navigating the Storm:eMOP, Big DH Projects, and Agile

Steering Standards

Laura Mandell, Principal Investigator, eMOP

Director, IDHMC

@mandellc

[email protected]

Page 7: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Early Modern Printing• Individual, hand-made

typefaces

• Worn and broken type

• Poor quality equipment/paper

• Inconsistent line bases

• Unusual page layouts, decorative page elements,

• Special characters & ligatures

• Spelling variations

• Mixed typefaces and languages

Slides by Matthew Christy 7

Page 8: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Slides by Matthew Christy 8

• Irregular Layouts• Print Bleedthrough

Page 9: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Document/Image Quality• Torn and damaged

pages• Noise introduced to

images of pages• Skewed pages• Warped pages• Missing pages• Inverted pages• Incorrect metadata• Extremely low quality

TIFFs (~50K)

Slides by Matthew Christy 9

Page 10: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Slides by Matthew Christy 10

Page 11: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

11

There may be as much difference between one letter and another in a specific font

As there is between letters in different fonts.

Reality

Dream

Training Tesseract in different fonts and applying them to the documents printed in those particular fonts will improve OCR quality.

Page 12: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Training TesseractAletheia

Created by PRImA Research Labs. A team of undergraduates uses Aletheia to identify each glyph on the page images, and ensure that the correct Unicode value is assigned to each. Aletheia outputs an XML file containing all identified glyphs on a page with their corresponding coordinates and Unicode values.

Page 13: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Training TesseractFranken+

1. Takes Aletheia's output files as input.

2. Groups all glyphs with the same Unicode values into one window for comparison.

3. Mistakenly coded glyphs are easily identified and re-coded.

4. A user can quickly compare all exemplars of a glyph and choose just the best subset, if desired.

5. Uses all selected glyphs to create a Franken-page image (TIFF) using a selected text as a base.

6. Outputs the same box files and TIFF images that Tesseract's first stage of native training.

7. Also allows users to complete Tesseract training using newly created box/TIFF file pairs, and add optional dictionary and other files.

8. Outputs a .traineddata file used by Tesseract when OCRing page images.

Slides by Matthew Christy 13

Page 14: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Navigating the Storm:eMOP, Big DH Projects, and Agile

Steering Standards

Clemens Neudecker, Koninklijke Bibliotheek

@cneudecker

Page 15: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

The case of IMPAC T

• IMPACT = IMProving ACcess to Text

• EU FP7, 2008 – 2012

• €16.7 M budget

• 22 partners (libraries, universities, companies)

• Goal: Significantly improve OCR for historical documents

Page 16: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Issue 1

• Expectation: The "IMPACT OCR"

• Reality: A collection of very diverse tools,

algorithms, etc. Some prototypes, some

commercial tools, different programming

languages, different levels of maturity etc. •

• No integrated product possible!

Page 17: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Issue 1

• Solution: Interoperability rather than integration

• Change: Individual applications as pluggable modules in a web-based framework

• Result: Flexible framework with additional benefits for testing, transparency, provenance

Page 18: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Issue 2

• Diversity: Librarians, Computer Scientists, Computational Linguists, Humanists

• Are we really talking the same language?

• Different focus points in the project: applicable solutions vs. academic publications

Page 19: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Issue 2

• Solution: Create bonding activities, foster atmosphere for knowledge exchange

• Change: Buddy programme, social games, quizzes about partners

• Result: Understand your partners background,

their way of thinking enrich the experience for everyone

Page 20: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Large Digitisation Projects:

Two Key Perspectives

Apostolos AntonacopoulosPRImA Research Lab

Page 21: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Background

Since 2002 the PRImA Lab has been involved in large digitisation

projects, creating software tools for all stages of the workflow

• From Image Enhancement to Layout Analysis to OCR

• Use-scenario based evaluation of extracted text quality

• Crowd/Scholar-sourcing

Two general points are routinely underestimated:

• (Really) Understanding stakeholders and their roles

• (Real) Understanding of problems, their extent and the

effectiveness/requirements of potential solutions

Page 22: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Stakeholders and their

rolesSeems obvious and often mentioned but the significance of understanding this point and its effects is vastly underestimated

Content holders

• Keen for their content to be widely available and used

• Do not know their content well and neither its potential uses

Computer scientists

• Have technical expertise to solve many of the problems

• Do not know the material and its use to prioritise problems well

DH researchers – the catalysts

• Very knowledgeable of material and potential use

• Have complementary technical skills to computer scientists

Page 23: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Problem understandingAt the start of each project everyone is eager to deliver “big” results but

it is important to identify and understand a few key problems and solve

them well

“Improve OCR results” is an ill-defined and short-sighted goal

• Measured in terms of word-accuracy, OCR results are of little use

• Layout is very important

• Even if all the words are recognised correctly, the reading order is unlikely to be

correct, limiting potentially interesting uses.

• Page numbers, captions, running headers etc. should not be mixed with body text

• Graphical elements / illustrations are important too

Think: Useful data (investment) vs. just more of any data (instant

gratification)

Page 24: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Navigating the Storm:eMOP, Big DH Projects, and Agile

Steering Standards

Elizabeth Grumbach, Co-Project Manager, IDHMC

@EMGrumbach

[email protected]

Page 25: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

“If an electronic scholarly project can’t fail and doesn’t produce new ignorance, then it isn’t

worth a damn.”- John Unsworth

“Documenting the Reinvention of Text: The Importance of Failure”

Navigating the Storm | @EMGrumbach | emop.tamu.edu

Page 26: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Navigating the Storm:eMOP, Big DH Projects, and Agile Steering Standards

Initial Goals

ChallengesOr

Failures

AnalysisNew Directions

Adaptability

Navigating the Storm | @EMGrumbach | emop.tamu.edu

Page 27: Navigating the Storm: eMOP, Big DH Projects, and Agile Steering Standards

Navigating the Storm:eMOP, Big DH Projects, and Agile Steering Standards

ChallengesOr

Failures

AnalysisNew Directions

Adaptability

Challenges +Failuresshould be constantly or consistently communicated.

Analysis + New Directions should lead to research and communication with similar projects.

Adaptabilityshould allow for new possibilities, new questions.

Navigating the Storm | @EMGrumbach | emop.tamu.edu