Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Workshop
OCR/NER Digital Humanities Summer school
2015
Agenda
• Introduction OCR
• Introduction NER
• Use case Succeed project @ KU Leuven
• Getting setup
• Hands-on session
Who are we?
• INL (Institute for Dutch lexicology)
– Katrien Depuydt
– Jesse de Does
• LIBIS
– Roxanne Wyns
– Sam Alloing
What is OCR?
• Definition
– Converting image of text to electronic text
• But entails a lot more!
– It is a workflow
• Recognition of printed text, not handwritten
text
Why OCR? • Improve discoverability
– Search within the image
– Search across images
• Text processing – Named entity recognition
• See later
– Further analysis • TEI…
• …
Why bad performance on some
text? • Quality of the printed text
– Can be a problem on historical material
• Different spacing between words,
characters,…
• Low quality of scans
• Font and language not supported
• Complex layout is not kept
Workflow OCR Attestation Improving Executing
OCR Digitisation Pre- processing
Post- processing
Evaluation set
Digisation
• Images in greyscale or black and white
– The OCR software will convert to B&W (=
binarisation)
• 300 dpi is recommended
– For smaller fonts 400 to 600 dpi
Workflow OCR Attestation Improving Executing
OCR Digitisation Post- processing
Evaluation set Pre-
processing
Pre-processing • All kinds of improvements
• Depends of the capabilities of your OCR engine
– Some engines contain some of the pre-processing features
• Layout correction
– De-skew
• Document Deskewer
• Scan Tailor
• Page Curl Corrector
• Removal tools
– Noise removal
– Border removal
• Scan Tailor
• NCSR Border Detection and Removal
• Image correction/enhancements
– Binarization
• Creation of Black and white images
– Image tools
• Imagemagick
• Photoshop
• Gimp
• …
De-skewing of image
Page curl
Pre-processing: skew and wrap correction
Pre-processing: Examples of noise
Pre-processing: “Star Wars” journal…
Workflow OCR Attestation Improving Executing
OCR Digitisation Pre- processing
Post- processing
Evaluation set
Attestation • Create a ground truth or gold standard
= 100% correct transcription
• Compare the ‘truth’ to the OCR output
• Evaluate the OCR output, other uses include: – To test different OCR engines
– To test outsourced OCR
• Tool – Aletheia
• Creates PAGE XML ground truth
OCR workflow Improving Executing
OCR Digitisation Pre- processing
Post- processing
Evaluation set Attestation
OCR evaluation
• OCRevalUAtion • Page Evaluator
for Tesseract
• Determine the error rate
– CER = character
– WER = Word
• Compare OCR output against ground truth
OCR workflow
Attestation Executing OCR Digitisation Pre-
processing
Post- processing
Evaluation set Improving
Improving • Different techniques used to improve the OCR result
• Pattern training – Learn the OCR engine new characters
– Tools • Part of some OCR engine
• Franken+ for Tesseract
• Cutout and page generator for Tesseract
• Dictionaries – Built-in
– Custom • Tools to create dictionaries: CoBaLT
• Changing the settings of the OCR engine
• Add training data
OCR workflow
Attestation Digitisation Pre- processing
Post- processing
Evaluation set Improving
Executing OCR
OCR applications • Desktop applications
– GUI
– Processing page by page
– Some batch processing capabilities
– Easy to use
• OCR engines – No GUI
– Processing large amounts of images
– More OCR features and more fine-tuning
– More knowledge required to use
2 types of OCR • Omnifont engine
• Adaptive engine – No knowledge of example font needed
– Creates a model during training • => More training required
• Examples – ABBYY FineReader
– Tesseract
– OmniPage
– OCRopus/ocropy
– BIT-Alpha
– IBM Adaptive OCR engine
Actions of the OCR engine • See pre-processing
– Binarisation
• Layout analysis – Identify the regions
with text
– Tools: • Layout Evaluation Tool
• Segmentation – Line, character and word
• Text recognition – Charachter categorisation
Output formats • Vendor specific XML
– Richest format
• Text
• ALTO – Library of Congress
– XML format describing the page
– Used in some viewer software, to overlay text on image
• TEI – Text Encoding Initiative
– Describe text in high detail
– XML format
– Don’t expect too much from OCR engine
• …
OCR workflow
Attestation Improving Executing OCR Digitisation
Post- processing
Evaluation set Pre-
processing
Post-correction • Manual or semi-automated
• Tools
– Korrektor
– Virtual Transcription Laboratory
– Page corrector for Tesseract
– CONCERT (IBM)
Conclusion
• Optimise for your use case
– Not all use cases need perfection
• Start with easy gains
– Dictionary
– Good images
• Evaluate!