28
Workshop OCR/NER Digital Humanities Summer school 2015

Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Workshop

OCR/NER Digital Humanities Summer school

2015

Page 2: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Agenda

• Introduction OCR

• Introduction NER

• Use case Succeed project @ KU Leuven

• Getting setup

• Hands-on session

Page 3: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Who are we?

• INL (Institute for Dutch lexicology)

– Katrien Depuydt

– Jesse de Does

• LIBIS

– Roxanne Wyns

– Sam Alloing

Page 4: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

What is OCR?

• Definition

– Converting image of text to electronic text

• But entails a lot more!

– It is a workflow

• Recognition of printed text, not handwritten

text

Page 5: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Why OCR? • Improve discoverability

– Search within the image

– Search across images

• Text processing – Named entity recognition

• See later

– Further analysis • TEI…

• …

Page 6: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Why bad performance on some

text? • Quality of the printed text

– Can be a problem on historical material

• Different spacing between words,

characters,…

• Low quality of scans

• Font and language not supported

• Complex layout is not kept

Page 7: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Workflow OCR Attestation Improving Executing

OCR Digitisation Pre- processing

Post- processing

Evaluation set

Page 8: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Digisation

• Images in greyscale or black and white

– The OCR software will convert to B&W (=

binarisation)

• 300 dpi is recommended

– For smaller fonts 400 to 600 dpi

Page 9: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Workflow OCR Attestation Improving Executing

OCR Digitisation Post- processing

Evaluation set Pre-

processing

Page 10: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Pre-processing • All kinds of improvements

• Depends of the capabilities of your OCR engine

– Some engines contain some of the pre-processing features

• Layout correction

– De-skew

• Document Deskewer

• Scan Tailor

• Page Curl Corrector

• Removal tools

– Noise removal

– Border removal

• Scan Tailor

• NCSR Border Detection and Removal

• Image correction/enhancements

– Binarization

• Creation of Black and white images

– Image tools

• Imagemagick

• Photoshop

• Gimp

• …

De-skewing of image

Page curl

Page 11: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Pre-processing: skew and wrap correction

Page 12: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Pre-processing: Examples of noise

Page 13: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Pre-processing: “Star Wars” journal…

Page 14: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Workflow OCR Attestation Improving Executing

OCR Digitisation Pre- processing

Post- processing

Evaluation set

Page 15: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Attestation • Create a ground truth or gold standard

= 100% correct transcription

• Compare the ‘truth’ to the OCR output

• Evaluate the OCR output, other uses include: – To test different OCR engines

– To test outsourced OCR

• Tool – Aletheia

• Creates PAGE XML ground truth

Page 16: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

OCR workflow Improving Executing

OCR Digitisation Pre- processing

Post- processing

Evaluation set Attestation

Page 17: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

OCR evaluation

• OCRevalUAtion • Page Evaluator

for Tesseract

• Determine the error rate

– CER = character

– WER = Word

• Compare OCR output against ground truth

Page 18: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

OCR workflow

Attestation Executing OCR Digitisation Pre-

processing

Post- processing

Evaluation set Improving

Page 19: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Improving • Different techniques used to improve the OCR result

• Pattern training – Learn the OCR engine new characters

– Tools • Part of some OCR engine

• Franken+ for Tesseract

• Cutout and page generator for Tesseract

• Dictionaries – Built-in

– Custom • Tools to create dictionaries: CoBaLT

• Changing the settings of the OCR engine

• Add training data

Page 20: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

OCR workflow

Attestation Digitisation Pre- processing

Post- processing

Evaluation set Improving

Executing OCR

Page 21: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

OCR applications • Desktop applications

– GUI

– Processing page by page

– Some batch processing capabilities

– Easy to use

• OCR engines – No GUI

– Processing large amounts of images

– More OCR features and more fine-tuning

– More knowledge required to use

Page 22: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

2 types of OCR • Omnifont engine

• Adaptive engine – No knowledge of example font needed

– Creates a model during training • => More training required

• Examples – ABBYY FineReader

– Tesseract

– OmniPage

– OCRopus/ocropy

– BIT-Alpha

– IBM Adaptive OCR engine

Page 23: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Actions of the OCR engine • See pre-processing

– Binarisation

• Layout analysis – Identify the regions

with text

– Tools: • Layout Evaluation Tool

• Segmentation – Line, character and word

• Text recognition – Charachter categorisation

Page 24: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Output formats • Vendor specific XML

– Richest format

• Text

• PDF

• ALTO – Library of Congress

– XML format describing the page

– Used in some viewer software, to overlay text on image

• TEI – Text Encoding Initiative

– Describe text in high detail

– XML format

– Don’t expect too much from OCR engine

• …

Page 25: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

OCR workflow

Attestation Improving Executing OCR Digitisation

Post- processing

Evaluation set Pre-

processing

Page 27: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Conclusion

• Optimise for your use case

– Not all use cases need perfection

• Start with easy gains

– Dictionary

– Good images

• Evaluate!

Page 28: Workshop OCR/NER - KU Leuven · 2020-04-01 · Workshop OCR/NER Digital Humanities Summer school 2015 . Agenda • Introduction OCR • Introduction NER • Use case Succeed project

Questions?

• Sam Alloing

[email protected]