14
Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro H. Toselli [email protected] Pattern Recognition and Human Language Technology Reseach Center Universitat Polit` ecnica de Val` encia Spain June 2015 Assisted HTR Assisted HTR handwritting KWS Index 1 Handwritten Text Recognition (HTR) and Indexing 1 2 The tranScriptorium Project 3 3 Selected Handwritting Datasets 5 4 Interactive HTR: Transcription Demonstration 11 5 HTR and Interactive-Predictive HTR Results 14 6 Handwritten Text Images Indexing: Search Demonstration 17 7 Handwritten Text Images Indexing and Search Results 20 8 Conclusion 23 A.H. Toselli – PRHLT/UPV Page 1

Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

  • Upload
    others

  • View
    35

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Interactive Handwritten Text Recognition and Indexingof Historical Documents: the tranScriptorum Project

Alejandro H. [email protected]

Pattern Recognition and Human Language Technology

Reseach Center

Universitat Politecnica de Valencia

Spain

June 2015

Assisted HTR

Assisted HTR handwritting KWS

Index

1 Handwritten Text Recognition (HTR) and Indexing . 1

2 The tranScriptorium Project . 3

3 Selected Handwritting Datasets . 5

4 Interactive HTR: Transcription Demonstration . 11

5 HTR and Interactive-Predictive HTR Results . 14

6 Handwritten Text Images Indexing: Search Demonstration . 17

7 Handwritten Text Images Indexing and Search Results . 20

8 Conclusion . 23

A.H. Toselli – PRHLT/UPV Page 1

Page 2: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Index

◦ 1 Handwritten Text Recognition (HTR) and Indexing . 1

2 The tranScriptorium Project . 3

3 Selected Handwritting Datasets . 5

4 Interactive HTR: Transcription Demonstration . 11

5 HTR and Interactive-Predictive HTR Results . 14

6 Handwritten Text Images Indexing: Search Demonstration . 17

7 Handwritten Text Images Indexing and Search Results . 20

8 Conclusion . 23

A.H. Toselli – PRHLT/UPV Page 1

Assisted HTR handwritting KWS

Handwritten Text Recognition (HTR) and Indexing

Huge amounts of handwritten historical documents arebeing published by on-line digital libraries world wide

However, for these raw digital images to be reallyuseful, they need be annotated with informative content

This presentation introduces efficient solutionsfor the indexing, search and full transcription ofhistorical handwritten document images

A.H. Toselli – PRHLT/UPV Page 2

Page 3: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Index

1 Handwritten Text Recognition (HTR) and Indexing . 1

◦ 2 The tranScriptorium Project . 3

3 Selected Handwritting Datasets . 5

4 Interactive HTR: Transcription Demonstration . 11

5 HTR and Interactive-Predictive HTR Results . 14

6 Handwritten Text Images Indexing: Search Demonstration . 17

7 Handwritten Text Images Indexing and Search Results . 20

8 Conclusion . 23

A.H. Toselli – PRHLT/UPV Page 3

Assisted HTR handwritting KWS

The tranScriptorium Projecthttp://www.transcriptorium.eu

• STREP of the FP7 in the ICT for Learning and Access to CulturalResources challenge (1 January 2013 to 31 December 2015)

• tranScriptorium aims to develop innovative, efficient and cost-effectivesolutions for the indexing, search and full transcription of historicalhandwritten document images, using modern, holistic Handwritten TextRecognition technology

1. Enhancing HTR technology for efficient transcription2. Bringing the HTR technology to users3. Integrating the HTR results in public web portals

Supported by: EU Cultural Heritage:A.H. Toselli – PRHLT/UPV Page 4

Page 4: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Index

1 Handwritten Text Recognition (HTR) and Indexing . 1

2 The tranScriptorium Project . 3

◦ 3 Selected Handwritting Datasets . 5

4 Interactive HTR: Transcription Demonstration . 11

5 HTR and Interactive-Predictive HTR Results . 14

6 Handwritten Text Images Indexing: Search Demonstration . 17

7 Handwritten Text Images Indexing and Search Results . 20

8 Conclusion . 23

A.H. Toselli – PRHLT/UPV Page 5

Assisted HTR handwritting KWS

Selected Handwritting Datasets

• BENTHAM: XVIII/XIX centuries collection of over 4, 000 volumesof drafts and notes, written by several hands in English

• PLANTAS: XVII century botanical specimen manuscript collectionof seven volumes written by a single hand in Old Spanish – kindlyprovided by the BNE

• HATTEM: XV century Medieval Manuscript composed of 573sheets written by a single hand in Dutch

• ESPOSALLES: XVII century Marriage License records written byseveral hands in old Catalan and other languages

• AUSTEN: XVIII century Juvenilia manuscripts by Jane Austen(single hand in English) – kindly provided by the BL

• REICHSGERICHT: XVIII century court decisions from the HighCourt of Germany, written by several hands in German

A.H. Toselli – PRHLT/UPV Page 6

Page 5: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

“BENTHAM” DatasetXVIII century collection of over 4, 000 volumes of drafts and notes, written by

several writers in English

Experiments on a first batch of 433 pre-selected page images

Number of: TotalPages 433Lines 11 473Running words 106 905Lexicon size 9 717Running characters 550 674Character set size 86

A.H. Toselli – PRHLT/UPV Page 7

Assisted HTR handwritting KWS

“PLANTAS” DatasetXVII century Botanical Specimen Manuscript Collection of seven volumes

written by a single writer in Old Spanish

Experiments on the first volume

Number of: TotalPages 871Lines 19 544Running words 197 617Lexicon size 21 148Character set size 77

A.H. Toselli – PRHLT/UPV Page 8

Page 6: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

“AUSTEN” Dataset

Jane Austen’s Juvenilia: XVIII century single hand manuscript

Experiments on Volume The Third

Number of: TotalPages 128Lines 2 693Running words 25 291Dataset lexicon 3 567Running characters 118 881Character set size 81

A.H. Toselli – PRHLT/UPV Page 9

Assisted HTR handwritting KWS

“Reichsgericht” Dataset

XVIII century court decisions from the High Court of Germany, written byseveral hands

Number of: TotalPages 114Lines 4 106Running words 31 545Dataset lexicon 8 108Running characters 239 762Character set size 92

A.H. Toselli – PRHLT/UPV Page 10

Page 7: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Index

1 Handwritten Text Recognition (HTR) and Indexing . 1

2 The tranScriptorium Project . 3

3 Selected Handwritting Datasets . 5

◦ 4 Interactive HTR: Transcription Demonstration . 11

5 HTR and Interactive-Predictive HTR Results . 14

6 Handwritten Text Images Indexing: Search Demonstration . 17

7 Handwritten Text Images Indexing and Search Results . 20

8 Conclusion . 23

A.H. Toselli – PRHLT/UPV Page 11

Assisted HTR handwritting KWS

Interactive HTR: CATTI operation example

x

STEP-0 p

s ≡ w antiguas cuidadelas que en el Castillo sus llamadas p′ antiguas cuidadelas que en el Castillo sus llamadas

STEP-1 κ antiguos cuidadelas que en el Castillo sus llamadas p antiguos ciudadanos que en el Castillo sus llamadas s antiguos ciudadanos que en el Castillo sus llamadas p′ antiguos ciudadanos que en el Castillo sus llamadas

STEP-2 κ antiguos ciudadanos que en Castilla sus llamadas p antiguos ciudadanos que en Castilla se llamaban s antiguos ciudadanos que en Castilla se llamaban p′ antiguos ciudadanos que en Castilla se llamaban

FINAL κ antiguos ciudadanos que en Castilla se llamaban #p ≡ T antiguos ciudadanos que en Castilla se llamaban

Post-editing WER: 6/7 (86%)Interactive WSR: 2/7 (29%, assuming a whole-word correction in step-1)Estimated effort reduction: 1− 29/86 (66%).

A.H. Toselli – PRHLT/UPV Page 12

Page 8: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Interactive HTR: Transcription Demonstration

• It is just a “demo” ! not intended for real operation (other systems do that)

• Everything is real. No tricks to make demo look better than real

• Web client-server architecture: Web browser front-end, back-end serverproviding off-line HTR-CATTI

• Off-line HTR-CATTI decoder based on word graphs

• Three tasks:

– BENTHAM: 10K words open vocabulary

– AUSTEN: 78K words external, open vocabulary from Bentham texts

20K words external, open vocabulary from Austen texts

A.H. Toselli – PRHLT/UPV Page 13

Assisted HTR handwritting KWS

Index

1 Handwritten Text Recognition (HTR) and Indexing . 1

2 The tranScriptorium Project . 3

3 Selected Handwritting Datasets . 5

4 Interactive HTR: Transcription Demonstration . 11

◦ 5 HTR and Interactive-Predictive HTR Results . 14

6 Handwritten Text Images Indexing: Search Demonstration . 17

7 Handwritten Text Images Indexing and Search Results . 20

8 Conclusion . 23

A.H. Toselli – PRHLT/UPV Page 14

Page 9: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

HTR and Interactive-Predictive HTR

HTR current state-of-the-art:

• Segmentation-free approach: no explicit segmentation of text images intowords or characters is required

• The basic input unit is a handwritten text line image

• Statistical modeling at different perception levels:

– Optical (character shape), using Hidden Markov Models (HMMs)– Lexical, by means of finite-state character representation of words– Syntactical, based on statistical language models, such as N -grams

Interactive-predictive framework: rather than full transcription automation, thesystem assists the human transcriber

• Combines HTR efficiency with the accuracy of human experts, leading tocost-effective perfect transcripts

A.H. Toselli – PRHLT/UPV Page 15

Assisted HTR handwritting KWS

HTR and Interactive-Predictive HTR Results

• BENTHAM: Training: OMs with 400 pages, LM Lex. 78K words. Test: 33 pages.WER = 22.0 % WSR = 17.2 % EFR: 21.5 % wrt post-editCER = 9.9 %

• PLANTAS: Training: OMs with 224 pages, LM Lex. 21K words. Test: 647 pages.WER = 33.4 % WSR: 31.5 % EFR: 5.7 % wrt post-editingCER = 16.0 %

• AUSTEN: Training: OMs with 50 pages, LM Lexicon 20K words. Test: 78 pagesWER = 32.2 % WSR = 21.4 % EFR: 33.5 % wrt post-editingCER = 15.9 %

• REICHSGERICHT: Training: OMs with 88 pgs, LM Lexicon 5K words. Test: 26 pgsWER = 33.3 % WSR = 25.1 % EFR: 24.6 % wrt post-editingCER = 12.9 %

WER/CER: percentage of mis-recognized words/characters.Experiments with open-vocabulary lexica and bi-gram LMs.

WSR = Percentage of word-level corrections to achieve ground truth transcripts.EFR = “Estimated Effort Reduction”.

A.H. Toselli – PRHLT/UPV Page 16

Page 10: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Index

1 Handwritten Text Recognition (HTR) and Indexing . 1

2 The tranScriptorium Project . 3

3 Selected Handwritting Datasets . 5

4 Interactive HTR: Transcription Demonstration . 11

5 HTR and Interactive-Predictive HTR Results . 14

◦ 6 Handwritten Text Images Indexing: Search Demonstration . 17

7 Handwritten Text Images Indexing and Search Results . 20

8 Conclusion . 23

A.H. Toselli – PRHLT/UPV Page 17

Assisted HTR handwritting KWS

Handwritten Text Images Indexing and Search

• There are massive text image collections out there, but their textualcontent remains practically inaccessible

• If perfect or sufficiently accurate text image transcripts were available,image textual context could be straightforwardly indexed for plaintexttextual access.

• But fully automatic transcription results lack the level of accuracy neededfor useful text indexing and search purposes

• And manual or even interactive-predictive assisted transcription isentirely prohibitive to deal with massive image collections

• Good news: indexing and search can be directly implemented on theimages themselves, without explicitly resorting to any image transcripts,as we will see now.

A.H. Toselli – PRHLT/UPV Page 18

Page 11: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Handwritten Text Images Indexing and Search: Demonstration

• It is just a “demo” ! not (yet) intended for real operation. But everythingis real – no tricks to make demo look better than real

• Line-level indexing according to the precision-recall trade-off model :Rather than exact searching, search is carried out with a confidencethreshold, specified by the user as part of the query in order to meetthe required precision-recall trade-off

• Word confidence scores are based on pixel-level probabilities andcomputed for line-shaped regions. Spotted word positions are markedonly approximately

• Two tasks:

– AUSTEN: Trained on Austen (50p), 20K words open vocabulary.Demo on the whole “Juvenile volume The Third” (128 pages)

– PLANTAS: Trained on Plantas (224p), 21K words open vocabulary.Demo on Volume I (about 1 000 pages)

A.H. Toselli – PRHLT/UPV Page 19

Assisted HTR handwritting KWS

Index

1 Handwritten Text Recognition (HTR) and Indexing . 1

2 The tranScriptorium Project . 3

3 Selected Handwritting Datasets . 5

4 Interactive HTR: Transcription Demonstration . 11

5 HTR and Interactive-Predictive HTR Results . 14

6 Handwritten Text Images Indexing: Search Demonstration . 17

◦ 7 Handwritten Text Images Indexing and Search Results . 20

8 Conclusion . 23

A.H. Toselli – PRHLT/UPV Page 20

Page 12: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Indexing and Search for Handwritten Text Images:Pixel-level Posteriorgram

P

X

Pixel-level posterior probabilities P for a text image X and word v =”matter”.

An accurate, contextual (n-gram based) word classifier was used to compute P . Thishelped to achieve very low posteriors in a region of X around (i=100, j=200), where avery similar word, “matters”, is written.

A.H. Toselli – PRHLT/UPV Page 21

Assisted HTR handwritting KWS

Results on tranScriptorium Data Sets

Average Precision (AP)Mean Average Precision (MAP)and Recall-Precision curves

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Precision:π

Recall: ρ

Bentham: ap=0.88, map=0.94Plantas: ap=0.89, map=0.92Austen-B:ap=0.76, map=0.65Austen: ap=0.91, map=0.77

Bentham

Plantas

Austen-B

Austen

Datasets training and test details

• BENTHAM: Multi-hand. Training: 400 pg. fromBentham, 87 char. HMMs, 2-gram LM trainedon Bentham texts; Lexicon 9 341 tokens.Test : 33 pages; query set: 6 962 keywords

• PLANTAS: Single hand. Training: 224 pagesfrom Plantas, 77 char. HMMs, 2-gram LMtrained with the training set + book glossarytranscripts. Lexicon 11 561 tokens.Test : 32 pages; query set: 9 945 keywords

• AUSTEN-B: Single hand. No training; usingBentham char. HMMs, lexicon and LM.Test : 78 pages; query set: 9 000 keywords

• AUSTEN: Single hand. Training: 50 Austenpages, 81 char. HMMs, 2-gram LM trained onAusten texts; Lexicon 20K tokens.Test : 78 pages; query set: 9 000 keywords

A.H. Toselli – PRHLT/UPV Page 22

Page 13: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Index

1 Handwritten Text Recognition (HTR) and Indexing . 1

2 The tranScriptorium Project . 3

3 Selected Handwritting Datasets . 5

4 Interactive HTR: Transcription Demonstration . 11

5 HTR and Interactive-Predictive HTR Results . 14

6 Handwritten Text Images Indexing: Search Demonstration . 17

7 Handwritten Text Images Indexing and Search Results . 20

◦ 8 Conclusion . 23

A.H. Toselli – PRHLT/UPV Page 23

Assisted HTR handwritting KWS

Conclussions

• Automatic or assisted handwritten text transcription and fullyautomatic indexing is now becoming perfectly feassible

• Models trained for a given collection can provide quiteuseful performance on images from other similar collections,without need of (re-training)

• Several demonstrators have been implemented and madepublicly available for first-hand experience in real use; see:http://transcriptorium.eu/demonstrations

A.H. Toselli – PRHLT/UPV Page 24

Page 14: Interactive Handwritten Text Recognition and Indexing of …€¦ · Interactive Handwritten Text Recognition and Indexing of Historical Documents: the tranScriptorum Project Alejandro

Assisted HTR handwritting KWS

Thanks for Your Attention !

A.H. Toselli – PRHLT/UPV Page 25