Document Collections A Cost Efficient approach to correct

A Cost Efficient approach to correct OCR errors in Large Document Collections

1

Deepayan Das, Jerin Philip, Minesh Mathew and C.V. JawaharCenter for Visual Information Technology, IIIT- Hyderabad

Digital Library

2

A digital repository for books, accessible to people around the world

Digital Libraries

Popular Digital libraries include:

3

Project #Books

Google Books Project 25 million (as of 2015)

Project Gutenberg 60, 000

Million Books Project 1.5 million

Digital Libraries

● Easy access to millions of books and articles.● Less cost in maintenance and support.● Supports search and indexing.

4

Digital Libraries

5

Scanning centers

OCR Access to millions of

people

Annotator proofreads

the text

OCRs are not always 100% accurate

6

● OCR is sensitive to quality of document images.● Degradations can result in words being misclassified.

Word Image OCR prediction Ground Truth

Lord Cauning Lord Canning

Cawnporo Cawnpore

Dolhi Delhi

rnorning Morning

OCR

7

Information Retrieval on OCR text

OCR errors leads to difference in ranking of the retrieved document.

8

Post-processing for Large Document Collection

9

Project Gutenberg.GB. Newby and

C.Frank.Distributed

proofreading.JCDL, 2003

Google BooksVon Ahn et. al

Recaptcha: Human based character recognition.

Science, 2008

Motivation

● OCR makes consistent errors throughout a document collection.

10

Juiiet Juiiet Juiiet Juiiet

Qucen Qucen

Camiing Canning Caniiing Caiiing

Qucen Qucen

Word images and their corresponding predictions in a collection

Motivation

● Books/collections have a finite vocabulary that repeat throughout the book.

11

A small subset of words can cover more than 50% of total words in a collection.

50% of words

Motivation

● Grouping and correcting words with high frequency can lead to significant gain in word accuracy.

12

50% of words

t-SNE Image Embedding

13

Maaten, Laurens van der and Hinton, Geoffrey. “Visualizing data using t-SNE”. JMLR, 2008

14

Reverse Annotation

Sankar et al. “Probabilistic Reverse Annotation For Large Scale Image Annotation.” CVPR, 2007.

Fusing Word Clusters

Rasagna et al. “Robust Recognition of Documents by Fusing Results of Word Clusters.” ICDAR, 2009.

Khader and casey. “Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment”. ICDAR, 2009.

Automatic Error Correction

15


16

Cluster representative, propagated to all cluster elements

Character Majority Voting

● Word Images are clustered on a feature space.● A cluster representative is chosen for each cluster.

○ Rasagna et al. use character major voting where the most frequently character is taken at each time step.


● Voted label is propagated to all the cluster elements.

17

Fig. shows a cluster of error words with label “thousand”. There are two incorrect labels “housan” and “thusiasn” which can be corrected with the above proposed method.

thusiasn

housan

thousand


18

moneymoney

moneymoney

money

money

money

aoney

more

Fig. shows nearest neighbour to the image embedding for word “money”. The error word (highlighted in red) can be corrected using character majority voting.

● Each clusters cannot be completely homogenous.● Character majority voting can lead to error propagation.

Cluster Impurities

19

impurity

moneymoney

moneymoney,

money

money

aoney

more

Can a better clustering algorithm help?

20

MST on word predictions

● We further partition the clusters using minimum spanning tree (MST).

● The nodes are the predictions.● The edit distance between the predictions form the edges.

21

Minimum spanning tree

22

Fully connected graph Minimum spanning tree Forest of individual trees

MST on word predictions

23

money!

money

money,

money,

money

money

aoney

more,

Cluster Partition using MST

What happens when all the predictions are wrong in a cluster ?

24

Manual correction

25

Figure shows a cluster with erred OCR predictions.

Automatic Error correction by label propagation will not achieve absolute word accuracy when clusters

are not homogeneous.

26

Human vs Machine

27

● Humans accurate but slow.● High cost ● Machines fast but inaccurate.

● Error propagation.● A human will be needed to rectify

the mistakes incurred by the machine.

● Will lead to high cost.

Human Machine Collaboration

28

Proposed Method

A human should verify each cluster by:

1. Picking the cluster representative.2. Choosing the cluster elements to which the label should

be propagated to.

29

Pipeline

30

Word Predictions

HWNet [2]

Image Features

ErrorDetection

Error Clusters

falling,

Manual

Clustering

Correction

CRNN-OCR [1]

1. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. Shi et. al PAMI, 20172. HWNet v2: An Efficient Word Image Representation for Handwritten Documents. Krishnan et. al IJDAR 2018.

Implementation Details

31

Edit Actions

● Full Typing ( no Dictionary involved)● Type + Select from dropdown (Static Dictionary)● Type + Select from dropdown(Growing Dictionary)

32

● Fully Annotated○ English

■ 19 books■ ~2.5k pages

○ Hindi■ 32 books■ ~5k pages

Datasets

33

Sample word images from Fully annotated dataset.

Evaluation Protocol

● For Fully Annotated○ Units of seconds spent by a human for correcting a book.○ We refer to it as cost of correction (C).○ We measure the cost for each method relative to

Full-Typing.

34

Cost of correction

Cost C = w1ct+ w2cd + w3cv

ct = typing cost

cd = selection cost

cv = verification cost

w1 = error words that need typing

w2 = error words whose correct alternative is present in dictionary

W3 = words that are correct but flagged wrongly as errors 35

Results

36

Relative cost

37

Fig. Relative cost of correction with respect to full typing when no clustering is involved.

Relative cost

38

Cost of correction across different clustering techniques for automatic label selection and propagation.

Cost without clustering

Qualitative Results

39

Qualitative results of k-means + MST clustering. The false positives are crossed out. Images, relevant to the cluster are marked correct while the non relevant ones are crossed out.

Results: Fully Annotated Data (English)

Comparison between Automatic vs Manual correction

40

● No false error propagation.

● Reduces cost of correction.

Scope for automatic error correction

41

Clustering on Large Scale Dataset

● We cluster on images from 100 unannotated books.● Testing is done on 200 annotated pages.● We use CMV for label assignment to erred predictions.

42

● Comparison of performance of CRNN and Tesseract OCR

Results

43

Automatic error correction is able to rectify the errors more accurately as the size of collection increases.

● Gain in word accuracy CRNN-OCR >> Tesseract OCR

Clustering on Large Scale Dataset

We observe that as the size of the collection increases, CMV becomes better at picking the correct cluster representative which is subsequently propagated to all the cluster elements.

44

Conclusion

● We proposed a cost efficient batch correction scheme for error reduction in OCRs.

● We also demonstrate how our approach can effectively be scaled to larger collections.

45

Future Work

● active learning techniques to find clusters/subclusters that need post-processing

● adapting recognizer to a collection,not just the post-processing module.

46

Thank You

47