47
A Cost Efficient approach to correct OCR errors in Large Document Collections 1 Deepayan Das, Jerin Philip, Minesh Mathew and C.V. Jawahar Center for Visual Information Technology, IIIT- Hyderabad

Document Collections A Cost Efficient approach to correct

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Document Collections A Cost Efficient approach to correct

A Cost Efficient approach to correct OCR errors in Large Document Collections

1

Deepayan Das, Jerin Philip, Minesh Mathew and C.V. JawaharCenter for Visual Information Technology, IIIT- Hyderabad

Page 2: Document Collections A Cost Efficient approach to correct

Digital Library

2

A digital repository for books, accessible to people around the world

Page 3: Document Collections A Cost Efficient approach to correct

Digital Libraries

Popular Digital libraries include:

3

Project #Books

Google Books Project 25 million (as of 2015)

Project Gutenberg 60, 000

Million Books Project 1.5 million

Page 4: Document Collections A Cost Efficient approach to correct

Digital Libraries

● Easy access to millions of books and articles.● Less cost in maintenance and support.● Supports search and indexing.

4

Page 5: Document Collections A Cost Efficient approach to correct

Digital Libraries

5

Scanning centers

OCR Access to millions of

people

Annotator proofreads

the text

Page 6: Document Collections A Cost Efficient approach to correct

OCRs are not always 100% accurate

6

Page 7: Document Collections A Cost Efficient approach to correct

● OCR is sensitive to quality of document images.● Degradations can result in words being misclassified.

Word Image OCR prediction Ground Truth

Lord Cauning Lord Canning

Cawnporo Cawnpore

Dolhi Delhi

rnorning Morning

OCR

7

Page 8: Document Collections A Cost Efficient approach to correct

Information Retrieval on OCR text

OCR errors leads to difference in ranking of the retrieved document.

8

Page 9: Document Collections A Cost Efficient approach to correct

Post-processing for Large Document Collection

9

Project Gutenberg.GB. Newby and

C.Frank.Distributed

proofreading.JCDL, 2003

Google BooksVon Ahn et. al

Recaptcha: Human based character recognition.

Science, 2008

Page 10: Document Collections A Cost Efficient approach to correct

Motivation

● OCR makes consistent errors throughout a document collection.

10

Juiiet Juiiet Juiiet Juiiet

Qucen Qucen

Camiing Canning Caniiing Caiiing

Qucen Qucen

Word images and their corresponding predictions in a collection

Page 11: Document Collections A Cost Efficient approach to correct

Motivation

● Books/collections have a finite vocabulary that repeat throughout the book.

11

A small subset of words can cover more than 50% of total words in a collection.

50% of words

Page 12: Document Collections A Cost Efficient approach to correct

Motivation

● Grouping and correcting words with high frequency can lead to significant gain in word accuracy.

12

50% of words

Page 13: Document Collections A Cost Efficient approach to correct

t-SNE Image Embedding

13

Maaten, Laurens van der and Hinton, Geoffrey. “Visualizing data using t-SNE”. JMLR, 2008

Page 14: Document Collections A Cost Efficient approach to correct

14

Reverse Annotation

Sankar et al. “Probabilistic Reverse Annotation For Large Scale Image Annotation.” CVPR, 2007.

Fusing Word Clusters

Rasagna et al. “Robust Recognition of Documents by Fusing Results of Word Clusters.” ICDAR, 2009.

Khader and casey. “Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment”. ICDAR, 2009.

Page 15: Document Collections A Cost Efficient approach to correct

Automatic Error Correction

15

Page 16: Document Collections A Cost Efficient approach to correct

Automatic Error Correction

16

Cluster representative, propagated to all cluster elements

Character Majority Voting

● Word Images are clustered on a feature space.● A cluster representative is chosen for each cluster.

○ Rasagna et al. use character major voting where the most frequently character is taken at each time step.

Page 17: Document Collections A Cost Efficient approach to correct

Automatic Error Correction

● Voted label is propagated to all the cluster elements.

17

Fig. shows a cluster of error words with label “thousand”. There are two incorrect labels “housan” and “thusiasn” which can be corrected with the above proposed method.

thusiasn

housan

thousand

Page 18: Document Collections A Cost Efficient approach to correct

Automatic Error Correction

18

moneymoney

moneymoney

money

money

money

aoney

more

Fig. shows nearest neighbour to the image embedding for word “money”. The error word (highlighted in red) can be corrected using character majority voting.

Page 19: Document Collections A Cost Efficient approach to correct

● Each clusters cannot be completely homogenous.● Character majority voting can lead to error propagation.

Cluster Impurities

19

impurity

moneymoney

moneymoney,

money

money

aoney

more

Page 20: Document Collections A Cost Efficient approach to correct

Can a better clustering algorithm help?

20

Page 21: Document Collections A Cost Efficient approach to correct

MST on word predictions

● We further partition the clusters using minimum spanning tree (MST).

● The nodes are the predictions.● The edit distance between the predictions form the edges.

21

Page 22: Document Collections A Cost Efficient approach to correct

Minimum spanning tree

22

Fully connected graph Minimum spanning tree Forest of individual trees

Page 23: Document Collections A Cost Efficient approach to correct

MST on word predictions

23

money!

money

money,

money,

money

money

aoney

more,

Cluster Partition using MST

Page 24: Document Collections A Cost Efficient approach to correct

What happens when all the predictions are wrong in a cluster ?

24

Page 25: Document Collections A Cost Efficient approach to correct

Manual correction

25

Figure shows a cluster with erred OCR predictions.

Page 26: Document Collections A Cost Efficient approach to correct

Automatic Error correction by label propagation will not achieve absolute word accuracy when clusters

are not homogeneous.

26

Page 27: Document Collections A Cost Efficient approach to correct

Human vs Machine

27

● Humans accurate but slow.● High cost ● Machines fast but inaccurate.

● Error propagation.● A human will be needed to rectify

the mistakes incurred by the machine.

● Will lead to high cost.

Page 28: Document Collections A Cost Efficient approach to correct

Human Machine Collaboration

28

Page 29: Document Collections A Cost Efficient approach to correct

Proposed Method

A human should verify each cluster by:

1. Picking the cluster representative.2. Choosing the cluster elements to which the label should

be propagated to.

29

Page 30: Document Collections A Cost Efficient approach to correct

Pipeline

30

Word Predictions

HWNet [2]

Image Features

ErrorDetection

Error Clusters

falling,

Manual

Clustering

Correction

CRNN-OCR [1]

1. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. Shi et. al PAMI, 20172. HWNet v2: An Efficient Word Image Representation for Handwritten Documents. Krishnan et. al IJDAR 2018.

Page 31: Document Collections A Cost Efficient approach to correct

Implementation Details

31

Page 32: Document Collections A Cost Efficient approach to correct

Edit Actions

● Full Typing ( no Dictionary involved)● Type + Select from dropdown (Static Dictionary)● Type + Select from dropdown(Growing Dictionary)

32

Page 33: Document Collections A Cost Efficient approach to correct

● Fully Annotated○ English

■ 19 books■ ~2.5k pages

○ Hindi■ 32 books■ ~5k pages

Datasets

33

Sample word images from Fully annotated dataset.

Page 34: Document Collections A Cost Efficient approach to correct

Evaluation Protocol

● For Fully Annotated○ Units of seconds spent by a human for correcting a book.○ We refer to it as cost of correction (C).○ We measure the cost for each method relative to

Full-Typing.

34

Page 35: Document Collections A Cost Efficient approach to correct

Cost of correction

Cost C = w1ct+ w2cd + w3cv

ct = typing cost

cd = selection cost

cv = verification cost

w1 = error words that need typing

w2 = error words whose correct alternative is present in dictionary

W3 = words that are correct but flagged wrongly as errors 35

Page 36: Document Collections A Cost Efficient approach to correct

Results

36

Page 37: Document Collections A Cost Efficient approach to correct

Relative cost

37

Fig. Relative cost of correction with respect to full typing when no clustering is involved.

Page 38: Document Collections A Cost Efficient approach to correct

Relative cost

38

Cost of correction across different clustering techniques for automatic label selection and propagation.

Cost without clustering

Page 39: Document Collections A Cost Efficient approach to correct

Qualitative Results

39

Qualitative results of k-means + MST clustering. The false positives are crossed out. Images, relevant to the cluster are marked correct while the non relevant ones are crossed out.

Page 40: Document Collections A Cost Efficient approach to correct

Results: Fully Annotated Data (English)

Comparison between Automatic vs Manual correction

40

● No false error propagation.

● Reduces cost of correction.

Page 41: Document Collections A Cost Efficient approach to correct

Scope for automatic error correction

41

Page 42: Document Collections A Cost Efficient approach to correct

Clustering on Large Scale Dataset

● We cluster on images from 100 unannotated books.● Testing is done on 200 annotated pages.● We use CMV for label assignment to erred predictions.

42

Page 43: Document Collections A Cost Efficient approach to correct

● Comparison of performance of CRNN and Tesseract OCR

Results

43

Automatic error correction is able to rectify the errors more accurately as the size of collection increases.

● Gain in word accuracy CRNN-OCR >> Tesseract OCR

Page 44: Document Collections A Cost Efficient approach to correct

Clustering on Large Scale Dataset

We observe that as the size of the collection increases, CMV becomes better at picking the correct cluster representative which is subsequently propagated to all the cluster elements.

44

Page 45: Document Collections A Cost Efficient approach to correct

Conclusion

● We proposed a cost efficient batch correction scheme for error reduction in OCRs.

● We also demonstrate how our approach can effectively be scaled to larger collections.

45

Page 46: Document Collections A Cost Efficient approach to correct

Future Work

● active learning techniques to find clusters/subclusters that need post-processing

● adapting recognizer to a collection,not just the post-processing module.

46

Page 47: Document Collections A Cost Efficient approach to correct

Thank You

47