Duplicate detection for quality assurance of document image collections

Duplicate detection for quality assurance of document image collections Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3 1 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology 3 Department for Research and Development Austrian National Library

Overview

Digital preservation & quality assurance

Digital image preservation workflows

Image duplicate detection

Keypoints and feature descriptors in Computer Vision

Bag of visual words

Results on a real-world data set

2 22.11.2012

SCAPE project and quality assurance

SCAlable Preservation Environments, EU FP7

Preservation Components:

improve and extend existing tools,

develop new ones where necessary,

apply proven approaches like

image and patterns analysis to the

problem of ensuring quality in digital

preservation

3 22.11.2012

Quality assurance in image preservation

Comparison of image content

- automatic image processing worflows (e.g. format conversion)

- reacquisition of images

Duplicate detection

- within a single collection (filtering)

- between collections (merging, comparison)

Solutions:

- page segmention + OCR

- feature based approaches

4 22.11.2012

Book scan sequence with duplicates

5 22.11.2012

Duplicate detection workflow

6 22.11.2012

Keypoint detection and description (1)

Keypoints are detected at salient image regions

A keypoint is described in a descriptor ( = vector of features)

Scalable Invariant Feature Transform - SIFT (Lowe, 2004)

7 22.11.2012

20 40 60 80 100 1200

0.1

0.2

20 40 60 80 100 1200

0.1

0.2


Invariance w.r.t. color/tone transformation

Invariance w.r.t. rotation, scaling or translation

8 22.11.2012

20 40 60 80 100 1200

0.1

0.2

20 40 60 80 100 1200

0.1

0.2

20 40 60 80 100 1200

0.1

0.2

20 40 60 80 100 1200

0.1

0.2


All detections (ordered by scale)

9 22.11.2012


10 22.11.2012

Bag of words model in text information retrieval: Document 1: “Peter likes to read books. Paul likes too”. Document 2: “Peter also likes to read poems” Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ] Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ] Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ]

Visual analogy: bag of visual words or bag of features

Document Image Document made of words Image made of descriptors Bag of words Bag of clustered descriptors = visual words Word occurrence histogram Visual word histogram / ”fingerprint”

Bag of visual words (1)

11 22.11.2012

12 22.11.2012


Visual word #104 Visual word #15 Visual word #221 Visual word #312 Visual word #424 Visual word #250


13 22.11.2012


14 22.11.2012

Image comparison / duplicate detection schemes

Comparison of visual histograms – tf (“term frequency”) score

Inverse document frequency –idf

Spatial verification – sv detailed image comparison

15 22.11.2012

50 100 150 200 250 300 350 400 450 5000

2

x 10-3

50 100 150 200 250 300 350 400 450 5000

24

x 10-3

50 100 150 200 250 300 350 400 450 5000

2

x 10-3

Spatial verification (1)

Bag of visual words maintains no (or limited) spatial information Spatial verification: 1. Ranking of most similar images in a shortlist 2. Direct matching of descriptors for pairs of images 3. Overlaying of images 4. Estimation of similarity

16 22.11.2012

Spatial verification (2)

17 22.11.2012

Pair of possible duplicates Descriptor matching Estimation of affine transformation

Image overlay Similarity estimation

Similarity measure MSSIM

Duplicate detection (1)

Pairwise comparison for a collection of N pages

18 22.11.2012 0 50 100 150 200 250 300 350 400 450 500

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

image index a

max

(Da)

Duplicate detection (2)

Robust outlier detection

19 22.11.2012

0 50 100 150 200 250 300 350 400 450 5000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

image index a

max

(Da)

a=12..15

a=22..25

a=106,107

a=108,109

a=188..197 a=198..207

Comparison of duplicate detection schemes

20 22.11.2012

a) Visual histogram comparison - tf

b) tf and inv. document frequency - tf/idf

c) tf and spatial verification – tf/sv

Results

Manual vs. automatic detection

59 books, 34805 pages

53 books correctly processed

53/59 ≈ 90% correct

69 of 75 duplicate runs detected

69/75 ≈ 92% correct

Missing detections due to

heavily mixed content

21 22.11.2012

Conclusion and outlook

Workflows for duplicate detection for complex documents

Keypoint detection and description = purely image based

Bag of visual words provides fast matching

Spatial verification applied to shortlist

Robust thresholding scheme for duplicate identification

Evaluation at Austrian National Library

Integration on SCAPE platform for scalable preservation

22 22.11.2012

AIT Austrian Institute of Technology your ingenious partner [email protected]