Upload
scape-project
View
1.145
Download
2
Embed Size (px)
DESCRIPTION
Reinhold Huber-Mörk, Austrian Institute of Technology, presented a method for quality assurance of scanned content based on computer vision at iPres 2012, Toronto. In: iPRES 2012 – Proceedings of the 9th International Conference on Preservation of Digital Objects. Toronto 2012, 136-143. ISBN 978-0-9917997-0-1
Citation preview
Duplicate detection for quality assurance of document image collections Reinhold Huber-Mörk1 & Alexander Schindler1,2 & Sven Schlarb3 1 Research Area Intelligent Vision Systems, Department Safety & Security AIT Austrian Institute of Technology 2 Department of Software Technology and Interactive Systems Vienna University of Technology 3 Department for Research and Development Austrian National Library
Overview
Digital preservation & quality assurance
Digital image preservation workflows
Image duplicate detection
Keypoints and feature descriptors in Computer Vision
Bag of visual words
Results on a real-world data set
2 22.11.2012
SCAPE project and quality assurance
SCAlable Preservation Environments, EU FP7
Preservation Components:
improve and extend existing tools,
develop new ones where necessary,
apply proven approaches like
image and patterns analysis to the
problem of ensuring quality in digital
preservation
3 22.11.2012
Quality assurance in image preservation
Comparison of image content
- automatic image processing worflows (e.g. format conversion)
- reacquisition of images
Duplicate detection
- within a single collection (filtering)
- between collections (merging, comparison)
Solutions:
- page segmention + OCR
- feature based approaches
4 22.11.2012
Book scan sequence with duplicates
5 22.11.2012
Duplicate detection workflow
6 22.11.2012
Keypoint detection and description (1)
Keypoints are detected at salient image regions
A keypoint is described in a descriptor ( = vector of features)
Scalable Invariant Feature Transform - SIFT (Lowe, 2004)
7 22.11.2012
20 40 60 80 100 1200
0.1
0.2
20 40 60 80 100 1200
0.1
0.2
Keypoint detection and description (2)
Invariance w.r.t. color/tone transformation
Invariance w.r.t. rotation, scaling or translation
8 22.11.2012
20 40 60 80 100 1200
0.1
0.2
20 40 60 80 100 1200
0.1
0.2
20 40 60 80 100 1200
0.1
0.2
20 40 60 80 100 1200
0.1
0.2
Keypoint detection and description (3)
All detections (ordered by scale)
9 22.11.2012
Duplicate detection workflow
10 22.11.2012
Bag of words model in text information retrieval: Document 1: “Peter likes to read books. Paul likes too”. Document 2: “Peter also likes to read poems” Bag: [ Peter, likes, to, read, books, Paul, too, also, poems ] Histogram 1: [ 1, 2, 1, 1, 1, 1, 1, 0, 0 ] Histogram 2: [ 1, 1, 1, 1, 0, 0, 0, 1, 1 ]
Visual analogy: bag of visual words or bag of features
Document Image Document made of words Image made of descriptors Bag of words Bag of clustered descriptors = visual words Word occurrence histogram Visual word histogram / ”fingerprint”
Bag of visual words (1)
11 22.11.2012
12 22.11.2012
Bag of visual words (2)
Visual word #104 Visual word #15 Visual word #221 Visual word #312 Visual word #424 Visual word #250
Bag of visual words (3)
13 22.11.2012
Duplicate detection workflow
14 22.11.2012
Image comparison / duplicate detection schemes
Comparison of visual histograms – tf (“term frequency”) score
Inverse document frequency –idf
Spatial verification – sv detailed image comparison
15 22.11.2012
50 100 150 200 250 300 350 400 450 5000
2
x 10-3
50 100 150 200 250 300 350 400 450 5000
24
x 10-3
50 100 150 200 250 300 350 400 450 5000
2
x 10-3
Spatial verification (1)
Bag of visual words maintains no (or limited) spatial information Spatial verification: 1. Ranking of most similar images in a shortlist 2. Direct matching of descriptors for pairs of images 3. Overlaying of images 4. Estimation of similarity
16 22.11.2012
Spatial verification (2)
17 22.11.2012
Pair of possible duplicates Descriptor matching Estimation of affine transformation
Image overlay Similarity estimation
Similarity measure MSSIM
Duplicate detection (1)
Pairwise comparison for a collection of N pages
18 22.11.2012 0 50 100 150 200 250 300 350 400 450 500
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
image index a
max
(Da)
Duplicate detection (2)
Robust outlier detection
19 22.11.2012
0 50 100 150 200 250 300 350 400 450 5000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
image index a
max
(Da)
a=12..15
a=22..25
a=106,107
a=108,109
a=188..197 a=198..207
Comparison of duplicate detection schemes
20 22.11.2012
a) Visual histogram comparison - tf
b) tf and inv. document frequency - tf/idf
c) tf and spatial verification – tf/sv
Results
Manual vs. automatic detection
59 books, 34805 pages
53 books correctly processed
53/59 ≈ 90% correct
69 of 75 duplicate runs detected
69/75 ≈ 92% correct
Missing detections due to
heavily mixed content
21 22.11.2012
Conclusion and outlook
Workflows for duplicate detection for complex documents
Keypoint detection and description = purely image based
Bag of visual words provides fast matching
Spatial verification applied to shortlist
Robust thresholding scheme for duplicate identification
Evaluation at Austrian National Library
Integration on SCAPE platform for scalable preservation
22 22.11.2012
AIT Austrian Institute of Technology your ingenious partner [email protected]