View
214
Download
1
Category
Tags:
Preview:
Citation preview
Form Image Compression
using
Template Extraction and Matching
Jianguo Wang and Hong Yan
School of Electrical and Information Engineering
University of Sydney, NSW 2006, Australia
phone: +61 2 9351 5338
fax: +61 2 9351 4824
e-mail: jwang@ee.usyd.edu.au
Multi-copy Form Images
Redundancy Analysis
• Local Redundancy (CCITT Group 3, Group 4, JBIG)
• Global Redundancy
– Component-level redundancy (JBIG2)
– Pattern assemblage redundancy in similar images (TEM)
Flow chart of the TEM form compression scheme
Compressing Restoring
No
No
Comparing
Similar?
Try again?
Templateextraction
Filled-in patternextraction
Compressingand saving
Finish
Yes
Yes
Formimages
Compressedimages
Templatelocation
Restoringcompressed
images
Displayand/orsaving
Finish
Template extraction
• image de-skewing and locating,
• distortion adjusting,
• template extraction,– generating greyscale image
– thresholding to get two pre-templates
– getting template by comparing pre-templates
• template refining.
A set of adjusted binary form images is overlapped to generate a greyscale image. The density of a pixel is determined by the times of black pixels overlapped
Examples of the compression approach(a) an original form image; (b) template extracted from a set of filled-in
forms
Compression
• image de-skewing and locating,
• distortion adjusting,
• filled-in data extraction,– three possible situation
– two types of prototypes: SCC and CCC
• compression with Group 4 as tiff files.
Decompression
• two types of prototypes: – SCC: performing in the rectangle area– CCC: performing in the pixel set of prototypes
• Three possible situations:– blank: copy the corresponding prototype– different: no substitution occurs– exactly same: delete the component
Form Document Compression Experiment Results
A Directory Micros Soco Tafe Westp
B Number of files 100 6 100 50
C= B*F Size of all the tiff files
(bytes)2,141,539 141,673 3,640,274 4,456,211
D =G*B+H Size of the compressed
file(bytes) 467,548 25,702 274,344 896,958
E= C/D Average compression rate
over tiff4.58 5.51 13.27 4.97
F= C/B
Average size ofeach tiff file (bytes) 21,415 23,612 36,403 89,953
G =(D-H)/B
Average size of eachcompressed image(bytes) 4,497 1,652 2,389 16,490
H Size of the template (bytes) 17,832 15,792 35,462 72,646
Conclusion• TEM to reduce pattern assemblage redundancy in
similar images;– can combine with any current standard (CCITT G3,
G4, JBIG) to reduce local redundancy
– can combine with JBIG2 to reduce Component-level redundancy in same image;
• a statistical template extraction algorithm by over-lapping binary images to a greyscale images;
• Form images de-skewing, location and distortion adjusting;
• pattern matching rules for SCC and CCC.
Recommended