View
214
Download
1
Tags:
Embed Size (px)
Citation preview
Form Image Compression
using
Template Extraction and Matching
Jianguo Wang and Hong Yan
School of Electrical and Information Engineering
University of Sydney, NSW 2006, Australia
phone: +61 2 9351 5338
fax: +61 2 9351 4824
e-mail: [email protected]
Multi-copy Form Images
Redundancy Analysis
• Local Redundancy (CCITT Group 3, Group 4, JBIG)
• Global Redundancy
– Component-level redundancy (JBIG2)
– Pattern assemblage redundancy in similar images (TEM)
Flow chart of the TEM form compression scheme
Compressing Restoring
No
No
Comparing
Similar?
Try again?
Templateextraction
Filled-in patternextraction
Compressingand saving
Finish
Yes
Yes
Formimages
Compressedimages
Templatelocation
Restoringcompressed
images
Displayand/orsaving
Finish
Template extraction
• image de-skewing and locating,
• distortion adjusting,
• template extraction,– generating greyscale image
– thresholding to get two pre-templates
– getting template by comparing pre-templates
• template refining.
A set of adjusted binary form images is overlapped to generate a greyscale image. The density of a pixel is determined by the times of black pixels overlapped
Examples of the compression approach(a) an original form image; (b) template extracted from a set of filled-in
forms
Compression
• image de-skewing and locating,
• distortion adjusting,
• filled-in data extraction,– three possible situation
– two types of prototypes: SCC and CCC
• compression with Group 4 as tiff files.
Decompression
• two types of prototypes: – SCC: performing in the rectangle area– CCC: performing in the pixel set of prototypes
• Three possible situations:– blank: copy the corresponding prototype– different: no substitution occurs– exactly same: delete the component
Form Document Compression Experiment Results
A Directory Micros Soco Tafe Westp
B Number of files 100 6 100 50
C= B*F Size of all the tiff files
(bytes)2,141,539 141,673 3,640,274 4,456,211
D =G*B+H Size of the compressed
file(bytes) 467,548 25,702 274,344 896,958
E= C/D Average compression rate
over tiff4.58 5.51 13.27 4.97
F= C/B
Average size ofeach tiff file (bytes) 21,415 23,612 36,403 89,953
G =(D-H)/B
Average size of eachcompressed image(bytes) 4,497 1,652 2,389 16,490
H Size of the template (bytes) 17,832 15,792 35,462 72,646
Conclusion• TEM to reduce pattern assemblage redundancy in
similar images;– can combine with any current standard (CCITT G3,
G4, JBIG) to reduce local redundancy
– can combine with JBIG2 to reduce Component-level redundancy in same image;
• a statistical template extraction algorithm by over-lapping binary images to a greyscale images;
• Form images de-skewing, location and distortion adjusting;
• pattern matching rules for SCC and CCC.