Form Image Compression using Template Extraction and Matching Jianguo Wang and Hong Yan School of...

Preview:

Citation preview

Form Image Compression

using

Template Extraction and Matching

Jianguo Wang and Hong Yan

School of Electrical and Information Engineering

University of Sydney, NSW 2006, Australia

phone: +61 2 9351 5338

fax: +61 2 9351 4824

e-mail: jwang@ee.usyd.edu.au

Multi-copy Form Images

Redundancy Analysis

• Local Redundancy (CCITT Group 3, Group 4, JBIG)

• Global Redundancy

– Component-level redundancy (JBIG2)

– Pattern assemblage redundancy in similar images (TEM)

Flow chart of the TEM form compression scheme

Compressing Restoring

No

No

Comparing

Similar?

Try again?

Templateextraction

Filled-in patternextraction

Compressingand saving

Finish

Yes

Yes

Formimages

Compressedimages

Templatelocation

Restoringcompressed

images

Displayand/orsaving

Finish

Template extraction

• image de-skewing and locating,

• distortion adjusting,

• template extraction,– generating greyscale image

– thresholding to get two pre-templates

– getting template by comparing pre-templates

• template refining.

A set of adjusted binary form images is overlapped to generate a greyscale image. The density of a pixel is determined by the times of black pixels overlapped

Examples of the compression approach(a) an original form image; (b) template extracted from a set of filled-in

forms

Compression

• image de-skewing and locating,

• distortion adjusting,

• filled-in data extraction,– three possible situation

– two types of prototypes: SCC and CCC

• compression with Group 4 as tiff files.

Decompression

• two types of prototypes: – SCC: performing in the rectangle area– CCC: performing in the pixel set of prototypes

• Three possible situations:– blank: copy the corresponding prototype– different: no substitution occurs– exactly same: delete the component

(c) the reconstructed image (d) the filled-in data extracted from (a).

Sample forms used for testing

Form Document Compression Experiment Results

A Directory Micros Soco Tafe Westp

B Number of files 100 6 100 50

C= B*F Size of all the tiff files

(bytes)2,141,539 141,673 3,640,274 4,456,211

D =G*B+H Size of the compressed

file(bytes) 467,548 25,702 274,344 896,958

E= C/D Average compression rate

over tiff4.58 5.51 13.27 4.97

F= C/B

Average size ofeach tiff file (bytes) 21,415 23,612 36,403 89,953

G =(D-H)/B

Average size of eachcompressed image(bytes) 4,497 1,652 2,389 16,490

H Size of the template (bytes) 17,832 15,792 35,462 72,646

Conclusion• TEM to reduce pattern assemblage redundancy in

similar images;– can combine with any current standard (CCITT G3,

G4, JBIG) to reduce local redundancy

– can combine with JBIG2 to reduce Component-level redundancy in same image;

• a statistical template extraction algorithm by over-lapping binary images to a greyscale images;

• Form images de-skewing, location and distortion adjusting;

• pattern matching rules for SCC and CCC.

Recommended