12
Form Image Compression using Template Extraction and Matching Jianguo Wang and Hong Yan School of Electrical and Information Engineering University of Sydney, NSW 2006, Australia phone: +61 2 9351 5338 fax: +61 2 9351 4824 e-mail: [email protected]

Form Image Compression using Template Extraction and Matching Jianguo Wang and Hong Yan School of Electrical and Information Engineering University of

  • View
    214

  • Download
    1

Embed Size (px)

Citation preview

Form Image Compression

using

Template Extraction and Matching

Jianguo Wang and Hong Yan

School of Electrical and Information Engineering

University of Sydney, NSW 2006, Australia

phone: +61 2 9351 5338

fax: +61 2 9351 4824

e-mail: [email protected]

Multi-copy Form Images

Redundancy Analysis

• Local Redundancy (CCITT Group 3, Group 4, JBIG)

• Global Redundancy

– Component-level redundancy (JBIG2)

– Pattern assemblage redundancy in similar images (TEM)

Flow chart of the TEM form compression scheme

Compressing Restoring

No

No

Comparing

Similar?

Try again?

Templateextraction

Filled-in patternextraction

Compressingand saving

Finish

Yes

Yes

Formimages

Compressedimages

Templatelocation

Restoringcompressed

images

Displayand/orsaving

Finish

Template extraction

• image de-skewing and locating,

• distortion adjusting,

• template extraction,– generating greyscale image

– thresholding to get two pre-templates

– getting template by comparing pre-templates

• template refining.

A set of adjusted binary form images is overlapped to generate a greyscale image. The density of a pixel is determined by the times of black pixels overlapped

Examples of the compression approach(a) an original form image; (b) template extracted from a set of filled-in

forms

Compression

• image de-skewing and locating,

• distortion adjusting,

• filled-in data extraction,– three possible situation

– two types of prototypes: SCC and CCC

• compression with Group 4 as tiff files.

Decompression

• two types of prototypes: – SCC: performing in the rectangle area– CCC: performing in the pixel set of prototypes

• Three possible situations:– blank: copy the corresponding prototype– different: no substitution occurs– exactly same: delete the component

(c) the reconstructed image (d) the filled-in data extracted from (a).

Sample forms used for testing

Form Document Compression Experiment Results

A Directory Micros Soco Tafe Westp

B Number of files 100 6 100 50

C= B*F Size of all the tiff files

(bytes)2,141,539 141,673 3,640,274 4,456,211

D =G*B+H Size of the compressed

file(bytes) 467,548 25,702 274,344 896,958

E= C/D Average compression rate

over tiff4.58 5.51 13.27 4.97

F= C/B

Average size ofeach tiff file (bytes) 21,415 23,612 36,403 89,953

G =(D-H)/B

Average size of eachcompressed image(bytes) 4,497 1,652 2,389 16,490

H Size of the template (bytes) 17,832 15,792 35,462 72,646

Conclusion• TEM to reduce pattern assemblage redundancy in

similar images;– can combine with any current standard (CCITT G3,

G4, JBIG) to reduce local redundancy

– can combine with JBIG2 to reduce Component-level redundancy in same image;

• a statistical template extraction algorithm by over-lapping binary images to a greyscale images;

• Form images de-skewing, location and distortion adjusting;

• pattern matching rules for SCC and CCC.