8
IMPACT Research Image Enhancement, Segmentation, Experimental OCR Apostolos Antonacopoulos PRImA Lab, The University of Salford, United Kingdom www.primaresearch.org

IMPACT Final Conference - USAL - Text line and word segmentation

Embed Size (px)

DESCRIPTION

IMPACT Final Conference - USAL - Text line and word segmentation

Citation preview

Page 1: IMPACT Final Conference - USAL - Text line and word segmentation

IMPACT ResearchImage Enhancement,Segmentation,Experimental OCR

Apostolos Antonacopoulos

PRImA Lab, The University of Salford, United Kingdom

www.primaresearch.org

Page 2: IMPACT Final Conference - USAL - Text line and word segmentation

Outline Overview: digitisation workflow Image enhancement

Border removal Page curl removal Correction of arbitrary warping

Segmentation Recognition-based Standalone

Typewritten document OCR Wordspotting

2

Page 3: IMPACT Final Conference - USAL - Text line and word segmentation

Overview: Digitisation Workflow

3

Main steps:① Scanning② Image enhancement

Page splitting Border removal Page curl removal Dewarping

③ Layout analysis Segmentation of regions, lines, words and

characters Region classification Logical layout analysis

④ OCR (incl. specialist or wordspotting)⑤ Post-processing

Page 4: IMPACT Final Conference - USAL - Text line and word segmentation

Textline and Word Segmentation

Standalone methods that can be integrated to systems without the need to integrate FR engine

Not based on recognition of characters/words – suitable for documents with non-dictionary words or not practical to OCR to OCR (word spotting)

Used in other IMPACT methods: Typewritten OCR Correction of arbitrary warping Word spotting

date footertext4

Page 5: IMPACT Final Conference - USAL - Text line and word segmentation

Hybrid Text Line Segmenter Hybrid approach based on connected component clustering and

projection profiles

Connected component extraction (incl. noise filtering)

Group components into line candidates using an efficient data structure

Find and split under-segmented lines using local projection profiles

Merge small peripheral lines to appropriate neighbour (e.g. for i-dots etc.)

Bitonal image

Text regions (PAGE XML)

Regions with text lines (PAGE XML)

Parameters

Page 6: IMPACT Final Conference - USAL - Text line and word segmentation

Density Word Segmenter Adaptive projection-profile based approach using foreground pixel

density

Bitonal image

Text regions and lines (PAGE XML)

Regions, text lines and words (PAGE XML)

Parameters

For each text line: Generate vertical

projection profile Find delimiting white

spaces using an adaptive threshold based on the density of foreground pixels in the line

Group connected components into words

Page 7: IMPACT Final Conference - USAL - Text line and word segmentation

7

Evaluation Text line ground truth: 25 historical documents (more than 2700 text lines) Results (using USAL layout evaluation tool):

Word ground truth: 15 historical documents (more than 14500 words) Results (using USAL layout evaluation tool):

Page 8: IMPACT Final Conference - USAL - Text line and word segmentation

Further Information8

PRImAhttp://www.primaresearch.org

IMPACThttp://www.impact-project.eu