Upload
impact-centre-of-competence
View
1.463
Download
0
Embed Size (px)
DESCRIPTION
IMPACT Final Conference - USAL - Text line and word segmentation
Citation preview
IMPACT ResearchImage Enhancement,Segmentation,Experimental OCR
Apostolos Antonacopoulos
PRImA Lab, The University of Salford, United Kingdom
www.primaresearch.org
Outline Overview: digitisation workflow Image enhancement
Border removal Page curl removal Correction of arbitrary warping
Segmentation Recognition-based Standalone
Typewritten document OCR Wordspotting
2
Overview: Digitisation Workflow
3
Main steps:① Scanning② Image enhancement
Page splitting Border removal Page curl removal Dewarping
③ Layout analysis Segmentation of regions, lines, words and
characters Region classification Logical layout analysis
④ OCR (incl. specialist or wordspotting)⑤ Post-processing
Textline and Word Segmentation
Standalone methods that can be integrated to systems without the need to integrate FR engine
Not based on recognition of characters/words – suitable for documents with non-dictionary words or not practical to OCR to OCR (word spotting)
Used in other IMPACT methods: Typewritten OCR Correction of arbitrary warping Word spotting
date footertext4
Hybrid Text Line Segmenter Hybrid approach based on connected component clustering and
projection profiles
Connected component extraction (incl. noise filtering)
Group components into line candidates using an efficient data structure
Find and split under-segmented lines using local projection profiles
Merge small peripheral lines to appropriate neighbour (e.g. for i-dots etc.)
Bitonal image
Text regions (PAGE XML)
Regions with text lines (PAGE XML)
Parameters
Density Word Segmenter Adaptive projection-profile based approach using foreground pixel
density
Bitonal image
Text regions and lines (PAGE XML)
Regions, text lines and words (PAGE XML)
Parameters
For each text line: Generate vertical
projection profile Find delimiting white
spaces using an adaptive threshold based on the density of foreground pixels in the line
Group connected components into words
7
Evaluation Text line ground truth: 25 historical documents (more than 2700 text lines) Results (using USAL layout evaluation tool):
Word ground truth: 15 historical documents (more than 14500 words) Results (using USAL layout evaluation tool):
Further Information8
PRImAhttp://www.primaresearch.org
IMPACThttp://www.impact-project.eu