Upload
lydieu
View
218
Download
2
Embed Size (px)
Citation preview
57
CHAPTER 3
SYSTEM OVERVIEW
3.1 INTRODUCTION
The recent advances in digital technology have resulted in more
and more databases in multimedia containing images and video, in addition to
textual information. Multimedia documents with embedded text data present
many challenging research issues in document analysis and recognition.
Among them, text detection and extraction are critical pre-processing for
automated indexing systems using OCR. The extraction of a text may seem to
be a trivial application for existing optical character recognition (OCR) tools.
However, compared to OCR for document images, extracting a text from real
images faces numerous challenges due to low resolution, unknown text color,
size and position, or complex backgrounds. This has led to the development
of new techniques to tackle these challenges. With this in view, this thesis
aims at presenting a robust system for automatically extracting the text
appearing in heterogeneous textual images, such as Scene text images,
Caption text images and document images with a unified text extraction
framework. The thesis also discusses text detection, localization and
binarization problems, and presents their solutions in the following Chapters.
This short chapter organized in three sections, gives a global view
of the proposed system for extracting text embedded in images and videos.
An overview of TIE system is presented in the first section. In the second
section, an overview of our text detection /localization and text binarization
systems are briefly introduced, which is helpful to understand the related
58
techniques that will be proposed in the subsequent chapters. The third section
mainly presents and discusses the characteristics of the databases used for
evaluation in this thesis.
3.2 TEXT INFORMATION EXTRACTION (TIE) SYSTEM
Text information extraction from image can be divided into the
following sub tasks.
Text Detection (TD): The text detection problem involves
locating the text regions in a textual image. Text detection takes
a raw image as an input. Its aim is to decide if there is any text
present in the image and if so, gives those parts of the image
containing the text as output.
Text Localization (TL): Text localization groups the text
regions identified at the detection stage, into text instances. Text
localization takes the output of the TD step as an input and
merges text regions which belong to the same text candidate. Its
final task is to determine the exact coordinates of the text
position.
Text Tracking (TT): A text often spans even hundreds of frames
in a digital video. In such cases, text tracking algorithms are
used to exploit the temporal occurrence of a text over a
sequence of video frames. This process is necessary to reduce
the processing time, by not applying the TDL algorithms to
every frame separately. This step is required to track the text in
video frames.
Text Binarization: The text binarization problem involves separating the text strokes from the background in a localized text region. The output of a binarization module is a binary
59
image, with pixels corresponding to text strokes marked as one binary level, and the background pixels marked as the other.
Text Recognition: The final stage is text recognition problem, in which the text appearing in the binarized text image is recognized. It performs optical character recognition (OCR) on the binarized text image and converts it into the corresponding ASCII text.
The architecture of the text information extraction system is shown in Figure 3.1.
Text Segmentation and binarization
Character Recognition (OCR)
Input image
ASCII Text
Text detection
Text Localization
Text Identification
Figure 3.1 Text Information Extraction (TIE) System architecture
60
As the thesis focuses on the TIE from still textual images, text
tracking is not considered (This process is required for tracking the text in
video frames). The processes associated with the TIE system here, are as
follows: Text detection, Text localization, Text Segmentation and
Binarization, and Text recognition. The recognition problem is not discussed
as it is beyond the scope of this thesis. It is assumed that once a text has been
binarized, any of the many commercial document image OCR systems could
be used for the recognition stage.
3.3 METHODOLOGIES
Two different methodologies of Text detection / Localization are proposed. A
solution to the text segmentation and binarization problem is also presented.
The experimental results for different sets of images will be presented to
demonstrate the high-quality performance of the approaches.
3.3.1 Text Detection / Localization (TD/TL) Methods
The observation from literature that a separate algorithm needs to
be designed for different kinds of images, sparked the idea to develop a
unified scheme to distinguish the text and non text from heterogeneous
images which takes care of variations in illumination, transformation/
perspective projection, font size, angular and radially changing text, so as to
make the system suitable for heterogeneous images. The various factors that
motivated this research are:
The need for a unified Text detection and localization system to
extract the text from heterogeneous images with an improved
recall rate.
The need for a Text detection and localization system that is
robust to various text font sizes, location and orientation of text,
61
complex background, lighting conditions and perspective
projection.
Two different TD/TL methods for detecting and localizing the text
in heterogeneous textual images are presented to tackle the issues in the
literature with two different philosophies, and made ready for the subsequent
segmentation and binarization process.
SBTA-TD/TL-Sub band Texture analysis based text detection and
localization:
In order to develop a common framework for text extraction from
various images, a methodology has to be designed to handle variations in the
text, such as font size and orientation. It is to be emphasized that compared to
the edge based and CC based methods of TD/TL, texture based methods are
more accurate when the text is embedded in a complex background. They
consider texts as regions with distinct textural properties. Capturing these
textural details in various directions will enable the designing of a unified
framework for text extraction. The introduction of a Non sub sampled
contourlet transform (NSCT) to handle variations in the text has been
attempted in this work and the sub bands are analyzed to distinguish the text
and non text regions.
The SBTA-TD/TL method follows the image analysis based
approach and consists of the following processes: Candidate text region
detection, Energy Computation and Texture analysis, Text region localization
and Text region Extraction. First, the candidate text regions are identified by
decomposing an input image with NSCT into eight directional sub band
outputs at three different scales. Energy is computed for the sub bands which
are categorized into Strong and Weak bands. Subsequently, a boosting level is
applied to the weak bands so as to bring them to the level of the strong bands.
62
Then, edge detection followed by the suitable dilation operator is applied. The
strong and boosted edges after dilation are combined with addition followed
by the logical AND operation which forms the text region. Finally, the
remaining non text regions are identified and eliminated. This method works
across heterogeneous images, and is robust to a limited range of font size of
characters and orientation of text. It provides encouraging results when
compared with other techniques.
MLFP TD/TL - Multi Level Feature Priority based text detection
and localization (MLFP TD/TL) method:
This approach is designed to improve the robustness of the SBTA-
TD/TL method for handling various orientations of text, wide range of font
size, different lighting conditions and perspective projected images. The
incorporation of the feature selection algorithm enables the development of
common framework for TD/TL for heterogeneous and hybrid textual images
to move from a domain dependent approach to a domain independent
approach. This algorithm is introduced to select the optimal features to
distinguish the text from the non text for all types of images with all possible
considered variations in illumination, transformation/ perspective projection,
font size, and angular and radially changing text.
The MLFP-TD/TL applies image analysis and the Machine
learning based approach; This has been carried out in two phases, namely,
Offline processing (training phase) and online processing (testing phase).
Offline processing is carried out to extract and generate feature vectors for
training images from the image corpus, constituting heterogeneous images
with variations in lighting, orientation and font size, and to train the classifier
for classification, and testing phase tests the sample based on it. In the MLFP-
TD/TL, candidate text region detection is composed of i) the decomposition
63
of the input image with non sub sampled contourlet transform - NSCT and ii)
the MLFP based feature selection algorithm for selecting the best features
from the merged sub bands, so as to distinguish the text and non text regions
better with the help of the classifier stage. These detected candidate text
regions are later verified at the text localization stage. The angular closing
operation is introduced at the text localization stage to handle a radially
changing text.
During Online processing, when the user supplies an input image,
the selective features are extracted from the transformed and merged
contourlet coefficients, and are used to classify the regions as candidate text
and non text regions with the neural network classifier. Subsequently,
candidate text regions will undergo selected verification rules to eliminate the
unwanted non text regions. The optimality of chosen features, the extraction
of the chosen features in various directions, and the scale invariant nature of
NSCT, combinedly provide an improved performance when compared with
the existing methods.
3.3.2 Text Segmentation and Binarization method
Works reported in the literature demand the development of new
techniques, for the text binarization system to handle color documents with
multi-colored texts, presence of text like objects, uneven illumination and a
textured background. This challenge is met by the Integration of Edge and
Color Analysis (IECA) technique.
Here, an approach is proposed in which the localized region is first
separated into individual characters, and then binarization is performed on
each individual character image to avoid windowing artifacts. Suitable
preprocessing techniques (Zhang et al 2009) are applied first to increase the
contrast of the image, and blur the background noises due to the textured
64
background. Then, edges are detected with iterative thresholding and the
morphologic erode algorithm (Yan et al 2008) rather than the conventional
edge algorithm, so as to maintain edge continuity and avoid edge overlapping.
Then the bounding box is generated for the detected edges and it is termed as
an edge-box (EB); the uniform edge box size (Ts) is determined by the
proposed sliding window based Character uniformity check (CUC) algorithm.
EBs with sizes lesser than Ts are removed.
Subsequently, the unwanted edges resembling a character are
removed, and the characters are segmented with the proposed Edge Quadrant
Coverage analysis (EQCA) algorithm. Edge intensity information is used to
filter non character edges and identify / segment candidate character EBs. But
it can not be applied further to remove the patches present in character EBs
due to the background edges, if the intensity of the background edges is
similar to the character edge. Subsequently, the Corner Vertices Color
Analysis (CVCA) algorithm is proposed which uses the color information of
edges as an attribute to remove those background patches. This is based on
the K-means clustering algorithm in HCL (hue, chroma, and luminance) color
space (Sarifuddin et al 2005), which is a color model close to the human
perception of colors. This method is able to binarize images taken under
uneven illumination by binarizing the segmented character rather than the
whole text row, and is able to handle a multicolored text by analyzing only
the segmented character without depending upon the color of the neighbor
components with CVCA based binarization. The reported performance results
for the IECA method are competitive to the results reported by other
researchers.
An attempt was made to integrate all the above mentioned factors
such as Text detection, Localization and Binarization to build a Unified
framework for text extraction from heterogeneous textual images. The
65
integration of all the three factors yields promising results exceptional to the
scenarios where touching characters are present for binarization.
3.4 SYSTEM REQUIREMENTS
This thesis is intended to handle various kinds of images effectively.
So, the image processing programs for various methodologies in this thesis
are implemented using Matlab software. Waikato Enviornment for
Knowledge Analysis (WEKA) tool is used to implement the machine learning
algorithms used in this thesis. The demo version of the commercial Optical
Character Recognition (OCR) software, ABBYY Fine Reader 7.0
Professional is used for recognition purposes to evaluate the performance of
text binarization system developed in this thesis.
Software and Hardware requirements for this thesis are listed below:
Software requirements: Matlab 7.6, WEKA Data mining tool,
OCR software (ABBYY Fine Reader 7.0 Professional)
Hardware requirements:
PC with Pentium 1GHZ processor
1GB RAM
40GB Hard disk drive
CD-ROM or DVD-ROM drive
Keyboard and a Microsoft Mouse or some other compatible
pointing device
Video adapter and monitor with Super VGA (800 x 600)or
higher resolution
66
3.5 DATABASES FOR EXPERIMENTS
This thesis is proposed to handle caption text images, scene text
images and document images. These heterogeneous images are collected from
various benchmark sources.
Text detection / localization system: For this process, data set
images have been gathered from the sites of several research groups
such as
Laboratory for Language and Media Processing (LAMP)
Automatic Movie Content Analysis (MoCA) Project
Computer Vision Lab., Pennsylvania State University.
ICDAR 2003 competition dataset
Synthetic datasets taken by a digital camera
They collectively form 4 sets of images with various image types as
shown in Table 3.1.
Table 3.1 Data set for TD/TL system
Set no. Image Group Image type Set 1 Caption Text images News telecast and video titles Set 2 Scene text images Name plate images
Number plate images Book cover images General embedded text images Variations: Perspective projection Angular text Radially changing text Illuminated images (day, Evening and Night
light) Languages tested (English, Tamil and Hindi)
Set 3 Document images Scanned document image Web document image
Set 4 Hybrid Scene and Caption text in the same image
67
Some of the sample images of Caption text, Scene text and
document images are shown in Figure 3.2.
(a) Caption text images
Figure 3.2 (Continued)
68
(b) Scene text images
(c) Document images
Figure 3.2 Sample data set images for TD/TL system
69
Feature selection algorithm: The data set for the feature selection
algorithm is created by
Pooling features from five different texture models (Chapter 5)
comprising 55 features for the textual dataset.
In addition to this, some more data sets are selected from the
UCI Machine Learning Repository such as
- Splice
- Chemical
- CoIL2000
Text segmentation and binarization system: Test sets are
selected to cover a wide variety of background complexity and
different text color, font, size and uneven illumination from
ICDAR 03’ dataset.
Localized text region from the proposed SBTA –TD/TL and
MLFP –TD/TL system.
Some of the sample images of the text segmentation and
binarization system are shown in Figure 3.3.
Figure 3.3 Sample data set images for text binarization system
70
3.6 SUMMARY
This chapter introduced a global view of the proposed system for
extracting text embedded in images and videos. In the first section, an
overview of TIE system is presented. The second section briefly introduced
an overview of the proposed text detection /localization and text binarization
systems. The third section mainly discussed the characteristics of the
databases used for evaluation in this thesis.