CHAPTER 3 SYSTEM OVERVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/30209/8/08_chapter3.pdf · CHAPTER 3 SYSTEM OVERVIEW 3.1 INTRODUCTION The recent advances in digital

57

CHAPTER 3

SYSTEM OVERVIEW

3.1 INTRODUCTION

The recent advances in digital technology have resulted in more

and more databases in multimedia containing images and video, in addition to

textual information. Multimedia documents with embedded text data present

many challenging research issues in document analysis and recognition.

Among them, text detection and extraction are critical pre-processing for

automated indexing systems using OCR. The extraction of a text may seem to

be a trivial application for existing optical character recognition (OCR) tools.

However, compared to OCR for document images, extracting a text from real

images faces numerous challenges due to low resolution, unknown text color,

size and position, or complex backgrounds. This has led to the development

of new techniques to tackle these challenges. With this in view, this thesis

aims at presenting a robust system for automatically extracting the text

appearing in heterogeneous textual images, such as Scene text images,

Caption text images and document images with a unified text extraction

framework. The thesis also discusses text detection, localization and

binarization problems, and presents their solutions in the following Chapters.

This short chapter organized in three sections, gives a global view

of the proposed system for extracting text embedded in images and videos.

An overview of TIE system is presented in the first section. In the second

section, an overview of our text detection /localization and text binarization

systems are briefly introduced, which is helpful to understand the related

58

techniques that will be proposed in the subsequent chapters. The third section

mainly presents and discusses the characteristics of the databases used for

evaluation in this thesis.

3.2 TEXT INFORMATION EXTRACTION (TIE) SYSTEM

Text information extraction from image can be divided into the

following sub tasks.

Text Detection (TD): The text detection problem involves

locating the text regions in a textual image. Text detection takes

a raw image as an input. Its aim is to decide if there is any text

present in the image and if so, gives those parts of the image

containing the text as output.

Text Localization (TL): Text localization groups the text

regions identified at the detection stage, into text instances. Text

localization takes the output of the TD step as an input and

merges text regions which belong to the same text candidate. Its

final task is to determine the exact coordinates of the text

position.

Text Tracking (TT): A text often spans even hundreds of frames

in a digital video. In such cases, text tracking algorithms are

used to exploit the temporal occurrence of a text over a

sequence of video frames. This process is necessary to reduce

the processing time, by not applying the TDL algorithms to

every frame separately. This step is required to track the text in

video frames.

Text Binarization: The text binarization problem involves separating the text strokes from the background in a localized text region. The output of a binarization module is a binary

59

image, with pixels corresponding to text strokes marked as one binary level, and the background pixels marked as the other.

Text Recognition: The final stage is text recognition problem, in which the text appearing in the binarized text image is recognized. It performs optical character recognition (OCR) on the binarized text image and converts it into the corresponding ASCII text.

The architecture of the text information extraction system is shown in Figure 3.1.

Text Segmentation and binarization

Character Recognition (OCR)

Input image

ASCII Text

Text detection

Text Localization

Text Identification

Figure 3.1 Text Information Extraction (TIE) System architecture

60

As the thesis focuses on the TIE from still textual images, text

tracking is not considered (This process is required for tracking the text in

video frames). The processes associated with the TIE system here, are as

follows: Text detection, Text localization, Text Segmentation and

Binarization, and Text recognition. The recognition problem is not discussed

as it is beyond the scope of this thesis. It is assumed that once a text has been

binarized, any of the many commercial document image OCR systems could

be used for the recognition stage.

3.3 METHODOLOGIES

Two different methodologies of Text detection / Localization are proposed. A

solution to the text segmentation and binarization problem is also presented.

The experimental results for different sets of images will be presented to

demonstrate the high-quality performance of the approaches.

3.3.1 Text Detection / Localization (TD/TL) Methods

The observation from literature that a separate algorithm needs to

be designed for different kinds of images, sparked the idea to develop a

unified scheme to distinguish the text and non text from heterogeneous

images which takes care of variations in illumination, transformation/

perspective projection, font size, angular and radially changing text, so as to

make the system suitable for heterogeneous images. The various factors that

motivated this research are:

The need for a unified Text detection and localization system to

extract the text from heterogeneous images with an improved

recall rate.

The need for a Text detection and localization system that is

robust to various text font sizes, location and orientation of text,

61

complex background, lighting conditions and perspective

projection.

Two different TD/TL methods for detecting and localizing the text

in heterogeneous textual images are presented to tackle the issues in the

literature with two different philosophies, and made ready for the subsequent

segmentation and binarization process.

SBTA-TD/TL-Sub band Texture analysis based text detection and

localization:

In order to develop a common framework for text extraction from

various images, a methodology has to be designed to handle variations in the

text, such as font size and orientation. It is to be emphasized that compared to

the edge based and CC based methods of TD/TL, texture based methods are

more accurate when the text is embedded in a complex background. They

consider texts as regions with distinct textural properties. Capturing these

textural details in various directions will enable the designing of a unified

framework for text extraction. The introduction of a Non sub sampled

contourlet transform (NSCT) to handle variations in the text has been

attempted in this work and the sub bands are analyzed to distinguish the text

and non text regions.

The SBTA-TD/TL method follows the image analysis based

approach and consists of the following processes: Candidate text region

detection, Energy Computation and Texture analysis, Text region localization

and Text region Extraction. First, the candidate text regions are identified by

decomposing an input image with NSCT into eight directional sub band

outputs at three different scales. Energy is computed for the sub bands which

are categorized into Strong and Weak bands. Subsequently, a boosting level is

applied to the weak bands so as to bring them to the level of the strong bands.

62

Then, edge detection followed by the suitable dilation operator is applied. The

strong and boosted edges after dilation are combined with addition followed

by the logical AND operation which forms the text region. Finally, the

remaining non text regions are identified and eliminated. This method works

across heterogeneous images, and is robust to a limited range of font size of

characters and orientation of text. It provides encouraging results when

compared with other techniques.

MLFP TD/TL - Multi Level Feature Priority based text detection

and localization (MLFP TD/TL) method:

This approach is designed to improve the robustness of the SBTA-

TD/TL method for handling various orientations of text, wide range of font

size, different lighting conditions and perspective projected images. The

incorporation of the feature selection algorithm enables the development of

common framework for TD/TL for heterogeneous and hybrid textual images

to move from a domain dependent approach to a domain independent

approach. This algorithm is introduced to select the optimal features to

distinguish the text from the non text for all types of images with all possible

considered variations in illumination, transformation/ perspective projection,

font size, and angular and radially changing text.

The MLFP-TD/TL applies image analysis and the Machine

learning based approach; This has been carried out in two phases, namely,

Offline processing (training phase) and online processing (testing phase).

Offline processing is carried out to extract and generate feature vectors for

training images from the image corpus, constituting heterogeneous images

with variations in lighting, orientation and font size, and to train the classifier

for classification, and testing phase tests the sample based on it. In the MLFP-

TD/TL, candidate text region detection is composed of i) the decomposition

63

of the input image with non sub sampled contourlet transform - NSCT and ii)

the MLFP based feature selection algorithm for selecting the best features

from the merged sub bands, so as to distinguish the text and non text regions

better with the help of the classifier stage. These detected candidate text

regions are later verified at the text localization stage. The angular closing

operation is introduced at the text localization stage to handle a radially

changing text.

During Online processing, when the user supplies an input image,

the selective features are extracted from the transformed and merged

contourlet coefficients, and are used to classify the regions as candidate text

and non text regions with the neural network classifier. Subsequently,

candidate text regions will undergo selected verification rules to eliminate the

unwanted non text regions. The optimality of chosen features, the extraction

of the chosen features in various directions, and the scale invariant nature of

NSCT, combinedly provide an improved performance when compared with

the existing methods.

3.3.2 Text Segmentation and Binarization method

Works reported in the literature demand the development of new

techniques, for the text binarization system to handle color documents with

multi-colored texts, presence of text like objects, uneven illumination and a

textured background. This challenge is met by the Integration of Edge and

Color Analysis (IECA) technique.

Here, an approach is proposed in which the localized region is first

separated into individual characters, and then binarization is performed on

each individual character image to avoid windowing artifacts. Suitable

preprocessing techniques (Zhang et al 2009) are applied first to increase the

contrast of the image, and blur the background noises due to the textured

64

background. Then, edges are detected with iterative thresholding and the

morphologic erode algorithm (Yan et al 2008) rather than the conventional

edge algorithm, so as to maintain edge continuity and avoid edge overlapping.

Then the bounding box is generated for the detected edges and it is termed as

an edge-box (EB); the uniform edge box size (Ts) is determined by the

proposed sliding window based Character uniformity check (CUC) algorithm.

EBs with sizes lesser than Ts are removed.

Subsequently, the unwanted edges resembling a character are

removed, and the characters are segmented with the proposed Edge Quadrant

Coverage analysis (EQCA) algorithm. Edge intensity information is used to

filter non character edges and identify / segment candidate character EBs. But

it can not be applied further to remove the patches present in character EBs

due to the background edges, if the intensity of the background edges is

similar to the character edge. Subsequently, the Corner Vertices Color

Analysis (CVCA) algorithm is proposed which uses the color information of

edges as an attribute to remove those background patches. This is based on

the K-means clustering algorithm in HCL (hue, chroma, and luminance) color

space (Sarifuddin et al 2005), which is a color model close to the human

perception of colors. This method is able to binarize images taken under

uneven illumination by binarizing the segmented character rather than the

whole text row, and is able to handle a multicolored text by analyzing only

the segmented character without depending upon the color of the neighbor

components with CVCA based binarization. The reported performance results

for the IECA method are competitive to the results reported by other

researchers.

An attempt was made to integrate all the above mentioned factors

such as Text detection, Localization and Binarization to build a Unified

framework for text extraction from heterogeneous textual images. The

65

integration of all the three factors yields promising results exceptional to the

scenarios where touching characters are present for binarization.

3.4 SYSTEM REQUIREMENTS

This thesis is intended to handle various kinds of images effectively.

So, the image processing programs for various methodologies in this thesis

are implemented using Matlab software. Waikato Enviornment for

Knowledge Analysis (WEKA) tool is used to implement the machine learning

algorithms used in this thesis. The demo version of the commercial Optical

Character Recognition (OCR) software, ABBYY Fine Reader 7.0

Professional is used for recognition purposes to evaluate the performance of

text binarization system developed in this thesis.

Software and Hardware requirements for this thesis are listed below:

Software requirements: Matlab 7.6, WEKA Data mining tool,

OCR software (ABBYY Fine Reader 7.0 Professional)

Hardware requirements:

PC with Pentium 1GHZ processor

1GB RAM

40GB Hard disk drive

CD-ROM or DVD-ROM drive

Keyboard and a Microsoft Mouse or some other compatible

pointing device

Video adapter and monitor with Super VGA (800 x 600)or

higher resolution

66

3.5 DATABASES FOR EXPERIMENTS

This thesis is proposed to handle caption text images, scene text

images and document images. These heterogeneous images are collected from

various benchmark sources.

Text detection / localization system: For this process, data set

images have been gathered from the sites of several research groups

such as

Laboratory for Language and Media Processing (LAMP)

Automatic Movie Content Analysis (MoCA) Project

Computer Vision Lab., Pennsylvania State University.

ICDAR 2003 competition dataset

Synthetic datasets taken by a digital camera

They collectively form 4 sets of images with various image types as

shown in Table 3.1.

Table 3.1 Data set for TD/TL system

Set no. Image Group Image type Set 1 Caption Text images News telecast and video titles Set 2 Scene text images Name plate images

Number plate images Book cover images General embedded text images Variations: Perspective projection Angular text Radially changing text Illuminated images (day, Evening and Night

light) Languages tested (English, Tamil and Hindi)

Set 3 Document images Scanned document image Web document image

Set 4 Hybrid Scene and Caption text in the same image

67

Some of the sample images of Caption text, Scene text and

document images are shown in Figure 3.2.

(a) Caption text images

Figure 3.2 (Continued)

68

(b) Scene text images

(c) Document images

Figure 3.2 Sample data set images for TD/TL system

69

Feature selection algorithm: The data set for the feature selection

algorithm is created by

Pooling features from five different texture models (Chapter 5)

comprising 55 features for the textual dataset.

In addition to this, some more data sets are selected from the

UCI Machine Learning Repository such as

- Splice

- Chemical

- CoIL2000

Text segmentation and binarization system: Test sets are

selected to cover a wide variety of background complexity and

different text color, font, size and uneven illumination from

ICDAR 03’ dataset.

Localized text region from the proposed SBTA –TD/TL and

MLFP –TD/TL system.

Some of the sample images of the text segmentation and

binarization system are shown in Figure 3.3.

Figure 3.3 Sample data set images for text binarization system

70

3.6 SUMMARY

This chapter introduced a global view of the proposed system for

extracting text embedded in images and videos. In the first section, an

overview of TIE system is presented. The second section briefly introduced

an overview of the proposed text detection /localization and text binarization

systems. The third section mainly discussed the characteristics of the

databases used for evaluation in this thesis.

Documents

CHAPTER 3 SYSTEM OVERVIEW - Shodhgangashodhganga.inflibnet.ac.in/bitstream/10603/30209/8/08_chapter3.pdf · CHAPTER 3 SYSTEM OVERVIEW 3.1 INTRODUCTION The recent advances in digital