DETECTING AND RECOGNIZING TEXTeee.metu.edu.tr/~alatan/PAPER/MSserhat.pdf · Yüksek Lisans, Elektrik Elektronik Mühendisliği Bölümü Tez Yöneticisi: A. Aydın Alatan Eylül 2002,

DETECTING AND RECOGNIZING TEXT FROM VIDEO FRAMES

A THESIS SUBMITTED TO

THE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES

OF

THE MIDDLE EAST TECHNICAL UNIVERSITY

BY

SERHAT TEKİNALP

IN PARTIAL FULLFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF SCIENCE

IN

THE DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING

SEPTEMBER 2002

Approval of the Graduate School of Natural and Applied Sciences ____________________________ Prof. Dr. Tayfur Öztürk Director

I certify that this thesis satisfies all the requirements as a thesis for the degree of Master of Science.

____________________________ Prof. Dr. Mübeccel Demirekler Head of Department

This is the certify that we have read this thesis and that in our opinion it is fully adequate, in scope and quality, as a thesis for the degree of Master of Science in Electrical and Electronics Engineering.

____________________________ Assoc. Prof. Dr. A. Aydın Alatan Supervisor

Examining Committee Members: Assoc. Prof. Dr. İsmet Erkmen ____________________________ Assoc. Prof. Dr. A. Aydın Alatan ____________________________ Assoc. Prof. Dr. Gözde Bozdağı Akar ____________________________ Assoc. Prof. Dr. Buyurman Baykal ____________________________ Ersin Esen ____________________________

ii

ABSTRACT

DETECTING AND RECOGNIZING TEXT FROM VIDEO FRAMES

Tekinalp, Serhat

M.S, Department of Electrical and Electronics Engineering

Supervisor: Assoc. Prof. Dr. A. Aydın Alatan

September 2002, 93 pages

An important source of information contained in digital video is the

text in video frames. This text may appear as a part of the scene (scene text)

or may be rendered artificially during production (superimposed text). By

detecting and recognizing videotext, it is possible to index and easily

manage large video archives. There are some basic properties, which makes

videotext detectable. These properties are, distinguishing texture, high

contrast, and uniform color. By employing these properties it is possible to

iii

detect text regions and binarize image for character recognition after

thresholding these regions. In this thesis, a complete framework for

detection and recognition of videotext is presented. The performance of the

system is tested for its recognition rate for various combinations. The

system is capable of character recognition rate up to 59%, which is quite

reasonable for most purposes.

Keywords: Text detection, text extraction, videotext, video-ocr,

texture, segmentation.

.

iv

ÖZ

VİDEO KARELERİNDE YAZI BULMA VE TANIMA

Tekinalp, Serhat

Yüksek Lisans, Elektrik Elektronik Mühendisliği Bölümü

Tez Yöneticisi: A. Aydın Alatan

Eylül 2002, 93 sayfa

Sayısal video’da en önemli bilgi kaynaklarından biri video içindeki

yazılardır. Video içindeki yazılar, görüntünün bir parçası olarak karşımıza

çıkabileceği gibi video’nun üretimi aşamasında sonradan eklenmiş de

olabilirler. Bu yazıların bulunması ve okunması büyük video

veritabanlarının endekslenmesine ve yönetilebilmesine olanak sağlar. Video

yazılarının bulunabilmelerine olanak sağlayan temel özellikleri

bulunmaktadır. Bu özellikler, farklı doku, yüksek kontrast ve düzgün renk

v

olarak tanımlanabilir. Bu özellikleri kullanarak, video içerisindeki yazı

bölgelerini bulmak mümkün olmakta ve bulunan bölgelerin eşiklenmesi

yoluyla, imgenin karakter tanıma için ikilileştirilmesi mümkün olmaktadır.

Bu çalışmada, video yazılarının bulunması ve tanınması için tüm bir sistem

ortaya konulmuştur. Sistemin performansı, karakter tanıma yüzdesi ile farklı

kombinasyonlar için test edilmiştir. Sistem ortalamada, birçok uygulama

için yeterli sayılabilecek %59 tanıma oranına erişmiştir.

Anahtar kelimeler: Yazı bulma, yazı çıkarma, video yazısı, video-

ocr, doku, bölütleme.

vi

ACKNOWLEDGEMENTS

First of all, I would like to express my sincere appreciation to Dr. A.

Aydın Alatan, my supervisor, for his valuable guidance and insight,

encouragement, support, and reliance throughout the research. I am also

grateful to my supervisory committee members, Dr. İsmet Erkmen, Dr.

Gözde Bozdağı Akar, Dr. Buyurman Baykal, and Ersin Esen, for their

valuable suggestions, contributions and comments.

I wish to express sincere thanks and love to my wife, Bengü, for her

patience, understanding, help and support in every phase of the thesis.

Lastly, but not leastly, I am, as always, indebted to my family. The

love and support of my mother Gülay, my father Feyzi and my sister Serap,

remain bedrock of my life.

vii

TABLE OF CONTENTS

ABSTRACT ..................................................................................... iii

ÖZ.......................................................................................................v

ACKNOWLEDGEMENTS ............................................................ vii

TABLE OF CONTENTS ............................................................... viii

LIST OF FIGURES............................................................................x

LIST OF TABLES .............................................................................x

CHAPTER

1 INTRODUCTION...........................................................................1

1.1 Text in video: videotext......................................................5

1.2 Properties of videotext......................................................11

2 RELATED WORK........................................................................14

2.1 Connected component-based approaches.........................15

2.2 Texture based approaches ................................................20

2.3 Work on scene text ...........................................................24

3 VIDEOTEXT DETECTION.........................................................28

3.1 Texture analysis................................................................31

viii

3.1.1 Texture analysis by Gabor filters ...............................31

3.1.2 Texture analysis by Haar wavelets.............................37

3.1.3 Detection by a feed-forward neural network..............41

3.2 Contrast analysis...............................................................49

3.3 Region analysis.................................................................50

3.4 Thresholding.....................................................................54

3.5 Heuristics..........................................................................56

4 EXPERIMENTAL RESULTS......................................................58

4.1 Videotext recognition .......................................................60

5 CONCLUSIONS...........................................................................68

REFERENCES.................................................................................72

ix

LIST OF TABLES

4.1: Pre-OCR performance evaluation – Part1.



x

LIST OF FIGURES

1.1: Example video frame containing both superimposed text and scene text.

1.2: Demonstration of complexity of background-character separation.

3.1: Block diagram overview of videotext detection and recognition process.

3.2: Gray-scale representation of Gabor filters.

3.3: Gabor filter outputs for a typical video frame.

3.4: Feature vector extraction using Gabor filters.

3.5: First level wavelet decomposition of a video frame.

3.6: Block diagram of the neural network classifier.

3.7: Training performance of Gabor and wavelet-based features.

3.8: Neural network output image example.

3.9: Typical neural network output examples.

3.10: Contrast analysis example.

3.11: Region analysis example.

4.1: Pre-OCR steps.

4.2: Various recognition results.

4.3: Various recognition results, continued.

xi

CHAPTER 1

INTRODUCTION

The increasing capacity of available storage media, transmission

bandwidth and the efficiency of compression algorithms increases the

availability of online digital imagery and video, and makes digital video the

standard form of video broadcast and distribution. Today, many homes own

digital video equipment and are maintaining personal media archives. Large

media archives are also being maintained by several organizations with an

interest in commerce, entertainment, medical research, security, etc. This

increasing popularity and availability has rekindled interest in the problem

of how to manage multimedia information sources efficiently.

The ease of use of digital video is directly related to advances in

compression, transmission, storage, archival, indexing, retrieval, querying

and browsing technologies. Compression of digital video has already

become a mature technology and there are widely accepted compression

1

standards. Closely related to compression technology, transmission of

digital video is also standardized and becoming much more efficient with

the recent advances in communication technologies. On the other hand, the

importance of and lack of attention to indexing, retrieval, querying and

browsing counterpart of digital video technology, has been recently

understood.

Presently, digital video is extensively produced in the entertainment

industry (digital movie and digital broadcast video), business

communications (video conferencing), medical imaging, and surveillance

video among other uses. All of these are candidates to be a part of some

video archival system. Content-based information retrieval from such digital

video databases and media archives is a challenging problem and is rapidly

gaining widespread research and commercial interest. Such systems are also

called Content Based Image (and Video) Retrieval (CBIR) Systems or

Visual Information Retrieval Systems (VIRS) [20].

Research efforts have led to the development of many methods that

can be used to provide pseudo-semantic content-based access to image and

video data. These methods determine the similarity in the visual information

content extracted from low-level features and have their roots in pattern

recognition [21]. Low-level features such as color, shape, and texture are

used to select a good match in response to the user query. Since humans, on

2

the other hand, attach semantic meaning to visual content; indexing video

sequences requires extraction of a natural language representation of the

sequence. Traditionally, this process is performed by a human reviewer,

which manually annotated images and video sequences with a small number

of keyword descriptors after visual inspection. Unfortunately, this process

can be very time-consuming and such delays may inhibit the ability to

perform near-real-time filtering and retrieval. Therefore, to be able to index

video sequences, it is necessary to produce the natural language

representation of the sequence automatically.

Several automated methods have also been developed which attempt

to access image and video data by content from media databases [21]. A

popular approach to address this problem has been to temporally segment

video into subsequences separated by shot changes, gradual transitions or

special effects such as fade-ins and fade-outs [22,23]. A story board of

events that occurred in the video can be created by selecting a key frame

from each subsequence. The video can then be queried with visual queries

that use color, texture or activity. This is a pseudo-semantic approach to

video content description, wherein the human interpretation of color, texture

and/or motion defines the content. Some researchers have extended this

philosophy in an attempt to identify the genre of the video [24]. By studying

patterns of color distribution in segmented video clips and relative motion of

3

objects within them, judgments can be made regarding the genre of the

video.

The next step in content-based indexing of digital video takes a hint

from the method used by humans in understanding visual information.

Humans recognize the objects imaged in the scene and form spatial and

temporal relationships between them in order to understand the semantics

contained. Developing methods that recognize such objects, and form

relationships between them would lead a long way in generation of

automated content description. One such object is the text contained in the

video.

Generally speaking, the natural language representation of a video

sequence can most directly be extracted from information carriers such as

voice, closed-caption text, scene text and superimposed text. Although

sound and closed captions provide index information on the spoken content,

basic annotational information often appears only in the image text. For

example, sports scores, product names, scene locations, speaker names,

movie credits, program introductions and special announcements often

appear and supplement or summarize the visual content. These types of

annotations are usually rendered in high contrast with respect to the

background, are of readable quality, and use keywords that facilitate

4

indexing. Specific searches for a particular actor or reference to a particular

story can easily be realized if there exists access to this textual content.

It is important to note that the efforts of the Motion Picture Experts

Group (MPEG), part of the joint International Organization for Standards

(ISO) and International Engineering Consortium, are producing standards

which are object based and within these standards video can be encoded as a

static background with moving foreground composed of various objects,

ultimately allowing annotation of textual content as an object and making it

easier to extract and use as an indexing tool [61-64]. The standards are not

yet completely implemented but the value of textual content rendered as part

of the video for indexing is widely accepted and the researchers worldwide

are continuously working on the subject.

1.1 Text in video: videotext

There is a considerable amount of text occurring in video which is a

useful source of information. The presence of text in a scene, to some

extent, naturally describes its content. If this text information can be

harnessed, it can be used along with the temporal segmentation, and other

video object recognition methods to provide richer content-based access to

the video data. The text in video frames can be classified broadly into two

large categories – superimposed (caption, artificial, or overlay) text and

5

scene text. It will be called hereafter as videotext to distinguish from ASCII

text.

Scene text appears within the scene, which is then captured by the

recording device. It is an integral part of the image and can be considered a

sample of the world. Examples of scene text include street signs, billboards,

text on trucks, and writing on shirts. Although valuable, the appearance of

such videotext is typically incidental to the scene content, and only useful in

applications such as navigation, surveillance or reading text on known

objects, rather than general indexing and retrieval. One exception is in

domains, where text or symbols may be used to identify players or vehicles.

Scene text is often difficult to detect and extract since it may appear in a

virtually unlimited number of poses, sizes, shapes and colors. Moreover

scene text may appear partly occluded by other objects in the scene.

On the other hand, superimposed text is a kind of videotext that is

mechanically added to video frames to supplement the visual content, and is

often more structured and closely related to the subject than scene text is.

Examples of superimposed text include headlines, keyword summaries, time

and location stamps, names of people and scores. The descriptors are

typically predictable, have simple styles, and are produced with the intent of

being read by the viewer.

6

Superimposed text has a number of functions that differ between

domains. In commercials, videotext appears to reinforce the vital

information such as the product name, claims made, or in some cases to

provide disclaimers. In sporting events, videotext is used to identify specific

players, provide game information, or relay statistics. In newscasts,

superimposed text can be used to either identify key features of the scene,

such as location or speaker, to provide a synopsis of the topic, or to provide

a visual summary of statistical information. In movies and television shows,

videotext provides production and acting credits, and in other cases captions

or language translations.

Figure 1.1 shows an example video frame containing both

superimposed text and scene text. In this image the text for speed and time

information is an example for superimposed text and it contains information

directly related to the content of the image. Commercials, on the other hand,

are examples for scene text and they are not directly related to the content.

7

Figure 1.1: Example video frame containing both superimposed text and scene text.

For the most part, research in this area has focused on the

identification of superimposed text. However, there are some cases where

videotext appears as scene text but shows the properties of superimposed

text such as product pack shots on TV commercials. Therefore it is better to

describe the scope of the research as “extracting (i.e. detecting and

recognizing) superimposed text and scene text which possesses

superimposed text properties”, and this is exactly how it is expressed in the

MPEG-7 standard [61-64].

The use of textual information as a key for an indexing system

requires conversion of videotext present in image to ASCII form, which is

commonly known as Optical Character Recognition (OCR) process.

Unfortunately, commercially available OCR systems are designed for

8

recognition of characters in printed documents [1]. Characters in these

documents appear in uniform color on a clean background and the document

is typically scanned in high resolution. Therefore, to be able to employ OCR

for videotext requires segmentation of characters from background, which is

usually arbitrarily complex. The segmentation consists of background

removal and character binarization. For OCR systems to recognize

characters efficiently, a video frame should be converted to a black and

white image where pixels on the characters of the videotext are black and

the remaining pixels are white. This binarization process is called as pre-

OCR in an overall sense.The operation is not straightforward because the

character pixels are often composed of a range of pixels and the boundary

between the character region and the background is not perfectly defined.

This statement can be best understood by Figure 1.2 where the original

image and a magnified character region are shown. As can be seen from the

figure, in the magnified image the character can not be separated, even

manually.

Although modern OCR systems are designed to decompose the

pages for recognition and can still recognize text on limited portion of

complex images. The success of the pre-OCR step is highly critical for the

performance of the OCR. The pre-OCR process will obviously be based on

distinctive properties of videotext.

9

(a)

(b)

Figure 1.2: Demonstration of complexity of background-character separation. (b) is the magnified version of (a) of the region containing text “JOSE”.

10

1.2 Properties of videotext

Although it is impossible to define a set of properties which

completely covers the whole videotext domain, some of these properties

which cover the majority of the set can be described as:

1. Texture

Videotext has a distinctive texture, which makes it distinguishable

from background. This statement stems from the fact that a text from an

unknown alphabet can be distinguished even without recognizing the

characters contained inside the text. This texture is a result of density of

edges forming the characters. Most alphabets have strong edges in a

particular direction, e.g. Latin scripts have strong vertical edges. These

edges form a high frequency texture.

2. Contrast

The colors for artificial text are chosen so as to have contrast against

the background. This is also true for a lot of scene text (signs, billboards,

etc.). Thus color-connected components belonging to text can be segmented

against the frame.

In some cases, the contrast between text and the background may not

be high. This is due to the background over which text is rendered, is

varying both spatially and temporally. It is often observed that the color of

11

caption characters is similar to that of the surrounding frame region.

However, even if this occurs, this is usually true only for a portion of the

caption or for a short duration, since otherwise, the editor would have

chosen a different color for the text.

3. Uniform color and intensity

Color is a strong feature for use in visual information indexing. Text

characters tend to have a perceptually uniform color and intensity over the

character stroke. While the character stroke appears to have the same color,

in reality it is usually composed of many different colors. In cases where the

color does vary across the caption, it varies in a gradual way so that adjacent

characters or character segments have very similar colors.

4. Size limits

Text appears in a variety of sizes in video data. Since text is intended

to be readable at a range of viewing distances in a limited time, here is

usually a minimum size to text characters. However, the upper bound on

character sizes is much looser.

5. Geometrical alignment

Characters belonging to an artificial text string are usually

horizontally aligned and the spaces between characters of a word have

restrictions.

12

These properties will be explained and demonstrated in detail in

subsequent sections but it should be noted that these properties are utilized

in pre-OCR binarization step.

13

CHAPTER 2

RELATED WORK

This chapter presents a survey of methods, for extracting text from

images and video in the literature. There has been a growing interest in the

development of methods for detecting, localizing and segmenting text from

images and video. In this chapter, we concentrate on presenting the

literature on detection, localization, and extraction of text regions from

images and video.

Methods for extracting artificial overlay text have primarily used

intensity, color, texture, edge, or a combination of these features. As

mentioned above, most methods have been designed to extract text from

images. Some have been extended for application to video.

Existing work on text detection has focused primarily on characters

in printed and handwritten documents [1]. There also exists considerable

amount of work on text detection in industrial applications, most of which

14

focus on a very narrow application field with simplifying assumptions, such

as license plate recognition from a uniform background [2]. The research on

text detection and text extraction from natural images and video is not as

mature as the work on document images.

The current research can be mainly divided into two classes:

Connected-component-based approaches [6,7,8], and texture-based methods

[4,5,9,10]. While the former tries to find characters as closed regions

containing uniform color, the latter considers the text as a special class of

texture. The boundary between these two methods is not very clear in the

sense that sometimes the distinction is not so obvious. As a consequence of

this fact, there are also some hybrid approaches [3] that utilize both of these

approaches.

2.1 Connected component-based approaches

Connected component-based approaches require high contrast and

resolution in order to give an acceptable performance. Some neighboring

regions with similar color values, i.e. low contrast, may yield erroneous

segmentation of the text [6,7,13]. In some methods, the assumption of

having monochrome text or, a monochrome background surrounding the

text is another limiting factor for the general performance [8]. Moreover,

finding characters as a single closed region is only possible by utilizing

15

high-resolution images [3]. However, for video indexing, such constraints

are usually far from being practical.

Zhong, Karu and Jain [3] locate text in complex color images. They

propose a method based on quantizing the color space based on peaks in the

histogram. Adjacent colors are merged into these peaks. This assumes that

the text occupies a narrow region of color space and a significant portion of

the image. Color components are then labeled as text based on their

geometrical properties and if there are at least 3 characters aligned

horizontally. If the color quantization step fails because the text has low

contrast or occupies a small portion of the frame, then a character may be

broken into multiple segments. As an alternative, same researchers use the

heuristic of high horizontal spatial variance to localize text. They report

results for compact disc and book cover images and the spatial variance

method on car scenes from video. Although the spatial variance may be

considered as a texture, the dominant characteristic of the approach is based

on connected components. Kim [28] also proposes a very detailed text

localization method for video images and compares it with the method of

Zhong, Karu, and Jain [3]. Color clustering followed by shape analysis is

used to determine text regions. This method appears to be fragile, with a

large number of thresholds that are empirically determined. In a later work

presented by Jain and Yu [8], text from video frames and compressed

16

images are extracted by background color separation and applying the

heuristic that the text has high contrast with the background and is of the

foreground color. A similar assumption made in the earlier work on CD

cover images [3] that the text forms a significantly large portion of the

image, does not hold true for video. The later work shows sample

localizations on video frames but inexplicably misses text that is not

particularly large or not on a uniform background (e.g. text on a weather

map from a weather news broadcast). Messelodi and Modena [44] extract

text from book cover images. They use simple homogeneity properties to

separate text from other image components and further correct their

extraction through estimation of orientation and skew correction of text

lines.

Color clustering is the main tool for connected component analysis.

Lienhart and Stuber [30,31,32] describe a system for automatic text

recognition in digitized movies that works on movie opening and closing

sequences, credit title sequences and closing sequences with title and

credits. Their work is well grounded, but not robust enough to be applicable

to general video, because they expect to extract characters as a single

connected component. They employ color clustering with both spatial and

temporal heuristics of segments and track (possibly scrolling) caption text in

video. The system is likely to fail for noisy video, text with poor contrast, or

17

text, which has highly cluttered background. Suen and Wang [33,34] also

propose a method for segmenting uniformly colored text from a color

graphics background. Page segmentation techniques are applied to separate

text and graphic regions based in color edge strength. Non-text regions are

discarded by applying color distribution properties for text regions. Mariano

and Kasturi [48] propose a color clustering based method for locating text

based on color uniformity of text pixels and horizontal alignment. An image

is searched for rows that pass through vertical and diagonal strokes of text.

The pixels are clustered in the L*a*b* space of a row that goes through the

middle of caption text on a non-uniform background. This cluster appears as

short streaks. Searching for similar clusters in adjacent rows, heuristics are

formed for detection of horizontal text.

Zhou and Lopresti [37] take a similar approach as it involves color

clustering, computation of connected components and applying spatial

heuristics for text to remove components from World Wide Web images.

Zhou, Lopresti, and Lei [38] extend this work to perform OCR on these

images. The work by Zhou, Lopresiti, and Tasdizen [42] is also another

example for color clustering. Their approach adds robustness to their earlier

work on extracting text from World Wide Web images. Color clustering is

done using the Euclidean minimum spanning tree (MST) technique. Spatial

distance between colors is also considered in the clustering process.

18

In some of the works in literature on text extraction from video,

connected component analysis is performed on multiple frames. An

example work of this kind is presented by Ariki and Teranishi [39]. In their

method, they make very simplistic assumptions as to the nature of the text,

i.e. it is brighter than the background. The method uses simple frame

subtraction to extract text. Kurakake, Kuwano, and Odaka [40] divide the

frame into several uniform sub-regions. Color histogram differencing is

applied to each sub-region. If a significant change is detected between the

sub-regions of consecutive frames, then the latter frame is assumed to have

text in it. Spatial analysis is used to further determine if text indeed exists in

these frames. Yeo’s approach [29] seeks to detect the appearance and

disappearance of text captions. The disadvantage is due to the assumption

for occuring only in certain predefined areas of the frame, which are then

template matched to detect sudden changes, presumably due to caption

appearances. In addition, the text detection (caption appearance event

detection) method assumes that all shot changes have been accurately

detected in the sequence before text detection needs to be applied, which

may not be true. Even if it is, text caption events may co-occur with shot

changes or camera motion. Multiple captions may appear within a shot, they

may fade in and out over long periods, they may coincide with a shot

change or they may remain constant across shot changes.

19

Hase et. al. [35] propose an extraction algorithm for character

strings. Their approach is directed towards binary document images but

works on simple, clean color images after color segmentation. Their

relaxation method has the advantage of handling strings of various

orientations. In later work [36], they extend their method to perform conflict

resolution between overlapping text regions through the use of likelihood of

character string.

Dimitrova, Agnihotri, Dorai and Bolle [13] present a method to

detect caption text from video frames. The first method assumes that all

caption text is white or yellow in color. The method looks for strong edges

in the Red and Green color channels. Further refinement is done by

identifying homogeneous regions in intensity images, forming positive and

negative images by thresholding and applying heuristics based on text

characteristics for eliminating non-text regions. The text regions are

validated using the temporal redundancy property of text in video.

2.2 Texture based approaches

Among various texture-based text detection algorithms, the main

difference is representation of the texture. The main idea behind texture

analysis is quantifying the distribution of different edge orientations.

Among different models, Gabor filters with various orientations [9] or

20

simple edge detectors for finding a specific direction [5] or multiresolution

wavelet filtering [4,10] can be used for finding textures. With arbitrary

number of orientations and scales, multichannel Gabor filters provide a

higher flexibility and hence, a better discrimination among different

textures. Jain et al. [9] describes a method for separating text and

background areas on document images by using a group of multichannel

Gabor filters. However, the extension of such an approach for low-

resolution video frames has not been examined. Although it is claimed that

use of texture for text detection is sensitive to character font size and style

[8], our experimental results show that a feature vector constructed by

directional Gabor filtered images is still a very powerful representation of

text regions.

Frequency components are used as texture representatives. Chaddha,

Sharma, Agrawal, and Gupta [25] propose a method to detect text from

JPEG compressed images. The method extracts the DCT coefficients for

each block of a macro block in the JPEG image. The absolute values of a set

of DCT coefficients are summed. This measure reflects the high spatial

frequency content of blocks containing text. This particular set of

coefficients was empirically determined as the optimum to discriminate text

and non-text blocks. Next, this sum is thresholded using an empirically

determined value. All blocks having a measure greater than this value are

21

marked as text blocks. Because the JPEG compressed information is not

decompressed, the method is fast and capable of running in real-time. The

algorithm finds artificial and scene text within a range of scales. However, it

fails on text that has low or varying contrast within the string or on text that

is much larger than the block size used (8x8). Zhong et al [47], as with [25],

use the DCT coeffcients available in the MPEG I-frames. Select coefficients

that highlight the vertical and horizontal frequencies are summed and then

thresholded to detect text blocks.

Another tool for texture representation, wavelet coefficients, also

carry information about frequency of text blocks. Li and Doermann [45,46]

extract text from digital video key frames. They use the heuristic that the

texture for text is different from the surrounding background to identify text

regions. Wavelets are used for feature extraction and a Neural Network is

used for decisions. They also present an algorithm for tracking moving text.

The tracker assumes that text is mostly rigid and moves in a simple, linear

manner.

Some of the researchers consider edge density as texture

representation. Hauptmann and Smith [26] localize text in video using the

heuristic that text regions consist of a large number of horizontal and

vertical edges in spatial proximity. In a later work, Smith and Kanade [27]

apply a 3x3 horizontal difference filter to extract text regions. Wu,

22

Manmatha, and Risemann [4] describe a scheme for finding text in images.

They use texture segmentation to localize text, i.e. edge detection to detect

character strokes and join strokes to form text regions. While texture

segmentation may work for high-resolution images or document images, it

is likely o fail on noisy, low-resolution video. Outdoor scenes especially are

likely to give rise to a large number of false alarms. They also propose a

stroke aggregation method, which is used to merge extracted text regions

from multiple scales of operation.

A similar approach is proposed by Sato, Kanade et. al. [5]. They

describe a system for performing OCR on video caption text in the context

of a digital news archive. They localized text by looking for clusters of edge

pixels that satisfy aspect ratio and other criteria. They interpolate between

pixels of the frame to increase its resolution. They also assume that the text

caption consists of static white characters on a dark background. Under this

assumption, they replace each pixel with the minimum value of pixels at

that position for the duration of the text caption, to minimize the background

variation. Filtering to detect white on black edges of characters follows this

step. They report results on CNN news captions. Their method appears to

work on the clean sequences under the assumptions made but it is no clear if

it would work on unconstrained, noisy video.

23

Different techniques, which can be classified as texture-based, try to

classify image blocks as text or non-text. Mitrea and deWith [43] propose a

simple algorithm to classify video frame blocks into graphics or video based

on the dynamic range and variation of gray levels within a 4x4 pixel block.

Their application was improving image compression rates. This method can

be extended to extract text regions from images or video. Le Bourgeois [41]

presents a system for multi font OCR from gray level images. The text

localization stage of this system appears to be fast and robust for clean video

and assuming not too much variation in size. Text regions are identified as

those where pixels are organized in a coherent way to form a regular

texture. Normalized intensity differences over a certain period are used to

separate such regions from the background.

Other approaches for detecting text in images and video are found in

[49,50,51,52,53,54,55].

2.3 Work on scene text

Although they have similarities, it is suitable to review the existing

work on scene text detection in a separate section. Some of the ideas form

these works, such as adaptive or local thresholding, can be incorporated for

detection of superimposed text.

24

Very little work has been done for extraction of scene text. A brief

overview of the research efforts published in the literature is presented.

Some research involves manufacturing contexts where the conditions are

completely under control [56] and thus inapplicable to our problem domain.

Ohya, Shio and Akamatsu [6] describe a method to recognize characters in

scene images. They use local gray level thresholding to segment the image

and localize text regions by looking for high contrast, uniform gray level of

a character, and uniform width. They also use the results of an OCR stage to

improve their extraction result. If the Chinese OCR algorithm they use does

not find a good enough match, the character candidate region is rejected.

There is work on the recognition of vehicle license plates [57,58]

from video, which shares some of the characteristics of scene text.

However, these approaches make restrictive assumptions on the placement,

contrast or format of the license plate characters. A typical system makes a

number of assumptions as to the capture process (vehicle and camera

positions leading to near normal projection and near horizontal orientation

etc.) and type of text (constant text color against constant background). The

existence of license plates with logos and other background is known to

decrease performance. In contrast a general-purpose scene text detection

algorithm must handle text in all orientations, of all colors and backgrounds.

25

Cui and Huang’s approach [58] applies the Markov Random Field

(MRF) model to localize the text region in a frame and take the advantage

of the information from multiple frames. They also correct for perspective

projection distortion using the fact that the characters lie on a plane.

Winger, Jernigan, and Robinson [59] discuss the segmentation and

thresholding of characters from low-contrast scene images acquired from a

hand-held camera. Their data set includes images with low contrast, poor

and uneven illumination.

Gandhi [60] explores an approach for extracting scene text from a

sequence of images with relative motion between the camera and the scene.

It is assumed that the scene text lies on planar surfaces, whereas the other

features are likely to be at random depths or undergoing independent

motion. The motion model parameters of these planar surfaces are estimated

using gradient based methods, and multiple motion segmentation. The

equations of the planar surfaces, as well as the camera motion parameters

are extracted by combining the motion models of multiple planar surfaces.

This approach is expected to improve the reliability and robustness of the

estimates, which are used to perform perspective correction on the

individual surfaces. Perspective correction can lead to improvement in OCR

performance. This work could be useful for detecting road signs and

billboards from a moving vehicle.

26

Existing work considered so far generally suffers from lack of

robustness. In fact almost none of the proposed algorithms are tested with

quantitative recognition results. Connected component-based techniques are

more general in the sense that texture based methods are usually based on

videotext character size assumptions. On the other hand texture based

methods offer much more reliability.

27

CHAPTER 3

VIDEOTEXT DETECTION

As mentioned in the introduction section the pre-OCR binarization

process for videotext extraction will ideally remove all image components

other than videotext characters. The problem is not trivial because the

distinction is not so clear at the pixel level, i.e. one can not decide on which

pixels belong to video text characters and which to background just by

looking at the pixel and/or its neighbors. Therefore, it is necessary to

employ regional and object level image analysis [15]. This analysis will be

based on the distinctive properties of videotext. Following discussion will

present the methods of employing videotext properties for pre-OCR process.

At this point we can give an overall picture, i.e. framework of the

videotext detection and recognition process. With this picture in mind, it

will be easier to follow the implementation details of the process.

28

The properties listed in the introduction section, which distinguishes

videotext regions from background, help us in finding the regions of

interest, i.e. the regions that have the possibility of containing text. These

regions are called as candidate regions. Finding candidate regions is only a

part of the solution and does not give us the required binary image. In order

to produce the binary image, which will be fed to the OCR for recognition,

all candidate regions must be binarized individually. The binarization of

individual candidate regions is accomplished by thresholding each region

with an appropriate threshold value specific to the region. This threshold

value is determined automatically for each candidate region. After

thresholding, we obtain two binary images, specifically the binary image

where remaining pixels are composed of the ones, which are above the

threshold value and vice versa. These two binary images are the images to

be recognized by the OCR. Block diagram of the algorithm is given in

Figure 3.1.

29

Image

Candidate region extraction(Texture, Contrast, Color) Regional thresholding

Candidateregions

Heuristic improvement

Binaryimage

ASCII characters OCR

Figure 3.1: Block diagram overview of videotext detection and recognition process.

In Figure 3.1, an extra block is shown after thresholding. In this step,

the binary images produced by thresholding are further filtered by a set of

heuristics, which are based on the horizontal alignment properties of

videotext.

The idea of binarization with regional thresholding was originally

proposed by Dorai et.al. [13]. They also extracted candidate regions by

region analysis. However, their work suffers from the high amount of non-

character regions, which can not be eliminated by any heuristics or shape

analysis. We propose that if region analysis is supported by videotext

30

detection methods which can narrow the regions of interest to text box level,

we can overcome the limitations of region analysis.

3.1 Texture analysis

Videotext regions have distinctive texture properties. This is mainly

due to artificial (i.e. non natural) edges and strokes that form the characters

of videotext.

Texture analysis is one of the major topics of computer vision and

there are well-known techniques for the analysis. We will present here two

different approaches for videotext texture analysis and compare them in

terms of representation capabilities. These two techniques are Gabor filters

and 2D Haar wavelets. A set of feature vectors will be produced and these

vectors will be used for detecting text regions. For detection, a three layer

neural network will be used and the two techniques will be compared in

terms of their training performances.

3.1.1 Texture analysis by Gabor filters

Gabor filters have been one of the major tools for texture analysis

[18,19]. This technique has the advantage of analyzing texture in an

unlimited number of directions and scales. This flexibility is very useful for

videotext detection because the character edges of videotext may appear in a

31

diverse range of directions [9]. For videotext detection, we will use a set of

Gabor filters from the same scale and with 8 different orientations. Before

proceeding further, it is better to give a brief description of Gabor filters.

Physiological studies found simple cells, in human visual cortex, that

are selectively tuned to orientation as well as to spatial frequency. It was

suggested that the response of a simple cell could be approximated by 2D

Gabor filters [11]. The Gabor filters proposed by Daugman [12] are local

spatial bandpass filters that achieve the theoretical limit for conjoint

resolution of information in the 2D spatial and 2D frequency domains.

Gabor functions were first proposed by Dennis Gabor [65], as a tool

for signal detection in noise. Gabor showed that there exists a “quantum

principle” for information; the conjoint time-frequency domain for 1D

signals must necessarily be quantized so that no signal or filter can occupy

less than certain minimal area in it. However, there is a trade off between

time resolution and frequency resolution. Gabor discovered that Gaussian

modulated complex exponentials provide the best trade off. For such a case,

the original Gabor elementary functions are generated with a fixed

Gaussian, while frequency of the modulating wave varies. Gabor filters,

rediscovered and generalized to 2D, are now being used extensively in

various computer vision applications.

32

Daugman [12] generalized the Gabor filters to the following 2D

form in order to model the receptive fields of the orientation selective

simple cells:

−=Ψ

−22

2

22

2

22

)(σ

σ

σeee

kx xkj

xki

ii

Ti

rr

rrrr

=

=

µ

µ

θθ

sincos

v

v

iy

ixi k

kkk

kr

Each Ψi is a plane wave characterized by the vector ki enveloped by

a Gaussian function, where σ is the standard deviation of this Gaussian. The

first term in brackets in this equation determines the oscillatory part of the

kernel, and the second term compensates for the DC value of the kernel.

Subtracting the DC response, Gabor filters become insensitive to the overall

level of illumination. The center frequency of ith filter is adjusted by the

characteristic wave vector, ki which has a scale and orientation given by

(kv,θµ).

33

The decomposition of the image by use of these Gabor filters is

achieved by the 2D convolution integral:

∫ ′′−Ψ′= xdxxxIxR iirrrrr )()()(

where is the image intensity value at . )(xI r xr

In Figure 3.2, gray-scale representation of Gabor filters for varying

spatial frequency (vertical axis, kv=2-v/4, v=0,1,2,3,4 ) and orientation

(horizontal axis, θµ=µπ/8, µ=0,1,..,7) and for σ=4 are shown.

Figure 3.2: Gray-scale representation of Gabor filters for varying spatial frequency and orientation.

Each member of the family of Gabor filters models the spatial

receptive field structure of a simple cell in the primary visual cortex. The

Gabor decomposition can be considered as a “directional microscope” with

an orientation and scaling sensitivity. These filters respond to short lines,

34

line endings, and sharp changes in curvature. Since such curves correspond

to some low-level salient features in an image, these filters can be assumed

to form a low-level feature extractor of an intensity image.

For the purpose of videotext region detection, we selected the scale,

kv, as 2 and for this scale all the filters with 8 orientations form the filter

bank for feature extraction.

In Figure 3.3, filtered images of a video frame containing videotext

is shown. It is important here to note that ordinary edges in the image gives

output for a limited portion of directions. On the other hand text regions

have contributed to output almost for all directions.

Figure 3.3: Gabor filter outputs for a typical video frame. Right-bottom,

θµ=0; right-top, θµ=2π/8; left-top, θµ=5π/8; left-bottom, θµ=7π/8.

35

The filtered images are computed once for a given image and the

feature vectors for all blocks are calculated. The feature extraction process

can be described with the block diagram given in Figure 3.4. In the block

diagram, the input is the original image for which the process is run. The

image is filtered with the Gabor filter bank consisting 8 directional filters

and as a result 8 filtered images, are obtained. Once these images are

obtained feature vector for any block in a particular location is obtained by

averaging the values of elements falling in the corresponding block, and an

8 dimensional feature vector, is obtained. For the experiments the block

dimension is selected as 16x16. The extracted vectors are then fed to a

neural network for detection. This will be discussed later after describing

the other method for texture analysis, because the neural network

counterpart of both methods are identical.

I m a g e F i l t e r i n g w i t h 8 d i r e c t i o n a l g a b o r f i l t e r s

B lo c k a v e r a g e

c a lc u la t i o n

F i l t e r o u t p u t s

B lo c kc o o r d i n a t e s

V e c t o r o u t p u t

Figure 3.4: Feature vector extraction using Gabor filters

36

3.1.2 Texture analysis by Haar wavelets

The next method for texture analysis to be presented is 2D Haar

Wavelets. The idea of using wavelets for videotext detection was first

proposed by Doerman et. al. [10]. As for the case for Gabor filters wavelets

are good candidates for videotext detection because of its directional

filtering capabilities. In the following paragraphs, 2D Haar wavelets will be

briefly described and their use in videotext detection will be introduced.

Use of wavelets to decompose the image provides successive

approximations to the image by downsampling and has the ability to detect

edges during the high pass filtering. The low pass filter creates successive

approximations to the image while the detailed signal provides feature rich

representation. This can be easily seen in the image decomposition shown in

Figure 3.5, where the original image and its first level wavelet

decomposition are shown.

Note that the videotext regions show high activity in the three high

frequency subbands (High-Low, HL, Low-High, LH, High-High, HH). As a

result of their local nature, only wavelets, which are located on or near the

edge yield large wavelet coefficients, making videotext regions detectable in

the high frequency subbands.

The scaling and wavelet functions of Haar wavelets can be written

as:

37

)12()2()2()( −Φ+Φ=−Φ=Φ ∑∈

xxkxpxZk

k

)12()2()2()( −Φ−Φ=−Φ= ∑∈

xxkxqxWZk

kH

Respectively with

otherwisex

x10

01

)(<≤

=Φ

For the equations above, have non-zero values = = 1 and zero

values for all other , and is zero except for = 1 and = -1.

kp

kq

0p 1p

1qjp 0q

For an image I(x,y) represented as

NNxNNNN

N

N

iii

iiiiii

yxI

2212,121,120,12

12,11,10,1

12,01,00,0

.....................

...

...

),(

=

−−−−

−

−

we can use the Mallat’s algorithm [17] to obtain the two-dimensional Haar

wavelet transform of I(x,y):

38

)(41

41

12,122,1212,22,2

1

0,2,2,

21

2121 ++++=

++ +++== ∑ yxyxyxyxkk

ykxkkkyx iiiiippLL

)(41

41

12,122,1212,22,2

1

0,2,2,

21

2121 ++++=

++ −+−== ∑ yxyxyxyxkk

ykxkkkyx iiiiiqpLH

)(41

41

12,122,1212,22,2

1

0,2,2,

21

2121 ++++=

++ −−+== ∑ yxyxyxyxkk

ykxkkkyx iiiiipqHL

)(41

41

12,122,1212,22,2

1

0,2,2,

21

2121 ++++=

++ +−−== ∑ yxyxyxyxkk

ykxkkkyx iiiiiqqHH

So far, we have introduced 2D Haar wavelet transform and its

practical implementation. Now, the feature extraction strategy for this case

will be explained.

For detection of videotext regions, we extract features from the

wavelet decomposition of the image. We use the mean and the second and

third order central moments as features. For an N x N block of I(x,y) we

calculate the mean, the second and third order central moments of the block

which can be written as:

∑ ∑−

=

−

=

=1

0

1

02 ),(1)(

N

i

N

j

jiIN

IM

∑∑−

=

−

=

−=1

0

1

0

222 ))(),((1)(

N

i

N

j

IMjiIN

Iµ ∑∑−

=

−

=

−=1

0

1

0

323 ))(),((1)(

N

i

N

j

IMjiIN

Iµ

39

All the features are computed on the decomposed subband images.

Since the block size chosen is 16x16, the maximum level of decomposition

is 3 since only one pixel is left for each subband image on the fourth level.

This method of feature selection produces 27 values for construction

of the feature vector, since we have 3 levels of demposition, 3 subbands,

and 3 moments. To reduce this dimension, we rely on a ranking

demonstrated by Doerman et.al. [46]. According to this ranking the most

salient features are, in the decreasing order, , , , , ,

, , where we use and to denote the second and third

order moments respectively, with j representing the decomposition level and

i representing the subband.

23HH

µ 22HH

µ 23HL

µ 22HL

µ 12HL

µ

12LH

µ 13LH

µ 13HH

µ ji2µ

ji3µ

In Figure 3.5, the first level wavelet decomposition of a video frame

is shown. Similar with the Gabor filter case, videotext regions have energy

in all subbands and this makes use of wavelets, for detection of videotext

regions, a suitable tool.

40

Figure 3.5: First level wavelet decomposition of a video frame.

Both Gabor filtering and wavelet decomposition methods produce an

eight dimensional feature vector for each 16x16 block of the image. A pre-

trained neural network then classifies this vector. The training and

classification process will be explained in the next section.

3.1.3 Detection by a feed-forward neural network

In the previous sections, feature extraction methods for texture

analysis are introduced. In this section the neural network structure used for

classification will be described and the two methods will be compared

according to their performances.

41

We consider detection of videotext regions using texture analysis as

a classical supervised pattern recognition problem. The classifier is selected

as a 3-layer single output feedforward neural network, which has the well-

known capability of discriminating linearly inseparable classes. In fact,

theoretically a 3-layer neural network can approximate any nonlinear

function after training.

The success of neural networks in related problems [16] provides us

with further motivation to rely on a neural network as a classifier to identify

videotext regions. By employing such a network and training the network

with representative sample feature vectors of videotext regions, it is possible

to detect the regions of interest in given image.

In Figure 3.6, the block diagram of the classifier is given. The

feature vector to be classified is extracted using either Gabor filtering or

wavelet transform. This vector is fed to the network and the network

responds to this input by giving an output indicating the type of the input.

The network is trained in such a way that it will give 1.0 as output if given

an input vector from a videotext region and 0.0 otherwise. The network is

trained to an acceptable error level and during classification an output

greater than 0.5 is accepted to indicate a videotext region.

42

Block coordinatesImage

Filtering/Transformation Feature extraction

Neural network classifier

Result : ss > 0.5, texts < 0.5, background

Figure 3.6: Block diagram of the neural network classifier

In order to obtain better localization, the 16x16 block is moved 4

pixels vertically and horizontally, and at each position of the block, the

corresponding feature vector is fed to the network. The output of the

network is accumulated on each pixel and the accumulation result is

thresholded with 0.5. The pixels above the threshold are considered as

pixels of text regions.

The neural network is trained using the RPROP algorithm [14]. The

algorithm can be briefly described as follows:

RPROP (resilient backpropagation) is a very useful gradient based

learning algorithm. It uses individual adaptive learning rates combined with

the so-called “Manhattan" update step.

43

The standard backpropagation updates the weights according to

ijij w

Ew∂∂

−=∆ η

The “Manhattan" update step, on the other hand, uses only the sign

of the derivative (the reason for the name should be obvious to anyone who

has seen a map of Manhattan), i.e.

∂∂

−=∆ij

ij wEsignw .η

The RPROP algorithm combines this Manhattan step with individual

learning rates for each weight, and the algorithm goes as follows

∂∂

−=∆ij

ijij wEsignttw ).()( η

where wij denotes any weight in the network.

44

The learning rate ηij(t) is adjusted according to

0)1().(0)1().(

)1()1(

)(<−∂∂>−∂∂

−−

= −

+

tEtEtEtE

ifif

tt

tijij

ijij

ij

ijij ηγ

ηγη

where

ijij w

tEtE∂∂

=∂)()(

and γ+ and γ- are different growth/shrinking factors, such that (0<γ-<1<γ+).

Values that are used during training are γ-=0.5 and γ+=1.5 with limits such that

10-6<ηij(t)<50.

To compare the two feature extraction methods, we can look at the

training behaviors of the two training sets. The training performances of the

two methods are shown in Figure 3.7. The graph contains plots of error

versus iteration count during training. Both feature vector sets are extracted

from same videotext regions. As it can be seen from the graph, Gabor filters

are much more powerful representatives of videotext regions.

45

0

0,01

0,02

0,03

0,04

0,05

0,06

0,07

0 200 400 600 800 1000

#of iterations

erro

r (ss

d)

WaveletGabor

Figure 3.7: Training performance of Gabor and wavelet-based features

Although Gabor filtering leads to a much more efficient

representation for videotext regions, its computational complexity reduces

its usefulness in real time applications. For the rest of the experiments

Gabor filtering with neural network classifier will be used for texture

analysis.

After thresholding the neural network output image, a bounding box

is found for each connected regions. This bounding box indicates the

candidate videotext regions. In Figure 3.8, a video frame, the Gabor filter

outputs and the neural network output image are shown. As can be seen

from the figure, the network output is very high on the pixels corresponding

46

to videotext regions. Although the background is fairly complex the network

can distinguish text and non-text regions.

(a)

(b) (c)

Figure 3.8: Neural network output image example.

More examples for neural network output image is shown in Figure

3.9. In the examples given, videotext regions, even scene texts, gives high

output as the neural network response.

47

Figure 3.9: Typical neural network output examples.

48

3.2 Contrast analysis

Another important distinctive property of videotext regions is the

high contrast with background. This property comes from the fact that

superimposed text is typically rendered to be easily read.

For contrast analysis, a simple contrast measure proposed by

Lienhart et. al. [7] is used. For each video frame a binary contrast image is

derived, in which the pixels in the regions, which show sufficiently high

absolute local contrast, are marked with 1 and are marked with 0 otherwise.

The absolute local contrast at position I(x,y) is measured by

∑∑−= −=

−−−=r

rk

r

rllk lykxIyxIGyxC ),(),(),( ,

where Gk,l denotes a 2D Gaussian smoothing filter, and r denotes the

size of the local neighborhood.

This operation constructs a 2D array of positive values, for which

the values grow around the high contrast regions, especially at the edges of

regions. Since the characters of videotext are usually in the form of strong

edges, this array can be used to identify videotext regions. The 2D array

should be thresholded in order to get rid of the low contrast regions. For this

purpose the values in the array is normalized to [0,1] interval and

49

thresholded with 0.5. The pixels above threshold are marked with 1 and the

rest with 0.

Next, each pixel with value 1 is dilated by half the maximum

expected strength of the stroke of a character. As a result, all character

pixels, as well as some non-character pixels, which also show high local

contrast, are registered in the binary contrast image. Figure 3.10 shows a

contrast image and the corresponding thresholded image.

Figure 3.10: Contrast analysis example.

3.3 Region analysis

For a videotext extraction system it is necessary to employ some

kind of region analysis in order to isolate candidate text regions from

background. Here we use the term, region, for interconnected pixels, which

are homogeneous in terms of gray level values. Since characters of

50

videotext are assumed to be homogeneous, they can be treated as regions

and analyzed at object level.

Once the image is decomposed into non-overlapping homogeneous

regions, it is possible to use the size restrictions for the characters of

videotext. Specifically too large and too small regions, which cannot

represent a character, are marked as background and removed.

The criterion used to group pixels into a region is that the absolute

gray level difference between any pair of pixels within the region cannot

exceed “delta”, which is experimentally determined as 20.

The segmentation is performed in a recursive manner each time by

checking the 4-connectivity neighbors of any pixel. The segmentation

procedure for region labeling can be outlined as follows:

MAIN LOOP:

FOR (each pixel in the image) BEGIN

IF(pixel is not owned by a region)BEGIN

Register a new region.

Record max and min as the value of the current pixel

CALL TraceNeigbours();

END

END.

51

RECURSIVE FUNCTION TraceNeighbours:

FOR (all the neighbor pixels) BEGIN

IF (value of the neighbor is between min and max) BEGIN

Accept the neighbor.

Call TraceNeighbors

END

ELSE IF (value of the neighbor is greater than max but less

than min+delta) BEGIN


Update max

Call TraceNeighbors

END

ELSE IF (value of the neighbor is less than min but greater

than max-delta) BEGIN


Update min.

Call TraceNeighbors

END

ELSE

Reject neighbor.

END.

52

Although only min and max values are extracted during region

growing as outlined in the pseudocode above, it is similarly possible to

record the minimum and maximum coordinate values of the extensions of

the region and number of pixels contained in the region.

By employing the recursive segmentation algorithm to segment the

image into non-overlapping homogeneous regions, we obtain complete

region information such as its label, area, average gray level, and the

minimum-bounding rectangle (MBR). Having obtained a number of

homogeneous regions in the image, non-videotext background regions are

removed based on their size. A region is identified as non-videotext and

removed if the height of its MBR is greater than 1/8 of image height or it is

removed if its width is greater that 1/8 of image width. The values are

determined experimentally.

Within the remaining regions characters of videotext may appear as

splitted into multiple regions and a region may contain some amount of

background pixels with character. We cannot guarantee to isolate the

characters by just segmenting the homogeneous regions. In order to

overcome this problem, we group remaining touching regions into a single

region, so that the videotext regions will be isolated as much as possible.

Obviously, there will be some amount of remaining background regions

depending on the complexity of the background. In Figure 3.11, a sample

53

video frame and extracted candidate regions using region analysis are

shown.

Figure 3.11: Region analysis example.

3.4 Thresholding

Thresholding may be considered as another tool for background

removal. Applying an optimum threshold for the entire image can only be

successful for images with uniform background, which is a very rare case.

Although global thresholding can not meet our requirements, local

thresholding will be necessary at the final stage of binarization process,

since other tools we have shown, can only narrow the regions of interest to

videotext region level but cannot isolate the characters. Therefore, we must

find and apply an appropriate local threshold for each candidate region.

54

For thresholding we have two alternatives: one is from Dorai et.al

[13] and the other is the well known iterative thresholding [15].

Dorai et.al. [13] call the thresholding as region boundary

enhancement and formulate the threshold as:

( )icbicb NNIIT +

+= ∑∑

where Icb is the gray level of a pixel on the circumscribing boundary

of the region, Ii is the gray level of a pixel belonging to region, Ncb is the

number of pixels on the circumscribing boundary of the region and Ni is the

number of pixels belonging to the region.

A pixel is defined to be on the circumscribing boundary of a region

if it does not belong to the region but at least one of its neighbors (using

four-connectivity) does. Those pixels in the region whose gray level is less

than T are marked as belonging to the background and discarded while the

others are retained in the region. Note that this condition is reversed for

negative image, i.e characters may be darker than background. This step is

repeated until the value of T does not change over two consecutive

iterations.

55

The other method is the well known iterative thresholding [15], and

can be formulated as:

)(21

21 µµ +=T

where is the mean of pixels below threshold and is the mean

of pixels above threshold. The process of determining the threshold starts

with an initial estimate of T which is the average intensity of the region.

Then the region is partitioned using this T and new T is calculated. This step

is repeated until the value of T does not change over two consecutive

iterations.

1µ 2µ

3.5 Heuristics

After thresholding some of the remaining regions are eliminated

according to a set of heuristics such that:

1 : (MinXj > MaxXi) AND (MinXj – MaxXi > MAX(Wj – Wi)) :

: To be apart horizontally

2 : (MinXi < MinXj < (MinXi+MaxXi)/2) OR ((MinXi+MaxXi)/2 <

MaxXj < MaxXi)

: To coincide horizontally

56

3 : ((MinYi+MaxYi)/2 < MinYj < MaxYi) OR (MinYi < MaxYj <

(MinYi+MaxYi)/2)

: Not to coincide vertically

where MinX stands for the coordinate of the leftmost pixel, MaxX stands

for the coordinate of the rightmost pixel, MinY stands for the coordinate of

the topmost pixel, MaxY stands for the bottommost pixel and W stands for

the width of the region.

All candidate regions are tested for above conditions in pairs. If they

do not satisfy all of the conditions, then they are grouped together. The

candidate regions, which do not belong to a group, are discarded.

57

CHAPTER 4

EXPERIMENTAL RESULTS

In the preceding sections, we have introduced the possible tools for

videotext detection, or pre-OCR process. The performances of these

methods can only be understood by a set of experiments. Although, one can

rely on any method, the performance of the pre-OCR step can be increased

by the use of a logical combination of the tools. In this section, possible

alternatives for the videotext binarization step will be presented. These

alternatives will also be tested for their performances.

Before proceeding further, it will be useful to remember the flow for

pre-OCR process. For videotext binarization the input is the original image,

the video frame, and the output is the binarized image where the pixels on

the videotext characters are black and remaining pixels are white. The pre-

OCR process can be logically grouped into three subprocesses: pre-

thresholding, thresholding and post-thresholding.

58

The pre-thresholding is constructed with a subset of texture analysis,

contrast analysis and region analysis methods. This step takes the video

frame as input and produces a binary mask where 1’s indicate the possible

videotext regions. Thresholding step is also constructed with one of (or both

of) the thresholding methods described earlier. The thresholding step takes

the video frame and the binary mask produced by pre-thresholding step as

input and produces a binary image, which can be fed to the OCR for

recognition. The post-thresholding stands for any further enhancements of

the binary image using the heuristics described.

The pre-OCR process can be visualized by the help of a typical

example in Figure 4.1. This is an example for the pre-OCR binarization

process. The input image, Figure 4.1(a), is analyzed for texture and color

segmentation in Figure 4.1(b) and (c), respectively, in parallel. This

approach is just for this particular example. Contrast analysis may also be

added as a third option. The intersection of the outputs of these parallel

threads forms the input for the thresholding step, Figure 4.1(d). After

thresholding this image, we obtain a positive image where the candidate

regions are formed by grouping connected pixels above threshold and a

negative image where the pixels are below the same threshold. In Figure

4.1(e), the positive image formed after thresholding is shown. After

59

thresholding, heuristic improvement is applied to the image and the binary

image, shown in Figure 4.1(f), to be fed to OCR is formed.

(a ) (b ) (c )

(d ) (e ) (f)

Figure 4.1: Pre-OCR steps.(a) original image (b) texture analysis output

(c) region analysis output (d) Pre-thresholding output (e) Thresholding

output (f) Heuristic improvement output

4.1 Videotext recognition

60

So far we have concentrated on the pre-OCR process. The

experiments for the evaluation of videotext binarization alternatives will be

based on the OCR performances. For the OCR part of the experiments,

ABBYY FineReader 5.0 commercial OCR package is used [66]. This

package has a set of Dynamic Link Libraries (DLL), which can be linked to

the main program. By using this library, it is possible to incorporate the

abilities of the OCR engine to the main program.

Tabulated results for videotext recognition experiments are given in

Table 4.1, 4.2, and 4.3. Table 4.1 lists the description of experiments and

Table 4.2 and 4.3 gives the result of each experiment where each row

corresponds to the characters counted for each experiment. The second

columns in these tables are the actual number of characters. As can be seen

from the results, different combinations of algorithms for pre-OCR step

results in different recognition rates. The maximum output is achieved when

region analysis and texture analysis are used in parallel before thresholding

and heuristics applied after thresholding. A similar result, in fact second

maximum recognition rate, is achieved when region analysis is used in

parallel with contrast analysis. The use of contrast analysis with texture

analysis together, decreases the recognition rate with respect to texture

analysis or contrast analysis alone. This is due to the fact that, the

drawbacks of contrast analysis and texture analysis affect the output

negatively when used together. It is possible to say that contrast analysis and

texture analysis techniques should be considered as alternatives to each

other. The drawbacks of texture analysis are the limited block size and need

for supervised training, and the main drawback of contrast analysis is due to

61

assumption of “over the average contrast videotext region”. On the other

hand, texture analysis is more accurate than contrast analysis and contrast

analysis is computationally much simpler than texture analysis.

Exp. No Pre-OCR components 0 NONE (Raw Image) Pre Thresh. Thresh. PostThresh. 1 R.A. I.T. 2 R.A. B.E. 3 R.A. I.T.+B.E. 4 R.A. B.E.+I.T. 5 C.A. I.T. 6 C.A. B.E. 7 C.A. I.T.+B.E. 8 C.A. B.E.+I.T. 9 R.A. + C.A. I.T. 10 R.A. + C.A. B.E. 11 R.A. + C.A. I.T.+B.E. 12 R.A. + C.A. B.E.+I.T. 13 R.A. + C.A. I.T.+B.E. HR 14 R.A. + T.A. I.T.+B.E. HR 15 R.A. + C.A. + T.A. I.T.+B.E. HR 16 R.A. + P.T.B.D. I.T.+B.E. C.A.: Contrast Analysis R.A.: Region Analysis T.A.: Texture Analysis P.T.B.D.: Perfect Text Box Detection I.T.: Iterative Thresholding B.E.: Boundary Enhancement HR: Heuristic Table 4.1: Pre-OCR performance evaluation – Part1.

62

Frame No

Char. Count Exp0 Exp1 Exp2 Exp3 Exp4 Exp5 Exp6 Exp7 Exp8

1 10 8 0 0 0 0 8 10 9 10 2 16 0 0 0 13 13 13 12 13 12 3 36 35 30 24 0 24 28 19 27 25 4 38 19 0 35 35 33 0 0 0 0 5 41 20 0 24 10 13 21 0 20 0 6 24 0 21 22 21 0 20 20 19 0 7 10 9 0 0 9 0 0 9 9 9 8 27 0 13 25 21 25 0 0 0 0 9 23 0 0 0 0 0 0 0 4 17 10 84 0 0 0 0 0 29 31 25 29 11 44 27 38 41 40 34 0 32 34 32 12 14 0 0 10 9 7 0 0 0 0 13 35 0 0 0 26 0 21 20 26 20 14 14 0 9 14 0 13 0 0 13 0 15 14 7 0 0 11 10 0 0 0 0 16 14 0 0 0 0 0 9 0 12 0 17 22 4 17 0 18 10 15 0 21 0 18 86 0 33 17 20 48 29 33 0 38 19 133 43 67 78 67 72 51 75 75 64 20 14 0 0 8 0 0 0 0 0 9 21 7 0 0 0 0 0 4 0 0 3 22 23 0 0 8 0 0 0 14 9 11 23 17 0 16 0 16 8 0 0 0 3 24 22 0 4 0 13 0 0 0 0 3 25 28 0 7 21 21 28 8 6 15 8 26 20 0 8 12 12 12 15 14 13 14 27 10 10 0 0 10 0 0 10 10 10 28 41 0 35 34 35 34 15 11 18 12 29 11 0 0 0 0 0 0 0 2 0 30 86 60 53 71 68 71 71 78 79 78 TOTAL 964 242 351 444 475 455 357 394 453 407 Table 4.2: Pre-OCR performance evaluation – Part2. First column is the number of frame, second column is the number of characters in each frame and the remaining columns are number of characters recognized.

63

Frame No

Char. Count Exp9 Exp10 Exp11 Exp12 Exp13 Exp14 Exp15 Exp16

1 10 8 9 9 9 9 9 9 9 2 16 11 13 13 13 13 13 13 13 3 36 30 24 29 22 29 31 0 24 4 38 35 33 34 31 31 32 32 34 5 41 8 21 17 11 12 15 10 13 6 24 0 0 21 0 21 21 21 21 7 10 3 9 9 9 9 9 9 9 8 27 13 23 21 23 23 17 17 21 9 23 0 0 0 0 15 15 15 15 10 84 29 30 28 0 28 28 28 29 11 44 40 0 40 40 40 40 40 40 12 14 0 0 6 0 0 5 5 9 13 35 6 26 22 19 26 0 26 24 14 14 0 0 0 0 13 14 13 13 15 14 0 0 11 0 11 12 11 11 16 14 0 0 0 0 0 0 0 12 17 22 14 17 17 12 18 21 21 21 18 86 19 20 20 39 37 39 39 30 19 133 60 69 72 63 65 73 73 73 20 14 0 0 0 0 5 0 5 9 21 7 0 0 0 3 3 3 3 2 22 23 0 0 0 0 9 9 9 9 23 17 15 0 16 8 0 16 16 15 24 22 0 0 0 3 0 6 0 10 25 28 22 28 27 28 21 18 18 28 26 20 0 11 11 11 10 11 10 12 27 10 10 10 10 10 10 10 10 10 28 41 16 16 18 16 18 38 18 38 29 11 0 0 2 0 0 0 0 2 30 86 78 78 82 78 78 62 65 77 TOTAL 964 417 437 535 448 554 567 536 633 Table 4.3: Pre-OCR performance evaluation – Part3. First column is the number of frame, second column is the number of characters in each frame and the remaining columns are number of characters recognized.

64

Another remark for the experiments is about the utilization of

thresholding techniques. The best results are obtained when the candidate

regions are first thresholded with iterative thresholding, and then the

resulting intensity regions are thresholded again with boundary

enhancement technique. This is always the case, independent of the pre-

thresholding part. The reason for such a behavior is due to the requirement

for determining the boundaries of candidate regions before applying

boundary enhancement. On the other hand, the heuristics increase the

average recognition rate, however their effect are not that much.

In the last experiment, experiment 16, the text boxes are placed

manually instead of automated texture analysis and an upper bound for the

performance is determined. For this experiment, heuristics improvement is

not applied because non-character regions are almost completely eliminated

by manually placing the text boxes.

In Figure 4.2 and 4.3, sample images from the experiment set and

corresponding recognition results are shown.

65

THE qUEEN MOTHER HAS DIED

Nliiir bit iild ihc titn) 1 In hvr limp

Me olmu^ Anderson Pierson

Kibirli hukuk prgfesoru

Figure 4.2: Various recognition results.

66

MQNTQYA

56 007 56442

CANU

MAKDULEVILDIZ/ fST

Figure 4.3: Various recognition results, continued.

67

CHAPTER 5

CONCLUSIONS

The problem of detection and extraction of textual information from

video frames is a relatively new subject. Throughout the review of the

related work, it is observed that none of the proposed methods are mature

enough to be accepted as a standard framework. However, as the literature

suggests, the use of distinctive properties of videotext leads to a relatively

high accuracy in detection of videotext character regions. This work is also

concentrated on the use of these distinctive properties.

As we have given the details in the preceding chapters, the

distinctive properties of videotext are texture, contrast and color uniformity.

In order to binarize the image for recognition, one should first localize the

candidate character regions and threshold these regions individually.

By employing contrast analysis and/or texture analysis, one can

narrow the regions of interest, i.e. candidate character regions to textbox

68

level. Additional analysis of homogeneous regions, further resolves the

regions even to character level, since large homogeneous regions outside the

textbox are usually connected to the regions between characters of

videotext. The use of region analysis alone, however, does not give

satisfactory results, since there usually exists a lot of non-character

homogeneous regions, most of which cannot be eliminated by shape

analysis or any heuristics. Therefore, region analysis should be supported

with texture and/or contrast analysis in order to reduce the candidate regions

to characters of videotext.

As can be seen from the experimental results, the use of contrast

analysis with texture analysis together, decreases the recognition rate with

respect to texture or contrast analysis alone. This is due to the fact that both

methods carry similar information and have drawbacks that limit the success

of recognition. It is possible to conclude that contrast analysis and texture

analysis techniques should be considered as alternatives to each other. The

drawbacks of texture analysis are the limited block size and need for

supervised training, whereas the main drawback of contrast analysis is the

“over the average contrast videotext region” assumption. On the other hand,

texture analysis is much more accurate than contrast analysis and contrast

analysis is computationally much simpler than texture analysis.

69

The proposed system is very successful for images where a uniform

colored frame surrounds the videotext and/or the characters of videotext are

large and have high contrast with the surrounding background. As the

character size reduces and the background complexity increases the

recognition rate decreases. The overall performance of the system can be

highly dependent on the low resolution of images, since the OCR utilized

during simulations (as well as many other OCRs), requires high resolution

input for accurate detection.

As another conclusion for texture analysis is, Gabor filtering based

features represent videotext regions much better than wavelet based

features. Since the Gabor filters have higher directional resolution, they are

much more appropriate for representing videotext. This comparison is

accomplished by observing the training performances of both feature sets,

i.e. by observing the error at the output of the artificial neural network

during training. To our knowledge, this kind of feature saliency test has not

been performed in the literature.

Further improvements on the system can be achieved by integrating

information on multiple video frames. One can assume the videotext

stationary and using a simple low-pass filter can get rid of the moving parts

of the video. One way of doing this may be averaging the consecutive

frames and then subtracting the result from the middle frame. The difficulty

70

of this method arises in the shot boundaries, where the continuity is

completely lost. A sliding window for filtering would be an appropriate

solution. Obviously this assumption is not always valid, because there are

cases when videotext moves in a linear fashion. For this kind of situations

the system can be supported with a suitable tracking algorithm. By multiple

frame integration it is also possible to enhance the resolution of the frames.

Moreover, another improvement would be to make texture analysis

unsupervised to increase robustness.

71

REFERENCES

[1] S. Mori, C.Y. Suen, and K: Yamamoto, “Historical Review of

OCR Research and Development,” Proc. IEEE, vol. 80, no. 7, pp. 1029-

1058, July 1992.

[2] Takatoo M, et al., “Gray-Scale Image-Processing Technolgy

Applied to a Vehicle License Number Recognition System,” Proc. Int.

Workshop Industrial Applications of Machine Vision and Machine

Intelligence, Los Amitos, CA, IEEE Computer Society, pp 76-79, 1987.

[3] Zhong Y., Karu K., Jain A.K., “Locating Text in Complex Color

Images,” Pattern Recognition, 28(10), pp. 1523-1535, 1995.

[4] Wu V., Manmatha R., Riseman E.M., “TextFinder: An

Automatic System To Detect and Recognize Text in Images,” IEEE Trans.

on Pattern Analysis and Machine Intelligence, Vol. 21, No. 11, November

1999.

[5] Sato T., Kanade T., Hughes E.K., Smith M.A., Satoh S., “Video

OCR: Indexing Digital News Libraries by Recognition of Superimposed

Captions,” Multimedia Systems, 7, pp. 385-395, 1999.

72

[6] Ohya J., Shio A., Akamatsu S., “Recognizing Characters in

Scene Images,” IEEE Trans. on Pattern Analysis and Machine Intelligence,

Vol. 16, No. 2, February 1994.

[7] Lienhart R., Effelsberg W., “Automatic Text Segmentation and

Text Recognition for Video Indexing,” Multimedia Systems, 8, pp. 69-81,

2000.

[8] Jain A.K., Yu B., “Automatic Text Location in Images and

Video Frames,” Proc. IEEE Pattern Recognition, Vol. 31, No. 12, pp. 2055-

2076, 1998.

[9] Jain A.K., Bhattacharjee S., “Text Segmentation using Gabor

Filters for Automatic Document Processing,” Machine Vision and

Applications, 5, pp. 169-184, 1992.

[10] Li H., Doermann D., Kia O., “Automatic Text Detection and

Tracking in Digital Video,” Computer Vision Lab. Technical Report,

University of Maryland, CAR-TR-900, December 1998.

[11] Daugman J.G., “Two Dimensional Spectral Analysis of Cortical

Receptive Field Profile,” Vision Research, vol. 20, pp. 847-856, 1980.

[12] Lee T.S., “Image Representation Using 2-D Gabor Wavelets,”

IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 18, no.10,

October 1996.

73

[13] Dimitrova N., Agnihotri L., Dorai C., Bolle R., “MPEG-7

Videotext description scheme for superimposed text in images and video,”

Signal Processing: Image Communication, 16, pp. 137-155, 2000.

[14] Riedmiller M., and Braun H., “A Direct Adaptive Method for

Faster Backpropagation Learning: The RPROP Algorithm,” Proc. of the

IEEE Int. Conf. on Neural Networks, pp. 586-591, 1993.

[15] Jain, R., Kasturi, R., Schunck, B.G., Machine Vision,

TA1634.J35, 1995.

[16] Chellappa, R., Manjunath, B.S., Simchony, T., "Texture

segmentation with neural networks," Neural Networks in Signal Processing,

pp 37 – 61, Prentice Hall, 1992

[17] Mallat S.G., “Multiresolution approximations and wavelet

orthonormal bases of l2(r),” Trans. Amer. Math. Soc., 315, pp. 69 - 87, 1989.

[18] Clark M., Bovild A.C., “Experiments in segmenting texture

patterns using localized spatial filters,” Pattern Recognition, 22(6), pp. 707 -

717, 1989.

[19] Jain A.K., Farrokhnia F., “Unsupervised texture segmentation

using Gabor filters,” Proc. IEEE Int. Conf. Sys. Man. Cybern., pp. 14 - 19,

Los Angeles, CA, November 1990.

[20] Gupta A., Jain R., “Visual Information Retrieval,”

Communications of the ACM, 40(5), pp. 70 - 79, May 1997.

74

[21] Antani S., Kasturi R., and Jain R., “Recognition Methods in

Image and Video Databases: Past, Present and Future,” Joint IAPR

International Workshops SSPR and SPR, 1451 in Lecture Notes in

Computer Science, pp. 31 - 38, 1998.

[22] Gargi U., Kasturi R., and Antani S., “Performance

characterization and comparison of video indexing algorithms,” Proc .IEEE

Conf. on Computer Vision and Pattern Recognition, pp. 559 - 565, 1998.

[23] Gargi U., Kasturi R., and Strayer S.H., “Performance

Characterization of Video-Shot-Change Detection Methods,” IEEE

Transactions on Circuits and Systems for Video Technology, 10(1), pp. 1 –

13, 2000.

[24] Fischer S., Lienhart E., and Effelsberg W.E., “Automatic

Recognition of Film Genres,” Proceedings of the ACM Multimedia

Conference, pp. 295 - 304, 1995.

[25] Chaddha N., Sharma R., Agrawal A., and Gupta A., “Text

Segmentation in Mixed–Mode Images,” In 28th Asilomar Conference on

Signals, Systems and Computers, pp. 1356-1361, October 1995.

[26] Hauptmann A. and Smith M., “Text, Speech, and Vision for

Video Segmentation: The Informedia Project,” In AAAI Fall 1995

Symposium on Computational Models for Integrating Language and Vision,

1995.

75

[27] Smith M.A. and Kanade T., “Video Skimming and

Characterization through the Combination of Image and Language

Understanding,” In International Workshop on Content-Based Access of

Image and Video Databases, pp. 61-70, 1998.

[28] H.K. Kim, “Efficient Automatic Text Location Method and

Content Based Indexing and Structuring of Video Database,” Journal of

Visual Communications and Image Representation, 7(4): 336-344,

December 1996.

[29] B.L. Yeo and B. Liu, “Visual Content Highlighting Via

Automatic Extraction of Embedded Captions on MPEG Compressed

Video,” In SPIE/IS&T Symposium on Electronic Imaging Science and

Technology: Digital Video Compression: Algorithms and Technologies, vol.

2668, pp 38-47, 1996.

[30] R. Lienhart and F. Stuber, “Automatic Text Recognition for

Video Indexing,” In Proceedings of the ACM International Multimedia

Conference & Exhibition, pp. 11 20, 1996.

[31] R. Lienhart and F. Stuber, “Automatic Text Recognition in

Digital Videos,” In Proceedings of SPIE, volume 2666, pages 180 188,

1996.

[32] R. Lienhart and F. Stuber, “Indexing and Retrieval of Digital

Video Sequences based on Automatic Text Recognition,” In Proceedings of

76

the ACM International Multimedia Conference & Exhibition, pp. 419-420,

1996.

[33] H.M. Suen and J.F. Wang, “Text String Extraction from Images

of Colour-Printed Documents,” IEE Proceedings: Vision, Image and Signal

Processing, 143(4), August 1996.

[34] H.M. Suen and J.F. Wang, “Segmentation of Uniform-Coloured

Text from Colour Graphics Background,” IEE Proceedings: Vision, Image

and Signal Processing, 144(6): 317-322, December 1997.

[35] H. Hase, T. Shinokawa, M. Yoneda, H. Sakai, and H.

Maruyama, “Character String Extraction by Multi-stage Relaxation,” In

International Conference on Document Analysis and Recognition, volume

1, pp. 298-302, 1997.

[36] H. Hase, T. Shinokawa, M. Yoneda, H. Sakai, and H.

Maruyama, “Character String Extraction from a Color Document,” In

International Conference on Document Analysis and Recognition, volume

1, pp 75-78, 1999.

[37] D. Lopresti and J. Zhou, “Document Analysis and the World

Wide Web,” In IAPR International Workshop on Document Analysis

Systems, pp. 651-659, 1996.

77

[38] J. Zhou, D. Lopresti, and Z. Lei, “OCR for World Wide Web

Images,” In Proceedings of SPIE Document Recognition IV, volume 3027,

pages 58 66, 1997.

[39] Y. Ariki and T. Teranishi, “Indexing and Classiffication of TV

News Articles based on Telop Recognition,” In International Conference on

Document Analysis and Recognition, volume 2, pages 422-427, 1997.

[40] S. Kurakake, H. Kuwano, and K. Odaka, “Recognition and

Visual Feature Matching of Text Region in Video for Conceptual Indexing,”

In Proceedings of IS&T/SPIE Conference on Storage and Retrieval for

Image and Video Databases I, Vol. SPIE 1908, pages 368-379, 1997.

[41] F. Le Bourgeois, “Robust Multifont OCR System from Gray

Level Images,” In International Conference on Document Analysis and

Recognition, volume 1, pages 1 5, 1997.

[42] J. Zhou, D. Lopresti, and T. Tasdizen, “Finding Text in Color

Images,” In Proceedings of SPIE Document Recognition V, volume 3305,

pages 130-140, 1998.

[43] M.V.D. Schaar-Mitrea and P.H.N. de With, “Compression of

Mixed Video and Graphics Images for TV Systems,” In SPIE Visual

Communications and Image Processing, pages 213-221, 1998.

78

[44] S. Messelodi and C.M. Modena, “Automatic Identification and

Skew Estimation of Text Lines in Real Scene Images,” Pattern Recognition,

32(5): 791-810, May 1999.

[45] H. Li and D. Doermann, “A Video Text Detection System

Based on Automated Training,” In Proc. International Conference on

Pattern Recognition, volume2, pages 223-226, 2000.

[46] H. Li, D. Doermann, and O. Kia, “Automatic Text Detection

and Tracking in Digital Video,” IEEE Transactions on Image Processing,

9(1): 147-156, 2000.

[47] Y. Zhong, H. Zhang, and A.K. Jain, “Automatic Caption

Localization in Compressed Video,” IEEE Transactions on Pattern Analysis

and Machine Intelligence, 22(4): 385–392, 2000.

[48] V.Y. Mariano and R. Kasturi, “Locating Uniform-Colored Text

in Video Frames,” In Proc. International Conference on Pattern

Recognition, volume 4, pages 539-542, 2000.

[49] B.T. Chun, Y. Bae, and T.Y. Kim, “Text Extraction in Videos

using Topographical Features of Characters,” In IEEE International Fuzzy

Systems Conference, volume2, pages 1126-1130, 1999.

[50] C. Garcia and X. Apostolidis, “Text detection and segmentation

in complex color images,” In Proc. IEEE International Conference on

Acoustics, Speech, and Signal Processing, pages 2326-2329, 2000.

79

[51] O. Hori, “A Video Text Extraction Method for Character

Recognition,” In International Conference on Document Analysis and

Recognition, pages 25-28, 1999.

[52] H. Kasuga, M. Okamoto, and H. Yamamoto, “Extraction of

Characters from Color Documents,” In Proceedings of IS&T/SPIE

Conference on Document Recognition and Retrieval VII, volume 3967,

pages 278-285, 2000.

[53] Y.K. Lim, S.H. Choi, and S.W. Lee, “Text Extraction in MPEG

Compressed Video for Content-based Indexing,” In Proc. International

Conference on Pattern Recognition, volume 4, pages 409-412, 2000.

[54] M. Sawaki, H. Hurase, and N. Hagita, “Automatic acquisition

of context-based images templates for degraded character recognition in

scene images,” In Proc. International Conference on Pattern Recognition,

volume 4, pages 15-18, 2000.

[55] K. Sobottka, H. Bunke, and H. Kronenberg, “Identifi.cation of

Text on Colored Book and Journal Covers,” In International Conference on

Document Analysis and Recognition, pages 57-62, 1999.

[56] J.B. Bosch and E.M. Ehlers, “Remote Sensing of Characters on

3D Objects,” Computers & Industrial Engineering, 33(1-2): 429-432, 1997.

80

[57] P. Comelli, P. Ferragina, M.N. Granieri, and F.Stabile, “Optical

Recognition of Motor Vehicle License Plates,” IEEE Transactions on

Vehicular Technology, 44(4): 790–799,November 1995.

[58] Y. Cui and Q. Huang, “Extracting Characters of License Plates

from Video Sequences,” Machine Vision and Applications, 10(5-6): 308-

320, April 1998.

[59] L.L. Winger, M.E. Jernigan, and J.A. Robinson, “Character

Segmentation and Thresholding in Low-Contrast Scene Images,” In

Proceedings of SPIE, volume 2660, pages 286-296, 1996.

[60] T. Gandhi, R. Kasturi, and S. Antani, “Application of Planar

Motion Segmentation for Scene Text Extraction,” In Proc. International

Conference on Pattern Recognition, volume 3, pages 445-449, 2000.

[61] MPEG Requirements Group, MPEG-7 Requirements

Document, Doc. ISO/MPEG N2460, MPEG Atlantic City Meeting, October

1998.

[62] MPEG Requirements Group, MPEG-7 Requirements

Document, Doc. ISO/MPEG N2461, MPEG Atlantic City Meeting, October

1998.

[63] MPEG-7 Description Schemes, ISO/ IEC/ JTC1/ SC29/ WG11/

N2844, July 1999.

81

[64] MPEG-7 Description Schemes (V0.6), ISO/ IEC/ JTC1/ SC29/

WG11/ M5040, Version 0.6-a, September 1999.

[65] D. Gabor, “Theory of communication,” J.IEE, vol.93, pp. 429 –

459, 1946.

[66] http://www.abbyy.com

82

http://www.abbyy.com/