Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
DETECTING AND RECOGNIZING TEXT FROM VIDEO FRAMES
A THESIS SUBMITTED TO
THE GRADUATE SCHOOL OF NATURAL AND APPLIED SCIENCES
OF
THE MIDDLE EAST TECHNICAL UNIVERSITY
BY
SERHAT TEKİNALP
IN PARTIAL FULLFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
THE DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING
SEPTEMBER 2002
Approval of the Graduate School of Natural and Applied Sciences ____________________________ Prof. Dr. Tayfur Öztürk Director
I certify that this thesis satisfies all the requirements as a thesis for the degree of Master of Science.
____________________________ Prof. Dr. Mübeccel Demirekler Head of Department
This is the certify that we have read this thesis and that in our opinion it is fully adequate, in scope and quality, as a thesis for the degree of Master of Science in Electrical and Electronics Engineering.
____________________________ Assoc. Prof. Dr. A. Aydın Alatan Supervisor
Examining Committee Members: Assoc. Prof. Dr. İsmet Erkmen ____________________________ Assoc. Prof. Dr. A. Aydın Alatan ____________________________ Assoc. Prof. Dr. Gözde Bozdağı Akar ____________________________ Assoc. Prof. Dr. Buyurman Baykal ____________________________ Ersin Esen ____________________________
ii
ABSTRACT
DETECTING AND RECOGNIZING TEXT FROM VIDEO FRAMES
Tekinalp, Serhat
M.S, Department of Electrical and Electronics Engineering
Supervisor: Assoc. Prof. Dr. A. Aydın Alatan
September 2002, 93 pages
An important source of information contained in digital video is the
text in video frames. This text may appear as a part of the scene (scene text)
or may be rendered artificially during production (superimposed text). By
detecting and recognizing videotext, it is possible to index and easily
manage large video archives. There are some basic properties, which makes
videotext detectable. These properties are, distinguishing texture, high
contrast, and uniform color. By employing these properties it is possible to
iii
detect text regions and binarize image for character recognition after
thresholding these regions. In this thesis, a complete framework for
detection and recognition of videotext is presented. The performance of the
system is tested for its recognition rate for various combinations. The
system is capable of character recognition rate up to 59%, which is quite
reasonable for most purposes.
Keywords: Text detection, text extraction, videotext, video-ocr,
texture, segmentation.
.
iv
ÖZ
VİDEO KARELERİNDE YAZI BULMA VE TANIMA
Tekinalp, Serhat
Yüksek Lisans, Elektrik Elektronik Mühendisliği Bölümü
Tez Yöneticisi: A. Aydın Alatan
Eylül 2002, 93 sayfa
Sayısal video’da en önemli bilgi kaynaklarından biri video içindeki
yazılardır. Video içindeki yazılar, görüntünün bir parçası olarak karşımıza
çıkabileceği gibi video’nun üretimi aşamasında sonradan eklenmiş de
olabilirler. Bu yazıların bulunması ve okunması büyük video
veritabanlarının endekslenmesine ve yönetilebilmesine olanak sağlar. Video
yazılarının bulunabilmelerine olanak sağlayan temel özellikleri
bulunmaktadır. Bu özellikler, farklı doku, yüksek kontrast ve düzgün renk
v
olarak tanımlanabilir. Bu özellikleri kullanarak, video içerisindeki yazı
bölgelerini bulmak mümkün olmakta ve bulunan bölgelerin eşiklenmesi
yoluyla, imgenin karakter tanıma için ikilileştirilmesi mümkün olmaktadır.
Bu çalışmada, video yazılarının bulunması ve tanınması için tüm bir sistem
ortaya konulmuştur. Sistemin performansı, karakter tanıma yüzdesi ile farklı
kombinasyonlar için test edilmiştir. Sistem ortalamada, birçok uygulama
için yeterli sayılabilecek %59 tanıma oranına erişmiştir.
Anahtar kelimeler: Yazı bulma, yazı çıkarma, video yazısı, video-
ocr, doku, bölütleme.
vi
ACKNOWLEDGEMENTS
First of all, I would like to express my sincere appreciation to Dr. A.
Aydın Alatan, my supervisor, for his valuable guidance and insight,
encouragement, support, and reliance throughout the research. I am also
grateful to my supervisory committee members, Dr. İsmet Erkmen, Dr.
Gözde Bozdağı Akar, Dr. Buyurman Baykal, and Ersin Esen, for their
valuable suggestions, contributions and comments.
I wish to express sincere thanks and love to my wife, Bengü, for her
patience, understanding, help and support in every phase of the thesis.
Lastly, but not leastly, I am, as always, indebted to my family. The
love and support of my mother Gülay, my father Feyzi and my sister Serap,
remain bedrock of my life.
vii
TABLE OF CONTENTS
ABSTRACT ..................................................................................... iii
ÖZ.......................................................................................................v
ACKNOWLEDGEMENTS ............................................................ vii
TABLE OF CONTENTS ............................................................... viii
LIST OF FIGURES............................................................................x
LIST OF TABLES .............................................................................x
CHAPTER
1 INTRODUCTION...........................................................................1
1.1 Text in video: videotext......................................................5
1.2 Properties of videotext......................................................11
2 RELATED WORK........................................................................14
2.1 Connected component-based approaches.........................15
2.2 Texture based approaches ................................................20
2.3 Work on scene text ...........................................................24
3 VIDEOTEXT DETECTION.........................................................28
3.1 Texture analysis................................................................31
viii
3.1.1 Texture analysis by Gabor filters ...............................31
3.1.2 Texture analysis by Haar wavelets.............................37
3.1.3 Detection by a feed-forward neural network..............41
3.2 Contrast analysis...............................................................49
3.3 Region analysis.................................................................50
3.4 Thresholding.....................................................................54
3.5 Heuristics..........................................................................56
4 EXPERIMENTAL RESULTS......................................................58
4.1 Videotext recognition .......................................................60
5 CONCLUSIONS...........................................................................68
REFERENCES.................................................................................72
ix
LIST OF TABLES
4.1: Pre-OCR performance evaluation – Part1.
4.2: Pre-OCR performance evaluation – Part2.
4.3: Pre-OCR performance evaluation – Part3.
x
LIST OF FIGURES
1.1: Example video frame containing both superimposed text and scene text.
1.2: Demonstration of complexity of background-character separation.
3.1: Block diagram overview of videotext detection and recognition process.
3.2: Gray-scale representation of Gabor filters.
3.3: Gabor filter outputs for a typical video frame.
3.4: Feature vector extraction using Gabor filters.
3.5: First level wavelet decomposition of a video frame.
3.6: Block diagram of the neural network classifier.
3.7: Training performance of Gabor and wavelet-based features.
3.8: Neural network output image example.
3.9: Typical neural network output examples.
3.10: Contrast analysis example.
3.11: Region analysis example.
4.1: Pre-OCR steps.
4.2: Various recognition results.
4.3: Various recognition results, continued.
xi
CHAPTER 1
INTRODUCTION
The increasing capacity of available storage media, transmission
bandwidth and the efficiency of compression algorithms increases the
availability of online digital imagery and video, and makes digital video the
standard form of video broadcast and distribution. Today, many homes own
digital video equipment and are maintaining personal media archives. Large
media archives are also being maintained by several organizations with an
interest in commerce, entertainment, medical research, security, etc. This
increasing popularity and availability has rekindled interest in the problem
of how to manage multimedia information sources efficiently.
The ease of use of digital video is directly related to advances in
compression, transmission, storage, archival, indexing, retrieval, querying
and browsing technologies. Compression of digital video has already
become a mature technology and there are widely accepted compression
1
standards. Closely related to compression technology, transmission of
digital video is also standardized and becoming much more efficient with
the recent advances in communication technologies. On the other hand, the
importance of and lack of attention to indexing, retrieval, querying and
browsing counterpart of digital video technology, has been recently
understood.
Presently, digital video is extensively produced in the entertainment
industry (digital movie and digital broadcast video), business
communications (video conferencing), medical imaging, and surveillance
video among other uses. All of these are candidates to be a part of some
video archival system. Content-based information retrieval from such digital
video databases and media archives is a challenging problem and is rapidly
gaining widespread research and commercial interest. Such systems are also
called Content Based Image (and Video) Retrieval (CBIR) Systems or
Visual Information Retrieval Systems (VIRS) [20].
Research efforts have led to the development of many methods that
can be used to provide pseudo-semantic content-based access to image and
video data. These methods determine the similarity in the visual information
content extracted from low-level features and have their roots in pattern
recognition [21]. Low-level features such as color, shape, and texture are
used to select a good match in response to the user query. Since humans, on
2
the other hand, attach semantic meaning to visual content; indexing video
sequences requires extraction of a natural language representation of the
sequence. Traditionally, this process is performed by a human reviewer,
which manually annotated images and video sequences with a small number
of keyword descriptors after visual inspection. Unfortunately, this process
can be very time-consuming and such delays may inhibit the ability to
perform near-real-time filtering and retrieval. Therefore, to be able to index
video sequences, it is necessary to produce the natural language
representation of the sequence automatically.
Several automated methods have also been developed which attempt
to access image and video data by content from media databases [21]. A
popular approach to address this problem has been to temporally segment
video into subsequences separated by shot changes, gradual transitions or
special effects such as fade-ins and fade-outs [22,23]. A story board of
events that occurred in the video can be created by selecting a key frame
from each subsequence. The video can then be queried with visual queries
that use color, texture or activity. This is a pseudo-semantic approach to
video content description, wherein the human interpretation of color, texture
and/or motion defines the content. Some researchers have extended this
philosophy in an attempt to identify the genre of the video [24]. By studying
patterns of color distribution in segmented video clips and relative motion of
3
objects within them, judgments can be made regarding the genre of the
video.
The next step in content-based indexing of digital video takes a hint
from the method used by humans in understanding visual information.
Humans recognize the objects imaged in the scene and form spatial and
temporal relationships between them in order to understand the semantics
contained. Developing methods that recognize such objects, and form
relationships between them would lead a long way in generation of
automated content description. One such object is the text contained in the
video.
Generally speaking, the natural language representation of a video
sequence can most directly be extracted from information carriers such as
voice, closed-caption text, scene text and superimposed text. Although
sound and closed captions provide index information on the spoken content,
basic annotational information often appears only in the image text. For
example, sports scores, product names, scene locations, speaker names,
movie credits, program introductions and special announcements often
appear and supplement or summarize the visual content. These types of
annotations are usually rendered in high contrast with respect to the
background, are of readable quality, and use keywords that facilitate
4
indexing. Specific searches for a particular actor or reference to a particular
story can easily be realized if there exists access to this textual content.
It is important to note that the efforts of the Motion Picture Experts
Group (MPEG), part of the joint International Organization for Standards
(ISO) and International Engineering Consortium, are producing standards
which are object based and within these standards video can be encoded as a
static background with moving foreground composed of various objects,
ultimately allowing annotation of textual content as an object and making it
easier to extract and use as an indexing tool [61-64]. The standards are not
yet completely implemented but the value of textual content rendered as part
of the video for indexing is widely accepted and the researchers worldwide
are continuously working on the subject.
1.1 Text in video: videotext
There is a considerable amount of text occurring in video which is a
useful source of information. The presence of text in a scene, to some
extent, naturally describes its content. If this text information can be
harnessed, it can be used along with the temporal segmentation, and other
video object recognition methods to provide richer content-based access to
the video data. The text in video frames can be classified broadly into two
large categories – superimposed (caption, artificial, or overlay) text and
5
scene text. It will be called hereafter as videotext to distinguish from ASCII
text.
Scene text appears within the scene, which is then captured by the
recording device. It is an integral part of the image and can be considered a
sample of the world. Examples of scene text include street signs, billboards,
text on trucks, and writing on shirts. Although valuable, the appearance of
such videotext is typically incidental to the scene content, and only useful in
applications such as navigation, surveillance or reading text on known
objects, rather than general indexing and retrieval. One exception is in
domains, where text or symbols may be used to identify players or vehicles.
Scene text is often difficult to detect and extract since it may appear in a
virtually unlimited number of poses, sizes, shapes and colors. Moreover
scene text may appear partly occluded by other objects in the scene.
On the other hand, superimposed text is a kind of videotext that is
mechanically added to video frames to supplement the visual content, and is
often more structured and closely related to the subject than scene text is.
Examples of superimposed text include headlines, keyword summaries, time
and location stamps, names of people and scores. The descriptors are
typically predictable, have simple styles, and are produced with the intent of
being read by the viewer.
6
Superimposed text has a number of functions that differ between
domains. In commercials, videotext appears to reinforce the vital
information such as the product name, claims made, or in some cases to
provide disclaimers. In sporting events, videotext is used to identify specific
players, provide game information, or relay statistics. In newscasts,
superimposed text can be used to either identify key features of the scene,
such as location or speaker, to provide a synopsis of the topic, or to provide
a visual summary of statistical information. In movies and television shows,
videotext provides production and acting credits, and in other cases captions
or language translations.
Figure 1.1 shows an example video frame containing both
superimposed text and scene text. In this image the text for speed and time
information is an example for superimposed text and it contains information
directly related to the content of the image. Commercials, on the other hand,
are examples for scene text and they are not directly related to the content.
7
Figure 1.1: Example video frame containing both superimposed text and scene text.
For the most part, research in this area has focused on the
identification of superimposed text. However, there are some cases where
videotext appears as scene text but shows the properties of superimposed
text such as product pack shots on TV commercials. Therefore it is better to
describe the scope of the research as “extracting (i.e. detecting and
recognizing) superimposed text and scene text which possesses
superimposed text properties”, and this is exactly how it is expressed in the
MPEG-7 standard [61-64].
The use of textual information as a key for an indexing system
requires conversion of videotext present in image to ASCII form, which is
commonly known as Optical Character Recognition (OCR) process.
Unfortunately, commercially available OCR systems are designed for
8
recognition of characters in printed documents [1]. Characters in these
documents appear in uniform color on a clean background and the document
is typically scanned in high resolution. Therefore, to be able to employ OCR
for videotext requires segmentation of characters from background, which is
usually arbitrarily complex. The segmentation consists of background
removal and character binarization. For OCR systems to recognize
characters efficiently, a video frame should be converted to a black and
white image where pixels on the characters of the videotext are black and
the remaining pixels are white. This binarization process is called as pre-
OCR in an overall sense.The operation is not straightforward because the
character pixels are often composed of a range of pixels and the boundary
between the character region and the background is not perfectly defined.
This statement can be best understood by Figure 1.2 where the original
image and a magnified character region are shown. As can be seen from the
figure, in the magnified image the character can not be separated, even
manually.
Although modern OCR systems are designed to decompose the
pages for recognition and can still recognize text on limited portion of
complex images. The success of the pre-OCR step is highly critical for the
performance of the OCR. The pre-OCR process will obviously be based on
distinctive properties of videotext.
9
(a)
(b)
Figure 1.2: Demonstration of complexity of background-character separation. (b) is the magnified version of (a) of the region containing text “JOSE”.
10
1.2 Properties of videotext
Although it is impossible to define a set of properties which
completely covers the whole videotext domain, some of these properties
which cover the majority of the set can be described as:
1. Texture
Videotext has a distinctive texture, which makes it distinguishable
from background. This statement stems from the fact that a text from an
unknown alphabet can be distinguished even without recognizing the
characters contained inside the text. This texture is a result of density of
edges forming the characters. Most alphabets have strong edges in a
particular direction, e.g. Latin scripts have strong vertical edges. These
edges form a high frequency texture.
2. Contrast
The colors for artificial text are chosen so as to have contrast against
the background. This is also true for a lot of scene text (signs, billboards,
etc.). Thus color-connected components belonging to text can be segmented
against the frame.
In some cases, the contrast between text and the background may not
be high. This is due to the background over which text is rendered, is
varying both spatially and temporally. It is often observed that the color of
11
caption characters is similar to that of the surrounding frame region.
However, even if this occurs, this is usually true only for a portion of the
caption or for a short duration, since otherwise, the editor would have
chosen a different color for the text.
3. Uniform color and intensity
Color is a strong feature for use in visual information indexing. Text
characters tend to have a perceptually uniform color and intensity over the
character stroke. While the character stroke appears to have the same color,
in reality it is usually composed of many different colors. In cases where the
color does vary across the caption, it varies in a gradual way so that adjacent
characters or character segments have very similar colors.
4. Size limits
Text appears in a variety of sizes in video data. Since text is intended
to be readable at a range of viewing distances in a limited time, here is
usually a minimum size to text characters. However, the upper bound on
character sizes is much looser.
5. Geometrical alignment
Characters belonging to an artificial text string are usually
horizontally aligned and the spaces between characters of a word have
restrictions.
12
These properties will be explained and demonstrated in detail in
subsequent sections but it should be noted that these properties are utilized
in pre-OCR binarization step.
13
CHAPTER 2
RELATED WORK
This chapter presents a survey of methods, for extracting text from
images and video in the literature. There has been a growing interest in the
development of methods for detecting, localizing and segmenting text from
images and video. In this chapter, we concentrate on presenting the
literature on detection, localization, and extraction of text regions from
images and video.
Methods for extracting artificial overlay text have primarily used
intensity, color, texture, edge, or a combination of these features. As
mentioned above, most methods have been designed to extract text from
images. Some have been extended for application to video.
Existing work on text detection has focused primarily on characters
in printed and handwritten documents [1]. There also exists considerable
amount of work on text detection in industrial applications, most of which
14
focus on a very narrow application field with simplifying assumptions, such
as license plate recognition from a uniform background [2]. The research on
text detection and text extraction from natural images and video is not as
mature as the work on document images.
The current research can be mainly divided into two classes:
Connected-component-based approaches [6,7,8], and texture-based methods
[4,5,9,10]. While the former tries to find characters as closed regions
containing uniform color, the latter considers the text as a special class of
texture. The boundary between these two methods is not very clear in the
sense that sometimes the distinction is not so obvious. As a consequence of
this fact, there are also some hybrid approaches [3] that utilize both of these
approaches.
2.1 Connected component-based approaches
Connected component-based approaches require high contrast and
resolution in order to give an acceptable performance. Some neighboring
regions with similar color values, i.e. low contrast, may yield erroneous
segmentation of the text [6,7,13]. In some methods, the assumption of
having monochrome text or, a monochrome background surrounding the
text is another limiting factor for the general performance [8]. Moreover,
finding characters as a single closed region is only possible by utilizing
15
high-resolution images [3]. However, for video indexing, such constraints
are usually far from being practical.
Zhong, Karu and Jain [3] locate text in complex color images. They
propose a method based on quantizing the color space based on peaks in the
histogram. Adjacent colors are merged into these peaks. This assumes that
the text occupies a narrow region of color space and a significant portion of
the image. Color components are then labeled as text based on their
geometrical properties and if there are at least 3 characters aligned
horizontally. If the color quantization step fails because the text has low
contrast or occupies a small portion of the frame, then a character may be
broken into multiple segments. As an alternative, same researchers use the
heuristic of high horizontal spatial variance to localize text. They report
results for compact disc and book cover images and the spatial variance
method on car scenes from video. Although the spatial variance may be
considered as a texture, the dominant characteristic of the approach is based
on connected components. Kim [28] also proposes a very detailed text
localization method for video images and compares it with the method of
Zhong, Karu, and Jain [3]. Color clustering followed by shape analysis is
used to determine text regions. This method appears to be fragile, with a
large number of thresholds that are empirically determined. In a later work
presented by Jain and Yu [8], text from video frames and compressed
16
images are extracted by background color separation and applying the
heuristic that the text has high contrast with the background and is of the
foreground color. A similar assumption made in the earlier work on CD
cover images [3] that the text forms a significantly large portion of the
image, does not hold true for video. The later work shows sample
localizations on video frames but inexplicably misses text that is not
particularly large or not on a uniform background (e.g. text on a weather
map from a weather news broadcast). Messelodi and Modena [44] extract
text from book cover images. They use simple homogeneity properties to
separate text from other image components and further correct their
extraction through estimation of orientation and skew correction of text
lines.
Color clustering is the main tool for connected component analysis.
Lienhart and Stuber [30,31,32] describe a system for automatic text
recognition in digitized movies that works on movie opening and closing
sequences, credit title sequences and closing sequences with title and
credits. Their work is well grounded, but not robust enough to be applicable
to general video, because they expect to extract characters as a single
connected component. They employ color clustering with both spatial and
temporal heuristics of segments and track (possibly scrolling) caption text in
video. The system is likely to fail for noisy video, text with poor contrast, or
17
text, which has highly cluttered background. Suen and Wang [33,34] also
propose a method for segmenting uniformly colored text from a color
graphics background. Page segmentation techniques are applied to separate
text and graphic regions based in color edge strength. Non-text regions are
discarded by applying color distribution properties for text regions. Mariano
and Kasturi [48] propose a color clustering based method for locating text
based on color uniformity of text pixels and horizontal alignment. An image
is searched for rows that pass through vertical and diagonal strokes of text.
The pixels are clustered in the L*a*b* space of a row that goes through the
middle of caption text on a non-uniform background. This cluster appears as
short streaks. Searching for similar clusters in adjacent rows, heuristics are
formed for detection of horizontal text.
Zhou and Lopresti [37] take a similar approach as it involves color
clustering, computation of connected components and applying spatial
heuristics for text to remove components from World Wide Web images.
Zhou, Lopresti, and Lei [38] extend this work to perform OCR on these
images. The work by Zhou, Lopresiti, and Tasdizen [42] is also another
example for color clustering. Their approach adds robustness to their earlier
work on extracting text from World Wide Web images. Color clustering is
done using the Euclidean minimum spanning tree (MST) technique. Spatial
distance between colors is also considered in the clustering process.
18
In some of the works in literature on text extraction from video,
connected component analysis is performed on multiple frames. An
example work of this kind is presented by Ariki and Teranishi [39]. In their
method, they make very simplistic assumptions as to the nature of the text,
i.e. it is brighter than the background. The method uses simple frame
subtraction to extract text. Kurakake, Kuwano, and Odaka [40] divide the
frame into several uniform sub-regions. Color histogram differencing is
applied to each sub-region. If a significant change is detected between the
sub-regions of consecutive frames, then the latter frame is assumed to have
text in it. Spatial analysis is used to further determine if text indeed exists in
these frames. Yeo’s approach [29] seeks to detect the appearance and
disappearance of text captions. The disadvantage is due to the assumption
for occuring only in certain predefined areas of the frame, which are then
template matched to detect sudden changes, presumably due to caption
appearances. In addition, the text detection (caption appearance event
detection) method assumes that all shot changes have been accurately
detected in the sequence before text detection needs to be applied, which
may not be true. Even if it is, text caption events may co-occur with shot
changes or camera motion. Multiple captions may appear within a shot, they
may fade in and out over long periods, they may coincide with a shot
change or they may remain constant across shot changes.
19
Hase et. al. [35] propose an extraction algorithm for character
strings. Their approach is directed towards binary document images but
works on simple, clean color images after color segmentation. Their
relaxation method has the advantage of handling strings of various
orientations. In later work [36], they extend their method to perform conflict
resolution between overlapping text regions through the use of likelihood of
character string.
Dimitrova, Agnihotri, Dorai and Bolle [13] present a method to
detect caption text from video frames. The first method assumes that all
caption text is white or yellow in color. The method looks for strong edges
in the Red and Green color channels. Further refinement is done by
identifying homogeneous regions in intensity images, forming positive and
negative images by thresholding and applying heuristics based on text
characteristics for eliminating non-text regions. The text regions are
validated using the temporal redundancy property of text in video.
2.2 Texture based approaches
Among various texture-based text detection algorithms, the main
difference is representation of the texture. The main idea behind texture
analysis is quantifying the distribution of different edge orientations.
Among different models, Gabor filters with various orientations [9] or
20
simple edge detectors for finding a specific direction [5] or multiresolution
wavelet filtering [4,10] can be used for finding textures. With arbitrary
number of orientations and scales, multichannel Gabor filters provide a
higher flexibility and hence, a better discrimination among different
textures. Jain et al. [9] describes a method for separating text and
background areas on document images by using a group of multichannel
Gabor filters. However, the extension of such an approach for low-
resolution video frames has not been examined. Although it is claimed that
use of texture for text detection is sensitive to character font size and style
[8], our experimental results show that a feature vector constructed by
directional Gabor filtered images is still a very powerful representation of
text regions.
Frequency components are used as texture representatives. Chaddha,
Sharma, Agrawal, and Gupta [25] propose a method to detect text from
JPEG compressed images. The method extracts the DCT coefficients for
each block of a macro block in the JPEG image. The absolute values of a set
of DCT coefficients are summed. This measure reflects the high spatial
frequency content of blocks containing text. This particular set of
coefficients was empirically determined as the optimum to discriminate text
and non-text blocks. Next, this sum is thresholded using an empirically
determined value. All blocks having a measure greater than this value are
21
marked as text blocks. Because the JPEG compressed information is not
decompressed, the method is fast and capable of running in real-time. The
algorithm finds artificial and scene text within a range of scales. However, it
fails on text that has low or varying contrast within the string or on text that
is much larger than the block size used (8x8). Zhong et al [47], as with [25],
use the DCT coeffcients available in the MPEG I-frames. Select coefficients
that highlight the vertical and horizontal frequencies are summed and then
thresholded to detect text blocks.
Another tool for texture representation, wavelet coefficients, also
carry information about frequency of text blocks. Li and Doermann [45,46]
extract text from digital video key frames. They use the heuristic that the
texture for text is different from the surrounding background to identify text
regions. Wavelets are used for feature extraction and a Neural Network is
used for decisions. They also present an algorithm for tracking moving text.
The tracker assumes that text is mostly rigid and moves in a simple, linear
manner.
Some of the researchers consider edge density as texture
representation. Hauptmann and Smith [26] localize text in video using the
heuristic that text regions consist of a large number of horizontal and
vertical edges in spatial proximity. In a later work, Smith and Kanade [27]
apply a 3x3 horizontal difference filter to extract text regions. Wu,
22
Manmatha, and Risemann [4] describe a scheme for finding text in images.
They use texture segmentation to localize text, i.e. edge detection to detect
character strokes and join strokes to form text regions. While texture
segmentation may work for high-resolution images or document images, it
is likely o fail on noisy, low-resolution video. Outdoor scenes especially are
likely to give rise to a large number of false alarms. They also propose a
stroke aggregation method, which is used to merge extracted text regions
from multiple scales of operation.
A similar approach is proposed by Sato, Kanade et. al. [5]. They
describe a system for performing OCR on video caption text in the context
of a digital news archive. They localized text by looking for clusters of edge
pixels that satisfy aspect ratio and other criteria. They interpolate between
pixels of the frame to increase its resolution. They also assume that the text
caption consists of static white characters on a dark background. Under this
assumption, they replace each pixel with the minimum value of pixels at
that position for the duration of the text caption, to minimize the background
variation. Filtering to detect white on black edges of characters follows this
step. They report results on CNN news captions. Their method appears to
work on the clean sequences under the assumptions made but it is no clear if
it would work on unconstrained, noisy video.
23
Different techniques, which can be classified as texture-based, try to
classify image blocks as text or non-text. Mitrea and deWith [43] propose a
simple algorithm to classify video frame blocks into graphics or video based
on the dynamic range and variation of gray levels within a 4x4 pixel block.
Their application was improving image compression rates. This method can
be extended to extract text regions from images or video. Le Bourgeois [41]
presents a system for multi font OCR from gray level images. The text
localization stage of this system appears to be fast and robust for clean video
and assuming not too much variation in size. Text regions are identified as
those where pixels are organized in a coherent way to form a regular
texture. Normalized intensity differences over a certain period are used to
separate such regions from the background.
Other approaches for detecting text in images and video are found in
[49,50,51,52,53,54,55].
2.3 Work on scene text
Although they have similarities, it is suitable to review the existing
work on scene text detection in a separate section. Some of the ideas form
these works, such as adaptive or local thresholding, can be incorporated for
detection of superimposed text.
24
Very little work has been done for extraction of scene text. A brief
overview of the research efforts published in the literature is presented.
Some research involves manufacturing contexts where the conditions are
completely under control [56] and thus inapplicable to our problem domain.
Ohya, Shio and Akamatsu [6] describe a method to recognize characters in
scene images. They use local gray level thresholding to segment the image
and localize text regions by looking for high contrast, uniform gray level of
a character, and uniform width. They also use the results of an OCR stage to
improve their extraction result. If the Chinese OCR algorithm they use does
not find a good enough match, the character candidate region is rejected.
There is work on the recognition of vehicle license plates [57,58]
from video, which shares some of the characteristics of scene text.
However, these approaches make restrictive assumptions on the placement,
contrast or format of the license plate characters. A typical system makes a
number of assumptions as to the capture process (vehicle and camera
positions leading to near normal projection and near horizontal orientation
etc.) and type of text (constant text color against constant background). The
existence of license plates with logos and other background is known to
decrease performance. In contrast a general-purpose scene text detection
algorithm must handle text in all orientations, of all colors and backgrounds.
25
Cui and Huang’s approach [58] applies the Markov Random Field
(MRF) model to localize the text region in a frame and take the advantage
of the information from multiple frames. They also correct for perspective
projection distortion using the fact that the characters lie on a plane.
Winger, Jernigan, and Robinson [59] discuss the segmentation and
thresholding of characters from low-contrast scene images acquired from a
hand-held camera. Their data set includes images with low contrast, poor
and uneven illumination.
Gandhi [60] explores an approach for extracting scene text from a
sequence of images with relative motion between the camera and the scene.
It is assumed that the scene text lies on planar surfaces, whereas the other
features are likely to be at random depths or undergoing independent
motion. The motion model parameters of these planar surfaces are estimated
using gradient based methods, and multiple motion segmentation. The
equations of the planar surfaces, as well as the camera motion parameters
are extracted by combining the motion models of multiple planar surfaces.
This approach is expected to improve the reliability and robustness of the
estimates, which are used to perform perspective correction on the
individual surfaces. Perspective correction can lead to improvement in OCR
performance. This work could be useful for detecting road signs and
billboards from a moving vehicle.
26
Existing work considered so far generally suffers from lack of
robustness. In fact almost none of the proposed algorithms are tested with
quantitative recognition results. Connected component-based techniques are
more general in the sense that texture based methods are usually based on
videotext character size assumptions. On the other hand texture based
methods offer much more reliability.
27
CHAPTER 3
VIDEOTEXT DETECTION
As mentioned in the introduction section the pre-OCR binarization
process for videotext extraction will ideally remove all image components
other than videotext characters. The problem is not trivial because the
distinction is not so clear at the pixel level, i.e. one can not decide on which
pixels belong to video text characters and which to background just by
looking at the pixel and/or its neighbors. Therefore, it is necessary to
employ regional and object level image analysis [15]. This analysis will be
based on the distinctive properties of videotext. Following discussion will
present the methods of employing videotext properties for pre-OCR process.
At this point we can give an overall picture, i.e. framework of the
videotext detection and recognition process. With this picture in mind, it
will be easier to follow the implementation details of the process.
28
The properties listed in the introduction section, which distinguishes
videotext regions from background, help us in finding the regions of
interest, i.e. the regions that have the possibility of containing text. These
regions are called as candidate regions. Finding candidate regions is only a
part of the solution and does not give us the required binary image. In order
to produce the binary image, which will be fed to the OCR for recognition,
all candidate regions must be binarized individually. The binarization of
individual candidate regions is accomplished by thresholding each region
with an appropriate threshold value specific to the region. This threshold
value is determined automatically for each candidate region. After
thresholding, we obtain two binary images, specifically the binary image
where remaining pixels are composed of the ones, which are above the
threshold value and vice versa. These two binary images are the images to
be recognized by the OCR. Block diagram of the algorithm is given in
Figure 3.1.
29
Image
Candidate region extraction(Texture, Contrast, Color) Regional thresholding
Candidateregions
Heuristic improvement
Binaryimage
ASCII characters OCR
Figure 3.1: Block diagram overview of videotext detection and recognition process.
In Figure 3.1, an extra block is shown after thresholding. In this step,
the binary images produced by thresholding are further filtered by a set of
heuristics, which are based on the horizontal alignment properties of
videotext.
The idea of binarization with regional thresholding was originally
proposed by Dorai et.al. [13]. They also extracted candidate regions by
region analysis. However, their work suffers from the high amount of non-
character regions, which can not be eliminated by any heuristics or shape
analysis. We propose that if region analysis is supported by videotext
30
detection methods which can narrow the regions of interest to text box level,
we can overcome the limitations of region analysis.
3.1 Texture analysis
Videotext regions have distinctive texture properties. This is mainly
due to artificial (i.e. non natural) edges and strokes that form the characters
of videotext.
Texture analysis is one of the major topics of computer vision and
there are well-known techniques for the analysis. We will present here two
different approaches for videotext texture analysis and compare them in
terms of representation capabilities. These two techniques are Gabor filters
and 2D Haar wavelets. A set of feature vectors will be produced and these
vectors will be used for detecting text regions. For detection, a three layer
neural network will be used and the two techniques will be compared in
terms of their training performances.
3.1.1 Texture analysis by Gabor filters
Gabor filters have been one of the major tools for texture analysis
[18,19]. This technique has the advantage of analyzing texture in an
unlimited number of directions and scales. This flexibility is very useful for
videotext detection because the character edges of videotext may appear in a
31
diverse range of directions [9]. For videotext detection, we will use a set of
Gabor filters from the same scale and with 8 different orientations. Before
proceeding further, it is better to give a brief description of Gabor filters.
Physiological studies found simple cells, in human visual cortex, that
are selectively tuned to orientation as well as to spatial frequency. It was
suggested that the response of a simple cell could be approximated by 2D
Gabor filters [11]. The Gabor filters proposed by Daugman [12] are local
spatial bandpass filters that achieve the theoretical limit for conjoint
resolution of information in the 2D spatial and 2D frequency domains.
Gabor functions were first proposed by Dennis Gabor [65], as a tool
for signal detection in noise. Gabor showed that there exists a “quantum
principle” for information; the conjoint time-frequency domain for 1D
signals must necessarily be quantized so that no signal or filter can occupy
less than certain minimal area in it. However, there is a trade off between
time resolution and frequency resolution. Gabor discovered that Gaussian
modulated complex exponentials provide the best trade off. For such a case,
the original Gabor elementary functions are generated with a fixed
Gaussian, while frequency of the modulating wave varies. Gabor filters,
rediscovered and generalized to 2D, are now being used extensively in
various computer vision applications.
32
Daugman [12] generalized the Gabor filters to the following 2D
form in order to model the receptive fields of the orientation selective
simple cells:
−=Ψ
−22
2
22
2
22
)(σ
σ
σeee
kx xkj
xki
ii
Ti
rr
rrrr
=
=
µ
µ
θθ
sincos
v
v
iy
ixi k
kkk
kr
Each Ψi is a plane wave characterized by the vector ki enveloped by
a Gaussian function, where σ is the standard deviation of this Gaussian. The
first term in brackets in this equation determines the oscillatory part of the
kernel, and the second term compensates for the DC value of the kernel.
Subtracting the DC response, Gabor filters become insensitive to the overall
level of illumination. The center frequency of ith filter is adjusted by the
characteristic wave vector, ki which has a scale and orientation given by
(kv,θµ).
33
The decomposition of the image by use of these Gabor filters is
achieved by the 2D convolution integral:
∫ ′′−Ψ′= xdxxxIxR iirrrrr )()()(
where is the image intensity value at . )(xI r xr
In Figure 3.2, gray-scale representation of Gabor filters for varying
spatial frequency (vertical axis, kv=2-v/4, v=0,1,2,3,4 ) and orientation
(horizontal axis, θµ=µπ/8, µ=0,1,..,7) and for σ=4 are shown.
Figure 3.2: Gray-scale representation of Gabor filters for varying spatial frequency and orientation.
Each member of the family of Gabor filters models the spatial
receptive field structure of a simple cell in the primary visual cortex. The
Gabor decomposition can be considered as a “directional microscope” with
an orientation and scaling sensitivity. These filters respond to short lines,
34
line endings, and sharp changes in curvature. Since such curves correspond
to some low-level salient features in an image, these filters can be assumed
to form a low-level feature extractor of an intensity image.
For the purpose of videotext region detection, we selected the scale,
kv, as 2 and for this scale all the filters with 8 orientations form the filter
bank for feature extraction.
In Figure 3.3, filtered images of a video frame containing videotext
is shown. It is important here to note that ordinary edges in the image gives
output for a limited portion of directions. On the other hand text regions
have contributed to output almost for all directions.
Figure 3.3: Gabor filter outputs for a typical video frame. Right-bottom,
θµ=0; right-top, θµ=2π/8; left-top, θµ=5π/8; left-bottom, θµ=7π/8.
35
The filtered images are computed once for a given image and the
feature vectors for all blocks are calculated. The feature extraction process
can be described with the block diagram given in Figure 3.4. In the block
diagram, the input is the original image for which the process is run. The
image is filtered with the Gabor filter bank consisting 8 directional filters
and as a result 8 filtered images, are obtained. Once these images are
obtained feature vector for any block in a particular location is obtained by
averaging the values of elements falling in the corresponding block, and an
8 dimensional feature vector, is obtained. For the experiments the block
dimension is selected as 16x16. The extracted vectors are then fed to a
neural network for detection. This will be discussed later after describing
the other method for texture analysis, because the neural network
counterpart of both methods are identical.
I m a g e F i l t e r i n g w i t h 8 d i r e c t i o n a l g a b o r f i l t e r s
B lo c k a v e r a g e
c a lc u la t i o n
F i l t e r o u t p u t s
B lo c kc o o r d i n a t e s
V e c t o r o u t p u t
Figure 3.4: Feature vector extraction using Gabor filters
36
3.1.2 Texture analysis by Haar wavelets
The next method for texture analysis to be presented is 2D Haar
Wavelets. The idea of using wavelets for videotext detection was first
proposed by Doerman et. al. [10]. As for the case for Gabor filters wavelets
are good candidates for videotext detection because of its directional
filtering capabilities. In the following paragraphs, 2D Haar wavelets will be
briefly described and their use in videotext detection will be introduced.
Use of wavelets to decompose the image provides successive
approximations to the image by downsampling and has the ability to detect
edges during the high pass filtering. The low pass filter creates successive
approximations to the image while the detailed signal provides feature rich
representation. This can be easily seen in the image decomposition shown in
Figure 3.5, where the original image and its first level wavelet
decomposition are shown.
Note that the videotext regions show high activity in the three high
frequency subbands (High-Low, HL, Low-High, LH, High-High, HH). As a
result of their local nature, only wavelets, which are located on or near the
edge yield large wavelet coefficients, making videotext regions detectable in
the high frequency subbands.
The scaling and wavelet functions of Haar wavelets can be written
as:
37
)12()2()2()( −Φ+Φ=−Φ=Φ ∑∈
xxkxpxZk
k
)12()2()2()( −Φ−Φ=−Φ= ∑∈
xxkxqxWZk
kH
Respectively with
otherwisex
x10
01
)(<≤
=Φ
For the equations above, have non-zero values = = 1 and zero
values for all other , and is zero except for = 1 and = -1.
kp
kq
0p 1p
1qjp 0q
For an image I(x,y) represented as
NNxNNNN
N
N
iii
iiiiii
yxI
2212,121,120,12
12,11,10,1
12,01,00,0
.....................
...
...
),(
=
−−−−
−
−
we can use the Mallat’s algorithm [17] to obtain the two-dimensional Haar
wavelet transform of I(x,y):
38
)(41
41
12,122,1212,22,2
1
0,2,2,
21
2121 ++++=
++ +++== ∑ yxyxyxyxkk
ykxkkkyx iiiiippLL
)(41
41
12,122,1212,22,2
1
0,2,2,
21
2121 ++++=
++ −+−== ∑ yxyxyxyxkk
ykxkkkyx iiiiiqpLH
)(41
41
12,122,1212,22,2
1
0,2,2,
21
2121 ++++=
++ −−+== ∑ yxyxyxyxkk
ykxkkkyx iiiiipqHL
)(41
41
12,122,1212,22,2
1
0,2,2,
21
2121 ++++=
++ +−−== ∑ yxyxyxyxkk
ykxkkkyx iiiiiqqHH
So far, we have introduced 2D Haar wavelet transform and its
practical implementation. Now, the feature extraction strategy for this case
will be explained.
For detection of videotext regions, we extract features from the
wavelet decomposition of the image. We use the mean and the second and
third order central moments as features. For an N x N block of I(x,y) we
calculate the mean, the second and third order central moments of the block
which can be written as:
∑ ∑−
=
−
=
=1
0
1
02 ),(1)(
N
i
N
j
jiIN
IM
∑∑−
=
−
=
−=1
0
1
0
222 ))(),((1)(
N
i
N
j
IMjiIN
Iµ ∑∑−
=
−
=
−=1
0
1
0
323 ))(),((1)(
N
i
N
j
IMjiIN
Iµ
39
All the features are computed on the decomposed subband images.
Since the block size chosen is 16x16, the maximum level of decomposition
is 3 since only one pixel is left for each subband image on the fourth level.
This method of feature selection produces 27 values for construction
of the feature vector, since we have 3 levels of demposition, 3 subbands,
and 3 moments. To reduce this dimension, we rely on a ranking
demonstrated by Doerman et.al. [46]. According to this ranking the most
salient features are, in the decreasing order, , , , , ,
, , where we use and to denote the second and third
order moments respectively, with j representing the decomposition level and
i representing the subband.
23HH
µ 22HH
µ 23HL
µ 22HL
µ 12HL
µ
12LH
µ 13LH
µ 13HH
µ ji2µ
ji3µ
In Figure 3.5, the first level wavelet decomposition of a video frame
is shown. Similar with the Gabor filter case, videotext regions have energy
in all subbands and this makes use of wavelets, for detection of videotext
regions, a suitable tool.
40
Figure 3.5: First level wavelet decomposition of a video frame.
Both Gabor filtering and wavelet decomposition methods produce an
eight dimensional feature vector for each 16x16 block of the image. A pre-
trained neural network then classifies this vector. The training and
classification process will be explained in the next section.
3.1.3 Detection by a feed-forward neural network
In the previous sections, feature extraction methods for texture
analysis are introduced. In this section the neural network structure used for
classification will be described and the two methods will be compared
according to their performances.
41
We consider detection of videotext regions using texture analysis as
a classical supervised pattern recognition problem. The classifier is selected
as a 3-layer single output feedforward neural network, which has the well-
known capability of discriminating linearly inseparable classes. In fact,
theoretically a 3-layer neural network can approximate any nonlinear
function after training.
The success of neural networks in related problems [16] provides us
with further motivation to rely on a neural network as a classifier to identify
videotext regions. By employing such a network and training the network
with representative sample feature vectors of videotext regions, it is possible
to detect the regions of interest in given image.
In Figure 3.6, the block diagram of the classifier is given. The
feature vector to be classified is extracted using either Gabor filtering or
wavelet transform. This vector is fed to the network and the network
responds to this input by giving an output indicating the type of the input.
The network is trained in such a way that it will give 1.0 as output if given
an input vector from a videotext region and 0.0 otherwise. The network is
trained to an acceptable error level and during classification an output
greater than 0.5 is accepted to indicate a videotext region.
42
Block coordinatesImage
Filtering/Transformation Feature extraction
Neural network classifier
Result : ss > 0.5, texts < 0.5, background
Figure 3.6: Block diagram of the neural network classifier
In order to obtain better localization, the 16x16 block is moved 4
pixels vertically and horizontally, and at each position of the block, the
corresponding feature vector is fed to the network. The output of the
network is accumulated on each pixel and the accumulation result is
thresholded with 0.5. The pixels above the threshold are considered as
pixels of text regions.
The neural network is trained using the RPROP algorithm [14]. The
algorithm can be briefly described as follows:
RPROP (resilient backpropagation) is a very useful gradient based
learning algorithm. It uses individual adaptive learning rates combined with
the so-called “Manhattan" update step.
43
The standard backpropagation updates the weights according to
ijij w
Ew∂∂
−=∆ η
The “Manhattan" update step, on the other hand, uses only the sign
of the derivative (the reason for the name should be obvious to anyone who
has seen a map of Manhattan), i.e.
∂∂
−=∆ij
ij wEsignw .η
The RPROP algorithm combines this Manhattan step with individual
learning rates for each weight, and the algorithm goes as follows
∂∂
−=∆ij
ijij wEsignttw ).()( η
where wij denotes any weight in the network.
44
The learning rate ηij(t) is adjusted according to
0)1().(0)1().(
)1()1(
)(<−∂∂>−∂∂
−−
= −
+
tEtEtEtE
ifif
tt
tijij
ijij
ij
ijij ηγ
ηγη
where
ijij w
tEtE∂∂
=∂)()(
and γ+ and γ- are different growth/shrinking factors, such that (0<γ-<1<γ+).
Values that are used during training are γ-=0.5 and γ+=1.5 with limits such that
10-6<ηij(t)<50.
To compare the two feature extraction methods, we can look at the
training behaviors of the two training sets. The training performances of the
two methods are shown in Figure 3.7. The graph contains plots of error
versus iteration count during training. Both feature vector sets are extracted
from same videotext regions. As it can be seen from the graph, Gabor filters
are much more powerful representatives of videotext regions.
45
0
0,01
0,02
0,03
0,04
0,05
0,06
0,07
0 200 400 600 800 1000
#of iterations
erro
r (ss
d)
WaveletGabor
Figure 3.7: Training performance of Gabor and wavelet-based features
Although Gabor filtering leads to a much more efficient
representation for videotext regions, its computational complexity reduces
its usefulness in real time applications. For the rest of the experiments
Gabor filtering with neural network classifier will be used for texture
analysis.
After thresholding the neural network output image, a bounding box
is found for each connected regions. This bounding box indicates the
candidate videotext regions. In Figure 3.8, a video frame, the Gabor filter
outputs and the neural network output image are shown. As can be seen
from the figure, the network output is very high on the pixels corresponding
46
to videotext regions. Although the background is fairly complex the network
can distinguish text and non-text regions.
(a)
(b) (c)
Figure 3.8: Neural network output image example.
More examples for neural network output image is shown in Figure
3.9. In the examples given, videotext regions, even scene texts, gives high
output as the neural network response.
47
Figure 3.9: Typical neural network output examples.
48
3.2 Contrast analysis
Another important distinctive property of videotext regions is the
high contrast with background. This property comes from the fact that
superimposed text is typically rendered to be easily read.
For contrast analysis, a simple contrast measure proposed by
Lienhart et. al. [7] is used. For each video frame a binary contrast image is
derived, in which the pixels in the regions, which show sufficiently high
absolute local contrast, are marked with 1 and are marked with 0 otherwise.
The absolute local contrast at position I(x,y) is measured by
∑∑−= −=
−−−=r
rk
r
rllk lykxIyxIGyxC ),(),(),( ,
where Gk,l denotes a 2D Gaussian smoothing filter, and r denotes the
size of the local neighborhood.
This operation constructs a 2D array of positive values, for which
the values grow around the high contrast regions, especially at the edges of
regions. Since the characters of videotext are usually in the form of strong
edges, this array can be used to identify videotext regions. The 2D array
should be thresholded in order to get rid of the low contrast regions. For this
purpose the values in the array is normalized to [0,1] interval and
49
thresholded with 0.5. The pixels above threshold are marked with 1 and the
rest with 0.
Next, each pixel with value 1 is dilated by half the maximum
expected strength of the stroke of a character. As a result, all character
pixels, as well as some non-character pixels, which also show high local
contrast, are registered in the binary contrast image. Figure 3.10 shows a
contrast image and the corresponding thresholded image.
Figure 3.10: Contrast analysis example.
3.3 Region analysis
For a videotext extraction system it is necessary to employ some
kind of region analysis in order to isolate candidate text regions from
background. Here we use the term, region, for interconnected pixels, which
are homogeneous in terms of gray level values. Since characters of
50
videotext are assumed to be homogeneous, they can be treated as regions
and analyzed at object level.
Once the image is decomposed into non-overlapping homogeneous
regions, it is possible to use the size restrictions for the characters of
videotext. Specifically too large and too small regions, which cannot
represent a character, are marked as background and removed.
The criterion used to group pixels into a region is that the absolute
gray level difference between any pair of pixels within the region cannot
exceed “delta”, which is experimentally determined as 20.
The segmentation is performed in a recursive manner each time by
checking the 4-connectivity neighbors of any pixel. The segmentation
procedure for region labeling can be outlined as follows:
MAIN LOOP:
FOR (each pixel in the image) BEGIN
IF(pixel is not owned by a region)BEGIN
Register a new region.
Record max and min as the value of the current pixel
CALL TraceNeigbours();
END
END.
51
RECURSIVE FUNCTION TraceNeighbours:
FOR (all the neighbor pixels) BEGIN
IF (value of the neighbor is between min and max) BEGIN
Accept the neighbor.
Call TraceNeighbors
END
ELSE IF (value of the neighbor is greater than max but less
than min+delta) BEGIN
Accept the neighbor.
Update max
Call TraceNeighbors
END
ELSE IF (value of the neighbor is less than min but greater
than max-delta) BEGIN
Accept the neighbor.
Update min.
Call TraceNeighbors
END
ELSE
Reject neighbor.
END.
52
Although only min and max values are extracted during region
growing as outlined in the pseudocode above, it is similarly possible to
record the minimum and maximum coordinate values of the extensions of
the region and number of pixels contained in the region.
By employing the recursive segmentation algorithm to segment the
image into non-overlapping homogeneous regions, we obtain complete
region information such as its label, area, average gray level, and the
minimum-bounding rectangle (MBR). Having obtained a number of
homogeneous regions in the image, non-videotext background regions are
removed based on their size. A region is identified as non-videotext and
removed if the height of its MBR is greater than 1/8 of image height or it is
removed if its width is greater that 1/8 of image width. The values are
determined experimentally.
Within the remaining regions characters of videotext may appear as
splitted into multiple regions and a region may contain some amount of
background pixels with character. We cannot guarantee to isolate the
characters by just segmenting the homogeneous regions. In order to
overcome this problem, we group remaining touching regions into a single
region, so that the videotext regions will be isolated as much as possible.
Obviously, there will be some amount of remaining background regions
depending on the complexity of the background. In Figure 3.11, a sample
53
video frame and extracted candidate regions using region analysis are
shown.
Figure 3.11: Region analysis example.
3.4 Thresholding
Thresholding may be considered as another tool for background
removal. Applying an optimum threshold for the entire image can only be
successful for images with uniform background, which is a very rare case.
Although global thresholding can not meet our requirements, local
thresholding will be necessary at the final stage of binarization process,
since other tools we have shown, can only narrow the regions of interest to
videotext region level but cannot isolate the characters. Therefore, we must
find and apply an appropriate local threshold for each candidate region.
54
For thresholding we have two alternatives: one is from Dorai et.al
[13] and the other is the well known iterative thresholding [15].
Dorai et.al. [13] call the thresholding as region boundary
enhancement and formulate the threshold as:
( )icbicb NNIIT +
+= ∑∑
where Icb is the gray level of a pixel on the circumscribing boundary
of the region, Ii is the gray level of a pixel belonging to region, Ncb is the
number of pixels on the circumscribing boundary of the region and Ni is the
number of pixels belonging to the region.
A pixel is defined to be on the circumscribing boundary of a region
if it does not belong to the region but at least one of its neighbors (using
four-connectivity) does. Those pixels in the region whose gray level is less
than T are marked as belonging to the background and discarded while the
others are retained in the region. Note that this condition is reversed for
negative image, i.e characters may be darker than background. This step is
repeated until the value of T does not change over two consecutive
iterations.
55
The other method is the well known iterative thresholding [15], and
can be formulated as:
)(21
21 µµ +=T
where is the mean of pixels below threshold and is the mean
of pixels above threshold. The process of determining the threshold starts
with an initial estimate of T which is the average intensity of the region.
Then the region is partitioned using this T and new T is calculated. This step
is repeated until the value of T does not change over two consecutive
iterations.
1µ 2µ
3.5 Heuristics
After thresholding some of the remaining regions are eliminated
according to a set of heuristics such that:
1 : (MinXj > MaxXi) AND (MinXj – MaxXi > MAX(Wj – Wi)) :
: To be apart horizontally
2 : (MinXi < MinXj < (MinXi+MaxXi)/2) OR ((MinXi+MaxXi)/2 <
MaxXj < MaxXi)
: To coincide horizontally
56
3 : ((MinYi+MaxYi)/2 < MinYj < MaxYi) OR (MinYi < MaxYj <
(MinYi+MaxYi)/2)
: Not to coincide vertically
where MinX stands for the coordinate of the leftmost pixel, MaxX stands
for the coordinate of the rightmost pixel, MinY stands for the coordinate of
the topmost pixel, MaxY stands for the bottommost pixel and W stands for
the width of the region.
All candidate regions are tested for above conditions in pairs. If they
do not satisfy all of the conditions, then they are grouped together. The
candidate regions, which do not belong to a group, are discarded.
57
CHAPTER 4
EXPERIMENTAL RESULTS
In the preceding sections, we have introduced the possible tools for
videotext detection, or pre-OCR process. The performances of these
methods can only be understood by a set of experiments. Although, one can
rely on any method, the performance of the pre-OCR step can be increased
by the use of a logical combination of the tools. In this section, possible
alternatives for the videotext binarization step will be presented. These
alternatives will also be tested for their performances.
Before proceeding further, it will be useful to remember the flow for
pre-OCR process. For videotext binarization the input is the original image,
the video frame, and the output is the binarized image where the pixels on
the videotext characters are black and remaining pixels are white. The pre-
OCR process can be logically grouped into three subprocesses: pre-
thresholding, thresholding and post-thresholding.
58
The pre-thresholding is constructed with a subset of texture analysis,
contrast analysis and region analysis methods. This step takes the video
frame as input and produces a binary mask where 1’s indicate the possible
videotext regions. Thresholding step is also constructed with one of (or both
of) the thresholding methods described earlier. The thresholding step takes
the video frame and the binary mask produced by pre-thresholding step as
input and produces a binary image, which can be fed to the OCR for
recognition. The post-thresholding stands for any further enhancements of
the binary image using the heuristics described.
The pre-OCR process can be visualized by the help of a typical
example in Figure 4.1. This is an example for the pre-OCR binarization
process. The input image, Figure 4.1(a), is analyzed for texture and color
segmentation in Figure 4.1(b) and (c), respectively, in parallel. This
approach is just for this particular example. Contrast analysis may also be
added as a third option. The intersection of the outputs of these parallel
threads forms the input for the thresholding step, Figure 4.1(d). After
thresholding this image, we obtain a positive image where the candidate
regions are formed by grouping connected pixels above threshold and a
negative image where the pixels are below the same threshold. In Figure
4.1(e), the positive image formed after thresholding is shown. After
59
thresholding, heuristic improvement is applied to the image and the binary
image, shown in Figure 4.1(f), to be fed to OCR is formed.
(a ) (b ) (c )
(d ) (e ) (f)
Figure 4.1: Pre-OCR steps.(a) original image (b) texture analysis output
(c) region analysis output (d) Pre-thresholding output (e) Thresholding
output (f) Heuristic improvement output
4.1 Videotext recognition
60
So far we have concentrated on the pre-OCR process. The
experiments for the evaluation of videotext binarization alternatives will be
based on the OCR performances. For the OCR part of the experiments,
ABBYY FineReader 5.0 commercial OCR package is used [66]. This
package has a set of Dynamic Link Libraries (DLL), which can be linked to
the main program. By using this library, it is possible to incorporate the
abilities of the OCR engine to the main program.
Tabulated results for videotext recognition experiments are given in
Table 4.1, 4.2, and 4.3. Table 4.1 lists the description of experiments and
Table 4.2 and 4.3 gives the result of each experiment where each row
corresponds to the characters counted for each experiment. The second
columns in these tables are the actual number of characters. As can be seen
from the results, different combinations of algorithms for pre-OCR step
results in different recognition rates. The maximum output is achieved when
region analysis and texture analysis are used in parallel before thresholding
and heuristics applied after thresholding. A similar result, in fact second
maximum recognition rate, is achieved when region analysis is used in
parallel with contrast analysis. The use of contrast analysis with texture
analysis together, decreases the recognition rate with respect to texture
analysis or contrast analysis alone. This is due to the fact that, the
drawbacks of contrast analysis and texture analysis affect the output
negatively when used together. It is possible to say that contrast analysis and
texture analysis techniques should be considered as alternatives to each
other. The drawbacks of texture analysis are the limited block size and need
for supervised training, and the main drawback of contrast analysis is due to
61
assumption of “over the average contrast videotext region”. On the other
hand, texture analysis is more accurate than contrast analysis and contrast
analysis is computationally much simpler than texture analysis.
Exp. No Pre-OCR components 0 NONE (Raw Image) Pre Thresh. Thresh. PostThresh. 1 R.A. I.T. 2 R.A. B.E. 3 R.A. I.T.+B.E. 4 R.A. B.E.+I.T. 5 C.A. I.T. 6 C.A. B.E. 7 C.A. I.T.+B.E. 8 C.A. B.E.+I.T. 9 R.A. + C.A. I.T. 10 R.A. + C.A. B.E. 11 R.A. + C.A. I.T.+B.E. 12 R.A. + C.A. B.E.+I.T. 13 R.A. + C.A. I.T.+B.E. HR 14 R.A. + T.A. I.T.+B.E. HR 15 R.A. + C.A. + T.A. I.T.+B.E. HR 16 R.A. + P.T.B.D. I.T.+B.E. C.A.: Contrast Analysis R.A.: Region Analysis T.A.: Texture Analysis P.T.B.D.: Perfect Text Box Detection I.T.: Iterative Thresholding B.E.: Boundary Enhancement HR: Heuristic Table 4.1: Pre-OCR performance evaluation – Part1.
62
Frame No
Char. Count Exp0 Exp1 Exp2 Exp3 Exp4 Exp5 Exp6 Exp7 Exp8
1 10 8 0 0 0 0 8 10 9 10 2 16 0 0 0 13 13 13 12 13 12 3 36 35 30 24 0 24 28 19 27 25 4 38 19 0 35 35 33 0 0 0 0 5 41 20 0 24 10 13 21 0 20 0 6 24 0 21 22 21 0 20 20 19 0 7 10 9 0 0 9 0 0 9 9 9 8 27 0 13 25 21 25 0 0 0 0 9 23 0 0 0 0 0 0 0 4 17 10 84 0 0 0 0 0 29 31 25 29 11 44 27 38 41 40 34 0 32 34 32 12 14 0 0 10 9 7 0 0 0 0 13 35 0 0 0 26 0 21 20 26 20 14 14 0 9 14 0 13 0 0 13 0 15 14 7 0 0 11 10 0 0 0 0 16 14 0 0 0 0 0 9 0 12 0 17 22 4 17 0 18 10 15 0 21 0 18 86 0 33 17 20 48 29 33 0 38 19 133 43 67 78 67 72 51 75 75 64 20 14 0 0 8 0 0 0 0 0 9 21 7 0 0 0 0 0 4 0 0 3 22 23 0 0 8 0 0 0 14 9 11 23 17 0 16 0 16 8 0 0 0 3 24 22 0 4 0 13 0 0 0 0 3 25 28 0 7 21 21 28 8 6 15 8 26 20 0 8 12 12 12 15 14 13 14 27 10 10 0 0 10 0 0 10 10 10 28 41 0 35 34 35 34 15 11 18 12 29 11 0 0 0 0 0 0 0 2 0 30 86 60 53 71 68 71 71 78 79 78 TOTAL 964 242 351 444 475 455 357 394 453 407 Table 4.2: Pre-OCR performance evaluation – Part2. First column is the number of frame, second column is the number of characters in each frame and the remaining columns are number of characters recognized.
63
Frame No
Char. Count Exp9 Exp10 Exp11 Exp12 Exp13 Exp14 Exp15 Exp16
1 10 8 9 9 9 9 9 9 9 2 16 11 13 13 13 13 13 13 13 3 36 30 24 29 22 29 31 0 24 4 38 35 33 34 31 31 32 32 34 5 41 8 21 17 11 12 15 10 13 6 24 0 0 21 0 21 21 21 21 7 10 3 9 9 9 9 9 9 9 8 27 13 23 21 23 23 17 17 21 9 23 0 0 0 0 15 15 15 15 10 84 29 30 28 0 28 28 28 29 11 44 40 0 40 40 40 40 40 40 12 14 0 0 6 0 0 5 5 9 13 35 6 26 22 19 26 0 26 24 14 14 0 0 0 0 13 14 13 13 15 14 0 0 11 0 11 12 11 11 16 14 0 0 0 0 0 0 0 12 17 22 14 17 17 12 18 21 21 21 18 86 19 20 20 39 37 39 39 30 19 133 60 69 72 63 65 73 73 73 20 14 0 0 0 0 5 0 5 9 21 7 0 0 0 3 3 3 3 2 22 23 0 0 0 0 9 9 9 9 23 17 15 0 16 8 0 16 16 15 24 22 0 0 0 3 0 6 0 10 25 28 22 28 27 28 21 18 18 28 26 20 0 11 11 11 10 11 10 12 27 10 10 10 10 10 10 10 10 10 28 41 16 16 18 16 18 38 18 38 29 11 0 0 2 0 0 0 0 2 30 86 78 78 82 78 78 62 65 77 TOTAL 964 417 437 535 448 554 567 536 633 Table 4.3: Pre-OCR performance evaluation – Part3. First column is the number of frame, second column is the number of characters in each frame and the remaining columns are number of characters recognized.
64
Another remark for the experiments is about the utilization of
thresholding techniques. The best results are obtained when the candidate
regions are first thresholded with iterative thresholding, and then the
resulting intensity regions are thresholded again with boundary
enhancement technique. This is always the case, independent of the pre-
thresholding part. The reason for such a behavior is due to the requirement
for determining the boundaries of candidate regions before applying
boundary enhancement. On the other hand, the heuristics increase the
average recognition rate, however their effect are not that much.
In the last experiment, experiment 16, the text boxes are placed
manually instead of automated texture analysis and an upper bound for the
performance is determined. For this experiment, heuristics improvement is
not applied because non-character regions are almost completely eliminated
by manually placing the text boxes.
In Figure 4.2 and 4.3, sample images from the experiment set and
corresponding recognition results are shown.
65
THE qUEEN MOTHER HAS DIED
Nliiir bit iild ihc titn) 1 In hvr limp
Me olmu^ Anderson Pierson
Kibirli hukuk prgfesoru
Figure 4.2: Various recognition results.
66
MQNTQYA
56 007 56442
CANU
MAKDULEVILDIZ/ fST
Figure 4.3: Various recognition results, continued.
67
CHAPTER 5
CONCLUSIONS
The problem of detection and extraction of textual information from
video frames is a relatively new subject. Throughout the review of the
related work, it is observed that none of the proposed methods are mature
enough to be accepted as a standard framework. However, as the literature
suggests, the use of distinctive properties of videotext leads to a relatively
high accuracy in detection of videotext character regions. This work is also
concentrated on the use of these distinctive properties.
As we have given the details in the preceding chapters, the
distinctive properties of videotext are texture, contrast and color uniformity.
In order to binarize the image for recognition, one should first localize the
candidate character regions and threshold these regions individually.
By employing contrast analysis and/or texture analysis, one can
narrow the regions of interest, i.e. candidate character regions to textbox
68
level. Additional analysis of homogeneous regions, further resolves the
regions even to character level, since large homogeneous regions outside the
textbox are usually connected to the regions between characters of
videotext. The use of region analysis alone, however, does not give
satisfactory results, since there usually exists a lot of non-character
homogeneous regions, most of which cannot be eliminated by shape
analysis or any heuristics. Therefore, region analysis should be supported
with texture and/or contrast analysis in order to reduce the candidate regions
to characters of videotext.
As can be seen from the experimental results, the use of contrast
analysis with texture analysis together, decreases the recognition rate with
respect to texture or contrast analysis alone. This is due to the fact that both
methods carry similar information and have drawbacks that limit the success
of recognition. It is possible to conclude that contrast analysis and texture
analysis techniques should be considered as alternatives to each other. The
drawbacks of texture analysis are the limited block size and need for
supervised training, whereas the main drawback of contrast analysis is the
“over the average contrast videotext region” assumption. On the other hand,
texture analysis is much more accurate than contrast analysis and contrast
analysis is computationally much simpler than texture analysis.
69
The proposed system is very successful for images where a uniform
colored frame surrounds the videotext and/or the characters of videotext are
large and have high contrast with the surrounding background. As the
character size reduces and the background complexity increases the
recognition rate decreases. The overall performance of the system can be
highly dependent on the low resolution of images, since the OCR utilized
during simulations (as well as many other OCRs), requires high resolution
input for accurate detection.
As another conclusion for texture analysis is, Gabor filtering based
features represent videotext regions much better than wavelet based
features. Since the Gabor filters have higher directional resolution, they are
much more appropriate for representing videotext. This comparison is
accomplished by observing the training performances of both feature sets,
i.e. by observing the error at the output of the artificial neural network
during training. To our knowledge, this kind of feature saliency test has not
been performed in the literature.
Further improvements on the system can be achieved by integrating
information on multiple video frames. One can assume the videotext
stationary and using a simple low-pass filter can get rid of the moving parts
of the video. One way of doing this may be averaging the consecutive
frames and then subtracting the result from the middle frame. The difficulty
70
of this method arises in the shot boundaries, where the continuity is
completely lost. A sliding window for filtering would be an appropriate
solution. Obviously this assumption is not always valid, because there are
cases when videotext moves in a linear fashion. For this kind of situations
the system can be supported with a suitable tracking algorithm. By multiple
frame integration it is also possible to enhance the resolution of the frames.
Moreover, another improvement would be to make texture analysis
unsupervised to increase robustness.
71
REFERENCES
[1] S. Mori, C.Y. Suen, and K: Yamamoto, “Historical Review of
OCR Research and Development,” Proc. IEEE, vol. 80, no. 7, pp. 1029-
1058, July 1992.
[2] Takatoo M, et al., “Gray-Scale Image-Processing Technolgy
Applied to a Vehicle License Number Recognition System,” Proc. Int.
Workshop Industrial Applications of Machine Vision and Machine
Intelligence, Los Amitos, CA, IEEE Computer Society, pp 76-79, 1987.
[3] Zhong Y., Karu K., Jain A.K., “Locating Text in Complex Color
Images,” Pattern Recognition, 28(10), pp. 1523-1535, 1995.
[4] Wu V., Manmatha R., Riseman E.M., “TextFinder: An
Automatic System To Detect and Recognize Text in Images,” IEEE Trans.
on Pattern Analysis and Machine Intelligence, Vol. 21, No. 11, November
1999.
[5] Sato T., Kanade T., Hughes E.K., Smith M.A., Satoh S., “Video
OCR: Indexing Digital News Libraries by Recognition of Superimposed
Captions,” Multimedia Systems, 7, pp. 385-395, 1999.
72
[6] Ohya J., Shio A., Akamatsu S., “Recognizing Characters in
Scene Images,” IEEE Trans. on Pattern Analysis and Machine Intelligence,
Vol. 16, No. 2, February 1994.
[7] Lienhart R., Effelsberg W., “Automatic Text Segmentation and
Text Recognition for Video Indexing,” Multimedia Systems, 8, pp. 69-81,
2000.
[8] Jain A.K., Yu B., “Automatic Text Location in Images and
Video Frames,” Proc. IEEE Pattern Recognition, Vol. 31, No. 12, pp. 2055-
2076, 1998.
[9] Jain A.K., Bhattacharjee S., “Text Segmentation using Gabor
Filters for Automatic Document Processing,” Machine Vision and
Applications, 5, pp. 169-184, 1992.
[10] Li H., Doermann D., Kia O., “Automatic Text Detection and
Tracking in Digital Video,” Computer Vision Lab. Technical Report,
University of Maryland, CAR-TR-900, December 1998.
[11] Daugman J.G., “Two Dimensional Spectral Analysis of Cortical
Receptive Field Profile,” Vision Research, vol. 20, pp. 847-856, 1980.
[12] Lee T.S., “Image Representation Using 2-D Gabor Wavelets,”
IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 18, no.10,
October 1996.
73
[13] Dimitrova N., Agnihotri L., Dorai C., Bolle R., “MPEG-7
Videotext description scheme for superimposed text in images and video,”
Signal Processing: Image Communication, 16, pp. 137-155, 2000.
[14] Riedmiller M., and Braun H., “A Direct Adaptive Method for
Faster Backpropagation Learning: The RPROP Algorithm,” Proc. of the
IEEE Int. Conf. on Neural Networks, pp. 586-591, 1993.
[15] Jain, R., Kasturi, R., Schunck, B.G., Machine Vision,
TA1634.J35, 1995.
[16] Chellappa, R., Manjunath, B.S., Simchony, T., "Texture
segmentation with neural networks," Neural Networks in Signal Processing,
pp 37 – 61, Prentice Hall, 1992
[17] Mallat S.G., “Multiresolution approximations and wavelet
orthonormal bases of l2(r),” Trans. Amer. Math. Soc., 315, pp. 69 - 87, 1989.
[18] Clark M., Bovild A.C., “Experiments in segmenting texture
patterns using localized spatial filters,” Pattern Recognition, 22(6), pp. 707 -
717, 1989.
[19] Jain A.K., Farrokhnia F., “Unsupervised texture segmentation
using Gabor filters,” Proc. IEEE Int. Conf. Sys. Man. Cybern., pp. 14 - 19,
Los Angeles, CA, November 1990.
[20] Gupta A., Jain R., “Visual Information Retrieval,”
Communications of the ACM, 40(5), pp. 70 - 79, May 1997.
74
[21] Antani S., Kasturi R., and Jain R., “Recognition Methods in
Image and Video Databases: Past, Present and Future,” Joint IAPR
International Workshops SSPR and SPR, 1451 in Lecture Notes in
Computer Science, pp. 31 - 38, 1998.
[22] Gargi U., Kasturi R., and Antani S., “Performance
characterization and comparison of video indexing algorithms,” Proc .IEEE
Conf. on Computer Vision and Pattern Recognition, pp. 559 - 565, 1998.
[23] Gargi U., Kasturi R., and Strayer S.H., “Performance
Characterization of Video-Shot-Change Detection Methods,” IEEE
Transactions on Circuits and Systems for Video Technology, 10(1), pp. 1 –
13, 2000.
[24] Fischer S., Lienhart E., and Effelsberg W.E., “Automatic
Recognition of Film Genres,” Proceedings of the ACM Multimedia
Conference, pp. 295 - 304, 1995.
[25] Chaddha N., Sharma R., Agrawal A., and Gupta A., “Text
Segmentation in Mixed–Mode Images,” In 28th Asilomar Conference on
Signals, Systems and Computers, pp. 1356-1361, October 1995.
[26] Hauptmann A. and Smith M., “Text, Speech, and Vision for
Video Segmentation: The Informedia Project,” In AAAI Fall 1995
Symposium on Computational Models for Integrating Language and Vision,
1995.
75
[27] Smith M.A. and Kanade T., “Video Skimming and
Characterization through the Combination of Image and Language
Understanding,” In International Workshop on Content-Based Access of
Image and Video Databases, pp. 61-70, 1998.
[28] H.K. Kim, “Efficient Automatic Text Location Method and
Content Based Indexing and Structuring of Video Database,” Journal of
Visual Communications and Image Representation, 7(4): 336-344,
December 1996.
[29] B.L. Yeo and B. Liu, “Visual Content Highlighting Via
Automatic Extraction of Embedded Captions on MPEG Compressed
Video,” In SPIE/IS&T Symposium on Electronic Imaging Science and
Technology: Digital Video Compression: Algorithms and Technologies, vol.
2668, pp 38-47, 1996.
[30] R. Lienhart and F. Stuber, “Automatic Text Recognition for
Video Indexing,” In Proceedings of the ACM International Multimedia
Conference & Exhibition, pp. 11 20, 1996.
[31] R. Lienhart and F. Stuber, “Automatic Text Recognition in
Digital Videos,” In Proceedings of SPIE, volume 2666, pages 180 188,
1996.
[32] R. Lienhart and F. Stuber, “Indexing and Retrieval of Digital
Video Sequences based on Automatic Text Recognition,” In Proceedings of
76
the ACM International Multimedia Conference & Exhibition, pp. 419-420,
1996.
[33] H.M. Suen and J.F. Wang, “Text String Extraction from Images
of Colour-Printed Documents,” IEE Proceedings: Vision, Image and Signal
Processing, 143(4), August 1996.
[34] H.M. Suen and J.F. Wang, “Segmentation of Uniform-Coloured
Text from Colour Graphics Background,” IEE Proceedings: Vision, Image
and Signal Processing, 144(6): 317-322, December 1997.
[35] H. Hase, T. Shinokawa, M. Yoneda, H. Sakai, and H.
Maruyama, “Character String Extraction by Multi-stage Relaxation,” In
International Conference on Document Analysis and Recognition, volume
1, pp. 298-302, 1997.
[36] H. Hase, T. Shinokawa, M. Yoneda, H. Sakai, and H.
Maruyama, “Character String Extraction from a Color Document,” In
International Conference on Document Analysis and Recognition, volume
1, pp 75-78, 1999.
[37] D. Lopresti and J. Zhou, “Document Analysis and the World
Wide Web,” In IAPR International Workshop on Document Analysis
Systems, pp. 651-659, 1996.
77
[38] J. Zhou, D. Lopresti, and Z. Lei, “OCR for World Wide Web
Images,” In Proceedings of SPIE Document Recognition IV, volume 3027,
pages 58 66, 1997.
[39] Y. Ariki and T. Teranishi, “Indexing and Classiffication of TV
News Articles based on Telop Recognition,” In International Conference on
Document Analysis and Recognition, volume 2, pages 422-427, 1997.
[40] S. Kurakake, H. Kuwano, and K. Odaka, “Recognition and
Visual Feature Matching of Text Region in Video for Conceptual Indexing,”
In Proceedings of IS&T/SPIE Conference on Storage and Retrieval for
Image and Video Databases I, Vol. SPIE 1908, pages 368-379, 1997.
[41] F. Le Bourgeois, “Robust Multifont OCR System from Gray
Level Images,” In International Conference on Document Analysis and
Recognition, volume 1, pages 1 5, 1997.
[42] J. Zhou, D. Lopresti, and T. Tasdizen, “Finding Text in Color
Images,” In Proceedings of SPIE Document Recognition V, volume 3305,
pages 130-140, 1998.
[43] M.V.D. Schaar-Mitrea and P.H.N. de With, “Compression of
Mixed Video and Graphics Images for TV Systems,” In SPIE Visual
Communications and Image Processing, pages 213-221, 1998.
78
[44] S. Messelodi and C.M. Modena, “Automatic Identification and
Skew Estimation of Text Lines in Real Scene Images,” Pattern Recognition,
32(5): 791-810, May 1999.
[45] H. Li and D. Doermann, “A Video Text Detection System
Based on Automated Training,” In Proc. International Conference on
Pattern Recognition, volume2, pages 223-226, 2000.
[46] H. Li, D. Doermann, and O. Kia, “Automatic Text Detection
and Tracking in Digital Video,” IEEE Transactions on Image Processing,
9(1): 147-156, 2000.
[47] Y. Zhong, H. Zhang, and A.K. Jain, “Automatic Caption
Localization in Compressed Video,” IEEE Transactions on Pattern Analysis
and Machine Intelligence, 22(4): 385–392, 2000.
[48] V.Y. Mariano and R. Kasturi, “Locating Uniform-Colored Text
in Video Frames,” In Proc. International Conference on Pattern
Recognition, volume 4, pages 539-542, 2000.
[49] B.T. Chun, Y. Bae, and T.Y. Kim, “Text Extraction in Videos
using Topographical Features of Characters,” In IEEE International Fuzzy
Systems Conference, volume2, pages 1126-1130, 1999.
[50] C. Garcia and X. Apostolidis, “Text detection and segmentation
in complex color images,” In Proc. IEEE International Conference on
Acoustics, Speech, and Signal Processing, pages 2326-2329, 2000.
79
[51] O. Hori, “A Video Text Extraction Method for Character
Recognition,” In International Conference on Document Analysis and
Recognition, pages 25-28, 1999.
[52] H. Kasuga, M. Okamoto, and H. Yamamoto, “Extraction of
Characters from Color Documents,” In Proceedings of IS&T/SPIE
Conference on Document Recognition and Retrieval VII, volume 3967,
pages 278-285, 2000.
[53] Y.K. Lim, S.H. Choi, and S.W. Lee, “Text Extraction in MPEG
Compressed Video for Content-based Indexing,” In Proc. International
Conference on Pattern Recognition, volume 4, pages 409-412, 2000.
[54] M. Sawaki, H. Hurase, and N. Hagita, “Automatic acquisition
of context-based images templates for degraded character recognition in
scene images,” In Proc. International Conference on Pattern Recognition,
volume 4, pages 15-18, 2000.
[55] K. Sobottka, H. Bunke, and H. Kronenberg, “Identifi.cation of
Text on Colored Book and Journal Covers,” In International Conference on
Document Analysis and Recognition, pages 57-62, 1999.
[56] J.B. Bosch and E.M. Ehlers, “Remote Sensing of Characters on
3D Objects,” Computers & Industrial Engineering, 33(1-2): 429-432, 1997.
80
[57] P. Comelli, P. Ferragina, M.N. Granieri, and F.Stabile, “Optical
Recognition of Motor Vehicle License Plates,” IEEE Transactions on
Vehicular Technology, 44(4): 790–799,November 1995.
[58] Y. Cui and Q. Huang, “Extracting Characters of License Plates
from Video Sequences,” Machine Vision and Applications, 10(5-6): 308-
320, April 1998.
[59] L.L. Winger, M.E. Jernigan, and J.A. Robinson, “Character
Segmentation and Thresholding in Low-Contrast Scene Images,” In
Proceedings of SPIE, volume 2660, pages 286-296, 1996.
[60] T. Gandhi, R. Kasturi, and S. Antani, “Application of Planar
Motion Segmentation for Scene Text Extraction,” In Proc. International
Conference on Pattern Recognition, volume 3, pages 445-449, 2000.
[61] MPEG Requirements Group, MPEG-7 Requirements
Document, Doc. ISO/MPEG N2460, MPEG Atlantic City Meeting, October
1998.
[62] MPEG Requirements Group, MPEG-7 Requirements
Document, Doc. ISO/MPEG N2461, MPEG Atlantic City Meeting, October
1998.
[63] MPEG-7 Description Schemes, ISO/ IEC/ JTC1/ SC29/ WG11/
N2844, July 1999.
81
[64] MPEG-7 Description Schemes (V0.6), ISO/ IEC/ JTC1/ SC29/
WG11/ M5040, Version 0.6-a, September 1999.
[65] D. Gabor, “Theory of communication,” J.IEE, vol.93, pp. 429 –
459, 1946.
[66] http://www.abbyy.com
82