57
CHARACTER Segmentation and Groundtruth preparation for handwritten Bangla word images Submitted by SANCHITA MAITY Exam. Roll No. : MCA-3212027 of 2011-12 University Regn. No. : 108560 of 2009-10 Under the guidance of Mr. Ram Sarkar Department of Computer Science and Engineering, Jadavpur University. A dissertation submitted in partial fulfillment of the requirements for the award of Master of Computer Application (MCA) Department of Computer Science and Engineering Faculty of Engineering and Technology Jadavpur University Kolkata - 700 0032 20011 -2012

CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

  • Upload
    others

  • View
    25

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

CHARACTER Segmentation and Ground–truth preparation for handwritten

Bangla word images

Submitted by

SANCHITA MAITY

Exam. Roll No. : MCA-3212027 of 2011-12

University Regn. No. : 108560 of 2009-10

Under the guidance of

Mr. Ram Sarkar

Department of Computer Science and Engineering,

Jadavpur University.

A dissertation submitted in partial fulfillment of the requirements for the award of Master

of Computer Application (MCA)

Department of Computer Science and Engineering

Faculty of Engineering and Technology

Jadavpur University

Kolkata - 700 0032

20011 -2012

Page 2: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted
Page 3: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted
Page 4: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted
Page 5: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

CONTENTS

Page no.

Chapter 1: Introduction 1

1.1 An overview on Optical Character

Recognition(OCR)

1

1.1.1 Description 1

1.1.2 History of OCR 2

1.1.3 Problem of OCR 5

1.1.4 Recent Trends in OCR research 6

1.2 Characteristic of Bangla script 6

1.3 Character Segmentation and Ground-truthing 9

1.3.1 What is character segmentation? 9

1.3.2 What is ground-truthing? 11

1.3.3 Importance of handwritten Bangla Word 12

Chapter 2: Review of existing work 13

2.1 Problems of Character Segmentation from

handwritten Bangla word images

13

2.2 Some recent character segmentation and ground-

truthing methodologies

14

2.2.1 A fuzzy technique for character segmentation

2.2.2 A two stage approach for segmentation

14

14

Page 6: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

2.2.3 A database for unconstrained handwritten

Bangla word images

15

2.2.4 A complete handwritten numeral database 16

2.3 Motivation 16

Chapter 3: Present Work 18

3.1 Data collection methodologies 20

3.2 Segmentation 20

3.2.1 Selection of SF and DNS Components 24

3.2.1.1 Initial Selection of Obvious SF and DNS

Class Components

25

3.2.1.2 Classification of SF/DNS Components

using MLP

26

3.2.2 Determination of Matra Pixels using a Fuzzy

Membership Function and Horizontalness Feature for

SF components

30

3.2.3 Determination of Potential Segmentation

Points using Two Fuzzy Membership Functions for SF

components

33

3.2.4 Identification of Actual Segmentation Points in

the SF Components

34

3.3 Preparation Ground-truthed images 36

Chapter 4: Conclusion 49

References 50

Page 7: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

1

Chapter 1

Introduction

1.1 An Overview on Optical Character Recognition

(OCR)

1.1.1 Description

Optical character recognition usually abbreviated to OCR, is the mathematical or

electronic translation of images of handwritten, typewritten or printed text (usually

captured by a scanner) into machine – editable text.

Broadly speaking, OCR system eases the barrier of the keyboard interface

between man and machine to a great extent, and help in advancement of office

automation. By doing so, OCR system facilate large scale document transcription with

huge saving of time and human effort. The systems has potential application in reading

amount from bank checks, extracting data from field-in forms and interpreting

handwritten address from mail pieces for automatic routine, and so on.

OCR is a field of research pattern recognition, artificial intelligence and machine

vision. Though academic research in a field continues the focus of OCR has shifted to

implementation of proven techniques. Optical character recognition (using optical

techniques such as mirrors and lenses and) and digital character recognition (using

scanners and computer algorithms) were originally considered separate fields. Because

Page 8: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

2

vary few application survive that use true optical techniques, the OCR has now been

broaden to include digital image processing as well.

Early system required training (the provision of known samples of each

character) to read a specific font. ―Intelligent‖ systems with a high degree of recognition

accuracy for most fonts are now common. Some systems are even capable of

reproducing output that closely approximates the original scanned page including

images, column and other non textual components.

1.1.2 History of OCR

In 1929 Gustav Tauschek obtained a patent on OCR in Germany, followed by

handle who obtained a US pattern on OCR in USA in 1933 (U.S. Patent 1,915,993). In

1935 Tauschek was also granted a US patent on his method (U.S. Patent 2,026,329).

Tauschek‘s machine was a mechanical device that used templates. A photo

detector was placed so that when the template and the character to be recognized were

line up for an exact match and a light was directed towards them, no light would reach

the photo detector.

In 1950, David H. Shepard, a cryptanalyst at the Armed Forces Security Agency

in the United State, was asked by frank Rowlett who had broken the Japanese PURPLE

diplomatic code, to work with Dr. Louis Tordella to recommend data automation

procedures for the Agency. This included the problem of converting printed messages

into machine language for computer processing. Shepard decided it must be possible to

build a machine to do this, and, with the help of Harvey Cook, a friend, built ―gismo‖ in

his attic during evenings and weekends. This was reported in the Washington Daily

News on 27 April 1951 and in the New York Times on 26 December 1953 after his U.S.

Patent 2,663,758 was issued. Shepard then founded Intelligent Machines Research

Corporation (IMR), which went on to deliver the world‘s first several OCR systems used

image analysis, as opposed to character matching, and could accept some font variation,

Gismo was limited to reasonably close vertical registration, whereas the following

Page 9: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

3

commercial IMR scanners analyzed characters anywhere in the scanned field, a practical

necessity on real world documents.

The first commercial system was installed at the Readers Digest in 1955, which,

many years later, was donated by Readers Digest to the Smithsonian, where it was put

on display. The second system was sold to the Standard Oil Company of California for

reading credit card imprints for billing purposes, with many more systems sold to other

oil companies. Other system sold by IMR during the late 1950s included a bill stub

reader to the Ohio Bell Telephone Company and a page scanner to the United States Air

Force for reading and transmitting by teletype typewritten messages. IBM and others

were later licensed on Sheppard‘s OCR patents.

In about 1965 Readers Digest and RCA collaborated to build an OCR Document

reader designed to digest the serial numbers on Reader Digest coupons returned from

advertisements. The fonts used on the documents were printed by an RCA Drum printer

using the OCR-A font. The reader was connected directly to an RCA 301 computer (one

f the first solid state computers). The reader was followed by a specialized document

reader installed at TWA where the reader processed Airline Ticket stock(a task made

more difficult by the carbonized backing on the ticket stock). The readers processed

document at a rate of 1500 documents per minute and checked each document rejecting

those it was not able to process correctly. The product became part of the RCA product

line as a reader designed to process ―Turn around Documents‖ such as those Utility and

insurance bills returned with payments.

The United States Postal Service has been using OCR machines to sort mail

since 1965 based on technology devised primarily by the prolific inventor Jacob

Rabinow. The first use of OCR in Europe was by the British General Post office or

GPO. In 1965 it began planning an entire banking system, the national Gyro, using OCR

technology, a process that revolutionized bill payment systems in the UK. Canada Post

has been using OCR systems since 1971. OCR systems read the name and address of the

addressee at the first mechanized sorting center, and print a routing bar code on the

envelope based on the postal code. After that the letters need only be sorted at later

centers by less expensive sorters which need only read the bar code. To avoid

Page 10: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

4

interference with the human-readable address field which can be located anywhere on

the letter, special ink used that is clearly visible under ultra violate light. This ink looks

orange in normal lighting conditions. Envelopes marked with the machine readable bar

code may then be processed.

In 1974, Ray Kurzweil started the company Kurzweil Computer Products, Inc.

and led development of the first Omni-font optical character recognition system—a

computer program capable of recognizing text printed in any normal font. He decided

that the best application of the technology would be to create a reading machine for the

blind, which would allow blind people to understand written text by having a computer

read it to them out loud. However, this device required the invention to two enabling

technologies—the CCD flatbed scanner and the text-to- speech synthesizer. On January

13, 1976, the finished product was unveiled during a widely reported news conference

headed by Kurzweil and the leaders of the National Federation of the blind. Called the

Kurzweil Reading Machine, the device covered an entire tabletop, but functioned

exactly as intended. On the day of the machine‘s unveiling, Walter Cronkite used the

machine to give his signature sound off, ―And that‘s the way it was, January 13, 1976.‖

While listening to The Today Show, musician Stevie Wonder heard a demonstration of

the device and personally purchased the first production version of the Kurzweil

Reading Machine.

In 1978 Kurzweil Computer Products began selling a commercial version of the

optical character recognition computer program. LexisNexis was one of the first

customers, and bought the program to upload paper legal and news documents onto its

nascent online databases. Two years later, Kurzweil sold his company to Xerox, which

had an interest in further commercializing paper-to-computer text conversion. Kurzweil

Computer Products thus became a subsidiary of Xerox known as Scan soft (now

Nuance).

Page 11: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

5

1.1.3 Problem of OCR

OCR of textual documents in general involves the following problems.

i) Image acquisition

ii) Text line extraction from document images

iii) Word segmentation and character segmentation

iv) Character recognition and word recognition

Optical scanners attached with PCs are mostly used for capturing digital images

document images. Extraction of text lines from document images is a trivial problem

provided that document image remains unskewed. Text line from such document images

can be easily extracted by identified valleys of horizontal pixel density histograms of

these images. But for all practical situations, document images are skewed at least to

some extent and the said technique fails to work for these images. Many text lines may

touch each other. Skewness is inherent in handwritten text. So, special techniques are

required for character segmentation of handwritten Bangla word images.

Segmentation of isolated word images, extracted from optically scanned

document images of handwritten text, is one of the major problems of optical character

recognition (OCR). If we can find a better method for segmenting the handwritten words

into characters then we can increase our recognition of characters too. So segmentation

of words into characters makes a large contribution towards the overall performance of

OCR system character recognition also towards the overall performance of an OCR

system too.

Characters segmented from document image are to be recognized for coding

them in ASCII or some other standard character code. For any of the widely used non-

holistic optical character recognition (OCR) approaches, success of a specific technique

depends on how best a word can be segmented into pieces, which are to be considered

subsequently as candidates for its constituent characters. The better is the segmentation,

the lesser is the ambiguity encountered in recognition of candidate characters or word

pieces. To recognize a candidate character, its context also requires due consideration.

Page 12: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

6

Because of variation of shapes and sizes, character segmentation of handwritten Bangla

word images requires more sophisticated technique than that of printed characters.

1.1.4 Recent Trends in OCR Research

Research on OCR has been mostly found to concentrate on text of European

languages based on Roman alphabet. Possibly the probability of European languages in

the industrialized West has interested both the researchers and entrepreneurs in OCR of

text of European languages including English text. Scripts relating to Asian languages

like Chinese, Korean, Japanese and Arabic have also received considerable attention

from the researchers working in the field of OCR. Other that these, a number of Indian

scripts, viz, Devnagri, Oriya and Bangla, have started to receive attention for OCR

related research in the recent years. Out of these, Bangla is the second most popular

script and language in the Indian subcontinent. As a script, it is used for Bangla,

Assamees and Manipuri languages. Bangla, which is also the national language of

Bangladesh, is the fifth most popular language in the world. So is the importance of

Bangla both as a script and as a language. But evidences of research on OCR of

handwritten Bangla characters, as observed in the literature, are a few in numbers.

1.2 Characteristics of Bangla script

Characters of Bangla script can be grouped into five categories of characters,

viz., vowel, consonant, modified shape, compound character, and punctuation symbol.

Out of these characters, vowels and consonants, which constitute Bangla alphabet, are

called basic characters. There are 11 vowels and 39 consonants in Bangla alphabet.

There is no concept of upper and lower case characters in Bangla script. Characters in

Bangla script are written from left to right. A vowel following a consonant in a word

takes a modified shape in Bangla script. Such shapes of all vowels are termed as

modified shapes. It is noteworthy that some modified shapes attached with a consonant

Page 13: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

7

have two isolated parts appearing in two opposite sides of the consonant. Some modified

shapes may appear just below the consonant, and some may reach its top from one of its

sides with a curved or partly curved segment. So, characters in Bangla script may not

always appear in non-overlapping consecutive positions. Depending on the mode of

pronunciation, a Bangla consonant followed by one or two consonants takes a complex

shape, which is called a compound character. There are in all 280 compound characters

in Bangla script. Apart from the basic characters, the modified shapes, and the

compound characters, Bangla script also constitutes 10 digit patterns. An important

feature of Bangla characters is Matra or head line. Excluding a few, all basic and

compound characters of Bangla script has this feature. The width of a Matra is nearly

same as the width of the character it touches. All the Matras of consecutive characters

appearing in a Bangla word are joined to form a common Matra of the characters

appearing in the word.

Fig.1 Bangla alphabet basic shape (The first 11 characters are vowels while others are

consonants.)

Page 14: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

8

(a)

(b)

(c)

Fig. 2 Examples of vowel and consonant modifiers: (a) vowel modifiers, (b) exceptional

cases of vowel modifiers and (c) consonant modifiers

Page 15: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

9

1.3 Character segmentation and ground truthing

1.3.1 What is character segmentation?

Character segmentation is a necessary preprocessing step for character

recognition in many handwritten word recognition systems. The most difficult case in

character segmentation is the cursive script. Fully cursive nature of Bangla handwriting

poses some high challenges for automatic character segmentation. Character

segmentation techniques are mostly script dependent. It is not only because of variations

of character shapes from one alphabet to other but also for certain script specific features

of text document. Segmentation of isolated words into constituent characters is a

challenging problem for Bangla scripts. Appearance of consecutive characters in

overlapping column positions over a text line makes the problem of Bangla word

segmentation more complex compared to segmentation of English words. The problem

becomes compounded with handwritten Bangla words because of variation in sizes and

shapes of handwritten characters. Considering all this, a novel technique for segmenting

images of handwritten Bangla words is presented in this paper.

Before segmenting individual characters of each Bangla word in the text image,

the word is horizontally partitioned into three adjacent zones as shown in Fig. 3. The

portion of each word on and above the Matra constitutes ‗upper zone‘, the main body of

the characters in a word lies within the ‗middle zone‘, and the portion of the word,

containing especially modified shapes and period like isolated character components,

below the main body form the ‗lower zone‘. The technique of word segmentation is

based on detection of the Matra.

Page 16: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

10

Fig. 3 Illustration of three zones and region boundaries of a Bangla word

A Matra is a horizontal line, which passes touching the upper part of many

characters of Bangla script as shown in Fig. 4(a). Depending on the characters, it covers

at most the entire character width. The consecutive characters, in a Bangla word, which

have Matras, are joined through a common Matra formed by joining the Matras of

individual characters as shown in Fig. 4(b). This line may have some discontinuity over

the positions where the characters in the word appear without Matras

In handwritten Bangla words, the Matras are not horizontal as strictly as these

are in printed words. So the technique of removing the Matra of a word for segmenting

its constituent characters may leave many characters joined with each other. Such under

segmentation may complicate classification decisions in the subsequent stage.

How to segment handwritten Bangla words into constituent characters efficiently

is still a challenging problem of OCR related research. This is a major point of

motivation behind the present work that deals with the problem of segmenting hand

written Bangla words into constituent characters.

Page 17: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

11

Fig. 4(a) An illustration of the common Matra of a word

Fig. 4(b) An illustration of Matra of individual characters in a word

In image analysis testing of any algorithm is time and man-power consuming in

manual way, which is now a days are widely used in different corner of world. Even the

testing schema is not standard. Different organization uses different testing schema. So

the success rate varies. Even standard database availability is too poor. So the result

generation to a particular technique becomes hectic for researcher as they need to collect

or prepare database for their won job.

1.3.2 What is ground truthing?

Generations of appropriate ground truth data has always been a challenging and

time some task for the kind of problem under consideration. Availability of ground truth

Page 18: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

12

information, however, makes any database more useful, enabling proper evaluation of

one‘s technique by comparing their output with the ground truth of the same. In the

present work, we have prepared ground truth images for a subset of our database ,viz.,

CMATERg1.1.1 and CMATERg1.2.1 respectively. We have prepared these ground

truth images of the databases in a semi automatic way. More specifically, we have

employed our previously developed technique [9] to identify individual character

segments from any document image. The possible error that might have been generated

in the automated character segmentation is corrected using a software tool called GT

Gen version 1.1, which we have developed for this project. Basically, we have used GT

Gen to recolor the characters, which were erroneously labeled by our previously

developed technique [9]. It may be noted that all the ground truth images are stored in

bitmap (bmp) file format, where the background is labeled in white and individual

characters are marked in different colors.

1.3.3 Importance of handwritten Bangla word

Bangla is an important East Asian script widely used in India and Bangladesh.

Popularity wise, Bangla ranks 5th in the world and 2nd ranked in India as a script and a

language both. It is also the national language of Bangladesh.

Handwritten Bangla word is cursive in nature in most of the cases. So

identification of each character is difficult to any segmentation algorithm. In handwritten

Bangla words, the Matras are not horizontal as strictly as these are in printed words. So

the technique of removing the Matra of a word for segmenting its constituent characters

may leave many characters joined with each other. Such under segmentation may

complicate classification decisions in the subsequent stage.

Page 19: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

13

Chapter 2

Review of Existing Work

In this chapter about the previous work and their drawbacks on character

segmentation from handwritten Bangla word images.

2.1 Problems of Character Segmentation from

handwritten Bangla word images

Character identification is the first and most important step in the process of

OCR of document images. If the characters are not identified accurately and for example

two or more characters are connected with common Matra line then none of the

characters of the word can be recognized correctly. The same problem occurs if a

character is accidentally split into two or more parts. The characters might be written so

close to one another those accents and similar features may become difficult to classify

into the correct character. Adjacent characters might even touch one another at some

points and in those cases it becomes very difficult to identify the constituent characters,

which have joined to form a single component. Character segmentation of handwritten

Bangla word images is faced by many challenges depends on the style of writing of an

individual. In image analysis testing of any algorithm is time and man-power consuming

in manual way, which is now a days are widely used in different corner of world. Even

the testing schema is not standard. Different organization uses different testing schema.

So the success rate varies. The present work suggests a method based on comparison of

neighborhood connected or disconnected components to determine whether they belong

to the same character.

Page 20: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

14

2.2 Some recent character segmentation and ground-

truthing methodologies

A wide variety of text line detection methods for handwritten Bangla word

images have been reported in the literature. These methods may be categorized into four

types, namely (i) a fuzzy technique for segmentation ; (ii) a two stage approach for

segmentation; (iii) a database for unconstrained handwritten Bangla word images; (iv) a

complete handwritten numeral database , which cannot be grouped in a unique category

since they do not share a common guideline.

2.2.1 A fuzzy technique for character segmentation

A fuzzy technique for segmentation of handwritten Bangla word images have

been presented in work [1]. It works in two steps. In first step, the black pixels

constituting the Matra (i.e., the longest horizontal line joining the tops of individual

characters of a Bangla word) in the target word image is identified by using a fuzzy

feature. In second step, some of the black pixels on the Matra are identified as segment

points (i.e., the points through which the word is to be segmented) by using three fuzzy

features. On experimentation with a set of 210 samples of handwritten Bangla words,

collected from different sources, the average success rate of the technique is shown to be

95.32%. Apart from certain limitations, the technique can be considered as a significant

step towards the development of a full-fledged Bangla OCR system, especially for

handwritten documents.

2.2.2 A two stage approach for segmentation

Segmentation of handwritten Bangla word images is a challenging problem for

the researchers. Discontinuity or absence of Matra, an important feature of Bangla

Page 21: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

15

script, may lead to inherent segmentation within the word images. Around 55% of these

inherently segmented connected sub-images do not require further segmentation. They

have designed a novel two-stage approach for segmentation of isolated Bangla word

images. In the first stage, a feature based approach is design to classify the connected

word segments into either of the two classes ,‘Segment further‘ and ‗Do not Segment‘

using a multi-layer perception (MLP) based classifier. In the second stage, fuzzy

segmentation feature are design to identify the Matra region and the potential

segmentation point on the Matra of the connected word segments that belong to

‗Segment further‘ class. Using this technique, the overall successful segmentation

accuracy achieved after two stages is 95.87% in the work [2].

2.2.3 A database for unconstrained handwritten Bangla word

images

In the work [7], the preparation of a benchmark database for research on off-line

Optical Character Recognition (OCR) of document images of handwritten Bangla text

and Bangla text mixed with English words have been described. This is the first

handwritten database in this area, as mentioned above, available as an open source

document. As India is a multi-lingual country and has a colonial past, so multi-script

document pages are very much common. The database contains 150 handwritten

document pages, among which 100 pages are written purely in Bangla script and rests of

the 50 pages are written in Bangla text mixed with English words. This database for off-

line-handwritten scripts is collected from different data sources. After collecting the

document pages, all the documents have been preprocessed and distributed into two

groups, i.e., CMATERdb1.1.1, containing document pages written in Bangla script only,

and CMATERdb1.2.1, containing document pages written in Bangla text mixed with

English words. Finally, we have also provided the useful ground truth images for the

line segmentation purpose. To generate the ground truth images, we have first labeled

each line in a document page automatically by applying one of our previously developed

line extraction techniques and then corrected any possible error by using our developed

Page 22: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

16

tool GT Gen 1.1. Line extraction accuracies of 90.6 and 92.38% are achieved on the two

databases, respectively, using our algorithm. Both the databases along with the ground

truth annotations and the ground truth generating tool are available freely at

http://code.google.com/p/cmaterdb.

2.2.4 A complete handwritten numeral database

The paper [16] describes the ISI database of handwritten Bangla numerals.

Bangla is the second most popular language and script of the Indian subcontinent and it

is used by more than 200 million people all over the globe. The present database has

several components which include both on-line and off-line handwritten numerals.

Samples of numeral strings and isolated numerals have been collected under both modes

of writing. This database has been developed at the Computer Vision and Pattern

Recognition Unit laboratory of Indian Statistical Institute, Kolkata. Samples of the

present database are properly ground thruthed and subdivided into respective training

and test sets. The off-line sample images are stored in TIFF image format and the on-

line samples are stored along with various information as header in ASCII file format.

This database will facilitate fruitful research on handwriting recognition of Bangla

through free access to the researchers.

Other methodologies are include in the works described in [12-14]. In [5], the

character segmentation problem is seen from an artificial intelligence perspective.

2.3 Motivation

Considering the all kind of problems as discussed above, we actually need an

automated evaluation tool for OCR systems, which is comparing the segmented results

of a technique/algorithm with ground thruthed images. The evaluation technique is

constructed of the following steps. First, database of word images is prepared. Then

Page 23: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

17

apply segmentation technique on that database. The characters of the word images are

not segmented properly. So we need detection and correction of the errors manually.

After correction manually we store the word images in the database as ground truthed

images. Our aim is to compare the segmented word images with the corresponding

ground-truthed images automatically and also will give the success rate. For this purpose

we want to create a tool in future. It will save time and man-power and will minimize

analytical errors. Ground-truth preparation plays an important role in image analysis as

mentioned above. It is also found that there is no such standard database or automatic

evaluating tool for handwritten Bangla word images for handwritten Bangla OCR

system. So, in present work, ground-truthing for handwritten Bangla word-images in

two levels is introduced.

Ground-truthed images are generated for the said database for evaluation of

character segmentation algorithm. Character segmentation accuracy on these

handwritten word images is also reported in the current work. Ground-truthed images

are prepared for component level and character level so that the database would also be

very useful for the performance evaluation of character recognition system.

Page 24: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

18

Chapter 3

Present Work

The present work on Character segmentation and ground-truthed

preparation for handwritten Bangla word images is described below. A typical OCR

system consists of scanning, preprocessing, word and character separation, recognition

and post processing stages. Each stage has an impact on overall performance. As India is

a multi-lingual country and has a colonial past, so multi- scripted document pages are

very much common. The database contains 5000 handwritten word images written

purely in Bangla script.

The document of offline-handwritten scripts is collected from different data

sources. After collecting the document pages, the entire document has been pre-

processed. Each document page contains 180-200 words on an average respectively.

Finally we have also provided the useful ground-truthed images for the character

segmentation purpose. To generate the ground-truthed images, at first we have labeled

each component in a word images with unique color applying our previously developed

technique [ ] and then corrected any possible error manually, developed for this project.

The database would be very useful for handwritten OCR research in the area of Bangla

especially for the performance evaluation of character segmentation methodologies as

there is hardly any standard database found for the handwritten Bangla word images.

Currently database is available on www. Cmaterju.org. Our aim is to provide the

ground-truthed images for component level and character level segmentation of Bangla

handwritten character recognition system.

The name of the prepared database is CMATERdb1.1.1, where CMATER stands

for Center of Microprocessor Application for Training Education and Research, a

research laboratory at Computer Science and Engineering department of Jadavpur

Page 25: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

19

University, India, where the current research activity took place. Db stands for database

and the numeric value.

Here the overall work flow is shown in fig. 5. The implementation details of the

fig. 5 are discussed in the following sections. My present work is highlighted on this

flow diagram.

Fig.5 The basic flow diagram of the overall project

Page 26: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

20

3.1 Data Collection Methodologies

The materials of the handwritten document pages for the proposed database have

been collected from different types of sources, viz., class-notes of students of different

age group, handwritten, handwritten manuscript of a popular Bangla monthly magazine

―Computer Jagat‖ and from document pages written by different person , on request,

under our supervision.

The document pages written under our supervision were collected from various

persons with subject varying from news paper articles and Bangla text books containing

both Bangla and English vocabulary. The writers were asked to use a black or blue ink

pen and write inside the A-4 size pages. They imposed no other restrictions regarding

the kind of pen they used or the style of writing chosen. Special attention was paid to

ensure data collection from writers of different age groups and educational levels.

Moreover, the pages were collected from different places (home, office, school etc.) in

order to include different style of writing. In total 25 men and 15 women were

participated in the data collection drive. The main characteristics of our database are as

follows.

95% of the writers were native Bengali.

Places of data collection: in school/colleges,40% in writers‘ homes, and

20% in public places.

Educational level of the writers: 20% 10th

standard school, 40% general

high school and 40% college and university.

Writers‘ age: 40% between 15-25 years, 30% between 25-35, and 30%

between 35-55 years.

3.2 Segmentation

The current work designs a novel technique for identification of potential

segmentation points on the Matra to isolate constituent characters from the word image

Page 27: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

21

of Bangla script. In the first stage, component-labeling algorithm of is applied to identify

connected sub-parts of the word images. The second stage involves an approach for

classification of the connected sub-parts into either of the SF or DNS classes using a rule

based prior selection and well-known MLP based classifier with a set of features

extracted from these components. Finally, fuzzy segmentation features are used to

identify potential segmentation points in an effective way on the detected fuzzy Matra

region for subsequent extraction of isolated characters or character sub-parts from the

overall word image. The basic steps of operations involved in this work for

segmentation of handwritten word images of Bangla script are depicted in Fig. 6.

Constituent characters or their sub-parts of words often extend above the

common Matra or appear below the main character body. In the current work, we have

identified three adjacent horizontally partitioned zones (viz., upper, middle and lower)

from each word image as shown in Fig. 1. More specifically, the top row of the upper

zone (R1), the top row of the middle zone (R2), the middle row of the middle zone (R3),

the bottom row of the middle zone (R4) and the bottom row of the lower zone (R5) are

identified from the word image. A horizontal pixel scan of the word image from top

towards bottom identifies the first row with any black pixel as the top row of the upper

zone i.e., R1. Similarly, a horizontal scan from bottom towards top identifies the first

row with any black pixel as the bottom row of the lower zone i.e., R5. Identification of

the top and bottom row boundaries of the middle zone (a key decision for subsequent

features extraction) is a challenging task in handwritten word segmentation.

In this work, authors have scanned the whole image to calculate sum of all

maximum horizontal length for each row and then estimated R2 using those values. But

sometimes this may give us misleading information. It may so happen because there are

cases related to handwriting style of individual where the sum of maximum longest run

length may appear anywhere in the word image and due to which R2 is not estimated

correctly as shown in Fig. 7. Therefore we have modified the technique for

determination of R2.

Page 28: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

22

Fig.6 Block diagram of basic steps of present work.

We know that generally Matra of handwritten word images do not appear in the

lower half of the image. So in the present work, to identify the common headline of the

word, horizontalness of each row is computed from the top to half of the word images

i.e. from R1 to R1+(R5- R1)/2. Each black pixel of the word image in the said region is

replaced by the length of the longest run of black pixels in horizontal direction by itself.

Sum of the horizontal longest run values of all the pixels in a row is computed for each

row of the word image. The row with the highest sum represents the row with maximum

horizontalness. This row signifies the possible upper boundary of the middle zone and

we have called it as 1st approximation of the upper boundary of the middle zone (R21).

Then from the vertical feature we have estimated the 2nd approximation of R2 and

called it R22.

Fig.7 Wrong estimation of R2 using technique described in [2]

Page 29: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

23

But even after estimating R21 and R22 we have observed that in few cases, the

Matra regions are not estimated accurately (as shown in Fig. 6). To address this issue,

we have determined another R2 estimate as the row containing the longest single run of

black pixels and called it as R23. So finally we have taken the average of the three R2

approximations and called it as R2final, such that R2final = (R21+ R22 + R23)/3. This

new estimate of R2 (involving three approximations) is observed to be more accurate in

comparison to our prior works involving two such approximates. We have taken R2final

as our final upper boundary (R2) of middle zone for a handwritten word image.

Also we know that generally bottom row (i.e. R4) of handwritten word images do

not appear in the upper zone. So in our current work, to identify the R4 of the word

images, horizontal transition points between text and background pixels are computed

from the middle to bottom of the word images i.e. from (R1+(R5- R1)/2) to R5. In each

row, starting from the middle row to the bottom row of word image, the sum of

transition points between text pixel to background pixel and vice versa are computed.

The average number of transition points in the lower half of the image is computed as

eta (η). Now the 1st row with greater transition points than η from bottom row of lower

zone to half of the word image is identified as the bottom row of the middle zone (say,

R41). Again, as in case of R2, we have estimated R4 from the vertical feature and called

it R42. Then we have taken the average of R41 and R42 as the final R4 i.e. = R4final =

(R41+R42)/2. We have taken R4final as bottom row of lower zone i.e. R4. Finally, the

middle row of the middle zone is taken as R3 i.e. R3= (R2+R4)/2.

In the present work we have used a simple, yet popular, technique for identifying

the connected components within the word image. Identification of connected word

components requires identification of the connected pixels therein and marking them

with identical labels. For this the CCL algorithm [14] scans the word image pixel by

pixel from left to right and from top to bottom. During scanning, it considers all 8

neighbors of each pixel. For each of the connected components, all its member pixels

appearing in the sub-image are replaced by a single distinct symbol. This is done to

complete labeling of the connected pixels in the image and to generate uniquely coded

Page 30: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

24

connected components as described in Fig. 8. Each of such connected components is

subsequently extracted for analysis.

3.2.1 Selection of SF and DNS Components

Among all the digitized word sub-parts generated after connected CCL, a

decision is often required to identify only the components that need further segmentation

because of the presence of many inherently segmented characters or their subparts in

word images (as shown in Fig. 9). Thus, all word-components may not require further

segmentation at all. These components are often classified into SF and DNS classes as

shown in Fig. 7. Segmentation of DNS components is an overhead as it causes over

segmentation of word components. So, selection of SF and DNS components not only

minimize the character isolation overhead but also minimize the over segmentation

probability. For this, we have developed here a two stage selection for SF and DNS class

components. These stages are described in subsections 3.2.1.1 and 3.2.1.2.

Fig. 8 A sample word image and its three of connected components

Page 31: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

25

3.2.1.1 Initial Selection of Obvious SF and DNS Class Components

In work [3], MLP based schemes were used for such a classification problem.

However, consideration of all word sub-parts in the said classification algorithm not

only increases computational overhead, but also leads to ambiguities in the selection

leading to erroneous classification. To solve this problem, a pre-selection step is

developed in the present work that identifies obvious SF and DNS class components. In

the designed methodology two scale-invariant thresholds are used for this pre-selection

of obvious SF and DNS components prior to the MLP-based classification scheme.

In the current approach, all the word components are divided hypothetically into

pieces by using a separating line (horizontal) along the middle line of the region R2 to

R4. The row, along which this separating line is go through, is selected experimentally

form the sample word images of the database.

After this hypothetical separation, the number of connected sub-components or

pieces generated as a result of this division is counted. We have applied this number as

the decision maker i.e., based on the number of generated sub-components; the original

component is categorized into one of the two types of classes, viz.., DNS or SF. If the

number of sub-components in a component is less than a threshold value T1, then we

have considered the component as a member of DNS class. On the other hand, if the

same is greater than another threshold value T2 then the component is considered as

belonging to the SF class. Some of the sample components classified successfully using

this thresholding technique are shown in Fig. 9.

The components with number of sub-components (n) between T1 and T2, i.e.

T1<n<T2, are sent to a previously trained MLP classifier to accurately classify the

components. This is done so, as decision-making on these components is not possible by

using either T1 or T2. Experimentally, we have observed the values of T1 and T2 as 2

and 6 respectively.

Page 32: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

26

Fig. 9 Sample images of Bangla script which are pre-classified as Obvious DNS components,

pre-classified as Obvious SF components and sent for MLP based classification (for subsequent

SF/DNS class identification)

From the images of Fig. 9, it is evident that the choice of T1 is suitable for single

character components partitioned in to two pieces along the hypothetical separating line,

and subsequently classified as DNS segments. Also, the choice of T2 is done in such a

way that, multiple touching characters or their sub-parts generate more number of

components than T2 after being hypothetically partitioned along the separating line.

These components are classified as SF components. In all remaining cases, ambiguities

may exist and thus need sophisticated techniques such as MLP based classifiers and

associated feature vectors.

3.2.1.2 Classification of SF/DNS Components using MLP

In the present work, an MLP based classifier is used for classification of

connected word components, which are not classified in the pre-processing stage, into

either of the two classes to decide whether the given component needs to be further

segmented or not, using the feature set mentioned in Table 1. The MLP based classifier

designed for this work is trained with the Back Propagation (BP) algorithm. It minimizes

the sum of the squared errors for the training samples by conducting a gradient descent

search in the weight space. The number of neurons in a hidden layer in the same is also

Page 33: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

27

adjusted during its training. In the current methodology we have designed a new feature

set containing 11 statistical features, as described in Table 1. The following discussion

justifies the choices of respective feature descriptors.

The higher value of feature F1 signifies that the component may belong to DNS

class, as this component may have some part(s) in the upper zone of the word as shown

in Fig. 10(a). A similar explanation is applicable for the features F2 and F4 for the

components in middle zone and the middle-lower zone respectively and is illustrated in

Fig. 10(b). The feature F3 is used to classify the noise segment (i.e. broken part(s) of

Matra) which almost certainly appears partially in upper and/or middle zone as shown in

Fig. 10(c).

Table 1: Feature vector and their description

Feature

ID

Feature Description

F1 Percentage height (w.r.t. the overall component height)

of the component that appears upper zone of the word

image

F2 Percentage height (w.r.t. the overall component height)

of the component that appears middle zone of the word

image

F3 Percentage height (w.r.t. the overall component height)

of the component that appears lower of the middle zone

of the word image

F4 Percentage height (w.r.t. the overall component height)

of the component that appears lower zone of the word

image

F5 Maximum horizontalness of the component within the

region R2 to R4

F6 Area of the component within the region R2 to R4

F7 Number of data pixel of the component within the

region R2 to R4

Page 34: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

28

F8 Number of data pixel of the component on R2

F9 Maximum width of the component within the region

R2 to R4

F10 Width of the component along R2

F11 Number of segmentation-point clusters on the Matra

region of the component

Feature F5, i.e. maximum horizontalness feature, has been used in the work [3].

However, due to writing styles of individuals this feature value may be higher in the

upper, lower or lower half of the middle zone if the ascendant (character sub-parts in the

upper-zone of the word image) or descendant (character sub-parts in the lower-zone of

the word image) is extended unnecessarily as shown in the Fig. 11(a). Because of this, in

the present work we have additionally used feature F9, i.e. maximum width of the

component within the region R2 to R4 as shown in Fig. 11(b). Lesser value of this

feature implies the component may be categorized as DNS class component.

In feature F6, as used in work [3], the whole component was considered for area

calculation. But this feature value may be higher for the component of DNS class due to

extended ascendant and/or descendant as shown in Fig. 11(c). For this reason, we have

modified the feature value of F6 by considering only the area of interest, i.e. the area

within the region R2 to R4 only. Due to the same reason, the feature value of F7 i.e.

number of data pixels is also calculated only within the region R2 to R4. Higher the

value of feature F7 more is the possibility of the component belonging to the class SF.

Similarly, high value feature F8 implies more prominent and continuity of the Matra i.e.

component will be a member of the SF class.

Again, due to cursive handwriting or discontinuity of Matra, the value of feature

F9 may be lower for the components that need to be segmented further. That is why we

have also taken feature F10 that gives the width of the component in the middle zone,

i.e. vertical projection of the components within R2 to R4 along R2 (central Matra row).

Page 35: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

29

(a) Word component in upper zone inside color box

(b) Word components in middle and lower zone inside color boxes

(c) Noise component inside color box

Fig. 10(a-c) Illustration of feature F1, F2 F3 and F4

Feature F11 is the number of segmentation-point cluster. Often a component gets

segmented to generate multiple, close segmentation points on the Matra region

(Selection of Matra and segmentation point is discussed in sections 3.2.2 and 3.2.3).

Using 8-way connectivity, we have identified cluster of such segmentation regions and

the number such clusters is considered as a feature value. More is the number of

clusters, higher is chance of the component classified as SF class. In the previous work

[3], the number of segmentation-points was considered as feature. But more number of

segmentation-points may not always imply that the component needs to be segmented

further. This is illustrated in Fig. 12(a-b).

To compute feature F11, potential segmentation points in the region R1-R3 of

connected components are to be determined first. The technique for finding potential

segmentation points are discussed through section 3.2.2 and 3.2.3.

Page 36: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

30

Fig. 11(a-c) Illustration of feature F5, F6 and F7

3.2.2 Determination of Matra Pixels using a Fuzzy

Membership Function and Horizontalness Feature for SF

components

The boundary between the sets of Matra pixels and non Matra pixels in the

region R1-R3 is not distinct in practice. The black pixels lying over the line R2 have got

strongest membership to the set of Matra pixels. As they are away on both sides of the

line R2, their degree of belongingness to the set diminishes, as shown in Fig. 13, through

a membership function MATRA (x). The exact expression of MATRA (x) is shown

below.

Where c=R2 denotes the center of the function, shown in Fig. 14, and ‗a‘ and ‗b‘ are

parameters of the equation. The value of ‗a‘ is chosen as |R1-R2|/2 for upper side of R2

and for lower side of R2 it is chosen as |R2-R3|/2. The value of ‗b‘ is chosen as 1.

To identify Matra pixels in the region R1- R3, all run lengths of black pixels

along each row in the region are computed. Taking average of all these run lengths in

the region, the mean run length of black pixels of the region are computed. This can be

Page 37: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

31

considered as the mean horizontalness of all black horizontal line segments in the

region. Fig. 14 (a-b) shows three word images and the continuous horizontal line

segments of black pixels, whose lengths exceed the mean horizontalness of the

respective words. Candidates for Matra pixels are to be selected from such line

segments.

To finally determine whether a black pixel in region R1-R3 is a Matra pixel, the

product of the horizontalness [2] of the black line segments, on which the pixel lies, and

its MATRA value is computed. If the product exceeds the average value of all such

products for all black pixels in the region R1-R3 then the pixel is finally considered as

Matra pixel here. All such Matra pixels constitute the Matra region.

Fig. 12(a-b) More segmentation clusters in (a) but only one segmentation cluster in (b)

Page 38: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

32

Fig. 13 The membership function for the set of Matra pixels

(a) Three sample word images (b) Consecutive black pixels, in the

sample images, whose horizontalness

exceed the mean horizontalness form

continuous lines highlighted with

darker shading

Fig.14 Illustration of horizontalness (h) feature

Page 39: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

33

3.2.3 Determination of Potential Segmentation Points using

Two Fuzzy Membership Functions for SF components

Potential segmentation points are Matra pixels, across which the segment is to be

fragmented if it falls in the SF category. They are basically candidates for segmentation

points until classification of the segment in SF class is completed.

Potential segmentation points usually lie on the column positions along which

the values of black pixel count are less. The less is the value of the black pixel count

along the column position of a Matra pixel the higher is the degree of belongingness of

the pixel to the class of potential segmentation points. To simulate this, a membership,

1, as shown in Fig. 15(a), is introduced here. The equation to this function is

, for x≥0.

The values of parameters a, b, c are chosen as follows: c=0, b=1, a=WM, where WM is

the maximum vertical width of the Matra region, defined in Section 3.2.2.

To ascertain a Matra pixel as a potential segment point, its distance from the line R2 is

considered here. The less is the distance the higher is its degree of belongingness to the

set of potential segmentation points. Ideally, it would be on R2. On this basis, another

membership function 2, is introduced here. The function is shown in Fig. 14(b). The

values of the parameters of 2 are same as that of MATRA. To decide about whether a

Matra pixel in region R1-R3 would be a potential segmentation point, the average of 1

and 2 values are computed. If the average exceeds the mean of averages for all the

Matra pixels in the region R1-R3 then the said pixel is to be considered as a potential

segmentation point. Feature F11 here represents the total number of all such pixels

which are identified as potential segment points in the region R1-R3.

Page 40: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

34

Fig. 15 (a-b) The membership functions for determination of potential segmentation points

3.2.4 Identification of Actual Segmentation Points in the SF

Components

For determination of actual segmentation points for SF components, there is

always a trade-off between under/over segmentation of word images. In the current

work, we have attempted to optimize between the two, with minimum loss of

information. The issue of segmentation also becomes difficult in the presence of

ascendants in the upper zone of the word component. For this reason, we have further

designed algorithm-steps to identify a single column for segmentation on the Matra

region. The methodology is described as follows.

Often a SF component gets segmented using the fuzzy features to generate

multiple, neighboring segmentation points along the Matra region. We have identified

segmentation cluster points using 8-neighbors CCL algorithm as illustrated in feature

F11 selection in section 3.2.1.2. It may be observed from Fig. 12(a-b) that actual

segmentation should not involve all the potential segmentation points in the cluster, but

Page 41: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

35

focus on only the pixels that optimally separate the connected parts into different

characters (or their sub-parts) of the word image.

Selection of points which accurately segment the word components (sub-parts)

into their constituent characters or sub-parts is a challenging issue. In case of poor

selection of such points, over-segmentation may occur during the segmentation process.

As a result of these, characters of their sub-parts may be internally broken/segmented,

leading to loss of information.

In the light of the above facts, we have selected the actual (more accurate)

segmentation points from the segmentation points generated in each segmentation-

cluster in the current work. There are two primary decisions to be taken for this purpose,

firstly, selection of the row-boundaries for segmentation along specific columns on the

Matra region and secondly, identification of the segmentation columns in each

segmentation-cluster. The algorithmic steps involved in this process are given below:

1. Check whether there is any ascendant in the word component under

consideration. Estimation of the height of upper zone of the component does

the checking. If the height of the said zone is exceeding some adaptively

tuned threshold value (0.2*(R4 – R2)), then it can be said that component has

an ascendant part in the upper zone.

2. In either of the cases, the generated potential segmentation points are labeled

using 8-way CCL algorithm. Each cluster of segmentation points is labeled

uniquely. For each cluster, the following technique is applied to determine

the segmentation column along which we can segment the word component

under consideration:

A) If there is no ascendant in the word component under

consideration; calculate the sum of number of data pixels, Matra

pixels and segmentation-point pixels for each column in the

region from R1 to (R3- R2)/2. Otherwise, calculate the same for

each column in the region from (R2- R1)/2 to (R2 + (R3- R2)/2).

Page 42: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

36

B) Consider the column for segmentation within the estimated region

(row boundaries), which has the minimum sum, as calculated in

step A.

Once the word components are segmented into constituent character or their sub-

parts, again 8-way CCL algorithm is applied to separate each such word-component.

Finally, such segmented components will be considered for recognition as meaningful

character codes.

3.3 Preparation Ground-truthed images

After scanning, the document were binarized by global thresholding technique,

where the threshold was chosen as the mean of the maximum and minimum gray level

values in each document images. All the binarized images were archived in DAT

format, where the foreground and background pixels were represented as 0 and

1respectively. Then the documents were preprocessed in order to remove all the

remaining salt and pepper noises like long lines in the border zone(s). Then

segmentation techniques described in section 3.2 is applied on the word images and

consequently gets the colored images. After getting the segmented components, error

detection and correction is required as all the characters are not segmented properly by

the segmentation technique. Detecting the errors, correct them manually and prepare

the character and component level ground-truthed database. The basic steps of this

work are represented by the fig. 16.

Using segmentation technique used in section 3.2, we get isolated characters or

their subparts. In this stage, the components are reconstructed as a word image

depending upon their position in the original word image. These word images are

Page 43: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

37

Fig. 16 Basic steps of generating Ground-truthed images

labeled as distinct color for distinct one assigned and consequently we get the colored

image with segmentation effect as shown in table 2. But these word images are not

segmented properly. So we need detection and correction the errors manually and

consequently we have to prepare ground truth images.

Page 44: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

38

Preparation of Ground-truthed images work in two levels. In first level, each

character in the word image is identified and then separated from each other. Then each

component either connected or disconnected of a character in the word image has

different color. So the first level‘s work is called component level segmentation. In

second level, each character may have two or more subparts either connected or

disconnected but they contain the same color as they are the components of a single

character. So the second level‘s work is called character level segmentation. For the

purpose of error detection and correction the tool ‗Paint‘ is used. ‗Paint ‗reads word

images with white background. We can select any color from the color box and use that

to recolor the characters which are not segmented properly in the word image by

selecting the intended segment point with the pencil. Using this technique, we can easily

correct errors in our segmentation algorithm to generate ground truth data. A screenshot

of the tool Paint with a word image is shown in fig. 17. The algorithmic representation

for estimation of this method is given below:

Steps:

1. Open an word image with the tool Paint

2. Pick any color from color box and then select the intended segment point with

pencil.

3. The character which we intended to segment is filling with color.

4. This will be done until all the characters are segmented.

5. Close the window and save the image.

Fig. 17 A screenshot of the toll Paint with an word image

Page 45: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

39

Table 2: Some results after segmentation

In ground-truthed database generation we work in two levels. In first level, each

character in the word image is identified and then separated from each other. Then each

component either connected or disconnected of a character in the word image has

different color. So the first level‘s work is called component level segmentation. In

second level, each character may have two or more subparts either connected or

disconnected but they contain the same color as they are the components of a single

character. So the second level‘s work is called character recognition.

Among all the digitized word sub-parts generated after connected component

label-ling, a decision is often required to identify only the components that need further

segmentation because of the presence of many inherently segmented characters or their

subparts in word images (as shown in Fig. 19(a) ). Thus, all word-components may not

Page 46: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

40

require further segmentation at all. These components are often classified into SF and

DNS classes as shown in Fig. 19. To Segment DNS word components is an over-head

for isolation of character components and also may causes over segmentation of word

components. So, Selection of SF and DNS components not only minimize the character

isolation overhead but also minimize the over segmentation probability. For this we

have developed here a two stage selection for SF and DNS class components.

In the table 2, the figure 01 is segmented correctly by the work [2]. So we do not

require any change. But in figure 02, two consecutive characters are connected to each

other after segmentation. This type of segmentation is called under segmentation. We

need to separate these characters by two distinct colors. To separate these characters

identify the intended segment point. Then the character having two parts one is in

middle zone and another is in upper zone also requiring segmentation. This is shown in

fig. 18(a) step by step. In second level, two subparts of a character either connected or

disconnected have the same color as they are parts of the same character. In table 2 the

fig. 03 has character containing two or more colors after segmentation. this type of

segmented word images are called over segmented. the character level and component

level segmentation of the over segmented word images is shown in fig. 18(b).

Collecting the component and character level Ground-truthed word images, we

create a database which is shown in the table 3. The results after segmentation are

comparing with the corresponding Ground-truthed images automatically by the propose

tool and will get the success rate.

Page 47: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

41

Component level segmentation Character level segmentation

Fig. 18 (a) Illustration of Component level and Character level segmentation of the under

segmented word image

Page 48: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

42

Fig. 18 (b) Illustration of Component level and Character level segmentation of the over

segmented word image

Table 3: Character and component level ground-truthed database:

Sl.

#

Original gray level

Bangla word Images

Corresponding Character

level Ground-truthed

images

Corresponding Component

level Ground-truthed

images

01

02

Page 49: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

43

03

04

05

06

07

08

09

Page 50: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

44

10

11

12

13

14

15

16

Page 51: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

45

17

18

19

20

21

22

23

Page 52: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

46

24

25

26

27

28

29

30

Page 53: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

47

31

32

33

34

35

36

37

Page 54: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

48

38

39

40

41

42

43

44

Page 55: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

49

Chapter 4

Conclusion

Character segmentation is a vital step for an OCR system because the more is the

accuracy of segmentation; the less will be the error in recognition. The work presented

here provides a practical solution to the problem on how best word images of

handwritten Bangla characters can be segmented into constituent characters. Moreover,

the technique can segment the word having discontinuity in Matra or cursive Matra. It

also optimizes trade-off between under/over segmentation as Matra and segmentation-

point clusters are estimated correctly. As a result, better word segmentation accuracy

achieved with minimal data loss.

This character segmentation methodology could be successfully applied on the

other Matra-based scripts, viz., Devanagri, Gurmukhi etc. However, there are further

scopes of improvements of the present technique. An iterative implementation of the

present technique, along with the existing segmentation algorithm, or designing more

precious feature set for MLP may further improve the overall segmentation performance

of handwritten Bangla word images in future. By varying the classifier or combining the

results of the different classifiers, the improvement of the present technique is also

possible. The work as a whole can be considered as a significant contribution towards

the development of a yet to come Optical Character Recognition (OCR) system for

handwritten Bangla text document.

In future, our aim is to increase the size of the database. This technique may

significantly reduce the cases of under-segmentation. However, there are further scopes

of improvements. An iterative implementation of the present technique, along with the

existing segmentation algorithm, may further improve the overall segmentation

performance of handwritten Bangla word images in future.

Page 56: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

50

References

[1] ―A fuzzy technique for segmentation of handwritten bangle word images‖,

Subhadip Basu, Ram Sarkar , Nibaran Das , Mahantapas Kundu ,Mita Nasipuri ,

Dipak Kumar Basu .

[2] ―A two stage approach for segmentation of handwritten bangle word images‖,

Ram Sarkar, Nibaran Das, Subhadip Basu, Mahantapas Kundu, Mita Nasipuri, Dipak

Kumar Basu .

[3] ―An improved offline hand written character segmentation algorithm for bangla

script‖, Subhadip Basu, Nibaran Das, Mahantapas Kundu, Mita Nasipuri, Dipak Kumar

Basu.

[4] ―Development of a recognizer for Bangla text: present status and future

challenges‖, Saima Hossain, Nasreen Akter, Hasan Sarwar and Chowdhury Mofizur

Rahman.

[5] ―Character segmentation for handwritten bangle words using artificial network‖,

T. K. Bhowmik, A. Roy, U. Roy.

[6] ―Individual character segmentation from single stroke of bangle online handwritten

text‖, Nilanjana Bhattacharya, Umapada Pal, Koushik Roy.

[7] ― A database of unconstrained handwritten bangla-english mixed script document

image‖, Ram Sarkar, Nibaran Das, Subhadip Basu, Mahaantapas Kundu, Mita Nasipuri,

Dipak Kumar Basu.

[8] ―A script independent technique for extraction of characters from handwritten word

images‖, Ram Sarkar, Samir Malakar, Nibaran Das, Shubhadip Basu, Mita Nasipuri.

[9] Digital Image Processing, http://en.wikipedia.org/wiki/Pixel

[10] Digital Image Processing, http//en.wikipedia.org/wiki/Grayscale

[11] http://www.ijcaonline.org/journal/number23/pxc387693.pdf

Page 57: CHARACTER Segmentation and Ground truth preparation for ... · required for character segmentation of handwritten Bangla word images. Segmentation of isolated word images, extracted

51

[12] http://www.mathworks.in/help/toolbox/images/f18-12508.html

[13] M.Maragoudisakis, et.al, ―Improving handwritten character segmentation by

incorporating Bayesian knowledge with support vector machines,‖ in Proc.

ICASSP‟ 2002, vol. 4, pp. IV-4174.

[14] R.M. Bozinovic et.al. ―Off-line Cursive Script Word Recognition‖, IEEE Trans.

Pattern Analysis and Machine Intelligence, vol. 11,pp 68-83, 1989.

[15] ―A complete handwritten numeral database of Bangla-a major indic script‖,

B.B.Chaudhuri.