full report 13 dec 2010 - eprints.utm.myeprints.utm.my/id/eprint/19120/7/WongYeeLengMFSKSM2010.pdfthe overall best result are obtained with discretized data, with identification accuracy

ABSTRACT

Identification based on Chinese handwriting is an interesting research in the

field of pattern recognition and computer vision. Recently, many innovative methods

and approaches have been developed for writer identification. Unlike character of

western alphabet such as English, German, French, some oriental character such as

Korean, Arabic and Chinese have structural characteristics. These structural

characteristics, particularly on Chinese character have a complex structure due to the

numerous strokes that warped into a cursive shape and have much larger set of

characters. Hence, more features are needed to be generated prior to the classification

phase for better identification. However, these features need to be well-represented for

identification purposes. Hence in this study, an improved discretization is

implemented to transform the range of continuous quantitative values of writer’s

features into a number of appropriate intervals, denoted as an integer label. Several

experiments have been conducted with two different types of datasets: pre-discretized

and post-discretized datasets. Post-discretized datasets is the extarcted features that

have performed with discretization process; while pre-discretized are the original

features, obtained from Direction-based Feature Extraction (DFE) technique. For

reliable identification performance through discretization, 10, 7 and 5 cross-

validations (CV) have been tested on both datasets. The experiments have shown that

the overall best result are obtained with discretized data, with identification accuracy

above 94.0% compared to pre-discretized with identification accuracy below 50.0%.

It can be concluded that the discretization process is efficient for representing the

writers’ features in obtaining higher identification rates for better forensic document

analysis.

ABSTRAK

Pengenalpastian tulisan tangan cina merupakan bidang tujahan penyelidikan

yang menarik dalam bidang pencaman pola dan visi komputer. Terdapat banyak

kaedah dan pendekatan inovatif terkini yang telah dibangunkan oleh penyelidik setara

bagi tujuan mengenalpasti penulis. Tidak seperti aksara barat seperti Inggeris, Jerman,

Perancis, kebanyakkan aksara orental seperti Korea, Arab dan cina mempunyai ciri-

ciri yang berstruktur. Ciri-ciri berstruktur ini terutamanya struktur aksara cina adalah

rumit kerana terdapat pelbagai rupabentuk lingkaran yang membentuk perwakilan

kursif yang banyak. Oleh yang demikian, banyak fitur perlu dijana sebelum

pengelasan bagi mendapatkan pengenalpastian yang baik. Namun begitu, fitur perlu

diwakilkan dengan cara yang baik bagi tujuan pengenalpastian yang berkualiti. Oleh

yang demikian, kajian ini melaksanakan pendiskretan pembaikan dengan

menjelmakan julat nilai selanjar kuantitatif fitur penulis dalam bentuk selang yang

bersesuaian; ini dikenali sebagai pelabelan integer. Pelbagai ujikaji telah dijalankan

menggunakan dua set data yang berbeza iaitu pra-pendiskretan data dan pasca-

pendiskretan data. Set data pasca-pendiskretan adalah data yang telah disari fiturnya

selepas proses pendiskretan. Manakala set data pra-pendiskretan adalah fitur asli yang

disari mengunakan kaedah Penyarian Fitur berdasarkan Arah (PFA). Bagi

mendapatkan prestasi pengenalpastian yang baik, validasi-silang 10, 7, dan 5

digunakan bagi menguji kedua-dua jenis set data tersebut. Hasil dapatan menunjukkan

bahawa hampir keseluruhan keputusan yang terbaik dijana oleh set data pasca-

pendiskretan dengan ketepatan melebihi 94% berbanding set data pra-pendiskretan

dengan ketepatan kurang dari 50%. Oleh yang demikian, boleh dirumuskan bahawa

proses pendiskretan amat berkesan bagi mewakilkan fitur penulis untuk mendapatkan

kadar pengenalpastian yang tinggi dalam menganalisa dokumen forensik.

TABLE OF CONTENTS

CHAPTER TITLE PAGE

DECLARATION ii

DEDICATION iii

ACKNOWLEDGEMENT iv

ABSTRACT v

ABSTRAK vi

TABLE OF CONTENTS vii

LIST OF TABLES xii

LIST OF FIGURES xiii

LIST OF ABBREVIATIONS xvi

1 INTRODUCTION 1

1.1 Overview 1

1.2 Problem Background 3

1.3 Problem Statement 6

1.4 Dissertation Aim 8

1.5 Objectives 8

1.6 Dissertation Scope 9

1.7 Significant of the Dissertation 9

1.8 Organization of the Dissertation 10

2 LITERATURE REVIEW 11

2.1 Introduction 11

2.2 Overview of Pattern Recognition 12

2.2.1 Pattern Recognition Hierarchy (Recognition,

Identification and Verification) 13

2.2.2 Identification based handwriting in Pattern

Recognition 17

2.3 Overview of Chinese Handwriting 18

2.3.1 Writer Identification on Multiple Languages 20

2.3.2 Writer Identification on Chinese Handwriting 23

2.4 Process in Chinese Handwritten Identification 27

2.4.1 Pre-processing 28

2.4.1.1 Normalisation 29

2.4.1.2 Thinning 31

2.4.2 Feature Extraction on Handwritten Character 31

2.4.3 Discretization Process on Chinese Handwritten

Data 34

2.4.3.1 Previous Discretization Method 35

2.4.3.2 Beneficial of Discretization Method 38

2.4.4 Writer Identification 39

2.4.4.1 Classification based on Soft Computing

Approaches 40

2.4.4.1.1 Rough Set Theory 42

2.5 Summary 44

3 RESEARCH METHODOLOGY 45

3.1 Introduction 45

3.2 Research Framework 46

3.3 Pre-Processing Phase 48

3.3.1 Chinese Handwritten Datasets 48

3.3.2 Defining Chinese Handwritten Data Primitives 50

3.3.3 Normalisation on Chinese handwriting 51

3.3.4 Feature Extraction 53

3.3.4.1 Directional based Feature Extraction

technique 53

3.4 Discretization Phase 60

3.4.1 Invariant Discretization Method 60

3.5 Identification Phase 62

3.5.1 Rough set classification 63

3.5.2 Performance Measurement 64

3.5.2.1 Confusion Matrix 66

3.5.2.2 Cross Validation 67

3.6 Summary 70

4 EXPERIMENTAL RESULTS 71

4.1 Introduction 71

4.2 Chinese Handwritten datasets 72

4.3 Pre-processing Phase 73

4.3.1 Standard size and Elimination of Background’s

Noise 73

4.3.2 Binarization 74

4.3.3 Feature Extraction 75

4.4 Discretization Phase 78

4.4.1 Discretization Procedure 80

4.5 Training Process 82

4.6 Testing Process and Classification for Chinese Writer

Identification 82

4.7 Experimental results of Pre-Discretize and Post-

Discretize on Chinese datasets 83

4.8 Summary 94

5 CONCLUSION 95

5.1 Introduction 95

5.2 Dissertation Constributions 96

5.3 Future Works 97

5.4 Summary 98

REFERENCES 99-106

LIST OF TABLES

TABLE NO. TITLE PAGE

1.1 Advantages and disadvantages of feature 4

extraction and identification methods

2.1 Hierarchy in Patern Recognition 16

2.2 Various methods on multiple languages 21

2.3 Offline and online writer identification on Chinese 24

handwriting

2.4 Feature extraction techniques for identification process 33

2.5 Categories of Discretization 36

2.6 Various Discretization methods 37

2.7 Techniques based on Soft computing approaches 41

3.1 Information for writer identification 49

4.1 Feature vectors from 12 samples of handwriting 76

4.2 Reduction process on both datasets classification 85

4.3 Comparisons of identification rates with different training and 87

testing datasets

4.4 Cross Validation experiment settings for both types of datasets 89

4.5 Comparisons of identification rates using 70% training and 91

30% testing data with 10, 7 and 5-fold Cross Validation (CV)

4.6 Comparisons of identification rates using 60% training and 92

40% testing data with 10, 7 and 5-fold Cross Validation (CV)

on both datasets

LIST OF FIGURES

FIGURE NO. TITLE PAGE

2.1 Basic framework of Pattern Recognition System 13

(Selim Aksoy, 2010)

2.2 Writer identification framework (Saphin Gupta, 14

2008)

2.3 Writer verification framework (Saphin Gupta, 15

2008)

2.4 Same Chinese character ( ) with different number 18

of strokes (Fang Hsuan Cheng, 1997) a) 8 strokes

character b) 6 strokes character

2.5 Examples of similar Chinese character with different 19

meaning (Po Hsien Wu, 2003)

2.6 Examples of new created word and its meaning 20

2.7 (a): Four handwritings from two writers 26

(b): Grid microstructure features on handwritings 26

(c): Differences measurement between two handwritings 27

(Xin Li and Xiaoqing Ding, 2009)

2.8 An example of the original image and the normalisation 30

result of the Chinese handwritten text (Yuchen et al.,

2009) a) before normalisation b) after normalisation

2.9 An example of the original image and the thinning result 31

of the Chinese character (Fang Hsuan Cheng, 1997)

a) before thinning b) after thinning

2.10 Rough Set Theory (Pawlak, 1982) 43

2.11 Approximation role in Rough Set Theory (Pawlak, 43

1982)

3.1 Discretization on writer identification based Chinese 47

handwriting framework

3.2 Sample of Chinese handwritten text (Tonghua et al., 2006) 49

3.3 Example of normalized Chinese handwritten image 52

(Yuchen et al., 2009) (a) before normalisation

(b) after normalisation

3.4 Image of 9 equal sized zones of 1 Chinese character, 54

3.5 Starters are rounded in red colour 55

3.6 Intersections are rounded in red 56

3.7 Minor starters are rounded in red 57

3.8 Feature vectors of the zoned Chinese character 58

image,

3.9 Traverse process in a Chinese character, 59

3.10 Statistic feature in ROSETTA Toolkit 63

3.11 Annotation feature in ROSETTA Toolkit 64

3.12 Confusion Matrix of pre-discretized datasets with 66

Johnson’s classification

3.13 Confusion Matrix of post-discretized datasets with 67

Johnson’s classification

3.14 Interface for 10 fold Cross Validation 68

3.15 Interface for command file and log file 68

3.16 Cross Validation log file 69

4.1 4 variation of Chinese handwritten characters from 3 writers 72

4.2 Image of 50x50 pixels effect of pre-processing algorithm 74

(a) before pre-processing (b) after pre-processing

4.3 Binary value from the input character for writer 1 75

4.4 Pre-discretized Chinese datasets of 15 writers ready for ID 77

process

4.5 Examples of Discretization process for writer 1 and writer 13 79

4.6 Post-discretized Chinese datasets for 15 writers 80

4.7 Identification rate for post-discretized datasets of writer 1 83

4.8 Confusion Matrix for post-discretized datasets of all writers 88

4.9 JohnsonReducer rules and its BatchClassifier command 89

4.10 CV script for Holte 1R, Genetic Algorithm and Exhaustive 90

method

4.11 Visualization of divergence level between pre-discretized and 92

post-discretized Chinese datasets using 70% training and 30%

testing data

4.12 Visualization of divergence level between pre-discretized and 93

post-discretized Chinese datasets using 60% training and 40%

testing data

LIST ABBREVIATIONS

ACO Ant Colony Optimisation

ANN Artificial Neural Network

ASI Aspect Scale Invariant

ATM Automated Teller Machine

BMP Bit Map Picture

CV Cross Validation

DFE Directional based Feature Extraction

DCR Discretization Class Reduction

EIG Equal Information Gain

EIW Equal Interval Width

FFT Fast Fourier Transformation

GGD Generalized Gaussian Distribution

GM Geometrical Moments

GMM Gaussian Mixture Model

GSC Gradient, Structural, Concavity

GUI Graphical User Interface

HIT MW Harbin Institute of Technology, Multi Writer

HMM Hidden Markov Model

HPP Horizontal Projection Profile

ID Invariant Discretization

IDC Information Distance Criterion

LM Language Model

MAE Mean Absolute Error

ME Maximum Entropy

MSC Multimedia Super Corridor

OCR Optical Character Recognition

RFID Radio Frequency Identification

RNG Random Number Generator

SD Standard Deviation

SC Shape Codes

SVM Support Vector Machine

TSC Temporal Sequence Code

UMI United Moment Invariant

VPP Vertical Projection Profile

WPT Wavelet Packet Transform

CHAPTER 1

INTRODUCTION

1.1 Overview

Ever since the commencement of Multimedia Super Corridor mission in

Malaysia, recent studies in the field of computer vision and pattern recognition show a

great amount of interest in the content of Biometric Security, Information and

Communication Technology. As a result, computer system as a tool for information

and communication medium is becoming more vital since then. In MSC themes,

there are few focus areas that have been identified, and these included Digital

Content, RFID Technology, Advanced Materials and Biometric Technology.

Biometric Technology has become an important research area now days. A biometric

system is one of the established systems that used the concept of pattern recognition

technology. Its operation begins with individual’s input, then feature extraction

process and finally comparison between extracted features and the model set in

database to obtain the precise result. This notable intensification leads to many

biometric applications being analysed, developed and commercialised for security and

crime identification purpose includes Handwriting (Srihari and Ball, 2009; Guo,

Christian and Alex, 2010), Signature (Cheng, Beng, and Connie, 2009; Eusebiu

Marcu, 2010), Iris (Ramkumar et al., 2005; Kevin Bowyer, 2009), Facial Features

(Thirimachos and Josef, 2009; Ehsan et al., 2010), Fingerprints (Duncan et al., 2009)

and others. However, our study related to writer identification based on Chinese

handwriting, interesting topic in the area of Pattern Recognition.

Pattern recognition in handwriting is wide-ranging term which cover up all

types of application field including identification based on handwriting (Guo,

Christian and Alex, 2010), verification based on handwriting (Srihari and Ball, 2009),

authentication (Muzaffar and Jurgen, 2009; Behzad and Mohsen, 2010) and character

recognition (Tonghua et al., 2009). Each of those proposed approaches has different

intention. Guo, Christian and Alex (2010) proposed a character prototype approach as

a model template to assist their alphabet knowledge base. These proposed approaches

described the additional information to support writer identification process and at the

same time preserved the style of individuality handwriting. Their experimental result

successfully increase writer identification accuracy from 66% to 87%. Other study

focuses at the early stage such as pre-processing including normalisation, feature

extraction in order to attain a better performance for recognition and identification

task. It is vital to understand the history and complexity of the handwriting structure

before it can be classified and categorized. A comprehensive review covering the

research work on writer identification for Chinese handwriting using various methods

have been well described and written (LongZuo and Tieniu Tan, 2002) and (Cheng

and En, 2009).

In this study, the effectiveness of Discretization process for individual

identification based on Chinese handwriting is investigated. Execution of

discretization in proposed framework throughout this dissertation is based on

Invariant Discretization proposed by Azah Kamilah Muda, Siti Mariyam Shamsuddin,

and Maslina Darus (2008). According to the authors, Mean Absolute Error (MAE) is

taken to describe the authorship invariance of the handwritings which are then allied

into the proposed Invariants Discretization to organize the obtained features into the

related writer class in order to improve the identification performance.

1.2 Problem Background

Identification based on Chinese handwriting is an interesting research in the

field of pattern recognition and computer vision. Recently, many innovative methods

and approaches have been developed for writer identification using Dynamic Feature

on Strokes (Bangy Li and Tieniu Tan, 2009), Fusion on both Dynamic and Static

Features (Wenfeng Jin, Yunhong Wang and Tieniu Tan, 2005), and Textual Analysis

(JunFeng and Xu Gao, 2009). Those approaches managed to overcome the

complexity of Chinese character to make identification task easier. However, some

advantages and disadvantages crop up in the proposed method to be shared of. Table

1.1 summarizes the comparison on advantages and disadvantages of the proposed

identification methods. However, some information are said to be practical for early

understanding such as the development of Hidden Markov Model (HMM) based

approach for Chinese character recognition (Kim and Govindaraju, 1997; Tonghua et

al., 2009). They have been proved well and known as established approach but some

weaknesses still exist. This approach often required large data in computation

process. It is a multifaceted method because have to develop classifier for each

character recognition purpose when dealing with large character set. Yu Chen et al.

(2009) has pointed out three typical of randomness occurred in Chinese handwriting

during the process of feature extraction. The randomness included the area of writing

style, content and position of the Chinese character. However, their proposed method

are said to be successful on randomness by significantly proven the accuracy of 98%

and 95% with database of 500 and 100 persons. Unfortunately, the computation

process of the proposed method required higher computational costs. Nafiz and Fatos

(2001) provide a summary to the character recognition prior to identification task.

The authors discussed the status of current character recognition, weaknesses and

point out some suggestion that is helpful for new researches on the related field.

Unlike character of western alphabet such as English, German, French, some oriental

character such as Korean, Arabic and Chinese have structural characteristics.

Table 1.1: Advantages and disadvantages of feature extraction and identification

methods

Authors Approaches Pros Cons Result (%)

Tonghua

et al.

(2009)

Segmentation

Free Strategy

based on

HMM

o State of art

strategy

o Depends on

CPU time,

memory,

experiences

of designer.

Confidence

with more

than 99%.

YuChen

et al.

(2009)

Spectral

Feature

Extraction

Method

based on Fast

Fourier

Transformati

on

o Minimizes

randomness in

Chinese character.

o Feasible for large

volume of data set.

o Successfully

commercialised.

o Higher

computation

costs.

Handwriting

samples:

100persons=

98%

500persons=

95%

WenFeng

Jin et al.

(2009)

Sum Rule,

Weighted

Sum Rule

and User

Specific Sum

Rule applied

to combine

dynamic and

static features

o Minimizes the

complexity in

Chinese character.

o Focuses on

12 primary

Chinese

strokes.

o More

training

data

needed.

Dynamic +

Static

features

improved

identificatio

n accuracy

better than

dynamic

features

Bangy Li

and

Tieniu

Tan

(2009)

Temporal

Sequence

Code (TSC)

and Shape

Codes (SC)

o Require small

number of

character.

o Work effectively

for English and

Chinese

character.

o Depends

on

emotions

and

physical

state of

writers.

Chinese

template

acc. >90%

English

template

acc. >95%

He and

Tan

(2004)

Textual

Analysis

Approaches

-Using Gabor

Filters and

Autocorrelati

on Function

o Applicable to

both text

dependent and

text independent

Chinese

handwriting.

o Long

calculation

time.

o High

computatio

n cost

Top 1=90%

Top 3=96%

Top5=100%

Top10=100%

Kim et

al. (1997)

Hidden

Markov

Model

o Work well with

variety of cursive

strokes

o Uses small

memory

o Large data

(18,000

characters).

o Time

complexity

and not

applicable

to real time

application.

o Depends

on input

pattern

Accuracy

rate of

90.3% with

speed of

1.83 s per

character.

Difficulties in Chinese character are well known as large alphabet language; it

has a variety of categories. Secondly, Chinese handwriting involves many types of

strokes features such as dynamic stroke, sub stroke and others. Some significant

method for dynamic strokes problem in Chinese character has been successfully

proposed by Bangy Li et al. (2007) using novel method. Unfortunately, this method

is only useful for online text independent.

Based on common issues in handling various Chinese strokes features data, we

were attracted by the Discretization approaches written by Alexis et al. (2009) and

Fabrice and Ricco (2005). According to the authors, these approaches can handle

complex and continuous attributes, which will be put into practise before

classification. Thus, in this study, the discretization concept is adopted in proposed

identification framework to seek for the improvement in the writer identification

accuracy. The identification results are used to show the effectiveness of the

discretization development in Chinese handwriting. The discretization conducted in

this study on Chinese handwriting is based on the discretization proposed by Azah

Kamilah, Siti Mariyam and Maslina Darus (2008). The authors proposed an

Invariants Discretization method for handwriting identification. Their result

successfully shows a great accuracy of 99.9% for writer identification using post-

discretized data instead of pre-discretized data.

1.3 Problem Statement

Basically, Discretization process can be classified into global versus local,

supervised versus unsupervised and static versus dynamic. Supervised method

needed information of class label while unsupervised method does not. Gennady

Agre and Stanimir (2002) compared supervised and unsupervised Discretization

method for continuous attributes. However, their result yielded that both classifiers

significantly improved the classification accuracy. This explained that Discretization

process is needed and is an important pre processing steps for machine learning and

data mining application.

From the problem background, many efforts have been analysed and carried

out either at pre-processing or post-processing stage in order to achieve the best

classification performance. As known, Chinese character is a large alphabet language

and have structural characteristic with construction of many stroke features (Fang

Hyuan Cheng, 1997; Tieniu Tan et al., 2000). This creates a large amount of features

data consequently slow down the computation progression and misclassification

occurs during learning and classification. Misclassification often happen causes by

overlapping feature space. This is true for handwriting and noises.

In practise, too much of pre-processing at early stage such as normalisation to

discard mixes noise of the character in order to refine classification would almost

certainly trade off important individualistic of the writer’s handwriting (Andreas and

Horst, 2005). Hence, create poor classification.

Those issues can significantly affect overall performance of handwriting

identification task either in terms of its identification capability and competency.

Thus, discovery set of decision rules in datasets is one of the most important

characteristics in machine learning and data mining task. For instance, if a raw

datasets is large and in continuous values, conditions need to be created through

appropriate decision rules to find the threshold values or specific range among the

datasets to represent the objects. In other words, creates better data representation.

This process is called Discretization.

As such, Discretization in this dissertation which based on the idea proposed

by Azah Kamilah et al. (2008) on Chinese handwriting is added to the writer

identification framework. The proposed Invariant Discretization is chosen as a basic

structure to assist classification phase because their experimental result successfully

demonstrated the achievement of 99.9% identification accuracy with discretized data

instead of non-discretized data.

Hence, the hypothesis of this study can be stated:

Discretization process would enhance the performance of individual identification on

Chinese handwriting.

1.4 Dissertation Aim

The aim of this study is to investigate the impact of Discretization process on

Chinese handwriting for individual identification.

1.5 Objectives

There are four main objectives to be achieved in these studies as below:

1. To propose a new framework of Discretization process for Chinese writer

identification.

2. To develop Azah’s Invariant Discretization algorithm on Chinese characters.

3. To evaluate the effectiveness of Discretization process on Chinese

handwriting.

4. To compare Azah’s Invariant Discretization method on Post-Discretized

Chinese handwriting with Pre-Discretized Chinese handwriting data for

Chinese handwriting identification.

1.6 Dissertation Scope

Several concerns encountered before writing the scope of this study. As

acknowledged, common problem arrises in Chinese handwriting including larger

amount of Chinese character features, complicated structure and etc, probably would

take very long time to finalize if start from scratch. Because this dissertation mainly

focus on Discretization process, thus only a small work progress of pre-processing

task on Chinese handwriting. Normalization applied here is just to standardize all

character images into equal size and to enhance the quality of the image for clearer

depiction. Overall, the scopes of the study are on the following area:

1. HIT MW handwritten Chinese database is chosen as our experimental

database.

2. MATLAB tool was selected for simulating and visualizing.

3. Visual C++ programming will be developed.

4. ROSETTA Toolkit will be used for classification purpose.

1.7 Significant of the Dissertation

The significant of this study is capable to provide some basic idea and can be

treated as a benchmark for researches or practitioners who are interested in finding an

alternative on individuality identification based on handwriting. Moreover, this study

could be a part of research work on biometric and natural approach for personal

authentication such as the work done by Eusebiu Marcu (2010) as accurate personal

identification could steer clear of crime and fake issues.

1.8 Organization of the Dissertation

Generally, the thesis is organized as follows:

1. Introduction

First chapter cover the general overview of the thesis including the

introduction of Pattern Recognition Technology and Chinese handwriting for

identification, common issues arrises in Chinese handwriting, scopes,

objectives and significance of the study.

2. Literature Review

Second chapter discuss the background and history of the related work in more

details.

3. Research Methodology

In this chapter, writer identification framework will be constructed. Details on

relevant data, how and what kind of method and tool that would be best use to

assist the work will be briefly elaborated.

4. Preliminary and Expected Result

Chapter 4 demonstrate the experimental result after the process of pre-

processing, feature extraction and classification. The experimental result

obtain will be well analysed, compared and evaluated.

5. Conclusion

This section summarizes the whole process of the study. Future works and

contributions of this dissertation also will be included.

Documents

full report 13 dec 2010 - eprints.utm.myeprints.utm.my/id/eprint/19120/7/WongYeeLengMFSKSM2010.pdfthe overall best result are obtained with discretized data, with identification accuracy