11
Cascade Markov random fields for stroke extraction of Chinese characters Jia Zeng a, * , Wei Feng b , Lei Xie c , Zhi-Qiang Liu d a Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kong b Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong c School of Computer Science, Northwestern Polytechnical University, Xi’an, PR China d School of Creative Media, City University of Hong Kong, Tat Chee Avenue 83, Hong Kong article info Article history: Received 11 October 2008 Received in revised form 7 June 2009 Accepted 18 September 2009 Keywords: Stroke extraction Cursive Chinese characters Cascade Markov random fields Bottom-up/top-down abstract Extracting perceptually meaningful strokes plays an essential role in modeling structures of handwritten Chinese characters for accurate character recognition. This paper proposes a cascade Markov random field (MRF) model that combines both bottom-up (BU) and top- down (TD) processes for stroke extraction. In the low-level stroke segmentation process, we use a BU MRF model with smoothness prior to segment the character skeleton into directional substrokes based on self-organization of pixel-based directional features. In the high-level stroke extraction process, the segmented substrokes are sent to a TD MRF-based character model that, in turn, feeds back to guide the merging of corresponding substrokes to produce reliable candidate strokes for character recognition. The merit of the cascade MRF model is due to its ability to encode the local statistical dependencies of neighboring stroke components as well as prior knowledge of Chinese character structures. Encouraging stroke extraction and character recognition results confirm the effectiveness of our method, which integrates both BU/TD vision processing streams within the unified MRF framework. Ó 2009 Elsevier Inc. All rights reserved. 1. Introduction A handwritten Chinese character is naturally composed of many straight-line strokes, whose relationships reflect the character structure [16]. Biederman’s [1] recognition-by-components (RBC) theory suggests that visual input is matched against objects’ structural representations consisting of primitive shapes and their interrelations in the brain. Therefore, most handwritten Chinese character recognition (HCCR) systems are based on modeling stroke relationships [17,10,31], and thus extracting perceptually meaningful strokes plays an essential role in the structural representation of Chinese characters. Stroke extraction of Chinese characters is a challenging task. According to Chinese character structures, most stroke extraction methods segment the character skeleton into four directional substrokes: horizontal (0°), right-diagonal (45°), vertical (90°), and left-diagonal (135°) [22]. In general, substroke segmentation involves a bottom-up (BU) process that in principle utilizes self-organization of pixel-based directional features. The interactions between neighboring pixels automat- ically determine if they belong to the same substroke. However, the thinning process causes distortions on the character skeleton for two types of ambiguous regions. The first is the junction region sharing by more than two strokes. The second is the transitional high curvature region of the continuous stroke. In practice, these ambiguous parts lead to the over-frag- mented substrokes, which are even more serious in cursive (no longer straight-line strokes) Chinese characters due to large 0020-0255/$ - see front matter Ó 2009 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2009.09.011 * Corresponding author. Tel.: +852 34117636; fax: +852 34117892. E-mail addresses: [email protected] (J. Zeng), [email protected] (W. Feng), [email protected] (L. Xie), [email protected] (Z.-Q. Liu). Information Sciences 180 (2010) 301–311 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins

Cascade Markov random fields for stroke extraction of Chinese characters

Embed Size (px)

Citation preview

Information Sciences 180 (2010) 301–311

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Cascade Markov random fields for stroke extraction of Chinese characters

Jia Zeng a,*, Wei Feng b, Lei Xie c, Zhi-Qiang Liu d

a Department of Computer Science, Hong Kong Baptist University, Kowloon Tong, Hong Kongb Department of Computer Science and Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kongc School of Computer Science, Northwestern Polytechnical University, Xi’an, PR Chinad School of Creative Media, City University of Hong Kong, Tat Chee Avenue 83, Hong Kong

a r t i c l e i n f o a b s t r a c t

Article history:Received 11 October 2008Received in revised form 7 June 2009Accepted 18 September 2009

Keywords:Stroke extractionCursive Chinese charactersCascade Markov random fieldsBottom-up/top-down

0020-0255/$ - see front matter � 2009 Elsevier Incdoi:10.1016/j.ins.2009.09.011

* Corresponding author. Tel.: +852 34117636; faxE-mail addresses: [email protected] (J. Zeng), wfen

Extracting perceptually meaningful strokes plays an essential role in modeling structuresof handwritten Chinese characters for accurate character recognition. This paper proposesa cascade Markov random field (MRF) model that combines both bottom-up (BU) and top-down (TD) processes for stroke extraction. In the low-level stroke segmentation process,we use a BU MRF model with smoothness prior to segment the character skeleton intodirectional substrokes based on self-organization of pixel-based directional features. Inthe high-level stroke extraction process, the segmented substrokes are sent to a TDMRF-based character model that, in turn, feeds back to guide the merging of correspondingsubstrokes to produce reliable candidate strokes for character recognition. The merit of thecascade MRF model is due to its ability to encode the local statistical dependencies ofneighboring stroke components as well as prior knowledge of Chinese character structures.Encouraging stroke extraction and character recognition results confirm the effectivenessof our method, which integrates both BU/TD vision processing streams within the unifiedMRF framework.

� 2009 Elsevier Inc. All rights reserved.

1. Introduction

A handwritten Chinese character is naturally composed of many straight-line strokes, whose relationships reflect thecharacter structure [16]. Biederman’s [1] recognition-by-components (RBC) theory suggests that visual input is matchedagainst objects’ structural representations consisting of primitive shapes and their interrelations in the brain. Therefore,most handwritten Chinese character recognition (HCCR) systems are based on modeling stroke relationships [17,10,31],and thus extracting perceptually meaningful strokes plays an essential role in the structural representation of Chinesecharacters.

Stroke extraction of Chinese characters is a challenging task. According to Chinese character structures, most strokeextraction methods segment the character skeleton into four directional substrokes: horizontal (0�), right-diagonal (45�),vertical (90�), and left-diagonal (135�) [22]. In general, substroke segmentation involves a bottom-up (BU) process that inprinciple utilizes self-organization of pixel-based directional features. The interactions between neighboring pixels automat-ically determine if they belong to the same substroke. However, the thinning process causes distortions on the characterskeleton for two types of ambiguous regions. The first is the junction region sharing by more than two strokes. The secondis the transitional high curvature region of the continuous stroke. In practice, these ambiguous parts lead to the over-frag-mented substrokes, which are even more serious in cursive (no longer straight-line strokes) Chinese characters due to large

. All rights reserved.

: +852 [email protected] (W. Feng), [email protected] (L. Xie), [email protected] (Z.-Q. Liu).

302 J. Zeng et al. / Information Sciences 180 (2010) 301–311

shape variations. As a result, only the BU process cannot produce reliable strokes for the real-world character recognition.Fig. 1C shows a typical example of substroke segmentation on the character skeleton in Fig. 1A. We see that fragmented sub-strokes Fig. 1C occur at junction and high curvature regions. These substrokes are far from the perceptually meaningfulstrokes in Fig. 1E that reflect the intrinsic character structures.

Inspired by Treisman’s feature integration theory [24], processing of image information is assumed as an interaction oftwo inversely image processing streams. One is the BU process of initial image information pieces to produce self-organizingstructures. The other is the supervised top-down (TD) process, which conveys the rules and the knowledge supposed toguide the linking and merging of the disjoint preliminary information pieces into perceptually meaningful image objects.The high-level TD processing stream is associated with image understanding and cognitive image perception. More specif-ically, recent psychological results [25] support a cascade model of image segmentation, in which the partial BU informationis sent to the higher level object representation that, in turn, feeds back to guide the segmentation process. So, we need tointegrate the TD character-representation-guided stroke merging process with the BU self-organization process.

To combine both BU and TD processes, we propose a cascade Markov random field (MRF) [12] model for strokeextraction:

(1) In the low-level stroke segmentation process, we use the BU MRF with smoothness prior to segment character skel-etons into continuous directional substrokes based on self-organization of pixel-based directional features. We believethat this BU MRF is superior to other heuristic rule-based stroke segmentation methods [22] because it considers thelocal statistical dependencies of pixels on the character skeleton as well as their smoothness constraints.

(2) In the high-level stroke extraction process, we use the TD MRF-based character model to guide the merging of sub-strokes into perceptually meaningful candidate strokes for character recognition. Since the character model encodesthe global information of individual strokes as well as their relationships, it first concatenates the subsrokes to form allpossible candidate strokes by a selective search algorithm, and then finds the best strokes from candidates by a struc-tural matching algorithm such as relaxation labeling [13].

As a summary, Fig. 1B and D illustrates the data flow of the proposed cascade MRF model integrating BU/TD processes forstroke extraction. The BU MRF encodes relationships of four labels representing horizontal (H), left-diagonal (L), vertical (V)and right-diagonal (R) directions. The TD MRF describes individual strokes (indexed by numbers) and their relationships,which can guide to produce corresponding reliable strokes in Fig. 1E. Our contributions lie in two aspects. First, we proposeto describe coherent groups of pixels on character skeleton by the MRF, which is based on statistical dependencies ratherthan on rule-based criteria in traditional BU approaches [5,4,8,2,22,23]. Second, we extend our previous work [31,32] touse character representation to guide and merge substrokes by the selective search, which avoid searching all possible con-catenations of substrokes. Also, we believe that the cascade MRF model can be generalized to other sophisticated object rec-ognition tasks.

The remainder of this paper is organized as follows. The related work is discussed in Section 2. We formulate the strokeextraction as the labeling problem and present our cascade MRF model in Section 3. The merit of this model is due to itsexpressive power for encoding local statistical dependencies as well as structural prior knowledge between neighboring

Fig. 1. The data flow of the cascade MRF model for stroke extraction. Using the bottom-up MRF (B), the input character skeleton (A) is segmented intosubstrokes (C) denoted by different colors. Substrokes are generally over-fragmented because of cursive strokes with large shape variations. They are guidedto merge into perceptually meaningful strokes (E) indexed by numbers for character recognition using the top-down MRF-based character model (D), whichencodes information of individual strokes and their relationships indexed by corresponding numbers. (For interpretation of the references to color in thisfigure legend, the reader is referred to the web version of this article.)

J. Zeng et al. / Information Sciences 180 (2010) 301–311 303

components. Section 4 shows experimental results on extracting perceptually meaningful candidate strokes for accuratecharacter recognition. Section 5 draws conclusions and envisions future work.

2. Related work

In the past decades, the BU and the TD approaches have been two major strategies for stroke extraction from Chinesecharacters [17,10,31,22,8,4,11,3,15,28,23,19,14,2].

The BU approaches segment characters into substrokes based on coherent groups of pixels, and subsequently merge cor-responding substrokes into meaningful strokes by heuristic rules to obtain the robust structural representation of characters(for example, trend-followed transcribing [15], fuzzy substroke extractor [28], and pixel-tracing using the configuration tem-plates [11]). Previous BU approaches often involve the thinning preprocess, and focuses on the correction of junction-distor-tions of character skeletons [19,14]. To avoid junction-distortion and spurious-branch problems, more advanced BUtechniques have been developed to directly handle the thick-line character image (for example, run-length-based approach[4,8], direction distribution [2], Gabor filtering and thresholding [22], and second-order Gaussian derivative filters [23]). De-spite the good performance of BU segmentation, there are two major problems yet to be solved. The first is the broken strokeproblem – due to complicated character shapes and ambiguous character strokes, how to keep the smoothness (continuity)of the substrokes. The second lies in the subsequent merging process, which depends heavily on developer’s prior knowl-edge. For example, the relationships of substrokes are often represented by a graph, whose complexity can be reduced basedon heuristic rules (for example, polygon approximation rules [17], fuzzy techniques [28], connection rules [14], and heuristicmerging algorithms [4,8,22,23]). Since these heuristic rules are only concerned with the local information of relative positionand direction between substrokes, they are not robust to large variations of character shapes, which may produce brokenstrokes or incorrect combination of substrokes, leading to the problematic structural description of characters. It is difficultto merge several substrokes into a correct and complete stroke by the BU approach alone because a character representationis unknown in advance.

In contrast, the TD approaches use the character representation learned from examples to guide the stroke extraction pro-cess (for example, model-based stroke extraction [17,10,31,32] and multi-stroke relaxation matching [3]). Existing TD ap-proaches need over-segmented pieces of substrokes (for example, line approximation method [17,10,3] and curvaturedetection method [31,32]), and involve an exhaustive search of almost all possible concatenations of substrokes. However,when the number of over-segmented substrokes increases, the merging speed will be greatly lowered because the searchspace of all possible combinations is large. As a result, theoretically TD process can achieve perceptually meaningful candi-date strokes for recognition, but practically it is also constrained by the BU process in order for computational efficiency.

In recent object recognition research, a promising direction is the synergy between BU and TD vision processes in order toobtain the relative merit of both approaches. For example, Zhu et al. [33] integrated BU and TD for object recognition by datadriven Markov chain Monte Carlo. Weber et al. [27] developed the constellation model that represents each class of object asthe flexible constellations of the rigid parts (BU process), and further describes intra-class variability by a joint probabilitydensity function on the shape of the constellations (TD process) and the output of part detectors. Mendoza et al. [20] used ahybrid fuzzy approach for image recognition taking advantage of both BU and TD information. In this study, we show thatthe combination of BU and TD can be also formulated within the unified MRF framework, which works well in the challeng-ing stroke extraction of handwritten Chinese characters.

3. Cascade MRFs for stroke extraction

3.1. Cascade MRFs for both BU and TD processes

Stroke extraction can be formulated as the labeling problem to which the solution is a set of semantic labels,J ¼ f1; . . . ; j; . . . ; Jg, assigned to a set of sites, I ¼ f1; . . . ; i; . . . ; Ig, to explain the observations, O ¼ fo1; . . . ;oIg, at all sites.The sites are image pixels or concatenated substrokes, while the labels reflect intrinsic relations, regularities or structuresbetween sites. At each site, the random observation oi is the pixel-based directional feature or holistic information of thesubstroke. For most practical problems, the number of labels J is not equal to the number of sites I. LetF ¼ ðf1; . . . ; f i; . . . ; f IÞT be a labeling configuration at all sites, where the labeling strength, f i ¼ ðfiðjÞÞ 2 ½0;1�; 1 6 j 6 J, isthe assignment probability of the label j to the site i. If label j is definitely assigned to the site i, we denote f i ¼ j orfiðjÞ ¼ 1. The null label is not assigned to any site denoted by

PifiðjÞ ¼ 0, and the null site is not associated with any labels

denoted byP

jfiðjÞ ¼ 0. The labels can be viewed as a set of hidden variables generating observations, so the class-conditionaljoint probability PðO; FjkÞ can represent the underlying structure of patterns for each class k [21]. Such a representation ofpatterns follows the logical principle of analysis by synthesis [7, p. 7], which tests a class model by generating random sam-ples according to PðO; FjkÞ to see how well they resemble the signals we are observing in the world. This synthesis principlehas been widely used in many pattern recognition problems, such as hidden Markov models (HMMs) for speech recognition[30].

In the MRF, the local statistical interactions among adjacent sites in a pattern or image are reflected by two fundamentalconcepts: neighborhood system @ i and clique potentials Vc . The neighborhood system @i defines a set of neighbors of site i

304 J. Zeng et al. / Information Sciences 180 (2010) 301–311

provided that i0 2 @i() i 2 @i0; i R @i. We can define any two sites i and i0 as neighbors in @ i. The clique c is a subset of sitesthat are all pair-wise neighbors. High-order neighborhood system may represent high-order relationships between sites, butit meanwhile leads to high computational cost. For simplicity, we use the second-order neighborhood system, and consideronly the single-site cliques, C1 ¼ fig, and pair-site cliques, C2 ¼ fði; i0Þg. To encourage or penalize different local interactions,we assign costs Vc to different cliques in the neighborhood system. Generally, we use the unary feature oi and binary featureoii0 to represent individual observations and their interactions in the neighborhood system.

Our goal is to find the best labeling configuration F� for the observation O. To this end, we maximize the joint probabilityPðO; FjkÞ in terms of F. For simplicity, we assume that the unary and binary features are conditionally independent given thelabeling configuration F, and we obtain

PðO;FjkÞ ¼ PðFjkÞYI

i¼1

pðoijf i; kÞY

ii0pðoii0 jf i; f i0 ; kÞ: ð1Þ

According to the equivalence between MRFs and Gibb distribution established by Hammersley and Clifford [9], the priorPðFjkÞ has the form

PðFjkÞ ¼ 1Zk

exp �Xi2C1

VC1 ðf iÞ �X

ii02C2

VC2 ðf i; f i0 Þ

24 35: ð2Þ

In the BU process, we use the same MRF model k for stroke segmentation because the pixel-based observations share thesame set of pixel-based labels across different classes. As a result, we can cancel the constant normalization factors Zk inEq. (2), and obtain the joint energy function corresponding to Eq. (1)

UðO;FÞ ¼ UðOjFÞ þ UðFÞ ¼X

i2C1VC1 ðf iÞ þ

Xii02C2

VC2 ðf i; f i0 Þ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}prior clique potentials

þX

i2C1VC1 ðoijf iÞ þ

Xii02C2

VC2 ðoii0 jf i; f i0 Þ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}likelihood clique potentials

; ð3Þ

where the likelihood clique potentials are defined as the negative log likelihood in Eq. (1)

VC1 ðoijf iÞ ¼ � ln pðoijf i; kÞ; ð4ÞVC2 ðoii0 jf i; f i0 Þ ¼ � ln pðoii0 jf i; f i0 ; kÞ: ð5Þ

The single-site clique potentials, VC1 ðoijf iÞ and VC1 ðf iÞ, describe the statistical information of the pixel-based observation oi

given the labeling strength f i, and the pair-site clique potentials, VC2 ðf i; f i0 Þ and VC2 ðoii0 jf i; f i0 Þ, encode the structural informa-tion of neighboring labeling strength f i and f i0 . In order to get the desirable global configuration F�, maximizing Eq. (1) isequivalent to minimizing the energy function Eq. (3) [12].

In the TD process, Eq. (1) cannot be simplified to Eq. (3) by dropping the normalization factor, because different categoriesof Chinese characters will have different class models k, which are composed of different stroke-based labels f i. However, wenote that the role of the normalization factor is to make the prior probability (2) comparable among different character classmodels k. So, a practical choice is to design the prior clique potentials proportional to the likelihood potentials, where thejoint energy will decrease in a certain proportion if the prior is consistent with the likelihood. Then we obtain an approxi-mation similar to Eq. (3)

UðO;FjkÞ ¼ UðOjF; kÞ þ UðFjkÞ

¼X

i2C1VC1 ðf ijkÞ þ

Xii02C2

VC2 ðf i; f i0 jkÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}prior clique potentials

þX

i2C1VC1 ðoijf i; kÞ þ

Xii02C2

VC2 ðoii0 jf i; f i0 ; kÞ|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}likelihood clique potentials

; ð6Þ

where the single-site clique potentials, VC1 ðoijf i; kÞ and VC1 ðf ijkÞ, describe the statistical variation of the substroke-basedobservation oi given the stroke-based label f i of the character k. The pair-site clique potentials, VC2 ðoii0 jf i; f i0 ; kÞ andVC2 ðf i; f i0 jkÞ, encode the statistical dependency between neighboring stroke labels f i and f i0 .

Interestingly enough, the only difference between BU and TD processes in Eqs. (3) and (6) is that Eq. (3) does not dependon the character model k while Eq. (6) needs information of the specific character model to perform labeling configuration.To encode statistical information, we need to derive specific clique potentials in Eqs. (4) and (5) from corresponding prob-ability density functions. We derive the likelihood clique potential from Gaussian mixture models (GMMs) [31]. More spe-cifically, we design the following single-site likelihood clique potential:

VC1 ðoijf i ¼ jÞ ¼ � lnXMs

m¼1

wjmNðoi; ljm;RjmÞ" #

; ð7Þ

where Ms is the number of mixture components, wjm is the weights of the mixture components, and Nð; l;RÞ is a multivariateGaussian distribution. Similarly, the pairwise likelihood clique potential is

VC1 ðoii0 jf i ¼ j; f i0 ¼ j0Þ ¼ � lnXMs

m¼1

wjj0mNðoii0 ; ljj0m;Rjj0mÞ" #

: ð8Þ

J. Zeng et al. / Information Sciences 180 (2010) 301–311 305

For short, we rewrite Eqs. (7) and (8) as VC1 ðoijjÞ and VC1 ðoii0 jj; j0Þ. If we marginalize Eqs. (7) and (8) for all the labels in set J,

we obtain VC1 ðoijf iÞ and VC1 ðoii0 jf i; f i0 Þ. Generally, three mixtures are enough to produce good results in stroke extraction.

3.2. Inference and parameter estimation

In the BU process, the basic idea of the MRF is to constrain the interactions between semantic labels by the Markov prop-erty in terms of the neighborhood system and clique potentials. The likelihood clique potential describes the statistical vari-ations of directional features at each site, and in the meanwhile, the smoothness-based prior clique potential impose thestructural constraint to ensure the stroke continuity property. The optimal labeling configuration of the character skeletonimplies the best stroke segmentation achieved by minimizing the joint energy Eq. (3). The relaxation labeling (RL) algorithm[13] is a fast parallel minimizer of Eq. (3) with polynomial complexity. Indeed, the RL algorithm is a kind of belief propaga-tion method that passes messages between neighboring labeling configurations.

Given the observations O, the RL algorithm [13] can find the best labeling, F�, and the minimum of the joint energyUðO; F�Þ. The labeling strength fiðjÞ can be assumed as the posterior probability Pðf i ¼ jjoiÞ, which is the E-step of the EM algo-rithm [6] for parameter estimation. In practice we convert the minimization of the joint energy (3) into the maximization ofa corresponding gain function,

gðO;FÞ ¼XI

i¼1

XJ

j¼1

KiðjÞfiðjÞ þX

i0 2 @i

XJ

j0¼1

Ki;i0 ðj; j0ÞfiðjÞfi0 ðj

24 35; ð9Þ

which consists of compatibility functions defined by clique potentials,

KiðjÞ ¼ CONST1 � VC1 ðoijjÞ � VC2 ðjÞ; ð10ÞKii0 ðjj

0Þ ¼ CONST2 � VC1 ðoii0 jj; j0Þ � VC2 ðj; j

0Þ; ð11Þ

where the constants CONST1 and CONST2 ensure that all the compatibility functions are non-negative. Because the compat-ibility functions contain both prior and likelihood information, the RL does not heavily depend on the initial labeling. So weset equal initial labeling strengths, f 0

i ðjÞ ¼ 1;1 6 j 6 J. We update f ti ðjÞ by the gradient qt

i ðjÞ of the gain function (9) until treaches the maximum iteration T as shown in Algorithm 1. Finally, we retrieve the best labeling configuration F� usingthe winner-take-all strategy.

Based on F�, we use the k-means clustering and the EM algorithm [6] to estimate the GMMs parameters ljm;Rjm, and wjm.First, we use the manually labeled data (for example, we manually assign each pixel-based observations one of five direc-tional labels: horizonal, left-diagonal, vertical, right-diagonal and ambiguous) to initialize these parameters by k-means clus-tering. We assign the labeled data to Ms clusters for each label j, and in the meanwhile, we estimate the parameters wjm; ljm,and bRjm within each cluster. Second, we use the initialized MRF to automatically label unlabeled observations. The best label-ing configuration implies an alignment of labels with observations. As mentioned previously, the labeling strength f �i ðjÞ isused to re-estimate the parameters of GMMs by the EM algorithm.

Algorithm 1. The relaxation labeling algorithm.

As far as TD process is concerned, we can use a similar Algorithm 1 to infer the best labeling configuration by minimizing

the joint energy Eq. (6) but with slight differences. Because of stroke segmentation noise, the number of substroke-basedobservations is not equal to the number of stroke labels in practice. We often encounter null observations and null strokelabels, which are not matched to any labels or observations. In these cases, we have to penalize such mismatches using heu-ristic rules [31]. As a result, one stroke label will find one substroke-based observation with the minimum energy. This align-ment can be used to estimate parameters of the GMM based on the EM algorithm.

3.3. Stroke segmentation

Stroke segmentation of Chinese characters is a typical labeling problem in Fig. 1. To account for ambiguous parts, we as-sign each pixel of character skeleton with one of the five labels, J ¼ f1; . . . ;5g, representing horizontal, right-diagonal, ver-tical, left-diagonal, and ambiguous substrokes, respectively. At each site i (pixel) of the character skeleton, one of the fivelabels should be assigned to the directional feature oi in order to explain its property.

To extract directional features, we simplify the response of the Gabor filter [29] as

306 J. Zeng et al. / Information Sciences 180 (2010) 301–311

hðx; yÞ ¼ exp � x2 þ y2

2r2

� �exp j2pf ðx cos hþ y sin hÞð Þ: ð12Þ

The output of the Gabor filter Iðx; yÞ is the magnitude of the convolution between input character skeleton image iðx; yÞ andthe Gabor filter template gðx; yÞ, where hðx; yÞ is a square with x 2 ½�r;r� and y 2 ½�r;r� in Eq. (12). The selection of theparameters, frequency f and standard deviation r, in Eq. (12) for stroke extraction has been discussed in [22,26]. The fre-quency f is associated with the stroke width, where the stroke width approximately equals one in the character skeleton.The standard deviation r ¼

ffiffiffi2p

=f [22]. Gabor filters are robust to the selection of f and r, i.e., the responses are stable withsmall variations of f and r.

The extraction of the directional feature consists of three steps. First, we normalize the slant and moment with aspectratio preserved for character images [18], and then perform the thinning process to obtain the character skeleton. Second,we use four Gabor filters with orientations, h ¼ 0�;45�;90�;135�, to convolve with the character skeleton, producing fouruncorrelated gray images. Finally, at each pixel i on the character skeleton, the values from four gray images compose thedirectional feature oi, which is further normalized into [0,1]. For example, at site i on the character skeleton, the four valuesof the gray images compose the directional feature, oi ¼ ½0:03;0:93; 0:03;0:01�T, which shows that the response of the 45�Gabor filter is the largest. The cosine between two directional vectors can measure the similarity of two observations be-cause the same directions have the largest similarity value and the perpendicular directions have the smallest similarity va-lue. We define the binary feature as the cosine similarity oii0 ¼ oT

i oi0=ðkoikkoi0 kÞ.In the BU process, we define the neighborhood system @i based on the connection of sites. The sites i0 and i00 are the first-

order and second-order neighbors of the site i, respectively. Although more complex relationships among image pixels canbe represented by higher order neighborhood systems, practically we use the second-order @i and consider only the single-site and pair-site cliques to reduce the computational cost.

The single-site prior clique potential does not encode structural information, and thus VC1 ðjÞ ¼ 0. To keep continuity ofsubstrokes, we design the pair-site smoothness-based prior clique potential as follows:

VC2 ðj; j0Þ ¼ ajj0 ; if j–j0;

0; otherwise;

(ð13Þ

where ajj0 is a positive constant penalizing inconsistent labels at neighboring sites. If ajj0 is large, the smoothness constraint isso strong that the neighboring sites are prone to be associated with the same label. In practice we have to balance the priorand likelihood clique potentials in order to achieve the best performance. We determine ajj0 as the minimum cosine similar-ity between directional features from training data. For example, given a set of observations with horizontal labels ðj ¼ 1Þ,and another set of observations with left-diagonal labels ðj0 ¼ 2Þ, the value a12 is set as the minimum cosine similarity be-tween these observations. The experimental results show that this smoothness prior ajj0 works well.

3.4. Stroke extraction

The BU MRF for stroke segmentation does not use the high-level information, such as the position and the length of acomplete stroke, so we obtain only fragmented directional substrokes. In particular, we often obtain unreliable substrokesfor cursive characters due to large variations of curvatures. Because stroke extraction of cursive Chinese characters is still anunsolved problem, we focus on MRF-based character models to guide the stroke extraction process of cursive Chinese char-acters. In the TD process, the MRF-based character model can guide the concatenation and merging of substrokes to producereliable complete strokes. This MRF-guided stroke extraction resembles previous model-based stroke extraction methods[3,17,10,31,32], but with one main distinction: in contrast to an exhaustive search over all substroke concatenations, weuse a selective search method controlled by the stroke-label parameters to reduce the search space.

J. Zeng et al. / Information Sciences 180 (2010) 301–311 307

As shown in Fig. 2, the MRF-based character representation is composed of a set of stroke labels corresponding to realstrokes of input characters. The Markov property constrains the interrelationship of stroke labels within the neighborhoodsystem. Each stroke label j is associated with a single-site clique potential, VC1 ðoijf i ¼ j; kÞ, where k represents a specific char-acter model, oi ¼ ½oP

i ; oLi ; o

Di �

T is the input substroke with its position, length and direction information, and the stroke labelj � flP

j ;lLj ;lD

j g encodes the mean value of the complete stroke’s position, length and direction parameters learned fromtraining samples by the RL and EM algorithms [31]. We perform a selective search of most possible concatenations of sub-strokes referred to as candidate strokes for each stroke label. For the stroke label j � flP

j ;lLj ;lD

j g, we build a straight linebased on the mean position, direction and length. First, we select all possible substrokes oi if their distance to this straightline is below a threshold in terms of the position and direction. This threshold is defined as one standard deviation rP

j and rDj

in the estimated GMMs. If these substrokes can be concatenated with each other, we concatenate them to produce a newsubstroke onew

i . If the distance between the length of the new candidate stroke and the stroke label lLj is below a threshold,

we accept it as a candidate stroke for the later stroke matching process. This threshold is defined as one standard deviationrL

j in the estimated GMMs. Fig. 2 shows the candidate strokes for the stroke label of a typical Chinese character model.Through the selective search, we reduce the search space and avoid concatenating all possible substrokes to find suitablecandidate strokes.

We can select the candidate stroke having the minimum single-site likelihood clique potential (7) as the best stroke ex-tracted by the stroke label. For example, in Fig. 2, the candidate stroke 1–2–3–4 is the best one because it has the lowestsingle-site clique potential. Nevertheless, another choice is to perform the structural match between stroke labels and can-didate strokes, because the best stroke extraction result should satisfy the relationships between all candidate strokes. Thebest match implies the association of stroke labels with best candidate strokes, which are final results of stroke extraction. Inthe meanwhile, the cost of structural match can be used to recognize characters, where input candidate strokes can be clas-sified to the character model with the minimum cost. In this sense, stroke extraction and character recognition are twosimultaneously correlated processes, i.e., stroke extraction leads to character recognition, and character recognition refinesstroke extraction. To encode prior structural knowledge of each category of characters, we need to design prior clique poten-tials and minimize the joint energy in Eq. (6). The best structural match between candidate strokes and stroke labels is stillinferred by the RL Algorithm 1. More details can be found in [31].

Fig. 2. The input substrokes are merged into candidate strokes corresponding to stroke labels. The stroke label denoted by thick bold line corresponds to thetop four candidate strokes with lowest single-site clique potentials. The most meaningful stroke has the lowest single-site likelihood clique potential, i.e.,the candidate stroke 1–2–3-4.

308 J. Zeng et al. / Information Sciences 180 (2010) 301–311

4. Experimental results

4.1. Stroke extraction of cursive chinese characters

We focus on the stroke segmentation of cursive Chinese characters because it is still a challenging problem. We per-formed extensive experiments on KAIST Hanja1/Hanja2 handwritten Chinese character databases [10]. Hanja1 has 783 cat-egories with 200 samples for each category. Hanja2 has 1309 samples of naturally cursive Chinese characters, which are

Fig. 3. Stroke segmentation results of cursive Chinese characters on KAIST Hanja2 database. The horizontal, left-diagonal, vertical, right-diagonal, andambiguous labels are denoted by �, +, �,5, and �, respectively. The first line is produced by the MRF, and the second line is produced by the GF method [22].Some fragmented substrokes are marked by circles.

J. Zeng et al. / Information Sciences 180 (2010) 301–311 309

most difficult in segmentation because of touched strokes. The Hanja1 image quality is good, but Hanja2 is bad. Direct usageof cursive characters as training data is infeasible, because cursive samples have large variations in shape. So, from Hanja1database, we used 50 thinned characters as training data, and manually assigned ambiguous labels at junction and transition

Fig. 4. Stroke extraction results of cursive Chinese characters on Hanja2 database. The first column is the MRF-based character models composed of manystroke labels. The second column shows stroke extraction results by the MRF. Only the best candidate strokes with the lowest clique potentials areillustrated. The numbers denote the correspondence between stroke labels and extracted strokes. The third column shows stroke extraction results by theGF method. The broken strokes and touching strokes are marked by ‘‘B” and ‘‘T”, respectively.

Table 1Recognition rate comparison on cursive Chinese characters.

Database SCSM [10] MRFs [31] T2MRFs [32] Cascade MRFsHanja2 83.14% 84.95% 85.79% 87.32%

310 J. Zeng et al. / Information Sciences 180 (2010) 301–311

regions. For test purpose on Hanja2, we segmented all 1309 cursive characters with the complexity about nine strokes percharacter on the average. We compared cascade MRFs with a recent method based on Gabor filters referred to as GF method[22]. In this stroke extraction method, a set of Gabor filters is used to break down an image of a character into four directionalfeatures, and then an iterative thresholding technique is used to recover stroke shape by minimizing the reconstruction er-ror. A refinement process is used to remove redundant stroke pieces based on measuring the degree of stroke overlap. Thismethod has been confirmed to be effective on well-written Chinese characters, but has not been examined on cursive Chi-nese characters.

Fig. 3 illustrates results of stroke segmentation of cursive Chinese characters on Hanja2, where the first line is producedby the MRF and the second line is produced by the GF method [22]. Two important observations are as follows. First, the MRFcan effectively detect ambiguous parts labeled by the symbol �. These ambiguous parts play important roles in producingcontinuous substrokes because two substrokes concatenated with ambiguous parts can belong to the same substroke if theyare associated with the same directional label. Second, the GF method produces more fragmented substrokes than those pro-duced by the MRF because cursive Chinese characters often have less straight-line strokes than well-written Chinese char-acters. For the GF method, we marked those fragmented substrokes by circles in Fig. 3. These fragmented noisy substrokeswill increase the complexity in the merging process to produce perceptually meaningful strokes. Particularly, fragmentedsubstrokes cause broken strokes that deteriorate the Chinese character structure significantly. In contrast, by introducingthe smoothness-based prior, the MRF performs well in retaining smoothness of substrokes as much as possible. Furthermore,we introduce a statistical learning of directional features of substrokes so as to handle local shape variations of cursive Chi-nese characters. Another advantage is that the MRF-based stroke segmentation does not require an external corner detection[15,17], or line approximation [19,10], to break the substroke at the high curvature places.

Fig. 4 illustrates the stroke extraction results of cursive characters on Hanja2 database. The first column is the MRF-basedcharacter models composed of many stroke labels. The second column shows stroke extraction results by the MRF. Only thebest candidate strokes with the lowest clique potentials are illustrated. The numbers denote the correspondence betweenstroke labels and extracted strokes. The third column shows stroke extraction results by the GF method. For the GF method,we marked broken strokes by ‘‘B” and touching strokes by ‘‘T”, where broken strokes mean that substrokes are not correctlymerged and touching strokes mean that substrokes are incorrectly merged. For the MRF-based character model, we see thatthe stroke labels can accurately find the perceptually meaningful candidate strokes in the second column. Particularly, strokelabels can guide the merging of proper substrokes and break some touching strokes in cursive characters, which cannot beachieved by the GF method in the third column. Indeed, the GF method almost fail to produce reliable strokes because itlacks the global information such as position and length of the original strokes. This condition is more serious in cursive Chi-nese characters due to unreliable local information. To summarize, it is advantageous to combine both BU and TD vision pro-cessing streams together to extract perceptually meaningful strokes from cursive Chinese characters.

4.2. Cursive chinese character recognition

The recognition performance of cursive characters degrades dramatically compared with well-written ones [10]. To val-idate cascade MRFs, we tested cursive Chinese character recognition performance on Hanja2 databases. In the training pro-cess, we selected 783 classes of characters from Haja1 database as the recognition vocabulary. For each class, we held out 10samples of the even number and used the remaining 190 samples for training purposes. To evaluate the MRF-based charactermodel for cursive Chinese characters, we used all the test samples from Hanja2 database. Table 1 compares the recognitionrate with three recent methods: statistical character structure modeling (SCSM) [10], Markov random field-based charactermodel (MRF) [31], and type-2 fuzzy Markov random field (T2MRF) [32]. All these methods used the same set of training andtest data. We see that the accuracy of cascade MRFs is higher than other state-of-the-art approaches, which demonstrates theeffectiveness of cascade MRFs.

5. Conclusions

Within the theoretically well-founded MRF framework, we propose a cascade model to extract reliable strokes from Chi-nese characters. It involves the BU MRF-based stroke segmentation and the TD MRF-guided stroke extraction. Therefore, theMRF unifies both BU and TD processes according to Eqs. (3) and (6). This joint segmentation and recognition approach is sup-ported by the psychological theories [24,25] in computer vision.

In stroke segmentation, we use the second-order neighborhood system. To describe the large variations of the directionalfeatures, we derive the likelihood clique potential from GMMs. We penalize those inconsistent labels at neighboring sites bythe pair-site prior clique potential, which ensures smoothness of substrokes. The best stroke segmentation is achieved by

J. Zeng et al. / Information Sciences 180 (2010) 301–311 311

minimizing the joint energy function using the RL algorithm. We believe that a higher order neighborhood system with morecomplex cliques will make the stroke segmentation better. The BU MRF can keep the smoothness of substrokes avoiding bro-ken strokes compared with traditional BU methods.

The stroke extraction and character recognition proceed simultaneously by structural match between input substrokesand character models. We first use the MRF-based stroke labels to extract possible candidate strokes, and then conductmatching between stroke labels and candidate strokes to find the best strokes. In the meanwhile, the cost of the structuralmatch can be used for character recognition. The TD MRF can effectively avoid broken and touching strokes. The recognitionaccuracy of cursive Chinese characters is also encouraging. Future work may generalize cascade MRFs to more sophisticatedimage segmentation and object recognition problems.

Acknowledgements

This work was partly supported by the Hong Kong Research Grant Council (Project CityU 118608) and the City Universityof Hong Kong (Project 9041369). This work was also partly supported by the National Natural Science Foundation of China(60802085), the Program for New Century Excellent Talents in University (2008) supported by the Ministry of Education(MOE) of China, the Research Fund for the Doctoral Program of Higher Education in China (20070699015), and the NPU Foun-dation for Fundamental Research (W018103).

References

[1] I. Biederman, Recognition-by-components: a theory of human image understanding, Psychological Review 94 (2) (1987) 115–147.[2] R. Cao, C.L. Tan, A model of stroke extraction from Chinese character images, in: Proceedings of the International Conference on Pattern Recognition,

vol. 4, 2000, pp. 368–371.[3] F.-H. Cheng, Multi-stroke relaxation matching method for handwritten Chinese character recognition, Pattern Recognition 31 (4) (1998) 401–410.[4] H.-P. Chiu, D.-C. Tseng, A novel stroke-based feature extraction for handwritten Chinese character recognition, Pattern Recognition 32 (12) (1999)

1947–1959.[5] R.I. Chou, A. Kershenbaum, E.K. Wong, Representation and recognition of handprinted Chinese characters by string-matching, Information Sciences 67

(1993) 1–34.[6] A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, Journal of the Royal Statistical Society, Series B

39 (1) (1977) 1–38.[7] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., John Wiley & Sons, New York, 2001.[8] K.-C. Fan, W.-H. Wu, A run-length-coding-based approach to stroke extraction of Chinese characters, Pattern Recognition 33 (11) (2000) 1881–1895.[9] J.M. Hammersley, P. Clifford, Markov field on finite graphs and lattices, unpublished, 1971.

[10] I.-J. Kim, J.-H. Kim, Statistical character structure modeling and its application to handwritten Chinese character recognition, IEEE Transactions onPattern Analysis and Machine Intelligence 25 (11) (2003) 1422–1436.

[11] C. Lee, B. Wu, A Chinese-character-stroke-extraction algorithm based on contour information, Pattern Recognition 31 (6) (1998) 651–663.[12] S.Z. Li, Markov Random Field Modeling in Image Analysis, Springer-Verlag, New York, 2001.[13] S.Z. Li, H. Wang, K.L. Chan, Minimization of MRF energy with relaxation labeling, Journal of Mathematical Imaging and Vision 7 (2) (1997) 149–161.[14] F. Lin, X. Tang, Off-line handwritten Chinese character stroke extraction, in: Proceedings of the International Conference on Pattern Recognition, vol. 3,

2002, pp. 249–252.[15] J.-R. Lin, C.-F. Chen, Stroke extraction for Chinese characters using a trend-followed transcribing technique, Pattern Recognition 29 (11) (1996) 1789–

1805.[16] C.-L. Liu, S. Jaeger, M. Nakagawa, Online recognition of Chinese characters: the state-of-the-art, IEEE Transactions on Pattern Analysis and Machine

Intelligence 26 (2) (2004) 198–213.[17] C.-L. Liu, I.-J. Kim, J.H. Kim, Model-based stroke extraction and matching for handwritten Chinese character recognition, Pattern Recognition 34 (12)

(2001) 2339–2352.[18] C.-L. Liu, K. Nakashima, H. Sako, H. Fujisawa, Handwritten digit recognition: investigation of normalization and feature extraction techniques, Pattern

Recognition 37 (2) (2004) 265–279.[19] K. Liu, Y.S. Huang, C.Y. Suen, Robust stroke segmentation method for handwritten Chinese character recognition, in: Proceedings of the International

Conference on Document Analysis and Recognition, vol. 1, 1997, pp. 211–215.[20] O. Mendoza, P. Melin, G. Licea, A hybrid approach for image recognition combining type-2 fuzzy logic, modular neural networks and the sugeno

integral, Information Sciences 179 (2009) 2078–2101.[21] D. Mumford, Pattern theory: the mathematics of perception, in: T. Li (Ed.), Proc. Int’l. Congress of Mathematicians, vol. 1, Higher Education Press,

Beijing, 2002.[22] Y.-M. Su, J.-F. Wang, A novel stroke extraction method for Chinese character using Gabor filters, Pattern Recognition 36 (3) (2003) 635–647.[23] Y.-M. Su, J.-F. Wang, Decomposing chinese characters into stroke segments using SOGD filters and orientation normalization, in: Proceedings of the

International Conference on Pattern Recognition, vol. 2, 2004, pp. 351–354.[24] A.M. Treisman, G. Gelade, A feature-integration theory of attention, Cognitive Psychology 12 (1) (1980) 97–136.[25] S.P. Vecera, M.J. Farah, Is visual image segmentation a bottom-up or an interactive process, Perception and Psychophysics 59 (8) (1997) 1280–1296.[26] X. Wang, X. Ding, C. Liu, Gabor filters-based feature extraction for character recognition, Pattern Recognition 38 (3) (2005) 369–379.[27] M. Weber, M. Welling, P. Perona, Unsupervised learning of models for recognition, in: Proceedings of the 6th European Conference on Computer Vision

– Part I, 2000, pp. 18–32.[28] D.S. Yeung, H.S. Fong, A fuzzy substroke extractor for handwritten Chinese characters, Pattern Recognition 29 (12) (1996) 1963–1980.[29] J. Zeng, Z.-Q. Liu, Stroke segmentation of Chinese characters using Markov random fields, in: Proceedings of the International Conference on Pattern

Recognition, 2006, pp. 868–871.[30] J. Zeng, Z.-Q. Liu, Type-2 fuzzy hidden Markov models and their application to speech recognition, IEEE Transactions on Fuzzy Systems 14 (3) (2006)

454–467.[31] J. Zeng, Z.-Q. Liu, Markov random field-based statistical character structure modeling for handwritten Chinese character recognition, IEEE Transactions

on Pattern Analysis and Machine Intelligence 30 (5) (2008) 767–780.[32] J. Zeng, Z.-Q. Liu, Type-2 fuzzy Markov random fields and their application to handwritten Chinese character recognition, IEEE Transactions on Fuzzy

Systems 16 (3) (2008) 747–760.[33] S.-C. Zhu, R. Zhang, Z. Tu, Integrating bottom-up/top-down for object recognition by datadriven Markov chain Monte Carlo, in: Proceedings of the IEEE

Conference on Computer Vision and Pattern Recognition, 2000, pp. 738–745.