7
Synthesizing queries for handwritten word image retrieval Jose A. Rodriguez-Serrano n , Florent Perronnin Textual and Visual Pattern Analysis (TVPA) Area, Xerox Research Centre Europe, 6 Chemin de Maupertuis, 38240 Meylan, France article info Article history: Received 4 January 2012 Accepted 15 February 2012 Available online 8 March 2012 Keywords: Data synthesis Word-spotting Handwriting recognition Hidden Markov models abstract We propose a method to perform text searches on handwritten word image databases when no ground-truth data is available to learn models or select example queries. The approach proceeds by synthesizing multiple images of the query string using different computer fonts. While this idea has been successfully applied to printed documents in the past, its application to the handwritten domain is not straightforward. Indeed, the domain mismatch between queries (synthetic) and database images (handwritten) leads to poor accuracy. Our solution is to represent the queries with robust features and use a model that explicitly accounts for the domain mismatch. While the model is trained using synthetic images, its generative process produces samples according to the distribution of handwritten features. Furthermore, we propose an unsupervised method to perform font selection which has a significant impact on accuracy. Font selection is formulated as finding an optimal weighted mixture of fonts that best approximates the distribution of handwritten low-level features. Experiments demonstrate that the proposed method is an effective way to perform queries without using any human annotated example in any part of the process. & 2012 Elsevier Ltd. All rights reserved. 1. Introduction We consider the problem of handwritten word image retrieval, where a collection of handwritten word images is available and the goal is to rank the images in order of relevance to a given query word. The relevance score indicates how likely each word image represents the query. One of the prominent applications of word image retrieval is building search engines for document images, also referred to as word spotting [19]. In a first type of techniques, the query is a string representing the word of interest, and classical handwriting recognition tech- niques are applied to parse images, which model words as sequences of sub-word units such as characters. These are usually adapted to generate lexicons/tries on-the-fly to encode the word(s) of interest only [9,10]. An advantage of such an approach is that it allows querying arbitrary strings. However, these systems need to be trained with a sufficient amount of data of the target domain. We work under the assumption that such a recognition system is not available. This is the case, for instance, when a new set of unannotated data becomes available of a different language/script than any previous annotated data. Here, we cannot re-use existing recognizers, and we assume that there are no resources available to ground-truth the data and train a new recognizer. A similar situation would be encountered with historical script or docu- ments with high degradation. Manmatha et al. [1] showed that an efficient way to browse such datasets is through query-by-example. Here, one defines a similarity function between word images. Rather than attempting to recognize each word image, searching for a query word amounts to finding word images with sufficiently high similarity to a query image. While this approach has been successfully adopted in many subsequent works [2,46,11], a practical drawback is that it does not allow searching for arbitrary strings; instead, one needs a prototype image of the word to kick-off the search. This is not an impediment in some scenarios (searching for other instances of a seen word), but represents a significant burden for unseen words, especially rare words. Our objective is to build a system that is able to query for any arbitrary word (string) when no previous training data is available. The contribution of this article is a system where an arbitrary query string is accepted; multiple images of the query are synthesized automatically from different computer fonts; a model is learned on-the-fly from the synthetic images; and the model is employed to assign a relevance score to each image in the database. One difficulty is that a model learned on synthetic samples may not be fully representative of real handwritten images. Actually, Contents lists available at SciVerse ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2012.02.015 n Corresponding author. Tel.: þ33 476 61 50 98. E-mail addresses: [email protected] (J.A. Rodriguez-Serrano), [email protected] (F. Perronnin). Pattern Recognition 45 (2012) 3270–3276

Synthesizing queries for handwritten word image retrieval

Embed Size (px)

Citation preview

Page 1: Synthesizing queries for handwritten word image retrieval

Pattern Recognition 45 (2012) 3270–3276

Contents lists available at SciVerse ScienceDirect

Pattern Recognition

0031-32

doi:10.1

n Corr

E-m

Florent.

journal homepage: www.elsevier.com/locate/pr

Synthesizing queries for handwritten word image retrieval

Jose A. Rodriguez-Serrano n, Florent Perronnin

Textual and Visual Pattern Analysis (TVPA) Area, Xerox Research Centre Europe, 6 Chemin de Maupertuis, 38240 Meylan, France

a r t i c l e i n f o

Article history:

Received 4 January 2012

Accepted 15 February 2012Available online 8 March 2012

Keywords:

Data synthesis

Word-spotting

Handwriting recognition

Hidden Markov models

03/$ - see front matter & 2012 Elsevier Ltd. A

016/j.patcog.2012.02.015

esponding author. Tel.: þ33 476 61 50 98.

ail addresses: [email protected]

[email protected] (F. Perronnin).

a b s t r a c t

We propose a method to perform text searches on handwritten word image databases when no

ground-truth data is available to learn models or select example queries. The approach proceeds by

synthesizing multiple images of the query string using different computer fonts. While this idea has

been successfully applied to printed documents in the past, its application to the handwritten domain is

not straightforward. Indeed, the domain mismatch between queries (synthetic) and database images

(handwritten) leads to poor accuracy.

Our solution is to represent the queries with robust features and use a model that explicitly

accounts for the domain mismatch. While the model is trained using synthetic images, its generative

process produces samples according to the distribution of handwritten features. Furthermore, we

propose an unsupervised method to perform font selection which has a significant impact on accuracy.

Font selection is formulated as finding an optimal weighted mixture of fonts that best approximates the

distribution of handwritten low-level features. Experiments demonstrate that the proposed method is

an effective way to perform queries without using any human annotated example in any part of the

process.

& 2012 Elsevier Ltd. All rights reserved.

1. Introduction

We consider the problem of handwritten word image retrieval,where a collection of handwritten word images is available andthe goal is to rank the images in order of relevance to a givenquery word. The relevance score indicates how likely each wordimage represents the query. One of the prominent applications ofword image retrieval is building search engines for documentimages, also referred to as word spotting [1–9].

In a first type of techniques, the query is a string representingthe word of interest, and classical handwriting recognition tech-niques are applied to parse images, which model words assequences of sub-word units such as characters. These are usuallyadapted to generate lexicons/tries on-the-fly to encode theword(s) of interest only [9,10]. An advantage of such an approachis that it allows querying arbitrary strings. However, thesesystems need to be trained with a sufficient amount of data ofthe target domain.

We work under the assumption that such a recognition systemis not available. This is the case, for instance, when a new set ofunannotated data becomes available of a different language/scriptthan any previous annotated data. Here, we cannot re-use existing

ll rights reserved.

om (J.A. Rodriguez-Serrano),

recognizers, and we assume that there are no resources availableto ground-truth the data and train a new recognizer. A similarsituation would be encountered with historical script or docu-ments with high degradation.

Manmatha et al. [1] showed that an efficient way to browsesuch datasets is through query-by-example. Here, one defines asimilarity function between word images. Rather than attemptingto recognize each word image, searching for a query wordamounts to finding word images with sufficiently high similarityto a query image.

While this approach has been successfully adopted in manysubsequent works [2,4–6,11], a practical drawback is that it doesnot allow searching for arbitrary strings; instead, one needs aprototype image of the word to kick-off the search. This is not animpediment in some scenarios (searching for other instances of aseen word), but represents a significant burden for unseen words,especially rare words.

Our objective is to build a system that is able to query for anyarbitrary word (string) when no previous training data is available.The contribution of this article is a system where an arbitraryquery string is accepted; multiple images of the query aresynthesized automatically from different computer fonts; a modelis learned on-the-fly from the synthetic images; and the model isemployed to assign a relevance score to each image in thedatabase.

One difficulty is that a model learned on synthetic samples maynot be fully representative of real handwritten images. Actually,

Page 2: Synthesizing queries for handwritten word image retrieval

J.A. Rodriguez-Serrano, F. Perronnin / Pattern Recognition 45 (2012) 3270–3276 3271

this phenomenon is known in machine learning as ‘‘concept drift’’,occurring when training and target sample sets have been gener-ated by different sources. Thus, a key aspect is how to avoid theconcept drift between synthetic and real images. To that end, weemploy a model where word-dependent parameters are learnedfrom synthetic data, whereas style-dependent parameters (thosemore likely to cause concept drift) are estimated from hand-written data. This is done in an unsupervised way, thus meetingthe requirement of not depending on labeled examples.

On top of that, we propose a font weighting mechanism, whereeach font is weighted according to its style (handwritten-likefonts are weighted stronger). The weighting is performed in aprincipled way – by finding the mixture of fonts that bestapproximates the low-level feature distribution of handwrittenword images – and again this is done in an unsupervised manner.

We remark that the intention of the proposed solution is not tocompete with handwriting recognition models, which are likelyto obtain very good performance in the same task when trainedwith an adequate and sufficient amount of data. In contrast, theproposed idea represents a solution at a reduced cost whentraining data is not available—the proposed method does notrequire any ground-truthed data.

Experiments demonstrate that the proposed synthesis, model-ing and font weighting provide a competitive system to queryfor strings when no training data is available. Except for ourpreliminary conference version [12], we are not aware of anyprevious attempts to synthesize word images to query hand-written documents.

The remainder of the paper is organized as follows. Section 2discusses the related work. Section 3 presents the retrievalframework, describing the processes of image synthesis, featureextraction and word modeling. Section 4 proposes a method toimprove the accuracy by performing automatic font selection.Section 5 reports the experimental validation of the proposedapproach. Finally, in Section 6 conclusions are drawn.

2. Related work

The generation of synthetic examples to train classifiers isincreasingly gaining interest in the computer vision community.For instance, for the problem of traffic sign classification, Hoessleret al. [13] train a classifier with synthetic images undergoing avariety of distortions. For pedestrian detection, Marin et al. [14]synthesize images of pedestrians using virtual world simulators,and actually report performances close to systems trained withreal images. Similarly, Schels et al. [15] use 3-D modeling soft-ware to render images of objects from different viewpoints toproduce the training set for an object classifier. Wang et al. [16]use synthetic characters to learn classifiers for text detection andrecognition in natural images.

The idea of exploring a document image by synthesizing animage of the query text has been previously applied by Konidariset al. in [17]. In this work, the domain of interest is typeddocuments, in which the font is uniform and known. Thus, it ispossible to synthesize word images with exactly the same fontand obtain reliable queries.

However, in the case of unknown fonts or multi-font docu-ments this approach falls short. As a matter of fact, Sankar et al.[18] report experiments on a historical Telugu document collec-tion rendering unseen words using a standard typed font. Theyobtain a recognition accuracy of 33% while the performance of thesystem trained on real data approaches 80%, and point to the fontstyle mismatch as the likely cause of the decrease in accuracy.

Therefore, it is clear that the immediate application of this idea tohandwritten words is difficult, especially multi-writer collections of

contemporary handwriting. If, as shown, a mismatch between twotyped fonts already causes a degradation, this effect is more likely tooccur with handwritten word images, which present higher varia-bility and are harder to mimic with computer fonts. Furthermore,despite some efforts (see e.g. [19]), accurate handwritten wordsynthesis remains an open problem.

Thus, the main differences between [17] and the proposedwork are (i) we work with the more challenging case of hand-written collections, (ii) we synthesize a model rather than a singleimage, (iii) our model explicitly considers the asymmetrybetween the queries (synthetic) and the documents (handwrit-ten), and (iv) we propose a font selection mechanism.

Note that one way to explicitly account for the domainasymmetry would be to use metric learning techniques (see, forinstance [20]). However, these use supervised learning and wouldrequire labeled data—something we explicitly would like toavoid.

Finally, we briefly highlight the work of Leydier et al. [21].They enable text searches in old historical manuscripts bycomposing a query image from images of individual character‘‘glyphs’’. Although this is related to the idea of synthesizingqueries, their approach requires an off-line, manual annotation toassign a class to each prototype glyph. Thus, even if it does notuse learning explicitly, it is also considered a ‘‘supervised’’approach. Furthermore, it is not immediately straightforwardto apply this approach to cursive handwriting styles wherecharacters are not isolated and multi-writer scenarios.

3. Retrieval framework

We assume the existence of a collection of handwritten wordimages. The workflow of the proposed solution is describedbelow.

Off-line processing: only once for each collection

1.

On each image in the collection, perform a feature extractionoperation.

2.

Obtain the distribution of features (vocabulary) of the targetcollection.

On-line processing: for each query

1.

A user types a string Q corresponding to a word to be searched:the query.

2.

Multiple images of the string Q are synthesized using a varietyof pre-defined computer fonts.

3.

The synthesized images undergo normalization and featureextraction operations.

4.

A word model is trained which makes use both (i) of the‘‘synthesized’’ features, and (ii) of the feature distribution(vocabulary) obtained in the off-line phase.

5.

The obtained model is employed to score all the features in thecollection according to the probability of each image to containthe query.

Details of the image synthesis step, feature extraction(with normalization), and word modeling are as follow.

Image synthesis. An image of a word Q is automaticallyrendered using a set of F specified computer fonts. The size ofthe font is fixed to a standard value. Fig. 2 shows synthesizedimages of the query ‘‘Monsieur’’ (‘‘Sir’’ in English) with F¼25.(Fig. 1 displays real handwritten examples for comparison.)

Feature extraction. Features are extracted from a word imageusing a sliding window approach: a window slides along the word

Page 3: Synthesizing queries for handwritten word image retrieval

Fig. 1. Handwritten examples of the word ‘‘Monsieur’’.

Fig. 2. Synthesized examples of the word ‘‘Monsieur’’ with 25 different computer

fonts.

Fig. 3. Schema of the SC-HMM.

J.A. Rodriguez-Serrano, F. Perronnin / Pattern Recognition 45 (2012) 3270–32763272

image from left to right and at each position a feature vector iscomputed for the pixels inside the window.

We employ LGH features [22] since they have shown state-of-the-art results for word spotting. In a nutshell, the featureextraction in a given window consists of three steps:

1.

Adjust the upper and lower bounds of the sliding window tothe area actually containing pixels.

2.

Split the reduced window into a 4�4 grid. 3. At each of the obtained cells, compute the gradient and

accumulate a histogram of eight angle orientations.

For further details, we refer the reader to [22]. Experiments inSection 5 demonstrate that the choice of these features is one ofthe factors that contribute to the good results of our approach.

Word modeling. Once a set of synthetic images have beengenerated and their features extracted, a crucial point is to modelthe word class represented by these images. The challenge is themismatch between the domains of the training images (synthetic)and test images (handwritten). We propose to overcome thisissue by using a semi-continuous hidden Markov model(SC-HMM). The key point of this model is that a part of theparameters can be estimated from handwritten data in anunsupervised way, while the remaining word-dependent para-meters are directly learned from the synthetic examples.

In the following, a feature sequence of length L is denoted byX ¼ x1, . . . ,xL, where upper-case variables such as X denote fullsequences and lower-case variables such as x denote featurevectors of the sequences. Abusing the language we may refer toX as the word image.

The method proceeds in two phases. In an off-line phase, weobtain the distribution of features in the handwritten domain. Tothat end, we fit a Gaussian mixture model (GMM) to a set ofM feature vectors fxmg

Mm ¼ 1 extracted from handwritten samples

of the target collection. The standard EM estimation [23] yieldsa mixture qðxÞ ¼

PKk ¼ 1 pkqkðxÞ, where qk is a shorthand for

N ðx9mk,SkÞ, a multivariate Normal distribution with meanmk and covariance Sk. The number of components K is typicallycross-validated. Note that this step is unsupervised and even if wemake use of handwritten word images, we do not need them to beground-truthed.

Following Perronnin [24], we draw the analogy between aGMM and the ‘‘visual vocabulary’’ of the bag-of-visual-wordapproaches: each Gaussian component can be interpreted as avisual word (where mk and Sk would encode the mean anddispersion of the feature values of each word).

Here, we extend the bag-of-visual-word representation tosequences by imposing a generative model where the visual wordselection is governed by a hidden Markov model. The generativeprocess is as follows: in a given state j, a visual word k is drawnfrom the multinomial distribution given by the weights fwjkg

Kk ¼ 1,

and a feature x is generated by drawing a random sample fromthe kth word Gaussian qk(x). Then, the system transitions to adifferent state i as determined by transition probabilities Aji andthe process is repeated. This can be interpreted as a state-dependent bag-of-words model.

This results in a hidden Markov model where all states sharethe same means and covariances fmk,Skg

Kk ¼ 1. This is known as a

SC-HMM. The emission probability of this HMM is

pðx9jÞ ¼XK

k ¼ 1

wjkqkðxÞ ð1Þ

In the run-time phase, the free, word-dependent parameterswjk and Aji are trained using the synthetic samples. Note that,crucially, even if the training occurs in the synthetic space, the

generative model produces features that are distributed according to

the visual word model of the handwritten space, which reduces theimpact of the concept drift issue.

If X1, . . . ,XF denote the features of the synthetic samples, theparameters wjk and Aji are trained by maximizing the MLEcriterion

XF

i ¼ 1

log pðXi9yÞ ð2Þ

using the EM algorithm with the Baum–Welch re-estimationprocedure [25]. Once the models are trained, a handwritten wordsample X ¼ x1, . . . ,xL is scored as

SðXÞ ¼pðX9yÞQLl ¼ 1 qðxlÞ

, ð3Þ

where y is a parameter that summarizes all the SC-HMM para-meters. The likelihood in the numerator is typically computedwith the forward–backward algorithm [25], while the denomi-nator is known as score normalization (see [8]). A schema of theSC-HMM summarizing the model topology and parameters isshown in Fig. 3. Further details on SC-HMMs can be read in [8].

4. Font weighting

A key aspect of the proposed system is the choice of fonts forthe synthesis. Using handwritten-like fonts would yield morerealistic samples, while using a broad variety of fonts mayincrease the generalization ability of the system.

4.1. Issues of supervised font selection

A straightforward solution to font selection would be toevaluate the accuracy of all the possible font combinations on avalidation set. However, this is impractical: since we have two

Page 4: Synthesizing queries for handwritten word image retrieval

J.A. Rodriguez-Serrano, F. Perronnin / Pattern Recognition 45 (2012) 3270–3276 3273

choices per font (include in the training set or not), even for asmall number of fonts, it would be extremely time consuming totry all possible combinations (for 10 fonts, that already makes210�1¼ 1023 combinations).Heuristic strategies can be applied to reduce the number of

combinations, for example by first evaluating all the fontsindependently, then ranking the fonts by accuracy, and finallyevaluating the combinations of the top 1, . . . ,F fonts. This wouldreduce the complexity to a linear one (evaluate F combinations) atthe expense of finding a sub-optimal solution. This was thesolution proposed in [12].

Still, important drawbacks are identified in this font selectionapproach. First, running training/test experiments with a decentlysized validation set and a sufficient number of fonts and queriesinvolves long times in practice. Secondly, these methods aresupervised, making a ground-truthed training set necessary.Again, this is a requirement we deliberately would like to avoid.Finally, this selection method would depend on the particularkeywords used for validation.

Therefore, some desirable properties of a font selectionmethod are efficient, unsupervised and keyword-independent.We describe a solution satisfying these properties in the nextsubsection.

4.2. Proposed unsupervised font weighting

The proposed solution does not select fonts in a binary fashion

but assigns a continuous weight to each font. The central idea is toestimate a set of weights such that the weighted distribution ofsynthesized samples optimally approximates the distribution ofhandwritten samples.

4.2.1. Learning the font weights

Let q denote the distribution of handwritten samples, esti-mated using a GMM as explained in the previous section. To learnGMMs for each font i¼ 1, . . . ,F, we first synthesize a set of wordimages (e.g. hundreds of the most common words in English) for agiven font, and then extract LGH feature sequences. We subse-quently fit a GMM for font i through maximum a posteriori (MAP)adaptation using q as the prior distribution. The GMM hasK Gaussians by construction and, analogously to q we express itas pi ¼

PKk ¼ 1 pi,kpi,k. Although the use of MAP is not strictly

necessary when a large number of synthetic features are used,we will explain later how this can be leveraged to gain efficiency.

To find an optimal set of font weights o¼ fo1, . . . ,oFg, wepropose to minimize the divergence between the distribution q

and the weighted mixturePF

i ¼ 1 oipi. If we choose the Kullback–Leibler divergence to measure the dissimilarity between twodistributions, it can be shown that we need to maximize

EðoÞ ¼Z

xqðxÞ log

XF

i ¼ 1

oipiðxÞ

!ð4Þ

with the constraintPF

i ¼ 1 oi ¼ 1.Although this objective function is convex, it is difficult to

maximize directly. Therefore, we follow the traditional samplingapproach. Let X ¼ fxmg

Mm ¼ 1 be a set of feature vectors extracted

from a large set of handwritten samples of the target collection. Ifwe assume that the xm’s have been generated independently fromq, then according to the law of large numbers we have

EðoÞ � Eðw,XÞ ¼1

M

XMm ¼ 1

logXF

i ¼ 1

oipiðxmÞ

!ð5Þ

Eðw,XÞ is still a convex function and it may be iterativelymaximized with a standard EM algorithm as follows:

E-step:

gmðiÞ ¼oipiðxmÞPF

j ¼ 1 ojpjðxmÞð6Þ

M-step:

oi ¼1

T

XMm ¼ 1

gmðiÞ ð7Þ

4.2.2. Leveraging the font weights

The learned oi’s can then be applied to weight the contribution ofeach font. Given a query word, let us recall that fX1, . . . ,XFg denotethe set of word images generated for the F fonts. The trainingobjective can be modified to maximize a weighted MLE criterion:

XF

i ¼ 1

oi log pðXi9yÞ ð8Þ

Note that we recover the unweighted MLE criterion (2) if we setoi ¼ 1=F (up to a multiplicative factor 1/F). Again, the weighted MLEcriterion (8) can be maximized using the EM algorithm. It can beshown that, compared to the unweighted criterion, this simply leadsto weighting the contribution of each synthetic sample (font) withthe corresponding oi when computing the accumulators in theE-step of the SC-HMM training.

The proposed font weighting scheme is a principled approachwhich possesses the three required properties: efficient, keyword-independent and unsupervised. In the experimental section it willbe shown that it brings an improvement in practice in terms ofaccuracy and speed.

4.2.3. Efficient Gaussian evaluation

Note that we need to compute F � K �M likelihood valuespi,nðxmÞ. This is very expensive for a large number of Gaussians K

(we use K¼512 in our experiments). We can speed-up thecomputation using a procedure inspired by the speech recognitionliterature (see [26]). This technique is based on two observations:

1.

When a GMM q with a large number of Gaussians K isevaluated on a sample xm, only a few of the Gaussians willcontribute significantly to the likelihood value.

2.

The Gaussians of an adapted GMM retain a correspondencewith the mixtures of the prior GMM. Therefore, if qkðxmÞ ishigh, the pi,kðxmÞ values should be high for all i¼ 1, . . . ,F.

The following approximation is derived based on these obser-vations. First, for each m we compute the K values qkðxmÞ andretain the K 0 highest values, with K 05K. Let Im denote the set ofindices of these Gaussians. Then, for each font i, we can approx-imate piðxmÞ ¼

PkA Im

pi,kpi,k.Hence, the total number of Gaussian computations is reduced

from F � K �M to M � ðKþF � K 0Þ. For F¼100 and K 0 ¼ 5, thisleads to a reduction of a factor of 50.

5. Experimental validation

5.1. Setup

To validate the proposed system, we carried out a set ofexperiments on a database of 105 real scanned letters written inFrench provided by the customer department of a large corporation.This database is particularly challenging owing to the variability ofwriters, styles, artifacts and other anomalies such as spellingmistakes. The occurrences of 10 of the words (Monsieur, Madame,

Page 5: Synthesizing queries for handwritten word image retrieval

J.A. Rodriguez-Serrano, F. Perronnin / Pattern Recognition 45 (2012) 3270–32763274

contrat, resiliation, salutation, resilier, demande, abonnement, com-pany name and veuillez) are labeled for evaluation purposes.

Standard segmentation techniques are employed to obtain a set ofword image hypotheses. Over-segmentation is employed to producea large set of word hypotheses. About 250 candidate word-images aregenerated per document image. Each candidate word-image isdescribed as a sequence of 128-dimensional LGH features.

A GMM with K¼512 Gaussians is trained using approximatelyM¼ 1;000,000 feature vectors randomly extracted from a separateset of letters. All the SC-HMMs involved in the experiments beloware constrained to this GMM and use 10 states per character.

The performance of the detection task is evaluated in terms ofthe average precision (AP), which represents the average of theprecision value in a precision/recall plot. We perform experimentsfor the 10 different keywords and report the mean across the10 keywords (mAP).

Comparison to alternative methods. In order to assess the role ofboth the LGH features and the SC-HMM in the proposed approach,we will repeat the retrieval experiments in two alternativesettings: (i) by replacing the LGH features with a standard set offeatures, and (ii) by replacing the SC-HMM with a standard imagematching approach. This would constitute two baselines for typed-to-handwritten matching. Regarding the alternative features, wechose the zoning features proposed by Vinciarelli et al. [27]. Thisfeature set is standard for word representation and consists incounting the pixels of a 4�4 split of the window. As for the imagematching approach, we use the standard DTW [2].

Fig. 4. The 25 font faces used in the experiments.

1 2 3 4 5 6 7 8 9 10 11 120

5

10

15

20

25

F

Mea

n av

erag

e pr

ecis

ion

(mA

P) [

%] LGH + SC−HMM

Zoning + SC−HMMLGH + DTW

Fig. 5. Results (mAP) with single-font models, comparing

5.2. Results with single fonts

In the first round of experiments, we use a single synthesizedsample per query. As opposed to ordinary hidden Markov models,training a SC-HMM with a single sample does not lead to over-fitting, as shown in [8]. Indeed, our experimental results in thispaper concur with those of [8] and show that training a SC-HMMwith a single sample yields higher accuracy than a templatematching approach using DTW.

In this case, for a desired word, we generate a single wordimage using a computer font. We evaluate the performance of theretrieval task as a function of the employed font face, where wehave experimented with the most usual computer fonts, shown inFig. 4. Fig. 5 shows the mAP for each font.

It can be observed that, for 19 out of the 25 fonts, the approachusing LGH features and SC-HMM outperforms the alternativeapproaches. In particular, for 18 out of the 25 fonts the proposedapproach obtains a relative increase of over 20% with respect tothe best alternative approach; in 13 out of the 25 the relativeincrease is over 50%; and in 7 out of 25 it is over 100%. Weconclude that both the LGH features and the SC-HMM are crucialfor matching typed words against handwritten words.

The importance of learning the GMM q(x) on handwritten samplesand re-using it in the SC-HMMs to reduce the concept drift is madeclear by the following experiment. When repeating the experiment butlearning the GMM from typed text images, the mAP for 23 out of the25 fonts drops to less than 3%. This poor result is due to the fact that noprior information about handwritten shapes is considered in this case.

Another interesting observation is that the best ranked fonts(e.g. Kunstler Script, French Script, Lucida Handwriting) are veryhandwritten-like, while the classical typed fonts (e.g. Times, Arial,Courier, OCR) rank low.

Finally, a simple experiment is carried out to compare theseresults to the case where we query with a handwritten word image.Although we explicitly assume that no prototypes are available atquery time, it is worth carrying out this comparison since this is astandard approach in word-spotting. Thus, we repeat the experi-ments with a randomly chosen example of each word (instead of asynthesized word), perform 10 different runs and average theresults. This leads to mAP¼17.7%. Note that this is higher than formost individual fonts, since the query is handwritten and there is noconcept drift in this case. However, below we will show that fontcombinations can significantly outperform this value.

5.3. Results with multiple fonts

In the next experiment, we generate word images usingdifferent fonts and use several images to train the SC-HMM.

13 14 15 16 17 18 19 20 21 22 23 24 25ont

the proposed method (left) to alternative methods.

Page 6: Synthesizing queries for handwritten word image retrieval

0 5 10 15 20 2520

22

24

26

28

30

32

34

Number of fonts

mA

P (%

)

Fig. 6. Results using the best F fonts.

Fig. 7. Top 25 ranked samples for the best keyword (top) and worst keyword

(bottom) with the nine-best typed fonts.

Table 1Results (per word and mAP) of font weighting compared to the multiple font

results of Section 5.3.

Word Nine-best

fonts

Font weighting

(F¼9)

Font weighting

(F¼100)

Monsieur 40.1 40.3 51.8

Madame 55.8 55.8 59.1

Contrat 39.5 37.4 68.0

Rsiliation 45.4 44.6 50.9

Salutation 27.4 27.0 27.2

Rsilier 21.1 21.1 26.4

Demande 56.0 55.7 61.7

Abonnement 79.5 79.4 83.0

Company name 78.5 78.2 84.4

Veuillez 41.1 40.1 51.6

mAP 48.4 48.0 56.4

J.A. Rodriguez-Serrano, F. Perronnin / Pattern Recognition 45 (2012) 3270–3276 3275

The question is whether the retrieval accuracy can be improvedby using multiple fonts.

Based on the ranking of fonts in Fig. 5, we trained a modelusing the F-best fonts, with F ¼ 1;2,3, . . . ,25. Fig. 6 shows the mAPas a function of the number F of fonts. The best performance isobtained by considering the best nine fonts ð432%Þ, compared tothe 21% obtained when using the best single font. This is asignificant improvement of the retrieval accuracy. Recall that thisis also significantly higher than the 17.7% obtained with a realhandwritten query, even if our approach does not make use of anyreal example. Of course, this set of nine fonts might not be theoptimal one among all possible combinations of fonts.

For illustration, Fig. 7 shows the 25 top ranked samples withthe model trained using the nine-best fonts. For the best word(AP¼79.5%), all 25 samples are correct, which highlights thepractical usefulness of the proposed approach. Even in the caseof the worst word (AP¼21.1%), several correct samples appearamong the top 25 ranked samples.

5.4. Results with automatic font weighting

The previous section has shown that a good accuracy can beobtained with the proposed method using multiple fonts. Thepresent section reports the experiments using the font weightingscheme described in Section 4.

We will perform the evaluation on a different set of 500documents. To learn the weights, we use a held-out set of 100documents. To compare to the font selection method, we willtreat the experiments on Section 5.3 as a validation run fromwhich the best nine fonts were selected.

In a first experiment, we learn the font weights using only thenine best fonts obtained in Section 5.3.

After repeating the experiments of Section 5.3 but training theSC-HMMs with the weighted MLE criterion using the estimatedweights, the obtained mAP is 47.9% (see Table 1). This is on parwith the unweighted results (0.5% worse on average). This can beconsidered an achievement in itself. Indeed, the nine fonts wereselected to maximize the validation accuracy for these exact10 keywords, while the weights computed by our approach arekeyword-independent. This gives an unfair advantage to theapproach of Section 5.3. Moreover, the font weighting approach ismuch less computationally intensive, and proceeds in an unsuper-vised way. In contrast, the font selection approach requires aground-truthed set for validation experiments.

In the previous experiments, the number of fonts was limitedto 25 as a higher number of fonts would yield long experimental

times. Actually, it takes about a day to run the experimentreported in Section 5.3 (on a single CPU of a 2.8 GHz AMDOpteronTM machine).

To show the full power of the font weighting scheme, we selecta larger number of fonts, F¼100. Nine of these fonts are the bestfonts used in the previous experiment. The remaining 91 fontshave been extracted from the ‘‘handwritten’’ category of the fontrepository http://www.dafont.com. Estimation of font weights inthis case takes on the order of minutes. By repeating the experi-ment with the 100 weighted fonts, the obtained mAP is 56.4%(see Table 1), which represents a significant improvement withrespect to the nine-best font case.

Page 7: Synthesizing queries for handwritten word image retrieval

J.A. Rodriguez-Serrano, F. Perronnin / Pattern Recognition 45 (2012) 3270–32763276

6. Conclusions

We have presented a system to perform text searches on adatabase of handwritten word images. The system synthesizes aset of images using a variety of computer fonts, and makes use ofthese images to learn a model of the word on-the-fly.

We have demonstrated that the proposed approach outper-forms other alternative approaches, and is superior to queryingwith a handwritten image. The three key elements of the systemare robust features, SC-HMM modeling, and unsupervised fontselection.

All the three elements contribute to reduce the concept driftwhich is caused by the asymmetry in modeling real handwrittenwords using synthetic images. In experiments, the LGH featuresshow robustness across domain. Moreover, the SC-HMM expli-citly models the domain mismatch. By being constrained to aGMM learned on handwritten examples, its generative processproduces features according to the distribution of handwrittenwords. Experiments demonstrate that this is a key aspect and thatlearning the model fully on synthetic data yields to a very lowaccuracy. Finally, a font selection method (which is unsupervised,efficient, and keyword-independent) leverages the font contribu-tions to best represent handwritten data, and yields significantimprovements in accuracy.

We remark that the proposed system is an effective way toquery handwritten word images by text without using any ground-

truthed image in the process.We ask ourselves whether it is also possible to use synthetic

data to train word HMMs composed by sub-word models, andwhether similar mechanisms to reduce the concept drift can beused in this case. However, this requires further experimentationand is considered future work.

Acknowledgments

The authors would like to thank Thierry Lehoux for generatinga dataset of synthetic images of common English words and Franc-ois Ragnet for useful discussions.

References

[1] R. Manmatha, C. Han, E.M. Riseman, Word spotting: a new approach toindexing handwriting, in: IEEE Conference on Computer Vision and PatternRecognition, San Francisco, CA, 1996, pp. 631–637.

[2] T.M. Rath, R. Manmatha, Word image matching using dynamic time warping,in: IEEE Conference on Computer Vision and Pattern Recognition, 2003,pp. 521–527.

[3] J. Edwards, Y.W. Teh, D.A. Forsyth, R. Bock, M. Maire, G. Vesom, Making Latinmanuscripts searchable using gHMMs, in: Advances in Neural InformationProcessing Systems, Vancouver, Canada, 2005.

[4] S. Srihari, H. Srinivasan, P. Babu, C. Bhole, Handwritten arabic word spottingusing the cedarabic document analysis system, in: Symposium on DocumentImage Understanding, College Park, MD, 2005.

[5] K. Terasawa, T. Nagasaki, T. Kawashima, Eigenspace method for text retrievalin historical document images, in: International Conference on DocumentAnalysis and Recognition, Seoul, Korea, 2005, pp. 436–441.

[6] E. Ataer, P. Duygulu, Matching Ottoman words: an image retrieval approachto historical document indexing, in: ACM International Conference on Imageand Video Retrieval, Amsterdam, The Netherlands, 2007, pp. 341–347.

[7] T. Van der Zant, L. Schomaker, K. Haak, Handwritten-word spotting usingbiologically inspired features, IEEE Transactions on Pattern Analysis andMachine Intelligence 30 (11) (2008) 1945–1957.

[8] J.A. Rodrıguez-Serrano, F. Perronnin, Handwritten word-spotting using hid-den Markov models and universal vocabularies, Pattern Recognition 42 (9)(2009) 2106–2116.

[9] A. Fischer, A. Keller, V. Frinken, H. Bunke, HMM-based word spotting inhandwritten documents using subword models, in: International Conferenceon Pattern Recognition, Istambul, Turkey, 2010, pp. 3416–3419.

[10] C. Choisy, Dynamic handwritten keyword spotting based on the NSHP-HMM,in: International Conference on Document Analysis and Recognition, Curitiba,Brazil, 2007, pp. 242–246.

[11] J.A. Rodrıguez-Serrano, F. Perronnin, J. Llados, G. Sanchez, A similaritymeasure between vector sequences with application to handwritten wordimage retrieval, in: IEEE Conference on Computer Vision and PatternRecognition, Miami, FL, 2009, pp. 1722–1729.

[12] J.A. Rodrıguez-Serrano, F. Perronnin, Handwritten word image retrieval withsynthesized typed queries, in: International Conference on Document Ana-lysis and Recognition, 2009, pp. 351–355.

[13] H. Hoessler, C. Wohler, F. Lindner, U. Kreßel, Classifier training based onsynthetically generated samples, in: International Conference on ComputerVision Systems, 2007.

[14] D.Geronimo, J. Marin, D. Vazquez, A.M. Lopez, Learning appearance in virtualscenarios for pedestrian detection, in: IEEE Conference on Computer Visionand Pattern Recognition, San Francisco, CA, 2010, pp. 137–144.

[15] J. Schels, J. Liebelt, K. Schertler, R. Lienhart, Synthetically trained multi-viewobject class and viewpoint detection for advanced image retrieval, in:International Conference on Multimedia Retrieval, Trento, Italy, 2011,pp. 3:1–3:8.

[16] K. Wang, B. Babenko, S. Belongie, End-to-end scene text recognition, in:International Conference on Computer Vision, Barcelona, Spain, 2011.

[17] T. Konidaris, B. Gatos, K. Ntzios, I. Pratikakis, S. Theodoridis, S.J. Perantonis,Keyword-guided word spotting in historical printed documents using syn-thetic data and user feedback, International Journal of Document Analysisand Recognition 9 (2–4) (2007) 167–177.

[18] P. Sankar, C. V. Jawahar, R. Manmatha, Nearest neighbor based collectionOCR, in: Document Analysis System, Boston, MA, 2010, pp. 207–214.

[19] J. Wang, C. Wu, Y.-Q. Xu, H.-Y. Shum, Combining shape and physical modelsfor online cursive handwriting synthesis, International Journal of DocumentAnalysis and Recognition 7 (4) (2005) 219–227.

[20] B. Kulis, K. Saenko, T. Darrell, What you saw is not what you get: Domainadaptation using asymmetric kernel transforms, in: CVPR, 2011.

[21] Y. Leydier, A. Ouji, F. LeBourgeois, H. Emptoz, Towards an omnilingual wordretrieval system for ancient manuscripts, Pattern Recognition 42 (9) (2009)2089–2105.

[22] J.A. Rodrıguez, F. Perronnin, Local gradient histogram features for wordspotting in unconstrained handwritten documents, in: International Con-ference on Frontiers in Handwriting Recognition, Montreal, Canada, 2008.

[23] J.A. Bilmes, A Gentle Tutorial of the EM Algorithm and its Application toParameter Estimation for Gaussian Mixture and Hidden Markov Models,Technical Reports TR-97-021, International Computer Science Institute, 1998.

[24] F. Perronnin, Universal and adapted vocabularies for generic visual categor-ization, IEEE Transactions on Pattern Analysis and Machine Intelligence 30(7) (2008) 1243–1256.

[25] L.R. Rabiner, A tutorial on hidden Markov models and selected applications inspeech recognition, Proceedings of the IEEE 77 (2) (1989) 257–286.

[26] D.A. Reynolds, T.F. Quatieri, R.B. Dunn, Speaker verification using adaptedGaussian mixture models, Digital Signal Processing 10 (2000) 19–41.

[27] A. Vinciarelli, S. Bengio, H. Bunke, Offline recognition of unconstrainedhandwritten texts using HMMs and statistical language models, IEEETransactions on Pattern Analysis and Machine Intelligence 26 (6) (2004)709–720.

Jose A. Rodriguez-Serrano graduated in Physics in 2003 at the Universitat de Barcelona and received his Master in Computer Vision in 2006 from the Computer VisionCenter (CVC) at the Universitat Autnoma de Barcelona (UAB). In 2009 he completed his Ph.D. thesis at UAB done in collaboration between with the Xerox Research CentreEurope (XRCE), on detecting words in handwritten word-spotting. After his thesis, he worked as a Research Associate at Loughborough University, UK (2008–2009), and asa Research Fellow at the University of Leeds (2009–2010), involved in projects on video analysis for transportation and event analysis. In 2010 he joined XRCE as a ResearchScientist. His main research interest include the analysis of textual images, image and video understanding, and sequential statistical models.

Florent Perronnin received his Engineering degree in 2000 from the Ecole Nationale Superieure des Tel ecommunications (Paris, France) and his Ph.D. degree in 2004 fromthe Ecole Polytechnique Federale de Lausanne (Lausanne, Switzerland). From 2000 to 2001 he was a Research Engineer with the Panasonic Speech Technology Laboratory(Santa Barbara, CA) working on speech and speaker recognition. In 2005, he joined the Xerox Research Centre Europe (Grenoble, France). His main interests are in thepractical application of machine learning to computer vision tasks such as image classification, retrieval or segmentation.