10
Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Jihad El-Sana

Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

  • Upload
    others

  • View
    14

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

Keywords image retrieval in historicalhandwritten Arabic documents

Raid SaabniJihad El-Sana

Page 2: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

Keywords image retrieval in historical handwrittenArabic documents

Raid SaabniTriangle R&D CenterKafr Qarea, Israel

E-mail: [email protected]

Jihad El-SanaBen-Gurion University of the Negev

Computer Science DepartmentBeer Sheva, Israel

Abstract. A system is presented for spotting and searching keywordsin handwritten Arabic documents. A slightly modified dynamic timewarping algorithm is used to measure similarities between words.Two sets of features are generated from the outer contour of thewords/word-parts. The first set is based on the angles betweennodes on the contour and the second set is based on the shape con-text features taken from the outer contour. To recognize a given word,the segmentation-free approach is partially adopted, i.e., continuousword parts are used as the basic alphabet, instead of individual char-acters or complete words. Additional strokes, such as dots anddetached short segments, are classified and used in a postprocessingstep to determine the final comparison decision. The search for a key-word is performed by the search for its word parts given in the correctorder. The performance of the presented system was very encourag-ing in terms of efficiency and match rates. To evaluate the presentedsystem its performance is compared to three different systems.Unfortunately, there are no publicly available standard datasetswith ground truth for testing Arabic key word searching systems.Therefore, a private set of images partially taken from Juma’a Al-Majid Center in Dubai for evaluation is used, while using a slightlymodified version of the IFN/ENIT database for training. © 2013SPIE and IS&T [DOI: 10.1117/1.JEI.22.1.013016]

1 IntroductionOver 90 million documents were written in Arabic scriptbetween the seventh and fourteenth centuries. Aboutseven million documents, in various disciplines, survivedthe years. During the last 300 years, only 300,000 manu-scripts have been revised and edited. The advances in digitalscanning and electronic storage have driven the digitizationof historical documents for preservation and analysis of cul-tural heritage. This process enables important knowledge tobe accessible to the general public, while protecting historicaldocuments from deterioration by frequent handling. Thesedocuments are usually stored as a collection of images, anapproach that complicates searching them for a specificword or phrase. To optimally utilize the digital availabilityof these documents, it is essential to develop an indexing

and searching mechanism. Currently, indexing is manuallyconstructed and the search is performed on the scannedpages, one by one. Since this procedure is expensive andtime consuming, an automation process is desirable. Onemay consider using off-line handwriting recognition to con-vert these document images into text files. However, theresearch on off-line handwritten script recognition hasbeen limited to domains with small vocabularies, such asautomatic mail sorting and check processing. Historicaldocuments add a further level of complexity resultingfrom lower quality sources due to diverse aging-relatedand deteriorative factors, such as faded ink, stained paper,dirt, and yellowing. These reasons make a word-spottingapproach1 a practical alternative for keyword searching inimages of documents. Spotted words, their position withinthe document, and their textual representation are used toindex keywords.

In this paper we present a novel approach for spotting key-words in Arabic historical documents. The pictorial represen-tation of predefined keywords are manually selected by ahuman operator and assigned the corresponding textual repre-sentation. The spotting algorithm relies on two sets of features,which are used in two consecutive dynamic-time-warping(DTW) based classifiers. The first set is extracted from the seg-ments of the simplified contour of the word-parts and the sec-ond is extracted from the closed curve representing the outercontour of the input component. DTW relies on theEuclidean and χ2 distance metrics to measures the similaritybetween two feature vectors, using the first and second featuresets, respectively. Fortunately, large amounts of the Arabic his-torical manuscripts heritage are well preserved and their qualityis reasonable. Therefore, in this research we choose to workwith handwritten Arabic document with good quality, and con-centrate on the difficulty of reading cursive handwritten Arabictexts. In the rest of the paper, we will first review closely relatedwork and subsequently present our approach, followed byexperimental results. Finally, we draw conclusions and suggestdirections for future work.

2 Related WorkWord spotting algorithms aim to reduce the tedious and time-consuming manual annotation applied to the pictorial

Paper 12341 received Aug. 30, 2012; revised manuscript received Nov. 28,2012; accepted for publication Dec. 19, 2012; published online Jan. 31,2013.

0091-3286/2013/$25.00 © 2013 SPIE and IS&T

Journal of Electronic Imaging 013016-1 Jan–Mar 2013/Vol. 22(1)

Journal of Electronic Imaging 22(1), 013016 (Jan–Mar 2013)

Page 3: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

representation of the input document’s words. It commencesby clustering similar pictorial representation (images) intoclasses. For each class, ci, a human operator assigns a textualrepresentation or declares the class as insignificant andignores it. The annotated clusters are used to construct a par-tial index for the processed documents.

Keyword spotting as detecting a word in an image hasbeen initially proposed in Refs. 2 and 3 for printed and hand-written text, respectively. The core of any word spotting pro-cedure is a word-matching algorithm, which measures thedistance between pictorial representations of words. Word-matching algorithms roughly fall into two categories:3

pixel-based matching and feature-based matching. Pixel-based matching approaches measure the similarity betweenthe two images on the pixel domain using various metrics,such as Euclidean distance map, XOR difference, Scott andLonguet-Higgins distance, Hausdorff distance or the sum ofsquare differences.3–7 In feature-based matching, images arecompared using representative features extracted from theimages. Similarity measures, such as DTW and point corre-spondence are defined on the feature domain.8–14

The DTW technique was implemented and evaluatedusing various sets of features15–17 and yielded better resultsthan competing techniques.3 Rath and Manmatha15 prepro-cessed segmented word images to create sets of one-dimensional features, which were compared using DTW.Rath and Manmatha18 describe an approach called wordspotting which involves grouping word images into clustersof similar words by using image matching to find similarity.They automatically build an index that links “interesting”words to their locations. To compute image similaritiesthey compare a number of different techniques includingDTW. The word similarities are then used for clusteringusing both K-means and agglomerative clustering tech-niques. Alternatively, different approaches were presentedto spot words within lines when the document is first seg-mented into lines only,19,20 or working on completely unseg-mented pages of text, treating the spotting task as an imageretrieval task.21–23 Recently, hidden Markov models (HMM)24–26 and neural networks27,28 were used in much research forkeyword searching and spotting tasks.

Rothfeder et al.7 presented an algorithm to draw corre-spondences between points of interest in two word imagesand utilizes these correspondences to measure the similaritiesbetween the images. Srihari et al.29 developed a word-spot-ting system that retrieves the candidate words from the docu-ments and ranks them based on global word shape features.Yalniz and Manmatha30 present an efficient word-spottingframework to search text in scanned books. SIFT descriptorsare extracted over the corner points of each word image andquantized into visual terms using hierarchical K-means algo-rithm and indexed using an inverted file. They perform effi-cient matching by projecting the visual terms on thehorizontal axis and searching for the longest common sub-sequence between the sequences. A novel keyword spottingmethod for handwritten documents was described,31 which isderived from a neural network-based system for uncon-strained handwriting recognition. In this template-freeapproach, it is not necessary for a keyword to appear inthe training set. The keyword spotting is done using a modi-fication of the CTC token passing algorithm in conjunctionwith a recurrent neural network.

Shrihari et al.32 used global word shape features to mea-sure the similarities among the spotted words and a set ofprototypes from known writers and presented a design fora search engine for handwritten documents.33 They indexeddocuments using global image features, such as stroke width,slant, word gaps, as well as local features that describe theshapes of characters and words. Image indexing is carriedout automatically using page analysis, page segmentation,line separation, word segmentation, and recognition of char-acters and words. Rath et al.34,35 extracted discrete featurevectors that describe word images, which are used to trainthe probabilistic classifier, which is then used to estimatesimilarity between word images.

A segmentation-free approach was adopted by Lavrenkoet al.36 where they used the upper envelope and projectionprofile features to spot word images without segmentingthem into individual characters. They showed that thisapproach is feasible even for noisy documents. Gatos et al.21

developed a segmentation-free approach for keyword searchin historical documents, which combines image preprocess-ing, synthetic data creation, word spotting, and user feedbacktechnologies. Moghaddam et al.23 presented a language in-dependent system for preprocessing and word spotting ofhistorical document images that did not require line andword segmentation. The distance between images is mea-sured using Euclidean distance and dynamic time wrapping.Manmatha and Rothfeder37 described a novel scale spacealgorithm for automatic segmentation of handwritten docu-ments into words. They detect margins, segment lines, anduse anisotropic Laplacian at several scales to segment linesinto words. Liorente et al.38 propose a direct image retrievalframework based on Markov random fields. In theirapproach, they use different kernels in a nonparametric den-sity estimation together with the utilization of configurationsthat explore semantic relationships among concepts at thesame time as low-level features. Kuo and Agazzi2 presenteda robust algorithm for the recognition of keywordsembedded in poorly printed documents. For each keyword,two statistical models are generated—one represents theactual keyword and the other represents all irrelevant words.They adopted dynamic programming to enable elasticmatching using the two models. Chen et al.39 developed afont-independent system, which is based on HMM to spotuser-specified keywords in document images. The systemextracted potential keywords from the image using a mor-phology-based preprocessor and then used external shapeand internal structure of the words to produce feature vectors.Duong et al.40 presented an approach that extracts regions ofinterest from grayscale images and classified them to eithertextual or nontextual using geometric and texture features.Farooq et al.41 present preprocessing techniques for hand-written Arabic documents to overcome the ineffectivenessof conventional preprocessing for such documents. Theydescribed techniques for slant normalization, slope correc-tion as well as for line and word separation for handwrittenArabic documents.

Fischer et al.24 present a word-spotting system based onHMM. They use sub word models to train the HMM, whichcan spot keywords out of the training set using no text linesegmentation. A language-neutral approach for searching online handwritten text using Friechet distance was presented.42

In their work, they have used a variant of Friechet distance

Journal of Electronic Imaging 013016-2 Jan–Mar 2013/Vol. 22(1)

Saabni and El-Sana: Keywords image retrieval in historical handwritten Arabic documents

Page 4: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

for retrieving words even when only a prefix of the word isgiven as query.

Few systems for word image retrieval were presented forvariety of languages such as Urdu43 and Tamil.44 A wordspotting system is presented43 where the text is first seg-mented into partial words. A set of features is computed fromeach partial word. The user queries the system using wordimage. The partial words in the query image are thenmatched with those in the database and the matched partialwords are merged into complete words. Another approachpresented,44 involves the use of HMM to characterize the fea-tures of the dynamically varying strokes of handwrittencharacters.

Liang et al.45,46 described a novel approach which uses acharacter-based modeling for training to overcome the prob-lem of the lack of existing large data sets for training. Awordmodeling technique (synthesized word) are used for enablingthe retrieval of keywords that have not explicitly been seen inthe training set.

Leydier et al.47 describe a word retrieval algorithm, thatallows the indexation of ancient manuscripts of any languageand alphabet. The presented approach does not need anylayout segmentation and makes use of features fitted toany type of alphabet directly on the image. The main ideais to compare only informative parts of the template keywordwith parts of the documents that may contain information.The gradient fields of gray level pixels with a normlarger than a threshold are extracted and considered aspart of a shape. They define and use these significant pixelsas guides for possible text locations called zones of interest(ZOI). Vertical strokes extracted with a morphologicalopening from the ZOIs are used for matching, while allowinga small displacement in order to find the best location.

3 Our ApproachWe present a feature-based approach for keyword search inhandwritten Arabic documents, in which features areextracted from the contours of the connected components.We adopted the holistic approach and avoided segmentingwords into letters. The search for a given keyword isperformed by determining its word-parts in the correctorder. The system treats each word-part as a meta compo-nent—one main component and associated secondary com-ponents. The secondary components represent delayedstrokes associated with the word-parts, which are representedby the main component. A delayed stroke’s shape may be adot, detached vertical segment, or small curve (usually sim-ilar to “ s ” or “ ∼”), which can appear above or below aword-part.

The word spotting process undergoes several steps, com-mencing with a binary image as an input and ending withgenerating the required indexes. In the first step, componentsare extracted from the input image and labeled, (see Fig. 1).The text lines are then determined and the connected com-ponents in each line are classified as primary (main) and sec-ondary (additional strokes). A pruning process is applied toreduce the size of the set of word-part images presented tothe matching process. Finally, a two-stage process of match-ing using the mentioned two feature sets is used to clustersimilar images. The following section discusses thesesteps in detail.

3.1 Line Extraction and Component LabelingTo extract lines from the text and label the components in thecorrect order we use an algorithm48 based on the seam carv-ing approach for content aware image resizing.49 The algo-rithm uses the signed distance transform to generate anenergy map, where extreme points (minima/maxima) indi-cate the layout of text lines. Dynamic programming isthen used to compute the minimum energy left-to-rightpaths (seams), which pass along the “middle” of the textlines. Each path intersects a set of components, which deter-mine the extracted text line and estimate its height.Unassigned components that fall within a text line regionare added to the component list of that text line. The com-ponents between two consecutive text lines are processedwhen the two lines are extracted. The algorithm assigns com-ponents to the closest text line, which is estimated based onthe attributes of extracted lines as well as the sizes and posi-tions of components. The resulting images—each represent-ing a word part—are used in the matching process.

Since we are tracing the contour of each component inde-pendently, segmenting the line into words is unnecessary.The extracted components are classified as main or secon-dary based on their size and location with respect to the esti-mated base line. We use main component to denote thecontinuous body of a word part and secondary component torefer to an additional stroke. Each secondary component isassociated with a main component. A main componenttogether with its secondary components represents an Arabicword part, which will be denoted as a meta component(see Fig. 2).

3.2 PruningThe compared components are normalized according to theaverage height of the document’s main components. Thecontour properties and the density histograms are used toprune irrelevant components. The ratio between the widthand length of the compared word parts’ contours are com-puted. A precomputed ratio for a main component is usedto prune word parts with distant ratios. In the next step ofpruning, we use the horizontal and vertical density histo-grams of the two compared main components. We calculatethe sum of the square differences between the two horizontaland vertical density histograms of the compared main

Fig. 1 In this figure, we draw the flow of the spotting process startingfrom upper-left with binary image and ending in bottom-left with clus-ters of spotted words.

Journal of Electronic Imaging 013016-3 Jan–Mar 2013/Vol. 22(1)

Saabni and El-Sana: Keywords image retrieval in historical handwritten Arabic documents

Page 5: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

components, separately. An experimentally determinedthreshold is used to eliminate the irrelevant components(see Fig. 3).

3.3 MatchingThe matching algorithm accepts two binary images, w1 andw2, normalized to the same height, and representing twoword-parts and returns the distance, dðw1; w2Þ, betweenthem. We extract the contours Cðw1Þ and Cðw2Þ of themain components of w1 and w2. The pixels on the boundaryof a component form its contour. However, such a represen-tation includes more than required vertices, which oftencomplicate processing and handling these contours. There-fore, we simplify the contour polygon to work with a smallnumber of representative vertices. The simplification isapplied iteratively by removing the vertex with the smallestdistance from the line passing through its two adjacentneighbors until an error threshold or a satisfying number ofvertices are reached. To insure meeting the requirement ofnearly equal-length edges of the contour polygon, i.e.,similar distances between consecutive vertices, we use theresulting short edges and a predefined tolerance value to sub-divide the long edges into shorter ones.

3.4 Feature ExtractionThe vertices of the simplified polygon are used to generate avector of features. Let CðwÞ be the contour of the main com-ponent of a word part w and Ps ¼ fp0; : : : ; pn−1g be the setof vertices of the simplified contour CðwÞ. For the first set,we extract two types of features—semiglobal and local—from the point sequence that quantifies the relation betweenneighboring strokes as follows:

• For a point pi, i > 0, we determine the angle betweenthe segment pi−1pi and the pipiþ1. We refer to thisfeature as αðpiÞ. It quantifies the relation between adja-cent segments, but does not provide any informationconcerning the point’s broader environment.

• To quantify the relation between a point and its envi-ronment, we extract a semiglobal feature, which isdefined as the angle between the segment pi−1pi andthe segment pipiþδ, where δ determines the width ofthe considered environment. We will refer to this fea-ture as βðpi; δÞ, where δ > 2.

The two features are interpolated linearly using Eq. (1),where w is a normalized positive weight that controls theblending of the two features and δ determines the widthof the neighborhood.

fðpiÞ ¼ ð1 − wÞαðpiÞ þ wβðpi; δÞ: (1)

The second set of features is based on the Shape Contextfeature,50 which considers the set Ps of n equidistant pointson the outer contour of the main component. For each pointpi ∈ Ps it assigns n − 1 vectors, pipj, one for each pointpj ∈ ðPs − piÞ. This set of vectors forms a rich, yet overlydetailed description of the shape. To simplify the represen-tation and generate a robust, compact, and highly discrimi-native descriptor a relative position distribution is identified.For each point pi on the shape a coarse histogram hi forthe relative coordinates of the remaining n − 1 points iscomputed. The histogram hi is defined to be the shape con-text of the point pi, see Fig. 4 and Eq. (2).

Fig. 2 Meta components with different numbers of additional strokes,which are written in different ways: separated dots and connectedpairs or triples.

Fig. 3 The columns (c) and (g) show the similarity of the density histograms of the same word parts.

Journal of Electronic Imaging 013016-4 Jan–Mar 2013/Vol. 22(1)

Saabni and El-Sana: Keywords image retrieval in historical handwritten Arabic documents

Page 6: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

The shape context feature is defined on the set of pointsPs, which are distributed almost uniformly over the contourCðwÞ.

hiðkÞ ¼ ½q ≠ pi∶q − pi ∈ binðkÞ�. (2)

3.5 Word-Part MatchingDTW is an algorithm for measuring similarity between twopolylines. This technique suits matching sequences withnonlinear warping. For one-dimensional sequences, DTWruns at polynomial time and is usually computed usingdynamic programming and based on Eq. (3).

Dði; jÞ ¼ min½Dði; j − 1Þ; Dði − 1; jÞ; Dði; jÞ� þ cost. (3)

In this work, we slightly adjusted the classic to includedifferent costs for insertion, deletion, and substitution. Inaddition, we adopted an extra-cost for consecutive insertionand deletion to avoid introducing long segments that disturbthe recognition accuracy. The DTW is computed by takingthe minimum of the three possible operations—insertion,deletion, and substitution—including the cost of each oper-ation, as shown in Eq. (7). We assign different cost functionsfor deletion, insertion, and substitution based on the intro-duced change. In handwriting, including Arabic handwrit-ing, the difference between two point sequences thatrepresent two different words is usually very small, i.e.,inserting/deleting just a few consecutive elements canchange the sequence to represent a different word-part.

The distance between the shapes of two word-parts is esti-mated by computing the similarity among the correspondingfeature vectors (see Sec. 3.4). Let Sa and Sb be the sequencesof the feature vectors extracted from the two word parts. Wedefine the costinsðiÞ, costdelðiÞ, and costsubði; jÞ as the cost ofinserting a new element i in the sequence Sa, the deletion ofthe element i from the sequence Sa, and the substitutingof the element i in the sequence Sa by the element j insequence Sb, respectively. In the case of geometric features,these differences (subtractions) are simply calculated using

the real values in the normalized coordinates. For the shapecontext features, we use the χ2 test statistics to compute thedifference between the normalized coordinates.50

Equations (4)–(6) define the cost of each operation,Where del and ins are the numbers of consequent operationsof deletions or insertions until point i, respectively:

costsub ¼ ½SaðiÞ − SbðjÞ�2 (4)

costdel ¼ f½Saðiþ 1Þ − SaðiÞ� � insig2 (5)

costins ¼ f½Sbðiþ 1Þ − SbðiÞ� � delig2 (6)

In order to embed the influence of consequent deletion orinsertion into the minimization problem of the DTW, we useEq. (7) to define the dynamic programming:

Fig. 4 (a) A sample of extracting the first feature value for the pixel piwhere δ is 3 and α in red and β in green. (b) Overlay of the outer con-tour of the main component of the word part (md) on five bins for log rand 12 bins for the angle θ to calculate the shape context value ofeach pixel using formula 2.

Fig. 5 Results of searching the keyword (Yashbah). When ignoringthe additional strokes, two different words sharing the same mainparts can be found as seen in the first two occurrences of thefound words.

Fig. 6 Results of clustering four different words taken from two pagesof the historical document. In each column, we can see how similarbut still different words are clustered to the same class using the pre-sented approach.

Journal of Electronic Imaging 013016-5 Jan–Mar 2013/Vol. 22(1)

Saabni and El-Sana: Keywords image retrieval in historical handwritten Arabic documents

Page 7: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

Dði; jÞ ¼ min½Dði; j − 1Þ þ costins;

Dði − 1; jÞ þ costdel;

Dði − 1; j − 1Þ þ costsub�. (7)

As can be seen, this rule adds quadratic penalty for con-sequent operations of deletions and insertions. This schemeforces the spread of these operations over all the fitting proc-esses and thus, forbids consequent operations from deletionor insertion.

Given two word-parts w1 and w2, a pruning step, com-pares their complementary parts and indicates a mismatchwhen the complementary parts are different. In the nextstep, the system applies the DTW method, to measure thedistance between the two word parts using two sets of fea-tures. Algorithm 1 presents the pseudo-code to compute thedistance between two word-parts w1 and w2, using twothresholds δ1 and δ2. When the distance using the first feature

set (Contour Points) falls in the range between δ1 and δ2, thesecond feature set (shape context), serves as a secondopinion.

The distance between two word-parts is set to a maximumvalue when their complementary parts do not agree. Whenthe distance value using the first feature set falls within therange (between δ1 and δ2), the second feature set is used tocompute the final distance. The motivation for the secondstep comes out of the experimental results showing the abil-ity of the shape context feature to give accurate results whenthe distance computed by the first feature set is in the thresh-old neighborhood.

3.6 Clustering KeywordsThis phase aims to cluster the shape instances of the differentwords within the document to a predefined number of clus-ters, derived from the number of words to be indexed—Target Clusters—and one additional cluster for the ignoredwords. Among the clustering methods used in the literature,we chose the K-nearest neighbors for its simplicity and goodperformance.51 In this case the k-nearest neighbors wereextracted for the given word and a maximal vote wasused to determine the final result. The clustering process ini-tializes each cluster ci using a synthetic shape of its corre-sponding words. The generation of the shapes of words isperformed based on horizontal layout of the word-parts syn-thetically without overlapping. The cluster corresponding tothe ignored word-parts is initialized to the empty set. Thenewly encountered word-part wi is compared with thealready clustered words and added to the cluster, withthe maximum votes within the k nearest neighbors. If theminimum distance between wi and the clustered word-parts exceeds a predefined threshold, is added to ignoredword-parts cluster.

4 Experimental ResultsWe have compared the proposed system to the followingwell-known systems for spotting words in handwrittendocuments:

• The CEDARABIC system was presented by Srihariet al.32 to spot words in Arabic documents. TheCEDARABIC system extracts a binary feature vectorof 1024 bit for a word. This is carried out by dividingthe scanned image of the word to 8 × 4 regions.Gradient (384 bits), structural (384 bits) and concavity(256 bits) features are extracted from each region andquantized to yield an 1024 binary vector. To measurethe similarity between two words, the distance betweenthe two feature vectors is computed using a correlationmeasure.52

• Manmatha et al.3 developed a system to spot Latinwords on the George Washington collection of letters.The system uses DTW and a set of features—Projection Profile, Word Profile, background/ink tran-sitions—to represent an image of a given word. TheProjection Profile records the sum of the pixels’s inten-sities along a column, the Word Profile is extractedfrom the upper and lower profile of the word, andthe background/ink transitions is the number of transi-tions between foreground and background along a

Algorithm 1 The distance between the word parts w1and w2.

M1←main Componentðw1Þ

M2←main Componentðw2Þ

CM1←Complementary Componentðw1Þ

CM1←Complementary Componentðw2Þ

if CM1 ≠ CM2 then

dðw1; w2Þ←Max Value

Return

end if

C1←ContourðM1Þ

C1←ContourðM1Þ

C2←ContourðM2Þ

P1←simplify ContourðC1Þ

P2←simplify ContourðC2Þ

f 1ðv1Þ←geometric Feature VectorðP1Þ

f 1ðv2Þ←geometric Feature VectorðP2Þ

if dðw1; w2Þ←DTWðf 1ðv1Þ; f 1ðv2ÞÞ then

Return

end if

f 2ðv1Þ←shape Context Feature VectorðP1Þ

f 2ðv2Þ←shape Context Feature VectorðP2Þ

dðw1; w2Þ←DTWðf 2ðv1Þ; f 2ðv2ÞÞ

Journal of Electronic Imaging 013016-6 Jan–Mar 2013/Vol. 22(1)

Saabni and El-Sana: Keywords image retrieval in historical handwritten Arabic documents

Page 8: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

column. The system was adopted to handle Arabicword-parts in lieu of Latin words.

• The system described in Sec. 2, was developed byLeydier et al.47 for image word retrieval from ancientmanuscripts of any language and alphabet. In our case,we have adapted the system to the Arabic alphabetusing shapes of synthesized images of shapes asqueries.

4.1 Data SetsDue to the absence of a standard database of Arabic histori-cal documents for evaluating systems for keyword searchingand spotting, we used a data set of 113 pages from three cop-ies written by different writers of the ‘Book in differencesbetween similar diseases in medicine’ edited by AhmadIbn Al-Jazzar in the tenth century. We also used an additionalset of 100 pages from five different documents (20 pagesfrom each document) from Juma’a Al-majid Center inDubai.53 We used a total of 213 pages including 29,614words and 103,716 word-parts. To generate representativelist of templates of word-parts/words to be searched, weextracted multiple shapes of 783 different word-parts fromthe IFN/ENIT database.54 The IFN database contents wereslightly modified to include each word-part as one connectedcomponent, i.e., incorrectly split components for singleword-parts rejoined to form a single one and touching com-ponents split manually to individual word-parts. The result-ing datasets include word-parts with several different shapesfor each word-part. Using this set of word-parts as an alpha-bet we generated a list of 1500 words to be searched forwithin the given documents. This list was generated by con-catenating the word-parts horizontally in the correct order.The same process of concatenation was performed on theword-parts within the documents before matching in orderto preserve consistency. Five hundred words from the 500were generated using multiple shapes for each word-partto serve as a seed for clusters in the spotting process. Onaverage, each class of these 500 words included 45 differentshapes. The same process of treating words as a horizontalsequence of word-parts was adopted in the two comparedsystems.

For the spotting process, we used the K-NN clustering51

with different values of k (In our tests k ¼ 5 gave the bestresults). The threshold mentioned in the matching processwas determined experimentally—we chose the values that

yielded the best performance. To evaluate the systems’ per-formance we used the selected 500 different words to bespotted and indexed. The same clustering was adopted inthe three systems, but different matching procedure, whichwere implemented based on the papers and adapted to dealwith Arabic word parts.

We ran experiments using two spotting schemes. In thefirst scheme, we have extracted word-parts from the IFN/ENIT database to synthetically generate list of shapes forthe word to be searched. In the second scheme, we haveadded one additional word-part image from each documentto the list from the first scheme. As expected, the results wereimproved when using real shapes from the dataset (seeTable 1). The instances of each keyword in the documentswere found and recorded as a ground truth and used to evalu-ate the performance of the three systems, results are summa-rized in Table 1.

To analyze the results of the presented systems, we com-pared the ground truth, derived manually from the dataset,and the generated clusters. We define a precision measureas the ratio of the correctly retrieved word parts to the totalnumber of retrieved word parts (true and false positive)in each cluster. The system’s final retrieval score is theweighted average of the 500 clusters. The recall was usedto measure the completeness of each system, which is mea-sured as the ratio of correctly clustered word parts to the totalnumber of appearances of each word part in the ground truthdata. The weighted average of the twenty clusters was takenas the recall score, for each system.

As can be seen in Table 1, our system provides better per-formance in both schemes. Extracting word/word-part sam-ples from the documents provides a lowered automationlevel and eliminates the involvement of a human operator.

To evaluate the searching process we searched oneappearance of each of the 1500 words generated. The lasttwo columns in Table 1 show the improved precision andrecall rates of the presented approach when compared tothe other two systems. Samples of searched words andsome clustered images, can be seen in Figs. 5 and 6.

5 Conclusions and Future WorkWe presented keyword searching and spotting algorithms forArabic documents. Our experimental results show that theused geometrical and the shape context features capturethe behavior of the written script for matching purposes.The nonlinearity of the DTW provides very good results

Table 1 Precision and recall results of the four compared systems in terms of counting false positives and false negatives among the automaticallyclustered words. The last two columns show precision and recall rates for the keyword searching system.

First scheme Second scheme Word searching

Precision (%) Recall (%) Precision (%) Recall (%) Precision (%) Recall (%)

CEDARABIC 81.6 83.18 82.8 84.1 80.6 81.18

MANMATHA 80.15 81.8 81.85 82.8 80.15 80.8

Leydier 83.35 82.1 84.5 86.3 85.3 81.9

Our SYSTEM 84.10 83.4 87.2 88.8 86.1 82.4

Journal of Electronic Imaging 013016-7 Jan–Mar 2013/Vol. 22(1)

Saabni and El-Sana: Keywords image retrieval in historical handwritten Arabic documents

Page 9: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

and seems adequate for keyword searching in handwrittenArabic documents.

The DTW-based algorithm for measuring the distancebetween objects does not maintain the triangle inequalitywhich is considered as a prerequisite in many clusteringmethods. In the scope of future research, we plan to inves-tigate various clustering methods and algorithms in anattempt to overcome the nonlinearity of such measurements.Furthermore, we contend that directly working on grayscaleimages provides an opportunity to cope with low qualityimages, where typical binarization algorithms eliminatevaluable detail.

The scope of future future work involves improving theresponse time of word searching and spotting systems.Even though, word spotting is performed off-line, the hugeamounts of documents prevent word searching or spotting inlinear time. Therefore, embedding into alternative spacesenabling rapid sublinear methods for image retrieval isnecessary.

References

1. R. Manmatha, C. Han, and E. M. Riseman, “Word spotting: newapproach to indexing handwriting,” in Proc. IEEE Comput. Soc.Conf. on Comput. Vis. and Pattern Recognit., pp. 631–637, IEEE,San Francisco, California (1996).

2. S. S. Kuo and O. E. Agazzi, “Keyword spotting in poorly printed docu-ments using pseudo 2-d hidden Markov models,” IEEE Trans. PatternAnal. Mach. Intell. 16(8), 842–848 (1994).

3. R.Manmatha and T. Rath, “Indexing handwritten historical documents—recent progress,” in Proc. Symposium on Document Image Under-standing Technology, pp. 77–86 (2003).

4. Y. Lu and C. L. Tan, “Word spotting in Chinese document images with-out layout analysis,” in Proc. 16th IEEE Int. Conf. on Pattern Recognit.,Vol. 3, pp. 57–60, IEEE Computer Society, Washington, DC (2002).

5. R. Manmatha and W. B. Croft, “Word spotting: indexing handwrittenarchives,” Intell. Multimed. Inform. Retrieval Collect. pp. 43–64 (1997).

6. T. Rath et al., “Indexing for a digital library of George Washington’smanuscripts: a study of word matching techniques,” CIIR TechnicalReport, University of Massachusetts, Amherst (2002).

7. J. L. Rothfeder, S. Feng, and T. M. Rath, “Using corner feature corre-spondences to rank word images by similarity,” in Proc. IEEE Conf. onComput. Vis. and Pattern Recognit. Workshop, Vol. 3, p. 30, IEEE,Madison, WisconsinMadison, Wisconsin (2003).

8. D. Jose, A. Bhardwaj, and V. Govindaraju, “Script independent wordspotting in multilingual documents,” in Proc. 2nd Int. Workshop onCross Lingual Information Access, pp. 48–54 (2008).

9. S. N. Srihari, B. Zhang, and C. Huang, “Word image retrieval usingbinary features,” Proc. SPIE 5296, 45–53 (2004).

10. H. Cao and V. Govindaraju, “Template-free word spotting in low-qual-ity manuscripts,” in Proc. 6th Int. Conf. on Advances in PatternRecognit. (2007).

11. Y. Leydier, F. Lebourgeois, and H. Emptoz, “Text search for medievalmanuscript images,” Pattern Recognit. 40(12), 3552–3567 (2007).

12. M. T. Rath, R. Manmatha, and V. Lavrenko, “A search engine for his-torical manuscript images,” in Proc. 27th Annual Int. ACM SIGIR Conf.on Research and Development in Information Retrieval, pp. 369–376,ACM, New York (2004).

13. P. Babu et al., “Handwritten Arabic word spotting using theCEDARABIC document analysis system,” in Proc. Symposium onDocument Image Understanding Technology, College Park,Maryland (2005).

14. T. Adamek, N. E. Connor, and A. F. Smeaton, “Word matching usingsingle closed contours for indexing historical documents,” J. Docum.Anal. Recognit. 9(2), 153–165 (2007).

15. T.M. Rath and R. Manmatha, “Word image matching using dynamictime warping,” in Proc. IEEE Comput. Soc. Conf. on Comput. Vis.and Pattern Recognit., Vol. 2, pp. II-521–II-527, IEEE (2003).

16. S. N. Srihari et al., “Spotting words in Latin, Devanagari and Arabicscripts,” Vivek: Indian J. Artif. Intell. 16(3), 2–9 (2006).

17. K. Terasawa and Y. Tanaka, “Slit style hog features for document imageword spotting,” in Proc. 10th IEEE Int. Conf. on Document Analysisand Recognit., pp. 116–120, IEEE Computer Society, Washington,DC (2009).

18. M. T. Rath and R. Manmatha, “Word spotting for historical documents,”Int. J. Doc. Anal. Recognit. 9(2), 139–152 (2007).

19. M. A. A. Kolcz and J. Alspector, “A line-oriented approach to wordspotting in handwritten documents,” Pattern Anal. Appl. 3(2), 153–168(2000).

20. H. Cao, A. Bhardwaj, and V. Govindaraju, “A probabilistic method forkeyword retrieval in handwritten document images,” Pattern Recognit.42(12), 3374–3382 (2009).

21. B. Gatos et al., “A segmentation-free approach for keyword search inhistorical typewritten documents,” in Proc. 8th Int. Conf. on DocumentAnalysis and Recognit., Vol. 1, pp. 54–58, IEEE Computer Society,Washington, DC (2005).

22. Y. Leydier, F. Le Bourgeois, and H. Emptoz, “Omnilingual seg-mentation-free word spotting for ancient manuscripts indexation,” inProc. 8th IEEE Int. Conf. on Document Analysis and Recognit.,Vol. 1, pp. 533–537, IEEE Computer Society, Washington, DC (2005).

23. R. F. Moghaddam and M. Cheriet, “Application of multi-level classi-fiers and clustering for automatic word spotting in historical documentimages,” in Proc. 10th IEEE Int. Conf. on Document Analysis andRecognit., pp. 511–515, IEEE, Barcelona (2009).

24. A. Fischer et al., “HMM-based word spotting in handwritten documentsusing subword models,” in Proc. 20th IEEE Int. Conf. on PatternRecognit., pp. 3416–3419, IEEE, Istanbul (2010).

25. C. Ziftci, J. Chan, and D. Forsyth, “Searching off-line Arabic docu-ments,” in Proc. IEEE Comput. Soc. Conf. on Comput. Vis. andPattern Recognit., pp. 1455–1462, IEEE, Illinois (2006).

26. J. A. Rodreguez and F. Perronnin, “Local gradient histrogram featuresfor word spotting in unconstrained handwritten documents,” in Proc.Int. Conf. on Frontiers in Handwriting Recognit., Montréal, Canada,pp. 7–12 (2008).

27. J. Keshet et al., “Robust discriminative keyword spotting for emotion-ally colored spontaneous speech using bidirectional lstm networks,” inProc. IEEE Int. Conf. on Acoustics, Speech and Signal Process.,pp. 3949–3952, IEEE Computer Society, Washington, DC (2009).

28. A. Graves, S. Fernandez, and J. Schmidhuber, “An application of recur-rent neural networks to discriminative keyword spotting,” in Proc. 17thInt. Conf. on Artificial Neural Networks, pp. 220–229, Springer-Verlag,Berlin, Heidelberg (2007).

29. C. Huang et al., “Spotting words in Latin, Devanagari and Arabicscripts,” Vivek: Indian J. Artif. Intell. 16(3), 2–9 (2003).

30. I. Z. Yalniz and R. Manmatha, “An efficient framework for searchingtext in noisy document images,” in Proc. 10th IEEE IAPR Int.Workshop on Document Analysis Systems, pp. 48–52, IEEEComputer Society , Gold Coast, Queensland Australia (2012).

31. V. Frinken et al., “A novel word spotting method based on recurrentneural networks,” IEEE Trans. Pattern Anal. Mach. Intell. 34(2),211–224 (2012).

32. S. Srihari et al., “Handwritten arabic word spotting using theCEDARABIC document analysis system,” in Proc. Symposium onDocument Image Understanding Technology, College Park,Maryland, pp. 123–132 (2005).

33. S. N. Srihari, C. Huang, and H. Srinivasan, “Search engine for hand-written documents,” Proc. SPIE 5676, 66–75 (2005).

34. T. Rath, V. Lavrenko, and R. Manmatha, “Retrieving historical manu-scripts using shape,” Technical Report, Center for IntelligentInformation Retrieval Univ. of Massachusetts, Amherst (2003).

35. T. M. Rath, V. Lavrenko, and R. Manmatha, “A statistical approach toretrieving historical manuscript images without recognition,” CIIRTechnical Report MM-42, Space and naval warfare systems center,Univ. of Massachusetts, Amherst (2003).

36. V. Lavrenko, T. M. Rath, and R. Manmatha, “Holistic word recognitionfor handwritten historical documents,” in Proc. 1st IEEE Int. Workshopon Document Image Analysis for Libraries, pp. 278–287, IEEEComputer Society, Washington, DC (2004).

37. R. Manmatha and J. Rothfeder, “A scale space approach for automati-cally segmenting words from degraded handwritten documents,” IEEETrans. Pattern Anal. Mach. Intell. 27(8), 1212–1225 (2005).

38. A. Llorente, R. Manmatha, and S. Rüger, “Image retrieval usingMarkov random fields and global image features,” in Proc. ACMInt. Conf. on Image and Video Retrieval, pp. 243–250, ACM, NewYork (2010).

39. F. R. Chen, L. D. Wilcox, and D. S. Bloomberg, “Word spotting inscanned images using hidden Markov models,” in Proc. IEEE Int.Conf. on Acoustics, Speech, and Signal Process., Vol. 5, pp. 1–4,IEEE, Minneapolis, Minnesota (1993).

40. J. Duong et al., “Extraction of text areas in printed document images,”in Proc. 2001 ACM Symposium on Document Eng., pp. 157–165,ACM, New York (2001).

41. F. Farooq, V. Govindaraju, and M. Perrone, “Pre-processing methodsfor handwritten Arabic documents,” in Proc. 8th IEEE Int. Conf. onDocument Analysis and Recognit., Vol. 1, pp. 267–271, IEEE,Seoul, Korea (2005).

42. E. Sriraghavendra, K. Karthik, and C. Bhattacharyya, “Frichet distancebased approach for searching online handwritten documents,” inProc. 9th Int. Conf. on Document Analysis and Recognit., Vol. 1,pp. 461–465, IEEE Computer Society, Washington, DC (2007).

43. A. Abidi et al., “Word spotting based retrieval of Urdu handwrittendocuments,” in Proc. 2012 Int. Conf. on Frontiers in HandwritingRecognit., Bari, Italy (2012).

Journal of Electronic Imaging 013016-8 Jan–Mar 2013/Vol. 22(1)

Saabni and El-Sana: Keywords image retrieval in historical handwritten Arabic documents

Page 10: Keywords image retrieval in historical handwritten …el-sana/publications/pdf/...Keywords image retrieval in historical handwritten Arabic documents Raid Saabni Triangle R&D Center

44. An. Sigappi, S. Palanivel, and V. Ramalingam, “Handwritten documentretrieval system for Tamil language,” Int. J. Comput. Appl. 31(4), 42–47(2011).

45. Y. Liang, M. C. Fairhurst, and R. M. Guest, “A synthesised wordapproach to word retrieval in handwritten documents,” PatternRecognit. 45(12), 4225–4236 (2012).

46. R. G. Y. Liang and M. Fairhurst, “Implementing word retrieval in hand-written documents using a small dataset,” in Proc. 2012 Int. Conf. onFrontiers in Handwriting Recognit., Bari, Italy (2012).

47. Y. Leydier et al., “Towards an omnilingual word retrieval system forancient manuscripts,” Pattern Recognit. 42(9), 2089–2105 (2009).

48. R. Saabni and J. El-Sana, “Language-independent text lines extractionusing seam carving,” in Proc. IEEE 12th Int. Conf. on DocumentAnalysis and Recognit., pp. 563–568, IEEE, Beijing, China (2011).

49. S. Avidan and A. Shamir, “Seam carving for content-aware image resiz-ing,” ACM Trans. Graph. 26(3), 10 (2007).

50. S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object rec-ognition using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell.24(4), 509–522 (2002).

51. S. Bubeck and U. von Luxburg, “Nearest neighbor clustering: a base-line method for consistent clustering with arbitrary objective functions,”J. Mach. Learn. Res. 10(3), 657–698 (2009).

52. B. Zhang and S. N. Srihari, “Binary vector dissimilarity measures forhandwriting identification,” Proc. SPIE 5010, 28 (2003).

53. Juma Al Majid Heritage and Culture Center, Dubai, http://www.almajidcenter.org.

54. M. Pechwitz et al., “Fn/enit—database of handwritten Arabic words,” inProc. CIFED 2002, pp. 129–136 (2002).

Raid Saabni is a senior researcher at theTriangle Research & Development Centerand a lecturer at the Tel Aviv YafoAcademic College. He received his BSc inmathematics and Computer Science in1989 and his MSc and PhD in computer sci-ence from Ben-Gurion University in theNegev in 2006 and 2010, respectively. Hisresearch interest is historical documentimage analysis, handwriting recognition,image retrieval, and image processing.

Jihad El-Sana is a senior lecturer at theDepartment of Computer Science, BenGurion University of the Negev. He receivedhis BSc and MSc in computer sciencefrom BGU. In 1995, he won a FulbrightScholarship for Israeli Arabs for doctoralstudies in the U.S. In 1999, he earned aPhD in computer science from the StateUniversity of New York, Stony Brook. Heheads the department’s Visual Media Lab,which hosts various research projects in

computer graphics, image processing, augmented reality, computa-tional geometry, and document image analysis. He has publishedover 50 papers in leading conferences and scientific journals. Heis also member of the IAPR, IEEE, and Euro-Graphics societies.

Journal of Electronic Imaging 013016-9 Jan–Mar 2013/Vol. 22(1)

Saabni and El-Sana: Keywords image retrieval in historical handwritten Arabic documents