Spatial layout representation for query-by-sketch content-based image retrieval

Spatial layout representation for query-by-sketchcontent-based image retrieval

E. Di Sciascio *, F.M. Donini, M. Mongiello

Dipartimento di Elettrotecnica ed Elettronica, Politecnico di Bari, Via Re David, 200 I-70125 Bari, Italy

Received 10 August 2001; received in revised form 30 November 2001

Abstract

Most content-based image retrieval techniques and systems rely on global features, ignoring the spatial relationships

in the image. Other approaches take into account spatial composition but usually focus on iconic or symbolic images.

Nevertheless a large class of users’ queries request images having regions, or objects, in well-determined spatial ar-

rangements. Within the framework of query-by-sketch image retrieval, we propose a structured spatial layout repre-

sentation. The approach provides a way to extract the image content starting from basic features and combining them

in a higher-level description of spatial layout of components, characterizing the semantics of the scene. We also propose

an algorithm that measures a weighted global similarity between a sketched query and a database image. � 2002

Elsevier Science B.V. All rights reserved.

Keywords: Image retrieval; Similarity retrieval; Query by sketch; Spatial relationships

1. Introduction

Content-based image retrieval (CBIR) systemsmainly perform extraction of visual features, typ-ically, color, shape and texture as a set of uncor-related characteristics. Such features provide aglobal description of images but fail to considerthe meaning of portrayed objects and the seman-tics of scenes. At a more abstract level of knowl-edge about the content of images, extraction of

object descriptions and their relative positionsprovides a spatial configuration and a logicalrepresentation of images. Because of the lack oflow-level features extraction such methods gener-ally fail to consider the physical extension of ob-jects and their primitive features. Anyway, the twoapproaches for CBIR should be considered ascomplementary. An image retrieval system shouldperform similarity matching based on the repre-sentation of visual features conveying the contentof segmented regions; besides, it should capturethe spatial layout of the depicted scenes in order toface the user expectations.In this paper we strive to overcome the gap

existing between these two approaches. Hence weprovide a method for describing the content of animage as the spatial composition of objects/regions

Pattern Recognition Letters 23 (2002) 1599–1612

www.elsevier.com/locate/patrec

*Corresponding author. Tel.: +39-805-460641; fax: +39-805-

460410.

E-mail addresses: [email protected] (E. Di Sciascio),

[email protected] (F.M. Donini), [email protected] (M.

Mongiello).

0167-8655/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved.

PII: S0167-8655 (02 )00124-1

in which each one preserves its visual features,including shape, color and texture. To this aim wepropose a representation for spatial layout ofsketches and a ‘‘good performance’’ algorithm forcomputing a similarity measure between a sket-ched query and a database image. Similarity is aconcept that involves human interpretation andis affected by the nature of real images; thereforewe base our similarity on a fuzzy concept thatincludes exact similarity in the case of perfectmatching.The outline of the paper is as follows: in the

next section, we revise related work on spatial re-lationships based image retrieval and outline ourproposal. In Section 3 we define how we representshapes in the user sketch, and their spatial ar-rangement, as well as other relevant features, suchas color and texture. We define similarly how wesegment regions in an image, and compare thesketch with an arrangement of regions. Then inSection 4 we define the formulae we use to com-pute single similarities, and how we compose themusing a more general algorithm. In Section 5 wepresent a set of experimental results, showingthe accordance between our method and a groupof independent human experimenters. Finally, wedraw conclusions in the last section.

2. Related work and proposal of the paper

In the work by Chang et al. (1983), which canbe considered the ancestor of works on image re-trieval by spatial relationships, the modeling oficonic images was presented in terms of 2D strings,each of the strings accounting for the position oficons along one of the two planar dimensions. Inthis approach retrieval of images basically revertsto simpler string matching. In the paper by Gu-divada and Raghavan (1995) objects in a symbolicimage are associated with vertexes in a weightedgraph. The spatial relationships among the objectsare represented through the list of the edges con-necting pairs of centroids. A similarity functioncomputes the degree of closeness between the twoedge lists representing the query and the databasepicture as a measure of the matching between thetwo spatial graphs.

More recent papers on the topic include Gudi-vada (1998) and El-Kwae and Kabuka (1999),which basically propose extensions of the stringapproach for efficient retrieval of subsets of icons.Gudivada (1998) proposes a logical representationof an image based on so-called hR-strings. Sucha representation also provides a geometry-basedapproach to iconic indexing based on spatial re-lationships between the iconic objects in an imageindividuated by their centroid coordinates. Trans-lation, rotation and scale variant images and thevariants generated by an arbitrary composition ofthese three geometric transformations are con-sidered. The similarity between a database and aquery image is obtained through a spatial simi-larity algorithm, simg, that measures the degree ofsimilarity between a query and a database imageby comparing the similarity between their hR-strings. The algorithm recognizes rotation, scaleand translation variants of the arrangement ofobjects, and also subsets of the arrangements. Aconstraint limiting the practical use of this ap-proach is the assumption that an image can con-tain at most one instance of each icon or object. Ofcourse the worst constraint of the algorithm de-pends on the fact that it takes into account onlythe points in which the icons are placed and notthe real configuration of objects. For example apicture with a plane in the left corner and a housein the right one has the same meaning of a picturewith the same objects in different proportions: asmall plane and a big house; besides the relativesize and orientation of the objects are not takeninto account.El-Kwae and Kabuka (1999) propose a further

extension of the spatial-graph approach that in-cludes both the topological and directional con-straints. The topological extension of the objectscan be obviously useful in determining furtherdifferences between images. The similarity algo-rithm extends the graph-matching proposed byGudivada and Raghavan (1995) and retains theproperties of the original approach, including itsinvariance to scaling, rotation and translation andis also able to recognize multiple rotation variants.Even though it considers the topological extensionof objects it is far from considering the composi-tional structure of objects: the objects are consid-

1600 E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612

ered as a whole and no reasoning is possible aboutthe meaning and the purpose of objects or scenesdepicted. Several extensions of these approacheshave been proposed with an evaluation and a com-parison of computational complexity (Zhou et al.,2001).The computational complexity of evaluating the

similarity of two arrangements of shapes has beenanalyzed since early 90’s. Results have been foundfor what are called type-0 and type-1 similarity inS.K. Chang classification (Chang, 1989). Briefly,in type-0 and type-1 similarity, arrangements areconsidered similar if they preserve horizontal andvertical orderings. For example, if object A isbelow and on the left of object B in picture 1,picture 2 is considered similar if the same object Aappears below and on the left of B, regardless oftheir relative size and distance in the two pictures.Tucci et al. (1991) studied the case when there canbe multiple instances of an object, and found thatthe problem is NP-complete. Later, Guan et al.(2000) proved that the same lower bound alsoholds when objects are all distinct. The authorsalso gave a polynomial-time algorithm for type-2similarity, which is stricter than type-1 since itconsiders similar two arrangements only if one ofthe two is a rigid translation of the other. Type-2similarity is too strict for our approach, sincewe admit also rotational and scale variants of anarrangement.When objects reduce to points, the problem of

evaluating the similarity of arrangements has beencalled point pattern matching. This problem hasbeen studied in computational geometry, where itis known as the ‘‘constellation problem’’. In Car-doze and Schulman (1998), a randomized algo-rithm has been given for exact matching of pointsunder translations and rotations, but not underscaling. For matching n points in the plane, thealgorithm works in Oðn2 log nÞ, where the proba-bility of finding wrong matches is a decreasingexponential. A different probabilistic algorithm––this one missing good matches––has been proposedby Irani and Raghavan (1996) for the best match-ing problem, which can also be a non-exact match.The algorithm considers matching under transla-tion, rotation, and scaling. However, the pointpattern matching problem solves just the matching

of centroids of an arrangement of shapes, and itis not obvious if and how algorithms could begeneralized to matching arrangements of shapes.All the previously described methods are partial

solutions to the problem of image retrieval. Theyconsider disjoint properties instead of a globalsimilarity measure between images. CBIR essen-tially reverts to two basic specifications: repre-sentation of the image features and definition ofa metric for similarity retrieval. Most featuresadopted in the literature are global ones, whichconvey information––and measure similarity––based on purely visual appearance of an image.Unfortunately, what a generic user typically con-siders the ‘‘content’’ of an image is seldom cap-tured by such global features. Particularly forquery-by-sketch image retrieval, the main issue isthe recognition of the sketched components of aquery in one or more database images, followed bya measure of similarity, to find more relevant ones.In our approach, we assume retrievable only im-ages where all components of the sketch are pre-sent as regions in the image. 1 The approach maybe considered more restrictive than other ap-proaches since only images having all the compo-nents of the sketch are taken into account. This isbecause we consider such components meaningfulto the overall configuration––why a user wouldsketch them if not? Once all the shapes of theconfiguration have been found in the image, otherproperties of the shapes concur to define theoverall similarity measure, such as rotation of eachshape, rotation of the overall arrangement, scaleand translation, color, texture. For measuring thesimilarity of the overall arrangement, we proposea modified version of Gudivada’s hR-strings.There are m!=n! pairings between the n objects

of a sketch and n out of m regions in an image,with m > n, hence a blind trial of all possiblepairings takes exponential time. Results we men-tioned about type-2 similarity and point patternmatching suggest that it might exist a polynomial-time algorithm for matching arrangements of

1 Of course, not all regions in the image must correspond to

components of the sketch. In other words, we do not enforce

full picture matching.

E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612 1601

shapes under translation, rotation and scaling.Since our approach is different from the previousones, we devised a non-optimal algorithm to con-duct our experiments, leaving a deeper analysis oncomputational complexity of the problem, andoptimal algorithms to solve it, to future research.The algorithm we employ is not totally blind,though, since it takes advantage of the fact thatusually most shapes in the sketch and regions inthe image do not match each other. Hence, com-puting an n� m matrix of similarities betweenshapes and regions, one can ignore most of thewrong matchings from the beginning. Neverthe-less, the worst case of the algorithm––when shapesare all similar to regions––is exponential.

3. Representing shapes, objects and images

In a previous paper (Di Sciascio et al., 2000a)we proposed a structured language (based on ideasborrowed from Description Logics) to describe thecomplex structure of objects in an image, startingfrom a region-based segmentation of objects. Thecomplete formalism was presented in anotherpaper (Di Sciascio et al., 2000b), together with thesyntax and an extensional semantics, which is fullycompositional.The main syntactic objects we consider are basic

shapes, composite shape descriptions, and trans-formations.

Basic shapes are denoted with the letter B, andhave an edge contour eðBÞ characterizing them.We assume that eðBÞ is described as a single,closed 2D-curve in a space whose origin coincideswith the centroid of B. Examples of basic shapescan be circle, rectangle, with the contourseðcircleÞ ¼ �, eðrectangleÞ ¼ ‘, but alsoany complete, rough contour––e.g., the one of aship––can be a basic shape.The possible transformations are the basic

ones that are present in any drawing tool: rota-tion (around the centroid of the shape), scalingand translation. We globally denote a rotation–translation–scaling transformation as s. Recallthat transformations can be composed in se-quences s1 � � � � � sn, and they form a mathemat-ical group.

Basic building block of our syntax is a basicshape component hc; t; s;Bi, which represents a re-gion with color c, texture t, and contour sðeðBÞÞ.With sðeðBÞÞ we denote the pointwise transfor-mation s of the whole contour of B. For example,s could specify to place the contour eðBÞ in theupper left corner of the image, scaled by 1/2 androtated 45� .

Composite shape descriptions are conjunctionsof basic shape components, each one with its owncolor and texture. They are denoted as

C ¼ hc1; t1; s1;B1i u � � � u hcn; tn; sn;Bni

Notice that this is just an internal representation,invisible to the user, that we map in a visual lan-guage actually used to sketch the query.Gudivada (1998) proposed hR-strings, a geo-

metry-based structure for the representation ofspatial relations in images, and an associatedspatial similarity algorithm. The hR-string is asymbolic representation of an image. It is obtainedby associating a name with each domain objectidentified in the image, and then considering thesequence of names and coordinates of the cent-roids of the objects, with reference to a Cartesiancoordinate system. Gudivada’s original represen-tation was limited by the assumption that imagescould not contain multiple instances of the sameobject type. We propose a modified formulation ofhR-strings, which allows our similarity algorithmto overcome the previous limitation. Although westill consider the arrangement of components asthe spatial layout of a composite shape, we alsodescribe each shape by including its relevant fea-tures. The icons of the symbolic image in Gudi-vada’s hR-string representation are replaced byobjects with their shape. Hence each shape is notscale, rotation, translation invariant (as it was inhR-strings) since we believe such properties havetheir meaning in composite shape descriptions.Therefore we provide a characterization both ofimages and objects in term of basic shapes com-posing complex objects in a sketched query and ofregions composing database images.To measure similarity, we propose an algorithm

that takes into account the arrangement of shapesin the sketch and compares it with groups of re-gions in an image. The algorithm provides a spa-


tial similarity measure that––with respect to Gu-divada’s simg algorithm––assumes the presence inthe image of a group of regions that correspond tothe components of the sketch. Besides, it consid-ers the relative size of corresponding regions andshapes. To define a global similarity measure, wetake into account similarity measure depending onall the features that characterize the componentsof objects, i.e., color, texture, rotation and scaling.

3.1. Description of relevant features

While relevant features are properties simplycomputable for elements of a sketch, dealing withregions in real images requires segmentation toobtain a partition of the image. Several segmen-tation algorithms have been proposed in the lit-erature; our approach does not depend on theparticular segmentation algorithm adopted. Any-way, it is obvious that the better the segmentation,the better our approach will work.Our scope here is limited to the description of

the computation of image features, assuming asuccessful segmentation. To make the descriptionself-contained, we start defining a generic colorimage as f~IIðx; yÞj16 x6Nh; 16 y6Nvg, where Nh,Nv are the horizontal and vertical dimensions,respectively, and ~IIðx; yÞ is a triple ðR;G;BÞ. Weassume that the image I has been partitioned inm regions ðriÞ, i ¼ 1; . . . ;m satisfying the followingproperties:

• I ¼SðriÞ; i ¼ 1; 2; . . . ;m

• 8i 2 ð1; 2; . . . ;mÞ, ri is a non-empty and simply-connected set

• ri \ rj ¼ ; iff i 6¼ j• each region satisfies heuristic and physical re-quirements.

We characterize each region ri with the follow-ing attributes: shape, position, size, orientation,color and texture.

3.1.1. ShapeGiven a connected region a point moving along

its boundary generates a complex function definedas: zðtÞ ¼ xðtÞ þ jyðtÞ, t ¼ 1; . . . ;Nb, with Nb thenumber of boundary points. Following the ap-

proach proposed in (Rui et al., 1996) we define thediscrete Fourier transform of zðtÞ as:

ZðkÞ ¼XNbt¼1

zðtÞe�j2ptkNb ¼ MðkÞejhðkÞ

with k ¼ 1; . . . ;Nb.In order to address the spatial discretization

problem we compute the fast Fourier transform(FFT) of the boundary zðtÞ; use the first ð2Nc þ 1ÞFFT coefficients to form a dense, non-uniform setof points of the boundary as:

zdenseðtÞ ¼XNck¼�Nc

ZðkÞe�j2ptkNb

with t ¼ 1; . . . ;Ndense, where Ndense is the number of‘‘dense’’ samples used in resampling the ModifiedFourier Descriptors.We then interpolate these samples to obtain

uniformly spaced samples zunifðtÞ, t ¼ 0; . . . ;Nunif .We compute again the FFT of zunifðtÞ obtainingFourier coefficients ZunifðkÞ, k ¼ �Nc; . . . ;Nc. Theshape-feature of a region is hence characterized bya vector of 2Nc þ 1 complex coefficients.

3.1.2. Position and sizeWe characterize size and position with the aid

of moment invariants (Pratt, 1991). In order tosimplify notation, let us assume that the extractedexternal contour of each region ri, is placed in anew image Iriðx; yÞ having the same size of theoriginal image, with a uniform background. Let usalso suppose the new images be binarized, i.e.,discretized with two levels. The ðp; qÞth spatialmoment of each Iri can be defined as:

Muðp; qÞ ¼XNvy¼1

XNhx¼1

xpyqIriðx; yÞ

The ðp; qÞth scaled spatial moment can be definedas:

Mðp; qÞ ¼ Muðp; qÞNqvN

ph

The definition of size is just the zero momentMð0; 0Þ. The position is obtained with reference tothe region centroid, having coordinates:

�xx ¼ Mð1; 0ÞMð0; 0Þ �yy ¼ Mð0; 1Þ

Mð0; 0Þ


3.1.3. OrientationIn order to quantify the orientation of each

region ri we use the same Fourier representation,which stores the orientation information in thephase values. We obviously deal also with specialcases when the shape of a region has more thanone symmetry, e.g. a rectangle or a circle. Rota-tional similarity between a reference shape B and agiven region ri can then be obtained finding max-imum values via cross-correlation:

CðtÞ ¼ 1

2Nc þ 1X2Nck¼0

ZBðkÞZriðkÞej 2p2Nckn

with t 2 0; . . . ; 2Nc

3.1.4. ColorColor information of each region ri is stored,

after quantization in a 112 values color space, asthe mean RGB value within the region:

Rri ¼Xp2ri

RðpÞ Gri ¼Xp2ri

GðpÞ Bri ¼Xp2ri

BðpÞ

3.1.5. TextureWe extract texture information for each region

ri with a method based on the work in (Pok andLiu, 1999). Following this approach, we extracttexture features convolving the original grey levelimage Iðx; yÞ with a bank of Gabor filters, havingthe following impulse response:

hðx; yÞ ¼ 1

2pr2e�x2þy2

2r2 ej2pðUxþVyÞ

where r is the scale factor, ðU ; V Þ represents thefilter location in frequency-domain coordinates,and k and h are the central frequency and theorientation, respectively defined as:

k ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU 2 þ V 2

ph ¼ arctanU=V

The processing allows to extract a 24-componentsfeature vector, which characterizes each texturedregion.Descriptions relative to basic shapes can be

used to build complex objects by properly com-posing such components. For example, we cansketch a house using a rectangle and a triangle

posed on its top. To deal with complex objects weassume a set F of basic shapes that can be used forcomparison with other instances of the same shapehaving the same proportions. The image com-posed in Fig. 1 has five shapes, two of them areoccurrences of the same shape, a circle, but with adifferent scale factor. To describe a complex objectwe extend the description for a basic shape.From now on, the regions of a query sketch will

be denoted with ðfjÞ, j ¼ 1; . . . ; n. Each region fjrepresents an instance of a shape Fj. Consider anobserver posed in the center-of-mass of the sketch.We define the order in which the components ap-pear as the order in which they are seen from thisviewpoint, counter clock-wise (CCW). For eachcomponent of index j, we call its left-hand sidecomponent, component with index jþ 1, the fol-lowing component CCW, and its right-hand sidecomponent, with index j� 1, the preceding com-ponent CCW. A shape is hence characterized asfollows:

• index of Fj;• coordinates of the centroid of Cfj of the contourof fj;

• length of the segment CQCfj between the cent-roid of the shape and the centroid of the image;

• angle of the segment CQCfj with the x-axis;• scale factor of the shape fj with respect to theshape Fj;

• rotation angle of fj with respect to Fj;

Fig. 1. Information contained in the fk component of a MhR-string.


• index j� 1 of the shape Fj�1 of which the shapethat precedes fj is an instance;

• index jþ 1 of the shape Fjþ1 of which the shapethat follows fj is an instance;

• color of fj;• texture of fj;• distance between the centroid of the edge ofshape fj and that of shape fj�1 on the left sideof fj;

• distance between the centroid of the edge ofshape fj and that of shape fjþ1 on the right sideof fj.

3.2. Spatial representation with modified hR-string

We now define the Modified hR-string (MhR-string). The terminology we introduce refers toDefinitions 1 through 8 to the work by Gudivada(1998) and properly extends them for our pur-poses. The main difference concerns the definition:in the original version there is a unique definitionof hR-string for both the database image and thesketched query. Here we give two different defini-tions respectively for a sketched query and for areal image. In the definitions we refer to the groupof regions of the segmented image and to the shapecomponents of a complex object. Nearly all theterminology extends to both definitions. For amore accurate analysis of the differences betweenthe hR-string and the MhR-string recall the defi-nition of Augmented hR-string given by Gudi-vada. In a MhR we consider the index of thereference shape instead of the name of the object;the angle h used for defining the order relationamong components is computed between the cen-ter-of-mass of the considered shape and the cent-roid of the centroids in the image instead of thecentroid of the image I. Such an angle inducesthe order relation in which the elements appear inthe same order the components are seen, CCW, byan observer posed in the center-of-mass of them.With respect to the order relation, the definitionof left and right neighbors of a shape (region) ex-tends to the MhR-string, and the indexes of therelative shapes are added to the MhR-string. Thedistance between components are computed usingthe euclidean distance between the centroid and

the angle h. Moreover, in the MhR-string of animage the index of the shape fj in the object de-scription corresponding to the region ri is added.We remind that before determining the MhR-string it is necessary to search for a group ofregions in the image that resembles the set ofcomponents in the sketched object.

Definition 1 (MhR-string for an object Q). TheMhR-string for a complex object Q is a list of nelements, one for each component. The order in whichthe elements appear is the same order the compo-nents are seen, CCW, by an observer posed in thecenter-of-mass of them. Then, each component has aleft-hand side component, and a right-hand side one.The element for the jth component fj of Q containsthe items below:

• fj:index––index of the prototypical shape whichfj is an instance of;

• fj:angle––the angle that the line joining the cent-roid (‘‘center-of-mass’’) of fj with the centroid ofthe centroids of Q subtends with the positive x-axis;

• fj:distance––distance between the centroid of fjand the centroid of the object description;

• fj:indexR––index of the prototypical shape of theshape on the right side of fj;

• fj:indexL––index of the prototypical shape of theshape on the left side of fj;

• fj:distanceR––distance between the centroid of fjand the centroid of the shape on the right side;

• fj:distanceL––distance between the centroid of fjand the centroid of the shape on the left side.

An example is shown in Fig. 1 where we rep-resent the MhR-string for a complex object. Werefer to the MhR-string of the fk component hererepresented by the bold circle. fk:index is the indexof the basic shape which represents the bold circle.The component fk:indexR and fk:indexL are re-spectively the index of the circle on the right sideand of the rectangle on the left side of fk. The othercomponents of the MhR-string are the distancesand the angle highlighted on the figure.

Definition 2 (MhR-string for an image I). TheMhR-string for an image I is a list of m elements,


one for each segmented region. The order in whichthe elements appear is the same order the compo-nents are seen, CCW, by an observer posed in thecenter-of-mass of them. Then, each component has aleft-hand side component, and a right-hand side one.The element for the ith region ri of I contains theitems below:

(1) ri:index––index of the shape which ri is an in-stance of;

(2) ri:angle––the angle that the line joining the cent-roid of ri with the centroid of the centroids of Isubtends with the positive x-axis;

(3) ri:distance––distance between the centroid of riand the centroid of the group of regions;

(4) ri:indexR––index of the shape corresponding tothe region on the right side of ri;

(5) ri:indexL––index of the shape corresponding tothe region on the left side of ri;

(6) ri:distanceR––distance between the centroid of riand the centroid of the region on the right side;

(7) ri:distanceL––distance between the centroid of riand the centroid of the region on the left side;

(8) ri:ObjectShape––index of the shape fj in the ob-ject description corresponding to the region ri(such value is returned by the search for a groupof regions).

4. Computing similarities

In all similarity measures, we adopt the functionUðx; gx; gyÞ. This function was determined by trialand error to replace the exponential one used byGudivada in his similarity algorithm. The moti-vation was the the exponential function resultedtoo steep, so that even small variations in the rel-ative position of the centroids would produceconsiderable variations in the similarity functionsimh, which appeared too drastic for real imagesretrieval. We then chose a function with a nullfirst-order derivative for x ¼ 0. The role of thefunction is to ‘‘smooth’’ the changes of the quan-tity x, depending on two parameters gx; gy, whichhave been tuned in the experimental phase, and tochange a distance x (in which 0 corresponds toperfect matching) to a similarity measure (in whichthe value 1 corresponds to perfect matching).

Uðx; gx; gyÞ ¼

gy þ ð1� gyÞ cos px2�gx

� �if 0 < x6 gx

gy 1�arctan p�ðx�gxÞ�ð1�gyÞ

gx�gy

h ip

24

35 if x > gx

8>>>><>>>>:

where values for gx > 0 and 0 < gy < 1 have beenexperimentally determined, and are presented inTable 1.Given a query Q with n objects and a picture I

segmented in m regions, from all the groups ofregions in the picture that might resemble thecomponents, we select the groups that presentthe higher spatial similarity with the objects. Inartificial examples in which all shapes in I and Qresemble each other, this may generate an expo-nential number of groups to be tested, given bymðm� 1Þ . . . ðm� nþ 1Þ. Hence, our algorithmmay not be optimal from a computational pointof view. Whether there exist or not polynomial-time algorithms to solve this problem is still anopen question; a polynomial-time algorithm isknown only if scaling transformations of a layout

Table 1

Configuration parameters, grouped by feature type

Parameter Value

Fourier descriptors threshold 0.9

Circular symmetry threshold 0.9

hR-string spatial factor 0.25

hR-string scale factor 0.25

Spatial similarity threshold 0.2

Spatial similarity weight a 0.30

Spatial similarity sensitivity gx 0.5

Spatial similarity sensitivity gy 0.4

Shape similarity weight b 0.30

Shape similarity sensitivity gx 0.005

Shape similarity sensitivity gy 0.2

Scale similarity weight g 0.1

Scale similarity sensitivity gx 0.5

Scale similarity sensitivity gy 0.4

Rotation similarity weight d 0.1

Rotation similarity sensitivity gx 10

Rotation similarity sensitivity gy 0.3

Color similarity weight c 0.1

Color similarity sensitivity gx 110

Color similarity sensitivity gy 0.4

Texture similarity weight � 0.1

Texture similarity sensitivity gx 110

Texture similarity sensitivity gy 0.4

Global similarity threshold 0.29


are not considered (Chew et al., 1997). However,in typical real images the similarity betweenshapes is selective enough to yield only a verysmall number of possible groups to try. Therefore,our algorithm has a behavior efficient enough tocarry on experiments. We recall that in Gudi-vada’s approach (1998) a strict assumption is made,namely, each basic component in Q does not ap-pear twice, and each region in I matches at mostone component in Q. Our approach, relaxing thisassumption, is more suited for image retrieval. Wecompute the simss shape similarity between re-gions of I and shape components of Q, and selectregions with a similarity greater than a giventhreshold. Computation of simss is invariant withrespect to scale and rotation. The measure is ob-tained as the cosine norm applied to the tuples Zfj

and Zri of complex coefficients describing respec-tively the shape of a region ri and the shapeof a component fj, Zfj ¼ ðx1; . . . ; x2NcÞ and Zri ¼ðy1; . . . ; y2NcÞ

simssðfj; riÞ ¼P2Nc

l¼1 xlylffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP2Ncl¼1 x

2l �

P2Ncl¼1 y

2l

q

simssðfj; riÞ is a number in the range [0,1], wheresimss ¼ 1 corresponds to maximum similarity be-tween a region and a shape component.

4.1. Spatial similarity

We recall that all the components of Q must befound in I. So, the recognition algorithm dis-cards––without further processing––those imagesthat do not contain all the shape components ofQ. Notice that we consider a shape componentrecognized if its shape similarity with reference toa region in the image is beyond a threshold. Alsonotice that this does not prevent regions notincluded in the sketch Q be present in the image I.As a consequence, we describe the MhR-stringalgorithm assuming the presence in an image of agroup of regions that corresponds to the compo-nents of Q. Let the objects in the sketch be num-bered as 1 . . . n and the regions in the image as1 . . .m. To make precise the correspondence be-tween objects and regions we use an injective

mapping l : f1; . . . ; ng ! f1; . . . ;mg, such thatobject i is matched with region rlðiÞ. In orderto compute the spatial similarity of the overallarrangement of regions with reference to thearrangement of components in Q, we considertheir centroids. Once the group of regions has beenselected the algorithm extracts for each groupthe corresponding MhR-string and evaluates theoverall similarity. The spatial similarity value simh

is computed by the ComputeSIMh algorithmconsidering the relative positions of shapes andregions (normalized with respect to the image size)and accounts for the similarity in terms of thespatial arrangements of objects.

Algorithm ComputeSIMh (MhRQ,MhRI )input MhR-string of the query Q

MhR-string of the group of n regionsrlð1Þ; . . . ; rlðnÞoutput simh

begin

simh ¼ 0for j 2 f1; . . . ; ng do

i ¼ rlðjÞ:ObjectShapeif rlðjÞ:indexR ¼ fi:indexR then

simh ¼ simh þ KSpatialFactorsimh ¼ simh þ KScaleFactor�Uhðjfi:dist-

anceR� rlðjÞ:distanceRj=Q:magnitudeÞ

endifif rlðjÞ:indexL ¼ fi:indexL then

simh ¼ simh þ KSpatialFactorsimh ¼ simh þ KScaleFactor�Uhðjfi:dist-

anceL� rlðjÞ:distanceLj=Q:magnitudeÞ

endif

endfor

return simh

end

The factors KSpatialFactor and KScaleFactor areconstants, subject to the following constraint:2KSpatialFactor þ 2KScaleFactor ¼ 1. KSpatialFactor is thedegree of importance of spatial relationships be-tween the corresponding region and shape incomputing the spatial similarity; KScaleFactor is thedegree of importance of the scale variations be-tween the region and the corresponding shape. The


multiplicative term 2 comes from the computationof spatial similarity considering both the left andright neighbors of a shape (region). Q:magnituderepresents the extension of the object Q. Its valueis obtained as the mean value of the lengths of thesegments between the centroid of the shape andthe centroid of the image.

4.2. Shape similarity

simshape measures the similarity between shapesin the composite shape description and the regionsin the segmented image,

simshape ¼ U maxn

j¼1½1

�� simssðfj; rlðjÞÞ�; gxshape; gyshape

�:

4.3. Scale similarity

simscale takes into account the differences inscale between each region rlðjÞ in the consideredgroup of regions and the corresponding basicshape fj in Q.

Dscale ¼ maxn

j¼1

j rlðjÞ:scaleScaleFactor

� fj:scalejfj:scale

( )

where fj:scale is the scaling factor in fj (the trans-formation for the jth component of Q), rlðjÞ:scaleis the scaling factor of region rðjÞ when matchedto basic shape fj, and ScaleFactor is the overallscaling factor of the selected group of regionswhen matched with Q. Observe that we choose themaximum, since the differences are distances, andnot similarities. Then we compute the scale simi-larity with the aid of the function U,

simscale ¼ UðDscale; gxscale; gyscaleÞ:

4.4. Rotation similarity

simrotation takes into account the errors in therotation of each region in the considered group ofregions with respect to the corresponding basicshapes in Q. We start by finding the Rotation-Factor of the group of n regions rlð1Þ; . . . ; rlðnÞ withrespect to Q, in accordance with the followingalgorithm:

Algorithm ComputeSIMRotation (MhRQ,MhRI )input MhR-string of the query Q

MhR-string of the group of n regionsrlð1Þ; . . . ; rlðnÞoutput RotationFactorbegin

RotationFactor ¼ 0for j 2 f1; . . . ng do

i ¼ rlðjÞ:ObjectShapeDangle ¼ rlðjÞ:angle� fi:angleif Dangle > 180� then

Dangle ¼ Dangle � 360�else if Dangle6 � 180� then

Dangle ¼ Dangle þ 360�endif

endif

RotationFactor ¼ RotationFactorþ Dangleendfor

RotationFactor ¼ RotationFactor=nreturn RotationFactor

end ComputeSIMRotation

In order to compute the rotation of each regionwith respect to the corresponding shapes we con-sider the maximum angles ar, r 2 f1; . . . ; kjg, j 2f1; . . . ; ng, obtained by computing the cross-cor-relation function, as described in Section 3. Eachrotation error Dj, for j 2 f1; . . . ; ng, is then com-puted as follows:

for r 2 f1; . . . kjg do

Dr ¼ jar �RotationFactor� Bj:anglejDr ¼ Dr mod 180�

endfor

Dj ¼ minkjr¼1 fjDrjg

The maximum rotation difference is then:

Drotation ¼ maxn

j¼1fjDjjg

and again, since we computed a distance, wesmooth and convert it into a similarity measurewith the help of U:

simrotation ¼ UðDrotation; gxrotation; gyrotationÞ:


4.5. Color similarity

simcolor measures the similarity in terms of colorappearance between the regions and the corres-ponding shapes in the composite shape descrip-tion. In the following formula, DcolorðjÞ:R denotesthe difference in the red color component betweenthe j-th component of Q and the region rlðjÞ, andsimilarly for the green and the blue color compo-nents.

DcolorðjÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi½DcolorðjÞ:R�2 þ ½DcolorðjÞ:G�2 þ ½DcolorðjÞ:B�2

qThen the function U takes the maximum of thedifferences to obtain a similarity:

simcolor ¼ U maxn

j¼1fDcolorðjÞg; gxcolor; gycolor

� �

4.6. Texture similarity

Finally, simtexture measures the similarity be-tween the texture features in the components of Qand in the corresponding regions.

simtexture ¼ UðDtexture; gxtexture; gytextureÞ

4.7. The recognition algorithm

We now formally describe the recognition al-gorithm. The algorithm uses a global similarity

value that is determined from similarity values forthe various features we considered relevant forregions description. These are simh for spatialsimilarity, simshape for shape similarity, simscale forscale similarity, simrotation for rotational similaritysimcolor for color similarity and simtexture for tex-tural similarity. Usually, such features are weigh-ted, as a user may consider colors more importantthan textures, or the arrangement of shapes moreimportant than the single shapes. However, theircomposition is not a weighted sum, since we useminimum for computing similarities. Minimumstems from the fuzzy interpretation of logical

‘‘and’’ in the formal language we mentioned in thebeginning. Minimum is also suitable when com-posing the similarities of each object shape fj withthe region rlðjÞ it is mapped into; in fact, if fj istotally dissimilar from rlðjÞ, its similarity drops tozero, and we want this null value to propagate tothe similarity of the entire layout. In this way,requiring all objects to be present in the image isnot an aside constraint, but comes naturally fromthe semantics. Therefore, we have the problem ofcombining weighted minimums. The problem hasbeen solved by Fagin and Wimmers (2000), in away we briefly recall. Suppose that coefficients a,b, c, g, d, � with a þ b þ c þ g þ d þ � ¼ 1 weightthe relevance each feature has in the global simi-larity computation. We number these coefficientsas h1; . . . ; h6, and link them to the (numbered)similarities as follows:

h1 ¼ a sim1 ¼ simh

h2 ¼ b sim2 ¼ simshape

h3 ¼ c sim3 ¼ simcolor

h4 ¼ g sim4 ¼ simscale

h5 ¼ d sim5 ¼ simrotation

h6 ¼ � sim6 ¼ simtexture

8>>>>>><>>>>>>:Now we reorder coefficients in non-increasingorder, by means of a permutation r : f1; . . . ; 6g !f1; . . . ; 6g such that hrð1Þ P hrð2Þ P � � � P hrð6Þ.Then, the formula by Fagin and Wimmers is asfollows:

The behavior of the algorithm, hereafter pre-sented, obviously depends on the configurationparameters, shown in Table 1, which determine therelevance of the various features involved in thesimilarity computation.

Algorithm Recognize (Q,I);input a query picture Q composed by n singleobjects, and an image I, segmented into regionsr1; . . . ; rmoutput A similarity measure of the recognitionof Q in Ibegin

similarity ¼ ðhrð1Þ � hrð2ÞÞ simrð1Þ þ 2 ðhrð2Þ � hrð3ÞÞ minðsimrð1Þ; simrð2ÞÞþ 3 ðhrð3Þ � hrð4ÞÞ minðsimrð1Þ; simrð2Þ; simrð3ÞÞ þ � � � þ 6hrð6Þ minðsimrð1Þ; . . . ; simrð6ÞÞ ð1Þ


/* for each object in the sketch */for j 2 f1; . . . ; ng do

simssmax ¼ 0/* make sure that at least one region issimilar to the object */for i 2 f1; . . . ;mg do

compute similarity simssðfj; riÞ betweenfj and riif simssmax < simssðfj; riÞ then simssmax ¼simssðfj; riÞ

endfor

/* if no region is similar to object j, fail */if simssmax < threshold then return (0)

endfor

smax ¼ 0for all injective functions l: f1; . . . ; ng !f1; . . . ;mg yielding a group of n regionsrlð1Þ; . . . ; rlðnÞsuch that for j ¼ 1; . . . ; n it is simssðfj; rlðjÞÞ >thresholddo:

extract the MhR-string MhRQ ofrlð1Þ; . . . ; rlðnÞ

compute similarity using Formula (1)if smax < s then smax ¼ s

endfor

return smaxend

5. Results

The approach described in the preceding sec-tions has been deployed in DESIR (DEscriptionlogics Structured Image Retrieval) a prototypeinteractive system. Queries can be posed by sketch,using pre-stored basic shapes or newly createdones (see an example of query by sketch in Fig. 2),but also by example. Both a query by example ora submitted database image undergo the sameprocedure: segmentation, feature extraction andclassification.One of the unfortunate aspects of similarity

retrieval is that it is inherently fuzzy and subjec-tive, and widely accepted benchmarks are still tobe found. Although our image database currentlystores a few thousands images, in order to easecomparisons with previous work on spatial-re-

lationships-based image-retrieval approaches wechose an experimental setup that is close to the oneproposed by Gudivada and Raghavan (1995), andalso adopted by Gudivada (1998) and El-Kwaeand Kabuka (1999), in which a small, well defined,set of images is used to check the system perfor-mances vs. human users judgement. Our test setconsists of 93 images (available at www-ictserv.poliba.it/disciascio/test-images.htm), picturing sim-ple or composite objects arranged together. Thetotal number of different objects was 18. Imageswere captured using a digital camera, all had size1080� 720 pixels, 24 bits/pixel. Images were au-tomatically segmented by the system. Thirty-oneimages of the set were selected as queries by ex-ample. The resulting classification carried out bythe system against the image set was defined assystem-provided ranking. Fig. 3 shows one of thequeries and the retrieved set of images. To obtain ausers ranking we then asked to five volunteers toclassify the 93 images based on their similarity toeach query image. Users could group databaseimages that he/she considered equivalent in termsof similarity to a given query. The five classifica-tions were not univocal, and they were mergedtogether in a unique ranking by considering, foreach image, the minimum ranking among the fiveavailable. This provided a users-ranking for thesame set of queries. Notice that this approachlimited the weight that images badly classified bysingle users have on the final ranking. Also fol-

Fig. 2. A query by sketch and retrieved images.


lowing the approach by Gudivada and Raghavan(1995), we measured the retrieval effectivenessadopting the Rnorm (Bollmann et al., 1985) asquality measure. Assuming G a finite set of imageswith a user-defined preference relation P that iscomplete and transitive, and Dusr the rank orderingof G induced by the user preference relation, letDsys be some rank ordering of G induced by thesimilarity values computed by the image retrievalsystem. The formulation of Rnorm is:

RnormðDsysÞ ¼1

21

�þ Sþ � S�

Sþmax

�

where Sþ is the number of image pairs where abetter image is ranked by the system ahead of aworse one; S� is the number of pairs where a worseimage is ranked ahead of a better one and Sþ

max isthe maximum possible number of Sþ. It should benoticed that the calculation of Sþ, S�, and Smax isbased on the ranking of image pairs in Dsys relativeto the ranking of corresponding image pairs inDusr. Rnorm values are in the range [0,1.0]; 1.0 cor-responds to a system-provided ordering of thedatabase images that is either identical to the oneprovided by the human experts or has a higherdegree of resolution, lower values correspond to aproportional disagreement between the two. Weobtained an average Rnorm�AVG ¼ 0:89, and as low-est value Rnorm�MIN ¼ 0:62 and highest Rnorm�MAX¼1:0. We also computed, as a reference, averagevalues for Precision¼NRR=NR, Recall¼NRR=N

where NRR¼number of images retrieved and rele-vant and NR¼ total number of relevant images indatabase and N¼ total number of retrieved images.We obtained PrecisionAVG¼0:91 and RecallAVG¼0:80.Results presented by Gudivada and Raghavan

(1995) showed an average Rnorm ¼ 0:98, on a data-base of 24 iconic images used both as queries anddatabase images and similarity computed only onspatial relationships between icons. Our systemworks on real images and computes similarity onseveral image features; we believe that resultsprove the ability of the system to catch to a goodextent the users information need, and make re-fined distinctions between images when searchingfor composite shapes. Obviously, we do not claimthese results would scale exactly considering largerimage databases, or databases with images pic-turing scenes without well-defined objects. Never-theless they show, in a controlled framework, goodperformances. As a final remark, notice we do notrequire full picture matching; the image retrievedfrom the sketch may contain other regions not inthe sketch. However, all components of the sketchmust be recognized in an image in order tomake it retrievable. Only in this case the databaseimages is processed to determine its similarity.This approach tends to increase precision, even atthe expenses of recall. It is anyway reasonable toassume that, in an interactive session using query-by-sketch, a user will introduce a partial querypicturing what he/she thinks are the main features/objects representing his/her information need.Should the retrieved set be still too large, furtherdetails will be added to reduce the set.

6. Conclusion

Starting from the observation that most imageretrieval systems either rely on global features orconcentrate on symbolic images considered onlyin terms of spatial positioning, we proposed anapproach particularly suitable for query by sketchimage retrieval, able to handle queries made byseveral shapes, where the position, orientation andsize of the shapes relative to each other is mean-ingful. In our approach we start extracting basic

Fig. 3. A query by example and retrieved images.


features of the image and combine them in ahigher-level description of spatial layout of com-ponents, characterizing the semantics of the scene.We also defined a similarity algorithm that mea-sures a weighted global similarity between asketched query and a database image and allowsfor both a perfect matching or an approximaterecognition of sketches in real images.

References

Bollmann, P., Jochum, F., Reiner, U., Weissmann, V., Zuse,

H., 1985. The LIVE-Project-Retrieval experiments based on

evaluation viewpoints. In: Proceedings of the 8th Annual

International ACM/SIGIR Conference on Research and

Development in Information Retrieval. ACM, New York,

pp. 213–214.

Cardoze, D.E., Schulman, L.J., 1998. Pattern matching for

spatial point sets. In: Proceedings of the 39th Annual

Symposium on the Foundations of Computer Science

(FOCS’98).

Chang, S.K., 1989. Principles of Pictorial Information Systems

Design. Prentice-Hall, Englewood Cliffs, New Jersey.

Chang, S.K., Shi, Q.Y., Yan, C.W., 1983. Iconic indexing by

2D strings. IEEE Trans. Pattern Anal. Mach. Intelligence 9

(3), 413–428.

Chew, L.P., Goodrich, M.T., Huttenlocher, D.P., Kedem, K.,

Kleinberg, J.M., Kravets, D., 1997. Geometric pattern

matching under euclidean motion. Comput. Geom. 7, 113–

124.

Di Sciascio, E., Donini, F.M., Mongiello, M., 2000a. A

Description logic for image retrieval. In: Lamma, E., Mello,

P. (Eds.), AI*IA 99: Advances in Artificial Intelligence,

number 1792 in Lecture Notes in Artificial Intelligence.

Springer, pp. 13–24.

Di Sciascio, E., Donini, F.M., Mongiello, M., 2000b. Semantic

indexing for image retrieval using description logics. In:

Laurini, R. (Ed.), Advances in Visual Information Systems,

number 1929 in Lecture Notes in Computer Science.

Springer, pp. 372–383.

El-Kwae, E.A., Kabuka, M.R., 1999. Content-based retrieval

by spatial similarity in image databases. ACM Trans.

Inform. Syst. 17, 174–198.

Fagin, R., Wimmers, E.L., 2000. A formula for incorporating

weights into scoring rules. Theor. Comput. Sci. 239, 309–

338.

Guan, D.J., Chou, C.Y., Chen, C.W., 2000. Computational

complexity of similarity retrieval in a pictorial database.

Inform. Process. Lett. 75, 113–117.

Gudivada, V.N., 1998. hR-string: A geometry-based represen-tation for efficient and effective retrieval of images by spatial

similarity. IEEE Trans. Knowl. Data Eng. 10 (3), 504–512.

Gudivada, V.N., Raghavan, J.V., 1995. Design and evaluation

of algorithms for image retrieval by spatial similarity. ACM

Trans. Inform. Syst. 13 (2), 115–144.

Irani, S., Raghavan, P., 1996. Combinatorial and experimental

results for randomized point matching algorithms. Proceed-

ings of the 12th ACM Symposium on Computational

Geometry, 68–77.

Pok, G., Liu, J., 1999. Texture classification by a two-level

hybrid scheme. Storage and Retrieval for Image and Video

Databases VII 3656, 614–622, SPIE.

Pratt, W.K., 1991. Digital Image Processing. J. Wiley & Sons

Inc., Englewood Cliffs, NJ.

Rui, Y., She, A.C., Huang, T.S., 1996. Modified Fourier

descriptors for shape representation—a practical approach.

In: Proceedings of 1st Workshop on Image Databases and

Multimedia Search.

Tucci, M., Costagliola, G., Chang, S.K., 1991. A remark on

NP-completeness of picture matching. Inform. Process.

Lett. 39, 241–243.

Zhou, X.M., Ang, C.H., Ling, T.W., 2001. Image retrieval

based on object’s orientation spatial relationship. Pattern

Recogn. Lett. 22, 469–477.


Documents

Spatial layout representation for query-by-sketch content-based image retrieval