Upload
e-di-sciascio
View
231
Download
0
Embed Size (px)
Citation preview
Spatial layout representation for query-by-sketchcontent-based image retrieval
E. Di Sciascio *, F.M. Donini, M. Mongiello
Dipartimento di Elettrotecnica ed Elettronica, Politecnico di Bari, Via Re David, 200 I-70125 Bari, Italy
Received 10 August 2001; received in revised form 30 November 2001
Abstract
Most content-based image retrieval techniques and systems rely on global features, ignoring the spatial relationships
in the image. Other approaches take into account spatial composition but usually focus on iconic or symbolic images.
Nevertheless a large class of users’ queries request images having regions, or objects, in well-determined spatial ar-
rangements. Within the framework of query-by-sketch image retrieval, we propose a structured spatial layout repre-
sentation. The approach provides a way to extract the image content starting from basic features and combining them
in a higher-level description of spatial layout of components, characterizing the semantics of the scene. We also propose
an algorithm that measures a weighted global similarity between a sketched query and a database image. � 2002
Elsevier Science B.V. All rights reserved.
Keywords: Image retrieval; Similarity retrieval; Query by sketch; Spatial relationships
1. Introduction
Content-based image retrieval (CBIR) systemsmainly perform extraction of visual features, typ-ically, color, shape and texture as a set of uncor-related characteristics. Such features provide aglobal description of images but fail to considerthe meaning of portrayed objects and the seman-tics of scenes. At a more abstract level of knowl-edge about the content of images, extraction of
object descriptions and their relative positionsprovides a spatial configuration and a logicalrepresentation of images. Because of the lack oflow-level features extraction such methods gener-ally fail to consider the physical extension of ob-jects and their primitive features. Anyway, the twoapproaches for CBIR should be considered ascomplementary. An image retrieval system shouldperform similarity matching based on the repre-sentation of visual features conveying the contentof segmented regions; besides, it should capturethe spatial layout of the depicted scenes in order toface the user expectations.In this paper we strive to overcome the gap
existing between these two approaches. Hence weprovide a method for describing the content of animage as the spatial composition of objects/regions
Pattern Recognition Letters 23 (2002) 1599–1612
www.elsevier.com/locate/patrec
*Corresponding author. Tel.: +39-805-460641; fax: +39-805-
460410.
E-mail addresses: [email protected] (E. Di Sciascio),
[email protected] (F.M. Donini), [email protected] (M.
Mongiello).
0167-8655/02/$ - see front matter � 2002 Elsevier Science B.V. All rights reserved.
PII: S0167-8655 (02 )00124-1
in which each one preserves its visual features,including shape, color and texture. To this aim wepropose a representation for spatial layout ofsketches and a ‘‘good performance’’ algorithm forcomputing a similarity measure between a sket-ched query and a database image. Similarity is aconcept that involves human interpretation andis affected by the nature of real images; thereforewe base our similarity on a fuzzy concept thatincludes exact similarity in the case of perfectmatching.The outline of the paper is as follows: in the
next section, we revise related work on spatial re-lationships based image retrieval and outline ourproposal. In Section 3 we define how we representshapes in the user sketch, and their spatial ar-rangement, as well as other relevant features, suchas color and texture. We define similarly how wesegment regions in an image, and compare thesketch with an arrangement of regions. Then inSection 4 we define the formulae we use to com-pute single similarities, and how we compose themusing a more general algorithm. In Section 5 wepresent a set of experimental results, showingthe accordance between our method and a groupof independent human experimenters. Finally, wedraw conclusions in the last section.
2. Related work and proposal of the paper
In the work by Chang et al. (1983), which canbe considered the ancestor of works on image re-trieval by spatial relationships, the modeling oficonic images was presented in terms of 2D strings,each of the strings accounting for the position oficons along one of the two planar dimensions. Inthis approach retrieval of images basically revertsto simpler string matching. In the paper by Gu-divada and Raghavan (1995) objects in a symbolicimage are associated with vertexes in a weightedgraph. The spatial relationships among the objectsare represented through the list of the edges con-necting pairs of centroids. A similarity functioncomputes the degree of closeness between the twoedge lists representing the query and the databasepicture as a measure of the matching between thetwo spatial graphs.
More recent papers on the topic include Gudi-vada (1998) and El-Kwae and Kabuka (1999),which basically propose extensions of the stringapproach for efficient retrieval of subsets of icons.Gudivada (1998) proposes a logical representationof an image based on so-called hR-strings. Sucha representation also provides a geometry-basedapproach to iconic indexing based on spatial re-lationships between the iconic objects in an imageindividuated by their centroid coordinates. Trans-lation, rotation and scale variant images and thevariants generated by an arbitrary composition ofthese three geometric transformations are con-sidered. The similarity between a database and aquery image is obtained through a spatial simi-larity algorithm, simg, that measures the degree ofsimilarity between a query and a database imageby comparing the similarity between their hR-strings. The algorithm recognizes rotation, scaleand translation variants of the arrangement ofobjects, and also subsets of the arrangements. Aconstraint limiting the practical use of this ap-proach is the assumption that an image can con-tain at most one instance of each icon or object. Ofcourse the worst constraint of the algorithm de-pends on the fact that it takes into account onlythe points in which the icons are placed and notthe real configuration of objects. For example apicture with a plane in the left corner and a housein the right one has the same meaning of a picturewith the same objects in different proportions: asmall plane and a big house; besides the relativesize and orientation of the objects are not takeninto account.El-Kwae and Kabuka (1999) propose a further
extension of the spatial-graph approach that in-cludes both the topological and directional con-straints. The topological extension of the objectscan be obviously useful in determining furtherdifferences between images. The similarity algo-rithm extends the graph-matching proposed byGudivada and Raghavan (1995) and retains theproperties of the original approach, including itsinvariance to scaling, rotation and translation andis also able to recognize multiple rotation variants.Even though it considers the topological extensionof objects it is far from considering the composi-tional structure of objects: the objects are consid-
1600 E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612
ered as a whole and no reasoning is possible aboutthe meaning and the purpose of objects or scenesdepicted. Several extensions of these approacheshave been proposed with an evaluation and a com-parison of computational complexity (Zhou et al.,2001).The computational complexity of evaluating the
similarity of two arrangements of shapes has beenanalyzed since early 90’s. Results have been foundfor what are called type-0 and type-1 similarity inS.K. Chang classification (Chang, 1989). Briefly,in type-0 and type-1 similarity, arrangements areconsidered similar if they preserve horizontal andvertical orderings. For example, if object A isbelow and on the left of object B in picture 1,picture 2 is considered similar if the same object Aappears below and on the left of B, regardless oftheir relative size and distance in the two pictures.Tucci et al. (1991) studied the case when there canbe multiple instances of an object, and found thatthe problem is NP-complete. Later, Guan et al.(2000) proved that the same lower bound alsoholds when objects are all distinct. The authorsalso gave a polynomial-time algorithm for type-2similarity, which is stricter than type-1 since itconsiders similar two arrangements only if one ofthe two is a rigid translation of the other. Type-2similarity is too strict for our approach, sincewe admit also rotational and scale variants of anarrangement.When objects reduce to points, the problem of
evaluating the similarity of arrangements has beencalled point pattern matching. This problem hasbeen studied in computational geometry, where itis known as the ‘‘constellation problem’’. In Car-doze and Schulman (1998), a randomized algo-rithm has been given for exact matching of pointsunder translations and rotations, but not underscaling. For matching n points in the plane, thealgorithm works in Oðn2 log nÞ, where the proba-bility of finding wrong matches is a decreasingexponential. A different probabilistic algorithm––this one missing good matches––has been proposedby Irani and Raghavan (1996) for the best match-ing problem, which can also be a non-exact match.The algorithm considers matching under transla-tion, rotation, and scaling. However, the pointpattern matching problem solves just the matching
of centroids of an arrangement of shapes, and itis not obvious if and how algorithms could begeneralized to matching arrangements of shapes.All the previously described methods are partial
solutions to the problem of image retrieval. Theyconsider disjoint properties instead of a globalsimilarity measure between images. CBIR essen-tially reverts to two basic specifications: repre-sentation of the image features and definition ofa metric for similarity retrieval. Most featuresadopted in the literature are global ones, whichconvey information––and measure similarity––based on purely visual appearance of an image.Unfortunately, what a generic user typically con-siders the ‘‘content’’ of an image is seldom cap-tured by such global features. Particularly forquery-by-sketch image retrieval, the main issue isthe recognition of the sketched components of aquery in one or more database images, followed bya measure of similarity, to find more relevant ones.In our approach, we assume retrievable only im-ages where all components of the sketch are pre-sent as regions in the image. 1 The approach maybe considered more restrictive than other ap-proaches since only images having all the compo-nents of the sketch are taken into account. This isbecause we consider such components meaningfulto the overall configuration––why a user wouldsketch them if not? Once all the shapes of theconfiguration have been found in the image, otherproperties of the shapes concur to define theoverall similarity measure, such as rotation of eachshape, rotation of the overall arrangement, scaleand translation, color, texture. For measuring thesimilarity of the overall arrangement, we proposea modified version of Gudivada’s hR-strings.There are m!=n! pairings between the n objects
of a sketch and n out of m regions in an image,with m > n, hence a blind trial of all possiblepairings takes exponential time. Results we men-tioned about type-2 similarity and point patternmatching suggest that it might exist a polynomial-time algorithm for matching arrangements of
1 Of course, not all regions in the image must correspond to
components of the sketch. In other words, we do not enforce
full picture matching.
E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612 1601
shapes under translation, rotation and scaling.Since our approach is different from the previousones, we devised a non-optimal algorithm to con-duct our experiments, leaving a deeper analysis oncomputational complexity of the problem, andoptimal algorithms to solve it, to future research.The algorithm we employ is not totally blind,though, since it takes advantage of the fact thatusually most shapes in the sketch and regions inthe image do not match each other. Hence, com-puting an n� m matrix of similarities betweenshapes and regions, one can ignore most of thewrong matchings from the beginning. Neverthe-less, the worst case of the algorithm––when shapesare all similar to regions––is exponential.
3. Representing shapes, objects and images
In a previous paper (Di Sciascio et al., 2000a)we proposed a structured language (based on ideasborrowed from Description Logics) to describe thecomplex structure of objects in an image, startingfrom a region-based segmentation of objects. Thecomplete formalism was presented in anotherpaper (Di Sciascio et al., 2000b), together with thesyntax and an extensional semantics, which is fullycompositional.The main syntactic objects we consider are basic
shapes, composite shape descriptions, and trans-formations.
Basic shapes are denoted with the letter B, andhave an edge contour eðBÞ characterizing them.We assume that eðBÞ is described as a single,closed 2D-curve in a space whose origin coincideswith the centroid of B. Examples of basic shapescan be circle, rectangle, with the contourseðcircleÞ ¼ �, eðrectangleÞ ¼ ‘, but alsoany complete, rough contour––e.g., the one of aship––can be a basic shape.The possible transformations are the basic
ones that are present in any drawing tool: rota-tion (around the centroid of the shape), scalingand translation. We globally denote a rotation–translation–scaling transformation as s. Recallthat transformations can be composed in se-quences s1 � � � � � sn, and they form a mathemat-ical group.
Basic building block of our syntax is a basicshape component hc; t; s;Bi, which represents a re-gion with color c, texture t, and contour sðeðBÞÞ.With sðeðBÞÞ we denote the pointwise transfor-mation s of the whole contour of B. For example,s could specify to place the contour eðBÞ in theupper left corner of the image, scaled by 1/2 androtated 45� .
Composite shape descriptions are conjunctionsof basic shape components, each one with its owncolor and texture. They are denoted as
C ¼ hc1; t1; s1;B1i u � � � u hcn; tn; sn;Bni
Notice that this is just an internal representation,invisible to the user, that we map in a visual lan-guage actually used to sketch the query.Gudivada (1998) proposed hR-strings, a geo-
metry-based structure for the representation ofspatial relations in images, and an associatedspatial similarity algorithm. The hR-string is asymbolic representation of an image. It is obtainedby associating a name with each domain objectidentified in the image, and then considering thesequence of names and coordinates of the cent-roids of the objects, with reference to a Cartesiancoordinate system. Gudivada’s original represen-tation was limited by the assumption that imagescould not contain multiple instances of the sameobject type. We propose a modified formulation ofhR-strings, which allows our similarity algorithmto overcome the previous limitation. Although westill consider the arrangement of components asthe spatial layout of a composite shape, we alsodescribe each shape by including its relevant fea-tures. The icons of the symbolic image in Gudi-vada’s hR-string representation are replaced byobjects with their shape. Hence each shape is notscale, rotation, translation invariant (as it was inhR-strings) since we believe such properties havetheir meaning in composite shape descriptions.Therefore we provide a characterization both ofimages and objects in term of basic shapes com-posing complex objects in a sketched query and ofregions composing database images.To measure similarity, we propose an algorithm
that takes into account the arrangement of shapesin the sketch and compares it with groups of re-gions in an image. The algorithm provides a spa-
1602 E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612
tial similarity measure that––with respect to Gu-divada’s simg algorithm––assumes the presence inthe image of a group of regions that correspond tothe components of the sketch. Besides, it consid-ers the relative size of corresponding regions andshapes. To define a global similarity measure, wetake into account similarity measure depending onall the features that characterize the componentsof objects, i.e., color, texture, rotation and scaling.
3.1. Description of relevant features
While relevant features are properties simplycomputable for elements of a sketch, dealing withregions in real images requires segmentation toobtain a partition of the image. Several segmen-tation algorithms have been proposed in the lit-erature; our approach does not depend on theparticular segmentation algorithm adopted. Any-way, it is obvious that the better the segmentation,the better our approach will work.Our scope here is limited to the description of
the computation of image features, assuming asuccessful segmentation. To make the descriptionself-contained, we start defining a generic colorimage as f~IIðx; yÞj16 x6Nh; 16 y6Nvg, where Nh,Nv are the horizontal and vertical dimensions,respectively, and ~IIðx; yÞ is a triple ðR;G;BÞ. Weassume that the image I has been partitioned inm regions ðriÞ, i ¼ 1; . . . ;m satisfying the followingproperties:
• I ¼SðriÞ; i ¼ 1; 2; . . . ;m
• 8i 2 ð1; 2; . . . ;mÞ, ri is a non-empty and simply-connected set
• ri \ rj ¼ ; iff i 6¼ j• each region satisfies heuristic and physical re-quirements.
We characterize each region ri with the follow-ing attributes: shape, position, size, orientation,color and texture.
3.1.1. ShapeGiven a connected region a point moving along
its boundary generates a complex function definedas: zðtÞ ¼ xðtÞ þ jyðtÞ, t ¼ 1; . . . ;Nb, with Nb thenumber of boundary points. Following the ap-
proach proposed in (Rui et al., 1996) we define thediscrete Fourier transform of zðtÞ as:
ZðkÞ ¼XNbt¼1
zðtÞe�j2ptkNb ¼ MðkÞejhðkÞ
with k ¼ 1; . . . ;Nb.In order to address the spatial discretization
problem we compute the fast Fourier transform(FFT) of the boundary zðtÞ; use the first ð2Nc þ 1ÞFFT coefficients to form a dense, non-uniform setof points of the boundary as:
zdenseðtÞ ¼XNck¼�Nc
ZðkÞe�j2ptkNb
with t ¼ 1; . . . ;Ndense, where Ndense is the number of‘‘dense’’ samples used in resampling the ModifiedFourier Descriptors.We then interpolate these samples to obtain
uniformly spaced samples zunifðtÞ, t ¼ 0; . . . ;Nunif .We compute again the FFT of zunifðtÞ obtainingFourier coefficients ZunifðkÞ, k ¼ �Nc; . . . ;Nc. Theshape-feature of a region is hence characterized bya vector of 2Nc þ 1 complex coefficients.
3.1.2. Position and sizeWe characterize size and position with the aid
of moment invariants (Pratt, 1991). In order tosimplify notation, let us assume that the extractedexternal contour of each region ri, is placed in anew image Iriðx; yÞ having the same size of theoriginal image, with a uniform background. Let usalso suppose the new images be binarized, i.e.,discretized with two levels. The ðp; qÞth spatialmoment of each Iri can be defined as:
Muðp; qÞ ¼XNvy¼1
XNhx¼1
xpyqIriðx; yÞ
The ðp; qÞth scaled spatial moment can be definedas:
Mðp; qÞ ¼ Muðp; qÞNqvN
ph
The definition of size is just the zero momentMð0; 0Þ. The position is obtained with reference tothe region centroid, having coordinates:
�xx ¼ Mð1; 0ÞMð0; 0Þ �yy ¼ Mð0; 1Þ
Mð0; 0Þ
E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612 1603
3.1.3. OrientationIn order to quantify the orientation of each
region ri we use the same Fourier representation,which stores the orientation information in thephase values. We obviously deal also with specialcases when the shape of a region has more thanone symmetry, e.g. a rectangle or a circle. Rota-tional similarity between a reference shape B and agiven region ri can then be obtained finding max-imum values via cross-correlation:
CðtÞ ¼ 1
2Nc þ 1X2Nck¼0
ZBðkÞZriðkÞej 2p2Nckn
with t 2 0; . . . ; 2Nc
3.1.4. ColorColor information of each region ri is stored,
after quantization in a 112 values color space, asthe mean RGB value within the region:
Rri ¼Xp2ri
RðpÞ Gri ¼Xp2ri
GðpÞ Bri ¼Xp2ri
BðpÞ
3.1.5. TextureWe extract texture information for each region
ri with a method based on the work in (Pok andLiu, 1999). Following this approach, we extracttexture features convolving the original grey levelimage Iðx; yÞ with a bank of Gabor filters, havingthe following impulse response:
hðx; yÞ ¼ 1
2pr2e�x2þy2
2r2 ej2pðUxþVyÞ
where r is the scale factor, ðU ; V Þ represents thefilter location in frequency-domain coordinates,and k and h are the central frequency and theorientation, respectively defined as:
k ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiU 2 þ V 2
ph ¼ arctanU=V
The processing allows to extract a 24-componentsfeature vector, which characterizes each texturedregion.Descriptions relative to basic shapes can be
used to build complex objects by properly com-posing such components. For example, we cansketch a house using a rectangle and a triangle
posed on its top. To deal with complex objects weassume a set F of basic shapes that can be used forcomparison with other instances of the same shapehaving the same proportions. The image com-posed in Fig. 1 has five shapes, two of them areoccurrences of the same shape, a circle, but with adifferent scale factor. To describe a complex objectwe extend the description for a basic shape.From now on, the regions of a query sketch will
be denoted with ðfjÞ, j ¼ 1; . . . ; n. Each region fjrepresents an instance of a shape Fj. Consider anobserver posed in the center-of-mass of the sketch.We define the order in which the components ap-pear as the order in which they are seen from thisviewpoint, counter clock-wise (CCW). For eachcomponent of index j, we call its left-hand sidecomponent, component with index jþ 1, the fol-lowing component CCW, and its right-hand sidecomponent, with index j� 1, the preceding com-ponent CCW. A shape is hence characterized asfollows:
• index of Fj;• coordinates of the centroid of Cfj of the contourof fj;
• length of the segment CQCfj between the cent-roid of the shape and the centroid of the image;
• angle of the segment CQCfj with the x-axis;• scale factor of the shape fj with respect to theshape Fj;
• rotation angle of fj with respect to Fj;
Fig. 1. Information contained in the fk component of a MhR-string.
1604 E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612
• index j� 1 of the shape Fj�1 of which the shapethat precedes fj is an instance;
• index jþ 1 of the shape Fjþ1 of which the shapethat follows fj is an instance;
• color of fj;• texture of fj;• distance between the centroid of the edge ofshape fj and that of shape fj�1 on the left sideof fj;
• distance between the centroid of the edge ofshape fj and that of shape fjþ1 on the right sideof fj.
3.2. Spatial representation with modified hR-string
We now define the Modified hR-string (MhR-string). The terminology we introduce refers toDefinitions 1 through 8 to the work by Gudivada(1998) and properly extends them for our pur-poses. The main difference concerns the definition:in the original version there is a unique definitionof hR-string for both the database image and thesketched query. Here we give two different defini-tions respectively for a sketched query and for areal image. In the definitions we refer to the groupof regions of the segmented image and to the shapecomponents of a complex object. Nearly all theterminology extends to both definitions. For amore accurate analysis of the differences betweenthe hR-string and the MhR-string recall the defi-nition of Augmented hR-string given by Gudi-vada. In a MhR we consider the index of thereference shape instead of the name of the object;the angle h used for defining the order relationamong components is computed between the cen-ter-of-mass of the considered shape and the cent-roid of the centroids in the image instead of thecentroid of the image I. Such an angle inducesthe order relation in which the elements appear inthe same order the components are seen, CCW, byan observer posed in the center-of-mass of them.With respect to the order relation, the definitionof left and right neighbors of a shape (region) ex-tends to the MhR-string, and the indexes of therelative shapes are added to the MhR-string. Thedistance between components are computed usingthe euclidean distance between the centroid and
the angle h. Moreover, in the MhR-string of animage the index of the shape fj in the object de-scription corresponding to the region ri is added.We remind that before determining the MhR-string it is necessary to search for a group ofregions in the image that resembles the set ofcomponents in the sketched object.
Definition 1 (MhR-string for an object Q). TheMhR-string for a complex object Q is a list of nelements, one for each component. The order in whichthe elements appear is the same order the compo-nents are seen, CCW, by an observer posed in thecenter-of-mass of them. Then, each component has aleft-hand side component, and a right-hand side one.The element for the jth component fj of Q containsthe items below:
• fj:index––index of the prototypical shape whichfj is an instance of;
• fj:angle––the angle that the line joining the cent-roid (‘‘center-of-mass’’) of fj with the centroid ofthe centroids of Q subtends with the positive x-axis;
• fj:distance––distance between the centroid of fjand the centroid of the object description;
• fj:indexR––index of the prototypical shape of theshape on the right side of fj;
• fj:indexL––index of the prototypical shape of theshape on the left side of fj;
• fj:distanceR––distance between the centroid of fjand the centroid of the shape on the right side;
• fj:distanceL––distance between the centroid of fjand the centroid of the shape on the left side.
An example is shown in Fig. 1 where we rep-resent the MhR-string for a complex object. Werefer to the MhR-string of the fk component hererepresented by the bold circle. fk:index is the indexof the basic shape which represents the bold circle.The component fk:indexR and fk:indexL are re-spectively the index of the circle on the right sideand of the rectangle on the left side of fk. The othercomponents of the MhR-string are the distancesand the angle highlighted on the figure.
Definition 2 (MhR-string for an image I). TheMhR-string for an image I is a list of m elements,
E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612 1605
one for each segmented region. The order in whichthe elements appear is the same order the compo-nents are seen, CCW, by an observer posed in thecenter-of-mass of them. Then, each component has aleft-hand side component, and a right-hand side one.The element for the ith region ri of I contains theitems below:
(1) ri:index––index of the shape which ri is an in-stance of;
(2) ri:angle––the angle that the line joining the cent-roid of ri with the centroid of the centroids of Isubtends with the positive x-axis;
(3) ri:distance––distance between the centroid of riand the centroid of the group of regions;
(4) ri:indexR––index of the shape corresponding tothe region on the right side of ri;
(5) ri:indexL––index of the shape corresponding tothe region on the left side of ri;
(6) ri:distanceR––distance between the centroid of riand the centroid of the region on the right side;
(7) ri:distanceL––distance between the centroid of riand the centroid of the region on the left side;
(8) ri:ObjectShape––index of the shape fj in the ob-ject description corresponding to the region ri(such value is returned by the search for a groupof regions).
4. Computing similarities
In all similarity measures, we adopt the functionUðx; gx; gyÞ. This function was determined by trialand error to replace the exponential one used byGudivada in his similarity algorithm. The moti-vation was the the exponential function resultedtoo steep, so that even small variations in the rel-ative position of the centroids would produceconsiderable variations in the similarity functionsimh, which appeared too drastic for real imagesretrieval. We then chose a function with a nullfirst-order derivative for x ¼ 0. The role of thefunction is to ‘‘smooth’’ the changes of the quan-tity x, depending on two parameters gx; gy, whichhave been tuned in the experimental phase, and tochange a distance x (in which 0 corresponds toperfect matching) to a similarity measure (in whichthe value 1 corresponds to perfect matching).
Uðx; gx; gyÞ ¼
gy þ ð1� gyÞ cos px2�gx
� �if 0 < x6 gx
gy 1�arctan p�ðx�gxÞ�ð1�gyÞ
gx�gy
h ip
24
35 if x > gx
8>>>><>>>>:
where values for gx > 0 and 0 < gy < 1 have beenexperimentally determined, and are presented inTable 1.Given a query Q with n objects and a picture I
segmented in m regions, from all the groups ofregions in the picture that might resemble thecomponents, we select the groups that presentthe higher spatial similarity with the objects. Inartificial examples in which all shapes in I and Qresemble each other, this may generate an expo-nential number of groups to be tested, given bymðm� 1Þ . . . ðm� nþ 1Þ. Hence, our algorithmmay not be optimal from a computational pointof view. Whether there exist or not polynomial-time algorithms to solve this problem is still anopen question; a polynomial-time algorithm isknown only if scaling transformations of a layout
Table 1
Configuration parameters, grouped by feature type
Parameter Value
Fourier descriptors threshold 0.9
Circular symmetry threshold 0.9
hR-string spatial factor 0.25
hR-string scale factor 0.25
Spatial similarity threshold 0.2
Spatial similarity weight a 0.30
Spatial similarity sensitivity gx 0.5
Spatial similarity sensitivity gy 0.4
Shape similarity weight b 0.30
Shape similarity sensitivity gx 0.005
Shape similarity sensitivity gy 0.2
Scale similarity weight g 0.1
Scale similarity sensitivity gx 0.5
Scale similarity sensitivity gy 0.4
Rotation similarity weight d 0.1
Rotation similarity sensitivity gx 10
Rotation similarity sensitivity gy 0.3
Color similarity weight c 0.1
Color similarity sensitivity gx 110
Color similarity sensitivity gy 0.4
Texture similarity weight � 0.1
Texture similarity sensitivity gx 110
Texture similarity sensitivity gy 0.4
Global similarity threshold 0.29
1606 E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612
are not considered (Chew et al., 1997). However,in typical real images the similarity betweenshapes is selective enough to yield only a verysmall number of possible groups to try. Therefore,our algorithm has a behavior efficient enough tocarry on experiments. We recall that in Gudi-vada’s approach (1998) a strict assumption is made,namely, each basic component in Q does not ap-pear twice, and each region in I matches at mostone component in Q. Our approach, relaxing thisassumption, is more suited for image retrieval. Wecompute the simss shape similarity between re-gions of I and shape components of Q, and selectregions with a similarity greater than a giventhreshold. Computation of simss is invariant withrespect to scale and rotation. The measure is ob-tained as the cosine norm applied to the tuples Zfj
and Zri of complex coefficients describing respec-tively the shape of a region ri and the shapeof a component fj, Zfj ¼ ðx1; . . . ; x2NcÞ and Zri ¼ðy1; . . . ; y2NcÞ
simssðfj; riÞ ¼P2Nc
l¼1 xlylffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiP2Ncl¼1 x
2l �
P2Ncl¼1 y
2l
q
simssðfj; riÞ is a number in the range [0,1], wheresimss ¼ 1 corresponds to maximum similarity be-tween a region and a shape component.
4.1. Spatial similarity
We recall that all the components of Q must befound in I. So, the recognition algorithm dis-cards––without further processing––those imagesthat do not contain all the shape components ofQ. Notice that we consider a shape componentrecognized if its shape similarity with reference toa region in the image is beyond a threshold. Alsonotice that this does not prevent regions notincluded in the sketch Q be present in the image I.As a consequence, we describe the MhR-stringalgorithm assuming the presence in an image of agroup of regions that corresponds to the compo-nents of Q. Let the objects in the sketch be num-bered as 1 . . . n and the regions in the image as1 . . .m. To make precise the correspondence be-tween objects and regions we use an injective
mapping l : f1; . . . ; ng ! f1; . . . ;mg, such thatobject i is matched with region rlðiÞ. In orderto compute the spatial similarity of the overallarrangement of regions with reference to thearrangement of components in Q, we considertheir centroids. Once the group of regions has beenselected the algorithm extracts for each groupthe corresponding MhR-string and evaluates theoverall similarity. The spatial similarity value simh
is computed by the ComputeSIMh algorithmconsidering the relative positions of shapes andregions (normalized with respect to the image size)and accounts for the similarity in terms of thespatial arrangements of objects.
Algorithm ComputeSIMh (MhRQ,MhRI )input MhR-string of the query Q
MhR-string of the group of n regionsrlð1Þ; . . . ; rlðnÞoutput simh
begin
simh ¼ 0for j 2 f1; . . . ; ng do
i ¼ rlðjÞ:ObjectShapeif rlðjÞ:indexR ¼ fi:indexR then
simh ¼ simh þ KSpatialFactorsimh ¼ simh þ KScaleFactor�Uhðjfi:dist-
anceR� rlðjÞ:distanceRj=Q:magnitudeÞ
endifif rlðjÞ:indexL ¼ fi:indexL then
simh ¼ simh þ KSpatialFactorsimh ¼ simh þ KScaleFactor�Uhðjfi:dist-
anceL� rlðjÞ:distanceLj=Q:magnitudeÞ
endif
endfor
return simh
end
The factors KSpatialFactor and KScaleFactor areconstants, subject to the following constraint:2KSpatialFactor þ 2KScaleFactor ¼ 1. KSpatialFactor is thedegree of importance of spatial relationships be-tween the corresponding region and shape incomputing the spatial similarity; KScaleFactor is thedegree of importance of the scale variations be-tween the region and the corresponding shape. The
E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612 1607
multiplicative term 2 comes from the computationof spatial similarity considering both the left andright neighbors of a shape (region). Q:magnituderepresents the extension of the object Q. Its valueis obtained as the mean value of the lengths of thesegments between the centroid of the shape andthe centroid of the image.
4.2. Shape similarity
simshape measures the similarity between shapesin the composite shape description and the regionsin the segmented image,
simshape ¼ U maxn
j¼1½1
�� simssðfj; rlðjÞÞ�; gxshape; gyshape
�:
4.3. Scale similarity
simscale takes into account the differences inscale between each region rlðjÞ in the consideredgroup of regions and the corresponding basicshape fj in Q.
Dscale ¼ maxn
j¼1
j rlðjÞ:scaleScaleFactor
� fj:scalejfj:scale
( )
where fj:scale is the scaling factor in fj (the trans-formation for the jth component of Q), rlðjÞ:scaleis the scaling factor of region rðjÞ when matchedto basic shape fj, and ScaleFactor is the overallscaling factor of the selected group of regionswhen matched with Q. Observe that we choose themaximum, since the differences are distances, andnot similarities. Then we compute the scale simi-larity with the aid of the function U,
simscale ¼ UðDscale; gxscale; gyscaleÞ:
4.4. Rotation similarity
simrotation takes into account the errors in therotation of each region in the considered group ofregions with respect to the corresponding basicshapes in Q. We start by finding the Rotation-Factor of the group of n regions rlð1Þ; . . . ; rlðnÞ withrespect to Q, in accordance with the followingalgorithm:
Algorithm ComputeSIMRotation (MhRQ,MhRI )input MhR-string of the query Q
MhR-string of the group of n regionsrlð1Þ; . . . ; rlðnÞoutput RotationFactorbegin
RotationFactor ¼ 0for j 2 f1; . . . ng do
i ¼ rlðjÞ:ObjectShapeDangle ¼ rlðjÞ:angle� fi:angleif Dangle > 180� then
Dangle ¼ Dangle � 360�else if Dangle6 � 180� then
Dangle ¼ Dangle þ 360�endif
endif
RotationFactor ¼ RotationFactorþ Dangleendfor
RotationFactor ¼ RotationFactor=nreturn RotationFactor
end ComputeSIMRotation
In order to compute the rotation of each regionwith respect to the corresponding shapes we con-sider the maximum angles ar, r 2 f1; . . . ; kjg, j 2f1; . . . ; ng, obtained by computing the cross-cor-relation function, as described in Section 3. Eachrotation error Dj, for j 2 f1; . . . ; ng, is then com-puted as follows:
for r 2 f1; . . . kjg do
Dr ¼ jar �RotationFactor� Bj:anglejDr ¼ Dr mod 180�
endfor
Dj ¼ minkjr¼1 fjDrjg
The maximum rotation difference is then:
Drotation ¼ maxn
j¼1fjDjjg
and again, since we computed a distance, wesmooth and convert it into a similarity measurewith the help of U:
simrotation ¼ UðDrotation; gxrotation; gyrotationÞ:
1608 E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612
4.5. Color similarity
simcolor measures the similarity in terms of colorappearance between the regions and the corres-ponding shapes in the composite shape descrip-tion. In the following formula, DcolorðjÞ:R denotesthe difference in the red color component betweenthe j-th component of Q and the region rlðjÞ, andsimilarly for the green and the blue color compo-nents.
DcolorðjÞ ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi½DcolorðjÞ:R�2 þ ½DcolorðjÞ:G�2 þ ½DcolorðjÞ:B�2
qThen the function U takes the maximum of thedifferences to obtain a similarity:
simcolor ¼ U maxn
j¼1fDcolorðjÞg; gxcolor; gycolor
� �
4.6. Texture similarity
Finally, simtexture measures the similarity be-tween the texture features in the components of Qand in the corresponding regions.
simtexture ¼ UðDtexture; gxtexture; gytextureÞ
4.7. The recognition algorithm
We now formally describe the recognition al-gorithm. The algorithm uses a global similarity
value that is determined from similarity values forthe various features we considered relevant forregions description. These are simh for spatialsimilarity, simshape for shape similarity, simscale forscale similarity, simrotation for rotational similaritysimcolor for color similarity and simtexture for tex-tural similarity. Usually, such features are weigh-ted, as a user may consider colors more importantthan textures, or the arrangement of shapes moreimportant than the single shapes. However, theircomposition is not a weighted sum, since we useminimum for computing similarities. Minimumstems from the fuzzy interpretation of logical
‘‘and’’ in the formal language we mentioned in thebeginning. Minimum is also suitable when com-posing the similarities of each object shape fj withthe region rlðjÞ it is mapped into; in fact, if fj istotally dissimilar from rlðjÞ, its similarity drops tozero, and we want this null value to propagate tothe similarity of the entire layout. In this way,requiring all objects to be present in the image isnot an aside constraint, but comes naturally fromthe semantics. Therefore, we have the problem ofcombining weighted minimums. The problem hasbeen solved by Fagin and Wimmers (2000), in away we briefly recall. Suppose that coefficients a,b, c, g, d, � with a þ b þ c þ g þ d þ � ¼ 1 weightthe relevance each feature has in the global simi-larity computation. We number these coefficientsas h1; . . . ; h6, and link them to the (numbered)similarities as follows:
h1 ¼ a sim1 ¼ simh
h2 ¼ b sim2 ¼ simshape
h3 ¼ c sim3 ¼ simcolor
h4 ¼ g sim4 ¼ simscale
h5 ¼ d sim5 ¼ simrotation
h6 ¼ � sim6 ¼ simtexture
8>>>>>><>>>>>>:Now we reorder coefficients in non-increasingorder, by means of a permutation r : f1; . . . ; 6g !f1; . . . ; 6g such that hrð1Þ P hrð2Þ P � � � P hrð6Þ.Then, the formula by Fagin and Wimmers is asfollows:
The behavior of the algorithm, hereafter pre-sented, obviously depends on the configurationparameters, shown in Table 1, which determine therelevance of the various features involved in thesimilarity computation.
Algorithm Recognize (Q,I);input a query picture Q composed by n singleobjects, and an image I, segmented into regionsr1; . . . ; rmoutput A similarity measure of the recognitionof Q in Ibegin
similarity ¼ ðhrð1Þ � hrð2ÞÞ simrð1Þ þ 2 ðhrð2Þ � hrð3ÞÞ minðsimrð1Þ; simrð2ÞÞþ 3 ðhrð3Þ � hrð4ÞÞ minðsimrð1Þ; simrð2Þ; simrð3ÞÞ þ � � � þ 6hrð6Þ minðsimrð1Þ; . . . ; simrð6ÞÞ ð1Þ
E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612 1609
/* for each object in the sketch */for j 2 f1; . . . ; ng do
simssmax ¼ 0/* make sure that at least one region issimilar to the object */for i 2 f1; . . . ;mg do
compute similarity simssðfj; riÞ betweenfj and riif simssmax < simssðfj; riÞ then simssmax ¼simssðfj; riÞ
endfor
/* if no region is similar to object j, fail */if simssmax < threshold then return (0)
endfor
smax ¼ 0for all injective functions l: f1; . . . ; ng !f1; . . . ;mg yielding a group of n regionsrlð1Þ; . . . ; rlðnÞsuch that for j ¼ 1; . . . ; n it is simssðfj; rlðjÞÞ >thresholddo:
extract the MhR-string MhRQ ofrlð1Þ; . . . ; rlðnÞ
compute similarity using Formula (1)if smax < s then smax ¼ s
endfor
return smaxend
5. Results
The approach described in the preceding sec-tions has been deployed in DESIR (DEscriptionlogics Structured Image Retrieval) a prototypeinteractive system. Queries can be posed by sketch,using pre-stored basic shapes or newly createdones (see an example of query by sketch in Fig. 2),but also by example. Both a query by example ora submitted database image undergo the sameprocedure: segmentation, feature extraction andclassification.One of the unfortunate aspects of similarity
retrieval is that it is inherently fuzzy and subjec-tive, and widely accepted benchmarks are still tobe found. Although our image database currentlystores a few thousands images, in order to easecomparisons with previous work on spatial-re-
lationships-based image-retrieval approaches wechose an experimental setup that is close to the oneproposed by Gudivada and Raghavan (1995), andalso adopted by Gudivada (1998) and El-Kwaeand Kabuka (1999), in which a small, well defined,set of images is used to check the system perfor-mances vs. human users judgement. Our test setconsists of 93 images (available at www-ictserv.poliba.it/disciascio/test-images.htm), picturing sim-ple or composite objects arranged together. Thetotal number of different objects was 18. Imageswere captured using a digital camera, all had size1080� 720 pixels, 24 bits/pixel. Images were au-tomatically segmented by the system. Thirty-oneimages of the set were selected as queries by ex-ample. The resulting classification carried out bythe system against the image set was defined assystem-provided ranking. Fig. 3 shows one of thequeries and the retrieved set of images. To obtain ausers ranking we then asked to five volunteers toclassify the 93 images based on their similarity toeach query image. Users could group databaseimages that he/she considered equivalent in termsof similarity to a given query. The five classifica-tions were not univocal, and they were mergedtogether in a unique ranking by considering, foreach image, the minimum ranking among the fiveavailable. This provided a users-ranking for thesame set of queries. Notice that this approachlimited the weight that images badly classified bysingle users have on the final ranking. Also fol-
Fig. 2. A query by sketch and retrieved images.
1610 E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612
lowing the approach by Gudivada and Raghavan(1995), we measured the retrieval effectivenessadopting the Rnorm (Bollmann et al., 1985) asquality measure. Assuming G a finite set of imageswith a user-defined preference relation P that iscomplete and transitive, and Dusr the rank orderingof G induced by the user preference relation, letDsys be some rank ordering of G induced by thesimilarity values computed by the image retrievalsystem. The formulation of Rnorm is:
RnormðDsysÞ ¼1
21
�þ Sþ � S�
Sþmax
�
where Sþ is the number of image pairs where abetter image is ranked by the system ahead of aworse one; S� is the number of pairs where a worseimage is ranked ahead of a better one and Sþ
max isthe maximum possible number of Sþ. It should benoticed that the calculation of Sþ, S�, and Smax isbased on the ranking of image pairs in Dsys relativeto the ranking of corresponding image pairs inDusr. Rnorm values are in the range [0,1.0]; 1.0 cor-responds to a system-provided ordering of thedatabase images that is either identical to the oneprovided by the human experts or has a higherdegree of resolution, lower values correspond to aproportional disagreement between the two. Weobtained an average Rnorm�AVG ¼ 0:89, and as low-est value Rnorm�MIN ¼ 0:62 and highest Rnorm�MAX¼1:0. We also computed, as a reference, averagevalues for Precision¼NRR=NR, Recall¼NRR=N
where NRR¼number of images retrieved and rele-vant and NR¼ total number of relevant images indatabase and N¼ total number of retrieved images.We obtained PrecisionAVG¼0:91 and RecallAVG¼0:80.Results presented by Gudivada and Raghavan
(1995) showed an average Rnorm ¼ 0:98, on a data-base of 24 iconic images used both as queries anddatabase images and similarity computed only onspatial relationships between icons. Our systemworks on real images and computes similarity onseveral image features; we believe that resultsprove the ability of the system to catch to a goodextent the users information need, and make re-fined distinctions between images when searchingfor composite shapes. Obviously, we do not claimthese results would scale exactly considering largerimage databases, or databases with images pic-turing scenes without well-defined objects. Never-theless they show, in a controlled framework, goodperformances. As a final remark, notice we do notrequire full picture matching; the image retrievedfrom the sketch may contain other regions not inthe sketch. However, all components of the sketchmust be recognized in an image in order tomake it retrievable. Only in this case the databaseimages is processed to determine its similarity.This approach tends to increase precision, even atthe expenses of recall. It is anyway reasonable toassume that, in an interactive session using query-by-sketch, a user will introduce a partial querypicturing what he/she thinks are the main features/objects representing his/her information need.Should the retrieved set be still too large, furtherdetails will be added to reduce the set.
6. Conclusion
Starting from the observation that most imageretrieval systems either rely on global features orconcentrate on symbolic images considered onlyin terms of spatial positioning, we proposed anapproach particularly suitable for query by sketchimage retrieval, able to handle queries made byseveral shapes, where the position, orientation andsize of the shapes relative to each other is mean-ingful. In our approach we start extracting basic
Fig. 3. A query by example and retrieved images.
E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612 1611
features of the image and combine them in ahigher-level description of spatial layout of com-ponents, characterizing the semantics of the scene.We also defined a similarity algorithm that mea-sures a weighted global similarity between asketched query and a database image and allowsfor both a perfect matching or an approximaterecognition of sketches in real images.
References
Bollmann, P., Jochum, F., Reiner, U., Weissmann, V., Zuse,
H., 1985. The LIVE-Project-Retrieval experiments based on
evaluation viewpoints. In: Proceedings of the 8th Annual
International ACM/SIGIR Conference on Research and
Development in Information Retrieval. ACM, New York,
pp. 213–214.
Cardoze, D.E., Schulman, L.J., 1998. Pattern matching for
spatial point sets. In: Proceedings of the 39th Annual
Symposium on the Foundations of Computer Science
(FOCS’98).
Chang, S.K., 1989. Principles of Pictorial Information Systems
Design. Prentice-Hall, Englewood Cliffs, New Jersey.
Chang, S.K., Shi, Q.Y., Yan, C.W., 1983. Iconic indexing by
2D strings. IEEE Trans. Pattern Anal. Mach. Intelligence 9
(3), 413–428.
Chew, L.P., Goodrich, M.T., Huttenlocher, D.P., Kedem, K.,
Kleinberg, J.M., Kravets, D., 1997. Geometric pattern
matching under euclidean motion. Comput. Geom. 7, 113–
124.
Di Sciascio, E., Donini, F.M., Mongiello, M., 2000a. A
Description logic for image retrieval. In: Lamma, E., Mello,
P. (Eds.), AI*IA 99: Advances in Artificial Intelligence,
number 1792 in Lecture Notes in Artificial Intelligence.
Springer, pp. 13–24.
Di Sciascio, E., Donini, F.M., Mongiello, M., 2000b. Semantic
indexing for image retrieval using description logics. In:
Laurini, R. (Ed.), Advances in Visual Information Systems,
number 1929 in Lecture Notes in Computer Science.
Springer, pp. 372–383.
El-Kwae, E.A., Kabuka, M.R., 1999. Content-based retrieval
by spatial similarity in image databases. ACM Trans.
Inform. Syst. 17, 174–198.
Fagin, R., Wimmers, E.L., 2000. A formula for incorporating
weights into scoring rules. Theor. Comput. Sci. 239, 309–
338.
Guan, D.J., Chou, C.Y., Chen, C.W., 2000. Computational
complexity of similarity retrieval in a pictorial database.
Inform. Process. Lett. 75, 113–117.
Gudivada, V.N., 1998. hR-string: A geometry-based represen-tation for efficient and effective retrieval of images by spatial
similarity. IEEE Trans. Knowl. Data Eng. 10 (3), 504–512.
Gudivada, V.N., Raghavan, J.V., 1995. Design and evaluation
of algorithms for image retrieval by spatial similarity. ACM
Trans. Inform. Syst. 13 (2), 115–144.
Irani, S., Raghavan, P., 1996. Combinatorial and experimental
results for randomized point matching algorithms. Proceed-
ings of the 12th ACM Symposium on Computational
Geometry, 68–77.
Pok, G., Liu, J., 1999. Texture classification by a two-level
hybrid scheme. Storage and Retrieval for Image and Video
Databases VII 3656, 614–622, SPIE.
Pratt, W.K., 1991. Digital Image Processing. J. Wiley & Sons
Inc., Englewood Cliffs, NJ.
Rui, Y., She, A.C., Huang, T.S., 1996. Modified Fourier
descriptors for shape representation—a practical approach.
In: Proceedings of 1st Workshop on Image Databases and
Multimedia Search.
Tucci, M., Costagliola, G., Chang, S.K., 1991. A remark on
NP-completeness of picture matching. Inform. Process.
Lett. 39, 241–243.
Zhou, X.M., Ang, C.H., Ling, T.W., 2001. Image retrieval
based on object’s orientation spatial relationship. Pattern
Recogn. Lett. 22, 469–477.
1612 E. Di Sciascio et al. / Pattern Recognition Letters 23 (2002) 1599–1612