InterestPoint Selectionby Topology Coherence for …media sharing services, e.g. Flickr [28], or a set of top-ranked results returned by an image search engine, e.g. Google Images

Multimedia Tools and Applications manuscript No.(will be inserted by the editor)

Interest Point Selection by Topology Coherence

for Multi-Query Image Retrieval

Xiaomeng Wu · Kunio Kashino

Received: date / Accepted: date

Abstract Although the bag-of-visual-words (BOVW) model in computer vi-sion has been demonstrated successfully for the retrieval of particular objects,it suffers from limited accuracy when images of the same object are very dif-ferent in terms of viewpoint or scale. Naively leveraging multiple views of thesame object to query the database naturally alleviates this problem to someextent. However, the bottleneck appears to be the presence of backgroundclutter, which causes significant confusion with images of different objects. Toaddress this issue, we explore the structural organization of interest pointswithin multiple query images and select those that derive from the tentativeregion of interest (ROI) to significantly reduce the negative contributions ofconfusing images. Specifically, we propose the use of a multi-layered undirectedgraph model built on sets of Hessian affine interest points to model the images’elastic spatial topology. We detect repeating patterns that preserve a coherentlocal topology, show how these redundancies are leveraged to estimate tenta-tive ROIs, and demonstrate how this novel interest point selection approachimproves the quality of visual matching. The approach is discriminative indistinguishing clutter from interest points, and at the same time, is highlyrobust as regards variation in viewpoint and scale as well as errors in interestpoint detection and description. Large-scale datasets are used for extensiveexperimentation and discussion.

Xiaomeng WuNTT Communication Science Laboratories3-1, Morinosato Wakamiya Atsugi-shi, Kanagawa, Japan 243-0198Tel.: +81-46-240-3664E-mail: [email protected]

Kunio KashinoNTT Communication Science Laboratories3-1, Morinosato Wakamiya Atsugi-shi, Kanagawa, Japan 243-0198Tel.: +81-46-240-3568E-mail: [email protected]

2 Xiaomeng Wu, Kunio Kashino

Keywords Interest Point Selection · Spatial Topology · Delaunay Triangu-lation · Multi-Query Image Retrieval

1 Introduction

The bag-of-visual-words (BOVW) representation of local features [26], alongwith query expansion [5], Hamming embedding [9], spatial re-ranking [19], andsoft assignment [20], has been shown to be successful in image retrieval, espe-cially in the retrieval of particular objects. In general, an image is representedby a set of interest points usually obtained by affine invariant region detectors,e.g. Maximally Stable Extremal Regions (MSER) [16], a Harris affine regiondetector [17], or a Hessian affine region detector [17]. A local patch aroundeach interest point is described using a local feature, e.g. a Scale-InvariantFeature Transform (SIFT) [15], Speed-Up Robust Features (SURF) [3], orLocal Intensity Order Pattern (LIOP) [30], usually of high dimension. TheBOVW model defines a visual vocabulary and quantizes local features intovisual words to achieve a much more compact representation. The visual vo-cabulary can be constructed offline by unsupervised clustering algorithms,typically k-means [26], hierarchical k-means (HKM) [19], or approximated k-means (AKM) [19] algorithms. In consequence, an image is represented by aset of visual words. The inverted index structure is typically leveraged to indexand retrieve a large-scale image database.

However, retrieval systems based on the BOVW model suffer from limitedaccuracy when the relevant images differ significantly from the query in termsof viewpoint (front vs. profile), scale (macro photography vs. deep focus), il-lumination condition (daytime vs. night), or age (old photos) [1]. Leveragingmultiple images, with sufficient variations, of the same object to query thedatabase naturally alleviates all of these problems to some extent. In the fieldof multimedia information retrieval, a standard method is to obtain a set ofimages from a labeled corpus corresponding to a query represented by text.The labeled corpus can be a public dataset derived from experts [8] or socialmedia sharing services, e.g. Flickr [28], or a set of top-ranked results returnedby an image search engine, e.g. Google Images [27]. Content-based video re-trieval based on representative frame indexing [18,36] and pseudo-[5]/user-[6]relevance feedback can also be deemed alternative forms of multi-query imageretrieval (MQIR). In the context of BOVW-based retrieval systems, a recentstudy [1] compared a number of different early- and late-fusion methods forcombining the multiple query images. It was concluded that the method ofchoice for this task was Multi-Query-Max (MQ-Max), in which each imagefrom the query was independently queried and maximum pooling was appliedto the sets of similarities of each result. This strategy performed best especiallyin retrieving unusual instances that differed from the query.

Without supervision when specifying the region of interest (ROI) or care-fully selecting query images, one of the main problems in MQIR is the pres-ence of clutter in background that frequently occurs in the database and hence

Interest Point Selection by Topology Coherence for Multi-Query Image Retrieval 3

(a) Multiple images in response to the query “Magdalen” in the Oxford BuildingsDataset [19].

(b) The top five false matches with the highest similarities to the query.

Fig. 1 An example of a query containing multiple images and the top-ranked false matchesreturned by using it to query the Oxford Buildings Dataset [19] with the MQ-Max.

causes significant confusion between images of different objects. The ROI indi-cates the region of the particular object to be queried. Fig. 1 shows an exampleof multiple query images and the top-ranked false matches returned by usingthese images to query the database. In this extreme case, all five false im-ages achieved their highest rankings due to the massive mismatches of interestpoints with the query image fourth from the left in Fig. 1a. In this image, thefoliage region was very similar to the region of plant leaves in the false images.In addition, the road region in the query image second from the left also led tofalse matches with images unrelated to the query. To deal with the destructiveeffects of the background clutter, we propose a novel interest point selectionapproach that explores the structural organization of interest points withinmultiple query images and selects those deriving from a tentative ROI ratherthan the background. The proposed approach, which significantly reduces thenegative contributions of confusing query images, is intrinsically motivated bythe following observations.

Observation 1 The object to be queried appears in all query images, and theROIs share local visual and a certain level of local spatial similarity with eachother even when the viewpoints vary.

Observation 2 The appearance of the background clutter varies with the na-ture of the viewpoint variation. Some of them share a local visual similarity,while in most cases, they bear no similarity to each other in a local spatialcontext.

Discussion Fig. 1a serves as an example of the evidence of these two ob-servations. It can be seen that the particular object to be queried, i.e. the


Magdalen Tower and connected buildings, share both local visual and localspatial similarities even when the viewpoint is varied while the background,e.g. the trees or the High Street, does not. These observations inspired us tointroduce a new spatial verification strategy into the interest point selectionprocess, in which interest points with a higher spatial coherence are deter-mined as having a higher chance of deriving from the ROI and vice versa. Weadapt a triangulation-based spatial verification method previously designedfor logo recognition [11]. The method emphasizes elastic spatial coherence forquality interest point matching. Rather than imposing a strict transformationfor geometric verification, a graph is constructed to encode the elastic spatialinformation for matched interest points. This gives better tolerance to trueresponses when detecting repeating patterns from the ROI by accumulatingevidence from local regularities. We leverage the repeating patterns to segmenta tentative ROI. We then compute local features from the initial query imagefor those selected interest points within the tentative ROI. The proposed ap-proach is evaluated on Oxford Buildings [19], Flickr 100K [19], Paris [20], andFlickr Logos 32 [23] datasets, and achieves a large performance gain with onlya small computation overhead. The rest of the paper is organized as follows.After describing related work in Sect. 2, we develop our main contribution tointerest point selection by topology coherence in Sect. 3. We then present ourexperiments and results in Sect. 4 and discuss future directions in Sect. 5.

2 Related Work

2.1 BOVW-Based Image Retrieval

The bag-of-visual-words (BOVW) framework [26] is the long-lasting standardapproach for image and visual object retrieval. Recent schemes [2,20,36] de-rived from this approach exhibit state-of-the-art performance on several bench-marks. This baseline method has been improved in several ways in recent years,in particular to compensate for quantization errors, e.g. by using large vocabu-laries [19], multiple- or soft-assignment [9,10,20], and Hamming embedding [9,10]. Other techniques include improving the initial ranking list by exploitingspatial geometry, which is discussed and explained in more details in Sect. 2.3,and query expansion [5].

However, retrieval systems based on the BOVW model suffer from lim-ited accuracy when the relevant images differ significantly from the queryin terms of viewpoint etc. Leveraging multiple images, with sufficient varia-tions, of the same object to query the database naturally alleviates all of theseproblems to some extent. In the context of BOVW-based retrieval systems, arecent study [1] compared a number of different early- and late-fusion meth-ods for combining the multiple query images. To the best of our knowledge,this is the only research that works on the BOVW-based multi-query imageretrieval (MQIR) task. Without supervision when specifying the region of in-terest (ROI) or carefully selecting query images, one of the major problems in


MQIR is the presence of clutter in background that frequently occurs in thedatabase and hence causes significant confusion between images of differentobjects. This problem gives rise to the motivation of the research presented inthis paper, and inspires us to propose a new interest point selection approachthat leverages a spatial coherence model to avoid the confusion problem de-scribed above.

2.2 Interest Point Selection Based on Local Feature Only

Interest point selection has recently become a popular way of reducing in-dex space for image retrieval based on local features. This is typically anapplication-dependent task, and most existing studies have been focusing oncontent-based location systems. Li and Kosecka [13] obtain an informationcontent posterior probability for each local feature with respect to locationidentification and select those with high posterior ranks. Similarly, given adense street-view geo-tagged image database, Schindler et al. [24] select in-formative local features occurring mostly in images of each specific location.Gammeter et al. [7] start from image clusters extracted based on geographicalproximity and visual similarity, and then extend the idea of posterior rankingto estimate a tentative ROI in the form of a minimum bounding box arounda foreground object in each image. Taking the opposite perspective, Knopp etal. [12] densely compute a confusion posterior probability for regions of imagesin a geo-tagged database by using a sliding window scheme and remove interestpoints inside regions that have the probability of causing high confusion.

Although these existing studies have reported success with content-basedlocation systems, it is usually difficult to directly adapt them to an MQIR taskfor the following reasons: 1. Most of these methods [12,13,24] require a trainingset of negative images per category that is obviously unavailable in MQIR; 2.the effectiveness of these methods depends on the number of positive (negativefor Knopp’s method [12]) images per category while in MQIR only a smallnumber of images are available, e.g. five images per query in our experiments;and 3. the ignorance of spatial context among interest points means that thesemethods suffer from limited discriminative power in distinguishing confusinglocal features deriving from white noise patterns, e.g. the images of plant leavesshown in Fig. 1b. In contrast, the approach we propose in this paper introducesa new spatial verification strategy into interest point selection, which does notdepend on negative images and requires only a small number of positive imagesper query.

2.3 Spatial Verification

In the past decade, with the introduction of local features, many methods forduplicate detection, image retrieval, and object recognition have been proposedon the basis of verifying the spatial context among interest points. These


methods can be summarized into two categories, and representatives of eachcategory are discussed below.

Geometry-Based Spatial Verification The motivation behind geometric verifi-cation is to emphasize geometric information about interest points and imposea rather strong spatial coherence between images. Image re-ranking methodsbased on RANdom SAmple Consensus (RANSAC) [19], which is one of themost widely used methods for global geometric verification, approximates thehomography between images as an affine transformation model built on thebasis of the shape of the affine-invariant ellipse region deriving from each in-terest point. Weak geometric consistency (WGC) [10] uses a global geometricmodel, in which the matches with consistent differences in the scale and thedominant orientation of interest points are thought to be true matches acrossmultiple images. Instead of focusing on global consistency, Wang et al. [29] usethe geometric statistics in the neighborhood of interest points as the spatialcontext within images. This spatial context contains the number of neighbor-hood features and the difference in scale and orientation between each interestpoint and its neighborhood points. In a more recent study, Liu et al. [14] ex-plore the two-order spatial structure of each interest point, and embed thescale and orientation differences between interest points in the inverted file toachieve a much lower time complexity.

Topology-Based Spatial Verification In the context of spatial verification, theterm topology indicates the elastic spatial context between neighboring in-terest points. The motivation for topology-based methods is to deliberatelyignore interest point geometry, which is sensitive to errors in interest point de-tection and description when there are large viewpoint and scale variations, sothat the true responses in interest point matching have greater tolerance. Bun-dled features [32] search for the local spatial neighbors of each interest pointon the basis of the scale of its Maximally Stable Extremal Regions (MSER)while ignoring the scale information during the retrieval process. Poullot etal. [21] group spatially neighboring interest points into triangles using a k-near-neighbor (k-NN) search and extract a compact binary signature to repre-sent the spatial context of each image. Zhang et al. [35] search for local spatialneighbors on the basis of a grid-construction scheme and describe the spatialcontext by using the co-occurrences of visual words. Kalantidis et al. [11] pro-pose a novel representation for logo recognition whereby neighboring interestpoints are grouped into triplets by means of Delaunay triangulation (DT).Zhang et al. [34] adapted the same idea to an instance search task [18] forspatial re-ranking after retrieval.

Discussion Geometry-based methods, e.g. RANSAC [19] or Liu’s method [14],achieve state-of-the-art results in terms of image matching accuracy, althoughthey are sensitive to errors in interest point detection and description. In manycases, the sensitivity is even helpful for distinguishing false matches. However,their direct adaptation to interest point selection leads to the over-elimination


of useful interest points. In this sense, topology-based spatial verification,which ignores the interest point geometry to allow more elastic spatial co-herence modeling, is more suitable as a way of completing our task. In thispaper, we adapt the DT-based method [11] because of its high efficiency androbustness to viewpoint changes. We show how this method can be effectivelyadapted for tentative ROI segmentation and interest point selection.

3 Interest Point Selection

Given a query composed of multiple images, the objective is to estimate a fewtentative ROIs, which are assumed to contain the object to be queried, fromeach image of the query. Only the interest points within these ROIs are selectedfor retrieval. In Sect. 3.1, we first present the BOVW-based retrieval modeladopted in this paper. Sect. 3.2 introduces Delaunay triangulation (DT) andits adaptation to the proposed approach for spatial neighborhood modeling.Subsequent sections describe the phases of the proposed approach, includingmulti-scale Delaunay triangulation (MSDT, Sect. 3.3), common Delaunay edgedetection (Sect. 3.4), and tentative ROI estimation (Sect. 3.5).

3.1 BOVW-Based Image Retrieval

We adopted the following BOVW-based retrieval model for all the experimentspresented in this paper. We explain this here to help the reader better under-stand our interest point selection approach, which is built on top of the BOVWmodel. The implementation is conventional, state-of-the-art, and compatiblewith pre- and post-processing methods, e.g. query expansion [5], Hamming em-bedding [9], spatial re-ranking [19], and soft assignment [20]. We use Hessianaffine region detector [17]1 and Root SIFT [2] for interest point extraction anddescription. A visual vocabulary with 106 (1M) visual words is constructed byusing AKM [19]2 on a set of 5,062 images with 17,862,193 local features. Inour implementation, AKM constructs a KD-forest containing eight KD-treesfor an approximate nearest neighbor (ANN) search with 768 maximum com-putations per entry, and adopts 30 iterations. Local features are quantizedinto visual words by hard assignment again based on ANN3 with eight KD-trees. In consequence, an image is represented by a set of visual words andencoded into a 1M-dimensional TF-IDF vector. Each vector is L2-normalizedand indexed with an inverted index for fast retrieval. We use the L2 distanceas the dissimilarity between two images, which equals the cosine similarity incombination with the L2-normalization.

In this paper, each query is composed of five images. The retrieval modelfirst queries the database by using each individual image, and outputs five

1 http://www.robots.ox.ac.uk/˜vgg/research/affine/#software2 http://www.robots.ox.ac.uk/˜vgg/software/fastcluster/3 http://www.robots.ox.ac.uk/˜vgg/software/fastann/


ranking lists by sorting the dissimilarity between the query image and eachdatabase image. The MQ-Max strategy, discussed in Sect. 1, is adopted forfusing the ranking lists because of its good performance as regards retrievingunusual instances.

3.2 Delaunay Triangulation

Delaunay triangulation (DT) [4] is a triangulation method often used to buildmeshes for space-discretized solvers. In mathematics and computational ge-ometry, a DT for a set P of points in a plane is a triangulation DT(P ) suchthat no point in P is inside the circumcircle of any triangle, called a Delau-nay triangle, in DT(P ). The three sides of each Delaunay triangle are calledDelaunay edges. DT maximizes the minimum of all the angles of Delaunaytriangles; they tend to avoid long and thin ones. The DT of a discrete pointset P corresponds to the dual graph of the Voronoi tessellation for P . Fig. 2dshows an example of DT built on interest points of a given image.

The nearest neighbor graph (NNG) is one of the most important subgraphsof DT. The NNG for a set of n points P in a plane with Euclidean distance isa directed graph with P being its vertex set and with a directed edge from pto q whenever the distance from p to q is no larger than from p to any otherpoint in P . In consequence, the nearest neighbor b to any point p is on anedge bq in an NNG, which is also a Delaunay edge in DT. On the other hand,DT minimizes the maximum of all the circumcircles of Delaunay triangles, i.e.the method tries to avoid connecting two points that are distant from eachother. That means the other Delaunay edges that are not in the NNG arestill connections of points that are near each other but not nearest neighbors.In this sense, DT can be thought of as a spatial neighborhood constructionprocess, and it serves as a much more efficient alternative solution to greedyapproaches, e.g. k-NN.

In this paper, we apply DT to interest points that multiple query imageshave in common and build a spatial neighborhood model for each image. Weconsider a query I of NI images, each of which is described by a set of interestpoints. A visual word ID wp is assigned to each interest point p by quantizationas described in Sect. 3.1. The following equation defines the number of imagescontaining a given visual word w, where I denotes a query image.

N(w) =‖ {I | w ∈ I, I ∈ I} ‖ (1)

Given a query image I, the set PI of common interest points within I is definedby Equation 2.

PI = {p | N(wp) ≥ τw, p ∈ I, I ∈ I} (2)

A interest point is regarded as common if its visual word appears in no lessthan τw query images. In this paper, τw is set at two because we have only fiveimages per query (NI = 5) and a threshold larger than two leads to the over-elimination of useful interest points. To validate this claim, we have conducted


(a) Initial image. (b) Interest points.

(c) Common interest points. (d) Delaunay triangulation.

Fig. 2 An example of Delaunay triangulation built from common interest points. The initialimage is from the query “Cornmarket” in the Oxford Buildings Dataset [19]. The area ofeach square shown in Figure 2b and 2c indicates the scale of each interest point. The commoninterest points are those common to the multiple images shown in Fig. 12a.


an experiment under one of the configurations of our experiments, namely OB,and have compared the MAP and MR@200 with various τw (see Sect. 4 formore details on the configuration and the evaluation criteria). Our approachwith τw = 2 achieved a MAP of 88.1% and a MR@200 of 93.1%, while the sameapproach with τw = 3 achieved a MAP of 85.7% and a MR@200 of 91.6%. Forτw ≥ 4, the number of common interest points became zero for most queries,and in consequence, the AP and Recall@200 in these cases became zero.

DT is then applied on top of the set PI of common interest points. Popularsolutions of DT include flip algorithms, the Bowyer-Watson algorithm, the di-vide and conquer algorithm, Fortune’s algorithm, and Sweephull. Among them,the divide and conquer algorithm [25] has been shown to be the fastest [31].We therefore adopt this algorithm4 in our implementation. The total runningtime is O(n log n), where n is the number of vertices. Fig. 2 shows an exampleof common interest points and the corresponding DT.

Here, the ROI is the building in the foreground housing a Nokia shop. Bycomparing Fig. 2b and Fig. 2c, we observe that the selection of common inter-est points discards a number of noise points derived from the non-ROI region.However, there are still a larger number of remnants surrounding the treesand the pedestrians. A selection of the common interest points for retrievalstill suffers from mismatches caused by this clutter. Sect. 4.5 and Sect. 4.6show more experimental results for interest point selection based on commoninterest points. Fig. 2d shows the DT built from Fig. 2c. Although some De-launay edges connect interest points that are distant from each other, e.g.those connecting the roof and the tree branches, further verification based onvisual words (Sect. 3.4) can successfully distinguish this type of false spatialneighborhood.

It is well known that SIFT and Root SIFT, the latter being adopted hereas the local feature, are derived from directionally sensitive gradient fields,and are not invariant to flip. To tackle this issue, we also take into accountthe flipped Root SIFT to describe and detect common interest points. Inother words, each interest point is associated with two visual words, assignedrespectively on the basis of the initial and the flipped Root SIFT features.The common interest points shown in Fig. 2c are exactly detected under thissetting. It should be noted that the flipped Root SIFT is only used for interestpoint selection, and is not used in the BOVW-based retrieval model to avoida large computation overhead in quantization.

3.3 Multi-Scale Delaunay Triangulation

Technically speaking, DT is no more than a representative research subject incomputational geometry rather than a deliberately-designed solution for spa-tial neighborhood modeling. Its construction is still sensitive to some specialvariations in transformation and errors in interest point detection. Fig. 3 shows

4 http://www.cs.cmu.edu/afs/cs/project/quake/public/www/triangle.html


(a) Sensitivity in additional interest point. (b) Sensitivity in relative position.

Fig. 3 Two toy examples illustrating the limitations of Delaunay triangulation. Circlesindicate the vertices of a Delaunay triangle, and line segments indicate Delaunay edges.Vertices with the same color indicate interest points with the same visual word. A variationin the addition and relative positions of interest points leads to a different triangulation.

two toy examples illustrating the limitations of DT. The left image in Fig. 3ashows a Delaunay triangle with three vertices. Here, suppose that an addi-tional interest point is detected, e.g. the red point shown in the right imageof Fig. 3a. According to the Delaunay condition that maximizes the minimumangle, the green-yellow Delaunay edge will be replaced by the new blue-redone. This usually happens due to the variation in the object scale or errorsin interest point detection. In another toy example, given by the four pointsshown in the left image in Fig. 3b, two Delaunay triangles will be constructedsharing the green-red Delaunay edge. Here, suppose that a diagonal stretchtransforms the positions of the four points into those shown in the right imagein Fig. 3b. According to the Delaunay condition, the green-red Delaunay edgewill be replaced by the new blue-yellow one. This frequently happens due toan affine transformation between images. In this section, we explain how weaddress the first problem, i.e. the sensitivity of DT in an additional interestpoint. The second problem will be addressed in Sect. 3.4.

In this paper, we generalize the multi-scale DT (MSDT) strategy proposedby Kalantidis et al. [11] to tolerate the addition of interest points. The idea isto constrain DT to subsets of interest points sharing nearby scales such that anadditional interest point will affect DT when and only when it is similar in scaleto those surrounding it. On the other hand, a Delaunay triangle constructedfrom a certain subset in one image is still expected to be found from a subsetin another image of the same query.

To build an MSDT for each query image, the scales of all the interest pointsin the same image are first transformed into log scales and normalized into therange [0, 1]. The set of interest points is then divided into overlapping partitionsof the same size by range partitioning. We take the normalized log scale as thepartitioning key and select a partition for each interest point by determining ifits partitioning key is inside a certain range. Given a partition size sp ∈ (0, 1]and a minimum overlap ratio rm ∈ [0, 1), the number of partitions np iscomputed by Equation 3.

np = ⌈1− sprmsp(1− rm)

⌉ (3)


Fig. 4 Multi-scale Delaunay triangulation built on the common interest points shown inFig. 2c. These triangulations have been arranged in ascending order of log scale.

All we do then is to triangulate the interest points in each partition and collectthe union of all constructed Delaunay triangles. Sect. 4.4 shows experimentalresults achieved when sp and rm were varied. Fig. 4 shows the DTs in a multi-scale space for the common interest points shown in Fig. 2c with sp = 0.3 andrm = 0.5. The set has been divided into six partitions (np = 6). It can beseen that partitions corresponding to smaller log scales contain more interestpoints. This makes sense because interest points with smaller log scales tendto be derived from the textures in the foreground, which usually has a largerarea than the background. In Sect. 4.5, we compare the performance of interestpoint selection based on DT with that based on MSDT. Also, we show howthe parameters sp and rm affect the retrieval performance in Sect. 4.4.

3.4 Common Delaunay Edge Detection

In this section, we detect local topology coherence on the basis of MSDT andoutput a set of repeating patterns. To accomplish this, we need first to definewhat a local pattern is. Kalantidis et al. [11] define each Delaunay triangle as a


local pattern and regard each image as a bag of Delaunay triangles for furtherlogo recognition. This representation is highly discriminative, but possesses in-variance limitations and sensitivity to outliers. This may be not an issue whendealing with planar homography, e.g. in logo recognition, but in our problemthe limitations will be prohibitive, especially in the case shown in Fig. 3b. Tovalidate this claim, we have conducted an experiment under two of the con-figurations of our experiments, namely OB and OB-P, and have compared theMAP and MR@200 with an edge-based representation described below andthe triangle-based representation (see Sect. 4 for more details on the config-uration and the evaluation criteria). For OB, our approach achieved a MAPof 88.1% and a MR@200 of 93.1%, while the triangle-based representationachieved a MAP of 79.7% and a MR@200 of 86.3%. For OB-P, our approachachieved a MAP of 75.8% and a MR@200 of 84.1%, while the triangle-basedrepresentation achieved a MAP of 54.1% and a MR@200 of 65.0%.

Fig. 5 Common Delaunay edges detected from MSDT.

In this paper, we define each Delaunay edge as a local pattern and detectlocal topology coherence by representing, indexing, and matching the Delau-nay edge across multiple query images. A visual phrase is assigned to eachDelaunay edge as an ordered doublet of the visual words, in lexicographicallyascending order, of the two vertices. Two Delaunay edges match if their visualphrases are identical, i.e. if the two visual words are identical. This providesbetter tolerance to true responses when detecting repeating patterns from theROI compared with the triangle-based topology model. Given a Delaunay edgee composed of two interest points p and q with wp ≤ wq, the visual phrase is


denoted as {wp, wq}, where w indicates the corresponding visual word. Thisdoublet can be regarded as a double-digit base-k numeral with k = 106 in ourimplementation being the size of the visual vocabulary. In practice, this base-knumeral is encoded into a long decimal numeral integer by Equation 4 with vedenoting the visual phrase of the Delaunay edge e.

ve = wpk + wq (4)

Given a query I of NI images, the following equation defines the number ofquery images containing a given visual phrase v.

N(v) =‖ {I | v ∈ I, I ∈ I} ‖ (5)

The set EI of common Delaunay edges within a given query image I is definedby Equation 6.

EI = {e | N(ve) ≥ τv, e ∈ I, I ∈ I} (6)

A Delaunay edge is regarded as common if its visual phrase appears in noless than τv images. In this paper, τv is set at two because a threshold largerthan two leads to the over-elimination of useful visual regularities. To validatethis claim, we have conducted an experiment under one of the configurations ofour experiments, namely OB, and have compared the MAP and MR@200 withvarious τv (see Sect. 4 for more details on the configuration and the evaluationcriteria). Our approach with τv = 2 achieved a MAP of 88.1% and a MR@200of 93.1%, while the same approach with τv = 3 achieved a MAP of 82.4%and a MR@200 of 88.7%. For τv ≥ 4, the number of common Delaunay edgesbecame zero for most queries, and in consequence, the AP and Recall@200 inthese cases became zero.

As with common interest point detection, we also employ the flipped RootSIFT when describing the visual phrase of each Delaunay edge. Fig. 5 showsthe common Delaunay edges detected from the MSDT shown in Fig. 4. Asexpected, the common Delaunay edges are almost all derived from the objectto be queried. The proposed approach successfully discarded the edges sur-rounding the trees and the pedestrians as well as those connecting the roofand the tree branches.

Fig. 6 A toy example illustrating the robustness of Delaunay triangulation. Circles indicateinterest points, and those of the same color are interest points with the same visual word.

The proposed approach is highly discriminative compared with existingstudies based on local features only because two visual words must be iden-tical to match a single Delaunay edge. It is also robust as regards variations


in spatial configuration because we deliberately ignore the interest point ge-ometry that is unstable when there is severe viewpoint and scale variation.Fig. 6 shows a toy example illustrating the robustness of the DT-based topol-ogy model. Suppose that the two patterns derive from the same object fromdifferent viewpoints. There is a severe transformation between them, whichconsists of a flip in an oblique direction around the green-yellow edge and ahorizontal scaling. Neither RANSAC [19] nor Liu’s method [14] can matchthese two local patterns because the flip is not compliant with the transforma-tion they assumed, e.g. the affine transformation in RANSAC. Here, supposethat the green and yellow points are reversed, and the transformation becomesa rotation and the scaling becomes horizontal. Handling this situation with thetwo methods described above is possible in theory. But the effectiveness de-pends heavily on the correctness of the Hessian affine region detector and SIFTcomputation in practice, especially in this extreme horizontal stretch case. Incontrast, these transformations have little or no influence on the DT-basedtopology model because it only takes the visual words into account once thespatial neighborhood has been constructed between interest points.

3.5 Tentative ROI Estimation

(a) ROI labeled by human. (b) Tentative ROI.

Fig. 7 Comparison of ROI labeled by a human being and that estimated from topologycoherence. The human-labeled ROI is available in the Oxford Buildings Dataset [19].


Given a set EI of common Delaunay edges, ROI estimation is based on atransitive closure and finding axis-aligned minimum bounding boxes. A tran-sitive closure is applied to the common Delaunay edges and outputs groups ofedges connecting each other. The minimum bounding box of each edge group isthen found, and the union of all bounding boxes is defined as the tentative ROI.Only the interest points within the ROI are selected for further BOVW-basedretrieval. Fig. 7b shows the tentative ROI estimated from Fig. 5. Comparedwith the human-labeled ROI that is available in the Oxford Buildings Datasetand shown in Fig. 7a, the estimated ROI still includes a few regions contain-ing pedestrians, but is already effective for significantly reducing the negativecontributions of confusing interest points. Sect. 4.5 and Sect. 4.6 show moreexamples of ROI segmentation and experimental retrieval results.

3.6 Summary

Algorithm 1 summarizes the interest point selection scheme in a pseudo-codeform. The input is the interest point sets each deriving from an image in aquery, and the output is the corresponding selected interest point sets. Thestep from line 1 to line 8 builds the inverted index of visual words; the stepfrom line 9 to line 13 detects common interest points; the step from line 14to line 18 extracts Delaunay edges using MSDT (Sect. 3.3); the step from line19 to line 27 builds the inverted index of visual phrases; the final step fromline 28 to line 38 applies the transitive closure, finds the minimum boundingboxes, and selects the interest points inside the tentative ROIs.

4 Experimentation

This section is organized as follows. After introducing the databases, config-urations, runs, and parameters in from Sect. 4.1 to Sect. 4.4, we present ourexperimentation and comparison in Sect. 4.5. We then comprehensibly dis-cuss how the proposed approach can help the retrieval of the query classes inSect. 4.6 and show the processing time in Sect. 4.7. Finally, Sect. 4.8 presentscomparisons between our approach and the other advanced approaches.

4.1 Database

For our evaluation, we use the Oxford Buildings [19] and Flickr Logos 32 [23]datasets. Oxford Buildings is a set of 5,062 images with a ground truth of 11queries of Oxford landmarks: 5 query images per query, i.e. 55 query images intotal. The number of relevant images ranges between 7 and 220. The human-labeled ROI is available for all query images in this dataset. In our experiments,we group the query images of the same landmark and define the group as aquery class. We thus have 11 query classes with 5 images per query. Thisconfiguration is exactly the same as that designed in Arandjelovic’s paper [1].


Input P = {Pi}: the set containing each interest point set of the i-th query image.Input k: the visual vocabulary size.Input τw: the threshold used for common interest point detection.Input τv : the threshold used for common Delaunay edge detection.1: for all w do ⊲ Visual Word2: I1(w)← ∅ ⊲ Inverted Index3: end for4: for Pi ∈ P do ⊲ Interest Point Set5: for all p ∈ Pi do ⊲ Interest Point6: I1(wp)← I1(wp) ∪ {i}7: end for8: end for9: P′ ← ∅ ⊲ Comment Interest Point Set10: for Pi ∈ P do11: P ′

i← {p ∈ Pi : |I1(wp)| ≥ τw}

12: P′ ← P′ ∪ P ′

i

13: end for14: E← ∅ ⊲ Delaunay Edge Set15: for P ′

i∈ P′ do

16: Ei ← MSDT(P ′

i) ⊲ Multi-Scale Delaunay Triangulation

17: E← E ∪ Ei

18: end for19: for all v do ⊲ Visual Phrase20: I2(v)← ∅ ⊲ Inverted Index21: end for22: for Ei ∈ E do23: for all e ∈ Ei do24: ve ← wpk + wq ⊲ e = (p, q) & wp ≤ wq

25: I2(ve)← I2(ve) ∪ {i}26: end for27: end for28: P∗ ← ∅ ⊲ Selected Interest Point Set29: for Ei ∈ E do30: E′

i← {e ∈ Ei : |I2(ve)| ≥ τv} ⊲ Comment Delaunay Edge Set

31: G← TC(E′

i) ⊲ Transitive Closure

32: P ∗

i← ∅

33: for all G ∈ G do ⊲ Edge Group34: B ← MBB(G) ⊲ Minimum Bounding Box35: P ∗

i← P ∗

i∪ {P ′

i∩B}

36: end for37: P∗ ← P ∗

i

38: end forOutput P∗: the set containing each selected interest point set of the i-th query image.

Algorithm 1: Pseudo code of interest point selection.

Flickr Logos 32 [23] is a set of 5,240 images with 32 queries of logos: 30 queryimages per query and 960 query images in total. The other 4,280 images formthe retrieval set that contains 1,280 logo images and 3,000 non-logo images.We randomly split the set of the 30 query images of each logo into 6 disjointsubsets with 5 query images per subset. We thus have 32 × 6 = 192 subsetsand define each subset as a query class.

We also use an additional unlabeled dataset, known as Flickr 100K [19].The images in Flickr 100K were crawled from Flickr’s 145 most popular tags


that did not contain tags of the landmarks in Oxford Buildings, and so thedataset is assumed not to contain images of these landmarks. The imagesin this additional dataset are used as distractors and provide an importanttest for the large-scale experiments of BOVW-based retrieval. Although theassumption described above has not been validated in any way, the configura-tion of using Flickr 100K as a distractor dataset has been widely adopted instate-of-the-art researches [1,2,5,9,10,19,20,35] in this community. Hence, wealso adopt this configuration in our experiments.

To examine the data dependency of our approach, we also use a stand-alone training dataset, known as Paris [20], as an alternative local featurequantization technique on the basis of which both interest point selection andBOVW-based retrieval are pushed through. The resolution of all the imagesin these datasets is 1024× 768 pixels. The datasets are compared in Table 1.

Table 1 The statistics of the datasets used in our experiments.

Dataset Number of Images Number of Interest Points

Oxford Buildings 5,062 17,862,193Flickr Logos 32 5,240 12,719,969Paris 6,412 20,662,980Flickr 100K 100,071 310,073,708

Total 116,785 361,318,850

4.2 Experimental Setting

The details of BOVW-based retrieval were presented in Sect. 3.1. The pre-and post-processing methods including query expansion [5], Hamming embed-ding [9], spatial re-ranking [19], and soft assignment [20] are not tested inour experiments, but we believe that the proposed interest point selection ap-proach is compatible with these methods. Because the effectiveness of spatialre-ranking [19] depends on how many relevant images the retrieval model candeliver in top ranks, we take this issue into account in our evaluations withthe mean recall at 200 (MR@200) approach as explained later.

The datasets discussed in Sect. 4.1 provide the six experimental settingsused in our experiments as listed in Table 2. It should be noted that OB-P,OBF100K-P, and FL32-P are used only for empirical tesing. In practice, theretrieval dataset that we intend to search is usually available before searching.The unsupervised learning of the visual vocabulary can always be achieved byusing this dataset. There is no need to purposely find a stand-alone dataset(Paris) for vocabulary construction, which does not contain the images to beretrieved. In contrast, OB, OBF100K, and FL32 are more standard protocolsfor the usage of the OB and FL32 datasets [1,9,10,19,22,23,35]. All these


state-of-the-art researches listed above did not consider OB-P, OBF100K-P,or FL32-P.

Table 2 The six experimental settings deriving from the datasets listed in Table 1.

Experimental Setting Quantization Retrieval

OB Oxford Buildings Oxford BuildingsOBF100K Oxford Buildings Oxford Buildings & Flickr 100KFL32 Flickr Logos 32 Flickr Logos 32

OB-P Paris Oxford BuildingsOBF100K-P Paris Oxford Buildings & Flickr 100KFL32-P Paris Flickr Logos 32

We measure the retrieval performance by using mean average precision(MAP) and MR@200 over all queries. The average precision (AP) is equivalentto Equation 7, where k is the rank in the ranking list of retrieved results, n isthe number of retrieved results, Pre(k) is the precision at cut-off k in the list,and Rel(k) is an indicator function that equals one if the result at rank k is arelevant one, and zero otherwise.

AP =

∑n

k=1(Pre(k)Rel(k))

Number of Relevant Images(7)

MAP for a set of queries is the mean of the AP scores of all queries as definedin Equation 8, where Q is the number of queries.

MAP =

∑Q

q=1AP(q)

Q(8)

MR@200 measures how effective the retrieval model is for fetching relevantresults in top ranks. It serves as a criterion for the compatibility of the proposedinterest point selection approach with spatial re-ranking. Because of the highcomputational burden of spatial re-ranking, a standard solution is to carry outre-ranking to the (103/Q) top-ranked images for Q-query image retrieval [1,19]. A higher number of targets usually results in the user running out ofpatience. Recall@200 is first defined by Equation 9.

Recall@200 =

∑200

k=1Rel(k)

Number of Relevant Images(9)

Similar to MAP, MR@200 is the mean of the Recall@200 scores of all queries,and is defined in Equation 10.

MR@200 =

∑Q

q=1Recall@200(q)

Q(10)

Note that, different from MAP that is an absolute criterion, MR@200 is acomparative criterion. That is, MR@200 gives more importance to how far an


approach outperforms (or underperforms) another rather than to what extentan approach perfects as a solution. Because there are queries with more than200 relevant images, Recall@200 will not reach 1 for these cases, even if alltop-200 images are relevant. This may be superficially misleading but actuallyreasonable because the purpose is simply to compare the compatibility ofthe different approaches with re-ranking. If one wants to see how perfect anapproach is, MAP will be a better barometer.

4.3 Run

We compare our interest point selection approach to the full feature set (MQ),the selection approach based on local feature only (IPS-CLF), and the ROI la-beled by human (MQ-HL). We also implement three alternative embodimentsof our proposal using spatial coherence models including k-NN, Liu’s geomet-ric model [14], and single-scale DT. In total, we implemented seven runs aslisted in Table 3. Specifically, IPS-CLF detects common interest points oc-curring mostly in images of each query, and defines the minimum boundingbox surrounding all common interest points as ROI; IPS-KNN uses k-NN tobuild edges connecting neighboring interest points, and similar to IPS-MSDT,detects common edges for ROI estimation; IPS-GBSV extends IPS-KNN byembedding the difference in scale and orientation between interest points intocommon edge detection; IPS-DT and IPS-MSDT use single- and multi-scaleDT to model spatial topology. For late fusion, the MQ-Max strategy discussedin Sect. 1 is adopted for all runs.

Table 3 The seven runs implemented in the experiments.

Run Selection Model Selection Category

MQ [1] Full NAIPS-CLF [7] Common Interest Point Local Feature Only

IPS-KNN k Near Neighbors Local Feature & TopologyIPS-GBSV Geometry-Based Spatial Verification [14] Local Feature & GeometryIPS-DT Delaunay Triangulation Local Feature & TopologyIPS-MSDT Multi-Scale Delaunay Triangulation Local Feature & Topology

MQ-HL [1] Human-Labeled ROI NA

To the best of our knowledge, there is still no researches that use spatialcoherence model for interest point selection in a MQIR scenario. IPS-KNN,IPS-GBSV, and IPS-DT are not existing researches but alternative embodi-ments of our proposal that use spatial coherence models differing from MSDT.In this sense, only MQ and IPS-CLF are baselines, and the comparisons of IPS-MSDT with IPS-KNN, IPS-GBSV, and IPS-DT only serve as evidences of theblief that MSDT is a better solution in this task. The motivations of using DT


and MSDT are as follows. As discussed in Sect. 2.3, Liu’s geometric model [14]adapted by IPS-GBSV is sensitive to errors in interest point detection and de-scription because it imposes a geometric coherence (e.g. as regards orientationand scale) in addition to the spatial coherence (e.g. as regards neighborhoodrelationship). k-NN can be considered an alternative of DT and MSDT butis computationally more expensive. For these reasons, we choose to adopt DTand MSDT because of their higher efficiency and lower sensitivity. We justifythese insights in from Sect. 4.5 to Sect. 4.7.

4.4 Parameter

In IPS-MSDT, there are four main parameters influencing the tentative ROIestimation performance. They are respectively, the number of images per queryNI , the size of the visual vocabulary k, and the partition size sp ∈ (0, 1]and the minimum overlap ratio rm ∈ [0, 1) used for range partitioning inMSDT. We do not examine NI and k in our implementation because thesetwo parameters operate simultaneously with BOVW-based retrieval. If interestpoint selection is optimized while potentially sacrificing retrieval the focusshifts to the circumstances at the expense of the main issue. Therefore, weonly investigate sp and rm in this section.

We fix NI = 5 following the standard [1]. We also fix k = 106 beingone of the most reliable choices in this community. The best performance isachieved with sp = 0.3 and rm = 0.5 for OB. In Fig. 8 we show MAP forinterest point selection based on MSDT with various sp and rm. A smallersp and a higher rm lead to DTs with more layers, and in theory, lead togreater robustness as regards scale variation and interest point detection errorbut lower selectivity. Unfortunately, this trend is not reflected in Fig. 8. Weassume this to be because ROI estimation based on minimum bounding boxevens out the differences between range partitioning strategies. Fig. 8 indirectlydemonstrates the high invariance of IPS-MSDT to parameters, and IPS-MSDToutperforms MQ for all sp and rm. The best performance is achieved withsp = 0.3 and rm = 0.6 for OB-P. We adopt sp = 0.3 and rm = 0.5 for OB andOBF100K, and adopt sp = 0.3 and rm = 0.6 for OB-P and OBF100K-P. Theparameters with the best performance are sp = 0.6 and rm = 0.5 for FL32and sp = 0.3 and rm = 0.5 for FL32-P.

In addition, IPS-CLF, IPS-KNN, and IPS-GBSV also require parameterexamination. IPS-CLF has a coefficient α [7] that adjusts the cut-off of theinterest point frequency; IPS-KNN has the choice of k denoting the numberof near neighbors; IPS-GBSV has the choice of k for k-NN, the number ofradial partitions nr, and the number of angular partitions na [14]. After fineparameter examination, we choose: α = 1.0 for IPS-CLF in all cases; k = 30(OB and FL32), k = 70 (OB-P), and k = 100 (FL32-P) for IPS-KNN; {k =100, nr = 1, na = 4} (OB and FL32) and {k = 80, nr = 1, na = 4} (OB-P andFL32-P) for IPS-GBSV.


0.83

0.84

0.85

0.86

0.87

0.88

0.89

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MAP

Partition Size

MQ IPS-MSDT

(a) MAP versus partition size (rm = 0.5).

0.83

0.84

0.85

0.86

0.87

0.88

0.89

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

MAP

Miminum Overlap Ratio

MQ IPS-MSDT

(b) MAP versus minimum overlap ratio (sp = 0.3).

Fig. 8 MAP of IPS-MSDT for various partition sizes sp and minimum overlap ratio rm.These two parameters are examined under the experimental setting OB.

4.5 Effectiveness Comparison

Tables 4 and 5 show the MAP and MR@200 performance obtained underthe various experimental settings discussed in Sect. 4.3. In general, when wecompare OB/FL32 and OB-P/FL32-P, we find that the performance of thelatter was poorer because it used an independent dataset (Paris) containingno images in the retrieval set to learn the visual vocabulary. In consequence, itcaused a larger error in the local feature quantization. Also, the performancedegraded after adding a large number of distractor images (Flickr 100K) tothe retrieval dataset.

As expected, IPS-MSDT outperformed MQ and IPS-CLF under almost allconfigurations with both evaluation criteria. The only exception was the MAP


Table 4 MAP (%) comparison of various experimental settings.

Run OB OBF100K OB-P OBF100K-P FL32 FL32-P

MQ 84.1 76.4 73.9 63.0 68.2 44.7IPS-CLF 84.9 78.7 74.1 64.8 73.3 48.0

IPS-KNN 87.3 77.4 74.8 62.4 76.0 52.6IPS-GBSV 85.9 76.2 73.0 59.3 76.7 53.0IPS-DT 86.6 77.7 73.3 58.8 77.2 52.0IPS-MSDT 88.1 79.9 75.8 62.5 77.2 52.8

MQ-HL 89.4 79.6 74.6 61.3 NA NA

Table 5 MR@200 (%) comparison of various experimental settings.

Run OB OBF100K OB-P OBF100K-P FL32 FL32-P

MQ 90.0 83.0 81.3 71.3 81.8 67.7IPS-CLF 91.0 84.6 80.7 72.0 83.8 69.6

IPS-KNN 93.4 86.1 83.9 72.1 84.6 73.8IPS-GBSV 92.5 85.1 83.7 68.2 84.7 72.0IPS-DT 92.9 84.8 83.4 69.3 87.1 71.6IPS-MSDT 93.1 87.2 84.1 73.3 87.5 72.4

MQ-HL 93.4 85.9 81.7 70.2 NA NA

of IPS-MSDT under OBF100K-P, which was slightly lower than MQ and IPS-CLF. We believe that this exception is because of the large error of local featurequantization under OBF100K-P, where a stand-alone training dataset (Paris)not containing any image in the retrieval dataset is used for quantization andfurther visual word assignment. Because interest point selection approacheswith spatial coherence model are much more discriminative than MQ and IPS-CLF, they are usually more sensitive to this quantization error than others.Note that even MQ-HL underperformed MQ and IPS-CLF in this case. Also,as discussed in Sect. 4.2, OBF100K-P is used only for empirical tesing. Inpractice, there is no need to purposely find a stand-alone dataset (Paris) forvocabulary construction, which does not contain the images to be retrieved.On the other hand, the MSDT parameters discussed in Sect. 4.4 are not tunedunder OBF100K and OBF100K-P, which can be another reason for the inferiorperformance of IPS-MSDT.

Surprisingly, IPS-MSDT even outperformed MQ-HL under OBF100K, OB-P, and OBF100K-P. Compared with the other runs, IPS-MSDT also achievedthe best performance in almost all cases. Among the runs of interest pointselection, the performance of IPS-KNN was comparable but still inferior tothat of IPS-MSDT. In general, the performance of topology-based interestpoint selection techniques, including IPS-KNN and IPS-DT, are superior tothat of a geometry-based approach, i.e. IPS-GBSV. This is because geometricspecification imposes a stronger spatial coherence between images and thus


leads to the over-elimination of useful interest points. IPS-CLF showed slightlybetter performance than MQ, but could not come up to the others. Because ofthe complete neglect of spatial coherence, it suffers from limited discriminativepower when distinguishing confusing interest points in the non-ROI clutter.

The MAP superiority over MQ for all interest point selection approachesdegraded under OB-P and FL32-P compared with that under OB and FL32.This is because interest point selection has to be performed on top of thematching between local features, and as a result, its performance is directlyaffected by the reliability of the local feature quantization. As discussed above,OB-P and FL32-P causes a significant error in quantization and visual wordassignment. We believe that a more robust quantization strategy can furtherimprove the effectiveness of interest point selection, which constitutes anotherresearch topic. In contrast, the MR@200 superiority of IPS-MSDT over MQwas rather stable. When combined with the observation of MAP boost, itmay be concluded that IPS-MSDT is more effective in improving the rankingof top-ranked relevant images than of those with lower ranks. This makessense because in real applications the users tend to pay more attention totop-ranked results in order to find the one they are interested in expeditiously.Also, as discussed in Sect. 4.2, the stable MR@200 boost demonstrated thehigh compatibility of IPS-MSDT with spatial re-ranking.

Under certain configurations, IPS-MSDT slightly underperformed IPS-KNN or IPS-GBSV as regards MR@200. Note that the MR@200 of IPS-MSDTin Table 5 was not the highest MR@200 of this run but the MR@200 with thehighest MAP. For example, the highest MR@200 of IPS-MSDT for OB was93.5% that was comparable with the highest MR@200 of IPS-KNN (93.4%).Also, as discussed in Sect. 4.3, IPS-KNN, IPS-GBSV, and IPS-DT are notexisting researches but alternative embodiments of our proposal, and the com-parisons of IPS-MSDT with these alternatives only serve as evidences of theblief that MSDT is a better solution in this task. As shown in Table 5, theMR@200 superiority of IPS-MSDT over the others was rather stable.

4.6 Discussion

Fig. 9 shows the MAP superiority of IPS-MSDT over MQ and MQ-HL un-der OB and OBF100K. Under both settings, IPS-MSDT outperformed MQin nine out of 11 queries, especially for “Magdalen” and “Cornmarket” un-der OB and for “Radcliffe Camera” under OBF100K. IPS-MSDT also out-performed MQ-HL in some cases, especially for “Bodleian” under OBF100K.In general, IPS-MSDT advantages especially when the background containslarge finely-textured regions, e.g. foliage (Magdalen) or a throng of people(Cornmarket), or regular patterns, e.g. framed rectangle (Radcliffe Camera).It performs poorly when the viewpoint variation within the query is too largesuch that few common interest points or common Delaunay edges can be foundacross the images (All Souls).


-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

MA

P B

oo

st

Query Class

MAP Boost over MQ MAP Boost over MQ-HL

(a) MAP comparison versus query class (OB).

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

0.25

MA

P B

oo

st

Query Class

MAP Boost over MQ MAP Boost over MQ-HL

(b) MAP comparison versus query class (OBF100K).

Fig. 9 The MAP boost of IPS-MSDT over MQ and MQ-HL for each query class.

To comprehensibly discuss how IPS-MSDT can help the retrieval of thesequery classes, we first explore which query image has the largest negativeimpact on retrieval in MQ. We denote Q as a query image of interest withina given query class Q, and denote F as the set of top-ranked false imagesobtained using Q. The set of false images that are top-ranked due to Q canbe defined as follows.

F(Q) = {F | Q = argmaxQ∈Q

Sim(Q,F ), F ∈ F} (11)

Here, Sim(Q,F ) denotes the similarity between each query image Q and eachfalse image F . Note that each false image F only corresponds to one queryimage Q that is the most similar to F , and this Q serves as the villain of F .Hence, we have

∑‖ F(Q) ‖ =‖ F ‖ with Q ∈ Q. A negative contribution ratio


(a) Initial image. (AP: 27.4%/12.3%)

(b) Tentative ROI estimated from IPS-CLF. (AP: 29.6%/15.6%)

(c) Tentative ROI estimated from IPS-GBSV. (AP: 54.9%/34.1%)

(d) Tentative ROI estimated from IPS-MSDT. (AP: 54.3%/33.6%)

(e) ROI labeled by human being. (AP: 56.9%/35.4%)

Fig. 10 ROIs estimated with interest point selection approaches. The AP is noted in brack-ets for each run under settings OB/OBF100K. (Magdalen)


-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

500

1000

Ran

k D

iffe

ren

ce

(a) N = 926; ID = 4; OB; Magdalen.

-4000

-3500

-3000

-2500

-2000

-1500

-1000

-500

0

500

Ran

k D

iffe

ren

ce

(b) N = 158; ID = 2; OB; Cornmarket.

-120000

-100000

-80000

-60000

-40000

-20000

0

Ran

k D

iffe

ren

ce

(c) N = 131; ID = 4; OBF100K; Radcliffe Camera.

Fig. 11 The rank difference of false images after interest point selection based on MSDT.N: the number of false images; ID: the ID of the query image, which caused these falseimages to be top-ranked; OB/OBF100K: the experimental setting under which retrieval isperformed; Magdalen/Cornmarket/Radcliffe Camera: the query name.


(NCR) for each query image Q of interest can thus be defined by Equation 12.Obviously,

∑NCR(Q,F) with Q ∈ Q equals 1, and a higher NCR indicates a

larger negative impact on retrieval in MQ.

NCR(Q,F) =‖ F(Q) ‖

‖ F ‖(12)

Fig. 10a shows the initial images of “Magdalen”, among which the fourth imagefrom the left (ID = 4) resulted in the highest NCR of 92.6% computed fromthe set F1000 of the top-1000 false results. Some examples of the false imagesthat were top-ranked due to this query image are shown in Fig. 1b. All of thesemismatches are because of the confusion caused by the foliage region. Fig. 10dshows the tentative ROI estimated from IPS-MSDT. We can observe that IPS-MSDT successfully discarded the foliage region without over-eliminating thatof the object being queried. It would be interesting to explore what became ofthe ranking of false results after IPS-MSDT. Here, we compute the differencebetween the ranks of each false result before and after IPS-MSDT, and obtaina total of 926 (92.6%× 1000) rank differences. The rank differences are sortedin descending order and plotted in the top graph shown in Fig. 11a. It can beseen that almost all the false results (88.9%) degraded significantly after IPS-MSDT. The ROIs estimated from all queries are available from the supportdocumentation5.

Fig. 12a shows the initial images of “Cornmarket”, among which the secondimage from the left (ID = 2) resulted in the highest NCR of 79% computedfrom the set F200 of the top-200 false results. Similar to “Magdalen”, most mis-matches are caused by the foliage region. In this case, although IPS-MSDTcould not precisely estimate the ROI as shown in Fig. 12d, the foliage regionwas successfully discarded. The reason for the slurred ROI estimation is be-lieved to be because of the large viewpoint variation. Fig. 11b shows the rankdifference of the 158 (79%×200) false results before and after IPS-MSDT. Theranks of 94.9% false results are significantly lower after interest point selection.

Fig. 13a shows the initial images of “Radcliffe Camera”, among which thefourth image from the left (ID = 4) caused the highest 65.5% NCR computedfrom the set F200 of top-200 false results retrieved under OBF100K. Unlikethe queries described above, most mismatches in this case are because thereis a less visible margin close to the image boundary. This rectangular frame-like margin shared a close similarity to false images with similar rectangularpatterns in Flickr 100K. IPS-MSDT could not precisely estimate the ROIas shown in Fig. 13d, but did successfully discard the margin. The reasonfor the slurred ROI estimation is believed to be the variation in viewpointand illumination. Fig. 11c shows the rank difference of the 131 (65.5%× 200)false results before and after IPS-MSDT. All the false results are proficientlydegraded from the top positions in the ranking list.

Fig. 14a shows the initial images of “Bodleian”, and Fig. 14e shows the cor-responding human-labeled ROI. Under OBF100K, the former achieved 92.4%

5 http://www.brl.ntt.co.jp/people/wu.xiaomeng/MTAP2013/sup.pdf






(e) ROI labeled by human. (AP: 91.0%/86.7%)

Fig. 12 ROI estimated by interest point selection approaches. AP is noted in brackets foreach run under settings OB/OBF100K. (Cornmarket)

MAP while the latter achieved 75.8%. The object to be queried is the windowabove the entrance of the Bodleian Library. In this case, the human-labeledROI is small and the information contained in the ROI is insufficient to querythe large-scale OBF100K database. Instead, the building of the Bodleian Li-brary contains more discriminative patterns that share a close similarity torelevant images. Fig. 14d shows the tentative ROI estimated from IPS-MSDT,which successfully preserves the useful context regions described above. This






(e) ROI labeled by human. (AP: 95.3%/85.1%)

Fig. 13 ROI estimated by interest point selection approaches. AP is noted in brackets foreach run under settings OB/OBF100K. (Radcliffe Camera)






(e) ROI labeled by human. (AP:97.1%/75.8%)

Fig. 14 ROI estimated by interest point selection approaches. AP is noted in brackets foreach run under settings OB/OBF100K. (Bodleian)

query serves as an example where IPS-MSDT can outperform MQ-HL becauseit properly maintains the context of interest points.

It is also interesting to compare the ROI estimation performance. Fig. 10and Figs. 12 to 14 show the tentative ROIs estimated by interest point selec-tion. IPS-CLF can only estimate the ROI in the form of a single minimumbounding box due to the nature of the approach. It is more perceptually com-prehensible but tends to contain more cluttered interest points, as shown inthe fourth image from the left in Fig. 10b and in almost all the images inFig. 12b. IPS-GBSV is comparable to IPS-MSDT. For example with “Corn-market”, the estimated ROIs are even more precise than with IPS-MSDT.However, it also tends to over-eliminate more useful regions, as shown in all


the images in Fig. 10c, in the second image from the left in Fig. 12c, and inthe second and fourth images from the left in Fig. 13c. In general, IPS-MSDTperformed most stably as regards ROI estimation. In addition to the imagesdiscussed in previous paragraphs, the second image from the left shown inFig. 10d serves as another example. Its estimation is rather challenging butwell accomplished by IPS-MSDT. This image contains four different types ofclutter regions, namely trees, pedestrians, cars, and the unrelated building onthe right. IPS-MSDT discarded almost all the clutter regions and perfectlypolished the Magdalen tower, which is the object to be queried, despite itsvery small area.

It should be noted that the precision of ROI estimation is not always con-sistent with the retrieval MAP. For instance, IPS-GBSV obviously failed toestimate the ROI of “Magdalen” (Fig. 10c), but surprisingly achieved 0.6%higher MAP than IPS-MSDT (Fig. 10d). In another example, IPS-GBSV al-most perfectly estimated the ROI of “Cornmarket” (Fig. 12c), but its retrievalperformance was much poorer than with IPS-MSDT (Fig. 12d). This could bebecause the context regions compensated for the over-elimination of useful in-terest points in the former example, and boosted the retrieval of IPS-MSDTin the latter example. Finding a way to determine whether a context region isuseful or harmful remains a challenging issue especially when the number oftraining images is small. We regard this issue as part of our future work.

4.7 Efficiency Comparison

In this study, we used an Intel Xeon X3750 2.93GHz CPU with 16 cores and a128GB memory for all the experiments. Table 6 shows the time complexity ofoffline indexing in BOVW-based retrieval. The processing time has been stan-dardized into a 1-CPU basis except for local feature quantization, in whichan approximated k-means was implemented by a complicated parallel com-putation tool. In practice, interest point detection and description as well asvisual word assignment were both implemented in parallel, and the speed wassignificantly improved depending on the server capability. Although the pro-cessing time listed here may appear to be long, it is reasonable and acceptablebecause these steps were accomplished offline. This long processing time iscommon in the field of local feature based image retrieval [1,2,5,9,10,19,20,22,23,35]. Also, in our experiments, this time is exactly the same for both theproposed approach and all the other runs compared in Sect. 4.

The processing time for interest point selection based on MSDT is shownin Table 7. We can observe that the construction of MSDT is even faster thancommon Delaunay edge detection and ROI estimation. Interest point selectionnaturally requires more processing time than MQ, but for IPS-MSDT, webelieve that the processing time listed here is within the acceptable range. Itis noted that the time complexity of IPS-MSDT only depends on the numberof query images in each query class, and is totally independent of the size ofthe retrieval database.


Table 6 Time complexity (HH:MM:SS) of offline indexing versus experimental settings.IPDD: Interest Point Detection & Description. LFQ: Local Feature Quantization. VWA:Visual Word Assignment. SEI: Search Engine Indexing.

Step OB OBF100K OB-P OBF100K-P FL32 FL32-P

11 Days 1 Day 12 Days 1 DayIPDD 13:23:23 11:23:53 06:31:23 04:31:53 10:25:20 03:32:20

8 Days 8 Days 9 Days 9 Days 7 Day 9 DaysLFQ 06:05:51 06:05:51 14:07:50 14:07:50 18:37:55 14:07:50

2 Days 60 Days 2 Days 60 Days 1 Day 3 DaysVWA 23:08:50 10:59:23 22:26:58 11:41:16 18:54:48 00:37:15

SEI 00:03:16 02:17:56 00:03:49 03:26:48 00:04:01 00:02:27

Table 7 Time complexity (per query class) of interest point selection based on MSDT.Each query class contains five query images and the set of common interest points is dividedinto six parts, which results in 30 Delaunay triangulations.

Step Time Complexity (Sec.)

Common Interest Point Detection 0.2MSDT 0.4Common Delaunay Edge Detection & ROI Estimation 0.6

Total 1.2

Table 8 lists the time complexity of online retrieval with and without in-terest point selection. Among the runs with interest point selection, IPS-CLFrequires the least processing time because it skips spatial verification whilethe others do not. IPS-DT and IPS-MSDT follow IPS-CLF with reasonableefficiency. IPS-KNN and IPS-GBSV consume the most processing time be-cause they are both based on k-NN, which imposes brute-force computationsof Euclidean distance with O(n2) time complexity. k-NN can be more ele-gantly implemented, e.g. in a divide and conquer manner, such that the timecomplexity can be simplified into O(n log n). Even so, the processing time ofeach single operation for distance computation is still much higher than thatin DT.

The time complexity of retrieval indirectly reflects the selectivity of interestpoint selection, i.e. the percentage of interest points selected by each approach.As expected, all interest point selection approaches outperformed MQ thatused the full set of interest points. IPS-MSDT slightly underperforms but isstill comparable to IPS-KNN, IPS-GBSV, and MQ-HL. In general, these runsdid not exhibit significant difference as regards the efficiency of online retrieval.


Table 8 Time complexity (sec. per query class) comparison of online retrieval. IPS: InterestPoint Selection. R: Retrieval.

Run IPS R (OB) R (OBF100K) R (FL32)

MQ 0 1 39.4 0.36IPS-CLF 0.1 0.9 28.6 0.35

IPS-KNN 306.6 0.5 13.1 0.18IPS-GBSV 333.1 0.6 18.6 0.18IPS-DT 0.7 0.5 11.2 0.19IPS-MSDT 1.2 0.7 19.1 0.24

MQ-HL 0 0.6 16.4 NA

4.8 Comparison with Advanced Approaches

In previous sections, we have compared the proposed approach with the BOVWwith TF-IDF and an advanced interest point selection approach [7]. To providemore comprehensive comparison, in this section, we compare our approachwith more advanced approaches on image retrieval based on local features,namely MSDT [11], Spatial Co-occurrence Kernel (SCK) [33], and Liu’s ap-proach [14]. MSDT [11] and SCK [33] explored the higher-order intra-imageco-occurrence of local features. Both approaches are more discriminative thanthe BOVW but less discriminative than the others described below due tothe exclusion of a geometric coherence constraint. Among geometry-based ap-proaches [14,32,35], Liu’s approach obeys the largest variety of affine invari-ance, and so is chosen as a representative for comparison. Zhang et al. [35]describe the long-range spatial layout of local features by computing a Houghtransform in the Euclidean space. It is invariant to translation but achieveslimited robustness as regards rotation and scaling. Wu et al. [32] measured thespatial conherence by projecting the local features inside each maximally stableextremal region along Cartesian coordinate axes. The approach achieves scaleinvariance but remains sensitive to rotation. Liu et al. [14] explored the second-order spatial structure of local features and embedded the relative distance andthe relative principal angle between them into an image representation, whichwas shown to be robust over translation, rotation, and scaling.

In the implementation of these advanced approaches, each approach searchesthe database by using each individual image, and MQ-Max is adopted for fus-ing the ranking lists. This configuration is exactly the same as that of theproposed approach. For MSDT [11], we followed the original publication andimplemented the triangle-based representation. Please note that the motiva-tions of these researches are quite different from our main proposal, and itis inevitable that approaches imposing spatial coherence constraints on theretrieval have an advantage over the proposed approach or any other ap-proaches [7] not interleaving such constraints. Still, it is interesting to seehow well the proposed approach can fare against such approaches.


The comparison with state-of-the-art is in Table 9. In the experiments, IPS-MSDT underperformed MQ-LIU [14] under OB-P, where MQ-LIU achievedhigher discernment as regards the discrimination between confusing objects.Apart from that, the performance of IPS-MSDT is very competitive, in par-ticular outperforming MQ-MSDT [11] and even equaling MQ-SCK [33]. IPS-MSDT also outperformed all advanced approaches as regards efficiency. Itachieved approximately double speed compared with MQ-MSDT [11] and re-trieved more than 13 times as fast as MQ-SCK [33] and MQ-LIU [14].

Table 9 Comparison with various advanced approaches. RT: Retrieval Time.

OB OB-P

MAP MR@200 RT MAP MR@200 RT

MQ-MSDT [11] 82.8 90.9 1.28 70.2 81.7 1.64MQ-SCK [33] 86.1 93.2 8.56 75.4 85.2 11.56MQ-LIU [14] 87.1 94.2 8.94 77.0 85.0 13.96

IPS-MSDT 88.1 93.1 0.66 75.8 84.1 0.57

5 Conclusion

We have presented our approach for interest point selection by topology co-herence, in an MQIR scenario with a limited number of positive images andno negative images for the query. The approach has significantly less difficultyin reducing the negative contributions of confusing interest points in clutteredregions. It was unexpected that topology coherence would be such a flexiblecriterion that interest point selection based on it even outperforms human su-pervision in certain cases. The approach based on MSDT is far faster thanthose based on greedy solutions, e.g. k-NN, of spatial neighborhood construc-tion. Interest point selection naturally consumes more processing time thannaive BOVW-based retrieval. But with our approach, the processing time iswithin an acceptable range, as shown in Sect. 4.7. The time complexity onlydepends on the number of images in each query, so the approach requires littletime for interest point selection even on a large scale.

Because interest point selection must be performed on top of the visualverification of interest points, its performance is directly affected by the ro-bustness of local features as regards viewpoint variation. As a result, the bot-tleneck with our approach appears to be how to make local features morerobust and how to boost the preciseness of local feature quantization. We re-gard this issue as a future subject. In Sect. 3.3, we discussed the sensitivityof DT to special variations in interest point detection and transformation. Inthis paper, we adopted the multi-scale strategy proposed by Kalantidis [11] tocope with this problem. This strategy is effective, but may sacrifice some true


responses due to its sensitivity to SIFT scale estimation. As another futuresubject, we will also examine the possibility of finding an efficient solutionentirely independent of the error prone collection of visual geometry.

References

1. Arandjelovic, R., Zisserman, A.: Multiple queries for large scale specific object retrieval.In: BMVC, pp. 1–11 (2012)

2. Arandjelovic, R., Zisserman, A.: Three things everyone should know to improve objectretrieval. In: CVPR, pp. 2911–2918 (2012)

3. Bay, H., Ess, A., Tuytelaars, T., Gool, L.J.V.: Speeded-up robust features (surf). Com-puter Vision and Image Understanding 110(3), 346–359 (2008)

4. Berg, M.d., Cheong, O., Kreveld, M.v., Overmars, M.: Computational Geometry: Al-gorithms and Applications, 3rd edn. Springer-Verlag TELOS, Santa Clara, CA, USA(2008)

5. Chum, O., Mikulık, A., Perdoch, M., Matas, J.: Total recall ii: Query expansion revisited.In: CVPR, pp. 889–896 (2011)

6. Cox, I.J., Miller, M.L., Minka, T.P., Papathomas, T.V., Yianilos, P.N.: The bayesianimage retrieval system, pichunter: theory, implementation, and psychophysical experi-ments. IEEE Transactions on Image Processing 9(1), 20–37 (2000)

7. Gammeter, S., Bossard, L., Quack, T., Gool, L.J.V.: I know what you did last summer:object-level auto-annotation of holiday snaps. In: ICCV, pp. 614–621 (2009)

8. Heller, K.A., Ghahramani, Z.: A simple bayesian framework for content-based imageretrieval. In: CVPR (2), pp. 2110–2117 (2006)

9. Jegou, H., Douze, M., Schmid, C.: Hamming embedding and weak geometric consistencyfor large scale image search. In: ECCV (1), pp. 304–317 (2008)

10. Jegou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image search.International Journal of Computer Vision 87(3), 316–336 (2010)

11. Kalantidis, Y., Pueyo, L.G., Trevisiol, M., van Zwol, R., Avrithis, Y.S.: Scalabletriangulation-based logo recognition. In: ICMR, p. 20 (2011)

12. Knopp, J., Sivic, J., Pajdla, T.: Avoiding confusing features in place recognition. In:ECCV (1), pp. 748–761 (2010)

13. Li, F., Kosecka, J.: Probabilistic location recognition using reduced feature set. In:ICRA, pp. 3405–3410 (2006)

14. Liu, Z., Li, H., Zhou, W., Tian, Q.: Embedding spatial context information into invertedfile for large-scale image retrieval. In: ACM Multimedia, pp. 199–208 (2012)

15. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. InternationalJournal of Computer Vision 60(2), 91–110 (2004)

16. Matas, J., Chum, O., Urban, M., Pajdla, T.: Robust wide baseline stereo from maximallystable extremal regions. In: BMVC, pp. 1–10 (2002)

17. Mikolajczyk, K., Schmid, C.: Scale & affine invariant interest point detectors. Interna-tional Journal of Computer Vision 60(1), 63–86 (2004)

18. Over, P., Awad, G., Michel, M., Fiscus, J., Sanders, G., Shaw, B., Kraaij, W., Smeaton,A.F., Quenot, G.: Trecvid 2012 – an overview of the goals, tasks, data, evaluationmechanisms and metrics. In: Proceedings of TRECVID 2012. NIST, USA (2012)

19. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Object retrieval with largevocabularies and fast spatial matching. In: CVPR (2007)

20. Philbin, J., Chum, O., Isard, M., Sivic, J., Zisserman, A.: Lost in quantization: Improv-ing particular object retrieval in large scale image databases. In: CVPR (2008)

21. Poullot, S., Buisson, O., Crucianu, M.: Scaling content-based video copy detection tovery large databases. Multimedia Tools Appl. 47(2), 279–306 (2010)

22. Romberg, S., Lienhart, R.: Bundle min-hashing. International Journal of MultimediaInformation Retrieval pp. 1–17 (2013)

23. Romberg, S., Pueyo, L.G., Lienhart, R., van Zwol, R.: Scalable logo recognition in real-world images. In: ICMR, p. 25 (2011)

24. Schindler, G., Brown, M., Szeliski, R.: City-scale location recognition. In: CVPR (2007)


25. Shewchuk, J.R.: Triangle: Engineering a 2d quality mesh generator and delaunay trian-gulator. In: WACG, pp. 203–222 (1996)

26. Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching invideos. In: ICCV, pp. 1470–1477 (2003)

27. Torresani, L., Szummer, M., Fitzgibbon, A.W.: Efficient object category recognitionusing classemes. In: ECCV (1), pp. 776–789 (2010)

28. Wang, S.Y., Liao, W.S., Hsieh, L.C., Chen, Y.Y., Hsu, W.H.: Learning by expansion:Exploiting social media for image classification with few training examples. Neurocom-puting 95, 117–125 (2012)

29. Wang, X., Yang, M., Cour, T., Zhu, S., Yu, K., Han, T.X.: Contextual weighting forvocabulary tree based image retrieval. In: ICCV, pp. 209–216 (2011)

30. Wang, Z., Fan, B., Wu, F.: Local intensity order pattern for feature description. In:ICCV, pp. 603–610 (2011)

31. Welzl, E., Su, P., III, R.L.S.D.: A comparison of sequential delaunay triangulation al-gorithms. Comput. Geom. 7, 361–385 (1997)

32. Wu, Z., Ke, Q., Isard, M., Sun, J.: Bundling features for large scale partial-duplicateweb image search. In: CVPR, pp. 25–32 (2009)

33. Yang, Y., Newsam, S.: Spatial pyramid co-occurrence for image classification. In: ICCV,pp. 1465–1472 (2011)

34. Zhang, W., Pang, L., Ngo, C.W.: Snap-and-ask: answering multimodal question bynaming visual instance. In: ACM Multimedia, pp. 609–618 (2012)

35. Zhang, Y., Jia, Z., Chen, T.: Image retrieval with geometry-preserving visual phrases.In: CVPR, pp. 809–816 (2011)

36. Zhu, C.Z., Satoh, S.: Large vocabulary quantization for searching instances from videos.In: ICMR, p. 52 (2012)

Documents

InterestPoint Selectionby Topology Coherence for …media sharing services, e.g. Flickr [28], or a set of top-ranked results returned by an image search engine, e.g. Google Images