11
Vis Comput (2013) 29:565–575 DOI 10.1007/s00371-013-0818-0 ORIGINAL ARTICLE Accurate and efficient cross-domain visual matching leveraging multiple feature representations Gang Sun · Shuhui Wang · Xuehui Liu · Qingming Huang · Yanyun Chen · Enhua Wu Published online: 25 April 2013 © Springer-Verlag Berlin Heidelberg 2013 Abstract Cross-domain visual matching aims at finding vi- sually similar images across a wide range of visual domains, and has shown a practical impact on a number of applica- tions. Unfortunately, the state-of-the-art approach, which es- timates the relative importance of the single feature dimen- sions still suffers from low matching accuracy and high time cost. To this end, this paper proposes a novel cross-domain visual matching framework leveraging multiple feature rep- resentations. To integrate the discriminative power of multi- ple features, we develop a data-driven, query specific feature fusion model, which estimates the relative importance of the individual feature dimensions as well as the weight vector among multiple features simultaneously. Moreover, to alle- viate the computational burden of an exhaustive subimage search, we design a speedup scheme, which employs hyper- plane hashing for rapidly collecting the hard-negatives. Ex- tensive experiments carried out on various matching tasks demonstrate that the proposed approach outperforms the state-of-the-art in both accuracy and efficiency. G. Sun ( ) · X. Liu · Y. Chen · E. Wu State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China e-mail: [email protected] G. Sun · Q. Huang University of Chinese Academy of Sciences, Beijing, China S. Wang · Q. Huang Key Laboratory of Intelligent Information Processing (CAS), Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China E. Wu University of Macau, Macao, China Keywords Visual matching · Cross-domain · Multiple features · Hyperplane hashing 1 Introduction The prevalence of the imaging devices and the Internet has given ordinary customers the privilege to capture their worlds in images, and conveniently share them with oth- ers on the web. Today, people can generate volumes of im- ages with exactly the same content across a wide range of visual domains (e.g., natural images, sketches, paintings, computer-generated images (CG images)) with dramatic variations in lighting conditions, seasons, ages, and render- ing styles. Matching these images has been very helpful to a number of applications, such as scene completion [11], Sketch2Photo [3, 4, 7], Internet rephotography [27], paint- ing2GPS [27], and CG2Real [16]. Therefore, accurately and efficiently finding visually similar images across different domains is in urgent need from research and application. Our objective in this paper is searching a large-scale dataset to find the image patches (subimages), which are visually similar to a given query across different domains. This is a very challenging task because small perceptual dif- ferences can result in arbitrarily large differences at the raw pixel level. In addition, without the knowledge of query spe- cific domain, it is very difficult to develop a generalized so- lution for multiple potential visual domains. Finally, rapid system feedback is desired for a number of applications, such as scene completion [11] and Sketch2Photo [3, 4, 7]. Researchers have made significant progress in the study of cross-domain visual matching. Some methods focus on the design of local invariant descriptors (self-similarity (SSIM) [26], symmetry [10], etc.) across different visual domains. As a second paradigm, a data-driven approach [27] focuses

Accurate and efficient cross-domain visual matching

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Vis Comput (2013) 29:565–575DOI 10.1007/s00371-013-0818-0

O R I G I NA L A RT I C L E

Accurate and efficient cross-domain visual matching leveragingmultiple feature representations

Gang Sun · Shuhui Wang · Xuehui Liu ·Qingming Huang · Yanyun Chen · Enhua Wu

Published online: 25 April 2013© Springer-Verlag Berlin Heidelberg 2013

Abstract Cross-domain visual matching aims at finding vi-sually similar images across a wide range of visual domains,and has shown a practical impact on a number of applica-tions. Unfortunately, the state-of-the-art approach, which es-timates the relative importance of the single feature dimen-sions still suffers from low matching accuracy and high timecost. To this end, this paper proposes a novel cross-domainvisual matching framework leveraging multiple feature rep-resentations. To integrate the discriminative power of multi-ple features, we develop a data-driven, query specific featurefusion model, which estimates the relative importance of theindividual feature dimensions as well as the weight vectoramong multiple features simultaneously. Moreover, to alle-viate the computational burden of an exhaustive subimagesearch, we design a speedup scheme, which employs hyper-plane hashing for rapidly collecting the hard-negatives. Ex-tensive experiments carried out on various matching tasksdemonstrate that the proposed approach outperforms thestate-of-the-art in both accuracy and efficiency.

G. Sun (�) · X. Liu · Y. Chen · E. WuState Key Laboratory of Computer Science, Institute of Software,Chinese Academy of Sciences, Beijing, Chinae-mail: [email protected]

G. Sun · Q. HuangUniversity of Chinese Academy of Sciences, Beijing, China

S. Wang · Q. HuangKey Laboratory of Intelligent Information Processing (CAS),Institute of Computing Technology, Chinese Academyof Sciences, Beijing, China

E. WuUniversity of Macau, Macao, China

Keywords Visual matching · Cross-domain · Multiplefeatures · Hyperplane hashing

1 Introduction

The prevalence of the imaging devices and the Internethas given ordinary customers the privilege to capture theirworlds in images, and conveniently share them with oth-ers on the web. Today, people can generate volumes of im-ages with exactly the same content across a wide range ofvisual domains (e.g., natural images, sketches, paintings,computer-generated images (CG images)) with dramaticvariations in lighting conditions, seasons, ages, and render-ing styles. Matching these images has been very helpful toa number of applications, such as scene completion [11],Sketch2Photo [3, 4, 7], Internet rephotography [27], paint-ing2GPS [27], and CG2Real [16]. Therefore, accurately andefficiently finding visually similar images across differentdomains is in urgent need from research and application.

Our objective in this paper is searching a large-scaledataset to find the image patches (subimages), which arevisually similar to a given query across different domains.This is a very challenging task because small perceptual dif-ferences can result in arbitrarily large differences at the rawpixel level. In addition, without the knowledge of query spe-cific domain, it is very difficult to develop a generalized so-lution for multiple potential visual domains. Finally, rapidsystem feedback is desired for a number of applications,such as scene completion [11] and Sketch2Photo [3, 4, 7].Researchers have made significant progress in the study ofcross-domain visual matching. Some methods focus on thedesign of local invariant descriptors (self-similarity (SSIM)[26], symmetry [10], etc.) across different visual domains.As a second paradigm, a data-driven approach [27] focuses

566 G. Sun et al.

on weighting the dimensions of the Histograms of OrientedGradients (HOG) [6] descriptor by training a linear classifierat the query time. In comparison with the uniform weights,the learned weights appropriately measure the relative im-portance of the feature dimensions, thus the performance issignificantly improved.

Although promising results have been achieved, theabove mentioned approaches are still far from being per-fect for many practical applications. The reasons lie in twoaspects. Firstly, single image feature (e.g., SSIM, HOG)lacks sufficient discriminative power under dramatic appear-ance variations, which may lead to restricted generalizationability and low accuracy for cross-domain visual matching.A possible way to tackle this problem is to integrate severalkinds of features together, such as feature concatenation.However, the systems that adopt such equally weighted con-catenation may still not perform better (or even worse) thanusing single feature. This is because that the informationconveyed by different features can be significantly unbal-anced. A single feature may even dominate the effectivenessof a feature ensemble. On the other hand, manual featureweight setting is impractical as the feature weights may varysignificantly from query to query.

Furthermore, owing to tens of millions of negative subim-ages in dataset, data-driven approach [27] requires training alinear Support Vector Machine (SVM) classifier with hard-negative mining approach [6] at the query time. In hard-negative mining, one alternates between using current clas-sifier to search the retrieval dataset exhaustively for suffi-cient false positives (hard-negatives), and learning a classi-fier given a current set of hard-negatives. For hard-negativemining, exhaustively searching the whole dataset incurs aheavy computational burden, and prevents its practical usefor a number of applications.

To address the above mentioned issues, we propose anaccurate and efficient cross-domain visual matching frame-work leveraging multiple feature representations becausemultiple features provide complementary discriminativepower for matching accuracy boosting. We present a data-driven, query specific feature fusion model to overcomethe information unbalance conveyed by different features.We train a discriminative linear classifier on multiple fea-ture descriptors to learn uniqueness-based weights, whichare derived from both individual feature dimensions andthe weight vector among multiple features. In comparisonwith [27], our learned weights appropriately encode the dis-criminative power of multiple features. Moreover, both ourapproach and [27] involve hard-negative mining approach,which incurs a heavy computational burden at the querytime. To alleviate the overhead, we employ compact hy-perplane hashing [19] to recast the hard-negative mining ina hyperplane-to-point search framework. On a dataset con-taining tens of millions of subimages, our speedup scheme

can rapidly and consistently identify the top matches withhigher quality in nearly 10 minutes per query on a PC,which is much faster than the original hard-negative min-ing approach [6]. Extensive experiments demonstrate thatour proposed approach outperforms the state-of-the-art [27]in both accuracy and efficiency.

The contributions in this paper can be summarized as fol-lows:

− We propose an accurate and efficient cross-domain vi-sual matching framework, which is the first work to ef-fectively make use of multiple feature representations inthis task.

− We develop a data-driven, query specific feature fusionmodel, which outperforms feature concatenation signifi-cantly.

− We design a speedup scheme (hashing-based hard-negative mining), which is more efficient than the origi-nal hard-negative mining approach.

The remainder of this paper is organized as follows. Sec-tions 2 and 3 review related work and previous data-drivenapproach, and Sect. 4 describes our approach. Extensive ex-perimental results will be provided in Sect. 5. We discussabout the limitations and future work in Sect. 6.

2 Related work

We briefly review the related work on cross-domain visualmatching, multiple features, and approximate near-neighborsearch.

2.1 Cross-domain visual matching

Many studies have been devoted to matching images amongspecific domains, such as photos under different lightingconditions [5], sketches to photographs [3, 4, 7], paintingsto photographs [25], and CG images to photographs [16].However, these domain-specific solutions are not applicableto cross multiple potential visual domains. For a generalizedsolution to this problem, some methods focus on the designof local invariant descriptors (self-similarity (SSIM) [26],symmetry [10], etc.) across different visual domains. In ad-dition, a data-driven approach [27] weights the dimensionsof the HOG [6] or SIFT [20] descriptor by training a lin-ear classifier. Nevertheless, none of these approaches effec-tively makes use of multiple feature representations to boostmatching accuracy.

2.2 Multiple features

Researchers in the Content-Based Image Retrieval (CBIR)community have proposed a few systems using multiple fea-ture representations [9, 24, 28] to boost the retrieval perfor-mance. However, how to integrate the discriminative power

Accurate and efficient cross-domain visual matching leveraging multiple feature representations 567

of multiple features without relevance feedback is still anopen problem. As a promising method successfully appliedin image classification and object recognition, Multiple Ker-nel Learning (MKL) [1, 17, 23, 29, 30] aims to optimize thediscriminative model built on multiple similarity measures(kernels) and the kernel weights simultaneously. Comparedwith feature concatenation, MKL is more capable of iden-tifying the meaningful features for matching cross-domainimages.

2.3 Approximate near-neighbor search

While spatial partitioning and tree-based search algorithmsbreak down for high-dimensional data, many researchershave proposed hashing-based near-neighbor search methodsto deal with high-dimensional inputs. One of the most suc-cessful methods is Locality-Sensitive Hashing (LSH) [8,13], which has been employed in a number of applica-tions. However, in hard-negative mining approach, giventhe current classifier, such point-to-point search methods arenot applicable. Recently, hyperplane-to-point search prob-lem has drawn much attention. In [14, 19], the authors de-velop hyperplane hashing to address the hyperplane-to-pointsearch problem, where the points which have minimal dis-tances to the hyperplane are rapidly returned. Due to theefficiency of hyperplane hashing, we employ the techniqueto speed up hard-negative mining approach by avoiding ex-haustively searching the whole retrieval dataset.

3 Data-driven framework

According to the Introduction in [27], cross-domain visualmatching is formalized as an optimization problem by min-imizing the following objective function:

L(wq, bq) =∑

xi∈P ∪Iq

h(w�

q xi + bq

)

+∑

xj ∈Nh(−w�

q xj − bq

) + λ‖wq‖2 (1)

where wq is the relative importance vector of the featuredimensions. xi is subimage Ii ’s grid-like HOG feature tem-plate. bq is the bias term. Iq is the query image (positive).P represents a set of extra positive data-points, by applyingsmall transformations to the query image Iq . N is the tens ofmillions of subimages (negatives) from the retrieval dataset.λ is regularization parameter, and h(x) = max(0,1 − x) isthe hinge loss function.

Given the learned, query-dependent importance vectorwq , the visual similarity between a query image Iq and anyother subimage Ii can be defined as

S(Iq, Ii) = w�q xi + bq (2)

To handle multiple feature representations in the data-driven framework, a possible way is feature concatenation.Let n denote the number of features and xi,k be the kth nor-malized feature of subimage Ii . Subimage Ii ’s concatenatedfeature xc

i can be given as following:

xci = [

x�i,1,x�

i,2, . . . ,x�i,n

]� (3)

Then xci can be treated as a single feature, and the rela-

tive importance vector can be learnt by minimizing Eq. (1).However, owing to the unbalanced information conveyed bydifferent features, such simple feature concatenation maystill not perform better (or even worse) than single feature,as illustrated in Sect. 5.

4 Our approach

An overview of our approach is shown in Fig. 1. In the pre-processing stage, we learn a set of hashing functions, thenconvert the concatenated feature of each subimage in re-trieval dataset into a hash code and store in a hyperplanehash table. At the query time, we first extract multiple fea-tures of the query. These features form a feature ensem-ble as initial classifier (decision hyperplane). Then a setof extra hyperplanes is created and the corresponding hashcodes are provided to hard-negative mining as inputs. Ourhashing-based hard-negative mining is shown in the red boxof Fig. 1. Given a set of hash codes, hyperplane hashingrapidly returns sufficient number of hard-negatives. Usingthese hard-negatives, MKL learner (our multiple feature fu-sion model) trains a classifier and creates a set of extra hy-perplanes, whose corresponding hash codes are looked upin hyperplane hash table for hard-negatives again. In our ex-periments, we use ten iterations of the procedure. Finally,the system outputs the learnt weights and the top matches.Note that the learnt weights are comprised of the relative im-portance of the individual feature dimensions and the weightvector among multiple features. In this section, we first de-scribe our multiple feature fusion model, and then explainhow to rapidly collect hard-negatives using our hashing-based hard-negative mining.

4.1 Multiple feature fusion

To balance the information conveyed by different features,we define the visual similarity between a query image Iq andany other subimage Ii as following:

S′(Iq, Ii) =n∑

k=1

dq,k

(w�

q,kxi,k

) + bq (4)

where dq = [dq,1, dq,2, . . . , dq,n]� is the weight vector forn different features. wq,k corresponds to the relative impor-tance vector of the kth normalized feature xi,k , and wc

q =

568 G. Sun et al.

Fig. 1 Overview of our approach. For the initial feature ensemble and the learned weights, the transparency of colors (red, green, blue) denotesthe relative value in individual features. A lower transparency means a larger value

[w�q,1,w�

q,2, . . . ,w�q,n]�. In other words, we aim to construct

a linear combination of multiple visual similarities.Instead of equally weighted or manual setting combina-

tion, we learn a query specific weight vector dq and wcq si-

multaneously by minimizing the following L2 norm multi-ple kernel learning objective function:

L′(wcq, bq,dq

) =∑

xci ∈P ∪Iq

h

(n∑

k=1

dq,k

(w�

q,kxi,k

) + bq

)

+∑

xcj ∈N

h

(−

n∑

k=1

dq,k

(w�

q,kxj,k

) − bq

)

+ λ∥∥wc

q

∥∥2 + μ‖dq � dq‖2

s.t. dq ≥ 0 (5)

where the symbol � represents the Hadamard product (i.e.,elementwise product). λ and μ is regularization parameter.In our experiments, λ = 102 and μ = 108.

We use SMO-MKL package [29] for learning wcq , bq ,

and dq . Usually, one uses the hard-negative mining approach[6] to cope with a very large amount of negative data. Inhard-negative mining, one alternates between using the cur-rent classifier to search the whole negative set exhaustivelyfor sufficient hard-negatives, and learning a classifier given acurrent set of hard-negatives. However, exhaustively search-ing the whole negative set is very time consuming, and pre-vents its practical use.

Our visual similarity Eq. (4) is still a linear model withrespect to the concatenated feature xc

i , since Eq. (4) can berewritten as

S′(Iq, Ii) = wf �q xc

i + bq (6)

where wfq = [dq,1w�

q,1, dq,2w�q,2, . . . , dq,nw�

q,n]�. Equa-tion (6) allows us to employ hyperplane hashing to speedup hard-negative mining approach.

4.2 Hashing-based hard-negative mining

To alleviate the computational burden of exhaustive subim-age search, we employ compact hyperplane hashing [19] torecast hard-negative mining in a hyperplane-to-point searchframework. Given a hyperplane query, hyperplane hashingis capable of rapidly returning the points which have mini-mal distances to the hyperplane. In the preprocessing stage,we learn a set of hashing functions H = [h1, h2, . . . , hm] as[19]. Then we convert each subimage concatenated featurexci into a m-bit hash code and store in a single hash table with

m-bit hash keys as entries. To search for hard-negatives dur-ing each hard-negative mining iteration, we design a hard-negative collecting procedure to approximate the exhaustivesearch.

Given the current classifier (decision hyperplane) param-eterized by wf

q and bq , we aim to find sufficient numberof subimages, and each subimage i must satisfy S′(Iq, Ii) ≥−1. In fact, in our experiments, all the hard-negatives satisfy−1 ≤ S′(Iq, Ii) ≤ 1. We create a set of extra hyperplanesby modifying the bias term bq of the current decision hy-perplane. To collect hard-negatives, we look up the hashcodes of all the hyperplanes in the hash table until suffi-cient number of hard-negatives are found, as illustrated inFig. 2. In Fig. 2(a), the hard-negative mining can be de-scribed as searching the retrieval dataset exhaustively to findthe hard-negatives (red points). In Fig. 2(b), the extra hy-perplanes (p1,p2, . . . ,pl) between S′(Iq, Ii) − 1 = 0 andS′(Iq, Ii) + 1 = 0 are created by modifying the bias termbq of the current decision hyperplane. Note that though thereare innumerable hyperplanes between S′(Iq, Ii)−1 = 0 andS′(Iq, Ii) + 1 = 0, these hyperplanes only correspond to afew different hash codes, as the space has been quantifiedby hashing. The hard-negative collecting for hyperplane p1

and p2 can be seen in Figs. 2(c) and 2(d). For each hyper-plane p, we extract its hash key H(p) and perform the bit-wise NOT operation to get the key H(p). Then we look up

Accurate and efficient cross-domain visual matching leveraging multiple feature representations 569

Fig. 2 Hard-negative collecting. (a) The hard-negative mining canbe described as searching the retrieval dataset exhaustively to findthe hard-negatives (red points). Black points denote easy-negatives.(b) A set of extra hyperplanes (p1,p2, . . . ,pl ) between S′(Iq , Ii )−1 =

0 and S′(Iq , Ii )+1 = 0 is created by modifying the bias term bq of thecurrent decision hyperplane. (c–d) For each hyperplane p, the greenpoints which have minimal distances to p are collected. Brown pointsdenote the points which have been collected by previous hyperplanes

H(p) in the hyperplane hash table up to a small Hammingdistance to obtain the points which have minimal distancesto the hyperplane. Once sufficient number of hard-negativesare found, we retrain the classifier by minimizing Eq. (5).

5 Experiments

In this section, we perform a number of matching experi-ments across multiple visual domains. We first describe thefeatures used in the experiments, then present the experi-mental results.

5.1 Image features

In our experiments, we first resize the query image heuristi-cally to limit its feature dimensionality no more than 10,000.Then we extract the following three features for the queryimage. Each feature is normalized over overlapping spatialblocks as [6].

Filter Bank: We compute the output energy of a bankof filters. The filters are Gabor-like functions with 4 orien-tations at 3 different scales. The output of each filter is av-eraged on a regular grid with a stride of 8 pixels. Then allthe local filter bank descriptors are stacked together to forma global descriptor. In comparison with the GIST descrip-tor [22], the filter bank feature encodes more discriminativeinformation with spatial layout.

HOG: The HOG feature [6] provides excellent perfor-mance for object and human detection tasks. As with [27],we also use HOG feature for cross-domain visual matchingtask. The HOG descriptors are densely extracted on a reg-ular grid with a stride of 8 pixels. Then all the local HOGdescriptors are stacked together to form a global descriptor.

SSIM: Unlike the above two features, the SSIM feature[26] provides a distinct, complementary measure of scenelayout using local self-similarities. The SSIM descriptorsare extracted on a regular grid with a stride of 8 pixels. Eachdescriptor is obtained by computing the correlation surface,and partition it into 24 bins (8 angles, 3 radial intervals).

570 G. Sun et al.

Then all the local SSIM descriptors are concatenated to forma global descriptor.

As with [27], we use the sliding window fashion to in-crease the number of good matches, and use non-maximasuppression to remove highly-overlapping redundantmatches. For each image in retrieval dataset, we precom-pute the filter bank, HOG, and SSIM feature pyramids atmultiple scales ahead of time, and iteratively collect hard-negative subimages in these feature pyramids at the querytime. Although only three features are used in our experi-ments, we emphasize that any rigid grid-like image repre-sentation (e.g., dense SIFT [18, 20], Spark feature (ShapeContext) [2, 7], Local Binary Patterns (LBP) [21]) can beintegrated into our framework.

5.2 Results

We conduct a series of experiments to validate our approach.We collect 500 query images of outdoor scenes in differ-ent geographical locations across diversified domains. Like[12], our retrieval dataset is composed of GPS-tagged Flickr

Fig. 3 Illustration of the weight vector of multiple features (normal-ized dq ). In the first query, the SSIM feature dominates the effective-ness of the feature ensemble. In the second query, the relative impor-tance of the three features is almost equal

images. For each query, we create a set of 5,000 random im-ages, and 5,000 images randomly sampled within a 30 mileradius of the query location (containing tens of millions ofsubimages). We compare with two baseline methods: thestate-of-the-art Shrivastava et al. [27] (using HOG feature)and the feature concatenation (using filter bank, HOG, andSSIM features) as described in Sect. 3. All the experimentsare run on a PC with a 3.40 GHz Intel Core i7-2600 CPUand 8 GB RAM.

Image-to-Image: Here, we match the photos taken overdifferent ages, seasons, weather or lighting conditions.Figure 4 shows some queries and the corresponding topmatches for our approach and the baselines. To investigatethe weight vector of multiple features, we plot the normal-ized dq of the two queries in Fig. 4, as shown in Fig. 3. Inthe first query, the two baseline methods fail to find goodmatches, because the SSIM feature dominates the effective-ness of the feature ensemble. In contrast, our approach canfind good matches by integrating multiple feature discrimi-native power properly. In the second query, while the HOGfeature lacks sufficient discriminative power, both the fea-ture concatenation and our approach can find good matches.This is because that the relative importance of the three fea-tures is almost equal for the query. However, in most cases,the relative importance of multiple features is very different.

Sketch-to-Image: Matching sketches to images is a verychallenging task, as the majority of sketches tend to makeuse of complex lines rather than just simple silhouettes. Inaddition, sketches show strong abstraction and local defor-mations with respect to the real scene. While most currentapproaches focus on designing a sketch-specific solution,our approach provides a general framework by integratingmultiple visual cues to boost matching performance. Somequalitative examples can be seen in Fig. 5.

Fig. 4 Qualitative comparison of our approach against baselines for Image-to-Image matching

Accurate and efficient cross-domain visual matching leveraging multiple feature representations 571

Fig. 5 Qualitative comparison of our approach against baselines for Sketch-to-Image matching

Fig. 6 Qualitative comparison of our approach against baselines for Painting-to-Image matching

Painting-to-Image: Matching paintings to images is alsoa very difficult task, as painting styles may vary signifi-cantly from painter to painter. Moreover, brush strokes maybring in strong local gradients even in the regions such assky. Without any changes, we apply our approach as before.Qualitative comparison of our approach against baselinescan be seen in Fig. 6.

CG Image-to-Image: As another cross-domain visualmatching evaluation, we testify our approach on matchingCG images to real images. Usually, CG images lack suffi-cient details (e.g., texture and noise), and the color distri-butions of CG images are over saturated and exaggerated,which results in insufficient discriminative power of sin-gle feature. Qualitative comparison of our approach againstbaselines can be seen in Fig. 7. In the first example, it is

important to see that the feature concatenation may performeven worse than using single feature.

Others-to-Image: Furthermore, we match some otherqueries (e.g., stereoscopic 3D images, logos, and ink andwash paintings) to images in our experiments, as shownin Fig. 8. The results further prove that our approach con-sistently achieves impressive matching performance for thequeries from more visual domains.

For quantitative evaluation, we create labels for the re-trieval dataset as ground truth. For each image in the dataset,we label whether the query scenes (Angkor Wat, SydneyOpera House, Venice, etc.) appear or not. For each query, wecompare how many query scene subimages are retrieved inthe top 20 matches. We use the bounded mean Average Pre-cision (mAP) [15] of the 500 collected query images to eval-

572 G. Sun et al.

Fig. 7 Qualitative comparison of our approach against baselines for CG Image-to-Image matching

Fig. 8 Qualitative comparison of our approach against baselines for Others-to-Image matching

Fig. 9 Typical failure case of our approach

Accurate and efficient cross-domain visual matching leveraging multiple feature representations 573

Fig. 10 Illustration of the mAP of our approach against baselines. ThemAP is computed across 500 collected query images

Fig. 11 Illustration of the average query time (in 103 seconds) of ourapproach against baselines. The average query time is computed across500 collected query images

uate the performance of our approach. Figure 10 shows themAP of the different approaches. Our approach outperformsthe two baselines significantly. In addition, we also com-pare our approach with the one using original hard-negativemining (our approach (O)), which shows that our speedupscheme can return comparable results with the original hard-negative mining approach. To evaluate the efficiency of ourspeedup scheme, we compare the average query time of ourapproach with the two baselines and our approach (O), asshown in Fig. 11. Our approach takes only nearly ten min-utes per query to find good matches.

6 Limitations and future work

The typical failure case of our approach can be seen inFig. 9. In this example, we fail to find good top matches,although we have integrated the discriminative power of thefilter bank, HOG, and SSIM features. This is because thatnone of the three features is effective for the query. To ad-dress this problem, a possible way is to integrate more fea-tures into our framework. Moreover, our approach is still tooslow for online interactive applications. In future work, wewill make more efforts to improve the efficiency of our sys-tem.

Acknowledgements The authors would like to thank the anonymousreviewers for their valuable comments and suggestions to improve thequality of the paper. This research is supported by National Funda-mental Research Grant 973 Program (2009CB320802), NSFC grant(61272326, 61025011), and Research Grant of University of Macau.Image credits: Andy Carvin, Bob Pejman, Leonid Afremov, MarikoJesse, Risto-Jussi Isopahkala, Steven Allen, www.turbosquid.com.

References

1. Bach, F.R., Lanckriet, G.R.G., Jordan, M.I.: Multiple kernel learn-ing, conic duality, and the SMO algorithm. In: Proceedings of the21th International Conference on Machine Learning (ICML), pp.6–13 (2004)

2. Belongie, S., Malik, J., Puzicha, J.: Shape matching and ob-ject recognition using shape contexts. IEEE Trans. Pattern Anal.Mach. Intell. 24(4), 509–522 (2002)

3. Cao, Y., Wang, C., Zhang, L., Zhang, L.: Edgel index for large-scale sketch-based image search. In: IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), pp. 761–768(2011)

4. Chen, T., Cheng, M.M., Tan, P., Shamir, A., Hu, S.M.:Sketch2Photo: Internet image montage. ACM Trans. Graph.28(5), 124 (2009)

5. Chong, H.Y., Gortler, S.J., Zickler, T.: A perception-based colorspace for illumination-invariant image processing. ACM Trans.Graph. 27(3), 61 (2008) (SIGGRAPH)

6. Dalal, N., Triggs, B.: Histograms of oriented gradients for humandetection. In: IEEE Conference on Computer Vision and PatternRecognition (CVPR), vol. 1, pp. 886–893 (2005)

7. Eitz, M., Hildebrand, K., Boubekeur, T., Alexa, M.: Sketch-basedimage retrieval: benchmark and bag-of-features descriptors. IEEETrans. Vis. Comput. Graph. 17(11), 1624–1636 (2011)

8. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high di-mensions via hashing. In: Proceedings of the 25th InternationalConference on Very Large Data Bases (VLDB), pp. 518–529(1999)

9. Ha, J.Y., Kim, G.Y., Choi, H.I.: The content-based image retrievalmethod using multiple features. In: International Conference onNetworked Computing and Advanced Information Management(NCM), vol. 1, pp. 652–657 (2008)

10. Hauagge, D.C., Snavely, N.: Image matching using local symme-try features. In: IEEE Conference on Computer Vision and PatternRecognition (CVPR), pp. 206–213 (2012)

11. Hays, J., Efros, A.A.: Scene completion using millions of pho-tographs. ACM Trans. Graph. 26(3), 4 (2007)

12. Hays, J., Efros, A.A.: Im2gps: estimating geographic informationfrom a single image. In: IEEE Conference on Computer Visionand Pattern Recognition (CVPR), pp. 1–8 (2008)

13. Indyk, P., Motwani, R.: Approximate nearest neighbors: towardsremoving the curse of dimensionality. In: Proceedings of the 30thAnnual ACM Symposium on Theory of Computing (STOC), pp.604–613 (1998)

14. Jain, P., Vijayanarasimhan, S., Grauman, K.: Hashing hyperplanequeries to near points with applications to large-scale active learn-ing. In: Advances in Neural Information Processing Systems(NIPS), vol. 23 (2010)

15. Jegou, H., Douze, M., Schmid, C.: Hamming embedding and weakgeometric consistency for large scale image search. In: EuropeanConference on Computer Vision (ECCV), pp. 304–317 (2008)

16. Johnson, M.K., Dale, K., Avidan, S., Pfister, H., Freeman, W.T.,Matusik, W.: CG2Real: improving the realism of computer gener-ated images using a large collection of photographs. IEEE Trans.Vis. Comput. Graph. 17(9), 1273–1285 (2011)

574 G. Sun et al.

17. Lanckriet, G.R.G., Cristianini, N., Bartlett, P., Ghaoui, L.E., Jor-dan, M.I.: Learning the kernel matrix with semidefinite program-ming. J. Mach. Learn. Res. 5, 27–72 (2004)

18. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spa-tial pyramid matching for recognizing natural scene categories.In: IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), vol. 2, pp. 2169–2178 (2006)

19. Liu, W., Wang, J., Mu, Y., Kumar, S., Chang, S.F.: Compact hy-perplane hashing with bilinear functions. In: Proceedings of the29th International Conference on Machine Learning (ICML), pp.17–24 (2012)

20. Lowe, D.: Distinctive image features from scale-invariant key-points. Int. J. Comput. Vis. 60(2), 91–110 (2004)

21. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scaleand rotation invariant texture classification with local binary pat-terns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987(2002)

22. Oliva, A., Torralba, A.: Modeling the shape of the scene: a holisticrepresentation of the spatial envelope. Int. J. Comput. Vis. 42(3),145–175 (2001)

23. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: Sim-pleMKL. J. Mach. Learn. Res. 9, 2491–2521 (2008)

24. Rui, Y., Huang, T.S., Ortega, M., Mehrotra, S.: Relevance feed-back: a power tool for interactive content-based image retrieval.IEEE Trans. Circuits Syst. Video Technol. 8(5), 644–655 (1998)

25. Russell, B.C., Sivic, J., Ponce, J., Dessales, H.: Automatic align-ment of paintings and photographs depicting a 3D scene. In: 3rdInternational IEEE Workshop on 3D Representation for Recogni-tion (3dRR) (2011)

26. Shechtman, E., Irani, M.: Matching local self-similarities acrossimages and videos. In: IEEE Conference on Computer Vision andPattern Recognition (CVPR), pp. 1–8 (2007)

27. Shrivastava, A., Malisiewicz, T., Gupta, A., Efros, A.A.: Data-driven visual similarity for cross-domain image matching. ACMTrans. Graph. 30(6), 154 (2011) (SIGGRAPH Asia)

28. Vadivel, A., Sural, S., Majumdar, A.K.: Image retrieval from theweb using multiple features. Online Inf. Rev. 33(6), 1169–1188(2009)

29. Vishwanathan, S.V.N., Sun, Z., Theera-Ampornpunt, N., Varma,M.: Multiple kernel learning and the SMO algorithm. In: Ad-vances in Neural Information Processing Systems (NIPS), vol. 23(2010)

30. Wang, S., Huang, Q., Jiang, S., Tian, Q.: S3MKL: scalable semi-supervised multiple kernel learning for real-world image applica-tions. IEEE Trans. Multimedia 14(4), 1259–1274 (2012)

Gang Sun received his B.Sc. de-gree from Nankai University, Tian-jin, China, in 2009. He is currently aPh.D. candidate at the State Labora-tory of Computer Science, Instituteof Software, Chinese Academy ofSciences. His research interests in-clude computer graphics, computervision, and machine learning.

Shuhui Wang received the B.Sc. de-gree from Tsinghua University, Bei-jing, China, in 2006, and the Ph.D.degree from the Institute of Com-puting Technology, ChineseAcademy of Sciences, Beijing,China, in 2012. He is currentlya researcher with the Institute ofComputing Technology, ChineseAcademy of Sciences, Beijing. Heis also with the Key Laboratory ofIntelligent Information Processing,Chinese Academy of Sciences. Hisresearch interests include semanticimage analysis, image and video re-

trieval, and large scale web multimedia data mining.

Xuehui Liu received her B.Sc. andM.Sc. degrees from Xiang Tan Uni-versity, Hunan, and received herPh.D. degree in 1998 from the Insti-tute of Software, Chinese Academyof Sciences. Since then she has beenworking at the Institute of Software,Chinese Academy of Sciences, andis now an Associate Professor. Herresearch interests include realisticimage synthesis, physically basedmodeling, and simulation.

Qingming Huang received theB.Sc. degree in computer scienceand Ph.D. degree in computer en-gineering from Harbin Institute ofTechnology, Harbin, China, in 1988and 1994, respectively. He is cur-rently a Professor with the GraduateUniversity of the Chinese Academyof Sciences (CAS), Beijing, China,and an Adjunct Research Profes-sor with the Institute of ComputingTechnology, CAS. His research ar-eas include multimedia video anal-ysis, video adaptation, image pro-cessing, computer vision, and pat-tern recognition.

Accurate and efficient cross-domain visual matching leveraging multiple feature representations 575

Yanyun Chen received his Ph.D.from the Institute of Software, Chi-nese Academy of Sciences, in 2000and his M.S. and B.S. from the Chi-nese University of Mining and Tech-nology in 1996 and 1993, respec-tively. He is currently a researcherat the State Laboratory of ComputerScience, Institute of Software, Chi-nese Academy of Sciences. His cur-rent research interests include pho-torealistic rendering, virtual real-ity, augmented reality, and complexsurface appearance modeling tech-niques.

Enhua Wu received his B.Sc. de-gree in 1970 from Tsinghua Univer-sity, Beijing, and his Ph.D. degreein 1984 from University of Manch-ester, UK. Since 1985, he has beenworking at the Institute of Software,Chinese Academy of Sciences, andfrom 1997 he has been also teach-ing in University of Macau. He is amember of IEEE & ACM. His re-search interests include realistic im-age synthesis, virtual reality, physi-cally based modeling, and scientificvisualization.