10
Parallel Field Alignment for Cross Media Retrieval Xiangbo Mao State Key Lab of CAD&CG Zhejiang University Hangzhou, China [email protected] Binbin Lin The Biodesign Institute Arizona State University Tempe, US [email protected] Deng Cai State Key Lab of CAD&CG Zhejiang University Hangzhou, China [email protected] Xiaofei He State Key Lab of CAD&CG Zhejiang University Hangzhou, China [email protected] Jian Pei School of Computing Science Simon Fraser University Burnaby, Canada [email protected] ABSTRACT Cross media retrieval systems have received increasing inter- est in recent years. Due to the semantic gap between low- level features and high-level semantic concepts of multimedia data, many researchers have explored joint-model techniques in cross media retrieval systems. Previous joint-model ap- proaches usually focus on two traditional ways to design cross media retrieval systems: (a) fusing features from dif- ferent media data; (b) learning different models for different media data and fusing their outputs. However, the process of fusing features or outputs will lose both low- and high- level abstraction information of media data. Hence, both ways do not really reveal the semantic correlations among the heterogeneous multimedia data. In this paper, we in- troduce a novel method for the cross media retrieval task, named Parallel Field Alignment Retrieval (PFAR), which in- tegrates a manifold alignment framework from the perspec- tive of vector fields. Instead of fusing original features or outputs, we consider the cross media retrieval as a manifold alignment problem using parallel fields. The proposed man- ifold alignment algorithm can effectively preserve the met- ric of data manifolds, model heterogeneous media data and project their relationship into intermediate latent semantic spaces during the process of manifold alignment. After the alignment, the semantic correlations are also determined. In this way, the cross media retrieval task can be resolved by the determined semantic correlations. Comprehensive experimental results have demonstrated the effectiveness of our approach. Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis; H.3.3 [Information Search and Retrieval]: Re- trieval Models Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. MM’13, October 21–25, 2013, Barcelona, Spain. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2404-5/13/10 ...$15.00. http://dx.doi.org/10.1145/2502081.2502087. General Terms Algorithms Keywords Cross media, manifold alignment, parallel field, text corpus and images, multimedia 1. INTRODUCTION In recent years, multimedia contents including text, im- age, audio and video on the web have been growing rapidly. With the role switch between social media (Twitter, Face- book) and traditional media (newspapers, etc.), more and more multimedia contents are published on the web by peo- ple. However, the explosion of multimedia contents has not been matched by an equivalent increase in the sophistication of multimedia content retrieval technology [7, 24, 25]. Nowa- days, the dominate search engines for multimedia retrieval, such as Google and Bing, are still text-based. To effectively leverage the massive explosion of multimedia content, a large number of approaches have been proposed in the areas such as information retrieval, multimedia retrieval and computer vision [2, 3, 5, 11, 16]. One of the well known challenges in the area of multimedia retrieval is the so called semantic gap, i.e ., low-level features are not sufficient to character- ize high-level semantics of media data [34, 35]. Aiming at this point, many previous works [27] have been proposed to simultaneously utilize multiple types of information such as the original multimedia contents, surrounding texts (or labels), and links to improve the multimedia retrieval per- formance. However, these works do not consider the seman- tic correlation among different media types. Generally, they can be viewed as uni-media retrieval, in which the query example and the retrieved results are of the same media type [7]. It is common knowledge that an important requirement for further progress in these areas is the development of so- phisticated joint models for multiple media types. In which the most significant is the development of models that sup- port inference with respect to content that is rich in multiple media types [24]. Specifically, these models should utilize the full structure of document which has a body of text accom- panied with images or videos. For example, wikipedia page or newspaper article usually contains several paragraphs of

Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

Parallel Field Alignment for Cross Media Retrieval

Xiangbo MaoState Key Lab of CAD&CG

Zhejiang UniversityHangzhou, China

[email protected]

Binbin LinThe Biodesign InstituteArizona State University

Tempe, [email protected]

Deng CaiState Key Lab of CAD&CG

Zhejiang UniversityHangzhou, China

[email protected] He

State Key Lab of CAD&CGZhejiang UniversityHangzhou, China

[email protected]

Jian PeiSchool of Computing Science

Simon Fraser UniversityBurnaby, Canada

[email protected]

ABSTRACTCross media retrieval systems have received increasing inter-est in recent years. Due to the semantic gap between low-level features and high-level semantic concepts of multimediadata, many researchers have explored joint-model techniquesin cross media retrieval systems. Previous joint-model ap-proaches usually focus on two traditional ways to designcross media retrieval systems: (a) fusing features from dif-ferent media data; (b) learning different models for differentmedia data and fusing their outputs. However, the processof fusing features or outputs will lose both low- and high-level abstraction information of media data. Hence, bothways do not really reveal the semantic correlations amongthe heterogeneous multimedia data. In this paper, we in-troduce a novel method for the cross media retrieval task,named Parallel Field Alignment Retrieval (PFAR), which in-tegrates a manifold alignment framework from the perspec-tive of vector fields. Instead of fusing original features oroutputs, we consider the cross media retrieval as a manifoldalignment problem using parallel fields. The proposed man-ifold alignment algorithm can effectively preserve the met-ric of data manifolds, model heterogeneous media data andproject their relationship into intermediate latent semanticspaces during the process of manifold alignment. After thealignment, the semantic correlations are also determined.In this way, the cross media retrieval task can be resolvedby the determined semantic correlations. Comprehensiveexperimental results have demonstrated the effectiveness ofour approach.

Categories and Subject DescriptorsH.3.1 [Information Storage and Retrieval]: ContentAnalysis; H.3.3 [Information Search and Retrieval]: Re-trieval Models

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’13, October 21–25, 2013, Barcelona, Spain.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2404-5/13/10 ...$15.00.http://dx.doi.org/10.1145/2502081.2502087.

General TermsAlgorithms

KeywordsCross media, manifold alignment, parallel field, text corpusand images, multimedia

1. INTRODUCTIONIn recent years, multimedia contents including text, im-

age, audio and video on the web have been growing rapidly.With the role switch between social media (Twitter, Face-book) and traditional media (newspapers, etc.), more andmore multimedia contents are published on the web by peo-ple. However, the explosion of multimedia contents has notbeen matched by an equivalent increase in the sophisticationof multimedia content retrieval technology [7, 24, 25]. Nowa-days, the dominate search engines for multimedia retrieval,such as Google and Bing, are still text-based. To effectivelyleverage the massive explosion of multimedia content, a largenumber of approaches have been proposed in the areas suchas information retrieval, multimedia retrieval and computervision [2, 3, 5, 11, 16]. One of the well known challengesin the area of multimedia retrieval is the so called semanticgap, i.e., low-level features are not sufficient to character-ize high-level semantics of media data [34, 35]. Aiming atthis point, many previous works [27] have been proposedto simultaneously utilize multiple types of information suchas the original multimedia contents, surrounding texts (orlabels), and links to improve the multimedia retrieval per-formance. However, these works do not consider the seman-tic correlation among different media types. Generally, theycan be viewed as uni-media retrieval, in which the queryexample and the retrieved results are of the same mediatype [7].

It is common knowledge that an important requirementfor further progress in these areas is the development of so-phisticated joint models for multiple media types. In whichthe most significant is the development of models that sup-port inference with respect to content that is rich in multiplemedia types [24]. Specifically, these models should utilize thefull structure of document which has a body of text accom-panied with images or videos. For example, wikipedia pageor newspaper article usually contains several paragraphs of

Page 2: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

text and a number of related images for illustration. Theperformance of such models is referred as a cross media re-trieval problem: the retrieval of all documents with the othermedia types in response to a media type query data. Thetask is crucial to many practical applications, such as findingimages on the web that best illustrate a given text, findingthe texts which are most related to a given image, or further,searching using a combination of different types of multime-dia data and labelling media data automatically [24]. Fig-ure 1 illustrates the following example, given an image ofa horse, the cross media retrieval model should return allthe contents related to the horse concept, e.g ., the sound ofhorse, the introduction text.

Figure 1: An example of cross media retrievalmodel.

To address the cross media retrieval problem, advanceshave been reported over the last decades [7, 26, 28]. Thesemethods focus on two traditional ways to design cross mediaretrieval systems: (a) fusing features from different mediadata into a single vector [23, 33]; (b) learning different mod-els for different media data and fusing their outputs [14,32]. And most of these approaches require multiple-typequeries, e.g ., queries composed of both image and text fea-tures. Hence, these methods are extensions of the classicuni-media retrieval systems.

The key challenge of cross media retrieval is to explorethe semantic correlations among the heterogeneous mediadata. However, both of previous ways do not really revealthe semantic correlations. Semantic correlations can help usto better understand, organize and make use of the mediadata [34].

Recently, some developments bring new perspective tosolve the cross media retrieval problem. Multimedia Corre-lation Space [34] and Correlation Semantic Space [24] intro-duce the idea of constructing a joint model to project orig-inal multimedia features to the semantic correlation space.And in the same period, manifold alignment methods [10,29, 31] are proposed and shown that they are appropri-ate joint models for pair matching between heterogeneousdata sources. However, most of existing manifold align-ment methods use graph based regularizer e.g ., graph Lapla-cian, which focus on ensuring the first order smoothness ofthe mapping functions [21] in manifold alignment process.The first order smoothness of the mapping functions is not

enough to reveal the underlying semantic correlations be-tween heterogeneous types of multimedia data. In order todiscover the latent semantic correlations, we would like toensure the second order smoothness of the mapping func-tions in manifold alignment which preserve the geodesic dis-tance of the manifolds. In some recent work, parallel fields[22] are found capable to keep the second order smoothnessof the mapping functions [18].

Inspired by these developments in the cross media re-trieval area, we propose a novel approach for cross mediaretrieval, called Parallel Field Alignment Retrieval (PFAR),which integrates a manifold alignment framework from theperspective of vector fields. Instead of fusing original fea-tures or outputs, we consider the cross media retrieval as amanifold alignment problem using parallel fields. The pro-posed manifold alignment algorithm can effectively preservethe metric of data manifolds, model heterogeneous mediadata and project their relationship into intermediate latentsemantic spaces during the process of manifold alignment.After the alignment, the semantic correlations are also de-termined. In this way, the cross media retrieval task canbe resolved by the determined semantic correlations. Theempirical results from a real world data demonstrate thebenefits of our approach over state-of-the-art cross mediaretrieval methods.

The rest of this paper is organized as follows. In Sec-tion 2, we introduce some basic background informationabout manifold alignment and parallel fields. In Section 3,we describe the proposed cross media retrieval algorithm(PFAR) in great detail. The experimental results of realworld cross media data are presented in Section 4. Finally,Section 5 provides some concluding remarks.

2. BACKGROUNDAs discussed in the previous section, our cross media re-

trieval approach involves manifold alignment and parallelfields techniques. In this section, we introduce some back-ground on mainfold alignment and parallel fields.

−10 0 10 20 0

20

40

−20

−10

0

10

20

Swiss roll

−10 0 10 20 0

20

40

−20

−10

0

10

20

Swiss roll with hole

After alignment

Figure 2: An example of manifold alignment. Twodata manifold samples are shown in the left.The re-sult of these two manifolds after alignment is shownin the right.

2.1 Manifold alignment

Page 3: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

In many areas of machine learning and data mining, oneis often confronted with very high dimensional data suchas high definition videos and large vector-space documents.Learning problems involving these datasets are usually chal-lenging. However, in many cases, there is a strong intuitionthat the high dimensional data may have a lower dimen-sional intrinsic representation. Manifold alignment is a classof machine learning algorithms which takes advantage of thisassumption to produce projections between sets of data byaligning their underlying manifold representations [9, 30].

One of the pioneering work in manifold alignment is thepaper of semi-supervised alignment [10]. Given certain la-beled samples, semi-supervised alignment aims to find twomaps which map two datasets to the new common spacewhile satisfying the label constraint. Suppose U is the vertexspace, there are l labeled vertices Vlabel = {v1, v2, . . . , vl},Vlabel ⊂ U, and Tlabel = {t1, t2, . . . , tl} (ti ∈ R), which is thevector of labeled target values. Similar to regression models,we would like to find a map defined on the vertices of thegraph f : U → R which matches known target values forthe labeled vertices. This can be solved by minimizing thefollowing objective function:

E(f) =∑i

µ|f(vi)− ti|+ fTLf (1)

Here vi ∈ Vlabel, ti ∈ Tlabel, and L is the graph Laplacianmatrix. The relative weighting of these terms is given by thecoefficient µ. The symmetric graph Laplacian matrix L =LT provides the information of the data manifold structure.Given two datasets with corresponding labels, we can learnmanifolds for each datasets according to Equation 1, andalign these two manifold with the intrinsic coordinates oflabels. An illustration of manifold alignment is shown inFigure 2.

2.2 Parallel field regularizationGiven a manifoldM, a vector field is a mapping from the

manifold to tangent spaces on the manifold [22]. We canthink of a vector field on the manifold M in the same wayas we think of the vector field in Euclidean space. For eachpoint on the manifold, we assign an arrow on it, with a givenmagnitude and direction, chosen to be tangent to the man-ifold M. A smooth vector field means that tangent vectorsvary smoothly on the manifold. An example of vector fieldsis shown in Figure 3.

One kind of the most important vector fields are parallelfields. The definition of parallel fields on the manifold isgiven as follows.

Definition 1. (Parallel Fields [22]) A vector field X on themanifold M is a parallel field if

∇X ≡ 0,

where ∇ is the covariant derivative on M.

The parallel field is closely related to the linear function onthe manifold. Let V be a parallel field on the manifold. If itis also a gradient field for function f , V = ∇f , then f mustbe a linear function on the manifold.

Parallel fields essentially captures the second order smooth-ness of functions on the manifold [18]. Some recent theo-retical results [15] shows that penalizing the second order

−1 −0.5 0 0.5 1−1

0

1

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Example of a vector field

Figure 3: A vector field on punched sphere.

smoothness of functions helps achieve faster rates of conver-gence for semi-supervised regression problems. Moreover,parallel fields can also captures the metric structure of themanifold [17]. In other words, we can use parallel fields topreserve the distance on the manifold.

Next we briefly introduce Parallel Field Regularization(PFR). Let M be a d-dimensional manifold embedded inEuclidean space Rm. Given l labeled data points (xi, yi)

li=1,

xi ∈ M and yi ∈ R, the goal of semi-supervised regressionon the manifold is to learn a function f : M→ R. Specifi-cally, PFR tries to learn the function f and it gradient field∇f simultaneously via regularization. Formally, PFR aimsto learn a function f and a vector field V by optimizing thefollowing objective function:

E(f, V ) =1

l

l∑i=1

R0(xi, yi, f) + λ1R1(f, V ) + λ2R2(V ), (2)

where

R1(f, V ) =

∫M‖∇f − V ‖2 (3)

and

R2(V ) =

∫M‖∇V ‖2F . (4)

The first term in Equation 2 is the loss function, and thesecond term enforces the vector field V to be close to thegradient field ∇f of f . ∇V in the third term measures thechange of the vector field V . If ∇V vanishes, V must be aparallel field.

3. CROSS MEDIA RETRIEVALIn this section, we present a novel approach to solve cross

media retrieval problem.

3.1 The problemSuppose we have a dataset X = {X1, . . . , X|X|} which

contains documents of two different types of media data Aand B, e.g ., A is a collection of texts and B consists of im-ages. Specifically, all Xi, i ∈ (1, |X |), in X can be quite di-verse: from documents where a single text is complemented

Page 4: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

by one or more images to documents containing multipleimages and texts. For simplicity, we consider the case whereeach document in X consists of two sample components, oneis from A and the other one is from B, i.e. Xi = (Ai,Bi).All data points of A and B are of vector forms in featurespaces RA and RB, respectively. Under this circumstances,each document establishes a one-to-one mapping betweencomponent from the A and component from B media dataspaces.

We consider cross media retrieval problem based on thedoument dataset X . The fundamental concept of cross me-dia retrieval is rather straightforward. Given a query QA ∈RA (or QB ∈ RB), the goal of cross media retrieval is to re-turn the closest matches in the B (or A) space RB (or RA).Whenever the A and B media data spaces have a naturalcorrespondence, the original cross media retrieval problemcan reduce to a classical retrieval problem: finding an in-vertible mapping function f between A and B:

f : RA →RB. (5)

Hence, if a query QA ∈ RA is given, we would be able tofind the nearest neighbor of f(QA) in RB. Similarly, givena query QB ∈ RB, it would be easy for us to find the nearestneighbor of f−1(QB) in RA [24].

However, since A and B are different types of media data,they tend to adopt different feature representations. There-fore, it is hard to reveal semantic correlations between RAand RB, which means the semantic gap still exists. For ex-ample, suppose A is a dataset consists of texts and B is adataset consists of images. And we adopt TF/IDF and His-togram of Oriented Gradients (HOG) [4] to be the featurerepresentations for texts and images, respectively. Thus, itis hard for us to directly explore the semantic correlationsbetween text and image data spaces.

In order to abridge the semantic gap betweenRA andRB,it is possible to map these two representations into two inter-mediate spaces which have a natural correspondence [24] andsemantic correlations [36]. Consider following mappings:

fA : RA → IA, (6)

and

fB : RB → IB. (7)

Here, fA and fB map originalRA andRB media data spacesto a pair of intermediate spaces IA and IB, respectively.Further, we would like to ensure that there is also an invert-ible mapping between IA and IB:

fI : IA → IB. (8)

Now, if a query QA ∈ RA is given, the cross media re-trieval task becomes finding the nearest neighbor of f−1

B ◦fI ◦ fA(QA) in RB. Similarly, the goal becomes finding thenearest neighbor of f−1

A ◦ f−1I ◦ fB(QB) in RA if a query

QB ∈ RB is given [24]. Under this circumstances, the orig-inal problem of cross media retrieval is equivalent to learnthe intermediate spaces IA and IB.

In our approach, we use manifold alignment with parallelfields to model disparate media A and B data and projecttheir relationship into intermediate spaces, and the semanticcorrelations are determined during the process of manifoldalignment. In this way, the cross media retrieval task can beresolved by the determined semantic correlation. We next

show our proposed method, the parallel field alignment re-trieval (PFAR), in detail.

3.2 Parallel field alignment retrievalIn order to learn intermediate spaces IA and IB, An opti-

mal correspondence between the representations in the orig-inal spaces RA and RB [24] is needed. One possible wayis to apply the subspace learning framework which utilizesome extremely well developed dimensionality reduction ap-proaches, such as Principal Component Analysis (PCA) [13]or Latent Semantic Indexing (LSI) [6].

Our approach here utilizes manifold alignment algorithms[10, 30] to find mappings defined on the media data mani-folds, and use these mappings to align the underlying mediamanifold representations in intermediate semantic spaces.In the sense of manifold alignment, the semantic correla-tions are learned after the alignment of the underlying me-dia manifold representations. In addition, we would like touse parallel fields, which preserve the metric of the manifoldto measure the disparate media data manifolds. Most of ex-isting manifold alignment algorithms focus on ensuring thefirst order smoothness of the function. However, researchershave shown that the second order smoothness of the functionis particularly important for preserving the metric. And thesecond order smoothness of the function is equivalent to theparallelism of the gradient field of the function [18]. Thus,we propose to learn two mapping functions and two vectorfields simultaneously with two constraints in the process ofmanifold alignment. By designing the two constraints, weensure each vector field to be close to the gradient field ofeach corresponding mapping function, and the vector fieldsshould be as parallel as possible. In this way, the secondorder smoothness of the function is also ensured. The wholeconcept of our proposed parallel field alignment retrieval ap-proach is shown in Figure 4.

We now consider aligning disparate media data manifolds,given some additional information of datasets, from the par-allel field perspective. In our approach, the desired coordi-nates for labeled samples are provided. In detail, the coor-dinates of labels indicate the intrinsic information of paireddata samples within the manifolds. With a small numberof labeled examples, it is crucial to exploit manifold struc-tures of datasets in manifold alignment [10]. Specifically, wefirst estimates the gradient field of the prediction functionby a vector field, and then require the vector field to be asparallel as possible.

In the setting of parallel field alignment retrieval (PFAR),we denote the notations A ⊂ RA and B ⊂ RB as thedatasets, and s = {s1, s2, . . . , sl} and t = {t1, t2, . . . , tl}are the vectors referred to the target paired information ofthe l “labeled” data samples for A and B respectively. Letf and g denote two mapping functions defined on the datamanifolds that match known target labeled pairs. Followingthe above analysis, we try to integrate vector fields VA andVB on the manifold with two constraints to our alignmentfunctions:

E(f, VA) =1

lA

lA∑i=1

(f(ai)−si)2+µA,1R1(f, VA)+µA,2R2(VA)

(9)

Page 5: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

Figure 4: The PFAR concept illustration.

and

E(g, VB) =1

lB

lB∑i=1

(g(bi)−ti)2+µB,1R1(g, VB)+µB,2R2(VB),

(10)where ai ∈ A and bi ∈ B, R1 and R2 are regularizers definedin Equation 3 and 4, R1 enforces vector fields VA and VB tobe close to gradient fields ∇f and ∇g of mapping functions.As shown in Section 2, if R2 vanishes, VA and VB must beparallel fields.

In order to explore how underlying data manifolds can bemapped into intermediate spaces and then aligned to eachother through a common set of paired media data informa-tion in the process of PFAR, we should make sure that thevector field is close to the gradient field of the function andfurther the vector field should be as parallel as possible.

Next we show that how to discretize the continuous ob-jective function. First of all, we give some notations:

• Let TaiM denote the d dimensional tangent space ofai on the manifold M.

• Let Ti ∈ Rm×d denote the matrix whose columns con-stitute an orthonormal basis for TaiM.

• Let Vai denote the value of the vector field V at datapoint ai.

Following the above notations, define Pi = TiTTi . It can

be shown that Pi is the unique orthogonal projection fromRm onto the tangent space TaiM [8]. According to thedefinition of vector fields, each vector Vai should be on thetangent space TaiM. Thus we can represent Vai by thecoordinates of tangent spaces, i.e., Vai = Tivi. Let VA =(vT1 , . . . , v

Tn )T ∈ Rdn be a dn-dimension column vector which

concatenates all the vi’s, i ∈ (1, . . . , n).

Then the regularizers R1 and R2 can be discretely reducedto:

R1(f,VA) =∑i

∑j∼i

wij((ai − aj)T Tivi − fj + fi)2 (11)

and

R2(VA) =∑i

∑j∼i

wij‖PiTjvj − Tivi‖2, (12)

where wij , weight parameters, which can be approximatedby heat kernel weights exp(−‖ai−aj‖2/δ) or by 0 - 1 weightsfor simplicity.

Now, let IA denote an n×n diagonal matrix where IAii = 1if ai is labeled and IAii = 0 otherwise. Then the dis-crete form of our parallel field alignment objective functionE(f, VA) can be written as:

E(f,VA) =1

lA(f − s)T IA(f − s)

+ µA,1

∑i

∑j∼i

wij((ai − aj)T Tivi − fj + fi)2

+ µA,2

∑i

∑j∼i

wij‖PiTjvj − Tivi‖2

.

(13)

The optimal solution to this objective function is thenobtained via solving the following linear systems:(

1lA

IA + 2µA,1LA −µA,1CTA

−µA,1CA µA,1GA + µA,2KA

)(fVA

)=

(1lA

s

0

),

(14)where L denotes the Laplacian matrix of the graph withweights wij , GA is a dn×dn block diagonal matrix, and CA

is a dn × dn block matrix. Let us denote GAii and CAi asthe i-th d×d diagonal block of G and the i-th d×n block ofC respectively, and zij ∈ Rn is a selection vector of all zero

Page 6: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

elements except for the i-th element being −1 and the j-thelement being 1. Then GAii and CAi are defined as

GAii =∑j∼i

wijTTi (aj − ai)(aj − ai)TTi (15)

and

CAi =∑j∼i

wijTTi (aj − ai)zTij , (16)

KA is a dn×dn sparse block matrix. If we use KAij to indexeach d× d block in A, i, j ∈ (1, . . . , n), we have

KAii =∑j∼i

wij(QijQTij + I) (17)

and

KAij =

{−2wijQij , if ai ∼ aj0, otherwise

(18)

where Qij = TTi Tj .

Similarly, the optimal solution for data set B is as follows:(1lB

IB + 2µB,1LB −µB,1CTB

−µB,1CB µB,1GB + µB,2KB

)(gVB

)=

(1lB

t

0

).

(19)Given two datasets A and B with target paired values of

labeled data samples, the solutions to f and g of Equation14 and Equation 19 can be used to estimate coordinatesof the other unlabeled data points in intermediate spaces,which further can be utilized to align their intrinsic datamanifolds.

Given any query q from A, we use the mapping function fobtained in the training step to project the query q into theintermediate space. And the projected q is then semanticcorrelated aligned in the intermediate space. To find thebest match samples in B, we can use the metric definedas: let F = (f1, . . . , fr)T and G = (g1, . . . , gr)T be the rdimensional representations of aligned manifolds of A andB. If the coordinates in F and G are aligned from knowncoordinates, the distance between ai ∈ A and bj ∈ B is thengiven by [10]:

d(ai, bj)2 =

∑k

|Fik −Gjk|2, (20)

then the best match bj ∈ B to a ∈ A is given by:

arg minj

d(a, bj). (21)

The alignment framework is similar to some of the exist-ing alignment methods [9, 10, 30]. However, most of theexisting alignment approaches focus on preserving the pair-wise similarity between data pairs. In this case, they maynot preserve the relative order of the similarity measure. Forexample, suppose object A and object B are similar, objectA and object C are also similar, however B is more simi-lar to A than C. In existing alignment methods, they canfind a space in which A, B, C are close, but they cannottell whether B is closer to A than C or not. By using vec-tor fields, we require the mapping function varies linearlyalong the geodesics on the manifolds and naturally the or-der can be preserved. This property is extremely importantfor multimedia retrieval problems.

4. EXPERIMENTSIn this section, we conduct some extensive experimental

evaluation to demonstrate the effectiveness of our proposedParallel Field Alignment Retrieval (PFAR) approach.

4.1 DatasetThe evaluation of a cross media retrieval system usually

needs a document corpus with paired contents from dis-parate domains of multimedia source. In this experiment,we use the recently proposed Wikipedia dataset composed oftext and image pairs1 [24]. This real world dataset consistsof a continually updated collection of Wikipedia’s featuredarticles spread over 29 categories, since 2009. The articlesare accompanied by one or more images from the WikipediaCommons, supplying a pairing of the desired kind. Sincesome categories are very scarce, we will only consider thetop 10 most categories in the experiment.

Each article was split into sections based on its sectionheadings, and each image was assigned to the correspond-ing section in which it was placed by the author(s) of theWikipedia page. Then this dataset was pruned to keep onlysections that contained a single image and at least 70 words.The final corpus contains 2866 text-image pairs, each be-longs to one of 10 semantic categories. And the corpus israndom split into a training set with 2173 documents, anda test set with 693 documents [24]. The detail informationof this corpus is summarized in Table 1.

Table 1: Experiment Dataset Summary [24]Category Training Query Total

Art & architecture 138 34 172Biology 272 88 360

Geography & places 244 96 340History 248 85 333

Literature & theatre 202 65 267Media 178 58 236Music 186 51 237

Royalty & nobility 144 41 185Sport & recreation 214 71 285

Warfare 347 104 451Summary 2173 693 2866

4.2 Data representation and evaluationThe data representation is similar to those of two previ-

ous works [20, 24]. In terms of image representations RI ,we represented the image using the histogram over the pop-ular scale-invariant feature transformation (SIFT) [19] withthe codebook of 512 codewords. The text representation isderived from a Latent Dirichlet Allocation (LDA) [1] modelwith 10 topics (for training data, label information is derivedbased on the topics). Thus, in RT , text documents are rep-resented by their topic assignment probability distributionsamong these 10 topics.

To test the proposed PFAR approach on real data, weconduct the retrieval task with two parts: text retrieval us-ing an image query, and image retrieval using a text query.In the first part, each image in the test set is used as a query,and the goal is to rank all the texts in the test set based ontheir match to the query image. In the second, a text query

1http://www.svcl.ucsd.edu/projects/crossmodal

Page 7: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Recall

Precis

ion

Random

CCA

SCM

Fast−MCU

PFAR

(a) PR curve of text query task

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Recall

Precis

ion

Random

CCA

SCM

Fast−MCU

PFAR

(b) PR curve of image query task

Figure 5: Precision recall curves: (a) text query and (b) image query

is used to rank the images. In both parts, the performanceis measured using precision-recall (PR) curves and the meanaverage precision (MAP).

In the experiment, our approach learns the aligned in-termediate spaces by incorporating the information of me-dia data. We use k = 8 nearest neighbors method to con-struct the neighborhood for each media source in formula-tion of Equation 9, and we apply the 10-fold cross validationto select the parameters (e.g ., µA,1 for R1, µA,2 for R2).To demonstrate the performance of PFAR, we compare anumber of state-of-the-art cross media retrieval approaches,CCA [12], Semantic Correlation Matching (SCM) [24], Fastversion of Maximum Covariance Unfolding (Fast-MCU) [20]and Manifold Alignment (MA) [10] with PFAR.

In these retrieval approaches, given a test sample (imageor text), it is first projected into the learned intermediatespace. For CCA and SCM, this involves a linear transforma-tion to the low dimensional subspace, while for Fast-MCU,the nearest neighbors of the test point among the trainingsamples in the original space are used to obtain a mapping ofthe point as a weighted combination of these neighbors [20].

For PFAR, we use training datasets to learn the interme-diate spaces during the parallel field alignment process andtwo manifolds are aligned at the same time. The same map-ping is then applied to the projection of the neighbors in thelearned intermediate space to compute the projection of thetest point. To perform retrieval, all the test samples fromboth models, image and text, are projected on the alignedmanifolds, respectively. For a given test point from one kindof media, we use the correlation distance shown in Equation21 to compute its distance to all the other projected testpoints of the other medium, and then these distances areranked. If a retrieved sample belongs to the same categoryas the query, it is considered to be correct.

4.3 Test of the cross media retrievalThe result of the retrieval task is shown in Table 2, which

summarizes the MAP scores obtained for the 5 cross mediaretrieval approaches. This table contains scores for bothimage retrieval from a text query, and text retrieval froman image query, and the average. The performance of therandom retrieval test is also shown to indicate the baseline

chance level. From Table 2, it is clear that our proposedapproach PFAR outperforms CCA, SCM and Fast-MCU inboth image-to-text (image query) and text-to-image (textquery) cross media retrieval tasks. PFRA leads non-trivialimprovements over other approaches. In both parts, theaverage MAP of our approach is more than 2.5 times thatof random method.

Table 2: Retrieval Performance (MAP Scores)Experiments Image Query Text Query Average

Random 0.118 0.118 0.118CCA 0.246 0.196 0.221SCM 0.274 0.225 0.250

Fast-MCU 0.287 0.224 0.256MA 0.262 0.225 0.243

PFAR 0.298 0.273 0.286

To further analyze the performance of our proposed ap-proach, we also presents the results of precision-recall (PR)curves for both image and text queries in Figure 5. Ac-cording to the results of Figure 5, it is clear that our pro-posed approach PFAR, Fast-MCU, SCM, and CCA all gainimprovements over random retrieval at all levels of recall.Compared the PR curves of PFAR with those of SCM andFast-MCU, it shows that PFAR gets higher precision at alllevels of recall for both text queries (on the left) and imagequeries (on the right). It shows that the PFAR approachis more effective to precisely find matches than the otherapproaches.

Figure 6 illustrates two examples of text queries and thetop ranked images retrieved by PFAR to provide the visu-alization of cross media retrieval. These two examples arechosen from geography category and sport category, respec-tively. In each case, the query text is shown at the top,and the first image in the images row is the groundtruth im-age. As indicated in Figure 6, top four retrieved images areshown next to the groundtruth image. We can see that theretrieved images are quite semantic correlated to the querytexts.

Figure 7 shows an example about topics distribution ofretrieved texts by CCA, SCM, Fast-MCU, MA and our pro-

Page 8: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

(a) Text query from geography category

(b) Text query from sport category

Figure 6: Two examples of text queries and the top images retrieved by PFAR. The query texts of (a) and(b) are chosen from geography category and sport category, respectively. In both (a) and (b), first image inbottom row is the groundtruth image, and others are top ranked retrieved images.

Page 9: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

(a) Query image

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Art

Biology

GeographyHistory

LiteratureMedia

Music

RoyalitySport

Warface

Random

CCA

SCMFast−MCU

MA

PFAR

(b) Topics distribution of retrieved texts (c) The groundtruth text: Seattle fromWikipedia

Figure 7: A comparison example of cross media retrieval with given image query.

posed PFAR. Given a query image, we compared the top-ics distribution of all retrieved texts with respect to abovemethods. In Figure 7, the query image is taken from Seattle,the groudtruth text gives some description about the geog-raphy of Seattle. The result of PFAR can accurately givethe topics distributions over other methods.

Actually, all methods tend to retrieve texts that are re-lated specifically to the query image. However, PFAR is ableto retrieve texts which are closer to the query image on thecategorical level. This also indicates that the abstractionwork with relative order infomation preserved by PFAR isespecially important for exploratory tasks.

According to the experiment result analysis, PFAR canbe viewed as one kind of semantic correlation spaces projec-tion methods. In addition, the parallel field in the alignmentprocess, which preserves the geodesic distance of media man-ifolds, also significantly contribute to the performance of thetext and image cross media retrieval.

5. CONCLUSIONSIn this paper, we have proposed a novel approach for cross

media retrieval, called Parallel Field Alignment Retrieval(PFAR), which integrates a manifold alignment frameworkfrom the perspective of vector fields. Our goal is to solve thefundamental problem in cross media retrieval area. Givena query QA ∈ RA (or QB ∈ RB), the goal of cross mediaretrieval is to return the closest matches in the B (or A)media space RB (or RA).

Due to the semantic gap between low level features andhigh level semantic concepts of multimedia data, it is noteasy for us to reveal the underlying semantic correlationsamong the heterogeneous multimedia data. Previous stud-ies have shown that manifold alignment method is an appro-priate model for pair matching between heterogeneous datasources. Hence, we consider the cross media retrieval as amanifold alignment problem and use parallel fields, whichcan effectively preserve the metric of media data manifolds,to model heterogeneous multimedia data and project theirrelationship into intermediate latent semantic spaces duringthe process of manifold alignment. And the most importantfactor is that the semantic correlations are also determinedin the manifold alignment process.

Finally, the experimental results from a real world dataillustrate the validity of our approach. In text and image

cross media retrieval tasks, our approach attains significantin the retrieval performance over state-of-the-art cross mediaretrieval methods.

6. ACKNOWLEDGMENTSThis work is supported by National Basic Research Pro-

gram of China (973 Program) under Grant 2012CB316400and National Natural Science Foundation of China (GrantNo: 61222207, 61125203, 61233011, 90920303). Mao’s andPei’s research is supported in part by an NSERC DiscoveryGrant and a BCFRST NRAS Endowment Research TeamProgram project. All opinions, findings, conclusions andrecommendations in this paper are those of the author anddo not necessarily reflect the views of the funding agencies.

7. REFERENCES[1] D. Blei, A. Ng, and M. Jordan. Latent dirichlet

allocation. The Journal of Machine LearningResearch, 3:993–1022, 2003.

[2] M. Casey, R. Veltkamp, M. Goto, M. Leman,C. Rhodes, and M. Slaney. Content-based musicinformation retrieval: current directions and futurechallenges. Proceedings of the IEEE, 96(4):668–696,2008.

[3] C. Cotsaces, N. Nikolaidis, and I. Pitas. Video shotdetection and condensed representation. a review.Signal Processing Magazine, IEEE, 23(2):28–37, 2006.

[4] N. Dalal and B. Triggs. Histograms of orientedgradients for human detection. In IEEE Conference onComputer Vision and Pattern Recognition, volume 1,pages 886–893. IEEE, 2005.

[5] R. Datta, D. Joshi, J. Li, and J. Wang. Imageretrieval: Ideas, influences, and trends of the new age.ACM Computing Surveys, 40(2):5:1–5:60, 2008.

[6] S. Deerwester, S. Dumais, G. Furnas, T. Landauer,and R. Harshman. Indexing by latent semanticanalysis. Journal of the American Society forInformation Science, 41(6):391–407, 1990.

[7] H. Escalante, C. Hernadez, L. Sucar, and M. Montes.Late fusion of heterogeneous methods for multimediaimage retrieval. In Proceedings of the 1st ACMInternational Conference on Multimedia InformationRetrieval, pages 172–179. ACM, 2008.

Page 10: Parallel Field Alignment for Cross Media Retrievaljpei/publications/ParallelField... · 2013. 7. 31. · days, the dominate search engines for multimedia retrieval, such as Google

[8] G. H. Golub and C. F. V. Loan. Matrix Computations.Johns Hopkins University Press, 1996.

[9] J. Ham, D. D. Lee, and L. K. Saul. Learning highdimensional correspondences from low dimensionalmanifolds. In Workshop on the Continuum fromLabeled to Unlabeled Data in Machine Learning andData Mining, pages 34–41, 2003.

[10] J. Ham, D. D. Lee, and L. K. Saul. Semisupervisedalignment of manifolds. In Proceedings of the AnnualConference on Uncertainty in Artificial Intelligence,volume 10, pages 120–127, 2005.

[11] X. He, W. Ma, and H. Zhang. Learning an imagemanifold for retrieval. In Proceedings of the 12th ACMInternational Conference on Multimedia, pages 17–23.ACM, 2004.

[12] H. Hotelling. Relations between two sets of variates.Biometrika, 28(3/4):321–377, 1936.

[13] I. Jolliffe. Principal component analysis, volume 2.Wiley Online Library, 2002.

[14] T. Kliegr, K. Chandramouli, J. Nemrava, V. Svatek,and E. Izquierdo. Combining image captions andvisual analysis for image concept classification. InProceedings of the 9th International Workshop onMultimedia Data Mining, pages 8–17. ACM, 2008.

[15] J. Lafferty and L. Wasserman. Statistical analysis ofsemi-supervised regression. In Advances in NeuralInformation Processing Systems 20, pages 801–808,Cambridge, MA, 2008.

[16] M. Lestari Paramita, M. Sanderson, and P. Clough.Diversity in photo retrieval: overview of theimageclefphoto task 2009. pages 45–59. Springer, 2010.

[17] B. Lin, X. He, C. Zhang, and M. Ji. Parallel vectorfield embedding. The Journal of Machine LearningResearch, 2013.

[18] B. Lin, C. Zhang, and X. He. Semi-supervisedregression via parallel field regularization. In Advancesin Neural Information Processing Systems 24, pages433–441, Cambridge, MA, 2011.

[19] D. Lowe. Distinctive image features fromscale-invariant keypoints. International Journal ofComputer Vision, 60(2):91–110, 2004.

[20] V. Mahadevan, C.-W. Wong, T. T. Liu,N. Vasconcelos, and L. K. Saul. Maximum covarianceunfolding: Manifold learning for bimodal data. InAdvances in Neural Information Processing Systems24, pages 918–926, 2011.

[21] B. Nadler, N. Srebro, and X. Zhou. Statistical analysisof semi-supervised learning: The limit of infiniteunlabelled data. In Advances in Neural InformationProcessing Systems 22, pages 1330–1338, 2009.

[22] P. Petersen. Riemannian Geometry, volume 171.Springer, 2006.

[23] T. Pham, N. Maillot, J. Lim, and J. Chevallet. Latentsemantic fusion model for image retrieval andannotation. In Proceedings of the 16th ACMConference on Information and KnowledgeManagement, pages 439–444. ACM, 2007.

[24] N. Rasiwasia, J. Costa Pereira, E. Coviello, G. Doyle,G. Lanckriet, R. Levy, and N. Vasconcelos. A newapproach to cross-modal multimedia retrieval. In

Proceedings of the 18th ACM International Conferenceon Multimedia, pages 251–260. ACM, 2010.

[25] N. Rasiwasia, P. Moreno, and N. Vasconcelos.Bridging the gap: Query by semantic example. IEEETransactions on Multimedia, 9(5):923–938, 2007.

[26] C. Snoek and M. Worring. Multimodal video indexing:A review of the state-of-the-art. Multimedia Tools andApplications, 25(1):5–35, 2005.

[27] C. Snoek, M. Worring, and A. Smeulders. Early versuslate fusion in semantic video analysis. In Proceedingsof the 13th ACM International Conference onMultimedia, pages 399–402. ACM, 2005.

[28] T. Tsikrika and J. Kludas. Overview of thewikipediamm task at imageclef 2008. pages 539–550.Springer, 2009.

[29] C. Wang and S. Mahadevan. Manifold alignment usingprocrustes analysis. In Proceedings of the 25thInternational Conference on Machine Learning, pages1120–1127. ACM, 2008.

[30] C. Wang and S. Mahadevan. A general framework formanifold alignment. 2009.

[31] C. Wang and S. Mahadevan. Heterogeneous domainadaptation using manifold alignment. volume 2, pages1541–1546. AAAI Press, 2011.

[32] G. Wang, D. Hoiem, and D. Forsyth. Building textfeatures for object image classification. In IEEEConference on Computer Vision and PatternRecognition, pages 1367–1374. IEEE, 2009.

[33] T. Westerveld. Probabilistic multimedia retrieval. InProceedings of the 25th International ACM SIGIRConference on Research and Development inInformation Retrieval, pages 437–438. ACM, 2002.

[34] Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang.Ranking with local regression and global alignment forcross media retrieval. In Proceedings of the 17th ACMInternational Conference on Multimedia, pages175–184. ACM, 2009.

[35] Y. Yang, Y. Zhuang, F. Wu, and Y. Pan. Harmonizinghierarchical manifolds for multimedia documentsemantics understanding and cross-media retrieval.IEEE Transactions on Multimedia, 10(3):437–446,2008.

[36] Y. Zhuang, Y. Yang, and F. Wu. Mining semanticcorrelation of heterogeneous multimedia data forcross-media retrieval. IEEE Transactions onMultimedia, 10(2):221–229, 2008.