55
THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE SIMILARITY METRICS FROM TRIPLET CONSTRAINTS A DISSERTATION SUBMITTED TO THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES IN CANDIDACY FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE BY LIWEN ZHANG CHICAGO, ILLINOIS WINTER, 2015

THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

THE UNIVERSITY OF CHICAGO

JOINTLY LEARNING MULTIPLE SIMILARITY METRICS FROM TRIPLET

CONSTRAINTS

A DISSERTATION SUBMITTED TO

THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES

IN CANDIDACY FOR THE DEGREE OF

MASTER OF SCIENCE

DEPARTMENT OF COMPUTER SCIENCE

BY

LIWEN ZHANG

CHICAGO, ILLINOIS

WINTER, 2015

Page 2: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE
Page 3: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

ABSTRACT

The measure of similarity plays a crucial role in many applications such as content-based recom-

mendation, image search and speech recognition. Similarity between objects is multifaceted and

it is easier to judge similarity when the focus is on a specific aspect. We consider the problem

of mapping objects into view specific embeddings where the distance between them is consistent

with the similarity comparisons of the form “from the t-th perspective, object A is more similar to

B than to C”. We propose a framework to jointly learn multiple views by deploying different sim-

ilarity metrics in a unified embedding space. Our approach is a natural extension of the view-wise

independent approach and is capable of exploiting the correlation between the views if it exists.

Experiments on a number of datasets, including a large dataset of multi-view crowdsourced com-

parison on bird images, showed the proposed method achieved lower triplet generalization error

and better grouping of classes in most cases, when compared to learning embeddings indepen-

dently for each view or learning a single embedding from triplets collected on all views. More

specifically, on datasets where correlation between views is strong, the proposed method is able to

achieve significant improvement, while on datasets with limited view correlation, it still performs

no worse than its independent learning counterpart.

iii

Page 4: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

ACKNOWLEDGEMENTS

This thesis is based on a joint work with Ryota Tomioka, my thesis advisor, and Subhransu Maji,

who is now a faculty member in UMass Amherst. I am grateful to them for their collaboration and

help on many technical details. I also wish to express my sincere thanks to Ryota for his guidance

throughout the completion of this thesis. Thanks go to Min Xu at the Statistics Department for

pointing out the idea of using Kendall’s τ in the analysis.

iv

Page 5: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

TABLE OF CONTENTS

ABSTRACT iii

ACKNOWLEDGEMENTS iv

1 INTRODUCTION 1

2 LEARNING SIMILARITY FROM RELATIVE COMPARISON 42.1 Learning with Relative Similarity Measure . . . . . . . . . . . . . . . . . . . . . . 52.2 Triplet Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 JOINT LEARNING MULTIPLE PERCEPTUAL SIMILARITIES 93.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Relation with Learning Multiple Independent Similarities . . . . . . . . . . . . . . 123.3 Influence of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Alternative Formulations and Computational Complexity . . . . . . . . . . . . . . 14

4 EXPERIMENTS 164.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Learning a New View (Zero-Shot Learning) . . . . . . . . . . . . . . . . . . . . . 254.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.5.1 Influence of Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5.2 Correlation Between Views . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 RELATED WORK 355.1 Multitask Metric Learning with Low-Rank Tensors . . . . . . . . . . . . . . . . . 355.2 Relation between Triplet Embedding and Metric Learning . . . . . . . . . . . . . 36

6 CONCLUSION AND FUTURE WORK 38

A VIEWS OF POSES OF AIRPLANES DATASET 40

B ATTRIBUTES OF PUBLIC FIGURES FACE DATASET 42

C CORRELATION BETWEEN SIMILARITY METRICS ON MULTIPLE DATASETS 43

REFERENCES 46

v

Page 6: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

LIST OF FIGURES

1.1 Illustration of ambiguity in similarity. . . . . . . . . . . . . . . . . . . . . . . . . 2

4.1 View specific similarities between poses of planes. . . . . . . . . . . . . . . . . . 184.2 View specific similarities between birds. . . . . . . . . . . . . . . . . . . . . . . . 194.3 Experimental results on synthetic data with clusters. . . . . . . . . . . . . . . . . . 214.4 Experimental results on uniformly distributed synthetic data. . . . . . . . . . . . . 224.5 The global view of embeddings of poses of planes. . . . . . . . . . . . . . . . . . 234.6 Experimental results on poses of planes dataset. . . . . . . . . . . . . . . . . . . . 244.7 Illustration of public figures face data embedded under the metric of the first view. . 254.8 Experimental results on public figure faces dataset. . . . . . . . . . . . . . . . . . 264.9 Illustration of CUB-200 birds data. . . . . . . . . . . . . . . . . . . . . . . . . . . 274.10 Experimental results on CUB-200 birds dataset. . . . . . . . . . . . . . . . . . . . 284.11 Learning a new view on CUB-200 birds dataset. . . . . . . . . . . . . . . . . . . . 294.12 Relation between triplet consistency and correlation of pairwise-distances. . . . . . 324.13 Performance gain and correlation of views . . . . . . . . . . . . . . . . . . . . . . 34

A.1 Landmarks illustrated on the several planes . . . . . . . . . . . . . . . . . . . . . 41

vi

Page 7: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

LIST OF TABLES

4.1 Measure of similarity correlation and performance gain of using joint learning.Entries marked with ∗ are values estimated from independent embeddings. . . . . . 34

B.1 List of Pubfig attributes that were used in this work. . . . . . . . . . . . . . . . . . 42

C.1 Consistency of triplet constraints between different views in synthetic data (clus-tered). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

C.2 Consistency of triplet constraints between different views in synthetic data (uni-formly distributed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

C.3 Consistency of triplet constraints between different views in poses of planes datasetestimated from independent embedding. . . . . . . . . . . . . . . . . . . . . . . . 43

C.4 Consistency of triplet constraints between different views in public figures dataset. 44C.5 Consistency of triplet constraints between different views in CUB-200 dataset es-

timated from independent embedding. . . . . . . . . . . . . . . . . . . . . . . . . 44C.6 Distance correlation between different views in synthetic data (clustered). . . . . . 44C.7 Distance correlation between different views in synthetic data (uniformly distributed).

44C.8 Distance correlation between different views in poses of planes dataset estimated

from independent embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45C.9 Distance correlation between different views in public figures dataset. . . . . . . . 45C.10 Distance correlation between different views in CUB-200 dataset estimated from

independent embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

vii

Page 8: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

CHAPTER 1

INTRODUCTION

Measure of similarity plays an important role in applications such as content-based recommenda-

tion, image search and speech recognition, and various techniques to learn such a measure from

data have been proposed (Jain et al., 2008; Weinberger and Saul, 2009; McFee and Lanckriet,

2011). While the concept of similarity between objects can be abstract, it is typically captured

by embedding objects into a vector space equipped with a distance metric that conforms to the

similarity observation. Similarity comparison of the form “object A is more similar to B than to

C” is a commonly used form of supervision for learning such embeddings (Agarwal et al., 2007;

Tamuz et al., 2011; van der Maaten and Weinberger, 2012).

Although judging similarity comparisons is easier than rating similarity on an absolute scale

(Kendall and Gibbons, 1990), sometimes there is ambiguity on how similarity is measured. Con-

sider the problem of comparing three birds (see Fig. 1.1), the answer to question: “is bird A more

similar to B or to C” could depend on what aspect of the bird one is interested in. Most annotators

will say that the head of bird A is more similar to the head of B while the back of A is more

similar to C. With the presence of multiple notions of similarity, it is natural to treat observation

from different perspectives separately.

To address this issue, we study the setting where the similarity observation takes the form

“from the t-th perspective, A is more similar to B than to C”. These view specific similarities

can be seen as an extension of attributes that can be collected without crisply defining attribute

vocabularies, and they can encode continuous and multidimensional structures, such as, color and

pose. Similarity in such cases can be expressed as a view-specific low dimensional embedding,

e.g., color as a (red, green, blue) vector. In addition to making the annotation task simpler, mul-

tiple perceptual similarities can also enable precise feedback for human “in the loop” tasks, such

as, content-based image retrieval system (Cox et al., 2000) and interactive fine-grained recogni-

tion (Wah et al., 2014a).

The main drawback of learning view specific embeddings independently is that one needs sim-

ilarity comparisons for each view. Consider the problem of simply learning a single embedding of

N objects which may require O(N3) triplet comparisons for each view. This can be expensive to

obtain especially when experts are involved.

1

Page 9: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

2

BA C

back

head

Is A more similar to B than C ?Figure 1.1: Ambiguity in similarity. A is more similar to C than B when focussing on the back(middle row), but is more similar to B than C when focussing on the head (bottom row).

Page 10: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

3

We propose a framework to learn embeddings jointly that addresses this drawback. The key

intuition is that while different notions of similarity may produce contradicting triplet relations

between some instances, they could be related for most of them. For instance, photos of a car

taken from different angles are correlated, as they can be considered as projections of a 3D model

onto different planes. In a symphony, musics played by individual instruments in the orchestra

could be related, as they may share some of the melody or rhythm. Our method exploits such

correlation between views by constructing a unified embedding space where different notions of

similarity are treated as its subspaces.

We perform experiments on a synthetic dataset and three real datasets from different domains,

namely, poses of airplanes, features from face images (PubFig dataset, Kumar et al., 2009), and

crowd-sourced similarities collected on different body parts of birds (CUB dataset, Welinder et al.,

2010). For a given amount of training data per view, the proposed joint learning approach obtains

lower triplet generalization error compared to the naive independent learning approach on most

datasets. The proposed joint learning approach also tends to obtain better cluster structures at

the category level, which are measured through the leave-one-out classification error. On dataset

where different similarity metrics are highly related, the joint embeddings are significantly better.

The rest of this thesis is structured as follows: in Chapter 2, we review the literature of learning

perceptual similarity from relative comparisons. We formulate the problem of multiple-metric

triplet embedding as an extension to learning a single similarity metric in Chapter 3. Our algorithm

to solve it is presented in the same chapter. Experiments on both synthetic and real datasets are

presented in Chapter 4. Chapter 5 lists some other work related to our problem. Discussion and

future directions are presented in Chapter 6.

Page 11: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

CHAPTER 2

LEARNING SIMILARITY FROM RELATIVE COMPARISON

While the concept of similarity between objects can be abstract, it is typically captured by embed-

ding objects into a vector space equipped with a distance metric that conforms to the similarity

observation.

Suppose there are N objects represented as D dimensional vectors x1,x2, . . . ,xN ⊂ RD

and we would like to learn a distance mapping d : RD × RD → R+0 which satisfies the properties:

1. d(xi,xi) ≥ 0 (non-negativity).

2. d(xi,xj) = d(xj,xi) (symmetry).

3. d(xi,xj) + d(xj,xk) ≥ d(xi,xk) (triangular inequality).

Notice that d here isn’t necessarily a metric because it doesn’t have to satisfy the following property

• d(xi,xj) = 0 if and only if xi = xj (distinguishability).

If a distance mapping satisfies the condition (1), (2) and (3) above , but not distinguishability, it is

called a pseudometric. Throughout this work, we consider the learning of pseudometrics.

A family of metrics that is commonly used in the literature is Mahalanobis metrics which

defines distance metric as

d(xi,xj) = (xi − xj)>M (xi − xj) (2.1)

where M is a D×D positive semi-definite matrix. In a learning problem, when xi’s are given, we

may learn a positive definite matrix M . It is common to think of it as learning a linear transfor-

mation L and let d(xi,xj) = (xi − xj)>L>L(xi − xj), since L>L is guaranteed to be positive

semidefinite. When xi’s are not given, we may take xi = ei ∈ RN and learn a rank D positive

semi-definite matrix M ∈ RN×N . Because M can be decomposed as M = L>L, for some

L ∈ RD×N and Lxi is simply the i-th column of L in this case, learning a rank D metric M and

fixing xi’s is equivalent to taking M to be the identy matrix ID and learning the embedding xi’s

in RD directly.

4

Page 12: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

5

2.1 Learning with Relative Similarity Measure

Similarity knowledge can be described in different forms. In this work, we focus on problems

where the similarity observation takes the form of similarity triplet: “A is more similar to B than

to C”, which is a special case of similarity quadruples “A is more similar to B than C is to D”. In

this section, we review different approaches that people have taken to learn similarity with this kind

of pairwise comparison. A formal description of learning from similarity triplets will be presented

in the next section.

Different techniques have been studied for embedding data points based on similarity triplets.

Agarwal et al. (2007) seeks to find an embedding where inter-point Euclidean distances have the

same ordering as a given set of dissimilarities. Tamuz et al. (2011) proposed a method to learn

an embedding from crowd sourced data alone where the training set is consist of triplet relations

in the form “object A is more similar to B than to C”. van der Maaten and Weinberger (2012)

proposed a technique called t-Distributed Stochastic Embedding (t-STE) which fulfill a similar

task but using Student-t kernel to model the similarity so that the resulting embedding is more

compact and clustered.

In the work of McFee and Lanckriet (2009), the authors consider the problem of learning an

embedding subject to a set of constraints in the form “objects i and j are more similar than k and

l”. They encode such ordinal constraints on a graph and exploit the global structure of the graph

to get an efficient embedding algorithm.

McFee and Lanckriet (2011) formulated a novel multiple kernel learning technique for inte-

grating heterogeneous data into a single unified similarity space. They also cope with similarity

constraints in the form of pair-wise comparison and use them as side information.

Pairwise similarity measurement can sometimes be inferred from other knowledge. For exam-

ple, supervision of class labels can be treated as similarity information, as it is natural to consider

that two objects belong to the same class can be considered more similar than two objects from

different classes. For instance, Cui et al. (2013) developed a pairwise-constrained multiple metric

learning algorithm for face recognition where similarity constraints are derived from class labels.

Parameswaran and Weinberger (2010) consider the case of supervised learning and train multiple

k-NN classifiers jointly by learning multiple metrics in the feature spaces. They use triplet similar-

ity constraints inferred from supervision information and showed that the jointly learned metrics

achieved higher classification accuracies than their single metric counterpart.

Page 13: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

6

2.2 Triplet Embedding

In this section, we assume that the embeddings are learned purely from similarity triplets. More

specifically, given a set of triplets S = (i, j, k) | zi is more similar to zj than zk, we aim to find

an embedding x1,x2, . . . ,xN ⊂ RD, for some D N , such that the pair-wise comparison of

Euclidean distances agrees with S, i.e., (i, j, k) ∈ S =⇒ ‖xi − xj‖22 < ‖xi − xk‖22. This is

the setting studied by van der Maaten and Weinberger (2012) and Tamuz et al. (2011). It is also a

special case of the paired comparison setting proposed in Agarwal et al. (2007), which originates

from the work of Shepard (1962a,b); Kruskal (1964a,b). We give a brief review of these methods

here.

Generalized Non-Metric Multidimensional Scaling (GNMDS)

GNMDS (Agarwal et al., 2007) seeks a low-rank kernel matrix K = X>X such that distances

constraints are satisfied with a large margin. The trace-norm of K is minimized in order to approx-

imately minimize its rank. Although GNMDS was raised for solving problems where similarity

constraints are posed as “i and j are more alike than k and l”, it could be easily adapted for triplets

constraints, as illustrated in van der Maaten and Weinberger (2012), and solve a problem of the

form:

minK

trace(K) + C∑

(i,j,l)∈S

ξijl ,

s.t. kjj − 2kij − kll + 2kil ≤ 1 + ξijl ,

ξijl ≥ 0 ,

K 0.

where C is a regularization parameter. After the optimal K is learned, the embedding X is then

obtained via an SVD of K. Here, ξijl’s are slack variables. Notice that this optimization problem

is equivalent to using a hinge-loss to penalize the violation of similarity constraints:

minK

trace(K) + C∑

(i,j,l)∈S

max 0, 1 + kjj − 2kij − kll + 2kil ,

s.t. K 0.

Page 14: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

7

Crowd Kernel Learning (CKL)

Tamuz et al. (2011) proposed the Crowd Kernel Learning method to deal with training triplets

(i, j, l) which are collected from annotators without domain expertise. They consider receiving the

feedback, “i is more similar to j than to l”, from a random crowd member as an random event

which occurs with certain probability. In the learning stage, the goal is to minimize the empirical

log-loss log(1/pijl) where a higher probability pijl means the triplet (i, j, l) is modelled better.

They take a scale-invariant model and define the probability pijl as

pijl =‖xi − xl‖22 + µ

‖xi − xj‖22 + ‖xi − xl‖22 + 2µ=

kii + kll − 2kil + µ

(kii + kjj − 2kij) + (kii + kll − 2kil) + 2µ

where µ is added to avoid numerical instability, kij’s are entries of the kernel matrix K. Their

optimization problem can be written as:

minK− 1

|S|∑

(i,j,l)∈S

log pijl ,

s.t. kii = 1, i = 1, 2, . . . , N ,

K 0.

Stochastic Triplet Embedding

Based on the idea of CKL, van der Maaten and Weinberger (2012) developed the method of

Stochastic Triplet Embedding. They proposed two ways to define the probability pijk:

1. STE

pijl =exp(−‖xi − xj‖22)

exp(−‖xi − xj‖22) + exp(−‖xi − xl‖22)

2. t-STE

pijl =

(1 +

‖xi−xj‖22α

)−α+12

(1 +

‖xi−xj‖22α

)−α+12

+(

1 +‖xi−xk‖22

α

)−α+12

They claimed that, during the iterations for numerically solving the optimization problem,

using STE and t-STE tends to penalize/reward triplets that are violated/satisfied more locally and

the influence of a triplet becomes very small when the triplet constraint is strongly violated. Thus,

Page 15: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

8

it is much better at revealing the underlying data structure and their experiments showed that STE

and t-STE produce more clustered embeddings. See (van der Maaten and Weinberger, 2012) for

details.

From the methods reviewed above, we observe that a commonly employed strategy is to de-

fine a loss function ` that measures how well the embedding models a triplet (i, j, k) and solve a

minimization problem of the form

minX∈RN×D

C ·∑

(i,j,k)∈S

`(‖xi − xj‖2, ‖xi − xk‖2) + ‖X‖2F , (2.2)

where ‖ · ‖F is the Frobenius norm and C > 0 is a regularization parameter.

Since the squared Euclidean distance ‖xi − xj‖2 can be expressed as ‖xi − xj‖2 = kii −2kij + kjj using a positive semidefinite Gram matrix K 0, minimization problem (2.2) can be

equivalently rewritten as follows:

minK∈RN×N

C ·∑

(i,j,k)∈S

`(dK(i, j), dK(i, k)) + tr(K), (2.3)

s.t. K 0,

where dK(i, j) = kii − 2kij + kjj . Penalizing the trace of the Gram matrix can be seen as a

convex surrogate for penalizing the rank of the Gram matrix (Agarwal et al. (2007); Fazel et al.

(2001)). After the optimal K is learned, we recover the embedding X = [x1, . . . ,xN ] by the

decomposition K = X>X .

Since optimization problems (2.2) and (2.3) are equivalent up to rotation, we will interchang-

ingly say “learning the embedding X” and “learning the metric K”. However, optimization prob-

lem (2.3) is convex if the loss function ` is convex, while this is not true for (2.2).

Page 16: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

CHAPTER 3

JOINT LEARNING MULTIPLE PERCEPTUAL SIMILARITIES

While previous studies deal with the situation where there is one underlying similarity measure,

we would like to consider the case that objects are compared from a few different aspects, i.e.,

there exists multiple measures of similarity.

The most related work would be from McFee and Lanckriet (2011), Parameswaran and Wein-

berger (2010) and Cui et al. (2013). However, the problem we will address has a more general

setting and is different from work mentioned previous in two folds: 1) Existing work that deals

with similarity triplets learns a single similarity metric. Yet, we consider the case where there exists

multiple similarity measures and we aim to learn embeddings under those measure simultaneously.

2) algorithms given by Parameswaran and Weinberger (2010) and Cui et al. (2013) requires super-

vision and their target is to minimize classification error. But we consider an embedding task that

is purely based on constraints in the form of similarity triplets.

Finally, our approach is complementary to methods that seek to minimize user effort by ac-

tively collecting triplet comparisons (Jamieson and Nowak, 2011; Tamuz et al., 2011) and better

interfaces for triplet collection (Wilber et al., 2014).

In this chapter, we will first give a formal formulation for the problem of jointly learning

multiple similarities and introduce our approach to solve this problem. We will discuss issues such

as influence of parameters and computational complexity later.

3.1 Formulation

Assume that T sets of triplets S1, . . . ,ST are obtained by asking labelers to focus on a specific

aspect when making pair-wise comparisons. In addition, a set of triplets S0 corresponding to

global (unspecific) notion of similarity may be available. Our goal is to obtain an embedding for

the global notion of similarity as well as embeddings corresponding to local views. To this end,

we take a hybrid approach that combines (2.2) and (2.3) and use explicit parameterization (2.2) for

the global ebmedding of each object and Gram-matrix-based parametrization (2.3) for modeling

each view.

9

Page 17: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

10

Let x1, . . . ,xN ∈ RD be vectors corresponding to the global embedding of the objects as

above. Each local view is associated with a Gram matrix M t and the underlying metric is defined

as a (squared) Mahalanobis distance dM t(i, j) := (xi − xj)>M t(xi − xj). We formulate the

learning problem as follows:

minx1,...,xN ,M1,...,MT

∑(i,j,k)∈S0

`(‖xi − xj‖2, ‖xi − xk‖2

)+

T∑t=1

∑(i,j,k)∈St

` (dM t(xi,xj), dM t(xi,xk))

+T∑t=1

γ tr(M t) + β‖X‖2F , (3.1)

s.t. M t 0 (t = 1, . . . , T ).

where ` is a loss function that measure how well the embedding models a triplet. We use the

hinge loss in the experiments in our work, but the proposed framework readily generalizes to other

loss functions proposed in literature such as Tamuz et al. (2011); van der Maaten and Weinberger

(2012).

We employ regularization terms for both the local metric M t and the global embedding vectors

xi in (3.1). Constraints on tr(M t) are added in the optimization objective, as the trace norm is

known to be effective in producing low-rank solutions (Agarwal et al., 2007; Fazel et al., 2001).

The norm of xi is necessary because, without this term, we may reduce the trace of M t by scaling

down M t while scaling up xi’s simultaneously, which significantly weakens the effect of the trace

term.

Although objective function (3.1) is non-convex, if we fix the value of xi’s and choose a convex

loss function, e.g.hinge-loss, then it becomes a convex problem with respect to M t’s and M t’s

can be learned independently since they appear in disjoint terms.

To minimize the objective, we update M t’s and xi’s alternately. When M t’s are fixed, xi’s are

updated directly via gradient descent (Algorithm 1). When xi’s are fixed, M t’s can be updated

independently via projected gradient descent, i.e., by iteratively taking a sub-gradient step and

projecting the resulting M t onto the positive semi-definite cone (Algorithm 2). The number of

dimension D is left as a hyper parameter. The algorithm is summarized in Algorithm 3.

Page 18: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

11

Algorithm 1: Update xi’sInput: xi, i = 1, 2, . . . , N ; M t, t = 1, 2, . . . , T ; β.Output: xi, i = 1, 2, . . . , N .initialization: choose a proper initial step size η0; set M0 := ID; set η ← η0; set iteration counterm = 1;while not converged do

For n = 1, . . . , N , let

gn =2βxn +T∑t=0

∑(i,j,k)∈St:n∈(i,j,k)

∇xn`(dM t(xi,xj), dM t(xi,xk))

Update

xn ← xn − ηgn (n = 1, . . . , N);

Update counter: m← m+ 1 ;Update step size: η ← η0/

√m ;

end

Algorithm 2: Update M t

Input: xi, i = 1, 2, . . . , N ; M t; γ.Output: M t

initialization: choose a proper initial step size η0; set η ← η0; set iteration counter m = 1;while not converged do

Take a step in the direction of negative sub-gradient:

G←∑

(i,j,k)∈St

∇M t`(dM t(xi,xj), dM t(xi,xk)) + γID,

M t ←M t − ηG;

Find the eigenvalue decomposition M t = V ΛV >;Project M t to PSD cone:

M t ← V max(Λ, 0)V >;

Update counter: m← m+ 1;Update step size: η ← η0/

√m;

end

Page 19: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

12

Algorithm 3: Multiple-metric LearningInput: N , the number of objects; D, dimension of embedding; St, t = 0, 1, 2, . . . , T , triplet

constraints. Regularization parameters β, γ.Output: Embedding xiNi=1, PSD matrices M tTt=1

Initialization: initialize xi’s randomly, initialize M t as identity matricies;while not converged do

Update xi’s by using Algorithm 1 ;for t ∈ 1, 2, . . . , T do

Update M t’s by using Algorithm 2 ;end

end

3.2 Relation with Learning Multiple Independent Similarities

Consider the problem of embedding N objects into D dimensional spaces from T sets of simi-

larity triplets S1,S2, . . . ,ST . By independent learning or independent embedding, we mean in-

dependently learning T embeddings X t, t = 1, 2, . . . , T . Each X t is learned from St only. In

contrast, our proposed model learns a global embedding X and T view specific metrics M t’s and

all St’s are used simultaneously. A simple parameter counting argument tells us that independent

learning requires to fit O(NDT ) parameters. On the other hand, our joint learning model has

O(ND +D2T ) parameters which is fewer when D < N .

By comparing the family of embeddings that can be represented by each of these two models, it

is not difficult to see that independently embedding into T spaces each of which has dimensionD is

a special case of jointly embedding in a TD dimensional space. Analytically, solving optimization

problem (2.2) independently on T views can be expressed as

minx1,...,xN∈RTDind

T∑t=1

∑(i,j,k)∈St

` (dM t(xi,xj), dM t(xi,xk)) + β‖X‖2F ,

where M t(p, q) =

1, p = q ∈ (t− 1)Dind + 1, (t− 1)Dind + 2, . . . , tDind,

0, otherwise.

which is a special case of the joint learning problem (3.1).

The same idea can be reached by approaching from the other direction. Consider a multiview

embedding represented by the joint embedding parameterized by a X ∈ RD×N and M t ∈ RD×D,

t = 1, 2, . . . , T . If we constrain the local metrics M t to be pair-wise orthogonal, i.e., 〈M i,M j〉 =

0 for all i 6= j, then each M t is associated with a subspace Wt ⊂ RD. Two Wt’s associated with

Page 20: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

13

different Mt’s are mutually orthogonal. Any x ∈ RD can be decomposed as x =∑T

t=1 yt + z

where x ∈ Wt and z ∈ (⊕

Wt)⊥. In this special case, finding a global embedding X reduces to

seeking for embeddings in Wt’s independently.

As a conclusion, jontly embedding in a D dimensional space subsumes indepdent learning in

T spaces with dimensionsl Dt’s as a special case, if∑T

t=1Dt ≤ D.

3.3 Influence of Parameters

The objective function in (3.1) has two parameters β and γ in the regularization term. For all the

loss functions mentioned in Sec 2.2, we can see that the value of these function only depends on

the distance dM t(xi,xj) and dM t(xi,xk). For any scalar α > 0, by the definition of Mahalanobis

distance (2.1), it is easy to see that dM t/α2(αxi, αxj) = dM t(xi,xj) for all xi and xj . This implies

that, if we substitute xi = αxi and M t = 1α2M t for xi and M t in the objective function of (3.1),

we can alter the value on regularization term while keeping the same loss. Meanwhile, due to

arithmetic-mean-geometric-mean inequality, the value of regularization term is lower bounded by

βα2‖X‖2F +T∑t=1

γ

α2tr(M t) ≥ 2

√√√√βγ‖X‖2FT∑t=1

tr(M t) (3.2)

and the lower bound can be reached when α takes the value:

α =

(γ∑T

t=1 tr(M t)

β‖X‖2F

) 14

.

On the other hand, if M t’s and X are the optimal solution of (3.1), they must reach the lower

bound above. (Otherwise, we can scale them by using the α that gives the lower bound and reduce

the value of objective function further.) Thereby, although we have two hyper parameters β and

γ, the effective regularization only depends on the value of βγ. Since Algorithm 3 updates xi’s

and M t’s alternately, the lower bound in (3.2) may not be reached numerically. As a heuristic,

we could add an optional step in the learning algorithm. That is, after one update of xi’s and

M t’s, we could scale them simultaneously to reach the lower bound on regularization term. See

Algorithm 4.

Page 21: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

14

Algorithm 4: Multiple-metric Learning with ScalingInput: N , the number of objects; D, dimension of embedding; St, t = 0, 1, 2, . . . , T , triplet

constraints. Regularization parameters β, γ.Output: Embedding xiNi=1, PSD matrices M tTt=1

Initialization: initialize xi’s randomly, initialize M t as identity matricies;while not converged do

Update xi’s by using Algorithm 1 ;for t ∈ 1, 2, . . . , T do

Update M t’s by using Algorithm 2 ;endScale xi’s and M t’s:

α←

(γ∑T

t=1 tr(M t)

β‖X‖2F

) 14

,

X ← αX,

M t ←M t/α2 (t = 1, . . . , T ).

end

3.4 Alternative Formulations and Computational Complexity

The joint learning model (3.1) parametrizes similarity metrics with a single global embedding x’s

and multiple local metrics M t’s. Alternatively, we could model the global embedding implicitly

by using a positive semi-definite kernel matrix K which can be considered as the Gram matrix

and K = Φ>Φ for some Φ ∈ RD×N could be regarded as objects’ representations in an D

dimensional space. Then, we can model the embedding of t-th view with a linear transformation

Lt ∈ RD×D and LtΦ would be the embedding for t-th view. If we choose regularization for Lt’s

appropriately, by the generalized representer theorem (Scholkopf et al., 2001), Lt can be expressed

as Lt = W tΦ> for some W t ∈ RD×N . Now, the Gram matrix on the t-th view becomes

Kt =((W tΦ

>)Φ)> (W tΦ>)Φ = KW>

t W tK,

and distance between object i and j can be expressed as

dt(i, j) = Kt(i, i)− 2Kt(i, j) + Kt(j, j)

= K(i, :)W>t W tK(:, i)

The problem of learning multiple similarity metrics can be formulated as

Page 22: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

15

minW 1,...,W t,K

∑(i,j,k)∈S0

`(‖K(i, :)−K(j, :)‖2, ‖K(i, :)−K(j, :)‖2

)+

T∑t=1

∑(i,j,k)∈St

` (dt(i, j), dt(i, k)) +T∑t=1

γ ‖W t‖2F + β‖K‖2F ,

s.t. K 0.

In this formulation, the objective function is neither convex in K nor in W t’s. Since W t’s only

appear as inner products, we can substitute M t = W>t W t but M t will have the size N ×N . In

either way, solving the optimization problem would involve eigen decomposition of the N -by-N

matrix K, which could be computationally expensive.

On the contrary, formulation (3.1) is convex in M t’s and involves eigen decomposition of

D-by-D matrices M t’s, which is less expensive when D < N . However, compared with indepen-

dent learning, solving joint learning problem (3.1) with Algorithm 4 is empirically much slower,

speculatively because it involves alternate updates in M t’s and xi’s.

Page 23: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

CHAPTER 4

EXPERIMENTS

In this section, we test our algorithm on both synthetic data and real datasets. One of the real

dataset consists of images of airplanes with different poses where similarity between two poses is

defined rigorously. Another real dataset uses triplets sampled from a few kernel matrices which

are computed from attributes of facial images. The other dataset contains images of 200 species of

birds where similarity triplets among the birds are “crowd-sourced”. We first give a brief explana-

tion on the design of the experiments and introduce the datasets. Experimental results follow.

4.1 Experimental Setup

On each dataset, we inspect the quality of embeddings learned from training sets with increasing

sizes. We draw about equal numbers of training triplets from all views and check the quality

of embedding. In addition, we inspect how the similarity knowledge on existing views could be

“transferred” to a “new” view where the number of similarity comparisons is small. We did this

by conduction an experiment in which we draw a small set of training triplets from one view but

use large numbers of training triplets from the rest views. The quality of embeddings is measured

from the following aspects:

1 Triplet generalization error. We split triplets randomly into a training set and a test set.

Triplet generalization error is defined as the percentage of test triplets whose triplet relations

are not correctly modelled by the learned embedding.

2 Leave-one-out classification error. We held out all information about objects’ class label

during the training stage. Only triplet constraints are used for learning the embedding. Then,

at the test stage, we choose one embedded object as target and predict its label by revealing

the labels of its neighbours. We do this prediction for every objects in turn. The leave-one-

out classification error is the percentage of objects whose labels are not correctly predicted.

Throughout the experiments, we use a 3-nearest-neighbour classifier to test classification

error.

16

Page 24: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

17

van der Maaten and Weinberger (2012) showed that reducing triplet generalization error often

leads to small nearest-neighbour classification error, although small triplet generalization doesn’t

necessarily guarantee low classification error. We also inspect embeddings obtained by simultane-

ously learning multiple metrics from this aspect.

Throughout the experiments, we use hinge loss as the loss function. Our joint learning method

is compared with two baselines, independent learning and pooled single view embedding. In in-

dependent learning, we conduct triplet embedding on every view as if they are independent tasks.

In pooled single view embedding, triplets collected from all views are merged together to learn a

single embedding, as if they are from the same similarity metric. When learning a single embed-

ding, we adopt the problem (2.2) as the objective and solve it by using the package provided by

the authors of van der Maaten and Weinberger (2012). To tune the regularization parameters, we

further split the training sets for a 5-fold cross-validation and swept over 10−5, 10−4, . . . , 105 for

all parameters.

4.2 Datasets

Here are the details of the datasets:

Synthetic Data

Two synthetic datasets are generated. One consists of 200 points uniformly sampled from a 20

dimensional unit hypercube, while the other dataset is generated in a 10 dimensional space and

have 200 objects from a mixture of four Gaussians each of which has variance 1 and is randomly

centered in a hypercube with side length 10. Six views are generated on each dataset. Every view

is produced by projecting data points onto a random five dimensional subspace. Training and test

triplets (i, j, k) are randomly sampled from all possible triplets on every views.

Poses of Airplanes

This dataset is constructed from 200 images of airplanes from the PASCAL VOC dataset (Evering-

ham et al., 2010) which are annotated with 16 landmarks such as nose tip, wing tips, etc (Bourdev

et al., 2010). We use these landmarks to construct a pose-based similarity. Given two planes and

the positions of landmarks in these images, pose similarity is defined as the residual error of align-

ment between the two sets of landmarks under scaling and translation. We generated 5 views each

Page 25: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

18

— view 1 — view 2 — view 3 — view 4 — view 5

Figure 4.1: View specific similarities between poses of planes are obtained by considering subsetsof landmarks shown by different colored rectangles and measuring their similarity in configurationup to a scaling and translation. For e.g.view 1 consists of all landmarks.

of which is associated with a subset of these landmarks as seen in Fig. 4.1 which shows three anno-

tated images from the set. The planes are highly diverse ranging from passenger planes to fighter

jets, varying in size and form which results in a slightly different similarity between instances for

each view. However, there is a strong correlation between the views because the underlying set of

landmarks are shared.

Additionally, we categorize the planes into five classes: “left-facing”, “right-facing”, “pointing-

up”, “pointing-down” and “facing out or facing away” to evaluate classification error. This pro-

duces classes with unbalanced number of members. About 80% of the images belongs to one of

these three classes: “left-facing”, “right-facing” and “facing out or facing away”.

Public Figures Face Data

Public Figures Face Database is created by Kumar et al. (2009). It consists of 58,797 images of

200 people. Every image is characterized by 75 attributes which are real valued and describe the

appearance of the person in the image. We selected 39 of the attributes and categorized them into

5 groups according to the aspects they describe: hair, age, accessory, shape and ethnicity. We

randomly selected ten people and drew 20 images for each of them to create a dataset with 200

Page 26: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

19

view 1 view 2 view 3 view 4 view 5 view 6Figure 4.2: Perceptual similarities between bird species were collected by showing users either thefull image (view 1), or crops around various parts (views 2, 3, . . . , 6). Images were taken from theCUB dataset Welinder et al. (2010) containing 200 species of birds.

images. Similarity between instances for a given group is equal to the dot product between their

attribute vectors where the attributes are restricted to those in the group. We describe the details

of these attributes in the Appendix. Each group is considered as a local view and identities of the

people in the images are considered as class labels.

CUB-200 Birds Data

We use the birds dataset introduced by Welinder et al. (2010) which contains images of birds cate-

gorized into 200 species. We consider the problem of embedding these species based on similarity

feedback from human users. Similarity triplets among species were collected in a crowd-sourced

manner: every time, a user is asked to judge the similarity between an image of a bird from the

target specie zi and nine images of birds of different species zkk∈K using the interface of Wilber

et al. (2014), where K is the set of all 200 species. For each display, the user partition these nine

images into two sets, Ksim and Kdissim, with Ksim containing birds considered similar to the target

and Kdissim having the ones considered dissimilar. Such a partition is broadcast to an equivalent

set of triplet constraints on associated species, (i, j, l) | j ∈ Ksim, l ∈ Kdissim. Therefore, for

each user response, |Ksim| |Kdissim| triplet constraints are acquired.

In the setting of multiple metric embedding, different views of birds are obtained cropping

regions around various parts of the bird as shown in Fig. 4.2, and then using the same procedure

as before to collect triplet comparisons. In this dataset, there are about 100,000 triplets cast on

similarities from comparisons made on the whole birds, while there are other 5 views each of which

Page 27: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

20

is cast on localized region of the birds (e.g.beak, breast, wing). Number of triplets obtained from

these views range from about 4,000 to 7,000. For testing classification error, we use a taxonomy of

the bird species provided by Welinder et al. (2010). To make the number of objects in all classes

balanced, we manually grouped some of the classes to get 6 super classes in total.

We note that these global and local similarity triplets were used by Wah et al. (2014a) and Wah

et al. (2014b) as a way to incorporate human feedback during recognition. However, our focus is

to learn better embeddings of the data itself by combining information across different views.

4.3 Experimental Results

Synthetic data

We first conduct experiments on synthetic dataset. Synthetic data with cluster is embedded in 5 and

10 dimensional spaces. The uniformly distributed synthetic data is embedded in 10 and 20 dimen-

sional spaces. Triplet generalization errors and leave-one-out 3-nearest-neighbour classification

errors are plotted in Fig. 4.3 and Fig. 4.3. The small plots on the right illustrate the view-wise test

errors while the big plot on the left shows the average across all views.

The dataset with cluster was sampled from a 10 dimensional space and each local view lies in

a 5 dimensional space. Our algorithm achieves significant improvement on small trianing samples

and it continues to perform better than the baselines except that independent learning obtains a

lower triplet generalization error than joint learning on training sets with nearly 40,000 triplets.

On the dataset that is uniformly sampled from a 20 dimensional space, joint embedding reduces

triplet generalization error faster than independent embedding and pooled embedding as the num-

ber of training triplets increases until around 20,000 triplets. At this point, joint embedding in

a 10 dimensional space is outperformed by its independent learning counterpart. However, joint

embedding in a 20 dimensional space has a lower error rate than the baselines in all cases.

Poses of Airplanes

The airplanes are embedded into a 10 dimensional space based on a training setting which consists

of 3,000 triplets from every view. As an illustration, we project the learned global view of the

objects onto their first two principle dimensions via SVD and illustrated the embedding in Fig. 4.5.

The visualization shows that objects roughly lies on a circle. Meanwhile, the same figure shows a

clear clustered structure and objects belonging to each of the three dominant classes are clustered

Page 28: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

21

103 104 105

number of training triplets across all views0.0

0.1

0.2

0.3

0.4

0.5

0.6

trip

let g

ener

aliz

atio

n er

ror joint-5d

ind-5dpooled-5djoint-10dind-10dpooled-10d

103 104 105

number of training triplets across all views0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

LOO

3-NN

cla

ssifi

catio

n er

ror joint-5d

ind-5dpooled-5djoint-10dind-10dpooled-10d

Figure 4.3: Experimental results on synthetic data with clusters. Top: Triplet generalization error.Bottom: Leave-one-out 3-nearest-neighbor classification error.

Page 29: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

22

103 104 105

number of training triplets across all views0.0

0.1

0.2

0.3

0.4

0.5

0.6

trip

let g

ener

aliz

atio

n er

ror joint-10d

ind-10dpooled-10djoint-20dind-20dpooled-20d

Figure 4.4: Triplet generalization errors on uniformly distributed synthetic data.

together, except that members in the class “facing out or facing away” are separated into two

clusters.

Fig. 4.6 shows the triplet generalization errors and classification errors of the learned embed-

dings. Errors on local views are shown in small plots while the large plots show the average across

views. Results of independent learning is included in the same figure. It shows that our algorithm

almost always performs better in both predicting triplets and classes when using less than 10,000

training triplets. However, this advantage disappears when training set becomes larger.

Public Figures Face Dataset

The 200 images are embedded into a 5, 10 and 20 dimensional spaces. We draw triplets randomly

from the ground truth similarity measure to form training and test sets. See Fig. 4.7 for a visu-

alization of data embedded under the metric of the first view. Triplet generalization errors and

classification errors are shown in Fig. 4.8. In terms of triplet generalization errors of 5 and 10 di-

mensional embeddings, we find that joint learning performs better on small training triplets while

its error rate surpasses independent learning as training set get larger. Joint learning and indepen-

dent learning have comparable triplet generalization error when the embedding is conducted in a

20 dimensional space. Pooled embedding has the highest triplet generalization errors among the

three methods. In terms of leave-one-out classification error, joint learning has a lower error rate

than independent learning and pooled embedding while pooled embeddings performs the best on

training sets with more than about 10,000 triplets.

Page 30: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

23

−1 −0.5 0 0.5

−0.5

0

0.5

Joint Embedding (Global)

Pointing−upPointing−downLeft−facingRight−facingFacing out/away

Student Version of MATLAB

Figure 4.5: The global view of embeddings of poses of planes.

CUB-200 Birds Data

The bird data set is more challenging in a sense that it emerges in a more real situation where

triplet relations among all bird species are not all available because it is crowdsourced. We learn

the embedding both in a 10 dimensional space and a 60 dimensional space. An illustration of

embedding learned in 60 dimensional space can be found in Fig. 4.9.

In the birds dataset, number of available triplets from the first view is more than the available

triplets from other views. During the experiment, we first sample equal numbers of triplets from

each view to form up the training set. We keep adding triplets to the training set until it has 3,000

triplets from each view. Then, we add triplets only from the first view and test the embedding in

all views.

See Fig. 4.10 for triplet generalization error and leave-one-out 3-nearest-neighbour classifica-

tion errors. The solid vertical line shows the moment that we start to add training triplets only to

the first view. We found that, different from the results of previous data sets, our algorithm doesn’t

work better on this data set in terms of triplet generalization error. However, the triplet generaliza-

tion error of learning multiple views jointly in a 60 dimensional space is comparable to the ones

get from embedding every views independently. In terms of leave-one-out classification error, our

method obtains lower error on all views except for the first view.

Page 31: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

24

103 104

number of training triplets across all views0.0

0.1

0.2

0.3

0.4

0.5

0.6

trip

let g

ener

aliz

atio

n er

ror joint-3d

ind-3dpooled-3djoint-10dind-10dpooled-10d

103 104

number of training triplets across all views0.0

0.2

0.4

0.6

0.8

1.0

LOO

3-NN

cla

ssifi

catio

n er

ror joint-3d

ind-3dpooled-3djoint-10dind-10dpooled-10d

Figure 4.6: Results on poses of planes dataset. Embedding is learned in 3 and 10 dimensionalspaces. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighbor classificationerror. The small figures shows errors on individual views and the large figures show the average.

Page 32: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

25

−15 −10 −5 0 5 10 15

−10

−5

0

5

10

Independent Embedding

Student Version of MATLAB

−15 −10 −5 0 5 10 15

−10

−5

0

5

10

Joint Embedding

Student Version of MATLABFigure 4.7: Illustration of public figures face data embedded under the metric of the first view:hair. Points with the same color belong to the same person. Embeddings are learned in a 10dimensional space using 105 triplets and then further embedded in a 2 dimensional plane by usingtSNE (van der Maaten and Hinton, 2008) for the purpose of visualization. Left: embeddingslearned independently. Right: embeddings learned jointly.

4.4 Learning a New View (Zero-Shot Learning)

On the CUB-200 birds dataset, we simulated a setting that is similar to zero-shot learning (Palatucci

et al., 2009). We draw a training set contains 100 triplets from the 2nd local view and 3,000 triplets

from all other 5 views. And we investigate how does joint learning help create an embedding on

the 2nd local view. The learned embedding is shown in Fig. 4.11. It is evident that, embedding

learned purely out of 100 triplets from that view has objects from different classes completely

mixed together, while the local view from the embedding learned jointly group objects from some

classes better. For example, members in the class Passeriformes(Emberizidae) is more separated

from members in Passeriformes(Icteridae). Meanwhile, the last plot in the same figure shows the

change in triplet generalization error as we add more training triplets from the 2nd local view. We

can see that joint learning has lower triplet generalization error when the number of training triplets

is small but as is expected it is matched and then outperformed by independent learning soon as

the training set gets larger.

4.5 Discussion

We have shown through experiments that our algorithm achieves lower triplet generalization errors

on synthetic data, airplane poses data and public figure faces data when the training set is relatively

small. Since in some real applications, similarity triplets can be expensive to obtain, jointly learn-

ing similarity metrics is preferable as it can recover the underlying structure using relatively small

Page 33: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

26

103 104 105

number of training triplets across all views0.0

0.1

0.2

0.3

0.4

0.5

trip

let g

ener

aliz

atio

n er

ror

joint-5dind-5dpooled-5djoint-10dind-10dpooled-10djoint-20dind-20dpooled-20d

103 104 105

number of training triplets across all views0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

LOO

3-NN

cla

ssifi

catio

n er

ror

joint-5dind-5dpooled-5djoint-10dind-10dpooled-10djoint-20dind-20dpooled-20d

Figure 4.8: Results on public figures faces dataset. Embeddings are learned in 5, 10 and 20 di-mensional spaces. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighborclassification error. The small figures shows errors on individual views and the large figures showthe average.

Page 34: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

27

−20 −15 −10 −5 0 5 10 15 20

−10

−5

0

5

10

Independent Embedding (View #1)

CharadriiformesPasseriformes (Emberizidae)Passeriformes (Icteridae)Passeriformes (Other)Passeriformes (Parulidae)birds (other)

Student Version of MATLAB

−15 −10 −5 0 5 10 15

−10

−5

0

5

10

Joint Embedding (View #1)

CharadriiformesPasseriformes (Emberizidae)Passeriformes (Icteridae)Passeriformes (Other)Passeriformes (Parulidae)birds (other)

Student Version of MATLAB

Figure 4.9: Illustration of CUB-200 birds data. The figure shows data’s embedding under themetric of the first view. Embeddings are learned in a 60 dimensional space using 18,000 triplets andthen further embedded in a 2 dimensional plane by using tSNE (van der Maaten and Hinton, 2008)for the purpose of visualization. Top: embeddings learned independently. Bottom: embeddingslearned jointly.

Page 35: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

28

Figure 4.10: Results on CUB-200 birds dataset. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighbor classification error. The small figures shows errors on individual viewsand the large figures show the average.

Page 36: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

29

102 103

number of training triplets from the local view0.0

0.1

0.2

0.3

0.4

0.5

0.6

trip

let g

ener

aliz

atio

n er

ror

joint-10dind-10d

−15 −10 −5 0 5 10 15

−10

−5

0

5

10

Independent Embedding (View #2)

CharadriiformesPasseriformes (Emberizidae)Passeriformes (Icteridae)Passeriformes (Other)Passeriformes (Parulidae)birds (other)

Student Version of MATLAB

−15 −10 −5 0 5 10 15−10

−8

−6

−4

−2

0

2

4

6

8

10

Joint Embedding (View #2)

CharadriiformesPasseriformes (Emberizidae)Passeriformes (Icteridae)Passeriformes (Other)Passeriformes (Parulidae)birds (other)

Student Version of MATLAB

Figure 4.11: Learning a new view on CUB-200 birds dataset. Training data contains 100 tripletsfrom the second local view and 3,000 triplets from other 5 views. Embeddings are learned in a 10dimensional space and then further embedded in a 2 dimensional plane by using tSNE (van derMaaten and Hinton, 2008) for the purpose of visualization. Left: triplet generalization error on thesecond local view. Middle: embedding learned independently. Right: embedding learned jointly.

number of training data.

Experiments showed that jointly learning multiple metrics performs better in terms of 3-nearest-

neighbour classification error almost in all cases, which implies that it has a potential to recover

the category level structure of the data.

From the experimental results presented above, the gain in performance when using joint learn-

ing varies among dataset. On synthetic data with clusters and poses of planes dataset, the benefit of

using joint learning is evident and it is especially large on small number of training triplets. Its per-

formance gain on uniformly distributed synthetic data and public figure faces dataset is relatively

small and very limited on CUB-200 birds dataset. Influence may come from multiple aspects. We

discuss two different yet related factors: (1) choice of dimensions and (2) correlation underlying

different views.

4.5.1 Influence of Dimension

From the results on public figure faces and CUB-200 (Fig. 4.8 and 4.10), we see that dimensions of

embedding spaces affect the triplet generalization errors of embeddings learned jointly especially

when the training set is large. For example, on public figure dataset (Fig. 4.8), joint learning re-

duces the error faster than independent learning up to around 10,000 triplets. When there are more

than 10,000 triplets, errors of joint embedding reduce monotonically as number of dimensions in-

creases. Meanwhile, when we learn a 10 dimensional joint embedding for the uniform synthetic

dataset sampled from a 20 dimensional space, we see that joint learning obtains lower triplet gen-

eralization error on small up to around 20,000 triplets but was catched up by independent learning

on larger training sets (Fig. 4.4). This can be understood as a bias induced by the joint learning.

Page 37: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

30

As analysed in Sec. 3.2, joint learning has a lower complexity when embedding dimension is

chosen the same as independent learning. However, experimental results have shown that in many

cases joint learning has a large performance gain even when using the same embedding dimension

as independent learning. For example, when public figures data was embedded in a 20 dimensional

space, joint embedding continues to perform better than independent learning. Meanwhile, from

the results of synthetic data (clustered) and poses of planes data (Fig. 4.3 and 4.6), joint learning

is able to achieve lower errors. We interpret this as that when the model has sufficient amount

of complexity to capture the inherent structure of data, joint embedding makes use of triplets

from all views simultaneously while independent learning only employ triplets from individual

views. Next, we consider one particular aspect of the underlying structure, which is the correlation

between different views.

4.5.2 Correlation Between Views

Another question of particular interest is how the underlying similarities affect the performance of

proposed joint learning algorithm. Speculatively, joint learning would be affected by the correla-

tion between similarity metrics of different views. Here, we examine such correlation from two

aspects: consistency of triplet constraints and correlation of distances.

Consistency of Triplet Constraints We assume that there is a ground truth embedding for each

similarity metric. Since there are 3!(N3

)triplets on N objects and any similarity metric would sat-

isfy exactly half of them1, a similarity metric induces m(N) = 3!(N3

)/2 triplet constraints. Triplet

constraint consistency between two views is defined as the ratio of number of triplet constraints

that are shared between both of the views to m(N). In our joint learning algorithm, the global

embedding X is trained to conform to triplet constraints over all views. Therefore, if there exists

a pair of conflicting triplet constraints, it is possible that joint learning might not be able to model

both triplets well. We examine consistency of triplet constraints on a dataset in the following

way. On datasets where we have the ground truth similarity metrics in the form of true embedding

(e.g., synthetic data) or true similarity kernel (e.g., public figure data), we draw 100,000 triplets

at random and check if each triplet is satisfied or violated on each view. Triplet constraint consis-

tency between two similarity metrics is estimated as the percentage of triplets which received the

1It is because if the triplet constraint (i, j, k) is satisfied under a metric, then the constraint (i, k, j) must be violatedby the same metric.

Page 38: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

31

same judgement under two metrics, i.e., satisfied or violated by both metrics. On poses of planes

data and CUB-200 dataset, since we do not have access to all triplet constraints, we estimate the

triplet consistency from embeddings learned independently with the largest number of triplets. The

triplet constraint consistency for datasets used in this work is list in the Appendix C. The average

of triplet constraint consistency for each dataset is summarized in Table 4.1.

Correlation of Distances A metric over N objects can be fully captured by a distance matrix D

whose (i, j)-th entry is the distance between object i and j. D can be computed from the kernel

matrix K by D(i, j) = K(i, i) −K(i, j) −K(j, i) + K(j, j). Suppose D1 and D2 are two

distance matrices. Since a (pseudo) metric is symmetric and so is the distance matrix, to compare

two distance matrices, we may restrict to the lower-triangle of them and define their correlation as

Corr(D1,D2) =

∑1≤i<j≤N (D1(i, j)− µ1) (D2(i, j)− µ2)√∑N

1≤i<j≤N (D1(i, j)− µ1)2√∑N

1≤i<j≤N (D2(i, j)− µ2)2

where

µt =2

N(N − 1)

∑1≤i<j≤N

Dt(i, j), t = 1, 2.

On synthetic data and public figure faces dataset, we are able to either compute the distance

matrix or have access to the true kernel matrix. Therefore we can compute the distance correlation

between every pair of local metrics for each of them. For poses of planes and CUB-200 data,

we estimate the correlation of distances from embeddings learned independently with the largest

number of triplets. Distance correlations between every pair of views on each dataset are shown in

Appendix C. The average of distances correlation for each dataset is summarized in Table 4.1.

Relation between triplet consistency and correlation of distances. Since we aim to recover the

underlying similarity metric from triplet constraints, a natural question would be: how far could

two metrics which satisfy the same set of triplet constraints be possibly away from each other.

Since triplet consistency and correlation of distances are related to the two sides of this problem,

we inspect the relation between these them empirically, to obtain some clues to the problem. A

rigorous analysis is left as a future work.

Here, an approximate relation between the two quantities are shown first. Its intuition is given

Page 39: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

32

−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2

0

0.2

0.4

0.6

0.8

1

1.2

correlation

sin(

0.5

π (2

Trip

letC

onsi

st −

1))

syn (cluster)syn (uniform)pubfigbirdposes of planes

Figure 4.12: Relation between triplet consistency and correlation of distances.

subsequently.

Let δ(D1,D2) denote the triplet consistency between two distance metrics. We illustrated

the relation between δ(Dp,Dq) and Corr(Dp,Dq) in Fig. 4.12. It shows that the two quantities

roughly follow the relation

sin(π

2(2δ(Dp,Dq)− 1)

)= Corr(Dp,Dq). (4.1)

The intuition behind this relation unfolds as follows.

Given two distance matrices D1 and D2 over N objects, consider the following generative

model:

1. Let (I, J) be a random variable that takes value from the space of all un-ordered pairs among

these N objects with equal change.

2. Define Y1 := D1(I, J) , Y2 := D2(I, J).

By this, Corr(D1,D2) defined previously is exactly the correlation of Y1 and Y2. On the other

hand, given a sample(y(t)1 , y

(t)2

): t = 1, . . . ,m

draw from the generative model defined above,

Page 40: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

33

the Kendall’s τ (Kruskal, 1958) is defined by

τ =2

m(m− 1)

∑1≤t<t′≤m

sign[(y(t)1 − y

(t′)1

)(y(t)2 − y

(t′)2

)]and, as stated by Liu et al. (2012), its population version is given by

τ := Corr(

sign(Y1 − Y1

), sign

(Y2 − Y2

))where Yt is i.i.d. to Yt, t = 1, 2. And,

τ = P(

(Y1 − Y1)(Y2 − Y2) > 0)− P

((Y1 − Y1)(Y2 − Y2) < 0

)Loosely speaking, we have τ = 2P

((Y1 − Y1)(Y2 − Y2) > 0

)+P

((Y1 − Y1)(Y2 − Y2) = 0

)−1,

which is approximately 2P(

(Y1 − Y1)(Y2 − Y2) > 0)− 1 when the number of objects N is large.

As we define Y1 and Y2 to be the distances of a random pair of objects under metrics D1 and D2,

P(

(Y1 − Y1)(Y2 − Y2) > 0)

can be regarded as the consistency of quadruple relations induced

from these two metrics. Each quadruple relation is written as (A,B,C,D) which means “A is

more similar to B than C is to D”. And it subsumes triplet constraints as special cases. If Y1 and

Y2 are jointly Gaussian or belong to a nonparanormal family, it is known that

sin(π

2τ) = ρY1,Y2 (4.2)

where ρY1,Y2 is the correlation between Y1 and Y2. See (Kruskal, 1958; Liu et al., 2012).

This gives the intuition of using (4.1) to approximate the relation between triplet consistency

and correlation of pair-wise distances. Note that, here, we use triplet consistency as an estimation

of quadruple consistency. Besides, Y1 and Y2 obtained from the generative model do not necessarily

resemble Gaussian. But, empirically, it works on the dataset used in this work quite well, as shown

in Fig. 4.12.

Relating performance gain with view correlation. The performance gain is measured by the

difference between the area under the triplet generalization errors normalized by that of the inde-

pendent learning, in the case of 10 dimensional embedding. Relating the performance gain with

correlation among similarities (Table 4.1, Fig. 4.13), we see that on synthetic data (clustered) where

joint learning obtained a significant performance gain, triplet consistency and distance correlations

Page 41: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

34

dataset average tripletconsistency

average distancecorrelation

performance gain (%)

synthetic (clustered) 0.78 0.71 52synthetic (uniform) 0.59 0.29 18

poses of planes 0.63∗ 0.37∗ 24public figure 0.58 0.25 10

CUB-200 0.52∗ 0.06∗ 0.4

Table 4.1: Measure of similarity correlation and performance gain of using joint learning. Entriesmarked with ∗ are values estimated from independent embeddings.

0.50 0.55 0.60 0.65 0.70 0.75 0.80triplet consistency

10

0

10

20

30

40

50

60

perfo

rman

ce g

ain

(%)

syn (clustered)

syn (uniform)

poses of planes

pubfig

CUB-200

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8correlation of distances

syn (clustered)

syn (uniform)

poses of planes

pubfig

CUB-200

Figure 4.13: Performance gain and correlation of views.

are high. On the other hand, synthetic data (uniform), poses of planes and public figure faces

dataset have relatively mild view correlations. Performance gains on these dataset are not very

significant. In contrast, triplet consistency of CUB-200 is close to random (0.5). That is probably

the reason that performance gain on this dataset is very limited.

Page 42: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

CHAPTER 5

RELATED WORK

Previously, we have covered some literature that is closely related to our formulation of jointly

learning multiple metrics from triplet constraints. In this chapter, we would like to mention some

other work that could be related to the problem of learning multiple similarities from pairwise

comparisons and take some glimpses at this problem from different angles.

5.1 Multitask Metric Learning with Low-Rank Tensors

Tensors are widely used in the context of multi-task learning (Evgeniou and Pontil, 2007; Romera-

Paredes et al., 2013). Learning a low-rank tensor is a problem that has been studied and used

in many applications (Tomioka et al., 2011; Signoretto et al., 2014; Liu et al., 2013; Blankertz

et al., 2008). Since T similarity metrics over a given set of N obejets can be represented as

kernel matrices Kt, t = 1, . . . , T , we may consider them as slices of a N × N × T tensor Kwhere the (i, j, t) entry of K reflects the similarity between objects i and j under the t-th view.

Meanwhile, under the assumption that there exists correlation among different kernel matrices, we

may seek for a K with low rank. See (Kolda and Bader, 2009) for the definition of tensor rank. Let

∆(i,j,t) ∈ RN×N×T be a tensor with entries ∆(i,j,t)(i, i, t) = ∆(i,j,t)(j, j, t) = 1, ∆(i,j,t)(i, j, t) =

∆(i,j,t)(j, i, t) = −1 and 0 everywhere else. Then its inner product with K is euqal to the (squared)

distance under t-th similarity metric, i.e.,

⟨∆(i,j,t),K

⟩= Ki,i,t −Ki,j,t −Kj,i,t +Kj,j,t

By this, each triplet constraint can be expressed as a linear constraint on K as

(i, j, k) ∈ St =⇒⟨∆(i,j,t),K

⟩<⟨∆(i,k,t),K

⟩.

35

Page 43: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

36

The problem of learning multiple similarity metrics can be formulated with a tensor as:

minK

T∑t=1

∑(i,j,k)∈St

`(⟨

∆(i,j,t),K⟩,⟨∆(i,k,t),K

⟩)+ Ω(K), (5.1)

s.t. K(:, :, t) 0 (t = 1, . . . , T ).

where Ω(K) is a regularization term which induces low-rank tensor. See (Tomioka et al., 2010) for

example choices of such regularizations. Comparing (5.1) with (3.1), we can see that (3.1) can also

be cast into a tensor learning problem but the learning space is restricted to tensors with a special

form of factorization K =M×1X>×2X

>. Here,M is a D×D× T tensor whose t-th slices is

required to be a positive semi-definite matrix and corresponds to the M t representing the metric

under the t-th view while X ∈ RD×N represents the global embedding in (3.1).

5.2 Relation between Triplet Embedding and Metric Learning

In the literature, embedding with pairwise comparisons is sometimes called ordinal embedding

(Alon et al., 2008; von Luxburg et al., 2014; Terada and Luxburg, 2014) or partial order embedding

(McFee and Lanckriet, 2009).

von Luxburg et al. (2014) consider the problem of recovering the embeddings of n objects from

pairwise comparisons. They proved that, asymptitocally, if we have the knowledge of the ordinal

relationships for all quadruples (i, j, k, l), as n→∞, the set of embedded points always converges

to the set of original points, up to similarity transformations such as rotations, translations, rescal-

ings and reflections. Developed on that, a further work of Terada and Luxburg (2014) investigates

the consistency of local ordinal embedding (LOE) which only uses triplet constraints that reflect

k-nearest-neighbour information. More specifically, in the setting of LOE, a triplet constraints

(i, j, l) implies that j is a k-nearest-neighbour of i but l is not. The authors proved that, under cer-

tain conditions, it is possible to reconstruct the point set x1, . . . , xn asymptotically, if we just know

the k-nearest neighbours of each point. They also raised a Soft Ordinal Embedding algorithm, not

only to recover the ordinal constraints, but the density structure underlying data set.

Jamieson and Nowak (2011) derived a lower bound for minimum number of queries of triplet

relations (i, j, k) that is needed to determine the embedding. They proved that at least Ω(dn log n)

such comparisons is needed to determine the embedding of n objects into a d dimensional space

and such lower bound cannot be achieved by randomly choosing pairwise comparisons. The work

Page 44: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

37

of McFee and Lanckriet (2009) suggested that it might be hard to find the minimal dimension

needed to produce an embedding that satisfies all the constraints. They showed that it is NP-

Complete to decide if a given set of oridnal constraints C can be satisfied in R1.

Alon et al. (2008) studied the problem of minimum-relaxation ordinal embedding. Roughly

speaking, given a distance function D(i, j), an ordinal embedding with relaxation α is an embed-

ding that satisfies D(i, j) > αD(p, q) =⇒ d(xi,xj) > d(xp,xq) where xi’s are the learned

embeddings and d(·, ·) is the metric function of the embedding space. They also established that

the problem of ordinal embedding has many qualitative differences from metric embedding which

aims to minimize distortion.

Page 45: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

CHAPTER 6

CONCLUSION AND FUTURE WORK

In this work, we introduced our formulation on the problem of jointly learning multiple similarity

metrics which is a natural extension of the conventional independent learning model. The proposed

model consists of a global view, which represents each object as a fixed dimensional vector, and

local views, which specifies each view-specific Mahalanobis metric as a positive semi-definite

matrix. Its performance is studied empirically on both synthetic dataset and dataset from the real

world. Experimental results have demonstrated that, on dataset with a higher correlation between

similarity metrics, joint learning has a better performance than independent learning in terms of

triplet generalization error, especially for the cases where the number of training triplets is small.

This suggests that joint learning would be preferable for applications where triplet relations can be

expensive to obtain. Although the results are presented for the hinge loss (Agarwal et al., 2007),

the proposed algorithm easily generalize to other loss functions, e.g., CKL loss (Tamuz et al.,

2011), t-STE loss (van der Maaten and Weinberger, 2012). Meanwhile, joint learning tends to

have a lower leave-one-out classification error in most situations which implies it could support

supervised learning tasks such as learning a classifier. As a future work, we aim to study how to

use embeddings learned from similarity comparisons for classification tasks where data is partially

labelled.

Some preliminary analysis shows that the underlying correlation among similarity metrics

would notably affect the performance of joint learning. Based on our empirical study, to achieve

performance comparable to independent learning, joint learning would require a higher dimen-

sional embedding space on datasets where views are less correlated. It would be interesting to find

a way of estimating the amount of view correlation and further provide guidance for choosing the

dimension of embedding space.

Moreover, as mentioned in Section 5.1, multiple metrics learning from triplet constraints could

be formulated as a problem of learning a low-rank tensor under linear constraints. We may explore

this approach and use low-rank tensors as the tool to model correlated similarity metrics. Resort to

existing studies on tensors, we might get some insight on learning multiple metric learning under

triplet constraints.

38

Page 46: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

39

On the other hand, embeddings learned from joint learning can be used in higher level machine

learning tasks, such as fast retrieval and recognition tasks. We may integrate our joint learning

framework with those tasks and study the performance of such systems in real application, such

as music recommendation (McFee et al., 2012), word embedding in language (Mikolov et al.,

2013) and interactive fine-grained recognition (Wah et al., 2014a). Since triplet relations might be

expensive to obtain in many applications, when building such a system, it would also be interesting

to develop an active triplet sampling algorithm, as an extension to (Jamieson and Nowak, 2011),

so that the queried triplets are the most informative ones.

Page 47: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

APPENDIX A

VIEWS OF POSES OF AIRPLANES DATASET

Each of the 200 airplanes were annotated with 16 landmarks namely,

01. Top Rudder 05. L WingTip 09. Nose Bottom 13. Left Engine Back02. Bot Rudder 06. R WingTip 10. Left Wing Base 14. Right Engine Front03. L Stabilizer 07. NoseTip 11. Right Wing Base 15. Right Engine Back04. R Stabilizer 08. Nose Top 12. Left Engine Front 16. Bot Rudder Front

This is also illustrated in the Figure A.1. The five different views are defined by considering

different subsets of landmarks as follows:

1. all ∈ 1, 2, . . . , 16

2. back ∈ 1, 2, 3, 4, 16

3. nose ∈ 7, 8, 9

4. back+wings ∈ 1, 2, . . . , 6, 10, 11, . . . , 16

5. nose+wings ∈ 5, 6, . . . , 15

For triplet (A,B,C) we compute similarity si(A,B) and si(A,C) by aligning the subset i of

landmarks ofB and C toA under a translation and scaling that minimizes the sum of squared error

after alignment. The similarity is inversely proportional to the residual error. This is also known

as “procrustes analysis” commonly used for matching shapes.

40

Page 48: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

41

Figure A.1: Landmarks illustrated on the several planes

Page 49: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

APPENDIX B

ATTRIBUTES OF PUBLIC FIGURES FACE DATASET

Each image in the Public Figures Face Dataset (Pubfig) 1 is characterized by 75 attributes. We

used 39 of the attributes in our work and categorized them into 5 groups according to the aspects

they describe. Here is a table of the categories and attributes:

Category AttributesHair Black Hair, Blond Hair, Brown Hair, Gray Hair, Bald, Curly Hair, Wavy Hair,

Straight Hair, Receding Hairline, Bangs, Sideburns.Age Baby,Child,Youth,Middle Aged,Senior.Accessory No Eyewear, Eyeglasses, Sunglasses, Wearing Hat, Wearing Lipstick, Heavy

Makeup, Wearing Earrings, Wearing Necktie, Wearing Necklace.Shape Oval Face, Round Face, Square Face, High Cheekbones, Big Nose, Pointy

Nose, Round Jaw, Narrow Eyes, Big Lips, Strong Nose-Mouth Lines.Ethnicity Asian, Black, White, Indian.

Table B.1: List of Pubfig attributes that were used in this work.

1Available at http://www.cs.columbia.edu/CAVE/databases/pubfig/

42

Page 50: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

APPENDIX C

CORRELATION BETWEEN SIMILARITY METRICS ON MULTIPLE

DATASETS

view 2 0.712220view 3 0.754380 0.720280view 4 0.770390 0.842020 0.782490view 5 0.780540 0.779740 0.703990 0.799840view 6 0.728800 0.846140 0.759320 0.847380 0.810520

view 1 view 2 view 3 view 4 view 5

Table C.1: Consistency of triplet constraints between different views in synthetic data (clustered).

view 2 0.558910view 3 0.566690 0.703850view 4 0.635800 0.506790 0.502940view 5 0.636420 0.622560 0.708390 0.570290view 6 0.567020 0.624570 0.569100 0.496120 0.629680

view 1 view 2 view 3 view 4 view 5

Table C.2: Consistency of triplet constraints between different views in synthetic data (uniformlydistributed).

view 2 0.497920view 3 0.493910 0.709840view 4 0.489880 0.712820 0.660470view 5 0.498180 0.800280 0.715140 0.683480

view 1 view 2 view 3 view 4

Table C.3: Consistency of triplet constraints between different views in poses of planes datasetestimated from independent embedding.

43

Page 51: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

44

view 2 0.577800view 3 0.641800 0.592700view 4 0.578000 0.575200 0.616400view 5 0.582300 0.531000 0.546300 0.546100

view 1 view 2 view 3 view 4

Table C.4: Consistency of triplet constraints between different views in public figures dataset.

view 2 0.501170view 3 0.491470 0.506750view 4 0.514880 0.509960 0.553930view 5 0.493880 0.539450 0.531050 0.550690view 6 0.494220 0.527550 0.541820 0.512840 0.516650

view 1 view 2 view 3 view 4 view 5

Table C.5: Consistency of triplet constraints between different views in CUB-200 dataset estimatedfrom independent embedding.

view 2 0.488729view 3 0.665361 0.549823view 4 0.708692 0.838732 0.775284view 5 0.680630 0.762995 0.579146 0.776799view 6 0.556607 0.886884 0.652158 0.874139 0.819197

view 1 view 2 view 3 view 4 view 5

Table C.6: Distance correlation between different views in synthetic data (clustered).

view 2 0.191292view 3 0.217535 0.588245view 4 0.430519 0.025099 0.008072view 5 0.426157 0.393445 0.620055 0.223425view 6 0.200978 0.375466 0.231990 -0.004176 0.402714

view 1 view 2 view 3 view 4 view 5

Table C.7: Distance correlation between different views in synthetic data (uniformly distributed).

Page 52: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

45

view 2 -0.012479view 3 -0.023239 0.611187view 4 -0.028532 0.665929 0.512171view 5 -0.006959 0.815321 0.616287 0.571281

view 1 view 2 view 3 view 4

Table C.8: Distance correlation between different views in poses of planes dataset estimated fromindependent embedding.

view 2 0.240323view 3 0.410561 0.194059view 4 0.252613 0.228472 0.307446view 5 0.296265 0.101105 0.216864 0.249339

view 1 view 2 view 3 view 4

Table C.9: Distance correlation between different views in public figures dataset.

view 2 -0.007114view 3 -0.028691 0.017077view 4 0.011462 0.046835 0.202294view 5 -0.014319 0.139694 0.127920 0.174904view 6 -0.031425 0.041847 0.142853 0.007505 0.018800

view 1 view 2 view 3 view 4 view 5

Table C.10: Distance correlation between different views in CUB-200 dataset estimated from in-dependent embedding.

Page 53: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

REFERENCES

Agarwal, S., Wills, J., Cayton, L., Lanckriet, G., Kriegman, D. J., and Belongie, S. (2007). Generalizednon-metric multidimensional scaling. In International Conference on Artificial Intelligence and Statistics,pages 11–18.

Alon, N., Badoiu, M., Demaine, E. D., Farach-Colton, M., Hajiaghayi, M., and Sidiropoulos, A. (2008). Or-dinal embeddings of minimum relaxation: general properties, trees, and ultrametrics. ACM Transactionson Algorithms (TALG), 4(4):46.

Blankertz, B., Tomioka, R., Lemm, S., Kawanabe, M., and Muller, K.-R. (2008). Optimizing spatial filtersfor robust eeg single-trial analysis. Signal Processing Magazine, IEEE, 25(1):41–56.

Bourdev, L., Maji, S., Brox, T., and Malik, J. (2010). Detecting people using mutually consistent poseletactivations. In European Conference on Computer Vision (ECCV).

Cox, I. J., Miller, M. L., Minka, T. P., Papathomas, T. V., and Yianilos, P. N. (2000). The bayesian imageretrieval system, pichunter: theory, implementation, and psychophysical experiments. Image Processing,IEEE Transactions on, 9(1):20–37.

Cui, Z., Li, W., Xu, D., Shan, S., and Chen, X. (2013). Fusing robust face region descriptors via multiplemetric learning for face recognition in the wild. In Computer Vision and Pattern Recognition (CVPR),2013 IEEE Conference on, pages 3554–3561. IEEE.

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). The pascal visualobject classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338.

Evgeniou, A. and Pontil, M. (2007). Multi-task feature learning. Advances in neural information processingsystems, 19:41.

Fazel, M., Hindi, H., and Boyd, S. P. (2001). A Rank Minimization Heuristic with Application to MinimumOrder System Approximation. In Proc. of the American Control Conference.

Jain, P., Kulis, B., and Grauman, K. (2008). Fast image search for learned metrics. In Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE.

Jamieson, K. G. and Nowak, R. D. (2011). Low-dimensional embedding using adaptively selected ordinaldata. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on,pages 1077–1084. IEEE.

Kendall, M. G. and Gibbons, J. D. (1990). Rank correlation methods. Edward Arnold. Oxford UniversityPress.

Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3):455–500.

Kruskal, J. B. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis.Psychometrika, 29(1):1–27.

46

Page 54: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

47

Kruskal, J. B. (1964b). Nonmetric multidimensional scaling: a numerical method. Psychometrika,29(2):115–129.

Kruskal, W. H. (1958). Ordinal measures of association. Journal of the American Statistical Association,53(284):814–861.

Kumar, N., Berg, A. C., Belhumeur, P. N., and Nayar, S. K. (2009). Attribute and simile classifiers for faceverification. In Computer Vision, 2009 IEEE 12th International Conference on, pages 365–372. IEEE.

Liu, H., Han, F., Yuan, M., Lafferty, J., Wasserman, L., et al. (2012). High-dimensional semiparametricgaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326.

Liu, J., Musialski, P., Wonka, P., and Ye, J. (2013). Tensor completion for estimating missing values invisual data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):208–220.

McFee, B., Barrington, L., and Lanckriet, G. (2012). Learning content similarity for music recommendation.Audio, Speech, and Language Processing, IEEE Transactions on, 20(8):2207–2218.

McFee, B. and Lanckriet, G. (2009). Partial order embedding with multiple kernels. In Proceedings of the26th Annual International Conference on Machine Learning, pages 721–728. ACM.

McFee, B. and Lanckriet, G. (2011). Learning multi-modal similarity. The Journal of Machine LearningResearch, 12:491–523.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations invector space. arXiv preprint arXiv:1301.3781.

Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). Zero-shot learning with semanticoutput codes. In Advances in neural information processing systems, pages 1410–1418.

Parameswaran, S. and Weinberger, K. Q. (2010). Large margin multi-task metric learning. In Advances inneural information processing systems, pages 1867–1875.

Romera-Paredes, B., Aung, H., Bianchi-Berthouze, N., and Pontil, M. (2013). Multilinear multitask learn-ing. In Proceedings of the 30th International Conference on Machine Learning, pages 1444–1452.

Scholkopf, B., Herbrich, R., and Smola, A. J. (2001). A generalized representer theorem. In Computationallearning theory, pages 416–426. Springer.

Shepard, R. N. (1962a). The analysis of proximities: Multidimensional scaling with an unknown distancefunction. I. Psychometrika, 27(2):125–140.

Shepard, R. N. (1962b). The analysis of proximities: Multidimensional scaling with an unknown distancefunction. II. Psychometrika, 27(3):219–246.

Signoretto, M., Dinh, Q. T., De Lathauwer, L., and Suykens, J. A. (2014). Learning with tensors: a frame-work based on convex optimization and spectral regularization. Machine Learning, 94(3):303–351.

Tamuz, O., Liu, C., Belongie, S., Shamir, O., and Kalai, A. T. (2011). Adaptively learning the crowd kernel.arXiv preprint arXiv:1105.1033.

Terada, Y. and Luxburg, U. V. (2014). Local ordinal embedding. In Proceedings of the 31st InternationalConference on Machine Learning (ICML-14), pages 847–855.

Page 55: THE UNIVERSITY OF CHICAGO JOINTLY LEARNING MULTIPLE

48

Tomioka, R., Hayashi, K., and Kashima, H. (2010). Estimation of low-rank tensors via convex optimization.arXiv preprint arXiv:1010.0789.

Tomioka, R., Suzuki, T., Hayashi, K., and Kashima, H. (2011). Statistical performance of convex tensordecomposition. In Advances in Neural Information Processing Systems, pages 972–980.

van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine LearningResearch, 9(2579-2605):85.

van der Maaten, L. and Weinberger, K. (2012). Stochastic triplet embedding. In Machine Learning forSignal Processing (MLSP), 2012 IEEE International Workshop on, pages 1–6. IEEE.

von Luxburg, U. et al. (2014). Uniqueness of ordinal embedding. In Proceedings of The 27th Conferenceon Learning Theory, pages 40–67.

Wah, C., Horn, G. V., Branson, S., Maji, S., Perona, P., and Belongie, S. (2014a). Similarity comparisonsfor interactive fine-grained categorization. In Computer Vision and Pattern Recognition.

Wah, C., Maji, S., and Belongie, S. (2014b). Learning localized perceptual similarity metrics for interactivecategorization. In Human-Machine Communication for Visual Recognition and Search Workshop, ECCV.

Weinberger, K. Q. and Saul, L. K. (2009). Distance metric learning for large margin nearest neighborclassification. The Journal of Machine Learning Research, 10:207–244.

Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. (2010). Caltech-UCSDBirds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.

Wilber, M. J., Kwak, I. S., and Belongie, S. J. (2014). Cost-effective hits for relative similarity comparisons.arXiv preprint arXiv:1404.3291.