Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
THE UNIVERSITY OF CHICAGO
JOINTLY LEARNING MULTIPLE SIMILARITY METRICS FROM TRIPLET
CONSTRAINTS
A DISSERTATION SUBMITTED TO
THE FACULTY OF THE DIVISION OF THE PHYSICAL SCIENCES
IN CANDIDACY FOR THE DEGREE OF
MASTER OF SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
BY
LIWEN ZHANG
CHICAGO, ILLINOIS
WINTER, 2015
ABSTRACT
The measure of similarity plays a crucial role in many applications such as content-based recom-
mendation, image search and speech recognition. Similarity between objects is multifaceted and
it is easier to judge similarity when the focus is on a specific aspect. We consider the problem
of mapping objects into view specific embeddings where the distance between them is consistent
with the similarity comparisons of the form “from the t-th perspective, object A is more similar to
B than to C”. We propose a framework to jointly learn multiple views by deploying different sim-
ilarity metrics in a unified embedding space. Our approach is a natural extension of the view-wise
independent approach and is capable of exploiting the correlation between the views if it exists.
Experiments on a number of datasets, including a large dataset of multi-view crowdsourced com-
parison on bird images, showed the proposed method achieved lower triplet generalization error
and better grouping of classes in most cases, when compared to learning embeddings indepen-
dently for each view or learning a single embedding from triplets collected on all views. More
specifically, on datasets where correlation between views is strong, the proposed method is able to
achieve significant improvement, while on datasets with limited view correlation, it still performs
no worse than its independent learning counterpart.
iii
ACKNOWLEDGEMENTS
This thesis is based on a joint work with Ryota Tomioka, my thesis advisor, and Subhransu Maji,
who is now a faculty member in UMass Amherst. I am grateful to them for their collaboration and
help on many technical details. I also wish to express my sincere thanks to Ryota for his guidance
throughout the completion of this thesis. Thanks go to Min Xu at the Statistics Department for
pointing out the idea of using Kendall’s τ in the analysis.
iv
TABLE OF CONTENTS
ABSTRACT iii
ACKNOWLEDGEMENTS iv
1 INTRODUCTION 1
2 LEARNING SIMILARITY FROM RELATIVE COMPARISON 42.1 Learning with Relative Similarity Measure . . . . . . . . . . . . . . . . . . . . . . 52.2 Triplet Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 JOINT LEARNING MULTIPLE PERCEPTUAL SIMILARITIES 93.1 Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Relation with Learning Multiple Independent Similarities . . . . . . . . . . . . . . 123.3 Influence of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4 Alternative Formulations and Computational Complexity . . . . . . . . . . . . . . 14
4 EXPERIMENTS 164.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Learning a New View (Zero-Shot Learning) . . . . . . . . . . . . . . . . . . . . . 254.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.5.1 Influence of Dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.5.2 Correlation Between Views . . . . . . . . . . . . . . . . . . . . . . . . . 30
5 RELATED WORK 355.1 Multitask Metric Learning with Low-Rank Tensors . . . . . . . . . . . . . . . . . 355.2 Relation between Triplet Embedding and Metric Learning . . . . . . . . . . . . . 36
6 CONCLUSION AND FUTURE WORK 38
A VIEWS OF POSES OF AIRPLANES DATASET 40
B ATTRIBUTES OF PUBLIC FIGURES FACE DATASET 42
C CORRELATION BETWEEN SIMILARITY METRICS ON MULTIPLE DATASETS 43
REFERENCES 46
v
LIST OF FIGURES
1.1 Illustration of ambiguity in similarity. . . . . . . . . . . . . . . . . . . . . . . . . 2
4.1 View specific similarities between poses of planes. . . . . . . . . . . . . . . . . . 184.2 View specific similarities between birds. . . . . . . . . . . . . . . . . . . . . . . . 194.3 Experimental results on synthetic data with clusters. . . . . . . . . . . . . . . . . . 214.4 Experimental results on uniformly distributed synthetic data. . . . . . . . . . . . . 224.5 The global view of embeddings of poses of planes. . . . . . . . . . . . . . . . . . 234.6 Experimental results on poses of planes dataset. . . . . . . . . . . . . . . . . . . . 244.7 Illustration of public figures face data embedded under the metric of the first view. . 254.8 Experimental results on public figure faces dataset. . . . . . . . . . . . . . . . . . 264.9 Illustration of CUB-200 birds data. . . . . . . . . . . . . . . . . . . . . . . . . . . 274.10 Experimental results on CUB-200 birds dataset. . . . . . . . . . . . . . . . . . . . 284.11 Learning a new view on CUB-200 birds dataset. . . . . . . . . . . . . . . . . . . . 294.12 Relation between triplet consistency and correlation of pairwise-distances. . . . . . 324.13 Performance gain and correlation of views . . . . . . . . . . . . . . . . . . . . . . 34
A.1 Landmarks illustrated on the several planes . . . . . . . . . . . . . . . . . . . . . 41
vi
LIST OF TABLES
4.1 Measure of similarity correlation and performance gain of using joint learning.Entries marked with ∗ are values estimated from independent embeddings. . . . . . 34
B.1 List of Pubfig attributes that were used in this work. . . . . . . . . . . . . . . . . . 42
C.1 Consistency of triplet constraints between different views in synthetic data (clus-tered). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
C.2 Consistency of triplet constraints between different views in synthetic data (uni-formly distributed). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
C.3 Consistency of triplet constraints between different views in poses of planes datasetestimated from independent embedding. . . . . . . . . . . . . . . . . . . . . . . . 43
C.4 Consistency of triplet constraints between different views in public figures dataset. 44C.5 Consistency of triplet constraints between different views in CUB-200 dataset es-
timated from independent embedding. . . . . . . . . . . . . . . . . . . . . . . . . 44C.6 Distance correlation between different views in synthetic data (clustered). . . . . . 44C.7 Distance correlation between different views in synthetic data (uniformly distributed).
44C.8 Distance correlation between different views in poses of planes dataset estimated
from independent embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45C.9 Distance correlation between different views in public figures dataset. . . . . . . . 45C.10 Distance correlation between different views in CUB-200 dataset estimated from
independent embedding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
vii
CHAPTER 1
INTRODUCTION
Measure of similarity plays an important role in applications such as content-based recommenda-
tion, image search and speech recognition, and various techniques to learn such a measure from
data have been proposed (Jain et al., 2008; Weinberger and Saul, 2009; McFee and Lanckriet,
2011). While the concept of similarity between objects can be abstract, it is typically captured
by embedding objects into a vector space equipped with a distance metric that conforms to the
similarity observation. Similarity comparison of the form “object A is more similar to B than to
C” is a commonly used form of supervision for learning such embeddings (Agarwal et al., 2007;
Tamuz et al., 2011; van der Maaten and Weinberger, 2012).
Although judging similarity comparisons is easier than rating similarity on an absolute scale
(Kendall and Gibbons, 1990), sometimes there is ambiguity on how similarity is measured. Con-
sider the problem of comparing three birds (see Fig. 1.1), the answer to question: “is bird A more
similar to B or to C” could depend on what aspect of the bird one is interested in. Most annotators
will say that the head of bird A is more similar to the head of B while the back of A is more
similar to C. With the presence of multiple notions of similarity, it is natural to treat observation
from different perspectives separately.
To address this issue, we study the setting where the similarity observation takes the form
“from the t-th perspective, A is more similar to B than to C”. These view specific similarities
can be seen as an extension of attributes that can be collected without crisply defining attribute
vocabularies, and they can encode continuous and multidimensional structures, such as, color and
pose. Similarity in such cases can be expressed as a view-specific low dimensional embedding,
e.g., color as a (red, green, blue) vector. In addition to making the annotation task simpler, mul-
tiple perceptual similarities can also enable precise feedback for human “in the loop” tasks, such
as, content-based image retrieval system (Cox et al., 2000) and interactive fine-grained recogni-
tion (Wah et al., 2014a).
The main drawback of learning view specific embeddings independently is that one needs sim-
ilarity comparisons for each view. Consider the problem of simply learning a single embedding of
N objects which may require O(N3) triplet comparisons for each view. This can be expensive to
obtain especially when experts are involved.
1
2
BA C
back
head
Is A more similar to B than C ?Figure 1.1: Ambiguity in similarity. A is more similar to C than B when focussing on the back(middle row), but is more similar to B than C when focussing on the head (bottom row).
3
We propose a framework to learn embeddings jointly that addresses this drawback. The key
intuition is that while different notions of similarity may produce contradicting triplet relations
between some instances, they could be related for most of them. For instance, photos of a car
taken from different angles are correlated, as they can be considered as projections of a 3D model
onto different planes. In a symphony, musics played by individual instruments in the orchestra
could be related, as they may share some of the melody or rhythm. Our method exploits such
correlation between views by constructing a unified embedding space where different notions of
similarity are treated as its subspaces.
We perform experiments on a synthetic dataset and three real datasets from different domains,
namely, poses of airplanes, features from face images (PubFig dataset, Kumar et al., 2009), and
crowd-sourced similarities collected on different body parts of birds (CUB dataset, Welinder et al.,
2010). For a given amount of training data per view, the proposed joint learning approach obtains
lower triplet generalization error compared to the naive independent learning approach on most
datasets. The proposed joint learning approach also tends to obtain better cluster structures at
the category level, which are measured through the leave-one-out classification error. On dataset
where different similarity metrics are highly related, the joint embeddings are significantly better.
The rest of this thesis is structured as follows: in Chapter 2, we review the literature of learning
perceptual similarity from relative comparisons. We formulate the problem of multiple-metric
triplet embedding as an extension to learning a single similarity metric in Chapter 3. Our algorithm
to solve it is presented in the same chapter. Experiments on both synthetic and real datasets are
presented in Chapter 4. Chapter 5 lists some other work related to our problem. Discussion and
future directions are presented in Chapter 6.
CHAPTER 2
LEARNING SIMILARITY FROM RELATIVE COMPARISON
While the concept of similarity between objects can be abstract, it is typically captured by embed-
ding objects into a vector space equipped with a distance metric that conforms to the similarity
observation.
Suppose there are N objects represented as D dimensional vectors x1,x2, . . . ,xN ⊂ RD
and we would like to learn a distance mapping d : RD × RD → R+0 which satisfies the properties:
1. d(xi,xi) ≥ 0 (non-negativity).
2. d(xi,xj) = d(xj,xi) (symmetry).
3. d(xi,xj) + d(xj,xk) ≥ d(xi,xk) (triangular inequality).
Notice that d here isn’t necessarily a metric because it doesn’t have to satisfy the following property
• d(xi,xj) = 0 if and only if xi = xj (distinguishability).
If a distance mapping satisfies the condition (1), (2) and (3) above , but not distinguishability, it is
called a pseudometric. Throughout this work, we consider the learning of pseudometrics.
A family of metrics that is commonly used in the literature is Mahalanobis metrics which
defines distance metric as
d(xi,xj) = (xi − xj)>M (xi − xj) (2.1)
where M is a D×D positive semi-definite matrix. In a learning problem, when xi’s are given, we
may learn a positive definite matrix M . It is common to think of it as learning a linear transfor-
mation L and let d(xi,xj) = (xi − xj)>L>L(xi − xj), since L>L is guaranteed to be positive
semidefinite. When xi’s are not given, we may take xi = ei ∈ RN and learn a rank D positive
semi-definite matrix M ∈ RN×N . Because M can be decomposed as M = L>L, for some
L ∈ RD×N and Lxi is simply the i-th column of L in this case, learning a rank D metric M and
fixing xi’s is equivalent to taking M to be the identy matrix ID and learning the embedding xi’s
in RD directly.
4
5
2.1 Learning with Relative Similarity Measure
Similarity knowledge can be described in different forms. In this work, we focus on problems
where the similarity observation takes the form of similarity triplet: “A is more similar to B than
to C”, which is a special case of similarity quadruples “A is more similar to B than C is to D”. In
this section, we review different approaches that people have taken to learn similarity with this kind
of pairwise comparison. A formal description of learning from similarity triplets will be presented
in the next section.
Different techniques have been studied for embedding data points based on similarity triplets.
Agarwal et al. (2007) seeks to find an embedding where inter-point Euclidean distances have the
same ordering as a given set of dissimilarities. Tamuz et al. (2011) proposed a method to learn
an embedding from crowd sourced data alone where the training set is consist of triplet relations
in the form “object A is more similar to B than to C”. van der Maaten and Weinberger (2012)
proposed a technique called t-Distributed Stochastic Embedding (t-STE) which fulfill a similar
task but using Student-t kernel to model the similarity so that the resulting embedding is more
compact and clustered.
In the work of McFee and Lanckriet (2009), the authors consider the problem of learning an
embedding subject to a set of constraints in the form “objects i and j are more similar than k and
l”. They encode such ordinal constraints on a graph and exploit the global structure of the graph
to get an efficient embedding algorithm.
McFee and Lanckriet (2011) formulated a novel multiple kernel learning technique for inte-
grating heterogeneous data into a single unified similarity space. They also cope with similarity
constraints in the form of pair-wise comparison and use them as side information.
Pairwise similarity measurement can sometimes be inferred from other knowledge. For exam-
ple, supervision of class labels can be treated as similarity information, as it is natural to consider
that two objects belong to the same class can be considered more similar than two objects from
different classes. For instance, Cui et al. (2013) developed a pairwise-constrained multiple metric
learning algorithm for face recognition where similarity constraints are derived from class labels.
Parameswaran and Weinberger (2010) consider the case of supervised learning and train multiple
k-NN classifiers jointly by learning multiple metrics in the feature spaces. They use triplet similar-
ity constraints inferred from supervision information and showed that the jointly learned metrics
achieved higher classification accuracies than their single metric counterpart.
6
2.2 Triplet Embedding
In this section, we assume that the embeddings are learned purely from similarity triplets. More
specifically, given a set of triplets S = (i, j, k) | zi is more similar to zj than zk, we aim to find
an embedding x1,x2, . . . ,xN ⊂ RD, for some D N , such that the pair-wise comparison of
Euclidean distances agrees with S, i.e., (i, j, k) ∈ S =⇒ ‖xi − xj‖22 < ‖xi − xk‖22. This is
the setting studied by van der Maaten and Weinberger (2012) and Tamuz et al. (2011). It is also a
special case of the paired comparison setting proposed in Agarwal et al. (2007), which originates
from the work of Shepard (1962a,b); Kruskal (1964a,b). We give a brief review of these methods
here.
Generalized Non-Metric Multidimensional Scaling (GNMDS)
GNMDS (Agarwal et al., 2007) seeks a low-rank kernel matrix K = X>X such that distances
constraints are satisfied with a large margin. The trace-norm of K is minimized in order to approx-
imately minimize its rank. Although GNMDS was raised for solving problems where similarity
constraints are posed as “i and j are more alike than k and l”, it could be easily adapted for triplets
constraints, as illustrated in van der Maaten and Weinberger (2012), and solve a problem of the
form:
minK
trace(K) + C∑
(i,j,l)∈S
ξijl ,
s.t. kjj − 2kij − kll + 2kil ≤ 1 + ξijl ,
ξijl ≥ 0 ,
K 0.
where C is a regularization parameter. After the optimal K is learned, the embedding X is then
obtained via an SVD of K. Here, ξijl’s are slack variables. Notice that this optimization problem
is equivalent to using a hinge-loss to penalize the violation of similarity constraints:
minK
trace(K) + C∑
(i,j,l)∈S
max 0, 1 + kjj − 2kij − kll + 2kil ,
s.t. K 0.
7
Crowd Kernel Learning (CKL)
Tamuz et al. (2011) proposed the Crowd Kernel Learning method to deal with training triplets
(i, j, l) which are collected from annotators without domain expertise. They consider receiving the
feedback, “i is more similar to j than to l”, from a random crowd member as an random event
which occurs with certain probability. In the learning stage, the goal is to minimize the empirical
log-loss log(1/pijl) where a higher probability pijl means the triplet (i, j, l) is modelled better.
They take a scale-invariant model and define the probability pijl as
pijl =‖xi − xl‖22 + µ
‖xi − xj‖22 + ‖xi − xl‖22 + 2µ=
kii + kll − 2kil + µ
(kii + kjj − 2kij) + (kii + kll − 2kil) + 2µ
where µ is added to avoid numerical instability, kij’s are entries of the kernel matrix K. Their
optimization problem can be written as:
minK− 1
|S|∑
(i,j,l)∈S
log pijl ,
s.t. kii = 1, i = 1, 2, . . . , N ,
K 0.
Stochastic Triplet Embedding
Based on the idea of CKL, van der Maaten and Weinberger (2012) developed the method of
Stochastic Triplet Embedding. They proposed two ways to define the probability pijk:
1. STE
pijl =exp(−‖xi − xj‖22)
exp(−‖xi − xj‖22) + exp(−‖xi − xl‖22)
2. t-STE
pijl =
(1 +
‖xi−xj‖22α
)−α+12
(1 +
‖xi−xj‖22α
)−α+12
+(
1 +‖xi−xk‖22
α
)−α+12
They claimed that, during the iterations for numerically solving the optimization problem,
using STE and t-STE tends to penalize/reward triplets that are violated/satisfied more locally and
the influence of a triplet becomes very small when the triplet constraint is strongly violated. Thus,
8
it is much better at revealing the underlying data structure and their experiments showed that STE
and t-STE produce more clustered embeddings. See (van der Maaten and Weinberger, 2012) for
details.
From the methods reviewed above, we observe that a commonly employed strategy is to de-
fine a loss function ` that measures how well the embedding models a triplet (i, j, k) and solve a
minimization problem of the form
minX∈RN×D
C ·∑
(i,j,k)∈S
`(‖xi − xj‖2, ‖xi − xk‖2) + ‖X‖2F , (2.2)
where ‖ · ‖F is the Frobenius norm and C > 0 is a regularization parameter.
Since the squared Euclidean distance ‖xi − xj‖2 can be expressed as ‖xi − xj‖2 = kii −2kij + kjj using a positive semidefinite Gram matrix K 0, minimization problem (2.2) can be
equivalently rewritten as follows:
minK∈RN×N
C ·∑
(i,j,k)∈S
`(dK(i, j), dK(i, k)) + tr(K), (2.3)
s.t. K 0,
where dK(i, j) = kii − 2kij + kjj . Penalizing the trace of the Gram matrix can be seen as a
convex surrogate for penalizing the rank of the Gram matrix (Agarwal et al. (2007); Fazel et al.
(2001)). After the optimal K is learned, we recover the embedding X = [x1, . . . ,xN ] by the
decomposition K = X>X .
Since optimization problems (2.2) and (2.3) are equivalent up to rotation, we will interchang-
ingly say “learning the embedding X” and “learning the metric K”. However, optimization prob-
lem (2.3) is convex if the loss function ` is convex, while this is not true for (2.2).
CHAPTER 3
JOINT LEARNING MULTIPLE PERCEPTUAL SIMILARITIES
While previous studies deal with the situation where there is one underlying similarity measure,
we would like to consider the case that objects are compared from a few different aspects, i.e.,
there exists multiple measures of similarity.
The most related work would be from McFee and Lanckriet (2011), Parameswaran and Wein-
berger (2010) and Cui et al. (2013). However, the problem we will address has a more general
setting and is different from work mentioned previous in two folds: 1) Existing work that deals
with similarity triplets learns a single similarity metric. Yet, we consider the case where there exists
multiple similarity measures and we aim to learn embeddings under those measure simultaneously.
2) algorithms given by Parameswaran and Weinberger (2010) and Cui et al. (2013) requires super-
vision and their target is to minimize classification error. But we consider an embedding task that
is purely based on constraints in the form of similarity triplets.
Finally, our approach is complementary to methods that seek to minimize user effort by ac-
tively collecting triplet comparisons (Jamieson and Nowak, 2011; Tamuz et al., 2011) and better
interfaces for triplet collection (Wilber et al., 2014).
In this chapter, we will first give a formal formulation for the problem of jointly learning
multiple similarities and introduce our approach to solve this problem. We will discuss issues such
as influence of parameters and computational complexity later.
3.1 Formulation
Assume that T sets of triplets S1, . . . ,ST are obtained by asking labelers to focus on a specific
aspect when making pair-wise comparisons. In addition, a set of triplets S0 corresponding to
global (unspecific) notion of similarity may be available. Our goal is to obtain an embedding for
the global notion of similarity as well as embeddings corresponding to local views. To this end,
we take a hybrid approach that combines (2.2) and (2.3) and use explicit parameterization (2.2) for
the global ebmedding of each object and Gram-matrix-based parametrization (2.3) for modeling
each view.
9
10
Let x1, . . . ,xN ∈ RD be vectors corresponding to the global embedding of the objects as
above. Each local view is associated with a Gram matrix M t and the underlying metric is defined
as a (squared) Mahalanobis distance dM t(i, j) := (xi − xj)>M t(xi − xj). We formulate the
learning problem as follows:
minx1,...,xN ,M1,...,MT
∑(i,j,k)∈S0
`(‖xi − xj‖2, ‖xi − xk‖2
)+
T∑t=1
∑(i,j,k)∈St
` (dM t(xi,xj), dM t(xi,xk))
+T∑t=1
γ tr(M t) + β‖X‖2F , (3.1)
s.t. M t 0 (t = 1, . . . , T ).
where ` is a loss function that measure how well the embedding models a triplet. We use the
hinge loss in the experiments in our work, but the proposed framework readily generalizes to other
loss functions proposed in literature such as Tamuz et al. (2011); van der Maaten and Weinberger
(2012).
We employ regularization terms for both the local metric M t and the global embedding vectors
xi in (3.1). Constraints on tr(M t) are added in the optimization objective, as the trace norm is
known to be effective in producing low-rank solutions (Agarwal et al., 2007; Fazel et al., 2001).
The norm of xi is necessary because, without this term, we may reduce the trace of M t by scaling
down M t while scaling up xi’s simultaneously, which significantly weakens the effect of the trace
term.
Although objective function (3.1) is non-convex, if we fix the value of xi’s and choose a convex
loss function, e.g.hinge-loss, then it becomes a convex problem with respect to M t’s and M t’s
can be learned independently since they appear in disjoint terms.
To minimize the objective, we update M t’s and xi’s alternately. When M t’s are fixed, xi’s are
updated directly via gradient descent (Algorithm 1). When xi’s are fixed, M t’s can be updated
independently via projected gradient descent, i.e., by iteratively taking a sub-gradient step and
projecting the resulting M t onto the positive semi-definite cone (Algorithm 2). The number of
dimension D is left as a hyper parameter. The algorithm is summarized in Algorithm 3.
11
Algorithm 1: Update xi’sInput: xi, i = 1, 2, . . . , N ; M t, t = 1, 2, . . . , T ; β.Output: xi, i = 1, 2, . . . , N .initialization: choose a proper initial step size η0; set M0 := ID; set η ← η0; set iteration counterm = 1;while not converged do
For n = 1, . . . , N , let
gn =2βxn +T∑t=0
∑(i,j,k)∈St:n∈(i,j,k)
∇xn`(dM t(xi,xj), dM t(xi,xk))
Update
xn ← xn − ηgn (n = 1, . . . , N);
Update counter: m← m+ 1 ;Update step size: η ← η0/
√m ;
end
Algorithm 2: Update M t
Input: xi, i = 1, 2, . . . , N ; M t; γ.Output: M t
initialization: choose a proper initial step size η0; set η ← η0; set iteration counter m = 1;while not converged do
Take a step in the direction of negative sub-gradient:
G←∑
(i,j,k)∈St
∇M t`(dM t(xi,xj), dM t(xi,xk)) + γID,
M t ←M t − ηG;
Find the eigenvalue decomposition M t = V ΛV >;Project M t to PSD cone:
M t ← V max(Λ, 0)V >;
Update counter: m← m+ 1;Update step size: η ← η0/
√m;
end
12
Algorithm 3: Multiple-metric LearningInput: N , the number of objects; D, dimension of embedding; St, t = 0, 1, 2, . . . , T , triplet
constraints. Regularization parameters β, γ.Output: Embedding xiNi=1, PSD matrices M tTt=1
Initialization: initialize xi’s randomly, initialize M t as identity matricies;while not converged do
Update xi’s by using Algorithm 1 ;for t ∈ 1, 2, . . . , T do
Update M t’s by using Algorithm 2 ;end
end
3.2 Relation with Learning Multiple Independent Similarities
Consider the problem of embedding N objects into D dimensional spaces from T sets of simi-
larity triplets S1,S2, . . . ,ST . By independent learning or independent embedding, we mean in-
dependently learning T embeddings X t, t = 1, 2, . . . , T . Each X t is learned from St only. In
contrast, our proposed model learns a global embedding X and T view specific metrics M t’s and
all St’s are used simultaneously. A simple parameter counting argument tells us that independent
learning requires to fit O(NDT ) parameters. On the other hand, our joint learning model has
O(ND +D2T ) parameters which is fewer when D < N .
By comparing the family of embeddings that can be represented by each of these two models, it
is not difficult to see that independently embedding into T spaces each of which has dimensionD is
a special case of jointly embedding in a TD dimensional space. Analytically, solving optimization
problem (2.2) independently on T views can be expressed as
minx1,...,xN∈RTDind
T∑t=1
∑(i,j,k)∈St
` (dM t(xi,xj), dM t(xi,xk)) + β‖X‖2F ,
where M t(p, q) =
1, p = q ∈ (t− 1)Dind + 1, (t− 1)Dind + 2, . . . , tDind,
0, otherwise.
which is a special case of the joint learning problem (3.1).
The same idea can be reached by approaching from the other direction. Consider a multiview
embedding represented by the joint embedding parameterized by a X ∈ RD×N and M t ∈ RD×D,
t = 1, 2, . . . , T . If we constrain the local metrics M t to be pair-wise orthogonal, i.e., 〈M i,M j〉 =
0 for all i 6= j, then each M t is associated with a subspace Wt ⊂ RD. Two Wt’s associated with
13
different Mt’s are mutually orthogonal. Any x ∈ RD can be decomposed as x =∑T
t=1 yt + z
where x ∈ Wt and z ∈ (⊕
Wt)⊥. In this special case, finding a global embedding X reduces to
seeking for embeddings in Wt’s independently.
As a conclusion, jontly embedding in a D dimensional space subsumes indepdent learning in
T spaces with dimensionsl Dt’s as a special case, if∑T
t=1Dt ≤ D.
3.3 Influence of Parameters
The objective function in (3.1) has two parameters β and γ in the regularization term. For all the
loss functions mentioned in Sec 2.2, we can see that the value of these function only depends on
the distance dM t(xi,xj) and dM t(xi,xk). For any scalar α > 0, by the definition of Mahalanobis
distance (2.1), it is easy to see that dM t/α2(αxi, αxj) = dM t(xi,xj) for all xi and xj . This implies
that, if we substitute xi = αxi and M t = 1α2M t for xi and M t in the objective function of (3.1),
we can alter the value on regularization term while keeping the same loss. Meanwhile, due to
arithmetic-mean-geometric-mean inequality, the value of regularization term is lower bounded by
βα2‖X‖2F +T∑t=1
γ
α2tr(M t) ≥ 2
√√√√βγ‖X‖2FT∑t=1
tr(M t) (3.2)
and the lower bound can be reached when α takes the value:
α =
(γ∑T
t=1 tr(M t)
β‖X‖2F
) 14
.
On the other hand, if M t’s and X are the optimal solution of (3.1), they must reach the lower
bound above. (Otherwise, we can scale them by using the α that gives the lower bound and reduce
the value of objective function further.) Thereby, although we have two hyper parameters β and
γ, the effective regularization only depends on the value of βγ. Since Algorithm 3 updates xi’s
and M t’s alternately, the lower bound in (3.2) may not be reached numerically. As a heuristic,
we could add an optional step in the learning algorithm. That is, after one update of xi’s and
M t’s, we could scale them simultaneously to reach the lower bound on regularization term. See
Algorithm 4.
14
Algorithm 4: Multiple-metric Learning with ScalingInput: N , the number of objects; D, dimension of embedding; St, t = 0, 1, 2, . . . , T , triplet
constraints. Regularization parameters β, γ.Output: Embedding xiNi=1, PSD matrices M tTt=1
Initialization: initialize xi’s randomly, initialize M t as identity matricies;while not converged do
Update xi’s by using Algorithm 1 ;for t ∈ 1, 2, . . . , T do
Update M t’s by using Algorithm 2 ;endScale xi’s and M t’s:
α←
(γ∑T
t=1 tr(M t)
β‖X‖2F
) 14
,
X ← αX,
M t ←M t/α2 (t = 1, . . . , T ).
end
3.4 Alternative Formulations and Computational Complexity
The joint learning model (3.1) parametrizes similarity metrics with a single global embedding x’s
and multiple local metrics M t’s. Alternatively, we could model the global embedding implicitly
by using a positive semi-definite kernel matrix K which can be considered as the Gram matrix
and K = Φ>Φ for some Φ ∈ RD×N could be regarded as objects’ representations in an D
dimensional space. Then, we can model the embedding of t-th view with a linear transformation
Lt ∈ RD×D and LtΦ would be the embedding for t-th view. If we choose regularization for Lt’s
appropriately, by the generalized representer theorem (Scholkopf et al., 2001), Lt can be expressed
as Lt = W tΦ> for some W t ∈ RD×N . Now, the Gram matrix on the t-th view becomes
Kt =((W tΦ
>)Φ)> (W tΦ>)Φ = KW>
t W tK,
and distance between object i and j can be expressed as
dt(i, j) = Kt(i, i)− 2Kt(i, j) + Kt(j, j)
= K(i, :)W>t W tK(:, i)
The problem of learning multiple similarity metrics can be formulated as
15
minW 1,...,W t,K
∑(i,j,k)∈S0
`(‖K(i, :)−K(j, :)‖2, ‖K(i, :)−K(j, :)‖2
)+
T∑t=1
∑(i,j,k)∈St
` (dt(i, j), dt(i, k)) +T∑t=1
γ ‖W t‖2F + β‖K‖2F ,
s.t. K 0.
In this formulation, the objective function is neither convex in K nor in W t’s. Since W t’s only
appear as inner products, we can substitute M t = W>t W t but M t will have the size N ×N . In
either way, solving the optimization problem would involve eigen decomposition of the N -by-N
matrix K, which could be computationally expensive.
On the contrary, formulation (3.1) is convex in M t’s and involves eigen decomposition of
D-by-D matrices M t’s, which is less expensive when D < N . However, compared with indepen-
dent learning, solving joint learning problem (3.1) with Algorithm 4 is empirically much slower,
speculatively because it involves alternate updates in M t’s and xi’s.
CHAPTER 4
EXPERIMENTS
In this section, we test our algorithm on both synthetic data and real datasets. One of the real
dataset consists of images of airplanes with different poses where similarity between two poses is
defined rigorously. Another real dataset uses triplets sampled from a few kernel matrices which
are computed from attributes of facial images. The other dataset contains images of 200 species of
birds where similarity triplets among the birds are “crowd-sourced”. We first give a brief explana-
tion on the design of the experiments and introduce the datasets. Experimental results follow.
4.1 Experimental Setup
On each dataset, we inspect the quality of embeddings learned from training sets with increasing
sizes. We draw about equal numbers of training triplets from all views and check the quality
of embedding. In addition, we inspect how the similarity knowledge on existing views could be
“transferred” to a “new” view where the number of similarity comparisons is small. We did this
by conduction an experiment in which we draw a small set of training triplets from one view but
use large numbers of training triplets from the rest views. The quality of embeddings is measured
from the following aspects:
1 Triplet generalization error. We split triplets randomly into a training set and a test set.
Triplet generalization error is defined as the percentage of test triplets whose triplet relations
are not correctly modelled by the learned embedding.
2 Leave-one-out classification error. We held out all information about objects’ class label
during the training stage. Only triplet constraints are used for learning the embedding. Then,
at the test stage, we choose one embedded object as target and predict its label by revealing
the labels of its neighbours. We do this prediction for every objects in turn. The leave-one-
out classification error is the percentage of objects whose labels are not correctly predicted.
Throughout the experiments, we use a 3-nearest-neighbour classifier to test classification
error.
16
17
van der Maaten and Weinberger (2012) showed that reducing triplet generalization error often
leads to small nearest-neighbour classification error, although small triplet generalization doesn’t
necessarily guarantee low classification error. We also inspect embeddings obtained by simultane-
ously learning multiple metrics from this aspect.
Throughout the experiments, we use hinge loss as the loss function. Our joint learning method
is compared with two baselines, independent learning and pooled single view embedding. In in-
dependent learning, we conduct triplet embedding on every view as if they are independent tasks.
In pooled single view embedding, triplets collected from all views are merged together to learn a
single embedding, as if they are from the same similarity metric. When learning a single embed-
ding, we adopt the problem (2.2) as the objective and solve it by using the package provided by
the authors of van der Maaten and Weinberger (2012). To tune the regularization parameters, we
further split the training sets for a 5-fold cross-validation and swept over 10−5, 10−4, . . . , 105 for
all parameters.
4.2 Datasets
Here are the details of the datasets:
Synthetic Data
Two synthetic datasets are generated. One consists of 200 points uniformly sampled from a 20
dimensional unit hypercube, while the other dataset is generated in a 10 dimensional space and
have 200 objects from a mixture of four Gaussians each of which has variance 1 and is randomly
centered in a hypercube with side length 10. Six views are generated on each dataset. Every view
is produced by projecting data points onto a random five dimensional subspace. Training and test
triplets (i, j, k) are randomly sampled from all possible triplets on every views.
Poses of Airplanes
This dataset is constructed from 200 images of airplanes from the PASCAL VOC dataset (Evering-
ham et al., 2010) which are annotated with 16 landmarks such as nose tip, wing tips, etc (Bourdev
et al., 2010). We use these landmarks to construct a pose-based similarity. Given two planes and
the positions of landmarks in these images, pose similarity is defined as the residual error of align-
ment between the two sets of landmarks under scaling and translation. We generated 5 views each
18
— view 1 — view 2 — view 3 — view 4 — view 5
Figure 4.1: View specific similarities between poses of planes are obtained by considering subsetsof landmarks shown by different colored rectangles and measuring their similarity in configurationup to a scaling and translation. For e.g.view 1 consists of all landmarks.
of which is associated with a subset of these landmarks as seen in Fig. 4.1 which shows three anno-
tated images from the set. The planes are highly diverse ranging from passenger planes to fighter
jets, varying in size and form which results in a slightly different similarity between instances for
each view. However, there is a strong correlation between the views because the underlying set of
landmarks are shared.
Additionally, we categorize the planes into five classes: “left-facing”, “right-facing”, “pointing-
up”, “pointing-down” and “facing out or facing away” to evaluate classification error. This pro-
duces classes with unbalanced number of members. About 80% of the images belongs to one of
these three classes: “left-facing”, “right-facing” and “facing out or facing away”.
Public Figures Face Data
Public Figures Face Database is created by Kumar et al. (2009). It consists of 58,797 images of
200 people. Every image is characterized by 75 attributes which are real valued and describe the
appearance of the person in the image. We selected 39 of the attributes and categorized them into
5 groups according to the aspects they describe: hair, age, accessory, shape and ethnicity. We
randomly selected ten people and drew 20 images for each of them to create a dataset with 200
19
view 1 view 2 view 3 view 4 view 5 view 6Figure 4.2: Perceptual similarities between bird species were collected by showing users either thefull image (view 1), or crops around various parts (views 2, 3, . . . , 6). Images were taken from theCUB dataset Welinder et al. (2010) containing 200 species of birds.
images. Similarity between instances for a given group is equal to the dot product between their
attribute vectors where the attributes are restricted to those in the group. We describe the details
of these attributes in the Appendix. Each group is considered as a local view and identities of the
people in the images are considered as class labels.
CUB-200 Birds Data
We use the birds dataset introduced by Welinder et al. (2010) which contains images of birds cate-
gorized into 200 species. We consider the problem of embedding these species based on similarity
feedback from human users. Similarity triplets among species were collected in a crowd-sourced
manner: every time, a user is asked to judge the similarity between an image of a bird from the
target specie zi and nine images of birds of different species zkk∈K using the interface of Wilber
et al. (2014), where K is the set of all 200 species. For each display, the user partition these nine
images into two sets, Ksim and Kdissim, with Ksim containing birds considered similar to the target
and Kdissim having the ones considered dissimilar. Such a partition is broadcast to an equivalent
set of triplet constraints on associated species, (i, j, l) | j ∈ Ksim, l ∈ Kdissim. Therefore, for
each user response, |Ksim| |Kdissim| triplet constraints are acquired.
In the setting of multiple metric embedding, different views of birds are obtained cropping
regions around various parts of the bird as shown in Fig. 4.2, and then using the same procedure
as before to collect triplet comparisons. In this dataset, there are about 100,000 triplets cast on
similarities from comparisons made on the whole birds, while there are other 5 views each of which
20
is cast on localized region of the birds (e.g.beak, breast, wing). Number of triplets obtained from
these views range from about 4,000 to 7,000. For testing classification error, we use a taxonomy of
the bird species provided by Welinder et al. (2010). To make the number of objects in all classes
balanced, we manually grouped some of the classes to get 6 super classes in total.
We note that these global and local similarity triplets were used by Wah et al. (2014a) and Wah
et al. (2014b) as a way to incorporate human feedback during recognition. However, our focus is
to learn better embeddings of the data itself by combining information across different views.
4.3 Experimental Results
Synthetic data
We first conduct experiments on synthetic dataset. Synthetic data with cluster is embedded in 5 and
10 dimensional spaces. The uniformly distributed synthetic data is embedded in 10 and 20 dimen-
sional spaces. Triplet generalization errors and leave-one-out 3-nearest-neighbour classification
errors are plotted in Fig. 4.3 and Fig. 4.3. The small plots on the right illustrate the view-wise test
errors while the big plot on the left shows the average across all views.
The dataset with cluster was sampled from a 10 dimensional space and each local view lies in
a 5 dimensional space. Our algorithm achieves significant improvement on small trianing samples
and it continues to perform better than the baselines except that independent learning obtains a
lower triplet generalization error than joint learning on training sets with nearly 40,000 triplets.
On the dataset that is uniformly sampled from a 20 dimensional space, joint embedding reduces
triplet generalization error faster than independent embedding and pooled embedding as the num-
ber of training triplets increases until around 20,000 triplets. At this point, joint embedding in
a 10 dimensional space is outperformed by its independent learning counterpart. However, joint
embedding in a 20 dimensional space has a lower error rate than the baselines in all cases.
Poses of Airplanes
The airplanes are embedded into a 10 dimensional space based on a training setting which consists
of 3,000 triplets from every view. As an illustration, we project the learned global view of the
objects onto their first two principle dimensions via SVD and illustrated the embedding in Fig. 4.5.
The visualization shows that objects roughly lies on a circle. Meanwhile, the same figure shows a
clear clustered structure and objects belonging to each of the three dominant classes are clustered
21
103 104 105
number of training triplets across all views0.0
0.1
0.2
0.3
0.4
0.5
0.6
trip
let g
ener
aliz
atio
n er
ror joint-5d
ind-5dpooled-5djoint-10dind-10dpooled-10d
103 104 105
number of training triplets across all views0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
LOO
3-NN
cla
ssifi
catio
n er
ror joint-5d
ind-5dpooled-5djoint-10dind-10dpooled-10d
Figure 4.3: Experimental results on synthetic data with clusters. Top: Triplet generalization error.Bottom: Leave-one-out 3-nearest-neighbor classification error.
22
103 104 105
number of training triplets across all views0.0
0.1
0.2
0.3
0.4
0.5
0.6
trip
let g
ener
aliz
atio
n er
ror joint-10d
ind-10dpooled-10djoint-20dind-20dpooled-20d
Figure 4.4: Triplet generalization errors on uniformly distributed synthetic data.
together, except that members in the class “facing out or facing away” are separated into two
clusters.
Fig. 4.6 shows the triplet generalization errors and classification errors of the learned embed-
dings. Errors on local views are shown in small plots while the large plots show the average across
views. Results of independent learning is included in the same figure. It shows that our algorithm
almost always performs better in both predicting triplets and classes when using less than 10,000
training triplets. However, this advantage disappears when training set becomes larger.
Public Figures Face Dataset
The 200 images are embedded into a 5, 10 and 20 dimensional spaces. We draw triplets randomly
from the ground truth similarity measure to form training and test sets. See Fig. 4.7 for a visu-
alization of data embedded under the metric of the first view. Triplet generalization errors and
classification errors are shown in Fig. 4.8. In terms of triplet generalization errors of 5 and 10 di-
mensional embeddings, we find that joint learning performs better on small training triplets while
its error rate surpasses independent learning as training set get larger. Joint learning and indepen-
dent learning have comparable triplet generalization error when the embedding is conducted in a
20 dimensional space. Pooled embedding has the highest triplet generalization errors among the
three methods. In terms of leave-one-out classification error, joint learning has a lower error rate
than independent learning and pooled embedding while pooled embeddings performs the best on
training sets with more than about 10,000 triplets.
23
−1 −0.5 0 0.5
−0.5
0
0.5
Joint Embedding (Global)
Pointing−upPointing−downLeft−facingRight−facingFacing out/away
Student Version of MATLAB
Figure 4.5: The global view of embeddings of poses of planes.
CUB-200 Birds Data
The bird data set is more challenging in a sense that it emerges in a more real situation where
triplet relations among all bird species are not all available because it is crowdsourced. We learn
the embedding both in a 10 dimensional space and a 60 dimensional space. An illustration of
embedding learned in 60 dimensional space can be found in Fig. 4.9.
In the birds dataset, number of available triplets from the first view is more than the available
triplets from other views. During the experiment, we first sample equal numbers of triplets from
each view to form up the training set. We keep adding triplets to the training set until it has 3,000
triplets from each view. Then, we add triplets only from the first view and test the embedding in
all views.
See Fig. 4.10 for triplet generalization error and leave-one-out 3-nearest-neighbour classifica-
tion errors. The solid vertical line shows the moment that we start to add training triplets only to
the first view. We found that, different from the results of previous data sets, our algorithm doesn’t
work better on this data set in terms of triplet generalization error. However, the triplet generaliza-
tion error of learning multiple views jointly in a 60 dimensional space is comparable to the ones
get from embedding every views independently. In terms of leave-one-out classification error, our
method obtains lower error on all views except for the first view.
24
103 104
number of training triplets across all views0.0
0.1
0.2
0.3
0.4
0.5
0.6
trip
let g
ener
aliz
atio
n er
ror joint-3d
ind-3dpooled-3djoint-10dind-10dpooled-10d
103 104
number of training triplets across all views0.0
0.2
0.4
0.6
0.8
1.0
LOO
3-NN
cla
ssifi
catio
n er
ror joint-3d
ind-3dpooled-3djoint-10dind-10dpooled-10d
Figure 4.6: Results on poses of planes dataset. Embedding is learned in 3 and 10 dimensionalspaces. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighbor classificationerror. The small figures shows errors on individual views and the large figures show the average.
25
−15 −10 −5 0 5 10 15
−10
−5
0
5
10
Independent Embedding
Student Version of MATLAB
−15 −10 −5 0 5 10 15
−10
−5
0
5
10
Joint Embedding
Student Version of MATLABFigure 4.7: Illustration of public figures face data embedded under the metric of the first view:hair. Points with the same color belong to the same person. Embeddings are learned in a 10dimensional space using 105 triplets and then further embedded in a 2 dimensional plane by usingtSNE (van der Maaten and Hinton, 2008) for the purpose of visualization. Left: embeddingslearned independently. Right: embeddings learned jointly.
4.4 Learning a New View (Zero-Shot Learning)
On the CUB-200 birds dataset, we simulated a setting that is similar to zero-shot learning (Palatucci
et al., 2009). We draw a training set contains 100 triplets from the 2nd local view and 3,000 triplets
from all other 5 views. And we investigate how does joint learning help create an embedding on
the 2nd local view. The learned embedding is shown in Fig. 4.11. It is evident that, embedding
learned purely out of 100 triplets from that view has objects from different classes completely
mixed together, while the local view from the embedding learned jointly group objects from some
classes better. For example, members in the class Passeriformes(Emberizidae) is more separated
from members in Passeriformes(Icteridae). Meanwhile, the last plot in the same figure shows the
change in triplet generalization error as we add more training triplets from the 2nd local view. We
can see that joint learning has lower triplet generalization error when the number of training triplets
is small but as is expected it is matched and then outperformed by independent learning soon as
the training set gets larger.
4.5 Discussion
We have shown through experiments that our algorithm achieves lower triplet generalization errors
on synthetic data, airplane poses data and public figure faces data when the training set is relatively
small. Since in some real applications, similarity triplets can be expensive to obtain, jointly learn-
ing similarity metrics is preferable as it can recover the underlying structure using relatively small
26
103 104 105
number of training triplets across all views0.0
0.1
0.2
0.3
0.4
0.5
trip
let g
ener
aliz
atio
n er
ror
joint-5dind-5dpooled-5djoint-10dind-10dpooled-10djoint-20dind-20dpooled-20d
103 104 105
number of training triplets across all views0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
LOO
3-NN
cla
ssifi
catio
n er
ror
joint-5dind-5dpooled-5djoint-10dind-10dpooled-10djoint-20dind-20dpooled-20d
Figure 4.8: Results on public figures faces dataset. Embeddings are learned in 5, 10 and 20 di-mensional spaces. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighborclassification error. The small figures shows errors on individual views and the large figures showthe average.
27
−20 −15 −10 −5 0 5 10 15 20
−10
−5
0
5
10
Independent Embedding (View #1)
CharadriiformesPasseriformes (Emberizidae)Passeriformes (Icteridae)Passeriformes (Other)Passeriformes (Parulidae)birds (other)
Student Version of MATLAB
−15 −10 −5 0 5 10 15
−10
−5
0
5
10
Joint Embedding (View #1)
CharadriiformesPasseriformes (Emberizidae)Passeriformes (Icteridae)Passeriformes (Other)Passeriformes (Parulidae)birds (other)
Student Version of MATLAB
Figure 4.9: Illustration of CUB-200 birds data. The figure shows data’s embedding under themetric of the first view. Embeddings are learned in a 60 dimensional space using 18,000 triplets andthen further embedded in a 2 dimensional plane by using tSNE (van der Maaten and Hinton, 2008)for the purpose of visualization. Top: embeddings learned independently. Bottom: embeddingslearned jointly.
28
Figure 4.10: Results on CUB-200 birds dataset. Top: triplet generalization error. Bottom: leave-one-out 3-nearest-neighbor classification error. The small figures shows errors on individual viewsand the large figures show the average.
29
102 103
number of training triplets from the local view0.0
0.1
0.2
0.3
0.4
0.5
0.6
trip
let g
ener
aliz
atio
n er
ror
joint-10dind-10d
−15 −10 −5 0 5 10 15
−10
−5
0
5
10
Independent Embedding (View #2)
CharadriiformesPasseriformes (Emberizidae)Passeriformes (Icteridae)Passeriformes (Other)Passeriformes (Parulidae)birds (other)
Student Version of MATLAB
−15 −10 −5 0 5 10 15−10
−8
−6
−4
−2
0
2
4
6
8
10
Joint Embedding (View #2)
CharadriiformesPasseriformes (Emberizidae)Passeriformes (Icteridae)Passeriformes (Other)Passeriformes (Parulidae)birds (other)
Student Version of MATLAB
Figure 4.11: Learning a new view on CUB-200 birds dataset. Training data contains 100 tripletsfrom the second local view and 3,000 triplets from other 5 views. Embeddings are learned in a 10dimensional space and then further embedded in a 2 dimensional plane by using tSNE (van derMaaten and Hinton, 2008) for the purpose of visualization. Left: triplet generalization error on thesecond local view. Middle: embedding learned independently. Right: embedding learned jointly.
number of training data.
Experiments showed that jointly learning multiple metrics performs better in terms of 3-nearest-
neighbour classification error almost in all cases, which implies that it has a potential to recover
the category level structure of the data.
From the experimental results presented above, the gain in performance when using joint learn-
ing varies among dataset. On synthetic data with clusters and poses of planes dataset, the benefit of
using joint learning is evident and it is especially large on small number of training triplets. Its per-
formance gain on uniformly distributed synthetic data and public figure faces dataset is relatively
small and very limited on CUB-200 birds dataset. Influence may come from multiple aspects. We
discuss two different yet related factors: (1) choice of dimensions and (2) correlation underlying
different views.
4.5.1 Influence of Dimension
From the results on public figure faces and CUB-200 (Fig. 4.8 and 4.10), we see that dimensions of
embedding spaces affect the triplet generalization errors of embeddings learned jointly especially
when the training set is large. For example, on public figure dataset (Fig. 4.8), joint learning re-
duces the error faster than independent learning up to around 10,000 triplets. When there are more
than 10,000 triplets, errors of joint embedding reduce monotonically as number of dimensions in-
creases. Meanwhile, when we learn a 10 dimensional joint embedding for the uniform synthetic
dataset sampled from a 20 dimensional space, we see that joint learning obtains lower triplet gen-
eralization error on small up to around 20,000 triplets but was catched up by independent learning
on larger training sets (Fig. 4.4). This can be understood as a bias induced by the joint learning.
30
As analysed in Sec. 3.2, joint learning has a lower complexity when embedding dimension is
chosen the same as independent learning. However, experimental results have shown that in many
cases joint learning has a large performance gain even when using the same embedding dimension
as independent learning. For example, when public figures data was embedded in a 20 dimensional
space, joint embedding continues to perform better than independent learning. Meanwhile, from
the results of synthetic data (clustered) and poses of planes data (Fig. 4.3 and 4.6), joint learning
is able to achieve lower errors. We interpret this as that when the model has sufficient amount
of complexity to capture the inherent structure of data, joint embedding makes use of triplets
from all views simultaneously while independent learning only employ triplets from individual
views. Next, we consider one particular aspect of the underlying structure, which is the correlation
between different views.
4.5.2 Correlation Between Views
Another question of particular interest is how the underlying similarities affect the performance of
proposed joint learning algorithm. Speculatively, joint learning would be affected by the correla-
tion between similarity metrics of different views. Here, we examine such correlation from two
aspects: consistency of triplet constraints and correlation of distances.
Consistency of Triplet Constraints We assume that there is a ground truth embedding for each
similarity metric. Since there are 3!(N3
)triplets on N objects and any similarity metric would sat-
isfy exactly half of them1, a similarity metric induces m(N) = 3!(N3
)/2 triplet constraints. Triplet
constraint consistency between two views is defined as the ratio of number of triplet constraints
that are shared between both of the views to m(N). In our joint learning algorithm, the global
embedding X is trained to conform to triplet constraints over all views. Therefore, if there exists
a pair of conflicting triplet constraints, it is possible that joint learning might not be able to model
both triplets well. We examine consistency of triplet constraints on a dataset in the following
way. On datasets where we have the ground truth similarity metrics in the form of true embedding
(e.g., synthetic data) or true similarity kernel (e.g., public figure data), we draw 100,000 triplets
at random and check if each triplet is satisfied or violated on each view. Triplet constraint consis-
tency between two similarity metrics is estimated as the percentage of triplets which received the
1It is because if the triplet constraint (i, j, k) is satisfied under a metric, then the constraint (i, k, j) must be violatedby the same metric.
31
same judgement under two metrics, i.e., satisfied or violated by both metrics. On poses of planes
data and CUB-200 dataset, since we do not have access to all triplet constraints, we estimate the
triplet consistency from embeddings learned independently with the largest number of triplets. The
triplet constraint consistency for datasets used in this work is list in the Appendix C. The average
of triplet constraint consistency for each dataset is summarized in Table 4.1.
Correlation of Distances A metric over N objects can be fully captured by a distance matrix D
whose (i, j)-th entry is the distance between object i and j. D can be computed from the kernel
matrix K by D(i, j) = K(i, i) −K(i, j) −K(j, i) + K(j, j). Suppose D1 and D2 are two
distance matrices. Since a (pseudo) metric is symmetric and so is the distance matrix, to compare
two distance matrices, we may restrict to the lower-triangle of them and define their correlation as
Corr(D1,D2) =
∑1≤i<j≤N (D1(i, j)− µ1) (D2(i, j)− µ2)√∑N
1≤i<j≤N (D1(i, j)− µ1)2√∑N
1≤i<j≤N (D2(i, j)− µ2)2
where
µt =2
N(N − 1)
∑1≤i<j≤N
Dt(i, j), t = 1, 2.
On synthetic data and public figure faces dataset, we are able to either compute the distance
matrix or have access to the true kernel matrix. Therefore we can compute the distance correlation
between every pair of local metrics for each of them. For poses of planes and CUB-200 data,
we estimate the correlation of distances from embeddings learned independently with the largest
number of triplets. Distance correlations between every pair of views on each dataset are shown in
Appendix C. The average of distances correlation for each dataset is summarized in Table 4.1.
Relation between triplet consistency and correlation of distances. Since we aim to recover the
underlying similarity metric from triplet constraints, a natural question would be: how far could
two metrics which satisfy the same set of triplet constraints be possibly away from each other.
Since triplet consistency and correlation of distances are related to the two sides of this problem,
we inspect the relation between these them empirically, to obtain some clues to the problem. A
rigorous analysis is left as a future work.
Here, an approximate relation between the two quantities are shown first. Its intuition is given
32
−0.2 0 0.2 0.4 0.6 0.8 1 1.2−0.2
0
0.2
0.4
0.6
0.8
1
1.2
correlation
sin(
0.5
π (2
Trip
letC
onsi
st −
1))
syn (cluster)syn (uniform)pubfigbirdposes of planes
Figure 4.12: Relation between triplet consistency and correlation of distances.
subsequently.
Let δ(D1,D2) denote the triplet consistency between two distance metrics. We illustrated
the relation between δ(Dp,Dq) and Corr(Dp,Dq) in Fig. 4.12. It shows that the two quantities
roughly follow the relation
sin(π
2(2δ(Dp,Dq)− 1)
)= Corr(Dp,Dq). (4.1)
The intuition behind this relation unfolds as follows.
Given two distance matrices D1 and D2 over N objects, consider the following generative
model:
1. Let (I, J) be a random variable that takes value from the space of all un-ordered pairs among
these N objects with equal change.
2. Define Y1 := D1(I, J) , Y2 := D2(I, J).
By this, Corr(D1,D2) defined previously is exactly the correlation of Y1 and Y2. On the other
hand, given a sample(y(t)1 , y
(t)2
): t = 1, . . . ,m
draw from the generative model defined above,
33
the Kendall’s τ (Kruskal, 1958) is defined by
τ =2
m(m− 1)
∑1≤t<t′≤m
sign[(y(t)1 − y
(t′)1
)(y(t)2 − y
(t′)2
)]and, as stated by Liu et al. (2012), its population version is given by
τ := Corr(
sign(Y1 − Y1
), sign
(Y2 − Y2
))where Yt is i.i.d. to Yt, t = 1, 2. And,
τ = P(
(Y1 − Y1)(Y2 − Y2) > 0)− P
((Y1 − Y1)(Y2 − Y2) < 0
)Loosely speaking, we have τ = 2P
((Y1 − Y1)(Y2 − Y2) > 0
)+P
((Y1 − Y1)(Y2 − Y2) = 0
)−1,
which is approximately 2P(
(Y1 − Y1)(Y2 − Y2) > 0)− 1 when the number of objects N is large.
As we define Y1 and Y2 to be the distances of a random pair of objects under metrics D1 and D2,
P(
(Y1 − Y1)(Y2 − Y2) > 0)
can be regarded as the consistency of quadruple relations induced
from these two metrics. Each quadruple relation is written as (A,B,C,D) which means “A is
more similar to B than C is to D”. And it subsumes triplet constraints as special cases. If Y1 and
Y2 are jointly Gaussian or belong to a nonparanormal family, it is known that
sin(π
2τ) = ρY1,Y2 (4.2)
where ρY1,Y2 is the correlation between Y1 and Y2. See (Kruskal, 1958; Liu et al., 2012).
This gives the intuition of using (4.1) to approximate the relation between triplet consistency
and correlation of pair-wise distances. Note that, here, we use triplet consistency as an estimation
of quadruple consistency. Besides, Y1 and Y2 obtained from the generative model do not necessarily
resemble Gaussian. But, empirically, it works on the dataset used in this work quite well, as shown
in Fig. 4.12.
Relating performance gain with view correlation. The performance gain is measured by the
difference between the area under the triplet generalization errors normalized by that of the inde-
pendent learning, in the case of 10 dimensional embedding. Relating the performance gain with
correlation among similarities (Table 4.1, Fig. 4.13), we see that on synthetic data (clustered) where
joint learning obtained a significant performance gain, triplet consistency and distance correlations
34
dataset average tripletconsistency
average distancecorrelation
performance gain (%)
synthetic (clustered) 0.78 0.71 52synthetic (uniform) 0.59 0.29 18
poses of planes 0.63∗ 0.37∗ 24public figure 0.58 0.25 10
CUB-200 0.52∗ 0.06∗ 0.4
Table 4.1: Measure of similarity correlation and performance gain of using joint learning. Entriesmarked with ∗ are values estimated from independent embeddings.
0.50 0.55 0.60 0.65 0.70 0.75 0.80triplet consistency
10
0
10
20
30
40
50
60
perfo
rman
ce g
ain
(%)
syn (clustered)
syn (uniform)
poses of planes
pubfig
CUB-200
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8correlation of distances
syn (clustered)
syn (uniform)
poses of planes
pubfig
CUB-200
Figure 4.13: Performance gain and correlation of views.
are high. On the other hand, synthetic data (uniform), poses of planes and public figure faces
dataset have relatively mild view correlations. Performance gains on these dataset are not very
significant. In contrast, triplet consistency of CUB-200 is close to random (0.5). That is probably
the reason that performance gain on this dataset is very limited.
CHAPTER 5
RELATED WORK
Previously, we have covered some literature that is closely related to our formulation of jointly
learning multiple metrics from triplet constraints. In this chapter, we would like to mention some
other work that could be related to the problem of learning multiple similarities from pairwise
comparisons and take some glimpses at this problem from different angles.
5.1 Multitask Metric Learning with Low-Rank Tensors
Tensors are widely used in the context of multi-task learning (Evgeniou and Pontil, 2007; Romera-
Paredes et al., 2013). Learning a low-rank tensor is a problem that has been studied and used
in many applications (Tomioka et al., 2011; Signoretto et al., 2014; Liu et al., 2013; Blankertz
et al., 2008). Since T similarity metrics over a given set of N obejets can be represented as
kernel matrices Kt, t = 1, . . . , T , we may consider them as slices of a N × N × T tensor Kwhere the (i, j, t) entry of K reflects the similarity between objects i and j under the t-th view.
Meanwhile, under the assumption that there exists correlation among different kernel matrices, we
may seek for a K with low rank. See (Kolda and Bader, 2009) for the definition of tensor rank. Let
∆(i,j,t) ∈ RN×N×T be a tensor with entries ∆(i,j,t)(i, i, t) = ∆(i,j,t)(j, j, t) = 1, ∆(i,j,t)(i, j, t) =
∆(i,j,t)(j, i, t) = −1 and 0 everywhere else. Then its inner product with K is euqal to the (squared)
distance under t-th similarity metric, i.e.,
⟨∆(i,j,t),K
⟩= Ki,i,t −Ki,j,t −Kj,i,t +Kj,j,t
By this, each triplet constraint can be expressed as a linear constraint on K as
(i, j, k) ∈ St =⇒⟨∆(i,j,t),K
⟩<⟨∆(i,k,t),K
⟩.
35
36
The problem of learning multiple similarity metrics can be formulated with a tensor as:
minK
T∑t=1
∑(i,j,k)∈St
`(⟨
∆(i,j,t),K⟩,⟨∆(i,k,t),K
⟩)+ Ω(K), (5.1)
s.t. K(:, :, t) 0 (t = 1, . . . , T ).
where Ω(K) is a regularization term which induces low-rank tensor. See (Tomioka et al., 2010) for
example choices of such regularizations. Comparing (5.1) with (3.1), we can see that (3.1) can also
be cast into a tensor learning problem but the learning space is restricted to tensors with a special
form of factorization K =M×1X>×2X
>. Here,M is a D×D× T tensor whose t-th slices is
required to be a positive semi-definite matrix and corresponds to the M t representing the metric
under the t-th view while X ∈ RD×N represents the global embedding in (3.1).
5.2 Relation between Triplet Embedding and Metric Learning
In the literature, embedding with pairwise comparisons is sometimes called ordinal embedding
(Alon et al., 2008; von Luxburg et al., 2014; Terada and Luxburg, 2014) or partial order embedding
(McFee and Lanckriet, 2009).
von Luxburg et al. (2014) consider the problem of recovering the embeddings of n objects from
pairwise comparisons. They proved that, asymptitocally, if we have the knowledge of the ordinal
relationships for all quadruples (i, j, k, l), as n→∞, the set of embedded points always converges
to the set of original points, up to similarity transformations such as rotations, translations, rescal-
ings and reflections. Developed on that, a further work of Terada and Luxburg (2014) investigates
the consistency of local ordinal embedding (LOE) which only uses triplet constraints that reflect
k-nearest-neighbour information. More specifically, in the setting of LOE, a triplet constraints
(i, j, l) implies that j is a k-nearest-neighbour of i but l is not. The authors proved that, under cer-
tain conditions, it is possible to reconstruct the point set x1, . . . , xn asymptotically, if we just know
the k-nearest neighbours of each point. They also raised a Soft Ordinal Embedding algorithm, not
only to recover the ordinal constraints, but the density structure underlying data set.
Jamieson and Nowak (2011) derived a lower bound for minimum number of queries of triplet
relations (i, j, k) that is needed to determine the embedding. They proved that at least Ω(dn log n)
such comparisons is needed to determine the embedding of n objects into a d dimensional space
and such lower bound cannot be achieved by randomly choosing pairwise comparisons. The work
37
of McFee and Lanckriet (2009) suggested that it might be hard to find the minimal dimension
needed to produce an embedding that satisfies all the constraints. They showed that it is NP-
Complete to decide if a given set of oridnal constraints C can be satisfied in R1.
Alon et al. (2008) studied the problem of minimum-relaxation ordinal embedding. Roughly
speaking, given a distance function D(i, j), an ordinal embedding with relaxation α is an embed-
ding that satisfies D(i, j) > αD(p, q) =⇒ d(xi,xj) > d(xp,xq) where xi’s are the learned
embeddings and d(·, ·) is the metric function of the embedding space. They also established that
the problem of ordinal embedding has many qualitative differences from metric embedding which
aims to minimize distortion.
CHAPTER 6
CONCLUSION AND FUTURE WORK
In this work, we introduced our formulation on the problem of jointly learning multiple similarity
metrics which is a natural extension of the conventional independent learning model. The proposed
model consists of a global view, which represents each object as a fixed dimensional vector, and
local views, which specifies each view-specific Mahalanobis metric as a positive semi-definite
matrix. Its performance is studied empirically on both synthetic dataset and dataset from the real
world. Experimental results have demonstrated that, on dataset with a higher correlation between
similarity metrics, joint learning has a better performance than independent learning in terms of
triplet generalization error, especially for the cases where the number of training triplets is small.
This suggests that joint learning would be preferable for applications where triplet relations can be
expensive to obtain. Although the results are presented for the hinge loss (Agarwal et al., 2007),
the proposed algorithm easily generalize to other loss functions, e.g., CKL loss (Tamuz et al.,
2011), t-STE loss (van der Maaten and Weinberger, 2012). Meanwhile, joint learning tends to
have a lower leave-one-out classification error in most situations which implies it could support
supervised learning tasks such as learning a classifier. As a future work, we aim to study how to
use embeddings learned from similarity comparisons for classification tasks where data is partially
labelled.
Some preliminary analysis shows that the underlying correlation among similarity metrics
would notably affect the performance of joint learning. Based on our empirical study, to achieve
performance comparable to independent learning, joint learning would require a higher dimen-
sional embedding space on datasets where views are less correlated. It would be interesting to find
a way of estimating the amount of view correlation and further provide guidance for choosing the
dimension of embedding space.
Moreover, as mentioned in Section 5.1, multiple metrics learning from triplet constraints could
be formulated as a problem of learning a low-rank tensor under linear constraints. We may explore
this approach and use low-rank tensors as the tool to model correlated similarity metrics. Resort to
existing studies on tensors, we might get some insight on learning multiple metric learning under
triplet constraints.
38
39
On the other hand, embeddings learned from joint learning can be used in higher level machine
learning tasks, such as fast retrieval and recognition tasks. We may integrate our joint learning
framework with those tasks and study the performance of such systems in real application, such
as music recommendation (McFee et al., 2012), word embedding in language (Mikolov et al.,
2013) and interactive fine-grained recognition (Wah et al., 2014a). Since triplet relations might be
expensive to obtain in many applications, when building such a system, it would also be interesting
to develop an active triplet sampling algorithm, as an extension to (Jamieson and Nowak, 2011),
so that the queried triplets are the most informative ones.
APPENDIX A
VIEWS OF POSES OF AIRPLANES DATASET
Each of the 200 airplanes were annotated with 16 landmarks namely,
01. Top Rudder 05. L WingTip 09. Nose Bottom 13. Left Engine Back02. Bot Rudder 06. R WingTip 10. Left Wing Base 14. Right Engine Front03. L Stabilizer 07. NoseTip 11. Right Wing Base 15. Right Engine Back04. R Stabilizer 08. Nose Top 12. Left Engine Front 16. Bot Rudder Front
This is also illustrated in the Figure A.1. The five different views are defined by considering
different subsets of landmarks as follows:
1. all ∈ 1, 2, . . . , 16
2. back ∈ 1, 2, 3, 4, 16
3. nose ∈ 7, 8, 9
4. back+wings ∈ 1, 2, . . . , 6, 10, 11, . . . , 16
5. nose+wings ∈ 5, 6, . . . , 15
For triplet (A,B,C) we compute similarity si(A,B) and si(A,C) by aligning the subset i of
landmarks ofB and C toA under a translation and scaling that minimizes the sum of squared error
after alignment. The similarity is inversely proportional to the residual error. This is also known
as “procrustes analysis” commonly used for matching shapes.
40
41
Figure A.1: Landmarks illustrated on the several planes
APPENDIX B
ATTRIBUTES OF PUBLIC FIGURES FACE DATASET
Each image in the Public Figures Face Dataset (Pubfig) 1 is characterized by 75 attributes. We
used 39 of the attributes in our work and categorized them into 5 groups according to the aspects
they describe. Here is a table of the categories and attributes:
Category AttributesHair Black Hair, Blond Hair, Brown Hair, Gray Hair, Bald, Curly Hair, Wavy Hair,
Straight Hair, Receding Hairline, Bangs, Sideburns.Age Baby,Child,Youth,Middle Aged,Senior.Accessory No Eyewear, Eyeglasses, Sunglasses, Wearing Hat, Wearing Lipstick, Heavy
Makeup, Wearing Earrings, Wearing Necktie, Wearing Necklace.Shape Oval Face, Round Face, Square Face, High Cheekbones, Big Nose, Pointy
Nose, Round Jaw, Narrow Eyes, Big Lips, Strong Nose-Mouth Lines.Ethnicity Asian, Black, White, Indian.
Table B.1: List of Pubfig attributes that were used in this work.
1Available at http://www.cs.columbia.edu/CAVE/databases/pubfig/
42
APPENDIX C
CORRELATION BETWEEN SIMILARITY METRICS ON MULTIPLE
DATASETS
view 2 0.712220view 3 0.754380 0.720280view 4 0.770390 0.842020 0.782490view 5 0.780540 0.779740 0.703990 0.799840view 6 0.728800 0.846140 0.759320 0.847380 0.810520
view 1 view 2 view 3 view 4 view 5
Table C.1: Consistency of triplet constraints between different views in synthetic data (clustered).
view 2 0.558910view 3 0.566690 0.703850view 4 0.635800 0.506790 0.502940view 5 0.636420 0.622560 0.708390 0.570290view 6 0.567020 0.624570 0.569100 0.496120 0.629680
view 1 view 2 view 3 view 4 view 5
Table C.2: Consistency of triplet constraints between different views in synthetic data (uniformlydistributed).
view 2 0.497920view 3 0.493910 0.709840view 4 0.489880 0.712820 0.660470view 5 0.498180 0.800280 0.715140 0.683480
view 1 view 2 view 3 view 4
Table C.3: Consistency of triplet constraints between different views in poses of planes datasetestimated from independent embedding.
43
44
view 2 0.577800view 3 0.641800 0.592700view 4 0.578000 0.575200 0.616400view 5 0.582300 0.531000 0.546300 0.546100
view 1 view 2 view 3 view 4
Table C.4: Consistency of triplet constraints between different views in public figures dataset.
view 2 0.501170view 3 0.491470 0.506750view 4 0.514880 0.509960 0.553930view 5 0.493880 0.539450 0.531050 0.550690view 6 0.494220 0.527550 0.541820 0.512840 0.516650
view 1 view 2 view 3 view 4 view 5
Table C.5: Consistency of triplet constraints between different views in CUB-200 dataset estimatedfrom independent embedding.
view 2 0.488729view 3 0.665361 0.549823view 4 0.708692 0.838732 0.775284view 5 0.680630 0.762995 0.579146 0.776799view 6 0.556607 0.886884 0.652158 0.874139 0.819197
view 1 view 2 view 3 view 4 view 5
Table C.6: Distance correlation between different views in synthetic data (clustered).
view 2 0.191292view 3 0.217535 0.588245view 4 0.430519 0.025099 0.008072view 5 0.426157 0.393445 0.620055 0.223425view 6 0.200978 0.375466 0.231990 -0.004176 0.402714
view 1 view 2 view 3 view 4 view 5
Table C.7: Distance correlation between different views in synthetic data (uniformly distributed).
45
view 2 -0.012479view 3 -0.023239 0.611187view 4 -0.028532 0.665929 0.512171view 5 -0.006959 0.815321 0.616287 0.571281
view 1 view 2 view 3 view 4
Table C.8: Distance correlation between different views in poses of planes dataset estimated fromindependent embedding.
view 2 0.240323view 3 0.410561 0.194059view 4 0.252613 0.228472 0.307446view 5 0.296265 0.101105 0.216864 0.249339
view 1 view 2 view 3 view 4
Table C.9: Distance correlation between different views in public figures dataset.
view 2 -0.007114view 3 -0.028691 0.017077view 4 0.011462 0.046835 0.202294view 5 -0.014319 0.139694 0.127920 0.174904view 6 -0.031425 0.041847 0.142853 0.007505 0.018800
view 1 view 2 view 3 view 4 view 5
Table C.10: Distance correlation between different views in CUB-200 dataset estimated from in-dependent embedding.
REFERENCES
Agarwal, S., Wills, J., Cayton, L., Lanckriet, G., Kriegman, D. J., and Belongie, S. (2007). Generalizednon-metric multidimensional scaling. In International Conference on Artificial Intelligence and Statistics,pages 11–18.
Alon, N., Badoiu, M., Demaine, E. D., Farach-Colton, M., Hajiaghayi, M., and Sidiropoulos, A. (2008). Or-dinal embeddings of minimum relaxation: general properties, trees, and ultrametrics. ACM Transactionson Algorithms (TALG), 4(4):46.
Blankertz, B., Tomioka, R., Lemm, S., Kawanabe, M., and Muller, K.-R. (2008). Optimizing spatial filtersfor robust eeg single-trial analysis. Signal Processing Magazine, IEEE, 25(1):41–56.
Bourdev, L., Maji, S., Brox, T., and Malik, J. (2010). Detecting people using mutually consistent poseletactivations. In European Conference on Computer Vision (ECCV).
Cox, I. J., Miller, M. L., Minka, T. P., Papathomas, T. V., and Yianilos, P. N. (2000). The bayesian imageretrieval system, pichunter: theory, implementation, and psychophysical experiments. Image Processing,IEEE Transactions on, 9(1):20–37.
Cui, Z., Li, W., Xu, D., Shan, S., and Chen, X. (2013). Fusing robust face region descriptors via multiplemetric learning for face recognition in the wild. In Computer Vision and Pattern Recognition (CVPR),2013 IEEE Conference on, pages 3554–3561. IEEE.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). The pascal visualobject classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338.
Evgeniou, A. and Pontil, M. (2007). Multi-task feature learning. Advances in neural information processingsystems, 19:41.
Fazel, M., Hindi, H., and Boyd, S. P. (2001). A Rank Minimization Heuristic with Application to MinimumOrder System Approximation. In Proc. of the American Control Conference.
Jain, P., Kulis, B., and Grauman, K. (2008). Fast image search for learned metrics. In Computer Vision andPattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE.
Jamieson, K. G. and Nowak, R. D. (2011). Low-dimensional embedding using adaptively selected ordinaldata. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on,pages 1077–1084. IEEE.
Kendall, M. G. and Gibbons, J. D. (1990). Rank correlation methods. Edward Arnold. Oxford UniversityPress.
Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM review, 51(3):455–500.
Kruskal, J. B. (1964a). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis.Psychometrika, 29(1):1–27.
46
47
Kruskal, J. B. (1964b). Nonmetric multidimensional scaling: a numerical method. Psychometrika,29(2):115–129.
Kruskal, W. H. (1958). Ordinal measures of association. Journal of the American Statistical Association,53(284):814–861.
Kumar, N., Berg, A. C., Belhumeur, P. N., and Nayar, S. K. (2009). Attribute and simile classifiers for faceverification. In Computer Vision, 2009 IEEE 12th International Conference on, pages 365–372. IEEE.
Liu, H., Han, F., Yuan, M., Lafferty, J., Wasserman, L., et al. (2012). High-dimensional semiparametricgaussian copula graphical models. The Annals of Statistics, 40(4):2293–2326.
Liu, J., Musialski, P., Wonka, P., and Ye, J. (2013). Tensor completion for estimating missing values invisual data. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(1):208–220.
McFee, B., Barrington, L., and Lanckriet, G. (2012). Learning content similarity for music recommendation.Audio, Speech, and Language Processing, IEEE Transactions on, 20(8):2207–2218.
McFee, B. and Lanckriet, G. (2009). Partial order embedding with multiple kernels. In Proceedings of the26th Annual International Conference on Machine Learning, pages 721–728. ACM.
McFee, B. and Lanckriet, G. (2011). Learning multi-modal similarity. The Journal of Machine LearningResearch, 12:491–523.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations invector space. arXiv preprint arXiv:1301.3781.
Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell, T. M. (2009). Zero-shot learning with semanticoutput codes. In Advances in neural information processing systems, pages 1410–1418.
Parameswaran, S. and Weinberger, K. Q. (2010). Large margin multi-task metric learning. In Advances inneural information processing systems, pages 1867–1875.
Romera-Paredes, B., Aung, H., Bianchi-Berthouze, N., and Pontil, M. (2013). Multilinear multitask learn-ing. In Proceedings of the 30th International Conference on Machine Learning, pages 1444–1452.
Scholkopf, B., Herbrich, R., and Smola, A. J. (2001). A generalized representer theorem. In Computationallearning theory, pages 416–426. Springer.
Shepard, R. N. (1962a). The analysis of proximities: Multidimensional scaling with an unknown distancefunction. I. Psychometrika, 27(2):125–140.
Shepard, R. N. (1962b). The analysis of proximities: Multidimensional scaling with an unknown distancefunction. II. Psychometrika, 27(3):219–246.
Signoretto, M., Dinh, Q. T., De Lathauwer, L., and Suykens, J. A. (2014). Learning with tensors: a frame-work based on convex optimization and spectral regularization. Machine Learning, 94(3):303–351.
Tamuz, O., Liu, C., Belongie, S., Shamir, O., and Kalai, A. T. (2011). Adaptively learning the crowd kernel.arXiv preprint arXiv:1105.1033.
Terada, Y. and Luxburg, U. V. (2014). Local ordinal embedding. In Proceedings of the 31st InternationalConference on Machine Learning (ICML-14), pages 847–855.
48
Tomioka, R., Hayashi, K., and Kashima, H. (2010). Estimation of low-rank tensors via convex optimization.arXiv preprint arXiv:1010.0789.
Tomioka, R., Suzuki, T., Hayashi, K., and Kashima, H. (2011). Statistical performance of convex tensordecomposition. In Advances in Neural Information Processing Systems, pages 972–980.
van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of Machine LearningResearch, 9(2579-2605):85.
van der Maaten, L. and Weinberger, K. (2012). Stochastic triplet embedding. In Machine Learning forSignal Processing (MLSP), 2012 IEEE International Workshop on, pages 1–6. IEEE.
von Luxburg, U. et al. (2014). Uniqueness of ordinal embedding. In Proceedings of The 27th Conferenceon Learning Theory, pages 40–67.
Wah, C., Horn, G. V., Branson, S., Maji, S., Perona, P., and Belongie, S. (2014a). Similarity comparisonsfor interactive fine-grained categorization. In Computer Vision and Pattern Recognition.
Wah, C., Maji, S., and Belongie, S. (2014b). Learning localized perceptual similarity metrics for interactivecategorization. In Human-Machine Communication for Visual Recognition and Search Workshop, ECCV.
Weinberger, K. Q. and Saul, L. K. (2009). Distance metric learning for large margin nearest neighborclassification. The Journal of Machine Learning Research, 10:207–244.
Welinder, P., Branson, S., Mita, T., Wah, C., Schroff, F., Belongie, S., and Perona, P. (2010). Caltech-UCSDBirds 200. Technical Report CNS-TR-2010-001, California Institute of Technology.
Wilber, M. J., Kwak, I. S., and Belongie, S. J. (2014). Cost-effective hits for relative similarity comparisons.arXiv preprint arXiv:1404.3291.