Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Pattern Recognition 75 (2018) 302–314
Contents lists available at ScienceDirect
Pattern Recognition
journal homepage: www.elsevier.com/locate/patcog
OPML: A one-pass closed-form solution for online metric learning
Wenbin Li a , Yang Gao
a , ∗, Lei Wang
b , Luping Zhou
b , Jing Huo
a , Yinghuan Shi a
a National Key Laboratory for Novel Software Technology, Nanjing University, China b School of Computing and Information Technology, University of Wollongong, Australia
a r t i c l e i n f o
Article history:
Received 28 September 2016
Revised 12 January 2017
Accepted 8 March 2017
Available online 9 March 2017
Keywords:
One-pass
Online metric learning
Triplet construction
Face verification
Abnormal event detection
a b s t r a c t
To achieve a low computational cost when performing online metric learning for large-scale data, we
present a one-pass closed-form solution namely OPML in this paper. Typically, the proposed OPML first
adopts a one-pass triplet construction strategy, which aims to use only a very small number of triplets
to approximate the representation ability of whole original triplets obtained by batch-manner methods.
Then, OPML employs a closed-form solution to update the metric for new coming samples, which leads
to a low space (i.e., O ( d )) and time (i.e., O ( d 2 )) complexity, where d is the feature dimensionality. In addi-
tion, an extension of OPML (namely COPML) is further proposed to enhance the robustness when in real
case the first several samples come from the same class (i.e., cold start problem). In the experiments,
we have systematically evaluated our methods (OPML and COPML) on three typical tasks, including UCI
data classification, face verification, and abnormal event detection in videos, which aims to fully evaluate
the proposed methods on different sample number, different feature dimensionalities and different fea-
ture extraction ways (i.e., hand-crafted and deeply-learned). The results show that OPML and COPML can
obtain the promising performance with a very low computational cost. Also, the effectiveness of COPML
under the cold start setting is experimentally verified.
© 2017 Elsevier Ltd. All rights reserved.
l
t
m
w
p
a
t
f
t
m
o
s
c
t
i
t
t
1. Introduction
In computer vision and machine learning, learning a meaning-
ful distance/ similarity metric on the original feature presentation
of samples, with the given distance constraints (either pairwise
similar/dissimilar distance constraints or triplet based relative dis-
tance constraints) at the same time, is usually regarded as a crucial
and challenging problem, which has been actively studied over the
decades. According to the different measure functions (e.g., Maha-
lanobis distance function and bilinear similarity function), the cur-
rent metric learning methods can be roughly classified into two
categories, i.e., Mahalanobis distance-based methods [1–5] and bi-
linear similarity-based methods [6–8] . The first class, Mahalanobis
distance-based methods, refers to learning a pairwise real-valued
distance function, which is parameterized by a symmetric Positive
Semi-Definite (PSD) matrix. The second class, bilinear similarity-
based methods, aims to learn a form of bilinear similarity function
which does not need to impose the PSD constraint on learned met-
rics.
∗ Corresponding author.
E-mail addresses: [email protected] (W. Li), [email protected] (Y. Gao),
[email protected] (L. Wang), [email protected] (L. Zhou), [email protected]
(J. Huo), [email protected] (Y. Shi).
s
e
a
i
s
o
http://dx.doi.org/10.1016/j.patcog.2017.03.016
0031-3203/© 2017 Elsevier Ltd. All rights reserved.
Recently, instead of batch manner, learning the metric in an on-
ine manner, which refers to online metric learning (OML), has at-
racted lots of interests, with the goal of learning a discriminative
etric with partially known sampled data for efficiently dealing
ith large-scale learning problem. Generally, to satisfy the online
rocessing speed for large-scale learning problem, OML methods
re required to well tackle the following two core issues: (1) how
o fast construct the triplet (or pair) in the original data, especially
or the large-scale data, and (2) how to fast update the metric with
he new coming samples in a real time manner.
For fast triplet (or pair) construction (first issue), existing OML
ethods usually assume that pairwise or triplet constraints can be
btained in advance [9] , or by employing the random sampling
trategy to reduce the size of triplets [6] . However, in real appli-
ations, it is usually infeasible to access the entire training set at a
ime, especially when the training set is relative large, construct-
ng the constraints will be both time- and space-consuming. To
his end, we propose a novel one-pass triplet construction strategy
o rapidly construct triplets in an online manner. In particular, the
trategy selects two latest samples from both the same and differ-
nt classes of currently available samples respectively, to construct
triplet. Compared with Online Algorithm for Scalable Image Sim-
larity (OASIS) [6] , which utilizes a random sampling strategy and
tores the entire training data in memory with space complexity
f O ( md ) ( d is the feature dimensionality, and m is the data size,
W. Li et al. / Pattern Recognition 75 (2018) 302–314 303
Table 1
The comparison of different OML methods. MA/BI denotes the Mahalanobis distance-based/bilinear
similarity-based method, respectively. The last 3 columns denote the processing time (the unit is
ms) per sample with different dimensions.
Method Type Constraint Solution d = 21 d = 64 d = 310
POLA MA Pair approximate 8 6 .2 120
RDML MA Pair approximate 0 .040 0 .043 2 .4
LEGO MA Pair closed-form 0 .159 0 .472 47
OASIS BI Triplet closed-form 0 .029 0 .028 9 .4
SOML BI Triplet approximate 0 .032 0 .094 21
OPML MA Triplet closed-form 0 .026 0 .023 1 .7
COPML MA Pair & Triplet closed-form 0 .027 0 .024 1 .7
w
v
o
t
[
t
h
M
t
E
d
r
i
m
f
d
b
c
t
t
P
s
a
f
a
a
a
p
t
u
s
a
o
c
s
a
H
t
t
i
t
l
b
c
s
i
s
t
v
O
C
a
s
d
R
t
l
g
r
w
t
O
c
i
i
2
b
a
w
w
l
r
l
h
a
S
a
s
r
m
A
O
(
l
d
s
s
t
f
L
t
m
M
w
r
b
b
hich is very large for large-scale data), our one-pass strategy can
astly reduce the space complexity to O ( cd ) ( c is the total number
f classes, which is usually small). Also, the time complexity of our
riplet construction strategy is O (1), which is truly fast.
For fast metric updating (second issue), several studies
6,10] try to adopt a closed-form solution for accurate computa-
ion. Among them, OASIS adopts bilinear similarity learning and
as a closed-form solution, while it lacks a good interpretability as
ahalanobis distance metric learning (i.e., linear projection) and
he learned similarity function is asymmetric. In contrast, LogDet
xact Gradient Online (LEGO) [10] attempts to learn a Mahalanobis
istance and has a closed-form solution. In addition, LEGO is not
equired to maintain the PSD constraint by using LogDet regular-
zation, which is time-consuming in some Mahalanobis distance
etric learning methods [9,11,12] . However, LEGO is designed
or pairwise constraints. Compared with LEGO, we developed a
ifferent Mahalanobis distance-based OML method for triplet-
ased constraints, named as OPML, which also has the property of
losed-form solution and does not need projection steps to main-
ain PSD constraint. Specifically, the proposed OPML directly learns
he transformation matrix L ∈ R
d×d ( M = L T L is the symmetric
SD matrix usually learnt in Mahalanobis metric learning), such a
etting does not require imposing the PSD constraint. By carefully
nalyzing the structure of the triplets based loss and using a few
undamental properties ( Lemmas 2 and 3 ), a closed-form solution
t each step is obtained with the time complexity of O ( d 2 ). Actu-
lly, when we learn a low rank matrix M , which means learning
rectangular matrix L ∈ R
c×d (c is the rank of M ), the time com-
lexity can be further reduced to O ( cd ). Since the rank c is difficult
o choose in practice and the performance of low-rank OPML is
sually worse than full-rank OPML, we mainly focus on learning a
quare matrix L in this paper. The major differences between OPML
nd OASIS/LEGO can be found in Table 1 . Please note that in our
nline metric learning setting, the time complexity of algorithms is
omprehensive which is influenced by three parts: the time of con-
truction of pairwise/triple constraints, the time of metric update
nd the number of pariwise/triple constraints used for updating.
ence, we present an integrated processing time per sample (the
otal time is divided by the data size) instead of presenting the
ime complexity of metric update of each algorithm alone, which
s almost O ( d 2 ) with different coefficients (expect for SOML).
Also, in some tasks, e.g., abnormal event detection in videos,
he data is usually imbalanced: the first several samples may be-
ong to the same class, then the triplet construction strategy will
e invalid until the samples of different classes appear. We call this
ase as cold start case. Furthermore, to deal with the cold start is-
ue, an extension namely COPML is developed in this paper. Specif-
cally, COPML includes a pre-stage by constructing pairwise con-
traints for two adjacent samples (from the same class) to update
he metric.
To summarize, compared with previous OML methods, the ad-
antages of our work can be concluded as: First, the proposed
PML and COPML are easy to implement. Second, OPML and
OPML are scalable to large datasets with a low space (i.e., O ( d ))
nd time (i.e., O ( d 2 )) complexity, where d is the feature dimen-
ionality. Third, OPML and COPML can be easily extended to han-
le high-dimensional data by learning a rectangular matrix L ∈
c×d with time complexity O ( cd ). Fourth, we have derived several
heoretical explanations, including the difference bound between
earned metrics of ones-pass and batch triplet construction strate-
ies, the average loss bound between these two strategies and the
egret bound, to guarantee the effectiveness of our methods.
The rest of this paper is organized as follows. In Section 2 ,
e present the related works of OML methods. Section 3 provides
he details of the proposed one-pass triplet construction strategy,
PML and COPML algorithms. In Section 4 , we give the theoreti-
al guarantee of our algorithm. The experimental results, compar-
sons and analysis are given in Section 5 , followed by conclusions
n Section 6 .
. Related work
Typically, all the previous online metric learning methods can
e roughly classified into two categories: bilinear similarity-based
nd Mahalanobis distance-based. A comparison of the most related
orks is given in Table 1 for better clarification.
In bilinear similarity-based methods, OASIS [6] is developed
hich is based on Passive-Aggressive algorithm [13] , aiming to
earn a similarity metric for image similarity. Sparse Online Met-
ic Learning (SOML) [7] follows a similar setting as OASIS, but
earns a diagonal matrix instead of a full matrix to handle very
igh-dimensional cases. In order to deal with multi-modal data,
n online kernel based method, namely Online Multiple Kernel
imilarity (OMKS), has been proposed by Xia et al. [8] . All these
bove methods are based on triplet constraints and they also as-
ume that the constraints can be gained beforehand or could be
andomly sampled on the entire dataset. Among them, OASIS is
ore relevant to the proposed OPML as both of them are Passive-
ggressive based. The differences between OASIS and the proposed
PML mainly include that OASIS learns a bilinear similarity metric
hard to interpret, and asymmetric) while OPML learns a Maha-
anobis metric (good interpretability, and symmetric), OASIS ran-
omly samples triplet constraints from the entire training set (with
pace complexity of O ( md )), while OPML constructs triplet con-
traints in an online manner (with space complexity of O ( cd ) and
ime complexity of O (1)), which leads the solutions of objective
unctions largely different.
In Mahalanobis distance-based methods, Pseudo-Metric Online
earning Algorithm (POLA) [9] is the first OML method which in-
roduces the successive projection operation to learn the optimal
etric. LEGO [10] is an extended version of Information Theoretic
etric Learning-Online (ITML-Online) [12] , by building the model
ith LogDet divergence regularization. Jin et al. [11] presented a
egularized OML method namely RDML, with a provable regret
ound. Also, Kunapuli and Shavlik proposed an unified approach
ased on composite mirror descent named as MDML [14] . These
304 W. Li et al. / Pattern Recognition 75 (2018) 302–314
...
Class 1
Class 2
Class 3
Time
Fig. 1. The illustration of one-pass triplet construction. x 1 , x 2 , . . . , x t denote the
samples at the 1th, 2th and t th time steps, respectively.
x
c
a
1
2
3
4
5
6
8
O
c
s
3
w
d
d
D
w
M
t
r
D
f
D
w
d
G
s
m
A
L
w
r
t
n
t
i
w
x
methods are all based on pairwise constraints, and they all assume
that the pairwise constraints can be obtained in advance except
RDML, which exactly receives two adjacent samples as a pairwise
constraints at each time. In fact, pairwise constraints based meth-
ods can be easily converted to an online manner for pair construc-
tion by using the strategy of RDML. However, in general, triplet
constraints are more effective than pairwise constraints for learn-
ing a metric [6,15–17] . In contrast, the proposed OPML is a triplet
constraints based Mahalanobis distance method. Besides, the pro-
posed method has the properties of closed-form solution and does
not enquire PSD constraint, making it more efficient.
3. The proposed method
We now first present our strategy for one-pass triplet construc-
tion, and then discuss the technical details of OPML and COPML,
respectively.
3.1. One-pass triplet construction
When dealing with large-scale data, how to fast obtain the
triplets is a crucial step, since the number of triplets usually de-
termines the tradeoff between the performance effectiveness and
time efficiency. Actually, there are several previous attempts to re-
duce the size of triplet set in previous studies, e.g., partial complete
construction after random sampling [7] , random sampling like OA-
SIS [6] , online importance sampling scheme [18] . Partial complete
construction strategy constructs all possible triplets after randomly
sampling some positive and negative samples, which is time- and
space- consuming and also without theoretical guarantee. Random
sampling strategy has a limitation on space-consuming. In [18] , the
online importance sampling strategy is required to calculate the
pairwise relevance scores between any two samples in advance,
which is time-consuming. Inspired by the impressive scalability of
one-pass strategies [19–22] , we proposed a one-pass triplet con-
struction strategy, aiming to quickly obtain all the triplets in a sin-
gle pass over all data. Fig. 1 illustrates the main idea of the one-
pass strategy of triplet construction. Formally, in an online manner,
the sample at the t th ( t = 1 , 2 , . . . , T ) step is denoted as x t ∈ R
d . An
array H = { h
1 , h
2 , . . . , h
c } is maintained, where c is the total num-
ber of classes, and h
k ∈ R
d (k ∈ H label = { 1 , 2 , . . . , c} ) means the lat-
est sample of the k th class at the current time step.
For triplet construction, our goal is to obtain a typical triplet
〈 x t , x p , x q 〉 ( x t , x p and x q ∈ R
d ), which satisfies that x t and x p belong
to the same class, while x t and x q belong to the different classes.
Specifically, in one-pass triplet construction, at the t th time step,
given x t belonging to the k th class, we assign h
k as x p , and ran-
domly pick h
k ′ (k ′ ∈ H label , k ′ � = k ) as x q . Then h
k ∈ H will be re-
placed by x t , and thus H consists of the latest sample of each class.
Please note that if x t belongs to the (c + 1) th class, we will assign
t to h
c+1 , and also update the number of classes as well as H ac-
ordingly.
Intuitively, at the t th time step, the triplet can be constructed
s follows:
: Input: x t , H = { h
1 , h
2 , . . . , h
c } , H label = { 1 , 2 , . . . , c} : Output: 〈 x t , x p , x q 〉 : if label (x t ) belongs to the k th class in H label
: x p = h
k , x q = h
k ′ , where k ′ � = k
: h
k = x t , update H
: else
7 : add h
c+1 = x t to H and c + 1 to H label
: end if
We can observe that the space complexity of this strategy is
( cd ). Basically, when the value of c is small, the space complexity
an be regarded as O ( d ). Also, the time complexity of triplet con-
truction at each time step is O (1).
.2. OPML
We mainly focus on the Mahalanobis distance learning here,
hich aims to learn a symmetric PSD matrix M ∈ S d×d + (cone of
× d real-valued symmetric PSD matrices) and can be formally
efined as follows:
M
(x i , x j ) =
√
(x i − x j ) � M(x i − x j ) , (1)
here x i ∈ R
d and x j ∈ R
d are the i th and j th samples, respectively.
can be mathematically decomposed as L � L , where L ∈ R
r×d ( r is
he rank of M ) denotes the transformation matrix. Then, we can
ewrite Eq. (1) as:
L (x i , x j ) = ‖ L(x i − x j ) ‖
2 2 . (2)
Our goal is to learn a transformation matrix L that satisfies the
ollowing large margin constraint:
L (x i , x l ) > D L (x i , x j ) + 1 , ∀ x i , x j , x l ∈ R
d , (3)
here x i and x j belong to the same class, while x i and x l belong to
ifferent classes. We can define the hinge loss function as below:
((x i , x j , x l ) ; L) = max (0 , 1 + D L (x i , x j ) − D L (x i , x l )
). (4)
By applying the one-pass triplet construction, at the t th time
tep, we can obtain the triplet 〈 x t , x p , x q 〉 . Thus, the online opti-
ization formulation can be defined as follows by using Passive-
ggressive algorithm [13] :
t = arg min
L
�(L)
= arg min
L
1
2
‖ L − L t−1 ‖
2 F
+
γ
2
[1 + ‖ L(x t − x p ) ‖
2 2 − ‖ L(x t − x q ) ‖
2 2 ] + ,
(5)
here γ is the regularization parameter, which is set into the
ange of (0 , 1 4 ) as a sufficient condition to theoretically guarantee
he positive definite property (see Lemma 2 ). ‖ · ‖ 2 F
means Frobe-
ius norm, ‖ · ‖ 2 denotes � 2 −norm, and [ z] + = max (0 , z) , namely
he hinge loss G((x t , x p , x q ) ; L) .
The optimal solution can be obtained when the gradient van-
shes ∂�(L) ∂L
= 0 , hence we have:
∂�(L)
∂L =
{
L − L t−1 + γ LA t = 0 [ z] + > 0
L − L t−1 = 0 [ z] + = 0 ,
(6)
here A t = (x t − x p )(x t − x p ) � − (x t − x q )(x t − x q )
� ∈ R
d×d . Since
t , x p and x q are all nonzero vectors, x t � = x p and x t � = x q , the
W. Li et al. / Pattern Recognition 75 (2018) 302–314 305
r
r
r
L
e
−
w
a
P
M
−
A
λ
i
λ
H
−−
L
p
i
P
A
λ
−1
F
o
λ
r
i
L
I
E
n
L
l
1
C
w
C
T
t
w
P
a
B
B
o
L
w
E
f
o
s
N
A
R
E
3
p
c
s
t
n
i
c
a
f
f
〈a
t
c
n
s
p
ank of (x t − x p )(x t − x p ) � and (x t − x q )(x t − x q )
� is 1. Hence, the
ank of A t is 1 or 2. When x t − x p � = μ(x t − x q ) , we can get that
ank (A t ) = 2 .
emma 1. Let M 1 , M 2 be two PSD matrices, and � = M 1 − M 2 , the
igenvalue of �, denoted by λ( �), satisfies the following equation:
λmax (M 2 ) ≤ λ(�) ≤ λmax (M 1 ) , (7)
here λmax (M 1 ) and λmax (M 2 ) are the maximum eigenvalues of M 1
nd M 2 , respectively.
roof. ∀ x ∈ R
d and x � = 0 , x � �x = x � (M 1 − M 2 ) x . Since M 1 and
2 are both PSD matrices, x � M 1 x ≥ 0 and x � M 2 x ≥ 0. Thus,
x � M 2 x
x � x ≤ x � �x
x � x ≤ x � M 1 x
x � x . (8)
ccording to Rayleigh quotient, we have λmin (�) ≤ x � �x x � x ≤
max (�) . Assume x � �x x � x achieve its maximum λmax (�) when x = e ,
.e., e � �e e � e = λmax (�) . By Eq. (8) , we obtain
max (�) =
e � �e
e � e ≤ e � M 1 e
e � e ≤ λmax (M 1 ) (9)
ence, λmax (�) ≤ λmax (M 1 ) . In the similar way, we can prove that
λmax (M 2 ) ≤ λmin (�) . Thus,
λmax (M 2 ) ≤ λmin (�) ≤ λ(�) ≤ λmax (�) ≤ λmax (M 1 ) . (10)
�
emma 2. If 0 < γ <
1 4 and samples are normalized, I + γ A t is a
ositive definite matrix and also it is invertible, where I ∈ R
d×d is the
dentity matrix.
roof. Let M 1 = (x t − x p )(x t − x p ) � , M 2 = (x t − x q )(x t − x q ) � and
t = M 1 − M 2 . It is easy to obtain that, λmax (M 1 ) = ‖ x t − x p ‖ 2 2 and
max (M 2 ) = ‖ x t − x q ‖ 2 2 . According to Lemma 1 , we can obtain that
λmax (M 2 ) ≤ λ(A t ) ≤ λmax (M 1 ) . Thus,
− γ λmax (M 2 ) ≤ λ(I + γ A t ) ≤ 1 + γ λmax (M 1 ) . (11)
or normalized samples, namely 0 ≤ ‖ x t ‖ ≤ 1, the ranges
f λmax (M 1 ) and λmax (M 2 ) vary from [0,4]. Given 0 < γ <
1 4 ,
(I + γ A t ) ≥ 1 − γ λmax (M 2 ) > 0 . Obviously, I + γ A t is a symmet-
ic matrix. Hence, I + γ A t is a positive definite matrix, and it is
nvertible. �
According to Lemma 2 , the optimal L t can be updated as below:
t =
{
L t−1 (I + γ A t ) −1 [ z] + > 0
L t−1 [ z] + = 0 .
(12)
t is known that the time complexity of the matrix inversion in
q. (12) is O ( d 3 ). However, the rank-2 property of A t offers us a
ice way to accelerate the speed by applying Lemma 3 .
emma 3. [23] Given G and G + B as two nonsingular matrices, and
et B have rank r > 0 . Let B = B 1 + · · · + B r , where each B k has rank
, and C k +1 = G + B 1 + · · · + B k is nonsingular for k = 1 , . . . , r, where
1 = G, then
(G + B ) −1 = C −1 r − g r C
−1 r B r C
−1 r , (13)
here,
−1 r+1 = C −1
r − g r C −1 r B r C
−1 r
g r =
1
1 + tr(C −1 r B r )
. (14)
〈
heorem 1. If A t is a rank-2 matrix and samples are normalized,
hen
(I + γ A t ) −1 = I − 1
η + β[ ηγ A t − (γ A t )
2 ] , (15)
here η = 1 + tr(γ A t ) , β =
1 2 [(tr(γ A t ))
2 − tr((γ A t ) 2 )] .
roof. According to Lemma 3 , we set G = I, B = γ A t ,
nd rewrite B = B 1 + B 2 , where B 1 = γ (x t − x p )(x t − x p ) � , 2 = −γ (x t − x q )(x t − x q )
� . It is obvious that the rank of B 1 and
2 is 1. Utilizing the Lemma 3 , we can obtain the Theorem 1 . �
By using Theorem 1 , plugging Eq. (15) back into the first term
f Eq. (12) , we can obtain
t = L t−1 − ηγ
η + β(L t−1 aa
� − L t−1 bb
� )
+
γ 2
η + β[(a
� a ) L t−1 aa
� − (a
� b) L t−1 ab
�
− (b
� a ) L t−1 ba
� + (b
� b) L t−1 bb
� ] ,
(16)
here a = x t − x p , and b = x t − x q .
Complexity: The time complexity of calculating η and β in
q. (16) is O ( d ). In addition, a
� a , a
� b , b � a and b � b are scalars,
or all of which time complexity is O ( d ). Also, the time complexity
f calculating L t−1 aa
� , L t−1 bb � , L t−1 ab � and L t−1 ba
� is O ( d 2 ) re-
pectively. Therefore, the time complexity of Eq. (16) is still O ( d 2 ).
ow, we give the pseudo-code of OPML in Algorithm 1 .
lgorithm 1 OPML.
equire: (x t , y t ) | T t=1 , γ .
nsure: L.
1: L 0 ← I.
2: for t = 1 , 2 , . . . , T do
3: 〈 x t , x p , x q 〉 ← one-pass triplet construction
4: if G((x t , x p , x q ) ; L t−1 ) � 0 then
5: L t = L t−1 .
6: else
7: L t ← solution by Eq. (16)
8: end if
9: end for
.3. Extended OPML to cold start case
In practice, there is a case that the first several available sam-
les belong to the same class, which is called as a cold start
ase. When cold start happens, since the triplet cannot be con-
tructed, OPML will discard all these initial samples. To address
his issue, we extend the proposed OPML to an enhanced version,
amely COPML, which includes an additional pre-stage before call-
ng OPML. Specifically, in the pre-stage, if the triplet cannot be
onstructed (i.e., the samples coming from different classes are not
vailable), the metric L can only be updated based on the samples
rom the same class, which usually adopts the pairwise constraint
or updating. Typically, pairwise constraint is mathematically set to
x t , x t+1 , y ∗t ,t +1 〉 , where y ∗t ,t +1 = 1 if two adjacent samples x t ∈ R
d
nd x t+1 ∈ R
d share the same class, and y ∗t ,t +1 = −1 otherwise. Ac-
ually, here we only need to consider the case when y ∗t ,t +1
= 1 , be-
ause we can update the metric by calling OPML if y ∗t ,t +1
= −1 (the
ew coming sample belongs to a different class). After the pre-
tage, OPML can be sequentially adopted for the following learning
rocess.
Formally, in the pre-stage when only the pairwise constraint
x t , x t+1 , y ∗t ,t +1
〉 can be used, the online optimization formulation
306 W. Li et al. / Pattern Recognition 75 (2018) 302–314
≤
m
‖
w
η
T
p
{
c
s
c
L
e
l
�
w
c
c
T
t
t
t
U
t
c
(
R
5
a
fi
v
r
5
u
c
t
a
r
a
i
r
o
m
E
t
N
[
(
S
is formulated as follows:
L t = arg min
L
�(L)
= arg min
L
1
2
‖ L − L t−1 ‖
2 F +
γ1
2
y ∗t ,t +1 ‖ L(x t − x t+1 ) ‖
2 2 ,
(17)
where γ 1 > 0 is the regularization parameter. The optimal solution
can be obtained when the gradient vanishes ∂�(L) ∂L
= 0 , hence
∂�(L)
∂L = L − L t−1 + γ1 y
∗t ,t +1 L�t = 0 , (18)
where �t = (x t − x t+1 )(x t − x t+1 ) � . It is obvious that �t is a rank-
1 PSD matrix for x t � = x t+1 . Then we can get that I + γ1 y ∗t ,t +1
�t
is a symmetric positive definite matrix, which is invertible, when
y ∗t ,t +1 = 1 . Then, the optimal L t can be obtained as
L t = L t−1 (I + γ1 �t ) −1 . (19)
By using the Sherman-Morrison formula, Eq. (19) can be equiva-
lently rewritten as follows:
L t = L t−1 − γ1 L t−1 (x t − x t+1 )(x t − x t+1 ) �
1 + γ1 (x t − x t+1 ) � (x t − x t+1 ) , (20)
where we can observe that the time complexity of Eq. (20) is O ( d 2 )
too. Algorithm 2 shows the pseudo-code of COPML.
Algorithm 2 COPML.
Require: (x t , y t ) | T t=1 , γ1 , γ2 .
Ensure: L.
1: L 0 ← I, c ← 0 .
2: for t = 1 , 2 , . . . , T do
3: Maintain H = { h 1 , h 2 , · · · , h c } 4: if y t = c = 1 then
5: 〈 x t , x t+1 , y ∗t ,t +1
〉 ← adjacent two samples
6: L t = L t−1 − γ1 L t−1 (x t −x t+1 )(x t −x t+1 ) �
1+ γ1 (x t −x t+1 ) � (x t −x t+1 )
.
7: else if 1 ≤ y t ≤ c and c ≥ 2 then
8: L ← call OPML (x t , y t , γ2 )
9: else
10: h c+1 ← x t 11: c ← c + 1
12: end if
13: end for
4. Theoretical guarantee
The following theorems guarantee the effectiveness of our
methods. Theorem 2 shows that the difference of learned metric
between one-pass triplet construction strategy and batch triplet
construction strategy is bounded. Note that, for a fair comparison,
the batch triplet construction strategy here is considered in an on-
line manner, that is to say, for each sample x t at the t th time step,
all past samples are stored to construct a triplet with x t (i.e., each
triplet contains this x t ). Theorem 3 also tries to explain that the
one-pass triplet construction strategy can approximate the batch
triplet construction, but from another perspective. Moreover, a re-
gret bound has been proved for the proposed OPML algorithm,
which can be found in Theorem 4 . All details of the proofs for the
theorems are provided in the appendix.
Theorem 2. Let L t be the solution output by OPML based on the one-
pass triplet construction strategy at the tth time step. Let L ∗t be the
solution output by OPML with the batch triplet construction strategy
at the tth time step. Assuming that ‖ x ‖ ≤ 1 (for all samples), ‖ L t ‖
FU and ‖ L ∗t ‖ F ≤ U, the bound of the difference between these two
atrices is
L t − L ∗t ‖ F ≤ U ·∥∥∥ C N ∑
i =1
B i +
C N ∑
i =1 , j=1 ,i< j
B i B j + · · · +
C N ∏
i =1
B i
∥∥∥F , (21)
here ‖ B ‖ F ≤32
∣∣∣ γ 2
η+ β
∣∣∣+4 √
2
∣∣∣ ηγη+ β
∣∣∣ (for all B i , B j , ���), γ ∈ (0 , 1 4 ) ,
∈ (− 1 4 ,
9 4 ) , and β ∈ (−1 , 25
32 ) .
heorem 3. Let 〈 x t , x p , x q 〉 be the triplet constructed by the pro-
osed one-pass triplet construction strategy at the tth time step. Let
〈 x t , x p i , x q i 〉}| C i =1 be the triplet set constructed by the batch triplet
onstruction strategy at the tth time step. Assuming ‖ x ‖ 2 ≤ R (for all
amples), ‖ L ‖ F ≤ U , ‖ L ∗‖ F ≤ U and the angle θ between two samples
oming from the same class is small after the transformation of L or
∗ (i.e., cos θ = α, α ≥ 0 and α is close to 1), while θ is large oth-
rwise (i.e., cos θ = −ξ , ξ ≥ 0 and ξ is close to 1). Then the average
oss bound between these two strategies at the tth time step is
1 − �2 ≤ 2(α + ξ + 1) R
2 U
2 , (22)
here �1 denotes the average loss generated by the one-pass triplet
onstruction strategy, and �2 refers to the average loss of the batch
onstruction strategy.
heorem 4. Let 〈 x 1 , x p 1 , x q 1 〉 , . . . , 〈 x T , x p T , x q T 〉 be a sequence of
riplets constructed by the proposed one-pass strategy. Let L t | T t=1 be
he solution output by OPML at the tth time step, and L ∗ be the op-
imal offline solution. Assuming ‖ x ‖ 2 ≤ R (for all samples), ‖ L ‖ F ≤ , ‖ L ∗‖ F ≤ U and the angle θ between two samples coming from
he same class is very small after the transformation of L or L ∗ (i.e.,
os θ = α, α ≥ 0 and α is close to 1), while θ is very large otherwise
i.e., cos θ = −ξ , ξ ≥ 0 and ξ is close to 1). Then the regret bound is
(L ∗, T ) ≤ 2 T (α + ξ + 1) R
2 U
2 . (23)
. Experiments
To verify the effectiveness of our methods, we evaluate OPML
nd COPML on three typical tasks, including (1) UCI data classi-
cation, (2) face verification, and (3) abnormal event detection in
ideos. Also, an additional experiment is conducted to validate the
obustness of COPML when the cold start issue happens.
.1. UCI Data classification
We introduce twelve datasets from the UCI repository for eval-
ation. The k -NN classifier is employed, since it is widely-used for
lassification with only one parameter. The detailed information of
hese datasets is presented in Table 2 . All these twelve datasets
re normalized by Z-score. Also, for each dataset, 50% samples are
andomly picked for training while the rest is used for testing. We
dopt the error rate as the evaluation criterion, and to reduce the
nfluence coming from the random partition, all the classification
esults are averaged over 100 individual runs.
To make an extensive comparison, we introduce several state-
f-the-art methods, including batch metric learning and OML
ethods. Specifically, batch metric learning methods include: (1)
uclidean distance metric ( Eucli for short); (2) Mahalanobis dis-
ance metric ( Maha for short); (3) LMNN ((Large Margin Nearest
eighbor)) [15] ; (4) ITML [12] . OML methods include: (1) OASIS
6] ; (2) RDML [11] ; (3) POLA [9] ; (4) LEGO [10] ; (5) SOML-TG
SOML for short) [7] . The implementation of LMNN, ITML and OA-
IS was provided by the authors in their respective papers, while
W. Li et al. / Pattern Recognition 75 (2018) 302–314 307
Table 2
Error rates (mean ± std. deviation) of a k-NN (k = 5) classifier on the UCI datasets. p−values of student’s t -test are calculated between other
methods and our methods. •/ ◦ indicates OPML performs statistically better/worse than the respective method according to the p−values.
The statistics of win/tie/loss is also included. The value in the bracket means the corresponding total processing time in second. 0.00
denotes the value is very small ( < 0.005). Abbreviations: Sam, sample; Dim, dimensionality; C, classes.
Data Sam Dim C Euclidean Mahalanobis LMNN
lsvt 126 310 2 0 .234 ± 0.056 • 0 .238 ± 0.046 • 0 .196 ± 0.044 (46.08)
iris 150 4 3 0 .050 ± 0.023 0 .075 ± 0.025 • 0 .037 ± 0.017 ◦( 1.55)
wine 178 13 3 0 .044 ± 0.020 0 .046 ± 0.023 0 .030 ± 0.016 ◦( 3.18)
glass 214 9 7 0 .336 ± 0.036 0 .349 ± 0.037 0 .341 ± 0.038 ( 4.52)
spect 267 22 2 0 .327 ± 0.035 0 .348 ± 0.034 • 0 .336 ± 0.037 ( 2.48)
ionosphere 351 34 2 0 .172 ± 0.019 • 0 .165 ± 0.016 0 .131 ± 0.020 ◦( 3.81)
balance 625 4 3 0 .146 ± 0.014 • 0 .133 ± 0.017 • 0 .124 ± 0.014 ◦( 1.42)
breast 683 9 2 0 .034 ± 0.008 0 .034 ± 0.007 0 .033 ± 0.008 ( 1.26)
pima 768 8 2 0 .273 ± 0.018 • 0 .271 ± 0.018 • 0 .272 ± 0.017 •( 1.12)
segment 2310 19 7 0 .067 ± 0.006 • 0 .101 ± 0.008 • 0 .047 ± 0.006 ◦( 5.59)
waveform 50 0 0 21 3 0 .187 ± 0.006 ◦ 0 .158 ± 0.006 ◦ 0 .182 ± 0.060 ◦( 6.04)
optdigits 5620 64 10 0 .026 ± 0.003 • 0 .039 ± 0.003 • 0 .014 ± 0.002 ◦(31.21)
win/tie/loss 6/5/1 7/4/1 1/4/7
ITML OASIS RDML POLA
0 .175 ± 0.040 ◦(142.3) 0 .205 ± 0.043 •(0.84) 0 .230 ± 0.053 • (0.17) 0 .157 ± 0.036 ◦( 9.17)
0 .034 ± 0.016 ◦(11.16) 0 .274 ± 0.050 •(0.10) 0 .077 ± 0.027 •( 0.00 ) 0 .030 ± 0.016 ◦( 0.62)
0 .035 ± 0.019 ◦(14.15) 0 .019 ± 0.014 ◦(0.08) 0 .039 ± 0.019 ( 0.00 ) 0 .028 ± 0.018 ◦( 2.22)
0 .358 ± 0.042 •(30.94) 0 .485 ± 0.057 •(0.12) 0 .349 ± 0.035 ( 0.00 ) 0 .395 ± 0.042 ( 3.86)
0 .337 ± 0.041 ( 3.44) 0 .364 ± 0.050 •(0.13) 0 .343 ± 0.032 •( 0.01 ) 0 .323 ± 0.040 •(19.22)
0 .139 ± 0.025 ◦( 3.65) 0 .124 ± 0.033 ◦(0.12) 0 .154 ± 0.016 ◦ (0.02) 0 .147 ± 0.020 ◦(17.65)
0 .104 ± 0.019 ◦(15.67) 0 .126 ± 0.009 (0.10) 0 .118 ± 0.012 ◦( 0.01 ) 0 .156 ± 0.047 •( 7.31)
0 .035 ± 0.008 ( 2.68) 0 .043 ± 0.021 •(0.09) 0 .033 ± 0.007 ( 0.01 ) 0 .038 ± 0.010 •( 5.13)
0 .279 ± 0.022 •( 3.16) 0 .346 ± 0.053 •(0.12) 0 .269 ± 0.019 ( 0.01 ) 0 .275 ± 0.021 •(12.40)
0 .050 ± 0.008 ◦(38.50) 0 .343 ± 0.067 •(0.11) 0 .082 ± 0.006 •( 0.02 ) 0 .057 ± 0.011 (22.89)
0 .187 ± 0.008 ◦(18.57) 0 .357 ± 0.039 •(0.13) 0 .186 ± 0.010 ◦ (0.12) 0 .250 ± 0.030 •(28.38)
0 .028 ± 0.006 •(122.4) 0 .077 ± 0.009 •(0.15) 0 .028 ± 0.003 • (0.13) 0 .023 ± 0.003 •(19.76)
3/2/7 9/1/2 5/4/3 6/2/4
LEGO SOML OPML COPML
0 .239 ± 0.050 •( 2.33) 0 .223 ± 0.057 •( 2.23) 0 .189 ± 0.048 ( 0.07 ) 0 .189 ± 0.047 ( 0.07 )
0 .050 ± 0.021 ( 0.14) 0 .287 ± 0.080 •( 0.69) 0 .049 ± 0.023 ( 0.00 ) 0 .048 ± 0.023 ( 0.00 )
0 .031 ± 0.020 ◦( 0.23) 0 .169 ± 0.093 •( 0.92) 0 .042 ± 0.020 ( 0.00 ) 0 .041 ± 0.019 ( 0.00 )
0 .390 ± 0.034 •( 0.31) 0 .557 ± 0.117 •( 1.16) 0 .339 ± 0.032 ( 0.00 ) 0 .341 ± 0.033 ( 0.00 )
0 .311 ± 0.038 ◦( 0.61) 0 .393 ± 0.084 •( 1.45) 0 .326 ± 0.034 ( 0.01 ) 0 .327 ± 0.035 ( 0.01 )
0 .154 ± 0.020 ◦( 1.09) 0 .362 ± 0.116 •( 2.18) 0 .161 ± 0.019 ( 0.01 ) 0 .163 ± 0.021 ( 0.01 )
0 .118 ± 0.011 ◦( 1.40) 0 .378 ± 0.077 •( 4.78) 0 .129 ± 0.012 ( 0.01 ) 0 .129 ± 0.014 ( 0.01 )
0 .035 ± 0.008 •( 1.64) 0 .054 ± 0.040 •( 5.19) 0 .032 ± 0.008 ( 0.01 ) 0 .032 ± 0.007 ( 0.01 )
0 .266 ± 0.019 ( 1.96) 0 .353 ± 0.060 •( 5.89) 0 .266 ± 0.017 ( 0.01 ) 0 .265 ± 0.018 (0.02)
0 .040 ± 0.006 ◦(14.50) 0 .541 ± 0.092 •(19.26) 0 .059 ± 0.006 (0.03) 0 .059 ± 0.006 (0.03)
0 .233 ± 0.007 •( 4.60) 0 .368 ± 0.035 •(49.30) 0 .224 ± 0.009 ( 0.08 ) 0 .225 ± 0.010 (0.09)
0 .022 ± 0.003 •( 6.97) 0 .239 ± 0.079 •(79.10) 0 .019 ± 0.003 ( 0.09 ) 0 .019 ± 0.003 (0.10)
5/2/5 12/0/0
t
t
L
t
O
F
i
C
γ
o
s
t
o
i
c
r
s
L
m
e
(
r
r
e
c
t
f
s
t
u
5
a
l
a
F
l
t
he rest methods were implemented by ourselves. The parame-
ers of these methods were selected by cross-validation, except
MNN and ITML using the default settings. Moreover, a parame-
er sensitivity experiment is also made to verify the influence of
PML/COPML by changing the value of parameter γ / γ 1 , γ 2 (see
ig. 3 ). For OPML, let γ be 10 −3 seems to be a good choice, which
s indeed chose in all twelve UCI datasets in our experiment. For
OPML, let γ 1 be 10 −4 or 10 −3 seems to be a good choice when
1 ≡ 10 −3 (same setting as OPML for simplicity). Since the pairwise
r triplet constraints of POLA, LEGO and SOML need to be con-
tructed in advance, we randomly sample 10,0 0 0 constraints for
hese three methods (same setting as LEGO [10] ). The error rates
f the proposed methods and competitive methods are presented
n Table 2 .
Moreover, the p -values of student’s t -test were calculated to
heck statistical significance. Also, the statistics of win/tie/loss is
eported according to the obtained p -values (see Table 2 ). It is ob-
erved that (1) the performance of our methods is comparable to
EGO, and slightly better than other OML methods; (2) the perfor-
ance of our methods is close to batch metric learning methods,
.g., LMNN and ITML, and better than Euclidean and Mahalanobis;
E3) our methods are faster than other OML methods except compa-
able with RDML, since instead of constructing triplets, RDML only
equires the pairwise constraint by receiving a pair of samples in
ach time.
To illustrate the performance with different numbers of triplet
onstraints on the learning of metric, we vary the numbers of
riplet constraints as (100, 1000, 2000, 5000, 10000, 15000, 20000)
or OASIS and SOML (see Fig. 2 ). Since the number of triplet con-
traint in OPML is a constant by using one-pass triplet construc-
ion, we can find that, OPML can achieve better performance by
sing fewer triplet constraints (except on UCI data 1, 3, 6).
.2. Face verification: PubFig
Face verification has become a popular application for evalu-
ting the performance of metric learning methods. Many metric
earning methods [24–26] have been proposed for this practical
pplication, even including deep metric learning methods [27–29] .
or face verification, we first evaluate our methods on the Pub-
ic Figures Face Database (PubFig) [30] . PubFig dataset consists of
wo subsets: Development Set (7650 images of 60 individuals) and
valuation Set (28954 images of 140 individuals). Following [30] ,
308 W. Li et al. / Pattern Recognition 75 (2018) 302–314
0 100 1000 2000 5000 10000 15000 20000 2500012
14
16
18
20
22
24
26
28
30
x=59
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 1
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 250000
10
20
30
40
50
60
x=72
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 2
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 250000
5
10
15
20
25
30
x=87
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 3
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 2500030
35
40
45
50
55
60
65
70
75
x=103
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 4
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 2500025
30
35
40
45
50
x=132
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 5
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 250005
10
15
20
25
30
35
40
45
50
x=172
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 6
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 2500010
15
20
25
30
35
40
45
50
55
x=310
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 7
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 25000−2
0
2
4
6
8
10
12
14
x=340
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 8
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 2500020
25
30
35
40
45
x=380
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 9
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 250000
10
20
30
40
50
60
70
x=1148
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 10
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 2500020
25
30
35
40
45
x=2498
The Number of Triplet
Err
or
Rat
e (%
)UCI Data: 11
OPMLOASISSOML
0 100 1000 2000 5000 10000 15000 20000 250000
5
10
15
20
25
30
35
x=2801
The Number of Triplet
Err
or
Rat
e (%
)
UCI Data: 12
OPMLOASISSOML
Fig. 2. Error rates of different methods with different numbers of triplets on twelve UCI datasets (the number of triplets in OPML is a constant).
Fig. 3. Error rates of OPML/COPML on the UCI datasets by changing the value of parameter γ / γ 1 , γ 2 (the last three datasets are specially constructed for cold start issue in
5.5 ).
(
t
d
a
t
p
d
d
o
we use the development set to develop all these methods, includ-
ing parameters tuning, while the evaluation set is used for per-
formance evaluation. The goal of face verification in PubFig is to
determine whether a pair of face images belong to the same per-
son. Please note that, images coming from the same person will
be regarded as belonging to the same class. For all subsets, 10-
fold cross validation is adopted to conduct the experiments, and
each fold is disjoint by identity (i.e., one person will not appear
in both the training and testing set). For testing each fold (with
rest 9 folds used for training), we randomly sample 10,0 0 0 pairs
50 0 0 intra- and 50 0 0 extra-personal pairs) for testing. Thus, the
otal number of pairs is 10 5 . In each training phase, we also ran-
omly select 10,0 0 0 pairwise or triplet constraints for LEGO, POLA
nd SOML as the same settings on the UCI datasets.
For sufficient and fair comparison, we use two forms of fea-
ures (i.e., attribute features and deep features) to evaluate the
erformance of all algorithms, respectively. Attribute features (73-
imension) provided by Kumar et al. [30] are ’high-level’ features
escribing nameable attributes such as gender, race, age, hair etc.,
f a face image. For deep features, we use a VGG-Face model
W. Li et al. / Pattern Recognition 75 (2018) 302–314 309
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FPR
TP
R
Roc Curves for Development Set
OPML (0.8297)COPML(0.8207)LEGO (0.8100)POLA (0.7934)ITML (0.7835)LMNN (0.7834)OASIS (0.7701)Eucli (0.7471)RDML (0.7287)Maha (0.7238)SOML (0.6956)
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
FPR
TP
R
Roc Curves for Evaluation Set
OPML (0.8114)COPML(0.8063)LEGO (0.8020)POLA (0.7823)LMNN (0.7617)OASIS (0.7584)RDML (0.7279)Eucli (0.7273)ITML (0.7171)Maha (0.7151)SOML (0.6965)
0.02 0.04 0.06 0.08 0.1 0.12 0.14
FPR
0.86
0.88
0.9
0.92
0.94
0.96
0.98
TP
R
Roc Curves for Development Set (Larger Version)
OPML (0.9861)COPML(0.9864)LEGO (0.9860)POLA (0.9840)ITML (0.9758)LMNN (0.9805)OASIS (0.9732)Eucli (0.9854)RDML (0.9573)Maha (0.9829)SOML (0.6918)
0.02 0.04 0.06 0.08 0.1 0.12
FPR
0.86
0.88
0.9
0.92
0.94
0.96
TP
R
Roc Curves for Evaluation Set (Larger Version)
OPML (0.9858)COPML(0.9856)LEGO (0.9806)POLA (0.9734)LMNN (0.9749)OASIS (0.9726)RDML (0.9745)Eucli (0.9804)ITML (0.9810)Maha (0.9844)SOML (0.6228)
Fig. 4. ROC Curves of development set (left column) and evaluation set (right column) on the PubFig dataset. first row: attribute features; second row: deep features. AUC
value of each method is presented in bracket.
[
w
4
b
i
t
i
a
R
c
l
f
m
5
b
f
c
v
t
t
T
a
I
t
t
o
m
a
a
t
(
r
o
t
W
f
e
o
f
s
[
d
r
d
7
t
t
c
d
c
t
31] to extract a 4096-dimensional feature for each face image
hich has been aligned and cropped. For easier handling, the
096-dimensional feature is reduced to a 54-dimensional feature
y Principal Component Analysis (PCA) algorithm.
For each testing pair, we first calculate the distance (similar-
ty) between them by the learned metric obtained from respec-
ive methods. Then, all the distances (similarities) are normalized
nto the range [0, 1]. Receiver Operating Characteristic (ROC) curves
re provided in Fig. 4 , with the corresponding AUC (Area under
OC) values calculated. It can be observed that OPML and COPML
an obtain superior results compared with the state-of-the-art on-
ine/batch metric learning methods. Moreover, although the deep
eature already has a strong representation ability, our proposed
ethods can still slightly improve the performance.
.3. Face verification: LFW
For face verification, we also evaluate our methods on the La-
eled Faces in the Wild Database (LFW) [32] . LFW is a widely used
ace verification benchmark with unconstrained images, which
ontains 13,233 images of 5749 individuals. This dataset has two
iews: View 1 is used for development purposes (containing a
raining set and a test set); And, View 2 is taken as evalua-
ion benchmark for comparison (i.e., a 10-fold cross-validation set).
here are two forms of configuration in both views, that is, im-
ge restricted configuration and image unrestricted configuration.
n the first formulation, the training information is restricted to
he provided image pairs and additional information such as ac-
ual name information can not be used. In other words, we can
nly use the pairwise images for training without any label infor-
ation can be used at all. While, in the second formulation, the
ctual name information (i.e., label information) can be used and
s many pairs or triplets can be formulated as one desires. No mat-
er which configuration we choose, the test procedure is the same
i.e., using pairwise images for testing). In order to simulate the
eal online environment and because our methods and some meth-
ds (e.g., OASIS [6] , SOML [7] ) are triplet-based methods, we adopt
he image unrestricted configuration to construct the experiment.
e use View 1 for parameter tuning and then evaluate the per-
ormance of all the algorithms on each fold (300 intra- and 300
xtra-personal pairs) in View 2. Other settings are similar with the
nes on the PubFig dataset.
In this experiment, we adopt two types of features (i.e., SIFT
eatures and attribute features) to represent each face image, re-
pectively. The SIFT features are provided by Guillaumin et al.
33] by extracting SIFT descriptors [34] at 9 fixed facial landmarks
etected on a face, over three scales. Then we perform PCA algo-
ithm to reduce the original 3456-dimensional feature to a 100-
imensional feature. Like PubFig, the attribute features of LFW are
3-dimensional ’high-level’ features describing the nameable at-
ributes of a face image [30] . To evaluate our methods and the con-
rastive methods, we report the ROC curves and AUC values of the
orresponding methods (see Fig. 5 ). The results of ITML [12] aren’t
isplayed for its difficulty of convergence in the training data. We
an see that the proposed COPML method can achieve the-state-of-
he-art performance compared with the contrastive metric learning
310 W. Li et al. / Pattern Recognition 75 (2018) 302–314
0 0.2 0.4 0.6 0.8 1FPR
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
TP
R
Roc Curves for LFW (SIFT)
OPML (0.8606)COPML(0.8676)LEGO (0.7976)POLA (0.8442)LMNN (0.8634)OASIS (0.8267)Eucli (0.7335)RDML (0.6902)Maha (0.6570)SOML (0.5994)
0 0.2 0.4 0.6 0.8 1FPR
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
TP
R
Roc Curves for LFW (Attribute)
OPML (0.9136)COPML(0.9143)LEGO (0.8629)POLA (0.9058)LMNN (0.9105)OASIS (0.8906)Eucli (0.8564)RDML (0.8368)Maha (0.8138)SOML (0.7265)
Fig. 5. ROC Curves of our methods and contrastive methods on the LFW dataset. left: sift features; right: attribute features. AUC value of each method is presented in
bracket.
Table 3
Quantitative comparison of our methods with
other abnormal event detection methods on 3
scenes of UMN dataset individually with AUC cri-
terion.
Method AUC
Optical flow [36] 0.84 (average)
Social Force [36] 0.96 (average)
Chaotic Invariants [37] 0.99 (average)
LSA [38] 0.985(average)
STCOG [20] 0 .936/0.776/0.966
Sparse [35] 0 .995/0.975/0.964
MP-MIDL [39] 0 .99 /0.98 /0.99
SVDD-based [40] 0.993/0.969/0.988
OPML 0.993 / 0.983 / 0.973
COPML 0.995 / 0.989 / 0.977
Table 4
Error rates on three UCI datasets.
Data Euclidean OPML COPML
seg-10 0 .069 ± 0.006 0 .062 ± 0.006 0 .057 ± 0.007
seg-5 0 .067 ± 0.007 0 .062 ± 0.007 0 .054 ± 0.007
seg-2 0 .067 ± 0.006 0 .064 ± 0.007 0.059 ± 0.007
eeg-10 0 .185 ± 0.004 0 .181 ± 0.004 0 .161 ± 0.023
eeg-5 0 .213 ± 0.004 0 .205 ± 0.005 0 .185 ± 0.018
eeg-2 0 .185 ± 0.004 0 .178 ± 0.007 0.178 ± 0.007
sen-10 0 .190 ± 0.002 0 .093 ± 0.036 0 .071 ± 0.011
sen-5 0 .196 ± 0.002 0 .097 ± 0.032 0 .082 ± 0.014
sen-2 0 .190 ± 0.006 0 .067 ± 0.020 0 .063 ± 0.015
c
i
O
a
2
f
c
i
a
t
t
s
fi
methods. Especially, when using SIFT features, our methods can
significantly improve the AUC value over the Euclidean distance
by 13% (5.8% with attribute features), showing the validity of the
proposed methods. It is worth noting that some metric learning
methods cannot even improve over the Euclidean distance, which
has happened on the PubFig dataset. The reason why LMNN can-
not achieve the best performance may be over-fitting for lacking of
regularization.
5.4. Abnormal event detection in videos
The performance of the proposed methods is also evaluated on
UMN dataset for abnormal event detection. UMN dataset contains
3 different scenes with 7739 frames in total: Scene1 (1453 frames),
Scene2 (4144 frames) and Scene3 (2142 frames). In UMN dataset,
people walking around is considered as normal, while people run-
ning away is regarded as abnormal. The resolution of the video is
320 × 240. We divide each frame into 5 × 4 non-overlapping 64
× 60 patches. For each patch, the MHOF (Multi-scale Histogram
of Optical Flow) feature [35] was extracted from every two suc-
cessive frames. The MHOF is a 16-dimensional feature, which can
capture both motion direction and motion energy. For integrating
the multi-patches features, we combine features from all patches
in each frame, and form a 320-dimensional feature. For each scene,
we perform 2-fold cross validation for evaluation. The distance
metric is learnt from the training data in online way, then we use
the SVM classifier to classify the testing frames after feature trans-
formation by using the learned metric L .
Table 3 reports the AUC of all the methods. We can notice
that our methods is very effective and competitive, when com-
pared with other methods. Fig. 6 exhibits the sample frames of
normal and abnormal events in the 3 scenes respectively (top
row), and shows the abnormal event detection results of our
method (COPML) in the indication bars (green/red indicates nor-
mal/abnormal event). It’s worth mentioning that in this experi-
ment, COPML performs better than OPML, because the video data
has the cold start issue especially at the beginning.
5.5. COPML For cold start
We can observe that in the case free of the cold start issue (e.g.,
UCI data classification, face verification), OPML and COPML can ob-
tain comparable results, while in the case with cold start issue
(e.g., abnormal event detection in videos), COPML is better than
OPML. To further test the performance of COPML on an extreme
ase with cold start issue, we construct several datasets with spec-
fied structure to verify the different performance of COPML and
PML. Three datasets were picked from the UCI repository: (1) Im-
ge Segmentation (seg for short), with 7 classes, 19 features and
310 samples; (2) EEG Eye State (eeg for short), with 2 classes, 15
eatures and 14,980 samples; (3) Sensorless (sen for short), with 11
lasses, 49 features and 58,509 samples.
For each dataset, the samples from different classes are divided
nto disjoint 10/5/2 parts, then different parts of different classes
re crosswise put together to construct a new dataset. Afterwards,
he new dataset is divided into 2 folds. The first fold is used for
raining and the second fold is used for testing. As the previous
etting for classification, we take a k -NN (k = 5) classifier to get the
nal test results, shown in Table 4 . The results prove that COPML
W. Li et al. / Pattern Recognition 75 (2018) 302–314 311
GroundtruthOur Results
Scene1 Scene3Scene2
Scene1 Abnormal Scene2 Abnormal Scene3 AbnormalScene1 Normal Scene2 Normal Scene3 Normal
Fig. 6. Global abnormal event detection results of our method COPML and the ground truth on the UMN dataset. (For interpretation of the references to colour in this figure
legend, the reader is referred to the web version of this article.)
p
S
a
6
O
g
m
c
T
e
l
p
m
m
A
6
(
(
S
A
P
L
A
w
o
l
L
N
i
t
(
r
c
L
L
s
t
t
p
L
T
‖
R
x
d
‖
w
e
λ
i
λ
r
a
erforms better than OPML when the data has the cold start issue.
ince when cold start occurs, COPML will incorporate both the pair
nd triplet information, instead of only using triplet in OPML.
. Conclusion
We propose a one-pass closed-form solution for OML, namely
PML. It employs the one-pass triplet construction for fast triplet
eneration, together with a closed-form solution to update the
etric with the new coming sample at each time step. Also, for
old start issue, COPML, an extended version of OPML is developed.
he major advantages of our methods are: OPML and COPML are
asy to implement. Also, OPML and COPML are very scalable with
ow space (i.e., O ( d )) and time (i.e., O ( d 2 )) complexity. In the ex-
eriments, we show that our methods can obtain superior perfor-
ance on three typical tasks, compared with the state-of-the-art
ethods.
cknowledgments
This work is supported by the National NSF of China (Nos.
1432008 , 61673203 , U1435214 ), Australian Research Council (ARC)
No. DE160100241 ), Primary R&D Plan of Jiangsu Province, China
No. BE2015213), and the Collaborative Innovation Center of Novel
oftware Technology and Industrialization.
ppendix A. Proof of Theorem 2
roof. Recall that the metric update formula of OPML is
t =
{
L t−1 (I + γ A t ) −1 [ z] + > 0
L t−1 [ z] + = 0 .
(A.1)
ccording to the Theorem 1 , we can obtain that,
(I + γ A t ) −1 = I − 1
η + β[ ηγ A t − (γ A t )
2 ] , (A.2)
here η = 1 + tr(γ A t ) , β =
1 2 [(tr(γ A t ))
2 − tr(γ A t ) 2 ] . Here, we
nly consider the case that [ z] + > 0 . Then at t th time step, the
earned metric L t of one-pass strategy can be expressed as below,
t = L 0 (I + γ A 1 ) −1 (I + γ A 2 )
−1 · · · (I + γ A t ) −1 . (A.3)
ote that the batch triplet construction strategy here is considered
n an online manner, that is to say, for each sample x t at the t th
ime step, all past samples are stored to construct a triplet with x t i.e., each triplet contains this x t ). Similar to L t , the learned met-
ic L ∗t of the batch strategy (at t th time step, C i | t i =1 triplets can be
onstructed) can be denoted as follows,
∗t = L ∗0
C 1 ∏
i =1
(I + γ A 1 i ) −1
C 2 ∏
i =1
(I + γ A 2 i ) −1 · · ·
C t ∏
i =1
(I + γ A t i ) −1 . (A.4)
et 〈 x 1 , x p 1 , x q 1 〉 , . . . , 〈 x t , x p t , x q t 〉 be the sequence of triplets con-
tructed by the proposed one-pass strategy, which is contained in
he sequence of triplets constructed by the batch strategy. If we let
he L ∗ learn on the sequence of triplets constructed by the one-
ass strategy first, the Eq. (A.4) can be reorganized as below,
∗t = L ∗0 (I + γ A 1 )
−1 · · · (I + γ A t ) −1 ·
C 1 + ···+ C t −t ∏
i =1
(I + γ A i ) −1
(L ∗t learn on the sequence of L t first )
= L t ·C 1 + ···+ C t −t ∏
i =1
(I + γ A i ) −1
(L 0 and L ∗0 are both initialized as identity matrices )
= L t ·C 1 + ···+ C t −t ∏
i =1
(I + B )
(by Theorem 1, where B = 1
η+ β
[ (γ A i )
2 −ηγ A i
] )= L t
[ I +
C N ∑
i =1
B i +
C N ∑
i =1 , j=1 ,i< j
B i B j + · · · +
C N ∏
i =1
B i
] ( where C N = C 1 + · · · + C t − t) .
(A.5)
hen we can calculate that
L t − L ∗t ‖ F =
∥∥∥L t
[ C N ∑
i =1
B i +
C N ∑
i =1 , j=1 ,i< j
B i B j + · · · +
C N ∏
i =1
B i
] ∥∥∥F
≤ ‖ L t ‖ F ·∥∥∥ C N ∑
i =1
B i +
C N ∑
i =1 , j=1 ,i< j
B i B j + · · · +
C N ∏
i =1
B i
∥∥∥F .
(A.6)
ecall that A t = M 1 − M 2 = (x t − x p )(x t − x p ) T − (x t − x q )(x t − q ) T ∈ R
d×d , which is a symmetry square matrix. According to the
efinition of Frobenius norm,
A t ‖ F =
√
d ∑
i =1
d ∑
i =1
| a i j | 2 =
√
d ∑
i =1
σ 2 i , (A.7)
here σ i are the singular values of A t , which are equal to the
igenvalues of A t . According to Lemma 1, −λmax (M 2 ) ≤ λ(A t ) ≤max (M 1 ) , where λ( A t ) denotes the eigenvalue of A t , and λmax (M)
ndicates the maximum eigenvalue of M . Assuming ‖ x t ‖ 2 ≤ 1,
max (M 1 ) and λmax (M 2 ) belong to the range of [0, 4]. Since the
ank of A t is 2 (which has been proved in Section 3.2 ), there are
t most two nonzero eigenvalues. Thus we can easily obtain that
312 W. Li et al. / Pattern Recognition 75 (2018) 302–314
T
s
�
F
p
l
�
I
A
s
c
w
w
w
H
�
‖
F
≤
T
A
P
C
R
‖ A ‖ F ≤ 4 √
2 . Hence,
‖ B ‖ F = ‖
γ 2
η + βA
2 t +
ηγ
η + β(−A t ) ‖ F
≤ ‖
γ 2
η + βA
2 t ‖ F + ‖
ηγ
η + β(−A t ) ‖ F
≤∣∣∣ γ 2
η + β
∣∣∣ · ‖ A t ‖ F · ‖ A t ‖ F +
∣∣∣ ηγ
η + β
∣∣∣ · ‖ A t ‖ F
≤ 32
∣∣∣ γ 2
η + β
∣∣∣ + 4
√
2
∣∣∣ ηγ
η + β
∣∣∣.
(A.8)
Then, we can also calculate the range of η and β respectively.
η = 1 + tr(γ A t )
= 1 + γ · tr(A t )
= 1 + γ[ (x t − x p )
T (x t − x p ) − (x t − x q ) T (x t − x q )
] = 1 + γ
[ ‖ x p ‖
2 2 − ‖ x q ‖
2 2 − 2 ‖ x T t ‖ 2 · ‖ x p ‖ 2 cos θ1
+ 2 ‖ x T t ‖ 2 · ‖ x q ‖ 2 cos θ2
] .
(A.9)
Since the range of γ is (0 , 1 4 ) , and 0 < ‖ x t ‖ 2 ≤ 1, we can obtain
the range of tr ( γ A t ) (i.e., (− 5 4 ,
5 4 ) ), and the range of η is (− 1
4 , 9 4 ) .
Recall that,
β =
1
2
[(tr(γ A t ))
2 − tr(γ A t ) 2 ]
( by the rule of tr(cA )= c·tr(A ))
=
1
2
[(tr(γ A t ))
2 − γ 2 tr(A
2 t )
]( by the rule of tr(A k )= ∑
i λk i , )
( where λi is the eigenvalue of A )
=
1
2
[ (tr(γ A t ))
2 − γ 2 d ∑
i =1
λ2 i
] ( by the rule of ‖ A t ‖ F =
√ ∑ d i =1 λ
2 i )
=
1
2
[ (tr(γ A t ))
2 − γ 2 ‖ A t ‖
2 F
]
(A.10)
For 0 < ‖ A t ‖ 2 F ≤32 and − 5 4 < tr(γ A t ) <
5 4 , the range of β is
(−1 , 25 32 ) . �
Appendix B. Proof of Theorem 3
Proof. By applying the one-pass triplet construction strategy, at
the t th time step, we can obtain one triplet 〈 x t , x p , x q 〉 . While in
the batch construction (all past samples will be stored), we can
get a triplet set {〈 x t , x p i , x q i 〉}| C i =1 . The average loss of these two
strategies can be expressed as follows:
�1 =
[ 1 + ‖ L(x t − x p ) ‖
2 2 − ‖ L(x t − x q ) ‖
2 2
] +
�2 =
1
C
C ∑
i =1
[ 1 + ‖ L ∗(x t − x p i ) ‖
2 2 − ‖ L ∗(x t − x q i ) ‖
2 2
] + ,
(B.1)
where [ z] + = max (0 , z) , namely the hinge loss G((x t , x p , x q ) ; L) .
For �1 , we only consider the case that z ≥ 0, which exactly af-
fects the updating of the metric L . However, in �2 , some losses
may be negative. Thus,
�1 = 1 + ‖ L(x t − x p ) ‖
2 2 − ‖ L(x t − x q ) ‖
2 2
�2 ≤ 1
C
C ∑
i =1
[ 1 + ‖ L ∗(x t − x p i ) ‖
2 2 − ‖ L ∗(x t − x q i ) ‖
2 2
] .
(B.2)
hen we calculate the difference between the losses of these two
trategies. That is,
1 − �2 ≤ 1
C
C ∑
i =1
�
≤ 1
C
C ∑
i =1
[ ‖ L(x t − x p ) ‖
2 2 − ‖ L ∗(x t − x p i ) ‖
2 2
+ ‖ L ∗(x t − x q i ) ‖
2 2 − ‖ L(x t − x q ) ‖
2 2
] .
(B.3)
or simplicity, we just analysis the �. By applying the rule of dot
roduct (i.e., x T y = ‖ x ‖ · ‖ y‖ · cos θ ), � can be expanded as fol-
ows,
= ‖ Lx p ‖
2 2 − ‖ L ∗x p i ‖
2 2 + ‖ L ∗x q i ‖
2 2 − ‖ Lx q ‖
2 2
− 2 ‖ Lx t ‖ · ‖ Lx p ‖ · cos θ1 + 2 ‖ L ∗x t ‖ · ‖ L ∗x p i ‖ · cos θ2
− 2 ‖ L ∗x t ‖ · ‖ L ∗x q i ‖ · cos θ3 + 2 ‖ Lx t ‖ · ‖ Lx q ‖ · cos θ4 .
(B.4)
n order to simplify the expression, we set
1 © = −2 ‖ Lx t ‖ · ‖ Lx p ‖ · cos θ1
2 © = 2 ‖ L ∗x t ‖ · ‖ L ∗x p i ‖ · cos θ2
3 © = −2 ‖ L ∗x t ‖ · ‖ L ∗x q i ‖ · cos θ3
4 © = 2 ‖ Lx t ‖ · ‖ Lx q ‖ · cos θ4 .
(B.5)
ssuming that the angle θ between two samples coming from the
ame class is very small after the transformation of L or L ∗ (i.e.,
os θ = α, α ≥ 0 and α is close to 1), while θ is very large other-
ise (i.e., cos θ = −ξ , ξ ≥ 0 and ξ is close to 1). Thus,
0 ≤ cos θ1 ≤ α
0 ≤ cos θ2 ≤ α
−ξ ≤ cos θ3 ≤ 0
−ξ ≤ cos θ4 ≤ 0 ,
(B.6)
here 0 ≤ α ≤ 1, 0 ≤ ξ ≤ 1 and both of them are close to 1. Then,
e can obtain that,
−2 α‖ Lx t ‖ · ‖ Lx p ‖ ≤ 1 © ≤ 0
0 ≤ 2 © ≤ 2 α‖ L ∗x t ‖ · ‖ L ∗x p i ‖
0 ≤ 3 © ≤ 2 ξ‖ L ∗x t ‖ · ‖ L ∗x q i ‖
−2 ξ‖ Lx t ‖ · ‖ Lx q ‖ ≤ 4 © ≤ 0 .
(B.7)
ere, we only consider the upper bound of �,
≤ ‖ Lx p ‖
2 2 −‖ L ∗x p i ‖
2 2 +‖ L ∗x q i ‖
2 2 −‖ Lx q ‖
2 2 +2 α‖ L ∗x t ‖ · ‖ L ∗x p i ‖
+ 2 ξ‖ L ∗x t ‖ · ‖ L ∗x q i ‖
≤ ‖ Lx p ‖
2 2 +‖ L ∗x q i ‖
2 2 +2 α‖ L ∗x t ‖ · ‖ L ∗x p i ‖ + 2 ξ‖ L ∗x t ‖ · ‖ L ∗x q i ‖ .
(B.8)
According to the property of compatible norms, that is,
Ax ‖ 2 ≤ ‖ A ‖ F · ‖ x ‖ 2 . (B.9)
or assuming that ‖ x ‖ 2 ≤ R (for all samples), ‖ L ‖ F ≤ U and ‖ L ∗‖ FU , we can obtain that,
� ≤ 2(α + ξ + 1) R
2 U
2
�1 − �2 ≤ 2(α + ξ + 1) R
2 U
2 . (B.10)
hus, this theorem has be proved. �
ppendix C. Proof of Theorem 4
roof. The regret can be defined (according to the definition of
hapter 3 in [41] ) as below:
(L ∗, T ) =
T ∑
t=1
G t (L t ) −T ∑
t=1
G t (L ∗) , (C.1)
W. Li et al. / Pattern Recognition 75 (2018) 302–314 313
w
o
f
e
T
R
I
t
R
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
here G t (L) = [1 + ‖ L(x t − x p ) ‖ 2 2 − ‖ L(x t − x q ) ‖ 2 2
] + . Here, we also
nly consider the case that the loss is positive, which exactly af-
ects the updating of the metric L . However, one triplet which gen-
rates a positive loss with L may incur a negative loss with L ∗ .
hus, after expanding,
(L ∗, T ) ≤T ∑
t=1
[ ‖ L t (x t − x p ) ‖
2 2 − ‖ L t (x t − x q ) ‖
2 2
−‖ L ∗(x t − x p ) ‖
2 2 + ‖ L ∗(x t − x q ) ‖
2 2
] . (C.2)
n the similar way of proving the Theorem 3 , we can easily prove
his theorem. �
eferences
[1] K.Q. Weinberger , J. Blitzer , L.K. Saul , Distance metric learning for large margin
nearest neighbor classification, in: Advances in Neural Information ProcessingSystems (NIPS), 2005, pp. 1473–1480 .
[2] S. Xiang , F. Nie , C. Zhang , Learning a mahalanobis distance metric for data clus-tering and classification, Pattern Recognit. 41 (12) (2008) 3600–3612 .
[3] C.-C. Chang , A boosting approach for supervised mahalanobis distance metric
learning, Pattern Recognit. 45 (2) (2012) 844–862 . [4] Y. Mu , W. Ding , D. Tao , Local discriminative distance metrics ensemble learn-
ing, Pattern Recognit. 46 (8) (2013) 2337–2349 . [5] T. Mensink , J. Verbeek , F. Perronnin , G. Csurka , Distance-based image classifi-
cation: generalizing to new classes at near-zero cost, IEEE Trans. Pattern Anal.Mach. Intell. 35 (11) (2013) 2624–2637 .
[6] G. Chechik , V. Sharma , U. Shalit , S. Bengio , Large scale online learning of image
similarity through ranking, J. Mach. Learn. Res. 11 (2010) 1109–1135 . [7] X. Gao , S.C. Hoi , Y. Zhang , J. Wan , J. Li , Soml: Sparse online metric learning
with application to image retrieval, in: Proceedings of the Twenty-Eighth AAAIConference on Artificial Intelligence (AAAI), 2014 .
[8] H. Xia , S.C. Hoi , R. Jin , P. Zhao , Online multiple kernel similarity learning forvisual search, IEEE Trans. Pattern Anal. Mach. Intell. 36 (3) (2014) 536–549 .
[9] S. Shalev-Shwartz , Y. Singer , A.Y. Ng , Online and batch learning of pseudo-met-
rics, in: Proceedings of the twenty-first International Conference on MachineLearning (ICML), ACM, 2004, p. 94 .
[10] P. Jain , B. Kulis , I.S. Dhillon , K. Grauman , Online metric learning and fast sim-ilarity search, in: Advances in Neural Information Processing Systems (NIPS),
2009, pp. 761–768 . [11] R. Jin , S. Wang , Y. Zhou , Regularized distance metric learning: Theory and al-
gorithm, in: Advances in Neural Information Processing Systems (NIPS), 2009,pp. 862–870 .
[12] J.V. Davis , B. Kulis , P. Jain , S. Sra , I.S. Dhillon , Information-theoretic metric
learning, in: Proceedings of the 24th International Conference on MachineLearning (ICML), ACM, 2007, pp. 209–216 .
[13] K. Crammer , O. Dekel , J. Keshet , S. Shalev-Shwartz , Y. Singer , Online passive-ag-gressive algorithms, J. Mach. Learn. Res. 7 (2006) 551–585 .
[14] G. Kunapuli , J. Shavlik , Mirror descent for metric learning: a unified approach,in: Machine Learning and Knowledge Discovery in Databases, Springer, 2012,
pp. 859–874 .
[15] K.Q. Weinberger , L.K. Saul , Distance metric learning for large margin nearestneighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244 .
[16] B. Shaw , B. Huang , T. Jebara , Learning a distance metric from a net-work, in: Advances in Neural Information Processing Systems (NIPS), 2011,
pp. 1899–1907 . [17] Q. Qian , R. Jin , J. Yi , L. Zhang , S. Zhu , Efficient distance metric learning by adap-
tive sampling and mini-batch stochastic gradient descent (sgd), Mach. Learn.
99 (3) (2015) 353–372 .
[18] J. Wang , Y. Song , T. Leung , C. Rosenberg , J. Wang , J. Philbin , B. Chen , Y. Wu ,Learning fine-grained image similarity with deep ranking, in: Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition, 2014,pp. 1386–1393 .
[19] C. Stauffer , W.E.L. Grimson , Learning patterns of activity using real-time track-ing, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (20 0 0) 747–757 .
20] Y. Shi , Y. Gao , R. Wang , Real-time abnormal event detection in complicatedscenes, in: 20th International Conference on Pattern Recognition (ICPR), IEEE,
2010, pp. 3653–3656 .
[21] W. Gao , R. Jin , S. Zhu , Z.-H. Zhou , One-pass auc optimization, in: Proceedingsof the 30th International Conference on Machine Learning (ICML), ACM, 2013,
pp. 906–914 . 22] Y. Zhu , W. Gao , Z.-H. Zhou , One-pass multi-view learning, in: Proceedings of
The 7th Asian Conference on Machine Learning, 2015, pp. 407–422 . 23] K.S. Miller , On the inverse of the sum of matrices, Math. Mag. 54 (1981) 67–72 .
24] J. Lu , X. Zhou , Y.-P. Tan , Y. Shang , J. Zhou , Neighborhood repulsed metric learn-
ing for kinship verification, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2) (2014)331–345 .
25] J. Lu , G. Wang , W. Deng , K. Jia , Reconstruction-based metric learning for un-constrained face verification, IEEE Trans. Inf. Forensics Secur. 10 (1) (2015)
79–89 . 26] Z. Huang , R. Wang , S. Shan , X. Chen , Face recognition on large-scale video
in the wild with hybrid euclidean-and-riemannian metric learning, Pattern
Recognit. 48 (10) (2015) 3113–3124 . [27] J. Hu , J. Lu , Y.-P. Tan , Discriminative deep metric learning for face verification
in the wild, in: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2014, pp. 1875–1882 .
28] J. Hu , J. Lu , Y.-P. Tan , Deep transfer metric learning, in: IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015, pp. 325–333 .
29] F. Schroff, D. Kalenichenko , J. Philbin , Facenet: A unified embedding for face
recognition and clustering, in: IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), 2015, pp. 815–823 .
30] N. Kumar , A.C. Berg , P.N. Belhumeur , S.K. Nayar , Attribute and simile classifiersfor face verification, in: IEEE 12th International Conference on Computer Vision
(ICCV), IEEE, 2009, pp. 365–372 . [31] O.M. Parkhi , A . Vedaldi , A . Zisserman , Deep face recognition, in: British Ma-
chine Vision Conference (BMVC), 1, 2015, p. 6 .
32] G.B. Huang , M. Ramesh , T. Berg , E. Learned-Miller , Labeled Faces in the Wild: ADatabase for Studying Face Recognition in Unconstrained Environments, Tech-
nical Report, University of Massachusetts, Amherst, 2007 . 07–49 33] M. Guillaumin , J. Verbeek , C. Schmid , Is that you? metric learning approaches
for face identification, in: IEEE 12th International Conference on Computer Vi-sion (ICCV), IEEE, 2009, pp. 498–505 .
34] D.G. Lowe , Distinctive image features from scale-invariant keypoints, Int. J.
Comput. Vision 60 (2) (2004) 91–110 . 35] Y. Cong , J. Yuan , J. Liu , Sparse reconstruction cost for abnormal event detection,
in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE,2011, pp. 3449–3456 .
36] R. Mehran , A. Oyama , M. Shah , Abnormal crowd behavior detection using so-cial force model, in: IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), IEEE, 2009, pp. 935–942 . [37] S. Wu , B.E. Moore , M. Shah , Chaotic invariants of lagrangian particle trajecto-
ries for anomaly detection in crowded scenes, in: IEEE Conference on Com-
puter Vision and Pattern Recognition (CVPR), IEEE, 2010, pp. 2054–2060 . 38] V. Saligrama , Z. Chen , Video anomaly detection based on local statistical aggre-
gates, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),IEEE, 2012, pp. 2112–2119 .
39] J. Huo , Y. Gao , W. Yang , H. Yin , Multi-instance dictionary learning for detect-ing abnormal events in surveillance videos, Int. J. Neural Syst. 24 (03) (2014)
1430010 .
40] Y. Zhang , H. Lu , L. Zhang , X. Ruan , Combining motion and appearance cues foranomaly detection, Pattern Recognit. 51 (2016) 443–452 .
[41] S. Shalev-Shwartz, Y. Singer, Online learning: Theory, algorithms, and applica-tions (2007).
314 W. Li et al. / Pattern Recognition 75 (2018) 302–314
ology, Nanjing University. His current research interests include machine learning and
Technology, Nanjing University, China, in 20 0 0. Currently, He is a Professor, and also the
niversity. He is currently directing the Reasoning and Learning Research Group in Nanjing urnals. His current research interests include artificial intelligence and machine learning.
.
ore, in 2004. He is currently with the Faculty of Engineering and Information Sciences, is current research interests include machine learning and pattern recognition, such as
r vision, such as categorization and content-based image retrieval.
erra, ACT, Australia. She was a Research Fellow with the University of North Carolina onwealth Scientific and Industrial Research Organisation, Clayton, VIC, Australia. She is
gy, University of Wollongong, Wollongong, NSW, Australia. Her current research interests ceived the Discovery Early Career Researcher Award from the Australian Research Council
ng Normal University, China, in 2011. She is currently working towards her Ph.D. degree in in machine learning and computer vision. Her work currently focuses on metric learning,
r Science and Technology of Nanjing University, China. He received his Ph.D. and B.Sc.
2007, respectively. He was a visiting scholar in University of Technology, Sydney and de computer vision and medical image analysis. He has published more than 40 research
VPR, AAAI, ACMMM and IPMI. He serves as a program committee member for several
Wenbin Li is a Ph.D. student in the Department of Computer Science and Techncomputer vision.
Yang Gao received the Ph.D. degree from the Department of Computer Science and
Deputy Director in the Department of Computer Science and Technology, Nanjing UUniversity. He has published more than 100 papers in top-tired conferences and jo
He also serves as Program Chair and Area Chair for many international conferences
Lei Wang received the Ph.D. degree from Nanyang Technological University, SingapUniversity of Wollongong, Wollongong, NSW, Australia, as an Associate Professor. H
feature selection, model selection, and kernel-based learning methods, and compute
Luping Zhou received the Ph.D. degree from Australian National University, Canbat Chapel Hill, Chapel Hill, NC, USA, and a Staff Research Scientist with the Comm
currently a Senior Lecturer with the School of Computing and Information Technoloinclude machine learning, medical image analysis, and computer vision. Dr. Zhou re
in 2015.
Jing Huo received her B.Eng. degree in computer science and technology from Nanjithe Department of Computer Science, Nanjing University. Her research interests lie
subspace learning and their applications to heterogeneous face recognition.
Yinghuan Shi is currently an Assistant Researcher in the Department of Compute
degrees from Department of Computer Science of Nanjing University in 2013 andUniversity of North Carolina at Chapel Hill, respectively. His research interests inclu
papers in related journals and conferences such as TPAMI, TBME, TNNLS, TCYB, Cconferences, and also as a referee for several journals.