OPML: A one-pass closed-form solution for online metric ... · OASIS BI Triplet closed-form 0.029 0.028 9.4 SOML BI Triplet approximate 0.032 0.094 21 OPML MA Triplet closed-form

Pattern Recognition 75 (2018) 302–314

Contents lists available at ScienceDirect

Pattern Recognition

journal homepage: www.elsevier.com/locate/patcog

OPML: A one-pass closed-form solution for online metric learning

Wenbin Li a , Yang Gao

a , ∗, Lei Wang

b , Luping Zhou

b , Jing Huo

a , Yinghuan Shi a

a National Key Laboratory for Novel Software Technology, Nanjing University, China b School of Computing and Information Technology, University of Wollongong, Australia

a r t i c l e i n f o

Article history:

Received 28 September 2016

Revised 12 January 2017

Accepted 8 March 2017

Available online 9 March 2017

Keywords:

One-pass

Online metric learning

Triplet construction

Face verification

Abnormal event detection

a b s t r a c t

To achieve a low computational cost when performing online metric learning for large-scale data, we

present a one-pass closed-form solution namely OPML in this paper. Typically, the proposed OPML first

adopts a one-pass triplet construction strategy, which aims to use only a very small number of triplets

to approximate the representation ability of whole original triplets obtained by batch-manner methods.

Then, OPML employs a closed-form solution to update the metric for new coming samples, which leads

to a low space (i.e., O ( d )) and time (i.e., O ( d 2 )) complexity, where d is the feature dimensionality. In addi-

tion, an extension of OPML (namely COPML) is further proposed to enhance the robustness when in real

case the first several samples come from the same class (i.e., cold start problem). In the experiments,

we have systematically evaluated our methods (OPML and COPML) on three typical tasks, including UCI

data classification, face verification, and abnormal event detection in videos, which aims to fully evaluate

the proposed methods on different sample number, different feature dimensionalities and different fea-

ture extraction ways (i.e., hand-crafted and deeply-learned). The results show that OPML and COPML can

obtain the promising performance with a very low computational cost. Also, the effectiveness of COPML

under the cold start setting is experimentally verified.

© 2017 Elsevier Ltd. All rights reserved.

l

t

m

w

p

a

t

f

t

m

o

s

c

t

i

t

t

1. Introduction

In computer vision and machine learning, learning a meaning-

ful distance/ similarity metric on the original feature presentation

of samples, with the given distance constraints (either pairwise

similar/dissimilar distance constraints or triplet based relative dis-

tance constraints) at the same time, is usually regarded as a crucial

and challenging problem, which has been actively studied over the

decades. According to the different measure functions (e.g., Maha-

lanobis distance function and bilinear similarity function), the cur-

rent metric learning methods can be roughly classified into two

categories, i.e., Mahalanobis distance-based methods [1–5] and bi-

linear similarity-based methods [6–8] . The first class, Mahalanobis

distance-based methods, refers to learning a pairwise real-valued

distance function, which is parameterized by a symmetric Positive

Semi-Definite (PSD) matrix. The second class, bilinear similarity-

based methods, aims to learn a form of bilinear similarity function

which does not need to impose the PSD constraint on learned met-

rics.

∗ Corresponding author.

E-mail addresses: [email protected] (W. Li), [email protected] (Y. Gao),

[email protected] (L. Wang), [email protected] (L. Zhou), [email protected]

(J. Huo), [email protected] (Y. Shi).

s

e

a

i

s

o

http://dx.doi.org/10.1016/j.patcog.2017.03.016

0031-3203/© 2017 Elsevier Ltd. All rights reserved.

Recently, instead of batch manner, learning the metric in an on-

ine manner, which refers to online metric learning (OML), has at-

racted lots of interests, with the goal of learning a discriminative

etric with partially known sampled data for efficiently dealing

ith large-scale learning problem. Generally, to satisfy the online

rocessing speed for large-scale learning problem, OML methods

re required to well tackle the following two core issues: (1) how

o fast construct the triplet (or pair) in the original data, especially

or the large-scale data, and (2) how to fast update the metric with

he new coming samples in a real time manner.

For fast triplet (or pair) construction (first issue), existing OML

ethods usually assume that pairwise or triplet constraints can be

btained in advance [9] , or by employing the random sampling

trategy to reduce the size of triplets [6] . However, in real appli-

ations, it is usually infeasible to access the entire training set at a

ime, especially when the training set is relative large, construct-

ng the constraints will be both time- and space-consuming. To

his end, we propose a novel one-pass triplet construction strategy

o rapidly construct triplets in an online manner. In particular, the

trategy selects two latest samples from both the same and differ-

nt classes of currently available samples respectively, to construct

triplet. Compared with Online Algorithm for Scalable Image Sim-

larity (OASIS) [6] , which utilizes a random sampling strategy and

tores the entire training data in memory with space complexity

f O ( md ) ( d is the feature dimensionality, and m is the data size,


http://www.ScienceDirect.com

http://www.elsevier.com/locate/patcog

http://crossmark.crossref.org/dialog/?doi=10.1016/j.patcog.2017.03.016&domain=pdf

mailto:[email protected]







W. Li et al. / Pattern Recognition 75 (2018) 302–314 303

Table 1

The comparison of different OML methods. MA/BI denotes the Mahalanobis distance-based/bilinear

similarity-based method, respectively. The last 3 columns denote the processing time (the unit is

ms) per sample with different dimensions.

Method Type Constraint Solution d = 21 d = 64 d = 310

POLA MA Pair approximate 8 6 .2 120

RDML MA Pair approximate 0 .040 0 .043 2 .4

LEGO MA Pair closed-form 0 .159 0 .472 47

OASIS BI Triplet closed-form 0 .029 0 .028 9 .4

SOML BI Triplet approximate 0 .032 0 .094 21

OPML MA Triplet closed-form 0 .026 0 .023 1 .7

COPML MA Pair & Triplet closed-form 0 .027 0 .024 1 .7

w

v

o

t

[

t

h

M

t

E

d

r

i

m

f

d

b

c

t

t

P

s

a

f

a

a

a

p

t

u

s

a

o

c

s

a

H

t

t

i

t

l

b

c

s

i

s

t

v

O

C

a

s

d

R

t

l

g

r

w

t

O

c

i

i

2

b

a

w

w

l

r

l

h

a

S

a

s

r

m

A

O

(

l

d

s

s

t

f

L

t

m

M

w

r

b

b

hich is very large for large-scale data), our one-pass strategy can

astly reduce the space complexity to O ( cd ) ( c is the total number

f classes, which is usually small). Also, the time complexity of our

riplet construction strategy is O (1), which is truly fast.

For fast metric updating (second issue), several studies

6,10] try to adopt a closed-form solution for accurate computa-

ion. Among them, OASIS adopts bilinear similarity learning and

as a closed-form solution, while it lacks a good interpretability as

ahalanobis distance metric learning (i.e., linear projection) and

he learned similarity function is asymmetric. In contrast, LogDet

xact Gradient Online (LEGO) [10] attempts to learn a Mahalanobis

istance and has a closed-form solution. In addition, LEGO is not

equired to maintain the PSD constraint by using LogDet regular-

zation, which is time-consuming in some Mahalanobis distance

etric learning methods [9,11,12] . However, LEGO is designed

or pairwise constraints. Compared with LEGO, we developed a

ifferent Mahalanobis distance-based OML method for triplet-

ased constraints, named as OPML, which also has the property of

losed-form solution and does not need projection steps to main-

ain PSD constraint. Specifically, the proposed OPML directly learns

he transformation matrix L ∈ R

d×d ( M = L T L is the symmetric

SD matrix usually learnt in Mahalanobis metric learning), such a

etting does not require imposing the PSD constraint. By carefully

nalyzing the structure of the triplets based loss and using a few

undamental properties ( Lemmas 2 and 3 ), a closed-form solution

t each step is obtained with the time complexity of O ( d 2 ). Actu-

lly, when we learn a low rank matrix M , which means learning

rectangular matrix L ∈ R

c×d (c is the rank of M ), the time com-

lexity can be further reduced to O ( cd ). Since the rank c is difficult

o choose in practice and the performance of low-rank OPML is

sually worse than full-rank OPML, we mainly focus on learning a

quare matrix L in this paper. The major differences between OPML

nd OASIS/LEGO can be found in Table 1 . Please note that in our

nline metric learning setting, the time complexity of algorithms is

omprehensive which is influenced by three parts: the time of con-

truction of pairwise/triple constraints, the time of metric update

nd the number of pariwise/triple constraints used for updating.

ence, we present an integrated processing time per sample (the

otal time is divided by the data size) instead of presenting the

ime complexity of metric update of each algorithm alone, which

s almost O ( d 2 ) with different coefficients (expect for SOML).

Also, in some tasks, e.g., abnormal event detection in videos,

he data is usually imbalanced: the first several samples may be-

ong to the same class, then the triplet construction strategy will

e invalid until the samples of different classes appear. We call this

ase as cold start case. Furthermore, to deal with the cold start is-

ue, an extension namely COPML is developed in this paper. Specif-

cally, COPML includes a pre-stage by constructing pairwise con-

traints for two adjacent samples (from the same class) to update

he metric.

To summarize, compared with previous OML methods, the ad-

antages of our work can be concluded as: First, the proposed

PML and COPML are easy to implement. Second, OPML and

OPML are scalable to large datasets with a low space (i.e., O ( d ))

nd time (i.e., O ( d 2 )) complexity, where d is the feature dimen-

ionality. Third, OPML and COPML can be easily extended to han-

le high-dimensional data by learning a rectangular matrix L ∈

c×d with time complexity O ( cd ). Fourth, we have derived several

heoretical explanations, including the difference bound between

earned metrics of ones-pass and batch triplet construction strate-

ies, the average loss bound between these two strategies and the

egret bound, to guarantee the effectiveness of our methods.

The rest of this paper is organized as follows. In Section 2 ,

e present the related works of OML methods. Section 3 provides

he details of the proposed one-pass triplet construction strategy,

PML and COPML algorithms. In Section 4 , we give the theoreti-

al guarantee of our algorithm. The experimental results, compar-

sons and analysis are given in Section 5 , followed by conclusions

n Section 6 .

. Related work

Typically, all the previous online metric learning methods can

e roughly classified into two categories: bilinear similarity-based

nd Mahalanobis distance-based. A comparison of the most related

orks is given in Table 1 for better clarification.

In bilinear similarity-based methods, OASIS [6] is developed

hich is based on Passive-Aggressive algorithm [13] , aiming to

earn a similarity metric for image similarity. Sparse Online Met-

ic Learning (SOML) [7] follows a similar setting as OASIS, but

earns a diagonal matrix instead of a full matrix to handle very

igh-dimensional cases. In order to deal with multi-modal data,

n online kernel based method, namely Online Multiple Kernel

imilarity (OMKS), has been proposed by Xia et al. [8] . All these

bove methods are based on triplet constraints and they also as-

ume that the constraints can be gained beforehand or could be

andomly sampled on the entire dataset. Among them, OASIS is

ore relevant to the proposed OPML as both of them are Passive-

ggressive based. The differences between OASIS and the proposed

PML mainly include that OASIS learns a bilinear similarity metric

hard to interpret, and asymmetric) while OPML learns a Maha-

anobis metric (good interpretability, and symmetric), OASIS ran-

omly samples triplet constraints from the entire training set (with

pace complexity of O ( md )), while OPML constructs triplet con-

traints in an online manner (with space complexity of O ( cd ) and

ime complexity of O (1)), which leads the solutions of objective

unctions largely different.

In Mahalanobis distance-based methods, Pseudo-Metric Online

earning Algorithm (POLA) [9] is the first OML method which in-

roduces the successive projection operation to learn the optimal

etric. LEGO [10] is an extended version of Information Theoretic

etric Learning-Online (ITML-Online) [12] , by building the model

ith LogDet divergence regularization. Jin et al. [11] presented a

egularized OML method namely RDML, with a provable regret

ound. Also, Kunapuli and Shavlik proposed an unified approach

ased on composite mirror descent named as MDML [14] . These

304 W. Li et al. / Pattern Recognition 75 (2018) 302–314

...

Class 1

Class 2

Class 3

Time

Fig. 1. The illustration of one-pass triplet construction. x 1 , x 2 , . . . , x t denote the

samples at the 1th, 2th and t th time steps, respectively.

x

c

a

1

2

3

4

5

6

8

O

c

s

3

w

d

d

D

w

M

t

r

D

f

D

w

d

G

s

m

A

L

w

r

t

n

t

i

w

x

methods are all based on pairwise constraints, and they all assume

that the pairwise constraints can be obtained in advance except

RDML, which exactly receives two adjacent samples as a pairwise

constraints at each time. In fact, pairwise constraints based meth-

ods can be easily converted to an online manner for pair construc-

tion by using the strategy of RDML. However, in general, triplet

constraints are more effective than pairwise constraints for learn-

ing a metric [6,15–17] . In contrast, the proposed OPML is a triplet

constraints based Mahalanobis distance method. Besides, the pro-

posed method has the properties of closed-form solution and does

not enquire PSD constraint, making it more efficient.

3. The proposed method

We now first present our strategy for one-pass triplet construc-

tion, and then discuss the technical details of OPML and COPML,

respectively.

3.1. One-pass triplet construction

When dealing with large-scale data, how to fast obtain the

triplets is a crucial step, since the number of triplets usually de-

termines the tradeoff between the performance effectiveness and

time efficiency. Actually, there are several previous attempts to re-

duce the size of triplet set in previous studies, e.g., partial complete

construction after random sampling [7] , random sampling like OA-

SIS [6] , online importance sampling scheme [18] . Partial complete

construction strategy constructs all possible triplets after randomly

sampling some positive and negative samples, which is time- and

space- consuming and also without theoretical guarantee. Random

sampling strategy has a limitation on space-consuming. In [18] , the

online importance sampling strategy is required to calculate the

pairwise relevance scores between any two samples in advance,

which is time-consuming. Inspired by the impressive scalability of

one-pass strategies [19–22] , we proposed a one-pass triplet con-

struction strategy, aiming to quickly obtain all the triplets in a sin-

gle pass over all data. Fig. 1 illustrates the main idea of the one-

pass strategy of triplet construction. Formally, in an online manner,

the sample at the t th ( t = 1 , 2 , . . . , T ) step is denoted as x t ∈ R

d . An

array H = { h

1 , h

2 , . . . , h

c } is maintained, where c is the total num-

ber of classes, and h

k ∈ R

d (k ∈ H label = { 1 , 2 , . . . , c} ) means the lat-

est sample of the k th class at the current time step.

For triplet construction, our goal is to obtain a typical triplet

〈 x t , x p , x q 〉 ( x t , x p and x q ∈ R

d ), which satisfies that x t and x p belong

to the same class, while x t and x q belong to the different classes.

Specifically, in one-pass triplet construction, at the t th time step,

given x t belonging to the k th class, we assign h

k as x p , and ran-

domly pick h

k ′ (k ′ ∈ H label , k ′ � = k ) as x q . Then h

k ∈ H will be re-

placed by x t , and thus H consists of the latest sample of each class.

Please note that if x t belongs to the (c + 1) th class, we will assign

t to h

c+1 , and also update the number of classes as well as H ac-

ordingly.

Intuitively, at the t th time step, the triplet can be constructed

s follows:

: Input: x t , H = { h

1 , h

2 , . . . , h

c } , H label = { 1 , 2 , . . . , c} : Output: 〈 x t , x p , x q 〉 : if label (x t ) belongs to the k th class in H label

: x p = h

k , x q = h

k ′ , where k ′ � = k

: h

k = x t , update H

: else

7 : add h

c+1 = x t to H and c + 1 to H label

: end if

We can observe that the space complexity of this strategy is

( cd ). Basically, when the value of c is small, the space complexity

an be regarded as O ( d ). Also, the time complexity of triplet con-

truction at each time step is O (1).

.2. OPML

We mainly focus on the Mahalanobis distance learning here,

hich aims to learn a symmetric PSD matrix M ∈ S d×d + (cone of

× d real-valued symmetric PSD matrices) and can be formally

efined as follows:

M

(x i , x j ) =

√

(x i − x j ) � M(x i − x j ) , (1)

here x i ∈ R

d and x j ∈ R

d are the i th and j th samples, respectively.

can be mathematically decomposed as L � L , where L ∈ R

r×d ( r is

he rank of M ) denotes the transformation matrix. Then, we can

ewrite Eq. (1) as:

L (x i , x j ) = ‖ L(x i − x j ) ‖

2 2 . (2)

Our goal is to learn a transformation matrix L that satisfies the

ollowing large margin constraint:

L (x i , x l ) > D L (x i , x j ) + 1 , ∀ x i , x j , x l ∈ R

d , (3)

here x i and x j belong to the same class, while x i and x l belong to

ifferent classes. We can define the hinge loss function as below:

((x i , x j , x l ) ; L) = max (0 , 1 + D L (x i , x j ) − D L (x i , x l )

). (4)

By applying the one-pass triplet construction, at the t th time

tep, we can obtain the triplet 〈 x t , x p , x q 〉 . Thus, the online opti-

ization formulation can be defined as follows by using Passive-

ggressive algorithm [13] :

t = arg min

L

�(L)

= arg min

L

1

2

‖ L − L t−1 ‖

2 F

+

γ

2

[1 + ‖ L(x t − x p ) ‖

2 2 − ‖ L(x t − x q ) ‖

2 2 ] + ,

(5)

here γ is the regularization parameter, which is set into the

ange of (0 , 1 4 ) as a sufficient condition to theoretically guarantee

he positive definite property (see Lemma 2 ). ‖ · ‖ 2 F

means Frobe-

ius norm, ‖ · ‖ 2 denotes � 2 −norm, and [ z] + = max (0 , z) , namely

he hinge loss G((x t , x p , x q ) ; L) .

The optimal solution can be obtained when the gradient van-

shes ∂�(L) ∂L

= 0 , hence we have:

∂�(L)

∂L =

{

L − L t−1 + γ LA t = 0 [ z] + > 0

L − L t−1 = 0 [ z] + = 0 ,

(6)

here A t = (x t − x p )(x t − x p ) � − (x t − x q )(x t − x q )

� ∈ R

d×d . Since

t , x p and x q are all nonzero vectors, x t � = x p and x t � = x q , the


r

r

r

L

e

−

w

a

P

M

−

A

λ

i

λ

H

−−

L

p

i

P

A

λ

−1

F

o

λ

r

i

L

I

E

n

L

l

1

C

w

C

T

t

w

P

a

B

B

o

L

w

E

f

o

s

N

A

R

E

3

p

c

s

t

n

i

c

a

f

f

〈a

t

c

n

s

p

ank of (x t − x p )(x t − x p ) � and (x t − x q )(x t − x q )

� is 1. Hence, the

ank of A t is 1 or 2. When x t − x p � = μ(x t − x q ) , we can get that

ank (A t ) = 2 .

emma 1. Let M 1 , M 2 be two PSD matrices, and � = M 1 − M 2 , the

igenvalue of �, denoted by λ( �), satisfies the following equation:

λmax (M 2 ) ≤ λ(�) ≤ λmax (M 1 ) , (7)

here λmax (M 1 ) and λmax (M 2 ) are the maximum eigenvalues of M 1

nd M 2 , respectively.

roof. ∀ x ∈ R

d and x � = 0 , x � �x = x � (M 1 − M 2 ) x . Since M 1 and

2 are both PSD matrices, x � M 1 x ≥ 0 and x � M 2 x ≥ 0. Thus,

x � M 2 x

x � x ≤ x � �x

x � x ≤ x � M 1 x

x � x . (8)

ccording to Rayleigh quotient, we have λmin (�) ≤ x � �x x � x ≤

max (�) . Assume x � �x x � x achieve its maximum λmax (�) when x = e ,

.e., e � �e e � e = λmax (�) . By Eq. (8) , we obtain

max (�) =

e � �e

e � e ≤ e � M 1 e

e � e ≤ λmax (M 1 ) (9)

ence, λmax (�) ≤ λmax (M 1 ) . In the similar way, we can prove that

λmax (M 2 ) ≤ λmin (�) . Thus,

λmax (M 2 ) ≤ λmin (�) ≤ λ(�) ≤ λmax (�) ≤ λmax (M 1 ) . (10)

�

emma 2. If 0 < γ <

1 4 and samples are normalized, I + γ A t is a

ositive definite matrix and also it is invertible, where I ∈ R

d×d is the

dentity matrix.

roof. Let M 1 = (x t − x p )(x t − x p ) � , M 2 = (x t − x q )(x t − x q ) � and

t = M 1 − M 2 . It is easy to obtain that, λmax (M 1 ) = ‖ x t − x p ‖ 2 2 and

max (M 2 ) = ‖ x t − x q ‖ 2 2 . According to Lemma 1 , we can obtain that

λmax (M 2 ) ≤ λ(A t ) ≤ λmax (M 1 ) . Thus,

− γ λmax (M 2 ) ≤ λ(I + γ A t ) ≤ 1 + γ λmax (M 1 ) . (11)

or normalized samples, namely 0 ≤ ‖ x t ‖ ≤ 1, the ranges

f λmax (M 1 ) and λmax (M 2 ) vary from [0,4]. Given 0 < γ <

1 4 ,

(I + γ A t ) ≥ 1 − γ λmax (M 2 ) > 0 . Obviously, I + γ A t is a symmet-

ic matrix. Hence, I + γ A t is a positive definite matrix, and it is

nvertible. �

According to Lemma 2 , the optimal L t can be updated as below:

t =

{

L t−1 (I + γ A t ) −1 [ z] + > 0

L t−1 [ z] + = 0 .

(12)

t is known that the time complexity of the matrix inversion in

q. (12) is O ( d 3 ). However, the rank-2 property of A t offers us a

ice way to accelerate the speed by applying Lemma 3 .

emma 3. [23] Given G and G + B as two nonsingular matrices, and

et B have rank r > 0 . Let B = B 1 + · · · + B r , where each B k has rank

, and C k +1 = G + B 1 + · · · + B k is nonsingular for k = 1 , . . . , r, where

1 = G, then

(G + B ) −1 = C −1 r − g r C

−1 r B r C

−1 r , (13)

here,

−1 r+1 = C −1

r − g r C −1 r B r C

−1 r

g r =

1

1 + tr(C −1 r B r )

. (14)

〈

heorem 1. If A t is a rank-2 matrix and samples are normalized,

hen

(I + γ A t ) −1 = I − 1

η + β[ ηγ A t − (γ A t )

2 ] , (15)

here η = 1 + tr(γ A t ) , β =

1 2 [(tr(γ A t ))

2 − tr((γ A t ) 2 )] .

roof. According to Lemma 3 , we set G = I, B = γ A t ,

nd rewrite B = B 1 + B 2 , where B 1 = γ (x t − x p )(x t − x p ) � , 2 = −γ (x t − x q )(x t − x q )

� . It is obvious that the rank of B 1 and

2 is 1. Utilizing the Lemma 3 , we can obtain the Theorem 1 . �

By using Theorem 1 , plugging Eq. (15) back into the first term

f Eq. (12) , we can obtain

t = L t−1 − ηγ

η + β(L t−1 aa

� − L t−1 bb

� )

+

γ 2

η + β[(a

� a ) L t−1 aa

� − (a

� b) L t−1 ab

�

− (b

� a ) L t−1 ba

� + (b

� b) L t−1 bb

� ] ,

(16)

here a = x t − x p , and b = x t − x q .

Complexity: The time complexity of calculating η and β in

q. (16) is O ( d ). In addition, a

� a , a

� b , b � a and b � b are scalars,

or all of which time complexity is O ( d ). Also, the time complexity

f calculating L t−1 aa

� , L t−1 bb � , L t−1 ab � and L t−1 ba

� is O ( d 2 ) re-

pectively. Therefore, the time complexity of Eq. (16) is still O ( d 2 ).

ow, we give the pseudo-code of OPML in Algorithm 1 .

lgorithm 1 OPML.

equire: (x t , y t ) | T t=1 , γ .

nsure: L.

1: L 0 ← I.

2: for t = 1 , 2 , . . . , T do

3: 〈 x t , x p , x q 〉 ← one-pass triplet construction

4: if G((x t , x p , x q ) ; L t−1 ) � 0 then

5: L t = L t−1 .

6: else

7: L t ← solution by Eq. (16)

8: end if

9: end for

.3. Extended OPML to cold start case

In practice, there is a case that the first several available sam-

les belong to the same class, which is called as a cold start

ase. When cold start happens, since the triplet cannot be con-

tructed, OPML will discard all these initial samples. To address

his issue, we extend the proposed OPML to an enhanced version,

amely COPML, which includes an additional pre-stage before call-

ng OPML. Specifically, in the pre-stage, if the triplet cannot be

onstructed (i.e., the samples coming from different classes are not

vailable), the metric L can only be updated based on the samples

rom the same class, which usually adopts the pairwise constraint

or updating. Typically, pairwise constraint is mathematically set to

x t , x t+1 , y ∗t ,t +1 〉 , where y ∗t ,t +1 = 1 if two adjacent samples x t ∈ R

d

nd x t+1 ∈ R

d share the same class, and y ∗t ,t +1 = −1 otherwise. Ac-

ually, here we only need to consider the case when y ∗t ,t +1

= 1 , be-

ause we can update the metric by calling OPML if y ∗t ,t +1

= −1 (the

ew coming sample belongs to a different class). After the pre-

tage, OPML can be sequentially adopted for the following learning

rocess.

Formally, in the pre-stage when only the pairwise constraint

x t , x t+1 , y ∗t ,t +1

〉 can be used, the online optimization formulation


≤

m

‖

w

η

T

p

{

c

s

c

L

e

l

�

w

c

c

T

t

t

t

U

t

c

(

R

5

a

fi

v

r

5

u

c

t

a

r

a

i

r

o

m

E

t

N

[

(

S

is formulated as follows:

L t = arg min

L

�(L)

= arg min

L

1

2

‖ L − L t−1 ‖

2 F +

γ1

2

y ∗t ,t +1 ‖ L(x t − x t+1 ) ‖

2 2 ,

(17)

where γ 1 > 0 is the regularization parameter. The optimal solution

can be obtained when the gradient vanishes ∂�(L) ∂L

= 0 , hence

∂�(L)

∂L = L − L t−1 + γ1 y

∗t ,t +1 L�t = 0 , (18)

where �t = (x t − x t+1 )(x t − x t+1 ) � . It is obvious that �t is a rank-

1 PSD matrix for x t � = x t+1 . Then we can get that I + γ1 y ∗t ,t +1

�t

is a symmetric positive definite matrix, which is invertible, when

y ∗t ,t +1 = 1 . Then, the optimal L t can be obtained as

L t = L t−1 (I + γ1 �t ) −1 . (19)

By using the Sherman-Morrison formula, Eq. (19) can be equiva-

lently rewritten as follows:

L t = L t−1 − γ1 L t−1 (x t − x t+1 )(x t − x t+1 ) �

1 + γ1 (x t − x t+1 ) � (x t − x t+1 ) , (20)

where we can observe that the time complexity of Eq. (20) is O ( d 2 )

too. Algorithm 2 shows the pseudo-code of COPML.

Algorithm 2 COPML.

Require: (x t , y t ) | T t=1 , γ1 , γ2 .

Ensure: L.

1: L 0 ← I, c ← 0 .

2: for t = 1 , 2 , . . . , T do

3: Maintain H = { h 1 , h 2 , · · · , h c } 4: if y t = c = 1 then

5: 〈 x t , x t+1 , y ∗t ,t +1

〉 ← adjacent two samples

6: L t = L t−1 − γ1 L t−1 (x t −x t+1 )(x t −x t+1 ) �

1+ γ1 (x t −x t+1 ) � (x t −x t+1 )

.

7: else if 1 ≤ y t ≤ c and c ≥ 2 then

8: L ← call OPML (x t , y t , γ2 )

9: else

10: h c+1 ← x t 11: c ← c + 1

12: end if

13: end for

4. Theoretical guarantee

The following theorems guarantee the effectiveness of our

methods. Theorem 2 shows that the difference of learned metric

between one-pass triplet construction strategy and batch triplet

construction strategy is bounded. Note that, for a fair comparison,

the batch triplet construction strategy here is considered in an on-

line manner, that is to say, for each sample x t at the t th time step,

all past samples are stored to construct a triplet with x t (i.e., each

triplet contains this x t ). Theorem 3 also tries to explain that the

one-pass triplet construction strategy can approximate the batch

triplet construction, but from another perspective. Moreover, a re-

gret bound has been proved for the proposed OPML algorithm,

which can be found in Theorem 4 . All details of the proofs for the

theorems are provided in the appendix.

Theorem 2. Let L t be the solution output by OPML based on the one-

pass triplet construction strategy at the tth time step. Let L ∗t be the

solution output by OPML with the batch triplet construction strategy

at the tth time step. Assuming that ‖ x ‖ ≤ 1 (for all samples), ‖ L t ‖
F
U and ‖ L ∗t ‖ F ≤ U, the bound of the difference between these two

atrices is

L t − L ∗t ‖ F ≤ U ·∥∥∥ C N ∑

i =1

B i +

C N ∑

i =1 , j=1 ,i< j

B i B j + · · · +

C N ∏

i =1

B i

∥∥∥F , (21)

here ‖ B ‖ F ≤32

∣∣∣ γ 2

η+ β

∣∣∣+4 √

2

∣∣∣ ηγη+ β

∣∣∣ (for all B i , B j , ��), γ ∈ (0 , 1 4 ) ,

∈ (− 1 4 ,

9 4 ) , and β ∈ (−1 , 25

32 ) .

heorem 3. Let 〈 x t , x p , x q 〉 be the triplet constructed by the pro-

osed one-pass triplet construction strategy at the tth time step. Let

〈 x t , x p i , x q i 〉}| C i =1 be the triplet set constructed by the batch triplet

onstruction strategy at the tth time step. Assuming ‖ x ‖ 2 ≤ R (for all

amples), ‖ L ‖ F ≤ U , ‖ L ∗‖ F ≤ U and the angle θ between two samples

oming from the same class is small after the transformation of L or

∗ (i.e., cos θ = α, α ≥ 0 and α is close to 1), while θ is large oth-

rwise (i.e., cos θ = −ξ , ξ ≥ 0 and ξ is close to 1). Then the average

oss bound between these two strategies at the tth time step is

1 − �2 ≤ 2(α + ξ + 1) R

2 U

2 , (22)

here �1 denotes the average loss generated by the one-pass triplet

onstruction strategy, and �2 refers to the average loss of the batch

onstruction strategy.

heorem 4. Let 〈 x 1 , x p 1 , x q 1 〉 , . . . , 〈 x T , x p T , x q T 〉 be a sequence of

riplets constructed by the proposed one-pass strategy. Let L t | T t=1 be

he solution output by OPML at the tth time step, and L ∗ be the op-

imal offline solution. Assuming ‖ x ‖ 2 ≤ R (for all samples), ‖ L ‖ F ≤ , ‖ L ∗‖ F ≤ U and the angle θ between two samples coming from

he same class is very small after the transformation of L or L ∗ (i.e.,

os θ = α, α ≥ 0 and α is close to 1), while θ is very large otherwise

i.e., cos θ = −ξ , ξ ≥ 0 and ξ is close to 1). Then the regret bound is

(L ∗, T ) ≤ 2 T (α + ξ + 1) R

2 U

2 . (23)

. Experiments

To verify the effectiveness of our methods, we evaluate OPML

nd COPML on three typical tasks, including (1) UCI data classi-

cation, (2) face verification, and (3) abnormal event detection in

ideos. Also, an additional experiment is conducted to validate the

obustness of COPML when the cold start issue happens.

.1. UCI Data classification

We introduce twelve datasets from the UCI repository for eval-

ation. The k -NN classifier is employed, since it is widely-used for

lassification with only one parameter. The detailed information of

hese datasets is presented in Table 2 . All these twelve datasets

re normalized by Z-score. Also, for each dataset, 50% samples are

andomly picked for training while the rest is used for testing. We

dopt the error rate as the evaluation criterion, and to reduce the

nfluence coming from the random partition, all the classification

esults are averaged over 100 individual runs.

To make an extensive comparison, we introduce several state-

f-the-art methods, including batch metric learning and OML

ethods. Specifically, batch metric learning methods include: (1)

uclidean distance metric ( Eucli for short); (2) Mahalanobis dis-

ance metric ( Maha for short); (3) LMNN ((Large Margin Nearest

eighbor)) [15] ; (4) ITML [12] . OML methods include: (1) OASIS

6] ; (2) RDML [11] ; (3) POLA [9] ; (4) LEGO [10] ; (5) SOML-TG

SOML for short) [7] . The implementation of LMNN, ITML and OA-

IS was provided by the authors in their respective papers, while


Table 2

Error rates (mean ± std. deviation) of a k-NN (k = 5) classifier on the UCI datasets. p−values of student’s t -test are calculated between other

methods and our methods. •/ ◦ indicates OPML performs statistically better/worse than the respective method according to the p−values.

The statistics of win/tie/loss is also included. The value in the bracket means the corresponding total processing time in second. 0.00

denotes the value is very small ( < 0.005). Abbreviations: Sam, sample; Dim, dimensionality; C, classes.

Data Sam Dim C Euclidean Mahalanobis LMNN

lsvt 126 310 2 0 .234 ± 0.056 • 0 .238 ± 0.046 • 0 .196 ± 0.044 (46.08)

iris 150 4 3 0 .050 ± 0.023 0 .075 ± 0.025 • 0 .037 ± 0.017 ◦( 1.55)

wine 178 13 3 0 .044 ± 0.020 0 .046 ± 0.023 0 .030 ± 0.016 ◦( 3.18)

glass 214 9 7 0 .336 ± 0.036 0 .349 ± 0.037 0 .341 ± 0.038 ( 4.52)

spect 267 22 2 0 .327 ± 0.035 0 .348 ± 0.034 • 0 .336 ± 0.037 ( 2.48)

ionosphere 351 34 2 0 .172 ± 0.019 • 0 .165 ± 0.016 0 .131 ± 0.020 ◦( 3.81)

balance 625 4 3 0 .146 ± 0.014 • 0 .133 ± 0.017 • 0 .124 ± 0.014 ◦( 1.42)

breast 683 9 2 0 .034 ± 0.008 0 .034 ± 0.007 0 .033 ± 0.008 ( 1.26)

pima 768 8 2 0 .273 ± 0.018 • 0 .271 ± 0.018 • 0 .272 ± 0.017 •( 1.12)

segment 2310 19 7 0 .067 ± 0.006 • 0 .101 ± 0.008 • 0 .047 ± 0.006 ◦( 5.59)

waveform 50 0 0 21 3 0 .187 ± 0.006 ◦ 0 .158 ± 0.006 ◦ 0 .182 ± 0.060 ◦( 6.04)

optdigits 5620 64 10 0 .026 ± 0.003 • 0 .039 ± 0.003 • 0 .014 ± 0.002 ◦(31.21)

win/tie/loss 6/5/1 7/4/1 1/4/7

ITML OASIS RDML POLA

0 .175 ± 0.040 ◦(142.3) 0 .205 ± 0.043 •(0.84) 0 .230 ± 0.053 • (0.17) 0 .157 ± 0.036 ◦( 9.17)

0 .034 ± 0.016 ◦(11.16) 0 .274 ± 0.050 •(0.10) 0 .077 ± 0.027 •( 0.00 ) 0 .030 ± 0.016 ◦( 0.62)

0 .035 ± 0.019 ◦(14.15) 0 .019 ± 0.014 ◦(0.08) 0 .039 ± 0.019 ( 0.00 ) 0 .028 ± 0.018 ◦( 2.22)

0 .358 ± 0.042 •(30.94) 0 .485 ± 0.057 •(0.12) 0 .349 ± 0.035 ( 0.00 ) 0 .395 ± 0.042 ( 3.86)

0 .337 ± 0.041 ( 3.44) 0 .364 ± 0.050 •(0.13) 0 .343 ± 0.032 •( 0.01 ) 0 .323 ± 0.040 •(19.22)

0 .139 ± 0.025 ◦( 3.65) 0 .124 ± 0.033 ◦(0.12) 0 .154 ± 0.016 ◦ (0.02) 0 .147 ± 0.020 ◦(17.65)

0 .104 ± 0.019 ◦(15.67) 0 .126 ± 0.009 (0.10) 0 .118 ± 0.012 ◦( 0.01 ) 0 .156 ± 0.047 •( 7.31)

0 .035 ± 0.008 ( 2.68) 0 .043 ± 0.021 •(0.09) 0 .033 ± 0.007 ( 0.01 ) 0 .038 ± 0.010 •( 5.13)

0 .279 ± 0.022 •( 3.16) 0 .346 ± 0.053 •(0.12) 0 .269 ± 0.019 ( 0.01 ) 0 .275 ± 0.021 •(12.40)

0 .050 ± 0.008 ◦(38.50) 0 .343 ± 0.067 •(0.11) 0 .082 ± 0.006 •( 0.02 ) 0 .057 ± 0.011 (22.89)

0 .187 ± 0.008 ◦(18.57) 0 .357 ± 0.039 •(0.13) 0 .186 ± 0.010 ◦ (0.12) 0 .250 ± 0.030 •(28.38)

0 .028 ± 0.006 •(122.4) 0 .077 ± 0.009 •(0.15) 0 .028 ± 0.003 • (0.13) 0 .023 ± 0.003 •(19.76)

3/2/7 9/1/2 5/4/3 6/2/4

LEGO SOML OPML COPML

0 .239 ± 0.050 •( 2.33) 0 .223 ± 0.057 •( 2.23) 0 .189 ± 0.048 ( 0.07 ) 0 .189 ± 0.047 ( 0.07 )

0 .050 ± 0.021 ( 0.14) 0 .287 ± 0.080 •( 0.69) 0 .049 ± 0.023 ( 0.00 ) 0 .048 ± 0.023 ( 0.00 )

0 .031 ± 0.020 ◦( 0.23) 0 .169 ± 0.093 •( 0.92) 0 .042 ± 0.020 ( 0.00 ) 0 .041 ± 0.019 ( 0.00 )

0 .390 ± 0.034 •( 0.31) 0 .557 ± 0.117 •( 1.16) 0 .339 ± 0.032 ( 0.00 ) 0 .341 ± 0.033 ( 0.00 )

0 .311 ± 0.038 ◦( 0.61) 0 .393 ± 0.084 •( 1.45) 0 .326 ± 0.034 ( 0.01 ) 0 .327 ± 0.035 ( 0.01 )

0 .154 ± 0.020 ◦( 1.09) 0 .362 ± 0.116 •( 2.18) 0 .161 ± 0.019 ( 0.01 ) 0 .163 ± 0.021 ( 0.01 )

0 .118 ± 0.011 ◦( 1.40) 0 .378 ± 0.077 •( 4.78) 0 .129 ± 0.012 ( 0.01 ) 0 .129 ± 0.014 ( 0.01 )

0 .035 ± 0.008 •( 1.64) 0 .054 ± 0.040 •( 5.19) 0 .032 ± 0.008 ( 0.01 ) 0 .032 ± 0.007 ( 0.01 )

0 .266 ± 0.019 ( 1.96) 0 .353 ± 0.060 •( 5.89) 0 .266 ± 0.017 ( 0.01 ) 0 .265 ± 0.018 (0.02)

0 .040 ± 0.006 ◦(14.50) 0 .541 ± 0.092 •(19.26) 0 .059 ± 0.006 (0.03) 0 .059 ± 0.006 (0.03)

0 .233 ± 0.007 •( 4.60) 0 .368 ± 0.035 •(49.30) 0 .224 ± 0.009 ( 0.08 ) 0 .225 ± 0.010 (0.09)

0 .022 ± 0.003 •( 6.97) 0 .239 ± 0.079 •(79.10) 0 .019 ± 0.003 ( 0.09 ) 0 .019 ± 0.003 (0.10)

5/2/5 12/0/0

t

t

L

t

O

F

i

C

γ

o

s

t

o

i

c

r

s

L

m

e

(

r

r

e

c

t

f

s

t

u

5

a

l

a

F

l

t

he rest methods were implemented by ourselves. The parame-

ers of these methods were selected by cross-validation, except

MNN and ITML using the default settings. Moreover, a parame-

er sensitivity experiment is also made to verify the influence of

PML/COPML by changing the value of parameter γ / γ 1 , γ 2 (see

ig. 3 ). For OPML, let γ be 10 −3 seems to be a good choice, which

s indeed chose in all twelve UCI datasets in our experiment. For

OPML, let γ 1 be 10 −4 or 10 −3 seems to be a good choice when

1 ≡ 10 −3 (same setting as OPML for simplicity). Since the pairwise

r triplet constraints of POLA, LEGO and SOML need to be con-

tructed in advance, we randomly sample 10,0 0 0 constraints for

hese three methods (same setting as LEGO [10] ). The error rates

f the proposed methods and competitive methods are presented

n Table 2 .

Moreover, the p -values of student’s t -test were calculated to

heck statistical significance. Also, the statistics of win/tie/loss is

eported according to the obtained p -values (see Table 2 ). It is ob-

erved that (1) the performance of our methods is comparable to

EGO, and slightly better than other OML methods; (2) the perfor-

ance of our methods is close to batch metric learning methods,

.g., LMNN and ITML, and better than Euclidean and Mahalanobis;
E
3) our methods are faster than other OML methods except compa-

able with RDML, since instead of constructing triplets, RDML only

equires the pairwise constraint by receiving a pair of samples in

ach time.

To illustrate the performance with different numbers of triplet

onstraints on the learning of metric, we vary the numbers of

riplet constraints as (100, 1000, 2000, 5000, 10000, 15000, 20000)

or OASIS and SOML (see Fig. 2 ). Since the number of triplet con-

traint in OPML is a constant by using one-pass triplet construc-

ion, we can find that, OPML can achieve better performance by

sing fewer triplet constraints (except on UCI data 1, 3, 6).

.2. Face verification: PubFig

Face verification has become a popular application for evalu-

ting the performance of metric learning methods. Many metric

earning methods [24–26] have been proposed for this practical

pplication, even including deep metric learning methods [27–29] .

or face verification, we first evaluate our methods on the Pub-

ic Figures Face Database (PubFig) [30] . PubFig dataset consists of

wo subsets: Development Set (7650 images of 60 individuals) and

valuation Set (28954 images of 140 individuals). Following [30] ,


0 100 1000 2000 5000 10000 15000 20000 2500012

14

16

18

20

22

24

26

28

30

x=59

The Number of Triplet

Err

or

Rat

e (%

)

UCI Data: 1

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 250000

10

20

30

40

50

60

x=72


Err

or

Rat

e (%

)

UCI Data: 2

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 250000

5

10

15

20

25

30

x=87


Err

or

Rat

e (%

)

UCI Data: 3

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 2500030

35

40

45

50

55

60

65

70

75

x=103


Err

or

Rat

e (%

)

UCI Data: 4

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 2500025

30

35

40

45

50

x=132


Err

or

Rat

e (%

)

UCI Data: 5

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 250005

10

15

20

25

30

35

40

45

50

x=172


Err

or

Rat

e (%

)

UCI Data: 6

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 2500010

15

20

25

30

35

40

45

50

55

x=310


Err

or

Rat

e (%

)

UCI Data: 7

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 25000−2

0

2

4

6

8

10

12

14

x=340


Err

or

Rat

e (%

)

UCI Data: 8

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 2500020

25

30

35

40

45

x=380


Err

or

Rat

e (%

)

UCI Data: 9

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 250000

10

20

30

40

50

60

70

x=1148


Err

or

Rat

e (%

)

UCI Data: 10

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 2500020

25

30

35

40

45

x=2498


Err

or

Rat

e (%

)UCI Data: 11

OPMLOASISSOML

0 100 1000 2000 5000 10000 15000 20000 250000

5

10

15

20

25

30

35

x=2801


Err

or

Rat

e (%

)

UCI Data: 12

OPMLOASISSOML

Fig. 2. Error rates of different methods with different numbers of triplets on twelve UCI datasets (the number of triplets in OPML is a constant).

Fig. 3. Error rates of OPML/COPML on the UCI datasets by changing the value of parameter γ / γ 1 , γ 2 (the last three datasets are specially constructed for cold start issue in

5.5 ).

(

t

d

a

t

p

d

d

o

we use the development set to develop all these methods, includ-

ing parameters tuning, while the evaluation set is used for per-

formance evaluation. The goal of face verification in PubFig is to

determine whether a pair of face images belong to the same per-

son. Please note that, images coming from the same person will

be regarded as belonging to the same class. For all subsets, 10-

fold cross validation is adopted to conduct the experiments, and

each fold is disjoint by identity (i.e., one person will not appear

in both the training and testing set). For testing each fold (with

rest 9 folds used for training), we randomly sample 10,0 0 0 pairs

50 0 0 intra- and 50 0 0 extra-personal pairs) for testing. Thus, the

otal number of pairs is 10 5 . In each training phase, we also ran-

omly select 10,0 0 0 pairwise or triplet constraints for LEGO, POLA

nd SOML as the same settings on the UCI datasets.

For sufficient and fair comparison, we use two forms of fea-

ures (i.e., attribute features and deep features) to evaluate the

erformance of all algorithms, respectively. Attribute features (73-

imension) provided by Kumar et al. [30] are ’high-level’ features

escribing nameable attributes such as gender, race, age, hair etc.,

f a face image. For deep features, we use a VGG-Face model


0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FPR

TP

R

Roc Curves for Development Set

OPML (0.8297)COPML(0.8207)LEGO (0.8100)POLA (0.7934)ITML (0.7835)LMNN (0.7834)OASIS (0.7701)Eucli (0.7471)RDML (0.7287)Maha (0.7238)SOML (0.6956)

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

FPR

TP

R

Roc Curves for Evaluation Set

OPML (0.8114)COPML(0.8063)LEGO (0.8020)POLA (0.7823)LMNN (0.7617)OASIS (0.7584)RDML (0.7279)Eucli (0.7273)ITML (0.7171)Maha (0.7151)SOML (0.6965)

0.02 0.04 0.06 0.08 0.1 0.12 0.14

FPR

0.86

0.88

0.9

0.92

0.94

0.96

0.98

TP

R

Roc Curves for Development Set (Larger Version)

OPML (0.9861)COPML(0.9864)LEGO (0.9860)POLA (0.9840)ITML (0.9758)LMNN (0.9805)OASIS (0.9732)Eucli (0.9854)RDML (0.9573)Maha (0.9829)SOML (0.6918)

0.02 0.04 0.06 0.08 0.1 0.12

FPR

0.86

0.88

0.9

0.92

0.94

0.96

TP

R

Roc Curves for Evaluation Set (Larger Version)

OPML (0.9858)COPML(0.9856)LEGO (0.9806)POLA (0.9734)LMNN (0.9749)OASIS (0.9726)RDML (0.9745)Eucli (0.9804)ITML (0.9810)Maha (0.9844)SOML (0.6228)

Fig. 4. ROC Curves of development set (left column) and evaluation set (right column) on the PubFig dataset. first row: attribute features; second row: deep features. AUC

value of each method is presented in bracket.

[

w

4

b

i

t

i

a

R

c

l

f

m

5

b

f

c

v

t

t

T

a

I

t

t

o

m

a

a

t

(

r

o

t

W

f

e

o

f

s

[

d

r

d

7

t

t

c

d

c

t

31] to extract a 4096-dimensional feature for each face image

hich has been aligned and cropped. For easier handling, the

096-dimensional feature is reduced to a 54-dimensional feature

y Principal Component Analysis (PCA) algorithm.

For each testing pair, we first calculate the distance (similar-

ty) between them by the learned metric obtained from respec-

ive methods. Then, all the distances (similarities) are normalized

nto the range [0, 1]. Receiver Operating Characteristic (ROC) curves

re provided in Fig. 4 , with the corresponding AUC (Area under

OC) values calculated. It can be observed that OPML and COPML

an obtain superior results compared with the state-of-the-art on-

ine/batch metric learning methods. Moreover, although the deep

eature already has a strong representation ability, our proposed

ethods can still slightly improve the performance.

.3. Face verification: LFW

For face verification, we also evaluate our methods on the La-

eled Faces in the Wild Database (LFW) [32] . LFW is a widely used

ace verification benchmark with unconstrained images, which

ontains 13,233 images of 5749 individuals. This dataset has two

iews: View 1 is used for development purposes (containing a

raining set and a test set); And, View 2 is taken as evalua-

ion benchmark for comparison (i.e., a 10-fold cross-validation set).

here are two forms of configuration in both views, that is, im-

ge restricted configuration and image unrestricted configuration.

n the first formulation, the training information is restricted to

he provided image pairs and additional information such as ac-

ual name information can not be used. In other words, we can

nly use the pairwise images for training without any label infor-

ation can be used at all. While, in the second formulation, the

ctual name information (i.e., label information) can be used and

s many pairs or triplets can be formulated as one desires. No mat-

er which configuration we choose, the test procedure is the same

i.e., using pairwise images for testing). In order to simulate the

eal online environment and because our methods and some meth-

ds (e.g., OASIS [6] , SOML [7] ) are triplet-based methods, we adopt

he image unrestricted configuration to construct the experiment.

e use View 1 for parameter tuning and then evaluate the per-

ormance of all the algorithms on each fold (300 intra- and 300

xtra-personal pairs) in View 2. Other settings are similar with the

nes on the PubFig dataset.

In this experiment, we adopt two types of features (i.e., SIFT

eatures and attribute features) to represent each face image, re-

pectively. The SIFT features are provided by Guillaumin et al.

33] by extracting SIFT descriptors [34] at 9 fixed facial landmarks

etected on a face, over three scales. Then we perform PCA algo-

ithm to reduce the original 3456-dimensional feature to a 100-

imensional feature. Like PubFig, the attribute features of LFW are

3-dimensional ’high-level’ features describing the nameable at-

ributes of a face image [30] . To evaluate our methods and the con-

rastive methods, we report the ROC curves and AUC values of the

orresponding methods (see Fig. 5 ). The results of ITML [12] aren’t

isplayed for its difficulty of convergence in the training data. We

an see that the proposed COPML method can achieve the-state-of-

he-art performance compared with the contrastive metric learning


0 0.2 0.4 0.6 0.8 1FPR

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TP

R

Roc Curves for LFW (SIFT)

OPML (0.8606)COPML(0.8676)LEGO (0.7976)POLA (0.8442)LMNN (0.8634)OASIS (0.8267)Eucli (0.7335)RDML (0.6902)Maha (0.6570)SOML (0.5994)

0 0.2 0.4 0.6 0.8 1FPR

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

TP

R

Roc Curves for LFW (Attribute)

OPML (0.9136)COPML(0.9143)LEGO (0.8629)POLA (0.9058)LMNN (0.9105)OASIS (0.8906)Eucli (0.8564)RDML (0.8368)Maha (0.8138)SOML (0.7265)

Fig. 5. ROC Curves of our methods and contrastive methods on the LFW dataset. left: sift features; right: attribute features. AUC value of each method is presented in

bracket.

Table 3

Quantitative comparison of our methods with

other abnormal event detection methods on 3

scenes of UMN dataset individually with AUC cri-

terion.

Method AUC

Optical flow [36] 0.84 (average)

Social Force [36] 0.96 (average)

Chaotic Invariants [37] 0.99 (average)

LSA [38] 0.985(average)

STCOG [20] 0 .936/0.776/0.966

Sparse [35] 0 .995/0.975/0.964

MP-MIDL [39] 0 .99 /0.98 /0.99

SVDD-based [40] 0.993/0.969/0.988

OPML 0.993 / 0.983 / 0.973

COPML 0.995 / 0.989 / 0.977

Table 4

Error rates on three UCI datasets.

Data Euclidean OPML COPML

seg-10 0 .069 ± 0.006 0 .062 ± 0.006 0 .057 ± 0.007

seg-5 0 .067 ± 0.007 0 .062 ± 0.007 0 .054 ± 0.007

seg-2 0 .067 ± 0.006 0 .064 ± 0.007 0.059 ± 0.007

eeg-10 0 .185 ± 0.004 0 .181 ± 0.004 0 .161 ± 0.023

eeg-5 0 .213 ± 0.004 0 .205 ± 0.005 0 .185 ± 0.018

eeg-2 0 .185 ± 0.004 0 .178 ± 0.007 0.178 ± 0.007

sen-10 0 .190 ± 0.002 0 .093 ± 0.036 0 .071 ± 0.011

sen-5 0 .196 ± 0.002 0 .097 ± 0.032 0 .082 ± 0.014

sen-2 0 .190 ± 0.006 0 .067 ± 0.020 0 .063 ± 0.015

c

i

O

a

2

f

c

i

a

t

t

s

fi

methods. Especially, when using SIFT features, our methods can

significantly improve the AUC value over the Euclidean distance

by 13% (5.8% with attribute features), showing the validity of the

proposed methods. It is worth noting that some metric learning

methods cannot even improve over the Euclidean distance, which

has happened on the PubFig dataset. The reason why LMNN can-

not achieve the best performance may be over-fitting for lacking of

regularization.

5.4. Abnormal event detection in videos

The performance of the proposed methods is also evaluated on

UMN dataset for abnormal event detection. UMN dataset contains

3 different scenes with 7739 frames in total: Scene1 (1453 frames),

Scene2 (4144 frames) and Scene3 (2142 frames). In UMN dataset,

people walking around is considered as normal, while people run-

ning away is regarded as abnormal. The resolution of the video is

320 × 240. We divide each frame into 5 × 4 non-overlapping 64

× 60 patches. For each patch, the MHOF (Multi-scale Histogram

of Optical Flow) feature [35] was extracted from every two suc-

cessive frames. The MHOF is a 16-dimensional feature, which can

capture both motion direction and motion energy. For integrating

the multi-patches features, we combine features from all patches

in each frame, and form a 320-dimensional feature. For each scene,

we perform 2-fold cross validation for evaluation. The distance

metric is learnt from the training data in online way, then we use

the SVM classifier to classify the testing frames after feature trans-

formation by using the learned metric L .

Table 3 reports the AUC of all the methods. We can notice

that our methods is very effective and competitive, when com-

pared with other methods. Fig. 6 exhibits the sample frames of

normal and abnormal events in the 3 scenes respectively (top

row), and shows the abnormal event detection results of our

method (COPML) in the indication bars (green/red indicates nor-

mal/abnormal event). It’s worth mentioning that in this experi-

ment, COPML performs better than OPML, because the video data

has the cold start issue especially at the beginning.

5.5. COPML For cold start

We can observe that in the case free of the cold start issue (e.g.,

UCI data classification, face verification), OPML and COPML can ob-

tain comparable results, while in the case with cold start issue

(e.g., abnormal event detection in videos), COPML is better than

OPML. To further test the performance of COPML on an extreme

ase with cold start issue, we construct several datasets with spec-

fied structure to verify the different performance of COPML and

PML. Three datasets were picked from the UCI repository: (1) Im-

ge Segmentation (seg for short), with 7 classes, 19 features and

310 samples; (2) EEG Eye State (eeg for short), with 2 classes, 15

eatures and 14,980 samples; (3) Sensorless (sen for short), with 11

lasses, 49 features and 58,509 samples.

For each dataset, the samples from different classes are divided

nto disjoint 10/5/2 parts, then different parts of different classes

re crosswise put together to construct a new dataset. Afterwards,

he new dataset is divided into 2 folds. The first fold is used for

raining and the second fold is used for testing. As the previous

etting for classification, we take a k -NN (k = 5) classifier to get the

nal test results, shown in Table 4 . The results prove that COPML


GroundtruthOur Results

Scene1 Scene3Scene2

Scene1 Abnormal Scene2 Abnormal Scene3 AbnormalScene1 Normal Scene2 Normal Scene3 Normal

Fig. 6. Global abnormal event detection results of our method COPML and the ground truth on the UMN dataset. (For interpretation of the references to colour in this figure

legend, the reader is referred to the web version of this article.)

p

S

a

6

O

g

m

c

T

e

l

p

m

m

A

6

(

(

S

A

P

L

A

w

o

l

L

N

i

t

(

r

c

L

L

s

t

t

p

L

T

‖

R

x

d

‖

w

e

λ

i

λ

r

a

erforms better than OPML when the data has the cold start issue.

ince when cold start occurs, COPML will incorporate both the pair

nd triplet information, instead of only using triplet in OPML.

. Conclusion

We propose a one-pass closed-form solution for OML, namely

PML. It employs the one-pass triplet construction for fast triplet

eneration, together with a closed-form solution to update the

etric with the new coming sample at each time step. Also, for

old start issue, COPML, an extended version of OPML is developed.

he major advantages of our methods are: OPML and COPML are

asy to implement. Also, OPML and COPML are very scalable with

ow space (i.e., O ( d )) and time (i.e., O ( d 2 )) complexity. In the ex-

eriments, we show that our methods can obtain superior perfor-

ance on three typical tasks, compared with the state-of-the-art

ethods.

cknowledgments

This work is supported by the National NSF of China (Nos.

1432008 , 61673203 , U1435214 ), Australian Research Council (ARC)

No. DE160100241 ), Primary R&D Plan of Jiangsu Province, China

No. BE2015213), and the Collaborative Innovation Center of Novel

oftware Technology and Industrialization.

ppendix A. Proof of Theorem 2

roof. Recall that the metric update formula of OPML is

t =

{

L t−1 (I + γ A t ) −1 [ z] + > 0

L t−1 [ z] + = 0 .

(A.1)

ccording to the Theorem 1 , we can obtain that,

(I + γ A t ) −1 = I − 1

η + β[ ηγ A t − (γ A t )

2 ] , (A.2)

here η = 1 + tr(γ A t ) , β =

1 2 [(tr(γ A t ))

2 − tr(γ A t ) 2 ] . Here, we

nly consider the case that [ z] + > 0 . Then at t th time step, the

earned metric L t of one-pass strategy can be expressed as below,

t = L 0 (I + γ A 1 ) −1 (I + γ A 2 )

−1 · · · (I + γ A t ) −1 . (A.3)

ote that the batch triplet construction strategy here is considered

n an online manner, that is to say, for each sample x t at the t th

ime step, all past samples are stored to construct a triplet with x t i.e., each triplet contains this x t ). Similar to L t , the learned met-

ic L ∗t of the batch strategy (at t th time step, C i | t i =1 triplets can be

onstructed) can be denoted as follows,

∗t = L ∗0

C 1 ∏

i =1

(I + γ A 1 i ) −1

C 2 ∏

i =1

(I + γ A 2 i ) −1 · · ·

C t ∏

i =1

(I + γ A t i ) −1 . (A.4)

et 〈 x 1 , x p 1 , x q 1 〉 , . . . , 〈 x t , x p t , x q t 〉 be the sequence of triplets con-

tructed by the proposed one-pass strategy, which is contained in

he sequence of triplets constructed by the batch strategy. If we let

he L ∗ learn on the sequence of triplets constructed by the one-

ass strategy first, the Eq. (A.4) can be reorganized as below,

∗t = L ∗0 (I + γ A 1 )

−1 · · · (I + γ A t ) −1 ·

C 1 + ···+ C t −t ∏

i =1

(I + γ A i ) −1

(L ∗t learn on the sequence of L t first )

= L t ·C 1 + ···+ C t −t ∏

i =1

(I + γ A i ) −1

(L 0 and L ∗0 are both initialized as identity matrices )

= L t ·C 1 + ···+ C t −t ∏

i =1

(I + B )

(by Theorem 1, where B = 1

η+ β

[ (γ A i )

2 −ηγ A i

] )= L t

[ I +

C N ∑

i =1

B i +

C N ∑

i =1 , j=1 ,i< j

B i B j + · · · +

C N ∏

i =1

B i

] ( where C N = C 1 + · · · + C t − t) .

(A.5)

hen we can calculate that

L t − L ∗t ‖ F =

∥∥∥L t

[ C N ∑

i =1

B i +

C N ∑

i =1 , j=1 ,i< j

B i B j + · · · +

C N ∏

i =1

B i

] ∥∥∥F

≤ ‖ L t ‖ F ·∥∥∥ C N ∑

i =1

B i +

C N ∑

i =1 , j=1 ,i< j

B i B j + · · · +

C N ∏

i =1

B i

∥∥∥F .

(A.6)

ecall that A t = M 1 − M 2 = (x t − x p )(x t − x p ) T − (x t − x q )(x t − q ) T ∈ R

d×d , which is a symmetry square matrix. According to the

efinition of Frobenius norm,

A t ‖ F =

√

d ∑

i =1

d ∑

i =1

| a i j | 2 =

√

d ∑

i =1

σ 2 i , (A.7)

here σ i are the singular values of A t , which are equal to the

igenvalues of A t . According to Lemma 1, −λmax (M 2 ) ≤ λ(A t ) ≤max (M 1 ) , where λ( A t ) denotes the eigenvalue of A t , and λmax (M)

ndicates the maximum eigenvalue of M . Assuming ‖ x t ‖ 2 ≤ 1,

max (M 1 ) and λmax (M 2 ) belong to the range of [0, 4]. Since the

ank of A t is 2 (which has been proved in Section 3.2 ), there are

t most two nonzero eigenvalues. Thus we can easily obtain that

http://dx.doi.org/10.13039/501100001809

http://dx.doi.org/10.13039/501100000923


T

s

�

F

p

l

�

I

A

s

c

w

w

w

H

�

‖

F

≤

T

A

P

C

R

‖ A ‖ F ≤ 4 √

2 . Hence,

‖ B ‖ F = ‖

γ 2

η + βA

2 t +

ηγ

η + β(−A t ) ‖ F

≤ ‖

γ 2

η + βA

2 t ‖ F + ‖

ηγ

η + β(−A t ) ‖ F

≤∣∣∣ γ 2

η + β

∣∣∣ · ‖ A t ‖ F · ‖ A t ‖ F +

∣∣∣ ηγ

η + β

∣∣∣ · ‖ A t ‖ F

≤ 32

∣∣∣ γ 2

η + β

∣∣∣ + 4

√

2

∣∣∣ ηγ

η + β

∣∣∣.

(A.8)

Then, we can also calculate the range of η and β respectively.

η = 1 + tr(γ A t )

= 1 + γ · tr(A t )

= 1 + γ[ (x t − x p )

T (x t − x p ) − (x t − x q ) T (x t − x q )

] = 1 + γ

[ ‖ x p ‖

2 2 − ‖ x q ‖

2 2 − 2 ‖ x T t ‖ 2 · ‖ x p ‖ 2 cos θ1

+ 2 ‖ x T t ‖ 2 · ‖ x q ‖ 2 cos θ2

] .

(A.9)

Since the range of γ is (0 , 1 4 ) , and 0 < ‖ x t ‖ 2 ≤ 1, we can obtain

the range of tr ( γ A t ) (i.e., (− 5 4 ,

5 4 ) ), and the range of η is (− 1

4 , 9 4 ) .

Recall that,

β =

1

2

[(tr(γ A t ))

2 − tr(γ A t ) 2 ]

( by the rule of tr(cA )= c·tr(A ))

=

1

2

[(tr(γ A t ))

2 − γ 2 tr(A

2 t )

]( by the rule of tr(A k )= ∑

i λk i , )

( where λi is the eigenvalue of A )

=

1

2

[ (tr(γ A t ))

2 − γ 2 d ∑

i =1

λ2 i

] ( by the rule of ‖ A t ‖ F =

√ ∑ d i =1 λ

2 i )

=

1

2

[ (tr(γ A t ))

2 − γ 2 ‖ A t ‖

2 F

]

(A.10)

For 0 < ‖ A t ‖ 2 F ≤32 and − 5 4 < tr(γ A t ) <

5 4 , the range of β is

(−1 , 25 32 ) . �

Appendix B. Proof of Theorem 3

Proof. By applying the one-pass triplet construction strategy, at

the t th time step, we can obtain one triplet 〈 x t , x p , x q 〉 . While in

the batch construction (all past samples will be stored), we can

get a triplet set {〈 x t , x p i , x q i 〉}| C i =1 . The average loss of these two

strategies can be expressed as follows:

�1 =

[ 1 + ‖ L(x t − x p ) ‖

2 2 − ‖ L(x t − x q ) ‖

2 2

] +

�2 =

1

C

C ∑

i =1

[ 1 + ‖ L ∗(x t − x p i ) ‖

2 2 − ‖ L ∗(x t − x q i ) ‖

2 2

] + ,

(B.1)

where [ z] + = max (0 , z) , namely the hinge loss G((x t , x p , x q ) ; L) .

For �1 , we only consider the case that z ≥ 0, which exactly af-

fects the updating of the metric L . However, in �2 , some losses

may be negative. Thus,

�1 = 1 + ‖ L(x t − x p ) ‖

2 2 − ‖ L(x t − x q ) ‖

2 2

�2 ≤ 1

C

C ∑

i =1

[ 1 + ‖ L ∗(x t − x p i ) ‖

2 2 − ‖ L ∗(x t − x q i ) ‖

2 2

] .

(B.2)

hen we calculate the difference between the losses of these two

trategies. That is,

1 − �2 ≤ 1

C

C ∑

i =1

�

≤ 1

C

C ∑

i =1

[ ‖ L(x t − x p ) ‖

2 2 − ‖ L ∗(x t − x p i ) ‖

2 2

+ ‖ L ∗(x t − x q i ) ‖

2 2 − ‖ L(x t − x q ) ‖

2 2

] .

(B.3)

or simplicity, we just analysis the �. By applying the rule of dot

roduct (i.e., x T y = ‖ x ‖ · ‖ y‖ · cos θ ), � can be expanded as fol-

ows,

= ‖ Lx p ‖

2 2 − ‖ L ∗x p i ‖

2 2 + ‖ L ∗x q i ‖

2 2 − ‖ Lx q ‖

2 2

− 2 ‖ Lx t ‖ · ‖ Lx p ‖ · cos θ1 + 2 ‖ L ∗x t ‖ · ‖ L ∗x p i ‖ · cos θ2

− 2 ‖ L ∗x t ‖ · ‖ L ∗x q i ‖ · cos θ3 + 2 ‖ Lx t ‖ · ‖ Lx q ‖ · cos θ4 .

(B.4)

n order to simplify the expression, we set

1 © = −2 ‖ Lx t ‖ · ‖ Lx p ‖ · cos θ1

2 © = 2 ‖ L ∗x t ‖ · ‖ L ∗x p i ‖ · cos θ2

3 © = −2 ‖ L ∗x t ‖ · ‖ L ∗x q i ‖ · cos θ3

4 © = 2 ‖ Lx t ‖ · ‖ Lx q ‖ · cos θ4 .

(B.5)

ssuming that the angle θ between two samples coming from the

ame class is very small after the transformation of L or L ∗ (i.e.,

os θ = α, α ≥ 0 and α is close to 1), while θ is very large other-

ise (i.e., cos θ = −ξ , ξ ≥ 0 and ξ is close to 1). Thus,

0 ≤ cos θ1 ≤ α

0 ≤ cos θ2 ≤ α

−ξ ≤ cos θ3 ≤ 0

−ξ ≤ cos θ4 ≤ 0 ,

(B.6)

here 0 ≤ α ≤ 1, 0 ≤ ξ ≤ 1 and both of them are close to 1. Then,

e can obtain that,

−2 α‖ Lx t ‖ · ‖ Lx p ‖ ≤ 1 © ≤ 0

0 ≤ 2 © ≤ 2 α‖ L ∗x t ‖ · ‖ L ∗x p i ‖

0 ≤ 3 © ≤ 2 ξ‖ L ∗x t ‖ · ‖ L ∗x q i ‖

−2 ξ‖ Lx t ‖ · ‖ Lx q ‖ ≤ 4 © ≤ 0 .

(B.7)

ere, we only consider the upper bound of �,

≤ ‖ Lx p ‖

2 2 −‖ L ∗x p i ‖

2 2 +‖ L ∗x q i ‖

2 2 −‖ Lx q ‖

2 2 +2 α‖ L ∗x t ‖ · ‖ L ∗x p i ‖

+ 2 ξ‖ L ∗x t ‖ · ‖ L ∗x q i ‖

≤ ‖ Lx p ‖

2 2 +‖ L ∗x q i ‖

2 2 +2 α‖ L ∗x t ‖ · ‖ L ∗x p i ‖ + 2 ξ‖ L ∗x t ‖ · ‖ L ∗x q i ‖ .

(B.8)

According to the property of compatible norms, that is,

Ax ‖ 2 ≤ ‖ A ‖ F · ‖ x ‖ 2 . (B.9)

or assuming that ‖ x ‖ 2 ≤ R (for all samples), ‖ L ‖ F ≤ U and ‖ L ∗‖ FU , we can obtain that,

� ≤ 2(α + ξ + 1) R

2 U

2

�1 − �2 ≤ 2(α + ξ + 1) R

2 U

2 . (B.10)

hus, this theorem has be proved. �

ppendix C. Proof of Theorem 4

roof. The regret can be defined (according to the definition of

hapter 3 in [41] ) as below:

(L ∗, T ) =

T ∑

t=1

G t (L t ) −T ∑

t=1

G t (L ∗) , (C.1)


w

o

f

e

T

R

I

t

R

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

[

here G t (L) = [1 + ‖ L(x t − x p ) ‖ 2 2 − ‖ L(x t − x q ) ‖ 2 2

] + . Here, we also

nly consider the case that the loss is positive, which exactly af-

ects the updating of the metric L . However, one triplet which gen-

rates a positive loss with L may incur a negative loss with L ∗ .

hus, after expanding,

(L ∗, T ) ≤T ∑

t=1

[ ‖ L t (x t − x p ) ‖

2 2 − ‖ L t (x t − x q ) ‖

2 2

−‖ L ∗(x t − x p ) ‖

2 2 + ‖ L ∗(x t − x q ) ‖

2 2

] . (C.2)

n the similar way of proving the Theorem 3 , we can easily prove

his theorem. �

eferences

[1] K.Q. Weinberger , J. Blitzer , L.K. Saul , Distance metric learning for large margin

nearest neighbor classification, in: Advances in Neural Information ProcessingSystems (NIPS), 2005, pp. 1473–1480 .

[2] S. Xiang , F. Nie , C. Zhang , Learning a mahalanobis distance metric for data clus-tering and classification, Pattern Recognit. 41 (12) (2008) 3600–3612 .

[3] C.-C. Chang , A boosting approach for supervised mahalanobis distance metric

learning, Pattern Recognit. 45 (2) (2012) 844–862 . [4] Y. Mu , W. Ding , D. Tao , Local discriminative distance metrics ensemble learn-

ing, Pattern Recognit. 46 (8) (2013) 2337–2349 . [5] T. Mensink , J. Verbeek , F. Perronnin , G. Csurka , Distance-based image classifi-

cation: generalizing to new classes at near-zero cost, IEEE Trans. Pattern Anal.Mach. Intell. 35 (11) (2013) 2624–2637 .

[6] G. Chechik , V. Sharma , U. Shalit , S. Bengio , Large scale online learning of image

similarity through ranking, J. Mach. Learn. Res. 11 (2010) 1109–1135 . [7] X. Gao , S.C. Hoi , Y. Zhang , J. Wan , J. Li , Soml: Sparse online metric learning

with application to image retrieval, in: Proceedings of the Twenty-Eighth AAAIConference on Artificial Intelligence (AAAI), 2014 .

[8] H. Xia , S.C. Hoi , R. Jin , P. Zhao , Online multiple kernel similarity learning forvisual search, IEEE Trans. Pattern Anal. Mach. Intell. 36 (3) (2014) 536–549 .

[9] S. Shalev-Shwartz , Y. Singer , A.Y. Ng , Online and batch learning of pseudo-met-

rics, in: Proceedings of the twenty-first International Conference on MachineLearning (ICML), ACM, 2004, p. 94 .

[10] P. Jain , B. Kulis , I.S. Dhillon , K. Grauman , Online metric learning and fast sim-ilarity search, in: Advances in Neural Information Processing Systems (NIPS),

2009, pp. 761–768 . [11] R. Jin , S. Wang , Y. Zhou , Regularized distance metric learning: Theory and al-

gorithm, in: Advances in Neural Information Processing Systems (NIPS), 2009,pp. 862–870 .

[12] J.V. Davis , B. Kulis , P. Jain , S. Sra , I.S. Dhillon , Information-theoretic metric

learning, in: Proceedings of the 24th International Conference on MachineLearning (ICML), ACM, 2007, pp. 209–216 .

[13] K. Crammer , O. Dekel , J. Keshet , S. Shalev-Shwartz , Y. Singer , Online passive-ag-gressive algorithms, J. Mach. Learn. Res. 7 (2006) 551–585 .

[14] G. Kunapuli , J. Shavlik , Mirror descent for metric learning: a unified approach,in: Machine Learning and Knowledge Discovery in Databases, Springer, 2012,

pp. 859–874 .

[15] K.Q. Weinberger , L.K. Saul , Distance metric learning for large margin nearestneighbor classification, J. Mach. Learn. Res. 10 (2009) 207–244 .

[16] B. Shaw , B. Huang , T. Jebara , Learning a distance metric from a net-work, in: Advances in Neural Information Processing Systems (NIPS), 2011,

pp. 1899–1907 . [17] Q. Qian , R. Jin , J. Yi , L. Zhang , S. Zhu , Efficient distance metric learning by adap-

tive sampling and mini-batch stochastic gradient descent (sgd), Mach. Learn.

99 (3) (2015) 353–372 .

[18] J. Wang , Y. Song , T. Leung , C. Rosenberg , J. Wang , J. Philbin , B. Chen , Y. Wu ,Learning fine-grained image similarity with deep ranking, in: Proceedings

of the IEEE Conference on Computer Vision and Pattern Recognition, 2014,pp. 1386–1393 .

[19] C. Stauffer , W.E.L. Grimson , Learning patterns of activity using real-time track-ing, IEEE Trans. Pattern Anal. Mach. Intell. 22 (8) (20 0 0) 747–757 .

20] Y. Shi , Y. Gao , R. Wang , Real-time abnormal event detection in complicatedscenes, in: 20th International Conference on Pattern Recognition (ICPR), IEEE,

2010, pp. 3653–3656 .

[21] W. Gao , R. Jin , S. Zhu , Z.-H. Zhou , One-pass auc optimization, in: Proceedingsof the 30th International Conference on Machine Learning (ICML), ACM, 2013,

pp. 906–914 . 22] Y. Zhu , W. Gao , Z.-H. Zhou , One-pass multi-view learning, in: Proceedings of

The 7th Asian Conference on Machine Learning, 2015, pp. 407–422 . 23] K.S. Miller , On the inverse of the sum of matrices, Math. Mag. 54 (1981) 67–72 .

24] J. Lu , X. Zhou , Y.-P. Tan , Y. Shang , J. Zhou , Neighborhood repulsed metric learn-

ing for kinship verification, IEEE Trans. Pattern Anal. Mach. Intell. 36 (2) (2014)331–345 .

25] J. Lu , G. Wang , W. Deng , K. Jia , Reconstruction-based metric learning for un-constrained face verification, IEEE Trans. Inf. Forensics Secur. 10 (1) (2015)

79–89 . 26] Z. Huang , R. Wang , S. Shan , X. Chen , Face recognition on large-scale video

in the wild with hybrid euclidean-and-riemannian metric learning, Pattern

Recognit. 48 (10) (2015) 3113–3124 . [27] J. Hu , J. Lu , Y.-P. Tan , Discriminative deep metric learning for face verification

in the wild, in: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR), 2014, pp. 1875–1882 .

28] J. Hu , J. Lu , Y.-P. Tan , Deep transfer metric learning, in: IEEE Conference onComputer Vision and Pattern Recognition (CVPR), 2015, pp. 325–333 .

29] F. Schroff, D. Kalenichenko , J. Philbin , Facenet: A unified embedding for face

recognition and clustering, in: IEEE Conference on Computer Vision and Pat-tern Recognition (CVPR), 2015, pp. 815–823 .

30] N. Kumar , A.C. Berg , P.N. Belhumeur , S.K. Nayar , Attribute and simile classifiersfor face verification, in: IEEE 12th International Conference on Computer Vision

(ICCV), IEEE, 2009, pp. 365–372 . [31] O.M. Parkhi , A . Vedaldi , A . Zisserman , Deep face recognition, in: British Ma-

chine Vision Conference (BMVC), 1, 2015, p. 6 .

32] G.B. Huang , M. Ramesh , T. Berg , E. Learned-Miller , Labeled Faces in the Wild: ADatabase for Studying Face Recognition in Unconstrained Environments, Tech-

nical Report, University of Massachusetts, Amherst, 2007 . 07–49 33] M. Guillaumin , J. Verbeek , C. Schmid , Is that you? metric learning approaches

for face identification, in: IEEE 12th International Conference on Computer Vi-sion (ICCV), IEEE, 2009, pp. 498–505 .

34] D.G. Lowe , Distinctive image features from scale-invariant keypoints, Int. J.

Comput. Vision 60 (2) (2004) 91–110 . 35] Y. Cong , J. Yuan , J. Liu , Sparse reconstruction cost for abnormal event detection,

in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE,2011, pp. 3449–3456 .

36] R. Mehran , A. Oyama , M. Shah , Abnormal crowd behavior detection using so-cial force model, in: IEEE Conference on Computer Vision and Pattern Recog-

nition (CVPR), IEEE, 2009, pp. 935–942 . [37] S. Wu , B.E. Moore , M. Shah , Chaotic invariants of lagrangian particle trajecto-

ries for anomaly detection in crowded scenes, in: IEEE Conference on Com-

puter Vision and Pattern Recognition (CVPR), IEEE, 2010, pp. 2054–2060 . 38] V. Saligrama , Z. Chen , Video anomaly detection based on local statistical aggre-

gates, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR),IEEE, 2012, pp. 2112–2119 .

39] J. Huo , Y. Gao , W. Yang , H. Yin , Multi-instance dictionary learning for detect-ing abnormal events in surveillance videos, Int. J. Neural Syst. 24 (03) (2014)

1430010 .

40] Y. Zhang , H. Lu , L. Zhang , X. Ruan , Combining motion and appearance cues foranomaly detection, Pattern Recognit. 51 (2016) 443–452 .

[41] S. Shalev-Shwartz, Y. Singer, Online learning: Theory, algorithms, and applica-tions (2007).

http://refhub.elsevier.com/S0031-3203(17)30114-0/sbref0001


















































































































































































ology, Nanjing University. His current research interests include machine learning and

Technology, Nanjing University, China, in 20 0 0. Currently, He is a Professor, and also the

niversity. He is currently directing the Reasoning and Learning Research Group in Nanjing urnals. His current research interests include artificial intelligence and machine learning.

.

ore, in 2004. He is currently with the Faculty of Engineering and Information Sciences, is current research interests include machine learning and pattern recognition, such as

r vision, such as categorization and content-based image retrieval.

erra, ACT, Australia. She was a Research Fellow with the University of North Carolina onwealth Scientific and Industrial Research Organisation, Clayton, VIC, Australia. She is

gy, University of Wollongong, Wollongong, NSW, Australia. Her current research interests ceived the Discovery Early Career Researcher Award from the Australian Research Council

ng Normal University, China, in 2011. She is currently working towards her Ph.D. degree in in machine learning and computer vision. Her work currently focuses on metric learning,

r Science and Technology of Nanjing University, China. He received his Ph.D. and B.Sc.

2007, respectively. He was a visiting scholar in University of Technology, Sydney and de computer vision and medical image analysis. He has published more than 40 research

VPR, AAAI, ACMMM and IPMI. He serves as a program committee member for several

Wenbin Li is a Ph.D. student in the Department of Computer Science and Techncomputer vision.

Yang Gao received the Ph.D. degree from the Department of Computer Science and

Deputy Director in the Department of Computer Science and Technology, Nanjing UUniversity. He has published more than 100 papers in top-tired conferences and jo

He also serves as Program Chair and Area Chair for many international conferences

Lei Wang received the Ph.D. degree from Nanyang Technological University, SingapUniversity of Wollongong, Wollongong, NSW, Australia, as an Associate Professor. H

feature selection, model selection, and kernel-based learning methods, and compute

Luping Zhou received the Ph.D. degree from Australian National University, Canbat Chapel Hill, Chapel Hill, NC, USA, and a Staff Research Scientist with the Comm

currently a Senior Lecturer with the School of Computing and Information Technoloinclude machine learning, medical image analysis, and computer vision. Dr. Zhou re

in 2015.

Jing Huo received her B.Eng. degree in computer science and technology from Nanjithe Department of Computer Science, Nanjing University. Her research interests lie

subspace learning and their applications to heterogeneous face recognition.

Yinghuan Shi is currently an Assistant Researcher in the Department of Compute

degrees from Department of Computer Science of Nanjing University in 2013 andUniversity of North Carolina at Chapel Hill, respectively. His research interests inclu

papers in related journals and conferences such as TPAMI, TBME, TNNLS, TCYB, Cconferences, and also as a referee for several journals.

Documents

OPML: A one-pass closed-form solution for online metric ... · OASIS BI Triplet closed-form 0.029 0.028 9.4 SOML BI Triplet approximate 0.032 0.094 21 OPML MA Triplet closed-form