16
Understanding and Improving Kernel Local Descriptors Arun Mukundan · Giorgos Tolias · Andrei Bursuc · Herv´ e J´ egou · Ondˇ rej Chum Abstract We propose a multiple-kernel local-patch de- scriptor based on efficient match kernels from pixel gra- dients. It combines two parametrizations of gradient position and direction, each parametrization provides robustness to a different type of patch mis-registration: polar parametrization for noise in the patch dominant orientation detection, Cartesian for imprecise location of the feature point. Combined with whitening of the descriptor space, that is learned with or without super- vision, the performance is significantly improved. We analyze the effect of the whitening on patch similar- ity and demonstrate its semantic meaning. Our unsu- pervised variant is the best performing descriptor con- structed without the need of labeled data. Despite the simplicity of the proposed descriptor, it competes well with deep learning approaches on a number of different tasks. 1 Introduction Representing and matching local features is an essen- tial step of several computer vision tasks. It has at- tracted a lot of attention in the last decades, when local features still were a required step of most approaches. Despite the large focus on Convolutional Neural Net- works (CNN) to process whole images, local features A. Mukundan, G.Tolias , O. Chum VRG, FEE, CTU in Prague E-mail: arun.mukundan,giorgos.tolias,[email protected] A. Bursuc Valeo.ai E-mail: [email protected] H. J´ egou Facebook AI Research E-mail: [email protected] still remain important and necessary for tasks such as Structure-from-Motion (SfM) [22,25,51], stereo match- ing [40], or retrieval under severe change in viewpoint or scale [53,72]. Classical approaches involve hand-crafted design of a local descriptor, which has been the practice for more than a decade with some widely used examples [34,10, 38,60,17,33]. Such descriptors do not require any train- ing data or supervision. This kind of approach allows to easily inject domain expertise, prior knowledge or even the result of a thorough analysis [20]. Learning meth- ods have been also employed in order to learn parts of the hand-crafted design, e.g. the pooling regions [66, 57], from the training data. Recently, the focus has shifted from hand-crafted descriptors to CNN-based descriptors. Learning such descriptors relies on large training sets of patches, that are commonly provided as a side-product of SfM [66]. Integrating domain expertise has been mostly so far neglected in this kind of approaches. Nevertheless, re- markable performance is achieved on a standard bench- mark [8,59,39]. On the other hand, recent work [7,52] shows that many CNN-based approaches do not neces- sarily generalize equally well on different tasks or dif- ferent datasets. Hand-crafted descriptors still appear an attractive alternative. In this work, we choose to work with a particular family of hand-crafted descriptors, the so called ker- nel descriptors [13,12,61]. They provide a quite flexible framework for matching sets, patches in our case, by en- coding different properties of the set elements, pixels in our case. In particular, we build upon the hand-crafted kernel descriptor proposed by Bursuc et al. [16] that is shown to have good performance, even compared to learned alternatives. Its few parameters are easily tuned arXiv:1811.11147v1 [cs.CV] 27 Nov 2018

arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

Page 1: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

Understanding and Improving Kernel Local Descriptors

Arun Mukundan · Giorgos Tolias · Andrei Bursuc · Herve Jegou ·Ondrej Chum

Abstract We propose a multiple-kernel local-patch de-scriptor based on efficient match kernels from pixel gra-dients. It combines two parametrizations of gradientposition and direction, each parametrization providesrobustness to a different type of patch mis-registration:polar parametrization for noise in the patch dominantorientation detection, Cartesian for imprecise locationof the feature point. Combined with whitening of thedescriptor space, that is learned with or without super-vision, the performance is significantly improved. Weanalyze the effect of the whitening on patch similar-ity and demonstrate its semantic meaning. Our unsu-pervised variant is the best performing descriptor con-structed without the need of labeled data. Despite thesimplicity of the proposed descriptor, it competes wellwith deep learning approaches on a number of differenttasks.

1 Introduction

Representing and matching local features is an essen-tial step of several computer vision tasks. It has at-tracted a lot of attention in the last decades, when localfeatures still were a required step of most approaches.Despite the large focus on Convolutional Neural Net-works (CNN) to process whole images, local features

A. Mukundan, G.Tolias , O. ChumVRG, FEE, CTU in PragueE-mail: arun.mukundan,giorgos.tolias,[email protected]

A. BursucValeo.aiE-mail: [email protected]

H. JegouFacebook AI ResearchE-mail: [email protected]

still remain important and necessary for tasks such asStructure-from-Motion (SfM) [22,25,51], stereo match-ing [40], or retrieval under severe change in viewpointor scale [53,72].

Classical approaches involve hand-crafted design ofa local descriptor, which has been the practice for morethan a decade with some widely used examples [34,10,38,60,17,33]. Such descriptors do not require any train-ing data or supervision. This kind of approach allows toeasily inject domain expertise, prior knowledge or eventhe result of a thorough analysis [20]. Learning meth-ods have been also employed in order to learn parts ofthe hand-crafted design, e.g. the pooling regions [66,57], from the training data.

Recently, the focus has shifted from hand-crafteddescriptors to CNN-based descriptors. Learning suchdescriptors relies on large training sets of patches, thatare commonly provided as a side-product of SfM [66].Integrating domain expertise has been mostly so farneglected in this kind of approaches. Nevertheless, re-markable performance is achieved on a standard bench-mark [8,59,39]. On the other hand, recent work [7,52]shows that many CNN-based approaches do not neces-sarily generalize equally well on different tasks or dif-ferent datasets. Hand-crafted descriptors still appear anattractive alternative.

In this work, we choose to work with a particularfamily of hand-crafted descriptors, the so called ker-nel descriptors [13,12,61]. They provide a quite flexibleframework for matching sets, patches in our case, by en-coding different properties of the set elements, pixels inour case. In particular, we build upon the hand-craftedkernel descriptor proposed by Bursuc et al. [16] thatis shown to have good performance, even compared tolearned alternatives. Its few parameters are easily tuned

arX

iv:1

811.

1114

7v1

[cs

.CV

] 2

7 N

ov 2

018

Page 2: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

2 Arun Mukundan et al.

on a validation set, while it is shown to perform well onmultiple tasks, as we confirm in our experiments.

Further post-processing or descriptor normalization,such as Principal Component Analysis (PCA) and power-law normalization, is shown to be effective on differenttasks [19,16,37,58]. We combine our descriptor withsuch post-processing that is learned from the data inunsupervised or supervised ways. We show how to re-duce the estimation error and significantly improve re-sults even without any supervision.

The hand-crafted nature and simplicity of our de-scriptor allows to visualize and analyze its parametriza-tion, and finally understand its advantages and disad-vantages. It leads us to propose a simple combinationof parametrizations each offering robustness to differ-ent types of patch miss-registrations. Interestingly, thesame analysis is possible even for the learned post-processing. We observe that its effect on the patch simi-larity is semantically meaningful. The feasibility of suchanalysis and visualization is an advantage or our ap-proach, and hand-crafted approaches in general, com-pared to CNN-based methods. Several insightful abla-tion and visualization studies [71,68,35,9] reveal whata CNN has learned. This typically provides only a par-tial view, i.e. for a small number of neurons, on theirbehavior, while our approach enables visualization ofthe overall learned similarity in a general way that isnot restricted to particular examples.

This work is an extension of our earlier conferencepublication [41]. In addition to the earlier version, wepropose unsupervised whitening with shrinkage, giveextra insight about its effect on patch similarity, presentextended comparisons of different whitening variantsand provide a proof justifying the absence of regularizedconcatenation.

The manuscript is organized as follows. Related workis discussed in Section 2, and background knowledgefor kernel descriptors is presented in Section 3. Ourdescriptor, the different whitening variants, and theirinterpretation are described in Section 4. Finally, theexperimental validation on two patch benchmarks ispresented in Section 5.

2 Related work

We review prior work on local descriptors, coveringboth hand-crafted and learned ones.

2.1 Hand-crafted descriptors

Hand-crafted descriptors have dominated the researchlandscape and a variety of approaches and methodolo-

gies exists. There are different variants on descriptorsbuilding features from filter-bank responses [10,15,29,43,50], pixel gradients [34,38,60,3], pixel intensities [55,17,33,48], ordering or ranking of pixel intensities [42,24], local edge shape [21]. Some approaches focus onparticular aspects of the local descriptors, such as a in-jecting invariance in the patch descriptor [42,30,1,58],computational efficiency [60,3], binary descriptors [17,33,2].

A popular direction is that of gradient histogram-based descriptors, where the most popular represen-tative is SIFT [34]. SIFT is a long-standing top per-former on multiple benchmarks and tasks across theyears. Multiple improvements for SIFT have been sub-sequently proposed: PCA-SIFT [28], ASIFT [69], Op-ponentSIFT [49], 3D-SIFT [54], RootSIFT [4], DSP-SIFT [20], etc. A simple and effective improvement ofSIFT is brought by the RootSIFT descriptor [4], whichuses Hellinger kernel as similarity measure. DSP-SIFT [20]counters the aliasing effects caused by the binned quan-tization in SIFT by pooling gradients over multiplescales instead of only the scale selected by SIFT. Ourkernelized descriptor also deals with quantization ar-tifacts by embedding each pixel in a continuous spaceand the aggregating pixels per patch by sum-pooling.

Kernel descriptors based on the idea of EfficientMatch Kernels (EMK) [13] encode entities inside a patch(such a gradient, color, etc.) in a continuous domain,rather than as a histogram. The kernels and their fewparameters are often hand-picked and tuned on a vali-dation set. Kernel descriptors are commonly representedby a finite-dimensional explicit feature maps [64]. Quan-tized descriptors, such as SIFT, can be also interpretedas kernel descriptor [16,11]. Furthermore, the widelyused RootSIFT descriptor [4] can be also thought of asan explicit feature map from the original SIFT spaceto the RootSIFT space, such that the Hellinger ker-nel is linearised, i.e. the linear kernel (i.e. dot product)in RootSIFT space is equivalent to the Hellinger ker-nel in the original SIFT space. In this case, the featuremapping is performed by `1-normalization and square-rooting, without any expansion in dimensionality.

In this work we build upon EMK by integratingmultiple pixel attributes in the patch descriptor. UnlikeEMK which relies on features from random projectionsthat require subsequent learning, we leverage insteadexplicit feature maps to approximate a kernel behaviordirectly. These representations can be further improvedby minimal learning.

Page 3: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

Understanding and Improving Kernel Local Descriptors 3

2.2 Learned descriptors

Learned descriptors commonly require annotation atpatch level. Therefore, research in this direction is fa-cilitated by the release of datasets that originate froman SfM system [66,44]. Such training datasets allow ef-fective learning of local descriptors, and in particular,their pooling regions [66,57], filter banks [66], trans-formations for dimensionality reduction [57] or embed-dings [46].

Kernelized descriptors are formulated within a su-pervised framework by Wang et al. [65], where imagelabels enable kernel learning and dimensionality reduc-tion. In this work, we rather focus on learning discrimi-native projections with minimal or no supervision. Thisis several orders of magnitude faster to learn than otherlearning approaches.

Recently, local descriptor learning is dominated bydeep learning. The proposed network architectures mimicthe ones for full-image processing. They have fewer pa-rameters, however they still use a large amount of train-ing patches.

Among representative examples is the work of Simo-Serra et al. [56] training with hard positive and negativeexamples or the work of Zagoruyko [70] where a central-surround representation is found to be immensely bene-ficial. Such CNN-based approaches are seen as joint fea-ture, filter bank, and metric learning [23] since both theconvolutional filters, patch descriptor and metrics arelearned end-to-end. Going further towards an end-to-end pipeline for patch detection and description, LIFT [67]advances a multi-step architecture with several spatialtransformer modules [26] that detects interest pointsand crops them, identifies their dominant orientationand rotates them accordingly and finally extract a patchdescriptor. Paulin et al. [45] propose a deep patch de-scriptor from unsupervised learning. They consider aconvolutional kernel network [36] with feature mapscompatible with the Gaussian kernel and which requirelayer-wise training.

Recent works in deep patch descriptors lean towardsmore compact architectures with more carefully designedtraining strategies and loss functions. Balntas et al. [8,6] advance shallower architectures with improved tripletranking loss [8,6]. In L2-Net [59] supervision is im-posed on intermediate feature maps, the loss functionintegrates multiple attributes, while sampling of train-ing data is done progressively to better balance posi-tive and negative pairs at each step. In HardNet [39],Mishchuk et al. extend L2-Net with a loss that mimicsLowe’s matching criterion by maximizing the distancebetween the closest positive and closest negative exam-ple in the batch. HardNet is currently a top perfomer on

most benchmarks. Despite obtaining impressive resultson standard benchmarks, the generalization of CNN-based local descriptors to other datasets is not alwaysthe case [52].

2.3 Post-processing

A post-processing step is common to both hand-craftedand learned descriptors. This post-processing rangesfrom simple `2 normalization, PCA dimensionality re-duction, to transformations learned on annotated data [47,14,7,27].

3 Preliminaries

3.1 Kernelized descriptors.

In general lines we follow the formulation of Bursuc etal. [16]. We represent a patch P as a set of pixels p ∈ Pand compare two patches P and Q via a match kernel

M(P,Q) =∑p∈P

∑q∈Q

k(p, q), (1)

where kernel k : Rn × Rn → R is a similarity func-tion, typically non-linear, comparing two pixels or theircorresponding feature vectors. The evaluation of thismatch kernel is costly as it computes exhaustively sim-ilarities between all pairs of pixels from the two sets.Match kernelM(P,Q) can be approximated with EMK [13].It uses an explicit feature map ψ : Rn → Rd to approx-imate this result as

M(P,Q) =∑p∈P

∑q∈Q

k(p, q)

≈∑p∈P

∑q∈Q

ψ(p)>ψ(q)

=∑p∈P

ψ(p)>∑q∈Q

ψ(q).

(2)

Vector V(P) =∑p∈P ψ(p) is a kernelized descriptor

(KD), associated with patch P, used to approximateM(P,Q), whose explicit evaluation is costly. The ap-proximation is given by a dot product V(P)>V(Q),where V(P) ∈ Rd. To ensure a unit self similarity, `2-normalization by a factor γ is introduced. The normal-ized KD is then given by V(P) = γ(P)V(P), whereγ(P) = (V(P)>V(P))−1/2.

Kernel k comprises products of kernels, each kernelacting on a different scalar pixel attribute

k(p, q) = k1(p1, q1)k2(p2, q2) . . . kn(pn, qn), (3)

Page 4: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

4 Arun Mukundan et al.

where kernel kn is pairwise similarity function for scalarsand pn are pixel attributes such as position and gradientorientation. Feature map ψn corresponds to kernel knand feature map ψ is constructed via Kronecker prod-uct of individual feature maps ψ(p) = ψ1(p1)⊗ψ2(p2)⊗. . .⊗ ψn(pn). It is straightforward to show that

ψ(p)>ψ(q) ≈ k1(p1, q1)k2(p2, q2) . . . kn(pn, qn). (4)

3.2 Feature maps.

As non-linear kernel for scalars we use the normal-ized Von Mises probability density function1, which isused for image [61] and patch [16] representations. It isparametrized by κ controlling the shape of the kernel,where lower κ corresponds to wider kernel, i.e. less se-lective kernel. We use a stationary (shift invariant) ker-nel that, by definition, depends only on the difference∆n = pn−qn, i.e. kVM(pn, qn) := kVM(∆n). We approx-imate this probability density function with Fourier se-ries with N frequencies that produces a feature mapψVM : R→ R2N+1. It has the property that

kVM(pn, qn) ≈ ψVM(pn)>ψVM(qn). (5)

In particular we approximate the Fourier series bythe sum of the first N terms as

kVM(∆n) ≈N∑i=0

γi cos(i∆n). (6)

The feature map ψVM(pn) is designed as follows:

ψVM(pn) = (√γ0,√γ1 cos(pn), . . . ,√γN cos(Npn),

√γ1 sin(pn), . . . ,√γN sin(Npn))>. (7)

This vector has 2N + 1 components. It is now easyto show that the inner product of two feature maps isapproximating the kernel . That is,

ψVM(pn)>ψVM(qn) = γ0 +N∑i=1

γi(cos(ipn) cos(iqn) (8)

+ sin(ipn) sin(iqn))

=N∑i=0

γi cos(i(pn − qn))

≈ kVM(∆n).

The reader is encouraged to read prior work for detailson these feature maps [63,18], which are previously usedin various contexts [61,16].

1 Also known as the periodic normal distribution

3.3 Descriptor post-processing.

It is known that further descriptor post-processing [47,5,16] is beneficial. In particular, KD is further centeredand projected as

V(P) = A>(V(P)− µ), (9)

where µ ∈ Rd and A ∈ Rd×d are the mean vector andthe projection matrix. These are commonly learned byPCA [27] or with supervision [47]. The final descriptoris always `2-normalized in the end.

4 Method

In this section we consider different patch parametriza-tions and kernels that result in different patch similar-ity. We discuss the benefits of each and propose howto combine them. We further learn descriptor transfor-mation with or without supervision and provide usefulinsight on how patch similarity is affected.

4.1 Patch attributes.

We consider a pixel p to be associated with coordinatespx, py in Cartesian coordinate system, coordinates pρ,pφ in polar coordinate system, pixel gradient magnitudepm, and pixel gradient angle pθ. Angles pθ, pφ ∈ [0, 2π],distance from the center pρ is normalized to [0, 1], whilecoordinates px, py ∈ {1, 2, . . . ,W} for W ×W patches.In order to use feature map ψVM, attributes pρ, px, andpy are linearly mapped to [0, π]. The gradient angle isexpressed w.r.t. the patch orientation, i.e. pθ directly,or w.r.t. to the position of the pixel. The latter is givenas pθ = pθ − pφ.

4.2 Patch parametrizations.

Composing patch kernel k as a product of kernels overdifferent attributes enables easy design of various patchsimilarities. Correspondingly, this defines different KD.All attributes px, py, pρ, pθ, pφ, and pθ are matchedby the Von Mises kernel, namely, kx, ky, kρ, kθ, kφ,and kθ parameterized by κx, κy, κρ, κθ, κφ, and κθ,respectively. In a similar manner to SIFT, we apply aGaussian mask by pg = exp(−p2

ρ) which gives moreimportance to central pixels.

In this work we focus on the two following matchkernels over patches. One in polar coordinates

Mφρθ(P,Q) =∑p∈P

∑q∈Q

pgqg√pm√qmkφ(pφ, qφ)

kρ(pρ, qρ)kθ(pθ, qθ), (10)

Page 5: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

Understanding and Improving Kernel Local Descriptors 5

0 π4

π2

3π4

π

0

1

∆θ or ∆θ

orkθ

κ = 8, N = 3

0 π4

π2

3π4

π

0

1

∆φ or ∆ρ

orkρ

κ = 8, N = 2

0 π4

π2

3π4

π

0

1

∆x or ∆y

kx

orky

κ = 1, N = 1

Fig. 1: Kernel approximations that we use for pixel attributes. Parameter κ and the number of frequencies Ndefine the final shape. The choice of kernel parameters is guided by [16].

and one in Cartesian coordinates

Mxyθ(P,Q) =∑p∈P

∑q∈Q

pgqg√pm√qmkx(px, qx)

ky(py, qy)kθ(pθ, qθ). (11)

The KD for the two cases are given by

Vφρθ(P) =∑p∈P

pg√pmψφ(pφ)⊗ ψρ(pρ)⊗ ψθ(pθ)

=∑p∈P

pg√pmψφρθ(p)

(12)

Vxyθ(P) =∑p∈P

pg√pmψx(px)⊗ ψy(py)⊗ ψθ(pθ)

=∑p∈P

pg√pmψxyθ(p).

(13)

The Vφρθ variant is exactly the one proposed by Bursucet al. [16], considered as a baseline in this work. Differ-ent parametrizations result in different patch similar-ity, which is analyzed in the following. In Figure 1 wepresent the approximation of kernels used per attribute.

4.3 Post-processing learned with or w/o supervision.

We detail different ways to learn the projection matrixA of (9) to perform the descriptor post-processing. Letus consider a learning set of patches P and the corre-sponding set of descriptors VP = {V (P), P ∈ P}. LetC be the covariance matrix of VP. Vector µ is the meandescriptor vector, and different ways to compute A areas follows.

Supervised whitening. We further assume thatsupervision is available in the form of pairs of matchingpatches. This is given by set M = {(P,Q) ∈ P×P, P ∼Q}, where ∼ denotes matching patches. We follow thework of Mikolajczyk and Matas [37] to learn discrimi-native projections using the available supervision. Thediscriminative projection is composed of two parts, awhitening part and a rotation part. The whitening part

is obtained from the intraclass (matching pairs) covari-ance matrix CM, while the rotation part is the PCA ofthe interclass (non-matching pairs) covariance matrixin the whitened space. We set the interclass one to beequal to C as this is dominated by non-matching pairs,while the intraclass one is given by

CM =∑

(P,Q)∈M

(V (P)− V (Q)) (V (P)− V (Q))> . (14)

The projection matrix is now given by

A = C−1/2M eig(C−1/2

M CC−1/2M ), (15)

where eig denotes the eigenvectors of a matrix intocolumns. To reduce the descriptor dimensionality, onlyeigenvectors corresponding to the largest eigenvaluesare used. The same holds for all cases that we per-form PCA in the rest of the paper. We refer to thistransformation as supervised whitening (WS).

Unsupervised whitening. There is no supervi-sion in this case and the projection is learned via PCAon set VP. In particular, projection matrix is given by

A = eig(C)diag(λ−1/21 , . . . , λ

−1/2d )>, (16)

where diag denotes a diagonal matrix with the givenelements on its diagonal, and λi is the i-th eigenvalueof matrix C. This method is called PCA whitening andwe denote simply by W [27].

Unsupervised whitening with shrinkage. Weextend the PCA whitening scheme by introducing pa-rameter t controlling the extent of whitening and theprojection matrix becomes

A = eig(C)diag(λ−t/21 , . . . , λ−t/2d )>, (17)

where t ∈ [0, 1], with t = 1 corresponding to the stan-dard PCA whitening and t = 0 to simple rotation with-out whitening.

Equivalently, t = 0 imposes the covariance matrixto be identity. We call this method attenuated PCAwhitening and denote it by WUA.

The aforementioned process resembles covariance es-timation with shrinkage [31,32]. The sample covariance

Page 6: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

6 Arun Mukundan et al.

Pixel p Patch Qkφ kρ kθ kφ kρ kθ kx ky kθ kx ky kθ

pθ = 0 qθ = 0, ∀q ∈ Q

pθ = 0 qθ = −π/8,∀q ∈ Q

Fig. 2: Patch maps for different parametrizations and kernels. We present two parametrizations in polar and twoin cartesian coordinates, with absolute or relative gradient angle for each one. The similarity between each pixel ofpatch Q and a single pixel p is shown over patch Q. All pixels in Q have the same gradient angle, which is shownin red arrows. The position of pixel p is shown with “×” on the patch maps. We show examples for ∆θ equal to0 (top) and π/8 (bottom). At the top of each column the kernels that are used (patch similarity) are shown. Thesimilarity is shown in a relative manner and, therefore, the absolute scale is missing. Ten isocontours are sampleduniformly and shown in different color.

Pixel p Patch Qkφ kρ kθpθ = −π/4 qθ = −π/2,∀q ∈ Q

(0, 0)π/2

π/2

q(3)

q(2)

q(1)

q(3) q(2) q(1)

0

1

kθkφkθkφ

Fig. 3: Patch map with polar parametrization kφkρkθ for ∆θ = π/4 and the pair of toy pixel and patch on the left.The example explains why the kernel undergoes shifting away from the position of pixel p. The diagram of the 4thcolumn overlays pixel p and 3 pixels of patch Q with the same distance from the center as p. On the rightmostplot, we illustrate kθ (pθ, qθ), kφ (pφ, qφ) for pixels q with qρ = pρ (on the black dashed circle). kθ is maximizedat q(3), kφ at q(1), and their product at q(2).

matrix is known to be a noise estimator, especially whenthe available samples are not sufficient relatively to thenumber of dimensions [32]. Ledoit and Wolf [32] pro-pose to replace this by a linear combination of the sam-ple covariance matrix and a structured estimator. Theirsolution is well conditioned and is shown to reduce theeffect of noisy estimation in eigen decomposition. Theimposed condition is simply that all variances are thesame and all covariances are zero. The shrunk covari-ance is

C = (1− β)C + βId, (18)

where Id is the identity matrix and β the shrinkingparameter. This process “shrinks” extreme (too large

or too small) eigenvalues to intermediate ones. In ourexperiments we show that a simple tuning of parameterβ performs well across different context and datasets.The projection matrix is now

A = eig(C)diag((αλ1 +β)−1/2, . . . , (αλd+β)−1/2)>, (19)

where α = 1− β. We call this method PCA whiteningwith shrinkage and denote it by WUS. We set parameterβ equal to the i-th eigenvalue. A method similar toours is used in the work of Brown et al. [14], but doesnot allow dimensionality reduction since descriptors areprojected back to the original space after the eigenvalueclipping.

Page 7: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

Understanding and Improving Kernel Local Descriptors 7

4.4 Visualizing and understanding patch similarity.

We define pixel similarity M(p, q) as kernel responsebetween pixels p and q, approximated as M(p, q) ≈ψ(p)>ψ(q). To show a spatial distribution of the influ-ence of pixel p, we define a patch map of pixel p (fixedpx, py, and pθ). The patch map has the same size asthe image patches; for each pixel q of the patch, mapM(p, q) is evaluated for some constant value of qθ.

For example, in Figure 2 patch maps for differentkernels are shown. The position of p is denoted by ×symbol. Then, pθ = 0, while qθ = 0 for all spatial loca-tions of q in the top row and qθ = −π/8 in the bottomrow. This example shows the toy patches and their gra-dient angles in arrows to be more explanatory. The toypatches are directly defined by pθ, and qθ. Only pθ andqθ are used in later examples, while the toy patches areskipped from the figures.

The example in Figure 2 reveals a discontinuity nearthe center of the patch when pixel similarity is given byVφρθ descriptor. It is caused by the polar coordinatesystem where a small difference in the position nearthe origin causes large difference in φ and θ. The patchmaps reveal weaknesses of kernel descriptors, such theaforementioned discontinuity, but also advantages ofeach parametrization. It is easy to observe that the ker-nel parametrized by Cartesian coordinates and absoluteangle of the gradient (Vxyθ, third column) is insensi-tive to small translations, i.e. feature point displace-ment. Moreover, in the bottom row we see that usingthe relative gradient direction θ allows to compensatefor imprecision caused by small patch rotation, i.e. themost similar pixel is not the one at the location of pwith different θ, but a rotated pixel with more similarvalue of θ. This effect is further analyzed in Figure 3.The final similarity involves the product of two kernelsthat both depend on angle φ. They are both maximizedat the same point if ∆θ = 0, otherwise not. The larger∆θ is, the maximum value moves further (in the patch)from p.

We additionally construct patch maps in the case ofdescriptor post-processing by a linear transformation,e.g. descriptor whitening. Now the contribution of apixel pair is given by

M(p, q) = (A>(ψ(p)− µ))>(A>(ψ(q)− µ))= (ψ(p)− µ)>AA>(ψ(q)− µ)= ψ(p)>AA>ψ(q)− ψ(p)>AA>µ− ψ(q)>AA>µ+ µ>AA>µ.

(20)

The last term is constant and can be ignored. If A is arotation matrix then the similiarity is affected just byshifting by µ. After the transformation, the similarity

kφ kρ kθ kx ky kθ kφ kρ kθ + kx ky kθ

Fig. 4: Patch maps for different pixels and parametriza-tions and their concatenation. We present twoparametrizations in polar and Cartesian coordinates,with relative and absolute gradient angle, respectively.∆θ is fixed to be 0 (individual values of pθ and qθ do notmatter due to shift invariance) and pixel p is shown with“×”. Note the behaviour around the centre in the con-catenated case. Ten isocontours are sampled uniformlyand shown in different color.

is no longer shift-invariant. Non-linear post-processing,such as power-law normalization or simple `2 normal-ization cannot be visualized, as it acts after the pixelaggregation.

4.5 Combining kernel descriptors.

We propose to take advantage of both parametrizationsVφρθ and Vxyθ, by summing their contribution. This isperformed by simple concatenation of the two descrip-tors. Finally, whitening is jointly learned and dimen-sionality reduction is performed.

In Figure 4 we show patch maps for the individ-ual and combined representation, for different pixels p.Observe how the combined one better behaves aroundthe center. The combined descriptor inherits reasonablebehavior around the patch center and insensitivity toposition misalignment from the Cartesian parametriza-tion, while insensitivity to dominant orientation mis-alignment from the polar parametrization, as shownearlier.

Page 8: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

8 Arun Mukundan et al.

PC P+WS C+WS PC+WS PC+WUA PC+WUS

pθ = 0qθ = 0,∀q ∈ Q

pθ = π/4qθ = π/4, ∀q ∈ Q

pθ = π/2qθ = π/2, ∀q ∈ Q

pθ = 3π/4qθ = 3π/4,∀q ∈ Q

pθ = πqθ = π,∀q ∈ Q

Fig. 5: Patch maps for different parametrizations, their concatenation, different post-processing methods, andvarying pθ and qθ, while ∆θ is always 0. Pixel p is shown with “×”. P: polar parametrization, C: cartesianparametrization, WUA is shown for t = 0.7 and WUS for β = λ40. Whitening is learned on Liberty dataset.Observe that the similarity is no more shift invariant after the whitening and how the shape follows the angle ofthe gradients. Ten isocontours are sampled uniformly and shown in different color.

16 32 48 64

0

1

py

k(p,q)

PCPC+WUAPC+WUSPC+WS

16 32 48 64

0

1

px

k(p,q)

PCPC+WUAPC+WUSPC+WS

Fig. 6: Visualizing 1D slices of a patch map. Showing similarity k(p, q) for all pixels q with qx = px (middle figure)and qy = py (right figure). It corresponds to 1D similarity across the dashed lines (magenta and blue, respectively)of the patch map on the left. The particular patch map on the left is only chosen as an illustrative example. Weshow similarity for the first row and for columns 1, 4, 5 and 6 of patch maps from Figure 5. Pixel p has px = 20and py = 20 with the origin considered at the bottom left corner.

Page 9: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

Understanding and Improving Kernel Local Descriptors 9

Q P C PC PC+WS P P C PC PC+WS

Fig. 7: Positive patch pairs (patches Q and P) and the corresponding heat maps for polar (P), Cartesian (C), com-bined (PC), and whitened combined (PC+WS) parametrization. Red (blue) corresponds to maximum (minimum)value. Heat maps on the left side correspond to

∑p∈PM(p, q), while the ones on the right side to

∑q∈QM(p, q).

In the case of PC+WS, M(p, q) is used instead of M(p, q).

4.6 Understanding the whitened patch similarity.

We learn the different whitening variants of Section 4.3and visualize their patch maps in Figure 5. All exam-ples shown are for ∆θ = 0 but gradient angles pθ andqθ jointly vary. We initially observe that the similar-ity is shift invariant only in the fist column of patchmaps where no whitening is applied. This is expectedby definition. Projecting by matrix A does not allowto reconstruct the shift invariant kernels anymore; thesimilarity does not only depend on ∆θ, which is 0, butalso on pθ and qθ.

The patch similarity learned by whitening exhibitsan interesting property. The shape of the 2D similar-ity becomes anisotropic and gets aligned with the ori-entation of the gradient. Equivalently, it becomes per-pendicular to the edge on which the pixel lies. Thisis a semantically meaningful effect. It prevents over-counting of pixel matching along aligned edges of thetwo patches. In the case of a blob detector this can pro-vide tolerance to errors in the scale estimation, i.e. thesimilarity remains large towards the direction that theblob edges shift in case of scale estimation error.

We presume that this is learned by pixels with sim-ilar gradient angle that co-occur frequently. A similareffect is captured by both the supervised and the un-supervised whitening with covariance shrinkage, it is,though, less evident in the case of WUS. Moreover, wesee that it is mostly the Cartesian parametrization thatallows this kind of deformation.

According to our interpretation, supervised whiten-ing [37,47] owes its success to covariance estimationthat is more noise free. The noise removal comes fromsupervision, but we show that standard approaches forwell-conditioned and accurate covariance estimation havesimilar effect on the patch similarity even without su-pervision. The observation that different parametriza-tions allow for different types of co-occurrences to be

captured is related to other domains too. For instance.CNN-based image retrieval exhibits improvements afterwhitening [5], but this is very unequal between averageand max pooling. However, observing the differences isnot as easy as in our case with the visualized patchsimilarity.

Finally, we obtain slices from the 2D patch mapsand present the 1D similarity kernels in Figure 6. Itis the similarity between pixel p and all pixels q ∈Q that lie on the vertical and horizontal lines drawnon the patch map at the left of Figure 6. We presentthe case for which pθ = 0 and qθ = 0,∀q ∈ Q (toprow in Figure 5). The gradient angle is fully horizontalin this case and the 2D similarity kernel tends to getaligned with that, while the chosen slices are aligned inthis fashion too. In our experiments we show that allWS, WUA, and WUS provide significant performanceimprovements. However, herein, we observe that theunderlined patch similarity demonstrates some differ-ences. Supervised whitening WS is not a decreasingfunction, which might be an outcome of over-fitting tothe training data. This is further validated in our ex-periments. Finally, WUA and WS are not maximized onpoint p, which does not seem a desired property. WUS ismaximized similarly to the raw descriptor without anypost-processing.

Patch maps are a way to visualize and study thegeneral shape of the underlined similarity function. Ina similar manner, we visualize the kernel responses fora particular pair of patches to reflect which are thepixels contributing the most to the patch similarity.This is achieved by assigning strength

∑q∈Q M(p, q)

to pixel p. In cases without whitening,M(p, q) is used.We present such heat maps in Figure 7. Whitening sig-nificantly affects the contribution of most pixels. Theover-counting phenomenon described in Section 4.6 isalso visible; some of the long edges are suppressed.

Page 10: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

10 Arun Mukundan et al.

0 0.2 0.4 0.6 0.8 1

0.07

0.09

0.11

0.13

0.15

0.17

t

FP

R@

95

WUA on PT

PT trainHP train

0 0.2 0.4 0.6 0.8 10.2

0.3

0.4

0.5

t

mA

P

WUA on HP Matching

HP trainPT train

0 0.2 0.4 0.6 0.8 10.4

0.45

0.5

0.55

0.6

t

mA

P

WUA on HP Retrieval

HP trainPT train

0 0.2 0.4 0.6 0.8 10.7

0.75

0.8

0.85

t

mA

P

WUA on HP Verification

HP trainPT train

20 40 60 80 1000.07

0.09

0.11

0.13

eigenvalue

FP

R@

95

WUS on PT

PT trainHP train

20 40 60 80 1000.2

0.3

0.4

0.5

eigenvalue

mA

P

WUS on HP Matching

HP trainPT train

20 40 60 80 1000.4

0.45

0.5

0.55

0.6

eigenvaluem

AP

WUS on HP Retrieval

HP trainPT train

20 40 60 80 1000.7

0.75

0.8

0.85

eigenvalue

mA

P

WUS on HP Verification

HP trainPT train

Fig. 8: Impact of the shrinkage parameter for unsupervised whitening when trained on the same or differentdataset. We evaluate performance on Photo Tourism and HPatches datasets versus shrinkage parameter t for theattenuated whitening WUA (top row), and versus shrinkage parameter β = λi for whitening with shrinkage WUS(bottom row).

5 Experiments

We evaluate our descriptor on two benchmarks, namelythe widely used Phototourism (PT) dataset [66], andthe recently released HPatches (HP) dataset [7]. Wefirst show the impact of the shrinkage parameters inunsupervised whitening, and then compare with thebaseline method of Bursuc et al. [16] on top of whichwe build our descriptor. We examine the generalizationproperties of whitening when learned on PT but testedon HP, and finally compare against state-of-the-art de-scriptors on both benchmarks. In all our experimentswith descriptor post-processing the dimensionality is re-duced to 128, while the combined descriptor original has238 dimensions, except for the cases where the input de-scriptor is already of lower dimension. Our experimentsare conducted with a Matlab implementation of the de-scriptor, which takes 5.6 ms per patch for extraction ona single CPU on a 3.5GHz desktop machine. A GPUimplementation reduces time to 0.1 ms per patch on anNvidia Titan X.

5.1 Datasets and protocols.

The Phototourism dataset contains three sets of patches,namely, Liberty (Li), Notredame (No) and Yosemite(Yo). Additionally, labels are provided to indicate the

3D point that the patch corresponds to, thereby provid-ing supervision. It has been widely used for training andevaluating local descriptors. Performance is measuredby the false positive rate at 95% of recall (FPR95). Theprotocol is to train on one of the three sets and test onthe other two. An average over all six combinations isreported.

The HPatches dataset contains local patches of higherdiversity, is more realistic, and during evaluation theperformance is measured on three tasks: verification,retrieval, and matching. We follow the standard eval-uation protocol [7] and report mean Average Precision(mAP). We follow the common practice and use modelslearned on Liberty of PT to compare descriptors thathave not used HP during learning. We evaluate on all3 train/test splits and report the average performance.All reported results on HP (our and other descriptors)are produced by our own evaluation by using the pro-vided framework, and descriptors2.

5.2 Impact of the shrinkage parameter.

We evaluate the impact of the shrinkage parameter in-volved in the unsupervised whitening. It is t for WUAand β = λi for WUS. Results are presented in Fig-

2 L2Net and HardNet descriptors were provided by the au-thors of HardNet [39].

Page 11: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

Understanding and Improving Kernel Local Descriptors 11

20 40 60 80 100 1200

0.2

0.4

0.6

0.8

1

i

λi

W, λ

WUA, λt

WUS, (1− β)λ+ β

Fig. 9: Eigenvalues for standard PCA whitening, atten-uated whitening (t = 0.7) and whitening with shrinkage(β = λ40). We normalize so that the maximum eigen-value is 1. First 120 eigenvalues (out of 238) are shown.

0.2 0.4 0.6 0.8 1

V>p Vq

polar

Negatives

Positives

0.2 0.4 0.6 0.8 1

V>p Vq

cartes

Negatives

Positives

0.2 0.4 0.6 0.8 1

V>p Vq

polar+cartes

Negatives

Positives

−0.5 0 0.5 1

V>p Vq

polar+cartes+WS

Negatives

Positives

Fig. 10: Histograms of patch similairity for positiveand negative patch pairs. Histograms are constructedfrom 50K matching and 50K non-matching pairs fromNotredame dataset.

ure 8 for evaluation on PT and HP dataset, while thewhitening is learned on the same or different dataset.The performance is stable for a range of values, whichmakes it easy to tune in a robust way across cases anddatasets. In the rest of our experiments we set t = 0.7and β = λ40. In Figure 9 we show the eigenvalues usedby W, WUA, and WUS. The contrast between the largerand smaller eigenvalues is decreased.

5.3 Comparison with the baseline.

We compare the combined descriptor against the differ-ent parametrizations when used alone. The experimen-tal evaluation is shown in Table 1 for the PT dataset.The baseline is followed by PCA and square-rooting,as originally proposed in [16]. We did not consider thesquare-rooting variant in our analysis in Section 4 be-cause such non-linearity does not allow to visualize theunderlined patch similarity. Supervised whitening ontop of the combined descriptor performs the best. Un-supervised whitening significantly improves too, whileit does not require any labeling of the patches.

Polar parametrization with the relative gradient di-rection (polar) significantly outperforms the Cartesianparametrization with the absolute gradient direction(cartes). After the descriptor post-processing (polar +WS vs. cartes + WS), the gap is reduced. The perfor-mance of the combined descriptor (polar + cartes) with-out descriptor post-processing is worse than the base-line descriptor. That is caused by the fact, that the twodescriptors are combined with an equal weight, which isclearly suboptimal. No attempt is made to estimate themixing parameter explicitly. It is implicitly included inthe post-processing stage (see Appendix A).

Figure 10 presents patch similarity histograms formatching and non-matching pairs, showing how theirseparation is improved by the final descriptor.

We perform an experiment with synthetic patch trans-formations to test the robustness of different parametriza-tions. The whole patch is synthetically rotated or trans-lated by appropriately transforming pixel position andgradient angle in the case of rotation. A fixed amountof rotation/translation is performed for one patch ofeach pair of the PT dataset and results are presentedin Figure 11. It is indeed verified that the Cartesianparametrization is more robust to translations, whilethe polar one to rotations. The joint one finally par-tially enjoys the benefits of both.

5.4 Generalization of whitening.

We learn the whitening on PT or HP (supervised andunsupervised) and evaluate the performance on HP. Wepresent results in Table 2. Whitening always improvesthe performance of the raw descriptor. The unsuper-vised variant is superior when learning it on an inde-pendent dataset. It generalizes better, implying over-fitting of the supervised one (recall the observations ofFigure 6). Learning on HP (the corresponding trainingpart per split) with supervision significantly helps. Notethat PT contains only patches detected by DoG, whileHP uses a combination of detectors.

Page 12: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

12 Arun Mukundan et al.

Test Liberty Notredame Yosemite

Train D Mean No Y o Li Y o Li No

polar [16] 175 22.42 24.34 24.34 16.06 16.06 26.85 26.85cartes 63 35.87 34.06 34.06 34.10 34.10 39.47 39.47polar + cartes 238 25.37 26.16 26.16 20.04 20.04 29.91 29.91polar + PCA+ SQRT [16] 128 8.30 12.09 13.13 5.16 5.41 7.52 6.49polar [16] +W S 128 7.06 8.55 10.48 4.40 3.94 8.86 6.12cartes +W S 63 15.13 17.31 20.34 10.90 11.85 16.84 13.55polar + cartes +W S 128 5.94 7.46 9.85 3.45 3.55 6.47 4.89polar + cartes +WUA 128 6.79 10.59 11.17 3.80 4.36 5.58 5.16polar + cartes +WUS 128 7.22 10.61 11.14 4.27 4.46 6.75 6.09

Table 1: Performance comparison on Phototourism dataset between the baseline approach and our combineddescriptor. We further show the benefit of learned whitening (WS) over the standard PCA followed by square-rooting, as well as the other variants that do additional regularization (WUA, WUS) without supervision. FPR95is reported for all methods.

Name Train Sup. R M V

polar + cartes N/A N/A 45.23 29.68 77.78

polar + cartes +WUA PT No 52.78 36.46 77.31polar + cartes +WUS PT No 53.50 37.16 78.81polar + cartes +W S PT Y es 49.66 32.58 75.82

polar + cartes +WUA HP No 56.36 39.88 80.06polar + cartes +WUS HP No 56.71 40.13 80.70polar + cartes +W S HP Y es 61.79 44.40 83.50

Table 2: Generalization of different whitening ap-proaches. Mean Average Precision(mAP) for 3 tasks ofHP, namely Retrieval (R), Matching (M), and Verifica-tion (V). The whitening is learned on PT or HP. Wedenote supervised by Sup.

5.5 Comparison with the State of the Art.

We compare the performance of the proposed descrip-tor with previously published results on Phototourismdataset. Results are shown in Table 3. Our method ob-tains the best performance among the unsupervised/hand-crafted approaches by a large margin. Overall, it comesright after the two very recent CNN-based descriptors,namely L2Net [59] and HardNet [39]. The advantage ofour approach is the low cost of the learning. It takes lessthan a minute; about 45 seconds to extract descriptorsof Liberty and about 10 seconds to compute the pro-jection matrix. CNN-based competitors require severalhours or days of training.

The comparison on the HPatches dataset is reportedin Figure 12. We use the provided descriptors and frame-work to evaluate all approaches by ourselves. For the de-scriptors that require learning, the model that is learnedon Liberty-PT is used. Our unsupervised descriptor is

0 5 10 15 20 25 300

10

20

Rotation (degrees)

FP

R@

95

Synthetic patch rotation

PC+WS

P+WS

C+WS

0 2 4 6 8 10 120

10

20

Translation (pixels)

FP

R@

95

Synthetic patch translation

PC+WS

P+WS

C+WS

Fig. 11: Performance on PT (training on Liberty , test-ing on Notredame) when one patch of each pair under-goes synthetic rotation or translation.

the top performing hand-crafted variant by a large mar-gin. Overall, it is always outperformed by HardNet,L2Net, while on verification is it additionally outper-formed by DDesc and TF-M. Verification is closer tothe learning task (loss) involved in the learning of theseCNN-based methods.

Finally, we learn supervised whitening WS for allother descriptors, post-process them, and present re-sults in Figure 13. The projection matrix is learned onHP, in particular the training part of each split. Su-pervised whitening WS consistently boosts the perfor-

Page 13: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

Understanding and Improving Kernel Local Descriptors 13

mance of all descriptors, while this comes at a minimalextra cost compared to the initial training of a CNN de-scriptor. Our descriptor comes 3rd at 2 out of 3 tasks.Note that it uses the whitening learned on HP (simi-larly to all other descriptors of this comparison), butdoes not use the PT dataset at all. All CNN-based de-scriptors train their parameters on Liberty-PT whichis costly, while the overall learning of our descriptor isagain in the order of a single minute.

6 Conclusions

We have proposed a multiple-kernel local-patch descrip-tor combining two parametrizations of gradient posi-tion and direction. Each parametrization provides ro-bustness to a different type of patch miss-registration:polar parametrization for noise in the dominant ori-entation, Cartesian for imprecise location of the fea-ture point. We have performed descriptor whiteningand have shown that its effect on patch similarity issemantically meaningful. The lessons learned from ana-lyzing the similarity after whitening can be exploited forfurther improvements of kernel, or even CNN-based, de-scriptors. Learning the whitening in a supervised or un-supervised way boosts the performance. Interestingly,the latter generalizes better and sets the best so far per-forming hand-crafted descriptor that is competing welleven with CNN-based descriptors. Unlike the currentlybest performing CNN-based approaches, the proposeddescriptor is easy to implement and interpret.

7 Acknowledgments

The authors were supported by the OP VVV fundedproject CZ.02.1.01/0.0/0.0/16 019/0000765 “ResearchCenter for Informatics” and MSMT LL1303 ERC-CZgrant. Arun Mukundan was supported by the CTU stu-dent grant SGS17/185/OHK3/3T/13. We would like tothank Karel Lenc and Vassileios Balntas for their valu-able help with the HPatches benchmark, and DmytroMishkin for providing the L2Net and HardNet descrip-tors for the HPatches dataset.

A Regularized concatenation

When combining the descriptors of different parametriza-tion by concatenation we use both with equal contribu-tion, i.e. the final similarity is equal to kφkρkθ +kxkykθ.In the case of the raw descriptors this is clearly subop-timal. One would rather regularize by kφkρkθ+wkxkykθand search for the optimal value of scalar w. We are

about to prove that this is not necessary in the caseof post-processing by supervised whitening, where theoptimal regularization is included in the projection ma-trix.

We denote a set of descriptors without regularizedconcatenation by VP when w = 1, while the V (w)

P whenw 6= 1. It holds that V (w)

P = {WV (P), P ∈ P}, whereW is a diagonal matrix with ones on the dimensionscorresponding to the first descriptors (for kφ kρ kθ),and has all the rest elements of the diagonal equal tow. The covariance matrix of VP is C, while of V (w)

P it isC(w) = WCW>.

Learning the supervised whitening on VP as in (15)produces projection matrix

A = C−1/2M eig(C−1/2

M CC−1/2M ), (21)

while learning it on V(w)P produces projection matrix

A(w) = C(w)M−1/2

eig(C(w)M−1/2

C(w)C(w)M−1/2

). (22)

Cholesky decomposition of C gives

C = U>U = LL>, (23)

which leads to the Cholesky decomposition

C(w) = WU>UW> = WLL>W>. (24)

Using (24) allows us to rewrite (22) as

A(w) = (L>W>)−1eig((WU>)−1WCW>(L>W>)−1)

= (W>)−1(L>)−1eig(U>−1W−1WCW>(W>)−1(L>)−1)

= (W>)−1(L>)−1eig(U>−1C(L>)−1)

= (W>)−1(L>)−1eig(C−1/2M CC

−1/2M )

= (W>)−1A

.

(25)

Whitening descriptor V (P) ∈ VP with matrix A is per-formed by

V (P) = A>(V (P)− µ), (26)

while whitening descriptor V (P)(w) ∈ V (w)P with matrix

A(w) is performed by

V (P)(w)

= A(w)>(WV (P)−Wµ)= A>W−1(WV (P)−Wµ)= V (P).

(27)

No matter what the regularization parameter is, the de-scriptor is identical after whitening. We conclude thatthere is no need to perform such regularized concate-nation.

Page 14: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

14 Arun Mukundan et al.

SupervisedName D FPR@95

Brown et al. [14] 29−36 15.36Trzcinski et al. [62] 128 17.08Simonyan et al. [57] 73−77 10.38DC−S2S [70] 512 9.67DDESC [56] 128 9.85Matchnet [23] 4096 7.75TF−M [8] 128 6.47L2Net+ [59] 128 2.22HardNet+ [39] 128 1.51polar + cartes +W S (our) 128 5.98

UnsupervisedName D FPR@95

RootSIFT 128 26.14RootSIFT + PCA+ SQRT [16] 80 17.51polar + PCA+ SQRT [16] 128 8.30polar + cartes +WUA (our) 128 6.79polar + cartes +WUS (our) 128 7.21

Table 3: Performance comparison with the state of the art on Phototourism dataset. We report FPR@95 averagedover 6 dataset combinations for supervised (left) and unsupervised (right) approaches. The whitening for ourdescriptor is learned on the corresponding training part of PT for each combination.

0 20 40 60 80 100

RSIFTBRIEF

ORBSIFT

BBoostDC-S

DC-S2SPCWUS

DDescTF-M

L2Net+HardNet+

57.257.959.764.466.570.077.978.879.281.486.488.1

Patch Verification mAP [%]0 20 40 60

BRIEFBBoost

ORBDC-SSIFT

RSIFTDC-S2SDDescTF-M

PCWUS

L2Net+HardNet+

10.314.715.225.125.827.227.828.332.837.245.553.2

Image Matching mAP [%]0 20 40 60 80

BRIEFORB

BBoostSIFT

RSIFTDC-S2S

DC-STF-M

PCWUS

DDescL2Net+

HardNet+

26.329.534.243.043.347.648.552.453.554.063.670.3

Patch Retrieval mAP [%]

Fig. 12: Performance comparison on HP benchmark. The learning, whenever applicable, is performed on Libertyof PT dataset. Descriptors that do not require any supervision in the form of labeled patches, i.e. hand-crafted orunsupervised, are shown in striped bars. Our descriptor is denoted by PCWUS (P=polar, C=cartes).

0 20 40 60 80 100

ORBBRIEF

BBoostSIFT

RSIFTDDescTF-M

PCWS

DC-SDC-S2SL2Net+

HardNet+

60.0664.469.376.477.781.583.083.584.286.287.688.8

Patch Verification mAP [%]0 20 40 60

BRIEFORB

BBoostDC-SSIFT

TF-MDC-S2SDDescRSIFTPCWS

L2Net+HardNet+

11.514.918.230.932.135.735.935.937.944.448.355.1

Image Matching mAP [%]0 20 40 60 80

BRIEFORB

BBoostSIFT

RSIFTTF-MDC-S

DC-S2SDDescPCWS

L2Net+HardNet+

27.027.536.350.554.955.456.258.459.461.866.472.3

Patch Retrieval mAP [%]

Fig. 13: Performance comparison on HP benchmark when post-processing all descriptors with supervised whiteningWS which is learned on HP. The initial learning of the descriptor, whenever applicable, is performed on Libertyof PT dataset. Our descriptor uses the whitening learned on HP and does not use the PT dataset at all. Ourdescriptor is denoted by PCWS (P=polar, C=cartes).

Page 15: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

Understanding and Improving Kernel Local Descriptors 15

References

1. Ahonen, T., Matas, J., He, C., Pietikainen, M.: Rotationinvariant image description with local binary pattern his-togram fourier features. In: Scandinavian Conference onImage Analysis, pp. 61–70. Springer (2009)

2. Alahi, A., Ortiz, R., Vandergheynst, P.: Freak: Fast retinakeypoint. In: CVPR (2012)

3. Ambai, M., Yoshida, Y.: Card: Compact and real-time de-scriptors. In: ICCV (2011)

4. Arandjelovic, R., Zisserman, A.: Three things everyoneshould know to improve object retrieval. In: CVPR (2012)

5. Babenko, A., Lempitsky, V.: Aggregating deep convolu-tional features for image retrieval. In: ICCV (2015)

6. Balntas, V., Johns, E., Tang, L., Mikolajczyk, K.: Pn-net:conjoined triple deep network for learning local image de-scriptors. In: arXiv (2016)

7. Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.:Hpatches: A benchmark and evaluation of handcrafted andlearned local descriptors. In: CVPR (2017)

8. Balntas, V., Riba, E., Ponsa, D., Mikolajczyk, K.: Learninglocal feature descriptors with triplets and shallow convolu-tional neural networks. In: BMVC (2016)

9. Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Net-work dissection: Quantifying interpretability of deep visualrepresentations. In: CVPR, pp. 3319–3327. IEEE (2017)

10. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-uprobust features (SURF). CVIU 110(3), 346–359 (2008)

11. Bo, L., Ren, X., Fox, D.: Kernel descriptors for visual recog-nition. In: NIPS (2010)

12. Bo, L., Ren, X., Fox, D.: Depth kernel descriptors for objectrecognition. In: IROS (2011)

13. Bo, L., Sminchisescu, C.: Efficient match kernels betweensets of features for visual recognition. In: NIPS (2009)

14. Brown, M., Hua, G., Winder, S.: Discriminative learning oflocal image descriptors. IEEE Trans. PAMI 33(1), 43–57(2011)

15. Brown, M., Szeliski, R., Winder, S.: Multi-image matchingusing multi-scale oriented patches. In: CVPR, vol. 1, pp.510–517 (2005)

16. Bursuc, A., Tolias, G., Jegou, H.: Kernel local descriptorswith implicit rotation matching. In: ICMR (2015)

17. Calonder, M., Lepetit, V., Strecha, C., Fua, P.: Brief: Bi-nary robust independent elementary features. In: ECCV(2010)

18. Chum, O.: Low dimensional explicit feature maps. In:ICCV (2015)

19. Delhumeau, J., Gosselin, P.H., Jegou, H., Perez, P.: Revisit-ing the VLAD image representation. In: ACM Multimedia(2013)

20. Dong, J., Soatto, S.: Domain-size pooling in local descrip-tors: Dsp-sift. In: CVPR (2015)

21. Forssen, P.E., Lowe, D.G.: Shape descriptors for maximallystable extremal regions. In: Computer Vision, 2007. ICCV2007. IEEE 11th International Conference on, pp. 1–8.IEEE (2007)

22. Frahm, J.M., Fite-Georgel, P., Gallup, D., Johnson, T.,Raguram, R., Wu, C., Jen, Y.H., Dunn, E., Clipp, B.,Lazebnik, S., et al.: Building rome on a cloudless day. In:ECCV (2010)

23. Han, X., Leung, T., Jia, Y., Sukthankar, R., Berg, A.C.:Matchnet: Unifying feature and metric learning for patch-based matching. In: CVPR (2015)

24. Heikkila, M., Pietikainen, M., Schmid, C.: Description ofinterest regions with local binary patterns. Pattern recog-nition 42(3), 425–436 (2009)

25. Heinly, J., Schonberger, J.L., Dunn, E., Frahm, J.M.: Re-constructing the world* in six days*(as captured by theyahoo 100 million image dataset). In: CVPR (2015)

26. Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatialtransformer networks. In: NIPS, pp. 2017–2025 (2015)

27. Jegou, H., Chum, O.: Negative evidences and co-occurrences in image retrieval: The benefit of PCA andwhitening. In: ECCV (2012)

28. Ke, Y., Sukthankar, R.: PCA-SIFT: a more distinctive rep-resentation for local image descriptors. In: CVPR, pp. 506–513 (2004)

29. Kokkinos, I., Yuille, A.: Scale invariance without scale se-lection. In: CVPR (2008)

30. Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture rep-resentation using local affine regions. IEEE Trans. PAMI27(8), 1265–1278 (2005)

31. Ledoit, O., Wolf, M.: Honey, i shrunk the sample covariancematrix. The Journal of Portfolio Management 30(4), 110–119 (2004)

32. Ledoit, O., Wolf, M.: A well-conditioned estimator forlarge-dimensional covariance matrices. Journal of multi-variate analysis 88(2), 365–411 (2004)

33. Leutenegger, S., Chli, M., Siegwart, R.Y.: Brisk: Binaryrobust invariant scalable keypoints. In: ICCV (2011)

34. Lowe, D.G.: Distinctive image features from scale-invariantkeypoints. IJCV 60(2), 91–110 (2004)

35. Mahendran, A., Vedaldi, A.: Visualizing deep convolutionalneural networks using natural pre-images. IJCV 120(3),233–255 (2016)

36. Mairal, J., Koniusz, P., Harchaoui, Z., Schmid, C.: Convo-lutional kernel networks. In: NIPS, pp. 2627–2635 (2014)

37. Mikolajczyk, K., Matas, J.: Improving descriptors for fasttree matching by optimal linear projection. In: ICCV(2007)

38. Mikolajczyk, K., Schmid, C.: A performance evaluation oflocal descriptors. IEEE Trans. PAMI 27(10), 1615–1630(2005)

39. Mishchuk, A., Mishkin, D., Radenovic, F., Matas, J.: Work-ing hard to know your neighbor’s margins: Local descriptorlearning loss. In: NIPS (2017)

40. Mishkin, D., Matas, J., Perdoch, M., Lenc, K.: Wxbs: Widebaseline stereo generalizations. In: arXiv (2015)

41. Mukundan, A., Tolias, G., Chum, O.: Multiple-kernel local-patch descriptor. In: BMVC (2017)

42. Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolutiongray-scale and rotation invariant texture classification withlocal binary patterns. IEEE Trans. PAMI 24(7), 971–987(2002)

43. Oliva, A., Torralba, A.: Modeling the shape of the scene: aholistic representation of the spatial envelope. IJCV 42(3),145–175 (2001)

44. Paulin, M., Douze, M., Harchaoui, Z., Mairal, J., Perronin,F., Schmid, C.: Local convolutional features with unsuper-vised training for image retrieval. In: ICCV (2015)

45. Paulin, M., Mairal, J., Douze, M., Harchaoui, Z., Per-ronnin, F., Schmid, C.: Convolutional patch representa-tions for image retrieval: an unsupervised approach. ICCV121(1), 149–168 (2017)

46. Philbin, J., Isard, M., Sivic, J., Zisserman, A.: Descriptorlearning for efficient retrieval. In: ECCV (2010)

47. Radenovic, F., Tolias, G., Chum, O.: CNN image retrievallearns from BoW: Unsupervised fine-tuning with hard ex-amples. In: ECCV (2016)

48. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: Anefficient alternative to sift or surf. In: ICCV (2011)

49. van de Sande, K.E.A., Gevers, T., Snoek, C.G.M.: Eval-uating color descriptors for object and scene recognition.IEEE Trans. PAMI 32(9), 1582–1596 (2010)

Page 16: arXiv:1811.11147v1 [cs.CV] 27 Nov 2018 · 2018. 11. 28. · arXiv:1811.11147v1 [cs.CV] 27 Nov 2018. 2 Arun Mukundan et al. on a validation set, while it is shown to perform well on

16 Arun Mukundan et al.

50. Schmid, C., Mohr, R.: Local grayvalue invariants for imageretrieval. IEEE Trans. PAMI 19(5), 530–535 (1997)

51. Schonberger, J.L., Frahm, J.M.: Structure-from-motion re-visited. In: CVPR (2016)

52. Schonberger, J.L., Hardmeier, H., Sattler, T., Pollefeys, M.:Comparative evaluation of hand-crafted and learned localfeatures. In: CVPR (2017)

53. Schonberger, J.L., Radenovic, F., Chum, O., Frahm, J.M.:From single image query to detailed 3D reconstruction. In:CVPR (2015)

54. Scovanner, P., Ali, S., Shah, M.: A 3-dimensional sift de-scriptor and its application to action recognition. In: Pro-ceedings of the 15th ACM International Conference onMultimedia, pp. 357–360 (2007)

55. Shechtman, E., Irani, M.: Matching local self-similaritiesacross images and videos. In: CVPR, pp. 1–8. IEEE (2007)

56. Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P.,Moreno-Noguer, F.: Discriminative learning of deep convo-lutional feature point descriptors. In: ICCV (2015)

57. Simonyan, K., Vedaldi, A., Zisserman, A.: Learning localfeature descriptors using convex optimisation. IEEE Trans.PAMI 36(8), 1573–1585 (2014)

58. Taira, H., Torii, A., Okutomi, M.: Robust feature matchingby learning descriptor covariance with viewpoint synthesis.In: ICPR (2016)

59. Tian, B.F.Y., Wu, F.: L2-net: Deep learning of discrimina-tive patch descriptor in euclidean space. In: CVPR (2017)

60. Tola, E., Lepetit, V., Fua, P.: Daisy: An efficient densedescriptor applied to wide-baseline stereo. IEEE Trans.PAMI 32(5), 815–830 (2010)

61. Tolias, G., Bursuc, A., Furon, T., Jegou, H.: Rotationand translation covariant match kernels for image retrieval.CVIU 140, 9–20 (2015)

62. Trzcinski, T., Christoudias, M., Lepetit, V., Fua, P.: Learn-ing image descriptors with the boosting-trick. In: NIPS(2012)

63. Vedaldi, A., Zisserman, A.: Efficient additive kernels viaexplicit feature maps. In: CVPR (2010)

64. Vedaldi, A., Zisserman, A.: Efficient additive kernels via ex-plicit feature maps. IEEE Trans. PAMI 34, 480–492 (2012)

65. Wang, P., Wang, J., Zeng, G., Xu, W., Zha, H., Li, S.:Supervised kernel descriptors for visual recognition. In:CVPR (2013)

66. Winder, S., Brown, M.: Learning local image descriptors.In: CVPR (2007)

67. Yi, K.M., Trulls, E., Lepetit, V., Fua, P.: Lift: Learned in-variant feature transform. In: ECCV, pp. 467–483. Springer(2016)

68. Yosinski, J., Clune, J., Nguyen, A., Fuchs, T., Lipson, H.:Understanding neural networks through deep visualization.arXiv preprint arXiv:1506.06579 (2015)

69. Yu, G., Morel, J.M.: A fully affine invariant image compar-ison method. In: ICASSP, pp. 1597–1600. IEEE (2009)

70. Zagoruyko, S., Komodakis, N.: Learning to compare im-age patches via convolutional neural networks. In: CVPR(2015)

71. Zeiler, M.D., Fergus, R.: Visualizing and understandingconvolutional networks. In: ECCV (2014)

72. Zhou, L., Zhu, S., Shen, T., Wang, J., Fang, T., Quan, L.:Progressive large scale-invariant image matching in scalespace. In: ICCV (2017)