Lec 08: Feature Aggregation II - sce.umkc.edu€¦ · Super Vector Aggregation – Speaker ID • Fisher Vector: Aggregates Features against a GMM • Super Vector: Aggregates GMM

Spring 2020: Venu: Haag 315, Time: M/W 4-5:15pm

ECE 5582 Computer VisionLec 08: Feature Aggregation II

Zhu LiDept of CSEE, UMKC

Office: FH560E, Email: [email protected], Ph: x 2346.http://l.web.umkc.edu/lizhu

Z. Li: ECE 5582 Computer Vision, 2020 p.1

slides created with WPS Office Linux and EqualX LaTex equation editor

Outline

• ReCap of Lecture 07• Image Retrieval System• BoW • VLAD

• Dense SIFT• Fisher Vector Aggregation• AKULA• Summary


Precision, Recall, F-measure

• Precision, TPR = TP/(TP + FP),

• Recall = TP/(TP + FN),

• FPR=FP/(TP+FP)

• F-measure

= 2*(precision*recall)/(precision + recall)

Precision: is the probability that a

retrieved document is relevant.

Recall: is the probability that a

relevant document is retrieved in a search.


Why Aggregation ?

• Curse of Dimensionality

•Decision Boundary / Indexing


+

…..

Bag-of-Words: Histogram Coding

•Codebook:• Feature space: Rd, k-means to get k centroids, {��, ��,…,��}

• BoW Hard Encoding:• For n feature points,{x1, x2, …,xn} assignment matrix: kxn,

with column only 1-non zero entry• Aggregated dimension: k


k

n

Kernel Code Book Soft Encoding

•Kernel Code Book Soft Encoding• Kernel Affinity: ��, �� = �−�|�� −��|

�

• Assignment Matrix: �� = �(��, ��)/��(��, ��)• Encoding: k-dimensional: X(k)= �1��


VLAD- Vector of Locally Aggregated Descriptors

• Aggregate feature difference from the codebook• Hard assignment by finding

the NN of feature {xk} to {��}

• Compute aggregated differences

• L2 normalize

• Final feature: k x d


3

x

v1 v2 v3 v4

v5

1

4

2

5

① assign descriptors

② compute x- i

③ vi=sum x- i for cell i

�� = �∀�,�.�.��=��

�� −��

�� = ��/||��||�

VLAD on SIFT

• Example of aggregating SIFT with VLAD• K=16 codebook entries• Each cell is a SIFT visualized as centroids in blue, and

VLAD difference in red• Top row: left image, bottom row: right image, red: code

book, blue: encoded VLAD


Outline

• ReCap of Lecture 07• Image Retrieval System• BoW • VLAD

• Dense SIFT• Fisher Vector Aggregation• AKULA• Summary


One more trick

• Recall that SIFT is a powerful descriptor

• VL_FEAT: vl_dsift • A dense description of image by computing SIFT descriptor

(no spatial-scale space extrema detection) at predetermined grid

• Supplement HoG as an alternative texture descriptor


VL_FEAT: vl_dsift

• Compute dense SIFT as a texture descriptor for the image• [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘step’, 2);

• There’s also a FAST option• [f, dsift]=vl_dsift(single(rgb2gray(im)), ‘fast’, ‘step’, 2);• Huge amount of SIFT data will be generated


Fisher Vector

• Fisher Vector and variations:• Winning in image classification:

• Winning in the MPEG object re-identification:o SCFV(Scalable Coded Fisher Vec) in CDVS


Codebook: Gaussian Mixture Model (GMM)

• GMM is a generative model to express data • Assuming data is generated from with parameters {��, ��,��}


�� ~ ��=1

��(��,��)

�(��,��) =1

(2�)�2 |Σ�|�/�

�−�12� (�− ��)

��−�(�−��)

A bit of Theory: Fisher Kernel

•Encode the derivation from the generative model• Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 for

SIFT.• How’s these observations derivate from the given GMM

model with a set of parameter, � = {��, ��,��}?o i.e, how the parameter, e.g, mean will move to best fit the

observation ?


��

��

��

X1 +

A bit of Theory: Fisher Kernel

•Score function w.r.t. the likelihood function ��(�)• �� = �� log��(�): derivative on the log likelihood • The dimension of score function is m, where m is the number

of generative model parameters, m=3 for GMM • Given the observed data X, score function indicate how

likelihood function parameter (e.g, mean) should move to better fit the data.

•Distance/Derivation of two observation X, Y w.r.t the generative model• Fisher Info Matrix (roughly the covariance in the

Mahanolibis distance)�� = ��

��

• Fisher Kernel Distance: normalized by the Fisher Info Matrix:


��(�, �) = ��

−��

Fisher Vector

• KFK(X, Y) is a measure of similarity, w.r.t. the generative model• Similar to the Mahanolibis distance

case, we can decompose this kernel as,

• That give us a kernel feature mapping of X to Fisher Vector

• For observed images features {xt}, can be computed as,


��(�, �) = ��

−�� = ��

��′��

GMM Fisher Vector

•Encode the derivation from the generative model• Observed feature set, {x1, x2, …,xn} in Rd, e.g, d=128 (!) for SIFT.• How’s these observations derivate from the given GMM model with a set

of parameter, � = {��, ��,��}?

• GMM Log Likelihood Gradient• Let �� =

��

��, Then we have


weight

mean

variance

GMM Fisher Vector VL_FEAT implementation

• GMM codebook• For a K-component GMM, we only allow 3K parameters, {��, ��,��|� = 1. .�}, i.e, iid Gaussian component

• Posterior prob of feature point xi to GMM component k


Σ� =�

�

��

�� 0 0 00 �� 0 0

…. ��

�

�

��

GMM Fisher Vector VL_FEAT implementation

• FV encoding• Gradient w.r.t. the mean, variance, for GMM component k,

j=1..D

• In the end, we have 2K x D aggregation on the derivation w.r.t. the means and variances


�X= [��, ��, …, ��, ��, ��, …, ��]

VL_FEAT GMM/FV API

• Compute GMM model with VL_FEAT• Prepare data:numPoints = 1000 ; dimension = 2 ;data = rand(dimension,N) ;

• Call vl_gmm:numClusters = 30 ;[means, covariances, priors] = vl_gmm(data, numClusters) ;

• Visualize:figure ;hold on ;plot(data(1,:),data(2,:),'r.') ;for i=1:numClusters vl_plotframe([means(:,i)' sigmas(1,i) 0 sigmas(2,i)]);end


VL_FEAT API

• FV encodingencoding = vl_fisher(data_to_Be_Encoded, means, covariances, priors);

• Bonus points:• Encode HoG features with Fisher Vector ?• randomly collect 2~3 images from each class• Stack all HoG features together into an n x 36 data matrix• Compute its GMM• Use this GMM to encode all image HoG features (other than

average)


Super Vector Aggregation – Speaker ID

• Fisher Vector: Aggregates Features against a GMM• Super Vector: Aggregates GMM against GMM

• Ref:o William M. Campbell, Douglas E. Sturim, Douglas A. Reynolds: Support vector

machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5): 308-311 (2006)


“Yes, We Can !”

?

Super Vector from MFCC• Motivated from Speaker ID work

• Speech is a continuous evolution of the vocal tract• Need to extract a sequence of spectra or sequence of spectral coefficients• Use a sliding window - 25 ms window, 10 ms shift


DCTLog|X(ω)|MFCC

GMM Model from MFCC• GMM on MFCC feature


• The acoustic vectors (MFCC) of speaker s is modeled by a prob. density function parameterized by

• Gaussian mixture model (GMM) for speaker s:

Universal Background Model

• UBM GMM Model:


• The acoustic vectors of a general population is modeled by another GMM called the universal background model (UBM):

• Parameters of the UBM

MAP Adaption

• Given the UBM GMM, how is the new observation derivate ?• The adapted mean is given by:


Supervector Distance

• Assuming we have UBM GMM model�� = {��, ��, Σ�},

with identical prior and covariance

• Then for two utterance samples a and b, with GMM models• �� = {��, ��

�, Σ�}, • �� = {��, ��

�, Σ�},

The SV distance is,

It means the means of two models need to be normalized by the UBM covariance induced Mahanolibis distance metricThis is also a linear kernel function scaled by the UBM covariances


�(��, ��) = �� Σ�

−(12)��

( ��Σ�−(12)��

�)

Supervector Performance in NIST Speaker ID

• System 5: Gaussian SV• DCF (Detection Cost Function)


m31491

AKULA – Adaptive KLUster Aggregation

2013/10/25

Abhishek Nagar, Zhu Li, Gaurav Srivastava and Kyungmo Park


Outline

•Motivation•Adaptive Aggregation•Results with TM7•Summary


Motivation

•Better Aggregation• Fisher Vector and VLAD type aggregation depending on a

global model• AKULA removes this dependence, and directly coding the

cluster centroids and sift count• SCFV/RVD all having situations where clusters are turned

off due to no assignment, this can be avoided in AKULA

SIFT detection & selection K-means AKULA description


Motivation

•Better Subspace Choice• Both SCFV and RVD do fixed normalization and PCA

projection based on heuristic.• What is the best possible subspace to do the aggregation ?• Using a boosting scheme to keep adding subspaces and

aggregations in an iterative fashion, and tune TPR-FPR to the desired operating points on FPR.


CE2: AKULA – Adaptive KLUster Aggregation

• AKULA Descriptor: cluster centroids + SIFT count

A2={yc21, yc2

2, …, yc2k ; pc2

1, pc22, …, pc2

k }

• Distance metric:• Min centroids distance, weighted

by SIFT count

A1={yc11, yc1

2, …, yc1k ; pc1

1, pc12, …, pc1

k },


AKULA implementation in TM7

• Inner loop aggregation• Dimension is fixed at 8• Numb of clusters, or nc=8, 16, 32, to hit 64, 128, and 256

bytes• Quantization: scale by ½ and quantized to int8, sift count is

8 bits, total (nc+1)*dim bytes per aggregation



•Outer loop subspace optimization by boosting• Initial set of subspace models {Ak} computed from MIR

FLICKR data set SIFT extractions by k-means the space to 4096 clusters

• Iterative search on subspaces to generate AKULA aggregation that can improve performance in precision-recall

• Notice that aggregation is de-coupled in subspace iteration, to allow more DoF in aggregation, to find subspaces that provides complimentary info.

•The algorithm is still being debugged, hence only having 1st iteration results in TM7



•Outer loop subspace optimization by boosting• Initial set of subspace models {Ak} computed from MIR

FLICKR data set SIFT extractions by k-means the space to 4096 clusters

• Iterative search on subspaces to generate AKULA aggregation that can improve performance in precision-recall

• Notice that aggregation is de-coupled in subspace iteration, to allow more DoF in aggregation, to find subspaces that provides complimentary info.

•The algorithm is still being debugged, hence only having 1st iteration results in TM7 • Indexing/Hashing is required for AKULA, it involves nc x

dim multiplications and additions at this time. A binarization scheme will be considered once its performance is optimized in non-binary form.


GD Only TPR-FPR: AKULA vs SCFV

•Data set 1:• AKULA (128bytes, dim=8, nc=16) distance is just 1-way

dmin1.*wt• Forcing a weighted sum on SCFV (512 bytes) hamming

distances without 2D decision fitting, i.e, count hamming distance between common active clusters, and sum up their distances


GD Only TPR-FPR: AKULA vs SCFV

•Data set 2, 3:• AKULA distance is just 1-way dmin1.*wt• AKULA=128bytes, SCFV = 512 bytes.


3D object set: 4 , 5

•Data set4, 5:


AKULA in PM

•FPR performance:

•AKULA rates:

pm rates m akula rates 512 8 64 1K 16 128 2K 16 128 1K_4K 16 128 2K_4K 16 128 4K 16 128 8K 32 256 16K 32 256


TPR@1% FPR

0

10

20

30

40

50

60

70

80

90

100

1a 1b 1c 2 3 4 5

TPR

(%)

bitrate: 512

TM7

AKULA

0

10

20

30

40

50

60

70

80

90

100

1a 1b 1c 2 3 4 5TP

R (%

)

bitrate: 1k

TM7

AKULA


TPR@1%FPR:

0

20

40

60

80

100

120

1a 1b 1c 2 3 4 5

TPR

(%)

bitrate: 2k

TM7

AKULA

0102030405060708090

100

1a 1b 1c 2 3 4 5

TPR

(%)

bitrate: 1k-4k

TM7

AKULA


TPR@1%FPR:

0

20

40

60

80

100

120

1a 1b 1c 2 3 4 5

TPR

(%)

bitrate: 2k-4k

TM7

AKULA

0

20

40

60

80

100

120

1a 1b 1c 2 3 4 5

TPR

(%)

bitrate: 4k

TM7

AKULA


TPR@1%FPR:

75

80

85

90

95

100

105

1a 1b 1c 2 3 4 5

TPR

(%)

bitrate: 8k

TM7

AKULA

80828486889092949698

100102

1a 1b 1c 2 3 4 5

TPR

(%)

bitrate: 16k

TM7

AKULA


AKULA Localization

•Quite some improvements: 2.7%


AKULA Summary

•Benefits:• Allow more DoF in aggregation optimization,

o by an outer loop boosting scheme for subspace projection optimization

o And an inner loop adaptive clustering without the constraint of the global GMM model

• Simple weighted distance sum metric, with no need to tune a multi-dimensional decision boundary

• The overall pair wise matching matched up with TM7 SCFV with 2-dimensional decision boundary

• In GD only matching outperforms the TM7 GD• Good improvements to the localization accuracy• Light in extraction, but still heavy in pair wise matching, and

need binarization scheme and/or indexing scheme to work for retrieval

• Future Improvements:• Supervector AKULA ?


Lec 08 Summary

• Fisher Vector• Aggregate features {Xk} in RD

against GMM

•Super Vector• Aggregate GMM against a global

GMM (UBM)

• AKULA• Direct Aggregation, non-

indexable


++ + +

Documents

Lec 08: Feature Aggregation II - sce.umkc.edu€¦ · Super Vector Aggregation – Speaker ID • Fisher Vector: Aggregates Features against a GMM • Super Vector: Aggregates GMM