64
R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016 1 Lisbon, Portugal June 5-8, 2017 SPARS 2017 Signal Processing with Adaptive Sparse Structured Representations Submission deadline: December 12, 2016 Notification of acceptance: March 27, 2017 Summer School: May 31-June 2, 2017 (tbc) Workshop: June 5-8, 2017 spars2017.lx.it.pt

Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

1

Lisbon, Portugal June 5-8, 2017

SPARS 2017 Signal Processing with Adaptive Sparse Structured Representations

Submission deadline: December 12, 2016 Notification of acceptance: March 27, 2017 Summer School: May 31-June 2, 2017 (tbc) Workshop: June 5-8, 2017

spars2017.lx.it.pt

Page 2: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

Rémi Gribonval Inria Rennes - Bretagne Atlantique

[email protected]

Random Moments for Compressive Learning

Page 3: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Main Contributors & Collaborators

3

Anthony Bourrier Nicolas Keriven Yann Traonmilin

Nicolas Tremblay

Gilles Puy

Mike Davies Patrick PerezGilles Blanchard

Page 4: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Agenda

From Compressive Sensing to Compressive Learning ? The Sketch Trick Compressive K-means Compressive GMM Conclusion

4

Page 5: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

From Compressive Sensing to Compressive Learning

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Page 6: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Machine Learning

Available data training collection of feature vectors = point cloud

Goals infer parameters to achieve a certain task generalization to future samples with the same probability distribution

Examples

6

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

[email protected]

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size m

k*n/

m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

-6 -4 -2 0 2 4 6-4

-3

-2

-1

0

1

2

3

PCA principal subspace

Dictionary learning dictionary

Clustering centroids

Classification classifier parameters

(e.g. support vectors)

X

Page 7: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

Challenging dimensions

7

X

Page 8: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

Challenging dimensions

7

x1X

Page 9: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

Challenging dimensions

7

x1 x2X

Page 10: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

Challenging dimensions

7

x1 x2 xN…X X

Page 11: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

High feature dimension n Large collection size N

Challenging dimensions

7

x1 x2 xN…X X

Page 12: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Point cloud = large matrix of feature vectors

High feature dimension n Large collection size N

Challenging dimensions

7

x1 x2 xN…X X

Challenge: compress before learning ?X

Page 13: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = large matrix of feature vectors

8

x1 x2 xN…X X

yNy2 …y1Y = MX

M

Page 14: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = large matrix of feature vectors

8

x1 x2 xN…X X

yNy2 …y1Y = MX

M

Page 15: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = large matrix of feature vectors

Reduce feature dimension [Calderbank & al 2009, Reboredo & al 2013]

(Random) feature projection Exploits / needs low-dimensional feature model

8

x1 x2 xN…X X

yNy2 …y1Y = MX

Page 16: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Challenges of large collections

Feature projection: limited impact

9

X

Y = MX

Page 17: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Challenges of large collections

Feature projection: limited impact

9

X

Y = MX

“Big Data” Challenge: compress collection size

Page 18: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = … empirical probability distribution

10

X

Page 19: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = … empirical probability distribution

Reduce collection dimension (adaptive) column sampling / coresets

see e.g. [Agarwal & al 2003, Felman 2010]

sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]

10

X

Page 20: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = … empirical probability distribution

Reduce collection dimension (adaptive) column sampling / coresets

see e.g. [Agarwal & al 2003, Felman 2010]

sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]

10

XM

z 2 Rm

Sketching operator nonlinear in the feature vectors

linear in their probability distribution

Page 21: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning ?

Point cloud = … empirical probability distribution

Reduce collection dimension (adaptive) column sampling / coresets

see e.g. [Agarwal & al 2003, Felman 2010]

sketching & hashing see e.g. [Thaper & al 2002, Cormode & al 2005]

10

XM

z 2 Rm

Sketching operator nonlinear in the feature vectors

linear in their probability distribution

Page 22: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Example: Compressive K-means

11

X

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

[email protected]

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size m

k*n/

m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

M z 2 Rm

Recovery algorithm

estimated centroids

ground truth

N = 1000;n = 2 m = 60

Page 23: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational impact of sketching

12

Ph.D. A. Bourrier & N. Keriven

Computation time Memory

Collection size N Collection size N

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

CHS

102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)K = 5

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMMEM

102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)K = 5

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Figure 7: Time (top) and memory (bottom) usage of all algorithms on synthetic data with dimension n = 10,number of components K = 5 (left) or K = 20 (right), and number of frequencies m = 5(2n+ 1)K, with respectto the number of items in the database N .RG: remplacer BS-GMM par Algorithm 3

distributions for further work. In this configuration, the speaker verification results will indeed be farfrom state-of-the-art, but as mentioned before our goal is mainly to test our compressive approach ona different type of problem than that of GMM estimation on synthetic data, for which we have alreadyobserved excellent results.

In the GMM-UBM model, each speaker S is represented by one GMM (⇥

S

,↵S

). The key point is theintroduction of a model (⇥

UBM

,↵UBM

) that represents a "generic" speaker, referred to as UniversalBackground Model (UBM). Given speech data X and a candidate speaker S, the statistic used forhypothesis testing is a likelihood ratio between the speaker and the generic model:

T (X ) =

p⇥S ,↵S (X )

p⇥UBM ,↵UBM (X )

. (23)

If T (X ) exceeds a threshold ⌧ , the data X are considered as being uttered by the speaker S.

The GMMs corresponding to each speaker must somehow be “comparable” to each other and to the UBM.Therefore, the UBM is learned prior to individual speaker models, using a large database of speech datauterred by many speakers. Then, given training data X

S

specific to one speaker, one M-step from theEM algorithm initialized with the UBM is used to adapt the UBM and derive the model (⇥

S

,↵S

). Werefer the reader to [51] for more details on this procedure.

In our framework, the EM or compressive estimation algorithms are used to learn theUBM.

5.2 Setup

The experiments were performed on the classical NIST05 speaker verification database. Both train-ing and testing fragments are 5-minutes conversations between two speakers. The database containsapproximately 650 speakers, and 30000 trials.

20

Page 24: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

Page 25: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

y

Signal

space

x

Observation space

Signal Processing inverse problems compressive sensing

MM

Probability space

Sketch space

Machine Learning method of moments compressive learning

z

p

Linear “projection”

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

Page 26: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

y

Signal

space

x

Observation space

Signal Processing inverse problems compressive sensing

MM

Probability space

Sketch space

Machine Learning method of moments compressive learning

z

p

Linear “projection”

nonlinear in the feature vectors linear in the distribution p(x)

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

finite-dimensional Mean Map Embedding, cf

Smola & al 2007, Sriperumbudur & al 2010

Page 27: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Information preservation ?

Data distribution

Sketch

The Sketch Trick

13

X ⇠ p(x)

y

Signal

space

x

Observation space

Signal Processing inverse problems compressive sensing

MM

Probability space

Sketch space

Machine Learning method of moments compressive learning

z

p

Linear “projection”

nonlinear in the feature vectors linear in the distribution p(x)

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

finite-dimensional Mean Map Embedding, cf

Smola & al 2007, Sriperumbudur & al 2010

Page 28: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

The Sketch Trick

Data distribution

Sketch

finite-dimensional Mean Map Embedding, cf

Smola & al 2007, Sriperumbudur & al 2010

Dimension reduction ?

14

X ⇠ p(x)

y

Signal

space

x

Observation space

Signal Processing inverse problems compressive sensing

MM

Probability space

Sketch space

Machine Learning method of moments compressive learning

z

p

Linear “projection”

nonlinear in the feature vectors linear in the distribution p(x)

z` =

Zh`(x)p(x)dx

= Eh`(X)

⇡ 1

N

NX

i=1

h`(xi)

Page 29: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

Compressive Learning (Heuristic) Examples

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Page 30: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive Machine Learning

Point cloud = empirical probability distribution

Reduce collection dimension ~ sketching

16

X

z` =1

N

NX

i=1

h`(xi) 1 ` m

M z 2 Rm

Sketching operator

Choosing information preserving sketch ?

Page 31: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

Standard approach = K-means

Sketching approach

p(x) is spatially localized

need “incoherent” sampling choose Fourier sampling

sample characteristic function

choose sampling frequencies

Example: Compressive K-means

17

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

[email protected]

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

z`

=1

N

NX

i=1

ejw>` xi

!` 2 Rn

~pooled Random Fourier Features, cf Rahimi & Recht 2007

Page 32: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

Standard approach = K-means

Sketching approach

p(x) is spatially localized

need “incoherent” sampling choose Fourier sampling

sample characteristic function

choose sampling frequencies

Example: Compressive K-means

17

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

[email protected]

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

z`

=1

N

NX

i=1

ejw>` xi

!` 2 Rn

~pooled Random Fourier Features, cf Rahimi & Recht 2007

Page 33: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

Standard approach = K-means

Sketching approach

p(x) is spatially localized

need “incoherent” sampling choose Fourier sampling

sample characteristic function

choose sampling frequencies

Example: Compressive K-means

17

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

[email protected]

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

z`

=1

N

NX

i=1

ejw>` xi

!` 2 Rn

~pooled Random Fourier Features, cf Rahimi & Recht 2007

Page 34: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

18

X M z 2 Rm

N = 1000;n = 2

Sampled Characteristic

Function

m = 60

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

[email protected]

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

Example: Compressive K-means

Page 35: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

18

ground truth

X M z 2 Rm

N = 1000;n = 2

Sampled Characteristic

Function

m = 60

Density model=mixture of K Diracs

p ⇡KX

k=1

↵k�✓k

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

[email protected]

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

Example: Compressive K-means

z = Mp ⇡KX

k=1

↵kM�✓k

Page 36: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: find k centroids

18

estimated centroids

ground truth

X M z 2 Rm

N = 1000;n = 2

Sampled Characteristic

Function

m = 60

Density model=mixture of K Diracs

p ⇡KX

k=1

↵k�✓k

Compressive Gaussian Mixture Estimation

Anthony Bourrier

12, R

´

emi Gribonval

2, Patrick P

´

erez

11 Technicolor, 975 Avenue des Champs Blancs, 35576 Cesson Sevigne, France

[email protected]

2 INRIA Rennes - Bretagne Atlantique, Campus de Beaulieu, 35042 Rennes, [email protected]

Motivation

Goal: Infer parameters ✓ from n-dimensional data X = {x1

, . . . ,xN}. Thistypically requires extensive access to the data. Proposed method: Inferfrom a sketch of the data ) memory and privacy savings.

n x

1

. . .xN

Learning set(size Nn)

ˆ

A

=) mˆ

z

Database sketch(size m)

L=) K ✓

Learned parameters(size K)

Figure 1: Illustration of the proposed sketching framework. A is a sketch-

ing operator, L is a learning method from the sketch.

Model and problem statement

Application to mixture of isotropic Gaussians in Rn:

fµ / exp

��kx� µk2

2

/(2�2

)

�. (1)

Data X = {xj}Nj=1 ⇠i.i.d.

p =

Pks=1↵sfµs

with:

•weights ↵1

, . . . ,↵k (positive, sum to one)•means µ

1

, . . . ,µk 2 Rn.

Sketch = Fourier samplings at different frequencies: (Af )l = ˆf (!l).Empirical version: ( ˆA(X ))l =

1

N

PNj=1 exp(�ih!l,xji) ⇡ (Ap)l.

We want to infer the mixture parameters from ˆ

z =

ˆ

A(X ).Problem casted as:

p = argminq2⌃k

kˆz�Aqk22

, (2)

where ⌃k = mixtures of k isotropic Gaussians with positive weights.Standard CS Our problem

Signal x 2 Rn f 2 L1

(Rn)

Dimension n InfiniteSparsity k k

Dictionary {e1

, . . . , en} F = {fµ,µ 2 Rn}Measurements x 7! ha,xi f 7!

RRn f (x)e�ih!,xidx

Algorithm

Current estimate p with weights {↵s}ks=1 and support ˆ� = {ˆµs}ks=1.Residual ˆr = ˆ

z�Ap.1. Searching new support functions:

Search for ”good components to add” to the support) Local minima of µ 7! �hAfµ, ˆri, added to the support ˆ�.New support ˆ�0.

2. k-term thresholding:Projection of ˆz onto ˆ

0 with positivity constraints on coefficients:

argmin�2RK

+

||ˆz�U�||22

, (3)

with U = [

ˆµ1

, . . . , ˆµK].k highest coefficients and corresponding support are kept! new support ˆ� and coefficients ↵

1

, . . . , ↵k.3. Final ”shift”:

Gradient descent algorithm on the objective function, with initialization atthe current support and coefficients.

First step Second step Third step

Figure 2: Algorithm illustration in dimension n = 1 for k = 3 Gaus-

sians. Top: Iteration 1. Bottom: Iteration 2. Blue curve=true mixture,

Red curve=reconstructed mixture, Green curve=gradient function. Green

Dots=Candidate Centroids, Red Dots=Reconstructed Centroids.

Experimental results

Data setup: � = 1, (↵1

, . . . ,↵k) drawn uniformly on the simplex.Entries of µ

1

, . . . ,µk ⇠i.i.d.

N (0, 1).

Algorithm heuristics:•Frequencies drawn i.i.d. from N (0, Id).

•New support function search (step 1) initialized as ru, where r uniformly

drawn in0,max

x2X||x||

2

�and u uniformly drawn on B

2

(0, 1).

Comparison between:•Our method: Sketch is computed on-the-fly and data is discarded.

•EM: Data is stored to allow the standard optimization steps to be per-formed.

Quality measures: KL Divergence and Hellinger distance.

NCompressed EM

KL div. Hell. Mem. KL div. Hell. Mem.10

3

0.68± 0.28 0.06± 0.01 0.6 0.68± 0.44 0.07± 0.03 0.2410

4

0.24± 0.31 0.02± 0.02 0.6 0.19± 0.21 0.01± 0.02 2.410

5

0.13± 0.15 0.01± 0.02 0.6 0.13± 0.21 0.01± 0.02 24

Table 1: Comparison between our method and an EM algorithm. n =

20, k = 10,m = 1000.

−4 −2 0 2 4 6 8

−4

−2

0

2

4

6

ˆ

A

=)

−0.5 0 0.5 10

10

20

30

40

50

60 n=10, Hell. for 80%

sketch size mk*

n/m

200 400 600 800 1000 1200 1400 1600 1800 2000

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Figure 3: Left: Example of data and sketch for n = 2. Right: Reconstruc-

tion quality for n = 10.

Recovery algorithm

= “decoder”

Example: Compressive K-means

CLOMP =Compressive Learning OMP similar to: OMP with Replacement, Subspace Pursuit & CoSaMP

z = Mp ⇡KX

k=1

↵kM�✓k

⇡ arg min↵k,✓k

kz �KX

k=1

↵kM�✓kk2

Page 37: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive K-Means: Empirical Results

19

x-20 0 20

y

-20

-10

0

10

20

100 101

m/(Kn)

10-5

100

Rela

tive m

em

ory

100 101

m/(Kn)

10-2

100

102

Rela

tive tim

e o

f est

imatio

n

100 101

m/(Kn)

0

1

2

3

4

5

Rela

tive S

SE

N=104

N=105

N=106

N=107

N training samplesK clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.

Sketch vector Matrix of centroids

C = CLOMP(z) 2 Rn⇥K

xi 2 Rn

Training set

K-means objective

Page 38: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive K-Means: Empirical Results

19

x-20 0 20

y

-20

-10

0

10

20

100 101

m/(Kn)

10-5

100

Rela

tive m

em

ory

100 101

m/(Kn)

10-2

100

102

Rela

tive tim

e o

f est

imatio

n

100 101

m/(Kn)

0

1

2

3

4

5

Rela

tive S

SE

N=104

N=105

N=106

N=107

N training samplesK clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.

Sketch vector Matrix of centroids

C = CLOMP(z) 2 Rn⇥K

xi 2 Rn

Training set

K-means objective

Page 39: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive K-Means: Empirical Results

19

x-20 0 20

y

-20

-10

0

10

20

100 101

m/(Kn)

10-5

100

Rela

tive m

em

ory

100 101

m/(Kn)

10-2

100

102

Rela

tive tim

e o

f est

imatio

n

100 101

m/(Kn)

0

1

2

3

4

5

Rela

tive S

SE

N=104

N=105

N=106

N=107

N training samplesK clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.

Sketch vector Matrix of centroids

C = CLOMP(z) 2 Rn⇥K

xi 2 Rn

Training set

K-means objective

Page 40: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Compressive K-Means: Empirical Results

20

N training samplesK=10 clusters

M z 2 Rm

SSE(X ,C) =NX

i=1

mink

kxi � ckk2.

Sketch vector Matrix of centroids

C = CLOMP(z) 2 Rn⇥K

xi 2 Rn

Training set

K-means objective1 rep. 5 rep.

0

0.05

0.1

0.15

0.2

SS

E/N

N1=70 000

KM

CKM

1 rep. 5 rep.

N2=300 000

1 rep. 5 rep.

N3=1 000 000

MNIST infMNIST infMNIST

Lloyd-Max vs Sketch+CLOMP algorithm with 1 or 5 replicates (random initialization)

CLOMPLloyd-Max

Spectral features

Page 41: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Goal: fit k Gaussians

X M z 2 RmSampled

Characteristic Function

m = 5 000N = 60 000 000;n = 12

21

z = Mp ⇡KX

k=1

↵kMp✓k

p ⇡KX

k=1

↵kp✓k

Density model=GMM with diagonal covariance

Recovery algorithm

= “decoder”

Example: Compressive GMM

⇡ arg min↵k,✓k

kz �KX

k=1

↵kMp✓kk2

estimated GMM parameters (⇥,↵)

Compressive Hierarchical Splitting (CHS) = extension of CLOMP to general GMM

Page 42: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Proof of Concept: Speaker Verification Results (DET-curves)

22

MFCC coefficients xi 2 R12

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

Page 43: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Proof of Concept: Speaker Verification Results (DET-curves)

22

MFCC coefficients xi 2 R12

After silence detection

N = 60 000 000

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

Page 44: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Proof of Concept: Speaker Verification Results (DET-curves)

22

MFCC coefficients xi 2 R12

After silence detection

N = 60 000 000

Maximum size manageable by EM

N = 300 000

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

Page 45: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Proof of Concept: Speaker Verification Results (DET-curves)

23

MFCC coefficients xi 2 R12

After silence detection

N = 60 000 000

Maximum size manageable by EM

N = 300 000

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

CHS

for EMfor CHS

Page 46: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

CHS

Proof of Concept: Speaker Verification Results (DET-curves)

24

MFCC coefficients xi 2 R12

After silence detection

N = 60 000 000

Maximum size manageable by EM

N = 300 000

N = 300 000 000

~ 50 Gbytes ~ 1000 hours of speech

for EM

for CHS

Page 47: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

m= 5 000720 000-fold compression exploit whole collection improved performance

Proof of Concept: Speaker Verification Results (DET-curves)

25

~ 50 Gbytes ~ 1000 hours of speech

m= 10003 600 000-fold compression

m= 5007 200 000-fold compression

CHS

Page 48: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

Computational Efficiency

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Page 49: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

Page 50: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

X

Page 51: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

Page 52: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

Page 53: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

27

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

~ One-layer random neural net

Decoding = next layers DNN ~ hierarchical sketching ?

see also [Bruna & al 2013, Giryes & al 2015]

Page 54: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

28

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

Privacy-reserving sketch and forget

see also [Bruna & al 2013, Giryes & al 2015]

~ One-layer random neural net

Decoding = next layers DNN ~ hierarchical sketching ?

Page 55: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

29

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

Streaming algorithms One pass; online update

Page 56: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

29

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

streaming…

Streaming algorithms One pass; online update

Page 57: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Computational Aspects

Sketching empirical characteristic function

30

z`

=1

N

NX

i=1

ejw>` xi

X

W WX

h(WX)

h(·) = ej(·)

z

average

… … … …

DIS TRI BU TED

Distributed computing Decentralized (HADOOP) / parallel (GPU)

Page 58: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Summary: Compressive K-means / GMM

31

✓ Dimension reduction ✓ Resource efficiency

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

CHS102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)

K = 5

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMMEM

102 103 104 105 106

10−2

10−1

100

101

102

N

Tim

e (s

)

K = 5

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

Sketching (no distr. computing)CLOMPCLOMPRBS−CGMM’EM

102 103 104 105 106

10−1

100

101

102

103

N

Tim

e (s

)

K = 20

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 5

N

Mem

ory

(byt

es)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Sketch + freq. (Compressive methods)Data (EM)

102 103 104 105 106104

105

106

107

108

K = 20

N

Mem

ory

(byt

es)

Figure 7: Time (top) and memory (bottom) usage of all algorithms on synthetic data with dimension n = 10,number of components K = 5 (left) or K = 20 (right), and number of frequencies m = 5(2n+ 1)K, with respectto the number of items in the database N .RG: remplacer BS-GMM par Algorithm 3

distributions for further work. In this configuration, the speaker verification results will indeed be farfrom state-of-the-art, but as mentioned before our goal is mainly to test our compressive approach ona different type of problem than that of GMM estimation on synthetic data, for which we have alreadyobserved excellent results.

In the GMM-UBM model, each speaker S is represented by one GMM (⇥

S

,↵S

). The key point is theintroduction of a model (⇥

UBM

,↵UBM

) that represents a "generic" speaker, referred to as UniversalBackground Model (UBM). Given speech data X and a candidate speaker S, the statistic used forhypothesis testing is a likelihood ratio between the speaker and the generic model:

T (X ) =

p⇥S ,↵S (X )

p⇥UBM ,↵UBM (X )

. (23)

If T (X ) exceeds a threshold ⌧ , the data X are considered as being uttered by the speaker S.

The GMMs corresponding to each speaker must somehow be “comparable” to each other and to the UBM.Therefore, the UBM is learned prior to individual speaker models, using a large database of speech datauterred by many speakers. Then, given training data X

S

specific to one speaker, one M-step from theEM algorithm initialized with the UBM is used to adapt the UBM and derive the model (⇥

S

,↵S

). Werefer the reader to [51] for more details on this procedure.

In our framework, the EM or compressive estimation algorithms are used to learn theUBM.

5.2 Setup

The experiments were performed on the classical NIST05 speaker verification database. Both train-ing and testing fragments are 5-minutes conversations between two speakers. The database containsapproximately 650 speakers, and 30000 trials.

20

✓ In the pipe: information preservation (generalized RIP, “intrinsic dimension”)

✓ Neural net - like

z

•Challenge: provably good recovery algorithms ?

Page 59: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

Conclusion

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Page 60: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Projections & Learning

33

y

Signal

space

x

Observation space

Signal Processing compressive sensing

M M

Probability space

Sketch space

Machine Learning compressive learning

z

p

Linear “projection”

Compressive sensing random projections of data items

Compressive learning with sketches

random projections of collections

nonlinear in the feature vectors

linear in their probability distribution

Reduce dimension of data items

Reduce size of collection

Page 61: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Summary

Compressive GMM

Bourrier, G., Perez, Compressive Gaussian Mixture Estimation. ICASSP 2013 Keriven & al, Sketching for Large-Scale Learning of Mixture Models. ICASSP 2016 & arXiv:1606.02838

Compressive k-means

Keriven & al, Compressive K-Means submitted to ICASSP 2017

Compressive spectral clustering (with graph signal processing)

Tremblay & al, Accelerated Spectral Clustering using Graph Filtering of Random Signals ICASSP 2016

Tremblay & al, Compressive Spectral Clustering ICML 2016 & arXiv:1602.02018

Ex: with Amazon graph (106 edges), 5 times speedup (3 hours instead of 15 hours for k= 500 classes)

34

Challenge: compress before learning ?X

Introduction Graph signal processing... ... applied to clustering Conclusion

What’s the point of using a graph ?

N points in d = 2 dimensions.Result with k-means (k=2) :

After creating a graph fromthe N points’ interdistances,and running the spectral clus-tering algorithm (with k=2) :

N. Tremblay Graph signal processing for clustering Rennes, 13th of January 2016 8 / 26

O(k2 log2 k +N(logN + k))O(k2N)

Page 62: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Recent / ongoing work / challenges

When is information preserved with sketches / projections ? Bourrier & al, Fundamental perf. limits for ideal decoders in high-dimensional linear inverse problems. IEEE Transactions on Information Theory, 2014

Notion of Instance Optimal Decoders = Uniform guarantees Fundamental role of general Restricted Isometry Property

How to reconstruct: algorithm / decoder ? Traonmilin & G., Stable recovery of low-dimensional cones in Hilbert spaces - One RIP to rule them all. ACHA 2016

RIP guarantees for general (convex & nonconvex) regularizers

How to (maximally) reduce dimension? [Dirksen 2014] : given a random sub-gaussian linear form Puy & al, Recipes for stable linear embeddings from Hilbert spaces to ℝ^m arXiv:1509.06947

Role of covering dimension / Gaussian width of normalized secant set

What is the achievable compression for learning tasks ? Compressive statistical learning, work in progress with G. Blanchard, N. Keriven, Y. Traonmilin

Number of random moments = “intrinsic dimension” of PCA, k-means, Dictionary Learning … Statistical learning: risk minimization + generalization to future samples with same distribution

35

Guarantees ?

�(y) := argminx2H

f(x) s.t.kMx� yk ✏

Page 63: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

R. GRIBONVAL London Workshop on Sparse Signal Processing, September 2016

Page 64: Signal Processing with Adaptive Sparse Structured ... › ~icore › Docs › Remi-Gribonval-PLEASE...Signal Processing with Adaptive Sparse Structured Representations Submission deadline:

TH###NKS# Lisbon, Portugal June 5-8, 2017

SPARS 2017 Signal Processing with Adaptive Sparse Structured Representations

Submission deadline: December 12, 2016 Notification of acceptance: March 27, 2017 Summer School: May 31-June 2, 2017 (tbc) Workshop: June 5-8, 2017

spars2017.lx.it.pt