Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Sketched Learning from Random Features Moments
Nicolas Keriven
Ecole Normale Supérieure (Paris)
CFM-ENS chair in Data Science
(thesis with Rémi Gribonval at Inria Rennes)
ISMP, July 6th 2018
Compressive learning
Nicolas Keriven 1/15
Compressive learning
Nicolas Keriven
Compression Learning
Linear sketch
• Sketched learning: First compress data in a linear sketch [Cormode
2011], then learn
1/15
Compressive learning
Nicolas Keriven
Compression Learning
Linear sketch
• Sketched learning: First compress data in a linear sketch [Cormode
2011], then learn• Hash tables, count sketches, histograms…
1/15
Compressive learning
Nicolas Keriven
Compression Learning
Linear sketch
• Sketched learning: First compress data in a linear sketch [Cormode
2011], then learn• Hash tables, count sketches, histograms…
• Advantages: one-pass, streaming, distributed compression, data privacy…
1/15
Compressive learning
Nicolas Keriven
Compression Learning
Linear sketch
• Sketched learning: First compress data in a linear sketch [Cormode
2011], then learn• Hash tables, count sketches, histograms…
• Advantages: one-pass, streaming, distributed compression, data privacy…
• In this talk: unsupervised learning
1/15
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
• : moment
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
• : moment
• : histogram
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
• : moment
• : histogram
• Proposed: kernel random features [Rahimi 2007]
(random proj. + non-linearity)
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
• : moment
• : histogram
• Proposed: kernel random features [Rahimi 2007]
(random proj. + non-linearity)
Questions:
• What information is preserved by the sketching ?
• How to retrieve this information ?
• What is a sufficient number of features ?
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
• : moment
• : histogram
• Proposed: kernel random features [Rahimi 2007]
(random proj. + non-linearity)
Questions:
• What information is preserved by the sketching ?
• How to retrieve this information ?
• What is a sufficient number of features ?
Intuition: sketching as a linear embedding
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
• : moment
• : histogram
• Proposed: kernel random features [Rahimi 2007]
(random proj. + non-linearity)
Questions:
• What information is preserved by the sketching ?
• How to retrieve this information ?
• What is a sufficient number of features ?
- Assumption:
Intuition: sketching as a linear embedding
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
• : moment
• : histogram
• Proposed: kernel random features [Rahimi 2007]
(random proj. + non-linearity)
Questions:
• What information is preserved by the sketching ?
• How to retrieve this information ?
• What is a sufficient number of features ?
- Assumption:
- Linear operator:
Intuition: sketching as a linear embedding
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
• : moment
• : histogram
• Proposed: kernel random features [Rahimi 2007]
(random proj. + non-linearity)
Questions:
• What information is preserved by the sketching ?
• How to retrieve this information ?
• What is a sufficient number of features ?
- Assumption:
- Linear operator:
- « Noisy » linear measurement:
Noise small
Intuition: sketching as a linear embedding
How-to: build a sketch
Nicolas Keriven
What is a sketch ?
Any linear sketch = empirical moments
2/15
What is contained in a sketch ?
• : mean
• : moment
• : histogram
• Proposed: kernel random features [Rahimi 2007]
(random proj. + non-linearity)
Questions:
• What information is preserved by the sketching ?
• How to retrieve this information ?
• What is a sufficient number of features ?
- Assumption:
- Linear operator:
- « Noisy » linear measurement:
Noise small
Intuition: sketching as a linear embedding
Dimensionality-reducing, random, linear embedding: Compressive Sensing?
Retrieving mixture of Diracsfrom a sketch= k-means
Example of applications [Keriven 2016,2017]
Nicolas Keriven 3/15
Retrieving mixture of Diracsfrom a sketch= k-means
Example of applications [Keriven 2016,2017]
Nicolas Keriven
Application: Spectral clusteringfor MNIST classification
3/15
Retrieving mixture of Diracsfrom a sketch= k-means
Example of applications [Keriven 2016,2017]
Nicolas Keriven
Application: Spectral clusteringfor MNIST classification
3/15
- Twice faster than k-means- 4 orders of magnitude more
memory efficient
Retrieving mixture of Diracsfrom a sketch= k-means
Example of applications [Keriven 2016,2017]
Nicolas Keriven
Application: Spectral clusteringfor MNIST classification
3/15
- Twice faster than k-means- 4 orders of magnitude more
memory efficient
Retrieving GMMs from a sketch
Application: speaker verification [Reynolds 2000]
Error:
• EM on 300 000 samples : 29.53• 20kB sketch computed on 50GB database: 28.96
In this talk
Nicolas Keriven
Q: Theoretical guarantees ?
• Inspired by Compressive Sensing:
• 1: with the Restricted Isometry Property (RIP)
• 2: with dual certificates
4/15
Outline
Nicolas Keriven
Information-preservation guarantees: a RIP analysis
Joint work with R. Gribonval, G. Blanchard, Y. Traonmilin
Total variation regularization:a dual certificate analysis
Conclusion, outlooks
Recall: Linear inverse problem
Nicolas Keriven 5/15
True distribution:
Recall: Linear inverse problem
Nicolas Keriven 5/15
Sketch:
True distribution:
Recall: Linear inverse problem
Nicolas Keriven
• Estimation problem = linear inverse problem on measures
• Extremely ill-posed !
5/15
Sketch:
True distribution:
Recall: Linear inverse problem
Nicolas Keriven
• Estimation problem = linear inverse problem on measures
• Extremely ill-posed !
• Feasibility? (information-preservation)
5/15
Best algorithmpossible
Sketch:
Information preservation guarantees
Nicolas Keriven 6/15
: Model set of « simple » distributions (eg. GMMs)
Information preservation guarantees
Nicolas Keriven 6/15
: Model set of « simple » distributions (eg. GMMs)
Information preservation guarantees
Nicolas Keriven 6/15
: Model set of « simple » distributions (eg. GMMs)
Information preservation guarantees
Nicolas Keriven 6/15
: Model set of « simple » distributions (eg. GMMs)
Information preservation guarantees
Nicolas Keriven
GoalProve the existence of a decoder robustto noise and stable to modeling error.
« Instance-optimal » decoder
6/15
: Model set of « simple » distributions (eg. GMMs)
Information preservation guarantees
Nicolas Keriven
GoalProve the existence of a decoder robustto noise and stable to modeling error.
Lower Restricted Isometry Property
« Instance-optimal » decoder
6/15
: Model set of « simple » distributions (eg. GMMs)
Information preservation guarantees
Nicolas Keriven
Generalized Method of Moments
GoalProve the existence of a decoder robustto noise and stable to modeling error.
Lower Restricted Isometry Property
« Instance-optimal » decoder
6/15
: Model set of « simple » distributions (eg. GMMs)
Information preservation guarantees
Nicolas Keriven
New goal: find/construct models and operators that satisfy the LRIP (w.h.p.)
Generalized Method of Moments
GoalProve the existence of a decoder robustto noise and stable to modeling error.
Lower Restricted Isometry Property
« Instance-optimal » decoder
6/15
Proving the LRIP
Nicolas Keriven
Goal: LRIP
7/15
Proving the LRIP
Nicolas Keriven
Construction of :- Kernel mean [Gretton 2006, Borgwardt 2006]
- Random features [Rahimi 2007]
Pointwise LRIP
Goal: LRIP
7/15
Proving the LRIP
Nicolas Keriven
Construction of :- Kernel mean [Gretton 2006, Borgwardt 2006]
- Random features [Rahimi 2007]
Pointwise LRIP
Goal: LRIP
Extension to LRIP
Covering numbers (compacity) of the normalized secant set
7/15
Proving the LRIP
Nicolas Keriven
Construction of :- Kernel mean [Gretton 2006, Borgwardt 2006]
- Random features [Rahimi 2007]
Pointwise LRIP
Goal: LRIP
Extension to LRIP
Covering numbers (compacity) of the normalized secant set
Subset of a unit ball (infinite dimension)that only depends on
7/15
Proving the LRIP
Nicolas Keriven
Construction of :- Kernel mean [Gretton 2006, Borgwardt 2006]
- Random features [Rahimi 2007]
Pointwise LRIP
Goal: LRIP
Extension to LRIP
Covering numbers (compacity) of the normalized secant set
Subset of a unit ball (infinite dimension)that only depends on
7/15
Main result [Keriven 2016]
Nicolas Keriven
Main hypothesis
The normalized secant set has finite covering numbers.
8/15
Result
For ,
Main result [Keriven 2016]
Nicolas Keriven
Main hypothesis
The normalized secant set has finite covering numbers.
8/15
Result
For ,
Main result [Keriven 2016]
Nicolas Keriven
Main hypothesis
The normalized secant set has finite covering numbers.
Pointwise concentration Dimensionality of the model
8/15
Result
For ,
W.h.p.
Main result [Keriven 2016]
Nicolas Keriven
Main hypothesis
The normalized secant set has finite covering numbers.
Pointwise concentration Dimensionality of the model
8/15
Result
For ,
W.h.p.
Main result [Keriven 2016]
Nicolas Keriven
Main hypothesis
The normalized secant set has finite covering numbers.
Pointwise concentration Dimensionality of the model
Modeling error Empirical noise
8/15
Result
For ,
W.h.p.
Main result [Keriven 2016]
Nicolas Keriven
Main hypothesis
The normalized secant set has finite covering numbers.
Pointwise concentration Dimensionality of the model
Modeling error Empirical noise
8/15
Does not depend on !
Does not depend on !
Result
For ,
W.h.p.
Main result [Keriven 2016]
Nicolas Keriven
Main hypothesis
The normalized secant set has finite covering numbers.
Pointwise concentration Dimensionality of the model
Modeling error
- Classic Compressive Sensing: finite dimension: Known- Here: infinite dimension: Technical
Empirical noise
8/15
Does not depend on !
Does not depend on !
Application
Nicolas Keriven
k-means with mixtures of Diracs
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs
Hypotheses- - separated centroids- - bounded domain for centroids
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs
Hypotheses- - separated centroids- - bounded domain for centroids
(no assumptionon the data)
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs
Hypotheses- - separated centroids- - bounded domain for centroids
Sketch- Adjusted Random Fourier features (for
technical reasons)
(no assumptionon the data)
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs
Hypotheses- - separated centroids- - bounded domain for centroids
Sketch- Adjusted Random Fourier features (for
technical reasons)
Result- W.r.t. k-means usual cost (SSE)
(no assumptionon the data)
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs
Hypotheses- - separated centroids- - bounded domain for centroids
Sketch- Adjusted Random Fourier features (for
technical reasons)
Result- W.r.t. k-means usual cost (SSE)
Sketch size
(no assumptionon the data)
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs GMM with known covariance
Hypotheses- - separated centroids- - bounded domain for centroids
Sketch- Adjusted Random Fourier features (for
technical reasons)
Result- W.r.t. k-means usual cost (SSE)
Sketch size
(no assumptionon the data)
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs GMM with known covariance
Hypotheses- - separated centroids- - bounded domain for centroids
Sketch- Adjusted Random Fourier features (for
technical reasons)
Result- W.r.t. k-means usual cost (SSE)
Sketch size
Hypotheses- Sufficiently separated means- Bounded domain for means
(no assumptionon the data)
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs GMM with known covariance
Hypotheses- - separated centroids- - bounded domain for centroids
Sketch- Adjusted Random Fourier features (for
technical reasons)
Result- W.r.t. k-means usual cost (SSE)
Sketch size
Hypotheses- Sufficiently separated means- Bounded domain for means
Sketch- Fourier features
(no assumptionon the data)
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs GMM with known covariance
Hypotheses- - separated centroids- - bounded domain for centroids
Sketch- Adjusted Random Fourier features (for
technical reasons)
Result- W.r.t. k-means usual cost (SSE)
Sketch size
Hypotheses- Sufficiently separated means- Bounded domain for means
Sketch- Fourier features
Result- With respect to log-likelihood
(no assumptionon the data)
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs GMM with known covariance
Hypotheses- - separated centroids- - bounded domain for centroids
Sketch- Adjusted Random Fourier features (for
technical reasons)
Result- W.r.t. k-means usual cost (SSE)
Sketch size
Hypotheses- Sufficiently separated means- Bounded domain for means
Sketch- Fourier features
Result- With respect to log-likelihood
Sketch size
(no assumptionon the data)
9/15
Application
Nicolas Keriven
k-means with mixtures of Diracs GMM with known covariance
Hypotheses- - separated centroids- - bounded domain for centroids
Sketch- Adjusted Random Fourier features (for
technical reasons)
Result- W.r.t. k-means usual cost (SSE)
Sketch size
Hypotheses- Sufficiently separated means- Bounded domain for means
Sketch- Fourier features
Result- With respect to log-likelihood
Sketch size
(no assumptionon the data)
9/15
Compared to Generalized Method of moments, different guarantees
Outline
Nicolas Keriven
Information-preservation guarantees: a RIP analysis
Total variation regularization:a dual certificate analysisJoint work with C. Poon, G. Peyré
Conclusion, outlooks
Total Variation regularization
Nicolas Keriven
Previously: RIP analysis
Minimization: moment matching
11/15
Total Variation regularization
Nicolas Keriven
Previously: RIP analysis
• Must know
• Non-convex !
Minimization: moment matching
11/15
Total Variation regularization
Nicolas Keriven
Previously: RIP analysis
• Must know
• Non-convex !
Minimization: moment matching
Convex relaxation (« super resolution »): Beurling-LASSO (BLASSO) [DeCastro 2015]
• : Radon measure
• : Total variation (« L1 norm »)
11/15
Total Variation regularization
Nicolas Keriven
Previously: RIP analysis
• Must know
• Non-convex !
Minimization: moment matching
Convex relaxation (« super resolution »): Beurling-LASSO (BLASSO) [DeCastro 2015]
• : Radon measure
• : Total variation (« L1 norm »)
Questions:• Is the measure sparse ?
• Does it have the right number of components ?
• Does it recover the true ?
11/15
Dual certificates
Nicolas Keriven 12/15
Dual certificate analysis:
( = Lagrange multiplier)
Dual certificates
Nicolas Keriven 12/15
Dual certificate analysis:
( = Lagrange multiplier)
Function
Dual certificates
Nicolas Keriven 12/15
Dual certificate analysis:
( = Lagrange multiplier)
Such that:
•
• otherwise•
Function
Dual certificates
Nicolas Keriven
Step 1: study full kernel [Candes 2013]
Assume sufficiently separated
12/15
Dual certificate analysis:
( = Lagrange multiplier)
Such that:
•
• otherwise•
Function
Dual certificates
Nicolas Keriven
Step 1: study full kernel [Candes 2013]
Step 2: bounding the deviations
Assume sufficiently separated
12/15
Dual certificate analysis:
( = Lagrange multiplier)
Such that:
•
• otherwise•
Function
Dual certificates
Nicolas Keriven
Step 1: study full kernel [Candes 2013]
Step 2: bounding the deviations
Assume sufficiently separated
m=10
12/15
Dual certificate analysis:
( = Lagrange multiplier)
Such that:
•
• otherwise•
Function
Dual certificates
Nicolas Keriven
Step 1: study full kernel [Candes 2013]
Step 2: bounding the deviations
Assume sufficiently separated
m=10 m=20
12/15
Dual certificate analysis:
( = Lagrange multiplier)
Such that:
•
• otherwise•
Function
Dual certificates
Nicolas Keriven
Step 1: study full kernel [Candes 2013]
Step 2: bounding the deviations
Assume sufficiently separated
m=50m=10 m=20
12/15
Dual certificate analysis:
( = Lagrange multiplier)
Such that:
•
• otherwise•
Function
Results for separated GMM
Nicolas Keriven
1: Ideal scaling in sparsity
13/15
Results for separated GMM
Nicolas Keriven
1: Ideal scaling in sparsity
13/15
Results for separated GMM
Nicolas Keriven
1: Ideal scaling in sparsity
In progress…
13/15
Results for separated GMM
Nicolas Keriven
1: Ideal scaling in sparsity
In progress…
• not necessarily right number of components, but:
• Mass of concentrated around true• (weak) robustness to modelling error
• Proof: infinite-dimensional golfingscheme (new)
13/15
Results for separated GMM
Nicolas Keriven
1: Ideal scaling in sparsity
In progress…
• not necessarily right number of components, but:
• Mass of concentrated around true• (weak) robustness to modelling error
• Proof: infinite-dimensional golfingscheme (new)
2: Minimal norm certificate[Duval, Peyré 2015]
In progress…
13/15
Assumption: data are actually drawn from a GMM…
Results for separated GMM
Nicolas Keriven
1: Ideal scaling in sparsity
In progress…
• not necessarily right number of components, but:
• Mass of concentrated around true• (weak) robustness to modelling error
• Proof: infinite-dimensional golfingscheme (new)
2: Minimal norm certificate[Duval, Peyré 2015]
In progress…
• when n high enough: sparse, withright number of components
•
• Proof: adaptation of [Tang, Recht 2013]
13/15
Assumption: data are actually drawn from a GMM…
Outline
Nicolas Keriven
Information-preservation guarantees: a RIP analysis
Total variation regularization:a dual certificate analysis
Conclusion, outlooks
Sketch learning
Nicolas Keriven
• Sketching :• Streaming, distributed learning
• Original view on data compression and generalized moments
• Combines random features and kernel mean with infinitedimensional Compressive sensing
14/15
Summary, outlooks
Nicolas Keriven
• RIP analysis• Information preservation guarantees• Fine control on noise, modeling error (instance optimal decoder) and
recovery metrics• Necessary and sufficient conditions
15/15
Summary, outlooks
Nicolas Keriven
• Dual certificate analysis• Convex minimization• In some cases, automatically guess the right number of components
• RIP analysis• Information preservation guarantees• Fine control on noise, modeling error (instance optimal decoder) and
recovery metrics• Necessary and sufficient conditions
15/15
Summary, outlooks
Nicolas Keriven
• Dual certificate analysis• Convex minimization• In some cases, automatically guess the right number of components
• RIP analysis• Information preservation guarantees• Fine control on noise, modeling error (instance optimal decoder) and
recovery metrics• Necessary and sufficient conditions
15/15
• Outlooks• Algorithms for TV minimization• Other features (not necessarily random…)• Other « sketched » learning tasks• Multilayer sketches ?
Thank you !
Nicolas Keriven
• Gribonval, Blanchard, Keriven, Traonmilin. Compressive Statistical Learning with Random Feature Moments. 2017. <arXiv:1706.07180>
• Keriven. Sketching for Large-Scale Learning of Mixture Models. PhD Thesis. <tel-01620815>
• Poon, Keriven, Peyré. A Dual Certificates Analysis of Compressive Off-the-Grid Recovery. 2018. <arXiv:1802.08464>
• Code, applications: nkeriven.github.io