Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Stochastic optimization in Hilbert spaces
Aymeric Dieuleveut
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 2 / 48
Learning vs Statistics
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 3 / 48
Learning vs Statistics
Tradeoffs of large scalelearning
Algorithm complexity.ERM ?
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 4 / 48
Learning vs Statistics
Tradeoffs of large scalelearning
Algorithm complexity.ERM ?
Stochastic optimization
Why is SGD so useful inlearning ?
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 5 / 48
Learning vs Statistics
Tradeoffs of large scalelearning
Algorithm complexity.ERM ?
Stochastic optimization
Why is SGD so useful inlearning ?
A simple case
Least mean squares, fi-nite dimension
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 6 / 48
Learning vs Statistics
Tradeoffs of large scalelearning
Algorithm complexity.ERM ?
Stochastic optimization
Why is SGD so useful inlearning ?
A simple case
Least mean squares, fi-nite dimension
Higher dimension ?
RKHS, non parametriclearning
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 7 / 48
Learning vs Statistics
Tradeoffs of large scalelearning
Algorithm complexity.ERM ?
Stochastic optimization
Why is SGD so useful inlearning ?
A simple case
Least mean squares, fi-nite dimension
Higher dimension ?
RKHS, non parametriclearning
Lower complexity ?
Column sampling, fea-ture selection
Tradeoffs of Large scale learning - Learning
Statistics vs Machine Learning
Statistics Machine LearningEstimation LearningClassifier Hypothesis
Data point Example/InstanceRegression Supervised Learning
Classification Supervised LearningCovariate FeatureResponse Label 1
Essentially AI vs math guys doing same kind of stuff. However main differences :
Statisticians are more interested in the model and drawingconclusions about it.
ML are more interested about prediction with a concern onalgorithms for high dim. data.
1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
Tradeoffs of Large scale learning - Learning
Statistics vs Machine Learning
Statistics Machine LearningEstimation LearningClassifier Hypothesis
Data point Example/InstanceRegression Supervised Learning
Classification Supervised LearningCovariate FeatureResponse Label 1
Essentially AI vs math guys doing same kind of stuff. However main differences :
Statisticians are more interested in the model and drawingconclusions about it.
ML are more interested about prediction with a concern onalgorithms for high dim. data.
1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
Tradeoffs of Large scale learning - Learning
Statistics vs Machine Learning
Statistics Machine LearningEstimation LearningClassifier Hypothesis
Data point Example/InstanceRegression Supervised Learning
Classification Supervised LearningCovariate FeatureResponse Label 1
Essentially AI vs math guys doing same kind of stuff. However main differences :
Statisticians are more interested in the model and drawingconclusions about it.
ML are more interested about prediction with a concern onalgorithms for high dim. data.
1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
Tradeoffs of Large scale learning - Learning
Statistics vs Machine Learning
Statistics Machine LearningEstimation LearningClassifier Hypothesis
Data point Example/InstanceRegression Supervised Learning
Classification Supervised LearningCovariate FeatureResponse Label 1
Essentially AI vs math guys doing same kind of stuff. However main differences :
Statisticians are more interested in the model and drawingconclusions about it.
ML are more interested about prediction with a concern onalgorithms for high dim. data.
1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
Tradeoffs of Large scale learning - Learning
Statistics vs Machine Learning
Statistics Machine LearningEstimation LearningClassifier Hypothesis
Data point Example/InstanceRegression Supervised Learning
Classification Supervised LearningCovariate FeatureResponse Label 1
Essentially AI vs math guys doing same kind of stuff. However main differences :
Statisticians are more interested in the model and drawingconclusions about it.
ML are more interested about prediction with a concern onalgorithms for high dim. data.
1. taken from www.quora.com/What-is-the-difference-between-statistics-and-machine-learning
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 8 / 48
Tradeoffs of Large scale learning - Learning
Framework
We consider the classical risk minimization problem. Given :
a space of input output pairs (x , y) ∈ X × Y, with probabilitydistribution P(x , y).
a loss function ` : Y × Y → R, a class of function F .
the risk of a function f : X → Y is R(f ) := EP [` (f (x), y)].
Our aim isminf ∈F
R(f )
R is unknown.
given a sequence of i.i.d. data points distributed(xi , yi )i=1..n ∼ P⊗n, we can define the empirical risk
Rn(f ) =1
n
n∑i=1
`(f (xi ), yi ).
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48
Tradeoffs of Large scale learning - Learning
Framework
We consider the classical risk minimization problem. Given :
a space of input output pairs (x , y) ∈ X × Y, with probabilitydistribution P(x , y).
a loss function ` : Y × Y → R, a class of function F .
the risk of a function f : X → Y is R(f ) := EP [` (f (x), y)].
Our aim isminf ∈F
R(f )
R is unknown.
given a sequence of i.i.d. data points distributed(xi , yi )i=1..n ∼ P⊗n, we can define the empirical risk
Rn(f ) =1
n
n∑i=1
`(f (xi ), yi ).
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48
Tradeoffs of Large scale learning - Learning
Framework
We consider the classical risk minimization problem. Given :
a space of input output pairs (x , y) ∈ X × Y, with probabilitydistribution P(x , y).
a loss function ` : Y × Y → R, a class of function F .
the risk of a function f : X → Y is R(f ) := EP [` (f (x), y)].
Our aim isminf ∈F
R(f )
R is unknown.
given a sequence of i.i.d. data points distributed(xi , yi )i=1..n ∼ P⊗n, we can define the empirical risk
Rn(f ) =1
n
n∑i=1
`(f (xi ), yi ).
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 9 / 48
Tradeoffs of Large scale learning - Learning
The bias-variance tradeoffs
a.k.a. estimation approximation error.
There are many ways of seeing it :
constraint case
penalized case
other regularization
Thus compromise : εapp + εest .
εapp εestF ↗ ↘ ↗
This is the classical setting.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
Tradeoffs of Large scale learning - Learning
The bias-variance tradeoffs
a.k.a. estimation approximation error.There are many ways of seeing it :
constraint case
penalized case
other regularization
Thus compromise : εapp + εest .
εapp εestF ↗ ↘ ↗
This is the classical setting.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
Tradeoffs of Large scale learning - Learning
The bias-variance tradeoffs
a.k.a. estimation approximation error.There are many ways of seeing it :
constraint case
penalized case
other regularization
Thus compromise : εapp + εest .
εapp εestF ↗
↘ ↗
This is the classical setting.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
Tradeoffs of Large scale learning - Learning
The bias-variance tradeoffs
a.k.a. estimation approximation error.There are many ways of seeing it :
constraint case
penalized case
other regularization
Thus compromise : εapp + εest .
εapp εestF ↗ ↘
↗
This is the classical setting.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
Tradeoffs of Large scale learning - Learning
The bias-variance tradeoffs
a.k.a. estimation approximation error.There are many ways of seeing it :
constraint case
penalized case
other regularization
Thus compromise : εapp + εest .
εapp εestF ↗ ↘ ↗
This is the classical setting.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 10 / 48
Tradeoffs of Large scale learning - Learning
Adding an optimization term
When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2
It questions the following points :
up to which precision is it necessary to optimize ?
which is the limiting factor ? (time, data points)
A problem is said to be large scale when time is limiting. For large scaleproblem :
which algo ?
more data less work ? (if time is limiting)
2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
Tradeoffs of Large scale learning - Learning
Adding an optimization term
When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2
It questions the following points :
up to which precision is it necessary to optimize ?
which is the limiting factor ? (time, data points)
A problem is said to be large scale when time is limiting. For large scaleproblem :
which algo ?
more data less work ? (if time is limiting)
2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
Tradeoffs of Large scale learning - Learning
Adding an optimization term
When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2
It questions the following points :
up to which precision is it necessary to optimize ?
which is the limiting factor ? (time, data points)
A problem is said to be large scale when time is limiting. For large scaleproblem :
which algo ?
more data less work ? (if time is limiting)
2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
Tradeoffs of Large scale learning - Learning
Adding an optimization term
When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2
It questions the following points :
up to which precision is it necessary to optimize ?
which is the limiting factor ? (time, data points)
A problem is said to be large scale when time is limiting. For large scaleproblem :
which algo ?
more data less work ? (if time is limiting)
2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
Tradeoffs of Large scale learning - Learning
Adding an optimization term
When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2
It questions the following points :
up to which precision is it necessary to optimize ?
which is the limiting factor ? (time, data points)
A problem is said to be large scale when time is limiting. For large scaleproblem :
which algo ?
more data less work ? (if time is limiting)
2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
Tradeoffs of Large scale learning - Learning
Adding an optimization term
When we face large datasets, it may be uneasy and useless to optimizewith high accuracy the estimator. We then question the choice of analgorithm from a fixed budget time point of view. 2
It questions the following points :
up to which precision is it necessary to optimize ?
which is the limiting factor ? (time, data points)
A problem is said to be large scale when time is limiting. For large scaleproblem :
which algo ?
more data less work ? (if time is limiting)
2. Ref :[Shalev-Schwartz and Srebro, 2008, Shalev-Schwartz and K., 2011, Bottou and Bousquet, 2008]
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 11 / 48
Tradeoffs of Large scale learning - Learning
Tradeoffs - Large scale learning
F ↗ n↗ ε↗
εapp ↘εest ↗ ↘εopt ↗T ↗ ↗ ↘
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
Tradeoffs of Large scale learning - Learning
Tradeoffs - Large scale learning
F ↗ n↗ ε↗εapp ↘
εest ↗ ↘εopt ↗T ↗ ↗ ↘
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
Tradeoffs of Large scale learning - Learning
Tradeoffs - Large scale learning
F ↗ n↗ ε↗εapp ↘εest ↗ ↘
εopt ↗T ↗ ↗ ↘
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
Tradeoffs of Large scale learning - Learning
Tradeoffs - Large scale learning
F ↗ n↗ ε↗εapp ↘εest ↗ ↘εopt ↗
T ↗ ↗ ↘
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
Tradeoffs of Large scale learning - Learning
Tradeoffs - Large scale learning
F ↗ n↗ ε↗εapp ↘εest ↗ ↘εopt ↗T ↗ ↗ ↘
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 12 / 48
Tradeoffs of Large scale learning - Learning
Different algorithms
To minimize ERM, a bunch of algorithms may be considered :
Gradient descent
Second order gradient descent
Stochastic gradient descent
Fast stochastic algorithm (requiring high memory storage)
Let’s compare first order methods : SGD and GD.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48
Tradeoffs of Large scale learning - Learning
Different algorithms
To minimize ERM, a bunch of algorithms may be considered :
Gradient descent
Second order gradient descent
Stochastic gradient descent
Fast stochastic algorithm (requiring high memory storage)
Let’s compare first order methods : SGD and GD.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 13 / 48
Tradeoffs of Large scale learning - Learning
Stochastic gradient algorithms :
Aim : minf R(f )
we only access to unbiased estimates of R(f ) and ∇R(f ).
1 Start at some f0.2 Iterate :
Get unbiased gradient estimate gk , s.t. E [gk ] = ∇R(fk).fk+1 ← fk − γkgk .
3 Output fm or fm := 1m
m∑k=1
fk (averaged SGD).
Gradient descent : same but with “true” gradient.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48
Tradeoffs of Large scale learning - Learning
Stochastic gradient algorithms :
Aim : minf R(f )
we only access to unbiased estimates of R(f ) and ∇R(f ).
1 Start at some f0.
2 Iterate :Get unbiased gradient estimate gk , s.t. E [gk ] = ∇R(fk).fk+1 ← fk − γkgk .
3 Output fm or fm := 1m
m∑k=1
fk (averaged SGD).
Gradient descent : same but with “true” gradient.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48
Tradeoffs of Large scale learning - Learning
Stochastic gradient algorithms :
Aim : minf R(f )
we only access to unbiased estimates of R(f ) and ∇R(f ).
1 Start at some f0.2 Iterate :
Get unbiased gradient estimate gk , s.t. E [gk ] = ∇R(fk).fk+1 ← fk − γkgk .
3 Output fm or fm := 1m
m∑k=1
fk (averaged SGD).
Gradient descent : same but with “true” gradient.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48
Tradeoffs of Large scale learning - Learning
Stochastic gradient algorithms :
Aim : minf R(f )
we only access to unbiased estimates of R(f ) and ∇R(f ).
1 Start at some f0.2 Iterate :
Get unbiased gradient estimate gk , s.t. E [gk ] = ∇R(fk).fk+1 ← fk − γkgk .
3 Output fm or fm := 1m
m∑k=1
fk (averaged SGD).
Gradient descent : same but with “true” gradient.Aymeric Dieuleveut Stochastic optimization Hilbert spaces 14 / 48
Tradeoffs of Large scale learning - Learning
ERM
SGD in ERMminf ∈F Rn(f )
Pick any (xi , yi ) from empirical sample
gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O
(1/√m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(d).
GD in ERMminf ∈F Rn(f )
gk = ∇f∑n
i=1 `(fk , (xi , yi ))= ∇f R(fk)
fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O ((1− κ)m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(nd).
R(fm)− R(f ∗) 6 O(1/√m)
+ O(1/√n)
With step-size γk proportional to 1√k
.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48
Tradeoffs of Large scale learning - Learning
ERM
SGD in ERMminf ∈F Rn(f )
Pick any (xi , yi ) from empirical sample
gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O
(1/√m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(d).
GD in ERMminf ∈F Rn(f )
gk = ∇f∑n
i=1 `(fk , (xi , yi ))= ∇f R(fk)
fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O ((1− κ)m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(nd).
R(fm)− R(f ∗) 6 O(1/√m)
+ O(1/√n)
With step-size γk proportional to 1√k
.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48
Tradeoffs of Large scale learning - Learning
ERM
SGD in ERMminf ∈F Rn(f )
Pick any (xi , yi ) from empirical sample
gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O
(1/√m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(d).
GD in ERMminf ∈F Rn(f )
gk = ∇f∑n
i=1 `(fk , (xi , yi ))= ∇f R(fk)
fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O ((1− κ)m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(nd).
R(fm)− R(f ∗) 6 O(1/√m)
+ O(1/√n)
With step-size γk proportional to 1√k
.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 15 / 48
Tradeoffs of Large scale learning - Learning
Conclusion
In the large scale setting, it is beneficial to use SGD !
Does more data help ?
With global estimation error fixed, it seems T ' 1R(fm)−R(f∗)− 1√
n
is
decreasing with n.
Upper bounding Rn − R uniformly is dangerous. Indeed, we have to alsocompare to one pass SGD, which minimizes the true risk R.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48
Tradeoffs of Large scale learning - Learning
Conclusion
In the large scale setting, it is beneficial to use SGD !Does more data help ?
With global estimation error fixed, it seems T ' 1R(fm)−R(f∗)− 1√
n
is
decreasing with n.
Upper bounding Rn − R uniformly is dangerous. Indeed, we have to alsocompare to one pass SGD, which minimizes the true risk R.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48
Tradeoffs of Large scale learning - Learning
Conclusion
In the large scale setting, it is beneficial to use SGD !Does more data help ?
With global estimation error fixed, it seems T ' 1R(fm)−R(f∗)− 1√
n
is
decreasing with n.
Upper bounding Rn − R uniformly is dangerous. Indeed, we have to alsocompare to one pass SGD, which minimizes the true risk R.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 16 / 48
Tradeoffs of Large scale learning - Learning
Expectation minimization
Stochastic gradient descent may be used to minimize R(f ) :
SGD in ERMminf ∈F Rn(f )
Pick any (xi , yi ) from empirical sample
gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O
(1/√m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(d).
SGD one passminf ∈F R(f )
Pick an independent (x , y)
gk = ∇f `(fk , (x , y)).fk+1 ← (fk − γkgk)
Output fk , k 6 n
R(fk)− R(f ∗) 6 O(
1/√k)
Cost of one iteration O(d).
SGD with one pass (early stopping as a regularization) achieves a nearlyoptimal bias variance tradeoff with low complexity.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48
Tradeoffs of Large scale learning - Learning
Expectation minimization
Stochastic gradient descent may be used to minimize R(f ) :
SGD in ERMminf ∈F Rn(f )
Pick any (xi , yi ) from empirical sample
gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O
(1/√m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(d).
SGD one passminf ∈F R(f )
Pick an independent (x , y)
gk = ∇f `(fk , (x , y)).fk+1 ← (fk − γkgk)
Output fk , k 6 n
R(fk)− R(f ∗) 6 O(
1/√k)
Cost of one iteration O(d).
SGD with one pass (early stopping as a regularization) achieves a nearlyoptimal bias variance tradeoff with low complexity.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48
Tradeoffs of Large scale learning - Learning
Expectation minimization
Stochastic gradient descent may be used to minimize R(f ) :
SGD in ERMminf ∈F Rn(f )
Pick any (xi , yi ) from empirical sample
gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O
(1/√m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(d).
SGD one passminf ∈F R(f )
Pick an independent (x , y)
gk = ∇f `(fk , (x , y)).fk+1 ← (fk − γkgk)
Output fk , k 6 n
R(fk)− R(f ∗) 6 O(
1/√k)
Cost of one iteration O(d).
SGD with one pass (early stopping as a regularization) achieves a nearlyoptimal bias variance tradeoff with low complexity.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48
Tradeoffs of Large scale learning - Learning
Expectation minimization
Stochastic gradient descent may be used to minimize R(f ) :
SGD in ERMminf ∈F Rn(f )
Pick any (xi , yi ) from empirical sample
gk = ∇f `(fk , (xi , yi )).fk+1 ← (fk − γkgk)
Output fmRn(fm)− Rn(f ∗n ) 6 O
(1/√m)
supf∈F |R − Rn|(f ) 6 O(1/√n)
Cost of one iteration O(d).
SGD one passminf ∈F R(f )
Pick an independent (x , y)
gk = ∇f `(fk , (x , y)).fk+1 ← (fk − γkgk)
Output fk , k 6 n
R(fk)− R(f ∗) 6 O(
1/√k)
Cost of one iteration O(d).
SGD with one pass (early stopping as a regularization) achieves a nearlyoptimal bias variance tradeoff with low complexity.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 17 / 48
Tradeoffs of Large scale learning - Learning
Rate of convergence
We are interested in prediction.
Strongly convex objective : 1µn .
Non strongly : 1√n
.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 18 / 48
A case study -Finite dimension linear least mean squares
LMS [Bach and Moulines, 2013]
We now consider the simple case where X = Rd , and the loss ` is qua-dratic. We are interested in linear predictors :
minθ∈Rd
EP [(θT x − y)2].
If we assume that the data points are generated according to
yi = θT∗ xi + εi .
We consider stochastic gradient algorithm :
θ0 = 0
θn+1 = θn − γn(〈xn, θn〉xn − ynxn)
This system may be rewritten :
θn+1 − θ∗ = (I − γxnxTn )(θn − θ∗)− γnξn. (1)
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 19 / 48
A case study -Finite dimension linear least mean squares
Rate of convergence, back again !
We are interested in prediction.
Strongly convex objective : 1µn .
Non strongly : 1√n
.
We define H = E[xxT ].
We have µ = min Sp(H).
For least min squares, statistical rate with ordinary LMS estimator is
σ2d
n
there is still a gap to be bridged !
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48
A case study -Finite dimension linear least mean squares
Rate of convergence, back again !
We are interested in prediction.
Strongly convex objective : 1µn .
Non strongly : 1√n
.
We define H = E[xxT ]. We have µ = min Sp(H).
For least min squares, statistical rate with ordinary LMS estimator is
σ2d
n
there is still a gap to be bridged !
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48
A case study -Finite dimension linear least mean squares
Rate of convergence, back again !
We are interested in prediction.
Strongly convex objective : 1µn .
Non strongly : 1√n
.
We define H = E[xxT ]. We have µ = min Sp(H).
For least min squares, statistical rate with ordinary LMS estimator is
σ2d
n
there is still a gap to be bridged !
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48
A case study -Finite dimension linear least mean squares
Rate of convergence, back again !
We are interested in prediction.
Strongly convex objective : 1µn .
Non strongly : 1√n
.
We define H = E[xxT ]. We have µ = min Sp(H).
For least min squares, statistical rate with ordinary LMS estimator is
σ2d
n
there is still a gap to be bridged !
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 20 / 48
A case study -Finite dimension linear least mean squares
A few assumptions
We define H = E[xxT ], and C = E[ξξT ].
Bounded noise variance : we assume C 6 σ2H.
Covariance operator :
no assumption on minimal eigenvalue,
E[‖x‖2] 6 R2.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48
A case study -Finite dimension linear least mean squares
A few assumptions
We define H = E[xxT ], and C = E[ξξT ].
Bounded noise variance : we assume C 6 σ2H.
Covariance operator :
no assumption on minimal eigenvalue,
E[‖x‖2] 6 R2.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48
A case study -Finite dimension linear least mean squares
A few assumptions
We define H = E[xxT ], and C = E[ξξT ].
Bounded noise variance : we assume C 6 σ2H.
Covariance operator :
no assumption on minimal eigenvalue,
E[‖x‖2] 6 R2.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 21 / 48
A case study -Finite dimension linear least mean squares
Result
Theorem
E[R(θn)− R(θ∗)] 64
n(σ2d + R2‖θ0 − θ∗‖2)
optimal statistical rate
1/n without strong convexity.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 22 / 48
Non parametric learning
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 23 / 48
What if d >> n ?
Non parametric learning
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 24 / 48
What if d >> n ?
Carry analyse in a Hilbert space
using reproducing kernel Hilbertspaces
Non parametric learning
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 25 / 48
What if d >> n ?
Carry analyse in a Hilbert space
using reproducing kernel Hilbertspaces
Non parametric regression inRKHS
An interesting problem itself
Non parametric learning
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 26 / 48
What if d >> n ?
Carry analyse in a Hilbert space
using reproducing kernel Hilbertspaces
Non parametric regression inRKHS
An interesting problem itself
Behaviour in FD
Adaptativity, tradeoffs.
Optimal statistical rates inRKHS
Choice of γ
Non parametric learning
Reproducing kernel Hilbert space[Dieuleveut and Bach, 2014]
We denote HK a Hilbert space of function. HK ⊂ RX .Which is characterized by the kernel function K : X × X → R :
for any x , Kx : X → R defined by Kx(x ′) = K (x , x ′) is in HK .
reproducing property : for all g ∈ HK and x ∈ X , g(x) = 〈g ,Kx〉K .
Two usages :
α) A hypothesis space for regression.
β) Mapping data points in a linear space.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48
Non parametric learning
Reproducing kernel Hilbert space[Dieuleveut and Bach, 2014]
We denote HK a Hilbert space of function. HK ⊂ RX .Which is characterized by the kernel function K : X × X → R :
for any x , Kx : X → R defined by Kx(x ′) = K (x , x ′) is in HK .
reproducing property : for all g ∈ HK and x ∈ X , g(x) = 〈g ,Kx〉K .
Two usages :
α) A hypothesis space for regression.
β) Mapping data points in a linear space.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48
Non parametric learning
Reproducing kernel Hilbert space[Dieuleveut and Bach, 2014]
We denote HK a Hilbert space of function. HK ⊂ RX .Which is characterized by the kernel function K : X × X → R :
for any x , Kx : X → R defined by Kx(x ′) = K (x , x ′) is in HK .
reproducing property : for all g ∈ HK and x ∈ X , g(x) = 〈g ,Kx〉K .
Two usages :
α) A hypothesis space for regression.
β) Mapping data points in a linear space.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 27 / 48
Non parametric learning
α) A hypothesis space for regression.
Classical regression setting :
(Xi ,Yi ) ∼ ρ i.i.d.
(Xi ,Yi ) ∈ (X × R)
Goal : Minimizing prediction error
ming∈L2
E[(g(X )− Y )2].
Looking for an estimator gn of gρ(X ) = E[Y |X ], gρ ∈ L2ρX . with
L2ρX =
{f : X → R/
∫f 2(t)dρX (t) <∞
}.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48
Non parametric learning
α) A hypothesis space for regression.
Classical regression setting :
(Xi ,Yi ) ∼ ρ i.i.d.
(Xi ,Yi ) ∈ (X × R)
Goal : Minimizing prediction error
ming∈L2
E[(g(X )− Y )2].
Looking for an estimator gn of gρ(X ) = E[Y |X ], gρ ∈ L2ρX . with
L2ρX =
{f : X → R/
∫f 2(t)dρX (t) <∞
}.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 28 / 48
Non parametric learning
β) Mapping data points in a linear space.
Linear regression on data maped into some RKHS.
arg minθ∈H||Y − Xθ||2.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 29 / 48
Non parametric learning
2 approaches of regression problem :
Link : In generalHK ⊂ L2
ρX
Andcompl||.||L2
ρX(RKHS) = L2
ρX
in some cases. We then look for an estimator of the regression functionin the RKHS.
General regression problemgρ ∈ L2
Linear regression problem inRKHS
looking for an estimator for the first problem using naturalalgorithms for the second one
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
Non parametric learning
2 approaches of regression problem :
Link : In generalHK ⊂ L2
ρX
Andcompl||.||L2
ρX(RKHS) = L2
ρX
in some cases. We then look for an estimator of the regression functionin the RKHS.
General regression problemgρ ∈ L2
Linear regression problem inRKHS
looking for an estimator for the first problem using naturalalgorithms for the second one
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
Non parametric learning
2 approaches of regression problem :
Link : In generalHK ⊂ L2
ρX
Andcompl||.||L2
ρX(RKHS) = L2
ρX
in some cases. We then look for an estimator of the regression functionin the RKHS.
General regression problemgρ ∈ L2
Linear regression problem inRKHS
looking for an estimator for the first problem using naturalalgorithms for the second one
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
Non parametric learning
2 approaches of regression problem :
Link : In generalHK ⊂ L2
ρX
Andcompl||.||L2
ρX(RKHS) = L2
ρX
in some cases. We then look for an estimator of the regression functionin the RKHS.
General regression problemgρ ∈ L2
Linear regression problem inRKHS
looking for an estimator for the first problem using naturalalgorithms for the second one
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
Non parametric learning
2 approaches of regression problem :
Link : In generalHK ⊂ L2
ρX
Andcompl||.||L2
ρX(RKHS) = L2
ρX
in some cases. We then look for an estimator of the regression functionin the RKHS.
General regression problemgρ ∈ L2
Linear regression problem inRKHS
looking for an estimator for the first problem using naturalalgorithms for the second one
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 30 / 48
Non parametric learning
Outline
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 31 / 48
What if d >> n ?
Carry analyse in a Hilbert space
using reproducing kernel Hilbertspaces
Non parametric regression inRKHS
An interesting problem itself
Non parametric learning
SGD algorithm in the RKHS
g0 ∈ HK (we often consider g0 = 0),
gn =n∑
i=1
aiKxi , (2)
(an)n such that an := −γn(gn−1(xn)−yn) = −γn(∑n−1
i=1 aiK (xn, xi )− yi
).
gn = gn−1 − γn (gn−1(xn)− yn)Kxn
=n∑
i=1
aiKxi with an defined as above.
(gn−1(xn)− yn)Kxn unbiased estimate of gradE[(〈Kx , gn−1〉 − y)2] .
SGD algorithm in the RKHS takes very simple formAymeric Dieuleveut Stochastic optimization Hilbert spaces 32 / 48
Non parametric learning
Assumptions
Two important points characterize the difficulty of the problem :
The regularity of the objective function
The spectrum of the covariance operator
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 33 / 48
Non parametric learning
Covariance operator
We have Σ = E [Kx ⊗ Kx ] . Where Kx ⊗ Kx : g 7→ 〈Kx , g〉Kx = g(x)Kx
Covariance operator is a self adjoint operator which contains infor-mation on the distribution of Kx
Assumption :
tr(Σα) <∞, for α ∈ [0; 1].
on gρ : gρ ∈ Σr (L2ρ(X )) with r ≥ 0.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48
Non parametric learning
Covariance operator
We have Σ = E [Kx ⊗ Kx ] . Where Kx ⊗ Kx : g 7→ 〈Kx , g〉Kx = g(x)Kx
Covariance operator is a self adjoint operator which contains infor-mation on the distribution of Kx
Assumption :
tr(Σα) <∞, for α ∈ [0; 1].
on gρ : gρ ∈ Σr (L2ρ(X )) with r ≥ 0.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 34 / 48
Non parametric learning
Interpretation
Eigenvalues decrease
Ellipsoid class of function. (we do not assume gρ ∈ HK )
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 35 / 48
Non parametric learning
Result :
Theorem
Under a few hidden assumptions :
E [R (gn)− R(gρ)] 6 O
(σ2 tr(Σα)γα
n1−α
)+ O
(||Σ−rgρ||2(nγ)2(r∧1)
)
Bias Variance decomposition
O is a known constant (4 or 8)
Finite horizon result here but extends to online setting.
Saturation
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48
Non parametric learning
Result :
Theorem
Under a few hidden assumptions :
E [R (gn)− R(gρ)] 6 O
(σ2 tr(Σα)γα
n1−α
)+ O
(||Σ−rgρ||2(nγ)2(r∧1)
)
Bias Variance decomposition
O is a known constant (4 or 8)
Finite horizon result here but extends to online setting.
Saturation
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48
Non parametric learning
Result :
Theorem
Under a few hidden assumptions :
E [R (gn)− R(gρ)] 6 O
(σ2 tr(Σα)γα
n1−α
)+ O
(||Σ−rgρ||2(nγ)2(r∧1)
)
Bias Variance decomposition
O is a known constant (4 or 8)
Finite horizon result here but extends to online setting.
Saturation
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 36 / 48
Non parametric learning
Corollary
Corollary
Assume A1-8 :If 1−α
2 < r < 2−α2 , with γ = n−
2r+α−12r+α we get the optimal rate :
E [R (gn)− R(gρ)] = O(n−
2r2r+α
)(3)
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 37 / 48
Non parametric learning
Conclusion 1
We get statistical optimal rate of convergence for learning in RKHSwith SGD with one pass.
We get insights on how to choose the kernel and the step size.
We compare favorably to [Ying and Pontil, 2008,Caponnetto and De Vito, 2007, Tarres and Yao, 2011].
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48
Optimal statistical rates inRKHS
Choice of γ
Non parametric learning
Conclusion 1
We get statistical optimal rate of convergence for learning in RKHSwith SGD with one pass.
We get insights on how to choose the kernel and the step size.
We compare favorably to [Ying and Pontil, 2008,Caponnetto and De Vito, 2007, Tarres and Yao, 2011].
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48
Optimal statistical rates inRKHS
Choice of γ
Non parametric learning
Conclusion 1
We get statistical optimal rate of convergence for learning in RKHSwith SGD with one pass.
We get insights on how to choose the kernel and the step size.
We compare favorably to [Ying and Pontil, 2008,Caponnetto and De Vito, 2007, Tarres and Yao, 2011].
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48
Optimal statistical rates inRKHS
Choice of γ
Non parametric learning
Conclusion 1
We get statistical optimal rate of convergence for learning in RKHSwith SGD with one pass.
We get insights on how to choose the kernel and the step size.
We compare favorably to [Ying and Pontil, 2008,Caponnetto and De Vito, 2007, Tarres and Yao, 2011].
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 38 / 48
Optimal statistical rates inRKHS
Choice of γ
Non parametric learning
Conclusion 2
Theorem can be rewritten :
E[R(θn)− R(θ∗)
]6 O
(σ2 tr(Σα)γα
n1−α
)+ O
(θT∗ Σ2r−1θT
(nγ)2(r∧1)
)(4)
where the ellipsoid condition appears more clearly.Thus :
SGD is adaptative to the regularity of the problem
bridges the gap between the different regimes and explainsbehaviour when d >> n.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48
Behaviour in FD
Adaptativity, tradeoffs.
Non parametric learning
Conclusion 2
Theorem can be rewritten :
E[R(θn)− R(θ∗)
]6 O
(σ2 tr(Σα)γα
n1−α
)+ O
(θT∗ Σ2r−1θT
(nγ)2(r∧1)
)(4)
where the ellipsoid condition appears more clearly.
Thus :
SGD is adaptative to the regularity of the problem
bridges the gap between the different regimes and explainsbehaviour when d >> n.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48
Behaviour in FD
Adaptativity, tradeoffs.
Non parametric learning
Conclusion 2
Theorem can be rewritten :
E[R(θn)− R(θ∗)
]6 O
(σ2 tr(Σα)γα
n1−α
)+ O
(θT∗ Σ2r−1θT
(nγ)2(r∧1)
)(4)
where the ellipsoid condition appears more clearly.Thus :
SGD is adaptative to the regularity of the problem
bridges the gap between the different regimes and explainsbehaviour when d >> n.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 39 / 48
Behaviour in FD
Adaptativity, tradeoffs.
The complexity challenge, approximation of the kernel
1 Tradeoffs of Large scale learning - Learning
2 A case study -Finite dimension linear least mean squares
3 Non parametric learning
4 The complexity challenge, approximation of the kernel
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 40 / 48
The complexity challenge, approximation of the kernel
Reducing complexity : sampling methods
However the complexity of such a method remains quadratic with respectof the number of examples : iteration number n costs n kernel calculations.
Rate Complexity
Finite Dimension dn O(dn)
Infinite dimension dnn O(n2)
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48
The complexity challenge, approximation of the kernel
Reducing complexity : sampling methods
However the complexity of such a method remains quadratic with respectof the number of examples : iteration number n costs n kernel calculations.
Rate Complexity
Finite Dimension dn O(dn)
Infinite dimension dnn O(n2)
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48
The complexity challenge, approximation of the kernel
Reducing complexity : sampling methods
However the complexity of such a method remains quadratic with respectof the number of examples : iteration number n costs n kernel calculations.
Rate Complexity
Finite Dimension dn O(dn)
Infinite dimension dnn O(n2)
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 41 / 48
The complexity challenge, approximation of the kernel
2 related methods
Approximate the kernel matrix
Approximate the kernel
Results from [Bach, 2012].Such results have been extended by [Alaoui and Mahoney, 2014, Rudi et al., 2015]There also exist results in the second situation [Rahimi and Recht, 2008,Dai et al., 2014]
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 42 / 48
The complexity challenge, approximation of the kernel
Sharp analysis
We only consider a fixed design setting. Then we have to approximate thekernel matrix : instead of computing the whole matrix, we randomly picka number dn of columns.
Then we still get the same estimation errors.Leading to :
Rate Complexity
Finite Dimension dn O(dn)
Infinite dimension dnn O(nd2
n )
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48
The complexity challenge, approximation of the kernel
Sharp analysis
We only consider a fixed design setting. Then we have to approximate thekernel matrix : instead of computing the whole matrix, we randomly picka number dn of columns.Then we still get the same estimation errors.Leading to :
Rate Complexity
Finite Dimension dn O(dn)
Infinite dimension dnn O(nd2
n )
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 43 / 48
The complexity challenge, approximation of the kernel
Random feature selection
Many kernels may be represented, due to Bochner’s theorem as
K (x , y) =
∫Wφ(w , x)φ(w , y)dµ(w).
(think of translation invariant kernels and Fourier transform).
We thus consider the low rank approximation :
K (x , y) =1
d
n∑i=1
φ(x ,wi )φ(y ,wi ).
where wi ∼ µ.We use this approximation of the kernel in SGD.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48
The complexity challenge, approximation of the kernel
Random feature selection
Many kernels may be represented, due to Bochner’s theorem as
K (x , y) =
∫Wφ(w , x)φ(w , y)dµ(w).
(think of translation invariant kernels and Fourier transform).We thus consider the low rank approximation :
K (x , y) =1
d
n∑i=1
φ(x ,wi )φ(y ,wi ).
where wi ∼ µ.We use this approximation of the kernel in SGD.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 44 / 48
The complexity challenge, approximation of the kernel
Directions
What I am working on for the moment :
Random feature selection
Tuning the sampling to improve accuracy of the approximation
Acceleration + stochasticity (with Nicolas Flammarion).
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 45 / 48
The complexity challenge, approximation of the kernel
Some references I
Alaoui, A. E. and Mahoney, M. W. (2014).
Fast randomized kernel methods with statistical guarantees.CoRR, abs/1411.0306.
Bach, F. (2012).
Sharp analysis of low-rank kernel matrix approximations.ArXiv e-prints.
Bach, F. and Moulines, E. (2013).
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n).ArXiv e-prints.
Bottou, L. and Bousquet, O. (2008).
The tradeoffs of large scale learning.In IN : ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 20.
Caponnetto, A. and De Vito, E. (2007).
Optimal Rates for the Regularized Least-Squares Algorithm.Foundations of Computational Mathematics, 7(3) :331–368.
Dai, B., Xie, B., He, N., Liang, Y., Raj, A., Balcan, M., and Song, L. (2014).
Scalable kernel methods via doubly stochastic gradients.In Advances in Neural Information Processing Systems 27 : Annual Conference on Neural Information ProcessingSystems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3041–3049.
Dieuleveut, A. and Bach, F. (2014).
Non-parametric Stochastic Approximation with Large Step sizes.ArXiv e-prints.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 46 / 48
The complexity challenge, approximation of the kernel
Some references II
Rahimi, A. and Recht, B. (2008).
Weighted sums of random kitchen sinks : Replacing minimization with randomization in learning.In Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference onNeural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages1313–1320.
Rudi, A., Camoriano, R., and Rosasco, L. (2015).
Less is more : Nystrom computational regularization.CoRR, abs/1507.04717.
Shalev-Schwartz, S. and K., S. (2011).
Theorical basis for more data less work.
Shalev-Schwartz, S. and Srebro, N. (2008).
SVM optimisation : Inverse dependance on training set size.Proceedings of the International Conference on Machine Learning (ICML).
Tarres, P. and Yao, Y. (2011).
Online learning as stochastic approximation of regularization paths.ArXiv e-prints 1103.5538.
Ying, Y. and Pontil, M. (2008).
Online gradient descent learning algorithms.Foundations of Computational Mathematics, 5.
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 47 / 48
The complexity challenge, approximation of the kernel
Thank you for your attention !
Aymeric Dieuleveut Stochastic optimization Hilbert spaces 48 / 48