Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Big learning:challenges and opportunities
Francis Bach
SIERRA Project-team, INRIA - Ecole Normale Superieure
October 2013
Scientific context
Big data
• Omnipresent digital media
– Multimedia, sensors, indicators, social networks
– All levels: personal, professional, scientific, industrial
– Too large and/or complex for manual processing
– Computational challenges
– Dealing with large databases
– Statistical challenges
– What can be predicted from such databases and how?
– Looking for hidden information
– Opportunities (and threats)
Scientific context
Big data
• Omnipresent digital media
– Multimedia, sensors, indicators, social networks
– All levels: personal, professional, scientific, industrial
– Too large and/or complex for manual processing
• Computational challenges
– Dealing with large databases
• Statistical challenges
– What can be predicted from such databases and how?
– Looking for hidden information
• Opportunities (and threats)
Machine learning for big data
• Large-scale machine learning: large p, large n, large k
– p : dimension of each observation (input)
– n : number of observations
– k : number of tasks (dimension of outputs)
• Examples: computer vision, bioinformatics, etc.
Search engines - advertising
Advertising - recommendation
Object recognition
Learning for bioinformatics - Proteins
• Crucial components of cell life
• Predicting multiple functions and
interactions
• Massive data: up to 1 millions for
humans!
• Complex data
– Amino-acid sequence
– Link with DNA
– Tri-dimensional molecule
Machine learning for big data
• Large-scale machine learning: large p, large n, large k
– p : dimension of each observation (input)
– n : number of observations
– k : number of tasks (dimension of outputs)
• Examples: computer vision, bioinformatics, etc.
• Two main challenges:
1. Computational: ideal running-time complexity = O(pn+ kn)
2. Statistical: meaningful results
Big learning: challenges and opportunities
Outline
• Scientific context
– Big data: need for supervised and unsupervised learning
• Beyond stochastic gradient for supervised learning
– Few passes through the data
– Provable robustness and ease of use
• Matrix factorization for unsupervised learning
– Looking for hidden information through dictionary learning
– Feature learning
Supervised machine learning
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n
• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp
• (regularized) empirical risk minimization: find θ solution of
minθ∈Rp
1
n
n∑
i=1
ℓ(yi, θ
⊤Φ(xi))
+ µΩ(θ)
convex data fitting term + regularizer
Supervised machine learning
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n
• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp
• (regularized) empirical risk minimization: find θ solution of
minθ∈Rp
1
n
n∑
i=1
ℓ(yi, θ
⊤Φ(xi))
+ µΩ(θ)
convex data fitting term + regularizer
• Applications to any data-oriented field
– Computer vision, bioinformatics
– Natural language processing, etc.
Supervised machine learning
• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n
• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp
• (regularized) empirical risk minimization: find θ solution of
minθ∈Rp
1
n
n∑
i=1
ℓ(yi, θ
⊤Φ(xi))
+ µΩ(θ)
convex data fitting term + regularizer
• Main practical challenges
– Designing/learning good features Φ(x)
– Efficiently solving the optimization problem
Stochastic vs. deterministic methods
• Minimizing g(θ) =1
n
n∑
i=1
fi(θ) with fi(θ) = ℓ(yi, θ
⊤Φ(xi))+ µΩ(θ)
• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−
γtn
n∑
i=1
f ′i(θt−1)
– Linear (e.g., exponential) convergence rate in O(e−αt)
– Iteration complexity is linear in n (with line search)
Stochastic vs. deterministic methods
• Minimizing g(θ) =1
n
n∑
i=1
fi(θ) with fi(θ) = ℓ(yi, θ
⊤Φ(xi))+ µΩ(θ)
• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−
γtn
n∑
i=1
f ′i(θt−1)
Stochastic vs. deterministic methods
• Minimizing g(θ) =1
n
n∑
i=1
fi(θ) with fi(θ) = ℓ(yi, θ
⊤Φ(xi))+ µΩ(θ)
• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−
γtn
n∑
i=1
f ′i(θt−1)
– Linear (e.g., exponential) convergence rate in O(e−αt)
– Iteration complexity is linear in n (with line search)
• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)
– Sampling with replacement: i(t) random element of 1, . . . , n
– Convergence rate in O(1/t)
– Iteration complexity is independent of n (step size selection?)
Stochastic vs. deterministic methods
• Minimizing g(θ) =1
n
n∑
i=1
fi(θ) with fi(θ) = ℓ(yi, θ
⊤Φ(xi))+ µΩ(θ)
• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−
γtn
n∑
i=1
f ′i(θt−1)
• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)
Stochastic vs. deterministic methods
• Goal = best of both worlds: Linear rate with O(1) iteration cost
Robustness to step size
time
log(
exce
ss c
ost)
stochastic
deterministic
Stochastic vs. deterministic methods
• Goal = best of both worlds: Linear rate with O(1) iteration cost
Robustness to step size
hybridlog(
exce
ss c
ost)
stochastic
deterministic
time
Stochastic average gradient
(Le Roux, Schmidt, and Bach, 2012)
• Stochastic average gradient (SAG) iteration
– Keep in memory the gradients of all functions fi, i = 1, . . . , n
– Random selection i(t) ∈ 1, . . . , n with replacement
– Iteration: θt = θt−1 −γtn
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
Stochastic average gradient
(Le Roux, Schmidt, and Bach, 2012)
• Stochastic average gradient (SAG) iteration
– Keep in memory the gradients of all functions fi, i = 1, . . . , n
– Random selection i(t) ∈ 1, . . . , n with replacement
– Iteration: θt = θt−1 −γtn
n∑
i=1
yti with yti =
f ′i(θt−1) if i = i(t)
yt−1i otherwise
• Stochastic version of incremental average gradient (Blatt et al., 2008)
• Simple implementation
– Extra memory requirement: same size as original data (or less)
– Simple/robust constant step-size
Stochastic average gradient
Convergence analysis
• Assume each fi is L-smooth and g=1
n
n∑
i=1
fi is µ-strongly convex
• Constant step size γt =1
16L. If µ >
2L
n, ∃C ∈ R such that
∀t > 0, E[g(θt)− g(θ∗)
]6 C exp
(
−t
8n
)
• Linear convergence rate with iteration cost independent of n
– After each pass through the data, constant error reduction
– Breaking two lower bounds
Stochastic average gradient
Simulation experiments
• protein dataset (n = 145751, p = 74)
• Dataset split in two (training/testing)
0 5 10 15 20 25 30
10−4
10−3
10−2
10−1
100
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
Steepest
AFG
L−BFGS
pegasos
RDA
SAG (2/(L+nµ))
SAG−LS
0 5 10 15 20 25 30
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5x 10
4
Effective Passes
Test
Lo
gis
tic
Lo
ss
Steepest
AFG
L−BFGS
pegasos
RDA
SAG (2/(L+nµ))
SAG−LS
Training cost Testing cost
Stochastic average gradient
Simulation experiments
• covertype dataset (n = 581012, p = 54)
• Dataset split in two (training/testing)
0 5 10 15 20 25 30
10−4
10−2
100
102
Effective Passes
Ob
jec
tiv
e m
inu
s O
ptim
um
Steepest
AFG
L−BFGS
pegasos
RDA
SAG (2/(L+nµ))
SAG−LS
0 5 10 15 20 25 30
1.5
1.55
1.6
1.65
1.7
1.75
1.8
1.85
1.9
1.95
2
x 105
Effective Passes
Test
Lo
gis
tic
Lo
ss
Steepest
AFG
L−BFGS
pegasos
RDA
SAG (2/(L+nµ))
SAG−LS
Training cost Testing cost
Large-scale supervised learning
Convex optimization
• Simplicity
– Few lines of code
• Robustness
– Step-size
– Adaptivity to problem difficulty
• On-going work
– Single pass through the data (Bach and Moulines, 2013)
– Distributed algorithms
- Convexity as a solution to all problems?
- Need good features Φ(x) for linear predictions θ⊤Φ(x) !
Large-scale supervised learning
Convex optimization
• Simplicity
– Few lines of code
• Robustness
– Step-size
– Adaptivity to problem difficulty
• On-going work
– Single pass through the data (Bach and Moulines, 2013)
– Distributed algorithms
• Convexity as a solution to all problems?
– Need good features Φ(x) for linear predictions θ⊤Φ(x) !
Unsupervised learning through matrix factorization
• Given data matrix X = (x⊤1 , . . . ,x
⊤n ) ∈ R
n×p
– Principal component analysis: xi ≈ Dαi
– K-means: xi ≈ dk ⇒ X = DA
Learning dictionaries for uncovering hidden structure
• Fact: many natural signals may be approximately represented as a
superposition of few atoms from a dictionary D = (d1, . . . ,dk)
– Decomposition x =k∑
i=1
αidi = Dα with α ∈ R
k sparse
– Natural signals (sounds, images) (Olshausen and Field, 1997)
- Decoding problem: given a dictionary D, finding α through
regularized convex optimization minα∈Rk ‖x−Dα‖22 + λ‖α‖1
Learning dictionaries for uncovering hidden structure
• Fact: many natural signals may be approximately represented as a
superposition of few atoms from a dictionary D = (d1, . . . ,dk)
– Decomposition x =k∑
i=1
αidi = Dα with α ∈ R
k sparse
– Natural signals (sounds, images) (Olshausen and Field, 1997)
• Decoding problem: given a dictionary D, finding α through
regularized convex optimization minα∈Rk ‖x−Dα‖22 + λ‖α‖1
1
2w
w 1
2w
w
Learning dictionaries for uncovering hidden structure
• Fact: many natural signals may be approximately represented as a
superposition of few atoms from a dictionary D = (d1, . . . ,dk)
– Decomposition x =
k∑
i=1
αidi = Dα with α ∈ R
k sparse
– Natural signals (sounds, images) (Olshausen and Field, 1997)
• Decoding problem: given a dictionary D, finding α through
regularized convex optimization minα∈Rk ‖x−Dα‖22 + λ‖α‖1
• Dictionary learning problem: given n signals x1, . . . ,xn,
– Estimate both dictionary D and codes α1, . . . ,αn
minD
n∑
j=1
minαj∈Rp
∥∥xj −Dαj
∥∥2
2+ λ‖αj‖1
Challenges of dictionary learning
minD
n∑
j=1
minαj∈Rp
∥∥xj −Dαj
∥∥2
2+ λ‖αj‖1
• Algorithmic challenges
– Large number of signals ⇒ online learning (Mairal et al., 2009a)
• Theoretical challenges
– Identifiabiliy/robustness (Jenatton et al., 2012)
• Domain-specific challenges
– Going beyond plain sparsity ⇒ structured sparsity
(Jenatton, Mairal, Obozinski, and Bach, 2011)
Dictionary learning for image denoising
x︸︷︷︸measurements
= y︸︷︷︸
original image
+ ε︸︷︷︸noise
Dictionary learning for image denoising
• Solving the denoising problem (Elad and Aharon, 2006)
– Extract all overlapping 8× 8 patches xi ∈ R64
– Form the matrix X = (x⊤1 , . . . ,x
⊤n ) ∈ R
n×64
– Solve a matrix factorization problem:
minD,A
||X−DA||2F = minD,A
n∑
i=1
||xi −Dαi||22
where A is sparse, and D is the dictionary
– Each patch is decomposed into xi = Dαi
– Average all Dαi to reconstruct a full-sized image
• The number of patches n is large (= number of pixels)
• Online learning (Mairal, Bach, Ponce, and Sapiro, 2009a)
Denoising result
(Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)
Denoising result
(Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)
Inpainting a 12-Mpixel photograph
Inpainting a 12-Mpixel photograph
Inpainting a 12-Mpixel photograph
Inpainting a 12-Mpixel photograph
Why structured sparsity?
• Interpretability
– Structured dictionary elements (Jenatton et al., 2009b)
– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu
et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)
Structured sparse PCA (Jenatton et al., 2009b)
raw data sparse PCA
• Unstructed sparse PCA ⇒ many zeros do not lead to better
interpretability
Structured sparse PCA (Jenatton et al., 2009b)
raw data sparse PCA
• Unstructed sparse PCA ⇒ many zeros do not lead to better
interpretability
Structured sparse PCA (Jenatton et al., 2009b)
raw data Structured sparse PCA
• Enforce selection of convex nonzero patterns ⇒ robustness to
occlusion in face identification
Structured sparse PCA (Jenatton et al., 2009b)
raw data Structured sparse PCA
• Enforce selection of convex nonzero patterns ⇒ robustness to
occlusion in face identification
Modelling of text corpora - Dictionary tree
Why structured sparsity?
• Interpretability
– Structured dictionary elements (Jenatton et al., 2009b)
– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu
et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)
Why structured sparsity?
• Interpretability
– Structured dictionary elements (Jenatton et al., 2009b)
– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu
et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)
• Stability and identifiability
• Prediction or estimation performance
– When prior knowledge matches data (Haupt and Nowak, 2006;
Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)
• How?
– Design of sparsity-inducing norms
Structured sparsity
• Sparsity-inducing behavior from “corners” of constraint sets
Structured dictionary learning- Efficient optimization
minA∈R
k×n
D∈Rp×k
n∑
i=1
‖xi −Dαi‖22 + λψ(αi) s.t. ∀j, ‖dj‖2 ≤ 1.
• Minimization with respect to αi : regularized least-squares
– Many algorithms dedicated to the ℓ1-norm ψ(α) = ‖α‖1
• Proximal methods : first-order methods with optimal convergence
rate (Nesterov, 2007; Beck and Teboulle, 2009)
– Requires solving many times minα∈Rk
12‖y −α‖22 + λψ(α)
• Efficient algorithms for structured sparse problems
– Bach, Jenatton, Mairal, and Obozinski (2011)
– Code available: http://www.di.ens.fr/willow/SPAMS/
Extensions - Digital Zooming
Digital Zooming (Couzinie-Devy et al., 2011)
Extensions - Task-driven dictionaries
inverse half-toning (Mairal et al., 2011)
Extensions - Task-driven dictionaries
inverse half-toning (Mairal et al., 2011)
Big learning: challenges and opportunities
Conclusion
• Scientific context
– Big data: need for supervised and unsupervised learning
• Beyond stochastic gradient for supervised learning
– Few passes through the data
– Provable robustness and ease of use
• Matrix factorization for unsupervised learning
– Looking for hidden information through dictionary learning
– Feature learning
References
F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence
rate o(1/n). Technical Report 00831977, HAL, 2013.
F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties.
Foundations and Trends R© in Machine Learning, 4(1):1–106, 2011.
R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. Technical
report, arXiv:0808.3572, 2008.
A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.
SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant
step size. 18(1):29–51, 2008.
M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned
dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, 2006.
J. Haupt and R. Nowak. Signal reconstruction from noisy random projections. IEEE Transactions on
Information Theory, 52(9):4036–4048, 2006.
J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th
International Conference on Machine Learning (ICML), 2009.
R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.
Technical report, arXiv:0904.3523, 2009a.
R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. Technical
report, arXiv:0909.1440, 2009b.
R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary
learning. In Submitted to ICML, 2010.
R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding.
Journal of Machine Learning Research, 12:2297–2334, 2011.
K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic
filter maps. In Proceedings of CVPR, 2009.
N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence
rate for strongly-convex optimization with finite training sets. Technical Report -, HAL, 2012.
J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In
International Conference on Machine Learning (ICML), 2009a.
J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image
restoration. In International Conference on Computer Vision (ICCV), 2009b.
J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In
NIPS, 2010.
Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center
for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007.
B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed
by V1? Vision Research, 37:3311–3325, 1997.