of 56/56

Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Sup´ erieure October 2013

View

0Download

0

Embed Size (px)

Big learning:challenges and opportunities

Francis Bach

SIERRA Project-team, INRIA - Ecole Normale Supérieure

October 2013

Scientific context

Big data

• Omnipresent digital media

– Multimedia, sensors, indicators, social networks

– All levels: personal, professional, scientific, industrial

– Too large and/or complex for manual processing

– Computational challenges

– Dealing with large databases

– Statistical challenges

– What can be predicted from such databases and how?

– Looking for hidden information

– Opportunities (and threats)

Scientific context

Big data

• Omnipresent digital media

– Multimedia, sensors, indicators, social networks

– All levels: personal, professional, scientific, industrial

– Too large and/or complex for manual processing

• Computational challenges

– Dealing with large databases

• Statistical challenges

– What can be predicted from such databases and how?

– Looking for hidden information

• Opportunities (and threats)

Machine learning for big data

• Large-scale machine learning: large p, large n, large k

– p : dimension of each observation (input)

– n : number of observations

– k : number of tasks (dimension of outputs)

• Examples: computer vision, bioinformatics, etc.

Search engines - advertising

Advertising - recommendation

Object recognition

Learning for bioinformatics - Proteins

• Crucial components of cell life

• Predicting multiple functions and

interactions

• Massive data: up to 1 millions for

humans!

• Complex data

– Amino-acid sequence

– Link with DNA

– Tri-dimensional molecule

Machine learning for big data

• Large-scale machine learning: large p, large n, large k

– p : dimension of each observation (input)

– n : number of observations

– k : number of tasks (dimension of outputs)

• Examples: computer vision, bioinformatics, etc.

• Two main challenges:

1. Computational: ideal running-time complexity = O(pn+ kn)

2. Statistical: meaningful results

Big learning: challenges and opportunities

Outline

• Scientific context

– Big data: need for supervised and unsupervised learning

• Beyond stochastic gradient for supervised learning

– Few passes through the data

– Provable robustness and ease of use

• Matrix factorization for unsupervised learning

– Looking for hidden information through dictionary learning

– Feature learning

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ̂ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(yi, θ

⊤Φ(xi))

+ µΩ(θ)

convex data fitting term + regularizer

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ̂ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(yi, θ

⊤Φ(xi))

+ µΩ(θ)

convex data fitting term + regularizer

• Applications to any data-oriented field

– Computer vision, bioinformatics

– Natural language processing, etc.

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ̂ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(yi, θ

⊤Φ(xi))

+ µΩ(θ)

convex data fitting term + regularizer

• Main practical challenges

– Designing/learning good features Φ(x)

– Efficiently solving the optimization problem

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(yi, θ

⊤Φ(xi))+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

– Linear (e.g., exponential) convergence rate in O(e−αt)

– Iteration complexity is linear in n (with line search)

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(yi, θ

⊤Φ(xi))+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(yi, θ

⊤Φ(xi))+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

– Linear (e.g., exponential) convergence rate in O(e−αt)

– Iteration complexity is linear in n (with line search)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

– Sampling with replacement: i(t) random element of {1, . . . , n}

– Convergence rate in O(1/t)

– Iteration complexity is independent of n (step size selection?)

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(yi, θ

⊤Φ(xi))+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

Stochastic vs. deterministic methods

• Goal = best of both worlds: Linear rate with O(1) iteration cost

Robustness to step size

time

log(

exce

ss c

ost)

stochastic

deterministic

Stochastic vs. deterministic methods

• Goal = best of both worlds: Linear rate with O(1) iteration cost

Robustness to step size

hybridlog

(exc

ess

cost

)

stochastic

deterministic

time

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n

– Random selection i(t) ∈ {1, . . . , n} with replacement

– Iteration: θt = θt−1 −γtn

n∑

i=1

yti with yti =

{

f ′i(θt−1) if i = i(t)

yt−1i otherwise

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n

– Random selection i(t) ∈ {1, . . . , n} with replacement

– Iteration: θt = θt−1 −γtn

n∑

i=1

yti with yti =

{

f ′i(θt−1) if i = i(t)

yt−1i otherwise

• Stochastic version of incremental average gradient (Blatt et al., 2008)

• Simple implementation

– Extra memory requirement: same size as original data (or less)

– Simple/robust constant step-size

Stochastic average gradient

Convergence analysis

• Assume each fi is L-smooth and g=1

n

n∑

i=1

fi is µ-strongly convex

• Constant step size γt =1

16L. If µ >

2L

n, ∃C ∈ R such that

∀t > 0, E[g(θt)− g(θ

∗)]6 C exp

(

−t

8n

)

• Linear convergence rate with iteration cost independent of n

– After each pass through the data, constant error reduction

– Breaking two lower bounds

Stochastic average gradient

Simulation experiments

• protein dataset (n = 145751, p = 74)

• Dataset split in two (training/testing)

0 5 10 15 20 25 30

10−4

10−3

10−2

10−1

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))SAG−LS

0 5 10 15 20 25 30

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

4

Effective Passes

Test

Lo

gis

tic

Lo

ss

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))SAG−LS

Training cost Testing cost

Stochastic average gradient

Simulation experiments

• covertype dataset (n = 581012, p = 54)

• Dataset split in two (training/testing)

0 5 10 15 20 25 30

10−4

10−2

100

102

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))SAG−LS

0 5 10 15 20 25 30

1.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2

x 105

Effective Passes

Test

Lo

gis

tic

Lo

ss

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))SAG−LS

Training cost Testing cost

Large-scale supervised learning

Convex optimization

• Simplicity

– Few lines of code

• Robustness

– Step-size

– Adaptivity to problem difficulty

• On-going work

– Single pass through the data (Bach and Moulines, 2013)

– Distributed algorithms

- Convexity as a solution to all problems?

- Need good features Φ(x) for linear predictions θ⊤Φ(x) !

Large-scale supervised learning

Convex optimization

• Simplicity

– Few lines of code

• Robustness

– Step-size

– Adaptivity to problem difficulty

• On-going work

– Single pass through the data (Bach and Moulines, 2013)

– Distributed algorithms

• Convexity as a solution to all problems?

– Need good features Φ(x) for linear predictions θ⊤Φ(x) !

Unsupervised learning through matrix factorization

• Given data matrix X = (x⊤1 , . . . ,x⊤n ) ∈ R

n×p

– Principal component analysis: xi ≈ Dαi

– K-means: xi ≈ dk ⇒ X = DA

Learning dictionaries for uncovering hidden structure

• Fact: many natural signals may be approximately represented as a

superposition of few atoms from a dictionary D = (d1, . . . ,dk)

– Decomposition x =k∑

i=1

αidi = Dα with α ∈ R

k sparse

– Natural signals (sounds, images) (Olshausen and Field, 1997)

- Decoding problem: given a dictionary D, finding α through

regularized convex optimization minα∈Rk ‖x−Dα‖

22 + λ‖α‖1

Learning dictionaries for uncovering hidden structure

• Fact: many natural signals may be approximately represented as a

superposition of few atoms from a dictionary D = (d1, . . . ,dk)

– Decomposition x =k∑

i=1

αidi = Dα with α ∈ R

k sparse

– Natural signals (sounds, images) (Olshausen and Field, 1997)

• Decoding problem: given a dictionary D, finding α through

regularized convex optimization minα∈Rk ‖x−Dα‖

22 + λ‖α‖1

1

2w

w 1

2w

w

Learning dictionaries for uncovering hidden structure

• Fact: many natural signals may be approximately represented as a

superposition of few atoms from a dictionary D = (d1, . . . ,dk)

– Decomposition x =

k∑

i=1

αidi = Dα with α ∈ R

k sparse

– Natural signals (sounds, images) (Olshausen and Field, 1997)

• Decoding problem: given a dictionary D, finding α through

regularized convex optimization minα∈Rk ‖x−Dα‖

22 + λ‖α‖1

• Dictionary learning problem: given n signals x1, . . . ,xn,

– Estimate both dictionary D and codes α1, . . . ,αn

minD

n∑

j=1

minαj∈R

p

{∥∥xj −Dαj

∥∥2

2+ λ‖αj‖1

}

Challenges of dictionary learning

minD

n∑

j=1

minαj∈R

p

{∥∥xj −Dαj

∥∥2

2+ λ‖αj‖1

}

• Algorithmic challenges

– Large number of signals ⇒ online learning (Mairal et al., 2009a)

• Theoretical challenges

– Identifiabiliy/robustness (Jenatton et al., 2012)

• Domain-specific challenges

– Going beyond plain sparsity ⇒ structured sparsity

(Jenatton, Mairal, Obozinski, and Bach, 2011)

Dictionary learning for image denoising

x︸︷︷︸measurements

= y︸︷︷︸

original image

+ ε︸︷︷︸noise

Dictionary learning for image denoising

• Solving the denoising problem (Elad and Aharon, 2006)

– Extract all overlapping 8× 8 patches xi ∈ R64

– Form the matrix X = (x⊤1 , . . . ,x⊤n ) ∈ R

n×64

– Solve a matrix factorization problem:

minD,A

||X−DA||2F = minD,A

n∑

i=1

||xi −Dαi||22

where A is sparse, and D is the dictionary

– Each patch is decomposed into xi = Dαi– Average all Dαi to reconstruct a full-sized image

• The number of patches n is large (= number of pixels)

• Online learning (Mairal, Bach, Ponce, and Sapiro, 2009a)

Denoising result

(Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)

Denoising result

(Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)

Inpainting a 12-Mpixel photograph

Inpainting a 12-Mpixel photograph

Inpainting a 12-Mpixel photograph

Inpainting a 12-Mpixel photograph

Why structured sparsity?

• Interpretability

– Structured dictionary elements (Jenatton et al., 2009b)

– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

Structured sparse PCA (Jenatton et al., 2009b)

raw data sparse PCA

• Unstructed sparse PCA ⇒ many zeros do not lead to better

interpretability

Structured sparse PCA (Jenatton et al., 2009b)

raw data sparse PCA

• Unstructed sparse PCA ⇒ many zeros do not lead to better

interpretability

Structured sparse PCA (Jenatton et al., 2009b)

raw data Structured sparse PCA

• Enforce selection of convex nonzero patterns ⇒ robustness to

occlusion in face identification

Structured sparse PCA (Jenatton et al., 2009b)

raw data Structured sparse PCA

• Enforce selection of convex nonzero patterns ⇒ robustness to

occlusion in face identification

Modelling of text corpora - Dictionary tree

Why structured sparsity?

• Interpretability

– Structured dictionary elements (Jenatton et al., 2009b)

– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

Why structured sparsity?

• Interpretability

– Structured dictionary elements (Jenatton et al., 2009b)

– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

• Stability and identifiability

• Prediction or estimation performance

– When prior knowledge matches data (Haupt and Nowak, 2006;

Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)

• How?

– Design of sparsity-inducing norms

Structured sparsity

• Sparsity-inducing behavior from “corners” of constraint sets

Structured dictionary learning- Efficient optimization

minA∈Rk×n

D∈Rp×k

n∑

i=1

‖xi −Dαi‖22 + λψ(αi) s.t. ∀j, ‖dj‖2 ≤ 1.

• Minimization with respect to αi : regularized least-squares

– Many algorithms dedicated to the ℓ1-norm ψ(α) = ‖α‖1

• Proximal methods : first-order methods with optimal convergence

rate (Nesterov, 2007; Beck and Teboulle, 2009)

– Requires solving many times minα∈Rk

12‖y −α‖

22 + λψ(α)

• Efficient algorithms for structured sparse problems

– Bach, Jenatton, Mairal, and Obozinski (2011)

– Code available: http://www.di.ens.fr/willow/SPAMS/

Extensions - Digital Zooming

Digital Zooming (Couzinie-Devy et al., 2011)

Extensions - Task-driven dictionaries

inverse half-toning (Mairal et al., 2011)

Extensions - Task-driven dictionaries

inverse half-toning (Mairal et al., 2011)

Big learning: challenges and opportunities

Conclusion

• Scientific context

– Big data: need for supervised and unsupervised learning

• Beyond stochastic gradient for supervised learning

– Few passes through the data

– Provable robustness and ease of use

• Matrix factorization for unsupervised learning

– Looking for hidden information through dictionary learning

– Feature learning

References

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence

rate o(1/n). Technical Report 00831977, HAL, 2013.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties.

Foundations and Trends R© in Machine Learning, 4(1):1–106, 2011.

R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. Technical

report, arXiv:0808.3572, 2008.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant

step size. 18(1):29–51, 2008.

M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned

dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, 2006.

J. Haupt and R. Nowak. Signal reconstruction from noisy random projections. IEEE Transactions on

Information Theory, 52(9):4036–4048, 2006.

J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th

International Conference on Machine Learning (ICML), 2009.

R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.

Technical report, arXiv:0904.3523, 2009a.

R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. Technical

report, arXiv:0909.1440, 2009b.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary

learning. In Submitted to ICML, 2010.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding.

Journal of Machine Learning Research, 12:2297–2334, 2011.

K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic

filter maps. In Proceedings of CVPR, 2009.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets. Technical Report -, HAL, 2012.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In

International Conference on Machine Learning (ICML), 2009a.

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image

restoration. In International Conference on Computer Vision (ICCV), 2009b.

J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In

NIPS, 2010.

Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center

for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007.

B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed

by V1? Vision Research, 37:3311–3325, 1997.