56
Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Sup´ erieure October 2013

Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Big learning:challenges and opportunities

Francis Bach

SIERRA Project-team, INRIA - Ecole Normale Superieure

October 2013

Page 2: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Scientific context

Big data

• Omnipresent digital media

– Multimedia, sensors, indicators, social networks

– All levels: personal, professional, scientific, industrial

– Too large and/or complex for manual processing

– Computational challenges

– Dealing with large databases

– Statistical challenges

– What can be predicted from such databases and how?

– Looking for hidden information

– Opportunities (and threats)

Page 3: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Scientific context

Big data

• Omnipresent digital media

– Multimedia, sensors, indicators, social networks

– All levels: personal, professional, scientific, industrial

– Too large and/or complex for manual processing

• Computational challenges

– Dealing with large databases

• Statistical challenges

– What can be predicted from such databases and how?

– Looking for hidden information

• Opportunities (and threats)

Page 4: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Machine learning for big data

• Large-scale machine learning: large p, large n, large k

– p : dimension of each observation (input)

– n : number of observations

– k : number of tasks (dimension of outputs)

• Examples: computer vision, bioinformatics, etc.

Page 5: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Search engines - advertising

Page 6: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Advertising - recommendation

Page 7: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Object recognition

Page 8: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Learning for bioinformatics - Proteins

• Crucial components of cell life

• Predicting multiple functions and

interactions

• Massive data: up to 1 millions for

humans!

• Complex data

– Amino-acid sequence

– Link with DNA

– Tri-dimensional molecule

Page 9: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Machine learning for big data

• Large-scale machine learning: large p, large n, large k

– p : dimension of each observation (input)

– n : number of observations

– k : number of tasks (dimension of outputs)

• Examples: computer vision, bioinformatics, etc.

• Two main challenges:

1. Computational: ideal running-time complexity = O(pn+ kn)

2. Statistical: meaningful results

Page 10: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Big learning: challenges and opportunities

Outline

• Scientific context

– Big data: need for supervised and unsupervised learning

• Beyond stochastic gradient for supervised learning

– Few passes through the data

– Provable robustness and ease of use

• Matrix factorization for unsupervised learning

– Looking for hidden information through dictionary learning

– Feature learning

Page 11: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(yi, θ

⊤Φ(xi))

+ µΩ(θ)

convex data fitting term + regularizer

Page 12: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(yi, θ

⊤Φ(xi))

+ µΩ(θ)

convex data fitting term + regularizer

• Applications to any data-oriented field

– Computer vision, bioinformatics

– Natural language processing, etc.

Page 13: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Supervised machine learning

• Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

• Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

• (regularized) empirical risk minimization: find θ solution of

minθ∈Rp

1

n

n∑

i=1

ℓ(yi, θ

⊤Φ(xi))

+ µΩ(θ)

convex data fitting term + regularizer

• Main practical challenges

– Designing/learning good features Φ(x)

– Efficiently solving the optimization problem

Page 14: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(yi, θ

⊤Φ(xi))+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

– Linear (e.g., exponential) convergence rate in O(e−αt)

– Iteration complexity is linear in n (with line search)

Page 15: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(yi, θ

⊤Φ(xi))+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

Page 16: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(yi, θ

⊤Φ(xi))+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

– Linear (e.g., exponential) convergence rate in O(e−αt)

– Iteration complexity is linear in n (with line search)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

– Sampling with replacement: i(t) random element of 1, . . . , n

– Convergence rate in O(1/t)

– Iteration complexity is independent of n (step size selection?)

Page 17: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic vs. deterministic methods

• Minimizing g(θ) =1

n

n∑

i=1

fi(θ) with fi(θ) = ℓ(yi, θ

⊤Φ(xi))+ µΩ(θ)

• Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

γtn

n∑

i=1

f ′i(θt−1)

• Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

Page 18: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic vs. deterministic methods

• Goal = best of both worlds: Linear rate with O(1) iteration cost

Robustness to step size

time

log(

exce

ss c

ost)

stochastic

deterministic

Page 19: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic vs. deterministic methods

• Goal = best of both worlds: Linear rate with O(1) iteration cost

Robustness to step size

hybridlog(

exce

ss c

ost)

stochastic

deterministic

time

Page 20: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n

– Random selection i(t) ∈ 1, . . . , n with replacement

– Iteration: θt = θt−1 −γtn

n∑

i=1

yti with yti =

f ′i(θt−1) if i = i(t)

yt−1i otherwise

Page 21: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic average gradient

(Le Roux, Schmidt, and Bach, 2012)

• Stochastic average gradient (SAG) iteration

– Keep in memory the gradients of all functions fi, i = 1, . . . , n

– Random selection i(t) ∈ 1, . . . , n with replacement

– Iteration: θt = θt−1 −γtn

n∑

i=1

yti with yti =

f ′i(θt−1) if i = i(t)

yt−1i otherwise

• Stochastic version of incremental average gradient (Blatt et al., 2008)

• Simple implementation

– Extra memory requirement: same size as original data (or less)

– Simple/robust constant step-size

Page 22: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic average gradient

Convergence analysis

• Assume each fi is L-smooth and g=1

n

n∑

i=1

fi is µ-strongly convex

• Constant step size γt =1

16L. If µ >

2L

n, ∃C ∈ R such that

∀t > 0, E[g(θt)− g(θ∗)

]6 C exp

(

−t

8n

)

• Linear convergence rate with iteration cost independent of n

– After each pass through the data, constant error reduction

– Breaking two lower bounds

Page 23: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic average gradient

Simulation experiments

• protein dataset (n = 145751, p = 74)

• Dataset split in two (training/testing)

0 5 10 15 20 25 30

10−4

10−3

10−2

10−1

100

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))

SAG−LS

0 5 10 15 20 25 30

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5x 10

4

Effective Passes

Test

Lo

gis

tic

Lo

ss

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))

SAG−LS

Training cost Testing cost

Page 24: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Stochastic average gradient

Simulation experiments

• covertype dataset (n = 581012, p = 54)

• Dataset split in two (training/testing)

0 5 10 15 20 25 30

10−4

10−2

100

102

Effective Passes

Ob

jec

tiv

e m

inu

s O

ptim

um

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))

SAG−LS

0 5 10 15 20 25 30

1.5

1.55

1.6

1.65

1.7

1.75

1.8

1.85

1.9

1.95

2

x 105

Effective Passes

Test

Lo

gis

tic

Lo

ss

Steepest

AFG

L−BFGS

pegasos

RDA

SAG (2/(L+nµ))

SAG−LS

Training cost Testing cost

Page 25: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Large-scale supervised learning

Convex optimization

• Simplicity

– Few lines of code

• Robustness

– Step-size

– Adaptivity to problem difficulty

• On-going work

– Single pass through the data (Bach and Moulines, 2013)

– Distributed algorithms

- Convexity as a solution to all problems?

- Need good features Φ(x) for linear predictions θ⊤Φ(x) !

Page 26: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Large-scale supervised learning

Convex optimization

• Simplicity

– Few lines of code

• Robustness

– Step-size

– Adaptivity to problem difficulty

• On-going work

– Single pass through the data (Bach and Moulines, 2013)

– Distributed algorithms

• Convexity as a solution to all problems?

– Need good features Φ(x) for linear predictions θ⊤Φ(x) !

Page 27: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Unsupervised learning through matrix factorization

• Given data matrix X = (x⊤1 , . . . ,x

⊤n ) ∈ R

n×p

– Principal component analysis: xi ≈ Dαi

– K-means: xi ≈ dk ⇒ X = DA

Page 28: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Learning dictionaries for uncovering hidden structure

• Fact: many natural signals may be approximately represented as a

superposition of few atoms from a dictionary D = (d1, . . . ,dk)

– Decomposition x =k∑

i=1

αidi = Dα with α ∈ R

k sparse

– Natural signals (sounds, images) (Olshausen and Field, 1997)

- Decoding problem: given a dictionary D, finding α through

regularized convex optimization minα∈Rk ‖x−Dα‖22 + λ‖α‖1

Page 29: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Learning dictionaries for uncovering hidden structure

• Fact: many natural signals may be approximately represented as a

superposition of few atoms from a dictionary D = (d1, . . . ,dk)

– Decomposition x =k∑

i=1

αidi = Dα with α ∈ R

k sparse

– Natural signals (sounds, images) (Olshausen and Field, 1997)

• Decoding problem: given a dictionary D, finding α through

regularized convex optimization minα∈Rk ‖x−Dα‖22 + λ‖α‖1

1

2w

w 1

2w

w

Page 30: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Learning dictionaries for uncovering hidden structure

• Fact: many natural signals may be approximately represented as a

superposition of few atoms from a dictionary D = (d1, . . . ,dk)

– Decomposition x =

k∑

i=1

αidi = Dα with α ∈ R

k sparse

– Natural signals (sounds, images) (Olshausen and Field, 1997)

• Decoding problem: given a dictionary D, finding α through

regularized convex optimization minα∈Rk ‖x−Dα‖22 + λ‖α‖1

• Dictionary learning problem: given n signals x1, . . . ,xn,

– Estimate both dictionary D and codes α1, . . . ,αn

minD

n∑

j=1

minαj∈Rp

∥∥xj −Dαj

∥∥2

2+ λ‖αj‖1

Page 31: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Challenges of dictionary learning

minD

n∑

j=1

minαj∈Rp

∥∥xj −Dαj

∥∥2

2+ λ‖αj‖1

• Algorithmic challenges

– Large number of signals ⇒ online learning (Mairal et al., 2009a)

• Theoretical challenges

– Identifiabiliy/robustness (Jenatton et al., 2012)

• Domain-specific challenges

– Going beyond plain sparsity ⇒ structured sparsity

(Jenatton, Mairal, Obozinski, and Bach, 2011)

Page 32: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Dictionary learning for image denoising

x︸︷︷︸measurements

= y︸︷︷︸

original image

+ ε︸︷︷︸noise

Page 33: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Dictionary learning for image denoising

• Solving the denoising problem (Elad and Aharon, 2006)

– Extract all overlapping 8× 8 patches xi ∈ R64

– Form the matrix X = (x⊤1 , . . . ,x

⊤n ) ∈ R

n×64

– Solve a matrix factorization problem:

minD,A

||X−DA||2F = minD,A

n∑

i=1

||xi −Dαi||22

where A is sparse, and D is the dictionary

– Each patch is decomposed into xi = Dαi

– Average all Dαi to reconstruct a full-sized image

• The number of patches n is large (= number of pixels)

• Online learning (Mairal, Bach, Ponce, and Sapiro, 2009a)

Page 34: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Denoising result

(Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)

Page 35: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Denoising result

(Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)

Page 36: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Inpainting a 12-Mpixel photograph

Page 37: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Inpainting a 12-Mpixel photograph

Page 38: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Inpainting a 12-Mpixel photograph

Page 39: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Inpainting a 12-Mpixel photograph

Page 40: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Why structured sparsity?

• Interpretability

– Structured dictionary elements (Jenatton et al., 2009b)

– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

Page 41: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Structured sparse PCA (Jenatton et al., 2009b)

raw data sparse PCA

• Unstructed sparse PCA ⇒ many zeros do not lead to better

interpretability

Page 42: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Structured sparse PCA (Jenatton et al., 2009b)

raw data sparse PCA

• Unstructed sparse PCA ⇒ many zeros do not lead to better

interpretability

Page 43: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Structured sparse PCA (Jenatton et al., 2009b)

raw data Structured sparse PCA

• Enforce selection of convex nonzero patterns ⇒ robustness to

occlusion in face identification

Page 44: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Structured sparse PCA (Jenatton et al., 2009b)

raw data Structured sparse PCA

• Enforce selection of convex nonzero patterns ⇒ robustness to

occlusion in face identification

Page 45: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Modelling of text corpora - Dictionary tree

Page 46: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Why structured sparsity?

• Interpretability

– Structured dictionary elements (Jenatton et al., 2009b)

– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

Page 47: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Why structured sparsity?

• Interpretability

– Structured dictionary elements (Jenatton et al., 2009b)

– Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

• Stability and identifiability

• Prediction or estimation performance

– When prior knowledge matches data (Haupt and Nowak, 2006;

Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)

• How?

– Design of sparsity-inducing norms

Page 48: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Structured sparsity

• Sparsity-inducing behavior from “corners” of constraint sets

Page 49: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Structured dictionary learning- Efficient optimization

minA∈R

k×n

D∈Rp×k

n∑

i=1

‖xi −Dαi‖22 + λψ(αi) s.t. ∀j, ‖dj‖2 ≤ 1.

• Minimization with respect to αi : regularized least-squares

– Many algorithms dedicated to the ℓ1-norm ψ(α) = ‖α‖1

• Proximal methods : first-order methods with optimal convergence

rate (Nesterov, 2007; Beck and Teboulle, 2009)

– Requires solving many times minα∈Rk

12‖y −α‖22 + λψ(α)

• Efficient algorithms for structured sparse problems

– Bach, Jenatton, Mairal, and Obozinski (2011)

– Code available: http://www.di.ens.fr/willow/SPAMS/

Page 50: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Extensions - Digital Zooming

Page 51: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Digital Zooming (Couzinie-Devy et al., 2011)

Page 52: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Extensions - Task-driven dictionaries

inverse half-toning (Mairal et al., 2011)

Page 53: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Extensions - Task-driven dictionaries

inverse half-toning (Mairal et al., 2011)

Page 54: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

Big learning: challenges and opportunities

Conclusion

• Scientific context

– Big data: need for supervised and unsupervised learning

• Beyond stochastic gradient for supervised learning

– Few passes through the data

– Provable robustness and ease of use

• Matrix factorization for unsupervised learning

– Looking for hidden information through dictionary learning

– Feature learning

Page 55: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

References

F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence

rate o(1/n). Technical Report 00831977, HAL, 2013.

F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties.

Foundations and Trends R© in Machine Learning, 4(1):1–106, 2011.

R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. Technical

report, arXiv:0808.3572, 2008.

A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant

step size. 18(1):29–51, 2008.

M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned

dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, 2006.

J. Haupt and R. Nowak. Signal reconstruction from noisy random projections. IEEE Transactions on

Information Theory, 52(9):4036–4048, 2006.

J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th

International Conference on Machine Learning (ICML), 2009.

R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.

Technical report, arXiv:0904.3523, 2009a.

Page 56: Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. Technical

report, arXiv:0909.1440, 2009b.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary

learning. In Submitted to ICML, 2010.

R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding.

Journal of Machine Learning Research, 12:2297–2334, 2011.

K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic

filter maps. In Proceedings of CVPR, 2009.

N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence

rate for strongly-convex optimization with finite training sets. Technical Report -, HAL, 2012.

J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In

International Conference on Machine Learning (ICML), 2009a.

J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image

restoration. In International Conference on Computer Vision (ICCV), 2009b.

J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In

NIPS, 2010.

Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center

for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007.

B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed

by V1? Vision Research, 37:3311–3325, 1997.