of 56/56
Big learning: challenges and opportunities Francis Bach SIERRA Project-team, INRIA - Ecole Normale Sup´ erieure October 2013

Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix factorization • Given data matrix X= (x ... • Dictionary learning problem: given nsignals

  • View
    0

  • Download
    0

Embed Size (px)

Text of Big learning: challenges and opportunities€¦ · Unsupervised learning through matrix...

  • Big learning:challenges and opportunities

    Francis Bach

    SIERRA Project-team, INRIA - Ecole Normale Supérieure

    October 2013

  • Scientific context

    Big data

    • Omnipresent digital media

    – Multimedia, sensors, indicators, social networks

    – All levels: personal, professional, scientific, industrial

    – Too large and/or complex for manual processing

    – Computational challenges

    – Dealing with large databases

    – Statistical challenges

    – What can be predicted from such databases and how?

    – Looking for hidden information

    – Opportunities (and threats)

  • Scientific context

    Big data

    • Omnipresent digital media

    – Multimedia, sensors, indicators, social networks

    – All levels: personal, professional, scientific, industrial

    – Too large and/or complex for manual processing

    • Computational challenges

    – Dealing with large databases

    • Statistical challenges

    – What can be predicted from such databases and how?

    – Looking for hidden information

    • Opportunities (and threats)

  • Machine learning for big data

    • Large-scale machine learning: large p, large n, large k

    – p : dimension of each observation (input)

    – n : number of observations

    – k : number of tasks (dimension of outputs)

    • Examples: computer vision, bioinformatics, etc.

  • Search engines - advertising

  • Advertising - recommendation

  • Object recognition

  • Learning for bioinformatics - Proteins

    • Crucial components of cell life

    • Predicting multiple functions and

    interactions

    • Massive data: up to 1 millions for

    humans!

    • Complex data

    – Amino-acid sequence

    – Link with DNA

    – Tri-dimensional molecule

  • Machine learning for big data

    • Large-scale machine learning: large p, large n, large k

    – p : dimension of each observation (input)

    – n : number of observations

    – k : number of tasks (dimension of outputs)

    • Examples: computer vision, bioinformatics, etc.

    • Two main challenges:

    1. Computational: ideal running-time complexity = O(pn+ kn)

    2. Statistical: meaningful results

  • Big learning: challenges and opportunities

    Outline

    • Scientific context

    – Big data: need for supervised and unsupervised learning

    • Beyond stochastic gradient for supervised learning

    – Few passes through the data

    – Provable robustness and ease of use

    • Matrix factorization for unsupervised learning

    – Looking for hidden information through dictionary learning

    – Feature learning

  • Supervised machine learning

    • Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

    • Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

    • (regularized) empirical risk minimization: find θ̂ solution of

    minθ∈Rp

    1

    n

    n∑

    i=1

    ℓ(yi, θ

    ⊤Φ(xi))

    + µΩ(θ)

    convex data fitting term + regularizer

  • Supervised machine learning

    • Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

    • Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

    • (regularized) empirical risk minimization: find θ̂ solution of

    minθ∈Rp

    1

    n

    n∑

    i=1

    ℓ(yi, θ

    ⊤Φ(xi))

    + µΩ(θ)

    convex data fitting term + regularizer

    • Applications to any data-oriented field

    – Computer vision, bioinformatics

    – Natural language processing, etc.

  • Supervised machine learning

    • Data: n observations (xi, yi) ∈ X × Y, i = 1, . . . , n

    • Prediction as a linear function θ⊤Φ(x) of features Φ(x) ∈ Rp

    • (regularized) empirical risk minimization: find θ̂ solution of

    minθ∈Rp

    1

    n

    n∑

    i=1

    ℓ(yi, θ

    ⊤Φ(xi))

    + µΩ(θ)

    convex data fitting term + regularizer

    • Main practical challenges

    – Designing/learning good features Φ(x)

    – Efficiently solving the optimization problem

  • Stochastic vs. deterministic methods

    • Minimizing g(θ) =1

    n

    n∑

    i=1

    fi(θ) with fi(θ) = ℓ(yi, θ

    ⊤Φ(xi))+ µΩ(θ)

    • Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

    γtn

    n∑

    i=1

    f ′i(θt−1)

    – Linear (e.g., exponential) convergence rate in O(e−αt)

    – Iteration complexity is linear in n (with line search)

  • Stochastic vs. deterministic methods

    • Minimizing g(θ) =1

    n

    n∑

    i=1

    fi(θ) with fi(θ) = ℓ(yi, θ

    ⊤Φ(xi))+ µΩ(θ)

    • Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

    γtn

    n∑

    i=1

    f ′i(θt−1)

  • Stochastic vs. deterministic methods

    • Minimizing g(θ) =1

    n

    n∑

    i=1

    fi(θ) with fi(θ) = ℓ(yi, θ

    ⊤Φ(xi))+ µΩ(θ)

    • Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

    γtn

    n∑

    i=1

    f ′i(θt−1)

    – Linear (e.g., exponential) convergence rate in O(e−αt)

    – Iteration complexity is linear in n (with line search)

    • Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

    – Sampling with replacement: i(t) random element of {1, . . . , n}

    – Convergence rate in O(1/t)

    – Iteration complexity is independent of n (step size selection?)

  • Stochastic vs. deterministic methods

    • Minimizing g(θ) =1

    n

    n∑

    i=1

    fi(θ) with fi(θ) = ℓ(yi, θ

    ⊤Φ(xi))+ µΩ(θ)

    • Batch gradient descent: θt = θt−1−γtg′(θt−1) = θt−1−

    γtn

    n∑

    i=1

    f ′i(θt−1)

    • Stochastic gradient descent: θt = θt−1 − γtf′i(t)(θt−1)

  • Stochastic vs. deterministic methods

    • Goal = best of both worlds: Linear rate with O(1) iteration cost

    Robustness to step size

    time

    log(

    exce

    ss c

    ost)

    stochastic

    deterministic

  • Stochastic vs. deterministic methods

    • Goal = best of both worlds: Linear rate with O(1) iteration cost

    Robustness to step size

    hybridlog

    (exc

    ess

    cost

    )

    stochastic

    deterministic

    time

  • Stochastic average gradient

    (Le Roux, Schmidt, and Bach, 2012)

    • Stochastic average gradient (SAG) iteration

    – Keep in memory the gradients of all functions fi, i = 1, . . . , n

    – Random selection i(t) ∈ {1, . . . , n} with replacement

    – Iteration: θt = θt−1 −γtn

    n∑

    i=1

    yti with yti =

    {

    f ′i(θt−1) if i = i(t)

    yt−1i otherwise

  • Stochastic average gradient

    (Le Roux, Schmidt, and Bach, 2012)

    • Stochastic average gradient (SAG) iteration

    – Keep in memory the gradients of all functions fi, i = 1, . . . , n

    – Random selection i(t) ∈ {1, . . . , n} with replacement

    – Iteration: θt = θt−1 −γtn

    n∑

    i=1

    yti with yti =

    {

    f ′i(θt−1) if i = i(t)

    yt−1i otherwise

    • Stochastic version of incremental average gradient (Blatt et al., 2008)

    • Simple implementation

    – Extra memory requirement: same size as original data (or less)

    – Simple/robust constant step-size

  • Stochastic average gradient

    Convergence analysis

    • Assume each fi is L-smooth and g=1

    n

    n∑

    i=1

    fi is µ-strongly convex

    • Constant step size γt =1

    16L. If µ >

    2L

    n, ∃C ∈ R such that

    ∀t > 0, E[g(θt)− g(θ

    ∗)]6 C exp

    (

    −t

    8n

    )

    • Linear convergence rate with iteration cost independent of n

    – After each pass through the data, constant error reduction

    – Breaking two lower bounds

  • Stochastic average gradient

    Simulation experiments

    • protein dataset (n = 145751, p = 74)

    • Dataset split in two (training/testing)

    0 5 10 15 20 25 30

    10−4

    10−3

    10−2

    10−1

    100

    Effective Passes

    Ob

    jec

    tiv

    e m

    inu

    s O

    ptim

    um

    Steepest

    AFG

    L−BFGS

    pegasos

    RDA

    SAG (2/(L+nµ))SAG−LS

    0 5 10 15 20 25 30

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5x 10

    4

    Effective Passes

    Test

    Lo

    gis

    tic

    Lo

    ss

    Steepest

    AFG

    L−BFGS

    pegasos

    RDA

    SAG (2/(L+nµ))SAG−LS

    Training cost Testing cost

  • Stochastic average gradient

    Simulation experiments

    • covertype dataset (n = 581012, p = 54)

    • Dataset split in two (training/testing)

    0 5 10 15 20 25 30

    10−4

    10−2

    100

    102

    Effective Passes

    Ob

    jec

    tiv

    e m

    inu

    s O

    ptim

    um

    Steepest

    AFG

    L−BFGS

    pegasos

    RDA

    SAG (2/(L+nµ))SAG−LS

    0 5 10 15 20 25 30

    1.5

    1.55

    1.6

    1.65

    1.7

    1.75

    1.8

    1.85

    1.9

    1.95

    2

    x 105

    Effective Passes

    Test

    Lo

    gis

    tic

    Lo

    ss

    Steepest

    AFG

    L−BFGS

    pegasos

    RDA

    SAG (2/(L+nµ))SAG−LS

    Training cost Testing cost

  • Large-scale supervised learning

    Convex optimization

    • Simplicity

    – Few lines of code

    • Robustness

    – Step-size

    – Adaptivity to problem difficulty

    • On-going work

    – Single pass through the data (Bach and Moulines, 2013)

    – Distributed algorithms

    - Convexity as a solution to all problems?

    - Need good features Φ(x) for linear predictions θ⊤Φ(x) !

  • Large-scale supervised learning

    Convex optimization

    • Simplicity

    – Few lines of code

    • Robustness

    – Step-size

    – Adaptivity to problem difficulty

    • On-going work

    – Single pass through the data (Bach and Moulines, 2013)

    – Distributed algorithms

    • Convexity as a solution to all problems?

    – Need good features Φ(x) for linear predictions θ⊤Φ(x) !

  • Unsupervised learning through matrix factorization

    • Given data matrix X = (x⊤1 , . . . ,x⊤n ) ∈ R

    n×p

    – Principal component analysis: xi ≈ Dαi

    – K-means: xi ≈ dk ⇒ X = DA

  • Learning dictionaries for uncovering hidden structure

    • Fact: many natural signals may be approximately represented as a

    superposition of few atoms from a dictionary D = (d1, . . . ,dk)

    – Decomposition x =k∑

    i=1

    αidi = Dα with α ∈ R

    k sparse

    – Natural signals (sounds, images) (Olshausen and Field, 1997)

    - Decoding problem: given a dictionary D, finding α through

    regularized convex optimization minα∈Rk ‖x−Dα‖

    22 + λ‖α‖1

  • Learning dictionaries for uncovering hidden structure

    • Fact: many natural signals may be approximately represented as a

    superposition of few atoms from a dictionary D = (d1, . . . ,dk)

    – Decomposition x =k∑

    i=1

    αidi = Dα with α ∈ R

    k sparse

    – Natural signals (sounds, images) (Olshausen and Field, 1997)

    • Decoding problem: given a dictionary D, finding α through

    regularized convex optimization minα∈Rk ‖x−Dα‖

    22 + λ‖α‖1

    1

    2w

    w 1

    2w

    w

  • Learning dictionaries for uncovering hidden structure

    • Fact: many natural signals may be approximately represented as a

    superposition of few atoms from a dictionary D = (d1, . . . ,dk)

    – Decomposition x =

    k∑

    i=1

    αidi = Dα with α ∈ R

    k sparse

    – Natural signals (sounds, images) (Olshausen and Field, 1997)

    • Decoding problem: given a dictionary D, finding α through

    regularized convex optimization minα∈Rk ‖x−Dα‖

    22 + λ‖α‖1

    • Dictionary learning problem: given n signals x1, . . . ,xn,

    – Estimate both dictionary D and codes α1, . . . ,αn

    minD

    n∑

    j=1

    minαj∈R

    p

    {∥∥xj −Dαj

    ∥∥2

    2+ λ‖αj‖1

    }

  • Challenges of dictionary learning

    minD

    n∑

    j=1

    minαj∈R

    p

    {∥∥xj −Dαj

    ∥∥2

    2+ λ‖αj‖1

    }

    • Algorithmic challenges

    – Large number of signals ⇒ online learning (Mairal et al., 2009a)

    • Theoretical challenges

    – Identifiabiliy/robustness (Jenatton et al., 2012)

    • Domain-specific challenges

    – Going beyond plain sparsity ⇒ structured sparsity

    (Jenatton, Mairal, Obozinski, and Bach, 2011)

  • Dictionary learning for image denoising

    x︸︷︷︸measurements

    = y︸︷︷︸

    original image

    + ε︸︷︷︸noise

  • Dictionary learning for image denoising

    • Solving the denoising problem (Elad and Aharon, 2006)

    – Extract all overlapping 8× 8 patches xi ∈ R64

    – Form the matrix X = (x⊤1 , . . . ,x⊤n ) ∈ R

    n×64

    – Solve a matrix factorization problem:

    minD,A

    ||X−DA||2F = minD,A

    n∑

    i=1

    ||xi −Dαi||22

    where A is sparse, and D is the dictionary

    – Each patch is decomposed into xi = Dαi– Average all Dαi to reconstruct a full-sized image

    • The number of patches n is large (= number of pixels)

    • Online learning (Mairal, Bach, Ponce, and Sapiro, 2009a)

  • Denoising result

    (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)

  • Denoising result

    (Mairal, Bach, Ponce, Sapiro, and Zisserman, 2009b)

  • Inpainting a 12-Mpixel photograph

  • Inpainting a 12-Mpixel photograph

  • Inpainting a 12-Mpixel photograph

  • Inpainting a 12-Mpixel photograph

  • Why structured sparsity?

    • Interpretability

    – Structured dictionary elements (Jenatton et al., 2009b)

    – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

    et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  • Structured sparse PCA (Jenatton et al., 2009b)

    raw data sparse PCA

    • Unstructed sparse PCA ⇒ many zeros do not lead to better

    interpretability

  • Structured sparse PCA (Jenatton et al., 2009b)

    raw data sparse PCA

    • Unstructed sparse PCA ⇒ many zeros do not lead to better

    interpretability

  • Structured sparse PCA (Jenatton et al., 2009b)

    raw data Structured sparse PCA

    • Enforce selection of convex nonzero patterns ⇒ robustness to

    occlusion in face identification

  • Structured sparse PCA (Jenatton et al., 2009b)

    raw data Structured sparse PCA

    • Enforce selection of convex nonzero patterns ⇒ robustness to

    occlusion in face identification

  • Modelling of text corpora - Dictionary tree

  • Why structured sparsity?

    • Interpretability

    – Structured dictionary elements (Jenatton et al., 2009b)

    – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

    et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

  • Why structured sparsity?

    • Interpretability

    – Structured dictionary elements (Jenatton et al., 2009b)

    – Dictionary elements “organized” in a tree or a grid (Kavukcuoglu

    et al., 2009; Jenatton et al., 2010; Mairal et al., 2010)

    • Stability and identifiability

    • Prediction or estimation performance

    – When prior knowledge matches data (Haupt and Nowak, 2006;

    Baraniuk et al., 2008; Jenatton et al., 2009a; Huang et al., 2009)

    • How?

    – Design of sparsity-inducing norms

  • Structured sparsity

    • Sparsity-inducing behavior from “corners” of constraint sets

  • Structured dictionary learning- Efficient optimization

    minA∈Rk×n

    D∈Rp×k

    n∑

    i=1

    ‖xi −Dαi‖22 + λψ(αi) s.t. ∀j, ‖dj‖2 ≤ 1.

    • Minimization with respect to αi : regularized least-squares

    – Many algorithms dedicated to the ℓ1-norm ψ(α) = ‖α‖1

    • Proximal methods : first-order methods with optimal convergence

    rate (Nesterov, 2007; Beck and Teboulle, 2009)

    – Requires solving many times minα∈Rk

    12‖y −α‖

    22 + λψ(α)

    • Efficient algorithms for structured sparse problems

    – Bach, Jenatton, Mairal, and Obozinski (2011)

    – Code available: http://www.di.ens.fr/willow/SPAMS/

  • Extensions - Digital Zooming

  • Digital Zooming (Couzinie-Devy et al., 2011)

  • Extensions - Task-driven dictionaries

    inverse half-toning (Mairal et al., 2011)

  • Extensions - Task-driven dictionaries

    inverse half-toning (Mairal et al., 2011)

  • Big learning: challenges and opportunities

    Conclusion

    • Scientific context

    – Big data: need for supervised and unsupervised learning

    • Beyond stochastic gradient for supervised learning

    – Few passes through the data

    – Provable robustness and ease of use

    • Matrix factorization for unsupervised learning

    – Looking for hidden information through dictionary learning

    – Feature learning

  • References

    F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence

    rate o(1/n). Technical Report 00831977, HAL, 2013.

    F. Bach, R. Jenatton, J. Mairal, and G. Obozinski. Optimization with sparsity-inducing penalties.

    Foundations and Trends R© in Machine Learning, 4(1):1–106, 2011.

    R. G. Baraniuk, V. Cevher, M. F. Duarte, and C. Hegde. Model-based compressive sensing. Technical

    report, arXiv:0808.3572, 2008.

    A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.

    SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.

    D. Blatt, A.O. Hero, and H. Gauchman. A convergent incremental gradient method with a constant

    step size. 18(1):29–51, 2008.

    M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned

    dictionaries. IEEE Transactions on Image Processing, 15(12):3736–3745, 2006.

    J. Haupt and R. Nowak. Signal reconstruction from noisy random projections. IEEE Transactions on

    Information Theory, 52(9):4036–4048, 2006.

    J. Huang, T. Zhang, and D. Metaxas. Learning with structured sparsity. In Proceedings of the 26th

    International Conference on Machine Learning (ICML), 2009.

    R. Jenatton, J.Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms.

    Technical report, arXiv:0904.3523, 2009a.

  • R. Jenatton, G. Obozinski, and F. Bach. Structured sparse principal component analysis. Technical

    report, arXiv:0909.1440, 2009b.

    R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for sparse hierarchical dictionary

    learning. In Submitted to ICML, 2010.

    R. Jenatton, J. Mairal, G. Obozinski, and F. Bach. Proximal methods for hierarchical sparse coding.

    Journal of Machine Learning Research, 12:2297–2334, 2011.

    K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic

    filter maps. In Proceedings of CVPR, 2009.

    N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence

    rate for strongly-convex optimization with finite training sets. Technical Report -, HAL, 2012.

    J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online dictionary learning for sparse coding. In

    International Conference on Machine Learning (ICML), 2009a.

    J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Non-local sparse models for image

    restoration. In International Conference on Computer Vision (ICCV), 2009b.

    J. Mairal, R. Jenatton, G. Obozinski, and F. Bach. Network flow algorithms for structured sparsity. In

    NIPS, 2010.

    Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center

    for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007.

    B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed

    by V1? Vision Research, 37:3311–3325, 1997.