54
Yann LeCun             5 years from now,             everyone will learn         their features            (you might as well start now)             5 years from now,             everyone will learn         their features            (you might as well start now)  Yann LeCun         Courant Institute of Mathematical Sciences and     Center for Neural Science,       New York University  Yann LeCun         Courant Institute of Mathematical Sciences and     Center for Neural Science,       New York University

FCV Learn LeCun

Embed Size (px)

DESCRIPTION

jkl

Citation preview

  • Yann LeCun

    5yearsfromnow,everyonewilllearn

    theirfeatures(youmightaswellstartnow)

    5yearsfromnow,everyonewilllearn

    theirfeatures(youmightaswellstartnow)

    YannLeCunCourantInstituteofMathematicalSciences

    andCenterforNeuralScience,NewYorkUniversity

    YannLeCunCourantInstituteofMathematicalSciences

    andCenterforNeuralScience,NewYorkUniversity

  • Yann LeCun

    I Have a Terrible Confession to MakeI Have a Terrible Confession to Make

    I'm interested in vision, but no more in vision than in audition or in other perceptual modalities.

    I'm interested in perception (and in control).

    I'd like to find a learning algorithm and architecture that could work (with minor changes) for many modalities

    Nature seems to have found one.

    Almost all natural perceptual signals have a local structure (in space and time) similar to images and videos

    Heavy correlation between neighboring variablesLocal patches of variables have structure, and are representable by feature vectors.

    I like vision because it's challenging, it's useful, it's fun, we have datathe image recognition community is not yet stuck in a deep local minimum like the speech recognition community.

  • Yann LeCun

    The Unity of Recognition

    Architectures

    The Unity of Recognition

    Architectures

  • Yann LeCun

    Most Recognition Systems Are Built on the Same ArchitectureMost Recognition Systems Are Built on the Same Architecture

    First stage: dense SIFT, HOG, GIST, sparse coding, RBM, auto-encoders.....

    Second stage: K-means, sparse coding, LCC....

    Pooling: average, L2, max, max with bias (elastic templates).....

    Convolutional Nets: same architecture, but everything is trained.

    Filter

    Bank

    feature

    Pooling

    Non

    LinearityClassifier

    Filter

    Bank

    Non

    LinNormPool

    Filter

    Bank

    Non

    LinNormPool Classifier

    Norma

    lization

  • Yann LeCun

    Filter Bank + Non-Linearity + Pooling + NormalizationFilter Bank + Non-Linearity + Pooling + Normalization

    This model of a feature extraction stage is biologically-inspired ...whether you like it or not (just ask David Lowe)Inspired by [Hubel and Wiesel 1962]The use of this module goes back to Fukushima's Neocognitron (and even earlier models in the 60's).

    FilterBank

    SpatialPooling

    NonLinearity

  • Yann LeCun

    How well does this work?How well does this work?

    Some results on C101 (I know, I know....)SIFT->K-means->Pyramid pooling->SVM intersection kernel: >65%

    [Lazebnik et al. CVPR 2006]

    SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75%[Boureau et al. CVPR 2010] [Yang et al. 2008]

    SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77%[Boureau et al. ICCV 2011]

    (Small) supervised ConvNet with sparsity penalty: >71% [rejected from CVPR,ICCV,etc] REAL TIME

    OrientedEdges

    WinnerTakesAll

    Histogram(sum)

    Filter

    Bank

    feature

    Pooling

    Non

    Linearity

    Filter

    Bank

    feature

    Pooling

    Non

    LinearityClassifier

    SIFT

    KmeansOrSparseCoding

    PyramidHistogram.ElasticpartsModels,...

    SVMorAnotherSimpleclassifier

  • Yann LeCun

    Convolutional Networks (ConvNets) fits that modelConvolutional Networks (ConvNets) fits that model

  • Yann LeCun

    Why do two stages work better than one stage?Why do two stages work better than one stage?

    The second stage extracts mid-level features

    Having multiple stages helps the selectivity-invariance dilemma

    Filter

    Bank

    Non

    LinNormPool

    Filter

    Bank

    Non

    LinNormPool Classifier

  • Yann LeCun

    Learning Hierarchical RepresentationsLearning Hierarchical Representations

    I agree with David Lowe: we should learn the features

    It worked for speech, handwriting, NLP.....

    In a way, the vision community has been running a ridiculously inefficient evolutionary learning algorithm to learn features:

    Mutation: tweak existing features in many different waysSelection: Publish the best ones at CVPRReproduction: combine several features from the last CVPRIterate. Problem: Moore's law works against you

    TrainableFeature

    Transform

    TrainableFeature

    Transform

    TrainableClassifier

    LearnedInternalRepresentation

  • Yann LeCun

    Sometimes, Biology gives you

    good hints example:

    contrast normalization

    Sometimes, Biology gives you

    good hints example:

    contrast normalization

  • Yann LeCunTHISISONESTAGEOFTHECONVNET

    SoftThresholding+AbsNSubtractiveandDivisiveLocalNormalizationPPoolingdownsamplinglayer:averageormax?

    CConvolutions(filterbank)Harsh Non-Linearity + Contrast Normalization + SparsityHarsh Non-Linearity + Contrast Normalization + Sparsity

    subtr activ e+di visive contr astn orm

    a lizat ion

    Con vol utio ns

    Thr esho ldin g

    Rec tific atio n

    Pooli ng,s ubsa mpli ng

  • Yann LeCun

    Soft Thresholding Non-LinearitySoft Thresholding Non-Linearity

  • Yann LeCun

    Local Contrast NormalizationLocal Contrast Normalization

    Performed on the state of every layer, including the input

    Subtractive Local Contrast NormalizationSubtracts from every value in a feature a Gaussian-weighted average of its neighbors (high-pass filter)

    Divisive Local Contrast NormalizationDivides every value in a layer by the standard deviation of its neighbors over space and over all feature maps

    Subtractive + Divisive LCN performs a kind of approximate whitening.

  • Yann LeCun

    C101 Performance (I know, I know)C101 Performance (I know, I know)

    Small network: 64 features at stage-1, 256 features at stage-2:

    Tanh non-linearity, No Rectification, No normalization: 29%

    Tanh non-linearity, Rectification, normalization: 65%

    Shrink non-linearity, Rectification, norm, sparsity penalty 71%

  • Yann LeCun

    Results on Caltech101 with sigmoid non-linearityResults on Caltech101 with sigmoid non-linearity

    likeHMAXmodel

  • Yann LeCun

    Feature Learning Works Really Well on everything but C101

    Feature Learning Works Really Well on everything but C101

  • Yann LeCun

    C101 is very unfavorable to learning-based systemsC101 is very unfavorable to learning-based systems

    Because it's so small. We are switching to ImageNet

    Some results on NORBNonormalization

    Randomfilters

    Nonormalization

    Unsupfilters

    Unsup+SupfiltersSupfilters

  • Yann LeCun

    Sparse Auto-EncodersSparse Auto-Encoders

    Inference by gradient descent starting from the encoder output

    Z i=argminzE Yi , z ;W

    INPUT Y Z

    Y i Y2

    z j

    W d Z

    FEATURES

    j .

    Z Z2ge W e ,Yi

    E Y i , Z =Y iW d Z2Zge W e ,Y

    i2 j z j

  • Yann LeCun

    Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a Hierarchy of Features

    Phase 1: train first layer using PSD

    FEATURES

    Y Z

    Y i Y2

    z j

    W d Z j .

    Z Z2ge W e ,Yi

  • Yann LeCun

    Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a Hierarchy of Features

    Phase 1: train first layer using PSD

    Phase 2: use encoder + absolute value as feature extractor

    FEATURES

    Y z j

    ge W e ,Yi

  • Yann LeCun

    Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a Hierarchy of Features

    Phase 1: train first layer using PSD

    Phase 2: use encoder + absolute value as feature extractor

    Phase 3: train the second layer using PSD

    FEATURES

    Y z j

    ge W e ,Yi

    Y Z

    Y i Y2

    z j

    W d Z j .

    Z Z2ge W e ,Yi

  • Yann LeCun

    Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a Hierarchy of Features

    Phase 1: train first layer using PSD

    Phase 2: use encoder + absolute value as feature extractor

    Phase 3: train the second layer using PSD

    Phase 4: use encoder + absolute value as 2nd feature extractor

    FEATURES

    Y z j

    ge W e ,Yi

    z j

    ge W e ,Yi

  • Yann LeCun

    Using PSD to Train a Hierarchy of FeaturesUsing PSD to Train a Hierarchy of Features

    Phase 1: train first layer using PSD

    Phase 2: use encoder + absolute value as feature extractor

    Phase 3: train the second layer using PSD

    Phase 4: use encoder + absolute value as 2nd feature extractor

    Phase 5: train a supervised classifier on top

    Phase 6 (optional): train the entire system with supervised back-propagation

    FEATURES

    Y z j

    ge W e ,Yi

    z j

    ge W e ,Yi

    classifier

  • Yann LeCun

    Learned Features on natural patches: V1-like receptive fieldsLearned Features on natural patches: V1-like receptive fields

  • Yann LeCun

    Using PSD Features for Object RecognitionUsing PSD Features for Object Recognition

    64 filters on 9x9 patches trained with PSD with Linear-Sigmoid-Diagonal Encoder

  • Yann LeCun

    ConvolutionalSparseCodingConvolutionalSparseCoding

    [Kavukcuogluetal.NIPS2010]:convolutionalPSD

    [Zeiler,Krishnan,Taylor,Fergus,CVPR2010]:DeconvolutionalNetwork[Lee,Gross,Ranganath,Ng,ICML2009]:ConvolutionalBoltzmannMachine[Norouzi,Ranjbar,Mori,CVPR2009]:ConvolutionalBoltzmannMachine[Chen,Sapiro,Dunson,Carin,Preprint2010]:DeconvolutionalNetworkwithautomaticadjustmentofcodedimension.

  • Yann LeCun

    Convolutional TrainingConvolutional Training

    Problem: With patch-level training, the learning algorithm must reconstruct the entire patch with a single feature vectorBut when the filters are used convolutionally, neighboring feature vectors will be highly redundant

    Patchleveltrainingproduceslotsoffiltersthatareshiftedversionsofeachother.

  • Yann LeCun

    Convolutional Sparse CodingConvolutional Sparse Coding

    Replace the dot products with dictionary element by convolutions.Input Y is a full imageEach code component Zk is a feature map (an image)Each dictionary element is a convolution kernel

    Regular sparse coding

    Convolutional S.C.

    k. * ZkWk

    Y =

    deconvolutional networks [Zeiler, Taylor, Fergus CVPR 2010]

  • Yann LeCun

    Convolutional PSD: Encoder with a soft sh() Function Convolutional PSD: Encoder with a soft sh() Function

    Convolutional FormulationExtend sparse coding from PATCH to IMAGE

    PATCH based learning CONVOLUTIONAL learning

  • Yann LeCun

    Cifar-10 Dataset Cifar-10 Dataset

    Dataset of tiny imagesImages are 32x32 color images10 object categories with 50000 training and 10000 testing

    Example Images

  • Yann LeCun

    Comparative Results on Cifar-10 DatasetComparative Results on Cifar-10 Dataset

    * Krizhevsky. Learning multiple layers of features from tiny images. Masters thesis, Dept of CS U of Toronto

    **Ranzato and Hinton. Modeling pixel means and covariances using a factorized third order boltzmann machine. CVPR 2010

  • Yann LeCun

    Road Sign Recognition CompetitionRoad Sign Recognition Competition

    GTSRB Road Sign Recognition Competition (phase 1)32x32 imagesThe 13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIANo 6 is humans!

  • Yann LeCun

    Pedestrian Detection (INRIA Dataset)Pedestrian Detection (INRIA Dataset)

    [Sermanetetal.,RejectedfromICCV2011]]

  • Yann LeCun

    Pedestrian Detection: ExamplesPedestrian Detection: Examples

    [Kavukcuogluetal.NIPS2010]

  • Yann LeCun

    LearningInvariantFeatures

    LearningInvariantFeatures

  • Yann LeCun

    Why just pool over space? Why not over orientation?Why just pool over space? Why not over orientation?

    Using an idea from Hyvarinen: topographic square pooling (subspace ICA)1. Apply filters on a patch (with suitable non-linearity)2. Arrange filter outputs on a 2D plane3. square filter outputs4. minimize sqrt of sum of blocks of squared filter outputs

  • Yann LeCun

    Why just pool over space? Why not over orientation?Why just pool over space? Why not over orientation?

    The filters arrange themselves spontaneously so that similar filters enter the same pool.

    The pooling units can be seen as complex cells

    They are invariant to local transformations of the inputFor some it's translations, for others rotations, or other transformations.

  • Yann LeCun

    Pinwheels?Pinwheels?

    Does that look pinwheely to you?

  • Yann LeCun

    Sparsity throughLateral InhibitionSparsity throughLateral Inhibition

  • Yann LeCun

    Invariant Features Lateral InhibitionInvariant Features Lateral Inhibition

    Replace the L1 sparsity term by a lateral inhibition matrix

  • Yann LeCun

    Invariant Features Lateral InhibitionInvariant Features Lateral Inhibition

    Zeros I S matrix have tree structure

  • Yann LeCun

    Invariant Features Lateral InhibitionInvariant Features Lateral Inhibition

    Non-zero values in S form a ring in a 2D topologyInput patches are high-pass filtered

  • Yann LeCun

    Invariant Features Lateral InhibitionInvariant Features Lateral Inhibition

    Non-zero values in S form a ring in a 2D topologyLeft: non high-pass filtering of inputRight: patch-level mean removal

  • Yann LeCun

    Invariant Features Short-Range Lateral Excitation + L1Invariant Features Short-Range Lateral Excitation + L1

    l

  • Yann LeCun

    Disentangling the Explanatory Factors

    of Images

    Disentangling the Explanatory Factors

    of Images

  • Yann LeCun

    Separating Separating

    I used to think that recognition was all about eliminating irrelevant information while keeping the useful one

    Building invariant representationsEliminating irrelevant variabilities

    I now think that recognition is all about disentangling independent factors of variations:

    Separating what and whereSeparating content from instantiation parametersHinton's capsules; Karol Gregor's what-where auto-encoders

  • Yann LeCun

    Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy

    Object is cross-product of object type and instantiation parameters[Hinton 1981]

    small medium large

    Objecttype Objectsize[KarolGregoretal.]

  • Yann LeCun

    Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy

    St St1 St2

    C1t C

    1t1 C

    1t2 C

    2t

    Decoder

    W1 W1 W1 W2

    Predictedinput

    C1t C

    1t1 C

    1t2 C

    2t

    St St1 St2

    Inferredcode

    Predictedcode

    InputEncoder

    f W 1 f W 1 f W 1

    W 2

    fW 2W 2

  • Yann LeCun

    Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy

    C1(where)

    C2(what)

  • Yann LeCun

    Input

    Generating from the NetworkGenerating from the Network

  • Yann LeCun

    What is the right criterion to train

    hierarchical feature extraction

    architectures?

    What is the right criterion to train

    hierarchical feature extraction

    architectures?

  • Yann LeCun

    Flattening the Data Manifold?Flattening the Data Manifold?

    The manifold of all images of is low-dimensional and highly curvy

    Feature extractors should flatten the manifold

  • Yann LeCun

    Flattening the

    Data Manifold?

    Flattening the

    Data Manifold?

  • Yann LeCun

    The Ultimate Recognition SystemThe Ultimate Recognition System

    Bottom-up and top-down informationTop-down: complex inference and disambiguationBottom-up: learns to quickly predict the result of the top-down inference

    Integrated supervised and unsupervised learningCapture the dependencies between all observed variables

    CompositionalityEach stage has latent instantiation variables

    TrainableFeature

    Transform

    TrainableFeature

    Transform

    TrainableClassifier

    LearnedInternalRepresentation

    Slide 1Slide 2Slide 3Slide 4Slide 5Slide 6Slide 7Slide 8Slide 9Slide 10Slide 11Slide 12Slide 13Slide 14Slide 15Slide 16Slide 17Slide 18Slide 19Slide 20Slide 21Slide 22Slide 23Slide 24Slide 25Slide 26Slide 27Slide 28Slide 29Slide 30Slide 31Slide 32Slide 33Slide 34Slide 35Slide 36Slide 37Slide 38Slide 39Slide 40Slide 41Slide 42Slide 43Slide 44Slide 45Slide 46Slide 47Slide 48Slide 49Slide 50Slide 51Slide 52Slide 53Slide 54