54
Yann LeCun             5 years from now,             everyone will learn         their features            (you might as well start now)             5 years from now,             everyone will learn         their features            (you might as well start now)  Yann LeCun         Courant Institute of Mathematical Sciences and     Center for Neural Science,       New York University  Yann LeCun         Courant Institute of Mathematical Sciences and     Center for Neural Science,       New York University

Fcv learn le_cun

  • Upload
    zukun

  • View
    503

  • Download
    1

Embed Size (px)

Citation preview

  • 1.5yearsfromnow, 5yearsfromnow, everyonewilllearneveryonewilllearn theirfeaturestheirfeatures (youmightaswellstartnow)(youmightaswellstartnow)YannLeCun YannLeCunCourantInstituteofMathematicalSciences CourantInstituteofMathematicalSciencesand andCenterforNeuralScience, CenterforNeuralScience,NewYorkUniversity NewYorkUniversityYann LeCun

2. IIHave aaTerrible Confession to MakeHave Terrible Confession to MakeIm interested in vision, but no more in vision than in audition or inother perceptual modalities.Im interested in perception (and in control).Id like to find a learning algorithm and architecture that could work(with minor changes) for many modalitiesNature seems to have found one.Almost all natural perceptual signals have a local structure (in spaceand time) similar to images and videosHeavy correlation between neighboring variablesLocal patches of variables have structure, and are representableby feature vectors.I like vision because its challenging, its useful, its fun, we have datathe image recognition community is not yet stuck in a deeplocal minimum like the speech recognition community.Yann LeCun 3. The Unity of The Unity ofRecognitionRecognition Architectures ArchitecturesYann LeCun 4. Most Recognition Systems Are Built on the Same Architecture Most Recognition Systems Are Built on the Same Architecture Filter NonfeatureNormaClassifier BankLinearity Pooling lizationFilterNonFilterNon Pool NormPool Norm ClassifierBank Lin Bank LinFirst stage: dense SIFT, HOG, GIST, sparse coding, RBM, auto-encoders.....Second stage: K-means, sparse coding, LCC....Pooling: average, L2, max, max with bias (elastic templates)..... Convolutional Nets: same architecture, but everything is trained.Yann LeCun 5. Filter Bank + Non-Linearity + Pooling + NormalizationFilter Bank + Non-Linearity + Pooling + Normalization Filter NonSpatial BankLinearityPoolingThis model of a feature extraction stage is biologically-inspired ...whether you like it or not (just ask David Lowe) Inspired by [Hubel and Wiesel 1962] The use of this module goes back to Fukushimas Neocognitron (and even earlier models in the 60s).Yann LeCun 6. How well does this work?How well does this work? Filter Non featureFilterNonfeatureClassifier Bank Linearity Pooling BankLinearity Pooling OrientedWinner HistogramPyramidSVMorKmeans EdgesTakes (sum) Histogram. AnotherOr All Elasticparts SimpleSparseCodingSIFT Models,... classifierSome results on C101 (I know, I know....) SIFT->K-means->Pyramid pooling->SVM intersection kernel: >65%[Lazebnik et al. CVPR 2006]SIFT->Sparse coding on Blocks->Pyramid pooling->SVM: >75%[Boureau et al. CVPR 2010] [Yang et al. 2008]SIFT->Local Sparse coding on Block->Pyramid pooling->SVM: >77%[Boureau et al. ICCV 2011](Small) supervised ConvNet with sparsity penalty: >71%[rejected from CVPR,ICCV,etc] REAL TIMEYann LeCun 7. Convolutional Networks (ConvNets) fits that modelConvolutional Networks (ConvNets) fits that modelYann LeCun 8. Why do two stages work better than one stage? Why do two stages work better than one stage?Filter Non Filter NonPool Norm Pool Norm ClassifierBankLinBankLinThe second stage extracts mid-level featuresHaving multiple stages helps the selectivity-invariance dilemmaYann LeCun 9. Learning Hierarchical RepresentationsLearning Hierarchical RepresentationsTrainableTrainableTrainable FeatureFeatureClassifierTransformTransform LearnedInternalRepresentation I agree with David Lowe: we should learn the features It worked for speech, handwriting, NLP..... In a way, the vision community has been running a ridiculously inefficient evolutionary learning algorithm to learn features: Mutation: tweak existing features in many different ways Selection: Publish the best ones at CVPR Reproduction: combine several features from the last CVPR Iterate. Problem: Moores law works against youYann LeCun 10. Sometimes, Sometimes,Biology gives youBiology gives yougood hints good hints example:example: contrast normalization contrast normalizationYann LeCun 11. Harsh Non-Linearity + Contrast Normalization + Sparsity Harsh Non-Linearity + Contrast Normalization + Sparsity CConvolutions(filterbank) SoftThresholding+Abs NSubtractiveandDivisiveLocalNormalization PPoolingdownsamplinglayer:averageormax? Pooling,subsampling contrastnormalizationsubtractive+divisive ThresholdingConvolutions Rectification THISISONESTAGEOFTHECONVNETYann LeCun 12. Soft Thresholding Non-Linearity Soft Thresholding Non-LinearityYann LeCun 13. Local Contrast NormalizationLocal Contrast NormalizationPerformed on the state of every layer, includingthe inputSubtractive Local Contrast Normalization Subtracts from every value in a feature a Gaussian-weighted average of its neighbors (high-pass filter)Divisive Local Contrast Normalization Divides every value in a layer by the standard deviation of its neighbors over space and over all feature mapsSubtractive + Divisive LCN performs a kind ofapproximate whitening.Yann LeCun 14. C101 Performance (I know, IIknow)C101 Performance (I know, know) Small network: 64 features at stage-1, 256 features at stage-2: Tanh non-linearity, No Rectification, No normalization: 29% Tanh non-linearity, Rectification, normalization: 65% Shrink non-linearity, Rectification, norm, sparsity penalty 71%Yann LeCun 15. Results on Caltech101 with sigmoid non-linearity Results on Caltech101 with sigmoid non-linearity likeHMAXmodelYann LeCun 16. Feature LearningFeature Learning Works Really Well Works Really Well on everything but C101 on everything but C101Yann LeCun 17. C101 is very unfavorable to learning-based systemsC101 is very unfavorable to learning-based systemsBecause its so small. We are switching to ImageNetSome results on NORBNonormalization Randomfilters Unsupfilters Supfilters Unsup+SupfiltersYann LeCun 18. Sparse Auto-EncodersSparse Auto-Encoders Inference by gradient descent starting from the encoder output ii 2i 2E Y , Z =Y W d Z Z g e W e ,Y j z j ii Z =argmin z E Y , z ; W i Y Y 2WdZ j . INPUT Y Z z j FEATURESi ge W e ,Y 2Z ZYann LeCun 19. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train first layer using PSDY i Y2 WdZ j . YZ z j ge W e ,Y i 2 Z ZFEATURESYann LeCun 20. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Y z j ge W e ,Y i FEATURESYann LeCun 21. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD Y i Y2 WdZ j . Y z j YZ z j ge W e ,Y i ge W e ,Y i 2 Z ZFEATURESYann LeCun 22. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of Features Phase 1: train first layer using PSD Phase 2: use encoder + absolute value as feature extractor Phase 3: train the second layer using PSD Phase 4: use encoder + absolute value as 2 nd feature extractor Y z jz j ge W e ,Y i ge W e ,Y i FEATURESYann LeCun 23. Using PSD to Train aaHierarchy of FeaturesUsing PSD to Train Hierarchy of FeaturesPhase 1: train first layer using PSDPhase 2: use encoder + absolute value as feature extractorPhase 3: train the second layer using PSDPhase 4: use encoder + absolute value as 2 nd feature extractorPhase 5: train a supervised classifier on topPhase 6 (optional): train the entire system with supervised back-propagation Y z jz j classifier ge W e ,Y i ge W e ,Y i FEATURESYann LeCun 24. Learned Features on natural patches: V1-like receptive fieldsLearned Features on natural patches: V1-like receptive fieldsYann LeCun 25. Using PSD Features for Object RecognitionUsing PSD Features for Object Recognition64 filters on 9x9 patches trained with PSD with Linear-Sigmoid-Diagonal EncoderYann LeCun 26. ConvolutionalSparseCoding ConvolutionalSparseCoding[Kavukcuogluetal.NIPS2010]:convolutionalPSD[Zeiler,Krishnan,Taylor,Fergus,CVPR2010]:DeconvolutionalNetwork[Lee,Gross,Ranganath,Ng,ICML2009]:ConvolutionalBoltzmannMachine[Norouzi,Ranjbar,Mori,CVPR2009]:ConvolutionalBoltzmannMachine[Chen,Sapiro,Dunson,Carin,Preprint2010]:DeconvolutionalNetworkwithautomaticadjustmentofcodedimension.Yann LeCun 27. Convolutional Training Convolutional TrainingProblem: With patch-level training, the learning algorithm must reconstruct the entire patch with a single feature vector But when the filters are used convolutionally, neighboring feature vectors will be highly redundant Patchleveltrainingproduces lotsoffiltersthatareshifted versionsofeachother.Yann LeCun 28. Convolutional Sparse Coding Convolutional Sparse CodingReplace the dot products with dictionary element by convolutions. Input Y is a full image Each code component Zk is a feature map (an image) Each dictionary element is a convolution kernelRegular sparse codingConvolutional S.C. Y= .* ZkkWk deconvolutional networks [Zeiler, Taylor, Fergus CVPR 2010]Yann LeCun 29. Convolutional PSD: Encoder with aasoft sh() Function Convolutional PSD: Encoder with soft sh() FunctionConvolutional Formulation Extend sparse coding from PATCH to IMAGE PATCH based learning CONVOLUTIONAL learningYann LeCun 30. Cifar-10 Dataset Cifar-10 DatasetDataset of tiny images Images are 32x32 color images 10 object categories with 50000 training and 10000 testingExample ImagesYann LeCun 31. Comparative Results on Cifar-10 DatasetComparative Results on Cifar-10 Dataset* Krizhevsky. Learning multiple layers of features from tiny images. Masters thesis, Dept of CS U of Toronto **Ranzato and Hinton. Modeling pixel means and covariances using a factorized third order boltzmann machine. CVPR 2010Yann LeCun 32. Road Sign Recognition Competition Road Sign Recognition Competition GTSRB Road Sign Recognition Competition (phase 1)32x32 imagesThe 13 of the top 14 entries are ConvNets, 6 from NYU, 7 from IDSIANo 6 is humans!Yann LeCun 33. Pedestrian Detection (INRIA Dataset) Pedestrian Detection (INRIA Dataset)[Sermanetetal.,RejectedfromICCV2011]]Yann LeCun 34. Pedestrian Detection: Examples Pedestrian Detection: ExamplesYann LeCun[Kavukcuogluetal.NIPS2010] 35. LearningLearning InvariantFeaturesInvariantFeaturesYann LeCun 36. Why just pool over space? Why not over orientation?Why just pool over space? Why not over orientation? Using an idea from Hyvarinen: topographic square pooling (subspace ICA)1. Apply filters on a patch (with suitable non-linearity)2. Arrange filter outputs on a 2D plane3. square filter outputs4. minimize sqrt of sum of blocks of squared filter outputsYann LeCun 37. Why just pool over space? Why not over orientation?Why just pool over space? Why not over orientation?The filters arrangethemselves spontaneously sothat similar filters enter thesame pool.The pooling units can be seenas complex cellsThey are invariant to localtransformations of the input For some its translations, for others rotations, or other transformations.Yann LeCun 38. Pinwheels? Pinwheels?Does that lookpinwheely toyou?Yann LeCun 39. Sparsity throughSparsity through Lateral Inhibition Lateral InhibitionYann LeCun 40. Invariant Features Lateral Inhibition Invariant Features Lateral InhibitionReplace the L1 sparsity term by a lateral inhibition matrixYann LeCun 41. Invariant Features Lateral Inhibition Invariant Features Lateral InhibitionZeros I S matrix have tree structureYann LeCun 42. Invariant Features Lateral Inhibition Invariant Features Lateral InhibitionNon-zero values in S form a ring in a 2D topology Input patches are high-pass filteredYann LeCun 43. Invariant Features Lateral Inhibition Invariant Features Lateral InhibitionNon-zero values in S form a ring in a 2D topology Left: non high-pass filtering of input Right: patch-level mean removalYann LeCun 44. Invariant Features Short-Range Lateral Excitation + L1 Invariant Features Short-Range Lateral Excitation + L1lYann LeCun 45. Disentangling theDisentangling the Explanatory Factors Explanatory Factors of Imagesof ImagesYann LeCun 46. Separating SeparatingI used to think that recognition was all about eliminating irrelevantinformation while keeping the useful oneBuilding invariant representationsEliminating irrelevant variabilitiesI now think that recognition is all about disentangling independent factorsof variations:Separating what and whereSeparating content from instantiation parametersHintons capsules; Karol Gregors what-where auto-encodersYann LeCun 47. Invariant Features through Temporal Constancy Invariant Features through Temporal ConstancyObject is cross-product of object type and instantiation parameters [Hinton 1981]small mediumlarge ObjecttypeObjectsize[KarolGregoretal.]Yann LeCun 48. Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy Decoder PredictedSt St1 St2 input W1W1 W1 W2 t t1t2t InferredC 1C 1 C 1C 2 codeCtCt1 Ct2Ct Predicted1 112 code f11 f W12f Wf W W2 W W2tt1t2Yann LeCun EncoderSSS Input 49. Invariant Features through Temporal Constancy Invariant Features through Temporal Constancy C1 (where)C2(what)Yann LeCun 50. Generating from the Network Generating from the Network InputYann LeCun 51. What is the rightWhat is the rightcriterion to train criterion to train hierarchical feature hierarchical featureextraction extraction architectures?architectures?Yann LeCun 52. Flattening the Data Manifold? Flattening the Data Manifold? The manifold of all images of is low-dimensional and highly curvy Feature extractors should flatten the manifoldYann LeCun 53. Flattening the Flattening the Data Manifold? Data Manifold?Yann LeCun 54. The Ultimate Recognition SystemThe Ultimate Recognition System Trainable TrainableTrainableFeature FeatureClassifier Transform TransformLearnedInternalRepresentation Bottom-up and top-down informationTop-down: complex inference and disambiguationBottom-up: learns to quickly predict the result of the top-downinference Integrated supervised and unsupervised learning Capture the dependencies between all observed variables CompositionalityEach stage has latent instantiation variablesYann LeCun