Thesis Final

  • View
    473

  • Download
    7

Embed Size (px)

Text of Thesis Final

UNSUPERVISED FEATURE LEARNINGVIA SPARSE HIERARCHICAL REPRESENTATIONSA DISSERTATIONSUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCEAND THE COMMITTEE ON GRADUATE STUDIESOF STANFORD UNIVERSITYIN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OFDOCTOR OF PHILOSOPHYHonglak LeeAugust 2010AbstractMachinelearninghasprovedapowerfultoolforarticialintelligenceanddataminingproblems. However,its success has usually relied on having a good feature representa-tion of the data, and having a poor representation can severely limit the performance oflearning algorithms. These feature representations are often hand-designed, require signif-icant amounts of domain knowledge and human labor, and do not generalize well to newdomains.To address these issues, I will present machine learning algorithms that can automat-ically learn good feature representations from unlabeled data in various domains, such asimages, audio, text, and robotic sensors. Specically, I will rst describe how efcientsparse coding algorithms which represent each input example using a small number ofbasis vectors can be used to learn good low-level representations from unlabeled data. Ialso show that this gives feature representations that yield improved performance in manymachine learning tasks.Inaddition, buildingonthedeeplearningframework, Iwillpresenttwonewalgo-rithms, sparse deep belief networks and convolutional deep belief networks, for buildingmore complex, hierarchical representations, in which more complex features are automat-ically learned as a composition of simpler ones. When applied to images, this method au-tomatically learns features that correspond to objects and decompositions of objects intoobject-parts. Thesefeaturesoftenleadtoperformancecompetitivewithorbetterthanhighly hand-engineered computer vision algorithms in object recognition and segmenta-tion tasks. Further, the same algorithm can be used to learn feature representations fromaudio data. In particular, the learned features yield improved performance over state-of-the-art methods in several speech recognition tasks.ivAcknowledgementsMost of all, I would like to thank my advisor Andrew Ng. It has been a privilege and trulyan honor to have him as a mentor. Andrew has been an amazing mentor and advisor, notonly in research, but also other aspects of academic life. I cannot thank you enough.Ialsowouldliketothankallmycommitteemembers, DaphneKollerandKrishnaShenoywhoweremyreadingcommittee, aswellasJayMcClellandandKaiYuwhowere my defense committee. It has been truly a great privilege and honor to have themas mentors, and I received invaluable advice and constructive feedback for the research.Thank you so much.I also would like to thank all the lab members of Andrew Ngs machine learning group,especially Rajat Raina whom I did lots of collaborations with. I also thank other formerand current lab members: Pieter Abbeel, Ashustosh Saxena, Tom Do, Zico Kolter, MorganQuigley, Adam Coates, Quoc Le, Olga Russakovsky, Jiquan Ngiam, and Andrew Maas.I thank my friends and colleagues at Stanford: Su-In Lee, Stephen Gould, Alexis Battle,Ben Packer, Suchi Saria, Varun Ganapathi, Alex Teichman, Jenny Finkel, Yun-Hsuan Sung,David Jackson, David Stavens, Roger Grosse, Chaitu Ekanadham, Rajesh Ranganath, PeterPham, YanLargman, JaewonYang, andMyunghwanKim, DongjunShin, andJinsungKwon. It has been such a wonderful experience and privilege to know and work with you.Finally, I thank my parents and family for their love and support. Without their support,this thesis would not have been possible. Especially, I thank my wife Youngjoo, who hasalways been supportive and encouraging.vContentsAbstract ivAcknowledgements v1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Unsupervised Feature Learning . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3.1 Learning features from labeled data . . . . . . . . . . . . . . . . . 51.3.2 Using unlabeled data to improve supervised learning tasks . . . . . 71.3.3 Generic unsupervised learning algorithms . . . . . . . . . . . . . . 91.3.4 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4 Proposed approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131.5 Summary of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . 141.6 First published appearances of the described contributions . . . . . . . . . 151.7 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 Efcient Sparse Coding Algorithms 172.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2 Formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.3 L1-regularized least squares: The feature-sign search algorithm. . . . . . . 212.4 Learning bases using the Lagrange dual . . . . . . . . . . . . . . . . . . . 242.5 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.5.1 The feature-sign search algorithm . . . . . . . . . . . . . . . . . . 26vi2.5.2 Total time for learning bases . . . . . . . . . . . . . . . . . . . . . 272.5.3 Learning highly overcomplete natural image bases . . . . . . . . . 292.5.4 Replicating complex neuroscience phenomena . . . . . . . . . . . 302.6 Application to self-taught learning . . . . . . . . . . . . . . . . . . . . . . 312.7 Other related work and applications . . . . . . . . . . . . . . . . . . . . . 332.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Exponential Family Sparse Coding 343.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.2 Self-taught Learning for Discrete Inputs . . . . . . . . . . . . . . . . . . . 353.3 Exponential Family Sparse Coding. . . . . . . . . . . . . . . . . . . . . . 363.3.1 Computing optimal activations. . . . . . . . . . . . . . . . . . . . 393.4 Computational Efciency. . . . . . . . . . . . . . . . . . . . . . . . . . . 413.5 Application to self-taught learning . . . . . . . . . . . . . . . . . . . . . . 443.5.1 Text classication . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.5.2 Robotic perception . . . . . . . . . . . . . . . . . . . . . . . . . . 463.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 Sparse Deep Belief Networks 494.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.2 Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.2.1 Sparse restricted Boltzmann machines. . . . . . . . . . . . . . . . 514.2.2 Learning deep networks using sparse RBM. . . . . . . . . . . . . 544.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.3.1 Learning pen-strokes from handwritten digits. . . . . . . . . . . 564.3.2 Learning from natural images . . . . . . . . . . . . . . . . . . . . 574.3.3 Learning a two-layer model of natural images using sparse RBMs . 574.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.4.1 Biological comparison . . . . . . . . . . . . . . . . . . . . . . . . 584.4.2 Machine learning applications . . . . . . . . . . . . . . . . . . . . 58vii4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 605 Convolutional Deep Belief Networks 615.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615.2 Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.2.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625.2.2 Convolutional RBM . . . . . . . . . . . . . . . . . . . . . . . . . 635.2.3 Probabilistic max-pooling . . . . . . . . . . . . . . . . . . . . . . 645.2.4 Training via sparsity regularization . . . . . . . . . . . . . . . . . 665.2.5 Convolutional deep belief network. . . . . . . . . . . . . . . . . . 675.2.6 Hierarchical probabilistic inference . . . . . . . . . . . . . . . . . 685.2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 695.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 705.3.1 Learning hierarchical representations from natural images . . . . . 705.3.2 Self-taught learning for object recognition . . . . . . . . . . . . . . 715.3.3 Handwritten digit classication . . . . . . . . . . . . . . . . . . . 725.3.4 Unsupervised learning of object parts . . . . . . . . . . . . . . . . 735.3.5 Hierarchical probabilistic inference . . . . . . . . . . . . . . . . . 765.4 Multi-class image segmentation . . . . . . . . . . . . . . . . . . . . . . . 775.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 806 Convolutional DBNs for Audio Classication 816.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 816.2 Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826.2.1 Convolutional deep belief networks for time-series data . . . . . . . 826.2.2 Application to audio data . . . . . . . . . . . . . . . . . . . . . . . 846.3 Unsupervised feature learning . . . . . . . . . . . . . . . . . . . . . . . . 846.3.1 Training on unlabeled TIMIT data . . . . . . . . . . . . . . . . . . 846.3.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 856.4 Application to speech recognition tasks . . . . . . . . . . . . . . . . . . . 886.4.1 Speaker identication . . . . . . . . . . . . . . . . . . . . . . . . . 886.4.2 Speaker gender classication . . . . . . . . . . . . . . . . . . . . . 90viii6.4.3 Phone classication . . . . . . . . . . . . . . . . . . . . . . . . . . 916.5 Application to music classication tasks . . . . . . . . . . . . . . . . . . . 936.5.1 Music genre c