1. Deep Learning with Python Getting started and getting from
ideas to insights in minutes PyData Seattle 2015 Alex Korbonits
(@korbonits)
2. About Me Alex Korbonits Data Scientist at Nuiku, Inc.
Seattleite Huge math/philosophy/music/art nerd
3. You may think you need to have
4. in order to do Deep Learning. That is not the case. Theres a
lot you can do with a little.
5. 6 Yann Lecun, Geoff Hinton, Yoshua Bengio, and Andrew
Ng
6. What is deep learning? Subset of machine learning and AI
Yes, artificial neural networks are inspired by the brain; BUT they
are usually created to perform a specific task, rather than mimic
the brain. Deep: many-layered neural networks.
7. Perceptron Rosenblatt, 1957, Cornell Aeronautics Laboratory,
funded by the Office of Naval Research Linear classifier. Designed
for image recognition. Inputs x and weights w linearly combined to
achieve some sort of output y.
8. XOR Whats great about perceptrons? They are linear
classifiers. Whats wrong with this picture? They cant classify
non-linear problems such as XOR (the counterexample to everything)
Minsky & Papert in Perceptrons (1969): its impossible for
perceptrons to learn XOR.
9. Multilayer Perceptrons vs.
10. Enter the multilayer perceptron With one hidden layer, a
multilayer perceptron which can now figure out XOR is capable of
arbitrary function approximation. This is where the math nerds get
excited. Woot! Supervised, semi-supervised, unsupervised, and
reinforcement learning applications. Flexible architectural
components layer types, connection types, regularization techniques
allow for empirical tinkering. Think of playing with Lego.
11. DONT BE SCARED BY THE MATH!
12. Who remembers their first quarter of calculus? All were
going to do is take a derivative. This diagram is a representation
of the chain rule. Backpropagation Here, we take the derivative of
z, which is a function of two variables x and y, each functions of
variables s and t.
13. Backpropagation A simple learning algorithm that takes some
total output error E defined by some loss function. For example, a
typical loss function for a multi-class classification task is log
loss.
14. Backpropagation E is a function of all of its inputs. I.e.,
all of the incoming connections to the output unit of a neural
network. I.e., a function that outputs a class membership
prediction and whose prediction is checked against a ground
truth/label.
15. Backpropagation We then show: A simple derivation of the
change in error as a function of each connection weight w_ij. This
gives a formula for updating each w_ij according to the learning
algorithm. There are different algorithms to do this, such as
SGD.
16. APPLICATIONS AND TOOLS 17 Wherefore and how
17. Motivation Were at PyData, and weve got some motivating
deep learning concepts. What are some of the practical applications
and tools you can use? Deep learning techniques have recently
beaten many long-standing benchmarks.
18. Some common applications Computer vision tasks:
Classification Segmentation Facial recognition NLP tasks: Automatic
Speech Recognition (ASR) Machine translation POS tagging Sentiment
analysis Natural Language Understanding (NLU)
19. Some common tools Torch (NYU, Facebook AI, Google Deepmind)
Caffe (Berkeley, Google) Theano (Univ. Montreal) Graphlab-Create
(Dato, Inc.) Under active development: Neon (Nervana Systems)
DeepLearning4j running on Apache Spark
20. Torch Created/Used by NYU, Facebook, Google DeepMind De
rigeur for deep learning research Its language is Lua, NOT Python
Luas syntax is somewhat Pythonic. Check it out. Torchs main
strengths are its features, which is why I mention it though here
we are at PyData. See http://bit.ly/1KzuFhd for a closer look.
21. Caffe Created/Used by Berkeley, Google Best tool to get
started with: Lots of pre-trained reference models Lots of standard
deep learning datasets Easy to configure networks with config
files. See http://bit.ly/1Db2bHT to get started.
22. Theano Created/Used by University of Montreal Very
flexible, very sophisticated: Lower level interface allows for lots
of customization Lots of libraries being built ON TOP of Theano,
e.g.: Keras, PyLearn2, Lasagne, etc. Pythonic API, and very well
documented. See http://bit.ly/1KBsMAv to get started.
23. GraphLab-Create Created by the wonderful folks at Dato,
Inc. User friendly, picks intelligent defaults. TONS of features,
AND all are state of the art. Blazing fast out-of-core computations
on small/medium/big data. Pythonic API, with amazing documentation.
See http://bit.ly/1LZVqLS to get started.
24. Under Active Development Neon Nervana Systems has released
a blazing fast engine for training and testing DNNs, beating a lot
of benchmarks compared to other leading tools. DeepLearning4j Being
developed to run on top of Apache Spark. The PySpark possibilities
there are huge.
25. NETWORK TOPOLOGIES 26 Applications and examples
26. Convolutional Neural Networks Named for one of the
principal layer types: a convolutional layer. MNIST and LeNet Used
in the 80s by folks such as Yann LeCun for handwritten digit
recognition for ATMs ImageNet and AlexNet New-ish computer vision
competition. In 2012, the winning submission used a deep CNN. This
has completely changed submissions are made: from handwritten
features crafted over decades, to deep nets. Text understanding
from scratch. Character-level inputs into CNNs for high-level
semantic knowledge.
27. Convolution What is a convolution? One way to think of it
is kind of like REDUCE, but our example (next slide) is 2D since
were doing convolutions of 2D images! Heres a short clip to guide
intuition (next slide).
28. Convolution http://bit.ly/1gquFDB
29. Lets talk about computer vision. Lets look at AlexNet.
30. AlexNet (Krizhevsky et al. 2012) Won the 2012 ImageNet
competition Hard and interesting: classification of 1000 objects
BEAT THE PANTS off of all previous attempts, which included
hand-engineered features; that had been studied and improved for
decades: AlexNets millions of params learned via backprop!
31. AlexNet (Krizhevsky et al. 2012) When AlexNet is processing
an image, this is what is happening at each layer. The size of the
last layer is the number of classes
32. AlexNet (Krizhevsky et al. 2012) When AlexNet is processing
an image, this is what is happening at each layer. The last layer
takes a lot of abstraction and richness as its input
33. AlexNet (Krizhevsky et al. 2012) When AlexNet is processing
an image, this is what is happening at each layer. It then outputs
a vote of confidence as to which class the image belongs
34. AlexNet (Krizhevsky et al. 2012) When AlexNet is processing
an image, this is what is happening at each layer. The class with
the highest likelihood is the one the DNN selects
35. AlexNet (Krizhevsky et al. 2012) When AlexNet is processing
an image, this is what is happening at each layer. In this
case
36. AlexNet (Krizhevsky et al. 2012) When AlexNet is processing
an image, this is what is happening at each layer. Its a cat!
37. AlexNet This is an example of classification with AlexNet.
Top five class predictions for each image. Correct classification
is red.
38. GoogLeNet Networks keep getting larger and larger, with no
end in sight. Remember AlexNet? It was a monster in 2012 for having
12 layers. GoogLeNet, from 2014, uses what it calls Inception
modules to improve its convolutions. Theyre getting deeper.
39. Recurrent Neural Networks Learning sequences of
words/characters/anything. A few well-known varieties: Plain
vanilla RNNs Long Short Term Memory (LSTM) RNNs Attention
mechanisms HOT right now for video scene descriptions, question and
answer systems, and text.
40. Recurrent Neural Networks RNNs are different from
convolutional nets in that their dont only connect up and down.
They can connect sideways within the same layer. There are even
architectures that can go in both directions.
41. Word2Vec: Neural network for finding high dimensional
representation per word Mikolov et al. 13 Skip-gram Model: From a
word, predict nearby words in sentence Awesome learning talk at
PyData deep 300 dim representation 300 dim representation 300 dim
representation 300 dim representation 300 dim representation 300
dim representation Neural net Viewed as deep features
42. Related words placed nearby high dim space Projecting 300
dim space into 2 dim with PCA (Mikolov et al. 13)
43. Ulysses on Fire with Torch This is how my favorite book,
James Joyces 1922 novel Ulysses, famously begins and famously
ends:
44. Ulysses on Fire with Torch I Stately, plump Buck Mulligan
came from the stairhead, bearing a bowl of lather on which a mirror
and a razor lay crossed. ... yes I said yes I will Yes.
Trieste-Zurich-Paris 1914-1921
45. Ulysses on Fire with Torch After 17 iterations of the
training data, this is what my LSTM RNN can generate:
46. Generating Joycean Prose Bloom works. Quick! Pollyman. An a
lot it was seeming, mide, says, up and the rare borns at
Leopolters! Cilleynan's face. Childs hell my milk by their doubt in
thy last, unhall sit attracted with source The door of Kildan and
the followed their stowabout over that of three constant trousantly
Vinisis Henry Doysed and let up to a man with hands in surresses
afraid quarts to here over someware as cup to a whie yellow accept
thicks answer to me.
47. Ulysses is a tough example Remember that Ulysses is only
1.5 MB, and that this is trained character by character. It has no
knowledge of English or language. Notice some of the emergent
properties of this prose. Punctuation, indentation, and more.
Longer samples correctly show underlining (markdown formatted),
properly formed parentheticals (which is a classically tough
problem in NLP due to variable length issues).
48. Recursive Neural Tensor Networks Capturing natural
languages recursive nature and handling variable-length sentences.
Created by applying the same set of weights recursively over a
structure Natural language inference Learn logical semantics Learn
vector representations of words, multi-word phrases, grammar, and
multi-lingual phrase pairs.
50. Deep Unsupervised Learning Its possible to train neurons to
be selective for high-level concepts using entirely unlabeled data.
Le et al. 2012 used a 9-layered locally connected sparse
autoencoder with pooling and local contrast normalization. 1
billion parameters trained on 10 million images. 15.8% error; great
at recognizing cats & humans.
51. Totally unsupervised!
52. QuocNet Optimal stimulus for two units according to
numerical constraint optimization.
53. Transfer Learning -Old idea explored by Donahue et al.,
2014. -Steps: - Get some data. Get a pre-trained DNN. - Propagate
unseen data through (that fits the DNN) - Extract outputs of some
layer before final output - Use as feature vectors - Can do
supervised/unsupervised learning w/ these
54. Example: image similarity A B C A B C - Distance between an
images extracted features. Each set of extracted features forms a
vector - Images whose deep visual features are similar have similar
sets of extracted features. - We can measure quantitatively how
similar two images are by taking the Euclidean distance between
these sets of features. - More similar images are closer together,
distance-wise, in that space.
56. Deep Reinforcement Learning - DeepMinds Deep Q-network
agent - Pixels and the game score only inputs - Comparable to pro
human game tester - Across a set of 49 games - Same algorithm, net,
hyperparameters.
57. APPENDIX I: VISUALIZATION 62 Whats going on under the
hood?
58. A view of AlexNet (Krizhevsky et al. 2012) Helpful, but
doesnt give intuition On the following slides, we show: Random test
images; with A subset of the feature activation maps in the
indicated layer.
68. APPENDIX II: PITFALLS 73 Weve still got a lot of learnin to
do
69. DNN INTERPRETATION AND INTUITION
70. DNNs hard to interpret: parameters learned via
backpropagation. DNNs have counter-intuitive properties. DNNs
expressive powers come with subtle limitations. 75
71. Fool me once, shame on you Szegedy et al., 2013 Authors
imperceptibly alter correctly classified images to fool DNNs: LeNet
AlexNet QuocNet They call such inputs adversarial examples. 76
72. ostrich, struthio Camelus, right? WRONG Left: correctly
predicted sample Center: 10x difference between Left and Right
columns. Right: ostrich, struthio Camelus 77
73. Fool me twice, shame on me Nguyen et al., 2014 Authors look
at counter-intuitive properties of DNNs per Szegedy et al., 2013.
Easy to produce images that are: Unrecognizable to humans; such
that DNNs almost certain that these are in familiar classes. The
authors call these fooling images. 78
74. Directly encoded fooling images These evolved images
unrecognizable to humans that DNNs trained on ImageNet believe with
near certainty to be a familiar object. 79
75. Indirectly encoded fooling images These evolved images
unrecognizable to humans that DNNs trained on ImageNet believe with
near certainty to be a familiar object. 80
76. Tip: train with adversarial examples Adds more
regularization than dropout! Szegedy et al., 2013 These results
suggest that the deep neural networks that are learned by
backpropagation have nonintuitive characteristics and intrinsic
blind spots, whose structure is connected to the data distribution
in a non-obvious way. Nguyen et al., 2014 The fact that DNNs are
increasingly used in a wide variety of industries, including
safety-critical ones such as driverless cars, raises the
possibility of costly exploits via techniques that generate fooling
images 81
77. Bibliography Csji, Balzs Csand. "Approximation with
artificial neural networks." Faculty of Sciences, Etvs Lornd
University, Hungary 24 (2001). Donahue, J., Jia, Y., Vinyals, O.,
Homan, J., Zhang, N., Tzeng, E., and Darrell, T. DeCAF: A deep
convolutional activation feature for generic visual recognition. In
JMLR, 2014. Goodfellow, Ian J., Jonathon Shlens, and Christian
Szegedy. "Explaining and harnessing adversarial examples." arXiv
preprint arXiv:1412.6572 (2014). Hermann, Karl Moritz, Tom Koisk,
Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. "Teaching Machines to Read and Comprehend." arXiv
preprint arXiv:1506.03340 (2015). Hornik, Kurt, Maxwell
Stinchcombe, and Halbert White. "Multilayer feedforward networks
are universal approximators." Neural networks 2, no. 5 (1989):
359-366. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton.
"Imagenet classification with deep convolutional neural networks."
In Advances in neural information processing systems, pp.
1097-1105. 2012. Le, Quoc V., Marc'Aurelio Ranzato, Rajat Monga,
Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean, and Andrew Y.
Ng. "Building high-level features using large scale unsupervised
learning." arXiv preprint arXiv:1112.6209 (2011).
78. Bibliography Mikolov, Tomas, Kai Chen, Greg Corrado, and
Jeffrey Dean. "Efficient estimation of word representations in
vector space." arXiv preprint arXiv:1301.3781 (2013). Mnih,
Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis
Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari
with deep reinforcement learning." arXiv preprint arXiv:1312.5602
(2013). Nguyen, Anh, Jason Yosinski, and Jeff Clune. "Deep neural
networks are easily fooled: High confidence predictions for
unrecognizable images." arXiv preprint arXiv:1412.1897 (2014). Olga
Russakovsky*, Jia Deng*, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael
Bernstein, Alexander C. Berg and Li Fei-Fei. (* = equal
contribution) ImageNet Large Scale Visual Recognition Challenge.
arXiv:1409.0575, 2014. Szegedy, Christian, Wei Liu, Yangqing Jia,
Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,
Vincent Vanhoucke, and Andrew Rabinovich. "Going deeper with
convolutions." arXiv preprint arXiv:1409.4842 (2014). Szegedy,
Christian, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru
Erhan, Ian Goodfellow, and Rob Fergus. "Intriguing properties of
neural networks." arXiv preprint arXiv:1312.6199 (2013). Yang, J.,
L., Y., Tian, Y., Duan, L., and Gao, W. Group-sensitive multiple
kernel learning for object categorization. In ICCV, 2009.