Upload
lyhanh
View
222
Download
0
Embed Size (px)
Citation preview
Deep Learning Tools
Emanuele [email protected]
5th Feb 2018
Introduction to Neural Networks 2017/2018
Deep Vision
Vision Problems• Object Classification• Image Recognition• Place Recognition• People Tracking• Scene Captioning• … Generative Problems…
Semantic Segmentation
Captioning
DNN and large datasets overcome the limits of manual Feature selection
Tracking
ImageNet
• Over 15M labeled high resolution images• Roughly 22K categories• Collected from web and labeled by Amazon
Mechanical Turk
http://image-net.org/
ILSVRC 2016
I. Object localization for 1000 categories.II. Object detection for 200 fully labeled categories.III.Object detection from video for 30 fully labeled
categories.IV.Scene classification for 365 scene categories (Joint with
MIT Places team) on Places2 Database http://places2.csail.mit.edu.
V. Scene parsing for 150 stuff and discrete object categories (Joint with MIT Places team).
Concept Recaps
• Pooling• Training Instances (N)• Batch (B per iteration)• Mini-batch (if B<<N)• Epoch (every N/B iterations)– Batch Mode (N=B)– Mini-batch Mode (1 < B < N)– Stochastic (B=1)
• Stochastic Gradient Descent (SGD)• SGD with Momentum
Software Tools for Deep Learning
• Many software frameworks are available for Deep Learning practice and research
• Highly optimized and tuned due to the industrial interest in the topic
• Examples– TensorFlow: Google, C++ https://www.tensorflow.org/
– Theano: Univ, Python https://github.com/Theano/Theano
– Torch: C++, Lua, Python http://pytorch.org/
– Tiny-Dnn: C++ https://github.com/tiny-dnn
– Keras: Google, Python https://keras.io/
• Looking at the support of CPU and GPU optimizations
General Processing Units
Memory Bandwidth comparison Power comparison
GTX Titan X is 6 Tflops SPTesla P100 of 2016 is 11 Tflops SP
SP is on different silicon than DPEffective speedup with DNN
GPUs and CNN• GPU paralellism and high-throughput makes training faster,
or some times feasible• CNN in GPU can be performed efficiently as:
1. Conversion of CNN to sparse Matrix Multiplication2. Fourier version of CNN 3. Explicit Convolution
1) Data Parallelism
2) Model Parallelism
3) Hyper Param Parallelism
CPU SIMD
• GPUs are great but we can use effectively modern CPUs• Intel AVX2 provides 16 registers at 256bit
– 8/4 single/double respectively– Fused Multiply and Accumulate
• Analysis by Google– “Improving the speed of neural networks on CPUs”, Vanhoucke et al. (PDF)
• Use– Theano is based on dynamic C code generation and it uses CPU features– Tensorflow needs to be compiled from source
Common Concepts in DNN Software
• Most DNN SW are based on the concept of Computational Graph and Automatic Differentiation
• The objective is the efficient computation of the Gradient for the optimizer
• Three Gradient computation approaches:– Numeric– Symbolic– Automatic Differentiation
Automatic Differentiation
• General concept for solving
• If the top-most function is scalar, as loss functions then it can be realized by two passes over the computational graph– Bottom-up evaluation of C(x)– Top-down evaluation of the gradient pushing results toward
each Jacobian• This Backward AD is most suited for scalar emitting
functions
Applied Jacobian
• In reality we are not interested in the full Jacobian but the application of the Jacobian by a generic vector à smaller problem
• Moreover resulting graph is generally independent of the batch size
GPU Inference Optimization
https://developer.nvidia.com/tensorrt
Structure + Weights + Label Names à Optimizer
Embedded Inference Scenario
How Big are the Models?
• VGGNet runner up of ILSVRC 2014 with 140M parameters (500MB model)
• GPU Limit is the amount of memory in GPU (4-12GB)
Smaller is better because1) Less distributed bandwidth2) Smaller embedding for Cars
and Smarphones(Cloud not always an option)
Deep Compression(lossless)
e.g. 240MB to 7MB for AlexNet
Model Pruning
• Deep Compression assumes to minimize the effect due to the pruning of the connections
AlexNet pruning
TensorFlow
• C++ framework by Google (PDF) with Python
API
• CPU (C++) and GPU (CUDA/OpenCL)
• Distributed using Hadoop FS
• Based on dataflow graph
• General Optimization framework, e.g. PDE,
SVM and more
• High-level API for DNN since TF 1.0
TensorFlow is Generic
• Mandelbrotxs = tf.constant(Z.astype(np.complex64))zs = tf.Variable(xs)ns = tf.Variable(tf.zeros_like(xs, tf.float32))
zs_ = zs*zs + xs
step = tf.group(zs.assign(zs_),ns.assign_add(tf.cast(not_diverged,
tf.float32)))
https://www.tensorflow.org/tutorials/mandelbrot GitHub
Example
• Check Jupyter from – https://github.com/eruffaldi/handson-ml
• Taken from the book on SkLearn and Tensorflow
Theano
• Python framework initially developed by Canadian University (PDF) MILA– Development by MILA stopped since October 2017
• Highly integrated with the Python numpy tensor library• Backends
– CPU is based on Cached dynamic C code generation– GPU is based on CUDA
• General optimizer with some specializations for DNN– Can be used as a general-purpose tool– PyMC3 is a probabilistic programming framework based on Theano for gradients
computations• Graph Structure
– Bipartite DAG – Variable– Operations
• Approach:1. Construction of Computational Graph2. Graph Optimization3. Code Generation
Comparison of Software
• Taken from Theano whitepaper (May 2016)– GPU Test– Theano little slower but comparable
Test over ImageNet, divided in forward part and backward part
Keras
• Keras is an example of high-level DNN library that supports both Tensorflow , Theano, CNTK
• Works at the level of Network Layers• A basic model is Sequential and can be built from layers
from keras.layers import Dense, Activation model.add(Dense(units=64, input_dim=100)) model.add(Activation('relu')) model.add(Dense(units=10)) model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])
Keras Train and Eval
• Given a model 4 operations– model.fit(x_train, y_train, epochs=5,
batch_size=32)– model.train_on_batch(x_batch, y_batch)– loss_and_metrics = model.evaluate(x_test, y_test,
batch_size=128)– classes = model.predict(x_test, batch_size=128)
MINST Dataset
• The MNIST dataset consists of handwritten digit images and it is divided in 60,000 examples for the training set and 10,000 examples for testing
(28x28 grayscale)
DIGITS
• High-level tools by Nvidia based on Torch or Caffe that allows a Web-based training of NN models.
• Dataset Creation• Model Creation with existing Network or Custom one• Job-based• Visualization of Network and Inference
Interoperability
• Having at disposal several libraries how we can interoperate between then for reusing training for inference, or transfer learning?
• Fight against fragmentation
• For a while Caffe models have been used for exchange, ONNX or NNEF are proposed as interoperable solutions – Open Neural Network Exchange Format or Neuranl Network Exchange Format
• Tools around ONNX– Direct or indirect support for specific libraries– Runtime support by Nvidia TensorRT
ONNX
ONNX
• Which kind of format is ONNX?
– Based on Google Protobuf serialization
– Describes network layers eventually with trained parameters
– Node, Graph, Attribute, Operator, Value, Shape
– All operators here:
https://github.com/onnx/onnx/blob/master/docs/Operators.md
• Example with TF
– https://github.com/onnx/tutorials/blob/master/tutorials/OnnxT
ensorflowImport.ipynb
• Repository of Pre-trained Networks
– https://github.com/onnx/models
– E.g. ResNet-50 is 92MB
Preprocessing for Image Data
• For image data it is sufficient to normalize in the range [0,1] or [-1,1]• Most network support variable sized images by adjusting pooling size• In specific cases we can employ custom preprocessing• What does preprocessing?
– Reduces data variability– Helps NN if small dataset
• What about color representation (RGB vs YUV)?– Any trichromatic space will suffice– But we can take advantage from
Reducing Overfitting in Image Data
• When the number of parameters is large there is a tendency to overfitting
• Dataset Augmentation is a solution, and it is particularly effective in image Data– Use label invariant transformation
• If the network is not affine invariant then we can use rotation/translation of input images
• Also gamma/color transformations
EgoCap
Augmentation withChanging background
CNN based onChanges over the101-layer residualNetwork of He et al
CIFAR10
• CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
• The test batch contains exactly 1000 randomly-selected images from each class.
http://www.cs.toronto.edu/~kriz/cifar.html
RESNet
• Residual Learning Network (PDF)• ILSVRC 2015 winner
“When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly”
Image Segmentation with Fully convolutional
1. Need of full output for segmentation2. Adaptation of classic to fully conv
CNN Over other Data Types
• Convolution is “natural” for images but what happens with other data?
Deep Learning, Goodfellow et al., MIT, 2017
Dimensions Single-Channel Multi-Channels
1D Audio Multi-track skeleton in time
2D Audio as FFT (rows are frequencies)
Color image RGB
3D CT Scan Video
Skeleton Motion
Which feature?1) Recenter2) Normalize Lengths3) Relative/Global Poses
https://arxiv.org/pdf/1306.3874.pdf