Deep LearningMachine Learning in der Medizin

Asan Agibetov,

Medical University of ViennaCenter for Medical Statistics, Informatics and Intelligent Systems

Institute for Artificial Intelligence and Decision SupportSpitalgasse 23, 1090 Vienna, BT88.04.808

December 13, 2018


References (available online for free):
"Neural Networks and Deep Learning". Michael A. Nielsen,

Determination Press, 2015
intuition first, math after

"Deep Learning". Ian Goodfellow, Youshua Bengio and Aaron Courville, MIT Press, 2016

formal with fair amount of intuition, more general than Nielsen

These slides are based on DL courses:
Course notes "CNN for Visual Recognition" (Stanford, Spring 2017)

Course notes "An introduction to Deep Learning",

Marc'Aurelio Ranzato (Facebook AI Research), DeepLearn Summer School - Bilbao, 17-21 July 2017

Why Deep Learning

Peter Norvig's recollection on Geoff Hinton's talk on Boltzmann Machine work (back in 1980)

1. Cognitive plausibility in terms of a model of the brain2. Model that learns from experiences rather that programmed by

hand3. Continuous representations rather than Boolean, as in

traditional symbolic expert systems

Research Director at Google, co-author of classical texts on AI
Professor at University of Toronto, one of the pioneers of Deep Learning
Boltzmann Machine (and Probabilistic Graphical Models) one of the theoretical foundations for generative DL models

theoretical foundations for generative DL models

Neural networks and Deep Learning

▶ Neural networks - biologically-inspired programming paradigm▶ enables computer to learn from observational data▶ universal function approximation machine 4

▶ Deep learning - powerful set of techniques for learning inneural networks

▶ harness GPU resources to parallelize and speed upmatrix-vector computations

▶ give rise to modularized approach to learning

Hornik, "Approximation capabilities of Multilayer Feedforward Networks", Neural Networks, 1991

Deep Learning - what’s in the name?▶ DL, roughly speaking, is NN with many layers and many

neurons in each layer▶ not true in all cases though (e.g., embeddings are often


Figure 1: Simple and Deep NNs


Hierarchical feature learning▶ DL learns features automatically, and hierarchically

Figure 2: (Convolutional) Neural Network to detect a face

6credit ”Michael A. Nielsen”

Hierarchical feature learning (cont.)

▶ Learnt features can be combined

Figure 3: Further decomposition of learnt features

7credit ”Michael A. Nielsen”

Neural networks

Figure 4: 2 hidden layer network/4 layer network (+ input, output)

▶ Universal function approximation that maps input to output▶ f : Rn 7→ Rm

▶ Class of functions considered to map input to output▶ composition of simpler (including non-linear 9) functions

▶ h1 is non-linear max(0, W · x⃗ + b⃗) aka ReLU▶ f = o ◦ h2x ◦ h1 ◦ x

image credit M-A. Ranzato (Facebook AI Research)
composition of only linear function would be equivalent to one linear function


Forward propagation

Figure 5: Forward pass on the network

▶ x ∈ RD, W1 ∈ RN1×D

▶ b1 ∈ RN1 , h1 ∈ RN1

h1 = max(0, W1 · x⃗ + b⃗)

▶ W1 1-st layer weight matrix or weights▶ b⃗1 1-st layer biases

Why non linear layers▶ ReLU layers provide piece-wise linear tiling▶ # planes grows exponentially w. # hidden units▶ Multiple layers yield exponential savings in # parameters

(parameter sharing)

Figure 6: with ReLU mapping is locally linear

Montufar et al. "On the number of linear regions of DNNs", arXiv, 2014

How good is the network: task-dependant loss function Vi

▶ regression: MSE (mean squared error)▶ V1(y, f) = (y − f(x))2

▶ classification: variants of Cross-Entropy loss▶ class (category) index k ∈ 1 . . . C▶ predicted classes

▶ f(x) = [10 0 . . .

k1 . . .

C0], f(x)k = 1

▶ true classes▶ y = [

11 0 . . .

k0 . . .

C0], yk = 0

▶ probability that x belongs to class ck

▶ p(ck = 1|⃗x) = ef(x)k∑C1


▶ loss function with log-likelihoods (easier to optimize)▶ V2(y, f) = −

∑k yk log p(ck|x)

Optimization: finding the best fTypical setup for optimization

▶ f can be parameterized with Θ (f = Θ · x linear case)▶ minimizing (learning) the loss function V over all training

examples 1 . . . n▶ plus regularizations on:

▶ λ2(f) - controls complexity of the function (usually norm ∥f∥)▶ λ1(f, Θ) - sparsity of the solution, where Θ parameters of f

f∗ = argminf


V(y, f(x)) + λ2(f) + λ1(f, Θ)

▶ to find f∗ you need to minimize complicated function▶ backpropagation gives the gradients of that complicated



▶ Neural nets - chain (composition) of non-linear operations,implementing highly non-linear functions

▶ Forward pass computes error between the currently learntmapping function and the actual output

▶ Backward pass computes gradients w.r.t. inputs at each layerand parameters

▶ Optimization (minimization of the loss error) done bystochastic gradient descent (or variants of it)

Computation: speed up and parallelize with GPUIn a nutshell DL is all about matrix multiplication

Figure 7: Matrix-matrix multiplication

▶ Entries of the A × C matrix can be computed in parallel withGPU

▶ A × B rows and B × C cols loaded in the shared memory

11image credit: Course notes ”CNN for Visual Recognition” (Stanford, Spring 2017)

Function composition and computational graph

f(x, y, z) =n∑i

x1y1 + z1...

xnyn + zn

in vector notation.


(x ⊗ y + z) ⊗Hadamard product, elementwise multiplication


(a + z) a = x ⊗ y


b b = a + z

= c c =n∑i


Function composition and computational graph (contd.)

f(x, y, z) =n∑i

x1y1 + z1...

xnyn + zn

Figure 8: computational graph with numpy

12image credit: Course notes ”CNN for Visual Recognition” (Stanford, Spring 2017)

Gradients of function composition

∇xf = ∇x


x1y1 + z1...

xnyn + zn





= ∂f∂c




∇xf = y∇yf = x∇zf = 1

Gradients of function composition (contd.)▶ Cons of using numpy only:

▶ Manual computation of gradients for all f▶ No GPU support

Figure 9: computational graph and gradients with numpy

13image credit: Course notes ”CNN for Visual Recognition” (Stanford, Spring 2017)

Deep Learning frameworks▶ Goals:

1. Easily build big computation graphs2. Easily compute gradients in computational graphs (automatic

gradient computation)3. Run it all efficiently on GPU (wrap low level NVIDIA and

Linear Algebra libraries (e.g., cuDNN, cuBLAS))▶ Academia/Industry open source frameworks

▶ Caffe (UC Berkeley) 7→ Caffe2 (Facebook)▶ Torch (NYU/Facebook) 7→ PyTorch (Facebook)▶ Theano (U Montreal) 7→ TensorFlow (Google)

▶ Industry (not necessarily open source) frameworks▶ Paddle (Baidu), CNTK (Microsoft), MXNet (Amazon), and

others...▶ High-level frameworks

▶ Keras (Theano, TensorFlow or CNTK as backend)▶ good for beginners

DL frameworks comparison

Figure 10: Computational graph definition in numpy, pytorch and tensorflow

15image credit: Course notes ”CNN for Visual Recognition” (Stanford, Spring 2017)

DL frameworks: Demo

Deep Learning for Vision

Figure 11: fully connected layer for visual recognition

▶ Idea:▶ unwrap images (2d matrices) into

1d vectors▶ R200×200 7→ R40000

▶ feed them into Neural Networks(fully connected layers)

▶ Problem:▶ spatial correlation is local▶ waste of resources▶ not robust to transformations

(scale, rotation, translation)

Convolutional Layer▶ shared weights across the whole image▶ convolution takes advantage of

▶ stationarity (similar statistics at different locations)▶ local spatial correlation

Figure 12: convolutional layer for visual recognition

Figure 13: convolutions with learnt kernels

Convolutional layer activations

Receptive field (aka. filter, kernel)Swipes through input and outputsactivation maps

activation maps - results of convolutions (sum of element-wisemultiplications)

positive activation null activation

Multiple convolutional filters

hnj = max(0,


hn−1k ∗ wn


Figure 14: multiple convolutional filters for visual recognition

Figure 15: one convolution layer

Pooling layer

Figure 16: pooling layer

▶ Pooling layer goal: spatialrobustness for featureextraction

▶ Assume our filter is eyedectector

▶ Pooling layer makes eyedetector robust to exactlocation of eye

Pooling layer (contd.)

hnj (x, y) = maxx∈N(x),y∈N(y)hn−1

j (x, y)

Figure 17: pooling layer

▶ by pooling (e.g., takingmax) filter responses atdifferent locations

▶ we gain robustness to theexact spatial location offeatures

ConvNets architecture

Figure 18: LeCun et al. "Gradient based learning applied to document recognition" IEEE 1988

DL for vision Demo
