An introduction to Neural Networks and Deep Learningasperti/SLIDES/neural.pdf · An introduction to Neural Networks and Deep Learning Talk given at the Department of Mathematics of

An introduction to Neural Networksand Deep Learning

Talk given at the Department of Mathematicsof the University of Bologna

February 20, 2018

Andrea Asperti

DISI - Department of Informatics: Science and EngineeringUniversity of Bologna

Mura Anteo Zamboni 7, 40127, Bologna, [email protected]

Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 1

A branch of Machine Learning

What is Machine Learning?

There are problems that are difficult to address with traditionalprogramming techniques:

I classify a document according to some criteria (e.g. spam,sentiment analysis, ...)

I compute the probability that a credit card transaction isfraudulent

I recognize an object in some image (possibly from an inusualviewpoint, in new lighting conditions, in a cluttered scene)

I ...

Typically the result is a weighted combination of a large number ofparameters, each one contributing to the solution in a small degree.


The Machine Learning approach

Suppose to have a set of input-output pairs (training set)

{〈xi , yi 〉}

the problem consists in guessing the map xi 7→ yi

The M.L. approach:

• describe the problem with a model depending on someparameters Θ (i.e. choose a parametric class of functions)

• define a loss function to compare the results of the modelwith the expected (experimental) values

• optimize (fit) the parameters Θ to reduce the loss to aminimum


Why Learning?

Machine Learning problems are in fact optimization problems!So, why talking about learning?

The point is that the solution to the optimization problem is notgiven in an analytical form (often there is no closed form solution).

So, we use iterative techniques (typically, gradient descent) toprogressively approximate the result.

This form of iteration over data can be understood as a way ofprogressive learning of the objective function based on theexperience of past observations.


Why Learning?






Why Learning?






Why Learning?






Using gradients

The objective is to minimize some loss function over (fixed)training samples, e.g.

Θ(w) =∑i

E (o(w , xi ), yi )

by suitably adjusting the parameters w .

See how it changes according to small perturbations ∆(w) of theparameters w : this is the gradient

∇w [θ] = [ ∂Θ∂w1

, . . . , ∂Θ∂wn

]

of Θ w.r.t. w .

The gradient is a vectorpointing in the direction ofsteepest ascent.


Gradient descent

Goal: minimize some loss function Θ(w) by suitably adjusting theparameters.

We can reach a minimalconfiguration for Θ(w) byiteratively taking small stepsin the direction opposite tothe gradient (gradient descent).

This is a general technique.

Warning: not guaranteed to work:

I may end up in local minima

I may get lost in plateau


Next arguments

A bit of taxonomy


Different types of Learning Tasks

• supervised learning:inputs + outputs (labels)- classification

- regression

• unsupervised learning:just inputs- clustering

- component analysis

- autoencoding

• reinforcement learningactions and rewards- learning long-term gains

- planning

supervised

unsupervised

reinforcement


Classification vs. Regression

Two forms of supervised learning: {〈xi , yi 〉}

inputNew

Probably a cat!

classification

New input

Expectedvalue

regression

y is discete: y ∈ {•,+} y is (conceptually) continuous


Many different techniques

• Different ways todefine the models:- decision trees

- linear models

- neural networks

- ...

• Different error(loss) functions:- mean squared errors

- logistic loss

- cross entropy

- cosine distance

- maximum margin

- ...

Sunny Overcast Rain

High Strong Normal Weak

No Yes

Yes

No Yes

Outlook

Humidity Wind

decision tree neural net

mean squared errors maximum margin


Next argument

Neural Networks


Neural Network

A network of (artificial) neurons

Artificial neuron

Each neuron takes multiple inputs and produces a single output(that can be passed as input to many other neurons).


The artificial neuron

w1

output

x

x

b

+1

Σ

inputs

function

activation

bias

1

x22w

n

nw

The purpose of the activation function is to introduce athresholding mechanism(similar to the axon-hillock of cortical neurons).


Different activation functions

The activation function is responsible for threshold triggering.

0

1

threshold: if x > 0 then 1 else 0 logistic function: 11+e−x

1

0

hyperbolic tangent: ex−e−x

ex +e−x rectified linear (RELU): if x > 0 then x else 0


A comparison with the cortical neuron


Next argument

Networks typology/topology


Layers

A neural network is a collection of artificial neurons connectedtogether.Neurons are usually organized in layers.

If there is more than one hidden layer the network is deep,otherwise it is called a shallow network.


Feed-forward networks

If the network is acyclic, it is called a feed-forward network.Feed-forward networks are (at present) the commonest type ofnetworks in practical applications.

Important Composing linear transformations makes no sense,since we still get a linear transformation.What is the source of non linearity in Neural Networks?

The activation function


Feed-forward networks

If the network is acyclic, it is called a feed-forward network.Feed-forward networks are (at present) the commonest type ofnetworks in practical applications.

Important Composing linear transformations makes no sense,since we still get a linear transformation.What is the source of non linearity in Neural Networks?

The activation function


Dense networks

The most typical feed-forward network is a dense network whereeach neuron at layer k − 1 is connected to each neuron at layer k .

The network is defined by a matrix of parameters (weights) Wk foreach layer (+ biases).

The matrix Wk has dimension Lk × Lk+1 where Lk is the numberof neurons at layer k .


Parameters and hyper-parameters

The weights Wk are the parameters of the model: they arelearned during the training phase.

The number of layers and the number of neurons per layer arehyper-parameters: they are chosen by the user and fixed beforetraining may start.

Other important hyper-parameters govern training such as learningrate, batch-size, number of ephocs an many others.


Convolutional networks

Convolutional networks are used with inputs with a topologicalstructure: signal sequences (e.g. sound), or images.

They repeteadly apply a (small) uniform linear transformation,called kernel, shifting it over the whole input image.


Example

[−1 0 1

]−→

−101

−→


Computing features

Many interesting kernels (filters) known from Image Processing:

I first and second order derivatives, image gradients

I sobel, prewitt, . . .

In Neural Networks, kernels are learned by training.

Since kernels are small and weights are shared training is relativelyfast.


Recurrent Networks

In a recurrent net-woks you may have cycles:

• dynamics is very complexnot even clear it stabilizes

• difficult to train

• biologically more realistic

Restricted models:

I Long-Short Term Memory models (LSTM),

I Gated Recurrent Unit (GRU)


LSTM and GRU

LSTM are useful to model sequences:

I equivalent to very deep nets with one hidden layer per timeslice (net unrolling)

I weights are shared between different time slices

I they can keep information for a long time in an internal state


Symmetrically connected networks

Similar to recurrent networks, but connections between units aresymmetrical (they have the same weight in both directions).

They have stable configurations corresponding to local minima of asuitable energy function.

Hopfield nets: symmetrically connected nets without hidden units

Boltzmann machines: symmetrically connected nets with hidden units:

I more powerful models than Hopfield nets

I less powerful than general recurrent networks

I have a nice and simple learning algorithm


How a real network looks like

VGG 16 (Simonyan e Zisserman). 92.7 accuracy (top-5) in ImageNet.

Picture by Davi Frossard: VGG in TensorFlow


https://www.cs.toronto.edu/~frossard/post/vgg16/

How do we implement a neural net?

Neural nets looks complicated.

How do we implement them?

There exist suitable languages:

I Theano, University of Montreal

I TensorFlow, Google Brain

I Caffe, Berkeley Vision

I Keras, F.Chollet

I PyTorch, Facebook

I . . .


http://deeplearning.net/software/theano/

https://www.tensorflow.org/

http://caffe.berkeleyvision.org/

https://keras.io/

http://pytorch.org/

VGG 16 in Keras

From GitHub

def VGG_16(weights_path=None):

model = Sequential()

model.add(ZeroPadding2D((1,1),input_shape=(3,224,224)))

model.add(Convolution2D(64, 3, 3, activation=’relu’))

model.add(ZeroPadding2D((1,1)))


model.add(MaxPooling2D((2,2), strides=(2,2)))





model.add(MaxPooling2D((2,2), strides=(2,2)))

...

The whole model is defined in 50 lines of code.


https://gist.github.com/baraldilorenzo/07d7802847aaad0a35d3

But what about training?

So complex ...

fit(x, y, batch_size=32, epochs=10)

I x: input, an array of data (hence, typically, an array of arrays)

I y: labels, an array of target categories

I batch size: integer, number of samples per gradient update.

I epochs: integer, the number of epochs (passes) to train themodel.


Next arguments

Features and deep features


Features

Any individual measurable property of data useful for the solutionof a specific task is called a feature.

Examples:

I Emergency C-section: age, first pregnancy, anemia, fetusmalpresentation, previous premature birth, anomalousultrasound, ...

I Meteo: humidity, pression, temperature, wind, rain, snow, ...

I Expected lifetime: age, health, annual income, kind of work,...


Derived (inner) features

New interesting features may be derived as a combination of inputfeatures.

Suppose for instance that we are interested to model somephenomenon with a cubic function

f (x) = ax3 + bx2 + cx + d

We can use x as input or . . .

we can precompute x , x2 and x3 reducing the problem to a linearmodel!


Derived (inner) features

New interesting features may be derived as a combination of inputfeatures.

Suppose for instance that we are interested to model somephenomenon with a cubic function

f (x) = ax3 + bx2 + cx + d

We can use x as input or . . .

we can precompute x , x2 and x3 reducing the problem to a linearmodel!


Traditional Image Processing

In order to process animage we start computinginteresting derived featureson the image:

- first order derivatives- second order derivatives- difference of gaussians- laplacian- ...

original gaussian blur 25

gaussian difference gaussian blur 10

Then we use these derived features to get the desired output.


Deep learning, in deeper sense

Discovering good features is a complex task.

Why not delegating the task to the machine, learning them?

Deep learning exploits a hierarchical organization of the learningmodel, allowing complex features to be computed in terms ofsimpler ones, through non-linear transformations.


AI, machine learning, deep learning

• Knowledge-based systems: take an expert, ask him how hesolves a problem and try to mimic his approach by means oflogical rules

• Traditional Machine-Learning: take an expert, ask himwhat are the features of data relevant to solve a givenproblem, and let the machine learn the mapping

• Deep-Learning: get rid of the expert












Relations between research areas

Deep learning

Example: MLPs autoencoders Example:

Representation learning

Machine learning

Example:

logistic

regression

Artificial Intelligence

bases

knowledgeExample:

Picture taken from “Deep Learning” by Y.Bengio, I.Goodfellow e A.Courville, MITPress.


Components trained to learn

Input InputInput

Rule−based

systems

Input

Output

Output Output

Output

Hand−

designedprogram features

Hand−

designed

Mapping

from

features

Mapping

from

features

Features Features

Mapping

from

features

features

complex

More

Classic

Machine

Learning

Learning

Representation Deep

Learning

learning

components

Picture taken from “Deep Learning” by Y.Bengio, I.Goodfellow e A.Courville, MITPress.Andrea Asperti Universita di Bologna - DISI: Dipartimento di Informatica: Scienza e Ingegneria 45

Next arguments

Some successful applications

• MINST and ImageNet• Speech Recognition• Lip reading• Text generation• Deep dreams and Inceptionism• Mimicking style• Robot navigation• Game simulation


MNIST

Modified National Institute of Standards and Technology database

I grayscale images of handwritten digits, 20× 20 pixels each

I 60,000 training images and 10,000 testing images


MNIST

A comparison of different techniques

Classifier Error rateLinear classifier 7.6K-Nearest Neighbors 0.52SVM 0.56Shallow neural network 1.6Deep neural network 0.35Convolutional neural network 0.21

See LeCun’s page the mnist database for more data.


http://yann.lecun.com/exdb/mnist/

ImageNet

ImageNet (@Stanford Vision Lab)

I high resolution color images covering 22K object classes

I over 15 million labeled images from the web


http://www.image-net.org/

ImageNet competition

Annual competition of image classification (since 2010).

I 1.2 Million images (30K categories)

I make five guesses about image label, ordered by confidence


ImageNet samples


ImageNet results


Speech recognition

Several stages (similar to optical character recognition):

I Segmentation. Convert the sound wave into a vector ofacoustic coefficients. Typical sampling: 10 milliseconds.

I The acoustic model Use adjacent vectors of acousticcoefficients to associate probabilities with phonemes.

I Decoding Find the sequence of phonemes that best fit theacoustic data, and a model of expected sentences.

Deep neural networks, pioneered by George Dahl andAbdel-rahman Mohamed, are replacing previous machine learningmethods.


Speech recognition

Major industries are investing lot of money on speech recognition:Amazon (with Intel), Google, Microsoft, ...

Achieving Human Parity in Conversational Speech Recognition.Speech & Dialog research group at Microsoft, 2016

R.Zweig (project manager) attributes the accomplishment to thesystematic use of the latest neural network technology in allaspects of the system.


https://arxiv.org/abs/1610.05256

Lip reading

Google’s DeepMind AI can lip-read TV shows better than aprofessional


https://www.newscientist.com/article/2113299-googles-deepmind-ai-can-lip-read-tv-shows-better-than-a-pro/

https://www.newscientist.com/article/2113299-googles-deepmind-ai-can-lip-read-tv-shows-better-than-a-pro/

Text Generation

See Andrej Karpathy’s blogThe Unreasonable Effectiveness of Recurrent Neural Networks

Examples of fake algebraic documents automatically generated by a

RNN.


http://karpathy.github.io/2015/05/21/rnn-effectiveness/

Deep dreams

Visit Deep dreams generator


https://deepdreamgenerator.com/

Mimicking style

A neural algorithm of artistic styleL.A. Gatys, A.S. Ecker, M. Bethge

Similar to inceptionism, but with “style” (texture) instead of content.



More examples


More examples


Mimicking style: a different approach

Image-to-image translation with Cycle Generative Adversarial Networks


https://junyanz.github.io/CycleGAN/

Robot navigation

Quadcopter Navigation in the Forest using Deep Neural Networks

Robotics and Perception Group, University of Zurich, Switzerland &

Institute for Artificial Intelligence (IDSIA), Lugano Switzerland

Based on Imitation Learning


https://www.youtube.com/watch?v=umRdt3zGgpU

Atari Games and Q-learning

Google DeepMind’s system playing Atari games (2013)

Recently extended to Augmented Imagination (2017)

video

Based on:

I deep neural networks

I an innovative reinforcement learning technique calledQ-learning


https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf


https://www.youtube.com/watch?v=f25Jogzogz0

Atari Games and Q-learning

The same network architecturewas applied to all games

Input are screen frames

Works well for reactive games,not for planning


Documents

An introduction to Neural Networks and Deep Learningasperti/SLIDES/neural.pdf · An introduction to Neural Networks and Deep Learning Talk given at the Department of Mathematics of