Grand Rehearsal FEV 2018 · 2018-12-13 · and his co-workers which showed a fast way to train such networks. Yann LeCun, a student of Geoff Hinton, also developed a very effective

Grand Rehearsal FEV 2018

Prof. Bart ter Haar Romeny

The relation between biological vision and computer vision

The beginning of Artificial Intelligence Campbell’s Soup Factories, USA

1984: Aldo Camino, expert with46 year experience, kneweverything of the complexe 22 m high sterilisers, heating68.000 cans of soup to 120 degrees.

He retired: all his knowledge in hundreds of RULES.

Dead end street!

Classical Computer-Aided Diagnosis / Detectionwith hand-crafted features

Gabor feature filters Clusters in feature space Validation, ROC curve

Classical machine learning:

‘Spurious resolution’: artefact of the wrong aperture

What is the best aperture?

Cortical receptive fields are well modeled by

derivatives of the Gaussian kernel

(Koenderink 1984)

( )xyxGyxLyxGyxL

x ∂∂

⊗=⊗∂∂ );,(),(;,(),( σσ

* =

gradient

We model the synaptic connectionswith artificial neural networks

A 3-layer neural network

However, these networks only gave 75% correct…

THE MODEL

Learning: synapses get bigger

Hierarchical learning as the brain

Needed:

• Large sets of training data• Clever network architecture• Error backpropagation• Robust classifier

The revolution: mimick the visual cascade:Deep Learning with neural nets of many layers

The idea: context

What a local filter sees:What a context filter sees:

In

Deep Learning Convolutional Neural Networks

THE TRICK: incrementalcontextual structure analysis

Convolution, ReLU, max pooling, convolution, convolution etc.

Error backpropagation AlexNet(Alex Krizhevsky2012)

ImageNetchallenge:1.4 millionimages,1000 classes

75% → 94%

A typical big deep NN has (hundreds of) millions of connections: weights.

Nvidia blog examples Medium.com

Google TensorFlow Kaggle.com DR

Google AutoML Kaggle Competitions

Ramen Jiro (ラーメン二郎) prediction from 41 ramen shops in TokyoKenji collected 1170 photos x 41 shops = 48,000 photos of ramen with shop labels.

AutoML Vision achieved 94.5% accuracy. AutoML did all work, preprocessing, augmentation, training. The whole process is designed for non data scientists, does not require ML expertise.

It is becoming easierand easier …

More applicationsevery day …

• > 80% of papers: Deep Learning• Challenges with given data is the norm

Data augmentation:Make MANY more new images from a single image by tiny transformations.

AI News Anchor

https://blogs.nvidia.com/https://medium.com/topic/artificial-intelligencehttps://www.tensorflow.org/https://www.kaggle.com/c/diabetic-retinopathy-detectionhttps://cloud.google.com/blog/big-data/2018/03/automl-vision-in-action-from-ramen-to-branded-goodshttps://www.kaggle.com/competitionshttps://medium.com/mlmemoirs/worlds-first-ai-news-anchor-makes-its-debut-in-china-4ffc00716578

The term "deep learning" refers to the method of training multi-layered neural networks, and became popular after papers by Geoffrey Hintonand his co-workers which showed a fast way to train such networks.

Yann LeCun, a student of Geoff Hinton, also developed a very effective algorithm for deep learning, called ConvNet, which was successfully used in late 80-s and early 90-s for automatic reading of amounts on bank checks.

In May 2014, Baidu, the Chinese search giant, has hired Andrew Ng, a leading Machine Learning and Deep Learning expert (and co-founder of Coursera) to head their new AI Lab in Silicon Valley, setting up an AI & Deep Learning race with Google (which hired Geoffrey Hinton) and Facebook (which hired Yann LeCun to head Facebook AI Lab).

Weigths and Activation functions

Demo

𝒉𝒉 = 𝝈𝝈(𝐖𝐖𝟏𝟏𝒙𝒙 + 𝒃𝒃𝟏𝟏)

𝒚𝒚 = 𝝈𝝈(𝑾𝑾𝟐𝟐𝒉𝒉 + 𝒃𝒃𝟐𝟐)

𝒉𝒉

𝒚𝒚

𝒙𝒙4 + 2 = 6 neurons (not counting inputs)

[3 x 4] + [4 x 2] = 20 weights 4 + 2 = 6 biases

26 learnable parameters

Weights

Activation functions

http://playground.tensorflow.org/#activation=tanh&batchSize=10&dataset=circle&regDataset=reg-plane&learningRate=0.03&regularizationRate=0&noise=0&networkShape=4,2&seed=0.45430&showTestData=false&discretize=false&percTrainData=50&x=true&y=true&xTimesY=false&xSquared=false&ySquared=false&cosX=false&sinX=false&cosY=false&sinY=false&collectStats=false&problem=classification&initZero=false&hideText=false

Loss functions and outputClassification Regression

Training examples

Rn x {class_1, ..., class_n} (one-hot encoding)

Rn x Rm

Output Layer

Soft-max[map Rn to a probability distribution]

Linear (Identity) or Sigmoid

Cost (loss)function

Cross-entropy

Mean Squared Errorf(x)=x

List of loss functions

𝐽𝐽 𝜃𝜃 = −1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

�𝑘𝑘=1

𝐾𝐾

𝑦𝑦𝑘𝑘(𝑖𝑖) log �𝑦𝑦𝑘𝑘

(𝑖𝑖) + 1 − 𝑦𝑦𝑘𝑘(𝑖𝑖) log 1 − �𝑦𝑦𝑘𝑘

𝑖𝑖

𝐽𝐽 𝜃𝜃 =1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

𝑦𝑦(𝑖𝑖) − �𝑦𝑦(𝑖𝑖) 2

𝐽𝐽 𝜃𝜃 =1𝑛𝑛�𝑖𝑖=1

𝑛𝑛

𝑦𝑦(𝑖𝑖) − �𝑦𝑦(𝑖𝑖)

Mean Absolute Error

Classification is about predicting a label and regression is about predicting a quantity.

Example: digit classificationExample: house prices

https://isaacchanghau.github.io/post/loss_functions/

TrainingSample labeled

data(batch)

Forward it through the network, get predictions

Back-propagate

the errors

Update the network weights

Optimize (min. or max.) an objective/cost function 𝑱𝑱(𝜽𝜽).Generate an error signal that measures the difference between predictions and target values.

Use the error signal to change the weights and get more accurate predictions.Subtracting a fraction of the gradient moves you towards the (local) minimum of the cost function.

https://medium.com/@ramrajchandradevan/the-evolution-of-gradient-descend-optimization-algorithm-4106a6702d39

In

Deep Learning Convolutional Neural Networks

Convolution, ReLU, max pooling, convolution, convolution etc.

Error backpropagation AlexNet(Alex Krizhevsky2012)

ImageNetchallenge:1.4 millionimages,1000 classes

75% → 94%

A typical big deep NN has (hundreds of) millions of connections: weights.

How does this actually work?

Gradient descent, derivatives, chain rule:

With

etc.

And finally:

1

And iterate until convergence:

Convolutional Neural Networks (CNNs)

Input matrix

Convolutional 3x3 filter

Convolution = filtering with a kernel / template / receptive field

To keep the same output size:Boundary choice (all wrong!):• Zero padding• Mean padding• Reflection

The convolution integral:

Wikipedia: convolution, cross-correlation

The cross-correlation integral:

The convolution theorem states that the Fourier transform of a convolution is the pointwise product of Fourier transforms.

https://en.wikipedia.org/wiki/Convolutionhttps://en.wikipedia.org/wiki/Cross-correlation

Physiology of

Front-End Vision

Prof. Bart ter Haar Romeny, PhDBiomedical Image Analysis

Eindhoven University of Technology

Synapses grow in size with learning:~ increasing weights

Rod and cone disks

Some numbers:

Disk 10 nm thickDisk spacing 25 nmOuter segment 25 μm1000 disks/rod108 rhodopsine mol./rod108 rod cells/retina

From: book.bionumbers.org/how-big-is-a-photoreceptor/

Cones couple to on-center andto off-centerganglion cells

The discovery of receptive fields by Hubel and Wiesel(Nobelprize 1981)

David Hubel Torsten Wiesel

50% on-center-surround RF 50% off-center-surround RF

Classical explanation:• Lateral inhibition• Surround suppression• Equal speed for on- and off intensity

Two types of retinal ganglion cells:Midget: small, for shapeParasol: large, for

motion

The retina is a multi-scale sampling device:

Multi-scale range

shape

motion

Disappearingblack spotsdue to lowacuity at highereccentricity

We only see sharp in the fovea

The mapping of many types of fully tiling ganglion cells is coarse and overlapping

Masland 2012

Reichardt motion detector: Two RF’s separated by delayIn the visual front-end retinal receptive fields are organized in pairs, tuned to a specific velocity and direction. The pairs are coupled by a delay cell, possibly a specific type amacrine cell.

Neurons act as temporalcoincidence detectors →Tuned velocity detector

All velocities and directionsare measured at all scales.

Motion illusion: flying bird

Time = span/velocity = delay

http://www.michaelbach.de/ot/cog-hiddenBird/index.html

Two slightly shifted RFs in different eyes for disparity detectionDisparity - stereo for depth perception

In the visual cortex V1 ‘far cells’ and ‘near cells’ are recorded.

Summary:

• The retina is a multi-scale sampling device• The retina measures with at least 20 overlapping ganglion RF tilings• Acuity decreases linearly with eccentricity• 150 million receptors converge into 1 million fibers in the optic nerve• Amacrine cells play a key role in directional motion detection• Separate channels to the LGN exist for shape, motion and color

Suggested reading:• R. Masland, The neuronal organization of the retina, Neuron 76, 2012• Hubel, David H. Eye, Brain and Vision. Scientific American Books, 1995.• H. Kolb et al. Webvision, https://webvision.med.utah.edu/• H. Kandel et al. Principles of Neural Science, New York, McGraw-hill 2013• R.W. Rodieck, The First Steps in Seeing, Sinauer Associates, 1998

https://webvision.med.utah.edu/

Research questions

1. Why do we have center-surround receptive fields in the retina?2. Why do we have on- and off channels?3. Why do we have 150 million rods and cones, and only 1 million fibers

in the optic nerve?4. Why can neurons fire so slowly, while our computers need GigaHertz

operations?5. Why do we have ~20 retinal channels sampling the outer world image

(Masland 2012)?6. Why do cones have a cone shape?7. Why do we make such precise binocular micro-saccades?8. Why do we have pinwheel orientation structures in the visual cortex?9. What is the visual field size of a cortical (pinwheel) hypercolumn?

FEV

Central Visual Pathways

David HubelNobelprize 1981

Torsten WieselNobelprize 1981

FEV

The 6 layers of the LGN:4 parvo-cellular layers

(small cells)2 magno-cellular layers

(large cells)

parvo

parvo

parvo

parvo

magno

magno

L

L

LR

R

R

Motion channel

http://www.michaelbach.de/ot/cog-hiddenBird/index.html

FEV

The receptive fields of LGN cells all have aon- or off-center-surround sensitivity profile

FEV

Spatio-temporal receptive field

mapping by reverse

correlation by Ohzawa and

Freeman(UC Berkeley)

Stimulus

RF profile

Temporal RF

Reverse Correlation Technique

FEV

Three main types ofreceptive fieldsensitivity profiles

FEV

50

50

Time sequence simple cell RF,

separableDeAngelis, Ohzawa and Freeman, TINS

1995

Reverse Correlation Technique

Voltage sensitive dye opticalimaging of tree shrew cortex

Brain-inspired image analysis: multi-orientation(大脑启发算法 – 多方向分析)Neuro-mathematics: multi-orientation analysis by the cortex

pinwheel

Cortical hypercolumn0.3 x 0.3 mm

Color orientation coding

FEV

The technique of Voltage Sensitive Dyes for the measurement of neural population signals was pioneered by prof. AmiramGrinvald, Weizmann Institute, Israel.

Voltage Sensitive Dyes

https://en.wikipedia.org/wiki/Voltage-sensitive_dyehttp://www.weizmann.ac.il/brain/grinvald/

FEV

Optical dye response at different orientations(monkey V1): the discovery of cortical hypercolumns (1991).

From Bonhoeffer and Grinvald, Nature 353, 429-431, 1991

Voltage Sensitive Dyes

FEV

Fitzpatrick, Duke University, Nature 2002

Connections exist between similar orientationsto far away columns

Alexander & van Leeuwen, 2010

FEV

image rotatingkernelorientation space

What are proper kernels that allow an inverse orientation transform?(no data loss)

Fourier Transform /Inverse Fourier Transform:

Sin / CosOrientation Space 2D:

Cake kernels, a new wavelet family

Orientation Space 3D:Mathieu functions

Exactly invertibleorientation transform:

0;)(),(

,)(),(

2

2

≥∂−=Φ

∂−=Φ

−

−

−

neaz

eazzz

nnn

zzn

nn

σ

σ

σσ

σσ

Multi-orientation differential geometry

FEV

Gabor vs Cake Kernel – Fourier Domain

Gabor Kernel

Cake Kernel

FEV

Different orientations are disentangled in the orientation space

imageorientation score

image orientation score

rotatingkernel filtered image

FEVFranken, Duits, ter Haar Romeny, TU/e, 2010

Denoising of crossing fibers(collagen, tissue engineered heart valve)

Properties of the Gaussian kernel

• Cascade property, Gaussian convolved with Gaussian is Gaussian• Normalization, area = 1• Separable• Relation to binomial coefficients• Relation to generalized functions (Dirac, Heavyside)• Fourier transform of Gaussian is also Gaussian• Low-pass filter• Narrow kernel in spatial domain is wide kernel is Fourier domain• Solution of the diffusion equation

Properties of the Gaussian derivative kernels

• Gaussian derivative – Gaussian times Hermite polynomial• Bandwidth filter

Differential structure of images

Gauge invariants are made with intrinsic coordinates v and w.Every derivative with respect to v and/or w is orthogonal invariant

Notebook

Some examples:

Second order structure

Affine invariant corner detection

Affine invariant corner detector:

Third order structure:

Change of isophote curvature at a T-junction →

Use: 3D TV from 2D video

Deep Learning withConvolutional Neural Networks

Hierarchical learning as the brain

Needed:

• Large sets of training data• Clever network architecture• Error backpropagation• Robust classifier

The first filters must represent the incoming data as efficientas possible → represented in a compact basis

Mapping of spatiotemporal receptive fields V1 of the tree shrew.

The negative subfield of the kernel is located centrally, indicating a preponderance of symmetric second order kernels in the selected RFs.

Calcium intrinsic imaging combined with reversecorrelation of responses to a sparse noise stimulus.

From:K. S. Lee, X. Huang, and D. Fitzpatrick. Topology of ON and OFF inputs in visual cortex enables an invariant columnar architecture. Nature, 533(7601):90{94, 5 2016

Principal Component Analysis (PCA) finds the intrinsic orthogonal local coordinate frame in the data as the orthogonal eigenvectors of the covariance matrix.

Covariance matrix (from Wikipedia):

Math: Learn from image patches → simple filters, edges, linesPrincipal Component Analysis

If data are restricted:filters are restricted

Lesson:

Handcrafted filters

Filters are in the DATA

• Multi-scale derivatives• Lie group: Infinitesimal

generator of translation• Taylor expansion

https://www.youtube.com/watch?v=QzkMo45pcUo

Colin Blakemore’s famous experiment with visual derivation (1974)

Blakemore, Colin, and Grahame F. Cooper. “Development of the brain depends on the visual environment.” (1970): Nature, 228(5270), 477-478.

Blakemore’s cat: First three months after birth –it sees only horizontal stripes

After three months: it could see a horizontal stick, but NOT a vertical one.Lesson: it had never learned filters for vertical lines, they were not in the data.

Bev Doolittle: The forest has eyes

The challenge

Computer-AidedDiagnosis

Bev Doolittle: The forest has eyes

We have much moredifficulty in recognizingfaces upside-down.

Slide Number 1Slide Number 2Slide Number 3Slide Number 4Slide Number 5Slide Number 6Slide Number 7Slide Number 8Slide Number 9Slide Number 10Slide Number 11Slide Number 12Slide Number 13Slide Number 14Slide Number 15Slide Number 16Slide Number 17Slide Number 18Slide Number 19Slide Number 20Physiology �of �Front-End VisionSlide Number 22Slide Number 23Slide Number 24Slide Number 25Slide Number 26Slide Number 27Slide Number 28Slide Number 29Slide Number 30Slide Number 31Slide Number 32Slide Number 33Slide Number 34Slide Number 35Slide Number 36Slide Number 37Slide Number 38Slide Number 39Slide Number 40Slide Number 41Slide Number 42Slide Number 43Slide Number 44Slide Number 45Slide Number 46Slide Number 47Gabor vs Cake Kernel – Fourier DomainSlide Number 49Slide Number 50Slide Number 51Slide Number 52Slide Number 53Slide Number 54Slide Number 55Slide Number 56Slide Number 57Slide Number 58Slide Number 59Slide Number 60Slide Number 61Slide Number 62Slide Number 63Slide Number 64Slide Number 65Slide Number 66Slide Number 67

Documents

Grand Rehearsal FEV 2018 · 2018-12-13 · and his co-workers which showed a fast way to train such networks. Yann LeCun, a student of Geoff Hinton, also developed a very effective