Download pdf - Lecture 1: Feature extraction

Part 8: Neural networks

EEEN60178: Digital Image Engineering Hujun Yin

Cross-section of retina, drawn by Santiago Ramón y Cajal (1853-1934).

Part 8: Neural networks

• Perception & cognition Retina & visual pathway Artificial intelligence

• Neurons & models

• Learning paradigms & network architectures

• Multilayer perceptrons

• Self-organising networks

• Learning & generalisation

• Genetic algorithm

8.1 Perception & cognition

• Perception &cognition

Tree


• Evolution

“Evolution is fascinating to watch. To me it is the most

interesting when one can observe the evolution of a single man.” Shana Alexander


• Evolution of eyes

Photosensory cell Receptor pigment Cup eye

Pinhole eye Single chambered (lens) eyes

Compound eyes have good time resolving power. Human eyes need 0.05 second to identify objects, while compounds eyes need only 0.01 second.


• Evolution of eyes

Limulus

Abdomen

Eyes

Head

Ommatidium

Tail

Hartline, et al (1960s) verified receptive field and lateral inhibition in limulus eyes


• Eyes

+

+

Lateral inhibition

To extract object boundaries!

Mach band effect


• A system point of view

Spatial frequency

Spatial frequency

Spatial frequency

Spatial position

Spatial position

Input spectrum

Output spectrum

HVS

MTF

Rel

ati

ve s

ensi

tivi

ty

Inp

ut in

ten

sity

O

utp

ut in

ten

sity

Ma

gn

itu

de

Ma

gn

itu

de


• Human eye

Cones: 6-7 millions

Red (64%)

Green (32%)

Blue (2%)

Rods: 120 millions

Dynamic range (brightest to darkest): 20 billion times (i.e. 206 dB)

Cameras: 8-16 bits


• Retina


• Visual pathway

There are about 100 millions of photosensitive cells in human retina, but only 1 million optic nerves connecting between retina and cortex.


• Visual cortex

• Many region have predetermined responses to visual stimuli. • Sensory experience from the external world can influence how the brain wires itself up after birth.

V1 V2 V6 depth

V5 motion

V4 colour

V3 orientation


• Cortex


• Perception

Einstein and ??

- from Kelso (1997)


• Perception

Robin’s vase

Kanisza triangle


• Cognition

Cognition is a result of dynamic pattern forming process of the brain, in which activities in the brain and behaviours appear self-organising into coherent statues or patterns and self-organisation provide a paradigm for behaviour and

cognition, as well as the structure and function of nervous system - from Kelso

(1997)


• Cognition

The human brain has trillions of cells, large enough for holding all events we have ever experienced [as claimed], but why can’t we remember all of them in details? Pinker believes that our brains are compelled to organise, without categories, the mental lift would be a chaos. Categorisations are for inference of objects having the same or similar properties.- from Pinker (1997)


• Cognition

Psychology

Behaviourism Neurobiology

Cognitive Science Computationism Connectionism


• Artificial system

Cognition is a complex information processing process (Marr 1982)


• Computing history

China ancient Abaci

Blaise Pascale 1642 Pascaline or Adding Machine

Joseph Jacquard 1804 Punch card controlled loom

Charles Babbage 1823 Difference Engine

Augusta Ada 1836 Programming

Herman Hollerith 1880 Data Processing Devices (IBM)

UK team 1942 Colossus

USA team 1946 ENIAC

UK teams 1949 EDSAC and Manchester Mark I

http://en.wikipedia.org/wiki/File:Manchester_Mark2.jpg


• Artificial intelligence

Alan Turing: Turing Machine,

Universal Turing Machine

Turing Test (Can Machines Think?)

Top-down: Representation/Symbols

Reasoning/Rules/Grammars

Learning/Logical Operations

Turing test

"Can machines think?" A seminal paper, Computing Machinery and Intelligence, Alan Turing published in 1950 in Mind.

http://en.wikipedia.org/wiki/Alan_Turing

http://en.wikipedia.org/wiki/Mind_(journal)

Singularity?

8.2 Neurons

• Models of neural networks

(Artificial) Neural Networks Approach (bottom-up approach)

° NNs started on findings and discoveries in neurobilogy and neuroscience. We gradually realise that human brains bear little resemblance to the von Neumann type of computers .

° The human brain has about 100 billion interconneted neurons, Each is a cell that uses biochemical reactions to receive, process and transmit stimulus. Each neuron is a simple processing unit, which receives (through dendrites and synapses) stimuli (input) from roughly 1000 neighbouring neurons (axons), cumulates these stimuli (in soma) and fires through its axon to its output neurons if the aggregated input is over a threshold (hillock). Networks of these cells form the basis of information processing.

8.2 Neurons

• Biological neurons

~100 billion neurons

The brain has ~100 billion interconnected neurons, each connecting to ~1000 neighbouring neurons.

8.2 Neurons


(60-100 trillion) synapses

Action potential

Protein

molecule

Potassium

Chloride

Sodium

Axon’s action potential opens ion channel, producing Ca2+ that leads to release of neurotransmitter, causing ion-conducting channels to open. Synapses can have either excitatory (depolarizing) or inhibitory (hyperpolarizing) effect on postsynaptic neurons.

8.2 Neurons


8.2 Neurons

• Models of neurons

bias

Synaptic weights

y=(xTw+b)x

Tw+b

wn

w2

w1

1

xn

x2

x1

b

(.)

Input signals

Adder Activation function

Output

1

v

(v)

0

1

v

(v)

0

Perceptron

8.2 Neurons

• Models of neural networks

Multilayer Perceptron (MLP)

v3

v2

x1

x2

x3

xn-1

xn

y1

y2

y3

Hidden layer Output layerInput layer

Weights Weights

::

v1

vnh

8.2 Neurons

• Artificial neural networks: a history

McCulloch and Pitts 1942: Linear Threshold Gate (LTG)

Wiener 1948: “Cybernetics”

Hebb 1949: “Organisation of Behaviour”

Rosenblatt 1958: Perceptron

Minsky and Papert 1969: “Perceptron”

Werbos 1974: Back-propagation algorithm

Hopfield 1982: Hopfield Recurrent Network

Kohonen 1982: Self-Organising Map

Rumelhart, et. al. 1986: Multilayer Perceptron

Hopfield net

•

Step 1: Assign Connection Weights

tij is the connection weight from node i to node j and xis

which can be +1 or -1 is element i of the exemplar class s.

Step 2: Initialize with Unknown Input Pattern

µi(0) = xi, 0iN-1,

µi(t) is the output of node i at time t and xi is the element i of

the input pattern.

Step 3: Iterate Until Convergence

f is a activation function such as hard limit. The process is

repeated until the node outputs remain unchanged with further

iterations. The node outputs represent the exemplar pattern

that best matches the input.

Step 4: Repeat by Going to Step 2

1,0 , ,0

,1

0

Njiji

jixxt

M

s

sj

si

ij

10 ,)]([)1(1

0

NjttftN

i

iijj

8.3 Learning paradigms

• Learning

The process which leads to the modification of behaviour _Oxford Dictionary

“Some growth process or metabolic change takes place” _ Donald Hebb

Learning is a process by which the free parameters of neural networks are adapted through a process of simulation by the environment in which the network is embedded. The type of learning is determined by the manner in which the parameter changes take place. _Simon Haykin

8.3 Learning paradigms

• Learning

Learning, in neural network terms, is the process of modification of connection weights.

• Supervised Learning

–Error-correction learning

• Reinforcement learning

• Unsupervised Learning

– Hebbian learning

– Competitive learning

– Self-organisation

2| |),(| |min wxyd ),,( xydew

},{max asrewarda

yxw

otherwise 0,

wins neuron if ),,( iyxfw

ii

otherwise 0,

winner toclose neuron if , ixyw

ii

Global order can arise from local

interactions (Turing 1952)

rewardw

8.4 Network architectures

• Feedforward

v3

v2

x1

x2

x3

xn-1

xn

y1

y2

y3


Weights Weights

::

v1

vnh

• Feed-forward Networks – Perceptron and multilayer perceptron – Radial basis function – Support vector machine

• Recurrent Networks – Hopfield networks – Boltzmann machine

v3

v2

x1

x2

x3

xn

y1

y2

y3

Weights

:

v1

vnh

yn

8.5 Feedforward networks

• Perceptron learning

Least mean square method:

n

k

Tkk tttwtxty

0

)()()()()( wx

)]([2

1 2 teEJ

)()()(

)()(

tette

tetJ

xww

)(2

1)( 2 tetJ

)]()()()[()()()()()(

)()1( tttdtttetttJ

tt Twxxwxw

www

)()( tytde

(without thresholding or activation function)

+1

-1

y=sign(xTw+b)x

Tw+b

wn

w2

w1

xn

x2

x1

b


• Perceptron limitations

+1

-1

y=sign(xTw+b)x

Tw+b

wn

w2

w1

xn

x2

x1

b

Linear separation

Sensitive to initial states

Local minima

A local minimum

Global minimum

x

Cost function

Starting point


• Multiplayer perceptron Back-propagation algorithm

Step 1: Initialisation. Set all weights and nodes’ biases to small random numbers.

Step 2: Presentation of training examples. input x=[x1, x2, …xn]T and

desired output d=[d1, d2,..dm]T.

Step 3: Forward computation. Calculate the outputs of the hidden and output layer,

Step 4: Backward computation and updating weights. Compute the error terms,

Step 5: Iteration. Repeat steps 3 and 4 and stop when the total error reaches the

required level).

)(1

okj

n

j

ojk

okk bvwy

h

)(1

hji

n

i

hij

hjj bxwv

))(1()1()'( kkkkkkok

ok

ok

ok ydyyyyee

m

k

ojk

okjj

m

k

ojk

ok

hj

hj wvvw

11

)1()'(

jok

ojk

ojk vww

ihj

hij

hij xww

v3

v2

x1

x2

x3

xn-1

xn

y1

y2

y3


Weights Weights

::

v1

vnh


• Radial basis functions

Properties:

•Hidden neurons are Gaussian

•Output neurons are linear similar to Single Layer Perceptrons.

•Easy to train Hidden nodes can be chosen first by clustering; output layer is trained as single layer perceptrons.

•Fast to converge

v3

v2

x1

x2

x3

xn-1

xn

y1

y2

y3


Weights=1 Weights

::

v1

vnh

v

(v)

0

)2

exp(2

2

kkv

mx


• Radial basis functions

The RBF model came from two fundamental theories

The Universal Approximation Theory: any function can be approximated with arbitrary precision by a weighted sum of a set of non-constant, bounded and monotone-increasing continuous functions,

Cover’s Separability Theory: A complex pattern-classification problem cast in a high-dimensional space nonlinearly is more likely to be linearly separable than in a low high dimensional space.

M

i

iwf

1

),(ˆ x

8.6 Self-organising networks

• Hebbian learning

When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic changes take place in one or both cells such that A’s efficiency as one of the cells firing B, is increased.

In mathematical term:

Oja’s rule:

xyw

)()]()()()[()(

})]()()([{

)()()()1( 2

2/1

1

2

twtytxtytw

tytxtw

tytxtwtw iiin

j

jj

iii


• Hebbian learning

Input Output Weight Changes Weights

x1 x2 1 y w1 w2 b w1 w2 b

0 0 0

1 1 1 1 1 1 1 1 1 1

1 -1 1 -1 -1 1 -1 0 2 0

-1 1 1 -1 1 -1 -1 1 1 -1

-1 -1 1 -1 1 1 -1 2 2 -2

train a perceptron for an AND function, y=b+xiwi

X1

1

X2

1

1

0 .5

.5

Boundary

0

X1

Y

1 X2

1 -1

(.)

-

x1

x2

+

-

-

x1+x2=1


• Von der Malsburg and Willshaw model (1973,1976)

'

'' )()()()()()(

k

kikk

k

ik

j

iijii tybtyetxtwtcyt

ty

otherwise 0

)( if ,)(*)(*

t*ytyty

jj

j

constant subject to ),(*)()(

ijji

ijwtytx

t

tw


• Kohonen‘s SOM

nx

x

x

:

:

2

1

x

Otherwise 0,

bubble theinside is j neuron If ,1)1(ty j

(t) j if 0

(t)j if )],()([)()]()([)()()()(

)(

twtxtytwtxtwtytxty

t

tw iji

jijiijjijij

)()(min)()( tttt jj

i xwxw

])()([)1(

i

iijTjj tyhtty xw


• SOM algorithm

)]()()[,,()()( tttkvtt vk wxw

• Updating the weights of winner and its neighbours.

• At each time t, present an input, x(t), select the winner.

cc

tv wx

)(minarg

• Repeat until the map converges.

Typical neighbourhood function: ])(2

exp[),,(2

2

t

kvtkv


• SOM clustering

Dove

Hen

Duck Goose

Owl Hawk

Eagle

Fox

Dog Wolf

Cat

TigerLion HorseZebra

Cow

0 2 4 6 8 10

0

1

2

3

4

5

6

7

8

9

10

11


• SOM Data visualisation/management


• SOM

Quantisation & topology

Topologically “ordered” map

“Error tolerant” coding -HVQ

(Luttrell, NC 1994)

“Minimum wiring” (Mitchison, NC 1995),

(Durbin&Mitchson, Nature1990)


• SOM extensions

° HSOM (Miikkulainen 1990), DISLEX (1990, 1997)

° PSOM (Ritter 1993), Hyperbolic SOM (1999), H2SOM

° Temporal Kohonen Map (Chappell &Taylor 1993)

° Neural Gas (Martinetz et al.1991), Growing Grid (Fritzke1995)

° ASSOM (Kohonen 1997)

° Recurrent SOM (Koskela, 1997), RecursiveSOM (Voegtlin2001)

° SOAR (Lampinen&Oja 1989), SOMAR (Ni&Yin, 2007)

° Bayesian SOM & SOMN (Yin&Allinson 1995,1997; Utsugi 1997)

° GTM (Bishop et al. 1998)

° GHSOM (Merkl et al. 2000), TOC(TreeSOM) (Freeman&Yin2004)

° PicSOM (Laaksonen, Oja, et al., 2000)

° ViSOM (Yin 2001, 2002), gViSOM (Yin 2007)

8.7 Learning & generalisation

• Learning (fitting) & generalisation

0 2 4 6 8 100

1

2

3

4

5

6

X

Y

0 2 4 6 8 100

1

2

3

4

5

6

X

Y

On one hand, we want to get as high classification (or low error) rate as possible for the training data set.

On the other hand, we would also like the trained network to have good performance on unseen (testing) data.

Which is

better?


• Bias &Variance dilemma (finite data set D)

dxxpxdxy )()]()([2

1 2

)]()}({)}][({)([2

)]()}({[)}]({)([

)]()}({)}({)([)]()([

22

22

xdxyExyExy

xdxyExyExy

xdxyExyExyxdxy

222 )}]}({)({[)]()}({[})]()({[ xyExyExdxyExdxyE DDD

(bias)2 (variance)

A close fitter/model yields small bias but large variance, while a loose fitter (or a wild guess) gives small variance but large bias.


• Bias &Variance dilemma (finite data set D)

)()(})({

)}({)}({

0)]()}({[

2

2

xdxgxgE

xhExyEas

xdxyE

D

D

* Courtesy of http://www.aiaccess.net/

g(x): generating curve

dots: sample

For h1(x):

)()()}({

0)}]}({)({[

1

2

xyxhxyEas

xyExyED

For h2(x):

Low variance, but large bias (error) !

Low bias but large variance ! Poor generalisation !

That is, different samples give different h2(x), thus

large variance! Or if for Sample 2, one still uses

h2(x) obtained from Sample 1, the errors are large.


• Bias &Variance dilemma (finite data set D) sin(2x): generating curve

dots: samples

Best representation

Over-fitting !

0th-order 1st-order

9th-order 3rd-order

8.8 Genetic algorithms

• Historical point of view

- Evolutionary Computation

A computational and optimisation methodology based on genetic and evolutionary mechanisms - proposed by John Holland in 1960s


• Genetic view

The human body consists of thousands of millions of cells. Each of the cells' nucleuses contains 46 chromosomes. 23 from each parent making 23 chromosome pairs in each cell of the offspring.

The chromosomes consists of DNA (deoxyribonucleic acid), which is a double helix made of sugar (deoxyribose), phosphorus and four nitrogen bases called A (adenine), T (thymine), C (cytosine) and G (guanine). The bases A and T, and the bases C and G, respectively, always make pairs.

Our DNA consists of 70-100 thousand genes. A gene is a segment of DNA that codes for the production of a specific protein by spelling the sequence of amino acids composing that protein. It is also the genes that give every person their own particular looks. The genes that we inherit from our parents decide which sex we will have, the colour of our skin and eyes, if we will be a tall or short person, and if a gene in our body is defect we can become ill.


• Genetic view

23 paired chromosomes of the human (male)

TraitTrait

(allele)(allele)

locuslocus

22 pairs of autosomes22 pairs of autosomes

Sex chromosomes: Sex chromosomes:

1 X and 1 Y 1 X and 1 Y –– male, or male, or

1 pair of X 1 pair of X –– femalefemale


• Genetic algorithm (GA)

Population: a set of possible solutions, or chromosomes.

Initial population: can be randomly chosen

Fitness: a fitness of a solution (chromosome) needs to be defined.

Encoding: to encode a solution (chromosomes ) to usually a binary sequence (a number of binary bits).


• GA operations

Selection: select a pair of, chromosomes from the population, with the

probabilities in proportion to their fitness.

Crossover: Randomly choose a locus and exchange the bits before and

after that locus between two chosen chromosomes to create two

child chromosomes. For example if the chosen parent strings are

11000100 and 01111110 and locus is chosen at the third bit.

Then the child strings are 11000110 and 01111100.

Mutation: Randomly (with a probability) flip some of the bits in the

chromosomes. For example, the string 00011000 might be

mutated in its fourth position to yield 00010000.


• Selection: Roulette wheel


• Crossover

• Mutation

11001011 + 11011111 = 11001111

11001111 => 10001011


• Example

Find the maximum point of

f(y)=y+|sin(32y)|, 0y<

Using 16-bit encoding for solutions, 0<y<

Population size: 200,

Initial population is randomly chose from the search space,

Fitness: f(y),

Crossover rate: 0.5

Mutation rate: 0.005

run matlab code: GA1.m

Part 8: Neural networks (Summary)

• Introduction & cognition

• Artificial intelligence

• Retina & visual pathway

• Neurons & models

• Multilayer perceptron

• Self-organising networks

• Genetic algorithm