PATTERN'RECOGNITION' AND MACHINE'LEARNING · 2020. 2. 5. · kkp lp i li li lip k k li k kk k li li li zF l output layer y z y z k Iz lik Fz F lil not output layer zI z F I! = = $$$"#%&

PATTERN'RECOGNITION'AND MACHINE'LEARNINGCHAPTER'5:'NEURAL'NETWORKS

Include(Nonlinearity(g(x)(in(Output(with(Respect(to(Input

Yk(X)(=(g((! wki(xi( +(wk0((),(k(=(1,(…,(K

x_1$$$x_2$$$$$x_3$$$$x_4$$$$$…$$$$$$$$x_d$$$$$$$$$$$$$$$$x_0

w_11$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$w_1d$$$$$$$$$$$$$$$$$$w_0

Y1(X)$…$$$$Yk(X)

Output

Input

g($)g($)

Using$sigmoid$nonlinearity$and$normal$class$conditional$probability,$the$output$of$a$NN$discriminant$is$interpreted$as$posteriori$probabilities$p(x|Ck)

Mapping'Arbitrary'Boolean'Function

Input&vector:'length'd,'all'components'0'or'1

Output:1:'if'the'given'input'is'class'A,'0:'if'input'from'class'B

Total'2^d'inputs;'say'K'are'in'class'A,'2^d'C K'in'B

2&layers&of&FF&NNInput'size'2^d;'Hidden'size'K;'Output'size'1;'hardlim threshold'funct.

WeightsInput'C>'Hidden:'1'if'given'input'is'in'A'and'has'1at'the'node;'C1'otherw

Hidden'C>'Out:'all'1;'Bias/hidden:'1Cb:if'node'K'has'b'ones;'

Prove!:'this'NN'gives'1'if'input'from'A,'0'from'B.

Mapping'Arbitrary'Function'with'34layer'FFNN

Single'neuron'threshold 4>'half4space2'layer'NN 4>'convex'region

Output'bias:'4M(hidden'u.)'gives'logical'AND3'layer'NN 4>'any'region'!!!

Subdivide'input'into'approx.'hypercubesA'cluster'of'd'1st'hidden'nodes'maps'one'cubeBias'41'means'logical'OR'at'outputCan'produce'any'combination'of'input'cubes'

3"Layer(Neural(Network((1(hidden(layer)

Kolmogorov(Approximation(Theorem((1957)

Discovered(independently(of(NNsRelated(to(Hilbert’s 23(unsolved(problems/1900

#13:(Can(a(function(of(several(variables(be(represented(as(a(combination(of(functions(of(a(few(variable.(Arnold:(yes(N 3(variable(with(2!

Kolmogorov(AN:Any(multivariable(continuous(function(can(be(expressed(as(superposition(of(functions(of(one(variable((a(small(number(of(components)

Limitations: not(constructive,(too(complicated

Example:)3+layer,)feedforward)ANN)with)supervised)learning

Motivation)for)3+layer:)Kolmogorov)Representation)Theorem

x1

9 8

7654

1 2 3

w41 w42w51w61

w71w52 w62w72 w43

w53 w63 w73

w94w84

w95 w85 w96 w86 w97w87

y8 y9

x2 x3

Example:)3+layer,)FF)ANN(cont’d)Assume:)transfer)function)is)just)summation

7654

1 2 3

w41 w42w51w61

w71w52 w62

w72 w43w53 w63 w73

x1 x2 x3

v4 v5 v6 v7

!!!

"

#

$$$

%

&

!!!!

"

#

$$$$

%

&

=

!!!!

"

#

$$$$

%

&

3

2

1

737244

636243

535242

434241

7

6

5

4

xxx

wwwwwwwwwwww

vvvv

Input0to0hidden0weights

Example:)3+layer,)FF)ANN(cont’d)hidden)to)output)weights

9 8

w94w84

w95 w85 w96 w86 w87

v4 v5 v6 v7

y8 y9

!!!!

"

#

$$$$

%

&

!"

#$%

&=!

"

#$%

&

7

6

5

4

97969594

87868584

9

8

vvvv

wwwwwwww

yy

Final&matrix&representation&for&linear&systemExample:&34layer,&FF&ANN(cont’d)

y9

!!!

"

#

$$$

%

&

!!!!

"

#

$$$$

%

&

!"

#$%

&=!

"

#$%

&

===

3

2

1

737244

636243

535242

434241

97969594

87868584

9

8

BACBcb

BA

xxx

wwwwwwwwwwww

wwwwwwww

yy

XWWVWYXWV

9 8

7654

1 2 3

w41 w42w51w61

w71

w52w62w72 w43

w53 w63 w73

w94w84

w95 w85 w96 w86 w97w87

y8

x2 x3

Layer1A

Layer1B

Layer1C

x1

Transfer(FunctionsMay(be(a(threshold(function(that(passes(information(ONLY(IF(the(output(exceeds(the(threshold

Can(be(a(continuous(function(of(the(inputThe(output(is(usually(passed(to(the(output(path(of(the(node

Example(of(transfer(function:(sigmoid

Examples)of)Approximation)(with)3)hidden)nodes))

Approximation+by+gradient+descentOften+not+practical+to+directly+evaluate

Use+approximation+of+the+error+function+by+iteration:dE/dw =+0+Gradient+descent+idea:+! if+we+are+at+a+given+location+(given+w+parameters+to+be+optimized],+then+change+w+in+a+way+where+the+gradient+of+the+error+is+the+max

W(t+1)+=+W(t)+– !dE/dWContinue+the+iteration+until+Converges+to+the+minimumNot+always+converges!

Illustration+of+the+Error+Surface

Learning(with(Error(Backpropagation((BP)

Learning:(determine(the(weights(of(the(NN

Assume:Structure(is(givenTransfer(functions(are(givenInput(A output(pair(are(given(

Supervised(learning(based(on(examples!See:(derivation(of(backpropagation

Backpropagation,Learning

Paul,Werbos,(1974)

PhD,work,Princeton

Roots,of,backpropagation

Rumelhart,,McCleland,(1986)

PDP,group,at,CMU

Popularization,of,the,idea

http://scsnl.stanford.edu/conferences/NSF_Brain_Network_Dynamics_Jan2007

http://www.archive.org/search.php?query=2007+brain+network+dynamics

http://scsnl.stanford.edu/conferences/NSF_Brain_Network_Dynamics_Jan2007

http://www.archive.org/search.php?query=2007+brain+network+dynamics

Supervised*Learning*Scheme

Standard'Backpropagation'– delta'rule

Gradient'of'the'sum'squared'error

Backpropagation'delta9rule

Weight'change'algorithm'iteratively'from'top'layer'backward

!!"

#$$%

&

''

''

''

(=)(Q

w wF

wF

wFwF ,...,,)(

21

!!=

"#$=

#$=

%%

=%% N

k

kjl

kliN

N

k lij

kN

p

zNw

FNw

F1

)1(1

1lim1lim &

kjl

kli

oldlij

newlij

N

k

kjl

kliN

oldlij

newlij

zww

zN

ww

)1(

1)1(

or

1lim

!

=!"#

!=

!= $

%&

&%

Generalized -rule !

where: N batch size

1

1( ) lim ( );N

kN kF w F w

N!"=

= #2*( ) ( , )k kkF Y x Y x w= !

1

( )( ) 1limN

kN kp p

F wF ww N w!"

=

##=

# #$

l-th layer

kk k ei

klij li lij

F F Iw I w! ! !

= " =! ! !( )k k

li liz F I=

ith - node

( 1)k kli liq l q

qI w z !="

( 1) ( 1)k kk k

lij l q l jk kqli lij li

F Fw z zI w I! !

" #$ $$= % =& '$ $ $( )

*

( 1) ; k k kk kli l j li

lij li

F Fzw I

! !"

# #= =

# #

lijw( 1)1klz !

( 1)2klz !

...... ...

( 1)kl jz !

( 1) maxkl qz !

We have to calculate !kli!

( )

12* *k

kli 1

kli

F: : * 2z

: : * F I

mkk k k kli kp lp i lik k

li li pk

k li kk k kli li li

z Fl output layer y z y z

I zk kli k F z F

l not output layerli z I z

FI

!

=

=

" #$ $$ % & % &' (= ) = ) )* + * +, - , -' ($ $ $ . /

$ $ $ 0= =$ $ $

12 3$ 2= = 4$ 225 ( )

1( 1)

1 ( 1)

*l k

l p kklik k

p l p li

IF F II z

µ ++

= +

! "## $= =% &# #% &' (

)

( )1

( 1)1 ( 1)

*l

k kkl p lr lik k

p rl p li

F w z F II z

µ +

+= +

! "# # $ % &= + =' () *# # + ,' (- ./ /

( )1 1

( 1) ( 1) ( 1)1 1( 1)

* * ( )l l

k k kkl pi li l p l pi lik

p pl p

F w F I w F II

µ µ

!+ +

+ + += =+

" #$ % %= =& '$& '( )

* *

( 1)1

1 where is determined iterativelylimN

k k kli l j li

N kp

F zw N

! !"#$ =

%=

% & (see above)

algorithm: weight change is

proportional to (gradient):lij

Fw!!

( 1)1

1

0 ~ ( )

Nnew old k klij lij li l j

k

w w zN

small learning rate

! "

!

#=

= #

$ <

%

Convergence)of)Backpropagation

1. Standard)backpropagation)reduces)error)F9 BUT:)no)guarantee)of)convergence)to)global)

minimum9 Possible)problems:

9 Local)minimum9 Very)slow)decrease

2. Theorem)on)the)approximation)by)39layer)NNAny)square)integrable)function)can)be)approximated)with)arbitrary)accuracy)by)a)39layer)backpropagation)ANN.

9 BUT:)no)guarantee)that)BP)(delta9rule)or)other))gives)the)optimum)approximation

Theorem'on'opt.'approx.'by'backpropagation

Two'classes:p(x):=p(x|w1)P(w1)'+'p(x|w2)P(w2)

p(x) @probability'distribution'of'feature'vectors'x''''''''''''''''''''p(x|wi)@conditional'probability'density'f.'of'class'wi P(wi) @’a@priori’'probability'of'class'wi,'i=1,2.'''''''''''''P(wi|x) @’a@posteriori’'probability'of'class'x'to'belong'to'class'I

Bayes'Rule: p(x|wi)P(wi)'='P(wi|x)p(x)Bayes'Discriminant:'''''P(w1|x)'– P(w2|x)'>'0!select'class'1THEOREM'(approximation'by'BPNN)An'optimally'selected'BP'NN'approximates'the'Bayesian'(maximum)'

discriminant'function.NOTE:'the'actual'approximation'depends'on'the'structure'of'the'

network,'class'conditional'probabilities'etc.

D.W.Ruck,)S.K.Rogers,…)IEEE)Trans.)Neur.)Netw.,Vol.)1.pp.296A298,)1990.

Local&quadratic&approximation

Taylor&expansion&of&error&function&w.r.t.&weights

1st and&2nd order:&gradient&and&Hessian&(H):

Gradient&of&the&error:

Modifications+of+Standard+Backpropagation

1. Optimum+choice+of+learning+rate:Initialization+of+weightsAdaptive+learning+rateRandomization

2. Adding+a+momentum+term

3. Regularization+term+to+SSEe.g.,+sum+of+weights+! pruningForgetting+rate+! pruning

)ww()1(ww 1-kkk1k !+!+=+ µ"µ# kk x

! !+"=i ji,

ij2*

ii |,w|ε')y(yI

Basic&NN&Architectures

Feed$forward*NNDirected&graph&in&which&a&path&never&visits&the&same&node&twiceRelatively&simple&behaviorExample:&MLP&for&classification,&pattern&recognition

Feedback*or*Recurrent*NNsContains&loops&of&directed&edges&going&forward&and&also&backwardComplicated&oscillations&might&occurExample:&Hopfield&NN,&Elman&NN&for&speech&recogn.

Random*NNsMore&realistic,&very&complex

Documents

PATTERN'RECOGNITION' AND MACHINE'LEARNING · 2020. 2. 5. · kkp lp i li li lip k k li k kk k li li li zF l output layer y z y z k Iz lik Fz F lil not output layer zI z F I! = = $$$"#%&