Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
PATTERN'RECOGNITION'AND MACHINE'LEARNINGCHAPTER'5:'NEURAL'NETWORKS
Include(Nonlinearity(g(x)(in(Output(with(Respect(to(Input
Yk(X)(=(g((! wki(xi( +(wk0((),(k(=(1,(…,(K
x_1$$$x_2$$$$$x_3$$$$x_4$$$$$…$$$$$$$$x_d$$$$$$$$$$$$$$$$x_0
w_11$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$w_1d$$$$$$$$$$$$$$$$$$w_0
Y1(X)$…$$$$Yk(X)
Output
Input
g($)g($)
Using$sigmoid$nonlinearity$and$normal$class$conditional$probability,$the$output$of$a$NN$discriminant$is$interpreted$as$posteriori$probabilities$p(x|Ck)
Mapping'Arbitrary'Boolean'Function
Input&vector:'length'd,'all'components'0'or'1
Output:1:'if'the'given'input'is'class'A,'0:'if'input'from'class'B
Total'2^d'inputs;'say'K'are'in'class'A,'2^d'C K'in'B
2&layers&of&FF&NNInput'size'2^d;'Hidden'size'K;'Output'size'1;'hardlim threshold'funct.
WeightsInput'C>'Hidden:'1'if'given'input'is'in'A'and'has'1at'the'node;'C1'otherw
Hidden'C>'Out:'all'1;'Bias/hidden:'1Cb:if'node'K'has'b'ones;'
Prove!:'this'NN'gives'1'if'input'from'A,'0'from'B.
Mapping'Arbitrary'Function'with'34layer'FFNN
Single'neuron'threshold 4>'half4space2'layer'NN 4>'convex'region
Output'bias:'4M(hidden'u.)'gives'logical'AND3'layer'NN 4>'any'region'!!!
Subdivide'input'into'approx.'hypercubesA'cluster'of'd'1st'hidden'nodes'maps'one'cubeBias'41'means'logical'OR'at'outputCan'produce'any'combination'of'input'cubes'
3"Layer(Neural(Network((1(hidden(layer)
Kolmogorov(Approximation(Theorem((1957)
Discovered(independently(of(NNsRelated(to(Hilbert’s 23(unsolved(problems/1900
#13:(Can(a(function(of(several(variables(be(represented(as(a(combination(of(functions(of(a(few(variable.(Arnold:(yes(N 3(variable(with(2!
Kolmogorov(AN:Any(multivariable(continuous(function(can(be(expressed(as(superposition(of(functions(of(one(variable((a(small(number(of(components)
Limitations: not(constructive,(too(complicated
Example:)3+layer,)feedforward)ANN)with)supervised)learning
Motivation)for)3+layer:)Kolmogorov)Representation)Theorem
x1
9 8
7654
1 2 3
w41 w42w51w61
w71w52 w62w72 w43
w53 w63 w73
w94w84
w95 w85 w96 w86 w97w87
y8 y9
x2 x3
Example:)3+layer,)FF)ANN(cont’d)Assume:)transfer)function)is)just)summation
7654
1 2 3
w41 w42w51w61
w71w52 w62
w72 w43w53 w63 w73
x1 x2 x3
v4 v5 v6 v7
!!!
"
#
$$$
%
&
!!!!
"
#
$$$$
%
&
=
!!!!
"
#
$$$$
%
&
3
2
1
737244
636243
535242
434241
7
6
5
4
xxx
wwwwwwwwwwww
vvvv
Input0to0hidden0weights
Example:)3+layer,)FF)ANN(cont’d)hidden)to)output)weights
9 8
w94w84
w95 w85 w96 w86 w87
v4 v5 v6 v7
y8 y9
!!!!
"
#
$$$$
%
&
!"
#$%
&=!
"
#$%
&
7
6
5
4
97969594
87868584
9
8
vvvv
wwwwwwww
yy
Final&matrix&representation&for&linear&systemExample:&34layer,&FF&ANN(cont’d)
y9
!!!
"
#
$$$
%
&
!!!!
"
#
$$$$
%
&
!"
#$%
&=!
"
#$%
&
===
3
2
1
737244
636243
535242
434241
97969594
87868584
9
8
BACBcb
BA
xxx
wwwwwwwwwwww
wwwwwwww
yy
XWWVWYXWV
9 8
7654
1 2 3
w41 w42w51w61
w71
w52w62w72 w43
w53 w63 w73
w94w84
w95 w85 w96 w86 w97w87
y8
x2 x3
Layer1A
Layer1B
Layer1C
x1
Transfer(FunctionsMay(be(a(threshold(function(that(passes(information(ONLY(IF(the(output(exceeds(the(threshold
Can(be(a(continuous(function(of(the(inputThe(output(is(usually(passed(to(the(output(path(of(the(node
Example(of(transfer(function:(sigmoid
Examples)of)Approximation)(with)3)hidden)nodes))
Approximation+by+gradient+descentOften+not+practical+to+directly+evaluate
Use+approximation+of+the+error+function+by+iteration:dE/dw =+0+Gradient+descent+idea:+! if+we+are+at+a+given+location+(given+w+parameters+to+be+optimized],+then+change+w+in+a+way+where+the+gradient+of+the+error+is+the+max
W(t+1)+=+W(t)+– !dE/dWContinue+the+iteration+until+Converges+to+the+minimumNot+always+converges!
Illustration+of+the+Error+Surface
Learning(with(Error(Backpropagation((BP)
Learning:(determine(the(weights(of(the(NN
Assume:Structure(is(givenTransfer(functions(are(givenInput(A output(pair(are(given(
Supervised(learning(based(on(examples!See:(derivation(of(backpropagation
Backpropagation,Learning
Paul,Werbos,(1974)
PhD,work,Princeton
Roots,of,backpropagation
Rumelhart,,McCleland,(1986)
PDP,group,at,CMU
Popularization,of,the,idea
http://scsnl.stanford.edu/conferences/NSF_Brain_Network_Dynamics_Jan2007
http://www.archive.org/search.php?query=2007+brain+network+dynamics
Supervised*Learning*Scheme
Standard'Backpropagation'– delta'rule
Gradient'of'the'sum'squared'error
Backpropagation'delta9rule
Weight'change'algorithm'iteratively'from'top'layer'backward
!!"
#$$%
&
''
''
''
(=)(Q
w wF
wF
wFwF ,...,,)(
21
!!=
"#$=
#$=
%%
=%% N
k
kjl
kliN
N
k lij
kN
p
zNw
FNw
F1
)1(1
1lim1lim &
kjl
kli
oldlij
newlij
N
k
kjl
kliN
oldlij
newlij
zww
zN
ww
)1(
1)1(
or
1lim
!
=!"#
!=
!= $
%&
&%
Generalized -rule !
where: N batch size
1
1( ) lim ( );N
kN kF w F w
N!"=
= #2*( ) ( , )k kkF Y x Y x w= !
1
( )( ) 1limN
kN kp p
F wF ww N w!"
=
##=
# #$
l-th layer
kk k ei
klij li lij
F F Iw I w! ! !
= " =! ! !( )k k
li liz F I=
ith - node
( 1)k kli liq l q
qI w z !="
( 1) ( 1)k kk k
lij l q l jk kqli lij li
F Fw z zI w I! !
" #$ $$= % =& '$ $ $( )
*
( 1) ; k k kk kli l j li
lij li
F Fzw I
! !"
# #= =
# #
lijw( 1)1klz !
( 1)2klz !
...... ...
( 1)kl jz !
( 1) maxkl qz !
We have to calculate !kli!
( )
12* *k
kli 1
kli
F: : * 2z
: : * F I
mkk k k kli kp lp i lik k
li li pk
k li kk k kli li li
z Fl output layer y z y z
I zk kli k F z F
l not output layerli z I z
FI
!
=
=
" #$ $$ % & % &' (= ) = ) )* + * +, - , -' ($ $ $ . /
$ $ $ 0= =$ $ $
12 3$ 2= = 4$ 225 ( )
1( 1)
1 ( 1)
*l k
l p kklik k
p l p li
IF F II z
µ ++
= +
! "## $= =% &# #% &' (
)
( )1
( 1)1 ( 1)
*l
k kkl p lr lik k
p rl p li
F w z F II z
µ +
+= +
! "# # $ % &= + =' () *# # + ,' (- ./ /
( )1 1
( 1) ( 1) ( 1)1 1( 1)
* * ( )l l
k k kkl pi li l p l pi lik
p pl p
F w F I w F II
µ µ
!+ +
+ + += =+
" #$ % %= =& '$& '( )
* *
( 1)1
1 where is determined iterativelylimN
k k kli l j li
N kp
F zw N
! !"#$ =
%=
% & (see above)
algorithm: weight change is
proportional to (gradient):lij
Fw!!
( 1)1
1
0 ~ ( )
Nnew old k klij lij li l j
k
w w zN
small learning rate
! "
!
#=
= #
$ <
%
Convergence)of)Backpropagation
1. Standard)backpropagation)reduces)error)F9 BUT:)no)guarantee)of)convergence)to)global)
minimum9 Possible)problems:
9 Local)minimum9 Very)slow)decrease
2. Theorem)on)the)approximation)by)39layer)NNAny)square)integrable)function)can)be)approximated)with)arbitrary)accuracy)by)a)39layer)backpropagation)ANN.
9 BUT:)no)guarantee)that)BP)(delta9rule)or)other))gives)the)optimum)approximation
Theorem'on'opt.'approx.'by'backpropagation
Two'classes:p(x):=p(x|w1)P(w1)'+'p(x|w2)P(w2)
p(x) @probability'distribution'of'feature'vectors'x''''''''''''''''''''p(x|wi)@conditional'probability'density'f.'of'class'wi P(wi) @’a@priori’'probability'of'class'wi,'i=1,2.'''''''''''''P(wi|x) @’a@posteriori’'probability'of'class'x'to'belong'to'class'I
Bayes'Rule: p(x|wi)P(wi)'='P(wi|x)p(x)Bayes'Discriminant:'''''P(w1|x)'– P(w2|x)'>'0!select'class'1THEOREM'(approximation'by'BPNN)An'optimally'selected'BP'NN'approximates'the'Bayesian'(maximum)'
discriminant'function.NOTE:'the'actual'approximation'depends'on'the'structure'of'the'
network,'class'conditional'probabilities'etc.
D.W.Ruck,)S.K.Rogers,…)IEEE)Trans.)Neur.)Netw.,Vol.)1.pp.296A298,)1990.
Local&quadratic&approximation
Taylor&expansion&of&error&function&w.r.t.&weights
1st and&2nd order:&gradient&and&Hessian&(H):
Gradient&of&the&error:
Modifications+of+Standard+Backpropagation
1. Optimum+choice+of+learning+rate:Initialization+of+weightsAdaptive+learning+rateRandomization
2. Adding+a+momentum+term
3. Regularization+term+to+SSEe.g.,+sum+of+weights+! pruningForgetting+rate+! pruning
)ww()1(ww 1-kkk1k !+!+=+ µ"µ# kk x
! !+"=i ji,
ij2*
ii |,w|ε')y(yI
Basic&NN&Architectures
Feed$forward*NNDirected&graph&in&which&a&path&never&visits&the&same&node&twiceRelatively&simple&behaviorExample:&MLP&for&classification,&pattern&recognition
Feedback*or*Recurrent*NNsContains&loops&of&directed&edges&going&forward&and&also&backwardComplicated&oscillations&might&occurExample:&Hopfield&NN,&Elman&NN&for&speech&recogn.
Random*NNsMore&realistic,&very&complex