Upload
andi-nur
View
230
Download
0
Embed Size (px)
Citation preview
7/31/2019 Artificial Neural Net Basics
1/45
Problem Solving in Hyperspace
or
Artificial Neural Net Basics
Tim Hare
7/31/2019 Artificial Neural Net Basics
2/45
Some History
In the 1960s, much interest in artificial neural networks (ANNs)
Rosenblatt (1962) proves important theorem regarding perceptron(single learning layer) network learning
Widrow, Angell, Hoff (1960-1962): demonstrations of perceptron
learning
Minsky (1969) kills the party: analyzes with great rigor and findsperceptrons have restrictions on what they can learn & sincemultilayer network training approaches not defined, the world lostinterest.
Work in field slows for a decade, but Widrow is defiant, andestablishes training algorithms for multilayer perceptrons.
The party starts again in the 80s
7/31/2019 Artificial Neural Net Basics
3/45
Why are ANNs important?
Ability to automatically create complex non-linear functions from simpler linearfunctions by composition of the individual pieces of the network into a meta-function
Process learns from data a priori knowledge not needed
Not a black box result one can discern (and well go through this) the specifics of the model one can adjust the model One can embed the final model in other applications
Process can be made continuously Adaptive continues modify itself as the data set changes
Alternative to traditional modeling techniques such as ANOVA and multiple regression
In more advanced forms, continues to be a means to explore the underpinnings of theorganic intelligence that evolved on this planet
7/31/2019 Artificial Neural Net Basics
4/45
Biological Neurons
7/31/2019 Artificial Neural Net Basics
5/45
The sort of ANN architecture well be playing with today:
Two processing layers, each with their own weights.
Information flows from left to right during execution of the network (forward propagation),
and from right to left during the weight adjustment cycle (backward propagation)
Neuron 1
Neuron 2
Input(1)
Input(2)
W(1,1)
W(1,2)W(2,1)
W(2,2)
W(1,3)
W(2,3)
Neuron 3
Layer-1 Layer-2
NetworkOutput
Error
7/31/2019 Artificial Neural Net Basics
6/45
Each neuron is in effect a summation operator. That is, per below,
NET(i) is a summation ( ) of all X(m)*AB(m,i), where m = input
number and i = neuron number.
Net(1)
Net(2)
X(1)
X(2)
AB(1,1)
AB(1,2)AB(2,1)
AB(2,2)
AB(1,1)X1+AB(1,2)X2 = NET(1)
AB(2,1)X1+AB(2,2)X2 = NET(2)
=
=
7/31/2019 Artificial Neural Net Basics
7/45
Or, equivalently, a vector-matrix product of the input vector (X) and
the weight matrix (AB), to produce the vector NET
X(1) X(2) AB(1,1) AB(1,2)
AB(2,1) AB(2,2)
X x [AB] = NET
=
X (1X2)2x2
AB
Input Vector Weight Matrix
NET
(1X2)=
NET Vector
NET(1) NET(2)
m
AB(m,i)X(m) = NET(i)
7/31/2019 Artificial Neural Net Basics
8/45
In fact, our artificial neuron definition
actually also includes a sigmoid function, TanH(x)
TanH(NET = W x X) = OUT
or, if you like
7/31/2019 Artificial Neural Net Basics
9/45
The TanH(X) activation (or transfer) function allows gain control (squashing) over the value of each neuron. Large neuron
values (or large weights) wont be amplified downstream leading to noise saturation and distortion in network learning.
My impression in testing is that if you dont use the sigmoid transfer function you run the risk of creating a feed-forward loop that
runs the weights to large values and while it is possible to get training, many times the net explodes into huge neuron values,
leading to overflow errors.
-1
1
OUT = TanH(NET)
TanH(x) = [exp(x)-exp(-x)] / [exp(x)+exp(-x)]
NET = W x X
0
NET distribution
7/31/2019 Artificial Neural Net Basics
10/45
The networks well test today will have one or two neurons in Layer-1
AB(1,1)X(1)+AB(1,2)X(2) = NET(1)
F(Net(1))=OUT(1)
F(Net(2))=OUT(2)
X(1)
X(2)
AB(1,1)
AB(1,2)AB(2,1)
AB(2,2)
AB(2,1)X(1)+AB(2,2)X(2) = NET(2)
F(X) = TanH(X)
BC(1)
BC(2)
F(Net(3))=OUT(3)
Layer-1 Layer-2
7/31/2019 Artificial Neural Net Basics
11/45
TRAINING: well need the derivative of our chosen sigmoid function. This
allows us to adjust the weight space error by establishing a relationship to the
training error.
-1
1
OUT = TanH(NET)
TanH(x) = [exp(x)-exp(-x)] / [exp(x)+exp(-x)]
Slope=
TanH
(NET(i*))
NET=W x XNET*=(W x X*)
TanH(x) = [1-TanH(x)][1+TanH(x)
WAF = TanH(NET(i*))(d*-OUT*) for
a particular position on the sigmoid
During forward propagation,
NET* is fed into the sigmoid
function, and OUT* isproduced.
During backward propagation, a
deltaOUT (d*-OUT*) is fed
into the linearization around
OUT*
A deltaNET (our weight
adjustment factor, WAF) is
produced.
* = a particular value
7/31/2019 Artificial Neural Net Basics
12/45
The network is a META function
The network is a meta function: a functional composition of the moreprimitive functions in each node
We use the WAF iteratively to minimize the total error in this metafunction with respect to the entire patterns set, on average.
WAF is used in conjunction with coefficients to tailor training:NewWeight = (LR)(OldWeight)+(MO)(WAF) where LR = learningrate and MO = momentum.
LR and MO refines the adjustment, and are chosen empirically and
vary according to each problems data set, and can vary as afunction of training results if encoded to do so.
Despite all this, we can still get caught in local minima as weattempt to reduce the error in weight space
7/31/2019 Artificial Neural Net Basics
13/45
In training, we want to minimize the error
(cost function) on the network (meta function) output
X = [X1, X2] = our input vector
D = our desired output for X
META(X) = network output
E(X) = cost function = AVG(ABS(D-META(X))
The cost function is minimized across X vectors for the entire
training set, iteratively, as the weights in META(X) are adjusted.
7/31/2019 Artificial Neural Net Basics
14/45
Pseudo-code training algorithm for back-propagation
of error using gradient descent
For each example in the training set
Calculate error (d-OUT)
Compute delta-WX for all weights from layer-1 neurons to j layer-2(output) neurons E2(j).
Compute delta-WX for all weights from the X(m) inputs to the i
layer-1 neurons E1(i). This value is based upon W x E2(j) since there is no training pair for the
layer-1 neurons.
Use E2(j) to update the weights leading back to each of the j layer-1 neurons
Use E1(i) to update the weights leading back to each of the m
inputs Next example
(do while not meeting some stop criteria such as low average, absoluteerror across all patterns in one epoch of training)
7/31/2019 Artificial Neural Net Basics
15/45
Error Hyper-Surface minimization: 2D pictured but our error is in the
WEIGHT space therefore in a much higher dimension.
The weight space is the error dimension we must minimize . This is distinct
from dimensionality of the input space, X, the neuron space, or the output
space. The high dimensional weight space surface is what we move down, to
a (hopefully) global minimum
7/31/2019 Artificial Neural Net Basics
16/45
This is not a 1D weight space graph, but our cost function, or error function: The overall
network error for one epoch (one pass through the patterns), Error = AVG(ABS(d-OUT),
is gradually minimized and reflects our weight space error reduction process during
training.
Training Epoch
AVG(ABS(d-OUT)
Local
minima
Global
minimum(hopefully)
7/31/2019 Artificial Neural Net Basics
17/45
Decision surfaces: Some 2D open sets. These lines would be higher dimensional linear
equations if more than two inputs were specified. Some minimum number of linear
equations will be needed to solve each type of problem.
X(2)
X(1)
X(2)
X(2)
X(1)
Two surfaces (two neurons, or two
equations) are needed for more
complex problems.
One Neuron: AB(1,1)X(1)+AB(1,2)X(2) = NET(1) = K
K
K K
One decision surface is only good for
simple classifications such as above.
Two Neurons: AB(1,1)X(1)+AB(1,2)X(2) = NET(1) = K & K
7/31/2019 Artificial Neural Net Basics
18/45
An open convex set that classifies A as above the lower line, and
below the upper line. The weights that feed downstream neurons from
each of these two neurons (linear equations) will establish the cutoff by
virtue of their interpretation by the down-stream neuron.
A
W(1,1)X(1)+W(2,1)X(2
)=NET(1)
W(1
,2)X(1)+W
(2,2)X(2)=
NET
(2)
We have two neurons (or
two decision lines) and
two inputs, hence the
form of the equations
Net(1)
Net(2)
X(1)
X(2)
W(1,1)
W(1,2)W(2,1)
W(2,2)
7/31/2019 Artificial Neural Net Basics
19/45
More on decision surfaces: Again, in 2D, the
network can create closedconvex sets
X(2)
X(1)
X(2) X(2)
X(1) X(1)
We are STILL somewhat limited in that we cant enclose any arbitraryshape (concave not possible) in a single class using convex objects made
from layer-1 neurons.
7/31/2019 Artificial Neural Net Basics
20/45
A single additional computational layer (e.g. between layer-1
and layer-2) adds the capacity to make concave sets
X(2)
X(1)
A not B gives concavity
A
B
7/31/2019 Artificial Neural Net Basics
21/45
In summary: 1 neuron = 1 linear equation = 1 decision surface
Each neuron represents a line (in 2D input space) a plane (in 3D input space) or ahyper-plane (in higher dimensions).
All these are linear decision objects/surfaces regardless of the dimension of the vectorX
The dimension of the space in which the decisions surfaces exist is determined by thedimension of X, input vector, whose dimension depends upon the number of inputs wefeed into the network (X[1,2]= line, X[1,2,3] = plane, X[1,2,3,4n] = hyper-plane)
Additional network layers beyond two provide logical operations through weights thatconnect the previous layers neurons (objects) to next to allow concave sets.
7/31/2019 Artificial Neural Net Basics
22/45
XOR training data format. Two inputs coupled with our intended
(d=desired) classification, by which the network will learn to group
patterns (the data rows) and a total of four patterns.
X(1) X(2) d
X(1) X(2) d
X(1) X(2) d
X(1) X(2) d
7/31/2019 Artificial Neural Net Basics
23/45
Each row has an input (X) vector, and an output vector, or desired vector, d. In this case
D is a 1-dimensional vector (a single output neuron), however we could specify as many
as we like, and so have higher dimensional vectors in both cases. The vectors are our
training pairs, that make up a single row or record, each of which is submitted to the net
during training, one at a time.
1 1 0
1 0 1
0 1 1
0 0 0
7/31/2019 Artificial Neural Net Basics
24/45
It should be clear that our intended classification is column 3, and we
have two classes we want the net to separate. While we encode them
in binary when we feed them to the network during training, well
reference them on the graphs to come as classes A and B
1 1 B
1 0 A
0 1 A
0 0 B
7/31/2019 Artificial Neural Net Basics
25/45
Finally, the actual data we sent to the net has been scaled for our preferred transfer
function TanH(x) which will categorize our input patters as either 1 or -1. While TanH(x)
has Domain [-inf, +inf] and Range [-1,1] well want to scale & restrict our inputs for a
variety of reasons. If a real-world problem, wed also likely scale our outputs to a smaller
dynamic range of TanH(x), say R[-0.75,0.75], for optimal training times. Well use the
below for clarity, though.
1 1 -1
1 -1 1
-1 1 1
-1 1 -1
7/31/2019 Artificial Neural Net Basics
26/45
Here is a program Im developing to analyze patterns in data.
If we have time, Ill demo this later to
show you more of a typical work flow...
7/31/2019 Artificial Neural Net Basics
27/45
For now, well use the hard coded two layer
ANN software I wrote for you in EXCEL
7/31/2019 Artificial Neural Net Basics
28/45
Before we tackle the XOR problem,
lets try a simpler problem
Two classes: A=(1,1) and B=(-1,-1)
Two inputs, X1 and X2
Lets try a single layer-1 neuron and see if we can solve it
7/31/2019 Artificial Neural Net Basics
29/45
Heres our result (or close enough)
but are we unduly biased?
Uh, waitwespecified only ONE
layer-1 neuron so
why the extra
connections???
7/31/2019 Artificial Neural Net Basics
30/45
The equations that result from training our network on a
simple two class pattern using one neuron and two inputs
1.77X(1)+1.51X(2)+0.01X3 = NET(1)
We can safely ignore the input-level bias, X3, since its weight isnear ZERO. So, 1.77X(1)+1.51X(2) = NET(1)
0.03X(1)-0.52X(2)+0.49X(3) = NET(2)
The value of NET(2) is always 1, since it is in fact a bias neuronitself, therefore its weights dont count. Therefore, 1 = NET(2)(always).
F(NET(1))*(-2.43)+F(NET(2))*(0.01) = NET(3)
As well the weights coming out of NET(2) are small, so well ignorethem. So, F(NET(1))*(-2.43) = NET(3), therefore
OUT = F(NET(3)) = F(F(1.77X(1)+1.51X(2))*(-2.43)) where F is,
again, TanH(x)
7/31/2019 Artificial Neural Net Basics
31/45
Initially, what looks pretty complicated.
X(1) 1.77
X(2)
1.52
X(3)
0.01
F(Net(1))
0.03
- 0.52
0.49 F(Net(2))
-2.43
0.01
F(Net(2)) OUT
7/31/2019 Artificial Neural Net Basics
32/45
Can be simplified.
X(1) 1.77
X(2)
1.52
1
0
F(Net(1))
0
- 0.52
0.49 1
-2.43
0
F(Net(2)) OUT
7/31/2019 Artificial Neural Net Basics
33/45
And further simplified.
X(1) 1.77
X(2)
1.52
F(Net(1))-2.43
F(Net(2)) OUT
Resulting in a single 2D decision line, or one linear equation
in two dimensions (X(1) & X(2))
OUT = F(NET(3)) = F(F(1.77X(1)+1.51X(2))*(-2.43)) where F is, again, TanH(x)
In fact, the weight from NET(1) to NET(2) has
evolved just to scale the value of NET(2) for TanH(x) so that it
produces an OUT value close to +1 or -1.
7/31/2019 Artificial Neural Net Basics
34/45
So, we can do this classification with a single layer-1 neuron. In fact, each
hidden layer neuron is tantamount to a linear equation that is a line, surface,
or hypersurface, depending on the number of inputs to layer-1. We needed
at least one solution line to solve this problem.
X(2)
X(1)
(0,0) =A
(1,1) = B
W(1,1)X(1)+W
(2,1)X(2)=NET(1)
7/31/2019 Artificial Neural Net Basics
35/45
What if we dont know the minimum
number of neurons and use more?
We created more decision lines than we
needed and made the interpretation of
the equations, the relationship between
the inputs, difficult to understand.
10n
euronsused!!!
7/31/2019 Artificial Neural Net Basics
36/45
To get a simple set of equations and a good model
Use as few neurons as you can to minimize the interpretation of the
equations
Start with some minimum number that works, and work backwardsto the true minimum (though with more complex noisy data sets, this
optimum may be hard to assess)
Also, generalization (averaging across noise within a class) is
impaired when one uses too many neurons. A less than robustnetwork results, which cant generalize to patterns it has never seen
and is over-fitted to noise.
7/31/2019 Artificial Neural Net Basics
37/45
Lets try and solve the XOR problem. How many linear
equations (layer-1 neurons) will we need for XOR?
Lets first try with 1 neuron
7/31/2019 Artificial Neural Net Basics
38/45
Why cant a 2-layer net with a single neuron in layer-1 (one linear
equation) solve XOR?
X(2)
X(1)
(0,0) = B
(1,1) = B
(0,1) =A
(1,0) =A
W(1,1)X1+W
(2,1)X2=NET(1)
As we saw, a network with a
single layer-1 neuron cant
separate class A from class B
and so the net iterates but does
not regress to a solution.
7/31/2019 Artificial Neural Net Basics
39/45
So, a single layer-1 neuron was not sufficient, and we know
why, but what about two layer-1 neurons (two equations, or
decision lines)?
Well try with 2 neurons
Then well take a look at the equations
Then well prune neurons if they dont contribute, to make theequations and variable relationships clearer
Well remove clear ZERO weights and test those we suspect may
not be impacting the result
Well then check for generalization across input vectors that the
net has not seen
7/31/2019 Artificial Neural Net Basics
40/45
Heres a result with XOR and two neurons
7/31/2019 Artificial Neural Net Basics
41/45
Again, with XOR, initially, things looks pretty complicated.
X(1) 1.15
X(2)
1.16
X(3)
-0.81
F(Net(1))
-0.82 F(Net(2))
-2.57
-2.57
F(Net(4)) OUT
F(Net(3))
0.810.030.04
-1.14
-1.13
-1.75
7/31/2019 Artificial Neural Net Basics
42/45
but can be made less complicated by eliminating clear zero inputs, pruning
inputs that they dont contribute, and labeling our bias neurons.
X(1) 1.15
X(2)
1.16
1
-0.81
F(Net(1))
-0.82 F(Net(2))
-2.57
-2.57
F(Net(4)) OUT
1
000
-1.14
-1.13
-1.75
7/31/2019 Artificial Neural Net Basics
43/45
until things are somewhatmore clear.
X(1) 1.15
X(2)
1.16
1
-0.81
F(Net(1))
-0.82 F(Net(2))
-2.57
-2.57
F(Net(4)) OUT
1
-1.14
-1.13
-1.75
7/31/2019 Artificial Neural Net Basics
44/45
The equations that result from training our network on a
simple two class pattern using one neuron and two inputs
1) 1.15X1+1.16X2-0.81= NET(1)
2) -1.14X1-1.13X2-0.82 = NET(2)
3) -1.75 = NET(3)
4) F(NET(1))*(-2.57) + F(NET(2))*(-2.57) - (1.75) = NET(4)
5) OUT = F(NET(4))
.where F(x) = TanH(x).
7/31/2019 Artificial Neural Net Basics
45/45
Miscellany and discussion
Linking up pre-trained networks as input to a training network
Linking up the above but using the hidden layer of the trained network as input to the
training network
Depending on where you start, you may never get out of a local minimum, or you mayfall into one after making progress.
Demo of VB software as needed to illustrate 3-layer networks