Artificial Neural Net Basics

7/31/2019 Artificial Neural Net Basics

1/45

Problem Solving in Hyperspace

or

Artificial Neural Net Basics

Tim Hare


2/45

Some History

In the 1960s, much interest in artificial neural networks (ANNs)

Rosenblatt (1962) proves important theorem regarding perceptron(single learning layer) network learning

Widrow, Angell, Hoff (1960-1962): demonstrations of perceptron

learning

Minsky (1969) kills the party: analyzes with great rigor and findsperceptrons have restrictions on what they can learn & sincemultilayer network training approaches not defined, the world lostinterest.

Work in field slows for a decade, but Widrow is defiant, andestablishes training algorithms for multilayer perceptrons.

The party starts again in the 80s


3/45

Why are ANNs important?

Ability to automatically create complex non-linear functions from simpler linearfunctions by composition of the individual pieces of the network into a meta-function

Process learns from data a priori knowledge not needed

Not a black box result one can discern (and well go through this) the specifics of the model one can adjust the model One can embed the final model in other applications

Process can be made continuously Adaptive continues modify itself as the data set changes

Alternative to traditional modeling techniques such as ANOVA and multiple regression

In more advanced forms, continues to be a means to explore the underpinnings of theorganic intelligence that evolved on this planet


4/45

Biological Neurons


5/45

The sort of ANN architecture well be playing with today:

Two processing layers, each with their own weights.

Information flows from left to right during execution of the network (forward propagation),

and from right to left during the weight adjustment cycle (backward propagation)

Neuron 1

Neuron 2

Input(1)

Input(2)

W(1,1)

W(1,2)W(2,1)

W(2,2)

W(1,3)

W(2,3)

Neuron 3

Layer-1 Layer-2

NetworkOutput

Error


6/45

Each neuron is in effect a summation operator. That is, per below,

NET(i) is a summation ( ) of all X(m)*AB(m,i), where m = input

number and i = neuron number.

Net(1)

Net(2)

X(1)

X(2)

AB(1,1)

AB(1,2)AB(2,1)

AB(2,2)

AB(1,1)X1+AB(1,2)X2 = NET(1)

AB(2,1)X1+AB(2,2)X2 = NET(2)

=

=


7/45

Or, equivalently, a vector-matrix product of the input vector (X) and

the weight matrix (AB), to produce the vector NET

X(1) X(2) AB(1,1) AB(1,2)

AB(2,1) AB(2,2)

X x [AB] = NET

=

X (1X2)2x2

AB

Input Vector Weight Matrix

NET

(1X2)=

NET Vector

NET(1) NET(2)

m

AB(m,i)X(m) = NET(i)


8/45

In fact, our artificial neuron definition

actually also includes a sigmoid function, TanH(x)

TanH(NET = W x X) = OUT

or, if you like


9/45

The TanH(X) activation (or transfer) function allows gain control (squashing) over the value of each neuron. Large neuron

values (or large weights) wont be amplified downstream leading to noise saturation and distortion in network learning.

My impression in testing is that if you dont use the sigmoid transfer function you run the risk of creating a feed-forward loop that

runs the weights to large values and while it is possible to get training, many times the net explodes into huge neuron values,

leading to overflow errors.

-1

1

OUT = TanH(NET)

TanH(x) = [exp(x)-exp(-x)] / [exp(x)+exp(-x)]

NET = W x X

0

NET distribution


10/45

The networks well test today will have one or two neurons in Layer-1

AB(1,1)X(1)+AB(1,2)X(2) = NET(1)

F(Net(1))=OUT(1)

F(Net(2))=OUT(2)

X(1)

X(2)

AB(1,1)

AB(1,2)AB(2,1)

AB(2,2)

AB(2,1)X(1)+AB(2,2)X(2) = NET(2)

F(X) = TanH(X)

BC(1)

BC(2)

F(Net(3))=OUT(3)

Layer-1 Layer-2


11/45

TRAINING: well need the derivative of our chosen sigmoid function. This

allows us to adjust the weight space error by establishing a relationship to the

training error.

-1

1

OUT = TanH(NET)

TanH(x) = [exp(x)-exp(-x)] / [exp(x)+exp(-x)]

Slope=

TanH

(NET(i*))

NET=W x XNET*=(W x X*)

TanH(x) = [1-TanH(x)][1+TanH(x)

WAF = TanH(NET(i*))(d*-OUT*) for

a particular position on the sigmoid

During forward propagation,

NET* is fed into the sigmoid

function, and OUT* isproduced.

During backward propagation, a

deltaOUT (d*-OUT*) is fed

into the linearization around

OUT*

A deltaNET (our weight

adjustment factor, WAF) is

produced.

* = a particular value


12/45

The network is a META function

The network is a meta function: a functional composition of the moreprimitive functions in each node

We use the WAF iteratively to minimize the total error in this metafunction with respect to the entire patterns set, on average.

WAF is used in conjunction with coefficients to tailor training:NewWeight = (LR)(OldWeight)+(MO)(WAF) where LR = learningrate and MO = momentum.

LR and MO refines the adjustment, and are chosen empirically and

vary according to each problems data set, and can vary as afunction of training results if encoded to do so.

Despite all this, we can still get caught in local minima as weattempt to reduce the error in weight space


13/45

In training, we want to minimize the error

(cost function) on the network (meta function) output

X = [X1, X2] = our input vector

D = our desired output for X

META(X) = network output

E(X) = cost function = AVG(ABS(D-META(X))

The cost function is minimized across X vectors for the entire

training set, iteratively, as the weights in META(X) are adjusted.


14/45

Pseudo-code training algorithm for back-propagation

of error using gradient descent

For each example in the training set

Calculate error (d-OUT)

Compute delta-WX for all weights from layer-1 neurons to j layer-2(output) neurons E2(j).

Compute delta-WX for all weights from the X(m) inputs to the i

layer-1 neurons E1(i). This value is based upon W x E2(j) since there is no training pair for the

layer-1 neurons.

Use E2(j) to update the weights leading back to each of the j layer-1 neurons

Use E1(i) to update the weights leading back to each of the m

inputs Next example

(do while not meeting some stop criteria such as low average, absoluteerror across all patterns in one epoch of training)


15/45

Error Hyper-Surface minimization: 2D pictured but our error is in the

WEIGHT space therefore in a much higher dimension.

The weight space is the error dimension we must minimize . This is distinct

from dimensionality of the input space, X, the neuron space, or the output

space. The high dimensional weight space surface is what we move down, to

a (hopefully) global minimum


16/45

This is not a 1D weight space graph, but our cost function, or error function: The overall

network error for one epoch (one pass through the patterns), Error = AVG(ABS(d-OUT),

is gradually minimized and reflects our weight space error reduction process during

training.

Training Epoch

AVG(ABS(d-OUT)

Local

minima

Global

minimum(hopefully)


17/45

Decision surfaces: Some 2D open sets. These lines would be higher dimensional linear

equations if more than two inputs were specified. Some minimum number of linear

equations will be needed to solve each type of problem.

X(2)

X(1)

X(2)

X(2)

X(1)

Two surfaces (two neurons, or two

equations) are needed for more

complex problems.

One Neuron: AB(1,1)X(1)+AB(1,2)X(2) = NET(1) = K

K

K K

One decision surface is only good for

simple classifications such as above.

Two Neurons: AB(1,1)X(1)+AB(1,2)X(2) = NET(1) = K & K


18/45

An open convex set that classifies A as above the lower line, and

below the upper line. The weights that feed downstream neurons from

each of these two neurons (linear equations) will establish the cutoff by

virtue of their interpretation by the down-stream neuron.

A

W(1,1)X(1)+W(2,1)X(2

)=NET(1)

W(1

,2)X(1)+W

(2,2)X(2)=

NET

(2)

We have two neurons (or

two decision lines) and

two inputs, hence the

form of the equations

Net(1)

Net(2)

X(1)

X(2)

W(1,1)

W(1,2)W(2,1)

W(2,2)


19/45

More on decision surfaces: Again, in 2D, the

network can create closedconvex sets

X(2)

X(1)

X(2) X(2)

X(1) X(1)

We are STILL somewhat limited in that we cant enclose any arbitraryshape (concave not possible) in a single class using convex objects made

from layer-1 neurons.


20/45

A single additional computational layer (e.g. between layer-1

and layer-2) adds the capacity to make concave sets

X(2)

X(1)

A not B gives concavity

A

B


21/45

In summary: 1 neuron = 1 linear equation = 1 decision surface

Each neuron represents a line (in 2D input space) a plane (in 3D input space) or ahyper-plane (in higher dimensions).

All these are linear decision objects/surfaces regardless of the dimension of the vectorX

The dimension of the space in which the decisions surfaces exist is determined by thedimension of X, input vector, whose dimension depends upon the number of inputs wefeed into the network (X[1,2]= line, X[1,2,3] = plane, X[1,2,3,4n] = hyper-plane)

Additional network layers beyond two provide logical operations through weights thatconnect the previous layers neurons (objects) to next to allow concave sets.


22/45

XOR training data format. Two inputs coupled with our intended

(d=desired) classification, by which the network will learn to group

patterns (the data rows) and a total of four patterns.

X(1) X(2) d

X(1) X(2) d

X(1) X(2) d

X(1) X(2) d


23/45

Each row has an input (X) vector, and an output vector, or desired vector, d. In this case

D is a 1-dimensional vector (a single output neuron), however we could specify as many

as we like, and so have higher dimensional vectors in both cases. The vectors are our

training pairs, that make up a single row or record, each of which is submitted to the net

during training, one at a time.

1 1 0

1 0 1

0 1 1

0 0 0


24/45

It should be clear that our intended classification is column 3, and we

have two classes we want the net to separate. While we encode them

in binary when we feed them to the network during training, well

reference them on the graphs to come as classes A and B

1 1 B

1 0 A

0 1 A

0 0 B


25/45

Finally, the actual data we sent to the net has been scaled for our preferred transfer

function TanH(x) which will categorize our input patters as either 1 or -1. While TanH(x)

has Domain [-inf, +inf] and Range [-1,1] well want to scale & restrict our inputs for a

variety of reasons. If a real-world problem, wed also likely scale our outputs to a smaller

dynamic range of TanH(x), say R[-0.75,0.75], for optimal training times. Well use the

below for clarity, though.

1 1 -1

1 -1 1

-1 1 1

-1 1 -1


26/45

Here is a program Im developing to analyze patterns in data.

If we have time, Ill demo this later to

show you more of a typical work flow...


27/45

For now, well use the hard coded two layer

ANN software I wrote for you in EXCEL


28/45

Before we tackle the XOR problem,

lets try a simpler problem

Two classes: A=(1,1) and B=(-1,-1)

Two inputs, X1 and X2

Lets try a single layer-1 neuron and see if we can solve it


29/45

Heres our result (or close enough)

but are we unduly biased?

Uh, waitwespecified only ONE

layer-1 neuron so

why the extra

connections???


30/45

The equations that result from training our network on a

simple two class pattern using one neuron and two inputs

1.77X(1)+1.51X(2)+0.01X3 = NET(1)

We can safely ignore the input-level bias, X3, since its weight isnear ZERO. So, 1.77X(1)+1.51X(2) = NET(1)

0.03X(1)-0.52X(2)+0.49X(3) = NET(2)

The value of NET(2) is always 1, since it is in fact a bias neuronitself, therefore its weights dont count. Therefore, 1 = NET(2)(always).

F(NET(1))*(-2.43)+F(NET(2))*(0.01) = NET(3)

As well the weights coming out of NET(2) are small, so well ignorethem. So, F(NET(1))*(-2.43) = NET(3), therefore

OUT = F(NET(3)) = F(F(1.77X(1)+1.51X(2))*(-2.43)) where F is,

again, TanH(x)


31/45

Initially, what looks pretty complicated.

X(1) 1.77

X(2)

1.52

X(3)

0.01

F(Net(1))

0.03

- 0.52

0.49 F(Net(2))

-2.43

0.01

F(Net(2)) OUT


32/45

Can be simplified.

X(1) 1.77

X(2)

1.52

1

0

F(Net(1))

0

- 0.52

0.49 1

-2.43

0

F(Net(2)) OUT


33/45

And further simplified.

X(1) 1.77

X(2)

1.52

F(Net(1))-2.43

F(Net(2)) OUT

Resulting in a single 2D decision line, or one linear equation

in two dimensions (X(1) & X(2))

OUT = F(NET(3)) = F(F(1.77X(1)+1.51X(2))*(-2.43)) where F is, again, TanH(x)

In fact, the weight from NET(1) to NET(2) has

evolved just to scale the value of NET(2) for TanH(x) so that it

produces an OUT value close to +1 or -1.


34/45

So, we can do this classification with a single layer-1 neuron. In fact, each

hidden layer neuron is tantamount to a linear equation that is a line, surface,

or hypersurface, depending on the number of inputs to layer-1. We needed

at least one solution line to solve this problem.

X(2)

X(1)

(0,0) =A

(1,1) = B

W(1,1)X(1)+W

(2,1)X(2)=NET(1)


35/45

What if we dont know the minimum

number of neurons and use more?

We created more decision lines than we

needed and made the interpretation of

the equations, the relationship between

the inputs, difficult to understand.

10n

euronsused!!!


36/45

To get a simple set of equations and a good model

Use as few neurons as you can to minimize the interpretation of the

equations

Start with some minimum number that works, and work backwardsto the true minimum (though with more complex noisy data sets, this

optimum may be hard to assess)

Also, generalization (averaging across noise within a class) is

impaired when one uses too many neurons. A less than robustnetwork results, which cant generalize to patterns it has never seen

and is over-fitted to noise.


37/45

Lets try and solve the XOR problem. How many linear

equations (layer-1 neurons) will we need for XOR?

Lets first try with 1 neuron


38/45

Why cant a 2-layer net with a single neuron in layer-1 (one linear

equation) solve XOR?

X(2)

X(1)

(0,0) = B

(1,1) = B

(0,1) =A

(1,0) =A

W(1,1)X1+W

(2,1)X2=NET(1)

As we saw, a network with a

single layer-1 neuron cant

separate class A from class B

and so the net iterates but does

not regress to a solution.


39/45

So, a single layer-1 neuron was not sufficient, and we know

why, but what about two layer-1 neurons (two equations, or

decision lines)?

Well try with 2 neurons

Then well take a look at the equations

Then well prune neurons if they dont contribute, to make theequations and variable relationships clearer

Well remove clear ZERO weights and test those we suspect may

not be impacting the result

Well then check for generalization across input vectors that the

net has not seen


40/45

Heres a result with XOR and two neurons


41/45

Again, with XOR, initially, things looks pretty complicated.

X(1) 1.15

X(2)

1.16

X(3)

-0.81

F(Net(1))

-0.82 F(Net(2))

-2.57

-2.57

F(Net(4)) OUT

F(Net(3))

0.810.030.04

-1.14

-1.13

-1.75


42/45

but can be made less complicated by eliminating clear zero inputs, pruning

inputs that they dont contribute, and labeling our bias neurons.

X(1) 1.15

X(2)

1.16

1

-0.81

F(Net(1))

-0.82 F(Net(2))

-2.57

-2.57

F(Net(4)) OUT

1

000

-1.14

-1.13

-1.75


43/45

until things are somewhatmore clear.

X(1) 1.15

X(2)

1.16

1

-0.81

F(Net(1))

-0.82 F(Net(2))

-2.57

-2.57

F(Net(4)) OUT

1

-1.14

-1.13

-1.75


44/45

The equations that result from training our network on a

simple two class pattern using one neuron and two inputs

1) 1.15X1+1.16X2-0.81= NET(1)

2) -1.14X1-1.13X2-0.82 = NET(2)

3) -1.75 = NET(3)

4) F(NET(1))*(-2.57) + F(NET(2))*(-2.57) - (1.75) = NET(4)

5) OUT = F(NET(4))

.where F(x) = TanH(x).


45/45

Miscellany and discussion

Linking up pre-trained networks as input to a training network

Linking up the above but using the hidden layer of the trained network as input to the

training network

Depending on where you start, you may never get out of a local minimum, or you mayfall into one after making progress.

Demo of VB software as needed to illustrate 3-layer networks

Documents

Artificial Neural Net Basics