Soft Computing Unit-2 by Arun Pratap Singh

8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh

1/74

UNIT : II

SOFT COMPUTINGII SEMESTER (MCSE 205)

PREPARED BY ARUN PRATAP SINGH


2/74

PREPARED BY ARUN PRATAP SINGH 1

1

NEURAL NETWORK:

These networks are simplified models of biological neuron system which is a massivelyparallel distributed processing system made up of highly interconnected neural computingelements. The neural networks have the ability to learn that makes them powerful and flexibleand thereby acquire knowledge and make it available for use. There networks are also called

neural net or artificial neural networks. In neural network there is no need to devise analgorithm for performing a special task. For real time systems, these networks are also wellsuited due to their computational times and fast response due to their parallel architecture.

UNIT : II


3/74


2


4/74


3


5/74


4


6/74


5


7/74


6


8/74


9/74


8


10/74


9


11/74


10


12/74


11

ARTIFICIAL NEURAL NETWORK (ANN):

In computer science and related fields, artificial neural networks (ANNs) are

computationalmodels inspired by an animal'scentral nervous systems (in particular thebrain)

which is capable ofmachine learning as well aspattern recognition.Artificial neural networks are

generally presented as systems of interconnected "neurons" which can compute values from

inputs.

For example, a neural network for handwriting recognition is defined by a set of input neuronswhich may be activated by the pixels of an input image. After being weighted and transformed by

a function (determined by the network's designer), the activations of these neurons are then

passed on to other neurons. This process is repeated until finally, an output neuron is activated.

This determines which character was read.

Like other machine learning methods - systems that learn from data - neural networks have been

used to solve a wide variety of tasks that are hard to solve using ordinary rule-based

programming, includingcomputer vision andspeech recognition.
http://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Statistical_modelhttp://en.wikipedia.org/wiki/Central_nervous_systemhttp://en.wikipedia.org/wiki/Brainhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Handwriting_recognitionhttp://en.wikipedia.org/wiki/Functionhttp://en.wikipedia.org/wiki/Computer_visionhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Computer_visionhttp://en.wikipedia.org/wiki/Functionhttp://en.wikipedia.org/wiki/Handwriting_recognitionhttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Brainhttp://en.wikipedia.org/wiki/Central_nervous_systemhttp://en.wikipedia.org/wiki/Statistical_modelhttp://en.wikipedia.org/wiki/Computer_science


13/74


12


14/74


13


15/74


14


16/74


15


17/74


16


18/74


17


19/74


18

DIFFERENT ACTIVATION FUNCTION:


20/74


19


21/74


20


22/74


21


23/74


22


24/74


23


25/74


24


26/74


25


27/74


26


28/74


27

SINGLE LAYER PERCEPTRON:

Inmachine learning,the perceptronis an algorithm forsupervisedclassification of an input into

one of several possible non-binary outputs. It is a type of linear classifier, i.e. a classification

algorithm that makes its predictions based on a linear predictor function combining a set of

weights with the feature vector. The algorithm allows for online learning, in that it processes

elements in the training set one at a time.

The perceptron algorithm dates back to the late 1950s; its first implementation, in custom

hardware, was one of the firstartificial neural networks to be produced.
http://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Supervised_classificationhttp://en.wikipedia.org/wiki/Classification_(machine_learning)http://en.wikipedia.org/wiki/Linear_classifierhttp://en.wikipedia.org/wiki/Linear_predictor_functionhttp://en.wikipedia.org/wiki/Feature_vectorhttp://en.wikipedia.org/wiki/Online_algorithmhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Online_algorithmhttp://en.wikipedia.org/wiki/Feature_vectorhttp://en.wikipedia.org/wiki/Linear_predictor_functionhttp://en.wikipedia.org/wiki/Linear_classifierhttp://en.wikipedia.org/wiki/Classification_(machine_learning)http://en.wikipedia.org/wiki/Supervised_classificationhttp://en.wikipedia.org/wiki/Machine_learning


29/74


28


30/74


29


31/74


30


32/74


31


33/74


32

WINDROW HOFF/DELTA LEARNING RULE:


34/74


33


35/74


34


36/74


35


37/74


36


38/74


39/74


38

is the neuron's activation function

is the target output

is the weighted sum of the neuron's inputs

is the actual output

is the th input.

It holds that and .

The delta rule is commonly stated in simplified form for a neuron with a linear activation

function as

While the delta rule is similar to theperceptron's update rule, the derivation is different.

The perceptron uses theHeaviside step function as the activation function , and

that means that does not exist at zero, and is equal to zero elsewhere, which

makes the direct application of the delta rule impossible.

WINNER-TAKE-ALL LEARNING RULE:

Winner-take-all is a computational principle applied in computational models of neuralnetworks by whichneurons in a layer compete with each other for activation. In the classical form,only the neuron with the highest activation stays active while all other neurons shut down, howeverother variations that allow more than one neuron to be active do exist, for example the soft winnertake-all, by which a power function is applied to the neurons.
http://en.wikipedia.org/wiki/Perceptronhttp://en.wikipedia.org/wiki/Heaviside_step_functionhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Neuronhttp://en.wikipedia.org/wiki/Neuronhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Heaviside_step_functionhttp://en.wikipedia.org/wiki/Perceptron


40/74


39

In the theory of artificial neural networks,winner-take-all networks are a case of competitive

learning in recurrent neural networks.Output nodes in the network mutually inhibit each other,

while simultaneously activating themselves through reflexive connections. After some time, only

one node in the output layer will be active, namely the one corresponding to the strongest input.

Thus the network uses nonlinear inhibition to pick out the largest of a set of inputs. Winner-take-all is a general computational primitive that can be implemented using different types of neural

network models, including both continuous-time and spiking networks (Grossberg, 1973; Oster et

al. 2009).

Winner-take-all networks are commonly used in computational models of the brain, particularly

for distributed decision-making or action selection in the cortex. Important examples include

hierarchical models of vision (Riesenhuber et al. 1999), and models of selective attention and

recognition (Carpenter and Grossberg, 1987; Itti et al. 1998). They are also common in artificial

neural networks and neuromorphic analog VLSI circuits. It has been formally proven that the

winner-take-all operation is computationally powerful compared to other nonlinear operations,such as thresholding (Maass 2000).

In many practical cases, there is not only a single neuron which becomes the only active one but

there are exactly kneurons which become active for a fixed number k. This principle is referred

to as k-winners-take-all .
http://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Recurrent_neural_networkhttp://en.wikipedia.org/wiki/Winner-take-all_in_action_selectionhttp://en.wikipedia.org/wiki/Cortex_(anatomy)http://en.wikipedia.org/wiki/Cortex_(anatomy)http://en.wikipedia.org/wiki/Winner-take-all_in_action_selectionhttp://en.wikipedia.org/wiki/Recurrent_neural_networkhttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Artificial_neural_network


41/74


40

LINEAR SEPARABILITY:

Linear separability is an important concept in neural networks. The idea is to check if you can

separate points in an n-dimensional space using only n-1 dimensions.

Lost it? Heres a simpler explanation.

One Dimension

Lets say youre on a number line. You take any two numbers. Now, there are two possibilities:

1. You choose two different numbers

2. You choose the same number

If you choose two different numbers, you can always find another number between them. This

number separates the two numbers you chose.

So, you say that these two numbers are linearly separable.

But, if both numbers are the same, you simply cannot separate them. Theyre the same. So,

theyre linearly inseparable. (Not just linearly, theyre arent separable at all. You cannotseparate something from itself)

Two Dimensions

On extending this idea to two dimensions, some more possibilities come into existence. Consider

the following:


42/74


41

Here, were like to seperate the point (1,1) from the other points. You can see that there exists a

line that does this. In fact, there exist infinite such lines. So, these two classes of points are

linearly separable. The first class consists of the point (1,1) and the other class has (0,1), (1,0)

and (0,0).

Now consider this:

In this case, you just cannot use one single line to separate the two classes (one containing the

black points and one containing the red points). So, they are linearly inseparable.


43/74


42

Three dimensions

Extending the above example to three dimensions. You need a plane for separating the two

classes.

The dashed plane separates the red point from the other blue points. So its linearly separable. If

bottom right point on the opposite side was red too, it would become linearly inseparable .

Extending to n dimensions

Things go up to a lot of dimensions in neural networks. So to separate classes in n-dimensions,

you need an n-1 dimensional hyperplane.

Multilayer Perceptron Neural Network Model

The following diagram illustrates a perceptron network with three layers:


44/74


43

This network has an input layer(on the left) with three neurons, one hidden layer(in themiddle) with three neurons and an output layer(on the right) with three neurons.

There is one neuron in the input layer for each predictor variable. In the case of categoricalvariables, N-1 neurons are used to represent the Ncategories of the variable.

Input LayerA vector of predictor variable values (x1...xp) is presented to the input layer. Theinput layer (or processing before the input layer) standardizes these values so that the range ofeach variable is -1 to 1. The input layer distributes the values to each of the neurons in thehidden layer. In addition to the predictor variables, there is a constant input of 1.0, calledthe biasthat is fed to each of the hidden layers; the bias is multiplied by a weight and added tothe sum going into the neuron.

Hidden LayerArriving at a neuron in the hidden layer, the value from each input neuron ismultiplied by a weight (wji), and the resulting weighted values are added together producing acombined value uj. The weighted sum (uj) is fed into a transfer function, , which outputs avalue hj. The outputs from the hidden layer are distributed to the output layer.

Output LayerArriving at a neuron in the output layer, the value from each hidden layerneuron is multiplied by a weight (wkj), and the resulting weighted values are added togetherproducing a combined value vj. The weighted sum (vj) is fed into a transfer function, , whichoutputs a value yk. The yvalues are the outputs of the network.

If a regression analysis is being performed with a continuous target variable, then there is asingle neuron in the output layer, and it generates a single y value. For classification problems

with categorical target variables, there are Nneurons in the output layer producing Nvalues,one for each of the Ncategories of the target variable.


45/74


44

MULTILAYER PERCEPTRON ARCHITECTURE:

The network diagram shown above is a full-connected, three layer, feed-forward, perceptronneural network. Fully connected means that the output from each input and hidden neuron isdistributed to all of the neurons in the following layer. Feed forward means that the values onlymove from input to hidden to output layers; no values are fed back to earlier layers (a Recurrent

Network allows values to be fed backward).

All neural networks have an input layer and an output layer, but the number of hidden layers mayvary. Here is a diagram of a perceptron network with two hidden layers and four total layers:

When there is more than one hidden layer, the output from one hidden layer is fed into the nexthidden layer and separate weights are applied to the sum going into each layer.

Training Multilayer Perceptron Networks

The goal of the training process is to find the set of weight values that will cause the output fromthe neural network to match the actual target values as closely as possible. There are severalissues involved in designing and training a multilayer perceptron network:

Selecting how many hidden layers to use in the network. Deciding how many neurons to use in each hidden layer. Finding a globally optimal solution that avoids local minima.


46/74


47/74


46

This picture is highly simplified because it represents only a single weight value (on the horizontalaxis). With a typical neural network, you would have a 200-dimension, rough surface with manylocal valleys.

Optimization methods such as steepest descent and conjugate gradient are highly susceptible to

finding local minima if they begin the search in a valley near a local minimum. They have no abilityto see the big picture and find the global minimum.

Several methods have been tried to avoid local minima. The simplest is just to try a number ofrandom starting points and use the one with the best value. A more sophisticated techniquecalled simulated annealingimproves on this by trying widely separated random values and thengradually reducing (cooling) the random jumps in the hope that the location is getting closer tothe global minimum.

DTREG uses the Nguyen-Widrow algorithm to select the initial range of starting weight values. Itthen uses the conjugate gradient algorithm to optimize the weights. Conjugate gradient usuallyfinds the optimum weights quickly, but there is no guarantee that the weight values it finds are

globally optimal. So it is useful to allow DTREG to try the optimization multiple times with differentsets of initial random weight values. The number of tries allowed is specified on the MultilayerPerceptron property page.


48/74


47


49/74


48


50/74


49

MADALINE :


51/74


50


52/74


51

MADALINE (Many ADALINE[1]) is a three-layer (input, hidden, output), fully connected, feed-

forward artificial neural network architecture for classification that uses ADALINE units in its

hidden and output layers, i.e. its activation function is thesign function.The three-layer network

uses memistors.Three different training algorithms for MADALINE networks, which cannot be

learned using backpropagation because the sign function is not differentiable, have been

suggested, called Rule I, Rule II and Rule III. The first of these dates back to 1962 and cannot

adapt the weights of the hidden-output connection. The second training algorithm improved on

Rule I and was described in 1988. The third "Rule" applied to a modified network

withsigmoid activations instead of signum; it was later found to be equivalent to backpropagation.

The Rule II training algorithm is based on a principle called "minimal disturbance". It proceeds by

looping over training examples, then for each example, it:

finds the hidden layer unit (ADALINE classifier) with the lowest confidence in its prediction,

tentatively flips the sign of the unit,

accepts or rejects the change based on whether the network's error is reduced,

stops when the error is zero.
http://en.wikipedia.org/wiki/Madaline#cite_note-winter-1http://en.wikipedia.org/wiki/Madaline#cite_note-winter-1http://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Statistical_classificationhttp://en.wikipedia.org/wiki/ADALINEhttp://en.wikipedia.org/wiki/Sign_functionhttp://en.wikipedia.org/wiki/Memistorhttp://en.wikipedia.org/wiki/Backpropagationhttp://en.wikipedia.org/wiki/Sigmoid_functionhttp://en.wikipedia.org/wiki/Sigmoid_functionhttp://en.wikipedia.org/wiki/Backpropagationhttp://en.wikipedia.org/wiki/Memistorhttp://en.wikipedia.org/wiki/Sign_functionhttp://en.wikipedia.org/wiki/ADALINEhttp://en.wikipedia.org/wiki/Statistical_classificationhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Madaline#cite_note-winter-1


53/74


52

Additionally, when flipping single units' signs does not drive the error to zero for a particular

example, the training algorithm starts flipping pairs of units' signs, then triples of units, etc.

DIFFERENCE BETWEEN HUMAN BRAIN AND ANN:


54/74


53


55/74


54


56/74


55


57/74


56

BACK PROPAGATION:

Backpropagation, an abbreviation for "backward propagation of errors", is a common method of

trainingartificial neural networks used in conjunction with anoptimization method such asgradient

descent.The method calculates the gradient of aloss function with respects to all the weights in

the network. The gradient is fed to the optimization method which in turn uses it to update the

weights, in an attempt to minimize the loss function.

Backpropagation requires a known, desired output for each input value in order to calculate the

loss function gradient. It is therefore usually considered to be a supervised learning method,

although it is also used in some unsupervised networks such as autoencoders. It is a

generalization of the delta rule to multi-layered feedforward networks,made possible by using

the chain rule to iteratively compute gradients for each layer. Backpropagation requires that

theactivation function used by theartificial neurons (or "nodes") bedifferentiable.
http://en.wikipedia.org/wiki/Artificial_neural_networkshttp://en.wikipedia.org/wiki/Mathematical_optimizationhttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Loss_functionhttp://en.wikipedia.org/wiki/Supervised_learninghttp://en.wikipedia.org/wiki/Unsupervised_learninghttp://en.wikipedia.org/wiki/Autoencoderhttp://en.wikipedia.org/wiki/Delta_rulehttp://en.wikipedia.org/wiki/Feedforward_neural_networkhttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Feedforward_neural_networkhttp://en.wikipedia.org/wiki/Delta_rulehttp://en.wikipedia.org/wiki/Autoencoderhttp://en.wikipedia.org/wiki/Unsupervised_learninghttp://en.wikipedia.org/wiki/Supervised_learninghttp://en.wikipedia.org/wiki/Loss_functionhttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Mathematical_optimizationhttp://en.wikipedia.org/wiki/Artificial_neural_networks


58/74


57


59/74


58


60/74


59


61/74


60


62/74


63/74


62


64/74


63


65/74


66/74


65


67/74


66


68/74


67

DERIVATION OF ERROR BACK PROPAGATION ALGORITHM (EBPA) :


69/74


68

Derivation-

Since backpropagation uses the gradient descent method, one needs to calculate the derivative

of the squared error function with respect to the weights of the network. The squared error function

is:

,

= the squared error

= target output

= actual output of the output neuron[note 2]
http://en.wikipedia.org/wiki/Backpropagation#cite_note-4http://en.wikipedia.org/wiki/Backpropagation#cite_note-4http://en.wikipedia.org/wiki/Backpropagation#cite_note-4http://en.wikipedia.org/wiki/Backpropagation#cite_note-4


70/74


69

(The factor of is included to cancel the exponent when differentiating.) Therefore the error, ,

depends on the output . However, the output depends on the weighted sum of all its input:

= the number of input units to the neuron

= the -th weight

= the -th input value to the neuron

The above formula only holds true for a neuron with a linear activation function (that is the outputis solely the weighted sum of the input). In general, a non-linear, differentiableactivation

function, , is used. Thus, more correctly:

This lays the groundwork for calculating the partial derivative of the error with respect to aweight using thechain rule:

= How the error changes when the weights are changed

= How the error changes when the output is changed

= How the output changes when the weighted sum changes
http://en.wikipedia.org/wiki/Non-linearhttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Partial_derivativehttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Partial_derivativehttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Non-linear


71/74


70

= How the weighted sum changes as the weights change

Since the weighted sum is just the sum over all products , therefore the partial

derivative of the sum with respect to a weight is the just the corresponding input . Similarly,

the partial derivative of the sum with respect to an input value is just the weight :

The derivative of the output with respect to the weighted sum is simply the derivative ofthe activation function :

This is the reason why backpropagation requires the activation function to be differentiable.A

commonly used activation function is thelogistic function:

which has a nice derivative of:

For example purposes, assume the network uses a logistic activation function, in which case the

derivative of the output with respect to the weighted sum is the same as the derivative of

the logistic function:

Finally, the derivative of the error with respect to the output is:
http://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Logistic_functionhttp://en.wikipedia.org/wiki/Logistic_functionhttp://en.wikipedia.org/wiki/Differentiable


72/74


71

Putting it all together:

If one were to use a different activation function, the only difference would be the term

will be replaced by the derivative of the newly chosen activation function.

To update the weight using gradient descent, one must choose a learning rate, . The change

in weight after learning then would be the product of the learning rate and the gradient:

For a linear neuron, the derivative of the activation function is 1, which yields:

This is exactly the delta rule forperceptron learning,which is why the backpropagation algorithm

is a generalization of the delta rule. In backpropagation and perceptron learning, when the

output matches the desired output , the change in weight would be zero, which is exactly

what is desired.
http://en.wikipedia.org/wiki/Perceptron#Learning_algorithmhttp://en.wikipedia.org/wiki/Perceptron#Learning_algorithm


73/74


72

MOMENTUM:

Empirical evidence shows that the use of a term called momentum in the backpropagationalgorithm can be helpful in speeding the convergence and avoiding local minima.

The idea about using a momentum is to stabilize the weight change by making nonradicalrevisions using a combination of the gradient decreasing term with a fraction of the previousweight change:

w(t) = -Ee/w(t) + w(t-1)

where a is taken 0 a 0.9, and t is the index of the current weight change.

This gives the system a certain amount of inertia since the weight vector will tend to continuemoving in the same direction unless opposed by the gradient term.

The momentum has the following effects:

- it smooths the weight changes and suppresses cross-stitching, that is cancels side-to-sideoscillations across the error valley;

- when all weight changes are all in the same direction the momentum amplifies the learning ratecausing a faster convergence;

- enables to escape from small local minima on the error surface.

The hope is that the momentum will allow a larger learning rate and that this will speedconvergence and avoid local minima. On the other hand, a learning rate of 1 with no momentumwill be much faster when no problem with local minima or non-convergence is encountered

LIMITATIONS OF NEURAL NETWORK :

There are many advantages and limitations to neural network analysis and to discuss this subjectproperly we would have to look at each individual type of network, which isn't necessary for thisgeneral discussion. In reference to backpropagational networks however, there are some specificissues potential users should be aware of.

Backpropagational neural networks (and many other types of networks) are in a sense theultimate 'black boxes'. Apart from defining the general architecture of a network andperhaps initially seeding it with a random numbers, the user has no other role than to feedit input and watch it train and await the output. In fact, it has been said that withbackpropagation, "you almost don't know what you're doing". Some software freelyavailable software packages (NevProp, bp, Mactivation) do allow the user to sample thenetworks 'progress' at regular time intervals, but the learning itself progresses on its own.The final product of this activity is a trained network that provides no equations orcoefficients defining a relationship (as in regression) beyond it's own internal mathematics.The network 'IS' the final equation of the relationship.


74/74

73

Backpropagational networks also tend to be slower to train than other types of networksand sometimes require thousands of epochs. If run on a truly parallel computer systemthis issue is not really a problem, but if the BPNN is being simulated on a standard serialmachine (i.e. a single SPARC, Mac or PC) training can take some time. This is becausethe machines CPU must compute the function of each node and connection separately,which can be problematic in very large networks with a large amount of data. However,

the speed of most current machines is such that this is typically not much of an issue.

Documents

Soft Computing Unit-2 by Arun Pratap Singh