Upload
arunpratapsingh
View
238
Download
3
Embed Size (px)
Citation preview
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
1/74
UNIT : II
SOFT COMPUTINGII SEMESTER (MCSE 205)
PREPARED BY ARUN PRATAP SINGH
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
2/74
PREPARED BY ARUN PRATAP SINGH 1
1
NEURAL NETWORK:
These networks are simplified models of biological neuron system which is a massivelyparallel distributed processing system made up of highly interconnected neural computingelements. The neural networks have the ability to learn that makes them powerful and flexibleand thereby acquire knowledge and make it available for use. There networks are also called
neural net or artificial neural networks. In neural network there is no need to devise analgorithm for performing a special task. For real time systems, these networks are also wellsuited due to their computational times and fast response due to their parallel architecture.
UNIT : II
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
3/74
PREPARED BY ARUN PRATAP SINGH 2
2
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
4/74
PREPARED BY ARUN PRATAP SINGH 3
3
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
5/74
PREPARED BY ARUN PRATAP SINGH 4
4
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
6/74
PREPARED BY ARUN PRATAP SINGH 5
5
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
7/74
PREPARED BY ARUN PRATAP SINGH 6
6
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
8/74
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
9/74
PREPARED BY ARUN PRATAP SINGH 8
8
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
10/74
PREPARED BY ARUN PRATAP SINGH 9
9
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
11/74
PREPARED BY ARUN PRATAP SINGH 10
10
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
12/74
PREPARED BY ARUN PRATAP SINGH 11
11
ARTIFICIAL NEURAL NETWORK (ANN):
In computer science and related fields, artificial neural networks (ANNs) are
computationalmodels inspired by an animal'scentral nervous systems (in particular thebrain)
which is capable ofmachine learning as well aspattern recognition.Artificial neural networks are
generally presented as systems of interconnected "neurons" which can compute values from
inputs.
For example, a neural network for handwriting recognition is defined by a set of input neuronswhich may be activated by the pixels of an input image. After being weighted and transformed by
a function (determined by the network's designer), the activations of these neurons are then
passed on to other neurons. This process is repeated until finally, an output neuron is activated.
This determines which character was read.
Like other machine learning methods - systems that learn from data - neural networks have been
used to solve a wide variety of tasks that are hard to solve using ordinary rule-based
programming, includingcomputer vision andspeech recognition.
http://en.wikipedia.org/wiki/Computer_sciencehttp://en.wikipedia.org/wiki/Statistical_modelhttp://en.wikipedia.org/wiki/Central_nervous_systemhttp://en.wikipedia.org/wiki/Brainhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Handwriting_recognitionhttp://en.wikipedia.org/wiki/Functionhttp://en.wikipedia.org/wiki/Computer_visionhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Speech_recognitionhttp://en.wikipedia.org/wiki/Computer_visionhttp://en.wikipedia.org/wiki/Functionhttp://en.wikipedia.org/wiki/Handwriting_recognitionhttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Pattern_recognitionhttp://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Brainhttp://en.wikipedia.org/wiki/Central_nervous_systemhttp://en.wikipedia.org/wiki/Statistical_modelhttp://en.wikipedia.org/wiki/Computer_science8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
13/74
PREPARED BY ARUN PRATAP SINGH 12
12
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
14/74
PREPARED BY ARUN PRATAP SINGH 13
13
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
15/74
PREPARED BY ARUN PRATAP SINGH 14
14
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
16/74
PREPARED BY ARUN PRATAP SINGH 15
15
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
17/74
PREPARED BY ARUN PRATAP SINGH 16
16
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
18/74
PREPARED BY ARUN PRATAP SINGH 17
17
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
19/74
PREPARED BY ARUN PRATAP SINGH 18
18
DIFFERENT ACTIVATION FUNCTION:
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
20/74
PREPARED BY ARUN PRATAP SINGH 19
19
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
21/74
PREPARED BY ARUN PRATAP SINGH 20
20
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
22/74
PREPARED BY ARUN PRATAP SINGH 21
21
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
23/74
PREPARED BY ARUN PRATAP SINGH 22
22
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
24/74
PREPARED BY ARUN PRATAP SINGH 23
23
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
25/74
PREPARED BY ARUN PRATAP SINGH 24
24
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
26/74
PREPARED BY ARUN PRATAP SINGH 25
25
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
27/74
PREPARED BY ARUN PRATAP SINGH 26
26
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
28/74
PREPARED BY ARUN PRATAP SINGH 27
27
SINGLE LAYER PERCEPTRON:
Inmachine learning,the perceptronis an algorithm forsupervisedclassification of an input into
one of several possible non-binary outputs. It is a type of linear classifier, i.e. a classification
algorithm that makes its predictions based on a linear predictor function combining a set of
weights with the feature vector. The algorithm allows for online learning, in that it processes
elements in the training set one at a time.
The perceptron algorithm dates back to the late 1950s; its first implementation, in custom
hardware, was one of the firstartificial neural networks to be produced.
http://en.wikipedia.org/wiki/Machine_learninghttp://en.wikipedia.org/wiki/Supervised_classificationhttp://en.wikipedia.org/wiki/Classification_(machine_learning)http://en.wikipedia.org/wiki/Linear_classifierhttp://en.wikipedia.org/wiki/Linear_predictor_functionhttp://en.wikipedia.org/wiki/Feature_vectorhttp://en.wikipedia.org/wiki/Online_algorithmhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Online_algorithmhttp://en.wikipedia.org/wiki/Feature_vectorhttp://en.wikipedia.org/wiki/Linear_predictor_functionhttp://en.wikipedia.org/wiki/Linear_classifierhttp://en.wikipedia.org/wiki/Classification_(machine_learning)http://en.wikipedia.org/wiki/Supervised_classificationhttp://en.wikipedia.org/wiki/Machine_learning8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
29/74
PREPARED BY ARUN PRATAP SINGH 28
28
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
30/74
PREPARED BY ARUN PRATAP SINGH 29
29
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
31/74
PREPARED BY ARUN PRATAP SINGH 30
30
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
32/74
PREPARED BY ARUN PRATAP SINGH 31
31
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
33/74
PREPARED BY ARUN PRATAP SINGH 32
32
WINDROW HOFF/DELTA LEARNING RULE:
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
34/74
PREPARED BY ARUN PRATAP SINGH 33
33
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
35/74
PREPARED BY ARUN PRATAP SINGH 34
34
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
36/74
PREPARED BY ARUN PRATAP SINGH 35
35
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
37/74
PREPARED BY ARUN PRATAP SINGH 36
36
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
38/74
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
39/74
PREPARED BY ARUN PRATAP SINGH 38
38
is the neuron's activation function
is the target output
is the weighted sum of the neuron's inputs
is the actual output
is the th input.
It holds that and .
The delta rule is commonly stated in simplified form for a neuron with a linear activation
function as
While the delta rule is similar to theperceptron's update rule, the derivation is different.
The perceptron uses theHeaviside step function as the activation function , and
that means that does not exist at zero, and is equal to zero elsewhere, which
makes the direct application of the delta rule impossible.
WINNER-TAKE-ALL LEARNING RULE:
Winner-take-all is a computational principle applied in computational models of neuralnetworks by whichneurons in a layer compete with each other for activation. In the classical form,only the neuron with the highest activation stays active while all other neurons shut down, howeverother variations that allow more than one neuron to be active do exist, for example the soft winnertake-all, by which a power function is applied to the neurons.
http://en.wikipedia.org/wiki/Perceptronhttp://en.wikipedia.org/wiki/Heaviside_step_functionhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Neuronhttp://en.wikipedia.org/wiki/Neuronhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Models_of_neural_networkhttp://en.wikipedia.org/wiki/Heaviside_step_functionhttp://en.wikipedia.org/wiki/Perceptron8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
40/74
PREPARED BY ARUN PRATAP SINGH 39
39
In the theory of artificial neural networks,winner-take-all networks are a case of competitive
learning in recurrent neural networks.Output nodes in the network mutually inhibit each other,
while simultaneously activating themselves through reflexive connections. After some time, only
one node in the output layer will be active, namely the one corresponding to the strongest input.
Thus the network uses nonlinear inhibition to pick out the largest of a set of inputs. Winner-take-all is a general computational primitive that can be implemented using different types of neural
network models, including both continuous-time and spiking networks (Grossberg, 1973; Oster et
al. 2009).
Winner-take-all networks are commonly used in computational models of the brain, particularly
for distributed decision-making or action selection in the cortex. Important examples include
hierarchical models of vision (Riesenhuber et al. 1999), and models of selective attention and
recognition (Carpenter and Grossberg, 1987; Itti et al. 1998). They are also common in artificial
neural networks and neuromorphic analog VLSI circuits. It has been formally proven that the
winner-take-all operation is computationally powerful compared to other nonlinear operations,such as thresholding (Maass 2000).
In many practical cases, there is not only a single neuron which becomes the only active one but
there are exactly kneurons which become active for a fixed number k. This principle is referred
to as k-winners-take-all .
http://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Recurrent_neural_networkhttp://en.wikipedia.org/wiki/Winner-take-all_in_action_selectionhttp://en.wikipedia.org/wiki/Cortex_(anatomy)http://en.wikipedia.org/wiki/Cortex_(anatomy)http://en.wikipedia.org/wiki/Winner-take-all_in_action_selectionhttp://en.wikipedia.org/wiki/Recurrent_neural_networkhttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Competitive_learninghttp://en.wikipedia.org/wiki/Artificial_neural_network8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
41/74
PREPARED BY ARUN PRATAP SINGH 40
40
LINEAR SEPARABILITY:
Linear separability is an important concept in neural networks. The idea is to check if you can
separate points in an n-dimensional space using only n-1 dimensions.
Lost it? Heres a simpler explanation.
One Dimension
Lets say youre on a number line. You take any two numbers. Now, there are two possibilities:
1. You choose two different numbers
2. You choose the same number
If you choose two different numbers, you can always find another number between them. This
number separates the two numbers you chose.
So, you say that these two numbers are linearly separable.
But, if both numbers are the same, you simply cannot separate them. Theyre the same. So,
theyre linearly inseparable. (Not just linearly, theyre arent separable at all. You cannotseparate something from itself)
Two Dimensions
On extending this idea to two dimensions, some more possibilities come into existence. Consider
the following:
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
42/74
PREPARED BY ARUN PRATAP SINGH 41
41
Here, were like to seperate the point (1,1) from the other points. You can see that there exists a
line that does this. In fact, there exist infinite such lines. So, these two classes of points are
linearly separable. The first class consists of the point (1,1) and the other class has (0,1), (1,0)
and (0,0).
Now consider this:
In this case, you just cannot use one single line to separate the two classes (one containing the
black points and one containing the red points). So, they are linearly inseparable.
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
43/74
PREPARED BY ARUN PRATAP SINGH 42
42
Three dimensions
Extending the above example to three dimensions. You need a plane for separating the two
classes.
The dashed plane separates the red point from the other blue points. So its linearly separable. If
bottom right point on the opposite side was red too, it would become linearly inseparable .
Extending to n dimensions
Things go up to a lot of dimensions in neural networks. So to separate classes in n-dimensions,
you need an n-1 dimensional hyperplane.
Multilayer Perceptron Neural Network Model
The following diagram illustrates a perceptron network with three layers:
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
44/74
PREPARED BY ARUN PRATAP SINGH 43
43
This network has an input layer(on the left) with three neurons, one hidden layer(in themiddle) with three neurons and an output layer(on the right) with three neurons.
There is one neuron in the input layer for each predictor variable. In the case of categoricalvariables, N-1 neurons are used to represent the Ncategories of the variable.
Input LayerA vector of predictor variable values (x1...xp) is presented to the input layer. Theinput layer (or processing before the input layer) standardizes these values so that the range ofeach variable is -1 to 1. The input layer distributes the values to each of the neurons in thehidden layer. In addition to the predictor variables, there is a constant input of 1.0, calledthe biasthat is fed to each of the hidden layers; the bias is multiplied by a weight and added tothe sum going into the neuron.
Hidden LayerArriving at a neuron in the hidden layer, the value from each input neuron ismultiplied by a weight (wji), and the resulting weighted values are added together producing acombined value uj. The weighted sum (uj) is fed into a transfer function, , which outputs avalue hj. The outputs from the hidden layer are distributed to the output layer.
Output LayerArriving at a neuron in the output layer, the value from each hidden layerneuron is multiplied by a weight (wkj), and the resulting weighted values are added togetherproducing a combined value vj. The weighted sum (vj) is fed into a transfer function, , whichoutputs a value yk. The yvalues are the outputs of the network.
If a regression analysis is being performed with a continuous target variable, then there is asingle neuron in the output layer, and it generates a single y value. For classification problems
with categorical target variables, there are Nneurons in the output layer producing Nvalues,one for each of the Ncategories of the target variable.
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
45/74
PREPARED BY ARUN PRATAP SINGH 44
44
MULTILAYER PERCEPTRON ARCHITECTURE:
The network diagram shown above is a full-connected, three layer, feed-forward, perceptronneural network. Fully connected means that the output from each input and hidden neuron isdistributed to all of the neurons in the following layer. Feed forward means that the values onlymove from input to hidden to output layers; no values are fed back to earlier layers (a Recurrent
Network allows values to be fed backward).
All neural networks have an input layer and an output layer, but the number of hidden layers mayvary. Here is a diagram of a perceptron network with two hidden layers and four total layers:
When there is more than one hidden layer, the output from one hidden layer is fed into the nexthidden layer and separate weights are applied to the sum going into each layer.
Training Multilayer Perceptron Networks
The goal of the training process is to find the set of weight values that will cause the output fromthe neural network to match the actual target values as closely as possible. There are severalissues involved in designing and training a multilayer perceptron network:
Selecting how many hidden layers to use in the network. Deciding how many neurons to use in each hidden layer. Finding a globally optimal solution that avoids local minima.
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
46/74
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
47/74
PREPARED BY ARUN PRATAP SINGH 46
46
This picture is highly simplified because it represents only a single weight value (on the horizontalaxis). With a typical neural network, you would have a 200-dimension, rough surface with manylocal valleys.
Optimization methods such as steepest descent and conjugate gradient are highly susceptible to
finding local minima if they begin the search in a valley near a local minimum. They have no abilityto see the big picture and find the global minimum.
Several methods have been tried to avoid local minima. The simplest is just to try a number ofrandom starting points and use the one with the best value. A more sophisticated techniquecalled simulated annealingimproves on this by trying widely separated random values and thengradually reducing (cooling) the random jumps in the hope that the location is getting closer tothe global minimum.
DTREG uses the Nguyen-Widrow algorithm to select the initial range of starting weight values. Itthen uses the conjugate gradient algorithm to optimize the weights. Conjugate gradient usuallyfinds the optimum weights quickly, but there is no guarantee that the weight values it finds are
globally optimal. So it is useful to allow DTREG to try the optimization multiple times with differentsets of initial random weight values. The number of tries allowed is specified on the MultilayerPerceptron property page.
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
48/74
PREPARED BY ARUN PRATAP SINGH 47
47
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
49/74
PREPARED BY ARUN PRATAP SINGH 48
48
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
50/74
PREPARED BY ARUN PRATAP SINGH 49
49
MADALINE :
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
51/74
PREPARED BY ARUN PRATAP SINGH 50
50
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
52/74
PREPARED BY ARUN PRATAP SINGH 51
51
MADALINE (Many ADALINE[1]) is a three-layer (input, hidden, output), fully connected, feed-
forward artificial neural network architecture for classification that uses ADALINE units in its
hidden and output layers, i.e. its activation function is thesign function.The three-layer network
uses memistors.Three different training algorithms for MADALINE networks, which cannot be
learned using backpropagation because the sign function is not differentiable, have been
suggested, called Rule I, Rule II and Rule III. The first of these dates back to 1962 and cannot
adapt the weights of the hidden-output connection. The second training algorithm improved on
Rule I and was described in 1988. The third "Rule" applied to a modified network
withsigmoid activations instead of signum; it was later found to be equivalent to backpropagation.
The Rule II training algorithm is based on a principle called "minimal disturbance". It proceeds by
looping over training examples, then for each example, it:
finds the hidden layer unit (ADALINE classifier) with the lowest confidence in its prediction,
tentatively flips the sign of the unit,
accepts or rejects the change based on whether the network's error is reduced,
stops when the error is zero.
http://en.wikipedia.org/wiki/Madaline#cite_note-winter-1http://en.wikipedia.org/wiki/Madaline#cite_note-winter-1http://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Statistical_classificationhttp://en.wikipedia.org/wiki/ADALINEhttp://en.wikipedia.org/wiki/Sign_functionhttp://en.wikipedia.org/wiki/Memistorhttp://en.wikipedia.org/wiki/Backpropagationhttp://en.wikipedia.org/wiki/Sigmoid_functionhttp://en.wikipedia.org/wiki/Sigmoid_functionhttp://en.wikipedia.org/wiki/Backpropagationhttp://en.wikipedia.org/wiki/Memistorhttp://en.wikipedia.org/wiki/Sign_functionhttp://en.wikipedia.org/wiki/ADALINEhttp://en.wikipedia.org/wiki/Statistical_classificationhttp://en.wikipedia.org/wiki/Artificial_neural_networkhttp://en.wikipedia.org/wiki/Madaline#cite_note-winter-18/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
53/74
PREPARED BY ARUN PRATAP SINGH 52
52
Additionally, when flipping single units' signs does not drive the error to zero for a particular
example, the training algorithm starts flipping pairs of units' signs, then triples of units, etc.
DIFFERENCE BETWEEN HUMAN BRAIN AND ANN:
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
54/74
PREPARED BY ARUN PRATAP SINGH 53
53
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
55/74
PREPARED BY ARUN PRATAP SINGH 54
54
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
56/74
PREPARED BY ARUN PRATAP SINGH 55
55
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
57/74
PREPARED BY ARUN PRATAP SINGH 56
56
BACK PROPAGATION:
Backpropagation, an abbreviation for "backward propagation of errors", is a common method of
trainingartificial neural networks used in conjunction with anoptimization method such asgradient
descent.The method calculates the gradient of aloss function with respects to all the weights in
the network. The gradient is fed to the optimization method which in turn uses it to update the
weights, in an attempt to minimize the loss function.
Backpropagation requires a known, desired output for each input value in order to calculate the
loss function gradient. It is therefore usually considered to be a supervised learning method,
although it is also used in some unsupervised networks such as autoencoders. It is a
generalization of the delta rule to multi-layered feedforward networks,made possible by using
the chain rule to iteratively compute gradients for each layer. Backpropagation requires that
theactivation function used by theartificial neurons (or "nodes") bedifferentiable.
http://en.wikipedia.org/wiki/Artificial_neural_networkshttp://en.wikipedia.org/wiki/Mathematical_optimizationhttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Loss_functionhttp://en.wikipedia.org/wiki/Supervised_learninghttp://en.wikipedia.org/wiki/Unsupervised_learninghttp://en.wikipedia.org/wiki/Autoencoderhttp://en.wikipedia.org/wiki/Delta_rulehttp://en.wikipedia.org/wiki/Feedforward_neural_networkhttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Artificial_neuronhttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Feedforward_neural_networkhttp://en.wikipedia.org/wiki/Delta_rulehttp://en.wikipedia.org/wiki/Autoencoderhttp://en.wikipedia.org/wiki/Unsupervised_learninghttp://en.wikipedia.org/wiki/Supervised_learninghttp://en.wikipedia.org/wiki/Loss_functionhttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Gradient_descenthttp://en.wikipedia.org/wiki/Mathematical_optimizationhttp://en.wikipedia.org/wiki/Artificial_neural_networks8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
58/74
PREPARED BY ARUN PRATAP SINGH 57
57
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
59/74
PREPARED BY ARUN PRATAP SINGH 58
58
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
60/74
PREPARED BY ARUN PRATAP SINGH 59
59
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
61/74
PREPARED BY ARUN PRATAP SINGH 60
60
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
62/74
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
63/74
PREPARED BY ARUN PRATAP SINGH 62
62
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
64/74
PREPARED BY ARUN PRATAP SINGH 63
63
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
65/74
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
66/74
PREPARED BY ARUN PRATAP SINGH 65
65
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
67/74
PREPARED BY ARUN PRATAP SINGH 66
66
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
68/74
PREPARED BY ARUN PRATAP SINGH 67
67
DERIVATION OF ERROR BACK PROPAGATION ALGORITHM (EBPA) :
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
69/74
PREPARED BY ARUN PRATAP SINGH 68
68
Derivation-
Since backpropagation uses the gradient descent method, one needs to calculate the derivative
of the squared error function with respect to the weights of the network. The squared error function
is:
,
= the squared error
= target output
= actual output of the output neuron[note 2]
http://en.wikipedia.org/wiki/Backpropagation#cite_note-4http://en.wikipedia.org/wiki/Backpropagation#cite_note-4http://en.wikipedia.org/wiki/Backpropagation#cite_note-4http://en.wikipedia.org/wiki/Backpropagation#cite_note-48/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
70/74
PREPARED BY ARUN PRATAP SINGH 69
69
(The factor of is included to cancel the exponent when differentiating.) Therefore the error, ,
depends on the output . However, the output depends on the weighted sum of all its input:
= the number of input units to the neuron
= the -th weight
= the -th input value to the neuron
The above formula only holds true for a neuron with a linear activation function (that is the outputis solely the weighted sum of the input). In general, a non-linear, differentiableactivation
function, , is used. Thus, more correctly:
This lays the groundwork for calculating the partial derivative of the error with respect to aweight using thechain rule:
= How the error changes when the weights are changed
= How the error changes when the output is changed
= How the output changes when the weighted sum changes
http://en.wikipedia.org/wiki/Non-linearhttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Partial_derivativehttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Chain_rulehttp://en.wikipedia.org/wiki/Partial_derivativehttp://en.wikipedia.org/wiki/Activation_functionhttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Non-linear8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
71/74
PREPARED BY ARUN PRATAP SINGH 70
70
= How the weighted sum changes as the weights change
Since the weighted sum is just the sum over all products , therefore the partial
derivative of the sum with respect to a weight is the just the corresponding input . Similarly,
the partial derivative of the sum with respect to an input value is just the weight :
The derivative of the output with respect to the weighted sum is simply the derivative ofthe activation function :
This is the reason why backpropagation requires the activation function to be differentiable.A
commonly used activation function is thelogistic function:
which has a nice derivative of:
For example purposes, assume the network uses a logistic activation function, in which case the
derivative of the output with respect to the weighted sum is the same as the derivative of
the logistic function:
Finally, the derivative of the error with respect to the output is:
http://en.wikipedia.org/wiki/Differentiablehttp://en.wikipedia.org/wiki/Logistic_functionhttp://en.wikipedia.org/wiki/Logistic_functionhttp://en.wikipedia.org/wiki/Differentiable8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
72/74
PREPARED BY ARUN PRATAP SINGH 71
71
Putting it all together:
If one were to use a different activation function, the only difference would be the term
will be replaced by the derivative of the newly chosen activation function.
To update the weight using gradient descent, one must choose a learning rate, . The change
in weight after learning then would be the product of the learning rate and the gradient:
For a linear neuron, the derivative of the activation function is 1, which yields:
This is exactly the delta rule forperceptron learning,which is why the backpropagation algorithm
is a generalization of the delta rule. In backpropagation and perceptron learning, when the
output matches the desired output , the change in weight would be zero, which is exactly
what is desired.
http://en.wikipedia.org/wiki/Perceptron#Learning_algorithmhttp://en.wikipedia.org/wiki/Perceptron#Learning_algorithm8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
73/74
PREPARED BY ARUN PRATAP SINGH 72
72
MOMENTUM:
Empirical evidence shows that the use of a term called momentum in the backpropagationalgorithm can be helpful in speeding the convergence and avoiding local minima.
The idea about using a momentum is to stabilize the weight change by making nonradicalrevisions using a combination of the gradient decreasing term with a fraction of the previousweight change:
w(t) = -Ee/w(t) + w(t-1)
where a is taken 0 a 0.9, and t is the index of the current weight change.
This gives the system a certain amount of inertia since the weight vector will tend to continuemoving in the same direction unless opposed by the gradient term.
The momentum has the following effects:
- it smooths the weight changes and suppresses cross-stitching, that is cancels side-to-sideoscillations across the error valley;
- when all weight changes are all in the same direction the momentum amplifies the learning ratecausing a faster convergence;
- enables to escape from small local minima on the error surface.
The hope is that the momentum will allow a larger learning rate and that this will speedconvergence and avoid local minima. On the other hand, a learning rate of 1 with no momentumwill be much faster when no problem with local minima or non-convergence is encountered
LIMITATIONS OF NEURAL NETWORK :
There are many advantages and limitations to neural network analysis and to discuss this subjectproperly we would have to look at each individual type of network, which isn't necessary for thisgeneral discussion. In reference to backpropagational networks however, there are some specificissues potential users should be aware of.
Backpropagational neural networks (and many other types of networks) are in a sense theultimate 'black boxes'. Apart from defining the general architecture of a network andperhaps initially seeding it with a random numbers, the user has no other role than to feedit input and watch it train and await the output. In fact, it has been said that withbackpropagation, "you almost don't know what you're doing". Some software freelyavailable software packages (NevProp, bp, Mactivation) do allow the user to sample thenetworks 'progress' at regular time intervals, but the learning itself progresses on its own.The final product of this activity is a trained network that provides no equations orcoefficients defining a relationship (as in regression) beyond it's own internal mathematics.The network 'IS' the final equation of the relationship.
8/12/2019 Soft Computing Unit-2 by Arun Pratap Singh
74/74
73
Backpropagational networks also tend to be slower to train than other types of networksand sometimes require thousands of epochs. If run on a truly parallel computer systemthis issue is not really a problem, but if the BPNN is being simulated on a standard serialmachine (i.e. a single SPARC, Mac or PC) training can take some time. This is becausethe machines CPU must compute the function of each node and connection separately,which can be problematic in very large networks with a large amount of data. However,
the speed of most current machines is such that this is typically not much of an issue.