By Dr. Mukhtiar Ali Unar

1

Mehran University of Engineering and Technology, Jamshoro

Institute of Information TechnologyThird Term ME CSN & IT

Neural NetworksBy By

Dr. Mukhtiar Ali UnarDr. Mukhtiar Ali Unar

2

Statistical Nature of the learning Process:A neural network is merely one form in which

empirical knowledge about the physical phenomenon or environment of interest may be encoded through training. By empirical knowledge we mean a set of measurements that characterizes the phenomenon. To be more specific, consider the example of a stochastic phenomenon described by a random vector X consisting of a set of independent variables, and a random scalar D representing a dependent variable. Suppose that we have N realizations of the random vector X denoted by , and a corresponding set of realizations of the random scalar D denoted by .

N

1iix

N1iD

3

These realizations (measurements) constitute the training sample denoted by

Ordinarily we do not have knowledge of the exact functional relationship between X and D, so we proceed by proposing the modelD = f(X) + (2)

where f(.) is a deterministic function of its argument vector, and is a random expectational error that represents our “ignorance” about the dependence of D and X. The statistical model described by equation (2) is called a regressive model, and is depicted in the following figure:

N

1iii d,xT (1)

f(.) +Xd

Fig 1

4

The expectation error , in general, has zero mean and positive probability of occurrence. On this basis, the regression model of fig1 has two useful properties.

1. The mean value of the expectation error , given any realization X is zero; that is,E[|X] = 0 (3)

where E is the statistical expectation operator. As a corollary to this property, we may state that the regression function f(X) is the conditional mean of the model output D, given that the input X = x, as shown byf(x) = E[D|x] (4)

2. The expectation error is uncorrelated with the regression function f(X); that isE[f(X)] = 0

This property is the well known Principle of Orthogonality, which states that all the information about D available to us through the input X has been encoded into the regression function f(X).

5

A neural network provides an approximation to the regressive model of Fig.1. Let the actual response of neural network produced in response to the input vector X, be denoted asY = f(X,w) (5)

where f(.,w) is the input-output function realized by the neural network. Given the training data T of equation (1) , the weight vector w is obtained by minimizing the cost function:

In statistical terms, The error function (w) may be expressed as (w) = B(w) + V(w) (7)

The term B(w) is the bias term which represents the inability of the neural network to accurately approximate the regression function. The term V(w) is the variance which represents the inadequacy of the information contained in the training sample about the regression function.

2N

1iii w,xFd

2

1]w[

(6)

6

Committee Machines: Supervised learning taskSupervised learning task Approach is based on a commonly used engineering Approach is based on a commonly used engineering

principle: principle: Divide and Conquer. According to the principle of Divide and conquer, a

complex computational task is solved by dividing it into a number of computationally simple tasks and then combining the solutions to those tasks. In supervised learning, computational simplicity is achieved by distributing the learning task among a number of experts, which in turn divide the input space into a set of subspaces. The combination of subsets is said to constitute a committee machine.

7

Committee machines are universal approximators.

Classification of Committee Machines:1. Static Structures: In this class of Committee

Machines , the responses of several predictors (experts) are combined by means of a mechanism that does not involve the input signal, hence the designation static. This category includes the following methods:

Ensemble averaging, where the outputs of different predictors are linearly combined to produce an overall output.

Boosting, where a weak algorithm is converted into one that achieves arbitrarily high accuracy.

8

1. Dynamic Structures: In this second class of Committee Machines, the input signal is directly involved in actuating the mechanism that integrates the outputs of the individual experts into an overall output, hence the designation dynamic. There are two kinds of dynamic structures:

Mixture of Experts, in which the individual responses of the responses are non-linearly combined by means of a single gating network.

Hierarchical mixture of experts, in which the individual responses of the individual experts are nolinearly combined by means of several gating networks arranged in a hierarchical fashion.

9

The mixture of experts and hierarchical mixture of experts may also be viewed as examples of Modular Networks. A formal definition of the notion of modularity.

A neural network is said to be Modular if the computation performed by the network can be decomposed into two or more modules (subsystems) that operate on distinct inputs without communicating with each other. The outputs of the modules are mediated by an integrating unit that is not permitted to feed information back to the modules. In particular, the integrating unit both (1) decides how the outputs of the modules should be combined to form the final output of the system, and decides which modules should learn which training patterns.

10

Ensemble Averaging: Fig2 shows a number of differently trained neural

networks (i.e. experts), which share a common input and whose individual outputs are somehow combined to produce an overall output y. To simplify the presentation, the outputs of the experts are assumed to be scalar valued. Such a representation is known as ensemble averaging method.

Expert 1

Expert 2

Expert k

Inputx[n]

Combiner Output y

11

The motivation for using ensemble averaging: If the combination of experts in Fig 2 were replaced by

a single neural network, we would have a network with a correspondingly large number of adjustable parameters. The training time for such a large network is likely to be longer than for the case of a set of experts trained in parallel.

The risk of overfitting the data increases when the number of adjustable parameters is large compared to cardinality (i.e. size of the set) of the training data.

In using a committee machine the expectation is that the differently trained networks converge to different local minima on the error surface, and the overall performance is improved by combining the outputs in some way.

12

Boosting: Boosting is another method that belong to the

“static” class of committee machines. Boosting is quite different from ensemble averaging.

In a Committee machine based on ensemble averaging, all the experts in the machine are trained on the same data set; they may differ from each other in the choice of initial conditions used in network training. By contrast, in a boosting machine, the experts are trained on data sets with entirely different distributions; it is a general method that can be used to improve the performance of any learning algorithm.

13

Boosting can be implemented in three fundamentally different ways: Boosting by filtering: This approach

involves filtering the training examples by different versions of a weak learning algorithm. It assumes the availability of a large (in theory, infinite) source of examples, with the examples being either discarded or kept during training. An advantage of this approach is that it allows for a small memory requirement compared to the other two approaches.

14

Boosting by subsampling: This second approach works with a training sample of fixed size. The examples are “resampled” according to a given probability distribution during training. The error is calculated with respect to the fixed training sample.

Boosting by reweighting: This third approach also works with a fixed training sample, but it assumes that the weak learning algorithm can receive “weighted” examples. The error is calculated with respect to the weighted examples.

15

Boosting by Filtering:In boost by filtering, the Committee machine consists of

three experts. The algorithm used to train them is called a boosting algorithm. The three experts are arbitrarily labeled “first”, “second” and “third”. The three experts are individually trained as follows:

The first expert is trained on a set consisting of N1 examples.

The trained first expert is used to filter another set of examples by proceeding in the following manner: Flip a fair coin: this in effect simulates a random

guess If the result is heads, pass new patterns through the

first expert, and discard correctly classified patterns until a pattern is misclassified. This misclassified pattern is added to the training set for the second expert.

16

Boosting by Filtering: If the result is tails, do the opposite. Specifically, pass

new patterns through the first expert and discard incorrectly classified patterns until a pattern is classified correctly. That correctly classified pattern is added to the training set for the second expert.

Continue this process until a total of N1 examples have been filtered by the first expert. This set of filtered examples constitutes the training set for the second expert.

In this way, the second expert is forced to learn a distribution different from that learned by the first expert.

Once the second expert has been trained in the usual way, a third training set is formed for the third expert by proceeding in the following manner:

17

Boosting by Filtering: Pass a new pattern through both the first

and second experts. If the two experts agree in their decisions, discard the pattern. If on the other hand, they disagree, , the pattern is added to the training set for the third expert.

Continue this process until a total of N1 examples has been filtered jointly by the first and second experts. This set of jointly filtered examples constitutes the training set for the third expert.

The third expert is then trained in the usual way, and the training of the entire committee machine is therby completed.

18

Mixture of Expert (ME) Models This configuration consists of K expert networks, or

simply experts, and an integrating unit called a gating network that performs the function of a mediator among the expert networks (see fig below). It is assumed that the different experts work best in different regions of the input space.

Expert 1

Expert 2

Expert K

GatingNetwork

Inputvector x

g1

g2

gK

Outputsignal y

19

The neurons of the experts are usually linear. The Fig. Given below shows the block diagram of a single neuron constituting expert k. The output of expert k is the inner product of the input vector x and synaptic weight vector wk of this neuron, as shown by

yk = wkTx k = 1,2,…,K

+

wk1

wk2

wk2

x1

x2

xm

yk

(8)

20

The getting network consists of a single layer of K neurons, with each neuron assigned to a specific expert. The Fig. (a) below Shows the architectural graph of the gating network and Fig (b) show the block diagram of neuron k in that network.

x1

x2

xm

ak1

ak2

akm

x1

x2

xm

Softmaxgk

(a) (b)

uk

21

Unlike the experts, the neurons of the gating networks are non-linear, with their activation function defined by

K

1jj

kk

)uexp(

)uexp(g

where uk is the inner product of the input vector x and synaptic weight vector ak; i.euk = ak

Tx k = 1,2,…,kThe normalized exponential function may be viewed as a multi-input generalization of the logistic function. It preserves the rank order of its input values, and is a differentiable generalization of the “winner takes all” operation of picking the maximum value. For this reason, the activation function of equation (9) is referred to as Softmax.

(9)

22

Let yk denote the output of the kth expert in response to the input vector x. The overall output of the ME model is

K

1kkkygy (10)

Example: Consider an ME model with two experts, and a gating network with two outputs denoted by g1 and g2. The Output g1 is defined by

))uu(exp(1

1

)uexp()uexp(

)uexp(g

2121

11

Let a1 and a2 denote the two weight vectors of the gating network. We may then write

kT

k axu

(11)

(12)

23

and therefore rewrite equation (11) as:

The other output g2 of the gating network is

)aa(xexp1

1g

21T1

12T12 aaxexp1

1g1g

(13)

(14)

Along the ridge defined by a1 = a2, we gave g1 = g2 = ½ and the two experts contributes equally to the output of the ME model.Away from the ridge, one or the other of the two experts assumes the dominant role.

24

Hierarchical Mixture of Experts (HME) Model:The HME model, illustrated on the next slide, is a natural

extension of HE model.The illustration is for an HME model of four experts.It has two layers of gating networks. By continuing with the application of the principle of divide and conquer in a manner similar to that illustrated, we may construct an HME model with any number of levels of hierarchy.

The architecture of the HME model is like a tree in which the gating networks sit at the various nonterminals of the tree and the experts sit at the leaves of the tree.

The HME model differs from the ME model in that the input space is divided into a nested set of subspaces, with the information combined and redistributed among the experts under the control of several gating networks arranged in a hierarchical manner.

25

Expert 1,1

Expert 2,1

Gating Network 1

Expert 1,2

Expert 2,2

Gating Network 2

Gating Network

InputVector x

y11

y21

y12

y22

y

g1|1

g2|1

g1|2

g2|2

g1

g2

26

Local Model Networks:

nn

f1(f1(

f2(f2(

fn(fn(

u

y

y1

y2

yn

27

A Local Model Network (LMN) is a set of models (experts) weighted by some activation function.

The same input is fed to each model and outputs are weighted according to some variable or variables, ,

where y(t) is the model network output, i() is the validity (i.e. activation) function of the ith model, n is the number of models, and yi(t) is the output of the ith local mode fi().

The weighting or activation of each local model is calculated using an activation function which is a function of of the scheduling variable.

The scheduling variable could be a system state variable, an input variable or some other system parameter.

It is also feasible to schedule more than one variable and to establish a multi-dimensional LMN.

i

n

1ii )t(y)t(y (1)

28

Although any function with a locally limited activation may be applied an an activation function, Gaussian functions are applied most widely.

Usually normalized validity functions are used. The validity function i() can be normalized as

n

1ii

ii )(

• The individual component models fi can be of any form; they can be linear or nonlinear, have a state-space or input- output description, or be discrete or continuous time. They can be of different character, using physical models of the system for operating conditions where they are available, and parametric models for conditions where there is no physical description available. These can also be ANN models such as MLP & RBF networks.

29

The individual local models are smoothly interpolated by the validity functions i to produce the overall model.

The learning Process in LMNs can be divided into two tasks:

1. Find the optimal number, position and shape of the validity functions, i.e. define the structure of the network.

2. Find the optimal set of parameters for the local models, i.e. define the parameters of the network. These parameters could be complete set of coefficients of a linear model, numerical parameters of a non-linear model, or even switches which alter the local model structure.

30

Advantages of LMNs: The LMN has a transparent structure which allows

a direct analysis of local model properties. The LMN is less sensitive to the curse of

dimentionality than many other local representations such as RBF networks.

Non-linear models based on LMNs are able to capture the non-linear effects and provide accuracy over a wide operational range.

The LMN framework allows the integration of a priori knowledge to define the model structure for a particular problem. This leads to more interpretable models which can be more reliably identified from a limited amount of observed data.

31

Example: Modelling of Ship Dynamics

A ship is usually represented by the following mathematical equation:

where is the heading of the ship and is the rudder angle (control signal). The parameters m, d1 and d2 depend upon the operating conditions which include the speed of the vessel, depth of water, loading conditions and environmental disturbances etc. The table on the next slide shows how these parameters change with the forward speed of the ship.

m d d 1 33

32

Table: Table: Variation of ship parameters with speed

Speed

(m/sec)

m d1 d3

2

4

6

8

10

12

14

16

18

20

387.5

96.875

43.055

24.22

15.5

10.76

7.91

6.05

4.78

3.87

5

2.5

1.66

1.25

1.00

0.84

0.72

0.63

0.55

0.5

12.5

1.56

0.46

0.19

0.1

0.05

0.03

0.02

0.01

0.01

33

A Local Model Network can easily be developed to incorporate the parameter variations with respect to speed. For example four models at a speed of (say) 4 m/sec, 8 m/sec, 12 m/sec and 16 m/sec. can be interpolated together as shown on page 26.

The Guassian functions at centres 4, 8, 12 and 16 m/sec can be used as validity functions and speed may be regarded as scheduling variable.

Some results are shown next.

34

10 m/sec

7 m/sec

35

4.1 m/sec

8.2 m/sec

Documents

By Dr. Mukhtiar Ali Unar