Discovering Knowledge in Data Daniel T. Larose, Ph.D

Discovering Knowledge in Discovering Knowledge in DataData

Daniel T. Larose, Ph.D.Daniel T. Larose, Ph.D.

Chapter 7Chapter 7Neural NetworksNeural Networks

Prepared by James Steck, Graduate AssistantPrepared by James Steck, Graduate Assistant

Neural NetworksNeural Networks

• Neural NetworksNeural Networks– Complex learning systems recognized in animal brainsComplex learning systems recognized in animal brains– Single neuron has simple structureSingle neuron has simple structure– Interconnected sets of neurons perform complex Interconnected sets of neurons perform complex

learning taskslearning tasks– Human brain has 10Human brain has 101515 synaptic connections synaptic connections– Artificial Neural NetworksArtificial Neural Networks attempt to replicate non-linear attempt to replicate non-linear

learning found in naturelearning found in natureDendrites

Cell Body

Axon

Neural Networks Neural Networks ((cont’dcont’d))

– Dendrites gather inputs from other neurons and combine Dendrites gather inputs from other neurons and combine informationinformation

– Then generate non-linear response when threshold reachedThen generate non-linear response when threshold reached– Signal sent to other neurons via axonSignal sent to other neurons via axon

– Artificial neuron model is similarArtificial neuron model is similar

– Data inputs (xData inputs (xii) are collected from upstream neurons input ) are collected from upstream neurons input

to combination function (sigma)to combination function (sigma)

nx

x

x

2

1

y

Neural NetworksNeural Networks ((cont’dcont’d))

– Activation function reads combined input and produces Activation function reads combined input and produces non-linear response (y)non-linear response (y)

– Response channeled downstream to other neuronsResponse channeled downstream to other neurons

• What problems applicable to Neural Networks?What problems applicable to Neural Networks?– Quite robust with respect to noisy dataQuite robust with respect to noisy data– Can learn and work around erroneous dataCan learn and work around erroneous data– Results opaque to human interpretationResults opaque to human interpretation– Often require long training timesOften require long training times

Activation functions

:a nondecreasing function, which can be linear or non linear

net

f(net) =1 If net ≥ θ

0 If net < θ

Binary threshold function [2]

θ

1

Ex1f(net)

Ex2 Sigmoid function

f(net) = 2 -1 (1+ e-net)

f(net) 1

net0

f(net) = net for all net

Identity function Ex3

Ex4 Hard limit function

(sign function or bipolar function)

f(net) = +1 if net >= 0

= -1 if net < 0

hidden units

Figure 11.1 Application of neural network to aircraft health monitoring [2].

Input units taking data from strain gauges

Single output unit in this case indicates the health of the aircraft

Input and Output EncodingInput and Output Encoding

– Neural Networks require attribute values encoded to [0, 1]Neural Networks require attribute values encoded to [0, 1]

• NumericNumeric– Apply Min-max Normalization to continuous variablesApply Min-max Normalization to continuous variables

– Works well when Min and Max knownWorks well when Min and Max known– Also assumes new data values occur within Min-Max Also assumes new data values occur within Min-Max

rangerange– Values outside range may be rejected or mapped to Min Values outside range may be rejected or mapped to Min

or Maxor Max

)min()max(

)min(

)range(

)min(*

XX

XX

X

XXX

Input and Output Encoding Input and Output Encoding ((cont’dcont’d))

• CategoricalCategorical– Indicator VariablesIndicator Variables used when number of category values used when number of category values

smallsmall– Categorical variable with Categorical variable with kk classes translated to classes translated to kk – 1 indicator – 1 indicator

variablesvariables– For example, For example, GenderGender attribute values are “Male”, “Female”, attribute values are “Male”, “Female”,

and “Unknown”and “Unknown”– Classes Classes kk = 3 = 3– Create Create kk – 1 = 2 indicator variables named – 1 = 2 indicator variables named Male_IMale_I and and

Female_IFemale_I– MaleMale records have values records have values Male_IMale_I = 1, = 1, Female_IFemale_I = 0 = 0– FemaleFemale records have values records have values Male_IMale_I = 0, = 0, Female_IFemale_I = 1 = 1– UnknownUnknown records have values records have values Male_IMale_I = 0, = 0, Female_IFemale_I = 0 = 0


– Apply caution when reordering Apply caution when reordering unorderedunordered categorical categorical values to [0, 1] rangevalues to [0, 1] range

– For example, attribute For example, attribute Marital_StatusMarital_Status has values has values “Divorced”, “Married”, “Separated”, “Single”, “Divorced”, “Married”, “Separated”, “Single”, “Widowed”, and “Unknown”“Widowed”, and “Unknown”

– Values coded as 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0, Values coded as 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0, respectivelyrespectively

– Coding implies “Divorced” is closer to “Married”, and Coding implies “Divorced” is closer to “Married”, and farther from “Separated”farther from “Separated”

– Neural Network only aware of numeric valuesNeural Network only aware of numeric values– Naive to pre-encoded meaning of categorical valuesNaive to pre-encoded meaning of categorical values– Results of network model may be meaninglessResults of network model may be meaningless


• OutputOutput – Neural Networks always return continuous values [0, 1]Neural Networks always return continuous values [0, 1]– Many classification problems have two outcomesMany classification problems have two outcomes– Solution uses threshold established Solution uses threshold established a prioria priori in single in single

output node to separate classesoutput node to separate classes– For example, target variable is “leave” or “stay”For example, target variable is “leave” or “stay”– Threshold value is “leave if output >= 0.67”Threshold value is “leave if output >= 0.67”– Single output node value = 0.72 classifies record as Single output node value = 0.72 classifies record as

“leave”“leave”


– Single output nodes applicable when target classes Single output nodes applicable when target classes orderedordered

– For example, classify elementary-level reading abilityFor example, classify elementary-level reading ability

– Define thresholdsDefine thresholds ClassifyClassify– ““if 0.00 <= output < 0.25”if 0.00 <= output < 0.25” “first-grade” “first-grade”– ““if 0.25 <= output < 0.50”if 0.25 <= output < 0.50” “second-grade” “second-grade”– ““if 0.50 <= output < 0.75”if 0.50 <= output < 0.75” “third-grade” “third-grade”– ““if output >= 0.75”if output >= 0.75” “fourth-grade” “fourth-grade”

– Fine-tuning of thresholds may be requiredFine-tuning of thresholds may be required

Neural Networks for Estimation Neural Networks for Estimation and Predictionand Prediction

– Continuous output useful for estimation and prediction Continuous output useful for estimation and prediction problemsproblems

– For example, predict a stock price 3 months from nowFor example, predict a stock price 3 months from now– Input values normalized using Min-max NormalizationInput values normalized using Min-max Normalization– Output values [0, 1] denormalized to represent scale of Output values [0, 1] denormalized to represent scale of

stock prices stock prices

Prediction = output(data range) + minimumPrediction = output(data range) + minimum

– For example, suppose stock price ranged $20 to $30 dollarsFor example, suppose stock price ranged $20 to $30 dollars– Network output prediction value = 0.69Network output prediction value = 0.69

0.69(30.0 – 20.0) + 20.0 = $26.900.69(30.0 – 20.0) + 20.0 = $26.90

Simple Example of a Neural Simple Example of a Neural NetworkNetwork

– Neural Network consists of Neural Network consists of layeredlayered, , feedforwardfeedforward, , completely connectedcompletely connected network of nodes network of nodes

– Feedforward restricts network flow to single directionFeedforward restricts network flow to single direction– Flow does not loop or cycleFlow does not loop or cycle– Network composed of two or more layersNetwork composed of two or more layers

Node 1

Node 2

Node 3

Node B

Node A

Node Z

W1A

W1B

W2A

W2B

WAZ

W3A

W3B

W0A

WBZ

W0Z

W0B

Input LayerInput Layer Hidden LayerHidden Layer Output LayerOutput Layer

EX1 ANN for AND logic operations

X1 X2 O

1 0 0

0 1 0

0 0 0

1 1 1

outputx1

x2

1

1

∑xiwi

θ=2

x11

1AND ~ Linearly separable

problem

x2

EX2 ANN for OR logic operations

X1 X2 O

1 0 1

0 1 1

0 0 0

1 1 1

x1

x2

output

∑xiwi

θ=11

1

x1

x2

1

1

OR ~ Linearly separable problem

Linearly separable VS Nonlinearly separable problems

“Linearly separable”

“Nonlinearly separable”

X1 X1

X2 X2

Feed forward – single layer

Neural network architectures

x1

y1

x2

Feed forward – multilayer

x1

x2

y1

y2

z1

Generally, 3 layers are sufficient to solve approximately any mathematical functions.

Linearly separable VS Nonlinearly separable problems

“Linearly separable”

“Nonlinearly separable”

X1 X1

X2 X2

0 0 1 1 0 0 0

0 0 0 1 0 0 0

0 0 0 1 0 0 0

0 0 1 0 1 0 0

0 0 1 0 1 0 0

0 1 1 1 1 1 0

0 1 0 0 0 1 0

0 1 0 0 0 1 0

1 1 1 0 1 1 1

0011000

X =

....

Ex Applying a BPNN to a character recognition problem

A

B

C

D

E

J

K

X1

X2

X63

.....

1

0

0

0

0

0

01

1

Simple Example of a Neural Simple Example of a Neural Network Network ((cont’dcont’d))

– Most networks have Most networks have InputInput, , HiddenHidden, , OutputOutput layers layers– Network may contain more than one hidden layerNetwork may contain more than one hidden layer– Network is completely connectedNetwork is completely connected– Each node in given layer, connected to every node in Each node in given layer, connected to every node in

next layernext layer

– Every connection has weight (WEvery connection has weight (Wijij) associated with it) associated with it

– Weight values randomly assigned 0 to 1 by algorithmWeight values randomly assigned 0 to 1 by algorithm– Number of input nodes dependent on number of Number of input nodes dependent on number of

predictorspredictors– Number of hidden and output nodes configurableNumber of hidden and output nodes configurable


– How many nodes in hidden layer?How many nodes in hidden layer?– Large number of nodes increases complexity of modelLarge number of nodes increases complexity of model– Detailed patterns uncovered in dataDetailed patterns uncovered in data– Leads to overfitting, at expense of generalizabilityLeads to overfitting, at expense of generalizability– Reduce number of hidden nodes when overfitting occursReduce number of hidden nodes when overfitting occurs– Increase number of hidden nodes when training accuracy Increase number of hidden nodes when training accuracy

unacceptably lowunacceptably low

– Input layer accepts values from input variablesInput layer accepts values from input variables– Values passed to hidden layer nodesValues passed to hidden layer nodes– Input layer nodes lack detailed structure compared to Input layer nodes lack detailed structure compared to

hidden and output layer nodeshidden and output layer nodes

Backpropagation algorithm

• Step0: Initialize weights

• Step1: While stopping condition is FALSE (i.e SSE > 0.01)

•Step2: For each training record

Feedforward

•Step3: Each input nodes & a bias broadcast signals

•Step4: Compute net_input at each hidden node and calculate the output of each hidden node

•Step5: Compute net_input at each output node and calculate the output of each output node

Backpropagation of error

• Step6: Compute error at each output node and weight corrections

•Step7: Compute error at each hidden node and weight corrections

•Step8: Update weights and biases in both layers

•Step9: Test stopping condition


– Combination function produces linear combination of Combination function produces linear combination of node inputs and connection weights to single scalar valuenode inputs and connection weights to single scalar value

– For node j, xFor node j, xijij is i is ithth input input– WWijij is weight associated with i is weight associated with ithth input node input node– I+ 1 inputs to node jI+ 1 inputs to node j– xx11, x, x22, ..., x, ..., xII are inputs from upstream nodes are inputs from upstream nodes– xx00 is is constant inputconstant input value = 1.0 value = 1.0– Each input node has extra input WEach input node has extra input W0j0jxx0j0j = W = W0j0j

jIjIjjjjiji

ijj xWxWxWxW ...net 1100


– The scalar value computed for hidden layer Node A The scalar value computed for hidden layer Node A equals equals

– For Node A, netFor Node A, netAA = 1.32 is input to activation function = 1.32 is input to activation function– Neurons “fire” in biological organisms Neurons “fire” in biological organisms – Signals sent between neurons when combination of Signals sent between neurons when combination of

inputs cross thresholdinputs cross threshold

x0 = 1.0 W0A = 0.5 W0B = 0.7 W0Z = 0.5

x1 = 0.4 W1A = 0.6 W1B = 0.9 WAZ = 0.9

x2 = 0.2 W2A = 0.8 W2B = 0.8 WBZ = 0.9

x3 = 0.7 W3A = 0.6 W3B = 0.4

32.1)7.0(6.0)2.0(8.0)4.0(6.05.0

)0.1(net 3322110

AAAAAAAiAi

iAA xWxWxWWxW


– Firing response Firing response not necessarily linearly relatednot necessarily linearly related to to increase in input stimulationincrease in input stimulation

– Neural Networks model behavior using non-linear Neural Networks model behavior using non-linear activation functionactivation function

– Sigmoid functionSigmoid function most commonly used most commonly used

– In Node A, sigmoid function takes netIn Node A, sigmoid function takes netAA = 1.32 as input = 1.32 as input and produces outputand produces output

xey

1

1

7892.01

132.1

ey


– Node A outputs 0.7892 along connection to Node Z, and Node A outputs 0.7892 along connection to Node Z, and becomes component of netbecomes component of netZZ

– Before netBefore netZZ is computed, contribution from Node B is computed, contribution from Node B requiredrequired

– Node Z combines outputs from Node A and Node B, Node Z combines outputs from Node A and Node B, through netthrough netZZ

8176.01

1)net(

and,

5.1)7.0(4.0)2.0(8.0)4.0(9.07.0

)0.1(net

5.1B

3322110

ef

xWxWxWWxW BBBBBBBiBi

iBB


– Inputs to Node Z not data attribute valuesInputs to Node Z not data attribute values– Rather, outputs are from sigmoid function in upstream Rather, outputs are from sigmoid function in upstream

nodesnodes

– Value 0.8750 output from Neural Network on first passValue 0.8750 output from Neural Network on first pass– Represents predicted value for target variable, given first Represents predicted value for target variable, given first

observationobservation

8750.01

1)net(

finally,

9461.1)8176.0(9.0)7892.0(9.05.0

)0.1(net

9461.1z

0

ef

xWxWWxW BZBZAZAZZiZi

iZZ

Sigmoid Activation FunctionSigmoid Activation Function

– Sigmoid function combines Sigmoid function combines nearly linearnearly linear, , curvilinearcurvilinear, and , and nearly constant behaviornearly constant behavior depending on input value depending on input value

– Function nearly linear for domain values -1 < x < 1Function nearly linear for domain values -1 < x < 1– Becomes curvilinear as values move away from centerBecomes curvilinear as values move away from center– At extreme values, f(At extreme values, f(xx) is nearly constant ) is nearly constant – Moderate increments in Moderate increments in xx produce variable increase in produce variable increase in

f(f(xx), depending on location of ), depending on location of xx– Sometimes called “Squashing Function”Sometimes called “Squashing Function”– Takes real-valued input and returns values [0, 1]Takes real-valued input and returns values [0, 1]

Back-PropagationBack-Propagation

– Neural Networks are supervised learning methodNeural Networks are supervised learning method– Require target variableRequire target variable– Each observation passed through network results in Each observation passed through network results in

output valueoutput value– Output value compared to actual value of target Output value compared to actual value of target

variablevariable– (Actual – Output) = Error(Actual – Output) = Error– Prediction error analogous to residuals in regression Prediction error analogous to residuals in regression

modelsmodels– Most networks use Sum of Squares (SSE) to measure Most networks use Sum of Squares (SSE) to measure

how well predictions fit target valueshow well predictions fit target values sOutputNodecords

outputactualSSE 2

Re

)(

Back-Propagation Back-Propagation ((cont’dcont’d))

– Squared prediction errors summed over all output Squared prediction errors summed over all output nodes, and all records in data setnodes, and all records in data set

– Model weights constructed that minimize SSEModel weights constructed that minimize SSE– Actual values that minimize SSE are unknownActual values that minimize SSE are unknown– Weights estimated, given the data setWeights estimated, given the data set– Unlike least-squares regression, no closed-form solution Unlike least-squares regression, no closed-form solution

exists for minimizing SSEexists for minimizing SSE

Gradient Descent MethodGradient Descent Method

– Gradient Descent MethodGradient Descent Method determines set of weights determines set of weights that minimize SSEthat minimize SSE

– Given a set of Given a set of mm weights weights ww = w = w11, w, w22, ..., w, ..., wmm in network in network modelmodel

– Find values for weights that, together, minimize SSEFind values for weights that, together, minimize SSE– Gradient Descent determines direction to adjust Gradient Descent determines direction to adjust

weights, that decreases SSEweights, that decreases SSE– Gradient of SSE, with respect to vector of weights Gradient of SSE, with respect to vector of weights ww, is , is

vector derivative:vector derivative:

mw

SSE

w

SSE

w

SSEwSSE ,...,,)(

10

Gradient Descent Method Gradient Descent Method ((cont’dcont’d))

– Gradient Descent is illustrated using single weight wGradient Descent is illustrated using single weight w11

– Figure plots SSE error against range of values for wFigure plots SSE error against range of values for w11

– Preferred values for wPreferred values for w11 minimize SSE minimize SSE– Optimal value for wOptimal value for w11 is w* is w*11

SS

E

w1L w1RW*1 W1


– Develop rule defining movement from current wDevelop rule defining movement from current w11 to to optimal value w*optimal value w*11

– If current weight near wIf current weight near w1L1L, , increasingincreasing ww approaches w* approaches w*11

– If current weight near wIf current weight near w1R1R, , decreasingdecreasing ww approaches w* approaches w*11

– Gradient of SSE, with respect to weight wGradient of SSE, with respect to weight wCURRENTCURRENT, is slope , is slope of SSE curve at wof SSE curve at wCURRENTCURRENT

– Value wValue wCURRENTCURRENT close to w close to w1L1L, slope is negative, slope is negative– Value wValue wCURRENTCURRENT close to w close to w1R1R, slope is positive, slope is positive

ww

www

CURRENT

CURRENTCURRENTNEW

location current in change is

where,


– Direction for adjusting wDirection for adjusting wCURRENTCURRENT is negative sign of is negative sign of derivative at SSE at wderivative at SSE at wCURRENTCURRENT

– To adjust, use magnitude of the derivative of SSE at To adjust, use magnitude of the derivative of SSE at wwCURRENTCURRENT

– When curve steep, adjustment largeWhen curve steep, adjustment large– When curve nearly flat, adjustment smallWhen curve nearly flat, adjustment small

– Learning RateLearning Rate (Greek “eta”) has values [0, 1] (Greek “eta”) has values [0, 1]

)(CURRENTw

SSEsign

)(CURRENT

CURRENT w

SSEw

Back-Propagation RulesBack-Propagation Rules

– Back-propagation percolates prediction error for record Back-propagation percolates prediction error for record back through networkback through network

– Partitioned responsibility for prediction error assigned Partitioned responsibility for prediction error assigned to various connectionsto various connections

– Weights of connections adjusted to decrease error, Weights of connections adjusted to decrease error, using Gradient Descent Methodusing Gradient Descent Method

– Back-propagation rules defined Back-propagation rules defined (Mitchell)(Mitchell)

j

ji

xw

www

j

ijjij

ijCURRENTijNEWij

node tobelongingerror particular afor lity responsibi represents

node input toth signifies x

rate learning

where,

ij

,,

Back-Propagation Rules Back-Propagation Rules ((cont’dcont’d))

– Error responsibility computed using partial derivative of Error responsibility computed using partial derivative of the sigmoid function with respect to netthe sigmoid function with respect to net jj

– Values take one of two formsValues take one of two forms

– Rules show why input values require normalizationRules show why input values require normalization– Large input values xLarge input values xijij would dominate weight adjustment would dominate weight adjustment– Error propagation would be overwhelmed, and learning Error propagation would be overwhelmed, and learning

stifledstifled

downstream nodesfor litiesresponsibierror of sum weighted torefers where,

nodeslayer hidden for

nodeslayer output for )output1(output

)outputactual)(output1(output

jj

jjjj

DOWNSTREAMjjk

DOWNSTREAMjjk

j

W

W

Example of Back-PropagationExample of Back-Propagation

– Recall that first pass through network yielded Recall that first pass through network yielded outputoutput = = 0.87500.8750

– Assume actual target value = 0.8, and learning rate = 0.01Assume actual target value = 0.8, and learning rate = 0.01– Prediction error = 0.8 - 0.8750 = -0.075Prediction error = 0.8 - 0.8750 = -0.075– Neural Networks use Neural Networks use stochasticstochastic back-propagation back-propagation– Weights updated after each record processed by networkWeights updated after each record processed by network– Adjusting the weights using back-propagation shown nextAdjusting the weights using back-propagation shown next

– Error responsibility for Node Z, an output node, found firstError responsibility for Node Z, an output node, found first

0082.0)875.08.0)(875.01(875.0

)outputactual)(output1(output ZZZZ

Z

Example of Back-Propagation Example of Back-Propagation ((cont’dcont’d))

– Now adjust “constant” weight wNow adjust “constant” weight w0Z0Z using rules using rules

– Move upstream to Node A, a hidden layer nodeMove upstream to Node A, a hidden layer node– Only node downstream from Node A is Node ZOnly node downstream from Node A is Node Z

49918.000082.05.0

00082.)1)(0082.0(1.0)1(

0,0,0

0

ZCURRENTZNEWZ

ZZ

www

W

00123.0)0082.0)(9.0)(7892.01(7892.0

)output1(output AA

DOWNSTREAM

jjkA W


– Adjust weight wAdjust weight wAZAZ using back-propagation rules using back-propagation rules

– Connection weight between Node A and Node Z Connection weight between Node A and Node Z adjusted from 0.9 to 0.899353adjusted from 0.9 to 0.899353

– Next, Node B is hidden layer nodeNext, Node B is hidden layer node– Only node downstream from Node B is Node ZOnly node downstream from Node B is Node Z

899353.0000647.09.0

000647.0)7892.0)(0082.0(1.0)(

,,

AZCURRENTAZNEWAZ

AZAZ

www

OUTPUTW

0011.0)0082.0)(9.0)(8176.01(8176.0

)output1(output BB

DOWNSTREAM

jjkB W


– Adjust weight wAdjust weight wBZBZ using back-propagation rules using back-propagation rules

– Connection weight between Node B and Node Z Connection weight between Node B and Node Z adjusted from 0.9 to 0.89933adjusted from 0.9 to 0.89933

– Similarly, application of back-propagation rules Similarly, application of back-propagation rules continues to input layer nodescontinues to input layer nodes

– Weights {wWeights {w1A1A, w, w2A2A, w, w3A 3A , w, w0A0A} and {w} and {w1B1B, w, w2B2B, w, w3B 3B , w, w0B0B} } updated by processupdated by process

89933.000067.0.09.0

00067.0)8176.0)(0082.0(1.0)(

,,

BZCURRENTBZNEWBZ

BZBZ

www

OUTPUTW


– Now, all network weights in model are updatedNow, all network weights in model are updated– Each iteration based on single record from data setEach iteration based on single record from data set

• SummarySummary– Network calculated predicted value for target variableNetwork calculated predicted value for target variable– Prediction error derivedPrediction error derived– Prediction error percolated back through networkPrediction error percolated back through network– Weights adjusted to generate smaller prediction errorWeights adjusted to generate smaller prediction error– Process repeats record by recordProcess repeats record by record

Termination CriteriaTermination Criteria

– Many passes through data set performedMany passes through data set performed– Constantly adjusting weights to reduce prediction errorConstantly adjusting weights to reduce prediction error– When to terminate?When to terminate?

– Stopping criterion may be computational “clock” time?Stopping criterion may be computational “clock” time?– Short training times likely result in poor modelShort training times likely result in poor model

– Terminate when SSE reaches threshold level?Terminate when SSE reaches threshold level?– Neural Networks are prone to overfittingNeural Networks are prone to overfitting– Memorizing patterns rather than generalizingMemorizing patterns rather than generalizing

Termination Criteria Termination Criteria ((cont’dcont’d))

• Cross-Validation Termination ProcedureCross-Validation Termination Procedure– Retain portion of training set as “hold out” data setRetain portion of training set as “hold out” data set– Train network on remaining dataTrain network on remaining data– Apply weights learned from training set to validation setApply weights learned from training set to validation set– Measure two sets of weightsMeasure two sets of weights– ““Current” weights for training set, “Best” weights with Current” weights for training set, “Best” weights with

minimum SSE on validation setminimum SSE on validation set– Terminate algorithm when current weights has Terminate algorithm when current weights has

significantly greater SSE than best weightssignificantly greater SSE than best weights– However, Neural Networks not guaranteed to arrive at However, Neural Networks not guaranteed to arrive at

global minimumglobal minimum for SSE for SSE

Termination Criteria Termination Criteria ((cont’dcont’d))

– Algorithm may become stuck in local minimumAlgorithm may become stuck in local minimum– Results in good, but not optimal solutionResults in good, but not optimal solution– Not necessarily an insuperable problemNot necessarily an insuperable problem

– Multiple networks trained using different starting weightsMultiple networks trained using different starting weights– Best model from group chosen as “final”Best model from group chosen as “final”

– Stochastic back-propagation method acts to prevent getting Stochastic back-propagation method acts to prevent getting stuck in local minimumstuck in local minimum

– Random element introduced to gradient descentRandom element introduced to gradient descent

– MomentumMomentum term may be added to back-propagation algorithm term may be added to back-propagation algorithm

Learning RateLearning Rate

– Recall Recall Learning RateLearning Rate (Greek “eta”) is a constant (Greek “eta”) is a constant

– Helps adjust weights toward global minimum for SSEHelps adjust weights toward global minimum for SSE

• Small Learning RateSmall Learning Rate– With small learning rate, weight adjustments smallWith small learning rate, weight adjustments small– Network takes unacceptable time converging to solutionNetwork takes unacceptable time converging to solution

• Large Learning RateLarge Learning Rate– Suppose algorithm close to optimal solutionSuppose algorithm close to optimal solution– With large learning rate, network likely to “overshoot” With large learning rate, network likely to “overshoot”

optimal solutionoptimal solution

rate learning

where,10

Learning Rate Learning Rate ((cont’dcont’d))

– w* optimum weight for w, which has current value w* optimum weight for w, which has current value wwCURRENTCURRENT

– According to Gradient Descent Rule, wAccording to Gradient Descent Rule, wCURRENTCURRENT adjusted adjusted in direction of w* in direction of w*

– Learning rate acts as multiplier to formula Learning rate acts as multiplier to formula Δ Δ wwCURRENTCURRENT

– Large learning may cause wLarge learning may cause wNEWNEW to jump past w* to jump past w* – wwNEWNEW may be farther away from w* than w may be farther away from w* than wCURRENTCURRENT

SS

E

wCURRENT wNEWW* W

Learning Rate Learning Rate ((cont’dcont’d))

– Next adjusted weight value on opposite side of w* Next adjusted weight value on opposite side of w* – Leads to oscillation between two “slopes”Leads to oscillation between two “slopes”– Network never settles down to minimum between themNetwork never settles down to minimum between them

• Adjust Learning Rate as Training ProgressesAdjust Learning Rate as Training Progresses– Learning rate initialized with large valueLearning rate initialized with large value– Network quickly approaches general vicinity of optimal Network quickly approaches general vicinity of optimal

solutionsolution– As network begins to converge, learning rate gradually As network begins to converge, learning rate gradually

reducedreduced– Avoids overshooting minimumAvoids overshooting minimum

Momentum TermMomentum Term

– Momentum term (“alpha”) makes back-propagation Momentum term (“alpha”) makes back-propagation more powerful, and represents more powerful, and represents inertiainertia

– Large momentum values influence Large momentum values influence Δ Δ wwCURRENT CURRENT to move to move same direction as previous adjustmentssame direction as previous adjustments

– Including momentum in back-propagation results in Including momentum in back-propagation results in adjustments becoming exponential average of all adjustments becoming exponential average of all previous adjustments previous adjustments (Reed and Marks)(Reed and Marks)

htgiven weig afor adjustment weight previous offraction

10 ,adjustment weight previous

where,

PREVIOUS

PREVIOUS

PREVIOUSCURRENT

CURRENT

w

w

ww

SSEw

influencelarger exert sadjustmentrecent more indicates

where,0

k

kCURRENTk

kCURRENT w

SSEw

Momentum Term Momentum Term ((cont’dcont’d))

– Large alpha values enable algorithm to remember more Large alpha values enable algorithm to remember more terms in adjustment historyterms in adjustment history

– Small alpha values reduce inertial effects, and influence Small alpha values reduce inertial effects, and influence of previous adjustmentsof previous adjustments

– With alpha = 0, all previous components disappearWith alpha = 0, all previous components disappear– Momentum term encourages adjustments in same Momentum term encourages adjustments in same

directiondirection– Increases rate which algorithm approaches Increases rate which algorithm approaches

neighborhood of optimalityneighborhood of optimality– Early adjustments in same directionEarly adjustments in same direction– Exponential average of adjustments, also in same Exponential average of adjustments, also in same

directiondirection


– Weight initialized to I, with local minima at A, CWeight initialized to I, with local minima at A, C– Global minimum at BGlobal minimum at B

• Small BallSmall Ball– Small “ball”, symbolized as momentum, rolls down hillSmall “ball”, symbolized as momentum, rolls down hill– Gets stuck in first trough AGets stuck in first trough A– Momentum helps find A (local minimum), but not global Momentum helps find A (local minimum), but not global

minimum B minimum B

SS

E

I A B C w

SS

E

I A B C w

Small Small MomentumMomentum

Large Large MomentumMomentum


• Large BallLarge Ball– Now, large “ball” symbolizes momentum termNow, large “ball” symbolizes momentum term– Ball rolls down curve and passes over first and second Ball rolls down curve and passes over first and second

hillshills– Overshoots global minimum B because of too much Overshoots global minimum B because of too much

momentummomentum– Settles to local minimum at CSettles to local minimum at C

– Values of learning rate and momentum require careful Values of learning rate and momentum require careful considerationconsideration

– Experimentation with different values necessary to Experimentation with different values necessary to obtain best resultsobtain best results

Sensitivity AnalysisSensitivity Analysis

– Opacity is drawback of Neural NetworksOpacity is drawback of Neural Networks– Flexibility enables modeling of non-linear behaviorFlexibility enables modeling of non-linear behavior– However, limits ability to interpret resultsHowever, limits ability to interpret results– No procedure exists for translating weights to decision No procedure exists for translating weights to decision

rulesrules– Sensitivity AnalysisSensitivity Analysis measures relative influence attributes measures relative influence attributes

have on solutionhave on solution

• Sensitivity Analysis ProcedureSensitivity Analysis Procedure– Generate new observation xGenerate new observation xMEANMEAN

– Each attribute value of xEach attribute value of xMEANMEAN equals mean value for equals mean value for attributes in data setattributes in data set

– Find network output, for input xFind network output, for input xMEANMEAN, called output, called outputMEANMEAN

Sensitivity Analysis Sensitivity Analysis ((cont’dcont’d))

– Attribute by attribute, vary xAttribute by attribute, vary xMEANMEAN to reflect attribute Min and Max to reflect attribute Min and Max– Find network output for each variation, compare to outputFind network output for each variation, compare to outputMEANMEAN

– Determines attributes, varying from their Min to Max, having Determines attributes, varying from their Min to Max, having greater effect on network results, compared to other attributesgreater effect on network results, compared to other attributes

– For example, predicting stock price using For example, predicting stock price using price-earnings-ratioprice-earnings-ratio, , dividend-yielddividend-yield

– Varying Varying price-earnings-ratioprice-earnings-ratio from its Min to Max results in output from its Min to Max results in output increase = 0.20increase = 0.20

– Varying Varying dividend-yielddividend-yield from its Min to Max results in output from its Min to Max results in output increase = 0.30increase = 0.30

– Conclude network Conclude network more sensitivemore sensitive to changes in to changes in dividend-yielddividend-yield– Therefore, more important predictor of stock priceTherefore, more important predictor of stock price

Application of Neural Network Application of Neural Network ModelingModeling

– Insightful Miner models Insightful Miner models adultadult data set using network data set using network– Model uses one hidden layer with 8 nodesModel uses one hidden layer with 8 nodes– Algorithm iterates 47 times (epochs) before terminatingAlgorithm iterates 47 times (epochs) before terminating– Resulting network model shownResulting network model shown

Application of Neural Network Application of Neural Network Modeling Modeling ((cont’dcont’d))

– Input nodes appear on left, one per attributeInput nodes appear on left, one per attribute– Eight hidden nodes in centerEight hidden nodes in center– Includes single output node, whether Includes single output node, whether incomeincome <= <=

$50,000$50,000– Table below shows network weights from modelTable below shows network weights from model– Input nodes numbered 1 – 9, hidden nodes 22 – 29Input nodes numbered 1 – 9, hidden nodes 22 – 29– Weight from input node Weight from input node AgeAge (1) to topmost hidden node (1) to topmost hidden node

(22) is -0.97(22) is -0.97

Application of Neural Network Application of Neural Network Modeling Modeling ((cont’dcont’d))

– Estimated prediction accuracy of model is 82%Estimated prediction accuracy of model is 82%– Approximately 75% of records have Approximately 75% of records have incomeincome <= $50,000 <= $50,000– Sensitivity Analysis performed with Clementine, shown belowSensitivity Analysis performed with Clementine, shown below

– capital-gaincapital-gain and and education-numeducation-num attributes are most attributes are most important predictors of those with important predictors of those with incomeincome <= $50,000 <= $50,000

– However, gender (However, gender (sexsex) not highly-predictive) not highly-predictive– Further analysis required to develop modelFurther analysis required to develop model

Documents

Discovering Knowledge in Data Daniel T. Larose, Ph.D