Upload
duongnguyet
View
216
Download
3
Embed Size (px)
Citation preview
An Introduction To Neural Networks
Neural Networks
In order to combine the powers of the machine and the human brain, Neural Networks try to mimic the structure and function of our nervous system.
Biological Motivation #1
Synapses
Axon
Dendrites
Synapses+++--
(weights)
Nodes
Biological Neural Systems
Neuron switching time : > 10-3 secs Connections (synapses) per neuron : ~104–105 Number of neurons in the human brain: ~1010
Face recognition : 0.1 secs High degree of parallel computation Distributed representations
Biological Motivation #2
Appropriate Problem Domains for Neural Network (backprop) Learning
Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
Output is discrete or real valued Output is a vector of values Possibly noisy data Form of target function is unknown Fast evaluation may be required Humans do not need to interpret the results (black box
model)
Threshold T Output y
Input x1
Input x2
Input x3
Input x4
Weight w1
Weight w2
Weight w3
Weight w4
If w1x1 + w2x2 + … + wnxn ≥ T,
then the output of n is 1.
Otherwise,
the output of n is 0.
A Single Perceptron
Linearly Separable x1 x2 x1 and x2
0 0 0
0 1 0
1 0 0
1 1 1
x1
x2
x1 x2 x1 or x2
0 0 0
0 1 1
1 0 1
1 1 1
x1
x2
x1 x2 x1 xor x2
0 0 0
0 1 1
1 0 1
1 1 0
x1
x2
Perceptrons
1969 book by Marvin Minsky and Seymour Papert
The problem is that they can only work for classification problems that are linearly separable
Insufficiently expressive “Important research problem” to investigate
multilayer networks although they were pessimistic about their value
Perceptrons - another views
T = 2 Output y
Input x1
Input x2
W1 = 1
W2 = 1
AND
Inputs are either 0 or 1
Output is 1 only if all inputs are 1
Output y
Input x1
Input x2
W1 = ?
W2 = ?
AND
Inputs are either 0 or 1
Input x0
W0 = ?
Output y
Input x1
Input x2
W1 = 0.5
W2 = 0.5
AND
Inputs are either 0 or 1
Input x0
W0 = -0.8
Training
Train a perceptron to respond to certain inputs with certain desired outputs
After training, the perceptron should give reasonable outputs for any input
If it wasn’t trained for that input, it should try to find the best possible output depending on how it was trained
Perceptron Training Rule
Begin with random weights Apply the perceptron to each training example
(each pass through examples is called an epoch)
If it misclassifies an example, modify the weights Continue until the perceptron classifies all
training examples correctly
Modifying the Weights
wi ← wi + ∆wi
∆wi = LearningRate(DesiredOutput – ActualOutput)xi
Usually set to some small value like 0.1.
Moderates the degree to which the weights are changed at each step.
Keeps it from overshooting.
Modifying the Weights
wi ← wi + ∆wi
∆wi = LearningRate(DesiredOutput – ActualOutput)xi
This is the difference between what we wanted the output to be and what it actually was.
If the desired and actual are equal, then this is 0 and the weight won’t change.
Modifying the Weights
wi ← wi + ∆wi
∆wi = LearningRate(DesiredOutput – ActualOutput)xi
The value of the input itself.
If this value was 0, then it had no impact on the error, and so its weight shouldn’t be adjusted.
EXAMPLE
Begin with random weights Apply the perceptron to each training example
(each pass through examples is called an epoch)
If it misclassifies an example, modify the weights wi = wi + LearningRate(DesiredOutput – ActualOutput)xi
Continue until the perceptron classifies all training examples correctly
Gradient Descent Learning Rule Consider linear unit without threshold and
continuous output o (not just 0,1) o=w0 + w1 x1 + … + wn xn
Train the wi’s such that they minimize the squared error
E[w1,…,wn] = ½ Σd∈D (td-od)2
where D is the set of training examples
Perceptrons - another views
(w1,w2)
(w1+Δw1,w2 +Δw2)
Gradient Descent
Incremental Stochastic Gradient Descent
Batch mode : gradient descent w=w - η ∇ED[w] over the entire data D
ED[w]=1/2Σd(td-od)2
Incremental mode: gradient descent w=w - η ∇Ed[w] over individual training examples d Ed[w]=1/2 (td-od)2
Incremental Gradient Descent can approximate Batch Gradient
Descent arbitrarily closely if η is small enough
Comparison: Perceptron and Gradient Descent Rule
Perceptron learning rule guaranteed to succeed (perfectly classifying training examples) if
Training examples are linearly separable Sufficiently small learning rate η Linear unit training rules using gradient descent Guaranteed to converge to hypothesis with minimum squared error Given sufficiently small learning rate η Even when training data contains noise Even when training data not separable by H
Restaurant Problem: Will I wait for a table?
Alternate – whether there is a suitable alternative restaurant nearby
Bar – whether the restaurant has a comfortable bar area to wait in Fri/Sat – true on Fridays and Saturdays Hungry – whether we are hungry Patrons – how many people are in the restaurant (None, Some or
Full) Price – the restaurants price range ($, $$, $$$) Raining – whether its is raining outside Reservation – whether we made a reservation Type – the kind of restaurant (French, Italian, Thai, or Burger) WaitEstimate – the wait estimate by the host (0-10 minutes, 10-30,
30-60, > 60)
Multilayer Network
A compromise function Perceptron
Linear
Sigmoid (Logistic)
€
output = net = wixii=0
n
∑
€
output =σ (net) =1
1+ e−net
€
output =1 if wixi > 0
i=0
n
∑0 else
#
$ %
& %
Learning in Multilayer Networks
Same method as for Perceptrons Example inputs are presented to the
network If the network computes an output that
matches the desired, nothing is done If there is an error, then the weights are
adjusted to balance the error
BackPropagation Learning
Alternative Error Measures
Neural Network Model Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
2 Gender
Stage 4
.6
.5
.8
.2
.1
.3 .7
.2
Weights HiddenLayer
“Probability of beingAlive”
0.6 Σ
Σ
.4
.2 Σ
Getting an answer from a NN Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
2 Gender
Stage 4
.6
.5
.8
.1
.7
Weights HiddenLayer
“Probability of beingAlive”
0.6 Σ
Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
2 Gender
Stage 4
.5
.8
.2
.3
.2
Weights HiddenLayer
“Probability of beingAlive”
0.6 Σ
Getting an answer from a NN
Getting an answer from a NN Inputs
Weights
Output
Independent variables
Dependent variable
Prediction
Age 34
1 Gender
Stage 4
.6
.5
.8
.2
.1
.3 .7
.2
Weights HiddenLayer
“Probability of beingAlive”
0.6 Σ
Minimizing the Error
w initial w trained
initial error
final error
Error surface
positive change
negative derivative
local minimum
Representational Power (FFNN)
Boolean functions 2 layers of units
Continuous functions 2 layers of units (sigmoid then linear)
Arbitrary functions 3 layers of units (sigmoids then linear)
Hypothesis Space and Inductive Bias
Hidden Layer Representations
Hidden Layer Representations
Overfitting
Handwritten Character Recognition
Le Cun et al. (1989) implemented a neural network to read zip codes on hand-addressed envelopes, for sorting purposes
To identify the digits, uses a 16x16 array of pixels as input, 3 hidden layers, and a distributed output encoding with 10 output units for digits 0-9
256 input nodes, 10 output units (1 for the liklihood of each number)
ALVINN Drives 70 mph on a public highway
Camera image
30x32 pixels as inputs
30 outputs for steering 30x32 weights
into one out of four hidden unit
4 hidden units
Neural Nets for Face Recognition
Learning Hidden Unit Weights
Interpreting Satellite Imagery for Automated Weather Forecasting
Summary
Perceptrons, one layer networks, are insufficiently expressive
Multi-layer networks are sufficiently expressive and can be trained by error back-propogation
Many applications including speech, driving, hand written character recognition, fraud detection, driving, etc.