23
Linear Discrimination Reading: Chapter 2 of textbook

Linear Discrimination Reading: Chapter 2 of textbook

Embed Size (px)

Citation preview

Page 1: Linear Discrimination Reading: Chapter 2 of textbook

Linear Discrimination

Reading: Chapter 2 of textbook

Page 2: Linear Discrimination Reading: Chapter 2 of textbook

Framework

• Assume our data consists of instances x = (x1, x2, ..., xn)

• Assume data can be separated into two classes, positive and negative, by a linear decision surface.

• Learning: Assuming data is n-dimensional, learn (n−1)-dimensional hyperplane to classify the data into classes.

Page 3: Linear Discrimination Reading: Chapter 2 of textbook

Linear Discriminant

Feature 1

Feature 2

Page 4: Linear Discrimination Reading: Chapter 2 of textbook

Linear Discriminant

Feature 2

Feature 1

Page 5: Linear Discrimination Reading: Chapter 2 of textbook

Linear Discriminant

Feature 2

Feature 1

Page 6: Linear Discrimination Reading: Chapter 2 of textbook

Example where line won’t work?

Feature 2

Feature 1

Page 7: Linear Discrimination Reading: Chapter 2 of textbook

Perceptrons• Discriminant function:

w0 is called the “bias”.

−w0 is called the “threshold”

• Classification:

y(x) = f (wT x + w0)

= f (w0 + w1x1 + ...+ wn xn )

y(x) = sgn(w0 + w1x1 + w2x2 + ...+ wn xn )

where sgn(z) =−1 if y < 0

0 if y = 0 +1 if y > 0

⎧ ⎨ ⎪

⎩ ⎪

Page 8: Linear Discrimination Reading: Chapter 2 of textbook

Perceptrons as simple neural networks

.

.

.

w1

w2

wn

o

w0

+1x1

x2

xn €

y(x) = sgn(w0 + w1x1 + w2x2 + ...+ wn xn )

where sgn(z) =−1 if y < 0

0 if y = 0 +1 if y > 0

⎧ ⎨ ⎪

⎩ ⎪

Page 9: Linear Discrimination Reading: Chapter 2 of textbook

Example

• What is the class y?

.4

-.4

-.1

+11

-1

Page 10: Linear Discrimination Reading: Chapter 2 of textbook

Geometry of the perceptron

Feature 1

Feature 2

HyperplaneIn 2d:

2

01

2

12

02211 0

w

wx

w

wx

wxwxw

Page 11: Linear Discrimination Reading: Chapter 2 of textbook

Work with one neighbor on this:

(a) Find weights for a perceptron that separates “true” and “false” in x1

x2. Find the slope and intercept, and sketch the separation line defined by this discriminant.

(b) Do the same, but for x1 x2.

(c) What (if anything) might make one separation line better than another?

In-class exercise

Page 12: Linear Discrimination Reading: Chapter 2 of textbook

• To simplify notation, assume a “dummy” coordinate (or attribute) x0 = 1. Then we can write:

• We can generalize the perceptron to cases where we project data points xn into “feature space”, (xn):

))(sgn()( xwx Ty

)...sgn(

)sgn()(

1100 DD

T

xwxwxw

y

xwx

Page 13: Linear Discrimination Reading: Chapter 2 of textbook

Notation

• Let S = {(xk, tk): k = 1, 2, ..., m} be a training set.

Note: xk is a vector of inputs, and tk is {+1, −1} for binary classification, tk for regression.

• Output o:

• Error of a perceptron on a single training example, (xk ,tk)

o = sgn w jj =0

n

∑ x j = wT ⋅ x

E k =1

2(t k − ok )2

Page 14: Linear Discrimination Reading: Chapter 2 of textbook

Example

• S = {((0,0), -1), ((0,1), 1)}

• Let w = {w0, w1, w2) = {0.1, 0.1, −0.3}

o

What is E1?

What is E2?

+1

x10.1

0.1

x2−0.3

E k =1

2(t k − ok )2

Page 15: Linear Discrimination Reading: Chapter 2 of textbook

How do we train a perceptron?

Gradient descent in weight space

From T. M. Mitchell, Machine Learning

Page 16: Linear Discrimination Reading: Chapter 2 of textbook

Perceptron learning algorithm

• Start with random weights w = (w1, w2, ... , wn).

• Do gradient descent in weight space, in order to minimize error E:

– Given error E, want to modify weights w so as to take a step in direction of steepest descent.

Page 17: Linear Discrimination Reading: Chapter 2 of textbook

Gradient descent

• We want to find w so as to minimize sum-squared error:

• To minimize, take the derivative of E(w) with respect to w.

• A vector derivative is called a “gradient”: E(w)

E(w) =1

2(t k − ok )2

k =0

m

nw

E

w

E

w

EE ,...,,)(

10

w

Page 18: Linear Discrimination Reading: Chapter 2 of textbook

• Here is how we change each weight:

and is the learning rate.

jj

jjj

w

Ew

www

where

Page 19: Linear Discrimination Reading: Chapter 2 of textbook

• Error function

has to be differentiable, so output function o also has to be differentiable.

E(w) =1

2(t k − ok )2

k =0

m

Page 20: Linear Discrimination Reading: Chapter 2 of textbook

-1

1

activation

output

0

-1

1

activation

output

0

Activation functions

0sgn wxwo

jjj

0wxwoj

jj

Not differentiable Differentiable

Page 21: Linear Discrimination Reading: Chapter 2 of textbook

∂E

∂wi

=∂

∂wi

1

2(t k − ok )2

k

∑ 1( )

=1

2

∂wi

(t k − ok )2

k

∑ (2)

=1

22(t k − ok )

∂wi

(t k − ok )k

∑ (3)

= (t k

k

∑ − ok )∂

∂wi

(t k − w⋅ x k ) (4)

= (t k

k

∑ − ok )(−xi

k ) (5)

So,

Δw i = η (t k

k

∑ − ok )xi

k (6)

This is called the perceptronlearning rule, with “true gradientdescent”.

Page 22: Linear Discrimination Reading: Chapter 2 of textbook

• Problem with true gradient descent:

Search process will land in local optimum.

• Common approach to this: use stochastic gradient descent:– Instead of doing weight update after all training

examples have been processed, do weight update after each training example has been processed (i.e., perceptron output has been calculated).

– Stochastic gradient descent approximates true gradient descent increasingly well as 1/.

Page 23: Linear Discrimination Reading: Chapter 2 of textbook

Training a perceptron

1. Start with random weights, w = (w1, w2, ... , wn).

2. Select training example (xk, tk).

3. Run the perceptron with input xk and weights w to obtain o.

4. Let be the learning rate (a user-set parameter). Now,

5. Go to 2.

wi ← wi + Δwi

where

Δwi = η (t k − ok )x ik