36
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Embed Size (px)

DESCRIPTION

Q&A: questions about labs Q1: when are they “due”? - Answer: Ideally you should submit your post before you leave class on the day we do the lab. While there’s no “penalty” for turning them in later, it’s harder for me to judge where everyone is without feedback. Q2: resolved or unresolved? - Answer: Leave them unresolved when you post, and I’ll mark them resolved as I enter them into my grade book.

Citation preview

Page 1: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

LECTURE 07:

CLASSIFICATION PT. 3February 15, 2016

SDS 293

Machine Learning

Page 2: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Announcements

• On Monday SDS293 will have guests :

Page 3: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Q&A: questions about labs

• Q1: when are they “due”?- Answer: Ideally you should submit your post before you leave

class on the day we do the lab. While there’s no “penalty” for turning them in later, it’s harder for me to judge where everyone is without feedback.

• Q2: resolved or unresolved?- Answer: Leave them unresolved when you post, and I’ll mark them

resolved as I enter them into my grade book.

Page 4: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Outline

Motivation Bayes classifier K-nearest neighbors Logistic regression

Logistic model Estimating coefficients with maximum likelihood Multivariate logistic regression Multiclass logistic regression Limitations

• Linear discriminant analysis (LDA)- Bayes’ theorem

- LDA on one predictor

- LDA on multiple predictors

• Comparing Classification Methods

Page 5: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Warning!

Page 6: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Recap: logistic regression

Question: what were we tying to do using the logistic function?

Answer: estimate the conditional distribution of the response, given the predictors

Page 7: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Discussion

• What could go wrong?

Page 8: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Flashback: superheroes

According to this sample, we can perfectly predict the 3 classes using only:

Page 9: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Flashback: superheroes

Page 10: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Perfect separation

• If a predictor happens to perfectly align with the response, we call this perfect separation

• When this happens, logistic regression will grossly inflate the coefficients of that predictor (why?)

Page 11: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Another approach

• We could try modeling the distribution of the predictors in each class separately:

“density function of X”

• How would this help?

Page 12: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Refresher: Bayes’ rule

Page 13: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Refresher: Bayes’ rule

The probability that the true class is k

Given that we observedpredictor vector x

The probability that weobserve predictor vector x

Given that thetrue class is k

The probability thatthe true class is k

The probability that weobserve predictor vector x

Page 14: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Refresher: Bayes’ rule

This isn’t too hard to estimate*

*Assuming our training data is randomly sampled

Just need to estimatethe density function!

Page 15: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Flashback: toy exampleBayes’ Decision Boundary

50%

Page 16: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Using Bayes’ rule for classification

• Let’s start by assuming we have just a single predictor

• In order to estimate fk(X), we’ll need to make some assumptions about its form

Page 17: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Assumption 1: fj(X) is normally distributed

Page 18: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Assumption 1: fj(X) is normally distributed

• If the density is normal (a.k.a. Gaussian), then we can calculate the function as:

where μk and σk2 are the mean & variance of the kth class

Page 19: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Assumption 2: classes have equal variance

• For simplicity, we’ll also assume that:

• That gives us a single variance term, which we’ll denote

Page 20: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Plugging in…

For our purposes, this is a constant!

Page 21: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

More algebra!

• So really, we just need to maximize:

• This is called a discriminant function of x

Page 22: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Okay, we need an example

Bayes’ Decision Boundary at x=0

Page 23: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

LDA: estimating the mean

• As usual, in practice we don’t know the actual values for the parameters and so we have to estimate them

• The linear discriminant analysis method uses the following:

(the average of all the training examples from class k)

Page 24: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

LDA: estimating the variance

• Then we’ll use that estimate to get:

(weighted average of the sample variances of each class)

Page 25: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Flashback

• Remember that time I said:

• If we don’t have additional knowledge about the class membership distribution:

*Assuming our training data is randomly sampled

This isn’t too hard to estimate*

Page 26: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

LDA classifier

• The LDA classifier plugs in all these estimates, and assign the observation to the class for which

is the largest

• The linear in LDA comes from the fact that this equation is linear in x

Page 27: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Back to our example

LDA Decision Boundary

Page 28: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Quick recap: LDA

• LDA on 1 predictor makes 2 assumptions: what are they?1. Observations within class are normally distributed

2. All classes have common variance

• So what would we need to change to make this work with multiple predictors?

Page 29: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

LDA on multiple predictors

• Nothing! We just assume the density functions are multivariate normal:

uncorrelated correlation = 0.7

Page 30: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

LDA on multiple predictors

• Well okay, not nothing…

• What happens to the mean?

• What happens to the variance?

Page 31: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

LDA on multiple predictors

• Plugging in as before, we get:

• While this looks a little terrifying, it’s actually just the matrix/vector version of what we had before:

Page 32: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Quick recap: LDA

• LDA on 1 predictor makes 2 assumptions: what are they?1. Each class is normally distributed

2. All classes have common varianceWhat can we do about

This second assumption?

Page 33: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Multiplying two x terms together,hence “quadratic”

Quadratic discriminant analysis

• What if we relax the assumption that the classes have uniform variance?

• If we plug this into Bayes’, we get:

Page 34: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Discussion: QDA vs. LDA

• Question: why does it matter whether or not we assume that the classes have common variance?

• Answer: bias/variance trade off- One covariance matrix on p predictors p(p+1)/2 parameters

- The more parameters we have to estimate, the higher variance in the model (but the lower the bias)

Page 35: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Lab: LDA and QDA

• To do today’s lab in R: nothing special• To do today’s lab in python: nothing special• Instructions and code:

http://www.science.smith.edu/~jcrouser/SDS293/labs/lab5/

• You’ll notice that the labs are beginning to get a bit more open-ended – this may cause some discomfort!- Goal: to help you become more fluent in using these methods

- Ask questions early and often: of me, of each other, of whatever resource helps you learn!

• Full version can be found beginning on p. 161 of ISLR

Page 36: LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning

Coming up

• Wednesday is Rally Day- NO AFTERNOON CLASSES

- Jordan’s office hours will be shifted ahead one hour (9am to 11am)

• Monday – last day of classification: comparing methods• Solutions for A1 have been posted to the course website

and Moodle• A2 due on Monday Feb. 22nd by 11:59pm