30
Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Embed Size (px)

Citation preview

Page 1: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Chapter1: IntroductionChapter2: Overview of

Supervised Learning2006.01.20

Page 2: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Supervised learning

Training data set: several features and outcome

Build a learner based on training data sets Predict the future unseen outcome from seen

features of data

Page 3: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

An example of supervised learningEmail spam

NormalEmails

………

Spam…………

Learner …New emails

Normal emails

Spam

Known

Unknown

Page 4: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Input & Output

Input = predictor = independent variable Output = response = dependent variable

Page 5: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Output Types

Quantitative >> regression Ex) stock price, temperature, age

Qualitative >> classification Ex) Yes/No,

Page 6: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Input Types

Quantitative Qualitative Ordered categorical

Ex) small, medium, big

Page 7: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Terminology

X : input Xj : j th component X : matrix xj : j th observed value

Y : quantitative output Y : prediction

G : qualitative output

^

Page 8: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

General model

Given input X, output Y

Want to estimate the function f based on known data set (training data)

unknown

Page 9: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Two simple methods

Linear model, linear regression

Nearest neighbor method

Page 10: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Linear model

Give a vector of input features X = (X1…Xp) Assume the linear relationship:

Least squares standard:

min -2

Page 11: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Classification example in two dimensions -1

Page 12: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Nearest neighbor method

Majority vote within the k nearest neighbors

K= 1: brownK= 3: green

new

Page 13: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Classification example in two dimensions -2

Page 14: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Linear model vs. K-nearest neighbor Linear model• #parameters: p• Stable, smooth• Low variance, high bias

K-nearest neighbor• #parameters: N/k • Unstable, wiggly• High variance, low bias

Each method has its own situations

for which it works best.

Page 15: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Misclassification curves

Page 16: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Enhanced Methods

Kernel methods using weights Modifying the distance kernels Locally weighted least squares Expansion of inputs for arbitrarily complex

models Projection & neural network

Page 17: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Statistical decision theory (1)

Given input X in Rp, output Y in R Joint distribution: Pr(X,Y) Looking for predicting function: f(X) Squared error loss:

Nearest-neighbor methods :

min EPE

^

Page 18: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Statistical decision theory (2)

k-Nearest neighbor

If N,k , k/N 0

Insufficient samples!

Curse of dimensionality!

Linear model

But, the true function might not be linear!

Page 19: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Statistical decision theory (3)

If

Robust But, discontinuous in their derivatives

^

Page 20: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Statistical decision theory (4)

G : categorical output variable L : Loss Function EPE = E[L(G, G(X))]

Bayesian Classifier

^

Page 21: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

References

Reading group on "elements of statistical learning” – overview.ppt http://sifaka.cs.uiuc.edu/taotao/stat.html

Welcome to STAT 894 – SupervisedLearningOVERVIEW05.pdf

http://www.stat.ohio-state.edu/~goel/STATLEARN/

The Matrix Cookbook http://www2.imm.dtu.dk/pubdb/views/ edoc_download.php/

3274/pdf/imm3274.pdf

A First Course in Probability

Page 22: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

2.5 Local Methods in High Dimensions With a reasonably large set of training data, we could always

approximate the theoretically optimal conditional expectation by k-nearest-neighbor averaging.

The curse of dimensionality To capture 1% of data to form a local average, we must cover

63% of the range of each input variable. The expected edge length =

All sample points are close to an edge of the sample. Median distance from the origin to the closest data point:

p r

1/1/1

( , ) (1 )2

Npd p N

Page 23: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

2.5 Local Methods in High Dimensions Example 1-NN vs. Linear

1-NN

As p increases, MSE & bias tends to 1.0. Linear model

Expecting on x0, the expected EPE increases linearly as a function of p.

20 0 0

2 20 0 0 0

ˆ( ) [ ( ) ]

ˆ ˆ ˆ[ ( )] [ ( ) ( )]

T

T T T

MSE x E f x y

E y E y E y f x

Variance Sq. Bias

0 0

20 | 0 0

2 20 0 0 0 0 0

20 0 0 0

ˆ( ) ( )

ˆ ˆ ˆ( | ) [ ] [ ]

ˆ ˆ( | ) ( ) ( )

y x T

T T T T

T

EPE x E E y y

Var y x E y E y E y E y

Var y x Var y Bias y

= 0.

By relying on rigid assumptions, the linear model has no bias at all and negligible variance, while the error in 1-nearest neighbor is larger.

Page 24: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

2.6 Statistical Models, Supervised Learning and Function Approximation Finding a useful approximation to

function that underlies the predictive relationship between the inputs and outputs. Supervised learning: machine learning point of

view Function approximation: mathematics and

statistics point of view

ˆ ( )f x( )f x

Page 25: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

2.7 Structured Regression Models Nearest-neighbor and other local methods

face problems in high dimensions. They may be inappropriate even in low

dimensions. Need for structured approaches.

Difficulty of the problem Infinitely many solutions to minimizing RSS. Unique solution comes from restrictions on f.

2

1

( ) ( ( ))N

i ii

RSS f y f x

Page 26: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

2.8 Classes of Restricted Estimators Methods categorized by the nature of the

restrictions. Roughness penalty and Bayesian methods

Penalizing functions that too rapidly vary over small regions of input space.

Kernel methods and local regression Explicitly specifying the nature of local neighborhood (kernel

function). Need adaptation in high dimensions.

Basis functions and dictionary methods Linear expansion of basis functions.

Page 27: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

2.9 Model Selection and the Bias-Variance Tradeoff All models have a smoothing or complexity

parameter to be determined Multiplier of the penalty term Width of the kernel Number of basis functions

Page 28: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Bias-Variance tradeoff

Essential with ε, no way to reduce

To reduce one might increase the other. Tradeoff!

Page 29: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Bias-Variance tradeoff in kNN

Page 30: Chapter1: Introduction Chapter2: Overview of Supervised Learning 2006.01.20

Model complexity

Training error

Test error

Model ComplexityLow High

Pre

dict

ion

Err

or

High BiasLow Variance

Low BiasHigh Variance