lec02 Machine Learning

7/23/2019 lec02 Machine Learning

1/67

CSCI567 Machine Learning (Fall 2014)

Drs. Sha & Liu

{feisha,yanliu.cs}@usc.edu

September 9, 2014

Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 1 / 47
http://find/


2/67

Administration

Outline

1 Administration

2 First learning algorithm: Nearest neighbor classifier

3 More deep understanding about NNC

4 Some practical sides of NNC

5 What we have learned

http://find/


3/67

Administration

Entrance exam

All were graded

about 87% has passed96 students in Prof. Shas section and 48 in Prof. Lius section

Those who have passed the threshold were granted D-clearance

please enroll asap by Friday noon.

In several cases, advisors had sent out inquiries please respond byThursday noon per their instructions.

If not being granted D-clearance, or being contacted by advisors, you

can assume that you are not permitted to enroll.You can ask TAs to check up your grades, only if you strongly believeyou did well.

http://find/


4/67

First learning algorithm: Nearest neighbor classifier

Outline

1 Administration

2 First learning algorithm: Nearest neighbor classifierIntuitive example

General setup for classificationAlgorithm




http://goforward/http://find/http://goback/


5/67

First learning algorithm: Nearest neighbor classifier Intuitive example

Recognizing flowers

Types of Iris: setosa, versicolor, and virginica

http://find/http://goback/


6/67


Measuring the properties of the flowers

Features and attributes: the widths and lengths of sepal and petal


Fi l i l i h N i hb l ifi I i i l


7/67


Pairwise scatter plots of 131 flower specimens

Visualization of data helps to identify the right learning model to

use

Each colored point is a flower specimen: setosa,versicolor,virginica

sepallength

sepal length

sepalwidth

petallength

petalwidth

sepal width petal length petal width


Fi t l i l ith N t i hb l ifi I t iti l
http://find/


8/67


Different types seem well-clustered and separable

Using two features: petal width and sepal length

sepallength

sepal length

sepalwidth

petallength

petalwidth



http://find/


9/67


Labeling an unknown flower type

Closer to red cluster: so labeling it as setosa

sepallength

sepal length

sepalwidth

petalle

ngth

petalwidth


?


First learning algorithm: Nearest neighbor classifier General setup for classification
http://find/


10/67


Multi-class classification

Classify data into one of the multiple categories

Input (feature vectors): x RD

Output (label): y[C] ={1, 2, , C}Learning goal: y =f(x)

Special case: binary classification

Number of classes: C= 2

Labels: {0, 1} or{1, +1}


http://find/


11/67


More terminology

Training data (set)

N samples/instances: Dtrain ={(x1, y1), (x2, y2), , (xN, yN)}

They are used for learning f()

Test (evaluation) dataM samples/instances: Dtest ={(x1, y1), (x2, y2), , (xM, yM)}

They are used for assessing how well f() will do in predicting anunseen x / Dtrain

Training data and test data should notoverlap: Dtrain Dtest =


http://find/


12/67


Often, data is conveniently organized as a table

Ex: Iris data (clickherefor all data)

4 features3 classes


First learning algorithm: Nearest neighbor classifier Algorithm
http://en.wikipedia.org/wiki/Iris_flower_data_sethttp://en.wikipedia.org/wiki/Iris_flower_data_sethttp://find/


13/67

g g g g

Nearest neighbor classification (NNC)

Nearest neighborx(1) = xnn(x)

where nn(x) [N] ={1, 2, , N}, i.e., the index to one of the training

instances,

nn(x) = arg minn[N] x xn22= arg minn[N]

Dd=1

(xd xnd)2

Classification ruley=f(x) =ynn(x)


http://find/


14/67

g g g g

Visual example

In this 2-dimensional example, the nearest point to x

is atraininginstance, thus, x will be labeled asred.

x1

x2

(a)


http://find/


15/67

Example: classify Iris with two features

Training data

ID (n) petal width (x1) sepal length (x2) category (y)1 0.2 5.1 setoas

2 1.4 7.0 versicolor

3 2.5 6.7 virginica...

...

...

Flower with unknown categorypetal width = 1.8 and sepal width = 6.4Calculating distance = x1 xn1)

2 + (x2 xn2)2

ID distance1 1.75

2 0.72

3 0.76

Thus, the category is versicolor(the real categoryis viriginica)Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 15 / 47

http://find/


16/67

Decision boundary

For every point in the space, we can determine its label using the NNC

rule. This gives rise to a decision boundarythat partitions the space intodifferent regions.

x1

x2

(b)


http://find/


17/67

How to measure nearness with other distances?

Previously, we use the Euclidean distance

nn(x) = arg minn[N] x xn22

We can also use alternative distancesE.g., the following L1 distance (i.e., cityblock distance, or Manhattan distance)

nn(x) = arg minn[N] x xn1

= arg minn[N]D

d=1

|xd xnd| Green line is Euclidean distance.Red, Blue, and Yellow lines are

L1 distance


http://find/


18/67

K-nearest neighbor (KNN) classification

Increase the number of nearest neighbors to use?

1-nearest neighbor: nn1(x) = arg minn[N] x xn22

2nd-nearest neighbor: nn2(x) = arg minn[N]nn1(x) x xn22

3rd-nearest neighbor: nn2(x) = arg minn[N]nn

1

(x

)nn2

(x

)

x xn2

2The set of K-nearest neighbor

knn(x) ={nn1(x), nn2(x), , nnK(x)}

Let x

(k) =x

nnk(x

), then

x x(1)22 x x(2)22 x x(K)

22


http://find/


19/67

How to classify with Kneighbors?

Classification ruleEvery neighbor votes: suppose yn (the true label) for xn is c, then

vote for c is 1vote for c =c is 0

We use the indicator function I(yn==c) to represent.

Aggregate everyones vote

vc=

nknn(x)

I(yn==c), c [C]

Label with the majority

y=f(x) = arg maxc[C]vc


http://find/


20/67

Example

K=1, Label: red

x1

x2

(a)

K=3, Label: red

x1

x2

(a)

K=5, Label: blue

x1

x2

(a)


http://find/


21/67

How to choose an optimal K?

x6

x7

K = 1

0 1 20

1

2

x6

x7

K = 3

0 1 20

1

2

x6

x7

K = 3 1

0 1 20

1

2

When K increases, the decision boundary becomes smooth.


http://find/


22/67

Mini-summary

Advantages of NNC

Computationally, simple and easy to implement just computing thedistance

Theoretically, has strong guarantees doing the right thing

Disadvantages of NNC

Computationally intensive for large-scale problems: O(ND) forlabeling a data point

We need to carry the training data around. Without it, we cannot

do classification. This type of method is called nonparametric.

Choosing the right distance measure and Kcan be involved.


More deep understanding about NNC
http://find/


23/67

Outline

1 Administration


3 More deep understanding about NNCMeasuring performanceThe ideal classifierComparing NNC to the ideal classifier




More deep understanding about NNC
http://find/


24/67

Is NNC too simple to do the right thing?

To answer this question, we proceed in 3 steps

1 We define a performance metric for a classifier/algorithm.

2 We then propose an ideal classifier.

3 We then compare our simple NNC classifier to the ideal one and showthat it performs nearly as good.


More deep understanding about NNC Measuring performance
http://find/


25/67

How to measure performance of a classifier?

Intuition

We should computeaccuracy the percentage of data points beingcorrectly classified, or theerror rate the percentage of data points beingincorrectly classified.

Two versions: which one to use?

Defined on the training data set

Atrain = 1

N

n

I[f(xn) ==yn], train =

1

N

n

I[f(xn)=yn]

Defined on the test (evaluation) data set

Atest = 1

M

m

I[f(xm) ==ym], test =

1

M

M

I[f(xm)=ym]


http://find/


26/67

Example

Training data

What are Atrain and train?


http://find/


27/67

Example

Training data


Atrain = 100%, train = 0%


http://find/


28/67

Example

Training data



Test data

What are Atest and test?


http://find/


29/67

Example

Training data



Test data

What are Atest and test?

Atest = 0%, test = 100%


http://find/


30/67

Leave-one-out (LOO)

Idea

For each training instance xn,take it out of the training setand then label it.

For NNC, xns nearest neighbor

will not be itself. So the errorrate would not become 0necessarily.

Training data

What are the LOO-version ofAtrain

and train?


http://find/


31/67

Leave-one-out (LOO)

Idea

For each training instance xn,take it out of the training setand then label it.

For NNC, xns nearest neighbor

will not be itself. So the errorrate would not become 0necessarily.

Training data

What are the LOO-version ofAtrain

and train?

Atrain = 66.67%(i.e., 4/6)

train = 33.33%(i.e., 2/6)


http://find/


32/67

Drawback of the metrics

They are dataset-specificGiven a different training (or test) dataset, Atrain (orAtest) willchange.

Thus, if we get a dataset randomly, these variables would be

random quantities.

AtestD1 , Atest

D2 , , Atest

Dq ,



D b k f h i
http://find/


33/67




random quantities.

AtestD1 , Atest

D2 , , Atest

Dq ,

These are called empirical accuracies (or errors).



D b k f h i
http://find/


34/67




random quantities.

AtestD1 , Atest

D2 , , Atest

Dq ,

These are called empirical accuracies (or errors).

Can we understand the algorithm itself in a more certain nature, byremoving the uncertainty caused by the datasets?



E d i k
http://find/


35/67

Expected mistakes

Setup

Assume our data (x, y) is drawn from the joint and unknowndistribution p(x, y)

Classification mistake on a single data point x with the ground-truthlabel y

L(f(x

), y) = 0 iff(x) =y1 iff(x)=y



E t d i t k
http://find/


36/67

Expected mistakes

Setup



L(f(x

), y) = 0 iff(x) =y1 iff(x)=yExpected classification mistake on a single data point x

R(f, x) = Eyp(y|x)L(f(x), y)



E t d i t k


37/67

Expected mistakes

Setup



L(f(x

), y) = 0 iff(x) =y1 iff(x)=yExpected classification mistake on a single data point x


The average classification mistake by the classifier itself

R(f) = Exp(x)R(f, x) = E(x,y)p(x,y)L(f(x), y)



Jargons
http://find/


38/67

Jargons

L(f(x), y) is called 0/1 loss function many other forms of loss

functions exist for different learning problems.



Jargons
http://find/


39/67

Jargons


functions exist for different learning problems.Expected conditional risk




Jargons
http://find/


40/67

Jargons




Expected riskR(f) = E(x,y)p(x,y)L(f(x), y)



Jargons
http://find/


41/67

Jargons




Expected riskR(f) = E(x,y)p(x,y)L(f(x), y)

Empirical risk

RD(f) = 1

N

n

L(f(xn), yn)

Obviously, this is our empirical error (rates).



Ex: binary classification
http://find/


42/67


Expected conditional risk of a single data point x


=P(y= 1|x)I[f(x) = 0] +P(y= 0|x)I[f(x) = 1]



http://find/


43/67




=P(y= 1|x)I[f(x) = 0] +P(y= 0|x)I[f(x) = 1]

Let (x) =P(y = 1|x), we have

R(f, x) =(x)I[f(x) = 0] + (1 (x))I[f(x) = 1]

= 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

expected conditional accuracies



http://find/


44/67




=P(y= 1|x)I[f(x) = 0] +P(y= 0|x)I[f(x) = 1]

Let (x) =P(y = 1|x), we have

R(f, x) =(x)I[f(x) = 0] + (1 (x))I[f(x) = 1]

= 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

expected conditional accuraciesExercise: please verify the last equality.


More deep understanding about NNC The ideal classifier

Bayes optimal classifier
http://find/


45/67


Consider the following classifier, using the posterior probability(x) =p(y= 1|x)

f(x) = 1 if(x) 1/20 if(x)< 1/2 equivalentlyf(x) = 1 ifp(y= 1|x) p(y = 0|x)0 ifp(y= 1|x)< p(y = 0|x)





46/67


Consider the following classifier, using the posterior probability(x) =p(y= 1|x)

f(x) = 1 if(x) 1/20 if(x)< 1/2 equivalentlyf(x) = 1 ifp(y= 1|x) p(y = 0|x)0 ifp(y= 1|x)< p(y = 0|x)Theorem

For any labeling function f(), R(f, x) R(f, x). Similarly,

R(f) R(f). Namely, f() is optimal.



Proof (not required)
http://find/


47/67


From definition

R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}



http://find/


48/67

( q )

From definition

R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

Thus,

R(f, x) R(f, x) =(x) {I[f(x) = 1] I[f(x) = 1]}

+ (1 (x)) {I[f(x) = 0] I[f(x) = 0]}

=(x) {I[f(x) = 1] I[f(x) = 1]}

+ (1 (x)) {1 I[f(x) = 1] 1 + I[f(x) = 1]}= (2(x) 1) {I[f(x) = 1] I[f(x) = 1]}

0



Bayes optimal classifier in general form
http://find/


49/67

y p g

For multi-class classification problem

f(x) = arg maxc[C]p(y=c|x)

when C= 2, this reduces to detecting whether or not (x) =p(y= 1|x)

is greater than 1/2. We refer p(y=c|x) as the posterior probability of x.



Bayes optimal classifier in general form
http://find/


50/67

y p g

For multi-class classification problem

f(x) = arg maxc[C]p(y=c|x)

when C= 2, this reduces to detecting whether or not (x) =p(y= 1|x)

is greater than 1/2. We refer p(y=c|x) as the posterior probability of x.

Remarks

The Bayes optimal classifier is generally not computable as it assumesthe knowledge ofp(x, y) orp(y|x).

However, it is useful as a conceptual tool to formalize how well aclassifier can do withoutknowing the joint distribution.


More deep understanding about NNC Comparing NNC to the ideal classifier

Comparing NNC to Bayes optimal classifier
http://find/


51/67

p g y p

How well does our NNC do?

Theorem

For the NNC rulefnnc for binary classification, we have,

R(f) R(fnnc) 2R(f)(1 R(f)) 2R(f)

Namely, the expected risk by the classifier is at worst twice that of theBayes optimal classifier.

In short, NNC seems doing a reasonable thing



Proof sketches (not required)
http://find/


52/67

( )

Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.



http://find/


53/67


Step 2 To show the mistake made by our classifier on x is attributed tothe probability that x and x(1) have different labels.



http://find/


54/67



Step 3 To show the probability having different labels is

R(f, x) = 2p(y= 1|x)[1 p(y= 1|x)] = 2(x)(1 (x))



http://find/


55/67




R(f, x) = 2p(y= 1|x)[1 p(y= 1|x)] = 2(x)(1 (x))

Step 4 To show the expected risk of the Bayes optimal classifier on x is

R(f, x) = min{(x), 1 (x)}





56/67




R(f, x) = 2p(y= 1|x)[1 p(y= 1|x)] = 2(x)(1 (x))

Step 4 To show the expected risk of the Bayes optimal classifier on x is

R(f, x) = min{(x), 1 (x)}

Step 5 Then tie everything together

R(f, x) = 2R(f, x)(1 R(f, x))

We are also most there. But one more step is needed (and omitted here).



Mini-summary
http://find/


57/67

Advantages of NNC

Computationally, simple and easy to implement just computing thedistance

Theoretically, has strong guarantees doing the right thing

Disadvantages of NNCComputationally intensive for large-scale problems: O(ND) forlabeling a data point

We need to carry the training data around. Without it, we cannot

do classification. This type of method is called nonparametric.Choosing the right distance measure and Kcan be involved.


Some practical sides of NNC

Outline
http://find/


58/67

1 Administration



4 Some practical sides of NNCHow to tune to get the best out of it?Preprocessing data



Some practical sides of NNC How to tune to get the best out of it?

Hypeparameters in NNC
http://find/


59/67

Two practical issues about NNCChoosingK, i.e., the number of nearest neighbors (default is 1)

Choosing the right distance measure (default is Euclidean distance),for example, from the following generalized distance measure

x xnp=

d

(xd xnd)p

1/p

forp 1.

Those are not specified by the algorithm itself resolving them requiresempirical studies and are task/dataset-specific.



Tuning by using a validation dataset
http://find/


60/67

Training data (set)

N samples/instances: Dtrain ={(x1, y1), (x2, y2), , (xN, yN)}

They are used for learning f()

Test (evaluation) data

M samples/instances: Dtest ={(x1, y1), (x2, y2), , (xM, yM)}

They are used for assessing how well f() will do in predicting anunseen x / Dtrain

Development (or validation) data

L samples/instances: Ddev ={(x1, y1), (x2, y2), , (xL, yL)}

They are used to optimize hyperparameter(s).

Training data, validation and test data should notoverlap!



Recipe
http://find/


61/67

for each possible value of the hyperparameter (say K= 1, 3, , 100)

Train a model usingD

train

Evaluate the performance of the model on Ddev

Choose the model with the best performance on Ddev

Evaluate the model on Dtest



Cross-validation
http://find/


62/67

What if we do not have validation data?

We split the training data into Sequal parts.

We use each part in turn as avalidation dataset and use theothers as a training dataset.

We choose the hyperparametersuch that on average, the model

performing the best

S= 5: 5-fold cross validation

Special case: when S=N, this will be leave-one-out.



Recipe
http://find/


63/67

Split the training data into S equal parts. Denote each part as Dtrainsfor each possible value of the hyperparameter (say K= 1, 3, , 100)

for every s [1, S]

Train a model using Dtrain

\s = D

trainD

train

s

Evaluate the performance of the model on Dtrains

Average the S performance metrics

Choose the hyperparameter corresponding to the best averagedperformance

Use the best hyerparamter to train on a model using all Dtrain

Evaluate the model on Dtest


Some practical sides of NNC Preprocessing data

Yet, another practical issue with NNC
http://find/


64/67

Distances depend on units of the features!(Show how proximity can be changed due to change in features scale;Draw on screen or blackboard)


Some practical sides of NNC Preprocessing data

Preprocess data


65/67

Normalize data so that the data look like from a normal distributionCompute the means and standard deviations in each feature

xd = 1

N nxnd, s

2d =

1

N 1 n(xnd xd)

2

Scale the feature accordingly

xndxnd xd

sd

Many other ways of normalizing data you would need/want to trydifferent ones and pick them using (cross)validation


What we have learned

Outline


66/67

1 Administration






What we have learned

Summary so far


67/67

Described a simple learning algorithm

Used intensively in practical applications you will get a taste of it inyour homework

Discussed a few practical aspects, such as tuning hyperparameters,with (cross)validation

Briefly studied its theoretical properties

Concepts: loss function, risks, Bayes optimalTheoretical guarantees: explaining why NNC would work


Documents

lec02 Machine Learning