lec02 Machine Learning

Embed Size (px)

Citation preview

  • 7/23/2019 lec02 Machine Learning

    1/67

    CSCI567 Machine Learning (Fall 2014)

    Drs. Sha & Liu

    {feisha,yanliu.cs}@usc.edu

    September 9, 2014

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 1 / 47

    http://find/
  • 7/23/2019 lec02 Machine Learning

    2/67

    Administration

    Outline

    1 Administration

    2 First learning algorithm: Nearest neighbor classifier

    3 More deep understanding about NNC

    4 Some practical sides of NNC

    5 What we have learned

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 2 / 47

    http://find/
  • 7/23/2019 lec02 Machine Learning

    3/67

    Administration

    Entrance exam

    All were graded

    about 87% has passed96 students in Prof. Shas section and 48 in Prof. Lius section

    Those who have passed the threshold were granted D-clearance

    please enroll asap by Friday noon.

    In several cases, advisors had sent out inquiries please respond byThursday noon per their instructions.

    If not being granted D-clearance, or being contacted by advisors, you

    can assume that you are not permitted to enroll.You can ask TAs to check up your grades, only if you strongly believeyou did well.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 3 / 47

    http://find/
  • 7/23/2019 lec02 Machine Learning

    4/67

    First learning algorithm: Nearest neighbor classifier

    Outline

    1 Administration

    2 First learning algorithm: Nearest neighbor classifierIntuitive example

    General setup for classificationAlgorithm

    3 More deep understanding about NNC

    4 Some practical sides of NNC

    5 What we have learned

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 4 / 47

    http://goforward/http://find/http://goback/
  • 7/23/2019 lec02 Machine Learning

    5/67

    First learning algorithm: Nearest neighbor classifier Intuitive example

    Recognizing flowers

    Types of Iris: setosa, versicolor, and virginica

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 5 / 47

    http://find/http://goback/
  • 7/23/2019 lec02 Machine Learning

    6/67

    First learning algorithm: Nearest neighbor classifier Intuitive example

    Measuring the properties of the flowers

    Features and attributes: the widths and lengths of sepal and petal

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 6 / 47

    Fi l i l i h N i hb l ifi I i i l

    http://find/http://goback/
  • 7/23/2019 lec02 Machine Learning

    7/67

    First learning algorithm: Nearest neighbor classifier Intuitive example

    Pairwise scatter plots of 131 flower specimens

    Visualization of data helps to identify the right learning model to

    use

    Each colored point is a flower specimen: setosa,versicolor,virginica

    sepallength

    sepal length

    sepalwidth

    petallength

    petalwidth

    sepal width petal length petal width

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 7 / 47

    Fi t l i l ith N t i hb l ifi I t iti l

    http://find/
  • 7/23/2019 lec02 Machine Learning

    8/67

    First learning algorithm: Nearest neighbor classifier Intuitive example

    Different types seem well-clustered and separable

    Using two features: petal width and sepal length

    sepallength

    sepal length

    sepalwidth

    petallength

    petalwidth

    sepal width petal length petal width

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 8 / 47

    First learning algorithm: Nearest neighbor classifier Intuitive example

    http://find/
  • 7/23/2019 lec02 Machine Learning

    9/67

    First learning algorithm: Nearest neighbor classifier Intuitive example

    Labeling an unknown flower type

    Closer to red cluster: so labeling it as setosa

    sepallength

    sepal length

    sepalwidth

    petalle

    ngth

    petalwidth

    sepal width petal length petal width

    ?

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 9 / 47

    First learning algorithm: Nearest neighbor classifier General setup for classification

    http://find/
  • 7/23/2019 lec02 Machine Learning

    10/67

    First learning algorithm: Nearest neighbor classifier General setup for classification

    Multi-class classification

    Classify data into one of the multiple categories

    Input (feature vectors): x RD

    Output (label): y[C] ={1, 2, , C}Learning goal: y =f(x)

    Special case: binary classification

    Number of classes: C= 2

    Labels: {0, 1} or{1, +1}

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 10 / 47

    First learning algorithm: Nearest neighbor classifier General setup for classification

    http://find/
  • 7/23/2019 lec02 Machine Learning

    11/67

    First learning algorithm: Nearest neighbor classifier General setup for classification

    More terminology

    Training data (set)

    N samples/instances: Dtrain ={(x1, y1), (x2, y2), , (xN, yN)}

    They are used for learning f()

    Test (evaluation) dataM samples/instances: Dtest ={(x1, y1), (x2, y2), , (xM, yM)}

    They are used for assessing how well f() will do in predicting anunseen x / Dtrain

    Training data and test data should notoverlap: Dtrain Dtest =

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 11 / 47

    First learning algorithm: Nearest neighbor classifier General setup for classification

    http://find/
  • 7/23/2019 lec02 Machine Learning

    12/67

    First learning algorithm: Nearest neighbor classifier General setup for classification

    Often, data is conveniently organized as a table

    Ex: Iris data (clickherefor all data)

    4 features3 classes

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 12 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://en.wikipedia.org/wiki/Iris_flower_data_sethttp://en.wikipedia.org/wiki/Iris_flower_data_sethttp://find/
  • 7/23/2019 lec02 Machine Learning

    13/67

    g g g g

    Nearest neighbor classification (NNC)

    Nearest neighborx(1) = xnn(x)

    where nn(x) [N] ={1, 2, , N}, i.e., the index to one of the training

    instances,

    nn(x) = arg minn[N] x xn22= arg minn[N]

    Dd=1

    (xd xnd)2

    Classification ruley=f(x) =ynn(x)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 13 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://find/
  • 7/23/2019 lec02 Machine Learning

    14/67

    g g g g

    Visual example

    In this 2-dimensional example, the nearest point to x

    is atraininginstance, thus, x will be labeled asred.

    x1

    x2

    (a)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 14 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://find/
  • 7/23/2019 lec02 Machine Learning

    15/67

    Example: classify Iris with two features

    Training data

    ID (n) petal width (x1) sepal length (x2) category (y)1 0.2 5.1 setoas

    2 1.4 7.0 versicolor

    3 2.5 6.7 virginica...

    ...

    ...

    Flower with unknown categorypetal width = 1.8 and sepal width = 6.4Calculating distance = x1 xn1)

    2 + (x2 xn2)2

    ID distance1 1.75

    2 0.72

    3 0.76

    Thus, the category is versicolor(the real categoryis viriginica)Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 15 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://find/
  • 7/23/2019 lec02 Machine Learning

    16/67

    Decision boundary

    For every point in the space, we can determine its label using the NNC

    rule. This gives rise to a decision boundarythat partitions the space intodifferent regions.

    x1

    x2

    (b)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 16 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://find/
  • 7/23/2019 lec02 Machine Learning

    17/67

    How to measure nearness with other distances?

    Previously, we use the Euclidean distance

    nn(x) = arg minn[N] x xn22

    We can also use alternative distancesE.g., the following L1 distance (i.e., cityblock distance, or Manhattan distance)

    nn(x) = arg minn[N] x xn1

    = arg minn[N]D

    d=1

    |xd xnd| Green line is Euclidean distance.Red, Blue, and Yellow lines are

    L1 distance

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 17 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://find/
  • 7/23/2019 lec02 Machine Learning

    18/67

    K-nearest neighbor (KNN) classification

    Increase the number of nearest neighbors to use?

    1-nearest neighbor: nn1(x) = arg minn[N] x xn22

    2nd-nearest neighbor: nn2(x) = arg minn[N]nn1(x) x xn22

    3rd-nearest neighbor: nn2(x) = arg minn[N]nn

    1

    (x

    )nn2

    (x

    )

    x xn2

    2The set of K-nearest neighbor

    knn(x) ={nn1(x), nn2(x), , nnK(x)}

    Let x

    (k) =x

    nnk(x

    ), then

    x x(1)22 x x(2)22 x x(K)

    22

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 18 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://find/
  • 7/23/2019 lec02 Machine Learning

    19/67

    How to classify with Kneighbors?

    Classification ruleEvery neighbor votes: suppose yn (the true label) for xn is c, then

    vote for c is 1vote for c =c is 0

    We use the indicator function I(yn==c) to represent.

    Aggregate everyones vote

    vc=

    nknn(x)

    I(yn==c), c [C]

    Label with the majority

    y=f(x) = arg maxc[C]vc

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 19 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://find/
  • 7/23/2019 lec02 Machine Learning

    20/67

    Example

    K=1, Label: red

    x1

    x2

    (a)

    K=3, Label: red

    x1

    x2

    (a)

    K=5, Label: blue

    x1

    x2

    (a)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 20 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://find/
  • 7/23/2019 lec02 Machine Learning

    21/67

    How to choose an optimal K?

    x6

    x7

    K = 1

    0 1 20

    1

    2

    x6

    x7

    K = 3

    0 1 20

    1

    2

    x6

    x7

    K = 3 1

    0 1 20

    1

    2

    When K increases, the decision boundary becomes smooth.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 21 / 47

    First learning algorithm: Nearest neighbor classifier Algorithm

    http://find/
  • 7/23/2019 lec02 Machine Learning

    22/67

    Mini-summary

    Advantages of NNC

    Computationally, simple and easy to implement just computing thedistance

    Theoretically, has strong guarantees doing the right thing

    Disadvantages of NNC

    Computationally intensive for large-scale problems: O(ND) forlabeling a data point

    We need to carry the training data around. Without it, we cannot

    do classification. This type of method is called nonparametric.

    Choosing the right distance measure and Kcan be involved.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 22 / 47

    More deep understanding about NNC

    http://find/
  • 7/23/2019 lec02 Machine Learning

    23/67

    Outline

    1 Administration

    2 First learning algorithm: Nearest neighbor classifier

    3 More deep understanding about NNCMeasuring performanceThe ideal classifierComparing NNC to the ideal classifier

    4 Some practical sides of NNC

    5 What we have learned

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 23 / 47

    More deep understanding about NNC

    http://find/
  • 7/23/2019 lec02 Machine Learning

    24/67

    Is NNC too simple to do the right thing?

    To answer this question, we proceed in 3 steps

    1 We define a performance metric for a classifier/algorithm.

    2 We then propose an ideal classifier.

    3 We then compare our simple NNC classifier to the ideal one and showthat it performs nearly as good.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 24 / 47

    More deep understanding about NNC Measuring performance

    http://find/
  • 7/23/2019 lec02 Machine Learning

    25/67

    How to measure performance of a classifier?

    Intuition

    We should computeaccuracy the percentage of data points beingcorrectly classified, or theerror rate the percentage of data points beingincorrectly classified.

    Two versions: which one to use?

    Defined on the training data set

    Atrain = 1

    N

    n

    I[f(xn) ==yn], train =

    1

    N

    n

    I[f(xn)=yn]

    Defined on the test (evaluation) data set

    Atest = 1

    M

    m

    I[f(xm) ==ym], test =

    1

    M

    M

    I[f(xm)=ym]

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 25 / 47

    More deep understanding about NNC Measuring performance

    http://find/
  • 7/23/2019 lec02 Machine Learning

    26/67

    Example

    Training data

    What are Atrain and train?

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 26 / 47

    More deep understanding about NNC Measuring performance

    http://find/
  • 7/23/2019 lec02 Machine Learning

    27/67

    Example

    Training data

    What are Atrain and train?

    Atrain = 100%, train = 0%

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 26 / 47

    More deep understanding about NNC Measuring performance

    http://find/
  • 7/23/2019 lec02 Machine Learning

    28/67

    Example

    Training data

    What are Atrain and train?

    Atrain = 100%, train = 0%

    Test data

    What are Atest and test?

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 26 / 47

    More deep understanding about NNC Measuring performance

    http://find/
  • 7/23/2019 lec02 Machine Learning

    29/67

    Example

    Training data

    What are Atrain and train?

    Atrain = 100%, train = 0%

    Test data

    What are Atest and test?

    Atest = 0%, test = 100%

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 26 / 47

    More deep understanding about NNC Measuring performance

    http://find/
  • 7/23/2019 lec02 Machine Learning

    30/67

    Leave-one-out (LOO)

    Idea

    For each training instance xn,take it out of the training setand then label it.

    For NNC, xns nearest neighbor

    will not be itself. So the errorrate would not become 0necessarily.

    Training data

    What are the LOO-version ofAtrain

    and train?

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 27 / 47

    More deep understanding about NNC Measuring performance

    http://find/
  • 7/23/2019 lec02 Machine Learning

    31/67

    Leave-one-out (LOO)

    Idea

    For each training instance xn,take it out of the training setand then label it.

    For NNC, xns nearest neighbor

    will not be itself. So the errorrate would not become 0necessarily.

    Training data

    What are the LOO-version ofAtrain

    and train?

    Atrain = 66.67%(i.e., 4/6)

    train = 33.33%(i.e., 2/6)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 27 / 47

    More deep understanding about NNC Measuring performance

    http://find/
  • 7/23/2019 lec02 Machine Learning

    32/67

    Drawback of the metrics

    They are dataset-specificGiven a different training (or test) dataset, Atrain (orAtest) willchange.

    Thus, if we get a dataset randomly, these variables would be

    random quantities.

    AtestD1 , Atest

    D2 , , Atest

    Dq ,

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 28 / 47

    More deep understanding about NNC Measuring performance

    D b k f h i

    http://find/
  • 7/23/2019 lec02 Machine Learning

    33/67

    Drawback of the metrics

    They are dataset-specificGiven a different training (or test) dataset, Atrain (orAtest) willchange.

    Thus, if we get a dataset randomly, these variables would be

    random quantities.

    AtestD1 , Atest

    D2 , , Atest

    Dq ,

    These are called empirical accuracies (or errors).

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 28 / 47

    More deep understanding about NNC Measuring performance

    D b k f h i

    http://find/
  • 7/23/2019 lec02 Machine Learning

    34/67

    Drawback of the metrics

    They are dataset-specificGiven a different training (or test) dataset, Atrain (orAtest) willchange.

    Thus, if we get a dataset randomly, these variables would be

    random quantities.

    AtestD1 , Atest

    D2 , , Atest

    Dq ,

    These are called empirical accuracies (or errors).

    Can we understand the algorithm itself in a more certain nature, byremoving the uncertainty caused by the datasets?

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 28 / 47

    More deep understanding about NNC Measuring performance

    E d i k

    http://find/
  • 7/23/2019 lec02 Machine Learning

    35/67

    Expected mistakes

    Setup

    Assume our data (x, y) is drawn from the joint and unknowndistribution p(x, y)

    Classification mistake on a single data point x with the ground-truthlabel y

    L(f(x

    ), y) = 0 iff(x) =y1 iff(x)=y

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 29 / 47

    More deep understanding about NNC Measuring performance

    E t d i t k

    http://find/
  • 7/23/2019 lec02 Machine Learning

    36/67

    Expected mistakes

    Setup

    Assume our data (x, y) is drawn from the joint and unknowndistribution p(x, y)

    Classification mistake on a single data point x with the ground-truthlabel y

    L(f(x

    ), y) = 0 iff(x) =y1 iff(x)=yExpected classification mistake on a single data point x

    R(f, x) = Eyp(y|x)L(f(x), y)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 29 / 47

    More deep understanding about NNC Measuring performance

    E t d i t k

    http://find/http://goback/
  • 7/23/2019 lec02 Machine Learning

    37/67

    Expected mistakes

    Setup

    Assume our data (x, y) is drawn from the joint and unknowndistribution p(x, y)

    Classification mistake on a single data point x with the ground-truthlabel y

    L(f(x

    ), y) = 0 iff(x) =y1 iff(x)=yExpected classification mistake on a single data point x

    R(f, x) = Eyp(y|x)L(f(x), y)

    The average classification mistake by the classifier itself

    R(f) = Exp(x)R(f, x) = E(x,y)p(x,y)L(f(x), y)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 29 / 47

    More deep understanding about NNC Measuring performance

    Jargons

    http://find/
  • 7/23/2019 lec02 Machine Learning

    38/67

    Jargons

    L(f(x), y) is called 0/1 loss function many other forms of loss

    functions exist for different learning problems.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 30 / 47

    More deep understanding about NNC Measuring performance

    Jargons

    http://find/
  • 7/23/2019 lec02 Machine Learning

    39/67

    Jargons

    L(f(x), y) is called 0/1 loss function many other forms of loss

    functions exist for different learning problems.Expected conditional risk

    R(f, x) = Eyp(y|x)L(f(x), y)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 30 / 47

    More deep understanding about NNC Measuring performance

    Jargons

    http://find/
  • 7/23/2019 lec02 Machine Learning

    40/67

    Jargons

    L(f(x), y) is called 0/1 loss function many other forms of loss

    functions exist for different learning problems.Expected conditional risk

    R(f, x) = Eyp(y|x)L(f(x), y)

    Expected riskR(f) = E(x,y)p(x,y)L(f(x), y)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 30 / 47

    More deep understanding about NNC Measuring performance

    Jargons

    http://find/
  • 7/23/2019 lec02 Machine Learning

    41/67

    Jargons

    L(f(x), y) is called 0/1 loss function many other forms of loss

    functions exist for different learning problems.Expected conditional risk

    R(f, x) = Eyp(y|x)L(f(x), y)

    Expected riskR(f) = E(x,y)p(x,y)L(f(x), y)

    Empirical risk

    RD(f) = 1

    N

    n

    L(f(xn), yn)

    Obviously, this is our empirical error (rates).

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 30 / 47

    More deep understanding about NNC Measuring performance

    Ex: binary classification

    http://find/
  • 7/23/2019 lec02 Machine Learning

    42/67

    Ex: binary classification

    Expected conditional risk of a single data point x

    R(f, x) = Eyp(y|x)L(f(x), y)

    =P(y= 1|x)I[f(x) = 0] +P(y= 0|x)I[f(x) = 1]

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 31 / 47

    More deep understanding about NNC Measuring performance

    Ex: binary classification

    http://find/
  • 7/23/2019 lec02 Machine Learning

    43/67

    Ex: binary classification

    Expected conditional risk of a single data point x

    R(f, x) = Eyp(y|x)L(f(x), y)

    =P(y= 1|x)I[f(x) = 0] +P(y= 0|x)I[f(x) = 1]

    Let (x) =P(y = 1|x), we have

    R(f, x) =(x)I[f(x) = 0] + (1 (x))I[f(x) = 1]

    = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

    expected conditional accuracies

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 31 / 47

    More deep understanding about NNC Measuring performance

    Ex: binary classification

    http://find/
  • 7/23/2019 lec02 Machine Learning

    44/67

    Ex: binary classification

    Expected conditional risk of a single data point x

    R(f, x) = Eyp(y|x)L(f(x), y)

    =P(y= 1|x)I[f(x) = 0] +P(y= 0|x)I[f(x) = 1]

    Let (x) =P(y = 1|x), we have

    R(f, x) =(x)I[f(x) = 0] + (1 (x))I[f(x) = 1]

    = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

    expected conditional accuraciesExercise: please verify the last equality.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 31 / 47

    More deep understanding about NNC The ideal classifier

    Bayes optimal classifier

    http://find/
  • 7/23/2019 lec02 Machine Learning

    45/67

    Bayes optimal classifier

    Consider the following classifier, using the posterior probability(x) =p(y= 1|x)

    f(x) = 1 if(x) 1/20 if(x)< 1/2 equivalentlyf(x) = 1 ifp(y= 1|x) p(y = 0|x)0 ifp(y= 1|x)< p(y = 0|x)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 32 / 47

    More deep understanding about NNC The ideal classifier

    Bayes optimal classifier

    http://find/http://goback/
  • 7/23/2019 lec02 Machine Learning

    46/67

    Bayes optimal classifier

    Consider the following classifier, using the posterior probability(x) =p(y= 1|x)

    f(x) = 1 if(x) 1/20 if(x)< 1/2 equivalentlyf(x) = 1 ifp(y= 1|x) p(y = 0|x)0 ifp(y= 1|x)< p(y = 0|x)Theorem

    For any labeling function f(), R(f, x) R(f, x). Similarly,

    R(f) R(f). Namely, f() is optimal.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 32 / 47

    More deep understanding about NNC The ideal classifier

    Proof (not required)

    http://find/
  • 7/23/2019 lec02 Machine Learning

    47/67

    Proof (not required)

    From definition

    R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

    R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 33 / 47

    More deep understanding about NNC The ideal classifier

    Proof (not required)

    http://find/
  • 7/23/2019 lec02 Machine Learning

    48/67

    ( q )

    From definition

    R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

    R(f, x) = 1 {(x)I[f(x) = 1] + (1 (x))I[f(x) = 0]}

    Thus,

    R(f, x) R(f, x) =(x) {I[f(x) = 1] I[f(x) = 1]}

    + (1 (x)) {I[f(x) = 0] I[f(x) = 0]}

    =(x) {I[f(x) = 1] I[f(x) = 1]}

    + (1 (x)) {1 I[f(x) = 1] 1 + I[f(x) = 1]}= (2(x) 1) {I[f(x) = 1] I[f(x) = 1]}

    0

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 33 / 47

    More deep understanding about NNC The ideal classifier

    Bayes optimal classifier in general form

    http://find/
  • 7/23/2019 lec02 Machine Learning

    49/67

    y p g

    For multi-class classification problem

    f(x) = arg maxc[C]p(y=c|x)

    when C= 2, this reduces to detecting whether or not (x) =p(y= 1|x)

    is greater than 1/2. We refer p(y=c|x) as the posterior probability of x.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 34 / 47

    More deep understanding about NNC The ideal classifier

    Bayes optimal classifier in general form

    http://find/
  • 7/23/2019 lec02 Machine Learning

    50/67

    y p g

    For multi-class classification problem

    f(x) = arg maxc[C]p(y=c|x)

    when C= 2, this reduces to detecting whether or not (x) =p(y= 1|x)

    is greater than 1/2. We refer p(y=c|x) as the posterior probability of x.

    Remarks

    The Bayes optimal classifier is generally not computable as it assumesthe knowledge ofp(x, y) orp(y|x).

    However, it is useful as a conceptual tool to formalize how well aclassifier can do withoutknowing the joint distribution.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 34 / 47

    More deep understanding about NNC Comparing NNC to the ideal classifier

    Comparing NNC to Bayes optimal classifier

    http://find/
  • 7/23/2019 lec02 Machine Learning

    51/67

    p g y p

    How well does our NNC do?

    Theorem

    For the NNC rulefnnc for binary classification, we have,

    R(f) R(fnnc) 2R(f)(1 R(f)) 2R(f)

    Namely, the expected risk by the classifier is at worst twice that of theBayes optimal classifier.

    In short, NNC seems doing a reasonable thing

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 35 / 47

    More deep understanding about NNC Comparing NNC to the ideal classifier

    Proof sketches (not required)

    http://find/
  • 7/23/2019 lec02 Machine Learning

    52/67

    ( )

    Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47

    More deep understanding about NNC Comparing NNC to the ideal classifier

    Proof sketches (not required)

    http://find/
  • 7/23/2019 lec02 Machine Learning

    53/67

    Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.

    Step 2 To show the mistake made by our classifier on x is attributed tothe probability that x and x(1) have different labels.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47

    More deep understanding about NNC Comparing NNC to the ideal classifier

    Proof sketches (not required)

    http://find/
  • 7/23/2019 lec02 Machine Learning

    54/67

    Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.

    Step 2 To show the mistake made by our classifier on x is attributed tothe probability that x and x(1) have different labels.

    Step 3 To show the probability having different labels is

    R(f, x) = 2p(y= 1|x)[1 p(y= 1|x)] = 2(x)(1 (x))

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47

    More deep understanding about NNC Comparing NNC to the ideal classifier

    Proof sketches (not required)

    http://find/
  • 7/23/2019 lec02 Machine Learning

    55/67

    Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.

    Step 2 To show the mistake made by our classifier on x is attributed tothe probability that x and x(1) have different labels.

    Step 3 To show the probability having different labels is

    R(f, x) = 2p(y= 1|x)[1 p(y= 1|x)] = 2(x)(1 (x))

    Step 4 To show the expected risk of the Bayes optimal classifier on x is

    R(f, x) = min{(x), 1 (x)}

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47

    More deep understanding about NNC Comparing NNC to the ideal classifier

    Proof sketches (not required)

    http://find/http://goback/
  • 7/23/2019 lec02 Machine Learning

    56/67

    Step 1 To show that when N +, x and its nearest neighbor x(1) aresimilar. In particular p(y= 1|x) and p(y= 1|x(1)) are the same.

    Step 2 To show the mistake made by our classifier on x is attributed tothe probability that x and x(1) have different labels.

    Step 3 To show the probability having different labels is

    R(f, x) = 2p(y= 1|x)[1 p(y= 1|x)] = 2(x)(1 (x))

    Step 4 To show the expected risk of the Bayes optimal classifier on x is

    R(f, x) = min{(x), 1 (x)}

    Step 5 Then tie everything together

    R(f, x) = 2R(f, x)(1 R(f, x))

    We are also most there. But one more step is needed (and omitted here).

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 36 / 47

    More deep understanding about NNC Comparing NNC to the ideal classifier

    Mini-summary

    http://find/
  • 7/23/2019 lec02 Machine Learning

    57/67

    Advantages of NNC

    Computationally, simple and easy to implement just computing thedistance

    Theoretically, has strong guarantees doing the right thing

    Disadvantages of NNCComputationally intensive for large-scale problems: O(ND) forlabeling a data point

    We need to carry the training data around. Without it, we cannot

    do classification. This type of method is called nonparametric.Choosing the right distance measure and Kcan be involved.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 37 / 47

    Some practical sides of NNC

    Outline

    http://find/
  • 7/23/2019 lec02 Machine Learning

    58/67

    1 Administration

    2 First learning algorithm: Nearest neighbor classifier

    3 More deep understanding about NNC

    4 Some practical sides of NNCHow to tune to get the best out of it?Preprocessing data

    5 What we have learned

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 38 / 47

    Some practical sides of NNC How to tune to get the best out of it?

    Hypeparameters in NNC

    http://find/
  • 7/23/2019 lec02 Machine Learning

    59/67

    Two practical issues about NNCChoosingK, i.e., the number of nearest neighbors (default is 1)

    Choosing the right distance measure (default is Euclidean distance),for example, from the following generalized distance measure

    x xnp=

    d

    (xd xnd)p

    1/p

    forp 1.

    Those are not specified by the algorithm itself resolving them requiresempirical studies and are task/dataset-specific.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 39 / 47

    Some practical sides of NNC How to tune to get the best out of it?

    Tuning by using a validation dataset

    http://find/
  • 7/23/2019 lec02 Machine Learning

    60/67

    Training data (set)

    N samples/instances: Dtrain ={(x1, y1), (x2, y2), , (xN, yN)}

    They are used for learning f()

    Test (evaluation) data

    M samples/instances: Dtest ={(x1, y1), (x2, y2), , (xM, yM)}

    They are used for assessing how well f() will do in predicting anunseen x / Dtrain

    Development (or validation) data

    L samples/instances: Ddev ={(x1, y1), (x2, y2), , (xL, yL)}

    They are used to optimize hyperparameter(s).

    Training data, validation and test data should notoverlap!

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 40 / 47

    Some practical sides of NNC How to tune to get the best out of it?

    Recipe

    http://find/
  • 7/23/2019 lec02 Machine Learning

    61/67

    for each possible value of the hyperparameter (say K= 1, 3, , 100)

    Train a model usingD

    train

    Evaluate the performance of the model on Ddev

    Choose the model with the best performance on Ddev

    Evaluate the model on Dtest

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 41 / 47

    Some practical sides of NNC How to tune to get the best out of it?

    Cross-validation

    http://find/
  • 7/23/2019 lec02 Machine Learning

    62/67

    What if we do not have validation data?

    We split the training data into Sequal parts.

    We use each part in turn as avalidation dataset and use theothers as a training dataset.

    We choose the hyperparametersuch that on average, the model

    performing the best

    S= 5: 5-fold cross validation

    Special case: when S=N, this will be leave-one-out.

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 42 / 47

    Some practical sides of NNC How to tune to get the best out of it?

    Recipe

    http://find/
  • 7/23/2019 lec02 Machine Learning

    63/67

    Split the training data into S equal parts. Denote each part as Dtrainsfor each possible value of the hyperparameter (say K= 1, 3, , 100)

    for every s [1, S]

    Train a model using Dtrain

    \s = D

    trainD

    train

    s

    Evaluate the performance of the model on Dtrains

    Average the S performance metrics

    Choose the hyperparameter corresponding to the best averagedperformance

    Use the best hyerparamter to train on a model using all Dtrain

    Evaluate the model on Dtest

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 43 / 47

    Some practical sides of NNC Preprocessing data

    Yet, another practical issue with NNC

    http://find/
  • 7/23/2019 lec02 Machine Learning

    64/67

    Distances depend on units of the features!(Show how proximity can be changed due to change in features scale;Draw on screen or blackboard)

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 44 / 47

    Some practical sides of NNC Preprocessing data

    Preprocess data

    http://find/http://goback/
  • 7/23/2019 lec02 Machine Learning

    65/67

    Normalize data so that the data look like from a normal distributionCompute the means and standard deviations in each feature

    xd = 1

    N nxnd, s

    2d =

    1

    N 1 n(xnd xd)

    2

    Scale the feature accordingly

    xndxnd xd

    sd

    Many other ways of normalizing data you would need/want to trydifferent ones and pick them using (cross)validation

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 45 / 47

    What we have learned

    Outline

    http://find/http://goback/
  • 7/23/2019 lec02 Machine Learning

    66/67

    1 Administration

    2 First learning algorithm: Nearest neighbor classifier

    3 More deep understanding about NNC

    4 Some practical sides of NNC

    5 What we have learned

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 46 / 47

    What we have learned

    Summary so far

    http://find/http://goback/
  • 7/23/2019 lec02 Machine Learning

    67/67

    Described a simple learning algorithm

    Used intensively in practical applications you will get a taste of it inyour homework

    Discussed a few practical aspects, such as tuning hyperparameters,with (cross)validation

    Briefly studied its theoretical properties

    Concepts: loss function, risks, Bayes optimalTheoretical guarantees: explaining why NNC would work

    Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 9, 2014 47 / 47

    http://find/http://goback/