Lirong Xia Kernel (continued), Clustering Friday, April 18, 2014

Lirong Xia

Kernel (continued), Clustering

Friday, April 18, 2014

• Project 4 due tonight

• Project 5 (last project!) out later

– Q1: Naïve Bayes with auto tuning (for smoothing)

– Q2: freebie

– Q3: Perceptron

– Q4: another freebie

– Q5: MIRA with auto tuning (for C)

– Q6: feature design for Naïve Bayes

2

Projects

Last time

3

• Fixing the Perceptron: MIRA

• Support Vector Machines

• k-nearest neighbor (KNN)

MIRA

4

• MIRA*: choose an update size that fixes the current mistake

• …but, minimizes the change to w

• Solution

*Margin Infused Relaxed Algorithm

MIRA with Maximum Step Size

5

• In practice, it’s also bad to make updates that are too large– Solution: cap the maximum possible

value of τ with some constant C

– tune C on the held-out data

Support Vector Machines

6

• Maximizing the margin: good according to intuition, theory, practice

SVM

Parametric / Non-parametric

7

• Parametric models:– Fixed set of parameters– More data means better settings

• Non-parametric models:– Complexity of the classifier increases with

data– Better in the limit, often worse in the non-limit

• (K)NN is non-parametric

Basic Similarity

8

• Many similarities based on feature dot products:

• If features are just the pixels:

• Note: not all similarities are of this form

Perceptron Weights

9

• What is the final value of a weight wy of a perceptron?– Can it be any real vector?– No! it’s build by adding up inputs

• Can reconstruct weight vectors (the primal representation) from update counts (the dual representation)

1 5

,

0 ...y

y i y ii

w f x f xw a f x

primal

dual

Dual Perceptron

10

• Start with zero counts (alpha)• Pick up training instances one by one

• Try to classify xn,

• If correct, no change!• If wrong: lower count of wrong class (for this

instance), raise score of right class (for this instance)

• The kernel trick

• Neuron network

• Clustering (k-means)

11

Outline

Kernelized Perceptron

12

• If we had a black box (kernel) which told us the dot product of two examples x and y:– Could work entirely with the dual representation– No need to ever take dot products (“kernel trick”)

• Like nearest neighbor – work with black-box similarities

• Downside: slow if many examples get nonzero alpha

Kernelized Perceptron Structure

13

,

= score ,

i c i

y x

a

Kernels: Who Cares?

14

• So far: a very strange way of doing a very simple calculation

• “Kernel trick”: we can substitute many similarity function in place of the dot product

• Lets us learn new kinds of hypothesis

Non-Linear Separators

15

• Data that is linearly separable (with some noise) works out great:

• But what are we going to do if the dataset is just too hard?

• How about… mapping data to a higher-dimensional space:

This and next few slides adapted from Ray Mooney, UT

Non-Linear Separators

16

• General idea: the original feature space can always be mapped to some higher-dimensional feature space where the training set is separable:

Some Kernels

17

• Kernels implicitly map original vectors to higher dimensional spaces, take the dot product there, and hand the result back

• Linear kernel:

• Quadratic kernel:

• RBF: infinite dimensional representation

• Mercer’s theorem: if the matrix A with Ai,j=K(xi,xj) is positive definite, then K can be represented as the inner product of a higher dimensional space

Why Kernels?

18

• Can’t you just add these features on your own?– Yes, in principle, just compute them– No need to modify any algorithms– But, number of features can get large (or infinite)

• Kernels let us compute with these features implicitly– Example: implicit dot product in quadratic kernel takes

much less space and time per dot product– Of course, there’s the cost for using the pure dual

algorithms: you need to compute the similarity to every training datum


• Neuron network


19

Outline

• Neuron = Perceptron with a non-linear

activation function

ai=g(ΣjWj,iaj)

20

Neuron

• (a) step function or threshold function

– perceptron

• (b) sigmoid function 1/(1+e-x)

– logistic regression

• Does the range of g(ini) matter? 21

Activation function

• AND, OR, and NOT can be implemented

• XOR cannot be implemented

22

Implementation by perceptron

• Feed-forward networks

– single-layer (perceptrons)

– multi-layer (perceptrons)

• can represent any boolean function

– Learning

• Minimize squared error of prediction

• Gradient descent: backpropagation algorithm

• Recurrent networks

– Hopfield networks have symmetric weights

Wi,j=Wj,i

– Boltzmann machines use stochastic activation

functions

23

Neural networks

• a5=g5 (W3,5×a3+W4,5×a4)

=g5 (W3,5×g3 (W1,3×a1+W2,3×a2)+W4,5×g4 (W1,4×a1+W2,4×a2))

24

Feed-forward example


• Neuron network


25

Outline

Recap: Classification

26

• Classification systems:– Supervised learning– Make a prediction given

evidence– We’ve seen several

methods for this– Useful when you have

labeled data

Clustering

27

• Clustering systems:– Unsupervised learning– Detect patterns in unlabeled

data• E.g. group emails or search

results

• E.g. find categories of customers

• E.g. detect anomalous program executions

– Useful when don’t know what you’re looking for

– Requires data, but no labels– Often get gibberish

Clustering

28

• Basic idea: group together similar instances• Example: 2D point patterns

• What could “similar” mean?– One option: small (squared) Euclidean distance

2dist ,

T

i iix y x y x y x y

K-Means

29

• An iterative clustering algorithm– Pick K random points

as cluster centers (means)

– Alternate:• Assign data instances

to closest mean

• Assign each mean to the average of its assigned points

– Stop when no points’ assignments change

K-Means Example

30

Example: K-Means

31

• [web demo]– http://www.cs.washington.edu/research/imagedatabase/

demo/kmcluster/

http://www.cs.washington.edu/research/imagedatabase/demo/kmcluster/

http://www.cs.washington.edu/research/imagedatabase/demo/kmcluster/

K-Means as Optimization

32

• Consider the total distance to the means:

• Each iteration reduces phi• Two states each iteration:

– Update assignments: fix means c, change assignments a

– Update means: fix assignments a, change means c

dis, = ,, tii i ik i ax c x ca

points assignments means

Phase I: Update Assignments

33

• For each point, re-assign to closest mean:

• Can only decrease total distance phi!

Phase II: Update Means

34

• Move each mean to the average of its assigned points

• Also can only decrease total distance…(Why?)

• Fun fact: the point y with minimum squared Euclidean distance to a set of points {x} is their mean

Initialization

35

• K-means is non-deterministic– Requires initial means– It does matter what you pick!

– What can go wrong?

– Various schemes for preventing this kind of thing: variance-based split / merge, initialization heuristics

K-Means Getting Stuck

36

• A local optimum:

K-Means Questions

37

• Will K-means converge?– To a global optimum?

• Will it always find the true patterns in the data?– If the patterns are very very clear?

• Will it find something interesting?

• Do people ever use it?

• How many clusters to pick?

Agglomerative Clustering

38

• Agglomerative clustering:– First merge very similar instances– Incrementally build larger clusters out

of smaller clusters

• Algorithm:– Maintain a set of clusters– Initially, each instance in its own

cluster– Repeat:

• Pick the two closest clusters• Merge them into a new cluster• Stop when there is only one cluster left

• Produces not one clustering, but a family of clusterings represented by a dendrogram

Agglomerative Clustering

39

• How should we define “closest” for clusters with multiple elements?

• Many options– Closest pair (single-link clustering)– Farthest pair (complete-link

clustering)– Average of all pairs– Ward’s method (min variance, like k-

means)

• Different choices create different clustering behaviors

Clustering Application

40

Documents

Lirong Xia Kernel (continued), Clustering Friday, April 18, 2014