100
Kernel Methods Barnabás Póczos University of Alberta Oct 1, 2009

Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

Kernel Methods

Barnabás PóczosUniversity of Alberta

Oct 1, 2009

Page 2: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

2

Outline• Quick Introduction• Feature space• Perceptron in the feature space• Kernels• Mercer’s theorem

• Finite domain• Arbitrary domain

• Kernel families• Constructing new kernels from kernels

• Constructing feature maps from kernels• Reproducing Kernel Hilbert Spaces (RKHS)• The Representer Theorem

Page 3: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

3

Ralf Herbrich: Learning Kernel Classifiers Chapter 2

Page 4: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

Quick Overview

Page 5: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

5

Hard 1-dimensional Dataset

x=0

Positive “plane” Negative “plane”

x=0

taken from Andrew W. More; CMU + Nello Cristianini, Ron Meir, Ron Parr

• If the data set is not linearly separable, then adding new features (mapping the data to a larger feature space) the data might become linearly separable

• m general! points in an m-1 dimensional space is always linearly separable by a hyperspace!) it is good to map the data to high dimensional spaces

(For example 4 points in 3D)

Page 6: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

6

Hard 1-dimensional DatasetMake up a new feature!

Sort of… … computed from original feature(s)

x=0

),( 2kkk xx=z

Separable! MAGIC!

Now drop this “augmented” data into our linear SVM.

taken from Andrew W. More; CMU + Nello Cristianini, Ron Meir, Ron Parr

Page 7: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

7

Feature mapping• m general! points in an m-1 dimensional space is always

linearly separable by a hyperspace!) it is good to map the data to high dimensional spaces

• Having m training data, is it always enough to map the data into a feature space with dimension m-1?

• Nope... We have to think about the test data as well!Even if we don’t know how many test data we have...

•We might want to map our data to a huge (1) dimensional feature space

•Overfitting? Generalization error?... We don’t care now...

Page 8: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

8

Feature mapping, but how???

1

Page 9: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

9

Observation

Several algorithms use the inner products only, but not the feature values!!!E.g. Perceptron, SVM, Gaussian Processes...

Page 10: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

10

The Perceptron

Page 11: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

11

Maximize

α k

k =1

R

∑ − 1

2α kα l Q k l

l =1

R

∑k =1

R

∑ where

Qkl =ykyl(xk ⋅xl)

Subject to these constraints:

0≤αk ≤C ∀k

αk y k

k =1

R

∑ = 0

α k

SVM

Page 12: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

12

Inner productsSo we need the inner product between

and

Looks ugly, and needs lots of computation...

Can’t we just say that let

Page 13: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

13

Finite example

=r

r

rn

n

Page 14: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

14

Finite exampleLemma:

Proof:

Page 15: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

15

Finite exampleChoose 7 2D points

Choose a kernel k

1

2 3

4

56

7

G = 1.0000 0.8131 0.9254 0.9369 0.9630 0.8987 0.9683 0.8131 1.0000 0.8745 0.9312 0.9102 0.9837 0.9264 0.9254 0.8745 1.0000 0.8806 0.9851 0.9286 0.9440 0.9369 0.9312 0.8806 1.0000 0.9457 0.9714 0.9857 0.9630 0.9102 0.9851 0.9457 1.0000 0.9653 0.9862 0.8987 0.9837 0.9286 0.9714 0.9653 1.0000 0.9779 0.9683 0.9264 0.9440 0.9857 0.9862 0.9779 1.0000

Page 16: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

16

[U,D]=svd(G), UDUT=G, UUT=IU = -0.3709 0.5499 0.3392 0.6302 0.0992 -0.1844 -0.0633

-0.3670 -0.6596 -0.1679 0.5164 0.1935 0.2972 0.0985 -0.3727 0.3007 -0.6704 -0.2199 0.4635 -0.1529 0.1862 -0.3792 -0.1411 0.5603 -0.4709 0.4938 0.1029 -0.2148 -0.3851 0.2036 -0.2248 -0.1177 -0.4363 0.5162 -0.5377 -0.3834 -0.3259 -0.0477 -0.0971 -0.3677 -0.7421 -0.2217 -0.3870 0.0673 0.2016 -0.2071 -0.4104 0.1628 0.7531

D = 6.6315 0 0 0 0 0 0

0 0.2331 0 0 0 0 0 0 0 0.1272 0 0 0 0 0 0 0 0.0066 0 0 0 0 0 0 0 0.0016 0 0 0 0 0 0 0 0.000 0 0 0 0 0 0 0 0.000

Page 17: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

17

Mapped points=sqrt(D)*UT

Mapped points =

-0.9551 -0.9451 -0.9597 -0.9765 -0.9917 -0.9872 -0.9966 0.2655 -0.3184 0.1452 -0.0681 0.0983 -0.1573 0.0325 0.1210 -0.0599 -0.2391 0.1998 -0.0802 -0.0170 0.0719 0.0511 0.0419 -0.0178 -0.0382 -0.0095 -0.0079 -0.0168 0.0040 0.0077 0.0185 0.0197 -0.0174 -0.0146 -0.0163-0.0011 0.0018 -0.0009 0.0006 0.0032 -0.0045 0.0010 -0.0002 0.0004 0.0007 -0.0008 -0.0020 -0.0008 0.0028

Page 18: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

18

Roadmap IWe need feature maps

Implicit (kernel functions)Explicit (feature maps)

Several algorithms need the inner products of features only!

It is much easier to use implicit feature maps (kernels)

Is it a kernel function???Is it a kernel function???

Page 19: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

19

Mercer’s theorem,

eigenfunctions, eigenvalues

Positive semi def. integral operators

Infinite dim feature space (l2)

Roadmap II

Is it a kernel function??? SVD,

eigenvectors, eigenvalues

Positive semi def. matrices

Finite dim feature space

We have to think aboutthe test data as well...

If the kernel is pos. semi def. , feature map construction

Page 20: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

20

Mercer’s theorem

(*)

2 variables 1 variable

Page 21: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

21

Mercer’s theorem

...

Page 22: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

22

Roadmap III

We want to know which functions are kernels• How to make new kernels from old kernels?• The polynomial kernel:

We will show another way using RKHS:

Inner product=???

Page 23: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

Ready for the details? ;)

Page 24: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

24

Hard 1-dimensional Dataset

What would SVMs do with this data?

Not a big surprise

x=0

Positive “plane” Negative “plane”

x=0

Doesn’t look like slack variables will save us this time…

taken from Andrew W. Moore

Page 25: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

25

Hard 1-dimensional DatasetMake up a new feature!

Sort of… … computed from original feature(s)

x=0

),( 2kkk xx=z

New features are sometimes called basis functions.

Separable! MAGIC!

Now drop this “augmented” data into our linear SVM.taken from Andrew W. Moore

Page 26: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

26

Hard 2-dimensional Dataset

O

OX

X

Let us map this point to the 3rd dimension...

Page 27: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

27

Kernels and Linear Classifiers

We will use linear classifiers in this feature space.

Page 28: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

28Picture is taken from R. Herbrich

Page 29: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

29Picture is taken from R. Herbrich

Page 30: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

30

Kernels and Linear Classifiers

Feature functions

Page 31: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

31

Back to the Perceptron Example

Page 32: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

32

The Perceptron

• The primal algorithm in the feature space

Page 33: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

33

The primal algorithm in the feature space

Picture is taken from R. Herbrich

Page 34: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

34

The Perceptron

Page 35: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

35

The PerceptronThe Dual Algorithm in the feature space

Page 36: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

36

The Dual Algorithm in the feature space

Picture is taken from R. Herbrich

Page 37: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

37

The Dual Algorithm in the feature space

Page 38: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

38

KernelsDefinition: (kernel)

Page 39: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

39

Kernels

Definition: (Gram matrix, kernel matrix)

Definition: (Feature space, kernel space)

Page 40: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

40

Kernel technique

Lemma:

The Gram matrix is symmetric, PSD matrix.

Proof:

Definition:

Page 41: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

41

Kernel technique

Key idea:

Page 42: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

42

Kernel technique

Page 43: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

43

Finite example

=r

r

rn

n

Page 44: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

44

Finite exampleLemma:

Proof:

Page 45: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

45

Kernel technique, Finite exampleWe have seen:

Lemma:

These conditions are necessary

Page 46: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

46

Kernel technique, Finite example

Proof: ... wrong in the Herbrich’s book...

Page 47: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

47

Kernel technique, Finite exampleSummary:

How to generalize this to general sets???

Page 48: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

48

Integral operators, eigenfunctions

Definition: Integral operator with kernel k(.,.)

Remark:

Page 49: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

49

From Vector domain to Functions

• Observe that each vector v = (v[1], v[2], ..., v[n]) is a mapping from the integers {1,2,..., n} to <

•We can generalize this easily to INFINITE domain

w = (w[1], w[2], ..., w[n], ...) where w is mapping from {1,2,...} to <

12

1 2 1

1

G vi

j

Page 50: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

50

From Vector domain to Functions

From integers we can further extend to

• < or • <m

• Strings• Graphs• Sets• Whatever• …

Page 51: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

51

Lp and lp spaces

.

Picture is taken from R. Herbrich

Page 52: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

52

Lp and lp spaces

Picture is taken from R. Herbrich

Page 53: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

53

L2 and l2 special cases

Picture is taken from R. Herbrich

Page 54: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

54

Kernels

Definition: inner product, Hilbert spaces

Page 55: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

55

Integral operators, eigenfunctions

Definition: Eigenvalue, Eigenfunction

Page 56: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

56

Positive (semi) definite operators

Definition: Positive Definite Operator

Page 57: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

57

Mercer’s theorem

(*)

2 variables 1 variable

Page 58: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

58

Mercer’s theorem

...

Page 59: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

59

A nicer characterization

Theorem: nicer kernel characterization

Page 60: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

60

Kernel Families• Kernels have the intuitive meaning of similarity

measure between objects.

• So far we have seen two ways for making a linear classifier nonlinear in the input space:

• (explicit) Choosing a mapping φ ) Mercer kernel k

• (implicit) Choosing a Mercer kernel k ) Mercer map φ

Page 61: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

61

Designing new kernels from kernels

are also kernels.

Picture is taken from R. Herbrich

Page 62: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

62

Designing new kernels from kernels

Picture is taken from R. Herbrich

Page 63: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

63

Designing new kernels from kernels

Page 64: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

64

Kernels on inner product spacesNote:

Page 65: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

65Picture is taken from R. Herbrich

Page 66: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

66

Common Kernels• Polynomials of degree d

• Polynomials of degree up to d

• Sigmoid

• Gaussian kernels

Equivalent to φ(x) of infinite dimensionality!

2

Page 67: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

67

The RBF kernelNote:

Proof:

Page 68: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

68

The RBF kernelNote:

Note:

Proof:

Page 69: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

69

The Polynomial kernel

Page 70: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

70

Reminder: Hard 1-dimensional DatasetMake up a new feature!

Sort of… … computed from original feature(s)

x=0

),( 2kkk xx=z

New features are sometimes called basis functions.

Separable! MAGIC!

Now drop this “augmented” data into our linear SVM.taken from Andrew W. Moore

Page 71: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

71

… New Features from Old …• Here: mapped ℜ → ℜ2 by Φ: x → [x, x2]

• Found “extra dimensions” ⇒ linearly separable!

• In general,• Start with vector x ∈ℜN

• Want to add in x12 , x2

2, …

• Probably want other terms – eg x2 ⋅ x7, …

• Which ones to include? Why not ALL OF THEM???

Page 72: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

72

Special Case• x=(x1, x2, x3 ) →

(1, x1, x2, x3, x12, x2

2, x32, x1x2, x1x3, x2x3 )

∀ ℜ3 → ℜ10, N=3, n=10;

22)1)(2(

21

2NNNNNNN ≈++=

+++→

In general, the dimension of the quadratic map:

taken from Andrew W. Moore

Page 73: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

73

Quadratic Basis Functions

− NN

N

N

N

N

xx

xx

xxxx

xxxx

x

xxx

xx

x

1

1

32

1

31

21

2

22

21

2

1

2:2:22:22

:

2:221

)(

Constant Term

Linear Terms

Pure Quadratic

Terms

Quadratic Cross-Terms

What about those ??

… stay tuned

2

Let

taken from Andrew W. Moore

Page 74: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

74

Qua

drat

ic D

ot

Prod

ucts

> =ΦΦ<

−− NN

N

N

N

N

N

N

N

N

N

bb

bb

bbbb

bbbb

b

bbb

bb

aNa

aa

aaaa

aaaa

a

aaa

aa

1

1

32

1

31

21

2

22

21

2

1

1

1

32

1

31

21

2

22

21

2

1

2:2:22:22

:

2:221

2:2:22:22

:

2:221

)(),( ba

1

∑=

N

iiiba

1

2

∑=

N

iii ba

1

22

∑ ∑= +=

N

i

N

ijjiji bbaa

1 12

+

+

+

taken from Andrew W. Moore

Page 75: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

75

Qua

drat

ic D

ot

Prod

ucts

> =ΦΦ< )(),( ba

∑ ∑∑∑= +===

+++N

i

N

ijjiji

N

iii

N

iii bbaababa

1 11

2

12)(21

Now consider another fn of a and b

(a ⋅b +1) 2

=(a⋅b)2 +2a⋅b+1

121

2

1++

= ∑∑

==

N

iii

N

iii baba

1211 1

++= ∑∑ ∑== =

N

iii

N

i

N

jjjii bababa

122)(11 11

2 +++= ∑∑ ∑∑== +==

N

iii

N

i

N

ijjjii

N

iii babababa

They’re the same!

And this is only O(N) to compute… not O(N2)

taken from Andrew W. Moore

Page 76: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

76

Higher Order Polynomials

50 m2N m2 / 2(a∙b+1)41960000m2N4 m2 /48All N4/24 terms up to degree 4

Quartic

50 m2N m2 / 2(a∙b+1)383 000 m2N3 m2 /12All N3/6 terms up to degree 3

Cubic

50 R2m R2 / 2(a∙b+1)22 500 R2m2 R2 /4All m2/2 terms up to degree 2

Quadratic

Cost if 100 inputs

Cost to build Qkl matrix: sneaky

φ(a)∙φ(b)Cost if 100 inputs

Cost to build Qkl matrix: traditional

φ(x)Poly-nomial €

Qkl =ykyl(xk ⋅xl)

50 m2N m2 / 2(a∙b+1)22 500 m2N2 m2 /4All N2/2 terms up to degree 2

Quadratic

Cost if 100 dim inputs

Cost to build Qkl matrix: sneaky

φ(a)∙φ(b)Cost if N=100 dim inputs

Cost to build Qkl matrix: traditional

φ(x)Poly-nomial

taken from Andrew W. Moore

Page 77: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

77

The Polynomial kernel, General case

We are going to map these to a larger space

We want to show that this k is a kernel function

Page 78: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

78

The Polynomial kernel, General case

P factors

We are going to map these to a larger space

Page 79: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

79

The Polynomial kernel, General caseWe already know:

We want to get k in this form:

Page 80: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

80

The Polynomial kernel

For example

We already know:

Page 81: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

81

The Polynomial kernel

Page 82: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

82

The Polynomial kernel

) k is really a kernel!

Page 83: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

83

Reproducing Kernel Hilbert Spaces

Page 84: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

84

RKHS, Motivation

Now, we show another way using RKHS

What objective do we want to optimize?

1.,

2.,

Page 85: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

85

RKHS, Motivation

1st term, empirical loss 2nd term, regularization

3., How can we minimize the objective over functions???

• Be PARAMETRIC!!!...

(nope, we do not like that...)

• Use RKHS, and suddenly the problem will be finite dimensional optimization only (yummy...)

The Representer theorem will help us here

Page 86: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

86

Reproducing Kernel Hilbert Spaces

Now, we show another way using RKHS

Completing (closing) a pre-Hilbert space ) Hilbert space

Now, we show another way using RKHS

Page 87: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

87

Reproducing Kernel Hilbert Spaces

The inner product:

(*)

Page 88: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

88

Reproducing Kernel Hilbert Spaces

Note:

Proof:

(*)

Page 89: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

89

Reproducing Kernel Hilbert SpacesLemma:

• Pre-Hilbert space: Like the Euclidean space with rational scalars only

• Hilbert space: Like the Euclidean space with real scalars

Proof:

Page 90: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

90

Reproducing Kernel Hilbert SpacesLemma: (Reproducing property)

Lemma: The constructed features match to k

Huhh...

Page 91: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

91

Reproducing Kernel Hilbert Spaces

Proof of property 4.,:

rep. property

CBS For CBS we don’t need 4.,we need only that <0,0>=0!

Page 92: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

92

Methods to Construct Feature SpacesWe now have two methods to construct feature

maps from kernels

Well, these feature spaces are all isomorph with each other...

Page 93: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

93

The Representer TheoremIn the perceptron problem we could use the dual

algorithm, because we had this representation:

Page 94: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

94

The Representer TheoremTheorem:

1st term, empirical loss 2nd term, regularization

Page 95: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

95

The Representer Theorem

Proof of Representer Theorem:

Message: Optimizing in general function classes is difficult, but in RKHS it is only finite! (m) dimensional problem

Page 96: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

96

Proof of the Representer TheoremProof of Representer Theorem

1st term, empirical loss 2nd term, regularization

Page 97: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

97

1st term, empirical loss 2nd term, regularization

Proof of the Representer Theorem

Page 98: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

98

Later will come• Supervised Learning

• SVM using kernels• Gaussian Processes

• Regression• Classification• Heteroscedastic case

• Unsupervised Learning• Kernel Principal Component Analysis• Kernel Independent Component Analysis

• Kernel Mutual Information• Kernel Generalized Variance• Kernel Canonical Correlation Analysis

Page 99: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

99

If we still have time…

• Automatic Relevance Machines• Bayes Point Machines

• Kernels on other objects• Kernels on graphs• Kernels on strings

• Fisher kernels• ANOVA kernels• Learning kernels

Page 100: Barnabás Póczos University of Albertabapoczos/other_presentations/...5 Hard 1-dimensional Dataset x=0 Positive “plane” Negative “plane” taken from Andrew W. More; CMU + Nello

100

Thanks for the Attention!