38
Intro ML: Tutorial on Kernel Methods Gideon Dresdner, 1 Vincent Fortuin 1 1 ETH Zurich 1

Intro ML: Tutorial on Kernel Methods

  • Upload
    others

  • View
    19

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Intro ML: Tutorial on Kernel Methods

Intro ML: Tutorial on KernelMethods

Gideon Dresdner,1 Vincent Fortuin 1

1ETH Zurich

1

Page 2: Intro ML: Tutorial on Kernel Methods

Motivation

1. Need to fit non-linear functions.

2. Map data using a non-linear function. Then, perform a

linear regression on the resulting “features”.

3. We have a problem — number of features explodes.

2

Page 3: Intro ML: Tutorial on Kernel Methods

Solution — the “kernel trick”

1. “Only compute what you need”

2. Let φ(xi) be the feature expansion. Don’t compute φ(xi)for every data point. Instead, find a shortcut to compute

φ(xi)>φ(xj) for every pair (xi, xj).

3. (x, y) 7→ φ(x)>φ(y) is called a kernel

4. The matrixKij = φ(xi)>φ(xj) generated from all the data is

called a kernel matrix, also referred to as a “Gram” matrix.

3

Page 4: Intro ML: Tutorial on Kernel Methods

Solution — the “kernel trick”

1. “Only compute what you need”

2. Let φ(xi) be the feature expansion. Don’t compute φ(xi)for every data point. Instead, find a shortcut to compute

φ(xi)>φ(xj) for every pair (xi, xj).

3. (x, y) 7→ φ(x)>φ(y) is called a kernel

4. The matrixKij = φ(xi)>φ(xj) generated from all the data is

called a kernel matrix, also referred to as a “Gram” matrix.

3

Page 5: Intro ML: Tutorial on Kernel Methods

Solution — the “kernel trick”

1. “Only compute what you need”

2. Let φ(xi) be the feature expansion. Don’t compute φ(xi)for every data point. Instead, find a shortcut to compute

φ(xi)>φ(xj) for every pair (xi, xj).

3. (x, y) 7→ φ(x)>φ(y) is called a kernel

4. The matrixKij = φ(xi)>φ(xj) generated from all the data is

called a kernel matrix, also referred to as a “Gram” matrix.

3

Page 6: Intro ML: Tutorial on Kernel Methods

Solution — the “kernel trick”

1. “Only compute what you need”

2. Let φ(xi) be the feature expansion. Don’t compute φ(xi)for every data point. Instead, find a shortcut to compute

φ(xi)>φ(xj) for every pair (xi, xj).

3. (x, y) 7→ φ(x)>φ(y) is called a kernel

4. The matrixKij = φ(xi)>φ(xj) generated from all the data is

called a kernel matrix, also referred to as a “Gram” matrix.

3

Page 7: Intro ML: Tutorial on Kernel Methods

Example — kernelizing absolute value loss (1)

• arg minw

1n

n∑i=1

|w>xi − yi|+ λ‖w‖22

• Let w =∑n

i=1 αixi

• Rewrite in terms of ~α:

w>xi =

(n∑

j=1

αjxj

)>

xi =n∑

j=1

αjx>j xi (1)

And,‖w‖22 = w>w =

n∑i=1

n∑j=1

αiαjx>i xj (2)

4

Page 8: Intro ML: Tutorial on Kernel Methods

Example — kernelizing absolute value loss (1)

• arg minw

1n

n∑i=1

|w>xi − yi|+ λ‖w‖22

• Let w =∑n

i=1 αixi

• Rewrite in terms of ~α:

w>xi =

(n∑

j=1

αjxj

)>

xi =n∑

j=1

αjx>j xi (1)

And,‖w‖22 = w>w =

n∑i=1

n∑j=1

αiαjx>i xj (2)

4

Page 9: Intro ML: Tutorial on Kernel Methods

Example — kernelizing absolute value loss (1)

• arg minw

1n

n∑i=1

|w>xi − yi|+ λ‖w‖22

• Let w =∑n

i=1 αixi

• Rewrite in terms of ~α:

w>xi =

(n∑

j=1

αjxj

)>

xi =n∑

j=1

αjx>j xi (1)

And,‖w‖22 = w>w =

n∑i=1

n∑j=1

αiαjx>i xj (2)

4

Page 10: Intro ML: Tutorial on Kernel Methods

Example — kernelizing absolute value loss (1)

• arg minw

1n

n∑i=1

|w>xi − yi|+ λ‖w‖22

• Let w =∑n

i=1 αixi

• Rewrite in terms of ~α:

w>xi =

(n∑

j=1

αjxj

)>

xi =n∑

j=1

αjx>j xi (1)

And,‖w‖22 = w>w =

n∑i=1

n∑j=1

αiαjx>i xj (2)

4

Page 11: Intro ML: Tutorial on Kernel Methods

Example — kernelizing absolute value loss (1)

• arg minw

1n

n∑i=1

|w>xi − yi|+ λ‖w‖22

• Let w =∑n

i=1 αixi

• Rewrite in terms of ~α:

w>xi =

(n∑

j=1

αjxj

)>

xi =n∑

j=1

αjx>j xi (1)

And,‖w‖22 = w>w =

n∑i=1

n∑j=1

αiαjx>i xj (2)

4

Page 12: Intro ML: Tutorial on Kernel Methods

Example — kernelizing absolute value loss (2)

• Putting this together,

arg minw

1

n

n∑i=1

|w>xi − yi|+ λ‖w‖22 (3)

=1

n

n∑i=1

∣∣∣∣∣n∑

j=1

αjx>j xi − yi

∣∣∣∣∣+ λn∑

i=1

n∑j=1

αiαjx>i xj (4)

• xi’s only appear within inner products! Let Kij := k(xi, xj)for some kernel function k.

5

Page 13: Intro ML: Tutorial on Kernel Methods

Example — kernelizing absolute value loss (3)

• ∑nj=1 αjk(xj, xi)− yi = ~α>Ki − yi where [Ki]j := k(xj, xi)

• ∑ni=1 |~α>Ki − yi| = ‖~α>K − ~y‖1

• Thus,

arg minw

1

n

n∑i=1

|w>xi − yi|+ λ‖w‖22 (5)

= arg min~α

1

n‖~α>K − ~y‖1 + λ~α>K~α (6)

6

Page 14: Intro ML: Tutorial on Kernel Methods

How to derive kernel formulation

1. Assume a linear kernel

2. Only consider solutions that are in the linear span of the

data: w∗ =∑n

i=1 αixi.

3. Reformulate so that X only appears as X>X

4. Replace X>X with the kernel matrix K(X,X)ij = k(xi, xj),also called the “Gram Matrix.”

7

Page 15: Intro ML: Tutorial on Kernel Methods

How to derive kernel formulation

1. Assume a linear kernel

2. Only consider solutions that are in the linear span of the

data: w∗ =∑n

i=1 αixi.

3. Reformulate so that X only appears as X>X

4. Replace X>X with the kernel matrix K(X,X)ij = k(xi, xj),also called the “Gram Matrix.”

7

Page 16: Intro ML: Tutorial on Kernel Methods

How to derive kernel formulation

1. Assume a linear kernel

2. Only consider solutions that are in the linear span of the

data: w∗ =∑n

i=1 αixi.

3. Reformulate so that X only appears as X>X

4. Replace X>X with the kernel matrix K(X,X)ij = k(xi, xj),also called the “Gram Matrix.”

7

Page 17: Intro ML: Tutorial on Kernel Methods

How to derive kernel formulation

1. Assume a linear kernel

2. Only consider solutions that are in the linear span of the

data: w∗ =∑n

i=1 αixi.

3. Reformulate so that X only appears as X>X

4. Replace X>X with the kernel matrix K(X,X)ij = k(xi, xj),also called the “Gram Matrix.”

7

Page 18: Intro ML: Tutorial on Kernel Methods

Kernel Methods in practice

1. Determine the type of problem you have (regression,

binary classification, multi-class classification, etc.)

2. Select an appropriate model (perceptron, SVM, etc.)

3. Construct a kernel

4. Repeat

8

Page 19: Intro ML: Tutorial on Kernel Methods

Kernel methods in practice — “with great powercomes great responsibility”

Pros1. Explicit control (and understanding) of the model.

2. Incorporate prior knowledge via kernel engineering.

Cons1. Avoided large d (e.g. d = ∞), but now our solution is in Rn

— lots of research on this scaling problem

2. Kernels can be hard to design without thorough

understanding of the problem/dataset — lots of research

on this kernel design problem9

Page 20: Intro ML: Tutorial on Kernel Methods

Choose the right kernel

So you have decided to use some kernel-based method, say

SVM. How do you design a kernel?

• Sit and think very hard

• Combine known kernels (that other people have laborouslyconstructed) to suit your particular problem.

10

Page 21: Intro ML: Tutorial on Kernel Methods

Spectrum kernel for biological sequencesLet x, y be two bio sequences (e.g. GATAACA). Define,

k(x, y) =∑sk

#(x, sk)#(y, sk) (7)

where the sum is over all subsequences, sk of length k.

Caption

11

Page 22: Intro ML: Tutorial on Kernel Methods

Example fits

26

Combining kernels

12

Page 23: Intro ML: Tutorial on Kernel Methods

Properties of Kernels

Symmetry and positive semi-definitenessA function k : X × X → R is a valid kernel if and only if it issymmetric and positive semi-definite, that is,

1. k(x,x′) = k(x′,x)2. Any of the following equivalent statements holds

2.1 The kernel matrix K computed on data X ⊂ X is positivedefinite for all X , that is, v>Kv ≥ 0 ∀v ∈ Rn

2.2∑

i

∑j cicjk(xi,xj) ≥ 0 ∀X ⊂ X , ci, cj ∈ R

13

Page 24: Intro ML: Tutorial on Kernel Methods

Properties of Kernels

Inner product in Hilbert space (Mercer 1909)A function k : X × X → R is a valid kernel if and only if thereexists a feature map φ : X → H into a Hilbert space H, suchthat

k(x,x′) = φ(x)>φ(x′) ∀x,x′ ∈ X .

14

Page 25: Intro ML: Tutorial on Kernel Methods

Kernel composition rules

Sum ruleIf k1 and k2 are valid kernels on X , then k1 + k2 is a valid kernelon X .

Scaling ruleIf λ > 0 and k is a valid kernel on X , then λk is a valid kernel onX .

Product ruleIf k1 and k2 are valid kernels on X , then k1k2 is a valid kernel onX . If k1 is a valid kernel on X1 and k2 is a valid kernel on X2,

then k1k2 is a valid kernel on X1 ×X2.15

Page 26: Intro ML: Tutorial on Kernel Methods

Proving kernel validity

1. Proving that a kernel is valid:1.1 Prove symmetry (easy) and positive definiteness (usually harder)

1.2 Find an explicit feature map φ(x), such that k(x,x′) = φ(x)>φ(x′)1.3 Derive the kernel from other valid ones using the composition

rules

2. Proving that a kernel is invalid:2.1 Find a counterexample against symmetry (might be easy)

2.2 Find a counterexample against positive definiteness (might be

harder)

16

Page 27: Intro ML: Tutorial on Kernel Methods

Exercises

For x,x′ ∈ Rd and k(x,x′) = (x>x′ + 1)2, show that k(·, ·) is avalid kernel.

17

Page 28: Intro ML: Tutorial on Kernel Methods

Exercises

(x>x′ + 1)2 = (d∑

i=1

xix′i + 1)2

= 1 + 2∑i

xix′i +∑i

∑j

xixjx′ix

′j

= 1 +∑i

(√2xi)(

√2x′i) +

∑i

∑j

(xixj)(x′ix

′j)

Thus k(x,x′) = φ(x)>φ(x′)with

φ(x) =[1,√2x1, . . . ,

√2xd, x1x1, x1x2, . . . , x1xd, x2x1, . . . , xdxd

]>18

Page 29: Intro ML: Tutorial on Kernel Methods

Exercises

For k̃(x,x′) = f(k(x,x′)), show that k̃(·, ·) is a valid kernel ifk(·, ·) is a valid kernel and f is a polynomial with non-negativecoefficients.

19

Page 30: Intro ML: Tutorial on Kernel Methods

Exercises

k̃(x,x′) = a1k(x,x′)e1 + a2k(x,x′)e2 + . . .

Proof:

• All the k(x,x′)ei are valid kernels by the product rule.

• Thus, all the aik(x,x′)ei are valid kernels by the scaling rule.

• Thus, k̃(x,x′) is a valid kernel by the sum rule.

20

Page 31: Intro ML: Tutorial on Kernel Methods

Exercises

For k̃(x,x′) = f(x)k(x,x′)f(x′)with f : X → R, show thatk̃(·, ·) is a valid kernel if k(·, ·) is a valid kernel.

21

Page 32: Intro ML: Tutorial on Kernel Methods

Exercises

If k(·, ·) is a valid kernel, we can write it as k(x,x′) = φ(x)>φ(x′)for some φ(x).Thus

k̃(x,x′) = f(x)φ(x)>φ(x′)f(x′)

= (f(x)φ(x))>(f(x′)φ(x′)) .

With φ̃(x) = f(x)φ(x), we can write k̃(x,x′) = φ̃(x)>φ̃(x′),which makes it a valid kernel.

22

Page 33: Intro ML: Tutorial on Kernel Methods

Exercises

For k̃(x,x′) = k(f(x), f(x′))with f : X → X , show that k̃(·, ·) isa valid kernel if k(·, ·) is a valid kernel.

23

Page 34: Intro ML: Tutorial on Kernel Methods

Exercises

If k(·, ·) is a valid kernel, we can write it as k(x,x′) = φ(x)>φ(x′)for some φ(x).Thus

k̃(x,x′) = φ(f(x))>φ(f(x′)) .

With φ̃(x) = φ(f(x)), we can write k̃(x,x′) = φ̃(x)>φ̃(x′), whichmakes it a valid kernel.

24

Page 35: Intro ML: Tutorial on Kernel Methods

Exercises

For the data set X = [(−3, 4), (1, 0)] and the feature map

φ(x) = [x1, x2, ‖x‖]>, compute the kernel matrix K.

25

Page 36: Intro ML: Tutorial on Kernel Methods

Exercises

φ((−3, 4)) = [−3, 4, 5]>

φ((1, 0)) = [1, 0, 1]>

φ((−3, 4))>φ((−3, 4)) = 50

φ((−3, 4))>φ((1, 0)) = 2

φ((1, 0))>φ((1, 0)) = 2

K =

[50 22 2

]26

Page 37: Intro ML: Tutorial on Kernel Methods

References

1. Convolutional kitchen sinks for transcription factor binding

site prediction. Morrow, Alyssa and Shankar, Vaishaal and

Petersohn, Devin and Joseph, Anthony and Recht,

Benjamin and Yosef, Nir, NeurIPS 2016.

27

Page 38: Intro ML: Tutorial on Kernel Methods

END OF PRESENTATION

BEGININNG OF Q&A

28