53
Lecture 3 Approximate Kernel Methods (Computational vs. Statistical Trade-off) Bharath K. Sriperumbudur * & Dougal J. Sutherland * Department of Statistics, Pennsylvania State University Gatsby Unit, University College London Data Science Summer School ´ Ecole Polytechnique June 2019

Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

  • Upload
    others

  • View
    18

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Lecture 3

Approximate Kernel Methods(Computational vs. Statistical Trade-off)

Bharath K. Sriperumbudur∗ & Dougal J. Sutherland†

∗Department of Statistics, Pennsylvania State University

†Gatsby Unit, University College London

Data Science Summer SchoolEcole Polytechnique

June 2019

Page 2: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Outline

I Motivating examples

I Kernel ridge regression and kernel PCA

I Approximation methods

I Computational vs. statistical trade off

Page 3: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel Ridge regression: Feature Map and Kernel TrickI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f ∈ H (some feature space) s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φ(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φ(xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: For Φ(X) := (Φ(x1), . . . ,Φ(xn)) ∈ Rdim(H)×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(X)y︸ ︷︷ ︸primal

=1

nΦ(X)

(1

nΦ(X)>Φ(X) + λIn

)−1

y︸ ︷︷ ︸dual

Page 4: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel Ridge regression: Feature Map and Kernel TrickI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f ∈ H (some feature space) s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φ(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φ(xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: For Φ(X) := (Φ(x1), . . . ,Φ(xn)) ∈ Rdim(H)×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(X)y︸ ︷︷ ︸primal

=1

nΦ(X)

(1

nΦ(X)>Φ(X) + λIn

)−1

y︸ ︷︷ ︸dual

Page 5: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel Ridge regression: Feature Map and Kernel TrickI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f ∈ H (some feature space) s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φ(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φ(xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: For Φ(X) := (Φ(x1), . . . ,Φ(xn)) ∈ Rdim(H)×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(X)y︸ ︷︷ ︸primal

=1

nΦ(X)

(1

nΦ(X)>Φ(X) + λIn

)−1

y︸ ︷︷ ︸dual

Page 6: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel Ridge regression: Feature Map and Kernel TrickI Prediction: Given t ∈ X

f (t) = 〈f ,Φ(t)〉H =1

ny>Φ(X)>

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(t)

=1

ny>(

1

nΦ(X)>Φ(X) + λIn

)−1

Φ(X)>Φ(t)

As before

Φ(X)>Φ(X) =

〈Φ(x1),Φ(x1)〉H · · · 〈Φ(x1),Φ(xn)〉H〈Φ(x2),Φ(x1)〉H · · · 〈Φ(x2),Φ(xn)〉H

.... . .

...〈Φ(xn),Φ(x1)〉H · · · 〈Φ(xn),Φ(xn)〉H

︸ ︷︷ ︸

k(xi ,xj )=〈Φ(xi ),Φ(xj )〉H

and

Φ(X)>Φ(t) = [〈Φ(x1),Φ(t)〉H, . . . , 〈Φ(xn),Φ(t)〉H]>

Page 7: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel Ridge regression: Feature Map and Kernel TrickI Prediction: Given t ∈ X

f (t) = 〈f ,Φ(t)〉H =1

ny>Φ(X)>

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(t)

=1

ny>(

1

nΦ(X)>Φ(X) + λIn

)−1

Φ(X)>Φ(t)

As before

Φ(X)>Φ(X) =

〈Φ(x1),Φ(x1)〉H · · · 〈Φ(x1),Φ(xn)〉H〈Φ(x2),Φ(x1)〉H · · · 〈Φ(x2),Φ(xn)〉H

.... . .

...〈Φ(xn),Φ(x1)〉H · · · 〈Φ(xn),Φ(xn)〉H

︸ ︷︷ ︸

k(xi ,xj )=〈Φ(xi ),Φ(xj )〉H

and

Φ(X)>Φ(t) = [〈Φ(x1),Φ(t)〉H, . . . , 〈Φ(xn),Φ(t)〉H]>

Page 8: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Remarks

I The primal formulation requires the knowledge of feature map Φ(and of course H) and these could be infinite dimensional.

I The dual formulation is entirely determined by kernel evaluations,Gram matrix and (k(xi , t))i . But poor scalability: O(n3).

Page 9: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel Principal Component AnalysisI Dimensionality reduction

I Given: {(xi )}ni=1 where xi ∈ Rd

I Task: Find a low-dimensional representation for (xi ).

max‖f ‖H=1

Var (〈f ,Φ(x1)〉H, 〈f ,Φ(x2)〉H, . . . , 〈f ,Φ(xn)〉H)

≡ max‖f ‖H=1

1

n

n∑i=1

〈f ,Φ(xi )〉2H −

(1

n

n∑i=1

〈f ,Φ(xi )〉H

)2

≡ max‖f ‖H=1

〈f , Σf 〉H

where

Σ :=1

n

n∑i=1

Φ(xi )⊗ Φ(xi )−

(1

n

n∑i=1

Φ(xi )

)⊗

(1

n

n∑i=1

Φ(xi )

)

=1

nΦ(X)

(Id −

1

n11>

)Φ(X)> =: Φ(X)HΦ(X)>.

Page 10: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel Principal Component AnalysisI Dimensionality reduction

I Given: {(xi )}ni=1 where xi ∈ Rd

I Task: Find a low-dimensional representation for (xi ).

max‖f ‖H=1

Var (〈f ,Φ(x1)〉H, 〈f ,Φ(x2)〉H, . . . , 〈f ,Φ(xn)〉H)

≡ max‖f ‖H=1

1

n

n∑i=1

〈f ,Φ(xi )〉2H −

(1

n

n∑i=1

〈f ,Φ(xi )〉H

)2

≡ max‖f ‖H=1

〈f , Σf 〉H

where

Σ :=1

n

n∑i=1

Φ(xi )⊗ Φ(xi )−

(1

n

n∑i=1

Φ(xi )

)⊗

(1

n

n∑i=1

Φ(xi )

)

=1

nΦ(X)

(Id −

1

n11>

)Φ(X)> =: Φ(X)HΦ(X)>.

Page 11: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel Principal Component AnalysisI Dimensionality reduction

I Given: {(xi )}ni=1 where xi ∈ Rd

I Task: Find a low-dimensional representation for (xi ).

max‖f ‖H=1

Var (〈f ,Φ(x1)〉H, 〈f ,Φ(x2)〉H, . . . , 〈f ,Φ(xn)〉H)

≡ max‖f ‖H=1

1

n

n∑i=1

〈f ,Φ(xi )〉2H −

(1

n

n∑i=1

〈f ,Φ(xi )〉H

)2

≡ max‖f ‖H=1

〈f , Σf 〉H

where

Σ :=1

n

n∑i=1

Φ(xi )⊗ Φ(xi )−

(1

n

n∑i=1

Φ(xi )

)⊗

(1

n

n∑i=1

Φ(xi )

)

=1

nΦ(X)

(Id −

1

n11>

)Φ(X)> =: Φ(X)HΦ(X)>.

Page 12: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Remarks

I The primal formulation requires the knowledge of feature map Φ(and of course H) and these could be infinite dimensional.

I The dual formulation is entirely determined by kernel evaluations,Gram matrix and (k(xi , t))i . But poor scalability: O(n3).

Page 13: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Approximation Schemes

I Incomplete Cholesky factorization (Fine and Scheinberg, JMLR 2001)

I Sketching (Yang et al., 2015)

I Sparse greedy approximation (Smola and Scholkopf, NIPS 2000)

I Nystrom method (Williams and Seeger, NIPS 2001)

I Random Fourier features (Rahimi and Recht, NIPS 2008), ...

Page 14: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Key Ideas

I Approach 1: Finite dimensional approximation to Φ(x)

I Perform linear method on Φm(x) ∈ Rm.

I Involves Φm(X)Φm(X)> ∈ Rm×m.

I Approach 2: Approximate the representer

I The representer theorem yields that the solution lies in{f ∈ H

∣∣∣f =n∑

i=1

αik(·, xi ) : (α1, . . . , αn) ∈ Rn

}

I Instead, restrict the solution to{f ∈ H

∣∣∣f =m∑i=1

βik(·, xi ) : (β1, . . . , βm) ∈ Rm

}

Page 15: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Key Ideas

I Approach 1: Finite dimensional approximation to Φ(x)

I Perform linear method on Φm(x) ∈ Rm.

I Involves Φm(X)Φm(X)> ∈ Rm×m.

I Approach 2: Approximate the representer

I The representer theorem yields that the solution lies in{f ∈ H

∣∣∣f =n∑

i=1

αik(·, xi ) : (α1, . . . , αn) ∈ Rn

}

I Instead, restrict the solution to{f ∈ H

∣∣∣f =m∑i=1

βik(·, xi ) : (β1, . . . , βm) ∈ Rm

}

Page 16: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Approach 1: Finite dimensional approximation to Φ(x)

Page 17: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Random Fourier Approximation

I X = Rd ; k be continuous and translation-invariant, i.e.,k(x , y) = ψ(x − y).

I Bochner’s theorem: ψ is positive definite if and only if

k(x , y) =

∫Rd

e√−1〈ω,x−y〉2 dΛ(ω),

where Λ is a finite non-negative Borel measure on Rd .

I k is symmetric and therefore Λ is a “symmetric” measure on Rd .

I Therefore

k(x , y) =

∫Rd

cos(〈ω, x − y〉2) dΛ(ω).

Page 18: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Random Fourier Approximation

I X = Rd ; k be continuous and translation-invariant, i.e.,k(x , y) = ψ(x − y).

I Bochner’s theorem: ψ is positive definite if and only if

k(x , y) =

∫Rd

e√−1〈ω,x−y〉2 dΛ(ω),

where Λ is a finite non-negative Borel measure on Rd .

I k is symmetric and therefore Λ is a “symmetric” measure on Rd .

I Therefore

k(x , y) =

∫Rd

cos(〈ω, x − y〉2) dΛ(ω).

Page 19: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Random Feature Approximation

(Rahimi and Recht, 2008a): Draw (ωj)mj=1

i.i.d.∼ Λ.

km(x , y) =1

m

m∑j=1

cos (〈ωj , x − y〉2) = 〈Φm(x),Φm(y)〉R2m ,

≈ k(x , y) = 〈k(·, x), k(·, y)〉H︸ ︷︷ ︸〈Φ(x),Φ(y)〉H

where

Φm(x) =1√m

(

ϕ1(x)︷ ︸︸ ︷cos(〈ω1, x〉2), . . . , cos(〈ωm, x〉2), sin(〈ω1, x〉2), . . . , sin(〈ωm, x〉2)).

Page 20: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

How good is the approximation?

(S and Szabo, NIPS 2016):

supx,y∈S

|km(x , y)− k(x , y)| = Oa.s.

(√log |S |

m

)

Optimal convergence rate

I Other results are known but they are non-optimal (Rahimi and Recht,

NIPS 2008; Sutherland and Schneider, UAI 2015).

Page 21: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Ridge Regression: Random Feature ApproximationI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φm(xi ) and do linear regression,

minw∈R2m

1

n

n∑i=1

(〈w ,Φm(xi )〉R2m − yi )2 + λ‖w‖2

R2m (λ > 0)

I Solution: For Φm(X) := (Φm(x1), . . . ,Φm(xn)) ∈ R2m×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦm(X)Φm(X)> + λI2m

)−1

Φm(X)y︸ ︷︷ ︸primal

=1

nΦm(X)

(1

nΦm(X)>Φm(X) + λIn

)−1

y︸ ︷︷ ︸dual

Computation: O(m2n).

Page 22: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Ridge Regression: Random Feature ApproximationI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φm(xi ) and do linear regression,

minw∈R2m

1

n

n∑i=1

(〈w ,Φm(xi )〉R2m − yi )2 + λ‖w‖2

R2m (λ > 0)

I Solution: For Φm(X) := (Φm(x1), . . . ,Φm(xn)) ∈ R2m×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦm(X)Φm(X)> + λI2m

)−1

Φm(X)y︸ ︷︷ ︸primal

=1

nΦm(X)

(1

nΦm(X)>Φm(X) + λIn

)−1

y︸ ︷︷ ︸dual

Computation: O(m2n).

Page 23: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Ridge Regression: Random Feature ApproximationI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φm(xi ) and do linear regression,

minw∈R2m

1

n

n∑i=1

(〈w ,Φm(xi )〉R2m − yi )2 + λ‖w‖2

R2m (λ > 0)

I Solution: For Φm(X) := (Φm(x1), . . . ,Φm(xn)) ∈ R2m×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦm(X)Φm(X)> + λI2m

)−1

Φm(X)y︸ ︷︷ ︸primal

=1

nΦm(X)

(1

nΦm(X)>Φm(X) + λIn

)−1

y︸ ︷︷ ︸dual

Computation: O(m2n).

Page 24: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

What happens statistically?

Kernel ridge regression: (Xi ,Yi )ni=1

iid∼ ρXY .

I R∗P = inf f∈L2(ρX ) E|f (X )− Y |2 = E|f ∗(X )− Y |2.

I Penalized risk minimization: O(n3)

fn = arg inff∈H

1

n

n∑i=1

|Yi − f (Xi )|22 + λ‖f ‖2H

I Penalized risk minimization (approximate): O(m2n)

fm,n = arg inff∈Hm

1

n

n∑i=1

|Yi − f (Xi )|22 + λ‖f ‖2Hm

Page 25: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

What happens statistically?

RP(fm,n)︸ ︷︷ ︸E|fm,n(X )−Y |2

−R∗

= (RP(fm,n)−RP(fn))︸ ︷︷ ︸error due to approximation

+ (RP(fn)−R∗P)

I (Rahimi and Recht, 2008b): (m ∧ n)−12

I (Rudi and Rosasco, 2016): If m ≥ nα where 12 ≤ α < 1 with α

depending on the properties of f ∗, then fm,n achieves the minimaxoptimal rate as obtained in the case with no approximation.

Computational gain with no statistical loss!!

Page 26: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

What happens statistically?

RP(fm,n)︸ ︷︷ ︸E|fm,n(X )−Y |2

−R∗

= (RP(fm,n)−RP(fn))︸ ︷︷ ︸error due to approximation

+ (RP(fn)−R∗P)

I (Rahimi and Recht, 2008b): (m ∧ n)−12

I (Rudi and Rosasco, 2016): If m ≥ nα where 12 ≤ α < 1 with α

depending on the properties of f ∗, then fm,n achieves the minimaxoptimal rate as obtained in the case with no approximation.

Computational gain with no statistical loss!!

Page 27: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

What happens statistically?

RP(fm,n)︸ ︷︷ ︸E|fm,n(X )−Y |2

−R∗

= (RP(fm,n)−RP(fn))︸ ︷︷ ︸error due to approximation

+ (RP(fn)−R∗P)

I (Rahimi and Recht, 2008b): (m ∧ n)−12

I (Rudi and Rosasco, 2016): If m ≥ nα where 12 ≤ α < 1 with α

depending on the properties of f ∗, then fm,n achieves the minimaxoptimal rate as obtained in the case with no approximation.

Computational gain with no statistical loss!!

Page 28: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel PCA: Random Feature ApproximationI Perform linear PCA on {Φm(xi )}ni=1.

I Approximate KPCA finds β ∈ Rm that solves

sup‖β‖2=1

Var[{〈β,Φm(xi )〉2}ni=1] = sup

‖β‖2=1

⟨β, Σmβ

⟩2

where

Σm :=1

n

n∑i=1

Φm(xi )⊗ Φm(xi )−

(1

n

n∑i=1

Φm(xi )

)⊗

(1

n

n∑i=1

Φm(xi )

).

I Same as doing kernel PCA in Hm where

Hm =

{f =

m∑i=1

βiϕi : β ∈ Rm

}

is an RKHS induced by the reproducing kernel km.

I Computation: O(m2n)

Page 29: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel PCA: Random Feature ApproximationI Perform linear PCA on {Φm(xi )}ni=1.

I Approximate KPCA finds β ∈ Rm that solves

sup‖β‖2=1

Var[{〈β,Φm(xi )〉2}ni=1] = sup

‖β‖2=1

⟨β, Σmβ

⟩2

where

Σm :=1

n

n∑i=1

Φm(xi )⊗ Φm(xi )−

(1

n

n∑i=1

Φm(xi )

)⊗

(1

n

n∑i=1

Φm(xi )

).

I Same as doing kernel PCA in Hm where

Hm =

{f =

m∑i=1

βiϕi : β ∈ Rm

}

is an RKHS induced by the reproducing kernel km.

I Computation: O(m2n)

Page 30: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel PCA: Random Feature ApproximationI Perform linear PCA on {Φm(xi )}ni=1.

I Approximate KPCA finds β ∈ Rm that solves

sup‖β‖2=1

Var[{〈β,Φm(xi )〉2}ni=1] = sup

‖β‖2=1

⟨β, Σmβ

⟩2

where

Σm :=1

n

n∑i=1

Φm(xi )⊗ Φm(xi )−

(1

n

n∑i=1

Φm(xi )

)⊗

(1

n

n∑i=1

Φm(xi )

).

I Same as doing kernel PCA in Hm where

Hm =

{f =

m∑i=1

βiϕi : β ∈ Rm

}

is an RKHS induced by the reproducing kernel km.

I Computation: O(m2n)

Page 31: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Reconstruction Error

I Linear PCA

EX∼P

∥∥∥∥∥(X − µ)−∑i=1

〈(X − µ), φi 〉2φi

∥∥∥∥∥2

2

I Kernel PCA

EX∼P

∥∥∥∥∥k(·,X )−∑i=1

〈k(·,X ), φi 〉Hφi

∥∥∥∥∥2

H

where k(·, x) = k(·, x)−∫k(·, x) dP(x).

I However, the eigenfunctions of approximate empirical KPCA lie inHm, which is finite dimensional and not contained in H.

Page 32: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Reconstruction Error

I Linear PCA

EX∼P

∥∥∥∥∥(X − µ)−∑i=1

〈(X − µ), φi 〉2φi

∥∥∥∥∥2

2

I Kernel PCA

EX∼P

∥∥∥∥∥k(·,X )−∑i=1

〈k(·,X ), φi 〉Hφi

∥∥∥∥∥2

H

where k(·, x) = k(·, x)−∫k(·, x) dP(x).

I However, the eigenfunctions of approximate empirical KPCA lie inHm, which is finite dimensional and not contained in H.

Page 33: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Reconstruction Error

I Linear PCA

EX∼P

∥∥∥∥∥(X − µ)−∑i=1

〈(X − µ), φi 〉2φi

∥∥∥∥∥2

2

I Kernel PCA

EX∼P

∥∥∥∥∥k(·,X )−∑i=1

〈k(·,X ), φi 〉Hφi

∥∥∥∥∥2

H

where k(·, x) = k(·, x)−∫k(·, x) dP(x).

I However, the eigenfunctions of approximate empirical KPCA lie inHm, which is finite dimensional and not contained in H.

Page 34: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Embedding to L2(P) (S and Sterge, 2017)

What we have?

I Population eigenfunctions (φi )i∈I of Σ: these form a subspace in H.

I Empirical eigenfunctions (φi )ni=1 of Σ: these form a subspace in H.

I Eigenvectors after approximation, (φi,m)mi=1 of Σm: these form a subspace in Rm

I We embed them in a common space before comparing. The common space isL2(P).

I (Inclusion operator) I : H→ L2(P), f 7→ f −∫X f (x) dP(x)

I (Approximation operator) U : Rm → L2(P),

α 7→m∑i=1

αi

(ϕi −

∫Xϕi (x) dP(x)

)

Page 35: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Embedding to L2(P) (S and Sterge, 2017)

What we have?

I Population eigenfunctions (φi )i∈I of Σ: these form a subspace in H.

I Empirical eigenfunctions (φi )ni=1 of Σ: these form a subspace in H.

I Eigenvectors after approximation, (φi,m)mi=1 of Σm: these form a subspace in Rm

I We embed them in a common space before comparing. The common space isL2(P).

I (Inclusion operator) I : H→ L2(P), f 7→ f −∫X f (x) dP(x)

I (Approximation operator) U : Rm → L2(P),

α 7→m∑i=1

αi

(ϕi −

∫Xϕi (x) dP(x)

)

Page 36: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Reconstruction Error (S and Sterge, 2017)

(Iφi√λi

)∞i=1

form an ONS for L2(P). Define k(·, x) = k(·, x)− µP and τ > 0.

I Population KPCA:

R` = E

∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥2

L2(P)

I Empirical KPCA:

Rn,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi

, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥∥∥2

L2(P)

I Approximate Empirical KPCA:

Rm,n,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨U φi,m√λi,m

, Ik(·,X )

⟩L2(P)

U φi,m√λi,m

∥∥∥∥∥∥∥2

L2(P)

Page 37: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Reconstruction Error (S and Sterge, 2017)

(Iφi√λi

)∞i=1

form an ONS for L2(P). Define k(·, x) = k(·, x)− µP and τ > 0.

I Population KPCA:

R` = E

∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥2

L2(P)

I Empirical KPCA:

Rn,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi

, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥∥∥2

L2(P)

I Approximate Empirical KPCA:

Rm,n,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨U φi,m√λi,m

, Ik(·,X )

⟩L2(P)

U φi,m√λi,m

∥∥∥∥∥∥∥2

L2(P)

Page 38: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Reconstruction Error (S and Sterge, 2017)

(Iφi√λi

)∞i=1

form an ONS for L2(P). Define k(·, x) = k(·, x)− µP and τ > 0.

I Population KPCA:

R` = E

∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥2

L2(P)

I Empirical KPCA:

Rn,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi

, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥∥∥2

L2(P)

I Approximate Empirical KPCA:

Rm,n,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨U φi,m√λi,m

, Ik(·,X )

⟩L2(P)

U φi,m√λi,m

∥∥∥∥∥∥∥2

L2(P)

Page 39: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Result (S and Sterge, 2017)

Clearly R` → 0 as `→∞. The goal is to study the convergence rates for R`, Rn,` andRm,n,` as `,m, n→∞.

Suppose λi � i−α, α > 12, ` = n

θα and m = nγ where θ > 0 and 0 < γ < 1.

I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)(bias dominates)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2(variance dominates)

I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss

Page 40: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Result (S and Sterge, 2017)

Clearly R` → 0 as `→∞. The goal is to study the convergence rates for R`, Rn,` andRm,n,` as `,m, n→∞.

Suppose λi � i−α, α > 12, ` = n

θα and m = nγ where θ > 0 and 0 < γ < 1.

I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)(bias dominates)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2(variance dominates)

I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss

Page 41: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Result (S and Sterge, 2017)

Clearly R` → 0 as `→∞. The goal is to study the convergence rates for R`, Rn,` andRm,n,` as `,m, n→∞.

Suppose λi � i−α, α > 12, ` = n

θα and m = nγ where θ > 0 and 0 < γ < 1.

I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)(bias dominates)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2(variance dominates)

I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss

Page 42: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Result (S and Sterge, 2017)

Clearly R` → 0 as `→∞. The goal is to study the convergence rates for R`, Rn,` andRm,n,` as `,m, n→∞.

Suppose λi � i−α, α > 12, ` = n

θα and m = nγ where θ > 0 and 0 < γ < 1.

I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)(bias dominates)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2(variance dominates)

I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss

Page 43: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Result (S and Sterge, 2017)

Clearly R` → 0 as `→∞. The goal is to study the convergence rates for R`, Rn,` andRm,n,` as `,m, n→∞.

Suppose λi � i−α, α > 12, ` = n

θα and m = nγ where θ > 0 and 0 < γ < 1.

I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)(bias dominates)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2(variance dominates)

I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss

Page 44: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Approach 2: Approximate the representer

Page 45: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Ridge Regression: Nystrom ApproximationI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f s.t. f (xi ) ≈ yi .

I Idea: Restrict f to Hm = {f ∈ H : f =∑m

i=1 αik(·, xi )} and dokernel ridge regression,

minf∈Hm

1

n

n∑i=1

(〈f , k(·, xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: Since f ∈ Hm,

minα∈Rm

1

n

n∑i=1

([Knmα]i − yi )2 + λα>Kmmα (λ > 0)

where [Kmm]ij = k(xi , xj) and [Kmn]ij = k(xi , xj). Therefore

α = (K>nmKnm + nλKmm)−1K>nmy.

Computation: O(m2n).

Page 46: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Ridge Regression: Nystrom ApproximationI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f s.t. f (xi ) ≈ yi .

I Idea: Restrict f to Hm = {f ∈ H : f =∑m

i=1 αik(·, xi )} and dokernel ridge regression,

minf∈Hm

1

n

n∑i=1

(〈f , k(·, xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: Since f ∈ Hm,

minα∈Rm

1

n

n∑i=1

([Knmα]i − yi )2 + λα>Kmmα (λ > 0)

where [Kmm]ij = k(xi , xj) and [Kmn]ij = k(xi , xj). Therefore

α = (K>nmKnm + nλKmm)−1K>nmy.

Computation: O(m2n).

Page 47: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Ridge Regression: Nystrom ApproximationI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f s.t. f (xi ) ≈ yi .

I Idea: Restrict f to Hm = {f ∈ H : f =∑m

i=1 αik(·, xi )} and dokernel ridge regression,

minf∈Hm

1

n

n∑i=1

(〈f , k(·, xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: Since f ∈ Hm,

minα∈Rm

1

n

n∑i=1

([Knmα]i − yi )2 + λα>Kmmα (λ > 0)

where [Kmm]ij = k(xi , xj) and [Kmn]ij = k(xi , xj). Therefore

α = (K>nmKnm + nλKmm)−1K>nmy.

Computation: O(m2n).

Page 48: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Kernel PCA: Nystrom Approximation

Iarg sup

{⟨f , Σf

⟩H

: f ∈ Hm, ‖f ‖H = 1},

where

Hm :=

{f ∈ H : f =

m∑i=1

βik(·, xi ) : (β1, . . . , βm) ∈ Rm

}.

Iβ = K−1/2

mm u

where u is an eigenvector of 1n K−1/2mm K>nmKnmK

−1/2mm .

Computation: O(m2n).

Page 49: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Computational vs. Statistical Trade-off

I Results similar to Approach 1 are derived for KRR with Nystromapproximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015).

I Kernel PCA with Nystrom approximation

Many directions and open questions . . .

Page 50: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Questions

Page 51: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Thank You

Page 52: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

References I

Fine, S. and Scheinberg, K. (2001).Efficient SVM training using low-rank kernel representations.Journal of Machine Learning Research, 2:243–264.

Rahimi, A. and Recht, B. (2008a).Random features for large-scale kernel machines.In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages1177–1184. Curran Associates, Inc.

Rahimi, A. and Recht, B. (2008b).Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning.In NIPS, pages 1313–1320.

Rudi, A. and Rosasco, L. (2016).Generalization properties of learning with random features.https://arxiv.org/pdf/1602.04474.pdf.

Smola, A. J. and Scholkopf, B. (2000).Sparse greedy matrix approximation for machine learning.In Proc. 17th International Conference on Machine Learning, pages 911–918. Morgan Kaufmann, San Francisco, CA.

Sriperumbudur, B. K. and Sterge, N. (2017).Statistical consistency of kernel PCA with random features.arXiv:1706.06296.

Sriperumbudur, B. K. and Szabo, Z. (2015).Optimal rates for random Fourier features.In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems28, pages 1144–1152. Curran Associates, Inc.

Sutherland, D. and Schneider, J. (2015).On the error of random fourier features.In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 862–871.

Williams, C. and Seeger, M. (2001).Using the Nystrom method to speed up kernel machines.In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, MA.MIT Press.

Page 53: Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

References II

Yang, Y., Pilanci, M., and Wainwright, M. J. (2015).Randomized sketches for kernels: Fast and optimal non-parametric regression.Technical report.https://arxiv.org/pdf/1501.06195.pdf.