Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature

Lecture 3

Approximate Kernel Methods(Computational vs. Statistical Trade-off)

Bharath K. Sriperumbudur∗ & Dougal J. Sutherland†

∗Department of Statistics, Pennsylvania State University

†Gatsby Unit, University College London

Data Science Summer SchoolEcole Polytechnique

June 2019

Outline

I Motivating examples

I Kernel ridge regression and kernel PCA

I Approximation methods

I Computational vs. statistical trade off

Kernel Ridge regression: Feature Map and Kernel TrickI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f ∈ H (some feature space) s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φ(xi ) and do linear regression,

minf∈H

1

n

n∑i=1

(〈f ,Φ(xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: For Φ(X) := (Φ(x1), . . . ,Φ(xn)) ∈ Rdim(H)×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦ(X)Φ(X)> + λIdim(H)

)−1

Φ(X)y︸︷︷︸primal

=1

nΦ(X)

(1

nΦ(X)>Φ(X) + λIn

)−1

y︸︷︷︸dual



minf∈H

1

n

n∑i=1

(〈f ,Φ(xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)


f =1

n

(1


)−1


=1

nΦ(X)

(1

nΦ(X)>Φ(X) + λIn

)−1

y︸︷︷︸dual



minf∈H

1

n

n∑i=1

(〈f ,Φ(xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)


f =1

n

(1


)−1


=1

nΦ(X)

(1

nΦ(X)>Φ(X) + λIn

)−1

y︸︷︷︸dual

Kernel Ridge regression: Feature Map and Kernel TrickI Prediction: Given t ∈ X

f (t) = 〈f ,Φ(t)〉H =1

ny>Φ(X)>

(1


)−1

Φ(t)

=1

ny>(

1

nΦ(X)>Φ(X) + λIn

)−1

Φ(X)>Φ(t)

As before

Φ(X)>Φ(X) =

〈Φ(x1),Φ(x1)〉H · · · 〈Φ(x1),Φ(xn)〉H〈Φ(x2),Φ(x1)〉H · · · 〈Φ(x2),Φ(xn)〉H

.... . .

...〈Φ(xn),Φ(x1)〉H · · · 〈Φ(xn),Φ(xn)〉H

︸︷︷︸

k(xi ,xj )=〈Φ(xi ),Φ(xj )〉H

and

Φ(X)>Φ(t) = [〈Φ(x1),Φ(t)〉H, . . . , 〈Φ(xn),Φ(t)〉H]>

Kernel Ridge regression: Feature Map and Kernel TrickI Prediction: Given t ∈ X

f (t) = 〈f ,Φ(t)〉H =1

ny>Φ(X)>

(1


)−1

Φ(t)

=1

ny>(

1

nΦ(X)>Φ(X) + λIn

)−1

Φ(X)>Φ(t)

As before

Φ(X)>Φ(X) =

〈Φ(x1),Φ(x1)〉H · · · 〈Φ(x1),Φ(xn)〉H〈Φ(x2),Φ(x1)〉H · · · 〈Φ(x2),Φ(xn)〉H

.... . .

...〈Φ(xn),Φ(x1)〉H · · · 〈Φ(xn),Φ(xn)〉H

︸︷︷︸

k(xi ,xj )=〈Φ(xi ),Φ(xj )〉H

and

Φ(X)>Φ(t) = [〈Φ(x1),Φ(t)〉H, . . . , 〈Φ(xn),Φ(t)〉H]>

Remarks

I The primal formulation requires the knowledge of feature map Φ(and of course H) and these could be infinite dimensional.

I The dual formulation is entirely determined by kernel evaluations,Gram matrix and (k(xi , t))i . But poor scalability: O(n3).

Kernel Principal Component AnalysisI Dimensionality reduction

I Given: {(xi )}ni=1 where xi ∈ Rd

I Task: Find a low-dimensional representation for (xi ).

max‖f ‖H=1

Var (〈f ,Φ(x1)〉H, 〈f ,Φ(x2)〉H, . . . , 〈f ,Φ(xn)〉H)

≡ max‖f ‖H=1

1

n

n∑i=1

〈f ,Φ(xi )〉2H −

(1

n

n∑i=1

〈f ,Φ(xi )〉H

)2

≡ max‖f ‖H=1

〈f , Σf 〉H

where

Σ :=1

n

n∑i=1

Φ(xi )⊗ Φ(xi )−

(1

n

n∑i=1

Φ(xi )

)⊗

(1

n

n∑i=1

Φ(xi )

)

=1

nΦ(X)

(Id −

1

n11>

)Φ(X)> =: Φ(X)HΦ(X)>.




max‖f ‖H=1


≡ max‖f ‖H=1

1

n

n∑i=1

〈f ,Φ(xi )〉2H −

(1

n

n∑i=1

〈f ,Φ(xi )〉H

)2

≡ max‖f ‖H=1

〈f , Σf 〉H

where

Σ :=1

n

n∑i=1


(1

n

n∑i=1

Φ(xi )

)⊗

(1

n

n∑i=1

Φ(xi )

)

=1

nΦ(X)

(Id −

1

n11>

)Φ(X)> =: Φ(X)HΦ(X)>.




max‖f ‖H=1


≡ max‖f ‖H=1

1

n

n∑i=1

〈f ,Φ(xi )〉2H −

(1

n

n∑i=1

〈f ,Φ(xi )〉H

)2

≡ max‖f ‖H=1

〈f , Σf 〉H

where

Σ :=1

n

n∑i=1


(1

n

n∑i=1

Φ(xi )

)⊗

(1

n

n∑i=1

Φ(xi )

)

=1

nΦ(X)

(Id −

1

n11>

)Φ(X)> =: Φ(X)HΦ(X)>.

Remarks

I The primal formulation requires the knowledge of feature map Φ(and of course H) and these could be infinite dimensional.

I The dual formulation is entirely determined by kernel evaluations,Gram matrix and (k(xi , t))i . But poor scalability: O(n3).

Approximation Schemes

I Incomplete Cholesky factorization (Fine and Scheinberg, JMLR 2001)

I Sketching (Yang et al., 2015)

I Sparse greedy approximation (Smola and Scholkopf, NIPS 2000)

I Nystrom method (Williams and Seeger, NIPS 2001)

I Random Fourier features (Rahimi and Recht, NIPS 2008), ...

Key Ideas

I Approach 1: Finite dimensional approximation to Φ(x)

I Perform linear method on Φm(x) ∈ Rm.

I Involves Φm(X)Φm(X)> ∈ Rm×m.

I Approach 2: Approximate the representer

I The representer theorem yields that the solution lies in{f ∈ H

∣∣∣f =n∑

i=1

αik(·, xi ) : (α1, . . . , αn) ∈ Rn

}

I Instead, restrict the solution to{f ∈ H

∣∣∣f =m∑i=1

βik(·, xi ) : (β1, . . . , βm) ∈ Rm

}

Key Ideas

I Approach 1: Finite dimensional approximation to Φ(x)

I Perform linear method on Φm(x) ∈ Rm.

I Involves Φm(X)Φm(X)> ∈ Rm×m.

I Approach 2: Approximate the representer

I The representer theorem yields that the solution lies in{f ∈ H

∣∣∣f =n∑

i=1

αik(·, xi ) : (α1, . . . , αn) ∈ Rn

}

I Instead, restrict the solution to{f ∈ H

∣∣∣f =m∑i=1

βik(·, xi ) : (β1, . . . , βm) ∈ Rm

}

Approach 1: Finite dimensional approximation to Φ(x)

Random Fourier Approximation

I X = Rd ; k be continuous and translation-invariant, i.e.,k(x , y) = ψ(x − y).

I Bochner’s theorem: ψ is positive definite if and only if

k(x , y) =

∫Rd

e√−1〈ω,x−y〉2 dΛ(ω),

where Λ is a finite non-negative Borel measure on Rd .

I k is symmetric and therefore Λ is a “symmetric” measure on Rd .

I Therefore

k(x , y) =

∫Rd

cos(〈ω, x − y〉2) dΛ(ω).

Random Fourier Approximation

I X = Rd ; k be continuous and translation-invariant, i.e.,k(x , y) = ψ(x − y).

I Bochner’s theorem: ψ is positive definite if and only if

k(x , y) =

∫Rd

e√−1〈ω,x−y〉2 dΛ(ω),

where Λ is a finite non-negative Borel measure on Rd .

I k is symmetric and therefore Λ is a “symmetric” measure on Rd .

I Therefore

k(x , y) =

∫Rd

cos(〈ω, x − y〉2) dΛ(ω).

Random Feature Approximation

(Rahimi and Recht, 2008a): Draw (ωj)mj=1

i.i.d.∼ Λ.

km(x , y) =1

m

m∑j=1

cos (〈ωj , x − y〉2) = 〈Φm(x),Φm(y)〉R2m ,

≈ k(x , y) = 〈k(·, x), k(·, y)〉H︸︷︷︸〈Φ(x),Φ(y)〉H

where

Φm(x) =1√m

(

ϕ1(x)︷︸︸︷cos(〈ω1, x〉2), . . . , cos(〈ωm, x〉2), sin(〈ω1, x〉2), . . . , sin(〈ωm, x〉2)).

How good is the approximation?

(S and Szabo, NIPS 2016):

supx,y∈S

|km(x , y)− k(x , y)| = Oa.s.

(√log |S |

m

)

Optimal convergence rate

I Other results are known but they are non-optimal (Rahimi and Recht,

NIPS 2008; Sutherland and Schneider, UAI 2015).

Ridge Regression: Random Feature ApproximationI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f s.t. f (xi ) ≈ yi .

I Idea: Map xi to Φm(xi ) and do linear regression,

minw∈R2m

1

n

n∑i=1

(〈w ,Φm(xi )〉R2m − yi )2 + λ‖w‖2

R2m (λ > 0)

I Solution: For Φm(X) := (Φm(x1), . . . ,Φm(xn)) ∈ R2m×n andy := (y1, . . . , yn)> ∈ Rn,

f =1

n

(1

nΦm(X)Φm(X)> + λI2m

)−1

Φm(X)y︸︷︷︸primal

=1

nΦm(X)

(1

nΦm(X)>Φm(X) + λIn

)−1

y︸︷︷︸dual

Computation: O(m2n).



minw∈R2m

1

n

n∑i=1

(〈w ,Φm(xi )〉R2m − yi )2 + λ‖w‖2

R2m (λ > 0)


f =1

n

(1


)−1


=1

nΦm(X)

(1


)−1

y︸︷︷︸dual




minw∈R2m

1

n

n∑i=1

(〈w ,Φm(xi )〉R2m − yi )2 + λ‖w‖2

R2m (λ > 0)


f =1

n

(1


)−1


=1

nΦm(X)

(1


)−1

y︸︷︷︸dual


What happens statistically?

Kernel ridge regression: (Xi ,Yi )ni=1

iid∼ ρXY .

I R∗P = inf f∈L2(ρX ) E|f (X )− Y |2 = E|f ∗(X )− Y |2.

I Penalized risk minimization: O(n3)

fn = arg inff∈H

1

n

n∑i=1

|Yi − f (Xi )|22 + λ‖f ‖2H

I Penalized risk minimization (approximate): O(m2n)

fm,n = arg inff∈Hm

1

n

n∑i=1

|Yi − f (Xi )|22 + λ‖f ‖2Hm


RP(fm,n)︸︷︷︸E|fm,n(X )−Y |2

−R∗

= (RP(fm,n)−RP(fn))︸︷︷︸error due to approximation

+ (RP(fn)−R∗P)

I (Rahimi and Recht, 2008b): (m ∧ n)−12

I (Rudi and Rosasco, 2016): If m ≥ nα where 12 ≤ α < 1 with α

depending on the properties of f ∗, then fm,n achieves the minimaxoptimal rate as obtained in the case with no approximation.

Computational gain with no statistical loss!!



−R∗


+ (RP(fn)−R∗P)







−R∗


+ (RP(fn)−R∗P)





Kernel PCA: Random Feature ApproximationI Perform linear PCA on {Φm(xi )}ni=1.

I Approximate KPCA finds β ∈ Rm that solves

sup‖β‖2=1

Var[{〈β,Φm(xi )〉2}ni=1] = sup

‖β‖2=1

⟨β, Σmβ

⟩2

where

Σm :=1

n

n∑i=1

Φm(xi )⊗ Φm(xi )−

(1

n

n∑i=1

Φm(xi )

)⊗

(1

n

n∑i=1

Φm(xi )

).

I Same as doing kernel PCA in Hm where

Hm =

{f =

m∑i=1

βiϕi : β ∈ Rm

}

is an RKHS induced by the reproducing kernel km.

I Computation: O(m2n)



sup‖β‖2=1


‖β‖2=1

⟨β, Σmβ

⟩2

where

Σm :=1

n

n∑i=1


(1

n

n∑i=1

Φm(xi )

)⊗

(1

n

n∑i=1

Φm(xi )

).


Hm =

{f =

m∑i=1

βiϕi : β ∈ Rm

}





sup‖β‖2=1


‖β‖2=1

⟨β, Σmβ

⟩2

where

Σm :=1

n

n∑i=1


(1

n

n∑i=1

Φm(xi )

)⊗

(1

n

n∑i=1

Φm(xi )

).


Hm =

{f =

m∑i=1

βiϕi : β ∈ Rm

}



Reconstruction Error

I Linear PCA

EX∼P

∥∥∥∥∥(X − µ)−∑i=1

〈(X − µ), φi 〉2φi

∥∥∥∥∥2

2

I Kernel PCA

EX∼P

∥∥∥∥∥k(·,X )−∑i=1

〈k(·,X ), φi 〉Hφi

∥∥∥∥∥2

H

where k(·, x) = k(·, x)−∫k(·, x) dP(x).

I However, the eigenfunctions of approximate empirical KPCA lie inHm, which is finite dimensional and not contained in H.


I Linear PCA

EX∼P

∥∥∥∥∥(X − µ)−∑i=1

〈(X − µ), φi 〉2φi

∥∥∥∥∥2

2

I Kernel PCA

EX∼P

∥∥∥∥∥k(·,X )−∑i=1


∥∥∥∥∥2

H

where k(·, x) = k(·, x)−∫k(·, x) dP(x).



I Linear PCA

EX∼P

∥∥∥∥∥(X − µ)−∑i=1

〈(X − µ), φi 〉2φi

∥∥∥∥∥2

2

I Kernel PCA

EX∼P

∥∥∥∥∥k(·,X )−∑i=1


∥∥∥∥∥2

H

where k(·, x) = k(·, x)−∫k(·, x) dP(x).


Embedding to L2(P) (S and Sterge, 2017)

What we have?

I Population eigenfunctions (φi )i∈I of Σ: these form a subspace in H.

I Empirical eigenfunctions (φi )ni=1 of Σ: these form a subspace in H.

I Eigenvectors after approximation, (φi,m)mi=1 of Σm: these form a subspace in Rm

I We embed them in a common space before comparing. The common space isL2(P).

I (Inclusion operator) I : H→ L2(P), f 7→ f −∫X f (x) dP(x)

I (Approximation operator) U : Rm → L2(P),

α 7→m∑i=1

αi

(ϕi −

∫Xϕi (x) dP(x)

)

Embedding to L2(P) (S and Sterge, 2017)

What we have?

I Population eigenfunctions (φi )i∈I of Σ: these form a subspace in H.

I Empirical eigenfunctions (φi )ni=1 of Σ: these form a subspace in H.

I Eigenvectors after approximation, (φi,m)mi=1 of Σm: these form a subspace in Rm

I We embed them in a common space before comparing. The common space isL2(P).

I (Inclusion operator) I : H→ L2(P), f 7→ f −∫X f (x) dP(x)

I (Approximation operator) U : Rm → L2(P),

α 7→m∑i=1

αi

(ϕi −

∫Xϕi (x) dP(x)

)

Reconstruction Error (S and Sterge, 2017)

(Iφi√λi

)∞i=1

form an ONS for L2(P). Define k(·, x) = k(·, x)− µP and τ > 0.

I Population KPCA:

R` = E

∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥2

L2(P)

I Empirical KPCA:

Rn,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi

, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥∥∥2

L2(P)

I Approximate Empirical KPCA:

Rm,n,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨U φi,m√λi,m

, Ik(·,X )

⟩L2(P)

U φi,m√λi,m

∥∥∥∥∥∥∥2

L2(P)


(Iφi√λi

)∞i=1


I Population KPCA:

R` = E

∥∥∥∥∥Ik(·,X )−∑i=1


⟩L2(P)

Iφi√λi

∥∥∥∥∥2

L2(P)

I Empirical KPCA:

Rn,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi

, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥∥∥2

L2(P)


Rm,n,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨U φi,m√λi,m

, Ik(·,X )

⟩L2(P)

U φi,m√λi,m

∥∥∥∥∥∥∥2

L2(P)


(Iφi√λi

)∞i=1


I Population KPCA:

R` = E

∥∥∥∥∥Ik(·,X )−∑i=1


⟩L2(P)

Iφi√λi

∥∥∥∥∥2

L2(P)

I Empirical KPCA:

Rn,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨Iφi√λi

, Ik(·,X )

⟩L2(P)

Iφi√λi

∥∥∥∥∥∥∥2

L2(P)


Rm,n,` = E

∥∥∥∥∥∥∥Ik(·,X )−∑i=1

⟨U φi,m√λi,m

, Ik(·,X )

⟩L2(P)

U φi,m√λi,m

∥∥∥∥∥∥∥2

L2(P)

Result (S and Sterge, 2017)

Clearly R` → 0 as `→∞. The goal is to study the convergence rates for R`, Rn,` andRm,n,` as `,m, n→∞.

Suppose λi � i−α, α > 12, ` = n

θα and m = nγ where θ > 0 and 0 < γ < 1.

I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)(bias dominates)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2(variance dominates)

I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss





I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α


n−( 12−θ), α

2(3α−1)≤ θ < 1


I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss





I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α


n−( 12−θ), α

2(3α−1)≤ θ < 1


I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss





I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α


n−( 12−θ), α

2(3α−1)≤ θ < 1


I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss





I n−2θ(1− 12α ) . R` . n−2θ(1− 1

2α )

I

Rn,` .

n−2θ(1− 12α ), 0 < θ ≤ α


n−( 12−θ), α

2(3α−1)≤ θ < 1


I

Rm,n,` .

n−2θ(1− 12α ), 0 < θ ≤ α

2(3α−1)

n−( 12−θ), α

2(3α−1)≤ θ < 1

2

for γ > 2θ.

No statistical loss

Approach 2: Approximate the representer

Ridge Regression: Nystrom ApproximationI Given: {(xi , yi )}ni=1 where xi ∈ X , yi ∈ RI Task: Find a regressor f s.t. f (xi ) ≈ yi .

I Idea: Restrict f to Hm = {f ∈ H : f =∑m

i=1 αik(·, xi )} and dokernel ridge regression,

minf∈Hm

1

n

n∑i=1

(〈f , k(·, xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)

I Solution: Since f ∈ Hm,

minα∈Rm

1

n

n∑i=1

([Knmα]i − yi )2 + λα>Kmmα (λ > 0)

where [Kmm]ij = k(xi , xj) and [Kmn]ij = k(xi , xj). Therefore

α = (K>nmKnm + nλKmm)−1K>nmy.





minf∈Hm

1

n

n∑i=1

(〈f , k(·, xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)


minα∈Rm

1

n

n∑i=1








minf∈Hm

1

n

n∑i=1

(〈f , k(·, xi )〉H − yi )2 + λ‖f ‖2

H (λ > 0)


minα∈Rm

1

n

n∑i=1





Kernel PCA: Nystrom Approximation

Iarg sup

{⟨f , Σf

⟩H

: f ∈ Hm, ‖f ‖H = 1},

where

Hm :=

{f ∈ H : f =

m∑i=1

βik(·, xi ) : (β1, . . . , βm) ∈ Rm

}.

Iβ = K−1/2

mm u

where u is an eigenvector of 1n K−1/2mm K>nmKnmK

−1/2mm .


Computational vs. Statistical Trade-off

I Results similar to Approach 1 are derived for KRR with Nystromapproximation (Bach 2013, Alaoui and Mahoney, 2015, Rudi et al., 2015).

I Kernel PCA with Nystrom approximation

Many directions and open questions . . .

Questions

Thank You

References I

Fine, S. and Scheinberg, K. (2001).Efficient SVM training using low-rank kernel representations.Journal of Machine Learning Research, 2:243–264.

Rahimi, A. and Recht, B. (2008a).Random features for large-scale kernel machines.In Platt, J. C., Koller, D., Singer, Y., and Roweis, S. T., editors, Advances in Neural Information Processing Systems 20, pages1177–1184. Curran Associates, Inc.

Rahimi, A. and Recht, B. (2008b).Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning.In NIPS, pages 1313–1320.

Rudi, A. and Rosasco, L. (2016).Generalization properties of learning with random features.https://arxiv.org/pdf/1602.04474.pdf.

Smola, A. J. and Scholkopf, B. (2000).Sparse greedy matrix approximation for machine learning.In Proc. 17th International Conference on Machine Learning, pages 911–918. Morgan Kaufmann, San Francisco, CA.

Sriperumbudur, B. K. and Sterge, N. (2017).Statistical consistency of kernel PCA with random features.arXiv:1706.06296.

Sriperumbudur, B. K. and Szabo, Z. (2015).Optimal rates for random Fourier features.In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems28, pages 1144–1152. Curran Associates, Inc.

Sutherland, D. and Schneider, J. (2015).On the error of random fourier features.In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, pages 862–871.

Williams, C. and Seeger, M. (2001).Using the Nystrom method to speed up kernel machines.In T. K. Leen, T. G. Diettrich, V. T., editor, Advances in Neural Information Processing Systems 13, pages 682–688, Cambridge, MA.MIT Press.

References II

Yang, Y., Pilanci, M., and Wainwright, M. J. (2015).Randomized sketches for kernels: Fast and optimal non-parametric regression.Technical report.https://arxiv.org/pdf/1501.06195.pdf.

https://arxiv.org/pdf/1501.06195.pdf

Documents

Approximate Kernel Methods€¦ · Kernel Ridge regression: Feature Map and Kernel Trick I Given: f(x i;y i)g n i=1 where x i 2X, y i 2R I Task:Find a regressor f 2H(some feature