Kernel Mean Embeddings

George Hron, Adam Scibior

Department of EngineeringUniversity of Cambridge

Reading Group, March 2017

Outline

Characteristic kernels

Two Sample Testing

Kernelised Stein Discrepancy

Kernel Bayes’ Rule

Outline

Two Sample Testing

Kernels

A kernel k is a positive definite function X × X →, that is for alla1, . . . , an ∈, x1, . . . , xn ∈ X ,

n∑i ,j=1

aiajk(xi , xj) ≥ 0

Intuitively a kernel is a measure of similarity between two elementsof X .

Kernels

Commonly used kernels:

I polynomial(〈x1, x2〉+ 1)d

I Gaussian RBF

e−‖x−y‖2

I Laplace

e−‖x−y‖σ

A reproducing kernel Hilbert space (RKHS) for a kernel k is onespanned by functions k(x , ·) for all x ∈ X with the inner productdefined by

〈k(x1, ·), k(x2, ·)〉 = k(x1, x2)

which is well-defined on all of H by linearity of the inner product.

The equation above is known as the kernel trick and lets us treatH as an implicit feature space, where we never have to explicitlyevaluate the feature map.

H has the reproducing property, that is for all f ∈ H and all x ∈ X ,

〈f , k(x , ·)〉 = f (x)

Outline

Two Sample Testing

Expectations in RKHS

Goal: evaluating expected values of functions from an RKHS.

As with the kernel trick, it turns out that this is possible in termsof inner products, making the computation analytically tractable,

[f (x)] =

p (dx)f (x) =

p (dx)〈k (x , ·), f 〉Hk

?= 〈∫X

p (dx)k (x , ·), f 〉Hk=: 〈µp , f 〉Hk

µp is so called kernel mean embedding (KME) of distribution p .

Note: X assumed measurable throughout the whole presentation.

[f (x)] =

p (dx)f (x) =

p (dx)〈k (x , ·), f 〉Hk

?= 〈∫X

p (dx)k (x , ·), f 〉Hk=: 〈µp , f 〉Hk

[f (x)] =

p (dx)f (x) =

p (dx)〈k (x , ·), f 〉Hk

?= 〈∫X

p (dx)k (x , ·), f 〉Hk=: 〈µp , f 〉Hk

Existence of KME

Riesz representation theorem: For every bounded linearfunctional T : H → R (resp. T : H → C), there exists a uniqueg ∈ H such that T (f ) = 〈g , f 〉H,∀f ∈ H.

Expectation is a linear functional, and ∀f ∈ H we have,

Epf (x) ≤

∣∣∣∣Ep f (x)

∣∣∣∣ ≤ Ep

∣∣f (x)∣∣ = E

∣∣〈f , k (x , ·)〉H∣∣ ≤‖f ‖H E

∥∥k (x , ·)∥∥H

Thus if Ep

√k (x , x) <∞, then µp ∈ H exists, and

Ep f (x) = 〈µp , f 〉H,∀f ∈ H.

Existence of KME

Epf (x) ≤

∣∣∣∣Ep f (x)

∣∣∣∣ ≤ Ep

∣∣f (x)∣∣ = E

∣∣〈f , k (x , ·)〉H∣∣ ≤‖f ‖H E

∥∥k (x , ·)∥∥H

Thus if Ep

Ep f (x) = 〈µp , f 〉H,∀f ∈ H.

Existence of KME

Epf (x) ≤

∣∣∣∣Ep f (x)

∣∣∣∣ ≤ Ep

∣∣f (x)∣∣ = E

∣∣〈f , k (x , ·)〉H∣∣ ≤‖f ‖H E

∥∥k (x , ·)∥∥H

Thus if Ep

Ep f (x) = 〈µp , f 〉H,∀f ∈ H.

Estimating KME

Because µp is generally unknown, it has to be estimated.

A natural (and minimax optimal) estimator is the sample mean,

µp :=1

N∑n=1

k (xn, ·)

and in particular,

µp (x) = 〈µp , k (x , ·)〉H =1

N∑n=1

k (xn, x)N→∞−−−−→ E

p (x ′)k (x ′, x)

Estimating KME

Because µp is generally unknown, it has to be estimated.

A natural (and minimax optimal) estimator is the sample mean,

µp :=1

N∑n=1

k (xn, ·)

and in particular,

µp (x) = 〈µp , k (x , ·)〉H =1

N∑n=1

k (xn, x)N→∞−−−−→ E

p (x ′)k (x ′, x)

Outline

Two Sample Testing

Representing Distributions via KME

A positive definite kernel is called characteristic if,

T : M+1 (X )→ H

p 7→ µp

is injective; M+1 (X ) is the set of probability measures on X . [5, 6]

Proving a kernel is characteristic is non-trivial in general, butsufficient conditions exist. Three well known examples:

I Universality : If k is continuous, X compact, and Hk dense inC(X ) wrt L∞, then k is characteristic. [6, 8]

I Integral strict positive definiteness: A bounded measurablekernel k is called integrally strictly positive definite if∫X∫X k (x , y)µ(dx)µ(dy) > 0 for all non–zero finite signed

Borel measures µ on X . [22]

I Some stationary kernels: For X = Rd , a stationary kernel k ischaracteristic iff supp Λ (ω) = Rd , where Λ (ω) is the spectraldensity of k (cf. Bochner’s theorem). [22]

Outline

Two Sample Testing

t-test

Have {x1, . . . , xN}iid∼ N (µ1, σ

21), and {y1, . . . , yM}

iid∼ N (µ2, σ22),

where all parameters are unknown.

Declare H0 : µ1 = µ2;H1 : µ1 6= µ2. Then for,

µ1 :=1

N∑n=1

xn µ2 :=1

M∑m=1

µi ∼ N (µi ,σ2iN ), ∀i ∈ {1, 2}, and µ1 − µ2 ∼ N (µ1 − µ2,

σ21N +

Hence tH0∼ t ν , ν :=

(s21/N+s22/M)2

(s21/N)2

N−1+

(s22/M)2

, where,

t :=(µ1 − µ2)− 0√

s21N +

, s21 :=

∑Ni=1(xi − µ1)2

N− 1, s22 :=

∑Mi=1(yi − µ2)2

M− 1.

t-test

Have {x1, . . . , xN}iid∼ N (µ1, σ

21), and {y1, . . . , yM}

iid∼ N (µ2, σ22),

µ1 :=1

N∑n=1

xn µ2 :=1

M∑m=1

σ21N +

(s21/N+s22/M)2

(s21/N)2

N−1+

(s22/M)2

, where,

t :=(µ1 − µ2)− 0√

s21N +

, s21 :=

∑Ni=1(xi − µ1)2

N− 1, s22 :=

∑Mi=1(yi − µ2)2

M− 1.

t-test

Have {x1, . . . , xN}iid∼ N (µ1, σ

21), and {y1, . . . , yM}

iid∼ N (µ2, σ22),

µ1 :=1

N∑n=1

xn µ2 :=1

M∑m=1

σ21N +

(s21/N+s22/M)2

(s21/N)2

N−1+

(s22/M)2

, where,

t :=(µ1 − µ2)− 0√

s21N +

, s21 :=

∑Ni=1(xi − µ1)2

N− 1, s22 :=

∑Mi=1(yi − µ2)2

M− 1.

t-test

http://grasshopper.com/blog/the-errors-of-ab-testing-your-conclusions-can-make-things-worse/

More examples

Arthur Gretton, Gatsby Neuroscience Unit

More examples

Outline

Two Sample Testing

Kernel mean embedding,

Ep (x)

[f (x)] = 〈f , µp 〉H, ∀f ∈ H

For a characteristic kernel k and the corresponding RKHS H,

µp = µq iff p = q .

This means we might be able to distinguish distributions bycomparing the corresponding kernel mean embeddings.

Ep (x)

[f (x)] = 〈f , µp 〉H, ∀f ∈ H

Ep (x)

[f (x)] = 〈f , µp 〉H, ∀f ∈ H

Maximum Mean Discrepancy (MMD)

Measure distance between mean embeddings by the worst casedifference of expected values [9],

MMD(p , q ,H) := supf ∈H,‖f ‖H≤1

p (x)[f (x)]− E

q (y)[f (y)]

)= sup

f ∈H,‖f ‖H≤1

(〈µp − µq , f 〉H

Notice that 〈µp − µq , f 〉H ≤∥∥µp − µq ∥∥H‖f ‖H, with equality iff

f ∝ µp − µq . Hence,

MMD(p , q ,H) =∥∥µp − µq ∥∥H

Maximum Mean Discrepancy (MMD)

Measure distance between mean embeddings by the worst casedifference of expected values [9],

MMD(p , q ,H) := supf ∈H,‖f ‖H≤1

p (x)[f (x)]− E

q (y)[f (y)]

)= sup

f ∈H,‖f ‖H≤1

(〈µp − µq , f 〉H

Notice that 〈µp − µq , f 〉H ≤∥∥µp − µq ∥∥H‖f ‖H, with equality iff

f ∝ µp − µq . Hence,

MMD(p , q ,H) =∥∥µp − µq ∥∥H

Witness function

Estimation

Recall: empirical kernel mean embedding estimator,

µp =1

N∑n=1

k (xn, ·)

We can estimate the square of MMD by substituting the empirical

estimator. For {xi}Ni=1iid∼ p and {yi}Ni=1

iid∼ q ,

N(N− 1)

N∑i=1

N∑j 6=i

(k (xi , xj) + k (yi , yj)

)− 1

N∑i=1

N∑j=1

(k (xi , yj) + k (xj , yi )

Estimation

Recall: empirical kernel mean embedding estimator,

µp =1

N∑n=1

k (xn, ·)

We can estimate the square of MMD by substituting the empirical

estimator. For {xi}Ni=1iid∼ p and {yi}Ni=1

iid∼ q ,

N(N− 1)

N∑i=1

N∑j 6=i

(k (xi , xj) + k (yi , yj)

)− 1

N∑i=1

N∑j=1

(k (xi , yj) + k (xj , yi )

Estimation

Proof sketch:

MMD2(p , q ,H) =∥∥µp − µq ∥∥2H =

∥∥µp ∥∥2H +∥∥µq ∥∥2H − 2〈µp , µq 〉H

where,

∥∥µp ∥∥2H = 〈µp , µp 〉H = Ex∼p

µp (x) = Ex∼p〈µp , k (x , ·)〉H

= Ex∼p

Ex∼p

k (x , x) ≈ 1

N(N− 1)

N∑i=1

N∑j 6=i

k (xi , xj)

Similarly for the other terms.

Estimation

Proof sketch:

MMD2(p , q ,H) =∥∥µp − µq ∥∥2H =

∥∥µp ∥∥2H +∥∥µq ∥∥2H − 2〈µp , µq 〉H

where,

∥∥µp ∥∥2H = 〈µp , µp 〉H = Ex∼p

µp (x) = Ex∼p〈µp , k (x , ·)〉H

= Ex∼p

Ex∼p

k (x , x) ≈ 1

N(N− 1)

N∑i=1

N∑j 6=i

k (xi , xj)

Estimation

Proof sketch:

MMD2(p , q ,H) =∥∥µp − µq ∥∥2H =

∥∥µp ∥∥2H +∥∥µq ∥∥2H − 2〈µp , µq 〉H

where,

∥∥µp ∥∥2H = 〈µp , µp 〉H = Ex∼p

µp (x) = Ex∼p〈µp , k (x , ·)〉H

= Ex∼p

Ex∼p

k (x , x) ≈ 1

N(N− 1)

N∑i=1

N∑j 6=i

k (xi , xj)

Considerations

I The distribution of MMD2

under H0 assymptoticallyapproaches an infinite sum of shifted chi-squared randomvariables multiplied by eigenvalues of the RKHS.

I Approximations to the sampling distribution [1, 10, 11, 12].

I Calculation of the naive MMD2

estimator is O(N2).I Linear time approximations [2, 26].

I Performance is dependent on choice of the kernel. There is nouniversally best performing kernel.

I Previously heuristical, recently replaced by hyperparameteroptimisation based on test power, or Bayesian evidence [4, 24].

I Empirical kernel mean estimator might be suboptimal.I Better estimators exist [4, 16, 17].

Considerations

Optimised MMD and model criticism

Arthur Gretton’s twitter.

MMD tends to lose test power with increasing dimensionality. [18]

Pick the kernel such that the test power is maximised, [24]

MMD2−MMD2

>cα/m −MMD2

m→∞−−−−→ 1− Φ

m√Vm− MMD2

where cα is an estimator of the theoretical rejection threshold cα.

MMD tends to lose test power with increasing dimensionality. [18]

Pick the kernel such that the test power is maximised, [24]

MMD2−MMD2

>cα/m −MMD2

m→∞−−−−→ 1− Φ

m√Vm− MMD2

where cα is an estimator of the theoretical rejection threshold cα.

Witness function evaluated on a test set, ARD weights highlighted [24].

Outline

Two Sample Testing

Integral Probability metrics

Have two probability measure p , and q . Integral ProbabilityMetric (IPM) defines a discrepancy measure,

dH(p , q ) := supf ∈H

∣∣∣∣∣ Ep (x)

(f (x)

)− E

(f (y)

)∣∣∣∣∣where the space of functions H must be rich enough such thatdH(p , q ) = 0 iff p = q .

f ∗ := argmaxf ∈H∣∣Ep (f (x))− Eq (f (y))

∣∣ is the witness function.

MMD is an IPM where H is the unit ball in characteristic RKHS.

∣∣∣∣∣ Ep (x)

(f (x)

)− E

(f (y)

∣∣∣∣∣ Ep (x)

(f (x)

)− E

(f (y)

Stein Discrepancy

Construct H such that Eq (f (y)) = 0,∀f ∈ H.

dH(p , q ) = Sq (p , T ,H) := supf ∈H

∣∣∣∣∣ Ep (x)

[(T f )(x)

]∣∣∣∣∣where T is a real–valued operator and Eq [(T f )(y)] = 0, ∀f ∈ H.

A standard choice of T when x = x ∈ R is the Stein’s operator,

(T f )(x) = Aq f (x) := sq (x)f (x) +∇x f (x)

Stein Discrepancy

Construct H such that Eq (f (y)) = 0,∀f ∈ H.

dH(p , q ) = Sq (p , T ,H) := supf ∈H

∣∣∣∣∣ Ep (x)

[(T f )(x)

]∣∣∣∣∣where T is a real–valued operator and Eq [(T f )(y)] = 0, ∀f ∈ H.

A standard choice of T when x = x ∈ R is the Stein’s operator,

(T f )(x) = Aq f (x) := sq (x)f (x) +∇x f (x)

Kernelised Stein Discrepancy (KSD)

Assume that p (x), and q (y) are differentiable continuous densityfunctions; q might be unnormalised [3, 15].

Use the Stein’s operator in the unit ball of the product RKHS FD,

Sq (p ,Aq ,FD) := supf ∈FD,‖f ‖FD≤1

∣∣∣∣∣ Ep (x)

[Tr(Aq f (x)

)]∣∣∣∣∣= sup

f ∈FD,‖f ‖FD≤1

∣∣∣∣∣ Ep (x)

[Tr(f (x)sq (x)T +∇2f (x)

)]∣∣∣∣∣= sup

∣∣∣∣∣∣D∑

Ep (x)

[fd(x)sq ,d(x) +

∂fd(x)

]∣∣∣∣∣∣= sup

∣∣∣〈f ,βq 〉FD

∣∣∣ =∥∥∥βq

∥∥∥FD

(s.t. βq ∈ FD)

〈f , g〉FD :=∑D

d=1〈fd , gd〉F , βq := Ep [k (x , ·)uq (x) +∇xk (x , ·)].

Kernelised Stein Discrepancy (KSD)

Assume that p (x), and q (y) are differentiable continuous densityfunctions; q might be unnormalised [3, 15].

Use the Stein’s operator in the unit ball of the product RKHS FD,

Sq (p ,Aq ,FD) := supf ∈FD,‖f ‖FD≤1

∣∣∣∣∣ Ep (x)

[Tr(Aq f (x)

)]∣∣∣∣∣= sup

∣∣∣∣∣ Ep (x)

[Tr(f (x)sq (x)T +∇2f (x)

)]∣∣∣∣∣= sup

∣∣∣∣∣∣D∑

Ep (x)

[fd(x)sq ,d(x) +

∂fd(x)

]∣∣∣∣∣∣= sup

∣∣∣〈f ,βq 〉FD

∣∣∣ =∥∥∥βq

∥∥∥FD

(s.t. βq ∈ FD)

〈f , g〉FD :=∑D

d=1〈fd , gd〉F , βq := Ep [k (x , ·)uq (x) +∇xk (x , ·)].

KSD vs. alternative GoF tests

Variational Inference using KSDNatural idea: minimise KSD between the true andan approximate posterior distribution → a particular case ofOperator Variational Inference. [19]

Quick detour: For the true posterior p (x) = p (x)/Zp andan approximation q (x) = q (x)/Zq ,

S(p , q ) = supf ∈FD,‖f ‖FD≤1

∣∣∣∣∣ Eq (x) [Tr(Ap f (x)

)]∣∣∣∣∣= sup

∣∣∣∣∣ Eq (x) [Tr(Ap f (x)−Aq f (x)

)]∣∣∣∣∣= sup

∣∣∣∣∣ Eq (x) [f (x)T(sp (x)− sq (x))]∣∣∣∣∣

= Eq (x)

Eq (x)

k (x , x)(sp (x)− sq (x))T(sp (x)− sq (x))

→ q (x) =∝ q (x , ε), e.g. q (x , ε) = N (x | Tθ(ε), σ2I ) N (ε | 0, I )

Variational Inference using KSDNatural idea: minimise KSD between the true andan approximate posterior distribution → a particular case ofOperator Variational Inference. [19]

Quick detour: For the true posterior p (x) = p (x)/Zp andan approximation q (x) = q (x)/Zq ,

S(p , q ) = supf ∈FD,‖f ‖FD≤1

∣∣∣∣∣ Eq (x) [Tr(Ap f (x)

)]∣∣∣∣∣= sup

∣∣∣∣∣ Eq (x) [Tr(Ap f (x)−Aq f (x)

)]∣∣∣∣∣= sup

∣∣∣∣∣ Eq (x) [f (x)T(sp (x)− sq (x))]∣∣∣∣∣

= Eq (x)

Eq (x)

k (x , x)(sp (x)− sq (x))T(sp (x)− sq (x))

→ q (x) =∝ q (x , ε), e.g. q (x , ε) = N (x | Tθ(ε), σ2I ) N (ε | 0, I )

Learning to Sample using KSD

Two papers [15, 25] on optimising a set of particles, resp. a set ofsamples from a model (amortisation), repeatedly using KL-optimalperturbations x t = x t−1 + εt f (x).

Read Yingzhen’s blog post [13] and a recent paper [14]!

Random walk through the latent space of a GAN trained with KSD adversary. [25]

Outline

Two Sample Testing

Tensor product

If H and K are Hilbert spaces then so is H⊗K.

If {ai}∞i=1 spans H and {bj}∞j=1 spans K then {ai ⊗ bj}∞i ,j=0 spansH⊗K.

For any f , f ′ ∈ H and g , g ′ ∈ K we have

〈f ⊗ g , f ′ ⊗ g ′〉 = 〈f , f ′〉〈g , g ′〉

H ⊗ K is isomporphic to a space of bounded linear operatorsK → H

(f ⊗ g)g ′ = f 〈g , g ′〉

Covariance operatorsCovariance operators CXY : K → H are defined as

CXY = E[kX (X , ·)⊗ kY (Y , ·)]

CXX = E[kX (X , ·)⊗ kX (X , ·)]

For any f ∈ H and g ∈ K

〈f ,CXY g〉 = 〈CYX f , g〉 = E[f (X )g(Y )]

Empirical estimators

(Xi ,Yi ) ∼i .i .d . P(X ,Y )

CXY =1

N∑i=1

kX (Xi , ·)⊗ kY (Yi , ·)

CXX =1

N∑i=1

kX (Xi , ·)⊗ kX (Xi , ·)

Conditional embeddings

LetµX |Y=y = E[kX (X , ·)|Y = y ]

We seek an operator UX |Y such that for all y

µX |Y=y = UX |Y kY (y , ·)

which we call the conditional mean embedding.

Lemma ([21])

For any f ∈ H let h(y) = E[f (X )|Y = y ] and assume that h ∈ K.Then

CYY h = CYX f

Proof.

Take any g ∈ K.

〈g ,CYY h〉 = EY∼P(Y )

[g(Y )h(Y )] =

EY∼P(Y )

[g(Y ) EX∼P(X |Y )

[f (X )|Y ]] = EY∼P(Y )

[ EX∼P(X |Y )

[g(Y )f (X )|Y ]]

EX ,Y∼P(X ,Y )

[g(Y )f (X )] = 〈g ,CYX f 〉

Whenever CYY is invertible we have

UX |Y = CXYC−1YY

in practice we use a regularized version

U εX |Y = CXY (CYY + εI )−1

which can be estimated by

U εX |Y = CXY (CYY + εI )−1

Kernel Bayes’ rule

We could compute posterior embedding by

µX |Y=y = UX |Y kY (y , ·)

but we may want a different prior.

Setting:

I let P(X ,Y ) be the joint distribution of the model

I let Q(X ,Y ) be a different distribution such thatP(Y |X = x) = Q(Y |X = x) for all x

I we have a sample (Xi ,Yi ) ∼ Q and Xj ∼ P(X )

I we want to estimate the posterior embeddingEX∼P(X |Y=y)[kX (X , ·)] for some particular y

Kernel sum rule

Sum ruleP(X ) =

P(X ,Y )

Kernel sum ruleµX = UX |YµY

CorollaryCXX = UXX |YµY

Kernel product rule

Product ruleP(X ,Y ) = P(X |Y )P(Y )

Kernel product ruleCXY = UXYCYY

Observe that

CXY = UY |XCXX

CYY = UYY |XµX

and so

UX |Y = CXYC−1YY = (UY |XCXX )(UYY |XµX )−1

lets us express UX |Y in terms of UY |X .

Reminder

Setting:

I let P(X ,Y ) be the joint distribution of the model

I let Q(X ,Y ) be a different distribution such thatP(Y |X = x) = Q(Y |X = x) for all x

I we have a sample (Xi ,Yi ) ∼ Q and Xj ∼ P(X )

I we want to estimate the posterior embeddingEX∼P(X |Y=y)[kX (X , ·)] for some particular y

We have

UPY |X = UQ

UPX |Y 6= U

UPX |Y = (UP

Y |XCPXX )(UP

YY |XµPX )−1

= (UQY |XC

PXX )(UQ

YY |XµPX )−1

µPX |Y=y = (UQY |XC

PXX )(UQ

YY |XµPX )−1kY (y , ·)

In practice we use the following estimator

µX |Y=y = (UQY |X C

PXX )((UQ

YY |X µPX )2 + εI )−1(UQ

YY |X µPX )kY (y , ·)

which can be written as

µX |Y=y = A(B2 + δI )−1BkY (y , ·)A = UQ

Y |X CPXX = CQ

YX (CQXX + εI )−1CP

B = UQYY |X µ

PX = CQ

YYX (CQXX + εI )−1µPX

For ε = N−13 and δ = N−

427 it can be shown [7] that∥∥∥µX |Y=y − µX |Y=y

∥∥∥ = Op(N−427 )

When is Kernel Bayes’ Rule useful?

I when densities aren’t tractable (ABC)

I when you don’t know how to write a model but you know howto pick a kernel

I perhaps it can perform better than alternatives even if theabove aren’t satisfied

Kernel Mean Embeddings - University of...

Documents

Guided Independent Study What goes on in ‘Guided Independent Study’? Nichola Gretton & Steve Rooney, Leicester Learning Institute Tracey Dodman, Department

Follow the conversation using #CareCert The Care Certificate Update NCA Conference March 2015 Sally Gretton Head of Area

GRETTON - corby.gov.uk Home - Gretton... · PICK, John Charles Preston Machine Gun Corps PRIDMORE, ... By 1901 Lydia Addy had moved back to Gretton with her sons Harold, George and

Lecture2: MappingsofProbabilities toRKHSandApplicationsmlss.tuebingen.mpg.de/2015/slides/gretton/part_2.pdf · Lecture2: MappingsofProbabilities toRKHSandApplications MLSST¨ubingen,2015

Fast Inference in Deep Generative Modelscbl.eng.cam.ac.uk/.../ReadingGroup/FastInferenceInDeepGenerativeM… · Fast Inference in Deep Generative Models Isabel Valera, Jonas Umlauft

Defending Against Adversarial Attackscbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/... · Introduction to Adversarial Examples Ian J. Goodfellow, Jonathon Shlens & Christian Szegedy:

Lecture1: IntroductiontoRKHSmlss.tuebingen.mpg.de/2015/slides/gretton/part_1.pdf · Featurespace BasicsofreproducingkernelHilbertspaces KernelRidgeRegression Lecture1: IntroductiontoRKHS

Segmentation and Tracking of Multiple Humans in …slipguru.disi.unige.it/readinggroup/papers_vis/zhao07humans.pdf · Segmentation and Tracking of Multiple Humans in Crowded Environments

ThesisProposal: Scalable,ActiveandFlexible ...dougals/papers/proposal.pdf · Maximummeandiscrepancy Themaximummeandiscrepancy(mmd, Sriperumbudur,Gret-ton, et al.2010; Gretton, Borgwardt,

Productivity Commission Development of Competition Policy, Economic Benefits and Reform Processes: Australias experience Paul Gretton 3 rd ASEAN-CER Integration

Gretton 99

Notes on mean embeddings and covariance operators - UCLgretton/coursefiles/lecture5_covarianceOperator.pdf · Notes on mean embeddings and covariance operators Arthur Gretton February

Ms Margaret Leandra Gretton: Professional conduct panel ... · Ms Margaret Leandra Gretton: Professional conduct panel outcome Panel decision and reasons on behalf of the Secretary

Lecture9: SupportVectorMachines - University …gretton/coursefiles/Slides5A.pdfArthurGretton Lecture9: SupportVectorMachines Representertheorem Therepresentertheorem: asolutionto

Lecture3: Dependencemeasures usingRKHSembeddingsmlss.tuebingen.mpg.de/2015/slides/gretton/part_3.pdf · Lecture3: Dependencemeasures usingRKHSembeddings MLSST¨ubingen,2015 Arthur

Stochastic Approximation Theory - University of Cambridgecbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/sgd.pdf · Plan I History and modern formulation of stochastic approximation

GRETTON NEWS - · PDF fileOne final very quick ... years. Part of this on Gretton shops etc. is ... created from our extensive archive of photographs of Gretton and its residents in

Equivalence of distance-based and RKHS-based …gretton/papers/SejSriGreFuk13.pdfwith the associated test statistic being the Hilbert–Schmidt independence criterion (HSIC) [Gretton

Probabilistic Programming with Infercbl.eng.cam.ac.uk/pub/Intranet/MLG/ReadingGroup/Bronskill_infer_N… · Probabilistic Programming “We want to do for machine learning what the

Wacha Bounliphone, Arthur Gretton, Arthur Tenenhaus ... · Wacha Bounliphone, Arthur Gretton, Arthur Tenenhaus, Matthew Blaschko 32nd International Conference on Machine Learning