Kernel Thinning and Stein Thinning

Kernel Thinning and Stein Thinning

Lester Mackey

Microsoft Research New England

October 15, 2021

Joint work with Raaz Dwivedi, Marina Riabiz, Wilson Ye Chen, Jon Cockayne, Pawel Swietach,Steven A. Niederer, Chris J. Oates, and Abhishek Shetty

Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 1 / 38

Motivation: Computational Cardiology

Computational Cardiology: Developing multiscale digital twins of human hearts tonon-invasively predict disease progression and therapy response [Niederer, Sacks, Girolami, and Willcox, 2021]

Organ scale<latexit sha1_base64="tb7H8EVnsmshWv1rgVqSlSn2iBI=">AAAB9HicbVBNS8NAEJ34WeNX1aOXxSJ4Kkk96LHoxZsV7Ae0oUy2m3bpZhN3N4US+ju8eFDEqz/Gm//GpM1BWx8MPN6bYWaeHwuujeN8W2vrG5tb26Ude3dv/+CwfHTc0lGiKGvSSESq46NmgkvWNNwI1okVw9AXrO2Pb3O/PWFK80g+mmnMvBCHkgecoskk714NURJNUTDb7pcrTtWZg6wStyAVKNDol796g4gmIZOGCtS66zqx8VJUhlPBZnYv0SxGOsYh62ZUYsi0l86PnpHzTBmQIFJZSUPm6u+JFEOtp6GfdYZoRnrZy8X/vG5igmsv5TJODJN0sShIBDERyRMgA64YNWKaEaSKZ7cSOkKF1GQ55SG4yy+vklat6l5Waw+1Sv2miKMEp3AGF+DCFdThDhrQBApP8Ayv8GZNrBfr3fpYtK5ZxcwJ/IH1+QNmlZEx</latexit>

Tissue scale<latexit sha1_base64="d37eTTHEYHWpxsIBEMJ+3E1MvT8=">AAAB9XicbVC7TsNAEFyHVzCvACXNiQiJKrJDAWUEDWWQ8kBKTHS+rJNTzg/dnUGRlf+goQAhWv6Fjr/hnLiAhJFWGs3sanfHTwRX2nG+rdLa+sbmVnnb3tnd2z+oHB51VJxKhm0Wi1je+1Sh4BG2NdcC7xOJNPQFdv3JTe53H1EqHkctPU3QC+ko4gFnVBvpocWVSpEoRgXa9qBSdWrOHGSVuAWpQoHmoPLVH8YsDTHSTFCleq6TaC+jUnMmcGb3U4UJZRM6wp6hEQ1Redn86hk5M8qQBLE0FWkyV39PZDRUahr6pjOkeqyWvVz8z+ulOrjyMh4lqcaILRYFqSA6JnkEZMglMi2mhlAmubmVsDGVlGkTVB6Cu/zyKunUa+5FrX5XrzauizjKcAKncA4uXEIDbqEJbWAg4Rle4c16sl6sd+tj0Vqyiplj+APr8wddI5HB</latexit>

Single cell scale<latexit sha1_base64="U+5TJMPnWOyG3lvju3p6V8LEoS8=">AAAB/HicbVBNS8NAEN34WeNXtEcvi0XwVJJ60GPRi8eK9gPaUDbbSbt0swm7GyGE+le8eFDEqz/Em//GTZuDtj4YeLw3w8y8IOFMadf9ttbWNza3tis79u7e/sGhc3TcUXEqKbRpzGPZC4gCzgS0NdMceokEEgUcusH0pvC7jyAVi8WDzhLwIzIWLGSUaCMNneo9E2MOmALnWFHCwbaHTs2tu3PgVeKVpIZKtIbO12AU0zQCoSknSvU9N9F+TqRmlMPMHqQKEkKnZAx9QwWJQPn5/PgZPjPKCIexNCU0nqu/J3ISKZVFgemMiJ6oZa8Q//P6qQ6v/JyJJNUg6GJRmHKsY1wkgUdMAtU8M4RQycytmE6IJFSbvIoQvOWXV0mnUfcu6o27Rq15XcZRQSfoFJ0jD12iJrpFLdRGFGXoGb2iN+vJerHerY9F65pVzlTRH1ifPyhqk8k=</latexit>

Figurecredit:MarinaRiabiz

Example (Heartbeats and arrhythmias)

Whole-organ heartbeats are coordinated by calcium signaling in heart cells

Dysregulation known to lead to life-threatening heart arrhythmias

Goal: Model impact of calcium signaling dysregulation on heart function [Campos,

Shiferaw, Prassl, Boyle, Vigmond, and Plank, 2015, Niederer, Lumens, and Trayanova, 2019, Colman, 2019]


Motivation: Computational Cardiology

Figurecredit:

Augustinet al.2020

Inferential Pipeline (Impact of calcium signaling dysregulation on heart function)

1 Estimate unknown calcium signaling model parameters from patient data2 Capture uncertainty by sampling many likely parameter configurations

Run Markov chain Monte Carlo (MCMC) to (eventually) draw sample pointsfrom the posterior distribution P over unknown parametersMay require millions of sample points to adequately explore target distribution P

3 Propagate uncertainty by simulating whole-heart model for each configuration

Problem: Each simulation requires 1000s of CPU hours!

Questions: Can we accurately summarize P using many fewer points? How?Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 3 / 38

Distribution Compression

Goal: Accurately summarize a distribution P using a small number of points

Standard solutions

i.i.d. sampling directly from PMCMC with Markov chain converging to P

−2.5 0.0 2.5x1

−2

0

2

x2

−2.5 0.0 2.5x1

−2

0

2

x2

Benefits: Readily available and eventually high-quality

Provide asymptotically exact sample estimates Pnf = 1n

∑ni=1 f(xi) for intractable

expectations Pf = EX∼P[f(X)]

Drawback: Samples are too large!

Typical integration error Pnf − Pf = Θ(n−1/2): need n = 10000 for 1% error

Prohibitive for expensive downstream tasks and function evaluations

Idea: Directly compress the high-quality sample approximations PnReduces general problem to approximating empirical distributions


Distribution Compression

Question: How do we effectively compress an empirical distribution Pn?

Standard solutions

Uniform subsampling / i.i.d. sampling

Standard thinning: Keep every t-th point−2.5 0.0 2.5

x1

−2

0

2

x2

−2.5 0.0 2.5x1

−2.5

0.0

2.5

x2

Drawback: Large loss in accuracy, worst case integration error = Θ(√t/n)

Compression from n to√n points increases error from Θ(n−1/2) to Θ(n−1/4)

Question: Can we do better?

Minimax lower bounds for worst-case integration error to PΩ(n−1/2) for any compression procedure returning

√n points [Phillips and Tai, 2020]

Ω(n−1/2) for any approximation based only on n i.i.d. points from P[Tolstikhin, Sriperumbudur, and Muandet, 2017]

This talk: Introduce a more effective compression strategy – kernel thinning – thatmatches these lower bounds up to log factors


Problem Setup

Given:

Input points Sin = x1, . . . , xn ⊂ Rd with empirical distribution Pn = 1n

∑ni=1 δxi

Pre-generated by any algorithm (i.i.d. sampling, MCMC, quadrature, kernel herding)

Target output size s (e.g., s =√n for heavy compression)

Goal: Return coreset Sout ⊂ Sin with |Sout| = s, Q = 1s

∑x∈Sout δx, and o(s−1/2)

(better-than-i.i.d.) worst-case integration error between Pn and Q


Maximum Mean Discrepancies

Goal: Return coreset Sout ⊂ Sin with |Sout| = s, Q = 1s

∑x∈Sout δx, and o(s−1/2)

worst-case integration error between Pn and Q

Quality measure: Maximum mean discrepancy (MMD) [Gretton, Borgwardt, Rasch, Scholkopf, and Smola, 2012]

MMDk(Pn,Q) = sup‖f‖k≤1

|Pnf −Qf |

Measures maximum discrepancy between input and coreset expectations over aclass of real-valued test functions (unit ball of a reproducing kernel Hilbert space)

Parameterized by a reproducing kernel k: any symmetric (k(x, y) = k(y, x)) andpositive semidefinite (

∑i,l ciclk(zi, zl) ≥ 0,∀zi ∈ Rd, ci ∈ R) function

Gaussian: k(x, y) = e−12‖x−y‖22 , Inverse multiquadric: k(x, y) = 1

(1+‖x−y‖22)1/2

Metrizes convergence in distribution for popular infinite-dimensional kernels (e.g.,Gaussian, Matern, B-spline, inverse multiquadric, sech, and Wendland)


Square-root Kernels

Definition (Square-root kernel)

A reproducing kernel krt is a square-root kernel for k if

k(x, y) =∫Rd krt(x, z)krt(y, z)dz.

k(x, y) = κ(x−y)Name of kernel

κ(z)Expression for

κ(ω)Fourier transform

krt

Square-root kernel

σ > 0Gaussian(σ) :

exp(−‖z‖

22

2σ2

)σd exp

(−σ2‖ω‖22

2

) (2πσ2

) d4 Gaussian

(σ√2

)ν > d, γ > 0

Matern(ν, γ) :cν− d

2(γ‖z‖2)ν−

d2Kν− d

2(γ‖z‖2) φd,ν,γ (γ2 + ‖ω‖22)−ν Aν,γ,dMatern(ν2 , γ)

β ∈ 2N + 1B-spline(2β + 1) :

S2β+2,d

d∏j=1

~2β+21[− 12, 12

](zj) S′2β+2,d

d∏j=1

sin2β+2(ωj2

)

ω2β+2j

Sβ,dB-spline(β)

Exact square-root kernel not necessary: see Dwivedi and Mackey [2021b] for convenient choicesfor inverse multiquadric, sech, Wendland, and all sufficiently smooth and integrable κ


Kernel Thinning

Two stages1 Initialization: kt-split

Divides input Sin into 2m balanced candidate coresets, each of size n2m using

non-uniform randomnessWhen m = 1

2 log2(n), obtain√n coresets of size

√n

2 Refinement: kt-swap

Selects candidate coreset closest to Sin in terms of MMDk

Iteratively refines the coreset by replacing each coreset point in turn with the bestalternative in Sin, as measured by MMDk

Complexity

Time: dominated by O(n2) kernel evaluations

Space: O(min(nd, n2))


KT-SPLIT

𝒮in

𝒮1,1 𝒮1,2

𝒮2,1 𝒮2,2 𝒮2,3 𝒮2,4

𝒮3,1 𝒮3,2 𝒮3,3 𝒮3,4 𝒮3,5 𝒮3,6 𝒮3,7 𝒮3,8

𝒮m,1 𝒮m,2 𝒮m,5𝒮m,3 𝒮m,4 𝒮m,6 𝒮m,7 𝒮m,2m𝒮m,8

kt-split partitions the input Sin recursively, first dividing the input sequence in half,then halving those halves into quarters, and so on

Runs online: after i input points processed have output coresets of size i2m


KT-SPLITInput

( points)n

( points)n2

Output ( points)

n2m

After Kernel Halving roundsm

Kernel Halving

( points)n4

Kernel Halving

<latexit sha1_base64="dM2tpuvUYF3a17jCjaY7CCNIIg4=">AAAB/XicbVDLSsNAFL2pr1pf8bFzM1iEClISFXVZdOOyon1AG8tkOmmHTh7MTIQagr/ixoUibv0Pd/6NkzYLrR4YOJxzL/fMcSPOpLKsL6MwN7+wuFRcLq2srq1vmJtbTRnGgtAGCXko2i6WlLOANhRTnLYjQbHvctpyR5eZ37qnQrIwuFXjiDo+HgTMYwQrLfXMna6P1ZBgntykd0nFPkT2Qdozy1bVmgD9JXZOypCj3jM/u/2QxD4NFOFYyo5tRcpJsFCMcJqWurGkESYjPKAdTQPsU+kkk/Qp2tdKH3mh0C9QaKL+3EiwL+XYd/VkllXOepn4n9eJlXfuJCyIYkUDMj3kxRypEGVVoD4TlCg+1gQTwXRWRIZYYKJ0YSVdgj375b+keVS1T6vH1yfl2kVeRxF2YQ8qYMMZ1OAK6tAAAg/wBC/wajwaz8ab8T4dLRj5zjb8gvHxDQ8YlFE=</latexit>

S(1,1)

<latexit sha1_base64="IzjRe/pKMhG0lDpZnAr+uYI87CU=">AAAB/XicbVDLSgMxFL1TX7W+xsfOTbAIFaTMVFGXRTcuK9oHtGPJpGkbmnmQZIQ6DP6KGxeKuPU/3Pk3ZtpZaOuBwOGce7knxw05k8qyvo3cwuLS8kp+tbC2vrG5ZW7vNGQQCULrJOCBaLlYUs58WldMcdoKBcWey2nTHV2lfvOBCskC/06NQ+p4eOCzPiNYaalr7nU8rIYE8/g2uY9LlWNkHyVds2iVrQnQPLEzUoQMta751ekFJPKorwjHUrZtK1ROjIVihNOk0IkkDTEZ4QFta+pjj0onnqRP0KFWeqgfCP18hSbq740Ye1KOPVdPplnlrJeK/3ntSPUvnJj5YaSoT6aH+hFHKkBpFajHBCWKjzXBRDCdFZEhFpgoXVhBl2DPfnmeNCpl+6x8cnNarF5mdeRhHw6gBDacQxWuoQZ1IPAIz/AKb8aT8WK8Gx/T0ZyR7ezCHxifPxChlFI=</latexit>

S(2,1)

<latexit sha1_base64="3Lu/QOS+4D9PDX/fWpjFkMdWhok=">AAAB/XicbVDLSgMxFM3UV62v8bFzEyxCBSkzKuqy6MZlRfuAdiyZNNOGJpkhyQh1GPwVNy4Ucet/uPNvzLSz0OqBwOGce7knx48YVdpxvqzC3PzC4lJxubSyura+YW9uNVUYS0waOGShbPtIEUYFaWiqGWlHkiDuM9LyR5eZ37onUtFQ3OpxRDyOBoIGFCNtpJ690+VIDzFiyU16l1T4IXQP0p5ddqrOBPAvcXNSBjnqPfuz2w9xzInQmCGlOq4TaS9BUlPMSFrqxopECI/QgHQMFYgT5SWT9CncN0ofBqE0T2g4UX9uJIgrNea+mcyyqlkvE//zOrEOzr2EiijWRODpoSBmUIcwqwL2qSRYs7EhCEtqskI8RBJhbQormRLc2S//Jc2jqntaPb4+Kdcu8jqKYBfsgQpwwRmogStQBw2AwQN4Ai/g1Xq0nq036306WrDynW3wC9bHN2s0lI0=</latexit>

S(m,1)

<latexit sha1_base64="f2RiQ3d0Z6heUOlpz815K7Ji2Lk=">AAACAnicbVC7TsMwFHV4lvIKMCEWiwqJqUoAAWMFC2MR9CG1UeS4TmvVdiLbQaqiiIVfYWEAIVa+go2/wUkzQMuRLB2fc6/uvSeIGVXacb6thcWl5ZXVylp1fWNza9ve2W2rKJGYtHDEItkNkCKMCtLSVDPSjSVBPGCkE4yvc7/zQKSikbjXk5h4HA0FDSlG2ki+vd/nSI8wYuld5qfFR/KUiizz7ZpTdwrAeeKWpAZKNH37qz+IcMKJ0JghpXquE2svRVJTzEhW7SeKxAiP0ZD0DBWIE+WlxQkZPDLKAIaRNE9oWKi/O1LElZrwwFTmO6pZLxf/83qJDi89c1CcaCLwdFCYMKgjmOcBB1QSrNnEEIQlNbtCPEISYW1Sq5oQ3NmT50n7pO6e109vz2qNqzKOCjgAh+AYuOACNMANaIIWwOARPINX8GY9WS/Wu/UxLV2wyp498AfW5w+b3pg5</latexit>Sin

Each output coreset S(m,`) is the result of repeated kernel halvingOn each halving round, remaining points are paired, and one point from each pairis selected using a new Hilbert space generalization of the self-balancing walk ofAlweiss, Liu, and Sawhney [2020]


Kernel Halving with a Self-Balancing Hilbert WalkAlgorithm: Self-balancing Hilbert Walk [Dwivedi and Mackey, 2021b]

Input: sequence of functions (fi)n/2i=1 in Hilbert space H, threshold sequence (ai)

n/2i=1

ψ0 ← 0 ∈ Hfor i = 1, 2, . . . , n/2 do

αi ← 〈ψi−1, fi〉H // Compute Hilbert space inner productif |αi| > ai:

ψi ← ψi−1 − fi · αi/ai // We choose ai to avoid this case with high probabilityelse:

ηi ← 1 with probability 12 (1− αi/ai) and ηi ← −1 otherwise

ψi ← ψi−1 + ηifiend

return ψn/2, sum of signed input functions // ψn/2 =∑n/2

i=1 ηifi with high probability

1 Kernel Halving: If fi = krt(x2i−1, ·)− krt(x2i, ·), half of input points Sout given sign 1

⇒ 1nψn/2 = Pnkrt −Qkrt with Q = 2

n

∑x∈Sout δx

2 Balance: If H = krt RKHS, Pnkrt(x)−Qkrt(x) is O(√

log(n)/n) sub-Gaussian, ∀xIn contrast, i.i.d. signs ηi give Pnkrt(x)−Qkrt(x) = Ω(1/

√n)


Why the Square-root Kernel krt?

Theorem (L∞ coresets for (krt,Pn) are MMD coresets for (k,Pn) [Dwivedi and Mackey, 2021b])

For any scalars R, a, b ≥ 0 with a+ b = 1, we have

MMDk(Pn,Q) ≤ vdRd2 · ‖Pnkrt−Qkrt‖∞ + 2τkrt(aR) + 2‖k‖ 1

2∞ ·maxτPn(bR), τQ(bR)

for vd , πd/4/Γ(d/2 + 1)1/2.

L∞ error: ‖Pnkrt−Qkrt‖∞ , supx∈Rd |Pnkrt(x)−Qkrt(x)|Tail decay of (Pn,Q,krt): τPn(R) , Pn(‖X‖2 ≥ R)

Effective radius: Want τkrt(aR), τPn(bR), τQ(bR) = O( 1√n)

R = O(1) for compact support, R = O(log(n)) for sub-exponential decay

When (Pn,Q,krt) are compactly supported, MMDk(Pn,Q) = O(‖Pnkrt−Qkrt‖∞)


L∞ Coresets from Kernel Halving

Theorem (L∞ guarantees for kernel halving [Dwivedi and Mackey, 2021b])

With high probability,

1 Kernel halving yields a 2-thinned L∞ coreset Q(1)KH satisying

‖Pnkrt −Q(1)KHkrt‖∞ ≤ ‖krt‖∞ · 2

nMkrt(Pn)

2 Repeated kernel halving yields a 2m-thinned L∞ coreset Q(m)KH satisfying

‖Pnkrt−Q(m)KHkrt‖∞ ≤ ‖krt‖∞ · 2m

nMkrt(Pn)

Mkrt(Pn) = O(√

log n) for compactly supported (P,krt) and O(log n) in general

With m = 12

log2(n) rounds, yields√n points with O(n−

12 log(n)) L∞ error

An equal-sized i.i.d. sample has Ω(n−14 ) L∞ error

Near-optimal: any procedure outputting√n points must suffer Ω(n−

12 ) L∞ error

for some Pn Phillips and Tai [2020, Thm. 3.1]Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 14 / 38

MMD Coresets from Kernel Thinning

Theorem (MMD guarantee for kernel thinning [Dwivedi and Mackey, 2021b])

Kernel thinning returns a coreset QKT with√n points satisfying, with high probability,

MMDk(Pn,QKT )=

O(√

lognn

) for compact support (P,krt) (e.g., B-spline k)

O( (logn)d+24 log logn√n

) for sub-Gaussian (P,krt) (e.g., Gaussian k)

O( (logn)d+12 log logn√n

) for sub-exponential (P,krt) (e.g., Matern k)

An equal-sized i.i.d. sample has Ω(n−14 ) MMD

Sub-exponential guarantees resemble the classical O( (logn)d√n

) quasi-Monte Carlo

error rates for uniform P on [0, 1]d but apply to more general distributions on Rd

See the paper for non-asymptotic bounds with explicit constants and n2m

points


Related Work on MMD Coresets

Uniform distribution P on [0, 1]d: O(n−12 logd n) L2 discrepancy MMD,

√n points

Quasi-Monte Carlo [Hickernell, 1998, Novak and Wozniakowski, 2010], Online Haar strategy [Dwivedi, Feldheim,

Gurel-Gurevich, and Ramdas, 2019]

Order n−14 MMD coresets for general P

i.i.d. [Tolstikhin, Sriperumbudur, and Muandet, 2017], geometrically ergodic MCMC [Dwivedi and Mackey, 2021b]

Kernel herding [Chen, Welling, and Smola, 2010, Lacoste-Julien, Lindsten, and Bach, 2015], Stein points MCMC [Chen,

Barp, Briol, Gorham, Girolami, Mackey, and Oates, 2019], Greedy sign selection [Karnin and Liberty, 2019]

Finite-dimensional linear kernels on Rd: O(√dn−

12 log2.5 n),

√n points

Discrepancy construction [Harvey and Samadi, 2014]: does not cover infinite-dimensional k

Unknown coreset qualitySuper-sampling with a reservoir [Paige, Sejdinovic, and Wood, 2016]: coreset quality not analyzedSupport points [Mak and Joseph, 2018]

Optimal√n coreset has o(n−

14 ) energy distance MMD but no construction given

Practical convex-concave procedures not analyzed or shown to be optimalMackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 16 / 38

Related Work on L∞ Coresets

L∞ coresets for Pn: o(n−14 ) L∞ error,

√n points

Series of breakthroughs due to [Joshi, Kommaraji, Phillips, and Venkatasubramanian, 2011, Phillips,

2013, Phillips and Tai, 2018, 2020, Tai, 2020]

Best known L∞ guarantees (for coreset of size√n)

Phillips and Tai [2020]: O(√dn−

12

√log n) error, Ω(n4) time, Ω(n2) space

Tai [2020] (Gaussian k): O(2dn−12

√log(d log n)) error, Ω(max(d5d, n4)) time

Both are offline and require rebalancing after approximate halving steps

This work: O(√dn−

12 log n) error, O(n2) time, O(nd) space, online, exact halving

Sub-Gaussian (krt,P): O(√dn−

12√

log n log log n) error

Compact support (krt,P): O(√dn−

12√

log n) error


Kernel Thinning vs. i.i.d. Sampling: Mixture of Gaussians

5

0

5

-15.000

-12.500

-10.000

-7.5

00

-7.500

-5.000

-5.000

-5.000

-5.000

8 iid points

-15.000

-12.500

-10.000

-7.5

00

-7.500

-5.000

-5.000

-5.000

-5.000

16 iid points

-15.000

-12.500

-10.000

-7.5

00

-7.500

-5.000

-5.000

-5.000

-5.000

32 iid points

5 0 5

5

0

5

-15.000

-12.500

-10.000

-7.5

00

-7.500

-5.000

-5.000

-5.000

-5.000

8 KT points

5 0 5

-15.000

-12.500

-10.000

-7.5

00

-7.500

-5.000

-5.000

-5.000

-5.000

16 KT points

5 0 5

-15.000

-12.500

-10.000

-7.5

00

-7.500

-5.000

-5.000

-5.000-5.000

32 KT points

P = 1k

∑kj=1N (µj, Id)

k(x, y) = exp(− 12σ2‖x− y‖2

2)with σ2 = 2d

Even for small sample sizes,kernel thinning (KT) provides

Better stratification acrosscomponentsLess clumping and fewergaps within components



10

5

0

5-17

.500

-17.500

-17.500

-17.5

00

-15.00

0

-15.000

-15.000-1

5.00

0

-12.5

00

-12.500

-12.500

-12.

500

-10.000

-7.5

00

-7.500

8 iid points

-17.50

0

-17.500

-17.500

-17.5

00

-15.00

0

-15.000

-15.000

-15.

000

-12.5

00

-12.500

-12.500

-12.

500

-10.000

-7.5

00

-7.500

16 iid points

-17.50

0

-17.500

-17.500

-17.5

00

-15.00

0

-15.000

-15.000

-15.

000

-12.5

00

-12.500

-12.500

-12.

500

-10.000

-7.5

00

-7.500

32 iid points

10 5 0 510

5

0

5-17

.500

-17.500

-17.500

-17.5

00

-15.00

0

-15.000

-15.000

-15.

000

-12.5

00

-12.500

-12.500

-12.

500

-10.000

-7.5

00

-7.500

8 KT points

10 5 0 5

-17.50

0

-17.500

-17.500

-17.5

00

-15.00

0

-15.000

-15.000

-15.

000

-12.5

00

-12.500

-12.500

-12.

500

-10.000

-7.5

00

-7.500

16 KT points

10 5 0 5

-17.50

0

-17.500

-17.500

-17.5

00

-15.00

0

-15.000

-15.000

-15.

000

-12.5

00

-12.500

-12.500

-12.

500

-10.000

-7.5

00

-7.500

32 KT points

P = 1M

∑Mj=1N (µj, Id)

k(x, y) = exp(− 12σ2‖x− y‖2

2)with σ2 = 2d

Even for small sample sizes,kernel thinning (KT) provides

Better stratification acrosscomponentsLess clumping and fewergaps within components



22 24 26

Coreset size n

2 6

2 4

2 2

Mea

n M

MD

M=4

iid: n 0.26

KT: n 0.52

22 24 26

Coreset size n

M=6

iid: n 0.25

KT: n 0.51

22 24 26

Coreset size n

M=8

iid: n 0.25

KT: n 0.52

Kernel thinning (KT) improves both rate of decay and order of magnitude ofMMDk(P,QKT )

P = 1M

∑Mj=1N (µj, Id), d = 2

k(x, y) = exp(− 12σ2‖x− y‖2

2) with σ2 = 2dMackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 20 / 38

Kernel Thinning vs. i.i.d. Sampling: Higher Dimensions

22 24 26

Coreset size n

2 7

2 5

2 3

2 1

Mea

n M

MD

d=2

iid: n 0.25

KT: n 0.51

22 24 26

Coreset size n

d=4

iid: n 0.25

KT: n 0.48

22 24 26

Coreset size n

d=10

iid: n 0.25

KT: n 0.43

22 24 26

Coreset size n

d=100

iid: n 0.25

KT: n 0.34

Kernel thinning (KT) improves both rate of decay and order of magnitude ofMMDk(P,QKT ) even for high dimensions and small sample sizes

P = N (0, Id)

k(x, y) = exp(− 12σ2‖x− y‖2

2) with σ2 = 2d


Kernel Thinning vs. Standard MCMC Thinning

Posterior inference for systems of ordinary differential equations (ODEs)

P = posterior distribution of coupled ODE model parameters given observed dataGoodwin model of oscillatory enzymatic control (d = 4) [Goodwin, 1965]

Lotka-Volterra model of oscillatory predator-prey evolution (d = 4) [Lotka, 1925, Volterra, 1926]

Hinch model of cardiac calcium signalling (d = 38) [Hinch, Greenstein, Tanskanen, Xu, and Winslow, 2004]

Downstream goal: propagate model uncertainty through whole-heart simulationEvery sample point discarded via compression = 1000s of CPU hours saved

MCMC input points [Riabiz, Chen, Cockayne, Swietach, Niederer, Mackey, and Oates, 2021]

Gaussian random walk (RW), adaptive RW (adaRW) [Haario, Saksman, and Tamminen, 1999]

2 weeks of CPU time to generate each RW Hinch chain of length 4× 106

Metropolis-adjusted Langevin algorithm (MALA) [Roberts and Tweedie, 1996]

Pre-conditioned MALA (pMALA) [Girolami and Calderhead, 2011]

Discarded burn-in and standard thinned to form Pnk(x, y) = exp(− 1

2σ2‖x− y‖22) with median heuristic σ2

[Garreau, Jitkrittum, and Kanagawa, 2017]


22 24 26

Coreset size n

2 7

2 5

2 3

Mea

n M

MD

Goodwin RW

standard: n 0.19

KT: n 0.50

22 24 26

Coreset size n

Goodwin adaRW

standard: n 0.15

KT: n 0.48

22 24 26

Coreset size n

Goodwin MALA

standard: n 0.28

KT: n 0.49

22 24 26

Coreset size n

Goodwin pMALA

standard: n 0.30

KT: n 0.50

22 24 26

Coreset size n

2 8

2 6

2 4

2 2

Mea

n M

MD

Lotka-Volterra RW

standard: n 0.14

KT: n 0.56

22 24 26

Coreset size n

Lotka-Volterra adaRW

standard: n 0.24

KT: n 0.57

22 24 26

Coreset size n

Lotka-Volterra MALA

standard: n 0.40

KT: n 0.56

22 24 26

Coreset size n

Lotka-Volterra pMALA

standard: n 0.31

KT: n 0.57

22 24 26

Coreset size n

2 7

2 5

2 3

2 1

Mea

n M

MD

Hinch 1

standard: n 0.46

KT: n 0.53

22 24 26

Coreset size n

Hinch 2

standard: n 0.43

KT: n 0.49

22 24 26

Coreset size n

Hinch Tempered 1

standard: n 0.35

KT: n 0.38

22 24 26

Coreset size n

Hinch Tempered 2

standard: n 0.32

KT: n 0.38

KT improves rate of decay and magnitude of MMD, even when standard thinning is accurateMackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 23 / 38

Something’s Wrong

Problem: The Hinch Markov chains haven’t mixed!

Solution: Use a more diffuse tempered posterior for faster mixing

Problem: Tempering introduces a persistent bias

MCMC points Pn will be summarizing the wrong target P

Question: Can we correct for such biases during compression?P

aram

eter

1

med sclmed smpcov MCMC

seed 1

seed 2

prior

Par

amet

er2

Par

amet

er3

Par

amet

er4

Par

amet

er5

Par

amet

er6

Par

amet

er7

Par

amet

er8

Par

amet

er9

Par

amet

er10


Compression with Bias Correction

Question: Can we correct for distributional biases in Pn during compression?

e.g., Biases due to off-target sampling, tempering, approximate MCMC, or burn-in

−2.5 0.0 2.5x1

−2

0

2

x2

−2.5 0.0 2.5x1

−2

0

2

x2

−2.5 0.0 2.5x1

−2.5

0.0

2.5

x2

Difficulty: Pn alone is insufficient; need to measure distance to the true target P


Measuring Distance to P

Quality measure: Maximum mean discrepancy (MMD) [Gretton, Borgwardt, Rasch, Scholkopf, and Smola, 2012]

MMDk(P,Q) = sup‖f‖k≤1

|Pf −Qf | =√

(P× P)k + (Q×Q)k− 2(Q× P)k

Problem: Integration under P is typically intractable!

⇒ Pk and MMDk(P,Q) cannot be computed in practice for most kernels

Idea: Only consider kernels kP with PkP known a priori to be 0

Then MMD computation only depends on Q!


Kernel Stein Discrepancies

Idea: Consider MMDkP with PkP known a priori to be 0

Kernel Stein discrepancy (KSD)[Chwialkowski, Strathmann, and Gretton, 2016, Liu, Lee, and Jordan, 2016, Gorham and Mackey, 2017]

kP(x, y) =∑d

j=11

p(x)p(y)∇xj∇yj(p(x)k(x, y)p(y)) [Oates, Girolami, and Chopin, 2017]

P has differentiable Lebesgue density pk is a bounded base kernel with bounded continuous derivatives

Depends on P through ∇ log p: computable when normalization constant unknownPkP = 0 whenever ∇ log p is integrable [Gorham and Mackey, 2017]

⇒ Kernel Stein discrepancy MMDkP(P,Q) is computable!

Theorem (KSD controls convergence in distribution[Gorham and Mackey, 2017, Chen, Barp, Briol, Gorham, Girolami, Mackey, and Oates, 2019])

Consider the base kernel k(x, y) = (c2 + ‖Γ(x− y)‖22)−1/2 for any c > 0 and positive

definite Γ. If P has strongly log concave tails and Lipschitz ∇ log p, then Qs ⇒ Pwhenever MMDkP(P,Qs)→ 0.


Stein Thinning

Idea: Greedily minimize KSD using points from Sin = x1, . . . , xn[Riabiz, Chen, Cockayne, Swietach, Niederer, Mackey, and Oates, 2021]

Choose initial approximation Q1 = δy1 with

y1 ∈ argminy∈Sin MMDkP(P, δy) = argminy∈Sin kP(y, y)

Iteratively construct Qs = 1s

∑si=1 δyi with

ys ∈ argminy∈Sin MMDkP(P, s−1sQs−1 + 1

sδy)

= argminy∈Sin kP(y, y) + 2∑s−1

i=1 kP(yi, y)

Same point xi can be selected multiple times

Running time = O(n∑s

i=1 ri) for ri ≤ i the number of distinct points selectedprior to round i (worst case = O(ns2))


Stein Thinning Guarantees

Theorem (Stein thinning KSD guarantee [Riabiz, Chen, Cockayne, Swietach, Niederer, Mackey, and Oates, 2021] )

MMDkP(P,Qs)2 ≤ infw∈∆n−1 MMDkP(P,

∑ni=1 wiδxi)

2 + (1+log(s))s

maxx∈Sin kP(x, x)

Expect maxx∈Sin kP(x, x) = O(log(n)) for sub-Gaussian input and kP(x, x) = O(‖x‖22)

Takeaway: Stein thinning performs nearly as well as best simplex reweighting of Sin

⇒ Nearly as well as Markov chain with burn-in removed!⇒ Nearly as well as off-target sample after optimal importance sampling reweighting!

−2.5 0.0 2.5x1

−2

0

2

x2

−2.5 0.0 2.5x1

−2

0

2

x2

−2.5 0.0 2.5x1

−2.5

0.0

2.5

x2


Stein Thinning Guarantees

Takeaway: Stein thinning performs nearly as well as best simplex reweighting of Sin

⇒ Nearly as well as Markov chain with burn-in removed!

⇒ Neary as well as off-target sample after optimal importance sampling reweighting!

Theorem (Stein thinning corrects off-target sampling[Riabiz, Chen, Cockayne, Swietach, Niederer, Mackey, and Oates, 2021])

If Sin drawn i.i.d. from P, then, under mild conditions (s ≤ n, log(n) = O(sβ/2) for

some β < 1, and E[eγmax(1,dPdP

(Xi)2)kP(Xi,Xi)] <∞ for some γ > 0),

MMDkP(P,Qs)→ 0 almost surely as s, n→∞.

Result extends to sufficiently ergodic Markov chains targeting P


Stein Thinning in Action: Correcting for Burn-in

Goodwin model of oscillatory enzymatic control

Projections on the first two coordinates of the MALA MCMC output

First s = 20 points from Stein thinning vs. burn-in removal + standard thinning

Substantial burn-in: b points out of 2× 106 removed for standard thinningMackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 31 / 38

Stein Thinning in Action: Correcting for Burn-in

Goodwin model of oscillatory enzymatic control101

102K

SD

RW ADA-RW

100 101 102

m

101

102

KS

D

MALA

100 101 102

m

P-MALA

Standard Thinning (high burn-in)

Standard Thinning (low burn-in)

Support Points

Stein Thinning (med)

Stein Thinning (sclmed)

Stein Thinning (smpcov)

2.3× 10−2

2.4× 10−2

2.5× 10−2

2.6× 10−2

2.7× 10−2

2.8× 10−2

ED

RW ADA-RW

101 102

m

2.3× 10−2

2.4× 10−2

2.5× 10−2

2.6× 10−2

2.7× 10−2

2.8× 10−2

ED

MALA

101 102

m

P-MALA

Standard Thinning, high burn-in

Standard Thinning, low burn-in

Support Points



Stein Thinning (smpcov)10−6

10−4

10−2

Ab

solu

teE

rror

Fir

stM

omen

t

Parameter 1 Parameter 2

100 101 102

m

10−6

10−4

10−2

Ab

solu

teE

rror

Fir

stM

omen

tParameter 3

100 101 102

m

Parameter 4

Standard Thinning (high burn-in)

Standard Thinning (low burn-in)

Support Points




Stein thinning outperforms standard thinning with high and low levels of burn-inremoval in terms of KSD, energy distance (ED), and first moment estimation


Stein Thinning in Action: Correcting for Tempering

Hinch model of cardiac calcium signalling: Tempering improves mixingP

aram

eter

1

med sclmed smpcov MCMC

seed 1

seed 2

prior

Par

amet

er2

Par

amet

er3

Par

amet

er4

Par

amet

er5

Par

amet

er6

Par

amet

er7

Par

amet

er8

Par

amet

er9

Par

amet

er10


Stein Thinning in Action: Correcting for Tempering

Hinch model of cardiac calcium signalling

100 101 102

m

102

103

104K

SD

RWSupport Points

Support Points (tempered)




Untempered support points compression yields poor summary due to poor mixingTempered SP without bias correction is even worse (due to tempering bias)Tempering + Stein thinning bias correction improves approximation to P


Conclusions

Summary

New tools for summarizing a probability distribution more effectively than i.i.d.sampling or standard MCMC thinning

Kernel thinning compresses an n point summary into a√n point summary with

better-than-i.i.d. approximation error

Stein thinning simultaneously compresses and reduces biases due to off-targetsampling, tempering, or burn-in

Especially well-suited for tasks that incur substantial downstream computation costs

Kernel ThinningPaper: arxiv.org/abs/2105.05842

Package: github.com/rzrsk/kernel thinning

Stein ThinningPaper: arxiv.org/abs/2105.05842

Website: stein-thinning.org


https://arxiv.org/abs/2105.05842

https://github.com/rzrsk/kernel_thinning

https://arxiv.org/abs/2105.05842

http://stein-thinning.org

Generalized Kernel Thinning [Dwivedi and Mackey, 2021a]

Question: Do you really need a square-root kernel?1 kt-split with target kernel k yields

Similar or better MMD guarantees for analytic kernels (like Gaussian, IMQ, & sinc)

Dimension-free O(√

lognn ) single-function integration error for any k and P

2 kt-split with fractional power kernel kα yields

Improved MMD for kernels without krt (like Laplace and non-smooth Matern)

3 kt-split with k + kα yields all of the above simultaneously!

We call this kernel thinning+ (KT+)


Distribution Compression in Near-linear Time [Shetty, Dwivedi, and Mackey, 2021]

Question: Can we speed up thinning algorithms without ruining their quality?

103 104 105 106

Input size n

2 10

2 8

2 6

2 4

Mea

n M

MD

d=2

ST: n 0.24

KT-Comp: n 0.46

KT-Comp++: n 0.50

KT: n 0.52

103 104 105 106

Input size n

2 9

2 7

2 5

2 3

d=4

ST: n 0.23

KT-Comp: n 0.41

KT-Comp++: n 0.46

KT: n 0.48

103 104 105 106

Input size n

2 8

2 6

2 4

d=10

ST: n 0.25

KT-Comp: n 0.38

KT-Comp++: n 0.41

KT: n 0.43

103 104 105 106

Input size n

2 7

2 6

2 5

2 4

2 3

d=100

ST: n 0.25

KT-Comp: n 0.32

KT-Comp++: n 0.31

KT: n 0.32

103 104 105 106

Input size n

1s

1m

10m1hr8hr1day

Sing

le c

ore

runt

ime

d=2KTKT-Comp++KT-Comp

103 104 105 106

Input size n

1s

1m

10m1hr8hr1day

d=4

103 104 105 106

Input size n

1s

1m

10m1hr8hr1day

d=10

103 104 105 106

Input size n

1s

1m

10m1hr8hr1day

d=100

Compress++ reduces n2 runtime to n log2 n, applies to any thinning algorithm, andinflates error by at most a constant factor


Distribution Compression in Near-linear Time [Shetty, Dwivedi, and Mackey, 2021]

Question: Can we speed up thinning algorithms without ruining their quality?

44 45 46 47 48

Input size n

2 8

2 6

2 4

Mea

n M

MD

d=2

ST: n 0.27

Herd-Comp: n 0.45

Herd-Comp++: n 0.51

Herd: n 0.49

44 45 46 47 48

Input size n

2 7

2 5

2 3

d=4

ST: n 0.24

Herd-Comp: n 0.42

Herd-Comp++: n 0.45

Herd: n 0.45

44 45 46 47 48

Input size n

2 7

2 6

2 5

2 4

2 3

d=10

ST: n 0.25

Herd-Comp: n 0.37

Herd-Comp++: n 0.42

Herd: n 0.46

44 45 46 47 48

Input size n2 6

2 5

2 4

2 3

d=100

ST: n 0.24

Herd-Comp: n 0.32

Herd-Comp++: n 0.32

Herd: n 0.32

103 104 105 106

Input size n

1s10s1m10m1hr10hr

Sing

le c

ore

runt

ime

d=2HerdHerd-Comp++Herd-Comp

103 104 105 106

Input size n

1s10s1m10m1hr10hr

d=4

103 104 105 106

Input size n

1s10s1m10m1hr10hr

d=10

103 104 105 106

Input size n

1s10s1m10m1hr10hr

d=100

Compress++ reduces n2 runtime to n log2 n, applies to any thinning algorithm(e.g., kernel herding), and inflates error by at most a constant factor


References IR. Alweiss, Y. P. Liu, and M. Sawhney. Discrepancy minimization via a self-balancing walk. arXiv preprint arXiv:2006.14009, 2020.

F. O. Campos, Y. Shiferaw, A. J. Prassl, P. M. Boyle, E. J. Vigmond, and G. Plank. Stochastic spontaneous calcium release events trigger premature ventricularcomplexes by overcoming electrotonic load. Cardiovascular Research, 107(1):175–183, 2015.

W. Y. Chen, A. Barp, F.-X. Briol, J. Gorham, M. Girolami, L. Mackey, and C. Oates. Stein point Markov chain Monte Carlo. In International Conference onMachine Learning, pages 1011–1021. PMLR, 2019.

Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence,UAI’10, page 109–116, Arlington, Virginia, USA, 2010. AUAI Press. ISBN 9780974903965.

K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on Machine Learning, 2016.

M. A. Colman. Arrhythmia mechanisms and spontaneous calcium release: Bi-directional coupling between re-entrant and focal excitation. PLoS ComputationalBiology, 15(8), 2019.

R. Dwivedi and L. Mackey. Generalized kernel thinning. arXiv preprint arXiv:2110.01593, 2021a.

R. Dwivedi and L. Mackey. Kernel thinning, 2021b.

R. Dwivedi, O. N. Feldheim, O. Gurel-Gurevich, and A. Ramdas. The power of online thinning in reducing discrepancy. Probability Theory and Related Fields, 174(1):103–131, 2019.

D. Garreau, W. Jitkrittum, and M. Kanagawa. Large sample analysis of the median heuristic. arXiv preprint arXiv:1707.07269, 2017.

M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 73(2):123–214, 2011.

B. C. Goodwin. Oscillatory behavior in enzymatic control process. Advances in Enzyme Regulation, 3:318–356, 1965.

J. Gorham and L. Mackey. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning, 2017.

A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012.

H. Haario, E. Saksman, and J. Tamminen. Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375–395, 1999.

N. Harvey and S. Samadi. Near-optimal herding. In Conference on Learning Theory, pages 1165–1182, 2014.

F. Hickernell. A generalized discrepancy and quadrature error bound. Mathematics of computation, 67(221):299–322, 1998.


References II

R. Hinch, J. Greenstein, A. Tanskanen, L. Xu, and R. Winslow. A simplified local control model of calcium-induced calcium release in cardiac ventricularmyocytes. Biophysical journal, 87(6):3723–3736, 2004.

S. Joshi, R. V. Kommaraji, J. M. Phillips, and S. Venkatasubramanian. Comparing distributions and shapes using the kernel distance. In Proceedings of thetwenty-seventh annual symposium on Computational geometry, pages 47–56, 2011.

Z. Karnin and E. Liberty. Discrepancy, coresets, and sketches in machine learning. In Conference on Learning Theory, pages 1975–1993. PMLR, 2019.

T. Karvonen, M. Kanagawa, and S. Sarkka. On the positivity and magnitudes of bayesian quadrature weights. Statistics and Computing, 29(6):1317–1333, 2019.

S. Lacoste-Julien, F. Lindsten, and F. Bach. Sequential kernel herding: Frank-Wolfe optimization for particle filtering. In Artificial Intelligence and Statistics,pages 544–552. PMLR, 2015.

Q. Liu, J. D. Lee, and M. I. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests and model evaluation. In Proceedings of the 33rd InternationalConference on Machine Learning, 2016.

A. J. Lotka. Elements of physical biology. Williams & Wilkins, 1925.

S. Mak and V. R. Joseph. Support points. The Annals of Statistics, 46(6A):2562–2592, 2018.

S. A. Niederer, J. Lumens, and N. A. Trayanova. Computational models in cardiology. Nature Reviews Cardiology, 16(2):100–111, 2019.

S. A. Niederer, M. S. Sacks, M. Girolami, and K. Willcox. Scaling digital twins from the artisanal to the industrial. Nature Computational Science, 1(5):313–320,2021.

E. Novak and H. Wozniakowski. Tractability of multivariate problems, volume ii: Standard information for functionals, european math. Soc. Publ. House, Zurich,3, 2010.

C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society, Series B, 79(3):695–718, 2017.

B. Paige, D. Sejdinovic, and F. Wood. Super-sampling with a reservoir. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence,pages 567–576, 2016.

J. M. Phillips. ε-samples for kernels. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1622–1632. SIAM, 2013.

J. M. Phillips and W. M. Tai. Improved coresets for kernel density estimates. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on DiscreteAlgorithms, pages 2718–2727. SIAM, 2018.


References III

J. M. Phillips and W. M. Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867–887, 2020.

M. Riabiz, W. Chen, J. Cockayne, P. Swietach, S. A. Niederer, L. Mackey, and C. Oates. Optimal thinning of MCMC output. To appear: Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 2021.

G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.

A. Shetty, R. Dwivedi, and L. Mackey. Distribution compression in near-linear time. arXiv preprint: to appear, 2021.

W. M. Tai. New nearly-optimal coreset for kernel density estimation. arXiv preprint arXiv:2007.08031, 2020.

I. Tolstikhin, B. K. Sriperumbudur, and K. Muandet. Minimax estimation of kernel mean embeddings. The Journal of Machine Learning Research, 18(1):3002–3048, 2017.

V. Volterra. Variazioni e fluttuazioni del numero d’individui in specie animali conviventi. 1926.


Documents

Kernel Thinning and Stein Thinning