Learning PBD Powers · 2021. 1. 16. · Learning PBDs Simply running Birge’s Algorithm [1] is not good enough, O log n="3 samples are needed. Learn Sparse: Truncate the support:

Learning PBD Powers

Kontonis Vasilis

November 13, 2017

Corelab, NTUA

Contents

1. Introduction

2. Binomial Powers

3. Learning the Parameters of a PBD

1

Introduction

Distribution Learning

• Draw samples from an unknown distribution D.• Output an approximation of the density function of D.

Distances of Distributions

dtv (Q,P) =1

2

∫|p(x) − q(x)| dx = sup

A∈A|P(A) − Q(A)|.

-2 -1 1 2

0.2

0.4

0.6

0.8

-4 -2 2 4

0.1

0.2

0.3

0.4

2

Dvoretzky Kiefer Wolfowitz Inequality

Kolmogorov Distance

dkol (Q,P) = supx∈R

|FQ(x) − FP(x)|

Empirical CDF and DKW Inequality

F̂N(x) =1

N

N∑i=1

1[Xi6x], x ∈ R

For every ε > 0

P[dkol

(FP , F̂N

)> ε]6 2e−2Nε

2

• With N = O(1/ε2

)samples we approximate the unknown CDF.

• Question: Is there anything left to do?

3

PBDs

Poisson Binomial Distributions

• Xi ’s are 0/1 Bernoulli with E[Xi ] = pi .• X =

∑ni=1 Xi is a n-PBD with probability vector p = (p1, . . . pn).

• E[X ] =∑

pi , V [X ] =∑

pi (1 − pi ).

• If pi ’s are small then X is close in TVD to a Pois(∑

pi ).

• If V [X ] is large, then X is close in TVD to a (Discretized) Normal.

4

Learning PBDs

Simply running Birge’s Algorithm [1] is not good enough, O(log n/ε3

)samples are needed.

• Learn Sparse:• Truncate the support: Draw O (1/ε2) samples, sort them and let a, b

be such that X (a 6 i 6 b) = 1− ε.

• If b − a > 1/ε3 then output fail, else run Birge’s unimodal algorithmon X[a,b].

• Learn Heavy:• Estimate the variance σ̂2 and the mean µ̂ of X using O (1/ε2)

samples.

• The discretized normal DN (µ̂, σ̂2) is ε-close, so just output µ̂ and σ̂2.

• Choose between Sparse and Heavy.

5

Tough Curriculum

• set of m items, e.g. set of students.• a set of n items, e.g. set of courses.• Each students has passed course i with probability pi independently

from other students.

Question: What is the distribution of the number of different courses

that a set of k students will have passed?

Answer: For course i not to be in the set we need to exclude from all

students. This happens with probability (1 − pi )k . Therefore i is included

in the union of courses with probability 1 − (1 − pi )k .

6

PBD Powers

PBD Powers

• Let P be a n-PBD defined by p.• P i is the i-th PBD power of P defined by pi .

Question: Given samples from a subset of the powers of P can we learn

the other powers? Can we do better than learning each one of them

separately?

7

Binomial Powers

Approximating Binomials

Binomial TVD [3]We want to approximate all the distributions B(n, pi ), i ∈ N.

|p − q| 6 ε

√p(1 − p)

n= err (n, p, ε) =⇒ dtv (B(n, p),B(n, q)) 6 ε

Approximating p

• Estimator:

p̂ =

∑Ni=1 XiNn

• Sample Complexity:Choosing N = O

(ln(1/δ)/ε2

)from Chernoff’s bound we have:

P [|p̂ − p| > err (n, p, ε)] < δ.

8

Real Powers of p

Assume that p ≈ 1 − 1n or p ≈1n .

p = 0. 99 . . . 9︸︷︷︸#log n

458382︸︷︷︸“constant” part

p = 0. 00 . . . 0︸︷︷︸#log n

235711︸︷︷︸“constant” part

• Sampling from the first power reveals the first part of p, since√p(1 − p)/n ≈ 1/n.

• Is this good enough to approximate all binomial powers ?• 0.99951000 ≈ 0.6064, 0.99971000 ≈ 0.7407

A blast from the past

|x − y | =|x2

i− y2

i|∏i−1

j=0(x2j + y2j )

9

Finding the Sweet Spot

By Mean Value Theorem applied to mapping x 7→ x l we obtain

pl − q̂l1 6 lpl−1(p − q̂1), l ∈ (1,+∞)

and

q̂l2 − pl 6 lpl−1(q̂2 − p), l ∈ (0, 1)

We need to find a function u(p) such that for all l > 0:

u(p)lpl−1err (n, p, ε) 6 err(n, pl , ε

)(1)

u(p)lpl−1√

p(1 − p)

n6

√pl(1 − pl)

n

u2(p) 6p

1 − p

p−l − 1

l2

10

Finding the Sweet Spot

f (l) = p−l−1l2 is convex attaining its minimum at l̄ = −

Cln p ,

f (l̄) = C ln2(1/p).

Now we can choose:

u(p) = D

√p

1 − pln(1/p), D ≈ 1.24

0.2 0.4 0.6 0.8 1.0

0.2

0.4

0.6

0.8

1.0

11

Getting in Range

Magic Power: a = − 1ln p .

Question: Can we guess the ”magic” power using samples from the first

one for all values of p?

Answer: No, if p is very close to 1 then we cannot hope to learn the

number of 9’s in its decimal representation.

We approximate a with â = − 1ln p̂ .

• pa = p̂â = 1/e.• If |p − p̂| 6 err (n, p, ε) then 1e2 6 p

â 6 1e3/2.

• Works for p ∈ [ε2/n, 1 − ε2/n].

Question: What if p is closer to 1 or 0?

Answer: For p ∈ [ε2/nd , 1 − ε2/nd ] we need O(log(d)/ε2) samples.

12

Learning Binomial Powers

Algorithm 1 Binomial Powers

Input : O(ln(1/δ)2/ε2) samples from the powers of B(n, p).

Output : â, q̂1, q̂2.

1: Draw O(ln(1/δ)/ε2) samples from B(n, p) to obtain the approximation

p̂.

2: Let â← −1/ ln(p̂).3: Draw O

(ln(1/δ)2/

(ε2ψ(pâ)2

))samples from B(n, pâ) to get estima-

tions q̂1, q̂2 of p, q̂1 6 p 6 q̂2,

4: return â, q̂1, q̂2

Question: How do we obtain q̂1, q̂2 ?

13

Learning the Parameters of a

PBD

Learning the Powers vs Learning the Parameters

Question: PBD powers ⇔ Parameter Estimation ?

• Assuming that all pi ’s are well separated we can learn them bysampling from the powers of a PBD.

• Is there a PBDs where learning its powers is easy but learning itsparameters is hard?

14

Learning the Parameters

• P(x) =∏

(x − pi ) = xn + cn−1x

n−1 + . . .+ c0.

• µj = EPj =∑

pji .

Newton Identities1

µ1 2

µ2 µ1 3...

.... . .

. . .

µn−1 µn−2 . . . µ1 n

cn−1cn−2cn−3

...

c0

=−µ1−µ2−µ3

...

−µn

⇔ Ac = b

We know the µi ’s only approximately from sampling the PBD powers.

Question: How to measure the impact of noise to the solution of the

system?

‖c − ĉ‖∞ 6 u O (n3/22n)15

Finding the Roots

Pan’s Algorithm

• P(x) =∑n

i=0 cixi = cn

∏ni=1(x − pi ), cn 6= 0.

• |pj | 6 1 for all j .• Computes roots such that |p̂j − pj | < ε for j = 1, . . . , n.• Precision needed:

‖c − ĉ‖∞ = 2O(−nmax(log(1/ε),log(n)). (2)• Overall we need 2O(nmax(log(1/ε),log(n))) samples.

16

Le Cam’s Inequality

Question: How do we show a sampling complexity Lower Bound ?

Hypothesis Testing

• We know that samples can come from either P1 or P2.• How many samples do we need in order to decide whether they

come from P or Q?

• Ψ is a testing function, Ψ : X→ {1, 2}• Let V be the random variable of the choice of the unknown

distribution.

Le Cam’s Inequality

infΨ

P[Ψ(XN) 6= V

]= 1 − dtv

(PN1 ,P

N2

)

17

Minimax Risk, Sample Complexity

• P is a family of distributions.• P ∈ P is a distribution.• Θ is the space of the parameter we want to estimate.• θ : P→ Θ, θ(P) is the parameter of P we want to estimate.• θ̂ : XN → Θ is the estimator.• XN = (X1, . . . ,XN) ∼ PN is the sample vector, N is the number of

samples.

• ρ is a semimetric on Θ.

MN (θ(P), ρ) := infθ̂

supP∈P

EPN[ρ(θ̂(XN), θ(P)

)].

n(ε, θ(P)) = inf {N : MN(θ(P), ρ) 6 ε}

18

Generalization of Minimax Risk

• P is a family of sequences of distributions.• P ∈ P is a sequence of distributions.• Θ is the space of the parameter we want to estimate.• θ : P→ Θ, θ(P) is the parameter of P we want to estimate.• θ̂ : Xm → Θ is the estimator.• Xm = (X1, . . . ,Xm1 . . . ,Xmk ) ∼ Pm is the sample vector, N is the

number of samples.

• ρ is a semimetric on Θ.

Definition

MN (θ(P), ρ) := infθ̂

inf|m|=N

supP∈P

EPm[ρ(θ̂(Xm), θ(P)

)]. (3)

19

From Estimation to Testing

Canonical Hypothesis Testing

• ”nature” chooses V uniformly from V.• Conditioned on V = v , we draw the sample Xm from the N-fold

product distribution Pmv .

Given Xm our goal is to determine V .

Lower BoundLet FV ⊆ P be a family of sequences of distributions indexed by v ∈ Vsuch that ρ (θ(Pv ,Pu)) > 2δ for all Pv , Pu ∈ FV, where, v 6= u ∈ V andδ > 0. Then

MN (θ(P), ρ) > δ infm=|N|

infΨνm (Ψ(Xm) 6= V ) .

20

From Estimation to Testing

Testing Function:

Ψ(Xm) := argminv∈V{ρ(θ̂, θv )} , ρ(θ̂, θv ) 6 δ⇔ Ψ(θ̂) = v .

21

Le Cam’s Method

Main Idea: Find two sequences of distributions that are close in Total

Variation but their parameters are far.

• P,Q ∈ P and δ > 0.• ρ (θ(P), θ(Q)) > 2δ then

After N observations (samples) the minimax risk has lower bound

MN (θ(P), ρ) >δ

2(1 −

√2

√1 − (1 − dtv (P,Q))

N).

22

The Lower Bound

Construction of [2]:

• n = Θ(log(N/ε))• pj := (1 + cos( 2πjn ))/8, qj := (1 + cos(

2πj+πn ))/8, j ∈ [n].

• j = n/4 + O(1), we have that |pi − qj | = Ω(1/ log(N/ε))

0.4 0.5 0.6 0.7 0.8 0.9 1.0

-2

-1

1

2

23

The Lower Bound

• pi ’s are the roots of Tn(8x − 1) − 1.• qj ’s are the roots of Tn(8x − 1) + 1.• for all l ∈ {1, 2, . . . , n − 1},

∑ni=1 p

li =

∑ni=1 q

li

• for l > n, 3l(∑n

i=1(pli − q

li )) 6 n(3/4)

n = log(N/ε)(3/4)log(N/ε).

All PBD powers of p,q are very close in TVD.

dtv (Pi ,Qi ) 6 c/N

. Therefore the number of samples samples N should be Ω(21/ε).

24

Questions?

24

References i

L. Birgé.

Estimating a Density under Order Restrictions: Nonasymptotic

Minimax Risk.

The Annals of Statistics, 15(3):995–1012, Sept. 1987.

I. Diakonikolas, D. Kane, and A.Stewart.

Properly learning poisson binomial distributions in almost

polynomial time.

In Proceedings of the 29th Conference on Learning Theory,

(COLT’16), pages 850–878, 2016.

B. Roos.

Improvements in the Poisson approximation of mixed Poisson

distributions.

Journal of Statistical Planning and Inference, 113:467–483, 2003.

25

IntroductionBinomial PowersLearning the Parameters of a PBD

Documents

Learning PBD Powers · 2021. 1. 16. · Learning PBDs Simply running Birge’s Algorithm [1] is not good enough, O log n="3 samples are needed. Learn Sparse: Truncate the support: