30
Learning PBD Powers Kontonis Vasilis November 13, 2017 Corelab, NTUA

Learning PBD Powers · 2021. 1. 16. · Learning PBDs Simply running Birge’s Algorithm [1] is not good enough, O log n="3 samples are needed. Learn Sparse: Truncate the support:

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • Learning PBD Powers

    Kontonis Vasilis

    November 13, 2017

    Corelab, NTUA

  • Contents

    1. Introduction

    2. Binomial Powers

    3. Learning the Parameters of a PBD

    1

  • Introduction

  • Distribution Learning

    • Draw samples from an unknown distribution D.• Output an approximation of the density function of D.

    Distances of Distributions

    dtv (Q,P) =1

    2

    ∫|p(x) − q(x)| dx = sup

    A∈A|P(A) − Q(A)|.

    -2 -1 1 2

    0.2

    0.4

    0.6

    0.8

    -4 -2 2 4

    0.1

    0.2

    0.3

    0.4

    2

  • Dvoretzky Kiefer Wolfowitz Inequality

    Kolmogorov Distance

    dkol (Q,P) = supx∈R

    |FQ(x) − FP(x)|

    Empirical CDF and DKW Inequality

    F̂N(x) =1

    N

    N∑i=1

    1[Xi6x], x ∈ R

    For every ε > 0

    P[dkol

    (FP , F̂N

    )> ε]6 2e−2Nε

    2

    • With N = O(1/ε2

    )samples we approximate the unknown CDF.

    • Question: Is there anything left to do?

    3

  • PBDs

    Poisson Binomial Distributions

    • Xi ’s are 0/1 Bernoulli with E[Xi ] = pi .• X =

    ∑ni=1 Xi is a n-PBD with probability vector p = (p1, . . . pn).

    • E[X ] =∑

    pi , V [X ] =∑

    pi (1 − pi ).

    • If pi ’s are small then X is close in TVD to a Pois(∑

    pi ).

    • If V [X ] is large, then X is close in TVD to a (Discretized) Normal.

    4

  • Learning PBDs

    Simply running Birge’s Algorithm [1] is not good enough, O(log n/ε3

    )samples are needed.

    • Learn Sparse:• Truncate the support: Draw O (1/ε2) samples, sort them and let a, b

    be such that X (a 6 i 6 b) = 1− ε.

    • If b − a > 1/ε3 then output fail, else run Birge’s unimodal algorithmon X[a,b].

    • Learn Heavy:• Estimate the variance σ̂2 and the mean µ̂ of X using O (1/ε2)

    samples.

    • The discretized normal DN (µ̂, σ̂2) is ε-close, so just output µ̂ and σ̂2.

    • Choose between Sparse and Heavy.

    5

  • Tough Curriculum

    • set of m items, e.g. set of students.• a set of n items, e.g. set of courses.• Each students has passed course i with probability pi independently

    from other students.

    Question: What is the distribution of the number of different courses

    that a set of k students will have passed?

    Answer: For course i not to be in the set we need to exclude from all

    students. This happens with probability (1 − pi )k . Therefore i is included

    in the union of courses with probability 1 − (1 − pi )k .

    6

  • PBD Powers

    PBD Powers

    • Let P be a n-PBD defined by p.• P i is the i-th PBD power of P defined by pi .

    Question: Given samples from a subset of the powers of P can we learn

    the other powers? Can we do better than learning each one of them

    separately?

    7

  • Binomial Powers

  • Approximating Binomials

    Binomial TVD [3]We want to approximate all the distributions B(n, pi ), i ∈ N.

    |p − q| 6 ε

    √p(1 − p)

    n= err (n, p, ε) =⇒ dtv (B(n, p),B(n, q)) 6 ε

    Approximating p

    • Estimator:

    p̂ =

    ∑Ni=1 XiNn

    • Sample Complexity:Choosing N = O

    (ln(1/δ)/ε2

    )from Chernoff’s bound we have:

    P [|p̂ − p| > err (n, p, ε)] < δ.

    8

  • Real Powers of p

    Assume that p ≈ 1 − 1n or p ≈1n .

    p = 0. 99 . . . 9︸ ︷︷ ︸#log n

    458382︸ ︷︷ ︸“constant” part

    p = 0. 00 . . . 0︸ ︷︷ ︸#log n

    235711︸ ︷︷ ︸“constant” part

    • Sampling from the first power reveals the first part of p, since√p(1 − p)/n ≈ 1/n.

    • Is this good enough to approximate all binomial powers ?• 0.99951000 ≈ 0.6064, 0.99971000 ≈ 0.7407

    A blast from the past

    |x − y | =|x2

    i− y2

    i|∏i−1

    j=0(x2j + y2j )

    9

  • Finding the Sweet Spot

    By Mean Value Theorem applied to mapping x 7→ x l we obtain

    pl − q̂l1 6 lpl−1(p − q̂1), l ∈ (1,+∞)

    and

    q̂l2 − pl 6 lpl−1(q̂2 − p), l ∈ (0, 1)

    We need to find a function u(p) such that for all l > 0:

    u(p)lpl−1err (n, p, ε) 6 err(n, pl , ε

    )(1)

    u(p)lpl−1√

    p(1 − p)

    n6

    √pl(1 − pl)

    n

    u2(p) 6p

    1 − p

    p−l − 1

    l2

    10

  • Finding the Sweet Spot

    f (l) = p−l−1l2 is convex attaining its minimum at l̄ = −

    Cln p ,

    f (l̄) = C ln2(1/p).

    Now we can choose:

    u(p) = D

    √p

    1 − pln(1/p), D ≈ 1.24

    0.2 0.4 0.6 0.8 1.0

    0.2

    0.4

    0.6

    0.8

    1.0

    11

  • Getting in Range

    Magic Power: a = − 1ln p .

    Question: Can we guess the ”magic” power using samples from the first

    one for all values of p?

    Answer: No, if p is very close to 1 then we cannot hope to learn the

    number of 9’s in its decimal representation.

    We approximate a with â = − 1ln p̂ .

    • pa = p̂â = 1/e.• If |p − p̂| 6 err (n, p, ε) then 1e2 6 p

    â 6 1e3/2.

    • Works for p ∈ [ε2/n, 1 − ε2/n].

    Question: What if p is closer to 1 or 0?

    Answer: For p ∈ [ε2/nd , 1 − ε2/nd ] we need O(log(d)/ε2) samples.

    12

  • Learning Binomial Powers

    Algorithm 1 Binomial Powers

    Input : O(ln(1/δ)2/ε2) samples from the powers of B(n, p).

    Output : â, q̂1, q̂2.

    1: Draw O(ln(1/δ)/ε2) samples from B(n, p) to obtain the approximation

    p̂.

    2: Let â← −1/ ln(p̂).3: Draw O

    (ln(1/δ)2/

    (ε2ψ(pâ)2

    ))samples from B(n, pâ) to get estima-

    tions q̂1, q̂2 of p, q̂1 6 p 6 q̂2,

    4: return â, q̂1, q̂2

    Question: How do we obtain q̂1, q̂2 ?

    13

  • Learning the Parameters of a

    PBD

  • Learning the Powers vs Learning the Parameters

    Question: PBD powers ⇔ Parameter Estimation ?

    • Assuming that all pi ’s are well separated we can learn them bysampling from the powers of a PBD.

    • Is there a PBDs where learning its powers is easy but learning itsparameters is hard?

    14

  • Learning the Parameters

    • P(x) =∏

    (x − pi ) = xn + cn−1x

    n−1 + . . .+ c0.

    • µj = EPj =∑

    pji .

    Newton Identities1

    µ1 2

    µ2 µ1 3...

    .... . .

    . . .

    µn−1 µn−2 . . . µ1 n

    cn−1cn−2cn−3

    ...

    c0

    =−µ1−µ2−µ3

    ...

    −µn

    ⇔ Ac = b

    We know the µi ’s only approximately from sampling the PBD powers.

    Question: How to measure the impact of noise to the solution of the

    system?

    ‖c − ĉ‖∞ 6 u O (n3/22n)15

  • Finding the Roots

    Pan’s Algorithm

    • P(x) =∑n

    i=0 cixi = cn

    ∏ni=1(x − pi ), cn 6= 0.

    • |pj | 6 1 for all j .• Computes roots such that |p̂j − pj | < ε for j = 1, . . . , n.• Precision needed:

    ‖c − ĉ‖∞ = 2O(−nmax(log(1/ε),log(n)). (2)• Overall we need 2O(nmax(log(1/ε),log(n))) samples.

    16

  • Le Cam’s Inequality

    Question: How do we show a sampling complexity Lower Bound ?

    Hypothesis Testing

    • We know that samples can come from either P1 or P2.• How many samples do we need in order to decide whether they

    come from P or Q?

    • Ψ is a testing function, Ψ : X→ {1, 2}• Let V be the random variable of the choice of the unknown

    distribution.

    Le Cam’s Inequality

    infΨ

    P[Ψ(XN) 6= V

    ]= 1 − dtv

    (PN1 ,P

    N2

    )

    17

  • Minimax Risk, Sample Complexity

    • P is a family of distributions.• P ∈ P is a distribution.• Θ is the space of the parameter we want to estimate.• θ : P→ Θ, θ(P) is the parameter of P we want to estimate.• θ̂ : XN → Θ is the estimator.• XN = (X1, . . . ,XN) ∼ PN is the sample vector, N is the number of

    samples.

    • ρ is a semimetric on Θ.

    MN (θ(P), ρ) := infθ̂

    supP∈P

    EPN[ρ(θ̂(XN), θ(P)

    )].

    n(ε, θ(P)) = inf {N : MN(θ(P), ρ) 6 ε}

    18

  • Generalization of Minimax Risk

    • P is a family of sequences of distributions.• P ∈ P is a sequence of distributions.• Θ is the space of the parameter we want to estimate.• θ : P→ Θ, θ(P) is the parameter of P we want to estimate.• θ̂ : Xm → Θ is the estimator.• Xm = (X1, . . . ,Xm1 . . . ,Xmk ) ∼ Pm is the sample vector, N is the

    number of samples.

    • ρ is a semimetric on Θ.

    Definition

    MN (θ(P), ρ) := infθ̂

    inf|m|=N

    supP∈P

    EPm[ρ(θ̂(Xm), θ(P)

    )]. (3)

    19

  • From Estimation to Testing

    Canonical Hypothesis Testing

    • ”nature” chooses V uniformly from V.• Conditioned on V = v , we draw the sample Xm from the N-fold

    product distribution Pmv .

    Given Xm our goal is to determine V .

    Lower BoundLet FV ⊆ P be a family of sequences of distributions indexed by v ∈ Vsuch that ρ (θ(Pv ,Pu)) > 2δ for all Pv , Pu ∈ FV, where, v 6= u ∈ V andδ > 0. Then

    MN (θ(P), ρ) > δ infm=|N|

    infΨνm (Ψ(Xm) 6= V ) .

    20

  • From Estimation to Testing

    Testing Function:

    Ψ(Xm) := argminv∈V{ρ(θ̂, θv )} , ρ(θ̂, θv ) 6 δ⇔ Ψ(θ̂) = v .

    21

  • Le Cam’s Method

    Main Idea: Find two sequences of distributions that are close in Total

    Variation but their parameters are far.

    • P,Q ∈ P and δ > 0.• ρ (θ(P), θ(Q)) > 2δ then

    After N observations (samples) the minimax risk has lower bound

    MN (θ(P), ρ) >δ

    2(1 −

    √2

    √1 − (1 − dtv (P,Q))

    N).

    22

  • The Lower Bound

    Construction of [2]:

    • n = Θ(log(N/ε))• pj := (1 + cos( 2πjn ))/8, qj := (1 + cos(

    2πj+πn ))/8, j ∈ [n].

    • j = n/4 + O(1), we have that |pi − qj | = Ω(1/ log(N/ε))

    0.4 0.5 0.6 0.7 0.8 0.9 1.0

    -2

    -1

    1

    2

    23

  • The Lower Bound

    • pi ’s are the roots of Tn(8x − 1) − 1.• qj ’s are the roots of Tn(8x − 1) + 1.• for all l ∈ {1, 2, . . . , n − 1},

    ∑ni=1 p

    li =

    ∑ni=1 q

    li

    • for l > n, 3l(∑n

    i=1(pli − q

    li )) 6 n(3/4)

    n = log(N/ε)(3/4)log(N/ε).

    All PBD powers of p,q are very close in TVD.

    dtv (Pi ,Qi ) 6 c/N

    . Therefore the number of samples samples N should be Ω(21/ε).

    24

  • Questions?

    24

  • References i

    L. Birgé.

    Estimating a Density under Order Restrictions: Nonasymptotic

    Minimax Risk.

    The Annals of Statistics, 15(3):995–1012, Sept. 1987.

    I. Diakonikolas, D. Kane, and A.Stewart.

    Properly learning poisson binomial distributions in almost

    polynomial time.

    In Proceedings of the 29th Conference on Learning Theory,

    (COLT’16), pages 850–878, 2016.

    B. Roos.

    Improvements in the Poisson approximation of mixed Poisson

    distributions.

    Journal of Statistical Planning and Inference, 113:467–483, 2003.

    25

    IntroductionBinomial PowersLearning the Parameters of a PBD