Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Learning PBD Powers
Kontonis Vasilis
November 13, 2017
Corelab, NTUA
Contents
1. Introduction
2. Binomial Powers
3. Learning the Parameters of a PBD
1
Introduction
Distribution Learning
• Draw samples from an unknown distribution D.• Output an approximation of the density function of D.
Distances of Distributions
dtv (Q,P) =1
2
∫|p(x) − q(x)| dx = sup
A∈A|P(A) − Q(A)|.
-2 -1 1 2
0.2
0.4
0.6
0.8
-4 -2 2 4
0.1
0.2
0.3
0.4
2
Dvoretzky Kiefer Wolfowitz Inequality
Kolmogorov Distance
dkol (Q,P) = supx∈R
|FQ(x) − FP(x)|
Empirical CDF and DKW Inequality
F̂N(x) =1
N
N∑i=1
1[Xi6x], x ∈ R
For every ε > 0
P[dkol
(FP , F̂N
)> ε]6 2e−2Nε
2
• With N = O(1/ε2
)samples we approximate the unknown CDF.
• Question: Is there anything left to do?
3
PBDs
Poisson Binomial Distributions
• Xi ’s are 0/1 Bernoulli with E[Xi ] = pi .• X =
∑ni=1 Xi is a n-PBD with probability vector p = (p1, . . . pn).
• E[X ] =∑
pi , V [X ] =∑
pi (1 − pi ).
• If pi ’s are small then X is close in TVD to a Pois(∑
pi ).
• If V [X ] is large, then X is close in TVD to a (Discretized) Normal.
4
Learning PBDs
Simply running Birge’s Algorithm [1] is not good enough, O(log n/ε3
)samples are needed.
• Learn Sparse:• Truncate the support: Draw O (1/ε2) samples, sort them and let a, b
be such that X (a 6 i 6 b) = 1− ε.
• If b − a > 1/ε3 then output fail, else run Birge’s unimodal algorithmon X[a,b].
• Learn Heavy:• Estimate the variance σ̂2 and the mean µ̂ of X using O (1/ε2)
samples.
• The discretized normal DN (µ̂, σ̂2) is ε-close, so just output µ̂ and σ̂2.
• Choose between Sparse and Heavy.
5
Tough Curriculum
• set of m items, e.g. set of students.• a set of n items, e.g. set of courses.• Each students has passed course i with probability pi independently
from other students.
Question: What is the distribution of the number of different courses
that a set of k students will have passed?
Answer: For course i not to be in the set we need to exclude from all
students. This happens with probability (1 − pi )k . Therefore i is included
in the union of courses with probability 1 − (1 − pi )k .
6
PBD Powers
PBD Powers
• Let P be a n-PBD defined by p.• P i is the i-th PBD power of P defined by pi .
Question: Given samples from a subset of the powers of P can we learn
the other powers? Can we do better than learning each one of them
separately?
7
Binomial Powers
Approximating Binomials
Binomial TVD [3]We want to approximate all the distributions B(n, pi ), i ∈ N.
|p − q| 6 ε
√p(1 − p)
n= err (n, p, ε) =⇒ dtv (B(n, p),B(n, q)) 6 ε
Approximating p
• Estimator:
p̂ =
∑Ni=1 XiNn
• Sample Complexity:Choosing N = O
(ln(1/δ)/ε2
)from Chernoff’s bound we have:
P [|p̂ − p| > err (n, p, ε)] < δ.
8
Real Powers of p
Assume that p ≈ 1 − 1n or p ≈1n .
p = 0. 99 . . . 9︸ ︷︷ ︸#log n
458382︸ ︷︷ ︸“constant” part
p = 0. 00 . . . 0︸ ︷︷ ︸#log n
235711︸ ︷︷ ︸“constant” part
• Sampling from the first power reveals the first part of p, since√p(1 − p)/n ≈ 1/n.
• Is this good enough to approximate all binomial powers ?• 0.99951000 ≈ 0.6064, 0.99971000 ≈ 0.7407
A blast from the past
|x − y | =|x2
i− y2
i|∏i−1
j=0(x2j + y2j )
9
Finding the Sweet Spot
By Mean Value Theorem applied to mapping x 7→ x l we obtain
pl − q̂l1 6 lpl−1(p − q̂1), l ∈ (1,+∞)
and
q̂l2 − pl 6 lpl−1(q̂2 − p), l ∈ (0, 1)
We need to find a function u(p) such that for all l > 0:
u(p)lpl−1err (n, p, ε) 6 err(n, pl , ε
)(1)
u(p)lpl−1√
p(1 − p)
n6
√pl(1 − pl)
n
u2(p) 6p
1 − p
p−l − 1
l2
10
Finding the Sweet Spot
f (l) = p−l−1l2 is convex attaining its minimum at l̄ = −
Cln p ,
f (l̄) = C ln2(1/p).
Now we can choose:
u(p) = D
√p
1 − pln(1/p), D ≈ 1.24
0.2 0.4 0.6 0.8 1.0
0.2
0.4
0.6
0.8
1.0
11
Getting in Range
Magic Power: a = − 1ln p .
Question: Can we guess the ”magic” power using samples from the first
one for all values of p?
Answer: No, if p is very close to 1 then we cannot hope to learn the
number of 9’s in its decimal representation.
We approximate a with â = − 1ln p̂ .
• pa = p̂â = 1/e.• If |p − p̂| 6 err (n, p, ε) then 1e2 6 p
â 6 1e3/2.
• Works for p ∈ [ε2/n, 1 − ε2/n].
Question: What if p is closer to 1 or 0?
Answer: For p ∈ [ε2/nd , 1 − ε2/nd ] we need O(log(d)/ε2) samples.
12
Learning Binomial Powers
Algorithm 1 Binomial Powers
Input : O(ln(1/δ)2/ε2) samples from the powers of B(n, p).
Output : â, q̂1, q̂2.
1: Draw O(ln(1/δ)/ε2) samples from B(n, p) to obtain the approximation
p̂.
2: Let â← −1/ ln(p̂).3: Draw O
(ln(1/δ)2/
(ε2ψ(pâ)2
))samples from B(n, pâ) to get estima-
tions q̂1, q̂2 of p, q̂1 6 p 6 q̂2,
4: return â, q̂1, q̂2
Question: How do we obtain q̂1, q̂2 ?
13
Learning the Parameters of a
PBD
Learning the Powers vs Learning the Parameters
Question: PBD powers ⇔ Parameter Estimation ?
• Assuming that all pi ’s are well separated we can learn them bysampling from the powers of a PBD.
• Is there a PBDs where learning its powers is easy but learning itsparameters is hard?
14
Learning the Parameters
• P(x) =∏
(x − pi ) = xn + cn−1x
n−1 + . . .+ c0.
• µj = EPj =∑
pji .
Newton Identities1
µ1 2
µ2 µ1 3...
.... . .
. . .
µn−1 µn−2 . . . µ1 n
cn−1cn−2cn−3
...
c0
=−µ1−µ2−µ3
...
−µn
⇔ Ac = b
We know the µi ’s only approximately from sampling the PBD powers.
Question: How to measure the impact of noise to the solution of the
system?
‖c − ĉ‖∞ 6 u O (n3/22n)15
Finding the Roots
Pan’s Algorithm
• P(x) =∑n
i=0 cixi = cn
∏ni=1(x − pi ), cn 6= 0.
• |pj | 6 1 for all j .• Computes roots such that |p̂j − pj | < ε for j = 1, . . . , n.• Precision needed:
‖c − ĉ‖∞ = 2O(−nmax(log(1/ε),log(n)). (2)• Overall we need 2O(nmax(log(1/ε),log(n))) samples.
16
Le Cam’s Inequality
Question: How do we show a sampling complexity Lower Bound ?
Hypothesis Testing
• We know that samples can come from either P1 or P2.• How many samples do we need in order to decide whether they
come from P or Q?
• Ψ is a testing function, Ψ : X→ {1, 2}• Let V be the random variable of the choice of the unknown
distribution.
Le Cam’s Inequality
infΨ
P[Ψ(XN) 6= V
]= 1 − dtv
(PN1 ,P
N2
)
17
Minimax Risk, Sample Complexity
• P is a family of distributions.• P ∈ P is a distribution.• Θ is the space of the parameter we want to estimate.• θ : P→ Θ, θ(P) is the parameter of P we want to estimate.• θ̂ : XN → Θ is the estimator.• XN = (X1, . . . ,XN) ∼ PN is the sample vector, N is the number of
samples.
• ρ is a semimetric on Θ.
MN (θ(P), ρ) := infθ̂
supP∈P
EPN[ρ(θ̂(XN), θ(P)
)].
n(ε, θ(P)) = inf {N : MN(θ(P), ρ) 6 ε}
18
Generalization of Minimax Risk
• P is a family of sequences of distributions.• P ∈ P is a sequence of distributions.• Θ is the space of the parameter we want to estimate.• θ : P→ Θ, θ(P) is the parameter of P we want to estimate.• θ̂ : Xm → Θ is the estimator.• Xm = (X1, . . . ,Xm1 . . . ,Xmk ) ∼ Pm is the sample vector, N is the
number of samples.
• ρ is a semimetric on Θ.
Definition
MN (θ(P), ρ) := infθ̂
inf|m|=N
supP∈P
EPm[ρ(θ̂(Xm), θ(P)
)]. (3)
19
From Estimation to Testing
Canonical Hypothesis Testing
• ”nature” chooses V uniformly from V.• Conditioned on V = v , we draw the sample Xm from the N-fold
product distribution Pmv .
Given Xm our goal is to determine V .
Lower BoundLet FV ⊆ P be a family of sequences of distributions indexed by v ∈ Vsuch that ρ (θ(Pv ,Pu)) > 2δ for all Pv , Pu ∈ FV, where, v 6= u ∈ V andδ > 0. Then
MN (θ(P), ρ) > δ infm=|N|
infΨνm (Ψ(Xm) 6= V ) .
20
From Estimation to Testing
Testing Function:
Ψ(Xm) := argminv∈V{ρ(θ̂, θv )} , ρ(θ̂, θv ) 6 δ⇔ Ψ(θ̂) = v .
21
Le Cam’s Method
Main Idea: Find two sequences of distributions that are close in Total
Variation but their parameters are far.
• P,Q ∈ P and δ > 0.• ρ (θ(P), θ(Q)) > 2δ then
After N observations (samples) the minimax risk has lower bound
MN (θ(P), ρ) >δ
2(1 −
√2
√1 − (1 − dtv (P,Q))
N).
22
The Lower Bound
Construction of [2]:
• n = Θ(log(N/ε))• pj := (1 + cos( 2πjn ))/8, qj := (1 + cos(
2πj+πn ))/8, j ∈ [n].
• j = n/4 + O(1), we have that |pi − qj | = Ω(1/ log(N/ε))
0.4 0.5 0.6 0.7 0.8 0.9 1.0
-2
-1
1
2
23
The Lower Bound
• pi ’s are the roots of Tn(8x − 1) − 1.• qj ’s are the roots of Tn(8x − 1) + 1.• for all l ∈ {1, 2, . . . , n − 1},
∑ni=1 p
li =
∑ni=1 q
li
• for l > n, 3l(∑n
i=1(pli − q
li )) 6 n(3/4)
n = log(N/ε)(3/4)log(N/ε).
All PBD powers of p,q are very close in TVD.
dtv (Pi ,Qi ) 6 c/N
. Therefore the number of samples samples N should be Ω(21/ε).
24
Questions?
24
References i
L. Birgé.
Estimating a Density under Order Restrictions: Nonasymptotic
Minimax Risk.
The Annals of Statistics, 15(3):995–1012, Sept. 1987.
I. Diakonikolas, D. Kane, and A.Stewart.
Properly learning poisson binomial distributions in almost
polynomial time.
In Proceedings of the 29th Conference on Learning Theory,
(COLT’16), pages 850–878, 2016.
B. Roos.
Improvements in the Poisson approximation of mixed Poisson
distributions.
Journal of Statistical Planning and Inference, 113:467–483, 2003.
25
IntroductionBinomial PowersLearning the Parameters of a PBD