Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Bayesian Model Choice and Information Criteria inSparse Generalized Linear Models
Mathias Drton
Department of StatisticsUniversity of Chicago
(Paper with this title: Rina Foygel & M.D., arXiv:1112.5635)
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
2 / 36
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
BIC and extensions 2 / 36
Bayesian information criterion (BIC)
Sample Y1, . . . ,Yn
Parametric model MMaximized log-likelihood function ˆ(M)
Bayesian information criterion (Schwarz, 1978)
BIC(M) := ˆ(M)− dim(M)
2log n
‘Generic’ model selection approach:
Maximize BIC(M) over set of considered models
BIC and extensions 3 / 36
Motivation: 1) Bayesian model choice
Posterior model probability in fully Bayesian treatment:
P(M|Y1, . . . ,Yn) ∝ P(M)︸ ︷︷ ︸prior
P(Y1, . . . ,Yn |M).
Marginal likelihood:
Ln(M) := P(Y1, . . . ,Yn |M)
=
∫P(Y1, . . . ,Yn | θM,M)︸ ︷︷ ︸
likelihood fct.
d P(θM |M)︸ ︷︷ ︸prior
BIC and extensions 4 / 36
Motivation: 2) Asymptotics
Y1, . . . ,Yn i.i.d. sample from P0 ∈M
Theorem (Schwarz, 1978; Haughton, 1988; and others)
Assume P(θM |M) is a ‘nice’ prior on Rd . Then in ‘nice’ models,
log Ln(M) = ˆn(M)− d
2log n + Op(1),
and a better (Laplace) approximation is possible:
log Ln(M) = ˆn(M)− d
2log( n
2π
)+ log P(θM |M)
− 1
2log det
[1
nHessian(θM)
]+ Op
(n−1/2
)
BIC and extensions 5 / 36
Consistency
Theorem
Fix a finite set of ‘nice’ models. Then, BIC selects a true model ofsmallest dimension with probability tending to one as n→∞.
Proof.
Finite set of models =⇒ pairwise comparisons suffice.
If P0 ∈M1 (M2 and d1 < d2, then
ˆn(M2)− ˆ
n(M1) = Op(1); and (d2 − d1) log n→∞.
If P0 ∈M1 \ clos(M2), then with probability tending to one,
1
n
[ˆn(M1)− ˆ
n(M2)]> ε > 0; and log(n)/n→ 0.
BIC and extensions 6 / 36
Linear regression (covariates i.i.d. N(0, 1), φ1 = 1, σ = 2)
BIC and extensions 7 / 36
BIC in higher-dimensional linear regression
Exhaustive search up to 6 covariates
10 20 30 40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
p
Pro
b co
rrec
t
n = p,σ = 1,k = 2,φ1 = 1φ2 = 1
BIC and extensions 8 / 36
Higher-dimensional linear regression . . . too large models
σ = 1,k = 2,φ1 = φ2 = 1
Broman & Speed (2002)
BIC and extensions 9 / 36
Informative prior on models in higher dim. regression
σ = 1,k = 2,φ1 = φ2 = 1
BIC and extensions 10 / 36
Informative prior on models in higher dim. regression
Exhaustive search up to 6 covariates
10 20 30 40 50 60 70 80
0.0
0.2
0.4
0.6
0.8
1.0
p
Pro
b co
rrec
t
BICEBIC
n = p,σ = 1,k = 2,φ1 = 1φ2 = 1
BIC and extensions 11 / 36
Extended Bayesian information criterion
Linear regression
Models given by subsets of covariates J ⊂ [p] := {1, . . . , p}Prior on models
P(J) =1
p + 1· 1( p|J|)
has k = #covariates and (J|k) uniformly distributed.
Extended BIC defined as
EBIC(J) = BIC(J)− |J| log p;
we have |J| � p in mind.
Bogdan et al. (2004), Chen & Chen (2008), Scott and Berger (2010), . . .
BIC and extensions 12 / 36
Theory = consistency for EBIC
Chen & Chen ’08 High-dimensional sparse linear regression(fixed design, # active covariates bounded).
Chen & Chen ’11 Generalized linear models(fixed design, canonical link).
Chen et al. ’11 Generalizations for fixed design regression
Gao et al. ’10 Gaussian graphical modelsFoygel & D ’10 (adjust penalty for number of graphs)
BIC and extensions 13 / 36
Questions
Bayesian connection under high-dimensional asymptotics:
� Laplace approximation to marginal likelihood accurate uniformly over agrowing number of models?
� EBIC captures growth of marginal likelihood?
Consistency for random designs?
Consistency for pseudo-likelihood approaches to graphical modelselection?
Consistency of fully Bayesian model choice as corollaries?
Shang & Clayton (2011)
BIC and extensions 14 / 36
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
Asymptotics for marginal likelihood of GLMs 15 / 36
Generalized linear model: Setup
Independent (response) observations Y1, . . . ,Yn
Distribution of Yi ∼ pθi from univariate exponential family:
pθ(y) ∝ exp {y · θ − b(θ)} , θ ∈ Θ = R.
Linearity:θ = (θ1, . . . , θn)T = Xφ, φ ∈ Rp,
for design matrix X = (Xij) ∈ Rn×p.(rows , experiments, col’s , covariates)
Random design with X1•, . . . ,Xn• i.i.d.
Variable selection:
Find support J∗ ⊂ [p] of true parameter φ∗.
Asymptotics for marginal likelihood of GLMs 16 / 36
Assumptions
(A) Bounded covariates (or a moment condition)
(B1) Subexponential growth of dimension: log(pn) = o(n).
(B2) Dimension of smallest true model bounded by a fixed q ∈ N.
(B3) Small sets of covariates have second moment matrices with mimimaleigenvalue bounded away from zero:
λmin
(E[X1JXT
1J
])> a > 0 for all |J| ≤ 2q.
(B4) Norm of signal ‖φ∗‖2 bounded.
Asymptotics for marginal likelihood of GLMs 17 / 36
Theorem (Laplace approximation)
Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Then thereis a constant C such that the marginal likelihood sequence Ln(J) satisfiesthat
log Ln(J) = `n(φJ)− |J|2
log(n) + log fJ(φJ) +|J|2
log(2π)
− 1
2log det
(1
nHessianJ(φJ)
)± C
√log(np)
nfor all |J| ≤ q,
with probability tending to 1 as n→∞.
Asymptotics for marginal likelihood of GLMs 18 / 36
EBIC approximation
EBIC (with parameter γ ≥ 0):
EBICγ(J) = `n(φJ) − |J|2
log(n) − γ|J| · log(p) .
Corollary
Assume (A), (B1)-(B4) and ‘nice priors’ (fJ : J ⊂ [p], |J| ≤ q). Adopt theunnormalized model prior
Pγ(J) =
(p
|J|
)−γ· 1 {|J| ≤ q} .
Then there is a constant C ′ such that with probability tending to 1 asn→∞, we have∣∣∣log
[Pγ(J,Y )
]− EBICγ(J)
∣∣∣ ≤ C ′ for all |J| ≤ q.
Asymptotics for marginal likelihood of GLMs 19 / 36
Laplace approximation to marginal likelihood
∫RJ
exp(`n(φJ + γ)
)· fJ(φJ + γ) dγ
Taylor series:
`n(φJ + γ) = `[n](φJ) − 1
2γ> HessianJ(φJ + tγ · γ) γ
Approximation by Gaussian integral:
fJ(φJ) ·∫RJ
exp(`n(φJ)
)· exp
(−1
2γ> HessianJ(φJ) γ
)dγ
= fJ(φJ) · exp(`n(φJ) ·
√(2π
n
)|J|· det
(1
nHessianJ(φJ)
)−1
Asymptotics for marginal likelihood of GLMs 20 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Laplace approximation to marginal likelihood
∫RJ
exp
(`n(φJ)− 1
2γ> HessianJ(φJ + tγ · γ) γ︸ ︷︷ ︸
≈ `n(φJ)− 1
2γ> HessianJ(φJ) γ
)dγ
●●
`n(φJ)
γ = 0 ↔ φ = φJ
‖γ‖2 ≤√
log(p)n
‖γ‖2 ≤ 1
Asymptotics for marginal likelihood of GLMs 21 / 36
Assumptions on priors
Family of priors (fJ : J ⊂ [p], |J| ≤ q) is ‘nice’ if for constants0 < F1,F2,F3 <∞ we have uniformly for all |J| ≤ q:
(i) an upper bound:sup φJ fJ(φJ) ≤ F1 <∞,
(ii) a lower bound over a compact set:
inf ‖φJ‖2≤R+1fJ(φJ) ≥ F2 > 0,
where R is a function of the constants in (A) & (B1)-(B4),
(iii) a Lipschitz property on the same compact set:
sup ‖φJ‖2≤R+1 ‖∇fJ(φJ)‖2 ≤ F3 <∞.
Asymptotics for marginal likelihood of GLMs 22 / 36
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
Consistency for GLMs 23 / 36
(B5) Small true coefficients don’t decay too fast:√log(npn)
n= o
(min
{∣∣φ∗j ∣∣ : j ∈ J∗}).
Theorem (EBIC consistency in GLM)
Assume (A), (B1)-(B5). Let
κ = limn→∞
log pn
log n∈ [0,∞],
and take γ > 1− 12κ . Then with prob. tending to 1 as n→∞, we have
BICγ(J∗)− maxJ 6=J∗,|J|≤q
BICγ(J) ≥ log(p) · Chigh + log(n) · Clow
for constants Chigh,Clow > 0.
Consistency for GLMs 24 / 36
EBIC approximates Bayesian model choice
Corollary (Consistency of Bayesian model choice)
Assume (A), (B1)-(B5) and ‘nice priors’. Then with probability tending to1 as n→∞, we have
Pγ(J∗ |Y ) > maxJ 6=J∗,|J|≤q
Pγ(J |Y ) .
Consistency for GLMs 25 / 36
Experiment for sparse logistic regression (with lasso)
Spambase data from UCI Machine Learning Data Repository
n0 = 4601 emails, p0 = 57 covariates
Downsample to n < n0 experiments.
Create p − p0 noise covariates by random permutation.
Total number of covariates p satisfies pn = p0
25 ≈ 2.28.
Select a model from lasso path using EBIC, cross-validation andstability selection (Meinshausen & Buhlmann, 2010).
Consistency for GLMs 26 / 36
Positive selection and false discovery rate
Number of samples
Pos
itive
sel
ectio
n ra
te (
PS
R)
100 200 300 400 500 600
0%10
%20
%30
%40
%50
%
●
●
●
●●
●
●●
●● ●
●
●
●
BIC0
BIC0.25
BIC0.5
BIC1
Cross−validationStability selection
Number of samples
Fals
e di
scov
ery
rate
(F
DR
)
100 200 300 400 500 6000%
20%
40%
60%
80%
● ● ● ● ● ●● ● ● ● ● ●
●
●
BIC0
BIC0.25
BIC0.5
BIC1
Cross−validationStability selection
Consistency for GLMs 27 / 36
Comparison to full data
P−value of feature in the full regression (sample size 4601)
Sm
ooth
ed p
rob.
of s
elec
tion
(sub
sam
ple
size
600
)
●
●
0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0
0.1
0.2
0.3
0.4
0.5
●
●
BIC0
BIC0.25
BIC0.5
BIC1
Cross−validationStability selection
Figure: Smoothed probability of selecting a true feature, as a function of thep-value of that feature in the full regression.
Consistency for GLMs 28 / 36
Outline
1 BIC and extensions
2 Asymptotics for marginal likelihood of GLMs
3 Consistency for GLMs
4 Ising models
Ising models 29 / 36
Ising model
Observe i.i.d. X (1), . . . ,X (n) ∈ {0, 1}p
Likelihood function:
1
Z (Θ)· exp
{∑jΘj0xj +
∑j<kΘjkxjxk
}↑ ↑
normalizing (sparse)
const. potential matrix
Full conditional for Xj is proportional to
exp{
xj ·(
Θj0 +∑
k 6=jΘjkxk
)}Model selection problem:
Find support E∗ (the ‘graph’) of true potential matrix Θ∗
Ising models 30 / 36
Neighborhood selection for sparse Ising models
For each Xj , select its neighborhood via the Lasso:
Θ(λ)j• = arg max
[`Xj |X−j
(Θj•
)+ λ ·
∑k 6=j
|Θjk |]
(Meinshausen & Buhlmann, 2006; Ravikumar et al., 2010)
How to choose λ, i.e., neighborhoods from each path?
Cross-validation tends to select too large neighborhoods.
Apply EBIC:
� Let Ej,λ be the edges incident to j in support of Θ(λ)j• .
� Maximize
`Xj |X−j
(Θ
(λ)j•
)− |Ej,λ|
2log(n) − |Ej,λ| · γ log(p)
with respect to λ.
Ising models 31 / 36
Consisteny of EBIC for Ising model selection
Theorem
Consider subexponential growth of p = pn with
κ = limn→∞
log pn
log n∈ [0,∞].
Assume
all neighborhood sizes bounded by a constant,√log(np)
n � |Θ∗jk | ≤ a constant, for all edges (j , k).
Take γ > 2− 12κ . Then with probability tending to 1 as n→∞:
EBICγ selects the right neighborhood for every Xj .
Follows from consistency of EBIC for GLMs with random covariates.
Ising models 32 / 36
Precipitation data (U.S. Historical Climatology Network)
89 weather stations
measure precipitation (1 or 0) on 278 (nonconsecutive) dates
discard locations of the weather stations can we recover the geographical layout?
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
−96 −94 −92 −90 −88 −86
3638
4042
Longitude
Latit
ude
Ising models 33 / 36
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
BIC
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
Extended BIC
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
Cross−validation
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
Stability selection
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
●
●
●
●
●
γ = 0.25
Ising models 34 / 36
Edge selection vs distance
Distance between weather stations (miles)
Sm
ooth
ed p
roba
bilit
y of
sel
ectin
g ed
ge
0 100 200 300 400 500 600
0.0
0.2
0.4
0.6
0.8
1.0
BICextended BICcross−validationstability selection
Ising models 35 / 36
Conclusion
Laplace approximation can be accurate uniformly over large numberof sparse GLMs
Chen & Chen’s extended Bayesian information criterion (EBIC):
� connected to Bayesian model choice;� its consistency proves consistency of ‘generic’ Bayesian procedures;� computationally inexpensive alternative to stability selection and other
resampling methods;� seems useful for tuning regularization methods.
For details including references, see:
Bayesian model choice and information criteria in sparse generalizedlinear models (with Rina Foygel). arXiv:1112.5635
Ising models 36 / 36