Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Computational and Statistical Aspectsof Statistical Machine Learning
John Lafferty
Department of Statistics RetreatGleacher Center
Outline
• “Modern” nonparametric inference for high dimensional dataI Nonparametric reduced rank regression
• Risk-computation tradeoffsI Covariance-constrained linear regression
• Other research and teaching activities
2
Context for High Dimensional Nonparametrics
Great progress in recent years on high dimensional linear models
Many problems have important nonlinear structure.
We’ve been studying “purely functional ” methods for highdimensional, nonparametric inference
• no basis expansions
• no Mercer kernels
3
Additive Models
Fully nonparametric models appear hopeless
• Logarithmic scaling, p = log n (e.g., “Rodeo” Lafferty andWasserman (2008))
Additive models are useful compromise
• Exponential scaling, p = exp(nc) (e.g., “SpAM” Ravikumar,Lafferty, Liu and Wasserman (2009))
4
Additive Models
420 Chapter 23. Nonparametric Regression
10 15 20 25
−0.0
50.
000.
050.
100.
150.
20
Age
Chan
ge in
BM
D Females
10 15 20 25
−0.0
50.
000.
050.
100.
150.
20
Age
Chan
ge in
BM
D Males
Figure 23.1. Bone Mineral Density Data
−0.10 −0.05 0.00 0.05 0.10
150
160
170
180
190
Age−0.10 −0.05 0.00 0.05 0.10 0.15
100
150
200
250
300
Bmi
−0.10 −0.05 0.00 0.05 0.10
120
160
200
240
Map−0.10 −0.05 0.00 0.05 0.10 0.15
110
120
130
140
150
160
Tc
Figure 23.2. Diabetes Data5
Multivariate Regression
Y ∈ Rq and X ∈ Rp. Regression function m(X ) = E(Y |X ).
Linear model Y = BX + ε where B ∈ Rq×p.
Reduced rank regression: r = rank(B) ≤ C.
Recent work has studied properties and high dimensional scaling ofreduced rank regression where nuclear norm ‖B‖∗ is used as convexsurrogate for rank constraint (Yuan et al., 2007; Negahban andWainwright, 2011). E.g.,
‖Bn − B∗‖F = OP
(√Var(ε)r(p + q)
n
)
6
Low-Rank Matrices and Convex Relaxation
low rank matrices convex hullrank(X ) ≤ t ‖X‖∗ ≤ t
7
Nuclear Norm Regularization
Algorithms for nuclear norm minimization are a lot like iterative softthresholding for lasso problems.
To project a matrix B onto the nuclear norm ball ‖X‖∗ ≤ t :
• Compute the SVD:B = U diag(σ) V T
• Soft threshold the singular values:
B ← U diag(Softλ(σ)) V T
8
Nonparametric Reduced Rank RegressionFoygel, Horrell, Drton and Lafferty (NIPS 2012)
Nonparametric multivariate regression m(X ) = (m1(X ), . . . ,mq(X ))T
Each component an additive model
mk (X ) =
p∑j=1
mkj (Xj)
What is the nonparametric analogue of ‖B‖∗ penalty?
9
Low Rank Functions
What does it mean for a set of functions m1(x), . . . ,mq(x) to be lowrank?
Let x1, . . . , xn be a collection of points.
We require the n × q matrix M(x1:n) = [mk (xi)] is low rank.
Stochastic setting: M = [mk (Xi)]. Natural penalty is
1√n‖M‖∗ = 1√
n
q∑s=1
σs(M) =
q∑s=1
√λs( 1
nMTM)
Population version:
|||M|||∗ :=∥∥∥√Cov(M(X ))
∥∥∥∗
=∥∥∥Σ(M)1/2
∥∥∥∗
10
Constrained Rank Additive Models (CRAM)
Let Σj = Cov(Mj). Two natural penalties:∥∥∥Σ1/21
∥∥∥∗
+∥∥∥Σ
1/22
∥∥∥∗
+ · · ·+∥∥∥Σ
1/2p
∥∥∥∗∥∥∥(Σ
1/21 Σ
1/22 · · ·Σ1/2
p )∥∥∥∗
Population risk (first penalty) 12E∥∥∥Y −
∑j Mj(Xj)
∥∥∥2
2+ λ
∑j
∣∣∣∣∣∣Mj∣∣∣∣∣∣∗
Linear case:
p∑j=1
∥∥∥Σ1/2p
∥∥∥∗
=
p∑j=1
‖Bj‖2
∥∥∥(Σ1/21 Σ
1/22 · · ·Σ1/2
p )∥∥∥∗
= ‖B‖∗
11
CRAM Backfitting Algorithm (Penalty 1)
Input: Data (Xi ,Yi), regularization parameter λ.Iterate until convergence:
For each j = 1, . . . ,p:
Compute residual: Rj = Y −∑
k 6=j Mk (Xk )
Estimate projection Pj = E(Rj |Xj), smooth: Pj = SjRj
Compute SVD: 1n Pj PT
j = U diag(τ) UT
Soft-threshold: Mj = U diag([1− λ/√τ ]+)UT Pj
Output: Estimator M(Xi) =∑
j Mj(Xij).
12
Scaling of Estimation Error
Using a “double covering” technique, (12 -parametric,
12 -nonparametric), we bound the deviation between empirical andpopulation functional covariance matrices in spectral norm:
supV
∥∥∥Σ(V )− Σn(V )∥∥∥
sp= OP
(√q + log(pq)
n
).
This allows us to bound the excess risk of the empirical estimatorrelative to an oracle.
13
Summary
• Variations on additive models enjoy most of the good statisticaland computational properties of sparse or low-rank linearmodels.
• We’re building a toolbox for large scale, high dimensionalnonparametric inference.
14
Computation-Risk Tradeoffs
• In “traditional” computational learning theory, dividing linebetween learnable and non-learnable is polynomialvs. exponential time
• Valiant’s PAC model
• Mostly negative results: It is not possible to efficiently learn innatural settings
• Claim: Distinctions in polynomial time matter most
15
Analogy: Numerical Optimization
In numerical optimization, it is understood how to tradeoffcomputation for speed of convergence
• First order methods: linear cost, linear convergence
• Quasi-Newton methods: quadratic cost, superlinear convergence
• Newton’s method: cubic cost, quadratic convergence
Are similar tradeoffs possible in statistical learning?
16
Hints of a Computation-Risk Tradeoff
Graph estimation:
• Our method for estimating graph for Ising models:n = Ω(d3 log p), T = O(p4) for graphs with p nodes andmaximum degree d
• Information-theoretic lower bound: n = Ω(d log p)
17
Statistical vs. Computational Efficiency
Challenge: Understand how families of estimators with differentcomputational efficiencies can yield different statistical efficiencies
RateH,F (n) = infmn∈H
supm∈F
Risk(mn,m)
• H: computationally constrained hypothesis class
• F : smoothness constraints on “true” model
18
Computation-Risk Tradeoffs for Linear Regression
Dinah Shender has been studying such a tradeoff in the setting ofhigh dimensional linear regression
19
Computation-Risk Tradeoffs for Linear Regression
Standard ridge estimator solves(1n
X T X + λnI)βλ =
1n
X T Y
Sparsify sample covariance to get estimator(Tt [Σ] + λnI
)βt ,λ =
1n
X T Y
where Tt [Σ] is hard-thresholded sample covariance:
Tt ([mij ]) =[mij 1(|mij | > t)
]Recent advance in theoretical CS (Spielman et al.): Solving asymmetric diagonally-dominant linear system with m nonzero matrixentries can be done in time
O(m log2 p)
20
Computation-Risk Tradeoffs for Linear Regression
Dinah has recently proved the statistical error scales as
‖βt ,λ − β∗‖‖β∗‖
= OP (‖Tt (Σ)− Σ‖2) = O(t1−q)
for class of covariance matrices with rows in sparse `q balls (asstudied by Bickel and Levina).
• Combined with the computational advance, this gives us anexplicit, fine-grained risk/computation tradeoff
21
Simulation
0.0 0.5 1.0 1.5 2.0
0.8
0.9
1.0
1.1
1.2
1.3
1.4
lambda
risk
22
Some Other Projects
Minhua Chen: Convex optimization for dictionarylearning
Eric Janofsky: Nonparanormal componentanalysis
Min Xu: High dimensional conditional densityand graph estimation
23
Courses in the Works
• Winter 2013: Nonparametric Inference (Undergraduate andMasters)
• Spring 2013: Machine Learning for Big Data (UndergraduateStatistics and Computer Science)
Charles Cary: Developing Cloud-based infras-tructure for the course. Candidate data: 80 mil-lion images, Yahoo! clickthrough data, Sciencejournal articles, City of Chicago datasets.
24