Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Static Parameter Estimation using KalmanFiltering and Proximal Operators
Vivak PatelDepartment of Statistics, University of Chicago December 2, 2015
Acknowledge
Mihai AnitescuSenior Computational Mathematician in LANSParttime Professor at University of Chicago
Madhukanta Patel (June, 8 1924 Nov. 21, 2015)
Outline
I. Stationary Parameter Estimation (SPE)
II. Optimization
III. Proximal Operators
IV. Kalman Filtering Theory
V. Kalmanbased SGD
Stationary ParametersProblem.
Observed Pairs where Data Generation Process:
Estimate , which is stationary (changes on long time scales)
( , ) ∈ R ×Yt Xt Rn t = 1, 2, …
P( | ) = f( , )Yt Xt Yt Xt
f
Stationary ParametersNonparametric Density Estimation
Assume some level of smoothness of Use orthonormal bases of function spaces, or use moving averageRate of Convergence (Curse of Dimensionality):
Replace smoothness assumption with sparsity
f
ASIME ∼ N− 2s
2s+n
Stationary ParametersParametric Density Estimation
Assume a particular form for depending on a parameter
Estimate Rate of Convergence:
f β
P ( | , β) = f( , , β)Yt Xt Yt Xt
β
MSE ∼ N −1
OptimizationFor statistical inference, the objective function is the loglikelihood:
Fisher (1922), Cramér (1946), Le Cam (1970), Hájek (1972)
Suppose the data is generated by some fixed . Let , and
If some regularity conditions on hold then
van der Vaart (2000)
(β) = log f( , , β)FN ∑t=1
N
Yt Xt
β∗ = arg min (β)βN FN
= E [ log f(Y , X, ) log f(Y , X, )]V ∗ ∇β β∗ ∇Tβ
β∗
f
( − ) ⇒ N (0, )N−−√ βN β∗ V ∗
OptimizationGoals of Optimization Method
Fast Rate of ConvergenceComputational StabilityWelldefined stop conditionLow computational requirements (floating point operations, memory)
Classical Methods for
GradientEvals
HessianEvals
FloatingPoint
MemoryCost RoC Stop
Condition Conditioning
Newton Quadratic Deterministic Mild
QuasiNewton SuperLinear Deterministic Mild
GradientDescent Linear Deterministic Moderate
Moral: If is a large, a single pass through the data set before calculating a new iterate is too expensive.
(β)FN
N N O( N)n2 O( )n2
N 0 O( + nN)n2 O( )n2
N 0 O(nN) O(n)
N
Stochastic GradientsBasic Idea: Minimize using:
History
Least Mean Squares Stochastic Gradient Descent (SGD)
Cotes (early 1700s)Legendre (1805)Gauss (1806)Macchi & Eweda (1983)Gerencsér (1995)
Robbins & Monro (1951)Neveu (1975)Murata (1998)
(β) = log f( , , β)FN ∑Nt=1 Yt Xt
= − ∇ log f( , , ) =: − ∇βSGDk+1 βSGD
k αk Yt Xt βSGDk βSGD
k αk lSGDt,k
Stochastic GradientsCharacterization per iteration
GradientEvals
HessianEvals
FloatingPoint
MemoryCost RoC Stop
Condition Conditioning
Newton Quadratic Deterministic Mild
QuasiNewton SuperLinear Deterministic Mild
Gradient Descent Linear Deterministic ModerateAlt. Full Gradient
SGD Linear Probabilistic Moderate
SGD SubLinear None Severe
Alternative Strategy? Without increasing number of gradient evaluations, can we:
Increase Rate of Convergence?Reduce sensitivity to conditoning?
Yes, by increasing computational resources used per iteration
N N O( N)n2 O( )n2
N 0 O( + nN)n2 O( )n2
N 0 O(nN) O(n)
N + m 0 O(N + m) O(n)
1 0 O(n) O(n)
Proximal OperatorProximal Form of SGD
Levenberg (1944), Marquardt (1963) Improvement
Why is this an improvement?
= arg { + (β − ∇ + }βSGDk+1 min
βlSGDt,k βSGD
k )T lSGDt,k
12αk
β −∥∥ βSGDk
∥∥2
βLMk+1
βLMk+1
= arg { + }minβ
12
( + (β − ∇ )lLMt,k βLM
k )T lLMt,k
2 12αk
β −∥∥ βLMk
∥∥2
= − (∇ )βLMk (∇ + I)lLM
t,k ∇T lLMt,k α−1
k
−1lLMt,k lLM
t,k
Proximal OperatorRecall:
Second Bartlett Identity:
holds under some regularity conditions on .
Moral: We are using a regularized estimate of the hessian/covariance.
Can we do better still?
= − (∇ )βLMk+1 βLM
k (∇ + I)lLMt,k ∇T lLM
t,k α−1k
−1lLMt,k lLM
t,k
[∇ ] = [ ]EβLMk
lLMt,k ∇T lLM
t,k EβLMk
∇2lt,k
f
Proximal OperatorLevenberg (1944), Marquardt (1963) Improvement
Possible Rescaling Improvement?
βLMk+1
βLMk+1
= arg { + }minβ
12
( + (β − ∇ )lLMt,k βLM
k )T lLMt,k
2 12αk
β −∥∥ βLMk
∥∥2
= − (∇ )βLMk (∇ + I)lLM
t,k ∇T lLMt,k α−1
k
−1lLMt,k lLM
t,k
βk+1
βk+1
= arg { + }minβ
12
( + (β − ∇ )lt,k βk)T lt,k2 1
2∥β − ∥βk
2M −1
k
= − (∇ )βk (∇ + )lt,k∇T lt,k M −1k
−1lt,k lt,k
Proximal OperatorRecall:
Optimizer Interpretation: should be the inverse Hessian at .
First Generation Stochastic QuasiNewton:
No improvement in rate of convergenceNo stop conditionHigh sensitivity to conditioning
Byrd, Hansen, Nocedal & Singer (arXiv, 2015)
βk+1
βk+1
= arg { + }minβ
12
( + (β − ∇ )lt,k βk)T lt,k2 1
2∥β − ∥βk
2M −1
k
= − (∇ )βk (∇ + )lt,k∇T lt,k M −1k
−1lt,k lt,k
Mk βk
Proximal OperatorRecall:
Statistician Interpretation: should estimate the inverse covariance.
Kalman Filter: simultaneously estimates parameter and its covariance matrix.
Kalman (1960)
βk+1
βk+1
= arg { + }minβ
12
( + (β − ∇ )lt,k βk)T lt,k2 1
2∥β − ∥βk
2M −1
k
= − (∇ )βk (∇ + )lt,k∇T lt,k M −1k
−1lt,k lt,k
Mk
Kalman FilterControl Theoretic Derivation
Suppose
and find which minimizes
Proximal Derivation
Given the "variance" of conditioned on , , and
= −βk+1 βk Gk+1lk,t
Gk+1
E [ , … , ]∥ − ∥βk+1 β∗ 2∥∥ X1 Xk+1
lk,t Xt σ2t := E [( − )( − , … , ]Mk βk β∗ βk β∗)T ∣∣ X1 Xk
= arg { + }βk+1 minβ
12σ2
t
( + (β − ∇ )lt,k βk)T lt,k2 1
2∥β − ∥βk
2M−1
k
Kalman FilterUpdate for :
Problem: neither nor are known a priori
Possible Solution: replace with other sequence , and replace with and generate
Questions
What values of will result in converging to ?What values of will result in approximating ?
Mk
= ∇ +M−1k+1
1σ2
t+1lk,t∇T lk,t M−1
k
σ2t M0
σ2t γ2
kM0 M0
= ∇ +M −1k+1
1γ2
k+1
lk,t∇T lk,t M −1k
γ2k
βk β∗
γ2k
Mk Mk
Kalman FilterSummary of Kalman Filtering Theory
Randomness in the model is not assumed to exist. Thus, and could be picked based rate of convergence needs.There is a strict focus on dynamic parameter estimation. Approximating is ignored to prove linear convergence rate.
References
Johnstone, Johnson, Bitmead & Anderson (1982)Bittanti, Bolzern & Campi (1990)Parkum, Poulsen & Holst (1992)Cao & Schwartz (2003)
= 0σ2t γ2
kMk
Kalmanbased SGDAssumptions
Linear Model with Independence and Identical Distribution of Regularity. Uniqueness. For all unit vectors , .
In the noise free case, . kSGD will calculate the vector once linearly independent examples are assimilated.(Modified GramSchmidt)In the noisy case, if then almost surely
For every , asymptotically almost surely:
= ( − /2lk,t Yt XTt βk)2 E [ ] < ∞l∗,t
( , )Yt Xt
E [ ] < ∞∥ ∥X122
v ∈ Rn P [ v = 0] < ∞∣∣XT1 ∣∣
=Yt XTt β∗ β∗ n
0 < ≤ < ∞infk γ2k
supk γ2k
E [∥ − ∥| , … , ] → 0βk β∗ X1 Xk
ϵ > 0
⪰ ⪰1 + ϵ
inf γ2k
Mk1
σ2 Mk1 − ϵ
sup γ2k
Mk
Kalmanbased SGDCMS Data Set ( , million)n = 31 N = 2.8
Kalmanbased SGDCMS Data Set ( , million)n = 31 N = 2.8
Kalmanbased SGDCharacterization per iteration
GradientEvals
HessianEvals
FloatingPoint
MemoryCost RoC Stop
Condition Conditioning
Newton Quadratic Deterministic Mild
QuasiNewton SuperLinear Deterministic Mild
Gradient Descent Linear Deterministic ModerateAlt. Full Gradient
SGD Linear Probabilistic Moderate
kSGD >SubLinear Almost Sure Mild
SGD SubLinear None Severe
N N O( N)n2 O( )n2
N 0 O( + nN)n2 O( )n2
N 0 O(nN) O(n)
N + m 0 O(N + m) O(n)
1 0 O( )n2 O( )n2
1 0 O(n) O(n)