Static Parameter Estimation using Kalman Filtering and

Static Parameter Estimation using KalmanFiltering and Proximal Operators

Vivak PatelDepartment of Statistics, University of Chicago December 2, 2015

Acknowledge

Mihai AnitescuSenior Computational Mathematician in LANSParttime Professor at University of Chicago

Madhukanta Patel (June, 8 1924 Nov. 21, 2015)

Outline

I. Stationary Parameter Estimation (SPE)

II. Optimization

III. Proximal Operators

IV. Kalman Filtering Theory

V. Kalmanbased SGD

Stationary ParametersProblem.

Observed Pairs where Data Generation Process:

Estimate , which is stationary (changes on long time scales)

( , ) ∈ R ×Yt Xt Rn t = 1, 2, …

P( | ) = f( , )Yt Xt Yt Xt

f

Stationary ParametersNonparametric Density Estimation

Assume some level of smoothness of Use orthonormal bases of function spaces, or use moving averageRate of Convergence (Curse of Dimensionality):

Replace smoothness assumption with sparsity

f

ASIME ∼ N− 2s

2s+n

Stationary ParametersParametric Density Estimation

Assume a particular form for depending on a parameter

Estimate Rate of Convergence:

f β

P ( | , β) = f( , , β)Yt Xt Yt Xt

β

MSE ∼ N −1

OptimizationFor statistical inference, the objective function is the loglikelihood:

Fisher (1922), Cramér (1946), Le Cam (1970), Hájek (1972)

Suppose the data is generated by some fixed . Let , and

If some regularity conditions on hold then

van der Vaart (2000)

(β) = log f( , , β)FN ∑t=1

N

Yt Xt

β∗ = arg min (β)βN FN

= E [ log f(Y , X, ) log f(Y , X, )]V ∗ ∇β β∗ ∇Tβ

β∗

f

( − ) ⇒ N (0, )N−−√ βN β∗ V ∗

OptimizationGoals of Optimization Method

Fast Rate of ConvergenceComputational StabilityWelldefined stop conditionLow computational requirements (floating point operations, memory)

Classical Methods for

GradientEvals

HessianEvals

FloatingPoint

MemoryCost RoC Stop

Condition Conditioning

Newton Quadratic Deterministic Mild

QuasiNewton SuperLinear Deterministic Mild

GradientDescent Linear Deterministic Moderate

Moral: If is a large, a single pass through the data set before calculating a new iterate is too expensive.

(β)FN

N N O( N)n2 O( )n2

N 0 O( + nN)n2 O( )n2

N 0 O(nN) O(n)

N

Stochastic GradientsBasic Idea: Minimize using:

History

Least Mean Squares Stochastic Gradient Descent (SGD)

Cotes (early 1700s)Legendre (1805)Gauss (1806)Macchi & Eweda (1983)Gerencsér (1995)

Robbins & Monro (1951)Neveu (1975)Murata (1998)

(β) = log f( , , β)FN ∑Nt=1 Yt Xt

= − ∇ log f( , , ) =: − ∇βSGDk+1 βSGD

k αk Yt Xt βSGDk βSGD

k αk lSGDt,k

Stochastic GradientsCharacterization per iteration

GradientEvals

HessianEvals

FloatingPoint

MemoryCost RoC Stop




Gradient Descent Linear Deterministic ModerateAlt. Full Gradient

SGD Linear Probabilistic Moderate

SGD SubLinear None Severe

Alternative Strategy? Without increasing number of gradient evaluations, can we:

Increase Rate of Convergence?Reduce sensitivity to conditoning?

Yes, by increasing computational resources used per iteration

N N O( N)n2 O( )n2

N 0 O( + nN)n2 O( )n2

N 0 O(nN) O(n)

N + m 0 O(N + m) O(n)

1 0 O(n) O(n)

Proximal OperatorProximal Form of SGD

Levenberg (1944), Marquardt (1963) Improvement

Why is this an improvement?

= arg { + (β − ∇ + }βSGDk+1 min

βlSGDt,k βSGD

k )T lSGDt,k

12αk

β −∥∥ βSGDk

∥∥2

βLMk+1

βLMk+1

= arg { + }minβ

12

( + (β − ∇ )lLMt,k βLM

k )T lLMt,k

2 12αk

β −∥∥ βLMk

∥∥2

= − (∇ )βLMk (∇ + I)lLM

t,k ∇T lLMt,k α−1

k

−1lLMt,k lLM

t,k

Proximal OperatorRecall:

Second Bartlett Identity:

holds under some regularity conditions on .

Moral: We are using a regularized estimate of the hessian/covariance.

Can we do better still?

= − (∇ )βLMk+1 βLM

k (∇ + I)lLMt,k ∇T lLM

t,k α−1k

−1lLMt,k lLM

t,k

[∇ ] = [ ]EβLMk

lLMt,k ∇T lLM

t,k EβLMk

∇2lt,k

f

Proximal OperatorLevenberg (1944), Marquardt (1963) Improvement

Possible Rescaling Improvement?

βLMk+1

βLMk+1

= arg { + }minβ

12

( + (β − ∇ )lLMt,k βLM

k )T lLMt,k

2 12αk

β −∥∥ βLMk

∥∥2

= − (∇ )βLMk (∇ + I)lLM

t,k ∇T lLMt,k α−1

k

−1lLMt,k lLM

t,k

βk+1

βk+1

= arg { + }minβ

12

( + (β − ∇ )lt,k βk)T lt,k2 1

2∥β − ∥βk

2M −1

k

= − (∇ )βk (∇ + )lt,k∇T lt,k M −1k

−1lt,k lt,k


Optimizer Interpretation: should be the inverse Hessian at .

First Generation Stochastic QuasiNewton:

No improvement in rate of convergenceNo stop conditionHigh sensitivity to conditioning

Byrd, Hansen, Nocedal & Singer (arXiv, 2015)

βk+1

βk+1

= arg { + }minβ

12

( + (β − ∇ )lt,k βk)T lt,k2 1

2∥β − ∥βk

2M −1

k

= − (∇ )βk (∇ + )lt,k∇T lt,k M −1k

−1lt,k lt,k

Mk βk


Statistician Interpretation: should estimate the inverse covariance.

Kalman Filter: simultaneously estimates parameter and its covariance matrix.

Kalman (1960)

βk+1

βk+1

= arg { + }minβ

12

( + (β − ∇ )lt,k βk)T lt,k2 1

2∥β − ∥βk

2M −1

k

= − (∇ )βk (∇ + )lt,k∇T lt,k M −1k

−1lt,k lt,k

Mk

Kalman FilterControl Theoretic Derivation

Suppose

and find which minimizes

Proximal Derivation

Given the "variance" of conditioned on , , and

= −βk+1 βk Gk+1lk,t

Gk+1

E [ , … , ]∥ − ∥βk+1 β∗ 2∥∥ X1 Xk+1

lk,t Xt σ2t := E [( − )( − , … , ]Mk βk β∗ βk β∗)T ∣∣ X1 Xk

= arg { + }βk+1 minβ

12σ2

t

( + (β − ∇ )lt,k βk)T lt,k2 1

2∥β − ∥βk

2M−1

k

Kalman FilterUpdate for :

Problem: neither nor are known a priori

Possible Solution: replace with other sequence , and replace with and generate

Questions

What values of will result in converging to ?What values of will result in approximating ?

Mk

= ∇ +M−1k+1

1σ2

t+1lk,t∇T lk,t M−1

k

σ2t M0

σ2t γ2

kM0 M0

= ∇ +M −1k+1

1γ2

k+1

lk,t∇T lk,t M −1k

γ2k

βk β∗

γ2k

Mk Mk

Kalman FilterSummary of Kalman Filtering Theory

Randomness in the model is not assumed to exist. Thus, and could be picked based rate of convergence needs.There is a strict focus on dynamic parameter estimation. Approximating is ignored to prove linear convergence rate.

References

Johnstone, Johnson, Bitmead & Anderson (1982)Bittanti, Bolzern & Campi (1990)Parkum, Poulsen & Holst (1992)Cao & Schwartz (2003)

= 0σ2t γ2

kMk

Kalmanbased SGDAssumptions

Linear Model with Independence and Identical Distribution of Regularity. Uniqueness. For all unit vectors , .

In the noise free case, . kSGD will calculate the vector once linearly independent examples are assimilated.(Modified GramSchmidt)In the noisy case, if then almost surely

For every , asymptotically almost surely:

= ( − /2lk,t Yt XTt βk)2 E [ ] < ∞l∗,t

( , )Yt Xt

E [ ] < ∞∥ ∥X122

v ∈ Rn P [ v = 0] < ∞∣∣XT1 ∣∣

=Yt XTt β∗ β∗ n

0 < ≤ < ∞infk γ2k

supk γ2k

E [∥ − ∥| , … , ] → 0βk β∗ X1 Xk

ϵ > 0

⪰ ⪰1 + ϵ

inf γ2k

Mk1

σ2 Mk1 − ϵ

sup γ2k

Mk

Kalmanbased SGDCMS Data Set ( , million)n = 31 N = 2.8

Kalmanbased SGDCMS Data Set ( , million)n = 31 N = 2.8

Kalmanbased SGDCharacterization per iteration

GradientEvals

HessianEvals

FloatingPoint

MemoryCost RoC Stop




Gradient Descent Linear Deterministic ModerateAlt. Full Gradient

SGD Linear Probabilistic Moderate

kSGD >SubLinear Almost Sure Mild

SGD SubLinear None Severe

N N O( N)n2 O( )n2

N 0 O( + nN)n2 O( )n2

N 0 O(nN) O(n)

N + m 0 O(N + m) O(n)

1 0 O( )n2 O( )n2

1 0 O(n) O(n)

Documents

Static Parameter Estimation using Kalman Filtering and