High-dimensional Time Series Models · High Dimensionsal Time Series Modeling Framework ‘ 1-LL Algorithm Objective functionjointly non-convex, but convex w.r.t. B’s and 1 Algorithm

High Dimensionsal Time Series

High-dimensional Time Series Models

George Michailidis

University of Florida

Transdisciplinary Foundations of Data ScienceIMA, September 2016

George Michailidis High Dimensionsal Time SeriesTransdisciplinary Foundations of Data Science IMA, September 2016 1

/ 34


Introduction

Learning Tasks with Temporally Dependent Data

Predictive inference, forecasting, segmentation,covariance estimation/graphicalmodeling

Regression models:yt = Xtβ + εt , where the p-dimensional predictors X and error term ε isgenerated by a stationary process

Autoregressive models: Xt = AXt−1 + Et , where the p-dimensional errorprocess Et is white noiseRelated control problem:Xt = AXt−1 + BUt + Et , together with a cost/performance function


/ 34


Introduction

Learning Tasks with Temporally Dependent Data

Predictive inference, forecasting, segmentation,covariance estimation/graphicalmodeling (ctd)

Factor models: Xt = ΛFt + Et , where Xt is a p-dimensional process, Ft ak-dimensional latent/factor process and Et a noise processA popular model in the economics/finance literature is for the factors tobe changing dynamically over time;e.g. Ft = ΦFt−1 + Ut

Given a multivariate time series Xt and identify structural breaks;i.e. identify points in time that the structure of the model changese.g Xt = A1Xt−1I (t ≤ τ) + A2Xt−1I (t ≥ τ) + Et , for some τ ∈ [0,T ]There is an online version of the problem for streaming data

Estimate covariance matrix of temporally dependent data


/ 34


Introduction

Application areas

Macroeconomics/Finance

Functional Genomics

Neuroscience

Control of large networks


/ 34


Introduction

Application areas: Economics

testing relationship between money and income (Sims, 1972, 1980)

understanding stock price-volume relation (Hiemstra et al., 1994)

dynamic effect of government spending and taxes on output (Blanchardand Jones, 2002)

identify and measure the effects of monetary policy innovations onmacroeconomic variables (Bernanke et al., 2005)


/ 34


Introduction

Forecasting models in Economics

-6

-4

-2

0

2

4

6

Feb

-60

Au

g-6

0

Feb

-61

Au

g-6

1

Feb

-62

Au

g-6

2

Feb

-63

Au

g-6

3

Feb

-64

Au

g-6

4

Feb

-65

Au

g-6

5

Feb

-66

Au

g-6

6

Feb

-67

Au

g-6

7

Feb

-68

Au

g-6

8

Feb

-69

Au

g-6

9

Feb

-70

Au

g-7

0

Feb

-71

Au

g-7

1

Feb

-72

Au

g-7

2

Feb

-73

Au

g-7

3

Feb

-74

Au

g-7

4

Employment

Federal Funds Rate

Consumer Price Index


/ 34


Introduction

Application areas: Functional Genomics

Identify regulatory mechanisms from time course data (panel data structure)

HeLa gene expression regulatory network [From: Fujita et al., 2007]


/ 34


Introduction

Application Areas: Neuroscience

Identify brain connectivity regions


/ 34


Introduction

Need for high-dimensional models

Economics: forecasting with many predictors (De Mol et al., 2008) orunderstand strcutural relationships (Christiano et al., 1999)

Finance: build large scale systemic risk models

Functional Genomics: reconstruct gene regulatory networks based onlimited experimental data

Neuroscience: build detailed connectivity maps on temporal dataexhibiting multiple structural changes

Network control: for large sparse systems (Liu, Slotine, Barabasi, 2011)


/ 34


Introduction

Key issues:

Nature of the data measurements (numerical, count, binary) (see Raskuttiet al., 2016, for models for count data)

Capture the correct dynamics (see Chen and Shojaie, 2016 for models forself-exciting processes)

How does the temporal dependence impact estimation and predictionaccuracy?


/ 34


Introduction

Illustration of estimation accuracy

REGULARIZED ESTIMATION IN TIME SERIES 7

Fig. 1. Estimation error of lasso in stochastic regression. Top panel: Example 1, VAR(1)process of predictors with cross-sectional dependence. Bottom panel: Example 2, VAR(2)process of predictors with no cross-sectional dependence.

affect the convergence rates of lasso estimates in a more intricate manner,not completely captured by ρ(A). Further, several authors [Loh and Wain-wright (2012), Negahban and Wainwright (2011), Han and Liu (2013)] con-ducted nonasymptotic analysis of high-dimensional VAR(1) models, assum-ing ‖A‖< 1. In Appendix E (supplementary material [Basu and Michailidis(2015)]) (see Figure 1 and Lemma E.2), we show that this assumption is re-strictive and is violated by many stable VAR(1) models. More importantly,such an assumption does not generalize beyond VAR(1).

Example 1. We generate data from the stochastic regression model(1.1) with p = 200 predictors and i.i.d. errors εt. The process of predic-tors comes from a Gaussian VAR(1) model Xt = AXt−1 + ξt, where A isan upper triangular matrix with α = 0.2 on the diagonal and γ on thetwo upper off-diagonal bands. We generate processes with different levelsof cross-correlation among the predictors by changing γ and plot the aver-age estimation error of lasso (over multiple iterates) against different samplesizes n in Figure 1.

The spectral radius is common (α= 0.2) across all models. Consistentlywith the classical low-dimensional asymptotics, the lasso errors for differ-ent processes seem to converge as n goes to infinity. However, for smallto moderate n, as is common in high-dimensional regimes, lasso errors are


/ 34


Modeling Framework

Vector Autoregression

Canonical model for understanding lead-lag cross-dependencies

Successful for forecasting purposes and for intervention analysis (impulseresponse)

Exhibits a number of technical challenges in high-dimensions


/ 34


Modeling Framework

The VAR Model

p-dimensional, discrete time, stationary process X t = X t1 , . . . ,X

tp

X t = A1Xt−1 + . . .+ AdX

t−d + εt , εti.i.d∼ N(0,Σε) (1)

A1, . . . ,Ad : p × p transition matrices (solid, directed edges)

Σ−1ε : contemporaneous dependence (dotted, undirected edges)

stability: Eigenvalues of A(z) := Ip −∑d

t=1 Atzt outside z ∈ C, |z | ≤ 1


/ 34


Modeling Framework

Detour: VARs and Granger Causality

Concept introduced by Granger (1969)

A time series X is said to Granger-cause Y if it can be shown, usuallythrough a series of F-tests on lagged values of X (and with lagged valuesof Y also known), that those X values provide statistically significantinformation about future values of Y .

In the context of a high-dimensional VAR model we have thatXT−t

j is Granger-causal for XTi if At

i,j 6= 0.

Granger-causality does not imply true causality; it is built on correlations

Also, related to estimating a Directed Acyclic Graph (DAG) with(d + 1)× p variables, with a known ordering of the variables


/ 34


Modeling Framework

Estimating high-dimensional VARs through regression

data: X 0,X 1, . . . ,XT - one replicate, observed at T + 1 time points

construct autoregression

(XT )′

(XT−1)′

...(X d )′

︸︷︷︸

Y

=

(XT−1)′ (XT−2)′ · · · (XT−d)′

(XT−2)′ (XT−3)′ · · · (XT−1−d)′

.... . .

......

(X d−1)′ (X d−2)′ · · · (X 0)′

︸︷︷︸

X

A′1...A′d

︸︷︷︸

B∗

+

(εT )′

(εT−1)′

...(εd)′

︸︷︷︸

E

vec(Y) = vec(X B∗) + vec(E)

= (I ⊗X ) vec(B∗) + vec(E)

Y︸︷︷︸Np×1

= Z︸︷︷︸Np×q

β∗︸︷︷︸q×1

+ vec(E)︸︷︷︸Np×1

vec(E) ∼ N (0,Σε ⊗ I )

N = (T − d + 1), q = dp2

Key Assumption : At are sparse,∑d

t=1 ‖At‖0 ≤ k


/ 34


Modeling Framework

Estimation Methods

`1-penalized least squares (`1-LS)

argminβ∈Rq

1

N‖Y − Zβ‖2 + λN ‖β‖1

`1-penalized log-likelihood (`1-LL) (Davis et al., 2012)

argminβ∈Rq

1

N(Y − Zβ)′

(Σ−1ε ⊗ I

)(Y − Zβ) + λN ‖β‖1


/ 34


Modeling Framework

`1-LL Algorithm

Objective function jointly non-convex, but convex w.r.t. B’s and Σ−1ε

Algorithm converges to stationary point near truth with high probabilityunder high-dimensional scaling, provided it is initialized at a good point(details in Lin et al., 2016)


/ 34


Theoretical Considerations

Detour: Probabilistic Consistency of Lasso

For regression models, the quality of the estimates of the regression parametersdepends on relies crucially on two regularity conditions:

1. Restricted Eigenvalue (RE): The null space of the normalized designmatrix X/

√N avoids a cone set C(S , 3) := v ∈ Rp : ‖vSc ‖1 ≤ 3‖vS‖1

αRE := minv∈Rp ,‖v‖≤1,v∈C(S,3)

1

N‖Xv‖2 > 0

2. Deviation Condition: ‖X ′E/N‖max ≤ Q(X , σ)√

log p/N

Under the above conditions

Estimation error: ‖β − β∗‖2 ≤Q(X , σ)

αRE

√k log p

Nwith high probability


/ 34



Detour: Probabilistic Consistency of Lasso

For regression models, the quality of the estimates of the regression parametersdepends on relies crucially on two regularity conditions:

1. Restricted Eigenvalue (RE): The null space of the normalized designmatrix X/

√N avoids a cone set C(S , 3) := v ∈ Rp : ‖vSc ‖1 ≤ 3‖vS‖1

αRE := minv∈Rp ,‖v‖≤1,v∈C(S,3)

1

N‖Xv‖2 > 0

2. Deviation Condition: ‖X ′E/N‖max ≤ Q(X , σ)√

log p/N

Under the above conditions

Estimation error: ‖β − β∗‖2 ≤Q(X , σ)

αRE

√k log p

Nwith high probability


/ 34



Lasso Regression for Time Series Data

It is unknown if the above conditions are satisfied in high-dimensional timeseries data

Verifying RE type assumptions for a fixed design is NP-hard

For random design matrix X , existing results only provide guarantees whenthe samples are independent

Even for a stationary process, the data share complicated dependencepatterns -I Rows are dependentI Columns are dependentI error term E and design matrix X dependent


/ 34









/ 34









/ 34









/ 34



Vector Autoregression

Random design matrix X , correlated with error matrix E(XT )′

(XT−1)′

...(X d )′

︸︷︷︸

Y

=

(XT−1)′ (XT−2)′ · · · (XT−d )′

(XT−2)′ (XT−3)′ · · · (XT−1−d )′

.... . .

......

(X d−1)′ (X d−2)′ · · · (X 0)′

︸︷︷︸

X

A′1...A′d

︸︷︷︸

B∗

+

(εT )′

(εT−1)′

...(εd )′

︸︷︷︸

Evec(Y) = vec(X B∗) + vec(E)

= (I ⊗X ) vec(B∗) + vec(E)

Y︸︷︷︸Np×1

= Z︸︷︷︸Np×q

β∗︸︷︷︸q×1

+ vec(E)︸︷︷︸Np×1

vec(E) ∼ N (0,Σε ⊗ I )

N = (T − d + 1), q = dp2

Question: How often does RE hold? How small is αRE ? How does thecross-correlation affect convergence rates?


/ 34



Quantifying Dependence in high-dimensional VAR: Existing approaches

One can try to procced analogously to regression for iid datae.g. Negahban and Wainwright, 2011: for VAR(1) models, assume‖A1‖ < 1, where ‖A‖ :=

√Λmax(A′A)

For univariate autoregression X t = ρX t−1 + εt , reduces to |ρ| < 1 -equivalent to stability assumption

It turns out, this is a very restrictive assumption for most realistic VARmodels


/ 34



Quantifying Dependence via the Spectral Density

Spectral density function of a covariance stationary process X t,

fX (θ) =1

2π

∞∑l=−∞

ΓX (l)e−ilθ, θ ∈ [−π, π]

ΓX (l) = E[X t(X t+l )′

], autocovariance matrix of order l

If the VAR process is stable, it has a closed form (cf. equation (9.4.23), Priestley(1981))

fX (θ) =1

2π

(A(e−iθ)

)−1Σε(A∗(e−iθ)

)−1

The two sources of dependence factorize in the frequency domain


/ 34




Spectral density function of a covariance stationary process X t,

fX (θ) =1

2π

∞∑l=−∞

ΓX (l)e−ilθ, θ ∈ [−π, π]

ΓX (l) = E[X t(X t+l )′

], autocovariance matrix of order l

If the VAR process is stable, it has a closed form (cf. equation (9.4.23), Priestley(1981))

fX (θ) =1

2π

(A(e−iθ)

)−1Σε(A∗(e−iθ)

)−1

The two sources of dependence factorize in the frequency domain


/ 34




For univariate processes, “peak” of the spectral density measures stability of theprocess - (sharper peak = less stable)

Quantifying Stability by Spectral Density

For univariate processes, “peak" of the spectral density measures stability of theprocess - (sharper peak = less stable)

−10 −5 0 5 10

0.0

0.2

0.4

0.6

0.8

1.0

lag (h)

Au

toco

vari

an

ce

Γ(h

)

ρ=0.1

ρ=0.5

ρ=0.7

Autocovariance of AR(1)

−3 −2 −1 0 1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

θf(

θ)

ρ=0.1

ρ=0.5

ρ=0.7

Spectral Density of AR(1)

For multivariate processes, similar role is played by the maximum eigenvalue ofthe (matrix-valued) spectral density

Sumanta Basu (UM) Network Estimation in Time Series 52 / 56

For multivariate processes, a similar role is played by the maximum eigenvalue ofthe (matrix-valued) spectral density


/ 34




For a covariance stationary process X t with continuous spectral density fX (θ),the maximum eigenvalue of its spectral density captures its stability

M(fX ) = maxθ∈[−π,π]

Λmax (fX (θ))

The minimum eigenvalue of the spectral density captures dependence among itscomponents

m(fX ) = minθ∈[−π,π]

Λmin (fX (θ))

For stable VAR(1) processes, M(fX ) scales with (1− ρ(A1))−2, where ρ(A1) isthe spectral radius of A1

m(fX ) scales with the capacity (maximum incoming + outgoing effect at a node)of the underlying graph

We can similarly measure stability of subprocesses

M(fX , k) := maxJ⊂1,...,p,|J|=k

M(fX (J)

)M(fX , 1) ≤M(fX , 2) ≤ · · · ≤ M(fX , p) =M(fX )

Allows us to derive concentration inequalities with dependent random variables


/ 34






Λmax (fX (θ))



Λmin (fX (θ))




M(fX , k) := maxJ⊂1,...,p,|J|=k

M(fX (J)

)M(fX , 1) ≤M(fX , 2) ≤ · · · ≤ M(fX , p) =M(fX )



/ 34






Λmax (fX (θ))



Λmin (fX (θ))




M(fX , k) := maxJ⊂1,...,p,|J|=k

M(fX (J)

)M(fX , 1) ≤M(fX , 2) ≤ · · · ≤ M(fX , p) =M(fX )



/ 34



Consistency of sparse VAR estimates

It is established (see Basu and Michailidis, 2015)

d∑h=1

∥∥∥Ah − Ah

∥∥∥ ≤ φ(At ,Σε)(√

k (log dp2)/N)

Consistency in high-dimension: Even if d , p = O(N2), k log dp2/N → 0as long as k = o(N)

Error has two components:1. φ(At ,Σε) large ⇔ M(fX ) large, m(fX ) small

2.√

k log dp2/N: Estimation error for independent data

Estimation error same as i.i.d. data, modulo a price for temporaldependence


/ 34



Recap of the Main Theoretical Results

Assuming RE and deviation conditions we can establish the consistency ofsparse estimates of high-dimensional VAR models

For stable VAR models, the RE and deviation conditions hold with highprobability

The convergence rate has two components:(i) the component governed by the structural parameters of the problemsand is identical to the iid case and(ii) the component governed by the temporal depencence


/ 34



Beyond VAR and Sparsity:High Dimensional Models for Time Series Data

The concentration bounds obtained can be used to prove estimation consistency forother regularized methods with high-dimensional, Gaussian time series

Regression with Lasso; non-convex penalty (SCAD, MCP)

Generalized linear models (regression with non-continuous outcome variables[Raskutti et al., 2016]

Sparse covariance estimation with time series data

Regression / VAR with Group Lasso [Basu et al., 2015]

Low rank and Low rank+ Sparse VAR [Basu, 2014]

Tensor Regression with dependent data [Raskutti and Yuan, 2015]

Time series with local dependence [Schweinberger et al., 2015]

VAR models with grouped structure on the transition matrices [Mattesson et al.,2015]

The results have a common theme

estimation errorfor dependent data

-Measure of narrowness

of spectrum× estimation error

for i .i .d . data


/ 34


Segmentation problems

Models with Structural Breaks

Increasing interest in using time series models and/or graphical models asnetwork models derived from high-dimensional data

Numerous applications both for the offline and online versions


/ 34



A Canonical Statistical Problem:Change Point Detection

Simplest Setup:

Random vector observed in a time interval 1, · · · ,T offline versionXt = (X1t ,X2t , · · · ,Xpt) ∼ N(0,Σ1), t ≤ τ ,Xt = (X1t ,X2t , · · · ,Xpt) ∼ N(0,Σ2), t > τ .

Objectives:1. Estimate the change point τ2. Estimate the Gaussian graphical models Ω1 ≡ Σ−1

1 ,Ω2 ≡ Σ−12

The iid assumption is simplifying, but can easily be mitigated throughneighborhood selection techniques leveraging lasso regressions with temporallydependent errors (Basu and Michailidis, 2015)


/ 34



Some Background on Low Dimensional Change Point Problems

Assume a stump model:

yi = αI (i ≤ τ) + βI (i > τ) + εi , i = 1, · · · ,T ,

where εi ∼ N(0, σ2) and I (·) denotes the indicator function

Then, under a condition on the signal-to-noise ratio |α−β|σ≥ C > 0, one

can establish the following

1. α, β can be estimated at rate√T (the usual parametric rate)

2. |τ−τ |T

= O( 1T

)


/ 34



Naive Algorithm

1. For each t = 1 + d , · · · ,T − d , for some d > 0,calculate the joint Gaussian likelihood assuming that τcandidate = t which isgiven by

L(Ω1; t = 1, · · · , τcandidate) + L(Ω2; t = τcandidate + 1, · · · ,T )

2. Setτ = argmaxτcandidateL(Ω1,Ω2, τcandidate)

Technical Challenges:Note that at any solution τ 6= τ , one term in the likelihood is misspecified

Hence, a much more careful handling of the technical issues is needed toestablish the results


/ 34



Main Results

Under a Restricted Eigenvalue condition it can be established that (Roy,Atchade and Michailidis, 2016)

1.||Ωk − Ωk ||F = O(

√s log pT )

2.|τ − τ |

T= O(

log pT

T).


/ 34



Extension to Multiple Change Points

In a recent paper, Leonardi and Buhlmann, arXiv, 2016 look at the sameproblem, but allow multiple change points

To identify the change points, they propose a dynamic programming algorithm,as well as a computationally faster binary search approximation

Further, they look at estimation consistency properties of τk , k = 1, · · · ,K andthe corresponding Ωk ’s in a slow regime where change points are sparse and farapart and in a fast regime, where change points grow as a function of T

The rates for the Ωk ’s are the usual ones, but even in the slow regime theobtained rate for τk is worse than the one previously obtained.

A related problem was studied in Kollar and Xing, Electronic J. of Statistics,2012 where each node can experience multiple-change points and Soh andChadrashekaran, arXiv 2014


/ 34



Concluding Remarks

Temporal data are present in a diverse set of applied areas

Time series models pose a number of subtle technical challenges inhigh-dimensions

A number of open questions:1. Going beyond Gaussian data (heavy tailed distributions, mixed types of

data)2. Incorporation of prior information/Bayesian modeling3. Inference framework for assessing both parameter and model significance4. Better models for capturing intricate temporal dynamics5. Intervention/control problems


/ 34

Documents

High-dimensional Time Series Models · High Dimensionsal Time Series Modeling Framework ‘ 1-LL Algorithm Objective functionjointly non-convex, but convex w.r.t. B’s and 1 Algorithm