15
Lecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction Time series data, by its very name, has a time component to its observations. For example, measuring the price of Google Stock every day is a time series since the daily measurements are measured across time. More formally, we have (Y t ,t) where Y t is the price of Google Stock at time t and we have T measurements across time (i.e. t =1, ..., T ) In addition to measuring the price at each time t, we may also collect other information at time t. GOOG Closing Price (Monthly) Time Price 2006 2008 2010 2012 100 200 300 400 500 600 700 Figure 1: Closing Price of Google Stock since 2004 For example, we can create a categorical variable that determines whether time t was the holiday season (e.g. Christmas, Thanksgiving, New Years, Hanukkah, and others) or a binary variable indicating whether a quar- terly earnings report was released at time t or not. Collectively, we can denote this data as (Y t , t, X t,1 , ..., X t,p ), t =1, ..., T , where each X t,j characterizes other p measurements taken at time t. The data, (Y t , t, X t,1 , ..., X t,p ), may look look like a multiple regression; if we treat t as another X , say X ,p+1 , 1

Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

Lecture on Time Series

Stat 431, Summer 2012

Hyunseung Kang

July 30/31, 2012

1 Introduction

Time series data, by its very name, has a time component to its observations. For example, measuring the priceof Google Stock every day is a time series since the daily measurements are measured across time. More formally,we have (Yt, t) where Yt is the price of Google Stock at time t and we have T measurements across time (i.e.t = 1, ..., T ) In addition to measuring the price at each time t, we may also collect other information at time t.

GOOG Closing Price (Monthly)

Time

Pric

e

2006 2008 2010 2012

100

200

300

400

500

600

700

Figure 1: Closing Price of Google Stock since 2004

For example, we can create a categorical variable that determines whether time t was the holiday season (e.g.Christmas, Thanksgiving, New Years, Hanukkah, and others) or a binary variable indicating whether a quar-terly earnings report was released at time t or not. Collectively, we can denote this data as (Yt, t,Xt,1, ..., Xt,p),t = 1, ..., T , where each Xt,j characterizes other p measurements taken at time t.

The data, (Yt, t,Xt,1, ..., Xt,p), may look look like a multiple regression; if we treat t as another X, say X,p+1,

1

Page 2: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

then, at least notationally, the data looks like a regression model

Yt = β0 + β1Xt,1 + ...+ βp+1Xt,p+1 + εt (1)

However, one important caveat in time series is that all the observations are serially correlated. That is, there iscorrelation between the current observation,Yt, and the previous observation Yt−1; it’s possible that Yt’ correlationextends to observations that maybe far in the past, say Yt−2. we must take this correlation into account when wedo inference and prediction with time series data.

In this lecture, we’ll discuss how to analyze time series data. In particular, we’ll formulate the time series data asa multiple regression model with a autoregressive term to account for the serial correlation. Under this framework,we’ll be able to study time series like any other multiple regression model we studied before. For example, we’llbe able to F tests where we test whether other measurements, X,j influence Y . Finally, we’ll conclude the lecturewith how to forecast future observations.

2 Time Series and ANCOVA

All time series measurements are inherently correlated with its previous, current, and future observations. Forexample, if the price of Google’s stock is $500 today, it’s very likely that the price of Google’s stock yesterday wasclose to $500. The same observation can be made about tomorrow’s stock price. Understanding how correlatedobservations are can shed some light on the variation of Yt. To do this, the most commonly used tool is called theautocorrelation

Autocorrelation is a plot of serial correlations. Specifically, it plots the correlation between Yt and Yt−1, Ytand Yt−2, Yt and Yt−3, and so on and so forth for different values of h, Yt and Yt−h, h = 1, ..., T − 1

2.1 Correlated Errors

We can measure the correlation between the current observation, Yt, and its past, say Yt−h for h = 1, 2, ..., T − 1by using a autocorrelation plot (also known as a correlogram). For example, if we want to study the correlationbetween an observation and its immediate past, we look at the correlation between Yt and Yt−1. Each correlationfor different lag values (i.e. h), is denoted as ρ(h) = corr(Yt, Yt−h). A plot between h and ρ(h) constitutes anautocorrelation plot

### To generate an autocorrelation plot and a time series plot in R###

# Read in data

data = read.csv("NASDAQ2011.txt")

attach(data)

# To generate a time series plot

plot(Date,Close,xlab="Date",ylab="NASDAQ close",main="NASDAQ Closing")

lines(Date,Close)

# To generate an autocorrelation plot

acf(Close,main="Autocorrelation for NASDAQ Closing")

The autocorrelation plot in figure 2 gives us a sense of how many lag terms to add in our time series model. Lagterms are terms that we include as X’s in our regression model in (1). Specifically, if we want to include exactlyh = 1 lag term, we would denote one of the X,j ’s as Xt,j = Yt−1 and, if this is the only X in our model, theexpression for the model would be

Yt = β0 + β1Xt,1 + εt (2)

where Xt,1 = Yt−1. Under model equation (2), we would have T − 1 observations, instead of T that we startedwith. This is because the first observation in time, Y1 does not have a past observation, say Y0. Thus, we startwith t = 2, ..., T .

2

Page 3: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

0 5 10 15 20

−0.

20.

00.

20.

40.

60.

81.

0

Lag

AC

F

Autocorrelation for NASDAQ Closing

2011−01−03 2011−03−22 2011−06−08 2011−08−24 2011−11−09

5052

5456

5860

NASDAQ Closing

Date

NA

SD

AQ

clo

se

Figure 2: NASDAQ closing each day and its autocorrelation in year 2011. Notice that there is significant correlationbetween the current observation and past observations, even those that are 15 days in the past (indicated by Lag= 15).

We can imagine adding more lag terms and constructing the following model

Yt = β0 + β1Xt,1 + ...βpXt,p + εt (3)

where Xt,j = Yt−j . Determining the number of lag terms to include in the model requires a combination of usingthe autocorrelation plot and the F-tests we done for multiple regression. The autocorrelation plot tells us theempirical correlation between a current observation and observation that is several time steps behind, ρ(h). Wemay use it to get a rough idea of the number of terms to initially include in our model and fit our preliminarymodel. For example, for the NASDAQ closing data, the autocorrelation plot in figure 2 showed ρ(h) > 0.6 forh ≤ 51. Hence, we may start our model with 5 terms in our regression.

Yt = β0 + β1Xt,1 + ...β5Xt,5 + εt

Afterwards, we can use the F test to determine which coefficient or group of coefficients are significant in explainingthe variation in Yt. For example, if we want to test whether the last three lag terms, Xt,3, ..., Xt,5 are related toYt, controlling for the fact that we already have the first two lags, Xt,1 and Xt,2, we can perform the test:

H0 : β3 = β4 = β5 = 0 vs. Ha : at least one of the β’s in H0 is not zero

The F-test for the NASDAQ closing data is shown below. We see that the last three lag terms are not significant,if we already have the first two lag terms.

1I arbitrary chose 0.6 as the correlation value that is considered to have significant lag terms. There is no special meaning behind0.6 and this cutoff number may vary from data set to data set

3

Page 4: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

### To define the lag values in R ###

# Note that our starting time step would be Y6 since there are five lag terms

Y = Close[6:length(Close)]

Lag1 = Close[5:(length(Close)-1)]

Lag2 = Close[4:(length(Close)-2)]

Lag3 = Close[3:(length(Close)-3)]

Lag4 = Close[2:(length(Close)-4)]

Lag5 = Close[1:(length(Close)-5)]

### Build the model with five lag terms ###

model.fiveLag = lm(Y ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5)

model.twoLag = lm(Y ~ Lag1 + Lag2)

### Run ANOVA ###

anova(model.twoLag,model.fiveLag)

### Output ###

# Our F test states that the last three lag terms are not significant

# at 0.05 significance

Analysis of Variance Table

Model 1: Y ~ Lag1 + Lag2

Model 2: Y ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5

Res.Df RSS Df Sum of Sq F Pr(>F)

1 244 160.23

2 241 155.51 3 4.7156 2.4359 0.06538 .

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Generally speaking, F tests are performed in a sequential fashion until we obtain no significance between theadjacent models. That is, by the F test2, if there is no significant difference between

Yt = β0 + β1Xt,1 + ...+ βpXt,p + εt

Yt = β0 + β1Xt,1 + ...+ βp−1Xt,p−1 + εt

where Xt,j = Yt−j , then we would only keep p − 1 lag terms. Because we are doing multiple testing, we’ll use aBonferroni correction for our significance level. That is, we’ll use α/# of Hypothesis as our significance level tocompare. A result of this is shown below for the NASDAQ data in table 2.1

Model 1 (Reduced) Model 2 (Full) P-value(H0 : βlast = 0 vs. Ha : βlast 6= 0)

Yt = β0 + β1Xt,1 + ...+ β4Xt,4 + εt Yt = β0 + β1Xt,1 + ...+ β5Xt,5 + εt 0.1873Yt = β0 + β1Xt,1 + ...+ β3Xt,3 + εt Yt = β0 + β1Xt,1 + ...+ β4Xt,4 + εt 0.238606Yt = β0 + β1Xt,1 + β2Xt,2 + εt Yt = β0 + β1Xt,1 + ...+ β3Xt,3 + εt 0.042970Yt = β0 + β1Xt,1 + εt Yt = β0 + β1Xt,1 + β2Xt,2 + εt 0.950503Yt = β0 + εt Yt = β0 + β1Xt,1 + εt < 2 ∗ 10−16

Table 1: Sequential testing of lag terms for the NASDAQ data. Based on these tests, our Bonferroni-correctedsignificance level (from the original α = 0.05) is 0.05/5 = 0.01. At 0.01, only the last row is rejected and thereisn’t a significant difference in adding the second lag terms. Thus, we would only use one lag term. P-values wereobtained using either the F-test or the t-test on the last coefficient in the full model.

2If we do tests in a sequential fashion, we can use the t-test on the last lag term, Xt,p, since the t-test associated with the last termis identical to the F-test with two nested regressions; the hypotheses for the t-test and the F-test are identical

4

Page 5: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

If the model contains other X,j ’s that are not lag terms (e.g. trends and seasons), we need to perform the Ftests, controlling for these non-lag terms since these non-lag terms may play a role in explaining some of thevariation in Yt.

Finally, this type of model is known as an autoregressive model, denoted as AR() where the number inside ()denotes the number of lag terms in the model. Equation (2) is an AR(1) while equation (3) is an AR(p) model.

2.2 Trend

Most time series data exhibit some trend and we want to build a model that takes these trends into account. Forexample, figure 2.2 is the amount of carbon dioxide in Mauna Loa, there is an increasing, upward trend in CO2from 1960 to 2000. Unfortunately, the lag terms we discussed previously cannot capture the trend that is present

1960 1970 1980 1990 2000 2010

0.5

1.0

1.5

2.0

2.5

3.0

CO2 in Atmosphere at Mauna Loa

Year

Par

ts p

er m

illio

n of

CO

2

Figure 3: CO2 levels in Mauna Loa. There is an upward trend in CO2 as time increases.

in time series data. However, similar to how we dealt with correlated errors, we can introduce a new term in ourmodel in equation (1) to account for the trend. Specifically, let Xt,1 = t. Without any lag terms, we have thefollowing model for Yt

Yt = β0 + β1Xt,1 + εt = β0 + β1t+ εt (4)

Equation (4) can capture linear trend of Yt since Yt is a linear function of t. However, we can incorporate complexfunctions as Xt,1. In the case with the CO2 data, we may believe there is a log(t) trend in CO2 since the CO2levels level out after year 2000 and we may want to model Yt, the CO2 level at time t, as

Yt = β0 + β1Xt,1 + εt = β0 + β1 log(t) + εt (5)

5

Page 6: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

In conjunction with the lag terms, we have the following model for CO2

Yt = β0 + β1Xt,1 + β2Xt,2 + β3Xt,3 + εt (6)

where Xt,1 = log(t), Xt,2 = Yt−1, and Xt,3 = Yt−2. The estimates of equation (6) are shown below along with theplot of the fit are shown in the attached R code and figure 4

### R Code for the CO2 Data Set ###

# Read data

data = read.table("http://stat.wharton.upenn.edu/~khyuns/stat431/CO2MaunaLoa.txt",header=TRUE)

attach(data)

### Plot the Autocorrelation Plot ###

par(mfrow=c(2,1)) #Allows me to have multiple plots

acf(ppm.yr,main="Autocorrelation plot for CO2")

plot(1:length(ppm.yr),ppm.yr,,type="l",main="C02 at Mauna Loa",xaxt="n",xlab="Year",lwd=2)

axis(1,at=seq(1,length(ppm.yr),10),labels=seq(year[1],year[length(year)],10),las=2)

### Incorporating Lag Terms ###

Y = ppm.yr[3:length(ppm.yr)]

Lag1 = ppm.yr[2:(length(ppm.yr)-1)]

Lag2 = ppm.yr[1:(length(ppm.yr)-2)]

### Incorporating Log Trend ###

# I reformat the time so that 1959, the first observation in the data, is 1

# Because of the two lag terms, we have to start at time = 3 (or 1961)

t = 3:length(year)

X = log(t)

### Fit log(t) as the trend ###

model = lm(Y ~ X + Lag1 + Lag2)

summary(model)

lines(t,as.numeric(predict(model)),col="blue",lty=2,lwd=1.5)

legend("topleft",legend=c("Actual CO2 Level","Fitted CO2 Level"), lwd=2,col=c("black","blue"),lty = c(1,2),cex=0.75)

### Output ###

# Notice that the lag terms are not significant.

# We can conduct an F test to see whether both lag terms are not significant, controlling for

# the trend term, log(t)

Call:

lm(formula = Y ~ X + Lag1 + Lag2)

Residuals:

Min 1Q Median 3Q Max

-1.25090 -0.27764 -0.02237 0.31892 1.12342

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.13443 0.29727 -0.452 0.653198

X 0.56247 0.14164 3.971 0.000244 ***

Lag1 -0.01123 0.14555 -0.077 0.938822

Lag2 -0.08995 0.14530 -0.619 0.538888

---

6

Page 7: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

Residual standard error: 0.4702 on 47 degrees of freedom

Multiple R-squared: 0.4006, Adjusted R-squared: 0.3623

F-statistic: 10.47 on 3 and 47 DF, p-value: 2.165e-05

### ANOVA to test whether both lag terms are important or not ###

# Controlling for the trend #

model.reduced = lm(Y ~ X)

anova(model.reduced,model)

### Output ###

# ANOVA shows us that, controlling for the trend terms,

# the lag terms are not significant

Analysis of Variance Table

Model 1: Y ~ X

Model 2: Y ~ X + Lag1 + Lag2

Res.Df RSS Df Sum of Sq F Pr(>F)

1 49 10.479

2 47 10.393 2 0.085958 0.1944 0.824

0 5 10 15

−0.

20.

00.

20.

40.

60.

81.

0

Lag

AC

F

Autocorrelation plot for CO2

0.5

1.0

1.5

2.0

2.5

3.0

C02 at Mauna Loa

Year

ppm

.yr

1959

1969

1979

1989

1999

2009

Actual CO2 Level

Fitted CO2 Level

Figure 4: Plot of CO2 over time, along with its autocorrelation plot. The fit suggested by the model in equation(6) is plotted as the dotted blue line.

7

Page 8: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

If we take a look at the R output for equation (6), we see that, individually, the lag terms are not significant;the t-tests, which test individual coefficients controlling for all the others, suggest that either lag terms are notsignificant, controlling for other variables. To check whether the lag terms matter after taking the trend intoaccount, we run an ANOVA with the hypothesis that

H0 : β2 = β3 = 0 vs. Ha : either β2 or β3 are not zero

The F value for this ANOVA is 0.1944 with a p-value of 0.824 (see the R output). This suggests that only the logtrend explains most of the variance in Yt, even though there is a strong correlation between Yt and Yt−1, or ρ(1)is very high.

Finally, we can incorporate polynomials as a trend. Specifically, we have Xt,1 = t, Xt,2 = t2,..., Xt,p = tp

whereYt = β0 + β1Xt,1 + ...+ βpXt,p + εt = β0 + β1t+ ...+ βpt

p + εt

If we do decide to use polynomials, we can use F tests to determine the number of power terms to include, muchlike how we did F tests in polynomial regression.

2.3 Season

Some time series data have regular patterns within certain intervals, known as seasons. These patterns differ fromtrends because trends are patterns that do not repeat on a fixed interval. For example, consider the CO2 data ona monthly scale in figure 5.

320

340

360

380

400

C02 at Mauna Loa (Monthly)

Year

CO

2

1958

1963

1968

1973

1978

1983

1988

1993

1998

2003

2008

●●

● ● ●● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●●

● ●●

●●

●● ●

● ●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Peak (May)

Trough (October)

Figure 5: Plot of CO2 in Mauna Loa on a monthly scale. Peak CO2 levels occur around May and lowest CO2levels occur around November

8

Page 9: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

We can see there are regular sinusoidal patterns, with peaks occurring around May and troughs occurring aroundOctober. In addition, there is an increasing linear trend and definite autocorrelation (see figure 6. To incorporateseasonality, we can introduce a new term in our model in equation (1). Specifically, let Xt, be a categorical variablewhere

Xt, =

“Peaks” from April to June

“Fall” from July to August

“Troughs” from September to November

“Rise” from December to March

Without any lag terms or trend terms, we have the following model for Yt, CO2 at Mauna Loa

Yt = β0 + β1Xt,1 + β2Xt,2 + β3Xt,3 + εt (7)

where Xt,1 is an indicator variable for “Peak” months, Xt,2 is an indicator variable for “Fall” months, and Xt,3 isan indicator variable for “Troughs” months; these X,js are a result of reformatting categorical variables, a topicwe discussed in the multiple regression lecture. We can also specify other seasons as well. For example, for theGoogle Stock data in figure 1, our “seasons” can be each quarter of the earning cycle. As long as the X’s thatrepresent seasons are categorical variables, any type of seasons can be incorporated into a time series model.

Trends and lag terms can be incorporated with seasons. For the CO2 data, we can propose the following modelfor Yt,

Yt = β0 + β1Xt,1 + β2Xt,2 + β3Xt,3 + β4Xt,4 + β5Xt,5 + β6Xt,6 + β7Xt,7 + β8Xt,8 + β9Xt,9 + β10Xt,10 + εt (8)

where Xt,1, Xt,2, and Xt,3 represent the season terms from equation (7), Xt,4 = t is the linear trend term, and Xt,6

to Xt,10 represent the lag terms (i.e. Xt,6 = Yt−1, Xt,7 = Yt−2,...,Xt,10 = Yt−5)3. The estimates of equation (8) are

shown below along with the plot of the fit are shown in the attached R code and figure 6

### R Code for the CO2 Data Set ###

# Read data

data = read.csv("http://stat.wharton.upenn.edu/~khyuns/stat431/CO2MaunaLoaMonthly.txt")

attach(data)

### Plot the Autocorrelation Plot ###

par(mfrow=c(2,1)) #Allows me to have multiple plots

acf(CO2,main="Autocorrelation plot for CO2 (Monthly)",lag.max=100)

t = 1:length(CO2)

plot(t,CO2,type="n",main="C02 at Mauna Loa (Monthly) ",xaxt="n",xlab="Year",lwd=2)

lines(t,CO2,lty=2,lwd=1,col="green")

axis(1,at=seq(1,length(CO2),60),labels=Year[seq(1,length(CO2),60)],las=2)

points(t[Month==5],CO2[Month==5],pch=16,col="blue",cex=1)

points(t[Month==10],CO2[Month==10],pch=16,col="red",cex=1)

legend("topleft",legend=c("Peak (May)", "Trough (October)"),

pch=16,col=c("blue","red"))

### Incorporating Lag Terms ###

Y = CO2[6:length(CO2)]

Lag1 = CO2[5:(length(CO2)-1)]

Lag2 = CO2[4:(length(CO2)-2)]

Lag3 = CO2[3:(length(CO2)-3)]

Lag4 = CO2[2:(length(CO2)-4)]

Lag5 = CO2[1:(length(CO2)-5)]

3I arbitrary chose to include five lag terms. Technically, you can include more lag terms and make a more complicated model

9

Page 10: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

320

340

360

380

400

C02 at Mauna Loa (Monthly)

Year

CO

2

1958

1963

1968

1973

1978

1983

1988

1993

1998

2003

2008

●●

● ● ●● ● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

● ●●

● ●●

●●

●● ●

● ●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

Peak (May)

Trough (October)

Actual CO2 Level

Fitted CO2 Level

Figure 6: Plot of CO2 in Mauna Loa on a monthly scale with the fitted values from equation (8). Notice that themodel very closely fits the actual data

10

Page 11: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

### Incorporating Linear Trend ###

# I reformat the time so that the first observation is ‘‘retimed’’ as 1

# Because of the five lag terms, we have to start at time = 6

X = 6:length(CO2)

### Incorporating Seasons ###

# Because of the five lag terms, we have to start at time = 6

Month.new = Month[6:length(CO2)]

Season = rep("",length(Month.new))

Season[Month.new >= 4 & Month.new <= 6] = "Peak"

Season[Month.new >= 7 & Month.new <= 8] = "Fall"

Season[Month.new >= 9 & Month.new <= 11] = "Troughs"

Season[Month.new == 12] = "Rise"

Season[Month.new >= 1 & Month.new <=3] = "Rise"

### Fit the model ###

model = lm(Y ~ Season + X + Lag1 + Lag2 + Lag3 + Lag4 + Lag5)

summary(model)

lines(X,as.numeric(predict(model)),col="purple",lty=2,lwd=1)

legend("bottomright",legend=c("Actual CO2 Level","Fitted CO2 Level"),

lwd=1,col=c("green","purple"),lty = c(1,2))

### Output ###

Call:

lm(formula = Y ~ Season + X + Lag1 + Lag2 + Lag3 + Lag4 + Lag5)

Residuals:

Min 1Q Median 3Q Max

-1.70302 -0.35836 0.01451 0.34512 2.00743

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.638920 2.663726 3.243 0.001244 **

SeasonPeak 0.930349 0.119263 7.801 2.52e-14 ***

SeasonRise 1.206201 0.118082 10.215 < 2e-16 ***

SeasonTroughs 1.213743 0.114011 10.646 < 2e-16 ***

X 0.003880 0.001060 3.661 0.000272 ***

Lag1 1.692354 0.046408 36.467 < 2e-16 ***

Lag2 -0.842168 0.075020 -11.226 < 2e-16 ***

Lag3 0.003702 0.084890 0.044 0.965233

Lag4 0.121480 0.079066 1.536 0.124927

Lag5 -0.006363 0.040320 -0.158 0.874647

### ANOVA to test Importance of Season ###

model.reduced = lm(Y ~ X + Lag1 + Lag2 + Lag3 + Lag4 + Lag5)

anova(model.reduced,model)

### Output ###

# Clearly, season is an important factor.

Analysis of Variance Table

Model 1: Y ~ X + Lag1 + Lag2 + Lag3 + Lag4 + Lag5

11

Page 12: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

Model 2: Y ~ Season + X + Lag1 + Lag2 + Lag3 + Lag4 + Lag5

Res.Df RSS Df Sum of Sq F Pr(>F)

1 640 301.05

2 637 225.28 3 75.779 71.425 < 2.2e-16 ***

---

Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1

After we fit the model, we can test whether incorporating seasons was important, given that trend and lag termswere already included. To do this, we can perform an F-test where

H0 : β1 = β2 = β3 = 0 vs. Ha : at least one βj is not zero

We would fit two models, one model with only the lag and the trend term and another model with all the terms,and perform an F test. With an F value of 75.779 (see R code) and a p-value that is less than 10−16, we concludethat season is important for the model in equation (8).

We can conduct other tests to see whether the lag terms or the trend term is significant or not. For the lagterms, we can conduct a sequential test, just like the one we did in the section on correlated errors, to see howmany lag terms to include. Here, we’ll control for the season and the trend term when we test the lag terms. Theresults are summarized in table 2.3. Based on table 2.3, it makes sense to keep only four lag terms.

Model 1 (Reduced) Model 2 (Full) P-value(H0 : βlast = 0 vs.Ha : βlast 6= 0)

Yt = β0 + S + T + β6Xt,6 + ...+ β9Xt,9 + εt Yt = β0 + S + T + β6Xt,6 + ...+ β10Xt,10 + εt 0.874647Yt = β0 + S + T + β6Xt,6 + ...+ β8Xt,8 + εt Yt = β0 + S + T + β6Xt,6 + ...+ β9Xt,4 + εt 0.001950Yt = β0 + S + T + β6Xt,6 + β7Xt,7 + εt Yt = β0 + S + T + β6Xt,6 + ...+ β8Xt,3 + εt 2.90 ∗ 10−5

Yt = β0 + S + T + β6Xt,6 + εt Yt = β0 + S + T + β6Xt,6 + β7Xt,7 + εt < 2 ∗ 10−16

Yt = β0 + S + T + εt Yt = β0 + S + T + β6Xt,6 + εt < 2 ∗ 10−16

Table 2: Sequential testing of lag terms for the CO2 data. S represents the terms associated with seasons and Trepresents the trend term, β5Xt,5 = t. Based on these tests and at Bonferroni-corrected significance of 0.05/5 =0.01, only the first four lag terms are significant and will be incorporated into the model. P-values were obtainedusing either the F-test or the t-test on the last coefficient in the full model.

3 Forecasting

The reason we build time series model is to predict future observations. And making predictions are very simple.Simply plug in the X values into the model and obtain your Yi’s, the predicted values. For example, with the CO2model in equation (8), we can predict YT+1 by plugging in the month at T + 1, XT+1,5 = (T + 1), XT+1,6 = YT ,XT+1,7 = YT−1, XT+1,8 = YT−2, XT+1,9 = YT−3, XT+1,10 = YT−4. An example of this is shown in Figure 3

For observations where we lag terms are not available, we use the predicted values as our lag terms. For example,suppose we want to predict YT+10. This term requires the lag terms XT+10,6 = YT+9, XT+10,7 = YT+8,XT+10,8 =

YT+7,XT+10,9 = YT+6, and XT+10,10 = YT+5, all of which are not available. Hence, we would use YT+9 for XT+10,6,

YT+8 for XT+10,7, and so forth. This is not a statistically correct way to do prediction. But it does give decentresults in practice. Regardless, always keep in mind that too much extrapolation in the future will generally leadto more unstable results.

We won’t discuss how to construct CIs or PIs because these intervals are not the same as those from multi-ple regression. However, the formulas for CI and PIs from multiple regression can give an approximation of theactual CIs and PIs.

12

Page 13: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

320

340

360

380

400

C02 at Mauna Loa (Monthly)

Year

CO

2.ne

w

1958

1963

1968

1973

1978

1983

1988

1993

1998

2003

2008

2013

Actual CO2 Level

Fitted CO2 Level

Future C02 Level

Figure 7: Forecasting CO2 levels in Mauna Loa. We must always be cautious about forecasting too far into thefuture as estimates become more unreliable.

13

Page 14: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

### Forecasting in R from equation 8###

# First, build a new model with only four lag terms #

### Incorporating Lag Terms ###

Y = CO2[5:length(CO2)]

Lag1 = CO2[4:(length(CO2)-1)]

Lag2 = CO2[3:(length(CO2)-2)]

Lag3 = CO2[2:(length(CO2)-3)]

Lag4 = CO2[1:(length(CO2)-4)]

### Incorporating Linear Trend ###

# I reformat the time so that the first observation is ‘‘retimed’’ as 1

# Because of the four lag terms, we have to start at time = 5

X = 5:length(CO2)

### Incorporating Seasons ###

# Because of the four lag terms, we have to start at time = 5

Month.new = Month[5:length(CO2)]

Season = rep("",length(Month.new))

Season[Month.new >= 4 & Month.new <= 6] = "Peak"

Season[Month.new >= 7 & Month.new <= 8] = "Fall"

Season[Month.new >= 9 & Month.new <= 11] = "Troughs"

Season[Month.new == 12] = "Rise"

Season[Month.new >= 1 & Month.new <=3] = "Rise"

Season = factor(Season)

### Fit the model ###

model = lm(Y ~ Season + X + Lag1 + Lag2 + Lag3 + Lag4)

### Predict future values ###

t.maxFuture = 12 # How far you want to go into the future

# Set values for the for-loop

t.future = X[length(X)]

m.future = Month.new[length(Month.new)]

s.future = Season[length(Season)]

l1.future = Lag1[length(Lag1)]

l2.future = Lag2[length(Lag2)]

l3.future = Lag3[length(Lag3)]

l4.future = Lag4[length(Lag4)]

Year.new = c(Year)

# Predicted Yi hats

Y.future = rep(0,t.maxFuture)

for (i in 1:t.maxFuture) {

y.predicted = predict(model,list(Season=s.future,X=t.future,

Lag1=l1.future,Lag2=l2.future,Lag3=l3.future,Lag4=l4.future))

y.predicted = as.numeric(y.predicted) # Makes the number look nice

# Get future months

if(m.future == 12) {

m.future = 1

Year.new = c(Year.new,Year.new[length(Year.new)] + 1)

14

Page 15: Lecture on Time Series Stat 431, Summer 2012stat.wharton.upenn.edu/~khyuns/stat431/TimeSeries.pdfLecture on Time Series Stat 431, Summer 2012 Hyunseung Kang July 30/31, 2012 1 Introduction

} else {

m.future = m.future + 1

Year.new = c(Year.new,Year.new[length(Year.new)])

}

if(m.future >= 4 & m.future <= 6) s.future = "Peak"

if(m.future >= 7 & m.future <= 8) s.future = "Fall"

if(m.future >= 9 & m.future <= 11) s.future = "Troughs"

if(m.future == 12) s.future = "Rise"

if(m.future >= 1 & m.future <= 3) s.future = "Rise"

# Get future time (formatted)

t.future = t.future + 1

# Get lag values

l4.future = l3.future

l3.future = l2.future

l2.future = l1.future

l1.future = y.predicted

# Prediction

Y.future[i] = as.numeric(predict(model,list(Season=s.future,X=t.future,

Lag1=l1.future,Lag2=l2.future,Lag3=l3.future,Lag4=l4.future)))

}

# Plot the time series

par(mfrow=c(1,1))

t = 1:length(CO2)

CO2.new = c(CO2,Y.future)

plot(c(t,(length(CO2)+1):(length(CO2)+t.maxFuture)),CO2.new,type="n",

main="C02 at Mauna Loa (Monthly) ",xaxt="n",xlab="Year")

lines(t,CO2,lty=2,lwd=1,col="green")

axis(1,at=seq(1,length(CO2.new),60),labels=Year.new[seq(1,length(CO2.new),60)],las=2)

lines(X,as.numeric(predict(model)),col="purple",lty=2,lwd=1)

lines(c(X[length(X)],(length(CO2)+1):(length(CO2) + t.maxFuture)),

c(as.numeric(predict(model))[length(Y)],Y.future),col="red",lty=2,lwd=1)

legend("bottomright",legend=c("Actual CO2 Level","Fitted CO2 Level","Future C02 Level"),

lwd=1,col=c("green","purple","red"),lty = c(1,2,2))

15