30
Augmented Difference-in-Differences: Estimation and Inference of Average Treatment Effects Kathleen T. Li * The Wharton School, University of Pennsylvania Philadelphia, Pennsylvania 19104 David R. Bell The Wharton School, University of Pennsylvania Philadelphia, Pennsylvania 19104 This draft: June 2017 Abstract Difference-in-differences (DID) methods are widely deployed in management sciences to es- timate the average treatment effect (ATE) in quasi-experimental data. We derive and illustrate a new practical and consistent estimator for commonly encountered settings where control units violate the parallel lines assumption essential to DID. Our method, the augmented DID (ADID), exploits the correlation between treated and non-treated units as its identifying assumption, and is a complement to standard DID when the latter breaks down. To facilitate inference, we pro- pose a simple bootstrap method and establish the validity of the bootstrap method for the stationary data case. Simulations show that our bootstrap method works well for stationary, non-stationary and complicated nonlinear trending data. Using analytics and simulations we demonstrate key properties of ADID. We then estimate the ATE of offline showroom openings on sales for a prominent online-first retailer using DID and ADID. The DID ATE estimate is bi- ased due to the violation of parallel lines assumption. The ADID gives a much more reasonable ATE estimate than that of the DID method. The ADID estimator is not only straightforward to implement, but also consistent and robust to the selection method for non-treated units. Keywords: Difference-in-differences, average treatment effects, parallel line violation, quasi- experimental methods, bootstrap inference * We would like to thank Christophe Van den Bulte and Eric Bradlow for helpful comments to this paper. Corre- spondence regarding this manuscript can be addressed to Kathleen T. Li, [email protected], The Wharton School, 700 Jon M. Huntsman Hall, 3730 Walnut Street, Philadelphia, PA 19104.

Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Augmented Difference-in-Differences: Estimation and Inference of

Average Treatment Effects

Kathleen T. Li∗

The Wharton School, University of Pennsylvania

Philadelphia, Pennsylvania 19104

David R. Bell

The Wharton School, University of Pennsylvania

Philadelphia, Pennsylvania 19104

This draft: June 2017

Abstract

Difference-in-differences (DID) methods are widely deployed in management sciences to es-

timate the average treatment effect (ATE) in quasi-experimental data. We derive and illustrate

a new practical and consistent estimator for commonly encountered settings where control units

violate the parallel lines assumption essential to DID. Our method, the augmented DID (ADID),

exploits the correlation between treated and non-treated units as its identifying assumption, and

is a complement to standard DID when the latter breaks down. To facilitate inference, we pro-

pose a simple bootstrap method and establish the validity of the bootstrap method for the

stationary data case. Simulations show that our bootstrap method works well for stationary,

non-stationary and complicated nonlinear trending data. Using analytics and simulations we

demonstrate key properties of ADID. We then estimate the ATE of offline showroom openings

on sales for a prominent online-first retailer using DID and ADID. The DID ATE estimate is bi-

ased due to the violation of parallel lines assumption. The ADID gives a much more reasonable

ATE estimate than that of the DID method. The ADID estimator is not only straightforward

to implement, but also consistent and robust to the selection method for non-treated units.

Keywords: Difference-in-differences, average treatment effects, parallel line violation, quasi-

experimental methods, bootstrap inference

∗We would like to thank Christophe Van den Bulte and Eric Bradlow for helpful comments to this paper. Corre-spondence regarding this manuscript can be addressed to Kathleen T. Li, [email protected], The WhartonSchool, 700 Jon M. Huntsman Hall, 3730 Walnut Street, Philadelphia, PA 19104.

Page 2: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

1 Introduction

Answering important policy questions in economics and the social sciences often relies on our ability

to evaluate causal effects of programs and interventions on outcomes of interest. In economics, for

example, much of the early literature on causal inference was motivated by the need to evaluate the

effectiveness of education and labor market programs using quasi-experimental data (Ashenfelter

1979, Ashenfelter and Card 1985). Likewise, marketing and management scientists increasingly

embrace quasi-experimental data to inform business decisions in a range of settings. These include

estimating average treatment effects of: anonymity conferred by the Internet on financing terms for

new cars (Busse, Silva-Russo and Zettlemeyer 2006), offline bookstore openings on sales at Amazon

(Forman, Ghose and Goldfarb 2009), an acquired tax nexus on online sales when a firm opens offline

stores (Anderson, Fong, Simester and Tucker 2010), and, consumer relocation on brand preferences

(Bronnenberg, Dube and Gentzkow 2012). Others have deployed DID to study phenomena preva-

lent especially in digital environments including counterfeiting (Qian 2008), how online reviews

drive sales (Chevalier and Mayzlin 2006), executional strategies for display advertising (Goldfarb

and Tucker 2011), and privacy (Miller and Tucker 2009).

In the most general terms, the fundamental problem of causal inference in quasi-experimental

settings is the following: A researcher desires to compare two outcomes for the same observational

unit when that unit is exposed or not exposed to an intervention, yet can observe only one outcome

at any given time (Holland 1986). Difference-in-differences (DID) is the standard, and most widely

applied, econometric approach for measuring the average treatment effect (ATE) using panel data.

An essential assumption is that the outcomes of the treated and non-treated units follow parallel

paths over time, in the absence of any treatment. Violation of this ‘parallel lines’ assumption leads

to biased DID estimates (Donald and Lang 2007; Bertrand, Duflo and Mullainathan 2009). In this

paper we develop and implement a complementary DID estimator, the augmented DID, that is

easy to apply and yields robust estimates of the ATE when the essential parallel lines assumption

is violated.

The essence of the augmented DID and our contribution can be better understood via the

following motivating example. Consider the case of the ‘digital first’ eyewear brand Warby Parker,

which began life as WarbyParker.com in 2010 and has subsequently opened showrooms throughout

the United States.1 The rationale for showrooms is that since eyewear is a tactile product, some

customers may wish to touch, feel, or try the product before buying. Naturally, management would

like to assess whether the ‘treatment’, i.e., the opening of a showroom in a specific market, impacted

overall sales relative to control markets which did not contain showrooms. Warby Parker opened a

1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB)–a company thatinitially bypasses wholesale distribution and goes direct to consumers online–but subsequently opens offline saleschannels. Other notable examples from the digital economy include DollarShaveClub.com (acquired in July 2016 byUnilever for $1 billion), Casper.com (mattresses), and Harrys.com (razors).

1

Page 3: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

showroom at Boston on September 22, 2011 on the purpose of promoting its online sales at Boston.

We use Warby Parker sale’s data prior and post the opening date to examine the showroom opening

effects on online eyeglasses sale in Cokumbus. We show that while the standard DID should not be

used to estimate ATE due to the violation to the ‘parallel lines’ assumption, our augmented DID

fits the data well and give a reasonable ATE estimation result.

Methodologically, the augmented DID ATE is identified using the correlation between outcomes

from the treated and control units. We impose no requirement that the sample paths of treated

and non-treated units are parallel. Furthermore, we show in simulations and in real data that the

augmented DID is robust to the selection of different control units.

Our contribution is threefold. First, we propose a new DID estimator that is robust to violation

of the key ‘parallel lines’ assumption as well as to alternative selection schemes for non-treated

units; moreover, it is easy to implement. Second, we allow for panels with stationary data, unit-

root non-stationary data, linear or non-linear trended data, we show that our ATE estimator is

consistent under quite general data generating processes. Third for inferences, we propose a simple

bootstrap method and the simulations show that it works well for various types of data, including

stationary and non-stationary data generating processes. Note that the inference theory in our

paper is not covered by those of Hsiao, Ching and Wan (2012), and Li and Bell (2017). Li and Bell

derived the asymptotic distribution of an average treatment effects estimator proposed by Hsiao,

Ching and Wan (2012) under the stationary data assumption. In addition we use an empirical

data to demonstrate when the ‘parallel paths’ assumption is violated and the DID method becomes

invalid, the augmented DID method delivers reasonable ATE estimation result, supporting our

argument that the augmented DID method extend the reach of the popular DID method to a wider

data ranges.

Augmented DID is practically useful and easily implemented, yet our method works best under

the following data conditions. Since we focus on a separate estimate of ATE for each specific

treatment unit over the post-treatment period, our method requires a relatively long time series

whereas standard DID can estimate ATEs with a large number of treated and control units with a

short time series. Therefore, our augmented DID should be viewed as complementary to standard

DID.

The remainder of the paper is organized as follows. In section 2 we provide more background

on DID and provide detailed estimation steps for the DID ATE estimator as well for our new ATE

estimator. Section 3 analyzes consistency of ATE estimator under various data type. We propose a

simple bootstrap method for inference in Section 4. Section 5 reports simulation results to examine

the finite sample performance of our estimator. In section 6, we present an application. Section 7

concludes the paper. Appendix A provides the relevant derivations and proofs.

2

Page 4: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

2 Estimation of ATEs

In this section, we discuss how to implement the DID and our augmented DID methods as well as

limitations of each method.

2.1 DID Mechanics and Implementation

Let y1it and y0

it denote unit i’s outcome in period t with and without treatment, respectively. The

treatment intervention effect for the ith observational unit at time t is defined as

∆it = y1it − y0

it. (2.1)

However, we can observe either y0it or y1

it, but never both. Thus, the observed data is in the

form

yit = dity1it + (1− dit)y0

it, (2.2)

where dit = 1 if the ith unit receives a treatment at time t, otherwise dit = 0.

In our application, we compute the ATE for each treatment market separately. Therefore,

we only need to consider the case where there is one treatment market (i.e., a market where the

firm opens a showroom). Without loss of generality, we assume that only the first unit receives a

treatment at time T1+1, while the remaining (N−1) units do not receive any treatment throughout

the sample period. Therefore, for the treatment unit, y1t = y01t for t = 1, ..., T1, and y1t = y1

1t for

t ≥ T1 + 1. The (N −1) units that do not receive any treatment serve as the control group. We use

yjt for j = 2, ..., N and t = 1, ..., T to denote control units’ outcomes. We need to estimate y01t for

t ≥ T1 + 1 in order to estimate the ATE. Let y01t be a generic estimator of y0

1t. Then the treatment

effects at time t can be estimated by ∆1t = y1t − y01t (t = T1 + 1, ..., T ) and the average treatment

effects ∆1 = E(y1it − y0

it) is estimated by

∆1 =1

T2

T∑t=T1+1

∆1t, (2.3)

where T2 = T − T1 is the post-treatment sample size.

Here we would like to emphasize that since there is only one unit (unit 1) that receives the

treatment, the ATE estimator is obtained by averaging over the post-treatment periods t = T1 +

1, ..., T (time series averaging) for unit 1. This differs from the usual DID method in which one

often has a larger number of units receiving treatments and the average is usually computed over

many treatment units (cross sectional averaging). Of course, if one wants to calculate the ATE

over all treated units, one can first calculate ATE for individual treated units and then average

over the treated units.

3

Page 5: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

The difference in average outcomes after and before the treatment date for the treatment market

(market 1) can be computed by

DTr =1

T2

T∑t=T1+1

y1t −1

T1

T1∑t=1

y1t. (2.4)

The difference in outcomes for the (N − 1) control markets after and before T1 is computed by

DCo =1

(N − 1)

N∑j=2

1

T2

T∑t=T1+1

yjt −1

T1

T1∑t=1

yjt

. (2.5)

The difference-in-differences estimate for the average treatment effects is:

ATE1,DID = DTr −DCo

=

1

T2

T∑t=T1+1

y1t −1

T1

T1∑t=1

y1t

− 1

(N − 1)

N∑j=2

1

T2

T∑t=T1+1

yjt −1

T1

T1∑t=1

yjt

. (2.6)

Under the ‘parallel lines’ assumption, it is easy to see that ATE1,DID defined in (2.6) is a

consistent (with large T1 and T2) estimator of the ATE for the treated unit.

It is also possible to use a regression method to estimate ATE. Define the treatment group

dummy and the post-treatment time period dummy as follows: TGi = 1 if unit i is a treatment

market, and 0 otherwise (we have TG1 = 1 and TGj = 0 for j = 2, ..., N), and ATt = 1 if t ≥ T1 +1

and ATt = 0 otherwise. Then the ATE estimator shown in (2.6) is identical to the least squares

estimator of β4 in the following regression model

yit = β1 + β2TGi + β3ATt + β4(TGi)(ATt) + uit i = 1, ..., N ; t = 1, ..., T. (2.7)

To see that β4 indeed yields the same ATE estimate as in (2.6), we obtain from (2.7) that

ATE1,DID = DTr −DCo

= [(β1 + β2 + β3 + β4)− (β1 + β2)]− [(β1 + β3)− β1]

= β4. (2.8)

The intuition behind the DID method is that, if yjt, j = 1, ..., N , are random draws from a

homogenous population, then the before and after treatment period change in yc,t = 1N−1

∑Nj=2 yjt

may mimic the change in y1t well in the absence of treatment. Specifically, the DID method

uses yc,t plus a constant (an intercept term, say δ1) to approximate y1t, i.e., it uses δ1 + yc,t to

approximate y1t. One chooses δ1 to make y1t to have the same sample mean as that of δ1 + yc,t

4

Page 6: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

over the pre-treatment period. Thus, one estimates δ1 using the pre-treatment data by

δ1 = y1 − yc =1

T1

T1∑t=1

y1t −1

T1

T1∑t=1

1

(N − 1)

N∑j=2

yjt, (2.9)

where δ1 is the least squares estimator of δ1 in y1t − yc,t = δ1 + errort. Therefore, the DID

in-sample-fit and the out-of-sample counterfactual estimate is computed by

y0DID,1t = δ1 +

1

(N − 1)

N∑j=2

yjt, t = 1, ..., T1, T1 + 1, ..., T (2.10)

where δ1 is given in (2.9). For t = 1, ..., T1, (2.10) gives the in-sample fitted curve; and for t =

T1 + 1, ..., T , (2.10) gives the out-of-sample counterfactual estimated curve.

To verify that (2.10) indeed gives the correct counterfactual estimate of y01t, using (2.3) and

(2.10) we obtain that

∆1,DID =1

T2

T∑t=T1+1

[y1t − y0DID,1t]

=1

T2

T∑t=T1+1

[y1t − δ1 −1

(N − 1)

N∑j=2

yjt]

=1

T2

T∑t=T1+1

y1t −1

T1

T1∑t=1

y1t −1

(N − 1)

N∑j=2

1

T2

T∑t=T1+1

yjt −1

T1

T1∑t=1

yjt

,(2.11)

which identically equals ATE1,DID, defined in (2.6), as it should. This verifies that (2.10) is the

correct formula for predicting the the counterfactual outcome y01t for t = T1 + 1, ..., T .

2.2 Factor Model Motivation

Similar to Hsiao, Ching and Wan (2012) and Li and Bell (2017), we motivate our method using a

factor model. The main idea is that there are some common factors that drive all units although

we allow for the common factors to affect different units in different ways. For example, in our

application, common factors that affect Warby Parker’s sales (outcome) in markets could include

media coverage of the company, national advertising, and general economic conditions. However,

a given factor may have a greater effect in some markets versus others. In the model, this is taken

care of by allowing the coefficients of each factor to vary by market.

Following prior research (e.g., Forni and Reichlin (1998), Gregory and Head (1999), Hsiao,

5

Page 7: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Ching and Wan (2012)), the factor model for pre-treatment period is:

y0it = αi + b′ift + uit, i = 1, ..., N ; t = 1, ..., T1, (2.12)

where αi is unit i’s individual specific intercept, bi is a K×1 vector of coefficients (factor loadings),

ft is a K × 1 vector of unobservable factors common to treatment and control units, uit is the

idiosyncratic error term.

Thus, the treatment unit’s outcome, y1t, and the average control units’ outcomes, yc,t, are

correlated through these common factors. The correlation between the treatment unit, y1t, and

the control units, yc,t, is what we explore to create the counterfactual in the post-treatment period

(what the outcome in treatment unit would have been had there not been an intervention). This

is because we assume that had there not been an intervention, the correlation structure between

the treatment and control units would remain the same as in the pretreatment period. In fact,

this is our identification assumption. In the next section we show that under this assumption

we can consistently estimate the counterfactual outcome for the treated unit; after creating the

counterfactual, we can then estimate ATE for the intervention.

2.3 Our Augmented DID method

The DID method elaborated above rests on the assumption that the sample paths of y1t and

yc,t = 1N−1

∑Nj=2 yjt are parallel in the absence of treatment. However, when the treated unit is

not random selected and there is heterogeneity in treatment and control groups, this assumption

is unlikely to hold in practice.

In this section we propose an augmented DID method which is robust to non-parallel paths of

the treated and the control units. We derive an estimator to address the question of interest to the

practitioner: “Was the intervention a success?” (e.g., did demand go up, costs decline, and so on).

We are able to derive an estimator that is consistent and delivers valid inference.

To accomplish this, we introduce a simple modification to the DID method. We multiply yc,t

by a scale factor (a constant, δ2), which leads to the following regression model

y1t = δ1 + δ2yc,t + e1t, t = 1, ..., T1. (2.13)

Let δ = (δ1, δ2)′ denote the least squares estimator of δ = (δ1, δ2)′. Then, we estimate y01t by

y01t = δ1 + δ2yc,t = x′tδ, t = 1, ..., T1, T1 + 1, ...T, (2.14)

where xt = (1, yc,t)′. Note that for t ≤ T1, y0

1t is the in-sample fitted value of y01t; and for t ≥

T1 + 1, y01t is the out-of-sample estimator for the counterfactual outcome y0

1t. Therefore, using our

6

Page 8: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

augmented DID method the ATE estimate is given by

∆1 =1

T2

T∑t=T1+1

(y1t − y01t), (2.15)

where y01t is defined in (2.14).

Note that the ATE estimator ∆1 defined in (2.15) nests the DID estimator as a special case.

To see this, note that if we replace δ2 by 1, then the estimator of δ1 will be the same as defined in

(2.9). It follows that (2.15) becomes identical to (2.11), the DID estimator of ∆1.

Here we give a heuristic argument showing that ∆1 is indeed a consistent estimator of ∆1. We

only need to show that T−12

∑Tt=T1+1 y

01t consistently estimates T−1

2

∑Tt=T1+1 y

01t. Notice that if the

correlation between y1t and yc,t is stable in the absence of treatment (our identifying assumption),

then y1t would be generated by y01t = δ1 +δ2yc,t+e1t in the absence of treatment for t = T1 +1, ..., T .

Given that δ = (δ1, δ2)′ is a consistent estimator of δ = (δ1, δ2)′ because T1 is large, and that e1t

has zero mean so that the average of e1t over the post-treatment period is small if T2 is large, then

the average of y01t and the average of y0

1t (over the post-treatment period) are close to each other

and become closer the greater T1 and T2 are. Hence, ∆1 is a consistent estimator of ∆1. Note

that the consistency argument does nor require the data to be stationary, we allow for unit-root

non-stationary as well linear and non-linear trending data. However, when data is non-stationary

with nonlinear trend, asymptotic analysis of the ATE estimator is complex and it can be difficulty

to simulate asymptotic distributions of ATE estimators when the exact data generating processes

are unknown. In this paper we will propose a simple bootstrap method approximate the ATE

estimator’s distribution. Simulations reported in Section 5 show that our proposed bootstrap

method works well for stationary and non-stationary data generating processes.

Before ending this section we would like to mention that, if the treated unit’s outcome differs

from the average of the control units by a constant (they have different intercept) and by a linear

trend. Then adding a linear trend component to a DID model will lead to good ATE estimate. In

this case one can use the following regression model to estimate ATE.

y1t = δ1 + δ2t+ yc,t + e1t, t = 1, ..., T1. (2.16)

Let δ1 and δ2 denote the least squares estimates of δ1 and δ2 based on y1t−yc,t = δ1+δ2t+e1t (t =

1, ..., T1). Then one can estimate the counterfactual y01t by y0

1t = δ1 + δ2t+ yc,t for t = T1 + 1, ..., T .

However, for many empirical data such as the one considered in this paper (see Section 5), model

(2.16) may not give good in-sample-fit and therefore may not lead to accurate ATE estimate. The

reason is that the outcomes often exhibit nonlinear and/or stochastic trends. Adding a linear trend

variable may not solve the non-parallel paths problem.

7

Page 9: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

3 Consistency Analysis of the Augmented DID ATE estimator

3.1 Stationary data

Our ATE estimator is consistent. That is, the ATE estimated using our method converges to the

average change in outcome due to an intervention as long as the pre-treatment and post-treatment

time periods are large. In our empirical application, we have T1 = 83 and T2 = 27. Simulations

reported in Section 5 suggests that this sample size combination is large enough for our augmented

DID method to work well. Below we present the model for the treated unit before and after the

intervention and use linear projections to show the consistency of our proposed estimator.

Before the intervention, the outcome for the treated unit in the pretreatment period is given by

y01t = x′tδ + e1t, t = 1, ..., T1, (2.17)

where xt = (1, yc,t)′ and δ = (δ1, δ2)′.

After an intervention occurs at time t = T1 + 1, the outcome for treated unit in the post-

treatment period is given by

y11t = x′tδ + ∆1t + eit, t = T1 + 1, ..., T, (2.18)

As previously discussed, we use a factor model as motivation and exploit the correlation between

the treated unit’s outcome (y1t) and the average of the control units’ outcomes (yc,t) to estimate

the counterfactual outcome for the treated unit. At first look, one may get the impression that the

error e1t in (2.17) is correlated with the regressor xt. However, the interpretation of (2.17) is that

x′tδ is the linear projection of y1t onto the space spanned by xt. Hence, e1t is the linear projection

error, which is orthogonal to xt by definition. To see this clearly, suppose that e1t is not orthogonal

to xt, then we project e1t onto the linear space of xt to get the linear projection L(e1t|xt) = x′tγ.

We can re-write equation (2.17) as y01t = x′tβ + ε1t, where the relationship between new coefficients

and original coefficients is given by β = δ + γ, and the new error term ε1t = e1t − L(e1t|xt). Then

we have L(ε1t|xt) = 0 by definition. Therefore, the least squares method consistently estimates the

coefficients β, and the error term ε1t is orthogonal to xt.

To facilitate the exposition, we will slightly abuse notation and continue to use δ instead of β

when discussing our estimation model. That is, we will use model (2.17) and interpret δ as the

linear projection coefficients, i.e., L(y01t|xt) = x′tδ. Hence, L(e1t|xt) = 0.

In this paper we focus on the case that the treatment effects ∆1t is a weakly dependent stationary

process with finite fourth moment. Then because δ is a consistent estimator of δ for large T1,

8

Page 10: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

consistency of the ATE follows if both T1 and T2 are large:

√T2(∆1 −∆1) =

1√T2

T∑t=T1+1

(y1t − y01t −∆1)

= −

1√T2

T∑t=T1+1

x′t

(δ − δ) +1√T2

T∑t=T1+1

v1t

= A1 +A2, (2.19)

where A1 = −[

1√T2

∑Tt=T1+1 x

′t

](δ− δ), A2 = 1√

T2

∑Tt=T1+1 v1t, v1t = ∆1t−∆1 + e1t is a zero mean

weakly dependent process.

Li and Bell (2017) show that when the common factors ft and the idiosyncratic error uit (defined

in (2.12)) are weak dependent process (so that (yc,t, eit) defined in (2.13) are a weakly dependent

processes) and under quite general conditions,√T2A1

d→ N(0,Σ1) and√T2A2

d→ N(0,Σ2), where

Σ1 and Σ2 are finite, positive constants. Moreover, A1 and A2 are asymptotically independent with

each other. Hence, Li and Bell (2017) establish that√T2(∆1 −∆1) = A1 + A2

d→ N(0,Σ1 + Σ2).

They further provide consistent estimators for Σ1 and Σ2. Thus, one can use their result to conduct

inference for the ATE estimator provided that the data are weakly dependent stationary processes.

Proposition 2.1 Under some regularity conditions (see Li and Bell (2017)) including that the

data is weakly dependent stationary processes, T2/T1 is bounded as T1, T2 →∞, we have

(i) :√T2(∆1 −∆1)

d→ N(0,Σ)

where Σ = Σ1 + Σ2, the definitions of Σ1 and Σ2 are given in the Appendix A.

Let Σ be a consistent estimator of Σ (Σ is given in Appendix A), then we have

(ii) :

√T2(∆1 −∆1)√

Σ=

(∆1 −∆1)√Σ/T2

d→ N(0, 1). (2.20)

We will give explicit expressions for Σj and Σj , j = 1, 2, in Appendix A.

Proposition 2.1 shows that our ATE estimator is√T2-consistent. When analyzing ∆1 − ∆1,

Li and Bell (2017) show that for the two terms A1 and A2 defined in (2.19), A1 has an order

T−1/21 and A2 has an order T

−1/22 . Then the assumption that T2/T1 is bounded implies that

∆1 −∆1 = Op(T−1/21 + T

−1/22 ) = Op(T

−1/22 ).

3.2 Non-stationary and nonlinear trending data

In this section we discuss the existing consistency results of the ATE estimator with non-stationary

data and point out difficulties in inference based on asymptotic analysis. We first discuss the

9

Page 11: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

trend stationary data case. It can be easily shown that, when common factors contain a time

trend variable t, yjt also contains a time trend component. Thus, we can write yjt = cjt + yjt for

j = 1, ..., N and t = 1, ..., T1, where cj is a constant, yjt is a stationary process. Li (2017) shows

that√T2(∆1 −∆1) has an asymptotic normal distribution. This result is as expected because it

is well established that for trend-stationary data, limiting distributions of least squares estimators

are still asymptotically normally distributed although the estimated time trend coefficient has a

different convergence rate compared with other estimated coefficients.

When the common factors contain a unit-root factor, say, fjt = fj,t−1 + ηjt for some j ∈1, ...,K, ηjt is a zero mean, weakly dependent stationary error term, then it can be shown that

yjt follows a drift-less unit-root process for all j = 1, ..., N . Bai, Li and Ouyang (2017) show that

in this case, (∆1 −∆1) = Op(T−1/21 + T

−1/22 ) so that ∆1 still consistently estimates ATE provided

that T1 and T2 are large. However, Bai et al. did not provide limiting distribution theory for√T2(∆1 −∆1). In this case the asymptotic distribution of

√T2(∆1 −∆1) is complicated because

it is well established the limiting distribution of T1(δ2 − δ2) is characterized by integrations of

Brownians.2 This implies that the asymptotic distribution of√T2(∆1−∆1) is non-normal and non-

standard. Usually some simulation methods are needed to approximate the asymptotic distribution

of√T2(∆1 − ∆1). A problem with using a simulation method in our context is that, a specific

simulation method is needed corresponding to a specific data generating process. For example,

simulation method of drift-less unit-root process will be different compared to simulation method

designed for a unit-root process with a drift, or a trend-stationary process. More importantly,

researchers may not be sure what is the true data generating process because data may exhibit

some complicated non-linear trending pattern, without the knowledge of the true data generating

processes, conventional simulation methods may not be feasible in practice.

One can establish consistency of our augmented DID ATE estimator under more general data

generating processes such as unit-root with drift, nonlinear time trend processes, etc. However,

establishing asymptotic distribution theory for the augmented DID estimator under general con-

ditions is challenging and inferences based on asymptotic theory may not be easy to implement.

Instead of pursuing asymptotic analysis of our estimator under general conditions, we suggest using

a simple and easy-to-implement bootstrap procedure to approximate the finite sample distribution

of√T2(∆1 −∆1). One important advantage of our bootstrap method is that one does not need to

know the true data generating process when using the bootstrap method for inference. Simulations

reported in Section 5 show that our proposed bootstrap method works well for a variety of data

generating processes, including stationary, unit-root non-stationary, linear and non-linea trending

processes. In Appendix A we justify the validity of the bootstrap method with stationary data. We

2Recall that δ2 is the coefficient of yc,t, which is a drift-less unit-root process in this case. y1t and yc,t is cointegrated,and the asymptotic distribution of (

√T1(δ1 − δ1), T1(δ2 − δ2))′ is well established and characterized by integrations

of Brownians, see Hayashi (2000).

10

Page 12: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

conjecture that our proposed bootstrap method can also be justified to approximate the asymptotic

distribution of√T2(∆1 −∆1) when data is non-stationary, has linear or nonlinear trends. But a

rigorous proof of this conjecture is technically challenging and we leave it as a future research topic.

Simulation results in Section 5 supports our conjecture.

4 A Simple Bootstrap Method

For expositional simplicity, we assume that v1t = ∆1t −∆1 + e1t is serially uncorrelated with zero

mean and variance σ2v , where e1t is defined in (2.17) and it is also serially uncorrelated with zero

mean and variance σ2e . These assumptions can be easily tested given data. We show that in our

empirical data, these assumptions are quite reasonable. When these assumptions are violated, some

more sophisticated approach such as block bootstrap method can be used to replace the simple

re-sampling method discussed below.

We now discuss how to use a bootstrap method to compute confidence intervals (CI) for ∆1.

The construction of CI can be used to test the null hypothesis such as H0: ∆1 = ∆1,0 against a

one-sided (two-sided) alternative hypothesis H1: ∆1 > ∆1,0 (or ∆1 < ∆1,0, or ∆1 6= ∆1,0), where

∆1,0 is a known constant (e.g., ∆1,0 = 0).

For the pre-treatment period, our regression model is

y1t = x′tδ + e1t, t = 1, ..., T1, (2.21)

where xt = (1, yc,t)′ and δ = (δ1, δ2)′.

4.1 Bootstrap steps for the A-DID method

We describe the bootstrap steps below.

Step 1: Let δ be the least squares estimator of δ based on model (2.21) using the pre-treatment

data. We compute e1t = y1t − x′tδ for t = 1, ..., T1 and estimate σ2e by σ2

e = T−11

∑T1t=1 e

21t. We

generate e∗1t as random draws from N(0, σ2e) for t = 1, ..., T1 and generate bootstrap outcome of

the pre-treatment sample: y∗1t,pre = x′tδ + e∗1t for t = 1, ..., T1. Then we use the bootstrap sample

y∗1t,pre, x′tT1t=1 to estimate δ based on y∗1t,pre = x′tδ + errort for t = 1, ..., T1. Let δ∗ denote the

resulting least squares estimator of δ.3 We generate bootstrap sample post-treatment counterfactual

outcome by y0∗1t = x′tδ

∗ for t = T1 + 1, ..., T .

Step 2: Let v1t = ∆1t − ∆1 for t = T1 + 1, ..., T , where ∆1t = y1t − y01t, ∆1 = T−1

2

∑Tt=T1+1 ∆1t,

and compute σ2v = T−1

2

∑Tt=T1+1 v

21t. Then we generate v∗1t as random draws from N(0, σ2

v) for

t = T1+1, ..., T and generate post-treatment bootstrap sample y∗1t,post = x′tδ+v∗1t for t = T1+1, ..., T .

3If we use X1 and Y ∗1,pre to denote the T1 × 2 and T1 × 1 matrices with their typical rows given by x′t = (1, yc,t)

and y∗1t,pre, respectively, then δ∗ = (X ′1X1)−1X ′1Y∗1,pre.

11

Page 13: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

The bootstrap statistic is given by

S∗ =√T2

1

T2

T∑t=T1+1

(y∗1t,post − y0∗1t )

= −

√T2

1

T2

T∑t=T1+1

x′t

(δ∗ − δ) +1√T2

T∑t=T1+1

v∗1t. (2.22)

Step 3: We repeat steps 1 and 2 B times. The empirical distribution of S∗j Bj=1 are used to

construct CI of ∆1. We sort the bootstrap statistics such that S∗(1) ≤ S∗(2) ≤ ... ≤ S∗(B). Then the

(1− α)% CI for ∆1 is given by

[∆1 − T−1/22 S∗(1−α/2), ∆1 − T−1/2

2 S∗(α/2)]. (2.23)

In the Appendix we prove the validity of the above bootstrap method for the weakly dependent

stationary process. We conjecture that this bootstrap method is valid for a much wider range

of data generating processes including stationary, non-stationary, linear and nonlinear trending

processes. Simulation results reported in Section 5 strongly supports our conjecture.

Remark 2.1 In Appendix A we show that the above bootstrap procedure works in delivering correct

confidence intervals for ∆1 for the stationary data case. Simulation results reported in Section 5

suggest that our proposed bootstrap method works for a much wide data generating processes includ-

ing non-stationary, linear and non-linear trending processes. Actually this result is quite intuitive

because our bootstrap procedure fixes data xtTt=1 so that the serial correlation pattern, whether it

is stationary or non-stationary, is preserved in the bootstrap sample. Under the assumption that

the idiosyncratic error e1t is a zero mean and serially uncorrelated process, we use the residual

bootstrap method to generate bootstrap error e∗1t which mimics e1t well. Hence, y∗1t mimics y1t well.

Therefore, the bootstrap method is expected to work whether xtTt=1 is stationary or not.

4.2 Bootstrap steps for the standard DID method

In this subsection we give bootstrap steps for the DID method.

Step 1: Let δ1,DID be the least squares estimator of δ1 based on y1t − yc,t = δ1 + error1t

for t = 1, ..., T1. We compute e1t,DID = y1t − yc,t − δ1,DID for t = 1, ..., T1 and estimate σ2DID

by σ2DID = T−1

1

∑T1t=1 e

21t,DID. We generate e∗1t as random draws from N(0, σ2

DID) and generate

bootstrap outcome y∗1t,pre,DID = yc,t + δ1,DID + e∗1t,DID for t = 1, ..., T1. Then we use the bootstrap

sample y∗1t,DID, yc,tT1t=1 to estimate δ1 based on y∗1t,DID − yc,t = δ1 + errort for t = 1, ..., T1. Let

δ∗1,DID denote the resulting least squares estimator of δ1. We generate bootstrap counterfactual

outcome by y0∗1t,DID = yc,t + δ∗1,DID for t = T1 + 1, ..., T .

12

Page 14: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Step 2: Let v1t,DID = ∆1t,DID − ∆1,DID and σ2v,DID = T−1

2

∑Tt=T1+1 v

21t,DID. Then we generate

v∗1t,DID as random draws from N(0, σ2v,DID) for t = T1 + 1, ..., T and generate post-treatment

bootstrap sample y∗1t,post,DID = yc,t + δ∗1 + v∗1t,DID for t = T1 + 1, ..., T . The bootstrap statistic is

given by

S∗DID =√T2

1

T2

T∑t=T1+1

(y∗1t,post,DID − y0∗1t,DID)

= −

√T2(δ∗1,DID − δ1,DID) +

1√T2

T∑t=T1+1

v∗1t,DID.

Step 3: We repeat steps 1 and 2 B times. The empirical distribution of S∗DID,jBj=1 is used to

construct CI of ∆1. We sort S∗DID,(1) ≤ S∗DID,(2) ≤ ... ≤ S∗DID,(B) and the (1 − α)% CI for ∆1 is

given by

[∆1,DID − T−1/22 S∗DID,(1−α/2), ∆1,DID − T−1/2

2 S∗DID,(α/2)]. (2.24)

5 Simulations

In this section we show that when the treated and the control units are random draws from a com-

mon distribution, our proposed bootstrap method works well for both the DID and our augmented

DID ATE estimators in the sense that it provides good confidence intervals’ coverage with moderate

sample sizes. When the treated and the control units are draws from heterogenous distributions,

the bootstrap method still works well for our augmented DID ATE estimator, supporting our ar-

gument that the augmented DID method is robust to violation of the parallel line assumption. In

contrast, for DID, the bootstrap does not provide correct confidence interval coverage because the

‘parallel line’ assumption does not hold when draws are from heterogenous distributions. Thus,

DID ATE estimator is biased, invaliding the bootstrap method for inference for the DID method.

5.1 A three factor data generating process

We use a similar factor model data generating process as in Hsiao, Ching and Wan (2012) who

considered stationary data cases. Here, we consider stationary factors, as well as non-stationary

factors such as unit root non-stationary, trend-stationary and non-linear trend processes. The first

three factors are stationary and the last three factors are non-stationary. We generate outcome

variables using a three-factor model where the three factors are chosen from the following pool:

f1t = 0.8f1,t−1 + ε1t,

f2t = −0.6f2,t−1 + ε2t + 0.8ε2,t−1,

f3t = ε3t + 0.9ε3,t−1 + 0.4ε3,t−2,

13

Page 15: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

f4t = f4,t−1 + ε4t,

f5t = (0.2 + 0.5ξt)t+ ε5t,

f6t =√t+ ε6t + 0.9ε6,t−1 + 0.4ε6,t−2,

where εit is iid N(0, 1), ξt is iid Uniform[0, 1], εit and ξs are independent with each other for all i, t

and s.

Note that the first three factors are the same as used in in Hsiao, Ching and Wan (2012) and Du

and Zhang (2015). They are stationary AR(1), ARMA(1,1) and MA(2) processes, respectively. The

last three factors are non-stationary. The DGP4 is a non-stationary (drift-less) unit root process,

DGP5 is has a linear trend component with a random coefficient uniformly distributed between 0.2

to 0.7 and finally, DGP6 has a non-linear (square-root) trend.

Let y0t denote the N × 1 vector of outcome variables without treatment. It is generated via a

three-factor model

y0t = a+Bft + ut, t = 1, ..., T (2.25)

where y0t = (y0

1t, y02t, ..., y

0Nt)′, a = (a1, a2, ..., aN )′ and ut = (u1t, u2t, ..., uNt)

′ are all N × 1 vectors,

B = (b1, b2, ..., bN )′ is the N × 3 loading matrix where bj is a 3× 1 loading vector for unit j, ft is a

3× 1 vector of common factors. We choose (a1, a2, ..., aN ) = (1, 1, ..., 1), and we consider two cases

for ujt: (a): ujt iid N(0, 1), (b): ujt is iid uniform[−√

3,√

3] (zero mean and unit variance). The

simulation results are virtually identical for these two cases. Hence, we only report the results of

case (a). For factor loadings B = (b1, b2, ..., bN )′, we consider two situations: (i) b1 = (1, 1, 1)′ for

all i = 1, ..., N ; (ii) b1 = (2, 2, 2)′ and bi = (1, 1, 1)′ for i = 2, ..., N . For case (i), the treated and the

control units’ outcomes are draws from a common distribution, while for case (ii), they are drawn

from heterogenous distributions. We use a set up similar to our WarbyParker empirical data by

choosing T1 = 83, T2 = 27, T = T1 + T2 = 110 and N = 11 (with 10 control units). The number

of simulations is 1000, and within each simulation, we generate 1000 bootstrap samples to obtain

50%, 80%, 90% and 95% confidence intervals for ∆1.

We consider the following combinations of factors (DGPs):

DGP1 : The three factors are f1t, f2t and f3t ,

DGP2 : The three factors are f2t, f3t and f4t ,

DGP3 : The three factors are f3t, f4t and f5t ,

DGP4 : The three factors are f4t, f5t and f6t .

Note that for DGP1, data are weakly dependent stationary process so that the statistic√T2(∆1−

14

Page 16: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

∆1) is asymptotically normally distributed (Li and Bell (2017)). For DGP2, one of the factors fol-

lows a non-stationary unit root process, so the distribution of√T2(∆1 − ∆1) is complicated as

it involves two terms, one is characterized by integrals of Brownian motions, the other is asymp-

totically normally distributed. The ATE estimator’s distributions for DGP3 and DGP4 are more

complicated as outcome variables are generated by non-stationary unit root and nonlinear factors.

For the treatment effects, we generate ∆1t as follows:

∆1t = α0

[ezt

1 + ezt+ 1

], t = T1 + 1, ..., T, (2.26)

where zt = 0.5zt−1 + ηt and ηt is iid N(0, 0.52).

Note that for post-treatment period, y11t = y0

1t+∆1t, where y01t are generated as described earlier

and ∆1t is generated by (2.26). There is a zero, or a positive treatment effects corresponding to

α0 = 0 and α0 > 0, respectively.

Table 1 reports simulation results for the case of zero treatment (α0 = 0), and Table 2 gives

results for the positive treatment effects case (α0 > 0). All factors in DGP1 are stationary, DGP2

has a drift-less unit-root factor, DGP3 and DGP4 contain a unit-root factor and nonlinear trending

factors. First, the results for the DID and our augmented DID are very similar. the estimated

coverage probabilities are close to their nominal values, showing that, for moderate sample sizes

(T1 = 83, T2 = 27), our proposed bootstrap methods work well for both the DID and the augmented-

DID ATE estimators. Second, the results are not sensitive to different types of factors, whether

stationary, unit-root non-stationary, or nonlinear trend non-stationary factors. Supporting our

conjecture that the bootstrap method works for different types of data generating processes. Third,

the results are not sensitive to the level of treatment effects, whether it is zero or positive. This

implies that ATE are estimated accurately for different level of treatments (zero or positive) and

for different type of data generating processes.

Next, we consider the case that the treated unit and the 10 control units are drawn from dif-

ferent distributions so that their sample paths are not parallel even in the absence of treatment.

Specifically, we choose loading coefficients to be 2 for the treated unit, and 1 for all the control

units. The results for α0 = 0 (zero treatment effects) and α0 = 1 (a positive treatment effects) are

given in Tables 3 and 4, respectively. First, the results for the augmented DID method are very

similar to those reported Tables 1 and 2. This confirms that the augmented DID ATE estimation

method is robust to the violation to parallel-sample-path condition. Next, for the DID method,

the coverage probabilities are significantly different from nominal levels. Moreover, the estimation

results deteriorates as the non-stationary, or the non-linear trending of the data becomes more

pronounced. This is because when the treated and control units are drawn from different distri-

butions, the DID method becomes biased. Non-stationarity and/or linear and non-linear trending

15

Page 17: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

data enhance the estimation bias, making the confidence intervals based on biased ATE estimates

not covering ∆1.

Table 1: Coverage probabilities (α0 = 0, loadings from a common distribution)A-DID DID

DGP1 (f1t, f2t, f3t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.933 0.888 0.787 0.512 0.939 0.880 0.785 0.509

DGP2 (f2t, f3t, f4t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.940 0.881 0.784 0.476 0.937 0.884 0.783 0.496

DGP3 (f3t, f4t, f5t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.950 0.889 0.794 0.488 0.941 0.890 0.784 0.510

DGP4 (f4t, f5t, f6t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.946 0.891 0.794 0.512 0.945 0.896 0.801 0.506

Table 2: Coverage probabilities (α0 = 1, loadings from a common distribution)A-DID DID

DGP1 (f1t, f2t, f3t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.925 0.876 0.772 0.476 0.920 0.872 0.791 0.489

DGP2 (f2t, f3t, f4t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.924 0.867 0.767 0.471 0.927 0.888 0.782 0.472

DGP3 (f3t, f4t, f5t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.941 0.888 0.786 0.466 0.932 0.882 0.781 0.468

DGP4 (f4t, f5t, f6t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.933 0.885 0.794 0.505 0.945 0.890 0.773 0.510

6 Empirical Application

6.1 Institutional Setting and The Data

WarbyParker.com is an online-first eyewear brand providing high quality eyeglasses at a lower price

point ($95) than that typically encountered in the North American consumer market (upwards of

$300). The data we analyze include all transactions occurred at Boston, during a 110-week period

from February 2010 to March 2012 and the variables made available to us are: customer ID,

customer ZIP code, item sold, and channel through which sales were made. Warby Parker operates

16

Page 18: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Table 3: Coverage probabilities (α0 = 0, loadings from heterogenous distributions)A-DID DID

DGP1 (f1t, f2t, f3t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.933 0.869 0.767 0.465 0.702 0.616 0.505 0.288

DGP2 (f2t, f3t, f4t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.921 0.854 0.765 0.478 0.186 0.168 0.132 0.072

DGP3 (f3t, f4t, f5t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.939 0.900 0.801 0.497 0 0 0 0

DGP4 (f4t, f5t, f6t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.937 0.882 0.787 0.485 0 0 0 0

Table 4: Coverage probabilities (α0 = 1, loadings from heterogenous distributions)A-DID DID

DGP1 (f1t, f2t, f3t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.944 0.876 0.781 0.480 0.678 0.592 0.496 0.271

DGP2 (f2t, f3t, f4t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.933 0.873 0.762 0.457 0.170 0.149 0.110 0.072

DGP3 (f3t, f4t, f5t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.931 0.871 0.771 0.469 0 0 0 0

DGP4 (f4t, f5t, f6t)

CI 95% 90% 80% 50% 95% 90% 80% 50%

Cov. Prob. 0.946 0.901 0.812 0.514 0 0 0 0

17

Page 19: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

three channels: online, a sampling channel called ‘Home Try-On’ (HTO), and showrooms.4 We

aggregate data to the market-week level and the relevant dependent variable is total sales in dollars.

Our focus, of course, is whether the introduction of showrooms–locations in which customers

can experience the entire product line and then purchase online–have any impact on total demand

in those markets. Warby Parker opened a showroom at Boston, on September 22, 2011. We want

examine the showroom treatment effects on its glasses online sale at Boston. For the control group,

we choose the 10 largest markets by population without showrooms: Chicago, Houston, Portland,

Seattle, Denver, Dallas, San Diego, Washington, Atlanta and Minneapolis.

We use this empirical data as an illustrative example to show that when the ‘parallel-lines’

assumption is violated and hence, the DID method should not be used to estimate ATE, our

augmented DID can still be used to successfully estimate ATE of the showroom opening.

6.2 Showroom ATE for Boston

We apply our proposed Augmented DID method to estimate showroom opening effects for Boston,

Ohio, where a misapplication of the DID method would lead to dramatically biased estimation

result. We first explain why a misapplication of the DID to Boston’ sales’ data would severely

underestimate the ATE for Boston. Next, we demonstrate that our augmented DID method sub-

stantially and effectively reduces the estimation bias.

In Figure 1, the solid line represents Boston’ weekly sales and the dashed line is Boston predicted

sales by the DID method based on the average weekly sales of the control cities. DID requires

treatment and controls to follow parallel paths in the absence of treatment, and, as seen in Figure

1, this assumption is violated because for early period (e.g., t ≤ 20). The fitted curve is below the

observed data, and for later period (t ≥ 50), the fitted curve is above the data. This implies that

the fitted curve has a steeper slope than that of the real data, which results in an overestimate of

the counterfactual sale and an underestimate of ATE.

If the sale of Boston and the average sale of the 10 control cities differ only by an intercept and

by a linear trend, then model (2.16) can provide good ATE estimate. Figure 2 shows estimation

results based on model (2.16). We see that the model (2.16) fits the data better than the DID

method. However, it is still fits the data poorly in some ranges, say for 50 ≤ t ≤ 60.

Before we present the ATE estimation results using our augmented DID method, we examine

the problem of using the DID method to Boston’ data. The problem with the DID method when

applied to Boston’s data is that the average of the ten control markets’ sales and Boston’s sales

exhibit very different upward trends; in other words, the sales of the treatment market (Boston) and

4In all three channels sales are fulfilled by shipping to a location of the customer’s choosing. In the HTO channelcustomers select five frames (without lens) for a 5-day trial period, and then return them to the firm. HTO ordersare said to convert to sales if an HTO customer buys product within two months of initiating the HTO. Showroomsare displays of the Warby Parker product line inside a third party retailer.

18

Page 20: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Figure 1: Boston DID ATE Estimation (10 largest cities as controls)

0 20 40 60 80 100 120

0

500

1000

1500

2000

2500

3000

3500

4000 ActualCounterfactual

sales of the control group do not follow parallel paths in the absence of treatment, as required by

DID. Consequently, the simple average of these control markets’ sales does not accurately predict

Boston’s counterfactual sales, which leads to an inaccurate estimate for the ATE.

The second problem (illustrated previously via our simulations), is that the DID ATE can be

sensitive to the selection of control units, i.e., the choice of which units to include in the control

group from among all potential units in the population of units that did not receive the treatment.

We will demonstrate this in the robustness check section 6.4 where we use a different group of

control cities to estimate Boston’s showroom opening ATE. We show there the DID method should

not be used to estimate ATE due the violation to the parallel path assumption. Hence, merely

changing the pool of non-treated units may not be sufficient to overcome the problem.

Fortunately, the augmented DID method overcomes both problems and continues to provide a

consistent estimate of the ATE. From figure 3, we observe that the fitted curve trace the in-sample-

data well. Due to opening a showroom in September, 2011, Boston’ online glasses’ weakly sale

went up by $945 (went up by about 65%) from November, 2011 to March, 2012. As demonstrated

analytically in section 2 and via simulation in section 5, we exploit the correlation between the

treated and the control units’ outcomes to consistently estimate the ATE. That is, we do not require

that treatment and control outcomes follow parallel sample paths. Recall that to implement our

augmented DID estimator, we refrain from using a simple sample average of the control markets’

sales to approximate the treatment market’s sales sample path, and instead multiply the sample

average of the control markets’ sales by a scaling constant. This latter scaling constant is determined

by the least squares method.

19

Page 21: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Figure 2: Boston ATE Estimation based on (2.16) (10 largest cities as controls)

0 20 40 60 80 100 120

0

500

1000

1500

2000

2500

3000

3500

4000

ATE = 714.28

ATE(%) = 42.4%

R square = 0.386

ActualCounterfactual

6.3 Inference

Using 10 control cities, our point ATE estimate is ∆1 = 944.73, or a 65% increase in weekly sale

due to the showroom opening. We use the bootstrap method discussed in Section 4 to construct

confidence intervals for ∆1. Our simple bootstrap method (random draws e∗1t from N(0, σ2e), and v∗1t

from N(0, σ2v)) replies on the assumptions that e1t and v1t = ∆1t−∆1+e1t are serially uncorrelated.

Otherwise, some more sophisticated approach such as block bootstrap method may be needed. To

justify our simple bootstrap method, we need to examine whether e1t, for t = 1, ..., T1, and v1t

for t = T1 + 1, ..., T , are serially correlated or not. Our test statistics will be based on the sample

analogues of ρe = E(e1te1,t−1) and ρv = E(v1tv1,t−1). The p-values of these tests are 0.439 and

0.848, respectively. Therefore, we do not reject the null hypotheses at any conventional levels

that e1t and v1t are serially uncorrelated, justifying the use of the simple bootstrap method to our

empirical data.

Using 10,000 bootstrap samples, the estimated 80%, 90%, 95% and 99% confidence inter-

vals are CI80% = [757.64, 1126.1]; CI90% = [704.82, 1179.0]; CI95% = [663.85, 1224.9]; CI99% =

[578.29, 1311.0], respectively. The lower bounds of these intervals are all positive, indicating Boston’

showroom opening significantly increases its sales.

In fact, we can also do a hypothesis test for any given value ∆1,0. We can conduct a one-sided

test: H0: ∆1 = ∆1,0 against H1: ∆1 > ∆1,0 for a given value ∆1,0, not necessarily only considering

the case of ∆1,0 = 0 (testing a zero ATE). For example, if we do a 5% level test and choose

∆1,0 = 700, then we obtain a p-value < 0.05 because the estimated 5% percentile is 704.82 > 700.

Hence, at the 5% level, we reject H0 and conclude that ∆1 > 700.

20

Page 22: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Figure 3: Boston Augmented DID ATE Estimation (10 largest cities as controls)

0 20 40 60 80 100 1200

500

1000

1500

2000

2500

3000

3500

4000

ATE-ADID = 945.63

ATE-ADID(%) = 65%

R square = 0.469

ActualCounterfactual

6.4 Robustness Checks

6.4.1 Using different cities in the control group

One important robustness check is to examine whether the ATE estimation results are sensitive to

the selection of control units. Therefore, in this section we change the composition of the control

group. Instead of using the 10 largest cities in U.S. (by population) that did not have a showroom

during our sample period, we use the 11-th to the 20-th largest cities in U.S. that did not have

showrooms as our control group. These cities are: Cleveland, Arlington, Phoenix, San Antonio, San

Jose, Nashville, Baltimore, Milwaukee, Omaha and Miami. The estimation results are plotted in

Figure 4. We can see that the ATE estimate of $944.7 or a 64.9% increase in weekly sales after the

opening of a showroom in Boston is quite close to the result of $945.6 or a 65% increase in weekly

sales that we obtained earlier using the largest 10 cities (that had not showrooms) as the control

group. This result supports our claim that the augmented DID ATE estimate is not sensitive to

the selection of control units.

For comparison, in figure 5 we draw the fitted and predicted curves using the DID method with

the largest 11th to the 20th cities as the control group. We can see that Boston’s sales is not parallel

to the average sales of these 10 control cities. This empirical example shows that, even when the

treated and the control units exhibit substantial heterogeneity, our A-DID method is robust to the

selection of control group and can deliver reasonable ATE estimate.

21

Page 23: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Figure 4: Boston Augmented DID ATE Estimation (11th-20th largest cities as controls)

0 20 40 60 80 100 1200

500

1000

1500

2000

2500

3000

3500

4000

ATE-ADID = 944.73

ATE-ADID(%) = 64.9%

R square = 0.494

ActualCounterfactual

6.4.2 Out-of-sample prediction comparison

The method introduced by Hsiao, Chin and Wan (HCW 2012) and extended by Li and Bell (2017)

is also applicable to scenarios where the parallel lines assumption fails to hold. Nevertheless, those

innovations carry their own restriction; specifically, the number of control units needs to be much

smaller than the number of pre-treatment time periods. In contrast, our augmented DID provides

a parsimonious solution which works well irrespective of whether the number of control units is

small or large.

In this subsection we compare our A-DID ATE estimation result with that using HCW method.

Their method is to first use AICC (a small-sample-size corrected version of the AIC model selection

criterion) to choose the most significant control units from a pool of the 10 largest cities. Then

using the selected control units to estimate the coefficients of the control units by the least squares

method, the estimated coefficients, together with the post-treatment control cities’ sales data are

used to estimate the counterfactual outcome and the resulting ATE. Applying HCW method to

the Boston data gives an estimated ATE of $972.26 or a 68.1% increase in weekly sales after the

opening of a showroom in Boston. While these numbers are close to our augmented DID estimation

result of $944.73 or a 64.9% increase in weekly sales we obtained earlier, we would like to compare

the out-of-sample forecasting performances of the two estimation methods in order to judge which

method gives a more accurate ATE estimation result.

The difference between the HCW method and our A-DID method is that the former allows

for each control unit to have a different coefficient when fitting the in-sample data and making

out-of-sample prediction for the counterfactual outcome for the treated unit. From this angle it

22

Page 24: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Figure 5: Boston DID ATE Estimation (11th-20th largest cities as controls)

0 20 40 60 80 100 120

-500

0

500

1000

1500

2000

2500

3000

3500

4000ActualCounterfactual

seems that HCW is more general than A-DID method because the later imposing a restriction that

all control units have the same weight (but the total weight is not restricted to be added to one).

However, on the other hand, if the coefficients of all control units are (or close to be) the same, then

A-DID method is more efficient by imposing this correct restriction in estimation. We compare

the out-of-sample forecast performances of our A-DID and the HCW method. We select a value

T0 (1 < T0 < T1) to estimate a regression model using the pre-treatment data, then we forecast

outcome y1t for t = T0 + 1, ..., T1. Because there is no treatment prior to T1, we can compare the

average prediction squared error over the period t = T0 + 1, ..., T1. Specifically, we estimate the

following model (xt = (1, yc,t)′)

y1t = x′tδ + e1t t = 1, ..., T0 (2.27)

by the least squares method. Let δT0 denote the resulting estimator. We predict y01t,A−DID by

y01t,A−DID = x′tδT0 . Similarly, let xt,HCW = (1, y2t, ..., yNt)

′, the HCW method is to estimate the

following model

y1t = x′t,HCWβ + ε1t t = 1, ..., T0 (2.28)

where β = (β,..., βN )′. Let βT0 be the least squares estimator of β using data t = 1, ..., T0. We

estimate the counterfactual outcome by y01t,HCW = x′t,HCW βT0 for t = T0 + 1, ..., T1.

We compute the prediction MSEs by

PMSEA−DID =1

T1 − T0

T1∑t=T0+1

(y1t − y01t,A−DID)2,

23

Page 25: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

PMSEHCW =1

T1 − T0

T1∑t=T0+1

(y1t − y01t,HCW )2.

We choose five different values for T0 = 60, 65, 70, 75, 80. The corresponding evaluation sample

sizes are T1−T0 = 83−T0 = 23, 17, 12, 7, 3. We report the PMSE ratio as PMSEHCW /PMSEMsyn.

The results are reported in Table 5.

Table 5: Out-of-sample Prediction MSE ratioT0 60 65 70 75 80

PMSEA−DID

PMSEHCE0.6801 0.8209 0.8068 0.6996 0.5896

From Table 5 we see that HCW method has larger PMSE than the A-DID method for all cases.

The A-DID method gives 18% to 41% PMSE reduction compared with the HCW method. This

empirical example shows the our A-DID ATE estimator is not only simple and easy-to-implement,

it also compares well with a more sophisticated competing estimator.

7 Conclusions

As noted in the Introduction, management and social scientists increasingly look to evaluate causal

effects of interventions on outcomes of interest. In this pursuit of casual effects, the popularity of

the DID method and diversity of applications are due in large part to the widespread availability

of quasi-experimental data, and the ease with which DID is implemented. While this method has

much to commend it, it relies on the restrictive ‘parallel lines’ assumption; namely, that outcomes

for treated and non-treated units follow parallel sample paths in the absence of any treatment.

The more varied the circumstances for treatment and control contexts, and the greater the pool

from which control units can be drawn, the more likely it is that this critical assumption will not

hold. Our empirical setting, the opening of a physical showroom in a given location by a digital-first

brand, is a case in point, as are those for the majority of articles referenced in the Introduction and

throughout the paper. In this paper we propose a suitable method for dealing with this problem,

irrespective of whether the number of control units is small or large. Using analytical results,

simulations, and an empirical application, we show that our method is practically useful and has

desirable theoretical properties. Specifically:

• Practically Useful. We proposed and implemented an augmented method that allows for

treatment units and control units to be drawn from heterogenous populations, provided that

the outcomes of the treatment and control units have a stable correlation relationship in the

absence of treatment. In other words, the augmented DID method is robust to the selection

of control units.

24

Page 26: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

• Theoretically Valid. We show that our estimator is consistent and propose using a simple

bootstrap method for valid inference. We then deploy simulated data and show that the

augmented DID method is robust to data draws from heterogenous distributions.

While our overall contribution is methodological, the results from our empirical application

contribute to the emerging literature on online-offline market interaction (e.g., Forman, Ghose, and

Goldfarb 2009; Anderson, Fong, Simester and Tucker 2010; Bell, Gallino, and Moreno 2016). We

find that showroom opening in Boston is demand accretive and significantly enhances its online

glasses’ sales.

Finally, we would like to emphasize that our augment DID method is a complement to the

popular DID method, with the DID and augmented DID ar one’s toolkit, one can estimate ATE

using the simple (augmented) DID method to a wide range of panel data with long or short time

series, large or small number of treated and control units, stationary and non-stationary data

processes.

References

Anderson E, Fong N, Simester D, Tucker C (2010) How Sales Taxes Affect Customer and Firm

Behavior: The Role of Search on the Internet. Journal of Marketing Research. 47, 229-239.

Ashenfelter O (1978) Estimating the Effects of Training Programs on Earnings. The Review of

Economics and Statistics. 60, 47-57.

Ashenfelter O, Card D (1985) Using the Longitudinal Structure of Earnings to Estimate the Effect

of Training Programs. 1985. The Review of Economics and Statistics. 67, 648-660.

Avery J, Steenburgh T, Deighton J, Caravella M (2012) Adding Bricks to Clicks: Predicting the

Patterns of Cross-Channel Elasticities Over Time. Journal of Marketing. 76, 96-111.

Bell D, Gallino S, Moreno A (2014) How to Win in an Omnichannel World. MIT Sloan Management

Review. 58 (1), 45-53.

Bell D, Gallino S, Moreno A (2016) Offline Showrooms and Customer Migration in Omni-Channel

Retail. Management Science, forthcoming.

Bertrand M, Duflo E, Mullainathan S (2009) How Much Should We Trust Differences-in-Differences

Estimates?. The Quarterly Journal of Economics. 119, 249-275.

Bronnenberg B, Dube J, Gentzkow M (2012) The Evolution of Brand Preferences. The American

Economic Review. 102, 2472-2508.

Busse, M., Silva-Risso J, Zettlemeyer F (2006) $ 1,000 Cash Back: The Pass-Through of Auto

Manufacturer Promotions. The American Economic Review. 96, 1253-1270.

25

Page 27: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Chevalier J, Mayzlin D (2006) The Effect of Word of Mouth on Sales: Online Book Reviews.

Journal of Marketing Research. 48, 345-354.

Donald S, Lang K (2007) Inference with Difference-in-Differences and Other Panel Data. The

Review of Economics and Statistics. 89, 221-233.

Du, Z., Zhang, L., 2015. Home-Purchase Restriction, Property Tax and Housing Price in China:

A Counterfactual Analysis. Journal of Econometrics 188, 558-568.

Forman C, Ghose A, Goldfarb A (2009) Competition Between Local and Electronic Markets: How

the Benefit of Buying Online Depends on Where You Live. Management Science 55, 47-57.

Forni M, Reichlin L (1998) Let’s Get Read: A Factor-Analytic Approach to Disaggregated Business

Cycle Dynamics. Review of Economic Studies. 65, 453-473.

Goldfarb A, Tucker C (2011) Private Regulation and Online Advertising. Management Science.

57, 40-56.

Gregory A, Head A (1999) Fluctuations in Productivity, Investment, and the Current Account.

Journal of Monetary Economics. 44, 423-452.

Hayashi, F. (2000) Econometrics. Princeton University Press.

Holland P (1986) Statistics and Causal Inference. Journal of the American Statistical Association.

81, 945-60.

Hsiao C, Ching H, Wan S (2012) A Panel Data Approach for Program Evaluation: Measuring the

Benefits of Political and Economic Integration of Hong Kong with Mainland China. Journal of

Applied Econometrics. 27, 705-740.

Li K (2017) Estimating Average Treatment Effects using a Modified Synthetic Control Method:

Theory and Applications. Unpublished manuscript.

Li K, Bell D (2017) Estimation of the Average Treatment Effect with Panel Data: Asymptotic

Theory and Implementation. Journal of Econometrics 197, 65-75.

Miller A, Tucker C (2009) Privacy Protection and Technology Diffusion: The Case of Electronic

Medical Records. Management Science 55, 1077-1093.

Newey W, West K (1987) A Simple Positive Semi-definite Heteroskedasticity and Autocorrelation

Consistent Covariance Matrix Estimation. Econometrica 55, 703-708.

Qian Y (2008) Impacts of Entry by Counterfeiters. The Quarterly Journal of Economics 4, 1577-

1609.

26

Page 28: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Appendix A

A.1 Expressions of Σ1 and Σ2

In this section we give expressions of Σ1 and Σ2 that appeared in Proposition 2.1 as well as consistent

estimates of them. The definitions are: Σ1 = ηE(xt)′V E(xt), where η = limT1, T2→∞ T2/T1, V is the

asymptotic variance of√T1(δ−δ); Σ2 = limT2→∞ T

−12

∑Tt=T1+1

∑Ts=T1+1 E(v1tv1s) is the asymptotic

variance of T−1/22

∑Tt=T1+1 v1t, v1t = ∆1t−∆1 + e1t. When v1t is serially uncorrelated, Σ2 simplifies

to Σ2 = E(v21t) ≡ σ2

v .

One can consistently estimate Σ1 by Σ1 = (T2/T1)E(xt)′V E(xt), where E(xt) = T−1

2

∑Tt=T1+1 xt

and V is a consistent estimator of V , say Newey-West heteroskedasticity and serial correlation

robust estimate of V . When e1t is serially uncorrelated and is conditional homoscedastic, then

V simplifies to V = σ2e [E(xtx

′t)]−1. For Σ2, one can use Σ2 = T−1

2

∑Tt=T1+1

∑Ts=T1+1,|s−t|≤l v1tv1s,

where l = O(T1/22 ) (e.g., Newey and West (1987)). When v1t is serially uncorrelated, one can

estimate Σ2 by Σ2 = T−12

∑Tt=T1+1 v

21t, where v1t = ∆1t − ∆1. To see that this indeed leads

to a consistent estimate of Σ2, note that ∆1t = y1t − x′tδ = ∆1t − x′t(δ − δ) + e1t and ∆1 =

T−1∑Tt=T1+1 ∆1t = ∆1 − x′(δ − δ) + e1, where a = T−1

2

∑Tt=T1+1 at, we get ∆1t − ∆1 = ∆1t −

∆1 + e1t +Op(T−1/21 + T

−1/22 ) = ∆1t −∆1 + e1t +Op(T

−1/21 + T

−1/22 ) because δ − δ = Op(T

−1/21 ),

e1 = Op(T−1/22 ) and ∆1 = ∆1 +Op(T

−1/22 ). Hence,

Σ2 =1

T2

T∑t=T1+1

(∆1t −∆1 + e1t)2 +Op(T

−1/21 + T

−1/22 ) = Σ2 +Op(T

−1/21 + T

−1/22 ), (A.1)

which implies that Σ2p→ Σ2 for large T1 and T2.

A.2 Consistency of the bootstrap method for stationary data case

We reproduce equation (2.19) here for readers’ convenience.

S =√T2(∆1 −∆1)

= −√T2

T1

1

T2

T∑t=T1+1

x′t

√T1(δ − δ) +1√T2

T∑t=T1+1

v1t

= A1 +A2. (A.2)

From (2.22) we know that the bootstrap analogue of (A.2) is given by

S∗ = −√T2

T1

1

T2

T∑t=T1+1

x′t

√T1(δ∗ − δ) +1√T2

T∑t=T1+1

v∗1t

= A∗1 +A∗2. (A.3)

27

Page 29: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Let FΣ(·) denote the cumulative distribution of a normal random variable with mean zero and

variance Σ. Then Proposition 2.1 implies that (S = A1 +A2)

limT1, T2→∞

|Pr(S < z)− FΣ(z)| = 0 (A.4)

for all z ∈ R. We want to show the bootstrap statistic S∗ = A∗1 + A∗2 satisfies a condition similar

to (A.4). Specifically, let X = y1t, x′tTt=1, we want to show that

|Pr(S∗ < z|X )− FΣ(z)| p→ 0 (A.5)

as T1, T2 →∞. We also use “S∗d→ N(0,Σ) in probability” to stand for the same meaning as (A.5).

Let Y ∗1 = (y∗11, ..., y∗1T1

)′, X = (x1, ...xT1)′ and e∗1 = (e∗11, ..., e∗1T1

)′. In matrix notation we have

Y ∗1 = Xδ + e∗1 and δ∗ = (X ′X)−1X ′Y ∗1 = δ + (X ′X)−1X ′e∗1. Because e∗1 is independent of X with

e∗1t|X iid N(0, σ2e), we have, conditional on X ,

(δ∗ − δ)|X = (X ′X)−1X ′e∗1|Xd∼ (X ′X)−1N(0, σ2

e(X′X)) = N(0, σ2

e(X′X)−1), (A.6)

here the notation A|X d∼ N(µ,Ω) is read as ‘conditional on X , A is normally distributed with mean

µ and variance Ω’. This result holds true for a fixed value of T1, it does not require T1 be large.

(A.6) implies that

σ−1e (X ′X)1/2(δ∗ − δ)|X d∼ N(0, I2), (A.7)

where for a positive definite symmetric matrix D, D1/2 is defined as a positive symmetric matrix

satisfying D1/2D1/2 = D, I2 is an identity matrix of dimension 2× 2. Because the right-hand-side

of (A.7) is unrelated to X , we have, unconditionally,

σ−1e (X ′X)1/2(δ∗ − δ) d∼ N(0, I2). (A.8)

To link the above result to the asymptotic expression, we use the fact that T1 is large. From

δ− δ = Op(T−1/21 ), we have σ2

e = σ2e +Op(T

−1/21 ), X ′X/T1 = E(xtx

′t) +Op(T

−1/21 ). Then equation

(A.8) implies that

σ−1e (X ′X/T1)1/2

√T1(δ∗ − δ)

= σ−1e (E(xtx

′t))

1/2√T1(δ∗ − δ) + op(1)

d→ N(0, I2) in probability. (A.9)

28

Page 30: Augmented Di erence-in-Di erences: Estimation and ......Jun 08, 2017  · 1Warby Parker is widely regarded as the exemplar Digitally Native Vertical Brand (DVNB){a company that initially

Equation (A.9) is equivalent to

√T1(δ∗ − δ) d→ N(0, σ2

e [E(xtx′t)]−1) = N(0, V ) in probability, (A.10)

where V = σ2e [E(xtx

′t)]−1 is the asymptotic variance of

√T1(δ− δ). Equation (A.10), together with

T2/T1 → η and T−12

∑Tt=T1+1 xt

p→ E(xt), lead to

A∗1 = −√T2

T1

1

T2

T∑t=T1+1

x′t

√T1(δ∗ − δ)

d→ −√ηE(x′t)N(0, V ) = N(0,Σ1), (A.11)

where Σ1 = ηE(x′t)V E(xt).

Next, we consider A∗2. v∗1t|X is iid N(0, σ2v), hence, A∗2|X

d∼ N(0, σ2v), or (A∗2/σv)|X

d∼ N(0, 1).

We have, unconditionally, A∗2/σvd∼ N(0, 1). Equation (A.1) in section A.1 shows that σ2

v ≡ Σ2 =

Σ2 + op(1) = σ2v + op(1). Hence, we have A∗2

d→ N(0,Σ2) in probability. Finally, A∗1 and A∗2 are

independent with each other. Therefore, we obtain

S∗|X = (A∗1 +A∗2)|X d→ N(0,Σ1 + Σ2) in probability, (A.12)

which is equivalent to

|P (S∗ ≤ z|X )− FΣ(z)| = op(1)

for all z ∈ R. This proves (A.5).

29