March 27, 2020 › pdf › 2003.11991v1.pdfShantanu Gupta, Zachary C. Lipton, David Childers Carnegie Mellon University [email protected],[email protected],[email protected]

arX

iv:2

003.

1199

1v1

[st

at.M

E]

26

Mar

202

0

Estimating Treatment Effects with Observed

Confounders and Mediators

Shantanu Gupta, Zachary C. Lipton, David Childers

Carnegie Mellon [email protected],[email protected],[email protected]

March 27, 2020

Abstract

Given a causal graph, the do-calculus can express treatment effects as functionals of the observa-tional joint distribution that can be estimated empirically. Sometimes the do-calculus identifies multi-ple valid formulae, prompting us to compare the statistical properties of the corresponding estimators.For example, the backdoor formula applies when all confounders are observed and the frontdoor for-mula applies when an observed mediator transmits the causal effect. In this paper, we investigate theover-identified scenario where both confounders and mediators are observed, rendering both estima-tors valid. Addressing the linear Gaussian causal model, we derive the finite-sample variance for bothestimators and demonstrate that either estimator can dominate the other by an unbounded constantfactor depending on the model parameters. Next, we derive an optimal estimator, which leverages allobserved variables to strictly outperform the backdoor and frontdoor estimators. We also present aprocedure for combining two datasets, with confounders observed in one and mediators in the other.Finally, we evaluate our methods on both simulated data and the IHDP and JTPA datasets.

1 Introduction

Causal effects are not, in general, identifiable from observational data alone. Consequently, scientists typi-

cally rely on controlled experiments to estimate causal quantities such as treatment effects. The fundamen-

tal insight of causal inference is that given structural assumptions on the data generating process, causal

effectsmay become expressible as functionals of the joint distribution over observed variables. Thus, much

work on causal graphical models has focused on problems of identification—determining if a functional

capturing the desired causal quantity exists and deriving it. The do-calculus, introduced by Pearl [1995],

provides a set of three rules that can be used to convert causal quantities into such functionals. Moreover,

the do-calculus has been proven to be complete by Shpitser and Pearl [2006] who introduce conditions to

determine when a desired causal effect is identifiable and an algorithm to express it as a functional when-

ever the condition holds. Once the desired functional has been identified, we can estimate it using only

observational data.

We are motivated by the observation that, for some causal graphs, treatment effects may be overidentified.

Here, applications of the do-calculus produce distinct functionals, all of which, subject to positivity con-

ditions, yield consistent estimators of the same causal effect. Consider a causal graph (see Figure 1) for

which the treatment X , mediatorM , confounderW , and outcome Y are all observable. Using the backdoor

adjustment, we can expresses the average treatment effect of X on Y as a function of P(X ,W , Y ), while1

http://arxiv.org/abs/2003.11991v1

mailto:[email protected]



X M Y

W

Figure 1: Causal graph with observed mediator and confounder, rendering both backdoor and frontdoor

estimators viable.

the frontdoor adjustment expresses that same causal quantity via P(X ,M, Y ) [Pearl, 1995]. Faced with the

(fortunate) condition of overidentification, our focus shifts from identification—is our effect estimable?, to

optimality—which among multiple valid estimators dominates from a standpoint of statistical efficiency?

In this paper, we address this very graph, focusing our analysis on the linear causal model [Wright, 1934],

a central object of study in causal inference and econometrics. We also adapt existing semi-parametric

estimators to this graph.

Deriving the finite sample variance of the backdoor and frontdoor estimators, and precisely characterizing

conditions under which each dominates, we find that either may outperform the other to an arbitrary

degree depending on the underlying model parameters. These expressions can provide guidance to a

practitioner for assessing the suitability of each estimator. For example, one byproduct of our analysis is

to characterize what properties make for the “ideal mediator”. Moreover, in the data collection phase, if

one has a choice between collecting data on the mediator or the confounder, these expressions, together

with the practitioner’s beliefs about likely ranges for model parameters, can be used to decide what data

to collect.

Next, we propose techniques that leverage both observed confounders and mediators. For the setting

where we simultaneously observe both the confounder and the mediator, we introduce an estimator that

optimally combines all information. We prove theoretically that this method achieves lower mean squared

error (MSE) than both the backdoor and frontdoor estimators, for all settings of the underlying model

parameters. Moreover, the extent to which this estimator can dominate the better of the backdoor and

frontdoor estimators is unbounded. Subsequently, we consider the partially-observed setting in which

two datasets are available, one with observed confounders (but not mediators) {(X ,W , Y )}ni=1, and anotherwith observed mediators (but not confounders) {(X ,M, Y )}mi=1. Interestingly, the likelihood is convex givensimultaneous observations but non-convex under partially-observed data. We introduce and empirically

validate a heuristic that is guaranteed to achieve higher likelihood than either the backdoor or frontdoor

estimators.

Through a series of experiments, we evaluate our methods on both synthetic, semi-synthetic, and real

datasets. We show that the backdoor and frontdoor estimators exhibit different variance depending on

the model parameters as predicted by our variance formulae. Our proposed estimators that combine con-

founders and mediators always exhibit lower variance and MSE than the backdoor and frontdoor estima-

tors when our model assumptions are satisfied.

2

2 Related Work

The backdoor adjustment, which formalizes the common practice of controlling for known confounders, is

widely applied and studied in statistics and econometrics [Pearl, 2009, 2010, Perković et al., 2015]. More re-

cently, the frontdoor adjustment, which leverages observed mediators to identify causal effects, even amid

unobserved confounding, has seen increasing application in real-world datasets [Bellemare and Bloem,

2019, Glynn and Kashin, 2018, 2017, Chinco and Mayer, 2016, Cohen and Malloy, 2014].

In the most similar work to ours, Glynn and Kashin [2018] compare the frontdoor and backdoor adjust-

ments, computing bias (but not variance) formulas for each and performing sensitivity analysis. Exploring

a real-world job training dataset, they demonstrate that the frontdoor estimator outperforms its back-

door counterpart (in terms of bias). Henckel et al. [2019] analyze linear causal models where multiple

adjustments sets are available for the backdoor criterion. They introduce a graphical criterion for com-

paring the asymptotic variances of the adjustment sets and identifying the one with lowest variance.

Rotnitzky and Smucler [2019] extend this work and show that the same graphical criterion is valid even

when the causal model is non-parametric. They also present a semi-parametric efficient estimator that

exploits the conditional independences in a causal graph.

Researchers have also worked to generalize the frontdoor criterion. Bareinboim et al. [2019] introduce the

conditional frontdoor criterion, allowing for both treatment-mediator confounders and mediator-outcome

confounders. Fulcher et al. [2020] propose a method for including observed confounders along with a

mediator with discrete treatments.

The study of overidentified models dates at least back to Koopmans and Reiersøl [1950]. Sargan [1958],

Hansen [1982] formalized the result that in the presence of overidentification, multiple estimators can be

combined to improve efficiency. This was extended to the non-parametric setting by Chen and Santos

[2018]. A related line of work considers methods for combining multiple datasets for causal inference.

Bareinboim and Pearl [2016] study the problem of handling biaseswhile combining heterogeneous datasets,

while Jackson et al. [2009] present Bayesian methods for combining datasets with different covariates and

some common covariates.

3 Preliminaries

In this work, weworkwithin the structural causal model (SCM) framework due to Pearl [2009], formalizing

causal relationships via directed acyclic graphs (DAGs). Each X → Y edge in this DAG indicates that the

variable X is (potentially) a direct cause of variable Y . Informally, the edge indicates that Y listens to X in

the sense that the function that determines its value depends upon X . However, the DAG does not capture

the magnitude of the effect. All measured variables are deterministic functions of their parents and a set of

jointly independent per-variable noise terms. As an illustration, consider the graph in Figure 1 with four

variables {W , X ,M, Y}. Here each variable V ∈ {W , X ,M, Y} can be expressed as

V = fV (Pa(V ), uV ), (1)

where fV is some deterministic function, Pa(V ) denotes V ’s parents in the graph, and uV is the correspond-

ing noise term. Eq. 1 is known as a structural equation.

Linear Gaussian SCM In linear Gaussian SCMs, each variable is assumed to be a linear function of its

parents. The noise terms are assumed to be additive and Gaussian. In this paper, we work with the linear

3

Gaussian SCM for the overidentified confounder-mediator graph (Figure 1), where the structural equations

can be written as

wi = uwi ,xi = dwi + uxi ,mi = cxi + umi ,yi = ami + bwi + uyi ,

uwi ∼ (0, �2uw )uxi ∼ (0, �2ux )umi ∼ (0, �2um )uyi ∼ (0, �2uy ).(2)

Here, wi, xi , mi , yi are realized values of the random variablesW,X,M, Y , respectively, and uwi , uxi , umi , uyiare realized values of the corresponding noise terms. In Eq. 2, we assume that the variables have zero

mean. This is done to simplify analysis, but this assumption is not necessary for the results presented in

the paper.

3.1 The Backdoor and Frontdoor Adjustments

The effect of a treatmentX is expressible by reference to the post-intervention distributions of the outcomeY for different values of the treatment X = x . Informally, an intervention do(X = x) means that we

manually set X to the value x for all instances, regardless of what value it would have assumed naturally.

More formally, an intervention do(X = x) in a causal graph can be expressed via the mutilated graph that

results from deleting all incoming arrows to X , setting X ’s value to X = x for all instances, while keeping

the SCM otherwise identical. This distribution is denoted as P(Y |do(X = x)).The backdoor and frontdoor adjustments [Pearl, 2009] express treatment effects as functionals of the ob-

servational distribution. Consider our running example of the causal model in Figure 1. We denote X as

the treatment, Y as the outcome, W as a confounder, and M as a mediator. Our goal is to estimate the

causal quantity P(Y |do(X = x)).Backdoor Adjustment When all confounders of both X and Y are observed—in our example,W—then

the causal effect of X on Y , i.e., P(Y |do(X = x)) can be written as

P(Y |do(X = x)) = ∑w P(Y |X = x,W = w)P(W = w). (3)

Frontdoor Adjustment This technique applies even when the confounder W is unobserved. Here we

require access to a mediatorM that (i) is observed; (ii) transmits the entire causal effect from X to Y ; and(iii) is not influenced by the confounder W given X . The effect of X on Y is computed in two stages. We

first find the effect of X onM :

P(M = m|do(X = x)) = P(M = m|X = x). (4)

Then we find the effect ofM on Y , i.e., P(Y |do(M = m)):P(Y |do(M = m)) = ∑x P(Y |M = m, X = x)P(X = x). (5)

We can then write the causal effect of X on Y as

P(Y |do(X = x)) = ∑m P(M = m|do(X = x))P(Y |do(M = m)).4

3.2 Ordinary Least Squares (OLS) Regression

Since we use OLS regression in later sections, we briefly review OLS estimators. We consider the following

setup: y = X� + e, (6)

where y and e are n × 1 vectors, X is an n × d matrix of observations, and � is the d × 1 coefficient vector

that we want to estimate. If e ⟂⟂ X and e ∼ (0, �2e In), where In is the n × n identity matrix, then the OLS

estimate of � is � = (X⊤X)−1X⊤y= � + (X⊤X)−1X⊤e, (7)

with E[�] = � and Var(�) = �2e E[(X⊤X)−1]. If each row Xi of X is sampled from Xi i.i.d.∼ (0,�), then the

distribution of (X⊤X)−1 is an Inverse-Wishart distribution. Then the variance of � is

Var(�) = �2e �−1n − d − 1 . (8)

4 Variance of Backdoor & Frontdoor Estimators

In this section, we analyze the backdoor and frontdoor estimators and characterize the regimes where each

dominates. We work with the linear SCM described in Eq. 2. Throughout, our goal is to estimate the causal

effect of X on Y . In terms of the underlying parameters of the linear SCM, the quantity that we wish to

estimate is ac. Absent measurement error, both estimators are unbiased (see proof in Appendix B) and

thus we focus our comparison on their respective variances.

Variance of the Backdoor Estimator The backdoor estimator requires only thatwe observe {X , Y ,W}(but not necessarily the mediator M). Say we observe the samples {xi , yi , wi}ni=1. The outcome yi can be

written as

yi = acxi + bwi + aumi + uyi .We can estimate the causal effect ac by taking the coefficient on X in an OLS regression of Y on {X ,W}.This controls for the confounderW and corresponds naturally to the adjustment described in Eq. 3.

Let � = Cov([X ,W ]). Using Eq. 8, the variance of the backdoor estimator is

Var(ac)backdoor = Var(aum + uy ) (�−1)1,1n − 3= a2�2um + �2uy(n − 3)�2ux . (9)

Thus, the asymptotic variance of the backdoor estimator is

limn→∞Var(√n(ac − ac)) = a2�2um + �2uy�2ux . (10)

5

Variance of the Frontdoor Estimator The frontdoor estimator is used when {X, Y ,M} samples are

observed and the confounder W is unobserved. The causal effect ac is estimated in two stages. First, we

estimate c by taking the coefficient on X in an OLS regression of M on X . Let the estimate be c. Thiscorresponds to the adjustment in Eq. 4. Then, we estimate a by taking the coefficient on M in an OLS

regression of Y on {M,X}. Let the estimate be af . This corresponds to the adjustment in Eq. 5.

The regression of M on X can be written as mi = cxi + umi . Let Σc = Var(X ). Using Eq. 8, Var(c) isVar(c) = Var(um)(Σ−1c )n − 2

= �2um(n − 2)(d2�2uw + �2ux ) . (11)

The regression of Y on {M, X} can be written as yi = ami + bd xi + ei , where ei = − bd uxi +uyi . In this case, the

error ei is not independent of the regressor xi . Using the fact that (ux , x) has a bivariate normal distribution,

Var(e|x) isVar(e|x) = b2�2uw�2ux + �2uy (d2�2uw + �2ux )(d2�2uw + �2ux ) . (12)

Let �a = Cov([M, X]). Then, using a formula analogous to Eq. 8 for non-independent errors (see Appendix

C.1.1 for the proof), the variance of af isVar(af ) = E [Var(e|x)(�−1a )1,1(n − 3) ]

= b2�2uw�2ux + �2uy (d2�2uw + �2ux )(n − 3)(d2�2uw + �2ux )�2um , (13)

where the expression for Var(e|x) is taken from Eq. 12.

The variance of the product of two random variables can be written as

Var(af c) = Cov(a2f , c2) + (Var(af ) + E2[af ])(Var(c) + E2[c]) − (Cov(af , c) + E[af ]E[c])2 (14)= Cov(a2f , c2) + (Var(af ) + a2)(Var(c) + c2]) − (Cov(af , c) + ac)2, (15)

where in Eq. 14 we used the facts that E[af ] = a, and E[c] = c (see proof in Appendix B.2).

Using the facts that Cov(af , c) = 0 (see proof in Appendix A.1), and Cov(a2f , c2) = Var(af )Var(c) (see proofin Appendix C.1.2) in Eq. 15, we can express the finite sample variance of the frontdoor estimator as

Var(af c) = c2Var(af ) + a2Var(c) + 2Var(af )Var(c), (16)

where Var(c) and Var(af ) are defined in Eqs. 11 and 13, respectively. Thus, the variance of the frontdoor

estimator can be written as

Var(√naf c) = c2Var(√naf ) + a2Var(√nc) +( 1n) . (17)

6

The Ideal Frontdoor Mediator A natural question that arises is what properties make a mediator best

suited for applying the frontdoor adjustment. Under what conditions is the frontdoor estimator most

precise? We can see that Var(af c) is non-monotonic in the mediator noise �um . Our variance expressionsmake clear that if �um is high, it becomes difficult to estimate c (Eq. 11). However, when �um is low, our

estimate of a has high variance (Eq. 13). Eq. 16 provides us with guidance. Var(af c) is a convex function

of �2um . The ideal mediator will have noise variance �2∗um which minimizes Eq. 16. That is,

�2∗um = argmin� 2um[a2Var(c) + c2Var(af ) + 2Var(af )Var(c)]

⟹ �2∗um = |c|√b2�2uw�2ux + �2uy (d2�2uw + �2ux )|a| √n − 2n − 3 .4.1 Comparison of Backdoor and Frontdoor Estimators

The relative performance of the backdoor and frontdoor estimators depend on the underlying SCM’s pa-

rameters. Using Eqs. 9 and 16, the ratio of the backdoor to frontdoor variance is

RVar = Var(ac)backdoorVar(af c)

= (n − 2)�2umD2(a2�2um + �2uy )�2ux ((n − 3)a2�4umD + (2�2um + c2(n − 2)D)(b2�2uw�2ux + �2uyD)) , (18)

where D = (d2�2uw + �2ux ).The backdoor estimator dominates when RVar < 1 and vice versa when RVar > 1. Note that there exist

parameters that cause any value of RVar > 0. In particular, as �2ux → 0, RVar → ∞ and as �2ux → ∞,RVar → 0, regardless of the sample size n. Thus, either estimator can dominate the other by any arbitrary

constant factor.

Note that in this section, we characterize the variance of the estimators in terms of the underlying causal

model parameters which are unknown. One practical way to operationalize these expressions would be to

test if one model dominates in parameter ranges given by prior knowledge of the problem. Alternatively,

one could estimate the model parameters choosing the adjustment with lowest variance based on plug-in

estimates (analogous to Explore-then-Commit [Lattimore and Szepesvári, 2019, Ch. 6]).

5 Combining Mediators & Confounders

Having characterized the performance of each estimator separately, we can now consider optimal strate-

gies for estimating treatment effects in the overidentified regime, where we observe both the confounder

and mediator simultaneously. In other words, we observe N samples {xi , yi , wi , mi}Ni=1. We show that it is

possible to construct an estimator that is strictly better than the backdoor and frontdoor estimators.

We compute the maximum likelihood estimator (MLE). The MLE will be optimal since our model satisfies

the necessary regularity conditions for MLE optimality (by virtue of being linear and Gaussian). Let the

vector si = [xi , yi , wi , mi] denote the ith sample. Since the data is multivariate Gaussian, the log-likelihood

of the data can be written as

= −N2 [log (det�) + Tr (��−1)] , (19)

7

where � = Cov([X , Y ,W ,M]) (the true covariance matrix) and � = 1N ∑Ni=1 sisi⊤ (the sample covariance

matrix).

The MLE for a Gaussian graphical model is �MLE = � [Uhler, 2019]. Let the MLE estimates for parametersc and a be c and ac , respectively. Thenc = �1,4�1,1 (20)

ac = �1,4�3,3 − �1,3�3,4�3,3�4,4 − �23,4 . (21)

The MLE estimate for c in Eq. 20 is the same as for the frontdoor—the coefficient of X in an OLS regression

of M on X . The MLE estimate for a in Eq. 21 is the coefficient on M in an OLS regression of Y on{M,W}. Contrast this with the estimate of a for the frontdoor estimator where we regress Y on {M, X}.Conditioning on W instead of X allows the combined estimator to improve performance.

We can write the regression of Y on {M,W} as yi = ami + bwi + uyi . Let �ac = Cov([M,W ]). Using Eq. 8,

we get

Var(ac) = Var(uyi )(�−1ac )1,1n − 3= �2uy(n − 3)(c2�2ux + �2um ) . (22)

The variance of c is the same as the frontdoor case (Eq. 11), that is

Var(c) = �2um(n − 2)(d2�2uw + �2ux ) . (23)

The variance of the product of two random variables can be written as

Var(ac c) = Cov(a2c , c2) + (Var(ac) + E2[ac])(Var(c) + E2[c]) − (Cov(ac , c) + E[ac]E[c])2 (24)= Cov(a2c , c2) + (Var(ac) + a2)(Var(c) + c2]) − (Cov(ac , c) + ac)2, (25)

where in Eq. 24 we used the facts that E[ac] = a, and E[c] = c (see proof in Appendix B.3).

Using the facts that Cov(ac , c) = 0, andCov(a2c , c2) ≤

√n − 3n − 5 (2|c|Var(ac)√Var(c) + √3√n − 2n − 4Var(ac)Var(c)) ,that are proved in Appendices A.2 and C.2, respectively in Eq. 25, we can upper bound the finite sample

variance of the combined estimator as

Var(ac c) ≤ c2Var(ac) + a2Var(c) +√n − 3n − 5 (2|c|Var(ac)√Var(c) + √3√n − 2n − 4Var(ac)Var(c)) , (26)

where Var(c) and Var(ac) are defined in Eqs. 23 and 22, respectively. Thus, the variance of the combined

estimator can be written as

Var(√nac c) ≤ c2Var(√nac) + a2Var(√nc) +( 1√n) . (27)

8

The Ideal Mediator Just as with the frontdoor estimator, we can ask what makes for an ideal mediator

when using the combined estimator. Eq. 27 shows that limn→∞ Var(√nac c) is a convex function of the

variance of the noise term of the mediator �2um . The ideal mediator will have noise variance �2∗um which

minimizes the variance in Eq. 27. This means that

�2∗um = argmin� 2um [ limn→∞ (a2Var(√nc) + c2Var(√nac))]⟹ �2∗um = max {0, vmin} ,

where

vmin = |c|�uy√d2�2uw + �2ux|a| − c2�2ux .5.1 Comparison with Backdoor and Frontdoor Estimators

We can compare Eqs. 10 and 27 too see that, asymptotically, the combined estimator has lower variance

than the backdoor estimator for all values of model parameters. That is, as n → ∞,

Var(√nac c) ≤ Var(√nac)backdoor.Similarly, we can compare Eqs. 17 and 27 to see that, asymptotically, the combined estimator is better than

the frontdoor estimator for all values of model parameters. That is, as n → ∞,

Var(√nac c) ≤ Var(√naf c).In the finite sample case, using Eqs. 9 and 26, we can see that for a large enough n, the combined estimator

will dominate the backdoor estimator for all model parameters. That is,

∃N , s.t., ∀n > N ,Var(ac c) ≤ Var(ac)backdoor,where the dependence of N on the model parameters is stated in Appendix D.1. We can make a similar

argument for the dominance of the combined estimator over the frontdoor estimator. Using Eqs 16 and

frontdoor-finite-sample-variance, it can be shown that

∃N , s.t., ∀n > N ,Var(ac c) ≤ Var(af c),where the dependence of N on the model parameters is stated in Appendix D.2.

Next, we show that the combined estimator can dominate the better of the backdoor and frontdoor esti-

mators by an arbitrary amount. That is, we show that the quantity

R = min{Var(ac)backdoor ,Var(af c)}Var(ac c)

is unbounded. In order to demonstrate this, we consider the case when Var(ac)backdoor = Var(af c). Thiscondition holds for certain settings of the model parameters (see Appendix D.3 for an example). In this

case,

R = Var(ac)backdoorVar(ac c) (28)

≥ (n − 2)DE(a2�2um + �2uy )�2ux ((n − 3)a2�2umE + �2uy (�2um + √3�2um (r1r2 + |c|(n − 2)D(|c| + r1 �um√(n−2)D)))) ,

9

where D = d2�2uw + �2ux , E = c2�2ux + �2um , r1 = √ n−3n−5 , r2 = √n−2n−4 , and, in Eq. 28, we used the upper bound

from Eq. 26. We can see that as �ux → 0, R → ∞ and thus R is unbounded.

5.2 Semi-Parametric Estimators

Fulcher et al. [2020] derive the efficient influence function and semi-parametric efficiency bound for a

generalized model with discrete treatment and non-linear relationships between the variables. While they

allow for confounding of the treatment-mediator link and the mediator-outcome link, the graph in Figure

1 has additional restrictions. As per Chen and Santos [2018], this graph is locally overidentified. This

suggests that it is possible to improve the estimator by Fulcher et al. [2020, Eq. (6)] (which we refer to as

IF-Fulcher) by leveraging our model’s restrictions. In our model, we have Y ⟂⟂ X |(M,W ), and M ⟂⟂ W |X .So we adapt IF-Fulcher to our graph by incorporating the additional conditional independences by usingE[Y |M,W , X ] = E[Y |M,W ], and f (M |X,W ) = f (M |X ) to create an estimator we refer to as IF-Restricted:

Ψ = 1n n∑i=1(Yi − E[Y |Mi,Wi]) f (M |x ∗)f (M |Xi)+ 1{Xi = x ∗}P(Xi = x ∗|Wi)× {E[Y |Mi ,Wi] −∑m E[Y |m,Wi]f (m|Xi)}+∑m E[Y |m,Wi]f (m|x ∗), (29)

where, if f (M |X) and P (X |W ), E(Y |M,W ) are consistent estimators of the true quantities, then Ψ p→E[Y |do(X = x ∗)]. By double robustness of the given estimator, if f , P , and E are correctly specified, then

IF-Restricted has identical asymptotic distribution as IF-Fulcher. But using the additional restrictions im-

proves estimation of nuisance functions. Rotnitzky and Smucler [2019], in contemporaneous work, ana-

lyzed the same graph and showed that, in addition, the efficient influence function is also changed when

imposing these conditional independences (see Example 10 in their paper). For our experiments with bi-

nary treatments, we use linear regression for f , E and logistic regression for P .Another way to adapt IF-Fulcher is for the case when we do not observe the confounders (as in the front-

door adjustment). In this case, we can set Wi = ∅ and apply IF-Fulcher. We call this special case IF-

Frontdoor.

6 Combining Revealed-confounder and Revealed-mediator Datasets

We now consider a situation in which the practitioner has access to two datasets In the first one, the

confounders are observed but the mediators are unobserved. In the second one, the mediators are observed

but the confounders are unobserved. This situation might arise if data is collected by two groups, the first

selecting variables to measure to apply the backdoor adjustment and the second selecting variables to

apply the frontdoor adjustment. Given the two datasets, we now turn to the question of how to optimally

leverage all available data to estimate the effect of X on Y .Onewaymight be to apply the backdoor estimator to the first dataset, the frontdoor estimator to the second

dataset. We could then estimate the variance of each estimate, using either bootstrapping or plugging in

10

estimates of the model parameters to the variance equations, and then choose the estimator with the lower

estimated variance. However, in this case, each estimator uses samples from one only dataset. To leverage

all available information, we introduce theMLE, demonstrating that this estimator has lower variance than

both the backdoor and frontdoor estimators.

Combined Log-Likelihood under Partial Observability Say we have P samples of {xi , yi , wi}Pi=1.Let each such sample be denoted by the vector pi = [xi , yi , wi]. Moreover, say we have Q samples of{xi , yi , mi}Qi=1. Let each such sample be denoted using the vector qj = [xj , yj , mj]. Let the observed data berepresented as D. That is, D = {p1, p2, … , pP , q1, q2, … , qQ}. Let N = P + Q and let k = PQ . Since the datais multivariate Gaussian, the conditional log-likelihood given k can be written as

(D|k) = − 12N [P (log (det�p) + Tr (�p�−1p )) + Q (log (det�q) + Tr (�q�−1q ))]= − 12(k + 1)[k (log (det�p) + Tr (�p�−1p )) + log (det�q) + Tr (�q�−1q )], (30)

where �p = Cov([X , Y ,W ]), �q = Cov([X , Y ,M]), �p = 1P ∑Pi=1 pip⊤i and �q = 1Q ∑Qi=1 qiq⊤i .6.1 Cramer-Rao Lower Bound

Before we present the MLE, we analyse the asymptotic variance of the this estimator and show that vari-

ance can be reduced by utilizing the additional samples.

Reparameterizing the Likelihood We are interested in estimating the value of the product ac. Lete = ac. We reparameterize the likelihood in Eq. 30 by replacing c with e/a. This simplifies the calcula-

tions and improves numerical stability. Now, we have the following eight unknown model parameters:{e, a, b, d, �2uw , �2ux , �2um , �2uy}.In order to compute the variance of the estimate of parameter e = ac, we compute the Cramer-Rao variance

lower bound. We first compute the Fisher information matrix (FIM) I for the eight model parameters:

I = −E ⎡⎢⎢⎢⎢⎢⎣)2)e2 )2)e)a )2)e)b … )2)e)�uy)2)a)e )2)a2 )2)a)b … )2)a)�uy⋮ ⋮ ⋮ ⋱ ⋮)2)�uy )e )2)�uy )a )2)�uy )b … )2)2�uy

⎤⎥⎥⎥⎥⎥⎦Let e be the MLE. Since standard regularity hold for our model (due to linearity and Gaussianity), the MLE

is asymptotically normal. We can use the Cramer-Rao theorem to get the asymptotic variance of e. Thatis, for constant k, as N → ∞, we have√N (e − e) d→ (0, Ve), andVe = (I−1)1,1.The expression for (I−1)1,1 is available in closed form and is given in Appendix E.1.

11

Remarks on the Lower Bound Recall that k = PQ . As k → ∞, Ve becomes equal to the asymptotic

variance of the backdoor estimator. That is, as k → ∞,

Ve = limn→∞Var(√nac)backdoor.And as k → 0, Ve becomes equal to the asymptotic variance of the frontdoor estimator. That is, as k → 0,

Ve = limn→∞ [a2Var(√nc) + c2Var(√naf )] .For a fixed value of Q, Ve is a decreasing function of k. And for a fixed value of P , Ve is an increasing

function of k. This shows that samples from both datasets are useful for reducing the asymptotic variance

of the estimator.

6.2 The Maximum Likelihood Estimator

Computing an analytical solution for the model parameters that maximizes the log-likelihood turns out to

be intractable. As a result, we update our estimated parameters to maximize the likelihood numerically.

Parameter Initialization The likelihood in Eq. 30 is non-convex. As a result, we cannot start with

arbitrary initial values for model parameters because we might encounter a local minimum. To avoid this,

we use the two datasets to initialize our parameter estimates. Each of the eight parameters can be identified

using only data from one of the datasets. For example, d can be initialized using the revealed-confounder

dataset (via OLS regression of X on W ). The parameter e is can be identified using either dataset, so we

pick the value with lower bootstrapped variance.

After initializing the eight model parameters, we run the Broyden–Fletcher–Goldfarb–Shanno (BFGS) al-

gorithm [Fletcher, 2013] to find model parameters that minimize the negative log-likelihood. In our exper-

iments, for the sample sizes considered and given our initialization procedure, the non-convexity of the

likelihood never proved a practical problem. In our experiments, we were always able to converge to the

global minimum. When we find the global minimum, this estimator is optimal and dominates both the

backdoor and frontdoor estimators.

In a real-world scenario, without our procedure, a practitionerwould have to choose between the backdoor

and frontdoor estimates from the individual datasets. An advantage of using our procedure is that we

obviate this choice and output a single estimate of the causal effect that asymptotically performs better

than the individual estimates.

7 Experiments

First, we present results on synthetic datasets, where the structural causal model is known. Next, we

present results on a semi-synthetic dataset, where the covariates and treatment assignment are from a real

study but the outcomes are synthetic. Finally, we test our methods on job training data.

Synthetic Data Generating data from our causal model (Eq. 2), we first compare the empirical variances

of the backdoor, frontdoor and combined estimator to those predicted by our theory (Eqs. 9, 16, 26). We

12

Table 1: MeanAbsolute Percentage Error of the empirical variance with the theoretical variance (computed

using our variance formulae). The values are reported as mean ± std.

Estimator n = 50 n = 100 n = 200Backdoor 0.36 ± 0.37 0.32 ± 0.29 0.34 ± 0.13Frontdoor 0.33 ± 0.21 0.30 ± 0.25 0.23 ± 0.17Combined 1.20 ± 1.14 0.97 ± 0.64 0.58 ± 0.23

50 60 70 80 90 100Number of samples

0

2

4

6

8

10

12

MSE

of e

stim

ator

frontdoorbackdoorcombined

(a) Backdoor better

50 60 70 80 90 100Number of samples

0

5

10

15

20

25

MSE

of e

stim

ator

frontdoorbackdoorcombined

(b) Frontdoor better

Figure 2: Comparison of MSE when confounders and mediators are simultaneously observed. (a) Causal

model where backdoor dominates frontdoor. (b) Causal model where frontdoor dominates backdoor. In

both cases, the combined estimator has lowest MSE.

initialize the model parameters by sampling from the following distributions:

a, b, c, d ∼ Unif[−10, 10]�2uw , �2ux , �2um , �2uy ∼ Unif[0.01, 2]. (31)

For each initialization, we compute the Mean Absolute Percentage Error (MAPE) of the empirical variance

with the theoretical variance. We average the MAPE across 1000 realizations sampled from Eq. 31. We

find that the theoretical variance is close to the empirical variance even for small sample sizes (Table 1).

Next, we compare the backdoor, frontdoor and combined estimators under different settings of the model

parameters. Unless stated otherwise, the model parameter values we use for generating the data are

a = 10, b = 4, c = 5, d = 5,�2uw = 1, �2ux = 1, �2um = 1, �2uy = 1. (32)

The quantity of interest is the causal effect ac = 50. We set some of the values in Eq. 32 to construct

regimes where either of the backdoor or frontdoor estimators can dominate the other. For Figure 2a, we

set {�2ux = 0.05, �2um = 0.05}, which makes the backdoor estimator better as predicted by Eq. 18. For Figure

2b, we set {�2uw = 2, �2ux = 0.01, �2um = 0.1} which makes the frontdoor estimator better as predicted by Eq.

18. The plots in Figure 2 corroborate these predictions at different sample sizes. Furthermore, the optimal

combined estimator always outperforms both the backdoor and frontdoor estimators.

Finally, we evaluate the procedure for combining datasets described in Section 6, generating two datasets

with equal numbers of samples. In the first, only {X, Y ,W } are observed. In the second, only {X, Y ,M}are observed. We set {�2ux = 0.05, �2um = 0.05}, which makes the backdoor estimator better (Figure 3a),

and then set {�2uw = 2, �2ux = 0.01, �2um = 0.1}, which makes the frondoor estimator better (Figure 3b). We

then confirm that the combined estimator has lower MSE than either for various sample sizes (Figure 3),

supporting our theoretical claims.

13

100 150 200 250 300 350 400 450 500Number of samples per dataset

0.2

0.4

0.6

0.8

1.0

1.2

MSE

of e

stim

ator

backdoorcombined

(a)

100 150 200 250 300 350 400 450 500Number of samples per dataset

0.250.500.751.001.251.501.752.002.25

MSE

of e

stim

ator

frontdoorcombined

(b)

Figure 3: Comparison of MSE when confounders and mediators are observed in separate datasets. The

estimator that combines datasets dominates the better of backdoor and frontdoor estimators.

IHDP Dataset Hill [2011] constructed a dataset from the Infant Health and Development Program

(IHDP). This semi-synthetic dataset, which has been used for benchmarking causal inference algorithms

[Shi et al., 2019, Shalit et al., 2017], is based on a randomized experiment to measure the effect of home

visits from a specialist on future test scores of children. We use samples from the NPCI package [Dorie,

2016]. The randomized data is converted to an observational study by removing a biased subset of the

treated group. This set contains 747 samples with 25 covariates.We use the covariates and the treatment assignment from the real study. We use a procedure similar to

Hill [2011] to simulate the mediator and the outcome. The mediator M takes the form M ∼ (5X, �2um ),where X is the treatment. The response Y takes the form Y ∼ (10M +W b, 1) where W is the matrix of

standardized (zero mean and unit variance) covariates and values in the vector b are randomly sampled (0,

1, 2, 3, 4) with probabilities (0.5, 0.2, 0.15, 0.1, 0.05). The ground truth causal effect is 5 × 10 = 50.We evaluate our estimators and the three IF estimators — IF-Fulcher, IF-Restricted and IF-Frontdoor (Sec-

tion 5.2). We test the estimators on two settings of the parameter �um (Table 2, the Complete dataset setting).

The MSE values are computed across 1000 instantiations of the dataset created by simulating the mediators

and outcomes. We first evaluate the estimators on the complete dataset of 747 samples. We see that for�um = 2, the backdoor estimator dominates the frontdoor estimator whereas for �um = 1, the frontdoor esti-mator is better. In both cases, the combined estimator which uses bothmediators and confounders (Section

5) outperforms both estimators. Furthermore, we see that IF-Restricted outperforms IF-Frontdoor, show-

ing the value of leveraging the covariates. Moreover, IF-Restricted also outperforms IF-Fulcher, suggesting

that incorporating model restrictions improves performance.

Next, we randomly split the data into two sets, one with the confounder observed and the other with the

mediator observed, finding that the estimator that combines the datasets (Section 6) outperforms frontdoor

and backdoor estimators for various values of �um (Table 2, the Partial dataset setting). We compute the

MSE across 1000 instantiations of the dataset (we use a different random split in each iteration).

National JTPA Study The National Job Training Partnership Act (JTPA) Study was commissioned to

evaluate the effect of a job training program on future earnings. In this dataset, the treatment variable Xrepresents whether a participant signed up to receive JTPA services. The outcome variable Y represents

earnings 18 months after the study. The study had a randomized treatment and control group which

allows us to compute the ground truth treatment effect. There was also a non-randomized component

in the form of eligible non-participants (ENP). ENPs were individuals who were eligible but chose not to

participate. Moreover, there was non-compliance amongst the treated units. That is, some participants in

14

Table 2: MSE of on IHDP data. The complete dataset setting is when all of {W,X,M, Y} are observed andpartial is with two datasets with {X,M, Y} in one and {W,X, Y} in the other.

Estimator Dataset �um = 2 �um = 1Backdoor Complete 4.41 1.07

Frontdoor Complete 3.81 2.81

Combined Complete 3.61 0.93IF-Frontdoor Complete 4.15 2.07

IF-Restricted Complete 3.07 1.48

IF-Fulcher Complete 3.93 1.87

Backdoor Partial 10.17 2.43

Frontdoor Partial 8.19 4.94

Combined Partial 6.64 1.62

Table 3: MSE and variance of the estimators on JTPA data. The complete setting is when all among{W,X,M, Y} are observed. In the partial setting we have two datasets, one with {X,M, Y} and another

with {W,X, Y}.Estimator Dataset Variance MSE

Frontdoor Complete 40.9K 75.3K

Combined Complete 33.1K 70.1KIF-Frontdoor Complete 46.6K 77.9K

IF-Restricted Complete 40.4K 42.1KIF-Fulcher Complete 45.1K 46.2K

Frontdoor Partial 74.8K 115.1K

Combined Partial 79.5K 123.1K

the treatment group did not make use of the services [Heckman et al., 1997]. This lets us create a mediatorM which represents compliance. The study also collected data on a number of covariates (like race, study

location, age) which are the confoundersW in our notation.

On the dataset used by Glynn and Kashin [2019], we combine the treated units and ENP units to create

the observational data for our experiments. There are 3155 treated units and 1236 ENP units. The ground

truth treatment effect is 862.74. Glynn and Kashin [2018] showed that the backdoor estimator has high

bias, suggesting that there was unmeasured confounding. The frontdoor estimator, on the other hand, has

low bias and works well for this dataset. For this reason, we do not consider the backdoor estimator in our

comparisons.

We first compare the frontdoor estimator, the combined estimator (Section 5), IF-Restricted, IF-Frontdoor

and IF-Fulcher (Section 5.2). The results are shown in Table 3 (the “Complete” dataset setting). We compute

the variance and MSE using 1000 bootstrap iterations. The results show that the combined estimator has

lower variance and MSE than the frontdoor estimator. Similarly, IF-Restricted outperforms IF-Frontdoor,

reinforcing the utility of combined estimators. We also see that IF-Restricted outperforms IF-Fulcher, show-

ing that using model restrictions is valuable.

Next, we evaluate our procedure for the partially-observed setting (Section 6). We compute variance and

MSE across 1000 bootstrap iterations. At each iteration, we randomly split our dataset into two datasets

of equal size, one with revealed confounders, one with revealed mediators. Since the backdoor adjustment

works poorly for this study, we should not expect the revealed-confounder dataset to improve the frontdoor

15

estimator. And indeed, the frontdoor estimator outperforms the combined estimator (Table 3). Despite the

inapplicability of the backdoor estimator, the combined estimator does not suffer too badly.

8 Conclusion

Addressing two overidentified regimes, one where confounders and mediators are observed simultane-

ously, and another where they are observed in separate datasets, we introduced improved estimators for

both cases and confirmed their benefits experimentally. Extending our analysis to more general graphs,

and considering online decisions about which variables to observe are two promising future directions.

We are also interested in evaluating how our conclusions might be affected by model misspecification.

References

E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National

Academy of Sciences, 113(27):7345–7352, 2016.

E. Bareinboim et al. Causal inference and data-fusion in econometrics. Technical report, arXiv. org, 2019.

M. F. Bellemare and J. R. Bloem. The paper of how: Estimating treatment effects using the front-door

criterion. Working Paper, 2019.

X. Chen and A. Santos. Overidentification in regular models. Econometrica, 86(5):1771–1817, 2018.

A. Chinco and C. Mayer. Misinformed speculators and mispricing in the housing market. The Review of

Financial Studies, 29(2):486–522, 2016.

L. Cohen and C. J. Malloy. Friends in high places. American Economic Journal: Economic Policy, 6(3):63–91,

2014.

V. Dorie. Non-parametrics for causal inference. https://github.com/vdorie/npci, 2016.

M. L. Eaton. Chapter 8: The Wishart Distribution, volume Volume 53 of Lecture Notes–Monograph Se-

ries, pages 302–333. Institute of Mathematical Statistics, 2007. doi: 10.1214/lnms/1196285114. URL

https://doi.org/10.1214/lnms/1196285114.

R. Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.

I. R. Fulcher, I. Shpitser, S. Marealle, and E. J. Tchetgen Tchetgen. Robust inference on population indirect

causal effects: the generalized front door criterion. Journal of the Royal Statistical Society: Series B

(Statistical Methodology), 2020.

A. N. Glynn and K. Kashin. Front-door difference-in-differences estimators. American Journal of Political

Science, 61(4):989–1002, 2017.

A. N. Glynn and K. Kashin. Front-door versus back-door adjustment with unmeasured confounding: Bias

formulas for front-door and hybrid adjustments with application to a job training program. Journal of

the American Statistical Association, 113(523):1040–1049, 2018.

A. N. Glynn and K. Kashin. Replication Data for: Front-Door Versus Back-Door Adjustment With Unmea-

sured Confounding: Bias Formulas for Front-Door and Hybrid Adjustments With Application to a Job

Training Program, 2019. URL https://doi.org/10.7910/DVN/G7NNUL .

16

https://github.com/vdorie/npci

https://doi.org/10.1214/lnms/1196285114

https://doi.org/10.7910/DVN/G7NNUL

L. P. Hansen. Large sample properties of generalized method of moments estimators. Econometrica, 50(4):

1029–1054, 1982.

J. J. Heckman, H. Ichimura, and P. E. Todd. Matching as an econometric evaluation estimator: Evidence

from evaluating a job training programme. The review of economic studies, 64(4):605–654, 1997.

L. Henckel, E. Perković, and M. H. Maathuis. Graphical criteria for efficient total effect estimation via

adjustment in causal linear models. arXiv preprint arXiv:1907.02435, 2019.

J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical

Statistics, 20(1):217–240, 2011.

C. Jackson, N. Best, and S. Richardson. Bayesian graphical models for regression on multiple data sets with

different variables. Biostatistics, 10(2):335–351, 2009.

T. C. Koopmans andO. Reiersøl. The identification of structural characteristics. The Annals of Mathematical

Statistics, 21(2):165–181, 1950.

T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2019.

J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–688, 1995.

J. Pearl. Causality. Cambridge university press, 2009.

J. Pearl. The foundations of causal inference. Sociological Methodology, 40(1):75–149, 2010.

E. Perković, J. Textor, M. Kalisch, and M. H. Maathuis. A complete generalized adjustment criterion. In

Uncertainty in Artificial Intelligence (UAI), 2015.

A. Rotnitzky and E. Smucler. Efficient adjustment sets for population average treatment effect estimation

in non-parametric causal graphical models. arXiv preprint arXiv:1912.00306, 2019.

J. D. Sargan. The estimation of economic relationships using instrumental variables. Econometrica: Journal

of the Econometric Society, pages 393–415, 1958.

U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds

and algorithms. In International Conference on Machine Learning (ICML), 2017.

C. Shi, D. Blei, and V. Veitch. Adapting neural networks for the estimation of treatment effects. InAdvances

in Neural Information Processing Systems, pages 2503–2513, 2019.

I. Shpitser and J. Pearl. Identification of conditional interventional distributions. In Uncertainty in Artifical

Intelligence (UAI), 2006.

C. Uhler. Gaussian graphical models: An algebraic and geometric perspective. Chapter in Handbook of

Graphical Models, 2019.

S. Wright. The method of path coefficients. Ann. Math. Statist., 5(3):161–215, 09 1934. doi: 10.1214/aoms/

1177732676.

17

A Covariance of a and cA.1 Frontdoor estimator

We prove that Cov(af , c) = 0 for the frontdoor estimator. The expressions for af and c arec = ∑ ximi∑ x2i= c + ∑ xiumi∑ x2i

(33)

af = ∑ x2i ∑miyi −∑ ximi ∑ xiyi∑ x2i ∑m2i − (∑ ximi)2= a + ∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 ,

(34)

where ei = − bd uxi + uyi . Using the fact the (ux , x) is bivariate normally distributed, we get

E[e|x] = b�2uxd(d2�2uw + �2ux )x= Fx, (35)

where F = b� 2uxd(d2� 2uw +� 2ux ) . The covariance then is

Cov(af , c) = E[(af − a)(c − c)]= E[E[(af − a)(c − c)|x,m]]= E[E[(∑ x2i ∑miei −∑ ximi ∑miei∑ x2i ∑m2i − (∑ ximi)2 )(∑ xiumi∑ x2i ) ||||x,m]]= E[(∑ x2i ∑miE[ei |x] −∑mixi ∑ xiE[ei |x]∑ x2i ∑m2i − (∑ ximi)2 )(∑ umi xi∑ x2i )] (36)

= E[F (∑ x2i ∑mixi −∑mixi ∑ x2i∑ x2i ∑m2i − (∑ ximi)2 )(∑ umi xi∑ x2i )]= 0,

where in Eq. 36 we used the expression from Eq. 35.

A.2 Combined estimator

We prove that Cov(ac , c) = 0 for the combined estimator from Section 5. The expressions for ac and c arec = ∑ ximi∑ x2i= c + ∑ xiumi∑ x2i

(37)

ac = ∑w2i ∑miyi −∑wimi ∑wiyi∑w2i ∑m2i − (∑wimi)2= a + ∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 .

(38)

18

The covariance is

Cov(ac , c) = E[(ac − a)(c − c)]= E[E[(ac − a)(c − c)|x,m, w]]= E[(∑w2i ∑miE[uyi ] −∑miwi ∑wiE[uyi ]∑w2i ∑m2i − (∑wimi)2 )(∑ umi xi∑ x2i )] (39)

= 0,where in 39 we used the fact that E[uyi ] = 0.B Unbiasedness of the estimators

B.1 Backdoor estimator

Recall that for the backdoor estimator, we take the coefficient of X in an OLS regression of Y on {X,W }.The outcome yi can be written as

yi = acxi + bwi + aumi + uyi .The error term aumi + uyi is independent of (xi , wi). In this case, the OLS estimator is unbiased. Therefore,E[acbackdoor] = ac.B.2 Frontdoor estimator

For the frontdoor estimator, we first compute c by taking the coefficient of X in an OLS regression of Mon X . The mediatormi can be written as

mi = cxi + umi .The error term umi is independent of xi . In this case, the OLS estimator is unbiased and hence, E[c] = c.We then compute af by taking the coefficient of M in an OLS regression of Y on {M,X}. The outcome yican be written as

yi = ami + bd xi − bd uxi + uy .In this case, the error term − bd uxi + uy is correlated with xi . The expression for a is given in Eq. 34. The

expectation E[af ] is E[af ] = a + E [∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 ]= a + E [E [∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2

||||x,m]]= a + E [∑ x2i ∑miE[ei |x] −∑ ximi ∑ xiE[ei |x]∑ x2i ∑m2i − (∑ ximi)2 ] (40)

= a + E [∑ x2i ∑mi(Fxi) −∑ ximi ∑ xi(Fxi)∑ x2i ∑m2i − (∑ ximi)2 ]= a,

19

where, in Eq. 40, the expression for E[ei |x] is taken from Eq. 35. Using the fact that Cov(af , c) = 0 (see

proof in Appendix A.1), we can see that the frontdoor estimator is unbiased asE[af c] = E[af ]E[c] + Cov(af , c)= ac.B.3 Combined estimator

In the combined estimator, the expression for c is the same as the frontdoor estimator. Therefore, as

shown in Appendix B.2, E[c] = c. We compute a by taking the coefficient of M in an OLS regression of Yon {M,W }. The outcome yi can be written as

yi = ami + bwi + uyi .The error term uyi is independent of (mi , wi). In this case, the OLS estimator is unbiased. Therefore, E[ac] =a. Using the fact that Cov(ac , c) = 0 (see proof Appendix A.2), we can see that the combined estimator is

unbiased as E[ac c] = E[ac]E[c] + Cov(ac , c)= ac.C Finite sample results for the variance

C.1 Frontdoor estimator

C.1.1 Variance of afFrom Eqs. 33 and 34, we know that

c = c + ∑ xiumi∑ x2iaf = a + ∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 .

where ei = − bd uxi + uyi . The following fact will be useful later in the derivations (taken from Eq. 12):

Var(e|x) = b2�2uw�2ux + �2uy (d2�2uw + �2ux )(d2�2uw + �2ux )= Ve . (41)

Note that Ve is a constant and does not depend on x .Let

A = ∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2C = ∑ xiumi∑ x2i .

20

First, we derive the expression for Var(af ) as follows,Var(af ) = Var(a + A)= Var(A)= Var(E[A|x,m]) + E[Var(A|x,m)]

= Var(E[∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2||||x,m]) + E[Var(A|x,m)]

= Var(E[∑ x2i ∑miE[ei |x] −∑ ximi ∑ xiE[ei |x]∑ x2i ∑m2i − (∑ ximi)2 ]) + E[Var(A|x,m)] (42)

= Var(E[∑ x2i ∑mi(Fxi) −∑ ximi ∑ xi(Fxi)∑ x2i ∑m2i − (∑ ximi)2 ]) + E[Var(A|x,m)]= E[Var(A|x,m)]= E [Var(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2

||||x,m)]= E [ 1(∑ x2i ∑m2i − (∑ ximi)2)2Var(∑ x2i ∑miei −∑ ximi ∑ xiei ||||x,m)]= E [ 1(∑ x2i ∑m2i − (∑ ximi)2)2Var(ei |xi)∑ x2i (∑ x2i ∑m2i − (∑ ximi)2)]= VeE [ ∑ x2i∑ x2i ∑m2i − (∑ ximi)2 ]= VeE [D] , (43)

where, in Eq. 42, we used the result from Eq. 35, and D = ∑ x2i∑ x2i ∑m2i −(∑ ximi )2 . Using the fact that D has

the distribution of a marginal from an inverse Wishart-distributed matrix, that is, if the matrix M ∼(Cov([M,X ])−1, n), then D = M1,1, in Eq. 43, we get

Var(af ) = VeE [D]= VeCov([M,X ])−11,1n − 2 − 1= b2�2uw�2ux + �2uy (d2�2uw + �2ux )(n − 3)(d2�2uw + �2ux )�2um ,

where the expression for Ve is taken from Eq. 41.

C.1.2 Covariance of a2f and c2We prove that Cov(a2f , c2) = Var(af )Var(c). This covariance can be written as

Cov(a2f , c2) = E[(a2f − E[a2f ])(c2 − E[c2])]= E[(a2f − Var(af ) − E2[af ])(c2 − Var(c2) − E2[c])]= E[a2f c2] − Var(af )Var(c) − a2Var(c) − c2Var(af ) − a2c2. (44)

21

We can write E[a2f c2] asE[a2f c2] = E[(a + A)2(c + C)2]= E[a2c2 + c2A2 + a2C2 + A2C2 + 2aAC2 + 2cCA2]= a2c2 + c2Var(a) + a2Var(c) + E[A2C2] + E[2aAC2] + E[2cCA2]. (45)

Substituting the result from Eq. 45 in Eq. 44, we get

Cov(a2f , c2) = E[A2C2] + E[2aAC2] + E[2cCA2]. (46)

Now we expand each term in Eq. 46 separately. E[2aAC2] isE[2aAC2] = 2aE [(∑ xiumi∑ x2i )2(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 )]

= 2aE [E [(∑ xiumi∑ x2i )2(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 ) |||||x,m]]= 2aE [(∑ xiumi∑ x2i )2(∑ x2i ∑miE[ei |x] −∑ ximi ∑ xiE[ei |x]∑ x2i ∑m2i − (∑ ximi)2 )] (47)

= 2aE [(∑ xiumi∑ x2i )2(∑ x2i ∑mi(Fxi) −∑ ximi ∑ xi(Fxi)∑ x2i ∑m2i − (∑ ximi)2 )]= 0, (48)

where, in Eq. 47, the expression for E[e|x] is taken from Eq. 35.

Next, we simplify E[2cCA2] asE[2cCA2]= 2cE [(∑ xiumi∑ x2i )(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 )2]= 2cE [E [(∑ xiumi∑ x2i )(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 )2 |||||x,m]]= 2cE [E [(∑ xiumi∑ x2i ) (∑ x2i )2(∑miei)2 + (∑ ximi)2(∑ xiei)2 − 2∑ x2i ∑miei ∑ ximi ∑ xiei(∑ x2i ∑m2i − (∑ ximi)2)2 |||||x,m]]= 2cE [(∑ xiumi∑ x2i )( Var(e|x)(∑ x2i )∑ x2i ∑m2i − (∑ ximi)2 +

F 2 (2(∑ x2i )2(∑ ximi)2 − 2(∑ x2i )2(∑ ximi)2)(∑ x2i ∑m2i − (∑ ximi)2)2 )]= 2cE [(∑ xiumi∑ x2i )( Var(e|x)(∑ x2i )∑ x2i ∑m2i − (∑ ximi)2)]= 2cVeE [(∑ xiumi∑ x2i )( ∑ x2i∑ x2i ∑m2i − (∑ ximi)2)]= 2cVeE [(c − c)( ∑ x2i∑ x2i ∑m2i − (∑ ximi)2)]= 2cVe (E [c( ∑ x2i∑ x2i ∑m2i − (∑ ximi)2)] − cE [( ∑ x2i∑ x2i ∑m2i − (∑ ximi)2)])= 2cVe (E [cD] − cE [D]) , (49)

22

where D = ∑ x2i∑ x2i ∑m2i −(∑ ximi )2 . Using the fact that c and D are independent of each other (see proof at the

end of this section), we get E[cD] = E[c]E[D]= cE[D]. (50)

Substituting the result from Eq. 50 in Eq. 49, we getE[2cCA2] = 2cVe (cE [D] − cE [D])= 0. (51)

We proceed similarly to Eq. 49 to write E[A2C2] asE[A2C2] = VeE[C2D].Then we further simplify E[A2C2] asE[A2C2] = VeE[C2D]= VeE[(c − c)2D]

= Ve (E [c2]E[D] − c2E[D])= Ve (Var(c)E[D] + E2 [c]E[D] − c2E[D])= VeVar(c)E[D] (52)

= VeVar(c)Cov([M,X ])−11,1n − 2 − 1= Ve(n − 3)�2um Var(c) (53)

= Var(af )Var(c), (54)

where, in Eq. 52, we used the fact that if thematrixM ∼ (Cov([M,X ])−1, n), thenD = M1,1 (that is,D has

the distribution of a marginal from an inverse Wishart-distributed matrix), and in Eq. 53, the expression

for Ve is taken from Eq. 41.

Substituting the results from Eqs. 48, 51, and 54 in Eq. 46, we get

Cov(a2f , c2) = Var(af )Var(c). (55)

Proof that c and D are independent. Let � be the following sample covariance matrix:

� = 1n [ ∑m2i ∑mixi∑mixi ∑x2i ] .The distribution of Σ is a Wishart distribution. That is, Σ ∼ (Cov([M,X ]), n). Then (�1,1 − �1,2�−12,2�2,1)and (�2,1,�2,2) are independent [Eaton, 2007, Proposition 8.7]. We can see that

�1,1 − �1,2�−12,2�2,1 = ∑m2i ∑x2i − (∑ximi)2∑x2i= 1D .

23

Therefore, we get

1D ⟂⟂ (∑x2i , ∑ ximi)⟹ D ⟂⟂ (∑x2i , ∑ ximi)⟹ D ⟂⟂ ∑ximi∑x2i⟹ D ⟂⟂ c.

C.2 Combined estimator

We prove that Cov(a2c , c2) = Var(ac)Var(c). We derive this in a similar manner as the frontdoor estimator

in Appendix C.1.2. From Eqs. 37 and 38, we know that

ac = a + ∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2c = c + ∑xiumi∑x2i .

Let

A = ∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2C = ∑xiumi∑x2i .

Then, similarly to Eq. 46, we get

Cov(a2c , c2) = E[A2C2] + E[2aAC2] + E[2cCA2]. (56)

Now we simplify each term in Eq. 56 separately. E[2aAC2] can be simplified as

E[2aAC2] = 2aE [(∑ xiumi∑ x2i )2(∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 )]= E [E [(∑ xiumi∑ x2i )2(∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 ) |||||x,m, w]]= E [(∑ xiumi∑ x2i )2(∑w2i ∑miE[uyi ] −∑wimi ∑wiE[uyi ]∑w2i ∑m2i − (∑wimi)2 )] (57)

= 0, (58)

where, in Eq. 57, we used the fact that E[uy] = 0.

24

Next, we simplify E[2cCA2] asE[2cCA2] = 2cE [(∑ xiumi∑ x2i )(∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 )2]

= 2cE [E [(∑ xiumi∑ x2i )(∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 )2 |||||x,m, w]]= 2cE [(∑ xiumi∑ x2i )Var(uy )( ∑w2i∑w2i ∑m2i − (∑wimi)2)]= 2c�2uyE [CD] , (59)

where D = ∑w2i∑w2i ∑m2i −(∑wimi )2 . We can upper bound the expression in Eq. 59 as

E[2cCA2] = 2c�2uyE [CD]≤ 2|c|�2uyE [CD] (60)

≤ 2|c|�2uy√E[C2]E[D2]

= 2|c|�2uy√Var(c)(Var(D) + E[D]2) (61)

= 2|c|�2uy√Var(c)(2 [(Cov([M,W ]))−11,1]2(n − 2 − 1)2(n − 2 − 3) +( (Cov([M,W ]))−11,1n − 2 − 1 )2)

= 2|c|�2uy√Var(c)√ 2(n − 3)2(n − 5)(c2�2ux + �2um )2 +

1(n − 3)2(c2�2ux + �2um )2= 2|c| �2uy(n − 3)(c2�2ux + �2um )

√Var(c)√n − 3n − 5

= 2|c|Var(a)√Var(c)√n − 3n − 5 , (62)

where, in Eq. 60, we used the Cauchy–Schwarz inequality, and in Eq. 61, we used the fact that if the matrixM ∼ (Cov([M,W ])−1, n), then D = M1,1 (that is, D has the distribution of a marginal from an inverse

Wishart-distributed matrix).

Similarly to Eq. 59, we simplify E[A2C2] asE[A2C2] = �2uyE[C2D]. (63)

25

The expression in Eq. 63 can be upper bounded using the Cauchy-Schwarz inequality asE[A2C2] = �2uyE[C2D]≤ �2uy

√E[C4]E[D2]= �2uy

√E[C4](Var(D) + E2[D])= �2uy

√E[C4](2 [(Cov([M,W ]))−11,1]2(n − 2 − 1)2(n − 2 − 3) +( (Cov([M,W ]))−11,1n − 2 − 1 )2)= �2uy(n − 3)(c2�2ux + �2um )

√n − 3n − 5√E[C4]

= Var(ac)√n − 3n − 5

√E[C4]. (64)

26

We can simplify E[C4] as follows,E[C4] = E [(∑ xiumi∑ x2i )4]

= E [E [(∑ xiumi∑ x2i )4 ||||x]]= E [ 1(∑ x2i )4E [(∑xiumi )4 ||||x]]= E [ 1(∑ x2i )4

{Var((∑ xiumi )2 |||x) + E [(∑ xiumi )2 |||x]2}]

= E [ 1(∑ x2i )4 {Var((∑ xiumi )2 |||x) + �4um (∑ x2i )2}]= E [ 1(∑ x2i )4

{Var(�2um ∑ x2i (∑ xiumi )2�2um ∑ x2i

||||x) + �4um (∑ x2i )2}]= E [ 1(∑ x2i )4

{�4um (∑ x2i )2 Var( (∑ xiumi )2�2um ∑ x2i||||x) + �4um (∑ x2i )2}] (65)

= E [ 1(∑ x2i )4 {�4um (∑ x2i )2 2 + �4um (∑ x2i )2}]= E [ 1(∑ x2i )4 {3�4um (∑ x2i )2}]= 3�4umE [ 1(∑ x2i )2 ]= 3�4um [Var( 1∑ x2i ) + E [ 1∑x2i ]

2] (66)

= 3�4um [ 2(n − 2)2(n − 4)(d2�2uw + �2ux )2 + 1(n − 2)2(d2�2uw + �2ux )2 ]= 3 �4um(n − 2)2(d2�2uw + �2ux )2 [n − 2n − 4]= 3 (Var(c2))2 [n − 2n − 4] , (67)

where, in Eq. 65, we used the fact that(∑ xiumi )2� 2um ∑ x2i

||||x has a Chi-squared distribution, that is,(∑ xiumi )2� 2um ∑ x2i

||||x ∼ � 2(1),and in Eq. 66, we used the fact that 1∑ x2i has a scaled inverse Chi-squared distribution, that is, 1∑ x2i ∼Scale-inv-� 2 (n, (d2� 2uw +� 2ux )2n ).Substituting the result from Eq. 67 in Eq. 64, we getE[A2C2] ≤ Var(ac)Var(c)√3(n − 3)(n − 2)(n − 5)(n − 4) . (68)

27

Substituting the results from Eqs. 58, 62, and 68 in Eq. 56, we get

Cov(a2c , c2) ≤√n − 3n − 5 (2|c|Var(ac)√Var(c) + √3√n − 2n − 4Var(ac)Var(c)) .

D Comparison of combined estimator with backdoor and frontdoor es-

timators

In this section, we provide more details on the comparison of the combined estimator presented in 5 to the

backdoor and frontdoor estimators.

D.1 Comparison with the backdoor estimator

In Section 5.1, we made the claim that

∃N , s.t., ∀n > N ,Var(ac c) ≤ Var(ac)backdoor.In this case, by comparing Eqs. 9 and 26, we have

N = 2(�4uxF + d2�2uw (�2umD + �2uxF ) + c2�6ux√c2�6uxD2(F + 2√3�2um ))�2umD2 ,

where D = d2�2uw +�2ux , E = c2�2ux +�2um , and F = E + (1+2√3)�2um . Thus, for a large enough n, the combined

estimator has lower variance than the backdoor estimator for all model parameter values.

D.2 Comparison with the frontdoor estimator

In Section 5.1, we made the claim that

∃N , s.t., ∀n > N ,Var(ac c) ≤ Var(af c).In this case, by comparing Eqs. 16 and 26, we have

N = 2(�6um + 2√3c2�4um�2ux − c4�2um�4ux + (D + �4um√�4um + 4√3c2�2um�2ux − 2c4�4ux))D2 ,

whereD = c6�6ux . Thus, for a large enough n, the combined estimator has lower variance than the frontdoor

estimator for all model parameter values.

D.3 Combined estimator dominates the better of backdoor and frontdoor

In this section, we provide more details for the claim in Section 5.1 that the combined estimator can dom-

inate the better of the backdoor and frontdoor estimators by an arbitrary amount. We show that the

quantity

R = min{Var(ac)backdoor ,Var(af c)}Var(ac c)28

is unbounded.

We do this by considering the case when Var(ac)backdoor = Var(af c). Note thatVar(ac)backdoor = Var(af c)

⟹ b =√−D(−a2�4um ((n − 2)d2�2uw + �2ux ) + (−(n − 2)d2�2uw (�2um − c2�2ux ) + �2uxE)�2uy )�2uw�4ux (2�2um + (n − 2)c2D) , (69)

where D = d2�2uw + �2ux , and E = (n − 2)c2�2ux − (n − 4)�2um . Hence, if the parameter b is set to the value

given in Eq. 69, the backdoor and frontdoor estimators will have equal variance. We have to ensure that

the value of b is real. b will be a real number if

|c| ≤ �um�ux√1 − 2�2ux(n − 2)D , and

n > 2.For the value of b in Eq. 69, the quantity R becomes

R = Var(ac)backdoorVar(ac c)

≥ (n − 2)DE(a2�2um + �2uy )�2ux ((n − 3)a2�2umE + �2uy (�2um + √3�2um (r1r2 + |c|(n − 2)D(|c| + r1 �um√(n−2)D)))) ,

where D = d2�2uw + �2ux , E = c2�2ux + �2um , r1 = √n−3n−5 and r2 = √n−2n−4 .R does not depend on the parameter b. It is possible to set the other model parameters in a way that allowsR to take any positive value. In particular, it can be seen that as �ux → 0, R → ∞, which shows that R is

unbounded.

29

E Combining Partially Observed Datasets

E.1 Cramer-Rao Lower Bound

Here we present the closed form expression for the asymptotic variance of the estimator presented in

Section 6, that is, (I−1)1,1. Let (I−1)1,1 = XY . ThenX = (a2�2um + �2uy )(a8d2(k + 1)(�2um )5�2uw (d2�2uw + �2ux )2 + a6(�2um )3(d2�2uw + �2ux )(b2�2uw�2ux (c2d4(k + 1)2

(�2uw )2 + d2�2uw (c2(2k2 + 3k + 1)�2ux + (k2 + 4k + 1)�2um ) + (k + 1)�2ux (c2k�2ux + k�2um + �2um )) + (k + 1)�2uy(d2�2uw + �2ux )(c2d4(k + 1)(�2uw )2 + d2�2uw (c2(2k + 1)�2ux + (k + 3)�2um ) + k�2ux (c2�2ux + �2um ))) + 4a5bcdk(�2um )3�2uw�2ux (d2�2uw + �2ux )(b2�2uw�2ux + �2uy (d2�2uw + �2ux )) + a4(�2um )2(b4(�2uw )2(�2ux )2(c2d4(k + 1)2(�2uw )2 + d2�2uw (2c2(k2 + 3k + 1)�2ux + (2k2 + 3k + 1)�2um ) + �2ux(c2(k2 + 4k + 1)�2ux + 2(k + 1)2�2um )) + b2�2uw�2ux�2uy (d2�2uw + �2ux )(2c2d4(k2 + 3k + 2)(�2uw )2 + d2�2uw(c2(4k2 + 13k + 5)�2ux + (4k2 + 8k + 2)�2um ) + �2ux (c2(2k2 + 7k + 1)�2ux + (4k2 + 6k + 2)�2um )) + (�2uy )2(d2�2uw + �2ux )2(c2d4(k2 + 4k + 3)(�2uw )2 + d2�2uw (c2(2k2 + 7k + 3)�2ux + (2k2 + 5k + 3)�2um ) + k�2ux(c2(k + 3)�2ux + 2(k + 1)�2um ))) + 4a3bcdk(�2um )2�2uw�2ux�2uy (d2�2uw + �2ux )(b2�2uw�2ux + �2uy (d2�2uw + �2ux )) + a2�2um (b2�2uw�2ux + �2uy (d2�2uw + �2ux ))(b4(k + 1)(�2uw )2(�2ux )2(c2(d2(k + 1)�2uw + (k + 2)�2ux ) + (k + 1)�2um ) + b2�2uw�2ux�2uy (2c2d4(k + 1)2(�2uw )2+2d2�2uw (k2(2c2�2ux + �2um ) + k(5c2�2ux + �2um ) + 2c2�2ux ) + �2ux (2c2(k2 + 3k + 1)�2ux + (2k2 + 3k + 1)�2um ))+(�2uy )2(d2�2uw + �2ux )(d2(k + 1)�2uw + k�2ux )(c2(k + 3)(d2�2uw + �2ux ) + (k + 1)�2um )) + c2(k + 1)(b2�2uw�2ux + �2uy (d2�2uw + �2ux ))2(b4(k + 1)(�2uw )2(�2ux )2 + b2�2uw�2ux�2uy (2d2k�2uw + 2k�2ux + �2ux ) + (�2uy )2(d2�2uw + �2ux )(d2(k + 1)�2uw + k�2ux ))),

and

Y = (a2�2um (d2�2uw + �2ux ) + b2�2uw�2ux + �2uy (d2�2uw + �2ux ))(a6d2(k + 1)(�2um )4�2uw(d2�2uw + �2ux )2 + a4(�2um )2(d2�2uw + �2ux )(b2�2uw�2ux (d2k�2uw (c2(k + 1)�2ux + 2�2um ) + (k + 1)�2ux (c2k�2ux + k�2um + �2um )) + (k + 1)�2uy (d2�2uw + �2ux )(d2�2uw(c2k�2ux + 3�2um ) + k�2ux (c2�2ux + �2um ))) + 4a3bcdk(�2um )2�2uw�2ux (d2�2uw+�2ux )(b2�2uw�2ux + �2uy (d2�2uw + �2ux )) + a2�2um (b4(�2uw )2(�2ux )2(d2�2uw (2c2k�2ux + k�2um + �2um ) + �2ux (2c2k�2ux + (k + 1)2�2um )) + 2b2�2uw�2ux�2uy (d2�2uw + �2ux )(2d2k�2uw (c2�2ux + �2um ) + �2ux (2c2k�2ux + (k + 1)2�2um )) + (�2uy )2(d2�2uw + �2ux )2(d2�2uw(2c2k�2ux + 3(k + 1)�2um ) + k�2ux (2c2�2ux + (k + 2)�2um ))) + 4abcdk�2um�2uw�2ux�2uy (d2�2uw+�2ux )(b2�2uw�2ux + �2uy (d2�2uw + �2ux )) + b6c2k(k + 1)(�2uw )3(�2ux )4 + b4(k + 1)(�2uw )2(�2ux )2�2uy (d2�2uw + �2ux )(3c2k�2ux + �2um ) + b2�2uw�2ux (�2uy )2(d2�2uw + �2ux )(d2k�2uw (3c2(k + 1)�2ux + 2�2um ) + �2ux (3c2k(k + 1)�2ux + 2k�2um + �2um )) + (�2uy )3(d2�2uw + �2ux )2(d2(k + 1)�2uw (c2k�2ux + �2um ) + k�2ux (c2(k + 1)�2ux + �2um ))),where k = PQ .

30

Documents

March 27, 2020 › pdf › 2003.11991v1.pdfShantanu Gupta, Zachary C. Lipton, David Childers Carnegie Mellon University [email protected],[email protected],[email protected]