Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
arX
iv:2
003.
1199
1v1
[st
at.M
E]
26
Mar
202
0
Estimating Treatment Effects with Observed
Confounders and Mediators
Shantanu Gupta, Zachary C. Lipton, David Childers
Carnegie Mellon [email protected],[email protected],[email protected]
March 27, 2020
Abstract
Given a causal graph, the do-calculus can express treatment effects as functionals of the observa-tional joint distribution that can be estimated empirically. Sometimes the do-calculus identifies multi-ple valid formulae, prompting us to compare the statistical properties of the corresponding estimators.For example, the backdoor formula applies when all confounders are observed and the frontdoor for-mula applies when an observed mediator transmits the causal effect. In this paper, we investigate theover-identified scenario where both confounders and mediators are observed, rendering both estima-tors valid. Addressing the linear Gaussian causal model, we derive the finite-sample variance for bothestimators and demonstrate that either estimator can dominate the other by an unbounded constantfactor depending on the model parameters. Next, we derive an optimal estimator, which leverages allobserved variables to strictly outperform the backdoor and frontdoor estimators. We also present aprocedure for combining two datasets, with confounders observed in one and mediators in the other.Finally, we evaluate our methods on both simulated data and the IHDP and JTPA datasets.
1 Introduction
Causal effects are not, in general, identifiable from observational data alone. Consequently, scientists typi-
cally rely on controlled experiments to estimate causal quantities such as treatment effects. The fundamen-
tal insight of causal inference is that given structural assumptions on the data generating process, causal
effectsmay become expressible as functionals of the joint distribution over observed variables. Thus, much
work on causal graphical models has focused on problems of identification—determining if a functional
capturing the desired causal quantity exists and deriving it. The do-calculus, introduced by Pearl [1995],
provides a set of three rules that can be used to convert causal quantities into such functionals. Moreover,
the do-calculus has been proven to be complete by Shpitser and Pearl [2006] who introduce conditions to
determine when a desired causal effect is identifiable and an algorithm to express it as a functional when-
ever the condition holds. Once the desired functional has been identified, we can estimate it using only
observational data.
We are motivated by the observation that, for some causal graphs, treatment effects may be overidentified.
Here, applications of the do-calculus produce distinct functionals, all of which, subject to positivity con-
ditions, yield consistent estimators of the same causal effect. Consider a causal graph (see Figure 1) for
which the treatment X , mediatorM , confounderW , and outcome Y are all observable. Using the backdoor
adjustment, we can expresses the average treatment effect of X on Y as a function of P(X ,W , Y ), while1
X M Y
W
Figure 1: Causal graph with observed mediator and confounder, rendering both backdoor and frontdoor
estimators viable.
the frontdoor adjustment expresses that same causal quantity via P(X ,M, Y ) [Pearl, 1995]. Faced with the
(fortunate) condition of overidentification, our focus shifts from identification—is our effect estimable?, to
optimality—which among multiple valid estimators dominates from a standpoint of statistical efficiency?
In this paper, we address this very graph, focusing our analysis on the linear causal model [Wright, 1934],
a central object of study in causal inference and econometrics. We also adapt existing semi-parametric
estimators to this graph.
Deriving the finite sample variance of the backdoor and frontdoor estimators, and precisely characterizing
conditions under which each dominates, we find that either may outperform the other to an arbitrary
degree depending on the underlying model parameters. These expressions can provide guidance to a
practitioner for assessing the suitability of each estimator. For example, one byproduct of our analysis is
to characterize what properties make for the “ideal mediator”. Moreover, in the data collection phase, if
one has a choice between collecting data on the mediator or the confounder, these expressions, together
with the practitioner’s beliefs about likely ranges for model parameters, can be used to decide what data
to collect.
Next, we propose techniques that leverage both observed confounders and mediators. For the setting
where we simultaneously observe both the confounder and the mediator, we introduce an estimator that
optimally combines all information. We prove theoretically that this method achieves lower mean squared
error (MSE) than both the backdoor and frontdoor estimators, for all settings of the underlying model
parameters. Moreover, the extent to which this estimator can dominate the better of the backdoor and
frontdoor estimators is unbounded. Subsequently, we consider the partially-observed setting in which
two datasets are available, one with observed confounders (but not mediators) {(X ,W , Y )}ni=1, and anotherwith observed mediators (but not confounders) {(X ,M, Y )}mi=1. Interestingly, the likelihood is convex givensimultaneous observations but non-convex under partially-observed data. We introduce and empirically
validate a heuristic that is guaranteed to achieve higher likelihood than either the backdoor or frontdoor
estimators.
Through a series of experiments, we evaluate our methods on both synthetic, semi-synthetic, and real
datasets. We show that the backdoor and frontdoor estimators exhibit different variance depending on
the model parameters as predicted by our variance formulae. Our proposed estimators that combine con-
founders and mediators always exhibit lower variance and MSE than the backdoor and frontdoor estima-
tors when our model assumptions are satisfied.
2
2 Related Work
The backdoor adjustment, which formalizes the common practice of controlling for known confounders, is
widely applied and studied in statistics and econometrics [Pearl, 2009, 2010, Perković et al., 2015]. More re-
cently, the frontdoor adjustment, which leverages observed mediators to identify causal effects, even amid
unobserved confounding, has seen increasing application in real-world datasets [Bellemare and Bloem,
2019, Glynn and Kashin, 2018, 2017, Chinco and Mayer, 2016, Cohen and Malloy, 2014].
In the most similar work to ours, Glynn and Kashin [2018] compare the frontdoor and backdoor adjust-
ments, computing bias (but not variance) formulas for each and performing sensitivity analysis. Exploring
a real-world job training dataset, they demonstrate that the frontdoor estimator outperforms its back-
door counterpart (in terms of bias). Henckel et al. [2019] analyze linear causal models where multiple
adjustments sets are available for the backdoor criterion. They introduce a graphical criterion for com-
paring the asymptotic variances of the adjustment sets and identifying the one with lowest variance.
Rotnitzky and Smucler [2019] extend this work and show that the same graphical criterion is valid even
when the causal model is non-parametric. They also present a semi-parametric efficient estimator that
exploits the conditional independences in a causal graph.
Researchers have also worked to generalize the frontdoor criterion. Bareinboim et al. [2019] introduce the
conditional frontdoor criterion, allowing for both treatment-mediator confounders and mediator-outcome
confounders. Fulcher et al. [2020] propose a method for including observed confounders along with a
mediator with discrete treatments.
The study of overidentified models dates at least back to Koopmans and Reiersøl [1950]. Sargan [1958],
Hansen [1982] formalized the result that in the presence of overidentification, multiple estimators can be
combined to improve efficiency. This was extended to the non-parametric setting by Chen and Santos
[2018]. A related line of work considers methods for combining multiple datasets for causal inference.
Bareinboim and Pearl [2016] study the problem of handling biaseswhile combining heterogeneous datasets,
while Jackson et al. [2009] present Bayesian methods for combining datasets with different covariates and
some common covariates.
3 Preliminaries
In this work, weworkwithin the structural causal model (SCM) framework due to Pearl [2009], formalizing
causal relationships via directed acyclic graphs (DAGs). Each X → Y edge in this DAG indicates that the
variable X is (potentially) a direct cause of variable Y . Informally, the edge indicates that Y listens to X in
the sense that the function that determines its value depends upon X . However, the DAG does not capture
the magnitude of the effect. All measured variables are deterministic functions of their parents and a set of
jointly independent per-variable noise terms. As an illustration, consider the graph in Figure 1 with four
variables {W , X ,M, Y}. Here each variable V ∈ {W , X ,M, Y} can be expressed as
V = fV (Pa(V ), uV ), (1)
where fV is some deterministic function, Pa(V ) denotes V ’s parents in the graph, and uV is the correspond-
ing noise term. Eq. 1 is known as a structural equation.
Linear Gaussian SCM In linear Gaussian SCMs, each variable is assumed to be a linear function of its
parents. The noise terms are assumed to be additive and Gaussian. In this paper, we work with the linear
3
Gaussian SCM for the overidentified confounder-mediator graph (Figure 1), where the structural equations
can be written as
wi = uwi ,xi = dwi + uxi ,mi = cxi + umi ,yi = ami + bwi + uyi ,
uwi ∼ (0, �2uw )uxi ∼ (0, �2ux )umi ∼ (0, �2um )uyi ∼ (0, �2uy ).(2)
Here, wi, xi , mi , yi are realized values of the random variablesW,X,M, Y , respectively, and uwi , uxi , umi , uyiare realized values of the corresponding noise terms. In Eq. 2, we assume that the variables have zero
mean. This is done to simplify analysis, but this assumption is not necessary for the results presented in
the paper.
3.1 The Backdoor and Frontdoor Adjustments
The effect of a treatmentX is expressible by reference to the post-intervention distributions of the outcomeY for different values of the treatment X = x . Informally, an intervention do(X = x) means that we
manually set X to the value x for all instances, regardless of what value it would have assumed naturally.
More formally, an intervention do(X = x) in a causal graph can be expressed via the mutilated graph that
results from deleting all incoming arrows to X , setting X ’s value to X = x for all instances, while keeping
the SCM otherwise identical. This distribution is denoted as P(Y |do(X = x)).The backdoor and frontdoor adjustments [Pearl, 2009] express treatment effects as functionals of the ob-
servational distribution. Consider our running example of the causal model in Figure 1. We denote X as
the treatment, Y as the outcome, W as a confounder, and M as a mediator. Our goal is to estimate the
causal quantity P(Y |do(X = x)).Backdoor Adjustment When all confounders of both X and Y are observed—in our example,W—then
the causal effect of X on Y , i.e., P(Y |do(X = x)) can be written as
P(Y |do(X = x)) = ∑w P(Y |X = x,W = w)P(W = w). (3)
Frontdoor Adjustment This technique applies even when the confounder W is unobserved. Here we
require access to a mediatorM that (i) is observed; (ii) transmits the entire causal effect from X to Y ; and(iii) is not influenced by the confounder W given X . The effect of X on Y is computed in two stages. We
first find the effect of X onM :
P(M = m|do(X = x)) = P(M = m|X = x). (4)
Then we find the effect ofM on Y , i.e., P(Y |do(M = m)):P(Y |do(M = m)) = ∑x P(Y |M = m, X = x)P(X = x). (5)
We can then write the causal effect of X on Y as
P(Y |do(X = x)) = ∑m P(M = m|do(X = x))P(Y |do(M = m)).4
3.2 Ordinary Least Squares (OLS) Regression
Since we use OLS regression in later sections, we briefly review OLS estimators. We consider the following
setup: y = X� + e, (6)
where y and e are n × 1 vectors, X is an n × d matrix of observations, and � is the d × 1 coefficient vector
that we want to estimate. If e ⟂⟂ X and e ∼ (0, �2e In), where In is the n × n identity matrix, then the OLS
estimate of � is � = (X⊤X)−1X⊤y= � + (X⊤X)−1X⊤e, (7)
with E[�] = � and Var(�) = �2e E[(X⊤X)−1]. If each row Xi of X is sampled from Xi i.i.d.∼ (0,�), then the
distribution of (X⊤X)−1 is an Inverse-Wishart distribution. Then the variance of � is
Var(�) = �2e �−1n − d − 1 . (8)
4 Variance of Backdoor & Frontdoor Estimators
In this section, we analyze the backdoor and frontdoor estimators and characterize the regimes where each
dominates. We work with the linear SCM described in Eq. 2. Throughout, our goal is to estimate the causal
effect of X on Y . In terms of the underlying parameters of the linear SCM, the quantity that we wish to
estimate is ac. Absent measurement error, both estimators are unbiased (see proof in Appendix B) and
thus we focus our comparison on their respective variances.
Variance of the Backdoor Estimator The backdoor estimator requires only thatwe observe {X , Y ,W}(but not necessarily the mediator M). Say we observe the samples {xi , yi , wi}ni=1. The outcome yi can be
written as
yi = acxi + bwi + aumi + uyi .We can estimate the causal effect ac by taking the coefficient on X in an OLS regression of Y on {X ,W}.This controls for the confounderW and corresponds naturally to the adjustment described in Eq. 3.
Let � = Cov([X ,W ]). Using Eq. 8, the variance of the backdoor estimator is
Var(ac)backdoor = Var(aum + uy ) (�−1)1,1n − 3= a2�2um + �2uy(n − 3)�2ux . (9)
Thus, the asymptotic variance of the backdoor estimator is
limn→∞Var(√n(ac − ac)) = a2�2um + �2uy�2ux . (10)
5
Variance of the Frontdoor Estimator The frontdoor estimator is used when {X, Y ,M} samples are
observed and the confounder W is unobserved. The causal effect ac is estimated in two stages. First, we
estimate c by taking the coefficient on X in an OLS regression of M on X . Let the estimate be c. Thiscorresponds to the adjustment in Eq. 4. Then, we estimate a by taking the coefficient on M in an OLS
regression of Y on {M,X}. Let the estimate be af . This corresponds to the adjustment in Eq. 5.
The regression of M on X can be written as mi = cxi + umi . Let Σc = Var(X ). Using Eq. 8, Var(c) isVar(c) = Var(um)(Σ−1c )n − 2
= �2um(n − 2)(d2�2uw + �2ux ) . (11)
The regression of Y on {M, X} can be written as yi = ami + bd xi + ei , where ei = − bd uxi +uyi . In this case, the
error ei is not independent of the regressor xi . Using the fact that (ux , x) has a bivariate normal distribution,
Var(e|x) isVar(e|x) = b2�2uw�2ux + �2uy (d2�2uw + �2ux )(d2�2uw + �2ux ) . (12)
Let �a = Cov([M, X]). Then, using a formula analogous to Eq. 8 for non-independent errors (see Appendix
C.1.1 for the proof), the variance of af isVar(af ) = E [Var(e|x)(�−1a )1,1(n − 3) ]
= b2�2uw�2ux + �2uy (d2�2uw + �2ux )(n − 3)(d2�2uw + �2ux )�2um , (13)
where the expression for Var(e|x) is taken from Eq. 12.
The variance of the product of two random variables can be written as
Var(af c) = Cov(a2f , c2) + (Var(af ) + E2[af ])(Var(c) + E2[c]) − (Cov(af , c) + E[af ]E[c])2 (14)= Cov(a2f , c2) + (Var(af ) + a2)(Var(c) + c2]) − (Cov(af , c) + ac)2, (15)
where in Eq. 14 we used the facts that E[af ] = a, and E[c] = c (see proof in Appendix B.2).
Using the facts that Cov(af , c) = 0 (see proof in Appendix A.1), and Cov(a2f , c2) = Var(af )Var(c) (see proofin Appendix C.1.2) in Eq. 15, we can express the finite sample variance of the frontdoor estimator as
Var(af c) = c2Var(af ) + a2Var(c) + 2Var(af )Var(c), (16)
where Var(c) and Var(af ) are defined in Eqs. 11 and 13, respectively. Thus, the variance of the frontdoor
estimator can be written as
Var(√naf c) = c2Var(√naf ) + a2Var(√nc) +( 1n) . (17)
6
The Ideal Frontdoor Mediator A natural question that arises is what properties make a mediator best
suited for applying the frontdoor adjustment. Under what conditions is the frontdoor estimator most
precise? We can see that Var(af c) is non-monotonic in the mediator noise �um . Our variance expressionsmake clear that if �um is high, it becomes difficult to estimate c (Eq. 11). However, when �um is low, our
estimate of a has high variance (Eq. 13). Eq. 16 provides us with guidance. Var(af c) is a convex function
of �2um . The ideal mediator will have noise variance �2∗um which minimizes Eq. 16. That is,
�2∗um = argmin� 2um[a2Var(c) + c2Var(af ) + 2Var(af )Var(c)]
⟹ �2∗um = |c|√b2�2uw�2ux + �2uy (d2�2uw + �2ux )|a| √n − 2n − 3 .4.1 Comparison of Backdoor and Frontdoor Estimators
The relative performance of the backdoor and frontdoor estimators depend on the underlying SCM’s pa-
rameters. Using Eqs. 9 and 16, the ratio of the backdoor to frontdoor variance is
RVar = Var(ac)backdoorVar(af c)
= (n − 2)�2umD2(a2�2um + �2uy )�2ux ((n − 3)a2�4umD + (2�2um + c2(n − 2)D)(b2�2uw�2ux + �2uyD)) , (18)
where D = (d2�2uw + �2ux ).The backdoor estimator dominates when RVar < 1 and vice versa when RVar > 1. Note that there exist
parameters that cause any value of RVar > 0. In particular, as �2ux → 0, RVar → ∞ and as �2ux → ∞,RVar → 0, regardless of the sample size n. Thus, either estimator can dominate the other by any arbitrary
constant factor.
Note that in this section, we characterize the variance of the estimators in terms of the underlying causal
model parameters which are unknown. One practical way to operationalize these expressions would be to
test if one model dominates in parameter ranges given by prior knowledge of the problem. Alternatively,
one could estimate the model parameters choosing the adjustment with lowest variance based on plug-in
estimates (analogous to Explore-then-Commit [Lattimore and Szepesvári, 2019, Ch. 6]).
5 Combining Mediators & Confounders
Having characterized the performance of each estimator separately, we can now consider optimal strate-
gies for estimating treatment effects in the overidentified regime, where we observe both the confounder
and mediator simultaneously. In other words, we observe N samples {xi , yi , wi , mi}Ni=1. We show that it is
possible to construct an estimator that is strictly better than the backdoor and frontdoor estimators.
We compute the maximum likelihood estimator (MLE). The MLE will be optimal since our model satisfies
the necessary regularity conditions for MLE optimality (by virtue of being linear and Gaussian). Let the
vector si = [xi , yi , wi , mi] denote the ith sample. Since the data is multivariate Gaussian, the log-likelihood
of the data can be written as
= −N2 [log (det�) + Tr (��−1)] , (19)
7
where � = Cov([X , Y ,W ,M]) (the true covariance matrix) and � = 1N ∑Ni=1 sisi⊤ (the sample covariance
matrix).
The MLE for a Gaussian graphical model is �MLE = � [Uhler, 2019]. Let the MLE estimates for parametersc and a be c and ac , respectively. Thenc = �1,4�1,1 (20)
ac = �1,4�3,3 − �1,3�3,4�3,3�4,4 − �23,4 . (21)
The MLE estimate for c in Eq. 20 is the same as for the frontdoor—the coefficient of X in an OLS regression
of M on X . The MLE estimate for a in Eq. 21 is the coefficient on M in an OLS regression of Y on{M,W}. Contrast this with the estimate of a for the frontdoor estimator where we regress Y on {M, X}.Conditioning on W instead of X allows the combined estimator to improve performance.
We can write the regression of Y on {M,W} as yi = ami + bwi + uyi . Let �ac = Cov([M,W ]). Using Eq. 8,
we get
Var(ac) = Var(uyi )(�−1ac )1,1n − 3= �2uy(n − 3)(c2�2ux + �2um ) . (22)
The variance of c is the same as the frontdoor case (Eq. 11), that is
Var(c) = �2um(n − 2)(d2�2uw + �2ux ) . (23)
The variance of the product of two random variables can be written as
Var(ac c) = Cov(a2c , c2) + (Var(ac) + E2[ac])(Var(c) + E2[c]) − (Cov(ac , c) + E[ac]E[c])2 (24)= Cov(a2c , c2) + (Var(ac) + a2)(Var(c) + c2]) − (Cov(ac , c) + ac)2, (25)
where in Eq. 24 we used the facts that E[ac] = a, and E[c] = c (see proof in Appendix B.3).
Using the facts that Cov(ac , c) = 0, andCov(a2c , c2) ≤
√n − 3n − 5 (2|c|Var(ac)√Var(c) + √3√n − 2n − 4Var(ac)Var(c)) ,that are proved in Appendices A.2 and C.2, respectively in Eq. 25, we can upper bound the finite sample
variance of the combined estimator as
Var(ac c) ≤ c2Var(ac) + a2Var(c) +√n − 3n − 5 (2|c|Var(ac)√Var(c) + √3√n − 2n − 4Var(ac)Var(c)) , (26)
where Var(c) and Var(ac) are defined in Eqs. 23 and 22, respectively. Thus, the variance of the combined
estimator can be written as
Var(√nac c) ≤ c2Var(√nac) + a2Var(√nc) +( 1√n) . (27)
8
The Ideal Mediator Just as with the frontdoor estimator, we can ask what makes for an ideal mediator
when using the combined estimator. Eq. 27 shows that limn→∞ Var(√nac c) is a convex function of the
variance of the noise term of the mediator �2um . The ideal mediator will have noise variance �2∗um which
minimizes the variance in Eq. 27. This means that
�2∗um = argmin� 2um [ limn→∞ (a2Var(√nc) + c2Var(√nac))]⟹ �2∗um = max {0, vmin} ,
where
vmin = |c|�uy√d2�2uw + �2ux|a| − c2�2ux .5.1 Comparison with Backdoor and Frontdoor Estimators
We can compare Eqs. 10 and 27 too see that, asymptotically, the combined estimator has lower variance
than the backdoor estimator for all values of model parameters. That is, as n → ∞,
Var(√nac c) ≤ Var(√nac)backdoor.Similarly, we can compare Eqs. 17 and 27 to see that, asymptotically, the combined estimator is better than
the frontdoor estimator for all values of model parameters. That is, as n → ∞,
Var(√nac c) ≤ Var(√naf c).In the finite sample case, using Eqs. 9 and 26, we can see that for a large enough n, the combined estimator
will dominate the backdoor estimator for all model parameters. That is,
∃N , s.t., ∀n > N ,Var(ac c) ≤ Var(ac)backdoor,where the dependence of N on the model parameters is stated in Appendix D.1. We can make a similar
argument for the dominance of the combined estimator over the frontdoor estimator. Using Eqs 16 and
frontdoor-finite-sample-variance, it can be shown that
∃N , s.t., ∀n > N ,Var(ac c) ≤ Var(af c),where the dependence of N on the model parameters is stated in Appendix D.2.
Next, we show that the combined estimator can dominate the better of the backdoor and frontdoor esti-
mators by an arbitrary amount. That is, we show that the quantity
R = min{Var(ac)backdoor ,Var(af c)}Var(ac c)
is unbounded. In order to demonstrate this, we consider the case when Var(ac)backdoor = Var(af c). Thiscondition holds for certain settings of the model parameters (see Appendix D.3 for an example). In this
case,
R = Var(ac)backdoorVar(ac c) (28)
≥ (n − 2)DE(a2�2um + �2uy )�2ux ((n − 3)a2�2umE + �2uy (�2um + √3�2um (r1r2 + |c|(n − 2)D(|c| + r1 �um√(n−2)D)))) ,
9
where D = d2�2uw + �2ux , E = c2�2ux + �2um , r1 = √ n−3n−5 , r2 = √n−2n−4 , and, in Eq. 28, we used the upper bound
from Eq. 26. We can see that as �ux → 0, R → ∞ and thus R is unbounded.
5.2 Semi-Parametric Estimators
Fulcher et al. [2020] derive the efficient influence function and semi-parametric efficiency bound for a
generalized model with discrete treatment and non-linear relationships between the variables. While they
allow for confounding of the treatment-mediator link and the mediator-outcome link, the graph in Figure
1 has additional restrictions. As per Chen and Santos [2018], this graph is locally overidentified. This
suggests that it is possible to improve the estimator by Fulcher et al. [2020, Eq. (6)] (which we refer to as
IF-Fulcher) by leveraging our model’s restrictions. In our model, we have Y ⟂⟂ X |(M,W ), and M ⟂⟂ W |X .So we adapt IF-Fulcher to our graph by incorporating the additional conditional independences by usingE[Y |M,W , X ] = E[Y |M,W ], and f (M |X,W ) = f (M |X ) to create an estimator we refer to as IF-Restricted:
Ψ = 1n n∑i=1(Yi − E[Y |Mi,Wi]) f (M |x ∗)f (M |Xi)+ 1{Xi = x ∗}P(Xi = x ∗|Wi)× {E[Y |Mi ,Wi] −∑m E[Y |m,Wi]f (m|Xi)}+∑m E[Y |m,Wi]f (m|x ∗), (29)
where, if f (M |X) and P (X |W ), E(Y |M,W ) are consistent estimators of the true quantities, then Ψ p→E[Y |do(X = x ∗)]. By double robustness of the given estimator, if f , P , and E are correctly specified, then
IF-Restricted has identical asymptotic distribution as IF-Fulcher. But using the additional restrictions im-
proves estimation of nuisance functions. Rotnitzky and Smucler [2019], in contemporaneous work, ana-
lyzed the same graph and showed that, in addition, the efficient influence function is also changed when
imposing these conditional independences (see Example 10 in their paper). For our experiments with bi-
nary treatments, we use linear regression for f , E and logistic regression for P .Another way to adapt IF-Fulcher is for the case when we do not observe the confounders (as in the front-
door adjustment). In this case, we can set Wi = ∅ and apply IF-Fulcher. We call this special case IF-
Frontdoor.
6 Combining Revealed-confounder and Revealed-mediator Datasets
We now consider a situation in which the practitioner has access to two datasets In the first one, the
confounders are observed but the mediators are unobserved. In the second one, the mediators are observed
but the confounders are unobserved. This situation might arise if data is collected by two groups, the first
selecting variables to measure to apply the backdoor adjustment and the second selecting variables to
apply the frontdoor adjustment. Given the two datasets, we now turn to the question of how to optimally
leverage all available data to estimate the effect of X on Y .Onewaymight be to apply the backdoor estimator to the first dataset, the frontdoor estimator to the second
dataset. We could then estimate the variance of each estimate, using either bootstrapping or plugging in
10
estimates of the model parameters to the variance equations, and then choose the estimator with the lower
estimated variance. However, in this case, each estimator uses samples from one only dataset. To leverage
all available information, we introduce theMLE, demonstrating that this estimator has lower variance than
both the backdoor and frontdoor estimators.
Combined Log-Likelihood under Partial Observability Say we have P samples of {xi , yi , wi}Pi=1.Let each such sample be denoted by the vector pi = [xi , yi , wi]. Moreover, say we have Q samples of{xi , yi , mi}Qi=1. Let each such sample be denoted using the vector qj = [xj , yj , mj]. Let the observed data berepresented as D. That is, D = {p1, p2, … , pP , q1, q2, … , qQ}. Let N = P + Q and let k = PQ . Since the datais multivariate Gaussian, the conditional log-likelihood given k can be written as
(D|k) = − 12N [P (log (det�p) + Tr (�p�−1p )) + Q (log (det�q) + Tr (�q�−1q ))]= − 12(k + 1)[k (log (det�p) + Tr (�p�−1p )) + log (det�q) + Tr (�q�−1q )], (30)
where �p = Cov([X , Y ,W ]), �q = Cov([X , Y ,M]), �p = 1P ∑Pi=1 pip⊤i and �q = 1Q ∑Qi=1 qiq⊤i .6.1 Cramer-Rao Lower Bound
Before we present the MLE, we analyse the asymptotic variance of the this estimator and show that vari-
ance can be reduced by utilizing the additional samples.
Reparameterizing the Likelihood We are interested in estimating the value of the product ac. Lete = ac. We reparameterize the likelihood in Eq. 30 by replacing c with e/a. This simplifies the calcula-
tions and improves numerical stability. Now, we have the following eight unknown model parameters:{e, a, b, d, �2uw , �2ux , �2um , �2uy}.In order to compute the variance of the estimate of parameter e = ac, we compute the Cramer-Rao variance
lower bound. We first compute the Fisher information matrix (FIM) I for the eight model parameters:
I = −E ⎡⎢⎢⎢⎢⎢⎣)2)e2 )2)e)a )2)e)b … )2)e)�uy)2)a)e )2)a2 )2)a)b … )2)a)�uy⋮ ⋮ ⋮ ⋱ ⋮)2)�uy )e )2)�uy )a )2)�uy )b … )2)2�uy
⎤⎥⎥⎥⎥⎥⎦Let e be the MLE. Since standard regularity hold for our model (due to linearity and Gaussianity), the MLE
is asymptotically normal. We can use the Cramer-Rao theorem to get the asymptotic variance of e. Thatis, for constant k, as N → ∞, we have√N (e − e) d→ (0, Ve), andVe = (I−1)1,1.The expression for (I−1)1,1 is available in closed form and is given in Appendix E.1.
11
Remarks on the Lower Bound Recall that k = PQ . As k → ∞, Ve becomes equal to the asymptotic
variance of the backdoor estimator. That is, as k → ∞,
Ve = limn→∞Var(√nac)backdoor.And as k → 0, Ve becomes equal to the asymptotic variance of the frontdoor estimator. That is, as k → 0,
Ve = limn→∞ [a2Var(√nc) + c2Var(√naf )] .For a fixed value of Q, Ve is a decreasing function of k. And for a fixed value of P , Ve is an increasing
function of k. This shows that samples from both datasets are useful for reducing the asymptotic variance
of the estimator.
6.2 The Maximum Likelihood Estimator
Computing an analytical solution for the model parameters that maximizes the log-likelihood turns out to
be intractable. As a result, we update our estimated parameters to maximize the likelihood numerically.
Parameter Initialization The likelihood in Eq. 30 is non-convex. As a result, we cannot start with
arbitrary initial values for model parameters because we might encounter a local minimum. To avoid this,
we use the two datasets to initialize our parameter estimates. Each of the eight parameters can be identified
using only data from one of the datasets. For example, d can be initialized using the revealed-confounder
dataset (via OLS regression of X on W ). The parameter e is can be identified using either dataset, so we
pick the value with lower bootstrapped variance.
After initializing the eight model parameters, we run the Broyden–Fletcher–Goldfarb–Shanno (BFGS) al-
gorithm [Fletcher, 2013] to find model parameters that minimize the negative log-likelihood. In our exper-
iments, for the sample sizes considered and given our initialization procedure, the non-convexity of the
likelihood never proved a practical problem. In our experiments, we were always able to converge to the
global minimum. When we find the global minimum, this estimator is optimal and dominates both the
backdoor and frontdoor estimators.
In a real-world scenario, without our procedure, a practitionerwould have to choose between the backdoor
and frontdoor estimates from the individual datasets. An advantage of using our procedure is that we
obviate this choice and output a single estimate of the causal effect that asymptotically performs better
than the individual estimates.
7 Experiments
First, we present results on synthetic datasets, where the structural causal model is known. Next, we
present results on a semi-synthetic dataset, where the covariates and treatment assignment are from a real
study but the outcomes are synthetic. Finally, we test our methods on job training data.
Synthetic Data Generating data from our causal model (Eq. 2), we first compare the empirical variances
of the backdoor, frontdoor and combined estimator to those predicted by our theory (Eqs. 9, 16, 26). We
12
Table 1: MeanAbsolute Percentage Error of the empirical variance with the theoretical variance (computed
using our variance formulae). The values are reported as mean ± std.
Estimator n = 50 n = 100 n = 200Backdoor 0.36 ± 0.37 0.32 ± 0.29 0.34 ± 0.13Frontdoor 0.33 ± 0.21 0.30 ± 0.25 0.23 ± 0.17Combined 1.20 ± 1.14 0.97 ± 0.64 0.58 ± 0.23
50 60 70 80 90 100Number of samples
0
2
4
6
8
10
12
MSE
of e
stim
ator
frontdoorbackdoorcombined
(a) Backdoor better
50 60 70 80 90 100Number of samples
0
5
10
15
20
25
MSE
of e
stim
ator
frontdoorbackdoorcombined
(b) Frontdoor better
Figure 2: Comparison of MSE when confounders and mediators are simultaneously observed. (a) Causal
model where backdoor dominates frontdoor. (b) Causal model where frontdoor dominates backdoor. In
both cases, the combined estimator has lowest MSE.
initialize the model parameters by sampling from the following distributions:
a, b, c, d ∼ Unif[−10, 10]�2uw , �2ux , �2um , �2uy ∼ Unif[0.01, 2]. (31)
For each initialization, we compute the Mean Absolute Percentage Error (MAPE) of the empirical variance
with the theoretical variance. We average the MAPE across 1000 realizations sampled from Eq. 31. We
find that the theoretical variance is close to the empirical variance even for small sample sizes (Table 1).
Next, we compare the backdoor, frontdoor and combined estimators under different settings of the model
parameters. Unless stated otherwise, the model parameter values we use for generating the data are
a = 10, b = 4, c = 5, d = 5,�2uw = 1, �2ux = 1, �2um = 1, �2uy = 1. (32)
The quantity of interest is the causal effect ac = 50. We set some of the values in Eq. 32 to construct
regimes where either of the backdoor or frontdoor estimators can dominate the other. For Figure 2a, we
set {�2ux = 0.05, �2um = 0.05}, which makes the backdoor estimator better as predicted by Eq. 18. For Figure
2b, we set {�2uw = 2, �2ux = 0.01, �2um = 0.1} which makes the frontdoor estimator better as predicted by Eq.
18. The plots in Figure 2 corroborate these predictions at different sample sizes. Furthermore, the optimal
combined estimator always outperforms both the backdoor and frontdoor estimators.
Finally, we evaluate the procedure for combining datasets described in Section 6, generating two datasets
with equal numbers of samples. In the first, only {X, Y ,W } are observed. In the second, only {X, Y ,M}are observed. We set {�2ux = 0.05, �2um = 0.05}, which makes the backdoor estimator better (Figure 3a),
and then set {�2uw = 2, �2ux = 0.01, �2um = 0.1}, which makes the frondoor estimator better (Figure 3b). We
then confirm that the combined estimator has lower MSE than either for various sample sizes (Figure 3),
supporting our theoretical claims.
13
100 150 200 250 300 350 400 450 500Number of samples per dataset
0.2
0.4
0.6
0.8
1.0
1.2
MSE
of e
stim
ator
backdoorcombined
(a)
100 150 200 250 300 350 400 450 500Number of samples per dataset
0.250.500.751.001.251.501.752.002.25
MSE
of e
stim
ator
frontdoorcombined
(b)
Figure 3: Comparison of MSE when confounders and mediators are observed in separate datasets. The
estimator that combines datasets dominates the better of backdoor and frontdoor estimators.
IHDP Dataset Hill [2011] constructed a dataset from the Infant Health and Development Program
(IHDP). This semi-synthetic dataset, which has been used for benchmarking causal inference algorithms
[Shi et al., 2019, Shalit et al., 2017], is based on a randomized experiment to measure the effect of home
visits from a specialist on future test scores of children. We use samples from the NPCI package [Dorie,
2016]. The randomized data is converted to an observational study by removing a biased subset of the
treated group. This set contains 747 samples with 25 covariates.We use the covariates and the treatment assignment from the real study. We use a procedure similar to
Hill [2011] to simulate the mediator and the outcome. The mediator M takes the form M ∼ (5X, �2um ),where X is the treatment. The response Y takes the form Y ∼ (10M +W b, 1) where W is the matrix of
standardized (zero mean and unit variance) covariates and values in the vector b are randomly sampled (0,
1, 2, 3, 4) with probabilities (0.5, 0.2, 0.15, 0.1, 0.05). The ground truth causal effect is 5 × 10 = 50.We evaluate our estimators and the three IF estimators — IF-Fulcher, IF-Restricted and IF-Frontdoor (Sec-
tion 5.2). We test the estimators on two settings of the parameter �um (Table 2, the Complete dataset setting).
The MSE values are computed across 1000 instantiations of the dataset created by simulating the mediators
and outcomes. We first evaluate the estimators on the complete dataset of 747 samples. We see that for�um = 2, the backdoor estimator dominates the frontdoor estimator whereas for �um = 1, the frontdoor esti-mator is better. In both cases, the combined estimator which uses bothmediators and confounders (Section
5) outperforms both estimators. Furthermore, we see that IF-Restricted outperforms IF-Frontdoor, show-
ing the value of leveraging the covariates. Moreover, IF-Restricted also outperforms IF-Fulcher, suggesting
that incorporating model restrictions improves performance.
Next, we randomly split the data into two sets, one with the confounder observed and the other with the
mediator observed, finding that the estimator that combines the datasets (Section 6) outperforms frontdoor
and backdoor estimators for various values of �um (Table 2, the Partial dataset setting). We compute the
MSE across 1000 instantiations of the dataset (we use a different random split in each iteration).
National JTPA Study The National Job Training Partnership Act (JTPA) Study was commissioned to
evaluate the effect of a job training program on future earnings. In this dataset, the treatment variable Xrepresents whether a participant signed up to receive JTPA services. The outcome variable Y represents
earnings 18 months after the study. The study had a randomized treatment and control group which
allows us to compute the ground truth treatment effect. There was also a non-randomized component
in the form of eligible non-participants (ENP). ENPs were individuals who were eligible but chose not to
participate. Moreover, there was non-compliance amongst the treated units. That is, some participants in
14
Table 2: MSE of on IHDP data. The complete dataset setting is when all of {W,X,M, Y} are observed andpartial is with two datasets with {X,M, Y} in one and {W,X, Y} in the other.
Estimator Dataset �um = 2 �um = 1Backdoor Complete 4.41 1.07
Frontdoor Complete 3.81 2.81
Combined Complete 3.61 0.93IF-Frontdoor Complete 4.15 2.07
IF-Restricted Complete 3.07 1.48
IF-Fulcher Complete 3.93 1.87
Backdoor Partial 10.17 2.43
Frontdoor Partial 8.19 4.94
Combined Partial 6.64 1.62
Table 3: MSE and variance of the estimators on JTPA data. The complete setting is when all among{W,X,M, Y} are observed. In the partial setting we have two datasets, one with {X,M, Y} and another
with {W,X, Y}.Estimator Dataset Variance MSE
Frontdoor Complete 40.9K 75.3K
Combined Complete 33.1K 70.1KIF-Frontdoor Complete 46.6K 77.9K
IF-Restricted Complete 40.4K 42.1KIF-Fulcher Complete 45.1K 46.2K
Frontdoor Partial 74.8K 115.1K
Combined Partial 79.5K 123.1K
the treatment group did not make use of the services [Heckman et al., 1997]. This lets us create a mediatorM which represents compliance. The study also collected data on a number of covariates (like race, study
location, age) which are the confoundersW in our notation.
On the dataset used by Glynn and Kashin [2019], we combine the treated units and ENP units to create
the observational data for our experiments. There are 3155 treated units and 1236 ENP units. The ground
truth treatment effect is 862.74. Glynn and Kashin [2018] showed that the backdoor estimator has high
bias, suggesting that there was unmeasured confounding. The frontdoor estimator, on the other hand, has
low bias and works well for this dataset. For this reason, we do not consider the backdoor estimator in our
comparisons.
We first compare the frontdoor estimator, the combined estimator (Section 5), IF-Restricted, IF-Frontdoor
and IF-Fulcher (Section 5.2). The results are shown in Table 3 (the “Complete” dataset setting). We compute
the variance and MSE using 1000 bootstrap iterations. The results show that the combined estimator has
lower variance and MSE than the frontdoor estimator. Similarly, IF-Restricted outperforms IF-Frontdoor,
reinforcing the utility of combined estimators. We also see that IF-Restricted outperforms IF-Fulcher, show-
ing that using model restrictions is valuable.
Next, we evaluate our procedure for the partially-observed setting (Section 6). We compute variance and
MSE across 1000 bootstrap iterations. At each iteration, we randomly split our dataset into two datasets
of equal size, one with revealed confounders, one with revealed mediators. Since the backdoor adjustment
works poorly for this study, we should not expect the revealed-confounder dataset to improve the frontdoor
15
estimator. And indeed, the frontdoor estimator outperforms the combined estimator (Table 3). Despite the
inapplicability of the backdoor estimator, the combined estimator does not suffer too badly.
8 Conclusion
Addressing two overidentified regimes, one where confounders and mediators are observed simultane-
ously, and another where they are observed in separate datasets, we introduced improved estimators for
both cases and confirmed their benefits experimentally. Extending our analysis to more general graphs,
and considering online decisions about which variables to observe are two promising future directions.
We are also interested in evaluating how our conclusions might be affected by model misspecification.
References
E. Bareinboim and J. Pearl. Causal inference and the data-fusion problem. Proceedings of the National
Academy of Sciences, 113(27):7345–7352, 2016.
E. Bareinboim et al. Causal inference and data-fusion in econometrics. Technical report, arXiv. org, 2019.
M. F. Bellemare and J. R. Bloem. The paper of how: Estimating treatment effects using the front-door
criterion. Working Paper, 2019.
X. Chen and A. Santos. Overidentification in regular models. Econometrica, 86(5):1771–1817, 2018.
A. Chinco and C. Mayer. Misinformed speculators and mispricing in the housing market. The Review of
Financial Studies, 29(2):486–522, 2016.
L. Cohen and C. J. Malloy. Friends in high places. American Economic Journal: Economic Policy, 6(3):63–91,
2014.
V. Dorie. Non-parametrics for causal inference. https://github.com/vdorie/npci, 2016.
M. L. Eaton. Chapter 8: The Wishart Distribution, volume Volume 53 of Lecture Notes–Monograph Se-
ries, pages 302–333. Institute of Mathematical Statistics, 2007. doi: 10.1214/lnms/1196285114. URL
https://doi.org/10.1214/lnms/1196285114.
R. Fletcher. Practical methods of optimization. John Wiley & Sons, 2013.
I. R. Fulcher, I. Shpitser, S. Marealle, and E. J. Tchetgen Tchetgen. Robust inference on population indirect
causal effects: the generalized front door criterion. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 2020.
A. N. Glynn and K. Kashin. Front-door difference-in-differences estimators. American Journal of Political
Science, 61(4):989–1002, 2017.
A. N. Glynn and K. Kashin. Front-door versus back-door adjustment with unmeasured confounding: Bias
formulas for front-door and hybrid adjustments with application to a job training program. Journal of
the American Statistical Association, 113(523):1040–1049, 2018.
A. N. Glynn and K. Kashin. Replication Data for: Front-Door Versus Back-Door Adjustment With Unmea-
sured Confounding: Bias Formulas for Front-Door and Hybrid Adjustments With Application to a Job
Training Program, 2019. URL https://doi.org/10.7910/DVN/G7NNUL .
16
L. P. Hansen. Large sample properties of generalized method of moments estimators. Econometrica, 50(4):
1029–1054, 1982.
J. J. Heckman, H. Ichimura, and P. E. Todd. Matching as an econometric evaluation estimator: Evidence
from evaluating a job training programme. The review of economic studies, 64(4):605–654, 1997.
L. Henckel, E. Perković, and M. H. Maathuis. Graphical criteria for efficient total effect estimation via
adjustment in causal linear models. arXiv preprint arXiv:1907.02435, 2019.
J. L. Hill. Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical
Statistics, 20(1):217–240, 2011.
C. Jackson, N. Best, and S. Richardson. Bayesian graphical models for regression on multiple data sets with
different variables. Biostatistics, 10(2):335–351, 2009.
T. C. Koopmans andO. Reiersøl. The identification of structural characteristics. The Annals of Mathematical
Statistics, 21(2):165–181, 1950.
T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2019.
J. Pearl. Causal diagrams for empirical research. Biometrika, 82(4):669–688, 1995.
J. Pearl. Causality. Cambridge university press, 2009.
J. Pearl. The foundations of causal inference. Sociological Methodology, 40(1):75–149, 2010.
E. Perković, J. Textor, M. Kalisch, and M. H. Maathuis. A complete generalized adjustment criterion. In
Uncertainty in Artificial Intelligence (UAI), 2015.
A. Rotnitzky and E. Smucler. Efficient adjustment sets for population average treatment effect estimation
in non-parametric causal graphical models. arXiv preprint arXiv:1912.00306, 2019.
J. D. Sargan. The estimation of economic relationships using instrumental variables. Econometrica: Journal
of the Econometric Society, pages 393–415, 1958.
U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds
and algorithms. In International Conference on Machine Learning (ICML), 2017.
C. Shi, D. Blei, and V. Veitch. Adapting neural networks for the estimation of treatment effects. InAdvances
in Neural Information Processing Systems, pages 2503–2513, 2019.
I. Shpitser and J. Pearl. Identification of conditional interventional distributions. In Uncertainty in Artifical
Intelligence (UAI), 2006.
C. Uhler. Gaussian graphical models: An algebraic and geometric perspective. Chapter in Handbook of
Graphical Models, 2019.
S. Wright. The method of path coefficients. Ann. Math. Statist., 5(3):161–215, 09 1934. doi: 10.1214/aoms/
1177732676.
17
A Covariance of a and cA.1 Frontdoor estimator
We prove that Cov(af , c) = 0 for the frontdoor estimator. The expressions for af and c arec = ∑ ximi∑ x2i= c + ∑ xiumi∑ x2i
(33)
af = ∑ x2i ∑miyi −∑ ximi ∑ xiyi∑ x2i ∑m2i − (∑ ximi)2= a + ∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 ,
(34)
where ei = − bd uxi + uyi . Using the fact the (ux , x) is bivariate normally distributed, we get
E[e|x] = b�2uxd(d2�2uw + �2ux )x= Fx, (35)
where F = b� 2uxd(d2� 2uw +� 2ux ) . The covariance then is
Cov(af , c) = E[(af − a)(c − c)]= E[E[(af − a)(c − c)|x,m]]= E[E[(∑ x2i ∑miei −∑ ximi ∑miei∑ x2i ∑m2i − (∑ ximi)2 )(∑ xiumi∑ x2i ) ||||x,m]]= E[(∑ x2i ∑miE[ei |x] −∑mixi ∑ xiE[ei |x]∑ x2i ∑m2i − (∑ ximi)2 )(∑ umi xi∑ x2i )] (36)
= E[F (∑ x2i ∑mixi −∑mixi ∑ x2i∑ x2i ∑m2i − (∑ ximi)2 )(∑ umi xi∑ x2i )]= 0,
where in Eq. 36 we used the expression from Eq. 35.
A.2 Combined estimator
We prove that Cov(ac , c) = 0 for the combined estimator from Section 5. The expressions for ac and c arec = ∑ ximi∑ x2i= c + ∑ xiumi∑ x2i
(37)
ac = ∑w2i ∑miyi −∑wimi ∑wiyi∑w2i ∑m2i − (∑wimi)2= a + ∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 .
(38)
18
The covariance is
Cov(ac , c) = E[(ac − a)(c − c)]= E[E[(ac − a)(c − c)|x,m, w]]= E[(∑w2i ∑miE[uyi ] −∑miwi ∑wiE[uyi ]∑w2i ∑m2i − (∑wimi)2 )(∑ umi xi∑ x2i )] (39)
= 0,where in 39 we used the fact that E[uyi ] = 0.B Unbiasedness of the estimators
B.1 Backdoor estimator
Recall that for the backdoor estimator, we take the coefficient of X in an OLS regression of Y on {X,W }.The outcome yi can be written as
yi = acxi + bwi + aumi + uyi .The error term aumi + uyi is independent of (xi , wi). In this case, the OLS estimator is unbiased. Therefore,E[acbackdoor] = ac.B.2 Frontdoor estimator
For the frontdoor estimator, we first compute c by taking the coefficient of X in an OLS regression of Mon X . The mediatormi can be written as
mi = cxi + umi .The error term umi is independent of xi . In this case, the OLS estimator is unbiased and hence, E[c] = c.We then compute af by taking the coefficient of M in an OLS regression of Y on {M,X}. The outcome yican be written as
yi = ami + bd xi − bd uxi + uy .In this case, the error term − bd uxi + uy is correlated with xi . The expression for a is given in Eq. 34. The
expectation E[af ] is E[af ] = a + E [∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 ]= a + E [E [∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2
||||x,m]]= a + E [∑ x2i ∑miE[ei |x] −∑ ximi ∑ xiE[ei |x]∑ x2i ∑m2i − (∑ ximi)2 ] (40)
= a + E [∑ x2i ∑mi(Fxi) −∑ ximi ∑ xi(Fxi)∑ x2i ∑m2i − (∑ ximi)2 ]= a,
19
where, in Eq. 40, the expression for E[ei |x] is taken from Eq. 35. Using the fact that Cov(af , c) = 0 (see
proof in Appendix A.1), we can see that the frontdoor estimator is unbiased asE[af c] = E[af ]E[c] + Cov(af , c)= ac.B.3 Combined estimator
In the combined estimator, the expression for c is the same as the frontdoor estimator. Therefore, as
shown in Appendix B.2, E[c] = c. We compute a by taking the coefficient of M in an OLS regression of Yon {M,W }. The outcome yi can be written as
yi = ami + bwi + uyi .The error term uyi is independent of (mi , wi). In this case, the OLS estimator is unbiased. Therefore, E[ac] =a. Using the fact that Cov(ac , c) = 0 (see proof Appendix A.2), we can see that the combined estimator is
unbiased as E[ac c] = E[ac]E[c] + Cov(ac , c)= ac.C Finite sample results for the variance
C.1 Frontdoor estimator
C.1.1 Variance of afFrom Eqs. 33 and 34, we know that
c = c + ∑ xiumi∑ x2iaf = a + ∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 .
where ei = − bd uxi + uyi . The following fact will be useful later in the derivations (taken from Eq. 12):
Var(e|x) = b2�2uw�2ux + �2uy (d2�2uw + �2ux )(d2�2uw + �2ux )= Ve . (41)
Note that Ve is a constant and does not depend on x .Let
A = ∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2C = ∑ xiumi∑ x2i .
20
First, we derive the expression for Var(af ) as follows,Var(af ) = Var(a + A)= Var(A)= Var(E[A|x,m]) + E[Var(A|x,m)]
= Var(E[∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2||||x,m]) + E[Var(A|x,m)]
= Var(E[∑ x2i ∑miE[ei |x] −∑ ximi ∑ xiE[ei |x]∑ x2i ∑m2i − (∑ ximi)2 ]) + E[Var(A|x,m)] (42)
= Var(E[∑ x2i ∑mi(Fxi) −∑ ximi ∑ xi(Fxi)∑ x2i ∑m2i − (∑ ximi)2 ]) + E[Var(A|x,m)]= E[Var(A|x,m)]= E [Var(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2
||||x,m)]= E [ 1(∑ x2i ∑m2i − (∑ ximi)2)2Var(∑ x2i ∑miei −∑ ximi ∑ xiei ||||x,m)]= E [ 1(∑ x2i ∑m2i − (∑ ximi)2)2Var(ei |xi)∑ x2i (∑ x2i ∑m2i − (∑ ximi)2)]= VeE [ ∑ x2i∑ x2i ∑m2i − (∑ ximi)2 ]= VeE [D] , (43)
where, in Eq. 42, we used the result from Eq. 35, and D = ∑ x2i∑ x2i ∑m2i −(∑ ximi )2 . Using the fact that D has
the distribution of a marginal from an inverse Wishart-distributed matrix, that is, if the matrix M ∼(Cov([M,X ])−1, n), then D = M1,1, in Eq. 43, we get
Var(af ) = VeE [D]= VeCov([M,X ])−11,1n − 2 − 1= b2�2uw�2ux + �2uy (d2�2uw + �2ux )(n − 3)(d2�2uw + �2ux )�2um ,
where the expression for Ve is taken from Eq. 41.
C.1.2 Covariance of a2f and c2We prove that Cov(a2f , c2) = Var(af )Var(c). This covariance can be written as
Cov(a2f , c2) = E[(a2f − E[a2f ])(c2 − E[c2])]= E[(a2f − Var(af ) − E2[af ])(c2 − Var(c2) − E2[c])]= E[a2f c2] − Var(af )Var(c) − a2Var(c) − c2Var(af ) − a2c2. (44)
21
We can write E[a2f c2] asE[a2f c2] = E[(a + A)2(c + C)2]= E[a2c2 + c2A2 + a2C2 + A2C2 + 2aAC2 + 2cCA2]= a2c2 + c2Var(a) + a2Var(c) + E[A2C2] + E[2aAC2] + E[2cCA2]. (45)
Substituting the result from Eq. 45 in Eq. 44, we get
Cov(a2f , c2) = E[A2C2] + E[2aAC2] + E[2cCA2]. (46)
Now we expand each term in Eq. 46 separately. E[2aAC2] isE[2aAC2] = 2aE [(∑ xiumi∑ x2i )2(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 )]
= 2aE [E [(∑ xiumi∑ x2i )2(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 ) |||||x,m]]= 2aE [(∑ xiumi∑ x2i )2(∑ x2i ∑miE[ei |x] −∑ ximi ∑ xiE[ei |x]∑ x2i ∑m2i − (∑ ximi)2 )] (47)
= 2aE [(∑ xiumi∑ x2i )2(∑ x2i ∑mi(Fxi) −∑ ximi ∑ xi(Fxi)∑ x2i ∑m2i − (∑ ximi)2 )]= 0, (48)
where, in Eq. 47, the expression for E[e|x] is taken from Eq. 35.
Next, we simplify E[2cCA2] asE[2cCA2]= 2cE [(∑ xiumi∑ x2i )(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 )2]= 2cE [E [(∑ xiumi∑ x2i )(∑ x2i ∑miei −∑ ximi ∑ xiei∑ x2i ∑m2i − (∑ ximi)2 )2 |||||x,m]]= 2cE [E [(∑ xiumi∑ x2i ) (∑ x2i )2(∑miei)2 + (∑ ximi)2(∑ xiei)2 − 2∑ x2i ∑miei ∑ ximi ∑ xiei(∑ x2i ∑m2i − (∑ ximi)2)2 |||||x,m]]= 2cE [(∑ xiumi∑ x2i )( Var(e|x)(∑ x2i )∑ x2i ∑m2i − (∑ ximi)2 +
F 2 (2(∑ x2i )2(∑ ximi)2 − 2(∑ x2i )2(∑ ximi)2)(∑ x2i ∑m2i − (∑ ximi)2)2 )]= 2cE [(∑ xiumi∑ x2i )( Var(e|x)(∑ x2i )∑ x2i ∑m2i − (∑ ximi)2)]= 2cVeE [(∑ xiumi∑ x2i )( ∑ x2i∑ x2i ∑m2i − (∑ ximi)2)]= 2cVeE [(c − c)( ∑ x2i∑ x2i ∑m2i − (∑ ximi)2)]= 2cVe (E [c( ∑ x2i∑ x2i ∑m2i − (∑ ximi)2)] − cE [( ∑ x2i∑ x2i ∑m2i − (∑ ximi)2)])= 2cVe (E [cD] − cE [D]) , (49)
22
where D = ∑ x2i∑ x2i ∑m2i −(∑ ximi )2 . Using the fact that c and D are independent of each other (see proof at the
end of this section), we get E[cD] = E[c]E[D]= cE[D]. (50)
Substituting the result from Eq. 50 in Eq. 49, we getE[2cCA2] = 2cVe (cE [D] − cE [D])= 0. (51)
We proceed similarly to Eq. 49 to write E[A2C2] asE[A2C2] = VeE[C2D].Then we further simplify E[A2C2] asE[A2C2] = VeE[C2D]= VeE[(c − c)2D]
= Ve (E [c2]E[D] − c2E[D])= Ve (Var(c)E[D] + E2 [c]E[D] − c2E[D])= VeVar(c)E[D] (52)
= VeVar(c)Cov([M,X ])−11,1n − 2 − 1= Ve(n − 3)�2um Var(c) (53)
= Var(af )Var(c), (54)
where, in Eq. 52, we used the fact that if thematrixM ∼ (Cov([M,X ])−1, n), thenD = M1,1 (that is,D has
the distribution of a marginal from an inverse Wishart-distributed matrix), and in Eq. 53, the expression
for Ve is taken from Eq. 41.
Substituting the results from Eqs. 48, 51, and 54 in Eq. 46, we get
Cov(a2f , c2) = Var(af )Var(c). (55)
Proof that c and D are independent. Let � be the following sample covariance matrix:
� = 1n [ ∑m2i ∑mixi∑mixi ∑x2i ] .The distribution of Σ is a Wishart distribution. That is, Σ ∼ (Cov([M,X ]), n). Then (�1,1 − �1,2�−12,2�2,1)and (�2,1,�2,2) are independent [Eaton, 2007, Proposition 8.7]. We can see that
�1,1 − �1,2�−12,2�2,1 = ∑m2i ∑x2i − (∑ximi)2∑x2i= 1D .
23
Therefore, we get
1D ⟂⟂ (∑x2i , ∑ ximi)⟹ D ⟂⟂ (∑x2i , ∑ ximi)⟹ D ⟂⟂ ∑ximi∑x2i⟹ D ⟂⟂ c.
C.2 Combined estimator
We prove that Cov(a2c , c2) = Var(ac)Var(c). We derive this in a similar manner as the frontdoor estimator
in Appendix C.1.2. From Eqs. 37 and 38, we know that
ac = a + ∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2c = c + ∑xiumi∑x2i .
Let
A = ∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2C = ∑xiumi∑x2i .
Then, similarly to Eq. 46, we get
Cov(a2c , c2) = E[A2C2] + E[2aAC2] + E[2cCA2]. (56)
Now we simplify each term in Eq. 56 separately. E[2aAC2] can be simplified as
E[2aAC2] = 2aE [(∑ xiumi∑ x2i )2(∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 )]= E [E [(∑ xiumi∑ x2i )2(∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 ) |||||x,m, w]]= E [(∑ xiumi∑ x2i )2(∑w2i ∑miE[uyi ] −∑wimi ∑wiE[uyi ]∑w2i ∑m2i − (∑wimi)2 )] (57)
= 0, (58)
where, in Eq. 57, we used the fact that E[uy] = 0.
24
Next, we simplify E[2cCA2] asE[2cCA2] = 2cE [(∑ xiumi∑ x2i )(∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 )2]
= 2cE [E [(∑ xiumi∑ x2i )(∑w2i ∑miuyi −∑wimi ∑wiuyi∑w2i ∑m2i − (∑wimi)2 )2 |||||x,m, w]]= 2cE [(∑ xiumi∑ x2i )Var(uy )( ∑w2i∑w2i ∑m2i − (∑wimi)2)]= 2c�2uyE [CD] , (59)
where D = ∑w2i∑w2i ∑m2i −(∑wimi )2 . We can upper bound the expression in Eq. 59 as
E[2cCA2] = 2c�2uyE [CD]≤ 2|c|�2uyE [CD] (60)
≤ 2|c|�2uy√E[C2]E[D2]
= 2|c|�2uy√Var(c)(Var(D) + E[D]2) (61)
= 2|c|�2uy√Var(c)(2 [(Cov([M,W ]))−11,1]2(n − 2 − 1)2(n − 2 − 3) +( (Cov([M,W ]))−11,1n − 2 − 1 )2)
= 2|c|�2uy√Var(c)√ 2(n − 3)2(n − 5)(c2�2ux + �2um )2 +
1(n − 3)2(c2�2ux + �2um )2= 2|c| �2uy(n − 3)(c2�2ux + �2um )
√Var(c)√n − 3n − 5
= 2|c|Var(a)√Var(c)√n − 3n − 5 , (62)
where, in Eq. 60, we used the Cauchy–Schwarz inequality, and in Eq. 61, we used the fact that if the matrixM ∼ (Cov([M,W ])−1, n), then D = M1,1 (that is, D has the distribution of a marginal from an inverse
Wishart-distributed matrix).
Similarly to Eq. 59, we simplify E[A2C2] asE[A2C2] = �2uyE[C2D]. (63)
25
The expression in Eq. 63 can be upper bounded using the Cauchy-Schwarz inequality asE[A2C2] = �2uyE[C2D]≤ �2uy
√E[C4]E[D2]= �2uy
√E[C4](Var(D) + E2[D])= �2uy
√E[C4](2 [(Cov([M,W ]))−11,1]2(n − 2 − 1)2(n − 2 − 3) +( (Cov([M,W ]))−11,1n − 2 − 1 )2)= �2uy(n − 3)(c2�2ux + �2um )
√n − 3n − 5√E[C4]
= Var(ac)√n − 3n − 5
√E[C4]. (64)
26
We can simplify E[C4] as follows,E[C4] = E [(∑ xiumi∑ x2i )4]
= E [E [(∑ xiumi∑ x2i )4 ||||x]]= E [ 1(∑ x2i )4E [(∑xiumi )4 ||||x]]= E [ 1(∑ x2i )4
{Var((∑ xiumi )2 |||x) + E [(∑ xiumi )2 |||x]2}]
= E [ 1(∑ x2i )4 {Var((∑ xiumi )2 |||x) + �4um (∑ x2i )2}]= E [ 1(∑ x2i )4
{Var(�2um ∑ x2i (∑ xiumi )2�2um ∑ x2i
||||x) + �4um (∑ x2i )2}]= E [ 1(∑ x2i )4
{�4um (∑ x2i )2 Var( (∑ xiumi )2�2um ∑ x2i||||x) + �4um (∑ x2i )2}] (65)
= E [ 1(∑ x2i )4 {�4um (∑ x2i )2 2 + �4um (∑ x2i )2}]= E [ 1(∑ x2i )4 {3�4um (∑ x2i )2}]= 3�4umE [ 1(∑ x2i )2 ]= 3�4um [Var( 1∑ x2i ) + E [ 1∑x2i ]
2] (66)
= 3�4um [ 2(n − 2)2(n − 4)(d2�2uw + �2ux )2 + 1(n − 2)2(d2�2uw + �2ux )2 ]= 3 �4um(n − 2)2(d2�2uw + �2ux )2 [n − 2n − 4]= 3 (Var(c2))2 [n − 2n − 4] , (67)
where, in Eq. 65, we used the fact that(∑ xiumi )2� 2um ∑ x2i
||||x has a Chi-squared distribution, that is,(∑ xiumi )2� 2um ∑ x2i
||||x ∼ � 2(1),and in Eq. 66, we used the fact that 1∑ x2i has a scaled inverse Chi-squared distribution, that is, 1∑ x2i ∼Scale-inv-� 2 (n, (d2� 2uw +� 2ux )2n ).Substituting the result from Eq. 67 in Eq. 64, we getE[A2C2] ≤ Var(ac)Var(c)√3(n − 3)(n − 2)(n − 5)(n − 4) . (68)
27
Substituting the results from Eqs. 58, 62, and 68 in Eq. 56, we get
Cov(a2c , c2) ≤√n − 3n − 5 (2|c|Var(ac)√Var(c) + √3√n − 2n − 4Var(ac)Var(c)) .
D Comparison of combined estimator with backdoor and frontdoor es-
timators
In this section, we provide more details on the comparison of the combined estimator presented in 5 to the
backdoor and frontdoor estimators.
D.1 Comparison with the backdoor estimator
In Section 5.1, we made the claim that
∃N , s.t., ∀n > N ,Var(ac c) ≤ Var(ac)backdoor.In this case, by comparing Eqs. 9 and 26, we have
N = 2(�4uxF + d2�2uw (�2umD + �2uxF ) + c2�6ux√c2�6uxD2(F + 2√3�2um ))�2umD2 ,
where D = d2�2uw +�2ux , E = c2�2ux +�2um , and F = E + (1+2√3)�2um . Thus, for a large enough n, the combined
estimator has lower variance than the backdoor estimator for all model parameter values.
D.2 Comparison with the frontdoor estimator
In Section 5.1, we made the claim that
∃N , s.t., ∀n > N ,Var(ac c) ≤ Var(af c).In this case, by comparing Eqs. 16 and 26, we have
N = 2(�6um + 2√3c2�4um�2ux − c4�2um�4ux + (D + �4um√�4um + 4√3c2�2um�2ux − 2c4�4ux))D2 ,
whereD = c6�6ux . Thus, for a large enough n, the combined estimator has lower variance than the frontdoor
estimator for all model parameter values.
D.3 Combined estimator dominates the better of backdoor and frontdoor
In this section, we provide more details for the claim in Section 5.1 that the combined estimator can dom-
inate the better of the backdoor and frontdoor estimators by an arbitrary amount. We show that the
quantity
R = min{Var(ac)backdoor ,Var(af c)}Var(ac c)28
is unbounded.
We do this by considering the case when Var(ac)backdoor = Var(af c). Note thatVar(ac)backdoor = Var(af c)
⟹ b =√−D(−a2�4um ((n − 2)d2�2uw + �2ux ) + (−(n − 2)d2�2uw (�2um − c2�2ux ) + �2uxE)�2uy )�2uw�4ux (2�2um + (n − 2)c2D) , (69)
where D = d2�2uw + �2ux , and E = (n − 2)c2�2ux − (n − 4)�2um . Hence, if the parameter b is set to the value
given in Eq. 69, the backdoor and frontdoor estimators will have equal variance. We have to ensure that
the value of b is real. b will be a real number if
|c| ≤ �um�ux√1 − 2�2ux(n − 2)D , and
n > 2.For the value of b in Eq. 69, the quantity R becomes
R = Var(ac)backdoorVar(ac c)
≥ (n − 2)DE(a2�2um + �2uy )�2ux ((n − 3)a2�2umE + �2uy (�2um + √3�2um (r1r2 + |c|(n − 2)D(|c| + r1 �um√(n−2)D)))) ,
where D = d2�2uw + �2ux , E = c2�2ux + �2um , r1 = √n−3n−5 and r2 = √n−2n−4 .R does not depend on the parameter b. It is possible to set the other model parameters in a way that allowsR to take any positive value. In particular, it can be seen that as �ux → 0, R → ∞, which shows that R is
unbounded.
29
E Combining Partially Observed Datasets
E.1 Cramer-Rao Lower Bound
Here we present the closed form expression for the asymptotic variance of the estimator presented in
Section 6, that is, (I−1)1,1. Let (I−1)1,1 = XY . ThenX = (a2�2um + �2uy )(a8d2(k + 1)(�2um )5�2uw (d2�2uw + �2ux )2 + a6(�2um )3(d2�2uw + �2ux )(b2�2uw�2ux (c2d4(k + 1)2
(�2uw )2 + d2�2uw (c2(2k2 + 3k + 1)�2ux + (k2 + 4k + 1)�2um ) + (k + 1)�2ux (c2k�2ux + k�2um + �2um )) + (k + 1)�2uy(d2�2uw + �2ux )(c2d4(k + 1)(�2uw )2 + d2�2uw (c2(2k + 1)�2ux + (k + 3)�2um ) + k�2ux (c2�2ux + �2um ))) + 4a5bcdk(�2um )3�2uw�2ux (d2�2uw + �2ux )(b2�2uw�2ux + �2uy (d2�2uw + �2ux )) + a4(�2um )2(b4(�2uw )2(�2ux )2(c2d4(k + 1)2(�2uw )2 + d2�2uw (2c2(k2 + 3k + 1)�2ux + (2k2 + 3k + 1)�2um ) + �2ux(c2(k2 + 4k + 1)�2ux + 2(k + 1)2�2um )) + b2�2uw�2ux�2uy (d2�2uw + �2ux )(2c2d4(k2 + 3k + 2)(�2uw )2 + d2�2uw(c2(4k2 + 13k + 5)�2ux + (4k2 + 8k + 2)�2um ) + �2ux (c2(2k2 + 7k + 1)�2ux + (4k2 + 6k + 2)�2um )) + (�2uy )2(d2�2uw + �2ux )2(c2d4(k2 + 4k + 3)(�2uw )2 + d2�2uw (c2(2k2 + 7k + 3)�2ux + (2k2 + 5k + 3)�2um ) + k�2ux(c2(k + 3)�2ux + 2(k + 1)�2um ))) + 4a3bcdk(�2um )2�2uw�2ux�2uy (d2�2uw + �2ux )(b2�2uw�2ux + �2uy (d2�2uw + �2ux )) + a2�2um (b2�2uw�2ux + �2uy (d2�2uw + �2ux ))(b4(k + 1)(�2uw )2(�2ux )2(c2(d2(k + 1)�2uw + (k + 2)�2ux ) + (k + 1)�2um ) + b2�2uw�2ux�2uy (2c2d4(k + 1)2(�2uw )2+2d2�2uw (k2(2c2�2ux + �2um ) + k(5c2�2ux + �2um ) + 2c2�2ux ) + �2ux (2c2(k2 + 3k + 1)�2ux + (2k2 + 3k + 1)�2um ))+(�2uy )2(d2�2uw + �2ux )(d2(k + 1)�2uw + k�2ux )(c2(k + 3)(d2�2uw + �2ux ) + (k + 1)�2um )) + c2(k + 1)(b2�2uw�2ux + �2uy (d2�2uw + �2ux ))2(b4(k + 1)(�2uw )2(�2ux )2 + b2�2uw�2ux�2uy (2d2k�2uw + 2k�2ux + �2ux ) + (�2uy )2(d2�2uw + �2ux )(d2(k + 1)�2uw + k�2ux ))),
and
Y = (a2�2um (d2�2uw + �2ux ) + b2�2uw�2ux + �2uy (d2�2uw + �2ux ))(a6d2(k + 1)(�2um )4�2uw(d2�2uw + �2ux )2 + a4(�2um )2(d2�2uw + �2ux )(b2�2uw�2ux (d2k�2uw (c2(k + 1)�2ux + 2�2um ) + (k + 1)�2ux (c2k�2ux + k�2um + �2um )) + (k + 1)�2uy (d2�2uw + �2ux )(d2�2uw(c2k�2ux + 3�2um ) + k�2ux (c2�2ux + �2um ))) + 4a3bcdk(�2um )2�2uw�2ux (d2�2uw+�2ux )(b2�2uw�2ux + �2uy (d2�2uw + �2ux )) + a2�2um (b4(�2uw )2(�2ux )2(d2�2uw (2c2k�2ux + k�2um + �2um ) + �2ux (2c2k�2ux + (k + 1)2�2um )) + 2b2�2uw�2ux�2uy (d2�2uw + �2ux )(2d2k�2uw (c2�2ux + �2um ) + �2ux (2c2k�2ux + (k + 1)2�2um )) + (�2uy )2(d2�2uw + �2ux )2(d2�2uw(2c2k�2ux + 3(k + 1)�2um ) + k�2ux (2c2�2ux + (k + 2)�2um ))) + 4abcdk�2um�2uw�2ux�2uy (d2�2uw+�2ux )(b2�2uw�2ux + �2uy (d2�2uw + �2ux )) + b6c2k(k + 1)(�2uw )3(�2ux )4 + b4(k + 1)(�2uw )2(�2ux )2�2uy (d2�2uw + �2ux )(3c2k�2ux + �2um ) + b2�2uw�2ux (�2uy )2(d2�2uw + �2ux )(d2k�2uw (3c2(k + 1)�2ux + 2�2um ) + �2ux (3c2k(k + 1)�2ux + 2k�2um + �2um )) + (�2uy )3(d2�2uw + �2ux )2(d2(k + 1)�2uw (c2k�2ux + �2um ) + k�2ux (c2(k + 1)�2ux + �2um ))),where k = PQ .
30