Marginal integration for nonparametric causal inference · severe experimental constraints or to substantially lower experimental costs. The words \causal inference" (usually) refer

$Page 1: Marginal integration for nonparametric causal inference · severe experimental constraints or to substantially lower experimental costs. The words \causal inference" (usually) refer$
Marginal integration for

nonparametric causal inference

Jan Ernest∗ and Peter Buhlmann

Seminar fur StatistikETH Zurich

e-mail: [email protected]; [email protected]

Abstract: We consider the problem of inferring the total causal effect ofa single continuous variable intervention on a (response) variable of in-terest. We propose a certain marginal integration regression technique fora very general class of potentially nonlinear structural equation models(SEMs) with known structure, or at least known superset of adjustmentvariables: we call the procedure S-mint regression. We easily derive that itachieves the convergence rate as for nonparametric regression: for example,single variable intervention effects can be estimated with convergence raten−2/5 assuming smoothness with twice differentiable functions. Our resultcan also be seen as a major robustness property with respect to modelmisspecification which goes much beyond the notion of double robustness.Furthermore, when the structure of the SEM is not known, we can esti-mate (the equivalence class of) the directed acyclic graph correspondingto the SEM, and then proceed by using S-mint based on these estimates.We empirically compare the S-mint regression method with more classicalapproaches and argue that the former is indeed more robust, more reliableand substantially simpler.

MSC 2010 subject classifications: Primary 62G05, 62H12.Keywords and phrases: Backdoor adjustment, causal inference, inter-vention calculus, marginal integration, model misspecification, nonparamet-ric inference, robustness, structural equation model.

1. Introduction

Understanding cause-effect relationships between variables is of great interest inmany fields of science. An ambitious but highly desirable goal is to infer causaleffects from observational data obtained by observing a system of interest with-out subjecting it to interventions.1 This would allow to circumvent potentiallysevere experimental constraints or to substantially lower experimental costs.The words “causal inference” (usually) refer to the problem of inferring effectswhich are due to (or caused by) interventions: if we make an outside inter-vention at a variable X, say, what is its effect on another response variable ofinterest Y . We describe examples in Section 1.3. Various fields and conceptshave contributed to the understanding and quantification of causal inference:

∗Partially supported by the Swiss National Science Foundation grant no. 20PA20E-134493.1More generally, in the presence of both, interventional and observational data, the goal is

to infer intervention or causal effects among variables which are not directly targeted by theinterventions from interventional data.

1

arX

iv:1

405.

1868

v3 [

stat

.ME

] 1

2 N

ov 2

015

mailto:[email protected]

mailto:[email protected]

Ernest and Buhlmann/Marginal integration for nonparametric causal inference 2

the framework of potential outcomes and counterfactuals (cf. Rubin, 2005), seealso Dawid (2000), structural equation modeling (cf. Bollen, 1998), and graphi-cal modeling (cf. Lauritzen and Spiegelhalter, 1988; Greenland et al., 1999); thebook by Pearl (2000) provides a nice overview.

We consider aspects of the problem indicated above, namely inferring inter-vention or causal effects from observational data without external interventions.Thus, we deal (in part) with the question of how to infer causal effects withoutrelying on randomized experiments or randomized studies. Besides fundamentalconceptual aspects, as treated for example in the books by Pearl (2000), Spirteset al. (2000) and Koller and Friedman (2009), important issues include statis-tical tasks such as estimation accuracy and robustness with respect to modelmisspecification. This paper focuses on the two latter topics, covering also high-dimensional sparse settings with many variables (parameters) but relatively fewobservational data points.

In general, the tools for inferring causal effects are different from regres-sion methods, but as we will argue, the regression methods, when properlyapplied, remain a useful tool for causal inference. In fact, for the estimation oftotal causal effects, we make use of a marginal integration regression methodwhich has originally been proposed for additive regression modeling (Linton andNielsen, 1995). Its use in causal inference is novel. Relying on known theory formarginal integration in regression (Fan et al., 1998), our main result (Theo-rem 1) establishes optimal convergence properties and justifies the method asa fully robust procedure against model misspecification, as explained further inSection 1.2.

1.1. Basic concepts and definitions for causal inference

We very briefly introduce some of the basic concepts for causal inference (andthe reader who is familiar with them can skip this subsection). We consider prandom variables X1, . . . , Xp, where one of them is a response variable Y ofinterest and one of them an intervention variable X, that is, the variable wherewe make an external intervention by setting X to a certain value x. Such anintervention is denoted by Pearl’s do-operator do(X = x) (cf. Pearl, 2000). Wedenote the indices corresponding to Y and X by jY and jX , respectively: thus,Y = XjY and X = XjX . We assume a setting where all relevant variables areobserved, i.e., there are no relevant hidden variables.2

The system of variables is assumed to be generated from a structural equationmodel (SEM):

Xj ← fj(Xpa(j), εj), j = 1, . . . , p. (1)

Thereby, ε1, . . . , εp are independent noise (or innovation) variables, and thereis an underlying structure in terms of a directed acyclic graph (DAG) D,where each node j corresponds to the random variable Xj : We denote by

2It suffices to assume that Y , X and Xpa(jX ) (the parents of X) are observed, see (3).


pa(j) = paD(j) the set of parents of node j in the underlying DAG D,3 andfj(·) are assumed to be real-valued (measurable) functions. For any index setU ⊆ {1, . . . , p} we write XU := (Xv)v∈U , for example, Xpa(j) = (Xv)v∈pa(j).

The causal mechanism we are interested in is the total effect of an interventionat a single variable X on a response variable Y of interest.4 The distribution ofY when doing an external intervention do(X = x) by setting variable X to x isidentified with its density (assumed to exist) or discrete probability function andis denoted by p(y|do(X = x)). The mathematical definition of p(y|do(X = x))can be given in terms of a so-called truncated Markov factorization or maybemore intuitively, by direct plug-in of the intervention value x for variable X andpropagating this intervention value x to all other random variables including Yin the structural equation model (1); precise definitions are given in e.g. Pearl(2000) or Spirtes et al. (2000). The underlying important assumption in thedefinition of p(y|do(X = x)) is that the functional forms and error distributionsof the structural equations for all the variables Xj which are different from Xdo not change when making an intervention at X.

A very powerful representation of the intervention distribution is given bythe well-known backdoor adjustment formula.5 We say that a path in a DAG Dis blocked by a set of nodes S if and only if it contains a chain .. → m → .. ora fork .. ← m → .. with m ∈ S or a collider .. → m ← .. such that m 6∈ S andno descendant of m is in S. Furthermore, a set of variables S is said to satisfythe backdoor criterion relative to (X,Y ) if no node in S is a descendant of Xand if S blocks every path between X and Y with an arrow pointing into X.For a set S that satisfies the backdoor criterion relative to (X,Y ), the backdooradjustment formula reads:

p(y|do(X = x)) =

∫p(y|X = x,XS)dP (XS), (2)

where p(·) and P (·) are generic notations for the density or distribution (Pearl,2000, Theorem 3.3.2). An important special case of the backdoor adjustmentformula is obtained when considering the adjustment set S = pa(jX): if jY /∈pa(jX), that is, if Y is not in the parental set of the variable X, then:

p(y|do(X = x)) =

∫p(y|X = x,Xpa(jX))dP (Xpa(jX)). (3)

Thus, if the parental set pa(jX) is known, the intervention distribution canbe calculated from the standard observational conditional and marginal distri-butions. Our main focus is the expectation of Y when doing the interventiondo(X = x), the so-called total effect:

E[Y |do(X = x)] =

∫y p(y|do(X = x))dy.

3The set of parents is paD(j) = {k; there exists a directed edge k → j in DAG D}.4A total effect is the effect of an intervention at a variable X to another variable Y , taking

into account the total of all (directed) paths from X to Y .5For a simple version of the formula, skip the text until the second line after formula (2).


A general and often used route for inferring E[Y |do(X = x)] is as follows: thedirected acyclic graph (DAG) corresponding to the structural equation model(SEM) is either known or (its Markov equivalence class) estimated from data;building on this, one can estimate the functions in the SEM (edge functions inthe DAG), the error distributions in the SEM, and finally extract an estimateof E[Y |do(X = x)] (or bounds of this quantity if the DAG is not identifiable)from the observational distribution. See for example Spirtes et al. (2000); Pearl(2000); Maathuis et al. (2009); Spirtes (2010).

1.2. Our contribution

The new results from this paper should be explained for two different scenariosand application areas: one where the structure of the DAG D in the SEM isknown, and the other where the structure and the DAG D are unknown andestimated from data. Of course, the second setting is linked to the first bytreating the estimated as the true known structure. However, due to estimationerrors, a separate discussion is in place.

1.2.1. Structural equation models with known structure

We consider a general SEM as in (1) with known structure in form of a DAG Dbut unknown functions fj and unknown error distributions for εj . As alreadymentioned before, our focus is on inferring the total effect

E[Y |do(X = x)] =

∫y p(y|do(X = x))dy, (4)

where p(y|do(X = x)) is the interventional density (or discrete probability func-tion) of Y as loosely described in Section 1.1.

The first approach to infer the total effect in (4) is to estimate the functionsfj and error distributions for εj in the SEM. It is then possible to calculateE[Y |do(X = x)], typically using a path-based method based on the DAG D(see also Section 3.1). This route is essentially impossible without putting furtherassumptions on the functional form of fj in the SEM (1). For example, one oftenmakes the assumption of additive errors, and if the cardinality of the parentalset |pa(j)| is large, additional constraints like additivity of a nonparametricfunction are in place to avoid the curse of dimensionality. Thus, by keeping thegeneral possibly non-additive structure of the functions fj in the SEM, we haveto reject this approach.

The second approach for inferring the total effect in (4) relies on the pow-erful backdoor adjustment formula in (2). At first sight, the problem seemsill-posed because of the appearance of p(Y |X = x,XS) for a set S with possiblylarge cardinality |S|. But since we integrate over the variables XS in (2), weare not entering the curse of dimensionality. This simple observation is a keyidea of this paper. We present an estimation technique for E[Y |do(X = x)], orother functionals of p(y|do(X = x)), using marginal integration which has been


proposed and analyzed for additive regression modeling (Linton and Nielsen,1995). The idea of our marginal integration approach is to first estimate a fullynonparametric regression of Y versus X and the variables XS from a valid ad-justment set satisfying the backdoor criterion (for example the parents of X ora superset thereof) and then average the obtained estimate over the variablesXS . We call the procedure “S-mint” standing for marginal integration with ad-justment set S. Our main result in Theorem 1 establishes that E[Y |do(X = x)]can be inferred via marginal integration with the same rate of convergence asfor one-dimensional nonparametric function estimation for a very large classof structural equation models with potentially non-additive functional forms inthe equations. We thereby achieve a major robustness result against model mis-specification, as we only assume some standard smoothness assumptions but nofurther conditions on the functional form or nonlinearity of the functions fj inthe SEM, not even additive errors. Our main result (Theorem 1) also appliesusing a superset of the true underlying DAG D (i.e. there might be additionaldirected edges in the superset), see Section 2.3. For example, such a supersetcould arise from knowing the order of the variables (e.g. in a time series con-text), or an approximate superset might be available from estimation of theDAG where one would not care too much about slight or moderate overfitting.

Inferring E[Y |do(X = x)] under model-misspecification is the theme of doublerobustness in causal inference, typically with a binary treatment variable X (cf.van der Laan and Robins, 2003). There, misspecification of either the regressionor the propensity score model6 is allowed but at least one of them has to becorrect to allow for consistent estimation: the terminology “double robustness”is intended to reflect this kind of robustness. In contrast to double robustness, weachieve here “full robustness” where essentially any form of “misspecification”is allowed, in the sense that S-mint does not require any specification of thefunctional form of the structural equations in the SEM. More details are givenin Section 2.1.1.

The local nature of parental sets. Our S-mint procedure requires the specificationof a valid adjustment set S: as described in (3), we can always use the parentalset pa(jX) if jY /∈ pa(jX). The parental variables are often an interesting choicefor an adjustment set which corresponds to a local operation. Furthermore, asdiscussed below, the local nature of the parental sets can be very beneficial inpresence of only approximate knowledge of the true underlying DAG D.

1.2.2. Structural equation models with unknown structure

Consider the SEM (1), but now we assume that the DAG D is unknown. For thissetting, we propose a two-stage scheme (“est S-mint”, Section 3.5). First, weestimate the structure of the DAG (or the Markov equivalence class of DAGs)or the order of the variables from observational data. To do this, all of thecurrent approaches make further assumptions for the SEM in (1), see for example

6Definitions can be found in Section 2.1.1


Chickering (2002); Teyssier and Koller (2005); Shimizu et al. (2006); Kalischand Buhlmann (2007); Schmidt et al. (2007); Hoyer et al. (2009); Shojaie andMichailidis (2010); Buhlmann et al. (2014).

We can then infer E[Y |do(X = x)] as before with S-mint model fitting, butbased on an estimated (instead of the true) adjustment set S. This seems oftenmore advisable than using the estimated functions in the SEM, which are readilyavailable from structure estimation, and pursuing a path-based method with theestimated DAG. Since estimation of (the Markov equivalence class of) the DAGor of the order of variables is often very difficult and with limited accuracy for fi-nite sample size, the second stage with S-mint model fitting is fairly robust withrespect to errors in order- or structure-estimation and model misspecification,as suggested by our empirical results in Section 5.3. Therefore, such a two-stageprocedure with structure- or order-search7 and subsequent marginal integrationleads to reasonably accurate and sometimes better results. For example, Sec-tion 5 reports a comparable performance to the direct CAM method (Buhlmannet al., 2014) with subsequent path-based estimation of causal effects, which isbased on, or assuming, a correctly specified additive SEM.8 Thus, even if theest S-mint approach with fully nonparametric S-mint modeling in the secondstage is not exploiting the additional structural assumption of an additive SEM,it exhibits a competitive performance.

As mentioned in the previous subsection, the parental sets (or supersetsthereof) with their local nature are often a very good choice in presence ofestimation errors with respect to inferring the true DAG (or equivalence classthereof): instead of assuming high accuracy for recovering the entire (equiva-lence class of the) DAG, we only need to have a reasonably accurate estimateof the much smaller and local parental set.

A combined structured (or parametric) and fully nonparametric approach. Thetwo-stage est S-mint procedure is typically a combination of a structured non-parametric or parametric approach for estimating the DAG (or the equivalenceclass thereof) and the fully nonparametric S-mint method in the subsequentsecond stage. As outlined above, it exhibits comparatively good performance.One could think of pursuing the first stage in a fully nonparametric fashion aswell, for example by using the PC-algorithm with nonparametric conditionalindependence tests (Spirtes et al., 2000), see also Song et al. (2013). For finiteamount of data and a fairly large number of variables, this is a very ambitious ifnot ill-posed task. In view of this, we almost have to make additional structuralor parametric assumptions for structure learning of the DAG (or its equiva-lence class). However, since the fully nonparametric S-mint procedure in thesecond stage is less sensitive to incorrect specification of the DAG (or its equiv-alence class), the combined approach exhibits better robustness. Vice-versa, ifthe structural or parametric model is correct which is used for structural learn-ing in the first stage, we do not lose much efficiency when “throwing away” (or

7We do not make use of e.g. estimated edge functions, even if they were implicitly estimatedfor structure-search, as e.g. in Chickering (2002).

8For a short description of the CAM method, see the last paragraph of Section 3.4.


not exploiting) such structural information in the second stage with S-mint. Weonly have empirical results to support such accuracy statements.

1.3. The scope of possible applications

Genetic network inference is a prominent example where causal inference meth-ods are used; mainly for estimating an underlying network in terms of a di-rected graph (cf. Smith et al., 2002; Husmeier, 2003; Friedman, 2004; Yu et al.,2004). The goal is very ambitious, namely to recover relevant edges in a com-plex network from observational or a few interventional data. This paper doesnot address this issue: instead of recovering a network (structure), inferring to-tal causal or intervention effects from observational data is a different, maybemore realistic but still very challenging goal in its full generality. Yet makingprogress can be very useful in many areas of applications, notably for prioritiz-ing and designing future randomized experiments which have a large total effecton a response variable of interest, ranging from molecular biology and bioinfor-matics (Editorial Nat. Methods, 2010) to many other fields including economy,medicine or social sciences. Such model-driven prioritization for gene interven-tion experiments in molecular biology has been experimentally validated withsome success (Maathuis et al., 2010; Stekhoven et al., 2012).

We will discuss an application from molecular biology on a rather “toy-like”level in Section 6. Despite all simplifying considerations, however, we believethat it indicates a broader scope of possible applications. When having approx-imate knowledge of the parental set of the variables in a potentially large-scalesystem, one would not need to worry much about the underlying form of thedependences of (or structural equations linking) the variables: for quantifyingthe effect of single variable interventions, the proposed S-mint marginal integra-tion estimator converges with the univariate rate, as stated in (the main result)Theorem 1.

Quantifying single variable interventions from observational data is indeed auseful first step. Further work is needed to address the following issues: (i) infer-ence in settings with additional hidden, unobserved variables (cf. Spirtes et al.,2000; Zhang, 2008; Shpitser et al., 2011; Colombo et al., 2012); (ii) inferencebased on both observational and interventional data (cf. He and Geng., 2008;Hauser and Buhlmann, 2012, 2014, 2015); and finally (iii) developing sound toolsand methods towards more confirmatory conclusions. The appropriate modifica-tions and further developments of our new results (mainly Theorem 1) towardsthese points (i)-(iii) are not straightforward.

2. Causal effects for general nonlinear systems via backdooradjustment: marginal integration suffices

We present here the, maybe surprising, result that marginal integration allowsus to infer the causal effect of a single variable intervention with a convergence


rate as for one-dimensional nonparametric function estimation in essentially anynonlinear structural equation model.

We assume a structural equation model (as already introduced in Section 1.1)

Xj ← f0j (Xpa(j), εj), j = 1, . . . , p, (5)

where ε1, . . . , εp are independent noise (or innovation) variables, pa(j) denotesthe set of parents of node j in the underlying DAG D0, and f0j (·) are real-valued (measurable) functions. We emphasize the true underlying quantitieswith a superscript “0”. We assume in this section that the DAG D0, or atleast a (super-) DAG D0

super which contains D0 (see Section 2.3), is known. Asmentioned earlier, our goal is to give a representation of the expected value of theintervention distribution E[Y |do(X = x)] for two variables Y,X ∈ {X1, . . . , Xp}.That is, we want to study the total effect that an intervention at X has on atarget variable Y . Let S be a set of variables satisfying the backdoor criterionrelative to (X,Y ), implying that

p(y|do(X = x)) =

∫p(y|X = x,XS)dP (XS),

where p(·) and P (·) are generic notations for the density or distribution (seeSection 1.1). Assuming that we can interchange the order of integration (cf.part 6 of Assumption 1) we obtain

E[Y |do(X = x)] =

∫E[Y |X = x,XS ]dP (XS). (6)

This is a function depending on the one-dimensional variable x only and there-fore, intuitively, its estimation should not be much exposed to the curse ofdimensionality. We will argue below that this is indeed the case.

2.1. Marginal integration

Marginal integration is an estimation method which has been primarily de-signed for additive and structured regression fitting (Linton and Nielsen, 1995).Without any modifications though, it is also suitable for the estimation ofE[Y |do(X = x)] in (6).

Let S be a set of variables satisfying the backdoor criterion relative to (X,Y )(see Section 1.1) and denote by s the cardinality of S. We use a nonparamet-ric partial local estimator of the multivariate regression function m(x, xS) =E[Y |X = x,XS = xS ] of the form

(α, β) = argminα,β

n∑i=1

(Yi − α− β(X(i) − x))2Kh1(X(i) − x)Lh2(X(i)S − xS), (7)

where α = α(x, xS), β = β(x, xS), K and L are two kernel functions and h1, h2the respective bandwidths, i.e.,

Kh1(t) =

1

h1K

(t

h1

), Lh2

(t) =1

hs2L

(t

h2

).


We obtain the partial local linear estimator at (x, xS) as m(x, xS) = α(x, xS).We then integrate over the variables XS with the empirical mean and obtain:

E[Y |do(X = x)] = n−1n∑k=1

m(x,X(k)S ) (8)

This is a locally weighted average, with localization through the one-dimensionalvariable x. For our main theoretical result to hold, we make the following as-sumptions:

Assumption 1.

1. The variables XS have a bounded support supp(XS).2. The regression function m(u, uS) = E[Y |X = u,XS = uS ] exists and has

bounded partial derivatives up to order 2 with respect to u and up to orderd with respect to uS for u in a neighborhood of x and uS ∈ supp(XS).

3. The variables X,XS have a density p(., .) with respect to Lebesgue measureand p(u, uS) has bounded partial derivatives up to order 2 with respect tou and up to order d with respect to uS. In addition, it holds that

infu∈x±δ

xS∈supp(XS)

p(u, xS) > 0 for some δ > 0.

4. The kernel functions K and L are symmetric with bounded supports andL is an order-d kernel.

5. For ε = Y −E[Y |X,XS ], it holds that E[ε4] is finite and E[ε2|X = x,XS =xS ] is continuous. Furthermore, for a δ > 0, E[|ε|2+δ | X = u] is boundedfor u in a neighborhood of x.

6. There exists c <∞ such that E[|Y ||X = x,XS = xS ] ≤ c for all xS.

Note that part 6 of Assumption 1 is only needed for interchanging the orderof integration in (6). Due to the bounded support of the variables XS it is notoverly restrictive.

As a consequence, the following result from Fan et al. (1998) establishes aconvergence rate for the estimator as for one-dimensional nonparametric func-tion estimation.

Theorem 1. Suppose that Assumption 1 holds for a set S satisfying the back-door criterion relative to (X,Y ) in the DAG D0 from model (5). Consider the es-timator in (8). Assume that the bandwidths are chosen such that h1, h2 → 0 withnh1h

2s2 / log2(n) → ∞, hd2/h

21 → 0, and in addition satisfying nh1h

s2/ log(n) →

∞ and h41 log(n)/hs2 → 0 (and all these conditions hold when choosing the band-widths in a properly chosen optimal range, see below). Then,

E[Y |do(X = x)]− E[Y |do(X = x)] = O(h21) +OP (1/√nh1).

Proof. The statement follows from Fan et al. (1998, Theorem 1 and Re-mark 3).


When assuming the smoothness condition d > s for m(u, uS) with respect tothe variable uS , and when choosing h1 � n−1/5 and h2 � n−α with 2/(5d) <α < 2/(5s) (which requires d > s), all the conditions for the bandwidths are sat-isfied: Theorem 1 then establishes the convergence rate O(n−2/5) which matchesthe optimal rate for estimation of one-dimensional smooth functions having sec-ond derivatives, and such a smoothness condition is assumed for m(u, uS) withrespect to the variable u in part 2 of Assumption 1. Thus, the implication is theimportant robustness fact that for any potentially nonlinear structural equa-tion model satisfying the regularity conditions in Theorem 1, we can estimatethe expected value of the intervention distribution with the same accuracy as innonparametric estimation of a smooth function with one-dimensional argument.We note, as mentioned already in Section 1.2.1, that it would be essentially im-possible to estimate the functions fj in (1) in full generality: interestingly, whenfocusing on inferring the total effect E[Y |do(X = x)], the problem is much betterposed as demonstrated with our concrete S-mint procedure. Furthermore, withthe (valid) choice S = pa(jX) or an (estimated) superset thereof, one obtains aprocedure that is only based on local information in the graph: this turns out tobe advantageous, see also Section 1.2.1, particularly when the underlying DAGstructure is not correctly specified (see Section 5.3). We will report about theperformance of such an S-mint estimation method in Sections 4 and 5. Notethat the rate of Theorem 1 remains valid (for a slightly modified estimator) ifwe allow for discrete variables in the parental set of X (Fan et al., 1998).

It is also worthwhile to point out that S-mint becomes more challenging forinferring multiple variable interventions such as E[Y |do(X1 = x1, X2 = x2)]: theconvergence rate is then of the order n−1/3 for a twice differentiable regressionfunction.

Remark 1. Theorem 1 generalizes to real-valued transformations t(·) of Y .By using the argument as in (6) and replacing part 6 of Assumption 1 by thecorresponding statement for t(Y ), we obtain

E[t(Y )|do(X = x)] =

∫t(y)p(y|do(X = x))dy

=

∫E[t(Y )|X = x,XS ]dP (XS).

For example, for t(y) = y2 we obtain second moments and we can then estimatethe variance Var(Y |do(X = x)) = E[Y 2|do(X = x)] − (E[Y |do(X = x)])2. Orwith the indicator function t(y) = I(y ≤ c) (c ∈ R) we obtain a procedure forestimating P[Y ≤ c|do(X = x)] with the same convergence rate as for one-dimensional nonparametric function estimation using marginal integration oft(Y ) versus X,XS.

2.1.1. Binary treatment and connection to double robustness

For the special but important case with binary treatment, where X ∈ {0, 1}and XS ∈ Rs is continuous, we can use marginal integration as well. We can


estimate the regression function m(x, xS) for x ∈ {0, 1} by using a kernel esti-mator based on data with the observed X(k) = 0 and X(k) = 1, respectively,denoted by m(x, xS) (x ∈ {0, 1}). We then integrate over xS with the empirical

mean n−1∑nk=1 m(x,X

(k)S ) (x ∈ {0, 1}). When choosing the bandwidth h2 (for

smoothing over the XS variables) smaller than for the non-integrated quantitym(x, xS), and assuming smoothness conditions, we anticipate the n−1/2 con-vergence rate for estimating E[Y |do(X = x)] with x ∈ {0, 1}; see for exampleHall and Marron (1987) in the context of nonparametric squared density esti-mation. We note that this establishes only the optimal parametric convergencerate but does not generally lead to asymptotic efficiency. For the case of binarytreatment, semiparametric minimax rates have been established in Robins et al.(2009) and asymptotically efficient methods can be constructed using higherorder influence functions (Li et al., 2011) or targeted maximum likelihood esti-mation (van der Laan and Rose, 2011) which both might be more suitable thanmarginal integration.

Theorem 1 establishes that S-mint is “fully robust” against model-misspeci-fication for inferring E[Y |do(X = x)] or related quantities as mentioned in Re-mark 1. The existing framework of double robustness is related to the issueof misspecification and we clarify here the connection. One specifies regres-sion models for E[Y |X,XS ] = m(X,XS) for both X = 0 and X = 1 and apropensity score (Rosenbaum and Rubin, 1983) or inverse probability weight-ing model (IPW; Robins et al. (1994)): for a binary intervention variable whereX encodes “exposure” (X = 1) and “control” (X = 0), the latter is a (oftenparametric logistic) model for P[X = 1|XS ]. A double robust (DR) estimator forE[Y |do(X = x)] requires that either the regression model or the propensity scoremodel is correctly specified. If both of them are misspecified, the DR estimatoris inconsistent. Double robustness of the augmented IPW approach has beenproved by Scharfstein et al. (1999) and double robustness in general was furtherdeveloped by many others, see e.g. Bang and Robins (2005). The targeted max-imum likelihood estimation (TMLE) framework (van der Laan and Rose, 2011)is also double robust. It uses a second step where the initial estimate is modifiedin order to make it less biased for the target parameter (e.g. the average causaleffect between “exposure” and “control”). If both, the initial estimator and thetreatment mechanism, are consistently estimated, TMLE can be shown to beasymptotically efficient. TMLE with a super-learner or also the approach ofhigher order influence functions (Li et al., 2011) can deal with a nonparametricmodel. Robins et al. (2009) prove that s = dim(XS) ≤ 2(βregr +βpropens), whereβ‘name′ denotes the smoothness of the regression or propensity score function,is a necessary condition for an estimator to achieve the 1/

√n convergence rate.

Our S-mint procedure is related to these nonparametric approaches: it dif-fers though in that it deals with a continuous treatment variable. Similar to thesmoothness requirement above we have discussed after Theorem 1 that we canachieve the n−2/5 nonparametric optimal rate (when assuming bounded deriva-tives up to order 2 of the regression function with respect to the treatmentvariable) if s = dim(XS) < d, where d plays the role of βregr. The condition


s < d is stronger than for the optimal 1/√n convergence rate with binary treat-

ment: however, this could be relaxed to the regime s < 2d when invoking Fanet al. (1998, Remark 1). Therefore, rate optimal estimation with continuoustreatment can be achieved under a “comparable” smoothness assumption as inthe binary treatment case.

2.2. Implementation of marginal integration

Theorem 1 justifies marginal integration as in (8) asymptotically. One issue isthe choice of the two bandwidths h1 and h2: we cannot rely on cross-validationbecause E[Y |do(X = x)] is not a regression function and is not linked to predic-tion of a new observation Ynew, nor can we use penalized likelihood techniqueswith e.g. BIC since E[Y |do(X = x)] does not appear in the likelihood. Besidesthe difficulty of choosing the smoothing parameters, we think that addressingsuch a smoothing problem will become easier, at least in practice, using aniterative boosting approach (cf. Friedman, 2001; Buhlmann and Yu, 2003).

We propose here a scheme, without complicated tuning of parameters, whichwe found to be most stable and accurate in extensive simulations. The idea is toelaborate on the estimation of the function m(x, xS) = E[Y |X = x,XS = xS ],from a simple starting point to more complex estimates, while the integrationover the variables XS is done with the empirical mean as in (8).

We start with the following simple but useful result.

Proposition 1. If pa(jX) = ∅ or if there are no backdoor paths from jX to jYin the true DAG D0 from model (5), then

E[Y |do(X = x)] = E[Y |X = x].

Proof. If there are no backdoor paths from jX to jY , the empty set S = ∅satisfies the backdoor criterion relative to (X,Y ). The statement then directlyfollows from the backdoor adjustment formula (2).

We learn from Proposition 1 that in simple cases, a standard one-dimensionalregression estimator for E[Y |X = x] would suffice. On the other hand, we knowfrom the backdoor adjustment formula in (6), that we should adjust with thevariables XS . Therefore, it seems natural to use an additive regression approx-imation for m(x, xS) as a simple starting point. If the assumptions of Proposi-tion 1 hold, such an additive model fit would yield a consistent estimate for thecomponent of the variable x: in fact, it is asymptotically as efficient as when us-ing one-dimensional function estimation for E[Y |X = x] (Horowitz et al., 2006).If the assumptions of Proposition 1 would not hold, we can still view an addi-tive model fit madd(x, xS) = µ + madd,jX (x) +

∑j∈S madd,j(xj) as one of the

simplest starting points to approximate the more complex function m(x, xS).When integrating out with the empirical mean as in (8), we obtain the estimate

Eadd[Y |do(X = x)] = µ + madd,jX (x). As motivated above and backed up bysimulations, µ + madd,jX (x) is quite often already a reasonable estimator forE[Y |do(X = x)].


In the presence of strong interactions between the variables, the additive ap-proximation may drastically fail though. Thus, we want to implement marginalintegration as follows: starting from madd, we implement L2-Boosting with anonparametric kernel estimator similar to the one in (7). More precisely, wecompute residuals

R(i)1 = Y (i) − madd(X(i), X

(i)S ), i = 1, . . . , n,

which, for simplicity, are fitted with a locally constant estimator of the form

α(x, xS) = argminα

n∑i=1

(R(i)1 − α)2Kh1

(X(i) − x)Lh2(X

(i)S − xS). (9)

The resulting fit is denoted by gR1(x, xS) := α(x, xS). We add this new function

fit to the previous one and compute again residuals, and we then iterate theprocedure bstop times. To summarize, for b = 1, . . . , bstop − 1,

m1(x, xS) = madd(x, xS),

mb+1(x, xS) = mb(x, xS) + gRb(x, xS),

R(i)b+1 = Y (i) − mb+1(X(i), X

(i)S ), i = 1, ..., n.

The final estimate for the total causal effect is obtained by marginally integratingover the variables XS with the empirical mean as in (8), that is

E[Y |do(X = x)] = n−1n∑k=1

mbstop(x,X(k)S ).

The concrete implementation of the additive model fitting is according to thedefault from the R-package mgcv, using penalized thin plate splines and choosingthe regularization parameter in the penalty by generalized cross-validation, seee.g. Wood (2006, 2003). The basis dimension for each smooth is set to 10. Forthe product kernel in (9), we choose K to be a Gaussian kernel and L to be aproduct of Gaussian kernels. The bandwidths h1 and h2 in the kernel estimatorshould be chosen “large” to yield an estimator with low variance but typicallyhigh bias. The iterations then reduce the bias. Once we have fixed h1 and h2 (andthis choice is not very important as long as the bandwidths are “large”), the onlyregularization parameter is bstop. It is chosen by the following considerations:for each iteration we approximate the sum of the differences to the previousapproximation on the set of intervention values I (typically the nine deciles, seeSection 5), that is ∑

x∈I|n−1

n∑k=1

gRb(x,X(k)S )|. (10)

When it becomes reasonably “small”, and this needs to be specified dependingon the context, we stop the boosting procedure. Such an iterative boostingscheme has the advantage that it is more insensitive to the choice of bstop than


the original estimator in (8) to the specification of the tuning parameters, andin addition, boosting adapts to some extent to different smoothness in differentdirections (variables). All these ideas are presented at various places in theboosting literature, particularly in Friedman (2001); Buhlmann and Yu (2003);Buhlmann and Hothorn (2007). In Section 4.2 we provide an example of a DAGwith backdoor paths, where the additive approximation is incorrect and severalboosting iterations are needed to account for interaction effects between thevariables. The implementation of our method is summarized in Algorithm 1: wecall it also S-mint, and we use it for all our empirical results in Sections 4 - 6.

Algorithm 1 S-mint1: if S = ∅ is a valid adjustment set (for example, if pa(jX) = ∅) then2: Fit an additive regression of Y versus X to obtain madd

3: return madd

4: else5: Fit an additive regression of Y versus X and the adjustment set variables XS to obtain

m1 = madd

6: for b = 2, ..., bstop − 1 do7: Apply L2-boosting to capture deviations from an additive regression model:8: (i) Compute residuals Rb = Y − mb

9: (ii) Fit the residuals with the kernel estimator (9) to obtain gRb10: (iii) Set mb+1 = mb + gRb11: end for12: return Do marginal integration: output 1

n

∑nk=1 mbstop (x,X

(k)S )

13: end if

We note the following about L2-boosting: if the initial estimator is a weighted

mean m1(x, xS) =∑ni=1 w

(1)i (x, xS)Yi with

∑ni=1 w

(1)i (x, xS) = 1 (e.g. many ad-

ditive function estimators are of this form), then, since the kernel estimator gRbin the boosting step 9 is a weighted mean too, mb(x, xS) =

∑ni=1 w

(b)i (x, xS)Yi

is a weighted mean. Thus, L2-boosting has the form of a weighted mean esti-mator. When using kernel estimation for gRb , the boosting estimator mbstop isrelated to an estimator with a higher order kernel (Marzio and Taylor, 2008)which depends on the bandwidth in gRb and the number of boosting iterationsin a rather non-explicit way. Establishing the theoretical properties of the L2-

boosting estimator E[Y |do(X = x)] = n−1n∑k=1

mbstop(x,X(k)S ) is beyond the

scope of this paper.

2.3. Knowledge of a superset of the DAG

It is known that a superset of the parental set pa(jX) suffices for the backdooradjustment in (3). To be precise, let

S(jX) ⊇ pa(jX) with S(jX) ∩ de(jX) = ∅, (11)

where de(jX) are the descendants of jX (in the true DAG D0). For example,S(jX) could be the parents of X in a superset of the true underlying DAG (a


DAG with additional edges relative to the true DAG). We can then choose theadjustment set S in (8) as S(jX) and Theorem 1 still holds true, assuming thatthe cardinality |S(jX)| ≤M <∞ is bounded. Thus, with the choice S = S(jX),we can use marginal integration by marginalizing over the variables XS(jX).

A prime example where we are provided with a superset S(jX) ⊇ pa(jX)with S(jX) ∩ de(jX) = ∅ is when we know the order of the variables and candeduct an approximate superset of the parents from that. When the variablesare ordered with Xj ≺ Xk for j < k, we would use

S(jX) = {k; jX − pmax ≤ k < jX} ⊇ pa(jX), (12)

where “≺” and pmax denote the order relation among the variables and an upperbound on the size of the superset to ensure that S(jX) ⊇ pa(jX).

Corollary 1. Consider the estimator in (8) and assume the conditions of The-orem 1 for the variables Y,X and XS(jX) with S(jX) in (11) or S(jX) as in(12) for ordered variables. Then,

E[Y |do(X = x)]− E[Y |do(X = x)] = O(h21) +OP (1/√nh1).

Proof. The statement is an immediate consequence of Theorem 1, as S(jX)in (11) and (12) satisfies the backdoor criterion relative to (X,Y ).

3. Path-based methods

We assume in the following until Section 3.5 that we know the true DAG and alltrue functions and error distributions in the general SEM (1). Thus, in contrastto Section 2, we have here also knowledge of the entire structure in form of theDAG D0 (and not only a valid adjustment set S assumed for Theorem 1). Thisallows us to infer E[Y |do(X = x)] in different ways than the generic S-mintregression from Section 2. The motivation to look at other methods is driven bypotential gains in statistical accuracy when including the additional informationof the functional form or of the entire DAG in the structural equation model.We will empirically analyze this issue in Section 5.

3.1. Entire path-based method from root nodes

Based on the true DAG, the variables can always be ordered such that

Xj1 ≺ Xj2 ≺ . . . ≺ Xjp .

Denote by jX and jY the indices of the variables X and Y , respectively.If X is not an ancestor of Y , we trivially know that E[Y |do(X = x)] = E[Y ]. If

X is an ancestor of Y it must hold that jX < jY . We can then generate the inter-vention distribution of the random variables Xj1 ≺ Xj2 ≺ . . . ≺ Y |do(X = x)in the model (1) as follows (Pearl, 2000, Def. 3.2.1.):


Step 1 Generate εj1 , . . . , εjY .Step 2 Based on Step 1, recursively generate:

Xj1 ← εj1 ,

Xj2 ← f0j2(Xpa(j2), εj2),

. . . ,

XjX ← x,

. . . ,

XjY ← f0jY (Xpa(jY ), εjY ).

Instead of an analytic expression for p(Y |do(X = x)) by integrating out overthe other variables {Xjk ; k 6= jX , jY } we rather rely on simulation. We draw

B samples Y (1) = X(1)jY, . . . , Y (B) = X

(B)jY

by B independent simulations ofSteps 1-2 above and we then approximate, for B large,

E[Y |do(X = x)] ≈ B−1B∑b=1

Y (b).

Furthermore, the simulation technique allows to obtain the distribution ofp(Y |do(X = x)) via e.g. density estimation or histogram approximation basedon Y (1), . . . , Y (B).

The method has an implementation in Algorithm 2 which uses propagation ofsimulated random variables along directed paths in the DAG. The method ex-ploits the entire paths in the DAG from the root nodes to node jY correspondingto the random variable Y . Figure 1 provides an illustration.

Algorithm 2 Entire path-based algorithm for simulating the intervention dis-tribution1: If there is no directed path from jX to jY , the interventional and observational quantities

coincide: p(Y |do(X = x)) ≡ p(Y ) and E[Y |do(X = x)] ≡ E[Y ].2: If there is a directed path from jX to jY , proceed with steps 3-9.3: Set X = XjX = x and delete all in-going arcs into X.4: Find all directed paths from root nodes (including jX) to jY , and denote them by

p1, . . . , pq .5: for b = 1, . . . , B do6: for every path, recursively simulate the corresponding random variables according to

the order of the variables in the DAG:

(i) Simulate the random variables corresponding to the root nodes of p1, . . . , pq ;

(ii) Simulate in each path p1, . . . , pq the random variables following the root nodes;proceed recursively, according to the order of the variables in the DAG.

(iii) Continue with the recursive simulation of random variables until Y is simulated.

7: Store the simulated variable Y (b).8: end for9: Use the simulated sample Y (1), . . . , Y (B) to approximate the intervention distribution

p(y|do(X = x)) or its expectation E[Y |do(X = x)].

When having estimates of the true DAG, all true functions and error dis-tributions in the additive structural equation model (14), we would use the


procedure above based on these estimated quantities; for the error distribu-tions, we either use the estimated variances in Gaussian distributions, or werely on bootstrapping residuals from the structural equation model (typicallywith residuals centered around zero).

3.2. Partially path-based method with short-cuts

Mainly motivated by computational considerations (see also Section 3.3), a mod-ification of the procedure in Algorithm 2 is valid. Instead of considering all pathsfrom root nodes to jY (corresponding to variable Y ), we only consider all pathsfrom jX (corresponding to variable X) to jY and simulate the random variableson these paths p′1, . . . , p

′m. Obviously, in comparison to Algorithm 2, m ≤ q and

every p′k corresponds to a path pr for an r ∈ {1, . . . , q}.Every path p′k is of the form

jX = jk,1 → jk,2 → . . .→ jk,`k−1 → jk,`k = jY ,

having length `k. For recursively simulating the random variables on the pathsp′1, . . . , p

′m we start with setting

X = XjX ← x.

Then we recursively simulate the random variables corresponding to all thepaths p′1, . . . , p

′m according to the order of the variables in the DAG. For each

of these random variables Xj with j ∈ {p′1, . . . , p′m} and j 6= jX , we need thecorresponding parental variables and error terms in

Xj ← f0j (Xpa(j), εj),

where for every k ∈ pa(j) we set

Xk =

{the previously simulated value, if k ∈ {p′1, . . . , p′m},bootstrap resampled X∗k , otherwise,

(13)

where the bootstrap resampling is with replacement from the entire data. Theerrors are simulated according to the error distribution.

We summarize the procedure in Algorithm 3 and Figure 1 provides an illus-tration.

Proposition 2. Consider the population case where the bootstrap resamplingin (13) yields the correct distribution of the random variables X1, . . . , Xp. Then,as B → ∞, the partially path-based Algorithm 3 yields the correct interventiondistribution p(y|do(X = x)) and its expected value E[Y |do(X = x)].

Proof. The statement of Proposition 2 directly follows from the definition ofthe intervention distribution in a structural equation model.

The same comment as in Section 3.1 applies here: when having estimatesof the quantities in the additive structural equation model (14), we would useAlgorithm 3 based on the plugged-in estimates. The computational benefit ofusing Algorithm 3 instead of Algorithm 2 is illustrated in Figure 7.


Algorithm 3 Partially path-based algorithm for simulating the interventiondistribution1: If there is no directed path from jX to jY , the interventional and observational quantities

coincide: p(Y |do(X = x)) ≡ p(Y ) and E[Y |do(X = x)] ≡ E[Y ].2: If there is a directed path from jX to jY , proceed with steps 3-9.3: Set X = XjX = x and delete all in-going arcs into X.4: Find all directed paths from jX to jY , and denote them by p′1, . . . , p

′m.

5: for b = 1, . . . , B do6: for every path, recursively simulate the corresponding random variables according to

the order of the variables in the DAG:

(i) Simulate in each path p′1, . . . , p′m the random variables following the node jX ;

proceed recursively as described in (13) according to the order of the variablesin the DAG.

(ii) Continue with the recursive simulation of random variables until Y is simulated.

7: Store the simulated variable Y (b).8: end for9: Use the simulated sample Y (1), . . . , Y (B) to approximate the intervention distribution

p(y|do(X = x)) or its expectation E[Y |do(X = x)].

3.3. Degree of localness

We can classify the different methods according to the degree of which the entireor only a small (local) fraction of the DAG is used. Algorithm 2 is a rather globalprocedure as it uses entire paths from root nodes to jY . Only when jY is closeto the relevant root nodes, the method does involve a smaller aspect of theDAG. Algorithm 3 is of semi-local nature as it does not require to considerpaths going from root nodes to jY : it only considers paths from jX to jY andall parental variables along these paths. The S-mint method based on marginalintegration described in Section 2 and Theorem 1 is of very local character as itonly requires the knowledge of Y,X and the parental set pa(jX) (or a supersetthereof) but no further information about paths from jX to jY .

In the presence of estimation errors, a local method might be more “reliable”as only a smaller fraction of the DAG needs to be approximately correct; globalmethods, in contrast, require that entire paths in the DAG are approximatelycorrect. The local versus global issue is illustrated qualitatively in Figure 1, andempirical results about statistical accuracy of the various methods are given inSection 5.

3.4. Estimation of DAG, edge functions and error distributions

With observational data, in general, it is impossible to infer the true underlyingDAGD0 in the structural equation model (5), or its parental sets, even as samplesize tends to infinity. One can only estimate the Markov equivalence class of thetrue DAG, assuming faithfulness of the data-generating distribution, see Spirteset al. (2000); Pearl (2000); Chickering (2002); Kalisch and Buhlmann (2007);van de Geer and Buhlmann (2013); Buhlmann (2013). The latter three referencesfocus on the high-dimensional Gaussian scenario with the number of random


X

Y

R2

R1x

Y

p1p2

p3x

X∗l

X∗k

Y

p′1

Pa1 Pa2

X

Y

(a) (b) (c) (d)

Fig 1. (a) True DAG D0. (b) Illustration of Algorithm 1. X is set to x, the roots R1, R2 andall paths from the root nodes and X to Y are enumerated (here: p1, p2, p3). The interventionaldistribution at node Y is obtained by propagating samples along the three paths. (c) Illustrationof Algorithm 2. X is set to x and all directed paths from X to Y are labeled (here: p′1). Inorder to obtain the interventional distribution at node Y , samples are propagated along thepath p′1 and bootstrap resampled X∗k and X∗l are used according to (13). (d) Illustration ofthe S-mint method with adjustment set S = pa(jX): it only uses information about Y,X andthe parents of X (here: Pa1,Pa2).

variables p � n but assuming a sparsity condition in terms of the maximaldegree of the skeleton of the DAG D0. The edge functions and error variancescan then be estimated for every DAG member in the Markov equivalence classby pursuing regression of a variable versus its parents.

However, there are interesting exceptions regarding identifiability of the DAGfrom the observational distribution. For nonlinear structural equation modelswith additive error terms, it is possible to infer the true underlying DAG frominfinitely many observational data (Hoyer et al., 2009; Peters et al., 2014). Var-ious methods have been proposed to infer the true underlying DAG D0 and itscorresponding functions f0j (·) and error distributions of the εj ’s: see for exampleImoto et al. (2002); Hoyer et al. (2009); Peters et al. (2014); Buhlmann et al.(2014); van de Geer (2014); Nowzohour and Buhlmann (2015) (the fourth andfifth references are considering high-dimensional scenarios). Another interestingclass of models where the DAG D0 can be identified from the observationaldata distribution are linear structural equation models with non-Gaussian noise(Shimizu et al., 2006), or with Gaussian noise but equal or approximately equalerror variances (van de Geer and Buhlmann, 2013; Peters and Buhlmann, 2014;Loh and Buhlmann, 2014) (the first and third references are considering thehigh-dimensional setting).

As an example of a model with identifiable structure (DAG D0) we can


specialize (5) to an additive structural equation model of the form

Xj ←∑

k∈pa(j)

f0jk(Xk) + εj , j = 1, . . . , p, (14)

where ε1, . . . , εp are independent with εj ∼ N (0, (σ0j )2), and the true underlying

DAG is denoted by D0. This model is used for all numerical comparisons of theS-mint procedure and the path-based algorithms in Section 5. Estimation ofthe unknown quantities D0, f0jk and error variances (σ0

j )2 can be done with the“CAM” method outlined below and used for the empirical results in Section 5.4in connection with the two-stage procedure est S-mint that will be introducedin Section 3.5.

The CAM method (Buhlmann et al., 2014). The abbreviation “CAM” standsfor Causal Additive Model, the additive structural equation model in (14). TheCAM method is a nonparametric technique fitting smooth additive functionsand Gaussian error terms in such an additive SEM. The unknown DAG D0

is estimated by restricted maximum likelihood: the restriction is on a spaceof sparse graphs (which can be determined by e.g. neighborhood selection withGroup Lasso for an undirected additive association graph) and there is no furtherregularization of such a restricted MLE. The CAM method is consistent, evenin the high-dimensional scenario with p� n but assuming a sparse underlyingtrue DAG.

3.5. Two-stage procedure: est S-mint

If the order of the variables or (a superset of) the parental set is unknown, wehave to estimate it from observational data; this leads to the following two-stage procedure described here for the case where the parental set pa(jX) isidentifiable:

Stage 1 Estimate a superset of the parental set S(jX) (defined in (11)) fromobservational data.

Stage 2 Based on the estimate S(jX), run S-mint regression with S = S(jX).

Even if in Stage 1 one would also obtain estimates of functions in a specifiedSEM besides an estimate of S(jX), we would not use the estimated functions inStage 2. We present empirical results for the est S-mint procedure in connectionwith the CAM method for Stage 1 for estimating a valid adjustment set S(jX)in Section 5.4.

If the parental set pa(jX) is not identifiable (see Section 3.4), one could ap-ply Stage 1 to obtain a set {S(jX)(1), . . . , S(jX)(cj)} such that the parentalsets from each Markov-equivalent DAG would be contained in at least one ofthe S(jX)(k) for some k. Stage 2 would then be performed for all estimates{S(jX)(1), . . . , S(jX)(cj)} and one could then derive bounds of the quantityE[Y |do(X = x)] in the spirit of the approach of Maathuis et al. (2009).


In Section 5.5 we will give some intuition why the two stage est S-mint isoften leading to better and more reliable results than (at least some) othermethods which rely on path-based estimation.

4. Empirical results: non-additive structural equation models

In this section we provide simple proof-of-concept examples for the generalityof the proposed S-mint estimation method (Algorithm 1). In particular, therobustness of S-mint is experimentally validated for models where the structuralequation model is not additive as in (14) but given in its general form (5). Wemake a naive comparison to path-based methods which are inconsistent due toincorrect specification of the model in Section 4.1. However, taking the view ofclassical robustness (cf. Hampel et al., 2011), we consider a complementary andinteresting issue in Section 5: namely the “efficiency” of a robust procedure incomparison to other methods relying on the correct model specification.

In Section 4.1 we empirically show that the path-based methods based on thewrong additive model assumption in (14) may fail even in the absence of back-door paths where the S-mint method boils down to estimation of an additivemodel. In Section 4.2 we add backdoor paths to the graph and a strong interac-tion term to the corresponding structural equation model. We then empiricallyshow that S-mint manages to approximate the true causal effect, whereas fittingonly an additive regression fails. Section 4.3 contains an example that demon-strates a good performance of S-mint even in the presence of non-additive noisein the structural equation model. Finally, Section 4.4 empirically illustrates is-sues with the fixed choice of the bandwidths in the product kernel in (9) in somecases.

4.1. Causal effects in the absence of backdoor paths

First let us illustrate the sensitivity of the path-based methods with respect tomodel specification, using a simple example of a 4-node graph with no backdoorpaths between X1 = X and Y (see Figure 2). We consider a corresponding(non-additive) structural equation model of the form

X1 ← ε1

X2 ← ε2

X3 ← cos(4 · (X1 +X2)) · exp(X1/2 +X2/4) + ε3

Y ← cos(X3) · exp(X3/4) + ε4 (15)

where εj ∼ N (0, σ2j ) with σ1 = σ2 = 0.7 and σ3 = σ4 = 0.2. We generate n

samples from this model. From Proposition 1 we know that for j ∈ {1, 2, 3},fitting an additive regression of Y versus Xj and Xpa(j) suffices to obtain thecausal effect E[Y |do(Xj = x)], that is, all causal effects can be readily estimatedwith an additive model. Our goal is to infer E[Y |do(X1 = x)], based on n = 500


X3

X2X1

Y

−2 −1 0 1 2

0.0

0.5

1.0

x

E[Y

| do

(X1

= x

)]

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●●

●●

●

●

●

●

●

●

●

true effectS−mintentire path

−2 −1 0 1 2

0.0

0.5

1.0

x

E[Y

| do

(X1

= x

)]

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●

true effectS−mintentire path

Fig 2. Left: DAG corresponding to the structural equation model (15). Right: S-mint regres-sion estimates of E[Y |do(X1 = x)] for the model in (15), with S = S(jX = 1) = ∅, basedon one representative sample each for sample sizes n = 500 (top) and n = 10′000 (bottom).S-mint regression is consistent while the entire path-based method with a misspecified additiveSEM (Algorithm 2) is not. The relative squared errors (over the 51 points x) are 0.013 forS-mint regression and 6.239 for the entire path-based method, both for n = 10000.

and n = 10′000 samples of the joint distribution of the 4 nodes. The results aredisplayed in Figure 2.

We consider the entire path-based Algorithm 2 (and Algorithm 3 as well, notshown) assuming an additive structural equation model as in (14). We impres-sively see that this approach is exposed to model misspecification while S-mint(in this case simply fitting of an additive model, i.e., bstop = 1 with the numberof additional boosting iterations equaling zero) is not and leads to reliable andcorrect results. We included two settings; n = 500 to be consistent with thesettings in the numerical study from Section 5 and n = 10000 to demonstratethat the failure of the path-based methods is not a small sample size but aninconsistency phenomenon.

4.2. Causal effects in the presence of backdoor paths

We now consider a slight (but crucial) modification of the above DAG that hasbeen proposed by Linbo Wang and Mathias Drton through private communi-cation. We consider the 4-node graph from Section 4.1 with additional edgesX1 → Y and X2 → Y and corresponding structural equation model

X1 ← ε1

X2 ← ε2

X3 ← X1 +X2 + ε3

Y ← X1 ·X2 ·X3 + ε4 (16)


where εj ∼ N (0, σ2j ) with σ1 = σ2 = 0.7 and σ3 = σ4 = 0.2. Note that this mod-

ification introduces two backdoor paths from X3 to Y . The goal is to estimatethe causal effect E[Y |do(X3 = x)] using the S-mint estimation procedure intro-duced in Algorithm 1 with different numbers of boosting iterations. In Figure 3one clearly sees that the additive approximation (with no additional boostingiterations) fails to approximate the total causal effect. It is not able to capture

−1.0 −0.5 0.0 0.5 1.0

−0.

50.

00.

5

intervention value

size

of c

ausa

l effe

ct

true effectadd. regr.2 boosting5 boosting10 boosting30 boosting60 boosting

0.05

0.15

differences

nr. of boosting iterations

0 10 20 30 40 50 60

0.5

1.5 squared error


−1.0 −0.5 0.0 0.5 1.0

−0.

50.

00.

5

intervention value

size

of c

ausa

l effe

ct

true effectadd. regr.2 boosting5 boosting10 boosting30 boosting60 boosting

0.00

0.10

0.20

differences


0 10 20 30 40 50 60

0.0

1.0

2.0

squared error


Fig 3. Approximation of the causal effect E[Y |do(X3 = x)] in model (16) with S-mint re-gression for additive model fit (starting value) and various boosting iterations (left), absolutedifferences between consecutive boosting iterations as in (10) (upper right) and integratedsquared error for approximating the true effect as a function of boosting iterations (lowerright). The boosting iterations in the S-mint procedure account for interactions between thevariables. The adjustment set is chosen as the parental set of X3, that is S(jX = 3) = {1, 2}.The results are based on one representative sample of size n = 500 (top) and n = 10000(bottom).

the full interaction term X1 ·X2 ·X3. However, adding boosting iterations signif-icantly improves the approximation of the true causal effect even for the smallsample size n = 500.

4.3. Causal effects in the presence of non-additive noise

Theorem 1 does not put any explicit restrictions on the noise structure in thestructural equation model. In particular, S-mint also works well in the case ofnon-additive noise. As an example, we consider the causal graph and SEM from


Section 4.2 but replace the structural equation corresponding to Y in (16) with

Y ← exp(X1) · cos(X2 ·X3 + ε4). (17)

The goal is again to estimate the causal effect E[Y |do(X3 = x)] based on n = 500observed samples of the joint distribution. Figure 4 shows that S-mint yields aclose approximation to the true causal effect.

−1.0 −0.5 0.0 0.5 1.0

0.5

1.5

2.5

intervention value

size

of c

ausa

l effe

ct

true effectadd. regr.2 boosting5 boosting10 boosting30 boosting 0.

020.

10

differences


0 5 10 15 20 25 300.

000.

20 squared error


Fig 4. Approximation of the causal effect E[Y |do(X3 = x)] in model (17) exhibiting non-additive noise in the structural equation model, with S-mint regression for additive model fit(starting value) and various boosting iterations (left). Absolute differences between consecutiveboosting iterations as in (10) (upper right) and integrated squared error for approximating thetrue effect as a function of boosting iterations (lower right). The adjustment set is chosen asthe parental set of X3, that is S(jX = 3) = {1, 2}. The results are based on one representativesample of size n = 500.

4.4. Choice of the bandwidth

Theorem 1 provides an asymptotic result but does not specify how to choosethe bandwidths h1 and h2 in the finite sample case. In particular, the same fixedchoice of h2 for all variables in the adjustment set S can be suboptimal in somesituations. As an example let us consider the graph and structural equationsfrom Section 4.2 where we replace one equation in (16) by

Y ← X1 + sin(X2 ·X3) + ε4. (18)

The goal is to approximate the causal effect E[Y |do(X3 = x)] based on n = 500samples of the joint distribution. Inspecting the scatterplots of Y versus X1, X2

and X3 (see Figure 5) suggests that the bandwidth h(1)2 corresponding to X1

should be larger than the bandwidth h(2)2 corresponding to X2. Figure 6 de-

picts the corresponding approximated causal effects using the S-mint method

for a fixed bandwidth h2 = (h(1)2 , h

(2)2 ) = (0.4, 0.4) and for a variable bandwidth

h2 = (h(1)2 , h

(2)2 ) = (0.8, 0.4) respectively. Clearly, the approximation with the


−2 −1 0 1 2

−3

−2

−1

01

23

X1

Y

−2 −1 0 1 2

−3

−2

−1

01

23

X2

Y

−3 −1 0 1 2 3

−3

−2

−1

01

23

X3

Y

Fig 5. Scatterplots of the data from model (18) of Y versus X1, X2 and X3. They reveal adifference in wigglyness.

variable bandwidth outperforms the approximation with the fixed bandwidth.Adaptive bandwidths choice methods as proposed by Polzehl and Spokoiny(2000) might be suitable, at the price of a more complicated and hence morevariable estimation scheme.

−1.0 −0.5 0.0 0.5 1.0

−0.

50.

51.

01.

5

intervention value

size

of c

ausa

l effe

ct

true effectadd. regr.2 boosting5 boosting10 boosting30 boosting

−1.0 −0.5 0.0 0.5 1.0

−0.

50.

51.

01.

5

intervention value

size

of c

ausa

l effe

ct

true effectadd. regr.2 boosting5 boosting10 boosting30 boosting

Fig 6. Approximation of the causal effect E[Y |do(X3 = x)] in model (18). The adjustmentset is chosen as the parental set of X3, that is S(jX = 3) = {1, 2} and with corresponding

fixed bandwidths h(1)2 = h

(2)2 = 0.4 (left) and varying bandwidths h

(1)2 = 0.8 and h

(2)2 = 0.4

(right). The results are based on one representative sample of size n = 500.

5. Empirical results: additive structural equation models

The goal of the numerical experiments in this section is to quantify the esti-mation accuracy of the total causal effect E[Y |do(X = x)] for two variablesX,Y ∈ {X1, ..., Xp} such that Y is a descendant of X (if Y is an ancestorof X, then the interventional expectation corresponds to the observational ex-pectation E[Y ]). We consider in this section only additive structural equation


models as in (14). This allows for a comparison of the S-mint method and thepath-based methods.

For S-mint regression, we use the implementation described in Section 2.2.The kernel functions K and L in the S-mint procedure are chosen to be aGaussian kernel with bandwidth h1 and a product of Gaussian kernels withbandwidth h2 respectively. For simplicity, in the style of Fan et al. (1998), wechoose h1 and h2 as 0.5 times the empirical standard deviation of the respectivecovariables in all of our simulations in this section. We use the following twocriteria for bstop, that is, as an automated stopping criterion for the boostingiterations:

1. Stop if an iteration changes the approximation by less than 1%. Thatis, the integrated difference (10) to the previous approximation is lessthan 0.01.

2. Stop if the integrated difference between two consecutive approximationsis less than 5% of the initial integrated difference.

When using the path-based methods from Section 3, we estimate the func-tions f0j by additive functions using the R-package mgcv with default values (andthus using the knowledge of the form of the nonlinear functions in the SEM).

We test the performance of four different methods: S-mint with parentalsets (Algorithm 1) with the stopping of boosting iterations as described above,additive regression with parental sets (first step of S-mint, without additionalboosting iterations), entire path-based method from root nodes (Algorithm 2)and partially path-based method with short-cuts (Algorithm 3). The referenceeffect E[Y |do(X = x)] is computed using Algorithm 2 with known (true) func-tions f0j,k and error variances (σ0

j )2 based on 5n samples.Since in a nonlinear structural equation model (in contrast to a linear struc-

tural equation model) E[Y |do(X = x)] is a nonlinear function of the interventionvalue x, we compute the interventional expectation for several values x: typi-cally, for the nine deciles d1(X), ..., d9(X) of X. To compare the estimationaccuracy of the three methods on DAG D, we compute a relative squared errore(D) over all considered pairs (X,Y ) (for details see below), denoted by L, andover all intervention values d1(X), ..., d9(X) as

e(D) =

∑(X,Y )∈L

9∑i=1

(E[Y |do(X = di(X))]− E0[Y |do(X = di(X))]

)2∑

(X,Y )∈L

9∑i=1

(E0[Y |do(X = di(X))])2

. (19)

Typically, we repeat every experiment on N = 50 or N = 100 random DAGs(described in Section 5.1) and record the relative error e(D) of all methods foreach repetition.


5.1. Data simulation

To simulate data we first fix a causal order π0 of the variables, that isXπ0(1) ≺ Xπ0(2) ≺ · · · ≺ Xπ0(p) and include each of the

(p2

)possible directed

edges, independently of each other, with probability pc. In the sparse setting wetypically choose pc = 2

p−1 which yields an expected number of p edges in theresulting DAG. Based on the causal structure of the graph we then build thestructural equation model. We simulate from the additive structural equationmodel (14), where every edge k → j in the DAG is associated with a nonlinearfunction f0j,k in the structural equation model. We use two function types:

1. edge functions f0j,k drawn from a Gaussian process with a Gaussian kernelwith bandwidth one

2. sigmoid-type edge functions of the form f0j,k(x) = a · b·(x+c)1+|b·(x+c)| with

a ∼ Exp(4) + 1, b ∼ Unif([−2,−0.5] ∪ [0.5, 2]) and c ∼ Unif([−2, 2]).

All variables with empty parental set (root nodes in the DAG) follow a Gaus-sian distribution with mean zero and standard deviation which is uniformlydistributed in the interval [1,

√2]. To all remaining variables we add Gaussian

noise with standard deviation uniformly distributed in [1/5,√

2/5]. Note thatboth simulation settings correspond to the ones used by Buhlmann et al. (2014).

5.2. Estimation of causal effects with known graphs

In this section we compare the different methods in terms of estimation accu-racy and CPU time consumption for known underlying DAGs D0. To that endwe generate random DAGs with p = 10 variables and simulate n = 500 samplesof the joint distribution applying the simulation procedure introduced in Sec-tion 5.1. We then select all index pairs (k, j) such that there exists a directedpath from Xk to Xj and estimate the causal effect E[Xj |do(Xk)] for all k, j onthe nine deciles of Xk.

The experiment is done for two different levels of sparsity, a sparse graphwith an expected number of p edges and a non-sparse graph with an expectednumber of 4p edges. We record the relative squared error (19) and the CPUtime consumption, both averaged over all index pairs, for N = 100 (N = 20 inthe dense setting, respectively) different DAGs D0. The results are displayed inFigure 7 for the sigmoid-type edge functions and in Figure 8 for the Gaussianprocess-type edge functions.

The method based on the entire paths (Algorithm 2) yields the smallesterrors followed by the path-based methods with short-cuts (Algorithm 3). TheS-mint and additive regression exhibit a slightly worse performance. This findingcan be explained by the fact that the path-based methods benefit from thefull (and correct) structural information of the DAG whereas the S-mint andadditive regression methods only use local information (cf. Section 3.3). For themonotone sigmoid-type function class, additive regression provides a very goodapproximation to the true causal effect even in dense settings. For both settings


0.00

0.01

0.02

0.03

0.04

0.05

e(D

)

Sparse truth Non−sparse truth

Entire PathPartial PathAdditive RegressionS−mint

02468

101214

CP

U ti

me

per

path



Fig 7. Comparison of the performance of the methods in terms of relative squared erroras in (19) (left) and CPU time consumption (right) for the case where the true DAGs D0

are known and the edge functions belong to the sigmoid-type setting. The adjustment set isS = pa(Xk) for additive regression and S-mint. Number of variables p = 10 and sample sizen = 500.

0.00

0.01

0.02

0.03

0.04

0.05

e(D

)



02468

1012

CP

U ti

me

per

path



Fig 8. Comparison of the performance of the methods in terms of relative squared error asin (19) (left) and CPU time consumption (right) for the case where the true DAGs D0 areknown and the edge functions are drawn from a Gaussian process with bandwidth one. Theadjustment set is S = pa(Xk) for additive regression and S-mint. Number of variables p = 10and sample size n = 500.

we observe that the boosting iterations in S-mint do not improve the additiveapproximation substantially.

In terms of CPU time consumption, S-mint and additive regression outper-form the path-based methods. Additive regression is particularly fast as it onlyrequires the fit of one nonparametric additive regression of Xj versus Xk andXpa(k) whereas the path-based methods each require one nonparametric addi-tive model fit for every node on all the traversed paths. As the set of paths inthe partially path-based method is a subset of the one in the entire path-basedmethod (cf. Section 3.2 and Figure 1), the partially path-based method needsless model fits which explains the reduction of time consumption. In particular,both S-mint and additive regression are computationally feasible for computingE[Xj |do(Xk)] for all pairs (k, j), even when p is large and in the thousands as-suming that the cardinality of the corresponding adjustment sets is reasonablysmall.


5.3. Estimation of causal effects on perturbed graphs

In the previous section we demonstrated that the two path-based methods ex-hibit a better performance than S-mint and the additive regression approxima-tion if causal effects are estimated based on the underlying true DAG D0. Wewill now focus on the more realistic situation in which we are only providedwith a partially correct DAG D. We model this by constructing a set of mod-ified DAGs {Dhr}r∈K with pre-specified (fixed) structural Hamming distances{hr}r∈K to the true DAG D0, where K = {1, 2, . . . , 6} and the corresponding{hr}r∈K are described in Figures 9 and 10. To do so, we use the following rule:starting from D0 with p = 50 nodes, for each r ∈ K, we randomly remove andadd hr

2 edges each to obtain Dhr . The structural Hamming distance between D0

and the perturbed graph Dhr is then equal to hr, and a percentage of 1− hr2|E|

edges in Dhr are still correct, where |E| denotes the expected number of edges inthe DAG D0. Note that this modification may change the order of the variables(especially for large values of hr).

We randomly choose 20 = |L| index pairs (k, j) such that there exists adirected path from Xk to Xj in D0, but now consider the problem of estimating

the total causal effect E[Xj |do(Xk)] based on the perturbed graph Dhr for theadjustment sets or the paths, respectively (and based on sample size n = 500as in Section 5.2). For every r ∈ K, this is repeated N = 100 times and ineach repetition, we record the relative squared error e(D) in (19). As before,we distinguish between a sparse graph with an expected number of 50 edgesand a non-sparse graph with an expected number of 200 edges and we use bothsimulation settings described in Section 5.1 for generating the edge functionsf0. The results are shown in Figures 9 and 10.

For both, the sparse and non-sparse settings, one observes that the largerthe structural Hamming distance (or equivalently, the smaller the percentage ofcorrectly specified edges in D0), the better is the performance of S-mint andadditive regression in comparison with the path-based methods. That is, bothmethods are substantially more robust with respect to possible misspecificationsof edges in the graph. This may be explained by the different degrees of localness(cf. Section 3.3) of the respective methods. For the two local methods we canhope to have approximately correct information in the parental set of Xk even ifthe modified DAG is far away from the true DAG D0 in terms of the structuralHamming distance. For the path-based methods however, randomly removingedges may break one or several of the traversed paths which results in causalinformation being partially or fully lost. This effect is most evident in the twosparse settings. A similar behavior is also observed in Figure 11.

Note that except for the true DAG D0, the performance of the partiallypath-based method is at least as good as for the entire path-based method.The shortcut introduced in Algorithm 3 does not only yield computational sav-ings but also improves (relative to the full path-based Algorithm 2) statisticalestimation accuracy of causal effects in incorrect DAGs. Again, a possible ex-planation for this observation is that the partially path-based method acts more


1e−03

1e−01

1e+01

0 (100%) 4 (96%) 8 (92%) 16 (84%) 32 (68%) 64 (34%)SHD to true DAG (percentage of correct edges)

e(D

)


1e−03

1e−02

1e−01

1e+00

1e+01


e(D

)


Fig 9. The plots compare the relative squared error performance of the three methods on aset of modified DAGs {Dhr}r∈K with given structural Hamming distances {hr}r∈K to thetrue DAG D0 (or equivalently, with a given percentage of correct edges) for the sigmoid-type additive structural equation model. The top and bottom panels show the relative squarederror error e(D) (19) in a sparse and dense setting, respectively. The larger the structuralHamming distance hr between the modified DAG Dhr and the true DAG D0, the better isthe performance of S-mint with parental sets in comparison with the two path-based methods.Number of variables p = 50 and sample size n = 500.

locally and thus is less affected by edge perturbations.

5.4. Estimation of causal effects in estimated graphs

We now turn our attention to the case where the goal is to compute causaleffects on a DAG D that has been estimated by a structure learning algorithm(while still relying on a correct model specification). In conjunction with S-mintregression, this is then the est S-mint method described in Section 3.5.

We generate N = 50 random DAGs with p = 20 nodes for different numbersn of observational data, which are simulated according to the procedure inSection 5.1.

Using the knowledge that the structural equation model is additive as in(14), we apply the recently proposed CAM method (Buhlmann et al., 2014) forestimation of the true underlying DAG D0 (which is identifiable from the obser-vational distribution), outlined at the end of Section 3.4. The implementationis according to the R-package CAM. Regarding the algorithmic details, we use thefollowing in the three steps:

1. Preliminary neighborhood selection to restrict the number of potentialparents per node: set to a maximum of 10 by default;

2. Estimation of the correct order by greedy search: we use 6 basis functionsper parent to fit the additive model;


1e−03

1e−01

1e+01


e(D

)


1e−03

1e−02

1e−01

1e+00

1e+01


e(D

)


Fig 10. The plots compare the relative squared error performance of the three methods on aset of modified DAGs {Dhr}r∈K with given structural Hamming distances {hr}r∈K to thetrue DAG D0 (or equivalently, with a given percentage of correct edges) for the Gaussianprocess-type additive structural equation models. The top and bottom panels show the relativesquared error e(D) (19) in a sparse and dense setting, respectively. The larger the structuralHamming distance hr between the modified DAG Dhr and the true DAG D0, the better isthe performance of S-mint with parental sets in comparison with the two path-based methods.Number of variables p = 50 and sample size n = 500.

3. Optional: Pruning of the DAG by feature selection to keep only the sig-nificant edges, where we use the default level α = 0.001.

After having estimated a DAG D with the above procedure, we randomlyselect 10 = |L| index pairs (k, j) such that there exists a directed path from Xk

to Xj in the true DAG D0 and approximate the total causal effect E[Xj |do(Xk)]

based on the estimated graph D. Figure 11 displays the relative squared errorsas defined in (19).

All four methods show a similar performance with respect to relative squarederror on the DAGs that are obtained applying the CAM method without featureselection. These DAGs mainly represent the causal order of the variables butotherwise are densely connected. An incorrectly specified order of the variables(e.g. for small sample sizes n) seems to comparably affect the S-mint and addi-tive regression with parental sets and the path-based methods. If the sample sizeincreases, the estimated graph D is closer to the true graph D0 which improvesthe estimation accuracy of causal effects for all the four methods.

The two path-based methods approximate the causal effects more accuratelyon the DAGs that are obtained without feature selection, that is, pruning theDAG is not advantageous for the estimation accuracy of causal effects, at leastfor a small number of observations. However, the pruning step yields vast compu-tational savings for the two path-based methods as demonstrated in Figure 12.The S-mint regression is very fast in both settings and pruning the DAG before


1e−03

1e−02

1e−01

1e+00

1e+01

200 500 1000Number of observations (n)

e(D

)


1e−03

1e−02

1e−01

1e+00

1e+01


e(D

)


Fig 11. Sigmoid-type additive structural equation models. Relative squared error performanceas in (19), for different numbers of observations (n), computed on graphs that have been esti-mated using the CAM method (Buhlmann et al., 2014). The algorithm has been applied with-out the pruning step (left) and with the pruning step (right). We use the estimated parentalsets as adjustment sets and the number of variables is p = 20. The S-mint regression corre-sponds to est S-mint as described in Section 3.5.

estimating the causal effects only has a minor effect on the time consumptionand estimation accuracy.

0100200300400500600700


CP

U ti

me

[s]


020406080

100120


CP

U ti

me

[s]


Fig 12. Sigmoid-type additive structural equation models. CPU time performance for n = 500for N = 50 graphs of p = 20 variables that have been estimated using the CAM method(Buhlmann et al., 2014) with and without pruning step. Pruning the DAG yields vast com-putational savings for the two path-based methods. S-mint and additive regression are barelyaffected by the pruning step and are considerably faster than the two path-based methods inboth scenarios.

5.5. Summary of the empirical results, and the advantage of theproposed two-stage est S-mint method

With respect to statistical accuracy, measured with the relative squared erroras in (19), we find that S-mint and additive regression are substantially morerobust against incorrectness of the true underlying DAG (or against a wrongorder of the variables) and against model misspecification, in comparison to thealternative path-based methods. The latter robustness of S-mint is rigorouslybacked-up by our theory in Theorem 1 and Corollary 1 whereas the formerseems to be due to the higher degree of localness as described in Section 3.3. Asa consequence, the proposed two-stage est S-mint (Section 3.5) where we firstestimate the order of the variables or the structure of the DAG (or the Markov


equivalence class of DAGs) and subsequently perform S-mint is expected ingeneral to lead to reasonably accurate results (which are empirically quantifiedabove for some settings). Only when the DAG is perfectly known and the modelcorrectly specified (here by an additive structural equation model), which is arather unrealistic assumption for practical applications, the path-based methodswere found to have a slight advantage. Thus, we recover here a typical robustnessphenomenon against model misspecification of our nonparametric and more“model-free” S-mint regression procedure.

Our empirical findings support the use of est S-mint, namely the combina-tion of a structured nonparametric (or parametric) approach for estimating theDAG (or its equivalence class) in the first stage and using the robust and fullynonparametric S-mint procedure in the second stage. The second stage leads toa clear gain in robustness whereas the efficiency loss in case of correctly specifiedmodels is marginal or even minimal.

Regarding computational efficiency, S-mint and in particular also the additiveregression approximation are massively faster than the path-based proceduresmaking them feasible for larger scales where the number of variables is in thethousands.

6. Real data application

In this section we want to provide two examples for the application of ourmethodology to real data. We use gene expression data from the isoprenoidbiosynthesis in Arabidopsis thaliana (Wille et al., 2004). The data consists ofn = 118 gene expression measurements from p = 39 genes. In the originalwork the authors try to infer connections between the individual genes in thenetwork using Gaussian graphical modeling. Our goal is to find the strongestcausal connections between the individual genes. We do not standardize theoriginal data but adjust the bandwidths in S-mint by scaling with the standarddeviations of the corresponding variables.

6.1. Estimation and error control for causal connections betweenand within the pathways

We first turn our attention to the whole isoprenoid biosynthesis dataset andwant to find the causal effects within and between the different pathways, withan error control for false positive selections. To be able to compute the causaleffects we have to estimate a causal network. In order to do that we use theCAM method (Buhlmann et al., 2014).

We estimate a DAG using CAM with the default settings. We then applythe S-mint procedure with parental sets obtained from the estimated DAG(which corresponds to the est S-mint procedure from Section 3.5) to rank thetotal causal effects according to their strength. We define the relative causalstrength CSrel

k→j of an intervention Xj |do(Xk) as a sum of relative distances


of observational and interventional expectation for different intervention valuesdivided by the range of the intervention values, i.e.

CSrelk→j =

1

Rk(d)

9∑i=1

|E[Xj ]− E[Xj |do(Xk = di)]||E[Xj ]|

,

where we choose d1(Xk), ..., d9(Xk) to be the nine deciles of Xk and we denotetheir range by Rk(d) = d9(Xk)− d1(Xk).

To control the number of false positives (i.e. falsely selected strong causaleffects) we use stability selection (Meinshausen and Buhlmann, 2010) whichprovides (conservative) error control under a so-called (and uncheckable) ex-changeability condition. We randomly select 100 subsamples of size n/2 = 59and repeat the procedure above 100 times. For each run, we record the indicesof the top 30 ranked causal strengths. At the end we keep all index pairs thathave been selected at least 66 times in the 100 runs as this leads to an ex-pected number of falsely selected edges (false positives) which is less or equalto 2 (Meinshausen and Buhlmann, 2010). The graphical representation of thenetwork in Figure 13 is based on Wille et al. (2004). The dotted arcs representthe underlying metabolic network (known from biology), the six red solid arcscorrespond to the stable index pairs found by est S-mint with stability selection.

None of the stable edges are opposite to the causal direction of the metabolicnetwork. In particular, we found strong total causal effects between GGPPSvariables in the MEP pathway, MVA pathway and mitochondrion. Note thatin this section we heavily rely on model assumption (14) as the CAM methodfor estimating a DAG assumes additivity of the parents. Therefore we cannotfully exploit the advantage of the S-mint method that it works for arbitrarynon-additive models (5) (but we would hope to be somewhat less sensitive tomodel misspecification than with path-based methods, see for example Fig-ures 9 and 10).

6.2. Estimation and error control of strong causal connectionswithin the MEP pathway

We now want to present a possible way of exploiting the very general model as-sumptions of S-mint. If the underlying order and an approximate graph structureare known a priori, we can use this information to proceed with S-mint usingthe order information as described in Corollary 1. This relieves us from anymodel assumptions on the functional connections between two variables (e.g.linearity, additivity, etc.).

To give an example, let us focus on the genes in the MEP pathway (blackbox in Figure 13). The goal is to find the strongest total causal effects withinthis pathway. The metabolic network (dotted arcs) is providing us with an or-der of the variables which we use for S-mint regression as follows: we choosethe adjustment set S(jX) in (12) by going three levels back (pmax = 3) in the


DXPS1 DXPS2 DXPS3

DXR

MCT

CMK

MECPS

HDS

HDR

IPPI1

GPPSGPPS

GGPPS PPDS1 PPDS2

UPPS1

GGPPS1,5,9

DPPS2

Chloroplast (MEP pathway)

AACT1 AACT2

HMGS

HMGR1

MK

MPDC1

IPPI2

FPPS1 FPPS2DPPS1

GGPPS3

Cytoplasm (MVA pathway)

HMGR2

MPDC2

Mitochondrion

2 6 8 10 11 12GGPPS4

DPPS3

Fig 13. Stable edges (with stability selection) for the Arabidopsis thaliana dataset. The dottedarcs represent the metabolic network, the red solid arcs the stable total causal effects foundby the est S-mint method.

causal order (to achieve a reasonably sized set), for example, the adjustmentset for CMK is DXPS1, DXPS2, DXPS3, DXR, MCT, whereas the adjustmentset for GPPS is HDS, HDR, IPPI1. We cannot use the full set of all ancestorsbecause there are only n/2 = 59 data points to fit the nonparametric additiveregression and marginal integration, as we again use stability selection based onsubsampling for controlling false positive selections as described in the previoussection. For each among the 100 subsampling runs we record the top 10 rankedindex pairs and keep the ones that are selected at least 65 times out of 100 rep-etitions. This results in an expected number of false positives being less than 1(Meinshausen and Buhlmann, 2010). The stable edges are shown in Figure 14.One of the four edges corresponds to an edge in the metabolic pathway. We findthat the upper part of the pathway contains the strongest total causal effectsand therefore may be an interesting target for intervention experiments.

7. Conclusions

We considered the problem of estimating expected values of intervention dis-tributions, also known as total causal effects, from observational data. A firstmain result (Theorem 1 and Corollary 1) says that if we know the local parentalvariables or a superset thereof (e.g., from the order of the variables), there is no


DXPS1 DXPS2 DXPS3

DXR

MCT

CMK

MECPS

HDS

HDR

IPPI1

GPPSGPPS

GGPPS2,6,8,10,11,12

PPDS1 PPDS2

Chloroplast (MEP pathway)

Fig 14. Stable edges (with stability selection) for the MEP pathway in the Arabidopsis thalianadataset. The dotted arcs represent the metabolic network whereas the red solid arcs denotethe top ranked causal effects found by S-mint with adjustment sets chosen from the order ofthe metabolic network structure by considering all ancestors up to three levels back.

need to base estimation and computations on a causal graph. In fact, we can di-rectly infer the expected values of single-intervention distributions via marginalintegration: we call the procedure S-mint. This result holds for any nonlinearand non-additive structural equation model apart from mild smoothness andregularity conditions. Hence, from another point of view, S-mint estimationof expected values of single intervention distributions is a fully nonparametrictechnique and thus robust against model misspecification of the functional formof the structural equations. We propose an L2-boosting approach for S-mintwhich is easy to use without complicated tuning of parameters and yields goodempirical results.

We complement the robustness view-point by empirical results indicatingthat S-mint also works reasonably well when the DAG- or order-structure ismisspecified to a certain extent, as it will be the case when we estimate thesequantities from data; in fact, S-mint regression is substantially more robustthan methods which follow all directed paths in the DAG to infer causal effects.


This suggests that the two-stage est S-mint procedure is most reliable for causalinference from observational data: first estimate the DAG- or order-structure (orequivalence classes thereof) and second, subsequently pursue S-mint regression.In addition, such a procedure is computationally much faster than methodswhich exploit directed paths in (estimated) DAGs.

Acknowledgments. We sincerely thank Linbo Wang and Mathias Drton forhaving pointed out a major error in an earlier and very different version of thepaper. We also thank anonymous reviewers for constructive comments.

References

Bang, H. and Robins, J. (2005). Doubly robust estimation in missing data andcausal inference models. Biometrics, 61:962–972.

Bollen, K. A. (1998). Structural equation models. Wiley Online Library.Buhlmann, P. (2013). Causal statistical inference in high dimensions. Mathe-

matical Methods of Operations Research, 77:357–370.Buhlmann, P. and Hothorn, T. (2007). Boosting algorithms: regularization,

prediction and model fitting (with discussion). Statistical Science, 22:477–505.

Buhlmann, P., Peters, J., and Ernest, J. (2014). CAM: Causal additive models,high-dimensional order search and penalized regression. Annals of Statistics,42:2526–2556.

Buhlmann, P. and Yu, B. (2003). Boosting with the L2 loss: regression andclassification. Journal of the American Statistical Association, 98:324–339.

Chickering, D. (2002). Optimal structure identification with greedy search.Journal of Machine Learning Research, 3:507–554.

Colombo, D., Maathuis, M., Kalisch, M., and Richardson, T. (2012). Learninghigh-dimensional directed acyclic graphs with latent and selection variables.Annals of Statistics, 40:294–321.

Dawid, A. P. (2000). Causal inference without counterfactuals. Journal of theAmerican Statistical Association, 95:407–424.

Editorial (2010). Cause and effect. Nature Methods, 7:243.Fan, J., Hardle, W., and Mammen, E. (1998). Direct estimation of low-

dimensional components in additive models. Annals of Statistics, 26:943–971.Friedman, J. (2001). Greedy function approximation: a gradient boosting ma-

chine. Annals of Statistics, 29:1189–1232.Friedman, N. (2004). Inferring cellular networks using probabilistic graphical

models. Science, 303:799–805.Greenland, S., Pearl, J., and Robins, J. M. (1999). Causal diagrams for epi-

demiologic research. Epidemiology, 10:37–48.Hall, P. and Marron, J. (1987). Estimation of integrated squared density deriva-

tives. Statistics & Probability Letters, 6:109–115.Hampel, F., Ronchetti, E., Rousseeuw, P., and Stahel, W. (2011). Robust statis-

tics: the approach based on influence functions. John Wiley & Sons.


Hauser, A. and Buhlmann, P. (2012). Characterization and greedy learningof interventional markov equivalence classes of directed acyclic graphs. TheJournal of Machine Learning Research, 13:2409–2464.

Hauser, A. and Buhlmann, P. (2014). Two optimal strategies for active learningof causal models from interventional data. International Journal of Approxi-mate Reasoning, 55:926–939.

Hauser, A. and Buhlmann, P. (2015). Jointly interventional and observa-tional data: estimation of interventional markov equivalence classes of directedacyclic graphs. Journal of the Royal Statistical Society, Series B, 77:291–318.

He, Y.-B. and Geng., Z. (2008). Active learning of causal networks with in-tervention experiments and optimal designs. Journal of Machine LearningResearch, 9:2523–2547.

Horowitz, J., Klemela, J., and Mammen, E. (2006). Optimal estimation inadditive regression models. Bernoulli, 12:271–298.

Hoyer, P., Janzing, D., Mooij, J., Peters, J., and Scholkopf, B. (2009). Nonlinearcausal discovery with additive noise models. In Advances in Neural Informa-tion Processing Systems 21, 22nd Annual Conference on Neural InformationProcessing Systems (NIPS 2008), pages 689–696.

Husmeier, D. (2003). Sensitivity and specificity of inferring genetic regulatoryinteractions from microarray experiments with dynamic bayesian networks.Bioinformatics, 19:2271–2282.

Imoto, S., Goto, T., and Miyano, S. (2002). Estimation of genetic networks andfunctional structures between genes by using Bayesian network and nonpara-metric regression. In Proceedings of the Pacific Symposium on Biocomputing(PSB-2002), volume 7, pages 175–186.

Kalisch, M. and Buhlmann, P. (2007). Estimating high-dimensional directedacyclic graphs with the PC-algorithm. Journal of Machine Learning Research,8:613–636.

Koller, D. and Friedman, N. (2009). Probabilistic graphical models: principlesand techniques. MIT press.

Lauritzen, S. and Spiegelhalter, D. (1988). Local computations with probabili-ties on graphical structures and their application to expert systems. Journalof the Royal Statistical Society, Series B, 50:157–224.

Li, L., Tchetgen, E. T., van der Vaart, A., and Robins, J. (2011). Higher orderinference on a treatment effect under low regularity conditions. Statistics &Probability Letters, 81:821–828.

Linton, O. and Nielsen, J. P. (1995). A kernel method of estimating structurednonparametric regression based on marginal integration. Biometrika, 82:93–100.

Loh, P. and Buhlmann, P. (2014). High-dimensional learning of linear causalnetworks via inverse covariance estimation. Journal of Machine LearningResearch, 15:3065–3105.

Maathuis, M., Colombo, D., Kalisch, M., and Buhlmann, P. (2010). Predictingcausal effects in large-scale systems from observational data. Nature Methods,7:247–248.

Maathuis, M., Kalisch, M., and Buhlmann, P. (2009). Estimating high-


dimensional intervention effects from observational data. Annals of Statistics,37:3133–3164.

Marzio, M. D. and Taylor, C. (2008). On boosting kernel regression. Journal ofStatistical Planning and Inference, 138:2483–2498.

Meinshausen, N. and Buhlmann, P. (2010). Stability Selection (with discussion).Journal of the Royal Statistical Society, Series B, 72:417–473.

Nowzohour, C. and Buhlmann, P. (2015). Score-based causal learn-ing in additive noise models. Statistics. Published online,doi:10.1080/02331888.2015.1060237.

Pearl, J. (2000). Causality: models, reasoning and inference. Cambridge Univ.Press.

Peters, J. and Buhlmann, P. (2014). Identifiability of Gaussian structural equa-tion models with equal error variances. Biometrika, 101:219–228.

Peters, J., Mooij, J., Janzing, D., and Scholkopf, B. (2014). Causal discoverywith continuous additive noise models. Journal of Machine Learning Re-search, 15:2009–2053.

Polzehl, J. and Spokoiny, V. (2000). Adaptive weights smoothing with applica-tions to image restoration. Journal of the Royal Statistical Society, Series B,62:335–354.

Robins, J., Rotnitzky, A., and Zhao, L. (1994). Estimation of regression coef-ficients when some of the regressors are not always observed. Journal of theAmerican Statistical Association, 89:846–866.

Robins, J., Tchetgen, E. T., Li, L., and van der Vaart, A. (2009). Semiparametricminimax rates. Electronic Journal of Statistics, 3:1305–1321.

Rosenbaum, P. and Rubin, D. (1983). The central role of the propensity scorein observational studies for causal effects. Biometrika, 70:41–55.

Rubin, D. B. (2005). Causal inference using potential outcomes. Journal of theAmerican Statistical Association, 100:322–331.

Scharfstein, D., Rotnitzky, A., and Robins, J. (1999). Adjusting for nonignorabledrop-out using semiparametric nonresponse models (with discussion). Journalof the American Statistical Association, 94:1096–1146.

Schmidt, M., Niculescu-Mizil, A., and Murphy, K. (2007). Learning graphicalmodel structure using l1-regularization paths. In Proceedings of the NationalConference on Artificial Intelligence, volume 22, page 1278. Menlo Park, CA;Cambridge, MA; London; AAAI Press; MIT Press; 1999.

Shimizu, S., Hoyer, P., Hyvarinen, A., and Kerminen, A. (2006). A linear non-Gaussian acyclic model for causal discovery. Journal of Machine LearningResearch, 7:2003–2030.

Shojaie, A. and Michailidis, G. (2010). Penalized likelihood methods for estima-tion of sparse high-dimensional directed acyclic graphs. Biometrika, 97:519–538.

Shpitser, I., Richardson, T. S., and Robins, J. M. (2011). An efficient algorithmfor computing interventional distributions in latent variable causal models. InProceedings of the 27th Conference on Uncertainty in Artificial Intelligence(UAI), pages 661–670.

Smith, V. A., Jarvis, E. D., and Hartemink, A. J. (2002). Evaluating functional


network inference using simulations of complex biological systems. Bioinfor-matics, 18(suppl 1):S216–S224.

Song, L., Fukumizu, K., and Gretton, A. (2013). Kernel embeddings of condi-tional distributions: A unified kernel framework for nonparametric inferencein graphical models. Signal Processing Magazine, IEEE, 30:98–111.

Spirtes, P. (2010). Introduction to causal inference. The Journal of MachineLearning Research, 11:1643–1662.

Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, Prediction, andSearch. MIT Press, second edition.

Stekhoven, D., Moraes, I., Sveinbjornsson, G., Hennig, L., Maathuis, M., andBuhlmann, P. (2012). Causal stability ranking. Bioinformatics, 28:2819–2823.

Teyssier, M. and Koller, D. (2005). Ordering-based search: a simple and ef-fective algorithm for learning Bayesian networks. In Proceedings of the 21thConference on Uncertainty in Artificial Intelligence (UAI), pages 584–590,Edinburgh, Scottland, UK.

van de Geer, S. (2014). On the uniform convergence of empirical norms andinner products, with application to causal inference. Electronic Journal ofStatistics, 8:543–574.

van de Geer, S. and Buhlmann, P. (2013). `0-penalized maximum likelihood forsparse directed acyclic graphs. Annals of Statistics, 41:536–567.

van der Laan, M. J. and Robins, J. M. (2003). Unified methods for censoredlongitudinal data and causality. Springer.

van der Laan, M. J. and Rose, S. (2011). Targeted Learning. Causal Inferencefor Observational and Experimental Data. Springer, New York.

Wille, A., Zimmermann, P., Vranova, E., Furholz, A., Laule, O., Bleuler, S.,Hennig, L., Prelic, A., von Rohr, P., Thiele, L., et al. (2004). Sparse graphicalgaussian modeling of the isoprenoid gene network in arabidopsis thaliana.Genome Biol, 5(11):R92.

Wood, S. (2006). Generalized Additive Models: An Introduction with R. Chap-man and Hall/CRC.

Wood, S. N. (2003). Thin-plate regression splines. Journal of the Royal Statis-tical Society (B), 65:95–114.

Yu, J., Smith, V. A., Wang, P. P., Hartemink, A. J., and Jarvis, E. D. (2004).Advances to bayesian network inference for generating causal networks fromobservational biological data. Bioinformatics, 20:3594–3603.

Zhang, J. (2008). On the completeness of orientation rules for causal discoveryin the presence of latent confounders and selection bias. Artificial Intelligence,172:1873–1896.

Documents

Marginal integration for nonparametric causal inference · severe experimental constraints or to substantially lower experimental costs. The words \causal inference" (usually) refer