Towards a Learning Theory of Cause-Effect Inference · Towards a Learning Theory of Cause-Effect Inference 1.1. Prior Art We now brieﬂy review the state-of-the-art on the inference

Towards a Learning Theory of Cause-Effect Inference

David Lopez-Paz1,2 [email protected] Muandet1 [email protected] Scholkopf1 [email protected] Tolstikhin1 [email protected] for Intelligent Systems2University of Cambridge

Abstract

We pose causal inference as the problem of lear-ning to classify probability distributions. Inparticular, we assume access to a collection{(Si, li)}ni=1, where each Si is a sample drawnfrom the probability distribution ofXi×Yi, and liis a binary label indicating whether “Xi → Yi” or“Xi ← Yi”. Given these data, we build a causalinference rule in two steps. First, we featurizeeach Si using the kernel mean embedding asso-ciated with some characteristic kernel. Second,we train a binary classifier on such embeddings todistinguish between causal directions. We presentgeneralization bounds showing the statistical con-sistency and learning rates of the proposed ap-proach, and provide a simple implementation thatachieves state-of-the-art cause-effect inference.Furthermore, we extend our ideas to infer causalrelationships between more than two variables.

1. IntroductionThe vast majority of statistical learning algorithms rely onthe exploitation of associations between the variables un-der study. Given the argument that all associations arisefrom underlying causal structures (Reichenbach, 1956), andthat different structures imply different influences betweenvariables, the question of how to infer and use causal knowl-edge in learning acquires great importance (Pearl, 2000;Scholkopf et al., 2012). Traditionally, the most widelyused strategy to infer the causal structure of a system isto perform interventions on some of its variables, whilestudying the response of some others. However, such in-terventions are in many situations unethical, expensive, or

Proceedings of the 32nd International Conference on MachineLearning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-right 2015 by the author(s).

even impossible to realize. Consequently, we often face theneed of causal inference purely from observational data.In these scenarios, one suffers, in the absence of strongassumptions, from the indistinguishability between latentconfounding (X ← Z → Y ) and direct causation (X → Yor X ← Y ). Nevertheless, disregarding the impossibilityof the task, humans continuously learn from experience toaccurately infer causality-revealing patterns. Inspired bythis successful learning, and in contrast to prior work, thispaper addresses causal inference by unveiling such causalpatterns directly from data. In particular, we assume accessto a set {(Si, li)}ni=1, where each Si is a sample set drawnfrom the probability distribution of Xi × Yi, and li is abinary label indicating whether “Xi → Yi” or “Xi ← Yi”.Using these data, we build a causal inference rule in twosteps. First, we construct a suitable and nonparametric rep-resentation of each sample Si. Second, we train a nonlinearbinary classifier on such features to distinguish betweencausal directions. Building upon this framework, we derivetheoretical guarantees regarding consistency and learningrates, extend inference to the multivariate case, proposeapproximations to scale learning to big data, and obtainstate-of-the-art performance with a simple implementation.

Given the ubiquity of uncertainty in data, which may arisefrom noisy measurements or the existence of unobservedcommon causes, we adopt the probabilistic interpretation ofcausation from Pearl (2000). Under this interpretation, thecausal structure underlying a set of random variables X =(X1, . . . , Xd), with joint distribution P , is often describedin terms of a Directed Acyclic Graph (DAG), denoted byG = (V,E). In this graph, each vertex Vi ∈ V is associatedto the random variable Xi ∈ X , and an edge Eji ∈ Efrom Vj to Vi denotes the causal relationship “Xi ← Xj”.More specifically, these causal relationships are defined bya structural equation model: each Xi ← fi(Pa(Xi), Ni),where fi is a function, Pa(Xi) is the parental set of Vi ∈ V ,and Ni is some independent noise variable. Then, causalinference is the task of recovering G from S ∼ Pn.

arX

iv:1

502.

0239

8v2

[st

at.M

L]

18

May

201

5


1.1. Prior Art

We now briefly review the state-of-the-art on the inferenceof causal structures G from observational data S ∼ Pn. Fora more thorough exposition, see, e.g., Mooij et al. (2014).

One of the main strategies to recover G is through the ex-ploration of conditional dependencies, together with someother technical assumptions such as the Markov and faith-fulness relationships between P and G (Pearl, 2000). Thisis the case of the PC algorithm (Spirtes et al., 2000), whichallows the recovery of the Markov equivalence class of Gwithout placing any restrictions on the structural equationmodel specifying the random variables under study.

Causal inference algorithms that exploit conditional depen-dencies are unsuited for inference in the bivariate case. Con-sequently, a large body of work has been dedicated to thestudy of this scenario. First, the linear non-Gaussian causalmodel (Shimizu et al., 2006; 2011) recovers the true causaldirection between two variables whenever their relationshipis linear and polluted with additive and non-Gaussian noise.This model was later extended into nonlinear additive noisemodels (Hoyer et al., 2009; Zhang & Hyvarinen, 2009; Ste-gle et al., 2010; Kpotufe et al., 2013; Peters et al., 2014),which prefer the causal direction under which the allegedcause is independent from the additive residuals of somenonlinear fit to the alleged effect. Third, the information ge-ometric causal inference framework (Daniusis et al., 2012;Janzing et al., 2014) assumes that the cause random vari-able is independently generated from some invertible anddeterministic mapping to its effect; thus, it is unlikely tofind dependencies between the density of the former and theslope of the latter, under the correct causal direction.

As it may be inferred from the previous exposition, thereexists a large and heterogeneous array of causal inferencealgorithms, each of them working under a very specializedset of assumptions, which are sometimes difficult to test inpractice. Therefore, there exists the need for a more flexiblecausal inference rule, capable of learning the relevant causalfootprints, later used for inference, directly from data. Sucha “data driven” approach would allow to deal with complexdata-generating processes, and would greatly reduce theneed of explicitly crafting identifiability conditions a-priori.

A preliminary step in this direction distilled from the com-petitions organized by Guyon (2013; 2014), which phrasedcausal inference as a learning problem. In these compe-titions, the participants were provided with a large col-lection of cause-effect samples {(Si, li)}ni=1, where Si ={(xij , yij)}ni

j=1 is drawn from the probability distributionof Xi × Yi, and li is a binary label indicating whether“Xi → Yi” or “Yi → Xi”. Given these data, most partici-pants adopted the strategy of i) crafting a vector of featuresfrom each Si, and ii) training a binary classifier on top of

the constructed features and paired labels. Although these“data-driven” methods achieved state-of-the-art performance(Guyon, 2013), the laborious task of hand-crafting featuresrenders their theoretical analysis impossible.

In more specific terms, the approach described above is alearning problem with inputs being sample sets Si, whereeach Si contains samples drawn from the probability dis-tribution Pi(Xi, Yi). In a separate strand of research, therehas been several attempts to learn from probability distri-butions in a principled manner (Jebara et al., 2004; Hein &Bousquet, 2004; Cuturi et al., 2005; Martins et al., 2009;Muandet et al., 2012). Szabo et al. (2014) presented thefirst theoretical analysis of distributional learning based onkernel mean embedding (Smola et al., 2007), with focus onkernel ridge regression. Similarly, Muandet et al. (2012)studied the problem of classifying distributions, but their ap-proach is constrained to kernel machines, and no guaranteesregarding consistency or learning rates are provided.

1.2. Our Contribution

Inspired by Guyon’s competitions, we pose causal infer-ence as the problem of classifying probability measures oncausally related pairs of random variables. Our contributionto this framework is the use of kernel mean embeddingsto nonparametrically featurize each cause-effect sample Si.The benefits of this approach are three-fold. First, thisavoids the need of hand-engineering features from the sam-ples Si. Second, this enables a clean theoretical analysis,including provable learning rates and consistency results.Third, the kernel hyperparameters (that is, the data repre-sentation) can be jointly optimized with the classifier usingcross-validation. Furthermore, we show how to extend theseideas to infer causal relationships between d ≥ 2 variables,give theoretically sustained approximations to scale lear-ning to big data, and provide the source code of a simpleimplementation that outperforms the state-of-the-art.

The rest of this article is organized as follows. Section 2reviews the concept of kernel mean embeddings, the toolthat will facilitate learning from distributions. Section 3shows the consistency and learning rates of our kernel meanembedding classification approach to cause-effect inference.Section 4 extends the presented ideas to the multivariatecausal inference case. Section 5 presents a variety of ex-periments displaying the state-of-the-art performance of asimple implementation of the proposed framework. Forconvenience, Table 1 summarizes our notations.

2. Kernel Mean Embeddings of ProbabilityMeasures

In order to later classify probability measures P accordingto their causal properties, we first need to featurize them into


E[ξ],V[ξ] Expected value and variance of r.v. ξZ Domain of cause-effect pairs Z = (X,Y )P Set of cause-effect measures P on ZL Set of labels li ∈ {−1, 1}M Mother distribution over P × L{(Pi, li)}ni=1 Sample from Mn

Si={Zij}nij=1 Sample from Pni

i

PSi Empirical distribution of Si

k Kernel function from Z × Z to RHk RKHS induced by kµk(P ) Kernel mean embedding of measure P ∈ Pµk(Psi) Empirical mean embedding of Psi

µk(P) The set {µk(P ) : P ∈ P}Mk Measure over µk(P)× L induced by M

Fk Class of functionals mappingHk to RRn(Fk) Rademacher complexity of class Fk

ϕ, Rϕ(f) Cost and surrogate ϕ-risk of sign ◦f

Table 1. Table of notations

a suitable representation. To this end, we will rely on theconcept of kernel mean embeddings (Berlinet & Thomas-Agnan, 2004; Smola et al., 2007).

In particular, let P be the probability distribution of somerandom variableZ taking values in the separable topologicalspace (Z, τz). Then, the kernel mean embedding of P asso-ciated with the continuous, bounded, and positive-definitekernel function k : Z × Z → R is

µk(P ) :=

∫Zk(z, ·) dP (z), (1)

which is an element inHk, the Reproducing Kernel HilbertSpace (RKHS) associated with k (Scholkopf & Smola,2002). Interestingly, the mapping µk is injective if k isa characteristic kernel (Sriperumbudur et al., 2010), that is,‖µk(P ) − µk(Q)‖Hk

= 0 ⇔ P = Q. Said differently, ifusing a characteristic kernel, we do not lose any informationwhen embedding distributions. An example of characteristickernel is the Gaussian kernel

k(z, z′) = exp(−γ‖z − z′‖22

), γ > 0, (2)

which will be used throughout this paper.

In many practical situations, it is unrealistic to assume ac-cess to the true distribution P , and consequently to the trueembedding µk(P ). Instead, we often have access to a sam-ple S = {zi}ni=1 ∼ Pn, which can be used to constructthe empirical distribution PS := 1

n

∑zi∈S δ(zi), where δ(z)

is the Dirac distribution centered at z. Using PS , we canapproximate (1) by the empirical kernel mean embedding

µk(PS) :=1

n

n∑i=1

k(zi, ·) ∈ Hk. (3)

The following result is a slight modification of Theorem 27from (Song, 2008). It establishes the convergence of the

empirical mean embedding µk(PS) to the embedding of itspopulation counterpart µk(P ), in RKHS norm:

Theorem 1. Assume that ‖f‖∞ ≤ 1 for all f ∈ Hk with‖f‖Hk

≤ 1. Then with probability at least 1− δ we have

‖µk(P )− µk(PS)‖Hk≤ 2

√E z∼P [k(z, z)]

n+

√2 log 1

δ

n.

Proof. See Section B.1.

3. A Theory of Causal Inference asDistribution Classification

This section phrases the inference of cause-effect relation-ships from probability measures as the classification of em-pirical kernel mean embeddings, and analyzes the learningrates and consistency of such approach. Throughout ourexposition, the setup is as follows:

1. We assume the existence of some Mother distributionM, defined on P × L, where P is the set of all Borelprobability measures on the space Z of two causallyrelated random variables, and L = {−1,+1}.

2. A set {(Pi, li)}ni=1 is sampled from Mn. Each measurePi ∈ P is the joint distribution of the causally relatedrandom variables Zi = (Xi, Yi), and the label li ∈ Lindicates whether “Xi → Yi” or “Xi ← Yi”.

3. In practice, we do not have access to the mea-sures {Pi}ni=1. Instead, we observe samples Si ={(xij , yij)}ni

j=1 ∼ Pnii , for all 1 ≤ i ≤ n. Using Si,

the data {(Si, li)}ni=1 is provided to the learner.

4. We featurize every sample Si into the empirical kernelmean embedding µk(PSi) associated with some kernelfunction k (Equation 3). If k is a characteristic kernel,we incur no loss of information in this step.

Under this setup, we will use the set {(µk(PSi), li)}ni=1 totrain a binary classifier from Hk to L, which will later beused to unveil the causal directions of new, unseen probabil-ity measures drawn from M. Note that this framework canbe straightforwardly extended to also infer the “confounding(X ← Z → Y )” and “independent (X ⊥⊥ Y )” cases byadding two extra labels to L.

Given the two nested levels of sampling (being the firstone from the Mother distribution M, and the second onefrom each of the drawn cause-effect measures Pi), it isnot trivial to conclude whether this learning procedure isconsistent, or how its learning rates depend on the samplesizes n and {ni}ni=1. In the following, we will study thegeneralization performance of empirical risk minimizationover this learning setup. Specifically, we are interested in


upper bounding the excess risk between the empirical riskminimizer and the best classifier from our hypothesis class,with respect to the Mother distribution M.

We divide our analysis in three parts. First, §3.1 reviews theabstract setting of statistical learning theory and surrogaterisk minimization. Second, §3.2 adapts these standard re-sults to the case of empirical kernel mean embedding clas-sification. Third, §3.3 considers theoretically sustained ap-proximations to deal with big data.

3.1. Margin-based Risk Bounds in Learning Theory

Let P be some unknown probability measure defined onZ × L, where Z is referred to as the input space, andL = {−1, 1} is referred to as the output space1. One of themain goals of statistical learning theory (Vapnik, 1998) is tofind a classifier h : Z → L that minimizes the expected risk

R(h) = E(z,l)∼P

[`(h(z), l

)]for a suitable loss function ` : L×L → R+, which penalizesdepartures between predictions h(z) and true labels l. Forclassification, one common choice of loss function is the 0-1loss `01(l, l′) = |l−l′|, for which the expected risk measuresthe probability of misclassification. Since P is unknown innatural situations, one usually resorts to the minimizationof the empirical risk 1

n

∑ni=1 `

(h(zi), li

)over some fixed

hypothesis classH, for the training set {(zi, li)}ni=1 ∼ Pn.It is well known that this procedure is consistent undermild assumptions (Boucheron et al., 2005).

Unfortunately, the 0-1 loss function is not convex, whichleads to empirical risk minimization being generally in-tractable. Instead, we will focus on the minimization ofsurrogate risk functions (Bartlett et al., 2006). In partic-ular, we will consider the set of classifiers of the formH = {sign ◦f : f ∈ F} where F is some fixed set of real-valued functions f : Z → R. Introduce a nonnegative costfunction ϕ : R→ R+ which is surrogate to the 0-1 loss, thatis, ϕ(ε) ≥ 1ε>0. For any f ∈ F we define its expected andempirical ϕ-risks respectively as

Rϕ(f) = E(z,l)∼P

[ϕ(−f(z)l

)], (4)

Rϕ(f) =1

n

n∑i=1

ϕ(−f(zi)li

). (5)

Many natural choices of ϕ lead to tractable empirical riskminimization. Common examples of cost functions includethe hinge loss ϕ(ε) = max(0, 1 + ε) used in SVM, theexponential loss ϕ(ε) = exp(ε) used in Adaboost, and thelogistic loss ϕ(ε) = log2

(1+ eε) used in logistic regression.

1Refer to Section A for considerations on measurability.

The misclassification error of sign ◦ f is always upperbounded by Rϕ(f). The relationship between functionsminimizing Rϕ(f) and functions minimizing R(sign ◦f)has been intensively studied in the literature (Steinwart &Christmann, 2008, Chapter 3). Given the high uncertaintyassociated with causal inferences, we argue that one is inter-ested in predicting soft probabilities rather than hard labels,a fact that makes the study of margin-based classifiers wellsuited for our problem.

We now focus on the estimation of f∗ ∈ F , the functionminimizing (4). However, since the distribution P is un-known, we can only hope to estimate fn ∈ F , the func-tion minimizing (5). Therefore, we are interested in high-probability upper bounds on the excess ϕ-risk

EF (fn) = Rϕ(fn)−Rϕ(f∗), (6)

w.r.t. the random training sample {(zi, li)}ni=1 ∼ Pn. Theexcess risk (6) can be upper bounded in the following way:

EF (fn) ≤ Rϕ(fn)− Rϕ(fn) + Rϕ(f∗)−Rϕ(f∗)

≤ 2 supf∈F|Rϕ(f)− Rϕ(f)|. (7)

While this upper bound leads to tight results for worst caseanalysis, it is well known (Bartlett et al., 2005; Boucheronet al., 2005; Koltchinskii, 2011) that tighter bounds can beachieved under additional assumptions on P. However, weleave these analyses for future research.

The following result — in spirit of Koltchinskii &Panchenko (1999); Bartlett & Mendelson (2002) — canbe found in Boucheron et al. (2005, Theorem 4.1).

Theorem 2. Consider a class F of functions mapping Zto R. Let ϕ : R → R+ be a Lϕ-Lipschitz function suchthat ϕ(ε) ≥ 1ε>0. Let B be a uniform upper bound onϕ(−f(ε)l

). Let {(zi, li)}ni=1 ∼ P and {σi}ni=1 be i.i.d.

Rademacher random signs. Then, with prob. at least 1− δ,

supf∈F|Rϕ(f)− Rϕ(f)|

≤ 2Lϕ E

[supf∈F

1

n

∣∣∣∣∣n∑i=1

σif(zi)

∣∣∣∣∣]+B

√log(1/δ)

2n,

where the expectation is taken w.r.t. {σi, zi}ni=1.

The expectation in the bound of Thm. 2 is known as theRademacher complexity of F , will be denoted by Rn(F),and has a typical order of O(n−1/2) (Koltchinskii, 2011).

3.2. From Classic to Distributional Learning Theory

Note that we can not directly apply the empirical risk min-imization bounds discussed in the previous section to ourlearning setup. This is because instead of learning a classi-fier on the i.i.d. sample {µk(Pi), li}ni=1, we have to learn


over the set {µk(PSi), li}ni=1, where Si ∼ Pni

i . Said differ-ently, our input feature vectors µk(PSi) are “noisy”: theyexhibit an additional source of variation as any two differentrandom samples Si, S′i ∼ Pni

i do. In the following, westudy how to incorporate these nested sampling effects intoan argument similar to Theorem 2.

We will now frame our problem within the abstract learningsetting considered in the previous section. Recall that ourlearning setup initially considers some Mother distributionM over P × L. Let µk(P) = {µk(P ) : P ∈ P} ⊆Hk, L = {−1,+1}, and Mk be a measure (guaranteed toexist by Lemma 2, Section A.1) on µk(P) × L inducedby M. Specifically, we will consider µk(P) ⊆ Hk and Lto be the input and output spaces of our learning problem,respectively. Let

{(µk(Pi), li

)}ni=1∼Mn

k be our trainingset. We will now work with the set of classifiers {sign ◦f : f ∈ Fk} for some fixed class Fk of functionals mappingfrom the RKHSHk to R.

As pointed out in the description of our learning setup, we donot have access to the distributions {Pi}ni=1 but to samplesSi ∼ Pni

i , for all 1 ≤ i ≤ n. Because of this reason, wedefine the sample-based empirical ϕ-risk

Rϕ(f) =1

n

n∑i=1

ϕ(−lif

(µk(PSi)

)),

which is the approximation to the empirical ϕ-risk Rϕ(f)that results from substituting the embeddings µk(Pi) withtheir empirical counterparts µk(PSi).

Our goal is again to find the function f∗ ∈ Fk minimizingexpected ϕ-risk Rϕ(f). Since Mk is unknown to us, andwe have no access to the embeddings {µk(Pi)}ni=1, we willinstead use the minimizer of Rϕ(f) in Fk:

fn ∈ arg minf∈Fk

Rϕ(f). (8)

To sum up, the excess risk (6) can now be reformulated as

Rϕ(fn)−Rϕ(f∗). (9)

Note that the estimation of f∗ drinks from two nestedsources of error, which are i) having only n training samplesfrom the distribution Mk, and ii) having only ni samplesfrom each measure Pi. Using a similar technique to (7), wecan upper bound the excess risk as

Rϕ(fn)−Rϕ(f∗) ≤ supf∈Fk

|Rϕ(f)− Rϕ(f)| (10)

+ supf∈Fk

|Rϕ(f)− Rϕ(f)|. (11)

The term (10) is upper bounded by Theorem 2. On theother hand, to deal with (11), we will need to upper boundthe deviations

∣∣f(µk(Pi)) − f(µk(PSi))∣∣ in terms of the

distances ‖µk(Pi)− µk(PSi)‖Hk

, which are in turn upperbounded using Theorem 1. To this end, we will have toassume that the class Fk consists of functionals with uni-formly bounded Lipschitz constants, such as the set of linearfunctionals with uniformly bounded operator norm.

We now present the main result of this section, which pro-vides a high-probability bound on the excess risk (9). Im-portantly, this excess risk will translate into the expectedcausal inference accuracy of our distribution classifier.

Theorem 3. Consider the RKHS Hk associated withsome bounded, continuous kernel function k, such thatsupz∈Z k(z, z) ≤ 1. Consider a class Fk of function-als mapping Hk to R with Lipschitz constants uniformlybounded by LF . Let ϕ : R → R+ be a Lϕ-Lipschitz func-tion such that φ(z) ≥ 1z>0. Let ϕ

(−f(h)l

)≤ B for every

f ∈ Fk, h ∈ Hk, and l ∈ L. Then, with probability not lessthan 1− δ (over all sources of randomness)

Rϕ(fn)−Rϕ(f∗) ≤ 4LϕRn(Fk) + 2B

√log(2/δ)

2n

+4LϕLFn

n∑i=1

√E z∼Pi[k(z, z)]

ni+

√log(2n/δ)

2ni

.


As mentioned in Section 3.1, the typical order of Rn(Fk)is O(n−1/2). For a particular examples of classes of func-tionals with small Rademacher complexity we refer to Mau-rer (2006). In such cases, the upper bound in Theorem 3converges to zero (meaning that our procedure is consis-tent) as both n and ni tend to infinity, in such a way that2

log n/ni = o(1). The rate of convergence w.r.t. n can beimproved up to O(n−1) if placing additional assumptionson M (Bartlett et al., 2005). On the contrary, the rate w.r.t.ni cannot be improved in general. Namely, the convergencerate O(n−1/2) presented in the upper bound of Theorem 1is tight, as shown in the following novel result.

Theorem 4. Under the assumptions of Theorem 1 denote

σ2Hk

= sup‖f‖Hk

≤1Vz∼P [f(z)].

Then there exist universal constants c, C such that for everyinteger n ≥ 1/σ2

Hk, and with probability at least c

‖µk(P )− µk(PS)‖Hk≥ CσHk√

n.


Finally, it is instructive to relate the notion of “identifiability”often considered in the causal inference community (Pearl,

2 We conjecture that this constraint is an artifact from our proof.


2000) to the properties of the Mother distribution. Sayingthat the model is identifiable means that the label l of P ∈ Pis assigned deterministically by M. In this case, learningrates can become as fast as O(n−1). On the other hand, asM(l|P ) becomes nondeterministic, the problem becomesunidentifiable and learning rates slow down (for example,in the extreme case of cause-effect pairs related by linearfunctions polluted with additive Gaussian noise, M(l =+1|P ) = M(l = −1|P ) almost surely). The investigationof these phenomena is left for future research.

3.3. Low Dimensional Embeddings for Large Data

For some kernel functions, the embeddings µk(PS) ∈ Hkare infinite dimensional. Because of this reason, one mustresort to the use of dual optimization problems, and in par-ticular, kernel matrices. The construction of these matricesrequires at least O(n2) computational and memory require-ments, prohibitive for large n. In this section, we showthat the infinite-dimensional embeddings µk(PS) ∈ Hk canbe approximated with easy to compute, low-dimensionalrepresentations (Rahimi & Recht, 2007; 2008). This willallow us to replace the infinite-dimensional minimizationproblem (8) with a low-dimensional one.

Assume that Z = Rd, and that the kernel function k is real-valued, and shift-invariant. Then, we can exploit Bochner’stheorem (Rudin, 1962) to show that, for any z, z′ ∈ Z:

k(z, z′)= 2Ck Ew,b

[cos(〈w, z〉+ b) cos(〈w, z′〉+ b)] , (12)

where w ∼ 1Ckpk, b ∼ U [0, 2π], pk : Z → R is the pos-

itive and integrable Fourier transform of k, and Ck =∫Z pk(w)dw. For example, the squared-exponential kernel

(2) is shift-invariant, and its evaluations can be approximatedby (12), if setting pk(w) = N (w|0, 2γI), and Ck = 1.

We now show that for any probability measure Q on Z andz ∈ Z , the function k(z, ·) ∈ Hk ⊆ L2(Q) can be ap-proximated by a linear combination of randomly chosenelements from the Hilbert space L2(Q). Namely, considerthe functions parametrised by w, z ∈ Z and b ∈ [0, 2π]:

gzw,b(·) = 2Ck cos(〈w, z〉+ b) cos(〈w, ·〉+ b), (13)

which belong to L2(Q), since they are bounded. If wesample {(wj , bj)}mj=1 i.i.d., as discussed above, the average

gzm(·) = 1

m

m∑i=1

gzwi,bi(·)

can be viewed as an L2(Q)-valued random variable. More-over, (12) shows that Ew,b[g

zm(·)] = k(z, ·). This enables

us to invoke concentration inequalities for Hilbert spaces(Ledoux & Talagrand, 1991), to show the following result,which is in spirit to Rahimi & Recht (2008, Lemma 1).

Lemma 1. Let Z = Rd. For any shift-invariant kernel k,s.t. supz∈Z k(z, z) ≤ 1, any fixed S = {zi}ni=1 ⊂ Z , anyprobability distribution Q on Z , and any δ > 0, we have∥∥∥∥∥µk(PS)− 1

n

n∑i=1

gzim(·)

∥∥∥∥∥L2(Q)

≤ 2Ck√m

(1 +

√2 log(n/δ)

)with probability larger than 1− δ over {(wi, bi)}mi=1.


Once sampled, the parameters {(wi, bi)}mi=1 al-low us to approximate the empirical kernel meanembeddings {µk(PSi

)}ni=1 using elements fromspan({cos(〈wi, ·〉+ bi)}mi=1), which is a finite-dimensionalsubspace of L2(Q). Therefore, we propose to use{(µk,m(PSi

), li)}ni=1 as the training sample for our finalempirical risk minimization problem, where

µk,m(PS) =2Ck|S|

∑z∈S

(cos(〈wj , z〉+ bj)

)mj=1∈Rm. (14)

These feature vectors can be computed in O(m) time andstored in O(1) memory; importantly, they can be used off-the-shelf in conjunction with any learning algorithm.

For the precise excess risk bounds that take into account theuse of these low-dimensional approximations, please referto Theorem 6 from Section B.5.

4. Extensions to Multivariate CausalInference

It is possible to extend our framework to infer causal relaton-ships between d ≥ 2 variables X = (X1, . . . , Xd). To thisend, and as introduced in Section 1, assume the existenceof a causal directed acyclic graph G which underlies thedependencies present in the probability distribution P (X).Therefore, our task is to recover G from S ∼ Pn.

Naıvely, one could extend the framework presented in Sec-tion 3 from the binary classification of 2-dimensional dis-tributions to the multiclass classification of d-dimensionaldistributions. However, the number of possible DAGs (andtherefore, the number of labels in our multiclass classifica-tion problem) grows super-exponentially in d.

An alternative approach is to consider the probabilities ofthe three labels “Xi → Xj”, “Xi ← Xj“, and “Xi ⊥⊥ Xj“for each pair of variables {Xi, Xj} ⊆ X , when embeddedalong with every possible context Xk ⊆ X \ {Xi, Xj}.The intuition here is the same as in the PC algorithm ofSpirtes et al. (2000): in order to decide the (absence of a)causal relationship between Xi and Xj , one must analyzethe confounding effects of every Xk ⊆ X \ {Xi, Xj}.


5. Numerical SimulationsWe conduct an array of experiments to test the effectivenessof a simple implementation of the presented causal learningframework. Given the use of random embeddings (14) inour classifier, we term our method the Randomized Cau-sation Coefficient (RCC). Throughout our simulations, wefeaturize each sample S = {(xi, yi)}ni=1 as

ν(S) = (µk,m(PSx), µk,m(PSy ), µk,m(PSxy )), (15)

where the three elements forming (15) stand for the low-dimensional representations (14) of the empirical kernelmean embeddings of {xi}ni=1, {yi}ni=1, and {(xi, yi)}ni=1,respectively. The representation (15) is motivated by thetypical conjecture in causal inference about the existenceof asymmetries between the marginal and conditional dis-tributions of causally-related pairs of random variables(Scholkopf et al., 2012). Each of these three embeddingshas random features sampled to approximate the sum ofthree Gaussian kernels (2) with hyper-parameters 0.1γ, γ,and 10γ, where γ is found using the median heuristic. Inpractice, we set m = 1000, and observe no significant im-provements when using larger amounts of random features.To classify the embeddings (15) in each of the experiments,we use the random forest3 implementation from Python’ssklearn-0.16-git. The number of trees is chosenfrom {100, 250, 500, 1000, 5000} via cross-validation.

Our experiments can be replicated using the source code at

https://github.com/lopezpaz/causation_learning_theory.

5.1. Classification of Tubingen Cause-Effect Pairs

The Tubingen cause-effect pairs is a collection of hetero-geneous, hand-collected, real-world cause-effect samples(Zscheischler, 2014). Given the small size of this dataset,we resort to the synthesis of an artificial Mother distribu-tion to sample our training data from. To this end, as-sume that sampling a synthetic cause-effect sample setSi := {(xij , yij)}nj=1 ∼ Pθ equals the following simplegenerative process:

1. A cause vector (xij)nj=1 is sampled from a mixture ofGaussians with c components. The mixture weightsare sampled from U(0, 1), and normalized to sum toone. The mixture means and standard deviations aresampled from N (0, σ1), and N (0, σ2), respectively,accepting only positive standard deviations. The causevector is standardized to zero mean and unit variance.

2. A noise vector (εij)nj=1 is sampled from a centered

3Although random forests do not comply with Lipschitznessassumptions from Section 3, they showed the best empirical results.Compliant alternatives such as SVMs exhibited a typical drop inclassification accuracy of 5%.

0 20 40 60 80 100

020

40

60

80

100

decission rate

class

ifica

tion

acc

ura

cy

RCCANMIGCI

Figure 1. Accuracy of RCC, IGCI and ANM on the Tubingencause-effect pairs, as a function of decision rate. The grey areadepicts accuracies not statistically significant.

Gaussian, with variance sampled from U(0, σ3).

3. A mapping mechanism fi is conceived as a splinefitted using an uniform grid of df elements frommin((xij)

nj=1) to max((xij)

nj=1) as inputs, and df

normally distributed outputs.

4. An effect vector is built as (yij := fi(xij) + εij)nj=1,

and standardized to zero mean and unit variance.

5. Return the cause-effect sample Si := {(xij , yij)}nj=1.

To choose a θ = (c, σ1, σ2, σ3, df ) that best resembles theunlabeled test data, we minimize the distance between theembeddings ofN synthetic pairs and the Tuebingen samples

argminθ

∑i

min1≤j≤N

‖ν(Si)− ν(Sj)‖22, (16)

over c, df ∈ {1, . . . , 10}, and σ1, σ2, and σ3 ∈{0, 0.5, 1, . . . , 5}, where the Sj ∼ Pθ, the Si are theTubingen cause-effect pairs, and ν is as in (15). This strat-egy can be thought of as transductive learning, since weassume to know the test inputs prior to the training of ourinference rule. We set n = 1000, and N = 10, 000.

Using the generative process outlined above, we constructthe synthetic training data

{{ν({(xij , yij)}nj=1),+1)}Ni=1,

{ν({(yij , xij)}nj=1),−1)}Ni=1},

where {(xij , yij)}nj=1 ∼ Pθ, and train our classifier on it.

Figure 1 plots the classification accuracy of RCC, IGCI(Daniusis et al., 2012), and ANM (Mooij et al., 2014) versusthe fraction of decissions that the algorithms are forced tomake out of the 82 scalar Tuebingen cause-effect pairs. Tocompare these results to other lower-performing methods,

https://github.com/lopezpaz/causation_learning_theory


refer to Janzing et al. (2012). RCC surpasses the state-of-the-art with a classification accuracy of 81.61% when inferringthe causal directions on all pairs. The confidence of RCCis computed using the classifier’s output class probabilities.SVMs obtain a test accuracy of 77.2% in this same task.

5.2. Inferring the Arrow of Time

We test the effectiveness of our method to infer the arrow oftime from causal time series. More specifically, we assumeaccess to a set of time series {xij}ni

j=1, and our task is toinfer, for each series, whether Xi → Xi+1 or Xi ← Xi+1.

We compare our framework to the state-of-the-art of Peterset al. (2009), using the same electroencephalography signals(Blankertz, 2005) as in their original experiment. On the onehand, Peters et al. (2009) construct two Auto-RegressiveMoving-Average (ARMA) models for each causal time se-ries and time direction, and prefers the solution under whichthe model residuals are independent from the inferred cause.To this end, the method uses two parameters for whichno estimation procedure is provided. On the other hand,our approach makes no assumptions whatsoever about theparametric model underlying the series, at the expense ofrequiring a disjoint set of N = 10, 000 causal time seriesfor training. Our method matches the best performance ofPeters et al. (2009), with an accuracy of 82.66%.

5.3. ChaLearn’s Challenge Data

The cause-effect challenges organized by Guyon (2014)provided N = 16, 199 training causal samples Si, eachdrawn from the distribution of Xi × Yi, and labeled either“Xi → Yi”, “Xi ← Yi”, “Xi ← Zi → Yi”, or “Xi ⊥⊥ Yi”.The task of the competition was to develop a causation co-efficient which would predict large positive values to causalsamples following “Xi → Yi”, large negative values tosamples following “Xi ← Yi”, and zero otherwise. Usingthese data, our obtained a test bidirectional area under thecurve score (Guyon, 2014) of 0.74 in one minute and a half,ranking third in the overall leaderboard. The winner of thecompetition obtained a score of 0.82 in thirty minutes, butresorted to several dozens of hand-crafted features.

Partitioning these same data in different ways, we learnedtwo related but different binary classification tasks. First,we trained our classifier to detect latent confounding, andobtained a test classification accuracy of 80% on the task ofdistinguishing “X → Y or X ← X” from “X ← Z → Y ”.Second, we trained our classifier to measure dependence,and obtained a test classification accuracy of 88% on thetask of distinguishing between “X ⊥⊥ Y ” and “else”. Weconsider these results to be a promising direction to learnflexible hypothesis tests and dependence measures directlyfrom data.

5.4. Reconstruction of Causal DAGs

We apply the strategy described in Section 4 to reconstructthe causal DAGs of two multivariate datasets: autoMPGand abalone (Lichman, 2013). Once again, we resort tosynthetic training data, generated in a similar procedure tothe one used in Section 5.1. Refer to Section C for details.

Regarding autoMPG, in Figure 2, 1) the release date ofthe vehicle (AGE) causes the miles per gallon consumption(MPG), acceleration capabilities (ACC) and horse-power(HP), 2) the weight of the vehicle (WEI) causes the horse-power and MPG, and that 3) other characteristics such as theengine displacement (DIS) and number of cylinders (CYL)cause the MPG. For abalone, in Figure 3, 1) the age of thesnail causes all the other variables, 2) the overall weightof the snail (WEI) is caused by the partial weights of itsmeat (WEA), viscera (WEB), and shell (WEC), and 3) theheight of the snail (HEI) is responsible for other phisicalyattributes such as its diameter (DIA) and length (LEN).

The target variable for each dataset is shaded in gray. Inter-stingly, our inference reveals that the autoMPG dataset isa causal prediction task (the features cause the target), andthat the abalone dataset is an anticausal prediction task (thetarget causes the features). This distinction has implicationswhen learning from these data (Scholkopf et al., 2012).

MPG

AGEACCWEI

HP

CYL

DIS

Figure 2. Causal DAG recovered from data autoMPG.

AGEWECWEB

WEA LEN DIA

HEI

WEI

Figure 3. Causal DAG recovered from data abalone.

6. Future WorkThree research directions are in progress. First, to improvelearning rates by using common assumptions from causalinference. Second, to further investigate methods to recon-struct multivariate DAGs. Third, to develop mechanisms tointerpret the causal footprints learned by our classifiers.


ReferencesBartlett, P. L. and Mendelson, S. Rademacher and Gaussian

complexities: Risk bounds and structural results. JMLR,2002.

Bartlett, P. L. and Mendelson, S. Empirical minimization.Probability Theory and Related Fields, 135(4), 2006.

Bartlett, P. L., Bousquet, O., and Mendelson, S. Localrademacher complexities. The Annals of Statistics, 33(4),2005.

Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. Convexity,classification, and risk bounds. Journal of the AmericanStatistical Association, 101, 2006.

Berlinet, A. and Thomas-Agnan, C. Reproducing KernelHilbert Spaces in Probability and Statistics. KluwerAcademic Publishers, 2004.

Blankertz, B. BCI Competition III data, experiment 4a,subject 3, 1000Hz, 2005. URL http://bbci.de/competition/iii/download/.

Boucheron, Stephane, Lugosi, Gabor, and Bousquet, Olivier.Theory of classification: a survey of recent advances.ESAIM: Probability and Statistics, 2005.

Cuturi, M., Fukumizu, K., and Vert, J. P. Semigroup kernelson measures. JMLR, 2005.

Daniusis, P., Janzing, D., Mooij, J., Zscheischler, J., Steudel,B., Zhang, K., and Scholkopf, B. Inferring deterministiccausal relations. UAI, 2012.

Guyon, I. Cause-effect pairs kaggle competition,2013. URL https://www.kaggle.com/c/cause-effect-pairs/.

Guyon, I. Chalearn fast causation coefficient chal-lenge, 2014. URL https://www.codalab.org/competitions/1381.

Hein, M. and Bousquet, O. Hilbertian metrics and positivedefinite kernels on probability measures. AISTATS, 2004.

Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J. R., andScholkopf, B. Nonlinear causal discovery with additivenoise models. NIPS, 2009.

Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheis-chler, J., Daniusis, P., Steudel, B., and Scholkopf, B.Information-geometric approach to inferring causal direc-tions. Artificial Intelligence, 2012.

Janzing, D., Steudel, B., Shajarisales, N., and Scholkopf, B.Justifying information-geometric causal inference. arXiv,2014.

Jebara, T., Kondor, R., and Howard, A. Probability productkernels. JMLR, 2004.

Koltchinskii, V. Oracle Inequalities in Empirical Risk Mini-mization and Sparse Recovery Problems. Ecole d’ete deprobabilites de Saint-Flour. Springer, 2011.

Koltchinskii, V. and Panchenko, D. Rademacher processesand bounding the risk of function learning. In E. Gine, D.and J.Wellner (eds.), High Dimensional Probability, II,pp. 443–457. Birkhauser, 1999.

Kpotufe, S., Sgouritsa, E., Janzing, D., and Scholkopf, B.Consistency of causal inference under the additive noisemodel. ICML, 2013.

Ledoux, M. and Talagrand, M. Probability in BanachSpaces: Isoperimetry and Processes. Springer, 1991.

Lichman, M. UCI machine learning repository, 2013. URLhttp://archive.ics.uci.edu/ml.

Martins, A. F. T., Smith, N. A., Xing, E. P., Aguiar, P. M. Q.,and Figueiredo, M. A. T. Nonextensive information theo-retic kernels on measures. JMLR, 2009.

Maurer, A. The rademacher complexity of linear transfor-mation classes. COLT, 2006.

Mooij, J. M., Peters, J., Janzing, D., Zscheischler, J., andScholkopf, B. Distinguishing cause from effect usingobservational data: methods and benchmarks. arXivpreprint arXiv:1412.3773, 2014.

Muandet, K., Fukumizu, K., Dinuzzo, F., and Scholkopf,B. Learning from distributions via support measure ma-chines. NIPS, 2012.

Pearl, J. Causality: models, reasoning and inference. Cam-bridge Univ Press, 2000.

Peters, J., Janzing, D., Gretton, A., and Scholkopf, B. De-tecting the direction of causal time series. ICML, 2009.

Peters, J., M., Joris M., Janzing, D., and Scholkopf, B.Causal discovery with continuous additive noise models.JMLR, 2014.

Rahimi, A. and Recht, B. Random features for large-scalekernel machines. NIPS, 2007.

Rahimi, A. and Recht, B. Weighted sums of random kitchensinks: Replacing minimization with randomization inlearning. NIPS, 2008.

Reed, M. and Simon, B. Methods of modern mathematicalphysics. vol. 1. Functional analysis, volume 1. Academicpress New York, 1972.

Reichenbach, H. The Direction of Time. Dover, 1956.

http://bbci.de/competition/iii/download/

http://bbci.de/competition/iii/download/

https://www.kaggle.com/c/cause-effect-pairs/

https://www.kaggle.com/c/cause-effect-pairs/

https://www.codalab.org/competitions/1381

https://www.codalab.org/competitions/1381

http://archive.ics.uci.edu/ml


Rudin, W. Fourier Analysis on Groups. Wiley, 1962.

Scholkopf, B. and Smola, A. J. Learning with Kernels. MITPress, 2002.

Scholkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang,K., and Mooij, J. On causal and anticausal learning.ICML, 2012.

Scholkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang,K., and Mooij, J. M. On causal and anticausal learning.In ICML, 2012.

Shimizu, S., Hoyer, P. O., Hyvarinen, A., and Kerminen, A.A linear non-Gaussian acyclic model for causal discovery.JMLR, 2006.

Shimizu, S., Inazumi, T., Sogawa, Y., Hyvarinen, A., Kawa-hara, Y., Washio, T., Hoyer, P. O., and Bollen, K. Di-rectlingam: A direct method for learning a linear non-gaussian structural equation model. JMLR, 2011.

Smola, A. J., Gretton, A., Song, L., and Scholkopf, B.A Hilbert space embedding for distributions. In ALT.Springer-Verlag, 2007.

Song, L. Learning via Hilbert Space Embedding of Distri-butions. PhD thesis, The University of Sydney, 2008.

Spirtes, P., Glymour, C. N., and Scheines, R. Causation,prediction, and search. MIT Press, 2000.

Sriperumbudur, B. K., Gretton, A., Fukumizu, K.,Scholkopf, B., and Lanckriet, G. Hilbert space embed-dings and metrics on probability measures. JMLR, 2010.

Stegle, O., Janzing, D., Zhang, K., Mooij, J. M., andScholkopf, B. Probabilistic latent variable models fordistinguishing between cause and effect. NIPS, 2010.

Steinwart, I. and Christmann, A. Support vector machines.Springer, 2008.

Szabo, Z., Sriperumbudur, B. Poczos, B., and Gretton, A.Learning theory for distribution regression. arXiv, 2014.

Vapnik, V. N. Statistical Learning Theory. John Wiley &Sons, 1998.

Zhang, K. and Hyvarinen, A. On the identifiability of thepost-nonlinear causal model. UAI, 2009.

Zscheischler, J. Benchmark data set for causal discov-ery algorithms, v0.8, 2014. URL http://webdav.tuebingen.mpg.de/cause-effect/.

http://webdav.tuebingen.mpg.de/cause-effect/

http://webdav.tuebingen.mpg.de/cause-effect/


A. Topological and Measurability ConsiderationsLet (Z, τZ) and (L, τL) be two separable topological spaces, where Z is the input space and L := {−1, 1} is theoutput space. Let B(τ) be the Borel σ-algebra induced by the topology τ . Let P be an unknown probability measure on(Z × L,B(τZ)⊗ B(τL)).

Consider also the classifiers f ∈ Fk and loss function ` to be measurable.

A.1. Measurability Conditions to Learn from Distributions

The first step towards the deployment of our learning setup is to guarantee the existence of a measure on the space µk(P)×L,where µk(P) = {µk(P ) : P ∈ P} ⊆ Hk is the set of kernel mean embeddings associated with the measures in P . Thefollowing lemma provides such guarantee. This allows the analysis within the rest of this Section on µk(P)× L.Lemma 2. Let (Z, τZ) and (L, τL) be two separable topological spaces. Let P be the set of all Borel probability measureson (Z,B(τZ)). Let µk(P) = {µk(P ) : P ∈ P} ⊆ Hk, where µk is the kernel mean embedding (1) associated to somebounded continuous kernel function k : Z × Z → R. Then, there exists a measure on µk(P)× L.

Proof. The following is a similar result to Szabo et al. (2014, Proof 3).

Start by endowing P with the weak topology τP , such that the map

L(P ) =

∫Zf(z)dP (z), (17)

is continuous for all f ∈ Cb(Z). This makes (P,B(τP)) a measurable space.

First, we show that µk : (P,B(τP))→ (Hk,B(τH)) is Borel measurable. Note thatHk is separable due to the separabilityof (Z, τZ) and the continuity of k (Steinwart & Christmann, 2008, Lemma 4.33). The separability ofHk implies µk is Borelmeasurable iff it is weakly measurable (Reed & Simon, 1972, Thm. IV.22). Note that the boundedness and the continuity ofk implyHk ⊆ Cb(Z) (Steinwart & Christmann, 2008, Lemma 4.28). Therefore, (17) remains continuous for all f ∈ Hk,which implies the Borel measurability of µk.

Second, µk : (P,B(τP))→ (G,B(τG)) is Borel measurable, since the B(τG) = {A ∩ G : A ∈ B(Hk)} ⊆ B(τH), whereB(τG) is the σ-algebra induced by the topology of G ∈ B(Hk) (Szabo et al., 2014).

Third, we show that g : (P × L,B(τP) ⊗ B(τL)) → (G × L,B(τG) ⊗ B(τL)) is measurable. For that, it suffices todecompose g(x, y) = (g1(x, y), g2(x, y)) and show that g1 and g2 are measurable (Szabo et al., 2014).

B. ProofsB.1. Theorem 1

Note that the original statement of Theorem 27 in Song (2008) assumed f ∈ [0, 1] while we let elements of the ball inRKHS to take negative values as well which can be achieved by minor changes of the proof. For completeness we providethe modified proof here. Using the well known dual relation between the norm in RKHS and sup-norm of empirical processwhich can be found in Theorem 28 of Song (2008) we can write:

‖µk(P )− µk(PS)‖Hk= sup‖f‖Hk

≤1

(Ez∼P

[f(z)]− 1

n

n∑i=1

f(zi)

). (18)

Now we proceed in the usual way. First we note that the sup-norm of empirical process appearing on the r.h.s. can be viewedas a real-valued function of i.i.d. random variables z1, . . . , zn. We will denote it as F (z1, . . . , zn). The straightforwardcomputations show that the function F satisfies the bounded difference condition (Theorem 14 of Song (2008)). Indeed, letus fix all the values z1, . . . , zn except for the zj which we will set to z′j . Using identity |a− b| = (a− b)1a>b+(b−a)1a≤band noting that if supx f(x) = f(x∗) then supx f(x)− supx g(x) is upper bounded by f(x∗)− g(x∗) we get

|F (z1, . . . , z′j , . . . , zn)− F (z1, . . . , zj , . . . , zn)|

≤ 1

n

(f(zj)− f(z′j)

)1F (z1,...,z′j ,...,zn)>F (z1,...,zj ,...,zn) +

1

n

(f(z′j)− f(zj)

)1F (z1,...,z′j ,...,zn)≤F (z1,...,zj ,...,zn).


Now noting that |f(z)− f(z′)| ∈ [0, 2] we conclude with

|F (z1, . . . , z′j , . . . , zn)− F (z1, . . . , zj , . . . , zn)|

≤ 2

n1F (z1,...,z′j ,...,zn)>F (z1,...,zj ,...,zn) +

2

n1F (z1,...,z′j ,...,zn)≤F (z1,...,zj ,...,zn) =

2

n.

Using McDiarmid’s inequality (Theorem 14 of (Song, 2008)) with ci = 2/n we obtain that with probability not less than1− δ the following holds:

sup‖f‖Hk

≤1

(Ez∼P

[f(z)]− 1

n

n∑i=1

f(zi)

)≤ E

[sup

‖f‖Hk≤1

(Ez∼P

[f(z)]− 1

n

n∑i=1

f(zi)

)]+

√2 log(1/δ)

n.

Finally, we proceed with the symmetrization step (Theorem 2.1 of (Koltchinskii, 2011)) which upper bounds the expectedvalue of the sup-norm of empirical process with twice the Rademacher complexity of the class {f ∈ Hk : ‖f‖Hk

≤ 1}and with upper bound on this Rademacher complexity which can be found in Lemma 22 and related remarks of Bartlett &Mendelson (2002).

We also note that the original statement of Theorem 27 in Song (2008) contains extra factor of 2 under logarithm comparedto our modified result. This is explained by the fact that while we upper bounded the Rademacher complexity directly, Song(2008) instead upper bounds it in terms of the empirical (or conditional) Rademacher complexity which results in anotherapplication of McDiarmid’s inequality together with union bound.

B.2. Theorem 3

We will proceed as follows:

Rϕ(fn)−Rϕ(f∗) = Rϕ(fn)− Rϕ(fn)+ Rϕ(fn)− Rϕ(f∗)+ Rϕ(f

∗)−Rϕ(f∗)≤ 2 sup

f∈Fk

|Rϕ(f)− Rϕ(f)|

= 2 supf∈Fk

|Rϕ(f)− Rϕ(f) + Rϕ(f)− Rϕ(f)|

≤ 2 supf∈Fk

|Rϕ(f)− Rϕ(f)|+ 2 supf∈Fk

|Rϕ(f)− Rϕ(f)|. (19)

We will now upper bound two terms in (19) separately.

We start with noticing that Theorem 2 can be used in order to upper bound the first term. All we need is to match thequantities appearing in our problem to the classical setting of learning theory, discussed in Section 3.1. Indeed, let µ(P) playthe role of input space Z . Thus the input objects are kernel mean embeddings of elements of P . According to Lemma 2,there is a distribution defined over µ(P) × L, which will play the role of unknown distribution P. Finally, i.i.d. pairs{(µk(Pi), li

)}ni=1

form the training sample. Thus, using Theorem 2 we get that with probability not less than 1− δ/2 (w.r.t.the random training sample

{(µk(Pi), li

)}ni=1

) the following holds true:

supf∈Fk

|Rϕ(f)− Rϕ(f)| ≤ 2Lϕ E

[supf∈Fk

1

n

∣∣∣∣∣n∑i=1

σif(zi)

∣∣∣∣∣]+B

√log(2/δ)

2n. (20)

To deal with the second term in (19) we note that

supf∈Fk

|Rϕ(f)− Rϕ(f)| = supf∈Fk

∣∣∣∣∣ 1nn∑i=1

[ϕ(−lif

(µk(Pi)

))− ϕ

(−lif

(µk(PSi)

))]∣∣∣∣∣≤ supf∈Fk

1

n

n∑i=1

∣∣ϕ(−lif(µk(Pi)))− ϕ(−lif(µk(PSi)))∣∣

≤ Lϕ supf∈Fk

1

n

n∑i=1

∣∣f(µk(Pi))− f(µk(PSi))∣∣ ,


where we have used the Lipschitzness of the cost function ϕ. Using the Lipschitzness of the functionals f ∈ Fk we obtain:

supf∈Fk

|Rϕ(f)− Rϕ(f)| ≤ Lϕ supf∈Fk

Lfn

n∑i=1

‖µk(Pi)− µk(PSi)‖Hk

. (21)

Also note that the usual reasoning shows that if h ∈ Hk and ‖h‖Hk≤ 1 then:

|h(z)| = |〈h, k(z, ·)〉Hk| ≤ ‖h‖Hk

‖k(z, ·)‖Hk= ‖h‖Hk

√k(z, z) ≤

√k(z, z)

and hence ‖h‖∞ = supz∈Z |h(z)| ≤ 1 because our kernel is bounded. This allows us to use Theorem 1 to control everyterm in (21) and combine the resulting upper bounds in a union bound4 over i = 1, . . . , n to show that for any fixedP1, . . . , Pn with probability not less than 1− δ/2 (w.r.t. the random samples {Si}ni=1) the following is true:

Lϕ supf∈F

Lfn

n∑i=1

‖µk(Pi)− µk(PSi)‖Hk

≤ Lϕ supf∈F

Lfn

n∑i=1

2

√E z∼P [k(z, z)]

ni+

√2 log 2n

δ

ni

. (22)

The quantity 2n/δ appears under the logarithm since for every i we have used Theorem 1 with δ′ = δ/(2n). Combining(20) and (22) in a union bound together with (19) we finally get that with probability not less than 1− δ the following is true:

Rϕ(fn)−Rϕ(f∗) ≤ 4LϕRn(F) + 2B

√log(2/δ)

2n+

4LϕLFn

n∑i=1

√E z∼P [k(z, z)]

ni+

√log 2n

δ

2ni

,

where we have defined LF = supf∈F Lf .

B.3. Theorem 4

Our proof is a simple combination of the duality equation (18) combined with the following lower bound on the supremumof empirical process presented in Theorem 2.3 of Bartlett & Mendelson (2006):

Theorem 5. Let F be a class of real-valued functions defined on a setZ such that supf∈F ‖f‖∞ ≤ 1. Let z1, . . . , zn, z ∈ Zbe i.i.d. according to some probability measure P on Z . Set σ2

F = supf∈F V[f(z)]. Then there are universal constantsc, c′, and C for which the following holds:

E

[supf∈F

∣∣∣∣∣E [f(z)]− 1

n

n∑i=1

f(zi)

∣∣∣∣∣]≥ c σF√

n.

Furthermore, for every integer n ≥ 1/σ2F , with probability at least c′,

supf∈F

∣∣∣∣∣E [f(z)]− 1

n

n∑i=1

f(zi)

∣∣∣∣∣ ≥ C E

[supf∈F

∣∣∣∣∣E [f(z)]− 1

n

n∑i=1

f(zi)

∣∣∣∣∣].

We note that constants c, c′, and C appearing in the last result do not depend on n, σ2F or any other quantities appearing in

the statement. This can be verified by the inspection of the proof presented in (Bartlett & Mendelson, 2006).

B.4. Lemma 1

Proof. Bochner’s theorem (Rudin, 1962) states that for any shift-invariant symmetric p.d. kernel k defined on Z ×Z whereZ = Rd and any z, z′ ∈ Z the following holds:

k(z, z′) =

∫Zpk(w)e

i〈w,z−z′〉dw, (23)

4 Note that the union bound results in the extra logn factor in our bound. We believe that this factor can be avoided using a refinedproof technique, based on the application of McDiarmid’s inequality. This question is left for a future work.


where pk is a positive and integrable Fourier transform of the kernel k. It is immediate to check that Fourier transform ofsuch kernels k is always an even function, meaning pk(−w) = pk(w). Indeed, since k(z− z′) = k(z′− z) for all z, z′ ∈ Z(due to symmetry of the kernel) we have:

pk(w) :=

∫Zk(δ)ei〈w,δ〉dδ =

∫Zk(δ) cos(〈w, δ〉)dδ =

∫Zk(δ) cos(−〈w, δ〉)dδ = pk(−w)

which holds for any w ∈ Rd. Thus for any z, z′ ∈ Rd we can write:

k(z, z′) =

∫Rd

pk(w)(cos(〈w, z − z′〉) + i · sin(〈w, z − z′〉)

)dw

=

∫Rd

pk(w)(cos(〈w, z − z′〉)dw + i ·

∫Rd

pk(w) sin(〈w, z − z′〉))dw

=

∫Rd

pk(w) cos(〈w, z − z′〉)dw

= 2

∫Rd

∫ 2π

0

1

2πpk(w) cos(〈w, z〉+ b) cos(〈w, z′〉+ b) db dw.

Denote Ck =∫Rd p(w)dw < ∞. Next we will use identity cos(a − b) = 1

π

∫ 2π

0cos(a + x) cos(b + x)dx and introduce

random variables b and w distributed according to U [0, 2π] and 1Ckpk(w) respectively. Then we can rewrite

k(z, z′) = 2Ck Eb,w

[cos(〈w, z〉+ b) cos(〈w, z′〉+ b)] . (24)

Now let Q be any probability distribution defined on Z . Then for any z, w ∈ Z and b ∈ [0, 2π] the function

gzw,b(·) := 2Ck cos(〈w, z〉+ b) cos(〈w, ·〉+ b)

belongs to the L2(Q) space. Namely, L2(Q) norm of such a function is finite. Moreover, it is bounded by 2Ck:

‖gzw,b(·)‖2L2(Q) =

∫Z

(2Ck cos(〈w, z〉+ b) cos(〈w, t〉+ b)

)2dQ(t)

≤ 4C2k

∫ZdQ(t) = 4C2

k . (25)

Note that for any fixed x ∈ Z and any random parameters w ∈ Z and b ∈ [0, 2π] the element gzw,b(·) is a random variabletaking values in the L2(Q) space (which is Hilbert). Such Banach-space valued random variables are well studied objects(Ledoux & Talagrand, 1991) and a number of concentration results for them are known by now. We will use the followingversion of Hoeffding inequality which can be found in Lemma 4 of Rahimi & Recht (2008):

Lemma 3. Let v1, . . . , vm be i.i.d. random variables taking values in a ball of radius M centred around origin in a Hilbertspace H . Then, for any δ > 0, the following holds:∥∥∥∥∥ 1

m

m∑i=1

vi − E

[1

m

m∑i=1

vi

]∥∥∥∥∥H

≤ M

m

(1 +

√2 log(1/δ)

).

with probability higher than 1− δ over the random sample v1, . . . , vm.

Note that Bochner’s formula (23) and particularly its simplified form (24) indicates that if w is distributed according tonormalized Fourier transform 1

Ckpk and b ∼ U([0, 2π]) then Ew,b[g

zw,b(·)] = k(z, ·). Moreover, we can show that any

element h of RKHSHk also belongs to the L2(Q) space:

‖h(·)‖2L2(Q) =

∫Z

(h(t)

)2dQ(t)

=

∫Z〈k(t, ·), h(·)〉2Hk

dQ(t)

≤∫Zk(t, t)‖h‖2Hk

dQ(t) ≤ ‖h‖2Hk<∞, (26)


where we have used the reproducing property of k in RKHSHk, Cauchy-Schwartz inequality, and the fact that the kernel kis bounded. Thus we conclude that the function k(z, ·) is also an element of L2(Q) space.

This shows that if we have a sample of i.i.d. pairs {(wi, bi)}mi=1 then E[

1m

∑mi=1 g

zwi,bi

(·)]= k(z, ·) where {gzwi,bi

(·)}mi=1

are i.i.d. elements of Hilbert space L2(Q). We conclude the proof using concentration inequality for Hilbert spaces ofLemma 3 and a union bound over the elements z ∈ S, since∥∥∥∥∥µk(PS)− 1

n

n∑i=1

gzim(·)

∥∥∥∥∥L2(Q)

=

∥∥∥∥∥ 1nn∑i=1

k(zi, ·)−1

n

n∑i=1

gzim(·)

∥∥∥∥∥L2(Q)

≤ 1

n

n∑i=1

‖k(zi, ·)− gzim(·)‖L2(Q)

=1

n

n∑i=1

∥∥∥∥∥∥k(zi, ·)− 1

m

m∑i=j

gziwj ,bj(·)

∥∥∥∥∥∥L2(Q)

,

where we have used the triangle inequality.

B.5. Excess Risk Bound for Low-Dimensional Representations

Let us first recall some important notations introduced in Section 3.3. For any w, z ∈ Z and b ∈ [0, 2π] we define thefollowing functions

gzw,b(·) = 2Ck cos(〈w, z〉+ b) cos(〈w, ·〉+ b) ∈ L2(Q), (27)

where Ck =∫Z pk(z)dz for pk : Z → R being the Fourier transform of k. We sample m pairs {(wi, bi)}mi=1 i.i.d. from(

1Ckpk

)× U [0, 2π] and define the average function

gzm(·) = 1

m

m∑i=1

gzwi,bi(·) ∈ L2(Q).

Since cosine functions (27) do not necessarily belong to the RKHSHk and we are going to use their linear combinations asa training points, our classifiers should now act on the whole L2(Q) space. To this end, we redefine the set of classifiersintroduced in the Section 3.2 to be {sign ◦f : f ∈ FQ} where now FQ is the set of functionals mapping L2(Q) to R.

Recall that our goal is to find f∗ such that

f∗ ∈ arg minf∈FQ

Rϕ(f) := arg minf∈FQ

E(P,l)∼M

[ϕ(−f(µk(P )

)l)]. (28)

As was pointed out in Section B.4 if the kernel k is bounded supz∈Z k(z, z) ≤ 1 thenHk ⊆ L2(Q). In particular, for anyP ∈ P it holds that µk(P ) ∈ L2(Q) and thus (28) is well defined.

Instead of solving (28) directly, we will again use the version of empirical risk minimization (ERM). However, this timewe won’t use empirical mean embeddings {µk(PSi)}ni=1 since, as was already discussed, those lead to the expensivecomputations involving the kernel matrix. Instead, we will pose the ERM problem in terms of the low-dimensionalapproximations based on cosines. Namely, we propose to use the following estimator fmn :

fmn ∈ arg minf∈FQ

Rmϕ (f) := arg minf∈FQ

1

n

n∑i=1

ϕ

(−f

(1

ni

∑z∈Si

gzm(·)

)li

).

The following result puts together Theorem 3 and Lemma 1 to provide an excess risk bound for fmn which accounts for allsources of the errors introduced in the learning pipeline:

Theorem 6. Let Z = Rd and Q be any probability distribution on Z . Consider the RKHS Hk associated with somebounded, continuous, shift-invariant kernel function k, such that supz∈Z k(z, z) ≤ 1. Consider a class FQ of functionals


mapping L2(Q) to R with Lipschitz constants uniformly bounded by LQ. Let ϕ : R→ R+ be a Lϕ-Lipschitz function suchthat φ(z) ≥ 1z>0. Let ϕ

(−f(h)l

)≤ B for every f ∈ FQ, h ∈ L2(Q), and l ∈ L. Then for any δ > 0 the following holds:

Rϕ(fmn )−Rϕ(f∗) ≤ 4LϕRn(FQ) + 2B

√log(3/δ)

2n

+4LϕLQn

n∑i=1

√E z∼Pi[k(z, z)]

ni+

√log 3n

δ

2ni

+ 2

LϕLQn

n∑i=1

2Ck√m

(1 +

√2 log(3n · ni/δ)

)with probability not less than 1− δ over all sources of randomness, which are {(Pi, li)}ni=1, {Si}ni=1, {(wi, bi)}mi=1.

Proof. We will proceed similarly to (19):

Rϕ(fmn )−Rϕ(f∗) = Rϕ(f

mn )− Rmϕ (fmn )

+ Rmϕ (fmn )− Rmϕ (f∗)

+ Rmϕ (f∗)−Rϕ(f∗)≤ 2 sup

f∈FQ

|Rϕ(f)− Rmϕ (f)|

= 2 supf∈FQ

|Rϕ(f)− Rϕ(f) + Rϕ(f)− Rϕ(f) + Rϕ(f)− Rmϕ (f)|

≤ 2 supf∈FQ

|Rϕ(f)− Rϕ(f)|+ 2 supf∈FQ

|Rϕ(f)− Rϕ(f)|+ 2 supf∈FQ

|Rϕ(f)− Rmϕ (f)|. (29)

First two terms of (29) were upper bounded in Section B.2. Note that the upper bound of the second term (proved inTheorem 3) was based on the assumption that functionals in FQ are Lipschitz onHk w.r.t. theHk metric. But as we alreadynoted, for bounded kernels we have Hk ⊆ L2(Q) which implies ‖h‖L2(Q) ≤ ‖h‖Hk

for any h ∈ Hk (see (26)). Thus|f(h)− f(h′)| ≤ Lf‖h−h′‖L2(Q) ≤ Lf‖h−h′‖Hk

for any h, h′ ∈ Hk. It means that the assumptions of Theorem 3 holdtrue and we can safely apply it to upper bound the first two terms of (29).

We are now going to upper bound the third one using Lemma 1:

supf∈FQ

|Rϕ(f)− Rmϕ (f)| = supf∈FQ

∣∣∣∣∣ 1nn∑i=1

ϕ(−f(µk(PSi

))li)− 1

n

n∑i=1

ϕ

(−f

(1

ni

∑z∈Si

gzm(·)

)li

)∣∣∣∣∣≤ 1

n

n∑i=1

supf∈FQ

∣∣∣∣∣ϕ (−f(µk(PSi))li)− ϕ

(−f

(1

ni

∑z∈Si

gzm(·)

)li

)∣∣∣∣∣≤ Lϕ

n

n∑i=1

supf∈FQ

∣∣∣∣∣f(µk(PSi))− f

(1

ni

∑z∈Si

gzm(·)

)∣∣∣∣∣≤ Lϕ

n

n∑i=1

supf∈FQ

Lf

∥∥∥∥∥µk(PSi)− 1

ni

∑z∈Si

gzm(·)

∥∥∥∥∥L2(Q)

.

We can now use Lemma 1 combined in union bound over i = 1, . . . , n with δ′ = δ/n. This will give us that

supf∈FQ

|Rϕ(f)− Rmϕ (f)| ≤ LϕLQn

n∑i=1

2Ck√m

(1 +

√2 log(n · ni/δ)

).

with probability not less than 1− δ over {(wi, bi)}mi=1.


C. Training and Test Protocols for Section 5.4The synthesis of the training data for the experiments described in Section 5.4 follows a very similar procedure to the onefrom Section 5.1. The main difference here is that, when trying to infer the cause-effect relationship between two variablesXi and Xj belonging to a larger set of variables X = (X1, . . . , Xd), we will have to account for the effects of possibleconfounders Xk ⊆ X \ {Xi, Xj}. For the sake of simplicity, we will only consider one-dimensional confounding effects,that is, scalar Xk.

C.1. Training Phase

To generate cause-effect pairs exhibiting every possible scalar confounding effect, we will generate data from the eightpossible directed acyclic graphs depicted in Figure 4.

Figure 4. The eight possible directed acyclic graphs on three variables.

In particular, we will sample N different causal DAGs G1, . . . , GN , where the Gi describes the causal structure underlying(Xi, Yi, Zi). Given Gi, we generate the sample set Si = {(xij , yij , zij)}nj=1 according to the generative process describedin Section 5.1. Together with Si, we annotate the triplet of labels (li1, li2, li3), where according to Gi,

• li1 = +1 if “Xi → Yi”, li1 = −1 if “Xi ← Yi”, and li1 = 0 else.

• li2 = +1 if “Yi → Zi”, li2 = −1 if “Yi ← Zi”, and li2 = 0 else.

• li3 = +1 if “Xi → Zi”, li1 = −1 if “Xi ← Zi”, and li1 = 0 else.

Then, we add the following six elements to our training set:

({(xij , yij , zij)}nj=1,+l1), ({(yij , zij , xij)}nj=1,+l2), ({(xij , zij , yij)}nj=1,+l3),

({(yij , xij , zij)}nj=1,−l1), ({(zij , yij , xij)}nj=1,−l2), ({(zij , xij , yij)}nj=1,−l3),

for all 1 ≤ i ≤ N . Therefore, our training set will consist on 6N sample sets and their paired labels. At this point, andgiven any sample {(uij , vij , wij)}nj=1 from the training set, we propose to use as feature vectors the concatenation of them−dimensional empirical kernel mean embeddings (14) of {uij}nj=1, {vij}nj=1, and {(uij , vij , wij)}nj=1, respectively.

C.2. Test Phase

To start, given nte test d−dimensional samples S = {(x1i, . . . , xdi)}ntei=1, the hyper-parameters of the kernel and training

data synthesis process are transductively chosen, as described in Section 5.1.

In order to estimate the causal graph underlying the test sample set S, we compute three d× d matrices M→, M⊥⊥, andM←. Each of these matrices will contain, at their coordinates i, j, the probabilities of the labels “Xi → Xj”, “Xi ⊥⊥ Xj”,and “Xi ← Xj”, respectively, when averaged over all possible scalar confounders Xk. Using these matrices, we estimatethe underlying causal graph by selecting the type of each edge (forward, backward, or no edge) to be the one with maximalprobability according. As a post-processing step, we prune the least-confident edges until the derived graph is acyclic.

Documents

Towards a Learning Theory of Cause-Effect Inference · Towards a Learning Theory of Cause-Effect Inference 1.1. Prior Art We now brieﬂy review the state-of-the-art on the inference