130
Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach Camilo Alberto C ´ ardenas Hurtado Economist, UNAL Universidad Nacional de Colombia Facultad de Ciencias Departamento de Estad ´ ıstica Bogot ´ a, D.C. November 2016

¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Causal Inference in the Presence of Causally ConnectedUnits: A Semi-Parametric Hierarchical Structural

Equation Model Approach

Camilo Alberto Cardenas HurtadoEconomist, UNAL

Universidad Nacional de ColombiaFacultad de Ciencias

Departamento de EstadısticaBogota, D.C.

November 2016

Page 2: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Causal Inference in the Presence of Causally ConnectedUnits: A Semi-Parametric Hierarchical Structural

Equation Model Approach

Camilo Alberto Cardenas HurtadoEconomist, UNAL

A dissertation submitted for the degree ofMaster of Science, Statistics

AdvisorB. Piedad Urdinola, Ph.D.Ph.D Demograhy, UC Berkeley

Universidad Nacional de ColombiaFacultad de Ciencias

Departamento de EstadısticaBogota, D.C.

November 2016

Page 3: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

3

Title in English

Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric

Hierarchical Structural Equation Model Approach.

Tıtulo en espanol

Inferencia Causal en Presencia de Unidades Causalmente Conectadas: Una Aproximacion

a traves de un Modelo A Semi-Parametrico, Jerarquico de Ecuaciones Estructurales.

Abstract: Causal inference has become a dominant research area in both theoretical

and empirical statistics. One of the main drawbacks of conventional frameworks is the

assumption of no causal interactions among individuals (i.e independent units). Violation

of this assumption often yields biased estimations of causal effects of an intervention in

quantitative social, biomedical and epidemiological research. This document proposes

a novel approach for modeling causal connections among units within the Structural

Causal Model framework: a Semi-Parametric Hierarchical Structural Equation Model

(SPHSEM). Estimation uses Bayesian techniques, and the empirical performance of the

proposed model is evaluated through both simulation and applied studies. Results prove

that the Bayesian SPHSEM recovers nonlinear (causal) relationships between latent

variables belonging to different levels and yields unbiased estimates of the (causal) model

parameters.

Resumen: La inferencia causal se ha convertido en un area activa de investigacion en

la estadıstica teorica y aplicada. Una falencia de las aproximaciones convencionales es el

supuesto de ausencia de interacciones causales entre individuos (unidades independientes

de estudio). La violacion de este supuesto resulta en estimaciones sesgadas de los efectos

causales en investigaciones sociales, biomedicas y epidemiologicas. En este documento

se propone una nueva manera de modelar dichas conexiones causales bajo el Modelo

Estructural de Causalidad: un modelo Semi-Parametrico, Jerarquico de Ecuaciones

Estructurales (SPHSEM). La estimacion se hace mediante tecnicas Bayesianas, y su

capacidad empırica se evaua a traves tanto de un ejercicio de simulacion como de una

aplicacion empırica. Los resultados confirman que el SPHSEM Bayesiana recupera las

relaciones causales no lineales que existen entre variables latentes pertenecientes a distintos

niveles de agrupamiento, y que las estimaciones de los parametros causales son insesgadas.

Keywords: Causal inference, independence assumption violation, causally connected

units, directed acyciclic graphs (DAG), structural equation models, hierarchical linear

Page 4: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

models, semiparametric models, Bayesian estimation.

Palabras clave: Inferencia causal, violacion de supuesto de independencia, dependen-

cia entre observaciones, grafos acıclicos direccionados (DAG), modelos de ecuaciones es-

tructurales (SEM), modelos jerarquicos (HLM), modelos semiparametricos, estimacion

Bayesiana.

Page 5: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Acceptation Note

Thesis Work

Approved

“TBD mention”

Jury

Edilberto Cepeda, Ph.D.

Jury

Ivan Dıaz, Ph.D.

Advisor

B. Piedad Urdinola, Ph.D.

Bogota, D.C., May 31st 2017

Page 6: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Dedicated to

To everyone who truly believed in me throughout this journey.

Page 7: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Acknowledgements

First, I would like to thank my advisor, Prof. B. Piedad Urdinola. Her never endingpatience and support fueled my passion for academics and encouraged me to keep workingeveryday on my thesis despite of being a challenging, yet rewarding, journey. To her,my deepest gratitude. Second, I am grateful to my juries, Prof. Ivan Dıaz and Prof.Edilberto Cepeda, as well to my professors in the Statistics Department at UniversidadNacional de Colombia, for their guidance, knowledge, time and commentaries on previousversions of this document. Third, I would like to thank Prof. Nian-Sheng Tang, for kindlysharing his superb, unpublished manuscript. His piece of work was key to understandand set the basis for the structural equation model here presented.

Also, I am in debt with my family, friends, and partners, who were always patientand understanding with my temporal absences. Their company, love and support werefundamental throughout these years. To them, thank you. Finally, to Paola. Her supportwas critical for me during the last few months of this process. I cannot say anything but“gracias infinitas, siempre”.

Page 8: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Contents

Contents I

Introduction III

1. Causality: An introduction 1

1.1 In the search of a causal language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Going deeper into Pearl’s Structural Causal Model (SCM) . . . . . . . . . . . . . . 11

1.3 The relationship between SCM and RCM: Why the former and not the latter? 24

2. Causal Inference Through Parametric, Linear Models 28

2.1 Structural Equation Models (SEMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.2 Bayesian Estimation of Structural Equation Models . . . . . . . . . . . . . . . . . . . 36

3. Causally Connected Units 38

3.1 Hierarchical Structural Equation Models (HSEM) . . . . . . . . . . . . . . . . . . . . 42

3.2 Bayesian Estimation of HSEM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4. A Semi-Parametric Hierarchical Structural Equation Model (SPHSEM) 48

4.1 The observed random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 The measurement equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 The structural equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4 A Note on Bayesian P-splines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.5 Identification Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5. Bayesian Estimation of the SPHSEM 56

5.1 Prior Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Posterior Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

I

Page 9: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CONTENTS II

6. Simulations & Application 69

6.1 A Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.1.1 MCMC Simulations and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.1.2 Bayesian Model Comparison and Goodness-of-fit Statistics . . . . . . . . 71

6.1.3 An Intervention to the Simulated Causal System . . . . . . . . . . . . . . . 74

6.2 Empirical Application of the SPHSEM: Hostility, Leadership, and Task Sig-nificance Perceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.2.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.2.2 Analysis of Intervention: What if soldiers reported higher Leadershipperceptions? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Conclusions 83

Further Research 84

Some Causal Quantities of Interest 85

Derivation of posterior distributions 88

Results of Simulation Study 94

MCMC Results for the Simulation Example 96

Description of the Empirical Exercise 99

R Codes 102

Page 10: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Introduction

Scientific discovery, both in Natural and Social Sciences, is the task of learning from theworld through observation. Most of Social Science studies handle social phenomena froma descriptive point of view, a task that is by no means easy, but challenges arise whenshifting from descriptive questions (what) to causal questions (what if and why). AsSocial Scientists face what has been called the fundamental problem of causal inferencein social research (Holland, 1986), there has been a sprout of new developments ofstatistical theory and methodologies on causal analyses over the past few decades. Thesestatistical methodologies are framed in what is know in social statistics as Causal Inference.

Several authors have contributed to the causal inference literature from either a theo-retical or empirical approach (Rubin, 1974, 1978, 2006, Angrist et al., 1996, Robins, 1986;Robins et al., 2004, or Pearl (2009b), as the most cited or known references), but mostof them assume no causal connections or interactions among individuals (i.e independentunits and the Stable Unit Ireatment Value Assumption, SUTVA, assumption), somethingthat is very uncommon when by observations we mean people. Formal results show thatthe presence of causally connected individuals yields biased estimations of the causaleffects of an intervention (Rosenbaum, 2007; Sekhon, 2008). VanderWeele and An (2013)present a survey with the most recent advances in terms of modeling causal relationshipsin presence of causally connected units, but, as it recognized by the authors themselves,“a formal theory for inferring causation on social networks is arguably still in its infancy”(p.353), and most of papers on causal inference with causally connected units lack of thestructural framework proposed by Pearl (1988b, 1995, 2009b).

This thesis aims to fill this gap by presenting a Semi-Parametric HierarchicalStructural Equation Model (SPHSEM) that counts for the presence of non independent,causally connected units clustered in groups that are organized in a multilevel fashion.We build upon the work on the hierarchical structural equation model by Rabe-Heskethet al. (2004); Rabe-Hesketh and Skrondal (2012) and Lee and Tang (2006); Lee (2007),among others; and, following Song et al. (2013) and others, expand it by proposing asemi-parametric formulation akin to the theoretical SCM presented in Pearl (2009b).This novel methodology would allow the quantitative Social Scientists, or researchersin Applied Biology or Epidemiology, to assess causal effects of interventions fromobservational data sampled from clustered subpopulations. Following Lee (2007); Songand Lee (2012a) and Song et al. (2013), we present a Bayesian estimation algorithm for

III

Page 11: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

INTRODUCTION IV

the SPHSEM’s parameters.

This document is organized as it follows: after the first introductory section, we presentand introduction to Causality and its multiple theoretical and applied frameworks inChapter 1. Chapter 2 presents how causal inference is reached through statistical models,in particular Structural Equation Models. Chapter 3 presents how causally connected unitsare modeled in a multilevel SEM. In Chapter 4 we present a Semi-Parametric HierarchicalStructural Equation Model (SPHSEM), the main contribution of this thesis. Chapter 5explains the Bayesian estimation procedure, the algorithms and algebraic derivations ofthe model. Chapter 6 presents both a simulation and an application studies that providean empirical idea of the performance of the proposed SPHSEM. Lastly, we conclude andpresent some future research opportunities around the SPHSEM.

Page 12: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1

Causality: An introduction

The main objective of scientific endeavors is to establish, understand and even find newcausal relationships in the world we know. Nonetheless, causality is commonly a subjectframed within controversial discussions. Aside from the mere philosophical motivationof truly understanding the causes and effects related to a given phenomenon, causalknowledge permits the researcher to build up a system or model that allows for predictingnew outcomes for a variable of interest given an intervention. Assessing the impact of anintervention on a causal system is straightforward in experimental conditions. However,not always researchers can count on data from experimental designs and, in manycases, randomized experiments cannot be properly conducted, especially in the socialsciences. In the latter cases it might be just too expensive or even unethical to performexperiments where people are involved, and therefore, social scientists usually resort toobservational data in order to get their research going. This issue makes causal knowledgehard, since the associations present in observational data do not necessarily imply directcausal relations. Moreover, from observational data alone we are only able to computeassociational quantities, such as correlations, and there might be some confoundingvariables present in the underlying structure of the data that cannot be directly measuredand that have to be taken into account when posing causal claims. The reader mightalready be aware that resorting to observed data (i.e. realizations of random variables)necessarily comes with uncertainty. In light of that, Statistics plays a key role in derivingcausal claims from non-experimental observations, that is, a probabilistic approach tocausal analysis.

1.1. In the search of a causal language

A probabilistic approach to causality

We can, indeed, understand the world we are living in as a set of deterministic causalsystems that are unknown and not revealed to the researcher, but that can be somehowinferred from the information obtained from or provided by the environment. However,given the fact that we do not know these causal systems, uncertainty will play a central

1

Page 13: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 2

role in causal analysis. In simple cases, knowing the causes allows us to predict theconsequences, without any further effort. For example:

Proposition A: It rains (cause),Proposition B: The pavement gets wet (effect),

Evidence, logic, or simple common knowledge states that the relationship between propo-sitions A and B goes as A → B. In this particular example uncertainty is not an issueof major concern, and therefore, observing A allows for thinking that B well certainlyhappen as well. Notwithstanding, in more elaborate examples, knowing A will render Bmore likely to be observed, not absolutely certain (Suppes, 1970). For example:

Proposition A: A patient is given certain medicine X (cause),Proposition B: The patient recovers from a disease Y (effect),

In this case we have that the relationship between propositions is still A → B, butalso that P (B ∣ A) ≥ P (B). As stated above, the occurrence of causes A increases theprobability of occurrence of the consequence B, i.e. P (B ∣ A) > P (B ∣ ¬A) (see Suppes,1970; Cartwright, 1983, 1989).

In the latter example, relationship A → B implies an underlying theoretical causalmodel that is familiar to the researcher, and observational data only allows for verifyingthe validity of the proposed causal model. However, nothing in the conditional probabilityP (B ∣ A) alone allows the researcher to elaborate causal claims derived from an externalintervention, say A′. Many scientists have proposed different epistemological andempirical frameworks based on conditional probabilities that allow for the elucidation ofcounterfactuals and/or causal effects of a given treatment variable (A or A′) on an outcomevariable of interest (B). However, the problem that most of social scientists, biologists,medical researchers, public policy makers, among others, face is what it is known as thefundamental problem of causal inference (Holland, 1986): in most cases researchers cannotsupply an alternative treatment (A′ ≠ A) to the same subject under the same conditionsand/or at the same time in which the original experiment was conducted and, therefore,cannot formulate conclusions about the effect of such treatment on an outcome variable.That is, observational data provides information about P (B ∣ A), but not about P (B ∣ A′).

Given that we cannot derive causal claims from observational data alone, causal in-ference is the scientific process by which these relationships are inferred and (fully) char-acterized from observational data, but only after assuming a causal model driving therelationships between random variables. Put another way, as described in Pearl (2016),assume an unknown, real-world, invariant, and true data generating process, M , thatgenerates a set of observed random variables (data), D, and an associated multivariateprobability distribution, P , as shown in Figure 1.1.

The target of scientific inquiry in traditional statistical analysis is a probabilistic quan-tity, Q(P ), which summarizes some attribute of D that is of interest of the researcher.Q(P ) can be estimated from P and D alone. However, causal analysis is different fromstatistical analysis in the sense that the former is interested in the effect of an externalintervention (treatment) of interest to the causal system M , that is, when experimentalconditions change. This intervention acts as a specific modification to the data-generating

Page 14: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 3

P(joint distribution)

M(Data generating

process)

D(Data)

Q (P)(Target quantity)

Statistical Inference

Figure 1.1. Traditional statistical inference paradigm, adapted from Pearl (2016).

model M , giving rise to an unobserved (counterfactual) set of data D′ and a distribu-tion P ′. This ‘change’ is known as the causal effect the intervention, i.e. the changes inthe data generating process that generate hypothetical (unobserved) D′ and P ′. Then, acausal target parameter Q(P ′) is computed, a quantity that summarizes the causal effectof the given intervention (or treatment). The problem is that in observational studies theresearcher only has access to D (and therefore P ), while D′ (and P ′) remain unknown.D or P alone cannot give an answer to the causal quantity of interest. That is why theresearcher resorts to a set of (un)testable causal assumptions that allow for estimatingQ(P ′) from D and P , as in Figure 1.2. With these assumptions at hand, the idea is tomathematically express Q(P ′) in terms of both D and P , leaving D′ and P ′ out. Theseassumptions come from the expertise and previous experience of the researcher.

P(joint distribution)

M(Data generating

process)

D(Data)

Q (P’)(Target quantity)

(D,P) + Causal Assumptions = Causal Inference

P’(joint distribution)

Intervention(modification to

the DGP)

Figure 1.2. Causal inference paradigm, adapted from Pearl (2016).

Moreover, given P , D consists of a sample of randomly distributed exogenous(treatment) variables, T ∈ D, from which the researcher is to deduce the causal structurethat determines the values of the endogenous (outcome) variables, Y ∈ D, from a set ofpossible structures (including the true one, M), by imposing testable assumptions onfunctional relationships - linear or nonlinear - between input and output variables, and

Page 15: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 4

distributional forms for exogenous variables. Also, confounding and/or baseline variablesmay be present, X ∈ D, so that spurious relationships between an outcome Y and atreatment T , generated by X should be ‘weeded out’ from the true causal effects of T onY . To do that, the concept of conditional independence is the key to probabilistic causalinference. Based on Dawid (1979) and Pearl and Paz (1987) (to be formally defined inthe subsequent sections), given a set of random variables, Y , T and X, the former areconditionally independent given the latter, denoted as (Y á T ∣ X), if once we know thevalue X = x, knowing the value obtained by T does not provide any further (causal)information about Y . In words, causal quantities can be only estimated when in a causalmodel M , an hypothetical randomized trial is mimicked by conditioning on the right setof variables X.

Once specified the causal system M and the links of both P and D to the causalmodel, the next step in causal inference is to define the target parameter or quantityQ(P ′). As described in Petersen and van der Laan (2014), the researcher should translatethe scientific inquiry to a formal causal quantity usually defined as a (set of) parameter(s)of the joint distribution of the counterfactual scenario, P ′. Several causal quantities canbe of interest to the researcher, such as the Average Causal Effect - ACE, the AverageTreatment Effect on the Treated - ATT, the Conditional Average Treatment Effect -CATE, the Population and Sample Average Treatment Effects - PATE and SATE (seeRubin 1974; Holland 1986, 1988; Imbens 2004, and references therein), the AverageCausal Mediation Effect - ACME (see Baron and Kenny 1986; Imai et al. 2010a,b,c,and references therein), Direct and Indirect Effects - DE and IE (see Pearl 2001b, 2005,2009b, chapter 4, and references therein), etc. Further details on some quantities arepresented in the Appendix section. The estimators involve expressions of P ′ in terms ofP , and therefore can be estimated through a handful of statistical methods. It is up tothe researcher to decide which one is the most appropriate tool for his/her scientific andempirical goals.

Following Barringer et al. (2013), in the next subsections we present a brief but con-cise historical compilation of statistical theoretical, and empirical developments in causalinference literature that allow for the specification of M , and the subsequent estimation ofcausal quantities Q(P ′) (either in a parametric or non-parametric ways) based on condi-tional independences between manifest random variables. Each of these approaches comeswith identification rules and particular causal assumptions. Some models require strongerassumptions than others, but come with easily implementable statistical estimation meth-ods and interpretability. Yet again, the choice its almost a matter of philosophical debate.

Rubin’s Causal Model, Randomization, and the Potential OutcomeFramework

This approach views experiments as the ruling paradigm in causal discovery. It isgrounded on the concept of random treatment assignment, a critical condition inexperimental designs. Holland (1986) presents a section with a rather epistemologicaldiscussion about causality, from which we highlight his interpretation of a causal effectwhen establishing a bridge between the Rubin’s Causal Model (RCM) and Mill’s (1843)ideas about the subject. Holland argues that, once defined the treatment, T , and theoutcome variable of scientific interest that depends on (a function of) the treatment,

Page 16: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 5

Y (T ), causes are the “circumstances” in which instances Y (T ) and Y (T ′) differ, oncethe researcher controls for variables that do not differ (not affected by, or pretreatmentvariables) between experiments, X. In this case, the differences in Y are caused bydifferent treatment regimes T and T ′, once X remains constant, as if experimentalconditions were met.

Now, assume a sample of size N . In order to assess the causal effect of a treatment,Ti = ti, on a variable of interest, Yi(Ti = ti), for a particular individual i ∈ N (subscriptnotation), the former should have been assigned following random mechanisms thatdo not obscure the outcomes of an experimental design. That is, results should becomparable to those hypothetically obtained if the control treatment was applied to thesame individual, Yi(Ti = t′i) (under the same experimental conditions). The idea behindrandom treatment assignment is that, once controlling for pretreatment variables xi ≅ xi′ ,differences between observed values yi and yi′ , are only due to differences between ti andti′ , which are assigned by chance.

However, controlled randomized experiments are not always possible and researchersare obligated to resort to non-randomized observational data. The solution is tostatistically mimic experimental conditions and randomized treatment assignment. Deepwithin the roots of this framework lays the well known motto “No causation withoutmanipulation” (Holland, 1986). The idea behind this assertion is that ideal experimentalconditions are required to establish causal effects from treatments that are willfully butrandomly assigned to an individual. If these ‘laboratory conditions’ cannot be fully met,even by any statistical procedure (Holland, 2003), then causal claims lack of validity.

The role of randomization was first presented by Rubin (1974, 1978), who waslargely influenced by the pioneering work of Neyman (1923) on design of experiments inagricultural sciences. The main idea of the RCM is, intuitively, to estimate the overallcausal effect of a treatment Ti = ti (over a control Ti = ci, absence of treatment) asthe difference between what would have happened if the unit i had been exposed to thetreatment Ti = ti, Yi(ti), and what would have happened if i had been exposed to Ti = ci,Yi(ci). This definition uses a formal language full of counterfactuals, the mathematicalbasis of the potential outcome framework (Rubin, 2005).

More specifically, given a set of exogenous (not ‘affected’ by the treatment) variablesXi, define Yi(Xi, ti) as the value of an outcome variable Y measured for a particular uniti, with pretreatment variables Xi, who was randomly assigned to the treatment group(i.e. received Ti = ti), and Yi(Xi, ci) the value of Y measured for the same unit i giventhat it was assigned to the control group Ti = ci. Then, the causal effect of Ti = ti overTi = ci for i is defined as τi = Yi(Xi, ti)−Yi(Xi, ci). The problem is that since we are usingnon-experimental data, only one of the latter values is actually measured for i. Thesevalues, Yi(ti) and Yi(ci) are known as the potential outcomes for unit i, but only one ofthem is actually observed from the observational study design. In many cases, the otheris estimated from the data. Therefore, causal inference within the RCM framework isreduced to a missing data problem.

Page 17: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 6

The main assumption in RCM is related to the way missing information is estimatedfrom observed data. The missing potential outcome for an unit i with observed treatmentti, is computed using information from other units, i′ ≠ i, with similar pretreatmentvariables, xi ≅ xi′ , but with different treatments, Ti′ = ci′ .

Put another way, in order to compare potential outcomes, the researcher is urgedto attempt to find comparable units to compare, that is, to increase homogeneity in thecounterfactual scenario (Sekhon, 2008). Suppose that the researcher observes Yi(ti), buthe knows, by the random nature of treatment assignment or from previous informationor by some other reasoning, that the way i responds to an hypothetical interventionTi = ci is about the same way another individual i′ responds to Ti′ = ci′ , an unit forwhich Yi′(ci′) is measured. By some ‘matching’ procedure, the researcher is able to‘inform’ the potential outcome for i, Yi(ci), with the extra information obtained from i′,therefore, being able to compute the causal quantity of interest. In essence, matching isabout finding units in the sample that did not receive the treatment, but that are sta-tistically equivalent (in terms of pretreatment covariates) to those that actually received it.

Unit homogeneity is an important characteristic for claiming causal effects inobservational studies, otherwise, estimates of causal quantities would be biased andwould yield misleading conclusions. In the RCM, three other assumptions are neededin order to obtain unbiased estimates of causal quantities using observational data froma ‘homogenized’ sample: i) the perfect compliance assumption, that is, individuals whoare randomly assigned to receive a treatment do, in fact, take the treatment; ii) theunconfoundedness/ignorability assumption, which states that treatment assignment is(conditionally) independent of the outcomes; and iii) stable unit treatment value assump-tion (SUTVA), which assumes no interference between units and that the treatmentsfor all units are comparable (Rubin, 1978), i.e. potential outcomes for one unit do notdepend on the treatment assigned to any other unit in the sample. The latter is of specialinterest in this thesis since its violation is very common on most social science contexts,where interference between units its quite common.

Unit homogeneity is thus achieved by using matching methods (Rubin, 1977; Rosen-baum and Rubin, 1983; Rubin, 2006) or other related statistical methods such as regressiondiscontinuity designs. Also, it has been argued that bias and variance for causal param-eters of interest decrease as the homogeneity in the sample increases Rosenbaum (2005).This document does not present further explanations on matching methods, but we referthe reader to Rosenbaum and Rubin (1983, 1984); Rubin (1973, 2006) for relevant litera-ture on the propensity score matching (PSM), Cochran and Rubin (1973); Rubin (1979,1980) on multivariate matching based on Mahalanobis distance, and Sekhon (2009) andothers on more advanced matching algorithms, such as Genetic Matching.

Angrist’s Instrumental Variables and Heckman’s Control Functions

Instrumental variables (IV) are commonly used by economists and econometricians incausal inference because of its capability to mimic random treatment assignments. Thisapproach is quite similar to the RCM framework, but computation of causal quantitiesresort to different estimation methods. The pioneering work of Angrist (1990) and Angrist

Page 18: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 7

et al. (1996) were the first to use IV for estimating causal quantities and then stress therelationship between IV and RCM. The main assumption in this framework is that IVsare random variables Z that affect the assignment of a given treatment T , but do nothave direct effects on the outcome variable Y . This assumption (which is known as theexclusion restriction) is also seen as a weakness for the IV approach to causal inference,since it is quite difficult to assess the validity and exogeneity of the instruments from theunexplained part of the model.

Nonetheless, the IV approach is quite useful when SUTVA assumption is violated,since by controlling for an instrument Z (a procedure that mimics a random assignment ofthe treatment) that affects the outcome variable Y only through the treatment T , assures,by far, that treatment assignment is independent from those of other observations, i.e.Y á T ∣ Z. Also, IVs allow for estimating causal quantities such as the local averagetreatment effect (LATE, Imbens and Angrist, 1994), which are more complex in natureand provide some insights to the causal system when RCM assumptions are violated.

On the other hand, Heckman’s control functions (Heckman, 2005) serve as an approx-imation to causal inference when exogeneity between treatment assignment and outcome(strong ignorability) assumption is violated. This happens when, for example, individualsin the sample are more prone to ‘force’ themselves to the treatment (self-selection bias)and therefore, treatment assignment becomes a non-random process, different from whatassumed in the RCM. Heckman’s control functions are akin to the IV approach in thesense that the researcher needs to model the endogeneity in the assignment mechanismby identifying an unobservable factor whose impact in the outcome variable is restrictedto be indirect only through the matching equation (Barringer et al., 2013). The readershall refer to Heckman and Vytlacil (2007a,b) for a deeper insight in causal inference usingobservational data.

Robins’ “G-methods”

The approach presented by Robins (1986, 1987, 1997) and colleagues is a bridge betweenthe potential outcome framework, and the structural causal model proposed by Pearland others (to be presented below). Robins was concerned about the causal effects ofa time-varying treatment on an outcome variable of interest within a framework with(possibly time-varying) concomitant variables/confounders. This is why Robins’ approachto causal inference is very popular among epidemiologists, public health, and medicalresearchers (see, for example Robins et al., 2004). The key assumptions are similar tothose in the RCM. However, given the different nature of the problem and the causalparameter of interest, matching methods in the PO framework will yield inconsistentand biased estimators of the causal quantity if the study contemplates time-varyingconfounders that are themselves affected by the (past) treatment exposures. We extendthe notation in previous subsections. Let i be an individual belonging to a sample ofN units. Also, let t denote a follow-up visit with t ∈ 1, ..., T + 1, occurring within thelapses τ1, ..., τT+1. Robins (1986, 1987) expanded the static framework in Rubin (1974,1978) to a longitudinal setting with direct and indirect effects of time-varying treatments(Ai,t), on confounders, (Li,t), concomitant variables, (Xi,t), and an outcome of interest,(Yi), for an individual i at time t. Said that, the RCM is a particular case of Robins’

Page 19: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 8

approach in a baseline scenario, with unique t = t0.

Consider a longitudinal setting in which the N individuals enter the study at aninitial time τ0, and receive baseline treatment Ai,0. The intervention is assumed to beassigned at random. A collection of covariates Xi,0 and Li,0 (confounders and concomitantvariables, respectively) are measured. There are t = 1, ..., T + 1 follow-up visits occurringat times τ1, ..., τT+1, where measures are taken on the treatment variable (Ai,t) and onthe covariates (Li,t and Xi,t) for each subject i ∈ N . An outcome variable Yi is measuredat the end of the study, i.e. at visit T + 1.

The treatment history for subject i up to visit t is denoted as Ai,t = (Ai,0, ...,Ai,t).Similarly, we denote the confounders’ and concomitants’ history as Li,t and Xi,t, respec-tively. For sake of simplicity, in this subsection we assume that concomitant variablesare implicitly included in Li,t and are not explicitly expressed. Denote Ai ≡ Ai,T as thehistory up to the end of the study (same for confounder and concomitant variables).Assuming either binary or continuous treatments, Ai takes values on the Euclideanspace A = A0 × ⋯ ×AT , where each At is the set of all possible values of the treatment(common for all subjects) at visit t. Therefore, for every realization ai ∈ A, we definea counterfactual (potential outcome) Y ai

i as the value the outcome variable ofinterest for subject i would have attained had i been exposed to treatmenthistory ai. It is clear that Robins’ potential outcomes, Y ai

i , are similar to those ofRubin’s in the RCM, Yi(ai), but with a dynamic treatment Ti = ai instead of a static one ti.

As in every causal inference framework, we need untestable assumptions to claim causalresults and to identify the effect of a time-varying treatment on the outcome variable of in-terest. First, we assume no interference between units (akin to SUTVA), and consistency,

that is, Yi = Y Aii . Also, we need the conditional exchangeability (i.e. Y a á At ∣ Lt,At−1,

∀ a ∈ A and ∀t ∈ 0, ..., T, or in words: “treatment assigned at random given the past”)and positivity (if PL(l) ≠ 0 then PA(a ∣ l) > 0) assumptions outlined in Robins and Hernan(2009). Given the latter, a formal definition of the causal effect of a sequential treatmentexposure on the outcome variable of interest can be formulated.

Definition 1.1.1. (Causal Effect; Daniel et al., 2013): Given Y, the support of Y , thecausal effect of an intervention history A on Y is the mapping q ∶ A × Y → R+, whereq(a, y) gives the value of the probability function of Y a evaluated at y, P (Y a = y ∣ A = a),provided a ∈ A and y ∈ Y.

Definition 1.1.1 is akin to the one in Pearl’s causal model, 1.2.9, to be presented inthe following subsections. In essence, the causal effect of a time-varying intervention isthe conditional probability distribution defined over the outcome variable Y when A = a.This conditional probability distribution can be graphically represented by an event treethat Robins (1986, 1987) called Causally Interpreted Structured Tree Graphs (CISTGs).A natural extension, the random CISTGs (RCISTGs), is an event tree in which thetreatment received at visit t is randomly allocated given the past values of the treatmentand other covariates (following the conditional exchangeability assumption). Given thisgraphical representation, the close relationship with the structural causal model (SCM)of Pearl an others (to be presented in the next subsection) has been extensively arguedby both authors (see, for example, Robins, 1995; Greenland et al., 1999; Pearl, 2001a and

Page 20: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 9

Pearl, 2009b, sections 3.6.4 and 11.3.7).

Robins (1986) proved that within the RCISTGs framework, if the the exchangeability,consistency and positivity assumptions hold, the causal effect P (Y a = y ∣ A = a) of atime-varying intervention a over Y , can be further expressed as

P (Y a = y ∣ A = a) =∑lT

P (y ∣ lT , aT )T

∏t=1

P (lt ∣ lt−1, at−1) (1.1)

Equation (1.1) is known as the “G-computation formula”. In essence, this equa-tion expresses the counterfactual distribution P ′ in terms of P and D, which arenon-experimental data observed by the researcher. With the causal effect defined andidentified, a more ‘parsimonious’ and concise causal quantity Q(P ′) of interest can beestimated. Very often this quantity would be the average causal effect of a particulartreatment history a on Y , i.e. E(Y a). Note, however, that a conventional potentialoutcome approach would fail to yield causally-interpretable results. Assume a binarytreatment at every time t. The main drawback is that we would have to deal with a largenumber of counterfactual treatment regimes (i.e. 2T+1 values of a), and hence, with asame number of potential outcomes E(Y a) ∶ a ∈ A.

Robins and coauthors came up with statistical methods to estimate this target causalparameter for a time-varying intervention: the G-computation algorithm (Robins, 1986,1987); a class of semi-parametric estimators of Structural Nested Models (SNM) (Robinset al., 1992; Robins, 1992, 1993); and double-robust, including the inverse probabilityweighted (IPW), estimation of Marginal Structural Models (MSM) (Robins, 1998; Robinset al., 2000). We do not extend much on the latter, but an accessible introduction tothese methods is summarized in Robins and Hernan (2009), Daniel et al. (2013) andVansteelandt and Joffe (2014).

Pearl’s Structural Causal Model (SCM)

This approach was a major breakthrough in the understanding of causal systems, asit views the system as a set of ‘mechanisms’ (unknown to the researcher) that can beindividually studied, modeled, and changed. Pearl’s Structural Causal Model (SCM)went even further than RCM, and aimed to the identification of the causal model Mfrom observed data D and its associated distribution P . Causal effects are defined aschanges in M due to an external intervention to the functional mechanism of a variable T ,expressed in terms of a logical operator, do(T = t). By applying such manipulation to thesystem, a modified sub-model M ′ is generated, from which hypothetical (non-observable)D′ and P ′ are expected to be obtained. The main objective, yet again, is to translatethose counterfactuals in terms of observed, non-experimental data.

Pearl’s SCM needs more untestable assumptions that other approaches in order toprovide causal interpretability to the estimation results. In this case, assumptions aremostly related to the structure imposed to M . That is, based on its expertise or previousresults, the researcher decides (or infers from data) how random variables in D are

Page 21: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 10

causally interrelated. The latter means defining conditional (in)dependencies betweenrandom variables in D. Some other assumptions are similar to those already explained inthe RCM and Robins’ approach.

The SCM departs from the theoretical bases set by Wright (1920, 1921, 1934) on whatit is known as the path analysis method. In his work, Wright translated (conditional)correlations into path coefficients (standardized regression coefficients) with a causalinterpretation. The key to understand path coefficients as carriers of causal informationis to realize that there are usually prior beliefs based on previous experience from theresearcher or experimental grounds that allow for claiming that some factors (variables)are direct causes of variations in others, or that pairs of variables are related if bothare effects of a common cause (Wright, 1921, page 559). Path coefficients are ultimatelyinterpreted as the fraction of the standard deviation of the outcome (dependent)variable for which the factor (exogenous variable) is directly responsible, keeping allother factors constant. However, Wright himself argued that path coefficients (derivedfrom partial correlations) have to be interpreted carefully when it comes to derivingcausal claims, this because of simultaneous causation, or improper definition of thediagram that represents the relationships between factors (Wright, 1934). In addition topath coefficients, another contribution by Wright was the path diagram, a graphical rep-resentation of the causal relationships between factors that show direct and indirect effects.

Despite of path analysis being formulated in the early 1920s and formalized in the1930s, it was not until the 1960s that applied social scientists began using path analysisto study the (reciprocal) causal relationships between socioeconomic variables (Duncanet al., 1968). Path analysis was complemented with features from simultaneous equationmodels (Haavelmo, 1943) and from factor analysis (Spearman, 1904). Later in the 1970s,Joreskog (1973), jointly with contributions from Keesling (1972) and Wiley (1973),developed an analytical framework known as Structural Equation Models (SEM), orLISREL (LInear Structural RElationships, Joreskog and Sorbom, 1996). SEM are astatistical tool used in the social sciences to assess causal problems and to estimate directand indirect effects of treatments on a multivariate set of outcome variables (see, e.g.Goldberger, 1972, 1973; Goldberger and Duncan, 1973; Duncan, 1975).

The causality-oriented focus was not fully acknowledged until the publication of Bollen(1989) Structural Equations with Latent Variables. Bollen emphasized the importanceof previous/expert knowledge in model (path) building. However, SEMs themselves arenot an formal causal framework, but a statistical method that lacked of causal theory.It was the work of Pearl (1995, 2000, 2009b) and Spirtes et al. (1991, 2000), who buildupon the advances on probabilistic graphical models and causal analysis in the contextof artificial reasoning (Pearl and Paz, 1987; Pearl, 1986, 1988b; Verma and Pearl, 1988);that set the foundations of what we currently know as the Structural Causal Model (SCM).

Pearl’s and coauthors allowed for the formulation of causal effects of interventions (asalready defined above), following graphical criterion backed by solid formal probabilisticfoundations. The definition of causal effects in the SCM is akin to that presented in(1.1), as it shall explained in the following subsection. Causal quantities can be estimatedafter defining the causal effect formula. Identification of the model is achieved following a

Page 22: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 11

graphical representation of the causal system (usually a Directed Acyclic Graph, DAG).The natural statistical method to estimate such target causal parameters is the parametricSEM, as it defines a set of functional causal relationships that represent the modularway nature works. Interventions (treatment) are modeled through the do-operator. Anintroductory, yet detailed, explanation of the SCM can be found in Pearl (2009a, 2010a,b)and Spirtes (2010).

SEM’s parametric approach has been found useful and accurate when estimating causalquantities of interest (see Chapter 2). More recently, fully nonparametric methods havebeen developed, such as the Targeted Maximum Likelihood Estimation (TMLE) approach(van der Laan and Rubin, 2006; van der Laan, 2010a,b; van der Laan and Rose, 2011).As a truly nonparametric method, TMLE is the perfect match for SCM, as Pearl him-self acknowledges. However, a detailed explanation of TLME is out of the scope of thisdocument.

1.2. Going deeper into Pearl’s Structural Causal Model(SCM)

It is important to distinguish the differences between standard statistical analysis andcausal analysis. On one hand, the former aims to assess and to estimate parameters ofa probability density from samples drawn from that distribution, such as regression pa-rameters and/or (conditional) correlations, and use them to claim relational associationsbetween variables of interest. These inferences are valid under the assumption of stableexperimental conditions, that is, no conditions are changed by means of the introductionof treatments or external interventions to the causal system. On the other hand, causalanalysis is concerned not only about the associational relationships in static conditions,but also when external interventions (treatments) are introduced into the system (Pearl,2009a,b).

More explicitly, a joint probability distribution cannot tell anything about how thatdistribution changes in the presence of any external change. Instead, this informationmust be provided by causal assumptions which identify relationships that are not affectedby dynamic external conditions. These assumptions cannot be, in principle, testedfrom observational data, but are theoretically or judgmentally based claims related tothe way the researcher understands the world. By testable, Pearl refers to the idealsituation in which direct manipulation from the researcher is open to happen, e.g. incontrolled experimental settings, something that is not likely to happen in the SocialSciences. Therefore, it is clear that any scientific claim arguing causal relationshipsbetween variables needs causal assumptions that cannot be inferred from, or even definedin terms of standard probabilistic language alone.

In the SCM, untestable assumptions are typically directed relationships betweenpairs of variables X,Y , belonging to a set of measurements, D, from individuals (units)i ∈ 1, ...,N in a non-experimental setting. For consistency of notation with the SCMliterature, define V ≡ D onwards. X is assumed to be a direct cause Y if there existsa causal path, defined as a collection of edges, from X to Y . These assumptions are

Page 23: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 12

clearly not verifiable from the joint probability distribution defined over V . However,these directed links are, in fact, related to conditional probabilities in a way such that Yis independent of X (Y /á X) if X and Y are not assumed to share a causal connection(X /→ Y ), as we shall see promptly.

The process by which the researcher aims to define a logically well defined set ofdirected links consistent with the joint distribution implicit in the observed data isknown as causal discovery. The set of variables, together with the set of causal (directed)links consistent with the joint probability distribution governing the DGP has a visualrepresentation, called graph, that allows for straightforward interpretations from thecausal system itself and the causal relationships therein assumed. We will not extendmuch on the explanation of causal discovery, since it is a wide topic that deservesrigorous attention by itself but is off the scope of this document. With the mathematicalrelationships between graphs and probabilistic dependencies being formally treated in the1980s by computer scientists, mathematicians and philosophers, important developmentswere achieved on how causal relationships can be inferred from observational data (aftermaking certain assumptions about the underlying DGP).

Moreover, computational improvements fostered the development of complex al-gorithms designed to find patterns of conditional independencies from the true DGPthat were also coherent with partial sections of the assumed causal model (see Pearl,2009a, Chapter 2, for an introduction to causal discovery). Algorithms such as theInductive Causation and the Inductive Causation with latent variables algorithms (ICand IC* algorithms, Verma and Pearl, 1990; Verma, 1993), the PC algorithm (Spirtesand Glymour, 1991), the FCI algorithm (Fast Causal Inference algorithm, Spirteset al., 2000; Spirtes, 2001), among some others explained with greater detail in Spirteset al. (2000, 2010); are designed to suggest candidate causal models that i) follow theassumptions encoded in the causal paths defined by the researcher, ii) are capable ofgenerating the true DGP, and iii) follow the model minimality, Markov and stabilityconditions/assumptions defined in Pearl (2009a, Chapter 2), requirements necessary toefficiently distinguish causal structures from non-experimental data.

In this document we assume that the researcher has already ‘discovered’ a set of plausi-ble equivalent causal structures that are consistent with the joint probability distributiongoverning the observed data.

From Conditional Independence to Probabilistic Graphical Models(PGM) to Directed Acyclic Graphs (DAG)

We have already emphasised, causes only render consequences more likely, but no ab-solutely certain (Pearl, 2009a). This allows to understand causal relationships betweenvariables in terms of conditional probabilities and thus, conditional independence. Somedefinitions and properties:

Definition 1.2.1. (Conditional Independence; based on Dawid, 1979): Let V be a setof discrete or continuous random variables. Let P (⋅) be a multivariate probability densitydefined over V , and let X, Y and Z be any three independent subsets of V . X and Y

Page 24: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 13

are said to be conditional independent given Z = z if P (x ∣ y, z) = P (x ∣ z) wheneverP (Y,Z) > 0, and it is written X á Y ∣ Z.

In words, “learning the value of Y does not provide additional information about X,once we know Z” (Pearl, 2009a, page 11). To denote conditional independence of Xand Y given Z we write (X á Y ∣ Z), or more specifically (X á Y ∣ Z)P to explicitlydenote conditional independence when using a certain probability measure P . Some usefulproperties satisfied by the conditional independence relation are (Dawid, 1979):

Symmetry: If (X á Y ∣ Z), then (Y áX ∣ Z),

Decomposition: If (X á YW ∣ Z), then (X á Y ∣ Z),

Weak union: If (X á YW ∣ Z), then (X á Y ∣ ZW ),

Contraction: If (X á Y ∣ Z) and (X áW ∣ Y Z), then (X á YW ∣ Z), and

Intersection: If (X áW ∣ ZY ) and (X á Y ∣ ZW ), then (X á YW ∣ Z).

These properties were independently presented in Pearl and Paz (1987) and Geigeret al. (1990); Geiger and Pearl (1990, 1993) under the name of graphoid axioms. These setof axioms are important and serve as a basis for the definition of informational relevance,specially in computation and AI frameworks (Pearl, 1988b). More specifically, in thecontext of graphs, these properties assure that (X á Y ∣ Z), read as ‘X is conditionallyindependent of Y given Z’, is translated from a probabilistic language into that of graphs,as ‘all paths from X to Y are intercepted by Z’.

Graphs, or more specifically, Probabilistic Graphical Models (PGM) are representa-tions of the multivariate probability density defined over the measurable space generatedby a set of random variables. PGMs summarize the conditional (in)dependence relation-ships between random variables given a previously defined (causal) graph structure. Anintroductory approach to PGMs can be found in Koller and Friedman (2009) and Pearl(1988b). Some advantages of PGMs over conventional multivariate probability functionsare i) their ease on interpretation, ii) their capability of presenting high dimensional mul-tivariate probability functions (in the sense that in a set of random variables of cardinalityM , we would have to deal with 2M conditional probabilities), and iii) their adaptability tograph theory. The link between PGMs and multivariate density functions is possible dueto the Hammersley-Clifford theorem (Hammersley and Clifford, 1971), which ultimatelyrelates Markov properties and factorization of conditional probabilities over graphs. Moreformally,

Definition 1.2.2. (Graph; Koller and Friedman, 2009): A graph G is a mathematicalobject consisting of a set V of random variables, each one of them called a node, anda set E of edges (links) that connect pairs of nodes. The notation used throughout thisdocument will be G = ⟨V,E⟩ when referring to a graph G with nodes V and edges E. The(absence of) edges between two variables Vi, Vj ∈ V represent conditional (in)dependencies.

For example, given the set of variables V = V1, V2, V3, V4, V5 and the set of edgesE = (1,2), (1,3), (3,5), (4,5), the graph G = ⟨V,E⟩ shall be represented as in Figure 1.3.

The most basic type of graphs, from which we build upon, are known as MarkovNetworks (MN) or Markov Random Fields (MRF), and were introduced by Besag (1974).

Page 25: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 14

V1

V3

V2

V5

V4

Figure 1.3. G = ⟨V,E⟩, an undirected Markov Network.

MRFs are undirected graphs, that is, do not carry causal assumptions and links merelyrepresent symmetric probabilistic dependencies between nodes (random variables). Thistype of graph satisfies the Markov property, that is, the conditional probability for agiven node is ‘memoryless’ (does not add new information) with respect to variables towhich it is not connected to. More formally, given the undirected graph G = ⟨V,E⟩, theset of variables X = Xν ∈ V forms a MRF with respect to graph G if the following(non-equivalent) Markov properties are satisfied:

Pairwise Markov property: Xu á Xv ∣ XV /Xu,Xv, if (u, v) ∉ E; i.e. any twonon-adjacent nodes are conditionally independent given all the other variables in V .

Local Markov property: Xv á XV /NG[v] ∣ XNG(v), where NG(v) is defined asthe neighbourhood set1 of node Xv in graph G, and NG[v] is defined as the closedneighbourhood set2 of node Xv, i.e. a variable is conditionally independent of allother variables given its neighbours, and,

Global Markov property: XA á XB ∣ XC . This property implies that any two setsof nodes A,B ⊂ V are conditionally independent given a separating set C ⊂ V , i.e.C is defined as a set of nodes such that every path from a node in A to a node in Bgoes through C.

In order to make Markov properties clearer we present an example using the graph Gpresented in Figure 1.3. We have that V4 á V3 ∣ V1, V2, V5 (pairwise Markov property),V1 á V5, V4 ∣ V2, V3 (local Markov property), and, given X = V1, V2, Y = V4, andZ = V3, V5, X á Y ∣ Z (global Markov property).

However, the type of graphs that we are interested in are known as BayesianNetworks (BN) or Directed Graphs (DG), as presented in the context of causal inferenceand causal discovery by Pearl (1988b) and, from a mathematical point of view, by

1The neighbourhood set of a node Xv in a graph G is the induced subgraph of G consisting of all nodesadjacent to Xv, i.e. the induced subgraph G[S](v) = ⟨S,E∗⟩, such that S ∶= Xu ∈ V ∶ u ≠ v, (u, v) ∈ Ewith S ⊂ V , and E∗ ∶= (a, b) ∈ E ∶ Xa,Xb ∈ S ∪ (a, v) ∈ E ∶Xa ∈ S.

2The closed neighbourhood set of a node Xv is defined as NG[v] ∶=Xv ∪NG(v).

Page 26: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 15

Thulasiraman and Swamy (1992) and Lauritzen (1996). More specifically, we areinterested in those graphs that display an acyclic behaviour (Directed Acyclic Graphs,DAG). The term ‘directed’ comes from the assumption of directional (causal) associ-ations between random variables. That is, flow of information goes from one node toanother, but not the other way around (except in those graphs with bidirected links,but these shall not be considered in this document). A directed edge is known as apath, and given a graph G = ⟨V,E⟩ it can be understood as every pair of nodes inV , (u, v) ∈ E, such that the former (known as parent) directly influences or inducesa dependency relationship to the latter (known as child), and it is represented as Xu →Xv.

We recall that DAGs also satisfy the Markov conditions described above, but thepresence of directed paths requires additional definitions. We begin by defining the set ofnodes that renders Markov properties satisfied in a given DAG G.

Definition 1.2.3. (Markovian Parents; Pearl, 2009b): Let V be a set of random vari-ables and P a joint probability measure defined over the measurable space generated byV . A set of variables PAj ⊂ V is known as the set of Markovian parents of Vj if PAjis a minimal set of predecessors of Xj that once conditioned on, renders Xj indepen-dent of all its others predecessors. In others words, PAj is any subset of V that satisfiesP (Vj ∣ PAj) = P (Vj ∣ V /Vj), and no other proper subset of PAj satisfies the latter, forevery Vj ∈ V .

It can be shown that the set of Markovian parents PAj is unique if P belongs to thespace of all positive measures M, P ∈M (Pearl, 1988a). Moreover, making use of thechain rule of probabilities, we define the Markov property for a given DAG G.

Definition 1.2.4. (Markov Property for DAGs; Lauritzen, 1996): Let G be DAG. Wesay that a probability density P ∈M, obeys the Markov properties of G if P (x1, x2..., xn) =n

∏i=1P (xi ∣ x−i) =

n

∏i=1P (xi ∣ pai), with x−i = x1, ..., xi−1, xi+1, ..., xn. The set of all positive

densities obeying the Markov properties for G is denoted as M(G).

Whenever we define a probability function P ∈M(G), such that admits a factorizationlike the one in definition 1.2.4 relative to the causal structure in a graph G, we say thateither G represents P , or that G and P are compatible, or that P is Markov Relativeto G. This characteristic is known as Markov compatibility . Assuring compatibilitybetween probability measures and DAGs is important in statistical modelling mainlybecause Markov compatibility is a sufficient and necessary condition for a DAG G toexplain observed empirical (non-experimental) data represented by P , that is, a stochasticcausal model capable of generating P (Pearl, 1988a, 2009b).

The set of conditional independencies defined by the generated probability measureP and implied by a DAG G that Markov compatible with P can be read off followinggraphical criterion as well. This criterion is called d-separation, and it allows for a clearerand more straightforward interpretation of the relationships between random variables. Itwas introduced in Pearl (1986, 1988b) and treated extensively throughout the refinementof the SCM. This concept shall play a major role in causal inference in this document. Itis formally defined as

Definition 1.2.5. (d-Separation; Pearl, 2009b): A path p (sequence of consecutiveedges) is said to be d-separated by a set of nodes Z if, and only if

Page 27: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 16

1. p contains a chain i→m→ j, or a fork i←m→ j, such that the middle node m ∈ Z.

2. p contains an inverted fork (collider) i → m ← j such that m ∉ Z and such that nodescendant of m ∈ Z.

A set Z is said to d-separate X from Y if and only if Z blocks every path from a node inX to a node in Y .

The intuition behind d-separation is easily understood by giving a causal meaning toeach directed edge in the DAG. Condition 1 in definition 1.2.5 states that the two extremerandom variables will become conditionally independent (i.e. blocked) once we know thevalue of the middle node (i.e. condition). That is, conditioning on m blocks the flow ofcausal information from i to j (and vice versa) along path p. Condition 2, representingtwo causes having a common effect, becomes clearer by realizing that to know the valueof m renders i and j conditionally dependent, because confirming one will reduce theprobability of observing the other.

The connection between d-separation and conditional independence is establishedthrough a theorem in Verma and Pearl (1988) and Geiger et al. (1990):

Theorem 1.2.1. (Probabilistic implications of d-Separation; Pearl, 2009b): If anysets X and Y are d-separated by Z in a DAG G, then X is independent of Y conditional onZ in every distribution compatible with G. Conversely, if X and Y are not d-separated byZ in a DAG G, then X and Y are dependent conditional on Z in at least one distributioncompatible with G. This can be expressed more succinctly as it follows. For any threedisjoint subsets of nodes (X,Y ,Z) in a DAG G, and for all probability functions P ∈Mwe have:

1. If (X á Y ∣ Z)G, then it holds that (X á Y ∣ Z)P whenever G and P are comparable;

2. If (X á Y ∣ Z)P holds in all distributions compatible with G, then it follows that(X á Y ∣ Z)G.

where the graphical notion of d-separation (X á Y ∣ Z)G is distinguished from the proba-bilistic notion of conditional independence (X á Y ∣ Z)P .

Theorem 1.2.1 sets the logical foundations for building a ‘mathematical bridge’between the language of statistics and probabilities, and that of graphical models,particularly DAGs, which have a natural capacity for carrying causal assumptions andinformation. This ‘bridge’ shall be the backbone throughout the development of the SCM.

DAGs themselves are not the causal system. Nonetheless, there are two main advan-tages of approaching causal knowledge through DAGs: i) the model implications (causalhypotheses) become more meaningful, accessible and reliable; and ii), DAGs are capableof dealing with external interventions of interest. The ability to manipulate states of thenodes in a DAG resembles an intervention to the node X in a non-experimental setting,which is the goal of measuring causal effects.

Page 28: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 17

DAGs and Interventions: Towards causal inference

As emphasized at the beginning of this subsection, causal analysis is concerned aboutthe effect of external interventions on the causal system. Assuming that a givenDAG represents stable and autonomous causal relationships between variables, then anintervention should be understood as the physical phenomena in which a single or a setof those causal relationships is changed without changing the structure of the causalsystem or changing the other relationships (Pearl, 2009b). In other words, DAGs allowfor manipulating the value of selected nodes in V in order to resemble an intervention ina non-experimental setting. This logical operation, known as the do operator, denotedas do(X = x) and first presented in Goldszmidt and Pearl (1992), or set(X = x), as in(Pearl, 1995); acts by eliminating the directed link from any predecessor to the intervenedvariable (cause) of interest, X such that the value attained (x) does not correspond to astochastic result, but instead a deterministic process fixed by the researcher.

Note the difference between P (y ∣ x) and P (y ∣ do(X = x)). The former is a simple,passive observation. The latter is an active action, that is, an intervention to the natural(functional) process by which the random variable X is defined. As stated by Pearl(2009b), the ability of causal networks to predict the effects of those interventions requirea set of assumptions that rest on causal knowledge (not associational), that ensure thesystem to respond in accordance to what we described as the principle of autonomy. Wenow introduce the concept of causal Bayesian networks:

Definition 1.2.6. (Causal Bayesian Network; Pearl, 2009b): Let P (v) be a probabilitydistribution over a set of variables V, and let Px(v) denote the distribution resulting fromthe intervention do(X = x). Let P∗(v) be the set of all interventional distributions Px(v),with X ⊆ V . Note that P (v) ∈ P∗(v), i.e no intervention (X = ∅). A DAG G is saidto be a causal Bayesian network compatible with P∗(v) if, and only if, the following threeconditions hold:

1. Px(v) is Markov relative to G.

2. Px(vi) = 1 for all Vi ∈X, whenever vi is consistent with X = x.

3. Px(vi ∣ pai) = P (vi ∣ pai) for all Vi ∈X, whenever pai is consistent with X = x, i.e.,each P (vi ∣ pai) remains invariant to interventions not involving Vi.

By departing from Bayesian causal networks to pursue causal analyses, once performedthe intervention, equation in definition 1.2.4 can be expressed as the truncated factoriza-tion:

Px(v) = ∏i∶Vi∉X

P (vi ∣ pai) (1.2)

for every v consistent with do(X = x). In other words, we are limiting our interventionalspace of distributions P∗ to a certain set of multivariate probability densities which fulfillspecific hypothesis and that are Markov relative to a causal Bayes Network G. Once wedefine P∗, the following properties must hold for every P ∈ P∗:

Property 1: For all i, P (vi ∣ pai) = Ppai(vi). That is, the set PAi is exogenousrelative to Vi (i.e. the conditional probability P (vi ∣ pai) coincides with settingdo(PAi = pai) by intervention).

Page 29: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 18

Property 2: For all i and every subset S ⊆ V , such that S ∩Vi, PAi = ∅, we havePpai,s(vi) = Ppai(vi). That is, once we control for the direct causes of Vi (i.e. settingdo(PAi = pai)), no other interventions (i.e. do(S = s)) will affect the outcome of Vi.

Interventions and causal Bayesian networks are mathematical objects that allow forthe derivation of several causal quantities in terms of probability distributions, many ofthem being hypothetical counterfactual analyses from observed data (Balke and Pearl,1995). Notwithstanding, Pearl also understood causality in a modular way. He followedLaplace’s (1814) conception of nature’s laws being deterministic, and randomness arisingdue to our ignorance of the underlying causal system. Therefore, DAGs and the causalinfluences implied therein are interpreted as a set of deterministic, functional relationshipsperturbed by random disturbances, as first stated in Pearl and Verma (1991). Thesefunctions, called structural equations, represent each child-parent link in a DAG G asVi = f(PAi, Ui), where ui are iid stochastic error terms following a distribution P (U).Formally:

Definition 1.2.7. (Causal Model; Pearl, 2009b): A causal model is a triple M =⟨U,V,F ⟩ where

U is a set of background, exogenous, variables, determined by factors outside themodel,

V is a set V1, ..., Vn of endogenous variables, determined by variables inside themodel, i.e. U ∪ V ,

F is a set of functions f1, ..., fn such that each fi is a mapping from U ∪ (V /Vi)to Vi and such that F forms a mapping from U to V , i.e. if fi yields a value for Vigiven U ∪ V , then the entire system has a unique solution V (u) through F . The setof equations can be represented as

vi = fi(pai, ui), i = 1, ..., n

where pai is the realization of the unique minimal set of Markovian Parents (PAi)sufficient for representing fi. Likewise, Ui ⊆ U represents the minimal set of exoge-nous variables sufficient for representing fi.

The submodel created from the intervention do(X = x) in model M represents changesin the system after removing the functional relationship X = f(PAX , UX). The latter isdone by removing all incoming links to node X in DAG G (Markov equivalent to a causalmodel M). Formally:

Definition 1.2.8. (Submodel; Pearl, 2009b): Let M be a causal model, X a set ofvariables in V , and x a realization of X. A submodel Mx of M is the causal modelMx = ⟨U,V,Fx⟩ where Fx ∶= fi ∶ Vi ∉X ∪ X = x.

That is, Fx is formed by deleting from F all the functions fi corresponding to nodesin X, and replacing them by the constant function fX = x. Submodels are useful whenrepresenting interventions to the causal system of the form do(X = x). When solving forthe distribution of another node (set of nodes), Xj , P (xj ∣ do(X = x)) yields the causaleffect of X on Xj . The formal definition can be read as it follows:

Page 30: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 19

Definition 1.2.9. (Causal Effect; based on Pearl, 2009b): Given two disjoint sets ofvariables, X and Y , the causal effect of X on Y , denoted as P (y ∣ do(x)), Px(y), orP (y ∣ x); is a function g ∶ X → P, where P is the space of probability functions P definedover Y . g(x, y) gives the probability of observing Y = y induced by deleting all the equationscorresponding to variables in X from the set of functional causal relationships in M , andsubstituting then by X = do(x) in the remaining equations. That is, g(x, y) = P (y ∣ x).

Causal quantities Q(P ′), such as the ATE, can be expressed in terms of causaleffects. See that E(y ∣ do(x)) − E(y ∣ do(x′)), as in Rosenbaum and Rubin (1983),can be computed from the probability distributions P (y ∣ do(x)) and P (y ∣ do(x′)),both belonging to the interventional space of distributions P∗. We also recall that theexpression P (y ∣ do(x)) is equivalent to that presented in Rubin (1974) describing thepotential outcome notation, P (Yx = y).

Now that we defined the concept of causal effects, we present the logical framework thatallows for their actual identification and computation. The total effect of an interventiondo(X = x) on a set of outcome variables of interest can be computed due to the followingtheorem:

Theorem 1.2.2. (Adjustment for direct causes; Pearl, 2009b): Let Xi and PAidenote a random variable and its direct causes. Also, let Y be any set of variables disjointfrom Xi and PAi, that is Y ∩ Xi ∪ PAi = ∅. The effect of the intervention do(Xi = x′i)on Y is

P (y ∣ do(x′i)) = ∫pai

P (y ∣ x′i, pai)P (pai) (1.3)

where P (y ∣ x′i, pai) and P (pai) represent preintervention probabilities.

Equation (1.3) calls for conditioning P (y ∣ do(Xi = x′i)) on the parents of Xi,and then averaging the result weighting by the probability of P (PAi = pai). Thisoperation is known as ‘adjusting for PAi’ (Pearl, 2009b). Adjusting for the set PAieliminates spurious correlations (paths) between the cause do(Xi = xi) and the effect(Y = y). In other words, adjusting for the parents of the intervened variable Xi actsas if the researcher partitioned the sample into homogeneous groups with respect toPAi, assessed the effect of the intervention in each homogeneous group, and thenaveraged the results (equation 1.3). However, other sets of variables can also accomplishthe task of eliminating spurious causal paths between interventions and outcome variables.

One of the many advantages of the SCM is the usage of simple graphical criterion forthe identification of the right set Z of ‘adjusting covariates’, as opposed to some othercumbersome concepts in the related literature, such as the ‘ignorability” assumption inthe potential-outcome framework (Rosenbaum and Rubin, 1983). Assume that we areprovided with a causal DAG G, along with non-experimental data V . Suppose we want toestimate the effect of the intervention do(X = x) on Y , where both X and Y are subsets ofV . More formally, we want to estimate P (y ∣ do(X = x)) using the observed informationcoded in P (v), given the causal assumptions encoded in G. A simple graphical test,known as the “back-door criterion” (Pearl, 1993) is used to test whether a set of nodes Zis sufficient for identifying P (y ∣ x):Definition 1.2.10. (Back-Door; Pearl, 2009b): It is said that a set of variables Zsatisfies the back door criterion relative to other variables Xi and Xj in a DAG G if:

Page 31: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 20

1. No variable in Z is a descendant of Xi;

2. Z blocks (d-separates) every path between Xi and Xj that contains an arrow into Xi.

Similarly, if X and Y are two disjoint subsets of nodes in G, then Z is said to satisfythe back-door criterion relative to (X,Y ) if it satisfies the criterion relative to any pair(Xi,Xj) such that Xi ∈X and Xj ∈ Y .

The idea of “back-door” is essential in the measurement of causal quantities. To beclearer, assume we wish to compute a certain causal effect do(X = x), but instead, weare only provided with or able to observe the dependency P (y ∣ x) that results from thepaths in a DAG G. This dependency is contaminated by some spurious correlations (theback-door paths), while some are genuinely causal (the direct paths from X to Y ). Inorder to remove this bias, we need to modify the measured dependency looking forwardto make it equal to the desired quantity (Pearl, 2009b). To do so, we condition on a setof variables Z that satisfy the criterion previously defined. Once we ‘adjust’ for the setZ, we can further identify the causal quantity by the following theorem, for which a proofcan be found in Pearl (1993, 2009b):

Theorem 1.2.3. (Back-Door Adjustment; Pearl, 2009b): If a set of variables Z sat-isfies the back-door criterion relative to (X,Y ), then the causal effect of X on Y isidentifiable by the formula

P (y ∣ do(X = x)) = ∫z

P (y ∣ x, z)P (z)dz (1.4)

The “back-door” criterion does not adjust for the accompanying random variablesthat are actually affected by the intervention do(X = x). These variables can be also usedin the process of causal knowledge. Another criterion, introduced by Pearl (1995) underthe name of “front-door” criterion, will be important inasmuch it aids in the identificationof causal effects.

Second, assume a DAG with a causal path X → Z → Y as the one presented in Figure1.4, where X and Y are simultaneously influenced by a fourth latent variable U . Z doesnot satisfy any of the back-door conditions in definition 1.2.10, but nonetheless measuresof Z can help to consistently estimate P (y ∣ do(X = x)). As shown in Pearl (2009b), thiscan be achieved by by translating P (y ∣ do(X = x)) into formulas that are computablefrom P (x, y, z).

The joint distribution of the hypothetical example described above can be furtherdecomposed as P (x, y, z, u) = P (u)P (x ∣ u)P (z ∣ x)P (y ∣ z, u). From the interventiondo(X = x), we remove the factor P (x ∣ u) from this expression, and after summing over zand u (since we are interested in a causal quantity defined for y), we have

P (y ∣ do(x)) =∑z

P (z ∣ x)∑u

P (y ∣ z, u)P (u) (1.5)

The following equalities hold due to the conditional assumptions encoded in Figure1.4: P (u ∣ z, x) = P (u ∣ x) and P (y ∣ x, z, u) = P (y ∣ z, u). The latter implies the following

Page 32: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 21

ZX Y

U

Figure 1.4. A diagram representing the front door criterion.

equalities:

∑u

P (y ∣ z, u)P (u) =∑x∑u

P (y ∣ z, u)P (u ∣ x)P (x)

=∑x∑u

P (y ∣ x, z, u)P (u ∣ z, x)P (x)

=∑x

P (y ∣ x, z)P (x) (1.6)

and therefore, replacing these expressions in equation (1.5) yields

P (y ∣ do(x)) =∑z

P (z ∣ x)∑x′P (y ∣ x′, z)P (x′) (1.7)

The expression in equation (1.7) can be estimated from non-experimental data andallows for an unbiased estimate of the causal effect of X on Y , P (y ∣ do(X = x)),in definition 1.2.9. The derivations in (1.7) can be interpreted as a two-step appli-cation of the back-door criterion. As explained in Pearl (2009b), the first step isfinding the causal effect of X on Z, a concomitant variable that is directly affectedby the treatment but that also is not of our main interest. Since we do not have anyunblocked back-door path from X to Z in Figure 1.4 we have that P (z ∣ do(x)) = P (z ∣ x).

The second step can be understood as the computation of the causal effect of Z on Y .Since there is a back-door path Z ← X ← U → Y that renders Z and Y dependent, weadjust forX, a node that d-separates this path, and therefore, allowing for the computationof P (y ∣ do(z)) = ∑x′ P (y ∣ x′, z)P (x′). By combining the two steps, we have:

P (y ∣ do(x)) =∑z

P (y ∣ do(z))P (z ∣ do(x)) =∑z

P (z ∣ x)∑x′P (y ∣ x′, z)P (x′) (1.8)

This result allows for defining the front-door criterion and and its corresponding the-orem for causal effects identification:

Definition 1.2.11. (Front-Door; Pearl, 2009b): A set of variables Z is said to satisfythe front-door criterion relative to an ordered pair of variables (X,Y ) if:

1. Z intercepts all directed paths from X to Y ;

2. There is no unblocked back-door path from X to Z; and

Page 33: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 22

3. All back-door paths from Z to Y are blocked by X.

Theorem 1.2.4. (Front-Door Adjustment; Pearl, 2009b): If Z satisfies the front-door criterion relative to (X,Y ) and if P (x, z) > 0, then the causal effect of X on Y isidentifiable and is given by the formula

∑z

P (z ∣ x)∑x′P (y ∣ x′, z)P (x′) (1.9)

Despite of having derived some formulas by which a causal effect of an interventionof the form do(X = x) can be estimated, we still need some ‘inference rules’ that allowfor the translation from hypothetical treatments P (y ∣ x) into equivalent expressionsinvolving only but standard probabilities of observed quantities (realizations). This set ofinference rules is called the do-calculus, and it is readily presented.

First, we start by introducing new notation. Following Pearl (1995, 2009b); let X,Y , and Z be arbitrary disjoint sets of nodes in a causal DAG G. We denote by GX thesubgraph obtained by removing all the directed edges pointing to nodes X. Likewise, letGX as that subgraph resulting from deleting all the directed edges emerging from anynode in X. Also, P (y ∣ x, z) ≜ P (y, z ∣ x)/P (z ∣ x) stands for the probability of Y = yby holding (intervening) X at level x, and by chance observing Z = z. Now, three basicinference rules are presented. Proofs can be found in Pearl (1995).

Theorem 1.2.5. (Rules of do-calculus; Pearl, 2009b): Let G be the DAG associatedwith a causal model, and let P (⋅) be the probability distribution induced (implied) by thatmodel. For any disjoint set of variables X, Y , Z and W , we have the following rules:

Rule 1 (Insertion/deletion of observations):

P (y ∣ x, z,w) = P (y ∣ x,w) if (Y á Z) ∣ (X,W )GX

,

Rule 2 (Action/observation exchange):

P (y ∣ x, z, w) = P (y ∣ x, z,w) if (Y á Z) ∣ (X,W )GXZ

,

Rule 3 (Insertion/deletion of actions):

P (y ∣ x, z, w) = P (y ∣ x,w) if (Y á Z) ∣ (X,W )GX,Z(W ),

where Z(W ) is the set of (Z)-nodes that are not ancestors of any (W)-node in GX

A short explanation of each of these rules goes at if follows. Rule 1 stems from thefact that d-separation is a valid criterion for testing conditional independence between twosets of variables. The intervention do(X = x), or in other words, deleting the equations forX from the causal system (resulting in GX), does not introduce new dependencies amongthe rest of the random variables in the sample. Rule 2 provides a condition for externalinterventions do(Z = z) to have the same effect on Y as the passive observations Z = z.The condition is that (X,W ) are blocking the back-door paths from Z to Y . It is clearthat this holds since we are already fixing do(X = x) and also not letting Z affect any ofits descendants, by ‘manipulating” the graph (Spirtes et al., 2000) until we have the DAGGXZ . Rule 3 allows for introducing a new intervention do(Z = z) without changing theprobability of Y = y. This rule is valid once we manipulate the DAG G by eliminatingall the equations corresponding to Z, hence G

X,Z(W ). A detailed explanation of the rules

Page 34: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 23

of do-calculus can be found in Pearl (1995). The main result here is that, after simplemanipulations, causal effects can be computed from observed probabilities.

Corollary 1.2.6. The causal effect of a set of interventions do(X) on a set of outcomevariables of interest Y , Q = P (y1, ..., yk ∣ x1, ..., xm), is identifiable in a causal modelcharacterized by a DAG G if there exists a finite sequence of transformations, each oneconforming to one of the inference rules in Theorem 1.2.5, that reduces Q into a standardprobability expression involving observed quantities.

We have briefly presented some theoretical results on nonparametric identification ofcausal effects from an intervention or treatment X on an outcome random variable ofinterest Y . What we wanted to highlight was that this quantity, P (y ∣ do(x)), can berecovered from standard probability distributions from observed non-experimental data,in contrast to what we have presented in the Potential Outcome framework (Rubin, 1974;Rosenbaum and Rubin, 1983; Holland, 1986; Rubin, 2005, 2006) and the InstrumentalVariables approach (Angrist, 1990; Angrist et al., 1996), where causal claims are derivedfrom hypothetical probability distributions defined over non-observable (counterfactual)random variables. We will bring up this issue in the following subsection, where SCMand RCM are compared.

It is important to recall that throughout this subsection we did not make anyassumption about the nature of the set of variables X, Y , and Z involved in the processof causal inference. Although Pearl (2009b) developed his structural causal theory basedon observed random variables, this approach can be extended in the case in which causalrelationships exist between latent (or non-observable) variables. The main challenge whendealing with latent variables comes in the causal discovery step, that is, recovering thetrue underlying DAG (or a set of plausible DAGs) from the observed data. In responseto this issue, different algorithms were developed in order to recover causal structureswith latent variables from observed data (see, for example, Glymour and Spirtes, 1988;Scheines et al., 1991; Spirtes et al., 1995; Spirtes and Richardson, 1997; and Spirtes et al.,1997). These documents gave a causal interpretation to the structural part of a structuralequation model (SEM), as we shall see in the following sections.

Also, SCM and RCM differ with respect the main causal quantities of interest. WhileRCM is usually concerned with the computation of causal parameters like the ATE orthe ATT, in SCM the researcher estimates mostly total, direct, and indirect effects of agiven intervention T on a variable of interest Y , with both Y,T ∈ V . This is possible dueto the explicit, modular structure of M in the SCM (a feature that is not declared inthe RCM), in which direct (and indirect) causal paths are explicitly modeled. Therefore,depending on the graph structure, an intervention T has multiple ways in which affectsan outcome Y . Pearl (2005) presents the formal definitions of these causal quantities (seethe Appendix section). Linear SEMs are handy statistical tools that yield parametric,but consistent and unbiased estimates of coefficients from which total, direct, and indirecteffects of an intervention can be estimated. As presented in Chapter 2, the estimationof causal quantities within the SCM can be achieved through SEM. See Bollen (1987),Sobel (1987, 1990) for a causal interpretation of the estimated coefficients of a linearSEM and how are used to compute total, direct, and indirect (mediation), causal effects.Also, Muthen (2011) and Muthen and Asparouhov (2015) explain how these effects canbe computed in the context of nonlinear SEMs with latent variables.

Page 35: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 24

1.3. The relationship between SCM and RCM: Why theformer and not the latter?

There has been a continuous discussion about which approach is better to assess causalclaims from observed (non-experimental) data, yet no definite conclusion has been reached.In this subsection we present a brief introduction to the current debate involving both SCMand RCM scholars in the causal inference field. First, we present the most radical view onwhy scholars in the RCM literature believe that SCM cannot cope with causal inference.After, a more moderate approach that acknowledges some relevant features of the SCM,and finally, a third paragraph on how the SCM subsumes the RCM.

An agnostic point of view from the RCM towards SCM

Rubin’s (and colleagues) arguments against the SCM rely on their claims about the SCMapproach lacking of the capacity to simulating proper experimental conditions, fromwhich causal-wise conclusions can be obtained. This is in fact summarized in the wellknown motto “no causation without manipulation” in Holland (1986). For those versedin the RCM framework, it is the scientist that decides which intervention to apply and,subsequently, measure the causal effect of that intervention on some outcome variableof interest. In contrast, they argue, in the SCM there is no a clear distinction betweenwhich intervention variables are result of a decision made by the researcher, or whereas itis just a mere outcome variable.

We expand on the latter discussion. First, in the RCM the key assumption is thattreatment is assigned at random by some unknown mechanism (as in a proper experi-mental condition). The ignorability assumption then states that the potential outcomesare conditionally independent of the treatment given a set of observed (pre-treatment)variables. This is akin to the notion of randomization in design of experiment studies.Therefore, if this conditional independence holds, then the set of observed variables mustbe sufficient to eliminate confounding on the causal effect estimates coming from othersources different from the treatment and the outcome variable of interest. In favor ofPearl’s SCM, we argue that if the researcher chooses the right set of variables that allowfor blocking those spurious correlations (back-door criterion), then the causal effect ofa ‘treatment’ on an outcome variable of interest is identifiable. In other words, if thed-separation is fulfilled, then we meet the conditions that mimic proper experimentalconditions (i.e. the strong ignorability assumption in the RCM).

On the other hand, a very enlightening discussion has arose from a series of papersand blog entries by Gelman and Pearl himself. Gelman (2011) summarizes the flaws of theSCM in a couple of points: i) he criticizes the idea of an algorithm telling the researcherwhich is the ‘correct’ graph that resembles the DGP and that suits the observational data;and ii) he claims that there is not such thing as ‘true zeros’ in the social, behavioraland environmental sciences (when referring to conditional independence). For the sake ofan objective discussion, we give Gelman’s arguments credit. Indeed, as a starting pointit is clear to the reader that in this document we do not address the discovery step inthe causal inference process. Instead, we depart from an established, assumed DAG,previously defined by the researcher. Given the recent advances on artificial intelligence

Page 36: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 25

and machine learning, the discussions related to the developments of the discovery stephave gone far beyond the scope of this paper. With respect to his second argument, weargue that DAGs help to encode experts’ knowledge on how the world works, even ifit just a naıve representation of it. Nonetheless, Gelman has a point. DAGs representassumptions on conditional independence between random variables, something that isplausible in settings where there is a clear understanding of the mechanisms that operateunder such experiment. However, in the social and cognitive sciences, these mechanismsare more cumbersome.

A moderate, second opinion by Dawid

Another (more moderate) opinion is that of Dawid (2008, 2010) and colleagues, whoargue that, albeit very useful when it comes to carrying causal information, conditionalprobabilities implied in a certain causal DAG might also be replicated by other DAGsencoding different causal assumptions. This said, causality is a complex subject, and itcannot be left only to be expressed in terms conditional independence properties inferredfrom purely observational data. A formal causal language has to be defined: Dawid (2008)stresses the importance of distinguishing between probabilistic DAGs, causal DAGs, andthose expressing both conditional independence relationships and causal assumptions,the Pearlian DAGs (akin to those in Definition 1.2.6).

Also, Dawid is skeptical about how assumed interventions (like those in the SCM)propagate across passive, undisturbed systems (observational data) by just inferring a setof ‘structural equations’ from the observed data. This critic is somewhat similar to that ofGelman on causal discovery. However, Dawid argues that in order to validate these causaldiscovery results (which he believes are logically consistent and well obtained), strongcausal assumptions have to be made. Pearl himself acknowledges this point, and providesa strong theoretical framework to the the discovery step in his SCM.

On why SCM and not RCM

Now we list a series of advantages that several authors and ourselves believe the SCM hasover the RCM. First, DAGs are an easy, clear, intuitive and accessible way to commu-nicate causal relationships between random variables through conditional independenceproperties. In fact, DAGs are a graphical representation of how the researcher thinksthe world works. Pearlian DAGs provide causal information without using any ‘algebra’but resorting only to rigorous and intuitive graphical rules. In fact, as stated in Pearl(2009b, Section 3.6.2) DAGs are mathematical objects that allow for the incorporation ofsubstantive knowledge from the researchers, such as that of the form “Y does not directlyaffect X ”, which are rather causal judgements than constraints on probability distribu-tions with counterfactual variables (as in in RCM). This is why SCM is so popular inSociology, Computer Science, Epidemiology and other Biomedical Sciences. As a matterof fact, Pearl (2009b, Section 11.3.4, page 347) says (emphasis added):

The power of graphs lies in offering investigators a transparent language toreason about, to discuss the plausibility of (causal) assumptions and, whenconsensus is not reached, to isolate differences of opinion and identify whatadditional observations would be needed to resolve differences. This facility

Page 37: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 26

is lacking in the potential-outcome approach where, for most investigators,“strong ignorability” remains a mystical black box.

Nonetheless, RCM advocates say that DAGs are not clear in terms of expressingcounterfactuals, a key concept in Rubin’s approach. Counterfactuals are understood ashypothetical manipulations of a random variable, like those obtained under ideal exper-imental conditions. DAGs encode causal information about several counterfactuals (notonly one), while PO models are clear about which intervention is of interest. Single WorldIntervention Graphs (SWIGs), a novel approach by Richardson and Robins (2013a,b),aim to unify the causal DAGs and the PO models using node-splitting transformations.Their cutting edge work allows for understanding counterfactuals through graphical, yetpowerful, tools. Richardson and Robins (2013a,b) show that SWIGs satisfy d-separation(factorization, global Markov condition) and modularity properties, key concepts rootedin the SCM framework and necessary in RCM.

Second, Pearl formally demonstrated equivalence between the structural and poten-tial outcome frameworks. Recall the PO framework in Section 2.1. According to RCM’snotation, Yi(xi, t) can be read as “the value Y would attain for individual i with back-ground characteristics Xi = xi, had treatment been T = t”. Pearl (2009d) shows that thestructural interpretation of the latter is

Yi(xi, t) ≡ YMt(xi) (1.10)

where YMt(xi) is the unique solution of Y in the submodel Mt of M (as in definition1.2.8) under the realization X = xi. While the term unit in RCM refers to the individual,in SCM refers to the set of attributes that characterize that individual, represented as thevector xi in structural modeling. According to Pearl, equation (1.10) defines the formalconnexion between a counterfactual approach and the intervention do(T = t).

In fact, as Pearl (2009d) states, if T is treated as a random variable thenYi(xi, t) ≡ YMt(xi) is also a random variable. An important criticism to RCM isthat PO researcher will take the counterfactual as a primitive (axiom) and will think ofP (x1, ..., xn) as a marginal distribution of an extended probability function P ∗ definedover both set of observed variables and counterfactuals. Instead, SCM researchers willunderstand counterfactuals as the result of interventions to the causal system that changethe structural model (and the distribution) but keep all variables the same. In otherwords, RCM views Y under do(T = t), Yi(xi, t), to be a different variable than YMt(xi);Pearl having already shown to be equivalent. For a wider explanation of the subject referto Pearl (2009b, Chapters 3, 7 and Section 11.3.5).

Third, SCM (through DAGs) give better orientations on which variables are rele-vant (necessary) for the estimation of causal effects than RCM does. Pearl (2009b,c)criticizes Rubin’s (and colleagues) advice on conditioning on all pretreatment vari-ables, while Rubin (2007, 2008) views SCM as “non-scientific ad hockery”, because of itsapparent difficulty on coping with untested assumptions related to the experts’ knowledge.

Page 38: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 1. CAUSALITY: AN INTRODUCTION 27

The discussion started with Shrier’s comment (2008) on Rubin (2007). Latter, withRubin (2008) and other several replies from both RCM and SCM scholars (see Pearl,2009c; Sjolander, 2009; Rubin, 2009 for further details on this topic). The main idea isthat, depending on the underlying structure of the DAG, spurious correlations (might)arise due to unnecessary variables in the PO exercise yielding biased results about thecausal effect of the selected intervention on outcomes of interest. This issue is common inthe presence of confounding/mediating variables or M-structures in a causal DAG.

The simplest M-structure, as shown in Figure 1.5 (see Greenland et al., 1999; Hernanet al., 2004, and Pearl, 2009d), contains the path X ← U1 → Z ← U2 → Y , where X, Yand Z are measured, while U1 and U2 are latent variables. Conditioning on Z rendersa spurious dependence between X and Y (since Z is in an inverted fork path), thus astandard PO analysis would yield biased estimates of the causal effect do(X = x) on Y .

Z

X Y

U1 U2

Figure 1.5. A causal DAG demonstrating M-Bias.

This spurious dependence is a result of Berkson’s paradox, which states that two inde-pendent causes of a common effect (say U1 and U2) become dependent once we conditionon the observed effect (Z). RCM scholars do not acknowledge the logical consequences ofnot avoiding conditioning on variables that describe subjects even before the treatment.Therefore, in most cases, PS results yield biased results. Some authors in the RCM lit-erature (e.g. Greenland, 2003; Liu et al., 2012 in epidemiological studies, and Brookhartet al., 2006; Setoguchi et al., 2008; Clarke et al., 2011 in RCM) have recently adressed thisissue. Their findings are similar to those originally claimed by Pearl and colleagues.

Page 39: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2

Causal Inference Through Parametric, Linear

Models

In Chapter 1 we presented the theoretical foundations of causal inference in the SCM.However, we did not explain how causal quantities are estimated through models. Sta-tistical models play a key role in causal inference since allow for setting a bridge be-tween observed, non-experimental data, (D,P ), and causal assumptions, M , to an-swer causal questions, derived from Q(P ′) and (D′, P ′). Hernan and Robins (2016)dedicated a whole volume of their book to explain causal inference through mod-els. Even Andrew Gelman, a prominent statistician, says in one of his blog’s en-tries (http://andrewgelman.com/2009/07/07/more on pearls/, emphases and parenthesesadded):

Statistical modeling can contribute to causal inference. In an observationalstudy with lots of background variables to control for, there is a lot of freedomin putting together a statistical model different possible interactions, link func-tions, and all the rest [...]. Better modeling, and model checking, can lead tobetter causal inference. This is true in Rubin’s framework as well as in Pearl’s:the structure may be there (referring to M), but the model still needs to bebuilt and tested.

In summary, computing causal effects through statistical models, whether in aparametric or nonparametric way, allows for estimating causal quantities from non-experimental data. In the parametric case, models impose prior restrictions on thedistribution of the data and inferences are valid as long as the model is correctly specified.In the nonparametric one, no restrictions are imposed, at the expense of interpretationand parsimony. Either way, statistical models give causal structure and interpreta-tion to noisy data. It is through models that researchers answer causal questions.Within the context of causal inference, statistical models are not just carriers of prob-abilistic information, but have causal content and thus shall be interpreted in a causal way.

Pearl (2000, 2009b) made clear how causal knowledge can be understood in a modularfashion and how to give a causal interpretation to a system of equations. The quasi-deterministic approach (see definitions 1.2.7 and 1.2.8) that has been accepted in the SCM

28

Page 40: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS 29

literature allows for expressing causal ties between random variables and their Markovianparents as vi = fi(pai, ui), i = 1, ..., n, which is a nonparametric generalization of the(linear) Structural Equation Model (SEM):

vi =∑k≠iβi,kvk + ui, i = 1, ..., n. (2.1)

Both strong and weak causal assumptions are first implied by the causal DAG andthen translated into the SEMs’ specification (Bollen and Pearl, 2013). On one hand,strong causal assumptions are represented by the absence of directed links or bi-directedarcs between nodes or error terms in the DAG. When translated to SEMs’ language,these assumptions are expressed as restrictions over the parameters in form of zero partialcorrelations and/or conditional independence (absence of causal relationships) betweenrandom variables. On the other hand, weak causal assumptions exclude some values for aparameter but allow another range of values. Directed links and bi-directed arcs in DAGsexpress the weak causal assumption of nonzero effect.

A causal model M is called Markovian (recursive in SEM literature) if its graph Gcontains no directed cycles and if error terms (Ui, one for each endogenous variable) aremutually independent (i.e. no bi-directed arcs). A model is semiMarkovian if its graph isacyclic but some errors are dependent. It is common to assume that Ui are multivariatenormal. If so, Vi in equation (2.1) will be also multivariate normal and (if centered), itsdistribution is completely characterized by the correlation coefficients ρik, i ≠ k. In fact,the partial structural (regression) coefficients in equation (2.1) can be expressed in termsof both partial correlations and standard deviations, as βY,X ≡ βY X ⋅Z = ρY X ⋅Z (σY ⋅Z/σX ⋅Z).

Now, consider a directed path X → Y in a graph G. In order to assess the causal effectof an intervention of the form do(X = x), one needs to estimate the structural parametersin a given system of causes and effects. Within the context of path analysis, the structural(regression) coefficient can be decomposed as the sum of paths coefficients (Wright, 1921,1934), as βXY = α + IXY , where α is the direct (causal) effect of X on Y , and IXY is thesum of other effects through paths connecting X and Y , excluding the direct link X → Y .Now, assume we remove the edge X → Y in G, resulting with a subgraph with no directeffects of X on Y , denoted as Gα. We now introduce a theorem that allows for causaleffects identification using the d-separation criterion presented in Chapter 1:

Theorem 2.0.1. (d-Separation in General Linear Models; based on Pearl, 2009b):For any linear model structured according to a graph G, which may include cycles andbi-directed arcs, the partial correlation ρXY ⋅Z vanishes if the nodes in set Z d-separatenode X from Y in graph G. In this context, each bi-directed arc Xi ←→ Xj is interpretedas a latent common parent Xi ← L→Xj.

Now, assume a set of variables Z in Gα. If, and only if, nodes in Z d-separate X andY (following theorem 2.0.1), we have that

βXY ⋅Z = α + IXY ⋅Z = α, (2.2)

that is IXY ⋅Z = 0. The latter is true because the sum of other path coefficients indirectlyconnecting X and Y , IXY ⋅Z , vanishes once we ‘control for’ Z. This yields βXY ⋅Z = α as the

Page 41: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS 30

estimate of the direct causal effect of X on Y . This result is summarized in the followingtheorem:

Theorem 2.0.2. (Single-Door Criterion for Direct Effects; Pearl, 2009b): Let G beany path diagram in which α is the path coefficient associated with the link X → Y , andlet Gα denote the diagram that results when X → Y is deleted from G. The coefficient αis identifiable if there exists a set of nodes Z such that i) Z contains no descendant of Y ,and ii) Z d-separates X from Y in Gα. If Z satisfies these two conditions, then α is equalto the regression coefficient βY X ⋅Z . Conversely, if Z does not satisfy these conditions, thenβY X ⋅Z is not a consistent estimand of α (except in rare cases of measure zero).

Theorem 2.0.2 is useful when the structure of M is relatively simple. However, itcan be extended to the case in which more complex paths are considered and where theidentification of the structural (path) parameters is achieved through the identification oftotal rather than direct effects:

Theorem 2.0.3. (Back-Door Criterion in SEM; Pearl, 2009b): For any two variablesX and Y in a causal diagram G, the total effect of X on Y is identifiable if there exists aset of variables Z such that

1. no member of Z is a descendant of X; and

2. Z d-separates X from Y in the subgraph GX obtained after deleting from G all arrowsemanating from X.

Moreover, if the two conditions are satisfied, then the total effect of X on Y is given byβXY ⋅Z .

Theorem 2.0.3 states that after controlling for Z, X and Y are not (causally) connectedthrough spurious, indirect (back-door) paths. These results also hold for nonlinear SEMs(Pearl, 2009b). It is clear, now, how the (structural) parameters’ estimates of a SEMare useful to estimate causal quantities in the SCM, such as direct and total effects ofan intervention of the form do(X = x). Theorems 2.0.1 to 2.0.3 allow for interpretingstructural parameters in a causal way.

Path coefficients, Direct and Indirect Effects

The structural parameters will be estimated from non-experimental data through SEM,as we show in the following subsections. For now, assume the researcher knows the valuesof the coefficients in the linear, structural equations. With the (estimated) structuralcoefficients at hand, the computation of direct and indirect effects is straightforward ina setting of linear, simultaneous, structural equations. Following Wright (1920, 1921,1934), the total causal effect (correlation) of an intervention, do(X = x), on an outcomevariable, Y , is equal to the sum of the product of the path coefficients along all directand indirect paths connecting X and Y . By theorems 2.0.2 and 2.0.3, these structural(path) coefficients can be approximated by estimated parameters.

For example, consider Figure 2.1 in Pearl (2009b, p. 151). This DAG representsa system of linear structural equations with observed variables V = X,Y,Z1, Z2, and

Page 42: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS 31

latent variables (dashed arcs). In this case, we are interested on the effect of an interventiondo(X = x) on Y . Assume ∆X = 1 with respect to the baseline scenario. Wright’s pathcoefficient method states the effect of the latter intervention can be computed as the sumof the product of the structural parameters for each path connecting X and Y , i.e. α+βγ.From theorems 2.0.2 and 2.0.3, it is clear that we can compute βY X ⋅Z2 = α + βγ (as Z2

d-separates every back-door (spurious) causal path between X and Y ).

Z1

Z2X

β

α

Figure 2.1. Graphical Identification of Total Effects, in Pearl (2009b).

The computation of direct and indirect (mediation) effects has been recently studiedwithin the context of nonlinear SEMs. Muthen (2011) and Muthen and Asparouhov(2015) present how (natural) direct and indirect effects can be estimated with binary,discrete, and continuous outcomes or mediators (e.g. Z1, in the latter example), even inthe presence of latent variables.

But, how are regressions’ and SEMs’ estimates different? This question has long beenaddressed, and it is important to clarify the differences. First, Pearl (2009b, p.159) gives asuperb explanation on what are structural coefficients and how should they be interpreted.SEMs must be interpreted as carriers of causal information and not as a byproduct ofstatistical analysis of non-experimental data. As stated by Pearl, an equation y = βx + εis called structural if interpreted as if in an ideal experiment, once performed the actiondo(X = x) and once set Z = z on any other set of variables Z not containing X or Y , thevalue of Y would be βx+ ε, where ε is not a function (does not causally depend on) of thevalues x and y. Also, if we were to perform the action do(Y = y) and set Z = z, then wecannot say the value of X would be (y − ε)/β. This is because the relationship betweenX and Y is not bidirectional (otherwise stated), and the flow of causal information goesas X → Y , but Y /→ X. Second, and similar to the first consideration, Bollen and Pearl(2013) argue that the equality sign should be removed and replaced with an assignmentsymbol (∶=) instead, which represents an asymmetrical transfer of causal information. Itis precisely the claim that regressions and SEMs are equivalent that obscured the causalmeaning of the latter in the causal inference literature. We now present some insights onStructural Equation Models as a statistical tool for estimating causal quantities, withoutforgetting its key role in causal inference.

Page 43: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS 32

2.1. Structural Equation Models (SEMs)

We have already discussed how causal quantities, Q(P ′), derived from an intervention(treatment) do(X = x), on an outcome variable of interest Y , can be computed throughthe estimated values of the structural parameters (path coefficients) in linear SEMs. Inthis subsection we present an introduction to the statistical properties and estimationprocedures of Structural Equation Models.

SEMs are data-driven models with causal implications that describe the (causal)relationships between random variables. SEMs are a system of simultaneous regressionequations, with less restrictive assumptions, that allow for measurement errors in bothendogenous and exogenous variables, and that can be viewed as a generalization of muchsimpler statistical methods (linear models, ANOVA, factor analyses, etc). One importantfeature of SEMs is their ability to adapt researchers’ expertise in the model specification,estimation and testing steps. The empirical results obtained from a SEM analysis can begiven causal meaning only within the context of a substantive informed setup (Bollen,1989).

SEMs consist of three main parts: i) a system of structural equations, ii) the pathdiagram, and iii) the covariance structure (matrix). The first part, the system of struc-tural equations is a set of equations that represent the view of how the researcher believesrandom variables are causally related. These equations contain random variables andstructural parameters. Random variables can be either latent (concepts, unmeasured, fac-tors), observed (measures, indicators, proxies), or disturbance terms; while the structuralparameters are invariant constants that “quantify” the causal links between variables andallow to compute causal quantities of interest (Bollen, 1989).

The Linear System of Structural Equations

Recall a set of random variables, both manifest and latent, V , causally interrelatedfollowing the structure and assumptions implied by a causal model M . Variables in Vare assumed to be linearly related to its direct causes (parents), as in equation (2.1).We introduce new notation that should not interfere with the SCM causal frameworkpresented in Chapter 1. Let η and ξ, with (η,ξ) ∈ V , be sets of endogenous (determinedwithin M) and exogenous (determined outside M) latent variables, respectively. Also,let Y and X, with (Y,X) ∈ V , be sets of endogenous and exogenous manifest variables.Also, assume a sample of N individuals. For each individual i = 1, ...,N , we measureYi = yi and Xi = xi, and assume the existence of the latent constructs ηi and ξi. Forease on notation, we leave out the subscript i, and assume that the causal assumptionsimplied by M (and summarized by the linear equations of the form of 2.1) hold for allindividuals in the sample.

Following the LISREL model (Joreskog and Sorbom, 1996), two major subsystems arepart of the system of structural equations: a linear latent variable (structural) model anda linear measurement model. On one hand, the latent variable model summarizes the

Page 44: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS 33

causal relationships between factors:

η = +Bη +Γξ + ζ, (2.3)

where η is a q1 × 1 vector of endogenous latent variables, B is a q1 × q1 matrix ofstructural coefficients for the latent endogenous variables, ξ is a q2 ×1 vector of exogenouslatent variables with covariance matrix E(ξξ′) = Φ, Γ is a q1 × q2 matrix of structuralcoefficients for the latent exogenous variables, and ζ is a q1 × 1 vector of disturbanceterms that is assumed to have E(ζ) = 0 and covariance matrix E(ζζ′) = Ψ. Moreover, theerror term satisfies E(ζξ′) = 0. An important assumption with respect to the structuralparameters in the latent variable model is that diag(B) = 0, i.e. a variable is not animmediate and instantaneous cause of itself; and that (I − B) is non-singular so that(I −B)−1 exists. For statistical identification purposes, restrictions are imposed over theparameters. These restrictions also represent claims about (the absence of) causal effectsbetween random variables.

On the other hand, measurement equations represent the (linear) causal links betweenlatent and manifest variables. It is assumed that the observed variables are highly corre-lated with the latent constructs they measure. That is, they provide quantitative evidenceof their latent counterparts:

Y = ΛY η + ε (2.4)

X = ΛXξ + δ (2.5)

In equations (2.4) and (2.5), Y is a p1 × 1 vector of endogenous observed variables,ΛY is a p1×q1 matrix of structural coefficients (factor loadings) linking endogenous latent(causes) and observed variables (consequences), ε a p1 × 1 vector of disturbances, X is ap2 × 1 vector of exogenous manifest variables, ΛX is a p2 × q2 matrix of coefficients linkingexogenous latent variables and exogenous observed variables, and δ a p2×1 vector of errorterms. It is important to recall that the coefficients in ΛY and ΛX represent the expectedchange in the observed variable resulting from a one unit increase in its correspondinglatent variable. To do so, one must assign a scale to the latent variable in order to fullyinterpret the coefficient, and therefore, the analyst typically fixes either the latent variablescale to one (e.g. λij = 1 for some Xi and ξj) or standardizes its variance to one (e.g.var(ξi) = 1). Also, error terms satisfy E(ε) = E(δ) = E(εδ′) = E(εη′) = E(εξ′) = E(δη′) =E(δξ′) = 0. Moreover, the covariance matrices for the disturbance terms are assumed tobe E(εε′) = Θε and E(δδ′) = Θδ, respectively.

The Path Diagram

Path diagrams are the graphical representation of the causal assumptions coded in thesystem of structural equations. These are equivalent to DAGs in the SCM frameworkand, accordingly, have the same properties. As mentioned before, path analysis began withWright (1920, 1921, 1934) as a way to compute causal effects of hypothetical interventionsin (non-experimental) observed data. Within SEMs’ context, latent variables are presentedby circle nodes, manifest variables by squared nodes, direct causal (linear) relationshipsby single headed arrows, and when unmodeled causal association between two randomvariables (usually error disturbances) is assumed, then a bi-directed arrow links those two

Page 45: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS 34

random variables. The absence of a bi-directed arc between two error terms εi and εjreflects the causal assumption (and the correspondent parameter restriction) cov(εi, εj) =0. An example of a path diagram (in Bollen, 1989) is Figure 2.2.

ξ1

ξ2

η1

η2

X1 X2

X3 X4 Y3Y4

Y1 Y2

ζ1

ζ2

δ1 δ2

δ3 δ4 ε3 ε4

ε1 ε2

β21 β12

γ22

γ11

λ3 λ4

λ1 λ2 λ5 λ6

λ7 λ8

Figure 2.2. SEM example in Bollen (1989).

The latter is a graphical representation of the structural and measurement equationsin (2.3), (2.4) and (2.5), with η = (η1, η2)′, ξ = (ξ1, ξ2)′, Y = (Y1, ..., Y4)′, X = (X1, ...,X4)′,δ = (δ1, ..., δ4)′, ε = (ε1, ..., ε4)′, ζ = (ζ1, ζ2)′ and the following causal assumptions encodedin the matrices:

B = [ 0 β12

β21 0] , Γ = [γ11 0

0 γ22] , Λ′

X = [ 0 0 λ3 λ4

λ1 λ2 0 0] , Λ′

Y = [ 0 0 λ7 λ8

λ5 λ6 0 0] ,

with cov(ζ1, ζ2) ≠ 0 in Ψ, and cov(ξ1, ξ2) ≠ 0 in Φ.

The Covariance Matrix

Joreskog (1967, 1973, 1978) and Bollen (1989), among others, show how the theoreticalcovariance matrix of manifest variables Σ = E(ZZ′), with Z = (Y,X)′, can be written interms of a set of parameters θ in equations (2.3) to (2.5). Therefore, the covariance matrixΣ(θ) can be expressed as

Σ(θ) = [ΣY Y (θ) ΣY X(θ)ΣXY (θ) ΣX(θ) ] (2.6)

After replacing equations (2.3) to (2.5) in (2.6), we have that

ΣY Y (θ) = ΛY (I −B)−1 (ΓΦΓ′ +Ψ) [(I −B)−1]′Λ′Y +Θε

ΣXY (θ) = ΛXΦΓ′ [(I −B)−1]′Λ′Y y

Page 46: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS 35

ΣY X(θ) = Σ′XY

ΣXX(θ) = ΛXΦΛ′X +Θδ

It is clear from the former equations that in basic SEM, the covariance matrix ofmanifest variables can be expressed as a linear combination of parameters characterizingthe system of equations. The aim is to estimate those structural parameters θ fromobserved, non-experimental data. More specifically, θ is estimated from the empirical

covariance matrix S = 1N

N

∑i=1

ziz′i, where i = 1, ...,N denotes an individual i ∈ N , and N the

total number of observations in the sample.

In essence, we aim to estimate θ such that the discrepancies between S and Σ(θ)are minimized. Formally, θ is the vector of parameters that satisfies θ ∈ arg min F (θ) ∶=∣∣S − Σ(θ)∣∣. The method of Maximum Likelihood (ML) is the default in most SEManalysis and software packages. It can be demonstrated (Joreskog, 1973; Mulaik, 2009;Kline, 2011) that maximizing an assumed likelihood function is equivalent to minimizingthe loss function

FML(θ) = log ∣Σ(θ)∣ + trSΣ−1(θ) − log ∣S∣ − (p1 + p2).

As a full-information method, ML estimation aims to find θ such that the differencesbetween log ∣S∣ and log ∣Σ(θ)∣ and between trSΣ−1(θ) and tr(I) = p1 + p2 are simulta-neously minimized. Although robust, ML estimation requires making strong assumptionsabout multivariate normality on latent variables and disturbance terms. Also, a largesample size is needed in order to obtain unbiased, consistent and asymptotically efficientestimations. Another important features of ML estimators are scale invariance and scalefreedom, which means that ML estimators are no bound to the measurement scale ofthe manifest variables or their correspondent covariance matrix. See Joreskog (1973);Chou and Bentler (1995) and Bollen (1989, Chapters 4, 8, 9) for further details on MLestimation in SEM.

Other less restrictive but equally robust covariance-based (CB) estimation methodshave been proposed in the SEM literature: Ordinary Least Squares (OLS) and Two-StageLeast Squares (2SLS), Generalized Least Squares (GLS), as in Fox (1984) and Johnstonand DiNardo (1997); Unweighted/Weighted Least Squares (ULS/WLS), as in Browne(1982, 1984); among others, with their corresponding penalty functions:

FOLS(θ) =1

2tr[S −Σ(θ)]′ [S −Σ(θ)]

FGLS(θ) =1

2tr([S −Σ(θ)]W−1)2

FULS(θ) =1

2tr[S −Σ(θ)]2

FWLS(θ) = [s −σ(θ)]′W−1 [s −σ(θ)]

where W is a positive-definite weight matrix (usually W = S in GLS), and s and σ(θ)are a vector of sample variances/covariances and the model implied vector of populationelements in Σ(θ), respectively. We do not further present detailed explanations on

Page 47: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS 36

estimation procedures and fit indexes analysis for linear SEM. However, we refer thereader to comprehensive treatises and handbooks on classic (causal) SEM, such asGoldberger and Duncan (1973); Duncan (1975); Bollen (1989); Hoyle (1995); Spirteset al. (2000); Hancock and Mueller (2006); Morgan and Winship (2007); Marcoulides andSchumacker (2009); Mulaik (2009); Kline (2011), among others.

Irrespective of which estimation method is used, these are still based on the empiricaland theoretical covariance matrices. Some CB methods depend on the multivariatenormality assumption of observed and latent variables, while others depend on very largesample sizes. Moreover, as pointed out in Arminger and Muthen (1998) and Lee (2007),basic CB-SEM methods cannot easily cope with nonlinear terms or interactions betweenvariables, and rely on strong assumptions when dealing with dichotomous and orderedcategorical data, missing data or small sample size.

SEM methods based on raw individuals or random observations, rather than the em-pirical covariance matrix (see, e.g. Lee and Zhu, 2000, 2002; Skrondal and Rabe-Hesketh,2004; Mehta and Neale, 2005), have overcome the latter shortcomings and have some ad-vantages: i) estimation methods based on first moments avoid loss functions based onsecond moments, which are more difficult to handle, ii) direct or indirect estimation oflatent variables is more straightforward, and iii) the estimating equations have a directinterpretation, as they are similar to standard regression analyses in most cases.

2.2. Bayesian Estimation of Structural Equation Models

Among the individual based (IB) SEM methods, the Bayesian approach has additionaladvantages in terms of flexibility in both estimation and dealing with required assump-tions. For example, in Social Sciences it is very uncommon to satisfy multivariatenormality or to have a large sample size, necessary to derive the asymptotic propertiesof ML estimators. There is evidence of lack of robustness of ML estimates with smallsample sizes (see, for example Chou et al., 1991; Hoogland and Boomsma, 1998). Incontrast, Bayesian analyses of SEM, which depend less on asymptotic theory, producereliable results even with small sample sizes (Lee and Song, 2004; Lee, 2007).

Also, Bayesian SEM allows for incorporating expert’s knowledge with respect to thestructural parameters and latent variables through restrictions on the prior distributionfunctions. With the computational advances in recent decades, Bayesian analysis in SEM(and its extensions) has received greater attention from the academic community.

Let M be a causal SEM including a path diagram and a set of structural equationswith parameters θ. Let Y = Y1, ...,YN be the observed data of N individuals in thesample. In the Bayesian approach θ is considered as a random variable with (conditional)distribution function p(θ ∣M) ≅ p(θ). Also let p(Y,θ ∣M) be the joint probability densityfunction of Y and θ under M . Based on the well known identity (and taking logs), wehave

log p(θ ∣ Y,M)∝ log p(Y ∣ θ,M) + log p(θ ∣ M). (2.7)

Page 48: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 2. CAUSAL INFERENCE THROUGH PARAMETRIC, LINEAR MODELS 37

In equation (2.7), p(θ ∣ Y,M) is known as the posterior distribution, p(Y ∣ θ,M)is the log-likelihood function, and p(θ ∣ M) the prior distribution (defined by theresearcher). Note that these distributions are model-contingent, that is, are determinedby the structure of the causal model M . Also note that the likelihood function dependson the sample size, while the posterior does not. As N increases, the former dominatesthe latter and the posterior estimates are equivalent to the ML approach.

Therefore Bayesian and ML are asymptotically equivalent. Within this framework,assume a set of equations as in (2.3), (2.4), and (2.5). Again, let Y = Y1, ...,YN andΩ = ω1, ...,ωN, with ω′i = (η′i,ξ′i)′, for i = 1, ...,N , the set of manifest and latent variablesin proposed model. Let θ be the set of unknown parameters in B, Γ, Ψ, ΛY , ΛX , Θε andΘδ. In the posterior analysis, Y is augmented with the matrix of latent variables Ω andthen a sample from the posterior distribution p(Ω,θ ∣ Y) is drawn. A sufficiently largenumber of observation are generated from this distribution using the Gibbs Sampler, asit follows. At the (j + 1)th iteration with current values Ω(j) and θ(j)

(i) Generate Ω(j+1) from p(Ω ∣ θ(j),Y) (2.8)

(ii) Generate θ(j+1) from p(θ ∣ Ω(j+1),Y) (2.9)

Let θY be the parameters in the measurement equations (2.4), and (2.5), and θωthose in the structural equation (2.3). It is assumed that the prior distributions of θYand θω are independent, i.e, p(θ) = p(θY ) p(θω). Moreover, p(Y ∣ Ω,θ) = p(Y ∣ Ω,θy)and p(Ω ∣ θ) = p(Ω ∣ θω). From equation (2.8) it is clear that latent variables are directlygenerated from the posterior distribution. For further details on Bayesian estimation inSEM refer to Lee (2007) and Song and Lee (2012a,b). We will make use of the notationand procedures therein explained in further chapters.

Note, however, that in conventional Bayesian SEM analysis, Y1, ...,YN are assumedto be independent. This assumption is equivalent to a sample of independent individuals,i.e. no causal interference between observational units. However, this assumption doesnot hold in most Social Science studies.

Page 49: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3

Causally Connected Units

As explained in previous chapters, most of the causal inference literature is based onthe assumption of independent units that are not causally connected. This assumption,known as the Stable Unit Treatment Value Assumption (SUTVA), asserts that there is nocausal interference among individuals. Using the potential outcome notation introducedin Chapter 1, SUTVA means that treatment supplied to unit i′, Ti′ = ti′ , and/or outcomevariable for unit i′, Yi′(Xi′ , ti′), have no effect on unit’s i outcome, Yi(Xi, ti), with i ≠ i′and both (i, i′) ∈ N . In other words, Yi(Xi, Ti) á Ti′ and Yi(Xi, Ti) á Yi′(Xi′ , Ti′).Common causal target parameters of interest (Q(P ′)) are evaluated using individual levelinformation, and therefore, their estimation and inference are based on the asymptoticproperties of classical statistical estimators (van der Laan, 2014). However, these causalestimators fail to provide causally interpretable estimates of structural parameters whenthis assumption is violated.

On one hand, SUTVA violation has been extensively explored in the RCM frame-work, challenging existing causal theory and statistical methodologies. The main issueof SUTVA violation is the fact that potential outcomes in the RCM are based oncounterfactuals, i.e. unobserved random variables that enter in the estimation of causalquantities. More formally, assume a sample of i = 1, ...,N units. Also, assume a binarytreatment variable Ti ∈ 0,1 and a continuous outcome variable of interest Yi measuredfor each unit i ∈ N . Let an individual i receive treatment Ti = 1. As explained in previouschapters, the potential outcome of unit i, i.e. Ti = 0 and therefore Yi(Xi, Ti = 0), is anunobserved variable. The estimation of causal quantities require a set of observed dataand counterfactuals for every individual i in the sample. Assuming SUTVA, and if T is abinary random variable, then each individual i has one, and only one, potential outcome(keeping everything else constant). However, when SUTVA is violated, a larger set ofcounterfactuals for each individual i ∈ N emerge.

The larger set of potential outcomes in the context of SUTVA violation rendersthe causal quantity of interest Q(P ′) unidentifiable. That is, the potential outcome ofunit i with treatment Ti = 1, Yi(Xi, Ti = 0,⋯), depends not only on Ti = 0 but also ontreatments assigned/received to units i′ ≠ i, ∀ i′ ∈ N/i, ti′ , ..., ti∗ . For ease on notation,let the subscript −i = N/i be used in a vector of either a treatment or outcome for

38

Page 50: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3. CAUSALLY CONNECTED UNITS 39

all other units in N but i, e.g. T−i = t1, ..., ti−1, ti+1, ..., tN. The latter means thatthe potential outcome of unit i is Yi(Xi, Ti = 0,T−i = t−i). In this particular example(assuming binary treatments), SUTVA violation in a sample of N individuals leads to anumber of 2N potential outcomes. Besides, Sobel (2006) proved that in the presence ofSUTVA violation, estimates of causal parameters (such as the ATE) are not informativeabout the sole effect of an intervention over the outcome variable, but instead are thesum of two distinct effects, one for the intended intervention and one for the spillovereffects. When not analyzed independently, results can yield misleading conclusions aboutthe effectiveness of a treatment. It is clear how SUTVA violation calls for the definitionof new statistical methodologies and causal parameters of scientific interest.

The violation of SUTVA assumption has been extensively studied within the RCMframework. A series of papers by Halloran and Struchiner (1995); Sobel (2006);Rosenbaum (2007); Hudgens and Halloran (2008); VanderWeele and Tchetgen Tchetgen(2011); Tchetgen Tchetgen and VanderWeele (2012); VanderWeele et al. (2012); Liuand Hudgens (2014); Lundin and Karlsson (2014), among others, approached this issuethrough extending Rubin’s potential outcomes notation and by defining new causal targetparameters.

The problem is presented as it follows: Assume a sample of N individuals dividedin G groups (clusters), each one with Ng units, g = 1, ...,G (i.e. N = ∑Gg=1Ng). LetTig denote the treatment of individual i ∈ g. Let Tg = (T1,g, ..., TNg ,g) be the observedtreatment program assigned to cluster g. Same notation applies for the outcome ofinterest, Yig and Yg. Also, let T (Ng) be the set of all possible allocations of length Ng.That is, if treatment is a binary random variable, Tg would be one of the 2Ng possibleorderings of the allocation program in T (Ng). For each cluster g, we assume a set ofcounterfactuals Yg(⋅) = Yg(tg) ∶ tg ∈ T (Ng). Therefore, the observed set of outcomesYg is, in fact, a function of the actual treatment program, i.e. Yg = Yg(Tg). Notethat Yg = Yig(Tg) ∶ i ∈ g, which means that the individual potential outcome for aunit i might depend not only on its treatment, but on the entire allocation program, i.e.Yig(Tg), as opposed to the paradigm in classical RCM, where Yig(Tig).

In order to compute the missing counterfactuals and estimate the new target causalparameters of interest, SUTVA violation requires additional assumptions. First, for sim-plicity authors assume perfect compliance. Second, the authors resort to the assumptionof partial interference (Sobel, 2006) to assure that the outcome variable of unit i, Yi(⋅),depends only on the treatments assigned to other units belonging to the same clusterunit i belongs, i.e. i′ ∈ Ng, with i′ ≠ i and i, i′ ∈ Ng. Third, a two-stage randomizationprogram/treatment assignment is assumed. That is, clusters are randomly assigned to anallocation scheme and then, within the cluster, units are randomly assigned to a specifictreatment value. We do not go further on SUTVA violation in the RCM framework,but we do encourage the reader to check the aforementioned references for the formaldefinitions of the parametric causal parameters and the estimation methods thereinpresented. In summary, the study of SUTVA violation within the RCM called for newdefinitions and estimators for causal parameters.

Page 51: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3. CAUSALLY CONNECTED UNITS 40

Akin to the latter approach, Hong and Raudenbush (2003, 2005, 2006, 2013);Verbitsky-Savitz and Raudenbush (2012) presented an extension of the propensity scorematching (PSM) techniques (Rosenbaum and Rubin, 1983) to the case in which data ishierarchically organized and SUTVA assumption is violated. Built on the assumption ofstrong ignorability (exogeneity between treatment assignment mechanism and potentialoutcomes, given observed covariates and outcome), these resort to the use of lineal modelsas means to estimate causal quantities. The idea is that in multilevel settings, units arecausally connected not only due to the clustering effect, but also because of the possibleinteractions between units within each group.

We use the same notation presented above. Again, let Yig, Tig,Yg,Tg,T−i,g be thesame random variables (vectors) as described earlier. The multilevel PSM frameworkmimics the two-stage stratification and random treatment assignment assumptionsalready discussed. However, in this approach the potential outcome Yig(Tig,T−i,g)is assumed to be a linear function of its arguments. Also, given the multiplicity ofpossible values Tg and T−i,g can attain, the authors propose a function r ∶ R2n → R thatallows to summarize the effect of the allocation program on the outcome of unit i, i.e.Yig(Tig,T−i,g) = Yig(Tig, r(T−i,g)). This approach is an improvement in the sense that itsimplifies the analysis and reduces the number of arguments in the potential outcome’sfunction.

On the other hand, less has been proposed within the SCM framework. An initialapproach was presented by van der Laan (2012); VanderWeele and An (2013); Ogburn andVanderWeele (2014); van der Laan (2014). The authors developed nonparametric TMLEestimators of causal quantities in settings were SUTVA assumption was violated and anallocation program varied over time. Their approach, however, is based on the study ofnetworks of causally connected units, which means that it assumes/requires a previouslyknown network structure. Notwithstanding, we acknowledge the way the authors usedcausal diagrams (DAGs) and the rules of structural causal inference in order to identifyand estimate causal parameters of interest in a non-experimental setting with interrelatedunits. Also, we remark that the TLME method represents a fully non-parametric andflexible approach to the estimation of causal target parameters of interest (Pearl, 2009b).

In this thesis we use a hierarchical-SEM (HSEM) based approach to model SUTVAviolation within the SCM, partly because it is easier to understand the causal mechanismsin M through the use of causal diagrams. Also, the linear SEM framework presented inChapter 2 allows for a straightforward computation of direct and indirect effects. Theestimated structural parameters give a notion of the magnitude and strength of thecausal linkages between random variables in the study, something that is missing in thenon-parametric TLME approach. Although this topic is not covered in this thesis, webelieve that computation of direct and indirect effects of a cluster-based interventionsand/or causally connected units can be extended from what was presented in Chapter 2(i.e. using estimates of structural parameters), through the proposed HSEM framework.Finally, we do not assume a previously known network structure, mainly because in manySocial Science studies this information is not available to the researcher.

Page 52: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3. CAUSALLY CONNECTED UNITS 41

Recall that within the SCM, given a causal model M , direct and indirect causaleffects are computed using the linear SEM’s estimates of the structural parameters.However, in the context of SUTVA violation (i.e. clustered, no independent, units),conventional, linear SEM’s estimators would yield inefficient estimates of the parameters,and therefore, the computation of direct and indirect effects would be distorted andpossibly biased. If this clustered design is not considered in the modeling specification,then inferences on the parameter estimates will be wrong, since the standard errors ofthe coefficients will be underestimated (i.e. shorter confidence intervals for the hypothesistesting and therefore, the researcher is prone to type I errors). Also, multilevel modelingallows for estimating cluster-specific structural parameters (or random effects). Thesegroup-specific coefficients will prove to be useful when the researcher has substantiveinterest in group-specific treatment (causal) effects. Said that, we believe that core ideasin the hierarchical modeling approach in Hong’s and Raudenbush’s work provide a cleardeparture framework for modeling causal relationships between random variables in anon-experimental setting with causally connected units.

Throughout the rest of this thesis, we will extend the notation presented for the SEMframework in Chapter 2. Let N be a sample of individuals clustered within G groups,each one with Ng individuals (as above). We assume that every single unit i belongs toone, and only one, group g, i.e. no multiple membership is allowed. We add the subscriptg to the notation in Chapter 2 to denote cluster-membership, i.e. Yig,Xig,ξig,ηig denotethe vectors of endogenous and exogenous measured and latent variables for unit i ∈ g,respectively. Given the multilevel design, we let units i and i′, with i, i′ ∈ g, to be causallyconnected if an outcome variable for both units share common causes varying at grouplevel. That is, i and i′ are causally connected if there exists a direct or indirect causalpath between them, usually at the cluster level.

We extend the causal diagram framework in introduced Chapters 1 and 2 to a settingthat graphically represents the hierarchical structure of the causal model M , usingobserved, non-experimental data. The extension to the path diagram uses the platenotation, first introduced Rabe-Hesketh et al. (2004) in the context of multilevel linearmodels with latent variables. As an example, assume a two-level dataset with latentvariables. In Figure 3.1, the outer block represents a cluster g ∈ G, and each inner blockrepresents individuals i, i′ ∈ g. We show that units i and i′ are causally connected throughthe causal path ηig ←Ð ηg Ð→ ηi′g (as in the SCM framework presented earlier). In thiscase, ηg d-separates ηig and ηi′g. The latter means that once we ‘know’ the value of arandom variable varying at the cluster level, ηg, variables varying at the individual levelbecome independent, i.e. ηig á ηi′g ∣ ηg.

It is clear how HSEMs are a way of specifying a statistical model structure whenSUTVA assumption is violated and the sample consists of N clustered, causally connectedunits. If a linear specification is imposed, each path in Figure 3.1 represents a structuralparameter whose estimates can be used to compute direct and indirect causal effects ofan cluster intervention of the form do(Xg = xg), or individual based interventions of theform do(Xig = xig). As explained in Chapter 2, HSEMs are also capable of estimatingcounterfactuals from non-experimental, hierarchical data. We introduce HSEMs and itsestimation methods in the following subsections. However, we do emphasize that the

Page 53: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3. CAUSALLY CONNECTED UNITS 42

computation of direct or indirect causal effects in a multilevel setting is not presented inthis thesis.

…. ηg

ηi'g

Xk

i'

i

g

ηig

X1

Xk

X1

Figure 3.1. HSEM example introducing the plate notation.

3.1. Hierarchical Structural Equation Models (HSEM)

Hierarchical SEM (HSEM) is not a new concept. First attempts focused on estimat-ing two separate models (for a two level model) for the between and within clustercovariance matrices, as in Longford and Muthen (1992) and Muthen (1989, 1994).Almost simultaneously, a different approach was undertaken by Goldstein and McDonald(1988); McDonald and Goldstein (1989) and Lee (1990), who developed a joint MLestimation procedure of both the within-cluster and between-cluster covariance matricesfor a two level model. Despite these efforts, the latter papers assumed continuousdata and (almost always) balanced designs. Later Ansari et al. (2000) developeda Bayesian approach with MCMC methods and Lee and Shi (2001) a ML approachusing a MC-EM algorithm, limited to a two-level factor model (no structural part) though.

Shortly after, Rabe-Hesketh et al. (2004, 2007, 2012) overcame these limitations andformulated the Generalized Linear Latent and Mixed Modeling framework (GLLAMM).Rabe-Hesketh and Skrondal (2012) show the correspondence between mixed modelingand multilevel modeling (Raudenbush and Bryk, 2002). The main difference from theprevious approaches to HSEM was that estimation procedure in their model is IB ratherthan CB. GLLAMM-SEM is a maximum likelihood estimation procedure where thelikelihood function is optimized using first order approximations and the Newton-Raphsonalgorithm for the measurement equations, and where the latent variables are integratedout using adaptive quadrature in the structural equations. For further detail see Skrondaland Rabe-Hesketh (2004). We recall that the link between CB-SEM approach andGLLAMM-SEM is presented in Rabe-Hesketh et al. (2007).

Conditional on the latent variables (belonging to higher or traversal levels), theresponse model (or measurement equation) in Rabe-Hesketh et al. (2004) follows closelythe linear prediction equation in the Generalized Linear Model (GLM) framework of

Page 54: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3. CAUSALLY CONNECTED UNITS 43

McCullagh and Nelder (1989). The linear predictor is accompanied by a link functiondefined accordingly after a distribution from the exponential family. In GLLAMM-SEM,the measurement equations depend on the assumed distribution of the observed responses,conditional on the latent variables and covariates. Also, in the structural part the latentvariables are regressed on other latent variables and (possibly) observed exogenouscovariates. Exogenous disturbances allow for non-observed variability among the latentand observed variables. As in most SEMs, restrictions over parameters are imposed toachieve identifiability.

More formally, let N individuals or observational units (level 1) be organized inindependent clusters within L levels of hierarchy. In each level l = 2, ..., L we observeGl clusters, each of them with nGl individuals. The subscript Gl indicates membershipof an individual i to cluster Gl in level l. Given the full information up to level L,Rabe-Hesketh et al. assume independent clusters within level l, something that is knownas partial interference, or no interference between clusters (Sobel, 2006). We keep thiskey assumption throughout the rest of this paper. Also we assume that each one of theseobservations belongs to one, and only one, cluster at a certain level within the hierarchy.Cases of multiple memberships are not considered here.

As for the measurement equation, let νj(l) be the [i × p] × 1 vector of p responses forevery unit i (level 1) belonging to the unit-group j in level l. Also, let Xj(l) be a [i × p]×Kmatrix of exogenous variables at the individual level and β a vector of K × 1 parameters.

Moreover, let Λ(l)j(l) and η

(l)j(l) be the [i × p] × q(l) matrix of structural parameters (factor

loadings) and the q(l) × 1 vector of latent variables varying at level (l) (following thesuperscript) for the cluster (or l-level unit) j(l) (following the subscript), respectively. Fora unit (cluster) z(L) at the top level L, the measurement equation is

νz(L) = Xz(L)β +L

∑l=2

Λ(l)z(L)η

(l)z(L) (3.1)

Note that, as Skrondal and Rabe-Hesketh (2004) point out, superscript (l) in Λ(l)z(L)

denotes that the matrix of structural coefficients is specific to the level-l latent variables.Equation (3.1) can be further expressed in a more succinct way by replacing the sum term

by the appended matrix Λz(L) = [Λ(2)z(L), ...,Λ

(L)z(L)] and the vector ηz(L) = [η(2)′

z(L), ...,η(L)′z(L)]

′:

νz(L) = Xz(L)β +Λz(L)ηz(L) (3.2)

A very important assumption in Skrondal and Rabe-Hesketh (2004) and Rabe-Heskethet al. (2004) is that of latent variables varying at level l, η(l), following a multivariateGaussian distribution N [0,Σ(l)], with Σ(l) being a semi-positive definite covariancematrix for each l = 2, ..., L (i.e. latent variables at level l are not necessarily independent).Also, the authors assume that latent variables at different levels are assumed to beindependent.

Now, conditional on the vector of all realizations of the L levels latent variables, η(L),the vector of response variables for all individuals, a, is linked to the linear predictor, ν,

Page 55: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3. CAUSALLY CONNECTED UNITS 44

through a continuous and double differentiable function g, known as the link function:

g (E [a ∣ X,η(L)]) = ν (3.3)

Note that specification in equation (3.3) allows for modeling continuous and discretedistributions from the exponential family for the response variables.

As for the latent variable equations, for each unit-cluster j in level l Skrondal andRabe-Hesketh (2004) assume a linear system of the reduced form

ηj(l) = Bηj(l) +Γwj(l) + ζj(l) (3.4)

In equation (3.4), ηj(l) is the same as in (3.2), but for the jth l-level unit-cluster. B

is a q × q upper-block diagonal matrix of regression parameters, with q = ∑Ll=2 ql, wj(l) =[w(2)

jk..z,w(3)k..z, ...,w

(L)z ]

′is a vector of r = ∑Ll=2 rl covariates, Γ is an q×r matrix of regression

parameters, and ζj(l) is the vector of q disturbance terms. Recall that each element ofζj(l) varies at the same level l as the corresponding latent variable in ηj(l). As shown inSkrondal and Rabe-Hesketh (2004), the extensive representation of equation (3.4) is

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

η(2)jk..z

η(3)k..z

⋮η(L)z

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎢⎢⎢⎢⎢⎣

B(22) B(23) ⋯ B(2L)

0 B(33) ⋯ B(3L)

⋮ ⋮ ⋱ ⋮0 0 ⋯ B(LL)

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

η(2)jk..z

η(3)k..z

⋮η(L)z

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

+

⎡⎢⎢⎢⎢⎢⎢⎢⎣

Γ(22) Γ(23) ⋯ Γ(2L)

0 Γ(33) ⋯ Γ(3L)

⋮ ⋮ ⋱ ⋮0 0 ⋯ Γ(LL)

⎤⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

w(2)jk..z

w(3)k..z

⋮w(L)z

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦

+

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎣

ζ(2)jk..z

ζ(3)k..z

⋮ζ(L)z

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎦(3.5)

Also, Skrondal and Rabe-Hesketh (2004) assume recursive SEM (i.e. no feedbackeffects, as explained in Bollen, 1989), therefore each Bll′ and Γ(ll′) are matrices withdiag(Bll′) = diag(Γ(ll)) = 0. This disposition of B also assumes that lower-level latentvariables have no effect on higher-level latent variables.

The estimation procedure of HSEM in the GLLAMM framework is based on maximummarginal likelihood estimation using adaptive quadrature rules. Let θ be the vector of allparameters to be estimated, including β, Λz(l), the non-duplicated elements of Σ(l) forl = 2, ..., L; B and Γ. Also let y and X be the response vector and matrix of explanatoryvariables for all units in the sample. The conditional likelihood of y given the latent andobserved variables is obtained after substituting the structural (latent variables) modelinto the response (measurement) model.

Let the conditional density of the response vector for a l-level unit to be

f (l) (y(l) ∣ X(l),ζ(l+);θ), where ζ(l+) = (ζ(l)′ , ...,ζ(L)′)′ is a vector of latent variable er-rors for level l and above. Moreover, let the multivariate normal density of the latentvariables at level l be h(l) (ζ(l);θ). Therefore, the multivariate density for a l-level unit,conditional on the latent variables at level l + 1 and above is:

f (l) (y(l) ∣ X(l),ζ(l+1+);θ) = ∫ h(l) (ζ(l);θ) ∏ f (l−1) (y(l−1) ∣ X(l−1),ζ(l+);θ)dζ(l),(3.6)

Page 56: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3. CAUSALLY CONNECTED UNITS 45

Note the recursive nature of equation (3.6). This means that the likelihood function forupper levels of hierarchy is computed using information from lower levels. Furthermore,the total marginal likelihood is the product of the contributions from all the highest level(L) units:

L (θ;y,X) =∏ f (L) (y(L) ∣ X(L);θ) . (3.7)

The Newton-Raphson algorithm is used to maximize the marginal log-likelihood de-rived from (3.7). For a given set of parameters θ, the multivariate integral over thelatent variables at level l, ζ(l), is evaluated numerically using an adaptive quadraturerule. The integral is evaluated over ql independent, standard, normally distributed latentvariables v(l), with ζ(l) = Clv

(l), where Cl is the Cholesky decomposition of Σl. By let-ting v(l+) = (v(l)′ , ...,v(L′))′, the integral can be approximated by the Cartesian productquadrature as

∫ h(l) (ζ(l);θ) ∏ f (l−1) (y(l−1) ∣ X(l−1),ζ(l+);θ)dζ(l)

=∫ φ (v(l)ql )⋯∫ φ (v(l)1 )∏ f (l−1) (y(l−1) ∣ X(l−1),v(l),v(l+1+);θ) dv(l)1 ⋯ dv(l)ql

≈∑rql

πrql⋯∑r1

πr1∏ f (l−1) (y(l−1) ∣ X(l−1), αr1 , ..., αrql ,v(l+1+);θ) ,

where φ(⋅) is the standard Gaussian density, rql is the number of quadrature pointsfor latent variable ql, and πr and αr are quadrature weights and locations, respectively.The reader might refer to Rabe-Hesketh et al. (2002, 2004, 2005) and Skrondal and Rabe-Hesketh (2004) for further details on HSEM estimation within the GLLAMM framework.

3.2. Bayesian Estimation of HSEM

HSEM estimation within the GLLAMM framework has some drawbacks. First, latentvariables are integrated out the analysis and there is not direct computation of theirvalues (some figures might be assigned though, see Skrondal and Rabe-Hesketh, 2004,Chapter 7). Also, the basic algorithm is computationally intensive and is not suitable formodeling nonlinear relationships between latent variables.

A Bayesian alternative to cope with the latter issues in HSEM was first presentedin Song and Lee (2004). They built upon the work of Ansari and Jedidi (2000); Ansariet al. (2000); Dunson (2000), who presented Bayesian procedures for factor models (latentvariables) estimation. However, these multilevel models did not cope with cross-leveleffects. That is, causal information flowing from of latent variables in the groups to thelatent variables within the individuals’ level.

Lee and Tang (2006) and Song and Lee (2012a, Chapter 9) present a HSEM modelwith cross-level effects that allows for group characteristics to have causal impact on thebehaviors of the individuals. The authors present a two level HSEM framework that iseasily extended to an arbitrary number of levels L. We use a different notation than theone presented in the original papers in order to make both GLLAMM-HSEM and BayesianHSEM comparable. Let aig be a p × 1 vector of random variables from unit i belongingto group g. Each group (cluster) g = 1, ...,G is conformed by i = 1, ...,Ng individuals. Lee

Page 57: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3. CAUSALLY CONNECTED UNITS 46

and Tang (2006) and Song and Lee (2012a, Chapter 9) assume a measurement equationthat relate the observed variables with the latent variables at the individual and grouplevels:

aig = Axig +Λ1ω1,ig +Λ2ω2,g + εig (3.8)

where xig is a k×1 vector of exogenous variables for unit i and A a p×k vector of fixedparameters. Moreover, let Λ1 be a p×q1 matrix of factor loadings and ω1,ig a q1×1 randomvector of latent variables varying at the first level, Λ2 a p×q2 matrix of factor loadings forlatent variables at level 2, ω2,g a q2 × 1 vector of second level latent variables, and εig is ap × 1 random vector error measurements with distribution N(0,Ψε). In this model εig isassumed to be independent of both ω1,ig and ω2,g, and also Ψε to be diagonal. For sake ofsimplicity, the authors assume a factor model for the second level measurement equations.Notwithstanding, extensions to (3.8) for observations at any level l ∈ L is straightforward.Let (l) denote the level at which the observed, latent variables, and factor loadings arevarying. Therefore, for an arbitrary unit i belonging to group j, up until group z at anarbitrary level l, equation (3.8) can be rewritten as

aig,...,z =L

∑l=1

A(l)x(l)z +

L

∑l=1

Λ(l)ω(l),z + εig,..,z (3.9)

Now, for a two-level HSEM, let ω1,ig = (η′1,ig,ξ′1,ig)′ be a partition of ω1,ig. η1,ig is aq11 × 1 vector of endogenous latent variables, and ξ1,ig a q12 × 1 vector of exogenous latentvariables, both at the first level of hierarchy. Song and Lee (2012a, Chapter 9) (which is amore general version of Lee and Tang, 2006) considers the following (nonlinear) structuralequation:

η1,ig = ΓF (ξ1,ig,ω2,g) + δ1,ig (3.10)

where F (ξ1,ig,ω2,g) = [f1 (ξ1,ig,ω2,g) , ..., fm (ξ1,ig,ω2,g)]′ is a m × 1 vector of nonzerovector valued functions with differential known functions, f1, ..., fm with m ≥ maxq12, q2,and Γ is a q11 ×m matrix of unknown coefficients. In addition, ξ1,ig and δ1,ig are assumedto be distributed N (0,Φ1) and N (0,Ψδ), respectively, being Ψδ a diagonal matrix.Moreover, δ1,ig is assumed to be independent of η1,ig and ω2,g. Yet again, an exten-sion of the structural equation system to an arbitrary number of levels L is straightforward.

Due to the complexity of the correlation structure between latent and manifestterms of this model, Lee and Tang (2006) and Song and Lee (2012a) draw upon dataaugmentation techniques (Tanner and Wong, 1987) in their Bayesian algorithm. Theidea behind this procedure is to complement the observed data with latent constructsfrom previous MCMC steps. Traditional Gibbs sampler (Geman and Geman, 1984)and Metropolis-Hastings (Metropolis et al., 1953; Hastings, 1970) algorithms are used,similar to that in Section 3.2. The reader might refer to the original papers for deeperexplanations on Bayesian estimation of HSEM with cross-level effects.

Despite the advantages of this framework in terms of modeling causal relationshipsbetween random variables (both manifest and latent) in the presence of causally connectedunits, one drawback consists of assuming a known set of functional forms F(⋅) in thestructural part of the HSEM. We aim to overcome this issue by expanding the BayesianHSEM framework with a semi-parametric structure for the structural part of the model.

Page 58: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 3. CAUSALLY CONNECTED UNITS 47

This extended version of the HSEM implicitly assumes unknown functional forms for thecausal relationships between latent variables, both at the same and higher levels.

Page 59: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 4

A Semi-Parametric Hierarchical Structural

Equation Model (SPHSEM)

4.1. The observed random variables

The proposed model builds upon the developments presented in Song et al. (2013), Leeand Tang (2006), Song and Lee (2012a, Chapter 9) and Lee (2007), among others. First,consider a hierarchically structured dataset with an arbitrary number L of levels. Letaig be a vector of p × 1 random variables for an individual i = 1, ...Ng (level 1) belongingto a group g = 1, ...,G (level 2), which in turn can be part of another group in a higherhierarchy (level 3 onwards). Subscripts to other levels are omitted for ease on notation.Note that ng might be different for every g, which means that this framework allowsfor unbalanced sets. For now, we assume a two-level setting, but extension to a furthernumber of higher levels is straightforward.

Random variables in aig can be ordered categorical (zig), continuous (yig), count (vig),or unordered categorical variables (uig). Therefore, without loss of generality, assume

aig = (aig,1, ..., aig,p)′

= (zig,1, ..., zig,r1 , yig,r1+1, ..., yig,r2 , vig,r2+1, ..., vig,r3 , uig,r3+1, ..., uig,r4)′.

In order to provide a clear framework for the generalized SPHSEM, let a∗ig be anunderlying vector defined as

a∗ig = (a∗ig,1, ..., a∗ig,p)′

= (z∗ig,1, ..., z∗ig,r1 , y∗ig,r1+1, ..., y

∗ig,r2 , v

∗ig,r2+1, ..., v

∗ig,r3 ,w

′ig,r3+1, ...,w

′ig,r4)

′,

48

Page 60: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 49

such that is linked to aig as it follows:

⎧⎪⎪⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎪⎪⎩

zig,k = gk(z∗ig,k), k = 1, ..., r1,

yig,k = gk(y∗ig,k), k = r1 + 1, ..., r2,

κig,k = gk(v∗ig,k), k = r2 + 1, ..., r3,

uig,k = gk(wig,k), k = r3 + 1, ..., r4,

where κig,k = E(vig,k) for all k = r2 + 1, ..., r3, and gk(⋅)’s are the threshold, identity,exponential and multinomial probit link functions, respectively.

Following Muthen (1984), Lee and Song (2003a) and Lee (2007), for ordered categoricalvariables, i.e zig,k for k = 1, ..., r1, that take integer values in the set 0,1, ..., Zk − 1, gk(⋅)is the threshold link function defined as:

zig,k = gk(z∗ig,k) =Zk−1

∑q=0

q × I[αk,q ,αk,q+1)(z∗ig,k), (4.1)

where I(⋅) is an indicator function that takes the value of 1 wheneverz∗ig,k ∈ [αk,q, αk,q+1), and 0 if not. For each ordered categorical variable we define aset of thresholds −∞ = αk,0 < αk,1 < ⋯ < αk,Zk−1 < αk,Zk = +∞ that define the Zkcategories.

For continuous variables, i.e yig,k ∈ R for k = r1 + 1, ..., r2, gk(⋅) is the identity linkfunction, defined as:

yig,k = gk(y∗ig,k) = y∗ig,k. (4.2)

For count variables, i.e. vig,k for k = r2 + 1, ..., r3, it is commonly assumed that vig,k ∼Poisson(κig,k). As in the generalized linear model framework presented by McCullagh andNelder (1989), gk(⋅) is the log link, defined as it follows:

log(κig,k) = g−1k (E(vig,k)) = v∗ig,k, or

κig,k = E(vig,k) = gk(v∗ig,k) = exp(v∗ig,k). (4.3)

Finally, for the unordered categorical variables, i.e. uig,k for k = r3 + 1, ..., r4, weassume that take values on the set 0,1, ..., Uk − 1. For sake of simplicity, and as in Songet al. (2013), we assume that Uk = U for all k. However, as the original authors state,this assumption can be relaxed easily. Following Imai and van Dyk (2005) and Song et al.(2007), uig,k is modeled in terms of an unobserved continuous multivariate normal randomvector wig,k = (wig,k,1, ...,wig,k,Uk−1)′, such that:

uig,k = gk(wig,k) =⎧⎪⎪⎨⎪⎪⎩

0, if max(wig,k) ≤ 0

u′ if max(wig,k) = wig,k,u′ > 0(4.4)

In order to clarify the idea, for an unordered categorical variable uig,k with threecategories, we have wig,k = (wig,k,1,wig,k,2)′. If both wig,k,1 ≤ 0 and wig,k,2 ≤ 0, the uig,ktakes the value of 0. Accordingly, if wig,k,1 > wig,k,2 and wig,k,1 > 0, then uig,k = 1. Finally,

Page 61: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 50

if wig,k,2 > wig,k,1 and wig,k,2 > 0, then uig,k = 2. A generalization to U − 1 (or Uk − 1)categories is straightforward.

4.2. The measurement equations

Once defined the nature of the manifest random variables for every individual i in groupg, we set up the measurement equations consistent with different types of observeddata. We extend the semiparametric SEM in Song et al. (2013) by combining features ofthe Multilevel SEM (with crossed-level effects) considered in Lee and Tang (2006), Lee(2007), Song and Lee (2004, 2012a), and to some extent in Rabe-Hesketh et al. (2004),Skrondal and Rabe-Hesketh (2004) and Rabe-Hesketh et al. (2007).

First, we define an aggregated measurement equation for the underlying manifest ran-dom vector for the ordered, continuous and count variables, a∗ig, related to that measuredon an individual i belonging to group g, aig, through link equations gk(⋅):

a∗ig = µ +L

∑l=1

A(l)x(l)ig +

L

∑l=1

Λ(l)ω(l)ig + εig, (4.5)

in which for every set k = 1, .., ri defined in equations (4.1) to (4.4), µ = (µ1, ..., µpk)′

is a pk ×1 vector of intercepts, A(l) and Λ(l)k are unknown pk × r(l)x and pk × q(l) parameter

matrices for fixed and latent variables belonging to level l, respectively; x(l)ig is a r

(l)x × 1

vector of exogenous variables, ω(l)ig is a vector of q(l) × 1 latent random variables, and εig

is a pk × 1 random vector of error measurements such that εig ∼ N [0,Ψε,k], where Ψε,k is

a diagonal sub-matrix of dimension k in Ψε, and εig is independent of ω(l)ig at all levels.

By defining Λ(l) = [A(1), ...,A(L),Λ(1), ...,Λ(L)], and ωig,(l) =[x(1)′ig , ...,x

(L)′ig ,ω

(1)′ig , ...,ω

(L)′ig ]′, equation (4.5) can be expressed more succinctly

asa∗ig = µ +Λ(l)ωig,(l) + εig. (4.6)

Given ωig,(l) and the parameters in the aggregated matrices in equation (4.6), it isclear that the measurement equations for a∗ig,k, k = 1, ..., pk, are given by:

z∗ig,k = µk +Λ′k,(l)ωig,(l) + εig,k for k = 1, ..., r1, (4.7)

y∗ig,k = µk +Λ′k,(l)ωig,(l) + εig,k for k = r1 + 1, ..., r2, (4.8)

v∗ig,k = µk +Λ′k,(l)ωig,(l) for k = r2 + 1, ..., r3, (4.9)

where µk are the corresponding intercepts in µ, Λk,(l) is a (∑Ll=1 r(l)x +∑Ll=1 q

(l))×1 vector

corresponding to the kth row of the matrix Λ(l), and εig,k ∼ N[0, ψε,k], with ψε,k ∈ diag(Ψε).For the unordered categorical variables, the underlying vector wig,k defined for each of the

Page 62: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 51

k = r3 + 1, ..., r4 observed uig,k’s, has the following measurement equation:

wig,k = µk + 1Uk−1Λ′k,(l)ωig,(l) + ε

wig,k for k = r3 + 1, ..., r4, (4.10)

where 1Uk−1 is a (Uk − 1) × 1 vector with 1’s at all elements, µk is a vector of Uk − 1intercepts, and εwig,k ∼ N[0,Ψε,wk] are error terms, independent of ωig, and where Ψε,wk isa sub-matrix of Ψε.

With respect to equation (4.10), the probability mass function of uig,k is

p(uig,k = u′ ∣ µk,Λk,(l),ωig,(l),Ψε,w) =

∫Ru′

ΦUk−1(wig,k;µk + 1Uk−1Λ′k,(l)ωig,(l),Ψε,wk)dwig,k (4.11)

where ΦUk−1(⋅;µ,Σ) denotes the density function of a (Uk −1)-variate normal randomvariable with mean µ and covariance matrix Σ, and

Ru′⎧⎪⎪⎨⎪⎪⎩

wig,k ∶ max(wig,k) < 0 u′ = 0

wig,k ∶ max(wig,k) = wig,k,u′ > 0 u′ = 1, ..., Uk − 1.

The latter means that the multivariate density function for wig,k is proportional to thetruncated multivariate normal distribution with density function

ΦUk−1(wig,k;µk + 1Uk−1Λ′k,(l)ωig,(l),Ψε,wk)IRu′(wig,k),

where IRu′(⋅) is an indicator function that takes the value of 1 if wig,k ∈ Ru′ and 0otherwise.

4.3. The structural equations

The structural equations proposed in this model follow closely those presented in Songet al. (2013) and resemble those in Skrondal and Rabe-Hesketh (2004) and Rabe-Hesketh

et al. (2004). Consider a partition of ω(l)ig = (η(l)′

ig ,ξ(l)′ig )′, where η

(l)ig = (η(l)ig,1, ..., ηig,q(l)1

)′ is

a q(l)1 × 1 vector of endogenous (outcome) latent variables, and ξ

(l)ig = (ξ(l)ig,1, ..., ξig,q(l)2

)′ is

a q(l)2 × 1 vector of exogenous (explanatory) latent variables that are assumed to follow a

multivariate normal distribution, ξ(l)ig ∼ N[0,Φ(l)]. As in the latter section, we allow for

a set of exogenous covariates to enter into the structural equation, c(l)ig = (c(l)ig,1, ..., c

(l)ig,m(l))

′.

Page 63: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 52

We propose the following general structural equation. For an arbitrary element η(l)ig,k

in η(l)ig ,

η(l)ig,k =

m(l)

∑j1=1

πk,j1(c(l)ig,j1

) +q(l)2

∑j2=1

fk,j2(ξ(l)ig,j2

) +L

∑l∗>l

⎡⎢⎢⎢⎢⎣

m(l∗)

∑j∗1=1

π∗k,j∗1 (c(l∗)ig,j∗1

) +q(l

∗)

∑j∗2=1

f∗k,j∗2 (ω(l∗)ig,j∗2

)⎤⎥⎥⎥⎥⎦+ δ(l)ig,k,

(4.12)

where πk,j(⋅), fk,j(⋅), π∗k,j(⋅) and f∗k,j(⋅) are unspecified smooth functions with

continuous second order derivatives for k = 1, ..., q(l)1 , and j as in the sets j1, j∗1 , j2 and j∗2

defined in equation (4.12). δ(l)ig,k is a random residual term which is assumed to follow a

normal distribution N[0, ψ(l)δ,k], with ψ

(l)δ,k ∈ diag(Ψ(l)

δ ); and to be independent of ξ(l)ig at all

levels.

Even if the structural equation in (4.12) is quite general, it is important to bear inmind that endogenous latent variables at level l only depend on exogenous latent variablesbelonging to the same level or latent variables belonging to superior levels l∗ > l. As statedby Skrondal and Rabe-Hesketh (2004, chapter 4): “it would not make sense to regress ahigher level latent variable on a lower level latent variable or observed variable since thiswould force the higher level variable to vary at a lower level”. Also, although not explicitlyread from equation (4.12), this structural equation setting is not intended for recursivestructural equation models.

4.4. A Note on Bayesian P-splines

Following Song and Lu (2010) and Song et al. (2013), we consider Bayesian penalizedsplines as an initial approach for estimating the unknown functions πk,j(⋅), fk,j(⋅), π∗k,j(⋅)and f∗k,j(⋅) in equation (4.12). These functions can be modeled by a sum of basis splines(B-splines, De Boor, 1978) defined over a set of knots in their respective domains. For

simplicity, allow the structural equation to be η(l)ig,k = f(ξ

(l)ig,1)+δ

(l)ig,k, a special case of (4.12),

as in Song et al. (2013). Using the B-splines, f(ξ(l)ig,1) is modeled as

f(ξ(l)ig,1) =K

∑k=1

γkBk(ξ(l)ig,1), (4.13)

where K is the number of knots (splines), K ≤ N , being N the total number ofobservations (individuals or groups) belonging to level l; γk’s are unknown parameters,and functions Bk(⋅) are uniformly continuous polynomial functions of appropriate order

n (i.e. n ≤ K − 1) defined over the domain of ξ(l)ig,1. Song et al. argue that a natural choice

for Bk(⋅) is the cubic B-spline, for a number of nodes typically ranging from 10 to 60 inorder to ensure enough flexibility (Song and Lu, 2012).

The main drawback of traditional cubic B-splines is that the basis functions Bk(⋅)are defined in a fixed finite interval. As the reader might realize, the realizations of ξ

(l)ig,1

generated by the MCMC algorithm iterations might not always fall inside the fixed interval.

Page 64: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 53

Song et al. (2013) propose to solve these difficulties by transforming the explanatory latent

variables through the probit function, i.e. the authors model f(ξ(l)ig,1) as

f(ξ(l)ig,1) =K

∑k=1

γkBk(Φ∗(ξ(l)ig,1)), (4.14)

where Φ∗(⋅) is the cumulative distribution function of N[0,1]. Equation (4.14)

transforms the original scale of ξ(l)ig,1, (−∞,∞), to the scale of the cumulative probability

p(ξ ≤ ξ(l)ig,1), which is the closed interval [0,1]. Since Φ∗(⋅) is a monotonically increasingfunction, the composite function Bk(Φ∗(⋅)) in the right hand of (4.14) will allow for

the same interpretation of the relationship between ξ(l)ig,1 and f(ξ(l)ig,1). This modeling

technique has some advantages of the ones presented in Lang and Brezger (2004) andSong and Lu (2010), specially when it comes to defining the positions of the knots andcomputational simplicity and efficiency.

Despite of its flexibility and robustness, B-splines are subject to over-fitting if a largenumber of knots is used. Eilers and Marx (1996) proposed a frequentist approach to theso called P-splines, an extension to the B-splines model in which a penalization is imposedon coefficients of adjacent B-splines to avoid over-fitting and regularize the problem. Thecoefficients of this penalty function minimize

ng

∑i=1

⎛⎝η(l)ig,k −

K

∑k=1

γkBk(Φ∗(ξ(l)ig,1))

⎞⎠

2

+ βK

∑k=t+1

(∆tγk)2

(4.15)

where β is a smoothing parameter for controlling the amount of penalty, and ∆tγkdenotes the difference operator of order t. Usually first or second order differences areenough (Brezger and Lang, 2006). It is important to recall that equation (4.15) essentiallymirrors the maximization of a penalized likelihood estimation, as in Eilers and Marx(1996, 1998).

In matrix notation, equation (4.15) can be written as

ng

∑i=1

⎛⎝η(l)ig,k −

K

∑k=1

γkBk(Φ∗(ξ(l)ig,1))

⎞⎠

2

+ βγ′Mγγ (4.16)

where γ = (γ1, ..., γK)′ and Mγ is the associated penalty matrix to the second sum inthe RHS of equation (4.15). For an explicit description of the penalty matrix Mγ , seeFahrmeir and Raach (2007).

Within a Bayesian framework, the unknown parameters γk, k = 1, ..., K are regardedas random variables an have to be assigned appropriate prior distributions. These aredefined by replacing the difference penalty in equation (4.15) by its stochastic analogue,i.e.

∆tγk = ∆tγk−1 + ek, (4.17)

Page 65: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 54

where ek’s are independently distributed as N[0, τk]. For example, for t = 1, equation4.17 becomes γk = γk−1+ek, and for t = 2, γk = 2γk−1−γk−2+ek. This is analogous to definea prior distribution by specifying the conditional distributions of a particular parameterγk given its left and right neighbors (Lang and Brezger, 2004). Therefore, the conditionalmeans of γk’s can be understood as locally linear or quadratic fits at a particular knotposition. In this modeling approach, the amount of smoothness is controlled by theadditional variance parameter τk, which corresponds to the inverse of the smoothingparameter β in the classical approach of equation (4.15).

Accordingly, equation (4.12) can be reformulated as it follows:

η(l)ig,k =

m(l)

∑j1=1

Kc,j1

∑k=1

γj1,kBcj1,k

(c(l)ig,j1) +q(l)2

∑j2=1

Kξ,j2

∑k=1

γj2,kBξ

j2,k(Φ∗(ξ(l)ig,j2))

+L

∑l∗>l

⎡⎢⎢⎢⎢⎢⎢⎣

m(l∗)

∑j∗1=1

Kcl∗,j∗

1

∑k=1

γj∗1 ,kBc

l∗

j∗1 ,k(c(l

∗)ig,j∗1

) +q(l

∗)

∑j∗2=1

Kω,j∗2

∑k=1

γj∗2 ,kBω

l∗

j∗2 ,k(Φ∗(ω(l∗)

ig,j∗2))

⎤⎥⎥⎥⎥⎥⎥⎦

+ δ(l)ig,k (4.18)

where Ka,b denotes the number of nodes defined for the bth random variable of typea. For simplicity, and without loss of generality, throughout the rest of this paper weassume that Ka,b = K, ∀ a, b.

4.5. Identification Constraints

As it is common in SEM literature, the model proposed in equations (4.5) and (4.18) isnot identified without imposing identifiability constraints on the model parameters. Songet al. (2013) discuss appropriate solutions to the identification issues. Common restrictionpractices in both generalized HSEM and NPSEM arise from:

1. Existence of ordered (z∗ig,k) and unordered (wig,k) categorical variables:

Given the fact that the scale is not defined for z∗ig,k nor wig,k, then the rest ofparameters in equations (4.7), (4.1), and (4.10), (4.4) cannot be simultaneouslyestimated. For the ordered categorical variables case, Song et al. (2013) propose tofix αk,1 = Φ∗−1(f∗k,1) and αk,Zk−1

= Φ∗−1(f∗k,Zk−1), where Φ∗(⋅) is the standard normal

distribution function, f∗k,1 is the frequency for the first category of z∗ig,k, and f∗k,Zk−1

is the cumulative frequency of categories zig,k < Zk−1 (see Shi and Lee, 2000). Forthe unordered categorical variables case, we fix the covariance matrix Ψε,wk = IUk ,where IUk is an identity matrix of appropriate dimensions (see Dunson, 2000).

2. Measurement equation parameters:

This type of identifiability issue has been well addressed before (see, for example,Bollen, 1989 and Lee, 2007). For example, in the measurement equations (4.7) to(4.10) (we assume no exogenous variables), if an arbitrary nonsingular matrix C isintroduced, such that y∗ig,k = µk +Λ′

k,(l)ωig,(l) + εig,k = µk +Λ′k,(l)CC−1ωig,(l) + εig,k =

µk +Λ∗′k,(l)ω

∗′ig,(l) + εig,k, with Λ∗′

k,(l) = Λ′k,(l)C and ω∗

′ig,(l) = C−1ωig,(l); then a definite

Page 66: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 4. A SEMI-PARAMETRIC HIERARCHICAL STRUCTURAL EQUATION MODEL (SPHSEM) 55

solution cannot be estimated. Therefore, as it is common in the SEM literature,we overcome this issue by fixing appropriate elements of Λk,(l) such that the onlynonsingular matrix C that satisfies the imposed conditions is the identity matrix.This is usually achieved by fixing one factor loading at 1 (in order to introduce ascale to the corresponding latent variable), and/or by imposing a non-overlappingstructure to Λ, i.e. only one latent variable enters into the measurement equationfor item k.

3. Unknown functions in the structural equation:

The unknown functions in the structural equations are not identified up to a constant,i.e. adding and subtracting an arbitrary constant c in equation (4.12) from, sayπ∗k,1(⋅) = πk,1(⋅) + c and π∗k,2(⋅) = πk,2(⋅) − c will yield the same result, thus, resultingin an unidentified model. Following Song and Lu (2010), restrictions are imposedon every unknown function at each MCMC iteration such that, for every p in j1,

j2, j∗1 and j∗2 , and for all k = 1, ..., q(l)1 , the constraint

G

∑g=1

ng

∑i=1fk,p(κig,p) = 0 holds for

appropriate f ’s and κ’s. The latter can be equivalently formulated in matrix notationas 1′NFk,p = 0, where Fk,p = (fk,p(κ11,p), ..., fk,p(κnGG,p))′, and 1N is a ∑Gg=1 ng =N × 1 vector with all elements fixed to 1. After accounting for the nonparametricspecification, the constraint is also equivalent to 1′NBpγp = 0, where γp is a K × 1vector of spline parameters and Bp is a N × K matrix for which each of the N rows

are defined as [Bp,1(κig,p), ...,Bp,K(κig,p)], for i = 1, ..., ng and g = 1, ...,G, where its

elements are the B-spline basis of natural cubic splines. Extensions to exogenousvariables is straightforward. After restricting the mean of each function in (4.12) tobe zero, the additive structural model is fully identified.

We recall that these are only sufficient but not necessary conditions for identification.

Page 67: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5

Bayesian Estimation of the SPHSEM

We shall use a Bayesian framework to analyze the proposed model. First of all, advantagesof Bayesian techniques include, but are not limited to, i) the usage of prior informationabout the model parameters in addition to that provided by the data itself, ii) the powerof dealing with complex model structures (e.g. intractable integrals) via simulation, andiii) its ability to provide reliable results even with small sample sizes. In subsection 5.1 wefirst describe the prior distributions chosen upon the model’s parameters in subsections4.1 to 4.4. Then, we present the posterior distributions in subsection 5.2.

Let ag = (a1g, ...,angg) be the observed data nested in the gth group, and a = (a1, ...,aG)be the overall observed data for all G groups. Also let a∗g = (a∗1g, ...,a∗ngg) and

a∗ = (a∗1 , ...,a∗G) be the underlying data nested in the gth group and the overall under-lying data respectively. Rows in a∗ are linked to those in a through a = g(a∗), whereg(⋅) is a piecewise function consisting of link functions described in equations (4.1) to (4.4).

Let θm the set of parameters associated with the measurement and structuralequations in (4.7) to (4.10) and (4.12), respectively, i.e. θm = µ,Λ(l),Ψε,Ψδ,Φ, where

Ψδ = Ψ(1)δ , ...,Ψ

(L)δ , and Φ = Φ(1), ...,Φ(L); θs be the set of parameters associated

with the nonparametric equations, i.e. θs = γ,τ, where γ = γj1 ,γj2 ,γj∗1 ,γj∗2 , for whichγj1 = γ1, ...,γm(l), γj2 = γ1, ...,γq(l)2

, and γj∗1 = γ1, ...,γm(l∗), γj∗2 = γ1, ...,γq(l∗) for

every l∗ > l, τ = τj1 ,τj2 ,τj∗1 ,τj∗2 for vectors τj1 = τ1, ..., τm(l), τj2 = τ1, ..., τq(l)2

, and

τj∗1 = τ1, ..., τm(l∗), τj∗2 = τ1, ..., τq(l∗) for every l∗ > l; and α is the set of thresholds thatdefine the values of the unordered categorical variables, i.e. α = α1, ...,αr1. Finally, letθ = θm, θs,α be the set of all parameters in the SPHSEM.

The Bayesian estimates of θ can be obtained by taking the mean of a sufficiently largenumber of random samples from the posterior density of θ given the observed data a,proportional to the product

p(θ ∣ a)∝ p(a ∣ θ) p(θ) (5.1)

where p(a ∣ θ) is the likelihood function of a (given the parameters θ) and p(θ) is theprior density of the model’s parameters. Bear in mind that both the posterior distribution

56

Page 68: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 57

and the likelihood function in equation (5.1) are very complicated (and sometimesintractable) due to the presence of latent variables, discrete data, and a complex modelstructure. These features complicate the Bayesian analysis of the observed-data posteriorin the left hand of equation (5.1). Therefore, data augmentation techniques as in Tannerand Wong (1987) are used to overcome the difficulties related to the posterior analysis.We follow the approach of Lee (2007); Song and Lee (2012a), as is it common in BayesianSEM literature.

Let z∗g = z∗1g, ...,z∗ngg be the set of underlying data related to the unordered categor-

ical variables zig for all ng individuals belonging to the gth group, and Z∗ = z∗1 , ...,z∗Gbe the underlying data linked to the complete sample’s observed unordered categoricalvariables. Also let wig = wig,r3+1, ...,wig,r4, wg = w1g, ...,wngg and W = w1, ...,wGbe the underlying vectors for the observed ordered categorical for individual i, group

g and complete dataset N , respectively. Finally, let Ω(l)g = ω(l)

1g , ...,ω(l)ngg and

Ω(l) = Ω(l)1 , ...,Ω

(l)G be the sets of the l-level both endogenous and exogenous latent

variables for group g and the aggregated latent variables for all possible observationsN , respectively. There will be as many Ω(l) sets as available levels, L, such thatΩ(L) = Ω(1), ...,Ω(L).

In the data augmentation procedure, the observed data a will be augmented withZ∗,W,Ω(L) to produce the complete dataset a,Z∗,W,Ω(L) that will be used toevaluate the augmented posterior distribution p(θ ∣ Z∗,W,Ω(L),a) through MCMCmethods, namely the Gibbs sampler (Geman and Geman, 1984) and the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970).

Said that, departing from the joint probability distribution of the observed, underlying,and latent random variables, and the model’s parameters, p(θ,a,Z∗,W,Ω(L)), and byfollowing Bayes’ rule; the posterior distribution in equation (5.1) can be further expressedas

p(θ ∣ a,Z∗,W,Ω(L))∝ p(a,Z∗,W,Ω(L) ∣ θ) p(θ). (5.2)

By assuming that the prior distributions for θ, p(θ) are independent between param-eters (Shi and Lee, 1998), and by following Bayes’ rule, equation (5.2) can be furtherdeclared as

p(θ ∣ a,Z∗,W,Ω(L))∝ p(a,Z∗,W,Ω(L) ∣ θ) p(θ)= p(a,Z∗,W,Ω(L) ∣ θ) p(θm) p(θs) p(α)= p(a,Z∗,W ∣ θ,Ω(L)) p(Ω(L) ∣ θ)p(θm) p(θs) p(α)= p(Z∗,W ∣ θ,Ω(L),a) p(a ∣ θ,Ω(L)) p(Ω(L) ∣ θ) p(θm) p(θs) p(α)= p(Z∗ ∣ θ,Ω(L),a) p(W ∣ θ,Ω(L),a) p(a ∣ θ,Ω(L)) p(Ω(L) ∣ θ)p(θm) p(θs) p(α) (5.3)

Equation (5.3) can be further presented as the product of several independent distribu-tions for random variables varying at different levels, groups and individuals. Recall thatin a multilevel setting, individuals i, i′ belonging to group g ∈ G become independent oncewe ‘control for’ the appropriate random variables varying at group g and higher levels.

Page 69: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 58

Also, we assume that prior distributions within θm, θs and α are also independent. There-fore, after rearranging some terms, the right hand side of equation (5.3) can be expressedas it follows (we omit the variables we condition on, but are the same as in eq. 5.3):

∝L

∏l=1

G

∏g=1

ng

∏i=1

[r1

∏k=1

p(z∗ig,k,αk ∣ µk,Λk,(l),ωig,(l), ψε,k, zig)] × p(wig ∣ µ,Λ(l),Ψε,uig)×

p(ωig ∣ γ,Λ(l),Ψ(l)δ ,Φ(l)) × p(aig ∣ z∗ig,wig, θm,Ω(L)) × p(Ψδ) × p(Ψε) × p(Φ)×

p(µ) ×⎡⎢⎢⎢⎢⎣

r2

∏k=1

p(Λk,(l) ∣ ψε,k)r4

∏k=r2+1

p(Λk,(l))⎤⎥⎥⎥⎥⎦× p(γ ∣ τ ) × p(τ )

⎫⎪⎪⎬⎪⎪⎭(5.4)

As in Lee and Zhu (2000) and Shi and Lee (1998), we can sample from the distribu-tion above by implementing an MCMC algorithm. Samples from the joint distributionp(θ,Z∗,W,Ω(L) ∣ a) can be independently obtained from posterior distributions resultingfrom concatenating terms in equation (5.4) into simpler, conventional, and more generaldistributions. First, Bayesian methods, particularly the Gibbs algorithm (Geman and

Geman, 1984), require setting arbitrary initial values (θ(0),Z∗(0),W(0),Ω(0)(L)). Then we

simulate (θ(m),Z∗(m),W(m),Ω(m)(L) ) from the distribution above, for m = 1, ..., T , following

the proposed algorithm:

Algorithm 5.1:

1. Generate (Z∗(m+1),α(m+1)) from p(Z∗,α ∣ θ(m),Ω(m)(L) ,Z)

2. Generate W(m+1) from p(W ∣ θ(m),Ω(m)(L) ,u)

3. Generate Ω(m+1)(L) from p(Ω(L) ∣ θ(m),W(m+1),Z∗(m+1),a)

4. Generate θ(m+1) from p(θ ∣ W(m+1),Z∗(m+1),Ω(m+1)(L) ,a). Due to its complexity, step

4. is further decomposed into:

4.1. Generate µ(m+1) from p(µ ∣ W(m+1),Z∗(m+1),Ω(m+1)(L) ,a,Λ

(m)(l) ,Ψ

(m)ε )

4.2. Generate Λ(m+1)(l) from p(Λ(l) ∣ W(m+1),Z∗(m+1),Ω(m+1)

(L) ,a,µ(m+1),Ψ(m)ε )

4.3. Generate Ψ(m+1)ε from p(Ψε ∣ W(m+1),Z∗(m+1),Ω(m+1)

(L) ,a,µ(m+1),Λ(m+1)(l) )

4.4. Generate Ψ(m+1)δ from p(Ψδ ∣ W(m+1),Z∗(m+1),Ω(m+1)

(L) ,a,γ(m),τ (m))

4.5. Generate Φ(m+1) from p(Φ ∣ Ω(m+1))4.6. Generate γ(m+1) from p(γ ∣ Ω

(m+1)(L) ,a,Ψ

(m+1)δ ,τ (m))

4.7. Generate τ (m+1) from p(τ ∣ γ(m+1))

Algorithm 5.1 is cycled T times. Under mild-regularity conditions, Geman and Geman

(1984) showed that for sufficiently large T , (θ(T ),Z∗(T ),W(T ),Ω(T )(L)) can be regarded as

a realization of the posterior distribution in equation (5.4). Recall that Z∗ and W playa key role in the Gibbs algorithm described above because, when their values are given,

Page 70: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 59

the model becomes simpler and the estimation effort is reduced due to the presence ofunderlying, continuous observations (a∗).

In subsection 5.1 we describe the prior distributions defined for the parameters in θ, andin 5.2 we describe in detail the posterior distributions involved in algorithm 5.1. Furtherdetails on the derivation of the posterior distributions can be found in the Appendixsection.

5.1. Prior Distributions

In fully Bayesian settings, a relevant subject is related to the appropriate specification ofprior distributions for the unknown parameters and underlying random variables.

We first consider prior distributions for the basis parameters in the P-splines equations,γ. Fahrmeir and Raach (2007) show that the stochastic analogues of the penalty to thecoefficients of the B-splines (eq. 4.17) follow a Gaussian distribution, i.e. for each elementin γj1,γj2 and γj∗1 ,γj

∗2

with l∗ > l, we have that

p(γj1 ∣ τj1) =K

∏k=1

p(γj1,k ∣ γj1,k−1, τj1)∝ exp

⎧⎪⎪⎨⎪⎪⎩− 1

2τj1

K

∑k=2

(γj1,k − γj1,k−1)2⎫⎪⎪⎬⎪⎪⎭= exp− 1

2τj1γ′j1Mγj1

γj1

(5.5)

for every j1 = 1, ...,m(l) and for an arbitrary differentiation order t. The same goes forevery element in j2 and j∗1 , j

∗2 , for l∗ > l. hen the differentiation order is t = 1, the [K × K]

penalty matrix, Mγj1, is defined as

Mγj1=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

1 −1−1 2 −1

⋱ ⋱ ⋱−1 2 −1

−1 1

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦[K×K]

Moreover, given the identification constraint 1′NBj1γj1 = 0, the prior distribution inequation (5.5) becomes a truncated Gaussian distribution (for arbitrary t):

p(γj1 ∣ τj1) = ( 1

2πτj1)(K∗

j1/2)

exp− 1

2τj1γ′j1Mγj1

γj1 I(1′NBj1γj1 = 0) (5.6)

where K∗j1

= rank(Mγj1) and I(⋅) is an indicator function. The specification in

equation (5.6) is the same for every j1 = 1, ...,m(l), j2 = 1, ..., q(l)2 , j∗1 = 1, ...,m(l∗) and

j∗2 = 1, ..., q(l∗), for l∗ > l.

Second, for all the smoothing parameters in τ , we assume highly dispersed but proper(conjugate) priors. Following Song and Lu (2010), for every p in j1, j2, j

∗1 and j∗2 , for l∗ > l,

Page 71: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 60

we assign gamma priors for the precision parameter (τ−1p ) given by

p(τ−1p ) D= Gamma[αγ0 , βγ0], (5.7)

where αγ0 and βγ0 are shape and rate hyperparameters with fixed preassigned values.In order to achieve highly dispersed priors, we set αγ0 = 1 and βγ0 = 0.005, common valuesin the literature (see, for example, Song et al., 2013).

Third, for all the structural parameters in θm, we consider the following conjugatepriors

p(µk) D= N[µk0 , σ2k0] for k = 1, ..., r3 (5.8)

p(µk) D= N[µk0 ,Hµk0] for k = r3 + 1, ..., r4 (5.9)

p(Λk,(l) ∣ ψε,k) D= N[Λk0 , ψε,kHΛk0] and (5.10)

p(ψε,k) D= InvGamma[αΛk0, βΛk0

] for k = 1, ..., r2 (5.11)

p(Λk,(l))D= N[Λk0 ,HΛk0

] for k = r2 + 1, ..., r4 (5.12)

p(ψ(l)δ,k)

D= InvGamma[αδk0, βδk0

] for k = 1, ..., q(l)1 and l = 1, ..., L (5.13)

p(Φ(l)) D= InvWishart[R0, ρ0] for l = 1, ..., L (5.14)

where µk0 , σ2k0,µk0 ,Λk0 , αΛk0

, βΛk0,Λk0 , αδk0

, βδk0, ρ0 are appropriate hyperparame-

ters, and Hµk0,HΛk0

,R0 are positive semidefinite matrices whose values are assumed tobe given by the prior information.

Lastly, in order to reflect the uncertainty related to these parameters, we assign anon-informative prior distribution for the thresholds that define each ordered categoricalvariable, αk. Given that αk,1 and αk,Zk−1 are fixed for every k = 1, ..., r1, we consider theprior distribution for αk,2 < ⋅ ⋅ ⋅ < αk,Zk−2 as follows:

p(αk) = p(αk,2, ..., αk,Zk−2)∝ c (5.15)

with c being a fixed, arbitrary constant. Even though this prior distribution is im-proper, the joint conditional distribution (of the thresholds and underlying continuousvariables) is proper, thus they can be sampled in the MCMC procedure. We now explorethe features of the posterior distributions, obtained after replacing equations (5.8) to (5.15)into (5.4) and after requiring some algebra. Details can be revised in the Appendix section.

5.2. Posterior Inference

We start with step 1 in algorithm 5.1. The joint posterior distribution in

this step can be further decomposed as the product p(Z∗,α ∣ θ(m),Ω(m)(L) ,Z) =

p(Z∗ ∣ α, θ(m),Ω(m)(L) ,Z) p(α ∣ θ(m),Ω(m)

(L) ,Z). The first posterior distribution,

p(Z∗ ∣ α, θ,Ω(L),Z), can be in turn expressed as the product of several independent

Page 72: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 61

random variables, as it follows. Let Z∗k = (z∗11,k, ..., z

∗ngG,k

)′ be the vector of all N un-

derlying observations of zig,k, for every k = 1, ..., r1; therefore, p(Z∗ ∣ α, θ,Ω(L),Z) =r1∏k=1

p(Z∗k ∣ αk, θ,Ω(L),Z). Given the appropriate current-level and higher-level explanatory

variables (both observed and latent), and the model parameters involved in the conditionaldistribution, we end up with independent groups and individuals, such that

r1

∏k=1

p(Z∗k ∣ αk, θ,Ω(L),Z) =

r1

∏k=1

G

∏g=1

ng

∏i=1

p(z∗ig,k ∣ µk,Λk,(l),ωig,(l), ψε,k, αk,zig,k , αk,zig,k+1, zig,k)

(5.16)

holds true, with p(z∗ig,k ∣ ⋅) D= N[µk + Λ′k,(l)ωig,(l), ψε,k]I[αk,zig,k ,αk,zig,k+1)(z∗ig,k), where

IA(⋅) is an indicator that function takes value of 1 if z∗ig,k ∈ A, with A = [αk,zig,k , αk,zig,k+1).

Now, for the second posterior distribution, p(α ∣ θ,Ω(L),Z), we also assume that itcan be expressed as the product of a series of independent distributions, one for each

of the r1 ordered categorical variables, i.e. p(α ∣ ⋅) =r1∏k=1

p(αk ∣ θ,Ω(L),Zk), with Zk =

(z11,k, ..., zngG,k)′, the vector of all N observations of zig,k, and where

p(αk ∣ θ,Ω(L),Zk)∝G

∏g=1

ng

∏i=1

[Φ∗ (ψ−12

ε,k [αk,zig,k+1 − µk −Λ′k,(l)ωig,(l)])

−Φ∗ (ψ−12

ε,k [αk,zig,k − µk −Λ′k,(l)ωig,(l)])] (5.17)

for k = 1, ..., r1. Recall that Φ∗(⋅) is defined as the cumulative distribution function of astandard Gaussian distribution. A short explanation in the derivation of this distributionis in the Appendix section. Combining equations (5.16) and (5.17), we have that

p(αk,Z∗k ∣ ⋅)∝

G

∏g=1

ng

∏i=1

φ(ψ−12

ε,k [z∗ig,k − µk −Λ′k,(l)ωig,(l)]) I[αk,zig,k ,αk,zig,k+1)(z

∗ig,k) (5.18)

with φ(⋅) being the standard normal density. As in Lee and Zhu (2000), we samplejoint realizations for (α,Z∗), as it is more efficient. However, note that the posteriordistribution in equation (5.18) is not standard, and therefore we should sample from itusing the MH algorithm, as described later in this chapter.

In a similar fashion, the posterior distribution for the underlying vectors of the un-ordered categorical variables in step 2, W, can be expressed as p(W ∣ θ,Ω(L),a) =r4∏

k=r3+1p(Wk ∣ θ,Ω(L),a), with Wk = (w′

11,k, ...,w′ngG,k

)′, for k = r3 + 1, ..., r4. It is clear

that following the measurement equation in (4.10), and given the appropriate current andhigher level explanatory variables and the model’s parameters θ, the latter product ofprobability distribution functions can be further expressed as

r4

∏k=r3+1

p(Wk ∣ θ,Ω(L),a) =r4

∏k=r3+1

G

∏g=1

ng

∏i=1

p(wig,k ∣ µk,Λk,(l),ωig,(l), uig,k)

Page 73: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 62

with p(wig,k ∣ uig,k = u′, ⋅) D= N[µk + 1Uk−1Λ′k,(l)ωig,(l), IUk−1]IRu′ (wig,k), where, IUk−1

is an identity matrix of order Uk − 1, and again, IA(⋅) is an indicator function that takesthe value of 1 whenever wig,k ∈ Ru′ , with the Uk − 1 dimensional vector space Ru′ definedas

Ru′ =⎧⎪⎪⎨⎪⎪⎩

wig,k ∶ max(wig,k) < 0 if uig,k = u′ = 0

wig,k ∶ max(wig,k) = wig,k,u′ > 0 if uig,k = u′ = 1, ..., Uk − 1.

We shall bear in mind that the posterior distribution in equation (5.19) is a truncatedmultivariate normal distribution. We follow the approach of Song et al. (2007) and Songet al. (2013) (who in turn follow the algorithm in Robert, 1995) to obtain samples fromthis distribution. The authors simulate partitioned variables using the Gibbs sampler.Let wig,k,−u′ the vector wig,k with wig,k,u′ = max(wig,k) deleted, i.e. a new vector of size(Uk−2)×1. The distribution of wig,k,u′ given uig,k = u′,wig,k,−u′ ,ωig,(l) and the appropriateparameters in θ, is a univariate truncated normal distribution defined as:

p(wig,k,u′ ∣ ⋅) D=⎧⎪⎪⎨⎪⎪⎩

N[µk,u′ +Λ′k,(l)ωig,(l),1]I(wig,k,u′ ≥ maxwig,k,−u′ ,0) if uig,k = u′

N[µk,u′ +Λ′k,(l)ωig,(l),1]I(wig,k,u′ < maxwig,k,−u′ ,0) if uig,k ≠ u′

(5.19)

where µk,u′ is the component of µk associated with wig,k,u′ . It is clear that samplingUk − 1 random variables from a univariate truncated normal distribution consumes morecomputational resources, but also keeps the algorithm simple.

Now, sampling from the posterior distribution in step 3, algorithm 5.1, requires moredetail. Given that current and higher level explanatory variables in the measurementequations, latent variables are conditionally independent between individuals and groups,even between levels. Accordingly, recall that observed items are also conditionally inde-

pendent, i.e. ω(l)ig á ω

(l)i′g ∣ Ω(L) and aig á ai′g ∣ Ω(L), for i ≠ i′, for any arbitrary group g

and arbitrary level l. Therefore, we have that:

p(Ω(L) ∣ ⋅) =L

∏l=1

p(Ω(l) ∣ Ω(l∗), ⋅) =L

∏l=1

G

∏g=1

ng

∏i=1

p(ω(l)ig ∣ Ω(l∗), θ(m),W(m+1),Z∗(m+1),a),

where Ω(l∗) is the set of all the realization of latent variables for every level l∗ >l. Moreover, given the rules of conditional probability, p(ω(l)

ig ∣ Ω(l∗), ⋅) can be furtherexpressed as:

p(ω(l)ig ∣ Ω(l∗), ⋅)∝ p(aig,z∗ig,wig ∣ ω(l)

ig ,Ω(l∗), ⋅)p(η(l)

ig ∣ ξ(l)ig ,Ω(l∗), ⋅)p(ξ(l)ig ∣ ⋅), (5.20)

which, for each level l, is in turn decomposed into

∝ exp

⎧⎪⎪⎨⎪⎪⎩−1

2

⎡⎢⎢⎢⎢⎣

r1

∑k=1

(z∗ig,k − µk −Λ′k,(l)ωig,(l))

2/ψε,k +

r2

∑k=r1+1

(yig,k − µk −Λ′k,(l)ωig,(l))

2/ψε,k

⎤⎥⎥⎥⎥⎦

+r3

∑k=r2+1

[vig,k (µk +Λ′k,(l)ωig,(l)) − exp (µk +Λ′

k,(l)ωig,(l))]

− 1

2

r4

∑k=r3+1

(wig,k −µk − 1Uk−1Λ′k,(l)ωig,(l))

′(wig,k −µk − 1Uk−1Λ

′k,(l)ωig,(l))

Page 74: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 63

+L

∑l=1

⎛⎜⎝

q(l)1

∑k=1

− 1

2ψ(l)δ,k

⎛⎜⎝η(l)ig,k −

m(l)

∑j1=1

Kc,j1

∑k=1

γj1,kBcj1,k

(c(l)ig,j1) −q(l)2

∑j2=1

Kξ,j2

∑k=1

γj2,kBξ

j2,k(Φ∗(ξ(l)ig,j2))

−L

∑l∗>l

⎡⎢⎢⎢⎢⎢⎢⎣

m(l∗)

∑j∗1=1

Kcl∗,j∗

1

∑k=1

γj∗1 ,kBc

l∗

j∗1 ,k(c(l

∗)ig,j∗1

) +q(l

∗)

∑j∗2=1

Kω,j∗2

∑k=1

γj∗2 ,kBω

l∗

j∗2 ,k(Φ∗(ω(l∗)

ig,j∗2))

⎤⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎠

2

− 1

2(ξ(l)ig )

′(Φ(l))

−1(ξ(l)ig )) (5.21)

Given that the posterior distribution of Ω(L) in equation (5.21) is highly complex andvirtually intractable, we propose a Metropolis-Hastings within Gibbs sampler algorithmthat allows us to sample random independent draws from it. Details on this piece ofalgorithm are presented in section 5.3.

Lastly, we present a breakdown that allows for sampling from the posterior of θ in step4 of algorithm 5.1. After assuming that different k’s are independent random variables,p(µ ∣ ⋅) is derived using the observed data likelihood and the conjugate priors in (5.8) and(5.9), as it is shown in the Appendix section. Following Song et al. (2013) and after somealgebra, the posterior distributions for the intercept parameters are

p(µk ∣ ⋅) D= N[µ∗k, σ∗k], for k = 1, ..., r1 (5.22)

p(µk ∣ ⋅) D= N[µ∗∗k , σ∗k], for k = r1 + 1, ..., r2 (5.23)

p(µk ∣ ⋅)∝ exp

⎧⎪⎪⎨⎪⎪⎩

G

∑g=1

ng

∑i=1

[vig,k(µk +Λ′k,(l)ωig,(l)) − exp(µk +Λ′

k,(l)ωig,(l))]⎫⎪⎪⎬⎪⎪⎭,

for k = r2 + 1, ..., r3 (5.24)

p(µk ∣ ⋅) D= N[µk,Σµk], for k = r3 + 1, ..., r4 (5.25)

where

σ∗k = [Nψ−1ε,k + σ−1

k0]−1

µ∗k = σ∗k⎡⎢⎢⎢⎢⎣ψ−1ε,k

G

∑g=1

ng

∑i=1

(z∗ig,k −Λ′k,(l)ωig,(l)) + σ

−1k0µk0

⎤⎥⎥⎥⎥⎦and

µ∗∗k = σ∗k⎡⎢⎢⎢⎢⎣ψ−1ε,k

G

∑g=1

ng

∑i=1

(yig,k −Λ′k,(l)ωig,(l)) + σ

−1k0µk0

⎤⎥⎥⎥⎥⎦

for the ordered categorical and continuous variables, and Σµk = (N IUk−1 + H−1µk0

)−1

and µk = Σµk [G

∑g=1

ng

∑i=1

(wig,k) +H−1µk0µµk0

], with wig,k = wig,k − 1Uk−1Λ′k,(l)ωig,(l) for the

unordered categorical variables.

Most of the posteriors distributions in (5.22) to (5.25) are normal distributions andtherefore, one could draw samples directly using the standard Gibbs sampler. However,the posterior distribution in (5.24) is not standard, thus we need to sample using a MH

Page 75: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 64

algorithm.

To derive the posterior distributions in steps 4.2 and 4.3 in algorithm 5.1 from theobserved data likelihood and using the conjugate priors in equations (5.10) to (5.12).Again, details can be found in the Appendix. The posterior distributions p(Λk,(l) ∣ ⋅) andp(ψε,k ∣ ⋅) are:

p(Λk,(l) ∣ ⋅) D= N[Λ∗k,(l), ψε,kH

∗k], and p(ψ−1

ε,k ∣ ⋅) D= Gamma [N2+ αΛk0

, β∗k] ,

for k = 1, ..., r1,

(5.26)

p(Λk,(l) ∣ ⋅) D= N[Λ∗∗k,(l), ψε,kH

∗k], and p(ψ−1

ε,k ∣ ⋅) D= Gamma [N2+ αΛk0

, β∗∗k ] ,

for k = r1 + 1, ..., r2,

(5.27)

p(Λk,(l) ∣ ⋅)∝ exp

⎧⎪⎪⎨⎪⎪⎩

G

∑g=1

ng

∑i=1

[vig,k (µk +Λ′k,(l)ωig,(l)) − exp(µk +Λ′

k,(l)ωig,(l))]

−1

2[Λk,(l) −Λk0

]′H−1Λk0

[Λk,(l) −Λk0] , for k = r2 + 1, ..., r3,

(5.28)

p(Λk,(l) ∣ ⋅) D= N[Λ∗∗∗k,(l), ψε,kH

∗∗k ], for k = r3 + 1, ..., r4; (5.29)

where H∗k = (H−1

Λk0+ ΩΩ′)−1, Λ∗

k,(l) = H∗k [H−1

Λk0Λk0 +ΩZ∗

k], being Ω a

[L

∑l=1

(r(l)x + q(l))] × N matrix defined as Ω = [ω11,(l), ...,ωngG,(l)], with the vectors

ωig,(l) defined as in equation (4.6); and the N × 1 vector Z∗k = (z∗11,k − µk, ..., z∗ngG,k − µk)

′;

and β∗k = βΛk0+ 1

2 [Z∗′k Z∗

k +Λ′k0

H−1Λk0

Λk0 −Λ∗′k,(l)H

∗−1k Λ∗

k,(l)].

In addition, we define Λ∗∗k,(l) = H∗

k [H−1Λk0

Λk0 +ΩYk], Yk = (y11,k −µk, ..., yngG,k −µk)′,and β∗∗k = βΛk0

+ 12 [Y′

kYk +Λ′k0

H−1Λk0

Λk0 −Λ∗∗′k,(l)H

∗−1k Λ∗∗

k,(l)]. Finally, H∗∗k =

(H−1Λk0

+ (Uk − 1)ΩΩ′)−1

and Λ∗∗∗k,(l) = H∗∗

k [H−1Λk0

Λk0 +ΩWk], with Wk =(1′Uk−1(w11,k −µk), ...,1′Uk−1(wngG,k −µk))′ (see Appendix).

Furthermore, sampling from the posteriors p(Ψδ ∣ ⋅) and p(Φ ∣ ⋅) in steps 4.4 and 4.5 of

algorithm 5.1 goes at it follows. First, assume independent ψ(l)δ,k’s for every k = 1, ..., q1 in

every l = 1, ..., L. Therefore, the posterior distribution in step 4.4 is derived by multiplying

numerous individual likelihoods for η(l)ig,k’s, times the prior distribution for ψ

(l)δ,k (same k

and l), as defined in equation (5.13). Thus, we have that for every k = 1, ..., q1 in everyl = 1, ..., L,

p(ψ(l)δ,k ∣ η(l)ig,k, ⋅)∝

L

∏l=1

G

∏g=1

ng

∏i=1

p(η(l)ig,k ∣ ⋅) p(ψ(l)δ,k)

Page 76: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 65

After some algebra (see Appendix for further details on the computation of this pos-terior), we end up with a posterior distribution

p(ψ(l),−1δ,k ∣ η(l)ig,k, ⋅)

D= Gamma [α(l)∗δk

, β(l)∗δk

] , (5.30)

where α(l)∗δk

= N2 + αδk0

, with N being the sum of the number of individuals (groups)

belonging to level l at which η(l)ig,k belongs; and

β(l)∗δk

= βδk0+1

2

⎡⎢⎢⎢⎢⎢⎣

L

∑l=1

G

∑g=1

ng

∑i=1

⎛⎜⎝η(l)ig,k −

m(l)

∑j1=1

Kc,j1

∑k=1

γj1,kBcj1,k

(c(l)ig,j1) −q(l)2

∑j2=1

Kξ,j2

∑k=1

γj2,kBξ

j2,k(Φ∗(ξ(l)ig,j2))

−L

∑l∗>l

⎡⎢⎢⎢⎢⎢⎢⎣

m(l∗)

∑j∗1=1

Kcl∗,j∗

1

∑k=1

γj∗1 ,kBc

l∗

j∗1 ,k(c(l

∗)ig,j∗1

) +q(l

∗)

∑j∗2=1

Kω,j∗2

∑k=1

γj∗2 ,kBω

l∗

j∗2 ,k(Φ∗(ω(l∗)

ig,j∗2))

⎤⎥⎥⎥⎥⎥⎥⎦

⎞⎟⎟⎠

2⎤⎥⎥⎥⎥⎥⎥⎦

Now, for the posterior in step 4.5 of algorithm 5.1, p(Φ ∣ ⋅), we assume indepen-dent covariance matrices for the exogenous latent variables across different levels, that is,

p(Φ ∣ ⋅) =L

∏l=1p(Φ(l) ∣ ⋅). The posterior is computed through the exogenous latent variables

at level l likelihood times the prior defined for θ(l). After some algebra explicitly availablein the Appendix section, we have that the posterior distribution for θ(l) is of the form:

p(Φ(l) ∣ ⋅) D= InvWishart (N + ρ0,ξ(l)′ξ(l) +R0) , (5.31)

for every l = 1, ..., L. N is defined in a similar way as in equation (5.30).

To derive the posterior distribution for other parameters associated with the nonpara-metric structural equations; first, let both τp and γp be independent among p’s, thatis,

p(τ ) =∏J

P

∏p=1

p(τp) and

p(γ) =∏J

P

∏p=1

p(γp ∣ τp),

for every p in j1 to j∗2 and J = j1, j2, j∗1 , j∗2.

Second, following Song et al. (2013), each one of the posteriors for τp’s is the resultof the conjugation between their priors and the corresponding γp’s prior. Both priorswere presented in equations (5.7) and (5.6), respectively. After some computations (in theAppendix), the posterior for τp results in

τ−1p ∼ Gamma [αγ0 +

K

2, βγ0 +

γ′pMγpγp

2] (5.32)

Page 77: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 66

where K is the number of knots, and p stands for the random exogenous variables in thestructural equations in j1 to j∗2 . The posterior distribution for γp results as the combinationof its prior (equation 5.6) and a modified version of the kth endogenous latent variablelikelihood. After some algebra, the posterior distribution for γp can be expressed as

p(γp ∣ ⋅) D= N[γ∗p , Σ∗γp] I(1

′NBpγp = 0) (5.33)

with γ∗p = [Σ∗γpB

′pη

(l)∗k ] (ψ(l)

δ,k)−1

and Σ∗γp = (B′pBp/ψ

(l)δ,k +Mγp/τp)

−1, and where I(⋅) is

an indicator function that takes the value of 1 of the restriction 1′NBpγp = 0 is satisfied

and 0 otherwise. In this specification we define Bp as the N × K matrix

Bp =

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

[Bp,1(⋅) , ⋯, Bp,K(⋅)][1,1]

⋮ ⋱ ⋮

[Bp,1(⋅) , ⋯, Bp,K(⋅)][ng ,G]

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦[N×K]

where N is the total number of observations (individuals) in level l for which the endoge-

nous latent variable η(l)ig,k belongs, and K is the total number of knots defined for each basis

spline. This configuration is valid for every p in j1 to j∗2 . Also, let η(l)∗k as the N ×1 vector

defined as η(l)∗k = (η(l)∗11,k, ..., η

(l)∗N,k

)′, with η

(l)∗ig,k = η(l)ig,k − ∑

p′≠p

J

∑p′

K

∑k=1

γp′,kBp′,k(⋅), for every indi-

vidual/group ig in 1, ..., N in level l. To sample from the truncated posterior distribution,

we can sample an observation γ(New)p from equation (5.33) and then transform it as

γp = γ(New)p −Σ∗

γpB′p1N (1′

NBpΣ∗

γpB′p1N)−1

1′NBpγ(New)

p

5.3. Implementation

Most posterior distributions in equations (5.16) to (5.33) are familiar standard distribu-tions from the exponential family, such as Normal, (Inverted) Gamma, and (Inverted)Wishart distributions. Therefore, it is straightforward to sample from them using theGibbs sampler (Geman and Geman, 1984). However, when sampling from the joint

posterior distribution p(Z∗,α ∣ θ,Ω(L),Z) in equation (5.18) and from p(ω(l)ig ∣ Ω(l∗), ⋅)

in equation (5.21), it becomes clear that these distributions are not standard. We thenappeal to the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970)specification presented in Lee and Zhu (2000) and Song et al. (2013).

Following Cowles (1996), for each of the k = 1, ..., r1 ordered categorical variables wegenerate a vector of thresholds αk = (αk,2, ..., αk,Zk−2) from the following truncated normaldistribution:

αk,q ∼ N [α(m)k,q , σ

2αk

] I[αk,q−1,α(m)k,q+1

)(αk,q), (5.34)

Page 78: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 67

for every q = 2, ..., Zk − 2. In equation (5.34), α(m)k,q is the value of αk,q at the mth

iteration of the Gibbs sampler, and σ2αk

is a preassigned constant that yields an appropriateacceptance rate. In addition, we can sample Z∗

k from (5.16). It follows from the MHalgorithm that the acceptance probability for (αk,Z∗

k) is min1,Rk, where

Rk =p(Z∗

k,αk ∣ θ,Ω(L),Z) q(Z∗(m)k ,α

(m)k ∣ Z∗

k,αk, θ,Ω(L),Z)p(Z∗(m)

k ,α(m)k ∣ θ,Ω(L),Z) q(Z∗

k,αk ∣ Z∗(m)k ,α

(m)k , θ,Ω(L),Z)

,

with p(⋅) being the posterior distribution in (5.18) and q(⋅) the proposal distribution,defined as the product of the truncated Gaussian distributions specified in equations (5.16)and (5.34). Therefore, following Lee and Zhu (2000), it can be shown that

Rk =Zk−2

∏q=2

Φ∗ [(α(m)k,q+1 − α

(m)k,q ) /σαk] −Φ∗ [(αk,q−1 − α(m)

k,q ) /σαk]

Φ∗ [(αk,q+1 − αk,q) /σαk] −Φ∗ [(α(m)k,q−1 − αk,q) /σαk]

×G

∏g=1

ng

∏i=1

Φ∗ (ψ−12

ε,k [αk,zig,k+1 − µk +Λ′k,(l)ωig,(l)]) −Φ∗ (ψ−

12

ε,k [αk,zig,k − µk +Λ′k,(l)ωig,(l)])

Φ∗ (ψ−12

ε,k [α(m)k,zig,k+1 − µk +Λ′

k,(l)ωig,(l)]) −Φ∗ (ψ−12

ε,k [α(m)k,zig,k

− µk +Λ′k,(l)ωig,(l)])

(5.35)

Since Rk depends only on the old and new values of αk and not on those of Z∗k, we

do not need to draw new samples from the posterior distribution of Z∗k in the iterations

for which the new value of αk is not accepted (Cowles, 1996).

Something similar happens when sampling from p(ω(l)ig ∣ Ω(l∗), ⋅) in equation (5.21). In

this case, we follow the approach of Song and Lu (2010), which depart from what initiallyproposed Arminger and Muthen (1998) and Zhu and Lee (1999) in their respective papers.

For the posterior distribution in (5.21), we choose N [ω(l),(m)ig , σ2

ωΣ(l)ω ] as the proposal

distribution, where ω(l),(m)ig is the random sample of ω

(l)ig at the mth iteration of the Gibbs

sampler, and the covariance matrix defined as (Σ(l)ω )−1 = Λ′

ωΨ−1ε Λω + Σ(l) if l = 1, or

(Σ(l)ω )−1 = NgΛ

′ωΨ−1

ε Λω +Σ(l) if l ≥ 1, with

Σ(l) =⎛⎝

(Ψ(l)δ )−1 −(Ψ(l)

δ )−1γω∆(l)

−∆′(l)γ′ω(Ψ(l)δ )−1 (Φ(l))−1 +∆′(l)γ′ω(Ψ

(l)δ )−1γω∆(l)

⎞⎠, (5.36)

where Λω is a [r3 +∑r4k=r3+1(Uk − 1)] × q(l) matrix that results from stacking every Λk,(l)vector, as defined in equations (4.7) to (4.10); γω is a q

(l)1 × (K × J∗) matrix, J∗ = ∑j

∗2i=1 ji,

that also results from stacking q(l)1 vectors, one for each latent endogenous variable, com-

posed of K×J∗ γ’s (assuming we fix an equal number of K knots for each basis expansion);

and where ∆(l) is a (K × J∗) × q(l)2 matrix, defined as

Page 79: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 5. BAYESIAN ESTIMATION OF THE SPHSEM 68

∆ = ∂B(⋅)∂ξ

(l)′ig

RRRRRRRRRRRRRξig=0

=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

∂Bj1=1(⋅)∂ξ

(l)ig,1

⋯ ∂Bj1=1(⋅)∂ξ

(l)ig,q2

⋮ ⋱ ⋮∂Bj∗2=q(l∗)(⋅)

∂ξ(l)ig,1

⋯∂Bj∗2=q(l∗)(⋅)

∂ξ(l)ig,q2

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦ ∣ ξig=0

,

with B(⋅) being a (K×J∗)×1 vector defined as B(⋅) = (B′j1=1(⋅), ...,B′

j∗2=q(l∗)(⋅))′, and Bj(⋅)

being a K ×1 vector, for every j in j1 to j∗2 in the structural equation for each endogenouslatent variable at level l. For this proposal distribution, σ2

ω is also fixed to a value suchthat average acceptance rate is above 0.25 (Gelman et al., 1995). Thus, the acceptanceprobability is

min

⎧⎪⎪⎪⎨⎪⎪⎪⎩1,

p(ω(l)ig ∣ ⋅)

p(ω(l),(m)ig ∣ ⋅)

⎫⎪⎪⎪⎬⎪⎪⎪⎭

Page 80: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6

Simulations & Application

6.1. A Simulation Study

A simulation study is presented to provide an empirical idea of the performance of theproposed Bayesian estimation of the SPHSEM. We simulated a dataset for i = 1,2, ...,800observations at the first level, l = 1, distributed heterogeneously among G = 20 groups ata second level, l = 2, ranging from ng = 28 to ng = 51.

For practical purposes, we present a simulation exercise for a set of six continuousmanifest variables, y = y1, ...y6, related in a linear fashion, as in equation (4.6), witha set of endogenous and exogenous latent variables varying at both l = 1 and l = 2.The proposed algorithm works for the case in which count, and ordered and unorderedcategorical variables are simulated. The measurement equations were simulated as itfollows:

yig = µ +Λ(2)ωig,(2) + εig (6.1)

or in a more explicit notation:

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

yig,1yig,2yig,3yig,4yig,5yig,6

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

=

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

µ1

µ2

µ3

µ4

µ5

µ6

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

+

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

0 1 0 10 λ2,1 0 λ2,2

0 λ3,1 0 λ3,2

1 0 1 0λ5,1 0 λ5,2 0λ6,1 0 λ6,2 0

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

η(1)ig,1

ξ(1)ig,1

η(2)g,1

ξ(2)g,1

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

+

⎡⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎢⎣

εig,1εig,2εig,3εig,4εig,5εig,6

⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦

(6.2)

The true values for µ and Λ(2) where randomly generated and fixed to the valuesµ1 = ... = µ6 = 0.5, and λ2,1 = ... = λ6,2 = 0.8. The 1’s and 0’s in Λ(2) are fixed parametersthat identify the model. Error terms εig are distributed as εig ∼ N [0,Ψε], with thecovariance matrix fixed at Ψε = 1.5 ∗ I6 (where I6 is the identity matrix order 6).

The exogenous latent variables are drawn from ξ(1)ig ∼ N[0,Φ(1)], with Φ(1) = 1, and

ξ(2)g ∼ N[0,Φ(2)], with Φ(2) = 1. For each i in 1, ...,800 and for each g in 1, ...,20, the

69

Page 81: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 70

endogenous latent variables η(1)ig,1 and η

(2)g,1 were simulated according to the following simple,

nonlinear, smooth, structural functions (as in equation 4.12):

η(1)ig,1 = f11(ξ(1)ig,1) + f12(η(2)g,1 ) + δ

(1)ig,1 (6.3)

η(2)g,1 = f21(ξ(2)g,1 ) + δ

(2)g,1 (6.4)

with f11(ξ(1)ig,1) = cos(1.5 ∗ ξ(1)ig,1), f12(η(2)g,1 ) = sin(η(2)g,1 ), and f21(ξ(2)g,1 ) = 2 ∗ sin(1.5 ∗ ξ(2)g,1 ).Also, δ

(1)ig,1 ∼ N[0,Ψ(1)

δ ] and δ(2)g ∼ N[0,Ψ(2)

δ ], with values fixed at Ψ(1)δ = Ψ

(2)δ = 1.

In this simulation study, the hyperparameters in the prior distributions presented insubsection 5.1 (equations 5.8 to 5.14) were assigned the following values: µ10 = .... = µ60 = 0,σ2

10= .... = σ2

60= 100, λ2,1 = ... = λ6,2 = 0, a matrix of appropriate order HΛk0

= 100 ∗ I6,and αΛk0

= αδk0= αγ0 = 1 with βΛk0

= βδk0= βγ0 = 0.005 for uninformative priors on

the dispersion priors hyperparameters. A total of 22 < N1 = 800 equidistant nodes wereused to construct the cubic P-Splines for the latent variables belonging to the first level.Accordingly, for the second level, 12 < N2 = 20 knots where used. A first order randomwalk penalty matrix of appropriate order (Mγ) was used for the Bayesian P-Splines inestimating the unknown smooth functions.

6.1.1. MCMC Simulations and Results

After several thousands of iterations of Algorithm 5.1, we present the results of thesimulation study in this subsection. Due to the presence of latent variables, the usage ofdata augmentation techniques, the complexity of the SPHSEM structure, but speciallyto the presence of cross-level effects, the resulting chains converge at a very slow rate.It took a burn-in phase of 75,000 iterations and used an additional 5,000 to computethe Bayesian estimates of the parameters in θ and their 5% and 95% density bounds,respectively. To avoid confusion, the estimates of γ and Ω are not presented in tableformat. However, the Bayesian estimates for Ω are displayed in Figures 6.1 and 6.2,while those of γ are shown in Figure 12 in the Appendix section. It is observed that thetransitions from γk to γk+1 are smooth. The simulations were ran in an Intel® Corei7CPU @2.00Ghz, 8.00GB RAM machine. It took about 12 hours to run a 10,000 iterationcycle.

The Gibbs sampler with the proposed MH algorithm produce the Bayesian estimatesof the unknown parameters. In the MH algorithm (step 1 in Algorithm 5.1) we setσ2ω = 1.5 to give an acceptance rate of around 61% and 58% for the latent variables at

levels 2 and 1, respectively. Main results are reported in Table A1 in the Appendix section.

The main result of this thesis is that the proposed Bayesian estimation of the SPH-SEM recovers the nonlinear (causal) relationships between latent variables modeled inthe set of structural equations with cross-level effects. As an example, Figures 6.1 and

6.2 show the recovered values for the latent variables ξ(1)ig,1, η

(1)ig,1, ξ

(2)g,1 , and η

(2)g,1 , from the

SPHSEM simulation study (black dots), together with the respective estimated values for

the endogenous latent variables (blue dots, η(1)ig,1 and η

(2)g,1 ), and the true functions, f11(⋅)

Page 82: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 71

and f21(⋅), from which the original values for latent variables were simulated (red curves,equations 6.3 and 6.4).

−2 −1 0 1 2

−4

−3

−2

−1

01

23

Recovered Latent Variables

ξ1

Rec

over

ed v

alue

s fo

r

η 1

Figure 6.1. Recovered values (ξ(1)ig,1, η

(1)ig,1, black dots), fitted values (η

(1)ig,1, blue dots), and true

functions (f11(⋅), red curve) for the nonlinear causal relationship between latentvariables varying at Level 1. Source: Author’s own calculations.

6.1.2. Bayesian Model Comparison and Goodness-of-fit Statistics

A key question when using complex and computationally intensive models (like the SPH-SEM proposed herein) is if such model produces a better fit following some statisticalcriteria than somewhat simpler, more elementary alternatives (like the HSEM in Lee andTang, 2006). One common Bayesian model comparison statistic is the Bayes Factor (Kassand Raftery, 1995). Assume that the observed data a arose from one of two competing(nonlinear or semiparametric) models, M0 and M1. Let p(a ∣ Mi) be the probability den-sity of a given Mi, for i = 0,1. The Bayes Factor (BF) statistic for evaluating M1 againstM0 is defined by

BF10 =p(a ∣ M1)p(a ∣ M0)

(6.5)

It is clear that the BF is a summary of the statistical evidence provided by theobserved data in favor of a model M1, as opposed to the competing model M0. Themarginal densities involved in the computation of BF10 are obtained by integrating overthe parameter space as p(a ∣ Mi) = ∫ p(a ∣ Mi, θMi) p(θMi ∣ Mi) dθMi . However, thelatter densities are difficult to obtain analytically. Lee and Song (2003b) demonstratedthat the path-sampling procedure presented in Gelman and Meng (1998) can be usefulfor computing the BF.

Difficulties arise when establishing a link between the SPHSEM defined and conven-tional HSEM models within the path-sampling procedure. Moreover, even if the empiricalcomputations of the marginal densities is a byproduct of the Bayesian estimation method,

Page 83: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 72

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

−3

−2

−1

01

23

Recovered Latent Variables

ξ2

Rec

over

ed v

alue

s fo

r

η 2

Figure 6.2. Recovered values (ξ(2)g,1 , η

(2)g,1 , black dots), fitted values (η

(2)g,1 , blue dots), and true

functions (f21(⋅), red curve) for the nonlinear causal relationship between latentvariables varying at Level 2. Source: Author’s own calculations.

it is highly computationally intensive. As pointed by Song et al. (2013), the computationalcost for semiparametric SEMs is even higher. Therefore, we follow the Bayesian modelcomparison section in Song et al. (2013) and present and index based on the Deviance In-formation Criterion (DIC, Spiegelhalter et al., 2002). The DIC balances model complexityand goodness-of-fit to the data. The model with the smallest DIC is usually preferred inthe model-comparison procedure. An extensive analysis of the DIC for models with miss-ing data and latent variables (such as SEMs) is presented in Celeux et al. (2006). Theydemonstrate how the complete DIC, a statistic based on the complete-data log-likelihood(log p(D,Ω(L) ∣ θ), with D = (a,Z∗,W), see Chapter 5) is the best version of the DIC forBayesian model comparison with latent variables. The complete DIC (cDIC) is definedas:

cDIC = −4Eθ,Ω(L) log p(D,Ω(L) ∣ θ) ∣ D+ 2EΩ(L) log p(D,Ω(L) ∣ Eθ [θ ∣ D,Ω(L)]) ∣ D . (6.6)

For the proposed generalized SPHSEM, log p(D,Ω(L) ∣ θ), is computed as

log p(D,Ω(L) ∣ θ) =L

∑l=1

G

∑g=1

ng

∑i=1

log p(aig,z∗ig,wig,ω(l)ig ∣ θ), (6.7)

with p(aig,z∗ig,wig,ω(l)ig ∣ θ) as in equations (5.20) and (5.21) (see Chapter 5). The

first expectation in equation (6.6) can be approximated as:

Eθ,Ω(L) log p(D,Ω(L) ∣ θ) ∣ D = ∫ log p(D,Ω(L) ∣ θ) p(Ω(L), θ ∣ D) dΩ(L) dθ

Page 84: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 73

≈ 1

T

T

∑t=1

log p(D,Ω(t)(L) ∣ θ(t)) (6.8)

where (Ω(t)(L), θ

(t)) ∶ t = 1, ..., T are the MCMC samples drawn from the posterior

distributions outlined in Chapter 5. Also, let θ(m,t) ∶m = 1, ...,M be a chain of M ≤ Tsamples for the parameters’ posterior, p(θ ∣ D,Ω

(t)(L)). We have that

Eθ [θ ∣ D,Ω(t)(L)] ≈ θ

(t) = 1

M

M

∑m=1

θ(m,t).

Therefore, the second expectation in equation (6.6) can be approximated by

EΩ(L) log p(D,Ω(L) ∣ Eθ [θ ∣ D,Ω(L)]) ∣ D ≈ 1

Tlog p(D,Ω(t)

(L) ∣ θ(t)). (6.9)

It is important to bear in mind that the sample averages in equations (6.8) and (6.9)are computed using the MCMC samples obtained using the Gibbs sampler and the MHalgorithm presented in the previous chapter, and therefore, the computational cost forcomputing complete-data log-likelihood DIC values is not as high as if we computed otherBayesian model comparison statistics.

In order to assess the performance of the SPHSEM, we compare the obtained resultsfrom our model versus an alternative, simpler, linear setting of an HSEM, similar to that inLee and Tang (2006) and Lee (2007, section 9.7) (see the proposal distribution for the MHstep therein). Afterwards, the cDIC for each model is computed. We fit the HSEM to thedata simulated from the linear measurement equations and nonlinear structural equationspresented at the beginning of this chapter. We assume the same linear structure for themeasurement equations, as in (6.2), but instead of the nonlinear structural equations in(6.3) and (6.4), we assume a linear function for both levels, with cross-level effects, as:

η(1)ig,1 = γ1ξ

(1)ig,1 + γ2η

(2)g,1 + δ

(1)ig,1 (6.10)

η(2)g,1 = γ3ξ

(2)g,1 + δ

(2)g,1 (6.11)

For the HSEM, we also set σ2ω = 1.5, and obtain an average acceptance range of 46%

for latent variables at level 1 (individual level) and 48% for latent variables at level 2(group level). These rates are lower than to those reported for the SPHSEM at both level2 and 1. However, these numbers are misleading since the nonlinear (causal) relationshipsbetween latent variables are not recovered by the HSEM estimates. Table A2 in theAppendix section reports the estimation results for the parameters comparing both theSPHSEM and the HSEM.

Results reported therein provide evidence of bias most HSEM parameters’ estimates.It is of particular interest how the parameters related to the cross-level effects, and the

second level endogenous latent variables, i.e. µ4, ψε,4, λ6,1, and both ψ(1)δ,1 and ψ

(2)δ,1 are

highly biased in the HSEM case. provides biased estimates for parameters in θ. As a pieceof information, the sum of the absolute values for the bias of each estimated parameter is

Page 85: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 74

reported. For the SPHSEM that sum relatively low (3.471), while for the HSEM case, it

is significantly higher (27.026), determined mostly by the bias of ψ(1)δ,1 and ψ

(2)δ,1 .

We also compute the cDIC statistic for a third model (in addition to the SPHSEM andHSEM), which we call the “oracle” model. In this exercise we fix the parameter in θ andthe latent variables in Ω to their original, true values. The results for the cDIC statisticsare presented in table 6.1.

Table 6.1. Complete-data log-likelihood DIC

Model cDIC

SPHSEM 15,207.16HSEM 16,160.01Oracle 12,967.09

From these results it is clear the the SPHSEM outperforms the classical HSEM interms of goodness-of fit for the simulated dataset. The cDIC for the SPHSEM (15,207.16)is higher than that of the oracle model (12,967.09), but is lower than that of the HSEM(16,160.01).

6.1.3. An Intervention to the Simulated Causal System

We present an intervention to the structural system of equations through the do-operatorpresented in Pearl (2009b). As exogenous, explanatory variables are understood to be theMarkovian Parents (causes) of endogenous, random variables (effects), we intervene thesystems by setting the exogenous, latent variables at an arbitrary chosen level, and thencompare the outcome versus the non-intervened scenario presented above.

First, a group-level intervention is performed to the exogenous, latent variable ξ(2)g,1 .

In each case, the do-operator acts by modifying the entering values in f21(⋅) in thestructural equation (6.4), and replacing them by a fixed value equal to 0.5 for those groupwith observed pre-intervention values below zero (0), i.e, assume an intervention of the

form do(ξ(2)g,1 = 0.5) ∀ g ∈ g ∶ ξ(2)g,1 < 0. This group-level intervention has casual effectsover the individual latent variables as well, through the cross-level effect in equation f12(⋅).

Using the potential outcome notation for the endogenous latent variables, the treat-

ment effect (i.e. do(ξ(2)g,1 = 0.5) for selected g’s) can be expressed as

τg = η(2)g,1 (do(ξ(2)g,1 = 0.5), δ(2)g,1 ) − η

(2)g,1 (ξ

(2)g,1 , δ

(2)g,1 ), (6.12)

τig = η(2)g,1 (ξ(1)ig,1, do(ξ

(2)g,1 = 0.5), δ(1)ig,1) − η

(2)ig,1(ξ

(1)ig,1, ξ

(2)g,1 , δ

(1)ig,1), (6.13)

for the group and individual levels, respectively. The treatment effect is computedby comparing the non-treated individuals/groups versus the potential outcomes ofthe treated ones. The counterfactual potential outcomes correspond to the estimated,

Page 86: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 75

semi-parametric structural functions, f11(⋅), f12(⋅) and f21(⋅), evaluated at the proposed

intervention value, i.e f21(do(ξ(2)g,1 = 0.5)), and both f11(⋅) and f12(⋅) evaluated at the

resulting hypothetical value η(2)g,1 (do(ξ

(2)g,1 = 0.5)).

Figures 6.3a and 6.3b show how the intervention do(ξ(2)g,1 = 0.5) has causal direct andindirect effects on the endogenous latent variables. Black dots represent the non-intervenedgroups/individuals. Blue dots represent the observed values for endogenous latent variablefor the intervened ones, previous to the intervention, i.e., in Figure 6.3a, blue dots are those

groups for which ξ(2)g,1 < 0 holds true, and therefore those in which the treatment do(ξ(2)g,1 =

0.5) was applied. Dotted lines (black and blue) are the average values of η(2)g,1 for non

treated and to-be-treated groups, which take values of 1.441 and -0.875, respectively. Thered dot is the value that would attain each of the treated groups, evaluated at the estimatedfunction f21(⋅), through the SPHSEM. The dotted red line is the post-intervention averagevalue of the endogenous latent variable for the treated groups, and takes a value of 1.352.

It is clear that the difference between the average values of η(2)g,1 for the non treated (black

dotted line) and the treated (red dotted line) is lower after the intervention.

−2.0 −1.5 −1.0 −0.5 0.0 0.5 1.0

−3

−2

−1

01

23

Recovered Latent Variables

ξ2

Rec

over

ed v

alue

s fo

r

η 2

(a) Direct effect do(ξ(2)g,1 = 0.5) on η(2)g,1 .

−3 −2 −1 0 1 2 3

−4

−2

02

4

Recovered Latent Variables

ξ1

Rec

over

ed v

alue

s fo

r

η 1

(b) Indirect effect do(ξ(2)g,1 = 0.5) on η(1)ig,1.

Figure 6.3. Direct and Indirect causal effect of do(ξ(2)g,1 = 0.5) for g ∈ g ∶ ξ(2)g,1 < 0. Source:Author’s own calculations.

Similar is displayed in Figure6.3b. Black dots represent the non-treated individuals.

The black dotted line is the average value of η(2)g,1 for the non-treated individuals. The

blue dots and the blue dotted line correspond to the recovered values and their average forthe treated individuals, previous to receiving the treatment, respectively. Red dots, andthe red dotted line, are the counterfactual values, and their average, of the endogenous

latent variable η(2)g,1 for the treated individuals. Again, the difference between the averages

for the non-treated (black dotted line) and the treated (blue dotted linen) is lower af-

ter the group-level intervention, do(ξ(2)g,1 = 0.5) for selected g’s, is supplied (red dotted line).

Before the intervention, the difference for the latent endogenous variable, betweenthe non-treated (black dotted lines) and the to-be-treated (blue dotted lines) groups andindividuals were 2.316 and 0.362, respectively. A differences in means t-test suggeststhat these differences are statistically different from zero (p-values equal to 0.000 for

Page 87: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 76

the group and individual samples). After the intervention, do(ξ(2)g,1 = 0.5, the differences(black dotted line versus red dotted line) were reduced to 0.090 and 0.113 for the groupand the individual levels, respectively. These differences are statistically equal to zero,according to a differences in means t-test (p-values are 0.357 and 0.0992 for the group andindividual levels). The statistical tests suggest that this particular intervention eliminatesthe differences between the selected groups and individuals, in terms of the mean values

of η(2)g,1 and η

(1)ig,1.

Now, assume an individual-level intervention. In this case, the exogenous latent

variable ξ(1)ig,1 is manipulated through the do-operator for selected individuals, and set

to a fixed value equal to 0.5. In this case, we pick those individuals with values below

-1 for ξ(1)ig,1, i.e. assume an intervention of the form do(ξ(1)ig,1 = 0.5) ∀ i ∈ i ∶ ξ(1)ig,1 ∈ (−∞,−1].

Again, the potential outcomes allow for estimating the treatment effect at the individ-ual level (as in equation (6.13)). Given the fact that lower level interventions do not havecausal effects over higher level interventions, the treatment effect at the group level is notestimated.

−3 −2 −1 0 1 2 3

−4

−2

02

4

Recovered Latent Variables

ξ1

Rec

over

ed v

alue

s fo

r

η 1

Figure 6.4. Direct effect of do(ξ(1)ig,1 = 0.5) for i ∈ i ∶ ξ(1)ig,1 ∈ (−∞,−1]. Source: Author’s owncalculations.

In Figure 6.4, black dots represent the non-intervened individuals, and the blackdotted line the average value for the latent variable corresponding to the non-treated indi-viduals. Blue dots represent the pretreatment values for the to-be-intervened individuals,and the blue dotted line the average value for the latent variable of the to-be-treated in-dividuals. The red dot corresponds to the hypothetical, counterfactual value that treatedindividuals would attain if they were indeed subject to the proposed intervention. The

red dotted line corresponds to the value η(1)ig,1 would attain for individuals under the treat-

Page 88: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 77

ment regime. It is computed using the estimated semiparametric function of the SPHSEM.

Before the intervention, the difference between the non-treated (black dotted line) andthe to-be-treated (blue dotted line) individuals was 0.938. After the intervention, the

difference is not −0.284, which means that the counterfactual average value of η(1)ig,1 for the

treated individuals would be even higher than that for the non-treated. However, giventhe sample variance, both differences are statistically equal to zero according to a t-testperformed for the non-intervened and intervened individuals. Results are summarized inTable 6.2.

Table 6.2. Intervention Results

Interventionη(l)T - η

(l)NT

Level 1 Level 2

No intervention (group treatment) 0.362 2.316

do(ξ(2)g,1 = 0.5) 0.113 0.090

No intervention (unit treatment) 0.938 -

do(ξ(1)ig,1 = 1.5) -0.284 -

Sensitivity Analysis : Due to the nonlinear nature of the simulated, causal relation-ships between latent variables, a change in the intensity of the intervention will not becorresponded with a change on the output of the same magnitude. An important featureof the SPHSEM is that acknowledges these characteristics of the data generating process.Using the same simulated example, in Table 6.3 we present a sensitivity analysis of a group

level intervention, do(ξ(2)g,1 = α), to different values (α) of the exogenous, second-level latent

variable ξ(2)g,1 . Results suggest that interventions of greater magnitudes are followed by also

increasing causal effects on the treatment effect at the group level.

Table 6.3. Sensitivity Analysis to Group Interventions

Value of Interventionη(l)T - η

(l)NT

Level 2

No intervention 2.316α = 0.0 1.317α = 0.5 0.090α = 1.0 -0.873α = 1.5 -1.568

6.2. Empirical Application of the SPHSEM: Hostility, Lead-ership, and Task Significance Perceptions

We present the advantages of the Bayesian SPHSEM over more conventional, simplermodels. In this example, we use the dataset described in Bliese et al. (2002), available

Page 89: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 78

within the distribution of the R-package multilevel (Bliese, 2016). The original datasetconsists of 21 measures of items related to individual’s perceptions of Leadership Climate(LEAD), Task Significance (TSIG), and Hostility (HOSTIL) for 2,042 U.S Army soldiersgrouped within 49 companies (groups). For simplicity, in this application we restrictour attention to those companies with 25 or more soldiers. As a result, we end up witha sample of 1,723 soldiers (individuals) belonging (heterogeneously) to 29 companies(groups).

This dataset was used to assess theoretical models of stress related to workingconditions and job situations. The authors wanted to prove whether or not higher levelsof both leadership climate in a group (company) and level task significance perceptionof at the individual (soldier) and group levels caused higher responses of personaljob-related wellbeing items, measured by the degree to which individuals are keen toexhibit hostile behaviors or not. Core to this analysis is the theory of individual leveland nomothetic perspectives of job stress discussed in Bliese and Halverson (1996).The former emphasizes the role of individual perceptions, based on unique personalitytraits, beliefs, goals, abilities, etc, on the formation of wellbeing self-reported perceptions.The latter approach is based on the role of environmental variables on individual job stress.

In order to assess the causal relationships between individual and group (latent)variables of leadership climate and task significance on individual, self-reported hostilebehaviors, we set a SPHSEM, a two-level SEM, with structural equations following asemi-parametric structure, as the one explained in previous chapters. The dataset set con-sists of 21 variables for each unit, as described in the Appendix section. The implicit platediagram we propose for this exercise is also presented in Figure 13, in the Appendix section.

We model individual level Hostile Behavior (HOSTILig) as an endogenous latentvariable, caused by individual level Leadership Climate (LEADig), Task Significance(TSIGig) latent exogenous variables and by group level HOSTILg (cross-level effects).The latter is in turn modeled as an endogenous group level latent variable, caused bygroup level measurements of Leadership (LEADg) and Task Significance (TSIGg). Saidthat, we assume the structural equations for the proposed SPHSEM as:

HOSTIL(1)ig = f11(LEAD(1)

ig ) + f12(TSIG(1)ig ) + f13(HOSTIL(2)

g ) + δ(1)ig,1 (6.14)

HOSTIL(2)g = f21(LEAD(2)

g ) + f22(TSIG(2)g ) + δ(2)g,1 (6.15)

We expect both LEAD and TSIG latent variables (at any level) to have a negative(causal) relationship with individual level Hostile Behavior scales, i.e., f11(⋅), f12(⋅), f13(⋅)and f21(⋅), f22(⋅) should display negative trends. As explained in the Appendix section,questions are based on a five-point Likert scale. Following the advice in Preston andColman (2000), who show that SEM and factor analyses based on less than seven-pointitems can be highly unreliable, we transform the items in our example to continuous,standardized random variables.

Again, and given the lack of information about the parameters involved in the ap-plication study, we use the same uninformative priors used in the simulation examplesubsection above. Also, based on the meaning of the latent constructs and the identifia-

Page 90: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 79

bility constraints, we consider a non-overlapping structure of the loading matrix for thecorresponding, similar to that displayed in equation (6.1). In order to avoid over-fitting,a total of 6 equidistant knots were used. The second-order random-walk penalty wasused for the Bayesian P-splines in estimating the unknown, smooth functions. Due tothe convergence issues addressed previously and the computational burden, we collected5,000 observations after an initial burn-out phase of 45,000 cycles of the algorithm for theBayesian inference.

6.2.1. Results

Results for the empirical example are satisfactory. The average acceptance rates for thelatent variables at the individual and group levels are 38.7% and 26.7%, respectively.Given the complexity of the model, these acceptance rates are acceptable. The Bayesianestimates of the structural parameters of the SPHSEM example using Bliese et al. (2002)dataset are presented in Table A3 in the Appendix section. For simplicity, we do notpresent results for Ω, τ , and for γ parameters in the B-Splines representation of thestructural equations.

As expected, the relationship between the latent constructs Hostile Behavior (HOS-TIL) and both Leadership Climate (LEAD) and Task Significance Perception (TSIG) isnegative. However, the SPHSEM results suggest that these relationships are non linearat both individual and group levels. The latter can be inferred from the recovered valuesof the latent variables, as displayed in Figures 6.5a to 6.5d.

Results for the structural equations at the individual level suggest that higher Lead-ership Climate in the company causes individual Hostile Behavior to decline. However, asevidenced by the fitted values of the function f11(⋅) in Figure 6.5a, the slope of this causalrelationship is steeper for lower scores of LEAD relative to higher ones. The latter mightprovide evidence for overall lower job-related stress levels (measured by hostile behavior)for individuals who perceive higher, positive leadership attitudes in the environment theywork. Also, for those individuals (high-performing, low-stressed), marginal improvementson leadership might not further reduce job-related stress levels, but might have differentimpact on other variables (not measured in this study).

On the other hand, it is less clear how a higher individual Task Significance perceptioncauses Hostile Behavior to decline. A positive, but possibly non-significant, relationshipis recovered for individuals with lower TSIG scores. The latter suggest that very lowTSIG values might actually cause HOSTIL to increase, which makes sense. However,mid-to-high scores display a negative relationship, as expected. For individuals withhigher task significance perceptions there is no evidence for TSIG causing decreases inHOSTIL, which are already low, as shown in Figure 6.5b.

Results for the structural equations at the group level are more robust, and do, infact, reflect a negative causal relationship between both LEAD and TSIG, and HOSTIL.The fitted values (red lines) for functions f21(⋅) and f22(⋅) suggest that higher group-levelLeadership Climate scores reduce environmental job-related stress levels, measured by

Page 91: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 80

−4 −2 0 2 4

−0.

50.

00.

51.

01.

52.

0

Causal Relationship between Hostile Behavior and Leadership Climate

Recovered values for LEAD, individual level

Rec

over

ed v

alue

s fo

r H

OS

TIL

, ind

ivid

ual l

evel

(a) HOSTILig vs LEADig, f11(⋅).

−6 −4 −2 0 2 4

−0.

50.

00.

51.

01.

52.

0

Causal Relationship between Hostile Behavior and Task Significance

Recovered values for TSIG, individual level

Rec

over

ed v

alue

s fo

r H

OS

TIL

, ind

ivid

ual l

evel

(b) HOSTILig vs TSIGig, f12(⋅).

−1.0 −0.5 0.0 0.5 1.0

−0.

10−

0.05

0.00

0.05

0.10

Causal Relationship between Hostile Behavior and Leadership Climate

Recovered values for LEAD, group level

Rec

over

ed v

alue

s fo

r H

OS

TIL

, gro

up le

vel

(c) HOSTILg vs LEADg, group level, f21(⋅).

1 2 3 4 5

−0.

10−

0.05

0.00

0.05

0.10

Causal Relationship between Hostile Behavior and Task Significance

Recovered values for TSIG, group level

Rec

over

ed v

alue

s fo

r H

OS

TIL

, gro

up le

vel

(d) HOSTILg vs TSIGg, group level, f22(⋅).

Figure 6.5. Causal relationship between latent variables at both the individual and group levels.Source: Author’s own calculations.

the group indicator of Hostile Behavior, as shown in Figures 6.5c and 6.5d.

The complete-data log-likelihood Deviance Information Criteria (cDIC) statistic for theSPHSEM model is 18,118.22. The cDIC is lower than that of the simpler, linear HSEM(20,056.88), suggesting that the SPHSEM is the best model and also reaffirming thenecessity of its usage. These results are illustrative in the sense that they demonstrate thepower and versatility of the Semiparametric Hierarchical SEM, SPHSEM, when it comesto estimating and recovering nonlinear patterns in the functional relationship betweenlatent variables at both the individual and group levels. They are also useful becausethey suggest that the SPHSEM can be used both as an exploratory tool for investigatingfunctional forms and as a confirmatory tool for selecting models through statistically basedcriteria.

6.2.2. Analysis of Intervention: What if soldiers reported higher Lead-ership perceptions?

In this particular example, we want to answer what would happen to individual auto-perception of Hostile behavior if soldiers perceived higher levels of Leadership climate

Page 92: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 81

at the company’s level? Would it decrease if, say, a coaching program was introduced toimprove Leadership attitudes of Non-Commissioned Officers (NCO’s) and Officers? Theseare policy questions that can be answered through the do-operator and the SPHSEMmodel.

For example, assume the following intervention: NCO’s and Officers in commandof low Leadership ranked companies are signed to a coaching program, such that theirLeadership abilities and attitudes are improved until they reach at least group levelLeadership scores observed at the 75th quantile (0.568). That is, assume an intervention

do(LEAD(2)g = 0.568) ∀ g ∈ g ∶ LEAD(2)

g < 0. This intervention is expected to directly

reduce Hostile perception at the group level, (HOSTIL(2)g ), and indirectly reduce Hostile

perception at the soldier’s level, (HOSTIL(1)ig ). In this case, the value in the do-operator

function is evaluated in the estimated functions (red lines in figures 6.5c and 6.5a).Given the estimated functional forms of these lines, an evaluated value greater than zerowill predict a negative causal impact of Leadership perception and Hostile behavior display.

Figures 6.6a and 6.6b show the direct and indirect effects of the intervention

do(LEAD(2)g,1 = 0.568) on both the group and individual level of Hostile behavior,

HOSTIL(2)g,1 and HOSTIL

(1)ig,1, respectively. In figure 6.6a, the black dots represent the

non-intervened companies (black dotted line is average for HOSTIL scores, which takesa value of -0.057). The blue dots represent the to-be-intervened groups, i.e. low ranked

companies such that the intervention do(LEAD(2)g,1 = 0.568) is to be supplied (blue dotted

line is the average, and takes a value of 0.044). The red dot is the counterfactual valuefor both Leadership climate and Hostile behavior that companies would display had theyreported Leadership climate scores at least as good as the 75th quantile (i.e, 0.568). This

counterfactual is computed by evaluating the intervention value do(LEAD(2)g,1 = 0.568)

on the estimated semiparametric structural equation in the SPHSEM (red dotted line,with value -0.048). It is clear how the hypothetical treatment would cause lower levels ofperceived hostility at the group level.

−1.0 −0.5 0.0 0.5 1.0

−0.

10−

0.05

0.00

0.05

0.10

Causal Relationship between Hostile Behavior and Leadership Climate

Recovered values for LEAD, group level

Rec

over

ed v

alue

s fo

r H

OS

TIL

, gro

up le

vel

(a) Direct effect do(LEAD(2)g,1 = 0.568)on HOSTIL

(2)g,1 .

−4 −2 0 2 4

−0.

50.

00.

51.

01.

52.

0

Causal Relationship between Hostile Behavior and Leadership Climate

Recovered values for LEAD, individual level

Rec

over

ed v

alue

s fo

r H

OS

TIL

, ind

ivid

ual l

evel

(b) Indirect effect do(LEAD(2)g,1 = 0.568)on HOSTIL

(1)ig,1.

Figure 6.6. Direct and Indirect causal effect of do(LEAD(2)g,1 = 0.568) for low ranked Leadership

companies (i.e. LEAD(2)g,1 < 0). Source: Author’s own calculations.

Page 93: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

CHAPTER 6. SIMULATIONS & APPLICATION 82

In figure 6.6b, the black dots represent the non-intervened individuals (average forthe endogenous latent variable of interest is displayed as a black dotted line, and takesa value of -0.039). The blue dots are the individuals belonging to the to-be-intervenedcompanies (average for these individuals is displayed as a blue dotted line, value equalto 0.050). The red line is the counterfactual values for the unit level endogenous latentvariable that the treated individuals would display had they been treated , i.e. belong toa company whose NCO’s and Officers are subject to the coaching program. The average,hypothetical Hostility scores for the treated individuals (red dotted line, value equal to-0.0006) is in fact similar to that of the non-treated.

Before the intervention, the differences between the non-treated and the to-be-treatedgroups and individuals were −0.102 and −0.089, respectively. After the group-level inter-vention, the differences were reduced to −0.008 and −0.038, at group and soldier’s level,respectively. As expected, raising group’s Leadership scores (by an hypothetical coach-ing program) causes both individual and group levels perceptions of Hostile behavior todecrease. However, these differences were not statistically significant different from zero,according to a differences in means t-test.

Page 94: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Conclusions

Causal inference is a crucial step in scientific discovery, since allows for estimatingcausal effects of interventions (variables) on outcomes of interest from non-experimental,observational data. Most of the literature in causal inference explicitly assume no causalconnections or interactions among individuals, an assumption that is frequently violatedin the Social, Behavioral, Biomedical and Epidemiological Sciences. It has been formallyshown that the presence of causally connected individuals produce biased estimations ofthe causal effects of the desired intervention (Rosenbaum, 2007; Sekhon, 2008). Despite ofthis issue, as acknowledged by the scholars in the field of causal inference, a formal theoryand statistical tools for empirical research are still yet to be developed (VanderWeele andAn, 2013).

This thesis contributes to fill this gap by presenting a Semi-Parametric HierarchicalStructural Equation Model (SPHSEM) that counts for the presence of non independent,causally connected units clustered in groups that are organized following a multilevelstructure. The SPHSEM builds upon the work on multilevel Structural Equation Models(SEM) of Rabe-Hesketh et al. (2004, 2012) and Lee and Tang (2006); Lee (2007), amongothers, in the sense that accommodates a set of semi-parametric structural equationswhen modeling the latent variables’ means in a HSEM, as in Song et al. (2013). TheSPHSEM then allows for modeling nonlinear, cross-level, causal relationships betweenlatent variables within the theoretical Structural Causal Model (SCM) framework forcausal discovery presented and developed by Pearl (2009b) and others.

We use Bayesian techniques, namely an hybrid algorithm that combines the GibbsSampler (Geman and Geman, 1984) and the Metropolis Hastings algorithm (Metropoliset al., 1953; Hastings, 1970), to estimate the unknown parameters in the SPHSEM. Afterpresenting the formal derivations of the model’s distributions, a simulation study showsthat the Bayesian SPHSEM is capable of recovering the nonlinear causal relationships be-tween latent variables with cross-level effects. These results were contrasted with those ofa linear, hierarchical SEM model, HSEM. We conclude that the SPHSEM provides bettergoodness-of-fit indexes that the HSEM for the case when there are nonlinear, causal rela-tionships between latent variables, cross-level effects, and non-independent units clusteredwithin groups.

83

Page 95: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Further Research

The following is a list of possible further research topics relating the SPHSEM:

A formal definition of identification conditions in the P-Splines specification: Assuggested by Song et al. (2013), a formal study of identification constraints in theproposed SPHSEM is needed, perhaps following the methodologies developed byJara et al. (2008) and San-Martın et al. (2011).

Model comparison and Bayesian selection: This item is not covered in this thesis.For a more rigorous testing of the empirical performance of the SPHSEM, it has tobe tested against the null hypothesis of the data following a non hierarchical or alinear structure for the latent variable equations. A theoretical test has not beendeveloped. Model comparison can be performed using Bayesian criteria.

Multivariate spline functions: It can be useful to test whether the goodness-of-fit indexes improve by fitting multivariate spline functions, in place of separate,univariate cubic-spline basis for each latent variable entering in the structural partof the SPHSEM, at any level of aggregation.

Mediation and computation of (in)direct effects: In order to provide causal claimsusing the SPHSEM, the researcher has to report a measure of direct or indirect causaleffect of a given intervention. Very little has been written in this aspect. Bollen(1987) and Sobel (1987) present theoretical grounds, but it was only until Muthenand Asparouhov (2015) that proposed how to model mediation and to computedirect and indirect effects within the linear SEM framework. However, to the bestknowledge of the author, no paper has provided a clear explanation on how tocompute direct and indirect effects within a multilevel and/or nonparametric SEMframework.

84

Page 96: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

APPENDIX

Some Causal Quantities of Interest

In this Appendix we present a description for some of the most popular causal targetquantities of interest, Q(P ′). Throughout this subsection we use the potential outcomenotation in Rubin (1974, 1978).

Average Treatment Effect (ATE)

Recall the potential outcome notation introduced in Chapter 1, where Yi(Xi, ti) is thevalue an outcome variable of interest Y , for an individual i, would attain under treatmentTi = ti, with pre-treatment (or baseline) covariates Xi. For ease on notation, for nowwe omit the baseline covariates from the potential outcome notation. The causal effect oftreatment Ti = ti over control Ti = ci for i is defined as τi = Yi(ti)−Yi(ci). Since population-based causal claims are derived from a sample of treated and untreated individuals, theaverage causal effect (ACE) or average treatment effect (ATE) is defined as

ATE = E(τ) = E [Y (t) − Y (c)]= E [Y (t)] −E [Y (c)]

≈ 1

N

N

∑i=1

Yi(ti) −1

N

N

∑i=1

Yi(ci),

and it is estimated using the observed difference in means (after some matching pro-cedure):

ˆATE = 1

Nt

N

∑i=1

Yi(ti) × I(i ∈ Nt) −1

Nc

N

∑i=1

Yi(ci) × I(i ∈ Nc)

where Nt is the portion of the sample assigned to the treatment regime, and Nc is theportion assigned to the control regime, i.e. Nc = N −Nt.

Conditional Average Treatment Effect (CATE)

The conditional average treatment effect is the causal quantity to be estimated when theresearcher is interest about the causal effect of a treatment T over specific strata of the

85

Page 97: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

SOME CAUSAL QUANTITIES OF INTEREST 86

sample, according to ancillary, observed covariates. The CATE statistic is defined as

CATE = E [Yi(Xi, ti) − Yi(Xi, ci) ∣ Xi = Xa]

for Xa ∈ X, a particular realization or set of possible values of covariates X.

Complier Average Treatment Effects (CoATE)

In some quasi-experimental settings, the treatment is not actually taken by every indi-vidual in the sample. Let Z be an indicator for whether an observation was assigned tothe treatment, and D another indicator whether that observation actually received thetreatment. In settings with non-compliers the sample is divided then into always takers(Di = 1, regardless of Zi), never takers (Di = 0, regardless of Zi), and compliers (Di = 1when Zi = 1, and Di = 0 when Zi = 0). The complier average treatment effect is thendefined as

CoATE = E(Yi ∣ Zi = 1) −E(Yi ∣ Zi = 0)E(Di ∣ Zi = 1) −E(Di ∣ Zi = 0)

Average Treatment Effects on the Treated (ATT) and the Control (ATC)

We often need to know the effects of the treatment not just on the whole sample butspecifically for those to whom the treatment is actually administered. We define theaverage effects of treatment among the treated (ATT) and the control (ATC) as simplecounter-factual comparisons

ATT = E(Yi(t) − Yi(c) ∣ Di = 1) = E(Yi(t) ∣ Di = 1) −E(Yi(c) ∣ Di = 1)ATC = E(Yi(t) − Yi(c) ∣ Di = 0) = E(Yi(t) ∣ Di = 0) −E(Yi(c) ∣ Di = 0)

Bear in mind that when treatment is randomly assigned and there is full compliance,then ATE = ATT = ATC, since E(Yi(c) ∣ Di = 1) = E(Yi(c) ∣ Di = 0) and E(Yi(t) ∣ Di =0) = E(Yi(t) ∣ Di = 1).

Causal Effects within the SCM

Given a causal model M , and its associated graph G, it is important to make a distinctionbetween total and direct effects of an intervention of the type do(X = x), on a variableof interest Y . Assume a set of intermediate variables Z in the paths that connect X andY . Following Pearl (2005), and using the potential outcome notation in Rubin (1974,1978), the total effect of an intervention do(X = x) over an alternative regime do(X = x∗)is defined as P (Yx = y) − P (Yx∗ = y). However, X might have some effect over Z, andto identify the direct effect of X on Y we should isolate every other indirect effect, bycontrolling for Z. That is, the direct effect of an intervention do(X = x) over an alternativeregime do(X = x∗), controlling for Z, is defined as P (Yxz = y) − P (Yx∗z = y), where Z = zstands for a specified level of the intermediate variables in Z. First, we define the differencebetween controlled and natural effects.

Definition .0.1. (Controlled unit-level direct effect; Pearl, 2005): A variable Xhas a controlled direct effect on Y in a model M , and situation U = u, if there exists a

Page 98: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

SOME CAUSAL QUANTITIES OF INTEREST 87

setting Z = z of the other variables in the model, and values of X, x and x∗, such thatYxz(u) ≠ Yx∗z(u). In words, Y under do(X = x) differs from its value under do(X = x∗)when we keep Z = z fixed.

Definition .0.2. (Natural unit-level direct effect; Pearl, 2005): An event X = x hasa natural direct effect on Y in situation U = u, if Yx∗(u) ≠ Yx,Zx∗(u)(u) holds. In words,the value of Y under do(X = x∗) differs from its value under do(X = x) even when wekeep Z at the same value that it would attain under X = x∗ and U = u, i.e. Zx∗(u).

We now present definitions for some causal quantities of interest presented in the SCMliterature.

Average Controlled Direct Effects (ACDE)

Given a causal model M with causal graph G, the controlled direct effect of X = x on Yfor an unit with covariates U = u, and setting Z = z is given by

CDEz(x,x∗;Y,u) = Yxz(u) − Yx∗z(u)

where Z stands for all parents of Y in G, excluding X. Therefore, at the sam-ple/population level, the average controlled direct effect (ACDE) is defined as

ACDEz(x,x∗;Y,u) = EU(Yxz − Yx∗z),

where the expectation is taken over U .

Average Natural Direct Effects (ANDE)

Again, given a causal model M , the natural direct effect of do(X = x∗) on Y for unitU = u, is given by

NDEz(x,x∗;Y,u) = Yx,Zx∗(u)(u) − Yx∗(u)

and the average natural direct effect (ANDE) is defined as

ANDEz(x,x∗;Y,u) = EU(Yx,Zx∗(u) − Yx∗).

Page 99: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

APPENDIX

Derivation of posterior distributions

The derivations here presented are for the case of continuous variables, i.e yig,k = g(y∗ig,k) =y∗ig,k, but with subtle changes, the algebra is applicable for every underlying variable k ∈ p.Also for simplicity, in some cases we present the derivations for L = 2, but the extensionto an arbitrary number of levels is straightforward.

Posterior joint distribution for thresholds parameters α and underlyinglatent variable for unordered categorical variables Z∗ (equations 5.17 and5.18)

From equation (5.4), the posterior distribution for (Z∗,α) can be expressed as theproduct p(Z∗,α ∣ θ,Ω(L),Z) ∝ p(Z∗ ∣ α, θ,Ω(L),Z) × p(α ∣ θ,Ω(L),Z). The first partof the latter is reported in equation (5.16), which is obtained from (4.7). However, thesecond part of the above product, as reported in equation (5.17) is not straightforwardlyderived.

First, bear in mind that we assume independent posteriors for every ordered categoricalvariable. Using the prior distribution in equation (5.15), we recall that for every k =1, ..., r1,

p(αk ∣ θ,Ω(L),Zk,Z∗k)∝ p(αk) × p(Z∗

k ∣ αk, θ,Ω(L),Zk)

= c ×G

∏g=1

ng

∏i=1

p(z∗ig,k ∣ αk, θ,Ω(L), zig,k)

= c∗ (A.1)

Note that p(z∗ig,k ∣ αk, θ,Ω(L), zig,k) ≠ 0 only when z∗ig,k ∈ [αk,zig,k , αk,zig,k+1). There-fore, p(αk ∣ θ,Ω(L),Zk,Z∗

k) ≠ 0 only if Z∗k ∈ Zk, where Zk is the N-dimensional Euclidean

space originated when z∗ig,k ∈ [αk,zig,k , αk,zig,k+1), for every i ∈ ng and g ∈ G.

Now, by the Central Limit Theorem, z∗ig,kD= N[0,1]I[αk,zig,k /ψε,k,αk,zig,k+1/ψε,k)(z∗ig,k),

with z∗ig,k ≡ (z∗ig,k − µk − Λ′k,(l)ωig,(l))/ψε,k. Also, let Φ∗(ι) be the univariate cumulative

88

Page 100: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

DERIVATION OF POSTERIOR DISTRIBUTIONS 89

distribution function of a standard Gaussian random variable, evaluated at an arbitraryset in the real line, ι. As Φ∗(⋅) is a monotonically increasing transformation, the orderstatistics and the interpretation remain unaltered.

Following this reasoning, it is clear that p(αk ∣ θ,Ω(L),Zk,Z∗k) ≠ 0 only if z∗ig,k ∈

[αk,zig,k/ψε,k, αk,zig,k+1/ψε,k) for every i ∈ ng and g ∈ G. Therefore, c∗ in equation (A.1) canbe approximated as the product of independent segments of the set [0,1], as in equation(5.17). The segment

Φ∗ (ψ−12

ε,k [αk,zig,k+1 − µk −Λ′k,(l)ωig,(l)]) −Φ∗ (ψ−

12

ε,k [αk,zig,k − µk −Λ′k,(l)ωig,(l)])

is a measure of how likely is for ˆzig,k = (µk −Λ′k,(l)ωig,(l))/ψε,k to belong to the interval

[αk,zig,k/ψε,k, αk,zig,k+1/ψε,k), and therefore, can be approximated to the value of c∗. Thelatter argument is similar to what is presented in Shi and Lee (1998).

Posterior distributions for intercepts µ (equations 5.22 to 5.25)

From equation (5.4), let

G

∏g=1

ng

∏i=1

p(y∗ig,k ∣ ⋅)p(µk) =G

∏g=1

ng

∏i=1

1√2πψε,k

exp− 1

2ψε,k(y∗ig,k − µk −Λk,(l)ωig,(l))

1√2πσk0

exp− 1

2σk0

(µk − µk0)2 (A.2)

The latter expression is proportional to:

∝ exp

⎧⎪⎪⎨⎪⎪⎩− 1

2ψε,k

G

∑g=1

ng

∑i=1

(y∗ig,k − µk −Λk,(l)ωig,(l))2 − 1

2ψεk0

(µk − µk0)2⎫⎪⎪⎬⎪⎪⎭

= exp

⎧⎪⎪⎨⎪⎪⎩−1

2

⎛⎝ψ−1ε,k

G

∑g=1

ng

∑i=1

([y∗ig,k −Λk,(l)ωig,(l)] − µk)2 + σ−1

k0(µk − µk0)

2⎞⎠

⎫⎪⎪⎬⎪⎪⎭(A.3)

Letting y∗ig,k = y∗ig,k −Λk,(l)ωig,(l), and after some algebra,

= exp

⎧⎪⎪⎨⎪⎪⎩−1

2[µ2k (Nψ−1

ε,k + σ−1k0

)] − 2µk

⎡⎢⎢⎢⎢⎣ψ−1ε,k

⎛⎝G

∑g=1

ng

∑i=1

y∗ig,k⎞⎠+ σ−1

k0µk0

⎤⎥⎥⎥⎥⎦+⎡⎢⎢⎢⎢⎣ψ−1ε,k

⎛⎝G

∑g=1

ng

∑i=1

y2∗ig,k

⎞⎠+ σ−1

k0µ2k0

⎤⎥⎥⎥⎥⎦

⎫⎪⎪⎬⎪⎪⎭

= exp

⎧⎪⎪⎨⎪⎪⎩−(Nψ−1

ε,k + σ−1k0

)2

⎡⎢⎢⎢⎢⎣µ2k − 2µk

⎡⎢⎢⎢⎢⎣(Nψ−1

ε,k + σ−1k0

)−1 ⎛⎝ψ−1ε,k

⎛⎝G

∑g=1

ng

∑i=1

y∗ig,k⎞⎠+ σ−1

k0µk0

⎞⎠

⎤⎥⎥⎥⎥⎦+

(Nψ−1ε,k + σ−1

k0)−1 ⎛

⎝ψ−1ε,k

⎛⎝G

∑g=1

ng

∑i=1

y2∗ig,k

⎞⎠+ σ−1

k0µ2k0

⎞⎠

⎤⎥⎥⎥⎥⎦

⎫⎪⎪⎬⎪⎪⎭(A.4)

Page 101: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

DERIVATION OF POSTERIOR DISTRIBUTIONS 90

By allowing σ∗k = (Nψ−1ε,k + σ−1

k0)−1

and µ∗k = σ∗k (ψ−1ε,k (

G

∑g=1

ng

∑i=1y∗ig,k) + σ−1

k0µk0), and after

some algebra, the latter expression can be summarized as

∝ exp− 1

2σ∗k(µ2

k − 2µkµ∗k + µ2∗

k ) × exp

⎧⎪⎪⎨⎪⎪⎩− 1

2σ∗k

⎛⎝σ∗k

⎡⎢⎢⎢⎢⎣ψε,k − 2ψ−1

ε,kψ−1k0

G

∑g=1

ng

∑i=1

y∗ig,kµk0 + ψk0

⎤⎥⎥⎥⎥⎦

⎞⎠

⎫⎪⎪⎬⎪⎪⎭

∝ exp− 1

2σ∗k(µk − µ∗k)2 (A.5)

which is the kernel of the distribution p(µk ∣ ⋅) d= N[µ∗k, σ∗k].

Posterior distributions for coefficients Λk,(l) and variances ψε,k (equations5.26 to 5.29)

From equation (5.4), and with x = ∑Ll=1 r(l)x +∑Ll=1 q

(l), let

G

∏g=1

ng

∏i=1

p(y∗ig,k ∣ ⋅) p(Λk,(l) ∣ ψε,k) p(ψε,k) =G

∏g=1

ng

∏i=1

1√2πψε,k

exp− 1

2ψε,k(y∗ig,k − µk −Λk,(l)ωig,(l))

(2π)x2 ∣ψε,kHΛk0

∣−12×

exp− 1

2ψε,k(Λk,(l) −Λk0

)′H−1Λk0

(Λk,(l) −Λk0)×

βαΛk0Λk0

Γ(αΛk0)ψ

−(αΛk0−1)

ε,k exp−βΛk0

ψε,k (A.6)

The latter expression can be organized as it follows:

∝G

∏g=1

ng

∏i=1

ψ−1ε,k exp− 1

2ψε,k(y∗ig,k − µk −Λk,(l)ωig,(l))

∣ψε,kHΛk0∣−

12 exp− 1

2ψε,k(Λk,(l) −Λk0

)′H−1Λk0

(Λk,(l) −Λk0)×

ψ−(αΛk0

−1)ε,k exp−

βΛk0

ψε,k

=ψ−N

2−αΛk0

+1

ε,k ∣ψε,kHΛk0∣−

12 exp− 1

2ψε,k([Y∗

k −Ω′Λk,(l)]′ [Y∗

k −Ω′Λk,(l)]+

[Λk,(l) −Λk0]′H−1

Λk0[Λk,(l) −Λk0

]) −βΛk0

ψε,k (A.7)

by allowing the N × 1 vector defined as Y∗′k = (y∗11,k − µk, ..., y∗ngG,k − µk)

′, and Ω =[ω11,(l), ...,ωngG,(l)], with vectors ωig,(l) defined as in equation (4.6). After some ma-

trix algebra, and by defining A = [Y∗′k Y∗

k +Λ′k0

H−1Λk0

Λk0], the expression inside the exp

Page 102: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

DERIVATION OF POSTERIOR DISTRIBUTIONS 91

curly brackets () can be further expressed as:

= Λ′k,(l) (H

−1Λk0

+ΩΩ′)Λk,(l) +A −Λ′k,(l) [H

−1Λk0

Λk0 +ΩY∗k] − [Y∗′

k Ω′ +Λ′k0

H−1Λk0

]Λk,(l)

= Λ′k,(l)H

∗−1k Λk,(l) +A −Λ′

k,(l)H∗−1k Λ∗

k,(l) −Λ∗′k,(l)H

∗−1k Λk,(l)

= Λ′k,(l)H

∗−1k Λk,(l) +A −Λ′

k,(l)H∗−1k Λ∗

k,(l) −Λ∗′k,(l)H

∗−1k Λk,(l) +Λ∗′

k,(l)H∗−1k Λ∗

k,(l) −Λ∗′k,(l)H

∗−1k Λ∗

k,(l)

= [Λk,(l) −Λ∗k,(l)]

′H∗−1k [Λk,(l) −Λ∗

k,(l)] +A −Λ∗′k,(l)H

∗−1k Λ∗

k,(l)

after defining H∗k = (H−1

Λk0+ΩΩ′)−1 and Λ∗

k,(l) = H∗k [H−1

Λk0Λk0 +ΩY∗

k].

By replacing the last equality inside the curly brackets in the exp expression above,we end up with the kernel of the product of a normal and a gamma distribution, such

that Λk,(l)d= N[Λ∗

k,(l), ψε,kH∗k] and ψ−1

ε,kd= Gamma [N2 + αΛk0

, β∗k], with β∗k = βΛk0+

12 [A −Λ∗′

k,(l)H∗−1k Λ∗

k,(l)].

Posterior distributions for variances ψ(l)δ,k (equation 5.30)

For any level l, and for any endogenous latent variable k = 1, ..., q(l)1 , we define:

Ak,(l) =m(l)

∑j1=1

Kc,j1

∑k=1

γj1,kBcj1,k

(c(l)ig,j1) −q(l)2

∑j2=1

Kξ,j2

∑k=1

γj2,kBξ

j2,k(Φ∗(ξ(l)ig,j2))

−L

∑l∗>l

⎡⎢⎢⎢⎢⎢⎢⎣

m(l∗)

∑j∗1=1

Kcl∗,j∗

1

∑k=1

γj∗1 ,kBc

l∗

j∗1 ,k(c(l

∗)ig,j∗1

) +q(l

∗)

∑j∗2=1

Kω,j∗2

∑k=1

γj∗2 ,kBω

l∗

j∗2 ,k(Φ∗(ω(l∗)

ig,j∗2))

⎤⎥⎥⎥⎥⎥⎥⎦

From equation (5.4), it can be shown that after accounting for the appropriate Ak,(l),for a given k, and a given N (N = N1 = N for l = 1, N2 = G for l = 2, etc.):

L

∏l=1

G

∏g=1

ng

∏i=1

p(η(l)ig,k ∣ ⋅) p(ψ(l)−1

δ,k )∝ ψ(l),−N/2δ,k exp

⎧⎪⎪⎪⎨⎪⎪⎪⎩− 1

2ψ(l)δ,k

L

∑l=1

G

∑g=1

ng

∑i=1

(η(l)ig,k −Ak,(l))

2⎫⎪⎪⎪⎬⎪⎪⎪⎭×

ψ(l),−(αδk0

−1)δ,k exp

⎧⎪⎪⎪⎨⎪⎪⎪⎩−βδk0

ψ(l)δ,k

⎫⎪⎪⎪⎬⎪⎪⎪⎭

=ψ(l),−(N/2+αδk0

−1)δ,k exp

⎧⎪⎪⎪⎨⎪⎪⎪⎩− β∗δψ

(l)δ,k

⎫⎪⎪⎪⎬⎪⎪⎪⎭(A.8)

which is the kernel of a Gamma distribution, such that ψ(l)−1δ,k

d= Gamma [ N2 + αδk0, β∗δ ],

with β∗δ = βδk0+ 1

2 [L

∑l=1

G

∑g=1

ng

∑i=1

(η(l)ig,k −Ak,(l))

2].

Page 103: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

DERIVATION OF POSTERIOR DISTRIBUTIONS 92

Posterior distributions for covariance matrices Φ(l), for every l = 1, ..., L(equation 5.31)

From equation (5.4), it can be shown that,

p(ξ ∣ Φ) p(Φ) =L

∏l=1

G

∏g=1

ng

∏i=1

p(ξ(l)ig ∣ Φ(l))p(Φ(l))

∝L

∏l=1

G

∏g=1

ng

∏i=1

∣Φ(l)∣−12 exp−1

2ξ(l)′ig Φ(l)−1ξ

(l)ig × ∣Φ(l)∣

−(ρ0+q(l)2

+1)2 exp−1

2tr (R0Φ

(l)−1)

∝L

∏l=1

⎡⎢⎢⎢⎢⎣∣Φ(l)∣−

N2 exp

⎧⎪⎪⎨⎪⎪⎩−1

2

G

∑g=1

ng

∑i=1

ξ(l)′ig Φ(l)−1ξ

(l)ig

⎫⎪⎪⎬⎪⎪⎭× ∣Φ(l)∣

−(ρ0+q(l)2

+1)2 exp−1

2tr (R0Φ

(l)−1)⎤⎥⎥⎥⎥⎦

(A.9)

We present the derivations for level l = 1, so we set N = N1 = N and a q(1)2 ×N matrix

ξ(l) = (ξ(1)11 , ...,ξ(1)ngG

). The extensions to levels l > 1 are straightforward (and so N2 = G,

etc. is). Bearing in mind thatG

∑g=1

ng

∑i=1ξ(l)′ig Φ(l)−1ξ

(l)ig = tr (ξ(l)′Φ(l)−1ξ(l)), the expression

above can be summarized as:

∝∣Φ(l)∣−(N+ρ0+q

(l)2

+1)2 exp−1

2[tr (ξ(l)′Φ(l)−1ξ(l)) + tr (R0Φ

(l)−1)]

=∣Φ(l)∣−(N+ρ0+q

(l)2

+1)2 exp−1

2tr ([ξ(l)′ξ(l) +R0]Φ(l)−1) (A.10)

which is the kernel of an Inverse Wishart distribution, Φ(l) d= IW[N +ρ0,ξ(l)′ξ(l)+R0].

Posterior distributions for variances τp (equation 5.32)

According to Song et al. (2013), the posterior distribution for every τp, p ∈ J1, J2, J∗1 or

J∗2 , is derived from the conjugation between the prior distributions p(τ−1p ) and p(γp ∣ τ−1

p ).Therefore, for any p,

p(γp ∣ τ−1p ) p(τ−1

p )∝(τp)−K/2 exp− 1

2τpγ′pMγpγp × (τ−1

p )αγ0−1 exp−βγ0

τp

= (τ−1p )αγ0+K/2−1 exp−(βγ0 +

γ′pMγpγp

2) τ−1

p (A.11)

which is the kernel of a Gamma distribution, such that τ−1p

d=Gamma [αγ0 + K

2 , βγ0 +γ′pMγpγp

2 ].

Page 104: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

DERIVATION OF POSTERIOR DISTRIBUTIONS 93

Posterior distributions for coefficients γp (equation 5.33)

Let η(l)∗k , η

(l)∗ig,k , and Bp be defined as in subsection 5.2. For any level l and for k = 1, ..., q

(l)1 ,

and for a generic variable κp entering into the structural equation, let

G

∏g=1

ng

∏i=1

p(η(l)∗ig,k ∣ ⋅)p(γp ∣ τ−1p )∝ (ψ(l)

δ,k)− N

2 exp

⎧⎪⎪⎪⎨⎪⎪⎪⎩− 1

2ψ(l)δ,k

G

∑g=1

ng

∑i=1

⎛⎝η(l)∗ig,k −

K

∑k=1

γp,kBp,k(Φ∗(κig,p))

⎞⎠

2⎫⎪⎪⎪⎬⎪⎪⎪⎭×

τ− 1

2p exp− 1

2τp[γ′pMγpγp] I(1′

NBpγp = 0)

= (ψ(l)δ,k)

− N2 τ

− 12

p exp

⎧⎪⎪⎪⎨⎪⎪⎪⎩−1

2

⎡⎢⎢⎢⎢⎣

(η(l)∗k −Bpγp)′(η(l)∗

k −Bpγp)ψ

(l)δ,k

+γ′pMγpγp

τp

⎤⎥⎥⎥⎥⎦

⎫⎪⎪⎪⎬⎪⎪⎪⎭I(⋅)

(A.12)

After simple matrix algebra, and by defining Aγp = (ψ(l)δ,k)

− N2 τ

− 12

p , the latter expressioncan be reformulated as

∝ Aγp exp

⎧⎪⎪⎪⎨⎪⎪⎪⎩−1

2

⎛⎜⎝γ′pB′pBpγpψ

(l)δ,k

−η(l)∗′k Bpγp − γ′pB′pη

(l)∗k

ψ(l)δ,k

+γ′pMγpγp

τp

⎞⎟⎠

⎫⎪⎪⎪⎬⎪⎪⎪⎭I(⋅)

= Aγp exp

⎧⎪⎪⎪⎨⎪⎪⎪⎩−1

2

⎡⎢⎢⎢⎢⎢⎣γ′p

⎛⎜⎝B′pBpψ

(l)δ,k

+Mγp

τp

⎞⎟⎠γp −

(η(l)∗′k Bpγp + γ′pB′pη

(l)∗k )

ψ(l)δ,k

⎤⎥⎥⎥⎥⎥⎦

⎫⎪⎪⎪⎬⎪⎪⎪⎭I(⋅) (A.13)

By allowing Σ∗γp = (B′pBp/ψ

(l)δ,k +Mγp/τp)

−1and γ∗p = [Σ∗

γpB′pη

(l)∗k ] (ψ(l)

δ,k)−1

, and after

adding and subtracting γ∗′p Σ∗−1

γp γ∗p , we have

∝ Aγp exp−1

2[γ′pΣ∗−1

γp γp − γ∗′p Σ∗,−1

γp γp − γ′pΣ∗−1γp γ

∗p + γ∗

′p Σ∗−1

γp γ∗p − γ∗

′p Σ∗−1

γp γ∗p ] I(⋅)

∝ A∗γp exp−1

2(γp − γ∗p )′Σ∗−1

γp (γp − γ∗p ) I(⋅) (A.14)

with A∗γp = Aγp exp−1

2(−γ∗′p Σ∗−1

γp γ∗p ). The expression above is the kernel of a trun-

cated Normal distribution, such that γpD= N[γ∗p , Σ∗

γp] I(1′NBpγp = 0). Samples from

this truncated distribution are obtained followed the procedure explained at the end ofsubsection 5.2, as presented in Song et al. (2013).

Page 105: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

APPENDIX

Results of Simulation Study

Bayesian Estimates for the SPHSEM parameters

Table A1. Bayesian SPHSEM estimates, and their 90% probability interval.

Simulated and Empirical Values Bayesian SPHSEM Estimates

ParameterTrue Empirical Bayes Est. 5th pctle. 95th pctle.Value Value Value bound bound

µ1 0.5 - 0.308 0.138 0.503µ2 0.5 - 0.363 0.214 0.527µ3 0.5 - 0.413 0.263 0.574µ4 0.5 - 0.268 -0.021 0.584µ5 0.5 - 0.226 -0.005 0.474µ6 0.5 - 0.134 -0.101 0.387λ2,1 0.8 - 0.817 0.734 0.899λ2,2 0.8 - 0.824 0.715 0.944λ3,1 0.8 - 0.727 0.613 0.841λ3,2 0.8 - 0.819 0.710 0.941λ5,1 0.8 - 0.733 0.646 0.822λ5,2 0.8 - 0.790 0.738 0.846λ6,1 0.8 - 0.656 0.572 0.741λ6,2 0.8 - 0.794 0.741 0.850ψε,1 1.5 - 1.409 1.255 1.576ψε,2 1.5 - 1.455 1.308 1.605ψε,3 1.5 - 1.562 1.392 1.747ψε,4 1.5 - 1.188 0.973 1.399ψε,5 1.5 - 1.402 1.247 1.570ψε,6 1.5 - 1.469 1.316 1.632

Φ(1) 1.0 0.958 1.073 0.959 1.197

Φ(2) 1.0 0.712 0.706 0.414 1.143

ψ(1)δ,1 1.0 1.825 1.590 1.299 1.951

ψ(2)δ,1 1.0 2.707 1.227 0.615 2.219

τ(1)1 - - 2.290 1.039 4.173

τ(1)2 - - 0.011 0.002 0.035

τ(2)1 - - 0.004 0.001 0.012

τ(2)2 - - 0.005 0.001 0.013

94

Page 106: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

RESULTS OF SIMULATION STUDY 95

Comparison between SPHSEM and HSEM estimates

Table A2. Bayesian SPHSEM and HSEM estimates for simulated parameters.

ParameterTrue SPHSEM

BiasHSEM

BiasValue Estimate Estimate

µ1 0.5 0.308 -0.192 0.148 -0.352µ2 0.5 0.363 -0.137 0.229 -0.271µ3 0.5 0.413 -0.087 0.280 -0.220µ4 0.5 0.268 -0.232 -0.611 -1.111µ5 0.5 0.226 -0.274 -0.456 -0.956µ6 0.5 0.134 -0.366 -0.539 -1.039λ2,1 0.8 0.817 0.017 1.073 0.273λ2,2 0.8 0.824 0.024 0.804 0.004λ3,1 0.8 0.727 -0.073 0.909 0.109λ3,2 0.8 0.819 0.019 0.851 0.051λ5,1 0.8 0.733 -0.067 0.464 -0.336λ5,2 0.8 0.790 -0.010 0.647 -0.153λ6,1 0.8 0.656 -0.144 0.417 -0.383λ6,2 0.8 0.794 -0.006 0.622 -0.178ψε,1 1.5 1.409 -0.091 1.677 0.177ψε,2 1.5 1.455 -0.045 1.395 -0.105ψε,3 1.5 1.562 0.062 1.557 0.057ψε,4 1.5 1.188 -0.312 0.236 -1.268ψε,5 1.5 1.402 -0.098 1.752 0.252ψε,6 1.5 1.469 -0.031 1.751 0.251

Φ(1) 1.0 1.073 0.073 0.729 -0.271

Φ(2) 1.0 0.706 -0.294 0.545 -0.455

ψ(1)δ,1 1.0 1.590 0.590 5.476 4.476

ψ(2)δ,1 1.0 1.227 0.227 15.278 14.278

γ1 - - - -0.001 -γ2 - - - -0.734 -γ3 - - - 6.844 -

∑ ∣BIAS∣ 3.471 27.026

Page 107: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

APPENDIX

MCMC Results for the Simulation Example

0 1000 2000 3000 4000 5000

0.0

0.2

0.4

0.6

MCMC Repetitions

mu

1

0 1000 2000 3000 4000 5000

0.1

0.3

0.5

0.7

MCMC Repetitions

mu

2

0 1000 2000 3000 4000 5000

0.2

0.4

0.6

MCMC Repetitions

mu

3

0 1000 2000 3000 4000 5000

−0.

20.

20.

6

MCMC Repetitions

mu

4

0 1000 2000 3000 4000 5000

−0.

20.

20.

6

MCMC Repetitions

mu

5

0 1000 2000 3000 4000 5000

−0.

20.

2

MCMC Repetitions

mu

6

MCMC Results for µk's

(a) MCMC chains for µk.

0 1000 2000 3000 4000 5000

0.65

0.95

MCMC Repetitions

lam

bda

1

0 1000 2000 3000 4000 5000

0.6

1.0

MCMC Repetitions

lam

bda

2

0 1000 2000 3000 4000 5000

0.5

0.8

MCMC Repetitions

lam

bda

3

0 1000 2000 3000 4000 5000

0.6

1.0

MCMC Repetitions

lam

bda

4

0 1000 2000 3000 4000 5000

0.6

0.9

MCMC Repetitions

lam

bda

5

0 1000 2000 3000 4000 5000

0.70

0.95

MCMC Repetitions

lam

bda

6

0 1000 2000 3000 4000 5000

0.5

0.8

MCMC Repetitions

lam

bda

7

0 1000 2000 3000 4000 5000

0.70

MCMC Repetitions

lam

bda

8

MCMC Results for λk's

(b) MCMC chains for λk.

Figure 7. MCMC chains for µ and λ. Source: Author’s own calculations.

96

Page 108: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

MCMC RESULTS FOR THE SIMULATION EXAMPLE 97

0 1000 2000 3000 4000 5000

1.2

1.4

1.6

MCMC Repetitions

psi[e

psilo

n, k

] 1

0 1000 2000 3000 4000 5000

1.2

1.4

1.6

1.8

MCMC Repetitions

psi[e

psilo

n, k

] 2

0 1000 2000 3000 4000 5000

1.3

1.5

1.7

1.9

MCMC Repetitions

psi[e

psilo

n, k

] 3

0 1000 2000 3000 4000 5000

0.8

1.2

1.6

MCMC Repetitions

psi[e

psilo

n, k

] 4

0 1000 2000 3000 4000 5000

1.1

1.3

1.5

1.7

MCMC Repetitions

psi[e

psilo

n, k

] 5

0 1000 2000 3000 4000 5000

1.2

1.4

1.6

1.8

MCMC Repetitions

psi[e

psilo

n, k

] 6

MCMC Results for ψε's

Figure 8. MCMC chains for ψε. Source: Author’s own calculations.

0 1000 3000 5000

1.0

1.5

2.0

2.5

3.0

3.5

MCMC Repetitions

ψδ1

ψδ1

Den

sity

1.0 2.0 3.0

0.0

0.5

1.0

1.5

2.0

MCMC Results for ψδ1's

(a) MCMC chains for ψ(1)δ .

0 1000 3000 5000

12

34

5

MCMC Repetitions

ψδ2

ψδ2

Den

sity

0 1 2 3 40.

00.

20.

40.

60.

81.

01.

2

MCMC Results for ψδ2's

(b) MCMC chains for ψ(2)δ .

Figure 9. MCMC chains and histograms for ψδ (level 1 and 2). Source: Author’s own calcula-tions.

0 1000 3000 5000

0.9

1.0

1.1

1.2

1.3

MCMC Repetitions

Φ1

Φ1

Den

sity

0.9 1.0 1.1 1.2 1.3

01

23

45

6

MCMC Results for Φ1's

(a) MCMC chains for Φ(1).

0 1000 3000 5000

0.5

1.0

1.5

2.0

2.5

3.0

MCMC Repetitions

Φ2

Φ2

Den

sity

0.0 0.5 1.0 1.5 2.0

0.0

0.5

1.0

1.5

2.0

MCMC Results for Φ2's

(b) MCMC chains for Φ(2).

Figure 10. MCMC chains and histograms for Φ (level 1 and 2). Source: Author’s own calcula-tions.

Page 109: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

MCMC RESULTS FOR THE SIMULATION EXAMPLE 98

0 1000 2000 3000 4000 5000

24

68

10

MCMC Repetitions

τ 1

τ1

Den

sity

0 2 4 6 8

0.0

0.2

0.4

0 1000 2000 3000 4000 5000

0.00

0.05

0.10

0.15

MCMC Repetitions

τ 1

τ1

Den

sity

0.00 0.02 0.04 0.06 0.08

020

4060

80

MCMC Results for τ1's (level 1)

(a) MCMC chains for τ(1)γ .

0 1000 2000 3000 4000 5000

0.00

0.10

0.20

MCMC Repetitions

τ 2

τ2

Den

sity

0.00 0.01 0.02 0.03 0.04 0.05

050

100

150

0 1000 2000 3000 4000 5000

0.00

0.10

0.20

0.30

MCMC Repetitions

τ 2

τ2

Den

sity

0.00 0.01 0.02 0.03 0.04 0.05

050

100

150

MCMC Results for τ2's (level 2)

(b) MCMC chains for τ(2)γ .

Figure 11. MCMC chains and histograms for τ (level 1 and 2). Source: Author’s own calcula-tions.

0 10 20 30 40

−2

−1

01

23

4

Bayesian SPHSEM values for γ

Total number of knots

Est

imat

ed v

alue

s fo

r

γ

Figure 12. MCMC Results for γ. Source: Author’s own calculations.

Page 110: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

APPENDIX

Description of the Empirical Exercise

In this Appendix we present some useful results for the empirical exercise in Chapter 6.

Bliese et al. (2002) dataset

The items reported in the dataset in Bliese et al. (2002) is organized in 21 variables, three(3) for the Hostility behavior latent variable (HOSTIL), eleven (11) for the LeadershipClimate perception (LEAD), and five (5) for the Task Significance perception (TSIG).The item responses are on a five-point Likert scale (ordered categorical variables), withthe ends anchored by “none” and “extreme” for the HOSTIL items, and “stronglydisagree” and “strongly agree” for both the LEAD and TSIG items. The survey questionsare outline as it follows:

Soldiers were asked to respond “I think/feel/believe ...”

COMPID: Army Company (group) identifying variable.

SUB: Soldier (individual) identifying number.

HOSTIL01: I am easily annoyed or irritated.

HOSTIL02: I have temper outbursts that I cannot control.

HOSTIL03: I have urges to harm someone.

HOSTIL04: I have urges to break things.

HOSTIL05: I get into frequent arguments.

LEAD01: Officers get cooperation from company.

LEAD02: Non-commissioned Officers (NCOs) get cooperation from company.

LEAD03: I am impressed by Leadership.

LEAD04: I would go for help within chain of command.

99

Page 111: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

DESCRIPTION OF THE EMPIRICAL EXERCISE 100

LEAD05: Officers would lead well in combat.

LEAD06: NCOs would lead well in combat.

LEAD07: Officers are interested in welfare.

LEAD08: NCOs are interested in welfare.

LEAD09: Officers are interested in what I think.

LEAD10: NCOs are interested in what I think.

LEAD11: Chain of command works well.

TSIG01: What I am doing is important.

TSIG02: I am contributing to the mission.

TSIG03: What I am doing helps to accomplish the mission.

The proposed model path diagram

Lig,1

Lig,11

LEADig

Tig,1

Tig,5

TSIGig

Hig,1

Hig,5

HOSTig

1

1

1

λ1LEAD,11

λ1TSIG,5

λ1HOST,5

f11

f12

LEADg

TSIGg

HOSTg

f21

f22

1

1

1

λ2HOST,5

λ2TSIG,5

λ2LEAD,11

i

f13g

Figure 13. Proposed path diagram (DAG) for the SPHSEM model for Hostile Behavior, Lead-ership Climate and Task Significance. Source: Proposed by the author.

Page 112: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

DESCRIPTION OF THE EMPIRICAL EXERCISE 101

SPHSEM Estimation Results

Table A3. SPHSEM Bayesian estimates of the structural parameters in the study of Hostilityand Leadership (Bliese et al., 2002).

Par. Est. SE. Par. Est. SE. Par. Est. SE.

µ1 -0.158 0.463 λ5,5 0.395 0.037 ψε,2 0.727 0.026µ2 -0.046 0.135 λ6,2 0.334 0.013 ψε,3 0.445 0.016µ3 -0.066 0.194 λ6,5 0.320 0.039 ψε,4 0.592 0.021µ4 -0.057 0.166 λ7,2 0.441 0.012 ψε,5 0.502 0.018µ5 -0.063 0.185 λ7,5 0.424 0.037 ψε,6 0.664 0.024µ6 -0.051 0.151 λ8,2 0.366 0.013 ψε,7 0.429 0.017µ7 -0.067 0.198 λ8,5 0.351 0.039 ψε,8 0.602 0.023µ8 -0.056 0.164 λ9,2 0.442 0.011 ψε,9 0.426 0.016µ9 -0.067 0.198 λ9,5 0.425 0.037 ψε,10 0.600 0.022µ10 -0.056 0.165 λ10,2 0.367 0.012 ψε,11 0.406 0.016µ11 -0.068 0.201 λ10,5 0.352 0.038 ψε,12 1.291 0.069µ12 -3.363 0.525 λ11,2 0.450 0.011 ψε,13 0.313 0.020µ13 -1.838 0.269 λ11,5 0.432 0.037 ψε,14 0.305 0.017µ14 -1.849 0.278 λ13,3 0.551 0.015 ψε,15 0.549 0.021µ15 0.004 0.021 λ13,6 0.546 0.025 ψε,16 0.512 0.022µ16 0.003 0.020 λ14,3 0.554 0.015 ψε,17 0.304 0.015µ17 0.004 0.018 λ14,6 0.548 0.023 ψε,18 0.218 0.016µ18 0.004 0.017 λ16,1 1.111 0.043 ψε,19 0.495 0.021

µ19 0.003 0.020 λ16,4 0.807 0.327 ψ(1)δ,1 0.063 0.017

λ2,2 0.300 0.013 λ17,1 1.332 0.049 ψ(2)δ,1 0.002 0.001

λ2,5 0.287 0.040 λ17,4 0.975 0.306 Φ(1)1,1 3.214 0.133

λ3,2 0.435 0.011 λ18,1 1.415 0.053 Φ(1)2,2 5.201 0.647

λ3,5 0.417 0.036 λ18,4 1.022 0.299 Φ(1)1,2 1.316 0.132

λ4,2 0.371 0.012 λ19,1 1.130 0.044 Φ(2)1,1 0.622 0.222

λ4,5 0.355 0.038 λ19,4 0.822 0.324 Φ(2)2,2 5.067 1.855

λ5,2 0.411 0.012 ψε,1 1.952 0.087 Φ(2)1,2 -0.054 0.924

Page 113: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

APPENDIX

R Codes

The simulations here presented were computed using the statistical software R (R Devel-opment Core Team, 2017). Codes are available upon request to the author. Please writeto [email protected].

102

Page 114: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

Bibliography

Angrist, J. (1990). Lifetime Earnings and the Vietnam Era Draft Lottery: Evidence fromSocial Security Administrative Records. American Economic Review, 80(3):313–336.

Angrist, J., Imbens, G., and Rubin, D. B. (1996). Indentification of Causal Effects UsingInstrumental Variables. Journal of the American Statistical Association, 91(434):444–455.

Ansari, A. and Jedidi, K. (2000). Bayesian Factor Analysis for Multilevel Binary Obser-vation. Psychometrika, 65(4):475–496.

Ansari, A., Jedidi, K., and Jagpal, S. (2000). A Hierarchical Bayesian Methodology forTreating Heterogeneity in Structural Equation Models. Marketing Science, 19(4):328–347.

Arminger, G. and Muthen, B. (1998). A Bayesian Approach to Nonlinear Latent VariableModels Using the Gibbs Sampler and the Metropolis-Hastings Algorithm. Psychome-trika, 63(3):271–300.

Balke, A. and Pearl, J. (1995). Bounds on Treatment Effects from Studies with ImperfectCompliance. Journal of the American Statistical Association, 92(439):1171–1176.

Baron, R. M. and Kenny, D. A. (1986). The Moderator-Mediator Variable Distinctionin Social Psychological Research: Conceptual, Strategic, and Statistical Considerations.Journal of Personality and Social Psychology, 51(6):1173–1182.

Barringer, S., Eliason, S., and Leahey, E. (2013). A History of Causal Analysis in the SocialSciences. In Morgan, S. L., editor, Handbook of Causal Analysis for Social Research,chapter 2, pages 9–26. New York, NY, US: Springer-Verlag.

Besag, J. (1974). Spatial Interaction and the Statistical Analysis of Lattice Systems.Journal of the Royal Statistical Society - Series B, 36(2):192–236.

Besag, J., York, J., and Mollie, A. (1991). Bayesian Image Restoration with Two Applica-tions in Spatial Statistics. Annals of the Institute of Statistical Mathematics, 43(1):1–20.

Bliese, P. D. (2016). multilevel: Multilevel Functions. R package version 2.6.https://CRAN.R-project.org/package=multilevel.

103

Page 115: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 104

Bliese, P. D. and Halverson, R. R. (1996). Individual and Nomothetic Models of JobStress: An Examination of Work Hours, Cohesion, and Well Being. Journal of AppliedSocial Psychology, 26(13):1171–1189.

Bliese, P. D., Halverson, R. R., and Schriesheim, C. A. (2002). Benchmarking multi-level methods in leadership: The articles, the model and the data set. The LeadershipQuarterly, 13:3–14.

Bollen, K. A. (1987). Total, Direct, and Indirect Effects in Structural Equation Models.Sociological Methodology, 17:39–67.

Bollen, K. A. (1989). Structural Equations with Latent Variables. New York, NY, US:John Wiley & Sons, Ltd., 1st edition.

Bollen, K. A. and Pearl, J. (2013). Eight Myths about Causality and Structural EquationModels. In Morgan, S. L., editor, Handbook of Causal Analysis for Social Research,chapter 16, pages 301–328. New York, NY, US: Springer-Verlag.

Bollen, K. A. and Stine, R. (1990). Direct and Indirect Effects: Classical and BootstrapEstimates of Variability. Sociological Methodology, 20:115–140.

Bowers, J., Fredrickson, M. M., and Panagopoulus, C. (2013). Reasoning about Interfer-ence Between Units: A General Framework. Political Analysis, 21:97–124.

Brezger, A. and Lang, S. (2006). Generalized structured additive regression based onbayesian P-splines. Computational Statistics & Data Analysis, 50:967–991.

Brookhart, M. A., Schneeweiss, S., Rothman, K. J., Glynn, R. J., Avorn, J., and Sturmer,T. (2006). Variable Selection for Propensity Score Models. American Journal of Epi-demiology, 163(12):1149–1156.

Browne, M. W. (1982). Covariance structures. In Hawkins, D. M., editor, Topics inApplied Multivariate Analysis, chapter 2, pages 72–133. Cambridge, UK: CambridgeUniversity Press.

Browne, M. W. (1984). Asymptotically distribution-free methods for the analysis of covari-ance structures. British Journal of Mathematical and Statistical Psychology, 37(1):62–38.

Cartwright, N. (1983). How the Laws of Physics Lie. Oxford, UK: Clarendon Press -Oxford University Press.

Cartwright, N. (1989). Nature’s Capacities and their Measurement. Oxford, UK: Claren-don Press - Oxford University Press.

Celeux, G., Forbes, F., Robert, C. P., and Titterington, D. M. (2006). Deviance Informa-tion Criteria for Missing Data Models. Bayesian Analysis, 1(4):651–674.

Chou, C.-P. and Bentler, P. M. (1995). Estimates and Tests in Structural EquationModeling. In Hoyle, R. H., editor, Structural Equation Modeling: Concepts, Issues andApplications, chapter 3, pages 37–55. Thousand Oaks, CA, US: Sage Publications.

Chou, C.-P., Bentler, P. M., and Satorra, A. (1991). Scaled test statistics and robuststandard errors for non-normal data in a covariance structure analysis: A Monte Carlostudy. British Journal of Mathematical and Statistical Psychology, 44:347–357.

Page 116: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 105

Clarke, K. A., Kenkel, B., and Rueda, M. R. (2011). Misspecificationand the Propensity Score: The Possibility of Overadjustment. Uni-versity of Rochester, Department of Political Science Working Paper.https://www.rochester.edu/college/psc/clarke/MissProp.pdf.

Cochran, W. and Rubin, D. B. (1973). Controlling Bias in Observational Studies: AReview. Sankhya: The Indian Journal of Statistics, Series A, 35(4):417–446.

Cowles, M. (1996). Accelerating Monte Carlo Markov Chain convergence for cumulative-link generalized linear models. Statistics and Computing, 6:101–111.

Crespo-Tenorio, A. and Montgomery, J. (2013). A Bayesian Approach to Inference withInstrumental Variables: Improving Estimation of Treatment Effects with Weak Instru-ments and Small Samples. Unpublished manuscript.

Daniel, R. M., Cousens, S. N., De Stalova, B. L., Kenward, M. G., and Sterne, J. A. C.(2013). Methods for dealing with time-dependent confounding. Statistics in Medicine,32(9):1584–1648.

Dawid, A. P. (1979). Conditional Independence in Statistical Theory. Journal of the RoyalStatistics Society, Series B, 41(1):1–31.

Dawid, A. P. (2008). Beware of the DAG! Journal of Machine Learning Research: Work-shop and Conference Proceedings, 6:59–86.

Dawid, A. P. (2010). Seeing and Doing: The Pearlian Synthesis. In Detcher, R., Geffner,H., and Halpern, J., editors, Heuristics, Probability and Causality: A Tribute to JudeaPearl, chapter 18, pages 309–325. London, UK: College Publications.

De Boor, C. (1978). A Practical Guide to Splines. New York, NY, US: Springer-Verlag,1st edition.

Duncan, O. D. (1975). Introduction to Structural Equation Models. New York, NY, US:Academic Press, Inc., 1st edition.

Duncan, O. D., Haller, A. O., and Portes, A. (1968). Peer Influences on Aspirations: AReintepretation. American Journal of Sociology, 74(2):119–137.

Dunson, D. B. (2000). Bayesian Latent Variable Models for Clustered Mixed Outcomes.Journal of the Royal Statistical Society - Series B, 62(2):355–366.

Eilers, P. and Marx, B. (1996). Flexible Smoothing with B-splines and Penalties. StatisticalScience, 11(2):89–121.

Eilers, P. and Marx, B. (1998). Direct generalized additive modeling with penalized like-lihood. Computational Statistics & Data Analysis, 28:193–209.

Fahrmeir, L. and Raach, A. (2007). A Bayesian Semiparametric Latent Variable Modelfor Mixed Responses. Psychometrika, 72(3):327–346.

Fox, J. (1984). Linear Statistical Models and Related Methods with Applications to SocialResearch. New York, NY, US: John Wiley & Sons, Ltd., 1st edition.

Geiger, D. and Pearl, J. (1990). Logical and Algorithmic Properties of Conditional In-dependence and their Application to Bayesian Networks. Annals of Mathematics andArtificial Intelligence, 2:165–178.

Page 117: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 106

Geiger, D. and Pearl, J. (1993). Logical and Algorithmic Properties of Conditional Inde-pendence and Graphical Models. The Annals of Statistics, 21(4):2001–2021.

Geiger, D., Verma, T., and Pearl, J. (1990). Identifying Independence in Bayesian Net-works. Networks, 20:507–534.

Gelman, A. (2009). Resolving disputes between J. Pearl and D. Rubin on causal inference.http://andrewgelman.com/2009/07/05/disputes about/.

Gelman, A. (2011). Review Essay: Causality and Statistical Learning. American Journalof Sociology, 117(3):955–966.

Gelman, A. and Meng, X.-L. (1998). Simulating Normalizing Constants: From ImportanceSampling to Bridge Sampling to Path Sampling. Statistical Science, 13(2):163–185.

Gelman, A., Roberts, G. O., and Gilks, W. R. (1995). Efficient Metropolis JumpingRules. In Bernardo, J. M., Berger, J. O., Dawid, P., and Smith, F. P., editors, BayesianStatistics 5, pages 599–607. Oxford, UK: Oxford University Press.

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibss distributions and theBayesian restoration of images. I.E.E.E. Transactions: Pattern Analysis and MachineIntelligence, 6:721–741.

Glymour, C. and Spirtes, P. (1988). Latent Variables, Causal Models and OveridentifyingConstraints. Journal of Econometrics, 39:175–198.

Goldberger, A. (1972). Structural Equation Models in the Social Sciences. Econometrica,40:979–1001.

Goldberger, A. (1973). Structural equation models: an overview. In Goldberger, A. andDuncan, O., editors, Structural Equations Models in the Social Sciences, pages 1–18.New York, NY, US: Seminar Press.

Goldberger, A. and Duncan, O. (1973). Structural Equations Models in the Social Sciences.New York, NY, US: Seminar Press.

Goldstein, H. and McDonald, R. (1988). A General Model for the Analysis of MultilevelData. Psychometrika, 53(4):455–467.

Goldszmidt, M. and Pearl, J. (1992). Rank-based systems: A simple approach to beliefrevision, belief update, and reasoning about evidence and actions. In Nebel, B., Rich,C., and Swartout, W., editors, Proceedings of the Third International Conference onPrinciples of Knowledge Representation and Reasoning, pages 661–672. San Mateo, CA,US: Morgan Kaufmann.

Greenland, S. (2003). Quantifying Biases in Causal Models: Classical Confounding vsCollider-Stratification Bias. Epidemiology, 14(3):300–306.

Greenland, S., Pearl, J., and Robins, J. M. (1999). Causal Diagrams for EpidemiologicResearch. Epidemiology, 10(1):37–48.

Haavelmo, T. (1943). The Statistical Implications of a System of Simultaneous Equations.Econometrica, 11(1):1–12.

Page 118: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 107

Halloran, M. E. and Struchiner, C. J. (1995). Causal Inference in Infectous Diseases.Epidemiology, 6:142–151.

Hammersley, J. and Clifford, P. (1971). Markov fields on finite graphs and lattices. Un-published manuscript.

Hancock, G. R. and Mueller, R. O., editors (2006). Structural Equation Modeling: A Sec-ond Course. Quantitative Methods in Education and the Behavioral Sciences. Green-wich, CT, US: Information Age Publishing, 1st edition.

Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and theirapplications. Biometrika, 57(1):97–109.

Heckman, J. (2005). The Scientific Model of Causality. Sociological Methodology, 35:1–97.

Heckman, J. and Vytlacil, E. (2007a). Econometric Evaluation of Social Programs, PartI: Causal Models, Structural Models and Econometric Policy Evaluation. In Heckman,J. and Leamer, E., editors, Handbook of Econometrics, Volume 6B, chapter 70, pages4779–4874. Amsterdam, NL: North-Holland Publishing Co.

Heckman, J. and Vytlacil, E. (2007b). Econometric Evaluation of Social Programs, PartII: Using the Marginal Treatment Effect to Organize Alternative Econometric Estima-tors to Evaluate Social Programs, and to Forecast their Effects in New Environments.In Heckman, J. and Leamer, E., editors, Handbook of Econometrics, Volume 6B, chap-ter 71, pages 4875–5143. Amsterdam, NL: North-Holland Publishing Co.

Hernan, M. A., Brumback, B., and Robins, J. M. (2000). Marginal Structural Modelsto Estimate the Causal Effect of Zidovudine on the Survival of HIV-Positive Men.Epidemiology, 11(5):561–570.

Hernan, M. A., Hernandez-Dıaz, S., and Robins, J. M. (2004). A Structural Approach toSelection Bias. Epidemiology, 15(5):615–625.

Hernan, M. A. and Robins, J. M. (2016). Causal Inference. New York, NY, US: Chapman& Hall, CRC, 1st edition.

Holland, P. (1986). Statistics and Causal Inference. Journal of the American StatisticalAssociation, 81(396):945–960.

Holland, P. (2003). Causation and Race. Educational Testing Service Research Reports,RR-03-03.

Holland, P. W. (1988). Causal Inference, Path Analysis, and Recursive Structural EquationModels. Sociological Methodology, 18:449–484.

Hong, G. and Raudenbush, S. (2003). Causal Inference for Multi-level Observational Datawith Application to Kindergarten Retention Study. 2003 Proceedings of the AmericanStatistical Association, Social Statistics Section, pages 1849–1856.

Hong, G. and Raudenbush, S. (2005). Effects of Kindergarten Retention Policy on Chil-dren’s Cognitive Growth in Reading and Mathematics. Educational Evaluation andPolicy Analysis, 27(3):205–224.

Hong, G. and Raudenbush, S. (2006). Evaluating Kindergarten Retention Policy. Journalof the American Statistical Association, 101(475):901–910.

Page 119: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 108

Hong, G. and Raudenbush, S. (2013). Heterogeneous Agents, Social Interactions, andCausal Inference. In Morgan, S. L., editor, Handbook of Causal Analysis for SocialResearch, chapter 16, pages 331–352. New York, NY, US: Springer-Verlag.

Hoogland, J. J. and Boomsma, A. (1998). Robustness Studies in Covariance StructureModeling. Sociological Methods and Research, 26(3):329–367.

Hoyle, R. H., editor (1995). Structural Equation Modeling: Concepts, Issues and Applica-tions. Thousand Oaks, CA, US: Sage Publications, 1st edition.

Hudgens, M. and Halloran, E. (2008). Toward Causal Inference with Interference. Journalof the American Statistical Association, 103(482):832–842.

Imai, K., Keele, L., and Tingley, D. (2010a). A General Approach to Causal Mediation.Psychological Methods, 15(4):309–344.

Imai, K., Keele, L., Tingley, D., and Yamamoto, T. (2010b). Causal Mediation Analy-sis Using R. In Vinod, H. D., editor, Advances in Social Science Research Using R,chapter 8, pages 129–154. New York, NY, US: Springer-Verlag.

Imai, K., Keele, L., and Yamamoto, T. (2010c). Identification, Inference and SensitivityAnalysis for Causal Mediation Effects. Statistical Science, 25(1):51–71.

Imai, K. and van Dyk, D. (2005). A Bayesian analysis of the multinomial probit modelusing marginal data augmentation. Journal of Econometrics, 125:311–334.

Imbens, G. and Angrist, J. (1994). Identification and Estimation of Local Average Treat-ment Effects. Econometrica, 62(2):467–475.

Imbens, G. W. (2004). Nonparametric Estimation of Average Treatment Effects underExogeneity: A Review. The Review of Economics and Statistics, 86(1):4–29.

Jara, A., Quintana, F., and San-Martın, E. (2008). Linear Mixed Models with Skew-Elliptical Ddistributions: A Bayesian Approach. Computational Statistics and DataAnalysis, 52:5033–5045.

Johnston, J. and DiNardo, J. (1997). Econometric Methods. New York, NY, US: McGraw-Hill Higher Education, 4th edition.

Joreskog, K. G. (1967). Some Contributions to Maximum Likelihood Factor Analysis.Psychometrika, 32(4):443–482.

Joreskog, K. G. (1973). A general method for estimating a linear structural equationsystem. In Goldberger, A. and Duncan, O., editors, Structural Equation Models in theSocial Sciences, pages 85–113. New York, NY, US: Seminar Press.

Joreskog, K. G. (1978). Statistical Analysis of Covariance and Correlation Matrices. Psy-chometrika, 43(4):443–477.

Joreskog, K. G. and Sorbom, D. (1996). LISREL 8: Structural Equation Modeling withthe SIMPLIS Command Language. London, UK: Scientific Software International, 1stedition.

Kass, R. E. and Raftery, A. E. (1995). Bayes Factors. Journal of the American StatisticalAssociation, 50(430):773–795.

Page 120: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 109

Keesling, J. W. (1972). Maximum likelihood approaches to causal flow analysis. PhDthesis, Department of Education, University of Chicago.

Kline, R. B. (2011). Practice and Principles of Structural Equation Modeling. Methodologyin the Social Sciences. New York, NY, US: The Guildford Press, 3rd edition.

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Tech-niques. Cambridge, MA, US: The MIT Press, 1st edition.

Lang, S. and Brezger, A. (2004). Bayesian P-Splines. Journal of Computational andGraphical Statistics, 13(1):183–212.

Laplace, P.-S. (1814). Essai philosophique sur les probabilites. Paris, FR: Courcier, 1stedition.

Lauritzen, S. L. (1996). Graphical Models. Number 17 in Oxford Statistical Science Series.Oxford, UK: Clarendon Press - Oxford University Press.

Lee, S.-Y. (1990). Multilevel analysis of structural equation models. Biometrika,77(4):763–762.

Lee, S.-Y. (2007). Structural Equation Modeling: A Bayesian Approach. New York, NY,US: John Wiley & Sons, Ltd., 1st edition.

Lee, S.-Y. and Shi, J.-Q. (2001). Maximum Likelihood Estimation of Two-Level LatentVariable Models with Mixed Continuous and Polytomous Data. Biometrics, 57(3):787–794.

Lee, S.-Y. and Song, X.-Y. (2003a). Bayesian analysis of structural equation models withdichotomous variables. Statistics in Medicine, 22:3073–3088.

Lee, S.-Y. and Song, X.-Y. (2003b). Model Comparison of Nonlinear Structural EquationModels with Fixed Covariates. Psychometrika, 68(1):27–47.

Lee, S.-Y. and Song, X.-Y. (2004). Evaluation of the Bayesian and Maximum Likeli-hood Approaches in Analyzing Structural Equation Models with Small Sample Sizes.Multivariate Behavioral Research, 39(4):653–686.

Lee, S.-Y. and Tang, N.-S. (2006). Bayesian Analysis of Two-level Structural EquationModels with Cross Level Effects. Unpublished manuscript.

Lee, S.-Y. and Zhu, H.-T. (2000). Statistical analysis of nonlinear structural equationmodels with continuous and polytomous data. British Journal of Mathematical andStatistical Psychology, 50:209–232.

Lee, S.-Y. and Zhu, H.-T. (2002). Maximum Likelihood Estimation of Nonlinear StructuralEquation Models. Psychometrika, 67(2):189–210.

Liu, L. and Hudgens, M. G. (2014). Large Sample Randomization Inference of CausalEffects in the Presence of Interference. Journal of the American Statistical Association,109(505):288–301.

Liu, W., Brookhart, M. A., Schneeweiss, S., Mi, X., and Setoguchi, S. (2012). Implica-tions of M-Bias in Epidemiological Studies: A Simulation Study. American Journal ofEpidemiology, 176(10):938–948.

Page 121: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 110

Longford, N. and Muthen, B. (1992). Factor Analysis for Clustered Observations. Psy-chometrika, 57(4):581–597.

Lundin, M. and Karlsson, M. (2014). Estimation of causal effects in observational studieswith interference between units. Statistical Methods & Applications, 23(3):417–433.

Manski, C. (2003). Partial Identification of Probability Distributions. Springer Series inStatistics. New York, NY, US: Springer-Verlag, 1st edition.

Manski, C. (2013). Identification of treatment response with social interactions. Econo-metrics Journal, 16(1):1–23.

Marcoulides, G. A. and Schumacker, R. E. (2009). New Developments and Techniques inStructural Equation Modeling. Mahwah, NJ, US: Lawrence Erlbaum Associates Pub-lishers, 2nd edition.

McCullagh, P. and Nelder, J. (1989). Generalized Linear Models. New York, NY, US:Chapman & Hall, CRC, 2nd edition.

McDonald, R. and Goldstein, H. (1989). Balanced versus Unbalanced Designs for LinearStructural Relations in Two-level Data. British Journal of Mathematical and StatisticalPsychology, 42:215–232.

Mehta, P. and Neale, M. (2005). People Are Variables Too: Multilevel Structural EquationModeling. Psychological Methods, 10(3):259–284.

Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. (1953). Equationof State Calculations by Fast Computing Machines. The Journal of Chemical Physics,21(6):1087–1092.

Mill, J. S. (1843). A System of Logic. London, UK: Parker.

Morgan, S. and Winship, C. (2007). Counterfactuals and Causal Inference. New York,NY: Cambridge University Press, 1st edition.

Mulaik, S. (2009). Linear Causal Modeling with Structural Equations. Statistics in theSocial and Behavioral Sciences Series. New York, NY, US: Chapman & Hall, CRC, 1stedition.

Muthen, B. (2011). Applications of Causally Defined Direct and Indirect Effects in Medi-ation Analysis using SEM in Mplus. Technical Report. Los Angeles, CA, US: Muthen& Muthen.

Muthen, B. O. (1984). A General Structural Equation Model with Dichotomous, OrderedCategorical, and Continuous Latent Variable Indicators. Psychometrika, 49(1):115–132.

Muthen, B. O. (1989). Latent Variable Modeling in Heterogeneous Populations. Psy-chometrika, 54(4):557–585.

Muthen, B. O. (1994). Multilevel Covariance Structure Analysis. Sociological Methodsand Research, 22(3):376–398.

Muthen, B. O. and Asparouhov, T. (2015). Causal Effects in Mediation Modeling: AnIntroduction with Applications to Latent Variables. Structural Equation Modeling: AMultidisciplinary Journal, 22:12–23.

Page 122: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 111

Neuberg, L. G. (2003). Causality: Models, Reasoning and Inference: A Review. Econo-metric Theory, 19:675–685.

Neugebauer, R. and van der Laan, M. (2006). Causal effects in longitudinal studies: Defi-nition and maximum likelihood estimation. Computational Statistics & Data Analysis,51:1664–1675.

Neugebauer, R. and van der Laan, M. (2007). Nonparametric causal effects based onmarginal structural models. Journal of Statistical Planning and Inference, 137:419–434.

Neyman, J. (1923). On the Application of Probability Theory to Agricultural Experiments.Essay on Principles. Section 9. Statistical Science, 5(4):465–480.

Nowzohour, C. (2015). Estimating Causal Networks from Multivariate Observational Data.PhD thesis, ETH Zurich.

Ogburn, E. and VanderWeele, T. J. (2014). Causal Digrams for Interference. StatisticalScience, 29(4):559–578.

Pearl, J. (1985). Bayesian networks: A model of self-activated memory for evidentialreasoning. Proceedings, Cognitive Science Society, pages 329–334.

Pearl, J. (1986). Markov and Bayes Networks: a Comparison of Two Graphi-cal Representations of Probabilistic Knowledge. UCLA Technical Report R-46-I.http://ftp.cs.ucla.edu/tech-report/198 -reports/860024.pdf.

Pearl, J. (1988a). On the definition of actual cause. UCLA Technical Report R-259.ftp://ftp.cs.ucla.edu/pub/stat ser/R259.pdf.

Pearl, J. (1988b). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA, US:Morgan Kaufmann, 1st edition.

Pearl, J. (1993). Comment: Graphical models, causality and intervention. StatisticalScience, 8(3):266–269.

Pearl, J. (1995). Causal diagrams for empirical research (with discussions). Biometrika,82(4):669–710.

Pearl, J. (1998). Graphs, Causality and Structural Equation Models. Sociological Methodsand Research, 27(2):226–284.

Pearl, J. (2000). Causality: Models, Reasoning and Inference. New York, NY, US: Cam-bridge University Press, 1st edition.

Pearl, J. (2001a). Causal Inference in the Health Sciences: A Conceptual Introduction.Health Services & Outcomes Research Methodology, 2:189–220.

Pearl, J. (2001b). Direct and Indirect Effects. In Proceedings of the Seventeenth Conferenceon Uncertainty in Artificial Intelligence, pages 411–420. San Mateo, CA, US: MorganKaufmann.

Pearl, J. (2005). Direct and Indirect Effects. In Proceedings of American StatisticalAssociation Joint Statistical Meetings, pages 1572–1581. Minneapolis, MN, US: MIRA.

Pearl, J. (2009a). Causal inference in statistics: An overview. Statistics Surveys, 3:96–146.

Page 123: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 112

Pearl, J. (2009b). Causality: Models, Reasoning and Inference. New York, NY, US:Cambridge University Press, 2nd edition.

Pearl, J. (2009c). Letter to the Editor: Remarks on the method of propensity score.Statistics in Medicine, 28(9):1415–1416.

Pearl, J. (2009d). Myth, Confusion and Science in Causal Analysis. UCLA TechnicalReport R-348. http://web.cs.ucla.edu/∼kaoru/r348.pdf.

Pearl, J. (2010a). The Foundations of Causal Inference. Sociological Methodology, 40(1):75–149.

Pearl, J. (2010b). An Introduction to Causal Inference. The International Journal ofBiostatistics, 6(2):1–58.

Pearl, J. (2012a). The Causal Foundations of Structural Equation Modeling. In Hoyle,R. H., editor, Handbook of Structural Equation Modeling, chapter 5, pages 68–91. NewYork, NY, US: The Guildford Press.

Pearl, J. (2012b). Interpretable conditions for identifying direct and indirect effects. UCLATechnical Report R-389. http://ftp.cs.ucla.edu/pub/stat ser/r389-tr.pdf.

Pearl, J. (2013). Linear Models: A Useful “Microscope” for Causal Analysis. Journal ofCausal Inference, 1(1):155–170.

Pearl, J. (2016). Causal Inference in Statistics: A Gentle Introduction. Tutorial at theJoint Statistical Meetings (JSM-16), Chicago, IL, August 1, 2016.

Pearl, J. and Paz, A. (1987). Graphoids: A graph-based Logic for Reasoning aboutRelevance Relations. In Duboulay, B., Hogg, D., and Steels, L., editors, Advances inArtificial Intelligence II, pages 357–363. Amsterdam, NL: North-Holland Publishing Co.

Pearl, J. and Verma, T. (1991). A theory of inferred causation. In Allen, J. A., Fikes,R., and Sandewall, E., editors, Proceedings of the Second International Conference onPrinciples of Knowledge Representation and Reasoning, pages 441–452. San Mateo, CA,US: Morgan Kaufmann.

Petersen, M. L. and van der Laan, M. K. (2014). Causal Models and Learning from Data:Integrating Causal Modeling and Statistical Estimation. Epidemiology, 25(3):418–426.

Preacher, K., Zhang, Z., and Zyphur, M. (2010). A General Multilevel SEM Frameworkfor Assessing Multilevel Mediation. Psychologycal Methods, 15(3):209–233.

Preacher, K., Zhang, Z., and Zyphur, M. (2011). Alternative Methods for AssessingMediation in Multilevel Data: The Advantages of Multilevel SEM. Structural EquationModeling, 18(2):161–182.

Preston, C. C. and Colman, A. M. (2000). Optimal Number of Response Categories inRating scales: Reliability, Validity, Discriminating Power, and Respondent Preferences.Acta Psychologica, 104:1–15.

R Development Core Team (2017). R: A Language and Environment for Statistical Com-puting. R Foundation for Statistical Computing, Vienna, Austria.

Page 124: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 113

Rabe-Hesketh, S. and Skrondal, A. (2012). Multilevel and Longitudinal Modeling UsingStata. College Station, TX, US: Stata Press, 3rd edition.

Rabe-Hesketh, S., Skrondal, A., and Pickels, A. (2002). Reliable estimation of generalizedlinear mixed models using adaptive quadrature. The STATA Journal, 2(1):1–21.

Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2004). Generalized Multilevel StructuralEquation Modeling. Psychometrika, 69(2):167–190.

Rabe-Hesketh, S., Skrondal, A., and Pickles, A. (2005). Maximum likelihood estimationof limited and discrete dependent variable models with nested random effects. Journalof Econometrics, 128:301–323.

Rabe-Hesketh, S., Skrondal, A., and Zheng, X. (2007). Multilevel Structural EquationModelling. In Lee, S.-Y., editor, Handbook of Latent Variables and Related Models,chapter 10, pages 209–227. Amsterdam, NL: North-Holland Publishing Co.

Rabe-Hesketh, S., Skrondal, A., and Zheng, X. (2012). Multilevel Structural EquationModeling. In Hoyle, R., editor, Handbook on Structural Equation Models, chapter 30,pages 512–531. New York, NY, US: The Guilford Press.

Raudenbush, S. and Bryk, A. (2002). Hierarchical Linear Models: Applications and DataAnalysis Methods. Thousand Oaks, CA, US: Sage Publications, 2nd edition.

Richardson, T. and Robins, J. M. (2013a). Single World Intervention Graphs: APrimer. UAI Workshop on Causal Structural Learning, Bellevue, Washington.http://www.statslab.cam.ac.uk/∼rje42/uai13/Richardson.pdf.

Richardson, T. and Robins, J. M. (2013b). Single World Intervention Graphs: A Uni-fication of the Counterfactual and Graphical Approaches to Causality. Center forStatistics and the Social Sciences, University of Washington, Technical Report 128.https://www.csss.washington.edu/Papers/wp128.pdf.

Robert, C. P. (1995). Simulation of truncated normal variables. Statistics and Computing,5:121–125.

Robins, J. M. (1986). A New Approach to Causal Inference in Mortality Studies with aSustained Exposure Period: Application to control of the healthy worker survivor effect.Mathematical Modelling, 7:1393–1512.

Robins, J. M. (1987). Addendum to “A New Approach to Causal Inference in MortalityStudies with a Sustained Exposure Period: Application to control of the healthy workersurvivor effect”. Computers and Mathematics, with Applications, 14(9):917–921.

Robins, J. M. (1992). Estimation of the time-dependent accelerated failure time model inthe presence of confounding factors. Biometrika, 79(2):321–334.

Robins, J. M. (1993). Analytic Methods for Estimating HIV-Treatment and CofactorEffects. In Ostrow, D. G. and Kessler, R. C., editors, Methodological Issues in AIDSBehavioral Research, chapter 9, pages 213–290. New York, NY, US: Plenum Press.

Robins, J. M. (1994). Correcting for Non-compliance in Randomized Trials Using Struc-tural Nested Mean Models. Communications in Statistics - Theory and Methods,23(8):2379–2412.

Page 125: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 114

Robins, J. M. (1995). Discussion in “Causal diagrams for empirical research”. Biometrika,82(4):695–698.

Robins, J. M. (1997). Causal Inference from Complex Longitudinal Data. In Bernake, M.,editor, Latent Variable Modeling and Applications to Causality, volume 120 of LectureNotes in Statistics, pages 69–117. New York, NY, US: Springer-Verlag.

Robins, J. M. (1998). Marginal Structural Models. In 1997 Proceedings of the AmericanStatistical Association, pages 1–10. Alexandria, VA, US: American Statistical Associa-tion. Section on Bayesian Statistical Science.

Robins, J. M. (1999a). Association, Causation and Marginal Structural Models. Synthese,121:151–179.

Robins, J. M. (1999b). Marginal Structural Models versus Structural Nested Models asTools for Causal Inference. In Halloran, M. E. and Berry, D., editors, Statistical Modelsin Epidemiology, The Environment and Clinical Trials, volume 116 of The IMA Volumesin Mathematics and its Applications, pages 95–133. New York, NY, US: Springer-Verlag.

Robins, J. M. (1999c). Testing and Estimation of Direct Effects by ReparameterizingDirect Acyclic Graphs with Structural Nested Models. In Glymour, P. and Cooper,G., editors, Computation, Causation and Discovery, chapter 12, pages 349–405. MenloPark, CA; Cambridge, MA, US; London, UK: AAAI Press & The MIT Press.

Robins, J. M., Blevins, D., Ritter, G., and Wulfsohn, M. (1992). G-Estimation of theEffect of Prophylaxis Therapy for Pneumocystis carinii Pneumonia on the Survival ofAIDS Patients. Epidemiology, 3(4):319–336.

Robins, J. M. and Hernan, M. A. (2009). Estimation of the causal effects of time-varyingexposures. In Fitzmaurice, G., Davidian, M., Verbeke, G., and Molenberghs, G., editors,Longitudinal Data Analysis, Handbooks of Modern Statistical Methods, chapter 23,pages 553–599. New York, NY, US: Chapman & Hall, CRC.

Robins, J. M., Hernan, M. A., and Brumback, B. (2000). Marginal Structural Models andCausal Inference in Epidemiology. Epidemiology, 11(5):550–560.

Robins, J. M., Hernan, M. A., and Siebert, U. (2004). Effects of Multiple Interventions.In Ezzati, M., Murray, C., Lopez, A. D., and Rodgers, A., editors, Comparative quan-tification of health risks: Global and regional burden of disease attributable to selectedmajor risk factors, volume 2, chapter 28, pages 2191–2230. Geneva, CH: World HealthOrganization Press.

Rosenbaum, P. (2005). Heterogeneity and Causality: Unit Heterogeneity and DesignSensitivity in Observational Studies. The American Statistician, 59(2):147–152.

Rosenbaum, P. (2007). Interference Between Units in Randomized Experiments. Journalof the American Statistical Association, 102(477):191–200.

Rosenbaum, P. and Rubin, D. B. (1983). The central role of the propensity score inobservational studies for causal effects. Biometrika, 70(1):41–55.

Rosenbaum, P. and Rubin, D. B. (1984). Reducing Bias in Observational Studies UsingSubclassification on the Propensity Score. Journal of the American Statistical Associa-tion, 79(387):516–524.

Page 126: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 115

Rubin, D. B. (1973). Matching to Remove Bias in Observational Studies. Biometrics,29:159–183.

Rubin, D. B. (1974). Estimating Causal Effects of Treatments in Randomized and Non-randomized Studies. Journal of Educational Psychology, 66(5):688–701.

Rubin, D. B. (1977). Assignment to Treatment Group on the Basis of a Covariate. Journalof Educational Statistics, 2(1):1–26.

Rubin, D. B. (1978). Bayesian Inference for Causal Effects: The Role of Randomization.The Annals of Statistics, 6(1):34–58.

Rubin, D. B. (1979). Using Multivariate Matched Sampling and Regression Adjustment toControl Bias in Observational Studies. Journal of the American Statistical Association,74(366):318–328.

Rubin, D. B. (1980). Bias Reduction Using Mahalanobis-Metric Matching. Biometrics,36(2):293–298.

Rubin, D. B. (2005). Causal Inferences Using Potential Outcomes: Design, Modeling,Decisions. Journal of the American Statistical Association, 100(469):322–331.

Rubin, D. B. (2006). Matching Sampling for Causal Effects. New York, NY, US: CambridgeUniversity Press, 1st edition.

Rubin, D. B. (2007). The design versus the analysis of observational studies for causaleffects: Parallels with the design of randomized trials. Statistics in Medicine, 26(1):20–36.

Rubin, D. B. (2008). Author’s Reply. Statistics in Medicine, 27(14):2741–2742.

Rubin, D. B. (2009). Author’s Reply: Should observational studies be design to allow lackof balance in covariate distributions across treatment groups? Statistics in Medicine,28(9):1420–1423.

San-Martın, E., Jara, A., Rolin, J.-M., and Mouchart, M. (2011). On the Bayesian Non-parametric Generalization of IRT-type Models. Psychometrika, 76(3):385–409.

Scheines, R., Spirtes, P., and Glymour, C. (1991). Building Latent Variable Models.Carnegie Mellon University Working Papers Series - Department of Philosophy, 19.http://repository.cmu.edu/cgi/viewcontent.cgi?article=1249&context=philosophy.

Sekhon, J. (2008). The Neyman-Rubin Model of Causal Inference and Estimation viaMatching Methods. In Box-Steffensmeier, J., Brady, H., and Collier, D., editors, TheOxford Handbook of Political Methodology, chapter 11, pages 271–299. Oxford, UK:Oxford University Press.

Sekhon, J. (2009). Opiates for the Matches: Matching Methods for Causal Inference.Annual Review of Political Science, 12:487–508.

Setoguchi, S., Schneeweiss, S., Brookhart, M. A., Glynn, R. J., and Cook, E. F. (2008).Evaluating uses of data mining techniques in propensity score estimation: a simulationstudy. Pharmacoepidemiology and Drug Safety, 17(6):546–555.

Page 127: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 116

Shi, J.-Q. and Lee, S.-Y. (1998). Bayesian sampling-based approach for factor analysismodels with continuous and polytomous data. British Journal of Mathematical andStatistical Psychology, 51:233–252.

Shi, J.-Q. and Lee, S.-Y. (2000). Latent variable models with mixed, continuous andpolytomous data. Journal of the Royal Statistical Society - Series B, 62(1):77–87.

Shpitser, I. and Pearl, J. (2006). Identification of Conditional Interventional Distribu-tions. In Dechter, R. and Richardson, T. S., editors, Proceedings of the Twenty-SecondConference on Uncertainty in Artificial Intelligence, pages 437–444. Corvallis, OR, US:AUAI Press.

Shrier, I. (2008). Letter to the Editor. Statistics in Medicine, 27(14):2740–2741.

Sjolander, A. (2009). Letter to the Editor: Propensity scores and M-structures. Statisticsin Medicine, 28(9):1416–1420.

Skrondal, A. and Rabe-Hesketh, S. (2004). Generalized Latent Variable Modeling: multi-level, longitudinal, and structural equation models. Boca Raton, FL, US: Chapman &Hall, CRC.

Sobel, M. E. (1987). Direct, and Indirect Effects in Linear Structural Equation Models.Sociological Methods & Research, 16(1):155–176.

Sobel, M. E. (1990). Effect Analysis and Causation in Linear Structural Equation Models.Psychometrika, 55(3):495–515.

Sobel, M. E. (2006). What Do Randomized Studies of Housing Mobility Demonstrate?:Causal Inference in the Face of Interference. Journal of the American Statistical Asso-ciation, 101(476):1398–1407.

Song, X.-Y. and Lee, S.-Y. (2004). Bayesian analysis of two-level nonlinear structuralequation models with continuous and polytomous data. British Journal of Mathematicaland Statistical Psychology, 57:29–52.

Song, X.-Y. and Lee, S.-Y. (2012a). Basic and Advanced Bayesian Structural EquationModeling with Applications in the Medical and Behavioral Sciences. New York, NY, US:John Wiley & Sons, Ltd., 1st edition.

Song, X.-Y. and Lee, S.-Y. (2012b). A tutorial on the Bayesian approach for analyzingstructural equation models. Journal of Mathematical Psychology, 56:135–148.

Song, X.-Y., Lee, S.-Y., Ng, M., So, W.-Y., and Chan, J. (2007). Bayesian analysis ofstructural equation models with multinomial variables and an application to type 2diabetic nephropathy. Statistics in Medicine, 26:2348–2369.

Song, X.-Y. and Lu, Z.-H. (2010). Semiparametric Latent Variable Models with BayesianP-Splines. Journal of Computational and Graphical Statistics, 19(3):590–608.

Song, X.-Y. and Lu, Z.-H. (2012). Semiparametric transformation models with BayesianP-splines. Statistics and Computing, 22(5):1085–1098.

Song, X.-Y., Lu, Z.-H., Cai, J.-H., and Ip, E. H.-S. (2013). A Bayesian Modeling Approachfor Generalized Semiparametric Structural Equation Models. Psychometrika, 78(4):624–647.

Page 128: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 117

Spearman, C. (1904). “General Intelligence”, Objectively Determined and Measured. TheAmerican Journal of Psychology, 15(2):201–292.

Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and van der Linde, A. (2002). Bayesianmeasures of model complexity and fit. Journal of the Royal Statistical Society - SeriesB, 64(4):583–639.

Spirtes, P. (2001). An Anytime Algorithm for Causal Inference. AI and Statistics 2001Conference.

Spirtes, P. (2005). Graphical models, causal inference, and econometric models. Journalof Economic Methodology, 12(1):1–33.

Spirtes, P. (2010). Introduction to Causal Inference. Journal of Machine Learning Re-search, 11:1643–1662.

Spirtes, P. and Glymour, C. (1991). An Algorithm for Fast Recovery of Sparse CausalGraphs. Social Science Computer Review, 9(1):67–72.

Spirtes, P., Glymour, C., and Scheines, R. (1991). From Probability to Causality. Philo-sophical Studies, 64(1):1–36.

Spirtes, P., Glymour, C., and Scheines, R. (2000). Causation, Predition and Search.Cambridge, MA, US: The MIT Press, 2nd edition.

Spirtes, P., Glymour, C., Scheines, R., and Tillman, R. (2010). Automated Search forCausal Relations: Theory and Practice. In Detcher, R., Geffner, H., and Halpern, J.,editors, Heuristics, Probability and Causality: A Tribute to Judea Pearl, chapter 27,pages 467–506. London, UK: College Publications.

Spirtes, P., Meek, C., and Richardson, T. (1995). Causal Inference in the Presence of La-tent Variables and Selection Bias. In Proceedings of the Eleventh Conference on Uncer-tainty in Artificial Intelligence, pages 499–506. San Mateo, CA, US: Morgan Kaufmann.

Spirtes, P. and Richardson, T. (1997). A Polynomial Time Algorithm for DeterminingDAG Equivalence in the Presence of Latent Variables and Selection Bias. In Proceed-ings of the Sixth International Workshop on Artificial Intelligence and Statistics. FortLauderdale, FL, US: Society for Artificial Intelligence and Statistics.

Spirtes, P., Richardson, T., and Meek, C. (1997). Heuristic Greedy Search Algorithmsfor Latent Variiable Models. In Proceedings of the Sixth International Workshop onArtificial Intelligence and Statistics. Fort Lauderdale, FL, US: Society for ArtificialIntelligence and Statistics.

Suppes, P. (1970). A Probabilistic Theory of Causality. Amsterdam, NL: North-HollandPublishing Co., 1st edition.

Tanner, M. and Wong, W. H. (1987). The Calculation of Posterior Distributions byData Augmentation (with discussion). Journal of the American Statistical Association,82(398):528–540.

Tchetgen Tchetgen, E. J. and VanderWeele, T. J. (2012). On causal inference in thepresence of interference. Statistical Methods in Medical Research, 21(1):55–75.

Page 129: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 118

Thulasiraman, K. and Swamy, M. N. S. (1992). Graphs: Theory and Algorithms. Toronto,CA: John Wiley & Sons, Ltd.

Toulis, P. and Kao, E. (2013). Estimation of Causal Peer Influence Effects. Journalof Machine Learning Research: Proceedings of the 30th International Conference onMachine Learning, Atlanta, Georgia, USA, 28(3):1489–1497.

van der Laan, M. and Rose, S. (2011). Targeted Learning: Causal Inference for Observa-tional and Experimental Data. New York, NY, US: Springer-Verlag, 1st edition.

van der Laan, M. J. (2010a). Targeted Maximum Likelihood Based Causal Inference: PartI. The International Journal of Biostatistics, 6(2). Article 2.

van der Laan, M. J. (2010b). Targeted Maximum Likelihood Based Causal Inference: PartII. The International Journal of Biostatistics, 6(2). Article 3.

van der Laan, M. J. (2012). Causal Inference for Networks. U.C.Berkeley Division of Biostatistics Working Paper Series. WP 300.http://biostats.bepress.com/ucbbiostat/paper300/.

van der Laan, M. J. (2014). Causal Inference for a Population of Causally ConnectedUnits. Journal of Causal Inference, 2(1):13–74.

van der Laan, M. J. and Rubin, D. (2006). Targeted Maximum Likelihood Learning. TheInternational Journal of Biostatistics, 2(1). Article 11.

VanderWeele, T. J. and An, W. (2013). Social Networks and Causal Inference. In Morgan,S. L., editor, Handbook of Causal Analysis for Social Research, chapter 17, pages 353–374. New York, NY, US: Springer-Verlag.

VanderWeele, T. J. and Tchetgen Tchetgen, E. J. (2011). Effect partitioning under in-terference in two-stage randomized vaccine trials. Statistics and Probability Letters,81:861–869.

VanderWeele, T. J., Vanderbroucke, J. P., Tchetgen Tchetgen, E. J., and Robins, J. M.(2012). A Mapping Between Interactions and Interference: Implications for VaccineTrials. Epidemiology, 23(2):285–292.

Vansteelandt, S. and Joffe, M. (2014). Structural Nested Models and G-estimation: Thepartially realized promise. Statistical Science, 29(4):707–731.

Verbitsky-Savitz, N. and Raudenbush, S. (2012). Causal Inference Under Interference inSpatial Settings: A Case Study Evauating Community Policing Program in Chicago.Epidemilogic Methods, 1(1):107–130.

Verma, T. (1993). Graphical Aspects of Causal Models. UCLA Technical Report R-191.http://ftp.cs.ucla.edu/pub/stat ser/r191.pdf.

Verma, T. and Pearl, J. (1988). Causal Networks: Semantics and Expressiveness. Proceed-ings of the Fourth Workshop on Uncertainty in Artificial Intelligence, pages 352–359.

Verma, T. S. and Pearl, J. (1990). Equivalence and Synthesis of Causal Models. Proceedingsof the Sixth Conference on Uncertainty in Artificial Intelligence, pages 220–227.

Page 130: ¡rdenasHurtado... · 2017-06-06 · 3 Title in English Causal Inference in the Presence of Causally Connected Units: A Semi-Parametric Hierarchical Structural Equation Model Approach

BIBLIOGRAPHY 119

Wagner, H. and Tuchler, R. (2010). Bayesian estimation of random effects models for mul-tivariate responses of mixed data. Computational Statistics & Data Analysis, 54:1206–1218.

Wiley, D. E. (1973). The identification problem for structural equation models withunmeasured variables. In Goldberger, A. and Duncan, O., editors, Structural EquationsModels in the Social Sciences, pages 69–84. New York, NY, US: Seminar Press.

Winship, C. and Morgan, S. (1999). The Eestimation of Causal Effects from ObservationalData. Annual Review of Sociology, 25:659–706.

Witteman, J. C. M., D’Agostino, R. B., Stijnen, T., Kannel, W. B., Cobb, J. C., de Ridder,M. A. J., Hofman, A., and Robins, J. M. (1998). G-Estimation of Causal Effects:Isolated Systolic Hypertension and Cardiovascular Death in the Framingham HeartStudy. American Journal of Epidemiology, 148(4):390–401.

Wright, S. (1920). The Relative Importance of Heredity and Environment in Determiningthe Piebald Pattern of Guinea-Pigs. Proceedings og the National Academy of Sciences,6:320–332.

Wright, S. (1921). Correlation and Causation. Journal of Agricultural Research, 20(7):557–585.

Wright, S. (1934). The Method of Path Coefficients. The Annals of Mathematical Statis-tics, 5(3):161–215.

Xie, Y. (2013). Population Heterogeneity and Causal Inference. Proceedings of the NationalAcademy of Sciences Journal, 110(16):6262–6268.

Zhu, H.-T. and Lee, S.-Y. (1999). Statistical analysis of nonlinear factor analysis models.British Journal of Mathematical and Statistical Psychology, 52:225–242.