Causal Inference in Observational Data - arXiv · techniquest to determine the e cacy of display advertise-ments. Even extending association rules mining to causal rule mining has

Causal Inference in Observational Data

Pranjul YadavDept. of Computer Science

University of MinnesotaMinneapolis, MN

[email protected]

Michael SteinbachDept. of Computer Science


[email protected]

Vipin KumarDept. of Computer Science

University of MinnesotaMinneapolis, USA

[email protected] WestraSchool of Nursing


[email protected]

Alexander HoffDept. of Computer Science


[email protected]

Connie DelaneySchool of Nursing


[email protected] PrunelliSchool of Nursing


[email protected]

Gyorgy SimonDept. of Health Sciences

ResearchMayo Clinic, Rochester, MN

[email protected]

ABSTRACTOur aging population increasingly suffers from multiple chronicdiseases simultaneously, necessitating the comprehensive treat-ment of these conditions. Finding the optimal set of drugsand interventions for a combinatorial set of diseases is acombinatorial pattern exploration problem. Association rulemining is a popular tool for such problems, but the require-ment of health care for finding causal, rather than asso-ciative, patterns renders association rule mining unsuitable.One of the purpose of this study was to apply SSC guide-line recommendations to EHR data for patients with severesepsis or septic shock and determine guideline complianceas well as its impact on inpatient mortality and sepsis com-plications. Propensity Score Matching in conjuction withBootstrap Simulation were used to match patients with andwithout exposure to the SCC recommendations. Findingsshowed that EHR data could be used to estimate compli-ance with SCC recommendations as well as the effect ofcompliance on outcomes. Further, we propose a novel frame-work based on the Rubin-Neyman causal model for extract-ing causal rules from observational data, correcting for anumber of common biases. Specifically, given a set of in-terventions and a set of items that define subpopulations(e.g., diseases), we wish to find all subpopulations in whicheffective intervention combinations exist and in each suchsubpopulation, we wish to find all intervention combina-tions such that dropping any intervention from this com-bination will reduce the efficacy of the treatment. A keyaspect of our framework is the concept of closed interven-

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

c© 2016 ACM. ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

tion sets which extend the concept of quantifying the effectof a single intervention to a set of concurrent interventions.Closed intervention sets also allow for a pruning strategythat is strictly more efficient than the traditional pruningstrategy used by the Apriori algorithm. To implement ourideas, we introduce and compare five methods of estimatingcausal effect from observational data and rigorously evalu-ate them on synthetic data to mathematically prove (whenpossible) why they work. We also evaluated our causal rulemining framework on the Electronic Health Records (EHR)data of a large cohort of patients from Mayo Clinic andshowed that the patterns we extracted are sufficiently richto explain the controversial findings in the medical literatureregarding the effect of a class of cholesterol drugs on Type-IIDiabetes Mellitus (T2DM).

KeywordsCausal Inference, Confounding, Counterfactual Estimation.

1. RELATED WORKCausation has received substantial research interest in many

areas. In computer science, Pearl [5] and Rosenbaum[6]laid the foundation for causal inference, upon which severalfields, cognitive science, econometrics, epidemiology, philos-ophy and statistics have built their respective methodologies[7, 8, 9].

At the center of causation is a causal model. Arguably,one of the earliest and popular models is the Rubin-Neymancausal model [3]. Under this model X causes Y , if X occusrbefore Y ; and without X, Y would be different. Besidethe Rubin-Neyman model, there are several other causalmodels, including the Granger causality [10] for time se-ries, Bayes Networks [11], Structural Equation Modeling [8],causal graphical models [12], and more generally, probabilis-tic graphical models [13]. In our work, we use the potentialoutcome framework from the Rubin-Neyman model and we

arX

iv:1

611.

0466

0v1

[cs

.AI]

15

Nov

201

6

10.1145/1235

use causal graphical models to identify and correct for bi-ases.

Causal graphical models are tools to visualize causal re-lationships among variables. Nodes of the causal graph arevariables and edges are causal relationships. Most methodsassume that the causal graph structure is a priori given, how-ever, methods have been proposed for discovering the struc-ture of the causal graph [14, 15]. In our work, the structureis partially given: we know the relationships among groupsof variables, however we have to assign each variable to thecorrect group based on data.

Knowing the correct graph structure is important, becausesubstructures in the graph are suggestive of sources of bias.To correct for biases, we are looking for specific substruc-tures. For example, causal chains can be sources of overcor-rection bias and ”V”-shaped structures can be indicative ofconfounding or endogenous selection bias [9]. Many otherinteresting substructures have been studied [16, 17, 18]. Inour work, we consider three fundamental such structures:direct causal effect, indirect causal effect and confounding.Of these, confounding is the most severe and received themost research interest.

Numerous methods exist to handle confounding, whichincludes propensity score matching (PSM) [19], structuralmarginal models [9] and g-estimation [8]. The latter twoextend PSM for various situations, for example, for time-varying interventions [9].

Propensity score matching is used to estimate the effectof an intervention on an outcome. The propensity score isthe propensity (probability) of a patient receiving the inter-vention given his baseline characteristics and the propensityscore is used to create a new population that is free of con-founding. Many PSM techniques exist and they typicallydiffer in how they use the propensity score to create thisnew population [20, 21, 22, 23].

Applications of causal modeling is not exclusive to socialand life sciences. In data mining, Lambert et al. [24] in-vestigated the causal effect of new features on click throughrates and Chan et al. [25] used doubly robust estimationtechniquest to determine the efficacy of display advertise-ments.

Even extending association rules mining to causal rulemining has been attempted before [26, 27, 28]. Li et al.[26] used odds ratio to identify causal patterns and later ex-tended their technique [28] to handle large data set. Theirtechnique, however, is not rooted in a causal model andhence offers no protection against computing systematicallybiased estimates. In their proposed causal decision trees[29], they used the potential outcome framework, but stillhave not addressed correction for various biases, includingconfounding.

2. SIMPLE CAUSAL RULE MINING IN IR-REGULAR TIME-SERIES DATA

2.1 IntroductionAccording to the Center for Disease Control and Preven-

tion, the incidence of sepsis or septicemia has doubled from2000 through 2008, and hospitalizations have increased by70% for these diagnoses1. In addition, severe sepsis andshock have higher mortality rates than other sepsis diag-noses, accounting for an estimated mortality between 18%

and 40%. During the first 30 days of hospitalization, mor-tality can range from 10% to 50% depending on the patientsrisk factors. Patients with severe sepsis or septic shock aresicker, have longer hospital stays, are more frequently dis-charged to other short-term hospital or long-term care insti-tutions, and represent the most expensive hospital conditiontreated in 20112.

The use of evidence-based practice (EBP) guidelines, suchas the Surviving Sepsis Campaign (SSC), could lead to anearlier diagnosis, and consequently, earlier treatment. How-ever, these guidelines have not been widely incorporatedinto clinical practice. The SSC is a compilation of interna-tional recommendations for the management of severe sep-sis and shock. Many of these recommendations are inter-ventions to prevent further system deterioration during andafter diagnosis. Even when the presence of sepsis or progres-sion to sepsis is suspected early in the course of treatment,timely implementation of adequate treatment managementand guideline compliance are still a challenge. Therefore,the effectiveness of the guideline in preventing clinical com-plications for this population is still unclear to clinicians andresearchers alike.

The majority of studies have focused on early detectionand prevention of sepsis and little is known about the com-pliance rate to SSC and the impact of compliance on the pre-vention of sepsis-related complications. Further, the mea-surement of adherence to individual SSC recommendationsrather than the entire SSC is, to our knowledge, limited. Themajority of studies have used traditional randomized controltrials with analytic techniques such as regression modeling toadjust for risk factors known from previous research. Data-driven methodologies, such as data mining techniques andmachine learning, have the potential to identify new insightsfrom electronic health records (EHRs) that can strengthenexisting EBP guidelines.

The national mandate for all health professionals to imple-ment interoperable EHRs by 2015 provides an opportunityfor the reuse of potentially large amounts of EHR data to ad-dress new research questions that explore patterns of patientcharacteristics, evidence-based guideline interventions, andimprovement in health. Furthermore, expanding the rangeof variables documented in EHRs to include team-based as-sessment and intervention data can increase our understand-ing of the compliance with EBP guidelines and the influenceof these guidelines on patient outcomes. In the absence ofsuch data elements, adherence to guidelines can only be in-ferred; it cannot be directly observed.

In this section, we present a methodology for using EHRdata to estimate the compliance with the SSC guideline rec-ommendations and also estimate the effect of the individ-ual recommendations in the guideline on the prevention ofin-hospital mortality and sepsis-related complications in pa-tients with severe sepsis and septic shock.

2.2 MethodsData from the EHR of a health system in the Midwest

was transferred to a clinical data repository (CDR) at theUniversity of Minnesota which is funded through a Clini-cal Translational Science Award. After IRB approval, de-identified data for all adult patients hospitalized between1/1/09 to 12/31/11 with a severe sepsis or shock diagnosiswas obtained for this study.

2.2.1 Data and cohort selectionThe sample included 186 adult patients age 18 years or

older with an ICD-9 diagnosis code of severe sepsis or shock(995.92 and 785.5*) identified from billing data. Since 785.*codes corresponding to shock can capture patients withoutsepsis, patients without severe sepsis or septic shock, and pa-tients who did not receive antibiotics were excluded. Theseexclusions aimed to capture only those patients who had se-vere sepsis and septic shock, and were treated for that clin-ical condition. The final sample consisted of 177 patients.

2.2.2 Variables of interestFifteen predictor variables (baseline characteristics) were

collected. These include sociodemographics and health dis-parities data: age, gender, race, ethnicity, and payer (Medi-caid represents low income); laboratory results: lactate andwhite blood cells count (WBC); vital signs: heart rate (HR),respiratory rate (RR), temperature (Temp), mean arterialblood pressure (MAP); and diagnoses for respiratory, car-diovascular, cerebrovascular, and kidney-related comorbidconditions. ICD-9 codes for comorbid conditions were se-lected according to evidence in the literature . Comorbiditieswere aggregated from the patientaAZs prior problem list todetect preexisting (upon admission) respiratory, cardiovas-cular, cerebrovascular, and kidney problems. Each categorywas treated as yes/no if any of the ICD-9 codes in that cat-egory were present.

The outcomes of interest were inhospital mortality and de-velopment of new complications (respiratory, cardiovascular,cerebrovascular, and kidney) during the hospital encounter.New complications were determined as the presence of ICD-9 codes on the patients billing data that did not exist at thetime of the admission.

2.2.3 Study designThis study aimed to analyze compliance with the SSC

guideline recommendations in patients with severe sepsis orseptic shock. Therefore, the baseline (TimeZero) was de-fined as the onset of sepsis and the patients were under ob-servation until discharged. Unfortunately, the timestampfor the diagnoses is dated back to the time of admission;hence the onset of sepsis needs to be estimated. The onsettime for sepsis was defined as the earliest time during a hos-pital encounter when the patient meets at least two of thefollowing six criteria: MAP < 65, HR >100, RR >20, tem-perature < 95 or >100.94, WBC < 4 or > 12, and lactate >2.0. The onset time was established based on current clinicalpractice and literature on sepsis5. The earliest time whentwo or more of these aforementioned conditions were met,a TimeZero flag was added to the time of first occurrenceof that abnormality, and the timing of the SSC compliancecommenced.

2.2.4 Guideline complianceSSC guideline recommendations were translated into a

readily computable set of rules. These rules have condi-tions related to an observation (e.g. MAP < 65 Hgmm) andan intervention to administer (e.g. give vasopressors) if thepatient meets the condition of the rule. The SSC guidelinewas transformed into 15 rules in a computational format,one for each recommendation in the SSC guideline recom-mendations, and each rule was evaluated for each patient(see Figure 1). After each rule is an abbreviated name sub-

sequently used in this paper.

Figure 1: SSC rules for measuring guideline compli-ances

We call the treatment of a patient compliant (exposed) fora specific recommendation, if the patient meets the condi-tion of the corresponding rule any time after TimeZero andthe required intervention was administered; the treatmentis non-compliant (unexposed) if the patient meets the con-dition of the corresponding rule after TimeZero, but the in-tervention was not administered (any time after TimeZero);and the recommendtion is not applicable to a treatment ifthe patient does not meet the condition of the correspondingrule. In estimating compliance (as a metric) with a specificrecommendation, we simply measure the number of compli-ant encounters to which the recommendation is applicable.In this phase of the study, the time when a recommendationwas administered was not incorporated in the analysis.

We also estimate the effect of the recommendation on theoutcomes. We call a patient exposed to a recommendation,if the recommendation is applicable to the patient and thecorresponding intervention was administered to the patient.We call a patient unexposed to a recommendation if the rec-ommendation is applicable but was not applied (the treat-ment was non-compliant). The incidence fraction in exposedpatients with respect to an outcome is the fraction of pa-tients with the outcome among the exposed patients. Theincidence fraction of the unexposed patients can be definedanalogously. We define the effect of the recommendationon an outcome as the difference in the incidence fractionsbetween the unexposed and exposed patients. The recom-mendation is beneficial (protective against an outcome) ifthe effect is positive, namely, the incidence faction in theunexposed is higher than the incidence fraction in the unex-

posed patients.

2.2.5 Data qualityIncluded variables were assessed for data quality regard-

ing accuracy and completeness based on the literature anddomain knowledge. Constraints were determined for plausi-ble values, e.g., a CVP reading could not be greater than 50.Values outside of constraints were recoded as missing values.Any observation that took place before the estimated onsetof sepsis (TimeZero) was considered a baseline observation.Simple mean imputation was the method of choice for im-puting missing values. Imputation was necessary for lactate(7.7%), temperature (3%), and WBC (3%). There was nomissing data for the other variables and for the outcomesof interest. Central venous pressure was not included as abaseline characteristic due to the high number of missingvalues (54%).

2.2.6 Propensity score matchingPatients who received SSC recommendations may be in

worse health than patient who did not receive SSC rec-ommendations. For example, patients whose lactate wasmeasured may have more apparent (and possibly advanced)sepsis than patients whose lactate was not measured. Tocompensate for such disparities, propensity score matching(PSM) was employed. The goal of PSM is to balance thedata set in terms of the covariates between patients exposedand unexposed to the SSC guideline recommendations. Thisis achieved by matching exposed patients with unexposedpatients on their propensity (probability) of receiving therecommendations. This ensures that at TimeZero, pairs ofpatients, one exposed and one unexposed, are at the samestate of health and they only differs in their exposure to therecommendation. PSM is a popular technique for estimatingtreatment effects.

To compute the propensity of patients to receive treat-ment, a logistic regression model was used, where the de-pendent variable is exposure to the recommendation andthe independent variables are the covariates. The linearprediction (propensity score) of this model was computedfor every patient. A new (matched) population was createdfrom pairs of exposed and unexposed patients with match-ing propensity scores. Two scores match if they differ by nomore than a certain caliper (.1 in our study). The effect ofthe recommendation was estimated by comparing the inci-dent fraction among the exposed and unexposed patients inthe matched population.

2.2.7 PSM nested inside bootstrapping simulationIn order to incorporate the effect of additional sources of

variability arising due to estimation in the propensity scoremodel and variability in the propensity score matched sam-ple, 500 bootstrap samples were drawn from the originalsample. In each of these bootstrap iterations, the propensityscore model was estimated using the above caliper matchingtechniques and the effect of the recommendation was com-puted with respect to all outcomes. In recent years, boot-strap simulation has been widely employed in conjunctionwith PSM to better handle bias and confounding variables.For each recommendation and outcome, the 500 bootstrapiterations result in 500 estimates of the effect (of the recom-mendation on the outcome), approximating the samplingdistribution of the effect.

2.3 ResultsTable 1 shows the baseline characteristics of the study

population. Results are reported as total count for categor-ical variables, and mean with inter-quartile (25% to 75%)range for continuous variables. As shown in Table 1, themajority of patients were male, Caucasian, and had Medi-caid as the payer. Before the onset of sepsis, Cardiovascularcomorbidities (56.4%) were common, the mean HR (101.3)was slightly above the normal, as well as lactate (2.8), andWBC (15.8). The mean length of stay for the sample was 15days, ranging from less than 24 hours to 6 months. TimeZerowas within the first 24 hours of admission, and patients atthat time were primarily (86.4%) in the emergency depart-ment.

Feature MeanTotal Number of Patients 177Average Age 61Gender(Male) 102Race(Caucasian) 97Ethnicity(Latino) 11Payer(Medicaid) 102White Blood cell 15.8Lactate 28Mean blood Pressure 73.9Temperature 98.4Heart Rate 101.3Respiratory Rate 20.6Cardiovascular 100Cerebrovascular 66Respiratory 69Kidney 62

Table 1: Demographics statistics of patient popula-tion

In Figure 2, the effects of various rule-combination pairsare depicted. An effect is defined as the difference in themean rate of progression to complications between the ex-posed and unexposed groups. Since we used bootstrap simu-lation, for each rule-complication pair, 500 replications wereperformed resulting in a sampling distribution for the ef-fect. Sampling distribution for each rule-association pair ispresented as boxplots. The boxplots represent the statis-tic measured, i.e. in this study, the differential impact ofa recommendation on mortality between the exposed andunexposed population. When this statistic is 0, the recom-mendation has no effect. If the recommendation is greaterthan 0, it means that the recommendation is protective forthat specific condition; and if the recommendation is below0, the recommendation may even increase the risk for theoutcome for that specific condition.

The panes (groups of boxplots) correspond to the compli-cations and the boxes within each pane correspond to therecommendation (rule). For example, the effect of the Ven-tilator rule (Recommendation 15: patients in respiratorydistress should be put on ventilator) on mortality (Death) isshown in the rightmost box (Ventilator) in the bottom-mostpane (Death). Since all effects in the boxplot are above 0,namely the number of observed complications in the unex-posed group is higher than in the exposed, compliance withthe Ventilator rule reduces the number of deaths. Therefore,the corresponding recommendation is beneficial to protectpatients from Death (mortality). In Table 3, we present the95% Confidence Intervals for various rule-outcome pairs.

95% Confidence intervals for various rule-outcome pairs.

Figure 2: Box-plots of the mean difference betweengroups (unexposed - exposed) to the guideline rec-ommendations and each of the outcomes of interest.

To further ensure the validity of the results, we examine thepropensity score distribution in the exposed and unexposedgroup. As an example, Figure 3 illustrates the propensityscore distribution for a randomly selected bootstrap iter-ation to measure the effect of Ventilator on Death. Thehorizontal axis represents the propensity score, which is theprobability of receiving the interventions, and the verticalaxis represents the density distribution, namely the propor-tion of patients in each group with a particular propensityfor being put on Ventilator. Figure 3 shows substantialoverlap between the propensity scores in the exposed andunexposed group. The propensity score overlap representsthe distribution; the predictor Ventilator across the exposedand unexposed populations regarding the outcome Death;the balance was successful when the propensity score wasapplied for this population. Other rule-complication pairsexhibit similar propensity score distribution.

3. CONCLUSIONThe overall purpose of this study was to use EHR data to

determine compliance with the Surviving Sepsis Campaign

Figure 3: Distribution of the propensity scores be-tween exposed and unexposed groups for the out-come Death when patients and the SSC recommen-dation was Ventilator..

(SSC) guideline and measure its impact on inpatient mor-tality and sepsis complications in patients with severe sep-sis and septic shock. Results showed that compliance withmany of the recommendations was> 95% for MAP and CVPwith fluid resuscitation given for low readings. Other highcompliance (greater than 80%) recommendations were: in-sulin given for high blood glucose and evaluating respiratorydistress. The recommendations with the lowest compliance(< 30%) were: vasopressor or albumin for continuing lowMAP or CVP readings. This may be due to a study de-sign artifact, where the rule only considered interventionsinitiated after TimeZero (estimated onset of sepsis) whilethe fluid resuscitation may have taken place earlier. Al-ternatively, the apparently poor compliance could also beexplained with issues related to the coding of fluids: duringdata validation, we found that it was difficult to track fluids.

Our study also demonstrates that retrospective EHR datacan be used to evaluate the effect of compliance with guide-line recommendations on outcomes. We found a numberof SSC recommendations that were significantly protectiveagainst more than one complication: Ventilator was protec-tive against Cardiovascular and Respiratory complicationsas well as Death; use of Vasopressors was protective for Res-piratory complications.

Other recommendations, BCulture, Antibiotic, Vasopres-sor, Lactate, CVP, and RespDistress, showed results lessconsistent with our expectation. For instance, Vasopressorused to treat low MAP, appears to increase cerebrovascularcomplications. While this finding is not statistically signif-icant, it may be congruent with the fact that small brainvessels are very sensitive to changes in blood pressure. LowMAP can cause oxygen deprivation, and consequently braindamage.

Ventilator, Vasopressor, and BGlucose showed protectiveeffects against Respiratory complications. The SSC guide-line recommends the implementation of ventilator therapyas soon as any change in respiratory status is noticed. Thisintervention aims to protect the patient against further sys-tem stress, restore hypoxia, help with perfusion across themain respiratory-cardio vessels, and decrease release of tox-

ins due to respiratory efforts.Our study is a proof-of-concept study demonstrating that

EHR data can be used to estimate the effect of guidelinerecommendations. However, for several combinations of rec-ommendations and outcomes, the effect was not significant.We believe that the reason is that guidelines represent work-flows and the effect of the workflow goes beyond the effects ofthe individual guideline recommendations. For example, byconsidering the recommendations outside the context of theworkflow, we may ignore whether the intervention addressedthe condition that triggered its administration. If low MAPtriggered the administration of vasopressors, without consid-ering the workflow, we do not know whether MAP returnedto the normal levels thereafter. Thus we cannot equate anadverse outcome with the failure of the guideline, it may bethe result of the insufficiency of the intervention. Movingforward, we are going to model the workflows behind theguidelines and apply the same principles that we developedin this work to estimate the effect of the entire workflow.

This phase of our study did not address the timing ofrecommendations nor the time prior to TimeZero. For thisanalysis, guideline compliance was considered only after Time-Zero (the estimated onset), since compliance with SSC isonly necessary in the presence of suspected or confirmed sep-sis. There is no reason to suspect sepsis before TimeZero.However, some interventions may have started earlier, with-out respect to sepsis. For example, 100% of the patientsin this sample had antibiotics (potentially preventive antibi-otics), but only 99 (55%) patients received it after TimeZero.

The EHR does not provide date and time for certain ICD-9 diagnoses. During a hospital stay, all new diagnoses arerecorded with the admission date. We know whether a diag-nosis was present on admission or not, thus we know whetherit is a preexisting or new condition, but do not know pre-cisely when the patient developed this condition during thehospitalization. For this reason, we are unable to detectwhether the SSC guideline was applied before or after a com-plication occurred, thus we may underestimate the beneficialeffect of some of the recommendations. For example, highlevels of lactate is highly related to hypoxia and pulmonarydamage. If these patients were checked for lactate after pul-monary distress, we would consider the treatment compliantwith the Lactate recommendation, but we would not knowthat the respiratory distress was already present at the timeof the lactate measurement and we would incorrectly countit as a complication that the guideline failed to prevent.

4. COMPLEX CAUSAL RULE MINING INIRREGULAR TIME-SERIES DATA

4.1 IntroductionEffective management of human health remains a major

societal challenge as evidenced by the rapid growth in thenumber of patients with multiple chronic conditions. Type-II Diabetes Mellitus (T2DM), one of those conditions, af-fects 25.6 million (11.3%) Americans of age 20 or older andis the seventh leading cause of death in the United States [1].Effective treatment of T2DM is frequently complicated bydiseases comorbid to T2DM, such as high blood pressure,high cholesterol, and abdominal obesity. Currently, thesediseases are treated in isolation, which leads to wasteful du-plicate treatments and suboptimal outcomes. The recent

rise in the number of patients with multiple chronic condi-tions necessitates comprehensive treatment of these condi-tions to reduce medical waste and improve outcomes.

Finding optimal treatment for patients who suffer frommultiple associated diseases, each of which can have mul-tiple available treatments is a complex problem. We couldsimply use techniques based on association, but a reasonablealgorithm would likely find that the use of a drug is associ-ated with some unfavorable outcome. This does not meanthat the drug is harmful; in fact in many cases, it simplymeans that patients who take the drug are sicker than thosewho do not and thus they have a higher chance of the unfa-vorable outcome. What we really wish to know is whethera treatment causes an unfavorable outcome, as opposed tobeing merely associated with it.

The difficulty in quantifying the effect of interventions onoutcomes stems from subtle biases. Suppose we wish toquantify the effect of a cholesterol-lowering agent, statin, ondiabetes. We could simply compare the proportion of dia-betic patients in the subpopulation that takes statin and thesubpopulation that does not and estimate the effect of statinas the difference between the two proportions. This methodwould give the correct answer only if the statin-taking andnon-statin-taking patients are identical in all respects thatinfluence the diabetes outcome. We refer to this situationas treated and untreated patients being comparable. Unfor-tunately, statin taking patients are not comparable to non-statin-taking patients, because they take statin to treat highcholesterol, which by and in itself increases the risk of dia-betes. High cholesterol confounds the effect of statin. Manydifference sources of bias exist, confounding is just one of themany. In this manuscript, we are going to address severaldifferent sources of bias, including confounding.

Techniques to address such biases in causal effect estima-tion exist. However, these techniques have been designed toquantify the effect of a single intervention. In trying to applythese techniques to our problem of finding optimal treatmentfor patients suffering from varying sets of diseases, we facetwo challenges.

First, patients with multiple conditions will likely needa combination of drugs. Quantifying the effect of multi-ple concurrent interventions is semantically different fromconsidering only a single intervention. The key concept inestimating the effect of an intervention is comparability : toestimate the effect of intervention, we need two groups ofpatients who are identical in all relevant aspects except thatone group receives the intervention and the other group doesnot. For a single intervention, the first group is typically thesickest patients who still do not get treated and the secondgroup consists of the healthiest patient who get treatment.They are reasonably in the same state of health. However,when we go from a single intervention to multiple interven-tion and try to estimate their joint effect, comparability nolonger exists. A patient requiring multiple simultaneous in-terventions is so fundamentally different from a patient whodoes not need any intervention that they are not compara-ble.

The other key challenge in finding optimal interventionsets for patients with combinatorial sets of diseases is thecombinatorial search space. Even if we could trivially extendthe methods for quantifying the effect of a single interven-tion to a set of concurrent interventions, we would have tosystematically explore a combinatorially large search space.

The association rule mining framework [2] provides an ef-ficient solution for exploring combinatorial search spaces,however, it only detects associative relationships. Our in-terest is in causal relationships.

In this manuscript, we propose causal rule mining, a frame-work for transitioning from association rule mining towardscausal inference in subpopulations. Specifically, given a setof interventions and a set of items to define subpopulations,we wish to find all subpopulations in which effective inter-vention combinations exist and in each such subpopulation,we wish to find all intervention combinations such that drop-ping any intervention from this combination will reduce theefficacy of the treatment. We call these closed interventionsets, which are not be confused with closed item sets. Asa concrete example, interventions can be drugs, subpopula-tions can be defined in terms of their diseases and for eachsubpopulations (set of diseases), our algorithm would returneffective drug cocktails of increasing number of constituentdrugs. Leaving out any drug from the cocktail will reducethe efficacy of the treatment. Closed intervention sets allowus to go from estimating a single intervention to multipleinterventions.

To address the exploration of the combinatorial searchspace, we propose a novel frequency-based anti monotonicpruning strategy enable by the closed intervention set con-cept. The essence of antimonotonic property is that if a setI of interventions does not satisfy a criterion, none of itssupersets will. The proposed pruning strategy based on theclosed intervention is strictly more efficient than the tradi-tional pruning strategy used by the Apriori algorithm [2].

Underneath our combinatorial exploration algorithm, weutilize the Rubin-Neyman model of causation [3]. This modelsets two conditions for causation: a set X of interventionscauses a change in Y iff X happens before Y and Y would bedifferent had X not occurred. The unobservable outcome ofwhat would happen had a treated patient not received treat-ment is a potential outcome and needs to be estimated. Wepresent and compare five methods for estimating these po-tential outcomes and describe the biases these methods cancorrect.

Typically the ground truth for the effect of drugs is notknown. In order to assess the quality of the estimates, weconduct a simulation study utilizing five different syntheticdata set that introduce a new source of bias. We will eval-uate the effect of the bias on the five proposed methodsunderscoring the statements with rigorous proofs when pos-sible.

We also evaluate our work on a real clinical data set fromMayo Clinic. We have data for over 52,000 patients with 13years of follow-up time. Our outcome of interest is 5-yearincident T2DM and we wish to extract patterns of interven-tions for patients suffering from combinations of commoncomorbidities of T2DM. First, we evaluate our methodologyin terms of the computational cost, demonstrating the effec-tiveness of the pruning methodologies. Next, we evaluate thepatterns qualitatively, using patterns involving statins. Weshow that our methodology extracted patterns that allow usto explain the controversial patterns surrounding statin [4].

Contributions. (1) We propose a novel framework for ex-tracting causal rules from observational data correcting for anumber of common biases. (2) We introduce the concept ofclosed intervention sets to extend the concept of quantifyingthe effect of a single intervention to a set of concurrent in-

terventions sidestepping the patient comparability problem.Closed intervention sets also allow for a pruning strategythat is strictly more efficient than the traditional pruningstrategy used by the Apriori algorithm [2]. (3) We comparefive methods of estimating causal effect from observationaldata that are applicable to our problem and rigorously eval-uate them on synthetic data to mathematically prove (whenpossible) why they work.

4.2 Background: Association Rule MiningWe first briefly review the fundamental concepts of asso-

ciation rule mining and extend these concepts to causal rulemining in the next section. Consider a set I of items, whichare single-term predicates evaluating to ‘true’ or ‘false’. Forexample, {age > 55} can be in item. A k-itemset is a setof k items, evaluated as the conjunction (logical ’and’) ofits constituent items. Consider a dataset D = { d1, d2.....dn}, which consists of n observations. Each observation, de-noted by Dj is a set of items. An itemset I = i1, i2, . . . , ik(I ⊂ I) supports an observation Dj if all items in I eval-uate to ‘true’ in the observation. The support of I is thefraction of the observations in D that support I. An item-set is frequent if its support exceeds a pre-defined minimumsupport threshold.

A association rule is a logical implication of form X ⇒ Y ,where X and Y are disjoint itemsets. The support of a ruleis support(XY ) and the confidence of the rule is

conf(X ⇒ Y ) =support(XY )

support(X)= P(Y |X).

4.2.1 Causal Rule MiningGiven an intervention itemset X and an outcome item

Y , such that X and Y are disjoint, a causal rule is an impli-cation of form X → Y , suggesting that X causes a changein Y . Let the itemset S define a subpopulation, consist-ing of all observations that support S. This subpopulationconsists of all observations for which all items in S evalu-ate to ‘true’. The causal rule X → Y |S implies that theintervention X has causal effect on Y in the subpopulationdefined by S. The quantity of interest is the causal effect,which is the change in Y in the subpopulation S caused byX. We will formally define the metric used to quantify thecausal effect shortly.

Rubin-Neyman Causal Model. X has a causal effecton Y if (i) X happens earlier than Y and (ii) if X had nothappened, Y would be different [3].

Our study design ensures that the intervention X precedesthe outcome Y , but fulfilling the second conditions requiresthat we estimate the outcome for the same patient bothunder intervention and without intervention.

Potential Outcomes. Every patient in the dataset has twopotential outcomes: Y0 denotes their outcome had they nothad the intervention X; and Y1 denotes the outcome hadthey had the intervention. Typically, only one of the two po-tential outcomes can be observed. The observable outcomeis the actual outcome (denoted by Y ) and the unobservablepotential outcome is called the counterfactual outcome.

Using the definition of counterfactual outcome, we cannow define the metric for estimating the change in Y causedby X. Average Treatment response on the Treated(ATT) is a widely known metric in the causal literature and

is computed as follows:

ATT(X → Y |S) = E[Y1 − Y0]X=1 = E[Y1]X=1 − E[Y0]X=1,

where E denotes the expectation and the X = 1 in the sub-script signals that we only evaluate the expectation in thetreated patients (X = 1).

ATT aims to compute an average per-patient change causedby the intervention. Y0 = Y1, indicates that the interventionresulted in no change in outcome for the patient.

Biases. Beside X, numerous other variables can also exertinfluence over Y , leading to biases in the estimates. Tocorrect for these biases, we have correctly account for theseother effects. The quintessential tool for this purpose is thecausal graph, depicted in Figure 4. The nodes of this graphare sets of variables that play a causal role and edges arecausal effects. This is not a correlation graph (or dependencegraph), because for example, U and Z are dependent givenX, yet there is no edge between them.

Variables (items in I) can exert influence on the effect ofX on Y in three way: they may only influence X, they mayonly influence Y or them may influence both X and Y . Ac-cordingly, variables can be categorized into four categories:

V are variables that directly influence Y and thus havedirect effect on Y

U are variables that only influence Y through X andthus have indirect effect on Y ;

Z are variables that influence both X and Y and arecalled confounders; and finally

O are variables that do not influence either X or Yand hence can be safely ignored.

Figure 4: Rubin-Neyman Causal Model

Most of the causal inference literature assumes that thecausal graph is known and true. In other words, we knowapriori which variables fall into each of the categories, U ,Z, V and O. In our case, only X and Y are specified andwe have to infer which category each other variable (item)belongs to. Since this inference relies on association (de-pendence) rather than causation, the discovered graph mayhave errors, misclassifications of variables into the wrongcategory. For example, because of the marginal dependencebetween U and Y , variables in U can easily get misclassi-fied as Z. Such misclassifications do not necessarily lead tobiases, but they can cause loss of efficiency.

Problem Formulation. Given a data set D, a set S ofsubpopulation-defining items, a set X of interventionitems, a minimal support threshold θ and a minimum effectthreshold η, we wish to find all subpopulations S (S ⊂ S)and all intervetions X (X ⊂ X ), X and S are disjoint, suchthat the causal rule X → Y |S is frequent and its interventionset X is closed w.r.t. our metric of causal effect, ATT.

Note that the meaning of θ, the minimum support thresh-old, is different than in association rule mining literature.

Typically, rules with support less than θ are considered un-interesting, in other cases, it is simply a computational con-venience, but in our case, we set θ to a minimum value suchthat ATT is estimable for the discovered patterns.

We call a causal rule frequent iff its support exceeds theuser-specified minimum threshold θ

support(X → Y |S) = support(XY S) = P(XY S) > θ

and we call an intervention set X closed w.r.t. to ATT iff

∀x ∈ X, |ATT (x→ Y |S,X\x)| > η,

where η is the user-specified minimum causal effect thresh-old. In other words, a causal rule is closed in a subpopu-lation, if its (absolute) effect is greater than any of its sub-rules.

Example. In a medical setting, X may be drugs, S couldbe comorbid diseases. Then X is a drug-combination thathopefully treats set of diseases S. This set of drugs beingclosed w.r.t. ATT means that dropping any drug from Xwill reduce the overall efficacy of the treatment; the patientis not taking unnecessary drugs.

An itemset is closed if its support is strictly higher thanall of its subitemsets’. Analogously, an intervention set isclosed if its absolute causal effect is strictly higher than allof its subitemsets’.

4.2.2 Frequent Causal Pattern Mining AlgorithmWe can now present our algorithm for causal pattern min-

ing. At a very high level, the algorithm comprises of twonested frequent pattern enumeration [30] loops. The outerloop enumerates subpopulation-defining itemsets S usingitems in S, while the inner loop enumerates interventioncombinations using items in X \ S. More generally, X andS can overlap but we do not consider that in this paper. Ef-fective algorithms to this end exists [31, 32], we simply useApriori [2].

Once the patterns are discovered, the ATT of the interven-tions are computed, using one of the methods from Section4.3 and the frequent, effective patterns are returned.

On the surface, this approach appears very expensive,however several novel, extremely effective pruning strategiesare possible and we describe them below.

Potential Outcome Support Pruning. Let X be anintervention k-itemset, S be a subpopulation-defining item-set, and let X and S be disjoint. Further, X−i be an itemsetthat evaluates to ‘true’ iff all items except the ith are ‘true’but the ith item is ‘false’. Using association rule mining ter-minology, all items in X except the ith are present in thetransaction.

Definition 1 (Potential Outcome Support Pruning).We only need to consider itemsets X such that

min{support(S,X), support({S,X−1), . . . ,

support(S,X−k)} > θ.

In order to be able to estimate the effect of x ∈ X in thesubpopulation S, we need to have observations with x ‘true’and also with x ‘false’ in S.

Lemma 1. Potential Outcome Support Pruning is anti-monotonic.

Proof: Consider a causal rule X → Y |S . If the causal ruleX → Y |S is infrequent, then

support(XS) < θ ∨ ∃i, support(X−iS) < θ.

If support(X−iS) had insufficient support, then any exten-sion of it with an intervention item x will continue to haveinsufficient support, thus the Xx → Y |S rule will have in-sufficient support. Likewise, if support(XS) had insufficientsupport, then any extension of it with an intervention itemx will also have insufficient support.

Pruning based on Causal Effect.

Proposition 1. Effective causal rule pruning conditionis anti-monotonic.

Rationale: To explain the rational, let us return to themedical example, where X is a combination of drugs forminga treatment. Assuming that the effects of drugs are additive,if a casual rule X → Y |S is ineffective because

∃xi ∈ X, |ATT(xi → Y |S,X\xi)| < η,

then forming a new rule Xxj → Y |S will also be ineffectivebecause

|ATT(xi → Y |S,xj ,X\xi)|

will be ineffective. In the presence of positive interactions(that reinforce each other’s effect) among the drugs, thisstatement may not hold true. Beside statistical reasoning,one can question why a patient should receive a drug thathas no effect in a combination.

4.3 Causal Estimation MethodsATT, our metric of interest, with respect to a single in-

tervention x in a subpopulation S is defined as

ATT(x→ Y |S) = E [Y1 − Y0]S,X=1 ,

which is the expected difference between the potential out-come under treatment Y1 and the potential outcome with-out treatment Y0 in patients with S who actually receivedtreatment. Since we consider treated patients, the potentialoutcome Y1 can be observed, the potential outcome Y0 can-not. Thus at least one of the two must be estimated. Themethods we present below differ in which potential outcomethey estimate and how they estimate it.

For the discussion below, we consider the variables X, Z,U and V from the causal graph in Figure 4. X is a singleintervention, U , V and Z can be sets of items. For regressionmodels, we will denote the matrix defined by U , V and Zin the subpopulation S as U , V and Z (same letter as thevariable sets).

Counterfactual Confidence (CC). This is the simplestmethod. We simply assume that the patients who receiveintervention X = 1 and those who do not X = 0, do not dif-fer in any important respect that would influence Y . Underthis assumption, Y1 in the treated is simply the actual out-come in the treated and the potential outcome Y0 is simplythe actual outcome in the non-treated (X = 0). Thus

ATT = conf((X = 1)→ Y |S)− conf((X = 0)→ Y |S),

= P(Y |S,X = 1)− P(Y |S,X = 0)

In the followings, to improve readability, we drop the Ssubscript. All evaluations take place in the S subpopula-tions.

Direct Adjustment (DA). We cannot estimate Y0 in thetreated (X = 1) as the actual outcome Y in the untreated,because the treated and untreated populations can signifi-cantly differ in variables such as Z and V that influence Y .In Direct Adjustment, we attempt to directly remove the ef-fect of V and Z by including them into a regression model.Since a regression model relates the means of the predictorswith the mean of the outcome, we can remove the effect ofV and Z by making their means 0.

Let R be a generalized linear regression model, predictingY via a link function g

g(Y |V,Z,X) = β0 + βV V + βZZ + βXX.

Then the (link-transformed) potential outcome under treat-ment is g(Y1) = β0 + βV V + βZZ + βX and the potentialoutcome without treatment is g(Y0) = β0 + βV V + βZZ.The ATT is then

ATT = E[g−1(Y1|V,Z,X = 1)

]X=1−

E[g−1(Y0|V,Z,X = 0)

]X=1

.

where g−1(Y1|V,Z,X = 1) is prediction for an observationwith the observed V and Z but with X set to 1. The E(·)X=1

notation signifies that these expectation of the predictionsare taken only over patients who actually received the treat-ment.

The advantage of DA (over CC) is manyfold. First, it canadjust for Z and V as long the model specification is correct,namely the interaction terms that may exist among Z and Vare specified correctly. Second, we get correct estimates evenif we ignore U , because U is conditionally independent of Ygiven X. This unfortunately only is a theoretical advantage,because we have to infer from the data whether a variableis a predictor of Y and U is marginally dependent on Y , sowe will likely adjust for U , even if we don’t need to.

Counterfactual Model (CM). In this technique, we buildan explicit model for the potential outcome without treat-ment Y0 using patients with X = 0. Specifically, we build amodel

g(Y |V,Z,X = 0) = β0 + βV V + βZZ.

and estimate the potential outcome as

g(Y0|V,Z) = g(Y |V,Z,X = 0).

The ATT is then

ATT = P(Y |X = 1)− E[g−1(Y0|V,Z)

]X=1

.

Similarly to Direct Adjustment, the Counterfactual Modeldoes not depend on U . However, in case of the Counterfac-tual Model, we are only considering the population withX = 0. In this population, U and Y are independent, thuswe will not include U variables into the model.

Propensity Score Matching (PSM). The central ideaof Propensity Score Matching is to create a new population,such that patients in this new population are comparable inall relevant respects and thus the expectation of the poten-tial outcome in the untreated equals the expectation of theactual outcome in the untreated.

Patients are matched based on their propensity of receiv-ing treatment. This propensity is computed as a logisticregression model with treatment as the dependent variable

logP(X)

1− P(X)= β0 + βV V + βZZ.

Patient pairs are formed, such that in each pair, one patientreceived treatment and the other did not and their propen-sities for treatment differ by no more than a user-definedcaliper difference ρ.

The matched population has an equal number of treatedand untreated patients, is balanced on V and Z, thus thepatients are comparable in terms of their baseline risk of Y .Hopefully, the only factor causing a difference in outcome isthe treatment.

For estimating ATT, the potential outcome without treat-ment is estimated from the actual outcomes of the patientsin the matched population who did not receive treatment:

ATT = E [Y1 − Y0]

− P(Y |X = 1,M)− P(Y |X = 0,M),

where M denotes the matched population.Among the methods we consider, propensity score match-

ing most strictly enforces the patient comparability crite-rion, however, it is susceptible to misspecification of thepropensity regression model, which can erode the qualityof the matching.

Stratified Non-Parametric (SN). In the stratified esti-mation, we directly compute the expectation via stratifica-tion. The assumption is that the patient in each stratumare comparable in all relevant respects and only differ in thepresence or absence of intervention. In each stratum, wecan estimate the potential outcome Y0 in the treated as theactual outcome Y in the untreated.

ATT = E [Y1 − Y0]X=1

=∑l

P (l|X = 1) [P (Y1|l,X = 1)− P (Y0|l,X = 1)]

=∑l

P (l|X = 1) [P (Y |X = 1)− P (Y |X = 0)] ,

where l iterates over the combined levels of V and Z. If wecan identify the items that fall into U , then we can ignorethem, otherwise, we should include them as well into thestratification.

The stratified method makes very few assumptions andshould arrive at the correct estimate as long as each of thestrata are sufficiently large. The key disadvantage of thestratified method lies in stratification itself: when the num-ber of items across which we need to stratify is too large, wemay end up dividing the population into excessively manysmall subpopulations (strata) and become unable to esti-mate the causal effect in many of them thus introducingbias into the estimate.

4.4 ResultsAfter describing our data and study design, we present

three evaluations of the proposed methodology. The firstevaluation demonstrates the computational efficiency of ourpruning methodologies, isolating the effect of each prun-ing methods: (i) Apriori support-based pruning, (ii) Po-tential Outcome Support Pruning, and (iii) Potential Out-come Support Pruning in conjunction with Effective CausalRule Pruning. In the second section, we provide a quali-tative evaluation, looking at patterns involving statin. Weattempt to use the extracted patterns to explain the con-troversial findings that exist in the literature regarding theeffect of statin on diabetes. Finally, in order to compare the

treatment effect estimates to a ground truth, which does notexits for real drugs, we simulate a data set using proportionswe derived from the Mayo Clinic data set.

Data and Study Design. In this study we utilized alarge cohort of Mayo Clinic patients with data between 1999and 2013. We included all adult patients (69,747) with re-search consent. The baseline of our study was set at Jan. 1,2005. We collected lab results, medications, vital signs andstatus, and medication orders during a 6-year retrospectiveperiod between 1999 and the baseline to ascertain the pa-tient’s baseline comorbidities. From this cohort, we excludedall patients with a diagnosis of diabetes before the baseline(478 patients), missing fasting plasma glucose measurements(14,559 patients), patients whose lipid health could not bedetermined (1,023 patients) and patients with unknown hy-pertension status (498 patients). Our final study cohort con-sists of 52,139 patients who were followed until the summerof 2013.

Patients were phenotyped during the retrospective period.Comorbidities of interest include Impaired Fasting Glucose(IFG), abdominal obesity, Hypertension (HTN; high bloodpressure) and hyperlipidemia (HLP; high cholesterol). Foreach comorbidity, the phenotyping algorithm classified pa-tients into three broad levels of severity: normal, mild andsevere. Normal patients show no sign of disease; mild pa-tients are either untreated and out of control or are con-trolled using first-line therapy; severe patients require moreaggressive therapy. IFG is categorized into normal and pre-diabetic, the latter indicating impaired fasting plasma glu-cose levels but not meeting the diabetes criteria yet. Forthis study, progression to T2DM within 5 years from base-line (i.e. Jan 1, 2005) was chosen as our outcome of interest.Out of 52,139 patients 3627 patients progressed to T2DM ,41028 patients did not progressed to T2DM and the remain-ing patients (7484) dropped out of the study. In Table 2 wepresent statistics about our patient population.

4.4.1 Pruning EfficiencyIn our work, we proposed two new pruning methods. First,

we have the Potential Outcome Support Pruning, whichaims to eliminate patterns for which the ATT is not es-timable. Second, we have the Effective Causal Rule Pruning,where we eliminate patterns that do not improve treatmenteffectiveness relative to the subitemsets.

In Figure 5 we present the number of patterns discoveredusing (i) the traditional Apriori support based pruning, (ii)our proposed Potential Outcome Support Pruning (POSP),and (iii) POSP in conjunction with Effective Causal RulePruning (ECRP).

The number of patterns discovered by POSP is strictlyless than the number of patterns discovered by the Aprioripruning. POSP in conjunction with ECRP is very effective.

4.4.2 StatinIn this section, we demonstrate that the proposed causal

rule mining methodology can be used to discover non-trivialpatterns from the above diabetes data set.

In recent years, the use of statins, a class of cholesterol-lowering agents, have been prescribed increasingly. Highcholesterol (hyperlipidemia) is linked to cardio-vascular mor-tality and the efficacy of statins in reducing cardio-vascularmortality is well documented. However, as evidenced by a2013 BMJ editorial [4] devoted to this topic, statins are sur-

T2DMPresent Absent

Total Number of Patients 3627 41028Average Age 44.73 35.58Male(%) 51 41Female(%) 49 59

Patient Diagnosis Status (%)NormFG 42 84PreDM 58 16Normal Obesity 29 59Mild Obesity 25 30Severe Obesity 46 11Normal Hypertension 48 69Mild Hypertension 33 23Severe Hypertension 19 08Normal Hyperlipidemia 12 29Mild Hyperlipidemia 72 64Severe Hyperlipidemia 16 07

Patient Medication Status(%)Statin 26 11Fibrates 03 01Cholesterol.Other 02 01Acerab 17 07Diuret 18 07CCB 08 04BetaBlockers 22 10HTN.Others 01 01

Table 2: Demographics statistics of patient popula-tion

Figure 5: Comparison of Pruning Techniques

rounded in controversy. In patients with normal blood sugarlevels (labeled as NormalFG), statins have a detrimental ef-fect, they increase the risk of diabetes; yet in pre-diabeticpatients (PreDM), it appears to have no effect. What wedemonstrate below is that this phenomenon is simply dis-ease heterogeneity.

First, we describe how this problem maps to the causalrule mining problem. Our set of interventions (X ) consistsof statin and our subpopulation defining variables consist ofthe various levels of HTN, HLP and IFG (S). Our inter-est is the effect of statin (x) on T2DM (Y ) in all possiblesubpopulations S, S ⊂ S.

In this setup, HTN, which is associated with both hy-perlipidemia (and statin use), as well as with T2DM, isa confounder (Z). A cholesterol drug, other than statin,(say) fibrates, are in the U category: they are predictive ofstatin (patients on monotherapy who take fibrates do nottake statins), but have no effect on Y , because its effect isalready incorporated into the hyperlipidemia severity vari-

ables that defined the subpopulation. Variables that onlyinfluence diabetes but not statin use (say a diabetes drug)would fall into the V category. All subpopulations have vari-ables that fall into Z and U and some subpopulation mayalso have V .

The HLP variable in Table 2 uses statin as part of its defi-nition, thus we constructed two new variables. The first oneis HLP1, a variable at the borderline between HLP-Normaland HLP-Mild, consisting of untreated patients with mildlyabnormal lab results (these fall into HLP-Normal) and pa-tients who are diagnosed and receive a first-line treatment(they fall into HLP-Mild). Comparability is the central con-cept of estimating causal effects and these patients are com-parable at baseline. Similarly, we also created another vari-able, HLP2, which is at the border of HLP-Mild and HLP-Severe, again consisting of patients who are comparable inrelevant aspects of their health at baseline.

S CC DA CM PSM SNPreDM 0.145 0.022 0.010 0.022 0.017NormFG 0.060 0.023 0.034 0.017 0.029HLP1 0.078 0.019 0.014 0.010 0.010HLP2 0.021 -0.013 -0.010 -0.021 -0.015PreDM,HLP1 0.067 0.018 0.021 0.004 0.002PreDM,HLP2 0.001 -0.038 -0.031 -0.048 -0.043NormFG,HLP1 0.043 0.020 0.015 0.014 0.013NormFG,HLP2 0.017 -0.002 -0.002 -0.005 -0.004

Table 3: ATT due to statin in various subpopula-tions S as estimated by the 5 proposed methods.

Table 3 presents the ATT estimates obtained by the var-ious methods proposed in Section 3.4 for some of the mostrelevant subpopulations. Negative ATT indicates beneficialeffect and positive ATT indicates detrimental effect.

Counterfactual confidence (CC) estimates statin to be detri-mental in all subpopulations. While statins are known tohave detrimental effect in patients with normal glucose lev-els [4], it is unlikely that statins are universally detrimental,even in patients with severe hyperlipidemia, the very diseaseit is supposed to treat.

The results between DA, CM, PSM and SN are similar,with PSM and SN having larger effect sizes in general. Thepicture that emerges from these results is that patients withsevere hyperlipidemia appear to benefit from statin treat-ment even in terms of their diabetes outcomes, while statintreatment is moderately detrimental for patients with mildhyperlipidemia.

Bootstrap estimation was used to compute the statisti-cal significance of these results. For brevity, we report theresults only for PSM. The estimates are significant in the fol-lowing subpopulations: NormFG, PreDM+HLP2 (p-valuesare <.001) and NormFG+HLP1 (p-value .05).

The true ATT in these subpopulations is not know. Toinvestigate the accuracy that the various methods achieve,we use simulated that is largely based on this example [4,33].

4.4.3 Synthetic DataIn this section, we describe four experiments utilizing syn-

thetic data sets, each of which introduces a new potentialsource of bias. Our objective is to illustrate the ability of thefive methods from Section 3.4 for adjusting for these biases.

We compare their ATT estimates to the true ATT we usedto generate the data set and discuss reasons for their successor failure.The rows of Table 4 correspond to the synthetic data sets inincreasing order of the biases we introduced and the columnscorresponds to the methods: Conf (confidence), CC (Coun-terfactual Confidence), DA (Direct Adjustment), CM (Coun-terfactual Model), PSM (Propensity Score Matching) andSNP (Stratified Non-Parametric).Some of these methods, DA, CM, PSM and SNP take thecausal graph structure into account while estimating ATT.Specifically, they require the information whether a variableis a confounder (Z), has a direct effect (V ), an indirect effect(V ), or no effect (O). PSM and SNP are not sensitive to theinclusion of superfluous variables, they simply decrease themethod’s efficiency.In all of the data sets, we use a notation consistent withFigure 1: Z is the central disease with outcome Y ; X is theintervention of interest that treats Z; V is another diseasewith direct causal effect on Y , but V is independent of X;and U is a third disease, which can be treated with X, buthas no impact on Y . All data sets contain 5000 observations.

I. Direct Causal Effect from V . We assume that every pa-tient in the cohort has disease Z at the same severity. Theyare all comparable w.r.t. Z. 30% of the patients are sub-ject to the intervention X aimed at treating Z, while othersare not. Untreated patients face a 25% chance of having Y ,while treated patients only have 10% chance. Some patients,20% of the population, also have disease V , which directlyaffects Y : it increases the probability of Y by 5%.In this example the true ATT is -.15, as X reduces thechance of Y by 15%. Our causal graph dictates that Xand V be marginally independent, hence this this effect ishomogeneous across the levels of V . (Otherwise V wouldbecome predictive of X and it would become a confounder.Confounding is discussed in experiments III-V.) All methodsestimated the ATT correctly, because ATT does not dependon V . We can demonstrate this by stratifying on V andusing the marginal independence of X and V .

ATT = E [P(Y |X = 1)− P(Y |X = 0)]

=∑v∈V

P(V = v) [P(Y |V = v,X = 1)− P(Y |V = v,X = 0)]

=∑v∈V

[P(Y, V = v|X = 1)− P(Y, V = v|X = 0)]

= P(Y |X = 1)− P(Y |X = 0)

where v denotes the levels of V . The marginal independenceof X and V is used in step three:

P(Y |V,X) =P(Y, V,X)

P(V,X)=

P(Y, V |X)P(X)

P(X,V )=

P(Y, V |X)

P(V ).

II. Indirect Causal Effect. The setup for this experiment isthe same as for the ’Direct Causal Effect’ experiment, exceptwe have disease U instead of V . Just like Z, disease U isalso treated by X, but U has no direct effect on Y ; its effectis indirect through X. U is thus independent of Y given X.The true ATT continues to be -.15.Again, the ATT does not depend on U , hence all methodsestimated it correctly. To demonstrate that ATT does notdepend on U , we use stratification and the conditional inde-

pendence of Y and U .

ATT = E [P(Y |X = 1)− P(Y |X = 0)]

=∑u∈U

[P(Y |U = u,X = 1)P(U = u|X = 1)

−P(Y |U = u,X = 0)P(U = u|X = 0)]

=∑u∈U

[P(Y |X = 1)P(U = u|X = 1)

−P(Y |X = 0)P(U = u|X = 0)]

= P(Y |X = 1)∑u

P(U = u|X = 1)−

P(Y |X = 0)∑u

P(U = u|X = 0)

= P(Y |X = 1)− P(Y |X = 0)

III. Confounding. In this experiment, we consider the sim-plest case of confounding, involving a single disease Z, asingle treatment X and outcome Y . 20% of the patientshave disease Z and 95% of the diseased patients are treatedwith X, while 5% are not. All treated patients have Z. 25%of the untreated patients (Z = 1 and X = 0) have outcomeY ; 10% of the treated patients (Z = 1 and X = 1) have theoutcome; and only 5% of the healthy patients (Z = 0) haveit. The true ATT is -.15.In the presence of confounding, the counterfactual confi-dence and ATT do not coincide. With z denoting the levelsof Z and P(z) being a shorthand for P(Z = z),

ATT = E [P(Y |X = 1)− P(Y |X = 0)]

=∑z

P(z) [P(Y |X = 1, z)− P(Y |X = 0, z)] ,

while the counterfactual confidence (CC) is

CC = P(Y |X = 1)− P(Y |X = 0)

=∑z

[P(Y |X = 1, z)P(z|X = 1)

−P(Y |X = 0, z)P(z|X = 0)] .

When P(z|X) 6= P(z), these quantities do not coincide.However, any method that can estimate P(Y |X,Z) for alllevels of Z and X will arrive at the correct ATT estimate.We used logistic regression in our implementation of theDirect Adjustment method, which can estimate P(Y |X,Z)when X and Z have no interactions. Note that the causalgraph admits interaction between X and Z, thus model mis-specification can cause biases in the estimate.

IV. Confounding with Indirect Effect. In addition to theConfounding experiment, we also have an indirect causaleffect from U . We now have two diseases, Z and U , eachof which can be treated with X. 20% of the population hasZ and independently, 20% has U . 25% of the patients whohave Z and have no treatment (X = 0) have Y , while only10% of the treated (X = 1) patients have it, regardless ofwhether the patient has U . (If the probability of Y wasaffected by U , it would be another confounder, rather thanhave an indirect effect.)X has a beneficial ATT of -.15 in patients with Z == 1 (andX == 1) and has no effect in patients with Z = 0 (who getX because of U). Thus the true ATT=-.0833.In this experiment, the counterfactual model was the best-performing model. The counterfactual model estimates the

ATT through the definition

ATT = E [P(Y1|X = 1)− P(Y0|X = 1)] ,

where Y0 is the potential outcome the patient would havewithout treatment X = 0 and P(Y0|X = 1) is the coun-terfactual probability of Y (the probability of Y had theynot received X) in the population who actually got X = 1.Note that the potential outcome Y1|X = 1 in the patientswho actually got X = 1 is the observed outcome Y |X = 1.With u and z denoting the levels of U and Z, respectivelyand P(u) being a shorthand for P(U = u),

ATT = E [P(Y |X = 1)− P(Y0|X = 1)]

=∑u

∑z

P(u, z) [P(Y |X = 1, u, z)− P(Y0|X = 1, u, z)]

=∑z

P(z)∑

[P(Y |X = 1, z)− P(Y0|X = 1, z)]

=∑z

P(z)∑

[P(Y |X = 1, z)− P(Y |X = 0, z)] ,

which coincides with the data generation mechanism, hencethe estimate is correct.

In the derivation, step 2 holds because U and Z are inde-pendent givenX and step 3 uses the fact that the counterfac-tual model estimates P0(Y |X = 1, z, u) from the untreatedpatients, thus

P(Y0|X = 1, z, u) = P(Y |X = 0, z, u) = P(Y |X = 0, z).

V. Confounding with Direct and Indirect Effects. In this ex-

periment, we have three diseases: our index disease Z, whichis a confounder; U having an indirect effect on Y via X; andV having a direct effect on Y . 20% of the population haseach of Z, V and U independently. 95% of patients with Zor U get the intervention X. 25% of the untreated patientswith Z get Y , while only 10% of the treated patients do,regardless of whether they have U . Patients with V face a5% in their chance of experiencing outcome Y .X has a beneficial ATT of -.15 in patients with Z = 1 andhave no effect in patients with Z = 0 (who get X because ofU). Whether a patient has V does not influence the effectof X. The true ATT is thus -.0833.None of the methods estimated the effect correctly, but Propen-sity Score Matching came closest. Analytic derivation ofwhy it performed well is outside the scope of this paper, butin essence, its success is driven by its ability to maximally ex-ploit the independence relationships encoded in the causalgraph. It can ignore V when it constructs the propensityscore model, because X and V are independent (when Ynot given); and it can ignore U and V when it computes theATT in the propensity matched population. On the otherhand, the causal graph admits interaction among U , Z andX, thus a logistic regression model as the propensity scoremodel can be subject to model misspecification.The Stratified Non-Parametric method, which is essentiallyjust a direct implementation of the definition of ATT, un-derestimated the ATT by almost 25%. The reason lies in theexcessive stratification across all combinations of the levelsof U , V , and Z. Even with just three variables, most stratadid not have sufficiently many patients (either treated oruntreated) to estimate P(Y |X,u, v, z). In the discussion, wewill describe remedies to overcome this problem.

Conf CC DA CM PSM SNI. +.110 -.150 -.150 -.150 NA -.150II. +.099 -.150 -.150 -.150 -.151 -.149III. +.099 +.047 -.136 -.136 -.136 -.136IV. +.077 +.024 -.019 -.083 -.068 -.064V. +.072 +.038 -.037 -.105 -.074 -.067

Table 4: The ATT estimates by the 6 methods in thefive experiments. The experimental conditions, thefull names of the methods and the true ATT valueare specified in the text.

5. DISCUSSION AND CONCLUSIONWe proposed the causal rule mining framework, which

transitions pattern mining from finding patterns that areassociated with an outcome towards patterns that causechanges in the outcome. Finding causal relationships in-stead of associations is absolutely critical in health care, butalso has appeal beyond health care.

The numerous biases that arise in establishing causationmake quantifying causal effects difficult. We use the Neyman-Rubin causal model to define causation and use the poten-tial outcome framework to estimate the causal effects. Wecorrect for three kinds of potential biases: those stemmingfrom direct causal effect, indirect causal effect and confound-ing. We compared five different methods for estimating thecausal effect, evaluated them on real and synthetic data andfound that three of these methods gave very similar results.

We have demonstrated on real clinical data that our pro-posed method can effectively enumerate causal patterns in alarge combinatorial search space due to the two new pruningmethods we developed for this work. We also demonstratedthat the patterns discovered from the data were very richand we managed to illustrate how the effect of statin is dif-ferent in various subpopulations. The results we found areconsistent with the literature but go beyond what is alreadyknown about statin’s effect on the risk of diabetes.

The discussions and experimental results provided in thispaper provide some general guidance on when to use thedifferent methods we described. We recommend counterfac-tual confidence if no confounding is suspected as counterfac-tual confidence is computationally efficient and can arriveat the correct solution even when direct effects and indirecteffects are present. In the presence of confounding, propen-sity score matching gave the most accurate results, but dueto the need to create a matched population, it has built-inrandomness, increasing its variance. Moreover, the counter-factual model as well as the propensity score model are sus-ceptible to model misspecification. If unknown interactionsamong variables are suspected, we recommend the strati-fied non-parametric method. With this technique, modelmisspecification is virtually impossible, however, its samplesize requirement is high. The stratified model is subopti-mal if we need to stratify across many variables. Stratify-ing across many variables can fragment the population intomany strata too small to afford us with the ability to esti-mate the effects correctly. If the estimates use some stratabut not others, they may be biased.

AcknowledgmentsThis study is supported by National Science Foundation(NSF) grant: IIS-1344135. Contents of this document are

the sole responsibility of the authors and do not necessarilyrepresent official views of the NSF.

6. ADDITIONAL AUTHORS

7. REFERENCES[1] National diabetes fact sheet: national estimates and

general information on diabetes and prediabetes in theunited states, 2011. Atlanta, GA: US Department ofHealth and Human Services, Centers for DiseaseControl and Prevention, 201, 2011.

[2] R. Agrawal and R. Srikant. Fast algorithms for miningassociation rules. In Proc. 20th int. conf. very largedata bases, VLDB, volume 1215, pages 487–499, 1994.

[3] J. Sekhon. The neyman-rubin model of causalinference and estimation via matching methods. TheOxford handbook of political methodology, pages271–299, 2008.

[4] R. Huupponen and J. Viikari. Statins and the risk ofdeveloping diabetes. BMJ, 346, 2013.

[5] V. Didelez and I. Pigeot. Judea pearl: Causality:Models, reasoning, and inference. PolitischeVierteljahresschrift, 42(2):313–315, 2001.

[6] P. Rosenbaum. Observational studies. Springer, 2002.

[7] D. Freedman. From association to causation viaregression. Advances in applied mathematics,18(1):59–110, 1997.

[8] W. Bielby and R. Hauser. Structural equation models.Annual review of sociology, pages 137–161, 1977.

[9] J. Robins, M. Hernan, and B. Brumback. Marginalstructural models and causal inference inepidemiology. Epidemiology, pages 550–560, 2000.

[10] C. Granger. Some recent development in a concept ofcausality. Journal of econometrics, 39(1):199–211,1988.

[11] F. Jensen. An introduction to Bayesian networks,volume 210. UCL press London, 1996.

[12] F. Elwert. Graphical causal models. In Handbook ofcausal analysis for social research, pages 245–273.Springer, 2013.

[13] K. Murphy. Dynamic bayesian networks. ProbabilisticGraphical Models, M. Jordan, 2002.

[14] D. Heckerman. A bayesian approach to learning causalnetworks. In UAI, pages 285–295. Morgan KaufmannPublishers Inc., 1995.

[15] D. Heckerman. Bayesian networks for data mining.Data mining and knowledge discovery, 1(1):79–119,1997.

[16] G. Cooper. A simple constraint-based algorithm forefficiently mining observational databases for causalrelationships. Data Mining and Knowledge Discovery,1(2):203–224, 1997.

[17] C. Silverstein, S. Brin, R. Motwani, and J. Ullman.Scalable techniques for mining causal structures. DataMining and Knowledge Discovery, 4(2-3):163–192,2000.

[18] S. Mani, P. Spirtes, and G. Cooper. A theoreticalstudy of y structures for causal discovery. arXivpreprint arXiv:1206.6853, 2012.

[19] P. Austin. An introduction to propensity scoremethods for reducing the effects of confounding in

observational studies. Multivariate behavioral research,46(3):399–424, 2011.

[20] J. Lunceford and M. Davidian. Stratification andweighting via the propensity score in estimation ofcausal treatment effects: a comparative study.Statistics in medicine, 23(19):2937–2960, 2004.

[21] P. Austin and E. Stuart. Estimating the effect oftreatment on binary outcomes using full matching onthe propensity score. Statistical methods in medicalresearch, page 0962280215601134, 2015.

[22] P. Rosenbaum and D. Rubin. The central role of thepropensity score in observational studies for causaleffects. Biometrika, 70(1):41–55, 1983.

[23] P. Austin and E. Stuart. Moving towards best practicewhen using inverse probability of treatment weighting(iptw) using the propensity score to estimate causaltreatment effects in observational studies. Statistics inmedicine, 34(28):3661–3679, 2015.

[24] D. Lambert and D. Pregibon. More bang for theirbucks: Assessing new features for online advertisers.In Workshop on Data mining and audienceintelligence for advertising, pages 7–15. ACM, 2007.

[25] D. Chan, R. Ge, O. Gershony, T. Hesterberg, andD. Lambert. Evaluating online ad campaigns in apipeline: causal models at scale. In ACM-SIGKDD,pages 7–16. ACM, 2010.

[26] J. Li, T. Le, L. Liu, J. Liu, Z. Jin, and B. Sun. Miningcausal association rules. In Data Mining Workshops(ICDMW), 2013 IEEE 13th International Conferenceon, pages 114–123. IEEE, 2013.

[27] P. Holland and D. Thayer. Differential itemperformance and the mantel-haenszel procedure. Testvalidity, pages 129–145, 1988.

[28] J. Li, T. Le, L. Liu, J. Liu, Z. Jin, B. Sun, and S. Ma.From observational studies to causal rule mining.ACM Transactions on Intelligent Systems andTechnology (TIST), 7(2):14, 2015.

[29] J. Li, S. Ma, T. Le, L. Liu, and J. Liu. Causal decisiontrees. arXiv preprint arXiv:1508.03812, 2015.

[30] B. Goethals. Survey on frequent pattern mining. Univ.of Helsinki, 2003.

[31] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal,and M. Hsu. Freespan: frequent pattern-projectedsequential pattern mining. In ACM-SIGKDD, pages355–359. ACM, 2000.

[32] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequentpatterns without candidate generation: Afrequent-pattern tree approach. Data mining andknowledge discovery, 8(1):53–87, 2004.

[33] M Regina Castro, Gyorgy Simon, Stephen S Cha,Barbara P Yawn, L Joseph Melton III, and Pedro JCaraballo. Statin use, diabetes incidence and overallmortality in normoglycemic and impaired fastingglucose patients. Journal of General InternalMedicine, pages 1–7.

Documents

Causal Inference in Observational Data - arXiv · techniquest to determine the e cacy of display advertise-ments. Even extending association rules mining to causal rule mining has