Interpretable Almost-Exact Matching for Causal Inferenceproceedings.mlr.press/v89/dieng19b/dieng19b.pdf · 2019-04-25 · Awa Dieng, Yameng Liu, Sudeepa Roy, Cynthia Rudin, Alexander

Interpretable Almost-Exact Matching for Causal Inference

Awa Dieng, Yameng Liu, Sudeepa Roy, Cynthia Rudin, Alexander Volfovsky∗Duke University

Durham, NC 27708{awa.dieng, alexander.volfovsky}@duke.edu, {ymliu, sudeepa, cynthia}@cs.duke.edu

Abstract

Matching methods are heavily used in thesocial and health sciences due to their inter-pretability. We aim to create the highestpossible quality of treatment-control matchesfor categorical data in the potential outcomesframework. The method proposed in thiswork aims to match units on a weighted Ham-ming distance, taking into account the relativeimportance of the covariates; the algorithmaims to match units on as many relevant vari-ables as possible. To do this, the algorithmcreates a hierarchy of covariate combinationson which to match (similar to downward clo-sure), in the process solving an optimizationproblem for each unit in order to constructthe optimal matches. The algorithm uses asingle dynamic program to solve all of theunits’ optimization problems simultaneously.Notable advantages of our method over exist-ing matching procedures are its high-qualityinterpretable matches, versatility in handlingdifferent data distributions that may have ir-relevant variables, and ability to handle miss-ing data by matching on as many availablecovariates as possible.

1 INTRODUCTION

In observational causal inference where the scientistdoes not control the randomization of individuals intotreatment, an ideal approach matches each treatmentunit to a control unit with identical covariates. How-ever, in high dimensions, few such “identical twins”

∗ Equal contribution from all authors.

Proceedings of the 22nd International Conference on Ar-tificial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s).

exist, since it becomes unlikely that any two unitshave identical covariates in high dimensions. In thatcase, how might we construct a match assignment thatwould lead to accurate estimates of conditional averagetreatment effects (CATEs)?

For categorical variables, we might choose a Ham-ming distance to measure similarity between covariates.Then, the goal is to find control units that are simi-lar to the treatment units on as many covariates aspossible. However, the fact that not all covariates areequally important has serious implications for CATEestimation. Matching methods generally suffer in thepresence of many irrelevant covariates (covariates thatare not related to either treatment or outcome): theirrelevant variables would dominate the Hamming dis-tance calculation, so that the treatment units wouldmainly be matched to the control units on the irrel-evant variables. This means that matching methodsdo not always pass an important sanity check in thatirrelevant variables should be irrelevant. To handlethis issue with irrelevant covariates, in this work wechoose to match units based on a weighted Hammingdistance, where the weights can be learned from ma-chine learning on a hold-out training set. These weightsact like variable importance measures for defining theHamming distance.

The choice to optimize matches using Hamming dis-tance leads to a serious computational challenge: howdoes one compute optimal matches on Hamming dis-tance? In this work, we define a matched group fora given unit as the solution to a constrained discreteoptimization problem, which is to find the weightedHamming distance of each treatment unit to the near-est control unit (and vice versa). There is one suchoptimization problem for each unit, and we solve all ofthese optimization problems efficiently with a single dy-namic program. Our dynamic programming algorithmhas the same basic monotonicity property (downwardsclosure) as that of the apriori algorithm (Agrawal andSrikant, 1994) used in data mining for finding frequentitemsets. However, frequency of itemsets is irrelevant


here, instead the goal is to find a largest (weighted)set of covariates that both a treatment and controlunit have in common. The algorithm, Dynamic AlmostMatching Exactly – DAME – is efficient, owing to the useof bit-vector computations to match units in groups,and does not require an integer programming solver.

A more general version of our formulation (Full AlmostMatching Exactly) adaptively chooses the features formatching in a data-driven way. Instead of using afixed weighted Hamming distance, it uses the hold-outtraining set to determine how useful a set of variablesis for prediction out of sample. For each treatmentunit, it finds a set of variables that (i) allows a matchto at least one control unit; (ii) together have the bestout-of-sample prediction ability among all subsets ofvariables for which a match can be created (to at leastone control unit). Again, even though for each unit weare searching for the best subset of variables, we cansolve all of these optimization problems at once withour single dynamic program.

2 RELATED WORK

As mentioned earlier, exact matching is not possible inhigh dimensions, as “identical twins” in treatment andcontrol samples are not likely to exist. Early on, thisled to techniques that reduce dimension using propen-sity score matching (Rubin, 1973b,a, 1976; Cochranand Rubin, 1973), which extend to penalized regres-sion approaches (Schneeweiss et al., 2009; Rassen andSchneeweiss, 2012; Belloni et al., 2014; Farrell, 2015).Propensity score matching methods project the entiredataset to one dimension and thus cannot be usedfor estimating CATE (conditional average treatmenteffect), since units within the matched groups oftendiffer on important covariates. In “optimal match-ing,” (Rosenbaum, 2016), an optimization problem isformed to choose matches according to a pre-defineddistance measure, though as discussed above, this dis-tance measure can be dominated by irrelevant covari-ates, leading to poor matched groups and biased esti-mates. Coarsened exact matching (Iacus et al., 2012,2011) has the same problem, since again, the distancemetric is pre-defined, rather than learned. Recentinteger-programming-based methods considers extremematches for all possible reasonable distance metrics, butthis is computationally expensive and relies on manualeffort to create the ranges (Morucci et al., 2018; Noor-E-Alam and Rudin, 2015); in contrast we use machinelearning to create a single good match assignment.

In the framework of almost-exact matching (Wang et al.,2017), each matched group contains units that areclose on covariates that are important for predictingoutcomes. For example, Coarsened Exact Matching

(Iacus et al., 2012, 2011) is almost-exact if one were touse an oracle (should one ever become available) thatbins covariates according to importance for estimatingcausal effects. DAME’s predecessor, the FLAME algo-rithm (Wang et al., 2017) is an almost-exact matchingmethod that adapts the distance metric to the datausing machine learning. It starts by matching “identi-cal twins,” and proceeds by eliminating less importantcovariates one by one, attempting to match individu-als on the largest set of covariates that produce validmatched groups. FLAME can handle huge datasets,even datasets that are too large to fit in memory, andscales well with the number of covariates, but removingcovariates in exactly one order (rather than all possi-ble orders as in DAME) means that many high-qualitymatches will be missed.

DAME tends to match on more covariates than FLAME;the distances between matched units are smaller inDAME than in FLAME, thus its matches are distinctlyhigher quality. This has implications for missing data,where DAME can find matched groups that FLAMEcannot.

3 ALMOST MATCHING EXACTLY(AME) FRAMEWORK

Consider a dataframe D = [X,Y,T ] where X ∈{0,1, . . . , k}n×p, Y ∈ Rn, T ∈ {0,1}n respectively de-note the categorical covariates for all units, the out-come vector and the treatment indicator (1 for treated,0 for control). The j-th covariate X of unit i is de-noted xij ∈ {0,1, . . . , k}. Notation xi ∈ {0,1, . . . , k}pindicates covariates for the ith unit, and Ti ∈ {0,1} isan indicator for whether or not unit i is treated.

Throughout we make SUTVA and ignorability assump-tions (Rubin, 1980). The goal is to match treatmentand control units on as many relevant covariates aspossible. Relevance of covariate j is denoted by wj ≥ 0and it is determined using a hold-out training set. wj ’scan either be fixed beforehand or adjusted dynamicallyinside the algorithm (see Full-AME in Section 5).

For now, assuming that we have a fixed nonnegativeweight wj for each covariate j, we would like to finda match for each treatment unit t that matches atleast one control unit on as many relevant covariatesas possible. Thus we consider the following problem:

Almost Matching Exactly with Fixed Weights(AME): For each treatment unit t,

✓t∗ ∈ argmax✓∈{0,1}p✓Tw such that

∃ ` with T` = 0 and x` ○ ✓ = xt ○ ✓,where ○ denotes Hadamard product. The solution tothe AME problem is an indicator of the optimal set of

Awa Dieng, Yameng Liu, Sudeepa Roy, Cynthia Rudin, Alexander Volfovsky

∗

covariates for the matched group of treatment unit t.The constraint says that the optimal matched groupcontains at least one control unit. When the solutionof the AME problem is the same for multiple treatmentunits, they form a single matched group. For treatmentunit t, the main matched group for t contains allunits ` so that xt ○ ✓t∗ = x` ○ ✓t∗. If any unit ` (eithercontrol or treatment) within t’s main matched grouphas its own different main matched group, then t’smatched group is an auxiliary matched group for`. In this case, ` could have been matched to otherunits on more covariates than it was matched to t.Estimation of CATE for a unit should always be doneon the main matched group for that unit.

The formulation of the AME and main matched groupis symmetric for control units. There are two straight-forward (but inefficient) approaches to solving the AMEproblem for all units.

AME Solution 1 (quadratic in n, linear in p):Brute force pairwise comparison of treatment points tocontrol points. (Detailed in the appendix.)

AME Solution 2 (order n logn, exponential inp): Brute force iteration over all 2p subsets of the pcovariates. (Detailed in the appendix.)

If n is in the millions, the first solution, or any simplevariation of it, is practically infeasible. A straightfor-ward implementation of the second solution is also inef-ficient. However, a monotonicity property (downwardclosure) allows us to prune the search space so that thesecond solution can be modified to be completely prac-tical. The DAME algorithm does not enumerate all ✓’s,monotonicity reduces the number of ✓’s it considers.Proposition 3.1. (Monotonicity of ✓∗ in AME solu-tions) Fix treatment unit t. Consider feasible ✓, mean-ing ∃ ` with T` = 0 and x` ○ ✓ = xt ○ ✓. Then,

• Any feasible ✓′ such that ✓′ < ✓ elementwise willhave ✓′Tw ≤ ✓T

w.• Consequently, consider feasible vectors ✓ and ✓′. De-

fine ˜✓ as the elementwise min(✓,✓′). Then ˜✓Tw <

✓Tw, and ˜✓

Tw < ✓′Tw.

These follow from the fact that the elements of ✓ arebinary and the elements of w are non-negative. Thefirst property means that if we have found a feasible✓, we do not need to consider any ✓′ with fewer 1’s asa possible solution of the AME for unit t. Thus, theDAME algorithm starts from ✓ being all 1’s (consider allcovariates). It systematically drops one element of ✓ tozero at a time, then two, then three, ordered accordingto values of ✓T

w. The second property implies thatwe must evaluate both ✓ and ✓′ as possible AME so-lutions before evaluating ˜✓. Conversely, a new subset

of variables defined by ˜✓ cannot be considered unlessall of its supersets have been considered. These twoproperties form the basis of the DAME algorithm.

The algorithm must be stopped early to avoid creatinglow quality matches. A useful stopping criterion is ifthe weighted sum of covariates ✓T

w used for matchingbecomes too low (perhaps lower than a prespecifiedpercentage of the total sum of weights �w�1).Note that matching does not produce estimates, it pro-duces a partition of the covariate space, based on whichwe can estimate CATEs. Within each main matchedgroup, we use the difference of the average outcomeof the treated units and the average outcome of thecontrol units as an estimate of the CATE value, giventhe covariate values for that group. Smoothing theCATE estimates could be useful after matching.

4 DYNAMIC ALMOST MATCHINGEXACTLY (DAME)

We call a covariate-set any set of covariates. We denoteby J the original set of all covariates from the inputdataset, where p = �J �. When we drop a set of covariatess, it means we will match on J �s. For any covariate-sets, we associate an indicator-vector ✓s ∈ {0, 1}p definedas follows:

✓s,j = {j∉s} ∀ j ∈ {1, .., p} (1)

that is, the value is 1 if the covariate is not in s implyingthat it is being used for matching.

Algorithm 1 gives the pseudocode of the DAME algorithm.It uses the monotonicity property stated in Proposi-tion 3.1 and ideas from the apriori algorithm for associ-ation rule mining (Agrawal and Srikant, 1994). Insteadof looping over all possible 2

p vectors to solve theAME, it considers a covariate-set s for being droppedonly if satisfies the monotonicity property of Proposi-tion 3.1. For example, if {1} has been considered forbeing dropped to form matched groups, it would notprocess {1,2,3} next because the monotonicity prop-erty requires {1,2}, {1,3}, and {2,3} to have beenconsidered previously for being dropped.

The DAME algorithm uses the GroupedMR (GroupedMatching with Replacement) subroutine given in Algo-rithm 2 to form all valid main matched groups havingat least one treated and one control unit. GroupedMRtakes a given subset of covariates and finds all subsets oftreatment and control units that have identical valuesof those covariates. We use an efficient implementationof the group-by operation in the algorithm from Wanget al. (2017) that uses bit-vectors. To keep track ofmain matched groups, GroupedMR takes the entire setof units D as well as the set of unmatched units from


Algorithm 1: The DAME algorithmInput : Data D, pre-computed weight vector w for

all covariates (from machine learning)Output : {Dm(h),MG(h)}h≥1 all matched units and all

the matched groups from all iterations hNotation: h: iterations, D(h) (resp. Dm(h)) =unmatched (resp. matched) units at the end ofiteration h,MG(h) = matched groups at the end ofiteration h, ⇤(h) = set of active covariate-sets at theend of iteration h that are eligible to be dropped toform matched groups, �(h) = set of covariate-sets atthe end of iteration h that have been processed (i.e.,have been considered to be dropped and forformulation of matched groups).Initialize: D(0) =D,Dm(0) = �,MG(0) = �,⇤(0) ={{1}, ...,{p}},�(0) = �, h = 1while there is at least one treatment unit to match inD(h−1) do

(find the ‘best’ covariate-set to drop from

the set of active covariate-sets)

Let s∗(h) ∈ argmaxs∈⇤h−1 ✓Ts w (✓s ∈ {0,1}p denotes

the indicator-vector of s as in (1))if early stopping condition is met then

Exit while loop(Dm(h),MG(h)) = GroupedMR(D,D(h−1),J � s∗(h))(find matched units and main groups)

Z(h) = GenerateNewActiveSets(�(h−1), s∗(h))(generate new active covariate-sets)⇤(h) = ⇤(h−1) � {s∗(h)} (remove s∗(h) from the set

of active sets)⇤(h) = ⇤(h) ∪Z(h) (update the set of active

sets)�(h) =�(h−1) ∪ {s∗(h)} (update the set of

already processed covariate-sets)D(h) =D(h−1) �Dm(h−1) (remove matches)

h = h + 1return {Dm(h),MG(h)}h≥1

the previous iteration D(h−1) as input along with thecovariate-set J � s∗(h) to match on in this iteration. In-stead of matching only the unmatched units in D(h−1)using the group-by procedure, it matches all units in Dto allow for matching with replacement as in the AMEobjective. It keeps track of the main matched groupsfor the unmatched units D(h−1).DAME keeps track of two sets of covariate-sets: (1) Theset of processed sets � contains the covariate-setswhose main matched groups (if any exist) have alreadybeen formed. That is, � contains s if matches havebeen constructed on J � s by calling the GroupedMR

procedure. (2) The set of active sets ⇤ containsthe covariate-sets s that are eligible to be droppedaccording to Proposition 3.1. For any iteration h, ⇤(h)∩

Algorithm 2: Procedure GroupedMR

Input :Data D, unmatched DataDum ⊆D = (X,Y,T ), subset of indexes ofcovariates J s ⊆ {1, ..., p}

Output :Newly matched units Dm using covariatesindexed by J s where groups have at leastone treated and one control unit, and mainmatched groups for Dm

Mraw = group-by (D,J s) (form groups on D by

exact matching on Js)

M = prune(Mraw) (remove groups without at

least one treatment and one control unit)

Dm= Subset of Dum

where the covariates

match with some group in M (find newly

matched units and their main matched groups)

return {Dm,M} (newly matched units and main

matched groups)

�(h) = �, i.e., the sets are disjoint, where ⇤(h),�(h)denote the states of ⇤,� at the end of iteration h. Dueto the monotonicity property stated in Proposition 3.1,if s ∈ ⇤(h), then each proper subset r ⊂ s belonged to⇤(h′) in an earlier iteration h′ < h. Once an active sets ∈ ⇤(h−1) is chosen as the optimal subset to drop (i.e.,s is s∗(h) in iteration h), s is excluded from ⇤(h) (it isno longer active) and is included in �(h) as a processedset. In that sense, the active sets are generated andincluded in ⇤(h) in a hierarchical manner similar to theapriori algorithm. A set s is included in ⇤(h) only if allof its proper subsets of one less size r ⊂ s, �r� = �s� − 1,have been processed.

The procedure GenerateNewActiveSets gives an effi-cient implementation of generation of new active setsin each iteration of DAME, and takes the currently pro-cessed sets � = �(h−1) and a newly processed sets = s∗(h) as input. Let �s� = k. In this procedure,�

k ⊆� ∪ {s} denotes the set of all processed covariate-sets in � of size k, and also includes s. Inclusion of s in�

k may lead to generation of a new active set r of sizek+1 only if all of r’s subsets of size k (one less) have beenpreviously processed. The new active sets triggered byinclusion of s in �

k would be supersets r of s of sizek+1 if all subsets s′ ⊂ r of size �s′� = k belong to �

k. Togenerate such candidate supersets r, we can append swith all covariates appearing in some covariate-set in �

except those in s. However, this naive approach woulditerate over many superfluous candidates for active sets.Instead, GenerateNewActiveSets safely prunes somesuch candidates that cannot be valid active sets usingsupport of each covariate e in �

k, which is the numberof sets in �

k containing e. Indeed, for any covariatethat is not frequent enough in �

k, the monotonicityproperty ensures that any covariate-set that contains


∗

that covariate cannot be active. The following proposi-tion shows that this pruning step does not eliminateany valid active set (proof is in the appendix):Proposition 4.1. If for a superset r of a newly pro-cessed set s where �s� = k and �r� = k + 1, all subsets s′of r of size k have been processed (i.e. r is eligible tobe active after s is processed), then r is included in theset Z returned by GenerateNewActiveSets.

The explicit verification step of whether all possi-ble subsets of r of one less size belongs to �

k isnecessary, i.e., the above optimization only prunessome candidate sets that are guaranteed not to beactive. For instance, consider s = {2,3}, k = 2, and�

2 = {{1,2},{1,3},{3,5},{5,6}} ∪ {{2,3}}. For thesuperset r = {2,3,5} of s, all of 2,3,5 have support of≥ 2 in �

2, but this r cannot become active yet, sincethe subset {2,5} of r does not belong to �

2.

Finally, the following theorem states the correctness ofthe DAME algorithm (proof is in the appendix).Theorem 4.2. (Correctness) The DAME algorithmsolves the AME problem.

Once the problem is solved, the main matched groupscan be used to estimate treatment effects, by consid-ering the difference in outcomes between treatmentand control units in each group, and possibly smooth-ing the estimates from the matched groups to preventoverfitting of treatment effect estimates.

5 Almost Matching Exactly withAdaptive Weights

We now generalize the AME framework so the weightsare adjusted adaptively for each subset of variables.The weights are chosen using machine learning on ahold-out training set. Let us consider a trivial varia-tion of the AME problem with fixed weights and thengeneralize it to handle adaptive weights.

Almost Matching Exactly with Fixed Weights,Revisited: We will use squared rewards w2

j this time.For a given treatment unit u with covariates xu, com-pute the following, which is the maximum sum of re-wards {w2

j}j=1,..,p we can attain for a valid matchedgroup (that contains at least one control unit):

✓u∗ ∈ argmax✓∈{0,1}p✓T (w ○w) s.t.∃` with T` = 0 and x` ○ ✓ = xu ○ ✓. (2)

The solution to this is an indicator of the optimal setof covariates to match unit u on. For treatment unitu, again, the main matched group for u contains allunits ` so that xu ○✓u∗ = x` ○✓u∗. Now we provide the(more general) adaptive version of AME.

Algorithm 3: Procedure GenerateNewActiveSets

1. Input : s a newly dropped set of size k,� the set of previously processed sets

2. Initialize: Z = � (stores new active sets)

3. �k = {� ∈� � size(�) = k} ∪ {s} (compute all

subsets of � of size k and also include s)

4. � = {↵ � ↵ ∈ � and � ∈�k} (get all the

covariates contained in sets in �k)

5. Se = support of covariate e in �

k

6. ⌦ = {↵ � ↵ ∈ � and S↵ ≥ k} � s (get the

covariates not in s that have enough support)

7. if {∀e ∈ s ∶ Se ≥ k} (if all covariates in s have

enough support in �k) then

8. for all ↵ ∈ ⌦ (generate new active set) do9. r = s ∪ {↵}10. if all subsets s′ ⊂ r, �s′� = k, belong to �

k

then11. add r to Z (add newly active set r to

Z)

12. return Z

Example (follow line number correspondence)1. s = {2,3}, k = 2,� = {{1},{2},{3},{5},{1,2},{1,3},{1,5}}2. Z = �3. �2 = {{1,2},{1,3},{2,3},{1,5}}4. � = {1,2,3,5}5. S1 = 3,S2 = 2,S3 = 2,S5 = 16. ⌦ = {1,2,3} � {2,3} = {1}7. T rue ∶ both 1 and 2 have support ≥ 28. ↵ = 1 (only one value)9. r = {2,3} ∪ {1} = {1,2,3}10. T rue (subsets of r of size 2 are {1, 2},{1, 3},{2, 3})11. Z = {{1,2,3}}12. return Z = {{1,2,3}}

Full Almost Matching Exactly (Full-AME): De-note ✓ ∈ {0,1}p as an indicator vector for a subset ofcovariates to match on. Define the matched group forunit u with respect to covariates ✓ as the units thatmatch u exactly on the covariates ✓:

MG✓(u) = {v ∶ xv ○ ✓ = xu ○ ✓}.The usefulness of a set of covariates ✓ is now determinedby how well they can be used together to make out-of-sample predictions. Specifically, the prediction errorPE(✓) is defined with respect to a class of functions Fas: PEF(✓) = minf∈F E(f(X ○ ✓, T ) − Y )2, where theexpectation is taken over X, T and Y . Its empiricalcounterpart is defined with respect to a separate ran-dom sample from the distribution, used as a trainingset {xtr

i , T tri , ytri }i∈ training, specifically:

PEF(✓) =min

f∈F �i∈ training

(f(xtri ○ ✓, T tr

i ) − ytri )2.


The training set is only used to calculate predictionerror, not for matching. Using this, the best predictionerror we could hope to achieve for a nontrivial matchedgroup containing treatment unit u uses the followingcovariates for matching:

✓∗u ∈ argmin

✓PEF(✓) s.t. ∃` ∈MG✓(u) where T` = 0

The main matched group for u is defined asMG✓∗u(u). The goal of the Full-AME problem is tofind the main matched group MG✓∗u(u) for all units u.

The class of functions F can include nonlinear functions.We can use variable importance measures for predictionon PEF such as permutation importance (also calledmodel reliance) to determine the variable’s weight. IfF includes linear models, the weight wj for feature jwould be the absolute value of feature j’s coefficient.

The Full-AME problem reduces to the fixed-squared-weights version under specific conditions, such as whenF is a single function f , which is linear with fixed linearweights (w,wT ) and f(x○✓, T ) = (w○✓)T (x○✓)+wTT ,where w is the ground-truth coefficient vector thatgenerates y, and PE✓ is determined by the sum ofw2

j weights for covariates determined by the feature-selector vector ✓. This reduction is discussed formallyby Wang et al. (2017).

In order to solve Full-AME, a step is needed in Algo-rithm 1 at the top of the while loop that updates theweights for each covariate-set we could choose at thatiteration. In particular, we let

s∗(h) ∈ arg min

s∈⇤(h−1) PE(✓s),where ⇤(h−1) is the active set of covariates, and thepredictive error is computed over the training set withrespect to a pre-specified class of models, F . In the im-plementation in this paper we consider linear functionsfit separately on the treated and the control units inthe training set using ridge regression (that is, we adda ridge penalty to Eq (5)).

5.1 Early Stopping of DAME

It is important that DAME be stopped early when thequality of matches produced falls. In dropping covari-ates, its prediction error PEF should never increasetoo far above its original value using all the covari-ates. This ensures the quality of every matched group:the covariates ✓∗u for every matched group thus obeyPEF(✓∗u) <min✓ PEF(✓)+ ✏, where the choice of ✏ (per-haps 5%) determines stopping. As such the while loopin Algorithm 1 should not only check whether there aremore units to match, but also whether the predictiveerror has increased too much.

5.2 Hybrid FLAME-DAME

The DAME algorithm solves the Full-AME problem,whereas FLAME (Wang et al., 2017) approximatesits solution. This is because FLAME uses backwardsfeature selection, whereas DAME calculates the solutionwithout approximation. For problems with many fea-tures, we can use FLAME to remove the less relevantfeatures, and then switch to DAME when we start toremove some of the more influential features. Thishybrid algorithm scales substantially better, possiblywithout any noticeable loss in the quality of matches.

Matching-after-learning-to-stretch (MALTS) (Parikhet al., 2018) has been combined with FLAME and DAME

to handle mixed real and categorical covariates.

5.3 Other Estimands

While CATEs are the most granular estimands, aggre-gate estimands such as Average Treatment Effect (ATE)and Average Treatment Effect on the Treated (ATT)may be of interest. Since DAME matches with replace-ment, standard techniques (e.g., frequency weights)should be used (Stuart, 2010; Abadie et al., 2004).

6 SIMULATIONS

We present results under several data generating pro-cesses. We show that DAME produces higher qualitymatches than popular matching methods such as 1-PSNNM (propensity score nearest neighbor matching)and Mahalanobis distance nearest neighbor matching,and better treatment effect estimates than black boxmachine learning methods such as Causal Forest (whichis not a matching method, and is not interpretable).The ‘MatchIt’ R-package (Ho et al., 2011) was used toperform 1-PSNNM and Mahalanobis distance nearestneighbor matching (‘Mahalanobis’). For Causal Forest,we used the ‘grf’ R-package (Athey et al., 2019). DAMEalso improves over FLAME (Wang et al., 2017) withregards to the quality of matches. Other matchingmethods (optmatch, cardinality match) do not scale tolarge problems and thus needed to be omitted.

Throughout this section, the outcome is generatedwith y = ∑i ↵ixi+T ∑i=1 �ixi+T ⋅U ∑i,�,�>i xix� whereT ∈ {0,1} is the binary treatment indicator. This gen-eration process includes a baseline linear effect, lineartreatment effect, and quadratic (nonlinear) treatmenteffect. We vary the distribution of covariates, coeffi-cients (↵’s, �’s, U), and the fraction of treated units.We report conditional average treatment effects on thetreated.


∗

6.1 Presence of Irrelevant Covariates

A basic sanity check for matching algorithms is how sen-sitive they are to irrelevant covariates. To that end, werun experiments with a majority of the covariates beingirrelevant to the outcome. For important covariates1 ≤ i ≤ 5 let ↵i ∼ N(10s,1) with s ∼ Uniform{−1,1},�i ∼ N(1.5, 0.15), xi ∼ Bernoulli(0.5). For unimportantcovariates 5 < i ≤ 15, xi ∼ Bernoulli(0.1) in the controlgroup and xi ∼ Bernoulli(0.9) in the treatment groupso there is little overlap between treatment and controldistributions. This simulation generates 15000 controlunits, 15000 treatment units, 5 important covariatesand 10 irrelevant covariates. Results: In Figure 1,DAME (even with early stopping) runs to the end andmatches on all units because the stopping criteria isnever met. In this figure, DAME finds all high-qualitymatches even after important covariates are dropped.In contrast, FLAME achieves the optimal result be-fore dropping any important covariates and generatessome poor matches after dropping important covariates.However, even FLAME’s worst case scenario is betterthan the comparative methods, all of which performpoorly in the presence of irrelevant covariates. CausalForest is especially ill suited for this case.

6.2 Exponentially Decaying Covariates

An advantage of DAME over FLAME is that it pro-duces more high quality matches before resorting tolower quality matches. To test this, we consideredcovariates of decaying importance, letting the ↵ pa-rameters decrease exponentially as ↵i = 64 × �12�i. Weevaluated performance when ≈ 30% and 50% of unitswere matched. Results: As Figure 2 shows, DAME

matches on more covariates, yielding better estimatesthan FLAME.

6.3 Imbalanced Data

Imbalance is common in observational studies: thereare often substantially more control than treatmentunits. The data for this experiment has covariates withdecreasing importance. A fixed batch of 2000 treatmentand 40000 control units were generated. We sampledfrom the controls to construct different imbalance ra-tios: 40000 in the most imbalanced case (Ratio 1),then 20000 (Ratio 2), and 10000 (Ratio 3). Results:Table 1 reveals that FLAME and DAME outperform thenearest neighbor matching methods. DAME is distinctlybetter than FLAME. Additionally, DAME has an averageof 4 covariates not matched on, with ≈ 84% of unitsmatched on all but 2 covariates, whereas FLAME aver-ages 7 covariates not matched on and only ≈ 25% unitsmatched on all but 2 covariates. Detailed results arein the longer version (Liu et al., 2018).

Table 1: MSE for different imbalance ratios

Mean Squared Error (MSE)Ratio 1 Ratio 2 Ratio 3

DAME 0.47 0.83 1.39

FLAME 0.52 0.88 1.55Mahalanobis 26.04 48.65 64.801-PSNNM 246.08 304.06 278.87

6.4 Run Time Evaluation

We compare the run time of DAME with a brute forcesolution (AME Solution 1 described in Section 3). Allexperiments were run on an Ubuntu 16.04.01 systemwith Intel Core i7 Processor (Cores: 8, Speed: 3.6GHz), 8 GB RAM. Results: As shown in Figure 3,FLAME provides the best run-time performance be-cause it incrementally reduces the number of covariates,rather than solving Full-AME. On the other hand, asshown in the previous simulations, DAME produces highquality matches that the other methods do not. Itsolves the AME much faster than brute force. Therun time for DAME could be further optimized throughsimple parallelization of the checking of active sets.

6.5 Missing Data

Missing data problems are complicated in match-ing. Normally one would impute missing values, butmatches become less interpretable when matching onimputed values. If we match only on the raw values,DAME has an advantage over FLAME because it cansimply match on as many non-missing relevant covari-ates as possible. When data are imputed, DAME stillmaintains an advantage over FLAME, possibly becauseit can match on more raw covariate values and fewerimputed values. Details of the experiments are in thelonger version (Liu et al., 2018).

7 BREAKING THE CYCLE OFDRUGS AND CRIME

Breaking The Cycle (BTC) (Harrell et al., 2006) is a so-cial program conducted in several U.S. states designedto reduce criminal involvement and substance abuseamong current offenders. We study the effect of par-ticipating in the program on reducing non-drug futurearrest rates. The details of the data and our resultsare in Appendix D. We compared CATE predictions ofDAME and FLAME to double check the performance ofa black box support vector machine (SVM) approachthat predicts positive, neutral, or negative treatmenteffect for each individual. The result is that DAME

and the SVM approach agreed on most of the exactly


Figure 1: Estimated CATT vs. True CATT (Conditional Average Treatment Effect on the Treated). DAME and FLAMEperfectly estimate the CATTs before dropping important covariates. DAME matches all units without dropping importantcovariates, but FLAME needs to stop early in order to avoid poor matches. All other methods are sensitive to irrelevantcovariates and give poor estimates. The two numbers on each plot are the number of matched units and MSE.

Figure 2: DAME makes higher quality matches early on.Rows correspond to stopping thresholds (top row 30%,bottom row 50%). DAME matches on more covariates thanFLAME, yielding lower MSE from matched groups.

Figure 3: Run-time comparison between DAME FLAME,and brute force. Left: varying number of units. Right:varying number of covariates.

matched units. All of the units for which exact match-ing predicted approximately zero treatment effect allhave a “neutral” treatment effect predicted label fromthe SVM. Most other predictions were similar betweenthe two methods. There were only few disagreementsbetween the methods. Upon further investigation, wefound that the differences are due to the fact that DAMEis a matching method and not a modeling method;the estimates could be smoothed afterwards if desiredto create a model. In particular, one of the two dis-agreeing predictions between the SVM and DAME has apositive treatment CATE prediction, but it was closerin Hamming distance to units predicted to have nega-tive treatment effects. With smoothing, its predictedCATE may have also become negative.

8 CONCLUSIONS

DAME produces matches that are of high quality. Itsestimates of individualized treatment effects are asgood as the (black box) machine learning methods wehave tried. Other methods can match individuals to-gether whose covariates look nothing alike, whereas thematches from DAME are interpretable and meaningful,because they are almost exact; units are matched oncovariates that together can be used to predict out-comes accurately. Code is publicly available at: https://github.com/almost-matching-exactly/DAME .

Acknowledgements:

This work was supported in part by NIH award1R01EB025021-01, NSF awards IIS-1552538 and IIS-1703431, a DARPA award under the L2M program, anda Duke University Energy Initiative Energy ResearchSeed Fund (ERSF).

https://github.com/almost-matching-exactly/DAME

https://github.com/almost-matching-exactly/DAME


∗

References

A. Abadie, D. Drukker, J. L. Herr, and G. W. Im-bens. Implementing matching estimators for averagetreatment effects in stata. The Stata Journal, 4(3):290–311, 2004.

R. Agrawal and R. Srikant. Fast algorithms for miningassociation rules in large databases. In Proceedingsof the 20th International Conference on Very LargeData Bases, VLDB ’94, pages 487–499, 1994.

S. Athey, J. Tibshirani, and S. Wager. GeneralizedRandom Forests. The Annals of Statistics, 47(2):1148–1178, 2019.

A. Belloni, V. Chernozhukov, and C. Hansen. Infer-ence on treatment effects after selection among high-dimensional controls. The Review of Economic Stud-ies, 81(2):608–650, 2014.

W. G. Cochran and D. B. Rubin. Controlling biasin observational studies: A review. Sankhya: TheIndian Journal of Statistics, Series A, pages 417–446,1973.

M. H. Farrell. Robust inference on average treatment ef-fects with possibly more covariates than observations.Journal of Econometrics, 189(1):1–23, 2015.

S. Goh and C. Rudin. A minimax surrogate loss ap-proach to conditional difference estimation. ArXive-prints: arXiv:1803.03769, Mar. 2018.

A. V. Harrell, D. Marlowe, and J. Merrill. Break-ing the cycle of drugs and crime in Birmingham,Alabama, Jacksonville, Florida, and Tacoma, Wash-ington, 1997-2001. Ann Arbor, MI: Inter-universityConsortium for Political and Social Research, 2006-03-30, 2006.

D. Ho, K. Imai, G. King, and E. Stuart. Matchit:Nonparametric preprocessing for parametric causalinference. Journal of Statistical Software, Articles,42(8):1–28, 2011.

S. M. Iacus, G. King, and G. Porro. Multivariate match-ing methods that are monotonic imbalance bounding.Journal of the American Statistical Association, 106(493):345–361, 2011.

S. M. Iacus, G. King, and G. Porro. Causal inferencewithout balance checking: Coarsened exact matching.Political Analysis, 20:1–24, 2012.

Y. Liu, A. Dieng, S. Roy, C. Rudin, and A. Volfovsky.Interpretable almost-exact matching for causal infer-ence. arXiv e-prints: arXiv:1806.06802, Jun 2018.

M. Morucci, M. Noor-E-Alam, and C. Rudin. Hypothe-sis tests that are robust to choice of matching method.ArXiv e-prints, arXiv:1812.02227, Dec. 2018.

M. Noor-E-Alam and C. Rudin. Robust nonparametrictesting for causal inference in observational studies.Optimization Online, Dec, 2015.

H. Parikh, C. Rudin, and A. Volfovsky. MALTS:Matching After Learning to Stretch. arXiv e-prints:arXiv:1811.07415, Nov 2018.

J. A. Rassen and S. Schneeweiss. Using high-dimensional propensity scores to automate confound-ing control in a distributed medical product safetysurveillance system. Pharmacoepidemiology andDrug Safety, 21(S1):41–49, 2012.

P. R. Rosenbaum. Imposing minimax and quantileconstraints on optimal matching in observationalstudies. Journal of Computational and GraphicalStatistics, 26(1), 2016.

D. B. Rubin. Matching to remove bias in observationalstudies. Biometrics, 29(1):159–183, Mar. 1973a.

D. B. Rubin. The use of matched sampling and re-gression adjustment to remove bias in observationalstudies. Biometrics, 29(1):185–203, Mar. 1973b.

D. B. Rubin. Multivariate matching methods thatare equal percent bias reducing, I: Some examples.Biometrics, 32(1):109–120, Mar. 1976.

D. B. Rubin. Randomization analysis of experimentaldata: The fisher randomization test comment. Jour-nal of the American Statistical Association, 75(371):591–593, 1980.

S. Schneeweiss, J. A. Rassen, R. J. Glynn, J. Avorn,H. Mogun, and M. A. Brookhart. High-dimensionalpropensity score adjustment in studies of treatmenteffects using health care claims data. Epidemiology(Cambridge, Mass.), 20(4):512, 2009.

E. A. Stuart. Matching methods for causal inference:A review and a look forward. Statistical Science:a Review Journal of the Institute of MathematicalStatistics, 25(1):1, 2010.

T. Wang, S. Roy, C. Rudin, and A. Volfovsky.FLAME: A fast large-scale almost matching ex-actly approach to causal inference. arXiv e-prints:arXiv:1707.06315, July 2017.

Documents

Interpretable Almost-Exact Matching for Causal Inferenceproceedings.mlr.press/v89/dieng19b/dieng19b.pdf · 2019-04-25 · Awa Dieng, Yameng Liu, Sudeepa Roy, Cynthia Rudin, Alexander