Bob Weigelbobweigel.net/projects/images/JMMcCracken_defense_slides.pdfExploratory Causal Analysis in...

Exploratory Causal Analysis in Bivariate Time Series Data

AbstractMany scientific disciplines rely on observational data of systems for which it is difficult

(or impossible) to implement controlled experiments and data analysis techniques are

required for identifying causal information and relationships directly from observational

data. This need has lead to the development of many different time series causality

approaches and tools including transfer entropy, convergent cross-mapping (CCM), and

Granger causality statistics.

A practicing analyst can explore the literature to find many proposals for identifying

drivers and causal connections in times series data sets, but little research exists of how

these tools compare to each other in practice. This work introduces and defines

exploratory causal analysis (ECA) to address this issue. The motivation is to provide a

framework for exploring potential causal structures in time series data sets.

J. M. McCracken

Defense talk for PhD in Physics, Department of Physics and Astronomy

10:00 AM November 20, 2015; Exploratory Hall, 3301

Advisor: Dr. Robert Weigel; Committee: Dr. Paul So, Dr. Tim Sauer

J. M. McCracken (GMU) ECA w/ time series causality November 20, 2015 1 / 50

Exploratory Causal Analysis in Bivariate TimeSeries Data

J. M. McCrackenDepartment of Physics and AstronomyGeorge Mason University, Fairfax, VA

November 20, 2015

Outline1. Motivation

2. Causality studies

3. Data causality

4. Exploratory causal analysis

5. Making an ECA summaryTransfer entropy differenceGranger causality statisticPairwise asymmetric inferenceWeighed mean observed leaningLagged cross-correlation difference

6. Computational tools for the ECA summary

7. Empirical examplesCooling/Heating System DataSnowfall Data

8. Times series causality as data analysis

Motivation

Consider two sets of time series measurements, X and Y.

Question

Is there evidence that X “drives” Y?

We were looking for a data analysis approach, i.e., we were looking foranalysis tools that

I worked with time series data,

I had straightforward, preferably well-established, interpretations,

I were reliable,

I and did not require studying the (vast) philosophical causalityliterature.

Essentially, we were looking for a “plug-and-play” analysis tool.

This work stems from our search for such a tool.

Motivation

Question

I were reliable,

Motivation

Question

I were reliable,

Motivation

Question

I were reliable,

Motivation

Question

I were reliable,

Motivation

Question

I were reliable,

Motivation

Question

I were reliable,

Motivation

Question

I were reliable,

Motivation

Question

I were reliable,

Causality studies

The study of causality is as old as science itself

I Modern historians credit Aristotle with both the first theory ofcausality (“four causes”) and an early version of the scientific method

I The modern study of causality is broadly interdisciplinary; far toobroad to review in a short talk.

Illari and Russo’s textbook1provides an overview of causality studies

1Illari, P., & Russo, F. (2014). Causality: Philosophical theory meets scientific practice. Oxford University Press.

Towards a taxonomy of causal studies

Paul Holland identified four types of causal questions2:

I the ultimate meaningfulness of the notion of causality

I the details of causal mechanisms

I the causes of a given effect

I the effects of a given cause

Foundational causality “Is a cause required to precede an effect?” or “Howare causes and effects related in space-time?”

Data causality “Does smoking cause lung cancer?” or “Are trafficaccidents caused by rain storms?”

2Holland, P. W. (1986). Statistics and causal inference. Journal of the American statistical Association, 81(396), 945-960.

Foundational causality “Is a cause required to precede an effect?” or“How are causes and effects related in space-time?”

Foundational causality “Is a cause required to precede an effect?” or “Howare causes and effects related in space-time?”

Data causality

Data causality is data analysis to draw causal inferences

Approaches to data causality studies include

I design of experiments (e.g., Fisher randomization)

I potential outcomes (Rubin’s counterfactuals)

I directed acyclic graphs (DAGs) with structural equation models(SEMs); popularized by Pearl as “structural causal models (SCMs)”

I time series causality

There is no consensus on the best approach to data causality

Many authors consider their favored approach to be the exclusive correctapproach.

Data causality

Time series causality

Time series causality is data causality with time series data

Approaches to time series causality can be roughly divided into fivecategories,

I Granger (model based approaches)

I Information-theoretic

I State space reconstruction (SSR)

I Correlation

I Penchant

Time series causality

Time series causality is data causality with time series data

Approaches to time series causality can be roughly divided into fivecategories,

I Granger (model based approaches)

I Information-theoretic

I State space reconstruction (SSR)

I Correlation

I Penchant

Exploratory causal analysisLanguage

Exploring causal structures in data sets is distinct from confirming causalstructures in data sets.

Causal language used in ECA should not be conflated with other typicaluses; i.e., “cause”, “effect”, “drive”, etc. are used as technical terms withdefinitions unrelated to their common, everyday definitions.

→ and ← will be used as shorthand for causal statements,e.g., A drives B will be written as A→ B.

Exploratory causal analysisLanguage

Exploring causal structures in data sets is distinct from confirming causalstructures in data sets.

Causal language used in ECA should not be conflated with other typicaluses; i.e., “cause”, “effect”, “drive”, etc. are used as technical terms withdefinitions unrelated to their common, everyday definitions.

→ and ← will be used as shorthand for causal statements,e.g., A drives B will be written as A→ B.

Exploratory causal analysisAssumptions

A cause always precedes an effect.

This assumption is required for the operational definitions of causality.

A driver may be present in the data being analyzed.

This assumption may lead to issues of confounding.

Exploratory causal analysisECA summary vector approach

We will not favor a specific operational definition of causality ⇒ we do notfavor any particular tool

Consider a time series pair (X,Y),

ECA summary vector

Define a vector ~g where each element gi is defined as either 0 if X→ Y, 1if X← Y, or 2 if no causal inference can be made. The value of each gicomes from a specific time series causality tool.

ECA summary

The ECA summary is either X→ Y, Y → X, or undefined, withgi = 0 ∀gi ∈ ~g ⇒ X→ Y and gi = 1 ∀gi ∈ ~g ⇒ Y → X.

ECA summary vector

ECA summary

ECA summary vector

ECA summary

Making an ECA summary

Our focus is on time series, so each causal inference gi ∈ ~g will be drawnfrom a tool in one of each of the five time series causality categories.

transfer entropy differenceGranger log-likelihood statisticspairwise asymmetric inference (PAI)average weighted mean observed leaninglagged cross-correlation difference

• g1

transfer entropy difference information-theoreticGranger log-likelihood statisticspairwise asymmetric inference (PAI)average weighted mean observed leaninglagged cross-correlation difference

• g2

transfer entropy difference information-theoreticGranger log-likelihood statistics Grangerpairwise asymmetric inference (PAI)average weighted mean observed leaninglagged cross-correlation difference

• g3

transfer entropy difference information-theoreticGranger log-likelihood statistics Grangerpairwise asymmetric inference (PAI) SSRaverage weighted mean observed leaninglagged cross-correlation difference

• g4

transfer entropy difference information-theoreticGranger log-likelihood statistics Grangerpairwise asymmetric inference (PAI) SSRaverage weighted mean observed leaning penchantlagged cross-correlation difference

• g5

transfer entropy difference information-theoreticGranger log-likelihood statistics Grangerpairwise asymmetric inference (PAI) SSRaverage weighted mean observed leaning penchantlagged cross-correlation difference correlation

Transfer entropy (g1)Shannon entropy

The uncertainty that a random variable X takes some specific value Xn isgiven by the Shannon (or information) entropy,

HX = −NX∑n=1

P(X = Xn) log2 P(X = Xn)

HX = −NX∑n=1

P(X = Xn) is the probability that X takes the specific value Xn

HX = −NX∑n=1

The sum is over all possible values of Xn; n = 1, 2, . . . ,NX

HX = −NX∑n=1

The base of the logarithm sets the entropy units, which is “bits” here

Transfer entropy (g1)Shannon entropy example

Binary example (to help with intuition)

Consider a coin C that take the value H with probability pH and T withprobability pT . The Shannon entropy is

HC = − (pH log2 pH + pT log2 pT )

completely uncertain of outcomeFair coin ⇒ pH = pT = 0.5⇒ HC = 1

completely certain of outcomeAlways heads (or tails) ⇒ pH(T ) = 0, pT (H) = 1⇒ HC = 0

(Entropy calculations almost always assume 0 log2 0 := 0.)

Transfer entropy (g1)Mutual information

A pair of random variables (X,Y) have some mutual information given by

IX ;Y = HX + HY − HX ,Y

NX∑n=1

NY∑m=1

P(X = Xn,Y = Ym) log2P(X = Xn,Y = Ym)

P(X = Xn)P(Y = Ym)

NX∑n=1

NY∑m=1

P(X = Xn)P(Y = Ym)

P(X = Xn,Y = Ym) is the probability that X takes the specific value Xn

and Y takes the specific value Ym

NX∑n=1

NY∑m=1

P(X = Xn,Y = Ym) log2P(X=Xn,Y=Ym)

P(X=Xn)P(Y=Ym)

If X and Y are independent, thenP(X = Xn,Y = Ym) = P(X = Xn)P(Y = Ym)⇒ IX ;Y = 0

NX∑n=1

NY∑m=1

P(X = Xn)P(Y = Ym)

The mutual information is symmetric; i.e., IX ;Y = IY ;X

NX∑n=1

NY∑m=1

P(X = Xn)P(Y = Ym)

Schreiber proposed an extension of the mutual information to measure“information flow” by making it conditional and including assumptionsabout the temporal behavior X and Y.

Transfer entropy (g1)Information flow

Suppose X and Y are both Markov processes. The directed flow ofinformation from Y to X is given by the transfer entropy,

TY→X =

NX∑n=1

NY∑m=1

pn+1,n,m log2

pn+1|n,m

pn+1|n

I pn+1,n,m = P(X(t + 1) = Xn+1,X(t) = Xn,Y(τ) = Ym)

I pn+1|n,m = P(X(t + 1) = Xn+1|X(t) = Xn,Y(τ) = Ym)

I pn+1|n = P(X(t + 1) = Xn+1|X(t) = Xn)

There is no directed information flow from Y to X if X is conditionallyindependent of Y; i.e.,

pn+1|n,m = pn+1|n ⇒ TY→X = 0

Operational causality (information-theoretic)

X causes Y if the directed information flow from X to Y is higher than thedirected information flow from Y to X; i.e.,

TX→Y − TY→X > 0 ⇒ X→ Y

TX→Y − TY→X < 0 ⇒ Y → X

TX→Y − TY→X = 0 ⇒ no causal inference

There is no directed information flow from Y to X if X is conditionallyindependent of Y; i.e.,

pn+1|n,m = pn+1|n ⇒ TY→X = 0

Operational causality (information-theoretic)

X causes Y if the directed information flow from X to Y is higher than thedirected information flow from Y to X; i.e.,

TX→Y − TY→X > 0 ⇒ X→ Y

TX→Y − TY→X < 0 ⇒ Y → X

TX→Y − TY→X = 0 ⇒ no causal inference

Granger causality (g2)Granger’s axioms

Consider a discrete universe with two time series X = Xt | t = 1, . . . , nand Y = Yt | t = 1, . . . , n, where t = n is considered the present time.All knowledge available in the universe at all times t ≤ n is denoted as Ωn.

Axiom 1

The past and present may cause the future, but the future cannot causethe past.

Axiom 1

The past and present may cause the future, but the future cannot causethe past.

Axiom 2

Ωn contains no redundant information, so that if some variable Z isfunctionally related to one or more other variables, in a deterministicfashion, then Z should be excluded from Ωn.

Granger’s definition of causality

Given some set A, Y causes X if

P(Xn+1 ∈ A|Ωn) 6= P(Xn+1 ∈ A|Ωn − Y)

Granger’s definition of causality

Given some set A, Y causes X if

P(Xn+1 ∈ A|Ωn) 6= P(Xn+1 ∈ A|Ωn − Y)

Granger’s original goal was to make this notion of causality “operational”.

Granger causality (g2)VAR models

Consider a time series pair (X,Y). Suppose there is a vectorautoregressive (VAR) model that describes the pair,(

n∑i=1

11 Ai12

Ai21 Ai

)(Xt−iYt−i

(ε1,t

n∑i=1

11 Ai12

Ai21 Ai

)(Xt−iYt−i

(ε1,t

The current time step t of X and Y

n∑i=1

11 Ai12

Ai21 Ai

) (Xt−iYt−i

(ε1,t

The current time step t of X and Y is modeled as a sum of n past steps

n∑i=1

11 Ai12

Ai21 Ai

) (Xt−iYt−i

(ε1,t

The current time step t of X and Y is modeled as a sum of n past stepsof X and Y,

n∑i=1

11 Ai12

Ai21 Ai

)(Xt−iYt−i

(ε1,t

The current time step t of X and Y is modeled as a sum of n past stepsof X and Y, plus uncorrelated noise terms.

Granger causality (g2)Comparison of VAR models

Consider two different VAR models for the pair (X,Y),(Xt

n∑i=1

(Axx ,i Axy ,i

Ayx ,i Ayy ,i

)(Xt−iYt−i

(εx ,tεy ,t

n∑i=1

(A′xx ,i 0

0 A′yy ,i

)(Xt−iYt−i

(ε′x ,tε′y ,t

n∑i=1

(Axx ,i Axy ,i

Ayx ,i Ayy ,i

)(Xt−iYt−i

(εx ,tεy ,t

n∑i=1

(A′xx ,i 0

0 A′yy ,i

)(Xt−iYt−i

(ε′x ,tε′y ,t

)The G-causality log-likelihood statistic is defined as

FY→X = ln|Σ′xx ||Σxx |

n∑i=1

(Axx ,i Axy ,i

Ayx ,i Ayy ,i

)(Xt−iYt−i

(εx ,tεy ,t

n∑i=1

(A′xx ,i 0

0 A′yy ,i

)(Xt−iYt−i

(ε′x ,tε′y ,t

Covariance of X model residuals given no dependence on Y

Consider two different VAR models for the pair (X,Y),

n∑i=1

(Axx ,i Axy ,i

Ayx ,i Ayy ,i

)(Xt−iYt−i

(εx ,tεy ,t

n∑i=1

(A′xx ,i 0

0 A′yy ,i

)(Xt−iYt−i

(ε′x ,tε′y ,t

Covariance of X model residuals given a possible dependence on Y

Granger causality (g2)G-causality log-likelihood statistic

If both VAR models fit (or “forecast”) the data equally well, then there isno G-causality; i.e.,

|Σ′xx | = |Σxx | ⇒ FY→X = 0

Operational causality (Granger)

X causes Y if the X-dependent forecast of Y decreases the Y modelresidual covariance (as compared to the X-independent forecast) morethan the Y-dependent forecast of X decreases the X model residualcovariance (as compared to the Y-independent forecast); i.e.,

FX→Y − FY→X > 0 ⇒ X→ Y

FX→Y − FY→X < 0 ⇒ Y → X

FX→Y − FY→X = 0 ⇒ no causal inference

Granger causality (g2)G-causality log-likelihood statistic

If both VAR models fit (or “forecast”) the data equally well, then there isno G-causality; i.e.,

|Σ′xx | = |Σxx | ⇒ FY→X = 0

Operational causality (Granger)

X causes Y if the X-dependent forecast of Y decreases the Y modelresidual covariance (as compared to the X-independent forecast) morethan the Y-dependent forecast of X decreases the X model residualcovariance (as compared to the Y-independent forecast); i.e.,

FX→Y − FY→X > 0 ⇒ X→ Y

FX→Y − FY→X < 0 ⇒ Y → X

FX→Y − FY→X = 0 ⇒ no causal inference

Pairwise asymmetric inference (g3)State space reconstruction

Consider an embedding of the time series X = xt | t = 0, 1 . . . , L− 1, Lconstructed from delayed time steps as

X = xt | t = 1 + (E − 1)τ, . . . , L

withxt =

(xt , xt−τ , xt−2τ , . . . , xt−(E−1)τ

X = xt | t = 1 + (E − 1)τ, . . . , L

(xt , xt− τ , xt−2 τ , . . . , xt−(E−1) τ

)I τ is the delay time step

X = xt | t = 1 + (E − 1)τ, . . . , L

(xt , xt−τ , xt−2τ , . . . , x

t−( E −1)τ

)I τ is the delay time step

I E is the embedding dimension

Pairwise asymmetric inference (g3)Cross-mapping

Consider a time series pair (X,Y). The shadow manifold of X (labeled X)is constructed from the points

xt = (xt , xt−τ , xt−2τ , . . . , xt−(E−1)τ , yt)

1. Find the n nearest neighbors to xt (in X), where “nearest” meanssmallest Euclidean distance, d ; i.e., d1 < d2 < . . . < dn

2. Create weights,w , from the nearest neighbors as

wi =e− di

d1∑nj=1 e

−djd1

3. Construct the cross-mapped estimate of Y using the weights as

Yt |X =

n∑i=1

wiYti| t = 1 + (E − 1)τ, . . . , L

Each cross-mapped point in the estimate of Y, i.e.,

Yt |X =n∑

e− di /d1∑nj=1 e

−dj/d1Yti

depends on comparisons of

Yt |X =n∑

e−di/d1∑nj=1 e

−dj/d1Yti

depends on comparisons of the pasts of X

xt = ( xt , xt−τ , xt−2τ , . . . , xt−(E−1)τ , yt )

Yt |X =n∑

e−di/d1∑nj=1 e

−dj/d1Yti

depends on comparisons of the pasts of X and the presents of X and Y.

Pairwise asymmetric inference (g3)Cross-mapped correlation

A good cross-mapped estimate is defined as one that is strongly correlatedwith the original times series. The cross-mapped correlation is

CYX =[ρ(Y,Y|X)

where ρ (·) is Pearson’s correlation coefficient.

Cross-mapping interpretation

If similar histories of X (i.e., nearest neighbors in the shadow manifold)capably estimate Y (i.e., lead to CYX ≈ 1, or at least CYX 6= 0), then thepresence (or action) of Y in the system has been recorded in X.

Pairwise asymmetric inference (g3)Cross-mapped correlation

A good cross-mapped estimate is defined as one that is strongly correlatedwith the original times series. The cross-mapped correlation is

CYX =[ρ(Y,Y|X)

where ρ (·) is Pearson’s correlation coefficient.

Cross-mapping interpretation

If similar histories of X (i.e., nearest neighbors in the shadow manifold)capably estimate Y (i.e., lead to CYX ≈ 1, or at least CYX 6= 0), then thepresence (or action) of Y in the system has been recorded in X.

Pairwise asymmetric inference (g3)Cross-mapping interpretation of causality

A time series pair (X,Y) will have two cross-mapped correlations, CYX

and CXY .

Operational causality (SSR)

X causes Y if similar histories of Y estimate X better than similar historiesof X estimate Y, where the “similar histories” of one time series are usedto estimate another time series through shadow manifold nearest neighborweighting (cross-mapping); i.e.,

CYX − CXY < 0 ⇒ X→ Y

CYX − CXY > 0 ⇒ Y → X

CYX − CXY = 0 ⇒ no causal inference

Pairwise asymmetric inference (g3)Cross-mapping interpretation of causality

A time series pair (X,Y) will have two cross-mapped correlations, CYX

and CXY .

Operational causality (SSR)

X causes Y if similar histories of Y estimate X better than similar historiesof X estimate Y, where the “similar histories” of one time series are usedto estimate another time series through shadow manifold nearest neighborweighting (cross-mapping); i.e.,

CYX − CXY < 0 ⇒ X→ Y

CYX − CXY > 0 ⇒ Y → X

CYX − CXY = 0 ⇒ no causal inference

Weighted mean observed leaning (g4)Causal penchant

The causal penchant ρEC ∈ [1,−1] is

ρEC = P (E |C )− P(E |C

ρEC = P (E |C ) − P(E |C

P (E |C ) is the probability of some effect E given some cause C

ρEC = P (E |C )− P(E |C

P(E |C

)is the probability of some effect E given no cause C

ρEC = P (E |C )− P(E |C

So, the penchant is the probability of an effect E given a cause C minusthe probability of that effect without the cause

ρEC = P (E |C )− P(E |C

In the psychology/medical literature, the causal penchant is known as theEells measure of causal strength or probability contrast.

ρEC = P (E |C )− P(E |C

If C drives E , then it is expected that ρEC > 0.

ρEC = P (E |C )− P(E |C

The second term, P(E |C

), can be eliminated from the penchant formula

using Bayes theorem.

ρEC = P(E |C )

1− P(C )

)− P(E )

1− P(C )

ρEC = P(E |C )

1− P(C )

)− P(E )

1− P(C )

If E and C are independent, then P(E |C ) = P(E ), which implies

ρEC = P(E ) +P(E )P(C )− P(E )

1− P(C )= P(E )− P(E ) = 0

ρEC = P(E |C )

1− P(C )

)− P(E )

1− P(C )

Example (to help with intuition)

Consider C and E to be two fair coins, c1 and c2, being “heads”; i.e.,P(c1 = “heads ′′) = 0.5 and P(c2 = “heads ′′) = 0.5. If the coins areindependent, then

P(c2 = “heads ′′|c1 = “heads ′′) = P(c2 = “heads ′′) = 0.5⇒ ρEC = 0

If they are completely dependent then

P(c2 = “heads ′′|c1 = “heads ′′) = 1 or 0⇒ ρEC = 1 or − 1

ρEC = P(E |C )

1− P(C )

)− P(E )

1− P(C )

This formula has the additional benefit of only needing to estimate oneconditional probability from the data.

Weighted mean observed leaning (g4)Causal leaning

A difference of penchants can be used to compare different cause-effectassignments (i.e., different assumptions of what should be considered acause and what should be considered an effect). The leaning is

λEC = ρEC − ρCE

Weighted mean observed leaning (g4)Causal leaning

A difference of penchants can be used to compare different cause-effectassignments (i.e., different assumptions of what should be considered acause and what should be considered an effect). The leaning is

λEC = ρEC − ρCE

Leaning interpretation

If λEC > 0, then C drives E more than E drives C .

Weighted mean observed leaning (g4)Usefulness of the leaning

The usefulness of the leaning depends on two things,

1. Operational definitions of C and E (called the cause-effectassignment)

2. Estimations of P(C ), P(E ), P(C |E ), and P(E |C ) from the data

The primary cause-effect assignment will be the l-standard assignment,

l-standard assignment

Consider a time series pair (X,Y). The l-standard assignment initiallyassumes the cause is the l lagged time step of X and the effect is thecurrent time step of Y; i.e., C ,E = xt−l , yt.

Probabilities will estimated using data frequency counts.

Weighted mean observed leaning (g4)Leaning from the data

The cause-effect assignment must be specific if the probabilities are to beestimated with frequency counts and need to include tolerance domains toaccount for noise in the measurements.

Consider the time series pair (X,Y). The penchant calculation depends onthe conditional P(yt = a|xt−l = b), where a ∈ Y and b ∈ X. Thisconditional will be estimated as

P(yt ∈ [a− δLy , a + δRy ]|xt−l ∈ [b − δLx , b + δRx ]) =na∩bnb

P(yt ∈ [a− δLy , a + δRy ]|xt−l ∈ [b − δLx , b + δRx ]) =na∩b

na∩b is the number of times yt ∈ [a− δLy , a + δRy ] and

xt−l ∈ [b − δLx , b + δRx ] in (X,Y)

nb is the number of times xt−l ∈ [b − δLx , b + δRx ] in X

The tolerance domains are usually considered symmetric; i.e., δLx = δRx andδLy = δRy

The causal inference implied by the leaning calculations aredependent on both the cause-effect assignment and the tolerance

domains.

Weighted mean observed leaning (g4)Weighted mean

Any time series pair (X,Y) will have many leanings; e.g., an l-standardassignment of C ,E = xt−l = b ± δx , yt = a± δy will have a differentleaning calculation for each xt−1 ∈ [b − δx , b + δx ] andyt ∈ [a− δy , a + δy ].

Consider a time series pair (X,Y) and some cause-effect assignmentC ,E for which reasonable tolerance domains have been defined.

Any penchant calculation for which the (estimated) conditionalP(E |C ) 6= 0 (or P(C |E ) 6= 0) is called an observed penchant.

The weighed mean observed penchant, 〈ρEC 〉w , is the weighedalgebraic mean of the observed penchants.

The weighed mean observed penchant, 〈ρEC 〉w , is the weighed algebraicmean of the observed penchants.

The weighed mean observed leaning, 〈λEC 〉w , is the difference of theweighed mean observed penchants; i.e., 〈λEC 〉w = 〈ρEC 〉w − 〈ρCE 〉w

Weighted mean observed leaning (g4)Causal inference

Operational causality (penchant)

X causes Y if the weighted mean observed leaning is positive given acause-effect assignment (and reasonable tolerance domains) in which theassumed cause X precedes the assumed effect Y; i.e.,

〈λEC 〉w > 0 ⇒ X→ Y

〈λEC 〉w < 0 ⇒ Y → X

〈λEC 〉w = 0 ⇒ no causal inference

given C ∈ X, E ∈ Y, and C precedes E .

Lagged cross-correlation difference (g5)Cross-correlation

The cross-correlation between two time series X and Y is

ρxy =E [(xt − µX ) (yt − µY )]√

σ2Xσ

ρxy =E[(

xt − µX

)(yt − µY )

]√σ2Xσ

Every point in X is compared to the mean of X

ρxy =E[(xt − µX )

(yt − µY

)]√σ2Xσ

Every point in Y is compared to the mean of Y

ρxy =E [(xt − µX ) (yt − µY )]√

σ2Xσ

The product of the individual variances of X and Y is used as anormalization

ρxy =E [(xt − µX ) (yt − µY )]√

σ2Xσ

Example (to help with intuition)

X = Y ⇒ ρxy =E [(xt − µX ) (yt − µY )]√

σ2Xσ

=E[(xt − µX )2

Lagged cross-correlation difference (g5)Lagged cross-correlation

Consider a time series pair (X,Y). The past of Y may be compared to thepresent of X by introducing a lag l into the cross-correlation calculation,

ρxyl =E [(xt − µX ) (yt−l − µY )]√

σ2Xσ

Operational causality (correlation)

X causes Y (at lag l) if the past of X (i.e., X lagged by l time steps) ismore strongly correlated with the present of Y than the past of Y (i.e., Ylagged by l time steps) is with the present of X; i.e.,

|ρxyl | − |ρyxl | < 0 ⇒ X→ Y

|ρxyl | − |ρyxl | > 0 ⇒ Y → X

|ρxyl | − |ρyxl | = 0 ⇒ no causal inference

Lagged cross-correlation difference (g5)Lagged cross-correlation

Consider a time series pair (X,Y). The past of Y may be compared to thepresent of X by introducing a lag l into the cross-correlation calculation,

ρxyl =E [(xt − µX ) (yt−l − µY )]√

σ2Xσ

Operational causality (correlation)

X causes Y (at lag l) if the past of X (i.e., X lagged by l time steps) ismore strongly correlated with the present of Y than the past of Y (i.e., Ylagged by l time steps) is with the present of X; i.e.,

|ρxyl | − |ρyxl | < 0 ⇒ X→ Y

|ρxyl | − |ρyxl | > 0 ⇒ Y → X

|ρxyl | − |ρyxl | = 0 ⇒ no causal inference

Computational tools for the ECA summary

Open source packages are available for some of the mentioned times seriescausality tools and others required code to be develop from scratch.

Java Information Dynamics Toolkit (JIDT)Multivariate Granger Causality (MVGC) MATLAB toolbox(C++)(MATLAB)(MATLAB)

All the code is available at https://github.com/jmmccracken

Cooling/Heating System DataTime series data

Consider a time series pair (X,Y) where X are indoor temperaturemeasurements (in degrees Celsius) in a house with “experimental”environmental controls and Y is the temperature outside of that house,measured at the same time intervals (168 measurements in each series)3

0 20 40 60 80 100 120 140 16020

0 20 40 60 80 100 120 140 1600

The intuitive causal inference is Y → X.

3This data was originally presented at a time series conference. The abstract is available here,

http://www.osti.gov/scitech/biblio/5231321 . The data is also available as part of the UCI Machine Learning Repository.

Cooling/Heating System DataTime series data

Consider a time series pair (X,Y) where X are indoor temperaturemeasurements (in degrees Celsius) in a house with “experimental”environmental controls and Y is the temperature outside of that house,measured at the same time intervals (168 measurements in each series)3

0 20 40 60 80 100 120 140 16020

0 20 40 60 80 100 120 140 1600

The intuitive causal inference is Y → X.

3This data was originally presented at a time series conference. The abstract is available here,

http://www.osti.gov/scitech/biblio/5231321 . The data is also available as part of the UCI Machine Learning Repository.

Cooling/Heating System DataECA summary preliminaries

An ECA summary requires several parameters be set from the data,including

I embedding dimension and time delay for g3 (PAI)

I cause-effect assignment and tolerance domains for g4 (leaning)

I lags for g5 (cross-correlation)

The embedding dimension will be set (somewhat arbitrarily) to E = 10and the time delay will be τ = 1.

The tolerance domains will be the f -width tolerance domains; i.e.,±δx = f (max(X)−min(X)) and ±δy = f (max(Y)−min(Y)). For thisexample, f = 1/4.

The cause-effect assignment will be the l-standard assignment, but thereis still the problem of determining relevant lags l .

Cooling/Heating System DataAutocorrelations

There are autocorrelations in both time series (only 50 lags are shown),

0 10 20 30 40 500

t−l,xt)|2

0 10 20 30 40 500

t−l,yt)|2

The autocorrelations appear cyclic and initially drop to zero around l = 7for both time series.

This observation will be used justify using lags of l = 1, 2, · · · , 7 for

both g4 (leaning) and g5 (cross-correlation).

Cooling/Heating System DataAutocorrelations

There are autocorrelations in both time series (only 50 lags are shown),

0 10 20 30 40 500

t−l,xt)|2

0 10 20 30 40 500

t−l,yt)|2

The autocorrelations appear cyclic and initially drop to zero around l = 7for both time series.

This observation will be used justify using lags of l = 1, 2, · · · , 7 for

both g4 (leaning) and g5 (cross-correlation).

Cooling/Heating System DataLagged cross-correlations and leanings

The lagged cross-correlations and leaning (using the l-standardassignment) can be plotted for each tested lag,

1 2 3 4 5 6 7−0.4

−0.2

⟨ λl ⟩

There are 7 different causal inferences in this plot, all of which agreeexcept l = 7. A single causal inference (for each tool) will be foundwith the algebraic mean across all the tested lags.

1 2 3 4 5 6 7−0.4

−0.2

⟨ λl ⟩

There are 7 different causal inferences in this plot, all of which agreeexcept l = 7.

A single causal inference (for each tool) will be foundwith the algebraic mean across all the tested lags.

1 2 3 4 5 6 7−0.4

−0.2

⟨ λl ⟩

There are 7 different causal inferences in this plot, all of which agreeexcept l = 7. A single causal inference (for each tool) will be foundwith the algebraic mean across all the tested lags.

Cooling/Heating System DataMaking an ECA summary

Each of the five time series tools leads to a causal inference in the ECAsummary vector,

TX→Y − TY→X = −0.14 ⇒ Y → X ⇒FX→Y − FY→X = −0.35 ⇒ Y → X ⇒CYX − CXY = 3.1× 10−4 ⇒ Y → X ⇒〈〈λEC 〉w 〉 = −0.20 ⇒ Y → X ⇒〈|ρxyl | − |ρ

yxl |〉 = 0.40 ⇒ Y → X ⇒

TX→Y − TY→X = −0.14 ⇒ Y → X ⇒ g1 = 1

FX→Y − FY→X = −0.35 ⇒ Y → X ⇒CYX − CXY = 3.1× 10−4 ⇒ Y → X ⇒〈〈λEC 〉w 〉 = −0.20 ⇒ Y → X ⇒〈|ρxyl | − |ρ

yxl |〉 = 0.40 ⇒ Y → X ⇒

TX→Y − TY→X = −0.14 ⇒ Y → X ⇒ g1 = 1FX→Y − FY→X = −0.35 ⇒ Y → X ⇒ g2 = 1

CYX − CXY = 3.1× 10−4 ⇒ Y → X ⇒〈〈λEC 〉w 〉 = −0.20 ⇒ Y → X ⇒〈|ρxyl | − |ρ

yxl |〉 = 0.40 ⇒ Y → X ⇒

TX→Y − TY→X = −0.14 ⇒ Y → X ⇒ g1 = 1FX→Y − FY→X = −0.35 ⇒ Y → X ⇒ g2 = 1CYX − CXY = 3.1× 10−4 ⇒ Y → X ⇒ g3 = 1

〈〈λEC 〉w 〉 = −0.20 ⇒ Y → X ⇒〈|ρxyl | − |ρ

yxl |〉 = 0.40 ⇒ Y → X ⇒

TX→Y − TY→X = −0.14 ⇒ Y → X ⇒ g1 = 1FX→Y − FY→X = −0.35 ⇒ Y → X ⇒ g2 = 1CYX − CXY = 3.1× 10−4 ⇒ Y → X ⇒ g3 = 1〈〈λEC 〉w 〉 = −0.20 ⇒ Y → X ⇒ g4 = 1

〈|ρxyl | − |ρyxl |〉 = 0.40 ⇒ Y → X ⇒

TX→Y − TY→X = −0.14 ⇒ Y → X ⇒ g1 = 1FX→Y − FY→X = −0.35 ⇒ Y → X ⇒ g2 = 1CYX − CXY = 3.1× 10−4 ⇒ Y → X ⇒ g3 = 1〈〈λEC 〉w 〉 = −0.20 ⇒ Y → X ⇒ g4 = 1〈|ρxyl | − |ρ

yxl |〉 = 0.40 ⇒ Y → X ⇒ g5 = 1

TX→Y − TY→X = −0.14 ⇒ Y → X ⇒ g1 = 1

FX→Y − FY→X = −0.35 ⇒ Y → X ⇒ g2 = 1

CYX − CXY = 3.1× 10−4 ⇒ Y → X ⇒ g3 = 1

〈〈λEC 〉w 〉 = −0.20 ⇒ Y → X ⇒ g4 = 1

〈|ρxyl | − |ρyxl |〉 = 0.40 ⇒ Y → X ⇒ g5 = 1

∴ the ECA summary is Y → X, which agrees with intuition

Snowfall DataTime series data

Consider a time series pair (X,Y) where X is the mean daily temperature(in degrees Celsius) at Whistler, BC, Canada, and Y is the total snowfall(in centimeters) (7,753 measurements in each series)4

0 1000 2000 3000 4000 5000 6000 7000 8000−30

0 1000 2000 3000 4000 5000 6000 7000 80000

The intuitive causal inference is X→ Y.

3This data is available as part of the UCI Machine Learning Repository. The data was recorded from July 1, 1972 to

December 31, 2009.

Snowfall DataTime series data

Consider a time series pair (X,Y) where X is the mean daily temperature(in degrees Celsius) at Whistler, BC, Canada, and Y is the total snowfall(in centimeters) (7,753 measurements in each series)4

0 1000 2000 3000 4000 5000 6000 7000 8000−30

0 1000 2000 3000 4000 5000 6000 7000 80000

The intuitive causal inference is X→ Y.

3This data is available as part of the UCI Machine Learning Repository. The data was recorded from July 1, 1972 to

December 31, 2009.

Snowfall DataECA summary preliminaries

The ECA summary can be made with similar parameters as the previousexample,

I The embedding dimension will be E = 100 with a time delay of τ = 1

I The cause-effect assignment will be the l-standard assignment

I The tolerance domains will be the 1/4-width domains

I The tested lags will be l = 1, 2, . . . , 20

Snowfall DataMaking an ECA summary

TX→Y − TY→X = 2.1× 10−2 ⇒ X→ Y ⇒FX→Y − FY→X = −2.6× 10−3 ⇒ Y → X ⇒CYX − CXY = −3.4× 10−2 ⇒ X→ Y ⇒〈〈λEC 〉w 〉 = 3.7× 10−2 ⇒ X→ Y ⇒〈|ρxyl | − |ρ

yxl |〉 = 2.3× 10−2 ⇒ Y → X ⇒

TX→Y − TY→X = 2.1× 10−2 ⇒ X→ Y ⇒ g1 = 0

FX→Y − FY→X = −2.6× 10−3 ⇒ Y → X ⇒CYX − CXY = −3.4× 10−2 ⇒ X→ Y ⇒〈〈λEC 〉w 〉 = 3.7× 10−2 ⇒ X→ Y ⇒〈|ρxyl | − |ρ

yxl |〉 = 2.3× 10−2 ⇒ Y → X ⇒

TX→Y − TY→X = 2.1× 10−2 ⇒ X→ Y ⇒ g1 = 0FX→Y − FY→X = −2.6× 10−3 ⇒ Y → X ⇒ g2 = 1

CYX − CXY = −3.4× 10−2 ⇒ X→ Y ⇒〈〈λEC 〉w 〉 = 3.7× 10−2 ⇒ X→ Y ⇒〈|ρxyl | − |ρ

yxl |〉 = 2.3× 10−2 ⇒ Y → X ⇒

TX→Y − TY→X = 2.1× 10−2 ⇒ X→ Y ⇒ g1 = 0FX→Y − FY→X = −2.6× 10−3 ⇒ Y → X ⇒ g2 = 1CYX − CXY = −3.4× 10−2 ⇒ X→ Y ⇒ g3 = 0

〈〈λEC 〉w 〉 = 3.7× 10−2 ⇒ X→ Y ⇒〈|ρxyl | − |ρ

yxl |〉 = 2.3× 10−2 ⇒ Y → X ⇒

TX→Y − TY→X = 2.1× 10−2 ⇒ X→ Y ⇒ g1 = 0FX→Y − FY→X = −2.6× 10−3 ⇒ Y → X ⇒ g2 = 1CYX − CXY = −3.4× 10−2 ⇒ X→ Y ⇒ g3 = 0〈〈λEC 〉w 〉 = 3.7× 10−2 ⇒ X→ Y ⇒ g4 = 0

〈|ρxyl | − |ρyxl |〉 = 2.3× 10−2 ⇒ Y → X ⇒

TX→Y − TY→X = 2.1× 10−2 ⇒ X→ Y ⇒ g1 = 0FX→Y − FY→X = −2.6× 10−3 ⇒ Y → X ⇒ g2 = 1CYX − CXY = −3.4× 10−2 ⇒ X→ Y ⇒ g3 = 0〈〈λEC 〉w 〉 = 3.7× 10−2 ⇒ X→ Y ⇒ g4 = 0〈|ρxyl | − |ρ

yxl |〉 = 2.3× 10−2 ⇒ Y → X ⇒ g5 = 1

TX→Y − TY→X = 2.1× 10−2 ⇒ X→ Y ⇒ g1 = 0

FX→Y − FY→X = −2.6× 10−3 ⇒ Y → X ⇒ g2 = 1

CYX − CXY = −3.4× 10−2 ⇒ X→ Y ⇒ g3 = 0

〈〈λEC 〉w 〉 = 3.7× 10−2 ⇒ X→ Y ⇒ g4 = 0

〈|ρxyl | − |ρyxl |〉 = 2.3× 10−2 ⇒ Y → X ⇒ g5 = 1

∴ the ECA summary is undefined

TX→Y − TY→X = 2.1× 10−2 ⇒ X→ Y ⇒ g1 = 0

FX→Y − FY→X = −2.6× 10−3 ⇒ Y → X ⇒ g2 = 1

CYX − CXY = −3.4× 10−2 ⇒ X→ Y ⇒ g3 = 0

〈〈λEC 〉w 〉 = 3.7× 10−2 ⇒ X→ Y ⇒ g4 = 0

〈|ρxyl | − |ρyxl |〉 = 2.3× 10−2 ⇒ Y → X ⇒ g5 = 1

∴ the ECA summary is undefined

The majority of the causal inferences agree with intuition.

Times series causality as data analysisObjections to causal studies

Data analysis often ignores causality.

Two primary objections to time series causality

1. Correlation is not causation

2. Confounding cannot be controlled

Many different tools have been developed that go beyond correlation andignoring such tools means ignoring potentially useful inferences that canbe drawn from the data.

True, but this is an issue of defining “causality”. Exploring potentialcausal relationships within data sets can be done with operationaldefinitions of causality. These different causalities may provide deeperinsight into the system dynamics.

Many different tools have been developed that go beyondcorrelation and ignoring such tools means ignoring potentially usefulinferences that can be drawn from the data.

ECA summaries as practical toolsExploratory causal inference as a practical part of data analysis

Consider a recent result presented in Unraveling the cause-effect relationbetween time series [Phys. Rev. E 90, 052150]:

Liang; Section V, Ibid.

“. . . El Nino and IOD [Indian Ocean Dipole] are mutually causal, and thecausality is asymmetric, with the one from the latter to the former largerthan its counterpart . . .” (In the language of ECA: Given the time seriespair (E, I), the dominant potential driver is I; i.e., I→ E)

This conclusion is drawn from a derivation of the “Liang information flow”from the transfer entropy and then applying this new formula to E and I,but this same conclusion can be drawn from the ECA summaryvectors of these time series pairs, using the code presentedpreviously with naive algorithm parameters (specifically, theparameters used in the snowfall example).

ECA summaries as practical toolsExploratory causal inference as a practical part of data analysis

Consider a recent result presented in Unraveling the cause-effect relationbetween time series [Phys. Rev. E 90, 052150]:

Liang; Section V, Ibid.

“. . . El Nino and IOD [Indian Ocean Dipole] are mutually causal, and thecausality is asymmetric, with the one from the latter to the former largerthan its counterpart . . .” (In the language of ECA: Given the time seriespair (E, I), the dominant potential driver is I; i.e., I→ E)

This conclusion is drawn from a derivation of the “Liang information flow”from the transfer entropy and then applying this new formula to E and I,but this same conclusion can be drawn from the ECA summaryvectors of these time series pairs, using the code presentedpreviously with naive algorithm parameters (specifically, theparameters used in the snowfall example).

BACK-UP

Impulse with linear response

Consider X,Y = xt, yt where t = 0, 1, . . . , L,

2 t = 1Aηt ∀ t ∈ t | t 6= 1 and t mod 5 6= 02 ∀ t ∈ t | t mod 5 = 0

and yt = xt−1 + Bηt with y0 = 0, A,B ∈ R ≥ 0 and ηt ∼ N (0, 1).Specifically, consider L = 500, A = 0.1, and B = 0.4.

TX→Y − TY→X = 5.3× 10−1 ⇒ X→ Y ⇒ g1 = 0FX→Y − FY→X = 4.5× 10−1 ⇒ X→ Y ⇒ g2 = 0CYX − CXY = −8.3× 10−3 ⇒ X→ Y ⇒ g3 = 0〈〈λEC 〉w 〉 = 6.6× 10−3 ⇒ X→ Y ⇒ g4 = 0

〈|ρxyl | − |ρyxl |〉 = −2.8× 10−3 ⇒ X→ Y ⇒ g5 = 0

ECA summary is X→ Y, which agrees with intuition.

Cyclic driving with linear response

xt = a sin(bt + c) + Aηt

andyt = xt−1 + Bηt

with y0 = 0, A ∈ [0, 1], B ∈ [0, 1], ηt ∼ N (0, 1), and with the amplitudea, the frequency b, and the phase c all in the appropriate units.Specifically, consider L = 500, A = 0.1, B = 0.4, a = b = 1, and c = 0.

〈|ρxyl | − |ρyxl |〉 = −2.9× 10−2 ⇒ X→ Y ⇒ g5 = 0

Cyclic driving with non-linear responseConsiderX,Y = xt, yt where t = 0, 1, . . . , L,

xt = a sin(bt + c) + Aηt

andyt = Bxt−1 (1− Cxt−1) + Dηt ,

with y0 = 0, with A,B,C ,D ∈ [0, 1], ηt ∼ N (0, 1), and with theamplitude a, the frequency b, and the phase c all in the appropriate unitsgiven t = 0, f π, 2f π, 3f π, . . . , 6π with f = 1/30, which implies L = 181.Specifically, consider A = 0.1, B = 0.3, C = 0.4, D = 0.5, a = b = 1, andc = 0.

〈|ρxyl | − |ρyxl |〉 = −6.8× 10−2 ⇒ X→ Y ⇒ g5 = 0

ECA summary is X→ Y, which agrees with intuition.J. M. McCracken (GMU) ECA w/ time series causality November 20, 2015 49 / 50

Coupled logistic map

xt = xt−1 (rx − rxxt−1 − βxyyt−1)

andyt = yt−1 (ry − ryyt−1 − βyxxt−1)

where the parameters rx , ry , βxy , βyx ∈ R ≥ 0. Specifically, considerL = 500, βxy = 0.5, βyx = 1.5, rx = 3.8, and ry = 3.2 with initialconditions x0 = y0 = 0.4.

〈|ρxyl | − |ρyxl |〉 = −2.6× 10−1 ⇒ X→ Y ⇒ g5 = 0

Bob Weigelbobweigel.net/projects/images/JMMcCracken_defense_slides.pdfExploratory Causal Analysis in...

Documents

Scatter Diagram of Bivariate Measurement Data. Bivariate Measurement Data Example of Bivariate Measurement:

Stata: Bivariate Statisticspopulationsurveyanalysis.com/wp-content/uploads/2014/10/bivariate... · Stata: Bivariate Statistics ... the outcome, it might continue to explain the

Poisson bivariate

20 Bivariate

Bivariate & Multivariate

Bivariate Poisson and Diagonal Inflated Bivariate Poisson ... · Bivariate Poisson and Diagonal Inﬂated Bivariate Poisson Regression Models in R ... of the bivariate Poisson model

7 Bivariate Eda

14 Bivariate Transformations

13 Bivariate

Constructions for a bivariate beta distribution · Constructions for a bivariate beta distribution ... a bivariate beta distribution may be appropriate ... Bivariate beta distributions

Bivariate data

22 Bivariate Stats

Univariate & bivariate analysis

Bivariate Relationships

Bivariate corr slides

Chapter 5 Summarizing Bivariate Data · 5.1 Bivariate Relationships What is Bivariate data? When exploring/describing a bivariate (x,y) relationship: Determine the Explanatory and

Bivariate Data

Bivariate Data Analysis Bivariate Data analysis 4

Stata 2, Bivariate

(Bivariate Probability Distributions)