Local Independence Tests for Point Processes Learning ... · Nikolaj Thams, University of...

Preview:

Citation preview

Local Independence Tests for Point ProcessesLearning causality in event models

Nikolaj Thams, University of CopenhagenNovember 21st, 2019

Time to Event Data and Machine Learning WorkshopJoint work with Niels Richard Hansen

Hawkes Processes

Causality

Local independence test

Experimental results

Conclusion

Learning causality in event models?

0 TTime

b

h

c

a

Learning causality in event models?

0 TTime

b

h

c

a

Hawkes Processes

Point process

Point processesA point process with marks V = 1, . . . , d is a collectionof random measures random

Nk =∑

iTk

i ,

where Tki is the i’th event of type k. This defines processes

t 7→ Nkt := Nk(0, t].

T1 T2T3

Nt

t

If the compensator Akt of Nk

t equals∫ t

0 λks ds for some λk, λk is the intensity of Nk.

Observe that ENkt =

∫ t0 Eλk

s ds.

Famous examples: Poisson process (λt constant) and Hawkes process (next slide).

Hawkes processes

Hawkes processThe process with intensity:

λkt = βk

0 +∑v∈V

∫ t−

−∞gvk(t − s)N(ds) = βk

0 +∑v∈V

∑s<t

gvk(t − s)

is called the (linear) Hawkes process, with kernels g for some integrable functions g.E.g. gvk(x) = βvk

1 e−βvk2 (x).

This motivates using graphs for summarizing dependencies:N1 N2

0 5 10 15 20 0 5 10 15 20

0.3

0.4

0.5

0.6

Time

Inte

nsity

Process

N1

N2

1 2

Hawkes processes

Hawkes processThe process with intensity:

λkt = βk

0 +∑v∈V

∫ t−

−∞gvk(t − s)N(ds) = βk

0 +∑v∈V

∑s<t

gvk(t − s)

is called the (linear) Hawkes process, with kernels g for some integrable functions g.E.g. gvk(x) = βvk

1 e−βvk2 (x).

This motivates using graphs for summarizing dependencies:N1 N2

0 5 10 15 20 0 5 10 15 20

0.3

0.4

0.5

0.6

Time

Inte

nsity

Process

N1

N2 1 2

Causality

Causal inference

Static systemStructural Causal Models (SCMs) consist of functional assignments, summarized byparents in a graph.

Xi = fi(Xpai , ϵi) , i ∈ V

X1

X2 X3

X1

X2 X3:= cEssential assumption: Also describes the system under interventions Xi := c.

A graph satisfies, in conjunction with a separation criterion ⊥ satisfies:

• The global Markov property if A ⊥B|C =⇒ A |= PB|C.• Faithfulness A |= PB | C =⇒ A ⊥B|C

The global Markov property and faithfullness is the motivation for developingconditional independence tests in causality. See (Peters et al. 2017) for details.

Causal inference

Static systemStructural Causal Models (SCMs) consist of functional assignments, summarized byparents in a graph.

Xi = fi(Xpai , ϵi) , i ∈ V

X1

X2 X3

X1

X2 X3:= cEssential assumption: Also describes the system under interventions Xi := c.

A graph satisfies, in conjunction with a separation criterion ⊥ satisfies:

• The global Markov property if A ⊥B|C =⇒ A |= PB|C.• Faithfulness A |= PB | C =⇒ A ⊥B|C

The global Markov property and faithfullness is the motivation for developingconditional independence tests in causality. See (Peters et al. 2017) for details.

Causal inference

Static systemStructural Causal Models (SCMs) consist of functional assignments, summarized byparents in a graph.

Xi = fi(Xpai , ϵi) , i ∈ V

X1

X2 X3

X1

X2 X3:= cEssential assumption: Also describes the system under interventions Xi := c.

A graph satisfies, in conjunction with a separation criterion ⊥ satisfies:

• The global Markov property if A ⊥B|C =⇒ A |= PB|C.• Faithfulness A |= PB | C =⇒ A ⊥B|C

The global Markov property and faithfullness is the motivation for developingconditional independence tests in causality. See (Peters et al. 2017) for details.

Causal inference: Dynamical system

Causal ideas have been generalized the dynamical setting, e.g. (Didelez 2008;Mogensen, Malinsky, et al. 2018; Mogensen and Hansen 2018)

X1t1

X2t1

X3t1

X1t2

X2t2

X3t2

X1t3

X2t3

X3t3

. . .

. . .

. . .

X1

X2

X3

Volterra series

Local independenceLet N be a marked point process. For subsets A,B,C ⊆ V, we say that B is locallyindependent of A given C if for every b ∈ B:

λb,A∪Ct = E[λb

t | FA∪Ct ]

version∈ FC

t

and we write A → B | C. Heuristically, the intensity of b, when observing A ∪ C,depends only on events of C.

Under faithfulness assumptions, there exist algorithms for learning the causal graph(Meek 2014; Mogensen and Hansen 2018), by removing the edge a → b if a → b | Cfor some C. In practice, this requires an empirical test for independence!

Local independence test

Local independence test

We want to test:

H0 : j → k | C

Equivalently to test if λk,Ct is a version of λk,C∪j

t . We propose to fit:

λk,C∪jt = β

k0 +

∫ t

0gjk(t − s)Nj(ds) + λk,C

t

Then

H0 : gjk = 0

will have the right level, if we estimate the true λk,C.

Problem: If there are latent variables, the marginalized modelmay not be a Hawkes process. So how to estimate λC gener-ally, to retain level?

k

h

c j

Local independence test

We want to test:

H0 : j → k | C

Equivalently to test if λk,Ct is a version of λk,C∪j

t . We propose to fit:

λk,C∪jt = β

k0 +

∫ t

0gjk(t − s)Nj(ds) + λk,C

t

Then

H0 : gjk = 0

will have the right level, if we estimate the true λk,C.

Problem: If there are latent variables, the marginalized modelmay not be a Hawkes process. So how to estimate λC gener-ally, to retain level?

k

h

c j

Voltera approximations

To develop a non-parametric fit for λC, we prove the following theorem, resemblingVoltera series for continuous systems.TheoremSuppose that N is a stationary point process. There exist a sequence of functions hαN,such that letting:

λNt = h0

N +N∑

n=1

∑|α|=n

∫ t

−∞· · ·

∫ t

−∞hαN(t − s1, · · · t − sn)Nα1(ds1) · · ·Nαn(dsn)

and λNt

P−→ λC for N → ∞.

Approximating intensity

λC approximationsA1: Approximate by 2nd order iterated integrals.A2: Approximate kernels using tensor splineshα(x1, . . . , xn) ≈

∑dj1=1 · · ·

∑djn=1 β

αj1,...,jnbj1(x1) · · · bjn(xn)

In vector notation:λC

t (β) = β0 +∑v∈C

∫ t−

−∞(βv)TΦ1(t − s)Nv(ds)

+∑

v1,v2∈Cv2≥v1

∫ t−

−∞(βv1v2)TΦ2(t − s1, t − s2)N(v1,v2)(ds2)

=: βTC xC

tSimilarly for gjk, such that

λk,Ct = β

k0 +

∫ t

0gjk(t − s)Nj(ds) + λk,C

t = βTxt + βT

C xCt =: βTxt

Maximum Likelihood Estimation

The likelihood is concave for linear intensities!

logLT(β) =

∫ T

0log

(βTxt

)Nk(dt)− βT

∫ T

0xk

t dt

We penalize with a roughness penalty:

maxβ logLT(β)− κ0βTΩβ

s.t. Xβ ≥ 0

The distribution of maximum likelihood estimate is approximately normal:

βapprox∼ N

((I + 2κ0J−1

T Ω)β0, J−1T KTJ−1

T

)with KT =

∫ T0

xtxTt

βTxtdt and JT = KT − 2κ0Ω

Local Independence Test

Given the distribution of β = (β, βC), we can test the hypothesis H0 : j → k | C.

How do we test ΦTβ ≡ 0?

• First idea: β approximately normal, so test directly β = 0.• Better idea (see Wood 2012), evaluate basis Φ in a grid G = x1, . . . , xM. Fitted

function values over grid is thus Φ(G)Tβ.

If β ∼ N (µj,Σj) then Wald test statistic for null hypothesis Φ(G)Tµj = 0 is:

Tα = (β)TΦ(G)[Φ(G)TΣjΦ(G)

]−1Φ(G)Tβ (1)

This is χ2(M)-distributed, and we can test for significance of components!

Summary of test

We summarize our proposed test. To test j → k | C:

• Approximate λC by Voltera expansion at degree 2 and with spline-kernels.• Fit λk,C

(β) within model class by penalized MLE.• Test ϕTβ ≡ 0 using grid evaluation and Wald approximation.• If test is accepted, conclude local independence.

Experimental results

Experiment 1: Testing various structures

In each of the following 7 structures, we test a → b | b,C:

a c b a c h b a ch

bL1: L2: L3:

a b a h b ac

hbP1: P2: P3:

We obtain acceptance rates:L1 L2 L3 P1 P2 P3

1 2 1 2 1 2 1 2 1 2 1 2

0%

20%

40%

60%

80%

100%

H0

acce

ptan

ce r

ate

Test outcome

Accepted

Rejected

Experiment 1: Testing various structures

In each of the following 7 structures, we test a → b | b,C:

a c b a c h b a ch

bL1: L2: L3:

a b a h b ac

hbP1: P2: P3:

We obtain acceptance rates:L1 L2 L3 P1 P2 P3

1 2 1 2 1 2 1 2 1 2 1 2

0%

20%

40%

60%

80%

100%

H0

acce

ptan

ce r

ate

Test outcome

Accepted

Rejected

Causal discovery

We evaluate the performance in the CA-algorithm, which estimates the causal graph.

0 TTime

a b

cd

a → b | b, c, d

a b

cd

· · ·

a b

cd

Experiment 2: Causal discovery

We simulate random graphs, simulate a dataset from this graph, recover the graphfrom dataset and measure the Structural Hamming Distance (SHD) to the true graph:SHD between 1 and 2The (minimum) number of actions between flipping, adding or removing an edgeneeded to turn 1 into 2

0

1

2

3

4

5

3 4 5 6Dimension of graph

SH

D to

true

gra

ph

Type

Baseline

LI Test

Conclusion

Conclusion

• Causal inference is possible in point process models, using conditionalindependence tests!

• Facing latent components in a Hawkes model, the marginal process may not beHawkes.

• The Voltera expansions can overcome this model misspecification, by fitting ageneral functional form of intensities.

• We propose a testing framework based on splines, and have promisingexperimental results.

References i

Daley, Daryl J and David Vere-Jones (2007). An introduction to the theory of pointprocesses: volume II: general theory and structure. Springer Science & BusinessMedia.

Didelez, Vanessa (2008). “Graphical models for marked point processes based on localindependence”. In: Journal of the Royal Statistical Society: Series B (StatisticalMethodology) 70.1, pp. 245–264.

Meek, Christopher (2014). “Toward learning graphical and causal process models”. In:Proceedings of the UAI 2014 Conference on Causal Inference: Learning andPrediction-Volume 1274. CEUR-WS. org, pp. 43–48.

References ii

Mogensen, Søren Wengel and Niels Richard Hansen (2018). “Markov equivalence ofmarginalized local independence graphs”. In: arXiv preprint arXiv:1802.10163. Toappear in Ann. Statist.

Mogensen, Søren Wengel, Daniel Malinsky, and Niels Richard Hansen (2018). “Causallearning for partially observed stochastic dynamical systems”. In: 34th Conferenceon Uncertainty in Artificial Intelligence 2018, UAI 2018Conference on Uncertainty inArtificial Intelligence. Association For Uncertainty in Artificial Intelligence (AUAI),pp. 350–360.

Peters, Jonas, Dominik Janzing, and Bernhard Schölkopf (2017). Elements of causalinference: foundations and learning algorithms. MIT press.

Wood, Simon N (2012). “On p-values for smooth components of an extendedgeneralized additive model”. In: Biometrika 100.1, pp. 221–228.

Questions?

Voltera: Sketch of proof I

First we show the representation at time 0, i.e. for λ0

1. For any λ0, use that 1|λ0|<Nλ0P→ λ and 1|λ0|<Nλ0 ∈ L1(F)

2. Define Fτ = σ(T1 ∧ τ,T2 ∧ τ, . . . , ), and show ∪τ≤0L1(Fτ ) is dense in L1(F),where F = σ(Nt, t < 0) via martingale convergence.

3. Through combinatoric argument, show that for λ0 ∈ L1(Fτ ), 1N([τ,0])=1λ0 has aadditive decomposition

N∑n=1

βn

∫[τ,0]

f(t1)1Dn dN(tn)a.s.→N λ01N([τ,0])=1

4. Extend to 1N([τ,0])=Mλ0 and sum all terms.

Voltera: Sketch of proof II

5. Using time-homogenity 1, we extend the result to every time t.6. Extension to multivariate point processes is simple, using:∫

((−∞,0]×V)nhn(t1, v1, . . . , tn, vn)N(dtn × vn)

=∑|α|=n

∫(−∞,0]n

hα(t1, . . . , tn)N(dtn)

1λ(π, Nss<π) = λ(0, Nπs s<0)

Local independence graphs

Local independence graphFor point process with coordinates V = 1, . . . , d, define the local independencegraph = (V,E) by

E = (a, b) | a → b | V\a

Example

a b

c

Graphs and µ-separation

ac1

c2

bGraph: ac1

c2

bWalk:

Colliderµ-connection and separationFor = (V,E) let a, b ∈ V,C ⊆ V. A µ-connecting walk p from a to b given C is awalk from a to b such that:

1. p is non-trivial and its final edge points to b.2. a /∈ C3. coll(p) ⊆ An(C)4. noncoll(p) ∩ C = ∅

If no walks from a to b are µ-connecting given C, they are µ-separated and we writea ⊥µ b | C.

Global Markov property

The following concepts relate local independence to a graph :

Global Markov Property A ⊥µ B | C implies A → B | CFaithfullness A → B | C implies A ⊥µ B | C.

The global Markov property makes the local independence graph ”relevant” forunderstanding the underlying point process.Recovering the graph using independence testAssuming faithfullness and the global Markov property, (Meek 2014) proposes analgorithm which guarantees to return the true local independence graph, essentiallyby testing a → b | C for all a, b and sets C of increasing size.

Backup: The CA algorithm

Algorithm 1 Causal Analysis algorithmInitialize = (V,ECA) as a fully connected graphfor v ∈ V do:

n = 0while n < |pa(v)| do:

for v′ ∈ pa(v) do:for C ⊆ pa(v)\v′ with |C| = n do:

if v′ → v | C then remove (v′, v) from ECA.n = n + 1

return = (V,ECA)

In short: For pairs (v′, v), remove the edge v′ → v if there exist a set C such thatv′ → v | C.

Backup: P1 and P2

Definition

• ⊥ satisfies P1 if separation v′ ⊥ v | C for v′ /∈ C implies (v′, v) /∈ E.• ⊥ satisfies P2 if lack of an edge (v′, v) implies existence of a set C ⊆ pa(v) such

that v′ ⊥ v | C.

The CA algorithm assumes both P1 and P2. d-separation satisfies P1 and δ- andµ-separation satisfies P1 and P2.

We show that for ⊥ satisfying P1 and P2, two graphs have the same separationsexactly if they are equal.

Backup: Example of Local independence

Example3 children (a, b, c) throwing a ball a exp(1)→ b exp(1)→ c exp(1)→ a · · · . Nv counts the numberof times child v has thrown the ball. This has intensities:

λat = 1Na

t=Nct λb

t = 1Nbt <Na

tλc

t = 1Nct<Nb

t

We find b → a | a, c and a → b | b, c, because:

λa,a,b,ct = E

[λa

t | Fa,b,ct

]= 1Na

t=Nct ∈ Fa∪c

t

λb,a,b,ct = E

[λb

t | Fa,b,ct

]= 1Nb

t <Nat/∈ Fb∪c

T

Also a → a | b, c because

λa,a,b,ct = 1Na

t=Nct /∈ Fb∪c

t

Backup: Runtime

0

10

20

30

40

250 500 750 1000 1250

Point count

Run

tim

e (s

)Test order

First order

Second order

Structure

S1

S3

S4i

Figure 1: Runtime of 300 invocations of the local empirical independence test. a →λ b | b,Cwas tested 100 times in different structure S1, S3 and S4i.

Backup: Tuning κ0

S4ii S4iii S4iv

S1 S2 S3 S4i

0.001 1 1000 0.001 1 1000 0.001 1 1000

0.001 1 10000.00

0.25

0.50

0.75

0.00

0.25

0.50

0.75

Scale κ0

Test

p−

valu

e

Scale κ0

1e−04

0.001

0.01

0.1

0.316

1

3.162

10

100

1000

10000

S1 S2 S3 S4i S4ii S4iii S4iv

Roughness penalty

0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000

0.00

0.25

0.50

0.75

Scale κ0

Test

p−

valu

e

Scale κ0

1e−04

0.001

0.01

0.1

0.316

1

3.162

10

100

1000

10000

S1 S2 S3 S4i S4ii S4iii S4iv

Roughness penalty

0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000

0.00

0.25

0.50

0.75

Scale κ0

Test

p−

valu

e

Scale κ0

1e−04

0.001

0.01

0.1

0.316

1

3.162

10

100

1000

10000

S1 S2 S3 S4i S4ii S4iii S4iv

Roughness penalty

0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000

0.00

0.25

0.50

0.75

Scale κ0

Test

p−

valu

e

S1 S2 S3 S4i S4ii S4iii S4iv

Roughness penalty

0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000

0.00

0.25

0.50

0.75

Scale κ0

Test

p−

valu

e

S1 S2 S3 S4i S4ii S4iii S4iv

Roughness penalty

0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000

0.00

0.25

0.50

0.75

Scale κ0

Test

p−

valu

e

Scale κ0

1e−04

0.001

0.01

0.1

0.316

1

3.162

10

100

1000

10000

S1 S2 S3 S4i S4ii S4iii S4iv

Roughness penalty

0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000 0.001 1 1000

0.00

0.25

0.50

0.75

Scale κ0

Test

p−

valu

e

Figure 2: Boxplots of p-values from the 7 structures. From each structure 100 Hawkes processwas simulated, and the local empirical independence test was run, each with the roughnesspenalty at various levels of κ0. Each simulation produced a p-value, which is plotted. The reddotted line shows the 5%-level. The headers show the ground truth of whether a → b | b,C.The dark-green line show the fraction of the simulated p-values falling below a 5%-level.

Backup: Latent experiment

0 1 2

2 3 4 5 2 3 4 5 2 3 4 5

0

2

4

6

8

Observed nodes

SH

D

Algorithm

Second Order SHD

First Order SHD

Figure 3: Structural Hamming Distances of graphs estimated using the ECA-algorithm with afirst- and second-order local empirical independence test (second being the standard one, usedabove). Each of the boxes 0, 1 and 2 indicate the number |V\O| of latent variables. The linesrepresent the average SHD within each group.