34
Mining Correlations on Massive Bursty Time Series Collections Tomasz Kuśmierczyk and Kjetil Nørvåg

Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Embed Size (px)

Citation preview

Page 1: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Mining Correlations on Massive Bursty Time Series CollectionsTomasz Kuśmierczyk and Kjetil Nørvåg

Page 2: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Problem statement

bursty streams

2

one of many various detection

methods

Page 3: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Problem statement

bursty streams

streams of bursts

3

Page 4: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Problem statement

bursty streams

streams of bursts

correlated bursts

identify correlated bursty streams

4

Page 5: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Problem: Massive Collections

● identify pairs: correlation >= threshold ● N ~ millions of streams● naive (all pairs) solution complexity ~ N 2

● pruning● indexing

5

Page 6: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Motivation

● any source of large number of streams:○ social media○ web page view counts○ traffic monitoring sensors○ smart grid (electricity consumption meters)○ and more

6

Page 7: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Correlated bursts

● different lengths● different heights● slight shifts

but● should overlap

7

Page 8: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Correlated bursty streams

number of bursts per stream

number of bursts from stream i

overlapping with j

number of bursts from stream j

overlapping with i

two streams i and j

Page 9: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Correlated bursty streams

number of bursts per stream

number of bursts from stream i

overlapping with j

number of bursts from stream j

overlapping with i

Ei

Ejtime

ei = 4 oij = 3

ej = 3 oij = 2

min(oij , oi

j) = min(3 , 2) = 2

J(Ei, Ej) = 2 / (4+3 - 2) = 2/5

9

two streams i and j

Page 10: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Enumerating pairs

Order streams according to number of bursts:

❏ FOREACH base count b ❏ FOREACH b’ IN connected counts of b

❏ compare streams with b and b’ bursts

10

Page 11: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Pruning

● for each base count b we need to consider only connected counts b’ such that:

JT • b ≤ b’≤ b

11

threshold particular base countpossible connected counts

Page 12: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Interval Boxes (IB) index

● k-subset of bursts = k-dim box● k-dimensional R-trees

1 2 3 4

4

3

2

1For example (k=2): the representation of stream Ei as 2-dimensional boxes

12

Page 13: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Interval Boxes (IB) index

● k-subset of bursts = k-dim box● k-dimensional R-trees● k-dim boxes overlapping =

at least k bursts overlap

IndexedQuery min(oi

j , oij) ≥ k

13

Page 14: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Interval Boxes (IB) index: mining

● mining: ○ for each base count b maintain

an IB (RTrees) index ○ query it with streams having

connected counts b’

b=1

b=2

b=3

b=414

Page 15: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Interval Boxes (IB) index: mining

● mining: ○ for each base count b maintain

an IB (RTrees) index ○ query it with streams having

connected counts b’

b=1

b=2

b=3

b=4

candidate pairs of streams: min(oij , oi

j) ≥ k

15

correlated output pairs

Page 16: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

IB index: what dimensionality k?

● small k (IB Low Dimensional = IBLD)○ small indexes○ large number of candidate pairs

● high k (IB High Dimensional = IBHD)○ large indexes○ small number of candidate pairs○ kmax = JT • b (correlation ≥ threshold guaranteed)

16

Page 17: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

IBHD index in practice● to speed up some k-subsets are skipped● some pairs may be missing for multiple overlapping ● efficiency-effectiveness tradeoff

17

Page 18: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

List-based (LS) index: bins

separate bin for each (not pruned) b, b’

b=1, b’=2

b=2, b’=3

b=3, b’=4

b=1, b’=3

b=2, b’=4

b=3, b’=5

b=1, b’=4

b=2, b’=5

b=3, b’=6

b=4, b’=5 b=4, b’=6 b=4, b’=7

b=1, b’=5

18

Page 19: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

List-based (LS) index: single bin

time

19

time granularity

Page 20: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● Returns oi

j and oji

● Only for such pairs Ei, Ej that have at least one overlap● Immediate validation of pairs correlation J

20

Page 21: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● For each set of bursts pointers (time moment):

21

time

current time moment (set of pointers)

bursts active in current moment

bursts active in previous moment

Page 22: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)

22

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD

Page 23: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)○ maintain map

OVERLAPS = burst → set of overlapping streams

23

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD

Page 24: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining algorithm● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)○ maintain map

OVERLAPS = burst → set of overlapping streams○ update counts oi

j and oji

24

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD

Page 25: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Hybrid index

● LS index works well when:○ low number of overlaps○ high number of bursts per stream

● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps

25

Page 26: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Hybrid index

● LS index works well when:○ low number of overlaps○ high number of bursts per stream

● IBHD index works well when:○ low number of bursts per stream○ high number of overlaps

● Solution: Hybrid index:IBHD index for low and LS for high base counts

26

Page 27: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Experimental evaluation● Wikipedia page views from the years 2011-2013● Kleinberg’s burst extraction● streams having at least 5 bursts● 2.1M streams and 43M bursts in total● 10 bursts per stream on average● mean burst length 28h

27

Page 28: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Mining & building

Threshold: JT = 0.9528

Page 29: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Hybrid mining

Number of streams: N = 2.1M29

Page 30: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Number of generated pairs

Threshold: JT = 0.95 (<10% pairs missing)30

How_I_Met_Your_Mother_(season_7)

Two_and_a_Half_Men_(season_9)

Process_(computing) Central_processing_unit

Endoplasmic_reticulum Ribosome

Greatest_Hits,_Vol._2_(Ronnie_Milsap_album)

Greatest_Hits,_Vol._3_(Ronnie_Milsap_album)

DigiTech_JamMan Lexicon_JamMan

Humanistic_psychology Positive_psychology

Computational limits for Naive/LS index

Page 31: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

What’s more in the paper?

● formal definitions and proofs● considerations of combinatorial aspects● multiple overlap cases● on-line maintenance of indexes

31

Page 32: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Questions?

Tomasz Kuś[email protected]

32

Page 33: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

Thank you!

Tomasz Kuś[email protected]

33

Page 34: Mining Correlations on Massive Bursty Time Series Collection (DASFAA2015)

LS index: mining● For each set of bursts pointers (time moment):

○ identify NEW, OLD, ENDING (simple set operations)○ new overlapping bursts: NEW x OLD ∪ NEW x NEW ○ remove ENDING and add new overlapping bursts to the

map OVERLAPS = burst → set of overlapping streams:○ update counts oi

j and oji for new overlapping bursts and

with the help of OVERLAPS ● For each i and j in o: calculate min(oi

j , oji) and J

34

time

current time moment (set of pointers)

bursts active in current momentENDINGNEW

bursts active in previous moment

OLD