43
Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim, Norway

Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

Embed Size (px)

Citation preview

Page 1: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

Streaming Pattern Discovery in Multiple Time-Series

Spiros PapadimitriouJimeng SunChristos Faloutsos

Carnegie Mellon University

VLDB 2005, Trondheim, Norway

Page 2: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

2

Motivation

Several settings where many deployed sensors measure some quantity—e.g.:– Traffic in a network– Temperatures in a large building– Chlorine concentration in water distribution

network

Values are typically correlated

Would be very useful if we could summarize them on the fly

Page 3: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

3

Motivation

water distribution network

normal operation

May have hundreds of measurements, butit is unlikely they are completely unrelated!

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

Page 4: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

4

Phase 1 Phase 2 Phase 3

: : : : : :

: : : : : :

Motivation

water distribution network

normal operation major leak

May have hundreds of measurements, butit is unlikely they are completely unrelated!

chlo

rine c

once

ntr

ati

ons

sensorsnear leak

sensorsawayfrom leak

Page 5: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

5

Motivation

actual measurements(n streams)

k hidden variable(s)

We would like to discover a few “hidden(latent) variables” that summarize the key trends

Phase 1

: : : : : :

: : : : : :

chlo

rine c

once

ntr

ati

ons

Phase 1

k = 1

Page 6: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

6

Motivation

We would like to discover a few “hidden(latent) variables” that summarize the key trends

chlo

rine c

once

ntr

ati

ons

Phase 1 Phase 1Phase 2 Phase 2

actual measurements(n streams)

k hidden variable(s)

k = 2

: : : : : :

: : : : : :

Page 7: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

7

Motivation

We would like to discover a few “hidden(latent) variables” that summarize the key trends

chlo

rine c

once

ntr

ati

ons

Phase 1 Phase 1Phase 2 Phase 2Phase 3 Phase 3

actual measurements(n streams)

k hidden variable(s)

k = 1

: : : : : :

: : : : : :

Page 8: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

8

Discover “hidden” (latent) variables for:– Summarization of main trends for users– Efficient forecasting, spotting outliers/anomalies

Incremental, real-time computation Limited memory requirements

Goals

Page 9: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

9

Related workStream mining

Stream SVD [Guha, Gunopulos, Koudas / KDD03] StatStream [Zhu, Shasha / VLDB02] Clustering

[Aggarwal, Han, Yu / VLDB03], [Guha, Meyerson, et al / TKDE],[Lin, Vlachos, Keogh, Gunopulos / EDBT04],

Classification[Wang, Fan, et al / KDD03], [Hulten, Spencer, Domingos / KDD01]

Piecewise approximations[Palpanas, Vlachos, Keogh, etal / ICDE 2004]

Queries on streams[Dobra, Garofalakis, Gehrke, et al / SIGMOD02],[Madden, Franklin, Hellerstein, et al / OSDI02],[Considine, Li, Kollios, et al / ICDE04],[Hammad, Aref, Elmagarmid / SSDBM03]

Page 10: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

10

Overview

Method outline Experiments

Page 11: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

11

Stream correlations

Step 1: How to capture correlations?

Step 2: How to do it incrementally, when we have a very large number of points?

Step 3: How to dynamically adjust the number of hidden variables?

Page 12: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

12

1. How to capture correlations?

20oC

30oC

Tem

pera

ture

T1

First sensor

time

Page 13: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

13

1. How to capture correlations?

First sensorSecond sensor

20oC

30oC

Tem

pera

ture

T2

time

Page 14: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

14

20oC 30oC

1. How to capture correlations

20oC

30oC

Temperature T1

Correlations:

Let’s take a closer look at the first three value-pairs…

Tem

pera

ture

T2

Page 15: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

15

20oC 30oC

1. How to capture correlations

20oC

30oC

Tem

pera

ture

T2

Temperature T1

First three lie (almost) on a line in the space of value-pairs…

O(n) numbers for the slope, and One number for each value-pair (offset on line)

offse

t = “h

idde

n va

riabl

e”

time=1

time=2

time=3

Page 16: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

16

1. How to capture correlations

20oC 30oC

20oC

30oC

Tem

pera

ture

T2

Temperature T1

Other pairs also follow the same pattern: they lie (approximately) on this line

Page 17: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

17

Stream correlations

Step 1: How to capture correlations?

Step 2: How to do it incrementally, when we have a very large number of points?

Step 3: How to dynamically adjust the number of hidden variables?

Page 18: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

18

2. Incremental update

error

20oC 30oC

20oC

30oC

Tem

pera

ture

T2

Temperature T1

For each new point Project onto

current line Estimate error

New value

Page 19: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

19

2. Incremental update

error

20oC

30oC

20oC 30oC

Tem

pera

ture

T2

Temperature T1

For each new point Project onto

current line Estimate error Rotate line in

the direction of the error and in proportion to its magnitude

O(n) time New value

Page 20: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

20

2. Incremental update

20oC

30oC

20oC 30oC

Tem

pera

ture

T2

Temperature T1

For each new point Project onto

current line Estimate error Rotate line in

the direction of the error and in proportion to its magnitude

Page 21: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

21

Stream correlationsPrincipal Component Analysis (PCA)

The “line” is the first principal component (PC) vector

This line is optimal: it minimizes the sum of squared projection errors

Page 22: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

22

2. Incremental updateGiven number of hidden variables k

Assuming k is known We know how to update the slope

(detailed equations in paper)

For each new point x and for i = 1, …, k : yi := wi

Tx (proj. onto wi)

di di + yi2 (energy i-th eigenval.)

ei := x – yiwi (error)

wi wi + (1/di) yiei (update estimate)

x x – yiwi (repeat with remainder)

y1

w1

xe1

w1 updated

Page 23: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

23

Stream correlations

Step 1: How to capture correlations?

Step 2: How to do it incrementally, when we have a very large number of points?

Step 3: How to dynamically adjust k, the number of hidden variables?

Page 24: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

24

T3

3. Number of hidden variables

If we had three sensors with similar measurements

Again: points would lie on a line (i.e., one hidden variable, k=1), but in 3-D space

T1

T2

value-tuple space

Page 25: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

25

T3

3. Number of hidden variables

Assume one sensor intermittently gets stuck

Now, no line can give a good approximation

T1

T2

value-tuple space

Page 26: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

26

T3

3. Number of hidden variables

Assume one sensor intermittently gets stuck

Now, no line can give a good approximation

But a plane will do (two hidden variables, k = 2)

T1

T2

value-tuple space

Page 27: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

27

Number of hidden variables (PCs)

Keep track of energy maintained by approximation with k variables (PCs):– Reconstruction accuracy, w.r.t. total squared

error

Increment (or decrement) k if fraction of energy maintained goes below (or above) a threshold– If below 95%, k k 1– If above 98%, k k 1

Page 28: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

28

Missing values

20oC 30oC

20oC

30oC

Tem

pera

ture

T2

Temperature T1

true values (pair)

all possiblevalue pairs(given only t1)

best guess(given correlations: intersection)

Page 29: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

29

Forecasting

?

Assume we want to forecast the next value for a particular stream (e.g. auto-regression)

n streams

Page 30: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

30

Forecasting

Option 1: One complex model per stream– Next value = function of

previous values on all streams

– Captures correlations – Too costly! [ ~ O(n3) ]

+

n streams

Page 31: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

31

Forecasting

Option 1: One complex model per stream

Option 2: One simple model per stream– Next value = function of

previous value on same stream

– Worse accuracy, but maybe acceptable

– But, still need n models

+

n streams

Page 32: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

32

Forecasting

n streams

hiddenvariables

k hidden vars

k << n and already

capture correlations

+

Only k simplemodels

Efficiency &robustness

Page 33: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

33

Time/space requirementsIncremental PCA

O(nk) space (total) and time (per tuple), i.e., Independent of # points (t) Linear w.r.t. # streams (n) Linear w.r.t. # hidden variables (k)

In fact, Can be done in real time [demo]

Page 34: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

34

Overview

Method outline Experiments

Page 35: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

35

ExperimentsChlorine concentration

166 streams2 hidden variables (~4% error)

Measurements

Reconstruction

[CMU Civil Engineering]

Page 36: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

36

ExperimentsChlorine concentration

hidden variables

[CMU Civil Engineering]

Both capture global, periodic pattern Second: ~ first, but “phase-shifted” Can express any “phase-shift”…

Page 37: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

37

ExperimentsLight measurements

54 sensors2-4 hidden variables (~6% error)

measurementreconstruction

Page 38: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

38

ExperimentsLight measurements

1 & 2: main trend (as before) 3 & 4: potential anomalies and

outliers

hidden variables

intermittentintermittent

Page 39: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

39

ExperimentsMissing values

Correlations already captured by hidden variables Provide information about missing values

– Quickly back on track, if mis-estimated

reconstruct sensor 7given everything else(via hidden variables)

[CMU ECE]

Page 40: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

40

ExperimentsMissing values

Correlations already captured by hidden variables Provide information about missing values

– Quickly back on track, if mis-estimated

reconstruct sensor 8given everything else(via hidden variables)

[CMU ECE]

Page 41: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

41

Wall-clock times

time vs. stream size (t)

time vs. #streams (n)

time vs. #hid. vars (k)

constant time per tuple and per stream

tim

e (

sec)

stream size (time ticks t)

tim

e (

sec)

tim

e (

sec)

# of streams (n) # of PCs (k)

Page 42: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

42

Conclusion

Many settings with hundreds of streams, but– Stream values are, by nature, related– In reality, there are only a few variables

Discover hidden variables for– Summarization of main trends for users– Efficient forecasting, spotting outliers/anomalies

Incremental, real time computation With limited memory

Page 43: Streaming Pattern Discovery in Multiple Time-Series Spiros Papadimitriou Jimeng Sun Christos Faloutsos Carnegie Mellon University VLDB 2005, Trondheim,

43

End

Thank you