time series analysis · Time Series vs Sequence Data •Temporal data may be discrete or real-valued (continuous) •Time series contain real values and sequencedata (i.e. web click

DataWarehousingandDataMining:TimeSeriesAnalytics

Berlin,February5,2018

PatrickSchäfer

Vorlesung: https://hu.berlin/vl_dwhdm17Übung: https://hu.berlin/ue_dwhdm17

Motivation• Temporaldataiscommoninmanydataminingapplicationsanddatawarehouses• ThereisatimedimensioninallM/ERmodels

• Applicationdomainsrangefrom:• Sensordata:environmentalsensorsmeasuretemperature,pressurehumidity

• Medicaldevices:electrocardiogram(ECG)andelectroencephalogram(EEG)

• Financialmarket:stockprices,economicindicators,productsales

• Meteorologicaldata:sedimentsfromdrillholes,earthobservationsatellitedata

2

Motivation

• Dataanalyticsbasedonsimilarity isthetoolforexploringthesedatasets• Examplesofsimilarityqueriesinclude• Findallstocksthatshowsimilartrends• Findthemostunusualheartbeatinapatient’sECGrecording• Findfrequentpatternsinabirdsoundrecording• FindthepatientswiththemostsimilarECGrecording• Findproductswithasimilarsalespattern

3

TimeSeriesDefinition

• Definition:ATimeSeriesisasequence(orderedcollection)ofnrealvaluesattimestamps !", … , !% :

& = (", … , (%

• TimeSeriesmaybeunivariateormultivariate• Univariate:asinglevalue)* isassociatedwitheachtimestamp+*.• Multivariate:, values)* = (./, ….0) areassociatedwitheachtimestamp+*.

• Thedimensionalityofatimeseriesreferstothenumberofvaluesateachtimestamp

4

TimeSeriesvsSequenceData

• Temporaldatamaybediscreteorreal-valued(continuous)• Timeseriescontainrealvaluesandsequence data(i.e.webclickstreams,DNAsequences,documents)referstodiscretedata• Differentalgorithmsapplicableforsequencedata:Hashing,Tries,NaïveBayes,Boyer-Moore,vectorspacemodel,tf-idf,wordembeddings…• Insomecasesatimesseriescanbeconvertedtoasequencebyuseofdiscretization,therebymakeuseofsequenceminingalgorithms

5

Pre-processing:MissingData

• Itisconvenienttohavetimeseriesthatareequallyspacedandsynchronizedacrosstimestamps• However,itiscommonfortimeseriestocontainmissingdata• Onemethodistoapplylinear,i.e.estimatethe(missing)valuesatdesiredtimestamps• Linearinterpolation:)23/ and)2 arevaluesofthetimeseriesattimes+*3/ and+*

)456 = )23/ ++456 − +*3/ 9 )2 − )23/

+* − +*3/+*3/ +*+456

)23/

)2

)456

missingdata

6

Equallyspaced

Pre-processing:NoiseRemoval

• Removesshort-termfluctuationandnoise• Binning:• Dividethedataintodisjoint intervalsofsize..ThencalculatethemeanvaluesT = );/ …);6 , with< =4

>,ineachinterval:);* =

∑ @ABCADB CEF GF

>• Thisreducesthenumberofpointsbyafactorof.

• Smoothing(Moving-Averages)• Dividethedataintooverlapping intervalsofsizekoverwhichtheaveragesarecalculated• Thus,theaverageiscomputedateachtimestamp+/, +> , +H, +>I/ , … ratherthanonlyattheintervalboundaries +/, +> , +>I/, +H> , … 7

Pre-processing:Normalization

• Timeseriesneedtobenormalized,especiallywhendifferentsensorsareusedtorecordthedata• Normalizationputsthedataonthesamescaletomakecomparisonsmeaningful(i.e.FahrenheitundCelsius)• Twomethodsarecommonlyused:• Z-normalization (isgenerallypreferred)• Range-basednormalization

8

Pre-processing:Z-Normalization

• Z-normalization:LetJ andK bethemeanandstandarddeviationofthetimeseries& = (",… , (% ,then

LMNO, P = )Q/, … , )Q4

with)′* =@C3S

T

9

Pre-processing:Range-basednormalization

• Range-basednormalization:Theminimumandmaximumvaluesoveralltimeseriesaredetermined,theneachvalueofthetimeseries& = (",… , (% ismappedtorange(0,1)by:

OUMVW_MNO, P = )Q/, … , )Q4

with)′* =@C30*4

YZ[ 30*4

10

TimeSeriesSimilarity

• Atthecoreoftimeseriesdataanalyticsthereare:• atimeseriesrepresentation.• asimilaritymeasuretocomparetwotimeseries.

11

SimilarityMeasures(DistanceMeasures)• Thesimilarityoftwotimeseries\ andP isexpressedintermsofarealvalueusingadistancemeasure:] \, P → ℝ`

I

• Asimilaritymeasureistheinverseofthedistancemeasure:itqualifiessimilar(/dissimilar)timeseriesbyasmall(/large)value• Timeseriessimilaritymeasuresaredesignedwithapplicationspecificgoals• MostcommonmethodsareEuclideandistance(ED)andDynamicTimeWarpingDistance(DTW)• OthercommondistancemeasuresareLongestCommonSubsequenceorEditDistance

12

EuclideanDistance(ED)

• Definition:TheEuclideandistancebetweentwotimeseries\ = (a/, … , a4) andC=(b/, … , b4),bothoflengthM,isdefinedas:

]cd \, e = f a* − b* H

�

*

�

• TheEDappliesalinearalignmentofthetimeaxis• EDcannotcopewithvariablelengthtimeseries• EDruntimeisO(n)

13

DynamicTimeWarping(DTW)

• DynamicTimeWarpingappliesanelastictransformationofthetimeaxistodetectsimilarshapesthathaveadifferentphase• Thisisessentiallyapeak-to-peakandvalley-to-valleyalignmentoftwotimeseries• Intuition:DTWcanbethoughtofasanextensionoftheED,whichusestwoindiceshandi representingbothtimeaxis• Theseindicesareincrementedindependently:

]djk \, e = f a* − b2H

�

(*,2)

�

14

DynamicTimeWarping(CostMatrix)

• DTWstartsbycomputingacostmatrixMwiththedistancesbetweenallpairsofvaluesin\ = (a/ …a4) ande = (b/ … b4)

• ThismatrixhasthedimensionalityMH andisgivenby:l*,2 = a* − b2

H, h, i ∈ 1…M

• DTWthensearchesfortheoptimalwarpingpath

15

DynamicTimeWarping(WarpingPath)

• Definition:AwarpingpathpisdefinedasasetoftuplesthatdefinesatraversalofthecostmatrixMwhereash andirepresentindicesofthevaluesin\ = (a/ …a4) ande =(b/ … b4):

o/ = ", " , 2,2 , …, M − 1, M − 1 , %, %oH = ", " , 1,2 , 1⏟

r

, 3⏟t

, 2,3 , …, M − 1, M , %, %

• Avalidwarpingpathhastosatisfytwoconditions:• Thestarthastobe(1,1)andtheendhastobe(n,n)• Thepathmayproceedbyamaximumofoneindex:0 ≤ hwI/ − hw ≤ 1 and0 ≤ iwI/ − iw ≤ 1 forallU < M

• EDisthespecialcaseofthediagonalinthecostmatrix

16

DynamicTimeWarping(Distance)

• Definition:TheDTWdistanceisdefinedasthewarpingpathwiththeminimaltotalcostthroughthecostmatrixl]djk \, e = ,hM f a* − b2

H�

(*,2)∈y

|o ∈ l

• AsDTWisapeak-to-peakandvalley-to-valleyalignmentoftwotimeseries,itmayproducesuboptimalresultsifthereisavariablenumberofpeaksandvalleys• DTWruntimeis{(MH)

17

DynamicTimeWarping(Algorithm)int DTWDistance(

Time Series q[1..n], TimeSeries c[1..n]) { // cost matrixM:= array [0..n, 0..n] for i := 1 to n

M[i, 0] := infinity for i := 1 to n

M[0, i] := infinity M[0, 0] := 0

//find optimal pathfor i := 1 to n

for j := 1 to n cost := D(q[i], c[j]) // measure distance M[i, j] := cost + minimum(

M[i-1, j ], // insertion M[i , j-1], // deletion M[i-1, j-1]) // match

return M[n, n] }

18

DynamicTimeWarping(WarpingWindow)

• Awarpingwindowconstraintcanbesettoreducethesearchspace(computationalcomplexity)• Itdefinestheamountofwarpingallowedbetweeneachpairofpointsonthewarpingpatho:

h − i ≤ O 9 M∀ h, i }o

• DTWwithawarpingwindowconstraintOhasacomputationalcomplexityof{(MO)

r

19

Similarity:Wholevs.SubsequenceMatching

• Wholematching:thedistanceiscalculatedbetweentwowholetimeseries• Subsequencematching:Searchesforthebestsubsequencewithinalongertimeseries• ComplexityfortimeserieslengthnandsubsequencelengthwwithEuclideandistance:• Wholematching:{(M)• Subsequencematching:{ < M − <

long time series C

slide along

sliding window S

Query Q

Similarity D(Q,S)

most similar subsequence

Query Q

Query QQuery Q

Similarity D(Q,S)

dataset DStime series TWhole Matching

Subsequence Matching

20

TimeSeriesDataTransformations

• Avarietyofmethodsexistforreducingthedimensionalityoftimeseries(i.e.fornoisefiltering,fasterprocessing),• Real-valuedmethodstransformintoasmallernumberofnumericvaluesandsymbolicmethodstransformintodiscretevalues(sequences)

21

PiecewiseAggregateApproximation(PAA)

• Intuition:torepresentatimeseriesoflengthnthedataisdividedintowequal-sizedintervalsandthemeanvalueineachintervaliscalculated• PAA:atimeseriesP = ()/, … , )4) oflengthnisrepresentedbyaw-dimensionalsequenceofmeanvaluesC = ()/, … , )6),wherei-th elementiscalculatedas:

)*� =<

Mf )2

46*

2Ä46 *3/ I/

PAA:0,08-0,08-0,240,180,450,340,00-0,190,050,900,03-0,32-0,70-0,210,30-1,72

22

PiecewiseAggregateApproximation(PAA)

23

DiscreteFourierTransform(DFT)

• Intuition:eachseriesoflengthncanbeexpressedbyalinearcombinationofsmoothperiodicsinusoidalseries• Eachwaveisrepresentedbyacomplexnumber(FourierCoefficient)• ThisrepresentationiscalledFrequencyDomain• TheDFTconcentratesmostofitsenergyinthefirstfewFouriercoefficients• Low-passfilter:OnecanapproximateatimeseriesbyitsfirstwFouriercoefficients

DFT:0-8.81 -20.7 -11.9 -6.28 -8.02 -0.67 15.31 -18.7-18.36 -5.67 16.84 -8.919 -23.8010.70 -21.92 25.255-1.321...

24


• TheDFTdecomposesatimeseriesToflengthnintoasumofnorthogonalbasisfunctionsusingsinusoidwaves• AFouriercoefficient(sinusoidwave)isrepresentedbythecomplexnumber: ÅÇ = OWUÉÇ, h,UVÇ , ÑNOÖ = 0,1… , M − 1

• Then-pointDFTofatimeseriesP = ()/ …)4) isthengivenby:]ÜP P = Å`,… , Å43/ = (OWUÉ`, h,UV`, … , OWUÉ43/, h,UV43/)

• withÅÇ =/

4∑ )* 9 W

3AáàâCä4

*Ä/ , ÑNOÖ} 0, M , i = −1�

25


• ThefirstFouriercoefficientisequaltothemeanofatimeseriesandcanbediscardedtoobtainoffsetinvariance:Å` =

/

4∑ )* 9 W

`4*Ä/

• UsingonlythefirstfewFouriercoefficientsisequaltwolow-passfiltering(smoothening)atimeseries• ComputationalComplexity:TheFastFourierTransformhasacomputationalcomplexityof{(MlogM)tocomputetheDFTofatimeseriesoflengthn

26

SymbolicAggregateApproximation(SAX)

• SAXconvertsthetimeseriesintoadiscretesequenceofsymbolsby• X-axis:AppliesdimensionalityreductionusingthePAAtransformation

• Y-axis:TransformseachPAAvalueusingdiscretizationtoasymbol

• Valuessampledfromaz-normalizedtimeseriesfollowGaussiandistribution• SAXappliesdiscretizationbasedonGaussiandistributiontoproduceequi-probablesymbols• Arealvaluedtimeseriesisthenrepresentedbyaword

116 J. Lin et al.

0

--

0 20 40 60 80 100 120

bbb

a

cc

c

a

c

a

b

0

--

Fig. 5 A time series is discretized by first obtaining a PAA approximation and then using prede-termined breakpoints to map the PAA coefficients into SAX symbols. In the example above, withn = 128, w = 8 and a = 3, the time series is mapped to the word baabccbc

the time series to a binary vector. They demonstrated that discretizing the timeseries before clustering significantly improves the accuracy in the presence ofoutliers. We note that “clipping” is actually a special case of SAX, where a = 2.

3.3 Distance measures

Having introduced the new representation of time series, we can now definea distance measure on it. By far the most common distance measure for timeseries is the Euclidean distance (Keogh and Kasetty 2002; Reinert et al. 2000).Given two time series Q and C of the same length n, Eq. 3 defines their Euclid-ean distance, and Fig. 6A illustrates a visual intuition of the measure.

D (Q, C) ≡

!""#n$

i=1

(qi − ci)2 (3)

If we transform the original subsequences into PAA representations, Q andC, using Eq. 1, we can then obtain a lower bounding approximation of theEuclidean distance between the original subsequences by

DR(Q, C) ≡%

nw

&$w

i=1(qi − ci)

2 (4)

This measure is illustrated in Fig. 6B. If we further transform the data intothe symbolic representation, we can define a MINDIST function that returnsthe minimum distance between the original time series of two words:

MINDIST(Q, C) ≡%

nw

&$w

i=1

'dist(qi , ci)

(2 (5)

The function resembles Eq. 4 except for the fact that the distance betweenthe two PAA coefficients has been replaced with the sub-function dist(). Thedist() function can be implemented using a table lookup as illustrated in Table 4.

from:Linetal.:ExperiencingSAX:anovelsymbolicrepresentationoftimeseries

27

SymbolicAggregateApproximation(SAX)

BC

BC

A

28

SymbolicFourierApproximation(SFA)

• SFArepresentseachrealvaluedtimeseriesbyaword• SFAiscomposedof

a) approximation usingtheFouriertransformand

b) a dataadaptivediscretization• ThediscretizationintervalsarelearnedfromtheFouriertransformeddatadistributionratherthanusingfixedintervals

DFT0-8.81-20.7-11.9-6.28-8.02-0.6715.31-18.7-18.36-5.67-16.84-8.919[...]

DiscretizationCBBCCDCBBCBCB[...]

Raw:0.26790.24800.18280.08170.0051-0.023-0.052-0.082-0.111-0.075-0.032-0.022-0.029[...]

29

SymbolicFourierApproximation(SFA)

30

ComparisonofSFAandSAX

• Discretizationandapproximationcauselossofinformation• Thehigherthenumberofsymbolsandthealphabetsize,themoreexactistherepresentation

31

Propertiesof Symbolic Representations

• Noiseremoval• SAX:UsingPAAandquantization.• SFA:UsingtheDFT(low-passfilter)andquantization.

• Stringrepresentation• Allowsforstringdomainalgorithmslikehashingorthebag-of-wordstobeapplied

• DimensionalityReduction• AllowforindexinghighdimensionaldatausingtheiSAX index(SAX)ortheSFAtrie(SFA)

• Storagereduction• Sequenceshaveamuchlowermemoryfootprintthanreal-valuedtimeseries,i.e.lessthan1byte(“Char”)vs8byte(“Double”)foreachtimestamp

32

TimeSeriesDataAnalyticsTasks

Euclidean Distance

6

5

4

3

2

1

DTW

6

5

4

3

2

1

BOSS

6

5

4

3

2

1

Bell

Cylinder

Funnel

caffeinchlorogenic acid

?

Motif Discovery Classification

Clustering

Discords

abnormal

abnormal

Query

33

Motifs

• AMotifisfrequentlyoccurringpatternorshapeinatimeseries• Distance-based(“exact”Motifs)

• Asubsequenceofatimeseriesissaidtosupportamotif,ifthedistancebetweenthesubsequenceandthemotifislessthanathreshold

• Sequentialpatternmining(“approximate”Motifs)• Discretizationisappliedtoconvertthetimeseriesintoasequence.Motifsdiscoverylendsmethodsfromsequentialpatternmining From:JonasSpenger’s BachelorThesis

10

Results - Quality

● 85 planted motif time series● 8 planted sequences (red) per

time series● Task: Recover the planted

motifs (red regions)

Sequences (red) from: Chen, Yanping, et al. "The ucr time series classification archive."

34

Distance-basedMotifs

• Thesearecommonlydefinedascontiguoussequencesofatimeseries• Approximatedistancematch• Amotif é/, … , é6 issaidtoapproximatelymatchacontiguoussubsequenceoftimeseriesT = )/ …)4 atpositionh with< ≤ M ,ifthedistancebetween é/ … é6 and )* …)*I63/ isatmost}.• ThemostcommondistancemeasureusedistheEuclideandistance

• Motifcount:Thenumberofmatcheswithinthreshold} ofamotiftothetimeseriesisdefinedasthenumberofmatches• Atypicalgoalistofindthemostfrequentmotifs

35

NaïveAlgorithmMotif findBestMotif(

Time Series (y1,…,yn), WindowLength w, Threshold epsilon)

Begin

for i := 1 to n-w+1 do

candidate_motif = (yi,…,yi+w-1);

for j:=1 to n-w+1 do

D = computeDistance(candidate_motif, (yj,…,yj+w-1));

If (D < epsilon) and (non-trivial match) then

increment count of candidate_motif by 1

end if

end for

if (candidate_motif has highest count so far) then

update best_candidate to candidate_motif;

end if

return best_candidate;

end

• Thenaïvealgorithmextractsallcandidatemotifsoflengthwfromatimeseriesandcomputesthedistancetoalloffsets

• Thenumberofmatchesiscounted• Trivialmatchesareoverlappingsub-sequences

(i.e.i =j)• Theapproachrequiresanested-loopandthe

numberofoperationsisequaltothesizeofthetimeseries,thus{ MH distancecomputations

• TotalcomplexityforEuclideandistance{ MH<

36

Ideastospeed-up Motifdiscovery

• Computeafastlowerboundforthedistance• Ifthelowerboundisgreaterthanepsilonthentherealdistancedoesnothavetobecomputed

• Approaches:• UsePAAandcomputedistanceonlyformeanvalues

]hé+ Å, è ≥ ,� 9 ]hé+(ÅQ, èQ)• UseSAXandaprecomputedlookuptablefordistancesbetweensymbols

37

RelationtoSequentialPatternMining

• First,convertthetimeseriesintoasequencesusingforexampleSAX• NowapplyStringminingalgorithms• ItemsetMining:discoverfrequentitemsets usingassociationruleminingalgorithms(ARM).TheA-Priorialgorithmwillbepresentedinthelecture

38

Questions?

53

Documents

time series analysis · Time Series vs Sequence Data •Temporal data may be discrete or real-valued (continuous) •Time series contain real values and sequencedata (i.e. web click