Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
DataWarehousingandDataMining:TimeSeriesAnalytics
Berlin,February5,2018
PatrickSchäfer
Vorlesung: https://hu.berlin/vl_dwhdm17Übung: https://hu.berlin/ue_dwhdm17
Motivation• Temporaldataiscommoninmanydataminingapplicationsanddatawarehouses• ThereisatimedimensioninallM/ERmodels
• Applicationdomainsrangefrom:• Sensordata:environmentalsensorsmeasuretemperature,pressurehumidity
• Medicaldevices:electrocardiogram(ECG)andelectroencephalogram(EEG)
• Financialmarket:stockprices,economicindicators,productsales
• Meteorologicaldata:sedimentsfromdrillholes,earthobservationsatellitedata
2
Motivation
• Dataanalyticsbasedonsimilarity isthetoolforexploringthesedatasets• Examplesofsimilarityqueriesinclude• Findallstocksthatshowsimilartrends• Findthemostunusualheartbeatinapatient’sECGrecording• Findfrequentpatternsinabirdsoundrecording• FindthepatientswiththemostsimilarECGrecording• Findproductswithasimilarsalespattern
3
TimeSeriesDefinition
• Definition:ATimeSeriesisasequence(orderedcollection)ofnrealvaluesattimestamps !", … , !% :
& = (", … , (%
• TimeSeriesmaybeunivariateormultivariate• Univariate:asinglevalue)* isassociatedwitheachtimestamp+*.• Multivariate:, values)* = (./, ….0) areassociatedwitheachtimestamp+*.
• Thedimensionalityofatimeseriesreferstothenumberofvaluesateachtimestamp
4
TimeSeriesvsSequenceData
• Temporaldatamaybediscreteorreal-valued(continuous)• Timeseriescontainrealvaluesandsequence data(i.e.webclickstreams,DNAsequences,documents)referstodiscretedata• Differentalgorithmsapplicableforsequencedata:Hashing,Tries,NaïveBayes,Boyer-Moore,vectorspacemodel,tf-idf,wordembeddings…• Insomecasesatimesseriescanbeconvertedtoasequencebyuseofdiscretization,therebymakeuseofsequenceminingalgorithms
5
Pre-processing:MissingData
• Itisconvenienttohavetimeseriesthatareequallyspacedandsynchronizedacrosstimestamps• However,itiscommonfortimeseriestocontainmissingdata• Onemethodistoapplylinear,i.e.estimatethe(missing)valuesatdesiredtimestamps• Linearinterpolation:)23/ and)2 arevaluesofthetimeseriesattimes+*3/ and+*
)456 = )23/ ++456 − +*3/ 9 )2 − )23/
+* − +*3/+*3/ +*+456
)23/
)2
)456
missingdata
6
Equallyspaced
Pre-processing:NoiseRemoval
• Removesshort-termfluctuationandnoise• Binning:• Dividethedataintodisjoint intervalsofsize..ThencalculatethemeanvaluesT = );/ …);6 , with< =4
>,ineachinterval:);* =
∑ @ABCADB CEF GF
>• Thisreducesthenumberofpointsbyafactorof.
• Smoothing(Moving-Averages)• Dividethedataintooverlapping intervalsofsizekoverwhichtheaveragesarecalculated• Thus,theaverageiscomputedateachtimestamp+/, +> , +H, +>I/ , … ratherthanonlyattheintervalboundaries +/, +> , +>I/, +H> , … 7
Pre-processing:Normalization
• Timeseriesneedtobenormalized,especiallywhendifferentsensorsareusedtorecordthedata• Normalizationputsthedataonthesamescaletomakecomparisonsmeaningful(i.e.FahrenheitundCelsius)• Twomethodsarecommonlyused:• Z-normalization (isgenerallypreferred)• Range-basednormalization
8
Pre-processing:Z-Normalization
• Z-normalization:LetJ andK bethemeanandstandarddeviationofthetimeseries& = (",… , (% ,then
LMNO, P = )Q/, … , )Q4
with)′* =@C3S
T
9
Pre-processing:Range-basednormalization
• Range-basednormalization:Theminimumandmaximumvaluesoveralltimeseriesaredetermined,theneachvalueofthetimeseries& = (",… , (% ismappedtorange(0,1)by:
OUMVW_MNO, P = )Q/, … , )Q4
with)′* =@C30*4
YZ[ 30*4
10
TimeSeriesSimilarity
• Atthecoreoftimeseriesdataanalyticsthereare:• atimeseriesrepresentation.• asimilaritymeasuretocomparetwotimeseries.
11
SimilarityMeasures(DistanceMeasures)• Thesimilarityoftwotimeseries\ andP isexpressedintermsofarealvalueusingadistancemeasure:] \, P → ℝ`
I
• Asimilaritymeasureistheinverseofthedistancemeasure:itqualifiessimilar(/dissimilar)timeseriesbyasmall(/large)value• Timeseriessimilaritymeasuresaredesignedwithapplicationspecificgoals• MostcommonmethodsareEuclideandistance(ED)andDynamicTimeWarpingDistance(DTW)• OthercommondistancemeasuresareLongestCommonSubsequenceorEditDistance
12
EuclideanDistance(ED)
• Definition:TheEuclideandistancebetweentwotimeseries\ = (a/, … , a4) andC=(b/, … , b4),bothoflengthM,isdefinedas:
]cd \, e = f a* − b* H
�
*
�
• TheEDappliesalinearalignmentofthetimeaxis• EDcannotcopewithvariablelengthtimeseries• EDruntimeisO(n)
13
DynamicTimeWarping(DTW)
• DynamicTimeWarpingappliesanelastictransformationofthetimeaxistodetectsimilarshapesthathaveadifferentphase• Thisisessentiallyapeak-to-peakandvalley-to-valleyalignmentoftwotimeseries• Intuition:DTWcanbethoughtofasanextensionoftheED,whichusestwoindiceshandi representingbothtimeaxis• Theseindicesareincrementedindependently:
]djk \, e = f a* − b2H
�
(*,2)
�
14
DynamicTimeWarping(CostMatrix)
• DTWstartsbycomputingacostmatrixMwiththedistancesbetweenallpairsofvaluesin\ = (a/ …a4) ande = (b/ … b4)
• ThismatrixhasthedimensionalityMH andisgivenby:l*,2 = a* − b2
H, h, i ∈ 1…M
• DTWthensearchesfortheoptimalwarpingpath
15
DynamicTimeWarping(WarpingPath)
• Definition:AwarpingpathpisdefinedasasetoftuplesthatdefinesatraversalofthecostmatrixMwhereash andirepresentindicesofthevaluesin\ = (a/ …a4) ande =(b/ … b4):
o/ = ", " , 2,2 , …, M − 1, M − 1 , %, %oH = ", " , 1,2 , 1⏟
r
, 3⏟t
, 2,3 , …, M − 1, M , %, %
• Avalidwarpingpathhastosatisfytwoconditions:• Thestarthastobe(1,1)andtheendhastobe(n,n)• Thepathmayproceedbyamaximumofoneindex:0 ≤ hwI/ − hw ≤ 1 and0 ≤ iwI/ − iw ≤ 1 forallU < M
• EDisthespecialcaseofthediagonalinthecostmatrix
16
DynamicTimeWarping(Distance)
• Definition:TheDTWdistanceisdefinedasthewarpingpathwiththeminimaltotalcostthroughthecostmatrixl]djk \, e = ,hM f a* − b2
H�
(*,2)∈y
|o ∈ l
• AsDTWisapeak-to-peakandvalley-to-valleyalignmentoftwotimeseries,itmayproducesuboptimalresultsifthereisavariablenumberofpeaksandvalleys• DTWruntimeis{(MH)
17
DynamicTimeWarping(Algorithm)int DTWDistance(
Time Series q[1..n], TimeSeries c[1..n]) { // cost matrixM:= array [0..n, 0..n] for i := 1 to n
M[i, 0] := infinity for i := 1 to n
M[0, i] := infinity M[0, 0] := 0
//find optimal pathfor i := 1 to n
for j := 1 to n cost := D(q[i], c[j]) // measure distance M[i, j] := cost + minimum(
M[i-1, j ], // insertion M[i , j-1], // deletion M[i-1, j-1]) // match
return M[n, n] }
18
DynamicTimeWarping(WarpingWindow)
• Awarpingwindowconstraintcanbesettoreducethesearchspace(computationalcomplexity)• Itdefinestheamountofwarpingallowedbetweeneachpairofpointsonthewarpingpatho:
h − i ≤ O 9 M∀ h, i }o
• DTWwithawarpingwindowconstraintOhasacomputationalcomplexityof{(MO)
r
19
Similarity:Wholevs.SubsequenceMatching
• Wholematching:thedistanceiscalculatedbetweentwowholetimeseries• Subsequencematching:Searchesforthebestsubsequencewithinalongertimeseries• ComplexityfortimeserieslengthnandsubsequencelengthwwithEuclideandistance:• Wholematching:{(M)• Subsequencematching:{ < M − <
long time series C
slide along
sliding window S
Query Q
Similarity D(Q,S)
most similar subsequence
Query Q
Query QQuery Q
Similarity D(Q,S)
dataset DStime series TWhole Matching
Subsequence Matching
20
TimeSeriesDataTransformations
• Avarietyofmethodsexistforreducingthedimensionalityoftimeseries(i.e.fornoisefiltering,fasterprocessing),• Real-valuedmethodstransformintoasmallernumberofnumericvaluesandsymbolicmethodstransformintodiscretevalues(sequences)
21
PiecewiseAggregateApproximation(PAA)
• Intuition:torepresentatimeseriesoflengthnthedataisdividedintowequal-sizedintervalsandthemeanvalueineachintervaliscalculated• PAA:atimeseriesP = ()/, … , )4) oflengthnisrepresentedbyaw-dimensionalsequenceofmeanvaluesC = ()/, … , )6),wherei-th elementiscalculatedas:
)*� =<
Mf )2
46*
2Ä46 *3/ I/
PAA:0,08-0,08-0,240,180,450,340,00-0,190,050,900,03-0,32-0,70-0,210,30-1,72
22
PiecewiseAggregateApproximation(PAA)
23
DiscreteFourierTransform(DFT)
• Intuition:eachseriesoflengthncanbeexpressedbyalinearcombinationofsmoothperiodicsinusoidalseries• Eachwaveisrepresentedbyacomplexnumber(FourierCoefficient)• ThisrepresentationiscalledFrequencyDomain• TheDFTconcentratesmostofitsenergyinthefirstfewFouriercoefficients• Low-passfilter:OnecanapproximateatimeseriesbyitsfirstwFouriercoefficients
DFT:0-8.81 -20.7 -11.9 -6.28 -8.02 -0.67 15.31 -18.7-18.36 -5.67 16.84 -8.919 -23.8010.70 -21.92 25.255-1.321...
24
DiscreteFourierTransform(DFT)
• TheDFTdecomposesatimeseriesToflengthnintoasumofnorthogonalbasisfunctionsusingsinusoidwaves• AFouriercoefficient(sinusoidwave)isrepresentedbythecomplexnumber: ÅÇ = OWUÉÇ, h,UVÇ , ÑNOÖ = 0,1… , M − 1
• Then-pointDFTofatimeseriesP = ()/ …)4) isthengivenby:]ÜP P = Å`,… , Å43/ = (OWUÉ`, h,UV`, … , OWUÉ43/, h,UV43/)
• withÅÇ =/
4∑ )* 9 W
3AáàâCä4
*Ä/ , ÑNOÖ} 0, M , i = −1�
25
DiscreteFourierTransform(DFT)
• ThefirstFouriercoefficientisequaltothemeanofatimeseriesandcanbediscardedtoobtainoffsetinvariance:Å` =
/
4∑ )* 9 W
`4*Ä/
• UsingonlythefirstfewFouriercoefficientsisequaltwolow-passfiltering(smoothening)atimeseries• ComputationalComplexity:TheFastFourierTransformhasacomputationalcomplexityof{(MlogM)tocomputetheDFTofatimeseriesoflengthn
26
SymbolicAggregateApproximation(SAX)
• SAXconvertsthetimeseriesintoadiscretesequenceofsymbolsby• X-axis:AppliesdimensionalityreductionusingthePAAtransformation
• Y-axis:TransformseachPAAvalueusingdiscretizationtoasymbol
• Valuessampledfromaz-normalizedtimeseriesfollowGaussiandistribution• SAXappliesdiscretizationbasedonGaussiandistributiontoproduceequi-probablesymbols• Arealvaluedtimeseriesisthenrepresentedbyaword
116 J. Lin et al.
0
--
0 20 40 60 80 100 120
bbb
a
cc
c
a
c
a
b
0
--
Fig. 5 A time series is discretized by first obtaining a PAA approximation and then using prede-termined breakpoints to map the PAA coefficients into SAX symbols. In the example above, withn = 128, w = 8 and a = 3, the time series is mapped to the word baabccbc
the time series to a binary vector. They demonstrated that discretizing the timeseries before clustering significantly improves the accuracy in the presence ofoutliers. We note that “clipping” is actually a special case of SAX, where a = 2.
3.3 Distance measures
Having introduced the new representation of time series, we can now definea distance measure on it. By far the most common distance measure for timeseries is the Euclidean distance (Keogh and Kasetty 2002; Reinert et al. 2000).Given two time series Q and C of the same length n, Eq. 3 defines their Euclid-ean distance, and Fig. 6A illustrates a visual intuition of the measure.
D (Q, C) ≡
!""#n$
i=1
(qi − ci)2 (3)
If we transform the original subsequences into PAA representations, Q andC, using Eq. 1, we can then obtain a lower bounding approximation of theEuclidean distance between the original subsequences by
DR(Q, C) ≡%
nw
&$w
i=1(qi − ci)
2 (4)
This measure is illustrated in Fig. 6B. If we further transform the data intothe symbolic representation, we can define a MINDIST function that returnsthe minimum distance between the original time series of two words:
MINDIST(Q, C) ≡%
nw
&$w
i=1
'dist(qi , ci)
(2 (5)
The function resembles Eq. 4 except for the fact that the distance betweenthe two PAA coefficients has been replaced with the sub-function dist(). Thedist() function can be implemented using a table lookup as illustrated in Table 4.
from:Linetal.:ExperiencingSAX:anovelsymbolicrepresentationoftimeseries
27
SymbolicAggregateApproximation(SAX)
BC
BC
A
28
SymbolicFourierApproximation(SFA)
• SFArepresentseachrealvaluedtimeseriesbyaword• SFAiscomposedof
a) approximation usingtheFouriertransformand
b) a dataadaptivediscretization• ThediscretizationintervalsarelearnedfromtheFouriertransformeddatadistributionratherthanusingfixedintervals
DFT0-8.81-20.7-11.9-6.28-8.02-0.6715.31-18.7-18.36-5.67-16.84-8.919[...]
DiscretizationCBBCCDCBBCBCB[...]
Raw:0.26790.24800.18280.08170.0051-0.023-0.052-0.082-0.111-0.075-0.032-0.022-0.029[...]
29
SymbolicFourierApproximation(SFA)
30
ComparisonofSFAandSAX
• Discretizationandapproximationcauselossofinformation• Thehigherthenumberofsymbolsandthealphabetsize,themoreexactistherepresentation
31
Propertiesof Symbolic Representations
• Noiseremoval• SAX:UsingPAAandquantization.• SFA:UsingtheDFT(low-passfilter)andquantization.
• Stringrepresentation• Allowsforstringdomainalgorithmslikehashingorthebag-of-wordstobeapplied
• DimensionalityReduction• AllowforindexinghighdimensionaldatausingtheiSAX index(SAX)ortheSFAtrie(SFA)
• Storagereduction• Sequenceshaveamuchlowermemoryfootprintthanreal-valuedtimeseries,i.e.lessthan1byte(“Char”)vs8byte(“Double”)foreachtimestamp
32
TimeSeriesDataAnalyticsTasks
Euclidean Distance
6
5
4
3
2
1
DTW
6
5
4
3
2
1
BOSS
6
5
4
3
2
1
Bell
Cylinder
Funnel
caffeinchlorogenic acid
?
Motif Discovery Classification
Clustering
Discords
abnormal
abnormal
Query
33
Motifs
• AMotifisfrequentlyoccurringpatternorshapeinatimeseries• Distance-based(“exact”Motifs)
• Asubsequenceofatimeseriesissaidtosupportamotif,ifthedistancebetweenthesubsequenceandthemotifislessthanathreshold
• Sequentialpatternmining(“approximate”Motifs)• Discretizationisappliedtoconvertthetimeseriesintoasequence.Motifsdiscoverylendsmethodsfromsequentialpatternmining From:JonasSpenger’s BachelorThesis
10
Results - Quality
● 85 planted motif time series● 8 planted sequences (red) per
time series● Task: Recover the planted
motifs (red regions)
Sequences (red) from: Chen, Yanping, et al. "The ucr time series classification archive."
34
Distance-basedMotifs
• Thesearecommonlydefinedascontiguoussequencesofatimeseries• Approximatedistancematch• Amotif é/, … , é6 issaidtoapproximatelymatchacontiguoussubsequenceoftimeseriesT = )/ …)4 atpositionh with< ≤ M ,ifthedistancebetween é/ … é6 and )* …)*I63/ isatmost}.• ThemostcommondistancemeasureusedistheEuclideandistance
• Motifcount:Thenumberofmatcheswithinthreshold} ofamotiftothetimeseriesisdefinedasthenumberofmatches• Atypicalgoalistofindthemostfrequentmotifs
35
NaïveAlgorithmMotif findBestMotif(
Time Series (y1,…,yn), WindowLength w, Threshold epsilon)
Begin
for i := 1 to n-w+1 do
candidate_motif = (yi,…,yi+w-1);
for j:=1 to n-w+1 do
D = computeDistance(candidate_motif, (yj,…,yj+w-1));
If (D < epsilon) and (non-trivial match) then
increment count of candidate_motif by 1
end if
end for
if (candidate_motif has highest count so far) then
update best_candidate to candidate_motif;
end if
return best_candidate;
end
• Thenaïvealgorithmextractsallcandidatemotifsoflengthwfromatimeseriesandcomputesthedistancetoalloffsets
• Thenumberofmatchesiscounted• Trivialmatchesareoverlappingsub-sequences
(i.e.i =j)• Theapproachrequiresanested-loopandthe
numberofoperationsisequaltothesizeofthetimeseries,thus{ MH distancecomputations
• TotalcomplexityforEuclideandistance{ MH<
36
Ideastospeed-up Motifdiscovery
• Computeafastlowerboundforthedistance• Ifthelowerboundisgreaterthanepsilonthentherealdistancedoesnothavetobecomputed
• Approaches:• UsePAAandcomputedistanceonlyformeanvalues
]hé+ Å, è ≥ ,� 9 ]hé+(ÅQ, èQ)• UseSAXandaprecomputedlookuptablefordistancesbetweensymbols
37
RelationtoSequentialPatternMining
• First,convertthetimeseriesintoasequencesusingforexampleSAX• NowapplyStringminingalgorithms• ItemsetMining:discoverfrequentitemsets usingassociationruleminingalgorithms(ARM).TheA-Priorialgorithmwillbepresentedinthelecture
38
Questions?
53