[IEEE 2007 Seventh IEEE International Conference on Data Mining - Workshops (ICDM Workshops) - Omaha, NE, USA (2007.10.28-2007.10.31)] Seventh IEEE International Conference on Data

Optimal Window Change Detection

Jan Peter PatistVrije Universiteit Amsterdam

Artificial IntelligenceAmsterdam, The Netherlands

[email protected]

Abstract

It is recognized that change detection is an importantfeature in many data stream applications. An appealingapproach is to reformulate the problem of change detec-tion in data streams to the successive application of twosample tests, as proposed in [7]. Usually the underlyingdata-generation process is unknown. Consequently, non-parametric tests like the Kolmogorov-Smirnov (KS) test aredesirable. Maintenance of the KS-test statistic can be per-formed efficiently in O(log(n)) per example, where n is thewindow size. However this can only be achieved by assum-ing a fixed window size. Because there exist no any timeoptimal window size, it is highly desirable to obtain a vari-able size window algorithm. In this paper we propose anefficient approximate algorithm for the maintenance of theKS-test statistic under the optimal window size.

1 Introduction

The emergence of new applications involving massivedata sets such as customer click streams, telephone records,or electronic transactions, stimulated development of newalgorithms for analyzing massive streams of data.

The desiderata of any analysis system for data streamsare small time and space requirements per example, suchthat it can ’keep-up’ with the data and the analysis beingup-to-date at any time.

For an overview of data stream mining, see [1].One of the exciting challenges in data stream mining is

change detection. Change detection is used to express thatthe underlying data generation process has changed signifi-cantly and act upon. Besides example applications as fraudand fault detection, it is a tool in keeping the model up-to-date [5].

Many different solutions for the problem have been pro-posed originating from different domains. Possible solu-tions can cope with different data like multi-variate data,

and are based upon models, Bayesian statistics, classicalhypothesis testing, rules, time series models etc..

One appealing framework is the approach of [7], inwhich the problem of change detection is transformed intosuccessive non-parametric two sample testing. The twosamples correspond to a fixed reference sample and a slid-ing window. A sliding window of size n is a buffer of then most recent data points. On the arrival of a new pointthe oldest point is removed and the newest point added tothe window. One powerful statistical test is the KS-test. Inthis paper similar tests are used and is shown that the KS-statistic can be maintained using a balanced tree with a timecomplexity of O(log(m + n)) per example. Where m andn are the sizes of the samples.

One drawback of this approach is that it uses a fixed slid-ing window and is not naturally efficiently extendible to thevariable size window. Because there does not exist an any-time optimal size for the window a variable size windowwould be highly desirable. Namely, a too small window sizewould result in loss of power. A too large window would re-sult in a longer detection time.

In this paper we present a method of maintaining thenormalized KS-distance. It is normalized by the numberof points of the two samples. Actually we maximize theZ-score of the KS statistic. To make the procedure moreefficient we will settle with an approximate KS-test.

The KS-statistic is defined over the distance between thetwo ecdf’s, where the maximum distance is defined at oneof the points qi element of one of the samples. We approxi-mate the KS-statistic by fixing the points Q = ∪qi, elementof the reference set. If |Q| is sufficiently large the distanceat arg max qi will be equal to true KS-distance. We main-tain at any point the maximum distance per qi εQ over allpossible window sizes. Thus over all possible nmost recentpoints. The maximum over the maxima over all q’s definesthe new statistic.

The main question is, how can we do this efficiently andwhat is the cost with respect to memory and time.

We address these questions as follows. First, we show

Seventh IEEE International Conference on Data Mining - Workshops

0-7695-3019-2/07 $25.00 © 2007 IEEEDOI 10.1109/ICDMW.2007.9

557

that the KS-statistic can be rewritten as the maximum overmaximum quantile differences. Then these difference canbe rewritten as cumulative sums. Storing these sums plusan index of the datum is sufficient to reconstruct the KSdistance which is normalized by n. Hereafter, we show thatthe maximum quantile difference can be obtained efficientlyat any time given that the number of quantiles is sufficientlysmall.

2 Related work

Early work on change detection, from the domain ofstatistics, can by found in [11] and [9] in which sequentialhypothesis testing and the cusum test is covered. A goodreference to abrupt change detection is [2] which treats theBayesian change detection, generalized maximum Likeli-hood testing and more. However all the methods assumeknown underlying family of distribution.

In [4] a sliding window is maintained which adapts tothe rate of change in the data stream. The window size isdecreased when the sliding window is decreased when it isnot coherent with respect to the mean of subsequent sets.The window size is increased when new data is coming in.In the paper guarantees constant time complexity per exam-ple. In [3] the approach is extended by combining it with aKalman filter which improves detection accuracy.

In [5] Gama et. al change detection is used as a metaalgorithm for classification systems to cope with conceptdrift. The classification error of a particular system is mon-itored. Two levels are defined called warning level and driftlevel. When the error rate increases to warning level the sys-tem starts learning a new concept. When the level increasesto drift level change is reported. One possible drawback ofthis approach is that it can only be applied in case of thepresence of true class labels.

In [10] a change detection algorithm is proposed basedon wavelet footprints. Transforming the data into footprintscreates non-zero coefficients at times of possible changes.This fact is utilized to recover time of change and magnitudeof change using a lazy and an absolute method.

In [6] a framework for change detection is proposedbased on martingales. Although martingales are calculatedon univariate data, it can be used in the case of multivariatedata. This is achieved by using univariate model residualsor model performance measures. When new data is arrivingmartingales are calculated resulting in a sequence. The ap-proach is efficient with respect to time, and not necessarilydependent on a classifiers performance.

In [7] a new framework is proposed based on non-parametric two sample tests. A reference fixed sample issuccessively compared with the newest n data points, alsocalled sliding window. When a new data example comes inthe oldest data point is removed from the window and the

newest data example is added. Non-parametric two sam-ple tests, like the KS-test, Relativized Discrepancy and theMann-Whitney test, are used two detect significant devia-tion. It is shown that the KS-statistic can be maintainedefficiently using balanced trees with a time complexity perexample of log(m+n). Where m and n are the sizes of thereference fixed sample and the sliding window.

In [8] a sketch based detection algorithm is proposedfor the detection of attacks and failures in network traf-fic. Sketches are efficient structures to compress the data.Change is detected by monitoring the errors of forecastmodels build upon these sketches.

3 Problem definition

We define a data stream as a possibly infinite sequence ofnumbers from the real domain where data points are com-ing in one after another. Thus the data is one-dimensional.Let Sref be a reference set representative for the currentdistribution. Moreover it is assumed the data is identicallyindependently distributed under the null hypothesis.

The objective is to signal an alarm when the underlyingdistribution has probably changed.

Signal an alarm when the KS-test rejects the null hypoth-esis, Sref and Sn are from the same distribution, given acertain confidence, for any n most recent points.

The KS-statistic is equal to the maximum distancebetween the empirical cumulative distribution functions(ECDF) of respectively the reference sequence Sref andSn. The maximum distance is defined on a point x elementof the union of the reference set and the recent n points.

KS = maxx| ECDFref (x)−ECDFn(x) |, x ε (Sref

⋃Sn)

The KS-statistic is normalized by a factor dependent onn and m, where n is number of the n last points and m isthe size of the reference set. In the case of a fixed windowthis factor is constant and does not change the point of themaximum. However in our case it complicates matters.

KSnorm =

√mn

(n+m)KS, m =| Sref |, n =| Sn |

Because of practical reasons it is assumed that m >> nand the normalization factor is reduced to:

KSnorm ≈√nKS, if m >> n

We are interested in the n − th most recent point thatmaximizes the normalized KS. The maximum is called F.

F̂ ≈ maxn

√nmax

q| ECDFref (q)− ECDFn(q) |

558

q ε Q where Q ⊆ SrefF̂ is the approximation ofKSnorm by reducing the num-

ber of points q upon which the maximum could be defined.The points q resembling quantiles of ECDFref . Fromnow on the maximum distance is not determined over alldistances defined by xε(Sref

⋃Sn) but over x ε Q where

Q ⊆ Sref . The set and size of Q is to be set by the systemadministrator. A reasonable choice is equal frequency quan-tiles. The distance between the quantile and theECDF canbe rewritten as a cumulative sum:

maxq| αq − ECDFn(x) |= max

q| 1

n

n∑(αq − δ(x < q)) |

Where αq corresponds to cdf(q). The function δ(x < q)returns 1 if x < q and 0 otherwise. Because the maximiza-tion with respect to q is independent of

√n this term can be

put behind the max terms.

F̂ = maxq

maxn

1√n|n∑

(αq − δ(x < q)) |

An alarm is raised when F̂ > h. This is the case whenthe maximum distance, over fixed quantiles q, is larger thanh, where each q is maximized over n .

The above definition of F̂ expresses the intuitive idea ofour algorithm. Our algorithm finds the maximum per quan-tile q. During this process it keeps the current maximum,resulting in the maximum over q and n.

At first sight this algorithm seems expensive. How-ever if the number of q’s is small and d = maxn

1√n|∑n

(αq − δ(x < q)) | can be maintained efficiently, thenwe obtain an efficient procedure.

In the following section we show that d can be main-tained efficiently.

4 Efficient maintenance of maximum dis-tance for fixed quantiles

In this section our goal is to present an algorithm for themaintenance of:

d = maxn

1√n|n∑

(αq − δ(x < q)) |

Note that quantile q and αq are now fixed. The aboveproblem is split into finding the maximum of the maxi-mum positive deviation and the maximum negative devia-tion from αq . From now on we deal only with the problemof finding the maximum positive deviation. The maximumnegative deviation can be obtained in the same way.

In short, the algorithm works as follows: we maintaina buffer of snapshots, these snapshots are tuples of an in-dex and the partial sum

∑ni=1(δ(x < q)− q). Every time

a new data point x arrives, a new snapshot is added to abuffer. The added snapshot is equal to the last most recentsnapshot plus δ(x < q)− q. This index is equal to the ageof the snapshot. The age of the oldest snapshot is 1. Notethat using these snapshots we can reconstruct the distance dusing the most recent snapshot. See Figure 1. for an exam-ple of the evolution of these snapshots. We show that it isnot necessarily to maintain a buffer equal to the size of thedata stream. In practice we only need a very small numberof snapshots. This follows from the observation that snap-shots, can never define the point n of the maximum positivedistance, in the presence of certain other snapshots. Thisobservation is made explicit by three theorems. A snapshotis not maintaint when it can’t define the maximum n. Thisenables us to keep the buffer small and the calculation ofthe maximum distance more efficient.

First we fix the notation. The index n = 1 is the index ofthe first data point of the stream. and index n = c of the lastreceived. n′ = c − n is the difference between index of xand an arbitrary current index c. qα equals α− q quantile ofthe reference distribution and αq the cdf (qα). The indicatorfunction δ(x < q) return 1 if x < q, otherwise 0. A snap-shot sn is the the tuple (

∑ni=1(δ(x < h)−αq), n). The dis-

tance daf and distance da are defined as (sf − sa)/√f − n

and (sc − sa)/√c− n, where the mean µaf is defined as

(sf − sa)/(f − s).

4.1 Theorem 1

For each snapshots (sn, n), (sm,m):

|= if sn ≤ sm and m′ ≥ n′ , then dn > dm

In Figure 1. is shown an example application of the the-orem 1. It holds for all points m falling in the rectangle thatdn > dm.

Proof of theorem 1. Each snapshot (sm,m) canbe reached from (sn, n) be applying successively oneof the two operations Oiε{1,2}: O1(sn, n) = (sn, n −1), O2(sn, n) = (sn + η, n). Where η is any positivereal number. We will prove that d(sn, n) > d(O(sn, n)).d(sn, n) > d(O1(sn, n)) ⇔ d(sn, n) > d(sn, n − 1) ⇔1/√

(c− n)(sc − sn) > 1/√

(c− n+ 1)(sc − sn) ⇔1/√c− n > 1/

√(c− n+ 1). Now we prove that O2(·)

decreases the distance d. d(sn, n) > d(O2(sn, n)) ⇔d(sn, n) > d(sn + η, n)⇔ 1/

√n(sc − sn) > 1/

√n(sc −

n − η) ⇔ (sc − sn) > (sc − sn − η) . Consequentlya snapshot, which is reachable from (Sn, n) by successiveoperationsOiε{1,2}, can never define the maximum distanced.Lemma 1 is used in proofs of theorem 2 and 3. It states theconditions for which dn < dm.

559

0 200 400 600 800 1000−10

−5

0

5

10

15

20Theorem 1

n

snapshot

n

Figure 1. Theorem 1. For every point n withinthe rectangle it holds that dM > dn.

0 200 400 600 800 1000−10

−5

0

5

10

15

20Theorem 2

n

snapshot

NM

L

Figure 2. An illustrative example of theorem2. For every point n on a arrow it holds thatfor the corresponding base N ,M ,L, dbase >dn.

For each snapshot (sn, n), (sm,m), (sc, c) , where sn <sm, sn < sc and n′ > m′,

|= µnc > µmn(1 +√m′/n′)⇔ dn < dm

Proof of lemma 1 By definition of µ, :.dn = sc√n′, dm =

sc−sm√m′

, andµnc = scn′ We assume that sn = 0 for conve-

nience reasons. Evidently this does not affect the valid-ity of the proof. By using some simple algebra: dm >dn ⇔ sc−sm√

m′> sc√

n′sc(

1√m′− 1√

n′) > sm√

m′⇔ sc >

√n′sm√

n′−√m′⇔ sc

n′ >√n′sm

n′(√n′−√m′)

After substituting sm by

µnm(n′ − m′) we obtain: scn′ >

√n′sm

n′(√n′−√m′)⇔ µnc >

√n′µnm(n′−m′)n′(√n′−√m′)

⇔ µnc > µnm1−m′/n′

1−√m′/n′

⇔ µnc >

µnm(1 +√m′/n′).

4.2 Theorem 2

For each snapshot (sn, n) and (sm,m), assuming sn <sm and n′ > m′,

|= if µnm > (1− αq)/(1 +√m′/n′) then dn > dm

Figure 2. is an illustrative example of theorem 2. In thefigure is shown that for too large µnm between points n andm it holds that m can never define the maximum becausethen µmc > (1− αq), the maximum µ.

Proof of Theorem 2 First of all according Lemma 1the following holds (dm ≥ dn) ⇔ (µnc ≥ µnm(1 +√m′/n′)). Note that > is changed into ≥. Then by def-

inition of the snapshot, the maximum increment is 1 − αq

0 200 400 600 800 1000−10

−5

0

5

10

15

20Theorem 3

n

snapshot P

1

P2

Figure 3. Theorem 3. Points P1 and P2 cannever define the maximum distance

and consequently any µ is bounded by 1 − αq . If µnm >

(1−αq)/(1+√m′/n′) it can not hold that µnc ≥ µnm(1+√

m′/n′), because then µnc > (1 − αq) and consequentlydn > dm.

4.3 Theorem 3

For every (sn, n), (sm,m), (sk, k), for which n′ >m′ > k′ and sn < sm < sk,

|= if µnm > µmk then either dn > dm or dk > dn.

An application of theorem 3 is illustrated in Figure 3.Proof of theorem 3 First of all by two times Lemma 1,

µnf > µnm(1 +√m′/n′) and µmf > µmk(1 +

√k′/m′)

Which corresponds to dm > dn and dm < dk. Usingthe definition of µ the following holds: µnf = βµnm +(1 − β)µmf , (1 − β) = m′/n′ Substituting 1 − β and

µnf we obtain: µmf > µnm(1+√m′/n′)−β1−β ⇔ µmf >

µnm(1 +√n′/m′) . Recapitulating we have obtained

two inequalities: µmf > µnm(1 +√n′/m′) and µmf >

µmk(1 +√k′/m′) If the first inequality does not hold then

dm < dn. This proofs the first part of the theorem. Letus assume it holds that µmf > µnm(1 +

√n′/m′) . For

the minimum µmf it follows that if µnm(1 +√n′/m′) >

µmk(1 +√m′/k′) then dk > dm. If µnm > µmk then this

inequality holds, because (1 +√n′/m′) > (1 +

√k′/m′).

Concluding, if µnm > µmk and dm > dn then dk > dm,else dn > dm.

4.4 Transforming the theorems 1, 2 and 3into an algorithm

In the following section we show how the theorems arematerialized into an algorithm for the maintenance of themaximum distance. Recapitulating, the theorems show thatnot all points in the buffer have to be maintained and can bepruned. Consequently each theorem correspond to a prun-ing rule which reduces the buffer size of potential points ofmaxima. We maintain two times the number of pre-definedquantiles number of fifo buffers, for a positive and nega-tive deviation. When a new data example ’comes in’ the

560

snapshot as well as the time index is stored in a correspond-ing buffer. Hereafter three pruning rules corresponding tothe three theorems are called to reduce the buffer size. Thetheorems are applied in the chronological order of the snap-shots, starting in the most recently stored snapshot. The-orem 1 is applied by simply comparing the new snapshotwith the stores ones. If the theorem can be applied betweenthese two snapshot, the stored snapshot is deleted. If it doesnot hold we can stop pruning using theorem 1. We provelater this procedure exhaustively prunes all snapshots satis-fying theorem 1. Theorem 2 is applied slightly modified,because of efficiency reasons. The condition of deletion isµ ≥ 1− αq . Theorem 3 is applied to the top 3 snapshots ofthe fifo buffer. If the theorem can not be applied the pruningis stopped. Like the first pruning procedure it exhaustivelyprunes all points which can’t define the maximum distanceover a quantile. When a distance is above a certain thresh-old an alarm is signaled.

4.5 Proof of exhaustiveness of pruningstrategies with respect to Theorem 1and 3.

In this section we will prove that by applying the pruningrules as defined in the algorithm description all points arepruned which are irrelevant according Theorem 1 and 3. Inthe case of theorem 2 this does not hold. The reason for thisis efficiency. To fully apply theorem 2. we possibly have tocompare every snapshot.

Pruning rule 1 We will prove, by contradiction, thatfor two snapshots sn and sm, if sn is pruned then alsosm where n′ > m′. Let sn be pruned and assume sm isnot pruned, where n′ > m′. Consequently sn ≥ sl andsm < sl. However then sm ≤ sn and consequently sm andsn can’t be both in B.

Pruning rule 3 We will prove, by contradiction, that formeans µnm, µml, µlk between snapshots sn, sm, sk, sl ifsm is pruned then also sl, where n′ > m′ > l′ > k′. Let smbe pruned and sl is not pruned. Consequently µnm > µlkand µml > µnm then µml > µml. But then µml > µlk andsl can’t be in B.

4.6 Efficient implementation of calcula-tion of the maximum

In practice we are not always interested in the maximumdistance but in the maximum difference exceeding a userspecified threshold h. Consequently if (δ < q) − αq ispositive the distance of the negative deviation does not haveto be updated because it can never exceed the threshold hand vice versa. This follows from the fact that the maximalnegative deviation decreases and the most recent maximaldistance did not exceed h. When we are interested in the

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

2

4

6

8

10

12

14

16

18

20

Buffer s

ize

αq

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

2

4

6

8

10

12

14

16

18

20

Buffer s

ize

αq

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

2

4

6

8

10

12

14

16

18

20

22

Buffer s

ize

αq

Figure 4. The buffer size as a function ofquantiles of different distributions. The me-dian buffer size is approximately 10 and themaximum 20. From top to bottom the datawas generated from a uniform, exponentialand normal distribution. The data stream sizeis 105 examples.

1 2 3 4 5 6 7 8 9 10

x 105

4

6

8

10

12

14

16

size data stream

total bu

ffer siz

e

exp

unif

normal

Figure 5. The average total buffer size as afunction of the data stream size.

maximum distance the time computational effort is equal tothe cost of pruning plus q times the average buffer size.

4.7 Space and Time complexity

The space complexity is O(cq), where c is the averagebuffer size and q the number of prespecified quantiles. InFigure 4 the buffer size c is shown as a function of differentquantiles in case of a uniform and exponential distribution.From these figures we observe that at no moment the buffersize exceeds 25. Note that the buffer size is equal to thesum of size of the buffers needed to maintain both the upperand lower maximum deviation. For the different distribu-tions and different quantiles there is almost no difference inthe buffer sizes. In Figure 5 the average total buffer size isshown, for q = 21. The average total buffer size seems to’converge’. Although it is not proved that the buffer size re-mains small, in experiments we observe small buffer sizes.

561

Figure 6. A box plot of the number of op-erations per quantile performed using the 3pruning rules. The median number of oper-ations is 4 and the maximum 7. From top tobottom the data was generated from a nor-mal, uniform and exponential distribution.

0 1 2 3 4 5 6 7 8 9 10x 10

4

20

30

40

50

60

70

80

90

100

110

120

size data stream

total num

ber of co

mputatio

ns normal

uniform

exp

Figure 7. The average total number of oper-ations as a function of the size of the datastream.

The computation effort is bounded by 3cq per example.This is the total of pruning steps multiplied by the numberof quantiles plus the average buffer size c times q distancecalculations per quantile. Experiments with different quan-tiles and distributions indicate that both c and the number ofpruning steps are very small. See Figure 5. for the numbercomputations that have to be performed. Note that this in-cludes both maintenance of the maximum positive and neg-ative deviation. The total number of computations is shownin Figure 6. Thus in the case of q = 20 the amount of com-putations is around 120 per example and can be performedwithin a millisecond. Furthermore in the experiments, thecomputational effort looks independent of the quantile anddistribution.

5 Conclusion and Future Work

We presented an algorithm for approximately perform-ing the KS-test over the optimal window size. The proposedalgorithm is space efficient, and experiments are indicatingit is independent of the size of the data stream. The timecomplexity is small for sufficiently small number of quan-tiles and it is independent of the size of the stream. In ourfuture research we want to adjust the algorithm by boundingthe optimal window size. Secondly, we want to generalizeour procedure to other non-parametric tests.

References

[1] C. Aggarwal. Data Streams: Models and Algorithms.Springer, 2007.

[2] M. Basseville and I. V. Nikiforov. Detection of AbruptChanges - Theory and Application. Prentice-Hall, Inc.,1993.

[3] A. Bifet and R. Gavald. Kalman filters and adaptive win-dows for learning in data streams. In Proc. 9th InternationalConference on Discovery Sicence, volume Lecture Notes inArtificial Intelligence 4265, pages 29–40. Springer-Verlag,2006.

[4] A. Bifet and R. Gavald. Learning from time-changing datawith adaptive windowing. In SIAM International Confer-ence on Data Mining (SDM’07), 2006.

[5] J. Gama, P. Medas, G. Castillo, and P. Rodrigues. Learningwith drift detection. In S. Labidi and A. L. C. Bazzan, edi-tors, Advances in Artificial Intelligence - SBIA 2004, pages286–295, Berlin/Heidelberg, 2004. Publisher Springer.

[6] S.-S. Ho. A martingale framework for concept change de-tection in time-varying data streams. In L. D. Raedt andS. Wrobel, editors, ICML ’05: Proceedings of the 22ndinternational conference on Machine learning, pages 321–327, New York, NY, USA, 2005. ACM Press.

[7] A. Kifer, S. Ben-David, and J. Gehrke. Detecting change indata streams. In Proceedings of the Thirtieth InternationalConference on Very Large Data Bases, Toronto, Canada,August 31 - September 3 2004, pages 180–191. MorganKaufmann, 2004.

[8] B. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: methods, evaluation, and applica-tions. In IMC ’03: Proceedings of the 3rd ACM SIGCOMMconference on Internet measurement, pages 234–247. ACMPress, 2003.

[9] E. Page. On problems in which a change in a parameteroccurs at an unknown point. Biometrika, 44:248–252, 1957.

[10] M. Sharifzadeh, F. Azmoodeh, and C. Shahabi. Change de-tection in time series data using wavelet footprints. In C. B.Medeiros, M. J. Egenhofer, and E. Bertino, editors, SSTD2005, pages 127–144, 2005.

[11] A. Wald. Sequential analysis. Wiley, New York, 1947.

562

Documents

[IEEE 2007 Seventh IEEE International Conference on Data Mining - Workshops (ICDM Workshops) - Omaha, NE, USA (2007.10.28-2007.10.31)] Seventh IEEE International Conference on Data