Concept Drift Detection with Hierarchical Hypothesis Testing...(DDM-OCI) [20] addresses the limitation of DDM when data is imbalanced. However, DDM-OCI triggers a num-ber of false

Concept Drift Detection with Hierarchical Hypothesis Testing

Shujian Yu∗ Zubin Abraham†

January 25, 2017

Abstract

When using statistical models (such as a classifier) in astreaming environment, there is often a need to detect andadapt to concept drifts to mitigate any deterioration in themodel’s predictive performance over time. Unfortunately,the ability of popular concept drift approaches in detectingthese drifts in the relationship of the response and predic-tor variable is often dependent on the distribution charac-teristics of the data streams, as well as its sensitivity onparameter tuning. This paper presents Hierarchical Lin-ear Four Rates (HLFR), a framework that detects conceptdrifts for different data stream distributions (including im-balanced data) by leveraging a hierarchical set of hypothe-sis tests in an online setting. The performance of HLFR iscompared to benchmark approaches using both simulatedand real-world datasets spanning the breadth of conceptdrift types. HLFR significantly outperforms benchmarkapproaches in terms of accuracy, G-mean, recall, delay indetection and adaptability across the various datasets.

1 Introduction

Numerous real-world applications such as fraud detection,user preference prediction, email filtering, etc. [5] captureintrinsically changes in the relationship of the incomingdata streams. Current approaches to address these conceptdrifts fall into two categories [17, 7]. The first approachis to automatically adapt the parameters of the statisticalmodel in an incremental fashion, as new data are observed.The second approach is to have a concept drift detector inaddition to the statistical model, whose purpose is to signalthe need for retraining the statistical model, on account ofhaving observed a concept drift in the data. Unlike thefirst approach that focuses only on mitigating the effectof concept drift, the second approach also helps identifywhen the concept drift has occurred. This paper focuseson concept drift detection approaches.

A popular concept drift detection approach is Drift De-tection Method (DDM) [5]. The test statistic DDM moni-

tors is the overall classification error (P(t)error) and its empir-

ical standard deviation (S(t)error =

√P

(t)error(1− P (t)

error)/t).Since DDM focuses on the overall error rate, it fails todetect a drift unless the sum of false positive and false

∗University of Florida, FL†Robert Bosch Research and Technology Center, USA

negatives changes. This limitation is accentuated whendetecting concept drift in imbalanced classification tasks.Early Drift Detection Method (EDDM) [1] was proposedto achieve better detection results when dealing with slowgradual changes by monitoring the distance between thetwo classification errors. However it requires us to waitfor a minimum of 30 classification errors before calculatingthe monitoring statistic at each decision point which is notwell suited for imbalanced data. A third loss-monitoringmethod, i.e., STEPD [16], compares the accuracy on a re-cent window with the overall accuracy excluding the recentwindow by applying a test of equal proportion.

Drift Detection Method for Online Class Imbalance(DDM-OCI) [20] addresses the limitation of DDM whendata is imbalanced. However, DDM-OCI triggers a num-ber of false positives due to an inherent weakness in themodel. DDM-OCI assumes that the concept drift in im-balanced streaming data classification is indicated by thechange of underlying true positive rate (i.e., minority-classrecall). This hypothesis unfortunately does not accountfor scenarios when concept drift occurs without affectingminority-class recall. For instance, it is very possible thatthe underlying concept drifts from imbalanced data to bal-anced data, while the true positive rate (TPR), positivepredictive value (PPV) and F-measure remain unchanged.This type of drift is unlikely to be detected by DDM-OCI.

The test statistic used by DDM-OCI R(t)TPR is also not ap-

proximately distributed as N (P(t)TPR,

P(t)TPR(1−P (t)

TPR)

t ) underthe stable concept1. Hence, the TPR rationale of con-structing confidence levels specified in [5] is not suitable

with the null distribution of R(t)TPR. This is the reason

DDM-OCI triggers false positives quickly and frequently.Linear Four Rates (LFR) was proposed to address the lim-itation of DDM-OCI by monitoring the four rates associ-ated with the confusion matrix of the data stream [19].Although, LFR performs much better than DDM-OCI, itstill triggered false alarms.

To address the limitations of existing approaches,we present a two-layered hierarchical hypothesis testingframework (HLFR) for concept drift detection. Unlikeother approaches, HLFR not only detects all possible vari-ants of concept drift with the least false alarms, it can do

1R(t)TPR is a modified estimator of P

(t)TPR, which satisfies R

(t)TPR =

ηR(t−1)TPR + (1− η)1yt=yt where η denotes a time decaying factor [20].

Copyright © by SIAMUnauthorized reproduction of this article is prohibited

768

Dow

nloa

ded

07/2

7/17

to 1

32.2

38.1

81.1

89. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

so even in the presence of imbalanced class labels. HLFRis independent of the underlying classifier and outperformsexisting approaches in terms of earliest detection of con-cept drift, with the least false alarms and best precision.

1.1 Problem Formulation We are given a continuousstream of labeled streaming samples {Xt, yt}, t = 1, 2, ...,where Xt is a d-dimensional vector in a pre-defined vectorspace X = Rd and yt ∈ {0, 1}2. At every time point t, wesplit the samples in a set SA of nA recent ones and a setSB containing nB examples that appeared prior to thosein SA. We would now like to know whether or not themapping f from Xt to yt in SA were the same as in SB .

A closely related topic to concept drift detection isthe well-known change-point detection problem in machinelearning and statistics community, whereas the goal ofthe latter is to detect changes in the generating distribu-tions P (Xt) of the streaming data. The standard tools forchange-point detection are methods from statistical deci-sion theory. These methods usually compute a statisticfrom the available data, which is sensitive to changes be-tween the two sets of examples. The measured values ofthe statistic are then compared to the expected value underthe null hypothesis that both samples are from the samedistribution. The resulting p-value can be seen as a mea-sure of the strength of the drift. A good statistic must besensitive to data properties that are likely to change by alarge margin between samples from differing distributions.

Although a drift in generating distribution P (Xt) mayresult in a change in the learning problem, the detectionof any type of distributional change remains a challenge,especially when Xt is high-dimensional data [7, 19]. Toachieve high detection accuracy, we adopted the principleof risk minimization [18] and solved the problem directlyby monitoring the “significant” drift in the prediction risk(i.e., classification loss) of the underlying predictor ratherthan the intermediate problem of change-point detection.This is motivated by the fact that any drift of P (f(Xt), yt)would imply a drift in P (Xt, yt) with probability 1, where

f is the incrementally learned predictor (or classifier).

2 Hierarchical Linear Four Rate (HLFR)

This paper presents a two-layered hierarchical hypothesistesting framework (HLFR) for concept drift detection.Once a potential drift is detected by the Layer-I test ofHLFR, the Layer-II test is performed to confirm (or deny)the validity of the suspected drift. The results of the Layer-II feeds back to the Layer-I of the framework, reconfiguringand restarting Layer-I as needed. The HLFR frameworkdiffers from the mere execution of two subsequent tests asthe two layers cooperate by exchanging information aboutthe detected drift to improve online detection ability3, as

2This paper only considers binary classification3Note that the Layer-I and Layer-II test can also be executed

separately and independently, i.e., merely operating the Layer-I test,

Department | 8/10/2016 | © 2016 Robert Bosch LLC and affiliates. All rights reserved.

Hierarchical Linear Four Rates (HLFR) Algorithm

Research and Technology Center North America

Framework of HLFR

5

Layer-I Hypothesis Testing

Layer-II Hypothesis Testing

Hierarchical Hypthesis Testing Architecture

Potential Detection / Information of drift

Confirm Detection / Restart the testing

Detection Results / Model (classifier) update

{X(t),y(t)}

Figure 1: The architecture of the proposed hierarchicalhypothesis testing framework for concept drift detection.

shown in Fig.1.HLFR treats the underlying classifier as a black-box

and does not make use of any of its intrinsic properties.This modular property of the framework allows it be de-ployed alongside any classifier (k-nearest neighbors, multi-layer perceptron, etc.) unlike concept drift detectors thatare designed to only work with (e.g.) linear discriminantclassifiers [14], or support vector machines (SVM) [13].This paper selects soft margin SVM as the baseline classi-fier due to its universality and stability [3].

Given the robustness of detecting concept drift bymonitoring the four rates of the confusion matrix streams(more specifically, true positive rate (tpr), true negativerate (tnr), positive predictive value (ppv) and negativepredictive value (npv)), even in the presence of imbalancedclass labels [19], HLFR uses hypothesis tests that monitorthese four rates in its Layer-I test. The second layer ofHLFR uses the test statistic (or quantity) strictly relatedto that used at the Layer-I to conduct a permutationtest. If a drift is confirmed, the HFLR framework signalsa detection, otherwise the Layer-I detection output isconsidered to be a false positive and the test (eventuallyretrained) restarts to assess forthcoming data.

Experiments on synthetic dataset and real world ap-plications demonstrate that HLFR outperforms state-of-the-art methods by significantly reducing false positivesand guaranteeing low false negatives and detection delays.HLFR, is summarized in Algorithm 1.

2.1 Layer-I Hypothesis Testing Layer-I uses a driftdetection algorithm that works in an online fashion. LinearFour Rates (LFR) [19] is used as the Layer-I test, as ithas been experimentally proven that LFR outperformsbenchmarks in terms of recall, accuracy and detectiondelay, in a majority of the cases.

The underlying concept for LFR strategy is straight-forward: under a stable concept (i.e., P (Xt, yt) remainsunchanged), Ptpr,Ptnr,Pppv,Pnpv remains the same. Thus,a significant change of any P? (? ∈ tpr, tnr, ppv, npv) im-plies a change in underlying joint distribution P (Xt, yt),or concept. More specifically, at each time instant t, LFR

the HLFR framework reduces to our previously proposed LFR [19];

merely operating the Layer-II test may also achieve satisfactoryresults at the cost of expensive computation.


769

Dow

nloa

ded

07/2

7/17

to 1

32.2

38.1

81.1

89. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Algorithm 1 Hierarchical Linear Four Rates (HLFR)

Require: Data {Xt, yt}∞t=0 where Xt ∈ Rd, yt ∈ {0, 1}.Ensure: Concept drift time points {Tcd}.

1: for t = 1 to ∞ do2: Perform Layer-I hypothesis testing.3: if (Layer-I detects potential drift point Tpot) then4: Perform Layer-II hypothesis testing on Tpot5: if (Layer-II confirms the potentiality of Tpot)

then6: {Tcd} ← Tpot7: else8: Discard Tpot; Reconfigure and restart Layer-

I9: end if

10: end if11: end for

conducts statistical tests with the following null and alter-native hypothesis:

H0 : ∀?, P (P(t−1)? ) = P (P

(t)? )

HA : ∃?, P (P(t−1)? ) 6= P (P

(t)? )

? ∈ {tpr, tnr, ppv, npv}

The concept is stable under H0 and is considered to

have potential drift if HA holds. LFR modifies P(t)? with

R(t)? as employed in [20, 21] (also see footnote 2). R

(t)?

is essentially a weighted linear combination of classifier’sprevious performance and current performance. Given the

fact that R(t)? follows a weighted i.i.d. Bernoulli distribu-

tion (see proof in [19]), we are able to obtain the boundtable BoundTable by Monte-Carlo simulations. Havingcomputed the bounds, the framework considers that a con-cept drift is likely to occur and sets the warning signal

(warn.time ← t), when any R(t)? crosses the correspond-

ing warning bounds (warn.bd) for the first time. If any R(t)?

reaches the corresponding detection bound (detect.bd), theconcept drift is affirmed at (detect.time ← t). Interestedreaders can refer to [19] for more details.

Linear Four Rates Testing (Layer-I Hypothesis Testing)takes as input, data stream {(Xt, yt)}∞t=1 where Xt ∈ Rd

and yt ∈ {0, 1} and a binary classifier f and returnspotential concept drift time {Tpot}. Time decaying factorsη?, warn significance level δ? and detect significance level ε?are user defined parameters. The test is initiated by setting

P(0)? ← 0.5, R

(0)? ← 0.5, where ? ∈ {tpr, tnr, ppv, npv} and

confusion matrix C(0) ← [1, 1; 1, 1]; For each incoming datapoint,

yt ← f(Xt) and C(t)[yt][yt]← C(t−1)[yt][yt] + 1

R(t)? is updated using the following rule:

R(t)? =

{R

(t)? ← R

(t−1)? (yt, yt)

noimpact→ ?

η?R(t−1)? + (1− η?)1{yt=yt} else

P(t)? is updated using the following rule:

P(t)? =

C(t)[1{?=tpr},1{?=tpr}]

N?? ∈ {tpr, tnr}

C(t)[1{?=ppv},1{?=ppv}]

N?? /∈ {tpr, tnr}

The warning and detection are trigged under the following

conditions. If ( any R(t)? exceeds warn.bd? & warn.time

is NULL), Then warn.time ← t, Else (no R(t)? exceeds

warn.bd? & warn.time is not NULL) and warn.time ←NULL.

The warning bound and detection bound are set asfollows:

warn.bd? ← BoundTable(P(t)? , η?, δ?, N?)

detect.bd? ← BoundTable(P(t)? , η?, ε?, N?)

Once a detection is triggered, i.e., (any R(t)? ex-

ceeds detect.bd?), detect.time ← t; relearn f using

{(Xt, yt)}detect.timet=warn.time; reset R

(t)? , P

(t)? , C(t) ; return

{Tpot} ← t.

HLFR includes two modifications to LFR. First, weupdate the time decaying factor η? with4:

η(t)? =

{(η

(t−1)? − 1)e−(R

(t)? −R

(t−1)? ) + 1 R

(t)? ≥ R(t−1)

?

(1− η(t−1)? )eR(t)? −R

(t−1)? + (2η

(t−1)? − 1) R

(t)? < R

(t−1)?

This adaptation is motivated from the adaptive signalprocessing domain [8]. The key idea is that once R?

is increased, the system tends to perform better withrecent data, which suggests the feasibility of a largertime decaying factor, and vice versa. Moreover, we madethe BoundTable have explicit mathematical expression ofonly a few parameters by surface fitting to bring thebenefit of memory saving and increased table resolution.Numerous experiments (results not shown in this paper)validated the effectiveness of modifications, especially in alow memory environment. Unless otherwise specified, theLFR mentioned in this paper refers to the modified one.

2.2 Layer-II Hypothesis Testing The second-leveltesting aims at validating detections raised by the Layer-I, as such it is activated only when the Layer-I detectionoccurs. In particular, we rely on the value Tpot providedby the Layer-I testing to partition the streaming observa-tions into two subsequences (aiming at representing obser-vations before and after the suspected drift instant Tpot)and then we elect to another statistical hypothesis test

4Note that this time-varying representation of η(t)? is hyper-

parameter free and self-bounded between (2η(t−1)? − 1, 1) with a

“sigmoid” like curve.


770

Dow

nloa

ded

07/2

7/17

to 1

32.2

38.1

81.1

89. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

for comparing the inherent properties of these two subse-quences, assessing possible variations in the joint distribu-tion P (f(Xt), yt).

In this section, we present the employed permutationtest procedure (see Algorithm 2). Permutation tests aretheoretically well founded and do not require a prioriinformation about the monitored process or nature of thedrift [6]. The only thing we want to put emphasis onis that the selection of test statistic (or quantity) usedat the Layer-II should be strictly related to that used atthe Layer-I. To this end, we choose zero-one loss over theordered train-test split, as our test statistic in the Layer-IItesting. Zero-one loss is easy to calculate and also directlyrelated to aforementioned four rates. The intuition behindthis scheme is that if no concept drift has occurred, theprediction loss on the ordered split should not deviatetoo much from that of the shuffled splits, especially if thelearning algorithm has algorithmic stability.

Algorithm 2 Permutation Test (Layer-II)

Require: Potential drift time Tpot; Permutation windowsize W ; Algorithm A; Permutation number P ; Signif-icant rate η.

Ensure: Test decision (Is Tpot True or False?).1: Sord ← streaming segment before Tpot of length W .2: S′ord ← streaming segment after Tpot of length W .3: Train classifier ford at Sord using A.4: Test classifier ford at S′ord to get the zero-one loss Eord.5: for t = 1 to P do6: (Si, S

′i)← random split of Sorg

⋃S′org.

7: Train classifier fi at Si using A.8: Test classifier fi at S′i to get the zero-one loss Ei.9: end for

10: if1+

∑Pi=1 1[Eord≤Ei]

1+P ≤ η then11: decision←True12: else13: decision←False14: end if15: return decision

3 Experiments

Four sets of experiments are presented to demonstrate theeffectiveness and superiority of HLFR over baseline ap-proaches. First, we demonstrate the benefits of a two-layered architecture for concept drift detection over single-layer-based approaches. Second, quantitative metrics andplots are presented to show HLFR’s superior performanceover bench mark approaches such as DDM [5], EDDM [1],DDM-OCI [20], STEPD [16] and LFR [19]. Third, experi-ments of HLFR with various classifiers (soft-margin SVMclassifier, k-nearest neighbors (KNN) and quadratic dis-criminant analysis (QDA)) are presented to demonstratethat the superiority of HLFR is independent of the choiceof classifiers. Finally, we validate, via the application

of spam email filtering, the effectiveness of the proposedHLFR on streaming data classification and the accuracyof its detected concept drift points.

3.1 Benefits of a Hierarchical Architecture Beforeevaluating the HLFR framework, we evaluated the bene-fits offered by the proposed hierarchical architecture. Notethat the proposed hierarchical architecture may be inte-grated with an existing concept drift approach, by incor-porating the second layer test to reduce false positives.This section compares single-layer-based approaches suchas LFR and DDM to its hierarchical architecture counter-part in terms of false positives and false negatives. Eventhough parameter tuning of single-layer-based approaches(such as, decreasing the warning and detection thresholdsof DDM) can be used to control the number of detectedpotential drift points, the reduction of false positives oftencomes at the cost of reducing true positives. However, inthe hierarchical architecture, for a given parameter settingof the single-layer-based approach, it is often possible toreduce false positives with no decrease in true positives.

0 500 1000 15000

50

100HLFR (δ=0.01, ε=1/1k)

0 500 1000 15000

50

100LFR (δ=0.01, ε=1/1k)

0 500 1000 15000

50

100HLFR (δ=0.01, ε=1/100k)

0 500 1000 15000

50

100LFR (δ=0.01, ε=1/100k)

(a) LFR and HLFR under different

parameter settings

0 500 1000 15000

50

100DDM with permutation test (α=2, β=3)

0 500 1000 15000

50

100DDM (α=2, β=3)

0 500 1000 15000

50

100DDM with permutation test (α=1.5, β=1.8)

0 500 1000 15000

50

100DDM (α=1.5, β=1.8)

(b) DDM combined with permuta-

tion test under different parametersettings

0 500 1000 15000

50

100HLFR (δ=0.01, ε=1/10k)

0 500 1000 15000

50

100LFR (δ=0.01, ε=1/10k)

0 500 1000 15000

50

100HLFR (δ=0.01, ε=1/20k)

0 500 1000 15000

50

100LFR (δ=0.01, ε=1/20k)

(c) LFR and HLFR under different

parameter settings

0 500 1000 15000

50

100EDDM with permutation test (α=0.95, β=0.9)

0 500 1000 15000

50

100EDDM (α=0.95, β=0.9)

0 500 1000 15000

50

100EDDM with permutation test (α=0.9, β=0.8)

0 500 1000 15000

50

100EDDM (α=0.9, β=0.8)

(d) EDDM combined with permu-

tation test under different parame-ter settings

Figure 2: Comparison between the histogram of detecteddrift points over USENET1 dataset for LFR [19], DDM [5]and EDDM [1] combined with permutation test. The redcolumns denote the group truth of drift points, the blue columnsrepresent the histogram of detected drift points generated from100 Monte-carlo simulations.

For the purpose of this evaluation, the benchmarkUSENET1 [12] dataset is used. This dataset simulatesa stream of email messages from different topics, that aresequentially presented to a user who then labels them as


771

Dow

nloa

ded

07/2

7/17

to 1

32.2

38.1

81.1

89. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

interesting or junk according to his/her personal interests.It consists of 5 time periods of 300 examples. At the endof each period, the user’s interest in a topic alters in orderto simulate the occurrence of concept drift.

We considered four learning scenarios of the LFR frame-work, with warning and detection significant levels setto {δ? = 0.01, ε? = 0.001}, {δ? = 0.01, ε? = 0.0001},{δ? = 0.01, ε? = 0.00005} and {δ? = 0.01, ε? = 0.00001},respectively. For each of the scenarios, we compared therespective HLFR performance in detecting concept drifts(Fig. 2(a) and Fig. 2(c)). Similarly, Fig. 2(b) andFig. 2(d) plots the results of the benchmark DDM [5] andEDDM [1] approach for different learning scenarios, alongwith their respective hierarchical architecture counterpartthat used an aditional permutation test layer. The first andthird rows of all the subplots are the results of the hier-archical architecture equivalent of the baseline approachespresented in the second and fourth row, respectively.

From Fig. 2, it can be concluded that the hierarchicalarchitecture presents an intrinsic advantage over single-layer-based methods. The hierarchical architecture doesnot influence the performance of given single-layer-basedmethods if it already perform well and reduces falsepositives made by the single-layer-based methods. Thesecond-layer test is a flexible module within hierarchicalarchitecture and it can be logically combined with anyother single-layer-based method in practice. It is alsoworth noting that the second-layer test is not limited topermutation test.

3.2 Concept Drift Detection with HLFR In thissection, we compare the performance of the HLFR frame-work against popular concept drift benchmark approaches.Five benchmark algorithms are considered for evaluation:DDM [5], EDDM [1], DDM-OCI [20], STEPD [16]5 as wellas our recently proposed LFR [19]. The datasets used in-clude both synthetic and real world data. Drifts are syn-thesized in the data, thus controlling ground truth driftpoints and allowing precise quantitative analysis. Dur-ing comparison, we set the warning and detection thresh-olds of DDM (EDDM) to α = 3 (α = 0.95) and β = 2(β = 0.90), the warning and detection significant levelsof STEPD to w = 0.05, d = 0.016, and the parameterof DDM-OCI varies depending on data properties. Thewarning and detection significant levels of LFR, i.e., δ?and ε?, are set to 0.01 and 0.0001, respectively. The sig-nificant rate η in HLFR is fixed to 0.05. Quantitativecomparison are performed by evaluating detection qual-ity. To this end, a True Positive (TP ) detection is de-

5We re-implemented STEPD using chi-square test with Yates’s

continuity correction, as it is equivalent to Fisher’s exact test used

in the original paper [15].6These hyper parameters were not optimized. The parameter d

in STEPD corresponds to a P -value and are thus set to a standard

significance value. The parameters used for DDM and EDDM weretaken as recommended by their authors.

fined as a detection within a fixed delay range after theprecise concept change time. A False Negative (FN) isdefined as missing a detection within the delay range, anda False Positive (FP ), as a detection outside this range oran extra detection in the range. The detection quality isthen measured both by the Recall = TP/(TP +FN) andPrecision = TP/(TP + FP ) of the detector.

Table 1: Summary of properties of selected datasets

Data property SEA Checkerboard Hyperplane USENST1

high dimensional no no no yesimbalance no no yes norecurrent yes no yes yesabrupt yes yes yes yesgradual no no yes no

In the following drift detection tasks the base classifierused is a soft margin SVM with linear kernel (except forUSENET1 [12], where the soft margin SVM with RBFkernel (σ = 1) is selected as the base classifier). Thefirst stream, denoted “SEA” [5], represents abrupt driftwith label noise. This dataset has 60000 examples, 3attributes. Attributes are numeric between 0 and 10,only two are relevant. There are 4 concepts, 15000examples each, with different thresholds for the conceptfunction, which if the sum of two relevant features islarger than threshold then example label is 0. Thresholdvalues are 8, 9, 7 and 9.5. The second stream, denoted“Checkerboard” [4], presents a more challenging conceptdrift with label noise, where the examples are sampleduniformly from the unit square and the labels are setby a checkerboard with 0.2 tile width. At each conceptdrift, the checkerboard is rotated by an angle of φ/8radians. The third stream, the most challenging syntheticdataset, denoted “Rotating hyperplane”, constitutes 60000examples and 5 uniformly-spaced abrupt concept driftpoints. Within any of the two adjacent abrupt drift points,there are 10000 examples demonstrating a slow gradualconcept drift which is specified by the (k,t) pairs, where kdenotes the total number of dimensions whose weights arechanging and t measures the magnitude of such changein attributes. In our experiments, the (k,t) pairs areset to (2,0.1), (2,0.5), (2,1.0), (5,0.1), (5,0.5) and (5,1.0)successively. The last stream, USENET1 [12], representsdrifts synthesized from real data. Table 1 summarizes thedata properties and drift types for each stream7. A yesindicates that the stream data have corresponding dataproperty or concept drift type, and vice versa. Clearly, theselected datasets span the breadth of concept drift types.

7We define high dimensional if the number of attributes of

streaming samples, i.e., the dimensionality of Xt, is larger than 10, asmajority of benchmark datasets in concept drift detection community

is less than 5 [5, 17]. Besides, imbalance means that the ratio of the

number of samples in minority class to the number of samples inmajority class is less than 20%.


772

Dow

nloa

ded

07/2

7/17

to 1

32.2

38.1

81.1

89. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

0 1 2 3 4 5 6

x 104

050

100HLFR

0 1 2 3 4 5 6

x 104

050

100LFR

0 1 2 3 4 5 6

x 104

01020

DDM

0 1 2 3 4 5 6

x 104

050

100EDDM

0 1 2 3 4 5 6

x 104

050

100STEPD

0 1 2 3 4 5 6

x 104

050

100DDM−OCI

(a) SEA dataset

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

HLFR

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

LFR

0 1000 2000 3000 4000 5000 6000 7000 80000

100200

DDM

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

EDDM

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

STEPD

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

DDM−OCI

(b) Checkerboard dataset

0 1 2 3 4 5 6

x 104

050

100HLFR

0 1 2 3 4 5 6

x 104

050

100LFR

0 1 2 3 4 5 6

x 104

0100200

DDM

0 1 2 3 4 5 6

x 104

0100200

EDDM

0 1 2 3 4 5 6

x 104

0100200

STEPD

0 1 2 3 4 5 6

x 104

050

100DDM−OCI

(c) Rotating hyperplane dataset

0 500 1000 15000

50

100HLFR

0 500 1000 15000

50

100LFR

0 500 1000 15000

50

100DDM

0 500 1000 15000

50

100EDDM

0 500 1000 15000

50

100STEPD

0 500 1000 15000

50

100DDM−OCI

(d) USENET1 dataset

Figure 3: Comparison between the histogram of detected drift points over (a) SEA dataset; (b) Checkerboard dataset; (c)Rotating hyperplane dataset and (d) USENET1 dataset. The red columns denote the group truth of drift points, the blue columnsrepresent the histogram of detected drift points generated from 100 Monte-carlo simulations.

500 1000 1500 2000 25000

0.1

0.2

0.3

0.4

0.5

0.6

0.7Precision−Range curve

Detection Range

Pre

cisi

on

HLFRLFRDDMEDDMSTEPDDDM−OCI

(a) Precision over SEA dataset

50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Precision−Range curve

Detection Range

Pre

cisi

on


(b) Precision over Checkerboarddataset

100 200 300 400 500 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7Precision−Range curve

Detection Range

Pre

cisi

on


(c) Precision over Hyperplanedataset

20 40 60 80 100 1200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Precision−Range curve

Detection Range

Pre

cisi

on


(d) Precision over USENET1dataset

500 1000 1500 2000 25000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8Recall−Range curve

Detection Range

Rec

all


(e) Recall over SEA dataset

50 100 150 200 2500

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Recall−Range curve

Detection Range

Rec

all


(f) Recall over Checkerboard

dataset

100 200 300 400 500 6000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Detection Range

Rec

all


(g) Recall over Hyperplane

dataset

20 40 60 80 100 1200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9


Detection Range

Rec

all


(h) Recall over USENET1

dataset

Figure 4: Summary of Precision and Recall over SEA, Checkerboard, Rotating hyperplane and USENET1 datasets. TheX-axis in each figure represents the pre-defined detection delay range, whereas the Y-axis denotes the corresponding Precisionand Recall values. For a specific delay range, a higher Precision or Recall value suggests better performance.


773

Dow

nloa

ded

07/2

7/17

to 1

32.2

38.1

81.1

89. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Each stream was independently generated 100 times,and P = 1000 reshuffling splits were used in HLFR. Wesummarized the detected concept drift detection points foreach method over these 100 independent trails. Fig. 3compares the detection results of the various approaches.As can be seen, HLFR and LFR significantly outperformthe other four approaches in terms of their ability to detectconcept drifts early. The two approaches also significantlyoutperform the other approaches by triggering fewer falsedetections while also missing fewer concept drifts points.HLFR further improves on LFR by reducing even the fewfalse positives triggered by LFR.

Quantitative evaluations for Precision and Recall arepresented in Fig. 4: the rows correspond to performancemeasurements and columns to different datasets. The de-tection Precision is significantly improved with HLFRwhile the Recall of HLFR and LFR are similar (exceptfor Rotating hyperplane dataset). This is not surprising,as the purpose of Layer-II test serves to confirm or denythe potentiality of layer-I detection results. Layer-II can-not compensate for the errors of missing a detection madeby Layer-I test. The relatively lower Recall is explained bythe fact that Layer-II test is conservative, i.e., it may denytrue positives although the probability is very low. STEPDseems to provide much better Recall on SEA and Rotatinghyperplane datasets. However, the results are meaninglessin practice as it triggers significantly more false alarms (asseen in the fifth row of Fig. 3(a) and Fig. 3(c)). Addi-tionally, the detection Precision of STEPD on these twodatasets are consistently less than 0.15, which further dis-credits its high Recall values. Table 2 summarizes the de-tection delays for all competing algorithms, the displayedvalues outside brackets indicate ensemble averages, whilethe numbers in brackets denote standard deviations. Theresults of quantitative evaluations corroborate the qualita-tive observations.

Table 2: Detection delay for all competing algorithms

Algorithms SEA Checkerboard Hyperplane USENST1

HLFR 482(502) 55(37) 120(85) 17(7)LFR 458(486) 56(36) 127(112) 17(7)DDM 1209(450) 69(56) 125(128) 26(17)

EDDM 939(550) 93(50) 166(121) 36(8)DDM-OCI 844(461) 58(40) 198(112) 26(17)

STEPD 463(423) 57(58) 140(118) 19(7)

3.3 Performance of HLFR is Independent of theClassifier In this section, we show that the superiority ofHLFR over other approaches is independent of the classi-fier. To this end, instead of using a soft-margin SVM, twodifferent other classifiers, i.e., k-nearest neighbor (KNN)classifier and quadratic discriminant analysis (QDA), areapplied separately on all the competing algorithms. Fig.5 shows the concept drift detection results over USENET1

dataset and Checkerboard datasets using these two clas-sifiers, which, obviously, coincide with the simulation re-sults in Section 3.2. HLFR consistently produces the bestperformance when compared to the baseline approaches.Note that, although almost all the methods perform poorlyon the Checkerboard dataset using the QDA classifier,only HLFR can provide reasonable detection results on thethird, fifth, sixth and seventh drifts.

0 500 1000 15000

50100

HLFR

0 500 1000 15000

50100

LFR

0 500 1000 15000

50100

DDM

0 500 1000 15000

50100

EDDM

0 500 1000 15000

50100

STEPD

0 500 1000 15000

50DDM−OCI

(a) USENET1 dataset detection

results with KNN

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

HLFR

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

LFR

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

DDM

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

EDDM

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

STEPD

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

DDM−OCI

(b) Checkerboard dataset detec-

tion results with KNN

0 500 1000 15000

50100

HLFR

0 500 1000 15000

50100

LFR

0 500 1000 15000

50100

DDM

0 500 1000 15000

50100

EDDM

0 500 1000 15000

50100

STEPD

0 500 1000 15000

50100

DDM−OCI

(c) USENET1 dataset detection

results with QDA

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

HLFR

0 1000 2000 3000 4000 5000 6000 7000 80000

50100

LFR

0 1000 2000 3000 4000 5000 6000 7000 800005

10DDM

0 1000 2000 3000 4000 5000 6000 7000 80000

1020

EDDM

0 1000 2000 3000 4000 5000 6000 7000 80000

2040

STEPD

0 1000 2000 3000 4000 5000 6000 7000 80000

2040

DDM−OCI

(d) Checkerboard dataset detec-

tion results with QDA

Figure 5: Comparison between the concept drift detectionresults over USENET1 and Checkerboard datasets using KNNor QDA classifier. The red lines denote the group truth of driftpoints. The blue columns represent the histogram of detecteddrift points generated from 100 Monte-carlo simulations.

3.4 Classification With Concept Drift UsingHLFR Finally, we present the effectiveness of the pro-posed HLFR on the real-world problem pertaining to clas-sification of streaming data with concept drifts. To thisend, a case study is performed on a representative real-world concept drifting dataset from the email domain. TheSpam filtering dataset [11], consisting of 9324 instancesand 500 attributes is used. The spam ratio is approx-imately 20%. It has been demonstrated that spam fil-tering dataset contains natural concept drifts [11, 10]. Asoft-margin SVM classifier is generated from the trainingdataset and evaluated (also adapted) on the test dataset.The DDM-OCI performance has been omitted only for thereason that it fails to detect any “reasonable” concept driftpoints.

On Parameter Tuning and Experimental Setting.A common phenomenon for classification of real worldstreaming data with concept drifts and temporal depen-


774

Dow

nloa

ded

07/2

7/17

to 1

32.2

38.1

81.1

89. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

dency is that “the more random change alarms the clas-sifier fires, the better the accuracy [22]”. Thus, to pro-vide a fair comparison, the parameters of all competingalgorithms are tuned to detect similar number of conceptdrifts (except for LFR [19], where slightly more potentialdrift points are allowed). Table 3 summarized the keyparameters regarding significant levels (or thresholds) ofdifferent algorithms. An extensive search for an appropri-ate partition of training set and testing set was performedbased on the two criterions: 1) the training set is sufficientto achieve “significant” classification performance on bothmajority and minority classes; and 2) there is no strongautocorrelations in the classification residual sequence oftraining set. With these two considerations, the length oftraining set is set to 600.

Table 3: Parameter settings in spam email filtering.

Algorithms Parameter settings (significant level, threshold)

HLFR δ? = 0.01, ε? = 0.00001, η = 0.01LFR δ? = 0.01, ε? = 0.00001DDM α = 3, β = 2.5

EDDM α = 0.95, β = 0.90STEPD w = 0.005, d = 0.0003

Fig. 6 plots the concept drift detection results. Be-fore evaluating detection results, we recommend referringto Fig. 6 of [11], in which the k-means and expectationmaximization (EM) clustering algorithms have been ap-plied to the conceptual vectors of spam filtering dataset.According to [11], there are three dominating clusters (i.e.,concepts) distributed in different time periods and conceptdrifts occurred approximately in the neighbors of point 200in Region I, point 8000 in Region III, the ending location(point 1800) of Region I as well as the start and ending lo-cations (point 2300 and 6200) of Region II. Besides, thereare many abrupt drifts in the Region II. A possible rea-son for these abrupt and frequent drifts may be batchesof outliers or noisy messages. Obviously, the detection re-sults of HLFR best matches these descriptions, except thatit missed a potential drift point around point 1800. LFRdetected this point, but it also adds some false positives.Other methods, such as DDM or EDDM, not only missobvious drift points, but also reporte unreasonable driftlocations in Region I or Region III.

To further bridge the connections between our detectionresults and clustering results in [11], a recently developedmeasurement - Kappa Plus Statistic (KPS) [2, 23] - have

been proposed. KPS, defined as κ+ =p0−p′e1−p′e

, aims to eval-

uate data stream classifier performance taken into accounttemporal dependence as well as the effectiveness (or ratio-nality) of classifier adaptation, where p0 is the classifier’sprequential accuracy and p′e is the accuracy of No-Changeclassifier. We segment the training set to approximately 30periods. The KPS prequential representation over theseperiods is shown in Fig. 7. As can be seen, the HLFR

0 1000 2000 3000 4000 5000 6000 7000 80000

0.51

HLFR

0 1000 2000 3000 4000 5000 6000 7000 80000

0.51

LFR

0 1000 2000 3000 4000 5000 6000 7000 80000

0.51

DDM

0 1000 2000 3000 4000 5000 6000 7000 80000

0.51

EDDM

0 1000 2000 3000 4000 5000 6000 7000 80000

0.51

STEPD

Region I

Region III

Region II

Figure 6: Concept drift detection results over all competingalgorithms.

0 5 10 15 20 25 30−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1Kappa Plus Statistic

HLFRLFRDDMEDDMSTEPDNo adaptation

Figure 7: Kappa Plus Statistic prequential representations.

adaptation is most effective in period 1-5 but suffer froma performance drop on period 6-10. These observationscoincide with the detection results, as HLFR accuratelydetected the first drift point with no false positives in Re-gion I, but missed a target in Region II.

Regarding the problem of streaming data classification,several different quantitative measurements were usedfor a thorough evaluation. First, while overall accuracy(OAC) is an important and commonly used metric, it isnot an adequate evaluation measure for streaming dataclassification (especially for imbalanced data). Therefore,we also included the F-measure and minority class G-mean value. All metrics are obtained at each time step,creating a time-series representation (just like learningcurves in the general field of adaptive learning systems[9]). Fig. 8 plots the time series representations of OAC,F-measure and G-mean in the learning scenario. It is easyto summarize some key observations from these figures:1) There are severe concept drifts in the given data, asthe performance of a non-adaptive classifier deterioratesignificantly as time evolves.2) HLFR typically provides a significant improvement inF-measure and G-mean compared to its DDM, EDDMand LFR counterparts, while maintaining good accuracy.3) STEPD seems to demonstrate the best overall clas-sification performance in this case, but HLFR providesmore accurate (or rational) concept drift detections whichcoincide with cluster assignments results in [11].


775

Dow

nloa

ded

07/2

7/17

to 1

32.2

38.1

81.1

89. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

0 1000 2000 3000 4000 5000 6000 7000 8000 90000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1OAC


0 1000 2000 3000 4000 5000 6000 7000 8000 90000.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1F−measure


0 1000 2000 3000 4000 5000 6000 7000 8000 90000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1G−mean


Figure 8: The time series representations for all competing algorithms over OAC, F-measure and G-mean.

4 Conclusions

This paper presents a novel concept drift detection frame-work (HLFR) that detects concept drifts for different datastream distributions (including imbalanced data) by lever-aging a hierarchical set of hypothesis tests in an online set-ting. Using permutation test in the second layer, HLFRcan significantly reduce false alarms. Experimental re-sults show HLFR significantly outperforming benchmarkapproaches in terms of accuracy, G-mean, recall and delayin detection of concept drift across the various datasets.

References

[1] M. Baena-Garcıa, J. del Campo-Avila, R. Fidalgo,A. Bifet, R. Gavalda, and R. Morales-Bueno, Earlydrift detection method, in Fourth international workshopon knowledge discovery from data streams, vol. 6, 2006,pp. 77–86.

[2] A. Bifet, J. Read, I. Zliobaite, B. Pfahringer, andG. Holmes, Pitfalls in benchmarking data stream classi-fication and how to avoid them, in Joint European Con-ference on Machine Learning and Knowledge Discovery inDatabases, Springer, 2013, pp. 465–479.

[3] O. Bousquet and A. Elisseeff, Stability and general-ization, Journal of Machine Learning Research, 2 (2002),pp. 499–526.

[4] R. Elwell and R. Polikar, Incremental learning of con-cept drift in nonstationary environments, IEEE Transac-tions on Neural Networks, 22 (2011), pp. 1517–1531.

[5] J. Gama, P. Medas, G. Castillo, and P. Rodrigues,Learning with drift detection, in Brazilian Symposium onArtificial Intelligence, Springer, 2004, pp. 286–295.

[6] P. Good, Permutation tests: a practical guide to resam-pling methods for testing hypotheses, Springer Science &Business Media, 2013.

[7] M. Harel, S. Mannor, R. El-Yaniv, and K. Cram-mer, Concept drift detection through resampling., inICML, 2014, pp. 1009–1017.

[8] S. S. Haykin, Adaptive filter theory, Pearson EducationIndia, 2008.

[9] , Neural networks and learning machines, vol. 3,Pearson Upper Saddle River, NJ, USA:, 2009.

[10] I. Katakis, G. Tsoumakas, and I. Vlahavas, Dynamicfeature space and incremental feature selection for theclassification of textual data streams, Knowledge Discoveryfrom Data Streams, (2006), pp. 107–116.

[11] , Tracking recurring contexts using ensemble clas-sifiers: an application to email filtering, Knowledge andInformation Systems, 22 (2010), pp. 371–391.

[12] I. Katakis, G. Tsoumakas, and I. P. Vlahavas, Anensemble of classifiers for coping with recurring contextsin data streams., in ECAI, 2008, pp. 763–764.

[13] R. Klinkenberg and T. Joachims, Detecting con-cept drift with support vector machines., in ICML, 2000,pp. 487–494.

[14] L. I. Kuncheva and C. O. Plumpton, Adaptive learn-ing rate for online linear discriminant classifiers, in JointIAPR International Workshops on Statistical Techniquesin Pattern Recognition (SPR) and Structural and Syntac-tic Pattern Recognition (SSPR), Springer, 2008, pp. 510–519.

[15] E. L. Lehmann and J. P. Romano, Testing statisticalhypotheses, Springer Science & Business Media, 2006.

[16] K. Nishida and K. Yamauchi, Detecting concept driftusing statistical testing, in International conference ondiscovery science, Springer, 2007, pp. 264–269.

[17] G. J. Ross, N. M. Adams, D. K. Tasoulis, and D. J.Hand, Exponentially weighted moving average charts fordetecting concept drift, Pattern Recognition Letters, 33(2012), pp. 191–198.

[18] V. Vapnik, Principles of risk minimization for learningtheory, in NIPS, 1991, pp. 831–838.

[19] H. Wang and Z. Abraham, Concept drift detection forstreaming data, in 2015 International Joint Conference onNeural Networks (IJCNN), IEEE, 2015, pp. 1–9.

[20] S. Wang, L. L. Minku, D. Ghezzi, D. Caltabiano,P. Tino, and X. Yao, Concept drift detection for onlineclass imbalance learning, in Neural Networks (IJCNN),The 2013 International Joint Conference on, IEEE, 2013,pp. 1–10.

[21] S. Wang, L. L. Minku, and X. Yao, A learning frame-work for online class imbalance learning, in ComputationalIntelligence and Ensemble Learning (CIEL), 2013 IEEESymposium on, IEEE, 2013, pp. 36–45.

[22] I. Zliobaite, How good is the electricity benchmarkfor evaluating concept drift adaptation, arXiv preprintarXiv:1301.3524, (2013).

[23] I. Zliobaite, A. Bifet, J. Read, B. Pfahringer, andG. Holmes, Evaluation methods and decision theory forclassification of streaming data with temporal dependence,Machine Learning, 98 (2015), pp. 455–482.


776

Dow

nloa

ded

07/2

7/17

to 1

32.2

38.1

81.1

89. R

edis

trib

utio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.s

iam

.org

/jour

nals

/ojs

a.ph

p

Documents

Concept Drift Detection with Hierarchical Hypothesis Testing...(DDM-OCI) [20] addresses the limitation of DDM when data is imbalanced. However, DDM-OCI triggers a num-ber of false