14
Background Subtraction with Dirichlet Process Mixture Models Tom S.F. Haines and Tao Xiang Abstract—Video analysis often begins with background subtraction. This problem is often approached in two steps—a background model followed by a regularisation scheme. A model of the background allows it to be distinguished on a per-pixel basis from the foreground, whilst the regularisation combines information from adjacent pixels. We present a new method based on Dirichlet process Gaussian mixture models, which are used to estimate per-pixel background distributions. It is followed by probabilistic regularisation. Using a non-parametric Bayesian method allows per-pixel mode counts to be automatically inferred, avoiding over-/under- fitting. We also develop novel model learning algorithms for continuous update of the model in a principled fashion as the scene changes. These key advantages enable us to outperform the state-of-the-art alternatives on four benchmarks. Index Terms—Background subtraction, Dirichlet processes, non-parametric Bayesian methods, confidence capping, video analysis Ç 1 INTRODUCTION B ACKGROUND subtraction can be defined as a binary seg- mentation of a video stream, into the foreground, which is unique to a particular moment in time, and the back- ground, which is always present. It is typically used as an interest detector for higher level problems, such as auto- mated surveillance, action recognition, intelligent environ- ments and motion analysis. The etymology of background subtraction derives from the original approach, where a sin- gle static image of just the background is subtracted from the current frame, to generate a difference image. If the absolute difference exceeds a threshold the pixel in question is declared to belong to the foreground. This approach per- forms poorly because it assumes a static background with well behaved objects—in practise this is almost never the case. Many situations break these assumptions [1], [2]: Dynamic background, where objects such as trees blow in the wind, escalators move and traffic lights change colour. These objects, whilst moving, still belong to the background as they are of no interest to further analysis. Noise, as caused by the image capturing process. It can vary over the image due to photon noise and vary- ing brightness. In some cases, such as low light/thermal, it can dominate. Camouflage, where a foreground object looks very much like the background, e.g., a sniper in a ghillie suit. Camera shake often exists, a symptom of mount points that are subject to wind or vibrations. This can be consid- ered to be a type of highly correlated global noise. Moved object (class transitions), where the foreground becomes part of the background, or vice versa; e.g., a car could be parked in the scene, and after sufficient time con- sidered part of the background, only to later become fore- ground again when driven off. Bootstrapping. As it is often not possible to get a frame with no foreground an algorithm should be capable of being initialised with foreground objects in the scene. It has to then learn the correct background model over time. Shadows are cast by the foreground objects, but later proc- essing is typically not interested in them. Illumination changes, both gradual, e.g., from the sun moving during the day, and rapid, such as from a light switch being toggled. Addressing these challenges, particularly the first three, leads to a number of considerations in designing a back- ground model. Specifically, on account of the dynamic nature of backgrounds the model has to classify a range of pixels as belonging to the background. This can be repre- sented with a multi-modal distribution, e.g., a tree waving against the sky generates pixels that are sky one moment and tree another, but both sets need to modelled as back- ground. How many modes exist will need to be learnt, and can potentially change with time, e.g., a windy day versus a calm day. Noise is generated by both the image capture process and typical small scale variations in the background colour. It is primarily modelled using the variance of a prob- ability distribution, which needs to be set high enough to capture the variation. Camouflage on the other hand is when a foreground object looks like it belongs to the background, either deliberately or accidentally. The effect of camouflage is interleaved with that of noise: noise can increase the range of values considered to belong to the background, allowing a camouflaged object to remain undetected. Consequen- tially the variance of the background representation is a tradeoff—too low and noise will trigger detection; too high and camouflaged objects will be missed. Variance therefore needs to be learnt from the data. Noting that as noise can vary with both location and time (e.g., photon noise only matters in the dark areas of an image, at night), this needs T.S.F. Haines is with the Department of Computer Science, University College London, Gower Street, London WC1E 6BT, United Kingdom. E-mail: [email protected]. T. Xiang is with the School of Electronic Engineering and Computer Science, Queen Mary University of London, Mile End Road, London E1 4NS, United Kingdom. E-mail: [email protected]. Manuscript received 19 Mar. 2013; revised 4 Sept. 2013; accepted 12 Nov. 2013; date of publication 4 Dec. 2014; date of current version 19 Mar. 2014. Recommended for acceptance by J. Jia. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPAMI.2013.239 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 4, APRIL 2014 0162-8828 ß 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

Background Subtraction with DirichletProcess Mixture Models

Tom S.F. Haines and Tao Xiang

Abstract—Video analysis often begins with background subtraction. This problem is often approached in two steps—a background

model followed by a regularisation scheme. A model of the background allows it to be distinguished on a per-pixel basis from the

foreground, whilst the regularisation combines information from adjacent pixels. We present a new method based on Dirichlet process

Gaussian mixture models, which are used to estimate per-pixel background distributions. It is followed by probabilistic regularisation.

Using a non-parametric Bayesian method allows per-pixel mode counts to be automatically inferred, avoiding over-/under- fitting. We

also develop novel model learning algorithms for continuous update of the model in a principled fashion as the scene changes. These

key advantages enable us to outperform the state-of-the-art alternatives on four benchmarks.

Index Terms—Background subtraction, Dirichlet processes, non-parametric Bayesian methods, confidence capping, video analysis

Ç

1 INTRODUCTION

BACKGROUND subtraction can be defined as a binary seg-mentation of a video stream, into the foreground, which

is unique to a particular moment in time, and the back-ground, which is always present. It is typically used as aninterest detector for higher level problems, such as auto-mated surveillance, action recognition, intelligent environ-ments and motion analysis. The etymology of backgroundsubtraction derives from the original approach, where a sin-gle static image of just the background is subtracted fromthe current frame, to generate a difference image. If theabsolute difference exceeds a threshold the pixel in questionis declared to belong to the foreground. This approach per-forms poorly because it assumes a static background withwell behaved objects—in practise this is almost never thecase. Many situations break these assumptions [1], [2]:

Dynamic background, where objects such as trees blow inthe wind, escalators move and traffic lights change colour.These objects, whilst moving, still belong to the backgroundas they are of no interest to further analysis.

Noise, as caused by the image capturing process. Itcan vary over the image due to photon noise and vary-ing brightness. In some cases, such as low light/thermal,it can dominate.

Camouflage, where a foreground object looks very muchlike the background, e.g., a sniper in a ghillie suit.

Camera shake often exists, a symptom of mount pointsthat are subject to wind or vibrations. This can be consid-ered to be a type of highly correlated global noise.

Moved object (class transitions), where the foregroundbecomes part of the background, or vice versa; e.g., a carcould be parked in the scene, and after sufficient time con-sidered part of the background, only to later become fore-ground again when driven off.

Bootstrapping. As it is often not possible to get a framewith no foreground an algorithm should be capable of beinginitialised with foreground objects in the scene. It has tothen learn the correct background model over time.

Shadows are cast by the foreground objects, but later proc-essing is typically not interested in them.

Illumination changes, both gradual, e.g., from the sunmoving during the day, and rapid, such as from a lightswitch being toggled.

Addressing these challenges, particularly the first three,leads to a number of considerations in designing a back-ground model. Specifically, on account of the dynamicnature of backgrounds the model has to classify a range ofpixels as belonging to the background. This can be repre-sented with a multi-modal distribution, e.g., a tree wavingagainst the sky generates pixels that are sky one momentand tree another, but both sets need to modelled as back-ground. How many modes exist will need to be learnt, andcan potentially change with time, e.g., a windy day versus acalm day. Noise is generated by both the image captureprocess and typical small scale variations in the backgroundcolour. It is primarily modelled using the variance of a prob-ability distribution, which needs to be set high enough tocapture the variation. Camouflage on the other hand is whena foreground object looks like it belongs to the background,either deliberately or accidentally. The effect of camouflageis interleaved with that of noise: noise can increase the rangeof values considered to belong to the background, allowinga camouflaged object to remain undetected. Consequen-tially the variance of the background representation is atradeoff—too low and noise will trigger detection; too highand camouflaged objects will be missed. Variance thereforeneeds to be learnt from the data. Noting that as noise canvary with both location and time (e.g., photon noise onlymatters in the dark areas of an image, at night), this needs

� T.S.F. Haines is with the Department of Computer Science, UniversityCollege London, Gower Street, London WC1E 6BT, United Kingdom.E-mail: [email protected].

� T. Xiang is with the School of Electronic Engineering and ComputerScience, Queen Mary University of London, Mile End Road, London E14NS, United Kingdom. E-mail: [email protected].

Manuscript received 19 Mar. 2013; revised 4 Sept. 2013; accepted 12 Nov.2013; date of publication 4 Dec. 2014; date of current version 19 Mar. 2014.Recommended for acceptance by J. Jia.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPAMI.2013.239

670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 4, APRIL 2014

0162-8828 � 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

to be continuously calculated from the video feed, on a per-region basis. A desirable background model thus shouldlearn the number of modes required to represent the back-ground, including the variance of each mode, such that ithandles noise without compromising its ability to detectcamouflaged entities.

To this end we propose to use a Dirichlet process Gaussianmixture model (DP-GMM) [3] to provide a per-pixel densityestimate (DE). This is a non-parametric Bayesian method [4]that automatically estimates the number of mixture compo-nents required to model the pixels background colour dis-tribution, e.g., a tree waving backwards and forward infront of the sky will generate single mode pixels at the trunkand in the sky, but two mode pixels in the area where thebranches wave, such that the pixels transition between leafand sky regularly. If it needs more modes to represent mul-tiple leaf colours this will occur automatically, and, of greatimportance for long term surveillance, the model willupdate with time as new evidence becomes available, e.g.,on a calm day when the tree is not moving it will revert tosingle mode pixels throughout. As it is fully Bayesian itavoids over-/under- fitting its model.

However, there are two issues that prevent a standardDP-GMM from being used as a per-pixel backgroundmodel: 1) the existing model update techniques cannot copewith the scene changes common in real-world applications;2) using the model for continuous video streams is unscal-able both in terms of memory usage and computationalcost. To solve the first issue and make the DP-GMM suitablefor background modelling, a set of novel model updatetechniques are developed in this work. More specifically,each mode in the model represents a range of colours, usinga Gaussian distribution.1 The extent of this range is learntfrom the data stream, and is again updated as the scenechanges. During the day when pixels are almost constant itwill select a tight distribution, whilst at night it will expandas photon noise pushes the range of expected values wider.A visual scene typically evolves as time goes by, e.g., theday/night cycle, a car parking, road works remodelling atraffic junctions layout, with different speeds of change.Consequentially the model needs to forget what it has previ-ously learnt and model the new background effectively andefficiently. For this we introduce a novel model update con-cept that lets old information degrade in a principled way,which we refer to as confidence capping. It limits how confi-dent it can be about any given mode, which puts a con-straint on how certain it can be about the shape of thebackground distribution. As a result, when a mode is nolonger used it will slowly fade away, whilst the newmode(s) start to dominate the model. If during this periodthe scene reverts to the old state it can start using its oldmodes again. This allows a stationary object to remain partof the foreground for a long time, as it takes a lot of informa-tion for the new component to obtain the confidence of pre-existing components, but when an object moves on and thebackground changes to a component it has seen before,even if a while ago, it can use that component immediately.

Updating the components for gradual background changes(e.g., gradual illumination changes) continues to happen,making sure the model is never left behind by gradualchanges. This also helps with bootstrap performance, as itcan quickly learn the model, even with foreground objects.

To overcome the second issue on model scalability,two measures are taken. First, we fit the model by Gibbssampling each sample only once. This would normallycompromise model convergence, but because of confi-dence capping and the availability of a never endingstream of data the model will converge regardless. Inaddition to its computational saving it avoids the needto store and process hundreds of previous video frames,which can consume vast amount of memory [1], [5]. Thismemory consumption problem is particularly noticeablefor approaches based on kernel density estimation [6],[7], [8], [9], [10]. Second, to obtain real time performance,we introduce a GPU based implementation.

To summarise the contributions are: 1) a new back-ground subtraction technique is proposed [11] built on anon-parametric Bayesian density estimate, which avoidsover-/under- fitting; 2) A novel model update is formulatedthat is both principled and practical—the approach is realtime with a constant memory requirement; 3) An exhaustiveevaluation using four data sets [1], [2], [12], [13] demon-strates top performance in comparison with the state-of-the-art alternatives. The code, both a C version and a GPU ver-sion, has been made available.2

1.1 Related Work

The background subtraction field is humongous, andhas many review papers [2], [14], [15], [16]. Stauffer andGrimson [17] is one of the best known approaches—it usesa Gaussian mixture model (GMM) for a per-pixel densityestimate followed by connected components for regularisa-tion. This model improves on using a background platebecause it can handle a dynamic background and noise, byusing multimodal probability distributions. As it is continu-ously updated it can bootstrap. Its mixture model includesboth foreground and background components—it classifiesvalues based on their mixture component, which is assignedto the foreground or the background based on the assump-tion that the majority of the larger components belong to thebackground, with the remainder foreground. This assump-tion fails if objects hang around for very long, as theyquickly dominate the distribution. It has a fixed componentcount, which does not match real variability, where thenumber of modes differs between pixels. The model isupdated linearly using a fixed learning rate parameter—consequentially the moved object problem is an unhelpfultrade between ghosting and forgetting stationary objects tooquickly. Connected components converts the intermediateforeground mask into regions via pixel adjacency, and cullsall regions below a certain size, to remove spurious detec-tions. This approach to noise handling combined with itssomewhat primitive density estimation method underminescamouflage handling, as it often thinks it is noise, and alsoprevents it from tracking small objects. No capacity exists

1. This is actually integrated out, such that a student-t distribution isused in practise. 2. Accessible at http://www.thaines.com.

HAINES AND XIANG: BACKGROUND SUBTRACTION WITH DIRICHLET PROCESS MIXTURE MODELS 671

Page 3: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

for it to handle illumination changes or shadows. The abovecan be divided into four parts—the model, updating themodel, how pixels are classified, and regularisation; alterna-tive approaches for each will now be considered in turn.

The model. Alternative DE methods exist, includingdifferent GMM implementations [9] and kernel densityestimate (KDE) methods, either using Gaussian kernels[6], [7] or step kernels [8], [9], [10]. Histograms have alsobeen used [18], and alternatives to DE include modelsthat predict the next value [1], use neural networks [19],[20], [21], or hidden Markov models [22]. Alternatives tothe GMM background model can improve handling ofdynamic background, noise and camouflage by better reflect-ing the underlying variability of the background. Thiscan be particularly important with relation to over-/under-fitting, as the model needs to accurately generalisethe data stream. All models suffer from some degree ofover-/under-fitting. KDE and histogram methods areparticularly vulnerable, as they implicitly assume a con-stant density by using fixed size kernels/bins. GMMmethods should do better, but the heuristics required foronline learning, particularly regarding the creation ofnew components, can result in local minima in the opti-misation, which is just as problematic. He et al. [5]recently used DP-GMMs for background subtraction, incommon with the presented approach. They failed toleverage the potential advantages however (discussedbelow), and used computationally unscalable methods. Itis block-based, which is to say they ran the model atvery low resolution to save computation, and it hasunbounded memory consumption. Consequentially it isnot applicable to real world scenarios, and, despite theirefforts, poor results were obtained. Global modelsshould in principal be better, with techniques based onprincipal component analysis (PCA) [23] having shownthe most promise. They attempt to model the back-ground as a low dimensional subspace of the vectorisedinput, with the foreground identified as outliers. In prac-tise such approaches have struggled, due to high compu-tational requirements and limited capability to deal withmany common problems, e.g., camouflage. They tend toexcel at camera shake however, and recent variants haveresolved many of the issues [24], [25].

Model update. Most methods use a constant learningrate to update the model, but some use adaptive heuris-tics [9], [10], [26], whilst others are history based [1], [5],[10], and build a model from the last n frames directly.Adapting the learning rate affects the moved objectissue—if it is too fast then stationary objects become partof the background too quickly, if it is too slow it takestoo long to recover from changes to the background.Adaptation aims to adjust the rate depending on what ishappening. There is invariably a tradeoff betweenquickly learning when the background model haschanged and accepting foreground objects into the back-ground when they stop moving for a while (cars at traf-fic lights.)—trying to minimise this tradeoff is a key goaland no perfect solution exists. In addition continuouslylearning the model is required to handle the bootstrappingissue. We present confidence capping (see above), whichworks because non-parametric Bayesian models, such as

DP-GMMs, have a rigorous concept of a new mixturecomponent forming. Parametric models [9], [17] have touse heuristics to simulate this, whilst a confidence cap isnot possible for KDE based approaches [6], [7], [8], [9],[10] as they lack a measure of confidence.

Pixel classification. The use of a single density estimatethat includes both foreground (fg) and background (bg), asdone by Stauffer and Grimson [17] is somewhat unusual—most methods stick to separate models and apply Bayesrule [18], with the foreground model set to be the uniformdistribution as it is unknown. We follow this convention,which results in a probability of being bg or fg, rather than ahard classification, which is passed through to the regular-isation step. Instead of using Bayes rule some works use athreshold [6], [10], which can by dynamically learnt [10].Attempts at learning a foreground model also exist [7], andsome models generate a binary classification directly [19],[20], [21].

Regularisation. Some approaches have no regularisa-tion step [27], others have information sharing betweenadjacent pixels [19], [20], [21] but no explicit regularisa-tion. Techniques such as eroding then dilating are com-mon [2], and more advanced techniques have, forinstance, tried to match pixels against neighbouring pix-els, to compensate for background motion [6]. Whendealing with a probabilistic fg/bg assignment probabilis-tic methods should be used, such as the use of Markovrandom fields (MRF) by Migdal and Grimson [28],Sheikh and Shah [7] and Schick et al. [29]. We also use aMRF—the pixels all have a random variable which cantake on one of two labels, fg or bg. The data term is pro-vided by the model whilst pairwise potentials encourageadjacent pixels to share the same label. Differencesexist—previous works use Gibbs sampling [28] andgraph cuts [7], [29], whilst we choose loopy belief propa-gation [30], as runtime can be capped, and also acceler-ated with a GPU implementation. We also use an edgepreserving cost between pixels, rather than a constantcost, which proves to be beneficial with high levels ofnoise. Cohen [31] has also used a Markov random field,to generate a background image by selecting pixels froma frame sequence, rather than for regularisation. Parksand Fels [32] provide a review of regularisation for back-ground subtraction.

Input. The vast majority of approaches, including thisone [11], use the colours of the individual pixels directly.Other approaches use alternative information sourceshowever, including optical flow [33], depth [34], tracking[35] and object detection [35]. Additionally, only fore-ground/background classification is provided by thepresented approach—other approaches may mark areasas being in shadow [6], [35] or use other labelling strate-gies [33].

2 METHODOLOGY

The presented method naturally splits into two—a per pixelbackground model and a regularisation step. The GPUimplementation and further details are also given. For anoverview of the system Fig. 1 gives a block diagram, whilstAlgorithm 1 gives pseudocode of the core structure.

672 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 4, APRIL 2014

Page 4: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

2.1 Per-Pixel Background Model

Each pixel has its own multi-modal density estimate, usedto model P ðx j bgÞ where x is the vector of pixel colourchannels. The Dirichlet process (DP) Gaussian mixturemodel [3] is used. It can be viewed as the Dirichlet distribu-tion extended to an infinite number of components, whichallows it to learn the true number of mixtures required torepresent the data. For each pixel a stream of values arrives,one with each frame—the model has to be continuouslyupdated using incremental learning.

We start by introducing the Dirichlet process, first usingthe stick breaking construction then secondly using the Chi-nese restaurant process (CRP). The stick breaking construc-tion provides a clean explanation of the concept, whilst theChinese restaurant process integrates out irrelevant

variables and provides the formulation we actually solve.Fig. 2a represents the DP-GMM graphically using the stickbreaking construction. It can be split into three columns—onthe left the priors, in the middle the entities representing theDirichlet process and on the right the data for which a den-sity estimate is being constructed. This last column containsthe feature vectors (pixel colours) to which the model isbeing fitted, xi, which come from all previous frames, i 2 I .It is a generative model—each sample comes from a specificmixture component, indexed by Zi 2 K, which consists ofits probability of being selected, Vk and the Gaussian distri-bution from which the value was drawn, hk. The conjugateprior, consisting of m, a Gaussian over its mean, and L, aWishart distribution over its precision (inverse covariancematrix), is applied to all hk. So far this is just a mixturemodel; the interesting part is that K, the set of mixture com-ponents, is infinite. Conceptually the stick breaking con-struction is very simple—we have a stick of length 1,representing the entire probability mass, which we keepbreaking into two parts. Each time it is broken one of theparts becomes the probability mass for a mixture compo-nent—a value of Vk, whilst the other is kept for the nextbreak. This continues forever. a is the concentration param-eter, which controls how the stick is broken—a low valueputs most of the probability mass in a few mixture compo-nents, whilst a high value spreads it out over many. Sticklengths are represented by the random variable Vk, which isdefined as

Vk ¼ ykY

i2½0;kÞð1� yiÞ; yk � Betað1;aÞ; (1)

where Betað�; �Þ is the beta distribution and k2K is the infi-nite set of sticks. In effect, the yk values are how much of theremaining stick to use for probability mass k, whilst Vk is theremaining stick length, calculated from the previous breaks,multiplied by the amount of the remainder to use for stick k.

Fig. 2. Two representations of the DP-GMM graphical model, each cor-responding to a specific implementation techniques.

Fig. 1. Block diagram of the approach. The solid arrows show the path ofa frame through the system, whilst the doted arrows show the updateorder of the model, which loops around for the next frame.

HAINES AND XIANG: BACKGROUND SUBTRACTION WITH DIRICHLET PROCESS MIXTURE MODELS 673

Page 5: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

Orthogonal to the stick length, each stick is associated with adraw, hk, from the DP’s base measure, which is the alreadymentioned conjugate prior over the Gaussian.

Whilst the stick breaking construction offers a cleanexplanation of the model the Chinese restaurant process isused for the implementation.3 This is the model with themiddle column of Fig. 2a integrated out, to give Fig. 2b. It isnamed by an analogy. Specifically, each sample is repre-sented by a customer, which turns up and sits at a table in aChinese restaurant. Tables represent the mixture compo-nents, and a customer chooses either to sit at a table wherecustomers are already sitting, with probability proportionalto the number of customers at that table, or to sit at a newtable, with probability proportional to a. At each table (com-ponent) only one dish is consumed, which is chosen fromthe menu (base measure) by the first customer to sit at thattable. Integrating out the draw from the DP leads to betterconvergence, but more importantly replaces the infinite setof sticks with a computationally tractable finite set of tables.Whilst the theory behind Dirichlet processes is deep, theactual implementation of a CRP is fairly trivial, as it canbe treated as the addition of a dummy table, representingthe creation of a new table, to an otherwise standard mix-ture model. Note that when the dummy table is used a newone is made available.

The DP-GMM density estimate for each pixel is updatedwith each new frame. Updating proceeds by first calculatingthe probability of the current pixel value, x, given the cur-rent background model, then updating the model with x,weighted by the calculated probability—these steps arenow detailed.

Mixture components. The per-pixel model is a set ofweighted mixture components, such that the weights sum to1, where each component is a Gaussian distribution. Theyare integrated out however, using the Chinese restaurantprocess for the mixture weights and the conjugate prior forthe Gaussians (student-t distribution). Whilst the literature[37] already details this second part it is included for com-pleteness. xi 2 ½0; 1�; i 2 f0; 1; 2g represents the pixels colour,and independence is assumed between the components forreasons of speed (Equivalent to a covariance matrix whereoff diagonal entries are zero.). This simplifies the Wishartprior to a gamma prior for each channel i, such that

s�2i � G

ni;02;s2i;0

2

!; mijs2

i � N mi;0;s2i

ki;0

� �; (2)

xi � N�mi; s

2i

�; (3)

where Nðm; s2Þ represents the normal distribution andGða;bÞ the gamma distribution. The parameters ni;0 and si;0,i 2 f0; 1; 2g, are the L prior from the graphical model, whilstmi;0 and ki;0 are the m prior.

Evidence, x, is provided incrementally, one sample at atime, which will be weighted by w, the probability that it

belongs to the background model. The model is thenupdated from having m samples to mþ 1 samples using

ni;mþ1 ¼ ni;m þ w; (4)

ki;mþ1 ¼ ki;m þ w; (5)

mi;mþ1 ¼ki;mmi;m þ wxi

ki;m þ w; (6)

s2i;mþ1 ¼ s2

i;m þki;mw

ki;m þ wðxi � mi;mÞ2: (7)

Note that ni;m and ki;m have the same update, so one valuecan be stored to cover both, for all i. Given the above param-eters, updated with the available evidence, a Gaussian maybe drawn, to sample the probability of a colour being drawnfrom this mixture component. Instead of drawing it theGaussian is integrated out, to give

xi � T ni;m;mi;m;ki;m þ 1

ki;mni;ms2i;m

� �; (8)

where T ðv;m; s2Þ denotes the three parameter student-t dis-tribution. Consequentially each mixture component is actu-ally a student-tdistribution, to avoid the problem of samplingthe Gaussian distribution that is actually being represented.

Background probability. The Chinese restaurant processis used to calculate the probability of a pixel, x 2 ½0; 1�3belonging to the background (bg) model. It integrates outthe stick weights, unlike the stick breaking construction,which is essential to obtaining good performance. The prob-ability of x given component (table) t 2 T is

P ðx j t; bgÞ ¼ stPi2T si

P�x j nt; kt;mt; s

2t

�; (9)

P�x j nt; kt;mt; s

2t

�¼

Y

i2f0;1;2gT xi j nt;i;mt;i;

kt;i þ 1

kt;int;is2t;i

� �;

(10)

where st is the number of samples assigned to compo-nent t, and nt, mt, kt and st are the parameters of theprior updated with the samples currently assigned to thecomponent. Note that st is the offset from the prior val-ues for nt and kt, so these latter two do not need to bestored. By including a dummy component, t ¼ new 2 T ,which represents creating a new component (sitting at anew table) with snew ¼ a, this is the Chinese restaurantprocess. The student-t parameters for this dummy com-ponent are the prior without update. Finally, the mixturecomponents can be summed out

P ðx j bgÞ ¼X

t2TP ðx j t; bgÞ: (11)

3. Variational methods [36] offer one approach to using the stickbreaking construction directly. This is impractical however as historicpixel values would need to be kept.

674 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 4, APRIL 2014

Page 6: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

The goal is to calculate P ðbg j xÞ, not P ðx j bgÞ, henceBayes rule is applied,

P ðbg j xÞ ¼ P ðx j bgÞP ðbgÞP ðx j bgÞ þ P ðx j fgÞ : (12)

Note that pixels can only belong to either the background orthe foreground (fg), hence the denominator.P ðx j bgÞ is givenabove, leaving P ðbgÞ and P ðx j fgÞ. P ðbgÞ is an implicitthreshold on what is considered background and what is con-sidered foreground, and is thus considered to be a parameter.P ðx j fgÞ is unknown and hard to estimate, so the uniform dis-tribution is used, which is a value of 1, as the volume of the col-our space is 1 (see supplementary material, which can befound on the Computer Society Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPAMI.2013.239).

Model update. To update the model at each pixel the cur-rent value is assigned to a mixture component, which isthen updated—st is increased and the posterior for theGaussian updated with the new evidence (Equations (4)-(7)). Assignment is done probabilistically, by randomlydrawing a component, weighted by P ðx j t; bgÞ (Equation(9)). The possibility of assignment to a new mixture compo-nent is included—in the context of background subtractionthis is the possibility that the pixel represents something wehave not seen before. This is equivalent to Gibbs samplingthe density estimate, except we only sample each valueonce, on arrival. In consequence samples do not need to bestored. Updates are weighted by their probability of belong-ing to the background (Equation (12)), i.e., w is set toP ðbg j xÞ in Equations (4)-(7), and st is incremented byP ðbg j xÞ. Sampling each value just once is not an issue, asthe continuous stream of data combined with confidencecapping means the model soon converges.

Confidence capping. A learning rate is not used; instead,unique to a DP-GMM, we introduce the notion of cap-ping the confidence of the model. This can be interpretedas an adaptive update [9], [26], but it is both principledand very effective. In effect we are building an onlinedensity estimate with the ability to selectively forget,allowing newer data to take over when the backgroundchanges. If this was not done the model would get moreconfident as time goes on—if it had seen the backgroundin its current state for three days then it would requirethree days of data showing the background in a differentstate before the new model became dominant. Confi-dence capping effectively limits the confidence so that anew model can always replace the old one in a shortamount of time; it also avoids numerical overflow andallows single-sample Gibbs sampling to converge. Thecap is applied to the largest individual component,rather than the distribution as a whole—this avoids mix-tures with more components being required to use widerdistributions to compensate.

Mathematically it works by capping how high st can go.When the cap is exceeded by any st for a pixel then a multi-plier is applied to all st, scaling the highest st down to equalthe cap. Note that st is tied to nt and kt, so they are alsoadjusted. As s2

t is dependent on kt (it includes kt as a multi-plier) an update is avoided by storing s2

t =kt instead. Theeffectiveness is such that it can learn the initial model with

less than 30 frames of data yet objects can remain still formany minutes before being merged into the background.This does not impede the ability of the model to update asthe background changes. Finally, in the limit as an infinitenumber of frames arrives the Dirichlet process will have aninfinite number of components. This will exhaust a com-puter’s memory; therefore the component count is alsocapped. When a new component is created the existingcomponent with the lowest st is replaced.

2.2 Probabilistic Regularisation

The per-pixel background model ignores information fromthe neighbourhood of a pixel, leaving it susceptible tolocally occurring noise and camouflage. Additionally, Gibbssampling introduces a certain amount of noise, meaningthat the boundary between foreground and backgroundis subject to a dithering effect. To resolve these issues aMarkov random field is constructed, with a node for eachpixel, connected using a four-way neighbourhood. It is abinary labelling problem, where each pixel either belongs tothe foreground or to the background.

The task is to select the most probable solution, wherethe probability can be separated into two terms. First,each pixel has a probability of belonging to the back-ground or foreground (data term), directly obtainedfrom the model as P ðbg j xÞ and 1� P ðbg j xÞ, respec-tively. Second, there is the probability of being the same(smoothing term), which indicates how likely adjacentpixels have the same assignment,

P ðla ¼ lbÞ ¼h

hþm � dða; bÞ ; (13)

where ly is the label of pixel y, h is a parameter that sets thehalf life, i.e., the distance at which the probability becomes0:5, and dða; bÞ is the euclidean distance between the twopixels in colour space. Noise suppression is greatlyimproved if at least one of the neighbours of a pixel have ahigh probability of being assigned the same label—this isthe purpose of m. m is typically 1, but is decreased if a pixelis sufficiently far from its neighbours that none provides aP ðlðaÞ ¼ lðbÞÞ value above a threshold. The decrease is suchthat at least one of them matches the threshold.

Various methods can be considered for solving thismodel. Graph cuts [38] would give the MAP solution, how-ever we use loopy belief propagation instead [30], as it runsin constant time given an iteration cap, which is importantfor a real time implementation; it also runs considerably bet-ter on a GPU. A red-black [30] update schedule is used tosave memory—a checker-board pattern is assigned to thepixels and then all pixels of one colour send their messages,followed by all pixels of the other colour, and so on, untilconvergence. This results in half the memory consumption,as only messages going in one direction need to be stored; itadditionally accelerates convergence [30]. Furthermore, ahierarchical implementation is used, whereby the model isfirst solved at a low resolution, before being scaled up, suchthat the transferred messages accelerate convergence at thehigher resolution. This occurs over multiple levels—thenumber is decided by halving the resolution of the frameuntil either dimension drops below 32.

HAINES AND XIANG: BACKGROUND SUBTRACTION WITH DIRICHLET PROCESS MIXTURE MODELS 675

Page 7: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

2.3 GPU Implementation

A background subtraction algorithm will have limited prac-tical use if it cannot be run in real time on a video feed as itarrives. Accordingly, a GPU based implementation hasbeen constructed. Because we have selected a naturally par-allel structure for the presented approach adapting it to runon a GPU is relativity straightforward—the model for eachpixel is independent, whilst regularisation with belief prop-agation can be implemented via message passing. The fol-lowing phases exist:

1. Calculate the probability of the current pixel beingdrawn from each mixture component in the model,and also the new component probability. These com-putations are all independent.

2. Pull the probabilities together, to generate the proba-bility of each pixel coming from the current model.Update the model using Gibbs sampling—randomnumber generation is handled via the Philox algo-rithm [39]. This is independent on a per pixel basis.

3. Calculate the costs for belief propagation—this isnaturally parallel.

4. Do message passing, over multiple hierarchical lev-els. The use of a red-black update schedule meansthat we can update half the pixels at each given step.

5. Extract the final mask—an independent calculationfor each pixel.

Some approximations are made in the calculations toaccelerate runtime—these are detailed in the online supple-mentary material.

Using the GPU implementation with a GeForce GTX 580on a standard PC platform with a 4.4 GHz CPU, it canobtain 28:5 frames per second on input that has a resolutionof 320� 240 (Calculated for the problem cd-baseline-high-way—see Section 3). Note that only 41 percent of the time isspent doing the background subtraction—another 19 per-cent is expended on colour conversion whilst 25 percent isspent estimating the lighting change, with the remainder oninput/output. Neither colour conversation nor lightingchange estimation were run on the GPU, and all processingoccurs in sequence without any parallelism between theCPU and GPU.

2.4 Further Details

The core algorithmic details have now been given, but otherpertinent details remain.

The prior. The background model includes a prior on theGaussian associated with each mixture component. Insteadof treating this as a parameter to be set, it is learnt from thedata. Specifically, the mean and standard deviation (SD) ofthe prior are matched with the mean and SD of the pixels inthe current frame, with the weight set so it is equivalent to asingle sample,

ni;0 ¼ ki;0 ¼ 1; mi;0 ¼1

jF jX

x2Fxi; (14)

s2i;0 ¼

1

jF jX

x2Fðxi � mi;0Þ2; (15)

where F is the set of pixels in the current frame. Tochange the prior between frames the posterior parame-ters must not be stored directly, instead offsets from theprior are stored, which are then adjusted after eachupdate such that the model is equivalent.

Illumination change. The above helps by updating theprior, but it does nothing to update the evidence. To updatethe evidence a multiplicative model is used, whereby thelighting change between frames is estimated as a multiplier,then the entire model is updated by multiplying the means,mi;m, of the components accordingly. Light level change isestimated as in Loy et al. [40]. This takes every pixel in theframe and divides its value by the same pixel in the previ-ous frame, as an estimate of the lighting change. The modeof these estimates is then found using mean shift [41], whichis robust to the many outliers.

Robustness to shadows. Shadows are typically of no inter-est to further processing, and hence should be ignored bybackground subtraction. A simple approach is to ignoreluminance information [6], as this is the only channel inwhich (white light) shadows appear, but this throws awaytoo much information. Instead a novel step is taken whichinvolves reducing the importance of luminance, such that awider range of brightness levels are accepted as belongingto the background. Luminance remains available for model-ling by the background model. This is achieved with a newshadow-insensitive colour model—it is detailed in theonline supplementary material. It allows the approach toclassify softer shadows as belonging to the background, butproves ineffective for hard/deep shadows.

Camera shake. With few exceptions [42], [43] backgroundsubtraction algorithms assume a fixed camera. In practiseeven fixed cameras shake due to vibrations in their mount-ing structure, which can be caused, for instance, by thewind or passing traffic. To improve vibration handlingwhen updating the model we use a region around eachpixel, rather than the pixel value directly. Going back toEquation (7) it updates the model assuming a single pointvalue—it can be modified to accept a Gaussian distributioninstead, to get

s2i;mþ1 ¼ s2

i;m þ w&2i;m þ

ki;mw

ki;m þ wðxi � mi;mÞ2; (16)

where & is the standard deviation of the measured pixelvalue. Effectively it can be interpreted as the colour variancein the region around the pixel, from which we can randomlysample due to camera shake.

Assuming linear interpolation between adjacent val-ues allows us to calculate the standard deviation of animage region, which we do for a single 1� 1 pixel regionaround the point sampled. If we treat the image surfaceas being made of square regions rather than triangularregions we only need the values of three corners foreach square, which halves the number of pixels thatneed to be fetched. Using Fig. 3 the region being calcu-lated is given by the black square, which is split intofour sub-squares. Three corners of the top right sub-square are given by x, hn ¼ xþxn

2 and he ¼ xþxe2 ; the same

pattern applies to the other three. The standard deviationof these four squares is hence given by

676 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 4, APRIL 2014

Page 8: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

& ¼x2 þ h2

n þ h2e þ h2

s þ h2w

6

þ hnhe þ hehs þ hshw þ hwhn8

þ xhn þ xhe þ xhs þ xhw12

� m2; (17)

where m is the mean of the four squares, given bym ¼ hnþheþhsþhw

4 . This has to be done for each of the three col-our channels. As a further detail we scale the standard devia-tion with a parameter—a value of zero indicates that there isno camera shake, whilst 1 indicates a camera shake of about1 pixel across. This approach will not scale for large amountof shake, but proves effective for the small quantities intro-duced by vibrations from, for instance, passing traffic.

Bootstrapping. The algorithm learns continuously—itdoes not need a learning period and can update as the scenechanges. However, during initialisation the algorithm is notconfident, and assumes everything belongs to the back-ground. This makes the algorithm appear to be bad at boot-strapping—to obtain good performance it is forced to beover-confident. This is the opposite of confidence capping,in that it increases the confidence to the cap by multiplica-tion. It is only applied whilst calculating P ðbg j xÞ how-ever—the actual parameters are not updated. This bias termexists purely to make the algorithm overconfident duringinitial model training, because some of the tests in Section 3require results immediately from the first frame. For realworld usage this would be switched off.

3 EXPERIMENTS

Four data sets are used for the experiments:

� sabs from Brutzer et al. [2]. This one is synthetic.4

� wallflower from Toyama et al. [1].5

� star from Li et al. [12].6

� change detection (cd) from Goyette et al. [13]. This hasan online chart which is actively updated.7

We perform extensive experiments using these data sets,allowing us to compare against a large number of alterna-tive approaches. Brief summaries of many of theapproaches we compare against can be found in Table 1 inthe online supplementary material. Results for the wallflowerdata set are also available in the online supplementarymaterial, as are parameter sweeps to demonstrate that per-formance remains high regardless of parameters. Table 1gives the approaches parameters.

3.1 SABS

Brutzer et al. [2] presented a synthetic evaluation ofbackground subtraction algorithms, consisting of a 3Drendering of a road junction, traversed by both cars andpeople—see Fig. 5. Despite being synthetic it simulates,fairly accurately, nine real world problems, and has theadvantage of having ground truth for all frames. Thenine real world problems are

� basic, a baseline test.

� dynamic background, which crops the area of analysisto be the area of a waving tree.

� bootstrap, which has no training period, such that aclean background plate is never seen.

� darkening, which simulates the sun setting.

� light switch, which has the street at night and a shopswitching its light off then on again.

� noisy night, which is the scene at night with heavynoise.

� camouflage, where the cars and people have been col-oured similarly to the background, so they blend in.

� no camouflage, same as camouflage, but with easy tosee colours, for comparison.

� H264, which compresses the video heavily, to gener-ate typical compression artefacts.

The f-measure is reported for the various approaches inTable 2, and is defined as the harmonic mean of the recalland precision:

recall ¼ tp

tpþ fn; precision ¼ tp

tpþ fp; (18)

f-measure ¼ 2recall � precision

recallþ precision; (19)

using fp as the number of false positives, tn as the number oftrue negatives, etc.

TABLE 1Descriptions of the Parameters for the DP-GMM

The final column contains the values used for the cd data set.Fig. 3. Geometry of the interpolation used to handle camera shake. Eachblack dot is the centre of a pixel, for which we know a value—the five val-ues used are labelled, with the compass directions. The standard devia-tion is calculated for the region inside the black square.

4. Online at http://www.vis.uni-stuttgart.de/index.php?id=sabs.5. Can be downloaded from http://research.microsoft.com/en-us/

um/people/jckrumm/WallFlower/TestImages.htm.6. http://perception.i2r.a-star.edu.sg/bk_model/bk_index.html.7. Data set and chart online at http://www.changedetection.net.

HAINES AND XIANG: BACKGROUND SUBTRACTION WITH DIRICHLET PROCESS MIXTURE MODELS 677

Page 9: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

For this test we used one set of parameters for all sce-narios, rather than tuning per problem.8 As can be seen,the presented approach takes first place for all scenarios,and is on average 27 percent better than its nearest com-petitor. In doing so it demonstrates that it is not sensitiveto the parameters chosen, as it can produce top results inmany scenarios without per-problem tuning. Runningwithout regularisation is also included in the chart(denoted as DP, no post)9—in all cases a lack of regularisa-tion does not undermine its significant lead over the com-petition, demonstrating that the DP-GMM is doing mostof the work, but that regularisation always improves thescore, on average by 13 percent. It can be noted that thelargest performance gaps between regularisation beingoff and being on appears for the noisiest inputs, e.g.,noisy night, light switch, darkening and h264. These arethe kinds of problems encountered in typical surveillanceapplications.

As a further point of comparison DP, con com isincluded, where our MRF-based post-processing hasbeen swapped for the connected components method ofStauffer and Grimson [17]. Interestingly for the simplerproblems it does very well, sometimes better than thepresented method, but when it comes to the trickier sce-narios the presented is clearly better. Fig. 4 shows all thevariants for a frame from noisy night. This demonstratesthe key advantage of our post-processor—given a weakdetection that falls below the implicit threshold it canconvert it into a complete object, by utilising both colourconsistency and model uncertainty.

The frame shown in Fig. 5 has been chosen todemonstrate two areas in which the algorithm performspoorly. Specifically, its robustness to shadows is not veryeffective—whilst this could be improved by reducing theimportance of luminance in the colour space this has theeffect of reducing its overall ability to distinguish betweencolours, and undermines performance elsewhere. The sec-ond issue can be seen in the small blobs at the top of theimage—they are actually the reflections of cars and peoplein the scene. Using a DP-GMM allows it to learn a very pre-cise model, so much so that it can detect the slight deviationcaused by a reflection, when it would be preferable toignore it. Further processing could potentially avoid this.

Fig. 4. Frame 990 from the noisy night sequence.

TABLE 2SABS [2] Results: F-Measures for Each of the Nine Challenges

The results for other algorithms were obtained from the website (as of February 2013), though algorithms that never got a top score in the originalchart have been omitted. Algorithm rank is given by the numbers in brackets. The mean column gives the average for all tests—the presentedapproach is 27 percent higher than its nearest competitor.

Fig. 5. Frame 545 from the bootstrap sequence of the SABS data set [2].

8. The test procedure allows tuning one parameter per test—wehave thus put ourselves at a disadvantage.

9. The other algorithms on the chart have had their post-processingremoved, so it can be argued that this is the fairer comparison to make,though Brutzer et al. [2] define post-processing such that our regular-isation method is allowed.

678 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 4, APRIL 2014

Page 10: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

3.2 Star

The star data set [12] is very similar to the wallflower dataset—bootstrap from wallflower is the same video as br inthis data set. To its advantage star has an improved test-ing procedure however, as it provides multiple testframes per problem, with performance measured usingthe average similarity score for all test frames, wheresimilarity ¼ tp=ðtpþ fnþ fpÞ. In addition the sequencesare generally much harder, due to text overlays, systemicnoise and some camera shake. Fewer algorithms havebeen run on it. The presented approach10 takes first seventimes out of nine, beaten twice by Maddalena and Petro-sino [19], though in both cases by a small margin. Thelight switch test in this data set does not trip it up thistime—the library where it occurs has a high ceiling anddiffuse lighting, making multiplicative lighting much

more reasonable. Complex dynamic backgrounds clearlydemonstrate the strength of a DP-GMM, as evidenced byits three largest improvements (cam, ft and ws).

3.3 Change Detection (cd)

The change detection [13] data set is much larger thanany of the others, and includes a dense ground truth—masks are provided for all frames past some initial train-ing period. It additionally limits parameter tuning suchthat a single parameter set must be used for all 31 videos.Video resolution is not great however, and many of thevideos are of dubious quality; many appear to have beenpost-processed, often with a low quality de-interlacingalgorithm that leaves ghosts.11 The videos are dividedinto six categories, which can be seen alongside the

TABLE 4CD [13] Results: This Chart Only Contains the Most Competitive Algorithms—the Online Chart Includes Many More

The rank after each f-measure takes into account the omitted approaches. Refer to Fig. 7 for example frames.

Fig. 6. Star results: On the top row is the image, on the second row the ground truth and on the third row the output of the presented algorithm. Weuse the same frames as Culibrk et al. [44] and Maddalena and Petrosino [19], for a qualitative comparison. The videos are named using abbrevia-tions of their locations.

TABLE 3Star [12], [19] Results: Refer to Fig. 6 for Exemplar Frames, Noting that lb Has Abrupt Lighting Changes

The average improvement of DP, tuned over its nearest competitor is 5 percent.

10. We tune per-problem, as all the competitors have done the same.Results for a single set of parameters are also presented.

11. From Fig. 7 the traffic video demonstrates ghosting; the aban-doned box video is an example of another kind of defect. It should benoticed that the other real data sets have similar issues.

HAINES AND XIANG: BACKGROUND SUBTRACTION WITH DIRICHLET PROCESS MIXTURE MODELS 679

Page 11: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

Fig. 7. CD results: Layout is identical to Fig. 6, with multiple rows. The ground truth includes various shades of grey—a dark grey to indicate shad-owed regions, a mid grey for areas that are ignored when scoring for all frames and a light grey for areas ignored for just this frame. Mid grey is usedto focus on the action being tested, whilst light grey indicates areas where foreground/background assignment is ambiguous (typically presents as athin line around each segment).

680 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 4, APRIL 2014

Page 12: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

results in Fig. 7. Quantitative results can be found inTable 4.12 The online version considers many differentscoring mechanism, and then combines them in a non-lin-ear rank based system.13 Instead we present the f-mea-sure scores only, as it is the most used metric.

We now consider each test in turn.

� Baseline. A basic set of videos, with nothing unusual,though pedestrians has saturated colour. All the topalgorithms get comparable scores for this test.

� Dynamic background. These videos have complexmulti-modal backgrounds, such as water. As demon-strated previously, the presented excels at suchinput, outperforming the competitors significantly.

� Camera Jitter. The camera is not properly mountedfor these videos, and as such shakes. This doesreduce the quality of our output—the outputframes tend to miss parts of objects due to thewider background model induced by the shaking.It also incorrectly classifies the background duringthe worst shakes, but we still manage a close sec-ond place.

� Intermittent motion. This involves objects being leftalone (static), and is our weakest result. Otherapproaches often have a specific code path to han-dle this situation, whilst we do not, putting us ata disadvantage.

� Shadow. These videos have shadows, and test theability of the approach to ignore them. In our casewe have some capability to handle soft shadows, asseen in cubicle, but will typically detect hard shadowregions. Interestingly our result remains among thetop approaches despite its limited capabilities.

� Thermal. A somewhat unusual video source, thisdemonstrates cameras for low light recording. Suchsources have a large amount of noise, which the pre-sented is expected to do well on. Unsurprisingly itdemonstrates a strong lead over the competition. Itactually suffers from being too sensitive—in corridorit detects the thermal reflections when these shouldbe considered background.

Overall we win on average for the cd data set (Table 4).This is because the DP-GMM model can handle back-grounds that are complex and dynamic, due to its ability toestimate mixture components automatically and avoidover-/under-fitting. This ability, in combination with theregularisation based on a Markov random field allows it toeffectively cope with heavy noise.

3.4 Mixture Count

The approach implicitly calculates a mixture count foreach pixel14—Fig. 8 visualises this. Figs. 8a, 8b, 8c, and 8dshow two frames from the star data set, and their

corresponding per pixel mixture counts for the thousandthframe. The images are coloured such that black is 0 compo-nents and white is 4—the majority of pixels in both frameshave been estimated to have one component. Trees movingin cam create components where the movement is large,though note that most areas can be handled using a singlewide variance component. In ss the escalators again gener-ate extra components, though being mostly black they canoften be covered by a single component. The heavily usedescalator on the right gets lots of components because theapproach forms one for the people going past, in case theystop and become part of the background. These componentsare never considered to be part of the background as theydon’t obtain enough weight. Fig. 8e demonstrates the mix-ture count changing over time, with a synthetic video. Thevideo shows 10 seconds of one mode, then two modes, andso on up to four, then back again, at 24 fps—the groundtruth mode count is plotted.15 We can plot the mode count,averaged over all the pixels—it follows ground truth, withlag as expected. Note that the lag is very short when learn-ing a new mixture component—only about 2 seconds—butvery slow when it comes to forgetting a background compo-nent, which is desirable.

4 CONCLUSIONS

We have presented a new background subtraction methodand validated its effectiveness with extensive testing. Themethod is based on an existing model, namely DP-GMMs,but with new model learning algorithms developed tomake it both suitable for background modelling, and com-putationally scalable. Whilst the basic concept of per-pixeldensity estimate can be traced back to Stauffer and Grim-son [17], the use of a non-parametric Bayesian model gives

Fig. 8. Representations of the per-pixel mixture count—see text fordetails.

12. The table is accurate as of February 2013—an up to date versioncan be found at http://changedetection.net.

13. This is problematic—the addition of a new approach can causetwo unrelated algorithms to switch places. This can even occur whenthe new algorithm is much better or worse than those it reorders.

14. Because it is using a DPMM the mixture count is principallyalways infinity, but a reasonable approximation can be obtained usinga threshold.

15. The video changes colour repeatedly, cycling though all modesevery second and showing each for an equal amount of time. It includesnoise.

HAINES AND XIANG: BACKGROUND SUBTRACTION WITH DIRICHLET PROCESS MIXTURE MODELS 681

Page 13: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

it a significant advantage when it comes to handlingdynamic backgrounds. Specifically, our model is able tolearn the number of mixture components more accuratelyand hence better cope with a variety of scene changes, incomparison with the existing more heuristic DE methods.It also proves itself to have good performance in manyother areas, particularly on dealing with heavy noise.Despite its thorough theoretical basis implementationremains relatively simple.16

A number of improvements can be considered. Combin-ing information between pixels only as a regularisation stepdoes not fully exploit the information available. A more rig-orous method of spatial information transmission would bedesirable—a dependent Dirichlet process could providethis. Sudden complex lighting changes are not handled,which means it fails to handle some indoor lightingchanges. Furthermore, a more sophisticated model of theforeground and an explicit model of left objects could fur-ther improve our method. Solutions to these problems frommany of the existing methods could be adapted to ours [6],[33], [35].

ACKNOWLEDGMENTS

Both authors are funded by EPSRC under the EP/G063974/1 grant.

REFERENCES

[1] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower:Principles and Practise of Background Maintenance,” Proc. Sev-enth IEEE Int’l Conf. Computer Vision (ICCV), pp. 255-261, 1999.

[2] S. Brutzer and B. Hferlin, G. Heidemann, “Evaluation of Back-ground Subtraction Techniques for Video Surveillance,” Proc.IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2011.

[3] M.D. Escobar and M. West, “Bayesian Density Estimation andInference Using Mixtures,” J. Am. Statistical Assoc., vol. 90, no. 430,pp. 577-588, 1995.

[4] Y.W. Teh and M.I. Jordan, Hierarchical Bayesian NonparametricModels with Applications Bayesian Nonparametrics. Cambridge Univ.Press, 2010.

[5] Y. He, D. Wang, and M. Zhu, “Background Subtraction Based onNonparametric Bayesian Estimation,” Proc. Int. Conf. Digital ImageProcessing, 2011.

[6] A. Elgammal, D. Harwood, and L. Davis, “Non-Parametric Modelfor Background Subtraction,” Proc. Sixth European Conf. ComputerVision, pp. 751-767, 2000.

[7] Y. Sheikh and M. Shah, “Bayesian Modeling of Dynamic Scenesfor Object Detection,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 27, no. 11, pp. 1778-1792, Nov. 2005.

[8] O. Barnich and M.V. Droogenbroeck, “ViBE: A Powerful RandomTechnique to Estimate the Background in Video Sequences,” Proc.IEEE Int’l Conf. Acoustics, Speech and Signal Processing, pp. 945-948,2009.

[9] Z. Zivkovica and F. Heijden, “Efficient Adaptive Density Estima-tion Per Image Pixel for the Task of Background Subtraction,” Pat-tern Recognition Letters, vol. 27, pp. 773-780, 2006.

[10] M. Hofmann, P. Tiefenbacher, and G. Rigoll, “Background Seg-mentation with Feedback: The Pixel-Based Adaptive Segmenter,”Proc. IEEE Conf. Computer Vision and Pattern Recognition Workshops(CVPRW), pp. 38-43, 2012.

[11] T.S.F. Haines and T. Xiang, “Background Subtraction with Dirich-let Processes,” Proc. 12th European Conf. Computer Vision (ECCV),2012.

[12] L. Li, W. Huang, I.Y.-H. Gu, and Q. Tian, “Statistical Modeling ofComplex Backgrounds for Foreground Object Detection,” IEEETrans. Image Processing, vol. 13, no. 11, pp. 1459-1472, Nov. 2004.

[13] N. Goyette, P.-M. Jodoin, F. Porikli, J. Konrad, and P. Ishwar,“Changedetection.net: A New Change Detection BenchmarkDataset,” Proc. IEEE Conf. Computer Vision and Pattern RecognitionWorkshops (CVPRW), pp. 16-21, 2012.

[14] S.-C.S. Cheung and C. Kamath, “Robust Techniques for Back-ground Subtraction in Urban Traffic Video,” Proc. SPIE, vol. 5308,pp. 881-892, 2004.

[15] M. Karaman, L. Goldmann, D. Yu, and T. Sikora, “Comparison ofStatic Background Segmentation Methods,” Proc. SPIE, vol. 5960,pp. 2140-2151, 2005.

[16] S. Herrero and J. Bescos, “Background Subtraction Techniques:Systematic Evaluation and Comparative Analysis,” Proc. 11th Int’lConf. Advanced Concepts for Intelligent Vision Systems (ACIVS),pp. 33-42, 2009.

[17] C. Stauffer and W.E.L. Grimson, “Adaptive Background MixtureModels for Real-Time Tracking,” Proc. IEEE Conf. Computer Visionand Pattern Recognition (CVPR), vol. 2, pp. 637-663, 1999.

[18] L. Li, W. Huang, I.Y.H. Gu, and Q. Tian, “Foreground ObjectDetection from Videos Containing Complex Background,” Proc.11th ACM Int’l Conf. Multimedia (MULTIMEDIA ’03), pp. 2-10,2003.

[19] L. Maddalena and A. Petrosino, “A Self-Organizing Approach toBackground Subtraction for Visual Surveillance Applications,”IEEE Trans. Image Processing, vol. 17, no. 7, pp. 1168-1177, July2008.

[20] L. Maddalena and A. Petrosino, “A Fuzzy Spatial Coherence-Based Approach to Background/Foreground Separation for Mov-ing Object Detection,” Neural Computing and Applications, vol. 19,no. 2, pp. 179-186, 2010.

[21] L. Maddalena and A. Petrosino, “The SOBS Algorithm: What Arethe Limits?” Proc. IEEE Conf. Computer Vision and Pattern Recogni-tion Workshops (CVPRW), pp. 21-26, 2012.

[22] J. Kato, T. Watanabe, S. Joga, J. Rittscher, and A. Blake, “AnHMM-Based Segmentation Method for Traffic Monitoring Mov-ies,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 24,no. 9, pp. 1291-1296, Sept. 2002.

[23] N. Oliver, B. Rosario, and A.P. Pentland, “A Bayesian ComputerVision System for Modeling Human Interactions,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 831-843,Aug. 2000.

[24] J. He, L. Balzano, and A. Szlam, “Incremental Gradient on theGrassmannian for Online Foreground and Background Separationin Subsampled Video,” Proc. IEEE Conf. Computer Vision and Pat-tern Recognition (CVPR), 2012.

[25] F. Seidel, C. Hage, and M. Kleinsteuber, “pROST: A Smoothed Lp-Norm Robust Online Subspace Tracking Method for RealtimeBackground Subtraction in Video,” Unpublished, preprint inCoRR, 2013.

[26] D.-S. Lee, “Effective Gaussian Mixture Learning for Video Back-ground Subtraction,” IEEE Trans. Pattern Analysis and MachineIntelligence, vol. 27, no. 5, pp. 827-832, May 2005.

[27] K. Kim, T.H. Chalidabhongse, D. Harwood, and L. Davis,“Background Modeling and Subtraction by Codebook Con-struction,” Proc. Int’l Conf. Image Processing (ICIP), vol. 5, pp. 3061-3064, 2004.

[28] J. Migdal and W.E.L. Grimson, “Background Subtraction UsingMarkov Thresholds,” Proc. Seventh IEEE Workshop Application ofComputer Vision, pp. 58-65, 2005.

[29] A. Schick, M. Bauml, and R. Stiefelhagen, “Improving ForegroundSegmentation with Probabilistic Superpixel Markov RandomFields,” Proc. IEEE Conf. Computer Vision and Pattern RecognitionWorkshops (CVPRW), pp. 27-31, 2012.

[30] P.F. Felzenszwalb and D.P. Huttenlocher, “Efficient Belief Propa-gation for Early Vision,” Int’l J. Computer Vision, vol. 70, no. 1,pp. 41-54, 2004.

[31] S. Cohen, “Background Estimation as a Labeling Problem,” Proc.10th IEEE Int’l Conf. Computer Vision (ICCV), pp. 1034-1041, 2005.

[32] D.H. Parks and S.S. Fels, “Evaluation of Background SubtractionAlgorithms with Post-Processing,” Proc. IEEE Fifth Int’l Conf.Advanced Video and Signal Based Surveillance, pp. 192-199, 2008.

[33] R. Cucchiara, C. Grana, M. Piccardi, and A. Prati, “Detecting Mov-ing Objects, Ghosts and Shadows in Video Streams,” IEEE Trans.Pattern Analysis and Machine Intelligence, vol. 25, no. 10, pp. 1337-1342, Oct. 2003.

16. 186 lines of C for the DP model and 239 lines for the post-proc-essing. The GPU implementation comes in at 814 lines, due to the com-plex processing structure needed for an efficient implementation.

682 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 36, NO. 4, APRIL 2014

Page 14: 670 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND …txiang/publications/Haines_Xiang_backgroun… · To this end we propose to use a Dirichlet process Gaussian mixture model (DP-GMM)

[34] M. Harville, G.G. Gordon, and J. Woodfill, “Adaptive Video Back-ground Modeling Using Color and Depth,” Proc. Int’l Conf. ImageProcessing (ICIP), vol. 3, pp. 90-93, 2001.

[35] A. Morde, X. Ma, and S. Guler, “Learning a Background Model forChange Detection,” Proc. IEEE Conf. Computer Vision and PatternRecognition Workshops (CVPRW), pp. 15-20, 2012.

[36] D.M. Blei and M.I. Jordan, “Variational Inference for Dirichlet Pro-cess Mixtures,” Bayesian Analysis, vol. 2005, pp. 121-144, 2005.

[37] A. Gelman, J.B. Carlin, H.S. Stern, and D.B. Rubin, Bayesian DataAnalysis. Chapman & Hall, 2004.

[38] Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate EnergyMinimization via Graph Cuts,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 23, no. 11, pp. 1222-1239, Nov. 2001.

[39] J.K. Salmon, M.A. Moraes, R.O. Dror, and D.E. Shaw, “ParallelRandom Numbers: As Easy as 1, 2, 3,” Proc. Int’l Conf. High Perfor-mance Computing, Networking, Storage and Analysis, 2011.

[40] C.C. Loy, T. Xiang, and S. Gong, “Time-Delayed Correlation Anal-ysis for Multi-Camera Activity Understanding,” Int’l J. ComputerVision, vol. 90, no. 1, pp. 106-129, 2010.

[41] D. Comaniciu and P. Meer, “Mean Shift: A Robust ApproachToward Feature Space Analysis,” IEEE Trans. Pattern Analysis andMachine Intelligence, vol. 24, no. 5, pp. 603-619, May 2002.

[42] X. Cui, J. Huang, S. Zhang, and D. Metaxas, “Background Subtrac-tion Using Group Sparsity and Low Rank Constraint,” Proc. 12thEuropean Conf. Computer Vision (ECCV), 2012.

[43] A. Elqursh and A. Elgammal, “Online Moving Camera Back-ground Subtraction,” Proc. 12th European Conf. Computer Vision(ECCV), 2012.

[44] D. Culibrk, O. Marques, D. Socek, H. Kalva, and B. Furht, “NeuralNetwork Approach to Background Modeling for Video ObjectSegmentation,” Neural Networks, vol. 18, no. 6, pp. 1614-1627, 2007.

[45] R.H. Evangelio, M. PeA¤tzold, and T. Sikora, “Splitting Gaussiansin Mixture Models,” Proc. IEEE Ninth Int’l Conf. Advanced Videoand Signal-Based Surveillance, pp. 300-305, 2012.

[46] F. Hernandez-Lopez and M. Rivera, “Change Detection by Proba-bilistic Segmentation from Monocular View,” Machine Vision andApplications, submitted for publication..

[47] R.H. Evangelio and T. Sikora, “Complementary Background Mod-els for the Detection of Static and Moving Objects in CrowdedEnvironments,” Proc. IEEE Eighth Int’l Conf. Advanced Video andSignal-Based Surveillance, pp. 71-76, 2011.

Tom S.F. Haines received the PhD degree inshape from shading and stereopsis from the Uni-versity of York in 2009. Following this, he did apost doctorate at Queen Mary University of Lon-don, where he worked on topic models, activelearning, and background subtraction. He is cur-rently at University College London, doingresearch on handwriting. His interests includemachine learning, non-parametric Bayesianmethods, belief propagation, and densityestimation.

Tao Xiang received the PhD degree in electricaland computer engineering from the National Uni-versity of Singapore in 2002. He is currently asenior lecturer (associate professor) in theSchool of Electronic Engineering and ComputerScience, Queen Mary University of London. Hisresearch interests include computer vision, sta-tistical learning, video processing, and machinelearning, with a focus on interpreting and under-standing human behaviour. He has publishedmore than 100 papers in international journals

and conferences and coauthored a book Visual Analysis of Behaviour:From Pixels to Semantics.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

HAINES AND XIANG: BACKGROUND SUBTRACTION WITH DIRICHLET PROCESS MIXTURE MODELS 683