Video Segmentation into Background and …Video Segmentation into Background and Foreground Using Simplified Mean Shift Filter and K-Means Clustering Sudhanshu Sinha, Doctoral Student,

ASEE 2014 Zone I Conference, April 3-5, 2014, University of Bridgeport, Bridgpeort, CT, USA.

Video Segmentation into Background and Foreground Using Simplified Mean Shift Filter and

K-Means Clustering

Sudhanshu Sinha, Doctoral Student,

Bowie State University, Bowie, MD 20715

[email protected]

Manohar Mareboyana, Professor

Bowie State University, Bowie, MD 20715

[email protected]

Abstract—Video Segmentation decomposes image frames into background and foreground. In this paper, a combination of simplified mean-shift filter and K-Means clustering are used in modeling the background. The most common models used for background estimation are mixture of Gaussian (MOG), Kernel Density Estimation (KDE), etc. Comparison of the proposed approach with some of the aforementioned models have been made and it was observed that a relatively simple model using a simplified mean-shift computation and K-Means clustering can produce results that are comparable to those obtained by other methods. The proposed approach was tested on video data obtained from Wallflower test images from its source website. The results are encouraging and show the validity of this approach for background modeling.

Keywords—Image Processing; Video Segmentation; Foreground Background modeling; estimation; detection; Mean-shift; K-Means Clustering.

I. INTRODUCTION Video segmentation models can broadly be divided into

parametric and nonparametric models. Parametric models include Mixture of Gaussian in References [2, 5, 7, 10, 12, 13, 14, 15, 16, 17, 18], etc. Stauffer and Grimson [2,7] have used an adaptive background mixture models to address multi-modality of the background. The expectation maximization is incremental but convergence is slow. However McKenna, Raja, and Gong [31, 32] have tried to improve the convergence process with long temporal windows but at the cost of memory. The Hanzi Wang and David Suter [10] have tried to re-evaluate the Mixture of Gaussian by quantitatively evaluating the performance. Jin Wang and Lanfang Dong [12] have tried to speed up the MOG model by using “additive increase/decrease” to adjust the weights of the matched distributions and the unmatched distributions respectively. Most of these require expensive computation. It is hard to figure out the initial value for a Gaussian. Also, each time the input value has to be put in each Gaussian for computation before a decision is made and after decision the weights are adjusted. The non-parametric approaches include estimation of probability

density function (pdf) using kernel functions which is referred to as Kernel Density Estimation (KDE) [1, 4, 11, 19, 20, 21, 22]. Elgammal, Duraiswami, Harwood, and Davis [4] have proposed background model based on KDE. They have used a kernel function for probability density estimation of a pixel at a given location from training set and used it to determine if the new pixel is background or foreground. To adapt the model a sliding window is used by Mittal and Paragios [11]. Also, Kim, Harwood, and Davis [33] proposed a layered modeling technique to update the background for scene changes. Tang, Miao, Wan and Li, [19] have proposed a method that automatically locates the foreground object coarsely by a salient detection algorithm, and then refines the object by Weighted Kernel Density Estimation (WKDE). Ridder, Munkelt, and Kirchner [8] used Kalman filtering. Bayesian methods by Han, Zhu, Comaniciu, and Davis, [6] of background modeling is also an example of nonparametric estimation. This technique uses frequency distribution on the basis of probability of pixel being background in previous frames and keeps updating. In Baye's method, the training size has to be very large otherwise the result is not good. Han, Zhu, Comaniciu, and Davis, [6] have used Kernel based Bayesian approach in which they used analytic approach to approximate the density function. In some of the models pixel features like colors [4, 34] and in some others consistent motion[29, 30] have been employed. Haritaoglu, Harwood, and Davis [9] used the technique of storing minimum and maximum intensity values, and maximum temporal derivative for each pixel. Niranjil Kumar and Sureshkumar [28] have used modified K-Means for background subtraction. Histogram has been used for video segmentation in [3, 24, 26, 27]. D. Freedman and Kisilev [23] have used fast Mean shift by compacting the density representation. Sinha and Mareboyana in press [39] have used Mean shift and Histogram clustering for video segmentation.

II. METHODS In this paper, mean shift filter is used to compute the

local peaks (number of clusters) in the background at each

pixel position using a predetermined number of training image frames.. The mean shift filter uses every pixel value of the video frames as an initial point and converges to the local peak (densest point). Once the convergence is complete for all the pixels, a clustering of Mean values is done by using K-Means clustering. Each input value from test frame is compared against these clusters by computing the cluster distances. If the lowest distance is greater than the threshold then the pixel is taken as foreground otherwise it is taken as background. After each estimation, if the pixel is background then the training data is refreshed by inducting the decided pixel into it. This way the training data becomes more representative.

A Java program was written to test this video segmentation approach. The sample video images are taken from Wallflower video sets from its source site[38]. However the method was tested with images captured through Bowie State University video also but they are not included in this paper due to space limitations.

In a video, the pixel is represented by xijp where (i, j) represent row number and column number respectively of the frame p . There are m rows and n columns in each frame and p frames are used for background model estimation. We assumed the colors are independent of each other and so treated independently for notational

A. Computing Mean-shift Mean shift is a procedure for locating the maxima of a

density function given discrete data sampled from that function. This is an iterative method, and it starts with an initial estimate x. Let a kernel function K(xi - x) be given. This function determines the weight of nearby points for re-estimation of the mean. Typically Gaussian kernel is used.

𝐾 𝑥! − 𝑥 = 𝑒!!| !!!! |! (1)

The weighted mean of the density in the window

determined by K is

𝑚 𝑥 = !(!!! !)!!!!∈!(!)

!(!!! !)!!∈!(!) (2)

Where N(x) is the neighborhood of x, a set of points for

which K(x)≠0. The mean-shift algorithm now updates x= m(x), and repeats the estimation until m(x) converges. Rewriting in terms of discrete set of pixel values, we proceed as follows.

For notational simplicity, ignoring i and j, let the values of temporal in an arbitrary location (i, j) of the training set be (x1, x2, ……..xp). Taking the kernel as:

𝑐 ∗ 𝑒!!!!!

(!!!!)! (3) where h is the bandwidth of the kernel, c is a constant factor and p is the number of pixels in the training set. Ignoring the

constant factors as they will be canceled with numerator and denominator, the Mean-shift formula can be written as:

𝑚 𝑥 = !! !!!!

(!!!!)!.!!

!!!!!!

!! !!!!

(!!!!)!!!!!!!

(4)

Simplifying the formula for computation taking Numerator and Denominator separately, Numerator is:

𝑒!!!!!

(!!!!)! . 𝑥!!!!!!! (5)

= 𝑒! (!!!!)! . 𝑥!

!!!!!! (6)

where 𝑘 = − !!!!

= 𝑒! (!!!!!!!!! !!). 𝑥!

!!!!!! (7)

= 𝑥! . 𝑒! !!

!. 𝑒!!!"!! . 𝑒!!!!!!

!!! (8) Taking out the constant term

= 𝑒!!! 𝑥! . 𝑒!!!! . 𝑒!!!"!!!!!

!!! (9) Similarly denominator is

= 𝑒!!! 𝑒!!!! . 𝑒!!!"!!!!!

!!! (10) Finally the term 𝑒!!! cancels out in division,

𝑚 𝑥 = !!.!

!!!! . !!!!"!!!!!

!!!

!!!!! . !!!!"!!!!!

!!!

(11)

or

𝑚 𝑥 = !!. !!.

!!!!!! !!

!!.!!!!!! !!

(12)

Where 𝐴! = 𝑒!!!

! and 𝐵! = 𝑒!!!"!! For one set of temporal values, Ai will be the same till convergence as it doesn’t contain x so doesn’t need to be computed in each iteration. This makes the computation simple as only Bi need to be computed in each iteration for convergence. In the beginning the first pixel value x1 is taken as x for computing m(x). Then x is replaced by m(x). The process is repeated until m(x) converges within a limit. Same process is repeated for x2, x3, ….. xp. Once all the weighted mean values are computed, a K-means clustering of all the means is applied as detailed below.

B. Clustering Mean data with K-Means For initial seed values, any arbitrary values can be taken.

For this study, the mean (m) and standard deviation (sd) of the weighted mean (mean-shift) values are computed. Then three seed values are taken as m1=(m-sd), m2=m, and m3=(m+sd).

Now cluster elements are populated by putting the mean-shift values that are closest to m1 or m2 or m3. After the first iteration, the mean value of each cluster is re-computed to replace m1, m2 and m3. Then the cluster elements are again compared with the newly computed means m1, m2 and m3 and elements are populated in K clusters again with respect to closest distance. The process is repeated until m1, m2 and m3 converge.

This gives the final clusters. For this study, 3 (K=3) clusters were considered as mean shift data doesn’t have much variability.

After finishing the clustering of training data, test frames are processed. Each pixel value xijp+1 from the new (p+1)th frame is compared with each cluster after computing cluster distances using the following formula:

𝑑! = 𝑥!"#!! − 𝑚!!/(𝑛!𝜎!!)_________(13)

where mc is the mean of cluster c; 𝜎!! is the variance of cluster c and nc is the number of elements in cluster c.

If the closest distance is greater than a threshold thr then that pixel is taken as foreground and a value of 255 is put in the output frame Out[i,j] otherwise, 0. That is,

𝑂𝑢𝑡 𝑖, 𝑗 = 0 𝑖𝑓 𝑑! < 𝑡ℎ𝑟255 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

______(14) This Out[i,j] is put in decision frame S that will be displayed as result of segmentation.

C. Updating the Training Frames Set The frames in the training set are updated after each test frame decision. The values of those pixel locations that are marked as background (0) in the decision frame S are moved one up in the training set of frames in that location successively one up and the new pixel value xijp+1 is inducted into the first frame. For example, pixel value in frame p-1 is assigned to frame p, p-2 is assigned to p-1 in the same location. That way the frame number 1 is free to get the new test frame value.

III. RESULTS Although method was tested by changing parameters, the

results are shown here with best values. The input was taken in two ways: (1) all colors independently and (2) ratio of colors that means, the input frame data for Red (R), Green (G), and Blue (B) were changed to ratios R/(R+G+B), G/(R+G+B), and B/(R+G+B) respectively and multiplied by 100. The output were captured in two ways: (1) all colors taken independently and (2) the three colors decisions were added and any value other than zero was taken as 255. This constitutes 4 cases that will be shown in the results discussed below. The results are shown for the evaluation frame as detailed at the video source location [38].

A. Mean Shift Filter with K-Means Clustering The original frame is shown in Fig. 1.

Figure 1: Original Frame

The results of output images captured for 4 cases detailed above are as follows: Case 1: With input ratio of 3 colors and output sum of colors: Result shown in Fig. 2

Figure 2: Output Image for Tree Case 1

Case 2: With input ratio of 3 colors and output 3 colors: Result shown in Fig. 3

Figure. 3: Output Image for Case 2

Case 3: With input 3 colors and output 3 colors: Result shown in Fig. 4


Case 4: With input 3 colors and output sum of colors: Result shown in Fig. 5

Figure 5: Output Image for Case 4

B. Comparing with Mixture of Gaussian A Java program was written to compare the methods

used for this research with the results given by Mixture of Gaussian. The program was written mainly in accordance with the method contained in the paper by Stauffer and Grimson (1999). The results of 4 cases are shown below:

Case 1: With input ratio of 3 colors and output sum of colors: Result shown in Fig. 6


Case 2: With input ratio of 3 colors and output 3 colors: Result shown in Fig. 7


Case 3: With input 3 colors and output 3 colors: Result shown in Fig. 8




C. Comparing with Kernel Density Estimation A Java program was written to compare the methods

used for this research with the results given by Kernel Density Estimator (KDE). The program was written in accordance with the method contained in the paper by Elgammal, Duraiswami, Harwood, and Davis [4]. Case 2 will have the same result as Case 1 and Case 4 will have the same result as Case 3 because the color factors are in the computation.

Case 1: With input ratio of 3 colors and output sum of colors: Result shown in Fig. 10



Figure. 11: Output Image for Waving Tree Images Case 3

IV. CONCLUSION It can be seen that the proposed method gives

comparable results even though it doesn’t require complicated computations or manipulations. The method is authentic as it uses two well known methods: mean shift and K-means. The convergence of mean values takes more iterations if the threshold for termination of the convergence of mean-shift is kept low. The testing showed that there was about 28% gain in time when simplified

computation for mean shift was used. The method is easily understandable to non-mathematicians.

REFERENCES [1] A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model for

background subtraction,” in Proc. 6th European Conference on Computer Vision, (Dublin, Ireland), 2000.

[2] C. Stauffer and W. Grimson, “Adaptive background mixture models for real-time tracking,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (Fort Collins, CO), pp. 246–252, 1999.

[3] S. Jehan-Besson, M. Barlaud, G. Aubert, and O. Faugeras. Shape gradients for histogram segmentation using active contours. Proceedings of the IEEE International Conference on Computer Vision 2003.

[4] Ahmed Elgammal, Ramani Duraiswami, David Harwood, and Larry S. Davis, “ Background and Foreground Modeling Using Nonparametric Kernel Density Estimation for Visual Surveillance Proceedings of the IEEE, Vol. 90, No. 7, July 2002.

[5] C. Stauffer and W.E.L. Grimson, “Learning Patterns of Activity Using Real-time Tracking”, IEEE Transactions on Pattern Analysis &Machine Intelligence, Vol.22, Issue.8, pp. 747-757, 2000.

[6] Bohyung Han, Ying Zhu, Dorin Comaniciu, and Larry Davis, "Kernel-Based Bayesian Filtering for Object Tracking", Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7] W. Grimson, C. Stauffer, R. Romano, and L. Lee, “Using adaptive tracking to classify and monitor activities in a site,” in Proc. IEEE Conf. on Computer Vision and Pattern Recognition, (Santa Barbara, CA), pp. 22–29, 1998.

[8] Christof Ridder, Olaf Munkelt, and Harald Kirchner.“Adaptive Background Estimation and Foreground Detection using Kalman-Filtering,” Proceedings of International Conference on recent Advances in Mechatronics,ICRAM’95, UNESCO Chair on Mechatronics, 193-199,-1995.

[9] I. Haritaoglu, D. Harwood, and L. Davis, “W4: Who? When? Where? What? A real time system for detecting and tracking people,” in Proc. 3rd International Conference on Face and Gesture Recognition, (Nara, Japan), 1998.

[10] Hanzi Wang and David Suter, “A Re-Evaluation Of Mixture-Of-Gaussian Background Modeling “,IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05).

[11] A. Mittal and N. Paragios. "Motion-Based Background Subtraction using Adaptive Kernel Density Estimation," In Proceedings of CVPR,, Washington, DC, p. 302-309, 2004.

[12] Jin Wang and Lanfang Dong, " Moving objects detection method based on a fast convergence Gaussian mixture model", 3rd International Conference on Computer Research and Development (ICCRD), 2011

[13] M. S. Bouguila Allili, and D. N. Ziou, "Online Video Foreground Segmentation using General Gaussian Mixture Modeling" IEEE International Conference on Signal Processing and Communications (ICSPC), 2007.

[14] T. Supasuteekul Charoenpong and C. A. Nuthong, "Adaptive background modeling from an image sequence by using K-Means clustering", International Conference on Electrical Engineering/Electronics Computer Telecommunications and Information Technology (ECTI-CON), 2010

[15] Long-hui Guo, Liang He and Huai-zhong Li, "Traffic video image segmentation based on mixture of Gaussian model", International

Conference on Electric Information and Control Engineering (ICEICE), 2011

[16] Baoyan Ding, Ran Shi, Zhi Liu and Zhaoyang Zhang, "Human object segmentation using Gaussian mixture model and graph cuts" International Conference on Audio Language and Image Processing (ICALIP), 2010

[17] Ming-Shou An and Dae-Seong Kang, "Motion estimation with histogram distribution for visual surveillance", 19th Annual Wireless and Optical Communications Conference (WOCC), 2010

[18] N. Greggio, A. Bernardino, C. Laschi, P. Dario and J. Santos-Victor, "Self-adaptive Gaussian mixture models for real-time video segmentation and background subtraction", 10th International Conference on Intelligent Systems Design and Applications (ISDA), 2010

[19] Zhen Tang, Zhenjiang Miao, Yanli Wan and Jia Li, "Automatic foreground extraction for images and videos", 17th IEEE International Conference on Image Processing (ICIP), 2010

[20] Junwei Hsieh, Sin-Yu Chen, Chi-Hung Chuang, Yung-Sheng Chen, Zhong-Yi Guo and Kuo-Chin Fan, "Pedestrian segmentation using deformable triangulation and kernel density estimation", 2009 International Conference on Machine Learning and Cybernetics,

[21] Wei Yang, Junshan Li, Deqin Shi and Shuangyan Hu, "Mean Shift Based Target Tracking in FLIR Imagery via Adaptive Prediction of Initial Searching Points", Second International Symposium on Intelligent Information Technology Application (IITA), 2008.

[22] Dongbin Xu, Changping Liu and Lei Huang, "An Adaptive Kernel Density Estimation for Motion Detection", Second International Symposium on Intelligent Information Technology Application (IITA), 2008.

[23] D. Freedman and P. Kisilev. Fast Mean Shift by Compact Density Representation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009.

[24] Lo Chi-Chun and Wang Shuenn-Jyi. Video segmentation using a histogram-based fuzzy c-means clustering algorithm. IEEE International Conference on Fuzzy Systems, 2001.

[25] Rafael C. Gonzalez, Richard E. Woods, "Digital Image Processing", Third edition 2008, published by Pearson Prentice Hall.

[26] Robert A. Joyce and Bede Liu. Temporal Segmentation of Video Using Frame and Histogram Space. IEEE Transactions on Multimedia, Vol. 8, No. 1, February 2006.

[27] Xu Jianfeng, T. Yamasaki, and K. Aizawa. Temporal Segmentation of 3-D Video by Histogram-Based Feature Vectors. IEEE Transactions on Circuits and Systems for Video Technology 2009.

[28] A. Niranjil Kumar and C. Sureshkumar. Background Subtraction Based on Threshold detection using Modified K-Means Algorithm. Proceedings of the 2013 International Conference on Pattern Recognition, Informatics and Mobile Engineering, 27(5):827–832, February 2013.

[29] R. Pless, T. Brodsky, and Y. Aloimonos. Detecting independent motion: The statistics of temporal continuity. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):68–73, 2000.

[30] L. Wixson. Detecting salient motion by accumulating directionally-consistent flow. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:774–780, August, 2000.

[31] S.J. McKenna, Y. Raja, and S. Gong. Object tracking using adaptive color mixture models. In proceedings of Asian Conference on Computer Vision, 1:615–622, January 1998.

[32] S.J. McKenna, Y. Raja, and S. Gong. Tracking color objects using adaptive mixture models. Image and Vision Computing, 17:223–229, 1999.

[33] K. Kim, D. Harwood, and L. S. Davis. Background updating for visual surveillance. In proceedings of the International Symposium on Visual Computing, 1:337–346, December 2005.

[34] Y. Sheikh and M. Shah. Bayesian object detection in dynamic scenes. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 1:74–79, June 2005.

[35] K. P. Karman and A. von Brandt. Moving object recognition using an adaptive background memory. Time-varying Image Processing and Moving Object Recognition, 1990.

[36] K. P. Karman, A. von Brandt and R. Gerl. Moving object segmentation based on adaptive reference images. European Signal Processing Conference, 1990

[37] Wang Yunlong ; Jiang Guang ; Jiang Changlong. Mean shift tracking with graph cuts based image segmentation. International Congress on Image and Signal Processing (CISP), 2012.

[38] Microsoft Research website retrieved May 2, 2011:

http://research.microsoft.com/en-us/um/people/jckrumm/wallflower/testimages.htm

[39]Sudhanshu Sinha ; Manohar Mareboyana. Video Segmentation Into

Background And Foreground Using Simplified Mean Shift Filter And Clustering. International Conference on Image and Signal Processing (ICSIP), 2014.

Author Biography: Sudhanshu Sinha

He has two Master's degree one in Statistics and another one in Computer Science. Currently he works for providing technical support in planning, developing, modifying, testing, implementing, and supporting customers in a Federal Government department in USA. As a part time, he teaches Visual C# 2010 at Bowie State University, Maryland as an Adjunct faculty. He is also a Doctoral student researching on the topic of Image Processing. He has over 22 years of experience in the fields of Client/Server, Internet, Intranet, Mobile, Database, and Mainframe applications using various programming languages/platforms including Visual Basic.NET, C#.NET, ASP.NET, JAVA, and SAS.

Documents

Video Segmentation into Background and …Video Segmentation into Background and Foreground Using Simplified Mean Shift Filter and K-Means Clustering Sudhanshu Sinha, Doctoral Student,