Bag of Features with Dense Sampling for Visual Tracking€¦ · good resistance to occlusions, ... design a robust appearance model which is discriminnative and ... on densely sampling

Journal of Computational Information Systems 9: 20 (2013) 8307–8315Available at http://www.Jofcis.com

Bag of Features with Dense Sampling for Visual

Tracking ?

Pingyang DAI, Weisheng LIU, Lan WANG, Cuihua LI∗, Yi XIE

Computer Science Department, Xiamen University, Xiamen 361005, China

Abstract

The bag-of-feature model has become a state-of-the-art method of visual classification. Visual codebookscan be used to capture image statistical information for object detection and classification, which isextracted from local image patches and based on the quantization of robust appearance descriptors. Inthis paper, more information of target objects can be captured by dense sampling rather than sparselysampling. Then a robust visual tracking method is proposed based on dense sampling and bag of features.Firstly, local image patches are densely extracted by sliding windows and represented as invariantdescriptors. Secondly, visual codebooks are generated by fast clustering algorithms such as hierarchicalk-means. Therefore, the object region and candidate regions are represented by the bag-of-feature modelwith the learnt codebooks. After that, tracking can operate in a Bayesian inference framework. Thebag-of-feature tracking method with dense sampling is adaptive and flexible. It works independently inmany situations without the complement of existed tracking algorithms. The experiments on variouschallenging videos demonstrate that the proposed tracker outperforms several state-of-art algorithms.

Keywords: Visual Tracking; Dense Sampling; Bag of Features

1 Introduction

Object tracking has been widely studied in computer vision because it is important in manyapplications such as surveillance, augmented reality and human-computer interaction. However,most state-of-the-arts tracking algorithms could not achieve satisfied requirements because ofmany challenges. The most important one is that the appearance of object varies substantiallywith the changes in pose, lighting conditions, occlusions and shape variations. How to designa robust appearance model which can be adaptive to the factors mentioned above is a key taskin visual tracking. Without an effective appearance model, it is difficult in practice to achieverobust tracking in unconstrained real world environment.

?Project supported by the National Defense Basic Scientific Research Program of China, National DefenseScience and Technology Key laboratory Fund, Doctoral Program of Higher Specialized Research Fund (No.20110121110020) and Fundamental Research Funds for the Central Universities of the People’s Republic of China(No. 2010121066), Shenzhen City Special Fund for Strategic Emerging Industries (JCYJ2012).

∗Corresponding author.Email address: [email protected] (Cuihua LI).

1553–9105 / Copyright © 2013 Binary Information PressDOI: 10.12733/jcis8389October 15, 2013

8308 P. Dai et al. /Journal of Computational Information Systems 9: 20 (2013) 8307–8315

The representation which is based on collections of appearance descriptors extracted from localimage patches, has become very popular for image category and visual recognition [1–4]. Thebag-of-feature model and visual codebook based on quantization of robust appearance descrip-tors are efficient to capture image statistical information. The basic idea is to treat images ascollections of independent patches and sample a representative set of patches from the images.As a representation of the image, the visual descriptor vector is calculated for each patch inde-pendently and used for the resulting distribution of samples in descriptor space. This model hasgood resistance to occlusions, geometric deformations and illumination variations since that itcan capture a significant proportion of the complex statistics of image and visual class in a localform.

The idea of representing images as collections of independent local patches has shown its worthin object recognition or image classification. But the questions are raised: how to sample patchesfrom a set of training images? Is the descriptor vector of patches sampled densely or sparsely?The sampler is a critical component in the bag-of-features method and the significant factor ofperformance is the number of patches sampled from the image. The samplers based on interestpoint or random samplers operate with small numbers of patches but can not provide enoughpatches to find the most informative representation. Previous research [5] has shown that thedense sampling followed by the explicit discriminative feature selection captures more informationand obtains better results. And the performance always increases with the number of sampledpatches.

This paper proposed a novel robust visual tracker which is based on bag of features with densesampling. The patches within a target region are positive samples and the ones around the targetregion are negative samples. Both are densely extracted from the training frames. Each patchis represented independently as a vector of visual descriptor. The collections of these descriptorsare clustered for constructing visual codebook. Each template region can be represented asa distribution of codebook in the descriptor space, so do the candidate regions. Tracking thenoperates by the Bayesian inference framework in which a particle filter is used to propagate sampledistributions over time. Empirical results on challenging video sequences have demonstrated thesuperior performance of our method in terms of robustness and stability, compared with thestate-of-the-art methods.

The rest of paper is organized as follows. Related work on visual object tracking is reviewedin Section 2. The details of the proposed approach are presented in Section 3. Experimentson public available challenging sequences are analyzed in Section 4. Finally, the conclusion anddiscussion are provided in Section 5.

2 Related Work

Visual tracking has been an active research topic since the early 1980s. Many different track-ing methods have been proposed, such as global template-based trackers, shape-based methods,probabilistic models using mean-shift [6], particle filtering [7], local key-point based trackers [8]and flow-based trackers.

The appearance model is a prerequisite for the success of a tracking system. It is important todesign a robust appearance model which is discriminnative and adaptive. To deal with appearancevariations, the IVT method [9] utilizes an incremental subspace model to adapt the changes of

P. Dai et al. /Journal of Computational Information Systems 9: 20 (2013) 8307–8315 8309

appearance. This method performs well when target objects encounter illumination changes andpose variations. Adam et al. [10] utilized multiple fragments to design an appearance modelwhich is robust to partial occlusions. Inspired by the multiple instance learning method which isused to solve the ambiguity problem in face detection, an online multiple instance learning (MIL)method [11] was proposed. MIL tracker learns a classifier as the appearance model to alleviatethe drift problem. Recently, sparse representation has been used in the L1 tracker [12] where anobject is modeled by a sparse linear combination of target templates and trivial templates. Thetemplate set is dynamically updated according to the similarity between the tracking result andthe template set. L1 tracker demonstrates good performance to partial occlusions, illuminationchanges and pose variations.

Discriminative appearance models are generally associated with classifiers which take trackingas a binary classification task to separate an object from its surrounding background. Kalalet al. [13] proposed the P-N learning algorithm to exploit the underlying structure of positiveand negative samples, where the effective classifiers are learnt for object tracking. The ensembletracker [14] formulates tracking as a binary classification problem, where an ensemble of weakclassifiers is trained online to distinguish object from background. Grabner et al. [15] proposed anonline boosting method to update discriminative features and a semi-supervised online boostingalgorithm [16] is proposed to handle the drift problem.

Our method has a certain similarity to the work [17] in using bag of features model. However,there are many differences. Firstly, we sample local image patches densely in both object region-s and non-object regions such that the codebook contains positive and negative visual words.However the work [17] samples object region randomly and only constructs positive visual words.Secondly, the SIFT features are densely sampled to construct visual codebooks and used to matchrobustly across different object appearances. Thirdly, our proposed method is adaptive, flexibleand independent, while the work [17] requires the cooperation of existed tracking algorithms.

3 Tracking with Dense Sampling Bag of Features

3.1 Bayesian object tracking

The particle filter is a Bayesian sequential importance sampling technique, which describes adynamic system by recursively estimating the posterior distribution of state variables. It hasbeen a dominant framework to estimate and propagate the posterior probability density functionof state variables without considering the underlying distribution. A particle filter simulates thisdistribution by a well-known two-step recursion: prediction and update.

Let xt denote the state variable describing the motion parameters of an object at time t. Givenall variable observations y1:t−1 = {y1,y2, · · · ,yt−1} up to time t− 1, the predicting distributionof xt is denoted by p(xt|y1:t−1), which is recursively computed as

p(xt|y1:t−1) =

∫p(xt|xt−1)p(xt−1|y1:t−1)dxt−1. (1)

At time t, the observation yt is available and the state vector is updated using the Bayes rulein Equation (2), where p(yt|xt) denotes the observation likelihood.

p(xt|y1:t) =p(yt|xt)p(xt|y1:t−1)

p(yt|y1:t−1). (2)


Given all observations up to t-th frame, the optimal state of the tracked target is obtainedby the maximum a posteriori estimation over N samples at time t by Equation (3), where xi

t

indicates the i-th sample of the state xt, and yit denotes the image patch predicated by xi

t.

x̂t = argxitmax p(xt|y1:t). (3)

3.2 Dense sampling

The appearance of images which is based on invariant descriptors is a popular representation ofobjects. The image is treated as a loose collection of independent local patches. Patches canbe extracted either densely or sparsely according to local informative criteria. We utilize theinformation of every pixel to obtain densely sampling keypoints rather than use sparse featurepoints. The dense sampling procedure is illustrated in the Fig. 1.

Fig. 1: Illustration of dense sampling in our proposed framework

Sliding window scheme is performed and a dense keypoints set is extracted within the objectregion and its surrounding region. Then the sampling patches set P = {p1, p2, · · · , pn} is obtainedaccording to the dense keypoints set. The sampling keypoints set can capture useful discriminativeinformation of the tracking object and surrounding regions. The SIFT descriptor for denselysampled keypoints is computed with the identical size and orientation, which represents thefeatures of these keypoints. And then a feature pool F = {f 1, f 2, · · · , fn} is obtained, which isconstructed by the SIFT descriptors set.

3.3 Bag-of-feature based discriminative appearance model

Given the training frames, the collection of descriptors set and feature pool F are formed bydensely sampling patches. The visual codebook C = {c1, c2, · · · , ck} is constructed using a fastclustering algorithm such as hierarchical k-means. Then the descriptors which are densely sampledand calculated from image region are coded by the vector quantization and the visual codebook.The image region can be template image region or candidate image region. Afterwards, theimage region is represented as a distribution in descriptor space by histogram H which forms the


discriminative appearance representation. The discriminative appearance representation basedon densely sampling bag-of-feature is illustrated in Fig. 2.

Fig. 2: Illustration of discriminative appearance representation based on densely sampling bag-of-feature

The appearance representation is more discriminative. Firstly, the patches densely sampled canextract more information from image region to construct a more discriminative visual codebook,compared with the method using sparsely keypoints or random sample patches. Secondly, thevisual codebook which is constructed by the patches from the object region, only contains positiveinformation. So the descriptor of candidate regions that contains non object information will becoded by hard assignment to the nearest codeword during vector quantization, which decreasesthe discriminability of appearance representation. Our method achieves the discriminative ap-pearance based on the densely sampling bag-of-feature model since that the visual codebook isconstructed by the patches extracted from not only object regions but also non object regions.

3.4 Dense sampling bag-of-feature tracking

In the tracking framework, we apply a Gaussian perturbation with a different variance to modelthe object motion for two consecutive frames:

p(xt|xt−1) = G(xt−1, σ2). (4)

where G represents the Gaussian distribution with mean xt−1 and variance σ2.

Since our algorithm relies on visual codebook which must be trained on labeled data. Thisrequires a method of obtaining labeled frames for training purposed to construct visual codebook.We have found that our proposed algorithm can perform well with as few as 5 labeled frames forthe sequences. These labels can be done manually or with ground truth.

Given labeled frames, positive patches and non object patches are densely sampled by usingsliding windows and represented as invariant descriptors. Then visual codebook which containing


object information and non object information visual word is constructed by k-means. Thereforethe object template and candidate image regions can be represented as histogram separately. Thelikelihood function for the filtering distribution can be defined by

p(yt|xt) =1

Γexp(−λD(HT , Hxt)

2). (5)

where λ is a constant controlling the shape of Gaussian kernel, Γ is a normal factor and HT isthe template target model, Hxt is the candidate model and D is similarity metric that can be χ2

or Bhattacharyya distance. Therefore, the tracking result is the sample of states that obtains thelargest probability.

3.5 Update of observation model

It is essential to update the observation model for handling the appearance changes of targetobjects during visual tracking. However, the update frequency of templates should be appropriate.When the template is updated, small errors will be introduced in its location. Frequent updateswill accumulate these errors such that the tracker will drift away from the target. We solvethis problem by dynamically updating templates, where the original template and the currenttemplate are combined.

In order to update current template, we accumulate the latest k frames to generate a newvisual codebook C∗. Then the current target template and candidate regions can be representedas histogram Hupdate and H

′xt

respectively given C∗. Meanwhile, H0 and Hxt are histogramrepresentation of original template and candidate regions given the codebook C0. Then, theprobability can be calculated based on combination of the similarity of H0 and Hxt and thesimilarity of Hupdate and H

′xt

.

p(yt|xt) =1

Γ{α exp(−λD(H0, Hxt)

2) + (1− α) exp(−λD(Hupdate, H′

xt)2)}. (6)

The tracking result is the optimal state xt∗ of frame t obtained by

xt∗ = argxi

tmax p(yt|xt

i). (7)

4 Experiments

Our proposed algorithm is implemented in C++. For each sequence, the location of the targetis manually labeled or given by ground truth in the first 5 frames. The codebook size is set to30. The sample step of sliding window is set to 5 pixels. The number of particle is 800. Allparameters are fixed for all sequences.

In order to evaluate the performance of the proposed method, we compare our method withsome state-of-the-art methods. We use the source codes released and the parameter settingsturned finely by the authors. Firstly we compare our tracking method with the method in [17]in sequence David outdoor, Lemming and Occlusion1. The results shown in Fig. 3 demonstratethat our method is more robust.

Then we compare our tracking algorithm with six state-of-the-art methods in 4 challengingsequences. The six trackers are fragment (Frag) tracker [10], multiple instance learning (MIL)


Fig. 3: Tracking result compare with method [17]

tracker [11], L1 (L1) tracker [12], P-N learning (P-N) tracker [13], visual tracking decomposition(VTD) method [18] and IVT method [9]. Our method is quantitatively and qualitatively comparedwith those algorithms. The comparison results are shown in Fig. 4.

The first experiment uses caviar1 sequence. The frame indexes are 107, 122, 132, 189, 201 and373. In this sequence, L1 tracker, MIL tracker and IVT method fail to track the target becausethere are similar objects around the target when heavy occlusion occurs. However, because ofthe usage of bag-of-features representation, our method locates the target well. Compared withother trackers, our tracker is more robust to the partial occlusion.

The sequence davidoutdoor is very challenging for visual tracking since tracking object maysuffer from occlusion in outdoor environment. Some tracking result frames are given in Fig. 4,whose indexes are 57, 83, 154, 191, 206 and 251. We can see the other trackers fail in trackingtarget because the target suffer from occlusion, pose variation and clutter background. Ourtracker successfully keeps tracking of the target and achieves the best performance throughoutthe whole sequence.

The third experiment uses sequence lemming. The main challenge of this sequence is 3D-rotation, occlusion and scale change. Some tracking result are shown by Fig. 4, whose indexesare 150, 203, 235, 316, 341 and 400. It is difficult to deal with the fast motion of targets whichleads to blurred image appearance. As the result of frame 235 shown, most tracking algorithmsdrift or fail to follow the target due to its fast motion. Frame 316 and 341 show the trackingresult when the target encounters heavy occlusion. All tracking algorithms except our methoddrift or fail to follow target.

The sequence stone is challenging since that numerous stones have different shapes and colors.As the frames 397, 480 and 521 shown, VTD, Frag and MIL trackers lost target and other trackersdrift when the target stone is occlusion while IVT and our method successfully keep track of thetarget.

Quantitative evaluation results of the above-mentioned algorithms are shown in Fig. 5. The


relative center position error (in pixels) between the ground truth center and the tracking resultsare presented. The quantitative comparisons also verify that our tracker is superior to most ofother trackers.

Fig. 4: The qualitative comparison results

Fig. 5: The quantitative comparison results

5 Conclusion

In this paper, we have proposed an effective and robust tracking method based on bag-of-featuremodel with dense sampling. In our tracker, the features of image are extracted densely to learnvisual codebooks where the template and candidate region are represented using the resultingdistribution of visual codebooks in descriptor space. The proposed appearance model is used forobject tracking to account for large appearance change due to pose variation, occlusion and drift.The abundant experiments on public benchmark sequences have demonstrated that our method


can track object very well under large appearance change and outperform the state-of-the-artsalgorithm.

References

[1] C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka, Visual categorization with bags ofkeypoints, in: Porc. ECCV International Workshop on Statistical Learning in Computer Vision,2004.

[2] R. Fergus, P. Perona, and A. Zisserman, Object class recognition by unsupervised scale-invariantlearning, in: Proc. Computer Vision and Pattern Recognition, 2003, pp. II-264-II-271.

[3] G. Dorko and C. Schmid, Selection of scale-invariant parts for object class recognition, in: Proc.International Conference on Computer Vision, 2003, pp. 634-639.

[4] J. Sivic and A. Zisserman, Video Google: a text retrieval approach to object matching in videos,in: Proc. International Conference on Computer Vision, 2003, pp. 1470-1477.

[5] E. Nowak, F. Jurie, and B. Triggs, Sampling Strategies for Bag-of-Features Image Classification,in: Proc. ECCV, 2006, pp. 490-503.

[6] D. Comaniciu, V. Ramesh, and P. Meer, Real-time tracking of non-rigid objects using mean shift,in: Proc. Computer Vision and Pattern Recognition, 2000, pp. 142-149.

[7] M. Isard and A. Blake, CONDENSATION-Conditional Density Propagation for Visual Tracking,International Journal of Computer Vision, 29 (1998) 5-28.

[8] M. Ozuysal, P. Fua, and V. Lepetit, Fast Keypoint Recognition in Ten Lines of Code, in: Proc.Computer Vision and Pattern Recognition, 2007, pp. 1-8.

[9] D. Ross, J. Lim, R. -S. Lin, and M. -H. Yang, Incremental Learning for Robust Visual Tracking,International Journal of Computer Vision, 77 (2008) 125-141.

[10] A. Adam, E. Rivlin, and I. Shimshoni, Robust Fragments-based Tracking using the Integral His-togram, in: Proc. Computer Vision and Pattern Recognition, 2006, pp. 798-805.

[11] B. Babenko, M. Yang, and S. Belongie, Visual Tracking with Online Multiple Instance Learning,Pattern Analysis and Machine Intelligence, 33 (2011) 1619-1632.

[12] X. Mei and H. Ling, Robust Visual Tracking and Vehicle Classification via Sparse Representation,Pattern Analysis and Machine Intelligence, 33 (2011) 2259-2272.

[13] Z. Kalal, K. Mikolajczyk, and J. Matas, Tracking-Learning-Detection, Pattern Analysis and Ma-chine Intelligence, 34 (2012) 1409-1422.

[14] S. Avidan, Ensemble Tracking, Pattern Analysis and Machine Intelligence, 29 (2007) 261-271.

[15] H. Grabner and H. Bischof, On-line Boosting and Vision, in: Proc. Computer Vision and PatternRecognition, 2006, pp. 260-267.

[16] H. Grabner, C. Leistner, and H. Bischof, Semi-supervised On-Line Boosting for Robust Tracking,in: Proc. ECCV, 2008, pp. 234-247.

[17] Y. Fan, L. Huchuan, and C. Yen-Wei, Bag of Features Tracking, in: Proc. International Conferenceon Pattern Recognition, 2010, pp. 153-156.

[18] K. Junseok and L. Kyoung Mu, Visual tracking decomposition, Computer Vision and PatternRecognition, 2010, pp. 1269-1276.

Documents

Bag of Features with Dense Sampling for Visual Tracking€¦ · good resistance to occlusions, ... design a robust appearance model which is discriminnative and ... on densely sampling