6
DESCRIPTOR-BASED ADAPTIVE TRACKING-BY-DETECTION FOR VISUAL SENSOR NETWORKS Berner Panti 1 , Pedro Monteiro 2 , Fernando Pereira 3 , João Ascenso 4 Instituto Superior Técnico – Instituto de Telecomunicações 1 [email protected], 2 [email protected], 3 [email protected], 4 [email protected] ABSTRACT Local descriptors represent a powerful tool, which is exploited in several applications such as visual search, object recognition and visual tracking. Real-valued visual descriptors such as SIFT and SURF achieve state-of-the-art accuracy performance for a large set of visual analysis tasks. However, such algorithms are demanding in terms of computational capabilities and bandwidth, being unsuitable for scenarios where resources are constrained. In this context, binary descriptors provide an efficient alternative to real-valued descriptors, due to their low computational complexity, limited memory footprint and fast matching algorithms. In this paper, binary descriptors are used to perform visual tracking of an object along time. The proposed visual tracker performs descriptor matching between consecutive frames, applies filtering techniques to remove undesirable outliers and employs a suitable model to characterize the object appearance. In addition, techniques to code and transmit these description streams are employed, thus reducing the amount of data necessary to transmit to perform accurate object tracking. The efficiency of the proposed visual tracker is evaluated in terms of rate-accuracy, i.e. using the bitrate associated to the compressed binary descriptors and a quantitative metric to assess the accuracy of the visual tracker. Index Terms— object tracking, binary descriptors, feature coding, Hough transform, tracking-by-detection. 1. INTRODUCTION Visual sensor networks (VSN) benefit several applications, such as object recognition, traffic/habitat monitoring and visual surveillance [1]. In a VSN, a large number of sensing (camera) nodes are able to acquire and process image data locally, collaborate with other sensing nodes and provide a description of the captured events. Typically, wireless battery-operated nodes are used to sense the visual scene and have severe constraints in terms of energy, bandwidth resources and processing capabilities [1]. A possible approach to transmit visual data in this scenario requires the sensing nodes to acquire visual data, to compress and transmit a pixel-domain representation to a central node where further analysis is performed. This paradigm is called compress-then-analyze (CTA). However, coding and transmission of the pixel-level visual scene representation must be avoided, due to the high energy and bandwidth resources needed. An alternative promising approach is to extract at the sensing nodes, compact visual descriptors that are coded to meet the bandwidth and power requirements of the underlying network and devices. This approach is associated to a novel paradigm called analyze-then- compress (ATC) [2], where visual content is acquired and analyzed directly in geographically distributed sensing nodes (smart cameras), which collaborate to perform efficient visual analysis, hence reverting the order in which the compression and analysis tasks are applied in the CTA paradigm. In ATC, visual scene analysis tasks, such as background modeling, object recognition, visual tracking and activity The project GreenEyes acknowledges the financial support of the Future Emerging Technologies (FET) program within the Seventh Framework Program for Research of the European Commission, under FET-Open grant number: 296676. recognition, are performed on a succinct representation of the image without access to a pixel-level visual scene representation. In this paper, the target is to efficiently perform the challenging task of object tracking in the context of a visual sensor network. Thus, it is proposed to extract computationally efficient descriptors at the sensing node and compute a suitable compact and robust representation that can be efficiently transmitted to a central (sink) node. At the sink location, object detection is performed and a visual tracker is employed to successively identify the object location accurately. Thus, considering that the accurate estimation of the object location and size in a video frame requires a large number of descriptors, it is essential to employ coding schemes that represent the required information with minimal bitrate while being robust to changes in illumination, viewpoint and occlusions. In addition, inspired by recent advances in tracking-by-detection [3], an ATC visual tracking algorithm that meets the VSN computational and bandwidth constraints is proposed. With an ATC visual tracker, a rate-accuracy tradeoff exists where the highest tracking accuracy should be obtained for certain cost (bitrate wise) of the descriptor representation. This is analogous to video coding, where the target distortion must be minimized considering a known rate constraint (i.e. the available channel bandwidth). The evaluation of the proposed ATC visual tracker shows that significant rate savings can be obtained, notably up to 5 times in comparison to a CTA solution where a pixel-level coded representation is coded and transmitted. Considering that the goal is to develop a distributed visual tracker, real-valued local descriptors such as SIFT [4] and SURF [5] may seem an appropriate choice. However, since these tools are demanding in terms of computational capabilities and bandwidth, they are unsuitable for scenarios where resources are severely constrained. In the CTA scenario, low complexity local feature detection and extraction algorithms are better suited and should be employed to match the computational constraints of visual sensor networks. In this context, binary descriptors [6-8] provide an efficient alternative to real-valued descriptors, due to low computational complexity, limited memory footprint and fast matching algorithms. The rest of the paper is organized as follows. Section 2 describes the related work while Section 3 describes the key methods for binary descriptor compression. Then, Sections 4 and 5 describe the main components of the proposed tracking system, namely the coding of binary descriptors and the tracking-by-detection framework, respectively. Section 6 reports the experimental results and, finally, Section 7 presents the conclusions and future work. 2. RELATED WORK Many algorithms and systems have been proposed in the field of visual tracking in the past decades [9]. In visual tracking, the objective is to accurately localize an object in every frame of the video sequence while exploiting only the information about the object position and size in the first frame, e.g. acquired with a suitable object recognition technique. Visual tracking plays a crucial role in several applications such as video surveillance, augmented reality, etc. Most state-of-the-art visual trackers either rely on intensity, texture information or simple color space transformations [9].

DESCRIPTOR-BASED ADAPTIVE TRACKING-BY-DETECTION …amalia.img.lx.it.pt/~jmda/ICME15_tracking.pdfrecognition have led to a new and promising framework called tracking-by-detection,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: DESCRIPTOR-BASED ADAPTIVE TRACKING-BY-DETECTION …amalia.img.lx.it.pt/~jmda/ICME15_tracking.pdfrecognition have led to a new and promising framework called tracking-by-detection,

DESCRIPTOR-BASED ADAPTIVE TRACKING-BY-DETECTION FOR VISUAL SENSOR NETWORKS

Berner Panti1, Pedro Monteiro2, Fernando Pereira3, João Ascenso4 Instituto Superior Técnico – Instituto de Telecomunicações

[email protected], [email protected], [email protected], [email protected]

ABSTRACT

Local descriptors represent a powerful tool, which is exploited in several applications such as visual search, object recognition and visual tracking. Real-valued visual descriptors such as SIFT and SURF achieve state-of-the-art accuracy performance for a large set of visual analysis tasks. However, such algorithms are demanding in terms of computational capabilities and bandwidth, being unsuitable for scenarios where resources are constrained. In this context, binary descriptors provide an efficient alternative to real-valued descriptors, due to their low computational complexity, limited memory footprint and fast matching algorithms. In this paper, binary descriptors are used to perform visual tracking of an object along time. The proposed visual tracker performs descriptor matching between consecutive frames, applies filtering techniques to remove undesirable outliers and employs a suitable model to characterize the object appearance. In addition, techniques to code and transmit these description streams are employed, thus reducing the amount of data necessary to transmit to perform accurate object tracking. The efficiency of the proposed visual tracker is evaluated in terms of rate-accuracy, i.e. using the bitrate associated to the compressed binary descriptors and a quantitative metric to assess the accuracy of the visual tracker.

Index Terms— object tracking, binary descriptors, feature coding, Hough transform, tracking-by-detection.

1. INTRODUCTION

Visual sensor networks (VSN) benefit several applications, such as object recognition, traffic/habitat monitoring and visual surveillance [1]. In a VSN, a large number of sensing (camera) nodes are able to acquire and process image data locally, collaborate with other sensing nodes and provide a description of the captured events. Typically, wireless battery-operated nodes are used to sense the visual scene and have severe constraints in terms of energy, bandwidth resources and processing capabilities [1]. A possible approach to transmit visual data in this scenario requires the sensing nodes to acquire visual data, to compress and transmit a pixel-domain representation to a central node where further analysis is performed. This paradigm is called compress-then-analyze (CTA). However, coding and transmission of the pixel-level visual scene representation must be avoided, due to the high energy and bandwidth resources needed.

An alternative promising approach is to extract at the sensing nodes, compact visual descriptors that are coded to meet the bandwidth and power requirements of the underlying network and devices. This approach is associated to a novel paradigm called analyze-then-compress (ATC) [2], where visual content is acquired and analyzed directly in geographically distributed sensing nodes (smart cameras), which collaborate to perform efficient visual analysis, hence reverting the order in which the compression and analysis tasks are applied in the CTA paradigm. In ATC, visual scene analysis tasks, such as background modeling, object recognition, visual tracking and activity

The project GreenEyes acknowledges the financial support of the Future Emerging Technologies (FET) program within the Seventh Framework Program for Research of the European Commission, under FET-Open grant number: 296676.

recognition, are performed on a succinct representation of the image without access to a pixel-level visual scene representation.

In this paper, the target is to efficiently perform the challenging task of object tracking in the context of a visual sensor network. Thus, it is proposed to extract computationally efficient descriptors at the sensing node and compute a suitable compact and robust representation that can be efficiently transmitted to a central (sink) node. At the sink location, object detection is performed and a visual tracker is employed to successively identify the object location accurately. Thus, considering that the accurate estimation of the object location and size in a video frame requires a large number of descriptors, it is essential to employ coding schemes that represent the required information with minimal bitrate while being robust to changes in illumination, viewpoint and occlusions. In addition, inspired by recent advances in tracking-by-detection [3], an ATC visual tracking algorithm that meets the VSN computational and bandwidth constraints is proposed. With an ATC visual tracker, a rate-accuracy tradeoff exists where the highest tracking accuracy should be obtained for certain cost (bitrate wise) of the descriptor representation. This is analogous to video coding, where the target distortion must be minimized considering a known rate constraint (i.e. the available channel bandwidth). The evaluation of the proposed ATC visual tracker shows that significant rate savings can be obtained, notably up to 5 times in comparison to a CTA solution where a pixel-level coded representation is coded and transmitted.

Considering that the goal is to develop a distributed visual tracker, real-valued local descriptors such as SIFT [4] and SURF [5] may seem an appropriate choice. However, since these tools are demanding in terms of computational capabilities and bandwidth, they are unsuitable for scenarios where resources are severely constrained. In the CTA scenario, low complexity local feature detection and extraction algorithms are better suited and should be employed to match the computational constraints of visual sensor networks. In this context, binary descriptors [6-8] provide an efficient alternative to real-valued descriptors, due to low computational complexity, limited memory footprint and fast matching algorithms.

The rest of the paper is organized as follows. Section 2 describes the related work while Section 3 describes the key methods for binary descriptor compression. Then, Sections 4 and 5 describe the main components of the proposed tracking system, namely the coding of binary descriptors and the tracking-by-detection framework, respectively. Section 6 reports the experimental results and, finally, Section 7 presents the conclusions and future work.

2. RELATED WORK

Many algorithms and systems have been proposed in the field of visual tracking in the past decades [9]. In visual tracking, the objective is to accurately localize an object in every frame of the video sequence while exploiting only the information about the object position and size in the first frame, e.g. acquired with a suitable object recognition technique. Visual tracking plays a crucial role in several applications such as video surveillance, augmented reality, etc. Most state-of-the-art visual trackers either rely on intensity, texture information or simple color space transformations [9].

Page 2: DESCRIPTOR-BASED ADAPTIVE TRACKING-BY-DETECTION …amalia.img.lx.it.pt/~jmda/ICME15_tracking.pdfrecognition have led to a new and promising framework called tracking-by-detection,

Recent advances in visual feature representation, object detection and recognition have led to a new and promising framework called tracking-by-detection, where object motion is estimated using frame-by-frame feature matching and geometric consistency verification [3]. Using a tracking-by-detection framework, the coherency of the object characteristics along time is exploited by detecting interest (salient) points in each frame, computing local descriptors (high dimensional feature vector) and employing suitable filtering and database updating modules [10]. Typically, the descriptors obtained are used to model the appearance of the object for which the trajectory is calculated. Using such approach and following the ATC paradigm, a distributed visual tracker can be developed, where the sensing nodes perform image acquisition, and feature extraction and the central node performs tracking using the received data and a database of objects. This is rather different from conventional centralized tracking schemes where the pixel representation is often required to perform computationally expensive feature extraction using online training methods to select the best features to track, i.e. the outcome of the tracking itself influences which features should be tracked. On the contrary, distributed tracking doesn’t require any learning algorithm at the sensing node, which means that the descriptors transmitted can be also used for object recognition, activity classification, etc.

In [11], a comprehensive evaluation of detectors and descriptors for visual tracking is performed including SIFT, SURF and randomized trees. However, binary descriptors were not considered and also neither the compactness (rate) of the proposed representations. In [12], a method that performs simultaneous tracking and recognition using a low complexity descriptor that can be computed at reasonable frame rates on a mobile device is proposed. To perform tracking, they apply the FAST interest point detector on an image pyramid and employ a Histogram of Gradients type based descriptor. They propose methods to achieve rotation invariance and employ an affine motion model. Another popular approach is to model the object as a set of keypoints and associated descriptors, which are matched independently for every image of the video sequence. Robust estimation methods such as RANSAC [13] are often used to determine geometrically consistent sets of matches, which are then used to detect the presence and the transformation of an object.

SURFTrac [14], a widely popular tracking-by-detection method, proposes to track objects using interest point matching and updating while still continuously extracting descriptors for recognition. The main novelty of SURFTrac is the incremental interest point detection method, which uses a motion model (homography) to indicate the regions where keypoints should be extracted. However, this method can only be performed in a centralized visual tracking setting since sensing nodes must perform highly computationally expensive operations and the feature detection technique depends on the object model (database). In [15], a linear structured SVM is used to perform online learning and construct an appearance model that exploits binary descriptors and their efficiency. The proposed tracking-by-detection framework in [15] combines the matching, i.e. the computation of correspondences between a model and any input image, and the estimation of the object geometric transformation; successful detection is used to perform an online learning step to update the object model for the next frames. However, this framework was only applied to track the 3D camera pose for a SLAM system and depends on the number of inliers obtained in the correspondence matching which can be quite low for smaller objects or in rapid changing conditions.

In all the previous surveyed work, the compactness of the feature representation was never considered, neither any coding techniques were proposed to obtain lower rate representations, optimized for visual tracking. Techniques to code the state-of-the-art binary descriptors, such as BRIEF [6], BRISK [7] and ORB [8], have been proposed in the past. In previous work [16,17], rate-accuracy trade-

offs were studied for the case of binary descriptors, adopting different coding schemes, while considering an object recognition scenario. It was also shown that a performance comparable to state-of-the-art real value descriptors such as SIFT [4] and SURF [5] can be obtained at a much lower bitrate and computational cost. For video, a sequence of sets of descriptors can also be coded with appropriate techniques. In [18], Intra and Inter coding techniques were proposed to code the SIFT and SURF descriptors extracted from video sequences. In addition, a coding mode decision based on rate-distortion optimization was proposed. However, this framework was mainly proposed for content-based video retrieval and homography estimation, notably targeting camera calibration and reconstruction. Leveraging on the results achieved in the past, this paper proposes to address the problem of object tracking in the context of an ATC scenario. Thus, binary descriptors coding must be performed in such a way that high tracking performance can be obtained with a much more compact representation.

3. BINARY DESCRIPTORS COMPRESSION

Following the ATC paradigm, binary descriptors such as BRIEF, BRISK and ORB provide compact representations that are easy to compute. However, binary descriptors must be further compressed to save bandwidth in a visual sensor network scenario, possibly using a lossy representation of the features. In this sense, a rate-distortion tradeoff can be obtained since lower rate representations can be obtained at a certain cost in tracking accuracy. Several methods to code binary descriptors that exploit the Inter-descriptor [16] and Intra-descriptor redundancies [17] have been proposed in the past. To exploit the Inter correlation between sets of descriptors extracted from successive frames in video sequences, it is necessary to select an already decoded descriptor (reference) from previous frames for each descriptor to be coded in the current frame. In such case, the residual between the reference descriptor and the current descriptor is coded. However, this method increases the computational complexity at the sensing node as it involves an expensive search process (similar to motion estimation). In addition, since most of the keypoint detection methods lead to keypoints that are not very stable over time, temporal correlation between descriptors may be surprisingly low [19].

Alternatively, Intra-descriptor coding methods maximize the correlation between the descriptor elements (DE) by finding a suitable coding order for them; each DE corresponds to the outcome of an intensity test inside a patch and they are correlated according to the patch characteristics. Thus, each DE can be predicted from previous DEs and the residual DE is entropy coded; an offline training step computes the ‘optimal’ DE order [20] that guarantees some correlation between neighboring DEs. However, this method is lossless which means that the decoded descriptor does not have any errors with respect to the original descriptor to be coded, thus achieving only 30% of bitrate savings. To obtain higher compression ratios, it is necessary to adopt a lossy coding scheme that selects the most discriminant DEs and discards the remaining ones [21].

4. CODING BINARY DESCRIPTORS FOR OBJECT TRACKING

In the ATC scenario, visual content is acquired at the sensing node, a feature representation is computed for each frame and the descriptors are coded and transmitted while the pixel representation is discarded. The solution adopted in this paper is to combine the two Intra coding solutions [20][21], briefly described in Section 3, to obtain the best performance. Thus, it is first necessary to perform an offline processing step to calculate the N most discriminative DEs using the solution in [21]. Then, the best order for the N selected DEs is computed to maximize the correlation between neighboring DEs [20]. This offline step is performed for some predefined values of N. After

Page 3: DESCRIPTOR-BASED ADAPTIVE TRACKING-BY-DETECTION …amalia.img.lx.it.pt/~jmda/ICME15_tracking.pdfrecognition have led to a new and promising framework called tracking-by-detection,

obtaining a list of the DEs for each value of N and their corresponding order, the following steps are performed at the sensing node:

1. Binary Feature Detection and Extraction: First, keypoints are detected using any feature detector available; after, the descriptor of each keypoint is computed using any state-of-the-art binary descriptor. In this case, the BRISK-BRISK detector and extractor pair was adopted. However, to speedup even further the feature detection process, keypoints are detected without using the multi-scale approach typical of the BRISK descriptor, i.e. the number of octaves is set to 0. This allows to speed up the detection process since the scale space pyramid creation and search procedure are avoided.

2. Feature selection: Next, keypoints are ranked according to their FAST s score, a reliable measure of saliency, to keep only M keypoints, thus coding and transmitting only the most discriminative descriptors. In addition, the sink node provides the sender a bounding box to limit the region where the descriptors should be extracted, i.e. a rectangular region where the object should appear. This way it is not necessary to extract features for a significant part of the background, thus avoiding the coding and transmission of descriptors that are not very relevant to track the object being followed.

2. Quantization: Here the best N discriminative DEs are selected for each descriptor, determined using the importance computed offline with an asymmetric pairwise boosting technique [21].

3. Prediction: The DEs are sorted according to the prediction order offline computed [20]. Then, each ith DE is predicted from the previous i-1 DEs, thus obtaining the residual 𝑅 𝑖 = 𝐷 𝑖 − 1 ⨁𝐷 𝑖 .

4. Entropy coding: Finally, binary arithmetic coding is applied to the residual R to construct the final bitstream.

With this coding framework, it is possible to obtain significant descriptor rate savings even if at the cost of some lower discriminative power. The N parameter plays the usual role of the quantization step as, when higher values are selected, more accurate descriptors are obtained and vice-versa. Also, location of keypoints is compressed at the sensing node using a lossy technique based on histogram maps and counts [22]. Keypoint location data is essential to perform object tracking at the sinking node and must be transmitted; although is not very demanding in terms of bandwidth.

5. ANALYZE-THEN-COMPRESS OBJECT TRACKER

Local descriptors bring several advantages in visual tracking, such as the possibility to track an object even when it is occluded or when it disappears for a short period of time from the camera view. The proposed tracking technique correctly localizes the object in a video sequence, given only the first bounding box defining the object position and size in the first frame. Then, new bounding boxes surrounding the object are estimated with the proposed visual tracking algorithm. With this information, a region of interest is defined and sent to the sink node by means of a feedback channel.

5.1. Region of Interest Computation

The objective of the region of interest (ROI) computation is to limit the area in which keypoints are detected and descriptors extracted. By constraining the detection area, the matching and outlier processes may become more reliable and less descriptors need to be coded and transmitted, thus saving rate. Thus, to define the ROI, a single bounding box is used, which means that only 2 spatial coordinates need to be transmitted from the sink to the sensing node for each frame. To compute the ROI, the object location in the previous frame is used, which means that when a correct detection has been performed in frame i, the bounding box BB(i+1) for frame i+1 is equal to the BB(i) enlarged by a constant offset 𝜆, i.e. height and width increase, but the center is maintained. In case the detector has failed in the previous k frames (criteria defined in Section 5.2),

BB(i+1) will be enlarged for the next frame by (𝑘 + 1)𝜆. This way it is possible to increase the detection rate in case of sporadic detection errors. Naturally, the bounding box estimated for the i+1 frame cannot exceed the image borders, i.e. it is limited by the spatial resolution of the video sequence.

5.2. Proposing a Tracking-by-Detection Solution

Figure 1 shows the architecture of the proposed tracking-by-detection solution, detailing the key modules. Following ATC, only the decoded binary descriptors and corresponding keypoint locations are used as input after the bitstream is decoded. In this case, the scale and orientation of each keypoint is not available since to avoid spending additional rate and computational power, the sensing node does not compute or transmit this information. In opposition to conventional tracking solutions that assume some temporal coherence, a detection algorithm is also continuously applied for each frame of the video. Thus, it is proposed a learning procedure that models the object appearance and its variations (pose, scale, illumination) using the positive database (P-DB) and the background using the negative database (N-DB). Note that each database image corresponds to a representation based on a set of descriptors and not the usual pixel-level representation. The database is dynamically built by adding descriptors of the target object in different poses, and the bounding box information. The database creation and updating procedure is described in Section 5.3. The main advantage of this approach is that it is less influenced by abrupt object motion, change of appearance and low frame rate video.

Fig. 1. Proposed tracking-by-detection solution.

The proposed tracking-by-detection algorithm proceeds as follows: 1. Feature matching and correspondence filtering: First, pair-wise matching between the descriptors of each query/received frame and the descriptors of each frame in the (P-DB) database is performed. The correspondences between descriptors are found using the K-nearest neighbor search (k-NN) with K=2 and the Hamming distance metric. After, each query frame descriptor is filtered using a ratio test [4] that compares the ratio of distances between the best correspondence found and the 2nd best correspondence for every keypoint of the query frame; the top correspondence is removed if this ratio is above a threshold of 0.7, as proposed in [4]. 2. Negative database outlier filtering: Next, matching against a database that contains descriptors belonging to the background (negative database) is applied to reduce the probability of wrong matches. The matching method used is also the k-NN search with ratio test but using a threshold equal to 0.8. A query frame descriptor is discarded when a correspondence is found between the tested descriptor and any descriptor found in the negative database. Using a larger ratio test threshold allows to discard more background descriptors, but it may also discard descriptors belonging to the object, which is undesirable. 3. Object detection: Next, the Generalized Hough Transform (GHT) is applied to remove any wrong descriptor matches and to detect the object being followed using the images available at the positive database. To employ the GHT, it is necessary to have an accumulator

Keypoint  and  descriptor  decoding

Feature  matching  and  correspondence  

filtering

ATC  bitstream

Negative  database  

outlier  filtering

Object  detection

Scale  estimation

Online  database  updating

Update?P-­‐DB N-­‐DB

Yes

Object  center  Estimation

Page 4: DESCRIPTOR-BASED ADAPTIVE TRACKING-BY-DETECTION …amalia.img.lx.it.pt/~jmda/ICME15_tracking.pdfrecognition have led to a new and promising framework called tracking-by-detection,

array that is a discrete representation of the parameter space which in this case is the object center (x,y) 2D coordinates; the accumulator corresponds to an overlaid square grid (of 4×4 pixels) on the image. Then, the object center is reliably estimated by accumulating votes for each grid location; thus, each descriptor match corresponds to a vote in the 2D parameter space. In detail, GHT consists in four main steps: 1. For every detection between the query and each of the database images, the object center is estimated. This is done using the center of the object and keypoint locations stored in the database. Note that to perform detection a correspondence between two keypoints, one in the database and another in the query image is established. With this information, it is possible to compute a vector r representing the displacement between the object center and the keypoint location in the database image and apply it to the keypoint location in the query image to obtain a new object center. 2. At this stage, one vote is assigned to the square grid in which the object center lies. 3. The grid location for which more votes are obtained is assumed to be the object center and all matches (keypoint correspondences) that have contributed to this object center are considered as inliers, while the remaining matches are discarded. Since each match contributes with a vote for the object center, wrong matches that do not vote consistently are removed. Also, as long as there are enough matches in agreement for an accurate object center this solution is robust to any missing matches 4. Scale (size) estimation: Since it is not possible to rely on any scale estimation done locally for each keypoint, a novel scale estimation method is proposed here to estimate the scale (or size) of the object just using keypoint locations. The scale,  𝑆!, is estimated as the ratio of the geometric distance of pairs of keypoints belonging to the database image with the distance of the corresponding matching keypoints in the query image. This process is repeated for all the database images matching the query and averaged using:

𝑆! =1𝑁  

1𝐽

𝑃!! − 𝑃!!!

!

𝑃!!" − 𝑃!!!!"

!

!!!

!

!!!

(1)

where N is the number of matching images, 𝐽 is the number of pairs of matching keypoints, and 𝑃!

! and 𝑃!!" are keypoints in the query and the database images. For the distance computation, each point in the database is coupled with its most distant point, and this maximum distance is represented by the subscript t added to index k. 5. Object center estimation: The previous steps were performed for all the positive database images (i.e. descriptors that characterize the object) and allow to obtain: i) the number of inliers (matches) for each database image; and ii) an estimation of the scale of the target in the query frame. Then, the image with the highest number of inliers is selected and the GHT step is performed again just for the selected image while already considering the scale estimated in step 4. This means that the vector r is multiplied by the scale s, which allows to obtain a more accurate object center. Finally, a tracking failure is declared when the numbers of inliers estimated by the GHT algorithm is less than 8 for all the database images. Experimentally it was found that the object center cannot be robustly estimated and it is expected that a new frame comes in for which the detection process can be performed in a more reliable way. When correct detection is achieved, the output of the tracking-by-detection algorithm is the object center for the current frame, which can be used together with the scale to obtain a bounding box surrounding the object being followed. 5.3. Online Database Updating

To perform visual tracking, there are several ways to construct the database: i) generative approaches that just model the object being

followed and search for similar regions; ii) discriminative approaches that attempt to differentiate the object from the background. Here, the latter approach is followed, which means that two models are needed: one to characterize the object and another to characterize the surrounding background. The proposed solution is able to efficiently learn online the object model and adapt it to a particular environment.

First, to handle the changes that an object may face along time, it is necessary to characterize the object appearance by storing in the positive database (P-DB) some decoded descriptors belonging to the target object. A database image corresponds simply to a set of decoded descriptors acquired for the same time instant with the entire database representing the object captured in different poses, thus providing robustness to viewpoint changes and occlusions. Note also that, following the ATC paradigm, the pixel-level representation of the object is not available at the sink node and thus it is not possible to re-compute descriptors for any of the database images. This database is initialized with the set of descriptors obtained from the first image where the object is recognized, typically the first image of the video sequence. In the database, similar images (thus sets of descriptors) should be avoided since it will saturate the P-DB database, slowing down the tracking procedure. Thus, to accurately locate the object and estimate its size, it is necessary to only update P-DB when significant changes of the object appearance. To determine when to trigger the database update, a recognition threshold, T, is defined as the minimum number of inliers necessary to locate the target in the frame. When the number of inliers is between T and 2T, the keypoints inside the estimated bounding box and the corresponding descriptors are selected and added to the P-DB, growing its size. To prevent adding potential background or occluding keypoints while still obtaining a good object representation, all the descriptors inside the bounding box are added; the remaining ones are filtered out.

In addition, a negative database (N-BD) is used to model the surrounding background and to filter any descriptors of the current image that belong to the background. In the N-DB, a similar online updating scheme is performed. In this case, to avoid classifying as negative some descriptors belonging to the target object, the background keypoints are added to the negative database only when a correct bounding box estimation is obtained. Thus, the N-DB is updated when the number of inliers is greater than 2T, which considering the P-BD criterion means that descriptors can only be added to one of the databases. All the transmitted descriptors outside the object bounding box are background descriptors. Notice that these descriptors do not cover the entire image since the sensing node will only transmit descriptors near the previous object location (see Section 5.1). To maintain the databases with a reasonable size, a pruning approach is applied to both databases to remove old descriptors. In this case, when the P-DB has more than 25 images (sets of descriptors) the oldest object representation is removed. The same reasoning is applied to the N-DB database.

6. PERFORMANCE EVALUATION

In this section, the experimental evaluation of the tracking-by-detection solution proposed in Section 4 is made in terms of tracking accuracy and rate-accuracy tradeoff. The tracking accuracy results will evaluate the algorithm performance in a long-term object tracking scenario while the rate-accuracy results will show the tradeoff between the compression rate for binary descriptors and the tracking accuracy.

6.1. Test Conditions and Evaluation Metrics

The proposed tracking algorithm is evaluated for several video sequences extracted from several public and well-known datasets, such as MILTrack [23], BoBoT [24] and TLD [3]. These sequences have different challenging attributes, resolutions and object sizes. Table 1 reports the selected test sequences and their characteristics,

Page 5: DESCRIPTOR-BASED ADAPTIVE TRACKING-BY-DETECTION …amalia.img.lx.it.pt/~jmda/ICME15_tracking.pdfrecognition have led to a new and promising framework called tracking-by-detection,

namely number of frames, for how many frames the object target is visible, spatial resolution and the main sequence characteristic. All video sequences were processed with full frame rate, i.e. 25 fps.

Table 1. Test video sequences and attributes.

Sequence Nº Frames

Visible Target Fr. Resolution Main

Characteristic Juice (BoBoT) 404 404 320×240 Fast motion Cup (BoBoT) 629 629 640×480 Low resolution

Fish (MILTrack) 476 476 320×240 Illumination change Dog1 (MILTrack) 1353 1350 320×240 Scale/rotation change

FaceOcc2 (MILTrack) 815 815 640×480 Partial occlusion Car (TLD) 945 860 640×480 Full occlusion

To apply our tracking-by-detection scheme, some of the video sequences with low resolutions, such as 320×240 were up-sampled by a factor of two to extract a higher number of binary descriptors. The sequence lengths range from 404 to 1353 frames, and the system is initialized only with the information of the target object in the first frame, i.e. the ideal bounding box, set as the first entry of the ground truth. The P-DB is updated dynamically and can store up to 25 images of the target. Thus, the performance of the proposed ATC tracking solution is compared to the performance of the CTA solution where the pixel-level representation is coded and transmitted to the sink node and the same tracking algorithm is applied. This is the first time that rate-accuracy results are shown for a distributed tracking scenario to show the impact of the image quality in the tracking performance (CTA) and compare it with a solution where features are coded and transmitted (ATC).

The performance evaluation framework resorts to simple metrics widely used in the tracking literature where the overlapping between the estimated and the ground truth bounding boxes is computed. In this case, the evaluation procedure of the famous visual object tracking 2014 challenge (VOT2014) [24] that targets single-object, single-camera tracking, using the same initialization as our tracking system, has been adopted. Thus, to evaluate the proposed tracking solution, the following three metrics were used:

1. Bounding Box Overlap (BBO): BBO is computed from the estimated and ground truth bounding boxes as:

𝐵𝐵𝑂 =𝐼!"#!

𝐺𝑇!"#! + 𝐵𝐵!"#! − 𝐼!"#! (2)

where 𝐺𝑇!"#! and 𝐵𝐵!"#! are the areas of the bounding boxes given by the ground truth file and the proposed tracking solution and 𝐼!"#! is the intersection area between the two bounding boxes. BBO can be intuitively described as the intersection over the union of the estimated and true bounding boxes. Thus, 0 means no overlap and 1 means that both bounding boxes are identical.

2. Euclidean distance between object centers (EDOC): EDOC is computed as the geometric distance between the estimated object center and the ground truth object center.

3. Failure rate (FR): FR is computed as the number of frames where a correct localization of the object was not obtained divided by the total number of frames where the target is visible. The most common threshold [10] for defining a correct detection is 0.25, i.e. when BBO > 0.25 tracking is considered successful.

While the first two metrics BBO and EDOC represent the precision of the tracking system, the robustness is represented as the failure rate. To apply these metrics to a video sequence, BBO and EDOC values are averaged for all frames where correct detection was successful, which for many cases corresponds to the number of frames in which the target is visible (i.e. FR close to 1).

Nine rate-accuracy points have been defined for both the CTA and ATC tests. In CTA, different bitrates were obtained by means of

JPEG compression with different quality factor (QF) values, notably ranging from 10 to 100. Notice that the complexity must be kept at acceptable levels for a visual sensing node scenario with strict energy requirements and thus, more complex video codecs cannot be applied. In ATC, to obtain several rate points, the number of DEs selected for transmission has been changed, ranging from 8 to 512.

6.2. Object Tracking Assessment

The tracking performance results are reported in Table 2. These results have been obtained with the BRISK feature detection and descriptor method but other binary descriptors may be used. The maximum number of descriptors per frame was set to 300 but a smaller number of descriptors were sometimes extracted, notably when the number of salient points in the video frame drops (e.g. due to a low amount of texture). Nevertheless, it is possible to observe that the average overlap for all sequences is above 60%, with a 100% correct detection (FR) for all sequences, except Dog1, where the bounding box was misclassified due to the target object being extremely close to the camera for some frames.

Table 2. Tracking performance for FR (failure rate), BBO (bounding box overlap) and EDOC (Euclidean distance).

Sequence FR BBO [%] EDOC [pixels]

Juice 1.0 74.61 5.46 Cup 1.0 63.4 9.15 Fish 1.0 81.45 4.19 Dog1 0.97 72.39 10.10

FaceOcc2 1.0 73.25 10.12 Car 1.0 65.75 3.23

In addition, the tracking rate-accuracy performance obtained with different feature detectors and extractors is shown in Figure 2, for the CTA paradigm. To avoid showing three separate plots for the FR, BBO and EDOC metrics, it was chosen to define accuracy as the BBO averaged by the number of frames in a sequence in which the target is visible, thus considering both tracking drifts and failures. Moreover, the accuracy obtained for all the sequences in Table 1 are averaged to obtain a more compact performance assessment. Since, in a CTA scenario, descriptors are extracted at the sink from a decoded image (thus no rate is associated to the descriptors), any descriptor can be used including real-value descriptors such as SURF. Naturally, this CTA tracking performance evaluation is useful to understand which descriptors perform better and provide the best benchmark for the ATC scenario. As expected, Figure 2 shows that the SURF descriptor has the best performance following its consensual recognition as one of the best state-of-the-art local descriptors with moderate complexity.

Fig. 2. CTA rate-accuracy performance assessment.

Figure 2 shows that BRISK has better performance than ORB, another popular binary descriptor with a more compact representation (256 versus 512 bits). Notice that, for a rate of 250 kbit/frame corresponding to a JPEG compressed image with quality factor 80, the gap between BRISK and SURF is rather small. Another conclusion is that query images quality plays an important role in the

0.45%

0.5%

0.55%

0.6%

0.65%

0.7%

0.75%

0.8%

0% 100% 200% 300% 400% 500% 600% 700% 800% 900%

Accuracy%

Rate%[kbit/frame]%

CTA%Rate@Accuracy%

SURF%SURF&

BRISK%BRISK&

AGAST%ORB&

Page 6: DESCRIPTOR-BASED ADAPTIVE TRACKING-BY-DETECTION …amalia.img.lx.it.pt/~jmda/ICME15_tracking.pdfrecognition have led to a new and promising framework called tracking-by-detection,

CTA scenario. For low qualities, the accuracy drops significantly as the quantization noise can severely affect the tracking performance.

Figure 3 shows the rate-accuracy performance for the CTA paradigm using SURF (the best of Figure 2) and the ATC paradigm using BRISK. In this case, the results were obtained averaging the accuracy for all the sequences using both the original resolution (320×240) and the up-sampled resolution. While CTA applies JPEG compression to code the pixel-level representation before transmission, ATC codes the binary descriptors using the techniques described in Section 4. In both cases, the tracking framework proposed in Section 5 is applied. As shown, the ATC-BRISK paradigm allows significant bitrate savings with respect to CTA-SURF. Also, it can be concluded that: i) ATC-BRISK is able to achieve the maximum performance accuracy at a much lower bitrate, and ii) maximum accuracy of ATC-BRISK is similar to CTA-SURF and thus lossy descriptor compression does not impair tracking accuracy.

Fig. 3. CTA versus ATC rate accuracy assessment.

Surprisingly, accuracy values above 0.6 can be obtained with ATC by selecting just 128 DEs with a bitrate of 30 kbit/frame. The same score can be obtained in CTA with a JPEG quality factor of 40 but with five times more bitrate. Finally, Figure 5 shows the rate-accuracy obtained for the ATC-BRISK solution when the descriptor elements is kept fixed and the number of descriptors is varied. The following conclusions can be taken: i) it is not worthwhile to have more descriptor elements than 256, since it will just add bitrate without any accuracy improvements and ii) the number of descriptors play a key role to obtain more accurate tracking performance when compared to the number of descriptor elements.

Fig. 4. ATC rate-accuracy performance.

7. CONCLUSIONS AND FUTURE WORK

This paper proposes a tracking-by-detection solution to address the challenging problem of tracking an object in a visual sensor network scenario. Thus, sensing nodes transmit binary visual descriptors compressed with suitable techniques while the sink node performs tracking just with this compact representation, without any pixel-level information. Rate-accuracy experimental results show that it is possible to achieve better performance than conventional schemes where each image is coded and transmitted to the sink node. As future work, online learning schemes will be proposed to discriminate which descriptors belong to the background and to the tracked object.

REFERENCES [1] S. Soro and W. Heinzelman, “A Survey of Visual Sensor Networks,” Advances

in Multimedia, vol. 2009, Article ID 640386, 2009. [2] A. Redondi, L. Baroffio, M. Cesana, M. Tagliasacchi, “Compress-then-

Analyze vs. Analyse-then-Compress: Two Paradigms for Image Analysis in Visual Sensor Networks”, IEEE International Workshop on Multimedia Signal Processing, Pula, Italy, September 2013.

[3] Z. Kalal, K. Mikolajczyk, J. Matas, "Tracking-Learning-Detection", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, No. 7, pp. 1409-1422, July 2012.

[4] D. G. Lowe, "Distinctive Image Features from Scale-Invariant Keypoints," International Journal of Computer Vision, vol. 60, no. 2, November 2004.

[5] H. Bay, A. Ess, T. Tuytelaars, L. V. Gool, "SURF: Speeded Up Robust Features," Computer Vision and Image Understanding, vol. 110, no. 3, June 2008.

[6] M. Calonder, V. Lepetit, C. Strecha, and P. Fua, “BRIEF: Binary Robust Independent Elementary Features,” European Conference on Computer Vision, Crete, Greece, September 2010.

[7] S. Leutenegger, M. Chli, and R. Siegwart, "BRISK: Binary Robust Invariant Scalable Keypoints," IEEE International Conference on Computer Vision, Barcelona, Spain, November 2011

[8] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an Efficient Alternative to SIFT or SURF,” IEEE International Conference on Computer Vision, Barcelona, Spain, November 2011.

[9] H. Yang, L. Shao, F. Zheng, L. Wang, Z. Song, “Recent Advances and Trends in Visual Tracking: a Review”, Neurocomputing, vol. 74 no. 18, pp. 3823-3831, November 2011.

[10] M.E. Maresca, and A. Petrosino. “Matrioska: A Multi-level Approach to Fast Tracking by Learning,” International Conference on Image Analysis and Processing, Naples, Italy, September 2013.

[11] S. Gauglitz, T. Höllerer, and M. Turk. "Evaluation of Interest Point Detectors and Feature Descriptors for Visual Tracking." International Journal of Computer Vision, vol. 94, No.3, pp. 335-360, September 2011.

[12] G. Takacs, V. Chandrasekhar, S. Tsai, D. Chen, R. Grzeszczuk, B. Girod, “Unified Real-time Tracking and Recognition with Rotation Invariant Fast Features”, IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, USA, June 2010.

[13] P. H. S. Torr and A. Zisserman, “MLESAC: A New Robust Estimator with Application to Estimating Image Geometry”. Computer Vision and Image Understanding, Vol. 78, No. 1, pp.138–156, April 2000.

[14] D.-N. Ta, W.-C. Chen, N. Gelfand, K. Pulli "SURFTrac: Efficient Tracking and Continuous Object Recognition using Local Feature Descriptors." IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, April 2009.

[15] S. Hare, A. Saffari, and P.H.S. Torr, “Efficient Online Structured Output Learning for Keypoint-based Object Tracking”, IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 2012.

[16] J. Ascenso, F. Pereira, “Lossless Compression of Binary Image Descriptors for Visual Sensor Networks”, IEEE/EURASIP Digital Signal Processing Conference, Santorini, Greece, July 2013.

[17] A. Redondi, L. Baroffio, J. Ascenso, M. Cesana, M. Tagliasacchi, “Rate-accuracy Optimization of Binary Descriptors”, IEEE International Conference on Image Processing, Melbourne, Australia, September 2013

[18] L. Baroffio, M. Cesana, A. Redondi, M. Tagliasacchi, S. Tubaro, “Coding Visual Features Extracted from Video Sequences”, IEEE Transactions on Image Processing, Vol. 23, No. 5, pp. 2226-2276, May 2014.

[19] L. Baroffio, J. Ascenso, M. Cesana, A. Redondi, M. Tagliasacchi, "Coding Binary Local Features Extracted From Video Sequences", IEEE International Conference on Image Processing, Paris, France, October 2014

[20] P. Monteiro, J. Ascenso, “Coding Mode Decision Algorithm for Binary Descriptor Coding”, European Signal Processing Conference, Lisbon, Portugal, September 2014 .

[21] L. Baroffio, M. Cesana, A. Redondi, M. Tagliasacchi, "BAMBOO: A Fast Descriptor Based on Asymmetric Pairwise Boosting", IEEE International Conference on Image Processing, Paris, France, October 2014

[22] S. S. Tsai, D. Chen, G. Takacs, V. Chandrasekhar, J. P. Singh, and B. Girod, “Location Coding For Mobile Image Retrieval,” ICST Mobile Multimedia Communications Conference, London, UK, September 2009.

[23] B. Babenko, M.-H. Yang, S. Belongie, “Robust Object Tracking with Online Multiple Instance Learning”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1619–1632, August 2011.

[24] D. A. Klein, D. Schulz, S. Frintrop, A. B. Cremers, “Adaptive Real-time Video-tracking for Arbitrary Objects,” IEEE/RSJ International Conference on Intelligent Robots and Systems, Taipei, Taiwan, October 2010.

[25] M. Kristan et al., “Visual Object Tracking Challenge Results”, European Conference on Computer Vision (ECCV) Visual Object Tracking Challenge Workshop, Zurich, Switzerland, September 2014.

0"

0.1"

0.2"

0.3"

0.4"

0.5"

0.6"

0.7"

0.8"

0" 50" 100" 150" 200" 250"

Accuracy"

Rate[kbit/frame]"

ATC"vs"CTA"Rate@Accuracy"

ATC&BRISK"

CTA&SURF"

400#

350#

300#

250#200#

400#

350#

300#

250#200#

400#

350#300#

250#

200#

400#350#

300#

250#200#

0.57%

0.59%

0.61%

0.63%

0.65%

0.67%

0.69%

0.71%

0.73%

0% 20% 40% 60% 80% 100% 120% 140% 160% 180%

Accuracy%

Rate[kbit/frame]%

ATC%Rate@Accuracy%

uncompressed#5122DE#2562DE#1282DE#