arXiv:1904.03280v1 [cs.CV] 5 Apr 2019fjianrenw,he2,xiaobow,xinjiay,[email protected] Figure 1: We propose a prediction driven method for tracking and segmentation in videos. The

Prediction-Tracking-Segmentation

Jianren Wang Yihui He Xiaobo Wang Xinjia Yu Xia ChenCarnegie Mellon University

{jianrenw,he2,xiaobow,xinjiay,xiac}@andrew.cmu.edu

Figure 1: We propose a prediction driven method for tracking and segmentation in videos. The first row shows the results ofstate-of-the-art tracker SiamMask [66] (red) and ground truth (green). The second row shows the motion prediction (blackarrow) and tracking results (with the blue bounding box and red segmentation mask) of our Prediction-Tracking-Segmentation(PTS) model. Our method improves the robustness against distractors and occlusions. (better view with color)

Abstract

We introduce a prediction driven method for visual track-ing and segmentation in videos. Instead of solely relyingon matching with appearance cues for tracking, we build apredictive model which guides finding more accurate track-ing regions efficiently. With the proposed prediction mecha-nism, we improve the model robustness against distractionsand occlusions during tracking. We demonstrate signifi-cant improvements over state-of-the-art methods not onlyon visual tracking tasks (VOT 2016 and VOT 2018) but alsoon video segmentation datasets (DAVIS 2016 and DAVIS2017).

1. Introduction

A human can track, segment or interact with fast mov-ing objects with surprising accuracy [54], even in the casesthe objects are under deformations, occlusions and illumi-nation changes [68]. What is the key component in human

perception to make this happen?In fact, tracking and segmenting moving object appears

at a very early stage of human perception. Even 4-month-old infants can track moving objects with his or her eyesand reaching for them [61]. A professional athlete can eveninteract with objects at breakneck speed. (e.g., a baseballplayer can hit a 100-mph baseball). To do so, the humanbrain has to overcome its delays in neuronal transmissionthrough prediction [45, 25, 6] and saves times for process-ing what we see by using only local information [59, 5]. Be-sides prediction, researchers further point out that humanshave multiple temporal scales for tracking objects with dif-ferent speeds [26].

Inspired by these observations from human perception,in this paper, we propose a two-stage tracking methoddriven by prediction. Given a tracking result in time t,we first predict the approximate object location in the nextframe (in time t+1) without seeing it. Based on the predic-tion results, we then further refine the localization as wellas the segmentation results by using the appearance input intime t + 1. The refined tracking and segmentation results

1

arX

iv:1

904.

0328

0v1

[cs

.CV

] 5

Apr

201

9

can help us back to update a better prediction model, whichwill be applied again in the successive frames.

Concretely, in the first stage of prediction, we use an ex-trapolation method to estimate the object position in the fu-ture frame (in time t+1) and simulate the multiple temporalscales effect in human perception through adaptive searchregion in the same frame. With prediction driven track-ing and segmentation, we name our method Prediction-Tracking-Segmentation (PTS).

Our approach offers several unique advantages. First,through prediction of object position, we free tracking fromusing only appearance information. As most tracking andsegmentation methods can only discriminate foregroundfrom the non-semantic background [71], the performancesuffers significantly when the target object is surrounded bysimilar objects (know as distractors [71]). Prediction alsoimproves the robustness of our model against occlusions.Note that occlusions largely prevent most appearance-basedmethods from extracting useful information. We show inthe experiments that our method improves the tracking per-formance by a large margin under both cases. We visualizepart of the results in Fig. 1.

Second, through the usage of adaptive search regionaround the predicted area, PTS significantly decreases theinformation required to process and thus has a large poten-tial to increase the inference speed. To achieve this, we pro-pose to focus on smaller local regions when objects haveslower speeds and smaller sizes, and vice versa. This ap-proach allows better segmentation performance, less miss-ing and identity switching.

We evaluate our framework all major tracking datasets:VOT-2016 [36], VOT-2018 [35]. We demonstrate thatour framework achieves state-of-the-art performance, bothqualitatively and quantitatively. We also show compet-itive results against semi-supervised VOS approaches onDAVIS-2016 [48] and DAVIS-2017 [51].

To summarize, our main contributions are three-fold:First, inspired by visual cognition theory, we propose PTSto unify predict, tracking and segmentation in a singleframework. Second, we propose an adaptive search regionmodule to process information effectively. We indicate thatour proposed achieves competitive performance on VOTand VOS datasets.

2. Related WorksIn this section, we briefly overview three research areas

relative to our proposed method.

Video Object Tracking In tracking community, signif-icant attention has been paid to discriminative correlationfilters (DCF) based methods [3, 42, 39, 14]. These meth-ods allow discriminating between the template of an arbi-trary target and its 2D translations at a breakneck speed.

MOSSE [3] is the pioneering work which proposes a fastcorrelation tracker by minimizing the squared error. Per-formance of DCF-based trackers has then been notably im-proved through the using of multi-channel features [24, 12,32], robust scale estimation [8, 9], reducing boundary ef-fects [10, 33] and fusing multi-resolution features in thecontinuous spatial domain [11].

Tracking through Siamese Network is also an importantapproach [34, 56, 2, 58]. Instead of learning a discrimi-native classifier online, the idea is to train a deep siamesesimilarity function offline on pairs of video frames. At testtime, this function is used to search for the candidate mostsimilar to the template given in the starting frame on a newvideo, once per frame. The pioneering work is SINT [56].Similarly, GOTURN [23] used deep regression network topredict the motion between successive frames. SiamFC [2]implemented a fully convolutional network to output thecorrelation response map with high values at target loca-tions, which set a basic form of modern Siamese frame-work. Many following works have been proposed to im-prove the accuracy while maintain fast inference speed byadding semantic branch [17], using region proposals [38],hard negative mining [71], ensembling [16], deeper back-bone [37] and high-fidelity object representations [66].

With the assumption that objects are under minor dis-placement and size change in consecutive frames, mostmodern trackers, including all the ones mentioned above,use a steady search region, which is centered on the lastestimated position of the target with the same ratio. Al-though it is very straightforward, this oversimplified prioroften fails in occlusion, motion change, size change, cam-era motion, as it is evident in the examples of Figure 1. Thismotivated us to propose a tracker able to adaptively set thesearch region.

Video Forecasting The ability to predict and therefore toanticipate the future is an important attribute of intelligence.Many methods are developed to improve the temporal sta-bility of semantic video segmentation. Luc et al. [41] de-velop an autoregressive convolutional neural network thatlearns to generate multiple future frames iteratively. Sim-ilarly, Walker et al. [62] uses a VAE to model the possi-ble future movements of humans in the pose space. In-stead of generating future states directly, many methodsattempt to propagate segmentation from preceding inputframes [30, 46, 27].

Unlike previous work, inspired by human perception,we extract a motion model for each object and set up anew search region for segmentation according to the mo-tion model.

Video Object Segmentation Video Object Segmenta-tion (VOS) have been divided into three categories based

2

on the level of supervision required: unsupervised, semi-supervised and supervised. We briefly review the VOS fo-cusing on semi-supervised setting, which is usually formu-lated as a temporal label propagation problem. To exploitconsistency between video frames, many methods propa-gate the first segmentation mask through temporal adjacentones [1, 43, 57] or even entire video [28, 29]. Anotherapproach is to process video frames independently [47]and usually heavily rely on fine-tuning [4], data augmen-tation [31] and model adaption [60].

3. MethodTo unify prediction, tracking and segmentation with

adaptive search region, our model consists of: (i) predic-tion module: estimate object position and velocity in an un-seen frame (Section 3.1) (ii) tracking module: adaptivelylimit the search region for further processing (Section 3.2)(iii) segmentation module: a fully-convolutional Siameseframework to segment foreground object from given searchregion (Section 3.3). We show our framework in Figure 2.

3.1. Prediction Module

In video object tracking and segmentation, most meth-ods do not consider the time consistency of object motion.In other words, most methods predict a zero-velocity-objectand thus set a local search region centered on the last esti-mated position of the target [2, 38, 71, 66].

However, these methods only consider appearance fea-tures of the current frame, and hardly benefit from motioninformation. This leads to great difficulty in distinguish-ing between instances that look like the template, known asdistractors [71] or under occlusion, fast motion and cameramotion. To solve this problem, our proposed tracker takesfull advantage of the motion information.

Object motion in a given image is the superposition ofcamera motion and object motion. The former is randomwhile the latter should satisfy Newton’s First Law [44]. Wefirst pick a reference frame (Frk , rk denotes kth referenceframe) every n frames and thus separate the long video intoseveral pieces of short n-frame videos.

Second, we adopt the method proposed by ARIT [63] todecouple the camera motion and object motion within eachshort video. ARIT assumes that pending detection frame(Frk+t) and its reference frame (Frk ) are related by a ho-mography (Hrk,rk+t). This assumption holds in most casesas the global motion between neighboring frames is usu-ally small. To estimate the homography, the first step isto find the correspondences between two frames. As men-tioned in ARIT, we combine SURF features [67] and mo-tion vectors from the optical flow to generate sufficient andcomplementary candidate matches, which is shown to berobust [15, 63]. Here we use PWCNet [55] for dense flowgeneration.

As a homography matrix contains 8 free variables, atleast 4 background points pairs should be used. We cal-culate the least square solution of eq. 1 and optimize it toobtain robust solution through RANSAC [13], where pbprkand pbprk+t denotes random selected background matchingpairs in Frk and Frk+t using the above mentioned features.Given the assumption that the background occupies morethan half of the images, we partition matching points be-tween frames into 4 pieces, and then one point is randomlychosen inside each selected piece to improve the efficiencyof RANSAC algorithm.

Hrk,rk+t × pbprk = pbprk+t (1)

For simplicity, all following calculations are under refer-ence coordinate and project back to the new coming framewithout further noticing.

Fig.3 illustrates the working principle of decouplingstep. The origin video for Fig.3 is a handheld video withtrembling background. The motion of the pedestrians in theorigin video is highly unpredictable with huge backgrounduncertainties. However, by mapping the target frame to-wards the reference frame, the movement for pedestrianscould be more predictable and continuous.

To find the most representative point of object position,we calculate the ”center of mass” of object segmentationusing eq. 2

P = average(po) (2)

Random noise from background motion estimation andmask segmentation might be introduced to the object posi-tion prediction, which could influence the accuracy of pre-diction. To achieve a better estimation for object states, weutilize Kalman Filter to provide accurate position informa-tion based on the measurements from current and formerframes. As a classical tracking algorithm, the Kalman filterestimates the position of the object in two steps: predictionand correction. In the prediction step, it predicts the targetstate based on the dynamic model (eq. 3) and generates thesearch region for the Siamese network to achieve object seg-mentation. Therefore, the measurement for object positionin the next frame could be computed with eq. 2. Then, inthe correction step, the position measurement would be up-dated with higher certainty given the position measurementfrom the Siamese network, which benefits the accuracy ofpredictions for future frames.

The dynamic model for object position update could beformulated as:

xt|t−1 = Fkxt−1|t−1 + wt (3)

In eq.3, xt|t−1 is the priori state estimation given obser-vations up to time t−1, which is in the form of 4-dimensionvector ([x, y, dx, dy]) with position information. It is worth

3

Figure 2: An overview of our method. Our method is composed of prediction part, tracking part and segmentation part.

Figure 3: One example for decoupling background motionand mapping object motion to reference frame (arrows il-lustrates the movement of object center)

to mention that the velocity terms (dx, dy) are predicted byusing extrapolation between the information from time t−1and t. wt is the random noise existing in the system. AndFt is the transition matrix from time t− 1 to t.

After predicting the states, the Kalman filter uses mea-surements to correct its prediction during the correctionsteps using eq. 4. In the equation, yt is the residuals be-tween the prediction and measurement. And Kt is the opti-mal Kalman gain given from the predicted error covariance(Pt|t−1), measurement matrix (Ht) and measurement mar-gin covariance (St), as shown in eq. 5. It is worth to mentionthat, as Kalman filter is a recursive algorithm, the predictederror covariance (P) should be updated as well based on theestimation results.

xt|t = xt|t−1 +Ktyt (4)

Kt = Pt|t−1HtTSt

T (5)

The motion consistency between video frames in differ-

ent sliced videos with different reference frames could be anissue because the initialization of the velocity for the refer-ence frame could be critical to the accuracy of the positionupdate. To maintain the motion consistency, we choose thenth frame, which is the last frame in the sliced video, as thenext reference frame with the refined position and velocityestimation from Kalman filter based on the former referenceframe. Therefore, the velocity of the object, with respect tothe new frame, could be initialized by mapping the refinedvelocity towards the new reference.

3.2. Tracking Module

Inspired by human perception, we dynamically set upa new search region in the coming frame centered at thepredicted object position. We project the estimated objectcenter position back to the pending detection frame usingeq. 6.

Hrk,rk+t × Prk = Prk+t (6)

Given the estimated position, we setup the search regionS accordingly using the similar method as in [24]:

S = k√(w + p)(h+ p) (7)

k = 1 + 2× sigmoid(||v||2 − T ) (8)

where p =w + h

2. To achieve the adaptive search region,

the search region size S would be modified with respect tothe predicted velocity using eq.8. In the equation, v is thevelocity predicted by Kalman filter and T is the thresholdfor velocity. The search region is cropped center at Prk+ton the frame Fr+t, and then resized in 255× 255.

To make the one-shot segmentation framework suitablefor tracking task. We adopt the optimization strategy usedfor the automatic bounding box generation proposed inVOT-2016 [36] as it offers the highest IOU and mAP asreported in [66].

4

3.3. Segmentation Module

We adopt the SiamMask framework [66], which achievesa good balance between the accuracy and speed. SiamMaskpropose to use an offline trained fully-convolutional net-work to simultaneously collect binary segmentation mask,detection bounding box and objectness score. First, theSiamese network compares an template image z (w×h×3)against a (larger) search image x to obtain a dense re-sponse map gθ(z, x). The two inputs are processed by thesame CNN fθ, yielding two feature maps that are cross-correlated:

gθ(z, x) = fθ(z) ? fθ(x) (9)

Each spatial element of the response map gnθ (z, x) representa similarity between the template image z and nth candidatewindow in x. Second, a three-branch head calculates binarysegmentation mask, detection bounding box and objectnessscore, respectively. The mask branch predicts a (w × h) bi-nary mask mn from each spatial element gnθ (z, x). The boxbranch regresses k bounding boxes from each spatial ele-ment gnθ (z, x), where k is the number of anchors. And thescore branch estimates the corresponding objectness score.

mn = hφ(gnθ (z, x)) (10)

bni=1,..,k = hσ(gnθ (z, x)) (11)

sni=1,..,k = hϕ(gnθ (z, x)) (12)

A multi-task loss is used to optimise the whole framework.

L = λ1Lmask + λ2Lbox + λ3Lscore (13)

We refer readers to [49, 50] for understanding maskbranch and [53, 38] for understanding region proposalbranch.

4. ExperimentsIn this section, we evaluate our approach on three tasks:

motion prediction, visual object tracking (VOT 2016 andVOT 2018) and semi-supervised video object segmentation(on DAVIS 2016 and DAVIS 2017). It is worth noticingthat our method does not depend on the selection of track-ing and segmentation methods. To better evaluate the effi-ciency of our method, we adopt SiamMask [66] with pro-vided pretrained model as our online tracking and segmen-tation method.

4.1. Evaluation for motion prediction accuracy

Datasets and settings We adopt two widely used bench-mark data set to evaluate the performance the motion pre-diction: VOT 2016 [36] and VOT 2018 [35]. Both of themare annotated with the rotated bounding box. Both datasetscontain 60 public sequences with different challenging fac-tors: camera motion, object motion change, object size

change, occlusion and illumination change, which makesit extremely challenging for object motion prediction [35].We use eq. 2 to predict object position and eq. 3 to extrapo-late object velocity. For our baseline method, the predictedposition of the next frame (t+1) is always the same as thecurrent frame (t), while object velocity is always predictedas 0. The ground truth position is set as the center of theannotated rotated bounding box, while the velocity is thedifference between two consecutive positions. We evaluatethe position error from ground truth with Euclidean distanceand velocity error with Euclidean distance, cosine distanceand magnitude distance. Cosine distance is the cosine valuebetween predicted velocity and ground truth velocity (thehigher, the better). Magnitude distance is the absolute dif-ference between the absolute value of predicted velocity andground truth velocity. We adopt the reinitialize mechanismas used in the official VOT toolkit when the segmentationhas no overlap with ground truth. We reinitialize the track-ing method with ground truth after five frames.

Figure 4: Object center predictions generated by SiamMaskand PTS (red for PTS, yellow for SiamMask and blue crossfor ground truth) (better view with color)

Dataset Tracker Pos Err.

VOT 2016 SiamMask 16.281PTS(ours) 8.198

VOT 2018 SiamMask 14.593PTS(ours) 8.744

Table 1: Position prediction error on VOT 2016 and VOT2018

Results on VOT 2016 and VOT 2018 Table.1 presentsthe comparison of position prediction results usingSiamMask and PTS based on VOT 2016 and VOT 2018datasets.

As it is shown in the table, for both of these two datasets,the PTS method could dramatically reduce the predictionerrors of the object position. The mean square error for ob-ject position on VOT 2018 could be reduced by half from 16

5

Dataset Tracker MSE Err. Cosine Mag.

VOT 2016 SiamMask 8.274 - 8.274PTS(ours) 4.596 0.667 3.190

VOT 2018 SiamMask 7.006 - 7.006PTS(ours) 4.298 0.793 2.929

Table 2: Velocity prediction error on VOT 2016 and VOT2018

Figure 5: Velocity prediction generated by PTS (white forgoundtruth, black for prediction, both extended by 5 timeslonger for better visualization)(better view with color)

pixels to 8 pixels. Meanwhile, Fig.4 shows when the objectvelocity is high, the PTS method could provide a predictionmore accurate compared with the SiamMask, which doesnot consider the influence of object motion. The resultsprove that the decoupling strategy could reduce the back-ground uncertainty and the Kalman filter would provide arelatively reliable prediction for object position in the nextframe. Higher accuracy for object position prediction couldbenefit the generation of search regions for object trackingand eventually improve the performance of object segmen-tation.

For velocity, as can be seen in Table. 2, our method sig-nificantly reduce the estimation error. In VOT 2018, PTSachieves 0.763 cosine distance, which is about 37 degreedivergence from ground truth velocity direction. The maincause of the error is that objects are not always rigid, thus”center of mass” can approximate the overall motion of theobject 5. The size change of objects will further increasethe prediction error. However, with the correction proce-dure of the Kalman filter, this error (noise) can be stabilized.One possible solution to decrease velocity prediction erroris tracking each part of non-rigid objects and grouping allparts together to get the final prediction [52].

4.2. Evaluation for visual object tracking

Datasets and settings Similarly, we adopt three widelyused benchmarks for the evaluation of the object trackingtask: VOT 2016, VOT 2018 and compare against the state-of-the-art using official metric: Expected Average Overlap

VOT 2016Trackers A R EAO

SiamMaks 0.639 0.214 0.433PTS (Ours) 0.642 0.144 0.471

Table 3: Comparison with SiamMask on VOT 2016

(EAO), which considers both accuracy and robustness of atracker [35]. We use VOT 2018 to conduct an experiment todiscuss the performance under different conditions further.

Results on VOT 2016 Table 3 present comparisons oftracking performance between PTS and other state-of-the-art models based on VOT 2016 datasets. Our model im-proves the robustness by 30%, and provide an 8.8% gain ofEAO, which achieves 0.471.

Results on VOT 2018 In Table 4 we compare our PTSagainst eleven recently published state-of-the-art trackerson the VOT 2018 benchmark. We establish a new state-of-the-art tracker with 0.397 EAO and 0.612 accuracy. Inparticular, our accuracy outperforms all existing Correla-tion Filter-based trackers. This is very easy to under-stand since our baseline SiamMask relies on deep featureextraction which is much richer than all existing Corre-lation Filter-based methods. However, PTS even outper-forms the baseline SiamMask method, which is very in-teresting. Previous research shows Siamese based trackershave strong center bias despite the appearances of test tar-gets [37]. Thus, by estimating the center of the search re-gion more accurate, Siamese trackers can also achieve bet-ter regression result (e.g., bounding box detection or objectsegmentation). Besides, PTS achieves the highest robust-ness among all Siamese based trackers. This is even ex-hilarating because one of the key vulnerability of Siamesebased trackers is the low robustness. The main reason isthat most Siamese networks can only discriminate fore-ground from the non-semantic background [71]. In otherwords, Siamese based trackers are not appearance sensitiveenough and always suffer from distinguishing surroundingobjects. Our proposed PTS adopts a straightforward strat-egy and shows huge improvement of robustness from 0.276to 0.220, which provides another strategy to achieve betterrobustness: by setting more accurate and targeted search re-gion.

To further analysis where the improvements come from,we show the qualitative results of PTS and our baseline6.Just as mentioned above, the robustness comes from lesstracking object switching and missing. For example, as forthe car scenario in figure 6, when the camera shakes, thecenter of the search region of SiamMask will shift to the left

6

Figure 6: Qualitative result of our method : green box is the ground truth, red box is the bounding box from SiamMask, andblue box is our bounding box for the mask.

DaSiamRPN SA Siam R CPT DeePTSRCF DRT RCO UPDT SiamMask SiamRPN MFT LADCF OursEAO 0.326 0.337 0.339 0.345 0.356 0.376 0.378 0.380 0.383 0.385 0.389 0.397

Accuracy 0.569 0.566 0.506 0.523 0.519 0.507 0.536 0.609 0.586 0.505 0.503 0.612Robustness 0.337 0.258 0.239 0.215 0.201 0.155 0.184 0.276 0.276 0.140 0.159 0.220

Table 4: Comparison with the state-of-the-art under EAO, Accuracy, and Robustness on the VOT 2018 benchmark.

of the tracking car, and finally catches the truck. On the con-trary, since our model considers camera motion, the centerof our search region stays on the tracking car. This stabil-ity comes from the decoupling of camera motion. Anotherexample is Bolt, the third row in figure 6. When Bolt ac-celerates, SiamMask will be easily distracted by other run-ners, but our PTS model won’t fail because it considers thespeed of Bolt. This stability comes from object velocityestimation. These unique features greatly help the perfor-mance of PTS under large camera motion, fast object mo-tion and occlusion. Simply speaking, by predicting objectposition accurately, we can focus on more targeted searchregion and thus achieve better detection and segmentationperformance.

Table 5 presents the comparison of vos results usingSiamMask and PTS based on DAVIS 2016 and DAVIS 2017datasets.

Datasets Methods J F

DAVIS 2016 SiamMask 0.713 0.674PTS(ours) 0.732 0.692

DAVIS 2017 SiamMask 0.543 0.585PTS(ours) 0.554 0.604

Table 5: J and F Results on DAVIS 2016 and DAVIS 2017

4.3. Evaluation for video object segmentation

Datasets and settings We report the performance of PWTon standard VOS datasets DAVIS 2016 [48] and DAVIS2017 [51]. For both datasets, we use the official perfor-mance measures: the Jaccard index (J) to express regionsimilarity and the F-measure (F) to express contour accu-racy. Since we use SiamMask as our baseline, we adoptthe semi-supervise setup. We fit bounding boxes to objectmasks in the first frame and use these bounding boxes to

7

Figure 7: Qualitative result of SiamMask and PTS on DAVIS: First row and third row are the results from SiamMask. Secondrow and fourth row are the results from same videos using PTS. (better view with color)

initialize our proposed PTS.

Results on DAVIS 2016 and DAVIS 2017 The effect ofour approach is limited on DAVIS 2016 and DAVIS 2017datasets. The main reason is that DAVIS datasets have lesscamera motion or fast object motion, which are the majorgain from our method. However, segmentation can still ben-efit from more accurately cropped search region. e.g., Thedog in the third frame of ”Dogs-Jump” video is segmentedmore completely through PTS. However, SiamMask missesthe tail of the same dog during segmentation. Anotherexample is the person in the fourth frame of ”Soap-Box”video. PTS separates this person from the soapbox, how-ever, SiamMask mixes its segmentation with the surround-ing pixels. Further, SiamMask fails to distinguish the per-son mask from the drum of the soapbox because the drumoccupies the previous position of the person, which can notbe handled without motion assumption. Though our pre-tracking procedure, PTS can separate specific instance fromits neighboring instance and thus get a more accurate seg-mentation. We show that our proposed PTS has very largepotential especially under segmentation of crowded scenar-ios. For more qualitative results, please refer to Fig 7.

4.4. Ablation studies

Table 6 compares the influence of each module in ourmodel. Based on VOT 2018 dataset, we evaluate the influ-ence of tracking and prediction module and compare theirperformance with the baseline approach (SiamMask) andPTS. It can be observed from Table 6 that the predictionmodule which uses Kalman filter to update the position ofobjects plays an important role in PTS that most EAO im-

EAO Accuracy RobustnessSiamMask 0.380 0.609 0.276

SiamMask + Tracking 0.382 0.610 0.268SiamMask + Prediction 0.394 0.611 0.234

PTS 0.397 0.612 0.220

Table 6: Ablation studies for Tracking and Prediction mod-ules on VOT 2018 dataset.

provements seem to be introduced by the prediction mod-ule. Moreover, for the tracking module, which is the adap-tive search region update module, the influence is a littlebit limited with only 0.02 EAO increase. However, as wecan see from Table 6, both of these two modules have thepotential to improve accuracy.

5. Conclusion

In conclusion, we introduce a prediction driven methodfor visual tracking and segmentation in videos. Instead ofsolely relying on matching with appearance cues for track-ing, we build a predictive model which provides guidanceon finding more accurate tracking regions efficiently. Weshow that this simple idea significantly improve the robust-ness in VOT and VOS challenges and achieve state-of-the-art performance in both tasks. We hope our work can in-spire more studies in considering the relationship amongthree main challenges in video understanding: prediction,tracking and segmentation.

8

References[1] Linchao Bao, Baoyuan Wu, and Wei Liu. Cnn in mrf: Video

object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages5977–5986, 2018. 3

[2] Luca Bertinetto, Jack Valmadre, Joao F Henriques, AndreaVedaldi, and Philip HS Torr. Fully-convolutional siamesenetworks for object tracking. In European conference oncomputer vision, pages 850–865. Springer, 2016. 2, 3

[3] David S Bolme, J Ross Beveridge, Bruce A Draper, andYui Man Lui. Visual object tracking using adaptive cor-relation filters. In Computer Vision and Pattern Recogni-tion (CVPR), 2010 IEEE Conference on, pages 2544–2550.IEEE, 2010. 2

[4] Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset,Laura Leal-Taixe, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 221–230, 2017. 3

[5] Carlos R Cassanello, Abhay T Nihalani, and Vincent P Fer-rera. Neuronal responses to moving targets in monkey frontaleye fields. Journal of neurophysiology, 2008. 1

[6] Patrick Cavanagh and Stuart Anstis. The flash grab effect.Vision Research, 91:8–20, 2013. 1

[7] Shuyang Chen, Jianren Wang, and Peter Kazanzides. Inte-gration of a low-cost three-axis sensor for robot force con-trol. In 2018 Second IEEE International Conference onRobotic Computing (IRC), pages 246–249. IEEE, 2018.

[8] Martin Danelljan, Gustav Hager, Fahad Khan, and MichaelFelsberg. Accurate scale estimation for robust visual track-ing. In British Machine Vision Conference, Nottingham,September 1-5, 2014. BMVA Press, 2014. 2

[9] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, andMichael Felsberg. Discriminative scale space tracking. IEEEtransactions on pattern analysis and machine intelligence,39(8):1561–1575, 2017. 2

[10] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, andMichael Felsberg. Learning spatially regularized correlationfilters for visual tracking. In Proceedings of the IEEE Inter-national Conference on Computer Vision, pages 4310–4318,2015. 2

[11] Martin Danelljan, Andreas Robinson, Fahad Shahbaz Khan,and Michael Felsberg. Beyond correlation filters: Learn-ing continuous convolution operators for visual tracking. InEuropean Conference on Computer Vision, pages 472–488.Springer, 2016. 2

[12] Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg,and Joost Van de Weijer. Adaptive color attributes for real-time visual tracking. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 1090–1097, 2014. 2

[13] Martin A Fischler and Robert C Bolles. Random sampleconsensus: a paradigm for model fitting with applications toimage analysis and automated cartography. Communicationsof the ACM, 24(6):381–395, 1981. 3

[14] Hamed Kiani Galoogahi, Ashton Fagg, and Simon Lucey.Learning background-aware correlation filters for visualtracking. In ICCV, pages 1144–1152, 2017. 2

[15] Steffen Gauglitz, Tobias Hollerer, and Matthew Turk. Eval-uation of interest point detectors and feature descriptors forvisual tracking. International journal of computer vision,94(3):335, 2011. 3

[16] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. To-wards a better match in siamese network based visual objecttracker. In European Conference on Computer Vision, pages132–147. Springer, 2018. 2

[17] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng.A twofold siamese network for real-time object tracking.In The IEEE Conference on Computer Vision and PatternRecognition (CVPR), June 2018. 2

[18] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, andSong Han. Amc: Automl for model compression and ac-celeration on mobile devices. In Proceedings of the Euro-pean Conference on Computer Vision (ECCV), pages 784–800, 2018.

[19] Yihui He, Xianggen Liu, Huasong Zhong, and Yuchun Ma.Addressnet: Shift-based primitives for efficient convolu-tional neural networks. In 2019 IEEE Winter Conference onApplications of Computer Vision (WACV), pages 1213–1222.IEEE, 2019.

[20] Yihui He, Xiaobo Ma, Xiapu Luo, Jianfeng Li, MengchenZhao, Bo An, and Xiaohong Guan. Vehicle traffic drivencamera placement for better metropolis security surveillance.IEEE Intelligent Systems, 33(4):49–61, 2018.

[21] Yihui He, Xiangyu Zhang, Marios Savvides, and Kris Ki-tani. Softer-nms: Rethinking bounding box regression foraccurate object detection. arXiv preprint arXiv:1809.08545,2018.

[22] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruningfor accelerating very deep neural networks. In Proceedingsof the IEEE International Conference on Computer Vision,pages 1389–1397, 2017.

[23] David Held, Sebastian Thrun, and Silvio Savarese. Learn-ing to track at 100 fps with deep regression networks. InEuropean Conference on Computer Vision, pages 749–765.Springer, 2016. 2

[24] Joao F Henriques, Rui Caseiro, Pedro Martins, and JorgeBatista. High-speed tracking with kernelized correlation fil-ters. IEEE Transactions on Pattern Analysis and MachineIntelligence, 37(3):583–596, 2015. 2, 4

[25] Hinze Hogendoorn and Anthony N Burkitt. Predictive cod-ing of visual object position ahead of moving objects re-vealed by time-resolved eeg decoding. Neuroimage, 171:55–61, 2018. 1

[26] Alex O Holcombe. Seeing slow and seeing fast: two limitson perception. Trends in cognitive sciences, 13(5):216–221,2009. 1

[27] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al.Spatial transformer networks. In Advances in neural infor-mation processing systems, pages 2017–2025, 2015. 2

[28] Varun Jampani, Raghudeep Gadde, and Peter V Gehler.Video propagation networks. In Proceedings of the IEEE

9

Conference on Computer Vision and Pattern Recognition,pages 451–461, 2017. 3

[29] Won-Dong Jang and Chang-Su Kim. Online video objectsegmentation via convolutional trident network. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 5849–5858, 2017. 3

[30] Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin,Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, ZequnJie, et al. Video scene parsing with predictive feature learn-ing. In Proceedings of the IEEE International Conference onComputer Vision, pages 5580–5588, 2017. 2

[31] Anna Khoreva, Rodrigo Benenson, Eddy Ilg, Thomas Brox,and Bernt Schiele. Lucid data dreaming for object track-ing. In The DAVIS Challenge on Video Object Segmentation,2017. 3

[32] Hamed Kiani Galoogahi, Terence Sim, and Simon Lucey.Multi-channel correlation filters. In Proceedings of the IEEEinternational conference on computer vision, pages 3072–3079, 2013. 2

[33] Hamed Kiani Galoogahi, Terence Sim, and Simon Lucey.Correlation filters with limited boundaries. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 4630–4638, 2015. 2

[34] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov.Siamese neural networks for one-shot image recognition. InICML Deep Learning Workshop, volume 2, 2015. 2

[35] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Fels-berg, Roman Pfugfelder, Luka Cehovin Zajc, Tomas Vojir,Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, Gus-tavo Fernandez, and et al. The sixth visual object trackingvot2018 challenge results, 2018. 2, 5, 6

[36] Matej Kristan, Jiri Matas, Ales Leonardis, Michael Felsberg,Gustavo Fernandez, and et al. The visual object trackingvot2016 challenge results. In Proceedings of the EuropeanConference on Computer Vision Workshop, pages 777–823.Springer International Publishing, 2016. 2, 4, 5

[37] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, JunliangXing, and Junjie Yan. Siamrpn++: Evolution of siamesevisual tracking with very deep networks. arXiv preprintarXiv:1812.11703, 2018. 2, 6

[38] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu.High performance visual tracking with siamese region pro-posal network. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 8971–8980, 2018. 2, 3, 5

[39] Yang Li and Jianke Zhu. A scale adaptive kernel correlationfilter tracker with feature integration. In European confer-ence on computer vision, pages 254–265. Springer, 2014. 2

[40] Yudong Liang, Ze Yang, Kai Zhang, Yihui He, Jinjun Wang,and Nanning Zheng. Single image super-resolution via alightweight residual convolutional neural network. arXivpreprint arXiv:1703.08173, 2017.

[41] Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Ver-beek, and Yann LeCun. Predicting deeper into the futureof semantic segmentation. In Proceedings of the IEEE In-ternational Conference on Computer Vision, pages 648–657,2017. 2

[42] Chao Ma, Xiaokang Yang, Chongyang Zhang, and Ming-Hsuan Yang. Long-term correlation tracking. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 5388–5396, 2015. 2

[43] Nicolas Marki, Federico Perazzi, Oliver Wang, and Alexan-der Sorkine-Hornung. Bilateral space video segmentation.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 743–751, 2016. 3

[44] Isaac Newton. The Principia: mathematical principles ofnatural philosophy. Univ of California Press, 1999. 3

[45] Romi Nijhawan. Motion extrapolation in catching. Nature,1994. 1

[46] David Nilsson and Cristian Sminchisescu. Semantic videosegmentation by gated recurrent flow propagation. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 6819–6828, 2018. 2

[47] Federico Perazzi, Anna Khoreva, Rodrigo Benenson, BerntSchiele, and Alexander Sorkine-Hornung. Learning videoobject segmentation from static images. In Proceedingsof the IEEE Conference on Computer Vision and PatternRecognition, pages 2663–2672, 2017. 3

[48] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, LucVan Gool, Markus Gross, and Alexander Sorkine-Hornung.A benchmark dataset and evaluation methodology for videoobject segmentation. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 724–732, 2016. 2, 7

[49] Pedro O Pinheiro, Ronan Collobert, and Piotr Dollar. Learn-ing to segment object candidates. In Advances in NeuralInformation Processing Systems, pages 1990–1998, 2015. 5

[50] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and PiotrDollar. Learning to refine object segments. In EuropeanConference on Computer Vision, pages 75–91. Springer,2016. 5

[51] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Ar-belaez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017davis challenge on video object segmentation. arXiv preprintarXiv:1704.00675, 2017. 2, 7

[52] Deva Ramanan, David A Forsyth, and Andrew Zisser-man. Tracking people by learning their appearance. IEEEtransactions on pattern analysis and machine intelligence,29(1):65–81, 2007. 6

[53] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In Advances in neural information pro-cessing systems, pages 91–99, 2015. 5

[54] Jeroen BJ Smeets, Eli Brenner, and Marc HE de Lus-sanet. Visuomotor delays when hitting running spiders. InEWEP 5-Advances in perception-action coupling, pages 36–40. Editions EDK, Paris, 1998. 1

[55] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz.Pwc-net: Cnns for optical flow using pyramid, warping,and cost volume. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 8934–8943, 2018. 3

[56] Ran Tao, Efstratios Gavves, and Arnold WM Smeulders.Siamese instance search for tracking. In Proceedings of the

10

IEEE conference on computer vision and pattern recogni-tion, pages 1420–1429, 2016. 2

[57] Yi-Hsuan Tsai, Ming-Hsuan Yang, and Michael J Black.Video segmentation via object flow. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 3899–3908, 2016. 3

[58] Jack Valmadre, Luca Bertinetto, Joao Henriques, AndreaVedaldi, and Philip HS Torr. End-to-end representationlearning for correlation filter based tracking. In ComputerVision and Pattern Recognition (CVPR), 2017 IEEE Confer-ence on, pages 5000–5008. IEEE, 2017. 2

[59] Elle van Heusden, Martin Rolfs, Patrick Cavanagh, andHinze Hogendoorn. Motion extrapolation for eye move-ments predicts perceived motion-induced position shifts.Journal of Neuroscience, 38(38):8243–8250, 2018. 1

[60] Paul Voigtlaender and Bastian Leibe. Online adaptation ofconvolutional neural networks for video object segmenta-tion. arXiv preprint arXiv:1706.09364, 2017. 3

[61] Claes Von Hofsten. Eye–hand coordination in the newborn.Developmental psychology, 18(3):450, 1982. 1

[62] Jacob Walker, Kenneth Marino, Abhinav Gupta, and MartialHebert. The pose knows: Video forecasting by generatingpose futures. In Proceedings of the IEEE International Con-ference on Computer Vision, pages 3332–3341, 2017. 2

[63] H. Wang and C. Schmid. Action recognition with improvedtrajectories. In 2013 IEEE International Conference onComputer Vision, pages 3551–3558, Dec 2013. 3

[64] Jianren Wang, Long Qian, Ehsan Azimi, and PeterKazanzides. Prioritization and static error compensationfor multi-camera collaborative tracking in augmented reality.In 2017 IEEE Virtual Reality (VR), pages 335–336. IEEE,2017.

[65] Jianren Wang, Junkai Xu, and Peter B Shull. Vertical jumpheight estimation algorithm based on takeoff and landingidentification via foot-worn inertial sensing. Journal ofbiomechanical engineering, 140(3):034502, 2018.

[66] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, andPhilip HS Torr. Fast online object tracking and segmentation:A unifying approach. arXiv preprint arXiv:1812.05050,2018. 1, 2, 3, 4, 5

[67] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An effi-cient dense and scale-invariant spatio-temporal interest pointdetector. In European conference on computer vision, pages650–663. Springer, 2008. 3

[68] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object track-ing benchmark. IEEE Transactions on Pattern Analysis andMachine Intelligence, 37(9):1834–1848, 2015. 1

[69] Haisheng Xia, Junkai Xu, Jianren Wang, Michael A Hunt,and Peter B Shull. Validation of a smart shoe for estimat-ing foot progression angle during walking gait. Journal ofbiomechanics, 61:193–198, 2017.

[70] Chenchen Zhu, Yihui He, and Marios Savvides. Feature se-lective anchor-free module for single-shot object detection.arXiv preprint arXiv:1903.00621, 2019.

[71] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, andWeiming Hu. Distractor-aware siamese networks for visualobject tracking. In European Conference on Computer Vi-sion, pages 103–119. Springer, 2018. 2, 3, 6

11

Documents

arXiv:1904.03280v1 [cs.CV] 5 Apr 2019fjianrenw,he2,xiaobow,xinjiay,[email protected] Figure 1: We propose a prediction driven method for tracking and segmentation in videos. The