Multimodal Deep Feature Fusion (MMDFF) for RGB-D Tracking

Research ArticleMultimodal Deep Feature Fusion (MMDFF) for RGB-D Tracking

Ming-xin Jiang 12 Chao Deng 3 Ming-min Zhang 45

Jing-song Shan 6 and Haiyan Zhang 6

1 Jiangsu Laboratory of Lake Environment Remote Sensing Technologies Huaiyin Institute of Technology Huaian 223003 China2Faculty of Electronic information Engineering Huaiyin Institute of Technology Huaian 223003 China3School of Physics amp Electronic Information Engineering Henan Polytechnic University Jiaozuo 454000 China4School of Computer Science amp Technology Zhejiang University 310058 China5Institute of VR and Intelligent System Hangzhou Normal University 310012 China6Faculty of Computer and Software Engineering Huaiyin Institute of Technology Huaian 223003 China

Correspondence should be addressed to Chao Deng dengchao hpu163com and Ming-min Zhang zhangmm95zjueducn

Received 22 July 2018 Accepted 5 November 2018 Published 28 November 2018

Guest Editor Li Zhang

Copyright copy 2018 Ming-xin Jiang et alThis is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Visual tracking is still a challenging task due to occlusion appearance changes complex motion etc We propose a novel RGB-Dtracker based on multimodal deep feature fusion (MMDFF) in this paper MMDFF model consists of four deep ConvolutionalNeural Networks (CNNs) Motion-specific CNN RGB- specific CNN Depth-specific CNN and RGB-Depth correlated CNNThedepth image is encoded into three channels which are sent into depth-specific CNN to extract deep depth featuresThe optical flowimage is calculated for every frame and then is fed to motion-specific CNN to learn deep motion features Deep RGB depth andmotion information can be effectively fused atmultiple layers viaMMDFFmodel Finally multimodal fusion deep features are sentinto the C-COT tracker to obtain the tracking result For evaluation experiments are conducted on two recent large-scale RGB-Ddatasets and results demonstrate that our proposed RGB-D tracking method achieves better performance than other state-of-artRGB-D trackers

1 Introduction

Asone of themost fundamental problems in computer visionvisual object tracking which aims to estimate the trajectoryof a object in a video has been successfully addressed innumerous applications including intelligent traffic controlartificial intelligence and autonomous driving [1ndash3] Despitethe research on visual object tracking has made outstandingachievements many challenges remain when seeking to trackobjects effectively in practice For example it is still quitedifficult to track objects when occlusion occurs frequentlyappearance changes the motion of object is complex andillumination varies [4ndash6]

The major drawback of the tracking methods only usingRGB data is that they are not robust to appearance changesThanks to the availability of consumer-grade RGB-D sen-sors such as Inter RealSense Microsoft Kinect and AsusXtion more accurate depth information of objects can be

obtained to revisit the existing problems of tracking [7 8]Compared with RGB data the RGB-D data can remarkablyimprove the performance of tracking due to the access tothe depth information complementary to RGB [9]The depthinformation is invariant to illumination or color variations[10 11] and can provide geometrical cues and spatial structureinformation thus it shows powerful benefits in visual objecttracking However how to effectively utilize the depth dataprovided by RGB-D sensors is still a challenging issue

However RGB and depth only encode static informationfrom a single frame so that tracking often fails when themotion of object is complexUnder these circumstances deepmotion features can provide high-level motion informationto distinguish the target object [12ndash14] The dynamic infor-mation can be captured by deepmotion features and that willbe complementary to static features extracted from RGB anddepth image

HindawiComplexityVolume 2018 Article ID 5676095 8 pageshttpsdoiorg10115520185676095

2 Complexity

Ourmotivation is to design a RGB-D object tracker basedon multimodal deep feature fusion Specific emphasis of thispaper is placed on exploring three scientific questions howto fuse deep motion features with static features provided byRGB and depth image how to fuse the deep RGB and depthfeatures sufficiently how to effectively derive geometricalcues and spatial structure information from depth data Insummary the key technical contributions of this study arethree-fold

(i) A novel MMDFFmodel is designed for RGB-D track-ing including four deep CNNmotion-specific CNNRGB- specific CNN Depth-specific CNN and RGB-Depth correlated CNN InMMDFF we can fuse RGBdepth and motion information at multiple layers viaCNN effectively The proposedmethod can separatelyextract RGB and depth features by using RGB-specificCNNandDepth-specificCNNand adequately exploitthe correlated relationship between RGB and depthmodality by using RGB-Depth correlated CNN Atthe same time Motion-specific CNN can provide animportant high-level information about motion fortracking

(ii) The depth image is encoded into three channelshorizontal disparity height above ground and anglewith gravity Then the three channel images are sentinto depth-specific CNN to extract deep depth fea-tures The strong correlation between RGB and depthmodality can be learnt by RGB-Depth correlatedCNN In contrast to only using depth informationas one channel in many existing RGB-D trackerswe can obtain more useful information for trackingsuch as geometrical features and spatial structureinformation by encoding the depth image into threechannels

(iii) To evaluate the performance of our proposed RGB-Dtracker we conduct extensive experiments on the tworecent challenging RGB-D benchmark datasets thelarge-scale Princeton RGB-D Tracking Benchmark(PTB) Dataset [13] and the University of BirminghamRGB-D Tracking Benchmark (BTB) [15] The exper-imental results show that our proposed approach issuperior to the state-of-the-art RGB-D trackers

The remainder of this paper is organized as follows Therelated work is discussed in Section 2 Section 3 describesthe overview and the details of our proposed method InSection 4 we demonstrate experimental results to evaluateour proposed RGB-D tracker We conclude our work inSection 5

2 Related Work

21 RGB-D Object Tracking With the emergence of RGB-Dsensors there has been great interest in visual object trackingusing RGB-D data [15ndash17] to improve tracking performancesince depthmodality can provide useful information comple-mentary to RGB modality

ARGB-D trackingmethod using depth scaling kernelisedcorrelation filters and occlusion handling was proposed in[18] In [19] authors used Haar-like features and HOGfeatures based on RGB and depth to form a boosting trackingapproach Zheng et al [20] presented an object tracker basedon sparse representation of depth image Hannuna et alproposed a RGB-D tracker built upon the KCF trackerthey exploit depth information to handle scale changesocclusions and shape changes

Although the existing RGB-D trackers have made greatcontributions to promote RGB-D tracking most of themused hand-crafted features and fused simply RGB and depthinformation ignored the strong correlation between RGBand depth modality

22 Deep RGB Features Owing to the superiority in featureextraction CNN has been increasingly used in RGB trackers[21 22] The CNN includes a number of convolutional layerpooling layer and fully connected layer the features atdifferent layers have different properties The higher layerscan capture the senmantic features and the lower layers cancapture the spatial features and they are both important inthe tracking problem [23 24]

In [25] Song et al proposed the CREST algorithm whichtreated the correlation filter as one convolutional layer andapplied residual learning to capture appearance changes C-COT was presented in [26] which employed an implicitinterpolation model to solve the learning problem in thecontinuous spatial domain Zhu et al [27] proposed a trackernamed UCT which designed an end-to-end framework tolearn the convolutional features and perform the trackingprocess simultaneously

All these trackers only considered that RGB appearancedeep features in current frame cannot benefit from geomet-rical features extracted from depth image and interframedynamic information Thus it is important to get rid of thisproblem by fusing deep features from RGB-D data and deepmotion features

23 Deep Depth Features In the recent years deep depthfeatures have received a lot of attention in object recognition[28 29] object detection [30 31] indoor semantic segmenta-tion [32 33] etc Unfortunately few existing RGB-D trackersuse CNN to extract deep depth features to improve thetracking performance In [34] Jiang et al proposed a RGB-D tracker based on cross-modality deep learning in whichGaussian-Bernoulli deep BoltzmannMachines (DBMs) wereadopted to extract the features of RGB and depth image Asfar as the drawbacks of DBMs are concerned one of the mostimportant ones is the high computational cost of inference

In recent years CNN has witnessed great success incomputer vision In this context we will focus on how to useCNN to fuse deep depth feature with deep RGB and motionfeatures effectively for RGB-D tracker

24 Deep Motion Features Deep motion feature has beensuccessfully applied for action recognition [35] and videoclassification [36] but it is rarely applied to visual tracking

Complexity 3

RGB image

Disparity

Height

Angle

HHA encoding

Fused

Deep

features

Conv1_2Conv2_2 Conv3_3 Conv4_3 Conv5_3

Conv1_2 Conv2_2 Conv3_3 Conv4_3 Conv5_3

Conv1_2 Conv2_2 Conv3_3Conv4_3 Conv5_3

Conv3_3 Conv4_3 Conv5_3

C-COT Tracker

Optical flow image

Motion-specific CNN

RGB-specific CNN

RGB-Depth correlated CNN

Depth-specific CNNDepth image

Figure 1 Our MMDFF model for RGB-D Tracking

Most existing tracking methods only extract appearance fea-tures ignore motion information In [37] Zhu et al proposedan end-to-end flow correlation tracker which focuses onmaking use of the rich flow information in consecutiveframes to improve the feature representation and the trackingaccuracy Danelljan [38] investigated the influence of deepmotion features in a RGB tracker But they have not takendepth information into account

To the best of our knowledge deep motion features areyet to be applied for RGB-D tracking In this paper wewill discuss how to fuse deep motion features with RGBappearance features and geometrical features provided bydepth image to improve the performance of RGB-D tracking

3 RGB-D Tracking Based on MMDFF

31 Multimodal Deep Feature Fusion (MMDFF) Model Inthis section a novel MMDFF model is proposed for RGB-D tracking aiming at fusing deep motion features withstatic appearance and geometrical features provided by RGBand depth data The overall architecture of our approachis illustrated as Figure 1 and our end-to-end MMDFFmodel is composed of four deep CNNmotion-specific CNNRGB-specific CNN depth-specific CNN and RGB-Depthcorrelated CNN A final fully connected (FC) fusion layer isproposed to effectively fuse deep RGB-D features and deepmotion features then the fused deep features are sent intothe C-COT tracker and the tracking result can be obtainedat last

As described in Section 1 CNN has been shown to signif-icantly outperform traditional machine learning approachesin a wide range of computer vision tasks It has been widelyacknowledged that CNN features extracted from different

layers play different important roles in the tracking A lowerlayer captures spatial detailed features which is helpfulto precisely localize the target object at the same timea higher layer provides semantic features which is robustto occlusion and deformation In our MMDFF model weseparately adopt hierarchical convolutional features extractedfrom RGB-specific CNN and depth-specific CNN To bemore specific two independent CNN networks are adoptedthe RGB-specific CNN is for RGB data and the depth-specificCNN is for depth features And the features on Conv3-3 andConv5-3 in the two CNNs are sent to the highest fusing FClayer in our experiments

It is believed that more features extracted from differentmodalities can be helpful to accurately descript the objectsand improve the tracking performance As abovementionedmost of existing RGB-D trackers directly concatenate featuresextracted from RGB and depth modalities not adequatelyexploiting the correlation between the two modalities Inour method the strong correlation between RGB and depthmodality can be learnt by RGB-Depth correlated CNN

For human visual system geometrical and spatial struc-ture information plays an important role in tracking objectsIn order to more explicitly derive geometrical and spatialstructure information from depth data we encode it intothree channels horizontal disparity height above groundand angle with gravity using the encoding approach pro-posed in [39] Then the three channel image is sent intodepth-specific CNN to extract deep depth features

The optical flow image is calculated for every frameand then is fed to motion-specific CNN to learn deepmotion features which can capture high-level informationabout the movement of the object A pretrained optical flownetwork provided by [13] is used as ourmotion-specificCNN

4 Complexity

Table 1 Comparison results SR using different deep features fusions on the PTB dataset

Deep Features All SR Target type Target size Movement Occlusion Motion typehuman animal rigid large small slow fast yes no passive active

Only RGB 073 075 069 074 080 072 071 072 074 085 078 073RGB+Depth 077 079 071 077 081 077 076 073 081 086 079 075RGB+Depth +RGB-depth correlated 079 081 074 079 083 079 078 074 083 086 081 077

RGB+Depth +RGB-depth correlated+Motion

084 083 086 085 085 086 082 083 087 087 082 083

which is pretrained on the UCF101 dataset and includes fiveconvolutional layers

Thus far we have obtained multimodal deep features torepresent the rich information of the object including theRGB horizontal disparity height above ground angle withgravity and motion Next we attempt to explore how to fusethe multimodal deep features using the CNN To solve thisproblem we conduct extensive experiments to evaluate theperformance using different fusion schemes each experimentfuses the multimodal deep features at different layer Inspiredby the working mechanism of human visual cortex in thebrain which indicates the features should be fused in the highlevel so we test fusing at several relatively high layers such aspool 5 fc 6 and fc 7 We find that fusing the multimodal deepfeatures from fc 6 and fc 7 can obtain better performance

Let 119901119895119894 represent feature map of 119895modalities and 119894 denotesthe spatial position 119895 = 1 2 119869 In our paper 119869 = 4as we adopt the feature maps from Conv3-3 and Conv5-3in RGB depth RGB-depth correlated and motion modalityThe fusing feature map 119875119891119906sion119894 is the weighted sum of featuremaps for three levels in four modalities

119875119891119906119904119894119900119899119894 = 119869sum119895=1

119908119895119894119901119895119894 (1)

where the weigh 119908119895119894 can be computed as follows

119908119895119894 = exp (p119895119894 )sum119869119896=1 exp (p119896119894 )

(2)

32 C-COTTracker Themultimodal fusion deep features aresent into the C-COT tracker which was proposed in [26]A brief review of the C-COT tracker will be provided in thefollowing of this section and we will use the same symbols asin [33] for convenience and more detailed description andproofs can be found in [26]

The C-COT transfers multimodal the fusion feature mapto the continuous spatial domain 119905 isin [0T) by defining theinterpolation operator 119869119889 R119873119889 997888rarr 1198712(T) as follows

119869119889 119909119889 (119905) =119873119889minus1sum119899=0

119909119889 [119899] 119887119889 (119905 minus 119879119873119889 119899) (3)

The convolution operator is defined as

119878119891 119909 = 119891 lowast 119869 119909 minus 119863sum119889=1

119891119889 lowast 119869119889 119909119889 (4)

The objective function is

119864 (119891) = 119872sum119895=1

120572119895 10038171003817100381710038171003817119878119891 119909119895 minus 1199101198951003817100381710038171003817100381721198712 minus119863sum119889=1

100381710038171003817100381710038171205961198911198891003817100381710038171003817100381721198712 (5)

Equation (5) can be minimized in the Fourier domain tolearn the filter The following can be obtained by applyingParsevalrsquos formula to (5)

119864 (119891) = 119872sum119895=1

1205721198951003817100381710038171003817100381710038171003817100381710038171003817119863sum119889=1

1006704119891119889119883119889119895 1006704119887119889 minus 11991011989510038171003817100381710038171003817100381710038171003817100381710038172

1198712

minus 119863sum119889=1

10038171003817100381710038171003817 lowast 1198911198891003817100381710038171003817100381721198712 (6)

The desired convolution output 119910119895 can be provide by thefollowing expression

119910119895 [119896] = radic21205871205902119879 exp(minus21205902 (120587119896119879 )2 minus 1198942120587119879 119906119895119896) (7)

4 Experimental Results

The experiments are conducted on two challenging bench-mark datasets BTB dataset [17] with 36 videos and PTBdataset [7] with 100 videos to test the proposed RGB-Dtracker using MATLAB R2016b platform with the Caffetoolbox [40] on a PC with an Intel(R) Core(TM) i7-4712MQCPU340GHz (with 16G memory) and TITAN GPU (1200GB memory)

41 Deep Features Comparison The impacts of deep motionfeatures are evaluated by conducting different experimentson PTB dataset Table 1 indicates the comparison resultsof different deep features fusions using success rate (SR) asmeasurements FromTable 1 we find that SRs are lowestwhenonly using deep RGB feautures SRs increase by fusing deepdepth features and RGB-depth correlated features especiallywhen occlusion occurs Further the performance is obviouslyimproved when deep motion features are added especiallywhen the movement is fast motion type is active and targetsize is small

Complexity 5

Figure 2 The comparison results on athlete move video from BTB dataset

Figure 3 The comparison results on cup book video from PTB dataset

Figure 4 The comparison results on zballpat no1 video from PTB dataset

To intuitively show the contribution of fusing deep depthfeatures and deep motion features for RGB-D tracking in thefollowing we demonstrate part of experimental results onthe BTB dataset and PTB dataset that only using deep RGBfeatures obtains unsatisfactory performance The trackingresult only using deep RGB features is in yellow fusing deepRGB and depth is in blue adding deep motion features toRGB and depth is in red

In Figure 2 the major challenge of the athlete move videois that the target object moves fast from left to right Asillustrated in Figure 3 the cup is fully occluded by the bookThe zballpat no1 sequence is challenging due to the changeof moving direction (Figure 4) The deep motion features areable to improve the tracking performance as they can exploitthe motion patterns of the target object

42 Quantitative Evaluation We present the results of com-paring our proposed tracker with four state-of-the-art RGB-D trackers on the BTB dataset and PTB dataset Prin Tracker[7] (2013) DS-KCFlowast Tracker [41] (2016) GBM Tracker [34](2017) and Berming Tracker [17] (2018) We provide the

results in terms of success rate (SR) and area-under-curve(AUC) as measurements

Figure 5 shows the comparison results of SR of differenttrackers on the PTB dataset The results illustrates that theoverall SR of our tracker is 87 SR is 86 when theobject moves fast and SR is 84 when the motion type isactive These values are higher than other trackers obviouslyespecially when the object moves fast

Figure 6 indicates the AUC comparison results on theBTB dataset The overall AUC of our tracker is 930 the AUCis 984 when the camera is stationary and the AUC is 827when the camera is moving

From Figures 5-6 we can see that our tracker obtains thebest performance especially when the object is moving fastor the camera ismovingThese results show that deepmotionfeature is helpful to improve the tracking performance

5 Conclusion

We study the problems in the existing visual object trackingalgorithms and find that existing trackers cannot benefit

6 Complexity

1

09

08

07

06

05

04

03

02

01

0

Succ

ess R

ate (

SR)

Overall FastMove

ActiveMotion

Our TrackerPrin TrackerDS-KC

lowast TrackerGBM TrackerBerming Tracker

Figure 5 The comparison results of SR on the PTB dataset

10

9

8

7

6

5

4

3

2

1

0

Are

a Und

er C

urve

s (AU

Cs)

Over All Stationary Moving



Figure 6 The comparison results of AUC on the BTB dataset

from geometrical features extracted from depth image andinterframe dynamic information by fusing deep RGB depthand motion information We propose MMDFF model tosolve these problems In this model RGB depth and motioninformation are fused at multiple layers via CNN the corre-lated relationship betweenRGB anddepthmodality exploitedby using RGB-Depth correlated CNN The experimentalresults show that deep depth feature and deep motion featureprovide complementary information to RGB data and thefusing strategy promotes the tracking performance signifi-cantly especially when occlusion occurs the movement isfast motion type is active and target size is small

Data Availability

The data used to support the findings of this study areavailable from the corresponding author upon request

Conflicts of Interest

The authors declare no conflicts of interest

Acknowledgments

This work was supported by the National Key RampD projectunder Grant no 2017YFB1002803Major Program of NaturalScience Foundation of the Higher Education Institutionsof Jiangsu Province under Grant no 18KJA520002 Projectfunded by the Jiangsu Laboratory of Lake EnvironmentRemote Sensing Technologies under Grant no JSLERS-2018-005 Six-Talent Peak Project in Jiangsu Province under Grantno 2016XYDXXJS-012 the Natural Science Foundation ofJiangsu Province under Grant no BK20171267 533 TalentsEngineering Project in Huaian under Grant no HAA201738project funded by Jiangsu Overseas Visiting Scholar ProgramforUniversity Prominent YoungampMiddle-aged Teachers andPresidents and the fifth issue 333 high-level talent trainingproject of Jiangsu province

References

[1] S Duffner and C Garcia ldquoUsing Discriminative Motion Con-text for Online Visual Object Trackingrdquo IEEE Transactions onCircuits and Systems for Video Technology vol 26 no 12 pp2215ndash2225 2016

[2] X Zhang G-S Xia Q Lu W Shen and L Zhang ldquoVisualobject tracking by correlation filters and online learningrdquo ISPRSJournal of Photogrammetry and Remote Sensing 2017

[3] E DiMarco S P Gray and K Jandeleit-Dahm ldquoDiabetes altersactivation and repression of pro- and anti-inflammatory signal-ing pathways in the vasculaturerdquoFrontiers in Endocrinology vol4 p 68 2013

[4] M Jiang Z Tang and L Chen ldquoTracking multiple targetsbased on min-cost network flows with detection in RGB-D datardquo International Journal of Computational Sciences andEngineering vol 15 no 3-4 pp 330ndash339 2017

[5] K Chen W Tao and S Han ldquoVisual object tracking viaenhanced structural correlation filterrdquo Information Sciences vol394-395 pp 232ndash245 2017

[6] M Jiang D Wang and T Qiu ldquoMulti-person detecting andtracking based on RGB-D sensor for a robot vision systemrdquoInternational Journal of Embedded Systems vol 9 no 1 pp 54ndash60 2017

[7] S Song and J Xiao ldquoTracking Revisited Using RGBD CameraUnified Benchmark and Baselinesrdquo in Proceedings of the 2013IEEE International Conference on Computer Vision (ICCV) pp233ndash240 Sydney Australia December 2013

[8] N An X-G Zhao and Z-G Hou ldquoOnline RGB-D trackingvia detection-learning-segmentationrdquo in Proceedings of the 23rdInternational Conference on Pattern Recognition ICPR 2016 pp1231ndash1236 Mexico December 2016

[9] D Michel A Qammaz and A A Argyros ldquoMarkerless 3Dhuman pose estimation and tracking based on RGBD camerasAn experimental evaluationrdquo in Proceedings of the 10th ACMInternational Conference on PErvasive Technologies Related toAssistive Environments PETRA 2017 pp 115ndash122 Greece June2017

[10] AWang J Cai J Lu andT-J Cham ldquoModality and componentaware feature fusion for RGB-D scene classificationrdquo inProceed-ings of the 2016 IEEEConference onComputerVision and PatternRecognition CVPR 2016 pp 5995ndash6004 USA July 2016

Complexity 7

[11] S Hou Z Wang and F Wu ldquoObject detection via deeplyexploiting depth informationrdquo Neurocomputing vol 286 pp58ndash66 2018

[12] A Dosovitskiy P Fischery E Ilg et al ldquoFlowNet Learningoptical flow with convolutional networksrdquo in Proceedings of the15th IEEE International Conference on Computer Vision ICCV2015 pp 2758ndash2766 Chile December 2015

[13] G Gkioxari and J Malik ldquoFinding action tubesrdquo in Proceedingsof the IEEE Conference on Computer Vision and Pattern Recog-nition CVPR 2015 pp 759ndash768 USA June 2015

[14] S Karen and Z Andrew ldquoTwo-StreamConvolutional Networksfor Action Recognition in Videos [C]rdquo Andrew Z Two-StreamConvolutional Networks for Action Recognition in Videos [C]NIPS 2014

[15] L Spinello M Luber and K O Arras ldquoTracking people in 3Dusing a bottom-up top-down detectorrdquo in Proceedings of the2011 IEEE International Conference on Robotics and Automation(ICRA) pp 1304ndash1310 Shanghai China May 2011

[16] K Meshgi S-I Maeda S Oba H Skibbe Y-Z Li andS Ishii ldquoAn occlusion-aware particle filter tracker to handlecomplex and persistent occlusionsrdquoComputer Vision and ImageUnderstanding vol 150 pp 81ndash94 2016

[17] J Xiao R Stolkin Y Gao and A Leonardis ldquoRobust Fusionof Color and Depth Data for RGB-D Target Tracking UsingAdaptive Range-Invariant Depth Models and Spatio-TemporalConsistency Constraintsrdquo IEEE Transactions on Cybernetics2017

[18] MCamplani SHannunaMMirmehdi et al ldquoReal-timeRGB-D Tracking with Depth Scaling Kernelised Correlation Filtersand Occlusion Handlingrdquo in Proceedings of the British MachineVision Conference 2015 pp 1451-14511 Swansea

[19] M Luber L Spinello andK OArras ldquoPeople tracking in RGB-D data with on-line boosted target modelsrdquo in Proceedings ofthe 2011 IEEERSJ International Conference on Intelligent Robotsand Systems Celebrating 50 Years of Robotics IROSrsquo11 pp 3844ndash3849 USA September 2011

[20] W-L Zheng S-C Shen and B-L Lu ldquoOnline Depth Image-Based Object Tracking with Sparse Representation and ObjectDetectionrdquoNeural Processing Letters vol 45 no 3 pp 745ndash7582017

[21] H Li Y Li and F Porikli ldquoDeepTrack learning discriminativefeature representations online for robust visual trackingrdquo IEEETransactions on Image Processing vol 25 no 4 pp 1834ndash18482016

[22] L Bertinetto J Valmadre J F Henriques A Vedaldi and PH S Torr ldquoFully-convolutional siamese networks for objecttrackingrdquo LectureNotes in Computer Science (including subseriesLecture Notes in Artificial Intelligence and Lecture Notes inBioinformatics) Preface vol 9914 pp 850ndash865 2016

[23] LWangWOuyang XWang andH Lu ldquoVisual tracking withfully convolutional networksrdquo in Proceedings of the 15th IEEEInternational Conference on Computer Vision ICCV 2015 pp3119ndash3127 Chile December 2015

[24] C Ma J-B Huang X Yang and M-H Yang ldquoHierarchicalconvolutional features for visual trackingrdquo in Proceedings of the15th IEEE International Conference on Computer Vision ICCV2015 pp 3074ndash3082 Chile December 2015

[25] Y SongCMa LGong J ZhangRWH Lau andM-HYangldquoCREST Convolutional Residual Learning for Visual Trackingrdquoin Proceedings of the 16th IEEE International Conference on

Computer Vision ICCV 2017 pp 2574ndash2583 Italy October2017

[26] M Danelljan A Robinson F Shahbaz Khan and M FelsbergldquoBeyondCorrelation Filters LearningContinuousConvolutionOperators for Visual Trackingrdquo in Computer Vision ndash ECCV2016 vol 9909 of Lecture Notes in Computer Science pp 472ndash488 Springer International Publishing Cham 2016

[27] Z Zhu G Huang W Zou D Du and C Huang ldquoUCTLearningUnifiedConvolutionalNetworks for Real-TimeVisualTrackingrdquo in Proceedings of the 2017 IEEE International Confer-ence on Computer Vision Workshop (ICCVW) pp 1973ndash1982Venice October 2017

[28] Y Cheng X Zhao K Huang and T Tan ldquoSemi-supervisedlearning and feature evaluation for RGB-D object recognitionrdquoComputer Vision and Image Understanding vol 139 pp 149ndash160 2015

[29] A Eitel J T Springenberg L Spinello M Riedmiller andW Burgard ldquoMultimodal deep learning for robust RGB-Dobject recognitionrdquo in Proceedings of the IEEERSJ InternationalConference on Intelligent Robots and Systems IROS 2015 pp681ndash687 Germany October 2015

[30] J Hoffman S Gupta J Leong S Guadarrama and T DarrellldquoCross-modal adaptation for RGB-D detectionrdquo in Proceedingsof the 2016 IEEE International Conference on Robotics andAutomation ICRA 2016 pp 5032ndash5039 Sweden May 2016

[31] X Xu Y Li G Wu and J Luo ldquoMulti-modal deep featurelearning for RGB-D object detectionrdquo Pattern Recognition vol72 pp 300ndash313 2017

[32] Y Cheng R Cai Z Li X Zhao and K Huang ldquoLocality-sensitive deconvolution networks with gated fusion for RGB-Dindoor semantic segmentationrdquo in Proceedings of the 30th IEEEConference on Computer Vision and Pattern Recognition CVPR2017 pp 1475ndash1483 USA July 2017

[33] S Lee S-J Park and K-S Hong ldquoRDFNet RGB-DMulti-levelResidual Feature Fusion for Indoor Semantic Segmentationrdquoin Proceedings of the 16th IEEE International Conference onComputer Vision ICCV 2017 pp 4990ndash4999 Italy October2017

[34] M Jiang Z Pan and Z Tang ldquoVisual object tracking based oncross-modality Gaussian-Bernoulli deep Boltzmann machineswith RGB-D sensorsrdquo Sensors vol 17 no 1 2017

[35] G Cheron I Laptev andC Schmid ldquoP-CNN Pose-basedCNNfeatures for action recognitionrdquo in Proceedings of the 15th IEEEInternational Conference on Computer Vision ICCV 2015 pp3218ndash3226 Chile December 2015

[36] A Chadha A Abbas and Y Andreopoulos ldquoVideo Classifi-cation With CNNs Using The Codec As A Spatio-TemporalActivity Sensorrdquo IEEE Transactions on Circuits and Systems forVideo Technology 2017

[37] Z Zhu W Wu and W Zou ldquoEnd-to-end Flow CorrelationTracking with Spatial-temporal Attention[J]rdquo in Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recogni-tion 2018

[38] M Danelljan G Bhat S Gladh F S Khan and M FelsbergldquoDeepmotion and appearance cues for visual trackingrdquo PatternRecognition Letters 2018

[39] S Gupta R Girshick P Arbelaez and J Malik ldquoLearningrich features from RGB-D images for object detection andsegmentationrdquo Lecture Notes in Computer Science (includingsubseries LectureNotes in Artificial Intelligence and LectureNotesin Bioinformatics) Preface vol 8695 no 7 pp 345ndash360 2014

8 Complexity

[40] Y Jia E Shelhamer J Donahue et al ldquoCaffe convolutionalarchitecture for fast feature embeddingrdquo in Proceedings of theACM Conference on Multimedia (MM rsquo14) pp 675ndash678 ACMOrlando Fla USA November 2014

[41] S Hannuna M Camplani J Hall et al ldquoDS-KCF a real-timetracker for RGB-D datardquo Journal of Real-Time Image Processingpp 1ndash20 2016

Hindawiwwwhindawicom Volume 2018

MathematicsJournal of


Mathematical Problems in Engineering

Applied MathematicsJournal of


Probability and StatisticsHindawiwwwhindawicom Volume 2018

Journal of


Mathematical PhysicsAdvances in

Complex AnalysisJournal of


OptimizationJournal of



Engineering Mathematics

International Journal of


Operations ResearchAdvances in

Journal of


Function SpacesAbstract and Applied AnalysisHindawiwwwhindawicom Volume 2018

International Journal of Mathematics and Mathematical Sciences


Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018Volume 2018

Numerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisNumerical AnalysisAdvances inAdvances in Discrete Dynamics in

Nature and SocietyHindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Dierential EquationsInternational Journal of

Volume 2018


Decision SciencesAdvances in


AnalysisInternational Journal of


Stochastic AnalysisInternational Journal of

Submit your manuscripts atwwwhindawicom

2 Complexity

Ourmotivation is to design a RGB-D object tracker basedon multimodal deep feature fusion Specific emphasis of thispaper is placed on exploring three scientific questions howto fuse deep motion features with static features provided byRGB and depth image how to fuse the deep RGB and depthfeatures sufficiently how to effectively derive geometricalcues and spatial structure information from depth data Insummary the key technical contributions of this study arethree-fold

(i) A novel MMDFFmodel is designed for RGB-D track-ing including four deep CNNmotion-specific CNNRGB- specific CNN Depth-specific CNN and RGB-Depth correlated CNN InMMDFF we can fuse RGBdepth and motion information at multiple layers viaCNN effectively The proposedmethod can separatelyextract RGB and depth features by using RGB-specificCNNandDepth-specificCNNand adequately exploitthe correlated relationship between RGB and depthmodality by using RGB-Depth correlated CNN Atthe same time Motion-specific CNN can provide animportant high-level information about motion fortracking

(ii) The depth image is encoded into three channelshorizontal disparity height above ground and anglewith gravity Then the three channel images are sentinto depth-specific CNN to extract deep depth fea-tures The strong correlation between RGB and depthmodality can be learnt by RGB-Depth correlatedCNN In contrast to only using depth informationas one channel in many existing RGB-D trackerswe can obtain more useful information for trackingsuch as geometrical features and spatial structureinformation by encoding the depth image into threechannels

(iii) To evaluate the performance of our proposed RGB-Dtracker we conduct extensive experiments on the tworecent challenging RGB-D benchmark datasets thelarge-scale Princeton RGB-D Tracking Benchmark(PTB) Dataset [13] and the University of BirminghamRGB-D Tracking Benchmark (BTB) [15] The exper-imental results show that our proposed approach issuperior to the state-of-the-art RGB-D trackers

The remainder of this paper is organized as follows Therelated work is discussed in Section 2 Section 3 describesthe overview and the details of our proposed method InSection 4 we demonstrate experimental results to evaluateour proposed RGB-D tracker We conclude our work inSection 5

2 Related Work

21 RGB-D Object Tracking With the emergence of RGB-Dsensors there has been great interest in visual object trackingusing RGB-D data [15ndash17] to improve tracking performancesince depthmodality can provide useful information comple-mentary to RGB modality

ARGB-D trackingmethod using depth scaling kernelisedcorrelation filters and occlusion handling was proposed in[18] In [19] authors used Haar-like features and HOGfeatures based on RGB and depth to form a boosting trackingapproach Zheng et al [20] presented an object tracker basedon sparse representation of depth image Hannuna et alproposed a RGB-D tracker built upon the KCF trackerthey exploit depth information to handle scale changesocclusions and shape changes

Although the existing RGB-D trackers have made greatcontributions to promote RGB-D tracking most of themused hand-crafted features and fused simply RGB and depthinformation ignored the strong correlation between RGBand depth modality

22 Deep RGB Features Owing to the superiority in featureextraction CNN has been increasingly used in RGB trackers[21 22] The CNN includes a number of convolutional layerpooling layer and fully connected layer the features atdifferent layers have different properties The higher layerscan capture the senmantic features and the lower layers cancapture the spatial features and they are both important inthe tracking problem [23 24]

In [25] Song et al proposed the CREST algorithm whichtreated the correlation filter as one convolutional layer andapplied residual learning to capture appearance changes C-COT was presented in [26] which employed an implicitinterpolation model to solve the learning problem in thecontinuous spatial domain Zhu et al [27] proposed a trackernamed UCT which designed an end-to-end framework tolearn the convolutional features and perform the trackingprocess simultaneously

All these trackers only considered that RGB appearancedeep features in current frame cannot benefit from geomet-rical features extracted from depth image and interframedynamic information Thus it is important to get rid of thisproblem by fusing deep features from RGB-D data and deepmotion features

23 Deep Depth Features In the recent years deep depthfeatures have received a lot of attention in object recognition[28 29] object detection [30 31] indoor semantic segmenta-tion [32 33] etc Unfortunately few existing RGB-D trackersuse CNN to extract deep depth features to improve thetracking performance In [34] Jiang et al proposed a RGB-D tracker based on cross-modality deep learning in whichGaussian-Bernoulli deep BoltzmannMachines (DBMs) wereadopted to extract the features of RGB and depth image Asfar as the drawbacks of DBMs are concerned one of the mostimportant ones is the high computational cost of inference

In recent years CNN has witnessed great success incomputer vision In this context we will focus on how to useCNN to fuse deep depth feature with deep RGB and motionfeatures effectively for RGB-D tracker

24 Deep Motion Features Deep motion feature has beensuccessfully applied for action recognition [35] and videoclassification [36] but it is rarely applied to visual tracking

Complexity 3

RGB image

Disparity

Height

Angle

HHA encoding

Fused

Deep

features





C-COT Tracker

Optical flow image

Motion-specific CNN

RGB-specific CNN













4 Complexity





084 083 086 085 085 086 082 083 087 087 082 083




119875119891119906119904119894119900119899119894 = 119869sum119895=1

119908119895119894119901119895119894 (1)


119908119895119894 = exp (p119895119894 )sum119869119896=1 exp (p119896119894 )

(2)



119869119889 119909119889 (119905) =119873119889minus1sum119899=0

119909119889 [119899] 119887119889 (119905 minus 119879119873119889 119899) (3)


119878119891 119909 = 119891 lowast 119869 119909 minus 119863sum119889=1

119891119889 lowast 119869119889 119909119889 (4)


119864 (119891) = 119872sum119895=1

120572119895 10038171003817100381710038171003817119878119891 119909119895 minus 1199101198951003817100381710038171003817100381721198712 minus119863sum119889=1

100381710038171003817100381710038171205961198911198891003817100381710038171003817100381721198712 (5)


119864 (119891) = 119872sum119895=1

1205721198951003817100381710038171003817100381710038171003817100381710038171003817119863sum119889=1

1006704119891119889119883119889119895 1006704119887119889 minus 11991011989510038171003817100381710038171003817100381710038171003817100381710038172

1198712

minus 119863sum119889=1

10038171003817100381710038171003817 lowast 1198911198891003817100381710038171003817100381721198712 (6)






Complexity 5











5 Conclusion


6 Complexity

1

09

08

07

06

05

04

03

02

01

0

Succ

ess R

ate (

SR)

Overall FastMove

ActiveMotion




10

9

8

7

6

5

4

3

2

1

0

Are

a Und

er C

urve

s (AU

Cs)






Data Availability




Acknowledgments


References











Complexity 7































8 Complexity










Journal of












Journal of







Volume 2018






Volume 2018








Complexity 3

RGB image

Disparity

Height

Angle

HHA encoding

Fused

Deep

features





C-COT Tracker

Optical flow image

Motion-specific CNN

RGB-specific CNN













4 Complexity





084 083 086 085 085 086 082 083 087 087 082 083




119875119891119906119904119894119900119899119894 = 119869sum119895=1

119908119895119894119901119895119894 (1)


119908119895119894 = exp (p119895119894 )sum119869119896=1 exp (p119896119894 )

(2)



119869119889 119909119889 (119905) =119873119889minus1sum119899=0

119909119889 [119899] 119887119889 (119905 minus 119879119873119889 119899) (3)


119878119891 119909 = 119891 lowast 119869 119909 minus 119863sum119889=1

119891119889 lowast 119869119889 119909119889 (4)


119864 (119891) = 119872sum119895=1

120572119895 10038171003817100381710038171003817119878119891 119909119895 minus 1199101198951003817100381710038171003817100381721198712 minus119863sum119889=1

100381710038171003817100381710038171205961198911198891003817100381710038171003817100381721198712 (5)


119864 (119891) = 119872sum119895=1

1205721198951003817100381710038171003817100381710038171003817100381710038171003817119863sum119889=1

1006704119891119889119883119889119895 1006704119887119889 minus 11991011989510038171003817100381710038171003817100381710038171003817100381710038172

1198712

minus 119863sum119889=1

10038171003817100381710038171003817 lowast 1198911198891003817100381710038171003817100381721198712 (6)






Complexity 5











5 Conclusion


6 Complexity

1

09

08

07

06

05

04

03

02

01

0

Succ

ess R

ate (

SR)

Overall FastMove

ActiveMotion




10

9

8

7

6

5

4

3

2

1

0

Are

a Und

er C

urve

s (AU

Cs)






Data Availability




Acknowledgments


References











Complexity 7































8 Complexity










Journal of












Journal of







Volume 2018






Volume 2018








4 Complexity





084 083 086 085 085 086 082 083 087 087 082 083




119875119891119906119904119894119900119899119894 = 119869sum119895=1

119908119895119894119901119895119894 (1)


119908119895119894 = exp (p119895119894 )sum119869119896=1 exp (p119896119894 )

(2)



119869119889 119909119889 (119905) =119873119889minus1sum119899=0

119909119889 [119899] 119887119889 (119905 minus 119879119873119889 119899) (3)


119878119891 119909 = 119891 lowast 119869 119909 minus 119863sum119889=1

119891119889 lowast 119869119889 119909119889 (4)


119864 (119891) = 119872sum119895=1

120572119895 10038171003817100381710038171003817119878119891 119909119895 minus 1199101198951003817100381710038171003817100381721198712 minus119863sum119889=1

100381710038171003817100381710038171205961198911198891003817100381710038171003817100381721198712 (5)


119864 (119891) = 119872sum119895=1

1205721198951003817100381710038171003817100381710038171003817100381710038171003817119863sum119889=1

1006704119891119889119883119889119895 1006704119887119889 minus 11991011989510038171003817100381710038171003817100381710038171003817100381710038172

1198712

minus 119863sum119889=1

10038171003817100381710038171003817 lowast 1198911198891003817100381710038171003817100381721198712 (6)






Complexity 5











5 Conclusion


6 Complexity

1

09

08

07

06

05

04

03

02

01

0

Succ

ess R

ate (

SR)

Overall FastMove

ActiveMotion




10

9

8

7

6

5

4

3

2

1

0

Are

a Und

er C

urve

s (AU

Cs)






Data Availability




Acknowledgments


References











Complexity 7































8 Complexity










Journal of












Journal of







Volume 2018






Volume 2018








Complexity 5











5 Conclusion


6 Complexity

1

09

08

07

06

05

04

03

02

01

0

Succ

ess R

ate (

SR)

Overall FastMove

ActiveMotion




10

9

8

7

6

5

4

3

2

1

0

Are

a Und

er C

urve

s (AU

Cs)






Data Availability




Acknowledgments


References











Complexity 7































8 Complexity










Journal of












Journal of







Volume 2018






Volume 2018








6 Complexity

1

09

08

07

06

05

04

03

02

01

0

Succ

ess R

ate (

SR)

Overall FastMove

ActiveMotion




10

9

8

7

6

5

4

3

2

1

0

Are

a Und

er C

urve

s (AU

Cs)






Data Availability




Acknowledgments


References











Complexity 7































8 Complexity










Journal of












Journal of







Volume 2018






Volume 2018








Complexity 7































8 Complexity










Journal of












Journal of







Volume 2018






Volume 2018








8 Complexity










Journal of












Journal of







Volume 2018






Volume 2018















Journal of












Journal of







Volume 2018






Volume 2018








Documents

Multimodal Deep Feature Fusion (MMDFF) for RGB-D Tracking