Real-time Object Tracking on Mobile Phones - CASIA NLPR

Real-time Object Tracking on Mobile PhonesJuan Lei

National Laboratoryof Pattern RecognitionInstitute of Automation

Chinese Academy of SciencesEmail: [email protected]

Youji FengNational Laboratory

of Pattern RecognitionInstitute of Automation


Lixin FanNokia Research Center

Email: [email protected]

Yihong WuNational Laboratory

of Pattern RecognitionInstitute of Automation


Abstract—For a mobile landmark recognition and AugmentedReality system, real-time object tracking is crucial for the userexperience. In this paper, an effective natural feature (lowdimensional Haar-like feature) based method is proposed forreal-time object tracking on mobile phones. This method cancope with image variations and alleviate the drift problem. Themain idea is using an online two template switching strategyfor estimating the geometrical transformation. Experiments onNOKIA N900 phone show the good performance of the proposedmethod and support our claim that our tracking method iseffective and robust.

I. INTRODUCTION

Augmented Reality (AR) is a technology which allows com-puter generated virtual objects to be combined with physicalobjects in real time. In recent years, AR has made significantprogress in many domains [1] such as entertainment, mainte-nance, education etc.

With the increase of processing capabilities and integrationof camera, high-resolution touch screen, GPS and compass,the mobile phone has become an ideal platform for ARapplications ([2]–[4]). Most of mobile AR applications overlayvirtual objects on the live video sequence captured by themobile phone’s camera.

Mobile landmark recognition and AR system, as a popularmobile AR system, provides a mobile tourist guide for travel-ers ([5]–[7]). A server-client architecture is adopted in whichthe demanding recognition task is outsourced to the serverwhile the real-time object tracking is running on the mobileclient. An illustration of such a server-client architecture basedsystem is shown in Figure 1. First, the user specifies an inter-ested region in the current video frame captured by a mobilephone. Next, the specified image region is sent to a serverwhich stores a visual database for image matching while aprocedure is running on the mobile phone to track the specifiedregion. When the server finds images corresponding to theregion, it sends back the related information to the mobilephone using wireless links. Finally, the related information isaccurately overlaid on the region.

Since the position of augmented virtual information on thescreen has to be updated according to the camera move-ment, a real-time, accurate 2D tracking method is a crucialcomponent for high-quality user experience. In this paper, aneffective, real-time method is proposed for object tracking onmobile phones for 2D augmentation. Specifically, an online

two template switching strategy is used for estimating thegeometrical transformation. Experiments are carried out on aNOKIA N900 phone. The results support our claim to be real-time and effective. It is also shown that the proposed methodcan cope with object appearance variations caused by scalechanges, rotation, viewpoint changes or partial occlusions etc.and alleviate the drift problem.

Fig. 1. The work-flow of a mobile landmark recognition and AR system

The rest of this paper is organized as follows. In sec-tion 2, related work is provided. In section 3, the proposedtracking method is described in detail, and in section 4 theimplementation optimization for mobile phone programmingis summarized. Experiments on the real scene is carried outin section 5, and finally we conclude this paper in section 6.

II. RELATED WORK

Tracking an object in a video sequence is to find the object’slocation in every frame [8]. For mobile AR system, two mainkinds of vision-based tracking methods (i.e. marker-based([9], [10]) and natural feature based ([11], [12])) have beendeveloped increasingly. Due to its convenience and flexibility,the natural feature based method is a proper choice for themobile landmark recognition and AR system.

Although a lot of work has been spent on tracking research,it is still a difficult task to design a tracking method that cancope with variations such as scale changes, rotation, viewpointor illumination changes, partial occlusions etc. To improvethe tracking performance, a number of methods have been

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 560

proposed in the literature. Grabner et al. [13] demonstrated thatthe tracking problem could be considerably simplified by usingonline feature selection. [14]–[17] update the initial templatesuch that new object appearances are properly incorporatedin the matching stage. Since every template updating can beslightly misaligned from the previous one, the accumulatederrors will eventually lead to a tracking failure. Matthewset al. [18] proposed a drift correction strategy to align theupdated template with the initial one. A geometric model forverification is used in [19] in which the template is updatedonly when the geometric model is verified. Inspired by thesemethods, we propose a novel two template switching strategyfor tracking object on a mobile phone which can cope withthe above mentioned variations.

III. TRACKING METHOD

Our tracking method aims to estimate a similarity transfor-mation St(t = 0, ...t, ...n) from the normalized image of atemplate to the object region in each video frame (the bound-ing box of the object region is determined by transformingthe four vertexes of the normalized image of the templateusing the estimated similarity transformation). The templateconsists of a normalized image and a set of reference features.Two templates (an initial fixed one and a dynamically updatedone) are employed to deal with viewpoint changes. They areswitched according to the tracking quality. Due to the storedinitial template, the drift problem can be alleviated when theincoming frame is similar to the initial frame.

Figure 2 illustrates the flow chart of the proposed trackingmethod. An initialization stage is firstly carried out to obtainthe initial template and similarity transformation. Meanwhile,the second template is set to null. For each incoming videoframe, features are extracted from a normalized image oftracked region in this frame and matched against referencefeatures in the two templates to produce two sets of featurecorrespondences. Then the set which has larger amount of cor-respondences is used to estimate the similarity transformation.Finally tracking quality is measured by the ratio (denoted asr) of the matched features to the references features of thechosen template. According to the tracking quality, one of theoperations (i.e. template switching, recovery or grabbing newframe directly) is carried out. Further details of the methodwill be discussed below.

A. Initialization

When the system is started, the user specifies a boundingbox of the target object. Then, the specified region of thecurrent frame is normalized to a image of preset area usinga similarity transformation. This transformation is saved asthe initial transformation S0. After that, Haar-like features areextracted from the normalized image and saved as referencefeatures. The normalized image and the Haar-like featuresform the initial template. Meanwhile, the second templateis set to null. Figure 3(a) gives a brief illustration of thenormalization in initialization stage.

Fig. 2. The flow chart of the tracking method

Haar-like feature: In our tracking method, points whichhave local maximum of gradient value are extracted as featurepoints and Haar-like feature descriptors in [20] with reduceddimensionality (from 36 to 8) are adopted, which make a com-promise between real-time computation and high distinctness.The 8-dimensional descriptor consists of the differences ofaverage intensities between the central block and 8 adjacentblocks, and can be computed very fast using integral image.

B. Feature tracking

The goal of feature tracking is to find the correspondencesbetween the reference features in the templates and the featuresextracted from the normalized image of tracked region in thecurrent frame. Since the Haar-like feature used is neither scalenor rotation invariant, using features extracted directly fromthe original frame usually produces poor matching resultswhenever scale changes or rotation occurs. To make thetracking robust to these changes, a normalization operationis employed: as is illustrated in Figure 3(b), a normalizedimage which has the same size as the normalized image of thetemplate is obtained by transforming the previous object regionin the current frame through the transformation S−1

t−1; thenhaar-like features are extracted from the normalized imagein the current frame. Feature correspondences between thefeatures in the normalized image of object region in the currentframe and the two sets of reference features are established bya typical local search using Nearest Neighbor Distance Ratio(NNDR) matching strategy. For two resulted sets of correspon-dences, the one which has larger amount of correspondencesis selected to calculate the similarity transformation from thenormalized image of the chosen template to the normalizedimage of tracked region in the current frame.

C. Geometrical transformation estimation

Since the Least Median of Squares (LMedS) estimation [21]is robust to false matches, given the feature correspondences,

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 561

(a) Normalization in the initializations stage.

(b) Normalization in the tracking procedure.

Fig. 3. Red box in the left image denotes the initial specified region (a)and the estimated tracking region in the current frame (b). The right imagedenotes the normalized image. Yellow box in the left below image denotes thetracked region in the previous frame. The current frame is cropped, rotatedand resized to the same fixed size region as the normalized image of thetemplate.

an iterative LMedS procedure is adopted to estimate theoptimal solution of the similarity transformation Sµ from thenormalized image of the chosen template to the normalizedimage of tracked region in the current frame. Let pi be thei-th Haar-like feature position in the normalized image of thetemplate and p

′

i be the corresponding feature position in thenormalized image in the current frame, the optimal problemcan be mathematically defined as:

µ∗ = argminµ

median(e1, . . . , ei, . . . , en) (1)

where ei = ‖Sµ(pi) − p′

i‖ is the residual error and µ =(s, θ, t) are the parameters of the similarity transformationSµ. The initial LMedS solution is obtained by using all thecorrespondences to estimate the transformation parameter µ.Then, the correspondences whose residual errors are belowthe median error are used to re-estimate µ. This re-estimationprocess is repeated until no further improvements can beachieved in 10 consecutive iterations. When Sµ is computed,the similarity transformation St from the normalized image ofthe template to the current frame is obtained with little effort:St = St−1 · Sµ.

D. Template Switching

As tracking proceeds, the appearance of the tracked object inincoming frames may significantly differ from the normalizedimage of the template due to gradual viewpoint variations,which results in the degrading of feature tracking quality andleads to tracking lost eventually. To cope with this problem,the second template is updated when the tracking qualityis deemed normal (Tl < r < Th). This process is asfollows: a normalized image is first obtained by applyingthe transformation S−1

t to the current frame; then Haar-like features are extracted from the normalized image andkept as reference features. The normalized image and thefeatures form the dynamically updated template. After theupdating, feature matching between the normalized image in

the current frame and the reference features of the secondtemplate usually generates good result due to the continuitybetween video frames. However, keep updating the templatein tracking procedure leads to the drift problem. To alleviatethis problem, features in the normalized image in the currentframe are always matched with the reference features in boththe initial template and the second template as described inSection III-B. When the appearance of the tracked object inan incoming frame approximates that in the initial frame,the template chosen to compute the geometric transformationis switched to the initial template again. Hence the drift isrectified.

E. Recovery

Due to fast motion or complete object occlusion, trackingcan be poor quality or even get lost. Thus, a recovery proce-dure should be involved to re-locate the object region in thesesituations.

In our tracking algorithm, a ‘full-search’ method is firstemployed to find the image region in current frame whichis the most similar to the specified image region in the initialframe. The process is taken as follows: first ZNCC [22] valuesare calculated between the initial specified region and a regionin the current frame, with the same size as the initial specifiedregion, which scans over the whole frame at a step of 20pixels both vertically and horizontally (for acceleration thecomputing of ZNCC is performed on down sampled imageregions); then the region with the highest ZNCC value isdeemed as the candidate required region. Once the region themost similar to the initial one is found, a regular trackingprocedure (i.e. treating this region as the tracked region inprevious frame) can then be carried out to complete therecovery stage. The whole process of recovery is conciselyillustrated in Figure 4.

Fig. 4. recovery procedure

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 562

IV. IMPLEMENTATION DETAILS ON MOBILE PHONES

Considering the limitations of hardware and instruction setcharacteristics on mobile phones, some programming tech-niques for mobile phones are used to speed up the program.

1) Fixed-point operations: As the computational capabili-ties of floating point processing unit (FPU) on mobile phonesis limited, in our implementation, we convert floating-pointoperations to fixed-point operations to the greatest extent. Thestrategy makes a trade off between accuracy and time cost.

2) Lookup tables: For those pre-calculated results, lookuptables are used wherever possible, which could make thecomputation fast. For example, during the feature trackingstage, lookup tables speed up finding candidate features duringlocal feature searching.

3) Inline expansion and direct operation: When calling afunction repeatedly, the time cost of context switch is relativelylarge. In order to reduce the time cost, we make use ofinline expansion and direct basic operations instead of callingfunctions.

V. EXPERIMENTS

All experiments are carried out on a NOKIA N900 phonewith an ARM Cortex-A8 600MHz processor, 256MB RAM.The parameters in all experiments are set as below: thetemplate size is set to a fixed area of 120 × 120 or 30 × 30pixels with the same aspect radio as the initial specified regionduring tracking and recovery procedure respectively; Tl is setto 0.35 while Th is 0.55. Experiments show the performanceof the proposed tracking method applied on video sequencescaptured by mobile phones.

To illustrate the effectiveness and robustness of our trackingmethod, an example landmark recognition and AR applicationbased on the method is presented in our experiment. Inthe experiment, a building far from the user is tracked andaugmented on the mobile phone. Due to the outdoor naturallight condition, illumination change occurs rarely. Meanwhile,rotation and translation form the main transformations ofthe tracked region among video frames because of the longdistance between the building and the mobile phone. Figure6 shows some typical results of experiment under these con-ditions. The building is tracked fairly well and the picture foraugmentation is overlaid on the target region accurately andstably through the video frames. Occasionally, tracking getslost due to fast motion. Nevertheless, it is quickly recoveredwhen the phone moves smoothly again.

In the experiment, tracking is performed almost in real-time. Figure 5 shows the time cost of tracking for each frameover 300 consecutive frames. The horizontal axis is the indexof the frames and the vertical axis is the processing time ofthat frame in ms. The average time cost of tracking is about30ms which is less than the time needed for N900 to capturetwo consecutive frames (40 ms). The recovery procedure costnearly 58ms per frame. But since recovery does not happenfrequently, the latency from recovery would not much hurt theoverall real-time performance of the tracking method.

Fig. 5. Tracking time measurement.

In our experiments, different objects also have been chosento illustrate the tracking performance. Figure 7 shows somescreen shots of a video of two typical objects (non-planarobject and planar object).

Scale changes and rotation: The scale change of the imageof the object is obtained by varying the relative distancebetween the camera and the target object. Image rotationsare obtained by rotating the camera around its optical axis.Because of the normalization operation as stated in sectionIII-B, our method can handle arbitrary angle changes and ascale change from 0.3 to 3. Since the feature point is not scaleinvariant, the tracker gets lost when the scale change exceedsthe range.

Viewpoint changes: In the case of viewpoint changes, thecamera position varies from a fronto-parallel view to one about60 degrees gradually. Our method could tolerate approximately40 degrees changes. When the viewpoint change exceeds 40degrees, the tracking quality degrades quickly. The reason isthat the geometrical transformation could not be approximateby a similarity transformation for large viewpoint changes.

Partial occlusions: The tracking method is robust to partialocclusions even if 70% of the object surface is occluded by abook. But when the occlusion area exceeds 70%, the numberof matched features decreases quickly, thus the tracked regionis unstable and gets lost easily.

VI. CONCLUSION

In this paper, a natural feature based tracking method isproposed for mobile landmark recognition and AR system.A normalization operation is used to deal with rotation andscale changes in every new frame. The dimensionality of theHaar-like feature used is reduced for fast matching and aniterative LMedS is used for robust geometrical transformationestimation. By employing an online two template switchingstrategy, the proposed tracking method is capable to deal withvariations including partial occlusions as well as viewpointand illumination changes. The experiments on a NOKIA N900phone show that the proposed real-time tracking method cancope with scale changes, rotation, viewpoint and illuminationchanges or partial occlusions.

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 563

(a) Landmark augmentation (b) Rotation and translation (c) Tracking failure by motion blur (d) Recovery

Fig. 6. The performance of augmenting a specified building in the frame.

(a) Original region (b) Scale and rotation (c) Viewpoint changes (d) Partial occlusions

(e) Original region (f) Scale and rotation (g) Viewpoint changes (h) Partial occlusions

Fig. 7. The performance of tracking a non-planar object (above) and a planar object (bottom).

ACKNOWLEDGMENT

This work is supported by Nokia Research Foundation andthe National Natural Science Foundation of China under grantNo.61070107.

REFERENCES

[1] D.W.F.van Krevelen, R. Poelman. A Survey of Augmented Reality Tech-nologies, Applications and Limitations. The International Journal ofVirtual Reality, 9(2), pp.1-20, 2010.

[2] Ann Morrison, Antti Oulasvirta, Peter Peltonen, Saija Lemmel, GiulioJacucci, Gerhard Reitmayr, Jaana Nsnen, Antti Juustila. Like Bees Aroundthe Hive : A Comparative Study of a Mobile Augmented Reality. ACM,pp.1889-1898, 2009.

[3] Mohring,M., Lessig,C., Bimber,O. Video see-through AR on consumercell-phones. Third IEEE and ACM International Symposium on Mixedand Augmented Reality, pp.252-253, Nov. 2004.

[4] Xiang Zhang, Genc,Y., Navab,N.. Mobile computing and industrial aug-mented reality for real-time data access. Proceedings. 8th IEEE Inter-national Conference on Emerging Technologies and Factory Automation,vol.2, pp.583-588, 2001.

[5] Tao Chen, Zhen Li, Kim-Hui Yap Kui Wu, Lap-Pui Chau. A Multi-scale Learning Approach for Landmark Recognition Using Mobile Devices.International Conference on Information, Communications and SignalProcessing, Macau, 2009.

[6] Jean-Pierre Chevallet, Joo-Hwee Lim, and Mun-Kew Leong. ObjectIdentification and Retrieval from Efficient Image Matching: Snap2Tell withthe STOIC Dataset. Information Processing Management, 43(2), pp.515-530, March 2007.

[7] Kim-Hui Yap, Tao Chen, Zhen Li, Kui Wu. A Comparative Studyof Mobile-Based Landmark Recognition Techniques. IEEE IntelligentSystems, vol(25), no(1), pp.48-57, Jan-Feb, 2010.

[8] Alper Yilmaz, Omar Javed, Mubarak Shah. Object tracking: A survey.ACM Computing Surveys, Volume 38, Issue 4, ACM, 2006.

[9] Kato,H., Billinghurst,M., Marker tracking and HMD calibration for avideo-based augmented reality conferencing system. 2nd IEEE and ACMInternational Workshop on Augmented Reality, pp.85-94, 1999.

[10] Mark Fiala. ARTag, a Fiducial Marker System Using Digital Techniques.IEEE Computer Society Conference on Computer Vision and PatternRecognition, Volume 2, pp.590-596, 2005.

[11] Georg Klein and David Murray.Parallel Tracking and Mapping ona Camera Phone. In Proc. International Symposium on Mixed andAugmented Reality, Orlando, October 2009.

[12] Wagner,D., Reitmayr,G., Mulloni,A.,Drummond,T., Schmalstieg,D. Posetracking from natural features on mobile phones. International Symposiumon Mixed and Augmented Reality, pp.125-134, Sept.2008.

[13] Grabner, H., Bischof, H.. On-line Boosting and Vision. IEEE ComputerSociety Conference on Computer Vision and Pattern Recognition, vol.1,pp.260-267, June 2006.

[14] Ellis,L., Dowson,N., Matas,J., Bowden,R.. Linear Predictors for FastSimultaneous Modeling and Tracking. IEEE 11th International Conferenceon Computer Vision, Oct. 2007.

[15] Collins, R.T., Yanxi Liu, Leordeanu,M.. Online selection of discrimina-tive tracking features. IEEE Transactions on Pattern Analysis and MachineIntelligence, vol.27, no.10, pp.1631-1643, Oct. 2005.

[16] David Ross, Jongwoo Lim, Ruei-Sung Lin, and Ming-Hsuan Yang.Incremental Learning for Robust Visual Tracking. International Journalof Computer Vision, vol.77, no.1-3, pp.125-141, 2008.

[17] Fan, L. A Feature-based Object Tracking Method Using Online TemplateSwitching and Feature Adaptation. 6th International Conference on Imageand Graphics, 2011.

[18] Matthews,L., Ishikawa,T., Baker,S.. The template update problem. IEEETransactions on Pattern Analysis and Machine Intelligence, vol.26, no.6,pp.810-815, June 2004.

[19] Grabner,M., Grabner,H., Bischof,H.. Learning Features for Tracking.IEEE Conference on Computer Vision and Pattern Recognition, pp.1-8,June 2007.

[20] Lixin Fan What A Single Template Can Do in Recognition. 4thInternational Conference on Image and Graphics, pp.586-591, 2007.

[21] Rousseeuw, P.J. Least Median of Squares Regression. Journal of theAmerican Statistical Association, Vol.79, No.388, pp.871-880, Dec 1984.

[22] L.Di Stefano, S.Mattoccia, F.Tombari. ZNCC-based Template Matchingusing Bounded Partial Correlation. Pattern Recognition Letters, Vol.16,No.14, pp.2129-2134, October 2005.

978-1-4577-0121-4/11/$26.00 ©2011 IEEE 564

Documents

Real-time Object Tracking on Mobile Phones - CASIA NLPR