[IEEE 2011 IEEE Virtual Reality (VR) - Singapore, Singapore (2011.03.19-2011.03.23)] 2011 IEEE Virtual Reality Conference - Stable vision-aided navigation for large-area augmented reality

  • Published on

  • View

  • Download

Embed Size (px)


<ul><li><p>Stable Vision-Aided Navigation for Large-Area Augmented Reality</p><p>Taragay Oskiper Han-Pang Chiu Zhiwei Zhu Supun Samaresekera Rakesh Kumar</p><p>SRI International Sarnoff</p><p>ABSTRACTIn this paper, we present a unified approach for a drift-free andjitter-reduced vision-aided navigation system. This approach isbased on an error-state Kalman filter algorithm using both relative(local) measurements obtained from image based motion estima-tion through visual odometry, and global measurements as a resultof landmark matching through a pre-built visual landmark database.To improve the accuracy in pose estimation for augmented realityapplications, we capture the 3D local reconstruction uncertainty ofeach landmark point as a covariance matrix and implicity rely moreon closer points in the filter. We conduct a number of experimentsaimed at evaluating different aspects of our Kalman filter frame-work, and show our approach can provide highly-accurate and sta-ble pose both indoors and outdoors over large areas. The resultsdemonstrate both the long term stability and the overall accuracyof our algorithm as intended to provide a solution to the cameratracking problem in augmented reality applications.</p><p>Index Terms: H.5.1 [Information Interfaces and Presentation]:Multimedia Information SystemsArtificial, Augmented, and Vir-tual Reality; I.4.8 [Image Processing and Computer Vision]: SceneAnalysisSensor Fusion</p><p>1 INTRODUCTIONNavigation for augmented reality systems using head mounted dis-plays (HMDs) has very demanding requirements: It must estimatehighly accurate 3D location and 3D orientation of the users head inreal time. This is required by the system to know where to insert thesynthetic actors and objects in the HMD. In addition, the insertedobjects must appear stable. The presence of drift or jitter on in-serted objects will disturb the illusion of mixture between renderedand real world for the user.</p><p>In this paper, we present our work on the real-time navigationcomponent of an augmented reality training and gaming system,which can be used both indoors and outdoors over large areas. Thesystem uses computer graphics and see-through head mounted dis-plays to insert virtual actors, objects and sound effects into thescene as viewed by each user. The virtual actors respond in real-istic ways to actions of the user, taking cover, or firing back, ormilling as crowds etc. In this system, there is no need to instrumentthe training/gaming facility; the individual users wear the primaryhardware. The augmented reality interface subsystem worn by eachindividual user consists of a helmet mounted sensor platform (Fig-ure 1) with front and back facing stereo cameras, inertial measure-ment unit (IMU), a compact computer mounted on his backpackand a see through HMD.</p><p>We introduce a new Kalman filter framework to fuse IMU data,the local measurements from the distributed aperture visual odom-</p><p>e-mail:taragay.oskiper@sri.come-mail:han-pang.chiu@sri.come-mail:zhiwei.zhu@sri.come-mail:supun.samarasekera@sri.come-mail:rakesh.kumar@sri.com</p><p>IMU Back-Stereo</p><p>Front-Stereo</p><p>Figure 1: Our helmet mounted sensor platform.</p><p>etry algorithm with front and back facing stereo cameras [17], andthe global measurements from the visual landmark matching mod-ule [27]. Our Kalman filter eliminates the constant velocity processmodel used in [17], and is able to incorporate the global measure-ments which are 3D to 2D feature point correspondences betweenthe pre-built landmark database and given query image. As shownin Figure 2, the final estimated global pose is generated from thefilter instead of the landmark matcher as was done in [27].</p><p>This unified approach to fuse all the measurement data allows forbetter handling of the uncertainty propagation through the wholesystem. It is not possible in the framework of [27] in which theKalman filter output was used to locally propagate the navigationsolution from one landmark match instance to another with the posesolution based on landmark matching effectively resetting the filteroutput. By fusing both inertial and vision measurements, our sys-tem is also more robust under challenging conditions where thereare insufficient visual clues to rely on.</p><p>The other advantage of our work is to eliminate a trade-off prob-lem in [27]: landmark matching between the pre-built database andthe current query frame provides global fixes to prevent the esti-mated poses from drifting during online tracking, but often lacksprecision which results in jitter. We found that the accuracy of poseestimation from the landmark matcher decreases if there are fewerlandmark point matches closer to the camera where the depth es-timation is more accurate. To reduce the jitter, we capture the 3Dlocal reconstruction uncertainty of each landmark point as a co-variance matrix and implicity rely more on closer points as globalmeasurements in the Kalman filter. This provides more accurateand stable pose estimation to fulfill the demanding requirementsfor augmented reality systems.</p><p>2 PREVIOUS WORK</p><p>Prototype augmented-reality training/gaming systems require sig-nificant infrastructure. For example, a few systems use video pro-jectors to project images of virtual actors on walls of rooms withina facility. Existing systems also have a limited ability to track users,</p><p>63</p><p>IEEE Virtual Reality 2011</p><p>19 - 23 March, Singapore</p><p>978-1-4577-0038-5/11/$26.00 2011 IEEE </p></li><li><p>IMUIMU</p><p>Back-StereoBack-Stereo</p><p>DistributedApertureVisual </p><p>Odometry</p><p>DistributedApertureVisual </p><p>Odometry</p><p>VisualLandmarkMatcher</p><p>VisualLandmarkMatcherLandmark</p><p>DatabaseLandmarkDatabase</p><p>KalmanFilter</p><p>KalmanFilter</p><p>Front-StereoFront-Stereo</p><p>3D Head Pose(3d location &amp; 3d orientation)</p><p>Figure 2: System block diagram.</p><p>and to adapt virtual actions or their projection to the movements ofthe users. GPS-based systems [22, 10] may be used for providinglocation outdoors. However, the performance of these outdoor-onlysystems decreases in challenging situations. Recent vision-basedsystems [2, 12, 25, 6, 11] can provide accurate tracking indoorseven on mobile phones. However, they still have difficulties tohandle occlusion or feature-less environment. Overall, providinghigh accuracy tracking over large indoor and outdoor areas (multi-ple square miles) is a very challenging problem.</p><p>Real-time tracking by fusing visual and inertial sensors has beenstudied for many years with numerous applications in robotics, ve-hicle navigation and augmented reality. However, it is still unclearhow to best combine the information from these sensors. Sinceinertial sensors are suited for handling no vision situations dueto fast motion or occlusion, many researchers use inertial data asbackup [1] or take only partial information (gyroscopes) from IMU[26, 19, 9] to support vision-based tracking systems.</p><p>To better exploit inertial data, several researchers use an ex-tended Kalman filter to fuse all measurements uniformly to a poseestimate. They combine the filter with vision-tracking techniquesbased on artificial markers [5], feature points, or lines. These sys-tems show that the vision measurements effectively reduce the er-rors accumulated from IMU. However, most of them conduct ex-periments on either synthetic data [18] or simulated vision mea-surements [8]. Some systems provide results on realistic data, butwithin simple test environments [23] or small rooms [3]. Moreover,they cannot eliminate the problem of long term drift over large areasinherent in inertial-based navigation platform.</p><p>Due to recent advances in the image searching techniques, real-time landmark matching with a large landmark database has be-come possible [16, 24]. Zhu et al. [27] integrated visual landmarkmatching to a pre-built landmark database in a visual-inertial navi-gation system. The continuously updating landmark matching cor-rects the long term drift in the system, and thus improves the overallperformance. However, they only used IMU data for transition be-tween views where visual features are lost due to fast motion orbad illumination. Moreover, landmark matching often lacks high-precision in pose estimation, which is required for augmented real-ity applications.</p><p>There are two major differences between our work and othervisual-inertial navigation systems. First, we adopt the so callederror-state formulation [21] in the extended Kalman filter. Un-der this representation, there is no need to specify an explicit dy-namic motion model which was used in [17] for a given sensor plat-form. The filter dynamics follow from the IMU error propagation</p><p>IMUMechanization</p><p>Extended Kalman Filter</p><p>visual odometry based relative pose</p><p>3D-2D tiepoints via landmark matching</p><p>Error estimates(corrections to inertial system)</p><p>corrected navigation solution</p><p>Figure 3: Error-state Extended Kalman Filter block diagram with localand global external measurements.</p><p>equations which evolve slowly over time and therefore are moreamenable to linearization. The measurements to the filter consistof the differences between the inertial navigation solution as ob-tained by solving the IMU mechanization equations and the exter-nal source data, which in our case is the relative pose informationprovided by visual odometry algorithm and global measurementsprovided by the visual landmark matching process (Figure 3).</p><p>Second, our Kalman filter framework incorporates two comple-mentary vision measurements based on state-of-the-art vision track-ing techniques. Relative pose measurements based on feature track-ing between adjacent frames are usually located very precisely.Therefore, they do not jitter but suffer from drift or loss of track.Landmark matching [27] provides correspondences between fixed3D features in a pre-built database and 2D points on the queryframe. These measurements avoid drift but cause jitter. To makethe outputted pose not only accurate but also stable, we fuse bothlocal and global information in the filter.</p><p>3 EXTENDED KALMAN FILTER PROCESS MODEL</p><p>In our extended Kalman filter, we denote the ground (global coordi-nate frame) to camera pose as PGC = [RGC TGC] such that a pointXG expressed in the ground frame can be transferred to the cam-era coordinates by XC = RGCXG+TGC. Accordingly, TGC repre-sents the ground origin expressed in the camera coordinate frame,whereas TCG =RTGCTGC is the camera location in the ground co-ordinate frame.</p><p>In this paper, without loss of generality and to keep the notationsimple, we will assume that the camera and IMU coordinate sys-tems coincide so that PGI = PGC. In reality we use an extrinsiccalibration procedure to determine the camera to IMU pose PCI ,(front left stereo camera is chosen as the master) as developed in[15] and distinguish between PGI = PCIPGC and PGC.</p><p>The total (full) states of the filter consist of the camera locationTCG, the gyroscope bias vector bg, velocity vector v in global co-ordinate frame, accelerometer bias vector ba and ground to cameraorientation qGC, expressed in terms of the quaternion representa-tion for rotation, such that RGC = (|q0|2</p><p>q 2)I33+2qq T </p><p>2q0[q ], with qGC = [q0 q T ]T and [q ] denoting the skew-</p><p>symmetric matrix formed by q . For quaternion algebra, we followthe notation and use the frame rotation perspective as described in[13]. Hence, the total (full) state vector is given by</p><p>s= [qTGC bTg v</p><p>T bTa TTCG]</p><p>T . (1)</p><p>64</p></li><li><p>We use the corresponding system model for the state time evolution</p><p>qGC(t) =12(qGC(t)(t)), bg(t) = nwg(t)</p><p>v(t) = a(t), ba(t) = nwa(t), TCG(t) = v(t)</p><p>where nwg and nwa are modeled as white Gaussian noise, and a(t)is camera acceleration in global coordinate frame, and (t) is therotational velocity in camera coordinate frame. Gyroscope and ac-celerometer measurements of these two vectors are modeled as:</p><p>m(t) = (t)+bg(t)+ng(t) (2)</p><p>am(t) = RGC(t)(a(t)g)+ba(t)+na(t) (3)</p><p>where ng and na are modeled as white Gaussian noise and g is thegravitational acceleration expressed in the global coordinate frame.</p><p>State estimate propagation is obtained by the IMU mechaniza-tion equations</p><p>qGC(t) =12(qGC(t) (t)) (4)</p><p>v(t) = RTGC(t)(t)+g, (5)</p><p>x(t) = v(t), bg(t) = 0,ba(t) = 0 (6)</p><p>with (t) = m(t) ba(t), and (t) = am(t) ba(t).We solve the above system by fourth-order Runge-Kutta numer-</p><p>ical integration method. The Kalman filter error state consists of</p><p> s= [T bTg vT bTa T</p><p>TCG]</p><p>T (7)</p><p>according to the following relation between the total state and itsinertial estimate</p><p>qGC = qGCqGC, with qGC [1 T /2]T</p><p>bg(t) = bg(t)+bg(t), ba(t) = ba(t)+ba(t)</p><p>v(t) = v(t)+v(t), TCG(t) = TCG(t)+TCG(t)</p><p>based on which we obtain (after some algebra) the following dy-namic process model for the error state:</p><p> s= F s+Gn (8)</p><p>where</p><p>F=</p><p> [] I3 03x3 03x3 03x303x3 03x3 03x3 03x3 03x3</p><p>RTGC [] 03x3 03x3 RTGC 03x3</p><p>03x3 03x3 03x3 03x3 03x303x3 03x3 I3 03x3 03x3</p><p> ,</p><p>n=</p><p>ngnwgnanwa</p><p> , and G=</p><p>I3 03x3 03x3 03x303x3 I3 03x3 03x303x3 03x3 RTGC 03x303x3 03x3 03x3 I303x3 03x3 03x3 03x3</p><p>4 VISUAL ODOMETRY AND LANDMARK MATCHING MEA-SUREMENT MODEL</p><p>The two vision measurements to our Kalman filter are from visualodometry and landmark matching. Our system captures the syn-chronized stereo image pairs from cameras and IMUmeasurementsin real time. The captured stereo pairs are sent to the visual odom-etry module that extracts Harris corners [7] from each image andestimates the initial camera pose (local pose) from the extractedHarris corners. Once the local visual odometry poses are estimated,</p><p>the visual odometry module sends the left images together with itsextracted Harris corners and the estimated pose to the landmarkmatching module.</p><p>To incorporate visual odometry poses that are relative in nature,we apply the same stochastic cloning approach developed in [20]for our measurement model. In particular, we denote P1,2 as thevisual odometry pose estimate between two time instances 1 and 2,and let the corresponding pose components of the state be denotedby PG,1 and PG,2. Then defining T2,1 = RG,1(T2,G T1,G), andq1,2 = q1G,1qG,2, and after lengthy algebra as similar to [20], weobtain the following measurement equations</p><p>zT =[RG,1(T2,G T1,G)</p><p>] G,1+ RG,1T2,G (9)</p><p> RG,1T1,G+T (10)</p><p>andzq = 1/2RT1,2G,21/2G,1+q (11)</p><p>where T and q are the Gaussian noise in translation and rota-tion associated with the visual odometry pose solution. These mea-surements are a function of the propagated error-state s2 and thecloned error-state s1 from previous time instance, which requiremodifications to the Kalman filter update equations (cf. [20]).</p><p>For landmark matching, after receiving the extracted Harris cor-ners from visual odometry module, the HOG descriptor [4] is ex-tracted for each Harris corner from the received left image. Oncethe HOG descriptors are extracted, the normalized 2D image co-ordinates, 3D coordinates, and the HOG descriptors of the cornerstogether with the camera pose is formed as a landmark shot. Thelandmark shot will serve as a query landmark shot and search thelandmark database for a list of potential similar landmark shots andthen an image-based 2D matching is performed to refine the poten-tial landmark shot list. Once a matched la...</p></li></ul>