Image matching techniques for vision-based indoor navigation systems: a 3D map-based approach 1

  • Published on

  • View

  • Download

Embed Size (px)


<ul><li><p>This article was downloaded by: [Aston University]On: 29 January 2014, At: 02:48Publisher: Taylor &amp; FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK</p><p>Journal of Location Based ServicesPublication details, including instructions for authors andsubscription information:</p><p>Image matching techniques for vision-based indoor navigation systems: a 3Dmap-based approachXun Li &amp; Jinling Wangaa School of Surveying and Geospatial Engineering, University ofNew South Wales, Sydney, AustraliaPublished online: 20 Jan 2014.</p><p>To cite this article: Xun Li &amp; Jinling Wang , Journal of Location Based Services (2014): Imagematching techniques for vision-based indoor navigation systems: a 3D map-based approach, Journalof Location Based Services, DOI: 10.1080/17489725.2013.837201</p><p>To link to this article:</p><p>PLEASE SCROLL DOWN FOR ARTICLE</p><p>Taylor &amp; Francis makes every effort to ensure the accuracy of all the information (theContent) contained in the publications on our platform. However, Taylor &amp; Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor &amp; Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.</p><p>This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &amp;Conditions of access and use can be found at</p></li><li><p>Image matching techniques for vision-based indoor navigationsystems: a 3D map-based approach1</p><p>Xun Li* and Jinling Wang</p><p>School of Surveying and Geospatial Engineering, University of New South Wales, Sydney, Australia</p><p>(Received 8 March 2013; final version received 13 August 2013; accepted 15 August 2013)</p><p>In this paper, we give an overview of image matching techniques for variousvision-based navigation systems: stereo vision, structure from motion and map-based approach. Focused on map-based approach, which generally uses feature-based matching for mapping and localisation, and based on our early developedsystem, a performance analysis has been carried out and three major problemshave been identified: inadequate geo-referencing and imaging geometry formapping, vulnerability to drastic viewpoint changes and big percentage ofmismatches at positioning. We introduce multi-image matching for mapping. Byusing affine-scale-invariant feature transform for viewpoint changes, the majorimprovement takes place on the epoch with large viewpoint changes. In order todeal with mismatches that were unable to be removed by Random SampleConsensus (RANSAC), we propose new method that use cross-correlationinformation to evaluate the quality of homography model and select the properone. The conducted experiments have proved that such an approach can reducethe chances of mismatches being included by RANSAC and final positioningaccuracy can be improved.</p><p>Keywords: indoor user localisation; map matching; navigation technologies;positioning systems; vision-based localisation</p><p>1. Introduction</p><p>Image matching techniques have been used in a variety of applications, such as 3D</p><p>modelling, image stitching, motion tracking, object recognition and vision-based</p><p>localisation. Over the past few years, many different methods have been developed, which</p><p>can be generally classified into two groups: area-based matching [intensity-based, such as</p><p>cross-correlation and least-squares matching (Gruen 1985), and feature-based matching</p><p>(e.g. scale-invariant feature transform (SIFT); Lowe 2004)]. Area-based methods (Trucco</p><p>and Verri 1998) may be comparatively more accurate, because they take into account a</p><p>whole neighbourhood around the feature points being analysed to establish</p><p>correspondences. Feature-based methods, on the other hand, use symbolic descriptions</p><p>of the images that contain certain local image information to establish correspondence. No</p><p>single algorithm, however, has been universally regarded as optimal for all applications</p><p>since they all have their pros and cons. In this paper, we explore the performance of</p><p>q 2014 Taylor &amp; Francis</p><p>*Corresponding author. Email:</p><p>Journal of Location Based Services, 2014</p><p></p><p>Dow</p><p>nloa</p><p>ded </p><p>by [A</p><p>ston U</p><p>nivers</p><p>ity] a</p><p>t 02:4</p><p>8 29 J</p><p>anua</p><p>ry 20</p><p>14 </p></li><li><p>various image matching techniques according to the specific needs of vision-based</p><p>positioning systems for indoor navigation applications.</p><p>The main purpose of vision-based navigation is to determine the position (and</p><p>orientation) of the vision sensor. Then mobile vehicles (the platform carrying the vision</p><p>sensor) motion/trajectory can be recovered. However, it is not without its limitation. Vision</p><p>sensor canmeasure relative positionwith derivative order of 0 but senses only a 2Dprojection</p><p>of the 3D world direct depth information is lost. Several approaches have been made to</p><p>tackle such a problem. One way is to use stereo cameras by which the distance to a landmark</p><p>can be directlymeasured.Anotherway is to usemonocular visionwith the integration of data</p><p>frommultiple viewpoints (structure frommotion; SFM) or rely on the prior knowledge of the</p><p>navigation environment (e.g. map or models). Despite the variety form of vision-based</p><p>navigation systems, they all need the same basic but essential function to support their</p><p>self-localisation mechanism: image matching. The way they use such function is however</p><p>different, which in terms affect their choice of image matching methods.</p><p>For stereo vision-based approaches, stereo matching is employed to create a depth map</p><p>(i.e. disparity map) for navigation. Area-based algorithms solve the stereo correspondence</p><p>problem for every single pixel in the image. Therefore, these algorithms result in dense</p><p>depth maps as the depth is known for each pixel (Kuhl 2004). Typical methods include</p><p>Census (Zabih and Woodfill 1994), sum of absolute differences and sum of squared</p><p>differences. The common drawback is that they are computationally demanding. To deal</p><p>with the problem, some efforts have been made. Nalpantidis, Chrysostomou, and</p><p>Gasteratos (2009) proposed a quad-camera-based system which used a custom-tailored</p><p>correspondence algorithm to keep the computation load within reasonable limits.</p><p>Meanwhile, feature-based methods are less error sensitive and require less work load. But</p><p>the resulting maps will be less detailed as the depth is not calculated for every pixel (Kuhl</p><p>2004). Therefore, how to achieve a disparity map, which is both dense and accurate, while</p><p>the system maintains reasonable refresh rate is the cornerstone of its success and still</p><p>remains to be an open question.</p><p>For monocular vision sensor, the two approaches also differ from each other. For SFM,</p><p>consecutive frames present a very small parallax and small camera displacement. Given</p><p>the location of a feature in one frame, a common strategy for SFM is to use feature tracker</p><p>to find its correspondence in the consecutive image frame. KanadeLucasTomasi</p><p>tracker (Zhang et al. 2010) is widely used for small baseline matching. The methodology</p><p>for a feature tracker to track interest points through image sequence is usually based on a</p><p>combined use of feature- and area-based image matching methods. First interest points are</p><p>extracted by operators from the first image (e.g. Forstner 1986; Harris and Stephens 1988).</p><p>Then due to the very short baseline, positions of corresponding interest points in the</p><p>second image are predicted and matched with cross-correlation, which can be further</p><p>refined using least squares matching. Some approaches perform outlier rejection based</p><p>either on epipolar geometry (Remondino and Ressl 2006) or Random Sample Consensus</p><p>(RANSAC) (Fischler and Bolles 1981) for the last step (Zhang et al. 2010).</p><p>For the map-based approach, first a map in the form of image collection (or model) is</p><p>built by a learning step. Then, self-localisation is realised by matching the query image</p><p>with the corresponding scene in the database/map. Whenever a match is found, the</p><p>position information of this reference image is transferred directly to the query image and</p><p>used as user position. A major difference between map-based approach and two previous</p><p>methods lies in that the real-time query image might be taken at substantially different</p><p>viewpoint, distance or different illumination conditions from the map images in the</p><p>X. Li and J. Wang2</p><p>Dow</p><p>nloa</p><p>ded </p><p>by [A</p><p>ston U</p><p>nivers</p><p>ity] a</p><p>t 02:4</p><p>8 29 J</p><p>anua</p><p>ry 20</p><p>14 </p></li><li><p>database. Besides, the two images might be taken using different optical devices. In other</p><p>words, the two matching images may have a very large baseline, large-scale difference and</p><p>big perspective effects, which lead to a wide range of image transformation while</p><p>transformation parameters are unknown. Due to such significant changes, most image</p><p>corresponding algorithms working well for short baseline (e.g. stereo, or video sequence)</p><p>images will fail in this case. For area-based approach, cross-correlation method cannot get</p><p>a good performance when rotation is greater than 208 or scale difference is greater than30% (Lang and Forstner 1995); an iterative search for least squares matching will require a</p><p>good initial guess of the two corresponding locations, which is not applicable in situations</p><p>where image transformation parameters are unknown. Moreover, early matching methods</p><p>based on corner detectors (Harris and Stephens 1988) would fail because of the big</p><p>perspective effects (Remondino and Ressl 2006). Therefore, more distinctive and</p><p>invariant features are needed. The first work in the area was by Schmid and Mohr (1997).</p><p>They used a jet of Gaussian derivatives to form a rotationally invariant descriptor around a</p><p>Harris corner. Lowe extended this approach to incorporate scale invariance (Lowe 2004).</p><p>Then, these newly developed invariant features were applied to image matching systems</p><p>(e.g. Bay, Tuytelaars, and Gool 2006; Boris, Effrosyni, and Marcin 2008; Se, Lowe, and</p><p>Little 2002) for accurate location estimation.</p><p>Map-based visual system using invariant feature matching for localisation is a</p><p>common approach in todays research domain. The most popular algorithm is Lowes</p><p>SIFT (2004). However, certain limitations still exist. In this paper, we mainly discuss the</p><p>performance of various imagematchingmethods used by such systems and their influence on</p><p>final positioning. The aim is to find the bottleneck and possible improvement. Moreover, we</p><p>propose to integrate intensity-based method into the feature matching process to strengthen</p><p>the robustness of the matching algorithm against mismatches and noise.</p><p>The paper is constructed as follows: the first section discusses image matching</p><p>methods in the context of vision-based navigation systems and the selection of different</p><p>algorithms to cope with different needs of various systems; in the second section, image</p><p>matching methods used in map-based visual systems have been discussed and certain</p><p>limitations have been revealed; in the third section, ASIFT (Morel and Yu 2009) is</p><p>introduced into the system to address viewpoint changes; in the next section, we propose</p><p>SIFT-basedmethodwith cross-correlation information to reduce the chancemismatches to be</p><p>included for positioning; we present our conclusion and final thoughts in the last section.</p><p>2. Map-based visual positioning with the use of geo-referenced SIFT features</p><p>The development of map-based positioning and navigation system mainly consists of two</p><p>steps: mapping, positioning and navigation. Both steps contain image matching procedure.</p><p>The SIFT features are invariant to image translation, scaling, rotation and partially</p><p>invariant to illumination changes and affine or 3D projection; therefore, we believe SIFT is</p><p>a suitable choice for image matching at positioning stage. Due to the nature of our</p><p>systems methodology: the very same geo-referenced features need to be matched for</p><p>positioning; the matching for tie point extraction at the mapping stage has to be consistent</p><p>with the later one. Therefore, we choose SIFT-based matching for both mapping and</p><p>positioning stage. In the following sections, we will first focus on the image matching</p><p>techniques used in mapping and introduce improvement we made; then the performance of</p><p>image matching at positioning stage will be evaluated and certain deficiency will be</p><p>identified.</p><p>Journal of Location Based Services 3</p><p>Dow</p><p>nloa</p><p>ded </p><p>by [A</p><p>ston U</p><p>nivers</p><p>ity] a</p><p>t 02:4</p><p>8 29 J</p><p>anua</p><p>ry 20</p><p>14 </p></li><li><p>2.1. Feature-based multi-image matching for 3D mapping</p><p>More detailed explanation of the system is as follows. First, mapping is carried out. Images</p><p>of the navigational environment are collected and SIFT matching between images with</p><p>overlapped areas is performed.</p><p>The theory behind 3D mapping is photogrammetric indirect geo-referencing. It is used</p><p>to establish relationship between images and object coordinate systems. In aerial</p><p>photogrammetry, it makes satellite, aerial as well as terrestrial imagery useful for mapping.</p><p>In close-range photogrammetry, it enables the production of accurate as-builtmeasurements</p><p>and 3D reconstruction. In our case, the method is used to geo-reference SIFT features on</p><p>map images. The main idea is to locate an object point in three dimensions through</p><p>photogrammetric bundle adjustment, which uses central projection imaging as fundamental</p><p>mathematical model. More specifically, SIFT feature points on map images will be</p><p>geo-referenced through bundle adjustment (indirect geo-referencing).</p><p>There are two core issues for mapping: feature extraction and bundle adjustment. The</p><p>first one is used to identify common feature points (tie points) in overlapped areas, while</p><p>the latter is to estimate the 3D coordinates of these points.</p><p>However, when identifying common feature points to extract 3D information for</p><p>mapping, most of the works in the literature use a pair-wise approach for image matching.</p><p>The major drawback is that the image geometry is weak especially when only two</p><p>neighbouring images are considered. Therefore,we introducemultiple imagematching based</p><p>on the SIFT algorithm into the mapping procedure. The whole process takes five steps.</p><p>First, a feature database is generated. It is a collection of all SIFT features extracted</p><p>fr...</p></li></ul>


View more >