Download pdf - Image matching techniques for vision-based indoor navigation systems: a 3D map-based approach 1

This article was downloaded by: [Aston University]On: 29 January 2014, At: 02:48Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Location Based ServicesPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/tlbs20

Image matching techniques for vision-based indoor navigation systems: a 3Dmap-based approachXun Li & Jinling Wanga

a School of Surveying and Geospatial Engineering, University ofNew South Wales, Sydney, AustraliaPublished online: 20 Jan 2014.

To cite this article: Xun Li & Jinling Wang , Journal of Location Based Services (2014): Imagematching techniques for vision-based indoor navigation systems: a 3D map-based approach, Journalof Location Based Services, DOI: 10.1080/17489725.2013.837201

To link to this article: http://dx.doi.org/10.1080/17489725.2013.837201

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/loi/tlbs20

http://www.tandfonline.com/action/showCitFormats?doi=10.1080/17489725.2013.837201

http://dx.doi.org/10.1080/17489725.2013.837201

http://www.tandfonline.com/page/terms-and-conditions

http://www.tandfonline.com/page/terms-and-conditions

Image matching techniques for vision-based indoor navigationsystems: a 3D map-based approach1

Xun Li* and Jinling Wang

School of Surveying and Geospatial Engineering, University of New South Wales, Sydney, Australia

(Received 8 March 2013; final version received 13 August 2013; accepted 15 August 2013)

In this paper, we give an overview of image matching techniques for variousvision-based navigation systems: stereo vision, structure from motion and map-based approach. Focused on map-based approach, which generally uses feature-based matching for mapping and localisation, and based on our early developedsystem, a performance analysis has been carried out and three major problemshave been identified: inadequate geo-referencing and imaging geometry formapping, vulnerability to drastic viewpoint changes and big percentage ofmismatches at positioning. We introduce multi-image matching for mapping. Byusing affine-scale-invariant feature transform for viewpoint changes, the majorimprovement takes place on the epoch with large viewpoint changes. In order todeal with mismatches that were unable to be removed by Random SampleConsensus (RANSAC), we propose new method that use cross-correlationinformation to evaluate the quality of homography model and select the properone. The conducted experiments have proved that such an approach can reducethe chances of mismatches being included by RANSAC and final positioningaccuracy can be improved.

Keywords: indoor user localisation; map matching; navigation technologies;positioning systems; vision-based localisation

1. Introduction

Image matching techniques have been used in a variety of applications, such as 3D

modelling, image stitching, motion tracking, object recognition and vision-based

localisation. Over the past few years, many different methods have been developed, which

can be generally classified into two groups: area-based matching [intensity-based, such as

cross-correlation and least-squares matching (Gruen 1985), and feature-based matching

(e.g. scale-invariant feature transform (SIFT); Lowe 2004)]. Area-based methods (Trucco

and Verri 1998) may be comparatively more accurate, because they take into account a

whole neighbourhood around the feature points being analysed to establish

correspondences. Feature-based methods, on the other hand, use symbolic descriptions

of the images that contain certain local image information to establish correspondence. No

single algorithm, however, has been universally regarded as optimal for all applications

since they all have their pros and cons. In this paper, we explore the performance of

q 2014 Taylor & Francis

*Corresponding author. Email: [email protected]

Journal of Location Based Services, 2014

http://dx.doi.org/10.1080/17489725.2013.837201

Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

mailto:[email protected]

mailto:[email protected]

http://dx.doi.org/10.1080/17489725.2013.837201

various image matching techniques according to the specific needs of vision-based

positioning systems for indoor navigation applications.

The main purpose of vision-based navigation is to determine the position (and

orientation) of the vision sensor. Then mobile vehicle’s (the platform carrying the vision

sensor) motion/trajectory can be recovered. However, it is not without its limitation. Vision

sensor canmeasure relative positionwith derivative order of 0 but senses only a 2Dprojection

of the 3D world – direct depth information is lost. Several approaches have been made to

tackle such a problem. One way is to use stereo cameras by which the distance to a landmark

can be directlymeasured.Anotherway is to usemonocular visionwith the integration of data

frommultiple viewpoints (structure frommotion; SFM) or rely on the prior knowledge of the

navigation environment (e.g. map or models). Despite the variety form of vision-based

navigation systems, they all need the same basic but essential function to support their

self-localisation mechanism: image matching. The way they use such function is however

different, which in terms affect their choice of image matching methods.

For stereo vision-based approaches, stereo matching is employed to create a depth map

(i.e. disparity map) for navigation. Area-based algorithms solve the stereo correspondence

problem for every single pixel in the image. Therefore, these algorithms result in dense

depth maps as the depth is known for each pixel (Kuhl 2004). Typical methods include

Census (Zabih and Woodfill 1994), sum of absolute differences and sum of squared

differences. The common drawback is that they are computationally demanding. To deal

with the problem, some efforts have been made. Nalpantidis, Chrysostomou, and

Gasteratos (2009) proposed a quad-camera-based system which used a custom-tailored

correspondence algorithm to keep the computation load within reasonable limits.

Meanwhile, feature-based methods are less error sensitive and require less work load. But

the resulting maps will be less detailed as the depth is not calculated for every pixel (Kuhl

2004). Therefore, how to achieve a disparity map, which is both dense and accurate, while

the system maintains reasonable refresh rate is the cornerstone of its success and still

remains to be an open question.

For monocular vision sensor, the two approaches also differ from each other. For SFM,

consecutive frames present a very small parallax and small camera displacement. Given

the location of a feature in one frame, a common strategy for SFM is to use feature tracker

to find its correspondence in the consecutive image frame. Kanade–Lucas–Tomasi

tracker (Zhang et al. 2010) is widely used for small baseline matching. The methodology

for a feature tracker to track interest points through image sequence is usually based on a

combined use of feature- and area-based image matching methods. First interest points are

extracted by operators from the first image (e.g. Forstner 1986; Harris and Stephens 1988).

Then due to the very short baseline, positions of corresponding interest points in the

second image are predicted and matched with cross-correlation, which can be further

refined using least squares matching. Some approaches perform outlier rejection based

either on epipolar geometry (Remondino and Ressl 2006) or Random Sample Consensus

(RANSAC) (Fischler and Bolles 1981) for the last step (Zhang et al. 2010).

For the map-based approach, first a map in the form of image collection (or model) is

built by a learning step. Then, self-localisation is realised by matching the query image

with the corresponding scene in the database/map. Whenever a match is found, the

position information of this reference image is transferred directly to the query image and

used as user position. A major difference between map-based approach and two previous

methods lies in that the real-time query image might be taken at substantially different

viewpoint, distance or different illumination conditions from the map images in the

X. Li and J. Wang2

Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

database. Besides, the two images might be taken using different optical devices. In other

words, the two matching images may have a very large baseline, large-scale difference and

big perspective effects, which lead to a wide range of image transformation while

transformation parameters are unknown. Due to such significant changes, most image

corresponding algorithms working well for short baseline (e.g. stereo, or video sequence)

images will fail in this case. For area-based approach, cross-correlation method cannot get

a good performance when rotation is greater than 208 or scale difference is greater than

30% (Lang and Forstner 1995); an iterative search for least squares matching will require a

good initial guess of the two corresponding locations, which is not applicable in situations

where image transformation parameters are unknown. Moreover, early matching methods

based on corner detectors (Harris and Stephens 1988) would fail because of the big

perspective effects (Remondino and Ressl 2006). Therefore, more distinctive and

invariant features are needed. The first work in the area was by Schmid and Mohr (1997).

They used a jet of Gaussian derivatives to form a rotationally invariant descriptor around a

Harris corner. Lowe extended this approach to incorporate scale invariance (Lowe 2004).

Then, these newly developed invariant features were applied to image matching systems

(e.g. Bay, Tuytelaars, and Gool 2006; Boris, Effrosyni, and Marcin 2008; Se, Lowe, and

Little 2002) for accurate location estimation.

Map-based visual system using invariant feature matching for localisation is a

common approach in today’s research domain. The most popular algorithm is Lowe’s

SIFT (2004). However, certain limitations still exist. In this paper, we mainly discuss the

performance of various imagematchingmethods used by such systems and their influence on

final positioning. The aim is to find the bottleneck and possible improvement. Moreover, we

propose to integrate intensity-based method into the feature matching process to strengthen

the robustness of the matching algorithm against mismatches and noise.

The paper is constructed as follows: the first section discusses image matching

methods in the context of vision-based navigation systems and the selection of different

algorithms to cope with different needs of various systems; in the second section, image

matching methods used in map-based visual systems have been discussed and certain

limitations have been revealed; in the third section, ASIFT (Morel and Yu 2009) is

introduced into the system to address viewpoint changes; in the next section, we propose

SIFT-basedmethodwith cross-correlation information to reduce the chancemismatches to be

included for positioning; we present our conclusion and final thoughts in the last section.

2. Map-based visual positioning with the use of geo-referenced SIFT features

The development of map-based positioning and navigation system mainly consists of two

steps: mapping, positioning and navigation. Both steps contain image matching procedure.

The SIFT features are invariant to image translation, scaling, rotation and partially

invariant to illumination changes and affine or 3D projection; therefore, we believe SIFT is

a suitable choice for image matching at positioning stage. Due to the nature of our

system’s methodology: the very same geo-referenced features need to be matched for

positioning; the matching for tie point extraction at the mapping stage has to be consistent

with the later one. Therefore, we choose SIFT-based matching for both mapping and

positioning stage. In the following sections, we will first focus on the image matching

techniques used in mapping and introduce improvement we made; then the performance of

image matching at positioning stage will be evaluated and certain deficiency will be

identified.

Journal of Location Based Services 3

Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

2.1. Feature-based multi-image matching for 3D mapping

More detailed explanation of the system is as follows. First, mapping is carried out. Images

of the navigational environment are collected and SIFT matching between images with

overlapped areas is performed.

The theory behind 3D mapping is photogrammetric indirect geo-referencing. It is used

to establish relationship between images and object coordinate systems. In aerial

photogrammetry, it makes satellite, aerial as well as terrestrial imagery useful for mapping.

In close-range photogrammetry, it enables the production of accurate as-builtmeasurements

and 3D reconstruction. In our case, the method is used to geo-reference SIFT features on

map images. The main idea is to locate an object point in three dimensions through

photogrammetric bundle adjustment, which uses central projection imaging as fundamental

mathematical model. More specifically, SIFT feature points on map images will be

geo-referenced through bundle adjustment (indirect geo-referencing).

There are two core issues for mapping: feature extraction and bundle adjustment. The

first one is used to identify common feature points (tie points) in overlapped areas, while

the latter is to estimate the 3D coordinates of these points.

However, when identifying common feature points to extract 3D information for

mapping, most of the works in the literature use a pair-wise approach for image matching.

The major drawback is that the image geometry is weak especially when only two

neighbouring images are considered. Therefore,we introducemultiple imagematching based

on the SIFT algorithm into the mapping procedure. The whole process takes five steps.

First, a feature database is generated. It is a collection of all SIFT features extracted

from map images. The database in fact consists of two sub-databases: keypoint database

(F_database) and descriptor database (D_database). As shown in Figure 1, each extracted

keypoint creates a four-parameter record, indicating its 2D location relative to the training

image in which it was found (the first and second columns of F_database), the scale (the

third column of F_database) and orientation (the fourth column of F_database).

Meanwhile, each feature (keypoint) is associated with a descriptor (128-dimension

Figure 1. Feature database (snapshot from the program).

X. Li and J. Wang4

Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

vector). We put them together to form a descriptor database. In Figure 1, it shows the first 5

dimensions of the 128 dimension vector.

After the features have been extracted, multi-image matching is performed. The main

difference from pair-wise matching is that single feature may have several matches since

multiple images may overlap a single ray. Therefore, a K-NN search function is used to

find nearest neighbours for all the features from all the map images in the database. We use

an exhaustive search for K-NN search since it works well on high dimensions. It is

performed on the 128 dimension feature descriptors in the D_database. In our approach, a

4-NN search is performed to find potential matches for each feature. It is noted that for the

feature points that have no correspondence in the database, still some random points will

be found as potential candidates. As shown in Figure 2, for every feature in the database,

its four nearest neighbours (including itself) have been found and listed in a row. As shown

in Figure 2, the first column has the same id with the reference feature point, indicating that

the nearest neighbour has been the feature point itself.

The next step is to find images with big overlapped areas to form an image

group. Therefore, we aim to identify the images that have a large number of matches

between each other. A voting strategy is used. First, we choose a query image from the map

image database, then group together nearest neighbours (candidatematches) associatedwith

features that belong to the query image. Each candidate match votes for the image it comes

from, therefore, the onewith biggest number of votes is the image that has biggest number of

matching features with current query image. We find three images with the highest rank for

each current query image and place them in an image group. It can be expected that the map

image itself comes first. It is also noted that points with no corresponding feature points or

falsematches are still involved in the vote, since they are randomly distributed in the images,

when the database has large number of features, the final rank will not be affected. After

image groups have been generated, geometric constraint is used to remove mismatches

among the data-set. Within each image group, RANSAC is used pair-wise.

Finally, bundle adjustment is used to geo-reference common feature points of the map

images. Different from the previous attempt which imports only two images into a bundle

adjustment, here bundle adjustment is performed on each image group (more than two

images included). With additional image and greater redundancy for the least squares

Figure 2. A 4-NN search is performed to find potential matches for each feature (snapshot from theprogram).


Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

adjustment, the geometric strength of the network is enhanced and the accuracy of the final

results,that is camera external parameters and 3D coordinates of feature points, is

enhanced. And the final positioning accuracy will benefit from such acts. It is noted that

the geo-referencing accuracy will not always be improved by increasing the number of

images involved. If we use bundle adjustment only once on the whole image sequence,

increasing errors could produce an initial solution too far from the optimal one for the

adjustment to converge. Therefore, we choose images in group of three for the system.

2.2. Evaluating the performance of SIFT matching for the vision-based positioningsystem

At the real-time positioning stage, when real-time images are taken by the vision sensor

mounted on the (moving) vehicle, another image matching based on SIFT is carried out

between the real-time image and the map images. When any of the SIFT feature points from

the map image(s) finds its correspondence on the real-time image, the geo-information it

carried canbe transferred to its counterpart. Therefore,matchedSIFT features on the real-time

image obtain both image coordinates from matching process and 3D coordinates from map

images, which can later serve as pseudo ground control points (PGCPs) for space resection-

based positioning at the final stage. More specifically, modified space resection is utilised to

calculate vision sensor’s external orientation in 6 DOF. Outliers including mismatches are

detected and removed by the system outlier detection mechanism based on RANSAC.

A controlled experiment is designed to evaluate the performance of SIFT matching for

the image-based positioning system. The aim is to identify the reason behind positioning

failure and inaccurate position results. In our indoor measurement, set-up factors such as

illumination can be easily controlled. A calibrated CCD camera (Canon EOS4500, Canon

Inc., Japan) with a fixed focal length at 24.2mm was used as vision sensor of the

positioning system. First, three stable camera sites were deployed facing different mapping

areas with X, Y, Z coordinates of the three camera stations surveyed by a total station, and

angular changes at each camera site were roughly measured. A total of 16 images were

taken at the three sites, and the images with light been turned on were shown in Table 1. For

each image matching process, matches before and after RANSAC (used to reject

mismatches) are compared. Furthermore, the number of PGCP generated by each matching

process is also studied along with the geometry of these points, which will directly affect

the precision of positioning. Finally, a manual check for the PGCP locations from the two

matched images is performed to verify the correctness of image matching after RANSAC.

The third column in Table 1 indicates the angular changes of viewpoint at k (around

the Z-axis) for each camera site, other angles remain stable. Assuming the viewing

direction perpendicular to the mapping area (wall with geo-referenced features) to be 0, a

clockwise rotation to be positive changes. It is easy to note that the only epoch that fails to

give a positioning result is the one with the most drastic angular change, real-time image

No. 5. Not only has the total number of SIFT matches been decreased, the percentage of

correct matches filtered using RANSAC has also been reduced. Actually it is the image

epoch with lowest correct rate. As a result, too few PGCPs are generated for positioning.

It indicates that when the two matched images suffer from large viewpoint variation, less

SIFT matches will be found, and there is a higher chance to generate false matches. But if

we take a look at other images with smaller angular changes, such role does not apply. The

reason is that the performance of SIFT-based matching only drops under substantial

viewpoint changes.

X. Li and J. Wang6

Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

Secondly, it is noticed that after using RANSAC to reject mismatches, there is still a

small chance that mismatches been left untreated, which might later generate false PGCP

to jeopardise the final positioning process.

In summary, two major weaknesses for image matching in the system have been

found: invariant feature matching could not deal with drastic viewpoint shift, and the

current popular outlier detection mechanism RANSAC cannot guarantee the correctness

of every pair. These problems affect the final positioning by deteriorating the precision

from inadequate number of matches or bringing in false matches.

3. Using ASIFT for viewpoint changes

In order to tackle the problem for unsatisfactory performance of SIFT subject to dramatic

viewpoint distinction, some approaches have been recently proposed by some researchers to

extend scale and rotation invariance to affine invariance, such as maximally stable extremal

regions (Matas et al. 2004) and Harris/Hessian affine (Mikolajczyk and Schmid 2004).

Although these methods have been proved to enable matching with a stronger viewpoint

change, all of them are prone to fail at a certain point (Nalpantidis, Chrysostomou, and

Gasteratos 2009). A better idea is to simulate viewpoint changes in order to reach affine

invariance, themost successful algorithmusing suchmethod is namedASIFT (affine-SIFT).

It is introduced byMorel and Yu (2009) to explicitly deal with extreme angle changes (up to

36 and higher). SIFT is only partially invariant to viewpoint changes because it is invariant

to four out of the six parameters of an affine transform. ASIFT, on the other hand,

simulates all image views obtainable by varying the two camera axis orientation parameters,

namely, the latitude and the longitude angles, left over by the SIFT method. Then, it covers

the other four parameters by using the SIFT method itself (Morel and Yu 2009).

In this paper, we introduce ASIFT into our vision-based navigation system to replace

SIFT in order to achieve a more robust positioning result against viewpoint variation. At

both mapping and positioning stage, ASIFT-based image matching is used in the same

way SIFT is utilised. In order to evaluate its performance and compare it with that of SIFT,

data-sets from the same controlled experiment is used.

In Figures 3 and 4, we showed the matching between real-time query image No. 5 with

map image No. 10 using SIFT and ASIFT, respectively. Under dramatic view changes,

ASIFT produce more reliable matches (inliers), while although SIFT get as many tentative

matches, most of which are mismatches and filtered out by RANSAC. When dealing with

images without much angular difference, however, ASIFT had a rather unstable

Table 1. Performance of SIFT matching in the system.

ST ID IDAngularchange

All SIFTmatches

Reliablematches

Number ofPGCPs

Number offalse PGCPs

Correctrate of

PGCPs (%) PDOP ADOP

1 1 0 578 202 51 1 98 2399 5833 230 372 103 32 2 94 1149 2235 50 267 59 2 0 100

2 7 0 427 145 35 0 100 2106 5309 220 401 193 59 1 98 969 235

3 11 0 417 112 11 0 100 5711 117713 230 457 146 22 0 100 372 213315 20 241 65 5 0 100 5813 28,658


Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

performance, as shown in Figures 5–8, with the former group favours SIFT and later one

favours ASIFT. The reason for it lies in that the author of ASIFT has already included an

outlier detection mechanism (Optimized Random Sampling Algorithm, ORSA) in ASIFT

algorithm. When the angular change is high, SIFT tends to produce more mismatches thus

outperformed by ASIFT in terms of both number of inliers and PGCPs. On the other hand,

Figure 3. Matching between real-time query image No. 5 with map image No. 10 usingSIFT þ RANSAC, dramatic angular change, 25 reliable matches (inliers).

Figure 4. Matching between real-time query image No. 5 with map image No. 10 usingASIFT þ RANSAC dramatic angular change, 47 reliable matches (inliers).

X. Li and J. Wang8

Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

if the angular change is low because of the pre-filtered mechanism of ASIFT, the inliers of

ASIFT largely depends on the combined effect of ORSA and RANSAC. In order to further

compare the performance of the two algorithms for our system, ASIFT is used without

RANSAC at positioning stage. The result is shown in Table 2.

Comparing Tables 1 and 2, two main factors that have been taken into consideration

are geometry given by PGCPs, which is related to positioning precision, and the outliers

Figure 5. Matching between real time query image No. 1 with map image No. 9 usingSIFT þ RANSAC, small angular change, 100 reliable matches (inlier).

Figure 6. Matching between real time query image No. 1 with map image No. 9 usingASIFT þ RANSAC, small angular change, 16 reliable matches (inlier).


Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

remained in the PGCPs data-set. An obvious improvement took place on epoch No. 5, the

one with dramatic angular change, ASIFT has been able to generate reasonable number of

PGCPs for a positioning resolution. On other epochs, ASIFT has yet outperform SIFT in

terms of the geometry (represented by Position DOP (PDOP) and Attitude DOP (ADOP))

Figure 7. Matching between real-time query image No. 11 with map image No. 10 usingSIFT þ RANSAC, small angular change, 47 reliable matches (inliers).

Figure 8. Matching between real-time query image No. 11 with map image No. 10 usingASIFT þ RANSAC, small angular change, 190 reliable matches (inliers).

X. Li and J. Wang10

Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

provided by PGCPs. It has also been observed that the correct rate of PGCPs has yet

reached 100% in the two tables, which means both RANSAC for SIFT and ORSA for

ASIFT have left some mismatches untreated. The reason behind is that all RANSAC-like

(e.g. RANSAC and ORSA) methods have the same bottleneck: when mismatches are near

their epipolar line, no matter how far away they are from their true correspondences, it is

hard for these mismatches be detected. Our test even revealed the lower correct rate of

PGCPs from ASIFT. Considering the fact that ASIFT is more computationally expensive

than SIFT, for real-time applications like vision-based navigation, it is important to reduce

the complexity of the algorithms been used. Therefore, we conclude that ASIFT is less

efficient than SIFT when dealing with low angular change. We only choose to use ASIFT

as a backup plan when SIFT fails to get a result because of dramatic viewpoint changes.

4. Feature-based matching with integration of cross-correlation information

It has been noticed that both SIFT and ASIFT produce a substantial number of

mismatches. The reason behind is that such feature-based methods depend on the choice of

correspondence on local information and fail to consider global context. When an image

has repeated patterns, ambiguities will occur when the local information for the similar

parts is identical.

A general solution is to use robust estimation method like RANSAC to remove

outliers. For a number of iterations, a random sample of four correspondences is selected

and a homography H is computed from those four correspondences. Every other

correspondence is then classified as an inlier or outlier depending on its concurrence with

H. After all of the iterations have finished, the iteration that contained the largest number

of inliers is selected. H can then be recomputed from all of the correspondences that were

considered as inliers in that iteration. While this method can remove most of the

mismatches, experiments have proved that it cannot guarantee the correctness of every

‘inlier’. The reason for it is that when mismatches are near their epipolar line, it is easier

for them to be treated as correct matches. Moreover, the iteration can only run limited

times (1000 at our system), and there is a certain chance that the iteration that has largest

number of inliers still produces an erroneous or inaccurate H since most of the inliers it

includes are mistaken. As a result, mismatches are selected as reliable matches, which will

further affect the positioning accuracy.

In this study, we propose to integrate intensity-based method into the feature matching

process to strengthen the robustness of the matching algorithm against mismatches and

Table 2. Performance of ASIFT matching in the system.

IDAngularchange

All ASIFTmatches

PGCPnumber

Number offalse PGCP

Correctrate of

PGCPs (%) PDOP ADOP

1 0 296 13 3 77 2112 5163 230 218 22 2 91 1353 2765 50 381 14 0 100 1761 3697 0 257 17 1 94 670 1659 220 178 12 1 92 1746 40511 0 716 15 0 100 9371 191313 230 471 10 2 80 12,801 240015 20 407 21 0 100 9490 1914


Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

noise. More specifically, cross-correlation information is used as an analysis and selection

criterion for the matching. Instead of identifying mismatch(es) after it has been generated,

it determines how good the homography model (H) is for the two matching images and

discard bad ones to reduce chances that mismatches are included. In RANSAC, we use 2D

projective transformation H (planar homography) to approximate the geometric

transformation between two images (e.g. I and I0). Any two corresponding SIFT features

in image I and I0 passing RANSAC will comply with the model, which ideally can be

expressed as:

x0

y0

1

2664

3775 ¼

a1 a2 a3

b1 b2 b3

c1 c2 1

2664

3775

x

y

1

2664

3775: ð1Þ

In Equation (1), p ¼ [x, y, 1]T and p 0 ¼ [x 0, y 0, 1]T denote the two points expressed

using homogeneous coordinates, and

a1 a2 a3

b1 b2 b3

c1 c2 1

2664

3775 represents homography model H.

a1 a2

b1 b2

" #parameterised affine changes, a3 b3

� �Tshift parameters and c1 c2

� �projective deformation. After homography model has been generated by RANSAC

processing, a local square patch in I with a size of (2w þ 1) * (2w þ 1) centred on p is

generated, denoted as N( p). By using the estimated H, N( p) is transformed into N( p 0)and resampled on image I 0. Then, cross-correlation between the two window patches is

calculated using Equation (2), where Guv and G0uv represent the intensity values of the

two correlation windows, respectively, whereas m(G) and m(G 0) denote their average

intensity:

r G;G0� � ¼ Pwu¼2w

Pwv¼2w Guv 2 m Gð Þ� �

G0uv 2 m G0� ��

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPwu¼2w

Pwv¼2w Guv 2 m Gð Þ� �2

·Pw

u¼2w

Pwv¼2w G0

uv 2 m G0ð Þ� �2q : ð2Þ

In Equation (2), r(G,G0) varies from21 to 1, the closer to 1 the higher correlation and

here indicates the bigger similarity between two patches and greater possibility to be

correct corresponding points. Therefore, we calculate the cross-correlation for each pair of

reliable matches (inliers) for every single matching process. An average correlation �r is

calculated for all the matched (reliable) points produced by one matching (one H is

generated). If it is close to 1, it means the estimated homography modelH is very accurate,

and vice versa. An example is shown using Figures 9 and 10 (both from our experiment),

which illustrates a bad �r could include mismatch as inlier and on the contrary, a �r closer to1 has less chance to include mismatch as inlier.

A certain threshold is set for �r (0.75 in the experiment). After every matching process,

a �r is calculated. If it is smaller than the threshold, the matched points generated by the

process will be discarded and the two images will be re-matched.

Byusing such criteria for the system, the number of false PGCPhas been reduced. In order

to evaluate the impact on final positioning accuracy, root mean square error for every camera

station is calculated and the results before and after using cross-correlation information are

compared in Table 3, which reflects that positioning accuracy has been improved.

X. Li and J. Wang12

Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

5. Concluding remarks

In this paper, we give an overview of image matching techniques in the context of vision-

based navigation systems. Based on a variety of systems, stereo vision, SFM and map-

based approach, different methods are discussed. Focused on map-based approach, which

generally uses feature-based matching for localisation, and based on our early developed

system, a performance analysis has been carried out. Three major problems have been

identified: inadequate mapping; at positioning stage, the matching is vulnerable to drastic

viewpoint changes and produces big percentage of mismatches.

First, multi-image matching has been introduced to mapping stage. Compared with

pair-wise approach, more features have been geo-referenced with better imaging

geometry. The vision-based positioning based on the 3D map benefit from the act in terms

of greater number and better distribution of PGCPs.

Then by using ASIFT, the major improvement takes place on the epoch with large

viewpoint changes. More reliable matches have been produced and the system has been

able to get a positioning result. However, our experiments have also revealed that ASIFT

has yet been able to outperform SIFT in terms of PGCPs geometry and correct rate of

PGCPs. Taking the computation cost into consideration, SIFT is still the more favourite

option. ASIFT is used as a back-up plan when SIFT fails to get a result because of large

viewpoint changes.

Using RANSAC to remove mismatches has been a popular approach for feature-based

matching in visual systems. But as has been proved by the performance analysis, some

Figure 9. Matching between real-time query image No. 13 with map image No. 8 usingSIFT þ RANSAC þ cross-correlation: �r ¼ 0:23.


Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

mismatches may still be included as inliers (reliable match) in final positioning process.

The reason for this is that the iteration in RANSAC can only run limited times and there

are chances that the one having the largest number of inliers still produces an erroneous or

very inaccurate homography model and as a consequence, mismatches are selected as

reliable matches. In order to deal with the problem, we have proposed to use cross-

correlation information to evaluate the quality of homography model, and the conducted

experiments have proved that such an approach can reduce the chances of mismatches

being included and final positioning accuracy can be improved.

Further research will be focused on improving the reliability and precision of image

matching in an effort to improve the overall positioning accuracy. Meanwhile, the

computation load of imagematchingwill be considered for such real-time system operations.

Table 3. Comparison of the RMSE of the positioning results.

RMSE X (m) Y (m) Z (m)

Station1 SIFT 0.0477 0.0838 0.0598SIFT þ cross-correlation 0.0469 0.0824 0.0613



Figure 10. Matching between real-time query image No. 13 with map image No. 8 usingSIFT þ RANSAC þ cross-correlation: �r ¼ 0:79.

X. Li and J. Wang14

Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14

Note

1. Part of this paper was previously published by IEEE for, and presented at, the InternationalConference on Indoor Positioning and Indoor Navigation (IPIN 2012), 13–15th November2012.

References

Bay, H., T. Tuytelaars, and L. V. Gool. 2006. “SURF: Speed-up Robust Features.” In 9th EuropeanConference on Computer Vision, Graz, Austria, May 7–13, 404–417. Berlin/Heidelberg: Springer.doi: 10.1007/11744023_32

Boris, R., K. Effrosyni, and D. Marcin. 2008. “Mobile Museum Guide Based on Fast SIFTRecognition.” In 6th International Workshop on Adaptive Multimedia Retrieval, Berlin,Germany, Vol. 5811, June 26–27, 26–27. Springer LNCS.

Fischler, M. A., and R. C. Bolles. 1981. “Random Sample Consensus: A Paradigm for Model Fittingwith Applications to Image Analysis and Automated Cartography.” Communications of theACM 24 (6): 381–395.

Forstner, W. 1986. A Feature Based Correspondence Algorithm for Image Matching. InternationalArchives of Photogrammetry. Vol. 26-III. Rovaniemi, Finland: IAPRS.

Gruen, A. W. 1985. “Adaptive Least Squares Correlation: A powerful Image Matching Technique.”South Africa Journal of Photogrammetry Remote Sensing and Cartography 14: 175–187.

Harris, C., and M. Stephens. 1988. “A Combined Corner and Edge Detector.” In Fourth Alvey VisionConference, Manchester, UK, 31 August–2 September, 147–151.

Kuhl, A. 2004. “Comparison of Stereo Matching Algorithms for Mobile Robots.” M.Sc. thesis,Fakultat fur Informatik und Automatisierung, Technische Universitat Ilmenau, Ilmenau,Germany.

Lang, F., and W. Forstner. 1995. “Matching Techniques.” In Second Course in DigitalPhotogrammetry. Landesvermessungsamt NRW.

Lowe, D. 2004. “Distinctive Image Features from Scale Invariant Key Points.” International Journalof Computer Vision 60 (2): 91–110.

Matas, J., O. Chum, M. Urban, and T. Pajdla. 2004. “Robust Wide-Baseline Stereo from MaximallyStable Extremal Regions.” Image and Vision Computing 22: 761–767.

Mikolajczyk, K., and C. Schmid. 2004. “Scale & Affine Invariant Interest Point Detectors.”International Journal of Computer Vision 60: 63–86.

Morel, J. M., and G. Yu. 2009. “ASIFT: A New Framework for Fully Affine Invariant ImageComparison.” SIAM Journal on Imaging Sciences 2 (2): 438–469.

Nalpantidis, L., D. Chrysostomou, and A. Gasteratos. 2009. “Obtaining Reliable Depth Maps forRobotic Applications with a Quad-Camera System.” In ICRA09 Workshop on Safe Navigation inOpen and Dynamic Environments Application to Autonomous Vehicles, Kobe, Japan, May 12.

Remondino, F., and C. Ressl. 2006. Overview and Experiences in Automated Markerless ImageOrientation. Vol. 36, Part 3. The International Archives of the Photogrammetry, Remote Sensingand Spatial Information Sciences (IAPRS), Archives no. 1682-1750.

Schmid, C., and R. Mohr. 1997. “Local Gray Value Invariants for Image Retrieval.” IEEETransactions on Pattern Analysis and Machine Intelligence 19 (5): 530–535.

Se, S., D. Lowe, and J. Little. 2002. “Mobile Robot Localization and Mapping with UncertaintyUsing Scale-invariant Visual Landmarks.” International Journal of Robotic Research 21 (8):735–760.

Trucco, E., and A. Verri. 1998. Introductory Techniques for 3-D Computer Vision. Vol. 93.Englewood Cliffs: Prentice Hall.

Zabih, R., and J. Woodfill. 1994. “Non-parametric Local Transforms for Computing VisualCorrespondence.” Third European Conference on Computer Vision, Stockholm, Sweden, May2–6.

Zhang, G., Z. Dong, J. Jia, T. T. Wong, and H. Bao. 2010. “Efficient Non-Consecutive FeatureTracking for Structure-from-Motion.” Proceedings of the European Conference on ComputerVision (ECCV ’10), Crete, Grece, September 5–11.


Dow

nloa

ded

by [

Ast

on U

nive

rsity

] at

02:

48 2

9 Ja

nuar

y 20

14