This article was downloaded by: [Aston University]On: 29 January 2014, At: 02:48Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Journal of Location Based ServicesPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/tlbs20
Image matching techniques for vision-based indoor navigation systems: a 3Dmap-based approachXun Li & Jinling Wanga
a School of Surveying and Geospatial Engineering, University ofNew South Wales, Sydney, AustraliaPublished online: 20 Jan 2014.
To cite this article: Xun Li & Jinling Wang , Journal of Location Based Services (2014): Imagematching techniques for vision-based indoor navigation systems: a 3D map-based approach, Journalof Location Based Services, DOI: 10.1080/17489725.2013.837201
To link to this article: http://dx.doi.org/10.1080/17489725.2013.837201
PLEASE SCROLL DOWN FOR ARTICLE
Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoeveror howsoever caused arising directly or indirectly in connection with, in relation to orarising out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions
Image matching techniques for vision-based indoor navigationsystems: a 3D map-based approach1
Xun Li* and Jinling Wang
School of Surveying and Geospatial Engineering, University of New South Wales, Sydney, Australia
(Received 8 March 2013; final version received 13 August 2013; accepted 15 August 2013)
In this paper, we give an overview of image matching techniques for variousvision-based navigation systems: stereo vision, structure from motion and map-based approach. Focused on map-based approach, which generally uses feature-based matching for mapping and localisation, and based on our early developedsystem, a performance analysis has been carried out and three major problemshave been identified: inadequate geo-referencing and imaging geometry formapping, vulnerability to drastic viewpoint changes and big percentage ofmismatches at positioning. We introduce multi-image matching for mapping. Byusing affine-scale-invariant feature transform for viewpoint changes, the majorimprovement takes place on the epoch with large viewpoint changes. In order todeal with mismatches that were unable to be removed by Random SampleConsensus (RANSAC), we propose new method that use cross-correlationinformation to evaluate the quality of homography model and select the properone. The conducted experiments have proved that such an approach can reducethe chances of mismatches being included by RANSAC and final positioningaccuracy can be improved.
Keywords: indoor user localisation; map matching; navigation technologies;positioning systems; vision-based localisation
1. Introduction
Image matching techniques have been used in a variety of applications, such as 3D
modelling, image stitching, motion tracking, object recognition and vision-based
localisation. Over the past few years, many different methods have been developed, which
can be generally classified into two groups: area-based matching [intensity-based, such as
cross-correlation and least-squares matching (Gruen 1985), and feature-based matching
(e.g. scale-invariant feature transform (SIFT); Lowe 2004)]. Area-based methods (Trucco
and Verri 1998) may be comparatively more accurate, because they take into account a
whole neighbourhood around the feature points being analysed to establish
correspondences. Feature-based methods, on the other hand, use symbolic descriptions
of the images that contain certain local image information to establish correspondence. No
single algorithm, however, has been universally regarded as optimal for all applications
since they all have their pros and cons. In this paper, we explore the performance of
q 2014 Taylor & Francis
*Corresponding author. Email: [email protected]
Journal of Location Based Services, 2014
http://dx.doi.org/10.1080/17489725.2013.837201
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
various image matching techniques according to the specific needs of vision-based
positioning systems for indoor navigation applications.
The main purpose of vision-based navigation is to determine the position (and
orientation) of the vision sensor. Then mobile vehicle’s (the platform carrying the vision
sensor) motion/trajectory can be recovered. However, it is not without its limitation. Vision
sensor canmeasure relative positionwith derivative order of 0 but senses only a 2Dprojection
of the 3D world – direct depth information is lost. Several approaches have been made to
tackle such a problem. One way is to use stereo cameras by which the distance to a landmark
can be directlymeasured.Anotherway is to usemonocular visionwith the integration of data
frommultiple viewpoints (structure frommotion; SFM) or rely on the prior knowledge of the
navigation environment (e.g. map or models). Despite the variety form of vision-based
navigation systems, they all need the same basic but essential function to support their
self-localisation mechanism: image matching. The way they use such function is however
different, which in terms affect their choice of image matching methods.
For stereo vision-based approaches, stereo matching is employed to create a depth map
(i.e. disparity map) for navigation. Area-based algorithms solve the stereo correspondence
problem for every single pixel in the image. Therefore, these algorithms result in dense
depth maps as the depth is known for each pixel (Kuhl 2004). Typical methods include
Census (Zabih and Woodfill 1994), sum of absolute differences and sum of squared
differences. The common drawback is that they are computationally demanding. To deal
with the problem, some efforts have been made. Nalpantidis, Chrysostomou, and
Gasteratos (2009) proposed a quad-camera-based system which used a custom-tailored
correspondence algorithm to keep the computation load within reasonable limits.
Meanwhile, feature-based methods are less error sensitive and require less work load. But
the resulting maps will be less detailed as the depth is not calculated for every pixel (Kuhl
2004). Therefore, how to achieve a disparity map, which is both dense and accurate, while
the system maintains reasonable refresh rate is the cornerstone of its success and still
remains to be an open question.
For monocular vision sensor, the two approaches also differ from each other. For SFM,
consecutive frames present a very small parallax and small camera displacement. Given
the location of a feature in one frame, a common strategy for SFM is to use feature tracker
to find its correspondence in the consecutive image frame. Kanade–Lucas–Tomasi
tracker (Zhang et al. 2010) is widely used for small baseline matching. The methodology
for a feature tracker to track interest points through image sequence is usually based on a
combined use of feature- and area-based image matching methods. First interest points are
extracted by operators from the first image (e.g. Forstner 1986; Harris and Stephens 1988).
Then due to the very short baseline, positions of corresponding interest points in the
second image are predicted and matched with cross-correlation, which can be further
refined using least squares matching. Some approaches perform outlier rejection based
either on epipolar geometry (Remondino and Ressl 2006) or Random Sample Consensus
(RANSAC) (Fischler and Bolles 1981) for the last step (Zhang et al. 2010).
For the map-based approach, first a map in the form of image collection (or model) is
built by a learning step. Then, self-localisation is realised by matching the query image
with the corresponding scene in the database/map. Whenever a match is found, the
position information of this reference image is transferred directly to the query image and
used as user position. A major difference between map-based approach and two previous
methods lies in that the real-time query image might be taken at substantially different
viewpoint, distance or different illumination conditions from the map images in the
X. Li and J. Wang2
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
database. Besides, the two images might be taken using different optical devices. In other
words, the two matching images may have a very large baseline, large-scale difference and
big perspective effects, which lead to a wide range of image transformation while
transformation parameters are unknown. Due to such significant changes, most image
corresponding algorithms working well for short baseline (e.g. stereo, or video sequence)
images will fail in this case. For area-based approach, cross-correlation method cannot get
a good performance when rotation is greater than 208 or scale difference is greater than
30% (Lang and Forstner 1995); an iterative search for least squares matching will require a
good initial guess of the two corresponding locations, which is not applicable in situations
where image transformation parameters are unknown. Moreover, early matching methods
based on corner detectors (Harris and Stephens 1988) would fail because of the big
perspective effects (Remondino and Ressl 2006). Therefore, more distinctive and
invariant features are needed. The first work in the area was by Schmid and Mohr (1997).
They used a jet of Gaussian derivatives to form a rotationally invariant descriptor around a
Harris corner. Lowe extended this approach to incorporate scale invariance (Lowe 2004).
Then, these newly developed invariant features were applied to image matching systems
(e.g. Bay, Tuytelaars, and Gool 2006; Boris, Effrosyni, and Marcin 2008; Se, Lowe, and
Little 2002) for accurate location estimation.
Map-based visual system using invariant feature matching for localisation is a
common approach in today’s research domain. The most popular algorithm is Lowe’s
SIFT (2004). However, certain limitations still exist. In this paper, we mainly discuss the
performance of various imagematchingmethods used by such systems and their influence on
final positioning. The aim is to find the bottleneck and possible improvement. Moreover, we
propose to integrate intensity-based method into the feature matching process to strengthen
the robustness of the matching algorithm against mismatches and noise.
The paper is constructed as follows: the first section discusses image matching
methods in the context of vision-based navigation systems and the selection of different
algorithms to cope with different needs of various systems; in the second section, image
matching methods used in map-based visual systems have been discussed and certain
limitations have been revealed; in the third section, ASIFT (Morel and Yu 2009) is
introduced into the system to address viewpoint changes; in the next section, we propose
SIFT-basedmethodwith cross-correlation information to reduce the chancemismatches to be
included for positioning; we present our conclusion and final thoughts in the last section.
2. Map-based visual positioning with the use of geo-referenced SIFT features
The development of map-based positioning and navigation system mainly consists of two
steps: mapping, positioning and navigation. Both steps contain image matching procedure.
The SIFT features are invariant to image translation, scaling, rotation and partially
invariant to illumination changes and affine or 3D projection; therefore, we believe SIFT is
a suitable choice for image matching at positioning stage. Due to the nature of our
system’s methodology: the very same geo-referenced features need to be matched for
positioning; the matching for tie point extraction at the mapping stage has to be consistent
with the later one. Therefore, we choose SIFT-based matching for both mapping and
positioning stage. In the following sections, we will first focus on the image matching
techniques used in mapping and introduce improvement we made; then the performance of
image matching at positioning stage will be evaluated and certain deficiency will be
identified.
Journal of Location Based Services 3
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
2.1. Feature-based multi-image matching for 3D mapping
More detailed explanation of the system is as follows. First, mapping is carried out. Images
of the navigational environment are collected and SIFT matching between images with
overlapped areas is performed.
The theory behind 3D mapping is photogrammetric indirect geo-referencing. It is used
to establish relationship between images and object coordinate systems. In aerial
photogrammetry, it makes satellite, aerial as well as terrestrial imagery useful for mapping.
In close-range photogrammetry, it enables the production of accurate as-builtmeasurements
and 3D reconstruction. In our case, the method is used to geo-reference SIFT features on
map images. The main idea is to locate an object point in three dimensions through
photogrammetric bundle adjustment, which uses central projection imaging as fundamental
mathematical model. More specifically, SIFT feature points on map images will be
geo-referenced through bundle adjustment (indirect geo-referencing).
There are two core issues for mapping: feature extraction and bundle adjustment. The
first one is used to identify common feature points (tie points) in overlapped areas, while
the latter is to estimate the 3D coordinates of these points.
However, when identifying common feature points to extract 3D information for
mapping, most of the works in the literature use a pair-wise approach for image matching.
The major drawback is that the image geometry is weak especially when only two
neighbouring images are considered. Therefore,we introducemultiple imagematching based
on the SIFT algorithm into the mapping procedure. The whole process takes five steps.
First, a feature database is generated. It is a collection of all SIFT features extracted
from map images. The database in fact consists of two sub-databases: keypoint database
(F_database) and descriptor database (D_database). As shown in Figure 1, each extracted
keypoint creates a four-parameter record, indicating its 2D location relative to the training
image in which it was found (the first and second columns of F_database), the scale (the
third column of F_database) and orientation (the fourth column of F_database).
Meanwhile, each feature (keypoint) is associated with a descriptor (128-dimension
Figure 1. Feature database (snapshot from the program).
X. Li and J. Wang4
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
vector). We put them together to form a descriptor database. In Figure 1, it shows the first 5
dimensions of the 128 dimension vector.
After the features have been extracted, multi-image matching is performed. The main
difference from pair-wise matching is that single feature may have several matches since
multiple images may overlap a single ray. Therefore, a K-NN search function is used to
find nearest neighbours for all the features from all the map images in the database. We use
an exhaustive search for K-NN search since it works well on high dimensions. It is
performed on the 128 dimension feature descriptors in the D_database. In our approach, a
4-NN search is performed to find potential matches for each feature. It is noted that for the
feature points that have no correspondence in the database, still some random points will
be found as potential candidates. As shown in Figure 2, for every feature in the database,
its four nearest neighbours (including itself) have been found and listed in a row. As shown
in Figure 2, the first column has the same id with the reference feature point, indicating that
the nearest neighbour has been the feature point itself.
The next step is to find images with big overlapped areas to form an image
group. Therefore, we aim to identify the images that have a large number of matches
between each other. A voting strategy is used. First, we choose a query image from the map
image database, then group together nearest neighbours (candidatematches) associatedwith
features that belong to the query image. Each candidate match votes for the image it comes
from, therefore, the onewith biggest number of votes is the image that has biggest number of
matching features with current query image. We find three images with the highest rank for
each current query image and place them in an image group. It can be expected that the map
image itself comes first. It is also noted that points with no corresponding feature points or
falsematches are still involved in the vote, since they are randomly distributed in the images,
when the database has large number of features, the final rank will not be affected. After
image groups have been generated, geometric constraint is used to remove mismatches
among the data-set. Within each image group, RANSAC is used pair-wise.
Finally, bundle adjustment is used to geo-reference common feature points of the map
images. Different from the previous attempt which imports only two images into a bundle
adjustment, here bundle adjustment is performed on each image group (more than two
images included). With additional image and greater redundancy for the least squares
Figure 2. A 4-NN search is performed to find potential matches for each feature (snapshot from theprogram).
Journal of Location Based Services 5
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
adjustment, the geometric strength of the network is enhanced and the accuracy of the final
results,that is camera external parameters and 3D coordinates of feature points, is
enhanced. And the final positioning accuracy will benefit from such acts. It is noted that
the geo-referencing accuracy will not always be improved by increasing the number of
images involved. If we use bundle adjustment only once on the whole image sequence,
increasing errors could produce an initial solution too far from the optimal one for the
adjustment to converge. Therefore, we choose images in group of three for the system.
2.2. Evaluating the performance of SIFT matching for the vision-based positioningsystem
At the real-time positioning stage, when real-time images are taken by the vision sensor
mounted on the (moving) vehicle, another image matching based on SIFT is carried out
between the real-time image and the map images. When any of the SIFT feature points from
the map image(s) finds its correspondence on the real-time image, the geo-information it
carried canbe transferred to its counterpart. Therefore,matchedSIFT features on the real-time
image obtain both image coordinates from matching process and 3D coordinates from map
images, which can later serve as pseudo ground control points (PGCPs) for space resection-
based positioning at the final stage. More specifically, modified space resection is utilised to
calculate vision sensor’s external orientation in 6 DOF. Outliers including mismatches are
detected and removed by the system outlier detection mechanism based on RANSAC.
A controlled experiment is designed to evaluate the performance of SIFT matching for
the image-based positioning system. The aim is to identify the reason behind positioning
failure and inaccurate position results. In our indoor measurement, set-up factors such as
illumination can be easily controlled. A calibrated CCD camera (Canon EOS4500, Canon
Inc., Japan) with a fixed focal length at 24.2mm was used as vision sensor of the
positioning system. First, three stable camera sites were deployed facing different mapping
areas with X, Y, Z coordinates of the three camera stations surveyed by a total station, and
angular changes at each camera site were roughly measured. A total of 16 images were
taken at the three sites, and the images with light been turned on were shown in Table 1. For
each image matching process, matches before and after RANSAC (used to reject
mismatches) are compared. Furthermore, the number of PGCP generated by each matching
process is also studied along with the geometry of these points, which will directly affect
the precision of positioning. Finally, a manual check for the PGCP locations from the two
matched images is performed to verify the correctness of image matching after RANSAC.
The third column in Table 1 indicates the angular changes of viewpoint at k (around
the Z-axis) for each camera site, other angles remain stable. Assuming the viewing
direction perpendicular to the mapping area (wall with geo-referenced features) to be 0, a
clockwise rotation to be positive changes. It is easy to note that the only epoch that fails to
give a positioning result is the one with the most drastic angular change, real-time image
No. 5. Not only has the total number of SIFT matches been decreased, the percentage of
correct matches filtered using RANSAC has also been reduced. Actually it is the image
epoch with lowest correct rate. As a result, too few PGCPs are generated for positioning.
It indicates that when the two matched images suffer from large viewpoint variation, less
SIFT matches will be found, and there is a higher chance to generate false matches. But if
we take a look at other images with smaller angular changes, such role does not apply. The
reason is that the performance of SIFT-based matching only drops under substantial
viewpoint changes.
X. Li and J. Wang6
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
Secondly, it is noticed that after using RANSAC to reject mismatches, there is still a
small chance that mismatches been left untreated, which might later generate false PGCP
to jeopardise the final positioning process.
In summary, two major weaknesses for image matching in the system have been
found: invariant feature matching could not deal with drastic viewpoint shift, and the
current popular outlier detection mechanism RANSAC cannot guarantee the correctness
of every pair. These problems affect the final positioning by deteriorating the precision
from inadequate number of matches or bringing in false matches.
3. Using ASIFT for viewpoint changes
In order to tackle the problem for unsatisfactory performance of SIFT subject to dramatic
viewpoint distinction, some approaches have been recently proposed by some researchers to
extend scale and rotation invariance to affine invariance, such as maximally stable extremal
regions (Matas et al. 2004) and Harris/Hessian affine (Mikolajczyk and Schmid 2004).
Although these methods have been proved to enable matching with a stronger viewpoint
change, all of them are prone to fail at a certain point (Nalpantidis, Chrysostomou, and
Gasteratos 2009). A better idea is to simulate viewpoint changes in order to reach affine
invariance, themost successful algorithmusing suchmethod is namedASIFT (affine-SIFT).
It is introduced byMorel and Yu (2009) to explicitly deal with extreme angle changes (up to
36 and higher). SIFT is only partially invariant to viewpoint changes because it is invariant
to four out of the six parameters of an affine transform. ASIFT, on the other hand,
simulates all image views obtainable by varying the two camera axis orientation parameters,
namely, the latitude and the longitude angles, left over by the SIFT method. Then, it covers
the other four parameters by using the SIFT method itself (Morel and Yu 2009).
In this paper, we introduce ASIFT into our vision-based navigation system to replace
SIFT in order to achieve a more robust positioning result against viewpoint variation. At
both mapping and positioning stage, ASIFT-based image matching is used in the same
way SIFT is utilised. In order to evaluate its performance and compare it with that of SIFT,
data-sets from the same controlled experiment is used.
In Figures 3 and 4, we showed the matching between real-time query image No. 5 with
map image No. 10 using SIFT and ASIFT, respectively. Under dramatic view changes,
ASIFT produce more reliable matches (inliers), while although SIFT get as many tentative
matches, most of which are mismatches and filtered out by RANSAC. When dealing with
images without much angular difference, however, ASIFT had a rather unstable
Table 1. Performance of SIFT matching in the system.
ST ID IDAngularchange
All SIFTmatches
Reliablematches
Number ofPGCPs
Number offalse PGCPs
Correctrate of
PGCPs (%) PDOP ADOP
1 1 0 578 202 51 1 98 2399 5833 230 372 103 32 2 94 1149 2235 50 267 59 2 0 100
2 7 0 427 145 35 0 100 2106 5309 220 401 193 59 1 98 969 235
3 11 0 417 112 11 0 100 5711 117713 230 457 146 22 0 100 372 213315 20 241 65 5 0 100 5813 28,658
Journal of Location Based Services 7
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
performance, as shown in Figures 5–8, with the former group favours SIFT and later one
favours ASIFT. The reason for it lies in that the author of ASIFT has already included an
outlier detection mechanism (Optimized Random Sampling Algorithm, ORSA) in ASIFT
algorithm. When the angular change is high, SIFT tends to produce more mismatches thus
outperformed by ASIFT in terms of both number of inliers and PGCPs. On the other hand,
Figure 3. Matching between real-time query image No. 5 with map image No. 10 usingSIFT þ RANSAC, dramatic angular change, 25 reliable matches (inliers).
Figure 4. Matching between real-time query image No. 5 with map image No. 10 usingASIFT þ RANSAC dramatic angular change, 47 reliable matches (inliers).
X. Li and J. Wang8
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
if the angular change is low because of the pre-filtered mechanism of ASIFT, the inliers of
ASIFT largely depends on the combined effect of ORSA and RANSAC. In order to further
compare the performance of the two algorithms for our system, ASIFT is used without
RANSAC at positioning stage. The result is shown in Table 2.
Comparing Tables 1 and 2, two main factors that have been taken into consideration
are geometry given by PGCPs, which is related to positioning precision, and the outliers
Figure 5. Matching between real time query image No. 1 with map image No. 9 usingSIFT þ RANSAC, small angular change, 100 reliable matches (inlier).
Figure 6. Matching between real time query image No. 1 with map image No. 9 usingASIFT þ RANSAC, small angular change, 16 reliable matches (inlier).
Journal of Location Based Services 9
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
remained in the PGCPs data-set. An obvious improvement took place on epoch No. 5, the
one with dramatic angular change, ASIFT has been able to generate reasonable number of
PGCPs for a positioning resolution. On other epochs, ASIFT has yet outperform SIFT in
terms of the geometry (represented by Position DOP (PDOP) and Attitude DOP (ADOP))
Figure 7. Matching between real-time query image No. 11 with map image No. 10 usingSIFT þ RANSAC, small angular change, 47 reliable matches (inliers).
Figure 8. Matching between real-time query image No. 11 with map image No. 10 usingASIFT þ RANSAC, small angular change, 190 reliable matches (inliers).
X. Li and J. Wang10
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
provided by PGCPs. It has also been observed that the correct rate of PGCPs has yet
reached 100% in the two tables, which means both RANSAC for SIFT and ORSA for
ASIFT have left some mismatches untreated. The reason behind is that all RANSAC-like
(e.g. RANSAC and ORSA) methods have the same bottleneck: when mismatches are near
their epipolar line, no matter how far away they are from their true correspondences, it is
hard for these mismatches be detected. Our test even revealed the lower correct rate of
PGCPs from ASIFT. Considering the fact that ASIFT is more computationally expensive
than SIFT, for real-time applications like vision-based navigation, it is important to reduce
the complexity of the algorithms been used. Therefore, we conclude that ASIFT is less
efficient than SIFT when dealing with low angular change. We only choose to use ASIFT
as a backup plan when SIFT fails to get a result because of dramatic viewpoint changes.
4. Feature-based matching with integration of cross-correlation information
It has been noticed that both SIFT and ASIFT produce a substantial number of
mismatches. The reason behind is that such feature-based methods depend on the choice of
correspondence on local information and fail to consider global context. When an image
has repeated patterns, ambiguities will occur when the local information for the similar
parts is identical.
A general solution is to use robust estimation method like RANSAC to remove
outliers. For a number of iterations, a random sample of four correspondences is selected
and a homography H is computed from those four correspondences. Every other
correspondence is then classified as an inlier or outlier depending on its concurrence with
H. After all of the iterations have finished, the iteration that contained the largest number
of inliers is selected. H can then be recomputed from all of the correspondences that were
considered as inliers in that iteration. While this method can remove most of the
mismatches, experiments have proved that it cannot guarantee the correctness of every
‘inlier’. The reason for it is that when mismatches are near their epipolar line, it is easier
for them to be treated as correct matches. Moreover, the iteration can only run limited
times (1000 at our system), and there is a certain chance that the iteration that has largest
number of inliers still produces an erroneous or inaccurate H since most of the inliers it
includes are mistaken. As a result, mismatches are selected as reliable matches, which will
further affect the positioning accuracy.
In this study, we propose to integrate intensity-based method into the feature matching
process to strengthen the robustness of the matching algorithm against mismatches and
Table 2. Performance of ASIFT matching in the system.
IDAngularchange
All ASIFTmatches
PGCPnumber
Number offalse PGCP
Correctrate of
PGCPs (%) PDOP ADOP
1 0 296 13 3 77 2112 5163 230 218 22 2 91 1353 2765 50 381 14 0 100 1761 3697 0 257 17 1 94 670 1659 220 178 12 1 92 1746 40511 0 716 15 0 100 9371 191313 230 471 10 2 80 12,801 240015 20 407 21 0 100 9490 1914
Journal of Location Based Services 11
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
noise. More specifically, cross-correlation information is used as an analysis and selection
criterion for the matching. Instead of identifying mismatch(es) after it has been generated,
it determines how good the homography model (H) is for the two matching images and
discard bad ones to reduce chances that mismatches are included. In RANSAC, we use 2D
projective transformation H (planar homography) to approximate the geometric
transformation between two images (e.g. I and I0). Any two corresponding SIFT features
in image I and I0 passing RANSAC will comply with the model, which ideally can be
expressed as:
x0
y0
1
2664
3775 ¼
a1 a2 a3
b1 b2 b3
c1 c2 1
2664
3775
x
y
1
2664
3775: ð1Þ
In Equation (1), p ¼ [x, y, 1]T and p 0 ¼ [x 0, y 0, 1]T denote the two points expressed
using homogeneous coordinates, and
a1 a2 a3
b1 b2 b3
c1 c2 1
2664
3775 represents homography model H.
a1 a2
b1 b2
" #parameterised affine changes, a3 b3
� �Tshift parameters and c1 c2
� �projective deformation. After homography model has been generated by RANSAC
processing, a local square patch in I with a size of (2w þ 1) * (2w þ 1) centred on p is
generated, denoted as N( p). By using the estimated H, N( p) is transformed into N( p 0)and resampled on image I 0. Then, cross-correlation between the two window patches is
calculated using Equation (2), where Guv and G0uv represent the intensity values of the
two correlation windows, respectively, whereas m(G) and m(G 0) denote their average
intensity:
r G;G0� � ¼ Pwu¼2w
Pwv¼2w Guv 2 m Gð Þ� �
G0uv 2 m G0� �� �
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPwu¼2w
Pwv¼2w Guv 2 m Gð Þ� �2
·Pw
u¼2w
Pwv¼2w G0
uv 2 m G0ð Þ� �2q : ð2Þ
In Equation (2), r(G,G0) varies from21 to 1, the closer to 1 the higher correlation and
here indicates the bigger similarity between two patches and greater possibility to be
correct corresponding points. Therefore, we calculate the cross-correlation for each pair of
reliable matches (inliers) for every single matching process. An average correlation �r is
calculated for all the matched (reliable) points produced by one matching (one H is
generated). If it is close to 1, it means the estimated homography modelH is very accurate,
and vice versa. An example is shown using Figures 9 and 10 (both from our experiment),
which illustrates a bad �r could include mismatch as inlier and on the contrary, a �r closer to1 has less chance to include mismatch as inlier.
A certain threshold is set for �r (0.75 in the experiment). After every matching process,
a �r is calculated. If it is smaller than the threshold, the matched points generated by the
process will be discarded and the two images will be re-matched.
Byusing such criteria for the system, the number of false PGCPhas been reduced. In order
to evaluate the impact on final positioning accuracy, root mean square error for every camera
station is calculated and the results before and after using cross-correlation information are
compared in Table 3, which reflects that positioning accuracy has been improved.
X. Li and J. Wang12
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
5. Concluding remarks
In this paper, we give an overview of image matching techniques in the context of vision-
based navigation systems. Based on a variety of systems, stereo vision, SFM and map-
based approach, different methods are discussed. Focused on map-based approach, which
generally uses feature-based matching for localisation, and based on our early developed
system, a performance analysis has been carried out. Three major problems have been
identified: inadequate mapping; at positioning stage, the matching is vulnerable to drastic
viewpoint changes and produces big percentage of mismatches.
First, multi-image matching has been introduced to mapping stage. Compared with
pair-wise approach, more features have been geo-referenced with better imaging
geometry. The vision-based positioning based on the 3D map benefit from the act in terms
of greater number and better distribution of PGCPs.
Then by using ASIFT, the major improvement takes place on the epoch with large
viewpoint changes. More reliable matches have been produced and the system has been
able to get a positioning result. However, our experiments have also revealed that ASIFT
has yet been able to outperform SIFT in terms of PGCPs geometry and correct rate of
PGCPs. Taking the computation cost into consideration, SIFT is still the more favourite
option. ASIFT is used as a back-up plan when SIFT fails to get a result because of large
viewpoint changes.
Using RANSAC to remove mismatches has been a popular approach for feature-based
matching in visual systems. But as has been proved by the performance analysis, some
Figure 9. Matching between real-time query image No. 13 with map image No. 8 usingSIFT þ RANSAC þ cross-correlation: �r ¼ 0:23.
Journal of Location Based Services 13
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
mismatches may still be included as inliers (reliable match) in final positioning process.
The reason for this is that the iteration in RANSAC can only run limited times and there
are chances that the one having the largest number of inliers still produces an erroneous or
very inaccurate homography model and as a consequence, mismatches are selected as
reliable matches. In order to deal with the problem, we have proposed to use cross-
correlation information to evaluate the quality of homography model, and the conducted
experiments have proved that such an approach can reduce the chances of mismatches
being included and final positioning accuracy can be improved.
Further research will be focused on improving the reliability and precision of image
matching in an effort to improve the overall positioning accuracy. Meanwhile, the
computation load of imagematchingwill be considered for such real-time system operations.
Table 3. Comparison of the RMSE of the positioning results.
RMSE X (m) Y (m) Z (m)
Station1 SIFT 0.0477 0.0838 0.0598SIFT þ cross-correlation 0.0469 0.0824 0.0613
Station2 SIFT 0.0641 0.1192 0.2710SIFT þ cross-correlation 0.0644 0.0582 0.2481
Station3 SIFT 0.3397 0.7191 0.3278SIFT þ cross-correlation 0.3164 0.6439 0.2883
Figure 10. Matching between real-time query image No. 13 with map image No. 8 usingSIFT þ RANSAC þ cross-correlation: �r ¼ 0:79.
X. Li and J. Wang14
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14
Note
1. Part of this paper was previously published by IEEE for, and presented at, the InternationalConference on Indoor Positioning and Indoor Navigation (IPIN 2012), 13–15th November2012.
References
Bay, H., T. Tuytelaars, and L. V. Gool. 2006. “SURF: Speed-up Robust Features.” In 9th EuropeanConference on Computer Vision, Graz, Austria, May 7–13, 404–417. Berlin/Heidelberg: Springer.doi: 10.1007/11744023_32
Boris, R., K. Effrosyni, and D. Marcin. 2008. “Mobile Museum Guide Based on Fast SIFTRecognition.” In 6th International Workshop on Adaptive Multimedia Retrieval, Berlin,Germany, Vol. 5811, June 26–27, 26–27. Springer LNCS.
Fischler, M. A., and R. C. Bolles. 1981. “Random Sample Consensus: A Paradigm for Model Fittingwith Applications to Image Analysis and Automated Cartography.” Communications of theACM 24 (6): 381–395.
Forstner, W. 1986. A Feature Based Correspondence Algorithm for Image Matching. InternationalArchives of Photogrammetry. Vol. 26-III. Rovaniemi, Finland: IAPRS.
Gruen, A. W. 1985. “Adaptive Least Squares Correlation: A powerful Image Matching Technique.”South Africa Journal of Photogrammetry Remote Sensing and Cartography 14: 175–187.
Harris, C., and M. Stephens. 1988. “A Combined Corner and Edge Detector.” In Fourth Alvey VisionConference, Manchester, UK, 31 August–2 September, 147–151.
Kuhl, A. 2004. “Comparison of Stereo Matching Algorithms for Mobile Robots.” M.Sc. thesis,Fakultat fur Informatik und Automatisierung, Technische Universitat Ilmenau, Ilmenau,Germany.
Lang, F., and W. Forstner. 1995. “Matching Techniques.” In Second Course in DigitalPhotogrammetry. Landesvermessungsamt NRW.
Lowe, D. 2004. “Distinctive Image Features from Scale Invariant Key Points.” International Journalof Computer Vision 60 (2): 91–110.
Matas, J., O. Chum, M. Urban, and T. Pajdla. 2004. “Robust Wide-Baseline Stereo from MaximallyStable Extremal Regions.” Image and Vision Computing 22: 761–767.
Mikolajczyk, K., and C. Schmid. 2004. “Scale & Affine Invariant Interest Point Detectors.”International Journal of Computer Vision 60: 63–86.
Morel, J. M., and G. Yu. 2009. “ASIFT: A New Framework for Fully Affine Invariant ImageComparison.” SIAM Journal on Imaging Sciences 2 (2): 438–469.
Nalpantidis, L., D. Chrysostomou, and A. Gasteratos. 2009. “Obtaining Reliable Depth Maps forRobotic Applications with a Quad-Camera System.” In ICRA09 Workshop on Safe Navigation inOpen and Dynamic Environments Application to Autonomous Vehicles, Kobe, Japan, May 12.
Remondino, F., and C. Ressl. 2006. Overview and Experiences in Automated Markerless ImageOrientation. Vol. 36, Part 3. The International Archives of the Photogrammetry, Remote Sensingand Spatial Information Sciences (IAPRS), Archives no. 1682-1750.
Schmid, C., and R. Mohr. 1997. “Local Gray Value Invariants for Image Retrieval.” IEEETransactions on Pattern Analysis and Machine Intelligence 19 (5): 530–535.
Se, S., D. Lowe, and J. Little. 2002. “Mobile Robot Localization and Mapping with UncertaintyUsing Scale-invariant Visual Landmarks.” International Journal of Robotic Research 21 (8):735–760.
Trucco, E., and A. Verri. 1998. Introductory Techniques for 3-D Computer Vision. Vol. 93.Englewood Cliffs: Prentice Hall.
Zabih, R., and J. Woodfill. 1994. “Non-parametric Local Transforms for Computing VisualCorrespondence.” Third European Conference on Computer Vision, Stockholm, Sweden, May2–6.
Zhang, G., Z. Dong, J. Jia, T. T. Wong, and H. Bao. 2010. “Efficient Non-Consecutive FeatureTracking for Structure-from-Motion.” Proceedings of the European Conference on ComputerVision (ECCV ’10), Crete, Grece, September 5–11.
Journal of Location Based Services 15
Dow
nloa
ded
by [
Ast
on U
nive
rsity
] at
02:
48 2
9 Ja
nuar
y 20
14