Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
ABCDEFG
UNIVERS ITY OF OULU P.O.B . 7500 F I -90014 UNIVERS ITY OF OULU F INLAND
A C T A U N I V E R S I T A T I S O U L U E N S I S
S E R I E S E D I T O R S
SCIENTIAE RERUM NATURALIUM
HUMANIORA
TECHNICA
MEDICA
SCIENTIAE RERUM SOCIALIUM
SCRIPTA ACADEMICA
OECONOMICA
EDITOR IN CHIEF
PUBLICATIONS EDITOR
Professor Mikko Siponen
University Lecturer Elise Kärkkäinen
Professor Pentti Karjalainen
Professor Helvi Kyngäs
Senior Researcher Eila Estola
Information officer Tiina Pistokoski
University Lecturer Seppo Eriksson
University Lecturer Seppo Eriksson
Publications Editor Kirsti Nurkkala
ISBN 978-951-42-6150-3 (Paperback)ISBN 978-951-42-6151-0 (PDF)ISSN 0355-3213 (Print)ISSN 1796-2226 (Online)
U N I V E R S I TAT I S O U L U E N S I SACTAC
TECHNICA
U N I V E R S I TAT I S O U L U E N S I SACTAC
TECHNICA
OULU 2010
C 353
Juho Kannala
MODELS AND METHODSFOR GEOMETRIC COMPUTER VISION
FACULTY OF TECHNOLOGY,DEPARTMENT OF ELECTRICAL AND INFORMATION ENGINEERING,UNIVERSITY OF OULU;INFOTECH OULU,UNIVERSITY OF OULU
C 353
ACTA
Juho Kannala
C353etukansi.fm Page 1 Friday, March 19, 2010 10:48 AM
A C T A U N I V E R S I T A T I S O U L U E N S I SC Te c h n i c a 3 5 3
JUHO KANNALA
MODELS AND METHODS FOR GEOMETRIC COMPUTER VISION
Academic dissertation to be presented with the assent ofthe Faculty of Technology of the University of Oulu forpublic defence in OP-sali (Auditorium L10), Linnanmaa, on7 May 2010, at 12 noon
UNIVERSITY OF OULU, OULU 2010
Copyright © 2010Acta Univ. Oul. C 353, 2010
Supervised byDoctor Sami BrandtProfessor Janne Heikkilä
Reviewed byDoctor Peter SturmProfessor Kalle Åström
ISBN 978-951-42-6150-3 (Paperback)ISBN 978-951-42-6151-0 (PDF)http://herkules.oulu.fi/isbn9789514261510/ISSN 0355-3213 (Printed)ISSN 1796-2226 (Online)http://herkules.oulu.fi/issn03553213/
Cover designRaimo Ahonen
JUVENES PRINTTAMPERE 2010
Kannala, Juho, Models and methods for geometric computer visionFaculty of Technology, Department of Electrical and Information Engineering, University ofOulu, P.O.Box 4500, FI-90014 University of Oulu, Finland; Infotech Oulu, University of Oulu,P.O.Box 4500, FI-90014 University of Oulu, FinlandActa Univ. Oul. C 353, 2010Oulu, Finland
Abstract
Automatic three-dimensional scene reconstruction from multiple images is a central problem ingeometric computer vision. This thesis considers topics that are related to this problem area. Newmodels and methods are presented for various tasks in such specific domains as cameracalibration, image-based modeling and image matching. In particular, the main themes of thethesis are geometric camera calibration and quasi-dense image matching. In addition, a topicrelated to the estimation of two-view geometric relations is studied, namely, the computation of aplanar homography from corresponding conics. Further, as an example of a reconstruction system,a structure-from-motion approach is presented for modeling sewer pipes from video sequences.
In geometric camera calibration, the thesis concentrates on central cameras. A generic cameramodel and a plane-based camera calibration method are presented. The experiments with variousreal cameras show that the proposed calibration approach is applicable for conventionalperspective cameras as well as for many omnidirectional cameras, such as fish-eye lens cameras.In addition, a method is presented for the self-calibration of radially symmetric central camerasfrom two-view point correspondences.
In image matching, the thesis proposes a method for obtaining quasi-dense pixel matchesbetween two wide baseline images. The method extends the match propagation algorithm to thewide baseline setting by using an affine model for the local geometric transformations between theimages. Further, two adaptive propagation strategies are presented, where local texture propertiesare used for adjusting the local transformation estimates during the propagation. These extensionsmake the quasi-dense approach applicable for both rigid and non-rigid wide baseline matching.
In this thesis, quasi-dense matching is additionally applied for piecewise image registrationproblems which are encountered in specific object recognition and motion segmentation. Theproposed object recognition approach is based on grouping the quasi-dense matches between themodel and test images into geometrically consistent groups, which are supposed to representindividual objects, whereafter the number and quality of grouped matches are used as recognitioncriteria. Finally, the proposed approach for dense two-view motion segmentation is built on alayer-based segmentation framework which utilizes grouped quasi-dense matches for initializingthe motion layers, and is applicable under wide baseline conditions.
Keywords: camera calibration, image registration, image-based modeling, motionsegmentation, object recognition, structure from motion
To my late brother Jaakko
6
Preface
My first contact with computer vision occurred during a summer traineeship at the
Helsinki University of Technology almost ten years ago. Since that time, I have had
opportunities to learn from many experts in the field, both in Finland and abroad. Now,
upon completing the research for my thesis, I feel that it is the time to acknowledge
several people who have helped in bringing this thesis to its completion.
I would like to express my gratitude to my instructors, Doctor Sami Brandt and
Professor Janne Heikkilä, who have been great sources of ideas and advice through the
years. In addition, I am grateful that they have given me freedom to pursue my own
ideas in research. I am also indebted to my co-authors, Doctors Esa Rahtu and Mikko
Salo, whose broad expertise and open-minded innovative attitude to research has been
a good basis for fruitful collaboration.
I am grateful to the reviewers of the thesis, Doctor Peter Sturm and Professor Kalle
Åström, for their constructive comments and feedback. I would also like to acknowl-
edge Gordon Roberts for his help with the language revision of the manuscript.
The Machine Vision Group of the University of Oulu has been an excellent place for
doing research. This is due to the helpful attitude and efforts of all the personnel, both
research and support staff. In particular, I am grateful to Professor Matti Pietikäinen
for his long-term work on advancing computer vision research in Oulu and for offer-
ing me the possibility to work in this group. Further, considering the research topics
of this thesis, I would like to acknowledge Jukka Holappa for carefully optimizing the
implementations of several algorithms studied in the thesis. Finally, as there are many
important aspects to life other than research, I would like to thank Doctor Jani Boutel-
lier and Pekka Koskenkorva for various discussions during the daily lunch and coffee
breaks.
In addition to my home university in Oulu, I have had opportunities to interact
with scientists in other research institutes. I am grateful to Doctors Charles Bouveyron,
Stéphane Girard and Cordelia Schmid for their hospitality during my stay in INRIA
Grenoble in 2005. Further, I would like to thank Doctor Jirí Matas for hosting my
visit to the Center for Machine Perception (CMP) at the Czech Technical University
in Prague in 2009. I am also grateful for the interesting discussions with many CMP
members. For the collaboration with the sewer imaging application, I would like to
7
acknowledge Professor Jouko Lampinen and Doctor Aki Vehtari from Aalto Univer-
sity, Juhani Korkealaakso and Hannu Maula from VTT Technical Research Centre of
Finland, and Priit Uleksin from DigiSewer Productions Ltd.
The financial support provided by the Graduate School in Electronics, Telecommu-
nication and Automation (GETA), the Emil Aaltonen Foundation, the Finnish Founda-
tion for Technology Promotion, the Kaute Foundation, the Nokia Foundation, the Seppo
Säynäjäkangas Science Foundation, and the Tauno Tönning Foundation is gratefully ac-
knowledged.
Last but not least, I want to express my deepest gratitude to my family and friends
for all the support during these years. Especially, I would like to thank Noora for her
important support during the last stages of this work.
Oulu, February 2010
Juho Kannala
8
List of original articles
This dissertation is based on the following articles, which are referred to in the text by
their Roman numerals (I–VIII):
I Kannala J, Salo M & Heikkilä J (2006) Algorithms for computing a planar homographyfrom conics in correspondence. Proc British Machine Vision Conference (BMVC) 1: 77–86.
II Kannala J & Brandt SS (2006) A generic camera model and calibration method for conven-tional, wide-angle and fish-eye lenses. IEEE Transactions on Pattern Analysis and MachineIntelligence 28(8): 1335–1340.
III Kannala J, Heikkilä J & Brandt SS (2008) Geometric camera calibration. In Wah B (ed)Wiley Encyclopedia of Computer Science and Engineering. Hoboken, John Wiley & SonsInc.
IV Kannala J, Brandt SS & Heikkilä J (2009) Self-calibration of central cameras from pointcorrespondences by minimizing angular error. VISIGRAPP 2008, Revised Selected Papers.Communications in Computer and Information Science 24: 109–122.
V Kannala J, Brandt SS & Heikkilä J (2008) Measuring and modelling sewer pipes fromvideo. Machine Vision and Applications 19(2): 73–83.
VI Kannala J & Brandt SS (2007) Quasi-dense wide baseline matching using match propaga-tion. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
VII Kannala J, Rahtu E, Brandt SS and Heikkilä J (2008) Object recognition and segmentationby non-rigid quasi-dense matching. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR).
VIII Kannala J, Rahtu E, Brandt SS and Heikkilä J (2009) Dense and deformable motion seg-mentation for wide baseline images. Proc Scandinavian Conference on Image Analysis(SCIA). Lecture Notes in Computer Science 5575: 379–389.
The main responsibility of preparing all of the articles I-VIII was carried by the au-
thor of the dissertation. However, many ideas presented in the publications have been
developed as team work as detailed in the following.
In Paper I, the first algorithm was developed by the author, whereas the original
idea and implementation of the second algorithm was Prof. Heikkilä’s. The paper was
mainly written by the author and Dr. Salo, who helped with the formulations and proofs.
Paper II was written by the author, who also carried out the experiments. Dr. Brandt
was closely involved in developing the ideas, and gave valuable comments and detailed
suggestions during the writing process.
The author wrote Papers III-VI and performed the related experiments. The ideas
were developed together with the co-authors, who gave advice and guidance throughout
the work.
9
The experiments in Papers VII and VIII were carried out by the authorand Dr. Rahtu.
Paper VII was written together by the author and Dr. Rahtu, whereas Paper VIII was
mainly written by the author. The other co-authors participated by providing ideas and
advice.
10
List of symbols and abbreviations
| · | Absolute value
|| · || L2-norm
↔ Correspondence between two entities
≃ Equality up to scale
A Affine transformation matrix or complex symmetric matrix
A−1 Inverse of matrixA
A⊤ Transpose of matrixA
A−⊤ Inverse transpose of matrixA
B Complex symmetric matrix
C Conic coefficient matrix
C,C′ A pair of corresponding conics in two views
D Asymmetric lens distortion function
det(A) Determinant of matrixA
Fp Central projection of a pinhole camera
Fr Central projection of a radially symmetric camera
F,F ′ The two focal points of an elliptic or hyperbolic mirror
f Image intensity function or focal length parameter
f ′ Image intensity function of the other view
g,g′ Positive window functions for two images
H Planar projective transformation
H Homography matrix
h A vector containing the elements ofH
I,I ′ Image planes of two cameras
i Scalar index variable
j Scalar index variable
K Mapping from the virtual image plane to the real image plane
k1, . . . ,k5 The parameters of a generic radial projection function
l Distortion parameter for central catadioptric cameras
m Coordinates of a point in the image plane
Mi j Matrix of size 9×9
M A matrix obtained by stacking several matricesMi j
11
n A scalarvariable
O Projection center of a camera
P Camera projection function
Pc Internal camera projection function
Pi A point in space
pi A point in plane
Pd Projective space of dimensiond
Rd Real space of dimensiond
R Orthogonal matrix
R Composition of a rigid transformation and a projection onto sphere
r Radial projection function
S f ,g, S f ′,g′ Symmetric intensity moment matrices of two images
S, S′ Simplifying notations ofS f ,g(0) andS f ′,g′(0), respectively
S′1/2 Square root ofS′
S−1/2 Inverse square root ofS
r Radial projection function
s Camera skew factor
ur,uϕ Unit vectors in the radial and tangential directions
u,v Two-dimensional Cartesian coordinate vectors
(u,v) Pixel coordinates
(u0,v0) Principal point of a camera
V Virtual image plane of a camera
X Coordinates of a point in space
x Coordinates of a point in plane
(x,x′) A pair of corresponding points in two views
(x,y) Cartesian plane coordinates
X ,Y,Z Cartesian coordinates
∆r Asymmetric radial distortion function
∆t Asymmetric tangential distortion function
γ Aspect ratio of a camera
ζ1,ζ2,ζ3 Parameters of the asymmetric radial distortion function
η1,η2,η3 Parameters of the asymmetric tangential distortion function
θ Inclination angle
ι1, ι2, ι3, ι4 Parameters of the asymmetric radial distortion function
ξ1,ξ2,ξ3,ξ4 Parameters of the asymmetric tangential distortion function
12
Φ Spherical angle coordinates
ϕ Azimuth angle
2-D Two-dimensional
3-D Three-dimensional
DLT Direct linear transformation
ETHZ Die Eidgenössische Technische Hochschule Zürich
RANSAC Random sample consensus
ROC Receiver operating characteristic
SVD Singular value decomposition
ZNCC Zero-mean normalized cross-correlation
13
14
Contents
Abstract
Preface 7
List of original articles 9
List of symbols and abbreviations 11
Contents 15
1 Introduction 17
1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Summary of the original articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2 Geometry in computer vision 27
2.1 Introduction and background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Case study: Homography computation from corresponding conics . . . . . . . . 29
2.2.1 Related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30
2.2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3 Geometric camera calibration 39
3.1 Camera models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.1.1 Taxonomy of camera models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.1.2 Perspective cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.1.3 Central omnidirectional cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.2 Calibration methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2.1 Photogrammetric calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50
3.2.2 Self-calibration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4 Image-based scene reconstruction 55
4.1 Brief review of related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.1 Structure from motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.1.2 Image-based modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
15
4.2 Application: Modeling sewer pipes from video. . . . . . . . . . .. . . . . . . . . . . . . . .57
4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.2.2 Overview of the approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5 Quasi-dense matching 63
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.2 Match propagation in the wide baseline case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.3 Non-rigid quasi-dense matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.4 Application in object recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Framework for two-view motion segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6 Summary and conclusion 77
References 79
Original articles 91
16
1 Introduction
1.1 Background and motivation
Computer vision, as a discipline, relates to the automatic analysis and interpretation
of images. Hence, computer vision provides methods for extracting meaningful in-
formation from image data. In addition, computer vision deals with the construction
of artificial vision systems, which take the methods from theory to practice by soft-
ware and hardware implementations. Today, computer vision systems are used in var-
ious applications in different fields, such as medical imaging, human-computer inter-
action, industrial inspection, photogrammetry, visual surveillance and robot navigation.
Since the image data can be in many different forms (e.g. still photographs, video se-
quences, acoustic images, X-ray images, multidimensional medical images) and there
is a wide variety of different application areas (e.g. content-based image and video
retrieval, image-based modeling, image registration, object detection and recognition,
motion segmentation), computer vision is a broad and diverse subject and it is closely
connected to other fields of science and technology.
In fact, as many other sub-fields of computer science, computer vision has a close
relationship to mathematics. For example, many methods in computer vision are based
on geometry, statistics and optimization. Besides mathematics, also physics is some-
times important in the development of computer vision methods. For instance, the laws
of geometric optics describe the process of image formation in a photographic camera
and their understanding is useful when computer vision techniques are used for mea-
surement purposes. In addition, the field of signal processing is important to computer
vision since typically the high-level computer vision methods are based on low-level sig-
nal processing techniques. Furthermore, computer graphics and machine learning are
two sub-fields of computer science that are closely related to computer vision. Com-
puter graphics studies the problem of how to render realistic images of a scene given a
model of it, whereas computer vision studies the corresponding inverse problem, that
is, how to extract a description of a scene from its images. However, despite the dif-
ferent viewpoint, many basic concepts are common between these two fields. On the
other hand, the relationship between machine learning and computer vision lies in the
fact that machine learning techniques can often be applied to computer vision problems.
17
Actually, many tasks in computer vision can be approached by learning a function from
training data. For example, object categorization typically involves learning a classifier
from pre-categorized training images. Finally, also fields that study biological vision,
such as neurobiology and computational neuroscience, have connections to computer
vision, since the understanding of human visual system can be useful for designing
artificial vision systems and vice versa.
Currently, computer vision is a very active field of research, and there has recently
been rapid development in several problem areas. Many things that were not possible
ten years ago are now reality. For example, related to applications in robotics, it has
been demonstrated that purely vision-based robot localization and mapping is possible
in real-time (Davisonet al., 2007) and, within the area of image-based modeling, it
has been shown that realistic and automatic 3-D model building from Internet photo
collections is possible (Snavelyet al., 2008; Goeseleet al., 2007). In the field of object
recognition, there are recent works which, for instance, use text retrieval techniques for
efficient object matching in videos (Sivic & Zisserman, 2009), utilize decision trees for
recognizing specific objects from large databases (Obdrzálek & Matas, 2005; Nistér &
Stewénius, 2006) or apply machine learning techniques for object category recognition
(Ferguset al., 2003; Sivicet al., 2005; Lazebniket al., 2006). In addition, there has been
significant progress within biometric applications of computer vision. For example,
methods for face detection (Viola & Jones, 2001) and recognition (Zhaoet al., 2003;
Bowyeret al., 2006; Ahonenet al., 2006) have matured to the stage where they could
be used in practical systems.
Many of the aforementioned advanced applications are based on some relatively re-
cent advances in basic research, such as the understanding of the geometry of multiple
views (Hartley & Zisserman, 2000; Faugeraset al., 2001) or the emergence of methods
for viewpoint invariant detection and description of local image features (Mikolajczyk
et al., 2005; Mikolajczyk & Schmid, 2005; Lowe, 2004; Heikkiläet al., 2009). Be-
sides the progress of the field itself, computer vision has benefited from advances in
related fields. In part, the development has been driven by the increase in the available
computational power which has made many computationally intensive techniques more
tractable and attractive also in practical applications. This has again resulted in grow-
ing interest in the field and intensified research efforts, which has further expedited the
development. In fact, the ever increasing demand for new applications has also been
an important driving force for the advancements. For example, the amount of digital
imagery that is being created and stored is growing fast, partly due to the increasing
18
number of low-cost cameras (e.g. in mobile phones), and there isan increasing need for
automatic methods to manage and process this kind of image data.
As described, computer vision is a broad subject and only a small part of it can be
covered in the scope of a doctoral thesis. In this thesis, the focus is on geometric aspects
of computer vision. The topics discussed here are related to such specific problem
areas as multiple view geometry, geometric camera calibration, structure from motion,
image-based modeling, image registration, object recognition and motion segmentation.
However, the field of computer vision, in a broad sense, provides the motivation and
background as well as the broader context of application for the models and methods
studied in this work.
1.2 Scope of the thesis
The aim of this thesis is to contribute to the knowledge of geometric computer vision
and to provide practical methods for the problems of the field. The discussion covers a
variety of different problems which involve geometric aspects. Even though the topics
and problems considered in the original articles I-VIII may seem diverse at first, they
are all related to the context of a 3-D reconstruction pipeline, which is schematically
illustrated in Figure 1 and which largely defines the framework for the thesis. In the
following, we briefly describe research problems within the 3-D reconstruction pipeline
and their connection to work of the thesis.
Recovering a three-dimensional model of a scene from multiple camera views is a
classical problem in computer vision (Marr, 1982; Horn, 1986; Faugeras, 1993; Hart-
ley & Zisserman, 2000; Faugeraset al., 2001). Typically, the problem is divided into
several subproblems, such as sparse feature extraction and matching, structure from
motion, photogrammetric camera calibration or self-calibration, and dense surface re-
construction (i.e. dense stereo or multi-view stereo). Typically these subproblems are
broad subjects themselves and involve a lot of research. However, despite the extent
of the problem area, recent research has produced many satisfactory solutions to the
different subproblems, so that nowadays the construction of a generic and automatic
3-D reconstruction pipeline is possible in practice, and several such systems have been
built (Fitzgibbon & Zisserman, 1998; Pollefeys, 1999; Nistér, 2001; Lhuillier & Quan,
2005). The first systems of this kind used continuous video sequences as their input, but
some of the more recent approaches are able to produce realistic reconstructions using
19
only a few sparsely captured images (Martinec, 2008), or even unorganized image sets
from Internet photo collections (Snavelyet al., 2008; Goeseleet al., 2007).
Input
images matching
Structure−
from−motion
Self−
calibration
Dense
reconstruction
Model
out
Sparse
Fig. 1. A standard pipeline for scene reconstruction from multiple perspective
images. The self-calibration stage may be omitted for precalibrated cameras.
In this thesis, the aim is not to describe a generic system for multi-view reconstruction,
but to consider some specific topics that are related to the problem area. Central themes
of the thesis are geometric camera calibration and quasi-dense image matching. In
addition, the topics discussed involve theoretically oriented research, which is related
to methods for determining a planar homography from corresponding conics in two
images, as well as applied research in a sewer imaging application.
The theoretical framework for the thesis is the projective geometry of multiple views
which is covered in the books (Hartley & Zisserman, 2000) and (Faugeraset al., 2001),
from the viewpoint of computer vision. Within the area of multiple view geometry,
the research of the thesis touches on a subject in the estimation of two-view relations.
That is, we study algorithms for computing a planar projective transformation (i.e. a
homography) between two images using corresponding conics, such as ellipses. This
topic is the most theoretically oriented research topic of the thesis.
A central part of the thesis is geometric camera calibration, which is the process of
determining the imaging geometry of a camera, and is a prerequisite for image-based
metric three-dimensional measurements. Here, we concentrate on the photogrammetric
calibration of omnidirectional central cameras, such as fish-eye lens cameras. However,
we also briefly discuss the self-calibration of central cameras.
Camera calibration is required for image-based measurements, and the metrology
application studied in this thesis is the measurement and modeling of sewer pipes from
video. Our approach to this application problem is to use structure from motion tech-
niques to recover the shape of a sewer pipe from a video sequence, which is captured
by a precalibrated fish-eye lens camera moving through the pipe. We additionally study
methods for building a tubular model of the pipe based on the reconstructed interest
points. Hence, the sewer imaging system provides a practical application specific ex-
20
ample of a complete 3-D reconstruction pipeline, and it is the mostapplied theme in the
thesis.
An important part of the thesis is a quasi-dense approach to image matching. Our
work is built on previous works (Lhuillier & Quan, 2002) and (Lhuillier & Quan, 2005)
which have shown that growing a tentative set of sparse keypoint correspondences into a
set of quasi-dense matches between two images provides a good basis for two-view ge-
ometry estimation and surface reconstruction. In this thesis, we extend the quasi-dense
approach to the case, where the viewpoints of the two cameras differ substantially (the
so called wide baseline case), and study its applications in piecewise image registration
problems, such as specific object recognition and motion segmentation. Thus, here the
problem of image matching provides a connection between recognition and reconstruc-
tion, which are two classical problems in vision. In fact, this connection has been re-
cently studied also by other researchers from different points of view (Rothgangeret al.,
2006; Corneliset al., 2008) and, overall, integration of recognition and reconstruction
is an interesting and timely topic.
1.3 Contributions
The main contributions of the thesis are listed below.
– Two algorithms are proposed for computing a planar homography from correspond-
ing conics in two images.
– A generic camera model and camera calibration method are presented. The proposed
calibration approach is applicable for both conventional cameras and omnidirectional
cameras, such as fish-eye lens cameras.
– The calibration method is implemented as a Matlab toolbox and tested in practice
with real cameras of various types.
– A method is proposed for the self-calibration of central cameras from two-view point
correspondences.
– An error analysis is performed for the structure from motion approach that recovers
the interior structure of a sewer pipe from a video sequence which is scanned by a
robot moving inside the pipe. In addition, a method for modeling tubular surfaces
from sets of three-dimensional points is presented.
– A method for quasi-dense wide baseline image matching is proposed. The method is
an extension to the match propagation algorithm, which has been described earlier.
21
– A non-rigid variant of the quasi-dense matching method is presented.In addition, a
method is described for grouping the quasi-dense two-view matches into geometri-
cally consistent groups, which are supposed to lie on smooth surfaces that represent
individual objects. The proposed techniques are applied to the simultaneous recogni-
tion and segmentation of specific objects in photographs.
– A layer-based framework for dense and deformable two-view motion segmentation
is described. The proposed framework utilizes quasi-dense matching for initializing
the motion layers, and is applicable under wide baseline conditions.
1.4 Summary of the original articles
This thesis is based on eight articles in which the contributions listed above were origi-
nally published. The articles are reprinted in the appendix of the thesis, and their content
is summarized below.
Paper I presents two new algorithms for determining a planar homography from
corresponding conics in two images. The first algorithm is based on solving a set of
overdetermined linear equations, which are derived from the correspondence equations
for the elements of the homography matrix, and it can be applied when there are three
or more conic correspondences. The second algorithm is for the minimal case, i.e., it
allows one to compute the homography from only two conic correspondences. Unlike
some earlier approaches, this algorithm involves only linear algebra, e.g. eigendecom-
positions of complex symmetric matrices, and does not require solving high-degree
polynomial equations. Hence, both of the proposed algorithms are relatively easy to
implement.
Paper II describes a generic camera model which is suitable for both conventional
perspective cameras and central omnidirectional cameras, such as fish-eye and wide-
angle lens cameras and catadioptric cameras. The key idea of the article is to present
a flexible geometric camera model which allows accurate modeling of various types of
real cameras. Because real lenses and mirrors may deviate from precise radial symme-
try, the proposed model contains an asymmetric part whose purpose is to account for the
imperfections of the optical system. Also, a method is described for the computation of
the inverse camera model by using a first-order approximation for the asymmetric dis-
tortion function. In addition to the generic camera model, Paper II presents a four-step
algorithm for determining the parameters of the model. The algorithm utilizes images
of a planar calibration object, which contains control points in known positions. Paper
22
II is partly based on earlier works (Kannala & Brandt, 2004; Kannala,2004). Nev-
ertheless, unlike the method of Paper II, the calibration procedure proposed in these
earlier works is not suitable for omnidirectional cameras whose field of view exceeds
the hemisphere.
Paper III is a review article about geometric camera calibration. The article provides
an overview of camera models and calibration methods used in the field. The focus is
on conventional calibration techniques in which the parameters of the camera model
are determined by using images of a calibration object whose geometric properties are
known. Additionally, Paper III contains calibration experiments where the method of
Paper II is quantitatively evaluated with various types of real cameras. In the experi-
ments, we used three dioptric cameras equipped with a narrow-angle lens, a wide-angle
lens, and a fish-eye lens, and two catadioptric cameras, which were constructed by plac-
ing two different mirrors in front of the narrow-angle lens camera. The mirrors used
were a hyperbolic mirror and an equiangular mirror.
Paper IV proposes a self-calibration method for central cameras. The method uses
two-view point correspondences, and estimates the camera parameters by minimizing
the two-view angular error, which does not depend on the 3-D coordinates of the point
correspondences. A low-parameter camera model, which is suitable for different kinds
of radial distortions, is used in the minimization. Hence, the self-calibration problem
results in a small-scale optimization problem. However, the cost function may have
many local minima and, in order to avoid them, a multi-step optimization approach is
proposed.
Paper V presents a system for modeling sewer pipes from videos acquired by a fish-
eye lens camera moving in the pipe. The approach is based on tracking interest points
across the video sequence and addressing the related structure from motion problem.
Naturally, the proposed method requires that the interior surface of the pipe is suffi-
ciently textured so that interest points can be extracted. The experiments with a real
sewer video, scanned inside an eroded concrete pipe, show that the structure of the pipe
can be reliably reconstructed despite the forward motion of the camera. The tubular ar-
rangement of the reconstructed points allows us to build a parametric model of the pipe
by surface fitting. In fact, Paper V additionally proposes a practical method for model-
ing tubular surfaces with a locally cylindrical model. Paper V is partly based on earlier
works (Kannala & Brandt, 2005) and (Kannala, 2004). However, these earlier works do
not discuss the modeling issue nor the error analysis of the recovered structure.
23
Paper VI describes a method for quasi-dense wide baseline imagematching. The
method extends the match propagation algorithm, which is a technique for expanding
matching regions using local pixel-wise propagation, and makes it suitable for wide
baseline cases, where the camera pose may vary considerably between the two views.
There are basically two main extensions that are proposed in the article. The first ex-
tension is to use an affine model for the local geometric transformations between the
images. The local affine transformations are initialized from corresponding interest
regions which are used as seed matches. The second extension is to use the second
order intensity moments for adapting the estimates of the local affine transformations
during the match propagation. Besides intensity moments, the adaptation requires that
the epipolar geometry is known. However, typically already the locally constant affine
transformation model improves matching and provides a good result if the seed matches
are dense enough to cover the different surfaces that are to be matched.
Paper VII presents a non-rigid quasi-dense matching approach by extending the
adaptive method of paper VI so that the adjustment of the local affine transformations
is based on local image gradients and second order intensity moments. This implies
that the epipolar constraint is not needed for adjusting the transformations and, hence,
adaptive match propagation can be used in non-rigid image registration where global
geometric constraints are not available. Additionally, Paper VII applies quasi-dense
matching for object recognition and segmentation. In detail, a method is proposed
for grouping the quasi-dense pixel matches between the model and test images into
geometrically consistent groups which are supposed to represent the common objects
in the images. The grouping is based on a local grouping criterion which utilizes the
local affine transformation estimates acquired during the propagation. The number and
quality of matches in the obtained groups are used as decision criteria for recognition
whereas the location of matching pixels gives the segmentation. The experiments in
Paper VII show that the quasi-dense approach improves the reliability of recognition
compared to such approaches that only use a sparse set of interest regions for matching.
Paper VIII proposes a dense and deformable motion segmentation method for wide
baseline image pairs. The method is based on a bottom-up segmentation strategy, which
starts from a sparse set of seed matches, then proceeds to quasi-dense matching and, fi-
nally, uses the grouped quasi-dense matches to initialize the motion layers for the dense
segmentation stage, where the geometric and photometric transformations of the layers
are refined simultaneously with the segmentation. Thus, because the quasi-dense ap-
proach is used for initializing the motion layers, Paper VIII builds on the work of Paper
24
VII. In fact, the problem of two-view motion segmentation can beseen as a general-
ization of the object recognition problem, where the other image (i.e. model image)
is typically presegmented. So, in general, two-view motion segmentation requires the
solving of the following two subproblems simultaneously: (a) recognition of groups of
pixels that move together (from both images), and (b) estimation of the motion fields
associated to each group. The key contribution of Paper VIII is a motion segmentation
method which can deal with large non-rigid motions and large illumination changes.
1.5 Outline of the thesis
This thesis consists of an overview and an appendix, which contains the original articles.
The rest of the overview is organized as follows. First, Chapter 2 describes background
for geometric computer vision and discusses the work on conic-based homography esti-
mation. Chapter 3 deals with geometric camera calibration. Chapter 4 concentrates on
image-based scene reconstruction and introduces the sewer imaging application. The
topics related to quasi-dense matching are presented in Chapter 5 and conclusions are
in Chapter 6.
25
26
2 Geometry in computer vision
2.1 Introduction and background
Geometry is an important aspect of computer vision. The laws of geometry and optics
describe how the three-dimensional world is imaged on the camera sensor and, hence,
an understanding of imaging geometry is important for the development of automatic
image analysis methods. For example, a classical problem in geometric computer vision
is to determine automatically the three-dimensional structure of a scene from several
two-dimensional images. This geometric inverse problem is called the structure from
motion problem, and the approaches for solving it are based on geometric knowledge.
In fact, many current structure from motion systems are based on relatively recent ad-
vances in the theory and practice of geometric computer vision. Thus, although there is
a long history of using photographic images for measurement purposes in photogram-
metry (Slama, 1980), there has been significant progress in this area during the past
twenty years.
On the theoretical side, the progress has been achieved by applying concepts from
classical projective and algebraic geometry to problems in computer vision. Besides
just applying classical tools to new computational problems, the research has addition-
ally produced new theoretical results. For instance, the theory related to multiple view
tensors, which characterize the matching constraints for multiple views of a rigid scene,
was largely developed during the 1990s. The advances acquired in geometric computer
vision during the last two decades of the 20th century are summarized in (Hartley &
Zisserman, 2000) and (Faugeraset al., 2001). These works give a comprehensive in-
troduction to the projective geometry of multiple views, as well as several algorithms
for applying the theory to practical computations. The viewpoint in (Hartley & Zisser-
man, 2000) and (Faugeraset al., 2001) is application oriented, and the emphasis is on
relatively recent results. A more formal approach to the classical mathematical founda-
tions of algebraic projective geometry can be found from (Semple & Kneebone, 1952),
whereas (Stolfi, 1991) deals with oriented projective geometry. An early study of three-
dimensional computer vision is (Faugeras, 1993), and the geometry of stereo vision is
discussed in (Xu & Zhang, 1996). In addition, (Kanatani, 1993) and (Kanatani, 1996)
discuss many computational and statistical aspects of geometric computer vision.
27
Most of the research in geometric vision problems during 1980sand 1990s was re-
lated to the scenario where a single perspective camera moves in a rigid scene. A char-
acteristic feature of the developed theory and algorithms is the uncalibrated approach,
where the estimation of geometric entities can be performed in a projective coordinate
frame without knowing the internal camera parameters. Compared to the traditional
photogrammetric approach, which requires precalibrated cameras, the uncalibrated ap-
proach provides a more complete understanding of the geometry of multiple views, and
allows automatic metric scene reconstruction from image sequences via self-calibration;
this is the process that automatically determines the internal camera parameters from
the images (Hartley & Zisserman, 2000). Overall, the research that has led to the cur-
rent knowledge in geometric computer vision involves numerous contributions from
several researchers, and not all of them can be listed here. Hence, we mainly refer to
the aforementioned books which provide further references.
As briefly reviewed above, the multiple view geometry of rigid scenes is currently
relatively well understood, and several books have been written about the topic. There-
fore, recent research has expanded also to other areas of geometric computer vision.
Such areas include, for example, the application of algebraic geometry for solving sets
of nonlinear polynomial equations which arise from minimal problems in computer
vision (Stewénius, 2005; Byrödet al., 2009; Kukelovaet al., 2008), omnidirectional
vision and generic camera models (Daniilidis & Klette, 2006), and the field of dynamic
vision (Vidalet al., 2007), which can be seen to involve various vision problems related
to dynamic environments, such as deformable image registration, non-rigid structure
from motion, and motion segmentation.
Besides the advances in theoretical understanding, also the methodological advances
related to practical computational issues have been significant for the development of
vision systems. In particular, such issues as stability, robustness and precision of esti-
mation algorithms, which are used to estimate geometric entities from image data, are
relevant for the performance of real-world vision systems. Actually, since the purpose
of geometric computer vision is to provide the means for extracting geometric scene
information automatically from images, the estimation of geometric models is an essen-
tial part of most problems in the field. Typical estimation problems in rigid structure
from motion involve such geometric entities as planar homographies, fundamental ma-
trices, trifocal tensors, and camera projection matrices.
However, the data that is used in geometric estimation problems usually contains
measurement noise and outliers, i.e., observations which do not fit into the model. For
28
example, the automatically matched tentative interest pointcorrespondences, which are
commonly used for structure and motion recovery, are localized with limited precision
and may contain false matches. Thus, the estimation algorithms must be stable with re-
spect to noise and robust to outliers (Hartley & Zisserman, 2000). Often the robustness
to outliers is achieved by using the Random Sample Consensus paradigm (RANSAC)
(Fischler & Bolles, 1981). In fact, there is a lot of early and recent research related to
different minimal problems which can be efficiently solved in the RANSAC framework,
e.g. (Nistér, 2004; Brownet al., 2007). Finally, it is also desirable that estimation proce-
dures are precise so that estimation errors decrease when the amount of data increases.
Hence, often maximum likelihood estimates are computed after RANSAC-based outlier
rejection (Hartley & Zisserman, 2000). Overall, general advances in numerical methods
have turned out to be useful also in many vision problems, which are often computa-
tionally intensive and involve large scale optimization (Triggset al., 2000; Kahlet al.,
2008).
As described above, estimation of geometric entities from image data is a central
theme in geometric computer vision. One particular estimation problem studied in
this thesis is the computation of a planar homography from corresponding coplanar
conics. Paper I proposes two algorithms for this problem and they are summarized in
the following section.
2.2 Case study: Homography computation fromcorresponding conics
In perspective imaging, a scene plane is mapped to the image plane by a planar ho-
mography. Most often the homography between two planes is determined from point
correspondences (Hartley & Zisserman, 2000) but, in this thesis, Paper I studies the
problem of determining the homography from conic correspondences.
Paper I presents two algorithms for homography computation from corresponding
coplanar conics. In general, the first algorithm is for the case where there are three
or more conic correspondences and the second algorithm is for the minimal case of
two conic correspondences. The algorithms can be used to compute the homography
between a known scene plane and its perspective image or between two different per-
spective views of an unknown scene plane, as illustrated in Figure 2. Hence, in addition
to image registration, the algorithms might be useful in applications where homogra-
29
phies have to be estimated in order to extract either the internalor external parameters
of a perspective camera. For instance, plane-based camera calibration is one example
of such an application (Sturm & Maybank, 1999).
Fig. 2. An image registration example. Two perspective views of a plane contain-
ing white circles; the detected ellipses are in cyan. The homography was esti-
mated by the method of Paper I and is illustrated with the difference image on the
right.
The algorithms of Paper I are summarized in Section 2.2.2. First, however, the follow-
ing subsection provides a brief review of previous works that are related to conics and
their applications within computer vision.
2.2.1 Related work
Estimation of a planar homography from image data is an important part of many com-
puter vision algorithms. The most common approach to homography estimation is to
use point or line correspondences. In this case, a well-known technique, called the Di-
rect Linear Transformation method (DLT), allows us to formulate the estimation prob-
lem as a system of linear equations which can be solved by using at least four corre-
spondences (Hartley & Zisserman, 2000). However, in addition to points and lines,
correspondences of planar contours, such as conics, have also been used for homogra-
phy estimation (Jain & Jawahar, 2006). In fact, in a sense, conics can be seen as one
of the most fundamental image features, together with points and lines, because all of
these features are invariant under projective transformations. Actually, there has been
quite a lot of research on the geometry of conics in computer vision. The motivation
for studying conics arises from both scientific curiosity and practical problems. For
example, in some cases, reliable point or line features may not be available, whereas
higher order parametric curves, such as conics, can be recovered robustly.
30
Detection and estimation of conics from images is a prerequisitefor applying conic-
based algorithms. Generic segmentation paradigms, such as the Hough transform (Bal-
lard, 1981) or RANSAC technique (Fischler & Bolles, 1981), can be used for conic
detection, i.e. for segmenting edge points into sets that lie on conics (Rosin & West,
1995), and there are several estimation methods for fitting conics to segmented sets of
2-D points. In fact, the problem of conic fitting has been well studied, and there are
various non-iterative fitting methods (Bookstein, 1979; Rosin, 1993; Fitzgibbonet al.,
1999) which typically minimize some kind of an algebraic distance, as well as iterative
approaches which minimize the geometrical error (Sturm & Gargallo, 2007), i.e. the
sum of squared distances between the 2-D points and the conic, or an approximation
of it (Sampson, 1982; Kanatani, 1994). The rationale for minimizing the geometrical
error is the fact that it provides the maximum likelihood estimate of the conic under the
assumption of isotropic Gaussian noise in the point measurements.
As there are means for extracting conics from images, several conic-based proce-
dures have been proposed for various problem areas, such as object recognition (Forsyth
et al., 1991; Carlsson, 1993), structure from motion (Quan, 1996; Kahl & Heyden, 1998;
Schmid & Zisserman, 2000), and camera calibration (Yanget al., 2000; Wuet al., 2004;
Chenet al., 2004; Gurdjoset al., 2006; Ying & Zha, 2007). For instance, Forsythet al.
(1991) and Carlsson (1993) derive two projective invariants for a pair of coplanar conics
and use them, together with other invariant descriptors of planar shapes, for recogniz-
ing curved plane objects irrespective of their pose. In addition to pose invariant object
recognition, Forsythet al. (1991) describe an algorithm that utilizes two coplanar conic
correspondences for determining the relative pose of a scene plane with respect to the
camera. The pose recovery problem is equivalent to determining the homography be-
tween the scene plane and the image plane for a calibrated camera. The approach
presented by Forsythet al. (1991) requires the solution of quartic equations.
Within the area of structure from motion, conics have been studied from several
points of view. For example, Quan (1996) describes methods for projective and metric
reconstruction of plane conics from two images assuming that the camera projection ma-
trices are known. On the other hand, Kahl & Heyden (1998) and Kaminski & Shashua
(2004) utilize conic correspondences for epipolar geometry estimation. Furthermore,
Schmid & Zisserman (2000) deal with the geometry and matching of lines and curves
over multiple views. In particular, it is shown by Schmid & Zisserman (2000) that,
given the epipolar geometry, the homography induced by a plane can be determined
from one conic correspondence.
31
In camera calibration, conics can be utilized in various ways.One alternative is to
use conic correspondences to compute the homographies between a planar calibration
pattern and its images, as described in Paper I, and then determine the internal cam-
era parameters by utilizing the geometric constraints derived from the homographies
(Sturm & Maybank, 1999). This is also the approach by Yanget al. (2000) but the
method proposed there is restricted to concentric conics. However, in certain cases, if
the values of some internal camera parameters are known, or if a special calibration
pattern is used, one may derive constraints directly for the internal camera parameters
and avoid computing the homographies for each calibration image. In fact, there are
many works which study the calibration constraints that can be used with different cal-
ibration patterns containing conics. For example, there are studies which use coplanar
circles (Chenet al., 2004; Wuet al., 2004), axes aligned conics (Ying & Zha, 2007),
or confocal conics (Gurdjoset al., 2006). Besides plane-based calibration, also images
of spheres have been used for the calibration of both perspective cameras (Zhanget al.,
2007) and catadioptric cameras (Ying & Zha, 2005). The latter approach utilizes the
fact that, in addition to perspective cameras, also certain catadioptric cameras project
the occluding contour of a sphere onto a conic in the image (Ying & Hu, 2004b).
As described above, there are many problem areas where conics have been utilized
as image features. In addition to the aforementioned studies, there are previous works
that concentrate on the same problem as Paper I. The closest works to Paper I that we are
aware of are (Sugimoto, 2000), (Mudigondaet al., 2004) and (Ma, 1993). In (Sugimoto,
2000) a linear algorithm is proposed for solving the homography from correspondences
of coplanar conics in a general configuration. The algorithm is based on considering
conics as points in the projective spaceP5 and the homography is determined from
the corresponding conic-based transformation which is a linear mapping fromP5 to P
5.
This approach requires at least seven correspondences, whereas the linear algorithm
presented in Paper I requires only three correspondences.
The minimum number of conic correspondences that is required for solving the
homography is two. A method for computing the homography from two conic cor-
respondences is proposed by Mudigondaet al. (2004). This method requires solving
polynomial equations, whereas the approach of Paper I requires eigendecompositions
of symmetric matrices. However, recently, after the publication of Paper I, we became
aware of the work by Ma (1993), which also discusses the problem of determining a
planar homography from two coplanar conics. The approach proposed by Ma (1993)
is similar in spirit to that of Paper I. Nevertheless, the algorithms are independently
32
developed and different in their details. Furthermore, PaperI discusses the effect of
measurement errors in the conic coefficients, which is a topic that is not covered by Ma
(1993).
2.2.2 Algorithms
Both algorithms proposed in Paper I are based on linear algebra and utilize the frame-
work of algebraic projective geometry (Semple & Kneebone, 1952; Hartley & Zisser-
man, 2000). The basic ideas of the algorithms are summarized below; the details can
be found from the original article.
Problem setting
A conic is a second degree curve in the plane and, by using homogeneous coordinates
x, it is defined by an equation of the formx⊤Cx = 0, whereC is a real symmetric 3×3
matrix, which contains the parameters of the conic.
In the problem of Paper I, it is assumed that one has identified the conic correspon-
dencesCi ↔ C′i, i = 1, . . . ,n, between two planes which are related by a homography
represented with a non-singular 3×3 matrix H. Further, it is assumed that the conics
are non-degenerate, so that detCi 6= 0, detC′i 6= 0.
It is well known that, under a point transformationx′ ≃ Hx, a conicCi transforms
to
C′i ≃ H−⊤CiH−1, (1)
where≃ denotes equality up to scale (Hartley & Zisserman, 2000). Since the scale of
matrix H is insignificant, one may fix it by(detH)2 = 1. Then, by scaling the conic
coefficient matricesCi so that detCi = detC′i, the transformation rule (1) implies that
Ci = H⊤C′iH, i = 1, . . . ,n. (2)
Hence, since detCi = detC′i 6= 0 by construction, everyH that satisfies (2) must also
satisfy(detH)2 = 1. Thus, one may directly focus on solving (2) without the need to
consider any additional constraints forH.
The algorithms for solvingH are derived from the equations (2). In the following
sections, the two cases,n = 2 andn > 2, are discussed separately.
33
The case n > 2
If the number of conic correspondences is greater than two, one may proceed as follows.
By considering any two of the equations (2), i.e.
Ci = H⊤C′iH, (3)
C j = H⊤C′jH, (4)
and by multiplying the second equation with the inverse of the first, one obtains
C−1i C j = H−1C′−1
i C′jH (5)
and further
C′−1i C′
jH−HC−1i C j = 0, (6)
which is a set of linear equations in the elements ofH. That is, (6) is equivalent to
Mi jh = 0, (7)
whereh is a 9×1 vector containing the elements ofH andMi j is a 9×9 matrix deter-
mined by the conics. Thus, the solutionh belongs to the null space ofMi j. However,
in general, the dimension of the null space ofMi j is greater than 1 and, hence, the con-
straints (7) are not sufficient for determining the homography. Nevertheless, ifn ≥ 3,
one may choose other two equations to get another set of linear constraints. By consid-
ering all ordered pairs, one has in totaln(n−1) pairs of equations, and by stacking the
matricesMi j one obtains an overdetermined set of 9n(n−1) equations
Mh = 0, (8)
so that the null space is usually one dimensional. It was observed in Paper I that, in
general, already three conic correspondences allow us to solveh from (8).
In practice,M may have full rank due to measurement errors in the conic coeffi-
cients but in this case the solution minimizing||Mh|| with ||h|| = 1 is obtained as the
singular vector corresponding to the smallest singular value ofM (Hartley & Zisser-
man, 2000). In fact, as reported in Paper I, and illustrated in Figure 2, the proposed
approach, which is based on the SVD ofM, is tenable in practice, and is able to provide
a reasonable estimate of the homography also in such cases where there is no exact
solution to (2) due to noise and measurement errors in conic coefficients. However,
our experiments in Paper I additionally indicate that the solution, provided by the SVD,
34
depends on the choice of the Euclidean coordinate frame in whichthe conics are repre-
sented. This is expected since||Mh|| is an algebraic error, which is not a geometrically
or statistically optimal cost function (Hartley & Zisserman, 2000). Thus, as discussed
in Paper I, it might be advantageous to fix the coordinate frames of the two planes by us-
ing some normalization procedure analogous to the normalized DLT algorithm, which
is described for linear point-based homography estimation in (Hartley & Zisserman,
2000).
Minimal case, n = 2
If there are only two conic correspondences, the linear SVD-based algorithm above
is not useful, because the equations of the form (6) do not give sufficient constraints
for H. However, in general, the original nonlinear matrix equations can be used to
determine the homography up to a four-fold ambiguity, i.e., there exist at most four
distinct projective transformations which satisfy the equations.
In fact, it is shown in Paper I that, when the conic coefficient matrices are real,
invertible and symmetric, the equations
C1 = H⊤C′1H (9)
C2 = H⊤C′2H, (10)
are equivalent to the equations
R⊤R = I (11)
R⊤AR = B, (12)
whereR is an unknown complex orthogonal matrix (bijectively related toH) and the
known complex symmetric matricesA andB are defined by the conicsC′1, C′
2 andC1,
C2, respectively. Hence, instead of equations (9) and (10) one may first confine oneself
to solveR from (11) and (12) whereafterH can be determined usingR. As stated in
Paper I, the properties of complex symmetric matrices imply that the system (11), (12)
(and thereby (9), (10)) has a solution if and only if the matricesA andB are similar
(Horn & Johnson, 1985).
WhenA andB are similar and have distinct eigenvalues, they are diagonalizable
and there are only finitely many solutions to (11), (12). The solutions can be obtained
by computing the eigenvalues and eigenvectors ofA and B, as described in Paper I.
35
In this case, there are in total eight solutions forR. However, these solutions provide
only four distinct homographies, because ifH is a solution then is also−H and both
represent the same transformation. One of the solutions represents the geometrically
correct transformation, but additional information is required to determine which one.
If A andB have a multiple eigenvalue there may be infinitely many solutions to (11),
(12). In fact, in some degenerate cases the configuration of the two conics may be such
that it does not constrain all the degrees of freedom of the homography. For example,
this is the case ifC1 andC2 are two concentric circles, because their correspondence
information is not sufficient for determining the homography, i.e. there are infinitely
many solutions since the circles are invariant to any rotation around their center. How-
ever, Paper I concentrates on the case where the eigenvalues ofA andB are distinct and
the homography can be determined up to a four-fold ambiguity.
In practice, due to measurement noise, the eigenvalues of matricesA andB are never
exactly the same. Hence, typically there is no exact solution to (9), (10) but one would
still like to recover a homography estimate that is reasonably close to the underlying
true homography. Therefore, Paper I proposes an algorithm which uses the eigende-
compositions ofA andB for determiningR, and therebyH, as ifA andB were similar.
However, sinceA andB are not exactly similar in practice, the algorithm contains an
additional step in which the eigenvalues ofA andB are ordered in such a manner that
the corresponding eigenvalues are close to each other. ThereafterR is determined from
the correspondingly ordered eigenvectors ofA andB. This strategy is theoretically jus-
tified because, in general, the eigenvalues of a diagonalizable matrix are stable under
small perturbations of the matrix elements (Horn & Johnson, 1985). Furthermore, the
experiments of Paper I show that the proposed algorithm is sufficiently stable to be used
with real measurement data, and usually provides a reasonable estimate of the homog-
raphy, also in such cases where no exact solution exists due to measurement errors in
the conic coefficients.
2.3 Discussion
In this chapter, Section 2.1 briefly described some general background to the field of
geometric computer vision, and then, Section 2.2 concentrated on a particular geometric
estimation problem, namely, on the computation of a planar homography from conic
correspondences. The algorithms of Paper I were described and put into the context
36
of related research literature. In the following, the contributions of Paper I are further
summarized, and their significance is discussed.
In summary, Paper I describes two algorithms for computing a planar homography
from correspondences of coplanar conics. The first algorithm can be used when there
are at least three conic correspondences and, in general, it provides a unique solution
up to scale. The second algorithm is for the minimal case of two conic correspondences
and, in general, it provides a solution up to a four-fold ambiguity. In addition, Paper
I investigates the stability of the proposed algorithms with respect to noise. Although
both algorithms involve algebraic error measures, which are not statistically optimal,
the experiments with synthetic and real data show that the algorithms are stable enough
to be used in practical computations. Further, the proposed algorithms are easy to im-
plement because they are based on linear algebra, and the required matrix factorizations
are available in most numerical libraries and mathematical software packages.
Because homography estimation is a classical problem and several approaches exist,
one might think that the significance of Paper I is rather limited. However, we believe
that, besides providing new aspects to the problem, the algorithms may also have prac-
tical significance. In fact, we are not aware of any other linear method for homography
computation from less than seven conic correspondences, which is the minimum num-
ber of correspondences required by the method of (Sugimoto, 2000). Moreover, many
recent approaches to wide baseline image matching utilize affine covariant region de-
tectors, which are able to detect interest regions from images in such a manner that the
pre-image of the region is invariant to changes in the viewpoint of the camera. Typi-
cally these regions may be represented by ellipses (Mikolajczyket al., 2005), but often
they are simply represented by their centroids and considered as points in geometry es-
timation (Mataset al., 2004; Chumet al., 2005). However, it might be advantageous
to directly utilize the ellipses in homography estimation. For example, in principle,
only two region correspondences would be sufficient for hypothesis generation in a
RANSAC-based estimation framework, whereas point-based approaches require four
correspondences. This fact might be useful for some applications, such as plane detec-
tion (Lourakiset al., 2002; Chumet al., 2005; Choiet al., 2007).
Finally, since the proposed techniques use algebraic error measures, they may not be
suitable for precise homography estimation from noisy observations. However, given
just a set of corresponding coplanar conics, it is not obvious how to choose a suitable
error measure. In other words, it is not obvious what would be a reasonable model
for the uncertainty in the conic coefficients. On the other hand, if the conics are de-
37
fined by noisy edge points and the point-to-conic distances areassumed to be normally
distributed, it would be a justified approach to simultaneously determine the homog-
raphy and the conics, so that the conics are related by the recovered homography and
the point-to-conic distances are minimized in both images. Nevertheless, this kind of
an approach would require iterative methods, which need a good initialization for both
the homography and the conics (Sturm & Gargallo, 2007; Hartley & Zisserman, 2000).
Hence, the noniterative methods of Paper I might be useful also in this case.
38
3 Geometric camera calibration
Geometric camera calibration is the process whereby the geometric properties of a cam-
era are determined. In other words, geometric camera calibration establishes the map-
ping between image points and scene points. Hence, at least implicitly, calibration
defines both the forward-projection, which maps a given 3-D scene point to its corre-
sponding 2-D image point, and the back-projection, which determines the set of 3-D
points which map to a given image point.
Geometric camera calibration is a prerequisite for image-based metrology, because
the geometric characteristics of the camera have to be quantified in order to measure
scene properties such as angles and length ratios from the images. A central issue in
camera calibration is the choice of a suitable camera model which is sufficiently generic
for modeling the particular camera in question, and which allows convenient and stable
calibration. In fact, usually camera calibration can be seen as a parameter estimation
problem, where the parameters of the camera model are estimated from image data.
Hence, in the following, we review different camera models as well as methods for
determining their parameters. In addition to surveying previous calibration literature,
we briefly describe the contributions of Papers II-IV.
3.1 Camera models
A photographic camera can be seen as a ray-based imaging device, where the image
points are associated with 3-D lines which represent the light rays that arrive to the
camera. Thus, one may consider the image as a collection of pixels, where each pixel
may observe light only from those scene points that lie on the ray associated with it. In
the most general setting, the rays are unconstrained and camera calibration is the pro-
cess where the coordinates of all the rays, corresponding to the pixels, are determined
in some common coordinate system (Grossberg & Nayar, 2001; Sturm & Ramalingam,
2004). Hence, in this case, the number of parameters to be estimated is large. On the
other hand, in more constrained settings, which are quite common in practice, cameras
may often be modeled by low-parameter models. This is typically the case for the class
of central cameras in which the projection rays are constrained to meet at a single point
in space.
39
In the following, we present a taxonomy of camera models accordingto the classifi-
cation by Sturm (2005) and Ramalingam (2006). Thereafter, in Sections 3.1.2 and 3.1.3,
we focus on central cameras because our work concentrates on them. The discussion
below aims to give an overview of the subject; a more detailed survey of central camera
models can be found from Paper III.
3.1.1 Taxonomy of camera models
Sturm (2005) and Ramalingam (2006) describe a three-level hierarchy of camera mod-
els, which consists of the following classes, listed in the order of decreasing generality:
(1) generic cameras, (2) axial cameras, and (3) central cameras. At the most general
level of the hierarchy, cameras are modeled by unconstrained sets of projection rays,
whereas in the more specific classes, the projection rays are constrained to go through a
single line (axial cameras) or a single point (central cameras). Some examples of cam-
eras that belong to the different classes are given below and are schematically illustrated
in Figure 3.
P1
P2
P3
p3p2 p1
(a)
P1
P2
P3
p3p2
p1
(b)
P1
P2
P3
p3 p2p1
F
F ′
(c)
Fig. 3. Three catadioptric cameras which consist of a perspective camera and a
mirror. The scene points Pi are projected to points pi on the image plane. (a) A
generic camera with a curved mirror. (b) An axial camera with a spherical mirror.
All projection rays intersect a single line. (c) A central camera with a hyperbolic
mirror. All projection rays go through the focal point F of the mirror when the
perspective camera is placed at the other focal point F ′.
40
Generic model
A generic imaging model for photographic cameras consists of a set of pixels and pro-
jection rays so that each pixel of the image is associated with a unique ray, which is
represented by a 3-D line in the camera coordinate frame. In the general case the pro-
jection rays may be arbitrary and, hence, the model can describe essentially any camera
that captures light rays which travel along straight lines between the camera and the
observed opaque surfaces of the scene. In fact, here the concept of a camera is not
limited to a single physical device, but it includes also camera systems, which consist
of several physical camera units and which may be considered as a one generic camera
by combining all the pixels of the individual images to a single image.
Examples of cameras which obey the generic imaging model, but do not satisfy the
additional constraints of axial or central cameras, include, for instance, multi-camera
systems consisting of cameras whose optical centers are not all collinear, oblique cam-
eras where no two captured light rays intersect (Pajdla, 2002), non-central mosaics ac-
quired under circular motion (Swaminathanet al., 2003), and various other non-central
cameras (Bakstein & Pajdla, 2001; Ponce, 2009).
Axial model
In an axial camera, all the projection rays go through a single line in space and this
line is called the camera axis. Examples of axial cameras include crossed-slit cameras
(Feldmanet al., 2003; Gupta & Hartley, 1997), stereo camera rigs, and other multi-
camera systems which consist of central cameras with collinear optical centers. In
addition, a catadioptric camera consisting of a mirror and a central camera is an axial
camera if the mirror is any surface of revolution, and the camera center lies on the
mirror axis of revolution (Ramalingam, 2006; Chahl & Srinivasan, 1997; Olliset al.,
1999). For example, the catadioptric camera designs illustrated in Figures 3(b) and 3(c)
are axial cameras.
Central model
A central camera is a camera where all the projection rays go through a single point,
which is called the optical center of the camera. A common example of a central cam-
era is the perspective camera. The perspective imaging model is tenable for most con-
41
ventional cameras which have narrow-angle lenses. However, alsocameras equipped
with wide-angle and fish-eye lenses can be approximated by the central model, even
though the perspective model is not suitable for them (Miyamoto, 1964). Further, there
are certain catadioptric camera configurations which have a single viewpoint (Baker &
Nayar, 1999). For example, the catadioptric system in Figure 3(c), which contains a
hyperbolic mirror and a perspective camera placed at the focal point of the mirror, is a
central camera.
3.1.2 Perspective cameras
The perspective camera model is applicable for most conventional photographic cam-
eras, and it is the most widely used and studied camera model in the literature (Hartley &
Zisserman, 2000). In the following, the perspective model is briefly described in order
to illustrate its limitations, and to motivate the flexible central camera model, proposed
in Paper II and summarized in Section 3.1.3 below.
Geometrically, a perspective camera is defined by a plane and a point in space. That
is, the plane is the image plane of the camera, and the point is the optical projection
center. As shown in Figure 4, the projection ray associated to a particular image point
is the line joining the image point and the projection center.
Mathematically, as images of lines are lines under perspective projection, a per-
spective camera is a linear mapping from a three-dimensional world coordinate frame,
represented byP3, to a two-dimensional image coordinate frame, represented byP2.
Hence, by using homogeneous coordinates, a perspective camera can be represented by
a 3×4 matrix (Hartley & Zisserman, 2000).
However, besides being a linear mapping fromP3 to P2, a perspective camera can
be considered as a direction sensor. From this point of view, it is illustrative to use inho-
mogeneous pixel coordinates and to represent the camera projectionP as a composition
of two non-linear functions, namely,
m = P(X) = (Pc◦R)(X), (13)
wherem = (u,v)⊤ is the image point corresponding to the scene pointX, Pc defines
the internal properties of the camera andR relates the camera pose to the world frame.
Given the scene pointX in the world coordinate frame, the functionR provides the
direction of the corresponding projection ray in the camera coordinate frame, whose
origin is at the projection center. Hence,R involves a rigid transformation, which maps
42
the scene point from the world frame to the camera frame. Formally, Φ =R(X), where
Φ = (θ ,ϕ)⊤ andθ , ϕ are the spherical angle coordinates illustrated in Figure 4.
For a perspective camera, the cameraZ-axis is defined perpendicular to the image
plane andPc has the following form
m = Pc(Φ) = (K◦Fp)(Φ), (14)
whereFp is a central projection to the virtual image plane, illustrated byV in Figure 4,
i.e.,(
x
y
)
= Fp(Φ) = (tanθ)
(
cosϕsinϕ
)
(15)
andK is an affine transformation from the virtual image plane to the real image plane,(
u
v
)
=K(x,y) =
[
f s f
0 γ f
](
x
y
)
+
(
u0
v0
)
, (16)
where f , γ, s, u0 andv0 are the internal camera parameters. There are only five degrees
of freedom inK because, without loss of generality, one may fix the camera coordinate
frame so that theX-axis is parallel to theu-axis of the pixel coordinate frame.
θϕ
X
Y
ZO
X
x = (x,y)
m
I V
Fig. 4. A perspective camera whose projection center is O. The scene point X is
projected to point m on the image plane I. The plane Z =1 is the virtual image
plane V. The mapping from V to I is defined by the internal camera parameters.
It is evident from (15) and Figure 4 that the field of view of a perspective camera can
not exceed a hemisphere. Indeed, whenθ approaches the value 90◦ the image point
43
approaches infinity. Thus, due to this singularity, the perspective camera model is not
suitable for omnidirectional cameras.
3.1.3 Central omnidirectional cameras
The class of central cameras includes a wider range of projection models than just
perspective projection. In particular, many real omnidirectional cameras, such as fish-
eye lens cameras (Miyamoto, 1964) and certain catadioptric cameras (Baker & Nayar,
1999), can be approximated as central cameras. The flexible central camera model,
proposed in Paper II, is described below, after a brief review of the related literature.
Previous work
Various kinds of central cameras have been described in the literature. In principle, all
these cameras are covered by a model of the form (13), wherePc is a generic, possibly
nonlinear mapping from the unit sphere ofR3 toR
2. Hence, a central camera maps the
directions of incoming light rays to points in the image plane. The subset of the unit
sphere, wherePc is injective, defines the upper limit for the field of view of a particular
camera projection.
In practice, the mappingPc is usually continuous. In addition, many central omnidi-
rectional cameras employ a projection that is radially symmetric about an axis, i.e., the
optical axis. Hence, instead of (14),Pc is often modeled in the following more generic
form
m = Pc(Φ) = (H◦Fr)(Φ), (17)
whereFr is a radially symmetric projection to the virtual image plane, which is orthog-
onal to the optical axis, andH is a planar projective transformation from the virtual
image plane to the real image plane. In detail,(
x
y
)
= Fr(Φ) = r(θ)
(
cosϕsinϕ
)
, (18)
wherer(θ) defines the radially symmetric part of the projection and it is again assumed
that the cameraZ-axis is the optical axis, i.e.,θ is the angle between the incoming light
ray and the optical axis. Several models have been used forr(θ) in the literature. For
example, instead of the perspective modelr(θ) = tanθ , the stereographic projection
model,r(θ) = 2tan(θ/2), has been used for some fish-eye lenses (Miyamoto, 1964).
44
In general, the image plane of a central camera is not necessarilyorthogonal to the
optical axis and therefore the mappingH in (17) is a planar projective transformation
instead of an affine transformation. However, in the case of a perspective camera, one
may always fix the optical axis to be perpendicular to the image plane because the
pinhole projection, illustrated in Figure 4, is symmetric about any axis. Thus, the affine
transformation model is sufficient in (14).
Many central cameras, which appear in the literature, can be presented by the model
(17). The model covers various dioptric cameras, where the lens system provides a sin-
gle effective viewpoint and is radially symmetric about the optical axis. Examples
of such cameras include wide-angle (Franket al., 2007) and fish-eye lens cameras
(Miyamoto, 1964), in addition to conventional narrow-angle lens cameras. Several dif-
ferent parametric expressions have been used for the radial projection functionr in (18).
Many approaches take the perspective projectionr(θ) = tanθ as a starting point and
model the observed radial distortion with respect to it (Heikkilä, 2000; Zhang, 2000).
One example of such an approach is the so-called division model (Bräuer-Burchardt &
Voss, 2001; Fitzgibbon, 2001) which implicitly defines a relation betweenr andθ . How-
ever, due to the singularity of the perspective projection atθ = 90◦, these approaches
are not suitable for cameras whose field of view exceeds a hemisphere. Hence, other
models have also been proposed, e.g. (Micušík, 2004; Micušík & Pajdla, 2006). Finally,
in addition to the parametric approaches, a parameter-free method for determining the
radial distortion was proposed in (Hartley & Kang, 2007).
Additionally, besides dioptric cameras, also central catadioptric cameras, which con-
sist of a conical mirror and a camera (Baker & Nayar, 1999), can be represented in the
form (17). In fact, it has been shown (Geyer & Daniilidis, 2001) that the model (18),
wherer(θ) has the following one-parameter form
r(θ) =(l +1)sinθ
l +cosθ, (19)
is a unified model for central catadioptric projections. The properties of central cata-
dioptric image formation have been extensively studied, e.g. (Ying & Hu, 2004b; Bar-
reto & Araujo, 2005). Further, it has been demonstrated that the radial projection model
(19) is applicable also for many dioptric cameras, including conventional perspective
cameras and fish-eye lens cameras (Ying & Hu, 2004a). Indeed, for example, when
l=0, formula (19) gives the perspective projection, and the valuel = 1 corresponds to
the stereographic projection.
45
The proposed model
Oneof the contributions of this thesis is the generic central camera model presented in
Paper II. This section motivates the proposed model and describes its key components.
Many omnidirectional camera devices are designed to have a single viewpoint and
to be radially symmetric, as described in the previous section. However, radially sym-
metric central camera is an idealized model, which is not always sufficient for precise
modeling of real cameras. Hence, a common approach in photogrammetric camera
calibration is to append the idealized model with an additional distortion component,
which models the deviations from precise radial symmetry (Slama, 1980). So, the idea
is to adhere to the assumption of a single viewpoint but to drop the strict requirement
of radial symmetry. The distortion model traditionally used in photogrammetry (Slama,
1980) is based on an analytic approximation for the geometric distortion in a decentered
lens system (Conrady, 1919; Brown, 1971). However, the problem with this distortion
model is that it is not suitable for cameras with a very wide field of view because it is
built on the perspective imaging model.
In order to obtain a flexible camera model, which is applicable also for omnidirec-
tional cameras, Paper II takes the expression (17) as a starting point, and modelsFr in
a generic form, instead of assuming the restrictive perspective projection model. There-
after, model (17) is complemented by replacing the radially symmetric partFr with
D◦Fr, where the asymmetric distortion functionD appends two distortion terms toFr.
In detail, the virtual image pointx corresponding to the direction angleΦ is given by
x = (D◦Fr)(Φ) = r(θ)ur(ϕ)+∆r(θ ,ϕ)ur(ϕ)+∆t(θ ,ϕ)uϕ(ϕ), (20)
whereur(ϕ) anduϕ(ϕ) are the unit vectors in the radial and tangential directions,
r(θ) = k1θ + k2θ 3+ k3θ 5+ k4θ 7+ k5θ 9, (21)
and the asymmetric radial and tangential distortion terms are
∆r(θ ,ϕ)=(ζ1θ +ζ2θ 3+ζ3θ 5)(ι1cosϕ + ι2sinϕ + ι3cos2ϕ + ι4sin2ϕ), (22)
∆t(θ ,ϕ)=(η1θ +η2θ 3+η3θ 5)(ξ1cosϕ +ξ2sinϕ +ξ3cos2ϕ +ξ4sin2ϕ), (23)
respectively. Thus, here the radial projection function (21) contains five parameters and
both of the asymmetric distortion terms, (22) and (23), contain seven parameters.
Hence, Paper II proposes to model the radially symmetric term (21) as a part of a
power series consisting of the odd powers ofθ . The power series model, including the
46
even powers, has been used for fish-eye lenses also before (Xiong& Turkowski, 1997).
However, Paper II suggests dropping the even powers, based on the fact that an arbitrary
continuous odd function can be represented as a series of odd polynomials. Becauseθis always positive andr(0)=0, one may considerr(θ) as an odd function. Thus, in
principle, one may approximate any kind of continuous radially symmetric projection
by increasing the number of terms in (21).
The expressions (22) and (23), suggested for the asymmetric distortion terms in
Paper II, are separable in the variablesθ andϕ, and the dependence from both variables
is again modeled as a part of a mathematical series. Further, because the Fourier series
of any 2π-periodic continuous function converges in theL2-norm and any continuous
odd function can be approximated by a series of odd polynomials, one could model
increasingly complex continuous distortions by simply adding more terms to (22) and
(23). Hence, formulation (20) suggests a flexible and generic central camera model,
where the number of parameters could be determined automatically in the calibration
process by using, for example, cross-validation to avoid overfitting. However, automatic
complexity selection of the camera model is beyond the scope of this thesis.
As described above, the generic camera model of Paper II can be represented in
the formm = Pc(Φ), where the camera projection is a composition of three functions,
i.e., Pc(Φ) = (H◦D ◦Fr)(Φ). Usually, one also needs to know the inverse model
Φ = P−1c (m). In our case, it is straightforward to compute the inverse of the homog-
raphyH but (D◦Fr) can not be analytically inverted due to its relatively complicated
parametric form. Hence, the inverse mappingP−1c can not be explicitly computed from
Pc. One alternative is to estimate it numerically in a discrete form by using some kind
of an interpolation process. For example, the direction anglesΦk corresponding to each
image pixel may be stored into a look-up table. However, Paper II additionally proposes
a direct method to approximate(D◦Fr)−1(x) for a givenx. The proposed approxima-
tion method is based on a local linearization of the asymmetric distortion functionD.
3.2 Calibration methods
Camera calibration is the process whereby the camera parameters are determined from
image data. There are several calibration methods that have been proposed in the liter-
ature. The conventional photogrammetric approach for camera calibration uses images
of objects, whose geometry is known (Heikkilä, 2000; Zhang, 2000). A typical cali-
bration object used in this context consists of one to three planes which contain visible
47
control points in known positions. In contrast, camera self-calibration methods do not
use objects of known geometry or, in general, any other metric information about the
scene (Hartley & Zisserman, 2000). Instead, they typically use point correspondences
over multiple views of a rigid scene, and utilize assumptions about the internal camera
parameters, e.g. constant internal parameters or unit aspect ratio (Faugeraset al., 1992;
Heyden & Åström, 1997), or about the camera motion, e.g. pure rotation or translation
between the views (Hartley, 1994; Moonset al., 1994).
In addition, besides the pure self-calibration methods, there are calibration methods
which utilize some prior information about the scene, but do not necessarily require
a calibration pattern with completely known world coordinates. For example, some
approaches use images of lines (Barreto & Araujo, 2005) or spheres (Ying & Hu, 2004b)
or assume that the observed scene is planar (Tardifet al., 2007).
An example of a photogrammetric calibration process is illustrated in Figure 5
where a camera observes a planar calibration pattern, which contains circular control
points. Given several images of the pattern, the calibration is performed by fitting a cam-
era model to the observations, which are the measured positions of the control points
in the calibration images. In the case of central cameras, the camera images can be
rectified after the calibration by back-projecting them onto a cube whose centroid is at
the camera center (Tardifet al., 2007). If the calibration is accurate, the back-projected
images of scene lines are straight on the faces of the cube, as illustrated in Figure 6.
Fig. 5. A photogrammetric camera calibration setup where a camera vie ws a pla-
nar calibration pattern displayed on a flat screen.
48
Fig. 6. Two images of a calibration pattern (left) and their undistorted versions
(right). The original images were taken with a catadioptric camera (top) and a wide-
angle lens camera (bottom), and the calibration was performed by the method of
Paper III. The undistorted images were computed by back-projecting the original
images onto a cube whose centroid was at the estimated viewpoint of the camera.
Typically, the camera calibration process involves nonlinear optimization, where the
camera parameters are estimated by minimizing a cost function, which quantifies the
model fitting error. In photogrammetric calibration, the control point coordinates are
known in the world frame and, hence, the minimization needs to be done only over
the internal and external camera parameters, where the latter relate the camera pose
to the world frame. However, in self-calibration, the 3-D coordinates of the observed
points are unknown, and they have to be estimated simultaneously with the camera
parameters. In both cases, the nonlinear minimization requires a good initial guess for
the parameters in order to avoid local minima. Hence, much of the research in camera
calibration is devoted for developing direct methods for the parameter recovery. In fact,
many calibration techniques are based on geometric invariants which are invariant to
camera pose, and depend only on the internal camera parameters (Geyer & Daniilidis,
2002; Ying & Hu, 2004b). For example, the image of the absolute conic, which is often
utilized in camera self-calibration (Pollefeys & Van Gool, 1997; Triggs, 1997; Barreto
& Araujo, 2005), is such an invariant.
49
The papers related to this thesis consider issues related to bothphotogrammetric cal-
ibration and self-calibration. Paper II describes a plane-based calibration procedure for
central cameras, and Paper IV studies the self-calibration of radially symmetric central
cameras from point correspondences under general camera motion. In the following,
calibration methods from the literature are introduced and categorized. Also, the meth-
ods of Papers II and IV are briefly summarized in this context.
3.2.1 Photogrammetric calibration
Perspective cameras
Photogrammetric calibration of perspective cameras is a well studied topic, and many
approaches exist. The approaches can be divided into methods which use non-coplanar
calibration objects, and those which use planar patterns. In general, if a non-coplanar
calibration object is used, the camera parameters can be determined from a single view
by using the classical Direct Linear Transform (DLT) method (Abdel-Aziz & Karara,
1971; Sutherland, 1974; Hartley & Zisserman, 2000). On the other hand, in the case
of a planar calibration object, several views are needed and the camera parameters can
be recovered from the planar homographies which relate the calibration plane to its
images (Sturm & Maybank, 1999; Zhang, 2000). Overall, the initial estimates of camera
parameters provided by some direct method, such as the DLT method or (Zhang, 2000),
should be finally refined by minimizing a geometrically justified error function. The
cost function typically used in the minimization is the sum of squared distances between
the measured and modeled control point projections (Hartley & Zisserman, 2000).
Central cameras
Most of the early research in camera calibration focused on the perspective imaging
model, even though extremely wide-angle lenses have been available for a long time
Miyamoto (1964). One reason to this might be the fact that the low sensor resolution
of early digital cameras limited the usefulness of very wide-angle optics. Neverthe-
less, during the past decade, there has been an increase of applications and research
efforts within omnidirectional vision (Daniilidis & Klette, 2006). One such effort is our
work in Paper II, which aims to present a practical plane-based approach for precise
photogrammetric calibration of generic central cameras. In addition to Paper II, also
50
other plane-based methods have been proposed for omnidirectionalcameras (Scara-
muzzaet al., 2006; Mei & Rives, 2007; Ramalingam & Sturm, 2008).
The calibration approach of Paper II is based on viewing a planar pattern, which
contains control points in known positions. Given the locations of observed control
points in several views, the camera parameters are determined by a multi-step procedure
which requires the user to provide a rough initialization for a few internal parameters,
and then iteratively refines all the parameters. Although the procedure is not completely
automatic, it is relatively robust against bad initialization and, in practice, satisfactory
initial values can always be provided. Further, due to the generality of the camera model
used, the proposed approach allows convenient calibration of various types of central
cameras, as shown by the experiments in Papers II and III.
Noncentral cameras
Although most omnidirectional cameras are strictly speaking noncentral, many of them
can be well approximated by the central model. In particular, if the camera is relatively
far from the viewed objects, the central model may be useful even for such cameras that
are clearly noncentral when observed from a closer distance (Swaminathan & Nayar,
2000). Also, the assumption of a single viewpoint may be necessary for stabilizing
the initial calibration of cameras that are only slightly noncentral (Ramalingamet al.,
2005; Micušík & Pajdla, 2006). However, there are cases where the central camera
model is not sufficient. For example, this is the case with many catadioptric cameras
(Chahl & Srinivasan, 1997; Swaminathanet al., 2004). Hence, calibration methods
have been proposed also for completely generic cameras, where the pixels’ projection
rays are unconstrained (Grossberg & Nayar, 2001; Sturm & Ramalingam, 2004). In
contrast to methods which use low-parameter models, these approaches require that
all the calibrated pixels are matched to the calibration object in several images. Thus,
dense matching is required. Finally, the calibration of radially symmetric noncentral
cameras is addressed in (Tardifet al., 2009), and the case of axial cameras is discussed
in (Ramalingamet al., 2006c).
51
3.2.2 Self-calibration
Perspective cameras
Self-calibration of perspective cameras has been widely studied since the early work
by Faugeraset al. (1992). Typically, the problem setting is such that one has an un-
calibrated projective multi-view reconstruction, computed from a set of point corre-
spondences over the views, and the task is to utilize constraints on the camera param-
eters in order to determine the metric properties of the cameras and the scene. The
early papers on self-calibration assume that the internal camera parameters are constant,
e.g. (Faugeraset al., 1992; Hartley, 1994), whereas some later works use less restrictive
constraints (Heyden & Åström, 1998; Pollefeyset al., 1999). Further, besides the case
of generic camera motion, self-calibration methods have been proposed for different
types of degenerate motions. For example, a method for planar motions is described
by Armstronget al. (1996), and the case of a rotating camera is discussed by Hartley
(1994). Overall, the literature related to self-calibration of perspective cameras is broad,
and further references can be found from (Hartley & Zisserman, 2000).
Central cameras
Various self-calibration methods have been proposed for central cameras during the
recent years, e.g. (Micušík & Pajdla, 2006; Claus & Fitzgibbon, 2005; Thirthala &
Pollefeys, 2005; Li & Hartley, 2006; Ramalingamet al., 2006b; Tardifet al., 2006,
2007). These approaches do not require a completely known calibration pattern, but
many of them still utilize some prior knowledge about the scene, such as straight lines
or coplanar points, or about the camera, such as the location of the distortion center. In
addition, the robustness and generality of the methods may often be limited, or they do
not provide a full metric calibration.
Hence, despite the recent research efforts, there is still a need for robust and generic
self-calibration methods which would be suitable for central cameras under general con-
ditions, i.e., under general camera motion in an unknown scene. In this thesis, Paper
IV addresses the problem by proposing a method which uses two-view point correspon-
dences, and estimates the parameters of a general radially symmetric camera model
by minimizing the angular error. Thus, the key idea in Paper IV is to use the exact
expression for the angular image reprojection error (Oliensis, 2002), and write the self-
52
calibration problem as a small-scale optimization problem where the cost function de-
pends only on the parameters of the camera. Further, since the cost function appears
to have many local minima, a multi-step approach is proposed for the minimization.
Although the approach does not completely remove the problem of local minima, the
experiments in Paper IV show that successful self-calibration can be achieved if reason-
able constraints are provided for the camera parameters.
Noncentral cameras
Self-calibration of generic noncentral cameras is a difficult and a little studied topic.
This thesis concentrates on central cameras but, for example, the paper by Ramalingam
et al. (2006b) deals with the self-calibration of radially symmetric noncentral cameras.
Their method utilizes dense image matches between several views of an unknown pla-
nar scene. However, without the assumption of radial symmetry, the self-calibration
of noncentral cameras is probably a very difficult problem because there is not much
information to be utilized, especially if both the scene structure and camera motion are
completely unknown (Ramalingam, 2006).
3.3 Discussion
The camera model and calibration method presented in Paper II are perhaps the most
practically useful contributions of this thesis in the area of camera calibration. The
proposed plane-based calibration approach allows precise modeling of various kinds of
real cameras as illustrated by the examples in Paper III. On the other hand, besides
photogrammetric calibration, also the topic of camera self-calibration is touched in the
thesis. However, despite some encouraging results, the local minima problem discussed
in Paper IV is not yet fully solved and, hence, further improvements are needed before
the suggested self-calibration approach is ready for practical use in applications. Indeed,
as pointed out also by Ramalingam & Sturm (2008), self-calibration of generic cameras
is a challenging topic and there are still many unresolved problems related to it.
The plane-based calibration algorithm, proposed in Paper II, is implemented into a
Matlab toolbox, which is publicly available and includes a semi-automatic procedure
for localizing control points from images of planar dot patterns. The toolbox is an
additional practical contribution of the thesis, and it aims to facilitate the use of omnidi-
rectional cameras in metrology applications. In fact, there has previously been a lack of
53
publicly available, easy-to-use calibration tools for omnidirectional cameras. However,
recently the situation has improved as several new calibration programs have been made
available. For example, implementations of the methods described in papers (Barreto
& Araujo, 2005), (Scaramuzzaet al., 2006), (Mei & Rives, 2007), and (Tardifet al.,
2006) are currently provided by the respective authors. Nevertheless, most of the avail-
able calibration tools assume a central camera model, which is fully radially symmetric,
whereas our toolbox additionally allows modeling asymmetric distortions.
Examples of application areas, where calibrated omnidirectional cameras may be
utilized, include structure from motion (Mouragnonet al., 2009), three-dimensional
modeling (Lhuillier, 2008b), panoramic imaging (Xiong & Turkowski, 1997), medical
imaging (Stehleet al., 2007) and image-based lighting in computer graphics (Fuchs
et al., 2007; Pronk, 2006). For instance, in (Stehleet al., 2007), the calibration approach
of Paper II is used for an endoscopic camera and, in (Pronk, 2006), it is used for a
fish-eye lens camera which captures lighting data for photorealistic computer graphics.
Overall, the various innovative and relatively recent applications of omnidirectional
cameras indicate that calibration of generic cameras is a relevant research topic, and the
results can often be directly utilized in practice.
54
4 Image-based scene reconstruction
Image-basedscene reconstruction is a classical topic in geometric computer vision. In
this thesis, it can be seen as a high-level topic which provides a connection between
such subtopics as geometric camera calibration and image matching, which are the
two central themes of the thesis. In fact, the goal of using cameras as measurement
devices for scene reconstruction has been an important motivation for the many efforts
in geometric camera calibration which were discussed in Chapter 3. On the other hand,
image matching, which is the topic of Chapter 5, is also an essential task that is needed
in image-based scene reconstruction, as well as in many other problems of computer
vision.
However, in this chapter, the problem area of image-based scene reconstruction is
briefly discussed as a whole, and the related sewer imaging application of Paper V
is introduced. Overall, the approach for modeling sewer pipes from video, outlined
in Section 4.2, can be considered as an application specific example of a multi-view
reconstruction pipeline, which is discussed on a general level in Section 4.1.
4.1 Brief review of related work
This section aims to give a brief introduction to the research literature, the results of
which are applied in the sewer imaging system of Paper V. However, because image-
based scene reconstruction is a wide topic, only a small part of the previous research is
covered here.
The standard pipeline for multi-view reconstruction contains typically three main
stages, namely,sparse matching, structure from motion, and image-based modeling
(Pollefeys, 1999; Nistér, 2001; Gargallo, 2008; Martinec, 2008). In the sparse matching
stage a number of interesting features, such as edges or corners, are detected from
each image and then matched between the different images. Thereafter, the structure
from motion stage uses the feature correspondences to compute the camera poses and
the three-dimensional structure of the features. Finally, given the camera poses, the
image-based modeling stage computes a dense reconstruction of the scene. Hence, it is
assumed here that the structure from motion stage includes self-calibration if necessary
so that the pipeline outlined above is consistent with the one presented in Figure 1.
55
The sewer modeling approach of Paper V mainly follows the standard three-stage
pipeline, but the usual generic dense reconstruction stage is replaced by a model-based
approach, where a parametric tubular model is directly fitted to the reconstructed in-
terest points. An overview of the sewer imaging system is presented in Section 4.2,
but first, the following subsections list some general background literature on structure
from motion and image-based modeling.
4.1.1 Structure from motion
Structure from motion, i.e. the simultaneous computation of camera motion and sparse
scene structure from multiple images, is a well studied problem, and it is extensively
described in (Hartley & Zisserman, 2000) and (Faugeraset al., 2001), for example. In
this thesis, we mainly confine ourselves to applying known structure from motion meth-
ods in the sewer imaging framework. However, there is still also active methodological
research in the field, as briefly discussed below.
Structure from motion methods require feature correspondences between the im-
ages as their input. In the case of continuous video sequences, the feature extraction
and matching is usually relatively straightforward (Mouragnonet al., 2009), and there
have also emerged powerful methods for reliable matching of wide baseline images
(Mikolajczyk et al., 2005; Lowe, 2004). Given the feature correspondences, the stan-
dard approach to structure from motion is to first determine the two-view geometry
between neighboring pairs of views and then merge the local sparse reconstructions
together, either incrementally or hierarchically, and finally refine the reconstruction by
global bundle adjustment (Hartley & Zisserman, 2000). The basic principles of the
methods have been known for a long time, but recent research has focused on efficiency
issues so that real-time performance could be achieved (Nistér, 2005; Pollefeyset al.,
2008; Mouragnonet al., 2009). Further, besides the fast video-based reconstruction sys-
tems, there are recent approaches that concentrate on robust scene reconstruction from
unorganized image collections (Martinec, 2008; Snavelyet al., 2008).
In addition to such issues as computational efficiency and robustness to varying
imaging conditions, the development of generic methods which are applicable for vari-
ous types of cameras has been an active research topic in structure from motion (Rama-
lingamet al., 2006a; Mouragnonet al., 2009). This line of research is also related to the
self-calibration of generic cameras, which was discussed in the previous chapter. For
example, there are some studies that attempt to generalize the concept of a fundamental
56
matrix for non-perspective cameras (Claus & Fitzgibbon, 2005;Barreto & Daniilidis,
2006; Sturm & Barreto, 2008).
4.1.2 Image-based modeling
The term image-based modeling refers to the task of computing a three-dimensional
geometric model of a scene from multiple images which are taken from known camera
viewpoints. In general, this task is also calledmulti-view stereo reconstruction, and it
has been extensively studied (Seitzet al., 2006).
Multi-view stereo algorithms aim to produce a dense scene reconstruction which is
visually compatible with the input images. Visual compatibility is usually measured by
some photo-consistency measure, such as normalized cross-correlation of correspond-
ing image patches (Seitzet al., 2006; Furukawa & Ponce, 2007). Fully automatic multi-
view reconstruction of complex scenes is a challenging task, especially if the input
images are taken under varying lighting conditions, and the scene contains partially
occluded objects or low textured surfaces. However, the multi-view stereo problem
has been under active study during recent years, e.g. (Strecha, 2007; Gargallo, 2008),
and currently there are systems that are capable to produce accurate and dense photo-
realistic reconstructions under challenging real world conditions (Goeseleet al., 2007;
Furukawa & Ponce, 2007).
In addition to generic multi-view stereo methods which are applicable for arbitrarily
shaped objects, there are works that concentrate on modeling certain specific object
classes, such as faces (Xinet al., 2005), architectural scenes (Furukawaet al., 2009) or
human bodies (Starck & Hilton, 2003). In this thesis, the focus is on modeling sewer
pipes.
4.2 Application: Modeling sewer pipes from video
4.2.1 Motivation
The proper functioning of sewerage systems is essential for modern infrastructure. How-
ever, in many countries sewer networks are deteriorating due to their high age, and their
restoration and maintenance requires significant investments (Kuntze & Haffner, 1998;
Cooperet al., 1998; Chae & Abraham, 2001). Thus, motivated by the economic rea-
57
sons, there have been attempts to develop automatic methods forcondition assessment
of sewer pipes.
Traditionally the condition of sewer pipes is evaluated by visual inspection of video
sequences which are scanned by a camera moving inside the pipe. However, manual
inspection has some disadvantages, such as subjectivity and high costs, and hence, there
are several approaches that have been suggested for automation of sewer surveys. For
instance, an idea of reconstructing the three-dimensional structure of sewer pipes from
survey videos was introduced by Cooperet al. (1998), and automatic detection of pipe
joints and surface cracks from digital sewer images was studied by Xuet al. (1998),
Chae & Abraham (2001) and Sinha & Fieguth (2006).
In this thesis, Paper V describes a method for modeling sewer pipes from survey
videos. An overview of the method is described in the following section.
4.2.2 Overview of the approach
The system described in Paper V recovers the interior shape of a sewer pipe from a
survey video which is obtained by moving a precalibrated fish-eye lens camera and a
light source through the pipe. The reconstruction approach is based on tracking interest
points across successive video frames, and computing their three-dimensional arrange-
ment by structure from motion techniques. The structure from motion stage is followed
by a modeling stage, which robustly estimates the shape of the pipe by fitting a paramet-
ric tubular model to the reconstructed points. Hence, there are two main stages in the
proposed measurement approach, structure recovery and modeling, and they are briefly
summarized below. The outline of the approach is illustrated in Figure 7, where a typ-
ical inspection system is shown on the left, and the obtained tubular model is on the
right.
58
Fig. 7. An illustration of the sewer modeling system. Left: A video sequence is
acquired by a remote controlled camera which moves inside a sewer pipe. Middle:
Structure and motion estimation is based on interest points, which are extracted
from the textured surface of the pipe and tracked across successive video frames.
Right: A tubular surface model is fitted to the reconstructed interest points. Re-
vised from Paper V. c©2008 Springer
Structure recovery
The structure recovery of sewer pipes is based on a standard structure from motion ap-
proach (Fitzgibbon & Zisserman, 1998), which is adapted to fish-eye image sequences.
Thus, the proposed method requires that the inner surface of the pipe has some texture
so that interest points can be extracted and reconstructed from the images. In Paper
V, the interest points were detected by the Harris corner detector (Harris & Stephens,
1988), and the experiments with a real sewer video showed that there are plenty of such
features in eroded concrete pipes. Hence, in this case, the set of reconstructed feature
points was sufficiently dense in order to allow pipe shape estimation by surface fitting
in the modeling stage. Further, the error analysis, presented in Paper V, suggests that a
relatively low uncertainty of reconstruction can be achieved despite the forward motion
of the camera.
The wide field of view of a fish-eye lens camera is essential in the sewer imaging
application as it enables us to obtain a high resolution scan of the whole pipe with a
single pass. Our approach assumes that the camera is precalibrated. In the experiments
of Paper V, the camera calibration was performed by using the approach of Paper II.
The large radial distortion of fish-eye lenses requires modifications to the conventional
59
structure from motion techniques, which are designed for perspective cameras. Thus,
Paper V gives a relatively detailed explanation of our implementation, which is suitable
for calibrated central cameras with a field of view up to 180 degrees. However, any other
generic structure from motion system could be used as well. For example, the recent
work by (Lhuillier, 2008a) describes a structure from motion framework for calibrated
omnidirectional cameras.
Modeling
In the modeling stage the shape of the pipe is estimated by fitting a piecewise cylindrical
model to the reconstructed points. Hence, in order to model the bending of pipes, Paper
V proposes the use of a tubular model which is concatenated from short cylindrical
pieces and then smoothed along the pipe. Further, Paper V describes a robust cylinder
fitting procedure, where the reconstructed points are first divided into several sections
along the pipe, and a cylinder with an elliptical cross-section is fitted to each section.
Finally, the parameters of the cylindrical pieces are interpolated along the pipe so that
the obtained tubular surface is smooth also in the main axis direction.
The robust cylinder fitting procedure, proposed in Paper V, has the following prop-
erties: (a) it minimizes a geometric cost function, (b) it is robust to outliers, and (c) it
can be applied to elliptical cylinders. There are also other approaches to cylinder fitting,
e. g. (Faber & Fisher, 2001; Lukácset al., 1998; Werghiet al., 1998), but the approach
of Paper V was used in the sewer modeling system as it includes a method for the ro-
bust initialization of the cylinder parameters, and has all the aforementioned desirable
properties.
4.3 Discussion
Image-based scene reconstruction can be seen as a unifying high-level theme for the top-
ics of this thesis, and this chapter briefly described the general background of the field.
At a more concrete level, the purpose of this chapter was to present the contributions of
Paper V, which are further summarized and discussed below.
The main contribution of Paper V is that it describes a complete system for acquiring
a three-dimensional model of a sewer pipe from a survey video. The experiments with
a real sewer video show that the proposed approach can recover the shape of a sewer
pipe from a fish-eye video sequence which is scanned by a single pass through the pipe.
60
We are not aware of any previous work where such a modeling systemwould have been
demonstrated in practice. Hence, Paper V presents a new application of structure from
motion techniques into the sewer inspection domain, where computer vision methods
are not yet widely used. Further, an additional contribution of Paper V is a robust and
practical method for modeling tubular surfaces from three-dimensional point clouds.
Overall, the methods presented in Paper V are not necessarily limited to sewer imag-
ing application, but they could be useful in other applications as well. Typically the
wide field of view of an omnidirectional camera provides advantages in structure and
motion estimation (Micušík & Pajdla, 2006), and therefore omnidirectional structure
from motion methods, such as the implementation of Paper V, are likely to be useful
in many applications. In addition, also the modeling of tubular surfaces could be uti-
lized in other problem areas. In particular, the methods presented might be useful in
the field of medical imaging, as there have already been some attempts to use computer
vision techniques for structure recovery from endoscopic images (Burschkaet al., 2004;
Schmidtet al., 2002; Wanget al., 2008).
61
62
5 Quasi-dense matching
5.1 Introduction
Image matching, i.e. determination of corresponding pixels between two different im-
ages of the same scene, is a vision task that is needed in many problems. For example,
in the area of structure from motion, computation of camera motion from image se-
quences requires sparse matching of interest points between the different images (Hart-
ley & Zisserman, 2000). Further, many recent approaches to viewpoint invariant object
recognition are based on matching of local image regions (Obdrzálek & Matas, 2002;
Lowe, 2004). However, matching a sparse set of interest points or regions is not always
sufficient. For instance, such tasks as surface reconstruction (Strecha, 2007) or object
recognition (Ferrariet al., 2006) may require dense or quasi-dense matching of images.
This thesis studies a quasi-dense approach to image matching. The work builds on
the previous work by Lhuillier & Quan (2002), and extends their approach to the wide
baseline case. In addition, the thesis presents applications of quasi-dense matching to
object recognition and two-view motion segmentation.
The key contribution in the paper by Lhuillier & Quan (2002) is a quasi-dense
matching algorithm, which is based on the match propagation principle. The algo-
rithm starts from a sparse set of seed matches between two images, then propagates to
the neighboring pixels by the best-first strategy, and finally produces a quasi-dense set
of matching pixels. The quasi-dense correspondences typically provide a good basis
for such tasks as two-view geometry estimation and surface reconstruction (Lhuillier
& Quan, 2005). Compared to conventional dense stereo correspondence algorithms
(Scharstein & Szeliski, 2002), the quasi-dense approach has the advantage that it can
also be used for uncalibrated image pairs where the epipolar constraint is not known.
Further, the match propagation algorithm is efficient in terms of time and memory, and
hence provides a tenable alternative for cases where a completely dense matching is not
required.
Thus, our work builds on (Lhuillier & Quan, 2002), but there are also several other
works that utilize the idea of region growing in image matching. For example, the early
paper (Otto & Chau, 1989) is perhaps the first work that uses this kind of idea. Also
the work by Chen & Medioni (1999) utilizes the growing principle but without the best-
63
first matching strategy. Further, among the more recent works,Megyesi & Chetverikov
(2004) present a match propagation method, which allows affine deformations of cor-
responding image patches, andCech & Sára (2007) modify the method by Lhuillier &
Quan (2002) so that the accuracy and correctness of matching is guaranteed even in
the presence of repeating texture patterns. However, most of the previous approaches,
including (Megyesi & Chetverikov, 2004) and (Cech & Sára, 2007), are designed for
conventional rectified stereo image pairs where the scene is rigid and the corresponding
epipolar lines are parallel.
A significant limitation with the algorithm by Lhuillier & Quan (2002) is that it is
not directly applicable in the wide baseline case. Hence, Papers VI and VII propose
extensions which make the algorithm suitable for the matching of wide baseline images
of both rigid and non-rigid scenes. Further, Paper VII additionally presents an approach
for utilizing quasi-dense matches for reliable object recognition in the presence of geo-
metric deformations and extensive background clutter. Finally, Paper VIII proposes the
use of quasi-dense matching as an initialization for a dense and deformable two-view
motion segmentation method. The following section summarizes the ideas of Paper
VI, whereas the contributions of Paper VII are described in Sections 5.3 and 5.4. An
overview of the approach of Paper VIII is described in Section 5.5, and the discussion
in Section 5.6 concludes this chapter.
5.2 Match propagation in the wide baseline case
In the case of wide baseline stereo images, the camera viewpoint differs greatly between
the two views. An example of such an image pair is shown in Figure 8 (Mikolajczyk
et al., 2005). There are certain issues in the match propagation algorithm by Lhuillier
& Quan (2002) which limit its applicability for wide baseline matching. In particular,
at each step of propagation, small image patches are extracted around the current seed
point in both images and the new candidate matches are scored according to the zero-
mean normalized cross-correlation (ZNCC) of the corresponding patches. Thus, it is
implicitly assumed that the local transformation between the images is effectively a
translation, and this assumption is not necessarily valid for wide baseline image pairs.
Hence, in order to widen the applicability of the method, Paper VI proposes the use of a
general affine model for the geometric transformation between the local image patches.
64
100 200 300 400 500 600 700 800
100
200
300
400
500
600
0
1
2
3
4
5
6
7
100 200 300 400 500 600 700 800
100
200
300
400
500
600
0
1
2
3
4
5
6
7
Fig. 8. Top: Two views of a wall from the dataset of Mikolajczyk et al. (2005).
The homography between the views is known. The ellipses denote matched re-
gions, whose centroids are used as a seed for match propagation. Bottom: The
matched pixels, computed by propagation with (right) and without (left) affine nor-
malization, are illustrated in the second view by coloring them according to their
distance from the true match. (The distances over 5 are suppressed to 5 and the
gray-value for the noncommon area is 6.)
The main stages of quasi-dense wide baseline matching are as follows. First, a sparse set
of initial matches is obtained by matching affine covariant regions (Mikolajczyket al.,
2005) between the two images. Hence, the output of the initial matching stage is a set
of corresponding points{(xi,x′i)}i (the centroids of the matched regions) accompanied
with the affine transformation matricesAi, which represent approximations for the local
geometric transformations between the images. The initial matches are used as seed
points for the match propagation, which searches new matches from the surrounding
image areas by using zero-mean normalized cross-correlation (ZNCC) as a similarity
measure. The matches obtained are stored in a disparity map which is filled in by
iterating the following steps:
(i) the seed point(xi,x′i) with the highest ZNCC score is removed from the list of seed
points
65
(ii) new candidate matches are searched from the surroundingsof (xi,x′i) by usingAi
for the geometric normalization of local image neighborhoods
(iii) the candidate matches with a sufficiently high ZNCC score are stored in the dispar-
ity map, and added to the list of seed points.
In this manner, the number of correspondences in the disparity map increases until the
list of seeds becomes empty.
As summarized above, the main difference between the original method by Lhuillier
& Quan (2002), and the method of Paper VI is the fact that in the latter approach a seed
match is always associated with a local affine transformation, which allows us to trans-
form the image patches to a common coordinate frame at each propagation step. This
geometric normalization process is schematically illustrated in Figure 9. The addition
of an affine transformation model is a relatively straightforward extension to (Lhuillier
& Quan, 2002). However, in practice, this extension is essential for the performance of
the method in wide baseline cases, as shown by the experiments in Paper VI. This is
also illustrated in the example of Figure 8, where the affine normalization significantly
increases the number and quality of matches. Moreover, by following the detailed im-
plementation guidelines of Paper VI, improvement in matching performance can be
achieved so that the efficiency of the quasi-dense approach is preserved. An additional
advantage of the proposed extension is the fact that it is able to directly utilize various
types of affine covariant interest regions (Mikolajczyket al., 2005) as seed matches.
Hence, it can be seen that the proposed approach naturally supplements the recent tech-
niques for sparse wide baseline matching.
Besides introducing the affine model for the local geometric transformations, Paper
VI additionally proposes an adaptive propagation method, where the current estimate
of the affine transformation can be adjusted during the propagation by using the second
order intensity moments locally, together with the epipolar geometry. Hence, the adap-
tive version of the propagation method can be used for image pairs of rigid scenes when
the epipolar geometry is known. The experiments in Paper VI show that the adaptive
method allows a single seed match to propagate into regions where the local transforma-
tion between the views differs from the initial one. However, the adaptation principle
of Paper VI is not applicable for non-rigid scenes, and therefore Paper VII describes
a method where the adaptation is based entirely on the local texture properties of the
images. The details are briefly summarized in the following section.
66
A
xx′
uu′
I I ′
Comparison
by ZNCC
Fig. 9. Affine normalization of local image neighborhoods around a seed match
(x,x′). The candidate match (u,u′) is evaluated by computing the ZNCC measure
between the corresponding normalized patches.
5.3 Non-rigid quasi-dense matching
Paper VII describes a non-rigid match propagation method which uses the local image
gradients and the second order intensity moments to adjust the estimate of the local
affine transformation during the propagation. Hence, unlike in Paper VI, epipolar ge-
ometry is not required for adaptive matching, and therefore the proposed approach is
particularly suitable for such cases where the epipolar geometry is not known, or the
scene is deforming.
As in Paper VI, the adaptation is based on the windowed second moment matrix of
the image intensity functionf , which is defined by
S f ,g(u) =∫
vv⊤ f (v)g(u−v)dv, (24)
where the functiong is a positive window function. By assuming that the intensity func-
tion f ′ and the window functiong′ of the other image are affine transformed versions
of f andg, i.e. f ′(u) = f (A−1u) andg′(u) = g(A−1u)/|detA|, a change of variables
67
in (24) gives the following transformation rule
S f ′,g′(u) = AS f ,g(A−1u)A⊤. (25)
Thus, it is assumed here that the coordinate systems in both images are centered to the
points under consideration which causes the translational part of the affine transforma-
tion to vanish. Then, by using the simplifying notationsS′ = S f ′,g′(0) andS = S f ,g(0),
the positive definiteness of (24) together with (25) implies that
A = S′1/2RS−1/2, (26)
whereR is an arbitrary orthogonal matrix. Hence, givenS andS′, the matrixA can be
determined up to a rotation.
The idea in adaptive match propagation is to use the affine transformation of the
current seed match to compute the local windows for a new candidate match, and esti-
mateS andS′ using these windows. Then the affine transformation for the new match
is computed by (26), where the remaining rotational degree of freedom is determined
either by using the epipolar lines of the matching points, as described in Paper VI, or
by computing the dominant directions of image gradients in the local neighborhoods
of the new match. The latter approach is described in Paper VII, and it has the advan-
tage that the adjustment is based solely on local texture properties, and this allows the
propagation to adapt to smooth non-rigid deformations of the imaged surfaces.
An example of non-rigid image registration by quasi-dense matching is shown in
Figure 10, where a single seed match is grown by using both the non-adaptive and
adaptive version of the propagation algorithm. Since here the artificial deformation
between the two images is known, the accuracy of the obtained quasi-dense matches
can be evaluated and it is illustrated by the color coding. It can be seen that in this case,
the adaptive method produces more matches than the non-adaptive method, which keeps
the affine transformation constant during the propagation. Further, the adaptation does
not essentially reduce the accuracy of the matches. Hence, the example shows that the
proposed adaptation principle works in practice, and efficiently improves the matching
of non-rigid scenes.
68
100 200 300 400 500 600
50
100
150
200
250
300
350
400
450
1
2
3
4
5
100 200 300 400 500 600
50
100
150
200
250
300
350
400
450
1
2
3
4
5
Fig. 10. Top: A pair of images where the artificial deformation is known. The cen-
troids of the yellow ellipses are used as a seed for match propagation. Bottom:
The matched pixels obtained by the non-adaptive (left) and adaptive (right) propa-
gation methods. The distance between the matched point and its true position in
the deformed image is used for the color coding. Revised from Paper VII. c©2008
IEEE.
5.4 Application in object recognition
Besides proposing the non-rigid quasi-dense matching method, Paper VII additionally
presents its application to object recognition and segmentation. The inspiration for this
kind of application arises from previous work (Ferrariet al., 2006), which shows that a
correspondence growing method may be used to improve the performance of an object
recognition approach which is based on local features.
The problem setting in the object recognition application is as follows: it is assumed
that some presegmented model images of objects are given and the task is to recognize
the given object instances from unknown test images where the illumination or camera
69
viewpoint may differ greatly from those in the model images. Inaddition, the test
images may contain occlusions and background clutter. Some examples of challenging
recognition tasks are shown in Figure 11.
Fig. 11. An example of a recognition and segmentation task where the given model
objects (left) are recognized and segmented from query images (right). The col-
ored contours illustrate the segmentation results. The example images are from
the ETHZ Toys dataset (Ferrari et al., 2006). Revised from Paper VII. c©2008 IEEE.
Many recent approaches to object recognition are based on local viewpoint invariant
image features (Schmid & Mohr, 1997; Obdrzálek & Matas, 2002; Lowe, 2004). Typi-
cally, these features are extracted by using a region detector, which adapts to the local
shape of the intensity surface, and is hence able to detect corresponding regions from
the model and test images, despite the change in camera viewpoint (Mikolajczyket al.,
2005). Given the detected regions in the model and test images, the most straightfor-
ward approach for recognition is to represent the regions with features which allow
reliable matching and then use the number of matched features as a recognition crite-
rion (Obdrzálek & Matas, 2002). However, the performance of this kind of approach
is limited in the presence of extensive background clutter, because the background may
produce many incorrect feature matches which disturb the recognition process. Fur-
ther, occlusion and large scale or viewpoint changes reduce the probability that a model
70
feature is correctly extracted from the test image. Hence, in order to counter these prob-
lems, some recent works have applied the principle of correspondence growing in object
recognition (Ferrariet al., 2006;Cechet al., 2008). The key idea in these works is to
utilize the fact that correctly matched regions typically grow better than the false ones.
This fact allows one to improve the discrimination between the correct and incorrect
matches, and thereby the performance of the recognition system.
Thus, inspired by (Ferrariet al., 2006), Paper VII proposes an approach for ob-
ject recognition and segmentation which is directly built on the quasi-dense matching.
The proposed approach has three main stages: match propagation, match grouping,
and recognition. The first stage performs the non-rigid quasi-dense matching proce-
dure between the model and test images. The second stage groups the quasi-dense
pixel matches into geometrically consistent groups by using a method which utilizes
the local affine transformation estimates obtained during the propagation. That is, the
neighboring triplets of quasi-dense matches, connected by Delaunay triangulation, are
merged to the same group if the motion of the connecting triangle is consistent with
the affine motions of its vertices. Finally, in the third stage, the number and quality of
geometrically consistent matches are used as factors for defining the decision criterion
for recognition. Further, if the object is recognized to be present in the test image, the
location of the matching pixels directly provides the segmentation. A more detailed
description of the three stages can be found from Paper VII.
As summarized above, the approach of Paper VII is conceptually similar to (Ferrari
et al., 2006). However, since the proposed approach contains only one match propaga-
tion stage, it is more straightforward than the method by Ferrariet al. (2006), which
involves repeated expansion and contraction phases, where the number and ratio of cor-
rect matches is gradually increased. Further, the approach of Paper VII does not use any
global constraints, and handles the images symmetrically. This implies that the method
can be applied also in cases where both the model image and the test image contain
background clutter, as discussed in more detail in the following section. Besides (Fer-
rari et al., 2006), also the papers (Vedaldi & Soatto, 2006) and (Cechet al., 2008) utilize
the idea of correspondence growing to improve the discrimination between correct and
incorrect region correspondences. However, these approaches concentrate on verifying
the correctness of a single region match at a time and do not discuss the grouping of
matching regions. Hence, they cannot be directly used for the segmentation problem.
In the experimental part of Paper VII, the performance of the proposed recogni-
tion approach was evaluated by performing the same experiment as in (Ferrariet al.,
71
2006) with the publicly available ETHZ Toys dataset. The datasetcontains 9 objects
and 23 challenging test scenes, and the task is to determine which objects are present
in the test scenes and to find the corresponding segmentations. Some examples of the
model and test images are shown in Figure 11, together with the obtained segmentation
results. Each colored contour in Figure 11 illustrates the boundary of a matching seg-
ment, which represents the support domain of a single group of quasi-dense matches.
Only the most reliable segments are shown as they correspond to recognized objects.
The reliability of a matching segment was measured by its correlation-and-coverage-
weighted area in the model image, as described in Paper VII. Besides the qualitative
evaluation, the recognition performance was quantified by computing the ROC curve as
in (Ferrariet al., 2006). As reported in Paper VII, the obtained results were comparable
to (Ferrariet al., 2006).
5.5 Framework for two-view motion segmentation
The problem of motion segmentation typically arises in a situation where one has a
sequence of images containing differently moving objects, and the task is to extract
the objects from the images using the motion information. In this context, the motion
segmentation problem consists of the following two subproblems: (1) determination of
groups of pixels in two or more images that move together, and (2) estimation of the
motion fields associated with each group (Willset al., 2006). Hence, in the case of
two images, the motion segmentation problem is typically more challenging than the
object recognition problem because neither one of the two images may be assumed to
be presegmented.
Many early approaches to motion segmentation assume small motion between the
images, and use dense optical flow techniques for motion estimation (Wang & Adelson,
1994; Weiss, 1997). The main limitation of optical flow based methods is that they
are not suitable for large motions (Willset al., 2006). Hence, in this thesis, Paper
VIII studies the motion segmentation problem in the context of wide baseline image
pairs. That is, the focus is on such cases where the motion of the objects between the
two images may be very large due to non-rigid deformations and viewpoint variations.
Further, spatially varying illumination changes, such as shadows, are another challenge
for motion segmentation that may often occur in the wide baseline setting. In order
to address the challenges posed by deforming motions and varying illumination, Paper
VIII proposes a bottom-up motion segmentation approach which gradually expands
72
and merges the initial matching regions into smooth motion layers and finally provides
a dense assignment of pixels into these layers. Besides segmentation, the proposed
method provides the geometric and photometric transformations for each layer.
The bottom-up segmentation method of Paper VIII utilizes the non-rigid quasi-
dense matching approach of Paper VII for initializing the motion layers. In detail, the
method starts from a sparse set of seed matches between the two images and then pro-
ceeds to quasi-dense matching, which expands the initial seed regions by local propaga-
tion. Then, the quasi-dense matches are grouped into coherently moving segments by
using a similar local grouping technique to that in Paper VII. The resulting segments
are used to initialize the motion layers for the final dense segmentation stage, where the
geometric and photometric transformations of the layers are iteratively refined together
with the segmentation. This alternating minimization scheme is formulated by using
a somewhat similar probabilistic model to that in (Simon & Seitz, 2007) and the pixel
level segmentation is obtained by graph cut based optimization. However, unlike Si-
mon & Seitz (2007), who concentrate on the object recognition problem, the proposed
method does not use any presegmented reference images, but detects and segments the
common regions automatically from both images. Further, Paper VIII uses a spatially
varying photometric transformation model which is more expressive than the global
model used by Simon & Seitz (2007).
Besides Paper VIII, the problem of dense two-view motion segmentation in the
presence of multiple large motions has been studied by Bhatet al. (2006) and Willset al.
(2006). However, these earlier approaches do not model varying lighting conditions
between the two images and they require that the motions are either rigid (Bhatet al.,
2006) or approximately planar (Willset al., 2006). Hence, due to its ability to deal with
deforming motions and large illumination changes, the approach of Paper VIII provides
a wider range of applicability than the previous methods.
The proposed motion segmentation method is illustrated with the example in Figure
12, where the two input images contain two common objects, the magazines. Besides
the magazines, both images contain a lot of background clutter. Moreover, there are also
other challenges present in the image pair: the illumination is different in the images,
the motion of the magazines is non-rigid, and the foremost magazine appears at sub-
stantially different scales in the two images. The results obtained are illustrated in the
last two columns of Figure 12, where the middle column shows the segmentations, and
the last column shows the estimated geometric and photometric transformation fields.
The meshes illustrate the geometric transformations, and the colors visualize the photo-
73
metric transformations. The colors show how the gray color, shown on the background
layer, would be transformed from the other image to the colored image. The result in-
dicates that the white balance is different in the two images, i.e., the first image is more
blue. In addition, it can be seen that the model has correctly captured the shadow on the
corner of the foremost magazine in the first image.
Fig. 12. A pair of images from the ETHZ Toys dataset (left) and the extracted
objects (middle) with the associated geometric and photometric transformations
(right). Reprinted from Paper VIII. c©2009 Springer.
5.6 Discussion
Quasi-dense matching and geometric camera calibration are the two main themes of
this thesis. Hence, one of the key contributions of the thesis are the extensions which
make the quasi-dense approach applicable for wide baseline images of both rigid and
non-rigid scenes. These extensions include the use of an affine normalization step as
an integral part of the propagation algorithm, and the use of second order intensity
moments, together with either epipolar geometry or local image gradients, for adaptive
propagation. Importantly, the experiments in Papers VII and VIII show that, in the
adaptive propagation mode, the parameters of the local affine transformation can be
efficiently adjusted during the propagation without using any nonlinear optimization.
As described, the adjustment is based on local texture properties and allows the match
propagation to adapt to smooth variations of the imaged surfaces. This is an interesting
observation and, to the best of our knowledge, this kind of an adaptation principle has
not been utilized in image matching before.
74
Overall, the proposed quasi-dense matching techniques can beseen as basic tools,
which may be utilized in different vision tasks. For example, Papers VII and VIII
propose approaches for object recognition and motion segmentation, which are based
on quasi-dense matching. In addition, the earlier work by Lhuillier & Quan (2005)
applies quasi-dense matching to surface reconstruction from image sequences.
The locality of matching is a characteristic feature of the match propagation algo-
rithm and, depending from the point of view, this can be seen as a disadvantage or as
an advantage. For example, in the case of conventional stereo matching of rectified
images, it might be better to use some global approach, such as (Cech & Sára, 2007)
or (Kolmogorov & Zabih, 2001), since these approaches are not as prone to suboptimal
solutions, and can better deal with repetitive texture patterns. On the other hand, due
to its local nature, quasi-dense matching is widely applicable, and can often be used in
cases where other methods are not applicable. For example, rectified images are not
always available in stereo matching. In fact, in the recent work by Xiaoet al. (2008),
the approach of Paper VII has been utilized in the initialization stage of a global stereo
matching method which can deal with uncalibrated wide baseline images. Further, the
non-rigid quasi-dense matching method of Paper VIII can also be used for omnidirec-
tional images (Lu & Wu, 2008).
Finally, considering the possible directions for future research, it is useful to relate
the methods of Papers VI-VIII to recent works that address similar problems or use
somewhat similar ideas. Such particularly interesting recent works include (Cechet al.,
2008) and (Choet al., 2009). For example, it might be advantageous to combine ideas
from Paper VII and the papers (Cechet al., 2008) and (Choet al., 2009) in order to
improve deformable object matching further. That is, the sequential correspondence
selection method byCechet al. (2008) could be first used to efficiently increase the
proportion of correct matches among tentative feature correspondences, whereupon the
correspondence clustering method by Choet al. (2009) and the combined correspon-
dence growing and grouping approach of Paper VII could be used together for accurate
recognition and segmentation. Further, as an additional topic for future research, it
might be useful to study ways of applying quasi-dense matching for image retrieval
from large databases. The first steps into this direction have been taken byCechet al.
(2008), who use correspondence growing to re-rank images retrieved by efficient sparse
matching techniques which are based on vocabulary trees.
On the other hand, one additional challenge for recognition, not addressed in the
aforementioned papers, is posed by significant non-linear intensity variations. This
75
problem is discussed by Yanget al. (2007), who also propose an image registration
method which is based on correspondence growing and is remarkably robust to large
illumination variations. However, their registration framework is not applicable for
arbitrary non-rigid scenes, and hence, there is room for further developments.
Lastly, the work by Furukawa & Ponce (2007) proposes a generic multi-view stereo
method, which is based on the local expansion of matching image patches. Although
the multi-view stereo problem is quite different to the matching problems discussed in
this thesis, there is a somewhat similar basic idea of correspondence growing behind
the approaches. On the other hand, a conceptual difference between the approaches is
the fact that the best-first propagation strategy is not used by Furukawa & Ponce (2007).
Instead, repeated expansion and filtering steps are used, where the reconstruction is
gradually grown and erroneously reconstructed patches are filtered out. Hence, there
remains a question whether the best-first propagation principle could be utilized in some
form to further improve the efficiency of patch-based multi-view stereo methods.
76
6 Summary and conclusion
This thesis has presented new models and methods for various computer vision prob-
lems which involve geometric aspects. The proposed techniques are related to such
specific problem areas as homography estimation, geometric camera calibration, image-
based modeling, image matching, object recognition, and motion segmentation. Hence,
there is a wide range of topics considered in the thesis and, at first sight, they may ap-
pear unrelated to each other. However, as described in the previous chapters, there is a
common background to the themes. In fact, in most cases each of the particular prob-
lems can be seen as a geometric inference problem in which geometric information,
either about the scene or the camera, is extracted from the images.
First, Chapter 2 considered methods for determining a planar homography from
corresponding conics. The proposed two algorithms provide new aspects and tools
for the problem of homography estimation, which is a commonly occurring low-level
task in many areas of computer vision. The techniques developed might be used as a
part of other methods which address higher-level tasks in different application domains,
including plane-based camera calibration, image registration, and structure and motion
recovery (e.g. plane detection).
An important part of this thesis is geometric camera calibration, which was dis-
cussed in Chapter 3. A flexible parametric model for central cameras and a plane-based
camera calibration method for determining the model parameters were introduced. The
calibration experiments, conducted with several real cameras, show that the proposed
camera model is suitable for various types of central cameras, including many catadiop-
tric cameras and dioptric cameras equipped with narrow-angle, wide-angle or fish-eye
lenses. Further, the procedures developed were implemented into a Matlab toolbox,
which is publicly available. From a practical viewpoint, this is also a useful contribu-
tion as there has been a lack of calibration tools for omnidirectional cameras.
In addition to photogrammetric calibration, which utilizes images of a known cali-
bration object, the problem of camera self-calibration was studied. Self-calibration of
central cameras from multi-view point correspondences under a general camera motion
is a challenging task which typically suffers from the local minima problem. In this
thesis, a multi-step calibration procedure was proposed which determines the parame-
ters of a radially symmetric central camera model by minimizing angular errors. Given
77
only a rough initialization for the internal camera parameters, the proposed approach re-
fines the camera parameters so that local minima are usually avoided. However, despite
promising results with different kinds of radial distortion models, the approach does
not completely solve the problem of local minima and, hence there is still room for im-
provements in order make the self-calibration from general camera motion sufficiently
robust and stable for practical use.
Since geometric camera calibration is a prerequisite for measuring metric scene
properties from images, it is closely related to the topic of image-based scene recon-
struction. The metrology example used in this thesis is the sewer imaging application
described in Chapter 4. In this context, an error analysis was performed for the struc-
ture from motion system that recovers the interior structure of a sewer pipe from a
video sequence, and it was observed that a relatively accurate reconstruction can be
achieved despite the forward motion of the camera. Additionally, a method was pre-
sented for modeling tubular surfaces from the reconstructed three-dimensional point
clouds. Hence, the sewer imaging system serves as an application specific example of a
complete reconstruction pipeline, where a compact model of the scene is automatically
acquired from the video.
The quasi-dense image matching, discussed in Chapter 5, is a central theme of the
thesis. The proposed methods provide the means for computing a quasi-dense set of
matching pixels between two images of a textured scene, given a sparse set of matching
regions between the images as seed matches. A key contribution in this thesis is that
the previously proposed match propagation principle was extended to the wide baseline
case where the camera pose may vary greatly between the two views. In addition,
methods were proposed for adjusting the local transformation parameters during the
propagation so that the matching process is able to adapt to both rigid and non-rigid
deformations of the imaged surfaces.
Quasi-dense matching can be utilized in many problems. Surface reconstruction
is perhaps one of the most obvious applications. However, in this thesis, quasi-dense
matching was additionally applied to specific object recognition and dense two-view
motion segmentation. Both applications are based on match grouping, where neigh-
boring quasi-dense matches are grouped together if their group-wise transformation
appears to be consistent with the associated local transformations. In the proposed ob-
ject recognition method, the quasi-dense matches between the model and test images
are first grouped, and then the quality and number of matching pixels in the groups are
used as recognition criteria. On the other hand, in the proposed motion segmentation
78
method, the grouped matches provide an initialization for themotion layers, which are
then refined iteratively. The ability to deal with background clutter and geometric defor-
mations is an advantage of the quasi-dense approach in the aforementioned applications
where the common image regions have to be recognized and matched simultaneously.
Overall, because quasi-dense matching conveniently supplements modern sparse wide
baseline matching techniques which are based on affine covariant regions, and have
proven to be useful in many applications, it can be seen as a potential approach for
various problems which involve image matching.
In summary, the topics studied in this thesis touch on two classical themes of com-
puter vision, namely, reconstruction and recognition. Traditionally these two problems
have been considered as separate but recently there has been discussion about consid-
ering them together, because they often appear concurrently in real-world conditions
and the approaches for their solution might benefit from each other. In this thesis, the
connection between reconstruction and recognition is provided by quasi-dense match-
ing which can be utilized in both problems. In fact, the results of the thesis show that,
compared to sparse keypoint-based approaches, the quasi-dense approach improves the
recognition of specific objects from photographs, and earlier studies have shown that
the quasi-dense approach improves robustness and accuracy of geometry estimation in
structure from motion. Further, the quasi-dense approach seems also intuitively reason-
able in cases where the problems of image matching and recognition are coupled: there
is no sense to try dense matching if the image regions do not represent the same object;
and again, reliable recognition is difficult without a good hypothesis for the pose and
position of the object in the image.
Finally, the themes considered in this thesis suggest some directions for future re-
search. Firstly, there are still many challenging problems related to generic cameras
and their calibration. In particular, practical and robust methods for generic calibra-
tion and self-calibration, which include automatic camera model selection, are needed.
Also, the case of varying camera parameters is a challenge for self-calibration. Sec-
ondly, as there have recently emerged patch-based multi-view stereo methods which
utilize match expansion for multi-view reconstruction, it would be interesting to study
whether the best-first match propagation principle, used here for two-view stereo, could
be efficiently used for multi-view stereo. Thirdly, in the context of motion segmentation,
it would be useful if the proposed dense and deformable two-view motion segmentation
method could be extended to work with multi-frame image sequences.
79
80
References
Abdel-Aziz YI & Karara HM (1971) Direct linear transformation from comparator to objectspace coordinates in close-range photogrammetry. Proc Symposium on Close-Range Pho-togrammetry.
Ahonen T, Hadid A & Pietikäinen M (2006) Face description with local binary patterns: appli-cation to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI) 28(12): 2037–2041.
Armstrong M, Zisserman A & Hartley RI (1996) Self-calibration from image triplets. Proc Euro-pean Conference on Computer Vision (ECCV): 3–16.
Baker S & Nayar SK (1999) A theory of single-viewpoint catadioptric image formation. Interna-tional Journal of Computer Vision (IJCV) 35(2): 175–196.
Bakstein H & Pajdla T (2001) An overview of non-central cameras. Proc Computer Vision WinterWorkshop (CVWW).
Ballard DH (1981) Generalizing the Hough transform to detect arbitrary shapes. Pattern Recog-nition 13(2): 111–122.
Barreto JP & Araujo H (2005) Geometric properties of central catadioptric line images and theirapplication in calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI) 27(8): 1327–1333.
Barreto JP & Daniilidis K (2006) Epipolar geometry of central projection systems using veronesemaps. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1):1258–1265.
Bhat P, Zheng KC, Snavely N, Agarwala A, Agrawala M, Cohen MF & Curless B (2006) Piece-wise image registration in the presence of multiple large motions. Proc IEEE Conference onComputer Vision and Pattern Recognition (CVPR), (2): 2491–2497.
Bookstein FL (1979) Fitting conic sections to scattered data. Computer Graphics and ImageProcessing (9): 56–71.
Bowyer KW, Chang K & Flynn P (2006) A survey of approaches and challenges in 3D and multi-modal 3D+2D face recognition. Computer Vision and Image Understanding (CVIU) 101(1):1–15.
Bräuer-Burchardt C & Voss K (2001) A new algorithm to correct fish-eye- and strong wide-angle-lens-distortion from single images. Proc International Conference on Image Processing(ICIP): 225–228.
Brown DC (1971) Close-range camera calibration. Photogrammetric Engineering 37(8): 855–866.
Brown M, Hartley R & Nistér D (2007) Minimal solutions for panoramic stitching. Proc IEEEConference on Computer Vision and Pattern Recognition (CVPR).
Burschka D, Li M, Taylor R & Hager GD (2004) Scale-invariant registration of monocular en-doscopic images to CT-scans for sinus surgery. Proc Medical Imaging Computing and Com-puter Assisted Intervention (MICCAI): 413–421.
Byröd M, Josephson K & Åström (2009) Fast and stable polynomial equation solving and itsapplication to computer vision. International Journal of Computer Vision (IJCV) 84(3): 237–256.
81
Carlsson S (1993) Projectively invariant decomposition and recognition ofplanar shapes. ProcInternational Conference on Computer Vision (ICCV): 471–475.
Cech J, Matas J & Perd’och M (2008) Efficient sequential correspondence selection by coseg-mentation. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Cech J & Sára R (2007) Efficient sampling of disparity space for fast and accurate matching. ProcBenCOS Workshop in conjunction with CVPR.
Chae MJ & Abraham DM (2001) Neuro-fuzzy approaches for sanitary sewer pipeline conditionassessment. Journal of Computing in Civil Engineering 15(1): 4–14.
Chahl JS & Srinivasan MV (1997) Reflective surfaces for panoramic imaging. Applied Optics36(31): 8275–8285.
Chen Q & Medioni GG (1999) A volumetric stereo matching method: application to image-basedmodeling. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Chen Q, Wu H & Wada T (2004) Camera calibration with two arbitrary coplanar circles. ProcEuropean Conference on Computer Vision (ECCV), (3): 521–532.
Cho M, Lee J & Lee KM (2009) Feature correspondence and deformable object matching viaagglomerative correspondence clustering. Proc International Conference on Computer Vision(ICCV).
Choi O, Kim H & Kweon IS (2007) Simultaneous plane extraction and 2D homography esti-mation using local feature transformations. Proc Asian Conference on Computer Vision(ACCV), (2): 269–278.
Chum O, Werner T & Matas J (2005) Two-view geometry estimation unaffected by a dominantplane. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1):772–779.
Claus D & Fitzgibbon AW (2005) A rational function lens distortion model for general cameras.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1): 213–219.
Conrady A (1919) Decentering lens systems. Monthly Notices of the Royal Astronomical Society79: 384–390.
Cooper D, Pridmore TP & Taylor N (1998) Towards the recovery of extrinsic camera parametersfrom video records of sewer surveys. Machine Vision and Applications 11: 53–63.
Cornelis C, Leibe B, Cornelis K & Van Gool L (2008) 3D object modeling and recognitionusing local affine-invariant image descriptors and multi-view spatial constraints. InternationalJournal of Computer Vision (IJCV) 78(2-3): 121–141.
Daniilidis K & Klette R (eds) (2006) Imaging Beyond the Pinhole Camera, volume 33 ofCom-putational Imaging and Vision. Springer.
Davison AJ, Reid ID, Molton ND & Stasse O (2007) MonoSLAM: real-time single cameraSLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29(6):1052–1067.
Faber P & Fisher B (2001) A buyer’s guide to Euclidean elliptical cylindrical and conical surfacefitting. Proc British Machine Vision Conference (BMVC): 521–530.
Faugeras O (1993) Three-Dimensional Computer Vision. The MIT Press.Faugeras O, Luong QT & Papadopoulo T (2001) The Geometry of Multiple Images. The MIT
Press.Faugeras OD, Luong QT & Maybank SJ (1992) Camera self-calibration: Theory and experiments.
Proc European Conference on Computer Vision (ECCV): 321–334.Feldman D, Pajdla T & Weinshall D (2003) On the epipolar geometry of the crossed-slits projec-
tion. Proc International Conference on Computer Vision (ICCV): 988–995.
82
Fergus R, Perona P & Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. Proc IEEE Conference on Computer Vision and Pattern Recognition(CVPR), (2): 264–271.
Ferrari V, Tuytelaars T & Van Gool LJ (2006) Simultaneous object recognition and segmentationfrom single or multiple model views. International Journal of Computer Vision (IJCV) 67(2):159–188.
Fischler MA & Bolles RC (1981) Random sample consensus: A paradigm for model fitting withapplications to image analysis and automated cartography. Communications of the ACM24(6): 381–395.
Fitzgibbon A (2001) Simultaneous linear estimation of multiple view geometry and lens dis-tortion. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1):125–132.
Fitzgibbon AW, Pilu M & Fisher RB (1999) Direct least square fitting of ellipses. IEEE Transac-tions on Pattern Analysis and Machine Intelligence (TPAMI) 21(5): 476–480.
Fitzgibbon AW & Zisserman A (1998) Automatic 3D model acquisition and generation of newimages from video sequences. Proc European Signal Processing Conference.
Forsyth D, Mundy JL, Zisserman A, Coelho C, Heller A & Rothwell C (1991) Invariant descrip-tors for 3-D object recognition and pose. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI) 13(10).
Frank O, Katz R, Tisse CL & Durrant-Whyte H (2007) Camera calibration for miniature, low-cost, wide-angle imaging systems. Proc British Machine Vision Conference (BMVC).
Fuchs M, Blanz V, Lensch HPA & Seidel HP (2007) Adaptive sampling of reflectance fields.ACM Transactions on Graphics 26(2).
Furukawa Y, Curless B, Seitz SM & Szeliski R (2009) Manhattan-world stereo. Proc IEEEConference on Computer Vision and Pattern Recognition (CVPR).
Furukawa Y & Ponce J (2007) Accurate, dense, and robust multi-view stereopsis. Proc IEEEConference on Computer Vision and Pattern Recognition (CVPR).
Gargallo P (2008) Contributions to the Bayesian approach to multi-view stereo. Ph.D. thesis,Institut National Polytechnique de Grenoble.
Geyer C & Daniilidis K (2001) Catadioptric projective geometry. International Journal of Com-puter Vision (IJCV) 45(3).
Geyer C & Daniilidis K (2002) Paracatadioptric camera calibration. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI) 24(5): 687–695.
Goesele M, Snavely N, Curless B, Hoppe H & Seitz S (2007) Multi-view stereo for communityphoto collections. Proc International Conference on Computer Vision (ICCV).
Grossberg MD & Nayar SK (2001) A general imaging model and a method for finding its param-eters. Proc International Conference on Computer Vision (ICCV).
Gupta R & Hartley RI (1997) Linear pushbroom cameras. IEEE Transactions on Pattern Analysisand Machine Intelligence (TPAMI) 19(9): 963–975.
Gurdjos P, Kim JS & Kweon IS (2006) Euclidean structure from confocal conics: theory andapplication to camera calibration. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR), (1): 1214–1222.
Harris C & Stephens M (1988) A combined corner and edge detector. Proc Alvey Vision Confer-ence.
Hartley R & Zisserman A (2000) Multiple View Geometry in Computer Vision. CambridgeUniversity Press.
83
Hartley RI (1994) Self-calibration from multiple views with a rotating camera. Proc EuropeanConference on Computer Vision (ECCV) (1), 471–478.
Hartley RI & Kang SB (2007) Parameter-free radial distortion correction with center of distortionestimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29(8):1309–1321.
Heikkilä J (2000) Geometric camera calibration using circular control points. IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI) 22(10): 1066–1077.
Heikkilä M, Pietikäinen M & Schmid C (2009) Description of interest regions with local binarypatterns. Pattern Recognition 42(3): 425–436.
Heyden A & Åström K (1997) Euclidean reconstruction from image sequences with varying andunknown focal length and principal point. Proc IEEE Conference on Computer Vision andPattern Recognition (CVPR): 438–443.
Heyden A & Åström K (1998) Minimal conditions on intrinsic parameters for Euclidean recon-struction. Proc Asian Conference on Computer Vision (ACCV): 169–176.
Horn BKP (1986) Robot Vision. The MIT Press.Horn RA & Johnson CR (1985) Matrix Analysis. Cambridge University Press.Jain PK & Jawahar CV (2006) Homography estimation from planar contours. Proc International
Symposium on 3D Data Processing Visualization and Transmission (3DPVT).Kahl F, Agarwal S, Chandraker MK, Kriegman DJ & Belongie S (2008) Practical global opti-
mization for multiview geometry. International Journal of Computer Vision (IJCV) 79(3):271–284.
Kahl F & Heyden A (1998) Using conic correspondences in two images to estimate the epipolargeometry. Proc International Conference on Computer Vision (ICCV): 761–766.
Kaminski JY & Shashua A (2004) Multiple view geometry of general algebraic curves. Interna-tional Journal of Computer Vision (IJCV) 56(3): 195–219.
Kanatani K (1993) Geometric Computation for Machine Vision. Oxford University Press.Kanatani K (1994) Statistical bias of conic fitting and renormalization. IEEE Transactions on
Pattern Analysis and Machine Intelligence (TPAMI) 16(3): 320–326.Kanatani K (1996) Statistical Optimization for Geometric Computation: Theory and Practice.
Elsevier Science.Kannala J (2004) Measuring the shape of sewer pipes from video. Master’s thesis, Helsinki
University of Technology.Kannala J & Brandt S (2004) A generic camera calibration method for fish-eye lenses. Proc
International Conference on Pattern Recognition (ICPR), (1): 10–13.Kannala J & Brandt SS (2005) Measuring the shape of sewer pipes from video. Proc IAPR
Conference on Machine Vision Applications: 237–240.Kolmogorov V & Zabih R (2001) Computing visual correspondence with occlusions via graph
cuts. Proc International Conference on Computer Vision (ICCV): 508–515.Kukelova Z, Bujnak M & Pajdla T (2008) Automatic generator of minimal problem solvers. Proc
European Conference on Computer Vision (ECCV), (3): 302–315.Kuntze HB & Haffner H (1998) Experiences with the development of a robot for smart multi-
sensoric pipe inspection. Proc IEEE International Conference on Robotics and Automation(ICRA): 1773–1778.
Lazebnik S, Schmid C & Ponce J (2006) Beyond bags of features: spatial pyramid matching forrecognizing natural scene categories. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR), (2): 2169–2178.
84
Lhuillier M (2008a) Automatic scene structure and camera motion using a catadioptric system.Computer Vision and Image Understanding (CVIU) 109(2): 186–203.
Lhuillier M (2008b) Toward automatic 3D modeling of scenes using a generic camera model.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Lhuillier M & Quan L (2002) Match propagation for image-based modeling and rendering. IEEETransactions on Pattern Analysis and Machine Intelligence (TPAMI) 24(8): 1140–1146.
Lhuillier M & Quan L (2005) A quasi-dense approach to surface reconstruction from uncalibratedimages. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 27(3):418–433.
Li H & Hartley RI (2006) Plane-based calibration and auto-calibration of a fish-eye camera. ProcAsian Conference on Computer Vision (ACCV): 21–30.
Lourakis MIA, Argyros AA & Orphanoudakis SC (2002) Detecting planes in an uncalibratedimage pair. Proc British Machine Vision Conference (BMVC).
Lowe D (2004) Distinctive image features from scale invariant keypoints. International Journalof Computer Vision (IJCV) 60(2): 91–110.
Lu L & Wu Y (2008) Quasi-dense matching between perspective and omnidirectional images.Proc Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applica-tions in conjunction with ECCV.
Lukács G, Martin R & Marshall D (1998) Faithful least-squares fitting of spheres, cylinders,cones and tori for reliable segmentation. Proc European Conference on Computer Vision(ECCV): 671–686.
Ma SD (1993) Conics-based stereo, motion estimation, and pose determination. InternationalJournal of Computer Vision (IJCV) 10(1): 7–25.
Marr D (1982) Vision: A Computational Investigation into the Human Representation and Pro-cessing of Visual Information. W. H. Freeman and Company.
Martinec D (2008) Robust multiview reconstruction. Ph.D. thesis, Czech Technical University.Matas J, Chum O, Urban M & Pajdla T (2004) Robust wide-baseline stereo from maximally
stable extremal regions. Image and Vision Computing 22(10): 761–767.Megyesi Z & Chetverikov D (2004) Enhanced surface reconstruction from wide baseline im-
ages. Proc International Symposium on 3D Data Processing Visualization and Transmission(3DPVT): 463–469.
Mei C & Rives P (2007) Single view point omnidirectional camera calibration from planar grids.Proc IEEE International Conference on Robotics and Automation (ICRA): 3945–3950.
Mi cušík B (2004) Two-view geometry of omnidirectional cameras. Ph.D. thesis, Czech TechnicalUniversity.
Mi cušík B & Pajdla T (2006) Structure from motion with wide circular field of view cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 28(7).
Mikolajczyk K & Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (TPAMI) 27(10): 1615–1630.
Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T &Van Gool L (2005) A comparison of affine region detectors. International Journal of Com-puter Vision (IJCV) 65(1-2): 43–72.
Miyamoto K (1964) Fish eye lens. Journal of the Optical Society of America (JOSA) 54(8):1060–1061.
Moons T, Van Gool L, Van Diest M & Pauwels E (1994) Affine reconstruction from perspectiveimage pairs obtained by a translating camera, volume 825 ofLecture Notes in Computer
85
Science: 297–316. Springer.Mouragnon E, Lhuillier M, Dhome M, Dekeyser F & Sayd P (2009) Generic and real-time struc-
ture from motion using local bundle adjustment. Image and Vision Computing 27: 1178–1193.
Mudigonda P, Jawahar CV & Narayanan PJ (2004) Geometric structure computation from conics.Proc Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP).
Nistér D (2001) Automatic dense reconstruction from uncalibrated video sequences. Ph.D. thesis,Royal Institute of Technology, Stockholm.
Nistér D (2004) An efficient solution to the five-point relative pose problem. IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI) 26(6): 756–777.
Nistér D (2005) Preemptive RANSAC for live structure and motion estimation. Machine Visionand Applications 16(5): 321–329.
Nistér D & Stewénius H (2006) Scalable recognition with a vocabulary tree. Proc IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), (2): 2161–2168.
Obdrzálek S & Matas J (2002) Object recognition using local affine frames on distinguishedregions. Proc British Machine Vision Conference (BMVC).
Obdrzálek S & Matas J (2005) Sub-linear indexing for large scale object recognition. Proc BritishMachine Vision Conference (BMVC).
Oliensis J (2002) Exact two-image structure from motion. IEEE Transactions on Pattern Analysisand Machine Intelligence (TPAMI) 24(12): 1618–1633.
Ollis M, Herman H & Singh S (1999) Analysis and design of panoramic stereo vision usingequi-angular pixel cameras. CMU-RI-TR-99-04, Carnegie Mellon University.
Otto GP & Chau TKW (1989) Region-growing algorithm for matching of terrain images. Imageand Vision Computing 7(2): 83–94.
Pajdla T (2002) Stereo with oblique cameras. International Journal of Computer Vision (IJCV)47(1-3): 161–170.
Pollefeys M (1999) Self-calibration and metric 3D reconstruction from uncalibrated image se-quences. Ph.D. thesis, Katholieke Universiteit Leuven.
Pollefeys M, Koch R & Van Gool LJ (1999) Self-calibration and metric reconstruction inspite ofvarying and unknown intrinsic camera parameters. International Journal of Computer Vision(IJCV) 32(1): 7–25.
Pollefeys M, Nistér D, Frahm JM, Akbarzadeh A, Mordohai P, Clipp B, Engels C, Gallup D, KimSJ, Merrell P, Salmi C, Sinha SN, Talton B, Wang L, Yang Q, Stewénius H, Yang R, WelchG & Towles H (2008) Detailed real-time urban 3D reconstruction from video. InternationalJournal of Computer Vision (IJCV) 78(2-3): 143–167.
Pollefeys M & Van Gool LJ (1997) Self-calibration from the absolute conic on the plane at infinity.Proc International Conference on Computer Analysis of Images and Patterns (CAIP): 175–182.
Ponce J (2009) What is a camera? Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR).
Pronk J (2006) Spatially variant real world light for computer graphics. B.Sc. thesis, Universityof New South Wales.
Quan L (1996) Conic reconstruction and correspondence from two views. IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI) 18(2): 151–160.
Ramalingam S (2006) Generic imaging models: calibration and 3D reconstruction algorithms.Ph.D. thesis, Institut National Polytechnique de Grenoble.
86
Ramalingam S, Lodha SK & Sturm PF (2006a) A generic structure-from-motionframework.Computer Vision and Image Understanding (CVIU) 103(3): 218–228.
Ramalingam S & Sturm PF (2008) Minimal solutions for generic imaging models. Proc IEEEConference on Computer Vision and Pattern Recognition (CVPR).
Ramalingam S, Sturm PF & Boyer E (2006b) A factorization based self-calibration for radiallysymmetric cameras. Proc International Symposium on 3D Data Processing Visualization andTransmission (3DPVT): 480–487.
Ramalingam S, Sturm PF & Lodha SK (2005) Towards complete generic camera calibration.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 1093–1098.
Ramalingam S, Sturm PF & Lodha SK (2006c) Theory and calibration for axial cameras. ProcAsian Conference on Computer Vision (ACCV) (1): 704–713.
Rosin PL (1993) A note on the least squares fitting of ellipses. Pattern Recognition Letters 14(10):799–808.
Rosin PL & West GAW (1995) Nonparametric segmentation of curves into various representa-tions. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 17(12):1140–1153.
Rothganger F, Lazebnik S, Schmid C & Ponce J (2006) 3D object modeling and recognitionusing local affine-invariant image descriptors and multi-view spatial constraints. InternationalJournal of Computer Vision (IJCV) 66(3): 231–259.
Sampson PD (1982) Fitting conic sections to very scattered data: an iterative refinement of theBookstein algorithm. Computer Graphics and Image Processing (18): 97–108.
Scaramuzza D, Martinelli A & Siegwart R (2006) A flexible technique for accurate omnidirec-tional camera calibration and structure from motion. Proc International Conference on Com-puter Vision Systems (ICVS).
Scharstein D & Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo corre-spondence algorithms. International Journal of Computer Vision (IJCV) 47(1-3): 7–42.
Schmid C & Mohr R (1997) Local grayvalue invariants for image retrieval. IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI) 19(5): 530–535.
Schmid C & Zisserman A (2000) The geometry and matching of lines and curves over multipleviews. International Journal of Computer Vision (IJCV) 40(3): 199–233.
Schmidt J, Vogt F & Niemann H (2002) Nonlinear refinement of camera parameters using anendoscopic surgery robot. Proc IAPR Workshop on Machine Vision Applications: 40–43.
Seitz SM, Curless B, Diebel J, Scharstein D & Szeliski R (2006) A comparison and evaluationof multi-view stereo reconstruction algorithms. Proc IEEE Conference on Computer Visionand Pattern Recognition (CVPR): 519–528.
Semple JG & Kneebone GT (1952) Algebraic Projective Geometry. Oxford University Press.Simon I & Seitz SM (2007) A probabilistic model for object recognition, segmentation, and non-
rigid correspondence. Proc IEEE Conference on Computer Vision and Pattern Recognition(CVPR).
Sinha SK & Fieguth PW (2006) Morphological segmentation and classification of undergroundpipe images. Machine Vision and Applications 17: 21–31.
Sivic J, Russell BC, Efros AA, Zisserman A & Freeman WT (2005) Discovering objects and theirlocation in images. Proc International Conference on Computer Vision (ICCV): 370–377.
Sivic J & Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEETransactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(4): 591–606.
87
Slama CC (ed) (1980) Manual of Photogrammetry. American Society of Photogrammetry, fourthedition.
Snavely N, Seitz SM & Szeliski R (2008) Modeling the world from internet photo collections.International Journal of Computer Vision (IJCV) 80(2): 189–210.
Starck J & Hilton A (2003) Model-based multiple view reconstruction of people. Proc Interna-tional Conference on Computer Vision (ICCV): 915–922.
Stehle T, Truhn D, Aach T, Trautwein C & Tischendorf J (2007) Camera calibration for fish-eyelenses in endoscopy with an application to 3D reconstruction. Proc International Symposiumon Biomedical Imaging (ISBI): 1176–1179.
Stewénius H (2005) Gröbner basis methods for minimal problems in computer vision. Ph.D.thesis, Lund University.
Stolfi J (1991) Oriented Projective Geometry. Academic Press.Strecha C (2007) Multi-view stereo as an inverse inference problem. Ph.D. thesis, Katholieke
Universiteit Leuven.Sturm P & Barreto JP (2008) General imaging geometry for central catadioptric cameras. Proc
European Conference on Computer Vision (ECCV) (4): 609–622.Sturm P & Maybank S (1999) On plane based camera calibration: A general algorithm, singu-
larities, applications. Proc IEEE Conference on Computer Vision and Pattern Recognition(CVPR): 432–437.
Sturm PF (2005) Multi-view geometry for general camera models. Proc IEEE Conference onComputer Vision and Pattern Recognition (CVPR): 206–212.
Sturm PF & Gargallo P (2007) Conic fitting using the geometric distance. Proc Asian Conferenceon Computer Vision (ACCV), (2): 784–795.
Sturm PF & Ramalingam S (2004) A generic concept for camera calibration. Proc EuropeanConference on Computer Vision (ECCV), (2): 1–13.
Sugimoto A (2000) A linear algorithm for computing the homography from conics in correspon-dence. Journal of Mathematical Imaging and Vision 13: 115–130.
Sutherland I (1974) Three-dimensional data input by tablet. Proc IEEE 62: 453–461.Swaminathan R, Grossberg MD & Nayar SK (2003) A perspective on distortions. Proc IEEE
Conference on Computer Vision and Pattern Recognition (CVPR): 594–601.Swaminathan R, Grossberg MD & Nayar SK (2004) Designing mirrors for catadioptric systems
that minimize image errors. Proc Workshop on Omnidirectional Vision, Camera Networksand Non-Classical Cameras (OMNIVIS).
Swaminathan R & Nayar SK (2000) Nonmetric calibration of wide-angle lenses and polycameras.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 22(10): 1172–1178.
Tardif JP, Sturm P, Trudeau M & Roy S (2009) Calibration of cameras with radially symmetricdistortion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(9):1552–1566.
Tardif JP, Sturm PF & Roy S (2006) Self-calibration of a general radially symmetric distortionmodel. Proc European Conference on Computer Vision (ECCV) (4): 186–199.
Tardif JP, Sturm PF & Roy S (2007) Plane-based self-calibration of radial distortion. Proc Inter-national Conference on Computer Vision (ICCV).
Thirthala S & Pollefeys M (2005) The radial trifocal tensor: a tool for calibrating the radialdistortion of wide-angle cameras. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR): 321–328.
88
Triggs B (1997) Autocalibration and the absolute quadric. Proc IEEE Conference on ComputerVision and Pattern Recognition (CVPR): 609–614.
Triggs B, McLauchlan P, Hartley R & Fitzgibbon A (2000) Bundle Adjustment – A ModernSynthesis, volume 1883 ofLecture Notes in Computer Science: 298–372.
Vedaldi A & Soatto S (2006) Local features, all grown up. Proc IEEE Conference on ComputerVision and Pattern Recognition (CVPR), (2): 1753–1760.
Vidal R, Heyden A & Ma Y (eds) (2007) Dynamical Vision, volume 4358 ofLecture Notes inComputer Science.
Viola P & Jones M (2001) Rapid object detection using a boosted cascade of simple features.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1): 511–518.
Wang H, Mirota D, Ishii M & Hager GD (2008) Robust motion estimation and structure recoveryfrom endoscopic image sequences with an adaptive scale kernel consensus estimator. ProcIEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Wang JYA & Adelson EH (1994) Representing moving images with layers. IEEE Transactionson Image Processing 3(5): 625–638.
Weiss Y (1997) Smoothness in layers: motion segmentation using nonparametric mixture estima-tion. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 520–526.
Werghi N, Fisher R, Robertson C & Ashbrook A (1998) Modelling objects having quadric sur-faces incorporating geometric constraints. Proc European Conference on Computer Vision(ECCV): 185–201.
Wills J, Agarwal S & Belongie S (2006) A feature-based approach for dense segmentation andestimation of large disparity motion. International Journal of Computer Vision (IJCV) 68(2):125–143.
Wu Y, Zhu H, Hu Z & Wu F (2004) Camera calibration from the quasi-affine invariance of twoparallel circles. Proc European Conference on Computer Vision (ECCV), (1): 190–202.
Xiao J, Chen J, Yeung DY & Quan L (2008) Learning two-view stereo matching. Proc EuropeanConference on Computer Vision (ECCV), (3): 15–27.
Xin L, Wang Q, Tao J, Tang X, Tan T & Shum H (2005) Automatic 3D face modeling from video.Proc International Conference on Computer Vision (ICCV): 1193–1199.
Xiong Y & Turkowski K (1997) Creating image-based VR using a self-calibrating fisheye lens.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Xu G & Zhang Z (1996) Epipolar Geometry in Stereo, Motion and Object Recognition. Kluwer.Xu K, Luxmoore AR & Davies T (1998) Sewer pipe deformation assessment by image analysis
of video surveys. Pattern Recognition 31(2): 169–180.Yang C, Sun F & Hu Z (2000) Planar conic based camera calibration. Proc International Confer-
ence on Pattern Recognition (ICPR): 1555–1558.Yang G, Stewart CV, Sofka M & Tsai CL (2007) Registration of challenging image pairs: Ini-
tialization, estimation, and decision. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI) 29(11): 1973–1989.
Ying X & Hu Z (2004a) Can we consider central catadioptric cameras and fisheye cameras withina unified imaging model. Proc European Conference on Computer Vision (ECCV): 442–455.
Ying X & Hu Z (2004b) Catadioptric camera calibration using geometric invariants. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (TPAMI) 26(10).
Ying X & Zha H (2005) Linear catadioptric camera calibration from sphere images. Proc Work-shop on Omnidirectional Vision, Camera Networks and Non-Classical Cameras (OMNIVIS).
89
Ying X & Zha H (2007) Camera calibration using principal-axes aligned conics.Proc AsianConference on Computer Vision (ACCV), (1):138–148.
Zhang H, Wong K & Zhang G (2007) Camera calibration from images of spheres. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (TPAMI) 29(3): 499–502.
Zhang Z (2000) A flexible new technique for camera calibration. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI) 22(11): 1330–1334.
Zhao W, Chellappa R, Phillips PJ & Rosenfeld A (2003) Face recognition: a literature survey.ACM Computing Surveys 35(4): 399–458.
90
Original articles
I Kannala J, Salo M & Heikkilä J (2006) Algorithms for computing a planar homographyfrom conics in correspondence. Proc British Machine Vision Conference (BMVC) 1: 77–86.
II Kannala J & Brandt SS (2006) A generic camera model and calibration method for conven-tional, wide-angle and fish-eye lenses. IEEE Transactions on Pattern Analysis and MachineIntelligence 28(8): 1335–1340.
III Kannala J, Heikkilä J & Brandt SS (2008) Geometric camera calibration. In Wah B (ed)Wiley Encyclopedia of Computer Science and Engineering. Hoboken, John Wiley & SonsInc.
IV Kannala J, Brandt SS & Heikkilä J (2009) Self-calibration of central cameras from pointcorrespondences by minimizing angular error. VISIGRAPP 2008, Revised Selected Papers.Communications in Computer and Information Science 24: 109–122.
V Kannala J, Brandt SS & Heikkilä J (2008) Measuring and modelling sewer pipes from video.Machine Vision and Applications 19(2): 73–83.
VI Kannala J & Brandt SS (2007) Quasi-dense wide baseline matching using match propaga-tion. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
VII Kannala J, Rahtu E, Brandt SS and Heikkilä J (2008) Object recognition and segmentationby non-rigid quasi-dense matching. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR).
VIII Kannala J, Rahtu E, Brandt SS and Heikkilä J (2009) Dense and deformable motion seg-mentation for wide baseline images. Proc Scandinavian Conference on Image Analysis(SCIA). Lecture Notes in Computer Science 5575: 379–389.
Reprinted with permission from IEEE (II, VI and VII), John Wiley & Sons (III), and Springer-Verlag (IV, V and VIII).
Original publications are not included in the electronic version of the dissertation.
91
92
A C T A U N I V E R S I T A T I S O U L U E N S I S
Book orders:Granum: Virtual book storehttp://granum.uta.fi/granum/
S E R I E S C T E C H N I C A
337. Leinonen, Jouko (2009) Analysis of OFDMA resource allocation with limitedfeedback
338. Tick, Timo (2009) Fabrication of advanced LTCC structures for microwavedevices
339. Ojansivu, Ville (2009) Blur invariant pattern recognition and registration in theFourier domain
340. Suikkanen, Pasi (2009) Development and processing of low carbon bainitic steels
341. García, Verónica (2009) Reclamation of VOCs, n-butanol and dichloromethane,from sodium chloride containing mixtures by pervaporation. Towards efficientuse of resources in the chemical industry
342. Boutellier, Jani (2009) Quasi-static scheduling for fine-grained embeddedmultiprocessing
343. Vallius, Tero (2009) An embedded object approach to embedded systemdevelopment
344. Chung, Wan-Young (2009) Ubiquitous healthcare system based on a wirelesssensor network
345. Väisänen, Tero (2009) Sedimentin kemikalointikäsittely. Tutkimus rehevän jasisäkuormitteisen järven kunnostusmenetelmän mitoituksesta sekä sentuloksellisuuden mittaamisesta
346. Mustonen, Tero (2009) Inkjet printing of carbon nanotubes for electronicapplications
347. Bennis, Mehdi (2009) Spectrum sharing for future mobile cellular systems
348. Leiviskä, Tiina (2009) Coagulation and size fractionation studies on pulp andpaper mill process and wastewater streams
349. Casteleijn, Marinus G. (2009) Towards new enzymes: protein engineering versusBioinformatic studies
350. Haapola, Jussi (2010) Evaluating medium access control protocols for wirelesssensor networks
351. Haverinen, Hanna (2010) Inkjet-printed quantum dot hybrid light-emittingdevices—towards display applications
352. Bykov, Alexander (2010) Experimental investigation and numerical simulation oflaser light propagation in strongly scattering media with structural and dynamicinhomogeneities
C353etukansi.fm Page 2 Friday, March 19, 2010 10:48 AM
ABCDEFG
UNIVERS ITY OF OULU P.O.B . 7500 F I -90014 UNIVERS ITY OF OULU F INLAND
A C T A U N I V E R S I T A T I S O U L U E N S I S
S E R I E S E D I T O R S
SCIENTIAE RERUM NATURALIUM
HUMANIORA
TECHNICA
MEDICA
SCIENTIAE RERUM SOCIALIUM
SCRIPTA ACADEMICA
OECONOMICA
EDITOR IN CHIEF
PUBLICATIONS EDITOR
Professor Mikko Siponen
University Lecturer Elise Kärkkäinen
Professor Pentti Karjalainen
Professor Helvi Kyngäs
Senior Researcher Eila Estola
Information officer Tiina Pistokoski
University Lecturer Seppo Eriksson
University Lecturer Seppo Eriksson
Publications Editor Kirsti Nurkkala
ISBN 978-951-42-6150-3 (Paperback)ISBN 978-951-42-6151-0 (PDF)ISSN 0355-3213 (Print)ISSN 1796-2226 (Online)
U N I V E R S I TAT I S O U L U E N S I SACTAC
TECHNICA
U N I V E R S I TAT I S O U L U E N S I SACTAC
TECHNICA
OULU 2010
C 353
Juho Kannala
MODELS AND METHODSFOR GEOMETRIC COMPUTER VISION
FACULTY OF TECHNOLOGY,DEPARTMENT OF ELECTRICAL AND INFORMATION ENGINEERING,UNIVERSITY OF OULU;INFOTECH OULU,UNIVERSITY OF OULU
C 353
ACTA
Juho Kannala
C353etukansi.fm Page 1 Friday, March 19, 2010 10:48 AM