Models and methods for geometric computer visionjultika.oulu.fi/files/isbn9789514261510.pdf · geometric computer vision. This th esis considers topics that are re lated to this problem

ABCDEFG

UNIVERS ITY OF OULU P.O.B . 7500 F I -90014 UNIVERS ITY OF OULU F INLAND

A C T A U N I V E R S I T A T I S O U L U E N S I S

S E R I E S E D I T O R S

SCIENTIAE RERUM NATURALIUM

HUMANIORA

TECHNICA

MEDICA

SCIENTIAE RERUM SOCIALIUM

SCRIPTA ACADEMICA

OECONOMICA

EDITOR IN CHIEF

PUBLICATIONS EDITOR

Professor Mikko Siponen

University Lecturer Elise Kärkkäinen

Professor Pentti Karjalainen

Professor Helvi Kyngäs

Senior Researcher Eila Estola

Information officer Tiina Pistokoski

University Lecturer Seppo Eriksson


Publications Editor Kirsti Nurkkala

ISBN 978-951-42-6150-3 (Paperback)ISBN 978-951-42-6151-0 (PDF)ISSN 0355-3213 (Print)ISSN 1796-2226 (Online)

U N I V E R S I TAT I S O U L U E N S I SACTAC

TECHNICA


TECHNICA

OULU 2010

C 353

Juho Kannala

MODELS AND METHODSFOR GEOMETRIC COMPUTER VISION

FACULTY OF TECHNOLOGY,DEPARTMENT OF ELECTRICAL AND INFORMATION ENGINEERING,UNIVERSITY OF OULU;INFOTECH OULU,UNIVERSITY OF OULU

C 353

ACTA

Juho Kannala

C353etukansi.fm Page 1 Friday, March 19, 2010 10:48 AM

A C T A U N I V E R S I T A T I S O U L U E N S I SC Te c h n i c a 3 5 3

JUHO KANNALA

MODELS AND METHODS FOR GEOMETRIC COMPUTER VISION

Academic dissertation to be presented with the assent ofthe Faculty of Technology of the University of Oulu forpublic defence in OP-sali (Auditorium L10), Linnanmaa, on7 May 2010, at 12 noon

UNIVERSITY OF OULU, OULU 2010

Copyright © 2010Acta Univ. Oul. C 353, 2010

Supervised byDoctor Sami BrandtProfessor Janne Heikkilä

Reviewed byDoctor Peter SturmProfessor Kalle Åström

ISBN 978-951-42-6150-3 (Paperback)ISBN 978-951-42-6151-0 (PDF)http://herkules.oulu.fi/isbn9789514261510/ISSN 0355-3213 (Printed)ISSN 1796-2226 (Online)http://herkules.oulu.fi/issn03553213/

Cover designRaimo Ahonen

JUVENES PRINTTAMPERE 2010

Kannala, Juho, Models and methods for geometric computer visionFaculty of Technology, Department of Electrical and Information Engineering, University ofOulu, P.O.Box 4500, FI-90014 University of Oulu, Finland; Infotech Oulu, University of Oulu,P.O.Box 4500, FI-90014 University of Oulu, FinlandActa Univ. Oul. C 353, 2010Oulu, Finland

Abstract

Automatic three-dimensional scene reconstruction from multiple images is a central problem ingeometric computer vision. This thesis considers topics that are related to this problem area. Newmodels and methods are presented for various tasks in such specific domains as cameracalibration, image-based modeling and image matching. In particular, the main themes of thethesis are geometric camera calibration and quasi-dense image matching. In addition, a topicrelated to the estimation of two-view geometric relations is studied, namely, the computation of aplanar homography from corresponding conics. Further, as an example of a reconstruction system,a structure-from-motion approach is presented for modeling sewer pipes from video sequences.

In geometric camera calibration, the thesis concentrates on central cameras. A generic cameramodel and a plane-based camera calibration method are presented. The experiments with variousreal cameras show that the proposed calibration approach is applicable for conventionalperspective cameras as well as for many omnidirectional cameras, such as fish-eye lens cameras.In addition, a method is presented for the self-calibration of radially symmetric central camerasfrom two-view point correspondences.

In image matching, the thesis proposes a method for obtaining quasi-dense pixel matchesbetween two wide baseline images. The method extends the match propagation algorithm to thewide baseline setting by using an affine model for the local geometric transformations between theimages. Further, two adaptive propagation strategies are presented, where local texture propertiesare used for adjusting the local transformation estimates during the propagation. These extensionsmake the quasi-dense approach applicable for both rigid and non-rigid wide baseline matching.

In this thesis, quasi-dense matching is additionally applied for piecewise image registrationproblems which are encountered in specific object recognition and motion segmentation. Theproposed object recognition approach is based on grouping the quasi-dense matches between themodel and test images into geometrically consistent groups, which are supposed to representindividual objects, whereafter the number and quality of grouped matches are used as recognitioncriteria. Finally, the proposed approach for dense two-view motion segmentation is built on alayer-based segmentation framework which utilizes grouped quasi-dense matches for initializingthe motion layers, and is applicable under wide baseline conditions.

Keywords: camera calibration, image registration, image-based modeling, motionsegmentation, object recognition, structure from motion

To my late brother Jaakko

6

Preface

My first contact with computer vision occurred during a summer traineeship at the

Helsinki University of Technology almost ten years ago. Since that time, I have had

opportunities to learn from many experts in the field, both in Finland and abroad. Now,

upon completing the research for my thesis, I feel that it is the time to acknowledge

several people who have helped in bringing this thesis to its completion.

I would like to express my gratitude to my instructors, Doctor Sami Brandt and

Professor Janne Heikkilä, who have been great sources of ideas and advice through the

years. In addition, I am grateful that they have given me freedom to pursue my own

ideas in research. I am also indebted to my co-authors, Doctors Esa Rahtu and Mikko

Salo, whose broad expertise and open-minded innovative attitude to research has been

a good basis for fruitful collaboration.

I am grateful to the reviewers of the thesis, Doctor Peter Sturm and Professor Kalle

Åström, for their constructive comments and feedback. I would also like to acknowl-

edge Gordon Roberts for his help with the language revision of the manuscript.

The Machine Vision Group of the University of Oulu has been an excellent place for

doing research. This is due to the helpful attitude and efforts of all the personnel, both

research and support staff. In particular, I am grateful to Professor Matti Pietikäinen

for his long-term work on advancing computer vision research in Oulu and for offer-

ing me the possibility to work in this group. Further, considering the research topics

of this thesis, I would like to acknowledge Jukka Holappa for carefully optimizing the

implementations of several algorithms studied in the thesis. Finally, as there are many

important aspects to life other than research, I would like to thank Doctor Jani Boutel-

lier and Pekka Koskenkorva for various discussions during the daily lunch and coffee

breaks.

In addition to my home university in Oulu, I have had opportunities to interact

with scientists in other research institutes. I am grateful to Doctors Charles Bouveyron,

Stéphane Girard and Cordelia Schmid for their hospitality during my stay in INRIA

Grenoble in 2005. Further, I would like to thank Doctor Jirí Matas for hosting my

visit to the Center for Machine Perception (CMP) at the Czech Technical University

in Prague in 2009. I am also grateful for the interesting discussions with many CMP

members. For the collaboration with the sewer imaging application, I would like to

7

acknowledge Professor Jouko Lampinen and Doctor Aki Vehtari from Aalto Univer-

sity, Juhani Korkealaakso and Hannu Maula from VTT Technical Research Centre of

Finland, and Priit Uleksin from DigiSewer Productions Ltd.

The financial support provided by the Graduate School in Electronics, Telecommu-

nication and Automation (GETA), the Emil Aaltonen Foundation, the Finnish Founda-

tion for Technology Promotion, the Kaute Foundation, the Nokia Foundation, the Seppo

Säynäjäkangas Science Foundation, and the Tauno Tönning Foundation is gratefully ac-

knowledged.

Last but not least, I want to express my deepest gratitude to my family and friends

for all the support during these years. Especially, I would like to thank Noora for her

important support during the last stages of this work.

Oulu, February 2010

Juho Kannala

8

List of original articles

This dissertation is based on the following articles, which are referred to in the text by

their Roman numerals (I–VIII):

I Kannala J, Salo M & Heikkilä J (2006) Algorithms for computing a planar homographyfrom conics in correspondence. Proc British Machine Vision Conference (BMVC) 1: 77–86.

II Kannala J & Brandt SS (2006) A generic camera model and calibration method for conven-tional, wide-angle and fish-eye lenses. IEEE Transactions on Pattern Analysis and MachineIntelligence 28(8): 1335–1340.

III Kannala J, Heikkilä J & Brandt SS (2008) Geometric camera calibration. In Wah B (ed)Wiley Encyclopedia of Computer Science and Engineering. Hoboken, John Wiley & SonsInc.

IV Kannala J, Brandt SS & Heikkilä J (2009) Self-calibration of central cameras from pointcorrespondences by minimizing angular error. VISIGRAPP 2008, Revised Selected Papers.Communications in Computer and Information Science 24: 109–122.

V Kannala J, Brandt SS & Heikkilä J (2008) Measuring and modelling sewer pipes fromvideo. Machine Vision and Applications 19(2): 73–83.

VI Kannala J & Brandt SS (2007) Quasi-dense wide baseline matching using match propaga-tion. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

VII Kannala J, Rahtu E, Brandt SS and Heikkilä J (2008) Object recognition and segmentationby non-rigid quasi-dense matching. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR).

VIII Kannala J, Rahtu E, Brandt SS and Heikkilä J (2009) Dense and deformable motion seg-mentation for wide baseline images. Proc Scandinavian Conference on Image Analysis(SCIA). Lecture Notes in Computer Science 5575: 379–389.

The main responsibility of preparing all of the articles I-VIII was carried by the au-

thor of the dissertation. However, many ideas presented in the publications have been

developed as team work as detailed in the following.

In Paper I, the first algorithm was developed by the author, whereas the original

idea and implementation of the second algorithm was Prof. Heikkilä’s. The paper was

mainly written by the author and Dr. Salo, who helped with the formulations and proofs.

Paper II was written by the author, who also carried out the experiments. Dr. Brandt

was closely involved in developing the ideas, and gave valuable comments and detailed

suggestions during the writing process.

The author wrote Papers III-VI and performed the related experiments. The ideas

were developed together with the co-authors, who gave advice and guidance throughout

the work.

9

The experiments in Papers VII and VIII were carried out by the authorand Dr. Rahtu.

Paper VII was written together by the author and Dr. Rahtu, whereas Paper VIII was

mainly written by the author. The other co-authors participated by providing ideas and

advice.

10

List of symbols and abbreviations

| · | Absolute value

|| · || L2-norm

↔ Correspondence between two entities

≃ Equality up to scale

A Affine transformation matrix or complex symmetric matrix

A−1 Inverse of matrixA

A⊤ Transpose of matrixA

A−⊤ Inverse transpose of matrixA

B Complex symmetric matrix

C Conic coefficient matrix

C,C′ A pair of corresponding conics in two views

D Asymmetric lens distortion function

det(A) Determinant of matrixA

Fp Central projection of a pinhole camera

Fr Central projection of a radially symmetric camera

F,F ′ The two focal points of an elliptic or hyperbolic mirror

f Image intensity function or focal length parameter

f ′ Image intensity function of the other view

g,g′ Positive window functions for two images

H Planar projective transformation

H Homography matrix

h A vector containing the elements ofH

I,I ′ Image planes of two cameras

i Scalar index variable

j Scalar index variable

K Mapping from the virtual image plane to the real image plane

k1, . . . ,k5 The parameters of a generic radial projection function

l Distortion parameter for central catadioptric cameras

m Coordinates of a point in the image plane

Mi j Matrix of size 9×9

M A matrix obtained by stacking several matricesMi j

11

n A scalarvariable

O Projection center of a camera

P Camera projection function

Pc Internal camera projection function

Pi A point in space

pi A point in plane

Pd Projective space of dimensiond

Rd Real space of dimensiond

R Orthogonal matrix

R Composition of a rigid transformation and a projection onto sphere

r Radial projection function

S f ,g, S f ′,g′ Symmetric intensity moment matrices of two images

S, S′ Simplifying notations ofS f ,g(0) andS f ′,g′(0), respectively

S′1/2 Square root ofS′

S−1/2 Inverse square root ofS

r Radial projection function

s Camera skew factor

ur,uϕ Unit vectors in the radial and tangential directions

u,v Two-dimensional Cartesian coordinate vectors

(u,v) Pixel coordinates

(u0,v0) Principal point of a camera

V Virtual image plane of a camera

X Coordinates of a point in space

x Coordinates of a point in plane

(x,x′) A pair of corresponding points in two views

(x,y) Cartesian plane coordinates

X ,Y,Z Cartesian coordinates

∆r Asymmetric radial distortion function

∆t Asymmetric tangential distortion function

γ Aspect ratio of a camera

ζ1,ζ2,ζ3 Parameters of the asymmetric radial distortion function

η1,η2,η3 Parameters of the asymmetric tangential distortion function

θ Inclination angle

ι1, ι2, ι3, ι4 Parameters of the asymmetric radial distortion function

ξ1,ξ2,ξ3,ξ4 Parameters of the asymmetric tangential distortion function

12

Φ Spherical angle coordinates

ϕ Azimuth angle

2-D Two-dimensional

3-D Three-dimensional

DLT Direct linear transformation

ETHZ Die Eidgenössische Technische Hochschule Zürich

RANSAC Random sample consensus

ROC Receiver operating characteristic

SVD Singular value decomposition

ZNCC Zero-mean normalized cross-correlation

13

14

Contents

Abstract

Preface 7

List of original articles 9

List of symbols and abbreviations 11

Contents 15

1 Introduction 17

1.1 Background and motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.2 Scope of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.4 Summary of the original articles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.5 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2 Geometry in computer vision 27

2.1 Introduction and background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2 Case study: Homography computation from corresponding conics . . . . . . . . 29

2.2.1 Related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .30

2.2.2 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Geometric camera calibration 39

3.1 Camera models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.1.1 Taxonomy of camera models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.1.2 Perspective cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.1.3 Central omnidirectional cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2 Calibration methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.2.1 Photogrammetric calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50

3.2.2 Self-calibration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4 Image-based scene reconstruction 55

4.1 Brief review of related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.1 Structure from motion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1.2 Image-based modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

15

4.2 Application: Modeling sewer pipes from video. . . . . . . . . . .. . . . . . . . . . . . . . .57

4.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.2.2 Overview of the approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 Quasi-dense matching 63

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Match propagation in the wide baseline case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.3 Non-rigid quasi-dense matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.4 Application in object recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.5 Framework for two-view motion segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

6 Summary and conclusion 77

References 79

Original articles 91

16

1 Introduction

1.1 Background and motivation

Computer vision, as a discipline, relates to the automatic analysis and interpretation

of images. Hence, computer vision provides methods for extracting meaningful in-

formation from image data. In addition, computer vision deals with the construction

of artificial vision systems, which take the methods from theory to practice by soft-

ware and hardware implementations. Today, computer vision systems are used in var-

ious applications in different fields, such as medical imaging, human-computer inter-

action, industrial inspection, photogrammetry, visual surveillance and robot navigation.

Since the image data can be in many different forms (e.g. still photographs, video se-

quences, acoustic images, X-ray images, multidimensional medical images) and there

is a wide variety of different application areas (e.g. content-based image and video

retrieval, image-based modeling, image registration, object detection and recognition,

motion segmentation), computer vision is a broad and diverse subject and it is closely

connected to other fields of science and technology.

In fact, as many other sub-fields of computer science, computer vision has a close

relationship to mathematics. For example, many methods in computer vision are based

on geometry, statistics and optimization. Besides mathematics, also physics is some-

times important in the development of computer vision methods. For instance, the laws

of geometric optics describe the process of image formation in a photographic camera

and their understanding is useful when computer vision techniques are used for mea-

surement purposes. In addition, the field of signal processing is important to computer

vision since typically the high-level computer vision methods are based on low-level sig-

nal processing techniques. Furthermore, computer graphics and machine learning are

two sub-fields of computer science that are closely related to computer vision. Com-

puter graphics studies the problem of how to render realistic images of a scene given a

model of it, whereas computer vision studies the corresponding inverse problem, that

is, how to extract a description of a scene from its images. However, despite the dif-

ferent viewpoint, many basic concepts are common between these two fields. On the

other hand, the relationship between machine learning and computer vision lies in the

fact that machine learning techniques can often be applied to computer vision problems.

17

Actually, many tasks in computer vision can be approached by learning a function from

training data. For example, object categorization typically involves learning a classifier

from pre-categorized training images. Finally, also fields that study biological vision,

such as neurobiology and computational neuroscience, have connections to computer

vision, since the understanding of human visual system can be useful for designing

artificial vision systems and vice versa.

Currently, computer vision is a very active field of research, and there has recently

been rapid development in several problem areas. Many things that were not possible

ten years ago are now reality. For example, related to applications in robotics, it has

been demonstrated that purely vision-based robot localization and mapping is possible

in real-time (Davisonet al., 2007) and, within the area of image-based modeling, it

has been shown that realistic and automatic 3-D model building from Internet photo

collections is possible (Snavelyet al., 2008; Goeseleet al., 2007). In the field of object

recognition, there are recent works which, for instance, use text retrieval techniques for

efficient object matching in videos (Sivic & Zisserman, 2009), utilize decision trees for

recognizing specific objects from large databases (Obdrzálek & Matas, 2005; Nistér &

Stewénius, 2006) or apply machine learning techniques for object category recognition

(Ferguset al., 2003; Sivicet al., 2005; Lazebniket al., 2006). In addition, there has been

significant progress within biometric applications of computer vision. For example,

methods for face detection (Viola & Jones, 2001) and recognition (Zhaoet al., 2003;

Bowyeret al., 2006; Ahonenet al., 2006) have matured to the stage where they could

be used in practical systems.

Many of the aforementioned advanced applications are based on some relatively re-

cent advances in basic research, such as the understanding of the geometry of multiple

views (Hartley & Zisserman, 2000; Faugeraset al., 2001) or the emergence of methods

for viewpoint invariant detection and description of local image features (Mikolajczyk

et al., 2005; Mikolajczyk & Schmid, 2005; Lowe, 2004; Heikkiläet al., 2009). Be-

sides the progress of the field itself, computer vision has benefited from advances in

related fields. In part, the development has been driven by the increase in the available

computational power which has made many computationally intensive techniques more

tractable and attractive also in practical applications. This has again resulted in grow-

ing interest in the field and intensified research efforts, which has further expedited the

development. In fact, the ever increasing demand for new applications has also been

an important driving force for the advancements. For example, the amount of digital

imagery that is being created and stored is growing fast, partly due to the increasing

18

number of low-cost cameras (e.g. in mobile phones), and there isan increasing need for

automatic methods to manage and process this kind of image data.

As described, computer vision is a broad subject and only a small part of it can be

covered in the scope of a doctoral thesis. In this thesis, the focus is on geometric aspects

of computer vision. The topics discussed here are related to such specific problem

areas as multiple view geometry, geometric camera calibration, structure from motion,

image-based modeling, image registration, object recognition and motion segmentation.

However, the field of computer vision, in a broad sense, provides the motivation and

background as well as the broader context of application for the models and methods

studied in this work.

1.2 Scope of the thesis

The aim of this thesis is to contribute to the knowledge of geometric computer vision

and to provide practical methods for the problems of the field. The discussion covers a

variety of different problems which involve geometric aspects. Even though the topics

and problems considered in the original articles I-VIII may seem diverse at first, they

are all related to the context of a 3-D reconstruction pipeline, which is schematically

illustrated in Figure 1 and which largely defines the framework for the thesis. In the

following, we briefly describe research problems within the 3-D reconstruction pipeline

and their connection to work of the thesis.

Recovering a three-dimensional model of a scene from multiple camera views is a

classical problem in computer vision (Marr, 1982; Horn, 1986; Faugeras, 1993; Hart-

ley & Zisserman, 2000; Faugeraset al., 2001). Typically, the problem is divided into

several subproblems, such as sparse feature extraction and matching, structure from

motion, photogrammetric camera calibration or self-calibration, and dense surface re-

construction (i.e. dense stereo or multi-view stereo). Typically these subproblems are

broad subjects themselves and involve a lot of research. However, despite the extent

of the problem area, recent research has produced many satisfactory solutions to the

different subproblems, so that nowadays the construction of a generic and automatic

3-D reconstruction pipeline is possible in practice, and several such systems have been

built (Fitzgibbon & Zisserman, 1998; Pollefeys, 1999; Nistér, 2001; Lhuillier & Quan,

2005). The first systems of this kind used continuous video sequences as their input, but

some of the more recent approaches are able to produce realistic reconstructions using

19

only a few sparsely captured images (Martinec, 2008), or even unorganized image sets

from Internet photo collections (Snavelyet al., 2008; Goeseleet al., 2007).

Input

images matching

Structure−

from−motion

Self−

calibration

Dense

reconstruction

Model

out

Sparse

Fig. 1. A standard pipeline for scene reconstruction from multiple perspective

images. The self-calibration stage may be omitted for precalibrated cameras.

In this thesis, the aim is not to describe a generic system for multi-view reconstruction,

but to consider some specific topics that are related to the problem area. Central themes

of the thesis are geometric camera calibration and quasi-dense image matching. In

addition, the topics discussed involve theoretically oriented research, which is related

to methods for determining a planar homography from corresponding conics in two

images, as well as applied research in a sewer imaging application.

The theoretical framework for the thesis is the projective geometry of multiple views

which is covered in the books (Hartley & Zisserman, 2000) and (Faugeraset al., 2001),

from the viewpoint of computer vision. Within the area of multiple view geometry,

the research of the thesis touches on a subject in the estimation of two-view relations.

That is, we study algorithms for computing a planar projective transformation (i.e. a

homography) between two images using corresponding conics, such as ellipses. This

topic is the most theoretically oriented research topic of the thesis.

A central part of the thesis is geometric camera calibration, which is the process of

determining the imaging geometry of a camera, and is a prerequisite for image-based

metric three-dimensional measurements. Here, we concentrate on the photogrammetric

calibration of omnidirectional central cameras, such as fish-eye lens cameras. However,

we also briefly discuss the self-calibration of central cameras.

Camera calibration is required for image-based measurements, and the metrology

application studied in this thesis is the measurement and modeling of sewer pipes from

video. Our approach to this application problem is to use structure from motion tech-

niques to recover the shape of a sewer pipe from a video sequence, which is captured

by a precalibrated fish-eye lens camera moving through the pipe. We additionally study

methods for building a tubular model of the pipe based on the reconstructed interest

points. Hence, the sewer imaging system provides a practical application specific ex-

20

ample of a complete 3-D reconstruction pipeline, and it is the mostapplied theme in the

thesis.

An important part of the thesis is a quasi-dense approach to image matching. Our

work is built on previous works (Lhuillier & Quan, 2002) and (Lhuillier & Quan, 2005)

which have shown that growing a tentative set of sparse keypoint correspondences into a

set of quasi-dense matches between two images provides a good basis for two-view ge-

ometry estimation and surface reconstruction. In this thesis, we extend the quasi-dense

approach to the case, where the viewpoints of the two cameras differ substantially (the

so called wide baseline case), and study its applications in piecewise image registration

problems, such as specific object recognition and motion segmentation. Thus, here the

problem of image matching provides a connection between recognition and reconstruc-

tion, which are two classical problems in vision. In fact, this connection has been re-

cently studied also by other researchers from different points of view (Rothgangeret al.,

2006; Corneliset al., 2008) and, overall, integration of recognition and reconstruction

is an interesting and timely topic.

1.3 Contributions

The main contributions of the thesis are listed below.

– Two algorithms are proposed for computing a planar homography from correspond-

ing conics in two images.

– A generic camera model and camera calibration method are presented. The proposed

calibration approach is applicable for both conventional cameras and omnidirectional

cameras, such as fish-eye lens cameras.

– The calibration method is implemented as a Matlab toolbox and tested in practice

with real cameras of various types.

– A method is proposed for the self-calibration of central cameras from two-view point

correspondences.

– An error analysis is performed for the structure from motion approach that recovers

the interior structure of a sewer pipe from a video sequence which is scanned by a

robot moving inside the pipe. In addition, a method for modeling tubular surfaces

from sets of three-dimensional points is presented.

– A method for quasi-dense wide baseline image matching is proposed. The method is

an extension to the match propagation algorithm, which has been described earlier.

21

– A non-rigid variant of the quasi-dense matching method is presented.In addition, a

method is described for grouping the quasi-dense two-view matches into geometri-

cally consistent groups, which are supposed to lie on smooth surfaces that represent

individual objects. The proposed techniques are applied to the simultaneous recogni-

tion and segmentation of specific objects in photographs.

– A layer-based framework for dense and deformable two-view motion segmentation

is described. The proposed framework utilizes quasi-dense matching for initializing

the motion layers, and is applicable under wide baseline conditions.

1.4 Summary of the original articles

This thesis is based on eight articles in which the contributions listed above were origi-

nally published. The articles are reprinted in the appendix of the thesis, and their content

is summarized below.

Paper I presents two new algorithms for determining a planar homography from

corresponding conics in two images. The first algorithm is based on solving a set of

overdetermined linear equations, which are derived from the correspondence equations

for the elements of the homography matrix, and it can be applied when there are three

or more conic correspondences. The second algorithm is for the minimal case, i.e., it

allows one to compute the homography from only two conic correspondences. Unlike

some earlier approaches, this algorithm involves only linear algebra, e.g. eigendecom-

positions of complex symmetric matrices, and does not require solving high-degree

polynomial equations. Hence, both of the proposed algorithms are relatively easy to

implement.

Paper II describes a generic camera model which is suitable for both conventional

perspective cameras and central omnidirectional cameras, such as fish-eye and wide-

angle lens cameras and catadioptric cameras. The key idea of the article is to present

a flexible geometric camera model which allows accurate modeling of various types of

real cameras. Because real lenses and mirrors may deviate from precise radial symme-

try, the proposed model contains an asymmetric part whose purpose is to account for the

imperfections of the optical system. Also, a method is described for the computation of

the inverse camera model by using a first-order approximation for the asymmetric dis-

tortion function. In addition to the generic camera model, Paper II presents a four-step

algorithm for determining the parameters of the model. The algorithm utilizes images

of a planar calibration object, which contains control points in known positions. Paper

22

II is partly based on earlier works (Kannala & Brandt, 2004; Kannala,2004). Nev-

ertheless, unlike the method of Paper II, the calibration procedure proposed in these

earlier works is not suitable for omnidirectional cameras whose field of view exceeds

the hemisphere.

Paper III is a review article about geometric camera calibration. The article provides

an overview of camera models and calibration methods used in the field. The focus is

on conventional calibration techniques in which the parameters of the camera model

are determined by using images of a calibration object whose geometric properties are

known. Additionally, Paper III contains calibration experiments where the method of

Paper II is quantitatively evaluated with various types of real cameras. In the experi-

ments, we used three dioptric cameras equipped with a narrow-angle lens, a wide-angle

lens, and a fish-eye lens, and two catadioptric cameras, which were constructed by plac-

ing two different mirrors in front of the narrow-angle lens camera. The mirrors used

were a hyperbolic mirror and an equiangular mirror.

Paper IV proposes a self-calibration method for central cameras. The method uses

two-view point correspondences, and estimates the camera parameters by minimizing

the two-view angular error, which does not depend on the 3-D coordinates of the point

correspondences. A low-parameter camera model, which is suitable for different kinds

of radial distortions, is used in the minimization. Hence, the self-calibration problem

results in a small-scale optimization problem. However, the cost function may have

many local minima and, in order to avoid them, a multi-step optimization approach is

proposed.

Paper V presents a system for modeling sewer pipes from videos acquired by a fish-

eye lens camera moving in the pipe. The approach is based on tracking interest points

across the video sequence and addressing the related structure from motion problem.

Naturally, the proposed method requires that the interior surface of the pipe is suffi-

ciently textured so that interest points can be extracted. The experiments with a real

sewer video, scanned inside an eroded concrete pipe, show that the structure of the pipe

can be reliably reconstructed despite the forward motion of the camera. The tubular ar-

rangement of the reconstructed points allows us to build a parametric model of the pipe

by surface fitting. In fact, Paper V additionally proposes a practical method for model-

ing tubular surfaces with a locally cylindrical model. Paper V is partly based on earlier

works (Kannala & Brandt, 2005) and (Kannala, 2004). However, these earlier works do

not discuss the modeling issue nor the error analysis of the recovered structure.

23

Paper VI describes a method for quasi-dense wide baseline imagematching. The

method extends the match propagation algorithm, which is a technique for expanding

matching regions using local pixel-wise propagation, and makes it suitable for wide

baseline cases, where the camera pose may vary considerably between the two views.

There are basically two main extensions that are proposed in the article. The first ex-

tension is to use an affine model for the local geometric transformations between the

images. The local affine transformations are initialized from corresponding interest

regions which are used as seed matches. The second extension is to use the second

order intensity moments for adapting the estimates of the local affine transformations

during the match propagation. Besides intensity moments, the adaptation requires that

the epipolar geometry is known. However, typically already the locally constant affine

transformation model improves matching and provides a good result if the seed matches

are dense enough to cover the different surfaces that are to be matched.

Paper VII presents a non-rigid quasi-dense matching approach by extending the

adaptive method of paper VI so that the adjustment of the local affine transformations

is based on local image gradients and second order intensity moments. This implies

that the epipolar constraint is not needed for adjusting the transformations and, hence,

adaptive match propagation can be used in non-rigid image registration where global

geometric constraints are not available. Additionally, Paper VII applies quasi-dense

matching for object recognition and segmentation. In detail, a method is proposed

for grouping the quasi-dense pixel matches between the model and test images into

geometrically consistent groups which are supposed to represent the common objects

in the images. The grouping is based on a local grouping criterion which utilizes the

local affine transformation estimates acquired during the propagation. The number and

quality of matches in the obtained groups are used as decision criteria for recognition

whereas the location of matching pixels gives the segmentation. The experiments in

Paper VII show that the quasi-dense approach improves the reliability of recognition

compared to such approaches that only use a sparse set of interest regions for matching.

Paper VIII proposes a dense and deformable motion segmentation method for wide

baseline image pairs. The method is based on a bottom-up segmentation strategy, which

starts from a sparse set of seed matches, then proceeds to quasi-dense matching and, fi-

nally, uses the grouped quasi-dense matches to initialize the motion layers for the dense

segmentation stage, where the geometric and photometric transformations of the layers

are refined simultaneously with the segmentation. Thus, because the quasi-dense ap-

proach is used for initializing the motion layers, Paper VIII builds on the work of Paper

24

VII. In fact, the problem of two-view motion segmentation can beseen as a general-

ization of the object recognition problem, where the other image (i.e. model image)

is typically presegmented. So, in general, two-view motion segmentation requires the

solving of the following two subproblems simultaneously: (a) recognition of groups of

pixels that move together (from both images), and (b) estimation of the motion fields

associated to each group. The key contribution of Paper VIII is a motion segmentation

method which can deal with large non-rigid motions and large illumination changes.

1.5 Outline of the thesis

This thesis consists of an overview and an appendix, which contains the original articles.

The rest of the overview is organized as follows. First, Chapter 2 describes background

for geometric computer vision and discusses the work on conic-based homography esti-

mation. Chapter 3 deals with geometric camera calibration. Chapter 4 concentrates on

image-based scene reconstruction and introduces the sewer imaging application. The

topics related to quasi-dense matching are presented in Chapter 5 and conclusions are

in Chapter 6.

25

26

2 Geometry in computer vision

2.1 Introduction and background

Geometry is an important aspect of computer vision. The laws of geometry and optics

describe how the three-dimensional world is imaged on the camera sensor and, hence,

an understanding of imaging geometry is important for the development of automatic

image analysis methods. For example, a classical problem in geometric computer vision

is to determine automatically the three-dimensional structure of a scene from several

two-dimensional images. This geometric inverse problem is called the structure from

motion problem, and the approaches for solving it are based on geometric knowledge.

In fact, many current structure from motion systems are based on relatively recent ad-

vances in the theory and practice of geometric computer vision. Thus, although there is

a long history of using photographic images for measurement purposes in photogram-

metry (Slama, 1980), there has been significant progress in this area during the past

twenty years.

On the theoretical side, the progress has been achieved by applying concepts from

classical projective and algebraic geometry to problems in computer vision. Besides

just applying classical tools to new computational problems, the research has addition-

ally produced new theoretical results. For instance, the theory related to multiple view

tensors, which characterize the matching constraints for multiple views of a rigid scene,

was largely developed during the 1990s. The advances acquired in geometric computer

vision during the last two decades of the 20th century are summarized in (Hartley &

Zisserman, 2000) and (Faugeraset al., 2001). These works give a comprehensive in-

troduction to the projective geometry of multiple views, as well as several algorithms

for applying the theory to practical computations. The viewpoint in (Hartley & Zisser-

man, 2000) and (Faugeraset al., 2001) is application oriented, and the emphasis is on

relatively recent results. A more formal approach to the classical mathematical founda-

tions of algebraic projective geometry can be found from (Semple & Kneebone, 1952),

whereas (Stolfi, 1991) deals with oriented projective geometry. An early study of three-

dimensional computer vision is (Faugeras, 1993), and the geometry of stereo vision is

discussed in (Xu & Zhang, 1996). In addition, (Kanatani, 1993) and (Kanatani, 1996)

discuss many computational and statistical aspects of geometric computer vision.

27

Most of the research in geometric vision problems during 1980sand 1990s was re-

lated to the scenario where a single perspective camera moves in a rigid scene. A char-

acteristic feature of the developed theory and algorithms is the uncalibrated approach,

where the estimation of geometric entities can be performed in a projective coordinate

frame without knowing the internal camera parameters. Compared to the traditional

photogrammetric approach, which requires precalibrated cameras, the uncalibrated ap-

proach provides a more complete understanding of the geometry of multiple views, and

allows automatic metric scene reconstruction from image sequences via self-calibration;

this is the process that automatically determines the internal camera parameters from

the images (Hartley & Zisserman, 2000). Overall, the research that has led to the cur-

rent knowledge in geometric computer vision involves numerous contributions from

several researchers, and not all of them can be listed here. Hence, we mainly refer to

the aforementioned books which provide further references.

As briefly reviewed above, the multiple view geometry of rigid scenes is currently

relatively well understood, and several books have been written about the topic. There-

fore, recent research has expanded also to other areas of geometric computer vision.

Such areas include, for example, the application of algebraic geometry for solving sets

of nonlinear polynomial equations which arise from minimal problems in computer

vision (Stewénius, 2005; Byrödet al., 2009; Kukelovaet al., 2008), omnidirectional

vision and generic camera models (Daniilidis & Klette, 2006), and the field of dynamic

vision (Vidalet al., 2007), which can be seen to involve various vision problems related

to dynamic environments, such as deformable image registration, non-rigid structure

from motion, and motion segmentation.

Besides the advances in theoretical understanding, also the methodological advances

related to practical computational issues have been significant for the development of

vision systems. In particular, such issues as stability, robustness and precision of esti-

mation algorithms, which are used to estimate geometric entities from image data, are

relevant for the performance of real-world vision systems. Actually, since the purpose

of geometric computer vision is to provide the means for extracting geometric scene

information automatically from images, the estimation of geometric models is an essen-

tial part of most problems in the field. Typical estimation problems in rigid structure

from motion involve such geometric entities as planar homographies, fundamental ma-

trices, trifocal tensors, and camera projection matrices.

However, the data that is used in geometric estimation problems usually contains

measurement noise and outliers, i.e., observations which do not fit into the model. For

28

example, the automatically matched tentative interest pointcorrespondences, which are

commonly used for structure and motion recovery, are localized with limited precision

and may contain false matches. Thus, the estimation algorithms must be stable with re-

spect to noise and robust to outliers (Hartley & Zisserman, 2000). Often the robustness

to outliers is achieved by using the Random Sample Consensus paradigm (RANSAC)

(Fischler & Bolles, 1981). In fact, there is a lot of early and recent research related to

different minimal problems which can be efficiently solved in the RANSAC framework,

e.g. (Nistér, 2004; Brownet al., 2007). Finally, it is also desirable that estimation proce-

dures are precise so that estimation errors decrease when the amount of data increases.

Hence, often maximum likelihood estimates are computed after RANSAC-based outlier

rejection (Hartley & Zisserman, 2000). Overall, general advances in numerical methods

have turned out to be useful also in many vision problems, which are often computa-

tionally intensive and involve large scale optimization (Triggset al., 2000; Kahlet al.,

2008).

As described above, estimation of geometric entities from image data is a central

theme in geometric computer vision. One particular estimation problem studied in

this thesis is the computation of a planar homography from corresponding coplanar

conics. Paper I proposes two algorithms for this problem and they are summarized in

the following section.

2.2 Case study: Homography computation fromcorresponding conics

In perspective imaging, a scene plane is mapped to the image plane by a planar ho-

mography. Most often the homography between two planes is determined from point

correspondences (Hartley & Zisserman, 2000) but, in this thesis, Paper I studies the

problem of determining the homography from conic correspondences.

Paper I presents two algorithms for homography computation from corresponding

coplanar conics. In general, the first algorithm is for the case where there are three

or more conic correspondences and the second algorithm is for the minimal case of

two conic correspondences. The algorithms can be used to compute the homography

between a known scene plane and its perspective image or between two different per-

spective views of an unknown scene plane, as illustrated in Figure 2. Hence, in addition

to image registration, the algorithms might be useful in applications where homogra-

29

phies have to be estimated in order to extract either the internalor external parameters

of a perspective camera. For instance, plane-based camera calibration is one example

of such an application (Sturm & Maybank, 1999).

Fig. 2. An image registration example. Two perspective views of a plane contain-

ing white circles; the detected ellipses are in cyan. The homography was esti-

mated by the method of Paper I and is illustrated with the difference image on the

right.

The algorithms of Paper I are summarized in Section 2.2.2. First, however, the follow-

ing subsection provides a brief review of previous works that are related to conics and

their applications within computer vision.

2.2.1 Related work

Estimation of a planar homography from image data is an important part of many com-

puter vision algorithms. The most common approach to homography estimation is to

use point or line correspondences. In this case, a well-known technique, called the Di-

rect Linear Transformation method (DLT), allows us to formulate the estimation prob-

lem as a system of linear equations which can be solved by using at least four corre-

spondences (Hartley & Zisserman, 2000). However, in addition to points and lines,

correspondences of planar contours, such as conics, have also been used for homogra-

phy estimation (Jain & Jawahar, 2006). In fact, in a sense, conics can be seen as one

of the most fundamental image features, together with points and lines, because all of

these features are invariant under projective transformations. Actually, there has been

quite a lot of research on the geometry of conics in computer vision. The motivation

for studying conics arises from both scientific curiosity and practical problems. For

example, in some cases, reliable point or line features may not be available, whereas

higher order parametric curves, such as conics, can be recovered robustly.

30

Detection and estimation of conics from images is a prerequisitefor applying conic-

based algorithms. Generic segmentation paradigms, such as the Hough transform (Bal-

lard, 1981) or RANSAC technique (Fischler & Bolles, 1981), can be used for conic

detection, i.e. for segmenting edge points into sets that lie on conics (Rosin & West,

1995), and there are several estimation methods for fitting conics to segmented sets of

2-D points. In fact, the problem of conic fitting has been well studied, and there are

various non-iterative fitting methods (Bookstein, 1979; Rosin, 1993; Fitzgibbonet al.,

1999) which typically minimize some kind of an algebraic distance, as well as iterative

approaches which minimize the geometrical error (Sturm & Gargallo, 2007), i.e. the

sum of squared distances between the 2-D points and the conic, or an approximation

of it (Sampson, 1982; Kanatani, 1994). The rationale for minimizing the geometrical

error is the fact that it provides the maximum likelihood estimate of the conic under the

assumption of isotropic Gaussian noise in the point measurements.

As there are means for extracting conics from images, several conic-based proce-

dures have been proposed for various problem areas, such as object recognition (Forsyth

et al., 1991; Carlsson, 1993), structure from motion (Quan, 1996; Kahl & Heyden, 1998;

Schmid & Zisserman, 2000), and camera calibration (Yanget al., 2000; Wuet al., 2004;

Chenet al., 2004; Gurdjoset al., 2006; Ying & Zha, 2007). For instance, Forsythet al.

(1991) and Carlsson (1993) derive two projective invariants for a pair of coplanar conics

and use them, together with other invariant descriptors of planar shapes, for recogniz-

ing curved plane objects irrespective of their pose. In addition to pose invariant object

recognition, Forsythet al. (1991) describe an algorithm that utilizes two coplanar conic

correspondences for determining the relative pose of a scene plane with respect to the

camera. The pose recovery problem is equivalent to determining the homography be-

tween the scene plane and the image plane for a calibrated camera. The approach

presented by Forsythet al. (1991) requires the solution of quartic equations.

Within the area of structure from motion, conics have been studied from several

points of view. For example, Quan (1996) describes methods for projective and metric

reconstruction of plane conics from two images assuming that the camera projection ma-

trices are known. On the other hand, Kahl & Heyden (1998) and Kaminski & Shashua

(2004) utilize conic correspondences for epipolar geometry estimation. Furthermore,

Schmid & Zisserman (2000) deal with the geometry and matching of lines and curves

over multiple views. In particular, it is shown by Schmid & Zisserman (2000) that,

given the epipolar geometry, the homography induced by a plane can be determined

from one conic correspondence.

31

In camera calibration, conics can be utilized in various ways.One alternative is to

use conic correspondences to compute the homographies between a planar calibration

pattern and its images, as described in Paper I, and then determine the internal cam-

era parameters by utilizing the geometric constraints derived from the homographies

(Sturm & Maybank, 1999). This is also the approach by Yanget al. (2000) but the

method proposed there is restricted to concentric conics. However, in certain cases, if

the values of some internal camera parameters are known, or if a special calibration

pattern is used, one may derive constraints directly for the internal camera parameters

and avoid computing the homographies for each calibration image. In fact, there are

many works which study the calibration constraints that can be used with different cal-

ibration patterns containing conics. For example, there are studies which use coplanar

circles (Chenet al., 2004; Wuet al., 2004), axes aligned conics (Ying & Zha, 2007),

or confocal conics (Gurdjoset al., 2006). Besides plane-based calibration, also images

of spheres have been used for the calibration of both perspective cameras (Zhanget al.,

2007) and catadioptric cameras (Ying & Zha, 2005). The latter approach utilizes the

fact that, in addition to perspective cameras, also certain catadioptric cameras project

the occluding contour of a sphere onto a conic in the image (Ying & Hu, 2004b).

As described above, there are many problem areas where conics have been utilized

as image features. In addition to the aforementioned studies, there are previous works

that concentrate on the same problem as Paper I. The closest works to Paper I that we are

aware of are (Sugimoto, 2000), (Mudigondaet al., 2004) and (Ma, 1993). In (Sugimoto,

2000) a linear algorithm is proposed for solving the homography from correspondences

of coplanar conics in a general configuration. The algorithm is based on considering

conics as points in the projective spaceP5 and the homography is determined from

the corresponding conic-based transformation which is a linear mapping fromP5 to P

5.

This approach requires at least seven correspondences, whereas the linear algorithm

presented in Paper I requires only three correspondences.

The minimum number of conic correspondences that is required for solving the

homography is two. A method for computing the homography from two conic cor-

respondences is proposed by Mudigondaet al. (2004). This method requires solving

polynomial equations, whereas the approach of Paper I requires eigendecompositions

of symmetric matrices. However, recently, after the publication of Paper I, we became

aware of the work by Ma (1993), which also discusses the problem of determining a

planar homography from two coplanar conics. The approach proposed by Ma (1993)

is similar in spirit to that of Paper I. Nevertheless, the algorithms are independently

32

developed and different in their details. Furthermore, PaperI discusses the effect of

measurement errors in the conic coefficients, which is a topic that is not covered by Ma

(1993).

2.2.2 Algorithms

Both algorithms proposed in Paper I are based on linear algebra and utilize the frame-

work of algebraic projective geometry (Semple & Kneebone, 1952; Hartley & Zisser-

man, 2000). The basic ideas of the algorithms are summarized below; the details can

be found from the original article.

Problem setting

A conic is a second degree curve in the plane and, by using homogeneous coordinates

x, it is defined by an equation of the formx⊤Cx = 0, whereC is a real symmetric 3×3

matrix, which contains the parameters of the conic.

In the problem of Paper I, it is assumed that one has identified the conic correspon-

dencesCi ↔ C′i, i = 1, . . . ,n, between two planes which are related by a homography

represented with a non-singular 3×3 matrix H. Further, it is assumed that the conics

are non-degenerate, so that detCi 6= 0, detC′i 6= 0.

It is well known that, under a point transformationx′ ≃ Hx, a conicCi transforms

to

C′i ≃ H−⊤CiH−1, (1)

where≃ denotes equality up to scale (Hartley & Zisserman, 2000). Since the scale of

matrix H is insignificant, one may fix it by(detH)2 = 1. Then, by scaling the conic

coefficient matricesCi so that detCi = detC′i, the transformation rule (1) implies that

Ci = H⊤C′iH, i = 1, . . . ,n. (2)

Hence, since detCi = detC′i 6= 0 by construction, everyH that satisfies (2) must also

satisfy(detH)2 = 1. Thus, one may directly focus on solving (2) without the need to

consider any additional constraints forH.

The algorithms for solvingH are derived from the equations (2). In the following

sections, the two cases,n = 2 andn > 2, are discussed separately.

33

The case n > 2

If the number of conic correspondences is greater than two, one may proceed as follows.

By considering any two of the equations (2), i.e.

Ci = H⊤C′iH, (3)

C j = H⊤C′jH, (4)

and by multiplying the second equation with the inverse of the first, one obtains

C−1i C j = H−1C′−1

i C′jH (5)

and further

C′−1i C′

jH−HC−1i C j = 0, (6)

which is a set of linear equations in the elements ofH. That is, (6) is equivalent to

Mi jh = 0, (7)

whereh is a 9×1 vector containing the elements ofH andMi j is a 9×9 matrix deter-

mined by the conics. Thus, the solutionh belongs to the null space ofMi j. However,

in general, the dimension of the null space ofMi j is greater than 1 and, hence, the con-

straints (7) are not sufficient for determining the homography. Nevertheless, ifn ≥ 3,

one may choose other two equations to get another set of linear constraints. By consid-

ering all ordered pairs, one has in totaln(n−1) pairs of equations, and by stacking the

matricesMi j one obtains an overdetermined set of 9n(n−1) equations

Mh = 0, (8)

so that the null space is usually one dimensional. It was observed in Paper I that, in

general, already three conic correspondences allow us to solveh from (8).

In practice,M may have full rank due to measurement errors in the conic coeffi-

cients but in this case the solution minimizing||Mh|| with ||h|| = 1 is obtained as the

singular vector corresponding to the smallest singular value ofM (Hartley & Zisser-

man, 2000). In fact, as reported in Paper I, and illustrated in Figure 2, the proposed

approach, which is based on the SVD ofM, is tenable in practice, and is able to provide

a reasonable estimate of the homography also in such cases where there is no exact

solution to (2) due to noise and measurement errors in conic coefficients. However,

our experiments in Paper I additionally indicate that the solution, provided by the SVD,

34

depends on the choice of the Euclidean coordinate frame in whichthe conics are repre-

sented. This is expected since||Mh|| is an algebraic error, which is not a geometrically

or statistically optimal cost function (Hartley & Zisserman, 2000). Thus, as discussed

in Paper I, it might be advantageous to fix the coordinate frames of the two planes by us-

ing some normalization procedure analogous to the normalized DLT algorithm, which

is described for linear point-based homography estimation in (Hartley & Zisserman,

2000).

Minimal case, n = 2

If there are only two conic correspondences, the linear SVD-based algorithm above

is not useful, because the equations of the form (6) do not give sufficient constraints

for H. However, in general, the original nonlinear matrix equations can be used to

determine the homography up to a four-fold ambiguity, i.e., there exist at most four

distinct projective transformations which satisfy the equations.

In fact, it is shown in Paper I that, when the conic coefficient matrices are real,

invertible and symmetric, the equations

C1 = H⊤C′1H (9)

C2 = H⊤C′2H, (10)

are equivalent to the equations

R⊤R = I (11)

R⊤AR = B, (12)

whereR is an unknown complex orthogonal matrix (bijectively related toH) and the

known complex symmetric matricesA andB are defined by the conicsC′1, C′

2 andC1,

C2, respectively. Hence, instead of equations (9) and (10) one may first confine oneself

to solveR from (11) and (12) whereafterH can be determined usingR. As stated in

Paper I, the properties of complex symmetric matrices imply that the system (11), (12)

(and thereby (9), (10)) has a solution if and only if the matricesA andB are similar

(Horn & Johnson, 1985).

WhenA andB are similar and have distinct eigenvalues, they are diagonalizable

and there are only finitely many solutions to (11), (12). The solutions can be obtained

by computing the eigenvalues and eigenvectors ofA and B, as described in Paper I.

35

In this case, there are in total eight solutions forR. However, these solutions provide

only four distinct homographies, because ifH is a solution then is also−H and both

represent the same transformation. One of the solutions represents the geometrically

correct transformation, but additional information is required to determine which one.

If A andB have a multiple eigenvalue there may be infinitely many solutions to (11),

(12). In fact, in some degenerate cases the configuration of the two conics may be such

that it does not constrain all the degrees of freedom of the homography. For example,

this is the case ifC1 andC2 are two concentric circles, because their correspondence

information is not sufficient for determining the homography, i.e. there are infinitely

many solutions since the circles are invariant to any rotation around their center. How-

ever, Paper I concentrates on the case where the eigenvalues ofA andB are distinct and

the homography can be determined up to a four-fold ambiguity.

In practice, due to measurement noise, the eigenvalues of matricesA andB are never

exactly the same. Hence, typically there is no exact solution to (9), (10) but one would

still like to recover a homography estimate that is reasonably close to the underlying

true homography. Therefore, Paper I proposes an algorithm which uses the eigende-

compositions ofA andB for determiningR, and therebyH, as ifA andB were similar.

However, sinceA andB are not exactly similar in practice, the algorithm contains an

additional step in which the eigenvalues ofA andB are ordered in such a manner that

the corresponding eigenvalues are close to each other. ThereafterR is determined from

the correspondingly ordered eigenvectors ofA andB. This strategy is theoretically jus-

tified because, in general, the eigenvalues of a diagonalizable matrix are stable under

small perturbations of the matrix elements (Horn & Johnson, 1985). Furthermore, the

experiments of Paper I show that the proposed algorithm is sufficiently stable to be used

with real measurement data, and usually provides a reasonable estimate of the homog-

raphy, also in such cases where no exact solution exists due to measurement errors in

the conic coefficients.

2.3 Discussion

In this chapter, Section 2.1 briefly described some general background to the field of

geometric computer vision, and then, Section 2.2 concentrated on a particular geometric

estimation problem, namely, on the computation of a planar homography from conic

correspondences. The algorithms of Paper I were described and put into the context

36

of related research literature. In the following, the contributions of Paper I are further

summarized, and their significance is discussed.

In summary, Paper I describes two algorithms for computing a planar homography

from correspondences of coplanar conics. The first algorithm can be used when there

are at least three conic correspondences and, in general, it provides a unique solution

up to scale. The second algorithm is for the minimal case of two conic correspondences

and, in general, it provides a solution up to a four-fold ambiguity. In addition, Paper

I investigates the stability of the proposed algorithms with respect to noise. Although

both algorithms involve algebraic error measures, which are not statistically optimal,

the experiments with synthetic and real data show that the algorithms are stable enough

to be used in practical computations. Further, the proposed algorithms are easy to im-

plement because they are based on linear algebra, and the required matrix factorizations

are available in most numerical libraries and mathematical software packages.

Because homography estimation is a classical problem and several approaches exist,

one might think that the significance of Paper I is rather limited. However, we believe

that, besides providing new aspects to the problem, the algorithms may also have prac-

tical significance. In fact, we are not aware of any other linear method for homography

computation from less than seven conic correspondences, which is the minimum num-

ber of correspondences required by the method of (Sugimoto, 2000). Moreover, many

recent approaches to wide baseline image matching utilize affine covariant region de-

tectors, which are able to detect interest regions from images in such a manner that the

pre-image of the region is invariant to changes in the viewpoint of the camera. Typi-

cally these regions may be represented by ellipses (Mikolajczyket al., 2005), but often

they are simply represented by their centroids and considered as points in geometry es-

timation (Mataset al., 2004; Chumet al., 2005). However, it might be advantageous

to directly utilize the ellipses in homography estimation. For example, in principle,

only two region correspondences would be sufficient for hypothesis generation in a

RANSAC-based estimation framework, whereas point-based approaches require four

correspondences. This fact might be useful for some applications, such as plane detec-

tion (Lourakiset al., 2002; Chumet al., 2005; Choiet al., 2007).

Finally, since the proposed techniques use algebraic error measures, they may not be

suitable for precise homography estimation from noisy observations. However, given

just a set of corresponding coplanar conics, it is not obvious how to choose a suitable

error measure. In other words, it is not obvious what would be a reasonable model

for the uncertainty in the conic coefficients. On the other hand, if the conics are de-

37

fined by noisy edge points and the point-to-conic distances areassumed to be normally

distributed, it would be a justified approach to simultaneously determine the homog-

raphy and the conics, so that the conics are related by the recovered homography and

the point-to-conic distances are minimized in both images. Nevertheless, this kind of

an approach would require iterative methods, which need a good initialization for both

the homography and the conics (Sturm & Gargallo, 2007; Hartley & Zisserman, 2000).

Hence, the noniterative methods of Paper I might be useful also in this case.

38

3 Geometric camera calibration

Geometric camera calibration is the process whereby the geometric properties of a cam-

era are determined. In other words, geometric camera calibration establishes the map-

ping between image points and scene points. Hence, at least implicitly, calibration

defines both the forward-projection, which maps a given 3-D scene point to its corre-

sponding 2-D image point, and the back-projection, which determines the set of 3-D

points which map to a given image point.

Geometric camera calibration is a prerequisite for image-based metrology, because

the geometric characteristics of the camera have to be quantified in order to measure

scene properties such as angles and length ratios from the images. A central issue in

camera calibration is the choice of a suitable camera model which is sufficiently generic

for modeling the particular camera in question, and which allows convenient and stable

calibration. In fact, usually camera calibration can be seen as a parameter estimation

problem, where the parameters of the camera model are estimated from image data.

Hence, in the following, we review different camera models as well as methods for

determining their parameters. In addition to surveying previous calibration literature,

we briefly describe the contributions of Papers II-IV.

3.1 Camera models

A photographic camera can be seen as a ray-based imaging device, where the image

points are associated with 3-D lines which represent the light rays that arrive to the

camera. Thus, one may consider the image as a collection of pixels, where each pixel

may observe light only from those scene points that lie on the ray associated with it. In

the most general setting, the rays are unconstrained and camera calibration is the pro-

cess where the coordinates of all the rays, corresponding to the pixels, are determined

in some common coordinate system (Grossberg & Nayar, 2001; Sturm & Ramalingam,

2004). Hence, in this case, the number of parameters to be estimated is large. On the

other hand, in more constrained settings, which are quite common in practice, cameras

may often be modeled by low-parameter models. This is typically the case for the class

of central cameras in which the projection rays are constrained to meet at a single point

in space.

39

In the following, we present a taxonomy of camera models accordingto the classifi-

cation by Sturm (2005) and Ramalingam (2006). Thereafter, in Sections 3.1.2 and 3.1.3,

we focus on central cameras because our work concentrates on them. The discussion

below aims to give an overview of the subject; a more detailed survey of central camera

models can be found from Paper III.

3.1.1 Taxonomy of camera models

Sturm (2005) and Ramalingam (2006) describe a three-level hierarchy of camera mod-

els, which consists of the following classes, listed in the order of decreasing generality:

(1) generic cameras, (2) axial cameras, and (3) central cameras. At the most general

level of the hierarchy, cameras are modeled by unconstrained sets of projection rays,

whereas in the more specific classes, the projection rays are constrained to go through a

single line (axial cameras) or a single point (central cameras). Some examples of cam-

eras that belong to the different classes are given below and are schematically illustrated

in Figure 3.

P1

P2

P3

p3p2 p1

(a)

P1

P2

P3

p3p2

p1

(b)

P1

P2

P3

p3 p2p1

F

F ′

(c)

Fig. 3. Three catadioptric cameras which consist of a perspective camera and a

mirror. The scene points Pi are projected to points pi on the image plane. (a) A

generic camera with a curved mirror. (b) An axial camera with a spherical mirror.

All projection rays intersect a single line. (c) A central camera with a hyperbolic

mirror. All projection rays go through the focal point F of the mirror when the

perspective camera is placed at the other focal point F ′.

40

Generic model

A generic imaging model for photographic cameras consists of a set of pixels and pro-

jection rays so that each pixel of the image is associated with a unique ray, which is

represented by a 3-D line in the camera coordinate frame. In the general case the pro-

jection rays may be arbitrary and, hence, the model can describe essentially any camera

that captures light rays which travel along straight lines between the camera and the

observed opaque surfaces of the scene. In fact, here the concept of a camera is not

limited to a single physical device, but it includes also camera systems, which consist

of several physical camera units and which may be considered as a one generic camera

by combining all the pixels of the individual images to a single image.

Examples of cameras which obey the generic imaging model, but do not satisfy the

additional constraints of axial or central cameras, include, for instance, multi-camera

systems consisting of cameras whose optical centers are not all collinear, oblique cam-

eras where no two captured light rays intersect (Pajdla, 2002), non-central mosaics ac-

quired under circular motion (Swaminathanet al., 2003), and various other non-central

cameras (Bakstein & Pajdla, 2001; Ponce, 2009).

Axial model

In an axial camera, all the projection rays go through a single line in space and this

line is called the camera axis. Examples of axial cameras include crossed-slit cameras

(Feldmanet al., 2003; Gupta & Hartley, 1997), stereo camera rigs, and other multi-

camera systems which consist of central cameras with collinear optical centers. In

addition, a catadioptric camera consisting of a mirror and a central camera is an axial

camera if the mirror is any surface of revolution, and the camera center lies on the

mirror axis of revolution (Ramalingam, 2006; Chahl & Srinivasan, 1997; Olliset al.,

1999). For example, the catadioptric camera designs illustrated in Figures 3(b) and 3(c)

are axial cameras.

Central model

A central camera is a camera where all the projection rays go through a single point,

which is called the optical center of the camera. A common example of a central cam-

era is the perspective camera. The perspective imaging model is tenable for most con-

41

ventional cameras which have narrow-angle lenses. However, alsocameras equipped

with wide-angle and fish-eye lenses can be approximated by the central model, even

though the perspective model is not suitable for them (Miyamoto, 1964). Further, there

are certain catadioptric camera configurations which have a single viewpoint (Baker &

Nayar, 1999). For example, the catadioptric system in Figure 3(c), which contains a

hyperbolic mirror and a perspective camera placed at the focal point of the mirror, is a

central camera.

3.1.2 Perspective cameras

The perspective camera model is applicable for most conventional photographic cam-

eras, and it is the most widely used and studied camera model in the literature (Hartley &

Zisserman, 2000). In the following, the perspective model is briefly described in order

to illustrate its limitations, and to motivate the flexible central camera model, proposed

in Paper II and summarized in Section 3.1.3 below.

Geometrically, a perspective camera is defined by a plane and a point in space. That

is, the plane is the image plane of the camera, and the point is the optical projection

center. As shown in Figure 4, the projection ray associated to a particular image point

is the line joining the image point and the projection center.

Mathematically, as images of lines are lines under perspective projection, a per-

spective camera is a linear mapping from a three-dimensional world coordinate frame,

represented byP3, to a two-dimensional image coordinate frame, represented byP2.

Hence, by using homogeneous coordinates, a perspective camera can be represented by

a 3×4 matrix (Hartley & Zisserman, 2000).

However, besides being a linear mapping fromP3 to P2, a perspective camera can

be considered as a direction sensor. From this point of view, it is illustrative to use inho-

mogeneous pixel coordinates and to represent the camera projectionP as a composition

of two non-linear functions, namely,

m = P(X) = (Pc◦R)(X), (13)

wherem = (u,v)⊤ is the image point corresponding to the scene pointX, Pc defines

the internal properties of the camera andR relates the camera pose to the world frame.

Given the scene pointX in the world coordinate frame, the functionR provides the

direction of the corresponding projection ray in the camera coordinate frame, whose

origin is at the projection center. Hence,R involves a rigid transformation, which maps

42

the scene point from the world frame to the camera frame. Formally, Φ =R(X), where

Φ = (θ ,ϕ)⊤ andθ , ϕ are the spherical angle coordinates illustrated in Figure 4.

For a perspective camera, the cameraZ-axis is defined perpendicular to the image

plane andPc has the following form

m = Pc(Φ) = (K◦Fp)(Φ), (14)

whereFp is a central projection to the virtual image plane, illustrated byV in Figure 4,

i.e.,(

x

y

)

= Fp(Φ) = (tanθ)

(

cosϕsinϕ

)

(15)

andK is an affine transformation from the virtual image plane to the real image plane,(

u

v

)

=K(x,y) =

[

f s f

0 γ f

](

x

y

)

+

(

u0

v0

)

, (16)

where f , γ, s, u0 andv0 are the internal camera parameters. There are only five degrees

of freedom inK because, without loss of generality, one may fix the camera coordinate

frame so that theX-axis is parallel to theu-axis of the pixel coordinate frame.

θϕ

X

Y

ZO

X

x = (x,y)

m

I V

Fig. 4. A perspective camera whose projection center is O. The scene point X is

projected to point m on the image plane I. The plane Z =1 is the virtual image

plane V. The mapping from V to I is defined by the internal camera parameters.

It is evident from (15) and Figure 4 that the field of view of a perspective camera can

not exceed a hemisphere. Indeed, whenθ approaches the value 90◦ the image point

43

approaches infinity. Thus, due to this singularity, the perspective camera model is not

suitable for omnidirectional cameras.

3.1.3 Central omnidirectional cameras

The class of central cameras includes a wider range of projection models than just

perspective projection. In particular, many real omnidirectional cameras, such as fish-

eye lens cameras (Miyamoto, 1964) and certain catadioptric cameras (Baker & Nayar,

1999), can be approximated as central cameras. The flexible central camera model,

proposed in Paper II, is described below, after a brief review of the related literature.

Previous work

Various kinds of central cameras have been described in the literature. In principle, all

these cameras are covered by a model of the form (13), wherePc is a generic, possibly

nonlinear mapping from the unit sphere ofR3 toR

2. Hence, a central camera maps the

directions of incoming light rays to points in the image plane. The subset of the unit

sphere, wherePc is injective, defines the upper limit for the field of view of a particular

camera projection.

In practice, the mappingPc is usually continuous. In addition, many central omnidi-

rectional cameras employ a projection that is radially symmetric about an axis, i.e., the

optical axis. Hence, instead of (14),Pc is often modeled in the following more generic

form

m = Pc(Φ) = (H◦Fr)(Φ), (17)

whereFr is a radially symmetric projection to the virtual image plane, which is orthog-

onal to the optical axis, andH is a planar projective transformation from the virtual

image plane to the real image plane. In detail,(

x

y

)

= Fr(Φ) = r(θ)

(

cosϕsinϕ

)

, (18)

wherer(θ) defines the radially symmetric part of the projection and it is again assumed

that the cameraZ-axis is the optical axis, i.e.,θ is the angle between the incoming light

ray and the optical axis. Several models have been used forr(θ) in the literature. For

example, instead of the perspective modelr(θ) = tanθ , the stereographic projection

model,r(θ) = 2tan(θ/2), has been used for some fish-eye lenses (Miyamoto, 1964).

44

In general, the image plane of a central camera is not necessarilyorthogonal to the

optical axis and therefore the mappingH in (17) is a planar projective transformation

instead of an affine transformation. However, in the case of a perspective camera, one

may always fix the optical axis to be perpendicular to the image plane because the

pinhole projection, illustrated in Figure 4, is symmetric about any axis. Thus, the affine

transformation model is sufficient in (14).

Many central cameras, which appear in the literature, can be presented by the model

(17). The model covers various dioptric cameras, where the lens system provides a sin-

gle effective viewpoint and is radially symmetric about the optical axis. Examples

of such cameras include wide-angle (Franket al., 2007) and fish-eye lens cameras

(Miyamoto, 1964), in addition to conventional narrow-angle lens cameras. Several dif-

ferent parametric expressions have been used for the radial projection functionr in (18).

Many approaches take the perspective projectionr(θ) = tanθ as a starting point and

model the observed radial distortion with respect to it (Heikkilä, 2000; Zhang, 2000).

One example of such an approach is the so-called division model (Bräuer-Burchardt &

Voss, 2001; Fitzgibbon, 2001) which implicitly defines a relation betweenr andθ . How-

ever, due to the singularity of the perspective projection atθ = 90◦, these approaches

are not suitable for cameras whose field of view exceeds a hemisphere. Hence, other

models have also been proposed, e.g. (Micušík, 2004; Micušík & Pajdla, 2006). Finally,

in addition to the parametric approaches, a parameter-free method for determining the

radial distortion was proposed in (Hartley & Kang, 2007).

Additionally, besides dioptric cameras, also central catadioptric cameras, which con-

sist of a conical mirror and a camera (Baker & Nayar, 1999), can be represented in the

form (17). In fact, it has been shown (Geyer & Daniilidis, 2001) that the model (18),

wherer(θ) has the following one-parameter form

r(θ) =(l +1)sinθ

l +cosθ, (19)

is a unified model for central catadioptric projections. The properties of central cata-

dioptric image formation have been extensively studied, e.g. (Ying & Hu, 2004b; Bar-

reto & Araujo, 2005). Further, it has been demonstrated that the radial projection model

(19) is applicable also for many dioptric cameras, including conventional perspective

cameras and fish-eye lens cameras (Ying & Hu, 2004a). Indeed, for example, when

l=0, formula (19) gives the perspective projection, and the valuel = 1 corresponds to

the stereographic projection.

45

The proposed model

Oneof the contributions of this thesis is the generic central camera model presented in

Paper II. This section motivates the proposed model and describes its key components.

Many omnidirectional camera devices are designed to have a single viewpoint and

to be radially symmetric, as described in the previous section. However, radially sym-

metric central camera is an idealized model, which is not always sufficient for precise

modeling of real cameras. Hence, a common approach in photogrammetric camera

calibration is to append the idealized model with an additional distortion component,

which models the deviations from precise radial symmetry (Slama, 1980). So, the idea

is to adhere to the assumption of a single viewpoint but to drop the strict requirement

of radial symmetry. The distortion model traditionally used in photogrammetry (Slama,

1980) is based on an analytic approximation for the geometric distortion in a decentered

lens system (Conrady, 1919; Brown, 1971). However, the problem with this distortion

model is that it is not suitable for cameras with a very wide field of view because it is

built on the perspective imaging model.

In order to obtain a flexible camera model, which is applicable also for omnidirec-

tional cameras, Paper II takes the expression (17) as a starting point, and modelsFr in

a generic form, instead of assuming the restrictive perspective projection model. There-

after, model (17) is complemented by replacing the radially symmetric partFr with

D◦Fr, where the asymmetric distortion functionD appends two distortion terms toFr.

In detail, the virtual image pointx corresponding to the direction angleΦ is given by

x = (D◦Fr)(Φ) = r(θ)ur(ϕ)+∆r(θ ,ϕ)ur(ϕ)+∆t(θ ,ϕ)uϕ(ϕ), (20)

whereur(ϕ) anduϕ(ϕ) are the unit vectors in the radial and tangential directions,

r(θ) = k1θ + k2θ 3+ k3θ 5+ k4θ 7+ k5θ 9, (21)

and the asymmetric radial and tangential distortion terms are

∆r(θ ,ϕ)=(ζ1θ +ζ2θ 3+ζ3θ 5)(ι1cosϕ + ι2sinϕ + ι3cos2ϕ + ι4sin2ϕ), (22)

∆t(θ ,ϕ)=(η1θ +η2θ 3+η3θ 5)(ξ1cosϕ +ξ2sinϕ +ξ3cos2ϕ +ξ4sin2ϕ), (23)

respectively. Thus, here the radial projection function (21) contains five parameters and

both of the asymmetric distortion terms, (22) and (23), contain seven parameters.

Hence, Paper II proposes to model the radially symmetric term (21) as a part of a

power series consisting of the odd powers ofθ . The power series model, including the

46

even powers, has been used for fish-eye lenses also before (Xiong& Turkowski, 1997).

However, Paper II suggests dropping the even powers, based on the fact that an arbitrary

continuous odd function can be represented as a series of odd polynomials. Becauseθis always positive andr(0)=0, one may considerr(θ) as an odd function. Thus, in

principle, one may approximate any kind of continuous radially symmetric projection

by increasing the number of terms in (21).

The expressions (22) and (23), suggested for the asymmetric distortion terms in

Paper II, are separable in the variablesθ andϕ, and the dependence from both variables

is again modeled as a part of a mathematical series. Further, because the Fourier series

of any 2π-periodic continuous function converges in theL2-norm and any continuous

odd function can be approximated by a series of odd polynomials, one could model

increasingly complex continuous distortions by simply adding more terms to (22) and

(23). Hence, formulation (20) suggests a flexible and generic central camera model,

where the number of parameters could be determined automatically in the calibration

process by using, for example, cross-validation to avoid overfitting. However, automatic

complexity selection of the camera model is beyond the scope of this thesis.

As described above, the generic camera model of Paper II can be represented in

the formm = Pc(Φ), where the camera projection is a composition of three functions,

i.e., Pc(Φ) = (H◦D ◦Fr)(Φ). Usually, one also needs to know the inverse model

Φ = P−1c (m). In our case, it is straightforward to compute the inverse of the homog-

raphyH but (D◦Fr) can not be analytically inverted due to its relatively complicated

parametric form. Hence, the inverse mappingP−1c can not be explicitly computed from

Pc. One alternative is to estimate it numerically in a discrete form by using some kind

of an interpolation process. For example, the direction anglesΦk corresponding to each

image pixel may be stored into a look-up table. However, Paper II additionally proposes

a direct method to approximate(D◦Fr)−1(x) for a givenx. The proposed approxima-

tion method is based on a local linearization of the asymmetric distortion functionD.

3.2 Calibration methods

Camera calibration is the process whereby the camera parameters are determined from

image data. There are several calibration methods that have been proposed in the liter-

ature. The conventional photogrammetric approach for camera calibration uses images

of objects, whose geometry is known (Heikkilä, 2000; Zhang, 2000). A typical cali-

bration object used in this context consists of one to three planes which contain visible

47

control points in known positions. In contrast, camera self-calibration methods do not

use objects of known geometry or, in general, any other metric information about the

scene (Hartley & Zisserman, 2000). Instead, they typically use point correspondences

over multiple views of a rigid scene, and utilize assumptions about the internal camera

parameters, e.g. constant internal parameters or unit aspect ratio (Faugeraset al., 1992;

Heyden & Åström, 1997), or about the camera motion, e.g. pure rotation or translation

between the views (Hartley, 1994; Moonset al., 1994).

In addition, besides the pure self-calibration methods, there are calibration methods

which utilize some prior information about the scene, but do not necessarily require

a calibration pattern with completely known world coordinates. For example, some

approaches use images of lines (Barreto & Araujo, 2005) or spheres (Ying & Hu, 2004b)

or assume that the observed scene is planar (Tardifet al., 2007).

An example of a photogrammetric calibration process is illustrated in Figure 5

where a camera observes a planar calibration pattern, which contains circular control

points. Given several images of the pattern, the calibration is performed by fitting a cam-

era model to the observations, which are the measured positions of the control points

in the calibration images. In the case of central cameras, the camera images can be

rectified after the calibration by back-projecting them onto a cube whose centroid is at

the camera center (Tardifet al., 2007). If the calibration is accurate, the back-projected

images of scene lines are straight on the faces of the cube, as illustrated in Figure 6.

Fig. 5. A photogrammetric camera calibration setup where a camera vie ws a pla-

nar calibration pattern displayed on a flat screen.

48

Fig. 6. Two images of a calibration pattern (left) and their undistorted versions

(right). The original images were taken with a catadioptric camera (top) and a wide-

angle lens camera (bottom), and the calibration was performed by the method of

Paper III. The undistorted images were computed by back-projecting the original

images onto a cube whose centroid was at the estimated viewpoint of the camera.

Typically, the camera calibration process involves nonlinear optimization, where the

camera parameters are estimated by minimizing a cost function, which quantifies the

model fitting error. In photogrammetric calibration, the control point coordinates are

known in the world frame and, hence, the minimization needs to be done only over

the internal and external camera parameters, where the latter relate the camera pose

to the world frame. However, in self-calibration, the 3-D coordinates of the observed

points are unknown, and they have to be estimated simultaneously with the camera

parameters. In both cases, the nonlinear minimization requires a good initial guess for

the parameters in order to avoid local minima. Hence, much of the research in camera

calibration is devoted for developing direct methods for the parameter recovery. In fact,

many calibration techniques are based on geometric invariants which are invariant to

camera pose, and depend only on the internal camera parameters (Geyer & Daniilidis,

2002; Ying & Hu, 2004b). For example, the image of the absolute conic, which is often

utilized in camera self-calibration (Pollefeys & Van Gool, 1997; Triggs, 1997; Barreto

& Araujo, 2005), is such an invariant.

49

The papers related to this thesis consider issues related to bothphotogrammetric cal-

ibration and self-calibration. Paper II describes a plane-based calibration procedure for

central cameras, and Paper IV studies the self-calibration of radially symmetric central

cameras from point correspondences under general camera motion. In the following,

calibration methods from the literature are introduced and categorized. Also, the meth-

ods of Papers II and IV are briefly summarized in this context.

3.2.1 Photogrammetric calibration

Perspective cameras

Photogrammetric calibration of perspective cameras is a well studied topic, and many

approaches exist. The approaches can be divided into methods which use non-coplanar

calibration objects, and those which use planar patterns. In general, if a non-coplanar

calibration object is used, the camera parameters can be determined from a single view

by using the classical Direct Linear Transform (DLT) method (Abdel-Aziz & Karara,

1971; Sutherland, 1974; Hartley & Zisserman, 2000). On the other hand, in the case

of a planar calibration object, several views are needed and the camera parameters can

be recovered from the planar homographies which relate the calibration plane to its

images (Sturm & Maybank, 1999; Zhang, 2000). Overall, the initial estimates of camera

parameters provided by some direct method, such as the DLT method or (Zhang, 2000),

should be finally refined by minimizing a geometrically justified error function. The

cost function typically used in the minimization is the sum of squared distances between

the measured and modeled control point projections (Hartley & Zisserman, 2000).

Central cameras

Most of the early research in camera calibration focused on the perspective imaging

model, even though extremely wide-angle lenses have been available for a long time

Miyamoto (1964). One reason to this might be the fact that the low sensor resolution

of early digital cameras limited the usefulness of very wide-angle optics. Neverthe-

less, during the past decade, there has been an increase of applications and research

efforts within omnidirectional vision (Daniilidis & Klette, 2006). One such effort is our

work in Paper II, which aims to present a practical plane-based approach for precise

photogrammetric calibration of generic central cameras. In addition to Paper II, also

50

other plane-based methods have been proposed for omnidirectionalcameras (Scara-

muzzaet al., 2006; Mei & Rives, 2007; Ramalingam & Sturm, 2008).

The calibration approach of Paper II is based on viewing a planar pattern, which

contains control points in known positions. Given the locations of observed control

points in several views, the camera parameters are determined by a multi-step procedure

which requires the user to provide a rough initialization for a few internal parameters,

and then iteratively refines all the parameters. Although the procedure is not completely

automatic, it is relatively robust against bad initialization and, in practice, satisfactory

initial values can always be provided. Further, due to the generality of the camera model

used, the proposed approach allows convenient calibration of various types of central

cameras, as shown by the experiments in Papers II and III.

Noncentral cameras

Although most omnidirectional cameras are strictly speaking noncentral, many of them

can be well approximated by the central model. In particular, if the camera is relatively

far from the viewed objects, the central model may be useful even for such cameras that

are clearly noncentral when observed from a closer distance (Swaminathan & Nayar,

2000). Also, the assumption of a single viewpoint may be necessary for stabilizing

the initial calibration of cameras that are only slightly noncentral (Ramalingamet al.,

2005; Micušík & Pajdla, 2006). However, there are cases where the central camera

model is not sufficient. For example, this is the case with many catadioptric cameras

(Chahl & Srinivasan, 1997; Swaminathanet al., 2004). Hence, calibration methods

have been proposed also for completely generic cameras, where the pixels’ projection

rays are unconstrained (Grossberg & Nayar, 2001; Sturm & Ramalingam, 2004). In

contrast to methods which use low-parameter models, these approaches require that

all the calibrated pixels are matched to the calibration object in several images. Thus,

dense matching is required. Finally, the calibration of radially symmetric noncentral

cameras is addressed in (Tardifet al., 2009), and the case of axial cameras is discussed

in (Ramalingamet al., 2006c).

51

3.2.2 Self-calibration

Perspective cameras

Self-calibration of perspective cameras has been widely studied since the early work

by Faugeraset al. (1992). Typically, the problem setting is such that one has an un-

calibrated projective multi-view reconstruction, computed from a set of point corre-

spondences over the views, and the task is to utilize constraints on the camera param-

eters in order to determine the metric properties of the cameras and the scene. The

early papers on self-calibration assume that the internal camera parameters are constant,

e.g. (Faugeraset al., 1992; Hartley, 1994), whereas some later works use less restrictive

constraints (Heyden & Åström, 1998; Pollefeyset al., 1999). Further, besides the case

of generic camera motion, self-calibration methods have been proposed for different

types of degenerate motions. For example, a method for planar motions is described

by Armstronget al. (1996), and the case of a rotating camera is discussed by Hartley

(1994). Overall, the literature related to self-calibration of perspective cameras is broad,

and further references can be found from (Hartley & Zisserman, 2000).

Central cameras

Various self-calibration methods have been proposed for central cameras during the

recent years, e.g. (Micušík & Pajdla, 2006; Claus & Fitzgibbon, 2005; Thirthala &

Pollefeys, 2005; Li & Hartley, 2006; Ramalingamet al., 2006b; Tardifet al., 2006,

2007). These approaches do not require a completely known calibration pattern, but

many of them still utilize some prior knowledge about the scene, such as straight lines

or coplanar points, or about the camera, such as the location of the distortion center. In

addition, the robustness and generality of the methods may often be limited, or they do

not provide a full metric calibration.

Hence, despite the recent research efforts, there is still a need for robust and generic

self-calibration methods which would be suitable for central cameras under general con-

ditions, i.e., under general camera motion in an unknown scene. In this thesis, Paper

IV addresses the problem by proposing a method which uses two-view point correspon-

dences, and estimates the parameters of a general radially symmetric camera model

by minimizing the angular error. Thus, the key idea in Paper IV is to use the exact

expression for the angular image reprojection error (Oliensis, 2002), and write the self-

52

calibration problem as a small-scale optimization problem where the cost function de-

pends only on the parameters of the camera. Further, since the cost function appears

to have many local minima, a multi-step approach is proposed for the minimization.

Although the approach does not completely remove the problem of local minima, the

experiments in Paper IV show that successful self-calibration can be achieved if reason-

able constraints are provided for the camera parameters.

Noncentral cameras

Self-calibration of generic noncentral cameras is a difficult and a little studied topic.

This thesis concentrates on central cameras but, for example, the paper by Ramalingam

et al. (2006b) deals with the self-calibration of radially symmetric noncentral cameras.

Their method utilizes dense image matches between several views of an unknown pla-

nar scene. However, without the assumption of radial symmetry, the self-calibration

of noncentral cameras is probably a very difficult problem because there is not much

information to be utilized, especially if both the scene structure and camera motion are

completely unknown (Ramalingam, 2006).

3.3 Discussion

The camera model and calibration method presented in Paper II are perhaps the most

practically useful contributions of this thesis in the area of camera calibration. The

proposed plane-based calibration approach allows precise modeling of various kinds of

real cameras as illustrated by the examples in Paper III. On the other hand, besides

photogrammetric calibration, also the topic of camera self-calibration is touched in the

thesis. However, despite some encouraging results, the local minima problem discussed

in Paper IV is not yet fully solved and, hence, further improvements are needed before

the suggested self-calibration approach is ready for practical use in applications. Indeed,

as pointed out also by Ramalingam & Sturm (2008), self-calibration of generic cameras

is a challenging topic and there are still many unresolved problems related to it.

The plane-based calibration algorithm, proposed in Paper II, is implemented into a

Matlab toolbox, which is publicly available and includes a semi-automatic procedure

for localizing control points from images of planar dot patterns. The toolbox is an

additional practical contribution of the thesis, and it aims to facilitate the use of omnidi-

rectional cameras in metrology applications. In fact, there has previously been a lack of

53

publicly available, easy-to-use calibration tools for omnidirectional cameras. However,

recently the situation has improved as several new calibration programs have been made

available. For example, implementations of the methods described in papers (Barreto

& Araujo, 2005), (Scaramuzzaet al., 2006), (Mei & Rives, 2007), and (Tardifet al.,

2006) are currently provided by the respective authors. Nevertheless, most of the avail-

able calibration tools assume a central camera model, which is fully radially symmetric,

whereas our toolbox additionally allows modeling asymmetric distortions.

Examples of application areas, where calibrated omnidirectional cameras may be

utilized, include structure from motion (Mouragnonet al., 2009), three-dimensional

modeling (Lhuillier, 2008b), panoramic imaging (Xiong & Turkowski, 1997), medical

imaging (Stehleet al., 2007) and image-based lighting in computer graphics (Fuchs

et al., 2007; Pronk, 2006). For instance, in (Stehleet al., 2007), the calibration approach

of Paper II is used for an endoscopic camera and, in (Pronk, 2006), it is used for a

fish-eye lens camera which captures lighting data for photorealistic computer graphics.

Overall, the various innovative and relatively recent applications of omnidirectional

cameras indicate that calibration of generic cameras is a relevant research topic, and the

results can often be directly utilized in practice.

54

4 Image-based scene reconstruction

Image-basedscene reconstruction is a classical topic in geometric computer vision. In

this thesis, it can be seen as a high-level topic which provides a connection between

such subtopics as geometric camera calibration and image matching, which are the

two central themes of the thesis. In fact, the goal of using cameras as measurement

devices for scene reconstruction has been an important motivation for the many efforts

in geometric camera calibration which were discussed in Chapter 3. On the other hand,

image matching, which is the topic of Chapter 5, is also an essential task that is needed

in image-based scene reconstruction, as well as in many other problems of computer

vision.

However, in this chapter, the problem area of image-based scene reconstruction is

briefly discussed as a whole, and the related sewer imaging application of Paper V

is introduced. Overall, the approach for modeling sewer pipes from video, outlined

in Section 4.2, can be considered as an application specific example of a multi-view

reconstruction pipeline, which is discussed on a general level in Section 4.1.

4.1 Brief review of related work

This section aims to give a brief introduction to the research literature, the results of

which are applied in the sewer imaging system of Paper V. However, because image-

based scene reconstruction is a wide topic, only a small part of the previous research is

covered here.

The standard pipeline for multi-view reconstruction contains typically three main

stages, namely,sparse matching, structure from motion, and image-based modeling

(Pollefeys, 1999; Nistér, 2001; Gargallo, 2008; Martinec, 2008). In the sparse matching

stage a number of interesting features, such as edges or corners, are detected from

each image and then matched between the different images. Thereafter, the structure

from motion stage uses the feature correspondences to compute the camera poses and

the three-dimensional structure of the features. Finally, given the camera poses, the

image-based modeling stage computes a dense reconstruction of the scene. Hence, it is

assumed here that the structure from motion stage includes self-calibration if necessary

so that the pipeline outlined above is consistent with the one presented in Figure 1.

55

The sewer modeling approach of Paper V mainly follows the standard three-stage

pipeline, but the usual generic dense reconstruction stage is replaced by a model-based

approach, where a parametric tubular model is directly fitted to the reconstructed in-

terest points. An overview of the sewer imaging system is presented in Section 4.2,

but first, the following subsections list some general background literature on structure

from motion and image-based modeling.

4.1.1 Structure from motion

Structure from motion, i.e. the simultaneous computation of camera motion and sparse

scene structure from multiple images, is a well studied problem, and it is extensively

described in (Hartley & Zisserman, 2000) and (Faugeraset al., 2001), for example. In

this thesis, we mainly confine ourselves to applying known structure from motion meth-

ods in the sewer imaging framework. However, there is still also active methodological

research in the field, as briefly discussed below.

Structure from motion methods require feature correspondences between the im-

ages as their input. In the case of continuous video sequences, the feature extraction

and matching is usually relatively straightforward (Mouragnonet al., 2009), and there

have also emerged powerful methods for reliable matching of wide baseline images

(Mikolajczyk et al., 2005; Lowe, 2004). Given the feature correspondences, the stan-

dard approach to structure from motion is to first determine the two-view geometry

between neighboring pairs of views and then merge the local sparse reconstructions

together, either incrementally or hierarchically, and finally refine the reconstruction by

global bundle adjustment (Hartley & Zisserman, 2000). The basic principles of the

methods have been known for a long time, but recent research has focused on efficiency

issues so that real-time performance could be achieved (Nistér, 2005; Pollefeyset al.,

2008; Mouragnonet al., 2009). Further, besides the fast video-based reconstruction sys-

tems, there are recent approaches that concentrate on robust scene reconstruction from

unorganized image collections (Martinec, 2008; Snavelyet al., 2008).

In addition to such issues as computational efficiency and robustness to varying

imaging conditions, the development of generic methods which are applicable for vari-

ous types of cameras has been an active research topic in structure from motion (Rama-

lingamet al., 2006a; Mouragnonet al., 2009). This line of research is also related to the

self-calibration of generic cameras, which was discussed in the previous chapter. For

example, there are some studies that attempt to generalize the concept of a fundamental

56

matrix for non-perspective cameras (Claus & Fitzgibbon, 2005;Barreto & Daniilidis,

2006; Sturm & Barreto, 2008).

4.1.2 Image-based modeling

The term image-based modeling refers to the task of computing a three-dimensional

geometric model of a scene from multiple images which are taken from known camera

viewpoints. In general, this task is also calledmulti-view stereo reconstruction, and it

has been extensively studied (Seitzet al., 2006).

Multi-view stereo algorithms aim to produce a dense scene reconstruction which is

visually compatible with the input images. Visual compatibility is usually measured by

some photo-consistency measure, such as normalized cross-correlation of correspond-

ing image patches (Seitzet al., 2006; Furukawa & Ponce, 2007). Fully automatic multi-

view reconstruction of complex scenes is a challenging task, especially if the input

images are taken under varying lighting conditions, and the scene contains partially

occluded objects or low textured surfaces. However, the multi-view stereo problem

has been under active study during recent years, e.g. (Strecha, 2007; Gargallo, 2008),

and currently there are systems that are capable to produce accurate and dense photo-

realistic reconstructions under challenging real world conditions (Goeseleet al., 2007;

Furukawa & Ponce, 2007).

In addition to generic multi-view stereo methods which are applicable for arbitrarily

shaped objects, there are works that concentrate on modeling certain specific object

classes, such as faces (Xinet al., 2005), architectural scenes (Furukawaet al., 2009) or

human bodies (Starck & Hilton, 2003). In this thesis, the focus is on modeling sewer

pipes.

4.2 Application: Modeling sewer pipes from video

4.2.1 Motivation

The proper functioning of sewerage systems is essential for modern infrastructure. How-

ever, in many countries sewer networks are deteriorating due to their high age, and their

restoration and maintenance requires significant investments (Kuntze & Haffner, 1998;

Cooperet al., 1998; Chae & Abraham, 2001). Thus, motivated by the economic rea-

57

sons, there have been attempts to develop automatic methods forcondition assessment

of sewer pipes.

Traditionally the condition of sewer pipes is evaluated by visual inspection of video

sequences which are scanned by a camera moving inside the pipe. However, manual

inspection has some disadvantages, such as subjectivity and high costs, and hence, there

are several approaches that have been suggested for automation of sewer surveys. For

instance, an idea of reconstructing the three-dimensional structure of sewer pipes from

survey videos was introduced by Cooperet al. (1998), and automatic detection of pipe

joints and surface cracks from digital sewer images was studied by Xuet al. (1998),

Chae & Abraham (2001) and Sinha & Fieguth (2006).

In this thesis, Paper V describes a method for modeling sewer pipes from survey

videos. An overview of the method is described in the following section.

4.2.2 Overview of the approach

The system described in Paper V recovers the interior shape of a sewer pipe from a

survey video which is obtained by moving a precalibrated fish-eye lens camera and a

light source through the pipe. The reconstruction approach is based on tracking interest

points across successive video frames, and computing their three-dimensional arrange-

ment by structure from motion techniques. The structure from motion stage is followed

by a modeling stage, which robustly estimates the shape of the pipe by fitting a paramet-

ric tubular model to the reconstructed points. Hence, there are two main stages in the

proposed measurement approach, structure recovery and modeling, and they are briefly

summarized below. The outline of the approach is illustrated in Figure 7, where a typ-

ical inspection system is shown on the left, and the obtained tubular model is on the

right.

58

Fig. 7. An illustration of the sewer modeling system. Left: A video sequence is

acquired by a remote controlled camera which moves inside a sewer pipe. Middle:

Structure and motion estimation is based on interest points, which are extracted

from the textured surface of the pipe and tracked across successive video frames.

Right: A tubular surface model is fitted to the reconstructed interest points. Re-

vised from Paper V. c©2008 Springer

Structure recovery

The structure recovery of sewer pipes is based on a standard structure from motion ap-

proach (Fitzgibbon & Zisserman, 1998), which is adapted to fish-eye image sequences.

Thus, the proposed method requires that the inner surface of the pipe has some texture

so that interest points can be extracted and reconstructed from the images. In Paper

V, the interest points were detected by the Harris corner detector (Harris & Stephens,

1988), and the experiments with a real sewer video showed that there are plenty of such

features in eroded concrete pipes. Hence, in this case, the set of reconstructed feature

points was sufficiently dense in order to allow pipe shape estimation by surface fitting

in the modeling stage. Further, the error analysis, presented in Paper V, suggests that a

relatively low uncertainty of reconstruction can be achieved despite the forward motion

of the camera.

The wide field of view of a fish-eye lens camera is essential in the sewer imaging

application as it enables us to obtain a high resolution scan of the whole pipe with a

single pass. Our approach assumes that the camera is precalibrated. In the experiments

of Paper V, the camera calibration was performed by using the approach of Paper II.

The large radial distortion of fish-eye lenses requires modifications to the conventional

59

structure from motion techniques, which are designed for perspective cameras. Thus,

Paper V gives a relatively detailed explanation of our implementation, which is suitable

for calibrated central cameras with a field of view up to 180 degrees. However, any other

generic structure from motion system could be used as well. For example, the recent

work by (Lhuillier, 2008a) describes a structure from motion framework for calibrated

omnidirectional cameras.

Modeling

In the modeling stage the shape of the pipe is estimated by fitting a piecewise cylindrical

model to the reconstructed points. Hence, in order to model the bending of pipes, Paper

V proposes the use of a tubular model which is concatenated from short cylindrical

pieces and then smoothed along the pipe. Further, Paper V describes a robust cylinder

fitting procedure, where the reconstructed points are first divided into several sections

along the pipe, and a cylinder with an elliptical cross-section is fitted to each section.

Finally, the parameters of the cylindrical pieces are interpolated along the pipe so that

the obtained tubular surface is smooth also in the main axis direction.

The robust cylinder fitting procedure, proposed in Paper V, has the following prop-

erties: (a) it minimizes a geometric cost function, (b) it is robust to outliers, and (c) it

can be applied to elliptical cylinders. There are also other approaches to cylinder fitting,

e. g. (Faber & Fisher, 2001; Lukácset al., 1998; Werghiet al., 1998), but the approach

of Paper V was used in the sewer modeling system as it includes a method for the ro-

bust initialization of the cylinder parameters, and has all the aforementioned desirable

properties.

4.3 Discussion

Image-based scene reconstruction can be seen as a unifying high-level theme for the top-

ics of this thesis, and this chapter briefly described the general background of the field.

At a more concrete level, the purpose of this chapter was to present the contributions of

Paper V, which are further summarized and discussed below.

The main contribution of Paper V is that it describes a complete system for acquiring

a three-dimensional model of a sewer pipe from a survey video. The experiments with

a real sewer video show that the proposed approach can recover the shape of a sewer

pipe from a fish-eye video sequence which is scanned by a single pass through the pipe.

60

We are not aware of any previous work where such a modeling systemwould have been

demonstrated in practice. Hence, Paper V presents a new application of structure from

motion techniques into the sewer inspection domain, where computer vision methods

are not yet widely used. Further, an additional contribution of Paper V is a robust and

practical method for modeling tubular surfaces from three-dimensional point clouds.

Overall, the methods presented in Paper V are not necessarily limited to sewer imag-

ing application, but they could be useful in other applications as well. Typically the

wide field of view of an omnidirectional camera provides advantages in structure and

motion estimation (Micušík & Pajdla, 2006), and therefore omnidirectional structure

from motion methods, such as the implementation of Paper V, are likely to be useful

in many applications. In addition, also the modeling of tubular surfaces could be uti-

lized in other problem areas. In particular, the methods presented might be useful in

the field of medical imaging, as there have already been some attempts to use computer

vision techniques for structure recovery from endoscopic images (Burschkaet al., 2004;

Schmidtet al., 2002; Wanget al., 2008).

61

62

5 Quasi-dense matching

5.1 Introduction

Image matching, i.e. determination of corresponding pixels between two different im-

ages of the same scene, is a vision task that is needed in many problems. For example,

in the area of structure from motion, computation of camera motion from image se-

quences requires sparse matching of interest points between the different images (Hart-

ley & Zisserman, 2000). Further, many recent approaches to viewpoint invariant object

recognition are based on matching of local image regions (Obdrzálek & Matas, 2002;

Lowe, 2004). However, matching a sparse set of interest points or regions is not always

sufficient. For instance, such tasks as surface reconstruction (Strecha, 2007) or object

recognition (Ferrariet al., 2006) may require dense or quasi-dense matching of images.

This thesis studies a quasi-dense approach to image matching. The work builds on

the previous work by Lhuillier & Quan (2002), and extends their approach to the wide

baseline case. In addition, the thesis presents applications of quasi-dense matching to

object recognition and two-view motion segmentation.

The key contribution in the paper by Lhuillier & Quan (2002) is a quasi-dense

matching algorithm, which is based on the match propagation principle. The algo-

rithm starts from a sparse set of seed matches between two images, then propagates to

the neighboring pixels by the best-first strategy, and finally produces a quasi-dense set

of matching pixels. The quasi-dense correspondences typically provide a good basis

for such tasks as two-view geometry estimation and surface reconstruction (Lhuillier

& Quan, 2005). Compared to conventional dense stereo correspondence algorithms

(Scharstein & Szeliski, 2002), the quasi-dense approach has the advantage that it can

also be used for uncalibrated image pairs where the epipolar constraint is not known.

Further, the match propagation algorithm is efficient in terms of time and memory, and

hence provides a tenable alternative for cases where a completely dense matching is not

required.

Thus, our work builds on (Lhuillier & Quan, 2002), but there are also several other

works that utilize the idea of region growing in image matching. For example, the early

paper (Otto & Chau, 1989) is perhaps the first work that uses this kind of idea. Also

the work by Chen & Medioni (1999) utilizes the growing principle but without the best-

63

first matching strategy. Further, among the more recent works,Megyesi & Chetverikov

(2004) present a match propagation method, which allows affine deformations of cor-

responding image patches, andCech & Sára (2007) modify the method by Lhuillier &

Quan (2002) so that the accuracy and correctness of matching is guaranteed even in

the presence of repeating texture patterns. However, most of the previous approaches,

including (Megyesi & Chetverikov, 2004) and (Cech & Sára, 2007), are designed for

conventional rectified stereo image pairs where the scene is rigid and the corresponding

epipolar lines are parallel.

A significant limitation with the algorithm by Lhuillier & Quan (2002) is that it is

not directly applicable in the wide baseline case. Hence, Papers VI and VII propose

extensions which make the algorithm suitable for the matching of wide baseline images

of both rigid and non-rigid scenes. Further, Paper VII additionally presents an approach

for utilizing quasi-dense matches for reliable object recognition in the presence of geo-

metric deformations and extensive background clutter. Finally, Paper VIII proposes the

use of quasi-dense matching as an initialization for a dense and deformable two-view

motion segmentation method. The following section summarizes the ideas of Paper

VI, whereas the contributions of Paper VII are described in Sections 5.3 and 5.4. An

overview of the approach of Paper VIII is described in Section 5.5, and the discussion

in Section 5.6 concludes this chapter.

5.2 Match propagation in the wide baseline case

In the case of wide baseline stereo images, the camera viewpoint differs greatly between

the two views. An example of such an image pair is shown in Figure 8 (Mikolajczyk

et al., 2005). There are certain issues in the match propagation algorithm by Lhuillier

& Quan (2002) which limit its applicability for wide baseline matching. In particular,

at each step of propagation, small image patches are extracted around the current seed

point in both images and the new candidate matches are scored according to the zero-

mean normalized cross-correlation (ZNCC) of the corresponding patches. Thus, it is

implicitly assumed that the local transformation between the images is effectively a

translation, and this assumption is not necessarily valid for wide baseline image pairs.

Hence, in order to widen the applicability of the method, Paper VI proposes the use of a

general affine model for the geometric transformation between the local image patches.

64

100 200 300 400 500 600 700 800

100

200

300

400

500

600

0

1

2

3

4

5

6

7

100 200 300 400 500 600 700 800

100

200

300

400

500

600

0

1

2

3

4

5

6

7

Fig. 8. Top: Two views of a wall from the dataset of Mikolajczyk et al. (2005).

The homography between the views is known. The ellipses denote matched re-

gions, whose centroids are used as a seed for match propagation. Bottom: The

matched pixels, computed by propagation with (right) and without (left) affine nor-

malization, are illustrated in the second view by coloring them according to their

distance from the true match. (The distances over 5 are suppressed to 5 and the

gray-value for the noncommon area is 6.)

The main stages of quasi-dense wide baseline matching are as follows. First, a sparse set

of initial matches is obtained by matching affine covariant regions (Mikolajczyket al.,

2005) between the two images. Hence, the output of the initial matching stage is a set

of corresponding points{(xi,x′i)}i (the centroids of the matched regions) accompanied

with the affine transformation matricesAi, which represent approximations for the local

geometric transformations between the images. The initial matches are used as seed

points for the match propagation, which searches new matches from the surrounding

image areas by using zero-mean normalized cross-correlation (ZNCC) as a similarity

measure. The matches obtained are stored in a disparity map which is filled in by

iterating the following steps:

(i) the seed point(xi,x′i) with the highest ZNCC score is removed from the list of seed

points

65

(ii) new candidate matches are searched from the surroundingsof (xi,x′i) by usingAi

for the geometric normalization of local image neighborhoods

(iii) the candidate matches with a sufficiently high ZNCC score are stored in the dispar-

ity map, and added to the list of seed points.

In this manner, the number of correspondences in the disparity map increases until the

list of seeds becomes empty.

As summarized above, the main difference between the original method by Lhuillier

& Quan (2002), and the method of Paper VI is the fact that in the latter approach a seed

match is always associated with a local affine transformation, which allows us to trans-

form the image patches to a common coordinate frame at each propagation step. This

geometric normalization process is schematically illustrated in Figure 9. The addition

of an affine transformation model is a relatively straightforward extension to (Lhuillier

& Quan, 2002). However, in practice, this extension is essential for the performance of

the method in wide baseline cases, as shown by the experiments in Paper VI. This is

also illustrated in the example of Figure 8, where the affine normalization significantly

increases the number and quality of matches. Moreover, by following the detailed im-

plementation guidelines of Paper VI, improvement in matching performance can be

achieved so that the efficiency of the quasi-dense approach is preserved. An additional

advantage of the proposed extension is the fact that it is able to directly utilize various

types of affine covariant interest regions (Mikolajczyket al., 2005) as seed matches.

Hence, it can be seen that the proposed approach naturally supplements the recent tech-

niques for sparse wide baseline matching.

Besides introducing the affine model for the local geometric transformations, Paper

VI additionally proposes an adaptive propagation method, where the current estimate

of the affine transformation can be adjusted during the propagation by using the second

order intensity moments locally, together with the epipolar geometry. Hence, the adap-

tive version of the propagation method can be used for image pairs of rigid scenes when

the epipolar geometry is known. The experiments in Paper VI show that the adaptive

method allows a single seed match to propagate into regions where the local transforma-

tion between the views differs from the initial one. However, the adaptation principle

of Paper VI is not applicable for non-rigid scenes, and therefore Paper VII describes

a method where the adaptation is based entirely on the local texture properties of the

images. The details are briefly summarized in the following section.

66

A

xx′

uu′

I I ′

Comparison

by ZNCC

Fig. 9. Affine normalization of local image neighborhoods around a seed match

(x,x′). The candidate match (u,u′) is evaluated by computing the ZNCC measure

between the corresponding normalized patches.

5.3 Non-rigid quasi-dense matching

Paper VII describes a non-rigid match propagation method which uses the local image

gradients and the second order intensity moments to adjust the estimate of the local

affine transformation during the propagation. Hence, unlike in Paper VI, epipolar ge-

ometry is not required for adaptive matching, and therefore the proposed approach is

particularly suitable for such cases where the epipolar geometry is not known, or the

scene is deforming.

As in Paper VI, the adaptation is based on the windowed second moment matrix of

the image intensity functionf , which is defined by

S f ,g(u) =∫

vv⊤ f (v)g(u−v)dv, (24)

where the functiong is a positive window function. By assuming that the intensity func-

tion f ′ and the window functiong′ of the other image are affine transformed versions

of f andg, i.e. f ′(u) = f (A−1u) andg′(u) = g(A−1u)/|detA|, a change of variables

67

in (24) gives the following transformation rule

S f ′,g′(u) = AS f ,g(A−1u)A⊤. (25)

Thus, it is assumed here that the coordinate systems in both images are centered to the

points under consideration which causes the translational part of the affine transforma-

tion to vanish. Then, by using the simplifying notationsS′ = S f ′,g′(0) andS = S f ,g(0),

the positive definiteness of (24) together with (25) implies that

A = S′1/2RS−1/2, (26)

whereR is an arbitrary orthogonal matrix. Hence, givenS andS′, the matrixA can be

determined up to a rotation.

The idea in adaptive match propagation is to use the affine transformation of the

current seed match to compute the local windows for a new candidate match, and esti-

mateS andS′ using these windows. Then the affine transformation for the new match

is computed by (26), where the remaining rotational degree of freedom is determined

either by using the epipolar lines of the matching points, as described in Paper VI, or

by computing the dominant directions of image gradients in the local neighborhoods

of the new match. The latter approach is described in Paper VII, and it has the advan-

tage that the adjustment is based solely on local texture properties, and this allows the

propagation to adapt to smooth non-rigid deformations of the imaged surfaces.

An example of non-rigid image registration by quasi-dense matching is shown in

Figure 10, where a single seed match is grown by using both the non-adaptive and

adaptive version of the propagation algorithm. Since here the artificial deformation

between the two images is known, the accuracy of the obtained quasi-dense matches

can be evaluated and it is illustrated by the color coding. It can be seen that in this case,

the adaptive method produces more matches than the non-adaptive method, which keeps

the affine transformation constant during the propagation. Further, the adaptation does

not essentially reduce the accuracy of the matches. Hence, the example shows that the

proposed adaptation principle works in practice, and efficiently improves the matching

of non-rigid scenes.

68

100 200 300 400 500 600

50

100

150

200

250

300

350

400

450

1

2

3

4

5

100 200 300 400 500 600

50

100

150

200

250

300

350

400

450

1

2

3

4

5

Fig. 10. Top: A pair of images where the artificial deformation is known. The cen-

troids of the yellow ellipses are used as a seed for match propagation. Bottom:

The matched pixels obtained by the non-adaptive (left) and adaptive (right) propa-

gation methods. The distance between the matched point and its true position in

the deformed image is used for the color coding. Revised from Paper VII. c©2008

IEEE.

5.4 Application in object recognition

Besides proposing the non-rigid quasi-dense matching method, Paper VII additionally

presents its application to object recognition and segmentation. The inspiration for this

kind of application arises from previous work (Ferrariet al., 2006), which shows that a

correspondence growing method may be used to improve the performance of an object

recognition approach which is based on local features.

The problem setting in the object recognition application is as follows: it is assumed

that some presegmented model images of objects are given and the task is to recognize

the given object instances from unknown test images where the illumination or camera

69

viewpoint may differ greatly from those in the model images. Inaddition, the test

images may contain occlusions and background clutter. Some examples of challenging

recognition tasks are shown in Figure 11.

Fig. 11. An example of a recognition and segmentation task where the given model

objects (left) are recognized and segmented from query images (right). The col-

ored contours illustrate the segmentation results. The example images are from

the ETHZ Toys dataset (Ferrari et al., 2006). Revised from Paper VII. c©2008 IEEE.

Many recent approaches to object recognition are based on local viewpoint invariant

image features (Schmid & Mohr, 1997; Obdrzálek & Matas, 2002; Lowe, 2004). Typi-

cally, these features are extracted by using a region detector, which adapts to the local

shape of the intensity surface, and is hence able to detect corresponding regions from

the model and test images, despite the change in camera viewpoint (Mikolajczyket al.,

2005). Given the detected regions in the model and test images, the most straightfor-

ward approach for recognition is to represent the regions with features which allow

reliable matching and then use the number of matched features as a recognition crite-

rion (Obdrzálek & Matas, 2002). However, the performance of this kind of approach

is limited in the presence of extensive background clutter, because the background may

produce many incorrect feature matches which disturb the recognition process. Fur-

ther, occlusion and large scale or viewpoint changes reduce the probability that a model

70

feature is correctly extracted from the test image. Hence, in order to counter these prob-

lems, some recent works have applied the principle of correspondence growing in object

recognition (Ferrariet al., 2006;Cechet al., 2008). The key idea in these works is to

utilize the fact that correctly matched regions typically grow better than the false ones.

This fact allows one to improve the discrimination between the correct and incorrect

matches, and thereby the performance of the recognition system.

Thus, inspired by (Ferrariet al., 2006), Paper VII proposes an approach for ob-

ject recognition and segmentation which is directly built on the quasi-dense matching.

The proposed approach has three main stages: match propagation, match grouping,

and recognition. The first stage performs the non-rigid quasi-dense matching proce-

dure between the model and test images. The second stage groups the quasi-dense

pixel matches into geometrically consistent groups by using a method which utilizes

the local affine transformation estimates obtained during the propagation. That is, the

neighboring triplets of quasi-dense matches, connected by Delaunay triangulation, are

merged to the same group if the motion of the connecting triangle is consistent with

the affine motions of its vertices. Finally, in the third stage, the number and quality of

geometrically consistent matches are used as factors for defining the decision criterion

for recognition. Further, if the object is recognized to be present in the test image, the

location of the matching pixels directly provides the segmentation. A more detailed

description of the three stages can be found from Paper VII.

As summarized above, the approach of Paper VII is conceptually similar to (Ferrari

et al., 2006). However, since the proposed approach contains only one match propaga-

tion stage, it is more straightforward than the method by Ferrariet al. (2006), which

involves repeated expansion and contraction phases, where the number and ratio of cor-

rect matches is gradually increased. Further, the approach of Paper VII does not use any

global constraints, and handles the images symmetrically. This implies that the method

can be applied also in cases where both the model image and the test image contain

background clutter, as discussed in more detail in the following section. Besides (Fer-

rari et al., 2006), also the papers (Vedaldi & Soatto, 2006) and (Cechet al., 2008) utilize

the idea of correspondence growing to improve the discrimination between correct and

incorrect region correspondences. However, these approaches concentrate on verifying

the correctness of a single region match at a time and do not discuss the grouping of

matching regions. Hence, they cannot be directly used for the segmentation problem.

In the experimental part of Paper VII, the performance of the proposed recogni-

tion approach was evaluated by performing the same experiment as in (Ferrariet al.,

71

2006) with the publicly available ETHZ Toys dataset. The datasetcontains 9 objects

and 23 challenging test scenes, and the task is to determine which objects are present

in the test scenes and to find the corresponding segmentations. Some examples of the

model and test images are shown in Figure 11, together with the obtained segmentation

results. Each colored contour in Figure 11 illustrates the boundary of a matching seg-

ment, which represents the support domain of a single group of quasi-dense matches.

Only the most reliable segments are shown as they correspond to recognized objects.

The reliability of a matching segment was measured by its correlation-and-coverage-

weighted area in the model image, as described in Paper VII. Besides the qualitative

evaluation, the recognition performance was quantified by computing the ROC curve as

in (Ferrariet al., 2006). As reported in Paper VII, the obtained results were comparable

to (Ferrariet al., 2006).

5.5 Framework for two-view motion segmentation

The problem of motion segmentation typically arises in a situation where one has a

sequence of images containing differently moving objects, and the task is to extract

the objects from the images using the motion information. In this context, the motion

segmentation problem consists of the following two subproblems: (1) determination of

groups of pixels in two or more images that move together, and (2) estimation of the

motion fields associated with each group (Willset al., 2006). Hence, in the case of

two images, the motion segmentation problem is typically more challenging than the

object recognition problem because neither one of the two images may be assumed to

be presegmented.

Many early approaches to motion segmentation assume small motion between the

images, and use dense optical flow techniques for motion estimation (Wang & Adelson,

1994; Weiss, 1997). The main limitation of optical flow based methods is that they

are not suitable for large motions (Willset al., 2006). Hence, in this thesis, Paper

VIII studies the motion segmentation problem in the context of wide baseline image

pairs. That is, the focus is on such cases where the motion of the objects between the

two images may be very large due to non-rigid deformations and viewpoint variations.

Further, spatially varying illumination changes, such as shadows, are another challenge

for motion segmentation that may often occur in the wide baseline setting. In order

to address the challenges posed by deforming motions and varying illumination, Paper

VIII proposes a bottom-up motion segmentation approach which gradually expands

72

and merges the initial matching regions into smooth motion layers and finally provides

a dense assignment of pixels into these layers. Besides segmentation, the proposed

method provides the geometric and photometric transformations for each layer.

The bottom-up segmentation method of Paper VIII utilizes the non-rigid quasi-

dense matching approach of Paper VII for initializing the motion layers. In detail, the

method starts from a sparse set of seed matches between the two images and then pro-

ceeds to quasi-dense matching, which expands the initial seed regions by local propaga-

tion. Then, the quasi-dense matches are grouped into coherently moving segments by

using a similar local grouping technique to that in Paper VII. The resulting segments

are used to initialize the motion layers for the final dense segmentation stage, where the

geometric and photometric transformations of the layers are iteratively refined together

with the segmentation. This alternating minimization scheme is formulated by using

a somewhat similar probabilistic model to that in (Simon & Seitz, 2007) and the pixel

level segmentation is obtained by graph cut based optimization. However, unlike Si-

mon & Seitz (2007), who concentrate on the object recognition problem, the proposed

method does not use any presegmented reference images, but detects and segments the

common regions automatically from both images. Further, Paper VIII uses a spatially

varying photometric transformation model which is more expressive than the global

model used by Simon & Seitz (2007).

Besides Paper VIII, the problem of dense two-view motion segmentation in the

presence of multiple large motions has been studied by Bhatet al. (2006) and Willset al.

(2006). However, these earlier approaches do not model varying lighting conditions

between the two images and they require that the motions are either rigid (Bhatet al.,

2006) or approximately planar (Willset al., 2006). Hence, due to its ability to deal with

deforming motions and large illumination changes, the approach of Paper VIII provides

a wider range of applicability than the previous methods.

The proposed motion segmentation method is illustrated with the example in Figure

12, where the two input images contain two common objects, the magazines. Besides

the magazines, both images contain a lot of background clutter. Moreover, there are also

other challenges present in the image pair: the illumination is different in the images,

the motion of the magazines is non-rigid, and the foremost magazine appears at sub-

stantially different scales in the two images. The results obtained are illustrated in the

last two columns of Figure 12, where the middle column shows the segmentations, and

the last column shows the estimated geometric and photometric transformation fields.

The meshes illustrate the geometric transformations, and the colors visualize the photo-

73

metric transformations. The colors show how the gray color, shown on the background

layer, would be transformed from the other image to the colored image. The result in-

dicates that the white balance is different in the two images, i.e., the first image is more

blue. In addition, it can be seen that the model has correctly captured the shadow on the

corner of the foremost magazine in the first image.

Fig. 12. A pair of images from the ETHZ Toys dataset (left) and the extracted

objects (middle) with the associated geometric and photometric transformations

(right). Reprinted from Paper VIII. c©2009 Springer.

5.6 Discussion

Quasi-dense matching and geometric camera calibration are the two main themes of

this thesis. Hence, one of the key contributions of the thesis are the extensions which

make the quasi-dense approach applicable for wide baseline images of both rigid and

non-rigid scenes. These extensions include the use of an affine normalization step as

an integral part of the propagation algorithm, and the use of second order intensity

moments, together with either epipolar geometry or local image gradients, for adaptive

propagation. Importantly, the experiments in Papers VII and VIII show that, in the

adaptive propagation mode, the parameters of the local affine transformation can be

efficiently adjusted during the propagation without using any nonlinear optimization.

As described, the adjustment is based on local texture properties and allows the match

propagation to adapt to smooth variations of the imaged surfaces. This is an interesting

observation and, to the best of our knowledge, this kind of an adaptation principle has

not been utilized in image matching before.

74

Overall, the proposed quasi-dense matching techniques can beseen as basic tools,

which may be utilized in different vision tasks. For example, Papers VII and VIII

propose approaches for object recognition and motion segmentation, which are based

on quasi-dense matching. In addition, the earlier work by Lhuillier & Quan (2005)

applies quasi-dense matching to surface reconstruction from image sequences.

The locality of matching is a characteristic feature of the match propagation algo-

rithm and, depending from the point of view, this can be seen as a disadvantage or as

an advantage. For example, in the case of conventional stereo matching of rectified

images, it might be better to use some global approach, such as (Cech & Sára, 2007)

or (Kolmogorov & Zabih, 2001), since these approaches are not as prone to suboptimal

solutions, and can better deal with repetitive texture patterns. On the other hand, due

to its local nature, quasi-dense matching is widely applicable, and can often be used in

cases where other methods are not applicable. For example, rectified images are not

always available in stereo matching. In fact, in the recent work by Xiaoet al. (2008),

the approach of Paper VII has been utilized in the initialization stage of a global stereo

matching method which can deal with uncalibrated wide baseline images. Further, the

non-rigid quasi-dense matching method of Paper VIII can also be used for omnidirec-

tional images (Lu & Wu, 2008).

Finally, considering the possible directions for future research, it is useful to relate

the methods of Papers VI-VIII to recent works that address similar problems or use

somewhat similar ideas. Such particularly interesting recent works include (Cechet al.,

2008) and (Choet al., 2009). For example, it might be advantageous to combine ideas

from Paper VII and the papers (Cechet al., 2008) and (Choet al., 2009) in order to

improve deformable object matching further. That is, the sequential correspondence

selection method byCechet al. (2008) could be first used to efficiently increase the

proportion of correct matches among tentative feature correspondences, whereupon the

correspondence clustering method by Choet al. (2009) and the combined correspon-

dence growing and grouping approach of Paper VII could be used together for accurate

recognition and segmentation. Further, as an additional topic for future research, it

might be useful to study ways of applying quasi-dense matching for image retrieval

from large databases. The first steps into this direction have been taken byCechet al.

(2008), who use correspondence growing to re-rank images retrieved by efficient sparse

matching techniques which are based on vocabulary trees.

On the other hand, one additional challenge for recognition, not addressed in the

aforementioned papers, is posed by significant non-linear intensity variations. This

75

problem is discussed by Yanget al. (2007), who also propose an image registration

method which is based on correspondence growing and is remarkably robust to large

illumination variations. However, their registration framework is not applicable for

arbitrary non-rigid scenes, and hence, there is room for further developments.

Lastly, the work by Furukawa & Ponce (2007) proposes a generic multi-view stereo

method, which is based on the local expansion of matching image patches. Although

the multi-view stereo problem is quite different to the matching problems discussed in

this thesis, there is a somewhat similar basic idea of correspondence growing behind

the approaches. On the other hand, a conceptual difference between the approaches is

the fact that the best-first propagation strategy is not used by Furukawa & Ponce (2007).

Instead, repeated expansion and filtering steps are used, where the reconstruction is

gradually grown and erroneously reconstructed patches are filtered out. Hence, there

remains a question whether the best-first propagation principle could be utilized in some

form to further improve the efficiency of patch-based multi-view stereo methods.

76

6 Summary and conclusion

This thesis has presented new models and methods for various computer vision prob-

lems which involve geometric aspects. The proposed techniques are related to such

specific problem areas as homography estimation, geometric camera calibration, image-

based modeling, image matching, object recognition, and motion segmentation. Hence,

there is a wide range of topics considered in the thesis and, at first sight, they may ap-

pear unrelated to each other. However, as described in the previous chapters, there is a

common background to the themes. In fact, in most cases each of the particular prob-

lems can be seen as a geometric inference problem in which geometric information,

either about the scene or the camera, is extracted from the images.

First, Chapter 2 considered methods for determining a planar homography from

corresponding conics. The proposed two algorithms provide new aspects and tools

for the problem of homography estimation, which is a commonly occurring low-level

task in many areas of computer vision. The techniques developed might be used as a

part of other methods which address higher-level tasks in different application domains,

including plane-based camera calibration, image registration, and structure and motion

recovery (e.g. plane detection).

An important part of this thesis is geometric camera calibration, which was dis-

cussed in Chapter 3. A flexible parametric model for central cameras and a plane-based

camera calibration method for determining the model parameters were introduced. The

calibration experiments, conducted with several real cameras, show that the proposed

camera model is suitable for various types of central cameras, including many catadiop-

tric cameras and dioptric cameras equipped with narrow-angle, wide-angle or fish-eye

lenses. Further, the procedures developed were implemented into a Matlab toolbox,

which is publicly available. From a practical viewpoint, this is also a useful contribu-

tion as there has been a lack of calibration tools for omnidirectional cameras.

In addition to photogrammetric calibration, which utilizes images of a known cali-

bration object, the problem of camera self-calibration was studied. Self-calibration of

central cameras from multi-view point correspondences under a general camera motion

is a challenging task which typically suffers from the local minima problem. In this

thesis, a multi-step calibration procedure was proposed which determines the parame-

ters of a radially symmetric central camera model by minimizing angular errors. Given

77

only a rough initialization for the internal camera parameters, the proposed approach re-

fines the camera parameters so that local minima are usually avoided. However, despite

promising results with different kinds of radial distortion models, the approach does

not completely solve the problem of local minima and, hence there is still room for im-

provements in order make the self-calibration from general camera motion sufficiently

robust and stable for practical use.

Since geometric camera calibration is a prerequisite for measuring metric scene

properties from images, it is closely related to the topic of image-based scene recon-

struction. The metrology example used in this thesis is the sewer imaging application

described in Chapter 4. In this context, an error analysis was performed for the struc-

ture from motion system that recovers the interior structure of a sewer pipe from a

video sequence, and it was observed that a relatively accurate reconstruction can be

achieved despite the forward motion of the camera. Additionally, a method was pre-

sented for modeling tubular surfaces from the reconstructed three-dimensional point

clouds. Hence, the sewer imaging system serves as an application specific example of a

complete reconstruction pipeline, where a compact model of the scene is automatically

acquired from the video.

The quasi-dense image matching, discussed in Chapter 5, is a central theme of the

thesis. The proposed methods provide the means for computing a quasi-dense set of

matching pixels between two images of a textured scene, given a sparse set of matching

regions between the images as seed matches. A key contribution in this thesis is that

the previously proposed match propagation principle was extended to the wide baseline

case where the camera pose may vary greatly between the two views. In addition,

methods were proposed for adjusting the local transformation parameters during the

propagation so that the matching process is able to adapt to both rigid and non-rigid

deformations of the imaged surfaces.

Quasi-dense matching can be utilized in many problems. Surface reconstruction

is perhaps one of the most obvious applications. However, in this thesis, quasi-dense

matching was additionally applied to specific object recognition and dense two-view

motion segmentation. Both applications are based on match grouping, where neigh-

boring quasi-dense matches are grouped together if their group-wise transformation

appears to be consistent with the associated local transformations. In the proposed ob-

ject recognition method, the quasi-dense matches between the model and test images

are first grouped, and then the quality and number of matching pixels in the groups are

used as recognition criteria. On the other hand, in the proposed motion segmentation

78

method, the grouped matches provide an initialization for themotion layers, which are

then refined iteratively. The ability to deal with background clutter and geometric defor-

mations is an advantage of the quasi-dense approach in the aforementioned applications

where the common image regions have to be recognized and matched simultaneously.

Overall, because quasi-dense matching conveniently supplements modern sparse wide

baseline matching techniques which are based on affine covariant regions, and have

proven to be useful in many applications, it can be seen as a potential approach for

various problems which involve image matching.

In summary, the topics studied in this thesis touch on two classical themes of com-

puter vision, namely, reconstruction and recognition. Traditionally these two problems

have been considered as separate but recently there has been discussion about consid-

ering them together, because they often appear concurrently in real-world conditions

and the approaches for their solution might benefit from each other. In this thesis, the

connection between reconstruction and recognition is provided by quasi-dense match-

ing which can be utilized in both problems. In fact, the results of the thesis show that,

compared to sparse keypoint-based approaches, the quasi-dense approach improves the

recognition of specific objects from photographs, and earlier studies have shown that

the quasi-dense approach improves robustness and accuracy of geometry estimation in

structure from motion. Further, the quasi-dense approach seems also intuitively reason-

able in cases where the problems of image matching and recognition are coupled: there

is no sense to try dense matching if the image regions do not represent the same object;

and again, reliable recognition is difficult without a good hypothesis for the pose and

position of the object in the image.

Finally, the themes considered in this thesis suggest some directions for future re-

search. Firstly, there are still many challenging problems related to generic cameras

and their calibration. In particular, practical and robust methods for generic calibra-

tion and self-calibration, which include automatic camera model selection, are needed.

Also, the case of varying camera parameters is a challenge for self-calibration. Sec-

ondly, as there have recently emerged patch-based multi-view stereo methods which

utilize match expansion for multi-view reconstruction, it would be interesting to study

whether the best-first match propagation principle, used here for two-view stereo, could

be efficiently used for multi-view stereo. Thirdly, in the context of motion segmentation,

it would be useful if the proposed dense and deformable two-view motion segmentation

method could be extended to work with multi-frame image sequences.

79

80

References

Abdel-Aziz YI & Karara HM (1971) Direct linear transformation from comparator to objectspace coordinates in close-range photogrammetry. Proc Symposium on Close-Range Pho-togrammetry.

Ahonen T, Hadid A & Pietikäinen M (2006) Face description with local binary patterns: appli-cation to face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI) 28(12): 2037–2041.

Armstrong M, Zisserman A & Hartley RI (1996) Self-calibration from image triplets. Proc Euro-pean Conference on Computer Vision (ECCV): 3–16.

Baker S & Nayar SK (1999) A theory of single-viewpoint catadioptric image formation. Interna-tional Journal of Computer Vision (IJCV) 35(2): 175–196.

Bakstein H & Pajdla T (2001) An overview of non-central cameras. Proc Computer Vision WinterWorkshop (CVWW).

Ballard DH (1981) Generalizing the Hough transform to detect arbitrary shapes. Pattern Recog-nition 13(2): 111–122.

Barreto JP & Araujo H (2005) Geometric properties of central catadioptric line images and theirapplication in calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence(TPAMI) 27(8): 1327–1333.

Barreto JP & Daniilidis K (2006) Epipolar geometry of central projection systems using veronesemaps. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1):1258–1265.

Bhat P, Zheng KC, Snavely N, Agarwala A, Agrawala M, Cohen MF & Curless B (2006) Piece-wise image registration in the presence of multiple large motions. Proc IEEE Conference onComputer Vision and Pattern Recognition (CVPR), (2): 2491–2497.

Bookstein FL (1979) Fitting conic sections to scattered data. Computer Graphics and ImageProcessing (9): 56–71.

Bowyer KW, Chang K & Flynn P (2006) A survey of approaches and challenges in 3D and multi-modal 3D+2D face recognition. Computer Vision and Image Understanding (CVIU) 101(1):1–15.

Bräuer-Burchardt C & Voss K (2001) A new algorithm to correct fish-eye- and strong wide-angle-lens-distortion from single images. Proc International Conference on Image Processing(ICIP): 225–228.

Brown DC (1971) Close-range camera calibration. Photogrammetric Engineering 37(8): 855–866.

Brown M, Hartley R & Nistér D (2007) Minimal solutions for panoramic stitching. Proc IEEEConference on Computer Vision and Pattern Recognition (CVPR).

Burschka D, Li M, Taylor R & Hager GD (2004) Scale-invariant registration of monocular en-doscopic images to CT-scans for sinus surgery. Proc Medical Imaging Computing and Com-puter Assisted Intervention (MICCAI): 413–421.

Byröd M, Josephson K & Åström (2009) Fast and stable polynomial equation solving and itsapplication to computer vision. International Journal of Computer Vision (IJCV) 84(3): 237–256.

81

Carlsson S (1993) Projectively invariant decomposition and recognition ofplanar shapes. ProcInternational Conference on Computer Vision (ICCV): 471–475.

Cech J, Matas J & Perd’och M (2008) Efficient sequential correspondence selection by coseg-mentation. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Cech J & Sára R (2007) Efficient sampling of disparity space for fast and accurate matching. ProcBenCOS Workshop in conjunction with CVPR.

Chae MJ & Abraham DM (2001) Neuro-fuzzy approaches for sanitary sewer pipeline conditionassessment. Journal of Computing in Civil Engineering 15(1): 4–14.

Chahl JS & Srinivasan MV (1997) Reflective surfaces for panoramic imaging. Applied Optics36(31): 8275–8285.

Chen Q & Medioni GG (1999) A volumetric stereo matching method: application to image-basedmodeling. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Chen Q, Wu H & Wada T (2004) Camera calibration with two arbitrary coplanar circles. ProcEuropean Conference on Computer Vision (ECCV), (3): 521–532.

Cho M, Lee J & Lee KM (2009) Feature correspondence and deformable object matching viaagglomerative correspondence clustering. Proc International Conference on Computer Vision(ICCV).

Choi O, Kim H & Kweon IS (2007) Simultaneous plane extraction and 2D homography esti-mation using local feature transformations. Proc Asian Conference on Computer Vision(ACCV), (2): 269–278.

Chum O, Werner T & Matas J (2005) Two-view geometry estimation unaffected by a dominantplane. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1):772–779.

Claus D & Fitzgibbon AW (2005) A rational function lens distortion model for general cameras.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1): 213–219.

Conrady A (1919) Decentering lens systems. Monthly Notices of the Royal Astronomical Society79: 384–390.

Cooper D, Pridmore TP & Taylor N (1998) Towards the recovery of extrinsic camera parametersfrom video records of sewer surveys. Machine Vision and Applications 11: 53–63.

Cornelis C, Leibe B, Cornelis K & Van Gool L (2008) 3D object modeling and recognitionusing local affine-invariant image descriptors and multi-view spatial constraints. InternationalJournal of Computer Vision (IJCV) 78(2-3): 121–141.

Daniilidis K & Klette R (eds) (2006) Imaging Beyond the Pinhole Camera, volume 33 ofCom-putational Imaging and Vision. Springer.

Davison AJ, Reid ID, Molton ND & Stasse O (2007) MonoSLAM: real-time single cameraSLAM. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29(6):1052–1067.

Faber P & Fisher B (2001) A buyer’s guide to Euclidean elliptical cylindrical and conical surfacefitting. Proc British Machine Vision Conference (BMVC): 521–530.

Faugeras O (1993) Three-Dimensional Computer Vision. The MIT Press.Faugeras O, Luong QT & Papadopoulo T (2001) The Geometry of Multiple Images. The MIT

Press.Faugeras OD, Luong QT & Maybank SJ (1992) Camera self-calibration: Theory and experiments.

Proc European Conference on Computer Vision (ECCV): 321–334.Feldman D, Pajdla T & Weinshall D (2003) On the epipolar geometry of the crossed-slits projec-

tion. Proc International Conference on Computer Vision (ICCV): 988–995.

82

Fergus R, Perona P & Zisserman A (2003) Object class recognition by unsupervised scale-invariant learning. Proc IEEE Conference on Computer Vision and Pattern Recognition(CVPR), (2): 264–271.

Ferrari V, Tuytelaars T & Van Gool LJ (2006) Simultaneous object recognition and segmentationfrom single or multiple model views. International Journal of Computer Vision (IJCV) 67(2):159–188.

Fischler MA & Bolles RC (1981) Random sample consensus: A paradigm for model fitting withapplications to image analysis and automated cartography. Communications of the ACM24(6): 381–395.

Fitzgibbon A (2001) Simultaneous linear estimation of multiple view geometry and lens dis-tortion. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1):125–132.

Fitzgibbon AW, Pilu M & Fisher RB (1999) Direct least square fitting of ellipses. IEEE Transac-tions on Pattern Analysis and Machine Intelligence (TPAMI) 21(5): 476–480.

Fitzgibbon AW & Zisserman A (1998) Automatic 3D model acquisition and generation of newimages from video sequences. Proc European Signal Processing Conference.

Forsyth D, Mundy JL, Zisserman A, Coelho C, Heller A & Rothwell C (1991) Invariant descrip-tors for 3-D object recognition and pose. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI) 13(10).

Frank O, Katz R, Tisse CL & Durrant-Whyte H (2007) Camera calibration for miniature, low-cost, wide-angle imaging systems. Proc British Machine Vision Conference (BMVC).

Fuchs M, Blanz V, Lensch HPA & Seidel HP (2007) Adaptive sampling of reflectance fields.ACM Transactions on Graphics 26(2).

Furukawa Y, Curless B, Seitz SM & Szeliski R (2009) Manhattan-world stereo. Proc IEEEConference on Computer Vision and Pattern Recognition (CVPR).

Furukawa Y & Ponce J (2007) Accurate, dense, and robust multi-view stereopsis. Proc IEEEConference on Computer Vision and Pattern Recognition (CVPR).

Gargallo P (2008) Contributions to the Bayesian approach to multi-view stereo. Ph.D. thesis,Institut National Polytechnique de Grenoble.

Geyer C & Daniilidis K (2001) Catadioptric projective geometry. International Journal of Com-puter Vision (IJCV) 45(3).

Geyer C & Daniilidis K (2002) Paracatadioptric camera calibration. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI) 24(5): 687–695.

Goesele M, Snavely N, Curless B, Hoppe H & Seitz S (2007) Multi-view stereo for communityphoto collections. Proc International Conference on Computer Vision (ICCV).

Grossberg MD & Nayar SK (2001) A general imaging model and a method for finding its param-eters. Proc International Conference on Computer Vision (ICCV).

Gupta R & Hartley RI (1997) Linear pushbroom cameras. IEEE Transactions on Pattern Analysisand Machine Intelligence (TPAMI) 19(9): 963–975.

Gurdjos P, Kim JS & Kweon IS (2006) Euclidean structure from confocal conics: theory andapplication to camera calibration. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR), (1): 1214–1222.

Harris C & Stephens M (1988) A combined corner and edge detector. Proc Alvey Vision Confer-ence.

Hartley R & Zisserman A (2000) Multiple View Geometry in Computer Vision. CambridgeUniversity Press.

83

Hartley RI (1994) Self-calibration from multiple views with a rotating camera. Proc EuropeanConference on Computer Vision (ECCV) (1), 471–478.

Hartley RI & Kang SB (2007) Parameter-free radial distortion correction with center of distortionestimation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 29(8):1309–1321.

Heikkilä J (2000) Geometric camera calibration using circular control points. IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI) 22(10): 1066–1077.

Heikkilä M, Pietikäinen M & Schmid C (2009) Description of interest regions with local binarypatterns. Pattern Recognition 42(3): 425–436.

Heyden A & Åström K (1997) Euclidean reconstruction from image sequences with varying andunknown focal length and principal point. Proc IEEE Conference on Computer Vision andPattern Recognition (CVPR): 438–443.

Heyden A & Åström K (1998) Minimal conditions on intrinsic parameters for Euclidean recon-struction. Proc Asian Conference on Computer Vision (ACCV): 169–176.

Horn BKP (1986) Robot Vision. The MIT Press.Horn RA & Johnson CR (1985) Matrix Analysis. Cambridge University Press.Jain PK & Jawahar CV (2006) Homography estimation from planar contours. Proc International

Symposium on 3D Data Processing Visualization and Transmission (3DPVT).Kahl F, Agarwal S, Chandraker MK, Kriegman DJ & Belongie S (2008) Practical global opti-

mization for multiview geometry. International Journal of Computer Vision (IJCV) 79(3):271–284.

Kahl F & Heyden A (1998) Using conic correspondences in two images to estimate the epipolargeometry. Proc International Conference on Computer Vision (ICCV): 761–766.

Kaminski JY & Shashua A (2004) Multiple view geometry of general algebraic curves. Interna-tional Journal of Computer Vision (IJCV) 56(3): 195–219.

Kanatani K (1993) Geometric Computation for Machine Vision. Oxford University Press.Kanatani K (1994) Statistical bias of conic fitting and renormalization. IEEE Transactions on

Pattern Analysis and Machine Intelligence (TPAMI) 16(3): 320–326.Kanatani K (1996) Statistical Optimization for Geometric Computation: Theory and Practice.

Elsevier Science.Kannala J (2004) Measuring the shape of sewer pipes from video. Master’s thesis, Helsinki

University of Technology.Kannala J & Brandt S (2004) A generic camera calibration method for fish-eye lenses. Proc

International Conference on Pattern Recognition (ICPR), (1): 10–13.Kannala J & Brandt SS (2005) Measuring the shape of sewer pipes from video. Proc IAPR

Conference on Machine Vision Applications: 237–240.Kolmogorov V & Zabih R (2001) Computing visual correspondence with occlusions via graph

cuts. Proc International Conference on Computer Vision (ICCV): 508–515.Kukelova Z, Bujnak M & Pajdla T (2008) Automatic generator of minimal problem solvers. Proc

European Conference on Computer Vision (ECCV), (3): 302–315.Kuntze HB & Haffner H (1998) Experiences with the development of a robot for smart multi-

sensoric pipe inspection. Proc IEEE International Conference on Robotics and Automation(ICRA): 1773–1778.

Lazebnik S, Schmid C & Ponce J (2006) Beyond bags of features: spatial pyramid matching forrecognizing natural scene categories. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR), (2): 2169–2178.

84

Lhuillier M (2008a) Automatic scene structure and camera motion using a catadioptric system.Computer Vision and Image Understanding (CVIU) 109(2): 186–203.

Lhuillier M (2008b) Toward automatic 3D modeling of scenes using a generic camera model.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Lhuillier M & Quan L (2002) Match propagation for image-based modeling and rendering. IEEETransactions on Pattern Analysis and Machine Intelligence (TPAMI) 24(8): 1140–1146.

Lhuillier M & Quan L (2005) A quasi-dense approach to surface reconstruction from uncalibratedimages. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 27(3):418–433.

Li H & Hartley RI (2006) Plane-based calibration and auto-calibration of a fish-eye camera. ProcAsian Conference on Computer Vision (ACCV): 21–30.

Lourakis MIA, Argyros AA & Orphanoudakis SC (2002) Detecting planes in an uncalibratedimage pair. Proc British Machine Vision Conference (BMVC).

Lowe D (2004) Distinctive image features from scale invariant keypoints. International Journalof Computer Vision (IJCV) 60(2): 91–110.

Lu L & Wu Y (2008) Quasi-dense matching between perspective and omnidirectional images.Proc Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applica-tions in conjunction with ECCV.

Lukács G, Martin R & Marshall D (1998) Faithful least-squares fitting of spheres, cylinders,cones and tori for reliable segmentation. Proc European Conference on Computer Vision(ECCV): 671–686.

Ma SD (1993) Conics-based stereo, motion estimation, and pose determination. InternationalJournal of Computer Vision (IJCV) 10(1): 7–25.

Marr D (1982) Vision: A Computational Investigation into the Human Representation and Pro-cessing of Visual Information. W. H. Freeman and Company.

Martinec D (2008) Robust multiview reconstruction. Ph.D. thesis, Czech Technical University.Matas J, Chum O, Urban M & Pajdla T (2004) Robust wide-baseline stereo from maximally

stable extremal regions. Image and Vision Computing 22(10): 761–767.Megyesi Z & Chetverikov D (2004) Enhanced surface reconstruction from wide baseline im-

ages. Proc International Symposium on 3D Data Processing Visualization and Transmission(3DPVT): 463–469.

Mei C & Rives P (2007) Single view point omnidirectional camera calibration from planar grids.Proc IEEE International Conference on Robotics and Automation (ICRA): 3945–3950.

Mi cušík B (2004) Two-view geometry of omnidirectional cameras. Ph.D. thesis, Czech TechnicalUniversity.

Mi cušík B & Pajdla T (2006) Structure from motion with wide circular field of view cameras.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 28(7).

Mikolajczyk K & Schmid C (2005) A performance evaluation of local descriptors. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (TPAMI) 27(10): 1615–1630.

Mikolajczyk K, Tuytelaars T, Schmid C, Zisserman A, Matas J, Schaffalitzky F, Kadir T &Van Gool L (2005) A comparison of affine region detectors. International Journal of Com-puter Vision (IJCV) 65(1-2): 43–72.

Miyamoto K (1964) Fish eye lens. Journal of the Optical Society of America (JOSA) 54(8):1060–1061.

Moons T, Van Gool L, Van Diest M & Pauwels E (1994) Affine reconstruction from perspectiveimage pairs obtained by a translating camera, volume 825 ofLecture Notes in Computer

85

Science: 297–316. Springer.Mouragnon E, Lhuillier M, Dhome M, Dekeyser F & Sayd P (2009) Generic and real-time struc-

ture from motion using local bundle adjustment. Image and Vision Computing 27: 1178–1193.

Mudigonda P, Jawahar CV & Narayanan PJ (2004) Geometric structure computation from conics.Proc Indian Conference on Computer Vision, Graphics and Image Processing (ICVGIP).

Nistér D (2001) Automatic dense reconstruction from uncalibrated video sequences. Ph.D. thesis,Royal Institute of Technology, Stockholm.

Nistér D (2004) An efficient solution to the five-point relative pose problem. IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI) 26(6): 756–777.

Nistér D (2005) Preemptive RANSAC for live structure and motion estimation. Machine Visionand Applications 16(5): 321–329.

Nistér D & Stewénius H (2006) Scalable recognition with a vocabulary tree. Proc IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), (2): 2161–2168.

Obdrzálek S & Matas J (2002) Object recognition using local affine frames on distinguishedregions. Proc British Machine Vision Conference (BMVC).

Obdrzálek S & Matas J (2005) Sub-linear indexing for large scale object recognition. Proc BritishMachine Vision Conference (BMVC).

Oliensis J (2002) Exact two-image structure from motion. IEEE Transactions on Pattern Analysisand Machine Intelligence (TPAMI) 24(12): 1618–1633.

Ollis M, Herman H & Singh S (1999) Analysis and design of panoramic stereo vision usingequi-angular pixel cameras. CMU-RI-TR-99-04, Carnegie Mellon University.

Otto GP & Chau TKW (1989) Region-growing algorithm for matching of terrain images. Imageand Vision Computing 7(2): 83–94.

Pajdla T (2002) Stereo with oblique cameras. International Journal of Computer Vision (IJCV)47(1-3): 161–170.

Pollefeys M (1999) Self-calibration and metric 3D reconstruction from uncalibrated image se-quences. Ph.D. thesis, Katholieke Universiteit Leuven.

Pollefeys M, Koch R & Van Gool LJ (1999) Self-calibration and metric reconstruction inspite ofvarying and unknown intrinsic camera parameters. International Journal of Computer Vision(IJCV) 32(1): 7–25.

Pollefeys M, Nistér D, Frahm JM, Akbarzadeh A, Mordohai P, Clipp B, Engels C, Gallup D, KimSJ, Merrell P, Salmi C, Sinha SN, Talton B, Wang L, Yang Q, Stewénius H, Yang R, WelchG & Towles H (2008) Detailed real-time urban 3D reconstruction from video. InternationalJournal of Computer Vision (IJCV) 78(2-3): 143–167.

Pollefeys M & Van Gool LJ (1997) Self-calibration from the absolute conic on the plane at infinity.Proc International Conference on Computer Analysis of Images and Patterns (CAIP): 175–182.

Ponce J (2009) What is a camera? Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR).

Pronk J (2006) Spatially variant real world light for computer graphics. B.Sc. thesis, Universityof New South Wales.

Quan L (1996) Conic reconstruction and correspondence from two views. IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI) 18(2): 151–160.

Ramalingam S (2006) Generic imaging models: calibration and 3D reconstruction algorithms.Ph.D. thesis, Institut National Polytechnique de Grenoble.

86

Ramalingam S, Lodha SK & Sturm PF (2006a) A generic structure-from-motionframework.Computer Vision and Image Understanding (CVIU) 103(3): 218–228.

Ramalingam S & Sturm PF (2008) Minimal solutions for generic imaging models. Proc IEEEConference on Computer Vision and Pattern Recognition (CVPR).

Ramalingam S, Sturm PF & Boyer E (2006b) A factorization based self-calibration for radiallysymmetric cameras. Proc International Symposium on 3D Data Processing Visualization andTransmission (3DPVT): 480–487.

Ramalingam S, Sturm PF & Lodha SK (2005) Towards complete generic camera calibration.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 1093–1098.

Ramalingam S, Sturm PF & Lodha SK (2006c) Theory and calibration for axial cameras. ProcAsian Conference on Computer Vision (ACCV) (1): 704–713.

Rosin PL (1993) A note on the least squares fitting of ellipses. Pattern Recognition Letters 14(10):799–808.

Rosin PL & West GAW (1995) Nonparametric segmentation of curves into various representa-tions. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 17(12):1140–1153.

Rothganger F, Lazebnik S, Schmid C & Ponce J (2006) 3D object modeling and recognitionusing local affine-invariant image descriptors and multi-view spatial constraints. InternationalJournal of Computer Vision (IJCV) 66(3): 231–259.

Sampson PD (1982) Fitting conic sections to very scattered data: an iterative refinement of theBookstein algorithm. Computer Graphics and Image Processing (18): 97–108.

Scaramuzza D, Martinelli A & Siegwart R (2006) A flexible technique for accurate omnidirec-tional camera calibration and structure from motion. Proc International Conference on Com-puter Vision Systems (ICVS).

Scharstein D & Szeliski R (2002) A taxonomy and evaluation of dense two-frame stereo corre-spondence algorithms. International Journal of Computer Vision (IJCV) 47(1-3): 7–42.

Schmid C & Mohr R (1997) Local grayvalue invariants for image retrieval. IEEE Transactionson Pattern Analysis and Machine Intelligence (TPAMI) 19(5): 530–535.

Schmid C & Zisserman A (2000) The geometry and matching of lines and curves over multipleviews. International Journal of Computer Vision (IJCV) 40(3): 199–233.

Schmidt J, Vogt F & Niemann H (2002) Nonlinear refinement of camera parameters using anendoscopic surgery robot. Proc IAPR Workshop on Machine Vision Applications: 40–43.

Seitz SM, Curless B, Diebel J, Scharstein D & Szeliski R (2006) A comparison and evaluationof multi-view stereo reconstruction algorithms. Proc IEEE Conference on Computer Visionand Pattern Recognition (CVPR): 519–528.

Semple JG & Kneebone GT (1952) Algebraic Projective Geometry. Oxford University Press.Simon I & Seitz SM (2007) A probabilistic model for object recognition, segmentation, and non-

rigid correspondence. Proc IEEE Conference on Computer Vision and Pattern Recognition(CVPR).

Sinha SK & Fieguth PW (2006) Morphological segmentation and classification of undergroundpipe images. Machine Vision and Applications 17: 21–31.

Sivic J, Russell BC, Efros AA, Zisserman A & Freeman WT (2005) Discovering objects and theirlocation in images. Proc International Conference on Computer Vision (ICCV): 370–377.

Sivic J & Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEETransactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(4): 591–606.

87

Slama CC (ed) (1980) Manual of Photogrammetry. American Society of Photogrammetry, fourthedition.

Snavely N, Seitz SM & Szeliski R (2008) Modeling the world from internet photo collections.International Journal of Computer Vision (IJCV) 80(2): 189–210.

Starck J & Hilton A (2003) Model-based multiple view reconstruction of people. Proc Interna-tional Conference on Computer Vision (ICCV): 915–922.

Stehle T, Truhn D, Aach T, Trautwein C & Tischendorf J (2007) Camera calibration for fish-eyelenses in endoscopy with an application to 3D reconstruction. Proc International Symposiumon Biomedical Imaging (ISBI): 1176–1179.

Stewénius H (2005) Gröbner basis methods for minimal problems in computer vision. Ph.D.thesis, Lund University.

Stolfi J (1991) Oriented Projective Geometry. Academic Press.Strecha C (2007) Multi-view stereo as an inverse inference problem. Ph.D. thesis, Katholieke

Universiteit Leuven.Sturm P & Barreto JP (2008) General imaging geometry for central catadioptric cameras. Proc

European Conference on Computer Vision (ECCV) (4): 609–622.Sturm P & Maybank S (1999) On plane based camera calibration: A general algorithm, singu-

larities, applications. Proc IEEE Conference on Computer Vision and Pattern Recognition(CVPR): 432–437.

Sturm PF (2005) Multi-view geometry for general camera models. Proc IEEE Conference onComputer Vision and Pattern Recognition (CVPR): 206–212.

Sturm PF & Gargallo P (2007) Conic fitting using the geometric distance. Proc Asian Conferenceon Computer Vision (ACCV), (2): 784–795.

Sturm PF & Ramalingam S (2004) A generic concept for camera calibration. Proc EuropeanConference on Computer Vision (ECCV), (2): 1–13.

Sugimoto A (2000) A linear algorithm for computing the homography from conics in correspon-dence. Journal of Mathematical Imaging and Vision 13: 115–130.

Sutherland I (1974) Three-dimensional data input by tablet. Proc IEEE 62: 453–461.Swaminathan R, Grossberg MD & Nayar SK (2003) A perspective on distortions. Proc IEEE

Conference on Computer Vision and Pattern Recognition (CVPR): 594–601.Swaminathan R, Grossberg MD & Nayar SK (2004) Designing mirrors for catadioptric systems

that minimize image errors. Proc Workshop on Omnidirectional Vision, Camera Networksand Non-Classical Cameras (OMNIVIS).

Swaminathan R & Nayar SK (2000) Nonmetric calibration of wide-angle lenses and polycameras.IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 22(10): 1172–1178.

Tardif JP, Sturm P, Trudeau M & Roy S (2009) Calibration of cameras with radially symmetricdistortion. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(9):1552–1566.

Tardif JP, Sturm PF & Roy S (2006) Self-calibration of a general radially symmetric distortionmodel. Proc European Conference on Computer Vision (ECCV) (4): 186–199.

Tardif JP, Sturm PF & Roy S (2007) Plane-based self-calibration of radial distortion. Proc Inter-national Conference on Computer Vision (ICCV).

Thirthala S & Pollefeys M (2005) The radial trifocal tensor: a tool for calibrating the radialdistortion of wide-angle cameras. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR): 321–328.

88

Triggs B (1997) Autocalibration and the absolute quadric. Proc IEEE Conference on ComputerVision and Pattern Recognition (CVPR): 609–614.

Triggs B, McLauchlan P, Hartley R & Fitzgibbon A (2000) Bundle Adjustment – A ModernSynthesis, volume 1883 ofLecture Notes in Computer Science: 298–372.

Vedaldi A & Soatto S (2006) Local features, all grown up. Proc IEEE Conference on ComputerVision and Pattern Recognition (CVPR), (2): 1753–1760.

Vidal R, Heyden A & Ma Y (eds) (2007) Dynamical Vision, volume 4358 ofLecture Notes inComputer Science.

Viola P & Jones M (2001) Rapid object detection using a boosted cascade of simple features.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (1): 511–518.

Wang H, Mirota D, Ishii M & Hager GD (2008) Robust motion estimation and structure recoveryfrom endoscopic image sequences with an adaptive scale kernel consensus estimator. ProcIEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Wang JYA & Adelson EH (1994) Representing moving images with layers. IEEE Transactionson Image Processing 3(5): 625–638.

Weiss Y (1997) Smoothness in layers: motion segmentation using nonparametric mixture estima-tion. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 520–526.

Werghi N, Fisher R, Robertson C & Ashbrook A (1998) Modelling objects having quadric sur-faces incorporating geometric constraints. Proc European Conference on Computer Vision(ECCV): 185–201.

Wills J, Agarwal S & Belongie S (2006) A feature-based approach for dense segmentation andestimation of large disparity motion. International Journal of Computer Vision (IJCV) 68(2):125–143.

Wu Y, Zhu H, Hu Z & Wu F (2004) Camera calibration from the quasi-affine invariance of twoparallel circles. Proc European Conference on Computer Vision (ECCV), (1): 190–202.

Xiao J, Chen J, Yeung DY & Quan L (2008) Learning two-view stereo matching. Proc EuropeanConference on Computer Vision (ECCV), (3): 15–27.

Xin L, Wang Q, Tao J, Tang X, Tan T & Shum H (2005) Automatic 3D face modeling from video.Proc International Conference on Computer Vision (ICCV): 1193–1199.

Xiong Y & Turkowski K (1997) Creating image-based VR using a self-calibrating fisheye lens.Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Xu G & Zhang Z (1996) Epipolar Geometry in Stereo, Motion and Object Recognition. Kluwer.Xu K, Luxmoore AR & Davies T (1998) Sewer pipe deformation assessment by image analysis

of video surveys. Pattern Recognition 31(2): 169–180.Yang C, Sun F & Hu Z (2000) Planar conic based camera calibration. Proc International Confer-

ence on Pattern Recognition (ICPR): 1555–1558.Yang G, Stewart CV, Sofka M & Tsai CL (2007) Registration of challenging image pairs: Ini-

tialization, estimation, and decision. IEEE Transactions on Pattern Analysis and MachineIntelligence (TPAMI) 29(11): 1973–1989.

Ying X & Hu Z (2004a) Can we consider central catadioptric cameras and fisheye cameras withina unified imaging model. Proc European Conference on Computer Vision (ECCV): 442–455.

Ying X & Hu Z (2004b) Catadioptric camera calibration using geometric invariants. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (TPAMI) 26(10).

Ying X & Zha H (2005) Linear catadioptric camera calibration from sphere images. Proc Work-shop on Omnidirectional Vision, Camera Networks and Non-Classical Cameras (OMNIVIS).

89

Ying X & Zha H (2007) Camera calibration using principal-axes aligned conics.Proc AsianConference on Computer Vision (ACCV), (1):138–148.

Zhang H, Wong K & Zhang G (2007) Camera calibration from images of spheres. IEEE Trans-actions on Pattern Analysis and Machine Intelligence (TPAMI) 29(3): 499–502.

Zhang Z (2000) A flexible new technique for camera calibration. IEEE Transactions on PatternAnalysis and Machine Intelligence (TPAMI) 22(11): 1330–1334.

Zhao W, Chellappa R, Phillips PJ & Rosenfeld A (2003) Face recognition: a literature survey.ACM Computing Surveys 35(4): 399–458.

90

Original articles

I Kannala J, Salo M & Heikkilä J (2006) Algorithms for computing a planar homographyfrom conics in correspondence. Proc British Machine Vision Conference (BMVC) 1: 77–86.

II Kannala J & Brandt SS (2006) A generic camera model and calibration method for conven-tional, wide-angle and fish-eye lenses. IEEE Transactions on Pattern Analysis and MachineIntelligence 28(8): 1335–1340.

III Kannala J, Heikkilä J & Brandt SS (2008) Geometric camera calibration. In Wah B (ed)Wiley Encyclopedia of Computer Science and Engineering. Hoboken, John Wiley & SonsInc.

IV Kannala J, Brandt SS & Heikkilä J (2009) Self-calibration of central cameras from pointcorrespondences by minimizing angular error. VISIGRAPP 2008, Revised Selected Papers.Communications in Computer and Information Science 24: 109–122.

V Kannala J, Brandt SS & Heikkilä J (2008) Measuring and modelling sewer pipes from video.Machine Vision and Applications 19(2): 73–83.

VI Kannala J & Brandt SS (2007) Quasi-dense wide baseline matching using match propaga-tion. Proc IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

VII Kannala J, Rahtu E, Brandt SS and Heikkilä J (2008) Object recognition and segmentationby non-rigid quasi-dense matching. Proc IEEE Conference on Computer Vision and PatternRecognition (CVPR).

VIII Kannala J, Rahtu E, Brandt SS and Heikkilä J (2009) Dense and deformable motion seg-mentation for wide baseline images. Proc Scandinavian Conference on Image Analysis(SCIA). Lecture Notes in Computer Science 5575: 379–389.

Reprinted with permission from IEEE (II, VI and VII), John Wiley & Sons (III), and Springer-Verlag (IV, V and VIII).

Original publications are not included in the electronic version of the dissertation.

91

92


Book orders:Granum: Virtual book storehttp://granum.uta.fi/granum/

S E R I E S C T E C H N I C A

337. Leinonen, Jouko (2009) Analysis of OFDMA resource allocation with limitedfeedback

338. Tick, Timo (2009) Fabrication of advanced LTCC structures for microwavedevices

339. Ojansivu, Ville (2009) Blur invariant pattern recognition and registration in theFourier domain

340. Suikkanen, Pasi (2009) Development and processing of low carbon bainitic steels

341. García, Verónica (2009) Reclamation of VOCs, n-butanol and dichloromethane,from sodium chloride containing mixtures by pervaporation. Towards efficientuse of resources in the chemical industry

342. Boutellier, Jani (2009) Quasi-static scheduling for fine-grained embeddedmultiprocessing

343. Vallius, Tero (2009) An embedded object approach to embedded systemdevelopment

344. Chung, Wan-Young (2009) Ubiquitous healthcare system based on a wirelesssensor network

345. Väisänen, Tero (2009) Sedimentin kemikalointikäsittely. Tutkimus rehevän jasisäkuormitteisen järven kunnostusmenetelmän mitoituksesta sekä sentuloksellisuuden mittaamisesta

346. Mustonen, Tero (2009) Inkjet printing of carbon nanotubes for electronicapplications

347. Bennis, Mehdi (2009) Spectrum sharing for future mobile cellular systems

348. Leiviskä, Tiina (2009) Coagulation and size fractionation studies on pulp andpaper mill process and wastewater streams

349. Casteleijn, Marinus G. (2009) Towards new enzymes: protein engineering versusBioinformatic studies

350. Haapola, Jussi (2010) Evaluating medium access control protocols for wirelesssensor networks

351. Haverinen, Hanna (2010) Inkjet-printed quantum dot hybrid light-emittingdevices—towards display applications

352. Bykov, Alexander (2010) Experimental investigation and numerical simulation oflaser light propagation in strongly scattering media with structural and dynamicinhomogeneities


ABCDEFG

UNIVERS ITY OF OULU P.O.B . 7500 F I -90014 UNIVERS ITY OF OULU F INLAND


S E R I E S E D I T O R S

SCIENTIAE RERUM NATURALIUM

HUMANIORA

TECHNICA

MEDICA

SCIENTIAE RERUM SOCIALIUM

SCRIPTA ACADEMICA

OECONOMICA

EDITOR IN CHIEF

PUBLICATIONS EDITOR

Professor Mikko Siponen

University Lecturer Elise Kärkkäinen

Professor Pentti Karjalainen

Professor Helvi Kyngäs

Senior Researcher Eila Estola

Information officer Tiina Pistokoski



Publications Editor Kirsti Nurkkala

ISBN 978-951-42-6150-3 (Paperback)ISBN 978-951-42-6151-0 (PDF)ISSN 0355-3213 (Print)ISSN 1796-2226 (Online)


TECHNICA


TECHNICA

OULU 2010

C 353

Juho Kannala

MODELS AND METHODSFOR GEOMETRIC COMPUTER VISION

FACULTY OF TECHNOLOGY,DEPARTMENT OF ELECTRICAL AND INFORMATION ENGINEERING,UNIVERSITY OF OULU;INFOTECH OULU,UNIVERSITY OF OULU

C 353

ACTA

Juho Kannala


Documents

Models and methods for geometric computer visionjultika.oulu.fi/files/isbn9789514261510.pdf · geometric computer vision. This th esis considers topics that are re lated to this problem