Improving motion tracking using gyroscope data in Augmented …809636/FULLTEXT01.pdf · 2015-05-04 · mobile application, run on an iPhone 5S, could perform camera pose estimation

DEGREE PROJECT, IN , SECOND LEVELCOMPUTER VISIONSTOCKHOLM, SWEDEN 2015

Improving motion tracking usinggyroscope data in Augmented Realityapplications

FREDRIK BYSTAM

KTH ROYAL INSTITUTE OF TECHNOLOGY

SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION (CSC)

Improving motion tracking using gyroscope datain Augmented Reality applications

Att förbättra motion tracking med hjälp av gyroskopdata i Augmented

Reality-applikationer

Thesis report

April 22, 2015

FREDRIK [email protected]

Master’s Thesis in Computer Science (30 credits)

Supervisor: Stefan Carlsson ([email protected])

Examiner: Danica Kragic ([email protected])

Project commissioned by

André Strindby ([email protected])

Linus Åkerlund ([email protected])

Bontouch AB

Abstract

As commissioned by Bontouch AB, this project containsthe attempt to create an Augmented Reality application forsmartphones, where a parcel is visualised in order for a userto get a comprehensive image of the parcel size. The mainfocus of the project was to create an e�ectively running en-gine that can process images for computer vision, in order todetermine the pose of the smartphone camera. The experi-ment explored the possibilities of using gyroscope measure-ments between input images in order to formulate valid as-sumptions of upcoming images using homographies. Theseassumptions were meant to unburden the computer visionengine, creating very high performance. The ultimate goalof the engine was to track the motion of a known referenceobject in the image, with high precision.

The proposed method performed adequately, improvingreliability of the motion tracking algorithms. The resultingmobile application, run on an iPhone 5S, could performcamera pose estimation at up to 60 times per second, ata video camera feed resolution of 1280x720 pixels. Thishigh performance resulted in a very stable rendition of theparcel.

Referat

Som ett uppdrag från Bontouch AB innehåller det här pro-jektet ett försök att skapa en Augmented Reality-applikationför smartphones, där postpaket ska visualiseras för att geanvändare en förstärkt bild av paketets storlek. Huvudfo-kus i projektet var att bygga en e�ektiv motor som kanbehandla bilder för datorseende, för att beräkna positionenoch riktningen på mobilkameran. Experimentet utforskademöjligheterna för att använda gyroskopdata mellan bilderför att formulera giltiga antaganden hos kommande bildermed hjälp av homografier. Dessa antaganden vad ämna-de att avlasta motorn för datorseende, för att åstadkommahög prestanda. Målet var att motorn skulle följa rörelsenhos ett givet referensobjekt i bilden, och att göra det medhög precision.

Den föreslagna metoden presterade tillräckligt, och för-bättrade pålitligheten hos algoritmerna för att följa objek-tets rörelser. Mobilapplikationen som byggdes, och kördespå iPhone 5S, kunde beräkna kamerans position of riktningupp till 60 gånger per sekund, när videokameran försåg mo-torn med bilder med 1280x720 pixlars upplösning. Den högaprestandan resulterade i en väldigt stabil bild av paketet.

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.4 Delimitation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.5 Bontouch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.6 The Postnord app . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related work 52.1 Augmented Reality on handheld devices . . . . . . . . . . . . . . . . 52.2 Motion tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Real time pose estimation . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Involving non-camera sensors . . . . . . . . . . . . . . . . . . . . . . 8

3 Theory 113.1 The projection camera model . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 The internal camera model . . . . . . . . . . . . . . . . . . . 113.1.2 Camera orientation and translation . . . . . . . . . . . . . . . 123.1.3 A linear relation between two images of a rotated camera . . 13

3.2 Camera calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.1 General camera calibration . . . . . . . . . . . . . . . . . . . 133.2.2 Determining the intrinsic camera parameters . . . . . . . . . 143.2.3 Determining camera pose . . . . . . . . . . . . . . . . . . . . 15

3.3 Image processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.3.1 Image binarisation using Canny’s algorithm . . . . . . . . . . 153.3.2 Image binarisation using Otsu’s thresholding method . . . . . 163.3.3 Finding joint contours in an image . . . . . . . . . . . . . . . 183.3.4 Polygon shape approximation . . . . . . . . . . . . . . . . . . 19

4 Method 214.1 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.2 Method overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.3 Considered methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.4.1 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . 234.4.2 Pose estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4.3 Visualisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.4.4 Perform motion tracking based on results from object detection 264.4.5 Include motion sensors when tracking motion . . . . . . . . . 264.4.6 Internal calibration of the camera . . . . . . . . . . . . . . . . 27

4.5 Image sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.6 Gyroscope benchmark program . . . . . . . . . . . . . . . . . . . . . 284.7 Optimal region of interest . . . . . . . . . . . . . . . . . . . . . . . . 294.8 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.8.1 Smartphone . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304.8.2 Evaluation tests . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Results 335.1 Gyroscope benchmarking output . . . . . . . . . . . . . . . . . . . . 33

5.1.1 Motion intensity test . . . . . . . . . . . . . . . . . . . . . . . 335.1.2 Success rate test . . . . . . . . . . . . . . . . . . . . . . . . . 345.1.3 Region of interest margin test . . . . . . . . . . . . . . . . . . 345.1.4 ROI coverage test . . . . . . . . . . . . . . . . . . . . . . . . 345.1.5 CPU time measurement test . . . . . . . . . . . . . . . . . . 35

5.2 Optimal region of interest output . . . . . . . . . . . . . . . . . . . . 365.3 The finished application . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Discussion 396.1 Observations from the gyroscope benchmarking . . . . . . . . . . . . 39

6.1.1 Sequence motion . . . . . . . . . . . . . . . . . . . . . . . . . 396.1.2 Success rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.1.3 Region margins and coverage . . . . . . . . . . . . . . . . . . 406.1.4 CPU time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Observations from the optimal region of interest test . . . . . . . . . 416.3 The parcel visualisation application . . . . . . . . . . . . . . . . . . . 426.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Bibliography 45

List of Figures

1.1 A simple illustration showing how di�erent image processing methodsare executed on di�erent frames/images . . . . . . . . . . . . . . . . . . 2

3.1 A geometric representation of the pinhole camera, where the point X isconnected to the projection center C, intercepting the image plane atpoint x. This figure is taken from ref. [23]. . . . . . . . . . . . . . . . . . 12

3.2 Hysterisis thresholding applied to two edges. Weak edge pixels of A willbe considered an edge, as they are connected to strong edge pixels. Weakedge pixels of B will not be considered an edge . . . . . . . . . . . . . . 17

4.1 A flow chart displaying an overview of the entire application system. . . 224.2 Images describing some of the di�erent stages of object detection. . . . . 254.3 Image displaying the three axes which describe the complete rotation of

the iPhone. Taken from [2]. . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1 A plot displaying how total CPU time spent on a sequence varies withthe chosen ROI growth amount. . . . . . . . . . . . . . . . . . . . . . . 36

5.2 A few screenshots taken when using the iPhone application . . . . . . . 38

6.1 Tracking attempt su�ering from motion blur. The object location (innerred bounds) are clearly inaccurate. . . . . . . . . . . . . . . . . . . . . . 39

List of Tables

5.1 Motion intensity test results . . . . . . . . . . . . . . . . . . . . . . . . . 335.2 Success rate test results . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3 ROI margin test results . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.4 ROI coverage test results . . . . . . . . . . . . . . . . . . . . . . . . . . 355.5 CPU time test results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

Definitions

AR Augmented Reality. The process in which 3D objects are rendered in real timeinto a camera feed from reality.

px Short for pixel, the smallest coloured portion of an image.

ROI Region of Interest. A rectangular portion of an image where only the pixelsinside this region are considered when processing it.

Motion tracking The idea of following the motion of a specific object in images,attempting to determine its movement.

Translation The position of an object in world coordinates.

Orientation In this thesis, orientation refers to how an object is rotated in 3Dspace. In what direction the object is facing.

Camera pose The translation and orientation of a camera.

Homogeneous coordinates A type of coordinate representation common in pro-jective geometry (instead of Cartesian coordinates used in regular, Euclideangeometry). Homogeneous coordinates in an image can be seen as ”vectorspointing at a pixel in the image” rather then the coordinates of the pixel them-selves. Formulas expressed using homogeneous coordinates are often simplerthan their Cartesian counterparts.

Homography A linear mapping between points in two di�erent images in 2Dprojective space, see 3.1.3

Chapter 1

Introduction

1.1 BackgroundAugmented Reality (AR) is the process of blending real world imagery with 3Drendered environments in real time. The authors of the paper State of the art andfuture developments of the Augmented Reality for measurement applications predictthat AR will play a vital role on handheld devices in the future [13]. They mentionmany already existing prototypes includes usage in both industrial business andmedical care.

A very interesting usage of AR is when trying to picture the size and shapeof items - especially before a purchase or similar. In the case of this project, weconsider the case where a person receives a delivery of significant size, and that thisperson would like to picture this size in beforehand.

1.2 Problem statementOne problem a user could encounter before collecting a parcel is picturing its size.A situation like that could give birth to questions such as: ”If a parcel has thedimensions 14x34x110cm, will it fit on the carrier of my bike, or in the trunk of mycar?”

We would like to use modern computer vision technologies in order to renderparcels in a camera view upon a given reference object, such as a printed marker ora plain A4 paper. To be considered a good enough user experience, this has to bedone very e�ciently. 3D rendering in Augmented Reality has been done before, butdoing image processing in order to find a reference object can be very expensive.The idea is to evaluate whether we can improve performance by making assumptionsabout the environment, and then strengthen these assumptions using input fromthe smartphone motions sensors. This in order to improve performance, reducinglag and frame rate issues.

More specifically, imagine that object detection in one image takes time equal tofive frames from the camera, given some frame rate. If object detection is performed

1

CHAPTER 1. INTRODUCTION

on every available frame, the frame rate would be cut by one fifth. Now, theassumption is made that a user will not move significantly during the five frameswhere the object is being found. If that is the case, then the surroundings of therecently known location of the object can be used as the search space for the objectin subsequent frames. Processing only this space to track the object hopefully canresult in higher frame rates.

Figure 1.1: A simple illustration showing how di�erent image processing methodsare executed on di�erent frames/images

Detection frames are the images where heavier, more exhaustive object detectionis performed. Between these frames, estimation is done on the tracking frames, basedon the knowledge from detection frames. These tracking methods are the focus ofthis project. The program can rely only on the tracking method until the object islost. At that point, a new detection attempt on the entire image is performed.

1.2.1 Objective

The objective of this project is to the explore the possibilities of using motion sensorinput, more precisely from the gyroscope, to optimise image processing methodsused on real time video camera input. The hope is that gyroscope data, combinedwith existing knowledge of previous observations, can create robust assumptionsabout new scenes and images to be processed. Good assumptions of the imagecan result in less computational load, due to the use of methods tailored for thesituation.

2

1.2. PROBLEM STATEMENT

One method is to only search for the object in a rectangular region of interest(ROI), based on the last known location of the object. The concept of how toimprove the construction of these ROI’s, taking gyroscope data in consideration,will be evaluated.

1.2.2 PurposeThe purpose of this project is to evaluate an attempt to optimise solutions to thegeneral object detection and camera pose problem in augmented reality applications.This attempt includes using non-computer vision data sources such as the gyroscopeto create a reliable motion tracking engine.

Investigating the usage of only gyroscope input is interesting and valuable, sinceit is worth knowing if the rotational data is enough to e�ectively predict objectmovement when motion tracking. If the camera moves in translation, but not inorientation, then the gyroscope data will not be of help. Therefore, it is reward-ing to evaluate the e�ectiveness of this method under typical circumstances (in asmartphone AR application).

1.2.3 AimThe aim is to contribute scientifically to the field of augmented reality by evaluatingnew and innovative ways of optimising environment perception methods. This couldbe of great significance if camera hardware evolves faster than processing hardware.Only between the latest two versions of the Apple iPhone, camera recording ca-pabilities have gone from a 30 fps standard to a 60 fps standard, thus doublingthe maximum camera input to computer vision applications [1]. At the same time,Moore’s Law seems to have encountered a physical, stagnating factor [5]. Com-pensating by exploring e�cient methods of image processing is therefore of greatimportance.

This thesis will propose a method which can be used e�ciently when pose track-ing using a known reference object. By suggesting several useful optimisations, thismethod serves as a complement to many existing motion and pose tracking so-lutions from existing research. Anybody interested in implementing AugmentedReality features which require heavy image processing should find this valuable.

1.2.4 DelimitationThis evaluation will consist of the following phases. First of all, some sort of objectdetection method has to be developed to find a known object in an image. When adetection method is in place, a similar tracking method will be implemented basedon prior knowledge of the object location, using ROI’s. The project will mainlybe limited to exploring the usefulness of motion sensors when calculating this priorknowledge. If the use of gyroscope data proves useful when motion tracking, theoptimal size of the ROI used when motion tracking will be investigated. The knownreference object used will be a regular white A4 paper.

3

CHAPTER 1. INTRODUCTION

Ultimately, the new techniques will be prototyped as an application that willuse the image processing to estimate the camera pose from the detected object.This will be done in order to render parcels in the scenery, creating an augmentedreality experience.

1.2.5 BontouchBontouch is a software consulting firm with focus on the mobile market. Theirservices consist of developing and selling complete smartphone apps to their clients.Being a company eager to compete on the market with with futuristic solutions,their products consist of significant amount of experimenting when it comes toAugmented Reality (AR). Thus, they have also o�ered aid through the course ofthis project.

1.2.6 The Postnord appPostnord AB is a union between the Swedish and Danish postal services, and theirmobile application is developed by Bontouch. The app is a tool used by individualusers to track and locate their incoming mail and parcels. In the application, a useris presented with all the information she or he needs before collecting arrivals attheir local postal service station. The implementations of this project are meant tobe a an experimental prototype connected to the Postnord app services.

4

Chapter 2

Related work

In this chapter, previous scientific material which is related to this thesis project,is addressed and discussed. The idea is to highlight the similarities and di�erencesbetween what has been done before and what will be done in this thesis.

The application is built upon the use of methods and algorithms from mostlycomputer vision research. Computer vision consists of fields such as image process-ing, object recognition and camera geometry. All of these are necessary knowledgein this project. Image processing and object recognition relates to detection andtracking of the reference object, and camera geometry and calibration relates toestimating the camera pose in real time.

Rudimentary image processing methods will be part of the upcoming theorychapter (see 3). This chapter will focus more on recent computer vision research,such as motion tracking and camera pose estimation. Also, since the goal of theproject is to create an AR application for mobile devices, scientific material treatingthat topic will be discussed as well.

2.1 Augmented Reality on handheld devicesOver ten years ago, in 2003, Daniel Wagner and Dieter Schmalstieg, at Vienna Uni-versity in Austria, wrote an article called First Steps Towards Handheld AugmentedReality [42]. The article proposed the idea that image processing techniques andhardware had become powerful enough to feature Augmented Reality with a mini-mal need for infrastructure. Working several years before the big commercial breakfor the so-called smartphone, this was all done with a Hewlett-Packard PocketPCand a 320x240 camera [42]. With their very limited setup, they created a marker-based indoor road map application, with a 3D compass arrow overlay to display”which way to go” [42]. This was regarded as a successful proof of concept, despitethe lack of hardware compared to todays standards.

In 2005, Wagner and Schmalstieg together with Thomas Pintaric and FlorianLedermann, further explored the handheld AR experience [41]. By considering theconvenience of using pocket-sized hardware, they created a multi-user experience

5

CHAPTER 2. RELATED WORK

application as yet another AR proof of concept. A four player game based onrunning virtual toy trains on a real world track was built and showcased on acomputer graphics convention [41]. They estimated a total of five to six thousandpeople tried the game in total. Despite the fact that no formative user study wasperformed, the conclusion was that handheld AR applications are considered animpressive and exciting user experience, despite application complexity [41].

In their paper Anywhere Interfaces Using Handheld Augmented Reality from2012, Michael Gervautz and Dieter Schmalstieg describe the potential of AR appli-cations in the future. About camera pose tracking, they write «The most significanttechnical challenge on mobile devices is measuring the camera’s pose (that is, itsposition and orientation) in relation to relevant objects in the environment» [19].They describe how di�erent applications of this kind, on both smartphones as wellas gaming stations such as the Playstation 3, are battling on the AR market. Theyconclude in the paper, that «Although the first commercial AR success stories arenow available, much remains to be done in research», further stressing the impor-tance of exploring AR applications [19].

In his master’s thesis, Tommy Marshall experimented with AR as a digital guideused when visiting a city as a tourist [32]. The experiment consisted of an Androidapplication used to improve the experience when visiting the Nobel museum inStockholm. When testing the finished version of the application on user subjects,he got reviews describing the AR as ”fun” and ”simple to use” [32]. This was thegeneral user opinion of the Android app, despite the fact that it only containedbasic, experimental functionality. This was presented in a master’s thesis, not ina scientific article. Although, it still strengthens the belief that augmented realityapplications are greatly anticipated among possible consumers..

2.2 Motion tracking

Motion tracking is the process of following an object through a sequence of images,carefully di�erentiating the movement of that object from other object. The A4paper used as a reference object in this thesis is only defined by four points, equalto its corners. In order to get accurate estimations of these corners in the image,very precise image procession methods have to be used. This is further explainedin section 3.3.

In the paper Integrated h.264 Region-of-Interest detection, tracking and compres-sion for surveillance scenes [16], the authors describe how isolating moving objectinside regions of interest is a very e�ective way of tracking motion in surveillancesystems. Since cameras used in surveillance are placed in a fixed pose, the imagestaken by the camera will contain a background su�ering from little to no changes atall. Any changes in the scene can therefore be safely assumed to belong to movingobjects [16]. This the prerequisite for e�ective motion tracking based on regionsof interest. In this project, the camera will typically not be in a fixed pose, butassumptions about background characteristics can still be done if a very narrow

6

2.3. REAL TIME POSE ESTIMATION

but valid region of interest is chosen. This is where the gyroscope measurementshopefully will play a major roll in keeping these regions valid in between frames.

Many other method used for motion tracking exist, such as [29] [11] [26]. How-ever, many of these methods are based on tracking the general motion of any pixelsin the image, instead of a very specific object. The accuracy requirement of thisthesis therefore makes most of them unfitting, which is why motion tracking will bebased on more immediate image processing.

2.3 Real time pose estimation

The idea behind AR applications is to discover new ways of expressing informationto a user in a real world environment. It is therefore of great interest to determinethe pose of the camera in real time, expressed in world coordinates. The camerapose problem is considered nonlinear and is commonly solved iteratively with Gauss-Newton or Levenberg-Marquardt optimisation [31]. This problem is a very commonresearch topic, and the solutions proposed are often optimal under very specificcircumstances [37] [30] [36]. Many of these solutions are purely analytical, solvingthe problem algebraically as opposed to the iterative ones.

In order to determine the pose of the camera gathering the visual informationto augment, a known object, shape or image is required [23] [37]. This is calledthe reference object. In the paper A performance study for camera pose estimationusing visual marker based tracking, the authors compare several methods for poseestimation using a easily detected marker as known reference object [31]. Theypropose new methods, combining iterative and analytical solutions with ExtendedKalman Filters, to achieve higher accuracy and lower estimation errors. However,they also conclude that combined, more complex methods cause slower execution[31].

In the paper PTZ camera pose estimation by tracking a 3D target, the authorspropose a method for determining the pose of a Pan-Tilt-Zoom (PTZ) camera. Thepose is estimated by having the camera track the motion of an Unmanned AerialVehicle (UAV) with known 3D location, which is recorded using a GPS monitor [25].The problem is unconventional since the camera is not rigidly fixed, but o�ers atwo degrees of rotational freedom (zoom is considered fixed). Also, normally thecamera pose is calculated from several 2D-3D point correspondences in a singleimage. The pose in [25] is instead found by tracking a single point which translatesin 3D through several images. In this thesis, the pose estimation scenario is muchmore simple and conventional.

As an alternative, in the paper Markerless Pose Tracking for Augmented Reality,C. Yuan describes an approach where no known object is involved [44]. Instead, thepose is estimated by detecting features in an image, and then assuming that certainfeatures are planar. That way, the six-degree-of-freedom pose can be determined,up to an unknown scale, which in this case can only be guessed or set to matchthe scenery. In the case of this report, scale is of great importance. Therefore, it

7

CHAPTER 2. RELATED WORK

is crucial that the target object has a known 3D model. However, this approachcomes with two downsides which makes it unsuitable for this project. The first oneis that it is scale invariant [44], meaning that there is no possible way to know thedistance to any detected object, or the actual size of the object in the real world 3Dspace. These dimensions have to be assumed, in order to get realistic simulations.

The approach suggested in [44] can be considered to be fast enough to run inreal time. Although, real time in that context is considered to be 5 frames persecond on a DELL M20 Precision laptop [44]. Therefore, the algorithms used inthat article for image processing and tracking are not suitable in this project, sincea much higher frame rate is desired.

Since camera pose tracking consists mainly of object motion tracking and poseestimation, these two are the possible bottlenecks when it comes to optimising theapplication performance. The computational load of doing image processing to de-tect an object is proportional to the resolution of the image. This is not the casewhen estimating the camera pose, where only the amount of point correspondencesare relevant [30]. The prediction in this project is that the total performance of thepose tracking will be much more a�ected by the choice of image processing method-ology than the choice of pose estimating algorithm. Therefore, a very common poseestimation algorithm will be used (see 3.2.3).

2.4 Involving non-camera sensors

General object and pose tracking is sometimes considered to be unsuitable for real-time AR applications, due to its computational complexity [44]. High performancesolutions are often built from very explicit assumptions about the environment,to allow for special case algorithms to solve the heaviest tasks. Since part of thepose estimating problem is finding the rotation of the camera device, the imageprocessing algorithms could be unburdened by involving motion sensors such as thegyroscope.

As previously mentioned in [25], using known orientation data of the cameracan be used to improve the solution. In that case, the known pan and tilt values ofthe camera are used in order to build point correspondences as if the PTZ camerawas fixed. In this thesis, similar rotational data will instead be used to predict thecontent of upcoming images.

In Sequential Pose Estimation Using Linearized Rotation Matrices, a very similarmethod to the one in this thesis is proposed, where gyroscope measurements is usedto create a camera pose prediction, which is later corrected using vision basedmeasurements [14]. In this article however, the gyroscope measurements are usedas a prediction of camera orientation, rather than being used for the actual visualmotion tracking.

Whilst researching Pedestrian Localisation for Indoor Environments, Oliver Wood-man addressed gyroscope accuracy when relying on consecutive measurements [43].The ”white measurement noise” can be described as a constant bias value ‘, which

8

2.4. INVOLVING NON-CAMERA SENSORS

integrated over time produces a total angular error that grows linearly with time[43]. Other sources can also a�ect this error in a ”white noise”-fashion, such aselectrical interference and even temperature [43]. Similar results were published inthe article Real-time image processing for motion planning based on realistic sensordata, where the gyroscope measurements of an iPhone 4 and 4S gave accumulatingdrift values over time [10].

The hopes in this project is that many of these errors will be of little importancein the total error, since only the relative gyroscope measurements in between framesare considered. Thus, error will not propagate through subsequent estimations.

9

Chapter 3

Theory

The engine that was designed and implemented in this project relies upon severalknown mathematical principles and methods. These will be briefly explained in thischapter. First, basic camera geometry and mathematical image representations willbe explained in section 3.1. In section 3.2, camera calibration will be explained.Camera calibration consists of finding internal parameters of a camera, such asfocal length, as well as determining the external pose of the camera. Afterwards, insection 3.3, the di�erent image processing algorithms used for object detection areexplained.

The usage of formulas and methods explained in this chapter are all referred toin the method chapter, when the engine implementation is outlined and described.

When mathematical formulas and equations are referred to in this chapter, char-acter representation of mathematical entities may vary from those in the source.

3.1 The projection camera model3.1.1 The internal camera modelBefore moving on to anything else, we will exercise some rudimentary projectivegeometry. First and foremost, this includes the general camera model, as describedin Multiple View Geometry in Computer Vision by Hartley and Zisserman [23].

In the book, the authors propose a camera model where points in space areprojected onto an image surface, or image plane, towards a projection center. Thismodel, often called the pin-hole camera [23], joins a point P = (X, Y, Z)T in theworld coordinate 3D space with the projection center C, also in world coordinates.This connection will intercept the image plane at a point we call p = (x, y)T . Inthe simplest case, we can define the projection center to coincide with the euclideanspace origin C = (0, 0, 0)T . The axis running from the projection center towardsthe facing direction of the camera is called the principal axis. This axis has anorthogonal intersection with the image plane at the so-called principal point (x0, y0).The distance between the projection center and image plane is called the focal lengthf .

11

CHAPTER 3. THEORY

Figure 3.1: A geometric representation of the pinhole camera, where the point X isconnected to the projection center C, intercepting the image plane at point x. Thisfigure is taken from ref. [23].

Using homogeneous coordinates when representing points in 2D and 3D spacegives us a convenient way to write the projection as a linear transformation.

Q

caxy1

R

db = ⁄

Q

caf 0 x0 00 f y0 00 0 1 0

R

db

Q

ccca

XYZ1

R

dddb

Where ⁄ is an unknown scale factor. The 3x3 matrix containing focal length fand principal point (x0, y0) (omitting the last 0̄ column vector) is often referred toas the camera calibration matrix K [23]. The content of this matrix is often referredto as the intrinsic parameters of the camera [38].

For added generality, we consider the possibility of having non-square pixels.This is done by introducing scale factors –x = fmx and –y = fmy, representingthe focal length in terms of pixel dimensions. This provides the final intrinsic modelwhich will be used in this thesis.

K =

Q

ca–x 0 x00 –y y00 0 1

R

db

3.1.2 Camera orientation and translation

In reality, the coordinate system of the camera and that of the world may notcoincide. Instead, they will be related by a orientation and translation of the camera.The translation describes the position of the camera relative to the euclidean originof the world 3D space. The orientation, in turn, describes the facing of the camera,and thus the image plane, at that given point.

If we denote the general camera orientation as a 3x3 rotation matrix R =RxRyRz, and the translation as the 3x1 matrix t̄, we get the following camera

12

3.2. CAMERA CALIBRATION

representation [23].

Q

caxy1

R

db = ⁄KR

Q

ca1 0 0 ≠tx

0 1 0 ≠ty

0 0 1 ≠tz

R

db

Q

ccca

XYZ1

R

dddb = ⁄KRËI | ≠ t̄

È

Q

ccca

XYZ1

R

dddb

This is the complete camera model we will be using. Orientation and position ofthe camera relative to the world coordinate space is commonly referred to as theextrinsic parameters of the camera [38].

3.1.3 A linear relation between two images of a rotated cameraAssume that the intrinsic parameters of a camera are known and defined as thecamera matrix K. If a scene is projected on to two images from the same camera,from the same location in 3D space, but with di�erent rotations, the followingequations can be created.

p = KRP pÕ = KRÕP

p and pÕ are the image points from each image, and P is the real-world 3D pointswhich are being projected. The translation vector t̄ is omitted in the equations,as it is unchanged. Since the P vector is the same in both equations, one of theequations can be solved for P in order to obtain the following.

pÕ = KRÕ(KR)≠1p = KRÕR≠1K≠1p = Hp

The H matrix is a homography, and describes a linear relationship between imagepoints in the two images [23, p. 202-203]. In short, if the camera matrix and rotationbetween two images taken from the same location are known, then the translationof points from p to pÕ can be calculated using the homography H = KRÕR≠1K≠1.This can be used in order to formulate a valid guess of the contents of image pÕ, ifp is known.

3.2 Camera calibration3.2.1 General camera calibrationIn augmented reality applications, we are often interested in estimating the cameraprojection matrix in order to determine the pose of a camera. This is called cameracalibration, or resectioning [23]. If we simply express our camera model as p = MP ,we can try to solve a linear system of equations in order to find M . This is doneusing corresponding points pi ¡ Pi, where Pi is a known point of a 3D modeland pi the projection of that point in an image from the camera. M is a 3x4matrix, but possesses only 11 degrees of freedom, due to the unknown scale. With11 equations (five and a half corresponding points, each ordinate producing one

13

CHAPTER 3. THEORY

equation), this will result in one exact solution. Although, in the general case, it ismore common to solve an over-determined version of this system using direct lineartransformation [23].

3.2.2 Determining the intrinsic camera parametersFinding the intrinsic parameters of a camera is a necessary step in order to extractany metric information about a 3D world from 2D images [45]. The followingcalibration method is described by Zhengyou Zhang [45]. This method is commonlyused, as it is very flexible in comparison to older methods [45] (i.e. the one describedby Robert Y. Tsai in 1987 [40]). One of its primary strengths is that it only requiresobservations of a planar pattern, easily printed and photographed using any regularcamera.

The complete camera model containing camera matrix, translation and orien-tation can be viewed in its compact form as a single 3x4 matrix M . Given anestimation of the matrix M and the abbreviated camera model:

p =

Q

cam11 m12 m13 m14m21 m22 m23 m24m31 m32 m33 m34

R

db P

we now seek to decompose M back into intrinsic camera parameters, orientationand translation.

M = ⁄KRËI | ≠ t̄

È

Zhengyou Zhang found that by relating the model plane and its image witha homography H = (h1h2h3), we can set up a system of equations relating thehomography to the intrinsic parameters by two fundamental constraints.

hT1 K≠T K≠1h2 = 0 hT

1 K≠T K≠1h1 = hT2 K≠T K≠1h2

Where K≠T = (K≠1)T .B = K≠T K≠1 is a 3x3 matrix with 6 degrees of freedom (from the general 6

unknown intrinsic parameters). This matrix is symmetric [45] and can thereforealso be written on vector form b = (B11, B12, B22, B13, B23, B33). If the ith columnof the homography is hi = (hi1, hi2, hi2)T , then we have

hTi Bhj = vT

ijb

with

vij = (hi1hj1, hi1hj2 + hi2hj1, hi2hj2, hi3hj1 + hi1hj3, hi3hj2 + hi2hj3, hi3hj3)T

This way, we can write the fundamental constraints from a homography as twohomogeneous equations A

vT12

(v11 ≠ v22)T

B

b = 0

14

3.3. IMAGE PROCESSING

which can be solved for the vector b. This solution can in turn be used in theequation

B = K≠T K≠1

which has six unknowns and six equations.The solution for b is obtained by minimising an algebraic distance, and then

refined using maximum-likelihood inference [45].

3.2.3 Determining camera poseIf the internal parameters of the camera are already known, finding the pose ofthe camera can be reformulated as the Perspective-n-Point (PnP) problem [37]:given n points of a scene mapped to a real world model, determine the positionand orientation of a camera. Several propositions have been given to solve di�erentvariants of this problem. The simplest version of the problem which yields finitesolutions is PnP with n = 3, also called P3P [37]. The P3P problem requires fourcorresponding image and model points.

Several approaches have been presented to estimate solutions to the PnP prob-lem, including [24] [30] [37]. One of the simpler methods is based on Levenberg-Marquardt optimisation. This method iteratively finds the six degree-of-freedompose that minimises re-projection error of the corresponding points [18] in the sys-tem of equations. Since this method improves the solution iteratively, an alreadyknown estimation of the pose from previous runs can be used as the initial guessfor the pose. This reduces execution time significantly in the average case [27].

3.3 Image processingImage processing consists of methods and algorithms handling bitmap images inorder to extract visuals in what they represent. This section explains the di�erentmethods used in order to accurately detect the corners of an A4 paper.

3.3.1 Image binarisation using Canny’s algorithmEdge detection is the method of processing an input given image, outputting anotherbinary image of equal size, where pixels are determined to be of class edge or non-edge. Pixels that are considered to be part of an edge are set, and non-edge pixelsare unset. Edge detection can be an e�ective first step when processing an imageto find particular shapes.

One very common and e�cient approach to detecting potential edges in an imageis done using the so called Canny edge detecting algorithm. The algorithm wasproposed by John Canny in his paper A computational approach to edge detection[12]. In the paper, a method is described as the solution to the edge detectionproblem based on a comprehensive set of goals. Two goals, low error rate and goodlocalisation, naturally come in the form of a trade-o�. In order to sort out false

15

CHAPTER 3. THEORY

edges using a gaussian filter [20] (image noise), information about the exact positionof an edge can be lost [20] [12]. The third and final goal is that a true edge pointonly should generate one edge point response.

The Canny algorithm works as follows: first we compute the gradient of theimage, using the Sobel operators (proposed by Irwin Sobel in a talk for the StanfordArtificial Intelligence project, later described in Pattern Classification and SceneAnalysis by Robert O. Duda [15]). Then, by comparing the gradient of each pixelto the neighbourhood, pixels which do not contain a local maxima in the directionof the gradient are discarded. Lastly, something called hysterisis thresholding isperformed to classify the remaining pixels to the edge and non-edge classes basedon their gradient magnitude [20].

The gradient of an image, (Gx and Gy), can be estimated by convolving theSobel operator matrices with the input image A [20]:

Gx =

Q

ca≠1 0 1≠2 0 2≠1 0 1

R

db ú A GY =

Q

ca≠1 ≠2 ≠10 0 01 2 1

R

db ú A

Afterwards, we find the magnitude M and angle – of each gradient to be

M(x, y) =Ò

g2x + g2

y –(x, y) = arctan gy/gx

We can then describe the direction dk if a possible edge to be a quantisation of theangle –. By quantisation, we mean that we simplify the direction to be one of hori-zontal, vertical, +45° or ≠45° (d1, d2, d3, d4). If a pixel (x, y) has higher magnitudethan its neighbours along the quantised direction dk, it remains a candidate. Thisis called non-maxima suppression [20].

Finally, hysterisis thresholding requires two user-given thresholds; one upperTH and one lower TL. We define gN (x, y) to be the output from the non-maximasuppression, containing the gradient magnitudes for remaining candidates. Thenwe create another two categories based on the thresholds:

gNH(x, y) = gN (x, y) Ø TH gNL(x, y) = TH > gN (x, y) Ø TL

Pixels in gNH(x, y) and gNL(x, y) are considered ”strong” and ”weak” edge pixelsrespectively. Strong pixels will be marked as valid edges immediately. Weak oneswill be marked as edges if they are the neighbour of a strong edge pixel, or a weakedge that has already been marked as a valid edge [20].

The pixels marked in the final edge is equal to the binary edge detection output.

3.3.2 Image binarisation using Otsu’s thresholding methodDue to its high accuracy, Canny’s algorithm can be quite computationally expen-sive. If the image contains a clear background/foreground scenery, a much fastermethod can be applied called thresholding. Here, binary images are used to di�er-entiate between the image’s foreground and background. This is very commonly the

16


Figure 3.2: Hysterisis thresholding applied to two edges. Weak edge pixels of Awill be considered an edge, as they are connected to strong edge pixels. Weak edgepixels of B will not be considered an edge

case in Optical Character Recognition (OCR) [33]. Thresholding is a type of classi-fication, where pixels are considered to be of foreground (above the threshold) ochbackground (below the threshold) type, based on their grayscale intensity [20, p.738] [33] [6]. Thresholding holds a very central position in image segmentationand binarisation, both because of its intuitive properties, but also because of itscomputational speed [20, p. 738].

In the simplest form, by inferring a threshold value T , a segmented image B iscreated from a grayscale image F as:

B(x, y) =I

1 if F (x, y) > T0 if F (x, y) Æ T

where the value 1 indicates high and 0 low intensity values [20, p. 738]. Thesevalues can then be directly translated into a foreground/background classification.Global thresholding, in contrast to adaptive or variable thresholding [20, p. 741,742, 756], means only considering one threshold value T for the entire image.

17

CHAPTER 3. THEORY

The simplest form comes with a set of issues, though. One of them, is to deter-mine what value to use for T . A good threshold value, which creates a satisfactorybinarisation, needs to take the overall brightness of the image into account. It istherefore sometimes e�ective to consider a dynamic threshold, computed from thecontext of the pixel intensities [20, p. 741].

In his paper, A Threshold Selection Method from Gray-Level Histograms, NobuyukiOtsu covers his method for creating an optimum global threshold [6]. By analysingthe intensities of a grayscale image, a normalised histogram of these values can becreated. Normalised means that the sum of all values in the histogram is equal to1. This histogram can then be used as a discrete probability distribution functionover each intensity level. Assuming the image contains a reasonably distinct fore-ground/background case, this distribution should be in the form of two separable,rigid heaps. By introducing the foreground and background classes as c œ {1, 2}, wecan describe Pc as the probability of a pixel belonging to class c. Furthermore, ‡2

B isequal to the between-class variance of the histogram (between the heaps). If thethreshold variable is defined as k, m(k) is the mean intensity below the thresholdand mG(k) is the global mean intensity, the following expression for the between-class variance can be formed [20, p. 746]:

‡2B(k) = [mGP1(k) ≠ m(k)]2

P1(k)[1 ≠ P1(k)]

The threshold value kú that maximises this variance is the global optimum as pro-posed by Otsu [6] [20, p. 746]. Hence, a good binary segmentation of a grayscaleimage is:

B(x, y) =I

1 if F (x, y) > kú

0 if F (x, y) Æ kú

3.3.3 Finding joint contours in an imageGiven a binary image segmentation, such as the one produced by thresholding orwith the Canny [12] algorithm, we can proceed by building joint contours from theedges. This is a quite e�ective method when attempting to detect compound shapeswithin an image.

A contour can be defined as an ordered sequence of points. Choosing clockwiseordering, for example, is often a requirement when using other image processingalgorithms with shape contours as input [20, p. 796].

One particular method for creating this point sequence was presented by S.Suzuki and K. Abe in their paper Topological Structural Analysis of Digitized BinaryImages by Border Following from 1985 [39]. The authors propose an algorithmwhich can perform border following (another phrase for contour finding) which alsocreates a topological structure in the image, making some borders belong to otherswith an inner-outer relationship. This topological structure can be used in order tomark each shape candidate which could possibly be the sought object.

18


3.3.4 Polygon shape approximationGiven the contours of a shape in an image, it is often essential to interpret what thecontours actually represent. A rectangle can be represented with a minimal set ofpoints, using only the four corners to mark it. However, due to noise and distortionin the image, a contour finding algorithm is likely to produce a shape with morepoints than there actually are. In order to extract the actual rectangle visible tothe human being, some sort of polygon approximation is needed.

One way to represent a polygonal approximation is by dividing the area encom-passing the contour into square segments. Consecutively, by generalising movementbetween these segments as one of eight directions (north, east, south, west andintermediate combinations), a polygonal simplification of the contour is achieved.This representation is called Freeman chain codes [20, p. 798].

Approximating polygon structures with Freeman chain codes results in a verylarge set of edges, which is unfortunate if we are interested in more accuratelydetermining the actual shape of the polygon [35]. Another method, which triesto produce approximations with a low amount of edges, without compromising incomputational complexity, was presented by Urs Ramer in 1972 [35]. This algo-rithm takes a contour representation (a large set of points) and recursively triesto eliminate points. By first choosing a segment of points as a subset Sk of theentire contour, a straight line lk œ Sk from the start- and endpoints of that segmentis formed. Given a constant tolerance ‘, and a distance function f(Sk) betweenthe furthest point in the segment Sk and the the line lk, the following criterion isformed:

f(Sk) Æ ‘

If this criterion holds for the segment Sk, the entire segment is replaced with theendpoint line approximation lk.

This method is repeated until no subset segments of the remaining contour fulfilsthe criterion. A minor edge set polygon is created [35].

19

Chapter 4

Method

In this chapter, the di�erent steps of the total application are explained. First,an overview of the di�erent parts of the implementation is presented, followed bysome considered methods that were rejected. Afterwards, the di�erent steps ofthe implementation are explained in more detail, together with possible problemsencountered associated with them. Lastly, the method for testing and evaluatingthe implementation is explained, and the test data is presented.

4.1 OpenCVOpenCV is a open source computer vision library, which is maintained by the it-seez group [28]. This library features over 2500 optimised algorithms for imageprocessing, and has become an industry standard on many platforms. Popular us-age includes face recognition, camera movement tracking and object identification.OpenCV is commonly practiced both by smaller software companies, as well asgiants such as Google, Yahoo, Microsoft, Intel and IBM [28].

Image processing and camera calibration algorithms described in di�erent arti-cles cited in this report have corresponding C++ implementations in the OpenCVlibrary. Usage of any of these algorithms in this project is then trusted to theselibrary implementations.

4.2 Method overviewIn order to evaluate the e�ciency of gyroscope usage in image processing and motiontracking, a typical image processing and motion tracking case had to be chosen.Since the ultimate goal of the entire project, from perhaps a product perspective,was to visualise parcels in ”real world”-size, the image processing case was chosenaccordingly. The reference object to detect and track was chosen to be a white A4paper (210x297mm), which is a very common object.

The entire program consists of the application providing a live camera feed fromthe smartphone camera. This feed is sent to both a camera feed preview on the

21

CHAPTER 4. METHOD

screen, as well as to the camera pose engine, which is where most of the workis performed. The camera pose engine does both object detection and tracking,as well as determining the camera pose from the results. This engine also takesgyroscope input from the smartphone along with the images. Once the camerapose is determined, it is used to render parcels in 3D on top of the camera feedpreview.

Figure 4.1: A flow chart displaying an overview of the entire application system.

4.3 Considered methodsAs part of the o�cial OpenCV documentation pages, there is a guide describing howto estimate the camera pose using a textured object (in this case a parlour gamebox) [4]. When performing pose estimation based on a model detected in a scene

22

4.4. IMPLEMENTATION

it is common to use general feature detection algorithms, such as SIFT, SURF orFAST. Detecting a large amount of features on a textured object comes with somepowerful robustness. Since they use an object with very many feature points, aswell as a RANSAC-scheme for pose estimation, parts of the object can fall outsidethe image without greatly a�ecting the results. Although, since the ultimate goalis to create an AR engine which estimates its pose based on more common objects,using very specific textures falls out from consideration.

Since the A4 paper is an object that is only represented by its four corners,it is vital that these four points are determined with very high precision. This isbecause any loss in corner precision will cause great errors in the pose estimation.Also, four points is equal to the minimum amount of points where it is at all possibleto estimate the pose of the camera (see 3.2.3). Therefore, the entire paper has tobe visible in the image in order to find the pose. These properties connected tothe A4 paper object makes the approach in the OpenCV guide unfitting, since thereliability of feature detection combined with pose estimation based on RANSACcomes from the vast amount of points. The A4 paper must be detected specificallyas a whole, instead of relying on feature detection.

One existing library meant to do detection of single coloured blobs is cvblob [34].However, since CPU e�ciency is very important, the generalisation of cvblob couldcause it to be too slow for the purposes of this project. Instead, a new method,tailored for detecting this special case will be implemented.

4.4 Implementation

An application doing the parcel visualisation was developed as a part of this project.Since the purpose of the project is to evaluate Augmented Reality on mobile plat-forms, the application was developed in form of a mobile app. Most of the appconsists of a core built in C++. More specifically, this involves all the heavy imageprocessing algorithms and camera calibration methods.

Writing the application in C++ enables portability between di�erent mobileplatforms, such as Apple iPhone or any smartphone running the Android ARTengine [21]. Android apps are normally built using Java running on the ART en-gine(replacing the old Dalvik [21]), but can have access to native implementationsusing the Android Native Development Kit (NDK) [22].

For the sake of project limitations, the prototype was built for an Apple iPhone5S running the iOS 8 operating system [8].

4.4.1 Object detection

Since the goal of the project is to e�ectively and accurately render a parcel basedon its true size, camera calibration needs a known object as reference. Detectingthe A4 paper in a given image can be divided into the following steps:

23

CHAPTER 4. METHOD

Creating grayscale image The image is converted from a three-channel RGBformat into a single-channel grayscale format

Reducing image noise In order to avoid fake edges, a Gaussian Blur [20, p. 257]is applied to remove image noise

Image binarisation A binary edge map is created from the blurred grayscale im-age using Canny’s edge detection algorithm (see 3.3.1)

Contour finding A set of all closed contours in the edge map is determined withSuzukis algorithm (see 3.3.3)

Polygon approximation The contours are reduced to low-edge polygon approxi-mations using the Ramer-Douglas-Peucker algorithm, and only polygons withfour corners are kept (see 3.3.4)

Dropping low area candidates If a polygon has a very low area (less than abouta few thousand pixels), it is removed

Rectangle sieving Probable rectangles are found by testing of opposing sides ofthe four corner polygons are close to being parallel

Aspect ratio sieving Probable A4 papers are found by testing if the aspect ratiois close to that of an A4 paper

Possible failure If no polygons makes it this far, the detection is considered tohave failed

Selecting a winner The few remaining candidates are tested for the amount of”probable white” pixels they contain, based on interpreted HSV values, andthe whitest is elected as the A4 paper

4.4.2 Pose estimationGiven a detected set of points, and their corresponding 3D points in a given model,such as the A4 paper, we can solve the PnP-problem, giving us the pose and theposition of the camera. This can be done by applying Levenberg-Marquardt optimi-sation, minimising re-projection error of the corresponding points [18]. This methodsolves the problem iteratively, and is very fitting when estimating pose based onobject tracking, since it can use the previously known pose as an initial estimation.

4.4.3 VisualisationWith the pose correctly estimated, the given rotation and translation can be appliedto the camera of a 3D scene, which has a parcel placed at the origin of the coordinatesystem. Modelling a static parcel placed at the origin, with a camera that changespose is preferable to having the parcel change pose, since is better fits the real world

24

4.4. IMPLEMENTATION

(a) The original image beforeprocessing

(b) Binary image created byCanny’s algorithm

(c) Contours created from thebinary image

(d) Winner contour after filter-ing

Figure 4.2: Images describing some of the di�erent stages of object detection.

scenario. From the user point-of-view there is no di�erence, but the applicationbecomes more intuitive and simpler to maintain.

3D rendering is built upon the Apple SceneKit framework [9].

25

CHAPTER 4. METHOD

4.4.4 Perform motion tracking based on results from object detection

Assuming the camera pose will not change significantly after object detection, theprojection of the detected object will probably su�er from little to no translation inthe image. A region of interest for the object from the last detection frame can thenbe used to create a cropped image, where the object is likely to be found. That way,the input data to any object detection algorithm is greatly reduced, thus improvingperformance. This method will be referred to as motion tracking, and will be runon tracking frames.

The region of interest is created by finding the minimum rectangle that cancontain the detected object in the image, and then enlarging it by increasing theheight and width by a fixed amount.

Motion tracking is similar to the object detection, but with greatly reducedsearch space. Since the tracking is done in order to quickly determine camera pose,the true position of the object must be found with high accuracy. Therefore, thetracking method share all the steps with the detection method, except for the imagebinarisation. Since the scene with the reference object is greatly simplified insidethe ROI, a much simpler and faster image binarisation algorithm can be chosen.Instead of Canny’s edge detection algorithm, Otsu’s thresholding algorithm is used(see 3.3.2).

Another approach for motion tracking, especially in the case of tracking an A4paper, is to only pay attention to the corners of the paper. This reduces search spaceeven more e�ectively, since only four, relatively small areas have to be examined.On the other hand, this could generate a more di�cult case when making sure thetrue corners of the A4 paper have been found, since some of the context of theimage is lost when the search areas are divided. This corner-only approach will notbe attempted in this method.

4.4.5 Include motion sensors when tracking motion

Suppose the input simplification method works well when the camera is almost still,but performs poorly if the camera pose su�ers from rotation changes. Rotation couldtranslate the object in the image quite drastically, invalidating the assumption aboutthe region of interest where the object is likely to be found. In that case, rotationmeasurements from the gyroscope of the smartphone can assist in translating theregion of interest as well. This way, the input simplification method can becomeresistant to non-trivial rotation changes.

Assume the camera is calibrated (the camera matrix is known). Given a ROIin one image, and the rotation state of the camera in two images, the ROI can betranslated from one picture to the other using a homography (see 3.1.3). If therotation readings from the gyroscope are accurate, then the translated ROI will bean accurate guess for where to find the sought object in the new image. Therefore,gyroscope data is stored when the object is found, marking the state of the sceneat a specific point in time.

26

4.5. IMAGE SEQUENCES

The translation is done by redefining the ROI as a polygon with four points,then transforming these four points individually using the homography. Once thetransformation is done, a new ROI is created by enclosing four points with theminimal rectangle containing all four. This will also alter the size of the ROI, oftenmaking it larger, since the four transformed points will be translated di�erently.

The gyroscope measurements are not 100% accurate, since they are a�ected bydrift error. The hope is that that this drift error will be too small to heavily a�ectrotation changes between two individual images. Rotation data errors should notbe able to break the ROI assumption when the camera is still.

4.4.6 Internal calibration of the cameraIn order to find the camera matrix as described in 3.1.1, internal camera calibrationwas performed based on the OpenCV camera calibration example program [3]. Achessboard pattern was printed and photographed in di�erent angles, and then fedto the calibration engine, which outputted the camera matrix on a file.

4.5 Image sequencesEvaluation data will consist of sequences of images taken by a camera, along withthe rotation state of the camera as read by the gyroscope. In every image, therewill be a stationary reference object which is to be detected. Each image sequencewill be recorded while performing a unique type of movement with the camera. Themovement could represent careful translation of the camera, with very mild rotationchanges, or more rapid movement with great rotation changes. Sequences could alsocontain only rotation of a certain type (only around the X-, Y- or Z-axis), togetherwith almost no translation at all.

The di�erent sequences are chosen in order to expose the motion tracking engineto very di�erent circumstances. They are described as follows:

Basic movement A set of images where the reference object is being observedwhile walking around it with moderate speed. The camera su�ers from mildrotation changes.

High rotation movement A set of images where the reference object is beingobserved while walking around it with moderate speed. The camera is subjectto quite aggressive rotation changes.

Zoom only movement A set of images where the reference object is kept at in themiddle of the image, while physically zooming in and out by moving closerto and further away from it. The camera su�ers from little to no rotationchanges.

Elevation only movement A set of images where the reference object is beingfilmed while elevating the camera by moving it closer to and further away from

27

CHAPTER 4. METHOD

the ground (altering only it’s virtual Y-axis position). The camera su�ers frommild rotation changes.

Yaw rotation A set of images where the camera remains stationary, while su�eringfrom heavy rotational changes in the yaw orientation (rotation around thecamera’s virtual Z-axis).

Pitch rotation A set of images where the camera remains stationary, while suf-fering from heavy rotational changes in the pitch orientation (rotation aroundthe camera’s virtual X-axis).

Roll rotation A set of images where the camera remains stationary, while su�eringfrom heavy rotational changes in the roll orientation (rotation around thecamera’s virtual Y-axis).

All images are 960x540 pixels and 3-channel BGR-formatted. All sequences consistof 60 images. ROI’s created from previously known object locations is equal to theminimal bounding rectangle of the known object, then resized by adding 200px inboth width and height.

4.6 Gyroscope benchmark programUltimately, the goal of the thesis is to create a fluent and impressive user expe-rience, while battling expensive image processing. Image processing accounts formost of the computational complexity in these types of applications, and will there-fore also be the primary target of evaluation. This evaluation will benchmark theperformance of the tracking method, which can be seen in the overview (figure 4.1).

In the benchmarking program, every previously mentioned image sequence willbe subject to a series of measurements, comparing the e�ciency of image processingmethods with and without the gyroscope data. The gyroscope data is used onlywhen a tracking attempt is performed, based on a ROI computed from the lastknown location of the object.

These measurements consist of the following tests:

Motion intensity testFor each image in the sequence, try to track the object. If the tracking failed,try to detect it instead. If an object was found, solve the camera pose problem(see 3.2.3) to obtain the pose di�erence between images. Use the calculatedpose to give average measurements on camera translation between images, andgyroscope data to give average measurements on rotation between images.

Success rate testFor each image in the sequence, try to track the object. If the tracking failed,try to detect it instead. Count the total amount of successful tracking at-tempts, fallback detection attempts and failed attempts.

28

4.7. OPTIMAL REGION OF INTEREST

Region of interest margin testBased on the latest known location of the object to track, create a region ofinterest where the object is likely to be found. Afterwards, track/detect theobject based on the region of interest. If the new location of the object wasfound, measure the value of the smallest margin between any of the objectpoints and the region of interest. This accepts negative margins (the smallestbeing outside the outside the region). For example, if the four points of theobject have margins 50, 40, 200 and -30 pixels (3 inside, 1 outside) comparedto the ROI, the -30 is stored. The test then presents the average of the minimalmargins.

Region of interest coverage testBased on the latest known location of the object to track, create a region ofinterest where the object is likely to be found. Afterwards, track/detect theobject based on the region of interest. If the new location of the object wasfound, find the correct region of interest of the object. Then calculate howmuch of the correct region is covered by the the tracking assumption. Thismeasurement is equal to the area of the intersection of the two ROI’s, dividedby the area of the correct ROI.

ratio = area(Rcorrect fl Rguess)area(Rcorrect)

CPU time measurement testFor each image in the sequence, try to track the object, or detect it if thetracking failed. The total CPU time spent either tracking or detecting untilthe object has been found is recorded. The total time spent for the sequence ispresented. The average execution time for one tracking or detection attemptis also printed, as a reference value on how much more e�cient the tracking iscompared to detection. This is a measurement of the total speed up achievedby including gyroscope measurements in this type of image processing.

Each type of measurement performs the same test twice, with and without gyroscopedata when doing tracking attempts. This is how the gyroscope e�ciency will beevaluated. The motion tracking, especially in the non-gyroscope case, is highlydependant upon the size of the region of interest. Therefore, the di�erent testswill be performed on regions from di�erent margins, measuring the impact on thesemargin sizes, in terms of accuracy and execution speed.

4.7 Optimal region of interestAssuming the gyroscope method works in practice, it could be worth examiningthe choice of ROI size, instead of just using a fixed pixel growth. Selecting a biggerROI naturally will cause a higher change of performing successful tracking attempts.However, larger ROI’s takes longer to process - and if it is too big, it might ruin

29

CHAPTER 4. METHOD

the assumption of the ROI scene containing a clear background/foreground case,thus giving bad results from Otsu’s thresholding method (see 3.3.2). A test willtherefore be performed, measuring the total execution time of an image sequenceas a function of ROI size. The ROI size will be measured in percentage of growthin width and height, compared to the minimal bounds of the latest known objectposition.

As an example, if a known location of the object can be contained in a rectanglewith width 100px and height 80px, and we use a 10% growth for tracking, the ROIused for tracking will be of 110px width and 88px height.

The aim is to find some sort of relationship between ROI growth and the totalexecution time spent on a sequence. Hopefully, this relationship will be similarbetween the results of any sequence. Tracking in this test will only be done withgyroscope data applied.

4.8 Hardware4.8.1 SmartphoneThe application will be built upon an Apple iPhone 5S, which runs on the AppleA7 processing chip. It is equipped with a 30 fps 1080p HD recording video camera,as well as a three-axis gyroscope [7].

The gyroscope measures rotation around three internal axes. The X-axis runshorizontally across the screen surface, with positive going in the right direction.Counterclockwise rotation around this axis is called pitch. Running north across thescreen surface is the Y-axis, with its rotation called roll. Finally, running throughthe screen, positively towards the user observing it is the Z-axis, with its rotationcalled yaw.

4.8.2 Evaluation testsThe test sequences were recorded, along with gyroscope data, by the same AppleiPhone as mentioned in 4.8.1. The actual tests on these sequences were run on aMacbook Pro Retina 15” from Mid 2014 running OSX Yosemite. It has a 2,8 GHzIntel Core i7 processor, 16GB of 1600MHz DDR3 internal memory and an NVIDIAGeForce GT 750M 2048 MB graphics card.

30

4.8. HARDWARE

Figure 4.3: Image displaying the three axes which describe the complete rotationof the iPhone. Taken from [2].

31

Chapter 5

Results

5.1 Gyroscope benchmarking output5.1.1 Motion intensity testAverage translation is the measurement of absolute distance traveled between everyframe of the sequence (meters/frame). Average angular intensities are the measure-ments of absolute angular rotation around every axis between every frame of thesequence (°/frame).

Table 5.1: Motion intensity test results

Sequence Average Translation Average yaw Average pitch Average rollBasic 0.183048 3.56152 2.13641 2.49864High rotation 0.338489 6.29521 10.7688 4.14102Zoom 0.28821 1.32546 6.15959 2.60276Elevation 0.550592 1.04715 3.07357 1.44375Yaw 0.330875 16.9989 2.28565 2.01913Pitch 0.134763 0.416974 12.2647 0.607089Roll 0.366743 1.19815 1.3348 12.2657

33

CHAPTER 5. RESULTS

5.1.2 Success rate testEach measurement is the amount of frames (out of 60 total) that resulted in asuccessful tracking attempt, a successful detection attempt (if the tracking failed),or an failed attempt.

Table 5.2: Success rate test results

Sequence Track Track(gyro) Detect Detect(gyro) Failed Failed(gyro)Basic 55 59 4 1 1 0High rotation 8 38 24 13 28 8Zoom 34 38 17 13 9 9Elevation 27 37 22 14 11 9Yaw 49 58 9 1 2 1Pitch 15 59 15 1 30 0Roll 13 59 18 1 29 0

5.1.3 Region of interest margin testAll measurements are the average distance in pixels between region of interest andthe point which is ”least inside” the ROI. Negative measurements means the pointis outside the ROI.

Table 5.3: ROI margin test results

Sequence Average Margin Average Margin (gyro)Basic 48.727 59.782High rotation -148.074 51.037Zoom 26.721 31.000Elevation -2.512 12.805Yaw 39.125 112.857Pitch -234.333 83.429Roll -103.348 69.304

5.1.4 ROI coverage testAll measurements are the average area of the correct object ROI which was coveredby the ROI calculated using the previously known object location, measured inper-cent.

34

5.1. GYROSCOPE BENCHMARKING OUTPUT

Table 5.4: ROI coverage test results

Sequence Average Coverage Average Coverage (gyro)Basic 0.844% 0.909%High rotation 0.424% 0.862%Zoom 0.774% 0.811%Elevation 0.720% 0.780%Yaw 0.819% 0.999%Pitch 0.460% 0.953%Roll 0.549% 0.949%

5.1.5 CPU time measurement testAll measurements are the CPU time spent on finding the object, either tracking ordetecting. Both total time, as well as the average for each method is given. Timeis measured in milliseconds.

Total Total CPU time spent

Avg. T. Average CPU time to perform one tracking attempt

Avg. D. Average CPU time to perform one detection attempt

Table 5.5: CPU time test results

Sequence Total Total(gyro) Avg. T. Avg. T.(gyro) Avg. D. Avg. D.(gyro)Basic 76.084 63.830 0.913 0.999 4.439 4.516High rotation 247.437 96.143 0.653 0.929 3.994 4.029Zoom 177.884 162.331 0.910 0.958 4.471 4.432Elevation 200.568 156.689 0.785 0.835 4.663 4.614Yaw 96.216 77.586 0.825 1.152 4.376 4.357Pitch 210.859 56.857 0.638 0.879 3.912 4.033Roll 224.857 48.088 0.512 0.733 4.133 3.821

35

CHAPTER 5. RESULTS

5.2 Optimal region of interest outputThe optimal ROI test was run on all sequences, measuring ROI width and heightgrowth figures between 0% and 200%. The results for all the executions are displayedin the figure below.

Figure 5.1: A plot displaying how total CPU time spent on a sequence varies withthe chosen ROI growth amount.

The sequences that contain almost only rotational movement all seem to achieveoptimal execution performance when the ROI grows with around 40-80%. The sameapplies to the basic sequence which su�ers from very mild movement. Sequenceswith more intensive movement such as zoom, elevate and high rotation all seem tobenefit from ROI’s that grow up to about 150% of their original width and height.

36

5.3. THE FINISHED APPLICATION

5.3 The finished applicationAs previously mentioned, the ultimate goal of all these image processing optimisa-tions was to create a smooth user experience when visualising parcels in AR. Theapplication was built using the image processing engine, and visualises a parcelwhen a user targets the A4 paper using the camera. A settings view can be openedwhere the user can change a few things about the execution, including:

Parcel dimensions Three slider bars exist that lets the user change the size ofthe parcel between 0 and 2 meters

Toggle flashlight A switch can be toggled to light the flashlight/torch of theiPhone

Toggle FPS label A switch can be toggled to display a label with the rate atwhich the pose engine processes images, measured in frames per second

Set processing method A three state button group can be changed, setting oneof three methods which the pose engine uses to find the object: pure detection,detection + tracking or detection + gyro tracking

When running the app, the camera feeds the engine with 1280x720 images atrate 60hz. Using gyro tracking and performing less intensive movement, the enginemanages to process images at a frequency of 50-60 fps when observing the A4 paperfrom about 1.5 meters distance. Pure detection can only perform at about 12 fpsunder the same circumstances. When tracking without gyro data, the app performsunevenly, often dropping down to about 30 fps when the camera su�ers from non-trivial rotation changes.

The rendered parcel seems to be of very accurate size compared to the realworld. When rendered beside a real world parcel of a certain size, there is barelyany di�erence in size visible to the naked eye.

If the camera is placed such that the longer sides of the A4 paper become theshorter than the actual short sides, due to the very extreme perspective, it becomesvery di�cult to identify the correct order of the corners. Thus, estimating the posefrom such an angle also becomes di�cult. This occurs when the camera is held closeto the same surface as where the A4 paper is placed, viewing it from the side, forexample.

37

CHAPTER 5. RESULTS

(a) A 33x35x41(cm) box (b) Same box, another angle

(c) A 62x44x31(cm) box (d) The settings view

Figure 5.2: A few screenshots taken when using the iPhone application

38

Chapter 6

Discussion

6.1 Observations from the gyroscope benchmarking

6.1.1 Sequence motion

Figure 6.1: Tracking attempt su�eringfrom motion blur. The object location (in-ner red bounds) are clearly inaccurate.

The measurement of the di�erent se-quences were mostly consistent with theexpectations, regarding the motion in-tensity test. In the basic, high rota-tion, zoom and elevation sequences, thecamera su�ers from translation betweenaround 0.2 and 0.6 meters betweenframes, which seems reliable when look-ing directly at the sequence. The yaw,pitch, and roll intensive sequences seemto have relatively high translation mea-surements, despite being recorded froma fixed position. This is likely the er-ror of motion blur. Blurriness in theimage may cause the found points ofthe detected objects to be inaccurate.That in turn will result in poor poseestimation, since the inaccurate pointmeasurements will create an illusion ofmovement.

6.1.2 Success rateMotion tracking based on a ”sliding” region of interest is a method that can producevery good and accurate results when using a still camera. When the camera ismoving, especially rotating, hypothesis was that this method would struggle. Theevaluation tests seem to support this hypothesis. Test sequences with little to no

39

CHAPTER 6. DISCUSSION

significant rotation, such as basic, zoom and even yaw, resulted in successful trackingattempts in the majority of frames. The object was successfully tracked 55 out of59 (omitting the first frame where detection has to be done) times in the basicsequence. The elevation successfully tracked 27 out of 59 without the gyroscope,which can also be considered a moderately good result.

Once some more intensive rotation changes were applied, the very simple ROItracking method started to fail. In the high rotation sequence, only 8 frames re-sulted in successful tracking, and a total of 28 frames were dropped as totally failedattempts at finding the object. Taking gyroscope data in account made a huge im-pact. In the same sequence, the amount of successful tracking attempts went from8 to 38, and failed from 28 to 8. It is a bit curious why the gyro tracking resulted insuch a drop in failed frames in this case, since the detection method is supposed topick up frames where tracking fails. In this case, the most significant di�erence be-tween tracking and detecting the object is the binarisation method. Detection usesCanny’s edge detection algorithm, while tracking uses Otsu’s thresholding methodon only a ROI. Canny’s algorithm is much more vulnerable to motion blurry images,as the blur ”unsharpens” the edges. Otsu’s method, on the other hand, does nothave the same problem separating the paper from the background, despite the mo-tion blur. Therefore, tracking the object can result in a success that the detectionwould not necessarily be able to handle.

The ”rotation only” sequences pitch and roll received massive improvementsfrom gyro tracking. Both went from 13/15/30 and 13/18/29 (tracked/detected/failed)to 59/1/0 (perfect score). In these cases, the same reasoning is applied for howtracking can perform better than detection on motion blurry images. In the yawsequence, the rotation of the camera did not invalidate the ROI in the same way asin the pitch and roll sequences, since the object remains in the middle of the image.Even in this case the yaw rotation could be of use when determining the expectedsize of the ROI.

In total, applying gyroscope data resulted in more successful tracking attemptsand less total failed attempts in every single sequence. This also seems to indicatethat the drift error from gyroscope measurements is not enough to sabotage thetracking in sequences with little to no rotation changes, such as basic or zoom.

6.1.3 Region margins and coverageIn the margin test case, one can easily note that negative results is equivalent withthe detected object not being completely covered by the tracking ROI on average.Not very surprisingly at this point, is that these negative results appear drasticallyin the very rotation heavy sequences high rotation, pitch and roll when not usingthe gyroscope. All of these sequences had better margin results when gyroscopedata was applied. Some of the less rotation heavy sequences had a positive averagemargin even in the non-gyroscope case, only getting very little improvements fromthe gyroscope.

It is worth mentioning that only a few very bad frames where the margin falls

40

6.2. OBSERVATIONS FROM THE OPTIMAL REGION OF INTEREST TEST

to very large negative values could drag down the average quite heavily. On theaverage case, the highest possible positive margin is of much lower magnitude thanthe highest possible negative margin. Therefore, a score closer to zero can be theresult of many successful attempts, together with a few very unsuccessful ones. Thiscould be the case in the yaw sequence, scoring 39.125 on the average margin withoutgyro, with 49 successful tracking attempts.

The region of interest coverage test also comes with great improvements whenapplying gyroscope data to the tracking attempts. The yaw sequence resulted in astaggering 0.999% coverage in the gyroscope case. This could be the result of yawrotation data creating unusually large ROI’s, which almost always cover the correctone. That could in turn a�ect the computation speed negatively.

6.1.4 CPU time

The CPU time test is in many ways the ultimate evaluation of this project, sincethe attempted optimisation method is meant to increase the performance (and theimage processing frequency) of the AR application. The same type of correlationcan be seen in these tests as all the tests above; increased rotation intensity in thesequence produces a greater performance gap between measurements when trackingwith- and without the gyroscope.

The average CPU time spent on tracking di�ers fairly significantly between notusing and using the gyroscope. On average, using gyroscope tracking takes a bitmore time. This is due to the fact that ROI’s translated by the tomography oftengrow in size, since all four corners of the original ROI are transformed individually.

Looking at the average times for performing tracking and detection methods,one can deduce that detection is about five times more expensive in the typical case.This indicates a massive gap between the frame rate possibilities when tracking suc-cessfully and when falling back to detection very frequently. These numbers verifythat if the tracking works well, then the overall performance and image processingfrequency will benefit greatly from this method.

6.2 Observations from the optimal region of interest testThe graphs in figure 5.1 display some very valuable results of this project. They o�era whole other perspective on how to consider the gyroscope measurement accuracy.One of the most rotation intensive sequences, roll, performs at stunningly highperformance when using only about 50% ROI growth. This can be interpreted asthe gyroscope data being accurate enough to, when used together with homographyROI transformation, produce an assumption that already covers 1

1,5 = 23 of the

object. Performing image processing on a ROI that small takes very little time incomparison to processing the entire image.

When the camera su�ers from intensive translation as well, such as in the highrotation sequence, the optimum naturally ends up at a higher value. Even in this

41


extreme case, the total CPU time drops to less than half of the highest value (at0% growth).

If the overall optimal growth percentage is sought, one could use the mean valueof the all the sequence local minima. Although, many of these sequences are meantto represent di�erent types of extreme camera movement, except for the basic one.The basic sequence represents the typical user movement rather well. Using themean value of all the local minima would produce a growth percentage that mightnot reflect the typical case at all.

In the basic sequence, the optimal ROI growth is found at around 80%. Thiscan probably be considered as the most realistic result from seeking the optimalROI growth.

6.3 The parcel visualisation applicationThe resulting application of this project were far above expectation. The hopes werethat the improved tracking method could enable the camera pose being re-estimatedat closer to 30 times per second, when running video resolution at 960x640 pixels.Ultimately, the iPhone 5S managed to run at up to 60 times per second, at a1280x720 resolution (about 78% more pixels). Visualisation runs smoothly and isstable, and due to the high update frequency, it often truly looks as if the visualisedparcel is stuck to the A4 paper reference object.

The camera pose estimation works poorly if the A4 paper is placed in veryextreme perspectives, but that is to be expected. In order to actually estimate thepose from the 3D model of the A4 paper, each corner 2D point must be pairedwith the correct 3D model point. If the perspective causes the longer side of theA4 paper to shrink, becoming shorter than the actual short side, then the cornersof the paper will be incorrectly paired with its 3D correspondances. This is notconsidered to be a failure.

The application was presented to both the project supervisor at Bontouch, aswell as employees from Postnord, from where the original idea of the applicationsprouted. Both parts were satisfied with the project outcome.

42

6.4. FUTURE WORK

6.4 Future work

The method proposed in this project does not take any translation of the camerainto account, which is its greatest weakness when motion tracking. The methodin [16] describes how regions of interest can be used when tracking moving objectson a surveillance camera feed, since the camera is on a fixed location. In this project,a method was proposed to use when tracking a still target with a moving camera.This method, using gyroscope data for tracking, probably renders very weak if theobject and the camera moves at the same time.

Imagine a scenario when the user wants to track a moving object by followingit with the camera, keeping the projection of the object in the middle of the image.If the camera su�ers from rotation in this case, the ROI translated by rotation willprobably become invalid, since the assumption about the object being still is nolonger valid. In this case, tracking with a ROI but without gyroscope data wouldprobably be a lot more powerful.

If both the object and the camera are moving, but also the camera is not nec-essarily trying to follow the object, then a very di�cult case is created. Gyroscopedata could probably help in order to interpret the movement of the camera some-what, but it would be even more e�ective if the motion tracking method could”predict” the movement of the object. In this case, one could probably attemptusing some kind of linear or extended Kalman Filter to predict movement. This isdone by Zhaoxia Fu and Yan Han in their paper Centroid weighted Kalman filterfor visual object tracking [17]. Combining Kalman Filters with motion sensor datacould possibly create a powerful object tracking method.

6.5 Conclusion

In this report, an engine for object motion tracking and pose estimation was de-scribed and implemented. The purpose of the engine was to find and keep track ofa common object, chosen as an A4 paper, using it as a reference when determiningthe pose of the camera in world coordinates.

Motion tracking based on very precise image processing was implemented e�-ciently by letting a region of interest ”follow” the object being tracked, updatingthe bounds of this region with every frame. On the average use case, it is quite rarefor any user to move the camera with a translation velocity that invalidates theassumption from a region of interest. Although, rotation can very easily invalidateany ROI if the camera is tilted only slightly. Applying gyroscope sensor data topredict the location of the object when creating this region seems to have been avery e�ective improvement. This was done by translating the predicted location,and thus the ROI, by creating a rotational homography from the gyroscope rotationmatrices.

This method made tracking possible often on over 90% of the frames in a se-quence. In the best case, this causes image processing time to be cut by up to five

43


times, compared to doing exhaustive detection on the entire image. Using the gyro-scope data when tracking, the optimal ROI to process was also sought and roughlydetermined. The combination of searching only in a certain ROI and consideringrotation changes creates a robust and reliable performance upgrade, which provedto work on a semi-high end smartphone (iPhone 5S). The high performance wasvery notable when run in the parcel visualisation mobile app, which managed torun at a 720p resolution with 60fps while tracking, falling down to about 12 fpswhen performing no tracking whatsoever.

Motion tracking with gyroscope input has proven to be a very significant buildingblock when tracking stationary targets. Though it may su�er if the goal is to tracka moving target by following it with a camera, since the gyroscope rotation mayaccidentally invalidate the region of interest.

44

Bibliography

[1] Apple iPhone comparison. https://www.apple.com/iphone/compare/.

[2] Motion Events Part 2: Core Motion. http://www.techrepublic.com/blog/

software-engineer/motion-events-part-2-core-motion/.

[3] OpenCV camera calibration example. http://docs.opencv.org/doc/

tutorials/calib3d/camera_calibration/camera_calibration.html.

[4] OpenCV tutorial: Real Time pose estimation of a textured object.http://docs.opencv.org/trunk/doc/tutorials/calib3d/real_time_

pose/real_time_pose.html.

[5] TIME Magazine: The Collapse of Moore’s Law: Physicist SaysIt’s Already Happening. http://techland.time.com/2012/05/01/

the-collapse-of-moores-law-physicist-says-its-already-happening/.

[6] A threshold selection method from gray-level histograms. Systems, Man andCybernetics, IEEE Transactions on, 9(1):62–66, Jan 1979.

[7] Apple. Apple iPhone 5S specifications. http://www.apple.com/iphone-5s/

specs/.

[8] Apple. iOS 8 for developers. https://developer.apple.com/ios8/.

[9] Apple. SceneKit Framework reference, Apple developer. https:

//developer.apple.com/library/ios/documentation/SceneKit/

Reference/SceneKit_Framework/index.html.

[10] Je�reyR. Blum, DanielG. Greencorn, and JeremyR. Cooperstock. Smartphonesensor reliability for augmented reality applications. In Kan Zheng, Mo Li, andHongbo Jiang, editors, Mobile and Ubiquitous Systems: Computing, Network-ing, and Services, volume 120 of Lecture Notes of the Institute for ComputerSciences, Social Informatics and Telecommunications Engineering, pages 127–138. Springer Berlin Heidelberg, 2013.

[11] Lars Bretzner. Multi-scale feature tracking and motion estimation. PhD thesis,KTH, Numerical Analysis and Computer Science, NADA, 1999. QC 20100519.

45

https://www.apple.com/iphone/compare/

http://www.techrepublic.com/blog/software-engineer/motion-events-part-2-core-motion/

http://www.techrepublic.com/blog/software-engineer/motion-events-part-2-core-motion/

http://docs.opencv.org/doc/tutorials/calib3d/camera_calibration/camera_calibration.html

http://docs.opencv.org/doc/tutorials/calib3d/camera_calibration/camera_calibration.html

http://docs.opencv.org/trunk/doc/tutorials/calib3d/real_time_pose/real_time_pose.html

http://docs.opencv.org/trunk/doc/tutorials/calib3d/real_time_pose/real_time_pose.html

http://techland.time.com/2012/05/01/the-collapse-of-moores-law-physicist-says-its-already-happening/

http://techland.time.com/2012/05/01/the-collapse-of-moores-law-physicist-says-its-already-happening/

http://www.apple.com/iphone-5s/specs/

http://www.apple.com/iphone-5s/specs/

https://developer.apple.com/ios8/

https://developer.apple.com/library/ios/documentation/SceneKit/Reference/SceneKit_Framework/index.html



BIBLIOGRAPHY

[12] John Canny. A computational approach to edge detection. Pattern Analysisand Machine Intelligence, IEEE Transactions on, PAMI-8(6):679–698, Nov1986.

[13] Pasquale Daponte, Luca De Vito, Francesco Picariello, and Maria Riccio. Stateof the art and future developments of the augmented reality for measurementapplications. Measurement, 57(0):53 – 70, 2014.

[14] T.M. Drews, P.G. Kry, J.R. Forbes, and C. Verbrugge. Sequential pose estima-tion using linearized rotation matrices. In Computer and Robot Vision (CRV),2013 International Conference on, pages 113–120, May 2013.

[15] R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. A Wiley-Interscience publication. Wiley, 1973.

[16] I.A. Fernandez, P. Rondao Alface, Tong Gan, R. Lauwereins, andC. De Vleeschouwer. Integrated h.264 region-of-interest detection, trackingand compression for surveillance scenes. In Packet Video Workshop (PV),2010 18th International, pages 17–24, Dec 2010.

[17] Zhaoxia Fu and Yan Han. Centroid weighted kalman filter for visual objecttracking. Measurement, 45(4):650 – 655, 2012.

[18] Henri P. Gavin. The levenberg-marquardt method for nonlinear least squaresthe levenberg-marquardt method for nonlinear least squares curve-fitting prob-lems. Department of Civil and Environmental Engineering, October 2013.

[19] M. Gervautz and D. Schmalstieg. Anywhere interfaces using handheld aug-mented reality. Computer, 45(7):26–31, July 2012.

[20] R.C. Gonzalez. Digital Image Processing. Pearson Education, 2009.

[21] Google. Google ART Engine for Android. https://source.android.com/

devices/tech/dalvik/art.html.

[22] Google. Google NDK for Android. https://developer.android.com/tools/

sdk/ndk/index.html.

[23] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision.Cambridge University Press, ISBN: 0521540518, second edition, 2004.

[24] J.A. Hesch and S.I. Roumeliotis. A direct least-squares (dls) method for pnp.In Computer Vision (ICCV), 2011 IEEE International Conference on, pages383–390, Nov 2011.

[25] S. Hrabar, P. Corke, and V. Hilsenstein. Ptz camera pose estimation by trackinga 3d target. In Robotics and Automation (ICRA), 2011 IEEE InternationalConference on, pages 240–247, May 2011.

46

https://source.android.com/devices/tech/dalvik/art.html

https://source.android.com/devices/tech/dalvik/art.html

https://developer.android.com/tools/sdk/ndk/index.html

https://developer.android.com/tools/sdk/ndk/index.html

BIBLIOGRAPHY

[26] P. Huber, C. Cagran, and W. MÃŒller. An algorithm to correct for cam-era vibrations in optical motion tracking systems. Journal of Biomechanics,44(11):2172 – 2176, 2011.

[27] itseez. OpenCV Documentation, Camera Calibration. http:

//docs.opencv.org/trunk/modules/calib3d/doc/camera_calibration_

and_3d_reconstruction.html, 2014.

[28] itseez. OpenCV O�cial About Page. http://opencv.org/about.html, 2014.

[29] Yeon-Ho Kim and Soo-Yeong Yi. Articulated body motion tracking using illu-mination invariant optical flow. International Journal of Control, Automationand Systems, 8(1):73–80, 2010.

[30] V. Lepetit, F.Moreno-Noguer, and P.Fua. Epnp: An accurate o(n) solution tothe pnp problem. International Journal Computer Vision, 81(2), 2009.

[31] Madjid Maidi, Jean-Yves Didier, Fakhreddine Ababsa, and Malik Mallem. Aperformance study for camera pose estimation using visual marker based track-ing. Machine Vision and Applications, 21(3):365–376, 2010.

[32] Tommy Marshall. Moving the museum outside its walls : An augmented realitymobile experience. Master’s thesis, KTH, Communication Systems, CoS, 2011.

[33] Sergey Milyaev, Olga Barinova, Tatiana Novikova, Pushmeet Kohli, and VictorLempitsky. Image binarization for end-to-end text understanding in naturalimages. In Proceedings of the 2013 12th International Conference on DocumentAnalysis and Recognition, ICDAR ’13, pages 128–132, Washington, DC, USA,2013. IEEE Computer Society.

[34] Cristóbal Carnero Li nán. cvBlob. http://cvblob.googlecode.com.

[35] Urs Ramer. An iterative procedure for the polygonal approximation of planecurves. Computer Graphics and Image Processing, 1(3):244 – 256, 1972.

[36] G. Schweighofer and A. Pinz. Robust pose estimation from a planar target. Pat-tern Analysis and Machine Intelligence, IEEE Transactions on, 28(12):2024–2030, Dec 2006.

[37] Xiao shan Gao, Xiao-Rong Hou, Jianliang Tang, and Hang-Fei Cheng. Com-plete solution classification for the perspective-three-point problem. PatternAnalysis and Machine Intelligence, IEEE Transactions on, 25(8):930–943, Aug2003.

[38] Pradip K. Sinha. Image Acquisition and Preprocessing for Machine VisionSystems. SPIE, 2012.

47

http://docs.opencv.org/trunk/modules/calib3d/doc/camera_calibration_and_3d_reconstruction.html



http://opencv.org/about.html

http://cvblob.googlecode.com

BIBLIOGRAPHY

[39] Satoshi Suzuki and KeiichiA be. Topological structural analysis of digitizedbinary images by border following. Computer Vision, Graphics, and ImageProcessing, 30(1):32 – 46, 1985.

[40] R.Y. Tsai. A versatile camera calibration technique for high-accuracy 3d ma-chine vision metrology using o�-the-shelf tv cameras and lenses. Robotics andAutomation, IEEE Journal of, 3(4):323–344, August 1987.

[41] Daniel Wagner, Thomas Pintaric, Florian Ledermann, and Dieter Schmalstieg.Towards massively multi-user augmented reality on handheld devices. In Hans-W. Gellersen, Roy Want, and Albrecht Schmidt, editors, Pervasive Computing,volume 3468 of Lecture Notes in Computer Science, pages 208–219. SpringerBerlin Heidelberg, 2005.

[42] Daniel Wagner and D. Schmalstieg. First steps towards handheld augmentedreality. In Wearable Computers, 2003. Proceedings. Seventh IEEE InternationalSymposium on, pages 127–135, Oct 2003.

[43] Oliver Woodman and Robert Harle. Pedestrian localisation for indoor envi-ronments. In Proceedings of the 10th International Conference on UbiquitousComputing, UbiComp ’08, pages 114–123, New York, NY, USA, 2008. ACM.

[44] Chunrong Yuan. Markerless pose tracking for augmented reality. In GeorgeBebis, Richard Boyle, Bahram Parvin, Darko Koracin, Paolo Remagnino, AraNefian, Gopi Meenakshisundaram, Valerio Pascucci, Jiri Zara, Jose Molineros,Holger Theisel, and Tom Malzbender, editors, Advances in Visual Computing,volume 4291 of Lecture Notes in Computer Science, pages 721–730. SpringerBerlin Heidelberg, 2006.

[45] Zhengyou Zhang. A flexible new technique for camera calibration. PatternAnalysis and Machine Intelligence, IEEE Transactions on, 22(11):1330–1334,Nov 2000.

48

www.kth.se

Documents

Improving motion tracking using gyroscope data in Augmented …809636/FULLTEXT01.pdf · 2015-05-04 · mobile application, run on an iPhone 5S, could perform camera pose estimation