34
Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation Korbinian Schmid Perception and Cognition, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR), Oberpfaffenhofen, Germany e-mail: [email protected] Philipp Lutz Autonomy and Teleoperation, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR), Oberpfaffenhofen, Germany e-mail: [email protected] Teodor Tomi ´ c Analysis and Control of Advanced Robotic Systems, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR), Oberpfaffenhofen, Germany e-mail: [email protected] Elmar Mair Perception and Cognition, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR), Oberpfaffenhofen, Germany e-mail: [email protected] Heiko Hirschm ¨ uller Perception and Cognition, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR), Oberpfaffenhofen, Germany e-mail: [email protected] Received 6 July 2013; accepted 5 December 2013 Micro air vehicles have become very popular in recent years. Autonomous navigation of such systems plays an important role in many industrial applications as well as in search-and-rescue scenarios. We present a quadrotor that performs autonomous navigation in complex indoor and outdoor environments. An operator selects target positions in the onboard map and the system autonomously plans an obstacle-free path and flies to these locations. An onboard stereo camera and inertial measurement unit are the only sensors. The system is independent of external navigation aids such as GPS. No assumptions are made about the structure of the unknown environment. All navigation tasks are implemented onboard the system. A wireless connection is only used for sending images and a three-dimensional (3D) map to the operator and to receive target locations. We discuss the hardware and software setup of the system in detail. Highlights of the implementation are the field-programmable-gate-array-based dense stereo matching of 0.5 Mpixel images at a rate of 14.6 Hz using semiglobal matching, locally drift-free visual odometry with key frames, and sensor data fusion with compensation of measurement delays of 220 ms. We show the robustness of the approach in simulations and experiments with ground truth. We present the results of a complex, autonomous indoor/outdoor flight and the exploration of a coal mine with obstacle avoidance and 3D mapping. C 2014 Wiley Periodicals, Inc. 1. INTRODUCTION In recent years, micro air vehicles (MAVs) have found their way from research laboratories into many civil applica- tions. With regard to civil markets, there is a wide range of available MAVs. Many low-budget platforms for the con- sumer market already exist, and many more applications of MAVs for industrial purposes can be expected in the coming decades. MAVs can also be useful tools in disaster manage- ment and search-and-rescue scenarios (SAR). After the Fukushima Daiichi nuclear disaster in March 2011, MAVs were used to explore the site. Fire fighters employ MAVs to get an overview of fire locations. In mountain rescue, the time to find missing persons or casualties in dangerous areas such as avalanche regions could be reduced by using flying platforms. Additionally, MAVs could serve as mobile radio relays in regions with missing or destroyed communication infrastructure. In terms of robots for SAR missions, MAVs can greatly enhance the abilities of ground robots. In such missions, ground robots could realize energy-efficient trans- portation and ground manipulation, while MAVs mainly cover the task of exploration. Most MAVs are directly controlled by pilots. However, manual control requires highly trained personnel and re- duces the area of application. A pilot using direct line of Journal of Field Robotics 31(4), 537–570 (2014) C 2014 Wiley Periodicals, Inc. View this article online at wileyonlinelibrary.com DOI: 10.1002/rob.21506

Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

  • Upload
    heiko

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Autonomous Vision-based Micro Air Vehicle for Indoorand Outdoor Navigation

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •

Korbinian SchmidPerception and Cognition, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR), Oberpfaffenhofen, Germanye-mail: [email protected] LutzAutonomy and Teleoperation, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR), Oberpfaffenhofen,Germanye-mail: [email protected] TomicAnalysis and Control of Advanced Robotic Systems, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR),Oberpfaffenhofen, Germanye-mail: [email protected] MairPerception and Cognition, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR), Oberpfaffenhofen, Germanye-mail: [email protected] HirschmullerPerception and Cognition, Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR), Oberpfaffenhofen, Germanye-mail: [email protected]

Received 6 July 2013; accepted 5 December 2013

Micro air vehicles have become very popular in recent years. Autonomous navigation of such systems playsan important role in many industrial applications as well as in search-and-rescue scenarios. We present aquadrotor that performs autonomous navigation in complex indoor and outdoor environments. An operatorselects target positions in the onboard map and the system autonomously plans an obstacle-free path and fliesto these locations. An onboard stereo camera and inertial measurement unit are the only sensors. The systemis independent of external navigation aids such as GPS. No assumptions are made about the structure of theunknown environment. All navigation tasks are implemented onboard the system. A wireless connection isonly used for sending images and a three-dimensional (3D) map to the operator and to receive target locations.We discuss the hardware and software setup of the system in detail. Highlights of the implementation arethe field-programmable-gate-array-based dense stereo matching of 0.5 Mpixel images at a rate of 14.6 Hzusing semiglobal matching, locally drift-free visual odometry with key frames, and sensor data fusion withcompensation of measurement delays of 220 ms. We show the robustness of the approach in simulations andexperiments with ground truth. We present the results of a complex, autonomous indoor/outdoor flight andthe exploration of a coal mine with obstacle avoidance and 3D mapping. C© 2014 Wiley Periodicals, Inc.

1. INTRODUCTION

In recent years, micro air vehicles (MAVs) have found theirway from research laboratories into many civil applica-tions. With regard to civil markets, there is a wide range ofavailable MAVs. Many low-budget platforms for the con-sumer market already exist, and many more applications ofMAVs for industrial purposes can be expected in the comingdecades.

MAVs can also be useful tools in disaster manage-ment and search-and-rescue scenarios (SAR). After theFukushima Daiichi nuclear disaster in March 2011, MAVswere used to explore the site. Fire fighters employ MAVs

to get an overview of fire locations. In mountain rescue, thetime to find missing persons or casualties in dangerous areassuch as avalanche regions could be reduced by using flyingplatforms. Additionally, MAVs could serve as mobile radiorelays in regions with missing or destroyed communicationinfrastructure. In terms of robots for SAR missions, MAVscan greatly enhance the abilities of ground robots. In suchmissions, ground robots could realize energy-efficient trans-portation and ground manipulation, while MAVs mainlycover the task of exploration.

Most MAVs are directly controlled by pilots. However,manual control requires highly trained personnel and re-duces the area of application. A pilot using direct line of

Journal of Field Robotics 31(4), 537–570 (2014) C© 2014 Wiley Periodicals, Inc.View this article online at wileyonlinelibrary.com • DOI: 10.1002/rob.21506

Page 2: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

538 • Journal of Field Robotics—2014

sight cannot fly inside houses or behind obstacles. Flyingvia first-person view (FPV) requires a stable, low latency,high bandwidth wireless video link, which is difficult toguarantee, especially inside buildings. To overcome theselimitations, some (autonomous) functionality has to be re-alized onboard the system. This includes onboard control,environment perception, collision avoidance, environmentmodeling, and path planning. With this functionality, theMAV is able to navigate autonomously even in unknown,cluttered, GPS-denied environments. The operator can in-spect regions of interest and interact with the system via alow bandwidth radio link, without hard constraints on theoperator reaction time or the radio link latency. If the radiolink fails, the system can autonomously return to a recoveryarea or its starting point.

Nevertheless, achieving this level of autonomy is achallenging task. Most vertical take-off and landing (VTOL)MAVs are inherently unstable and highly dynamic. Eventhough orientation can be stabilized by proprioceptive sen-sors such as inertial measurement units (IMUs), informationfrom exteroceptive sensors such as cameras is needed to sta-bilize position. This information is also essential for all othernavigation tasks. In contrast to ground robots, flying robotshave to navigate three dimensionally using onboard sensorinformation only. Processing high amounts of exterocep-tive data using limited computational resources introducesmeasurement delays. Another challenge is the compensa-tion of these delays if used for control. Additionally, thesystem has to be robust against failures of the exteroceptivesensors, for example in the case of missing environmentfeatures. For flying close to humans (i.e., in SAR scenarios)or inside buildings, MAVs have to be safe for their envi-ronment. Freely spinning rotors should be protected andsmall to limit their inertia. Such constraints result in a sig-

nificant payload limitation and, hence, a strong restrictionof computational resources. It is obvious that the demandsof the mentioned applications are highly conflicting withthe available system resources.

Considering these limitations and application require-ments, suitable exteroceptive sensors have to be found.Depth sensors provide ideal input for obstacle avoidance,environmental modeling, as well as egomotion estimation.While laser scanners or time-of-flight (TOF) cameras pro-vide accurate depth measurements, their weight is ratherhigh. The Kinect has become very popular for robot nav-igation, but it depends on recognizing a pseudorandominfrared pattern that is projected onto the scene. This tech-nique does not work outdoors in sunlight. In contrast, pas-sive stereo vision works outdoors as well as indoors under alarge range of lighting conditions. However, sufficient pro-cessing power must be available for stereo matching.

Even though there is a wide range of commerciallyavailable MAVs, for research purposes the system equip-ment usually has to be adapted carefully to fulfill theprojects’ needs. The respective concept should thereforeconsider the required modularity and flexibility. Designingsuch a platform, as illustrated in Fig. 1, is a time-consumingprocess with many pitfalls. We address this challenge inthe first part of this article. We give an overview of exist-ing systems and their algorithmic concepts in Section 2. Weintroduce our carefully designed MAV system concept un-der aspects of common hardware and software issues inSection 3. We describe the design of a modular perceptionand navigation device considering typical constraints andcharacteristics of MAVs, such as weight, power consump-tion, high system dynamics, vibrations, and others. This tu-torial can be helpful to everyone setting up and equippingan autonomous MAV.

Figure 1. Experimental quadrotor platform for autonomous, vision-based indoor/outdoor exploration.

Journal of Field Robotics DOI 10.1002/rob

Page 3: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 539

In the second part of the article, in Section 4,we introduce the algorithms running on our quadro-tor for autonomous flights in unknown, unstructured in-door/outdoor environments. We focus in particular onstereo-vision-based inertial navigation in resource-limitedsystems. Additionally, we introduce briefly the used algo-rithms for control, mapping, and dynamic path planning.The system is extensively tested in simulations and experi-ments presented in Section 5. The influence of frame rate andtime delays on the navigation solution is analyzed in MonteCarlo simulations. The robustness of the navigation sys-tem is demonstrated in real system experiments with forcedvision dropouts. We show the interaction of all componentsin an autonomous indoor/outdoor exploration flight andin a possible SAR scenario in a coal mine. We discuss theresults and highlight some lessons learned during develop-ment and experiments in Section 6. Finally, we conclude thearticle in Section 7.

2. RELATED WORK

There is an increasing body of literature regarding MAVswith different autonomous navigation solutions. In the fol-lowing, we focus on work that is highly similar to ours. Wewill also briefly discuss the related work of the topics tack-led in the following sections, namely sensor registration,depth image processing, visual odometry, and system stateestimation.

2.1. Autonomous MAVs

In recent years, autonomous MAV navigation in GPS-denied environments has shown great progress. Bachrachet al. presented onboard pose estimation of MAVs usinga laser scanner (Bachrach et al., 2010). The accuracy wasincreased offboard by a pose-graph optimization simulta-neous localization and mapping (SLAM) algorithm. Theydemonstrated autonomous exploration and a flight througha window, while planning was realized offboard. Shenet al. demonstrated laser-based multifloor navigation andmapping, while all algorithms including SLAM with loopclosure were running onboard the MAV (Shen, Michael,& Kumar, 2011). Based on this work, they presented anautonomous MAV operation in a large multifloor indoorenvironment with an additional RGB-D sensor for map-ping (Shen, Michael, & Kumar, 2012). However, vertical wallconstraints in the egomotion calculation were used, whichlimits the operational area.

Huang et al. used visual odometry estimates from anRGB-D sensor (i.e., Kinect) with IMU measurements for lo-cal onboard navigation (Huang et al., 2011). Global navi-gation and loop closure with SLAM was realized offboard.Due to the use of the Kinect sensor, the system cannot oper-ate in outdoor environments with direct sunlight.

Heng et al. realized onboard mapping and path plan-ning based on stereo cameras (Heng, Meier, Tanskanen,Fraundorfer, & Pollefeys, 2011). The autonomy of the systemis limited by the usage of artificial markers or an externalmotion-tracking system that is required for pose estimation.

Our system is most closely related to the work of Fraun-dorfer et al. (2012). A forward-looking stereo camera is usedas the main sensor for building a global three-dimensional(3D) occupancy map with 0.1 m resolution onboard theMAV. A 2D slice of the 3D octomap is evaluated for onboardplanning and obstacle avoidance as well as frontier-basedexploration and wall following. In contrast to our system,the MAV is limited to environments with flat ground. Thisis due to the optical flow sensor, which drives the systemstate estimation.

2.2. Sensor Registration

Typically, MAVs use several sensors, such as cameras andIMUs, that complement each other regarding error char-acteristics and dropouts. For getting robust and reliablemeasurements by fusion, the spatial alignment between thesensors has to be known. Microelectromechanical system(MEMS)-based IMUs provide bias-prone and noisy mea-surements of the rotational velocity and translational ac-celeration, which leads to rather poor motion estimationafter integration. To avoid integration, one requires com-plex, error-prone setups (Lobo & Dias, 2007) or needs toestimate the motion of the system in an optimization frame-work that combines both sensor measurements.

Typical optimization approaches use Kalman filterssuch as the extended Kalman filter (EKF) (Mirzaei & Roume-liotis, 2007a) or the unscented Kalman filter (UKF) (Kelly &Sukhatme, 2009). Another possibility is the use of nonlinearbatch optimization (Mair, Fleps, Ruepp, Suppa, & Burschka,2011a; Mirzaei & Roumeliotis, 2007b). The maximum like-lihood estimation (MLE) of Kalman filters is highly com-putationally efficient and often used for real-time pose esti-mation (Section 4.4). The drawback is that its mathematicalmodel requires a piecewise linear motion and white Gaus-sian noise. Nonlinear least-squares batch optimization hasin general significantly higher computational costs. How-ever, it only requires the noise to be zero mean but neitherindependent nor identically distributed (i.i.d.) in order tofind the best solution in terms of the lowest variance ofthe estimate. Such a framework also allows motion mod-els of arbitrary dynamics. Some methods couple both tech-niques by estimating the trajectory within a Kalman filterand computing the spatial alignment in an outer, nonlinearoptimization loop (Hol, Schon, & Gustafsson, 2010). Theseso called gray-box approaches try to combine the efficiencyof Kalman filters with the accuracy of nonlinear optimiza-tion. However, in that way the nonlinear optimization canonly use a computationally expensive finite-difference ap-proximation for gradient computation while maintaining

Journal of Field Robotics DOI 10.1002/rob

Page 4: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

540 • Journal of Field Robotics—2014

the constraints implied by the Kalman filter on the systems’motion and noise.

Our system consists of rigidly mounted sensors, whichdo not require an online estimation of the spatial alignment.In our opinion, sensor registration should be performed asan offline step where applicable. It increases the alignmentaccuracy and reduces the state space of the navigation filter,which leads to lower computational complexity and mod-eling errors. Hence, in Section 3.3 we describe an approachfor computing the spatial alignment by nonlinear batch op-timization of the measurements.

2.3. Depth Image Processing

Dense depth images are an important base for detectingobstacles, environment modeling, and egomotion estima-tion. We focus on passive stereo cameras since they arelightweight and work in indoor and outdoor environmentsunder a wide range of lighting conditions.

The drawback of stereo matching is the required pro-cessing power, which is limited on mobile and especially onflying systems. Correlation-based stereo methods have beenan obvious choice for flying systems (Goldberg & Matthies,2011; Honegger, Greisen, Meier, Tanskanen, & Pollefeys,2012) due to their simple design and rather low process-ing requirements. However, correlation methods are knownto blur object boundaries and are unable to detect small orthin structures (Hirschmuller, Innocent, & Garibaldi, 2002b).Global methods that optimize a cost function offer muchhigher spatial resolution since they can perform pixel-wisematching (Scharstein & Szeliski, 2002). Methods that arebased on this principle are known as dynamic program-ming, graph cuts, and belief propagation. However, thesemethods require much higher processing resources. Morerecently, local methods that are based on adaptive weight-ing (Yoon & Kweon, 2006) and slanted support windows(Bleyer, Rhemann, & Rother, 2011) proved to be competitivewith global methods, but again at an increased runtime.

Semiglobal matching (SGM) (Hirschmuller, 2008) isbased on the principle of pixel-wise matching, supportedby a global cost function. The quality is comparable toother global methods. However, the algorithm uses sim-ple operations and is regular and easy to parallelize. Thismakes real-time implementations on the graphics process-ing unit (GPU) (Banz, Blume, & Pirsch, 2011; Ernst &Hirschmuller, 2008) and the field-programmable-gate-array(FPGA) (Banz, Hesselbarth, Flatt, Blume, & Pirsch, 2010;Gehrig, Eberli, & Meyer, 2009) possible. FPGA implemen-tations are particularly suitable for mobile robotics due totheir low weight and energy consumption in comparison tocentral processing units (CPUs) and GPUs.

In the Middlebury benchmark,1 SGM shows an av-erage performance. However, the strength of SGM is its

1http://vision.middlebury.edu/stereo

robustness in practice without parameter tuning, which canbe achieved in real-time. Therefore, SGM and its adapta-tions belong to the top performing methods in the KITTIbenchmark.2 Further tests confirm the advantages of SGMin real-world applications (Steingrube, Gehrig, & Franke,2009). Daimler is using SGM as part of their 6D vision sys-tem3 for driver assistance, which has been commerciallyavailable in cars since the summer of 2013. In our system,we are using the same FPGA implementation of SGM.

2.4. Visual Odometry

For flying robots, it is very important to know their currentpose accurately and at any time. For our application, we can-not use any external tracking system or infrastructure. GPSis also not always available for combined indoor/outdoorflights. Therefore, the system has to determine its pose andmovements through its own sensors, which are the cameras.

There are many possibilities for computing the posethrough visual odometry. Mono camera-based approaches(Davison, 2003) can only determine the motion with fivedegrees of freedom (DOF) in an unknown environment.However, the missing scale information is very importantfor control of flying systems.

Modern approaches are often based on matching densedepth images (Newcombe, Lovegrove, and Davison, 2011b;Newcombe et al., 2011a; Sturm, Burgard, & Cremers, 2012),but they require high processing power and are often imple-mented on the GPU. Furthermore, these methods depend onsmall movements between successive frames, which makea high frame rate necessary for highly dynamic systems.Our approach is based on that of Hirschmuller, Innocent,& Garibaldi (2002a) and Hirschmuller (2003), which utilizesthe available depth images from stereo matching, but worksonly on features to save processing time and make largemovements or rather low frame rates possible. We explainthe algorithm in more detail in Section 4.3.

2.5. System State Estimation

Combining visual with inertial sensing has been extensivelydemonstrated for mobile robot navigation. Kalman filtersare often used for sensor data fusion. Considering the fastdynamics of MAVs, measurement time delays can becomeproblematic for control if an ill-posed estimation strategy ischosen.

Time delays are often ignored. The resulting negativeeffect is mitigated by separating attitude and position es-timation (Fraundorfer et al., 2012). Using stereo cameras,visual odometry is computed at 10 Hz. A reference frameis maintained as long as it is feasible to avoid local drift. Adownward-facing optical flow camera (in conjunction witha sonar sensor for altitude) provides velocity measurements,

2http://www.cvlibs.net/datasets/kitti3http://www.6d-vision.com

Journal of Field Robotics DOI 10.1002/rob

Page 5: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 541

which are used in a simple Kalman filter to estimate the par-tial state. The integrated velocities are then combined withthe output of the visual odometry via a low-pass filter forproviding a complete pose estimate. Nevertheless, in thisconfiguration, attitude estimation cannot profit from visualinformation. Furthermore, the bandwidth of the positioncontroller is limited by the time delay.

A common approach for delay compensation is mea-surement buffering (Huang et al., 2011; Weiss, Achtelik,Chli, & Siegwart, 2012). Measurements of a modified par-allel tracking and mapping (PTAM) monocular SLAM al-gorithm are used as absolute pose measurements. Duringprocessing time, the system state is propagated by the IMUwhile all measurements are buffered. At the arrival of thedelayed pose measurement, buffered data are reprocessed.This approach is optimal in the sense of filtering, but it canintroduce CPU load peaks that are dependent on the delayduration and, therefore, it can influence real-time behavior.Furthermore, processing of several time-delayed measure-ments is problematic.

Another method for delay compensation in Kalmanfilters extrapolates measurements using past and presentestimates of the Kalman filter (Larsen, Andersen, Ravn, &Poulsen, 1998). An optimal gain is calculated for processingthe delayed measurement. While the method is well suitedfor real-time processing, as it prevents CPU load peaks, onlyone delayed measurement source can be incorporated.

We use a third approach for delay compensation,namely delay compensation by state augmentation. A spe-cial form of state augmentation is stochastic cloning, whichwas used for fusing delta poses of a vision system withIMU measurements (Roumeliotis, Johnson, & Montgomery,2002). We use this technique combined with hardware trig-gers, starting filter augmentation to implicitly compen-sate for delays. Furthermore, we keep some augmentationswithin the filter to realize key frame-based odometry for lo-cal drift-free navigation. The algorithm is described in detailin Section 4.4.

3. NAVIGATION BOX SETUP

Commercially available MAV platforms are a good startingpoint for developing autonomous flying robots. Our systemis based on an Ascending Technologies4 Pelican quadrotor.The modified MAV is depicted in Figure 1. All the sensingand processing required for the fully autonomous operationof the robot is realized onboard, as illustrated in Figure 2.The necessary low-level adaptation of a commercial plat-form is a complex task with many pitfalls. Hardware andsoftware aspects as well as sensor calibration have to beconsidered in the system design.

4http://www.asctec.de/

3.1. Hardware

On the hardware side, we consider an electronic and me-chanical design which is primarily driven by algorithmicand corresponding sensor requirements. The hardware hasto be lightweight, robust, and of limited size. Furthermore,the energy consumption of electronics and rigid mountingof sensors have to be taken into account. The electronics in-clude sensors, embedded computers, and associated com-munication hardware. These are assembled into a stack thatenables simple mechanical integration with the commercialMAV. We call the resulting stack the navigation box.

3.1.1. Electronics

The most limiting factor for MAV electronics is weight andpower consumption, which has an important impact on thechoice of sensors and computational hardware.

We employ PointGrey Firefly FMVU-03MTM/C-CS cam-eras in a stereo configuration as the main exteroceptive sen-sor. The cameras are equipped with Computar HM0320KPlenses [f = 3 mm, horizontal field of view (FOV) = 80.5◦,manual aperture]. As an inertial measurement unit, wechose an Analog Devices ADIS16407 MEMS IMU, which hasfactory-calibrated and temperature-compensated sensors. Itmeasures all three spatial axes in the accelerometer, gyro-scope, and magnetometer plus atmospheric pressure.

The autonomy functionality on the MAV is split intolow-level, real-time (RT), and high-level computationallyintensive tasks without real-time constraints (Non-RT). Thesystem robustness can be increased by further separatingsafety-critical from -noncritical tasks by hardware.

The setup should be lightweight and flexible, at thesame time offering a wide range of hardware interfaces.The resulting demands can be summarized as follows:

� Separation between RT and Non-RT processing.� Availability of common embedded interfaces (I2C,

UARTs, SPI, GPIOs, and ADCs) as well as high-level in-terfaces such as Ethernet, Firewire, and USB.

� Flexibility of the hardware setup to allow fast modifica-tions and development cycles.

� High computing power at low weight and power con-sumption.

Building upon COTS (commercial off the shelf) hard-ware, which is inexpensive and readily available, allows forfast prototype development cycles. Today most industrialembedded computing manufacturers offer systems withseparate computer modules (CPU + RAM) and peripheralboards. By means of this scheme, it is possible to updatethe computer module while keeping the peripheral board.This allows us to keep up with the latest improvements incomputing power without time-consuming and error-pronecustomized hardware designs. In our opinion, flexibility is

Journal of Field Robotics DOI 10.1002/rob

Page 6: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

542 • Journal of Field Robotics—2014

Ground Station

Navigation Box

Core2Duo

GumstixFPGA

MAV

Cam 1 Cam 2 IMU

Image Rectification

Visual Odometry

Mapping

Planning

Mission Control

Stereo Matching Strapdown Algorithm

Data Fusion

Controller

Reference Generator

LL State Machine

Autopilot

WLAN

XBee

Operator Interface

trigger

sync

EthernetPCIe Ethernet

SPI

USB

USB

PCIe

UART

UART

Wireless

Power

Mechanical

Hardware

Software

Physical communication

Data

Command channel

Time trigger

Figure 2. Hardware, software, and data flow overview of the presented system. The system consists of a ground segment (GroundStation) and a flight segment (navigation box and MAV). The navigation box consists of three computation units (Core2Duo, Gumstix,and FPGA), as well as sensors and communication hardware. It is linked mechanically and electrically (in terms of power supply) tothe MAV. Furthermore, attitude commands are sent to the MAV’s Autopilot. The computers must be time-synchronized for sensordata fusion. We use hardware triggers for camera synchronization and exact image time-stamping. All autonomous functionality,including mapping and planning, runs onboard the flight segment.

more valuable for a research platform than higher efficiencythrough time-consuming individual designs.

Therefore, as the RT embedded processing component,we chose a combination of Gumstix OVERO WaterStormCOM and Gumstix Tobi peripheral boards due to their com-pact size and high computing power. The Non-RT systemcomponent builds upon the COM express industrial stan-dard and is comprised of an Intel 1.8 GHz Core2Duo basedCOM module and the Ascending Technologies Mastermindperipheral board. For real-time stereo vision processing, weuse a Xilinx Spartan 6 LX75T FPGA development board,which is connected to the Core2Duo board via PCIe. Table Ilists the features of the RT and Non-RT processing units.

The RT and Non-RT computers are connected via Eth-ernet. Our current system architecture allows for easy ex-tension of more computing hardware by simply addingmore RT and Non-RT computers together with a custom-designed small footprint Ethernet switch (Tomic et al., 2012).An overview of the hardware setup and interconnection is

given in Figure 3. The weight of all components is listed inTable II.

A crucial problem in mobile robotics is reliable wirelesscommunication. Besides the mandatory remote control (RC)link, which is operated by a safety pilot, it is useful to havea high-bandwidth wireless communication link for onlinesystem debugging. Applications further require a robustdata link, suitable for indoor and outdoor environments.

To satisfy these requirements, we employ two dif-ferent communication channels. A WLAN module (IEEE802.11bgn) provides high data rate communication, whilean XBee module (IEEE 802.15.4) provides a robust, low-bandwidth data link. Because the latter has a lower band-width, it allows for a significantly higher system gain5

and, therefore, facilitates long-distance communication.Our WLAN module supports a 5.8 GHz operation mode,

5System gain = transmitter power + antenna gain + receiversensitivity.

Journal of Field Robotics DOI 10.1002/rob

Page 7: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 543

Table I. Processing hardware specification.

Gumstix Specification Mastermind SpecificationForm factor/standard Proprietary COM Express CompactCOM Module/peripheral board OVERO WaterStorm/ Kontron COMe-cPC2/

OVERO Tobi AscTec MastermindProcessor(s) OMAP3730 SoC (1 GHz Cortex A8, Core2Duo L9400 @1.8 GHz

800 MHz DSP, SGX530 GPU)Memory 512 MB DDR 4GB DDR3Interfaces GPIOs, I 2C, SPI, UART, ADC/DAC, GPIOs, UART, USB,

USB, Fast Ethernet Gigabit Ethernet, FirewireMax power consumption 3 W 20 WWeight 37 g 345 g

FPGA(Spartan6)

Non-RTmodule(Core2Duo)

RTmodule(Gumstix)

CamLeft

CamRight IMU

Autopilot

sync

USB

USB

SPI

PCIe Ethernet UART

WLA

N

XB

ee

Figure 3. Hardware connection overview.

Table II. Component weights.

Component Weight (g)

Mastermind board + peripherals 345FPGA board 95Gumstix OVERO (COM + Tobi Board) 37IMU and baseboard 22Cams and Lenses 2 × 33Mounts and cables 171Total 739

which enables robust communication in indoor areas witha crowded 2.4 GHz frequency band. Nevertheless, for out-door usage we consider the 2.4 GHz band as more appro-priate because of lower signal attenuation (free-space pathloss). The features of both interfaces are summarized inTable III.

3.1.2. Mechanical integration

All the sensors, computers, and communication hardwareare integrated into the stand-alone navigation box as illus-trated in Figure 4. The only physical connections with theMAV are four screws, the power source, and the data linkwith the autopilot. In that way, the electronic and software

components can be easily integrated and tested indepen-dently of the MAV.

The design of the navigation box is driven by sensorrequirements. First, the egomotion sensors, i.e., IMU andstereo cameras, have to be rigidly mounted relative to eachother providing a high stiffness. The stereo-matching algo-rithm is particularly sensitive to rotational deflections ofthe cameras. We have mounted the cameras on the sameplate as the IMU, which guarantees a stiff connection be-tween them. Second, we want to reduce the influence ofthe propeller vibrations on the IMU. The propellers rotateat 3,000–6,000 rpm, introducing mechanical vibrations tothe frame that can lower the quality of motion estimation.Hence, the cameras must not be mechanically connected toa load-bearing structure, such as the MAV frame. Our solu-tion was to use the navigation box as the mass of a mechanicallow-pass filter between the MAV and the IMU by connectingit only at four damped connectors to the quadrotor frame.We can use a simplified representation of the navigation boxas a second-order mass-spring-damper system, with inputat the MAV frame mounting point. Its undamped naturalfrequency will then be ω0 = √

k/m, where k is the stiffnessof the mounting and m is the mass of the navigation box.Therefore, we must lower the natural frequency of the nav-igation box with respect to the frame in order to lower theamplitude response at higher frequencies. This is achievedby using the weight of the navigation box and using dampersto additionally lower the mounting stiffness.

The cameras are placed on top of the computing stackin order to have an image while the MAV is standing on theground for stabilizing the egomotion estimation. The Gum-stix computer, Core2Duo, and FPGA boards are verticallystacked to fit into the bottom of the Pelican MAV. We usemilled 1.1-mm-thick carbon fiber plates for the intermediateplates to minimize weight. A carbon fiber plate below theFPGA board allows mounting of additional sensors, suchas sonar or a downward-facing camera.

In our previous system, the computing and sen-sor equipment was tightly integrated with the MAV

Journal of Field Robotics DOI 10.1002/rob

Page 8: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

544 • Journal of Field Robotics—2014

Table III. Specification of onboard communication hardware.

WLAN Module XBee Module

Model Ubiquity SR71-E Digi XBee ProCommunication standard IEEE 802.11n (MCS15 &HT20) IEEE 802.15.4Used frequency band ISM 2.4 / 5.8 GHz ISM 2.4 GHzApprox. system gain 95 dBm 118 dBmAntenna printed, 2× diversity wire antennaMax. net data rate 4 MB/s 80 kB/sInterface PCI-Express UARTMax power consumption 4.3 W 1 W

Figure 4. Exploded and assembled view of the processing stack, showing (1) FPGA card, (2) Core2Duo board, (3) Gumstix, (4)plate that carries the stack, (5) IMU (not visible in assembled view), and (7) stereo camera assembly. The stack is mounted to thequadrotor via (6), which is rigidly fastened to the plate (4). Dampers are placed on (4), so that the whole stack is damped withrespect to the quadrotor frame. This allows fast (dis)assembly of the quadrotor without losing IMU to camera calibration.

frame (Tomic et al., 2012). We have found that separatingsensors from the MAV frame has greatly improved sensorsignal quality. For motion estimation, this is due to lower vi-brations and the stiff camera fixture. Mechanical robustnesshas also increased, i.e., our equipment is better protected if acrash occurs. Due to the enclosed setup, sensor recalibrationcan be done independently of the MAV.

3.2. Software

The low-level software system should facilitate develop-ment on distributed, embedded systems. Scheduling con-straints and transparent communication have to be takeninto account.

3.2.1. Operating Systems

The navigation box contains different CPU boards with RTand Non-RT constraints. Considering RT operating systems

(RTOSs), we have to ensure software interoperability to-ward the Non-RT systems. The requirements can be sum-marized as follows:

� Same communication infrastructure (middleware) on RTand Non-RT OS.

� Availability of open-source software for maximum flexi-bility and community support.

� Device driver availability for the used sensor/systemconfiguration (ARM, ×86).

� POSIX6 compatible API on both systems for consistencyand portability.

We found that these requirements can best be satis-fied by using a uniform OS on both systems. Our solu-tion is based on a standard off-the-shelf Linux distribution

6POSIX = Portable Operating System Interface.

Journal of Field Robotics DOI 10.1002/rob

Page 9: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 545

Figure 5. Measuring scheduler latency between triggered and actual wakeup time.

(Ubuntu 12.04 LTS) with a real-time capable kernel for theRT system. We considered several approaches to realize areal-time capable Linux system:PREEMPT_RT patches make sections of the Linux ker-

nel and its drivers preemptable that areordinarily blocking. That also includesIRQ routines, which become preempt-able kernel threads. Most spinlocks areconverted into mutexes, and priority-inheritance also works for kernel spinlocksand semaphores. While the kernel itself isfully preemptable (OSADL, 2012), the con-version of various device drivers is still awork in progress.

RTAI Linux makes use of a real-time cokernelthat puts all incoming interrupts in an in-terrupt pipeline and dispatches interruptsto real-time tasks (scheduled by the cok-ernel) first. After completion of all real-time tasks, interrupts get transferred to thenon-real-time tasks associated with the ac-tual Linux kernel. Although it is similarto Xenomai, it focuses more on the lowesttechnically feasible latencies while expos-ing only its native API to the programmer.

Xenomai is a spin-off of RTAI Linux that makesuse of a real-time cokernel as well. UnlikeRTAI, it aims for clean extensibility andportability by offering a variety of differentprogramming APIs (skins) such as POSIX,VxWorks, pSOS, or RTAI. Also, Xenomaiis embracing PREEMPT_RT as a means ofoffering a single-kernel approach with itsrich set of RTOS APIs.

For our setup, the PREEMPT_RT kernel approach pro-vides the most benefits. One advantage over RTAI is itsPOSIX API. Already by its native design of building uponthe vanilla kernel, support for different hardware architec-tures is superior to a dual kernel approach. A cokernel ap-proach introduces the need to maintain also the real-timekernel and to keep up with the latest architecture support

improvements that come with upstream kernel updates.Furthermore, RT and Non-RT systems only differ in theirkernel, which facilitates system maintenance.

Crucial aspects of real-time operating systems involvescheduler and interrupt latency.7 Our system should be ableto run threads at a maximum sample rate of 1 kHz on anARM Cortex architecture. To verify whether we can obtainlatencies well below the required sampling period for con-trol and sensor data fusion, we use the following measurefor benchmarking.

Periodic high-resolution timers are assigned to a set ofthreads with different scheduling priorities and a samplingperiod ts . The only duty of every task is to save the currenttime stamp t1, sleep until the assigned timer expires, saveanother time stamp t2, and calculate the deviation of thesampling period by �t = t2 − (t1 + ts), which is an estimateof the scheduling latency. Figure 5 shows how the two timestamps are defined.

For benchmarking, we use the tool cyclictest,8 whichresembles exactly the mentioned behavior. We demonstratethe timing differences between a vanilla Linux kernel andone with the PREEMPT_RT patch set applied. The programwas invoked9 with four concurrent threads. The priority ofthe highest prioritized thread was set to 75, 500 histogrambins for data output were used, and the wakeup interval waschosen to be 1 ms. Every run took about 1 h, which makesa total sample count of around 3.6 × 106. During bench-marking, a set of computing (FPU), file I/O, and peripheralI/O intensive tasks10 were used to maintain a constant loadaverage close to 1.0. The load scenario was carefully cho-sen to reproduce the actual system load. Figure 6 depictstwo histograms showing the results of two cyclictest runs

7Scheduler latency is also known as “dispatch latency,” the latencycaused by thread context switching.8“cyclictest” is part of the “rt-tests” benchmarking suite;see https://www.osadl.org/Realtime-Preempt-Kernel.kernel-rt.0.html#externaltestingtool9Program invocation: ./cyclictest -t 4 -p 75 -h 500 -i1000.10Generating load: Traversing the filesystem with find/,whetstone floating-point benchmark, polling IMU at 819 Hz.

Journal of Field Robotics DOI 10.1002/rob

Page 10: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

546 • Journal of Field Robotics—2014

0 100 200 300 400 500

100

101

102

103

104

latency [µs]

sam

ple

coun

t

Thread1

Thread2

Thread3

Thread4

(a)Non-RT histogram. Kernel version: 3.0.22,

Max./avg./min. latency: 26383 (out of range)/192/0 µs

0 100 200 300 400

100

101

102

103

104

105

latency [µs]

sam

ple

coun

t

Thread1

Thread2

Thread3

Thread4

(b)RT histogram. Kernel version: 3.0.22-rt37,

Max./avg./min. latency: 334/168/0 µs

Figure 6. Scheduler latency histograms of RT and Non-RT kernel based on ARM kernel version 3.0.22. The scheduler tick clockwas running at 1.3 MHz. System load average has been close to 1.0.

on the same ARM-based Ubuntu Linux 12.04 system, thefirst on a vanilla kernel (version 3.0.22) and the second ona patched kernel (version 3.0.22-rt37). While the unpatchedkernel shows comparable results in the mean section, thehighest latency exceeds 26 ms. The highest latency occur-ring on the RT kernel was 334 μs.

These tests prove the suitability of the PREEMPT_RTapproach: Scheduling constraints well below 1 ms are ful-filled with high sensor IO and CPU load. Convenient systemmaintenance and a clean software architecture that exhibitsthe same API as the Non-RT operating system is provided.

3.2.2. Communication

To exchange time-stamped sensor data in a distributed sys-tem, all system clocks have to be synchronized. We employan open-source implementation of PTPd2 (Precision TimeProtocol Daemon). Running the synchronization daemonwith a rate of 4 Hz is sufficient to maintain an average de-viation of system clocks well below 500 μs. The Non-RTcomputer serves as a master clock while the RT computer isconfigured as a slave. The PTPd daemon runs with a highreal-time priority to minimize the influence of operatingsystem scheduling latency.

For communication within a distributed system, a con-sistent and transparent interprocess communication infras-tructure is needed. We found that the robot operating sys-tem (ROS) is suitable for our purposes: it is versatile, runson both ×86 and ARM architectures, and has flexible anddynamic data messages. It is a widespread middleware inthe robotics community, which makes it very easy to ex-change code with other research facilities and groups. AsROS is not intended for real-time applications, we extended

interprocess communication by modified nodelets to fit ourreal-time communication requirements.

As shown in Figure 3, onboard communication be-tween the RT and Non-RT computing units is realized byan Ethernet connection while the downlink to the groundstation is established via WLAN. These two independentnetwork segments need to be unified to enable the middle-ware to transparently communicate between all computers.Therefore, we employ a software bridge on the Non-RTcomputer, which connects WLAN and Ethernet peers in thesame subnet. Since forwarding is done at the MAC layer, allIP-based protocols pass the bridge transparently.

3.3. Spatial Sensor Registration

Especially in the context of highly dynamic systems, accu-rate temporal and spatial alignments play a crucial role forreliable data fusion. Temporal synchronization is assured byactively triggering the cameras, while storing the respectivetime stamps. Hence, for the spatial alignment, which we dis-cuss in the following, we assume accurately synchronizedmeasurements. In our setup, we have to estimate the rigid-body transformation between two cameras and an IMU. Theextrinsic calibration between the stereo camera pair is esti-mated together with the intrinsic parameters in a nonlinearoptimization framework (Hirschmuller, 2003; Strobl, Sepp,Fuchs, Paredes, & Arbter, 2005).11 It remains to compute thespatial alignment between cameras and the IMU.

A straightforward solution is to extend the state spaceof the navigation filter by the 6D spatial transformation

11A publicly available tool, called Callab, for intrinsic and extrinsiccamera calibration can be found at www.robotic.dlr.de/callab

Journal of Field Robotics DOI 10.1002/rob

Page 11: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 547

parameters and hence, estimating the spatial alignment inreal-time, together with the system pose. No offline calibra-tion step is required. However, estimating the spatial reg-istration together with the actual pose may result in poorpose estimation and spatial registration in the case of badinitialization or ill-posed motion at the beginning. There-fore, an initialization phase is advisable, where the systemperforms a motion that allows for a well-posed state esti-mation. Furthermore, when processing the data in real-time,linearization errors and sequential data processing are in-troduced. These drawbacks can be overcome by nonlinearbatch optimization. It comes with higher processing costsand hence it is not real-time-capable anymore. As in Kalmanfilters, batch optimization techniques also require a repre-sentation of the trajectory to combine measurements withdifferent time stamps. A significant advantage of batch op-timization, beside the batch-based processing of the data, isits higher flexibility in choosing an appropriate representa-tion of the trajectory.

We will now focus on computing the spatial alignmentby batch optimization of the measurements (Mair et al.,2011a). For generalizing the framework for arbitrary ex-teroceptive sensors that provide relative motion measure-ments, we use the visual odometry results as input insteadof triangulated landmarks or feature rays. As a represen-tation of the trajectory, we use a cubic B-spline. This im-plies the assumption that the trajectory associated with thecalibration sequence can be modeled by means of twicedifferentiable smooth parametric curves p(t) and r(t) de-scribing position and orientation at time t , respectively.Quaternions will be used to represent orientations, thus r(t)is a four-dimensional curve, while p(t) is obviously three-dimensional. Using this curve model certainly constitutesa restriction, since it imposes a number of constraints onthe trajectory, e.g., piecewise linearity of the accelerationbetween knots. Note, however, that it is still more generalthan the commonly encountered assumption of piecewiselinearity of motion, which is usually made in filter-basedapproaches.

3.3.1. Objective Function

The relationships between the measurements of differentsensors can be established as (Craig, 2006; Wendel, 2007)

ωB = IωB − eBIωB = RB

C

(CωC − eCωC

)(1)

and

IaB − eIaB + RBE gE = RB

C

(CvC − eCvC

) − ωB

× (ωB × pB

BC

) − ˙ωB × pB

BC. (2)

In the above formulas, ωB denotes the real, error-free angu-lar velocity and ex represents the error of a specific measure-ment x. The relative pose measurements of the camera areinterpreted as velocity measurements CvC . For clarity, the

reader is referred to Appendix A for the notation. Unfor-tunately, a direct comparison as outlined above is usuallynot possible, because the sensor measurements occur at dif-ferent time instances. The measurement errors δx at timeinstance t can be formulated by the following equations:

δIω = Iω − bIω − ω(t), (3)

δIa = Ia − bIa + r(t) • (ITvq gE

) • r(t) − p (t), (4)

δCω = RBC

CωC − ω(t), (5)

δCv = RBC

CvC + ω(t) × tBBC − p (t), (6)

with

ω(t) = 2 Ivq (r(t) • r (t)) , (7)

Ivq =⎛⎝1 0 0 0

0 1 0 00 0 1 0

⎞⎠, (8)

r(t) = diag (−1, −1, −1, 1) r(t), (9)

where bIω and bIa denote the IMU biases. The errors areexpressed in the IMU frame (body frame) to make the ex-pressions simpler, whereas the frame index is neglected forthe sake of readability. The only measurement revealing in-formation about the real orientation of the IMU relative tothe world coordinate frame is the gravity vector. However,this becomes irrelevant if we also estimate the orientationof the gravity vector relative to the initial pose of the IMUgB0 . That way, the trajectory starts at the origin of the IMUcoordinate frame, aligned to its axes, and it can be estimatedindependently of the world coordinate frame.

The errors are weighted with the pseudo Huber costfunction h(δ) (Hartley & Zisserman, 2003), which is differ-entiable and robust against outliers. It is defined as

h(δx) = 2b2x

⎛⎝

√δT

x δx

b2x

+ 1 − 1

⎞⎠ . (10)

A common choice for bx is 3σx , where σx denotes the stan-dard deviation of the noise and x denotes the kind of mea-surements for Iω, Ia, Cω, or Cv, respectively.

Concatenating the matrices of all the error vectorsfor each measurement �x yields the total error matrix� = (�Iω �Ia �Cω �Cv). Thus, the objective function g(�)can be formulated as

g(�) =∑

x∈{Iω,Ia,Cω,Cv}

(1σ 2

x

∑δx∈�

h(δx)

). (11)

Journal of Field Robotics DOI 10.1002/rob

Page 12: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

548 • Journal of Field Robotics—2014

3.3.2. B-spline-based Trajectory Modeling

The model for our trajectory representation needs to betwice differentiable to provide the second derivative of theposition for acceleration estimation p(t) [see Eq. (4)]. There-fore, a B-spline curve of third-order is required, but higher-order curves could also be used as a trajectory model. We arenow going to introduce our notation used for such curves.More detailed information about B-splines can be found inthe literature (De Boor, 2001; Prautzsch, Boehm, & Paluszny,2002). We define a B-spline curve as

s(t) =M∑i=1

ci bki (t), (12)

where ci ∈ Rd are the d-dimensional control point values,

bki (t) denotes the ith basis function of order k (for a cubic

B-spline k=4), and M is the number of knots. With c, we de-note the vector

(cT

1 cT2 · · · cT

M

)T ∈ RMd of concatenated control

point values. Furthermore, we assume that an equidistantknot sequence is used.

It is well known that B-splines are linear in parameters,which means that the evaluation of the above equation atseveral parameter locations t = (t0, t1, . . . , tn) is equivalentto computing the matrix-vector product B(t)c for a suitablebasis matrix B(t). More formally, this is expressed as(

s(t1)T , s(t2)T , . . . , s(tn)T)T = B(t)c. (13)

For d-dimensional control point values, the basis matrix hasthe shape

B(t) =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎝

bk1(t1) bk

2(t1) · · · bkm(t1)

bk1(t2) bk

2(t2) · · · bkm(t2)

......

. . ....

bk1(tn) bk

2(tn) · · · bkm(tn)

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎠

⊗ Id , (14)

where Id is the d × d identity matrix. At this point we want toemphasize the importance of synchronized measurementsas discussed in Section 3.2.2. It is obvious that if the vectorof parameter locations t remains constant, so does the ma-trix B = B(t). It has a significant impact on the processingcomplexity of each optimization step if the matrix B has tobe computed only once. Furthermore, B-spline derivativesare again B-splines and as such are again linear in parame-ters. In our optimization process, we are going to evaluatethe spline and its derivatives at the respective measurementtime stamps. This means that spline derivatives can also becomputed by simply evaluating B′c for some appropriatematrix B′ representing the basis matrix of the derived spline.

In our implementation, we need a B-spline of dimen-sion d = 7 and thus ci ∈ R

7 to model the trajectory,

s(t) = (p(t)T r(t)T

)T. (15)

Note that the quaternions are constrained to be of unitlength, which yields the expected six degrees of freedomfor rigid body motion.

The control point vector of the B-spline is part of the pa-rameter vector θ , which is subject to optimization. Further-more, this vector contains two IMU bias terms, which areassumed to be constant for the short calibration sequence.The initial direction of the gravity vector, the quaternion,and the vector describing the transformation between theIMU and the camera coordinate system are also included.Hence, the parameter vector in our optimization is definedas follows:

θ = ((cT

1 cT2 · · · cT

M

)bT

IωbT

Ia qB,TC gB0,T tB,T

BC

)T. (16)

3.3.3. Constraints and Optimization Details

There are some constraints on a few parameters that have tobe satisfied for optimization. The unit property of the controlpoints of the B-spline and the quaternions has to be ensured.Furthermore, the gravity vector has to be of length 9.81 andthe first control point of the spline is constrained to representzero rotation and zero translation to avoid dependenciesin the optimization parameters. This is because both thedirection of the gravity vector and the pose of the IMU aregoing to be optimized—at the beginning of the trajectory,one of these parameters has to be fixed to prevent redundantdegrees of freedom.

We use sequential quadratic programming (SQP) as anoptimization algorithm because it is known to work wellfor nonlinear optimization problems and it allows imple-menting equality constraints (Nocedal & Wright, 2000). Foroptimization purposes, it is generally necessary to computethe gradient of the objective function as well as an appro-priate Hessian approximation. The gradient of the objec-tive function can be computed by applying the chain rule,and for approximating the Hessian of the system, the popu-lar Broyden-Fletcher-Goldfarb-Shanno (BFGS) method (No-cedal & Wright, 2000) is used.

Nonlinear optimization does not guarantee that theglobal optimum will be found. It depends on the convexityof the problem and the initial values of the parameters to op-timize whether a local or the global optimum is found. Wedo not want to relax the problem to become convex, whichcomes with relaxation errors, but we rather try to find goodstarting points for the parameters, which also speeds up theoptimization process.

3.3.4. Parameter Initialization

The gravity vector gB0 can be initialized as the mean over thefirst accelerometer measurements assuming that the trans-lational acceleration at the beginning is negligibly small, sothat the sensor measurements consist mainly of the gravityforce. The resulting vector has to be scaled to have a length

Journal of Field Robotics DOI 10.1002/rob

Page 13: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 549

of 9.81. The bias terms bIω and bIa can be initialized withzero vectors or the mean over the first measurements in thecase of a static period at the beginning of the calibrationsequence. For accelerometer measurements, the estimatedgravity vector has to be subtracted.

An initialization of the translation is rather complicatedto achieve. Either one can extract a first estimate for tB

BC froma CAD drawing, or a complex setup as described in Loboand Dias (2007) has to be used. However, experiments haveshown that once the angular alignment is properly initial-ized, the translation converges reliably and hence it doesnot have to be initialized accurately and can be set to zerotoo.

Actually, the angular alignment can be estimated with-out ever considering the translational alignment betweenthe sensors. This can be done by applying conventionalhand-eye calibration algorithms, such as the closed-formsolution introduced in Tsai and Lenz (1989), on the deltarotations measured by the cameras and integrated from thegyroscopes. In case the biases of the gyroscopes cannot beestimated in advance, e.g., as described earlier, they canbe computed together with the alignment in a nonlinearoptimization framework (Mair, Fleps, Suppa, & Burschka,2011b).

Finally, the number of B-spline knots and their locationon the time-line have to be chosen, and the B-spline con-trol points have to be initialized. A high number of controlpoints means less smoothing and higher flexibility, but italso increases the complexity of the computation and thepossibility of overfitting. In our setup, we use an empiri-cally chosen value: 0.6 times the number of camera mea-surements. A more general solution would be to evaluatethe measurements of the accelerometers and the gyroscopes

to adapt the knot vector and the number of control pointsto satisfy the requirements given by the motion dynamics.The B-spline can then be initialized by an approximationover the camera measurements (Mair, 2012).

3.3.5. Comparative Evaluation

For validating the assumption that a nonlinear batch op-timization for spatial calibration is more accurate thanKalman filter based approaches, we show a brief compar-ison of the results of our solution to those of a UKF and agray-box implementation. In the experiment, we processedseven different runs with an IMU camera setup. Figure 7illustrates the results for translation estimation as separatebox plots for each algorithm and axis. Due to the fact thatwe do not have any CAD drawings for ground truth, weassume that the consistency over the runs correlates withthe achieved accuracy.

The plots show that in particular the translation es-timates of our approach are significantly more consistentthan those of the UKF, and they also outperform the gray-box implementation. For the rotation estimation, we achievesimilar results to those of the gray-box approach, whereasthe Kalman-filter-based solutions are significantly worse.More comprehensive results can be found in the literature(Mair et al., 2011a).

4. HIGH-LEVEL SYSTEM DESIGN

The introduced navigation box is the basis for high-level soft-ware modules implementing perception, sensor data fusion,control, mapping, and path planning. Perception is split intothree threads, as shown in Figure 8.

Figure 7. Registration results of seven different runs acquired with the same setup. Plot (a) shows the translational alignment andplot (b) the Euler angles ψ of the rotational alignment. The box plots on the left (B) represent the results of the discussed nonlinearoptimization technique, the box plots in the middle (G) denote the estimates of a gray-box implementation, and the box plots onthe right (K) denote a UKF-based estimation.

Journal of Field Robotics DOI 10.1002/rob

Page 14: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

550 • Journal of Field Robotics—2014

acquisitionImageRight

cam.

Time−stamping

RectificationacquisitionImageLeft

cam.Visual Odometry

Rectification

FPGA

Semi−Global Matching

CPU thread 2 3 daerht UPC1 daerht UPC

Disparity image

Pose

Pose error

Figure 8. Overview of the stereo image processing pipeline.

4.1. Capturing Stereo Image Pairs

The first perception thread is responsible for capturing syn-chronized stereo images. Synchronization is especially im-portant for a highly dynamic system. It is guaranteed byusing the start of the integration time of the left camera asa hardware trigger for the right camera. Additionally, werequire a precise time stamp, which is difficult to get fromUSB 2–based cameras. Therefore, we send a software trig-ger to the left camera and store the system time as a timestamp for the next image pair. Explicitly triggering the leftcamera ensures that only those images are captured that canbe processed by stereo matching and visual odometry. Thisis important for reducing the processing workload of oursensor data fusion algorithm, which augments the currentsystem pose every time a new image is captured (see Section4.4).

Additionally, to capturing the images, the thread alsocontrols the exposure time and gain settings of both cam-eras, because the auto exposure does not work together withthe external trigger on our cameras. The images have a reso-lution of 752 × 480 pixels. While the thread requires almostno processing time, the latency for capturing and transmit-ting the images to the main memory is about 65 ms.

4.2. Stereo Processing

To simplify the processing of stereo images, the image pairis rectified according to the intrinsic and extrinsic cameraparameters that are determined in advance. Rectification isimplemented by a lookup table for reducing the processingtime to just a few milliseconds.

Dense stereo matching is implemented by SGM(Hirschmuller, 2008). The method performs pixelwisematching, which is optimized by pathwise optimizationof a global cost function. The cost function prefers piece-wise smooth surfaces by penalizing neighboring pixels withdifferent disparity values. Pathwise optimization is per-formed symmetrically from eight directions through theimage. A left-right consistency check is used for identifyingand invalidating occlusions and mismatches. A segmenta-tion filter on the disparity image is used for finding and

invalidating small disparity segments that are unconnectedto the surroundings. This cleans up any possible clutter re-maining in the disparity image.

A previous study (Hirschmuller & Scharstein, 2009)has identified Census (Zabih & Woodfill, 1994) as the bestand most robust matching cost against radiometric differ-ences. Census computes a bit vector that encodes the rela-tive order of intensities in the neighborhood of each pixel.We commonly work with a 9 × 7 pixel window. For eachpixel in the window that is brighter than the center pixel,a corresponding bit is set. The resulting 64-bit vector thencharacterizes the center pixel. Pixelwise matching for SGM isdone by computing the Hamming distance between the bitvectors of corresponding pixels.

In our experience, SGM behaves robustly and is notsusceptible to the parameter settings, in contrast to someother stereo matching methods. Due to its regular algorith-mic structure and the fact that it just adds and comparesinteger values in the inner loop, it can be easily parallelizedby multithreading as well as with vector commands, as withthe SSE intrinsics that are available on modern ×86 CPUs.Nevertheless, SGM requires a lot of processing resourcesand would not be suitable for real-time processing on aresource-limited system. Therefore, we use an FPGA imple-mentation of SGM that has been implemented by Super-computing Systems12 for our former cooperation partnerDaimler. An earlier prototype has been described by Gehriget al. (2009). The implementation runs on a Spartan 6 FPGAthat has a very low energy consumption. It expects imagesof 1024 × 508 pixels with 12-bit radiometric depth and usesa disparity range of 128 pixels. Since our cameras have alower resolution, the remaining space is simply filled byblack pixels. Currently we also use eight-bit images only.The runtime for SGM-based stereo matching is about 68 ms,which results in a maximum frame rate of 14.6 Hz.

The stereo processing thread requests new images fromthe capture thread, performs rectification, sends the imagesto the FPGA via PCI express, receives the disparity image,

12www.scs.ch

Journal of Field Robotics DOI 10.1002/rob

Page 15: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 551

and performs the segmentation filter for cleaning up images.Since the FPGA has implemented a double buffer, rectifica-tion and sending of images to the FPGA is done while theFPGA processes the previous image pair. Similarly, postfil-tering is performed while matching the next image pair onthe FPGA. Therefore, with a frame rate of 14.6 Hz, the stereoprocessing thread causes a CPU load of 46%.

4.3. Visual Odometry

Our visual odometry is based on feature matching in subse-quent left images. For feature detection we used the Harriscorner detector for a long time (Harris & Stephens, 1988).Recently, we switched to AGAST (Mair, Hager, Burschka,Suppa, & Hirzinger, 2010) to reduce the processing timefrom more than 20 ms to about 3 ms per image. The thresh-old of AGAST is continuously adapted such that the detec-tor delivers approximately 300 features per image. How-ever, only corners for which a disparity is available aremaintained so as to have the possibility to reconstruct allcorners in three dimensions.

For computing feature descriptors with radiometric ro-bustness, we apply a Rank filter with a window size of15 × 15 pixels to the image and then extract a region of19 × 19 pixels around each corner. The feature descriptorsare matched by the sum of absolute differences. Computingand matching feature descriptors is implemented using SSEintrinsics. The resulting descriptor is very robust against ra-diometric differences and tolerant to small rotations (e.g.,up to 15 degrees) around the optical axis of the camera(Hirschmuller, 2003). It is not invariant to larger rotationsor to scaling differences. However, even with a highly dy-namic system, there will not be large rotations or scaling dif-ferences with an anticipated frame rate of 10–15 Hz. There-fore, we do not need rotation and scaling-invariant featuredescriptors such as SIFT (Lowe, 2004), which are computa-tionally expensive and potentially less discriminative, dueto their invariance.

Since the rotation of the system around the vertical axiscan quickly cause large image shifts, we do not try to trackfeatures. Instead, an initial matching of all features of thecurrent image to all features of the previous image is per-formed to find the most likely correspondences. Matchingfeatures from the previous to the current image is used tosort out unstable matches. Outliers in the correspondencesare identified by building all possible pairs of corners andcomparing the distances of their 3D reconstructions in thecurrent view to the distances of the 3D reconstructions ofthe corresponding corners in the previous view. A fast al-gorithm finds a large cluster of corners where all relativedistances between their 3D reconstructions match the 3D re-constructions of their correspondences (Hirschmuller et al.,2002a; Hirschmuller, 2003). In comparison to RANSAC, thismethod does not use any randomization, but it can still dealwith much more than 50% of the outliers.

As a side effect of switching from Harris corners toAGAST, we noticed that after the outlier detection, morecorrespondences are available for computing the rigid mo-tion, although both corner detectors have been set to deliverthe same number of corners. We cannot yet explain this pos-itive effect.

The rigid motion is computed by singular value de-composition (Arun, Hunag, & Blostein, 1987; Haralicket al., 1989) followed by minimizing the ellipsoid error(Matthies & Shafer, 1987) using nonlinear optimization bythe Levenberg-Marquardt algorithm.

For estimating the error of the calculated rigid mo-tion, an image error of 0.5 pixels is assumed in all cornerpositions. This error is propagated into the six parame-ters of the rigid motion using the Jacobian matrix (Stelzer,Hirschmuller, & Gorner, 2012).

Incrementally adding the rigid motions of subsequentimages for calculating the visual odometry leads to an in-creasing error, even if the system is moving slowly or stand-ing still. To avoid this, we are using a history of a smallnumber (e.g., three to five) of key frames (Stelzer et al.,2012). The motion is not only calculated to the previous im-age, but also to all key frames. This leads to n estimates ofthe current pose, with n corresponding error estimates. Thepose with the lowest overall error (i.e., the sum of errors tothe key frame and from the key frame to the current image)is used as the final estimate. After this determination, thecurrent image replaces the one in the key frame history withthe largest overall error. In this way, the algorithm seeks tominimize the overall error. If the system is standing still, theerror would not increase at all, thus making visual odome-try drift-free.

The literature contains details of the original method(Hirschmuller et al., 2002a; Hirschmuller, 2003) and the ex-tensions for estimating the error and maintaining a keyframe list (Stelzer et al., 2012).

The calculation is implemented as an additional threadthat gets the left stereo images and the corresponding dis-parity images from the stereo processing thread. We haveoptimized visual odometry for speed since earlier publi-cations (Schmid & Hirschmuller, 2013), without changingthe algorithm. The CPU load of the visual odometry threadwith 300 features and three key frames is now about 36%at a frame rate of 14.6 Hz. This leads to a total CPU loadof 82% of one CPU core for the capture, stereo, and vi-sual odometry thread. The latency between capturing theimages and getting the final delta pose estimate is about220 ms.

4.4. Sensor Data Fusion

We fuse the time-delayed delta measurements from thevisual odometry system with IMU measurements. Theonboard calculated state estimate is used for control.Since MAVs are inherently unstable and have fast system

Journal of Field Robotics DOI 10.1002/rob

Page 16: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

552 • Journal of Field Robotics—2014

dynamics, they require a high controller bandwidth for sta-bilization. Therefore, the time delays introduced by the vi-sion pipeline have to be taken into account. Furthermore,the algorithm has to run in hard real-time, which requires aguaranteed maximum processing time. Therefore, the pro-cessor load should be balanced, independently of measure-ment time delays.

We implemented the sensor data fusion algorithm asan extended Kalman filter (EKF). It is practical to realize theestimator as an indirect filter in a feedback configuration.This means that it is not the system states themselves (di-rect states) but only the errors of the system states (indirectstates) that are estimated. The indirect states are used tocorrect the direct states immediately after calculation. Thisdecoupling has several advantages (Maybeck, 1979, Chap. 6,p. 296):

� Fast system dynamics are tracked within the direct sys-tem state, using a computational cheap algorithm. Theslower error dynamics can be accurately tracked by acomputationally more expensive filter running at a lowerfrequency.

� Time delays in additional sensor measurements haveonly to be considered within the filter.

� The direct system state can still be calculated even in thecase of a failure of the filter.

� As the system state is corrected after each filter update,we can assume small angles in the attitude error, whichcan be efficiently represented by an error angle vectorof size 3 (instead of 4 for a gimbal-lock-free quaternionrepresentation).

Our vision-aided INS-filter framework, first introducedin Schmid, Ruess, Suppa, and Burschka (2012), is depictedin Figure 9. Sensor data preprocessing can be easily real-ized on a system without real-time constraints, whereas theactual filter runs on a real-time system. The filter frame-work can process any combination of relative or absolutemeasurements with or without (varying) time delays. Formeasurements with considerable time delays as well as forrelative measurements, the exact moment of the measure-ment start is registered in the EKF by a hardware trigger

signal. The system state error calculated by the EKF (δ) isused for direct system state (x) correction. For clarity, thereader is referred to Appendix A for the notation. We omitthe indices for the sensors and the reference frames wherenegligible. We define the direct and indirect system states,respectively, as

x = (pE,T

EB vE,TEB qE,T

B bTa bT

ω

)T ∈ R16, (17)

δ = (δ pT δvT δσ T δbT

a δbTω

)T ∈ R15 (18)

including position, velocity, orientation, and IMU accelera-tion and gyroscope biases. As we assume small-angle errorsbetween filter updates, we can parametrize the orientationerror as an orientation vector. x is calculated by the StrapDown Algorithm (SDA) at a frequency up to the full IMUsampling rate.

4.4.1. Strap Down Algorithm (SDA)

The SDA is used to calculate the motion of the robot fromlinear acceleration and angular velocity measurements. Dueto limited accuracy of the used MEMS IMU and the as-sumption of an approximately flat earth in the limited areaof flight, earth rotation and system transport rate can beneglected. Furthermore, it can be assumed that the IMUmeasurements represent sufficiently well the specific forcef B

EB and angular rates ωBEB , respectively. Employing rigid

body kinematics and the Bortz’ orientation vector differen-tial equation, we find the following motion equations:

pEEB = vE

EB, (19)

aBEB = f B

EB + gB, (20)

vEEB = R(qE

B ) f BEB + gE, (21)

qEB = q(σ ) =

(cos(σ/2)

σσ

sin(σ/2)

), (22)

σ =ωBEB + 1

2σ × ω+ 1

σ 2

(1− σ sin(σ )

2[1 − cos(σ )]

)σ × (σ × ωb

nb).

(23)

Figure 9. State estimation system design. The direct system state (x) is calculated at a high rate by the strap down algorithm(SDA) using acceleration and gyroscope measurements (a, ω) of the IMU. An EKF uses (time-delayed, relative) measurements tocalculate at a lower rate the state errors (δ) that are immediately used for state correction.

Journal of Field Robotics DOI 10.1002/rob

Page 17: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 553

With the assumption of a nearly constant orientationvector σ between the discrete time steps Tk−1 and Tk , thedelta orientation vector can be approximated from Eq. (23)by

�σ ≈ ωBEB (Tk − Tk−1). (24)

Using Eq. (22), the orientation quaternion at time Tk is cal-culated from the orientation at Tk−1 by

qEBk

= qEBk−1

• q(�σ ). (25)

Finally, discretizing Eq. (21) under the assumption of con-stant acceleration between time steps Tk−1 and Tk , we get

vEEBk

≈ R(qEBk−1

)(

aBEB + 1

2σ × aB

EB

)(Tk − Tk−1). (26)

4.4.2. Extended Kalman Filter

The EKF is used to estimate the indirect system state byprocessing time-delayed relative (key frame) measurementscoming from the stereo visual odometry system.

For the inertial navigation system error propagation,we employ the following linearized, continuous-time er-ror transition model for system error propagation (Wendel,2007):

δ = F δ + G n

=

⎛⎜⎜⎜⎜⎝

O3x3 I3 O3x3 O3x3 O3x3

O3x3 O3x3 − ⌊aE

EB

⌋ −REB O3x3

O3x3 O3x3 O3x3 O3x3 −REB

O3x3 O3x3 O3x3 O3x3 O3x3

O3x3 O3x3 O3x3 O3x3 O3x3

⎞⎟⎟⎟⎟⎠ δ (27)

+

⎛⎜⎜⎜⎜⎝

O3x3 O3x3 O3x3 O3x3

REB O3x3 O3x3 O3x3

O3x3 REB O3x3 O3x3

O3x3 O3x3 I3 O3x3

O3x3 O3x3 O3x3 I3

⎞⎟⎟⎟⎟⎠ n.

The uncertainties in the error propagation for translationand rotation are modeled as additive zero-mean, whiteGaussian noise (AWGN). The accelerometer and gyroscopebiases are modeled as random-walk processes driven byAWGN. The noise vector ns has the spectral density Q,such that

Q = diag(Qa, Qω, Qba , Qbω ), (28)

Qs = E[nsnT

s

] | s ∈ {a, ω, ba, bω} . (29)

We use the following (approximated) discretizations of F,G, and Q for the time interval T at time step k, which arecalculated as

�k = e�T , (30)

Gk Qk GTk = 1

4(I + �k) G QGT (I + �k)T T . (31)

For processing relative (key frame) pose measurementsin an optimal way, we use state augmentation by stochasticcloning (Roumeliotis et al., 2002). In its original form, timedelays are not considered. The arrival of a relative measure-ment triggers the start of the next one. Hence, the end of thepreceding relative measurement coincides with the start ofthe new one. The state at the end of a relative measurementis augmented to the state vector so that it can be referencedby the succeeding measurement.

Nevertheless, considering systems with fast dynam-ics, the time span between the real end of a relative mea-surement and the arrival of the measurement data can beunacceptably high due to communication and processingtime delays. To compensate for these delays and to achievea nondelayed, optimal state estimate, we trigger the filterstate cloning by sensor hardware signals, indicating the ex-act time of a measurement. When the delayed relative mea-surement arrives, it can reference the exact start and endstates of the measurements and correct all states includ-ing the augmentations. The filter framework can directlyprocess all possible combinations of time-delayed, overlap-ping, and relative measurements at the time of arrival. Thisresults in a delay-free system state estimate available at anytime, which is crucial for stable system control.

Furthermore, it is possible, depending on the availableprocessor resources, to hold augmented states in the filterinstead of deleting them directly after the arrival of the mea-surement data. If a relative state sensor is able to recognize ameasurement start or end point that is augmented in the fil-ter, the proposed navigation filter is equivalent to an indirectEKF position SLAM but can compensate for measurementtime delays. We use this mechanism to hold a fixed numberof augmented poses corresponding to stereo odometry keyframes.

For a general representation of filter augmentation, weuse the following terms: main state for the estimated statesat the current time and augmented states for the rest of thestate vector. It is not necessary to augment the whole mainstate for the processing of relative measurements but onlythe parts to which a relative measurement refers. In general,state augmentation and removal for the direct and indirectstates at time k can be written as

xk = Sk xk, (32)

δk = Skδk, (33)

where Sk is a state selection matrix with dimension (n +ak + a) × (n + ak), n denotes the size of the main state vector,ak is the number of initially augmented states at time k, and a

is the number of states to augment or remove from the statevector.

Journal of Field Robotics DOI 10.1002/rob

Page 18: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

554 • Journal of Field Robotics—2014

The augmentation of a part of the main state to the statevector in between the main state and the already augmentedstates, Sk , can be written as

Sk =⎛⎝ In×n 0n×ak

Ia×n 0a×ak

0ak×n Iak×ak

⎞⎠, (34)

where I i×i is the i × i identity matrix, I i×j denotes an iden-tity matrix containing only the rows that correspond tostates that should be augmented, and 0ak×n is the ak × n

zero matrix.To remove an augmentation from the state vector, Sk

can be written as

Sk = I (n+ak+a)×(n+ak ), (35)

where a is negative and I is an identity matrix of size(n + ak) × (n + ak) with the rows removed that correspondto a state that should be removed.

With this notation, the augmented/deaugmented statecovariance matrix can be written as

P = E[δk δT

k ] = E[SkδkδTk ST

k ] = Sk E[δkδTk ]ST

k = Sk P STk . (36)

As we have a closed-loop error state space filter repre-sentation, only the filter covariance is involved in the predic-tion step. The augmented error propagation matrix (�aug,k)and the noise propagation matrix (Gaug,k) are definedas

�aug,k = diag(�k, Ia+k ×a+

k), (37)

Gaug,k =(

Gk

0a+k ×n

), (38)

where a+k is the number of augmented states at the time of

prediction k. The filter covariance prediction can be realizedby the standard Kalman filter prediction step:

P−k+1 = �aug,k P+

k �Taug,k + Gaug,k Qk GT

aug,k, (39)

where P−k+1 is the a priori error state covariance at time step

k + 1 and P+k is the a posteriori covariance matrix at time step

k.The special form of �aug,k and Gaug,k can be exploited

in the filter implementation to get a prediction step witha complexity rising only linearly with the number of aug-mented states.

The augmented filter update is realized as a standardEKF update:

K k = P−k HT

k (Hk P−k HT

k + Rk)−1, (40)

P+k = (I − K k Hk)P−

k , (41)

δ = K k yk, (42)

where K k is the Kalman filter gain, Hk is the measurementmatrix, Rk is the measurement noise covariance matrix, andyk is the measurement residual.

The measurement matrix Hk is of dimension m ×(n + a+

k ), where m is the number of sensor measurements.The first n columns of the Hk matrix correspond tothe main state. For time-delayed, relative measurements,the columns of Hk corresponding to the referenced aug-mented states are filled with the relative measurementmatrices.

We use a pseudomeasurement to exploit the gravityvector for roll and pitch stabilization (Wendel, 2007). Themeasurement equation is given by

aBEB + R(qB

E)gE = R(qEB )T

⎛⎝0 −g 0

g 0 00 0 0

⎞⎠δσ + na, (43)

where na denotes the AWGN of the measurement with vari-ance R = E[nanT

a ]. The pseudomeasurement is valid as longas the acceleration is dominated by gravity. This can be re-flected by a measurement variance depending on the cur-rently commanded flight dynamics.

A general odometry sensor provides noisy position andorientation changes of the sensor between two points in timeT1 and T2,

xT2T1

=(

pC1C1,C2

qC1C2

)+ nx , (44)

where nx is an AWGN random variable modeling the sensornoise (with rotational noise in quaternion representation).Position and orientation states at time Tx are augmented toprocess the measurement. The augmented substates can bewritten as

δaug,Tx = (δ pT

x δσ Tx

)T, (45)

xaug,Tx = (pE,T

E,BxqE,T

Bx

)T. (46)

Considering the known, static camera to IMU calibration(tB

B,C , RCB ), the estimated sensor delta pose between T1 and

T2 is

xT2T1

=(

RCB RB1

E [ pEE,B2

− pEE,B1

+ (REB2

− REB1

)tBB,C]

qCB • qB1

E • qEB2

• qBC

)(47)

=(

pC1C1,C2

qC1C2

).

The measurement equations can be derived as

xT2T1

xT2T1

= (H1 H2

)(δaug,T1

δaug,T2

)+ nx (48)

Journal of Field Robotics DOI 10.1002/rob

Page 19: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 555

with the measurement submatrices as

H1 = RCB RB1

E

(I3×3

⌊pE

E,B2− pE

E,B1+ RE

B2tBB,C

⌋03×3 I3×3

),

H2 = RCB RB1

E

(−I3×3 − ⌊RE

B2tBB,C

⌋03×3 −I3×3

), (49)

and the definition of as

xT2T1

xT2T1

=(

pC1C1,C2

− pC1C1,C2(

diag(2 2 2) 03×1)qC1

C2• qC2

C1

)T

,

which is a subtraction of the position part and the delta rota-tion for the attitude part, computed between the two deltaorientations and approximated by the error angle vectorrepresentation.

The measurement noise nx is again characterized by itscovariance matrix Rx = E[nx nT

x ] calculated by the odom-etry algorithm. Rotational noise is parametrized by Eulerangles corresponding to nx .

4.4.3. Hard Real-time Requirements

The state estimation framework is used for system controland therefore has to run in hard real-time. The maximumprocessing time has to be determined and limited to thesampling period interval. For the introduced estimator, themaximum execution time can be determined for a constantnumber of states measuring the execution time of one fil-ter augmentation, one filter prediction, and two updates ofconstant measurement vector size. Furthermore, we haveto add the execution time for one SDA calculation step, themaximum time the direct state vector might be locked bythe SDA thread. By limiting the maximum number of states,hard real-time execution can be guaranteed independentlyof varying measurement time delays.

The maximum number of filter states determines themaximum number of key frames that can be used by the vi-sual odometry system considering real-time requirements.The key frame buffer of the visual odometry system andthe corresponding augmented states within the EKF frame-work have to be synchronized to keep the number of state

augmentations at a minimum. For synchronization, the vi-sual odometry system sends a list of time stamps of thecurrently buffered key frames with every calculated deltapose measurement at the current time tn, starting at tp1 andending at tp2. The EKF framework uses this information toremove all state clones and buffered direct states with timestamp t < tp2 that are not in the key frame list. In this way,we limit the number of clones in the filter to a minimum,which is the number of used key frames plus the number ofimage triggers between tp2 and tn.

4.5. Position Tracking Control and ReferenceGeneration

The MAV’s position and yaw are controlled based on themotion estimate of the sensor data fusion. For this purpose, acascaded position tracking controller is implemented, alongwith a reference generator and a low-level state machine re-sponsible for keeping the system in a known state. The ref-erence generator is used to create smooth position, velocity,acceleration, and yaw commands for the controller to track,based on a list of waypoints. The flown path then consistsof straight-line segments between the waypoints.

Good position tracking performance is crucial whentransitioning between indoor and outdoor areas, and whenflying indoors in tight spaces. Flying in this multitude of en-vironments makes the robot susceptible to a range of time-varying external disturbances. When flying outdoors, espe-cially in the vicinity of buildings, sudden wind gusts canoccur. Furthermore, flying through openings and in hall-ways can introduce air vortices that can be detrimental tothe flight performance. To compensate for this, we employan acceleration-based disturbance observer in the positioncontrol loop. Lastly, we do not consider time delays in thecontroller, as these are handled by the fusion algorithm.

4.5.1. Position Tracking Control

We employ a classical cascaded control structure depicted inFigure 10 for position tracking of the quadrotor (Hoffmann,Wasl, & Tomlin, 2008). Attitude stabilization and control isachieved by the Asctec Autopilot board using a PD Euler

Figure 10. Block diagram of the position controller with disturbance observer.

Journal of Field Robotics DOI 10.1002/rob

Page 20: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

556 • Journal of Field Robotics—2014

Figure 11. Overview of reference trajectory generation and control structure.

angle controller, which uses its own IMU and attitude es-timate. We implemented a tracking position controller onthe real-time Gumstix computer, using the current positionand attitude estimate provided by the fusion. Internally,the position controller calculates a desired force f on thequadrotor in the fixed world frame. This must be convertedto a valid attitude and thrust input u = [φ, θ, r, T ]T for theautopilot, where φ is the desired roll angle, θ is the desiredpitch angle, r is the desired yaw rate, and T is the desiredthrust.

Using the autopilot attitude controller has several im-plications. First, the PD controller cannot achieve perfecttracking, and no feed-forward angular velocity can be sent.Second, the orientation estimate used for control will differfrom that of our data fusion. Therefore, we add an integra-tor for the roll and pitch inputs to compensate both effects.Lastly, the yaw control input is the desired rate, so yawcontrol is achieved by a proportional controller.

We want to track a reference position xd , velocity xd ,acceleration xd , and yaw angle ψd . This is achieved by aPD feedback controller with acceleration feedforward anda disturbance observer (DOB). The DOB structure origi-nates from Nakao, Ohnishi, & Miyachi (1987) and Youcef-Toumi and Ito (1988), and has more recently been applied toquadrotors in Kondak, Bernard, Meyer, and Hommel (2007)and Jeong and Jung (2012). Here we assume a simplifiedmodel of the quadrotor with mx = f + h, where m is thequadrotor mass, x is the acceleration, f is the control inputand h is the external disturbance and model error actingon the system. The expression h cannot be used directlydue to noise in the acceleration signal and possible high-frequency disturbance that cannot be compensated by thesystem. Therefore, we filter the pseudomeasurement witha first-order lowpass filter (Tds + 1)−1, where Td is the filtertime constant. Our controller then has the form

f 0 = m

(xd − KD

˙x − KP x)

− h,

˙h = 1Td

(mx − f ).(50)

With this approach, it is easy to tune the response of thedisturbance observer, as the time constant is its only param-eter. The control inputs must be saturated also to prevent

wind-up of the disturbance estimate. Using the observeralso removes the need for an integrator in the position con-troller, greatly simplifying the tuning.

4.5.2. Reference Generation

The input to the controller is a list of N waypointsp1, p2, . . . , pN . To prevent control jumps and ensure pre-dictable behavior, the list passes several steps to obtain thereference trajectory, as depicted in Figure 11. First, a way-point monitor determines whether the current waypoint pi

has been reached. We consider that a waypoint is reachedwhen the estimated position has been within a radius of tol-erance from the waypoint for a minimum amount of time.When the condition is fulfilled, we switch to the next way-point. Secondly, the path is a line between two waypoints.We interpolate the path with a varying velocity v. Lastly,the reference trajectory (position xd , velocity xd , and accel-eration xd ) is obtained by an asymptotic Mth-order lowpassfilter λM

(s+λ)M , where M > 3 to obtain acceleration. The maxi-mum velocity can be specified by vmax, and aggressivenessof the path can be set by the filter parameter λ. By flying towaypoints in this way, transitions between waypoints willbe smooth. Our approach is similar to that presented inAchtelik, Achtelik, Weiss, and Siegwart (2011).

Our path planner does not plan any trajectory velocity,but instead only sends a list of waypoints to be flown. There-fore, a method to compute the required velocity on the pathis required, for example to fly the path with a constant ve-locity. However, this might not always be possible. In somesituations, such as strong wind gusts that cannot be com-pensated, the quadrotor can lose track of the reference. Toensure that the reference position does not go too far awayfrom the real MAV position, we change the interpolation ve-locity according to the current position tracking error. Theinput pd to the lowpass filter will not change if the trackingerror is above a threshold R. Hence, we change the veloc-ity according to v(‖x‖) = sat([1 − (‖x‖/R)]2) · vmax . Thisquadratic function results in near maximum velocity closeto the trajectory and slows down only on large errors. Oncethe error reaches the threshold R and above, the referencestops.

Journal of Field Robotics DOI 10.1002/rob

Page 21: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 557

Figure 12. Data-processing pipeline for map generation.

4.5.3. Low-level State Machine

To ensure that the system is always in a known state, asimple state machine is implemented at the controller level.Most importantly, this is to prevent any integrators in thecontroller from updating when the system is not flying, i.e.,when the control loop is not closed. It also manages the stateof the reference generator, e.g., to reset the reference whenthe quadrotor has landed and was manually moved to an-other location during experiments. Additionally, it providessafety from wrong operator actions, e.g., that the systemmust first be flying before any paths can be sent. The statemachine is described in detail in our previous work (Tomicet al., 2012).

4.6. Mapping, Planning, and Mission Control

The basis for our mobile robot navigation in cluttered en-vironments is a metric representation of the environment.While for ground-based robots a 2.5D representation is usu-ally sufficient, on flying systems a full 3D representationis needed. Only with such a representation can planningthrough windows or under bridges be realized. Therefore,we use a probabilistic voxel map implemented as an oc-tree, which is provided by the Octomap library (Hornung,Wurm, Bennewitz, Stachniss, & Burgard, 2013). Figure 12depicts the processing pipeline for map building. Consid-ering a static environment, we decrease the processing timeby resampling the depth images at a lower frequency thanis available from the FPGA. The resulting depth images areconverted to a point cloud. Inserting high-resolution pointclouds into a probabilistic voxel map is computationally ex-pensive, as for each point a ray-cast through the map hasto be calculated. In the near range, a map voxel may berepresented by several points of the point cloud. Therefore,we use a voxel grid prefilter with the same resolution asthe voxel map. The resulting, filtered point cloud is insertedinto the map by ray-casts using the robot pose calculated bythe sensor data fusion.

The path-planning module uses the map to calculatean obstacle-free path from the current robot position to agiven waypoint. We cut a horizontal plane at the currentaltitude of the robot and its height as the thickness out of theoctomap. The voxels of this plane are projected down intoa 2D horizontal map. We employ the ROS Navfn packageimplementing Dijkstra’s algorithm for 2D path planning atthe current height of the robot. By running the planner atthe frequency of the mapping module, newly appearing

obstacles are avoided. The calculated path consists of a listof densely spaced waypoints, discretized in the map grid.

Before commanding the plan to the controller, we firstprune the plan to reduce the number of waypoints. Pruningis based on comparing the arc length of a path segmentto its straight-line approximation. We iterate through allplan waypoints and compare the arc lengths between thelast accepted waypoint pm and the currently consideredwaypoint pk to a straight line between the two. Hence, wecompare the values

Larc =k∑

i=m+1

‖ pi − pi−1‖, Lstraight = ‖ pk − pm‖

and accept the waypoint pk−1 (thereby setting m = k − 1)if the relative difference exceeds a threshold, i.e., if Larc >

(1 + ε)Lstraight for small ε. By filtering the path in thisway, straight paths are approximated by a small numberof waypoints, whereas path segments with high curvature(i.e., around obstacles) are represented by a higher num-ber of waypoints. Using the reference generation strategyintroduced in Section 4.5, the flight velocity is effectivelyincreased in obstacle-free areas, whereas it is decreased incluttered environments. Lastly, we calculate the requiredyaw orientation such that the MAV is always oriented to-ward the next waypoint in order to obtain a map in the flightdirection.

We realize mission control by employing the ROSSMACH library. It facilitates the implementation of com-plex robot behaviors using the concept of hierarchical statemachines.

5. EXPERIMENTS

We verify the introduced system concept in experimentsconsidering several aspects. First, we analyze the influenceof vision dropouts as well as measurement time delays andfrequency on the quality of the sensor data fusion. There-fore, we simulate a quadrotor and its sensors flying highlydynamic maneuvers. Next, the quality and robustness ofthe stereo odometry inertial navigation is evaluated on ahand-held device with the same hardware configuration asdescribed in Section 3.1.1 and on the navigation box mountedon the quadrotor. Finally, we demonstrate the combined op-eration of all modules, realizing autonomous functionalitiesneeded for MAV flight in cluttered indoor/outdoor SARand disaster management scenarios.

Journal of Field Robotics DOI 10.1002/rob

Page 22: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

558 • Journal of Field Robotics—2014

5.1. Influence of Odometry MeasurementFrequency and Delays

As presented previously (Schmid et al., 2012), pose mea-surement frequency and time delays influence the qualityof system state estimation, which is essential for accurateMAV control. To study this effect, we simulated a quadrotoron a predefined trajectory. A virtual IMU measures quadro-tor accelerations and angular rates on three axes. The corre-sponding measurement is modeled as the real measurementplus AWGN and a bias driven by a random-walk process.Additionally, we simulated the IMU barometric sensor asheight measurement plus AWGN. We used noise param-eters available from the datasheet of our real IMU. Stereoodometry is simulated as a virtual delta pose sensor. Thekey frame system was simulated to hold one key frame forone second. The delta pose is disturbed by AWGN with abase standard deviation of σpos = 0.01 m for position andσatt = 0.02 rad for attitude measurements. The standard de-viation is varied over time by multiplication with a factorbetween 1 and 100 to simulate bad lighting conditions ormissing environment features. The delta pose measurementis delayed by a variable time delay.

To cover different dynamic flight conditions, the tra-jectory consists of three parts. In the first part after takeoff(12 s), a short, highly dynamic maneuver is conducted bycommanding a flip. Accelerations of up to 3 g are measured.In the second part (16–150 s), a long and slow flight path ona square is commanded. Finally, in the third part (150–300 s),the randomly generated trajectory shows long-term fast dy-namics with up to 1 g accelerations on the horizontal planecorresponding to roll and pitch angles of up to 50◦. Veloci-ties go up to 4 m/s. The trajectory and its estimate are de-picted in Figure 13. For this example plot, the delta pose sen-sor runs at f =9.5 Hz with a measurement time delay d =100 ms. Figure 14 depicts the corresponding error plot. It canbe seen that in phases of uncertain delta pose measurements,the covariances show peaks as expected. The uncertaintiesfor the x and y axes rise slowly, whereas the correspond-ing velocity errors are bound to about 3 cm/s. The errorsfor the z axis are bound by the simulated barometric heightmeasurement.

We varied the frequency of the simulated key frameodometry sensor from f = 15 to 1 Hz as well as the delaysof the measurement arrival from d = 1

f− dt to d = dt with

a sampling time of dt = 5 ms. For each dataset, 20 MonteCarlo simulations were performed, resulting in a total of 800runs. For control applications, we are especially interestedin the velocity estimate in the body frame. In the absence ofan absolute measurement of the yaw angle, we transformedthe ground truth velocities from the simulated world frameto the estimated world frame. In this way, the velocities aredirectly comparable. Figure 15 depicts the dependency ofthe mean of the root-mean-square errors (RMSEs) over allruns for position and velocity.

The position RMSE is almost constant for varying mea-surement delays. The direct delta pose measurement hasa strong influence on the position compared to the doubleintegrated acceleration. The constant key frame hold timeof 1 s has the effect of an absolute position measurement forthe period of the key frame. Therefore, the effect of varyingmeasurement time delays with constant frequency on theposition estimate is small. From Figure 15(c) it can be seenthat the quality of the position RMSE increases exponen-tially with the frequency of the measurement. The shorterthe intervals during the measurement updates, the lowerthe influence of the double integrated accelerations.

In contrast to the position RMSE, the RMSE of the ve-locity estimate depends linearly on the delay of the mea-surement update. This can be explained as there are nodirect sensor measurements for velocity, but the estimateis calculated by integration of acceleration measurementsand indirect corrections via delta poses. The noise of the ac-celeration sensors has, therefore, a stronger influence. Thelonger the measurement time delay, the greater the influ-ence of integrated acceleration noise on the velocity. Asseen for the position, the accuracy of the velocity estimatealso rises exponentially with the measurement frequency[Figure 15(d)].

5.2. System Validation on a Hand-held Device

For the second experiment, we built the navigation box (seeSection 3.1) in the form of a hand-held device (see Figure 16)to validate the system concept on real hardware. By usinga hand-held device, we separate the estimation from thecontrol problem. In the following, we will present someexperimental results (Schmid & Hirschmuller, 2013).

We conducted several outdoor experiments using atachymeter (Leica TP1200) for ground truth validation. To-ward that end, we mounted a 360◦ prism on top of thedevice. The tachymeter can automatically track the positionof the prism with an accuracy of about ±5 mm. We carriedthe device on a triangular trajectory of about 70 m as illus-trated in Figure 17. Thereby, we considered three differentexperimental setups:

� Setup 1: To analyze the influence of visual odometrykey frames on the estimation accuracy, we disabled keyframes, continuously integrating the motion betweenconsecutive images.

� Setup 2: To validate the system configuration used onour MAV, we set the parameters for the visual odometrysystem to 300 features and 5 key frames.

� Setup 3: Robustness of the state estimation in the case ofvision dropouts is essential for control of MAVs. There-fore, we forced two vision dropouts during the run byholding a hand in front of the cameras. During the first,lasting about 5 s, we conducted mainly a translational

Journal of Field Robotics DOI 10.1002/rob

Page 23: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 559

10−210−1

100101

σ[m

]

00.20.40.6

px

[m]

0

2

−0.20

0.20.40.6

py

[m]

−2

0

2

−1

−0.5

0

pz

[m]

−1

−0.95

−0.9

−0.10

0.10.2

vx

[m/s]

−4−2

024

−1

0

1

vy

[m/s]

−4−2

024

0123

vz

[m/s]

−0.2

0

0.2

−100

0

100Flip

φ[◦

]

−50

0

50

−2

0

2

θ[◦

]

−50

0

50

0 50 100 150 200 250 300

−100

0

100

Time [s]

ψ[◦

]

Figure 13. Simulated flight with delta pose sensor running at f = 9.5 Hz with measurement time delay d = 100 ms and keyframe hold of 1 s. Plots from top to bottom: Standard deviation of delta pose measurement; Reference position (px,y,z) as dashed,estimated position as solid line; Reference velocity (vx,y,z) as dashed, estimated velocity as solid line; Reference angles (roll, pitch,yaw) as dashed, estimated angles as solid line. The trajectory includes a flip (t = 12 s), a slow passage (t = 12–150 s), and a highlydynamic passage (t = 150–300 s) with velocities of up to 4 m/s and accelerations of up to 1 g resulting from roll and pitch anglesof up to 50◦. The scale of the plots’ y axes changes at t = 150 s.

motion. During the second, we turned the device quicklyby 130◦ to guarantee losing the connection of all keyframes.

For each setup, we conducted four runs. Table IV sum-marizes the experimental data while Figure 18 depicts theerror plots for each setup. During run 2, the tachymetertracking was lost but could be recovered. During run 8, thetracking was completely lost so no ground truth is avail-able. Analyzing the influence of key frames (setup 1 versussetup 2), as expected, visual odometry becomes more accu-

rate. During the starting phase while the device is standingon the table, the effect of key frames can be clearly seen.While there is a position drift in setup 1, the drift is com-pletely compensated in setup 2. It has to be mentioned thatthe positive effect of key frames during the experimentalrun is limited as, due to the lack of nearby obstacles, wepointed the stereo cameras mainly to the ground. With acamera ground distance of about 80 cm, only a few keyframes can be re-referenced while walking. The ratio of keyframes and the total number of frames (KF/TF) is thereforerelatively high at a value of around 50%.

Journal of Field Robotics DOI 10.1002/rob

Page 24: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

560 • Journal of Field Robotics—2014

10−210−1

100101

σ[m

]

0 50 100 150 200 250 300

−0.5

0

0.5

ep

,x[m

]

0 50 100 150 200 250 300

−0.5

0

0.5

ep

,y[m

]

0 50 100 150 200 250 300

−0.02

0

0.02

ep

,z[m

]

0 50 100 150 200 250 300−0.2−0.1

00.10.2

ev

,x[m

/s]

0 50 100 150 200 250 300−0.2−0.1

00.10.2

ev

,y[m

/s]

0 50 100 150 200 250 300

−0.04−0.02

00.02

Time [s]

ev

,z[m

/s]

Figure 14. Simulated flight with delta pose sensor running at f = 9.5 Hz with a measurement time delay d = 100 ms and a keyframe hold of 1 s. Plots from top to bottom: Standard deviation of delta pose measurement; Errors for position (ep,x,y,z) and velocity(ev,x,y,z) and the corresponding estimated 3σ bounds. The estimate covariances rise during phases of bad odometry measurements.The x, y position error rises slowly, while the z position error is bound by an absolute barometric height measurement. The velocityerrors are bound by the position measurements.

In addition to accuracy, robustness of the state estimatein the case of vision dropouts is essential for control ofinherently unstable MAVs. Setup 3 demonstrates the effectof vision dropouts. During the first dropout of about 5 s withmainly translational motion, the position error of the visualodometry system rises drastically while the effect is stronglylimited for the sensor data fusion estimate. At the seconddropout with a fast 130◦ turn, orientation is completely lostby the visual odometry resulting in a fast rise of errors withthe distance moved. Orientation was accurately tracked bythe IMU during vision dropout, which results in almost noinfluence on the estimate.

The sensor data fusion errors show a similar behaviorfor all three setups. They are all smaller than 1.5 m aftera run of 70 m corresponding to the worst experimental er-rors of less than 2.2%. On MAVs, vision dropouts can easilyoccur during fast maneuvers, provoking motion blur. Theintroduced visual/inertial navigation system was shown todeliver a robust state estimate in such situations as well.A video of the experiments is available online (Schmid,2013).

5.3. System Validation on the Quadrotor Platform

We evaluated the quality of the estimation results of the nav-igation box and the effectiveness of its damped mounting inquadrotor flight experiments. With the fast movements ofthe MAV and the limited dynamics of the tachymeter track-ing, we could not provide a ground truth for the flown tra-jectory. Instead, we started and landed the quadrotor at thesame position and used the position difference as a qualitycriterion.

The manually flown trajectory (see Figure 19) startsoutside a building in the center of a gully (radius 0.38 m)surrounded by grass. The quadrotor is flown around thebuilding to the entrance door. After entering the building,the left corridor (corridor 1) is traversed to return to theentrance hall via a laboratory on the right side. From theentrance hall, the quadrotor is flown to the second floor viathe staircase. The corridor above corridor 1 is traversed toleave the building through a window close to the startingpoint. The quadrotor is landed on the center of the gully toclose the trajectory loop after a path length of 110 m. A video

Journal of Field Robotics DOI 10.1002/rob

Page 25: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 561

0 0.2 0.4 0.6 0.8 1

0.15

0.2

0.25

0.3

Measurement delays [s]

RM

Sof

posi

tion

erro

r[m

]f=1

f=2

f=3

f=4

f=7.5

f=9.5

f=12

f=15

(a) Position RMSE vs. measurement delay.

0 0.2 0.4 0.6 0.8 12

3

4

·10−2

Measurement delays [s]

RM

Sof

velo

city

erro

r[m

/s]

f=1

f=2

f=3

f=4

f=7.5

f=9.5

f=12

f=15

(b) Velocity RMSE vs. measurement delay.

00.5

1

051015

0.1

0.2

0.3

Measurement delays [s]

Cam frequency [Hz]

RM

Sof

posi

tion

erro

r[m

]

(c) Position RMSE for varying measurement delays and fre-quencies.

00.5

1

051015

2

3

4

·10−2

Measurement delays [s]

Cam frequency [Hz]

RM

Sof

velo

city

erro

r[m

/s]

(d) Velocity RMSE for varying measurement delays and fre-quencies.

Figure 15. Root-mean-square errors (RMSEs) for varying frequencies and measurement delays. For each frequency/delay com-bination, the mean of 20 runs is depicted.

of the visualized onboard data is available online (Schmid,2013).

The trajectory includes several challenges for the navi-gation solution: Outlier rejection for visual features has to berobust as leaves on the ground are stirred up by the quadro-tor downwash. The lighting conditions between indoor andoutdoor areas vary strongly. In poorly textured indoor areas(white walls in the corridors), visual odometry easily dropsout. The building is left through the window at a height ofabout 4 m above the grass with a self-similar texture.

The position estimation errors of ten experimental runsare summarized in Table V. All ten runs resulted in an esti-mated position loop closure error smaller than 2.20%. Purevisual odometry has a mean loop closure position error of

8.15% while the fusion with IMU results in a mean errorof 1.27%. The resulting onboard maps are precise enoughfor path planning and obstacle avoidance. The experimentsproved the robustness of the navigation solution in geomet-rically unconstrained scenarios (no flat ground or verticalwall assumptions).

5.4. Autonomous Indoor and Outdoor Flight

We verified autonomous flight behavior on an additionalmixed indoor and outdoor flight using our quadrotorplatform (Schmid, Tomic, Ruess, Hirschmuller, & Suppa,2013). Initially the quadrotor was placed in the 1.8-m-widecorridor 1 of the building described in Section 5.3. After the

Journal of Field Robotics DOI 10.1002/rob

Page 26: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

562 • Journal of Field Robotics—2014

Figure 16. Navigation box as a hand-held device.

Figure 17. Experimental outdoor trajectory measured bytachymeter (ground truth), visual odometry, and sensor datafusion.

commanded takeoff, the operator could select goals on theonboard 3D map transmitted to a ground station. The com-manded waypoints led the quadrotor from its starting pointthrough a window to the outside of the building. Figure 21illustrates the onboard world representation just beforecrossing the window. After circling the building, the MAVwas commanded to reenter via a door and to return throughthe corridor to its starting point. Appearing obstacles wereautonomously avoided due to continuous replanning. Thetransition from indoor to outdoor and vice versa is chal-lenging in several aspects: the lighting conditions changequickly and usually the visual odometry shows dropoutsfor several images until the camera shutter is readapted.In contrast to our odometry system, feature-based SLAMmethods can easily lose the correspondence connectionand need reinitialization. Wind conditions change suddenlybetween a narrow indoor corridor with self-induced turbu-lence and a wide outdoor free space with possible windgusts. In our case, we conducted the indoor/outdoor tran-sition through a 1.25-m-wide window, while the quadrotorhas a diameter of 0.77 m. Therefore, the obstacle map hasto be accurate to find a valid path through the inflated win-dow frame, and the controller has to follow the plannedpath precisely to prevent collisions.

The experiment was successfully repeated three times.Figure 20(a) shows a 3D reconstruction of the flown area,recomputed offline at a higher resolution of 2.5 cm usingonly onboard calculated depth images and fused poses.No offline optimization or loop closure was applied. InFigure 20(b), the environment representation is shown asit is calculated onboard and used for path planning andobstacle avoidance.

The experiment showed the feasibility of au-tonomous quadrotor navigation in unconstrained, mixed

Table IV. Outdoor experiments with ground truth. From left to right: Number of run, distance of run, number of key frames (KFs)divided by total number of images (TF), mean of visual odometry position error (Eest

abs), mean of estimation position error (Eestabs),

maximum of visual odometry error (Ecammax), maximum of estimation error (Eest

max), final visual odometry position error (Ecamend ), final

estimation position error (Eestend), relative final visual odometry error (Ecam

rel ), and relative final estimation error (Eestrel ).

Run Dist. (m) KF/TF Ecamabs (m) Eest

abs (m) Ecammax (m) Eest

max (m) Ecamend (m) Eest

end Ecamrel (%) Eest

rel (%)

1 68 1.00 0.64 0.32 1.69 0.89 1.69 0.88 2.46 1.292 63 1.00 0.92 0.57 1.66 1.13 1.66 1.10 2.64 1.753 66 1.00 0.61 0.39 1.44 0.90 1.32 0.86 2.00 1.304 68 1.00 0.71 0.51 1.78 1.49 1.70 1.45 2.51 2.135 68 0.42 0.32 0.36 0.81 0.70 0.59 0.63 0.88 0.936 69 0.53 0.64 0.58 1.32 1.18 1.27 0.88 1.84 1.277 67 0.51 0.48 0.40 1.01 0.95 0.90 0.77 1.35 1.168 – – – – – – – – – –9 70 0.71 8.52 0.57 26.47 1.14 26.47 0.84 38.06 1.2110 69 0.69 7.66 0.55 20.28 1.86 20.28 0.67 29.22 0.9711 70 0.71 7.65 0.63 23.25 1.75 23.25 0.97 33.26 1.3912 70 0.72 8.13 0.38 25.86 1.28 25.85 1.13 37.16 1.62

Journal of Field Robotics DOI 10.1002/rob

Page 27: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 563

Figure 18. Error of visual odometry and fusion estimate. Top: visual odometry without key frames and 300 features; middle:visual odometry with five key frames and 300 features; bottom: visual odometry with five key frames and 300 features, forcedinterruptions marked as solid areas.

indoor/outdoor environments. The navigation solution isaccurate enough for safe flight in cluttered environments.Computationally expensive SLAM algorithms can increaseglobal positioning accuracy, especially in the case of loopclosures. Nevertheless, they are not necessary for safenavigation. A video of the experiment is available on-line (Schmid, 2013).

5.5. Exploration in a Coal Mine

We tested the introduced MAV in a possible disastermanagement scenario, namely a coal mine in Reckling-hausen/Germany. We used the waypoint navigation systemintroduced in Section 5.4 to enable the operator to exploretwo corridors in the mine. At the end of the second corridor,the quadrotor was commanded back to its starting point.Figures 22(a)–22(c) show the experimental setup with thetwo corridors and depict the onboard map and the flightpath of about 100 m. A video of the experiments is availableonline (Schmid, 2013).

The environment is challenging in several aspects: First,the lighting conditions below the ground were poor. There-fore, we equipped the system with camera-synchronizedLED flash lights. Second, flying in narrow corridors causesair turbulence, which necessitates a fast control strategy.Third, the downwash of the MAV raises dust, which canbe problematic, especially for the vision system. Fourth,the navigation algorithm cannot rely on flat ground or ver-tical wall assumptions. Finally, heavy machines and gen-erators in the corridors produce a strong magnetic field,which is why magnetic compass measurements are notreliable.

The experiments showed that the introduced naviga-tion system is robust under rough environmental condi-tions, such as dust and poor lighting, which might oc-cur during search-and-rescue missions. Despite multipledropouts of the stereo depth images due to dust, the nav-igation solution could be used for accurate system controland onboard mapping.

Journal of Field Robotics DOI 10.1002/rob

Page 28: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

564 • Journal of Field Robotics—2014

(a) Top view

(b) Front view

(c) Perspective view

Figure 19. Path and onboard map of a manually flown indoor and outdoor trajectory of 110 m. Grid resolution is 1 m.

Table V. Manually flown indoor and outdoor trajectory of 110 m. From top to bottom: Final visual odometry position error (Ecamend ),

relative final visual odometry error (Ecamrel ), final estimation position error (Eest

end), and relative final estimation error (Eestrel ).

Run 1 2 3 4 5 6 7 8 9 10 Mean

Ecamend (m) 5.38 6.14 7.83 3.50 12.80 13.61 7.88 11.26 16.25 4.95 8.96

Ecamrel (%) 4.89 5.59 7.12 3.18 11.63 12.37 7.16 10.24 14.77 4.50 8.15

Eestend (m) 2.42 1.29 1.41 1.31 0.96 1.37 0.89 1.98 1.56 0.76 1.40

Eestrel (%) 2.20 1.17 1.28 1.19 0.87 1.25 0.81 1.80 1.42 0.69 1.27

6. DISCUSSION

In this section, we summarize key issues and reveal somelessons learned. Furthermore, we discuss the results of theconducted experiments and give an outlook on future work.

6.1. Low-level System Setup, Lessons Learned

As MAVs are inherently unstable, robustness of the com-puter that is running the controller is essential. Distributinghigh-level and low-level tasks on different computers im-proves the robustness of the system as they can influenceeach other only by predefined communication interfaces.Especially during controller development, redundancy us-ing two independent autopilots helps a safety pilot to main-tain control at all times.

While transferring algorithm tests from simulations toreal hardware, we realized that the bulk of the problems thatoccurred were caused by timing issues. Considering sensordata fusion, the processing of delta measurements makesthe algorithm strongly sensitive to small errors in timingof the task scheduler. The sensitivity of the relative timingerror of a system scheduler increases with the frequency of

the executed task. Considering the fast dynamics of a flyingsystem, the controller has to run at a high rate compared toground-based mobile robots. The use of a real-time OS canhelp to lower the influence of timing errors. Furthermore,using an operating system simplifies the development ofcomplex, multithreaded algorithms such as, for example,our sensor data fusion. Employing the same operating sys-tem for RT and Non-RT tasks simplifies system maintenanceand improves the interoperability of algorithms. As shownin Section 3.2.1, Linux with the PREEMPT_RT kernel patchis suitable as such an operating system also on embeddedARM-based computers.

A further pitfall using distributed systems is sensordata time-stamping. Employing hardware triggers for sen-sors guarantees an exact registration of measurement timeson all systems. Nevertheless, the data itself have to betime-stamped and registered with the corresponding trig-ger. Therefore, all communicating systems have to use acommon time base for accurate synchronization.

Hence, the timing behavior and flexibility of the com-munication channel are of great importance. Using standardcommunication interfaces such as Ethernet, even for the

Journal of Field Robotics DOI 10.1002/rob

Page 29: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 565

Start

Window

Door

Outdoor area

Indoor area

(a) Offline 3D reconstruction with 2.5 cm resolution.

Start

Window

Door

Outdoor area

Indoor area

(b) On-board computed octomap with 10 cm res-olution.

Figure 20. Offline 3D reconstruction with 2.5 cm resolution using onboard calculated egomotion estimates and depth data only.The indoor trajectory is marked in green, while the outdoor trajectory is marked in red.

Figure 21. Stereo depth image, onboard camera view, onboard computed octomap, and flown path before flying through thewindow.

(a) First corridor: Quadrotor flying overa hand car

(b) Second corridor: Flight in a dustysurrounding

(c) Visualization of on-board map(grid resolution 1 m)

Figure 22. Autonomous flight in a coal mine: The operator guides the quadrotor by setting waypoints in the onboard map. Basedon the map, an obstacle-free path is calculated onboard. Waypoints are approached autonomously.

Journal of Field Robotics DOI 10.1002/rob

Page 30: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

566 • Journal of Field Robotics—2014

system running the controller, gives great flexibility in thedevelopment phase. Communication latencies can be re-duced by applying quality of service (QOS). Standard soft-ware network bridges realize a transparent communicationand debugging of all systems. Changing communicationroutes from wired to wireless does not affect the develop-ment process.

An FPGA extension of computers can drastically un-burden CPU load if used for computationally intensive butparallelizable algorithms. Their power consumption is alsolower than that of CPUs or GPUs. Nevertheless, develop-ment of complex algorithms for FPGAs is a time-consumingtask compared to CPU programming. Hence, the algorithmsshould be well established and tested before implementingthem on an FPGA to limit expensive changes.

Combining all electronics and sensors in a modularunit, separate from the quadrotor frame, has several ad-vantages. The combined weight of all components can beused for vibration damping. In case of crashes, most of theenergy is absorbed by the quadrotor frame, while the elec-tronics are protected by the dampers. Furthermore, the boxcan be easily exchanged or used on different platforms. Me-chanical stiffness of the sensor mounting can be realizedand maintained more easily as there is no direct connectionto load-bearing structures. This is especially important forthe calibration of the extrinsic stereo camera configuration,but also for the camera/IMU registration.

The accurate estimation of the spatial alignment be-tween stereo cameras and IMU has been shown to be onlypartially crucial. While the rotational alignment influencesthe accuracy of the motion estimation significantly, an in-accurate translational alignment only has a small impacton it. The poor signal-to-noise ratio of the accelerometersresults in small weights for the acceleration measurementsin the update step of the navigation filter. Hence, an erro-neous translation calibration cannot much affect the motionestimation result. This fact is also the reason for the poorobservability of this parameter during registration. An im-proved signal-to-noise ratio, e.g., applying highly dynamicmotions during registration, is beneficial for the estimationresult. If a quick spatial alignment is required, the rotationcan be computed in closed form as explained in Mair et al.(2011b), while the translation can simply be measured witha ruler. In our experience, the resulting accuracy is sufficientfor exploration flights with slow dynamics.

6.2. Experiments

Simulation results of Section 5.1 show that the proposedvision inertial navigation framework delivers robust andconsistent system state estimates even during failures ofthe visual odometry system (VO). Using key frames, theposition estimate is locally drift-free. The simulated trajec-tory shows fast dynamics that are accurately tracked whilemeasurement time delays of the VO are compensated. Theintroduced sensor data fusion algorithm complies with the

requirements of robustness and accuracy for navigation ofhighly dynamic flying platforms.

Variation of odometry frequency and delays is shownto have an influence on the state estimation quality. Posi-tion accuracy is hardly influenced by delays, while velocityaccuracy decreases linearly with delays. On the other hand,position and velocity accuracy increases exponentially withfrequency. Considering these results, parallelization of thevision pipeline on an FPGA is a useful means for accuracyimprovement. Processing frequency is strongly acceleratedusing parallelization. Delays introduced by pipelining havea comparably small effect on the accuracy.

Our visual odometry/IMU fusion system is well suitedfor autonomous navigation tasks of small mobile robots.The system is demonstrated to be robust and accurate usingreal sensor data while all information is processed onboard.Even in the case of long vision dropouts, the conductedexperiments on a hand-held device show a stable and accu-rate estimation behavior. The position accuracy results arecomparable to runs without vision dropouts. The positionaccuracy using the navigation box as a hand-held device ormounted on the quadrotor is comparable due to the dampedmounting.

The introduced system realizes a robust, vision-basedMAV autonomy concept for SAR and disaster managementscenarios. Its proof is shown in the presented explorationmissions in an unconstrained indoor/outdoor environmentand in a coal mine. The active LED illumination enables thesystem to fly in badly illuminated scenarios such as the coalmine. Furthermore, it reduces the time of vision dropoutsduring indoor/outdoor transitions.

6.3. Future work

There are still many challenges that have to be solved beforean MAV can be used in real SAR scenarios. For our platform,one of the most limiting factors is the short flight time ofless than 10 min. We are working on its prolongation bylowering the weight of the payload and developing a moreefficient flying platform. Another limitation of our currentsystem is the limited field of view of the camera system.We are working on a new quadrotor design that allows themounting of cameras with a wider field of view.

Considering path planning, the current 2D plannerdoes not exploit the possibilities of the 3D onboard map.We are working on a full 3D trajectory planner. Further-more, dynamic obstacles should be considered by reactivecollision avoidance.

7. CONCLUSION

We presented an MAV for vision-based autonomous oper-ation in search-and-rescue and disaster management sce-narios. The presented quadrotor test platform can navigatein cluttered indoor and outdoor environments. No radiolink to a ground station is needed for safe navigation. All

Journal of Field Robotics DOI 10.1002/rob

Page 31: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 567

algorithms, including stereo image processing, visualodometry, fusion of odometry and inertial sensor data, con-trol, mapping, and path planning, are realized onboard.With regard to SAR scenarios, complex exploration mis-sions can be accomplished by an operator selecting goals ofinterest in the onboard map.

Designing autonomous MAVs on the basis of com-mercially available platforms is challenging and time-consuming. Aspects of computational resources, weight, aswell as sensor equipment and placement have to be con-sidered carefully. We delivered in-depth insight into the de-sign of our MAV platform. Our system includes a Core2Duoboard for non-real-time and an OMAP3730-based processorboard for real-time tasks. Depth image calculation of the on-board stereo camera rig is implemented on an FPGA. Usingkernel scheduler benchmarks we demonstrated that Linuxwith an applied PREEMPT_RT kernel patch is suitable asa real-time operating system for low-level tasks, and to-gether with an unpatched non-real-time Linux we achievea unified operating system interface on all used computerboards. Furthermore, we introduced our nonlinear batch-optimization-based algorithms for extrinsic camera to IMUcalibration, which outperform conventional Kalman-filter-based approaches.

Based on the presented low-level system, we intro-duced our high-level MAV design. Depth images with aresolution of 0.5 MPixel are calculated at a rate of 14.6 Hzusing an FPGA implementation of semiglobal matching(SGM). The results are used for stereo odometry calculationwith key frame support. Our sensor data fusion algorithmcombines inertial measurements with key frame odometry.Delays of about 220 ms introduced by the vision pipelineare compensated. The used key frame system allows a lo-cal drift-free navigation. The estimated system state is usedfor control and mapping. Therefore, depth images are com-bined in a probabilistic octree. The onboard map is the basisfor path planning and obstacle avoidance.

We demonstrated the robustness and accuracy of theodometry-based fusion algorithm in Monte Carlo simula-tions. A highly dynamic quadrotor trajectory including aflip is used to evaluate the influence of odometry dropoutsas well as the influence of measurement frequency and de-lay variations of 1–15 Hz and 0–1 s, respectively. The sim-ulation accuracy results demonstrate that the fusion algo-rithm fits well to the timing properties of the FPGA visionpipeline. In real system experiments, the odometry inertialfusion was tested extensively. We demonstrated accurate re-sults using a hand-held device including the same electronicand sensor hardware as employed on our MAV. Forced vi-sion dropouts in the experimental setup demonstrated therobustness of the system. In further experimental scenar-ios, we used our quadrotor platform to prove the intro-duced MAV autonomy concept in two exploration settings:a challenging mixed indoor/outdoor scenario and a coalmine.

The next important step is to improve the enduranceof the MAVs, e.g., by hardware optimization or cooperationwith ground-based robots. We are confident that this is thelast big step toward flying platforms becoming helpful andimportant tools for SAR missions and disaster managementscenarios.

ACKNOWLEDGMENTS

The authors thank Michael Suppa and Darius Burschka forthe fruitful discussions and their helpful comments on ourwork. Furthermore, we thank the whole XRotor team andthe people in our workshops for contributions to our plat-form not explicitly mentioned in this article.

Appendix A: MATHEMATICAL NOTATION

The notation and definitions described here will be used inthe following paragraphs. Vectors are written in boldface.The following coordinate frames will be used directly, as asucceeding sub- or superscript, but might also be neglectedin some equations for the sake of readability:

� E: Earth fixed frame with the z axis aligned with thegravity vector

� B: Body frame aligned with the IMU coordinate system� C: Frame of the left camera� Xk : Frame X at time step k

The following preceding superscripts are used to refer-ence the sensors by which the measurements are acquired:

� C: cameras� I: IMU

Matrices/Vectors:

� RYX : Rotation matrix from frame X to frame Y

� qYX : Rotation quaternion from frame X to frame Y

� σ YX : Rotation vector from frame X to frame Y in orienta-

tion vector representation� R(qY

X), R(σ YX): Rotation matrix of corresponding orienta-

tion quaternion or vector, respectively� pX

XY : Translation vector from X to Y expressed in X� vX

XY : Velocity of Y relative to X expressed in X� bx : IMU bias of sensor x� aX

XY : Acceleration of Y relative to X expressed in X� XωY : Rotational velocity measured by sensor X expressed

in frame Y� gE : Gravity vector

Operators:

� �a�: 3 × 3 skew matrix such that �a� b is the cross productof a and b

� diag(X1, X2, . . .): diagonal matrix with X1, X2, . . . asdiagonal elements

Journal of Field Robotics DOI 10.1002/rob

Page 32: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

568 • Journal of Field Robotics—2014

� E[n]: mean of the stochastic variable n� •: quaternion multiplication� ⊗: Kronecker matrix product

REFERENCES

Achtelik, M., Achtelik, M., Weiss, S., & Siegwart, R. (2011). On-board IMU and monocular vision based control for MAVSin unknown in- and outdoor environments. In Proceed-ings of the IEEE International Conference on Robotics andAutomation (pp. 3056–3063), Shanghai, China.

Arun, K. S., Hunag, T. S., & Blostein, S. D. (1987). Least-squares fitting of two 3-D point sets. IEEE Transac-tions on Pattern Analysis and Machine Intelligence, 9(5),698–700.

Bachrach, A., De Winter, A., He, R., Hemnann, G., Prentice, S., &Roy, N. (2010). RANGE robust autonomous navigation inGPS-denied environments. In Proceedings of the IEEE In-ternational Conference on Robotics and Automation (pp.1096–1097), Anchorage, AK.

Banz, C., Blume, H., & Pirsch, P. (2011). Real-time semi-globalmatching disparity estimation on the GPU. In Workshopon GPU in Computer Vision Applications at the Interna-tional Conference on Computer Vision (ICCV) (pp. 514–521).

Banz, C., Hesselbarth, S., Flatt, H., Blume, H., & Pirsch, P.(2010). Real-time stereo vision system using semi-globalmatching disparity estimation: Architecture and FPGA-implementation. In IEEE Conference on Embedded Com-puter Systems: Architectures, Modeling and Simulation.SAMOS X.

Bleyer, M., Rhemann, C., & Rother, C. (2011). PatchMatchstereo—stereo matching with slanted support windows.In British Machine Vision Conference.

Craig, J. (2006). Introduction to robotics: Mechanics and control,3rd ed. New York: Addison-Wesley.

Davison, A. J. (2003). Real-time simultaneous localisation andmapping with a single camera. In International Conferenceon Computer Vision.

De Boor, C. (2001). A practical guide to splines, Vol. 27. Berlin:Springer-Verlag.

Ernst, I., & Hirschmuller, H. (2008). Mutual information basedsemi-global stereo matching on the GPU. In InternationalSymposium on Visual Computing (ISVC08) (vol. LNCS5358, Part 1, pp. 228–239), Las Vegas, NV.

Fraundorfer, F., Heng, L., Honegger, D., Lee, G., Meier, L.,Tanskanen, P., & Pollefeys, M. (2012). Vision-based au-tonomous mapping and exploration using a quadrotorMAV. In Proceedings of the IEEE/RSJ International Con-ference on Intelligent Robots and Systems, Vilamoura, Al-garve, Portugal.

Gehrig, S., Eberli, F., & Meyer, T. (2009). A real-time low-powerstereo vision engine using semi-global matching. In Inter-national Conference on Computer Vision Systems (ICVS)(vol. LNCS 5815, pp. 134–143), Liege, Belgium.

Goldberg, S. B., & Matthies, L. (2011). Stereo and IMU

assisted visual odometry on an OMAP3530 for smallrobots. In IEEE Computer Vision and Pattern RecognitionWorkshops (pp. 169–176).

Haralick, R., Joo, H., Lee, C.-N., Zhuang, X., Vaidya, V. G., &Kim, M. B. (1989). Pose estimation from correspondingpoint data. IEEE Transactions on Systems, Man, and Cy-bernetics, 19(6):1426–1446.

Harris, C., & Stephens, M. (1988). A combined corner and edgedetector. In Proceedings of the 4th Alvey Vision Confer-ence (pp. 147–151).

Hartley, R., & Zisserman, A. (2003). Multiple view geometry incomputer vision. Cambridge University Press.

Heng, L., Meier, L., Tanskanen, P., Fraundorfer, F., & Pollefeys,M. (2011). Autonomous obstacle avoidance and maneu-vering on a vision-guided MAV using on-board process-ing. In Proceedings of the IEEE International Conferenceon Robotics and Automation (pp. 2472–2477), Shanghai,China. IEEE.

Hirschmuller, H. (2003). Stereo vision based mapping andimmediate virtual walkthroughs. Ph.D. thesis, School ofComputing, De Montfort University, Leicester, UK.

Hirschmuller, H. (2008). Stereo processing by semi-globalmatching and mutual information. IEEE Transactions onPattern Analysis and Machine Intelligence, 30(2), 328–341.

Hirschmuller, H., Innocent, P. R., & Garibaldi, J. M. (2002a).Fast, unconstrained camera motion estimation from stereowithout tracking and robust statistics. In Proceedings ofthe 7th International Conference on Control, Automation,Robotics and Vision (pp. 1099–1104), Singapore.

Hirschmuller, H., Innocent, P. R., & Garibaldi, J. M. (2002b).Real-time correlation-based stereo vision with reducedborder errors. International Journal of Computer Vision47(1/2/3), 229–246.

Hirschmuller, H., & Scharstein, D. (2009). Evaluation of stereomatching costs on images with radiometric differences.IEEE Transactions on Pattern Analysis and Machine Intel-ligence, 31(9), 1582–1599.

Hoffmann, G. M., Wasl, S. L., & Tomlin, C. J. (2008). Quadrotorhelicopter trajectory tracking control. In Proceedings of theAIAA Guidance, Navigation and Control Conference.

Hol, J., Schon, T., & Gustafsson, F. (2010). Modeling and cal-ibration of inertial and vision sensors. The InternationalJournal of Robotics Research, 29(2-3), 231.

Honegger, D., Greisen, P., Meier, L., Tanskanen, P., & Pollefeys,M. (2012). Real-time velocity estimation based on opticalflow and disparity matching. In IEEE International Con-ference on Intelligent Robots and Systems (IROS).

Hornung, A., Wurm, K. M., Bennewitz, M., Stachniss, C., &Burgard, W. (2013). Octomap: An efficient probabilistic3d mapping framework based on octrees. Autonomousrobots (pp. 1–18).

Huang, A. S., Bachrach, A., Henry, P., Krainin, M., Fox, D.,& Roy, N. (2011). Visual odometry and mapping forautonomous flight using an RGB-D camera. In Interna-tional Symposium on Robotics Research (ISRR) (pp. 1–16),Shanghai, China.

Journal of Field Robotics DOI 10.1002/rob

Page 33: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

Schmid et al.: Autonomous Vision Based MAV for Indoor and Outdoor Navigation • 569

Jeong, S.-H., & Jung, S. (2012). Experimental studies of a distur-bance observer for attitude control of a quad-rotor system.In ICCAS ’12 (pp. 579–583).

Kelly, J., & Sukhatme, G. (2009). Visual-inertial simultaneous lo-calization, mapping and sensor-to-sensor self-calibration.In Computational Intelligence in Robotics and Automation(CIRA), 2009 IEEE International Symposium (pp. 360–368).IEEE.

Kondak, K., Bernard, M., Meyer, N., & Hommel, G. (2007).Autonomously flying vtol-robots: Modeling and control.In ICRA (pp. 736–741).

Larsen, T. D., Andersen, N. A., Ravn, O., & Poulsen, N. K.(1998). Incorporation of time delayed measurements in adiscrete-time Kalman filter. In Decision and Control, 1998.Proceedings of the 37th IEEE Conference (Vol. 4, pp. 3972–3977). IEEE.

Lobo, J., & Dias, J. (2007). Relative pose calibration betweenvisual and inertial sensors. The International Journal ofRobotics Research, 26(6), 561.

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vi-sion, 60(2), 91–110.

Mair, E. (2012). Efficient and robust pose estimation based oninertial and visual sensing. Ph.D. thesis, Technische Uni-versitat Munchen.

Mair, E., Fleps, M., Ruepp, O., Suppa, M., & Burschka, D.(2011a). Optimization based IMU camera calibration. InProceedings of the IEEE/RSJ International Conference onIntelligent Robots and Systems, San Francisco. IEEE.

Mair, E., Fleps, M., Suppa, M., & Burschka, D. (2011b). Spatio-temporal initialization for IMU to camera registration. InRobotics and Biomimetics (ROBIO), 2011. IEEE Interna-tional Conference. IEEE.

Mair, E., Hager, G., Burschka, D., Suppa, M., & Hirzinger, G.(2010). Adaptive and generic corner detection based on theaccelerated segment test. Computer Vision (ECCV), 2010.European Conference (pp. 183–196).

Matthies, L., & Shafer, S. A. (1987). Error modeling in stereo nav-igation. IEEE Journal on Robotics and Automation, 3(3),239–248.

Maybeck, P. S. (1979). Stochastic models, estimation, and con-trol, volume 1 of Mathematics in Science and Engineering.Academic.

Mirzaei, F., & Roumeliotis, S. (2007a). A Kalman filter-based al-gorithm for IMU-camera calibration. In Intelligent Robotsand Systems, 2007. IROS 2007. IEEE/RSJ InternationalConference (pp. 2427–2434). IEEE.

Mirzaei, F., & Roumeliotis, S. (2007b). IMU-camera calibra-tion: Bundle adjustment implementation. Technical re-port, Department of Computer Science, University ofMinnesota.

Nakao, M., Ohnishi, K., & Miyachi, K. (1987). A robust de-centralized joint control based on interference estimation.In Proceedings of the IEEE International Conference onRobotics and Automation (vol. 4, pp. 326–331).

Newcombe, R. A., Izadi, S., Hilliges, O., Molyneaux, D., Kim, D.,Davison, A. J., Kohli, P., Shotton, J., Hodges, S., & Fitzgib-

bon, A. (2011a). KinectFusion: Real-Time Dense SurfaceMapping and Tracking. In ISMAR.

Newcombe, R. A., Lovegrove, S., & Davison, A. J. (2011b).DTAM: Dense tracking and mapping in real-time. In In-ternational Conference on Computer Vision.

Nocedal, J., & Wright, S. (2000). Numerical optimization.Springer.

OSADL (2012). Open source automation development labeg—realtime linux road map. Retrieved June 12, 2013,from https://www.osadl.org/Realtime-Linux.projects-realtime-linux.0.html.

Prautzsch, H., Boehm, W., & Paluszny, M. (2002). BEZier andB-Spline Techniques. Berlin: Springer-Verlag.

Roumeliotis, S. I., Johnson, A. E., & Montgomery, J. F. (2002).Augmenting inertial navigation with image-based mo-tion estimation. In Proceedings of the IEEE InternationalConference on Robotics and Automation (pp. 4326–4333),Washington, D.C.

Scharstein, D., & Szeliski, R. (2002). A taxonomy and evaluationof dense two-frame stereo correspondence algorithms. In-ternational Journal of Computer Vision, 47(1/2/3), 7–42.

Schmid, K., Tomic, T., Ruess, F., Hirschmuller, H., & Suppa,M. (2013). Stereo vision based indoor/outdoor naviga-tion for flying robots. In Proceedings of the IEEE Inter-national Conference on Intelligent Robots and Systems(IROS), Tokyo, Japan.

Schmid, K. (2013). DLR XRotor Webpage,http://mobilerobots.dlr.de/systems/multicopters.

Schmid, K., & Hirschmuller, H. (2013). Stereo vision and IMUbased real-time ego-motion and depth image computationon a handheld device. In Proceedings of the IEEE Interna-tional Conference on Robotics and Automation, Karlsruhe,Germany.

Schmid, K., Ruess, F., Suppa, M., & Burschka, D. (2012). Stateestimation for highly dynamic flying systems using keyframe odometry with varying time delays. In Proceedingsof the IEEE/RSJ International Conference on IntelligentRobots and Systems (pp. 2997–3004), Vilamoura, Algarve,Portugal.

Shen, S., Michael, N., & Kumar, V. (2011). Autonomous multi-floor indoor navigation with a computationally con-strained MAV. In Proceedings of the IEEE InternationalConference on Robotics and Automation, March (pp. 20–25), Shanghai, China. GRASP Laboratory, University ofPennsylvania, Philadelphia, IEEE.

Shen, S., Michael, N., & Kumar, V. (2012). Stochastic differentialequation-based exploration algorithm for autonomous in-door 3D exploration with a micro-aerial vehicle. The Inter-national Journal of Robotics Research, 31(12), 1431–1444.

Steingrube, P., Gehrig, S., & Franke, U. (2009). Performanceevaluation of stereo algorithms for automotive applica-tions. In International Conference on Computer VisionSystems (ICVS) (vol. LNCS 5815), Liege, Belgium.

Stelzer, A., Hirschmuller, H., & Gorner, M. (2012). Stereo-vision-based navigation of a six-legged walking robot inunknown rough terrain. International Journal of RoboticsResearch: Special Issue on Robot Vision, 31(4), 381–402.

Journal of Field Robotics DOI 10.1002/rob

Page 34: Autonomous Vision-based Micro Air Vehicle for Indoor and Outdoor Navigation

570 • Journal of Field Robotics—2014

Strobl, K., Sepp, W., Fuchs, S., Paredes, C., &Arbter, K. (2005). DLR CalLab and DLR CalDe,http://www.robotic.dlr.de/callab/.

Sturm, J., Burgard, W., & Cremers, D. (2012). Evaluatingegomotion and structure-from-motion approaches us-ing the TIM RGB-D benchmark. In Proceedings of theWorkshop on Color-Depth Camera Fusion in Roboticsat the International Conference on Robotic Systems(IROS).

Tomic, T., Schmid, K., Lutz, P., Domel, A., Kassecker, M., Mair,E., Grixa, I., Ruess, F., Suppa, M., & Burschka, D. (2012).Toward a fully autonomous UAV: Research platform for in-door and outdoor urban search and rescue. IEEE RoboticsAutomation Magazine, 19(3), 46–56.

Tsai, R., & Lenz, R. (1989). A new technique for fully au-tonomous and efficient 3D robotics hand/eye calibration.IEEE Transactions on Robotics and Automation, 5(3), 345–358.

Weiss, S., Achtelik, M., Chli, M., & Siegwart, R. (2012). Versatiledistributed pose estimation and sensor self-calibration foran autonomous mav. In Proceedings of the IEEE Interna-tional Conference on Robotics and Automation, St. Paul,MN.

Wendel, J. (2007). Integrierte navigationssysteme. Sensordaten-fusion, GPS und inertiale navigation. Oldenbourg.

Yoon, K.-J., & Kweon, I.-S. (2006). Adaptive support-weightapproach for correspondence search. IEEE Transactionson Pattern Matching and Machine Intelligence, 28(4), 650–656.

Youcef-Toumi, K., & Ito, O. (1988). A time delay controller forsystems with unknown dynamics. In Proceedings of theAmerican Control Conference (pp. 904–913), Atlanta.

Zabih, R., & Woodfill, J. (1994). Non-parametric local transformsfor computing visual correspondence. In Proceedings ofthe European Conference of Computer Vision (pp. 151–158), Stockholm, Sweden.

Journal of Field Robotics DOI 10.1002/rob