PVS: A System For Large Scale Outdoor Perception ..../cvalles/papers/ICRA_PVS.pdfPVS: A System For Large Scale Outdoor Perception Performance Evaluation Cristian Dima 1, Levi Lister

PVS: A System For Large Scale Outdoor Perception PerformanceEvaluation

Cristian Dima1, Levi Lister1, Joan Campoy1, Carl Wellington1, Carlos Vallespi1,Boyoon Jung2, Michio Kise2, Zachary Bonefas2 and Stewart Moorehead2

Abstract— This paper describes the motivation, design andimplementation of the Perception Validation System (PVS), asystem for measuring the performance of outdoor perceptionsystems for autonomous vehicles. The system relies on usinglarge amounts of real world data and ground truth informationalong with customizable metrics in order to quantify perfor-mance aspects such as the rate of false positive or negativedetections of an obstacle detection system. Our system relies ona relational database infrastructure to achieve a high degree offlexibility in the type of analyses it can support.

We discuss the main steps required for going from a rawdata log of the data produced by the sensors mounted on avehicle to numerical estimates describing the performance ofthe perception system, including the generation of ground truthinformation and one performance metric which we found to bevery useful for comparing the perception system’s outputs tothe ground truth.

Several examples of the type of information that can beobtained using the Perception Validation System are presented.

I. INTRODUCTION

A. Motivation

Significant progress in the area of outdoor mobile roboticshas brought real world applications of autonomous navi-gation in military, agricultural, mining and transportationdomains closer to reality than ever before. In this context, theability of an autonomous vehicle to perceive and understandthe environment in which it needs to operate remains oneof the main obstacles on the path to commercialization:without a capable and reliable perception system the abilityof a mobile robot to safely and efficiently execute tasks isseverely limited.

The robotics community has attacked this obstacle vig-orously, deploying an impressive array of techniques andtechnologies. It is now almost common knowledge that ob-taining good perception performance in unstructured outdoorenvironments requires at least some reliance on machinelearning based approaches. We also know that the needto build systems that operate robustly despite significantchanges in their environment often leads to perception sys-tems relying on several different sensing modalities anddata processing algorithms. While these types of approachesled to significant capability improvements in this area, thecomplexity of the perception systems being developed hasalso increased significantly.

1: National Robotics Engineering Center, The Robotics Institute, CarnegieMellon University, Pittsburgh, PA 15201, USA. Corresponding author:[email protected]

2: John Deere, Moline, Illinois, USA

Fig. 1. A high level view of the PVS system. The system relies onprocessing a large amount of pre-recorded raw sensor data with a perceptionsystem, and comparing its outputs with a corresponding set of ground truthdata using configurable performance metrics. A database system managesthe entire process, and supports the necessary tools for performing detailedperformance analyses.

The importance of the perception performance levelscoupled with the seemingly inescapable complexity of thesolutions developed in order to maximize them lead to afundamental question that needs to be answered before anylarge scale commercial applications can be considered: howcan one understand, describe and make justified guarantees–even in a probabilistic sense– about the performance ofa complex perception system that is critical to the safeoperation of a robot?

Our belief is that any satisfactory solution to this fun-damental question must rely at its core on the ability tomeasure performance in a relevant and trustworthy manner.We contend that in our context, a performance measurementis rendered relevant and trustworthy by the following char-acteristics:

• The performance metric being employed must be linkedas directly as possible to the actual purpose of theperception system. As an example, a perception systemdesigned for avoiding collisions with trees should bemeasured primarily by its ability to prevent such col-lisions while also measuring its impact on the overallproductivity of the larger robotic system.

• The raw data on which the measurements are basedmust be collected in environments and conditions thatare directly relevant to the application of interest, andthe amount of data used must be sufficient for thepurpose of the analysis: it needs to encompass all themajor environmental variations that can be expectedover the lifetime of the system being developed. As arule of thumb, we consider a data set to be sufficientlylarge and representative if it contains several examplesof the situations the perception system is expected tofunction correctly in, under all major environmentalvariations. Unless this data is included in the tests,

one is likely to make wrong assumptions about thegeneralization properties of the perception system.

This paper describes a system for quantifying the per-formance level of the perception subsystem of autonomousmobile robots. In addition to facilitating the understanding ofthe capabilities of the perception system, we have designedthe system so that it is also a useful tool for acceleratingperception research and development. We will introduce thethree main usage models for the Perception Validation System(PVS) by describing the problems it is designed to solve andthe type of questions that it can help answer:

1) Determine perception system capabilities. This abilityis important for understanding the mapping betweencapabilities and potential applications. Potential usersmight be test engineers or marketing experts, whomight be interested to determine if a given systemis capable enough to support applications that requiredriving through a farm, an orchard or corn field, a road,surface mines or other environments of interests. Testengineers could use the system to efficiently verify thata given perception system meets the system require-ments. For example, they can verify performance inchallenging lighting conditions (dawn, dusk or nighttime) or in the presence of air obscurants (dust, rain,smoke or snow) or compressible ground vegetationsuch as weeds.

2) Compare multiple perception systems. Measuring theperformance values obtained when using different sen-sor combinations, amounts of computing resources,algorithms or parameters for those algorithms withrelatively low effort is a tremendously useful capabilityfor the design and development phases of buildinga perception system. A developer can make designdecisions based on hard data regarding their impacton the performance of the overall system, instead ofrelying on intuition. For example, the PVS systemenables one to measure if for a particular applicationusing a laser range finder is more effective than usingstereo vision, if using a higher resolution camera isjustified by performance improvements or if usinglower powered but more rugged computing hardwarehas any impact on perception performance. Similarly,one can measure the impact of using a new algorithm,or changing a particular parameter value from A to B.

3) Since the PVS measures perception performance atthe highest level, it can also be used to measurethe capabilities of a entire perception system at agiven moment in time. This enables it to be used asa system-level regression tool that can capture theeffects of the often non-trivial interactions that arisein complex perception systems designed for real worldenvironments. A manager can use such regression toolsfor tracking development progress over the course ofa project. A development team might track the exactsame values in order to detect software changes thatactually result in a degradation of the overall system

performance.

B. Related Work

The idea of measuring perception performance has beenaround for many years, even in the relatively narrow domainof outdoor autonomous navigation. We can conceptuallygroup the previous efforts in the area of perception per-formance evaluation into system-level and low-level ap-proaches, based on the level at which the measurements aredone.

The system level approaches depend on methods formeasuring performance at the system level, and have beenwidely promoted by DARPA1. Between 2000 and 2003, theDARPA sponsored PerceptOR program ([1]) used systemlevel Evaluation Experiments (EEs) in challenging real worldenvironments to differentiate several teams participating inthe program. The quantitative metrics employed were basedon testing in unrehearsed environments. A government test-ing team defined obstacle courses by selecting a seriesof waypoints provided to the participating teams by theirGPS coordinates. A performance baseline was establishedby having an expert human driver drive the courses on anATV platform similar to those on which the autonomousvehicles were based. A second baseline was established byhaving a remote operator teleoperate the same ATV platformbased on video streaming at 2.5Mb/s from a vehicle-mountedcamera. During robot testing, the testing team monitoredvariables such as the distance traveled by the robot, theduration of the traverse, the median speed, the number ofremote operators supporting the traverse, the uplink anddownlink data volumes (as an estimate of the amount ofautonomy vs. teleoperation), the uplink and download datavolumes per meter traveled, the number of emergency stopsand failed traverses. Over the four years of the program, theautonomous vehicles performed 296 runs, totaling a traveleddistance of 130km and a navigation time of 110hr. Thetesting team was able to aggregate numerous metrics such asthe number of emergency stops per km (reduced by a factorof 22 during the course of the program) and the uplink datavolume per unit distance (reduced by a factor of 46 over thesame period of time), demonstrating the value the programfor accelerating technology development.

A somewhat similar approach was adopted for DARPA’sLearning Applied to Ground Robots (LAGR) program whichoccurred between 2005 and 2008 ([2], [3]), as a follow-upon PerceptOR geared specifically towards machine learningapproaches applied to autonomous navigation. For this pro-gram (also structured as a competition), all the participatingteams had to use identical vehicles, and all the testing wasperformed independently by a government team without anymembers of the competing teams being involved in the tests.A great overview of the LAGR program is provided in [2].

DARPA took the multi-team competition to an extremeduring its two off-road “Grange Challenge” events in March2004 and October 2005 and the “Urban Grand Challenge”

1The United States Defense Advanced Research Projects Agency)

event in November 2007 (see [4]). These events were allstructured as races, in which all teams had to compete overthe same course with the only performance metric being thetime required to complete the course while respecting a setof rules primarily designed for the safety of the race. Thetrade-offs related to choosing particular algorithms, sensingmodalities or race strategies were primarily presented inpost-race publications by the participating teams based onindependent testing or on processing data logs recordedduring the race (see [5], [6], [7], [8], [9]).

A final example of robotics autonomy programs whichmeasured performance at the system level comes fromDARPA’s UGCV-PerceptOR Integration (UPI) program, de-signed to achieve a leap in autonomous navigation perfor-mance by combining an advanced off-road vehicle platformwith novel approaches to the design of perception and vehiclecontrol software. Similarly to PerceptOR, the field tests con-sisted in performing autonomous traverses following series ofwaypoints spaced between 0.2-2km in terrains varying fromtemperate forests to high deserts and presenting obstaclessuch as high slopes, large boulders and dense vegetation (see[10]).

The program followed aggregate performance measuressuch as the average travel speed over the course, the numberof operator interventions per distance traveled and the totaldistance traveled. Effort was also dedicated to understandingthe impact of using different perception algorithms on theperformance of the overall system. Autonomous runs wereperformed repeatedly using prior data at different resolutions,and using different configurations of the main components ofthe autonomy software. In [10], the authors define additionalmetrics in order to document the improvements in naviga-tion performance obtained as a result of applying differentmachine learning based approaches.

These competitive programs centered around testing inchallenging world environments pushed the autonomous nav-igation capabilities to the limit and beyond, and the efforts to-wards measuring performance in an objective manner playeda major role in their success. Unfortunately, the exorbitantcosts of testing on real robots in different environmentsput a severe limit on the amount of information that canbe extracted by only measuring performance at the systemlevel. Each system configuration to be tested required anindependent run over the entire course. The UPI programwas perhaps the most advanced in this respect, and yet itwas quite limited with respect to the number of configu-rations that were tested on the vehicle, and the amount ofexperimentation in different environmental conditions.

An opposite alternative to the system level testing ap-proach is to use standard techniques from the statisticalmachine learning and literature to compare the outputs ofthe different components of a perception system to groundtruth data. Typical performance measurements in this arearely on false positive or false negative detection rates, re-ceiver operating characteristic (ROC) curves, precision/recallcurves, confusion matrices and “area under the –ROC–curve” (AUC) measurements (see [11] or any standard texts

on statistical learning). The use of such metrics is relativelysimple and it makes intuitive sense: one presents the systemwith pre-labeled examples of data, and then estimated theprobability of making different types of mistakes. Althoughthe assumptions required are strong –theoretically the testdata needs to come from the same distribution as the datafrom the real environment where the perception system isto be used–, these measures have become a standard not inonly inside the machine learning field but also in roboticsand computer vision applications.

While flexible in terms of the analyses that can be per-formed since each component can be tested separately, low-level testing has significant limitations of such related tothe difficulty of producing large amounts of labeled datathat is meaningful for the perception tasks to be performed.As an example, in an obstacle detection application we notonly want to know the false positive and false negative ratesexpected for system based on all of our data, but we wantto understand what those rates are for different types ofobstacles located at different ranges, in different weatherconditions, in different locations relative to the vehicle, etc.For a vision based perception system, having an expertlabel important images in a dataset and then comparingthe outputs of the system to the labels is relatively easy –though still expensive–, but it not easily capture any of thecontext aspects discussed above. Another challenge for thesestandard tools is that the in reality, the penalties that applyto perception errors are highly non-uniform: committingperception errors very far the vehicle should ideally haveless impact on the overall perception numbers than errorscommited close to the vehicle where the decisions made bythe perception system are critical. It is hard to capture suchnon-uniformities using just the standard measures describedabove.

The system we describe in this paper is meant to fillthe gap between the system level testing (objective, mea-suring critical end performance but extremely expensive andhence typically rough in the type of information extracted)and low-level component testing (flexible in terms of theanalyses that can be performed, but challenging in termsof assuring that the right data distributions are used andthe right conclusions are drawn). As indicated on Figure1, the system relies on processing a large amount of pre-recorded raw sensor data with a perception system, andcomparing its outputs –generally two dimensional outputmaps describing the environment around the vehicle– witha corresponding set of ground truth data using configurableperformance metrics. A database system manages the entireprocess, and supports the necessary tools for performingdetailed performance analyses.

A somewhat similar approach was recently and inde-pendently proposed in the meta-learning area of machinelearning by [12]. The author describes a large scale systemfor accumulating data sets, algorithms and parameters fromthe current machine learning literature, and using it tomake discoveries about the behavior of different learners. Inaddition to this, the authors convincingly argue that having

Fig. 2. User interface used for using a GPS survey to obstain ground truthinformation for an obstacle course.

such a system available will improve the repeatability of theexperimental results presented in the literature, and protectagainst a false sense of progress that can result from testingnew algorithms on relatively small sets of data.

To summerize this section, we have discussed the mo-tivation for developing the PVS system which enables theefficient and detailed exploration/verification of perceptioncapabilities critical for commercial applications of outdoormobile robotics. We have also argued that being able toeasily perform large scale experiments with different variantsof a perception system can lead to a significant accelerationof research and development work, and that current perfor-mance estimation tools –whether at the system level or atthe component level– fall short of providing the necessarycapabilities.

In Section II we will discuss the requirements that wereconsidered during the design phase of our system, andin Section III its implementation. Sample analysis resultsobtained with the PVS are presented in Section IV, ourconclusions and future work are presented in Section V.

II. DESIGN REQUIREMENTS

In the introduction we have hinted that two importantcharacteristics of the validation system must be the easeand reasonable cost of collecting large amounts of raw andground truth data, and the ease of use and flexibility ofthe analysis tools and the infrastructure used for measuringperformance. In this section we present in more detail themain design requirements in these two areas of the system.

A. Data Collection Requirements

• Collecting raw data needs to be fast and simple. Forthe case of a vehicle which can be both manually andcomputer-driven, the process should consist of simplyhaving an operator traverse the environment of interest.

• Collecting ground truth data should be cost-effective interms of the effort required for labeling large amountsof data.

• For a system with limited hardware changes, it shouldbe possible to reuse old data logs such that a large bodyof real-world data is accumulated over time.

B. Analysis Requirements

• The PVS should support multiple systems, with vari-ations in vehicle platforms, hardware and softwareconfigurations, operating environments and intendedapplications.

• One should be able to perform analyses in an intuitivemanner. Users without perception or programming ex-pertise should be able to use the PVS and understandthe capabilities of the system being analyzed at the highlevel. Examples of such potential user groups are qualityassurance testing engineers or marketing experts.

• The analysis system should offer very fine control ofthe configuration of the systems being tested, in order tosupport the low level analysis needs of the developmentgroup. One should be able to measure the impact of veryfine changes deep inside the perception system beinganalyzed.

• Offer very fine control over the set of data logs thatare used to measure the performance of the perceptionsystem, in order to support analyses focused on differentenvironmental conditions (e.g. illumination, presence orabsence of rain or dust), different types of obstacles (e.g.vehicles, people, trees, buildings) and different operat-ing environments (e.g. open fields, roads, orchards).

• The analysis system needs to support processing mas-sive amounts of data, which are typically required inorder to draw reliable conclusions regarding the perfor-mance in highly variable outdoor environments.

• Since it is virtually impossible to anticipate all the typesof analyses or that one might want to use the PVSsystem for, it should be possible to easily change thetypes of questions one attempts to answer based on thePVS, or the performance metrics.

Collecting huge amounts of data while obtaining groundtruth in a cost effective manner, supporting multiple differentperception systems and also future types of analyses whileat the same time having a user interface that is non-expertfriendly yet powerful enough for developers is challenging.The next section will present the implementation of the PVSsystem, which we believe satisfies the majority of the systemrequirements.

III. IMPLEMENTATION

A. Raw and Ground Truth Data Collection

Our raw data collection process relies on recording almostall the stream of data produced by the sensors in a perceptionsystem: positioning sensors such as IMUs, GPS units andwheel encoders, perception sensors such as laser rangefinders and video camera and information produced by thevehicle platform. In addition to this, with each data logwe store meta-data such as the location where the data isrecorded and the atmospheric and illumination conditions.

The perception systems we develop are capable of pro-cessing previously captured data logs in playback mode andrecomputing their outputs using software identical to theconfiguration used on the vehicle during a real autonomous

run. As a result, we can reprocess the same data usingdifferent versions of the perception system an unlimitednumber of times. Current data storage technology makes iteasy to record large amounts of raw data at very low cost.

The greatest challenge consists in generating the groundtruth information corresponding to the raw data. Our systemrelies on two methods for generating ground truth informa-tion:

1) Surveyed obstacle courses. For many applications, it ispossible to construct courses which are highly representativefor the type of environment that the robot is expected toperform in. Such courses are typically set up for a particularapplication, and can be densely populated with the types ofobstacles and terrain features a perception system is expectedto identify. The data logs recorded while traversing them willcontain many example of situations that are interesting foranalyzing perception performance. For example, for a farmapplication an obstacle course might contain different typesof fences, poles, human mannequins, ditches of differentsizes and other vehicles.

Once an obstacle course is setup, a hand-held RTK GPSunit is used to survey the contours of each obstacle, whichare inserted in a database along with information about thenature and size of each obstacle. The idea behind usinga surveyed obstacle course is that although a significantamount of effort goes into setting it up, the ground truthinformation for any data collected over that course willautomatically be available. This is possible if the vehiclesused for recording data are also equipped with high accuracypositioning systems –even if those position estimates are notused by the perception system– such that at any momentin time we know where the vehicle is located on the courseand what obstacles it should be detecting based on their GPSsurveyed contours.

One of the user interfaces we developed for managingthe course survey information and inserting into a databaseis shown in Figure 2. The advantage of using the surveyedcourse approach is that massive amounts of data can be col-lected in a cost-effective manner after the initial investmentin the course. This enables us to collect data with a large setof varying parameters such as the environmental conditions,the speed at which the vehicles are driven, the orientation orlocation of the sensors, etc.

2) Labeling by human experts. In certain cases it isimpractical to set up a surveyed obstacle course, or it isnecessary to perform an analysis based on data collected in aless constrained fashion in the real application environment.For these situations we have developed the user interfaceand the procedure described in Figure 3, which imposes tworestrictions to vehicle sensor suite:

• It must contain a forward looking camera• It must contain a range sensing device (stereo camera

or laser range finder)2.This approach leverages the global positioning system

2unless the user is willing to make certain assumptions about the planarityof the operating environment and the possible shapes of the obstacles

Fig. 3. Obtaining ground truth data from human experts. The user choosesan image offering a good view of an obstacles, and marks the contour ofthe obstacle. The 3-D points which project inside the obstacle contour inthe image plane are marked as obstacles (lower-right). A range filtering tool(upper-right) allows the user to discard points that project inside the obstaclecontour but to do physically belong to the object. Using the known positionof the vehicle in the GPS global frame one can map the 3-D obstacle pointsto the global frame as well and insert them into an obstacle database.

of the vehicle and the precise estimates of the 3-D rigidtransforms between the vehicle and the camera and the rangesensor in order to obtain the locations of the obstacles in aglobal GPS referenced frame prior to inserting them in anobstacle database.

While more labor intensive than the previous ground truthacquisition approach, hand labeling is a good solution formedium and short data logs which are collected in situationswhere time constraints do not allow for a prior survey ofthe obstacles. As an alternative to exhaustively labeling alldata sets or manually selecting the representative scenes, onecould apply the techniques described in [13] to larger datasetsand only label those scenes that are different from scenesalready represented in the obstacle database.

Regardless of the method used for acquiring ground truthdata, the end result is that a database is populated with alist of obstacles corresponding to each driven path, on anobstacle course or not. The GPS coordinates of the obstaclecontours are stored in the database as well, along withdescriptions of the obstacle nature and size.

B. Analysis Support Infrastructure

In order to satisfy the challenging set of the requirementsfor the analysis component of the PVS, we have decidedto build the entire system around a relational database thatmanages almost all of the data related to the validationexperiments: the input data logs, the parameters describingthe configuration of the perception system to be tested, theparameters describing which components of the perceptionsystem are analyzed, the performance metrics to be applied,their parameters and the results of applying them to theoutput of the perception system. Our current implementationuses PostgreSQL, a widely used and mature open-sourcerelational database system.

One of the main reasons a relational database engine isthe correct choice as the backbone of our validation systemis the need for extreme flexibility in the supported queries.Given that the necessary relationships between the differenttypes of data are reflected in the structure of the database, one

can always formulate new types of queries even long afterthe experiments were performed. Furthermore, we keep trackof the outputs produced by the perception system severaltimes per second as opposed to just keeping track of theresults for an entire run. This give our system much moreflexibility in the type of analyses it can perform, but it resultsin several tens of thousands of new data entries even for amedium sized set of data log. Modern database systems arespecifically designed to support huge amounts of data andqueries, which makes them a great fit for our needs.

While the scope of this paper precludes us from describingthe inner structure of the database in detail, we will presentthe main steps involved in going from a set of raw datalogs with ground truth information to a set of performancemeasurements. Unless otherwise specified, all the describedfunctionality is available through a PHP-based web interfaceto the PostgreSQL server.

(1) Selection of data logsOur data logs contain meta-data about the environmental

and illumination conditions, the data collection site, thelength in time and distance traveled, etc. Since the log filesand the meta-data are managed by the database, the useris able to search for data logs that fit the intended analysisbased on any combination of these characteristics and selectthe ones that will be used. The interface allows the user toupload new data logs to the database, and the meta-data fieldscan be easily extended.

(2) Selection of perception outputs to monitorIn its current form the PVS relies on the fact that our per-

ception systems produce 2-D grid maps of the environmentsurrounding the vehicle, and they can be compared to theground truth information. This assumption is in reality nota strong one: most mobile robots have some representationof their surroundings for the purpose of motion planingor obstacle detection; the PVS can be adapted to otherrepresentations in order to support systems that do not use2-D maps.

Our perception systems have several algorithm moduleswhich can produce different types of maps : ground elevationmaps, obstacle, tree, slope or ditch maps (see [14], [15]).The PVS enables the user to select which maps will becaptured during log playback and hence what data will beavailable for analysis. For example, one could choose tocapture the obstacle map produced by using both laser rangefinders and cameras and the one produced only by using laserdata, so that the performance benefits obtained by addingvision capabilities are measured. It is also possible to selecta subset of interest out of the ground truth obstacles whichare recorded in the database, to support analyses focused ona particular capability (e.g. detecting ditches).

(3) Selection of perception software configurationsAny perception system has some parameters. In our case

they are all stored in a directory of text files, which togetherdetermine the configuration and behaviors of our perceptionsystem. Archived versions of parameter sets are managed bythe database, and the user can choose to use one of existingsets, upload a new set of parameters or edit an existing one.

Fig. 4. Web interface for visualizing the output maps captured after theplayback stage. The user can see the driven path overlayed on an overheadimage and the pairs of output and ground truth maps generated at any pointalong the path.

As a result, a user with in-depth knowledge of the perceptionsystem can have full control over any parameter, while a userwith less expertise can use one of the parameter sets alreadyexisting in the system.

(4) Running the perception system in playback modeOnce the data logs and the parameters to use are selected,

the PVS can start playing back all the selected raw logs andfeeding then in the perception system. The perception systemproduces 2-D grid output maps which are then captured by aPVS specific tool along with the time stamp and the vehicleposition at the time the map was generated. The informationstored in the database.

When the log playback process is completed, the PVSreprocesses all the output maps, and it uses the vehicleposition information and the ground truth information fromthe database in order to generate synthetic ground truth mapswhich are perfectly aligned with each output map. Theseground truth maps are directly linked to the output maps forfuture queries.

(5) Inspecting perception outputs manuallyAfter the playback stage is completed, the PVS user can

visualize the output and the ground truth maps using theinterface shown in Figure 4. This is a fast way to get a firstglimpse at the comparison between the desired and the actualoutputs in a georeferenced context.

(6) Adding, selecting and configuring performancemetrics

A very nice feature of the PVS is that it allows usersto independently implement new performance metrics whichare best for their particular analysis needs and upload them tothe PVS server. The only constraints on the metric “modules”are the following:

• they need to executable from the command line and taketwo parameters (the output map file and a correspondingground truth file) as input

• they need to generate their output in XML format.• the user needs to define the type and names of the input

parameters for the metrics and the type and names ofthe output variables.

Fig. 5. Innterface for analyzing the results produced by the Safe Speedmetrics described in Section III-C. CRIS WILL ADD A LESS UGLYPICTURE OF THIS

As a results, metrics can be implemented in any program-ming language. Typical examples would be metrics thatcounts the number of missed or false obstacle cells, or theSafe Speed metric discussed in Section III-C.

(7) Analyzing perception outputs using performancemetrics

Once one or more performance metrics are selected, thePVS system retrieves all the pairs of output and ground truthmaps from the database which apply to the current analysis.The metrics are applied to the map pairs and their outputsare aggregated in the database.

(8) Analyzing the experiment resultsThe PVS provides two ways in which the aggregated per-

formance metric outputs can be analyzed. The first consistsin the web interface, which provides simple summaries suchas histograms and plots of the outputs of the metrics, alongwith the possibility to visualize pairs of output and groundtruth maps. However, given the huge flexibility offered interms of the outputs of the performance metrics it is hardto pre-program any in-depth analysis tools. To address thischallenge, the PVS supports a second analysis mode: theusers can download through the web interface all the outputsof the performance metrics in various formats such as Matlabfiles, Excel spreadsheets or CSV files. The users can thenemploy any software of choice to analyze the data. Oncesuch analysis tool is shown in Figure 5.

It is important to notice that since the PVS system isbased on a standard PostgreSQL database, it is possible andrelatively simple to develop standalone tools independentfrom our web interface for performing any of the steps above.This has already happened with some of the early testers ofour system.

C. Safe Speed: The Performance Metric That Matters

Based on our experience with the PVS, the most usefulperformance metric we developed is based on the Safe Speedestimate, presented in [14]. We will discuss why this typeof metric is more useful than comparing detections betweenthe output and the ground truth maps at the map cell level.

The concept of safe speed is simple: given an obstacle mapof the area surrounding a vehicle, compute the maximumspeed at which the vehicle can move while still being

(a) (b)

Fig. 6. THIS DATA IS FAKE!

able to stop safely for the obstacles present in the map.The computation is non-trivial, but in essence it takes intoaccount the deceleration capability of the vehicle, its shape,the distance to obstacles and their location relative to theintended path. This makes intuitive sense: there is no needto stop for an obstacle that is very far from the vehicle oroutside its path, and it is possible to drive faster if the brakingdistance is shorter. The safe speed makes the vehicle cometo a gentle stop as it nears an obstacle, and it reduces speedwhen it has to drive close to them.

The main problem with comparing output and groundtruth maps cell-to-cell comes from inconsequential detectionerrors (false positives or false negatives). The performanceof most perception systems degrades with distance, and as aresult it is often the case that far range obstacle detections areunreliable. However, these errors are also inconsequential: aspurious obstacle detected 50m away from a slow movingvehicle is typically cleared once the vehicle gets closer and ithas no effect on any driving decision. Another type of errorswith very small consequences relate to possible misalignmentof the the obstacles with the ground truth. The contour ofan obstacle can be inaccurately recovered (for example dueto occlusions, or small errors in the 3-D reconstruction), yetthe behaviour of the system will be virtually unchanged; aperception system that locates all obstacles 5% closer thanit should will still stop safely in all the necessary situations.

The safe speed metric is based on comparing the estimatesgenerated by performing the computation over the groundtruth map (i.e. the map generated if perception was perfect)and over the actual perception produced output map. Themetric results can be aggregated in a plot similar to the oneshown in Figure 5. Discrepancies between the ground truthand the actual safe speed typically indicate perception errors.We have developed tools which can analyze these plots andextract statistics about the frequencies and the types of errorsencountered by the perception system. Two simple examplesare shown in the next section.

IV. SAMPLE ANALYSIS RESULTS

This paper is focused on the Perception Validation Systemand not on any perception system in particular. As a result,we only present results from two simple experiments de-signed to showcase the types of analyses one might perform.

In Figure 6 we present the results of using a set of datalogs captured with an autonomous vehicle in an orchard

environment. For both experiments we use as a metric thenumber of false positive detections per kilometer traveled asdetermined using the safe speed metric.

In Figure 6(a) we compare the false positive frequencyof the same perception system when operating in orchardrows where tall weeds are present with the performanceobserved in rows that had been mowed recently and did nothave any weeds. This type of analysis could be performedto determine if a given system is capable of functioning ina work environments where weeds are present.

In Figure 6(b) we compare the performance of two dif-ferent configurations of a perception system over a set ofdata logs contaning orchard rows with weeds. The twoconfigurations correspond to a system that uses only laserrange finder information to detect obstacles, and one that alsoadds vision based features to the range information. This typeof data could be used to analyze trade-offs in autonomousnavigation sensor package design.

V. CONCLUSIONS AND FUTURE WORKWe have described a system for measuring the perfor-

mance of perception systems for autonomous mobile robots.The system is centered around a relational database infras-tructure which makes it extremely flexible yet relatively easyto use.

The development of the Perception Validation Systemaddresses two critical needs of the mobile robotics field onthe path to commercialization:

• It enables researchers and system developers to makedesign decisions based on large amounts of data insteadof smaller amounts of data and intuition. We believe thiswill lead to accelerated innovation in our field.

• Perception performance verification is necessary prior toany large scale deployment of autonomous vehicles inthe commercial or military domains. This is particularlytrue in the case of safeguarding system for large vehiclessuch as cars, construction and farm equipment.

The primary limitation of the PVS system is related to thecost of acquiring ground truth data which is still non-trivial.Either setting up obstacle courses or labeling data by handrequires a significant amount of work, especially if done ona large scale. We are planning to develop additional methodsfor acquiring ground truth data that would make the processless labor intensive.

An additional limiation comes from the fact that the PVSsystem is designed to use pre-driven paths. As a result, itis useful for measuring perception performance, but in itscurrent form cannot be used for analyzing the performanceof other mobile robot subsystems such as motion planning.

As future work, we intend to continue to acquire datalogs for the PVS system and try to investigate and addressscalability limits when we encounter them. An additionalintention is to package the PVS as standard set of tools inorder to make it applicable to any other perception systemthat adheres to a minimal interface. We believe that thiscapability would have a significant positive impact on therepeatability and quality of mobile perception research.

VI. ACKNOWLEDGMENTS

The authors gratefully acknowledge the support receivedfrom Jose Gonzalez Mora for adapting the hand-labeledground truth user interface to the needs of the perceptionvalidation system. This work was supported by Deere & Co.under contract XXXXXX.

REFERENCES

[1] E. Krotkov, et al., “The DARPA PerceptOR evaluation experiments,”Autonomous Robots, vol. 22, no. 1, pp. 19–35, Aug. 2006. [Online].Available: http://www.springerlink.com/index/10.1007/s10514-006-9000-0

[2] N. F. Drive, E. Krotkov, M. Perschbacher, and J. Pippine, “The DARPALAGR Program : Goals , Challenges , Methodology , and Phase IResults,” Journal of Field Robotics, vol. 23, no. 2006, pp. 945–973,2007.

[3] L. Jackel, D. Hackett, E. Krotkov, and M, “How DARPA structuresits robotics programs to improve locomotion and navigation,” ofthe ACM, vol. 50, no. 11, pp. 55–59, 2007. [Online]. Available:http://portal.acm.org/citation.cfm?id=1297823

[4] A. Lakhotia and E. P. Blasch, “Come of Age :,” Computer, pp. 6–9,2006.

[5] C. Urmson, et al., “High Speed Navigation of Unrehearsed Terrain :Red Team Technology for Grand Challenge 2004,” Intelligence, 2004.

[6] S. Thrun, et al., “Stanley: The robot that won theDARPA Grand Challenge,” Journal of Field Robotics, vol. 23,no. 9, pp. 661–692, Sept. 2006. [Online]. Available:http://doi.wiley.com/10.1002/rob.20147

[7] M. Montemerlo, et al., “Winning the DARPA Grand Challenge with anAI Robot ,” Artificial Intelligence, no. Gat 1998, pp. 982–987, 2006.

[8] C. Urmson, et al., “A Robust Approach to High-Speed Navigation forUnrehearsed Desert Terrain,” Journal of Field Robotics, vol. 23, no. 8,pp. 467–508, 2006.

[9] ——, “Autonomous Driving in Urban Environments : Boss and theUrban Challenge,” Journal of Field Robotics, vol. 25, no. February,pp. 425–466, 2008.

[10] J. Bagnell, et al., “Learning for Autonomous Nav-igation,” Robotics & Automation Magazine, IEEE,vol. 17, no. 2, pp. 74–84, 2010. [Online]. Available:http://ieeexplore.ieee.org/xpls/abs all.jsp?arnumber=5481587

[11] T. Fawcett, “ROC graphs: Notes and practicalconsiderations for researchers,” Machine Learning,vol. 31, pp. 1–38, 2004. [Online]. Available:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.10.9777&rep=rep1&type=pdf

[12] J. Vanschoren, “Understanding Machine Learn-ing Performance with Experiment Databases,”lirias.kuleuven.be, no. May, 2010. [Online]. Available:http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Understanding+Machine+Learning+Performance+with+Experiment+Databases#0

[13] C. Dima and M. Hebert, “Active learning for outdoor obstacledetection,” in Proceedings of the Robotics Science and SystemsConference, Cambridge, MA. Citeseer, 2005. [Online]. Available:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.8867&rep=rep1&type=pdf

[14] S. Moorehead, C. Wellington, H. Paulino, and JF, “R-Gator: anunmanned utility vehicle,” Proceedings of SPIE, 2010. [Online].Available: http://link.aip.org/link/?PSISDG/7692/769215/1

[15] D. Johnson, D. Naffin, J. Puhalla, and J, “Development andimplementation of a team of robotic tractors for autonomous peat mossharvesting,” of Field Robotics, no. Figure 1, 2009. [Online]. Available:http://www3.interscience.wiley.com/journal/122262233/abstract

Documents

PVS: A System For Large Scale Outdoor Perception ..../cvalles/papers/ICRA_PVS.pdfPVS: A System For Large Scale Outdoor Perception Performance Evaluation Cristian Dima 1, Levi Lister