Towards Automated System and Experiment Reproduction in Roboticseprints.lincoln.ac.uk › 24852 › 1 › 0882.pdf · Towards Automated System and Experiment Reproduction in Robotics

Towards Automated System and Experiment Reproduction in Robotics

Florian Lier1, Marc Hanheide2, Lorenzo Natale3, Simon Schulz1,Jonathan Weisz4, Sven Wachsmuth1 and Sebastian Wrede1

Abstract— Even though research on autonomous robots andhuman-robot interaction accomplished great progress in re-cent years, and reusable soft- and hardware components areavailable, many of the reported findings are only hardlyreproducible by fellow scientists. Usually, reproducibility isimpeded because required information, such as the specificationof software versions and their configuration, required data sets,and experiment protocols are not mentioned or referencedin most publications. In order to address these issues, werecently introduced an integrated tool chain and its underlyingdevelopment process to facilitate reproducibility in robotics.In this contribution we instantiate the complete tool chain ina unique user study in order to assess its applicability andusability. To this end, we chose three different robotic systemsfrom independent institutions and modeled them in our toolchain, including three exemplary experiments. Subsequently,we asked twelve researchers to reproduce one of the formerlyunknown systems and the associated experiment. We show thatall twelve scientists were able to replicate a formerly unknownrobotics experiment using our tool chain.

I. INTRODUCTION

Research on autonomous robots and human-robot inter-action achieved great progress over recent years. In orderto conduct research in these fields, the development ofhighly integrated robotics systems that incorporate a broadset of skills into a single architecture or demonstrate ex-ceptional performance in a single domain is a prerequisite.Furthermore, established “off the shelf” robot hardware,simulators and programming frameworks already providea sound basis for development and reuse across differentsystem setups [7]. On the downside, given the inherentcomplexity of these systems, consisting of soft- and hardwarecomponents and capable of solving nontrivial tasks — manyof the reported results and experiments cannot easily bereproduced by interested researchers and reviewers in orderto confirm reported findings [2]. In addition, most currentrobotics systems are realized by implementing a component-based architecture [5] that integrates not only in-house, butalso third party components from diverse publicly availablesoftware repositories such as ROS [15] or OROCOS [6].Therefore, these components do not necessarily share thesame ecosystem, build system, integration model, deployand execution environment. This “zoo of tools” introduces

1Bielefeld University (CITEC/CoR-Lab), 33615 Bielefeld, Germany[flier,sschulz,swachsmu,swrede]@techfak.uni-bielefeld.de

2University of Lincoln (L-CAS), Lincoln, LN6 7TS, United [email protected]

3Italian Institute of Technology (IIT), 16163 Genova, [email protected]

4Columbia University, NY 10027, United [email protected]

additional challenges to (robotics) software development andespecially to the reproduction process of these systems. Inorder to pinpoint these challenges, we introduced four keyproblem statements with respect to reproducibility: a) infor-mation retrieval and aggregation, b) semantic relationships,c) software deployment, and d) experiment testing, executionand evaluation in [10]. These challenges will be furtherexplained in a representative example in section II.

In [10] we also introduced an integrated software toolchain and its underlying development process – The Cog-nitive Interaction Toolkit (CITK). It supports system de-velopers and experiment designers already in an early stageof system development, in order to automate tasks and takeon the above challenges. Hence, in this contribution wevalidate the complete tool chain in a practical user studytesting three hypotheses: (i) the tool chain is applicable fordifferent ecosystems and facilitates the reproduction processof robotic systems as well as associated experiments, (ii) thetool chain is usable for non-expert users, and (iii) the toolchain provides enough information about relevant artifacts ofreproduced experiments. To this end we chose three differentrobotics systems, that are available in simulation, from threeindependent and spatially distributed institutions. We mod-eled all three systems, including an exemplary experiment,using the CITK.

In the user study, we asked researchers at Bielefeld Uni-versity to replicate a formerly unknown system — using theCITK process and tools. After having potentially replicatedthe assigned experiment, we asked the researchers to assessthe CITK based on a structured set of questions with regardto reproducibility issues identified in the current literatureand the tool chain’s usability. To the best of our knowledgethis is the first user study in this matter: the investigationof a reproduction methodology for robotics incorporatingusability metrics and based on actual “production” systemsin a practical user study.

The remainder of this paper is structured as follows:In section II we will introduce the problem statements.In section III we will present related work from differentfields of study. In section IV we will shortly recapitulatethe concept and tools constituting the CITK. Section V willgive an overview of the reproduced systems and exampleexperiments, sections VI and VII will introduce the studydesign and results. Finally, we will discuss and concludethis contribution in VIII.

2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)Daejeon Convention CenterOctober 9-14, 2016, Daejeon, Korea

978-1-5090-3761-2/16/$31.00 ©2016 IEEE 3298

II. PROBLEM STATEMENTS

A publication or article and the therein presented resultsare typically the only starting basis in order to technicallyreproduce and thus practically verify results. On the onehand, due to the commonly established peer-review processof journals and conferences, the theoretical (formal or math-ematical) approval of results can be assumed as assured.On the other hand, the software artifacts which are the trueembodiments of the ideas and contributions cannot easilybe verified and reproduced. Why is that?

Given the fact, the paper of interest includes links orreferences to source code repositories or other relevantartifacts it is difficult and time consuming to identify andaggregate all relevant artifacts that are required for repro-duction of a software intensive robotics system. These arefor instance: all required software components, data sets,and their documentation [problem statement a) informationretrieval and aggregation]. This is not surprising since regularpublications are not (yet) intended to precisely list anddescribe this kind of information. However, a few journalsand conferences already offer the opportunity to submitcode alongside the corresponding publication, but this doesnot scale for entire robotics systems. Moreover, in order topractically replicate the system that produced the presentedresults it is indispensable to also provide information aboutthe relationships between significant artifacts. These are,e.g., the exact versions (i.e. commit hashes) of softwarecomponents in combination with an experiment-specific dataset (recorded sensory input and actuator output for instance),hardware or setup variant utilized in the presented study[problem statement b) semantic relationships]. Thus, simplyreferencing source code repositories or binary downloadsis not sufficient. Moreover, the entire software stack mustbe deployed, which is often non-trivial and time consum-ing for diverse ecosystems (problem statement c) softwaredeployment). Linux containers, such as Docker1, are a pos-sible solution in order to mitigate versioning and systemdeployment issues. However, there are still more obstacles toovercome. One of them is system and experiment execution.If a download link to a docker container that contains thesystem of interest is referenced in an publication, it is stillunclear how and what to execute inside the container in orderto obtain and verify the published results [problem statementd) experiment testing, execution and evaluation].

Lastly, it is well known that a massive amount of timeis spent on robotics experiments because they require sig-nificant effort with respect to initial prototyping, systemdevelopment, integration testing, evaluation and conservationof results. These tasks are usually labour intensive — espe-cially when performed manually. Thus in our opinion, aftercompleting all the required work that is necessary to publishresults, it is often hard to find the time and to motivateresearchers to “make their work fully replicable”, whichimplies gathering and provide all the required informationas described above in a way that it is “ready to use”. Tools

1https://www.docker.com

are required in order to support researchers automating thesetasks [problem statement d) experiment testing, executionand evaluation].

III. COMPUTATIONAL EXPERIMENTS & ENGINEERING

In this section we present related work from different fieldsof study with respect to reproducibility of computationalexperiments — which include experiments in the robotics do-main as well. As a first case in point, Schiaffonati et al. [16]point out that computer science, in contrast to fields likephysics or biology, focuses on the abstract concept of compu-tation and therefore on the creation of technological artifactsthat implement these concepts. Thus, before reproducing anexperiment the subject of study, software in this case, mustoften be produced beforehand. Furthermore, Schiaffonati etal. emphasize the inherently different disciplinary nature ofcomputing that incorporates scientific and, at the same time,engineering efforts.

Not surprisingly, this circumstance opens up additionalissues when reflecting on the role of reproducible resultsin the robotics domain and the way findings are published.Unfortunately, the impact of the engineering aspect is mostlyneglected in current literature. Thus, the provision of “tech-nical artifacts” as a useful instrument for reproduction is tobe considered as crucial, e.g., a description of the softwareused, its parameters and configuration. This demand is alsodiscussed by Jill Mesirov [13]. In her work, she discussesexperiments where results were derived from simulationin chemistry, materials science and climate modeling. Sheclaims that the scientists are often skilled and innovativeprogrammers who develop and release sophisticated softwarepackages (which sounds familiar to scientists in the roboticsdomain). Fellow scientists, who may not be software expertsthemselves, conduct studies utilizing these tools. Usually,the “standard usage” of these tools varies in a way thatresearchers combine them in novel ways. According toMesirov this provides an enormous advantage. On the down-side, the lack of implementation details and parametrizationof follow-up experiments poses new challenges for scientificpublications, replication and especially for traceability ofintermediate and resulting artifacts. In this context, Schwabet al. [17] assert that publications, lacking those technicaldetails are:

“[...] merely the advertisement of scholarshipwhereas the computer programs, input data, pa-rameter values, etc. embody the scholarship itself.”

Unfortunately, references to the actual source code,parametrization, configurations and related artifacts of the“mashup” that produced these scholarships are usually omit-ted or unknown and therefore lost and thus — not re-producible. Fortunately, a few approaches can be found incurrent literature tackling these issues. Most recently, in“The Real Software Crisis: Repeatability as a Core Value”Krishnamurthi et al. [8] stated that:

“[...] software artifacts play a central role: they arethe embodiments of our ideas and contributions.”

3299

Subsequently, they describe a newly initiated artifact eval-uation process that allows publishing authors (of acceptedpapers) to submit software alongside different kinds of non-software artifacts, e.g., data sets, test suites, and models thatback their findings.

In “Reproducible Research — Addressing the need fordata and code sharing in computational science”, Stod-den [18] proposed the following (and many others) goalsfor reproducible research: foster the development of web-based tools to facilitate the ease of use of software repos-itories and related artifacts. Publish code accompanied bysoftware routines that permit testing of the software testsuites, including unit testing and/or regression tests. Developtools to facilitate both routine and standardized citation ofcode and data. Develop deeper communities that maintaincode and data, ensure ongoing reproducibility, and perhapsoffer tech support to users. Without maintenance, changesbeyond individual’s control (computer hardware, operatingsystems, libraries, programming languages, and so on) willbreak reproducibility.

Besides these (more general) approaches, researchers inthe robotics domain also investigate current challenges withregard to reproducibility. In Defining the Requisites of aReplicable Robotics Experiment Bonsignorio et al. [3] listrequisites based on an exemplary experiment in visual ser-voing. These requisites are, amongst others, provision ofa) technical specifications of the robot used, b) technicalspecifications of the camera with respect to frame rate,resolution, [...] c) description of computers’ used memory,computation power, operating system, [...] and d) used soft-ware libraries and configuration. Moreover, Bonsignorio etal. investigated how “Good Experimental Methodology” indiverse fields of robotics, towards publication of findings,must be carried out in order to help reviewers recognize,and authors write high quality publications of replicableexperimental and theoretical work. These methods include,for example, that experimental results on real datasets mustbe publicly available, experiments in simulation should bemade available including the simulator alongside with thesetup used for the experiments and what set of objects oritems were tested.

Another approach has been realized by the ROS commu-nity. Here, scientific reports are linked to their associatedengineering component — software packages in this case— thus, the ROS wiki provides dedicated web sites2 for afew publications. These sites aggregate source code, data setsand usage examples to improve the potential of reproducingresults, published in related papers. Lastly, in order to providea repository of aggregated artifacts the ROS communityhas recently released a newly created package index forsoftware incorporated in the ROS ecosystem. This approachwas mainly realized based on the question “how do you knowwhat ROS software is available, where to find it, and howto use it?”. To solve this issue, a community site 3 was

2http://wiki.ros.org/Papers3http://rosindex.github.io/about/

created that aims at helping users to search across all ROSrepositories at once.

To summarize: the examination of related work fromdiverse fields clearly indicates that reproducibility of com-putational experiments is an important and “hot” topic —not only in robotics research. Nevertheless, the increasingnotion of software, input data, parameter values, etc. as theembodiment of scholarship itself is mutually shared acrossdisciplines. However, robotics systems additionally introducecomplexity due to their diversity and amount of required(often non standardized) artifacts.

IV. CITK IN A NUTSHELL

Since reproducibility is considered a critical challenge inrobotics research, we developed an integrated tool chainthat incorporates the complete development, reproduction,and refinement process of robotics systems: The CognitiveInteraction Toolkit (CITK). The CITK consists of a softwaretool set that is tied to an underlying artifact model, softwaredevelopment and reproduction process. This process defineshow users, system engineers and empirical scientists interactwith our tool chain. It has been inspired by current bestpractices in research and software development — especiallyin robotics. While the CITK tool chain and its associateddevelopment process is explained in greater detail in [10], thefollowing paragraphs will introduce the reader to the basicconcepts required in order to assess the study presented inSection VI.

The basis of the CITK is a domain model for reproduciblerobotics systems. This model is based on the concept ofa system version. A system version consists of multiplesoftware component versions (mandatory) — the functionalbuilding blocks. Versioning is enforced by, e.g., referencingspecific source code repository tags, branches or binaryrelease identifiers. Moreover, documentation such as wikipages, API docs, etc. and publications as well as obtaineddata sets are also associated with the system version. Anexperiment protocol that was adhered in order to acquire thepublished results must also be provided along with a systemversion. Additionally, the system’s software components andtheir configuration are frequently and automatically built,deployed and tested on a continuous integration (CI) serverto provide functional monitoring. If feasible, a simulatedexperiment must also regularly be executed on a CI server,based on the experiment’s protocol and data sets to preventregressions over time.

From a researcher’s point of view (illustrated by the or-ange path in Figure 1) the starting point for the reproductionand development process of a robotics system is the sourcecode of her/his software components (Figure 1 (1)). Thesecomponents are often written in different programming lan-guages and thus make use of diverse build environments:CMake, Maven, Catkin to only name a few. Managing manydifferent environments either requires convention over con-figuration and the use of ecosystem-specific tooling (cf. ROScatkin) or is a rather cumbersome and time-consuming

3300

Researcher / Engineer

ComponentComponentComponentRepository

(1)

Recipes: C++, Java, Data Sets, Experiments, etc.

(2)

create

Distribution

System-v.0.1

compose

(3)

generatebuild jobs

(4)Building, Testing

& Verification deploy

connect/test

Robot

File System utilizeUserBrowse (5)

CITK Web Catalog

Read Instructions

reproducesync content

(1) (2)

(3)

(4)

Laboratory or at homeRobot

CI

Docker

CI Server

Fig. 1: Cognitive Interaction Toolkit: tool chain and work flow

task. We address this issue by applying a generator-based so-lution that utilizes minimalistic template-based descriptions— we call them recipes — of the different components thatbelong to a system distribution (Figure 1 (2)). Recipes arewritten by researchers and describe a software component byinheriting a template for, e.g., CMake-based software, or fordownloading data set archives, an experiment description,or any other publicly available artifact that is required inorder to replicate a system. Thus, the composition of differentrecipe types comprises a distribution file (Figure 1 (3)).

Distribution files are interpreted by a newly implemented,so-called “build-generator” tool. Firstly, the build-generatoranalyzes all included recipes based on the nature of theirinherited template and the downloaded source code. Basedon this information, the build generator automatically derivesthe following: a) how to build and deploy a componentand b) the sequence in which components must be built inorder to satisfy inter-component dependencies. Afterwards,the build-generator creates per-component build jobs anduploads them to a running (local or remote) CI serverinstance (Figure 1 (3)). At this point the researcher isprovided with a set of build jobs that reflect her/his system.Additionally, a special build job is created that, if triggered,orchestrates the complete build and deployment process ofthe system. After all jobs are finished, the system is deployed(Figure 1 (4)) in the file system and is ready to use (Figure 1(5)). Since setting up a CI server and the required pluginstakes time and requires expert knowledge, we provide pre-packaged installations for CITK users. Moreover, we recentlyintroduced deployment of CITK-based systems using Linuxcontainers. For this, Docker [12] was integrated, the leadingapplication for using Linux containers, to create, package,and distribute robotics systems. System descriptions and theirmeta data, e.g., source code locations, wiki pages, issuetracker, current build status, experiment descriptions, and soforth are frequently synchronized to a web-based catalogthat also implements the CITK data model — providinga global human readable and searchable platform.

Besides gathering information about a system and auto-matically deploying it, successful reproduction also includesrepeating tests and experiments. To address this in the contextof robotics systems, we suggest to transfer the conceptof an experiment protocol to the orchestration of software

components involved in an experiment and to execute, testand evaluate software intensive experiments in an automatedmanner. To this end, CITK users are utilizing the FSMT [9]framework. FSMT is a lightweight and event-driven soft-ware tool that implements the aforementioned suggestions,formalizing experiment execution with respect to configura-tion and orchestration of invoked software components. Itsupports automated bootstrapping, evaluation and shutdownof a software system used in an experiment. An FSMTexperiment description includes three mandatory steps a)environment definition, e.g, executable paths, required en-vironment variables and runtime configurations, b) softwarecomponent descriptions such as, path to executable pluscommand line arguments (configuration!) and c) successcriteria. After specifying the environment, components andtheir success criteria, researchers must provide the executionorder of their components. Here, empirical scientists mayprovide scripts (R, Python) to evaluate data produced in eachrun or trial, to generate plots for instance. These scripts aredefined and executed like regular system components at thetime the system is completely bootstrapped by FSMT, hence,is producing data. Here, existing solutions like workflowengines (i.e. Taverna) are not sufficient since they do notprovide a deterministic execution behavior and do mostlynot support component health checks. Moreover, worflow-engines are ususally data, and not event-driven. However,this formalization of an experiment protocol allows to con-sistently reproduce and acquire test results in an automatedfashion. FSMT experiment descriptions and related materialsuch as evaluation scripts are also deployed using the build-generator and the CI server. Furthermore, required hard-ware/robots can be connected to the CI server that runs anexperiment — software stack respectively. Thus live sensordata can be evaluated directly when running an experimentincorporating hardware-in-the-loop. Alternatively, the CITKtool chain can also be instantiated directly on a robot itself— if feasible (cf. Figure 1).

From a user’s perspective (illustrated by the orange pathin Figure 1), who is interested in replicating an experiment,the starting point is the experiment representation in theweb catalog that has been referenced in a correspondingpublication (Figure 1 (1)). Here, she or he is directly pointedto a system version. The first step for the user is to browse

3301

the catalog and to load the page that contains a generaltextual description of the system and meta information.Subsequently, she/he follows the link to the build generatorrecipes comprising the system and downloads the requiredrecipes4. Following the provided installation instructions fora pre-packaged CI server the user bootstraps the CITK envi-ronment on her/his computer (Figure 1 (2)). After starting thelocal CI server instance, the generator is invoked in order toconfigure the CI server with build jobs for the system. Thelast step is to start the installation process. Therefore, theuser starts the aforementioned orchestration build job. Afterthis job has finished, the system is deployed on the user’scomputer Figure 1 (3)). Now, the user can focus on executingexperiments that have been defined and deployed alongsidethe system. These experiments are also available as buildjobs on the CI server. Simply starting the associated buildjob is enough. Moreover, the web catalog entry for the systemversion lists all available experiments. Each experiment alsocomes with a description of how to execute it. After theexperiment has been executed successfully, the resultingoutput, i.e., in form of a plot or textual report, can be foundon the user’s computer (as well as all required logs and datafiles). Besides the FSMT report, the generated plot can becompared to a reference plot from the catalog entry of theexperiment Figure 1 (4)).

V. TARGET SYSTEMS AND EXPERIMENTS

After having briefly explained the CITK approach, we willnow describe the systems and example experiments that weremodeled using the CITK tool chain. It is noteworthy, that thethree systems are of different nature when it comes to theirdeployment strategies and incorporated software parts. Eventhough the STRANDS system is mostly based on Ubuntupackages, a few components which utilize the ROS-catkinbuild system, were modeled. The iCub system is consistentlybuilt from source using CMake. The Flobi system is probablythe most “complex” with regard to its deployment. Here, thedifferent software components are built from source using:CMake, catkin, Python-setuptools, Maven and even somelegacy shell scripts. Thus, by instantiating the CITK toolchain, the desired target system can be deployed includingall relevant software components, required data sets, and aformalized and executable example experiment.

A. STRANDS

As a system aimed at long-term deployment of intelligentrobotic functionalities in application domains of security andcare, the STRANDS project has a high demand of robust-ness and, consequently, high software quality, to achieveits objectives of deploying robots for autonomous operationof 6 months. The scientific aims of STRANDS are to notonly achieve long-term autonomous behavior of its deployedrobots, but to actually exploit the experience these robotsgather over such long run-times.

The system realized for replication is based on the con-tinuous patrolling demo, inspired by STRANDS’ security

4https://opensource.cit-ec.de/projects/citk/repository

deployment system, where a SCITOS G5 robot equippedwith two Kinect-type sensors, and a SICK laser scanner,continuously patrols a known environment to a) gather long-term data sets, and b) spot deviations from the normalroutine. The system studied for this paper is composed ofapproximately 30 continuously running ROS components,enabling the robot to localize, plan path utilizing metric andtopological representation, recover from navigation failures,and perceive an environment. The overall area patrolledby the robot is approximately 500m2 in an environmentsimulated using the MORSE simulator.

Task and expected result: initially, the system as de-scribed above must be completely and successfully deployed,and the example experiment must be started by the testsubject — using the CITK tool chain. As soon as the systemis running, the robot is requested to patrol two predefinedwaypoints in the simulated environment within 360 sec-onds. After the given time period, the system’s functionalcomponents are shutdown and the evaluation phase starts.Thus, in the automated evaluation phase of the FSMT run,the robot’s system logs are processed and evaluated dueto the predefined success criteria. Only if the waypointshave been reached (“waypoint reached” reported in thewaypoint-request component), the experiment is evaluatedas successful. Finally, the evaluation results and associatedlog files are automatically presented to the subject. Basedon the available information in the web catalog and theFSMT output, the subject must assess the outcome of theexperiment.

B. iCub

In this case, the investigated system for replication ispart of the example applications distributed with the iCubsoftware [14]. This application is meant to help new iCubusers approach the software. In short, the investigated iCubsystem replays the complete sensory data of the iCub as itwas recorded during a former experiment. In the originalexperiment the robot was programmed to track a coloredobject with the gaze. In this example experiment, a tool isexecuted that replays the dataset with the original timing andformat. Furthermore, the GUIs for visualizing the originalimages from the cameras and a 3D representation of theiCub are utilized. The latter is animated starting from theencoders and it includes a representation of the forces at thearms and the inertial feedback from the head IMU.

Task and expected result: again, at first the system mustbe installed and brought-up by the test subject. As soon asthe system is running, the currently produced joint values ofthe iCub’s right arm are collected (via Yarp middleware) andsaved in a log file. After exactly 70 seconds all componentsare shutdown. Next, in the automated evaluation phase ofthe FSMT run, the recorded joint values are processedand evaluated. In this case, the recently recorded jointvalues are compared to a reference run (distributed withthe experiment). If the total axis angle offset, between thereference run and the current run, for the right arm issmaller than 5 degrees (for the first and last data point in

3302

Fig. 2: Systems (ltr): STRANDS navigation, iCub balltracking and Flobi face identification

the trajectory) the experiment is evaluated as successful.Finally, the evaluation results and a plot of the comparedjoint angles are automatically presented to the subject. Basedon the available information in the web catalog and theFSMT output, the subject must assess the outcome of theexperiment.

C. Flobi

The robot Flobi was designed to facilitate social interac-tion by designing the exterior in a way that allows facialexpression and emotional feedback, while at the same timeavoiding the uncanny valley effect by using a comic styledesign [11]. The robot’s neck features three degrees offreedom (DOF) for pan-, tilt-, and roll motion. In orderto facilitate a variety of facial expressions, a total of 14actuators are responsible for moving the eyes, the eyebrows,the individual eyelids, and the mouth. In this system setupFlobi’s emotional feedback, as well as face detection andrecognition capabilities are demonstrated using the MORSE-based Flobi simulator. Therefore, a total set of four imagesare presented to the robot. The images are processed by aperson identification component. Based on the classificationresult, a corresponding emotion is triggered and executedby the robot’s high- and low-level control framework andvisualized in the simulator.

Task and expected result: As usual, at first the systemmust be deployed and instantiated by the subject. Theexample experiment must be started. As soon as the systemis running, the virtual Flobi starts to identify the presentedpersons, a corresponding emotion is triggered. After 25seconds all components are shutdown. In the evaluationphase the robots emotion states are processed and evaluated.In this case, the system is supposed to report “all four personsfound”. Finally, the evaluation results and associated logfiles are presented to the subject. Based on the availableinformation in the web catalog and the FSMT output, thesubject must assess the outcome of the experiment.

VI. STUDY GOALS & DESIGN

The main objective of this study was to evaluate the CITKtool chain from a user’s perspective (cf. Section IV). Inorder to test our hypothesis as introduced in section I andto gain insight, we defined three key questions targetingdifferent aspects of our tool chain: i) are the users able totechnically reproduce an example experiment at all? ii) howis the usability of the tool chain perceived? iii) does the toolchain provide enough information to confirm and assess theexpected outcome of an experiment?

Besides recognizing potentially unidentified disadvantagesor benefits of our tool chain, these three key questionsalso address issues identified in the current literature (cf.Section III). For instance, before actually reproducing anexperiment, the target system including all required softwarecomponent versions, as well as their configuration parametersand execution sequence must be identified and deployedon the user’s machine. Thus, question (i) addresses the“frequently neglected” impact of the engineering aspecton computational experiment replication. Subsequently, inorder to support users in the replication process, suitabletransparent and open tools are required (cf. Stodden). Hence,the usability and effectivity of these tools will play an im-portant role towards reducing the effort spent on reproducingreported results — for reviewers as a case in point. This issueis addressed in question (ii). Lastly, the outcome and successcriteria of an experiment must be clearly stated and traceable(cf. Bosignoro) — this issue is addressed in question (iii). Inorder to test our hypothesis and to answer the aforementionedquestions, we conducted this study as follows.

We recruited twelve subjects from the Faculty of Technol-ogy at Bielefeld University. One subject was a postdoctoralresearcher, nine were PhD candidates and two subjects werein their final year of a master’s degree program. The subject’sacademic background was: 9 participants had a computerscience background, while mechatronics, interaction designand technologies, as well as chemistry was reported by 3subjects. Two participants were women, ten men. All subjectsconfirmed that they were neither involved in the developmentof the CITK tool chain, nor that they were familiar with theassigned system or experiment. The subjects were instructedand supervised by a student worker who was also not partof the CITK development team.

The subjects were asked to reproduce one of the exper-iments presented in section V by utilizing the CITK toolchain. Before actively starting the replication process, thesubjects were instructed to read a documentation manual(PDF file) that specified all necessary steps, e.g., requiredshell commands in order to deploy the assigned system(cf. Section IV). The replication process incorporated thefollowing tasks: 1) bootstrap the tool chain and deploy theentire target system. 2) navigate to the CITK web catalog, adedicated link was provided in the instruction set, and gatherinformation about the referenced experiment, its expectedoutcome and how to execute the experiment. 3) execute theexperiment and fill in a questionnaire. In the interest of main-taining consistency across all experiments, the instructionmanuals, as well as the web catalog sites, were structured

3303

alike. All subjects utilized the same computer to providea uniform technical basis. The manual PDF was accessiblethroughout the whole process and the subjects were allowedto copy&paste required shell commands. Eventually, everyexperiment was expected to be reproduced four times bythree different subjects.

VII. EVALUATION METHODS & OBTAINED RESULTS

In this section we will present the applied evaluationmethods and the study results. In order to also make thisstudy traceable, the instruction manuals and the question-naire results as well as an accompanying video can be foundon the CITK catalog website5. The discussion of these resultsand their potential implications will be presented in thefinal section of this contribution. In the questionnaire, weused a 5-point Likert scale ranging from strongly disagree(1) to strongly agree (5) where applicable. The remainingcategories were either “Yes/No” answers, or statements suchas “one — more than one — more than five”.

All participants had no or little knowledge about the topicof computational reproducibility. However, surprisingly 84%of the subjects strongly agree that reproducible and traceableresearch is an important topic. One subject neither agreesnor disagrees (3) and one subject disagrees (2) with thisstatement. Thus, altogether the subjects had no or at mostlittle theoretical and practical experience in reproducing thirdparty systems and experiments.

The first notable result, based on the subject’s self as-sessment and the supervisor’s evaluation, is that all 12subjects were able to reproduce their assigned experiment(key question i).

The second set of questions in the questionnaire was tar-geted at the tool chain’s usability (key question ii). Therefore,we applied the System Usability Scale (SUS) as proposedby John Brooke [4]. The SUS is an elementary, standardizedten-item attitude scale often used in systems engineering inorder to provide an overview of the system’s usability. Theresulting average score of 77.5, as well as the average scoreof each item, is depicted in Table I

TABLE I: CITK SUS Evaluation

Subject SU Questions 1 - 10 SUS1 - 12 4,0 1,8 4,1 2,1 4,0 1,3 3,9 2,1 4,0 1,7 77,5

Individual and overall scores are average values

According to Bangor et al. [1] an average SUS score above70 can be considered “good”. Thus, based on the subjectsrating, the usability of the CITK tool chain is more thanacceptable. However, the predicate “excellent” would haverequired a score equal to or above 85 (cf. Figure 3).

In the third and last section of the questionnaire we eval-uated key question iii. The results are depicted in Figure 4and 5. First of all, it is noticeable that the majority of the

5https://toolkit.cit-ec.de/citk-evaluation-study-i

121

Journal of Usability Studies Vol. 4, Issue 3, May 2009

0 10 20 30 40 50 60 70 80 90 100

EXCELLENTGOODOKPOORBEST

IMAGINABLE

WORST

IMAGINABLE

ACCEPTABILITY

RANGES

ADJECTIVE

RATINGS

ACCEPTABLENOT ACCEPTABLE MARGINAL

LOW HIGH

SUS Score

ABCDFGRADE

SCALE

Figure 4. A comparison of the adjective ratings, acceptability scores, and school grading scales, in relation to the average SUS score

Finally, regardless of whether words or letter grades are used for such a scale, we believe that the results from a single score should be considered to be complementary to the SUS score and the results should be used together to create a clearer picture of the products overall usability.

The work presented here suggests several lines of future research that are needed in order to further understand both the SUS and the use of an additional single question rating scale. First and foremost, data collection will continue with the substitution of the mid-point adjective with

one that carries a stronger neutral connotation than the current term of OK. With this substitution, we will also be including a letter grade scale to allow the users themselves to make the determination of a grade assignment, rather than having to rely on the anecdotal evidence presented to date. One virtue of the letter grade approach is that the subject could be asked verbally to assign a letter grade prior to presentation of the SUS. This would help remove the letter grade from the context of the SUS questions and perhaps increase the degree of independence between the two measures. We hypothesize that users may be less reluctant to give low or failing grades to poor interfaces because of their extensive exposure to this familiar scale in other domains. We believe that users may have self-generated reference points across the entire letter grade scale and because of their previous exposures could be more willing to use the full scale. If this is true, it may prove to be a valuable extension of the SUS and help

solve the range restriction issue that is prevalent in SUS scores. If the letter grade score does indeed prove to be reliable and useful, further investigations will need to focus on whether such a single score assessment might be sufficient. One important element of these investigations will be to examine the relationship between the SUS, the seven-point adjective rating scale, and the letter grade scale with objective measures of usability such as time-on-task and task success rates.

Fig. 3: Acceptability scores, school grading scales and ad-jective ratings in relation to the average SUS score [1]

participants assessed that it was clearly stated what was mea-sured in the experiment. The median value for this statementis 4 (agree) and the majority of all answers reside within arange of 3.5 and 5 (first and third quartiles). Furthermore, iswas also clear to the subjects how the results were measured(median is value is 4). However, it is also noteworthy thatthe results for this statement are inferior to the previousstatement. Most of the participants also perceived that theobtained results in the experiment addressed the system’scapabilities in a reasonable manner. Lastly, the participantsstrongly agreed that it was easy to assess whether theexperiment was successfully replicated, which is a promisingresult. Here, the mean value is 5, the first and third quartilesreside between 4.5 and 5. There is, however, one outlierassessing this statement with neither agree nor disagree (3).However, all subjects approved that automatable execution ofexperiments is essential for replication (strongly agree 75%/ agree 25%).

Furthermore, we asked the subjects about their notion ofincluded software components and their configuration, aswell as information about the hardware (computer) used inthe original experiment. With regard to included softwarecomponents the median value is 4. The median value forincluded software configurations and parameters is compar-atively low with 3.5 (first and third quartiles between 3 and4). Surprisingly, the subjects also couldn’t neither agree nordisagree (median 3) to the statement that sufficient informa-tion about the hardware used in the original experiment wasavailable. It is noteworthy, that the first and third quartilesrange from 2 to 5 which depicts a wide distribution of ratings.Lastly, the participants disagreed to the statement (median 2)that they could have achieved the same results without usingthe tool chain.

Fig. 4: CITK evaluation: experiment outcome and metrics

3304

Fig. 5: CITK evaluation: information about included softwarecomponents and configuration

VIII. DISCUSSION & CONCLUSION

The goal of this study was to test the following hypotheses:(i) the tool chain is applicable for different ecosystems andfacilitates the reproduction process of robotic systems as wellas associated experiments, (ii) the tool chain is usable fornon-expert users, and (iii) the tool chain provides enoughinformation about relevant artifacts of reproduced exampleexperiments. Concerning (i): due to the fact that all systemsand experiments were successfully reproduced by all 12subjects, using the CITK tool chain, we assert this hypothesisis verified. Since all subjects had no or little experience inreproducing third party systems, and an average SUS score of77.5 was reached, we also assert (ii) is verified. Verificationfor (i) and (ii) is additionally backed by the facts that iswas easy for the subjects to assess whether the experimentwas successfully replicated and that it was clear how andwhat was measured. Moreover, the subjects could associatetheir experiment result with the system’s capabilities. Lastly,the subjects stated that they could not have achieved thesame results without the tool chain. Unfortunately, hypothesis(iii) is not as well supported by our results as the othershypotheses. While the subjects approved there was sufficientinformation about included software components, it seemsthat roughly one half of the subjects could not determinewhether there was enough information about componentconfiguration. However, since the component configuration isexplicitly modeled in the tool chain (cf. FSMT), we assumethat the available information source, the web catalog in thiscase, needs refinement of the visual representation for thiskind of artifact. A first assumption is that, due to the fact thatFSMT configurations are “only” represented as hyperlinks onthe experiment web sites, only the minority of the subjectsdid actually follow that link. Thus, although there are positivetendencies, we assume this hypothesis can not be clearlyverified.

Current limitations: since our approach heavily relies onthe provision of previously recorded, or otherwise accessi-ble sensor data (e.g. live via network/middleware), highlyinteractive experiments, such as in HRI are difficult (andsometimes impossible) to capture and thus difficult to re-play/reproduce. Despite the fact that hardware is explicitly

modeled in the CITK, an (open) standardized hardwaredescription format is still lacking integration. The CITK toolchain is directly runnable or at least connectable to robotsthat are based on “consumer hardware” such as the PR2 oriCub control PCs. However, specialized embedded systemsfor instance are currently locked out — if not available insimulation.

To conclude: in this contribution we assessed the appli-cability and usability of a novel tool chain that addressescurrent issues with respect to reproducibility in the roboticsdomain in an applied and — so far — unique study. Wehave shown that scientists were able to replicate a formerlyunknown exemplary experiment using our approach. Whilewe were not able to clearly verify all of our hypotheses,we consider the results promising. We can hopefully exposeour approach to the general community in order to gatheradditional data and use cases, as well as fields of application.

REFERENCES

[1] A. Bangor, P. Kortum, and J. Miller. Determining what individual susscores mean: Adding an adjective rating scale. Journal of usabilitystudies, 4(3):114–123, 2009.

[2] F. Bonsignorio, J. Hallam, and A. del Pobil. Good experimentalmethodology-gem guidelines, 2007.

[3] F. Bonsignorio, J. Hallam, and A. del Pobil. Defining the requisitesof a replicable robotics experiment. In RSS2009 Workshop on GoodExperimental Methodologies in Robotics, 2009.

[4] J. Brooke. Sus-a quick and dirty usability scale. Usability evaluationin industry, 189(194):4–7, 1996.

[5] D. Brugali and P. Scandurra. Component-based robotic engineering(part i) [tutorial]. Robotics Automation Magazine, IEEE, 16(4):84–96,December 2009.

[6] H. Bruyninckx. Open robot control software: the orocos project.In Robotics and Automation, 2001. Proceedings 2001 ICRA. IEEEInternational Conference on, volume 3, pages 2523–2528. IEEE, 2001.

[7] S. Cousins, B. Gerkey, and K. Conley. Sharing software with ros [ROSTopics]. Robotics & Automation Magazine, 17(2):12–14, 2010.

[8] S. Krishnamurthi and J. Vitek. The real software crisis: Repeatabilityas a core value. Commun. ACM, 58(3):34–36, Feb. 2015.

[9] F. Lier, I. Lutkebohle, and S. Wachsmuth. Towards automatedexecution and evaluation of simulated prototype HRI experiments.Proc. 2014 ACM/IEEE Int. Conf. on Human-robot interaction, pages230–231. ACM, 2014.

[10] F. Lier, J. Wienke, A. Nordmann, S. Wachsmuth, and S. Wrede. Thecognitive interaction toolkit – improving reproducibility of roboticsystems experiments. SIMPAR 2014, LNAI, pages 400–411. SpringerInternational Publishing Switzerland, 2014.

[11] I. Lutkebohle, F. Hegel, S. Schulz, M. Hackel, B. Wrede,S. Wachsmuth, and G. Sagerer. The bielefeld anthropomorphicrobot head. 2010 IEEE International Conference on Robotics andAutomation, pages 3384–3391. IEEE, 2010.

[12] D. Merkel. Docker: Lightweight linux containers for consistentdevelopment and deployment. Linux J., 2014(239), Mar. 2014.

[13] J. P. Mesirov. Computer science. accessible reproducible research.Science (New York, NY), 327(5964), 2010.

[14] L. Natale, F. Nori, G. Metta, M. Fumagalli, S. Ivaldi, U. Pattacini,M. Randazzo, A. Schmitz, and G. Sandini. The icub platform: A toolfor studying intrinsically motivated learning, 2013.

[15] M. Quigley et al. ROS: an open-source robot operating system. InICRA workshop on open source software, volume 3, 2009.

[16] V. Schiaffonati and M. Verdicchio. Computing and experiments.Philosophy & Technology, pages 1–18, 2013.

[17] M. Schwab, M. Karrenbach, and J. Claerbout. Making scientificcomputations reproducible. Computing in Science & Engineering,2(6):61–67, 2000.

[18] V. C. Stodden. Reproducible research: Addressing the need for dataand code sharing in computational science. Computing in Science &Engineering, 12(5):8–12, 2010.

3305

Documents

Towards Automated System and Experiment Reproduction in Roboticseprints.lincoln.ac.uk › 24852 › 1 › 0882.pdf · Towards Automated System and Experiment Reproduction in Robotics