Popper: Making Reproducible Systems Performance Evaluation ...

  • Published on

  • View

  • Download


<ul><li><p>Popper: Making Reproducible Systems PerformanceEvaluation Practical</p><p>true</p><p>AbstractIndependent validation of experimental results inthe field of parallel and distributed systems research is a chal-lenging task, mainly due to changes and differences in softwareand hardware in computational environments. Recreating anenvironment that resembles the original systems research isdifficult and time-consuming. In this paper we introduce thePopper Convention, a set of principles for producing scientificpublications. Concretely, we make the case for treating an articleas an open source software (OSS) project, applying softwareengineering best-practices to manage its associated artifacts andmaintain the reproducibility of its findings. Leveraging existingcloud-computing infrastructure and modern OSS developmenttools to produce academic articles that are easy to validate. Wepresent our prototype file system, GassyFS, as a use case forillustrating the usefulness of this approach. We show how, byfollowing Popper, re-executing experiments on multiple platformsis more practical, allowing reviewers and students to quickly getto the point of getting results without relying on the authorsintervention.</p><p>I. INTRODUCTION</p><p>A key component of the scientific method is the ability torevisit and replicate previous experiments. Managing infor-mation about an experiment allows scientists to interpret andunderstand results, as well as verify that the experiment wasperformed according to acceptable procedures. Additionally,reproducibility plays a major role in education since theamount of information that a student has to digest increasesas the pace of scientific discovery accelerates. By having theability to repeat experiments, a student learns by looking atprovenance information about the experiment, which allowsthem to re-evaluate the questions that the original experimentaddressed. Instead of wasting time managing package conflictsand learning the paper authors ad-hoc experimental setups,the student can immediately run the original experiments andbuild on the results in the paper, thus allowing them to standon the shoulder of giants.</p><p>Independently validating experimental results in the field ofcomputer systems research is a challenging task. Recreatingan environment that resembles the one where an experimentwas originally executed is a challenging endeavour. Version-control systems give authors, reviewers and readers access tothe same code base [1] but the availability of source code doesnot guarantee reproducibility [2]; code may not compile, andeven it does, the results may differ. In this case, validatingthe outcome is a subjective task that requires domain-specificexpertise in order to determine the differences between originaland recreated environments that might be the root cause of anydiscrepancies in the results [35]. Additionally, reproducingexperimental results when the underlying hardware environment</p><p>Fig. 1. The OSS development model. A version-control system is used tomaintain the changes to code. The software is packaged and those packages areused in either testing or deployment. The testing environment ensures that thesoftware behaves as expected. When the software is deployed in production,or when it needs to be checked for performance integrity, it is monitored andmetrics are analyzed in order to determine any problems.</p><p>changes is challenging mainly due to the inability to predict theeffects of such changes in the outcome of an experiment [6,7].A Virtual Machine (VM) can be used to partially address thisissue but the overheads in terms of performance (the hypervisortax) and management (creating, storing and transferring)can be high and, in some fields of computer science suchas systems research, cannot be accounted for easily [8,9]. OS-level virtualization can help in mitigating the performancepenalties associated with VMs [10].</p><p>One central issue in reproducibility is how to organize anarticles experiments so that readers or students can easilyrepeat them. The current practice is to make the code availablein a public repository and leave readers with the daunting taskof recompiling, reconfiguring, deploying and re-executing anexperiment. In this work, we revisit the idea of an executablepaper [11], which proposes the integration of executables anddata with scholarly articles to help facilitate its reproducibility,but look at implementing it in todays cloud-computing worldby treating an article as an open source software (OSS)project. We introduce Popper, a convention for organizingan articles artifacts following the OSS development modelthat allows researchers to make all the associated artifactspublicly available with the goal of easing the re-execution ofexperiments and validation of results. There are two main goalsfor this convention:</p><p>1. It should apply to as many research projects as possible,regardless of their domain. While the use case shown inSection IV pertains to the area of distributed storagesystems, our goal is to embody any project with acomputational component in it.</p><p>2. It should be applicable, regardless of the underlying</p><p>Ivo Jimenez, Michael Sevilla, Noah Watkins, Carlos MaltzahnUC Santa Cruz</p><p>{ivo, msevilla, jayhawk, carlosm}@soe.ucsc.edu</p></li><li><p>technologies. In general, Popper relies on software-engineering practices like continuous integration (CI)which are implemented in multiple existing tools. Apply-ing this convention should work, for example, regardlessof what CI tool is being used.</p><p>If, from an articles inception, researchers make use ofversion-control systems, lightweight OS-level virtualization,automated multi-node orchestration, continuous integration andweb-based data visualization, then re-executing and validatingan experiment becomes practical. This paper makes thefollowing contributions:</p><p> An analysis of how the OSS development process can berepurposed to an academic article;</p><p> Popper: a convention for writing academic articles andassociated experiments following the OSS model; and</p><p> GasssyFS: a scalable in-memory file system that adheresto the Popper convention.</p><p>GassyFS, while simple in design, is complex in terms ofcompilation and configuration. Using it as a use case for Popperillustrates the benefits of following this convention: it becomespractical for others to re-execute experiments on multipleplatforms with minimal effort, without having to speculateon what the original authors (ourselves) did to compile andconfigure the system; and shows how automated performanceregression testing aids in maintaining the reproducibilityintegrity of experiments.</p><p>The rest of the paper is organized as follows. Section IIanalyzes the traditional OSS development model and how itapplies to academic articles. Section III describes Popper indetail and gives an overview of the high-level workflow that aresearcher goes through when writing an article following theconvention. In Section IV we present a use case of a projectfollowing Popper. We discuss some of the limitations of Popperand lessons learned in Section V. Lastly, we review relatedwork on Section VI and conclude.</p><p>II. THE OSS DEVELOPMENT MODEL FOR ACADEMICARTICLES</p><p>In practice, the open-source software (OSS) developmentprocess is applied to software projects (Figure 1). In thefollowing section, we list the key reasons why the process ofwriting scientific papers is so amenable to OSS methodologies.The goal of our work is to apply these in the academic settingin order to enjoy from the same benefits. We use the genericOSS workflow in Figure 1 to guide our discussion.</p><p>A. Version Control</p><p>Traditionally the content managed in a version-controlsystem (VCS) is the projects source code; for an academicarticle the equivalent is the articles content: article text,experiments (code and data) and figures. The idea of keepingan articles source in a VCS is not new and in fact manypeople follow this practice [1,12]. However, this only considersautomating the generation of the article in its final format(usually PDF). While this is useful, here we make the distinction</p><p>between changing the prose of the paper and changing theparameters of the experiment (both its components and itsconfiguration).</p><p>Ideally, one would like to version-control the entire end-to-end pipeline for all the experiments contained in an article.With the advent of cloud-computing, this is possible for mostresearch articles1. One of the mantras of the DevOps movement[13] is to make infrastructure as code. In a sense, having allthe articles dependencies in the same repository is analogousto how large cloud companies maintain monolithic repositoriesto manage their internal infrastructure [14,15] but at a lowerscale.</p><p>Tools and services: git, svn and mercurial are popular VCStools. GitHub and BitBucket are web-based Git repositoryhosting services. They offer all of the distributed revisioncontrol and source code management (SCM) functionality ofGit as well as adding their own features. They give new usersthe ability to look at the entire history of the project and itsartifacts.</p><p>B. Package ManagementAvailability of code does not guarantee reproducibility of re-</p><p>sults [2]. The second main component on the OSS developmentmodel is the packaging of applications so that users dont haveto. Software containers (e.g. Docker, OpenVZ or FreeBSDsjails) complement package managers by packaging all thedependencies of an application in an entire filesystem snapshotthat can be deployed in systems as is without having to worryabout problems such as package dependencies or specific OSversions. From the point of view of an academic article, thesetools can be leveraged to package the dependencies of anexperiment. Software containers like Docker have the greatpotential for being of great use in computational sciences [16].</p><p>Tools and services: Docker [17] automates the deploymentof applications inside software containers by providing anadditional layer of abstraction and automation of operating-system-level virtualization on Linux. Alternatives to docker aremodern package managers such as Nix [18] or Spack [19], oreven virtual machines.</p><p>C. Continuous IntegrationContinuous Integration (CI) is a development practice that</p><p>requires developers to integrate code into a shared repositoryfrequently with the purpose of catching errors as early aspossible. The experiments associated to an article is not absentof this type of issues. If an experiments findings can be codifiedin the form of a unit test, this can be verified on every changeto the articles repository.</p><p>Tools and services: Travis CI is an open-source, hosted,distributed continuous integration service used to build andtest software projects hosted at GitHub. Alternatives to TravisCI are CircleCI, CodeShip. Other on-premises solutions existsuch as Jenkins.</p><p>1For large-scale experiments or those that run on specialized platforms, re-executing an experiment might be difficult. However, this doesnt exclude suchresearch projects from being able to version-control the articles associatedassets.</p></li><li><p>D. Multi-node OrchestrationExperiments that require a cluster need a tool that automati-</p><p>cally manages binaries and updates packages across machines.Serializing this by having an administrator manage all thenodes in a cluser is impossible in HPC settings. Traditionally,this is done with an ad-hoc bash script but for experiments thatare continually tested there needs to be an automated solution.</p><p>Tools and services: Ansible is a configuration managementutility for configuring and managing computers, as well asdeploying and orchestrating multi-node applications. Similartools include Puppet, Chef, Salt, among others.</p><p>E. Bare-metal-as-a-ServiceFor experiments that cannot run on consolidated infrastruc-</p><p>tures due to noisy-neighborhood phenomena, bare-metal as aservice is an alternative.</p><p>Tools and services: Cloudlab [20], Chameleon and PRObE[21] are NSF-sponsored infrastructures for research on cloudcomputing that allows users to easily provision bare-metalmachines to execute multi-node experiments. Some cloudservice providers such as Amazon allow users to deployapplications on bare-metal instances.</p><p>F. Automated Performance Regression TestingOSS projects such as the Linux kernel go through rigorous</p><p>performance testing [22] to ensure that newer version dontintroduce any problems. Performance regression testing isusually an ad-hoc activity but can be automated using high-level languages or [23] or statistical techniques [24]. Anotherimportant aspect of performance testing is making sure thatbaselines are reproducible, since if they are not, then there isno point in re-executing an experiment.</p><p>Tools and services: Aver is language and tool that allowsauthors to express and validate statements on top of metricsgathered at runtime. For obtaining baselines Baseliner is a toolthat can be used for this purpose.</p><p>G. Data VisualizationOnce an experiment runs, the next task is to analyze and</p><p>visualize results. This is a task that is usually not done in OSSprojects.</p><p>Tools and services: Jupyter notebooks run on a web-basedapplication. It facilitates the sharing of documents containinglive code (in Julia, Python or R), equations, visualizations andexplanatory text. Other domain-specific visualization tools canalso fit into this category. Binder is an online service that allowsone to turn a GitHub repository into a collection of interactiveJupyter notebooks so that readers dont need to deploy webservers themselves.</p><p>III. THE POPPER CONVENTIONPopper is a convention for articles that are developed</p><p>as an OSS project. In the remaining of this paper we useGitHub, Docker, Binder, CloudLab, Travis CI and Aver as thetools/services for every component described in the previoussection. As stated in goal 2, any of these should be swappable</p><p>Fig. 2. End-to-end workflow for an article that follows the Popper convention.</p><p>for other tools, for example: VMs instead of Docker; Puppetinstead of Ansible; Jenkins instead of Travis CI; and so onand so forth. Our approach can be summarized as follows:</p><p> Github repository stores all details for the paper. It storesthe metadata necessary to build the paper and re-runexperiments.</p><p> Docker images capture the experimental environment,packages and tunables.</p><p> Ansible playbook deploy and execute the experiments. Travis tests the integrity of all experiments. Jupyter notebooks analyze and visualize experimental data</p><p>produced by the authors. Every image in an article has a link in its caption that</p><p>takes the reader to a Jupyter notebook that visualizes theexperimental results.</p><p> Every experiment involving performance metrics can belaunched in CloudLab, Chameleon or PRObE.</p><p> The reproducibility of every experiment can be checkedby running assertions of the aver language on top of thenewly obtained results.</p><p>Figure 2 shows the end-to-end workflow for reviewers andauthors. Given all the elements listed above, readers of a papercan look at a figure and click the associated link that takesthem to a notebook. Then, if desire...</p></li></ul>


View more >