Upload
logilab
View
163
Download
0
Embed Size (px)
DESCRIPTION
L'autre poster présenté par Logilab concerne Simulagora, un service en ligne de simulation numérique collaborative, qui permet de lancer des calculs dans les nuages (donc sans investissement dans du matériel ou d'administration système), qui met l'accent sur la traçabilité et la reproductibilité des calculs, ainsi que sur le travail collaboratif (partage de logiciel, de données et d'études numériques complètes).
Citation preview
The present poster first reviews what reproducibility in escience is all about[1][2][3]. Three important features of a reproducible study are listed and the challenges to be addressed exhibited, so that a research work can really be reproduced by its author itself (e.g. to improve upon it or just to check results obtained in the past), by reviewers (e.g. for careful analysis before publication), or by collaborators (e.g. to derive a new work). The Simulagora platform is then presented to show how it can help escientists produce or reproduce a research work. Future developments of the Simulagora platform are also quoted.
Reproducibility in escience: the Simulagora platform
Replication
Replication is the first step towards reproducibility. It aims at making it as easy as possible to replay the complete research study without any modification of the input data or the software.
The first challenge is to find available hardware resources that are compatible with the workload to be replicated. Among other things, the computing power and storage capacity must be sufficient. Ideally, the hardware resources used at this step are the same as those used to carry out the initial study.A second challenge is to replicate the complete software environment of the initial study, including all the dependencies. Note that proprietary software may be an intractable issue at this step, as licenses might not be available to the people who replicate the study at the time they do it.These two challenges may take a significant amount of time, making the replication impracticable. The global ease of the replication process is critical, thus requiring dedicated tools and methods.
Reappropriation
For a true reproducibility, the reappropriation of the process described in a research paper is even more important. It requires a thorough analysis of the data workflow and of the software. A final goal should be to derive some work from the initial one in order to test the robustness of the proposed approach.
Documenting the data workflow of the initial study is of great help but not sufficient here as errors may occur. A tool that ensures the traceability of the data and software throughout the process would be highly valuable.To deeply analyze a given step of the process, dedicated tools like statistics libraries or data visualization software are necessary and must be provided or easily installable in the initial study's environment: the data volume may be too big to be transferred.When it comes to software review, a distinction may be made between highlevel, studyspecific software and widely accepted, communitysupported, reliable libraries: auditing the studyspecific software should be as easy as possible, which might require code or input data editing.
http://[email protected]
Cloud resource usageVirtualization is an ideal work replication technology. Although usable on smallscale data centers, public clouds allow access to virtually infinite computing power and storage capacity, very high availability and always lower prices thanks to economies of scale.
When used on a typical public cloud Simulagora gets a new virtual machine within a few minutes after the user's request, gets the input data and launches the required program, either for computationintensive tasks or an interactive working session.
Sciencetargeted machine imagesSimulagora uses Debian[5] Linux as a basis for its machine images. Debian is a natural choice for scientists, thanks to its very large open source scientific software repository.Logilab actively contributes to Debian, notably by packaging free scientific software (recently, the Code_ASTER[6] finite element solver for mechanics).
Simulagora's machine images are produced using Saltstack[7] to make it easy to install and configure Simulagora's specific software.
Collaborate on software – MercurialSimulagora uses Mercurial[14] repositories to store the program that launches the computations and that may use the preinstalled software. The Mercurial version of this program is stored in Simulagora's database along with the machine image's version, the computation's input data, and results, … These repositories are exposed to power users, who may then write their own programs, share, clone and modify them before pushing back to the platform for use in a new study.
Modelbased applicationThe traceability is achieved using a WEB application based on the Open Source CubicWeb[8] framework, written in Python[9]. The input, output and software of each processing step is described and stored in a PostgreSQL[10] database before being executed and become immutable.
Collaborate on documentation – vcwikiThe wikilike functionality from vcwiki[15], is being integrated into Simulagora. Documents can be edited through the WEB interface or pushed using the Mercurial version control system. Vcwiki currently uses RestructuredText but could be extended to other textual markups if users ask.
https://www.simulagora.com
Remote access and visualizationSimulagora users are granted full root access to their virtual machines. They can simply use OpenSSH[11] from a terminal and monitor their work or install software. Or they can take advantage of NoVNC[12] and remotely view a full XFCE[13] desktop environment directly in their websocket capable browser, for a zeroinstall usage of Simulagora.
Collaboration
Strictly speaking, this is not a requirement of research reproducibility, but rather an extension. Collaborationlike features are however often required during the reviewing process, if the reviewer wants to check the results with slightly modified algorithms or input data. In other contexts, collaboration features are essential to make it easier for escientists to base their efforts on previous successful ones, and might considerably improve the collective efficiency.
The first need for proper collaboration in escience is the ability to share code with collaborators in a secure way, as WEBbased software forges do.However, sharing data is more difficult: depending on the size of the data involved in the considered study, getting a copy might be impracticable. In this case, it is more suitable to push the software required by the new work to the initial data than the other way round.
Collaborate on studies – Clone, modify, compareSimulagora users can allow people of their choice to access a complete study, including input data and results. They may also freeze the study so that collaborators can clone it as a start for a derived work, then perhaps compare the new results with those of the original study.
[1] http://recomputation.org[2] http://ropensci.org/blog/2014/06/09/reproducibility[3] http://www.nature.com/nature/focus/reproducibility
Further reading:
[4] http://www.openstack.org[5] http://www.debian.org[6] http://www.codeaster.org[7] http://www.saltstack.com[8] http://www.cubicweb.org[9] http://www.python.org
[10] http://www.postgresql.org[11] http://www.openssh.org[12] http://novnc.com[13] http://www.xfce.org[14] http://mercurial.selenic.com[15] http://www.cubicweb.org/project/cubicwebvcwiki
References: