Upload
carole-goble
View
846
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Research Objects for FAIRer Science - Shared Keynote presentation at VIVO and Science of Team Science Joint Conference, 6-8 August 2014, Austin Texas
Citation preview
Research Objects for
FAIRer ScienceProfessor Carole Goble CBE FREng FBCSThe University of Manchester, [email protected]
VIVO/SciTS Conferences 6-8 August 2014, Austin, TX
Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct
…..papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension
Jill Mesirov Accessible Reproducible Research
Science 22 Jan 2010: 327(5964): 415-416 DOI: 10.1126/science.1179653
Virtual Witnessing*
*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.
Virtual Witnessing*
*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.
Capturing, representing, sharing the information needed to understand how a research result came about.
Context of results• Inputs, outputs, process…Context of resources• Instruments, data, software,
people…
“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment, [the complete data] and the complete set of instructions which generated the figures.” David Donoho, “Wavelab and Reproducible Research,” 1995
datasetsdata collectionsstandard operating proceduressoftwarealgorithmsconfigurationstools and appscodesworkflowsscriptscode librariesservices,system software infrastructure, compilershardwareMorin et al Shining Light into Black
BoxesScience 13 April 2012: 336(6078) 159-160
Ince et al The case for open computer programs, Nature 482, 2012
“I can’t immediately reproduce the research in my own laboratory. It took an estimated 280 hours for an average user to approximately reproduce the paper.”
Phil BourneNIH Big Wig for Data Science
a reproducibility paradox
big, fast,complicated, multi-step, multi-type multi-field
greaterexpectations of reproducibility
diy publishinggreater access
Systems Biology Collaborations
Modelling Cycle
45 organisations 112 organisations
Data
Models
Articles
ExternalDatabases
http://www.seek4science.org
Metadata
http://www.isatools.org
Ontology-driven Aggregated Content Infrastructure (Framework) for building Sys Bio Commons
share and interlinking multi-stewarded, mixed, methods, models, data, samples…
Standards
DCATFOAF
YellowPages
Yellow Pages
Careful Sharing Options
Commons
Investigations
AssaysStudies
Towards Interoperable Bioscience Data, Nature Genetics, 2012
Standards, Structure, Interlink
Just Enough Results Model for things produced and used in experiments
Construction data
Validation data
Metabolomics
Mass Spec
Transcriptomics
Proteomics
Fluxomics
Publications
Mix of locally & remotely hosted content
Open Modelling Exchange Format Archive
Wolstencroft et al, Proc ISWC 2013
Just Enough Results Model for stuff in experimentsCommon elements
Data type specific elements
Experimentalists, modellers & developersCross-site, cross project collaborationKnowledge network
Building the System: Building a Cult
TRUST
VISION
SETTING EXPECTATIONS
Drink togetherWork together
• Collaboration – Complementarity correlation
• Modellers share more than Experimentalists
• Experimentalists reuse models more than Modellers
• Active enclave sharing • Public sharing tricky even
after publication, bribery and threats
• Data Hugging, Flirting and Voyerism
• Playground rules apply• Fluid, transient
collaborations > membership mgt pain in a*se
• Shameless exploitation of PI competitiveness & vanity
• PI & Funder leadership
• Pan project spawned collaborations – YES!!!!
• But not necessarily visible to us.
Data discovery
Data assembly, cleaning, and refinement
Ecological Niche Modeling
Statistical analysis
Data collection
InsightsInsights Scholarly Communication & Reporting
Scholarly Communication & Reporting
Enclosed sea problem (Ready et al., 2010)
Pilumnus hirtellus
Scientific Workflows
BioSTIF
method
instruments and laboratory
materials
Data discovery
Data assembly, cleaning, and refinement
Ecological Niche Modeling
Statistical analysis
Data collection
InsightsInsights Scholarly Communication & Reporting
Scholarly Communication & Reporting
Method Matters!
Workflow Commons
"Mapping present and future predicted distribution patterns for a meso-grazer guild in the Baltic Sea" by Sonja Leidenberger et al
1st International Workshop on Social Object Networks (SocialObjects 2011), Boston, October 9th 2011.
Find, Click ‘n’ GoFile ‘n’ Forget
Specialist Curators
24
Properties What would you ask a publication if you could?
Identity and DescriptionUniquenessAuthenticity
Who are you ? Where and when were you born ? Who were your parents (creators) ?
Review, Reuse, and Repurpose For which purpose were you conceived and have been used ?
InspectionVisualizationAnnotations
What do you have inside ?
Representation How is your content structured ?
Access Rights May I access all your parts ?
Adaptability Which parts can I replace ?
Evolution & VersioningProvenance
What have they done to you ? Who and When ? Why did they do that ?
Quality Why are you relevant to me ? Can I believe what you are saying or trust your results ?
Reproducibility Do you still produce the same results ?
Fitness Are you still working ?How could I repair you ?
Credit and attribution How could I thank you ? How could I talk about you ?
From Manuscripts
to “Research Objects”
A meme
The multi-dimensional paper
Packs
Packs
www.datafairport.org
What is a Research Object?
Howard Ratner, STM Innovations Seminar 2012was: Chair STM Future Labs Committee, CEO EVP Nature Publishing Group,
now: Director of Development for CHORUS (Clearinghouse for the Open Research of US)
http://www.youtube.com/watch?v=p-W4iLjLTrQ&list=PLC44A300051D052E5
http://www.myexperiment.org/packs/196.html
What The Commons* Is and Is Not
Is Not:– A database
– Confined to one physical location
– A new large infrastructure
– Owned by any one group
Is:– A conceptual framework
– Analogous to the Internet
– A collaboratory
– A few shared rules• All research objects
have unique identifiers
• All research objects have limited provenance
Philip E. Bourne Ph.D.Associate Director for Data Science, National Institutes of Healthhttp://www.slideshare.net/pebourne
*The NIH BD2K Commons Framework $100million in 2015
Social Objects
carriers of discourse
http://www.researchobject.org/
A Framework to Bundle and Relate multi-hosted (digital) resources of a scientific experiment or investigation using standard mechanisms & uniform access protocols. Carriers of Research Context
Outputs are first class citizens to be managed, credited and tracked: data, software
Research Objects
Links
• Recording & linking together the components of an experiment
• Linking across experiments.
Preserve Archive
Reproduce* RecomputeReuseTrain & Explain
Exchange RemixFix
* a word that means many things…..
re-compute
replicatererun repeat
re-examine
repurpose
recreate
reuse
restore
reconstructreview
regeneraterevise
recycle
regenerate the figure
redo
Results may vary
repeat replicate
Drummond C Replicability is not Reproducibility: Nor is it Good Science, onlinePeng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.
Methods(techniques, algorithms, spec. of the steps)
Materials(datasets, parameters, algorithm seeds)
ExperimentInstruments(codes, services, scripts, underlying libraries)
Laboratory(sw and hw infrastructure, systems software, integrative platforms)
Setup
reusereproduce
Executable Research Object
same experimentsame set upsame lab
same experimentsame set updifferent lab
same experimentdifferent set up
different experiment
some of same
Validate
reusereproduce
repeat replicate
http://www.biomedcentral.com/biome/carole-goble-on-reproducible-research-what-it-really-means-how-to-reach-it/
Design
Execution
Result Analysis
Collection
Publish / Report
Peer Review
Peer Reuse
Modelling
Can I repeat & defend my method?
Can I review / reproduce and compare my results / method with your results /
method?
Can I review / replicate and certify
your method?
Can I transfer your results into my
research and reuse this method?
* Adapted from Mesirov, J. Accessible Reproducible Research Science 327(5964), 415-416 (2010)
Research Report
Prediction
Monitoring
Cleaning
specialist codes libraries, platforms, tools
services
(cloud) hosted services
commodity platforms
data collectionscatalogues software
repositories
my datamy processmy codes
integrative frameworks
gateways
data carpentry
http://software-carpentry.org/
Components & Dependencies
• 35 kinds of annotations• 5 Main Workflows• 14 Nested Workflows• 25 Scripts• 11 Configuration files• 10 Software
dependencies • 1 Web Service • Dataset: 90 galaxies
observed in 3 bands • Multiple platforms• Multiple systems
José Enrique Ruiz (IAA-CSIC)
Galaxy Luminosity Profiling
Executable Instrument Entropy
Zhao, Gomez-Perez, Belhajjame, Klyne, Garcia-Cuesta, Garrido, Hettne, Roos, De Roure and Goble. Why workflows break - Understanding and combating decay in Taverna workflows, 8th Intl Conf e-Science 2012
MitigateDetect, RepairPreserve
Partial replicationApprox. reproductionVerificationBenchmarks
Executable Instrument EntropyPrepare to Repair
Reproducibility by InspectionRead It
Reproducibility by InvocationRun It
Document Instrument
[Adapted Freire, 2013]
provenancegather
dependenciescapture stepstrack & keep
results
provenancegather
dependenciescapture stepstrack & keep
results
portability
variability tolerance
preservationpackaging
versioning
openaccessibleavailablemachine actionable
descriptionintelligible
machine-readable
[Adapted Freire, 2013]
AuthoringExec.
PapersLink docs to
experiment
Sweave
ProvenanceTracking,Versioning
Replay, Record, Repair
Workflows, makefiles
ProvStore
provenancegather dependencies
capture stepstrack & keep results
provenancegather dependencies
capture stepstrack & keep results
openaccessibleavailablemachine actionable
descriptionintelligible
machine-readable
[Adapted Freire, 2013]
packagingportability
variability tolerance
preservation
provenancegather dependencies
capture stepstrack & keep results
provenancegather dependencies
capture stepstrack & keep results
versioning
host
service
Open Source/Store
Sci as a Service
Integrative fws
Virtual MachinesRecompute,
limited installation, Black BoxByte execution, copiesDescriptive read,White BoxArchived record
Read & Run, Co-locationNo installation
Portable PackageWhite Box, Installation Archived record
[Adapted Freire, 2013]
host
service
ReproZip
packagingportability
variability tolerance
preservation
provenancegather dependencies
capture stepstrack & keep results
provenancegather dependencies
capture stepstrack & keep results
versioning
No Green Fields No One System
Find Access Interop ReusePorting across PlatformsExchange between SystemsComparing across Labs
Identity
Description
Packaging
Refer to aggregations and their resource contents
Interpretation: What does it mean?How can I compare with others?How is it linked together and linked to others?
Describe aggregation structure and its constituent partsContainer regardless of host
FAIR RO Core Model
manifest
Uniform and first class handling of diverse types (data, software, workflows…)
Identity
Annotation
Aggregation
FAIR RO Core ModelDOIs
URIsHandles
ORCID
W3C OAM
OAI-ORE
Open Annotation Model
OAI-Object Reuse and Exchange
Identity
Annotation
Aggregation
FAIR RO Core ModelDOIs
URIsHandles
ORCID
AggregationsResource mapsProxies
Annotation first class and stand-off
Identity persistence and resolutionCitation
W3C OAM
OAI-ORE
Identity
Annotation
Aggregation
FAIR RO Core PlatformsDOIs
URIsHandles
ORCID
Data Citation Implementation
W3C OAM
OAI-ORE
Distributed Third Party Tenancy
Alien Store
AggregationCarrier of Research Context
• Identifiable, citable, resolvable
• Uniform Management• Mixed Stewardship
• Decay & Graceful Degrade
• Content & Aggregation Lifecycles
• Annotations• Manifests, Recipes,
Permissions, Discourse
Aggregations• Dispersed /
Encapsulated• External (linked) /
Local• Mixed types • Blackboxes• Virtual / Materialised
Content Resources• Aggregations
themselves• In many aggregations• Virtual / Materialised• Open / Closed
TARDIS: Time and Relative Dimension in SpaceScience
RO Model Ontology
• RO Management– Transportation / Access / Citation– Id location of RO “container”– Provenance of RO & contents– Behaviour/lifecycle of RO & contents– Policies
• RO Interpretation– What the RO and its content mean– How they can be compared and
validated– How they can be used, executed, linked
• Interpretation variations– Type (e.g. Workflows)– Discipline (e.g. Biology)– Task (e.g. Discovery, Execution)– Activity (e.g. Experiment)
Progression LevelsManagement and Interpretation for Integrated Applications
Progression LevelsManagement and Interpretation for Integrated Applications
• RO Management– Transportation / Access / Citation– Id location of RO “container”– Provenance of RO & contents– Behaviour/lifecycle of RO & contents– Policies
• RO Interpretation– What the RO and its content mean– How they can be compared and
validated– How they can be used, executed, linked
• Interpretation variations– Type (e.g. Workflows)– Discipline (e.g. Biology)– Task (e.g. Discovery, Execution)– Activity (e.g. Experiment)
Checklists
Versio
nin
g
Pro
venance
Dependencies
More Stakeholders
& ServicesCitation minimum
More specialised detail
Fewer but more specialised
stakeholders & services
AnnotationProfiles
.
Depth: how deeply described
Coverage: how much is covered.
Progression levelsSemantic Framework
Checklists
Versio
nin
g
Pro
venance
Dependencies
NISO-JATS
EXPO, ISAJERM, OBI
MIAME, SBML
GIT
MIM Ontology
PROVPAVVoID
Puppet Docker
Make
PAV
RO Model roevowfprov
wfdesc
SysBio Workflows
DCAT
AnnotationProfiles
.
Depth: how deeply described
Coverage: how much is covered.
Progression levelsSemantic FrameworkExperiment
VIVO-ISF
DC
Checklistsaka Minimum Information Models
Safety, quality, consistency
Validation, monitoring Common in experimental
science Checklists defined in
terms of the RO model and its annotations
Services execute against model and an RO’s annotations
Zhao et. al. A Checklist-Based Approach for Quality Assessment of Scientific Information 3rd
In. Workshop on Linked Science, 2013
Minim Checklist Ontology to describe checklists
Must, Should…Cardinalities…Rules…
http://purl.org/net/mim/ns
Towards Smart Integrated Applications & Mediation
1. Id & Cite fluid things2. First class citizenship &
uniform handling of artifacts
3. Compound 4. Mixed, leaky Containers5. Span outcomes, evolve
outputs, emergence6. Layered interpretation and
management profiles using standards
7. Machine-processable8. Technology Independent
Bechhofer, Why linked data is not enough for scientists, DOI: 10.1016/j.future.2011.08.004
Towards Smart Integrated Applications & Mediation
Bechhofer, Why linked data is not enough for scientists, DOI: 10.1016/j.future.2011.08.004
1. Id & Cite fluid things2. First class citizenship &
uniform handling of artifacts
3. Compound 4. Mixed, leaky Containers5. Span outcomes, evolve
outputs, emergence6. Layered interpretation and
management profiles using standards
7. Machine-processable8. Technology Independent
Research Objects Frameworka systematic approach to representing
a different unit of scholarship
“development” view“logical” view
“process” view “physical” view
SERVICESPOLICIES
LIFECYCLESMETADATA PROFILES
Lets Bake Research Objects!
Open Archival Information System Pilot
ROs are “Information Packages”
ROManagerRODL
• A single, transferable object encapsulates description and resources – Download, transfer, publish
• ZIP-based format + manifest describes aggregation and annotations– Unpack with standard
tooling
• JSON-LD for manifest– Lightweight linked-data
format– Use JSON tooling and
services
Baking with off the shelf platforms
OMEX archive
bundle
Adobe
UC
FO
RE
PR
OV
OD
F
• Work with local folder structure.– Version: github. – Metadata: Local tooling – Metadata about
aggregation and its resources: “hidden folder”
• Zenodo/figshare pull snapshot from github– DOIs for aggregation– new DOIs: release cycles
Baking with off the shelf platforms
http://dx.doi.org/10.6084/m9.figshare.1031591
FARSITE
coded descriptions of clinical study cohorts
an NHS tool to assess the feasibility of gathering a
cohort
packages codes, study, and metadata
Home Baking
In the WildSafari
integrated database and journal
http://www.gigasciencejournal.com
galaxy.cbiit.cuhk.edu.hk[Peter Li]
Nanopub: represents structured data along with its provenance in a single publishable and citable entry
Galaxy workflows: re-enact the analysis
Research Object: aggregates the (digital) resources contributing to findings of (computational) research (results, data and software) as citable compound digital objects
http://isa-tools.github.io/soapdenovo2/http://sandbox.wf4ever-project.org/portal/ro?ro=http://sandbox.wf4ever-project.org/rodl/ROs/SOAP2denovo2-Aureus/
[Alejandra Gonzalez-BeltranPhilippe Rocca-Serra]
what’s the least we can do? how might ROs minted and used by science teams?
how might ROs be implemented and used by developer teams?
Standards
ModelsPlatforms
Id SchemesResolution
Light touchExtensibleInfiltration
Mapping
Making,Curating, Using
Nudging
Sharing
Linking
Infiltration
Embedding into and changing work practices
TOOLS
Citing
Technical Social
Reward
Mixed stewardship
CitationSchemes
Fragility
[Norman Morrison]
(meta)Data Capture Platforms
Process Capture Platforms
Stealthy not Sneakyto reduce the frictioninstrument the world
IncrementalJIJIT not JIC
Focus on Personal Productivity
not Public Good
Auto-magical
From made reproducible to born reproducibleWhat’s the least we can do?
Knowledge TurnsTransportation & MediationUnit of Scholarly CurrencyContext, ComparisonDistributed: Search, Discover, Index, Harvest, Port
Research TurnsRelease model: Evolution, Emergence, Discourse, Comparison, Historical reviewForks, Merges & FixivityFlow across groups, projects and articlesAnti-Salami, Threaded Publications
Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012
Goble, De Roure, Bechhofer, Accelerating Knowledge Turns, I3CK, 2013
Profile FocusBody of knowledge around methods, workflows, software, data, person, rather than publication.First class citation, credit and respect
Open Research Practice is (increasingly) like Open Source Software Practice.
(Which we know a lot about)
FAIR research practice benefits from a shared and principled approach for identification, aggregation and annotation of research components of all kinds.
– Using existing standards, vocabularies, frameworks, platforms, infrastructures. Using linked data and semantic interoperability
VIVO - to represent the full context of researchers’ work.
SciTS – to study the research process and research collaboration
http://www.researchobject.org
• Barend Mons• Sean Bechhofer• Philip Bourne• Matthew Gamble• Raul Palma• Jun Zhao• Alan Williams• Stian Soiland-Reyes• Paul Groth• Tim Clark• Juliana Freire• Alejandra Gonzalez-Beltran• Philippe Rocca-Serra• Ian Cottam
All the members of the Wf4Ever teamiSOCO: Intelligent Software Components S.A., SpainUniversity of Manchester, School of Computer Science, Manchester, United KingdomUniversity of Oxford, Department of Zoology, Oxford, UKPoznan Supercomputing and Networking Center. Poznan, PolandIAA: Instituto de Astrofísica de Andalucía, Granada, SpainLeiden University Medical Centre, Centre for Human and Clinical Genetics, The Netherlands
Colleagues in Manchester’s Information Management GroupRO Advisory Board Members
http://www.researchobject.orghttp://www.wf4ever-project.org