Upload
gigascience-bgi-hong-kong
View
107
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Scott Edmunds talk on Big Data Publishing at the "What Bioinformaticians need to know about digital publishing beyond the PDF" workshop at ISMB 2013, July 22nd 2013
Citation preview
Big Data publishingBeyond dead trees, and a case study
Scott EdmundsISMB, 22nd July 2013@gigascience
The problems with publishing
• Scholarly articles are merely advertisement of scholarship . The actual scholarly artefacts, i.e. the data and computational methods, which support the scholarship, remain largely inaccessible --- Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
• Core scientific statements or assertions are intertwined and hidden in the conventional scholarly narratives
• Lack of transparency, lack of credit for anything other than “regular” dead tree publication
Time to move beyond:
18121665 1869
Problem: growing replication gap
1. Ioannidis et al., 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 142. Science publishing: The trouble with retractions http://www.nature.com/news/2011/111005/full/478026a.html3. Bjorn Brembs: Open Access and the looming crisis in science https://theconversation.com/open-access-and-the-looming-crisis-in-science-14950
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
Out of 18 microarray papers, resultsfrom 10 could not be reproduced
More retractions: >15X increase in last decadeAt current % > by 2045 as many papers published as retracted
Motivation• Scholarly artefacts must be
– Treated as first-class objects, in scientific investigations and in scholarly communications
– Made machine-readable for the convenience of reasoning– Represented in an interoperable manner
• Truly “add value” to publishing – Provide infrastructure to aid & reward replication
• Trial of ISA+Nanopublication+RO– Three similar approaches that should complement each
other for the representation of scholarly artifacts
Motivation• Scholarly artefacts must be
– Treated as first-class objects, in scientific investigations and in scholarly communications
?
:“data* generated in the course of research are just as valuable to the ongoing academic discourse as papers and monographs”.
“increase acceptance of research data* as legitimate, citable contributions to the scholarly record”.
Data* Citation (*and more)
• Data• Software• Re-use…
= Credit}
GigaSolution: deconstructing the paperNeed to credit and reward:
•Data/software availability
•Metadata/curation
•Interoperability
•Availability of workflows
•Transparent analyses
Data
Metadata
Methods
Analyses
GigaSolution: deconstructing the paper
www.gigadb.orgwww.gigasciencejournal.com
20PB storage, 20.5K cores, 212TFlops, >1000 bioinformaticians
Utilizes big-data infrastructure and expertise from:
Combines and integrates:Open-access journal
Data Publishing Platform
Data Analysis Platform
What should we reward?
Different levels of granularity:
Experiment(e.g. Rice 10K project)
Datasets(e.g. species, variety)
Sample(e.g. specimen xyz)
e.g. doi:10.5524/100001
e.g. doi:10.5524/100001-2
e.g. doi:10.5524/100001-2000or doi:10.5524/100001_xyz
Smaller still?
Papers
Data/Micropubs
NanopubsFacts/Assertions (~1014 in literature)
Reward different shaped publishable objects
Rewarding open data
Valid
ation
chec
ks
Fail – submitter is provided error report
Pass – dataset is uploaded to GigaDB.
Submission Workflow
Curator makes dataset public (can be set as future date if required)
DataCite XML file
Excel submission file
Submitter logs in to GigaDB website and uploads Excel submission
GigaDB
DOI assigned
FilesSubmitter provides files by ftp or Aspera
XML is generated and registered with DataCite
Curator Review
Curator contacts submitter with DOI citation and to arrange file transfer (and resolve any other questions/issues).
DOI 10.5524/100003Genomic data from the crab-eating macaque/cynomolgus monkey (Macaca fascicularis) (2011)
Public GigaDB dataset
Reward open & transparent review
End reviewer 3 Download parody videos, now!
Real-time open-review = paper in arXiv + blogged reviews
Reward open & transparent review
http://tmblr.co/ZzXdssfOMJfywww.gigasciencejournal.com/content/2/1/10
Cloud solutions?
Reward better handling of metadata…Novel tools/formats for data interoperability/handling.
BMC Research Awards 2013Winner of open data award
Rewarding transparent methods/results
Software availability
Open code
Accessible pipelines
Sharable workflows
Research Objects
Ease o
f replicatio
n
Implement workflows in a community-accepted format
http://galaxyproject.org
Over 36,000 main Galaxy server users
Over 500 papersciting Galaxy use
Over 55 Galaxyservers deployed
Open source
galaxy.cbiit.cuhk.edu.hk
Research ObjectsAn aggregation of scholarly artefacts: • Data used or results produced in an
experiment study• Methods employed to produce and
analyse that data• Provenance and setting information
about the experiments• People involved in the investigation• Annotations about these resources, that
are essential to the understanding and interpretation of the scientific outcomes captured by a research object.
Example
How are we supporting data reproducibility?
Data sets
Analyses
Linked to
Linked to
DOI
DOI
Open-Paper
Open-Review
DOI:10.1186/2047-217X-1-18>11000 accesses
Open-Code
8 reviewers tested data in ftp server & named reports published
DOI:10.5524/100044
Open-PipelinesOpen-Workflows
DOI:10.5524/100038Open-Data
78GB CC0 data
Code in sourceforge under GPLv3: http://soapdenovo2.sourceforge.net/>5000 downloads
Enabled code to being picked apart by bloggers in wiki http://homolog.us/wiki/index.php?title=SOAPdenovo2
8 referees downloaded & tested data, then signed reports
Reward open & transparent review
Post publication: bloggers pull apart code/reviews in blogs + wiki:
SOAPdenov2 wiki: http://homolog.us/wiki1/index.php?title=SOAPdenovo2Homologus blogs: http://www.homolog.us/blogs/category/soapdenovo/
Reward open & transparent review
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
SOAPdenovo2 workflows implemented in
galaxy.cbiit.cuhk.edu.hk
Implemented entire workflow in our Galaxy server, inc.:
• 3 pre-processing steps
• 4 SOAPdenovo modules
• 1 post processing steps
• Evaluation and visualization tools
Also will be available to download by >36K Galaxy users in
How much further can we take this?
How much further can we take this?ISA + RO + Nanopub case study
Understand how each of the three models can support representation of the actual scholarly artefacts, which are essential first-class objects in scholarly communication
Demonstrate added value to life science, publishing and scholarly communication communities on how these models should be used together to describe scholarly artefacts from life sciences domains
Data models (instructions for authors for digital publishing)
• Research Object– An encapsulation of essential information related to experiments
and investigations
• The ISA (Investigation + Study + Assay) framework– includes a format and a set of software tools that enable its
international user community to provide rich description of the experimental workflows in life science, environmental and biomedical domains.
• Nanopublication– Dissemination of individual data (assertions) with/without an
accompanying scholarly articles– Enables attribution to the scientists for sharing these their data
DataData
Method/Experimental
protocol
Method/Experimental
protocol
FindingsFindings
Types of resources in an RO
Wfdesc/ISA-TAB/ISA2OWL
Wfdesc/ISA-TAB/ISA2OWL
Models to describe each resource type
The Big Picture
The SOAPdenovo2 Case study
The DataThe method, as a Galaxy workflow
The findings, Table2 in the paper (doi:10.1186/2047-217X-1-18)
investigation
Investigation/Study/Assay infrastructure
The investigation file is a high-level aggregator for related studies, contains all the information to understand the overall goals of an experiment, including investigators involved, associated publications, the experimental design, experimental factors, protocols, funding agencies and so on…
investigation
Investigation/Study/Assay infrastructure
An investigation can have one or more studies. A study is the central unit of the experimental description and it contains information on the subject(s) under study, their characteristics, and any treatments applied.
study study
investigation
Investigation/Study/Assay infrastructure
Each study has one or more associated assays. The assay is the test performed either on the subject or on material taken from the subject, which produce qualitative and/or quantitative measurements.
study study
assay assay assay assay
investigation
Investigation/Study/Assay infrastructure
study study
assay assay assay assay
data analysis method scriptdata data
ISA
investigation
Study Design
ISA framework SOAPdenovo2
investigation
study
Study Samples: A table rendering the sample collection workflow
ISA framework SOAPdenovo2
investigation
study
assay
ftp://public.genomics.org.cn/BGI/SOAPdenovo2
http://galaxy.cbiit.cuhk.edu.hk/
FTP
ISA framework SOAPdenovo2
investigation
study
assay
Representation available in:•Tabular format
• Spreadsheet-like format• For biologists/experimentalists
•RDF/OWL format for Semantic Web/Linked Data users• For bioinformaticians/software developers• Facilitating data integration, querying, reasoning
•Support for submission to public repositories and data publication platforms
•Tools support for curation, creation, storage, analysis…
•Large and diverse life science user/collaborator communities
ISA framework
RO + ISA
Scientific Workflow-specific ROs ISA experiment and data description
Scientific, computationalExperiments, non-wet lab
protocols
Scientific, computationalExperiments, non-wet lab
protocols
Focus on web-lab or non-
computational experimental
protocols
Focus on web-lab or non-
computational experimental
protocols
An RO for the Case Study
A Galaxy workflowA Galaxy workflow
Some nanopub statements
Some nanopub statements
Input sequence
data
Input sequence
data
A Research Object
The Research Object contains the following artefacts:
• The inputs sequence data that are represented in ISA-TAB format
• The Galaxy workflow that reflects the computational steps taken for generating the results used to produce Table 2 • Machine-readable descriptions about the workflow
• The nanopublication statements that represents claims based on the content of Table 2
Descriptions about the workflow
Descriptions about the workflow
An RO for the Case Study
Assertion
Nanopublication URL
Provenance PublicationInfo
assertion
assertion
opm:was
DerivedFrom
opm:was
DerivedFrom
opm:wasGene-ratedBy
opm:wasGene-ratedBy
thisnanopub
thisnanopub
dcterms:created
dcterms:created
pav:authored-
By
pav:authored-
By
associa-tion
associa-tion aa
sio:statis-ticalAssoci
ation
sio:statis-ticalAssoci
ation
sio:has-measurementValue
sio:has-measurementValue
Association_1_p_val
ue
Association_1_p_val
ue
aa
Sio:probability-value
Sio:probability-value
sio:has-value
sio:has-value
6.56 e-5
^^xsd:float
6.56 e-5
^^xsd:float
sio:refers-to
sio:refers-to
dcterms:DOI
dcterms:DOI
…
Integrity KeyIntegrity Key
An Individual association between concepts:•statement or declaration•measurement•hypothetical inference•quantitative or qualitative
An Individual association between concepts:•statement or declaration•measurement•hypothetical inference•quantitative or qualitative
Guarantee immutabilityafter publication
Guarantee immutabilityafter publication
Unique, persistent and resolvable identifier
Unique, persistent and resolvable identifier
How this assertion came to be, methods,
evidence, context, etc.
How this assertion came to be, methods,
evidence, context, etc.
• Detailed attribution for authors, institutions, lab technicians, curators
• License info• Publication date
• Detailed attribution for authors, institutions, lab technicians, curators
• License info• Publication date
A Nanopublication-Centric View• Improvements of SOAPdenovo2 have also been observed in assembling GAGE [8]
dataset (see Additional file 1: Supplementary Method 6 and Tables 2 and 3). As shown in Tables 2 and 3, the correct assembly length of SOAPdenovo2 increased by approximately 3 to 80-fold comparing with that of SOAPdenovo1.
SOAPdenovo2 S. aureus pipeline
How do we generate a nanopub from this?
…stay tuned for Tech Track talk #34 by Marco Roos
ICC Lounge 81, Tuesday 23rd: 3.40pm-4.05pm
Final step: visualizationFinal step: visualization
NC_010079.pdf
gi_161510924_ref_NC_010063.1_.pdf
CONTIGuator 2 (thanks Marco Galardini)CONTIGuator 2 (thanks Marco Galardini)
https://github.com/combogenomics/CONTIGuator
Lessons learned:
• Is possible to push button(s) & recreate a result from a paper
• Reproducibility is COSTLY. How much are you willing to spend?
• Learn a huge amount about the study, and provides lots of information not present in the paper
• Much easier to do this before rather than after publication
steps
• Complete the case study on the release of ISA-OWL, nanopubs & ROs
• Extend the case study by including more than one datasets or ROs, in order to show how related or conflicting information can be more easily interlinked
• Create community guidelines on how these three models should be used together, e.g. recommended patterns or vocabulary terms
www.gigasciencejournal.com
Give us your data & pipelines!*
Want to go beyond dead trees & the PDF?
[email protected]@[email protected]
Contact us:
* APC’s currently generously covered by BGI in 2013
Ruibang Luo (BGI/HKU)Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)
Thanks to:
@gigasciencefacebook.com/GigaScienceblogs.openaccesscentral.com/blogs/gigablog/
Peter LiHuayan Gao Chris HunterJesse Si ZheNicole NogoyLaurie Goodman
Marco Roos (LUMC)Mark Thompson (LUMC)Jun Zhao (Oxford)Susanna Sansone (Oxford)Philippe Rocca-Serra (Oxford) Alejandra Gonzalez-Beltran (Oxford)
www.gigadb.orggalaxy.cbiit.cuhk.edu.hk
www.gigasciencejournal.com
CBIITFunding from:
Our collaborators:team: Case study:
DataData
Method/Experimental
protocol
Method/Experimental
protocol
FindingsFindings
Types of resources in an RO
Wfdesc/ISA-TAB/ISA2OWL
Wfdesc/ISA-TAB/ISA2OWL
Models to describe each resource type
The Big Picture