42
What was the plan? A role for data standards, models and computational workflows in scholarly data publishing Alejandra González-Beltrán, PhD Philippe Rocca-Serra, PhD Oxford e-Research Centre, University of Oxford {alejandra.gonzalezbeltran,philippe.rocca-serra}@oerc.ox.ac.uk ISMB Workshop: What Bioinformaticians need to know about digital publishing beyond the PDF2 July15th, 2014 Boston, USA

ISMB Workshop 2014

Embed Size (px)

DESCRIPTION

This talk explores how principles derived from experimental design practice, data and computational models can greatly enhance data quality, data generation, data reporting, data publication and data review.

Citation preview

Page 1: ISMB Workshop 2014

What was the plan? A role for data standards, models and computational

workflows in scholarly data publishing

Alejandra González-Beltrán, PhD Philippe Rocca-Serra, PhD Oxford e-Research Centre, University of Oxford

{alejandra.gonzalezbeltran,philippe.rocca-serra}@oerc.ox.ac.uk

ISMB Workshop: What Bioinformaticians need to know about

digital publishing beyond the PDF2

July15th, 2014 Boston, USA

Page 2: ISMB Workshop 2014

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The experimental workflow

Page 3: ISMB Workshop 2014

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The experimental workflow

metadata

Page 4: ISMB Workshop 2014

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

The experimental workflow

metadata

Page 5: ISMB Workshop 2014

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Interoperability

The experimental workflow

Reproducibility

Data Review

Page 6: ISMB Workshop 2014

The experimental workflow

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Data Reusability

Page 7: ISMB Workshop 2014

The experimental plan - life sciences case

experimental design!sample characteristic(s)!

experimental variable(s)!

2-week systemic rat study using male Wistar rats (N=15 per dose group)

14 proprietary drug candidates from participating companies and 2 reference toxic compounds

InnoMed PredTox Project

Page 8: ISMB Workshop 2014

The experimental plan - life sciences case

experimental design!sample characteristic(s)!

experimental variable(s)!

technology(s)!measurement(s)!protocols(s)!data file(s)!…!

Page 9: ISMB Workshop 2014

The experimental plan - computational case

•open peer-review •availability of

•data •analysis scripts •documentation

Evaluation of SOAPdenovo2 tool for the de novo assembly of genomes from small DNA segments reads by next generation sequencing, implementing improvements over SOAPdenovo1 assembler.

Page 10: ISMB Workshop 2014

genome assembly algorithm

genome size

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

Page 11: ISMB Workshop 2014

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

Page 12: ISMB Workshop 2014

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

Page 13: ISMB Workshop 2014

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

bacterial genome

insect genomehuman genomebacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

3x3 factorial design 9 study groups

The experimental plan - computational case

Page 14: ISMB Workshop 2014

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

bacterial genome

insect genomehuman genomebacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

S. aureusR. sphaeroides

B. impatiens

Chinese Han genome (or YH genome)

Page 15: ISMB Workshop 2014

genome assembly algorithm

genome size

SOAPdenovo2

SOAPdenovo1

ALL-PATHS-LG

bacterial genome

insect genomehuman genome

bacterial genome

insect genomehuman genomebacterial genome

insect genomehuman genome

Predictor Variables!(Factor Name, Factor Type)

The experimental plan - computational case

Response Variables!

genome coverage

computation run time

memory consumption

Page 16: ISMB Workshop 2014

http://www.am

a-roch

ester.o

rg/W

P/wp-co

nten

t/up

load

s/20

13/01/three-pillars.png

Page 17: ISMB Workshop 2014

17

A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework (ISA-Tab and/or tools) to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including: !

• stem cell discovery • system biology • transcriptomics • toxicogenomics • also by communities working to build a library of cellular

signatures

!• environmental health • environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics

Page 18: ISMB Workshop 2014

General-purpose, configurable format designed to support: !• description of the experimental metadata, making the annotation explicit and discoverable !• provenance tracking !

• use of community standards, such as minimal reporting guidelines and terminologies !• designed to be converted to - a growing number of - other metadata formats, e.g. used by the European Bioinformatics Institute (EBI) repositories !

Page 19: ISMB Workshop 2014

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

Page 20: ISMB Workshop 2014

H. Sapiens

H. Sapiens

H. Sapiens

H1

H1

H2

35

35

33

Years

Years

Years

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

Scanning

Scanning

Scanning

...

H. Sapiens

33 Years

H1

H2

H1.sample1

H1.sample2

H2.sample1

Labeling

Labeling

H1.sample1.labeled

H2.sample1.labeled

h1-s1.cel

h1-s2.cel

h2-s1.cel

H. Sapiens

35 Years

Scanning

Scanning

Scanning

...

...

...

obi:material entity

obi:material sample

obi:material processing

obi:processed material

obi:planned process

isa:raw data file

bfo:derives from

Page 21: ISMB Workshop 2014
Page 22: ISMB Workshop 2014
Page 23: ISMB Workshop 2014

http://gigasciencejournal.com

http://gigadb.org/dataset/100035

Page 24: ISMB Workshop 2014

http://gigasciencejournal.com

http://gigadb.org/dataset/100035

Page 25: ISMB Workshop 2014

Experimental metadata

or structured component

(in-house curated, machine-readable

formats)

Article or narrative

component (PDF and HTML)

A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these!

Credit for sharing your data

Focused on reuse and reproducibility

Peer reviewed, curated

Promoting Community Data Repositories

Open Access

Page 26: ISMB Workshop 2014

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

Page 27: ISMB Workshop 2014

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

Page 28: ISMB Workshop 2014

SOAPdenovo2

http://isa-tools.github.io/soapdenovo2

Galaxy workflows to re-enact the data analysis

Page 29: ISMB Workshop 2014

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

Nanopub: represents structured data along with its

provenance in a single publishable and citable entity

Page 30: ISMB Workshop 2014

http://isa-tools.github.io/soapdenovo2

SOAPdenovo2

ResearchObject: enables the aggregation of the digital

resources contributing to findings of computational

research, including results, data and software, as citable

compound digital objects

Page 31: ISMB Workshop 2014

Reproducing SOAPdenovo2 results Galaxy workflows

S. aureus pipeline

Page 32: ISMB Workshop 2014

Reproducing SOAPdenovo2 results Galaxy workflows

Page 33: ISMB Workshop 2014

Reproducing SOAPdenovo2 results Galaxy workflows

Page 34: ISMB Workshop 2014

2241 400

30

119.0 11 106 24 68

0

Reproducing SOAPdenovo2 results Galaxy workflows

Page 35: ISMB Workshop 2014

“genome coverage increased over the human data when comparing SOAPdenovo2 against SOAPdenovo1”!

Response Variables!

genome coverage

computation run time

memory consumption

Page 36: ISMB Workshop 2014

OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(

Spreadsheets(Maguire(et(al,((2013(

Bioinforma?cs(

widget for ontology

annotation and tagging on

Google spreadsheets

relying on BioPortal and Linked Open Vocabularies

services

Page 37: ISMB Workshop 2014

OntoMaton:(a(Bioportal(powered(Ontology(widget(for(Google(

Spreadsheets(Maguire(et(al,((2013(

Bioinforma?cs(

widget for ontology

annotation and tagging on

Google spreadsheets

relying on BioPortal and Linked Open Vocabularies

services

NanoMaton https://github.com/ISA-tools/NanoMaton

Ontology for Biomedical Investigations

SemanticsScience Integrated Ontology

Page 38: ISMB Workshop 2014

Data Scientist

Visualization

Analysis

Planning

Data Management

Data CollectionPublication

Use existing data

Perform new experiment

Findable, Accessible, Interoperable, Reusable!FAIR data

Page 39: ISMB Workshop 2014

Contributing to !Metabolights and ISA

• BBRSC UK-China Award & BGI funded Hackathon!• venue: BGI Hong-Kong!• Participants:!

• Metabolights/BGI/ISA/Birmingham/Hong-Kong University!

• Outcome: !• ISAtab web viewer code!• Functional Specifications & Code for DoE

Wizard API!• 4 datasets coded in ISA format!• Conversion Metabolights datasets to RDF

Page 40: ISMB Workshop 2014
Page 41: ISMB Workshop 2014

funders

acknowledgements

Scott Edmunds, GigaScience

Peter Li, GigaScience

Jun Zhao, Lancaster University

María Susana Avila García, Oxford University

Marco Roos, Leiden UniversityMark Thompson, Leiden University

Ruibang Luo, University of Hong Kong

Tin-Lap Lee, Chinese University of Hong Kong

Tak-wah Lam, University of Hong Kong

Page 42: ISMB Workshop 2014

Questions?You can email us...

[email protected]

View our blog http://isatools.wordpress.com

Follow us on Twitter @isatools

View our websites

View our Git repo & contribute http://github.com/ISA-tools

Thanks for your attention!