44
Delivering reproducible bioscience data by enabling biocuration at the source Susanna-Assunta Sansone, PhD Principal Investigator and Team Leader, University of Oxford e-Research Centre, Oxford, UK Academic Consultant, Open Access Data Products, Nature Publishing Group Data Curation Centre (DCC) 13th Regional Data Management Roadshow, London, 20 November 2012 www.slideshare.net/SusannaSansone

Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Embed Size (px)

DESCRIPTION

http://www.dcc.ac.uk/events/data-management-roadshows/dcc-roadshow-london-2

Citation preview

Page 1: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Delivering reproducible bioscience data by enabling

biocuration at the source

Susanna-Assunta Sansone, PhD

Principal Investigator and Team Leader, University of Oxford e-Research Centre, Oxford, UK

Academic Consultant, Open Access Data Products, Nature Publishing Group

Data Curation Centre (DCC) 13th Regional Data Management Roadshow, London, 20 November 2012

www.slideshare.net/SusannaSansone

Page 2: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

University of Oxford e-Research Centre

Page 3: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Integrating with national and international infrastructure

Supporting leading edge facilities through education and training

University of Oxford e-Research Centre

Providing research computing, high-performance computing

Page 4: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

University of Oxford e-Research Centre

Collaborating with European and wider international groups in, e.g.:

•  energy, •  radio astronomy, •  biological data federation, •  life sciences simulation, •  biodiversity, •  computational chemistry, •  neuroscience, •  digital humanities tools, •  digital music analysis •  visualization •  …

Page 5: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

My team’s activities and stakeholders we work with data management and biocuration, collaborative development of software

and database, standards and ontology

•  environmental genomics •  metabolomics •  metagenomics •  nanotechnology •  proteomics

•  stem cell discovery •  system biology •  transcriptomics •  toxicogenomics •  environmental health

Page 6: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

“The buzz around reproducible bioscience data:

the communities and the standards”

“The reality from the buzz:

challenges and exemplar project”

Outline

Page 7: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY

Page 8: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY

C O M P

R E

H E N

S

I

B

L

E

Page 9: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY

C O M P

R E

H E N

S

I

B

L

E

I N

T E R

O P E

R A

Page 10: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY

C O M P

R E

H E N

S

I

B

L

E

I N

T E R

O P E

R A

R E U S

Page 11: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

11

sample characteristic(s)

experimental design

experimental variable(s)

technology(s)

measurement(s)

protocols(s)

data file(s)

......

Page 12: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

12

§  Capture all salient features of the experimental workflow

§  Make annotation explicit and discoverable

§  Structure the descriptions for consistency, tracking

§  We must strike a balance between •  depth and breadth of

information; and •  sufficient information

required to reuse the data

Page 13: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

§  Researchers and bioinformaticians in both academic and commercial science, along with funding agencies and publishers, embrace the concept that community-developed standards are pivotal to structure and enrich the annotation of

•  entities of interest (e.g., genes, metabolites, phenotypes) and •  experimental steps (e.g., provenance of study materials,

technology and measurement types)

esoteric formats

hoc or proprietary terminologies

lack of sufficient contextual

information

comprehensible?

interoperable?

reusable?

Growing, worldwide movement for reproducible research

Source: http://ebbailey.wordpress.com

Page 14: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Community mobilization to develop standards, e.g.:

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

Page 15: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

report the same core, essential information

use the same word and refer to the same ‘thing’ allow data to flow from

one system to another

§  Fragmentation of the standards is a major issue •  Being focused on particular communities’ interests, be their individual

technologies or biological/biomedical disciplines, leads to duplication of effort, and more seriously, the development of (largely arbitrarily) different standards

•  This severely hinders the interoperability of databases and tools and ultimately the integration of datasets

Is this general mobilization good or bad?

Page 16: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

VO!

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

Growing number of reporting standards

Page 17: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Growing number of reporting standards

+ 130

Estimated

+ 150

Source: MIB

BI,

EQU

ATOR

+ 303

Source: BioPortal

Databases, annotation,

curation tools

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!

Page 18: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Which one are mature enough for

me to use or recommend?

I work on plants, are these just for

biomedical applications?

What are the criteria to evaluate

their status and value?

How can I get involved to

propose extensions or modifications?

Which tools and databases

implement which standards?

I use high throughput sequencing technologies, which one are applicable

to me?

But how much do we know about these standards

Page 19: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

19

A catalogue to map the landscape of standards and the systems implementing them: Over 400 bio-standards (public and in curation)

Field*, Sansone* et al., Omics data sharing. Science 326, 234-36 (2009) doi:0.1126/science.1180598

Page 20: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

•  A coherent, curated and searchable catalogue of data sharing resources •  Bioscience standards and associated data-sharing policies, publications, tools and databases •  Assessment criteria for usability and popularity of standards •  Relationships among standards •  Encouragement for communication & interaction among groups •  Promoting interoperability & informed decisions about standards

Page 21: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source
Page 22: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

22

Social engineering

Page 23: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

23

Ownership of open standards can be problematic in broad, grass-root collaborations; it

requires improved models, to encourage maintenance of and contributions to these efforts,

supporting their evolutions

Page 24: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

24

The extensive community liaison needs to be managed

and funded; rewards and incentives need to be identified

for all contributors

Page 25: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

25

The cost of implementing a standards-supported data

sharing vision is as large as the number of stakeholders that must operate synchronously

Page 26: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Funders are actively developing data policies

Page 27: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Similar trend in the regulatory arena…

Page 28: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

… and in the commercial sector

Page 29: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

….the rise of data-driven journals, e.g.:

partnering with:

Page 30: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source
Page 31: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

UK node work in progress

core organization in the

UK Node

Page 32: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Reproducible & Reusable

Bioscience Research

Well-annotated & Structured Data

reasoning

analysis

exchange

visualization

retrieval browsing integration

Community Standards

Software Tools

Page 33: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

§  A grass-root collaborative that works to facilitate collection, curation and sharing of experiments using a common, structured representation of the experiments that •  transcends individual biological and technological domains and •  can be ‘configured’ to implement (several of) the community

standards

An exemplar approach to the status quo

Page 34: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

user community

metadata tracking framework

Page 35: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

collection, curate and sharing of bioscience experiments

Page 36: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including:

•  environmental health •  environmental genomics •  metabolomics •  metagenomics •  nanotechnology •  proteomics,

•  stem cell discovery •  system biology •  transcriptomics •  toxicogenomics •  also by communities working to build

a library of cellular signatures

TOWARDS INTEROPERABLE BIOSCIENCE DATA Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA, Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo C, Forster M, Gaudet P, Gilbert J, Goble C, Griffin J, Jacob D, Kleinjans J, Harland L, Haug K, Hermjakob H, Sui S, Laederach A, Liang S, Marshall S, Merrill E, McGrath A, Reilly D, Roux M, Shamu C, Shang C, Steinbeck C, Trefethen A, Williams-Jones B, Wolstencroft K, Xenarios J, Hide W.

Feb 2012

Page 37: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Importance of a local community

Implementations at Harvard

Page 38: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Importance of a local community

Implementations at Harvard

data sharing in ISA-Tab

Page 39: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Importance of a local community

Implementations at Harvard

data sharing in ISA-Tab

Page 40: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

40

Implementation at the EBI

Page 41: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

41

Nanotechnology Informatics Working Group

Extensions of the

Page 42: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

We must increase the level of annotation

•  Invest in curating and manage data at the source using: •  a common metadata tracking framework, such as ISA •  publicly available and community-developed terminologies •  recording sufficient contextual information of the experimental steps

§  Progressively datasets will become more comprehensible, interoperable, reproducible and (re)usable, underpinning future investigations

Page 43: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source

Development timeline

Community involvement and uptake!

Core developments!

2008 2009 2010

1st ISA-Tab workshop!3rd ISA-Tab workshop!

2nd ISA-Tab workshop!

Final ISA-Tab spec! Database instance !at EBI!

ISA software v1!

2011

1st public instance: !Harvard Stem Cell !Discovery Engine!

RDF format starts!

Conversions to !Pride-XML/SRA-XML/!MAGE-Tab and more!

User workshops/visits - start!Growing number of systems starts to adopt ISA framework!

Publications!‘Omics data sharing!(Science)!

ISA-Tab and !ISA software suite!(Bioinformatics)!

Stem Cell !Discovery !Engine!(NAR)!

2007 2012

Strawman ISA-Tab spec!

Other tools implement !ISA-Tab!

Workshop reports!ISA Commons!(Nature Genetics)!

Links to analysis tools starts!

Collaborative approaches are highly valuable but take time

Page 44: Sa sansone dccroadshow-nov2012Delivering reproducible bioscience data by enabling biocuration at the source