Upload
susanna-assunta-sansone
View
109
Download
1
Tags:
Embed Size (px)
DESCRIPTION
http://www.dcc.ac.uk/events/data-management-roadshows/dcc-roadshow-london-2
Citation preview
Delivering reproducible bioscience data by enabling
biocuration at the source
Susanna-Assunta Sansone, PhD
Principal Investigator and Team Leader, University of Oxford e-Research Centre, Oxford, UK
Academic Consultant, Open Access Data Products, Nature Publishing Group
Data Curation Centre (DCC) 13th Regional Data Management Roadshow, London, 20 November 2012
www.slideshare.net/SusannaSansone
University of Oxford e-Research Centre
Integrating with national and international infrastructure
Supporting leading edge facilities through education and training
University of Oxford e-Research Centre
Providing research computing, high-performance computing
University of Oxford e-Research Centre
Collaborating with European and wider international groups in, e.g.:
• energy, • radio astronomy, • biological data federation, • life sciences simulation, • biodiversity, • computational chemistry, • neuroscience, • digital humanities tools, • digital music analysis • visualization • …
My team’s activities and stakeholders we work with data management and biocuration, collaborative development of software
and database, standards and ontology
• environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics
• stem cell discovery • system biology • transcriptomics • toxicogenomics • environmental health
“The buzz around reproducible bioscience data:
the communities and the standards”
“The reality from the buzz:
challenges and exemplar project”
Outline
http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY
http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY
C O M P
R E
H E N
S
I
B
L
E
http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY
C O M P
R E
H E N
S
I
B
L
E
I N
T E R
O P E
R A
http://www.flickr.com/photos/notbrucelee/8016189356/ CC BY
C O M P
R E
H E N
S
I
B
L
E
I N
T E R
O P E
R A
R E U S
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
11
sample characteristic(s)
experimental design
experimental variable(s)
technology(s)
measurement(s)
protocols(s)
data file(s)
......
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
12
§ Capture all salient features of the experimental workflow
§ Make annotation explicit and discoverable
§ Structure the descriptions for consistency, tracking
§ We must strike a balance between • depth and breadth of
information; and • sufficient information
required to reuse the data
§ Researchers and bioinformaticians in both academic and commercial science, along with funding agencies and publishers, embrace the concept that community-developed standards are pivotal to structure and enrich the annotation of
• entities of interest (e.g., genes, metabolites, phenotypes) and • experimental steps (e.g., provenance of study materials,
technology and measurement types)
esoteric formats
hoc or proprietary terminologies
lack of sufficient contextual
information
comprehensible?
interoperable?
reusable?
Growing, worldwide movement for reproducible research
Source: http://ebbailey.wordpress.com
Community mobilization to develop standards, e.g.:
report the same core, essential information
use the same word and refer to the same ‘thing’ allow data to flow from
one system to another
report the same core, essential information
use the same word and refer to the same ‘thing’ allow data to flow from
one system to another
§ Fragmentation of the standards is a major issue • Being focused on particular communities’ interests, be their individual
technologies or biological/biomedical disciplines, leads to duplication of effort, and more seriously, the development of (largely arbitrarily) different standards
• This severely hinders the interoperability of databases and tools and ultimately the integration of datasets
Is this general mobilization good or bad?
VO!
miame!MIAPA!
MIRIAM!MIQAS!MIX!
MIGEN!
CIMR!MIAPE!
MIASE!
MIQE!
MISFISHIE….!
REMARK!
CONSORT!
MAGE-Tab!GCDML!
SRAxml!SOFT! FASTA!
DICOM!
MzML !SBRML!
SEDML…!
GELML!
ISA-Tab!
CML!
MITAB!
AAO!CHEBI!
OBI!
PATO! ENVO!MOD!
BTO!IDO…!
TEDDY!
PRO!XAO!
DO
Growing number of reporting standards
Growing number of reporting standards
+ 130
Estimated
+ 150
Source: MIB
BI,
EQU
ATOR
+ 303
Source: BioPortal
Databases, annotation,
curation tools
miame!MIAPA!
MIRIAM!MIQAS!MIX!
MIGEN!
CIMR!MIAPE!
MIASE!
MIQE!
MISFISHIE….!
REMARK!
CONSORT!
MAGE-Tab!GCDML!
SRAxml!SOFT! FASTA!
DICOM!
MzML !SBRML!
SEDML…!
GELML!
ISA-Tab!
CML!
MITAB!
AAO!CHEBI!
OBI!
PATO! ENVO!MOD!
BTO!IDO…!
TEDDY!
PRO!XAO!
DO
VO!
Which one are mature enough for
me to use or recommend?
I work on plants, are these just for
biomedical applications?
What are the criteria to evaluate
their status and value?
How can I get involved to
propose extensions or modifications?
Which tools and databases
implement which standards?
I use high throughput sequencing technologies, which one are applicable
to me?
But how much do we know about these standards
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
19
A catalogue to map the landscape of standards and the systems implementing them: Over 400 bio-standards (public and in curation)
Field*, Sansone* et al., Omics data sharing. Science 326, 234-36 (2009) doi:0.1126/science.1180598
• A coherent, curated and searchable catalogue of data sharing resources • Bioscience standards and associated data-sharing policies, publications, tools and databases • Assessment criteria for usability and popularity of standards • Relationships among standards • Encouragement for communication & interaction among groups • Promoting interoperability & informed decisions about standards
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
22
Social engineering
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
23
Ownership of open standards can be problematic in broad, grass-root collaborations; it
requires improved models, to encourage maintenance of and contributions to these efforts,
supporting their evolutions
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
24
The extensive community liaison needs to be managed
and funded; rewards and incentives need to be identified
for all contributors
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
25
The cost of implementing a standards-supported data
sharing vision is as large as the number of stakeholders that must operate synchronously
Funders are actively developing data policies
Similar trend in the regulatory arena…
… and in the commercial sector
….the rise of data-driven journals, e.g.:
partnering with:
UK node work in progress
core organization in the
UK Node
Reproducible & Reusable
Bioscience Research
Well-annotated & Structured Data
reasoning
analysis
exchange
visualization
retrieval browsing integration
Community Standards
Software Tools
§ A grass-root collaborative that works to facilitate collection, curation and sharing of experiments using a common, structured representation of the experiments that • transcends individual biological and technological domains and • can be ‘configured’ to implement (several of) the community
standards
An exemplar approach to the status quo
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
user community
metadata tracking framework
collection, curate and sharing of bioscience experiments
A growing ecosystem of over 30 public and internal resources using the ISA metadata tracking framework to facilitate standards-compliant collection, curation, management and reuse of investigations in an increasingly diverse set of life science domains, including:
• environmental health • environmental genomics • metabolomics • metagenomics • nanotechnology • proteomics,
• stem cell discovery • system biology • transcriptomics • toxicogenomics • also by communities working to build
a library of cellular signatures
TOWARDS INTEROPERABLE BIOSCIENCE DATA Sansone SA, Rocca-Serra P, Field D, Maguire E, Taylor C, Hofmann O, Fang H, Neumann S, Tong W, Amaral-Zettler L, Begley K, Booth T, Bougueleret L, Burns G, Chapman B, Clark T, Coleman LA, Copeland J, Das S, de Daruvar A, de Matos P, Dix I, Edmunds S, Evelo C, Forster M, Gaudet P, Gilbert J, Goble C, Griffin J, Jacob D, Kleinjans J, Harland L, Haug K, Hermjakob H, Sui S, Laederach A, Liang S, Marshall S, Merrill E, McGrath A, Reilly D, Roux M, Shamu C, Shang C, Steinbeck C, Trefethen A, Williams-Jones B, Wolstencroft K, Xenarios J, Hide W.
Feb 2012
Importance of a local community
Implementations at Harvard
Importance of a local community
Implementations at Harvard
data sharing in ISA-Tab
Importance of a local community
Implementations at Harvard
data sharing in ISA-Tab
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
40
Implementation at the EBI
The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project
41
Nanotechnology Informatics Working Group
Extensions of the
Notes in Lab Books(information for humans)
Spreadsheets and Tables( the compromise)
Facts as RDF statements(information for machines)
We must increase the level of annotation
• Invest in curating and manage data at the source using: • a common metadata tracking framework, such as ISA • publicly available and community-developed terminologies • recording sufficient contextual information of the experimental steps
§ Progressively datasets will become more comprehensible, interoperable, reproducible and (re)usable, underpinning future investigations
Development timeline
Community involvement and uptake!
Core developments!
2008 2009 2010
1st ISA-Tab workshop!3rd ISA-Tab workshop!
2nd ISA-Tab workshop!
Final ISA-Tab spec! Database instance !at EBI!
ISA software v1!
2011
1st public instance: !Harvard Stem Cell !Discovery Engine!
RDF format starts!
Conversions to !Pride-XML/SRA-XML/!MAGE-Tab and more!
User workshops/visits - start!Growing number of systems starts to adopt ISA framework!
Publications!‘Omics data sharing!(Science)!
ISA-Tab and !ISA software suite!(Bioinformatics)!
Stem Cell !Discovery !Engine!(NAR)!
2007 2012
Strawman ISA-Tab spec!
Other tools implement !ISA-Tab!
Workshop reports!ISA Commons!(Nature Genetics)!
Links to analysis tools starts!
Collaborative approaches are highly valuable but take time