41
Data Consultant, Honorary Academic Editor Associate Director, Principal Investigator The rise of the data-centric research and publication enterprises Susanna-Assunta Sansone, PhD RIKEN Yokohama, 25 June, 2014 http://www.slideshare.net/SusannaSansone

FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Embed Size (px)

DESCRIPTION

Overview of FAIR data concept (http://www.dtls.nl/dtl/news/fairport-workshop.html) and my related activities + overview of NPG Scientific Data

Citation preview

Page 1: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Data Consultant, Honorary Academic Editor

Associate Director, Principal Investigator

The rise of the data-centric !research and publication enterprises!

Susanna-Assunta Sansone, PhD!!

RIKEN Yokohama, 25 June, 2014

http://www.slideshare.net/SusannaSansone

Page 2: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

•  About myself!o  activities and interests!

•  FAIR data!o  concept!o my related projects!

•  Scientific Data!o  rationale!o Data Descriptors!o  examples!

Outline!

Page 3: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

My areas of activity:!•  Data capture and curation!•  Data (nano)publication!•  Data provenance !•  Open, community ontologies

and standards!•  Semantic web!•  Software development!•  Training!

Communities I work with/for:! As part of:!•  UK, European and international

consortia!•  Pre-competitive informatics

public-private partnerships!•  Standardization initiatives!with e.g.:!

Page 4: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Notes and narrative! Spreadsheets and tables! Linked data and nanopublications!

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Notes in Lab Books(information for humans)

Spreadsheets and Tables( the compromise)

Facts as RDF statements(information for machines)

Enabling reproducible research and open science, driving science and discoveries !

Increase the level of annotation at the source, tracking provenance and using community standards

Page 5: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

https://projects.ac/blog/five-top-reasons-to-protect-your-data-and-practise-safe-science/

Credit to:

Page 6: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

A great start, but not enough!

image by Greg Emmerich

Page 7: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

§  Researchers and bioinformaticians in both academic and commercial science, along with funding agencies and publishers, embrace the concept that both •  DATA: entities of interest e.g., genes, metabolites, phenotypes and •  METADATA: experimental steps e.g., provenance of study materials,

technology and measurement types should be Findable, Accessible, Interoperable and Reusable

Worldwide movement for FAIR data

Page 8: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

8

sample characteristic(s)

experimental design

experimental variable(s)

technology(s)

measurement(s)

protocols(s)

data file(s)

......

Page 9: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

9

•  make annotation explicit and discoverable

•  structure the descriptions for consistency

•  ensure/regulate access

•  deposit and publish

•  etc….

§  To make this dataset ‘FAIR’, one must have tools, standards and best practices to: •  report sufficient details •  capture all salient features of

the experimental workflow

Page 10: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
Page 11: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

General-purpose, configurable format, designed to support: •  description of the experimental metadata,

making the annotation explicit and discoverable

•  provenance tracking •  use community standards, such as minimal

reporting guidelines and terminologies •  designed to be converted to - a growing

number of - other metadata formats, e.g. used by EBI repositories

analysis !method! script!

Data file or !record in a database!

Page 12: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
Page 13: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

The International Conference on Systems Biology (ICSB), 22-28 August, 2008 Susanna-Assunta Sansone www.ebi.ac.uk/net-project

ISA powers data collection, curation resources and repositories, e.g.:

Page 14: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Reporting standards and data interoperability

Including minimum information reporting requirements, or checklists to report the same core, essential information

Including controlled vocabularies, taxonomies, thesauri, ontologies etc. to use the same word and refer to the same ‘thing’

Including conceptual model, conceptual schema from which an exchange format is derived to allow data to flow from one system to another

Community-developed, standards are pivotal to structure, enrich the description and share datasets, facilitating understanding and reuse!

Page 15: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Growing number of reporting standards

+ 130 + 150

+ 303

Source: BioPortal

Databases, !annotation,!

curation !tools !

implementing !standards!

miame!MIAPA!

MIRIAM!MIQAS!MIX!

MIGEN!

CIMR!MIAPE!

MIASE!

MIQE!

MISFISHIE….!

REMARK!

CONSORT!

MAGE-Tab!GCDML!

SRAxml!SOFT! FASTA!

DICOM!

MzML !SBRML!

SEDML…!

GELML!

ISA-Tab!

CML!

MITAB!

AAO!CHEBI!

OBI!

PATO! ENVO!MOD!

BTO!IDO…!

TEDDY!

PRO!XAO!

DO

VO!Source: B

ioSharing

Source: BioSharing

Page 16: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Which standards and database can we use/recommend

I work in the field of cell migration research, which one are applicable to me?

I us cell migration in translational research, are there specific clinical standards?

Page 17: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014
Page 18: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Registering and cataloging is just step one; the next one are: •  Develop assessment criteria for usability and popularity of standards •  Associate standards to data policies and databases •  Assemble journal and funder policies re data storage •  Make fully cross-searchable •  Intended goal: help stakeholders make informed decisions

Page 19: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

•  About myself!o  activities and interests!

•  FAIR data!o  concept!o my related projects!

•  Scientific Data!o  rationale!o Data Descriptors!o  examples!

Outline!

Page 20: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

FAIR data - roles and responsibilities

•  Data has to become an integral part of the scholarly communications!

•  Responsibilities lie across several stakeholder groups: researchers, data centers, librarians, funding agencies and publishers!

•  But publishers occupy a “leverage point” in this process!

Page 21: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Human Genome 2001 62 Pages, 150 Authors,

49 Figure, 27 tables

Encode Project 2012 30 papers, 3 Journals

Journal publishing - changing landscape !

Page 22: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Helping you publish, discover and reuse research data

Visit nature.com/scientificdata Email [email protected] Tweet @ScientificData

Supported by:!

Honorary Academic Editor Susanna-Assunta Sansone, PhD Managing Editor Andrew L Hufton, PhD Editorial Curator Victoria Newman Advisory Panel and Editorial Board including senior researchers, funders, librarians and curators

Page 23: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

!!!

Launched on May 27th, 2014

A new online-only publication for descriptions of scientifically valuable datasets in the life, environmental and biomedical sciences, but not limited to these!

Credit for sharing your data

Focused on reuse and reproducibility

Peer reviewed, curated

Promoting Community Data Repositories

Open Access

Page 24: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

!!!Experimental metadata or !

structured component!(in-house curated, machine-

readable formats)!

Data Descriptor: narrative and structure!

Article or !narrative component!

(PDF and HTML) !

Page 25: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Data Descriptor: narrative!

Sections:!•  Title!•  Abstract!•  Background & Summary!•  Methods!•  Technical Validation!•  Data Records!•  Usage Notes !•  Figures & Tables !•  References!•  Data Citations!!

Focus on data reuse!Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.!Does not contain tests of new scientific hypotheses!

In traditional publications this information is not provided in a sufficiently detailed manner

However this information is essential for understanding, reusing, and reproducing datasets

Page 26: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Data Descriptor: narrative!

Sections:!•  Title!•  Abstract!•  Background & Summary!•  Methods!•  Technical Validation!•  Data Records!•  Usage Notes !•  Figures & Tables !•  References!•  Data Citations!!

Focus on data reuse!Detailed descriptions of the methods and technical analyses supporting the quality of the measurements.!Does not contain tests of new scientific hypotheses!

Joint Declaration of Data Citation Principles by the Data Citation Synthesis Group, incl.: -  CODATA -  Research Data Alliance (RDA), -  Force11

Page 27: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

In-house curation team:!•  assists users to submit the structured

content via simple templates and an internal authoring tool!

•  performs value-added semantic annotation of the experimental metadata!

For advanced users/service providers willing to export ISA-Tab for direct submission, we will release a technical specification:!

analysis !method! script!

Data file or !record in a database!

Data Descriptor: structure (CC0)!

Page 28: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Export to various formats (ISA_tab, RDF, etc)

Linking between research papers, Data Descriptors, and data records

Making data discoverable !

Page 29: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

a complete list of repositories is at nature.com/scientificdata/for-authors/data-deposition-policies/#recommended-repositories

24

3

10 4

1

4

3

4

DNA and protein sequenceFunctional genomicsGenetic association and genome variationMetagenomicsMolecular interactionsOrganism- or disease-specificProteomicsTaxonomy and species diversityTraces and sequencing reads

“Omics” is emphasized among basic life-sciences repositories

•  We currently recognize over 50 public data repositories!•  We have integrated systems with both:!!!

Helping authors find the right place for the data!

Page 30: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

1.  Broadly support and recognition within their scientific community !2.  Ensure long-term persistence and preservation of datasets in their

published form !3.  Provide expert curation !

4.  Implement relevant, community-endorsed reporting requirements !

5.  Provide for confidential review of submitted datasets !

6.  Provide stable identifiers for submitted datasets !

7.  Allow public access to data without unnecessary restrictions !

30

Data repositories - criteria!

Page 31: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Data: the primary datasets will reside in public repositories. Partnering with FigShare and Dryad, which are both CC0!

Data Descriptor - structured component (ISA-Tab): as NPG has already done with its existing Linked Data Portal, the metadata about data descriptors in Scientific Data will be CC0!Data Descriptor - narrative component: describing the methodology of data generation/collection and processing will be licensed under either of the following, by author choice:

Big  data  |  CSE  2014  31  

Open Access - APC supported!

Page 32: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

!!!!!!!!Scientific hypotheses:!Synthesis!Analysis!Conclusions!

Methods and technical analyses supporting the quality of the measurements:!What did I do to generate the data?!How was the data processed?!Where is the data?!Who did what when!

Relation with traditional articles - content and time!

BEFORE: get your data to the community as soon as possible (see NPG pre-publication policy) AT THE SAME TIME: publish your Data Descriptor(s) alongside research article(s) AFTER: expand on your research articles, adding further information for reuse of the data

Page 33: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

•  Neuroscience, ecology, epidemiology, environmental science, functional genomics, metabolomics, toxicology!

•  New datasets and previously published data sets!•  Datasets in figshare, OpenfMRI, GEO, GenomeRNAi,

ArrayExpress and MetaboLights !•  Code deposited in figshare and GitHub!•  Individual datasets, compendium and citizen science!•  First dataset part of a collection !•  Academic and industry authors!

33

Current content is diverse – bimonthly releases !

Page 34: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Hanke: Neuroscience !

!!!!!!!!!

Code in GitHub

New Dataset Data in OpenfMRI Source code in GitHub

Big Data

Page 35: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Stefano: Stem Cells!

Associated Nature Article Data - figshare - NCBI GEO Integrated figshare data viewer

Page 36: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Kirwan: MS metabolomics!

richer ISA-Tab

New Dataset Data in EBI MetaboLights

Page 37: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Yu: RNA-Seq transcriptomics (part of the SEQC collection)!

Associated Nature Communications article Data in NCBI GEO

``````!

Page 38: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Baud: genomes and phenomes!Associated Nature Genetics article Data: -  EBI ArrayExpress -  figshare

``````!

Page 39: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Baud: genomes and phenomes!Associated Nature Genetics article Data: -  EBI ArrayExpress -  figshare

``````!

``````!

Page 40: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

Evaluation is not be based on the perceived impact or novelty of the findings!•  Experimental Rigour and Technical Data Quality!

o  Were the data produced in a rigorous and methodologically sound manner?!o  Was the technical quality of the data supported convincingly with technical validation

experiments and statistical analyses of data quality or error, as needed?!o  Are the depth, coverage, size, and/or completeness of these data sufficient for the types of

applications or research questions outlined by the authors?!

•  Completeness of the Description!o  Are the methods and any data-processing steps described in sufficient detail to allow others to

reproduce these steps?!o  Did the authors provide all the information needed for others to reuse this dataset or integrate it

with other data?!o  Is this Data Descriptor, in combination with any repository metadata, consistent with relevant

minimum information or reporting standards?!

•  Integrity of the Data Files and Repository Record!o  Have you confirmed that the data files deposited by the authors are complete and match the

descriptions in the Data Descriptor?!o  Have these data files been deposited in the most appropriate available data repository?!

Peer review process focused on quality and reuse!

Page 41: FAIR data and NPG Scientific Data: RIKEN Yokohama, 25 June, 2014

•  Do you run a data resource we should recognize?!o  See on our website the list of criteria databases should meet!!

•  Are you interested in facilitating submission to us? !o  See our ISA-Tab specification on the website!

-  you can implement and export in this format from your authoring/curation tool, or from your database!

!

•  Do you want to submit Data Descriptor(s)?!o  Check suitability by sending a pre-submission enquire, we accept:!

-  Submissions in the life, environmental and biomedical sciences; but not limited to!-  Experimental, observational and computational datasets!-  Individual datasets, curated aggregations, and collections!-  Unpublished data and follow-up, with additional information for wider reuse, e.g.:!

ü  a fuller, more in-depth look at the data processing steps, supported by additional data files and code from each step!

ü  additional tutorial-like information for scientists interested in reusing or integrating the data with their own!

Interested in collaborating and/or submitting?!