Enabling Open Science: Data Discoverability, Access and Use · Europe PMC • Abstracts: 30 million • Full-text articles: 3 million • Article citation counts • Grants • ORCIDs

Enabling Open Science: Data

Discoverability, Access and Use

Jo McEntyre

Head of Literature Services

www.ebi.ac.uk

About EMBL-EBI

• Part of the European Molecular Biology Laboratory

• International, non-profit research institute

• Europe’s hub for biological data services and research

A stable, freely available, shared repository

Europe PMC

• Abstracts: 30 million

• Full-text articles: 3 million

• Article citation counts

• Grants

• ORCIDs

• Semantic annotation

• Data citations

• Data integration

Europe PMC is a member of the PMC

International Collaboration.

Funded by 26 European funders of life science research

Degrees of Data

Unstructured/semi-structured

Structured

Added Value

Metadata

A picture of a graph

A spreadsheet of my results

A record in a DNA

sequence

database

A graphical display of a genome

A narrative with

citations, pictures

and attachments Article

The scent of information

It's healthy to remember that

users are selfish, lazy, and

ruthless in applying their

cost-benefit analyses Information Foraging: by JAKOB NIELSEN, June 30, 2003

http://www.nngroup.com/articles/information-scent/

• Data placement: where people will find it

• Scent alone is not scientific …

“retaining files and being prepared to share them” ≠ accessible data

1. Discoverability through accessibility

• Deposit in a public/open database

• Where possible, structured archive (e.g. PDB,

ENA) >> unstructured archive (e.g. Zenodo,

Figshare)

• Uniquely identify it: PID, Accession number, DOI,

ROI

• Give it context: metadata (and more)

• All of the above = citable =

Data Citation Principles

• Data as legitimate, citable products

of research

• Attribution and credit

• Cited as evidence for a claim

• Unique identification

• Access

• Persistence

• Specificity and verifiability

(provenance)

• Interoperability and flexibility

https://www.force11.org/datacitation

2. Discoverability through structured data

structured data is one of the true

enablers of life science

- Discovery of homology between genes across species

- Predicting function based on protein folds

• Structured data can be cross-analysed, compared by

algorithm, and encourages development of new products

and tools

Discoverability through structured data

Metadata – critical to discoverability

Generic: title, submitters, date, file format, version.

citation basic search

Wagner F.F., 23-APR-2002, TPA: Homo sapiens SMP1

gene, RHD gene and RHCE gene, INSDC, 14-NOV-2006

(Rel. 89, Last updated, Version 7). BN000065

Specific: organism, tissue, assay, page number …

deep search analysis computation

Structured data is good value for money

Annual cost of generating new protein

structure data in labs around the world

Annual cost of

maintaining it

in a central

database

• In-built QA: otherwise you couldn’t do this!

3. Discoverable through added value

Literature-Data Integration

Quality control and trustworthiness

Pre publication

• QA: a range of activities from “is this file valid” to

“is it described adequately/richly by metadata”?

Post publication

• Data are immutable: time and repetition required

to show outliers.

• Openness: enabling assessment & reuse

• Provenance & connectivity with related data,

methods, code, tools, articles …

Text and Data Mining (TDM)

• The more structure there is – the more reusable data

becomes

• TDM builds bridges between unstructured (including

articles) and structured data worlds

• Such analyses can drive adoption of more structure ….

• Open Access infrastructure is a critical enabler

Graphic kindly supplied by the National Centre for Text Mining (NaCTeM), UK

BioStudy EBI

BioStudy database for unstructured data

Study

Publications

Ontologies

Data files

Other DBs

Metadata

Other DBs

Making data discoverable

Labs around the

world deposit

data and we…

Archive it

Classify it Share it with

other data

providers

Analyse, add

value and

integrate it

…provide

tools to help

researchers

use it

A collaborative

enterprise

Elixir: An international distributed infrastructure for • Data

• Standards

• Tools

• Compute

• Training

• Industry

Some … of many … Open Questions

• What can stakeholders do to capitalize on the research

investment?

• Incentives and credit for good data management

• Can all data outputs to be peer reviewed (like articles)?

How far does the article model apply to data?

• Negative results?

• What are we prepared to spend: costs—benefits?

• How does this apply to big data?

• What is the public interest in discoverable data?

Documents

Enabling Open Science: Data Discoverability, Access and Use · Europe PMC • Abstracts: 30 million • Full-text articles: 3 million • Article citation counts • Grants • ORCIDs