Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Enabling Open Science: Data
Discoverability, Access and Use
Jo McEntyre
Head of Literature Services
www.ebi.ac.uk
About EMBL-EBI
• Part of the European Molecular Biology Laboratory
• International, non-profit research institute
• Europe’s hub for biological data services and research
A stable, freely available, shared repository
Europe PMC
• Abstracts: 30 million
• Full-text articles: 3 million
• Article citation counts
• Grants
• ORCIDs
• Semantic annotation
• Data citations
• Data integration
Europe PMC is a member of the PMC
International Collaboration.
Funded by 26 European funders of life science research
Degrees of Data
Unstructured/semi-structured
Structured
Added Value
Metadata
A picture of a graph
A spreadsheet of my results
A record in a DNA
sequence
database
A graphical display of a genome
A narrative with
citations, pictures
and attachments Article
The scent of information
It's healthy to remember that
users are selfish, lazy, and
ruthless in applying their
cost-benefit analyses Information Foraging: by JAKOB NIELSEN, June 30, 2003
http://www.nngroup.com/articles/information-scent/
• Data placement: where people will find it
• Scent alone is not scientific …
“retaining files and being prepared to share them” ≠ accessible data
1. Discoverability through accessibility
• Deposit in a public/open database
• Where possible, structured archive (e.g. PDB,
ENA) >> unstructured archive (e.g. Zenodo,
Figshare)
• Uniquely identify it: PID, Accession number, DOI,
ROI
• Give it context: metadata (and more)
• All of the above = citable =
Data Citation Principles
• Data as legitimate, citable products
of research
• Attribution and credit
• Cited as evidence for a claim
• Unique identification
• Access
• Persistence
• Specificity and verifiability
(provenance)
• Interoperability and flexibility
https://www.force11.org/datacitation
2. Discoverability through structured data
structured data is one of the true
enablers of life science
- Discovery of homology between genes across species
- Predicting function based on protein folds
• Structured data can be cross-analysed, compared by
algorithm, and encourages development of new products
and tools
Discoverability through structured data
Metadata – critical to discoverability
Generic: title, submitters, date, file format, version.
citation basic search
Wagner F.F., 23-APR-2002, TPA: Homo sapiens SMP1
gene, RHD gene and RHCE gene, INSDC, 14-NOV-2006
(Rel. 89, Last updated, Version 7). BN000065
Specific: organism, tissue, assay, page number …
deep search analysis computation
Structured data is good value for money
Annual cost of generating new protein
structure data in labs around the world
Annual cost of
maintaining it
in a central
database
• In-built QA: otherwise you couldn’t do this!
3. Discoverable through added value
Literature-Data Integration
Quality control and trustworthiness
Pre publication
• QA: a range of activities from “is this file valid” to
“is it described adequately/richly by metadata”?
Post publication
• Data are immutable: time and repetition required
to show outliers.
• Openness: enabling assessment & reuse
• Provenance & connectivity with related data,
methods, code, tools, articles …
Text and Data Mining (TDM)
• The more structure there is – the more reusable data
becomes
• TDM builds bridges between unstructured (including
articles) and structured data worlds
• Such analyses can drive adoption of more structure ….
• Open Access infrastructure is a critical enabler
Graphic kindly supplied by the National Centre for Text Mining (NaCTeM), UK
BioStudy EBI
BioStudy database for unstructured data
Study
Publications
Ontologies
Data files
Other DBs
Metadata
Other DBs
Making data discoverable
Labs around the
world deposit
data and we…
Archive it
Classify it Share it with
other data
providers
Analyse, add
value and
integrate it
…provide
tools to help
researchers
use it
A collaborative
enterprise
Elixir: An international distributed infrastructure for • Data
• Standards
• Tools
• Compute
• Training
• Industry
Some … of many … Open Questions
• What can stakeholders do to capitalize on the research
investment?
• Incentives and credit for good data management
• Can all data outputs to be peer reviewed (like articles)?
How far does the article model apply to data?
• Negative results?
• What are we prepared to spend: costs—benefits?
• How does this apply to big data?
• What is the public interest in discoverable data?