Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles...

Preview:

Citation preview

Data Publication

COASP 2012

Publications

26 million abstracts

2.2 million full text articles

Citation networksDatabase linksText-mining

20122006 2011 2016?

Europe PubMed Central

How many open access articles in UKPMC?PubMed (995K)

UKPMC (18%,182K)

OA (9.6%, 96K)

Big Data:Deposition

Primary

Research articles

Big Data:Curated

Annotation

Managing the public data ecosystem

Unstructured Data

1

2

12

3

Literature citation from data(data annotation)

Links from Literature to Databases

• Proteins• Nucleotides• OMIM• Chemicals• Structure• Clinical reviews• Protein families• Protein-protein interactions• Gene expression experiments

800 K

370 K

110 K

Database crosslinks

Bibliography from P25106

Data citation from literature(provenance)

Semantic Type Unique Terms Articles Annotations

Accession No. 233,017 66,356 387,787

Chemical 76,712 1,694,385 83,923,066

Disease 171,692 1,768,214 57,821,871

Gene/Protein 227,318 1,310,382 77,189,022

GO Terms 32,664 1,832,294 65,061,579

Organism 180,637 1,713,280 70,832,222

Text Mining in UKPMC (2.2 million articles)

Accession numbers stories: data citation in OA articles

Senay Kafkas Jee-Hyub Kim

gen

pdb

spro

t

genp

ept

geo

omim pir

embla

lign

pubc

hem

pmc

0

10

20

30

40

50

60

70

80

90

100

gen

pdb

spro

t

arra

yexp

ress

pfam

inter

pro

0

10

20

30

40

50

60

70

80

90

100

publisher-annotated text-mined

Annotation of accession numbers (OA)

~10,000 articles >25,000 articles

• Névéol A, Wilbur WJ, Lu Z (2012) Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE. Database 2012:bas026 (PMC3371192)   

• Névéol A, Wilbur WJ, Lu Z (2011) Extraction of data deposition statements from the literature: a method for automatically tracking research results. Bioinformatics 27, 3306-3312 (PMC3223368)

bmc genomics

bmc evolutionary biology

the journal of cell biology

virology journal

bmc microbiology

the journal of experimental medicine

bmc bioinformatics

bmc plant biology

the journal of biological chemistry

bmc molecular biology

• plos one

acta crystallographica section e:

british journal of cancer

the journal of cell biology

environmental health perspectives

• nucleic acids research

the journal of experimental medicine

critical care

• emerging infectious diseases

bmc bioinformatics

• plos one

• nucleic acids research

bmc genomics

bmc evolutionary biology

the journal of cell biology

plos pathogens

bmc bioinformatics

virology journal

bmc microbiology

• emerging infectious diseases

Most publisher tags Most articlesMost text-mined tags

BMC Genomics: 1,484 TM tags*, 4,337 articlesPLoS One: 4,226 TM tags*, 42,888 articles

Efficacy of Accession number tagging (OA)

Scientific:

Linking articles that cite the same data

Citation:

Data Citation as measure of impact (Thomson: Data citation index)

Context of data citation: submission, reuse, analysis

Operational:

Services for publishers to improve Accession number tagging

Editorial policies and adherence

Extension of NLM DTD

Lessons learned for considering unstructured data

Why is this important? Implications

That we can perform this analysis at all highlights a benefit of Open Access

AY387398: needle in a haystack

Unstructured data

Articles with supplemental data (UKPMC)

• 235,000 articles (50K+ in 2011)

• 718, 511 files

• 459 extensions

• 0.8 TB (1200 CDs)• (However most data in ~60 extension types)

%

Pub Year

Big Data:Deposition

Primary

Research articles

Big Data:Curated

Annotation

Managing the public data ecosystem

Structured links

Unstructured Data

reuse

analysisprovenance

• Open• Citable • Discoverable• Reusable

People

• Paula Buttery• Andrew Caines• Norman Cobley• Yuci Gou• Senay Kafkas• Jyothi Katuri• Oliver Kilian• Jee-Hyub Kim• Nikos Marinos• Jo McEntyre• Xingjun Pi• Philip Rossiter

• Rebholz Group• Peter Stoehr

• University of Manchester• British Library

• OpenAIRE/OpenAIRE Plus

• NCBI, NLM

Recommended