20
Data Publication COASP 2012

Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Embed Size (px)

Citation preview

Page 1: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Data Publication

COASP 2012

Page 2: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Publications

Page 3: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

26 million abstracts

2.2 million full text articles

Citation networksDatabase linksText-mining

20122006 2011 2016?

Europe PubMed Central

Page 4: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

How many open access articles in UKPMC?PubMed (995K)

UKPMC (18%,182K)

OA (9.6%, 96K)

Page 5: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Big Data:Deposition

Primary

Research articles

Big Data:Curated

Annotation

Managing the public data ecosystem

Unstructured Data

1

2

12

3

Page 6: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Literature citation from data(data annotation)

Page 7: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Links from Literature to Databases

• Proteins• Nucleotides• OMIM• Chemicals• Structure• Clinical reviews• Protein families• Protein-protein interactions• Gene expression experiments

800 K

370 K

110 K

Page 8: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Database crosslinks

Bibliography from P25106

Page 9: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Data citation from literature(provenance)

Page 10: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Semantic Type Unique Terms Articles Annotations

Accession No. 233,017 66,356 387,787

Chemical 76,712 1,694,385 83,923,066

Disease 171,692 1,768,214 57,821,871

Gene/Protein 227,318 1,310,382 77,189,022

GO Terms 32,664 1,832,294 65,061,579

Organism 180,637 1,713,280 70,832,222

Text Mining in UKPMC (2.2 million articles)

Page 11: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Accession numbers stories: data citation in OA articles

Senay Kafkas Jee-Hyub Kim

Page 12: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

gen

pdb

spro

t

genp

ept

geo

omim pir

embla

lign

pubc

hem

pmc

0

10

20

30

40

50

60

70

80

90

100

gen

pdb

spro

t

arra

yexp

ress

pfam

inter

pro

0

10

20

30

40

50

60

70

80

90

100

publisher-annotated text-mined

Annotation of accession numbers (OA)

~10,000 articles >25,000 articles

• Névéol A, Wilbur WJ, Lu Z (2012) Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE. Database 2012:bas026 (PMC3371192)   

• Névéol A, Wilbur WJ, Lu Z (2011) Extraction of data deposition statements from the literature: a method for automatically tracking research results. Bioinformatics 27, 3306-3312 (PMC3223368)

Page 13: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

bmc genomics

bmc evolutionary biology

the journal of cell biology

virology journal

bmc microbiology

the journal of experimental medicine

bmc bioinformatics

bmc plant biology

the journal of biological chemistry

bmc molecular biology

• plos one

acta crystallographica section e:

british journal of cancer

the journal of cell biology

environmental health perspectives

• nucleic acids research

the journal of experimental medicine

critical care

• emerging infectious diseases

bmc bioinformatics

• plos one

• nucleic acids research

bmc genomics

bmc evolutionary biology

the journal of cell biology

plos pathogens

bmc bioinformatics

virology journal

bmc microbiology

• emerging infectious diseases

Most publisher tags Most articlesMost text-mined tags

BMC Genomics: 1,484 TM tags*, 4,337 articlesPLoS One: 4,226 TM tags*, 42,888 articles

Efficacy of Accession number tagging (OA)

Page 14: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Scientific:

Linking articles that cite the same data

Citation:

Data Citation as measure of impact (Thomson: Data citation index)

Context of data citation: submission, reuse, analysis

Operational:

Services for publishers to improve Accession number tagging

Editorial policies and adherence

Extension of NLM DTD

Lessons learned for considering unstructured data

Why is this important? Implications

That we can perform this analysis at all highlights a benefit of Open Access

Page 15: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

AY387398: needle in a haystack

Page 16: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?
Page 17: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Unstructured data

Page 18: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Articles with supplemental data (UKPMC)

• 235,000 articles (50K+ in 2011)

• 718, 511 files

• 459 extensions

• 0.8 TB (1200 CDs)• (However most data in ~60 extension types)

%

Pub Year

Page 19: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

Big Data:Deposition

Primary

Research articles

Big Data:Curated

Annotation

Managing the public data ecosystem

Structured links

Unstructured Data

reuse

analysisprovenance

• Open• Citable • Discoverable• Reusable

Page 20: Data Publication COASP 2012. Publications 26 million abstracts 2.2 million full text articles Citation networks Database links Text-mining 2012 200620112016?

People

• Paula Buttery• Andrew Caines• Norman Cobley• Yuci Gou• Senay Kafkas• Jyothi Katuri• Oliver Kilian• Jee-Hyub Kim• Nikos Marinos• Jo McEntyre• Xingjun Pi• Philip Rossiter

• Rebholz Group• Peter Stoehr

• University of Manchester• British Library

• OpenAIRE/OpenAIRE Plus

• NCBI, NLM