55
www.gigasciencejournal.com cott Edmunds, GigaScience/BGI Hong Kong OASP 2012, Budapest, 20 th September 2012 ta publication in the data delu

Scott Edmunds: Data publication in the data deluge

Embed Size (px)

DESCRIPTION

Scott Edmunds talk at COASP 2012 in Budapest "Data publication in the data deluge", September 20th 2012

Citation preview

Page 1: Scott Edmunds: Data publication in the data deluge

www.gigasciencejournal.com

Scott Edmunds, GigaScience/BGI Hong Kong COASP 2012, Budapest, 20th September 2012

Data publication in the data deluge

Page 2: Scott Edmunds: Data publication in the data deluge

The Data Challenge: • 1.2 zettabytes (1021) electronic data generated globally each year

• >Exponential growth of genomics data (& growth in imaging and MS data following)

• Issues with reproducibility, hosting, curation, interoperability

• Need for better incentives to overcome these

Source: http://www.genome.gov/sequencingcosts/ (with apologies)

Source: 1. Mervis J. U.S. science policy. Agencies rally to tackle big data. Science. 2012 Apr 6;336(6077):22.

Page 3: Scott Edmunds: Data publication in the data deluge

www.gigasciencejournal.com

Large-Scale Data Journal/Database

Editor-in-Chief: Laurie Goodman, PhDEditor: Scott Edmunds, PhDCommisioning Editor: Nicole Nogoy, PhDLead Curator: Tam Sneddon D.PhilData Platform: Peter Li, PhD

In conjunction with:

Page 4: Scott Edmunds: Data publication in the data deluge

GigaDB is a new database integrated with the GigaScience journal to meet the needs of a new generation of biological and biomedical research as it enters the era of “big-data”… (see more)

Page 5: Scott Edmunds: Data publication in the data deluge

Genomic Data Submission and Analytical platform

GDSAP:

Page 6: Scott Edmunds: Data publication in the data deluge

Anatomy of a Publication

Data

Idea

Study

Analysis

Answer

Metadata

Page 7: Scott Edmunds: Data publication in the data deluge

Anatomy of a Data Publication

Data

Idea

Study

Analysis

Answer

Metadata

Page 8: Scott Edmunds: Data publication in the data deluge

Issues for Data Publication

Data

Idea

Study

Analysis

Answer

Metadata

Technical issues:

Cultural issues:

Page 9: Scott Edmunds: Data publication in the data deluge

Issues for Data Publication

Data

Idea

Study

Analysis

Answer

Metadata

Cultural issues:

* T-Shirts available from Graham Steel / http://www.zazzle.co.uk/steelgraham

Adoption held back by: journal policies, citation, tracking…

Page 10: Scott Edmunds: Data publication in the data deluge

Issues for Data Publication

Data

Idea

Study

Analysis

Answer

Metadata

Technical issues:

What do we do with the data?

Page 11: Scott Edmunds: Data publication in the data deluge

Issues for Data Publication

Data

Idea

Study

Analysis

Answer

Metadata

Technical issues:

What do we do with the data?

Lightweight:• Metadata only journals• Get someone else to host

Heavyweight:• Become a repository

Page 12: Scott Edmunds: Data publication in the data deluge

To host or not to host?Against: supplementary files argument

Average size of a Journal of Neuroscience article and supplemental material in megabytes.

Maunsell J J. Neurosci. 2010;30:10599-10600

Announcement Regarding Supplemental Material: Beginning November 1, 2010, The Journal of Neuroscience will no longer allow authors to include supplemental material when they submit new manuscripts and will no longer host supplemental material on its web site for those articles.

The Journal of Neuroscience

“While the size of articles has grown gradually over the past decade, the supplemental material associated with a typical Journal article appears to be growing exponentially and is rapidly approaching the size of an article. The sheer volume of supplemental material is adversely affecting peer review.”

Page 13: Scott Edmunds: Data publication in the data deluge

To review: (>6TBp, >1500 datasets)

S3 (storage) = $15,000

EC2 (analysis w/ BLASTx) = $500,000

$1000 genome = million $ peer-review?

Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops

Page 14: Scott Edmunds: Data publication in the data deluge

To review: (>6TBp, >1500 datasets)

S3 (storage) = $15,000

EC2 (analysis w/ BLASTx) = $500,000

$1000 genome = million $ peer-review?

Source: Folker Meyer/Wilkening et al. 2009, CLUSTER'09. IEEE International Conference on Cluster Computing and Workshops

ENCODE analysis Virtual Machine:

Containing: input data, code bundles with scripts and processing steps, outputs

AWS = ~$5,000Source: James Taylor / http://encodeproject.org/ENCODE/integrativeAnalysis/VM

Page 15: Scott Edmunds: Data publication in the data deluge

To host or not to host?For: reproducibilityThe Guardian, 14th September 2012: Replication is the only solution to scientific fraud. http://www.guardian.co.uk/commentisfree/2012/sep/14/solution-scientific-fraud-replication

For: “data is the new oil” William Gibson: "Information is the currency of the future world”

Sir Tim Berners-Lee: "Data is a precious thing and will last longer than the systems themselves”

Move compute to the data: think EC2 rather than S3

Source:DNA Nexus/SRA http://techcrunch.com/2011/10/12/dnanexus-raises-15-million-teams-with-google-to-host-massive-dna-database/

DNA Nexus + 0.5PB SRA data = $15 million given by Google

Page 16: Scott Edmunds: Data publication in the data deluge

?

Overcoming cultural hurdles…

Page 17: Scott Edmunds: Data publication in the data deluge

Adventures in Data Citation

doi:10.5524/100001

Overcoming cultural hurdles…

Page 18: Scott Edmunds: Data publication in the data deluge

For data citation to work, needs:

1. Proven utility/potential user base.

2. Acceptance/inclusion by journals.

3. Data+Citation: inclusion in the references.

4. Tracking by citation indexes.

5. Usage of the metrics by the community…

Page 19: Scott Edmunds: Data publication in the data deluge

Datacitation 1: utility/user base.

Shackleton NJ, Hall MA, Vincent E (2001): Mean stable carbon isotope ratios of Cibicidoides wuellerstorfi from sediment core MD95-2042 on the Iberian margin, North Atlantic. PANGAEA - Data Publisher for Earth & Environmental Science. http://doi.pangaea.de/10.1594/PANGAEA.58229

Pahnke K, Zahn R: Southern Hemisphere Water Mass Conversion Linked with North Atlantic Climate Variability. Science 2005, 307:1741 -1746.

Cited in:

Andreeva A, Howorth D, Chandonia J-M, Brenner SE, Hubbard TJP, Chothia C, Murzin AG: Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res. 2008, 36:D419-425.

Nocek B, Xu X, Savchenko A, Edwards A, Joachimiak A. 2007. PDB ID: 2P06 Crystal structure of a predicted coding region AF_0060 from Archaeoglobus fulgidus DSM 4304. 10.2210/pdb2p06/pdb.

Cited in:

Establishment of data DOIs and use by databases:

Page 20: Scott Edmunds: Data publication in the data deluge

BGI Datasets Get DOI®s

PLANTSChinese cabbageCucumberFoxtail milletPigeonpeaPotatoSorghum

MicrobeE. Coli O104:H4 TY-2482T2D gut metagenome

Cell-LinesChinese Hamster OvaryMouse methylomes

Human Asian individual (YH)

- DNA Methylome - Genome Assembly

- TranscriptomeCancer (14TB)Single cell bladder cancerHBV infected exomesAncient DNA - Saqqaq Eskimo - Aboriginal Australian

VertebratesDarwin’s FinchGiant panda Macaque - Chinese rhesus - Crab-eatingMini-PigNaked mole rat Parrot, Puerto Rican Penguin - Emperor penguin- Adelie penguinPigeon, domesticPolar bearSheepTibetan antelope

InvertebrateAnt - Florida carpenter ant- Jerdon’s jumping ant- Leaf-cutter antRoundwormSchistosomaSilkworm

Released pre-publicationPaper Published in GigaScience

Page 21: Scott Edmunds: Data publication in the data deluge

To maximize its utility to the research community and aid those  fighting the current epidemic, genomic data is released here into the public domain under a CC0 license. Until the publication of research papers on the assembly and whole-genome analysis of this isolate we would ask you to cite this dataset as:

Li, D; Xi, F; Zhao, M; Liang, Y; Chen, W; Cao, S; Xu, R; Wang, G; Wang, J; Zhang, Z; Li, Y; Cui, Y; Chang, C; Cui, C; Luo, Y; Qin, J; Li, S; Li, J; Peng, Y; Pu, F; Sun, Y; Chen,Y; Zong, Y; Ma, X; Yang, X; Cen, Z; Zhao, X; Chen, F; Yin, X; Song,Y ; Rohde, H; Li, Y; Wang, J; Wang, J and the Escherichia coli O104:H4 TY-2482 isolate genome sequencing consortium (2011) Genomic data from Escherichia coli O104:H4 isolate TY-2482. BGI Shenzhen. doi:10.5524/100001 http://dx.doi.org/10.5524/100001

Our first DOI:

To the extent possible under law, BGI Shenzhen has waived all copyright and related or neighboring rights to Genomic Data from the 2011 E. coli outbreak. This work is published from: China.

Page 22: Scott Edmunds: Data publication in the data deluge
Page 23: Scott Edmunds: Data publication in the data deluge
Page 24: Scott Edmunds: Data publication in the data deluge
Page 25: Scott Edmunds: Data publication in the data deluge

1.3 The power of intelligently open dataThe benefits of intelligently open data were powerfully illustrated by events following an outbreak of a severe gastro-intestinal infection in Hamburg in Germany in May 2011. This spread through several European countries and the US, affecting about 4000 people and resulting in over 50 deaths. All tested positive for an unusual and little-known Shiga-toxin–producing E. coli bacterium. The strain was initially analysed by scientists at BGI-Shenzhen in China, working together with those in Hamburg, and three days later a draft genome was released under an open data licence. This generated interest from bioinformaticians on four continents. 24 hours after the release of the genome it had been assembled. Within a week two dozen reports had been filed on an open-source site dedicated to the analysis of the strain. These analyses provided crucial information about the strain’s virulence and resistance genes – how it spreads and which antibiotics are effective against it. They produced results in time to help contain the outbreak. By July 2011, scientists published papers based on this work. By opening up their early sequencing results to international collaboration, researchers in Hamburg produced results that were quickly tested by a wide range of experts, used to produce new knowledge and ultimately to control a public health emergency.

Page 26: Scott Edmunds: Data publication in the data deluge

Data Citation 2: acceptance by journals

Page 27: Scott Edmunds: Data publication in the data deluge

Data Citation 2: acceptance by journals

Page 28: Scott Edmunds: Data publication in the data deluge

Data+Citation 3: inclusion in the references

Page 29: Scott Edmunds: Data publication in the data deluge
Page 30: Scott Edmunds: Data publication in the data deluge

In the references…

Page 31: Scott Edmunds: Data publication in the data deluge

Is the DOI…

* Certain types of genomics data must also be deposited in INSDC databases (SRA & Genbank).

Page 32: Scott Edmunds: Data publication in the data deluge

And in more journals…

Hodkinson BP, Uehling JK, Smith ME: Lepidostroma vilgalysii, a new basidiolichen from the New World. Mycological Progress 2012. Advance Online Publication.

Hodkinson BP, Uehling JK, Smith ME (2012) Data from: Lepidostroma vilgalysii, a new basidiolichen from the New World. Dryad Digital Repository. doi:10.5061/dryad.j1g5dh23

Cited in:

Roberts SB, Hauser L, Seeb LW, Seeb JE (2012) Development of Genomic Resources for Pacific Herring through Targeted Transcriptome Pyrosequencing. PLoS ONE 7(2): e30908. doi:10.1371/journal.pone.0030908

Cited in:

Roberts SB (2012) Herring Hepatic Transcriptome 34300 contigs.fa. Figshare. Available: hdl.handle.net/10779/084d34370fbda29bbc6 7b3c5ecb02575. Accessed 2012 Jan 20.

Page 33: Scott Edmunds: Data publication in the data deluge

For data citation to work, needs:

1. Proven utility/potential user base.

2. Acceptance/inclusion by journals.

3. Data+Citation: inclusion in the references.

4. Tracking by citation indexes.

5. Usage of the metrics by the community…

Page 34: Scott Edmunds: Data publication in the data deluge

Datacitation 4: tracking?

Page 35: Scott Edmunds: Data publication in the data deluge

DataCite metadata in harvestable form (OAI-PMH)

Datacitation 4: tracking?

✗FAIL

- lists some DataCite DOIs, but says:

Datasets listed are the “result of approximations in the indexing algorithms.”

“Google Scholar's intended coverage is for scholarly articles. At this point, we don't include datasets. “

Page 36: Scott Edmunds: Data publication in the data deluge

DataCite metadata in harvestable form (OAI-PMH)

Datacitation 4: tracking?

✗FAIL

✗ Working on it. Coming soon…

Page 37: Scott Edmunds: Data publication in the data deluge
Page 38: Scott Edmunds: Data publication in the data deluge

“As a result of diverse practices and tool limitations, data citations are currently very difficult to track.”

Datacitation 5: metrics?

Page 39: Scott Edmunds: Data publication in the data deluge

“I’m afraid we are making promises to data creators about attribution and reward that we can’t keep. ”Make your data citeable!” is the cry. OK. So citeable is step one. Cited is step two. But for the citation to be useful, it has to be indexed so that citation metrics can be tracked and admired and used.

Who is indexing data citations right now? As far as I can tell: absolutely no one.”

Research Remix, 29th May 2012: http://researchremix.wordpress.com/2012/05/29/dear-research-data-advocate-please-sign-the-petition-oamonday/

Datacitation 5: metrics?✗FAIL

Page 40: Scott Edmunds: Data publication in the data deluge

Where data citation is in 2012:

1. Proven utility/potential user base.

2. Acceptance/inclusion by journals.

3. Data+Citation: inclusion in the references.

4. Tracking by citation indexes.

5. Usage of the metrics by the community…

✔/✗

Page 41: Scott Edmunds: Data publication in the data deluge

?

Overcoming technical hurdles…

Page 42: Scott Edmunds: Data publication in the data deluge

Computable methods/workflow systemsBioinformaticsDevelopment PublishingBiomedical and bioinformatics research

Addressing the reproducibility gap:

Page 43: Scott Edmunds: Data publication in the data deluge

Redefining what is a paper in the era of big-data?

Analysis Data

Tools/Workflows

Compute

goal: Executable Research Objects

Citable DOI

Page 44: Scott Edmunds: Data publication in the data deluge

• Background

• Methods

• Results (Data)

• Conclusions/Discussion

doi:10.1186/2047-217X-1-3

Publication

Page 45: Scott Edmunds: Data publication in the data deluge

• Background

• Methods

• Results (Data)

• Conclusions/Discussion

doi:10.1186/2047-217X-1-3

doi:10.5524/100035

PublicationData

Page 46: Scott Edmunds: Data publication in the data deluge

• Background

• Methods

• Results (Data)

• Conclusions/Discussion

doi:10.1186/2047-217X-1-3

doi:10.5524/100035

PublicationData +Methods +

Doi for workflows?

Page 47: Scott Edmunds: Data publication in the data deluge

doi:10.1186/2047-217X-1-3doi:10.5524/100035

AnalysisData Methods

DOI: x+ =

DOI: A DOI: X+ = DOI: 1

Page 48: Scott Edmunds: Data publication in the data deluge

doi:10.1186/2047-217X-1-3doi:10.5524/100035

AnalysisData Methods

DOI: x+ =

DOI: A DOI: X+ = DOI: 1

DOI: B DOI: X+ = DOI: 2

Page 49: Scott Edmunds: Data publication in the data deluge

doi:10.1186/2047-217X-1-3doi:10.5524/100035

AnalysisData Methods

DOI: x+ =

DOI: A DOI: X+ = DOI: 1

DOI: B DOI: X+ = DOI: 2

DOI: Y+ = DOI: 3DOI: A

Page 50: Scott Edmunds: Data publication in the data deluge

doi:10.1186/2047-217X-1-3doi:10.5524/100035

AnalysisData Methods

DOI: x+ =

DOI: A DOI: X+ = DOI: 1

DOI: B DOI: X+ = DOI: 2

DOI: Y+ = DOI: 3DOI: A

A, B, C… X, Y, Z… 4, 5, 6…=

Page 51: Scott Edmunds: Data publication in the data deluge

Different shaped publishable objects

DataPapers

Executable (Methods)

Papers

Analysis Papers

Page 52: Scott Edmunds: Data publication in the data deluge

Different levels of granularity

Experiment(e.g. ACRG project)

Datasets(e.g. cancer type)

Sample(e.g. specimen xyz)

e.g. doi:10.5524/100001

e.g. doi:10.5524/100001-2

e.g. doi:10.5524/100001-2000or doi:10.5524/100001_xyz

Smaller still?

Papers

Data/Micropubs

NanopubsFacts/Assertions (~1013 in literature)

Different shaped publishable objects

Page 53: Scott Edmunds: Data publication in the data deluge

DOIs are cheap*, data is precious: maximise its use

Adding “value” publishing data

* ish

• Scope for different shaped publishable objects• Scope for publishing methods/executable papers• Peer review of data problematic

– Post publication peer review– Change criteria (assess on transparency/access only)– Better use of workflows/cloud/VMs

Page 54: Scott Edmunds: Data publication in the data deluge

DOIs are cheap*, data is precious: maximise its use

Adding “value” publishing data

* ish Source: Ross Mounce CC-BY http://rossmounce.co.uk/2012/09/04/the-gold-oa-plot-v0-2/

Page 55: Scott Edmunds: Data publication in the data deluge

Thanks to:

[email protected]@gigasciencejournal.com

@gigascience

facebook.com/GigaScience

blogs.openaccesscentral.com/blogs/gigablog/

Contact us:

Laurie GoodmanTam SneddonNicole NogoyAlexandra BasfordPeter LiJesse Si Zhe

Follow us:

Shaoguang Liang (BGI-SZ)Tin-Lap Lee (CUHK)Huayen Gao (CUHK)Qiong Luo (HKUST)Senghong Wang (HKUST)Yan Zhou (HKUST)Cogini

www.gigadb.orgwww.gigasciencejournal.com