Revolutionising the Journal through Big Data Computational Research

Amye KenallJournal Development Manager, Open Data

Revolutionizing the Journal through Big Data

Computational Research

DataCite Annual ConferenceInist-CNRS

Vandoeuvre-lès-Nancy, France26 August 2014

2

• Founded in 2000 (bought by Springer in 2008)• Publish over 260 open access journals• ~25,000 peer reviewed research articles published annually• Genomics and computational biology are a significant fraction

e.g. Genome Biology, BMC Genomics, BMC Bioinformatics• Other key fields include

• Public Health / Global Health / Infectious Disease• Cancer

• All research articles are CC-BY licensed for reuse• Since mid 2013, all data is covered by a CC0 rights waiver

Who are we?

3

• Strong encouragement to authors of all journals to provide underlying datasets and required on a select number (eg. Genome Biology, Genome Medicine, GigaScience)

• CC0 + CC-BY 4.0 by default

In the works…• Interactive tabular data• DOIs for all additional files• Searchability of additional files• Data Citation clearly tagged in

XML to aid harvesting e.g. Data Citation Index

Data reuse @BioMedCentral• Availability of Data section and Data

Citation• Encourage use of ISA-TAB (especially

GigaScience and BMC Research Notes)

4

5

Journal, data-platform and database for large- scale data

In conjunction with

6

7

Linking and Citation

8

Publishing Reproducible Science: SOAPdenovo2, a case study

9

10

11

12

13

14

Lessons Learned?

• With enough work, results can be replicated with a push of a button.

• But a lot of work costs a lot of money! No one would pay an APC that reflects that cost.

• Learn a huge amount about the study and provides a lot of information not present in the paper.

• Needs to happen before publication.

15

Reproducibility of computational research

• Computational research in principle should be easier to replicate/reproduce than bench studies

• However, practical issues get in the way

• Even if source code is shared, reproducing entire technical setup/porting software, gathering appropriate input data, rerunning analysis is a significant effort

• This means readers and even reviewers don’t bother

• We would like to reduce this ‘activation energy’

16

Strong interest from potential partners

17

Key technologies

18

PartnersTechnologiesJournal

Article

+ +

19

20

21

22

23

24

Flexible management/deployment of packaged data/analysis suites using VM infrastructure

25

• Publishers have role in enforcement of community standards

• Public/academic databases can provide credible long term archiving for key data with a focus on curation and metadata standards

• Academic grid computing infrastructure can provide access for researchers to large-scale computing resource

• Commercial cloud providers universalize/democratize access to large-scale computing. Even if you are not at an institution with its own facilities, you can carry out high-end computations. No bureaucracy/politics – simply pay per CPU-hour.

Complementary roles of publishers, academia, and cloud providers

26

• To what extent can/should datasets be included in the VM/suite or pulled in externally?

• How can we avoid the costliness of moving data around, as it gets bigger and bigger?

• To what extent are cross-domain standards for referring to and pulling in underlying datasets feasible. Dataset DOIs typically point to metadata

• Multiple versions of datasets. To what extent is it practical, when dealing with evolving datasets/databases, to make them available as reproducible snapshots?

• Culture of data sharing. How to get authors to share their data?

Specific challenges with respect to data

27

• With big data and computational tools, research is becoming more “reproducible/reusable”

• The infrastructure is out there; we need to do a better job of using it

• What authors need to communicate their research is also changing, and as publishers we must respond

• Clear publishers have a role, with other organisations, in setting some community standards

• It took a few 100 years, but publishing is now getting exciting

Conclusions

28

Questions?

“One reason that the worldwide web worked was because people reused each other’s content in ways never imagined or achieved by those who created it. The same will be true of open data.”

– Tim Berners-Lee and Nigel Shadbolt, The Times, New Year’s Eve 2011

Amye KenallJournal Development Manager (Open Data), BioMed Central

@AmyeKenall (also @OpenDataBMC)[email protected]