TODAY’S SCIENTIFIC ARTICLE – HOW REPRODUCIBLE IS IT? MICHAEL MARKIE Associate Publisher, F1000Research @mmmarksman f1000research.com @f1000research

TODAY’S SCIENTIFIC ARTICLE – HOW REPRODUCIBLE IS IT?

MICHAEL MARKIEAssociate Publisher, F1000Research

@mmmarksman

f1000research.com@f1000research

PRODUCING A TYPICAL ARTICLE

The process of science in a nutshell:

1.Collect data, 2.Evaluate the data, 3.Present the result(s) in a scientific paper

Articles are normally written in a Word document (or perhaps a Latex file) and then typically converted to a JATS XML format:

ARTICLE METADATA

“The backing singer”: it’s not part of the main body text/graphics

Its job is to identify /describe the article

In the XML, bibliographic standards are:

1.Authorship

2.Article title

3.Copyright year, and publication date

4.Descriptive material such as keywords/abstracts

5.Persistent identifiers (DOIS, PMIDs etc)

Metadata makes an article discoverable: easily shared and interoperable

ARTICLE METADATA 2.

Benefits of Open Access journals

Greater visibility: dissemination is free and can be achieved via a simple Internet connection

The OAI protocols allows metadata harvesting for inclusion in many digital archives

Most journals provide the full text of the XML for data mining purposes – Why?

1.Drives users to content

2.Stimulates collaboration

3. Allows for the creation of new services for discovery

A NEW CHALLENGE - PUBLISHING DATA

• Outputs are more than just text – data and code are involved in the process too!

BUT IT’S NOT THAT EASY...

• Data is heterogeneous depending on the discipline

• Datasets are often generated with incomplete metadata

• Scientific user-communities are often small and specialised - so is their data

• Scientific metadata are more extensive and less standardised than non-scientific metadata

WHY SHOULD WE MAKE DATA AVAILABLE IN ARTICLES?

“We conclude there is a direct effect of third-party data reuse that persists for years beyond the time when researchers have published most of the papers reusing their own data...We further conclude that...a substantial fraction of archived datasets are reused, and that the intensity of dataset reuse has been steadily increasing since 2003.”

Piowar HA., Vision TA. Data reuse and the open data citation advantage. PeerJ (2013) doi: 10.7717/peerj.175

1. Correlates with higher citations

BUT WHY SHOULD WE MAKE DATA AVAILABLE IN ARTICLES?

“• We examined the availability of data from 516 studies between 2 and 22 years old: the odds of a data set being reported as extant fell by 17% per year

• Broken e-mails and obsolete storage devices were the main obstacles to data sharing

• Policies mandating data archiving at publication are clearly needed”

Vines TH. et al. The availability of research data declines rapidly with article age. Curr Biol 24, 94–7 (2014)

2. Research becomes harder to access with age


“We evaluated the replication of data analyses in 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005–2006...We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced. The main reason for failure to reproduce was data unavailability.”

Ioannidis JPA. et al Repeatability of published microarray gene expression analyses. Nature Genetics 41, 149–55 (2009)

3. Sharing Data allows replication


Increasing government and funding mandates to do so

New research:• testing new hypotheses• new analysis methods• meta-analyses to create new datasets• studies on data collection methods

Diversity of analyses and opinion Reduction of error and fraud

Education for new researchers

4. Lots of other reasons!

PUBLISHING DATA IN ARTICLES TODAY – RISE OF THE DATA PAPER

The Data Paper - describes a particular dataset and is peer-reviewed – can it provide the missing link between the data and the research article?

THE DATA ENFORCERS!

• Reproducible research or data sharing statements in published papers (Annals Internal Medicine BMJ)

• Data sharing implied by submission (BioMed Central)

• Data sharing as a condition of publication (PLoS, NPG)- AND data must be available to reviewers/editors

• Open data a condition of submission (F1000Research)- Papers will be rejected if data no made freely available*

MAKING DATA ACCESSIBLE

• ‘Openly accessible’ – apply the principles of the Budapest Open Access Initiative (originally created for scholarly articles) to scholarly data too i.e..

• Free to view/access• Free to download• Free to re-analyse• Free to modify

• Use a license that facilitates the ease of sharing and reuse: CC0

• Apply community norms regarding acknowledgement and citation of data.

HOW TO MAKE DATA USABLE/REPRODUCIBLE

• Present data in a useable format (i.e. not in a supplemental PDF)

• Share data in non-proprietary formats

• Specify how the data was generated (context)

• Provide quality assurance (what were the limitations?)

• Specify access to software required to view the data

• Specify parameters in the software to analyze the data

THE MORE INFORMATION ABOUT A DATASET THE BETTER

White EP. Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. 6(2):1-10. 2013

DATA REVIEW AT F1000RESEARCH

INTERNAL EDITORIAL CHECKS

• Where to store the data (discipline-specific repository where possible) if not in general-science repository such as figshare , Dryad Digital or Dataverse

• Format – what file type

• How the data is presented – layout, labels

• Is their adequate data?

• Is their adequate protocol information?

DATA REVIEW AT F1000RESEARCH

EXTERNAL PEER REVIEW CHECKS

• Were the methods used appropriate?

• Was the format/structure usable?

• What were the limitations and sources of error included

• Is their adequate information to enable potential replication?

DISCOVERING DATA AT F1000RESEARCH

DISCOVERING CODE AT F1000RESEARCH

STANDARDISING THE METADATA – WHAT’S BEING DONE

• Lists of recommended repositories and standards are being developed with community-driven efforts such as:

• Force11 Data Citation Implementation Group (DCIG) – Aiming to revise the NISO/JATS XML schema for direct data citation

• The Research Data Alliance forms focused Working Groups and Interest Groups to discuss the social and technical bridges of sharing data

• The Data Citation Index – data in repositories, with or without them being linked to papers, will be recognised as independent evidence products

• Training – Lots of initiatives to help standardise how we work.

BUT THE ARTICLE AS WE KNOW IT NEEDS TO CHANGE!

ReagentsWorkflowsSoftwareData

Publication

Open Peer Review and commenting

ResultsDiscussion

Alternative Metrics

An article for the digital age...?

Interactive toolsFor analysis

Preprint

IN-ARTICLE DATA MANIPULATION

FIGURES THAT DON’T EXIST

Simply data + code

Creates opportunities to change the definition of a figure

Colomb J and Brembs B.Sub-strains of Drosophila Canton-S differ markedly in their locomotor behavior [v1; ref status: indexed, http://f1000r.es/3is]F1000Research 2014, 3:176

LIVING FIGURES

Colomb J and Brembs B.

Sub-strains of Drosophila Canton-S differ markedly in their locomotor behavior [v1; ref status: indexed, http://f1000r.es/3is]

F1000Research 2014, 3:176

Other labs can attempt to replicate the study and then submit their data directly onto the figure in the article (with associated metadata).

Provides a new way to show reproducibility attempts and could change fundamentally what an article is.

Documents

TODAY’S SCIENTIFIC ARTICLE – HOW REPRODUCIBLE IS IT? MICHAEL MARKIE Associate Publisher, F1000Research @mmmarksman f1000research.com @f1000research