17
VO Sandpit, November 2009 Data Citation in the Earth and Physical Sciences Sarah Callaghan [[email protected]] with thanks and acknowledgement to a lot of other people! Developing Data Attribution and Citation Practices and Standards An International Symposium and Workshop August 22-23, 2011

VO Sandpit, November 2009 Data Citation in the Earth and Physical Sciences Sarah Callaghan [[email protected]] with thanks and acknowledgement

Embed Size (px)

Citation preview

VO Sandpit, November 2009

Data Citation in the Earth and Physical Sciences

Sarah Callaghan [[email protected]]

with thanks and acknowledgement to a lot of other people!

Developing Data Attribution and Citation Practices and StandardsAn International Symposium and Workshop

August 22-23, 2011

VO Sandpit, November 2009

Who are we? And why do we care?... And what do we know about data?

We’re one of the NERC data centres

VO Sandpit, November 2009

Some BADC numbers for context

Dataset: A collection of files sharing some administrative and/or project heritage.

BADC has approximately 150 real datasets (and thousands of virtual datasets).

BADC has approx 200 million files containing thousands of measured or simulated parameters.

BADC tries to deploy information systems that describe those data, parameters, projects and files, along with services that allow one to manipulate them …

Calendar year 2010: 2800 active users (of 12000 registered), downloaded 64 TB data in 16 million files from 165 datasets.

Less than half of the BADC data consumers are “atmospheric science” users!

VO Sandpit, November 2009

What does data mean to us?

Data can be anything from:• A measurement taken at a single place and time (e.g.

water sample, crystal structure, particle collision) • Measurements taken at a point over a period of time

(e.g. rain gauge measurements, temperature)• Measurements taken across an area at multiple times

by a static instrument (e.g. meteorological radar, satellite radiometer measurements)

• Measurements taken over and area and a time by a moving instrument (e.g. ocean traces, air quality measurements taken during an airplane flight, biodiversity measurements)

• Results from computer models (e.g. climate models, ocean circulation models)

• Video and images (e.g. cloud camera images, photos and video from flood events, wildlife camera traps)

• Physical samples (e.g. rock cores, tree ring samples, ice cores)

Suber cells and mimosa leaves. Robert Hooke, Micrographia, 1665

VO Sandpit, November 2009

Case Study: CMIP5

CMIP5: Fifth Coupled Model Intercomparison Project• Global community activity under the auspices of the World Meteorological Organisation (WMO) via the World Climate Research Programme (WCRP)•Aim:

– to address outstanding scientific questions that arose as part of the AR4 process,

– improve understanding of climate, and

– to provide estimates of future climate change that will be useful to those considering its possible consequences.

Method: standard set of model simulations in order to:• evaluate how realistic the models are in simulating the recent past,• provide projections of future climate change on two time scales, near term (out to about 2035) and long term (out to 2100 and beyond), and• understand some of the factors responsible for differences in model projections, including quantifying some key feedbacks such as those involving clouds and the carbon cycle

VO Sandpit, November 2009

FAR:1990SAR:1995TAR:2001AR4:2007AR5:2013

VO Sandpit, November 2009

Simulations:~90,000 years~60 experiments~20 modelling centres (from around the world) using~30 major(*) model configurations~2 million output “atomic” datasets ~10's of petabytes of output

~2 petabytes of CMIP5 requested output~1 petabyte of CMIP5 “replicated” output

Which will be replicated at a number of sites (including ours).

Of the replicants:~ 220 TB decadal~ 540 TB long term~ 220 TB atmosphere-only

~80 TB of 3hourly data~215 TB of ocean 3d monthly data!~250 TB for the cloud feedbacks!~10 TB of land-biochemistry (from the long term experiments alone).

CMIP5 numbers

(May 2011: All these data output volumes probably a factor of two too low!)

VO Sandpit, November 2009

CMIP5 will produce a lot of data! It’s an international effort, with everyone involved wanting to ensure proper citation, attribution and location of the data produced.

From http://cmip-pcmdi.llnl.gov/cmip5/citation.html?submenuheader=3 :

“Digital Object Identifiers will be assigned to various subsets of the CMIP5 multi-model dataset and, when available and as appropriate, users should cite these references in their publications. These DOI’s will provide a traceable record of the analyzed model data, as tangible evidence of their scientific value. Instructions will be forthcoming on how to cite the data using DOI’s.”

There are also plans to work with journal publishers to publish data papers about various key model runs and ensembles (more about data publication later!)

CIMP5 and Data Citation

VO Sandpit, November 2009

Earth Sciences: BADC

It is possible to reference our datasets using a specific citation given on the main dataset information page.

We’re currently working on assigning DOIs to certain datasets which meet our technical quality standards.

VO Sandpit, November 2009

Earth Sciences: Pangaea

VO Sandpit, November 2009

Physics and Life Science: ISIS

The ISIS pulsed neutron and muon source produces beams of neutrons and muons that allow scientists to study materials at the atomic level using a suite of instruments, often described as ‘super-microscopes’. It supports a national and international community of more than 2000 scientists who use neutrons and muons for research in physics, chemistry, materials science, geology, engineering and biology.

ISIS is now issuing DOIs for experiment data to allow easy citation. Principal Investigators will be sent DOIs shortly before their experiment is due to start.

DOIs issued by ISIS are in the form: 10.5286/ISIS.E.1234567

 

The recommended format for citation is:

Author, A N. et al; (2010): RB123456, STFC ISIS Facility, doi:10.5286/ISIS.E.1234567

Identifying materials for hydrogen storage

VO Sandpit, November 2009

Chemistry: PubChem

VO Sandpit, November 2009

Astronomy: Seamless Astronomy and Dataverse

Dataverse data citation standard:• offers proper recognition to authors • permanent identification through the use of global, persistent identifiers in place of URLs, • uses universal numerical fingerprints (UNFs) to guarantee that future researchers will be able to verify that data retrieved is identical to that used in a publication decades earlier, even if it has changed storage media, operating systems, hardware, and statistical program format.

Following is an authentic example of a replication data-set citation (from International Studies Quarterly, King and Zeng, 2007, p.209):Gary King; Langche Zeng, 2006, "Replication Data Set for 'When Can History be Our Guide? The Pitfalls of Counterfactual Inference'" hdl:1902.1/DXRXCFAWPK UNF:3:DaYlT6QSX9r0D50ye+tXpA== Murray Research Archive [distributor]http://projects.iq.harvard.edu/seamlessastronomy/

The Seamless Astronomy Group at the Harvard-Smithsonian Center for Astrophysics brings together astronomers, computer scientists, information scientists, librarians and visualization experts involved in the development of tools and systems to study and enable the next generation of online astronomical research. 

The are evaluating the Dataverse, an open data archive hosted by Harvard University and managed by the Institute for Quantitative Social Science (IQSS), as a project-based repository for the storage, access, and citation of reduced astronomical data.

VO Sandpit, November 2009

(Scientific) Communication through the ages

Science, as a process, requires the exchange of information and ideas.

We can make this exchange face-to-face (conferences, meetings, seminars) or through another medium (text, video, images), or both.

No matter what method we use, we wind up telling each other stories about what we’ve discovered.

Technology has given us new tools, but it’s also provided new challenges

http://www.intoon.com/#68559

VO Sandpit, November 2009

The Data Deluge

“the amount of data generated worldwide...is growing by 58% per year; in 2010 the world generated 1250 billion gigabytes of data”

The Digital Universe Decade – Are You Ready?IDCC White Paper, May 2010

Journals can’t now communicate everything we need to know about a scientific event

- whether that’s an observation, simulation, development of a theory, or any combination of these.

Data always has been the foundation of scientific progress – without it, we can’t test any of our assertions.

Previously data was hard to capture, but could be (relatively) easily published in image or table format

We need to publish data – but how?

VO Sandpit, November 2009

Serving, Citing and Publishing Data

Citation forms an important part of the scientific record.

We draw a clear distinction between:

publishing = making available for consumption (e.g. on the web), and

Publishing = publishing after some formal process which adds value for the consumer:

• e.g. PloS ONE type review, or• EGU journal type public

review, or• More traditional peer review.

AND• provides commitment to

persistence

0.Serving of data sets

1.Data set Citation

2.Publication of data

sets

This is what data centres do as our day job – take in data supplied by scientists and make it available to other interested parties.

This is our first step for this project – formulate and formalise a way of citing data sets. Will provide benefits to our users – and a carrot to get them to provide data to us!

This involves the peer-review of data sets, and gives “stamp of approval” associated with traditional journal publications. Can’t be done without effective linking/citing of the data sets.

Doi:10232/123

Doi:10232/123ro

VO Sandpit, November 2009

Final remarks

• There is obviously a need for data citation, not only for scientists, but also to provide traceability and accountability for the general public (c.f. issues surrounding Climategate)

• There is serious pressure in the Earth and climate sciences to publish data

• but there is also a need to ensure proper accreditation

• How we communicate scientific findings is changing – data citation is a big part of that.

http://www.keepcalm-o-matic.co.uk/default.aspx#createposter