Data Standards & Best Practices for the Stratigraphic Record

  • View
    65

  • Download
    2

  • Category

    Science

Preview:

Citation preview

Data Standards & Best PracticesKerstin LehnertLamont-Doherty Earth Observatory

iedadata.org

2

Vouchering the Stratigraphic Record A synthesis database?

Aggregates data that are published in articles or in data repositories

Requirements: Integration, Quality (Trusted data!) Needs standardized metadata, semantics, and persistent unique

identifiers

A trusted repository? Publishes and ensures persistent access to data Requirements: Compliance with international data

curation and repository standards Long-term preservation, data identification (DOI), editorial

procedures, etc.

3

Data Standards

“documented agreements on representation, format, definition, structuring, tagging, transmission, manipulation, use, and management of data.”

Discipline specific Data type specific Application specific

4

Data Standards: Why?

Re-usability of data

Reproducibility of science

Integration/interoperability of data

6

Reproducibility in the Field Sciences Workshop in May 2015, organized by AAAS (M. McNutt), AGU, and

ESA, funded by the Arnold Foundation Report in preparation

Technical Requirements for Transparent, Reproducible Data1. The data themselves must be publicly available in machine-readable, non-

proprietary formats with accurate and precise descriptive metadata; 2. Data provenance—process(es) by which usable datasets were generated or

derived from raw, often streaming or machine-readable-only data—must be accurately and precisely specified;

3. Computer code (“scripts”) and software with which datasets were analyzed must be available and adequately described to ensure their repeated use and be publicly available in non-proprietary formats, and;

4. Version control should be used to ensure that the original data and code are maintained.

(from draft workshop report)

7

Coalition for Publishing Data in the Earth & Space Sciences (COPDESS)

Joint initiative of Earth Science publishers and Data Facilities to better help translate the aspirations of open, available, and useful data from policy into practice. Reaffirm and ensure adherence to existing journal and

publishing policies and society position statements regarding open data sharing and archiving of data, tools, and models.

Ensure that Earth science data will, to the greatest extent possible, be stored in community approved repositories that can provide additional data services.

Statement of Commitment signed by all major Earth & Space Science publishers

7

www.copdess.org

9

9

Repository Standards

Open access

Data quality assurance (editorial process)

Persistence (long-term preservation)

Persistent & unique identification of data (DOI registration)

Standard-based metadata (ISO) & APIs (OAI-PHM)

accessible

small data

findableidentification,persistence

protection,protocols

context,provenance

re-usableharmonized, machine-readable

interoperable

BIG DATA

Generic Repositories Community Data Collections

Adding V

alue

Domain Repositories

11

Distributed Data Curation

Alert: Stratigraphy is multi-disciplinary There are many data types that already have homes

Paleobio Database Macrostrat/Digital Crust Geochron (@IEDA) MagIC Open Core Data (@IEDA – under development) EarthChem (@IEDA) System for Earth Sample Registration (@IEDA)

Don’t reinvent, but leverage, link, & integrate!

EarthCube

EarthCube: A Process

Get all the info at: http://earthcube.org

COMPUTER SCIENCES

SOFTWARE ENGINEERS

SCIENTIFIC VISIONTECHNICAL ARCHITECTURE

ENGAGEMENTFUNDED PROJECTS

14

Back to Data Standards

Metadata Content Structure (data model) Vocabularies & Taxonomies

Identifiers

(API = Application Programming Interface)

15

Metadata Standards

Geospatial

Scientific Context

Object classifications

Methods (instrumentation, computation, etc.)

Actions dates actors

Data provenance (references, authors, etc.)

16

16

Open Geospatial Consortium (OGC):Observations & Measurements

Observation Result

Feature of Interest

Sampling Sampling Feature

Observation

“Observations commonly involve sampling of an ultimate feature of interest. This International Standard defines a common set of sampling feature types classified primarily by topological dimension, as well as

samples for ex-situ observations.” (OGC O&M 2.0.0 / ISO19156; editor: Simon Cox)

e.g. Station,Transect, Section

Observation Data Model v2

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"

17

ODM2 Team:J S HorsburghA K AufdenkampeL HsuA JonesK LehnertE MayorgaL SongD TarbotonI Zaslavsky

18

18

Data Templates

LPSC 2015 Workshop: Restoration and Synthesis of Planetary Geochemical Data

Persistent Unique Identifiers

SamplesDataset

Article publication

Awards & grants

ORCID

Cruise ID

IGSN

DOI

FundRef

DOI

ResearchersField Program

Data DOI Metadata

22

22

Internet of Samples in the Earth Sciences Physical samples need to be linked to the digital data

generated by their study. Reproducibility! Access to the physical samples is required to

verify & reproduce observations. Re-usability! Access to information about samples is required

for proper evaluation & interpretation of sample-based data.

Physical samples need to be shared broadly for use & re-use.

Samples are often expensive to collect (drilling, remote locations). Many samples are unique and irreplaceable. Re-analysis augments utility of existing data. Samples often serve in ways that the collectors and repositories could not

have imagined.

3/26/2015

23

23

Unique Sample Identification

Imagine the possibilities … Easily find a specific sample and contact its owner Find all publications that mention a specific sample Find all data for that sample across the literature

and distributed databases Find other samples with similar properties

geospatial temporal compositional

24

24

Sample Identification Until Now

Samples have ambiguous and non-persistent names and cannot be properly cited.

The EarthChem Portal shows 75 publications with

geochemical data referenced to a sample with the name

M1 (or M-1). (www.earthchem.org)Names of dredge sample 3 of

the Amphitrite cruise(PetDB database, www.petdb.org)

25

25

Sample Identification From Now:IGSN: International Geo Sample Number

Persistent unique identifier for physical objects in the Earth Sciences Global uniqueness guaranteed via governance by the IGSN e.V.

Persistent access and preservation of sample metadata Cataloguing services of IGSN e.V. members Allows to build central search engine Resolving service of the IGSN central registry

Does not replace personal or institutional naming protocols

IGSN: Examples

Oriented Core Drill Hole (ODP)

Soil Section Rock Specimen

27

27

IGSN Status

International governance established in 2011 14 members (organizations) in the IGSN e.V. (www.igsn.org)

ca. 4 million samples registered (registration tripled in 2014)

>350 active users, including increasing number of individual scientists sample repositories & museums (Smithsonian, marine cores, geological surveys (USGS, Geoscience Australia, BGR) large-scale observatories and sampling campaigns

ICDP, IODP, CZO, DCO, GeoPRISMs, etc.)

IGSN Adoption

IGSN Adoption

COPDESS Statement of Commitment

IGSN in Action

31

IGSN in Action:

Publications

32

Metadata

Identification Sample name(s), registrant

Description Material, classification, age, size, comments

Geospatial information Geographical names, coordinates

Collection Expedition/cruise, platform, date, collector,

technique

Archiving/access Physical location of sample (repository), contact

32

IGSN Sample “Geneology” 33

34

34

Extended IGSN Metadata

Images Documents (.pdf, .xls, .doc) References URLs for related data resources User defined metadata

Internet of Samples in the Earth Sciences

iSamples RCN

Advance use of innovative CI to connect physical samples across the Earth Sciences with digital data infrastructure

Goals: Improve discovery, access, and re-usability of physical samples Improve re-usability and reproducibility of the data generated by their

study

Registries & Catalogs

Metadata

Identifiers

CitationRepositories

Software ToolsTaxonomies

C4P: Collaboration & Cyberinfrastructure for PaleoscienceAn EarthCube Research Coordination Network

Unravel the large-scale, long-term evolution of the Earth-Life System through the study of the geological record

Major challenges C4P addresses:• Heterogeneous & dispersed data• Modeling of age & time• Legacy & ‘dark’ data• Limited interoperability among resources• Variable semantics & ontologies

A diverse community:paleobiology, paleoclimate, paleoceanography, geochemistry, dendrochronology, stratigraphy, geochronology, sample curation, data management, bioinformatics, semantics, software architecture, and more ...

C4P achievements:• New resources

• data & software catalogs• Educational materials (webinars)

• New collaborations• Convergence on best practices (samples,

age, taxonomy)

37

Take Away Messages 37

develop leading practices for data

get community buy-in

align & coordinate with existing leading practices

leverage existing infrastructure

get started and don’t let the challenges stop you

“The Hitchhiker’s Guide to Geoinformatics”

(Lee Allison, LISTMG Workshop 2004)“Building an International

Collaboration for Geoinformatics”

(Walter Snyder, AGU 2005)

“Cyberinfrastructure for Solid Earth Geochemistry” (Kerstin Lehnert, GSA 2003)

The Cultural Challenges 38

39

Thank You!

"The wonderful thing about standards is that there are so many of them to

choose from”.

(Grace Hopper)

Recommended