57
Making small data BIG Insights from a Long-tail Geoscience Domain Kerstin Lehnert [email protected] Lamont -Doherty Earth Observatory of Columbia University Palisades, NY, 10964 www.iedadata.org

Lehnert: Making Small Data Big, IACS, April2015

Embed Size (px)

Citation preview

Page 1: Lehnert: Making Small Data Big, IACS, April2015

Making small data BIGInsights from a Long-tail Geoscience Domain

Kerstin Lehnert [email protected] -Doherty Earth Observatory of Columbia UniversityPalisades, NY, 10964

www.iedadata.org

Page 2: Lehnert: Making Small Data Big, IACS, April2015

Outline

• The (super-fast) Introduction to Geochemistry

• Achievements & Challenges in Geochemical Data Management

• Sustainable data infrastructure in the Long Tail

• EarthCube

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 2

Page 3: Lehnert: Making Small Data Big, IACS, April2015

Geochemistry

• Puts real numbers on geologic times.

• Fingerprints sources of material involved in geological processes.

• Reveals the history of climate and the circulations of the atmosphere and ocean.

• Constrains theories of the Earth’s deep interior

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 3

Page 4: Lehnert: Making Small Data Big, IACS, April2015

Geochemical Observations

• Hundreds of chemical properties of different Earth materials• elemental or oxide concentrations

• isotopes and isotopic ratios

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 4

• Thermodynamic properties

• Kinetics

Page 5: Lehnert: Making Small Data Big, IACS, April2015

Geochemical Data Types

• Analytical (observational)• Sample-based measurements

• Sensor data

• Experimental data

• Derived data (models)

• (Samples)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 5

Page 6: Lehnert: Making Small Data Big, IACS, April2015

Materials & Samples

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 6

Page 7: Lehnert: Making Small Data Big, IACS, April2015

Geochemistry Methods

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 7

Page 8: Lehnert: Making Small Data Big, IACS, April2015

How a Geochemist Generates Data:“Did New Zealand Dust Influence the Last Ice Age?”

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 8

Bess Koffman, Michael Kaplan, Steven Goldstein, Gisela Winckler (LDEO), Natalie Mahowald (Cornell)http://blogs.ei.columbia.edu/2014/03/13/did-new-zealand-dust-influence-the-last-ice-age/

Page 9: Lehnert: Making Small Data Big, IACS, April2015

Get Samples in the Field

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 9

Page 10: Lehnert: Making Small Data Big, IACS, April2015

Get Samples in the Lab/Repository

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 10

Page 11: Lehnert: Making Small Data Big, IACS, April2015

Analyze Samples in the Lab

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 11

Page 12: Lehnert: Making Small Data Big, IACS, April2015

The Data!

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 12

Note the number of data points generated in this study (the yellow dots) in light of the effort that included collecting samples in NZ to operating expensive equipment in the lab.

Page 13: Lehnert: Making Small Data Big, IACS, April2015

Data “Sharing”

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 13

Page 14: Lehnert: Making Small Data Big, IACS, April2015

Long-tail Research Data

• heterogeneous

• customized & optimized for research questions

• lack of data standards

• data sharing limited

• lack of data infrastructure (facilities)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 14

Page 15: Lehnert: Making Small Data Big, IACS, April2015

The Value of Long-tail Data

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 15

“While the data volumes are small when viewed individually, in total they represent a very significant

portion of the country’s scientific output.”

“The long tail is a breeding ground for new ideas and never before attempted science.”

(Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of Science”)

BUT:Long-tail data have no value if they are not re-usable!

Page 16: Lehnert: Making Small Data Big, IACS, April2015

Monday’s Musings: Beyond The Three V’s of Big Data – Viscosity and ViralityPublished on February 27, 2012 by R "Ray" Wanghttp://blog.softwareinsider.org/2012/02/27/mondays-musings-beyond-the-three-vs-of-big-data-viscosity-and-virality/

What Makes Data BIG?

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"

Value

16

The sixth ‘V’:

Page 17: Lehnert: Making Small Data Big, IACS, April2015

Adding VALUE

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 17

accessible

small data

BIG DATA

findable

identification,persistence

authorization,protocols

context,provenance

re-usable

harmonized, machine-readable

interoperable“… data have no value or

meaning in isolation; they exist

within a knowledge

infrastructure — an ecology of

people, practices,

technologies, institutions,

material objects, and

relationships.”

C.L. Borgman

https://www.force11.org/group/fairgroup/fairprinciples

Generic Repositories Domain Repositories

Page 18: Lehnert: Making Small Data Big, IACS, April2015

Domain-specific Data Facilities

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 18

Science Community

Domain specific Data facility

18

Libraries Archives

CI, Computer Science

Publishers, editors

Metadata registrationSoftware (tool) development

InteroperabilityData policies

Persistent access Bibliometrics

Data CurationData access & discovery (optimized for domain)

Data products (synthesis)Data harmonization (standards)

User Support

Funding Agencies

Data Facilities

Registries

AGU FM 2014: IN14B-01

Page 19: Lehnert: Making Small Data Big, IACS, April2015

Small Data Gone BIG

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 19

IEDA Repositories >500,000 files 47 TB 4 x 106 samples

IEDA Syntheses 19 x 106 analytical values in EarthChem 2.63 x 106 miles of data from 808 cruises in the

Global Multi-Resolution Topography (GMRT)

Page 20: Lehnert: Making Small Data Big, IACS, April2015

EarthChem: Big Data for Geochemistry

• EarthChem Library• DOI registration

• Long-term archiving

• CC license

• Data templates & guidelines for data documentation

• QC by data managers

• Synthesis Databases (PetDB, EarthChem Portal)• QA/QC by data managers

• Data & metadata harmonization

• Standards-compliant data model

• Service Oriented Architecture (ECP)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 20

Page 21: Lehnert: Making Small Data Big, IACS, April2015

EarthChem Data Systems

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 21

Metadata

Data Data Data Data Data

EarthChem Library

Data Data Data

Search

Investigators

Data Repository

Page 22: Lehnert: Making Small Data Big, IACS, April2015

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 22

DOI to allow proper citation

Link to publications

Link to funding source

22

Page 23: Lehnert: Making Small Data Big, IACS, April2015

Data Templates

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 23

Page 24: Lehnert: Making Small Data Big, IACS, April2015

ECL Challenges

• Metadata guidelines/templates for an increasing diversity of data

• Need extended metadata for meaningful searches• Geospatial

• Variables

• Sample name

• Integration with publication workflow

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 24

Page 25: Lehnert: Making Small Data Big, IACS, April2015

Coalition for Publishing Data in the Earth & Space Sciences (COPDESS)

25

• Joint initiative of Earth Science publishers and Data Facilities to help translate the aspirations of open, available, and useful data from policy into practice.• Reaffirm and ensure adherence to existing journal and publishing policies

and society position statements regarding open data sharing and archiving of data, tools, and models.

• Ensure that Earth science data will, to the greatest extent possible, be stored in community approved repositories that can provide additional data services.

• Statement of Commitment signed by all major Earth & Space Science publishers

• Build an online community directory of appropriate Earth science community repositories for data, tools, and models that meet leading standards on curation, quality, and access

www.copdess.org

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"

Page 26: Lehnert: Making Small Data Big, IACS, April2015

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 26

Presentation at EarthCube workshop “Scope & Vision”, March 2015

Page 27: Lehnert: Making Small Data Big, IACS, April2015

EarthChem Data Systems

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 27

Metadata

Data Data Data Data Data

EarthChem Library

Data Data Data

Search

Data & Metadata

Search

Data Data

Search

DB DB DB DB DB

Data & Metadata

[XML]Investigators

[.xls]

EarthChem Data Managers

Data Repository

PetDB, SedDB EarthChem Portal

Data Synthesis

Page 28: Lehnert: Making Small Data Big, IACS, April2015

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 28

Example of success:

This study showed new relationships between noble gases and the elemental and isotope geochemistry of the deep mantle, with implications for mantle structure and evolution.

It was possible through a synthesis of the global data set,

only because the scattered data were made available by the online databases PetDB and GEOROC.

This entire community now depends on this cyberinfrastructure.

Page 29: Lehnert: Making Small Data Big, IACS, April2015

The PetDB Database

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 29

Map shows locations of mafic volcanic rock samples. Color of symbols is scaled to the 87Sr/86Sr isotope ratio in the rocks, illustrating the difference in the composition of the Earth’s mantle under the Indian and the Pacific Ocean.

Data are from >300 publications, retrieved from the PetDB database in ca. 2 minutes.

Page 30: Lehnert: Making Small Data Big, IACS, April2015

PetDB Concept: BIG Data

• Data Mining

• Fine-grained data access: Database structure ‘disintegrates’ data sets into individual values

• Context & provenance metadata to search and filter

• Harmonized data: controlled vocabularies, data compilation & QC by data managers

• Data Integration• User-defined across data sets

• By sample (use of unique sample ID)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 30

Page 31: Lehnert: Making Small Data Big, IACS, April2015

Data Mining: Search & Filter

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"

31

Filter by method or concentration

Page 32: Lehnert: Making Small Data Big, IACS, April2015

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 32

Page 33: Lehnert: Making Small Data Big, IACS, April2015

PetDB Impact

• 500 - 800 downloads per quarter

• >550 citations in the literature

• many fundamental new discoveries & insights

• new scientific approaches

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 33

Meyzen et al, 2007, Isotopic portrayal of theEarth's upper mantle flow field. Nature 447, 1069A. W. Hofmann: “Mantle

Myths, Reservoirs, and Databases”, Goldschmidt Conf. 2008

Page 34: Lehnert: Making Small Data Big, IACS, April2015

Technical Challenges

• scalability/flexibility of database schema• accommodate new sample and data types (time series, non-numeric

data, etc.)

• track relationships among samples

• diverse context for new sample and data types

• track provenance of metadata

• performance of search application

• usability & functionality of search application

• interoperability interfaces

• data ingestion & quality control

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 34

Page 35: Lehnert: Making Small Data Big, IACS, April2015

ODM2

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 35

ODM2 Team:J S HorsburghA K AufdenkampeL HsuA JonesK LehnertE MayorgaL SongD TarbotonI Zaslavsky

Challenges:• migration of db content• new user interface• new data entry & QA/QC tools• resources

Page 36: Lehnert: Making Small Data Big, IACS, April2015

ODM2 Problem

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 36

from:http://techdistrict.kirkk.com/2009/10/07/the-usereuse-paradox/

“In general, the more reusable we choose to make a software module, the more difficult that same software module is to use.”

Page 37: Lehnert: Making Small Data Big, IACS, April2015

New User Interface (under development)

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 37

Page 38: Lehnert: Making Small Data Big, IACS, April2015

Challenge: User Expectations

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 38

C.H. Langmuir (Harvard): “Geochemical Databases: What is needed now?” Presentation at EarthCubeDomain End-user workshop for Petrology & Geochemistry, March 2013

Page 39: Lehnert: Making Small Data Big, IACS, April2015

Access to Samples is a Community Concern

• Poor and uneven access and management of sample collections

• Incomplete sample tracking and linking of samples to analyses in the literature and databases

• Poor discoverability of existing samples

• insufficient or uneven sample density through space and time for most geological terrains of interest

From Executive Summary of EarthCube Domain End-user Workshop Petrology & Geochemistry 2013

EarthCube Domain End-user Workshop for Petrology & Geochemistryat the National Museum of Natural History, Smithsonian Institution, March 2013

Page 40: Lehnert: Making Small Data Big, IACS, April2015

The Internet of Samples

• Central or federated online catalogs for discovery & access of samples.

• Best practices for sample identification, documentation, and citation.

• Software tools that support personal or institutional sample management & curation.

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 40

(And facilities to provide access to curated samples!)

Page 41: Lehnert: Making Small Data Big, IACS, April2015

IGSN: International GeoSample Number

• persistent unique identifier for physical objects in the Earth Sciences; centralized control mechanism via IGSN e.V.

• resolves to virtual sample representations (sample metadata profiles) managed at federated IGSN Allocating Agents.

Page 42: Lehnert: Making Small Data Big, IACS, April2015

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 42

Use of the IGSN

IGSNs in data table resolve to sample metadata in IGSN registry

Page 43: Lehnert: Making Small Data Big, IACS, April2015

SESAR (www.geosamples.org)

System for Earth Sample Registration

• Allocating Agent for individual investigators, sample repositories, and science programs• tools and services for users to catalog and manage sample metadata

(MySESAR)

• personal (authenticated) workspace

• metadata template creator

• label creation & printing (including QR code)

• transfer of sample ownership

• web services for client systems

• register sample metadata & obtain IGSNs

• access to IGSN metadata

• preservation & persistent access of sample metadata

• Global Sample Catalog (harvest metadata from other AAs

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 43

Page 44: Lehnert: Making Small Data Big, IACS, April2015

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 44

Challenges:• scalability of architecture for a rapidly growing

number of registrations• service-oriented architecture• handle registrations• software tools that support investigators with

metadata capture in the field & lab• flexibility for user specific metadata & new sample

types• inclusion of sample images (storage!)

Page 45: Lehnert: Making Small Data Big, IACS, April2015

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 45

InstitutionsCollection Mgmt

Public ‘Virtual Museum’

InvestigatorsSample Mgmt

(storage, software solutions, & services)

VisualizationPublications

Data Systems

Sample Registries

AP

IsG

UIs

Page 46: Lehnert: Making Small Data Big, IACS, April2015

Internet of Samples Initiatives

• CODATA Task Group “Physical Samples in the Digital Era”

• SciColl: Scientific Collections International (Consortium)

• iSamples (Internet of Samples in the Earth Sciences)• Funded EarthCube Research Coordination Network (RCN)

• advance access and re-use of physical samples through use of innovative cyberinfrastructure

• DESC: Digital Environment for Sample Curation

• IGSN e.V.

• National Data Services test-bed

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 46

Page 47: Lehnert: Making Small Data Big, IACS, April2015

DATA FACILITIES FOR THE LONG TAIL

Scalability, Sustainability

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 47

Page 48: Lehnert: Making Small Data Big, IACS, April2015

Many Earth Science Data Communities

48

Atmo-spheric

Chemistry

Climate & Large Scale

Dynamics

Paleo-Climate

Meteor-ology

Aeronomy

Space Weather

Magneto-spheric Physics

Solar Terrestrial

Igneous Petrology & Volcan-

ology

Geo Ed & Workforce

Training

NCAR

Geophysics &

Geody-namics

Geobiology & Paleoen-

tology

Cryosphere & Ice

Dynamics

Critical Zone &

Soil Science

Chemical Ocean-

ography

Geomor-phology

Hydrology

Sediment-ology &

Strati-graphy

Marine Geophysics

Physical Ocean-

ography

Marine Geology

BiologicalOcean-

ography

Ocean Education

Ocean Drilling & Engineer-

ingSoftware

& Modeling

Bio-informatics

Ecosystems

Biology

High PerfComputing

Semantics &

Ontologies

Algorithms & DataMining

EarthCube CI

Solid and Aqueous Geochem

-istry

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"

Page 49: Lehnert: Making Small Data Big, IACS, April2015

IEDA: A “Long-Tail” Data Facility

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 49

www.iedadata.org

• Multiple core disciplines (focus: solid earth)• High-T Geochemistry• Low-T Geochemistry• Petrology• Marine Geophysics & Geology• Geochronology

• Cross-disciplinary tools & services• Sample registry SESAR• IEDA Data Browser• Portals (GeoPRISMs, USAP-DCC, etc.)• GeoMapApp• Data management support

49

Page 50: Lehnert: Making Small Data Big, IACS, April2015

From Research Data Collections to Data Facility

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 50

Formal Governance

Robust Infrastructure

Stable Expert Team

Accreditation

Adherence to Community Standards

Page 51: Lehnert: Making Small Data Big, IACS, April2015

Scalable Infrastructure

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 51

The ALLIANCE Model

Page 52: Lehnert: Making Small Data Big, IACS, April2015

Alliance Development

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 52

Proposal “Interdisciplinary Earth Data Alliance as a Model for Integrating EC Technology Resources and Engaging the Broad Community” submitted March 2015

MetPetDB

Mineral PhysicsDeep SubmergenceIcePod

Challenges:• Social & organizational engineering• Diversity of data needs• Diversity of systems • Business models

Page 53: Lehnert: Making Small Data Big, IACS, April2015

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 53

Page 54: Lehnert: Making Small Data Big, IACS, April2015

Conclusions

• Long-tail data can grow BIG through domain-specific data curation.

• Partnerships among data efforts can provide a solution for sustainability of data infrastructure in long tail communities

• Partnerships with the computer and information sciences are necessary to build the cyberinfrastructure.

Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 54

Page 55: Lehnert: Making Small Data Big, IACS, April2015

EarthCube MotivationsKerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 55

To transform geosciences research by supporting community-driven cyberinfrastructure to integrate data and information.

Tech

. Dri

vers Supports science and

other User Needs

Create a dynamic, community-driven cyberinfrastructure

Open, evolvable, sustainable

Easy interface with existing capabilities

Ch

alle

nge

s Diversity of the geosciences

Interdisciplinary Science Questions

Big, Heterogeneous Data issues

Communities that are poorly served/have no community resources

Page 56: Lehnert: Making Small Data Big, IACS, April2015

Towards an Architecture for EarthCube

• Under purview of the EarthCube Technology and Architecture Committee (TAC)

– Coordinating with Council of Data Facilities, Science Committee, and Liaison Team

• Ongoing Working Groups (since Fall 2014):

– Architecture WG

– Standards WG

– Use Cases WG

– Funded Projects and Gap Analysis WG

– Testbed WG

!

!

EarthCube!

23!

!!!!!!!!!!!!!!!!!!!!!

!

!

Building((Blocks(

Architecture(

Governance(Research((Coordina7on((Networks(

Funded&Projects&

!

EarthCube!Funded!Projects!

!(2013!and!2014!Awards)!

!

Page 57: Lehnert: Making Small Data Big, IACS, April2015

TAC Workshop (ongoing on now)

Learn more at:

http://earthcube.org/group/technology-architecture-committee http://earthcube.org/document/2014/earthcube-past-present-future