Upload
kerstin-lehnert
View
100
Download
3
Tags:
Embed Size (px)
Citation preview
Making small data BIGInsights from a Long-tail Geoscience Domain
Kerstin Lehnert [email protected] -Doherty Earth Observatory of Columbia UniversityPalisades, NY, 10964
www.iedadata.org
Outline
• The (super-fast) Introduction to Geochemistry
• Achievements & Challenges in Geochemical Data Management
• Sustainable data infrastructure in the Long Tail
• EarthCube
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 2
Geochemistry
• Puts real numbers on geologic times.
• Fingerprints sources of material involved in geological processes.
• Reveals the history of climate and the circulations of the atmosphere and ocean.
• Constrains theories of the Earth’s deep interior
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 3
Geochemical Observations
• Hundreds of chemical properties of different Earth materials• elemental or oxide concentrations
• isotopes and isotopic ratios
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 4
• Thermodynamic properties
• Kinetics
Geochemical Data Types
• Analytical (observational)• Sample-based measurements
• Sensor data
• Experimental data
• Derived data (models)
• (Samples)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 5
Materials & Samples
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 6
Geochemistry Methods
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 7
How a Geochemist Generates Data:“Did New Zealand Dust Influence the Last Ice Age?”
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 8
Bess Koffman, Michael Kaplan, Steven Goldstein, Gisela Winckler (LDEO), Natalie Mahowald (Cornell)http://blogs.ei.columbia.edu/2014/03/13/did-new-zealand-dust-influence-the-last-ice-age/
Get Samples in the Field
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 9
Get Samples in the Lab/Repository
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 10
Analyze Samples in the Lab
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 11
The Data!
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 12
Note the number of data points generated in this study (the yellow dots) in light of the effort that included collecting samples in NZ to operating expensive equipment in the lab.
Data “Sharing”
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 13
Long-tail Research Data
• heterogeneous
• customized & optimized for research questions
• lack of data standards
• data sharing limited
• lack of data infrastructure (facilities)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 14
The Value of Long-tail Data
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 15
“While the data volumes are small when viewed individually, in total they represent a very significant
portion of the country’s scientific output.”
“The long tail is a breeding ground for new ideas and never before attempted science.”
(Heidorn, B. 2008: “Shedding Light on the Dark Data in the Long Tail of Science”)
BUT:Long-tail data have no value if they are not re-usable!
Monday’s Musings: Beyond The Three V’s of Big Data – Viscosity and ViralityPublished on February 27, 2012 by R "Ray" Wanghttp://blog.softwareinsider.org/2012/02/27/mondays-musings-beyond-the-three-vs-of-big-data-viscosity-and-virality/
What Makes Data BIG?
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
Value
16
The sixth ‘V’:
Adding VALUE
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 17
accessible
small data
BIG DATA
findable
identification,persistence
authorization,protocols
context,provenance
re-usable
harmonized, machine-readable
interoperable“… data have no value or
meaning in isolation; they exist
within a knowledge
infrastructure — an ecology of
people, practices,
technologies, institutions,
material objects, and
relationships.”
C.L. Borgman
https://www.force11.org/group/fairgroup/fairprinciples
Generic Repositories Domain Repositories
Domain-specific Data Facilities
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 18
Science Community
Domain specific Data facility
18
Libraries Archives
CI, Computer Science
Publishers, editors
Metadata registrationSoftware (tool) development
InteroperabilityData policies
Persistent access Bibliometrics
Data CurationData access & discovery (optimized for domain)
Data products (synthesis)Data harmonization (standards)
User Support
Funding Agencies
Data Facilities
Registries
AGU FM 2014: IN14B-01
Small Data Gone BIG
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 19
IEDA Repositories >500,000 files 47 TB 4 x 106 samples
IEDA Syntheses 19 x 106 analytical values in EarthChem 2.63 x 106 miles of data from 808 cruises in the
Global Multi-Resolution Topography (GMRT)
EarthChem: Big Data for Geochemistry
• EarthChem Library• DOI registration
• Long-term archiving
• CC license
• Data templates & guidelines for data documentation
• QC by data managers
• Synthesis Databases (PetDB, EarthChem Portal)• QA/QC by data managers
• Data & metadata harmonization
• Standards-compliant data model
• Service Oriented Architecture (ECP)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 20
EarthChem Data Systems
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 21
Metadata
Data Data Data Data Data
EarthChem Library
Data Data Data
Search
Investigators
Data Repository
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 22
DOI to allow proper citation
Link to publications
Link to funding source
22
Data Templates
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 23
ECL Challenges
• Metadata guidelines/templates for an increasing diversity of data
• Need extended metadata for meaningful searches• Geospatial
• Variables
• Sample name
• Integration with publication workflow
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 24
Coalition for Publishing Data in the Earth & Space Sciences (COPDESS)
25
• Joint initiative of Earth Science publishers and Data Facilities to help translate the aspirations of open, available, and useful data from policy into practice.• Reaffirm and ensure adherence to existing journal and publishing policies
and society position statements regarding open data sharing and archiving of data, tools, and models.
• Ensure that Earth science data will, to the greatest extent possible, be stored in community approved repositories that can provide additional data services.
• Statement of Commitment signed by all major Earth & Space Science publishers
• Build an online community directory of appropriate Earth science community repositories for data, tools, and models that meet leading standards on curation, quality, and access
www.copdess.org
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 26
Presentation at EarthCube workshop “Scope & Vision”, March 2015
EarthChem Data Systems
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 27
Metadata
Data Data Data Data Data
EarthChem Library
Data Data Data
Search
Data & Metadata
Search
Data Data
Search
DB DB DB DB DB
Data & Metadata
[XML]Investigators
[.xls]
EarthChem Data Managers
Data Repository
PetDB, SedDB EarthChem Portal
Data Synthesis
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 28
Example of success:
This study showed new relationships between noble gases and the elemental and isotope geochemistry of the deep mantle, with implications for mantle structure and evolution.
It was possible through a synthesis of the global data set,
only because the scattered data were made available by the online databases PetDB and GEOROC.
This entire community now depends on this cyberinfrastructure.
The PetDB Database
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 29
Map shows locations of mafic volcanic rock samples. Color of symbols is scaled to the 87Sr/86Sr isotope ratio in the rocks, illustrating the difference in the composition of the Earth’s mantle under the Indian and the Pacific Ocean.
Data are from >300 publications, retrieved from the PetDB database in ca. 2 minutes.
PetDB Concept: BIG Data
• Data Mining
• Fine-grained data access: Database structure ‘disintegrates’ data sets into individual values
• Context & provenance metadata to search and filter
• Harmonized data: controlled vocabularies, data compilation & QC by data managers
• Data Integration• User-defined across data sets
• By sample (use of unique sample ID)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 30
Data Mining: Search & Filter
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
31
Filter by method or concentration
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 32
PetDB Impact
• 500 - 800 downloads per quarter
• >550 citations in the literature
• many fundamental new discoveries & insights
• new scientific approaches
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 33
Meyzen et al, 2007, Isotopic portrayal of theEarth's upper mantle flow field. Nature 447, 1069A. W. Hofmann: “Mantle
Myths, Reservoirs, and Databases”, Goldschmidt Conf. 2008
Technical Challenges
• scalability/flexibility of database schema• accommodate new sample and data types (time series, non-numeric
data, etc.)
• track relationships among samples
• diverse context for new sample and data types
• track provenance of metadata
• performance of search application
• usability & functionality of search application
• interoperability interfaces
• data ingestion & quality control
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 34
ODM2
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 35
ODM2 Team:J S HorsburghA K AufdenkampeL HsuA JonesK LehnertE MayorgaL SongD TarbotonI Zaslavsky
Challenges:• migration of db content• new user interface• new data entry & QA/QC tools• resources
ODM2 Problem
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 36
from:http://techdistrict.kirkk.com/2009/10/07/the-usereuse-paradox/
“In general, the more reusable we choose to make a software module, the more difficult that same software module is to use.”
New User Interface (under development)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 37
Challenge: User Expectations
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 38
C.H. Langmuir (Harvard): “Geochemical Databases: What is needed now?” Presentation at EarthCubeDomain End-user workshop for Petrology & Geochemistry, March 2013
Access to Samples is a Community Concern
• Poor and uneven access and management of sample collections
• Incomplete sample tracking and linking of samples to analyses in the literature and databases
• Poor discoverability of existing samples
• insufficient or uneven sample density through space and time for most geological terrains of interest
From Executive Summary of EarthCube Domain End-user Workshop Petrology & Geochemistry 2013
EarthCube Domain End-user Workshop for Petrology & Geochemistryat the National Museum of Natural History, Smithsonian Institution, March 2013
The Internet of Samples
• Central or federated online catalogs for discovery & access of samples.
• Best practices for sample identification, documentation, and citation.
• Software tools that support personal or institutional sample management & curation.
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 40
(And facilities to provide access to curated samples!)
IGSN: International GeoSample Number
• persistent unique identifier for physical objects in the Earth Sciences; centralized control mechanism via IGSN e.V.
• resolves to virtual sample representations (sample metadata profiles) managed at federated IGSN Allocating Agents.
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 42
Use of the IGSN
IGSNs in data table resolve to sample metadata in IGSN registry
SESAR (www.geosamples.org)
System for Earth Sample Registration
• Allocating Agent for individual investigators, sample repositories, and science programs• tools and services for users to catalog and manage sample metadata
(MySESAR)
• personal (authenticated) workspace
• metadata template creator
• label creation & printing (including QR code)
• transfer of sample ownership
• web services for client systems
• register sample metadata & obtain IGSNs
• access to IGSN metadata
• preservation & persistent access of sample metadata
• Global Sample Catalog (harvest metadata from other AAs
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 43
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 44
Challenges:• scalability of architecture for a rapidly growing
number of registrations• service-oriented architecture• handle registrations• software tools that support investigators with
metadata capture in the field & lab• flexibility for user specific metadata & new sample
types• inclusion of sample images (storage!)
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 45
InstitutionsCollection Mgmt
Public ‘Virtual Museum’
InvestigatorsSample Mgmt
(storage, software solutions, & services)
VisualizationPublications
Data Systems
Sample Registries
AP
IsG
UIs
Internet of Samples Initiatives
• CODATA Task Group “Physical Samples in the Digital Era”
• SciColl: Scientific Collections International (Consortium)
• iSamples (Internet of Samples in the Earth Sciences)• Funded EarthCube Research Coordination Network (RCN)
• advance access and re-use of physical samples through use of innovative cyberinfrastructure
• DESC: Digital Environment for Sample Curation
• IGSN e.V.
• National Data Services test-bed
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 46
DATA FACILITIES FOR THE LONG TAIL
Scalability, Sustainability
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 47
Many Earth Science Data Communities
48
Atmo-spheric
Chemistry
Climate & Large Scale
Dynamics
Paleo-Climate
Meteor-ology
Aeronomy
Space Weather
Magneto-spheric Physics
Solar Terrestrial
Igneous Petrology & Volcan-
ology
Geo Ed & Workforce
Training
NCAR
Geophysics &
Geody-namics
Geobiology & Paleoen-
tology
Cryosphere & Ice
Dynamics
Critical Zone &
Soil Science
Chemical Ocean-
ography
Geomor-phology
Hydrology
Sediment-ology &
Strati-graphy
Marine Geophysics
Physical Ocean-
ography
Marine Geology
BiologicalOcean-
ography
Ocean Education
Ocean Drilling & Engineer-
ingSoftware
& Modeling
Bio-informatics
Ecosystems
Biology
High PerfComputing
Semantics &
Ontologies
Algorithms & DataMining
EarthCube CI
Solid and Aqueous Geochem
-istry
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain"
IEDA: A “Long-Tail” Data Facility
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 49
www.iedadata.org
• Multiple core disciplines (focus: solid earth)• High-T Geochemistry• Low-T Geochemistry• Petrology• Marine Geophysics & Geology• Geochronology
• Cross-disciplinary tools & services• Sample registry SESAR• IEDA Data Browser• Portals (GeoPRISMs, USAP-DCC, etc.)• GeoMapApp• Data management support
49
From Research Data Collections to Data Facility
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 50
Formal Governance
Robust Infrastructure
Stable Expert Team
Accreditation
Adherence to Community Standards
Scalable Infrastructure
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 51
The ALLIANCE Model
Alliance Development
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 52
Proposal “Interdisciplinary Earth Data Alliance as a Model for Integrating EC Technology Resources and Engaging the Broad Community” submitted March 2015
MetPetDB
Mineral PhysicsDeep SubmergenceIcePod
Challenges:• Social & organizational engineering• Diversity of data needs• Diversity of systems • Business models
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 53
Conclusions
• Long-tail data can grow BIG through domain-specific data curation.
• Partnerships among data efforts can provide a solution for sustainability of data infrastructure in long tail communities
• Partnerships with the computer and information sciences are necessary to build the cyberinfrastructure.
Kerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 54
EarthCube MotivationsKerstin Lehnert: "Making small data BIG: Insights from a Long-tail Geoscience Domain" 55
To transform geosciences research by supporting community-driven cyberinfrastructure to integrate data and information.
Tech
. Dri
vers Supports science and
other User Needs
Create a dynamic, community-driven cyberinfrastructure
Open, evolvable, sustainable
Easy interface with existing capabilities
Ch
alle
nge
s Diversity of the geosciences
Interdisciplinary Science Questions
Big, Heterogeneous Data issues
Communities that are poorly served/have no community resources
Towards an Architecture for EarthCube
• Under purview of the EarthCube Technology and Architecture Committee (TAC)
– Coordinating with Council of Data Facilities, Science Committee, and Liaison Team
• Ongoing Working Groups (since Fall 2014):
– Architecture WG
– Standards WG
– Use Cases WG
– Funded Projects and Gap Analysis WG
– Testbed WG
!
!
EarthCube!
23!
!!!!!!!!!!!!!!!!!!!!!
!
!
Building((Blocks(
Architecture(
Governance(Research((Coordina7on((Networks(
Funded&Projects&
!
EarthCube!Funded!Projects!
!(2013!and!2014!Awards)!
!
TAC Workshop (ongoing on now)
Learn more at:
http://earthcube.org/group/technology-architecture-committee http://earthcube.org/document/2014/earthcube-past-present-future