25
Navigational Data Management Martin Greenwald, Josh Stillerman, John Wright MIT – Plasma Science & Fusion Center IAEA TM on Fusion Data Processing June 2, 2017

Navigational Data Management

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Navigational Data Management

Navigational Data Management

Martin Greenwald, Josh Stillerman, John WrightMIT – Plasma Science & Fusion Center

IAEA TM on Fusion Data Processing June 2, 2017

Page 2: Navigational Data Management

“I have a system for storing my data and getting it back, aren’t I done?”

IAEA TM 2017 Navigational Data Management 2

● In general, our existing approach to capturing and exploiting this class of metadata has been ad hoc and inadequate

● This hampers data discovery and the ability to assemble coherent, complete, useful data sets.

● Probably not.

● Collecting data has never been easier, but…

● We’re struggling to keep up with the rapidly growing volume and complexity of scientific data.

Our Thesis

● The challenge is all about giving this mountain of data meaning and putting it into context

● Context requires metadata describing relationships among data objects – “navigational metadata”

Page 3: Navigational Data Management

Discovering and Understanding Data Depends On Context

IAEA TM 2017 Navigational Data Management 3

● Historical attempts at organizing human knowledge: systems for ordering and categorizing

● Data discovery relies on “adjacency” to find other interesting data

● Historically we’ve each build a set of ad-hoc, domain specific tools to store, explore, and retrieve this relationship metadata.

● Similar issues confront all data intensive areas of research.

● Can we solve these problems in our own domain?

● Can we generalize these to provide solutions across a broader set of domains?

It’s no accidentOrganizing knowledge is an old problem

Define adjacencies – ease searching & browsing

Page 4: Navigational Data Management

Background: Our Approach To This Problem Arose From A Decades Long Process of Generalization & Abstraction

IAEA TM 2017 Navigational Data Management 4

● Each step made the collection and organization of data easier – salient examples include

● MDSplus (www.mdsplus.org)

– Provides a well-characterized, hierarchical description of diagnostics, setup, calibration measurements, analyzed data, metadata, data acquisition workflow

– A single organizational hierarchy dominates (tree)

● MPO (Metadata, Provenance, Ontology) (mpo.psfc.mit.edu )

– Captures experimental, analysis and simulation workflows

– A single organization schema dominates (directed acyclic graph)

● Electronic Logbook database & schema

● These are particular examples of a general, graph-representation of data and relationships

Page 5: Navigational Data Management

Complexity: What Sorts Of Data Might Exist From A Typical Experiment?

IAEA TM 2017 Navigational Data Management 5

● Hierarchical data stores with raw and processed data (~105 named data objects per shot)

● Relational databases with “high level” results

● Electronic logbooks & annotation

● Experimental proposals

● Run Plans & Summaries

● Data provenance systems

● Data catalogs

● Data dictionaries, name space management

● Information about experimental campaigns & plans

● Publications & presentations

● Information about researchers, authors

● Simulation inputs & outputs

● Source code management systems

● Facility information, with details of experiment, measurement systems

● Document, drawing management systems

● QA, QC information

● WBS for projects

Page 6: Navigational Data Management

All Of Those Data Are Linked In Multiple and Complex Ways

DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 7

List of all experimental proposals with links to runs where they were executed

Page 7: Navigational Data Management

All Of Those Data Are Linked In Multiple and Complex Ways

DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 8

Full text of experimental proposals

Page 8: Navigational Data Management

All Of Those Data Are Linked In Multiple and Complex Ways

DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 9

List of all experimental runs and dates with links to proposals

Page 9: Navigational Data Management

All Of Those Data Are Linked In Multiple and Complex Ways

DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 10

Summary information about each run

Page 10: Navigational Data Management

All Of Those Data Are Linked In Multiple and Complex Ways

DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 11

Summary information about each shot

Page 11: Navigational Data Management

All Of Those Data Are Linked In Multiple and Complex Ways

DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 12

Logbook – shot by shot annotation

Page 12: Navigational Data Management

All Of Those Data Are Linked In Multiple and Complex Ways

DIBBs Award 1640829 IAEA TM2017 - Navigational Data Management - MIT 13

MDSplus tree with links to all of the raw and processed data

Page 13: Navigational Data Management

Currently, Our Capture of This Relationship Web Is Incomplete, Ad Hoc and Asymmetric

IAEA TM 2017 Navigational Data Management 14

● Incomplete

– Some relationships are explicitly represented in databases

– Some are implicit in data or text

– Some are only known or accessible by particular users

– Some are not recorded and can be forgotten and lost forever

● Ad Hoc

– We’ve added this information as needs arise

– Schemas, vocabulary are not always consistent

– Level of detail is uneven

● Asymmetric

– Example: We point to interesting data from the logbook (annotation); but do not point to annotation from data (many, many other examples)

Page 14: Navigational Data Management

Organization of Data – By Diagnostic System

IAEA TM 2017 Navigational Data Management 15

Top

Thomson Scattering InterferometerECE

Te

ResultsHardware ResultsHardwareResultsHardware Raw Data

ne neTe

VB

Results

Raw Brightness

Zeff The C-Mod data system has ~105 such nodes with significant metadata for each node

Page 15: Navigational Data Management

Organization of Data – By Physics Parameter

IAEA TM 2017 Navigational Data Management 16

Top

Thomson Scattering InterferometerECE

Te

ResultsHardware ResultsHardwareResultsHardware Raw Data

ne neTe

Plasma Parameters

VB

Results

DensityTemperatures

Raw Brightness

Zeff Such a graph would also be linked to descriptions of experiments, annotation, etc.

Page 16: Navigational Data Management

Organization of Data – By Data Provenance

IAEA TM 2017 Navigational Data Management 17

Top

Thomson Scattering InterferometerECE

Te

ResultsHardware ResultsHardwareResultsHardware Raw Data

ne neTe

Zeff Calculation

VB

Results

Raw Brightness

Zeff Such a graph would also be linked to descriptions of analysis codes, annotation, etc.

Page 17: Navigational Data Management

As We Navigate, The Data May Be Organized Into Different Topologies

IAEA TM 2017 Navigational Data Management 18

Page 18: Navigational Data Management

We Need A Systematic Approach To Represent These Relationships

IAEA TM 2017 Navigational Data Management 19

● A complex data store can be represented as a collections of data objects

– With attributes (metadata)

– With relationships to other data objects (navigational metadata)

● These relationships organize the data objects into multiple organizational paradigms

– Graphs of different topologies

– Trees, Lists, DAGs, Clouds, etc

● Note: Each data object is typically a member of several organizational schemes

– Example: We could organize a data tree by diagnostic hardware (e.g. Thomson Scattering) or physical measurement (e.g. Te as measured by TS, ECE, etc) or by it’s use in an analysis chain (e.g. kinetic MHD equilibrium)

Page 19: Navigational Data Management

We’ve Just Begun A Small Project To Add These Capabilities

IAEA TM 2017 Navigational Data Management 20

● Goals

– Systematically represent and expose data relationships

– Data driven

o Meta-Schema (the collection of relationship types and properties) defined as data

o Instances (actual relationships between specific records) stored as data

– Extensible - meet new needs without refactoring or writing (much) new code

– Broadly applicable, domain independent

● Work Products: a tool-set to build and navigate relationship web - schema and instances

– API (For building, populating and interrogating databases)

– GUI (For traversing, browsing, searching and displaying)

● Targeting data managers, not directly end users

– Build complex data system from existing & new components

Page 20: Navigational Data Management

System Needs To Represent Only Two Kinds Of Elements

IAEA TM 2017 Navigational Data Management 21

● Data objects

– With attributes (= metadata)

– Includes pointer (URI) to data & protocol

● Relations

– With attributes

● We can recognize these as the nodes and edges that define a mathematical graph

● Example from MPO system – data preparation for GYRO code

Page 21: Navigational Data Management

Not Intended As Stand-alone: Would Not Replace All Other Data Stores

IAEA TM 2017 Navigational Data Management 22

● Point to data objects in these external stores via URI – agnostic to the type of data being referred to

● URI includes protocol – indicates software ecosystem needed to read & navigate data store

– MDSplus “mdsplus://cmod/1050426022/electrons/Thomson/results/te”

– HDF5

– File system “file://server_name.domain/home/username/gs2/run_1234/inputs/grad_te”

– RDB

– Etc.

● Full functionality requires that each data store allow access to data, metadata, schema and navigation via API

– External data could range from transparent or opaque depending on capabilities of its underlying system

– Some systems could be retro-fit to improve access to information

Page 22: Navigational Data Management

Initial Implementation, R&D

IAEA TM 2017 Navigational Data Management 23

● Refactor existing annotations (logbook), Experimental Proposals, Runs, Shots, etc into new system as proof of principle

● Compare various database technologies (Relational, NoSQL, Graph (neo4j, orientdb))

– How hard is it to set up?

– Can it represent our objects?

– How hard is it to populate?

– How hard is it to query? How fast to execute?

– How hard to write applications against?

– How robust? Can it scale?

● Write an initial Single Page Application (SPA) using modern web front-end (VueJS, Angular, Polymer,…)

● Iterate

Page 23: Navigational Data Management

Challenges: Potentially Infinite Scope

IAEA TM 2017 Navigational Data Management 24

● Taken to its limits, this representation of information can grow without bounds

– Semantic web running into difficulties – trying to bring “meaning” to information on the web

– Truncation will be key, limit scope to useful relationships

– Granularity – some data objects must be treated as opaque, without internal structure as far as this system is concerned (example – long time series)

– Follow 80:20 rule – 80% of the value attained with the first 20% of the effort

● These issues also impact applications and visualization

– Visual browsing & data discovery is only practical if it matches human cognition

– Adopt a “filter, then browse” paradigm

Page 24: Navigational Data Management

Challenges: Costs & Mitigations

IAEA TM 2017 Navigational Data Management 25

● For this class of metadata to be useful they have to be populated

– Effort required for data managers and for users

– Costs are localized, benefits are general

● Ease the slope to entry

– Drive development from acknowledged use cases

– Encode existing data relationship systems

– Mine existing data sets

– Automate metadata capture wherever possible

– Create an easy to use API and provide plenty of examples

– Build example applications, templates

Page 25: Navigational Data Management

END

IAEA TM 2017 Navigational Data Management 26