Download pdf - Starting “small” to go Big: Building a Living Database · Big: Building a . Living Database. ... external data, user generated content. Reliable data flow, infrastructure, pipelines,

Solutions for Today | Options for Tomorrow

Jennifer Bauer, Jenny DiGiulio, Devin Justman,Lucy Romeo, Kelly Rose, Patrick Wingo

Starting “small” to go Big:Building a Living DatabaseMichael Sabbatino1,2, Baker, D.V. “Vic” 3,4, Rose, K. 1, Romeo, L.1,2, Bauer, J.1, and Barkhurst, A.3,4

POC: [email protected]

1US Department of Energy, National Energy Technology Laboratory, Albany, OR;2AECOM, Albany, OR;3Mid-Atlantic Technology Research & Innovation Center (MATRIC), Morgantown WV;4US Department of Energy, National Energy Technology Laboratory, Morgantown, WV

2

Challenges & Needs of Scientific Data Data Access

• ~80% loss of published data after 20 years

Data Discovery• 20% public data versus 80%

privateDate Interoperability

• Variety of data makes it difficult to create, exchange, & use data across different applications and systems

Date Analytics & Visualization• Requires advanced

computational capabilities, algorithms, & large data stores to analyze these data

80% Dark Data

Instrumentation, logging, sensors, external data, user generated content

Reliable data flow, infrastructure, pipelines,structures and unstructured data storage

Cleaning, anomaly detection, prep

Analytic, metrics, segments, aggregates, features, training data

A/B testing,experimentation,

simple ML algorithms

AI& Deep

Learning

Learn &Optimize

Aggregate& Label

Explore &Transform

Collect

Move &Store

working up the Data Science

Hierarchyof Needs

3

Discovered & integrated open data sources of

information related to oil & gas infrastructure

across the globe

Collect

https://edx.netl.doe.gov/dataset/global-oil-gas-features-database

“small” Beginnings: Developing a Global Oil & Gas Database

Machine Learning Automated Approach- A tool that scans “seed” resources and identifies relevant keywords, then crawls the web and parses the data for integration

>700 datasets>4 million features

4

EDX - A Virtual Library & Laboratoryfor Energy Science

• Virtualizing team analytics• Continuing innovations to connect

researchers to online Earth-Energy system resources

• Increasing number of tools & apps for use in team workspaces

Move & Store

https://edx.netl.doe.gov

Data Workflows& Structure

• Custom “smart search” tool in development

• Digital spatial team “notebook”

• Auto-indexing algorithm, provides analysis of your search and helps recommend other items

EDX vs Dark Data

80% Dark Data

EDX Smart Search - A machine learning, big data tool for rapid, online, .Zip, & FTP spatial & non-spatial data

mining with Hadoop + Bing + ESRI

5

Explore & Transform


The Living Database• Store & Share Data in a Structured

Secure Database Environment• Reduce Redundant Acquisition• Direct Data Access (not file based storage)• Consistent Data with Staff Turnover• Enhance Collaboration

• Curation of data and knowledge• Allows Direct Analysis from Database

Storing Databases with different data types, formats, & resolutions

Includes Data workflow, infrastructure, pipelines, structured & unstructured data

People

DataLifecycleApps

ResearchExternalApps

6https://edx.netl.doe.gov

• Developing tools & approaches to manage multiple heterogeneous datasets

• Develop a probabilistic approach to assess scientific data using big data analyses

• Develop stochastic approaches to reduce uncertainty

Improve joint analysis of multiple datasetsfocus on advancing “Big Data” mining, machine

learning, and advanced geoprocessing computing

Aggregate & Label

Tools, Analytics, & Metrics

Select relevant datasetsCombine data and tools to…

Highlight resultant data and analysis and reuse for

in further research

Evaluatecorrelations and spatio-temporal

trends

7

EDX continues to evolve in response to the needs of its users and NETL’s knowledge

transfer goals


Future Big Data Development &

Analysis

Learn & Optimize

• EDX Cloud Services• Living Database• Common Operating

Platform for Data Analytics

• Geocube Spatial Data Viewer

• Fuzzy Logic Analytics (SIMPA)

• AWS Development• Integration with

decades of DOE R&D• Federating Open Source

data• EDX & GeoCube (search

and location)• ID Data gaps in

subsurface puzzle

GOGI

Oil & Gas, Geothermal Data

& Resources

Carbon Storage Data & Resources

Millions of Records

Millions of Records

Millions of Records

Employing “smart” search tools to include

open resources

Billions of Records

*These attributes data sources are evolving quickly with implementation of new tools and engagement of key stakeholders

8

Advanced computer science& research


Developing Schema Matching AI

• Variety of data sources with diverse data schemas

• Manual schema matching is time consuming & inefficient

• Plan to develop and use existing machine leaning algorithms to match disparate data schemas:

• Schema level• Element Level• Structure Level

• Linguistic Matching

• Syntactic Techniques

~ Thank you! ~

Michael [email protected]

9

Questions?

Come check out this awesome poster!