Starting “smallâ€‌ to go Big: Building a Living Database Big: Building a . Living Database. ... external

  • View

  • Download

Embed Size (px)

Text of Starting “smallâ€‌ to go Big: Building a Living Database Big: Building a . Living...

  • Solutions for Today | Options for Tomorrow

    Jennifer Bauer, Jenny DiGiulio, Devin Justman, Lucy Romeo, Kelly Rose, Patrick Wingo

    Starting “small” to go Big: Building a Living Database Michael Sabbatino1,2, Baker, D.V. “Vic” 3,4, Rose, K. 1, Romeo, L.1,2, Bauer, J.1, and Barkhurst, A.3,4


    1US Department of Energy, National Energy Technology Laboratory, Albany, OR; 2AECOM, Albany, OR; 3Mid-Atlantic Technology Research & Innovation Center (MATRIC), Morgantown WV; 4US Department of Energy, National Energy Technology Laboratory, Morgantown, WV

  • 2

    Challenges & Needs of Scientific Data Data Access

    • ~80% loss of published data after 20 years

    Data Discovery • 20% public data versus 80%

    private Date Interoperability

    • Variety of data makes it difficult to create, exchange, & use data across different applications and systems

    Date Analytics & Visualization • Requires advanced

    computational capabilities, algorithms, & large data stores to analyze these data

    80% Dark Data

    Instrumentation, logging, sensors, external data, user generated content

    Reliable data flow, infrastructure, pipelines, structures and unstructured data storage

    Cleaning, anomaly detection, prep

    Analytic, metrics, segments, aggregates, features, training data

    A/B testing, experimentation,

    simple ML algorithms

    AI & Deep


    Learn & Optimize

    Aggregate & Label

    Explore & Transform


    Move & Store

    working up the Data Science

    Hierarchy of Needs

  • 3

    Discovered & integrated open data sources of

    information related to oil & gas infrastructure

    across the globe


    “small” Beginnings: Developing a Global Oil & Gas Database

    Machine Learning Automated Approach- A tool that scans “seed” resources and identifies relevant keywords, then crawls the web and parses the data for integration

    >700 datasets >4 million features

  • 4

    EDX - A Virtual Library & Laboratory for Energy Science

    • Virtualizing team analytics • Continuing innovations to connect

    researchers to online Earth-Energy system resources

    • Increasing number of tools & apps for use in team workspaces

    Move & Store

    Data Workflows & Structure

    • Custom “smart search” tool in development

    • Digital spatial team “notebook”

    • Auto-indexing algorithm, provides analysis of your search and helps recommend other items

    EDX vs Dark Data

    80% Dark Data

    EDX Smart Search - A machine learning, big data tool for rapid, online, .Zip, & FTP spatial & non-spatial data

    mining with Hadoop + Bing + ESRI

  • 5

    Explore & Transform

    The Living Database • Store & Share Data in a Structured

    Secure Database Environment • Reduce Redundant Acquisition • Direct Data Access (not file based storage) • Consistent Data with Staff Turnover • Enhance Collaboration

    • Curation of data and knowledge • Allows Direct Analysis from Database

    Storing Databases with different data types, formats, & resolutions

    Includes Data workflow, infrastructure, pipelines, structured & unstructured data


    Data LifecycleApps

    ResearchExternal Apps

  • 6

    • Developing tools & approaches to manage multiple heterogeneous datasets

    • Develop a probabilistic approach to assess scientific data using big data analyses

    • Develop stochastic approaches to reduce uncertainty

    Improve joint analysis of multiple datasets focus on advancing “Big Data” mining, machine

    learning, and advanced geoprocessing computing

    Aggregate & Label

    Tools, Analytics, & Metrics

    Select relevant datasets Combine data and tools to…

    Highlight resultant data and analysis and reuse for

    in further research

    Evaluate correlations and spatio-temporal


  • 7

    EDX continues to evolve in response to the needs of its users and NETL’s knowledge

    transfer goals

    Future Big Data Development &


    Learn & Optimize

    • EDX Cloud Services • Living Database • Common Operating

    Platform for Data Analytics

    • Geocube Spatial Data Viewer

    • Fuzzy Logic Analytics (SIMPA)

    • AWS Development • Integration with

    decades of DOE R&D • Federating Open Source

    data • EDX & GeoCube (search

    and location) • ID Data gaps in

    subsurface puzzle


    Oil & Gas, Geothermal Data

    & Resources

    Carbon Storage Data & Resources

    Millions of Records

    Millions of Records

    Millions of Records

    Employing “smart” search tools to include

    open resources

    Billions of Records

    *These attributes data sources are evolving quickly with implementation of new tools and engagement of key stakeholders

  • 8

    Advanced computer science & research

    Developing Schema Matching AI

    • Variety of data sources with diverse data schemas

    • Manual schema matching is time consuming & inefficient

    • Plan to develop and use existing machine leaning algorithms to match disparate data schemas:

    • Schema level • Element Level • Structure Level

    • Linguistic Matching

    • Syntactic Techniques

    ~ Thank you! ~

    Michael Sabbatino

  • 9


    Come check out this awesome poster!

    Slide Number 1 Challenges & Needs of Scientific Data ��Collect Move & Store Explore & Transform�� Aggregate & Label���� Learn & Optimize���� AI & Deep Learning Questions? ��Come check out this awesome poster!