22
Shaun Gleason, Ph.D. Director Computational Sciences & Engineering Division [email protected] . 865-574-8521 ORAU 70th Annual Meeting Big Data Analytics March 4, 2015 Workforce Needs for Next Generation Big Data Analytics

Workforce Needs for Next Generation Big Data Analytics

  • Upload
    orau

  • View
    141

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Workforce Needs for Next Generation Big Data Analytics

Shaun Gleason, Ph.D.

Director

Computational Sciences & Engineering Division [email protected]. 865-574-8521

ORAU 70th Annual Meeting

Big Data Analytics

March 4, 2015

Workforce Needs for Next Generation Big Data Analytics

Page 2: Workforce Needs for Next Generation Big Data Analytics

2 Overview 1301

Outline

• What are Big Data and Data Science? • Big Data & Computational Science at ORNL • ORNL Institutes: Tackling multi-disciplinary, big data science

challenges • Big Data Analytics Workforce

– Job statistics – The data scientist view – Gaps – Opportunities

• Conclusion

Page 4: Workforce Needs for Next Generation Big Data Analytics

Data Science Jobs

Page 5: Workforce Needs for Next Generation Big Data Analytics

5 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202

What is data science?

Page 6: Workforce Needs for Next Generation Big Data Analytics

6 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202

What is data science?

• Data science: the study of the generalizable extraction of knowledge from data

• Technical disciplines cover a broad range:

Signal & Image Processing

Mathematics Data Mining Machine Learning

Statistics Computer Programming

Data Engineering Pattern Recognition

Visualization Uncertainty Modeling

Database Design High-performance Computing

Page 7: Workforce Needs for Next Generation Big Data Analytics

Actions Mission Driven Outcomes

Sources - Real and Simulated

Applied data science as a workflow…

Acquiring New Data

• Multimodal streaming data

• Smart sensors • Quantum noise reduction • Quantum compressive

sensing

Generating New Data

• Experimental data • Modeling and simulation

of physical systems • Discrete event simulation

(DES)

Processing Data

• Data mining • Data fusion • Disambiguation • Dimensionality

reduction • Visualization • Inverse problems • Data analytics • Machine learning • Data-driven M&S (agent-

based and DES) • Advanced statistics • HPC & reversible

computing

Managing, Transmitting & Protecting Data

• Data Management • Provenance • Cyber-security • Quantum comms

Existing Data

• Archived data • Social media

Enabling and Applying Data Discoveries

• Knowledge discovery • Power system control • Reactor safety • Energy assurance • Population distribution

and dynamics • Transportation • Disaster and emergency

response • Biomedical and

healthcare applications • National security • Fraud and crime

detection • Behavioral sciences • Climate science

Page 8: Workforce Needs for Next Generation Big Data Analytics

8 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202

Computational Sciences and Engineering Division: Scientific Capabilities

Computational Science &

Engineering

Big Data Analytics &

Systems

Geographic Information

Science

Cyber Sciences

Biomedical & Health Data

Science Large Scale Systems

M&S

Discrete Computation

& HPC

Quantum Information

Science

Page 9: Workforce Needs for Next Generation Big Data Analytics

9

The Oak Ridge Leadership Computing Facility is one of the world’s most powerful computing resources

Peak performance 241 TF/s Memory 23 TB

Disk capacity 1.3 PB Square feet 72

Power 0.5 MW

Peak performance 27 PF/s Memory 710 TB

Disk bandwidth 240 GB/s Square feet 5,000

Power 8.8 MW

Gaea

Peak Performance 1.1 PF/s Memory 240 TB

Disk Bandwidth 104 GB/s Square feet 1,600

Power 2.2 MW

Titan

Data Storage • Spider File System

• 40 PB capacity • 1+ TB/s bandwidth

• HPSS Archive • 240 PB capacity • 6 Tape libraries

Data Analytics & Visualization • LENS cluster • Ewok cluster • EVEREST visualization facility • uRiKA data appliance

Networks • ESnet – 100 Gbps • Internet2 – 10 Gbps • XSEDEnet – 10 Gbps • Private dark fibre

Darter

Page 10: Workforce Needs for Next Generation Big Data Analytics

10 DirForum_1409

ORNL organizes institutes to solve grand challenge problems

Nuclear Science and Engineering

Global Security

Energy and Environmental

Sciences

Neutron Sciences

Physical Sciences

Consortium for Advanced Simulation

of LWRs

Institute for Functional

Imaging of Materials

Urban Dynamics Institute

Climate Change Science Institute

BioEnergy Science Center

Institute for Advanced Composites

Manufacturing Innovation

CNMS Nanomaterials

Theory Institute

Health Data Sciences Institute

Computing and Computational

Sciences

Page 11: Workforce Needs for Next Generation Big Data Analytics

11 Presentation name

Oak Ridge Urban Dynamics Institute

• Science and informatics for energy and urban infrastructures – Data from individual components (sensors) of infrastructure

networks (energy, water, transportation, telecommunication,…) – Data from users of infrastructure (human network)

• Characterization of the interaction between the human dynamics and integrated infrastructures

– Discovering emerging behavior of urban systems over large spatial and temporal scales (at unprecedented resolution)

• Efficient data management, analysis, creation, and visualization of meaningful information within useful timeframe

• Developing interdisciplinary bridge between foundational R&D, operational communities, and industry

Population

• Distribution and dynamics

• Demographic change

• Citizen science

Mobility

• Connected vehicles

• Driver-assistance systems

• Safety

Energy

• Efficiency • Pollution • Sustainability

Resiliency

• Cyber security • Communication • Disaster

management

Delivering transformational science and technology capabilities

Page 12: Workforce Needs for Next Generation Big Data Analytics

History • Formed in 2013 to integrate ORNL’s data-

driven, data-intensive biomedical research programs.

• HDSI members include biomedical researchers, system architects, data scientists, computer scientists, IT services, HPC operation experts

Vision • Accelerate data-driven biomedical

discoveries and healthcare delivery advancement

Mission • Develop innovative, scalable, and robust

technologies for organizing, integrating, and analyzing complex data at scale

Priorities: • Deliver methodological and applied

scientific innovations, informatics tools, and computing infrastructure to enable effective use of data for individual and public benefit.

• Advance a broad range of sponsor and health policy priorities while serving as a neutral entity.

• Build health data science community capacity via a User Facility for collaborative engagement and targeted education and training.

Health Data Sciences Institute Advancing the Utility of Data to Achieve Better Health Outcomes at Lower Cost

Innovate

Incubate

Accelerate

Page 13: Workforce Needs for Next Generation Big Data Analytics

History • Formed in 2009 to integrate ORNL’s

climate research programs. • 130 collocated computational

scientists, modelers, ecosystem field researchers, and data experts.

Mission • Advance the understanding

of the Earth system • Describe the consequences of climate

change • Evaluate and inform policy on climate

change responses

Priorities: Creating the science, experiments, data, and community capacity needed to: • Improve predictive capabilities of Earth

system and biogeochemical models. • Identify and understand how climate

change impacts the resiliency of human and natural land-energy-water systems.

• Participate in national and international climate assessments and policy analysis.

• Develop useful climate adaptation and mitigation tools and information in collaboration with key stakeholders.

Climate Change Science Institute Advancing the Knowledge of Climate Change and Understanding its Consequences

Page 14: Workforce Needs for Next Generation Big Data Analytics

Exploratory Data analysis ENvironment

Problem Statement: • Determination of significant associations

between interrelated climate simulation parameters and outputs is a challenge due to the rapid increases in data quantity, quality, and the number of different variables.

• Classical approaches restrict exploration.

Technical Approach: • Integrate automated statistical analytics with

interactive information visualization techniques to guide the analyst to significant associations.

• Exploratory analysis of large CLM4 ensembles in close collaboration with model researchers from ORNL, PNNL, and LANL.

• Dynamic visual queries provide “live” access to data behind the visualization.

Advantage over the State-of-the-Art: • Interactive exploratory analysis provides

intuitive dynamic visual queries to allow hypothesis generation and validation.

• Facilitates simultaneous analysis large multi-dimensional data in a single 2-D display.

• Can reveals unexpected relationships and serendipitous discoveries.

Exploratory analysis of multiple land model simulations (CLM4)

1000 CLM4 simulations, 81 parameters, 7 output variables analyzed with EDEN in the OLCF EVEREST laboratory.

Page 15: Workforce Needs for Next Generation Big Data Analytics

Goal: guide the design of materials tailored for functionality via probing, understanding, and designing local structure-property relationships on atomic and nanometer level Means: • Linking theory and imaging on

the level of microscopic degrees of freedom via data analytics

• Big, deep, and smart data in materials exploration and design

• Synergy and coordination between imaging disciplines

Institute for Functional Imaging of Materials

Static Functional Dynamic

Controlled

Unsupervised learning

Theo

ry

Correlative learning Image

recognition In-situ control

Big

data

Im

agin

g

Electronic Structure Molecular Dynamics

Multiscale

Ab Initio

New probes New analysis New control

Page 16: Workforce Needs for Next Generation Big Data Analytics

18 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202

IoT presents unique challenges & opportunities for data science

IoT Challenges Necessary ML Capabilities Multi-modal Data integration

Growing exponentially Scalable Evolving Adaptable

Streaming Real time Distributed intelligence High-level awareness

Unreliable Resilient Uncertain Quantifiable uncertainty

Complex systems Interconnected analytics Emergent behavior Multi-scale ML

Page 17: Workforce Needs for Next Generation Big Data Analytics

19 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202

Internet of Things Opportunities

Urban Systems Industrial Systems

Transportation Systems

Healthcare Systems

Energy Systems Social Systems

Big Question: What are the national and global challenges that an IoT-based machine learning system can enable solutions for?

Page 18: Workforce Needs for Next Generation Big Data Analytics

An Internet-of-Things (IoT) Science Collaboration Laboratory (ISciCL)

Sensors and measurement

Embedded computing and systems

Communications, standards, networks

Data

Streaming Analytics & Machine Learning

Modeling & Simulation

DOD DOE

IC DOT

Industry NIH Healthcare

Manufacturing

Transportation Environment

Energy

Kno

wle

dge

(pas

t, pr

esen

t, fu

ture

)

Virtual Systems

Cyb

er s

ecur

ity

Page 19: Workforce Needs for Next Generation Big Data Analytics

The data scientist’s view*… How do data scientists view themselves?

Many different tools are employed for data analytics

Big data “cleanup” is most time-consuming and least satisfying aspect of job

*Source: “Data Scientist Report,” Crowdflower. 2015.

Page 20: Workforce Needs for Next Generation Big Data Analytics

Gaps

• Shortage of data scientists AND “larger” shortage of data scientists with advanced degrees (only 22% think they need one).

• Lack of true “big data” trained students and staff. • Limited number of graduates come with domain

science experience. • Streaming data analytics expertise is difficult to find. • Machine learning at an HPC scale is a skill set that

needs attention.

Page 21: Workforce Needs for Next Generation Big Data Analytics

Opportunities

• Create domain inspired data science graduate programs – Traditional, core PhD programs with domain science focus – Domain-inspired “data science” PhD programs

• Establish better pipelines of big data analytics students from universities to ORNL – Access to HPC – Access to true big data sources – Access to domain science challenges and experts

Page 22: Workforce Needs for Next Generation Big Data Analytics

Conclusion

• There is a shortage of data science professionals with advanced degrees.

• The data can tell you anything: Big data challenges require trained research scientists.

• Interdisciplinary skills are beneficial for domain-science specific big data analytics.