Upload
orau
View
141
Download
0
Tags:
Embed Size (px)
Citation preview
Shaun Gleason, Ph.D.
Director
Computational Sciences & Engineering Division [email protected]. 865-574-8521
ORAU 70th Annual Meeting
Big Data Analytics
March 4, 2015
Workforce Needs for Next Generation Big Data Analytics
2 Overview 1301
Outline
• What are Big Data and Data Science? • Big Data & Computational Science at ORNL • ORNL Institutes: Tackling multi-disciplinary, big data science
challenges • Big Data Analytics Workforce
– Job statistics – The data scientist view – Gaps – Opportunities
• Conclusion
The Big Data Explosion
The emergence of the field of “Data Science”
Volume, Variety, Velocity, Veracity
Knowledge discovery within the data deluge
Data Science Jobs
5 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202
What is data science?
6 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202
What is data science?
• Data science: the study of the generalizable extraction of knowledge from data
• Technical disciplines cover a broad range:
Signal & Image Processing
Mathematics Data Mining Machine Learning
Statistics Computer Programming
Data Engineering Pattern Recognition
Visualization Uncertainty Modeling
Database Design High-performance Computing
Actions Mission Driven Outcomes
Sources - Real and Simulated
Applied data science as a workflow…
Acquiring New Data
• Multimodal streaming data
• Smart sensors • Quantum noise reduction • Quantum compressive
sensing
Generating New Data
• Experimental data • Modeling and simulation
of physical systems • Discrete event simulation
(DES)
Processing Data
• Data mining • Data fusion • Disambiguation • Dimensionality
reduction • Visualization • Inverse problems • Data analytics • Machine learning • Data-driven M&S (agent-
based and DES) • Advanced statistics • HPC & reversible
computing
Managing, Transmitting & Protecting Data
• Data Management • Provenance • Cyber-security • Quantum comms
Existing Data
• Archived data • Social media
Enabling and Applying Data Discoveries
• Knowledge discovery • Power system control • Reactor safety • Energy assurance • Population distribution
and dynamics • Transportation • Disaster and emergency
response • Biomedical and
healthcare applications • National security • Fraud and crime
detection • Behavioral sciences • Climate science
8 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202
Computational Sciences and Engineering Division: Scientific Capabilities
Computational Science &
Engineering
Big Data Analytics &
Systems
Geographic Information
Science
Cyber Sciences
Biomedical & Health Data
Science Large Scale Systems
M&S
Discrete Computation
& HPC
Quantum Information
Science
9
The Oak Ridge Leadership Computing Facility is one of the world’s most powerful computing resources
Peak performance 241 TF/s Memory 23 TB
Disk capacity 1.3 PB Square feet 72
Power 0.5 MW
Peak performance 27 PF/s Memory 710 TB
Disk bandwidth 240 GB/s Square feet 5,000
Power 8.8 MW
Gaea
Peak Performance 1.1 PF/s Memory 240 TB
Disk Bandwidth 104 GB/s Square feet 1,600
Power 2.2 MW
Titan
Data Storage • Spider File System
• 40 PB capacity • 1+ TB/s bandwidth
• HPSS Archive • 240 PB capacity • 6 Tape libraries
Data Analytics & Visualization • LENS cluster • Ewok cluster • EVEREST visualization facility • uRiKA data appliance
Networks • ESnet – 100 Gbps • Internet2 – 10 Gbps • XSEDEnet – 10 Gbps • Private dark fibre
Darter
10 DirForum_1409
ORNL organizes institutes to solve grand challenge problems
Nuclear Science and Engineering
Global Security
Energy and Environmental
Sciences
Neutron Sciences
Physical Sciences
Consortium for Advanced Simulation
of LWRs
Institute for Functional
Imaging of Materials
Urban Dynamics Institute
Climate Change Science Institute
BioEnergy Science Center
Institute for Advanced Composites
Manufacturing Innovation
CNMS Nanomaterials
Theory Institute
Health Data Sciences Institute
Computing and Computational
Sciences
11 Presentation name
Oak Ridge Urban Dynamics Institute
• Science and informatics for energy and urban infrastructures – Data from individual components (sensors) of infrastructure
networks (energy, water, transportation, telecommunication,…) – Data from users of infrastructure (human network)
• Characterization of the interaction between the human dynamics and integrated infrastructures
– Discovering emerging behavior of urban systems over large spatial and temporal scales (at unprecedented resolution)
• Efficient data management, analysis, creation, and visualization of meaningful information within useful timeframe
• Developing interdisciplinary bridge between foundational R&D, operational communities, and industry
Population
• Distribution and dynamics
• Demographic change
• Citizen science
Mobility
• Connected vehicles
• Driver-assistance systems
• Safety
Energy
• Efficiency • Pollution • Sustainability
Resiliency
• Cyber security • Communication • Disaster
management
Delivering transformational science and technology capabilities
History • Formed in 2013 to integrate ORNL’s data-
driven, data-intensive biomedical research programs.
• HDSI members include biomedical researchers, system architects, data scientists, computer scientists, IT services, HPC operation experts
Vision • Accelerate data-driven biomedical
discoveries and healthcare delivery advancement
Mission • Develop innovative, scalable, and robust
technologies for organizing, integrating, and analyzing complex data at scale
Priorities: • Deliver methodological and applied
scientific innovations, informatics tools, and computing infrastructure to enable effective use of data for individual and public benefit.
• Advance a broad range of sponsor and health policy priorities while serving as a neutral entity.
• Build health data science community capacity via a User Facility for collaborative engagement and targeted education and training.
Health Data Sciences Institute Advancing the Utility of Data to Achieve Better Health Outcomes at Lower Cost
Innovate
Incubate
Accelerate
History • Formed in 2009 to integrate ORNL’s
climate research programs. • 130 collocated computational
scientists, modelers, ecosystem field researchers, and data experts.
Mission • Advance the understanding
of the Earth system • Describe the consequences of climate
change • Evaluate and inform policy on climate
change responses
Priorities: Creating the science, experiments, data, and community capacity needed to: • Improve predictive capabilities of Earth
system and biogeochemical models. • Identify and understand how climate
change impacts the resiliency of human and natural land-energy-water systems.
• Participate in national and international climate assessments and policy analysis.
• Develop useful climate adaptation and mitigation tools and information in collaboration with key stakeholders.
Climate Change Science Institute Advancing the Knowledge of Climate Change and Understanding its Consequences
Exploratory Data analysis ENvironment
Problem Statement: • Determination of significant associations
between interrelated climate simulation parameters and outputs is a challenge due to the rapid increases in data quantity, quality, and the number of different variables.
• Classical approaches restrict exploration.
Technical Approach: • Integrate automated statistical analytics with
interactive information visualization techniques to guide the analyst to significant associations.
• Exploratory analysis of large CLM4 ensembles in close collaboration with model researchers from ORNL, PNNL, and LANL.
• Dynamic visual queries provide “live” access to data behind the visualization.
Advantage over the State-of-the-Art: • Interactive exploratory analysis provides
intuitive dynamic visual queries to allow hypothesis generation and validation.
• Facilitates simultaneous analysis large multi-dimensional data in a single 2-D display.
• Can reveals unexpected relationships and serendipitous discoveries.
Exploratory analysis of multiple land model simulations (CLM4)
1000 CLM4 simulations, 81 parameters, 7 output variables analyzed with EDEN in the OLCF EVEREST laboratory.
Goal: guide the design of materials tailored for functionality via probing, understanding, and designing local structure-property relationships on atomic and nanometer level Means: • Linking theory and imaging on
the level of microscopic degrees of freedom via data analytics
• Big, deep, and smart data in materials exploration and design
• Synergy and coordination between imaging disciplines
Institute for Functional Imaging of Materials
Static Functional Dynamic
Controlled
Unsupervised learning
Theo
ry
Correlative learning Image
recognition In-situ control
Big
data
Im
agin
g
Electronic Structure Molecular Dynamics
Multiscale
Ab Initio
New probes New analysis New control
18 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202
IoT presents unique challenges & opportunities for data science
IoT Challenges Necessary ML Capabilities Multi-modal Data integration
Growing exponentially Scalable Evolving Adaptable
Streaming Real time Distributed intelligence High-level awareness
Unreliable Resilient Uncertain Quantifiable uncertainty
Complex systems Interconnected analytics Emergent behavior Multi-scale ML
19 Managed by UT-Battelle for the U.S. Department of Energy Business Sensitive_1202
Internet of Things Opportunities
Urban Systems Industrial Systems
Transportation Systems
Healthcare Systems
Energy Systems Social Systems
Big Question: What are the national and global challenges that an IoT-based machine learning system can enable solutions for?
An Internet-of-Things (IoT) Science Collaboration Laboratory (ISciCL)
Sensors and measurement
Embedded computing and systems
Communications, standards, networks
Data
Streaming Analytics & Machine Learning
Modeling & Simulation
DOD DOE
IC DOT
Industry NIH Healthcare
Manufacturing
Transportation Environment
Energy
Kno
wle
dge
(pas
t, pr
esen
t, fu
ture
)
Virtual Systems
Cyb
er s
ecur
ity
The data scientist’s view*… How do data scientists view themselves?
Many different tools are employed for data analytics
Big data “cleanup” is most time-consuming and least satisfying aspect of job
*Source: “Data Scientist Report,” Crowdflower. 2015.
Gaps
• Shortage of data scientists AND “larger” shortage of data scientists with advanced degrees (only 22% think they need one).
• Lack of true “big data” trained students and staff. • Limited number of graduates come with domain
science experience. • Streaming data analytics expertise is difficult to find. • Machine learning at an HPC scale is a skill set that
needs attention.
Opportunities
• Create domain inspired data science graduate programs – Traditional, core PhD programs with domain science focus – Domain-inspired “data science” PhD programs
• Establish better pipelines of big data analytics students from universities to ORNL – Access to HPC – Access to true big data sources – Access to domain science challenges and experts
Conclusion
• There is a shortage of data science professionals with advanced degrees.
• The data can tell you anything: Big data challenges require trained research scientists.
• Interdisciplinary skills are beneficial for domain-science specific big data analytics.