WorDS of Data Science in the Presence of Heterogenous Compu7ng Architectures
WorDS.sdsc.edu
Dr. Ilkay Al7ntas Founder and Director, Workflows for Data Science (WorDS)
Center of Excellence San Diego Supercomputer Center, UC San Diego
SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego Providing Cyberinfrastructure for Research and Educa7on
• Established as a na7onal supercomputer resource center in 1985 by NSF
• A world leader in HPC, data-‐intensive compu7ng, and scien7fic data management
• Current strategic focus on “Big Data”
1985
today
Scien&fic Workflow Automa&on Technologies
Research
Workflows for Cloud Systems
Big Da
ta App
lica&
ons
Re
prod
ucible Scien
ce
Workforce Training and Educa&on
De
velopm
ent a
nd Con
sul&ng
Services
Workflows for Data Science Center
Focus on the ques&on, not the
technology! 10+ years of data science R&D experience as a Center.
So, what is a workflow?
Source: hZp://www.fastcodesign.com/1663557/how-‐a-‐kitchen-‐design-‐could-‐make-‐it-‐easier-‐to-‐bond-‐with-‐neighbors
Shop Prepare Cook Store
Let’s make pasta this evening! Shop Prepare Cook Store
30 minutes
30 minutes
15 minutes
3 minutes
15 minutes
3 minutes
How to Cook Everything Fast
“How to Cook Everything Fast is a book of kitchen innova7ons. Time management— the essen7al principle of fast cooking— is woven into revolu7onary recipes that do the thinking for you. You’ll learn how to take advantage of down&me to prepare vegetables while a soup simmers or toast croutons while whisking a dressing. Just cook as you read—and let the recipes guide you quickly and easily toward a delicious result.”
Image and quote source: amazon.com
What if you have more than one cooks?
…
…
… MAP
• Input: veggies • User defined
func&on(UDF): chop • Output: Chopped groups
of each kind of veggie
…
…
REDUCE • Input: chopped batches
for each veggie type • User defined
func&on(UDF): combine based on veggie type as key
• Output: a bowl of veggies per veggie kind
Thanksgiving dinner prepara7on: more planning and tasks?
Menu Item Prepara&on Time
Cooking Time
Cooling Time
Turkey 30 minutes 4 hours 15 minutes
Veggies 30 minutes 45 minutes None
Cranberry Sauce
5 minutes 30 minutes 2 hours
Soup 20 minutes 30 minutes None
Pie 30 minutes 5 minutes 1 day
• When do you start cooking? • What order do you cook? • Can you cook some menu items in parallel? • Who cooks what? • …
Data Science Workflows -‐ Programmable, Reusable and Reproducible Scalability -‐
• Access and query data • Scale computa7onal analysis • Increase reuse • Save 7me, energy and money • Formalize and standardize
Real-‐Time Hazards Management wifire.ucsd.edu
Data-‐Parallel Bioinforma7cs bioKepler.org
Scalable Automated Molecular Dynamics and Drug Discovery nbcr.ucsd.edu
kepler-‐project.org WorDS.sdsc.edu
Why scalable and reproducible data science?
The Big Picture is Supporting the Scientist
Conceptual SWF
Executable SWF
From “Napkin Drawings” to Executable Workflows
Fasta File
Circonspect
Average Genome Size
Combine Results PHACCS
The Big Picture is Supporting the Data Scientist
Conceptual SWF
Executable SWF
From “Napkin Drawings” to Executable Workflows… SBNL workflow
Local Learner
Data Quality Evaluation
Local Ensemble Learning
Quality Evaluation & Data Partitioning Big Data
Master Learner
MasterEnsemble Learning
Final BN Structure
Insurance and Traffic Data Analy&cs using Big Data Bayesian Network
Learning
Ptolemy II: A laboratory for investigating design
KEPLER: A problem-solving environment for Scientific Workflow KEPLER = “Ptolemy II + X” for Scientific Workflows
Kepler is a Scientific Workflow System
• A cross-project collaboration… initiated August 2003
• 2.4 released 04/2013
www.kepler-project.org
• Builds upon the open-source Ptolemy II framework
A Toolbox with Many Tools
Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!
• Data • Search, database access, IO opera7ons, streaming data in real-‐7me…
• Compute • Data-‐parallel paZerns, external execu7on, …
• Network opera7ons • Provenance and fault tolerance
So,
how does this relate to data science, big data and supercomputing?
Distributed Compu7ng
• Types of distributed compu7ng: – Computers in local area network – Cluster or High-‐Performance Compu7ng – Grid – Cloud
Compu7ng using more than one computers connected through a network.
Cluster or High-‐Performance Compu7ng
• Built from mul:ple computers
• May have – parallel file system – high-‐speed network
• Provides a scheduler to manage the machines and submiZed jobs – SGE/OGE, PBS, Condor, LSF, SLURM
Paralleliza7on
• Execu7on environments – One machine – Distributed machines
Mul&ple processes or threads running at the same &me
• Parallelism Types – Computa7on/task parallelism
– Data parallelism – Pipeline parallelism
Task 4Task 2
Running Waiting Task 5
WaitingTask 3
Running
Task 1
Finished
Input Data Set
Task 1Runnin
g
Task 2
Waiting
Task 3
WaitingTask 1 Task 2 Task 3
Task 1
Running
Task 2
Waiting
Task 3
Waiting
Input Data Set
Task 1
Running
Task 2
Running
Task 3
Running
Task Parallelism
Data Parallelism
Pipeline Parallelism
There are different styles of parallelism!
Big Data: Short Defini7on • Some features “V’s” of big data
– Volume: amount of data – Velocity: speed of data in and out – Variety: range of data types and sources – Veracity: trustworthiness of data
Picture credit: IBM 2012
• A parallel and scalable programming model for
Big Data– Input data is automatically partitioned onto multiple
nodes– Programs are distributed and executed in parallel on
the partitioned data blocks
Distributed-‐Data Parallel Compu7ng
Images from: hZp://www.stratosphere.eu/projects/
Stratosphere/wiki/PactPM
MapReduceMove program
to data!
Distributed Data-‐Parallel (DDP) PaZerns
• A higher-‐level programming model – Moving computa7on to data – Good scalability and performance accelera7on – Run-‐7me features such as fault-‐tolerance – Easier parallel programming than MPI and OpenMP
PaZerns for data distribu&on and parallel data processing
Images from: hZp://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM
Hadoop • Open source implementa7on of MapReduce
• A distributed file system across compute nodes (HDFS) – Automa:c data par::on – Automa:c data replica:on
• Master and workers/slaves architecture
• Automa7c task re-‐execu7on for failed tasks
Spark • Fast Big Data Engine
– Keeps data in memory as much as possible
• Resilient Distributed Datasets (RDDs) – Evaluated lazily – Keeps track of lineage for fault tolerance
• More operators than just Map and Reduce
• Can run on YARN (Hadoop v2)
Gepng Value out of All This
My favorite defini7on of Data Science
“By "Data Science", we mean almost everything that has something to do with data: Collec:ng, analyzing, modeling...... yet the most important part is its applica:ons -‐-‐-‐ all sorts of applica:ons.” Journal of Data Science (hZp://www.jds-‐online.com/about)
Implies -‐-‐ programming, data analysis, and problem solving
Some P’s of Data Science
People
Process
Platforms
Purpose
Programmability
There are more: provenance, publica7on, product, performance, policy, profit, ...
People…
People
Data Scien7st Skill Set hZp://datasciencedojo.com/what-‐are-‐the-‐key-‐skills-‐of-‐a-‐data-‐scien7st/
Unicorn?
hZp://www.anlytcs.com
/2014/01/data-‐
science-‐venn
-‐diagram
-‐v20.htm
l
Solu7on: Scale the Data Scien7sts
Standardize the data science process, not the tools!
Standardized processes enable data
scien&sts to communicate with business and programming partners.
Also, what these defini7ons really mean is “computa&onal and data scien&sts”.
Some P’s of Data Science
Process
Defining a Typical Data Science Process
Find data Access data Acquire data Move data
Clean data Integrate data Subset data
Pre-‐process data
Analyze data Process data
Interpret results Summarize results Visualize results
Post-‐process results
Some ques7ons to ask: • Where and how do I get the data? • What is the format and frequency of the data, e.g., structured, textual, real-‐7me,
image, …? • How do I integrate or subset datasets, e.g., knowledge representa7on,… ? • How do I analyze the data and what is the analysis func7on? • What are the parameters to customize each step? • What are the compu7ng needs to schedule and run each step? • How do I make sure the results are useful for the next step or as scien7fic products,
e.g., standards compliance, repor7ng, …?
configurable automated analysis
Some P’s of Data Science
People
Process
Purpose
Purpose… “You've got to think about
big things while you're doing small things,
so that all the small things go in the right direc7on.” – Alvin Toffler
use cases => purpose and value
Need toolboxes with many tools for: • data access, • analysis, • scalable execu&on, • fault tolerance, • provenance
tracking, • repor7ng • ...
Business Analysis
Opera&ons Research
Adapted from: B. Tierney, 2013
Integra7on of Many Tools to Serve a Purpose
Many Alterna7ves
• Alterna7ve tools • Mul7ple modes of scalability
• Support for each step of the development and produc7on process
• Different repor7ng needs for explora7on and produc7on stages
Build
Explore
Scale
Report
Build Once, Run Many Times…
• Data science process should support experimental work and dynamic scalability on many plavorms
• Scalability based on: – data volume and velocity – dynamic modeling needs – highly-‐op7mized HPC codes – changes in network, storage and compu7ng availability
Scalability across plavorms…
People
Process
Platforms
Purpose
Running on Heterogeneous Computing Resources
- Execution of programs on where they run most efficiently -
Gordon Trestles
Local Cluster Resources
NSF/DOE: TeraScale Resources (XSEDE)
(Gordon) (Comet)
(Stampede) (Lonestar)
Private Cluster: User Owned Resources
Different executables have different compu&ng architecture needs!
e.g., memory-‐intensive, compute-‐intensive, I/O-‐intensive
Challenges for Heterogeneous Compu7ng
• Dynamic scheduling op7miza7on needed – Based on network availability – Data transfer and locality – Energy efficiency – Availability of exascale memory hierarchies – Workload changes
• BeZer programmable communica7on between workflow systems and infrastructure for compu7ng, storage and network
Programmability for scalability, reusability and reproducibility
People
Process
Platforms
Purpose
Programmability
Using Big Data Computing in Bioinformatics- Improving Programmability, Scalability and Reproducibility-
biokepler.org
Gateways and other user environments
bioKepler Kepler and Provenance Framework
BioLinux Galaxy Clovr Hadoop
…
CLOUD and OTHER COMPUTING RESOURCES e.g., SGE, Amazon, FutureGrid, XSEDE
www.bioKepler.org
A coordinated ecosystem of biological and technological packages for bioinformatics!
Same approach can be applied to machine learning and other
applica7on areas!
-‐ REUSABILITY and REPURPOSABILITY-‐
Flexible programming of K-‐means • R: Programming
language and soyware environment for sta7s7cal compu7ng and graphics.
• KNIME: Plavorm for data analy7cs.
• MlLib: Scalable machine learning library running on Spark cluster compu7ng framework
• Mahout: Scalable machine learning library based on MapReduce.
Scalable Bayesian Network Learning SBNL workflow
Local Learner
Data Quality Evaluation
Local Ensemble Learning
Quality Evaluation & Data Partitioning Big Data
Master Learner
MasterEnsemble Learning
Final BN Structure
Kepler Workflow
BN Workflow • Top level workflow – Par77onData: RExpression actor that contains R script for the data par77oning step
– DDPNetworkLearner: Composite actor using MapReduce to perform parallel ensemble learning
WorDS – Simple and Scalable Big Data Solu7ons using Workflows
Focus on the use case, not the
technology!
• Develop new big data science technologies and infrastructure
• Develop data science workflow applica&ons through combina7on of tools, technologies and best prac&ces
• Hands on consul&ng on workflow technologies for big data and cloud systems, e.g., MapReduce, Hadoop, Yarn, Cascading
• Technology briefings and applied classes on end-‐to-‐end support for data science
Using Workflows and Cyberinfrastructure for Wildfire Resilience
- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -
wifire.ucsd.edu
A Scalable Data-‐Driven Monitoring, Dynamic Predic7on and Resilience Cyberinfrastructure for Wildfires (WIFIRE)
Development of: “cyberinfrastructure” for “analysis of large dimensional heterogeneous real-‐7me sensed data” for fire resilience before, during and aMer a wildfire
What is lacking in disaster management today is…
a system integra7on of real-‐7me sensor networks, satellite imagery, near-‐real 7me data management
tools, wildfire simula7on tools, and connec7vity to emergency command centers
. …. before, during and ayer a firestorm.
hZp://nbcr.ucsd.edu/
Integrated Mul7-‐Scale Biomedical Modeling Workflows in NBCR
Local Execu7on Op7on
User MD-‐Parameter Configura&on Op&on
Molecular Dynamic CADD Workflow
Amber Molecular Dynamics Package
Local: NBCR Cluster Resources
NSF/DOE: TeraScale Resources (XSEDE)
(Stampede)
NBCR and User Owned Cloud Resources
(Comet)
BENEFITS: • Enable users to configure MD job parameters through command-‐line, GUI or web interface. • Scale for mul7ple compounds in parallel • Run on Mul7ple Compu7ng plavorms • Increase reuse • Provenance
GPU or Gordon Execu7on Op7on
hZp://hpc.pnl.gov/IPPD/
Predic7ng Workflow Performance from Provenance
hZps://smartmanufacturingcoali7on.org/
Workflows-‐as-‐a-‐Service
To Sum Up• Workflows and provenance are well-adopted in scientific
infrastructures today, with success• WorDS Center applies these concepts to advanced
dynamic data-driven analytics applications
• One size does not fit all! • Many diverse environments and requirements• Need to orchestrate at a higher level• Higher level programming components for each domain
• Lots of future challenges on• Optimized execution on heterogeneous platforms
• Programmable interface to workload, storage and network needed• Increasing reuse within and across application domains• Querying and integration of workflow provenance data into
performance prediction
Que
s7on
s?
WorDS
Dire
ctor: Ilkay Al7ntas, Ph.D.
Email: al7n
tas@
sdsc.edu