Upload
ilkay-altintas-phd
View
512
Download
3
Embed Size (px)
Citation preview
Bridging Big Data and Data Science using Scalable Workflows
WorDS.sdsc.edu
ILKAY ALTINTAS, Ph.D. [email protected]
Director, Workflows for Data Science (WorDS) Center of Excellence San Diego Supercomputer Center, UC San Diego
SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego Providing Cyberinfrastructure for Research and EducaBon
• Established as a naBonal supercomputer resource center in 1985 by NSF
• Aworld leader in HPC, data-‐intensive compuBng, and scienBfic data management
• Current strategic focus on “Big Data”
1985
today
Scien&fic Workflow Automa&on Technologies
Research
Workflows for Cloud Systems
Big Da
ta App
lica&
ons
Re
prod
ucible Scien
ce
Workforce Training and Educa&on
De
velopm
ent a
nd Con
sul&ng
Services
Workflows for Data Science Center
Focus on the ques&on, not the
technology! 10+ years of data science R&D experience as a Center.
Why Data Science Workflows? “You've got to think about
big things while you're doing small things,
so that all the small things go in the right direcBon.” – Alvin Toffler
use cases => purpose and value
So, what is a workflow?
Source: hcp://www.fastcodesign.com/1663557/how-‐a-‐kitchen-‐design-‐could-‐make-‐it-‐easier-‐to-‐bond-‐with-‐neighbors
Shop Prepare Cook Store
Let’s make pasta this evening! Shop Prepare Cook Store
30 minutes
30 minutes
15 minutes
3 minutes
15 minutes
3 minutes
How to Cook Everything Fast
“How to Cook Everything Fast is a book of kitchen innovaBons. Time management— the essenBal principle of fast cooking— is woven into revoluBonary recipes that do the thinking for you. You’ll learn how to take advantage of down&me to prepare vegetables while a soup simmers or toast croutons while whisking a dressing. Just cook as you read—and let the recipes guide you quickly and easily toward a delicious result.”
Image and quote source: amazon.com
What if you have more than one cooks?
…
…
… MAP
• Input: veggies • User defined
func&on(UDF): chop • Output: Chopped groups
of each kind of veggie
…
…
REDUCE • Input: chopped batches
for each veggie type • User defined
func&on(UDF): combine based on veggie type as key
• Output: a bowl of veggies per veggie kind
Thanksgiving dinner preparaBon: more planning and tasks?
Menu Item Prepara&on Time
Cooking Time
Cooling Time
Turkey 30 minutes 4 hours 15 minutes
Veggies 30 minutes 45 minutes None
Cranberry Sauce
5 minutes 30 minutes 2 hours
Soup 20 minutes 30 minutes None
Pie 30 minutes 5 minutes 1 day
• When do you start cooking? • What order do you cook? • Can you cook some menu items in parallel? • Who cooks what? • …
Data Science Workflows -‐ Programmable and Reproducible Scalability -‐
• Access and query data • Scale computaBonal analysis • Increase reuse • Save Bme, energy and money • Formalize and standardize
Real-‐Time Hazards Management wifire.ucsd.edu
Data-‐Parallel BioinformaBcs bioKepler.org
Scalable Automated Molecular Dynamics and Drug Discovery nbcr.ucsd.edu
kepler-‐project.org WorDS.sdsc.edu
Why scalable and reproducible data science?
The Big Picture is Supporting the Scientist
Conceptual SWF
Executable SWF
From “Napkin Drawings” to Executable Workflows
Fasta File
Circonspect
Average Genome Size
Combine Results PHACCS
The Big Picture is Supporting the Data Scientist
Conceptual SWF
Executable SWF
From “Napkin Drawings” to Executable Workflows… SBNL workflow
Local Learner
Data Quality Evaluation
Local Ensemble Learning
Quality Evaluation & Data Partitioning Big Data
Master Learner
MasterEnsemble Learning
Final BN Structure
Insurance and Traffic Data Analy&cs using Big Data Bayesian Network
Learning
Ptolemy II: A laboratory for investigating design
KEPLER: A problem-solving environment for Scientific Workflow KEPLER = “Ptolemy II + X” for Scientific Workflows
Kepler is a Scientific Workflow System
• A cross-project collaboration… initiated August 2003
• 2.4 released 04/2013
www.kepler-project.org
• Builds upon the open-source Ptolemy II framework
A Toolbox with Many Tools
Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!
• Data • Search, database access, IO operaBons, streaming data in real-‐Bme…
• Compute • Data-‐parallel pacerns, external execuBon, …
• Network operaBons • Provenance and fault tolerance
So,
how does this relate to data science and big
data?
Toolboxes with many tools for: • data access, • analysis, • execuBon, • fault tolerance, • provenance
tracking, • reporBng • ...
Business Analysis
Opera&ons Research
Adapted from: B. Tierney, 2013
Workflows integrate data science building blocks!
Data ScienBst Skill Set hcp://datasciencedojo.com/what-‐are-‐the-‐key-‐skills-‐of-‐a-‐data-‐scienBst/
Unicorn?
hcp://www.anlytcs.com
/2014/01/data-‐
science-‐venn
-‐diagram
-‐v20.htm
l
SoluBon: Scale Your Data ScienBsts
Standardize the data science process, not the tools!
Standardized processes enable data
scien&sts to communicate with business and programming partners.
Also, what these definiBons really mean is “computa&onal and data scien&sts”.
Conceptualizing a ComputaBonal Data Science
Workflow
1: Start with the Workflow As a Blackbox • Treat the whole workflow as a blackbox – What is the usecase/applicaBon?
• What is the science quesBon this workflow is solving?
– What is the input data? – What are the expected outcomes?
Input data
My workflow
Outputs
• Give the workflow a Btle based on iniBal assessment!
f
2: ConceptualizaBon of ScienBfic Steps
Fasta File
Circonspect
Average Genome Size
Combine Results PHACCS
• ... • Cook • Chill • ….
Bake Pie
• … • Prepare • Cook • …
Bake Turkey • … • Make Cranberry Sauce • Cut Veggies • Prepare Stuffing • …
Make Side Dishes
3: Treat Each Step Like a Workflow -‐ un=l you reach an atomic func=onal step -‐ Find data Access data Acquire data Move data
Clean data Integrate data Subset data
Pre-‐process data
Analyze data Process data
Interpret results Summarize results Visualize results
Post-‐process results
PREPARE SHOP STORE COOK
Some quesBons to ask: • Where and how do I get the data? • What is the format and frequency of the data, e.g., structured, textual, real-‐Bme,
image, …? • How do I integrate or subset datasets, e.g., knowledge representaBon,… ? • How do I analyze the data and what is the analysis funcBon? • What are the parameters to customize each step? • What are the compuBng needs to schedule and run each step? • How do I make sure the results are useful for the next step or as scienBfic products,
e.g., standards compliance, reporBng, …?
configurable automated analysis
4: Start Building Each Step Including the AlternaBves
• AlternaBve tools • MulBple modes of scalability
• Support for each step of the development and producBon process
• Different reporBng needs for exploraBon and producBon stages
Build
Explore
Scale
Report
Running on Heterogeneous Computing Resources
- Execution of models on where they run most efficiently -
Gordon Trestles
Local: NBCR Cluster Resources
NSF/DOE: TeraScale Resources (XSEDE)
(Gordon) (Trestles)
(Stampede) (Lonestar)
Private Cluster: User Owned Resources
Different models have different compu&ng architecture needs!
e.g., memory-‐intensive, compute-‐intensive, I/O-‐intensive
5: Save and Share Reports and Final Products with your Team
• Data scienBst is in the middle bridging the gap between business and development à So, Data ScienBsts defines the business value and the steps to achieve the results as a workflow
• Developers/computer scienBsts use their favorite tools to implement the methods in the workflow
• The process is kept sharable, standardized, scalable and accountable
WorDS – Simple and Scalable Big Data SoluBons using Workflows
Focus on the use case, not the
technology!
• Develop new big data science technologies and infrastructure
• Develop data science workflow applica&ons through combinaBon of tools, technologies and best prac&ces
• Hands on consul&ng on workflow technologies for big data and cloud systems, e.g., MapReduce, Hadoop, Yarn, Cascading
• Technology briefings and applied classes on end-‐to-‐end support for data science
Using Big Data Computing in Bioinformatics- Improving Programmability, Scalability and Reproducibility-
biokepler.org
Gateways and other user environments
bioKepler Kepler and Provenance Framework
BioLinux Galaxy Clovr Hadoop
…
CLOUD and OTHER COMPUTING RESOURCES e.g., SGE, Amazon, FutureGrid, XSEDE
www.bioKepler.org
A coordinated ecosystem of biological and technological packages for bioinformatics!
Status of bioActors 500+ bioActors are listed under current bioKepler release, ~40 of them are
parallelized.
Using Workflows and Cyberinfrastructure for Wildfire Resilience
- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -
wifire.ucsd.edu
Fire is Part of the Natural Ecology … but requires Monitoring, PredicBon and Resilience
• Wildfires are criBcal for ecology, but volaBle
• Fuel load is high due to fire suppression over the last century
• Changes in rainfall, wind, seasons, and thus wildfires, potenBally induced by climate change
• Becer prevenBon, predicBon and maintenance of wildfires is needed
Photo of Harris Fire (2007) by former Fire Captain Bill Clayton
Disaster management of (ongoing) wildfires heavily relies on understanding
their DirecBon and Rate of Spread (RoS).
Decision making for wildfire fighting and disaster management based on heterogeneous data:
Photograph by Mark Thiessen
Satellite data
Wildfire perimeter Wind, Vegetation Terrain.
Fire Data Today
What is lacking in disaster management today is…
a system integraBon of real-‐Bme sensor networks, satellite imagery, near-‐real Bme data management
tools, wildfire simulaBon tools, and connecBvity to emergency command centers
. …. before, during and a{er a firestorm.
A Scalable Data-‐Driven Monitoring, Dynamic PredicBon and Resilience Cyberinfrastructure for Wildfires (WIFIRE)
Development of: “cyberinfrastructure” for “analysis of large dimensional heterogeneous real-‐Bme sensed data” for fire resilience before, during and aAer a wildfire
Data to Modeling in WIFIRE Real-‐&me remote data –> Modeling, data assimilaBon and dynamic
wildfire behavior predicBon Sensors:
System IntegraBon of sensor data, data assimilaBon, dynamic models and fire direcBon and RoS predicBons (computaBons) is based on ScienBfic and Engineering Workflows (Kepler) • Visual programming • Scalable parallel execuBon • Standardized data interfaces • Reuse and reproducibility
WIFIRE System IntegraBon
Training and ConsulBng Services in the WorDS Center
• Ongoing programs for workflow bootcamps and hackathons
• Technology briefings for industrial partners • Industry labs for undergraduate student researchers
• ConsulBng projects on workflow technologies
To Sum Up• Workflows and provenance are well-adopted in scientific
data science infrastructures today, with success• WorDS Center applies these concepts to advanced
dynamic data-driven analytics applications
• One size does not fit all! • Many diverse environments and requirements• Need to orchestrate at a higher level• Higher level programming components for each domain
• Lots of future challenges on• Optimized execution on heterogeneous platforms• Increasing reuse within and across application domains• Querying and integration of workflow provenance data
Que
sBon
s?
WorDS
Dire
ctor: Ilkay AlBntas, Ph.D.
Email: alBn
tas@
sdsc.edu