Bridging Big Data and Data Science Using Scalable Workflows

Bridging Big Data and Data Science using Scalable Workflows

WorDS.sdsc.edu

ILKAY ALTINTAS, Ph.D. [email protected]

Director, Workflows for Data Science (WorDS) Center of Excellence San Diego Supercomputer Center, UC San Diego

SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego Providing Cyberinfrastructure for Research and EducaBon

•  Established as a naBonal supercomputer resource center in 1985 by NSF

•  Aworld leader in HPC, data-‐intensive compuBng, and scienBfic data management

•  Current strategic focus on “Big Data”

1985

today

Scien&fic Workflow Automa&on Technologies

Research

Workflows for Cloud Systems

Big Da

ta App

lica&

ons

Re

prod

ucible Scien

ce

Workforce Training and Educa&on

De

velopm

ent a

nd Con

sul&ng

Services

Workflows for Data Science Center

Focus on the ques&on, not the

technology! 10+ years of data science R&D experience as a Center.

Why Data Science Workflows? “You've got to think about

big things while you're doing small things,

so that all the small things go in the right direcBon.” – Alvin Toffler

use cases => purpose and value

So, what is a workflow?

Source: hcp://www.fastcodesign.com/1663557/how-‐a-‐kitchen-‐design-‐could-‐make-‐it-‐easier-‐to-‐bond-‐with-‐neighbors

Shop Prepare Cook Store

Let’s make pasta this evening! Shop Prepare Cook Store

30 minutes

30 minutes

15 minutes

3 minutes

15 minutes

3 minutes

How to Cook Everything Fast

“How to Cook Everything Fast is a book of kitchen innovaBons. Time management— the essenBal principle of fast cooking— is woven into revoluBonary recipes that do the thinking for you. You’ll learn how to take advantage of down&me to prepare vegetables while a soup simmers or toast croutons while whisking a dressing. Just cook as you read—and let the recipes guide you quickly and easily toward a delicious result.”

Image and quote source: amazon.com

What if you have more than one cooks?

…

…

… MAP

•  Input: veggies •  User defined

func&on(UDF): chop •  Output: Chopped groups

of each kind of veggie

…

…

REDUCE •  Input: chopped batches

for each veggie type •  User defined

func&on(UDF): combine based on veggie type as key

•  Output: a bowl of veggies per veggie kind

Thanksgiving dinner preparaBon: more planning and tasks?

Menu Item Prepara&on Time

Cooking Time

Cooling Time

Turkey 30 minutes 4 hours 15 minutes

Veggies 30 minutes 45 minutes None

Cranberry Sauce

5 minutes 30 minutes 2 hours

Soup 20 minutes 30 minutes None

Pie 30 minutes 5 minutes 1 day

•  When do you start cooking? •  What order do you cook? •  Can you cook some menu items in parallel? •  Who cooks what? •  …

Data Science Workflows -‐ Programmable and Reproducible Scalability -‐

•  Access and query data •  Scale computaBonal analysis •  Increase reuse •  Save Bme, energy and money •  Formalize and standardize

Real-‐Time Hazards Management wifire.ucsd.edu

Data-‐Parallel BioinformaBcs bioKepler.org

Scalable Automated Molecular Dynamics and Drug Discovery nbcr.ucsd.edu

kepler-‐project.org WorDS.sdsc.edu

Why scalable and reproducible data science?

The Big Picture is Supporting the Scientist

Conceptual SWF

Executable SWF

From “Napkin Drawings” to Executable Workflows

Fasta File

Circonspect

Average Genome Size

Combine Results PHACCS

The Big Picture is Supporting the Data Scientist

Conceptual SWF

Executable SWF

From “Napkin Drawings” to Executable Workflows… SBNL workflow

Local Learner

Data Quality Evaluation

Local Ensemble Learning

Quality Evaluation & Data Partitioning Big Data

Master Learner

MasterEnsemble Learning

Final BN Structure

Insurance and Traffic Data Analy&cs using Big Data Bayesian Network

Learning

Ptolemy II: A laboratory for investigating design

KEPLER: A problem-solving environment for Scientific Workflow KEPLER = “Ptolemy II + X” for Scientific Workflows

Kepler is a Scientific Workflow System

•  A cross-project collaboration… initiated August 2003

•  2.4 released 04/2013

www.kepler-project.org

•  Builds upon the open-source Ptolemy II framework

A Toolbox with Many Tools

Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!

•  Data •  Search, database access, IO operaBons, streaming data in real-‐Bme…

•  Compute •  Data-‐parallel pacerns, external execuBon, …

•  Network operaBons •  Provenance and fault tolerance

So,

how does this relate to data science and big

data?

Toolboxes with many tools for: •  data access, •  analysis, •  execuBon, •  fault tolerance, •  provenance

tracking, •  reporBng •  ...

Business Analysis

Opera&ons Research

Adapted from: B. Tierney, 2013

Workflows integrate data science building blocks!

Data ScienBst Skill Set hcp://datasciencedojo.com/what-‐are-‐the-‐key-‐skills-‐of-‐a-‐data-‐scienBst/

Unicorn?

hcp://www.anlytcs.com

/2014/01/data-‐

science-‐venn

-‐diagram

-‐v20.htm

l

SoluBon: Scale Your Data ScienBsts

Standardize the data science process, not the tools!

Standardized processes enable data

scien&sts to communicate with business and programming partners.

Also, what these definiBons really mean is “computa&onal and data scien&sts”.

Conceptualizing a ComputaBonal Data Science

Workflow

1: Start with the Workflow As a Blackbox •  Treat the whole workflow as a blackbox – What is the usecase/applicaBon?

•  What is the science quesBon this workflow is solving?

– What is the input data? – What are the expected outcomes?

Input data

My workflow

Outputs

•  Give the workflow a Btle based on iniBal assessment!

f

2: ConceptualizaBon of ScienBfic Steps

Fasta File

Circonspect

Average Genome Size

Combine Results PHACCS

•  ... •  Cook •  Chill • ….

Bake Pie

• … •  Prepare •  Cook • …

Bake Turkey • … • Make Cranberry Sauce • Cut Veggies • Prepare Stuffing • …

Make Side Dishes

3: Treat Each Step Like a Workflow -‐ un=l you reach an atomic func=onal step -‐ Find data Access data Acquire data Move data

Clean data Integrate data Subset data

Pre-‐process data

Analyze data Process data

Interpret results Summarize results Visualize results

Post-‐process results

PREPARE SHOP STORE COOK

Some quesBons to ask: •  Where and how do I get the data? •  What is the format and frequency of the data, e.g., structured, textual, real-‐Bme,

image, …? •  How do I integrate or subset datasets, e.g., knowledge representaBon,… ? •  How do I analyze the data and what is the analysis funcBon? •  What are the parameters to customize each step? •  What are the compuBng needs to schedule and run each step? •  How do I make sure the results are useful for the next step or as scienBfic products,

e.g., standards compliance, reporBng, …?

configurable automated analysis

4: Start Building Each Step Including the AlternaBves

•  AlternaBve tools •  MulBple modes of scalability

•  Support for each step of the development and producBon process

•  Different reporBng needs for exploraBon and producBon stages

Build

Explore

Scale

Report

Running on Heterogeneous Computing Resources

- Execution of models on where they run most efficiently -

Gordon Trestles

Local: NBCR Cluster Resources

NSF/DOE: TeraScale Resources (XSEDE)

(Gordon) (Trestles)

(Stampede) (Lonestar)

Private Cluster: User Owned Resources

Different models have different compu&ng architecture needs!

e.g., memory-‐intensive, compute-‐intensive, I/O-‐intensive

5: Save and Share Reports and Final Products with your Team

•  Data scienBst is in the middle bridging the gap between business and development à  So, Data ScienBsts defines the business value and the steps to achieve the results as a workflow

•  Developers/computer scienBsts use their favorite tools to implement the methods in the workflow

•  The process is kept sharable, standardized, scalable and accountable

WorDS – Simple and Scalable Big Data SoluBons using Workflows

Focus on the use case, not the

technology!

•  Develop new big data science technologies and infrastructure

•  Develop data science workflow applica&ons through combinaBon of tools, technologies and best prac&ces

•  Hands on consul&ng on workflow technologies for big data and cloud systems, e.g., MapReduce, Hadoop, Yarn, Cascading

•  Technology briefings and applied classes on end-‐to-‐end support for data science

Using Big Data Computing in Bioinformatics- Improving Programmability, Scalability and Reproducibility-

biokepler.org

Gateways and other user environments

bioKepler Kepler and Provenance Framework

BioLinux Galaxy Clovr Hadoop

…

CLOUD and OTHER COMPUTING RESOURCES e.g., SGE, Amazon, FutureGrid, XSEDE

www.bioKepler.org

A coordinated ecosystem of biological and technological packages for bioinformatics!

Status of bioActors 500+ bioActors are listed under current bioKepler release, ~40 of them are

parallelized.

Using Workflows and Cyberinfrastructure for Wildfire Resilience

- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -

wifire.ucsd.edu

Fire is Part of the Natural Ecology … but requires Monitoring, PredicBon and Resilience

•  Wildfires are criBcal for ecology, but volaBle

•  Fuel load is high due to fire suppression over the last century

•  Changes in rainfall, wind, seasons, and thus wildfires, potenBally induced by climate change

•  Becer prevenBon, predicBon and maintenance of wildfires is needed

Photo of Harris Fire (2007) by former Fire Captain Bill Clayton

Disaster management of (ongoing) wildfires heavily relies on understanding

their DirecBon and Rate of Spread (RoS).

Decision making for wildfire fighting and disaster management based on heterogeneous data:

Photograph by Mark Thiessen

Satellite data

Wildfire perimeter Wind, Vegetation Terrain.

Fire Data Today

What is lacking in disaster management today is…

a system integraBon of real-‐Bme sensor networks, satellite imagery, near-‐real Bme data management

tools, wildfire simulaBon tools, and connecBvity to emergency command centers

. …. before, during and a{er a firestorm.

A Scalable Data-‐Driven Monitoring, Dynamic PredicBon and Resilience Cyberinfrastructure for Wildfires (WIFIRE)

Development of: “cyberinfrastructure” for “analysis of large dimensional heterogeneous real-‐Bme sensed data” for fire resilience before, during and aAer a wildfire

Data to Modeling in WIFIRE Real-‐&me remote data –> Modeling, data assimilaBon and dynamic

wildfire behavior predicBon Sensors:

System IntegraBon of sensor data, data assimilaBon, dynamic models and fire direcBon and RoS predicBons (computaBons) is based on ScienBfic and Engineering Workflows (Kepler) •  Visual programming •  Scalable parallel execuBon •  Standardized data interfaces •  Reuse and reproducibility

WIFIRE System IntegraBon

Training and ConsulBng Services in the WorDS Center

•  Ongoing programs for workflow bootcamps and hackathons

•  Technology briefings for industrial partners •  Industry labs for undergraduate student researchers

•  ConsulBng projects on workflow technologies

To Sum Up•  Workflows and provenance are well-adopted in scientific

data science infrastructures today, with success•  WorDS Center applies these concepts to advanced

dynamic data-driven analytics applications

•  One size does not fit all! •  Many diverse environments and requirements•  Need to orchestrate at a higher level•  Higher level programming components for each domain

•  Lots of future challenges on•  Optimized execution on heterogeneous platforms•  Increasing reuse within and across application domains•  Querying and integration of workflow provenance data

Que

sBon

s?

WorDS

Dire

ctor: Ilkay AlBntas, Ph.D.

Email: alBn

tas@

sdsc.edu

Data & Analytics

Bridging Big Data and Data Science Using Scalable Workflows