Download pdf - WorDS of Data Science in the Presence of Heterogenous Computing Architectures

WorDS of Data Science in the Presence of Heterogenous Compu7ng Architectures

WorDS.sdsc.edu

Dr. Ilkay Al7ntas Founder and Director, Workflows for Data Science (WorDS)

Center of Excellence San Diego Supercomputer Center, UC San Diego

SAN DIEGO SUPERCOMPUTER CENTER at UC San Diego Providing Cyberinfrastructure for Research and Educa7on

•  Established as a na7onal supercomputer resource center in 1985 by NSF

•  A world leader in HPC, data-‐intensive compu7ng, and scien7fic data management

•  Current strategic focus on “Big Data”

1985

today

Scien&fic Workflow Automa&on Technologies

Research

Workflows for Cloud Systems

Big Da

ta App

lica&

ons

Re

prod

ucible Scien

ce

Workforce Training and Educa&on

De

velopm

ent a

nd Con

sul&ng

Services

Workflows for Data Science Center

Focus on the ques&on, not the

technology! 10+ years of data science R&D experience as a Center.

So, what is a workflow?

Source: hZp://www.fastcodesign.com/1663557/how-‐a-‐kitchen-‐design-‐could-‐make-‐it-‐easier-‐to-‐bond-‐with-‐neighbors

Shop Prepare Cook Store

Let’s make pasta this evening! Shop Prepare Cook Store

30 minutes

30 minutes

15 minutes

3 minutes

15 minutes

3 minutes

How to Cook Everything Fast

“How to Cook Everything Fast is a book of kitchen innova7ons. Time management— the essen7al principle of fast cooking— is woven into revolu7onary recipes that do the thinking for you. You’ll learn how to take advantage of down&me to prepare vegetables while a soup simmers or toast croutons while whisking a dressing. Just cook as you read—and let the recipes guide you quickly and easily toward a delicious result.”

Image and quote source: amazon.com

What if you have more than one cooks?

…

…

… MAP

•  Input: veggies •  User defined

func&on(UDF): chop •  Output: Chopped groups

of each kind of veggie

…

…

REDUCE •  Input: chopped batches

for each veggie type •  User defined

func&on(UDF): combine based on veggie type as key

•  Output: a bowl of veggies per veggie kind

Thanksgiving dinner prepara7on: more planning and tasks?

Menu Item Prepara&on Time

Cooking Time

Cooling Time

Turkey 30 minutes 4 hours 15 minutes

Veggies 30 minutes 45 minutes None

Cranberry Sauce

5 minutes 30 minutes 2 hours

Soup 20 minutes 30 minutes None

Pie 30 minutes 5 minutes 1 day

•  When do you start cooking? •  What order do you cook? •  Can you cook some menu items in parallel? •  Who cooks what? •  …

Data Science Workflows -‐ Programmable, Reusable and Reproducible Scalability -‐

•  Access and query data •  Scale computa7onal analysis •  Increase reuse •  Save 7me, energy and money •  Formalize and standardize

Real-‐Time Hazards Management wifire.ucsd.edu

Data-‐Parallel Bioinforma7cs bioKepler.org

Scalable Automated Molecular Dynamics and Drug Discovery nbcr.ucsd.edu

kepler-‐project.org WorDS.sdsc.edu

Why scalable and reproducible data science?

The Big Picture is Supporting the Scientist

Conceptual SWF

Executable SWF

From “Napkin Drawings” to Executable Workflows

Fasta File

Circonspect

Average Genome Size

Combine Results PHACCS

The Big Picture is Supporting the Data Scientist

Conceptual SWF

Executable SWF

From “Napkin Drawings” to Executable Workflows… SBNL workflow

Local Learner

Data Quality Evaluation

Local Ensemble Learning

Quality Evaluation & Data Partitioning Big Data

Master Learner

MasterEnsemble Learning

Final BN Structure

Insurance and Traffic Data Analy&cs using Big Data Bayesian Network

Learning

Ptolemy II: A laboratory for investigating design

KEPLER: A problem-solving environment for Scientific Workflow KEPLER = “Ptolemy II + X” for Scientific Workflows

Kepler is a Scientific Workflow System

•  A cross-project collaboration… initiated August 2003

•  2.4 released 04/2013

www.kepler-project.org

•  Builds upon the open-source Ptolemy II framework

A Toolbox with Many Tools

Need expertise to identify which tool to use when and how! Require computation models to schedule and optimize execution!

•  Data •  Search, database access, IO opera7ons, streaming data in real-‐7me…

•  Compute •  Data-‐parallel paZerns, external execu7on, …

•  Network opera7ons •  Provenance and fault tolerance

So,

how does this relate to data science, big data and supercomputing?

Distributed Compu7ng

•  Types of distributed compu7ng: – Computers in local area network – Cluster or High-‐Performance Compu7ng – Grid – Cloud

Compu7ng using more than one computers connected through a network.

Cluster or High-‐Performance Compu7ng

•  Built from mul:ple computers

•  May have – parallel file system – high-‐speed network

•  Provides a scheduler to manage the machines and submiZed jobs – SGE/OGE, PBS, Condor, LSF, SLURM

Paralleliza7on

•  Execu7on environments – One machine – Distributed machines

Mul&ple processes or threads running at the same &me

•  Parallelism Types – Computa7on/task parallelism

– Data parallelism – Pipeline parallelism

Task 4Task 2

Running Waiting Task 5

WaitingTask 3

Running

Task 1

Finished

Input Data Set

Task 1Runnin

g

Task 2

Waiting

Task 3

WaitingTask 1 Task 2 Task 3

Task 1

Running

Task 2

Waiting

Task 3

Waiting

Input Data Set

Task 1

Running

Task 2

Running

Task 3

Running

Task Parallelism

Data Parallelism

Pipeline Parallelism

There are different styles of parallelism!

Big Data: Short Defini7on •  Some features “V’s” of big data

–  Volume: amount of data –  Velocity: speed of data in and out –  Variety: range of data types and sources –  Veracity: trustworthiness of data

Picture credit: IBM 2012

•  A parallel and scalable programming model for

Big Data–  Input data is automatically partitioned onto multiple

nodes–  Programs are distributed and executed in parallel on

the partitioned data blocks

Distributed-‐Data Parallel Compu7ng

Images from: hZp://www.stratosphere.eu/projects/

Stratosphere/wiki/PactPM

MapReduceMove program

to data!

Distributed Data-‐Parallel (DDP) PaZerns

•  A higher-‐level programming model –  Moving computa7on to data –  Good scalability and performance accelera7on –  Run-‐7me features such as fault-‐tolerance –  Easier parallel programming than MPI and OpenMP

PaZerns for data distribu&on and parallel data processing

Images from: hZp://www.stratosphere.eu/projects/Stratosphere/wiki/PactPM

Hadoop •  Open source implementa7on of MapReduce

•  A distributed file system across compute nodes (HDFS) –  Automa:c data par::on –  Automa:c data replica:on

•  Master and workers/slaves architecture

•  Automa7c task re-‐execu7on for failed tasks

Spark •  Fast Big Data Engine

–  Keeps data in memory as much as possible

•  Resilient Distributed Datasets (RDDs) –  Evaluated lazily –  Keeps track of lineage for fault tolerance

•  More operators than just Map and Reduce

•  Can run on YARN (Hadoop v2)

Gepng Value out of All This

My favorite defini7on of Data Science

“By "Data Science", we mean almost everything that has something to do with data: Collec:ng, analyzing, modeling...... yet the most important part is its applica:ons -‐-‐-‐ all sorts of applica:ons.” Journal of Data Science (hZp://www.jds-‐online.com/about)

Implies -‐-‐ programming, data analysis, and problem solving

Some P’s of Data Science

People

Process

Platforms

Purpose

Programmability

There are more: provenance, publica7on, product, performance, policy, profit, ...

People…

People

Data Scien7st Skill Set hZp://datasciencedojo.com/what-‐are-‐the-‐key-‐skills-‐of-‐a-‐data-‐scien7st/

Unicorn?

hZp://www.anlytcs.com

/2014/01/data-‐

science-‐venn

-‐diagram

-‐v20.htm

l

Solu7on: Scale the Data Scien7sts

Standardize the data science process, not the tools!

Standardized processes enable data

scien&sts to communicate with business and programming partners.

Also, what these defini7ons really mean is “computa&onal and data scien&sts”.


Process

Defining a Typical Data Science Process

Find data Access data Acquire data Move data

Clean data Integrate data Subset data

Pre-‐process data

Analyze data Process data

Interpret results Summarize results Visualize results

Post-‐process results

Some ques7ons to ask: •  Where and how do I get the data? •  What is the format and frequency of the data, e.g., structured, textual, real-‐7me,

image, …? •  How do I integrate or subset datasets, e.g., knowledge representa7on,… ? •  How do I analyze the data and what is the analysis func7on? •  What are the parameters to customize each step? •  What are the compu7ng needs to schedule and run each step? •  How do I make sure the results are useful for the next step or as scien7fic products,

e.g., standards compliance, repor7ng, …?

configurable automated analysis


People

Process

Purpose

Purpose… “You've got to think about

big things while you're doing small things,

so that all the small things go in the right direc7on.” – Alvin Toffler

use cases => purpose and value

Need toolboxes with many tools for: •  data access, •  analysis, •  scalable execu&on, •  fault tolerance, •  provenance

tracking, •  repor7ng •  ...

Business Analysis

Opera&ons Research

Adapted from: B. Tierney, 2013

Integra7on of Many Tools to Serve a Purpose

Many Alterna7ves

•  Alterna7ve tools •  Mul7ple modes of scalability

•  Support for each step of the development and produc7on process

•  Different repor7ng needs for explora7on and produc7on stages

Build

Explore

Scale

Report

Build Once, Run Many Times…

•  Data science process should support experimental work and dynamic scalability on many plavorms

•  Scalability based on: –  data volume and velocity –  dynamic modeling needs –  highly-‐op7mized HPC codes –  changes in network, storage and compu7ng availability

Scalability across plavorms…

People

Process

Platforms

Purpose

Running on Heterogeneous Computing Resources

- Execution of programs on where they run most efficiently -

Gordon Trestles

Local Cluster Resources

NSF/DOE: TeraScale Resources (XSEDE)

(Gordon) (Comet)

(Stampede) (Lonestar)

Private Cluster: User Owned Resources

Different executables have different compu&ng architecture needs!

e.g., memory-‐intensive, compute-‐intensive, I/O-‐intensive

Challenges for Heterogeneous Compu7ng

•  Dynamic scheduling op7miza7on needed – Based on network availability – Data transfer and locality – Energy efficiency – Availability of exascale memory hierarchies – Workload changes

•  BeZer programmable communica7on between workflow systems and infrastructure for compu7ng, storage and network

Programmability for scalability, reusability and reproducibility

People

Process

Platforms

Purpose

Programmability

Using Big Data Computing in Bioinformatics- Improving Programmability, Scalability and Reproducibility-

biokepler.org

Gateways and other user environments

bioKepler Kepler and Provenance Framework

BioLinux Galaxy Clovr Hadoop

…

CLOUD and OTHER COMPUTING RESOURCES e.g., SGE, Amazon, FutureGrid, XSEDE

www.bioKepler.org

A coordinated ecosystem of biological and technological packages for bioinformatics!

Same approach can be applied to machine learning and other

applica7on areas!

-‐ REUSABILITY and REPURPOSABILITY-‐

Flexible programming of K-‐means •  R: Programming

language and soyware environment for sta7s7cal compu7ng and graphics.

•  KNIME: Plavorm for data analy7cs.

•  MlLib: Scalable machine learning library running on Spark cluster compu7ng framework

•  Mahout: Scalable machine learning library based on MapReduce.

Scalable Bayesian Network Learning SBNL workflow

Local Learner

Data Quality Evaluation

Local Ensemble Learning

Quality Evaluation & Data Partitioning Big Data

Master Learner

MasterEnsemble Learning

Final BN Structure

Kepler Workflow

BN Workflow •  Top level workflow –  Par77onData: RExpression actor that contains R script for the data par77oning step

–  DDPNetworkLearner: Composite actor using MapReduce to perform parallel ensemble learning

WorDS – Simple and Scalable Big Data Solu7ons using Workflows

Focus on the use case, not the

technology!

•  Develop new big data science technologies and infrastructure

•  Develop data science workflow applica&ons through combina7on of tools, technologies and best prac&ces

•  Hands on consul&ng on workflow technologies for big data and cloud systems, e.g., MapReduce, Hadoop, Yarn, Cascading

•  Technology briefings and applied classes on end-‐to-‐end support for data science

Using Workflows and Cyberinfrastructure for Wildfire Resilience

- A Scalable Data-Driven Monitoring and Dynamic Prediction Approach -

wifire.ucsd.edu

A Scalable Data-‐Driven Monitoring, Dynamic Predic7on and Resilience Cyberinfrastructure for Wildfires (WIFIRE)

Development of: “cyberinfrastructure” for “analysis of large dimensional heterogeneous real-‐7me sensed data” for fire resilience before, during and aMer a wildfire

What is lacking in disaster management today is…

a system integra7on of real-‐7me sensor networks, satellite imagery, near-‐real 7me data management

tools, wildfire simula7on tools, and connec7vity to emergency command centers

. …. before, during and ayer a firestorm.

hZp://nbcr.ucsd.edu/

Integrated Mul7-‐Scale Biomedical Modeling Workflows in NBCR

Local Execu7on Op7on

User MD-‐Parameter Configura&on Op&on

Molecular Dynamic CADD Workflow

Amber Molecular Dynamics Package

Local: NBCR Cluster Resources

NSF/DOE: TeraScale Resources (XSEDE)

(Stampede)

NBCR and User Owned Cloud Resources

(Comet)

BENEFITS: •  Enable users to configure MD job parameters through command-‐line, GUI or web interface. •  Scale for mul7ple compounds in parallel •  Run on Mul7ple Compu7ng plavorms •  Increase reuse •  Provenance

GPU or Gordon Execu7on Op7on

hZp://hpc.pnl.gov/IPPD/

Predic7ng Workflow Performance from Provenance

hZps://smartmanufacturingcoali7on.org/

Workflows-‐as-‐a-‐Service

To Sum Up•  Workflows and provenance are well-adopted in scientific

infrastructures today, with success•  WorDS Center applies these concepts to advanced

dynamic data-driven analytics applications

•  One size does not fit all! •  Many diverse environments and requirements•  Need to orchestrate at a higher level•  Higher level programming components for each domain

•  Lots of future challenges on•  Optimized execution on heterogeneous platforms

•  Programmable interface to workload, storage and network needed•  Increasing reuse within and across application domains•  Querying and integration of workflow provenance data into

performance prediction

Que

s7on

s?

WorDS

Dire

ctor: Ilkay Al7ntas, Ph.D.

Email: al7n

tas@

sdsc.edu