Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications

Active and Accelerated Learning of Cost Models for Optimizing Scientific

Applications

Piyush Shivam, Shivnath Babu, Jeffrey Chase

Duke University

C3

C1

C2

Site A

Site B

Site C

Task scheduler

Task workflow•A network of clusters or grid

sites

Networked Computing Utility

•Each site is a pool of heterogeneous resources

•Jobs are task workflows

•Challenge: choose good resource assignments for the jobs

C3C1

C2

Site A

Site B

Site C

home file server

P1

P2P3

• A workflow with a single task

Example: Assigning Resources to Run Tasks

P1 Site A Site A

• Task input data at Site A

• Execution plan Ξ Resource assignment

P2 Site B Site A

P3 Site B Site B

Plan CPU Storage

Plan Selection Problem

Choose Best Plan

Plans CPU Storage

P1 Site A Site A

P2 Site B Site A

… … …

Task workflow

Plan Enumeration

Cost

T1

T2

…

Cost: Plan Execution

Time

Challenge: Need cost models to estimate plan execution time

Generating Cost Models is Hard

• Non-declarative

– Scientific workflow tasks are usually scripts (matlab, perl)

– Such tasks are not database operators like join or select

– Hence: task is a black box with no prior knowledge

• Heterogeneous resources

– Computational grid setting

– Performance varies a lot across resource assignments

• Data dependency

– Performance can vary significantly based on properties of input data & parameters to scripts

Problem Setting• Scientific workflows at DSCR (Duke Shared Cluster

Resource)

• Important scientific workflows are run repeatedly

– Opportunity to observe & learn task behavior

– Better plan selection for subsequent runs

• Sequential scientific workflows

– Each task runs on a single node

– >90% of workflows at DSCR are sequential

NIMO SystemNonInvasive Modeling for

Optimization

NIMO learns cost models for task workflows

– End-to-end cost models

• Incorporate properties of tasks, resources, & data

– Non-invasive

• No changes to tasks

– Automated and active

• Automatically collects training data for learning cost models

C3

C1

C2

Site A

Site B

Site C

Scheduler NIMO

NIMO SystemNonInvasive Modeling for

Optimization

NIMO Fills a Gap

• WorkFlow Management Systems (WFMSs)

– WFMSs use database technology for managing all aspects of scientific workflows [Liu ‘04, Shankar ‘05]

• Batch scheduling systems

– Knowledge of plan execution time is assumed for optimizing resource assignments [Casanova ‘00, Phan ‘05, Kelly ‘03]

NIMO generates cost models for these systems

Roadmap

• Cost models

• NIMO: active learning of cost models

• Experimental evaluation

• Related work

• Conclusions

• Future work

Cost Model

Task

Executiontime

Resource assignment

Cost Modelfor Task Input data

Total workflow execution time can be derived usingthe cost models for individual tasks

Task workflow

Oa

(compute

occupancy)

Os

(stall occupancy)

Task Cost Model

compute phase(compute resource busy)

stall phase(compute resource

stalled on I/O)

Od

(storage

occupancy)

On

(network

occupancy)

+ + )(T = D *totaldata

exec.time

occupancy: average time spent per unit of data

Cost ModelTask

Executiontime

Resourceassignment

Cost Model

Input dataT = D * (Oa + On + Od)

Resource profile

Data profile

Task profile

Learning Cost Models

Learning the cost model = Learning profiles + Learning predictors

Independent variables

Resource profile ( )

Dataprofile ( )

Statistical Learningof Predictors

Dependent variables

Ex: Learn each predictor as a regression modelfrom the training data

Challenges in Learning

• Cost of sample acquisition

• Coverage of system operating range

• Curse of dimensionality

– Suppose: 10 profile attributes X 10 values per attribute, and 5 minutes for a task run (sample) We sample 1% of space and build cost model

Passive learning

Elapsed Time

Accuracy of

currentbest

model

951 years!

Active & AcceleratedLearning

Best accuracy possible

Active (and Accelerated) Learning

• Which predictors are important?

• Which profile attributes should each predictor have?

• What values to consider for each profile attribute during training?

Resource profile Data profile

WANemulator(nistnet)

NIMO workbench

Training setdatabase

Active &Accel.

learning

C3

C1

C2

Site A

Site B

Site C

Scheduler

NIMO System

Taskprofiler

Resourceprofiler

Run standard benchmarks

Dataprofiler

Active Learning Algorithm

Initialization

While( ) {

}

• Relearn predictors with the new set of training samples

• Compute current prediction error of each predictor

– Fixed test set

– Cross-validation


Initialization

While( ) {

}

Pick a new assignment

Run task on chosen assignment

Relearn predictors

Relearn Predictors

10ms256M1GHz 1G512MB 6 8T44


Initialization

While( ) {

}


Relearn predictors

10ms256M1GHz 1G512MB 6 8T44

Choose a predictor to refine

Choose attributes for the predictor

Choose attribute values for the run

Predictor Choice• Predictors – fa, fn, fd, fD

• Order predictors + Traverse this order

– Ex: relevance-based order (Plackett-Burman)

– Ex: choose predictor with current max. error


Initialization

While( ) {

}


Relearn predictors

10ms256M1GHz 1G512MB 6 8T44




Attribute Choice

• Each predictor takes profile attributes as input

• Not all attributes are equally relevant

• Order attributes + Traverse this order


Initialization

While( ) {

}


Relearn predictors

10ms256M1GHz 1G512MB 6 8T44




Value Choice

• Cover the operating range of attributes

• Expose main interactions with other attributes

Experimental Results

• Biomedical workflows (from DSCR)

– BLAST, fMRI, NAMD, CardioWave

– Single task workflows

• Plan space in the heterogeneous networked utility

– 5 CPU speeds, 6 Network latencies, 5 Memory sizes

– 5 X 6 X 5 = 150 resource plans

• Goal: Converge quickly to a fairly-accurate cost model

– We use regression models for the predictors

– Model validation details in previous work (ICAC 2005)

Performance Summary

• Error: Mean absolute % error in predicted execution time• A separate test set for evaluating the error

BLAST Application: Predictor Choice

BLAST Application: Attribute Choice

Related Work

• Workflow Management Systems (WFMSs)

– [Shankar ’05, Liu ’04 etc.]

• Performance prediction in scientific applications

– [Carrington ’05, Rosti ’02, etc.]

• Learning cost models using statistical techniques

– [Zhang ’05, Zhu ’96, etc.]

• NIMO is end-to-end, noninvasive, and active (acquires model learning data automatically)

Conclusions

• NIMO:

– Learns cost models for scientific workflows

– Noninvasive and end-to-end

– Active and accelerated learning: Learns accurate cost models quickly

– Fills a gap in Workflow Management Systems

• NIMO + SHIRAKO

– A policy-based resource-leasing system that can slice-and-dice virtualized resources

• NIMO + Fa

– Processing system-management queries (e.g., root-cause diagnosis, forecasting performance problems, capacity-planning)

C3

C1

C2

Site A

Site B

Site C

Scheduler NIMO

Future Work

Backup Slides for Explanation

See Paper for Details of Steps• Each algorithm step has sub-algorithms

• Example: Choosing the predictor to refine in current step

– Goal: learn most relevant predictors first

– Static Vs. dynamic ordering

• Static:

– Define total order: a priori or using estimates of influence (Plackett-Burman)

– Traverse the order: round-robin Vs. improvement-threshold-based

• Dynamic: choose the predictor with maximum current prediction error

Active and Accelerated Learning

Latency hiding

Saturation

Documents

Active and Accelerated Learning of Cost Models for Optimizing Scientific Applications