30
Run-time Adaptation Run-time Adaptation of Grid Data of Grid Data Placement Jobs Placement Jobs George Kola, Tevfik Kosar and Miron Livny George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin Condor Project, University of Wisconsin

Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Embed Size (px)

DESCRIPTION

Run-time Adaptation of Grid Data Placement Jobs 3 Data Placement Stage in data Stage out data Compute Data placement Data placement encompasses data transfer, staging, replication, data positioning, space allocation and de-allocation A Data Intensive Application

Citation preview

Page 1: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Run-time Adaptation of Grid Data Placement Grid Data Placement

JobsJobsGeorge Kola, Tevfik Kosar and Miron George Kola, Tevfik Kosar and Miron

LivnyLivny

Condor Project, University of WisconsinCondor Project, University of Wisconsin

Page 2: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 22

IntroductionIntroductionGrid presents a continuously Grid presents a continuously changing environmentchanging environmentData intensive applications are being Data intensive applications are being run on the gridrun on the gridData intensive applications have two Data intensive applications have two partsparts– Data placement partData placement part– Computation partComputation part

Page 3: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 33

Data PlacementData Placement

Stage in data

Stage out data

ComputeData placement

Data placement encompasses data transfer, staging, replication, data positioning, space allocation and de-allocation

A Data Intensive Application

Page 4: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 44

ProblemsProblemsInsufficient automationInsufficient automation– FailuresFailures– No tuning – tuning is difficult !No tuning – tuning is difficult !Lack of adaptation to changing Lack of adaptation to changing environmentenvironment– Failure of one protocol while others are Failure of one protocol while others are

functioningfunctioning– Changing network characteristicsChanging network characteristics

Page 5: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 55

Current ApproachCurrent ApproachFedexFedexHand tuningHand tuningNetwork Weather ServiceNetwork Weather Service– Not useful for high-bandwidth, high-Not useful for high-bandwidth, high-

latency networkslatency networksTCP Auto-tuningTCP Auto-tuning– 16-bit windows size and window scale 16-bit windows size and window scale

option limitationsoption limitations

Page 6: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 66

Our ApproachOur ApproachFull automationFull automationContinuously monitor environment Continuously monitor environment characteristicscharacteristicsPerform tuning whenever Perform tuning whenever characteristics changecharacteristics changeAbility to dynamically and Ability to dynamically and automatically choose an appropriate automatically choose an appropriate protocol protocol Ability to switch to alternate protocol Ability to switch to alternate protocol incase of failureincase of failure

Page 7: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 77

Memory Profiler

Disk Profiler

Network Profiler

Network Profiler

Disk Profiler

Memory ProfilerMonitoring Infrastructure @ Host 2

Tuning Infra-

structure

Monitoring Infrastructure @ Host 1

Network Parameters

Data Placement Scheduler

Disk Parameters

Memory Parameters

Disk Parameters

Memory Parameters

Data Transfer Parameters

The Big PictureThe Big Picture

Page 8: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 88

Monitoring Infrastructure @ Host 2

Tuning Infra-

structure

Memory Profiler

Disk Profiler

Network Profiler

Network Profiler

Disk Profiler

Memory Profiler

Monitoring Infrastructure @ Host 1

Network Parameters

Data Placement Scheduler

Disk Parameters

Memory Parameters

Disk Parameters

Memory Parameters

Data Transfer Parameters

The Big PictureThe Big Picture

Page 9: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 99

ProfilersProfilersMemory ProfilerMemory Profiler– Optimal memory block-size and incremental Optimal memory block-size and incremental

block-sizeblock-sizeDisk ProfilerDisk Profiler– Optimal disk block-size and incremental block-Optimal disk block-size and incremental block-

sizesizeNetwork ProfilerNetwork Profiler– Determines bandwidth, latency and the number Determines bandwidth, latency and the number

of hops between a given pair of hostsof hops between a given pair of hosts– Uses pathrate, traceroute and diskrouter Uses pathrate, traceroute and diskrouter

bandwidth test toolbandwidth test tool

Page 10: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1010

Memory Profiler

Disk Profiler

Network Profiler

Network Profiler

Disk Profiler

Memory ProfilerMonitoring Infrastructure @ Host 2

Tuning Infra-

structure

Monitoring Infrastructure @ Host 1

Network Parameters

Data Placement Scheduler

Disk Parameters

Memory Parameters

Disk Parameters

Memory Parameters

Data Transfer Parameters

The Big PictureThe Big Picture

Parameter Tuner 1

Parameter Tuner 1

Page 11: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1111

Parameter TunerParameter TunerGenerates optimal parameters for data Generates optimal parameters for data transfer between a given pair of hoststransfer between a given pair of hostsCalculates TCP buffer size as the Calculates TCP buffer size as the bandwidth-delay productbandwidth-delay productCalculates the optimal disk buffer size Calculates the optimal disk buffer size based on TCP buffer sizebased on TCP buffer sizeUses a heuristic to calculate the number Uses a heuristic to calculate the number of tcp streamsof tcp streams

No of streams = 1+ No of hops with latency > 10msNo of streams = 1+ No of hops with latency > 10msRounded to an even numberRounded to an even number

Page 12: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1212

Memory Profiler

Disk Profiler

Network Profiler

Network Profiler

Disk Profiler

Memory ProfilerMonitoring Infrastructure @ Host 2

Tuning Infra-

structure

Monitoring Infrastructure @ Host 1

Network Parameters

Data Placement Scheduler

Disk Parameters

Memory Parameters

Disk Parameters

Memory Parameters

Data Transfer Parameters

The Big PictureThe Big Picture

Page 13: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1313

Data Placement SchedulerData Placement SchedulerData placement is a real-jobData placement is a real-jobA meta-scheduler (e.g. DAGMan) is A meta-scheduler (e.g. DAGMan) is used to co-ordinate data placement used to co-ordinate data placement and computationand computationSample data placement jobSample data placement job[ [ dap_type = “transfer” ;dap_type = “transfer” ;src_url = “diskrouter://slic04.sdsc.edu/s/s1” ;src_url = “diskrouter://slic04.sdsc.edu/s/s1” ;dest_url=“diskrouter://quest2.ncsa.uiuc.edu/d/d1”; dest_url=“diskrouter://quest2.ncsa.uiuc.edu/d/d1”; ]]

Page 14: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1414

Data Placement SchedulerData Placement SchedulerUsed Stork, a prototype data Used Stork, a prototype data placement schedulerplacement schedulerTuned parameters are fed to StorkTuned parameters are fed to StorkStork uses the tuned parameters to Stork uses the tuned parameters to adapt data placement jobsadapt data placement jobs

Page 15: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1515

ImplementationImplementationProfilers are run as remote batch jobs Profilers are run as remote batch jobs on respective hostson respective hostsParameter tuner is also a batch job Parameter tuner is also a batch job An instance of parameter tuner is run An instance of parameter tuner is run for every pair of nodes involved in for every pair of nodes involved in data transferdata transferMonitoring and tuning infrastructure Monitoring and tuning infrastructure is coordinated by DAGManis coordinated by DAGMan

Page 16: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1616

Disk Profiler

Parameter Tuner

Memory Profiler

Network Profiler

Lightweight Network Profiler

Parameter Tuner

This part executes periodically

This part executes at startup and whenever requested

Coordinating DAGCoordinating DAG

Page 17: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1717

ScalabilityScalabilityThere is no centralized serverThere is no centralized serverParameter tuner can be run on any Parameter tuner can be run on any computation resourcecomputation resourceProfiler data is 100s of bytes per hostProfiler data is 100s of bytes per hostThere can be multiple data There can be multiple data placement schedulersplacement schedulers

Page 18: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1818

Memory Profiler

Disk Profiler

Network Profiler

Network Profiler

Disk Profiler

Memory ProfilerMonitoring Infrastructure @ Host 2

Tuning Infra-

structure

Monitoring Infrastructure @ Host 1

Network Parameters

Data Placement Scheduler

Disk Parameters

Memory Parameters

Disk Parameters

Memory Parameters

Data Transfer Parameters

The Big PictureThe Big Picture

Page 19: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 1919

ScalabilityScalabilityThere is no centralized serverThere is no centralized serverParameter tuner can be run on any Parameter tuner can be run on any computation resourcecomputation resourceProfiler data is 100s of bytes per hostProfiler data is 100s of bytes per hostThere can be multiple data There can be multiple data placement schedulersplacement schedulers

Page 20: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 2020

Dynamic Protocol SelectionDynamic Protocol SelectionDetermines the protocols available Determines the protocols available on the different hostson the different hostsCreates a list of hosts and protocols Creates a list of hosts and protocols in ClassAd formatin ClassAd format

e.g. e.g. [ hostname=“quest2.ncsa.uiuc.edu” ; protocols=“diskrouter,gridftp,ftp” ]

[ hostname=“nostos.cs.wisc.edu” ; protocols=“gridftp,ftp,http” ]

Page 21: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 2121

Dynamic Protocol SelectionDynamic Protocol Selection[ dap_type = “transfer” ;src_url = “any://slic04.sdsc.edu/s/data1” ;dest_url=“any://quest2.ncsa.uiuc.edu/d/data1” ;]

Stork determines an appropriate Stork determines an appropriate protocol to use for the transferprotocol to use for the transferIn case of failure, Stork chooses In case of failure, Stork chooses another protocolanother protocol

Page 22: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 2222

Alternate Protocol FallbackAlternate Protocol Fallback[ [ dap_type = “transfer”;dap_type = “transfer”;src_url = “diskrouter://slic04.sdsc.edu/s/data1”;src_url = “diskrouter://slic04.sdsc.edu/s/data1”;dest_url=“diskrouter://quest2.ncsa.uiuc.edu/d/data1”;dest_url=“diskrouter://quest2.ncsa.uiuc.edu/d/data1”;alt_protocols=“nest-nest, gsiftp-gsiftp”;alt_protocols=“nest-nest, gsiftp-gsiftp”;]]

In case of diskrouter failure, Stork will switch In case of diskrouter failure, Stork will switch to other protocols in the order specifiedto other protocols in the order specified

Page 23: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 2323

Real World ExperimentReal World ExperimentDPOSS data had to be transferred from DPOSS data had to be transferred from

SDSC located in San Diego to NCSA SDSC located in San Diego to NCSA located at Chicagolocated at Chicago

SDSC (slic04.sdsc.edu)

NCSA (quest2.ncsa.uiuc.edu)

Transfer

Page 24: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 2424

SDSC (slic04.sdsc.edu)

GridFTP

Management Site (skywalker.cs.wisc.edu)

NCSA (quest2.ncsa.uiuc.edu)

Control flowData flow

StarLight (ncdm13.sl.startap.net)

WAN

DiskRouter

Real World ExperimentReal World Experiment

Page 25: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 2525

Data Transfer from SDSC to NCSA using Run-time Protocol Auto-tuning

Time

Transfer Rate (MB/s)

Auto-tuning turned on

Network outage

Page 26: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 2626

Parameter TuningParameter TuningParameterParameter Before auto-Before auto-

tuningtuningAfter auto-After auto-tuningtuning

ParallelismParallelism 1 TCP stream1 TCP stream 4 TCP 4 TCP streamsstreams

Block sizeBlock size 1 MB1 MB 1 MB1 MB

TCP buffer TCP buffer sizesize

64 KB64 KB 256 KB256 KB

Page 27: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 2727

Testing Alternate Protocol Fall-Testing Alternate Protocol Fall-backback

[ [ dap_type = “transfer”;dap_type = “transfer”;src_url = src_url = “diskrouter://slic04.sdsc.edu/s/data1”;“diskrouter://slic04.sdsc.edu/s/data1”;

dest_url=“diskrouter://dest_url=“diskrouter://quest2.ncsa.uiuc.edu/d/data1”;quest2.ncsa.uiuc.edu/d/data1”;

alt_protocols=“nest-nest, gsiftp-gsiftp”;alt_protocols=“nest-nest, gsiftp-gsiftp”;]]

Page 28: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Time

Transfer Rate (MB/s)

DiskRouter server killed

DiskRouter server restarted

1 2 3 4

Testing Alternate Protocol Fall-back

Page 29: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 2929

ConclusionConclusionRun-time adaptation has a significant Run-time adaptation has a significant impact (20 times improvement in our impact (20 times improvement in our test case)test case)The profiling data has the potential to be The profiling data has the potential to be used for data miningused for data mining

Network misconfigurationsNetwork misconfigurationsNetwork outagesNetwork outages

Dynamic protocol selection and Dynamic protocol selection and alternate protocol fall-back increase alternate protocol fall-back increase resilence and improve overall resilence and improve overall throughputthroughput

Page 30: Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project,…

Run-time Adaptation of Grid Data Placement JobsRun-time Adaptation of Grid Data Placement Jobs 3030

Questions ?Questions ?

For more information you can contact:For more information you can contact:• George Kola : George Kola : [email protected]@cs.wisc.edu• Tevfik Kosar: Tevfik Kosar: [email protected]@cs.wisc.edu

Project web pages:Project web pages:• Stork: http://cs.wisc.edu/condor/storkStork: http://cs.wisc.edu/condor/stork• DiskRouter: http://cs.wisc.edu/condor/diskrouterDiskRouter: http://cs.wisc.edu/condor/diskrouter