Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013

A Flexible GridFTP Client for Implementation of

Intelligent Cloud Data Scheduling Services

Esma YildirimDepartment of Computer EngineeringFatih UniversityIstanbul, Turkey

DATACLOUD 2013

OutlineData Scheduling Services in the Cloud

File Transfer Scheduling Problem History

Implementation Details of the Client

Example Algorithms

Amazon EC2 Experiments

Conclusions

Cloud Data Scheduling ServicesData Clouds strive for novel services

for management, analysis, access and scheduling of Big Data

Application level protocols providing high performance in high speed networks is an integral part of data scheduling services

GridFTP, UDP based protocols are used frequently in Modern Day Schedulers (e.g. GlobusOnline, StorkCloud)

Bottlenecks in Data Scheduling Services

Data is large, diverse and complex

Transferring large datasets faces many bottlenecksTransport protocol’s under utilization of

networkEnd-system limitations (e.g. CPU, NIC and

disk speed)Dataset characteristics

Many short duration transfers Connection startup and tear down overhead

Optimizations in GridFTP Protocol: Pipelining, Parallelism and Concurrency

Application in Data Scheduling ServicesSetting optimal parameters for

different datasets is a challenging task

Data Scheduling Services sets static values based on experiences

Provided tools do not comply with dynamic intelligent algorithms that might change settings on the fly

Goals of the Flexible Client

Flexibility to scalable data scheduling algorithms

On the fly changes to the optimization parameters

Reshaping the dataset characteristics

File Transfer Scheduling Problem

Lies at the origin of the data scheduling services

Dates back to 1980s

Earliest approaches: List schedulingSort the transfers based on size, bandwidth

of the path or duration of the transferNear-optimal solution

Integer programming – not feasible to implement

File Transfer Scheduling Problem Scalable approaches:

Transferring from multiple replicas Divided datasets sent over different paths to make use of

additional network bandwidth

Adaptive approaches Divide files into multiple portions to send over parallel streams Divide dataset into multiple portions and send at the same time Adaptively change level of concurrency or parallelism based on

network throughput

Optimization algorithms Find optimal settings via modeling and set the optimal

parameters once and for all

File Transfer Scheduling Problem

Modern Day Data Scheduling Service Example Globus Online

Hosted SaaS Statically set pipelining, concurrency and

parallelism Stork

Multi-protocol support Finds optimal parallelism level based on

modeling Static job concurrency

Ideal Client Interface

Allow dataset transfers to beEnqueued, dequeuedSorted based on a propertyDivided, combined into chunksGrouped by source-destination pathsDone from multiple replicas

Implementation Details

Lacks of globus-url-copyDoes not let even static setting of

pipelining, uses its own default value invisible to the userglobus-url-copy -pp -p 5 -cc 4 src url dest url

A directory of files can not be divided and set different optimization parametersFilelist option does help but it can not apply

pipelining on the list as the developers indicates

globus-url-copy -pp -p 5 -cc 4 -f filelist.txt


File data structure propertiesFile size: used to construct data chunks

based on total size, throughput calculation, transfer duration calculation

Source and destination paths: necessary for combining and dividing datasets, changing the source path based on replica location

File name: Necessary to reconstruct full paths


Listing the files for a given pathContacts the GridFTP server Pulls information about the files in the

given pathProvides a list of file data structures

including the number of filesMakes it easier to divide, combine, sort ,

enqueue and dequeue on a list of files


Performing the actual transferSets the optimization parameters on a

list of files returned by the list function and manipulated by different algorithms

For a data chunk it sets the parallel stream, concurrency and pipelining value

Example Algorithms 1: Adaptive Concurrency

Takes a file list structure returned by the list function as input

Divides the file list into chunks based on the number of files in a chunk

Starting with concurrency level of 1 , transfer each chunk with an exponentially increasing concurrency level as long as the throughput increases by each chunk transfer

If the throughput drops adaptively the concurrency level is also decreased for the subsequent chunk transfer

Example Algorithms 1: Adaptive Concurrency

Example Algorithm 2: Optimal Pipelining Mean-based algorithm to construct clusters of files

with different optimal pipelining levels

Calculates optimal pipelining level by dividing BDP into mean file size of the chunk

Dataset is recursively divided by the mean file size index as long as the following conditions are met:A chunk can only be divided further as long as its

pipelining level is different than its parent chunkA chunk can not be less than a preset minimum

chunk sizeOptimal pipelining level for a chunk cannot be

greater than a preset maximum pipelining level

Example Algorithm 2-a: Optimal Pipelining

Example Algorithm 2-b: Optimal Pipelining and Concurrency

After the recursive division of chunks, ppopt is set for each chunk

Chunks go through a revision phase where smaller chunks are combined and larger chunks are further divided

Starting with cc = 1, each chunk is transferred with exponentially increased cc levels until throughput drops down

The rest of the chunks are transferred with the optimal cc level

Example Algorithm 2-b: Optimal Pipelining and Concurrency

Amazon EC2 Experiments

Large nodes with 2vCPUs, 8GB storage, 7.5 GB memory and moderate network performance

50ms artificial delay

Globus Provision is used to automatic setup of servers

Datasets comprise of many number of small files (most difficult optimization case)5000 1MB files1000 random size files in range 1Byte to 10MB

Amazon EC2 Experiments: 5000 1MB filesBaseline performance: Default

pipelining+data channel caching

Throughput achieved is higher than baseline for majority of cases

Amazon EC2 Experiments: 1000 random size files

Conclusions

The flexible GridFTP client has the ability to comply with different natured data scheduling algorithms

Adaptive and optimization algorithms easily sort, divide and combine datasets

Possibility to implement intelligent cloud scheduling services in an easier way

Questions?

Documents

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013