Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011

Survey on Programming and Survey on Programming and Tasking in CloudTasking in Cloud

Computing EnvironmentsComputing Environments

PhD Qualifying ExamPhD Qualifying ExamZhiqiang MaZhiqiang Ma

Supervisor: Lin GuSupervisor: Lin Gu

Feb. 18, 2011Feb. 18, 2011

OutlineOutline IntroductionIntroduction ApproachesApproaches

Application framework level approachApplication framework level approach Language level approachLanguage level approach Instruction level approachInstruction level approach

Our work: MRliteOur work: MRlite ConclusionConclusion

2

Cloud computingCloud computing

Internet services are the most popular Internet services are the most popular applications nowadaysapplications nowadays Millions of usersMillions of users Computation is large and complexComputation is large and complex

Google already processed 20TB data in 2004Google already processed 20TB data in 2004

Cloud computing provides massive Cloud computing provides massive computing resourcescomputing resources Available on demandAvailable on demand

3

A promising model to support processing large datasets housed on clusters

How to program and task?How to program and task? ChallengesChallenges

Parallelize the executionParallelize the execution Scheduling the large scale distributed computationScheduling the large scale distributed computation Handling faultsHandling faults High performanceHigh performance Ensuring fairnessEnsuring fairness

Programming models for GridProgramming models for Grid Do not automatically parallelize users’ programsDo not automatically parallelize users’ programs Pass the fault-tolerance work to applicationsPass the fault-tolerance work to applications

4




5

ApproachesApproaches

6

Approach Advantage Disadvantage

Language level

Instruction level

Application framework level

MapReduceMapReduce MapReduce: parallel computing MapReduce: parallel computing

framework framework for large-scale data processingfor large-scale data processing Successful used in datacenters comprising Successful used in datacenters comprising

commodity computerscommodity computers A fundamental piece of software in the Google A fundamental piece of software in the Google

architecture for many years architecture for many years Open source variant already exists: HadoopOpen source variant already exists: Hadoop Widely used in solving data-intensive problemsWidely used in solving data-intensive problems

7

MapReduce … Hadoop or variants …Hadoop

MapReduceMapReduce Map and Reduce are higher-order functionsMap and Reduce are higher-order functions Map: Map: apply an operation to apply an operation to all elements in a listall elements in a list Reduce: Like “fold”;Reduce: Like “fold”; aggregate elements of a list aggregate elements of a list

8

11

mm

44

mm

99

mm

1616

mm

2525

mm

11 22 33 44 55

m: x2

00 11

rr

55

rr

1414

rr

3030

rr

5555

rr

final valuefinal valueInitial valueInitial value

r: +

12 + 22 + 32 + 42 + 52 = ?

MapReduce’s data flowMapReduce’s data flow

9

MapReduceMapReduceMassive parallel processing made simpleMassive parallel processing made simple Example: world countExample: world count Map: parse a document and generate <word, 1> pairsMap: parse a document and generate <word, 1> pairs Reduce: receive all pairs for a specific word, and countReduce: receive all pairs for a specific word, and count

10

// D is a documentfor each word w in D output <w, 1>

// D is a documentfor each word w in D output <w, 1>

MapMap

Reduce for key w:count = 0for each input item count = count + 1output <w, count>

Reduce for key w:count = 0for each input item count = count + 1output <w, count>

ReduceReduce

MapReduce easily scales upMapReduce easily scales up

11

Input files

Map phase

Intermediate files

Reduce phase

Output files

12

MapReduce

Input Computation Output

DryadDryad General-purpose execution environment General-purpose execution environment

for distributed, data-parallel applicationsfor distributed, data-parallel applications Concentrates on throughput not latencyConcentrates on throughput not latency

Application written in Dryad is modeled as Application written in Dryad is modeled as a directed acyclic graph (DAG)a directed acyclic graph (DAG) Many programs can be represented as a Many programs can be represented as a

distributed execution graphdistributed execution graph

13

DryadDryad

14

Processingvertices Channels

(file, pipe, shared memory)

Inputs

Outputs

DryadDryad Concurrency arise from vertices running Concurrency arise from vertices running

simultaneously across multiple machinessimultaneously across multiple machines Vertices subroutines are usually quite simple as Vertices subroutines are usually quite simple as

sequential programssequential programs

User have control over the communication User have control over the communication graphgraph Each vertex can has multiple input and outputEach vertex can has multiple input and output

15


16



Language’ level

Instruction level

Users are relaxed from the details of distributing the execution

Automatically parallelize users’ programs;

Programs must follow the specific model

Tasking of executionTasking of execution PerformancePerformance

Locality is crucialLocality is crucial Speculative executionSpeculative execution

FairnessFairness The same cluster shared by multiple usersThe same cluster shared by multiple users Small jobs requires small response time while Small jobs requires small response time while

throughput is important for big jobsthroughput is important for big jobs

CorrectnessCorrectness Fault-toleranceFault-tolerance

17

Locality and fairnessLocality and fairness Locality is crucialLocality is crucial

Bandwidth is scarce resourceBandwidth is scarce resource Input data with duplications are stored in the Input data with duplications are stored in the

same cluster for executionssame cluster for executions

FairnessFairness Short jobs requires short response timeShort jobs requires short response time

18

Locality and fairness conflicts with each other

FIFO scheduler in HadoopFIFO scheduler in Hadoop

Jobs in a queue with priority orderJobs in a queue with priority order FIFO by defaultFIFO by default

When there are available slotsWhen there are available slots Assign slots to tasks, that have local data, in Assign slots to tasks, that have local data, in

priority orderpriority order Limit the assignment of non-local task to optimize Limit the assignment of non-local task to optimize

localitylocality

19

FIFO schedulerFIFO scheduler

20

JobQueue

2 tasks

1 tasks

Node 1

Node 2

Node 3

Node 4

FIFO scheduler – locality optimizationFIFO scheduler – locality optimization

21

JobQueue

Far away in network topology

Only dispatch one non-local task at one

time

4 tasks

1 tasks

Node 1

Node 2

Node 3

Node 4

Problem: fairnessProblem: fairness

22

JobQueue

3 tasks

3 tasks

Node 1

Node 2

Node 3

Node 4

Problem: response timeProblem: response time

23

JobQueue

Small job:Only 1 task)

3 tasks

3 tasks

1 task

Node 1

Node 2

Node 3

Node 4

Fair schedulingFair scheduling Assign free slots to the job that has the Assign free slots to the job that has the

fewest running tasksfewest running tasks Strict fairnessStrict fairness

Running jobs gets nearly equal number of slotsRunning jobs gets nearly equal number of slots

The small jobs finishes quicklyThe small jobs finishes quickly

24

Fair SchedulingFair Scheduling

25

JobQueue

Node 1

Node 2

Node 3

Node 4

Problem: lProblem: localityocality

26

JobQueue

Node 1

Node 2

Node 3

Node 4

Delay SchedulingDelay Scheduling Skip the job that cannot launch a local taskSkip the job that cannot launch a local task

Relax fairness slightlyRelax fairness slightly

Allow a job to launch non-local tasks if be Allow a job to launch non-local tasks if be skipped long enoughskipped long enough Avoid starvationAvoid starvation

27

Delay SchedulingDelay Scheduling

28

JobQueue

Node 1

skipcount 0 0 00120Threshold: 2

Node 2

Node 3

Node 4

Waiting time is short:

Tasks finish quicklySkipped job is in the head of the queue

““Fault” ToleranceFault” Tolerance

Nodes failNodes fail Re-run tasksRe-run tasks

Nodes are slow (stragglers)Nodes are slow (stragglers) Run backup tasks (speculative execution)Run backup tasks (speculative execution) To minimize job’s response timeTo minimize job’s response time

Important for short jobsImportant for short jobs

29

Speculative executionSpeculative execution

The scheduler schedules backup The scheduler schedules backup executions of the remaining executions of the remaining in-progress in-progress taskstasks

The task is marked as completed The task is marked as completed whenever either the primary or the backup whenever either the primary or the backup execution completesexecution completes

Improve job response time by 44% Improve job response time by 44% according Google’s experimentsaccording Google’s experiments

30

Speculative execution mechanismSpeculative execution mechanism

Seems a simple problem, butSeems a simple problem, but Resource for speculative tasks is not freeResource for speculative tasks is not free How to choose nodes to run speculative How to choose nodes to run speculative

tasks?tasks? How to distinguish “stragglers” from How to distinguish “stragglers” from

nodes that are slightly slower?nodes that are slightly slower? Stragglers should be found out earlyStragglers should be found out early

31

Hadoop’s schedulerHadoop’s scheduler

Start speculative tasks based on a simple Start speculative tasks based on a simple heuristicheuristic Comparing each task’s progress to the averageComparing each task’s progress to the average

Assumption of homogeneous environmentAssumption of homogeneous environment The default scheduler works wellThe default scheduler works well Broken in utility computingBroken in utility computing

Virtualized “utility computing” environments, such Virtualized “utility computing” environments, such as EC2as EC2

32

How to robustly perform speculative execution (backup tasks) in heterogeneous environments?

Speculative execution in HadoopSpeculative execution in Hadoop When there is no “higher priority” tasks, looks for When there is no “higher priority” tasks, looks for

a task to execute speculativelya task to execute speculatively Assumption: The is no cost to launching a speculative taskAssumption: The is no cost to launching a speculative task

Comparing each task’s progress to the average Comparing each task’s progress to the average progressprogress Assumption: Nodes perform similarly. (“Slow node is faulty”; Assumption: Nodes perform similarly. (“Slow node is faulty”;

“Nodes that ask for new tasks are fast”)“Nodes that ask for new tasks are fast”) Nodes may be slightly (2-3x) slower in “utility computing”, which Nodes may be slightly (2-3x) slower in “utility computing”, which

may not hurt the response time or ask for tasks but not fastmay not hurt the response time or ask for tasks but not fast

33

Speculative execution in HadoopSpeculative execution in Hadoop

Threshold for speculative executionThreshold for speculative execution (Average progress score of each category of (Average progress score of each category of

tasks) – 0.2tasks) – 0.2 Tasks beyond the threshold are “equally slow”Tasks beyond the threshold are “equally slow” Ranks candidates by localityRanks candidates by locality

Wrong tasks may be chosenWrong tasks may be chosen 35% completed 2x slower task with 35% completed 2x slower task with data available on idle data available on idle

nodenode or 5% completed 10x slower task? or 5% completed 10x slower task?

Too many speculative tasks and thrashingToo many speculative tasks and thrashing Taking away resources from useful tasksTaking away resources from useful tasks

34

Speculative execution in HadoopSpeculative execution in Hadoop

Progress scoreProgress score Map: fraction of input dataMap: fraction of input data Reduce: three phase (1/3 for each) and fraction Reduce: three phase (1/3 for each) and fraction

of data processedof data processed

Incorrect speculation of reduce tasksIncorrect speculation of reduce tasks Copy phase takes most of the time, but account Copy phase takes most of the time, but account

only 1/3only 1/3 30% tasks finishes quickly, 30% tasks finishes quickly, 70% are in copy 70% are in copy

phase:phase: Avg. progress rate = 30%*1+70%*1/3 = Avg. progress rate = 30%*1+70%*1/3 = 53%, threshold=53%, threshold=33%33%

35

LATELATE

Longest Approximate Time to EndLongest Approximate Time to End PrinciplesPrinciples

Ranks candidate by longest time to endRanks candidate by longest time to end Choose the right task that hurts the job’s response Choose the right task that hurts the job’s response

time; slow nodes can be utilized as long as it time; slow nodes can be utilized as long as it doesn’t hurt the response timedoesn’t hurt the response time

Only launch speculative tasks on fast nodesOnly launch speculative tasks on fast nodes Not every node that asks for task is fastNot every node that asks for task is fast

Cap speculative tasksCap speculative tasks Limit resource contention and thrashingLimit resource contention and thrashing

36

LATE algorithmLATE algorithm

If a node asks for a new task and there are fewer If a node asks for a new task and there are fewer than than SpeculativeCapSpeculativeCap speculative tasks running: speculative tasks running:

Ignore the request if the node's total progress is Ignore the request if the node's total progress is below below SlowNodeThresholdSlowNodeThreshold

Rank currently running tasks by estimated time leftRank currently running tasks by estimated time left

Launch a copy of the Launch a copy of the highest-rankedhighest-ranked task with task with progress rate below progress rate below SlowTaskThresholdSlowTaskThreshold

37

Cap speculative tasksCap speculative tasks

Only launch speculative tasks on fast nodesOnly launch speculative tasks on fast nodes

Rank candidates by longest time to endRank candidates by longest time to end


38



Instruction level

Users are relaxed from the details of distributing the execution

Automatically parallelize users’ programs; Programs must

follow the specific model

Language level

Language level approachLanguage level approach Programming frameworksProgramming frameworks

Still not clear and compact enoughStill not clear and compact enough

Traditional programming languageTraditional programming language Without giving special focus on high parallelism Without giving special focus on high parallelism

for large computing clusterfor large computing cluster

New languageNew language Clear, compact and expressiveClear, compact and expressive Automatically parallelized “normal” programsAutomatically parallelized “normal” programs Comfortable way for user to think about data Comfortable way for user to think about data

processing problem on large distributed datasetsprocessing problem on large distributed datasets

39

SawzallSawzall Interpreted, procedural high-level Interpreted, procedural high-level

programming languageprogramming language Exploit high parallelismExploit high parallelism Automate very large data sets analysisAutomate very large data sets analysis Give users a way to clearly and Give users a way to clearly and

expressively design distributed data expressively design distributed data processing programsprocessing programs

40

Overall flowOverall flow

FilteringFiltering Analysis each record individuallyAnalysis each record individually Expressed in SawzallExpressed in Sawzall

AggregationAggregation Collate and reduce the intermediate Collate and reduce the intermediate

valuesvalues Predefined aggregatorsPredefined aggregators 41

Map

Reduce

An exampleAn example

42

max_pagerank_url: table maximum(1)[domain:string] of url:string weight pagerank:int;

doc:Document = input;

emit max_pagerank_url[domain(doc.url)] <- doc.url weight doc.pagerank;

Find out the most-linked-to page of each domain

Aggregator: highest valueStores url

Indexed by domainWeighted by pagerank

input: pre-defined variable initialized by SawzallInterpreted into Documentn type

emit: sends intermediate value to the aggregator

Unusual featuresUnusual features Sawzall runs on one record at a timeSawzall runs on one record at a time

Nothing in the language to have one input record Nothing in the language to have one input record influent anotherinfluent another

emit statement is the only output primitiveemit statement is the only output primitive Explicit line between filtering and aggregationExplicit line between filtering and aggregation

43

Enables high degree of parallelism even though it is hidden from the language


44


Application framework level Users are relaxed from

the details of distributing the execution



Language level

Instruction level

Clearer, more expressiveComfortable way for programming

More restrict programming model

Instruction level approachInstruction level approach Provides instruction level abstracts and Provides instruction level abstracts and

compatibility to users’ applicationscompatibility to users’ applications May choose traditional ISA such as May choose traditional ISA such as

x86/x86-64x86/x86-64 Run traditional applications without any Run traditional applications without any

modificationmodification

Easier to migrate applications to cloud Easier to migrate applications to cloud computing environmentscomputing environments

45

Amazon Elastic Compute Cloud (EC2)Amazon Elastic Compute Cloud (EC2)

Provides virtual machines runs traditional Provides virtual machines runs traditional OSOS Traditional programs can work on EC2Traditional programs can work on EC2

Amazon Machine Image (AMI)Amazon Machine Image (AMI) Boot instancesBoot instances Unit of deployment, packaged-up environmentUnit of deployment, packaged-up environment Users design and implement the application logic Users design and implement the application logic

in AMI; EC2 handles the deployment and in AMI; EC2 handles the deployment and resource allocationresource allocation

46

vNUMAvNUMAVirtual shared-memory multiprocessor machine build from Virtual shared-memory multiprocessor machine build from commodity workstationscommodity workstations

Make the computational power available to legacy applications Make the computational power available to legacy applications and OSsand OSs

47

Virtualization vNUMA

PM

VM VM VM VM

PM PM PM

48

ArchitectureArchitecture

HypervisorHypervisor On each nodeOn each node

CPUCPU Virtual CPUs are mapped to real CPUs on nodesVirtual CPUs are mapped to real CPUs on nodes

MemoryMemory Divided between the nodes with equal-sized portionsDivided between the nodes with equal-sized portions Each node manages a subset of the pagesEach node manages a subset of the pages

49

Memory mappingMemory mapping

VM

VMM

PM PM

OS

Applicationread *a

maps b to real physical address

c on node

In application’s virtual memory address

translate a to VM’s physical memory

address b

find *c


50


Application framework level Users are relaxed from

the details of distributing the execution



Language level

Instruction level

Clearer, more expressiveComfortable way for programming

More restrict programming model

Supports traditional applications

Users handles the tasking

Hard to scale up




51

Our workOur work Analyze MapReduce’s design and use a case study Analyze MapReduce’s design and use a case study

to probe the limitationto probe the limitation One-way scalabilityOne-way scalability Difficult to handle dynamic, interactive and semantic-rich Difficult to handle dynamic, interactive and semantic-rich

applicationsapplications

Design a new parallelization framework – MRliteDesign a new parallelization framework – MRlite Able to scale “up” like MapReduce, and scale “down” to Able to scale “up” like MapReduce, and scale “down” to

process moderate-size dataprocess moderate-size data Low latency and massive parallelismLow latency and massive parallelism Small run-time system overheadSmall run-time system overhead

52

Design a general parallelization framework and programming paradigm for cloud computing

Architecture of MRliteArchitecture of MRlite

53

MRlite clientMRlite client

MRlite masterscheduler

MRlite masterscheduler

slaveslave

slaveslave

slaveslave

slaveslave

applicationapplication

Data flowData flow

Command flowCommand flow

Linked together with the app, the MRlite client library accepts calls

from app and submits jobs to the master

Linked together with the app, the MRlite client library accepts calls

from app and submits jobs to the master High speed distributed

storage, stores intermediate files

High speed distributed storage, stores

intermediate files

The MRlite master accepts jobs from clients and

schedules them to execute on slaves

The MRlite master accepts jobs from clients and

schedules them to execute on slaves Distributed nodes

accept tasks from master and execute

them

Distributed nodes accept tasks from

master and execute them

ResultResult

54

The evaluation shows that MRlite is one order of magnitude faster than The evaluation shows that MRlite is one order of magnitude faster than Hadoop on problems that MapReduce has difficulty in handling.Hadoop on problems that MapReduce has difficulty in handling.




55

ConclusionConclusion Cloud Computing needs a general programming Cloud Computing needs a general programming

frameworkframework Cloud computing shall not be a platform to run just simple OLAP Cloud computing shall not be a platform to run just simple OLAP

applications. It is important to support complex computation and even applications. It is important to support complex computation and even OLTP on large data setsOLTP on large data sets

Design MRlite: a general parallelization framework Design MRlite: a general parallelization framework for cloud computingfor cloud computing Handles applications with complex logic flow and data dependenciesHandles applications with complex logic flow and data dependencies Mitigates the one-way scalability problemMitigates the one-way scalability problem Able to handle all MapReduce tasks with comparable (if not better) Able to handle all MapReduce tasks with comparable (if not better)

performanceperformance

56

ConclusionConclusionEmerging computing platforms increasingly Emerging computing platforms increasingly emphasize parallelization capability, such as emphasize parallelization capability, such as GPGPUGPGPU

MRlite respects applications’ naturalMRlite respects applications’ natural logic flow logic flow and data dependenciesand data dependencies

This modularization of parallelization capability This modularization of parallelization capability from application logic enables MRlite to from application logic enables MRlite to integrate GPGPU processing very easily (future integrate GPGPU processing very easily (future work)work)

57

Thank you!Thank you!

AppendixAppendix

LATE: Estimate finish timesLATE: Estimate finish times

60

progress score

execution timeprogress rate =

1 – progress score

progress rateestimated time left =

=1

progress scoreX execution time - 1 ) (

The smaller progress score, the longer estimated time left.

Appendix

LATE: Solve the problems in Hadoop’s LATE: Solve the problems in Hadoop’s default schedulerdefault scheduler

Nodes may be slightly (2-3x) slower in “utility Nodes may be slightly (2-3x) slower in “utility computing”, which may not hurt the response computing”, which may not hurt the response time or ask for tasks but not fasttime or ask for tasks but not fast

Too many speculative tasks and thrashingToo many speculative tasks and thrashing Ranks candidate by localityRanks candidate by locality

Wrong tasks may be chosenWrong tasks may be chosen Incorrect speculation of reducersIncorrect speculation of reducers

61

Appendix

Documents

Survey on Programming and Tasking in Cloud Computing Environments PhD Qualifying Exam Zhiqiang Ma Supervisor: Lin Gu Feb. 18, 2011