Upload
allen-osborn-russell
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
Survey on Programming and Survey on Programming and Tasking in CloudTasking in Cloud
Computing EnvironmentsComputing Environments
PhD Qualifying ExamPhD Qualifying ExamZhiqiang MaZhiqiang Ma
Supervisor: Lin GuSupervisor: Lin Gu
Feb. 18, 2011Feb. 18, 2011
OutlineOutline IntroductionIntroduction ApproachesApproaches
Application framework level approachApplication framework level approach Language level approachLanguage level approach Instruction level approachInstruction level approach
Our work: MRliteOur work: MRlite ConclusionConclusion
2
Cloud computingCloud computing
Internet services are the most popular Internet services are the most popular applications nowadaysapplications nowadays Millions of usersMillions of users Computation is large and complexComputation is large and complex
Google already processed 20TB data in 2004Google already processed 20TB data in 2004
Cloud computing provides massive Cloud computing provides massive computing resourcescomputing resources Available on demandAvailable on demand
3
A promising model to support processing large datasets housed on clusters
How to program and task?How to program and task? ChallengesChallenges
Parallelize the executionParallelize the execution Scheduling the large scale distributed computationScheduling the large scale distributed computation Handling faultsHandling faults High performanceHigh performance Ensuring fairnessEnsuring fairness
Programming models for GridProgramming models for Grid Do not automatically parallelize users’ programsDo not automatically parallelize users’ programs Pass the fault-tolerance work to applicationsPass the fault-tolerance work to applications
4
OutlineOutline IntroductionIntroduction ApproachesApproaches
Application framework level approachApplication framework level approach Language level approachLanguage level approach Instruction level approachInstruction level approach
Our work: MRliteOur work: MRlite ConclusionConclusion
5
ApproachesApproaches
6
Approach Advantage Disadvantage
Language level
Instruction level
Application framework level
MapReduceMapReduce MapReduce: parallel computing MapReduce: parallel computing
framework framework for large-scale data processingfor large-scale data processing Successful used in datacenters comprising Successful used in datacenters comprising
commodity computerscommodity computers A fundamental piece of software in the Google A fundamental piece of software in the Google
architecture for many years architecture for many years Open source variant already exists: HadoopOpen source variant already exists: Hadoop Widely used in solving data-intensive problemsWidely used in solving data-intensive problems
7
MapReduce … Hadoop or variants …Hadoop
MapReduceMapReduce Map and Reduce are higher-order functionsMap and Reduce are higher-order functions Map: Map: apply an operation to apply an operation to all elements in a listall elements in a list Reduce: Like “fold”;Reduce: Like “fold”; aggregate elements of a list aggregate elements of a list
8
11
mm
44
mm
99
mm
1616
mm
2525
mm
11 22 33 44 55
m: x2
00 11
rr
55
rr
1414
rr
3030
rr
5555
rr
final valuefinal valueInitial valueInitial value
r: +
12 + 22 + 32 + 42 + 52 = ?
MapReduce’s data flowMapReduce’s data flow
9
MapReduceMapReduceMassive parallel processing made simpleMassive parallel processing made simple Example: world countExample: world count Map: parse a document and generate <word, 1> pairsMap: parse a document and generate <word, 1> pairs Reduce: receive all pairs for a specific word, and countReduce: receive all pairs for a specific word, and count
10
// D is a documentfor each word w in D output <w, 1>
// D is a documentfor each word w in D output <w, 1>
MapMap
Reduce for key w:count = 0for each input item count = count + 1output <w, count>
Reduce for key w:count = 0for each input item count = count + 1output <w, count>
ReduceReduce
MapReduce easily scales upMapReduce easily scales up
11
Input files
Map phase
Intermediate files
Reduce phase
Output files
12
MapReduce
Input Computation Output
DryadDryad General-purpose execution environment General-purpose execution environment
for distributed, data-parallel applicationsfor distributed, data-parallel applications Concentrates on throughput not latencyConcentrates on throughput not latency
Application written in Dryad is modeled as Application written in Dryad is modeled as a directed acyclic graph (DAG)a directed acyclic graph (DAG) Many programs can be represented as a Many programs can be represented as a
distributed execution graphdistributed execution graph
13
DryadDryad
14
Processingvertices Channels
(file, pipe, shared memory)
Inputs
Outputs
DryadDryad Concurrency arise from vertices running Concurrency arise from vertices running
simultaneously across multiple machinessimultaneously across multiple machines Vertices subroutines are usually quite simple as Vertices subroutines are usually quite simple as
sequential programssequential programs
User have control over the communication User have control over the communication graphgraph Each vertex can has multiple input and outputEach vertex can has multiple input and output
15
ApproachesApproaches
16
Approach Advantage Disadvantage
Application framework level
Language’ level
Instruction level
Users are relaxed from the details of distributing the execution
Automatically parallelize users’ programs;
Programs must follow the specific model
Tasking of executionTasking of execution PerformancePerformance
Locality is crucialLocality is crucial Speculative executionSpeculative execution
FairnessFairness The same cluster shared by multiple usersThe same cluster shared by multiple users Small jobs requires small response time while Small jobs requires small response time while
throughput is important for big jobsthroughput is important for big jobs
CorrectnessCorrectness Fault-toleranceFault-tolerance
17
Locality and fairnessLocality and fairness Locality is crucialLocality is crucial
Bandwidth is scarce resourceBandwidth is scarce resource Input data with duplications are stored in the Input data with duplications are stored in the
same cluster for executionssame cluster for executions
FairnessFairness Short jobs requires short response timeShort jobs requires short response time
18
Locality and fairness conflicts with each other
FIFO scheduler in HadoopFIFO scheduler in Hadoop
Jobs in a queue with priority orderJobs in a queue with priority order FIFO by defaultFIFO by default
When there are available slotsWhen there are available slots Assign slots to tasks, that have local data, in Assign slots to tasks, that have local data, in
priority orderpriority order Limit the assignment of non-local task to optimize Limit the assignment of non-local task to optimize
localitylocality
19
FIFO schedulerFIFO scheduler
20
JobQueue
2 tasks
1 tasks
Node 1
Node 2
Node 3
Node 4
FIFO scheduler – locality optimizationFIFO scheduler – locality optimization
21
JobQueue
Far away in network topology
Only dispatch one non-local task at one
time
4 tasks
1 tasks
Node 1
Node 2
Node 3
Node 4
Problem: fairnessProblem: fairness
22
JobQueue
3 tasks
3 tasks
Node 1
Node 2
Node 3
Node 4
Problem: response timeProblem: response time
23
JobQueue
Small job:Only 1 task)
3 tasks
3 tasks
1 task
Node 1
Node 2
Node 3
Node 4
Fair schedulingFair scheduling Assign free slots to the job that has the Assign free slots to the job that has the
fewest running tasksfewest running tasks Strict fairnessStrict fairness
Running jobs gets nearly equal number of slotsRunning jobs gets nearly equal number of slots
The small jobs finishes quicklyThe small jobs finishes quickly
24
Fair SchedulingFair Scheduling
25
JobQueue
Node 1
Node 2
Node 3
Node 4
Problem: lProblem: localityocality
26
JobQueue
Node 1
Node 2
Node 3
Node 4
Delay SchedulingDelay Scheduling Skip the job that cannot launch a local taskSkip the job that cannot launch a local task
Relax fairness slightlyRelax fairness slightly
Allow a job to launch non-local tasks if be Allow a job to launch non-local tasks if be skipped long enoughskipped long enough Avoid starvationAvoid starvation
27
Delay SchedulingDelay Scheduling
28
JobQueue
Node 1
skipcount 0 0 00120Threshold: 2
Node 2
Node 3
Node 4
Waiting time is short:
Tasks finish quicklySkipped job is in the head of the queue
““Fault” ToleranceFault” Tolerance
Nodes failNodes fail Re-run tasksRe-run tasks
Nodes are slow (stragglers)Nodes are slow (stragglers) Run backup tasks (speculative execution)Run backup tasks (speculative execution) To minimize job’s response timeTo minimize job’s response time
Important for short jobsImportant for short jobs
29
Speculative executionSpeculative execution
The scheduler schedules backup The scheduler schedules backup executions of the remaining executions of the remaining in-progress in-progress taskstasks
The task is marked as completed The task is marked as completed whenever either the primary or the backup whenever either the primary or the backup execution completesexecution completes
Improve job response time by 44% Improve job response time by 44% according Google’s experimentsaccording Google’s experiments
30
Speculative execution mechanismSpeculative execution mechanism
Seems a simple problem, butSeems a simple problem, but Resource for speculative tasks is not freeResource for speculative tasks is not free How to choose nodes to run speculative How to choose nodes to run speculative
tasks?tasks? How to distinguish “stragglers” from How to distinguish “stragglers” from
nodes that are slightly slower?nodes that are slightly slower? Stragglers should be found out earlyStragglers should be found out early
31
Hadoop’s schedulerHadoop’s scheduler
Start speculative tasks based on a simple Start speculative tasks based on a simple heuristicheuristic Comparing each task’s progress to the averageComparing each task’s progress to the average
Assumption of homogeneous environmentAssumption of homogeneous environment The default scheduler works wellThe default scheduler works well Broken in utility computingBroken in utility computing
Virtualized “utility computing” environments, such Virtualized “utility computing” environments, such as EC2as EC2
32
How to robustly perform speculative execution (backup tasks) in heterogeneous environments?
Speculative execution in HadoopSpeculative execution in Hadoop When there is no “higher priority” tasks, looks for When there is no “higher priority” tasks, looks for
a task to execute speculativelya task to execute speculatively Assumption: The is no cost to launching a speculative taskAssumption: The is no cost to launching a speculative task
Comparing each task’s progress to the average Comparing each task’s progress to the average progressprogress Assumption: Nodes perform similarly. (“Slow node is faulty”; Assumption: Nodes perform similarly. (“Slow node is faulty”;
“Nodes that ask for new tasks are fast”)“Nodes that ask for new tasks are fast”) Nodes may be slightly (2-3x) slower in “utility computing”, which Nodes may be slightly (2-3x) slower in “utility computing”, which
may not hurt the response time or ask for tasks but not fastmay not hurt the response time or ask for tasks but not fast
33
Speculative execution in HadoopSpeculative execution in Hadoop
Threshold for speculative executionThreshold for speculative execution (Average progress score of each category of (Average progress score of each category of
tasks) – 0.2tasks) – 0.2 Tasks beyond the threshold are “equally slow”Tasks beyond the threshold are “equally slow” Ranks candidates by localityRanks candidates by locality
Wrong tasks may be chosenWrong tasks may be chosen 35% completed 2x slower task with 35% completed 2x slower task with data available on idle data available on idle
nodenode or 5% completed 10x slower task? or 5% completed 10x slower task?
Too many speculative tasks and thrashingToo many speculative tasks and thrashing Taking away resources from useful tasksTaking away resources from useful tasks
34
Speculative execution in HadoopSpeculative execution in Hadoop
Progress scoreProgress score Map: fraction of input dataMap: fraction of input data Reduce: three phase (1/3 for each) and fraction Reduce: three phase (1/3 for each) and fraction
of data processedof data processed
Incorrect speculation of reduce tasksIncorrect speculation of reduce tasks Copy phase takes most of the time, but account Copy phase takes most of the time, but account
only 1/3only 1/3 30% tasks finishes quickly, 30% tasks finishes quickly, 70% are in copy 70% are in copy
phase:phase: Avg. progress rate = 30%*1+70%*1/3 = Avg. progress rate = 30%*1+70%*1/3 = 53%, threshold=53%, threshold=33%33%
35
LATELATE
Longest Approximate Time to EndLongest Approximate Time to End PrinciplesPrinciples
Ranks candidate by longest time to endRanks candidate by longest time to end Choose the right task that hurts the job’s response Choose the right task that hurts the job’s response
time; slow nodes can be utilized as long as it time; slow nodes can be utilized as long as it doesn’t hurt the response timedoesn’t hurt the response time
Only launch speculative tasks on fast nodesOnly launch speculative tasks on fast nodes Not every node that asks for task is fastNot every node that asks for task is fast
Cap speculative tasksCap speculative tasks Limit resource contention and thrashingLimit resource contention and thrashing
36
LATE algorithmLATE algorithm
If a node asks for a new task and there are fewer If a node asks for a new task and there are fewer than than SpeculativeCapSpeculativeCap speculative tasks running: speculative tasks running:
Ignore the request if the node's total progress is Ignore the request if the node's total progress is below below SlowNodeThresholdSlowNodeThreshold
Rank currently running tasks by estimated time leftRank currently running tasks by estimated time left
Launch a copy of the Launch a copy of the highest-rankedhighest-ranked task with task with progress rate below progress rate below SlowTaskThresholdSlowTaskThreshold
37
Cap speculative tasksCap speculative tasks
Only launch speculative tasks on fast nodesOnly launch speculative tasks on fast nodes
Rank candidates by longest time to endRank candidates by longest time to end
ApproachesApproaches
38
Approach Advantage Disadvantage
Application framework level
Instruction level
Users are relaxed from the details of distributing the execution
Automatically parallelize users’ programs; Programs must
follow the specific model
Language level
Language level approachLanguage level approach Programming frameworksProgramming frameworks
Still not clear and compact enoughStill not clear and compact enough
Traditional programming languageTraditional programming language Without giving special focus on high parallelism Without giving special focus on high parallelism
for large computing clusterfor large computing cluster
New languageNew language Clear, compact and expressiveClear, compact and expressive Automatically parallelized “normal” programsAutomatically parallelized “normal” programs Comfortable way for user to think about data Comfortable way for user to think about data
processing problem on large distributed datasetsprocessing problem on large distributed datasets
39
SawzallSawzall Interpreted, procedural high-level Interpreted, procedural high-level
programming languageprogramming language Exploit high parallelismExploit high parallelism Automate very large data sets analysisAutomate very large data sets analysis Give users a way to clearly and Give users a way to clearly and
expressively design distributed data expressively design distributed data processing programsprocessing programs
40
Overall flowOverall flow
FilteringFiltering Analysis each record individuallyAnalysis each record individually Expressed in SawzallExpressed in Sawzall
AggregationAggregation Collate and reduce the intermediate Collate and reduce the intermediate
valuesvalues Predefined aggregatorsPredefined aggregators 41
Map
Reduce
An exampleAn example
42
max_pagerank_url: table maximum(1)[domain:string] of url:string weight pagerank:int;
doc:Document = input;
emit max_pagerank_url[domain(doc.url)] <- doc.url weight doc.pagerank;
Find out the most-linked-to page of each domain
Aggregator: highest valueStores url
Indexed by domainWeighted by pagerank
input: pre-defined variable initialized by SawzallInterpreted into Documentn type
emit: sends intermediate value to the aggregator
Unusual featuresUnusual features Sawzall runs on one record at a timeSawzall runs on one record at a time
Nothing in the language to have one input record Nothing in the language to have one input record influent anotherinfluent another
emit statement is the only output primitiveemit statement is the only output primitive Explicit line between filtering and aggregationExplicit line between filtering and aggregation
43
Enables high degree of parallelism even though it is hidden from the language
ApproachesApproaches
44
Approach Advantage Disadvantage
Application framework level Users are relaxed from
the details of distributing the execution
Automatically parallelize users’ programs; Programs must
follow the specific model
Language level
Instruction level
Clearer, more expressiveComfortable way for programming
More restrict programming model
Instruction level approachInstruction level approach Provides instruction level abstracts and Provides instruction level abstracts and
compatibility to users’ applicationscompatibility to users’ applications May choose traditional ISA such as May choose traditional ISA such as
x86/x86-64x86/x86-64 Run traditional applications without any Run traditional applications without any
modificationmodification
Easier to migrate applications to cloud Easier to migrate applications to cloud computing environmentscomputing environments
45
Amazon Elastic Compute Cloud (EC2)Amazon Elastic Compute Cloud (EC2)
Provides virtual machines runs traditional Provides virtual machines runs traditional OSOS Traditional programs can work on EC2Traditional programs can work on EC2
Amazon Machine Image (AMI)Amazon Machine Image (AMI) Boot instancesBoot instances Unit of deployment, packaged-up environmentUnit of deployment, packaged-up environment Users design and implement the application logic Users design and implement the application logic
in AMI; EC2 handles the deployment and in AMI; EC2 handles the deployment and resource allocationresource allocation
46
vNUMAvNUMAVirtual shared-memory multiprocessor machine build from Virtual shared-memory multiprocessor machine build from commodity workstationscommodity workstations
Make the computational power available to legacy applications Make the computational power available to legacy applications and OSsand OSs
47
Virtualization vNUMA
PM
VM VM VM VM
PM PM PM
48
ArchitectureArchitecture
HypervisorHypervisor On each nodeOn each node
CPUCPU Virtual CPUs are mapped to real CPUs on nodesVirtual CPUs are mapped to real CPUs on nodes
MemoryMemory Divided between the nodes with equal-sized portionsDivided between the nodes with equal-sized portions Each node manages a subset of the pagesEach node manages a subset of the pages
49
Memory mappingMemory mapping
VM
VMM
PM PM
OS
Applicationread *a
maps b to real physical address
c on node
In application’s virtual memory address
translate a to VM’s physical memory
address b
find *c
ApproachesApproaches
50
Approach Advantage Disadvantage
Application framework level Users are relaxed from
the details of distributing the execution
Automatically parallelize users’ programs; Programs must
follow the specific model
Language level
Instruction level
Clearer, more expressiveComfortable way for programming
More restrict programming model
Supports traditional applications
Users handles the tasking
Hard to scale up
OutlineOutline IntroductionIntroduction ApproachesApproaches
Application framework level approachApplication framework level approach Language level approachLanguage level approach Instruction level approachInstruction level approach
Our work: MRliteOur work: MRlite ConclusionConclusion
51
Our workOur work Analyze MapReduce’s design and use a case study Analyze MapReduce’s design and use a case study
to probe the limitationto probe the limitation One-way scalabilityOne-way scalability Difficult to handle dynamic, interactive and semantic-rich Difficult to handle dynamic, interactive and semantic-rich
applicationsapplications
Design a new parallelization framework – MRliteDesign a new parallelization framework – MRlite Able to scale “up” like MapReduce, and scale “down” to Able to scale “up” like MapReduce, and scale “down” to
process moderate-size dataprocess moderate-size data Low latency and massive parallelismLow latency and massive parallelism Small run-time system overheadSmall run-time system overhead
52
Design a general parallelization framework and programming paradigm for cloud computing
Architecture of MRliteArchitecture of MRlite
53
MRlite clientMRlite client
MRlite masterscheduler
MRlite masterscheduler
slaveslave
slaveslave
slaveslave
slaveslave
applicationapplication
Data flowData flow
Command flowCommand flow
Linked together with the app, the MRlite client library accepts calls
from app and submits jobs to the master
Linked together with the app, the MRlite client library accepts calls
from app and submits jobs to the master High speed distributed
storage, stores intermediate files
High speed distributed storage, stores
intermediate files
The MRlite master accepts jobs from clients and
schedules them to execute on slaves
The MRlite master accepts jobs from clients and
schedules them to execute on slaves Distributed nodes
accept tasks from master and execute
them
Distributed nodes accept tasks from
master and execute them
ResultResult
54
The evaluation shows that MRlite is one order of magnitude faster than The evaluation shows that MRlite is one order of magnitude faster than Hadoop on problems that MapReduce has difficulty in handling.Hadoop on problems that MapReduce has difficulty in handling.
OutlineOutline IntroductionIntroduction ApproachesApproaches
Application framework level approachApplication framework level approach Language level approachLanguage level approach Instruction level approachInstruction level approach
Our work: MRliteOur work: MRlite ConclusionConclusion
55
ConclusionConclusion Cloud Computing needs a general programming Cloud Computing needs a general programming
frameworkframework Cloud computing shall not be a platform to run just simple OLAP Cloud computing shall not be a platform to run just simple OLAP
applications. It is important to support complex computation and even applications. It is important to support complex computation and even OLTP on large data setsOLTP on large data sets
Design MRlite: a general parallelization framework Design MRlite: a general parallelization framework for cloud computingfor cloud computing Handles applications with complex logic flow and data dependenciesHandles applications with complex logic flow and data dependencies Mitigates the one-way scalability problemMitigates the one-way scalability problem Able to handle all MapReduce tasks with comparable (if not better) Able to handle all MapReduce tasks with comparable (if not better)
performanceperformance
56
ConclusionConclusionEmerging computing platforms increasingly Emerging computing platforms increasingly emphasize parallelization capability, such as emphasize parallelization capability, such as GPGPUGPGPU
MRlite respects applications’ naturalMRlite respects applications’ natural logic flow logic flow and data dependenciesand data dependencies
This modularization of parallelization capability This modularization of parallelization capability from application logic enables MRlite to from application logic enables MRlite to integrate GPGPU processing very easily (future integrate GPGPU processing very easily (future work)work)
57
Thank you!Thank you!
AppendixAppendix
LATE: Estimate finish timesLATE: Estimate finish times
60
progress score
execution timeprogress rate =
1 – progress score
progress rateestimated time left =
=1
progress scoreX execution time - 1 ) (
The smaller progress score, the longer estimated time left.
Appendix
LATE: Solve the problems in Hadoop’s LATE: Solve the problems in Hadoop’s default schedulerdefault scheduler
Nodes may be slightly (2-3x) slower in “utility Nodes may be slightly (2-3x) slower in “utility computing”, which may not hurt the response computing”, which may not hurt the response time or ask for tasks but not fasttime or ask for tasks but not fast
Too many speculative tasks and thrashingToo many speculative tasks and thrashing Ranks candidate by localityRanks candidate by locality
Wrong tasks may be chosenWrong tasks may be chosen Incorrect speculation of reducersIncorrect speculation of reducers
61
Appendix