Upload
rosamund-pierce
View
224
Download
0
Embed Size (px)
Citation preview
Location-aware MapReduce in Virtual Cloud
2011 IEEE computer society
International Conference on Parallel Processing
Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1
Reporter: Yu Chih Lin
Outline
Introduction
Background
Model and New Strategy
Implementation
Experiment
Conclusion
Introduction
MapReduce is an important programming model
• Processing
• Generating large data sets
Commonly used in applications
• web indexing
• Data mining
• machine learning
Introduction
Multi-core CPU supporting virtualization technology
• Run two or more virtual machines (VMs) simultaneously
• Share the I/O resources to users
MapReduce is set up on a distributed file system
• Goolge uses GFS
• Hadoop uses HDFS
Introduction
In a virtual environmen runs MapReduce, three major problems
• Disk sharing results in unbalanced data distribution and unbalanced workload
• I/O interference caused by data unbalance and load unbalance
• Disk sharing reduces the data redundancy
Introduction
Purpose of this paper
• Abstract a model
• Define evaluation metrics
• Analyze the data pattern and task pattern
For Hadoop
• propose a location-aware file block allocation strategy
Introduction
Three main benefits by using this paper strategy
• MapReduce’s workload is more balanced
• Reduces I/O interference and improves HDFS’s performance
• Retains data’s redundancy
Background
I/O has two kinds of traditional interference
• Disk interference –
when multiple processes try to access the same disk simultaneously
• Network interference –
mainly considers the latency and throughput
Background
I/O virtualization has two kinds of virtualization
• KVM
• Paravirtualization
Virtual machines share CPUs and memory well, but not I/O.
Background
Virtualized Hadoop architecture
Model and New Strategy
Build a generation model to analyze different allocation strategies
• Data pattern
• Task pattern
To simply the problem for analyzing, make the four assumptions
Model and New Strategy
Using the same I/O devices hosts and number of virtual machines on each physical machine
All the virtual machines are in local area network and the network topology is flat
No limitation for workload to be randomly assigned to each virtual machine
All file blocks have the same size
Model and New Strategy
actualReplicaNum (a) :
average number of unique file blocks in a physical machine
Ideal value is 3 (when thereplica number is 3)
Model and New Strategy
maxBlockNum (b) :
shows the maximum number of blocks in a physical machine
Model and New Strategy
blockNumSigma (c) :
shows the variation of the pattern
Idea value is 0
Model and New Strategy
maxAssignedNum (d) :
shows the max number of task that a physical machine is assigned
Model and New Strategy
assignedNumSigma (e) :
reveals the load balance of the task pattern
Model and New Strategy
A new allocation strategy
• Replicas of a file block to different physical machines
• Keeps balance ofthe block number of each physical machines
Present two intuitive ways
• Round-robin allocation
• Serpentine allocation
For example , take p = 8 , n = 8 (p : physical machines , n : file blocks)
An example of round-robin allocation
Model and New Strategy
Model and New Strategy
For example , take p = 8 , n = 8(p : physical machines , n : file blocks)
An example of serpentine allocation
Model and New Strategy
Evaluation metrics for data pattern
actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0
Enumeration average results for task patterns
round-robin allocation as results:
maxAssignedNum=2.2724 , assignedNumSigma=0.7943
serpentine allocation as results:
maxAssignedNum=2.2705 , assignedNumSigma=0.79323
Implementation
Choose serpentine allocation
Add the location information of virtual node into the network topology
For example, one rack among the physical machines
• may be changed from /default-rack to /Phy0
For example, some rack among the physical machines
• may be changed from /rack1 to /rack1/Phy0
Implementation
Mechanism makes Hadoop easy
• It can keep compatibility with the native Hadoop
• Make special label starting with “ Phy ”
• Identify locations of virtual machines
Implementation
To maintain the block information for each virtual node
• In NameNode of Hadoop , add a sorted list by the number of blocks
In the update
• first update the block number of the virtual node
• Second update its position in the sorted list
Evaluation
Simulation to compare
• New strategy (serpentine allocation) and Hadoop’s original strategy
Set parameter
n = 256
p = [8,16,32,64,128,256]
sampling number is set to 1,000,000
Evaluation
maxBlockNum’s comparison of Hadoop’s original strategy and our new strategy using sampling
Evaluation
actualReplicaNum’s comparison original and new strategy
Evaluation
blockNumSigma’s comparison originals and new strategy
Evaluation
maxAssignedNum’s comparison original and new strategy
Evaluation
assignedNumSigma’s comparison original and new strategy
Experiment
N=224 , P=8
SAMPLING NUMBER=1,000,000
Original New
Average of actualReplicaNum 2.0657 3
Average of maxBlockNum 90.5798 84
Average of blockNumSigma 4.1722 0
Average of maxAssignedNum 33.7660 34.5946
Average of assignedNumSigma 3.6256 4.14939
Experiment
Experiment results of RandomWriter’s execution time
Red : SC offBlue : SC on
Experiment
Experiment results of TextSort’s execution time
Red : SC offBlue : SC on
Experiment
Experiment results of WordCount’s execution time
Red : SC offBlue : SC on
Conclusion
Address problems of data allocation and its impact on MapReduce system
Build a model and evaluation metrics to evaluate the data and task pattern
Propose a new strategy for file block allocation in Hadoop
Simulation and real experiments results
• prove the new allocation strategy is good