Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei

Location-aware MapReduce in Virtual Cloud

2011 IEEE computer society

International Conference on Parallel Processing

Yifeng Geng1,2, Shimin Chen3, YongWei Wu1*, Ryan Wu3, Guangwen Yang1,2, Weimin Zheng1

Reporter： Yu Chih Lin

Outline

Introduction

Background

Model and New Strategy

Implementation

Experiment

Conclusion

Introduction

MapReduce is an important programming model

• Processing

• Generating large data sets

Commonly used in applications

• web indexing

• Data mining

• machine learning

Introduction

Multi-core CPU supporting virtualization technology

• Run two or more virtual machines (VMs) simultaneously

• Share the I/O resources to users

MapReduce is set up on a distributed file system

• Goolge uses GFS

• Hadoop uses HDFS

Introduction

In a virtual environmen runs MapReduce, three major problems

• Disk sharing results in unbalanced data distribution and unbalanced workload

• I/O interference caused by data unbalance and load unbalance

• Disk sharing reduces the data redundancy

Introduction

Purpose of this paper

• Abstract a model

• Define evaluation metrics

• Analyze the data pattern and task pattern

For Hadoop

• propose a location-aware file block allocation strategy

Introduction

Three main benefits by using this paper strategy

• MapReduce’s workload is more balanced

• Reduces I/O interference and improves HDFS’s performance

• Retains data’s redundancy

Background

I/O has two kinds of traditional interference

• Disk interference –

when multiple processes try to access the same disk simultaneously

• Network interference –

mainly considers the latency and throughput

Background

I/O virtualization has two kinds of virtualization

• KVM

• Paravirtualization

Virtual machines share CPUs and memory well, but not I/O.

Background

Virtualized Hadoop architecture


Build a generation model to analyze different allocation strategies

• Data pattern

• Task pattern

To simply the problem for analyzing, make the four assumptions


Using the same I/O devices hosts and number of virtual machines on each physical machine

All the virtual machines are in local area network and the network topology is flat

No limitation for workload to be randomly assigned to each virtual machine

All file blocks have the same size


actualReplicaNum (a) :

average number of unique file blocks in a physical machine

Ideal value is 3 (when thereplica number is 3)


maxBlockNum (b) :

shows the maximum number of blocks in a physical machine


blockNumSigma (c) :

shows the variation of the pattern

Idea value is 0


maxAssignedNum (d) :

shows the max number of task that a physical machine is assigned


assignedNumSigma (e) :

reveals the load balance of the task pattern


A new allocation strategy

• Replicas of a file block to different physical machines

• Keeps balance ofthe block number of each physical machines

Present two intuitive ways

• Round-robin allocation

• Serpentine allocation

For example , take p = 8 , n = 8 (p : physical machines , n : file blocks)

An example of round-robin allocation



For example , take p = 8 , n = 8(p : physical machines , n : file blocks)

An example of serpentine allocation


Evaluation metrics for data pattern

actualReplicaNum=3, maxBlockNum=3, blockNumSigma=0

Enumeration average results for task patterns

round-robin allocation as results:

maxAssignedNum=2.2724 , assignedNumSigma=0.7943

serpentine allocation as results:

maxAssignedNum=2.2705 , assignedNumSigma=0.79323

Implementation

Choose serpentine allocation

Add the location information of virtual node into the network topology

For example, one rack among the physical machines

• may be changed from /default-rack to /Phy0

For example, some rack among the physical machines

• may be changed from /rack1 to /rack1/Phy0

Implementation

Mechanism makes Hadoop easy

• It can keep compatibility with the native Hadoop

• Make special label starting with “ Phy ”

• Identify locations of virtual machines

Implementation

To maintain the block information for each virtual node

• In NameNode of Hadoop , add a sorted list by the number of blocks

In the update

• first update the block number of the virtual node

• Second update its position in the sorted list

Evaluation

Simulation to compare

• New strategy (serpentine allocation) and Hadoop’s original strategy

Set parameter

n = 256

p = [8,16,32,64,128,256]

sampling number is set to 1,000,000

Evaluation

maxBlockNum’s comparison of Hadoop’s original strategy and our new strategy using sampling

Evaluation

actualReplicaNum’s comparison original and new strategy

Evaluation

blockNumSigma’s comparison originals and new strategy

Evaluation

maxAssignedNum’s comparison original and new strategy

Evaluation

assignedNumSigma’s comparison original and new strategy

Experiment

N=224 , P=8

SAMPLING NUMBER=1,000,000

Original New

Average of actualReplicaNum 2.0657 3

Average of maxBlockNum 90.5798 84

Average of blockNumSigma 4.1722 0

Average of maxAssignedNum 33.7660 34.5946

Average of assignedNumSigma 3.6256 4.14939

Experiment

Experiment results of RandomWriter’s execution time

Red : SC offBlue : SC on

Experiment

Experiment results of TextSort’s execution time


Experiment

Experiment results of WordCount’s execution time


Conclusion

Address problems of data allocation and its impact on MapReduce system

Build a model and evaluation metrics to evaluate the data and task pattern

Propose a new strategy for file block allocation in Hadoop

Simulation and real experiments results

• prove the new allocation strategy is good

Documents

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei