Upload
yu-liu
View
304
Download
1
Embed Size (px)
Citation preview
A Homomorphism-based Framework forSystematic Parallel Programming with MapReduce
Yu Liu1, Zhenjiang Hu2
1 The Graduate University for Advanced Studies,Tokyo, [email protected]
2 National Institute of Informatics,Tokyo, Japan2 [email protected]
Mar. 10th, 2011
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Background
MapReduce
Google’s MapReduce is a popular parallel-distributed programmingmodel, for processing large data sets. It has been the de factostandard for large scale data analysis.
Concepts from functional programming languages
Automatic parallel processing, fault tolerance
Good scalability
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
MapReduce
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
MapReduce
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
MapReduce
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
MapReduce
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
MapReduce
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Programming with MapReduce
A user has to
design a D&C algorithm that fits MapReduce paradigm
map this algorithm to MapReduce.
Difficulties of programming with MapReduce
How to resolve the constrains on computing order.
How to resolve the data dependency.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Programming with MapReduce
A user has to
design a D&C algorithm that fits MapReduce paradigm
map this algorithm to MapReduce.
Difficulties of programming with MapReduce
How to resolve the constrains on computing order.
How to resolve the data dependency.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Example
The Maximum Prefix Sum problem
mps [3,−1, 4,−1,−5, 9, 2,−6, 5,−10] = 11
A sequential program for MPS in O(n) time
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Example
The Maximum Prefix Sum problem
mps [3,−1, 4,−1,−5, 9, 2,−6, 5,−10] = 11
Hard to compute MPS with MapReduce
Computation has order.
MPS of sub-lists cannot be conquered directly.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Questions
Is there a systematic way to resolving such problems withMapReduce ?
How to handle the problems with district order ?
How to systematically design the divide-and-conqueralgorithm ?
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Questions
Is there a systematic way to resolving such problems withMapReduce ?
How to handle the problems with district order ?
How to systematically design the divide-and-conqueralgorithm ?
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Motivation and objective
We propose a systematic approach to automatically generate fullyparallelized and scalable MapReduce programs.A new framework which provides algorithmic programminginterfaces has been implemented.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
A systematic approach for programming with MapReduce
Firstly, derive a function h.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
A systematic approach for programming with MapReduce
Then write a inverse function h◦.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
A systematic approach for programming with MapReduce
D&C algorithm can be gotten.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
A systematic approach for programming with MapReduce
Map it to MapReduce paradigm.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
A systematic approach for programming with MapReduce
Parallelization is in a black box.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
A systematic approach for programming with MapReduce
Implemented by multi-phases MapReduce processing.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Conditions of this f function
Theorem
If there exists a binary operator � such that
f (xs ++ ys) = f xs � f ysthen such � can be defined as :x � y = f (f ◦x ++ f ◦x)
where ++ islistconcatenation.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Iff a function can be defined both rightwards and leftwards, such �exists. We can derive a divide-and-conquer algorithm like this:
Divide-and-conquer
f (xs ++ ys) = f (f ◦ (f xs) ++ f ◦ (f ys))
Such functions are so called: homomorphisms.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Programming Interface
Fold and unfold
fold :: [α]→ βunfold :: β → [α].
The implementation in Java
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
A function which computes MPS and its right inverse can bewritten as followings:
fold xs = mps 4 sum xsunfold (m, s) = [m, s −m]
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
The computation inside framework
Use fold and unfold functions doing the computation:
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Autonomous intermediate data
Each record of the intermediate data has the information ofposition, thus the distribution of data is indifferent.
< id , val > → << parId , id >, val >
By taking use of sorting and grouping mechanism of MapReduceframework, lists can be reconstructed when necessary.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
A formal definitation
homMR
homMR :: (α→ β)→ (β → β → β)→ {(ID, α)} → βhomMR f (⊕) = getValue ◦MapReduce mapper2 reducer2
◦MapReduce mapper1 reducer1where
mapper1 :: (ID, α)→ [((PID, ID), α)]mapper1 (i , a) = [(pid , i), a))]
where pid = makePid ireducer1 :: ((PID, ID), [α])→ ((PID, ID), β)reducer1 ((pid , j), ias) = ((pid , j), hom f (⊕) ias)
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
continued
mapper2 :: ((PID, ID), β)→ ((PID, ID), β)mapper2 ((pid , j), b) = ((c0, j), b)
where c0 is a predefined constant pidreducer2 :: ((PID, ID), [β])→ ((PID, ID), β)reducer2 ((c0, k), jbs) = ((c0, k), hom f (⊕) jbs)
getValue :: ((PID, ID), β)→ βgetValue ((c0, k), c) = c
Where, hom f (⊕) denotes a sequential version of ([f ,⊕]).
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Actual user-program for MPS
http://screwdriver.googlecode.com
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Performance evaluation
Environment: hardware
We configured clusters with 2, 4, 8, and 16 nodes. Eachcomputing/data node has two Xeon CPUs (Nocona, single-core,2.8 GHz), 2 GB memory. The nodes are connected with GigabitEthernet.
Environment: software
Linux2.6.26 ,Hadoop 0.21.0 +HDFS
Hadoop configuration: heap size= 1024MB
maximum mapper per node: 2
maximum reducer per node: 1
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Test cases
We implemented several programs for three problems on ourframework and Hadoop:
1 the maximum-prefix-sum problem.
MPS-lh is implemented using our framework’ API.MPS-mr is implemented by Hadoop API.
2 parallel sum of 64-bit integers
SUM-lh is implemented by our framework’ API.SUM-mr is implemented by Hadoop API.
3 VAR-lh computes the variance of 32-bit floating-pointnumbers;
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Test cases
Test data
100 million 64-bit integers (2.87 GB) for MPS, SUM.100 million 32-bit floating-point numbers (2.76 GB) for VAR.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Performance
The experiment results are summarized :
With 16 nodes speedup of all cases are more than 7.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Performance
Time curves:
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Concluding remarks
In this research:
Introduced a systematic way of parallel programming onMapReduce.
Developed a framework on top of Hadoop.
Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.Achieved good scalability and parallelism.Automatic optimization can be equipped.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Concluding remarks
In this research:
Introduced a systematic way of parallel programming onMapReduce.
Developed a framework on top of Hadoop.
Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.Achieved good scalability and parallelism.Automatic optimization can be equipped.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Concluding remarks
In this research:
Introduced a systematic way of parallel programming onMapReduce.
Developed a framework on top of Hadoop.
Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.
Details of MapReduce are hidden.Achieved good scalability and parallelism.Automatic optimization can be equipped.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Concluding remarks
In this research:
Introduced a systematic way of parallel programming onMapReduce.
Developed a framework on top of Hadoop.
Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.
Achieved good scalability and parallelism.Automatic optimization can be equipped.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Concluding remarks
In this research:
Introduced a systematic way of parallel programming onMapReduce.
Developed a framework on top of Hadoop.
Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.Achieved good scalability and parallelism.
Automatic optimization can be equipped.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Concluding remarks
In this research:
Introduced a systematic way of parallel programming onMapReduce.
Developed a framework on top of Hadoop.
Algorithmic programming interfaces let user can focus on thealgebraic properties of problem.Details of MapReduce are hidden.Achieved good scalability and parallelism.Automatic optimization can be equipped.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Future work
Decrease the system overhead and do more optimization.
Extend to more complex data structure such as tree andgraph.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Related work
Parallel programming with list homomorphisms (M.Cole 95)
The Third Homomorphism Theorem(J.Gibbons 96).
Systematic extraction and implementation ofdivide-and-conquer parallelism (Gorlatch PLILP96).
Automatic inversion generates divide-and-conquer parallelprograms(Morita et.al., PLDI07).
The third homomorphism theorem on trees: downward &upward lead to divide-and-conquer (Morihata, POPL09)
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Thank you very much.Questions?
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
List Homomorphism
Function h is said to be a list homomorphism
If there are a function f and an associative operator � such thatfor any list x and list y
h [a] = f ah (x ++ y) = h(x)� h(y).
Where ++ is the list concatenation.
Instance of a list homomorphism
sum [a] = asum (x ++ y) = sum x + sum y .
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Theorem (The Third Homomorphism Theorem (Gibbons,96) )
Let h be a given function and and � be binary operators. If thefollowing two equations hold for any element a and list y
h ([a] ++ y) = a h yh (y ++ [a]) = h y � a
then the function h is a homomorphism.
In fact, for a function h, if we have one of its right inverse h◦ thatsatisfies h ◦ h◦ ◦ h = h, then we can obtain the list-homomorphicdefinition as follows.
h = ([f ,�]) wheref a = h [a]l � r = h (h◦ l ++ h◦ r)
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
MapReduce programs can be automatically obtained bytwo sequential functions
homomorphism ([f ,⊕])
f :: a→ b⊕ :: b → b → b(a⊕ b)⊕ c = a⊕ (b ⊕ c).
fold and unfold, that compose leftwards and rightwards functions
fold([a] ++ x) = fold([a] ++ unfold(fold(x)))fold(x ++ [a]) = fold(unfold(fold(x)) ++ [a]).
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Currently, Screwdriver provides two kinds of programminginterfaces:
Programming interface corresponding to definition of listhomomorphism;
Programming interface corresponding to the 3rdhomomorphism theorem.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Basic Homomorphism-Programming Interface
Two functions which define an homomorphism
filter :: a→ bplus :: b → b → b.
The implementation in Java
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Programming Interface based on the 3rd homomorphismtheorem
A function and its right inverse
fold :: [a]→ bunfold :: b → [a].
The implementation in Java
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
The implementation of Screwdriver : list representation
To implement our programming interface with Hadoop, we need toconsider how to represent lists in a distributed manner.
Input data: index-value pairs
We use integer as the index’s type, the list [a, b, c , d , e] isrepresented by {(3, d), (1, b), (2, c), (0, a), (4, e)}.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Partition of input list
The pid(partition-id) of type PID is the index of a partial list. Theframework produces a same pid for the records which will begrouped together. These records have continues id .
Intermediate data: nested pairs ((pid , id), val)
Suppose the above list was divided to two parts and in differentnodes, then they are represented as{((0, 1), b), ((0, 2), c), ((0, 0), a)} and {((1, 3), d), ((1, 4), e)}.
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Grouping and sorting of intermediate data
We defined two functions: the comparatorG and comparatorS asfollows:
comparatorG (pid1, id1) (pid2, id2) = if pid1 == pid2
then 0else − 1
comparatorS (pid1, id1) (pid2, id2) = if id1 > id2then 1else − 1
for grouping intermediate records with same pid and sorting themby id .
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce
Data partition
1 In MAP task,
intermediate records with same pid are grouped together andsorted by id .a partitioner dispatches the groups to different reducers.
2 In REDUCE task, reducers apply merge-sort on all groupswith same pid
Yu Liu1, Zhenjiang Hu2 A Homomorphism-based Framework for Systematic Parallel Programming with MapReduce