Upload
amos-fitzgerald
View
223
Download
2
Embed Size (px)
Citation preview
IBM Research
®
© 2007 IBM Corporation
Introduction to Map-Reduce and Join Processing
IBM Research | India Research Lab
Hadoop – A Very Brief Introduction
A framework for creating distributed applications that process huge amounts of data.
Scalability, Fault Tolerance, Ease of Programming
Two main components HDFS – Hadoop Distributed File System
Map-Reduce
How data is organized on HDFS?
How data is processed using Map-Reduce?
IBM Research | India Research Lab
HDFS
Stores files in blocks across many nodes in a cluster
Replicates the blocks across nodes for durability Default – 64 MB
Master/Slave Architecture
HDFS Master NameNode
• Runs on a single node as a master process
• Directs client access to files in HDFS HDFS Slave
DataNode
• Runs on all nodes in the cluster
• Block creation/replication/deletion
• Takes orders from the namenode
IBM Research | India Research Lab
HDFS
A B CR1 1 2 3 R2 2 3 5R3 2 4 6R4 6 4 2R5 1 3 6R6 8 9 1R7 2 3 1R8 9 9 2R9 1 7 4R10 1 2 2R11 2 3 4R12 4 5 6R13 6 7 8R14 9 8 3R15 3 2 1
64 MB
64 MB
64 MB
Replication Factor = 3
All these blocksDistributed on the Cluster
IBM Research | India Research Lab
HDFS NameNode
Data Nodes
Put File
1 2 3
4 5 6
1, 4, 5
2, 5, 6
2, 3, 4
File1.txt
IBM Research | India Research Lab
HDFS
Read-Time = Transfer-Rate x Number of Machines
NameNode
Data Nodes
Read File
1 2 3
4 5 6
1, 4, 5
2, 5, 6
2, 3, 4
IBM Research | India Research Lab
HDFS
Fault-Tolerant Handles Node Failures
Self-Healing Rebalances files across cluster
Data from the remaining two nodes
is automatically copied
Scalable Just by adding new nodes
NameNode
Data Nodes
Read File
1 2 3
4 5 6
1, 4, 5
2, 5, 6
2, 3, 4
3, 5, 6
2, 3, 6
IBM Research | India Research Lab
Map-Reduce
Logical Functions : Mappers and Reducers
Developers write map and reducer functions then submit a jar to the Hadoop Cluster
Hadoop handles distributing the Map and Reduce tasks across the cluster
IBM Research | India Research Lab
Map-Reduce
A map task is started for each split / 64 MB block. Each map task generates some intermediate data.
Hadoop collects the output of all map tasks, reorganizes them and passes the reorganized data to Reduce tasks
Reduce tasks process this re-organized data and generate the final output
Flow HDFS Block to Map Task
Map Task to Hadoop Engine
Hadoop Shuffles and Sorts the Map output
Hadoop Engine to Reduce Tasks and Reduce Processing
IBM Research | India Research Lab
HDFS to Map Tasks Records are read one by one from each block and passed to map for processing.
The component is called InputFormat / RecordReader
A record is passed as a key-value pair. Key is an offset and the value is the record
Offset is usually ignored by the map
MAP-1
MAP-2
MAP-3
( 0, R1 1 2 3)(10, R2 2 3 5)(20, R3 2 4 6)(30, R4 6 4 2)(40, R5 1 3 6)
( 50, R6 8 9 1)(60, R7 2 3 1)(70, R8 9 9 2)(80, R9 1 7 4)(90, R10 1 2 2))
(100, R11 2 3 4)(110, R12 4 5 6)(120, R13 6 7 8)(130, R14 9 8 3)(140, R15 3 2 1)
Input-Format
Input-Format
Input-Format
IBM Research | India Research Lab
Map Task
Takes in a key-value pair and transforms it to a set of key-value pairs
{K1, V1} ==> [{K2, V2}]
( 0, R1 1 2 3)(10, R2 2 3 5)(20, R3 2 4 6)(30, R4 6 4 2)(40, R5 1 3 6)
( 0, R6 8 9 1)(10, R7 2 3 1)(20, R8 9 9 2)(30, R9 1 7 4)(50, R10 1 2 2))
( 0, R11 2 3 4)(10, R12 4 5 6)(20, R13 6 7 8)(30, R14 9 8 3)(50, R15 3 2 1)
MAP-1
MAP-2
MAP-3
(2, 3)(2, 4)(2,4) (6, 4)
(2, 9) (4, 9) (8, 9)(2, 3)
(2, 3)(2, 5) (4, 5)(2, 7)
Example: If the second column is an odd number, don’t do anything. If the second column is an even numbergenerate as many pairs as the number of even divisors of the value in the second column. The key is the divisor and the value is the value in the third column
IBM Research | India Research Lab
Hadoop Sorting And Shuffling
Hadoop processes the key-value pairs output by map in a fashion so that the values in all pairs with the same key are grouped together
These groups are then passed to reducers for processing
MAP-1
MAP-2
MAP-3
(2, 3)(2, 4)(2,4) (6, 4)
(2, 9) (4, 9) (8, 9)(2, 3)
(2, 3)(2, 5) (4, 5)(2, 7)
(2, [3, 3, 3, 4, 4, 5, 7, 9])(4, [5, 9])(6, [4])(8, [9])
Hadoop
Shuffle
IBM Research | India Research Lab
Hadoop Engine to Reduce Tasks and Reduce Processing
Let the number of distinct keys (groups) be m Let the number of reduce tasks be k.
These m groups are distributed across k reduce tasks using a Hash function
Reduce task processes each group and generates the output. Example – Sums all the values
REDUCER 1
(2, [3, 4, 4, 9, 3, 3, 5, 7])(6, [4])
REDUCER 2
(4, [9, 5])(8, [9])
(2, 38)(6, 4)
(4, 14)(8, 9)
IBM Research | India Research Lab
Word-Count
Hadoop UsesMap-Reduc
There is aMap-Phase
There is aReduce phase
(Hadoop, 1)(Uses, 1)(Map, 1)
(Reduce , 1)
(There, 1)(is, 1)(a, 1)
(Map, 1)(Phase, 1)
(There, 1)(is, 1) (a, 1)
(Reduce, 1)(Phase, 1)
(a, [1,1])(Hadoop, 1)
(is, [1,1])
(map, [1,1])(phase, [1,1])
(reduce, [1,1])(there, [1,1])
(uses, 1)
A-I
J-Q
R-Z
(a, 2)(hadoop, 1)
(is, 2)
(map, 2)(phase, 2)
(reduce, 2)(there, 2)(uses, 1)
IBM Research | India Research Lab
Map-Reduce Example: Aggregation
Compute Avg of B for each distinct value of A
A B C
R1 1 10 12
R2 2 20 34
R3 1 10 22
R4 1 30 56
R5 3 40 17
R6 2 10 49
R7 1 20 44
MAP 1
MAP 2
(1, 10)(2, 20)(1, 10)
(1, 30)(3, 40)(2, 10)(1, 20)
(1, 17.5)
(2, 15) (3, 40)
(1, [10, 10, 30, 20])
(2, [10, 20])(3, [40])
Reducer 1
Reducer 2
IBM Research | India Research Lab
Designing a Map-Reduce Algorithm
Thinking in terms of Map and Reduce What data should be the key?
What data should be the values?
Minimizing Cost Reading and Map Processing Cost
Communication Cost
Processing Cost at Reducer
Load Balancing All reducers should get similar volume of traffic
Should not happen that only few machines are busy while others are loaded
IBM Research | India Research Lab
Join On Point Data Select R.A, R.B, S.D where R.A==S.A
A B C
R1 1 10 12
R2 2 20 34
R3 1 10 22
R4 1 30 56
R5 3 40 17
A D E
S1 1 20 22
S2 2 30 36
S3 2 10 29
S4 3 50 16
S5 3 40 37
MAP 1
MAP 2
(1, [R, 10])(2, [R, 20])(1, [R, 10])(1, [R, 30])(3, [R, 40])
(1, [S, 20])(2, [S, 30])(2, [S, 10])(3, [S, 50])(3, [S, 40])
(1, 10, 20)(1, 10, 20)(1, 30, 20)
(2, 20, 30)(2, 20, 10)(3, 40, 50)(3, 40, 40)
(1, [(R, 10), (R, 10),(R, 30), (S, 20)] )
(2, [(R, 20), (S, 30),(S, 10)] )
(3, [(R, 40), (S, 50),(S, 40)]
Reducer 1
Reducer 2
IBM Research | India Research Lab
Join On Point Data Select R.A, R.B, S.D where R.A==S.A
Attribute A range is divided into k parts. A hash function hashes the value of attribute A to [1,…,k]
1 2 … … k
A reducer is defined for each of the k part
A tuple from R and S is communicated to reducer k if the value of R.A or S.A hashes to bucket k
Each reducer computes the partial join output
IBM Research | India Research Lab
Join On Point Data Assume k = 3, h(1)=0, h(2)=1, h(3)=2
A B C
R1 1 10 12
R2 2 20 34
R3 1 10 22
R4 1 30 56
R5 3 40 17
A D E
S1 1 20 22
S2 2 30 36
S3 2 10 29
S4 3 50 16
S5 3 40 37
R1 1 10 12R3 1 10 22R4 1 30 56S1 1 20 22
0 1 2
R2 2 20 34S2 2 30 36S3 2 10 29
R5 3 40 17S4 3 50 16S5 3 40 37
R1 S1R3 S1R4 S1
R2 S2R2 S3
R5 S4R5 S5
IBM Research | India Research Lab
Map-Reduce Example : Inequality Join Select R.A, R.B, S.D where R.A <= S.A Consider 3-Node Cluster
A B C
R1 1 10 12
R2 2 20 34
R3 1 10 22
R4 1 30 56
R5 3 40 17
A D E
S1 1 20 22
S2 2 30 36
S3 2 10 29
S4 3 50 16
S5 3 40 37
MAP 2
(r1, [S, 1, 20])(r2, [S, 2, 30])(r2, [S, 2, 10])(r3, [S, 3, 50])(r3, [S, 3, 40])
(1, 10, 20)(1, 10, 20)(1, 30, 20)
(1, 10, 50)(1, 10, 40)(2, 20, 50)(2, 20, 40)(1, 10, 50)(1, 10, 40)(1, 30, 50)(1, 30, 40)(3, 40, 50)(3, 40, 40)
MAP 1
(r1, [R, 1, 10])(r2, [R, 1, 10])(r3, [R, 1, 10])(r2, [R, 2, 20])(r3, [R, 2, 20]) ….. …..(r3, [R, 3, 40])
(r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20])
(r3, ([R, 1, 10],[R, 2, 20], [R, 1, 10],[R, 1, 30], [R, 3, 40],[S, 3, 50], [S, 3, 40])
Reducer 1
Reducer 3
……
Reducer 2
IBM Research | India Research Lab
Why Join On Map-Reduce Is A Complex Task?
Data for multiple relations distributed across different machines Map-Reduce is inherently designed for processing a single dataset.
An output tuple can be generated only when all the input tuples are collected at a common machine
This needs to happen for all output tuples, is non-trivial.
Apriori, we don’t know which tuples are going to join to form an output tuple. That is precisely the join problem
Ensuring it, may involve lot of replication and hence lot of communication
Tuples from every candidate combination need to be collected at reducers and the join predicates need to be checked