Upload
yoonseung-choi
View
144
Download
0
Embed Size (px)
Citation preview
1
Hadoop MapReduce
How to Survive Out-of-Memory Errors
Member: Yoonseung Choi Soyeong Park
Faculty Mentor: Prof. Harry XuStudent Mentor: Khanh Nguyen
The International Summer Undergraduate Research Fellowship
2
Outline
• Introduction• What is MapReduce?• How does MapReduce work?• Limitations of MapReduce• What are our goals?• Operation test• Conclusions
3
“There was 5 exabytes of information created between the dawn of civilization through 2003,
But that much information is now created every 2 days, and the pace is increasing...”
- Eric Schmidt, The Former Google CEO
4
Data scientists want to analyze these large data sets
But single machines have limitations in processing these data sets
How can we handle that?
Furthermore, data sets are now growing very rapidly
We don’t wantto understand
parallelization,fault tolerance,
data distribution,and load balancing!
Distributed processing
Therefore, we purposeThe ‘MapReduce’
parallelizationfault tolerance
data distributionload balancing
MapReduce is a programming model for processing large data sets
Many real world tasks are expressible in this model
The model is easy to use, even for programmers without experience with parallel and distributed systems
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”.
* https://en.wikipedia.org/wiki/Apache_Hadoop
MapReduce Layer
HDFS Layer
5
What is MapReduce?Mapper takes an input
and produces a set of intermediate key/value pairs
Reducer merges together these intermediate values associated with the same intermediate key
[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. p.12
6
How does MapReduce work?The cat sees the dog, and the dog sees the cat.
The cat sees the dog
And the dog sees the cat
cat, 1dog, 1sees, 1the, 2
cat, 1dog, 1sees, 1the, 2
and, 1
cat, 2dog, 2sees, 2the, 4
and, 1
- Wordcount program - A sentence is split into two map tasks
Map Phase
ReducePhase
7
8
Limitations of MapReduce
There are many reasons for poor performanceAnd even experts sometimes can’t figure them out
What are our goals?• Research Out-of-Memory Error(OOM) cases • Document OOM cases • Implement and simulate StackOverflow OOM cases• Develop solutions for such OOM cases
… all done!!
9
Two Categories
1. Inappropriate Configuration Configuration which causes poor performance
2. Large Intermediate Results Temporary data structure grows too large
[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences.
10
11
Operation test environments1. Standalone & Pseudo-distributed mode
- ‘14 MacBook Pro, 2.8 GHz Intel Core i5 8GB 1600 MHz DDR3, 500GB HDD
- ‘12 MacBook Air 1.4, GHz Intel Core i5 4GB 1600 MHz DDR3, 256GB HDD
2. Fully-distributed mode- Raspberry Pi 2 Model B (3 nodes)
A quad-core ARM Cortex-A7 CPU (1Ghz Overclock)1GB 500MHz SDRAM, 64GB HDD, 100Mbps
Ethernet
Split size variation [Single node]
* ‘14 MacBook Pro 2.8, GHz Intel Core i5, 8GB 1600 MHz DDR3, 500GB SSD
Input: StackOverflow’s users profiles (1GB)
16 32 64 128 2560
50
100
150
200173.3
88.3
47.326.7 24.3
204
117.3
86.364.7
56.3
169.3
117.3
78.759
55
(sec)
16 32 64 128 2560
50
100
150
200169.7
85.7
4323
23.3
172.7
103.7
64.7 48.737.7
129.7
77.7 55 39
32.7
[ Distributed grep (no Reducer) ][ Standard deviation of users’ age ]
(MB)
(sec)
(MB)
Standalone Pseudo-distributed(2Mapper 2Reducer)
Pseudo-distributed(4Mapper 4Reducer)
12
[ ]
Split size variation [Single node]
* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD
Input: StackOverflow’s Comments (8.5GB)
16 32 64 128 256250
550
850
1150
1450 1577.7
807.7
425411
312.3
1586.3
831
634454.3
299
1590
803.7
540.3397.7
323
Standard deviation of comment’s text length
Count Min and Max value
Standalone Pseudo-distributed(2Mapper 2Reducer)
Pseudo-distributed(4Mapper 4Reducer)
16 32 64 128 256250
550
850
1150
1450
1750
1469
783
398 389.3
281.3
1614
610.7 612418.7
294.3
1598
609 488
362.7 254.3
[ ](MB)(MB)
(sec) (sec)
13
Split size variation [Fully-distributed] Input: StackOverflow’s users profiles (1GB)
32 64 128 2560
200
400
600
800
375396
442548
313 296350
557
* Raspberry Pi 2 Model B (3 nodes) A quad-core ARM Cortex-A7 CPU (1Ghz Overclock) 1GB 500MHz SDRAM, 64GB HDD, 100Mbps Ethernet
16 32 64 128 2560
200
400
600
800
462.7 428.7476.7
561.7 604
333.3 303345 339.7
603
[ Distributed grep (no Reducer) ][ average users’ age based on countries ]
6 Mapper 12 Mapper
(MB)(MB)
(sec) (sec)
14
io.sort.mb variation [Single node]
* ‘12 MacBook Air 1.4, GHz Intel Core i5, 4GB 1600 MHz DDR3, 256GB SSD
Input: StackOverflow’s Comments (8.5GB)Test program: Standard deviation of comment’s text length
20 40 80 160 320600
700
800
900872
827 814 798 803.7
661 638.7 632 629.7 629.7633.7 641 635.7 629.3 629
Stand alone Pseudo distributed; 2M2R Pseudo distributed; 4M2R
(MB)
(sec)
15
I am working well with small datasets like 200-500MB. But for datasets above 1GB, I am getting an error like this:
* http://stackoverflow.com/questions/23042829/getting-java-heap-space-error-while-running-a-mapreduce-code-for-large-dataset
2. Large Intermediate Results
16
Problem Investigation
SplitedInput files
Task 1Task 2Task 3Task 4Task 5
[K, V]
[K, V]
[K, V]
[K, V]
[K, V]
The Mapper
Intermediatekey/value pairs
1.3GB
4.8GB
almost1 GB
17
Problem Investigation
[K, V]
[K, V]
[K, V]
[K, V]
[K, V]
The Reducer
Intermediatekey/value pairs
4.8GB
almost1 GB
I just have 1GB heap
space!
almost1 GB
Java heap can’t contain intermediate data structure
18
Configuration was:1.3GB Input, 256MB Split size, 1024MB Java Heap Space
Error: Java heap space
19
20
Summary of Solutions• Modify the configuration parameters
• Alter the program’s algorithm: Some alternative solution was suggested from the site -> Succeed with original version failed Configuration ( 256MB Split size & 1024MB Java heap size )
Java Heap size 1024MB 2048MB
Split size128 MB Successful Successful256 MB Failed Successful
21
Conclusions• How to solve the poor performance
1. Adjust ‘split size’ & ‘sort space’ - the more size, the less time to spend2. Adjust the number of Mapper - Utilize all CPU Cores - Larger number of mapper not always right
• If intermediate data structure is too large, - Modify the configuration parameter or - Alter the program’s algorithm
22
References[1] Jeffrey Dean and Sanjay Ghemawat. (2004). “MapReduce: Simplified Data Processing on Large Clusters”. [Online]. Available: http://static.googleusercontent.com/media/research.google.com/ko//archive/mapreduce-osdi04.pdf
[2] 한기용 , Do it! 직접 해보는 하둡 프로그래밍 . Seoul: EasysPublishing, 2013.
[3] Lijie Xu, “An Empirical study on real-world OOM cases in MapReduce jobs, Chinese Academy of Sciences.
[4] Donald Miner and Adam Shook, MapReduce Design Patterns. O’Reilly Media. Inc, 2012.
23
Thank YouAnd if you want to know more technical information,please enter our GitHub repository. Our project is Open Source.https://github.com/I-SURF-Hadoop/MapReduce
24
appendixHow does MapReduce really work?
25
How does MapReduce work?
[ Map Phase ]cat, 1
dog, 1
sees, 1
the, 2
Combining & Sorting
The cat sees the dog, and the dog sees the cat.
the, 1
cat, 1
sees, 1
the, 1
dog, 1
MapReduce library first splits the input into M pieces.
A map worker processes these pieces using a user-defined Map function. Intermediate key/value pairs will be produced by this function.
The cat sees the dog
26
How does MapReduce work?The cat sees the dog, and the dog sees the cat.
sees, 2the, 4
cat, 2dog, 2
and, 1[ Reduce Phase ] When a reduce worker has read all intermediate data, it sorts them by the intermediate keys.
The reduce worker iterates the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the values to the user’s Reduce function.
cat, 1dog, 1sees, 1the, 2
cat, 1dog, 1sees, 1the, 2
and, 1
Shuffling
Two independent reducer