Gearing Hadoop towards HPC sytems
HPC Advisory Council China Workshop 2014, Guangzhou
Xuanhua Shi
http://grid.hust.edu.cn/xhshi Huazhong University of Science and
Technology
Internet
2014/11/5 2
Typical router: • 42 bytes/second • 3.5 Gigabytes/day
Internet Traffic
• 8 Billion pages
Web Social Networks
Mobile Apps
Hadoop
• a well- known processing platform for large data sets and
widely used in many domains
• Hadoop cluster @ Yahoo
• Hadoop cluster @ Facebook
• Hadoop cluster @ Twitter
• Hadoop cluster @ Linkedin
• Hadoop Cluster @ Alibaba
• …
2014/11/5 3
Data processing on HPC
• IDC Report, 25 Oct 2013[1]
• High Performance Data Analysis Report
• …67% of the sites in the 2013 study said they perform Big Data
analysis on their HPC systems, with 30% of the available
computing cycles devoted on average to Big Data analysis work.
IDC forecasts that revenue for high performance data analysis
(HPDA) servers will grow robustly during the 2012–2017 forecast
period…
2014/11/5 4
[1] New idc worldwide hpc end-user study identifies latest trends in high performance computing usage and spending.
http://www.idc. com/getdoc.jsp?containerId=prUS24409313.
Hadoop on HPC
• How HPC is Hacking Hadoop, HPC wire, 2014, by Nicole
Hemsoth
• Utilizing HPC resource for Data-intensive Computing (e.g., UCSD
Gordon)
• Hadoop/MapReduce in bioinformatics
• Data intensive app on HDFS
• molecular dynamics simulation utilizing Hadoop
• Hadoop for remote sensing analysis
• Genome resequencing using Hadoop
• Hadoop and MPI, e.g., Data-MPI
• …
• Others
• Hadoop and HPC storage systems
• …
2014/11/5 5
Take a look at the Hadoop on HPC
• 320 GB data set, Wordcount
• 17 nodes, each node is equipped with two 8-core 2.6GHz
Intel(R) Xeon(R) E5-2670 CPUs, 32GB memory and a
300GB 10,000RPM SAS disk
2014/11/5 6
Typical Hadoop cluster
• Hadoop cluster @Twitter[1]
• 20 servers /rack
• cpu:Intel(R) Xeon(R) CPU E5645 (Dual 6-core)
• Disk: 12 * 2TB HDD
• Mem: 72 GB
• Network: 2 * 1Gb Ethernet
• Standard 2U server
2014/11/5 7
[1]http://www.slideshare.net/ye.mikez/rottinghuis-shenoyjune270140pmhall1v2130711161716phpapp02
Typical Hadoop cluster
• Hadoop cluster @alibaba[1]
• Over 10,000 nodes
• cpu:Intel(R) Xeon(R) CPU E5645 * 4
• Disk numbers: 12
• Mem: 48GB
2014/11/5 8
[1] http://wenku.it168.com/d_926264.shtml
Typical Hadoop cluster
• Hadoop cluster @Tencent[1]
• 4400 nodes
• cpu:over 100,000 cores
• Memory: 275 TB
• Disk: 100PB (52800 disks)
2014/11/5 9
[1] http://share.csdn.net/slides/4214
Typical HPC cluster
Name Number of
cores Mem(GB) Mem/core(GB/core)
Tianhe-2 3,120,000 1,024,000 0.33
K computer 705,024 1,410,048 2
Stampede 462,462 192,192 0.42
SuperMUC 155,656 300,000 1.9
TSUBAME 74,358 74,358 1
2014/11/5 10
The Hadoop execution engine
2014/11/5 11
1.I/O operations
2.I/O operations
3.I/O operations
4.I/O operations
5.I/O operations
6.I/O operations
7.I/O operations
I/O wait
Mammoth: gearing Hadoop on HPC[1]
• a multi-threaded execution engine. The engine is based
on Hadoop, but it runs in a single JVM on each node.
• global memory management
• a novel rule-based heuristic to prioritize memory allocation and
revocation among execution units (mapper, shuffler, reducer etc), to
maximize the holistic benefits of the Map/Reduce job when
scheduling each memory unit
• disk accesses serialization, multi-cache, shuffling from
memory
• solved the problem of full Garbage Collection (GC) in the
JVM
[1]X. Shi, et al. Mammoth: Gearing Hadoop towards Memory-intensive applications,
TPDS, 2014
2014/11/5 12
Architecture of Mammoth
2014/11/5 13
Mammoth: Map phase
Mapper
Element
Pool
HDFS
Element Queue
MSorter
cahce file
Cache Pool
get
return
Merger
final cache file
input
splitReader
Sort Buffer Send Buffer
cahce file
get
I/O Scheduler
2014/11/5 14
Shuffle phase
final cache file
Spiller
Sender
network
Disk
Reciever
Receive Buffer
Cache Poolreturn get
MergeSort
Reader
Send Buffer
I/O Schedulerspill
spill
read
2014/11/5 15
Push Model
Reduce phase
Cache Pool
Receive Buffer
Disk
RSorter Reducer Pusher
Reader HDFS
return
I/O Scheduler
getWriter
read
2014/11/5 16
Pipeline
Memory usage types
Memory
usage
Element Pool
Multi-buffer
Cache Pool
Sort Buffer
Send Buffer
Receive Buffer
Priority
2014/11/5 17
Map Phase
Map Phase
Shuffle Phase
Map Phase and Reduce Phase
Memory management
2014/11/5 18
Memory management
Map
Shuffle
Reduce
t1 t5t4t3t2time
2014/11/5 19
Sort Buffer>Send Buffer>Receive Buffer
I/O scheduling
2014/11/5 20
Multi-buffer
Evaluations -Intermediate Data Size
Total Map Shuffle Reduce
Mammoth for WCC 417 70 300 7
Hadoop for WCC 700 122 531 11
Mammoth for Sort 1583 123 635 672
Hadoop for Sort 3052 309 1965 918
Mammoth for WC 1890 249 1331 331
Hadoop for WC 4754 684 3606 563
0
1
2
3
4
5
Total Map Shuffle Reduce
Job execution time in different phases
Tim
e(1
000
s)
Mammoth for WCC
Hadoop for WCC
Mammoth for Sort
Hadoop for Sort
Mammoth for WC
Hadoop for WC
2014/11/5 21
Evaluations – available memory
内存大小 j ob. chi l d 作业号 运行时间64GB 55GB j ob_ 201307302348_ 0002 16mi ns, 30sec48GB 43GB j ob_ 201307310021_ 0001 19mi ns, 49sec32GB 28GB j ob_ 201307310052_ 0001 31mi ns, 44sec16GB 14GB j ob_ 201307310130_ 0001 45mi ns, 31sec
mammoth hadoop
990 1925 641189 2171 481890 4754 322731 6414 16
MAMMOTH( node204)
0
1
2
3
4
5
6
7
64 48 32 16
memory(GB)
comp
leti
on t
ime(
1000
s)
mammoth
hadoop
2014/11/5 22
Evaluations – Real applications
• CloudBurst in bioinformatics
• implementation of the RMAP algorithm for short-read gene
alignment
• Graph computing
• Pegaus: Peta Graph Mining library computing the diameter of the
graph
• Concmpt: an application for connected component detecting
• Radius: an application for graph radius computing
• DegDist: an application for vertex degree counting
• PageRank:widely used by the search engines to rank web pages
2014/11/5 23
Evaluations – Real applications
2014/11/5 24
Does spark run well on HPC?
operator. In the second experiment, Hadoop, Mammoth and Spark all execute the Sort
application. The configuration of the cluster and the input dataset are all the same with the
submitted paper. There are 17 nodes in the cluster, with one node as the master and the other 16
nodes as the slaves. Each node is configured with 32GB memory, two 8-core Xeon-2670 CPUs and
one SAS disk. Spark was deployed with the standalone mode. The original input dataset and the
final results are stored in the HDFS. For WordCount, the input dataset was generated with the
Hadoop built-in program RandomTextWriter, 20GB for each node. For Sort, the input dataset was
generated with the Hadoop build-in program RandomWriter, 20GB for each node too. To make
the comparison more convinced, we optimize the Spark’s software configuration parameters
according to the official documents (http://spark.apache.org/docs/0.9.0/tuning.html). We set the
parameter “spark.serializer” to be “org.apache.spark.serializer.KryoSerializer” and set the
parameter “spark.shuffle.consolidateFiles” to be “true”.
Fig. 1. Performance comparison with WordCount
Figure 1 illustrates the performances of the three systems in the first experiment. We can
see that Mammoth and Spark obtain almost the same performance, which is more or less a 1.7x
speedup compared with Hadoop. The three systems will aggregate the intermediate data in the
same way that the <key, value> pairs with the same key will be summed up to one <key, value>
pair. For example, one <word, 1> and one <word, 1> will be transformed to one <word, 2>. In this
way, the quantity of the intermediate data will be reduced heavily, and the memory will be rather
sufficient. In Mammoth, most of the intermediate data is processed in the memory except
spilling the Map tasks’ results to disks for fault tolerance. Spark processes shuffling based on disks,
and uses the hash table other than the sort to implement data aggregation. Both of them utilize
the memory better than Hadoop, which is the reason for their performance improvements.
Fig. 2. Performance Comparison with Sort
0
200
400
600
800
exec
utio
n tim
e(s)
0
2000
4000
6000
8000ex
ecut
ion
time(
s)
Hadoop Mammoth Spark
Hadoop Mammoth Spark
2014/11/5 25
operator. In the second experiment, Hadoop, Mammoth and Spark all execute the Sort
application. The configuration of the cluster and the input dataset are all the same with the
submitted paper. There are 17 nodes in the cluster, with one node as the master and the other 16
nodes as the slaves. Each node is configured with 32GB memory, two 8-core Xeon-2670 CPUs and
one SAS disk. Spark was deployed with the standalone mode. The original input dataset and the
final results are stored in the HDFS. For WordCount, the input dataset was generated with the
Hadoop built-in program RandomTextWriter, 20GB for each node. For Sort, the input dataset was
generated with the Hadoop build-in program RandomWriter, 20GB for each node too. To make
the comparison more convinced, we optimize the Spark’s software configuration parameters
according to the official documents (http://spark.apache.org/docs/0.9.0/tuning.html). We set the
parameter “spark.serializer” to be “org.apache.spark.serializer.KryoSerializer” and set the
parameter “spark.shuffle.consolidateFiles” to be “true”.
Fig. 1. Performance comparison with WordCount
Figure 1 illustrates the performances of the three systems in the first experiment. We can
see that Mammoth and Spark obtain almost the same performance, which is more or less a 1.7x
speedup compared with Hadoop. The three systems will aggregate the intermediate data in the
same way that the <key, value> pairs with the same key will be summed up to one <key, value>
pair. For example, one <word, 1> and one <word, 1> will be transformed to one <word, 2>. In this
way, the quantity of the intermediate data will be reduced heavily, and the memory will be rather
sufficient. In Mammoth, most of the intermediate data is processed in the memory except
spilling the Map tasks’ results to disks for fault tolerance. Spark processes shuffling based on disks,
and uses the hash table other than the sort to implement data aggregation. Both of them utilize
the memory better than Hadoop, which is the reason for their performance improvements.
Fig. 2. Performance Comparison with Sort
0
200
400
600
800
exec
utio
n tim
e(s)
0
2000
4000
6000
8000
exec
utio
n tim
e(s)
Hadoop Mammoth Spark
Hadoop Mammoth Spark
If you are interested…
• Open-source:
• Project: http://grid.hust.edu.cn/xhshi/projects/mammoth.htm
• Source code: https://github.com/mammothcm/mammoth
• The patch at Apache Software Foundation:
https://issues.apache.org/jira/browse/MAPREDUCE-5605
• Publications and technical report are available on the
project website
2014/11/5 26
•Thanks!
2014/11/5 27