Gearing Hadoop towards HPC sytems · Hadoop •a well- known processing platform for large data sets and widely used in many domains •Hadoop cluster @ Yahoo •Hadoop cluster @

Gearing Hadoop towards HPC sytems

HPC Advisory Council China Workshop 2014, Guangzhou

Xuanhua Shi

http://grid.hust.edu.cn/xhshi Huazhong University of Science and

Technology

http://grid.hust.edu.cn/xhshi

Internet

2014/11/5 2

Typical router: • 42 bytes/second • 3.5 Gigabytes/day

Internet Traffic

• 8 Billion pages

Web Social Networks

Mobile Apps

Hadoop

• a well- known processing platform for large data sets and

widely used in many domains

• Hadoop cluster @ Yahoo

• Hadoop cluster @ Facebook

• Hadoop cluster @ Twitter

• Hadoop cluster @ Linkedin

• Hadoop Cluster @ Alibaba

• …

2014/11/5 3

Data processing on HPC

• IDC Report， 25 Oct 2013[1]

• High Performance Data Analysis Report

• …67% of the sites in the 2013 study said they perform Big Data

analysis on their HPC systems, with 30% of the available

computing cycles devoted on average to Big Data analysis work.

IDC forecasts that revenue for high performance data analysis

(HPDA) servers will grow robustly during the 2012–2017 forecast

period…

2014/11/5 4

[1] New idc worldwide hpc end-user study identifies latest trends in high performance computing usage and spending.

http://www.idc. com/getdoc.jsp?containerId=prUS24409313.

Hadoop on HPC

• How HPC is Hacking Hadoop, HPC wire, 2014, by Nicole

Hemsoth

• Utilizing HPC resource for Data-intensive Computing (e.g., UCSD

Gordon)

• Hadoop/MapReduce in bioinformatics

• Data intensive app on HDFS

• molecular dynamics simulation utilizing Hadoop

• Hadoop for remote sensing analysis

• Genome resequencing using Hadoop

• Hadoop and MPI, e.g., Data-MPI

• …

• Others

• Hadoop and HPC storage systems

• …

2014/11/5 5

Take a look at the Hadoop on HPC

• 320 GB data set, Wordcount

• 17 nodes, each node is equipped with two 8-core 2.6GHz

Intel(R) Xeon(R) E5-2670 CPUs, 32GB memory and a

300GB 10,000RPM SAS disk

2014/11/5 6

Typical Hadoop cluster

• Hadoop cluster @Twitter[1]

• 20 servers /rack

• cpu：Intel(R) Xeon(R) CPU E5645 (Dual 6-core)

• Disk: 12 * 2TB HDD

• Mem: 72 GB

• Network: 2 * 1Gb Ethernet

• Standard 2U server

2014/11/5 7

[1]http://www.slideshare.net/ye.mikez/rottinghuis-shenoyjune270140pmhall1v2130711161716phpapp02


• Hadoop cluster @alibaba[1]

• Over 10,000 nodes

• cpu：Intel(R) Xeon(R) CPU E5645 * 4

• Disk numbers: 12

• Mem: 48GB

2014/11/5 8

[1] http://wenku.it168.com/d_926264.shtml


• Hadoop cluster @Tencent[1]

• 4400 nodes

• cpu：over 100,000 cores

• Memory: 275 TB

• Disk: 100PB (52800 disks)

2014/11/5 9

[1] http://share.csdn.net/slides/4214

Typical HPC cluster

Name Number of

cores Mem(GB) Mem/core（GB/core）

Tianhe-2 3,120,000 1,024,000 0.33

K computer 705,024 1,410,048 2

Stampede 462,462 192,192 0.42

SuperMUC 155,656 300,000 1.9

TSUBAME 74,358 74,358 1

2014/11/5 10

The Hadoop execution engine

2014/11/5 11

1.I/O operations

2.I/O operations

3.I/O operations

4.I/O operations

5.I/O operations

6.I/O operations

7.I/O operations

I/O wait

Mammoth: gearing Hadoop on HPC[1]

• a multi-threaded execution engine. The engine is based

on Hadoop, but it runs in a single JVM on each node.

• global memory management

• a novel rule-based heuristic to prioritize memory allocation and

revocation among execution units (mapper, shuffler, reducer etc), to

maximize the holistic benefits of the Map/Reduce job when

scheduling each memory unit

• disk accesses serialization, multi-cache, shuffling from

memory

• solved the problem of full Garbage Collection (GC) in the

JVM

[1]X. Shi, et al. Mammoth: Gearing Hadoop towards Memory-intensive applications,

TPDS, 2014

2014/11/5 12

Architecture of Mammoth

2014/11/5 13

Mammoth: Map phase

Mapper

Element

Pool

HDFS

Element Queue

MSorter

cahce file

Cache Pool

get

return

Merger

final cache file

input

splitReader

Sort Buffer Send Buffer

cahce file

get

I/O Scheduler

2014/11/5 14

Shuffle phase

final cache file

Spiller

Sender

network

Disk

Reciever

Receive Buffer

Cache Poolreturn get

MergeSort

Reader

Send Buffer

I/O Schedulerspill

spill

read

2014/11/5 15

Push Model

Reduce phase

Cache Pool

Receive Buffer

Disk

RSorter Reducer Pusher

Reader HDFS

return

I/O Scheduler

getWriter

read

2014/11/5 16

Pipeline

Memory usage types

Memory

usage

Element Pool

Multi-buffer

Cache Pool

Sort Buffer

Send Buffer

Receive Buffer

Priority

2014/11/5 17

Map Phase

Map Phase

Shuffle Phase

Map Phase and Reduce Phase

Memory management

2014/11/5 18

Memory management

Map

Shuffle

Reduce

t1 t5t4t3t2time

2014/11/5 19

Sort Buffer>Send Buffer>Receive Buffer

I/O scheduling

2014/11/5 20

Multi-buffer

Evaluations -Intermediate Data Size

Total Map Shuffle Reduce

Mammoth for WCC 417 70 300 7

Hadoop for WCC 700 122 531 11

Mammoth for Sort 1583 123 635 672

Hadoop for Sort 3052 309 1965 918

Mammoth for WC 1890 249 1331 331

Hadoop for WC 4754 684 3606 563

0

1

2

3

4

5

Total Map Shuffle Reduce

Job execution time in different phases

Tim

e(1

000

s)

Mammoth for WCC

Hadoop for WCC

Mammoth for Sort

Hadoop for Sort

Mammoth for WC

Hadoop for WC

2014/11/5 21

Evaluations – available memory

内存大小 j ob. chi l d 作业号运行时间64GB 55GB j ob_ 201307302348_ 0002 16mi ns, 30sec48GB 43GB j ob_ 201307310021_ 0001 19mi ns, 49sec32GB 28GB j ob_ 201307310052_ 0001 31mi ns, 44sec16GB 14GB j ob_ 201307310130_ 0001 45mi ns, 31sec

mammoth hadoop

990 1925 641189 2171 481890 4754 322731 6414 16

MAMMOTH( node204)

0

1

2

3

4

5

6

7

64 48 32 16

memory(GB)

comp

leti

on t

ime(

1000

s)

mammoth

hadoop

2014/11/5 22

Evaluations – Real applications

• CloudBurst in bioinformatics

• implementation of the RMAP algorithm for short-read gene

alignment

• Graph computing

• Pegaus: Peta Graph Mining library computing the diameter of the

graph

• Concmpt: an application for connected component detecting

• Radius: an application for graph radius computing

• DegDist: an application for vertex degree counting

• PageRank:widely used by the search engines to rank web pages

2014/11/5 23

Evaluations – Real applications

2014/11/5 24

Does spark run well on HPC?

operator. In the second experiment, Hadoop, Mammoth and Spark all execute the Sort

application. The configuration of the cluster and the input dataset are all the same with the

submitted paper. There are 17 nodes in the cluster, with one node as the master and the other 16

nodes as the slaves. Each node is configured with 32GB memory, two 8-core Xeon-2670 CPUs and

one SAS disk. Spark was deployed with the standalone mode. The original input dataset and the

final results are stored in the HDFS. For WordCount, the input dataset was generated with the

Hadoop built-in program RandomTextWriter, 20GB for each node. For Sort, the input dataset was

generated with the Hadoop build-in program RandomWriter, 20GB for each node too. To make

the comparison more convinced, we optimize the Spark’s software configuration parameters

according to the official documents (http://spark.apache.org/docs/0.9.0/tuning.html). We set the

parameter “spark.serializer” to be “org.apache.spark.serializer.KryoSerializer” and set the

parameter “spark.shuffle.consolidateFiles” to be “true”.

Fig. 1. Performance comparison with WordCount

Figure 1 illustrates the performances of the three systems in the first experiment. We can

see that Mammoth and Spark obtain almost the same performance, which is more or less a 1.7x

speedup compared with Hadoop. The three systems will aggregate the intermediate data in the

same way that the <key, value> pairs with the same key will be summed up to one <key, value>

pair. For example, one <word, 1> and one <word, 1> will be transformed to one <word, 2>. In this

way, the quantity of the intermediate data will be reduced heavily, and the memory will be rather

sufficient. In Mammoth, most of the intermediate data is processed in the memory except

spilling the Map tasks’ results to disks for fault tolerance. Spark processes shuffling based on disks,

and uses the hash table other than the sort to implement data aggregation. Both of them utilize

the memory better than Hadoop, which is the reason for their performance improvements.

Fig. 2. Performance Comparison with Sort

0

200

400

600

800

exec

utio

n tim

e(s)

0

2000

4000

6000

8000ex

ecut

ion

time(

s)

Hadoop Mammoth Spark


2014/11/5 25

operator. In the second experiment, Hadoop, Mammoth and Spark all execute the Sort

application. The configuration of the cluster and the input dataset are all the same with the

submitted paper. There are 17 nodes in the cluster, with one node as the master and the other 16

nodes as the slaves. Each node is configured with 32GB memory, two 8-core Xeon-2670 CPUs and

one SAS disk. Spark was deployed with the standalone mode. The original input dataset and the

final results are stored in the HDFS. For WordCount, the input dataset was generated with the

Hadoop built-in program RandomTextWriter, 20GB for each node. For Sort, the input dataset was

generated with the Hadoop build-in program RandomWriter, 20GB for each node too. To make

the comparison more convinced, we optimize the Spark’s software configuration parameters

according to the official documents (http://spark.apache.org/docs/0.9.0/tuning.html). We set the

parameter “spark.serializer” to be “org.apache.spark.serializer.KryoSerializer” and set the

parameter “spark.shuffle.consolidateFiles” to be “true”.

Fig. 1. Performance comparison with WordCount

Figure 1 illustrates the performances of the three systems in the first experiment. We can

see that Mammoth and Spark obtain almost the same performance, which is more or less a 1.7x

speedup compared with Hadoop. The three systems will aggregate the intermediate data in the

same way that the <key, value> pairs with the same key will be summed up to one <key, value>

pair. For example, one <word, 1> and one <word, 1> will be transformed to one <word, 2>. In this

way, the quantity of the intermediate data will be reduced heavily, and the memory will be rather

sufficient. In Mammoth, most of the intermediate data is processed in the memory except

spilling the Map tasks’ results to disks for fault tolerance. Spark processes shuffling based on disks,

and uses the hash table other than the sort to implement data aggregation. Both of them utilize

the memory better than Hadoop, which is the reason for their performance improvements.

Fig. 2. Performance Comparison with Sort

0

200

400

600

800

exec

utio

n tim

e(s)

0

2000

4000

6000

8000

exec

utio

n tim

e(s)



If you are interested…

• Open-source:

• Project: http://grid.hust.edu.cn/xhshi/projects/mammoth.htm

• Source code: https://github.com/mammothcm/mammoth

• The patch at Apache Software Foundation:

https://issues.apache.org/jira/browse/MAPREDUCE-5605

• Publications and technical report are available on the

project website

2014/11/5 26

http://grid.hust.edu.cn/xhshi/projects/mammoth.htm

https://github.com/mammothcm/mammoth

•Thanks!

2014/11/5 27

Documents

Gearing Hadoop towards HPC sytems · Hadoop •a well- known processing platform for large data sets and widely used in many domains •Hadoop cluster @ Yahoo •Hadoop cluster @