49
Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies Dhabaleswar K. (DK) Panda The Ohio State University E-mail: [email protected] state.edu http://www.cse.ohio-state.ed u/~panda Talk at Hadoop Summit, Dublin, Ireland (April 2016) by Xiaoyi Lu The Ohio State University E-mail: [email protected] state.edu http://www.cse.ohio-state.ed u/~luxi

Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Embed Size (px)

Citation preview

Page 1: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Dhabaleswar K. (DK) PandaThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~panda

Talk at Hadoop Summit, Dublin, Ireland (April 2016)

by

Xiaoyi LuThe Ohio State University

E-mail: [email protected]://www.cse.ohio-state.edu/~luxi

Page 2: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 2Network Based Computing Laboratory

• Substantial impact on designing and utilizing data management and processing systems in multiple tiers– Front-end data accessing and serving (Online)

• Memcached + DB (e.g. MySQL), HBase

– Back-end data analytics (Offline)• HDFS, MapReduce, Spark

Data Management and Processing on Modern Clusters

Page 3: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 3Network Based Computing Laboratory

Nov-96

Nov-97

Nov-98

Nov-99

Nov-00

Nov-01

Nov-02

Nov-03

Nov-04

Nov-05

Nov-06

Nov-07

Nov-08

Nov-09

Nov-10

Nov-11

Nov-12

Nov-13

Nov-14

Nov-15

050

100150200250300350400450500

0102030405060708090100Percentage of Clusters

Number of Clusters

Timeline

Num

ber o

f Clu

ster

s

Perc

enta

ge o

f Clu

ster

s

Trends for Commodity Computing Clusters in the Top 500 List (http://www.top500.org)

85%

Page 4: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 4Network Based Computing Laboratory

Drivers of Modern HPC Cluster Architectures

Tianhe – 2 Titan Stampede Tianhe – 1A

• Multi-core/many-core technologies• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)• Solid State Drives (SSDs), Non-Volatile Random-Access Memory (NVRAM), NVMe-SSD• Accelerators (NVIDIA GPGPUs and Intel Xeon Phi)

Accelerators / Coprocessors high compute density, high

performance/watt>1 TFlop DP on a chip

High Performance Interconnects - InfiniBand

<1usec latency, 100Gbps Bandwidth>Multi-core Processors SSD, NVMe-SSD, NVRAM

Page 5: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 5Network Based Computing Laboratory

• Advanced Interconnects and RDMA protocols– InfiniBand

– 10-40 Gigabit Ethernet/iWARP

– RDMA over Converged Enhanced Ethernet (RoCE)

• Delivering excellent performance (Latency, Bandwidth and CPU Utilization)

• Has influenced re-designs of enhanced HPC middleware– Message Passing Interface (MPI) and PGAS

– Parallel File Systems (Lustre, GPFS, ..)

• SSDs (SATA and NVMe)

• NVRAM and Burst Buffer

Trends in HPC Technologies

Page 6: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 6Network Based Computing Laboratory

Interconnects and Protocols in OpenFabrics Stack for HPC(http://openfabrics.org)

Kernel

Space

Application / Middleware

Verbs

Ethernet Adapter

EthernetSwitch

Ethernet Driver

TCP/IP

1/10/40/100 GigE

InfiniBandAdapter

InfiniBandSwitch

IPoIB

IPoIB

Ethernet Adapter

Ethernet Switch

Hardware Offload

TCP/IP

10/40 GigE-TOE

InfiniBandAdapter

InfiniBandSwitch

User Space

RSockets

RSockets

iWARP Adapter

Ethernet Switch

TCP/IP

User Space

iWARP

RoCEAdapter

Ethernet Switch

RDMA

User Space

RoCE

InfiniBandSwitch

InfiniBandAdapter

RDMA

User Space

IB Native

Sockets

Application / Middleware Interface

Protocol

Adapter

Switch

InfiniBandAdapter

InfiniBandSwitch

RDMA

SDP

SDP

Page 7: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 7Network Based Computing Laboratory

• 235 IB Clusters (47%) in the Nov’ 2015 Top500 list (http://www.top500.org)• Installations in the Top 50 (21 systems):

Large-scale InfiniBand Installations

462,462 cores (Stampede) at TACC (10th) 76,032 cores (Tsubame 2.5) at Japan/GSIC (25th)

185,344 cores (Pleiades) at NASA/Ames (13th) 194,616 cores (Cascade) at PNNL (27th)

72,800 cores Cray CS-Storm in US (15th) 76,032 cores (Makman-2) at Saudi Aramco (32nd)

72,800 cores Cray CS-Storm in US (16th) 110,400 cores (Pangea) in France (33rd)

265,440 cores SGI ICE at Tulip Trading Australia (17th) 37,120 cores (Lomonosov-2) at Russia/MSU (35th)

124,200 cores (Topaz) SGI ICE at ERDC DSRC in US (18th) 57,600 cores (SwiftLucy) in US (37th)

72,000 cores (HPC2) in Italy (19th) 55,728 cores (Prometheus) at Poland/Cyfronet (38th)

152,692 cores (Thunder) at AFRL/USA (21st ) 50,544 cores (Occigen) at France/GENCI-CINES (43rd)

147,456 cores (SuperMUC) in Germany (22nd) 76,896 cores (Salomon) SGI ICE in Czech Republic (47th)

86,016 cores (SuperMUC Phase 2) in Germany (24th) and many more!

Page 8: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 8Network Based Computing Laboratory

• Introduced in Oct 2000• High Performance Data Transfer

– Interprocessor communication and I/O– Low latency (<1.0 microsec), High bandwidth (up to 12.5 GigaBytes/sec -> 100Gbps), and

low CPU utilization (5-10%)• Multiple Operations

– Send/Recv– RDMA Read/Write– Atomic Operations (very unique)

• high performance and scalable implementations of distributed locks, semaphores, collective communication operations

• Leading to big changes in designing – HPC clusters– File systems– Cloud computing systems– Grid computing systems

Open Standard InfiniBand Networking Technology

Page 9: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 9Network Based Computing Laboratory

Communication in the Memory Semantics (RDMA Model)

InfiniBand Device

Memory Memory

InfiniBand Device

CQ QPSend Recv

MemorySegment

Send WQE contains information about the send buffer (multiple segments) and the

receive buffer (single segment)

Processor Processor

CQQPSend Recv

MemorySegment

Hardware ACK

MemorySegmentMemorySegment

Initiator processor is involved only to:1. Post send WQE

2. Pull out completed CQE from the send CQNo involvement from the target processor

Page 10: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 10Network Based Computing Laboratory

How Can HPC Clusters with High-Performance Interconnect and Storage Architectures Benefit Big Data Applications?

Bring HPC and Big Data processing into a “convergent trajectory”!

What are the major bottlenecks in current Big

Data processing middleware (e.g. Hadoop, Spark, and Memcached)?

Can the bottlenecks be alleviated with new

designs by taking advantage of HPC

technologies?

Can RDMA-enabled high-performance

interconnects benefit Big Data

processing?

Can HPC Clusters with high-performance

storage systems (e.g. SSD, parallel file

systems) benefit Big Data applications?

How much performance benefits

can be achieved through enhanced

designs?

How to design benchmarks for evaluating the

performance of Big Data middleware on

HPC clusters?

Page 11: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 11Network Based Computing Laboratory

Designing Communication and I/O Libraries for Big Data Systems: Challenges

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies(InfiniBand, 1/10/40/100 GigE

and Intelligent NICs)

Storage Technologies(HDD, SSD, and NVMe-SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O LibraryPoint-to-Point

Communication

QoS

Threaded Modelsand Synchronization

Fault-ToleranceI/O and File Systems

Virtualization

Benchmarks

Upper level Changes?

Page 12: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 12Network Based Computing Laboratory

• Sockets not designed for high-performance– Stream semantics often mismatch for upper layers– Zero-copy not available for non-blocking sockets

Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

Current Design

Application

Sockets

1/10/40/100 GigENetwork

Our Approach

Application

OSU Design

10/40/100 GigE or InfiniBand

Verbs Interface

Page 13: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 13Network Based Computing Laboratory

• RDMA for Apache Spark

• RDMA for Apache Hadoop 2.x (RDMA-Hadoop-2.x)– Plugins for Apache, Hortonworks (HDP) and Cloudera (CDH) Hadoop distributions

• RDMA for Apache Hadoop 1.x (RDMA-Hadoop)

• RDMA for Memcached (RDMA-Memcached)

• OSU HiBD-Benchmarks (OHB)– HDFS and Memcached Micro-benchmarks

• http://hibd.cse.ohio-state.edu• Users Base (based on voluntary registration): 160 organizations from 22 countries

• More than 16,000 downloads from the project site

• RDMA for Apache HBase and Impala (upcoming)

The High-Performance Big Data (HiBD) Project

Available for InfiniBand and RoCE

Page 14: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 14Network Based Computing Laboratory

• High-Performance Design of Hadoop over RDMA-enabled Interconnects– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level for HDFS, MapReduce, and

RPC components

– Enhanced HDFS with in-memory and heterogeneous storage

– High performance design of MapReduce over Lustre

– Plugin-based architecture supporting RDMA-based designs for Apache Hadoop, CDH and HDP

– Easily configurable for different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre) and different protocols (native InfiniBand, RoCE, and IPoIB)

• Current release: 0.9.9

– Based on Apache Hadoop 2.7.1

– Compliant with Apache Hadoop 2.7.1, HDP 2.3.0.0 and CDH 5.6.0 APIs and applications

– Tested with• Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE support with Mellanox adapters• Various multi-core platforms

• Different file systems with disks and SSDs and Lustre

– http://hibd.cse.ohio-state.edu

RDMA for Apache Hadoop 2.x Distribution

Page 15: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 15Network Based Computing Laboratory

• High-Performance Design of Spark over RDMA-enabled Interconnects– High performance RDMA-enhanced design with native InfiniBand and RoCE support at the verbs-level

for Spark

– RDMA-based data shuffle and SEDA-based shuffle architecture

– Non-blocking and chunk-based data transfer

– Easily configurable for different protocols (native InfiniBand, RoCE, and IPoIB)

• Current release: 0.9.1

– Based on Apache Spark 1.5.1

– Tested with• Mellanox InfiniBand adapters (DDR, QDR and FDR)

• RoCE support with Mellanox adapters

• Various multi-core platforms

• RAM disks, SSDs, and HDD

– http://hibd.cse.ohio-state.edu

RDMA for Apache Spark Distribution

Page 16: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 16Network Based Computing Laboratory

• Micro-benchmarks for Hadoop Distributed File System (HDFS) – Sequential Write Latency (SWL) Benchmark, Sequential Read Latency (SRL)

Benchmark, Random Read Latency (RRL) Benchmark, Sequential Write Throughput (SWT) Benchmark, Sequential Read Throughput (SRT) Benchmark

– Support benchmarking of • Apache Hadoop 1.x and 2.x HDFS, Hortonworks Data Platform (HDP) HDFS, Cloudera

Distribution of Hadoop (CDH) HDFS

• Micro-benchmarks for Memcached – Get Benchmark, Set Benchmark, and Mixed Get/Set Benchmark

• Current release: 0.8• http://hibd.cse.ohio-state.edu

OSU HiBD Micro-Benchmark (OHB) Suite – HDFS & Memcached

Page 17: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 17Network Based Computing Laboratory

• HHH: Heterogeneous storage devices with hybrid replication schemes are supported in this mode of operation to have better fault-tolerance as well as performance. This mode is enabled by default in the package.

• HHH-M: A high-performance in-memory based setup has been introduced in this package that can be utilized to perform all I/O operations in-memory and obtain as much performance benefit as possible.

• HHH-L: With parallel file systems integrated, HHH-L mode can take advantage of the Lustre available in the cluster.

• MapReduce over Lustre, with/without local disks: Besides, HDFS based solutions, this package also provides support to run MapReduce jobs on top of Lustre alone. Here, two different modes are introduced: with local disks and without local disks.

• Running with Slurm and PBS: Supports deploying RDMA for Apache Hadoop 2.x with Slurm and PBS in different running modes (HHH, HHH-M, HHH-L, and MapReduce over Lustre).

Different Modes of RDMA for Apache Hadoop 2.x

Page 18: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 18Network Based Computing Laboratory

• RDMA-based Designs and Performance Evaluation– HDFS– MapReduce– RPC– HBase– Spark– OSU HiBD Benchmarks (OHB)

Acceleration Case Studies and Performance Evaluation

Page 19: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 19Network Based Computing Laboratory

• Enables high performance RDMA communication, while supporting traditional socket interface• JNI Layer bridges Java based HDFS with communication library written in native code

Design Overview of HDFS with RDMA

HDFS

Verbs

RDMA Capable Networks(IB, iWARP, RoCE ..)

Applications

1/10/40/100 GigE, IPoIB Network

Java Socket Interface Java Native Interface (JNI)WriteOthers

OSU Design

• Design Features– RDMA-based HDFS write– RDMA-based HDFS

replication– Parallel replication support– On-demand connection

setup– InfiniBand/RoCE support

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda , High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012

N. Islam, X. Lu, W. Rahman, and D. K. Panda, SOR-HDFS: A SEDA-based Approach to Maximize Overlapping in RDMA-Enhanced HDFS, HPDC '14, June 2014

Page 20: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 20Network Based Computing Laboratory

Triple-H

Heterogeneous Storage

• Design Features– Three modes

• Default (HHH)• In-Memory (HHH-M)• Lustre-Integrated (HHH-L)

– Policies to efficiently utilize the heterogeneous storage devices

• RAM, SSD, HDD, Lustre

– Eviction/Promotion based on data usage pattern

– Hybrid Replication– Lustre-Integrated mode:

• Lustre-based fault-tolerance

Enhanced HDFS with In-Memory and Heterogeneous Storage

Hybrid Replication

Data Placement Policies

Eviction/Promotion

RAM Disk SSD HDD

Lustre

N. Islam, X. Lu, M. W. Rahman, D. Shankar, and D. K. Panda, Triple-H: A Hybrid Approach to Accelerate HDFS on HPC Clusters with Heterogeneous Storage Architecture, CCGrid ’15, May 2015

Applications

Page 21: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 21Network Based Computing Laboratory

Design Overview of MapReduce with RDMA

MapReduce

Verbs

RDMA Capable Networks(IB, iWARP, RoCE ..)

OSU Design

Applications

1/10/40/100 GigE, IPoIB Network

Java Socket Interface Java Native Interface (JNI)

JobTracker

TaskTracker

Map

Reduce

• Enables high performance RDMA communication, while supporting traditional socket interface• JNI Layer bridges Java based MapReduce with communication library written in native code

• Design Features– RDMA-based shuffle– Prefetching and caching map output– Efficient Shuffle Algorithms– In-memory merge– On-demand Shuffle Adjustment– Advanced overlapping

• map, shuffle, and merge• shuffle, merge, and reduce

– On-demand connection setup– InfiniBand/RoCE support

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014

Page 22: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 22Network Based Computing Laboratory

• A hybrid approach to achieve maximum possible overlapping in MapReduce across all phases compared to other approaches– Efficient Shuffle Algorithms– Dynamic and Efficient Switching– On-demand Shuffle Adjustment

Advanced Overlapping among Different Phases

Default Architecture

Enhanced Overlapping with In-Memory Merge

Advanced Hybrid Overlapping

M. W. Rahman, X. Lu, N. S. Islam, and D. K. Panda, HOMR: A Hybrid Approach to Exploit Maximum Overlapping in MapReduce over High Performance Interconnects, ICS, June 2014

Page 23: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 23Network Based Computing Laboratory

• Design Features– JVM-bypassed buffer

management– RDMA or send/recv based

adaptive communication– Intelligent buffer allocation and

adjustment for serialization– On-demand connection setup– InfiniBand/RoCE support

Design Overview of Hadoop RPC with RDMA

Hadoop RPC

Verbs

RDMA Capable Networks(IB, iWARP, RoCE ..)

Applications

1/10/40/100 GigE, IPoIB Network

Java Socket Interface Java Native Interface (JNI)

Our DesignDefault

OSU Design

• Enables high performance RDMA communication, while supporting traditional socket interface• JNI Layer bridges Java based RPC with communication library written in native code

X. Lu, N. Islam, M. W. Rahman, J. Jose, H. Subramoni, H. Wang, and D. K. Panda, High-Performance Design of Hadoop RPC with RDMA over InfiniBand, Int'l Conference on Parallel Processing (ICPP '13), October 2013.

Page 24: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 24Network Based Computing Laboratory

80 100 1200

50

100

150

200

250IPoIB (FDR)OSU-IB (FDR)

Data Size (GB)

Exec

ution

Tim

e (s

)

80 100 1200

50

100

150

200

250 IPoIB (FDR)OSU-IB (FDR)

Data Size (GB)

Exec

ution

Tim

e (s

)

Performance Benefits – RandomWriter & TeraGen in TACC-Stampede

Cluster with 32 Nodes with a total of 128 maps

• RandomWriter– 3-4x improvement over IPoIB

for 80-120 GB file size

• TeraGen– 4-5x improvement over IPoIB

for 80-120 GB file size

RandomWriter TeraGen

Reduced by 3x Reduced by 4x

Page 25: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 25Network Based Computing Laboratory

80 100 1200

100200300400500600700800900

IPoIB (FDR) OSU-IB (FDR)

Data Size (GB)

Exec

ution

Tim

e (s

)

80 100 1200

100

200

300

400

500

600IPoIB (FDR) OSU-IB (FDR)

Data Size (GB)

Exec

ution

Tim

e (s

)

Performance Benefits – Sort & TeraSort in TACC-Stampede

Cluster with 32 Nodes with a total of 128 maps and 64 reduces

• Sort with single HDD per node– 40-52% improvement over IPoIB

for 80-120 GB data

• TeraSort with single HDD per node– 42-44% improvement over IPoIB

for 80-120 GB data

Reduced by 52% Reduced by 44%

Cluster with 32 Nodes with a total of 128 maps and 57 reduces

Page 26: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 26Network Based Computing Laboratory

Evaluation of HHH and HHH-L with Applications

HDFS (FDR) HHH (FDR)

60.24 s 48.3 s

CloudBurstMR-MSPolyGraph

4 6 80

100200300400500600700800900

1000HDFS Lustre HHH-L

Concurrent maps per host

Exec

ution

Tim

e (s

)

Reduced by 79%

• MR-MSPolygraph on OSU RI with 1,000 maps– HHH-L reduces the execution time

by 79% over Lustre, 30% over HDFS

• CloudBurst on TACC Stampede– With HHH: 19% improvement over

HDFS

Page 27: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 27Network Based Computing Laboratory

Evaluation with Spark on SDSC Gordon (HHH vs. Tachyon/Alluxio)

• For 200GB TeraGen on 32 nodes

– Spark-TeraGen: HHH has 2.4x improvement over Tachyon; 2.3x over HDFS-IPoIB (QDR)– Spark-TeraSort: HHH has 25.2% improvement over Tachyon; 17% over HDFS-IPoIB (QDR)

8:50 16:100 32:2000

20406080

100120140160180 IPoIB (QDR) Tachyon OSU-IB (QDR)

Cluster Size : Data Size (GB)

Exec

ution

Tim

e (s

)

8:50 16:100 32:2000

100

200

300

400

500

600

700

Cluster Size : Data Size (GB)

Exec

ution

Tim

e (s

)

Reduced by 2.4x

Reduced by 25.2%

TeraGen TeraSort

N. Islam, M. W. Rahman, X. Lu, D. Shankar, and D. K. Panda, Performance Characterization and Acceleration of In-Memory File Systems for Hadoop and Spark Applications on HPC Clusters, IEEE BigData ’15, October 2015

Page 28: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 28Network Based Computing Laboratory

Intermediate Data Directory

Design Overview of Shuffle Strategies for MapReduce over Lustre• Design Features

– Two shuffle approaches• Lustre read based shuffle• RDMA based shuffle

– Hybrid shuffle algorithm to take benefit from both shuffle approaches

– Dynamically adapts to the better shuffle approach for each shuffle request based on profiling values for each Lustre read operation

– In-memory merge and overlapping of different phases are kept similar to RDMA-enhanced MapReduce design

Map 1 Map 2 Map 3

Lustre

Reduce 1 Reduce 2Lustre Read / RDMA

In-memory merge/sort

reduce

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, High Performance Design of YARN MapReduce on Modern HPC Clusters with Lustre and RDMA, IPDPS, May 2015

In-memory merge/sort

reduce

Page 29: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 29Network Based Computing Laboratory

• For 500GB Sort in 64 nodes– 44% improvement over IPoIB (FDR)

Performance Improvement of MapReduce over Lustre on TACC-Stampede

• For 640GB Sort in 128 nodes– 48% improvement over IPoIB (FDR)

300 400 5000

200

400

600

800

1000

1200

IPoIB (FDR)OSU-IB (FDR)

Data Size (GB)

Job

Exec

ution

Tim

e (s

ec)

20 GB 40 GB 80 GB 160 GB 320 GB 640 GBCluster: 4 Cluster: 8 Cluster: 16 Cluster: 32 Cluster: 64 Cluster: 128

0

50

100

150

200

250

300

350

400

450

500IPoIB (FDR) OSU-IB (FDR)

Job

Exec

ution

Tim

e (s

ec)

M. W. Rahman, X. Lu, N. S. Islam, R. Rajachandrasekar, and D. K. Panda, MapReduce over Lustre: Can RDMA-based Approach Benefit?, Euro-Par, August 2014.

• Local disk is used as the intermediate data directoryReduced by 48%Reduced by 44%

Page 30: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 30Network Based Computing Laboratory

• For 80GB Sort in 8 nodes– 34% improvement over IPoIB (QDR)

Case Study - Performance Improvement of MapReduce over Lustre on SDSC-Gordon

• For 120GB TeraSort in 16 nodes– 25% improvement over IPoIB (QDR)

• Lustre is used as the intermediate data directory

40 60 800

100

200

300

400

500

600

700

800

900IPoIB (QDR)OSU-Lustre-Read (QDR)OSU-RDMA-IB (QDR)OSU-Hybrid-IB (QDR)

Data Size (GB)

Job

Exec

ution

Tim

e (s

ec)

40 80 1200

100

200

300

400

500

600

700

800

900

Data Size (GB)

Job

Exec

ution

Tim

e (s

ec)

Reduced by 25%Reduced by 34%

Page 31: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 31Network Based Computing Laboratory

• RDMA-based Designs and Performance Evaluation– HDFS– MapReduce– RPC– HBase– Spark– OSU HiBD Benchmarks (OHB)

Acceleration Case Studies and Performance Evaluation

Page 32: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 32Network Based Computing Laboratory

HBase-RDMA Design Overview

• JNI Layer bridges Java based HBase with communication library written in native code• Enables high performance RDMA communication, while supporting traditional socket interface

HBase

IB Verbs

RDMA Capable Networks(IB, iWARP, RoCE ..)

OSU-IB Design

Applications

1/10/40/100 GigE, IPoIB Networks

Java Socket Interface Java Native Interface (JNI)

Page 33: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 33Network Based Computing Laboratory

HBase – YCSB Read-Write Workload

• HBase Get latency (QDR, 10GigE)– 64 clients: 2.0 ms; 128 Clients: 3.5 ms– 42% improvement over IPoIB for 128 clients

• HBase Put latency (QDR, 10GigE)– 64 clients: 1.9 ms; 128 Clients: 3.5 ms– 40% improvement over IPoIB for 128 clients

8 16 32 64 96 1280

1000

2000

3000

4000

5000

6000

7000

No. of Clients

Tim

e (u

s)

8 16 32 64 96 1280

100020003000400050006000700080009000

10000

IPoIB OSU-IB10GigE

No. of Clients

Tim

e (u

s)

Read Latency Write Latency

OSU-IB (QDR)IPoIB (QDR)

J. Huang, X. Ouyang, J. Jose, M. W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy, and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12

Page 34: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 34Network Based Computing Laboratory

HBase – YCSB Get Latency and Throughput on SDSC-Comet

• HBase Get average latency (FDR)– 4 client threads: 38 us – 59% improvement over IPoIB for 4 client threads

• HBase Get total throughput – 4 client threads: 102 Kops/sec– 2.4x improvement over IPoIB for 4 client threads

Get Latency Get Throughput

1 2 3 40

0.02

0.04

0.06

0.08

0.1

0.12IPoIB (FDR) OSU-IB (FDR)

Number of Client Threads

Aver

age

Late

ncy

(ms)

1 2 3 40

20

40

60

80

100

120IPoIB (FDR) OSU-IB (FDR)

Number of Client Threads

Tota

l Thr

ough

put

(Kop

s/se

c)

59% 2.4x

Page 35: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 35Network Based Computing Laboratory

• RDMA-based Designs and Performance Evaluation– HDFS– MapReduce– RPC– HBase– Spark– OSU HiBD Benchmarks (OHB)

Acceleration Case Studies and Performance Evaluation

Page 36: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 36Network Based Computing Laboratory

• Design Features– RDMA based shuffle– SEDA-based plugins– Dynamic connection

management and sharing– Non-blocking data transfer– Off-JVM-heap buffer

management– InfiniBand/RoCE support

Design Overview of Spark with RDMA

• Enables high performance RDMA communication, while supporting traditional socket interface• JNI Layer bridges Scala based Spark with communication library written in native code

X. Lu, M. W. Rahman, N. Islam, D. Shankar, and D. K. Panda, Accelerating Spark with RDMA for Big Data Processing: Early Experiences, Int'l Symposium on High Performance Interconnects (HotI'14), August 2014

Page 37: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 37Network Based Computing Laboratory

• InfiniBand FDR, SSD, 64 Worker Nodes, 1536 Cores, (1536M 1536R)• RDMA-based design for Spark 1.5.1 • RDMA vs. IPoIB with 1536 concurrent tasks, single SSD per node.

– SortBy: Total time reduced by up to 80% over IPoIB (56Gbps) – GroupBy: Total time reduced by up to 57% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – SortBy/GroupBy

64 Worker Nodes, 1536 cores, SortByTest Total Time 64 Worker Nodes, 1536 cores, GroupByTest Total Time

64 128 2560

50

100

150

200

250

300IPoIBRDMA

Data Size (GB)

Tim

e (s

ec)

64 128 2560

50

100

150

200

250IPoIBRDMA

Data Size (GB)

Tim

e (s

ec)

57%80%

Page 38: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 38Network Based Computing Laboratory

• InfiniBand FDR, SSD, 32/64 Worker Nodes, 768/1536 Cores, (768/1536M 768/1536R)• RDMA-based design for Spark 1.5.1 • RDMA vs. IPoIB with 768/1536 concurrent tasks, single SSD per node.

– 32 nodes/768 cores: Total time reduced by 37% over IPoIB (56Gbps) – 64 nodes/1536 cores: Total time reduced by 43% over IPoIB (56Gbps)

Performance Evaluation on SDSC Comet – HiBench PageRank

32 Worker Nodes, 768 cores, PageRank Total Time 64 Worker Nodes, 1536 cores, PageRank Total Time

Huge BigData Gigantic0

50100150200250300350400450

IPoIBRDMA

Data Size (GB)

Tim

e (s

ec)

Huge BigData Gigantic0

100

200

300

400

500

600

700

800IPoIBRDMA

Data Size (GB)

Tim

e (s

ec)

43%37%

Page 39: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 39Network Based Computing Laboratory

• RDMA-based Designs and Performance Evaluation– HDFS– MapReduce– RPC– HBase– Spark– OSU HiBD Benchmarks (OHB)

Acceleration Case Studies and Performance Evaluation

Page 40: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 40Network Based Computing Laboratory

• The current benchmarks provide some performance behavior• However, do not provide any information to the designer/developer on:

– What is happening at the lower-layer?– Where the benefits are coming from?– Which design is leading to benefits or bottlenecks?– Which component in the design needs to be changed and what will be its impact?– Can performance gain/loss at the lower-layer be correlated to the performance

gain/loss observed at the upper layer?

Are the Current Benchmarks Sufficient for Big Data?

Page 41: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 41Network Based Computing Laboratory

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies(InfiniBand, 1/10/40/100 GigE

and Intelligent NICs)

Storage Technologies(HDD, SSD, and NVMe-SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-PointCommunication

QoS

Threaded Modelsand Synchronization

Fault-ToleranceI/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Challenges in Benchmarking of RDMA-based Designs

Current Benchmarks

No Benchmarks

Correlation?

Page 42: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 42Network Based Computing Laboratory

Big Data Middleware(HDFS, MapReduce, HBase, Spark and Memcached)

Networking Technologies(InfiniBand, 1/10/40/100 GigE

and Intelligent NICs)

Storage Technologies(HDD, SSD, and NVMe-SSD)

Programming Models(Sockets)

Applications

Commodity Computing System Architectures

(Multi- and Many-core architectures and accelerators)

Other Protocols?

Communication and I/O Library

Point-to-PointCommunication

QoS

Threaded Modelsand Synchronization

Fault-ToleranceI/O and File Systems

Virtualization

Benchmarks

RDMA Protocols

Iterative Process – Requires Deeper Investigation and Design for Benchmarking Next Generation Big Data Systems and Applications

Applications-Level Benchmarks

Micro-Benchmarks

Page 43: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 43Network Based Computing Laboratory

• HDFS Benchmarks– Sequential Write Latency (SWL) Benchmark– Sequential Read Latency (SRL) Benchmark– Random Read Latency (RRL) Benchmark– Sequential Write Throughput (SWT) Benchmark

– Sequential Read Throughput (SRT) Benchmark

• Memcached Benchmarks– Get Benchmark– Set Benchmark– Mixed Get/Set Benchmark

• Available as a part of OHB 0.8

OSU HiBD Benchmarks (OHB)N. S. Islam, X. Lu, M. W. Rahman, J. Jose, and D. K. Panda, A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters, Int'l Workshop on Big Data Benchmarking (WBDB '12), December 2012

D. Shankar, X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop MapReduce on High-Performance Networks, BPOE-5 (2014)

X. Lu, M. W. Rahman, N. Islam, and D. K. Panda, A Micro-Benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks, Int'l Workshop on Big Data Benchmarking (WBDB '13), July 2013

To be Released

Page 44: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 44Network Based Computing Laboratory

• Upcoming Releases of RDMA-enhanced Packages will support– HBase– Impala

• Upcoming Releases of OSU HiBD Micro-Benchmarks (OHB) will support

– MapReduce

– RPC

• Advanced designs with upper-level changes and optimizations– Memcached with Non-blocking API

– HDFS + Memcached-based Burst Buffer

On-going and Future Plans of OSU High Performance Big Data (HiBD) Project

Page 45: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 45Network Based Computing Laboratory

• Discussed challenges in accelerating Hadoop and Spark with HPC technologies

• Presented initial designs to take advantage of InfiniBand/RDMA for HDFS, MapReduce, RPC, and Spark

• Results are promising

• Many other open issues need to be solved

• Will enable Big Data community to take advantage of modern HPC technologies to carry out their analytics in a fast and scalable manner

• Looking forward to collaboration with the community

Concluding Remarks

Page 46: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 46Network Based Computing Laboratory

Funding AcknowledgmentsFunding Support by

Equipment Support by

Page 47: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 47Network Based Computing Laboratory

Personnel AcknowledgmentsCurrent Students

– A. Augustine (M.S.)– A. Awan (Ph.D.)– S. Chakraborthy (Ph.D.)– C.-H. Chu (Ph.D.)– N. Islam (Ph.D.)– M. Li (Ph.D.)

Past Students – P. Balaji (Ph.D.)– S. Bhagvat (M.S.)– A. Bhat (M.S.) – D. Buntinas (Ph.D.)– L. Chai (Ph.D.)– B. Chandrasekharan (M.S.)– N. Dandapanthula (M.S.)– V. Dhanraj (M.S.)– T. Gangadharappa (M.S.)– K. Gopalakrishnan (M.S.)

– G. Santhanaraman (Ph.D.)– A. Singh (Ph.D.)– J. Sridhar (M.S.)– S. Sur (Ph.D.)– H. Subramoni (Ph.D.)– K. Vaidyanathan (Ph.D.)– A. Vishnu (Ph.D.)– J. Wu (Ph.D.)– W. Yu (Ph.D.)

Past Research Scientist– S. Sur

Current Post-Doc– J. Lin– D. Banerjee

Current Programmer– J. Perkins

Past Post-Docs– H. Wang– X. Besseron– H.-W. Jin– M. Luo

– W. Huang (Ph.D.)– W. Jiang (M.S.)– J. Jose (Ph.D.)– S. Kini (M.S.)– M. Koop (Ph.D.)– R. Kumar (M.S.)– S. Krishnamoorthy (M.S.)– K. Kandalla (Ph.D.)– P. Lai (M.S.)– J. Liu (Ph.D.)

– M. Luo (Ph.D.)– A. Mamidala (Ph.D.)– G. Marsh (M.S.)– V. Meshram (M.S.)– A. Moody (M.S.)– S. Naravula (Ph.D.)– R. Noronha (Ph.D.)– X. Ouyang (Ph.D.)– S. Pai (M.S.)– S. Potluri (Ph.D.) – R. Rajachandrasekar (Ph.D.)

– K. Kulkarni (M.S.)– M. Rahman (Ph.D.)– D. Shankar (Ph.D.)– A. Venkatesh (Ph.D.)– J. Zhang (Ph.D.)

– E. Mancini– S. Marcarelli– J. Vienne

Current Research Scientists Current Senior Research Associate– H. Subramoni– X. Lu

Past Programmers– D. Bureddy

- K. Hamidouche

Current Research Specialist– M. Arnold

Page 48: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 48Network Based Computing Laboratory

Second International Workshop on High-Performance Big Data Computing (HPBDC)

HPBDC 2016 will be held with IEEE International Parallel and Distributed Processing Symposium (IPDPS 2016), Chicago, Illinois USA, May 27th, 2016

Keynote Talk: Dr. Chaitanya Baru, Senior Advisor for Data Science, National Science Foundation (NSF);

Distinguished Scientist, San Diego Supercomputer Center (SDSC)

Panel Moderator: Jianfeng Zhan (ICT/CAS)Panel Topic: Merge or Split: Mutual Influence between Big Data and HPC Techniques

Six Regular Research Papers and Two Short Research Papers

http://web.cse.ohio-state.edu/~luxi/hpbdc2016

HPBDC 2015 was held in conjunction with ICDCS’15

http://web.cse.ohio-state.edu/~luxi/hpbdc2015

Page 49: Accelerating Apache Hadoop through High-Performance Networking and I/O Technologies

Hadoop Summit@Dublin (April ‘16) 49Network Based Computing Laboratory

{panda, luxi}@cse.ohio-state.edu

Thank You!

The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/

Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/