How to Increase Performance of Your Hadoop Cluster

NoSQL Now!

Aug 21, 2013

Ben Wen, Joyent

Renat Khasanshyn, Altoros

About Joyent

The high-performance public cloud

infrastructure provider

Cloud IaaS Virtual Machines:

Linux, Windows, BSD, SmartOS

(fka Solaris) with Zones

Core founding sponsors of Node.js

Four global datacenters

Key markets:

Big data, mobile, e-commerce,

finsvc, SaaS

Open Source contributions:

Node.js, KVM, DTrace, ZFS,

SmartOS

4

Running bare-metal only practical for some organizations

Performance varies significantly across various job types

In fact, for many jobs, less = more

Utilization of most clusters in production is low

Optimizing Hadoop/MapReduce performance is hard

5

Get upset when truth comes out!

Biased (to the shiny side of the coin)

Often add controversy and confusion

6

- For Hadoop, what is the impact of Container-based virtualization vs Hardware

emulation (KVM)*

- What are the Hadoop optimization strategies? Is there a “rule of thumb” when it

comes to determining the optimization approach?

- What are the optimal Hadoop cluster settings for 1TB TeraSort benchmark on

100 and 400 node clusters running Linux and SmartOS on the Joyent Public

Cloud?

7

Physical (disks, cpu, network)

OS/Hypervisor (especially for virtualized environments)

Hadoop/MapReduce (tons of settings)

Algorithmic (data structures, join strategies, big-O…)

Implementation (code efficiency, architecture decisions that fit all other factors)

8

Open source Unix operating system based on the active fork of Open Solaris technology (illumos) for the cloud. Uses containerized OS virtualization, called Zones (think a mature LXC with secure RBAC and auditing)

operating system based on the Debian

Linux distribution and distributed as free

and open source software.

Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. Derived from Google's MapReduce and Google File System (GFS) papers, Hadoop enables applications to work with thousands of computation-independent computers and petabytes of data.

9

Written by Opscode and released as open source under the Apache License 2.0., Chef is a DevOps tool used for configuring cloud services or to streamline the task of configuring a company's internal servers. Chef automatically sets up and tweaks the operating systems and programs that run in massive data centers.

Developed by creators of the Starfish project from Duke University, Unravel brings run-time profiling of Hadoop jobs followed by a cost-based database query optimization. Unravel connects to streams of Hadoop and system instrumentation data, and applies statistical machine learning to optimize cost of Hadoop jobs and increase cluster utilization.

1

0

Comparing I/O Path on

Bare Metal Unix Vs Zones Vs KVM

• Code path is essentially the same as bare metal

• Zones partition at the OS level

• Performance is higher

• KVM is encapsulated by hypervisor

• Code path is much more circuitous in a KVM process.

• Performance is impacted

Bare-metal OS Virtualization Kernel Virtualization

1

1

No over

head for

Zones:

Stack traces

show how a

network

packet is

transmitted

from:

Bare Metal

vs

Joyent Zone

vs

Fedora VM

on KVM

Bare Metal Joyent Zone (aka SmartMachine) Fedora VM on KVM VM

Start Start Start

1 kernel`start_xmit

2 kernel`dtrace_int3_handler+0xd2

3 kernel`kmem_cache_free+0x2f

4 kernel`dtrace_int3+0x3a

5 kernelèth_header

6 kernel`__kfree_skb+0x47

7 kernel`start_xmit+0x1

8 kernel`dev_hard_start_xmit+0x322

9 kernel`sch_direct_xmit+0xef

10 kernel`dev_queue_xmit+0x184

11 kernelèth_header+0x3a

12 kernel`neigh_resolve_output+0x11e

13 kernel`nf_hook_slow+0x75

14 kernelìp_finish_output

15 kernelìp_finish_output+0x17e

16 kernelìp_output+0x98

17 kernel`__ip_local_out+0xa4

18 kernelìp_local_out+0x29

19 kernelìp_queue_xmit+0x14f

20 kernel`tcp_transmit_skb+0x3e4

21 kernel`__kmalloc_node_track_caller+0x185

22 kernel`sk_stream_alloc_skb+0x41

23 kernel`tcp_write_xmit+0xf7

24 kernel`__alloc_skb+0x8c

25 kernel`__tcp_push_pending_frames+0x26

26 kernel`tcp_sendmsg+0x895

27 kernelìnet_sendmsg+0x64

28 kernel`sock_aio_write+0x13a

29 kernel`do_sync_write+0xd2

30 kernel`security_file_permission+0x2c

31 kernel`rw_verify_area+0x61

32 kernel`vfs_write+0x16d

33 kernel`sys_write+0x4a

34 kernel`sys_rt_sigprocmask+0x84

35 kernel`system_call_fastpath+0x16

36 igbìgb_tx_ring_send+0x33

37 mac`mac_hwring_tx+0x1d

38 mac`mac_tx_send+0x5dc

39 mac`mac_tx_single_ring_mode+0x6e

mac`mac_tx+0xda mac`mac_tx+0xda mac`mac_tx+0xda

dld`str_mdata_fastpath_put+0x53 dld`str_mdata_fastpath_put+0x53 dld`str_mdata_fastpath_put+0x53

ipìp_xmit+0x82d ipìp_xmit+0x82d ipìp_xmit+0x82d

ipìre_send_wire_v4+0x3e9 ipìre_send_wire_v4+0x3e9 ipìre_send_wire_v4+0x3e9

ip`conn_ip_output+0x190 ip`conn_ip_output+0x190 ip`conn_ip_output+0x190

ip`tcp_send_data+0x59 ip`tcp_send_data+0x59 ip`tcp_send_data+0x59

ip`tcp_output+0x58c ip`tcp_output+0x58c ip`tcp_output+0x58c

ip`squeue_enter+0x426 ip`squeue_enter+0x426 ip`squeue_enter+0x426

ip`tcp_sendmsg+0x14f ip`tcp_sendmsg+0x14f ip`tcp_sendmsg+0x14f

sockfs`so_sendmsg+0x26b sockfs`so_sendmsg+0x26b sockfs`so_sendmsg+0x26b

sockfs`socket_sendmsg+0x48 sockfs`socket_sendmsg+0x48 sockfs`socket_sendmsg+0x48

sockfs`socket_vop_write+0x6c sockfs`socket_vop_write+0x6c sockfs`socket_vop_write+0x6c

genunix`fop_write+0x8b genunix`fop_write+0x8b genunix`fop_write+0x8b

genunix`write+0x250 genunix`write+0x250 genunix`write+0x250

genunix`write32+0x1e genunix`write32+0x1e genunix`write32+0x1e

unix`_sys_sysenter_post_swapgs+0x14 unix`_sys_sysenter_post_swapgs+0x14 unix`_sys_sysenter_post_swapgs+0x149

Skips steppingthrough39 functionsrequiredwhen Fedorais running onKVM/qemu

Note thata Joyent Zoneis exactly thesame as “BareMetal”

Three identical Apache Hadoop 1.0.4 clusters were provisioned on Joyent

infrastructure using Joyent REST API and Opscode Chef

Each cluster was tweaked for optimal performance following best practices for

TeraSort benchmark.

13

A custom script launches virtual machines using Joyent API and stores information

about them in a json file.

14

Each machine in cluster is being configured according to its role in cluster using

Chef cookbooks.

15

As part of TeraSort benchmark a dataset is generated using TeraGen utility

included in Apache Hadoop.

16

On one of the nodes a Hadoop TeraSort job using previously generated dataset is

submitted.

17

See: Hadoop job_201210261134_0010 on hadoop-smartos-r-1.html

The key difference between the two clusters was unveiled when monitoring I/O and

CPU utilization. Ubuntu cluster was spending too much time in OS kernel while

performing I/O operations as demonstrated on Figure 1.

SmartOS cluster was using CPU much more efficiently and was able to utilize larger

number of Hadoop mappers and reducers, key configuration parameters for Hadoop:

20

21

22

The key difference between the clusters was unveiled when monitoring I/O and CPU utilization. Ubuntu cluster was spending too much time in OS kernel while performing I/O (for copies of configfiles and job reports –email [email protected])

mailto:[email protected]

24

1) Basic cluster configuration is key (one time effort for typical workloads)

DATA DISK SCALING

COMPRESSION

JVM REUSE POLICY

HDFS BLOCK SIZE

MAP-SIDE SPILLS

COPY/SHUFFLE PHASE TUNING

REDUCE-SIDE SPILLS

2) Tune the number of map and reduce tasks appropriately

3) Consider GPU for some workloads

25

• Forthcoming in October

• Includes cloud performance

• Co-author DTrace book

• More here on his techniques:

• http://dtrace.org/blogs/brendan/

26

Thank you!

Ben Wen: [email protected]

Renat Khasanshyn: [email protected]

@renatco (650) 395-7002



Technology

How to Increase Performance of Your Hadoop Cluster