Making sense of performance and identifying stragglers in Data Analytics Framework

Preview:

Citation preview

Making sense of performance and identifying stragglers inData Analytics Framework

CSCI 8780 Advanced Distributed Systems

Manish Ranjan and Narita Pandhe

Introduction

- Large-scale data analytics has become widespread

- Research devoted to improving the performance of data analytics frameworks

- BUT comparatively little effort : spent in identifying the performance bottlenecks!!

2

More resource efficient

Faster

3

4

5

6

7

8

9

Experiments

10

What Cluster Configuration did we use?

- #1 Master, #6 Slaves

- Master Config- 64 - Bit,

- 8GB RAM,

- 2 Cores,

- 50GB SSD

- Slaves Config(each):- 64 - Bit

- 2GB RAM,

- 1 Core,

- 30GB SSD

Config related modifications: eg. Replication + SSDs

11

First Benchmarking namenode

To first test Namenode hardware and config: NNBench

What it does:

Generates a lot of HDFS related requests

Why it does:

To put a “HIGH” HDFS management stress on the namenode

How it does:

Simulates request for creating, reading, renaming and deleting files on HDFS

12

What Workload did we use?

- TeraSort benchmark suite

- Goal of TeraSort: sort 1TB of data (or any other amount of data you want) as fast as possible.

- Limited by our cluster configuration, we performed several experiments with data of size 1GB, 5GB and 10GB.

- TeraSort benchmark can be utilized to iron out your Hadoop configuration

13

14

Hadoop

i-6c76c1da (M), i-40684ef0

(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)

15

i-6c76c1da (M), i-40684ef0

(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)

Red : s6Dark Green: s4

16

i-6c76c1da (M), i-40684ef0

(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)

Observations for 10GB

Red : s6Dark Green: s4

17

i-6c76c1da (M), i-40684ef0

(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)

Observations for 10GB

Red : s6Dark Green: s4

18

i-6c76c1da (M), i-40684ef0

(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)

Identified Stragglers

19

Spark

i-6c76c1da (M), i-40684ef0

(s1), i-41684ef1 (s2), i-42684ef2 (s3), i-43684ef3 (s4),i-4e684efe (s5), i-4f684eff (s6)

Orange: s2Red: s6

20

Hadoop SparkRed s6Bright Blue :

s5Orange : s2

Conclusions- Straggler task spends an unusually long amount of time in a particular part

of task execution.

- It usually not too hard to found a straggler for a specific execution- what is hard is to get it consistently enough!

- Though we were lucky enough to spot few even in a mediocre strength cluster. Which emphasizes the necessity of understanding the cluster meta info well.

Eg: DFS disk read time, shuffle write time, shuffle read time, and Java’s garbage collection

- Since, Spark:

- often breaks jobs into many more tasks

- has much lower task launch overhead than Hadoop

21

References- Making Sense of Performance in Data Analytics Frameworks,

Kay Ousterhout, Ryan Rasti, Sylvia Ratnasamy, Scott Shenker, Byung-Gon Chun, UC Berkeley, ICSI,

VMware, Seoul National University- No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics

https://www.cs.duke.edu/starfish/files/socc11-cluster-sizing.pdf- http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-ha

doop-cluster-with-terasort-testdfsio-nnbench-mrbench/- https://github.com/ehiggs/spark-terasort- aws.amazon.com

22

23

24

Recommended