Spark Overview and Performance Issues

Spark Understanding & Performance Issues

Key-points of Spark

• A better implementation of MapReduce paradigm

• Handles batch, iterative and real-time applications

within a single framework.

• Most computations maps into many maps and

reduces with dependences among them

• Spark’s RDD programming model models these

dependences as a DAG.

Spark Goals

• Generality: diverse workloads, operators, job sizes

• Low latency: sub second

• Fault Tolerance: faults shouldn’t be a special case

• Simplicity: offer a High level API without boilerplate

code

Programming Point of

View

• High Level API (accessible to data scientists)

• Native integration with Java, Python and Scala

• Due to flexible programming model, applications

can rewrite the way shuffle or aggregation is done.

• Optionally applications can choose to put datasets

in memory

Engineering Point of

View• Uses RPCs for task dispatching and scheduling

• Uses Thread-pool for execution of tasks (rather than a pool of JVM processes that Hadoop does)

• The above two enables spark to schedules tasks in milliseconds whereas MR scheduling takes

seconds or minutes in busy clusters.

• Supports checkpoint-based recovery (like Hadoop) + lineage-based recovery (much faster)

• Spark caches the data to be processed

• Each application gets its own executor processes, which stay up for the duration of the whole

application and run tasks in multiple threads.

• Benefit: Isolating applications from each other, on both the scheduling side (each driver

schedules its own tasks) and executor side (tasks from different applications run in different

JVMs).

• Disadvantage: it also means that data cannot be shared across different Spark applications

(instances of SparkContext) without writing it to an external storage system.

Spark Jargons (1/2)

• Driver: The program/process responsible for

running the Job over Spark Engine

• Executor: The process responsible for executing a

task

• Master: The machine where the Driver is

• Slave/Worker: The machine where the Executor

program runs.

Spark’s Master/Slave

Architecture

Spark Jargons (2/2)• Job: A parallel computation consisting of multiple tasks that

gets spawned in response to a Spark action (e.g. save,

collect)

• Stages: Each job gets divided into smaller sets of tasks

called stages that depend on each other (similar to the map

and reduce stages in MapReduce)

• Tasks: Each stage has some tasks, one per partition. One

task is executed on one partition of data on one executor.

• Dag: stands for Directed Acyclic Graph, in the present

content is a DAG of operators

RDDs• Resilient Distributed Datasets are the primary abstraction in Spark, a fault tolerant collection of

elements that can be operated in parallel.

• They are currently two types:

1. Parallelized collections: take an existing scala collection and run functions on it in parallel.

2. Hadoop Datasets: run functions on each record of a file in HDFS or any other storage

supported by Hadoop.

• Support two types of operations Transformations and Actions

1. Transformations: are lazy operations on a RDD that create one or many new RDDs, e.g.

map, filter, reduceByKey, join, randomSplit.

2. Actions: are computed immediately.They consist of running all the previous transformations

in order to get back an actual result. In other words, a RDD operation that returns a value of

any type but RDD[T] is an action. (actions are synchronous)

• An RDD can be persisted into storage disk or cached in memory.

Transformations

• There are two kinds of transformations:

1. Narrow transformations: are the result of the data from a single partition only,

i.e. map, filter..

• Spark groups narrow transformations in one stage which is called pipelining.

2. Wide/Shuffle transformations: are the result of groupByKey and reduceByKey.

The data required to compute the records in a single partition may exist in many

partitions of the parent RDD.

• All of the tuples with the same key must end up in the same partition,

processed by the same task.

• To satisfy these operations, Spark must execute RDD shuffle, which transfers

data across cluster and results in a new stage with a new set of partitions.

Transformations |

Actions• map( function )

• filter( function )

• flatmap( function )

• sample( function )

• union( otherDataSet )

• dinstict( [numTasks] )

• groupByKey( [numTasks] )

• reduceByKey( function, [numTasks])

• sortByKey( [ascending], [numTasks])

• join( otherDataSets, [numTasks] ) etc…

• reduce(function)

• collect()

• count()

• first()

• take(n)

• takeSample(..)

• saveAsTextFile(path)

• saveAsSequentialFile(path)

• countByKey()

• foreach( function ) etc…

RDD Shuffle

• Shuffling is a process of redistributing data across

partitions (aka repartitioning) that may or may not

cause moving data across JVM processes or even

over the wire (between executors on separate

machines).

• “This typically involves copying data across

executors and machines, making the shuffle a

complex and costly operation.” - wrote in Spark’s

website

Spark’s System

Layers

Different Deployment Modes

• Spark Stand-alone

• Spark on Yarn

• Spark on Mesos

Common Performance

Issues• Adequate Parallelism / partitioning: smaller/more numerous partitions allow work to be distributed

among more workers, but larger/fewer partitions allow work to be done in larger chunks, which may

result in the work getting done more quickly as long as all workers are kept busy, due to reduced

overhead.

• Re-partitioning on the Fly: each execution stage may have a different optimal degree of

parallelism, and the data shuffling between stages become opportunities to adjust the partitioning

accordingly.

• Wrong ordering of the transformations: shuffling more unnecessary data

• Data Layout: OO Languages add a layer of abstraction but this increase overhead in memory

usage.Furthermore those frameworks run on top of JVM and it’s garbage collector is known to be

sensitive to memory layout and access patterns.

• Task Placement: The co-allocation of heterogeneous tasks has the potential for creating unexpected

performance issues.

• Load Balancing: assuming applications execute stages sequentially, every imbalance

in a stage’s tasks lead to resource idleness.

Other issues (1/2): Too

many Shuffle files • Has been observed that the bottleneck that Spark

currently faces is a problem specific to the existing

implementation of how shuffle files are defined.

• Each Map creates one shuffle file for each Reducer so in case that we

have 5000 Maps and 1024 Reduces we end up with over 5 million

shuffle files in total.

• This can lead to:

1. Poor Performance due to communication via Socket

2. Suffer from Random I/Os

Solutions • Unsuccessful:

• Extra Processing Stage

• TritonSort (Try to bottleneck every source at the same time)

• Optimizations from static point of view or when the structure of the data is

known

• Successful:

1. Shuffle File Consolidation -proposed by A. Davidson at al. “Optimizing Shuffle Performance in Spark”

2. RDMA in Spark -proposed by W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing:

Early Experiences”

Other issues (2/2): Data

Shuffle blocks • It is not feasible to gather all the shuffle data before they are

consumed, because:

1. The data transfers would take a long time to complete

2. A large amount of memory and local storage would be needed to

cache it.

• So Producer Consumer model of Shuffling-Reducing is adopted.

1. This create a complex all-to-all communication pattern that puts a

significant burden on the networking infrastructure.

2. CPU Blocked due to missing Shuffle Block vs Explosion of

Memory utilization due to accumulation of many Shuffle blocks.

Evaluation tools

• Spark monitoring web UI (Offers precise event timeline, DAG visualisation and other

monitoring tools)

• Sar to report iops for i/o usage. (provided as a part of the

sysstat package)

• iostat

• htop

• free

Evaluation

applications/Benchmarks• SparkBench ( Benchmark suite ) - by IBM “SPARKBENCH: A Comprehensive

Benchmarking Suite For In Memory Data Analytic Platform Spark”

• GroupBy Test ( commonly used Spark Benchmark ) - used by “Accelerating Spark

with RDMA for Big Data Processing: Early Experiences"

• Twidd (application) - used by “Diagnosing Performance Bottlenecks in Massive Data

Parallel Programs.”

• Elcat (application) - used by “Diagnosing Performance Bottlenecks in Massive Data Parallel

Programs.”

• PageRank (application) - used by “Diagnosing Performance Bottlenecks in Massive Data

Parallel Programs.”

• BDBench (benchmark)

• TPC-DS (benchmark)

Blocked Time Analysis

• Issues:

1. Per-Task Utilization can not be measured in

Spark because all tasks run in a single process.

2. Instrumentation should be Light in terms of

memory

3.Instrumentation shouldn’t add to job time.

4.Needed to add logging in HDFS.

Usage of Memory• Execution: Memory used for shuffles, sorts and aggregation

• Storage: Memory used to cache data that will be reused

later

• 1st approach: static assignment

• 2nd approach: Unified memory (always storage spills to

disk)

• 3rd approach: Dynamic Assignment into different cores

(Each task is now assigned 1/N of the memory) -> helps

with stranglers.

References

• https://jaceklaskowski.gitbooks.io/mastering-apache-spark/

• SPARKBENCH: A Comprehensive Benchmarking Suite For In Memory

Data Analytic Platform Spark

• Accelerating Spark with RDMA for Big Data Processing: Early

Experiences

• Diagnosing Performance Bottlenecks in Massive Data Parallel Programs.

• A. Davidson at al. “Optimizing Shuffle Performance in Spark”

• W.Rahman at al “Accelerating Spark with RDMA for Big Data Processing:

Early Experiences”

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/

Software

Spark Overview and Performance Issues