Upload
madeleine-williams
View
218
Download
0
Tags:
Embed Size (px)
Citation preview
Big Data Engineering: Recent Performance Enhancements in JVM-based FrameworksMayuresh Kunjir
Modern Data Analytics Frameworks
Process external dataGreat for unstructured data
Easy to scale outSuited for Cloud infrastructure
Fast recoverySuited for commodity nodes
JVM basedBetter adaptability
Hadoop
Hbase
Spark
Flink
How do they stack up?
Modern Data Analytics Framework(Hadoop, Spark, Flink)
Relational DB, MPP DB(Netezza, Pivotal, Vertica)
Extensibility Fault Tolerance
ScalabilityPerformance
Series1
Expressiveness
SQL-on-Hadoop solutions bring power of SQL to
modern data analytics
Java vs. C ?
Why can’t performance match?
1. Java objects have storage overheads
2. Garbage collection : “Stop the world” pauses
JVM
JVM Memory Management
Young generationNew objects are created here.
Parallel GC is run frequently
GC “tenures” objects that are alive for a long time
Old generationGC is run less frequently and takes longer
Uses Concurrent-Mark-Sweep (CMS) or G1 GC
Capable of compacting memory to avoid memory fragmentation
In-memory processing in conflict with GC
Spark caches RDDs in memory between iterations
Implies less memory for user program’s custom objects
Implies more strain on Garbage collection
Performance Guideline: Keep as less data in JVM managed heap as possible
Serialization schemes
Performance Guideline: Rather than generic serialization schemes, build semantic and schema-specific schemes
Enhancement #1: Custom Serialization
Store the exact schema only once per dataset.
Store byte streams per tuple, with offsets to instance attributes.
Project Tungsten [4]
Serialization in Flink [3]
Custom class ‘TypeInformation’ represents any data type.
Each implementation of TypeInformation provides a custom serializer
e.g. To serialize Tuple3<Integer, Double, Person> where Person is a POJO
Enhancement #2: Custom Memory Management
HBase is used in latency-sensitive applications, GC delays are hazardous.
HBase MemStore allocates allocates memory in chunks of 2MB which makes GC sweeps more efficient. [2]
MemStore-Local Allocation Buffers help avoid memory fragmentation. [2]
Flink Managed Memory
A pool of 32KB buffers managed by MemoryManager and never released to GC. [3]
No major GC ever takes place
Utilizing off-heap memory
java.nio and sun.misc.unsafe packages allow C-style memory management in Java.
Project Tachyon supports in-memory storage for Spark RDDs in off-heap space. [1]
Project Tungsten supports storing shuffle objects in off-heap space. [4]
Enhancement #3: Cache-Sensitive Data Structures
Sort buffer in Flink [3]
Case study: Sort in Spark
Spark won Daytona GraySort contest in 2014.
• Compared to Yahoo’s earlier record for 100TB, Spark sorted the same data 3X faster using 10X fewer machines.
• Spark managed to also sort 1PB on 190 machines in under 4 hours. Earlier record: 3800 machines, 16 hours.
Sort in Spark: Optimizations
Sort-based shuffleLower memory overhead compared to hash-based shuffle
New network moduleUses JNI and bypasses JVM’s memory allocator
External shuffle serviceShuffle continues even during GC pauses
TimSortPerforms better than Quicksort for most real-world datasets
Cache locality
Are we there yet?
Alexey Grishchenko[6] compared performance of Spark with all sun.misc.unsafe magic enabled with Pivotal HAWQ (written in C) and here are the runtimes: Spark Pivotal HAWQ
First run 7.33sec 0.25sec
Second run 2.12sec
Third run 2.04sec
* Tested using a query “select a, avg(b) from test group by a order by a;”
Huge gap still!
What are we doing?
Memory management in SparkUnderstanding impact of various memory tuning parameters on GC
References1. Big Data Performance Engineering : Examples from
Hadoop, Pig, HBase, Flink and Spark
2. Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Three part series
3. Juggling with Bits and Bytes: Flink memory management
4. Project Tungsten: Bringing Spark Closer to Bare Metal
5. Spark the fastest open source engine for sorting a petabyte
6. Spark DataFrames are faster, aren’t they?
7. Byte Buffers and Non-Heap Memory
8. JavaOne 2013: Memory Efficient Java
9. Java Garbage Collection Basics