Helsinki Spark Meetup Nov 20 2015

Click to edit Master text styles

IBM Spark spark.tc

After Dark 1.5

High Performance, Real-time, Streaming, Machine Learning, Natural Language Processing,

Text Analytics, and Recommendations

Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center

** We’re Hiring -- Only Nice People, Please!! **

November 20, 2015

IBM Spark spark.tc

Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

Streaming Data Engineer Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Founder Advanced Apache Meetup

Author Advanced .

Due 2016

My Ma’s First Time in California

IBM Spark spark.tc

Random Slide: More Ma “First Time” Pics

In California Using Chopsticks Using “New” iPhone

IBM Spark spark.tc

Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th)

Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd)

Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th)

Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th)

San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th)

Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th)

Budapest Spark Meetup (Nov 26th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th)

Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 2016)

IBM Spark spark.tc

Advanced Apache Spark Meetup Meetup Metrics 1600+ Members in just 4 mos! Top 5 Most Active Spark Meetup!! Meetup Goals   Dig deep into codebase of Spark and related projects   Study integrations of Cassandra, ElasticSearch,

Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R   Surface and share patterns and idioms of these

well-designed, distributed, big data components

IBM Spark spark.tc

All Slides and Code Are Available!

advancedspark.com slideshare.net/cfregly

github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor

IBM Spark spark.tc

What is “ After Dark”? Spark-based, Advanced Analytics Reference App End-to-End, Scalable, Real-time Big Data Pipeline Demonstration of Spark & Related Big Data Projects

github.com/fluxcapacitor

IBM Spark spark.tc

Tools of This Talk

  Kafka   Redis   Docker   Ganglia   Cassandra   Parquet, JSON, ORC, Avro   Apache Zeppelin Notebooks   Spark SQL, DataFrames, Hive   ElasticSearch, Logstash, Kibana   Spark ML, GraphX, Stanford CoreNLP

github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor

IBM Spark spark.tc

Themes of this Talk  Filter  Off-Heap  Parallelize  Approximate  Find Similarity  Minimize Seeks  Maximize Scans  Customize for Workload  Tune Performance At Every Layer

  Be Nice, Collaborate! Like a Mom!!

IBM Spark spark.tc

Presentation Outline

 Spark Core: Tuning & Mechanical Sympathy

 Spark SQL: Query Optimizing & Catalyst

 Spark Streaming: Scaling & Approximations

 Spark ML: Featurizing & Recommendations 10

IBM Spark spark.tc

Spark Core: Tuning & Mechanical Sympathy Understand and Acknowledge Mechanical Sympathy

Study AlphaSort and 100Tb GraySort Challenge

Dive Deep into Project Tungsten

IBM Spark spark.tc

Mechanical Sympathy Hardware and software working together in harmony. - Martin Thompson http://mechanical-sympathy.blogspot.com

Whatever your data structure, my array will beat it. - Scott Meyers Every C++ Book, basically

Hair Sympathy

- Bruce Jenner

IBM Spark spark.tc

Spark and Mechanical Sympathy

Project Tungsten (Spark 1.4-1.6+)

GraySort Challenge (Spark 1.1-1.2)

Minimize Memory and GC Maximize CPU Cache Locality

Saturate Network I/O Saturate Disk I/O

IBM Spark spark.tc

AlphaSort Technique: Sort 100 Bytes Recs

Ptr Key Dereference Not Required! AlphaSort

List [(Key, Pointer)] Key is directly available for comparison

Naïve List [Pointer] Must dereference key for comparison

Ptr Dereference for Key Comparison

IBM Spark spark.tc

CPU Cache Line and Memory Sympathy Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs = 14 bytes

Key Ptr

Not CPU Cache-line Friendly!

Ptr Key-Prefix

2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes

Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes) = 16 bytes Key Ptr

/Pad CPU Cache-line Friendly!

IBM Spark spark.tc

Performance Comparison

IBM Spark spark.tc

Similar Trick: Direct Cache Access (DCA) Pull out packet header along side pointer to payload

IBM Spark spark.tc

CPU Cache Line Sizes

MyLaptop

MySoftLayerBareMetal

IBM Spark spark.tc

Cache Hits: Sequential v Random Access

IBM Spark spark.tc

Mechanical Sympathy CPU Cache Lines and Matrix Multiplication

IBM Spark spark.tc

CPU Cache Naïve Matrix Multiplication

// Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

Bad: Row-wise traversal, not using CPU cache line,

ineffective pre-fetching

IBM Spark spark.tc

CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ];

// Modify dot product calculation for B Transpose for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];

Good: Full CPU cache line, effective prefetching

OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];

Reference jbefore k

IBM Spark spark.tc

Instrumenting and Monitoring CPU Use Linux perf command!

http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html

IBM Spark spark.tc

Demo! Compare CPU Naïve & Cache-Friendly Matrix Multiplication

IBM Spark spark.tc

Results of Matrix Multiply Comparison

Naïve Matrix Multiply

Cache-Friendly Matrix Multiply ~27x ~13x ~13x ~2x

perf stat -XX:-Inline –event \ L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, \

LLC-prefetch-misses,cache-misses,stalled-cycles-frontend

~10x 55 hp 550 hp L1-dcache-load-misses 7916379222 214482863052

27.09355591

L1-dcache-prefetch-misses 4248568884 114878389282.70393142

LLC-load-misses 336612743 449261229613.3465306

LLC-prefetch-misses 4300544980 497580590.01157017

5cache-misses 320086472 4447068200 13.8933338

stalled-cycles-frontend 1575227969401 52463114772913.33050934

elapsed@me 1073 22982.14163124

IBM Spark spark.tc

Mechanical Sympathy CPU Cache Lines and Lock-Free Thread Sync

IBM Spark spark.tc

CPU Cache Naïve Tuple Counters object CacheNaiveTupleIncrement { var tuple = (0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = { this.synchronized { tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement) tuple } } }

IBM Spark spark.tc

CPU Cache Naïve Case Class Counters case class MyTuple(left: Int, right: Int) object CacheNaiveCaseClassCounters { var tuple = new MyTuple(0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = { this.synchronized { tuple = new MyTuple(tuple.left + leftIncrement, tuple.right + rightIncrement)

tuple } } }

IBM Spark spark.tc

CPU Cache Friendly Lock-Free Counters object CacheFriendlyLockFreeCounters { // a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each) val tuple = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalLong = 0L var updatedLong = 0L do {

originalLong = tuple.get() val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter val updatedRightInt = originalRightInt + rightIncrement // increment right counter val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter

updatedLong = updatedLeftInt // update the new long with the left counter updatedLong = updatedLong << 32 // shift the new long left updatedLong += updatedRightInt // update the new long with the right counter

} while (tuple.compareAndSet(originalLong, updatedLong) == false) updatedLong } }

Q: Why not @volatile long?

A: Java Memory Model does not guarantee synchronousupdates of 64-bit longs or doubles

IBM Spark spark.tc

Demo! Compare CPU Naïve & Cache-Friendly Tuple Counter Sync

IBM Spark spark.tc

Results of Counters Comparison Naïve Tuple Counters

Naïve Case Class Counters

Cache Friendly Lock-Free Counters

~2x ~1.5x

~3.5x ~2x ~2x

IBM Spark spark.tc

Profiling Visualizations: Flame Graphs

32 Example: Spark Word Count

Java Stack Traces (-XX:+PreserveFramePointer)

Plateausare Bad!!

IBM Spark spark.tc

100TB Daytona GraySort Challenge Focus on Network and Disk I/O Optimizations

Improve Data Structs/Algos for Sort & Shuffle

Saturate Network and Disk Controllers 33

IBM Spark spark.tc

Winning Results

Spark Goals   Saturate Network I/O   Saturate Disk I/O

(2013) (2014)

IBM Spark spark.tc

Winning Hardware Configuration Compute 206 Workers, 1 Master (AWS EC2 i2.8xlarge) 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 3 GBps mixed read/write disk I/O per node

Network AWS Placement Groups, VPC, Enhanced Networking Single Root I/O Virtualization (SR-IOV) 10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)

IBM Spark spark.tc

Winning Software Configuration Spark 1.2, OpenJDK 1.7 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit local reads, 2x replication Empirically chose between 4-6 partitions per cpu 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions (empirical best)

Range partitioning takes advantage of sequential keyspace Required ~10s of sampling 79 keys from in each partition

IBM Spark spark.tc

New Sort Shuffle Manager for Spark 1.2 Original “hash-based” New “sort-based” ①  Use less OS resources (socket buffers, file descriptors) ②  TimSort partitions in-memory ③  MergeSort partitions on-disk into a single master file ④  Serve partitions from master file: seek once, sequential scan

IBM Spark spark.tc

Asynchronous Network Module Switch to asyncronous Netty vs. synchronous java.nio Switch to zero-copy epoll Use only kernel-space between disk and network controllers

Custom memory management spark.shuffle.blockTransferService=netty

Spark-Netty Performance Tuning spark.shuffle.io.preferDirectBuffers=true Reuse off-heap buffers spark.shuffle.io.numConnectionsPerPeer=8 (for example) Increase to saturate hosts with multiple disks (8x800 SSD)

Details in SPARK-2468

IBM Spark spark.tc

Custom Algorithms and Data Structures Optimized for sort & shuffle workloads o.a.s.util.collection.TimSort[K,V] Based on JDK 1.7 TimSort Performs best with partially-sorted runs Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)

o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append

IBM Spark spark.tc

Daytona GraySort Challenge Goal Success

1.1 Gbps/node network I/O (Reducers) Theoretical max = 1.25 Gbps for 10 GB ethernet

3 GBps/node disk I/O (Mappers)

Aggregate Cluster

Network I/O!

220 Gbps / 206 nodes ~= 1.1 Gbps per node

IBM Spark spark.tc

Shuffle Performance Tuning Tips Hash Shuffle Manager (Deprecated) spark.shuffle.consolidateFiles (Mapper) o.a.s.shuffle.FileShuffleBlockResolver

Intermediate Files Increase spark.shuffle.file.buffer (Reducer) Increase spark.reducer.maxSizeInFlight if memory allows

Use Smaller Number of Larger Executors Minimizes intermediate files and overall shuffle More opportunity for PROCESS_LOCAL

SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify

Many Threads (1 per CPU)

IBM Spark spark.tc

Project Tungsten Data Struts & Algos Operate Directly on Byte Arrays

Maximize CPU Cache Locality, Minimize GC

Utilize Dynamic Code Generation

SPARK-7076 (Spark 1.4)

IBM Spark spark.tc

Quick Review of Project Tungsten Jiras

SPARK-7076 (Spark 1.4)

IBM Spark spark.tc

Why is CPU the Bottleneck? CPU is used for serialization, hashing, compression! Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle Partitioning, pruning, and predicate pushdowns Binary, compressed, columnar file formats (Parquet)

IBM Spark spark.tc

Yet Another Spark Shuffle Manager! spark.shuffle.manager = hash (Deprecated) < 10,000 reducers Output partition file hashes the key of (K,V) pair Mapper creates an output file per partition Leads to M*P output files for all partitions sort (GraySort Challenge) > 10,000 reducers Default from Spark 1.2-1.5 Mapper creates single output file for all partitions Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory Uses custom data structures and algorithms for sort-shuffle workload Wins Daytona GraySort Challenge tungsten-sort (Project Tungsten) Default since 1.5 Modification of existing sort-based shuffle Uses com.misc.Unsafe for self-managed memory and garbage collection Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms Perform joins, sorts, and other operators on both serialized and compressed byte buffers

IBM Spark spark.tc

CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations

Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder/sort serialized records LZF can reorder/sort compressed records

More CPU Cache-aware Data Structs & Algorithms o.a.s.sql.catalyst.expression.UnsafeRow o.a.s.unsafe.map.BytesToBytesMap

Code Generation (default in 1.5) Generate source code from overall query plan 100+ UDFs converted to use code generation

UnsafeFixedWithAggregationMap TungstenAggregationIterator

CodeGenerator GeneratorUnsafeRowJoiner

UnsafeSortDataFormat UnsafeShuffleSortDataFormat

PackedRecordPointer UnsafeRow

UnsafeInMemorySorter UnsafeExternalSorter UnsafeShuffleWriter

Mostly Same Join Code, UnsafeProjection

UnsafeShuffleManager UnsafeShuffleInMemorySorter UnsafeShuffleExternalSorter

Details in SPARK-7075

IBM Spark spark.tc

sun.misc.Unsafe

Info addressSize() pageSize()

Objects allocateInstance() objectFieldOffset()

Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized()

Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt()

Arrays arrayBaseOffset() arrayIndexScale()

Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile()

Used by Tungsten

IBM Spark spark.tc

Spark + com.misc.Unsafe

org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD Window

org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor

org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions

Over 200 source files affected!!

IBM Spark spark.tc

Traditional Java Object Row Layout 4-byte String

Multi-field Object

IBM Spark spark.tc

Custom Data Structures for Workload UnsafeRow

(Dense Binary Row)

TaskMemoryManager (Virtual Memory Address)

BytesToBytesMap (Dense Binary HashMap)

Dense, 8-bytes per field (word-aligned)

Key Ptr

AlphaSort-Style (Key + Pointer)

OS-Style Memory Paging

IBM Spark spark.tc

UnsafeRow Layout Example

Pre-Tungsten

Tungsten

IBM Spark spark.tc

Custom Memory Management o.a.s.memory. TaskMemoryManager & MemoryConsumer Memory management: virtual memory allocation, pageing Off-heap: direct 64-bit address On-heap: 13-bit page num + 27-bit page offset

o.a.s.shuffle.sort. PackedRecordPointer 64-bit word (24-bit partition key, (13-bit page num, 27-bit page offset))

o.a.s.unsafe.types. UTF8String Primitive Array[Byte]

2^13 pages * 2^27 page size = 1 TB RAM per Task

IBM Spark spark.tc

UnsafeFixedWidthAggregationMap

Aggregations o.a.s.sql.execution. UnsafeFixedWidthAggregationMap Uses BytesToBytesMap In-place updates of serialized data No object creation on hot-path Improved external agg support No OOM’s for large, single key aggs

o.a.s.sql.catalyst.expression.codegen. GenerateUnsafeRowJoiner Combine 2 UnsafeRows into 1

o.a.s.sql.execution.aggregate. TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 Steps: hash-based agg (grouping), then sort-based agg Supports spilling and external merge sorting

IBM Spark spark.tc

Equality Bitwise comparison on UnsafeRow No need to calculate equals(), hashCode() Row 1

Equals! Row 2

IBM Spark spark.tc

Joins Surprisingly, not many code changes o.a.s.sql.catalyst.expressions. UnsafeProjection Converts InternalRow to UnsafeRow

IBM Spark spark.tc

Sorting o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeInMemorySorter UnsafeExternalSorter RecordPointerAndKeyPrefix UnsafeShuffleWriter

AlphaSort-Style Cache Friendly

Ptr Key-Prefix

2x CPU Cache-line Friendly!

Using multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. This affects sort & shuffle performance.

Supports merging compressed records if compression CODEC supports it (LZF)

IBM Spark spark.tc

Spilling Efficient Spilling Exact data size is known No need to maintain heuristics & approximations Controls amount of spilling

Spill merge on compressed, binary records! If compression CODEC supports it

UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()

Exact Peak Memory for Spark Jobs

IBM Spark spark.tc

Code Generation Problem Boxing causes excessive object creation Expensive expression tree evals per row JVM can’t inline polymorphic impls

Solution Codegen by-passes virtual function calls Defer source code generation to each operator, UDF, UDAF Use Scala quasiquote macros for Scala AST source code gen Rewrite and optimize code for overall plan, 8-byte align, etc Use Janino to compile generated source code into bytecode

IBM Spark spark.tc

Click to edit Master text styles IBM | spark.tc

Spark SQL UDF Code Generation 100+ UDFs now generating code

More to come in Spark 1.6+

Details in SPARK-8159, SPARK-9571

Each Implements Expression.genCode() !

IBM Spark spark.tc

Creating a Custom UDF with Codegen Study existing implementations https://github.com/apache/spark/pull/7214/files

Extend base trait o.a.s.sql.catalyst.expressions.Expression.genCode()

Register the function o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()

Augment DataFrame with new UDF (Scala implicits) o.a.s.sql.functions.scala

Don’t forget about Python! python.pyspark.sql.functions.py 60

IBM Spark spark.tc

Who Benefits from Project Tungsten? Users of DataFrames All Spark SQL Queries Catalyst

All RDDs Serialization, Compression, and Aggregations

IBM Spark spark.tc

Project Tungsten Performance Results Query Time

Garbage Collection

OOM’d on Large Dataset!

IBM Spark spark.tc

Spark SQL: Query Optimizing & Catalyst Explore DataFrames/Datasets/DataSources, Catalyst

Review Partitions, Pruning, Pushdowns, File Formats

Create a Custom DataSource API Implementation

IBM Spark spark.tc

DataFrames Inspired by R and Pandas DataFrames

Schema-aware Cross language support

SQL, Python, Scala, Java, R Levels performance of Python, Scala, Java, and R

Generates JVM bytecode vs serializing to Python DataFrame is container for logical plan

Lazy transformations represented as tree Only logical plan is sent from Python -> JVM

Only results returned from JVM -> Python UDF and UDAF Support

Custom UDF support using registerFunction() Experimental UDAF support (ie. HyperLogLog)

Supports existing Hive metastore if available Small, file-based Hive metastore created if not available

*DataFrame.rdd returns underlying RDD if needed

Use DataFrames instead of RDDs!!

IBM Spark spark.tc

Spark and Hive Early days, Shark was “Hive on Spark” Hive Optimizer slowly replaced with Catalyst Always use HiveContext – even if not using Hive! If no Hive, a small Hive metastore file is created

Spark 1.5+ supports all Hive versions 0.12+ Separate classloaders for isolation Breaks dependency between Spark internal Hive

version and User’s external Hive version

IBM Spark spark.tc

Catalyst Optimizer Optimize DataFrame Transformation Tree Subquery elimination: use aliases to collapse subqueries Constant folding: replace expression with constant Simplify filters: remove unnecessary filters Predicate/filter pushdowns: avoid unnecessary data load Projection collapsing: avoid unnecessary projections Create Custom Rules Rules are Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) 67

Implements oas.sql.catalyst.rules.Rule

Apply to any plan stage

IBM Spark spark.tc

DataSources API Relations (o.a.s.sql.sources.interfaces.scala)

BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory

Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface): Common commands like EXPLAIN ExplainCommand(impl: case class) CacheTableCommand(impl: case class)

Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all predicates/filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl)

IBM Spark spark.tc

Native Spark SQL DataSources

IBM Spark spark.tc

Query Plan Debugging

gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)

DataFrame.queryExecution.logical

DataFrame.queryExecution.analyzed

DataFrame.queryExecution.optimizedPlan

DataFrame.queryExecution.executedPlan

IBM Spark spark.tc

Query Plan Visualization & Metrics

Effectiveness of Filter

CPU Cache Friendly

Binary Format Cost-based Join Optimization

Similar to MapReduce

Map-side Join

Peak Memory for Joins and Aggs

IBM Spark spark.tc

JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json")

.load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2")

SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 72

json() convenience method

IBM Spark spark.tc

JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load()

SQL CREATE TABLE genders USING jdbc OPTIONS (url, dbtable, driver, …)

IBM Spark spark.tc

Parquet Data Source Configuration

spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]

DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")

SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")

IBM Spark spark.tc

ORC Data Source Configuration spark.sql.orc.filterPushdown=true

DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders")

SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders")

IBM Spark spark.tc

Third-Party Spark SQL DataSources

spark-packages.org

IBM Spark spark.tc

CSV DataSource (Databricks) Github https://github.com/databricks/spark-csv

Maven com.databricks:spark-csv_2.10:1.2.0

Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender")

toDF() is required if CSV does not contain header

IBM Spark spark.tc

ElasticSearch DataSource (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop

Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0

val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>")

IBM Spark spark.tc

Elasticsearch Tips

Change id field to not_analyzed to avoid indexing

Use term filter to build and cache the query

Perform multiple aggregations in a single request

Adapt scoring function to current trends at query time

IBM Spark spark.tc

AWS Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift

Maven com.databricks:spark-redshift:0.5.0

Code val df: DataFrame = sqlContext.read

.format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...)

UNLOAD and copy to tmp bucket in S3 enables

parallel reads

IBM Spark spark.tc

DB2 and BigSQL DataSources (IBM) Coming Soon!

IBM Spark spark.tc

Cassandra DataSource (DataStax) Github https://github.com/datastax/spark-cassandra-connector

Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1

ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…)

IBM Spark spark.tc

Cassandra Pushdown Support spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala

Pushdown Predicate Rules 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column,

only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate

including IN predicate, and preceding column predicates must be EQ predicates. If there is only one cluster column predicate, the predicates could be any non-IN predicate.

6. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them

is equality or IN predicate.

IBM Spark spark.tc

New Cassandra DataSource By-pass CQL optimized for transactional data Instead, do bulk reads/writes directly on SSTables Similar to 5 year old Netflix Open Source project Aegisthus

Promotes Cassandra to first-class Analytics Option Potentially only part of DataStax Enterprise?! Please mail a nasty letter to your local DataStax office

IBM Spark spark.tc

Rumor of REST DataSource (Databricks) Coming Soon?

Ask Michael Armbrust Spark SQL Lead @ Databricks

IBM Spark spark.tc

Custom DataSource (Me and You!) Coming Right Now!

DEMO ALERT!!

IBM Spark spark.tc

Create a Custom DataSource Study Existing Native & Third-Party Data Sources Native Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation

Third-Party DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation!

IBM Spark spark.tc

Demo! Create a Custom DataSource

IBM Spark spark.tc

Contribute a Custom Data Source spark-packages.org Managed by Contains links to external github projects Ratings and comments Declare Spark version support for each package

Examples https://github.com/databricks/spark-csv https://github.com/databricks/spark-avro https://github.com/databricks/spark-redshift

IBM Spark spark.tc

Parquet Columnar File Format Based on Google Dremel

Collaboration with Twitter and Cloudera

Self-describing, evolving schema

Fast columnar aggregation

Supports filter pushdowns

Columnar storage format

Excellent compression

IBM Spark spark.tc

Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values

Delta, Prefix Encoding: Sorted data

IBM Spark spark.tc

Demo! Demonstrate File Formats, Partition Schemes, and Query Plans

IBM Spark spark.tc

Hive JDBC ODBC ThriftServer Allow BI Tools to Query and Process Spark Data Register Permanent Table CREATE TABLE ratings(fromuserid INT, touserid INT, rating INT) USING org.apache.spark.sql.json OPTIONS (path "datasets/dating/ratings.json.bz2")

Register Temp Table ratingsDF.registerTempTable("ratings_temp")

Configuration spark.sql.thriftServer.incrementalCollect=true spark.driver.maxResultSize > 10gb (default)

IBM Spark spark.tc

Demo! Query and Process Spark Data from BI Tools

IBM Spark spark.tc

Spark Streaming: Scaling & Approximations Discuss Delivery Guarantees, Parallelism, and Stability

Compare Receiver and Receiver-less Impls

Demonstrate Stream Approximations

IBM Spark spark.tc

Non-Parallel Receiver Implementation

IBM Spark spark.tc

Receiver Implementation (Kinesis)   KinesisRDD partitions store relevant offsets   Single receiver required to see all data/offsets   Kinesis offsets not deterministic like Kafka   Partitions rebuild from Kinesis using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee

IBM Spark spark.tc

Parallel Receiver-less Implementation (Kafka)

IBM Spark spark.tc

Receiver-less Implementation (Kafka)   KafkaRDD partitions store relevant offsets   Each partition acts as a Receiver   Tasks/Executors pull from Kafka in parallel   Partitions rebuild from Kafka using offsets   No Write Ahead Log (WAL) needed   Optimizes happy path by avoiding the WAL   At least once delivery guarantee

IBM Spark spark.tc

Maintain Stability of Stream Processing Rate Limiting Since Spark 1.2 Fixed limit on number of messages per second Potential to drops messages on the floor

Back Pressure Since Spark 1.5 (TypeSafe Contribution) More dynamic than rate limiting Push back on reliable, buffered source (Kafka, Kinesis) Fundamentals of Control Theory and Observability

IBM Spark spark.tc

Streaming Approximations HyperLogLog and CountMin Sketch

IBM Spark spark.tc

HyperLogLog (HLL) Approx Distinct Count   Approximate count distinct   Twitter’s Algebird   Better than HashSet   Low, fixed memory   Only 1.5K, 2% error,10^9 counts (tunable) Redis HLL: 12K per key, 0.81%, 2^64 counts   Spark’s countApproxDistinctByKey()   Streaming example in Spark codebase

http://research.neustar.biz/

IBM Spark spark.tc

CountMin Sketch (CMS) Approx Count   Approximate count   Twitter’s Algebird   Better than HashMap   Low, fixed memory   Known error bounds   Large num counters   Streaming example in Spark codebase

IBM Spark spark.tc

Demo! Using HLL and CMS for Streaming Count Approximations

IBM Spark spark.tc

Monte Carlo Simulations From Manhattan Project (Atomic bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase 1 Argument: # of trials Pi ~= # red dots

/ # total dots * 4

IBM Spark spark.tc

Demo! Using a Monte Carlo Simulation to Estimate Pi

IBM Spark spark.tc

Streaming Best Practices Get Data Out of Streaming ASAP Processing interval may exceed batch interval Leads to unstable streaming system

Please Don’t… Use updateStateByKey() like an in-memory DB Put streaming jobs on the request/response hot path

Use Separate Jobs for Different Batch Intervals Small Batch Interval: Store raw data (Redis, Cassandra, etc) Medium Batch Interval: Transform, join, process data High Batch Interval: Model training

Gotchas Tune streamingContext.remember()

Use Approximations!! 108

IBM Spark spark.tc

Spark ML: Featurizing & Recommendations Understand Similarity and Dimension Reduction

Demonstrate Sampling and Bucketing

Generate Recommendations

IBM Spark spark.tc

Live, Interactive Demo! sparkafterdark.com

IBM Spark spark.tc

Audience Participation Needed!!

-> You are

here ->

Audience Instructions   Navigate to sparkafterdark.com

  Click 3 actresses and 3 actors   Wait for us to analyze together!

Note: This is totally anonymous!! Project Links   https://github.com/fluxcapacitor/pipeline

  https://hub.docker.com/r/fluxcapacitor

IBM Spark spark.tc

Similarity

IBM Spark spark.tc

Types of Similarity Euclidean Linear-based measure Suffers from Magnitude bias Cosine Angle-based measure Adjusts for magnitude bias Jaccard Set intersection / union Suffers Popularity bias Log Likelihood Netflix “Shawshank” Problem Adjusts for popularity bias

Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1

IBM Spark spark.tc

All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis

Dimension reduction!!

IBM Spark spark.tc

Dimension Reduction Sampling and Bucketing

IBM Spark spark.tc

Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…)

Twitter: 40% efficiency gain vs. Cosine Similarity 117

IBM Spark spark.tc

Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Parallel compare bucket contents O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets

ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50

github.com/mrsqueeze/spark-hash

IBM Spark spark.tc

Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)

(index,value)

IBM Spark spark.tc

Recommendations Summary Statistics and Top-K Historical Analysis

Collaborative Filtering and Clustering

Text Featurization and NLP

IBM Spark spark.tc

Types of Recommendations Non-personalized No preference or behavior data for user, yet aka “Cold Start Problem” Personalized User-Item Similarity Items that others with similar prefs have liked

Item-Item Similarity Items similar to your previously-liked items

IBM Spark spark.tc

Recommendation Terminology Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll

Feature Engineering Dimension reduction, polynomial expansion

Hyper-parameter Tuning K-Folds Cross Validation, Grid Search

Pipelines/Workflows Chaining together Transformers and Evaluators

IBM Spark spark.tc

Single Machine ML Algorithms Stay Local, Distribute As Needed

Helps migration of existing single-node algos to Spark

Convert between Spark and Pandas DataFrames

New “pdspark” package: integration w/ scikitlearn, R

IBM Spark spark.tc

Non-Personalized Recommendations Use Aggregate Data to Generate Recommendations

IBM Spark spark.tc

  Top Users by Like Count “I might like users who have the most-likes overall

based on historical data.” SparkSQL, DataFrames: Summary Stat, Aggs

IBM Spark spark.tc

  Top Influencers by Like Graph “I might like the most-influential users in overall like graph.”

GraphX: PageRank

IBM Spark spark.tc

Demo! Generate Non-Personalized Recommendations

IBM Spark spark.tc

Personalized Recommendations Understand Similarity and Personalized Recommendations

IBM Spark spark.tc

  Like Behavior of Similar Users “I like the same people that you like.

What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity

IBM Spark spark.tc

Demo! Generate Personalized Recommendations using

Collaborative Filtering & Matrix Factorization

IBM Spark spark.tc

  Similar Text-based Profiles as Me “Our profiles have similar keywords and named entities.

We might like each other!” MLlib: Word2Vec, TF/IDF, k-skip n-grams

IBM Spark spark.tc

  Similar Profiles to Previous Likes

“Your profile text has similar keywords and named entities to other profiles of people I like. I might like you, too!”

MLlib: Word2Vec, TF/IDF, Doc Similarity

IBM Spark spark.tc

  Relevant, High-Value Emails “Your initial email references a lot of things in my profile.

I might like you for making the effort!” MLlib: Word2Vec, TF/IDF, Entity Recognition

^ Her Email < My Profile

IBM Spark spark.tc

Demo! Feature Engineering for Text/NLP Use Cases

IBM Spark spark.tc

The Future of Recommendations

IBM Spark spark.tc

  Eigenfaces: Facial Recognition “Your face looks similar to others that I’ve liked.

I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity

136 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

IBM Spark spark.tc

  NLP Conversation Starter Bot! “If your responses to my generic opening lines are positive, I may read your profile.”

MLlib: TF/IDF, DecisionTrees, Sentiment Analysis

Positive Negative

IBM Spark spark.tc

Click to edit Master text styles 138

Maintaining the Spark

IBM Spark spark.tc

⑨  Recommendations for Couples “I want Mad Max. You want Message In a Bottle.

Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity

GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors

IBM Spark spark.tc

Final Recommendation!

IBM Spark spark.tc

  Get Off the Computer & Meet People! Thank you, Helsinki!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, CA, USA

Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker!

IBM Spark spark.tc

More Relevant Links http://meetup.com/Advanced-Apache-Spark-Meetup http://advancedspark.com http://github.com/fluxcapacitor/pipeline http://hub.docker.com/r/fluxcapacitor/pipeline http://sortbenchmark.org/ApacheSpark2014.pd https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches) http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do) https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Tutorial http://techblog.netflix.com/2015/07/java-in-flames.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do 142

IBM Spark spark.tc

What’s Next?

IBM Spark spark.tc

What’s Next? Autoscaling Spark Workers Completely Docker-based Docker Compose and Docker Machine

Lots of Demos and Examples! Zeppelin & IPython/Jupyter notebooks Advanced streaming use cases Advanced ML, Graph, and NLP use cases

Performance Tuning and Profiling Work closely with Brendan Gregg & Netflix Surface & share more low-level details of Spark internals