View
643
Download
5
Category
Preview:
Citation preview
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
After Dark 1.5
High Performance, Real-time, Streaming, Machine Learning, Natural Language Processing,
Text Analytics, and Recommendations
Chris Fregly Principal Data Solutions Engineer IBM Spark Technology Center
** We’re Hiring -- Only Nice People, Please!! **
November 20, 2015
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Who Am I?
2
Streaming Data Engineer Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer IBM Technology Center
Founder Advanced Apache Meetup
Author Advanced .
Due 2016
My Ma’s First Time in California
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Random Slide: More Ma “First Time” Pics
3
In California Using Chopsticks Using “New” iPhone
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th)
4
San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 2016)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Advanced Apache Spark Meetup Meetup Metrics 1600+ Members in just 4 mos! Top 5 Most Active Spark Meetup!! Meetup Goals Dig deep into codebase of Spark and related projects Study integrations of Cassandra, ElasticSearch,
Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R Surface and share patterns and idioms of these
well-designed, distributed, big data components
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
All Slides and Code Are Available!
advancedspark.com slideshare.net/cfregly
github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor
6
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
What is “ After Dark”? Spark-based, Advanced Analytics Reference App End-to-End, Scalable, Real-time Big Data Pipeline Demonstration of Spark & Related Big Data Projects
7
github.com/fluxcapacitor
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Tools of This Talk
8
Kafka Redis Docker Ganglia Cassandra Parquet, JSON, ORC, Avro Apache Zeppelin Notebooks Spark SQL, DataFrames, Hive ElasticSearch, Logstash, Kibana Spark ML, GraphX, Stanford CoreNLP
…
github.com/fluxcapacitor hub.docker.com/r/fluxcapacitor
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Themes of this Talk Filter Off-Heap Parallelize Approximate Find Similarity Minimize Seeks Maximize Scans Customize for Workload Tune Performance At Every Layer
9
Be Nice, Collaborate! Like a Mom!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Presentation Outline
Spark Core: Tuning & Mechanical Sympathy
Spark SQL: Query Optimizing & Catalyst
Spark Streaming: Scaling & Approximations
Spark ML: Featurizing & Recommendations 10
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Spark Core: Tuning & Mechanical Sympathy Understand and Acknowledge Mechanical Sympathy
Study AlphaSort and 100Tb GraySort Challenge
Dive Deep into Project Tungsten
11
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Mechanical Sympathy Hardware and software working together in harmony. - Martin Thompson http://mechanical-sympathy.blogspot.com
Whatever your data structure, my array will beat it. - Scott Meyers Every C++ Book, basically
12
Hair Sympathy
- Bruce Jenner
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark and Mechanical Sympathy
13
Project Tungsten (Spark 1.4-1.6+)
GraySort Challenge (Spark 1.1-1.2)
Minimize Memory and GC Maximize CPU Cache Locality
Saturate Network I/O Saturate Disk I/O
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
AlphaSort Technique: Sort 100 Bytes Recs
14
Value
Ptr Key Dereference Not Required! AlphaSort
List [(Key, Pointer)] Key is directly available for comparison
Naïve List [Pointer] Must dereference key for comparison
Ptr Dereference for Key Comparison
Key
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CPU Cache Line and Memory Sympathy Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs = 14 bytes
15
Key Ptr
Not CPU Cache-line Friendly!
Ptr Key-Prefix
2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes
Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes) = 16 bytes Key Ptr
Pad
/Pad CPU Cache-line Friendly!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Performance Comparison
16
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Similar Trick: Direct Cache Access (DCA) Pull out packet header along side pointer to payload
17
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CPU Cache Line Sizes
18
MyLaptop
MySoftLayerBareMetal
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Cache Hits: Sequential v Random Access
19
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Mechanical Sympathy CPU Cache Lines and Matrix Multiplication
20
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CPU Cache Naïve Matrix Multiplication
// Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];
21
Bad: Row-wise traversal, not using CPU cache line,
ineffective pre-fetching
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ];
// Modify dot product calculation for B Transpose for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];
22
Good: Full CPU cache line, effective prefetching
OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];
Reference jbefore k
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Instrumenting and Monitoring CPU Use Linux perf command!
23
http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Compare CPU Naïve & Cache-Friendly Matrix Multiplication
24
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Results of Matrix Multiply Comparison
Naïve Matrix Multiply
25
Cache-Friendly Matrix Multiply ~27x ~13x ~13x ~2x
perf stat -XX:-Inline –event \ L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, \
LLC-prefetch-misses,cache-misses,stalled-cycles-frontend
~10x 55 hp 550 hp L1-dcache-load-misses 7916379222 214482863052
27.09355591
L1-dcache-prefetch-misses 4248568884 114878389282.70393142
8
LLC-load-misses 336612743 449261229613.3465306
6
LLC-prefetch-misses 4300544980 497580590.01157017
5cache-misses 320086472 4447068200 13.8933338
stalled-cycles-frontend 1575227969401 52463114772913.33050934
8
elapsed@me 1073 22982.14163124
6
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Mechanical Sympathy CPU Cache Lines and Lock-Free Thread Sync
26
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CPU Cache Naïve Tuple Counters object CacheNaiveTupleIncrement { var tuple = (0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = { this.synchronized { tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement) tuple } } }
27
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CPU Cache Naïve Case Class Counters case class MyTuple(left: Int, right: Int) object CacheNaiveCaseClassCounters { var tuple = new MyTuple(0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = { this.synchronized { tuple = new MyTuple(tuple.left + leftIncrement, tuple.right + rightIncrement)
tuple } } }
28
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CPU Cache Friendly Lock-Free Counters object CacheFriendlyLockFreeCounters { // a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each) val tuple = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalLong = 0L var updatedLong = 0L do {
originalLong = tuple.get() val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter val updatedRightInt = originalRightInt + rightIncrement // increment right counter val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter
updatedLong = updatedLeftInt // update the new long with the left counter updatedLong = updatedLong << 32 // shift the new long left updatedLong += updatedRightInt // update the new long with the right counter
} while (tuple.compareAndSet(originalLong, updatedLong) == false) updatedLong } }
29
Q: Why not @volatile long?
A: Java Memory Model does not guarantee synchronousupdates of 64-bit longs or doubles
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Compare CPU Naïve & Cache-Friendly Tuple Counter Sync
30
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Results of Counters Comparison Naïve Tuple Counters
Naïve Case Class Counters
31
Cache Friendly Lock-Free Counters
~2x ~1.5x
~3.5x ~2x ~2x
~1.5x
~1.5x
~1.5x
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Profiling Visualizations: Flame Graphs
32 Example: Spark Word Count
Java Stack Traces (-XX:+PreserveFramePointer)
Plateausare Bad!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
100TB Daytona GraySort Challenge Focus on Network and Disk I/O Optimizations
Improve Data Structs/Algos for Sort & Shuffle
Saturate Network and Disk Controllers 33
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Winning Results
34
Spark Goals Saturate Network I/O Saturate Disk I/O
(2013) (2014)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Winning Hardware Configuration Compute 206 Workers, 1 Master (AWS EC2 i2.8xlarge) 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 3 GBps mixed read/write disk I/O per node
Network AWS Placement Groups, VPC, Enhanced Networking Single Root I/O Virtualization (SR-IOV) 10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)
35
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Winning Software Configuration Spark 1.2, OpenJDK 1.7 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit local reads, 2x replication Empirically chose between 4-6 partitions per cpu 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions (empirical best)
Range partitioning takes advantage of sequential keyspace Required ~10s of sampling 79 keys from in each partition
36
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
New Sort Shuffle Manager for Spark 1.2 Original “hash-based” New “sort-based” ① Use less OS resources (socket buffers, file descriptors) ② TimSort partitions in-memory ③ MergeSort partitions on-disk into a single master file ④ Serve partitions from master file: seek once, sequential scan
37
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Asynchronous Network Module Switch to asyncronous Netty vs. synchronous java.nio Switch to zero-copy epoll Use only kernel-space between disk and network controllers
Custom memory management spark.shuffle.blockTransferService=netty
Spark-Netty Performance Tuning spark.shuffle.io.preferDirectBuffers=true Reuse off-heap buffers spark.shuffle.io.numConnectionsPerPeer=8 (for example) Increase to saturate hosts with multiple disks (8x800 SSD)
38
Details in SPARK-2468
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Custom Algorithms and Data Structures Optimized for sort & shuffle workloads o.a.s.util.collection.TimSort[K,V] Based on JDK 1.7 TimSort Performs best with partially-sorted runs Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)
o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append
39
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Daytona GraySort Challenge Goal Success
1.1 Gbps/node network I/O (Reducers) Theoretical max = 1.25 Gbps for 10 GB ethernet
3 GBps/node disk I/O (Mappers)
40
Aggregate Cluster
Network I/O!
220 Gbps / 206 nodes ~= 1.1 Gbps per node
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Shuffle Performance Tuning Tips Hash Shuffle Manager (Deprecated) spark.shuffle.consolidateFiles (Mapper) o.a.s.shuffle.FileShuffleBlockResolver
Intermediate Files Increase spark.shuffle.file.buffer (Reducer) Increase spark.reducer.maxSizeInFlight if memory allows
Use Smaller Number of Larger Executors Minimizes intermediate files and overall shuffle More opportunity for PROCESS_LOCAL
SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify
41
Many Threads (1 per CPU)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Project Tungsten Data Struts & Algos Operate Directly on Byte Arrays
Maximize CPU Cache Locality, Minimize GC
Utilize Dynamic Code Generation
42
SPARK-7076 (Spark 1.4)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Quick Review of Project Tungsten Jiras
43
SPARK-7076 (Spark 1.4)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Why is CPU the Bottleneck? CPU is used for serialization, hashing, compression! Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle Partitioning, pruning, and predicate pushdowns Binary, compressed, columnar file formats (Parquet)
44
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Yet Another Spark Shuffle Manager! spark.shuffle.manager = hash (Deprecated) < 10,000 reducers Output partition file hashes the key of (K,V) pair Mapper creates an output file per partition Leads to M*P output files for all partitions sort (GraySort Challenge) > 10,000 reducers Default from Spark 1.2-1.5 Mapper creates single output file for all partitions Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory Uses custom data structures and algorithms for sort-shuffle workload Wins Daytona GraySort Challenge tungsten-sort (Project Tungsten) Default since 1.5 Modification of existing sort-based shuffle Uses com.misc.Unsafe for self-managed memory and garbage collection Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms Perform joins, sorts, and other operators on both serialized and compressed byte buffers
45
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations
Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder/sort serialized records LZF can reorder/sort compressed records
More CPU Cache-aware Data Structs & Algorithms o.a.s.sql.catalyst.expression.UnsafeRow o.a.s.unsafe.map.BytesToBytesMap
Code Generation (default in 1.5) Generate source code from overall query plan 100+ UDFs converted to use code generation
46
UnsafeFixedWithAggregationMap TungstenAggregationIterator
CodeGenerator GeneratorUnsafeRowJoiner
UnsafeSortDataFormat UnsafeShuffleSortDataFormat
PackedRecordPointer UnsafeRow
UnsafeInMemorySorter UnsafeExternalSorter UnsafeShuffleWriter
Mostly Same Join Code, UnsafeProjection
UnsafeShuffleManager UnsafeShuffleInMemorySorter UnsafeShuffleExternalSorter
Details in SPARK-7075
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
sun.misc.Unsafe
47
Info addressSize() pageSize()
Objects allocateInstance() objectFieldOffset()
Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized()
Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt()
Arrays arrayBaseOffset() arrayIndexScale()
Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile()
Used by Tungsten
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark + com.misc.Unsafe
48
org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD Window
org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor
org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions
Over 200 source files affected!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Traditional Java Object Row Layout 4-byte String
Multi-field Object
49
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Custom Data Structures for Workload UnsafeRow
(Dense Binary Row)
TaskMemoryManager (Virtual Memory Address)
BytesToBytesMap (Dense Binary HashMap)
50
Dense, 8-bytes per field (word-aligned)
Key Ptr
AlphaSort-Style (Key + Pointer)
OS-Style Memory Paging
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
UnsafeRow Layout Example
51
Pre-Tungsten
Tungsten
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Custom Memory Management o.a.s.memory. TaskMemoryManager & MemoryConsumer Memory management: virtual memory allocation, pageing Off-heap: direct 64-bit address On-heap: 13-bit page num + 27-bit page offset
o.a.s.shuffle.sort. PackedRecordPointer 64-bit word (24-bit partition key, (13-bit page num, 27-bit page offset))
o.a.s.unsafe.types. UTF8String Primitive Array[Byte]
52
2^13 pages * 2^27 page size = 1 TB RAM per Task
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
UnsafeFixedWidthAggregationMap
Aggregations o.a.s.sql.execution. UnsafeFixedWidthAggregationMap Uses BytesToBytesMap In-place updates of serialized data No object creation on hot-path Improved external agg support No OOM’s for large, single key aggs
o.a.s.sql.catalyst.expression.codegen. GenerateUnsafeRowJoiner Combine 2 UnsafeRows into 1
o.a.s.sql.execution.aggregate. TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 Steps: hash-based agg (grouping), then sort-based agg Supports spilling and external merge sorting
53
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Equality Bitwise comparison on UnsafeRow No need to calculate equals(), hashCode() Row 1
Equals! Row 2
54
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Joins Surprisingly, not many code changes o.a.s.sql.catalyst.expressions. UnsafeProjection Converts InternalRow to UnsafeRow
55
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Sorting o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeInMemorySorter UnsafeExternalSorter RecordPointerAndKeyPrefix UnsafeShuffleWriter
AlphaSort-Style Cache Friendly
56
Ptr Key-Prefix
2x CPU Cache-line Friendly!
Using multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. This affects sort & shuffle performance.
Supports merging compressed records if compression CODEC supports it (LZF)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spilling Efficient Spilling Exact data size is known No need to maintain heuristics & approximations Controls amount of spilling
Spill merge on compressed, binary records! If compression CODEC supports it
57
UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()
Exact Peak Memory for Spark Jobs
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Code Generation Problem Boxing causes excessive object creation Expensive expression tree evals per row JVM can’t inline polymorphic impls
Solution Codegen by-passes virtual function calls Defer source code generation to each operator, UDF, UDAF Use Scala quasiquote macros for Scala AST source code gen Rewrite and optimize code for overall plan, 8-byte align, etc Use Janino to compile generated source code into bytecode
58
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles IBM | spark.tc
Spark SQL UDF Code Generation 100+ UDFs now generating code
More to come in Spark 1.6+
Details in SPARK-8159, SPARK-9571
Each Implements Expression.genCode() !
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Creating a Custom UDF with Codegen Study existing implementations https://github.com/apache/spark/pull/7214/files
Extend base trait o.a.s.sql.catalyst.expressions.Expression.genCode()
Register the function o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()
Augment DataFrame with new UDF (Scala implicits) o.a.s.sql.functions.scala
Don’t forget about Python! python.pyspark.sql.functions.py 60
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Who Benefits from Project Tungsten? Users of DataFrames All Spark SQL Queries Catalyst
All RDDs Serialization, Compression, and Aggregations
61
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Project Tungsten Performance Results Query Time
Garbage Collection
62
OOM’d on Large Dataset!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Presentation Outline
Spark Core: Tuning & Mechanical Sympathy
Spark SQL: Query Optimizing & Catalyst
Spark Streaming: Scaling & Approximations
Spark ML: Featurizing & Recommendations 63
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Spark SQL: Query Optimizing & Catalyst Explore DataFrames/Datasets/DataSources, Catalyst
Review Partitions, Pruning, Pushdowns, File Formats
Create a Custom DataSource API Implementation
64
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
DataFrames Inspired by R and Pandas DataFrames
Schema-aware Cross language support
SQL, Python, Scala, Java, R Levels performance of Python, Scala, Java, and R
Generates JVM bytecode vs serializing to Python DataFrame is container for logical plan
Lazy transformations represented as tree Only logical plan is sent from Python -> JVM
Only results returned from JVM -> Python UDF and UDAF Support
Custom UDF support using registerFunction() Experimental UDAF support (ie. HyperLogLog)
Supports existing Hive metastore if available Small, file-based Hive metastore created if not available
*DataFrame.rdd returns underlying RDD if needed
65
Use DataFrames instead of RDDs!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark and Hive Early days, Shark was “Hive on Spark” Hive Optimizer slowly replaced with Catalyst Always use HiveContext – even if not using Hive! If no Hive, a small Hive metastore file is created
Spark 1.5+ supports all Hive versions 0.12+ Separate classloaders for isolation Breaks dependency between Spark internal Hive
version and User’s external Hive version
66
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Catalyst Optimizer Optimize DataFrame Transformation Tree Subquery elimination: use aliases to collapse subqueries Constant folding: replace expression with constant Simplify filters: remove unnecessary filters Predicate/filter pushdowns: avoid unnecessary data load Projection collapsing: avoid unnecessary projections Create Custom Rules Rules are Scala Case Classes val newPlan = MyFilterRule(analyzedPlan) 67
Implements oas.sql.catalyst.rules.Rule
Apply to any plan stage
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
DataSources API Relations (o.a.s.sql.sources.interfaces.scala)
BaseRelation (abstract class): Provides schema of data TableScan (impl): Read all data from source PrunedFilteredScan (impl): Column pruning & predicate pushdowns InsertableRelation (impl): Insert/overwrite data based on SaveMode RelationProvider (trait/interface): Handle options, BaseRelation factory
Execution (o.a.s.sql.execution.commands.scala) RunnableCommand (trait/interface): Common commands like EXPLAIN ExplainCommand(impl: case class) CacheTableCommand(impl: case class)
Filters (o.a.s.sql.sources.filters.scala) Filter (abstract class): Handles all predicates/filters supported by this source EqualTo (impl) GreaterThan (impl) StringStartsWith (impl)
68
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Native Spark SQL DataSources
69
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Query Plan Debugging
70
gendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)
DataFrame.queryExecution.logical
DataFrame.queryExecution.analyzed
DataFrame.queryExecution.optimizedPlan
DataFrame.queryExecution.executedPlan
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Query Plan Visualization & Metrics
71
Effectiveness of Filter
CPU Cache Friendly
Binary Format Cost-based Join Optimization
Similar to MapReduce
Map-side Join
Peak Memory for Joins and Aggs
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
JSON Data Source DataFrame val ratingsDF = sqlContext.read.format("json")
.load("file:/root/pipeline/datasets/dating/ratings.json.bz2") -- or – val ratingsDF = sqlContext.read.json ("file:/root/pipeline/datasets/dating/ratings.json.bz2")
SQL Code CREATE TABLE genders USING json OPTIONS (path "file:/root/pipeline/datasets/dating/genders.json.bz2") 72
json() convenience method
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
JDBC Data Source Add Driver to Spark JVM System Classpath $ export SPARK_CLASSPATH=<jdbc-driver.jar>
DataFrame val jdbcConfig = Map("driver" -> "org.postgresql.Driver", "url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename") df.read.format("jdbc").options(jdbcConfig).load()
SQL CREATE TABLE genders USING jdbc OPTIONS (url, dbtable, driver, …)
73
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Parquet Data Source Configuration
spark.sql.parquet.filterPushdown=true spark.sql.parquet.mergeSchema=true spark.sql.parquet.cacheMetadata=true spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]
DataFrames val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet") gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")
SQL CREATE TABLE genders USING parquet OPTIONS (path "file:/root/pipeline/datasets/dating/genders.parquet")
74
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
ORC Data Source Configuration spark.sql.orc.filterPushdown=true
DataFrames val gendersDF = sqlContext.read.format("orc") .load("file:/root/pipeline/datasets/dating/genders") gendersDF.write.format("orc").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders")
SQL CREATE TABLE genders USING orc OPTIONS (path "file:/root/pipeline/datasets/dating/genders")
75
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Third-Party Spark SQL DataSources
76
spark-packages.org
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CSV DataSource (Databricks) Github https://github.com/databricks/spark-csv
Maven com.databricks:spark-csv_2.10:1.2.0
Code val gendersCsvDF = sqlContext.read .format("com.databricks.spark.csv") .load("file:/root/pipeline/datasets/dating/gender.csv.bz2") .toDF("id", "gender")
77
toDF() is required if CSV does not contain header
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
ElasticSearch DataSource (Elastic.co) Github https://github.com/elastic/elasticsearch-hadoop
Maven org.elasticsearch:elasticsearch-spark_2.10:2.1.0
Code
val esConfig = Map("pushdown" -> "true", "es.nodes" -> "<hostname>", "es.port" -> "<port>") df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite) .options(esConfig).save("<index>/<document-type>")
78
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Elasticsearch Tips
Change id field to not_analyzed to avoid indexing
Use term filter to build and cache the query
Perform multiple aggregations in a single request
Adapt scoring function to current trends at query time
79
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
AWS Redshift Data Source (Databricks) Github https://github.com/databricks/spark-redshift
Maven com.databricks:spark-redshift:0.5.0
Code val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load(...)
80
UNLOAD and copy to tmp bucket in S3 enables
parallel reads
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
DB2 and BigSQL DataSources (IBM) Coming Soon!
81
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Cassandra DataSource (DataStax) Github https://github.com/datastax/spark-cassandra-connector
Maven com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M1
Code
ratingsDF.write .format("org.apache.spark.sql.cassandra") .mode(SaveMode.Append) .options(Map("keyspace"->"<keyspace>", "table"->"<table>")).save(…)
82
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Cassandra Pushdown Support spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala
Pushdown Predicate Rules 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate 2. Only push down primary key column predicates with = or IN predicate. 3. If there are regular columns in the pushdown predicates, they should have at least one EQ expression on an indexed column and no IN predicates. 4. All partition column predicates must be included in the predicates to be pushed down, only the last part of the partition key can be an IN predicate. For each partition column,
only one predicate is allowed. 5. For cluster column predicates, only last predicate can be non-EQ predicate
including IN predicate, and preceding column predicates must be EQ predicates. If there is only one cluster column predicate, the predicates could be any non-IN predicate.
6. There is no pushdown predicates if there is any OR condition or NOT IN condition. 7. We're not allowed to push down multiple predicates for the same column if any of them
is equality or IN predicate.
83
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
New Cassandra DataSource By-pass CQL optimized for transactional data Instead, do bulk reads/writes directly on SSTables Similar to 5 year old Netflix Open Source project Aegisthus
Promotes Cassandra to first-class Analytics Option Potentially only part of DataStax Enterprise?! Please mail a nasty letter to your local DataStax office
84
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Rumor of REST DataSource (Databricks) Coming Soon?
Ask Michael Armbrust Spark SQL Lead @ Databricks
85
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Custom DataSource (Me and You!) Coming Right Now!
86
DEMO ALERT!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Create a Custom DataSource Study Existing Native & Third-Party Data Sources Native Spark JDBC (o.a.s.sql.execution.datasources.jdbc) class JDBCRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation
Third-Party DataStax Cassandra (o.a.s.sql.cassandra) class CassandraSourceRelation extends BaseRelation with PrunedFilteredScan with InsertableRelation!
87
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Create a Custom DataSource
88
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Contribute a Custom Data Source spark-packages.org Managed by Contains links to external github projects Ratings and comments Declare Spark version support for each package
Examples https://github.com/databricks/spark-csv https://github.com/databricks/spark-avro https://github.com/databricks/spark-redshift
89
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Parquet Columnar File Format Based on Google Dremel
Collaboration with Twitter and Cloudera
Self-describing, evolving schema
Fast columnar aggregation
Supports filter pushdowns
Columnar storage format
Excellent compression
90
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values
Delta, Prefix Encoding: Sorted data
91
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Demonstrate File Formats, Partition Schemes, and Query Plans
92
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Hive JDBC ODBC ThriftServer Allow BI Tools to Query and Process Spark Data Register Permanent Table CREATE TABLE ratings(fromuserid INT, touserid INT, rating INT) USING org.apache.spark.sql.json OPTIONS (path "datasets/dating/ratings.json.bz2")
Register Temp Table ratingsDF.registerTempTable("ratings_temp")
Configuration spark.sql.thriftServer.incrementalCollect=true spark.driver.maxResultSize > 10gb (default)
93
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Query and Process Spark Data from BI Tools
94
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Presentation Outline
Spark Core: Tuning & Mechanical Sympathy
Spark SQL: Query Optimizing & Catalyst
Spark Streaming: Scaling & Approximations
Spark ML: Featurizing & Recommendations 95
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Spark Streaming: Scaling & Approximations Discuss Delivery Guarantees, Parallelism, and Stability
Compare Receiver and Receiver-less Impls
Demonstrate Stream Approximations
96
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Non-Parallel Receiver Implementation
97
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Receiver Implementation (Kinesis) KinesisRDD partitions store relevant offsets Single receiver required to see all data/offsets Kinesis offsets not deterministic like Kafka Partitions rebuild from Kinesis using offsets No Write Ahead Log (WAL) needed Optimizes happy path by avoiding the WAL At least once delivery guarantee
98
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Parallel Receiver-less Implementation (Kafka)
99
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Receiver-less Implementation (Kafka) KafkaRDD partitions store relevant offsets Each partition acts as a Receiver Tasks/Executors pull from Kafka in parallel Partitions rebuild from Kafka using offsets No Write Ahead Log (WAL) needed Optimizes happy path by avoiding the WAL At least once delivery guarantee
100
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Maintain Stability of Stream Processing Rate Limiting Since Spark 1.2 Fixed limit on number of messages per second Potential to drops messages on the floor
Back Pressure Since Spark 1.5 (TypeSafe Contribution) More dynamic than rate limiting Push back on reliable, buffered source (Kafka, Kinesis) Fundamentals of Control Theory and Observability
101
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Streaming Approximations HyperLogLog and CountMin Sketch
102
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
HyperLogLog (HLL) Approx Distinct Count Approximate count distinct Twitter’s Algebird Better than HashSet Low, fixed memory Only 1.5K, 2% error,10^9 counts (tunable) Redis HLL: 12K per key, 0.81%, 2^64 counts Spark’s countApproxDistinctByKey() Streaming example in Spark codebase
103
http://research.neustar.biz/
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
CountMin Sketch (CMS) Approx Count Approximate count Twitter’s Algebird Better than HashMap Low, fixed memory Known error bounds Large num counters Streaming example in Spark codebase
104
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Using HLL and CMS for Streaming Count Approximations
105
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Monte Carlo Simulations From Manhattan Project (Atomic bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase 1 Argument: # of trials Pi ~= # red dots
/ # total dots * 4
106
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Using a Monte Carlo Simulation to Estimate Pi
107
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Streaming Best Practices Get Data Out of Streaming ASAP Processing interval may exceed batch interval Leads to unstable streaming system
Please Don’t… Use updateStateByKey() like an in-memory DB Put streaming jobs on the request/response hot path
Use Separate Jobs for Different Batch Intervals Small Batch Interval: Store raw data (Redis, Cassandra, etc) Medium Batch Interval: Transform, join, process data High Batch Interval: Model training
Gotchas Tune streamingContext.remember()
Use Approximations!! 108
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Presentation Outline
Spark Core: Tuning & Mechanical Sympathy
Spark SQL: Query Optimizing & Catalyst
Spark Streaming: Scaling & Approximations
Spark ML: Featurizing & Recommendations 109
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Spark ML: Featurizing & Recommendations Understand Similarity and Dimension Reduction
Demonstrate Sampling and Bucketing
Generate Recommendations
110
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Live, Interactive Demo! sparkafterdark.com
111
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Audience Participation Needed!!
112
-> You are
here ->
Audience Instructions Navigate to sparkafterdark.com
Click 3 actresses and 3 actors Wait for us to analyze together!
Note: This is totally anonymous!! Project Links https://github.com/fluxcapacitor/pipeline
https://hub.docker.com/r/fluxcapacitor
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Similarity
113
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Similarity Euclidean Linear-based measure Suffers from Magnitude bias Cosine Angle-based measure Adjusts for magnitude bias Jaccard Set intersection / union Suffers Popularity bias Log Likelihood Netflix “Shawshank” Problem Adjusts for popularity bias
114
Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1
z!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis
115
Dimension reduction!!
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Dimension Reduction Sampling and Bucketing
116
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…)
Twitter: 40% efficiency gain vs. Cosine Similarity 117
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Parallel compare bucket contents O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets
ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50
118
github.com/mrsqueeze/spark-hash
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)
119
(index,value)
(index,value)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Recommendations Summary Statistics and Top-K Historical Analysis
Collaborative Filtering and Clustering
Text Featurization and NLP
120
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Recommendations Non-personalized No preference or behavior data for user, yet aka “Cold Start Problem” Personalized User-Item Similarity Items that others with similar prefs have liked
Item-Item Similarity Items similar to your previously-liked items
121
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Recommendation Terminology Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll
Feature Engineering Dimension reduction, polynomial expansion
Hyper-parameter Tuning K-Folds Cross Validation, Grid Search
Pipelines/Workflows Chaining together Transformers and Evaluators
122
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Single Machine ML Algorithms Stay Local, Distribute As Needed
Helps migration of existing single-node algos to Spark
Convert between Spark and Pandas DataFrames
New “pdspark” package: integration w/ scikitlearn, R
123
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Non-Personalized Recommendations Use Aggregate Data to Generate Recommendations
124
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Top Users by Like Count “I might like users who have the most-likes overall
based on historical data.” SparkSQL, DataFrames: Summary Stat, Aggs
125
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Top Influencers by Like Graph “I might like the most-influential users in overall like graph.”
GraphX: PageRank
126
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Generate Non-Personalized Recommendations
127
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Personalized Recommendations Understand Similarity and Personalized Recommendations
128
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Like Behavior of Similar Users “I like the same people that you like.
What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
129
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Generate Personalized Recommendations using
Collaborative Filtering & Matrix Factorization
130
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Similar Text-based Profiles as Me “Our profiles have similar keywords and named entities.
We might like each other!” MLlib: Word2Vec, TF/IDF, k-skip n-grams
131
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Similar Profiles to Previous Likes
132
“Your profile text has similar keywords and named entities to other profiles of people I like. I might like you, too!”
MLlib: Word2Vec, TF/IDF, Doc Similarity
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Relevant, High-Value Emails “Your initial email references a lot of things in my profile.
I might like you for making the effort!” MLlib: Word2Vec, TF/IDF, Entity Recognition
133
^ Her Email < My Profile
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Demo! Feature Engineering for Text/NLP Use Cases
134
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
The Future of Recommendations
135
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Eigenfaces: Facial Recognition “Your face looks similar to others that I’ve liked.
I might like you.” MLlib: RowMatrix, PCA, Item-Item Similarity
136 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
NLP Conversation Starter Bot! “If your responses to my generic opening lines are positive, I may read your profile.”
MLlib: TF/IDF, DecisionTrees, Sentiment Analysis
137
Positive Negative
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles 138
Maintaining the Spark
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
⑨ Recommendations for Couples “I want Mad Max. You want Message In a Bottle.
Let’s find something in between to watch tonight.” MLlib: RowMatrix, Item-Item Similarity
GraphX: Nearest Neighbors, Shortest Path similar similar • plots -> <- actors
139
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
Final Recommendation!
140
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Get Off the Computer & Meet People! Thank you, Helsinki!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, CA, USA
Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker!
141
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
More Relevant Links http://meetup.com/Advanced-Apache-Spark-Meetup http://advancedspark.com http://github.com/fluxcapacitor/pipeline http://hub.docker.com/r/fluxcapacitor/pipeline http://sortbenchmark.org/ApacheSpark2014.pd https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ (Memory Part 2: CPU Caches) http://lwn.net/Articles/255364/ (Memory Part 5: What Programmers Can Do) https://www.safaribooksonline.com/library/view/java-performance-the/9781449363512/ch04.html http://web.eece.maine.edu/~vweaver/projects/perf_events/perf_event_open.html http://www.brendangregg.com/perf.html https://perf.wiki.kernel.org/index.php/Tutorial http://techblog.netflix.com/2015/07/java-in-flames.html http://techblog.netflix.com/2015/04/introducing-vector-netflixs-on-host.html http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html#Java http://sortbenchmark.org/ApacheSpark2014.pdf https://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html http://0x0fff.com/spark-architecture-shuffle/ http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf http://stackoverflow.com/questions/763262/how-does-one-write-code-that-best-utilizes-the-cpu-cache-to-improve-performance http://www.aristeia.com/TalkNotes/ACCU2011_CPUCaches.pdf http://mishadoff.com/blog/java-magic-part-4-sun-dot-misc-dot-unsafe/ http://docs.scala-lang.org/overviews/quasiquotes/intro.html http://lwn.net/Articles/252125/ <-- Memory Part 2: CPU Caches http://lwn.net/Articles/255364/ <-- Memory Part 5: What Programmers Can Do 142
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles
What’s Next?
143
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
What’s Next? Autoscaling Spark Workers Completely Docker-based Docker Compose and Docker Machine
Lots of Demos and Examples! Zeppelin & IPython/Jupyter notebooks Advanced streaming use cases Advanced ML, Graph, and NLP use cases
Performance Tuning and Profiling Work closely with Brendan Gregg & Netflix Surface & share more low-level details of Spark internals
144
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th)
Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd)
Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th)
Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th)
145
San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 19th) Helsinki Spark Meetup (Nov 20th)
Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 25th)
Budapest Spark Meetup (Nov 26th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th)
Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Austin Data Days Conference (Jan 2016)
Click to edit Master text styles
Click to edit Master text styles
IBM Spark spark.tc
Click to edit Master text styles Power of data. Simplicity of design. Speed of innovation.
IBM Spark
Recommended