Upload
databricks
View
3.421
Download
1
Embed Size (px)
Citation preview
What’s new in Spark Streaming Tathagata “TD” Das Strata NY 2015
@tathadas
Who am I?
Project Management Committee (PMC) member of Spark Started Spark Streaming in AMPLab, UC Berkeley Current technical lead of Spark Streaming Software engineer at Databricks
2
Founded by creators of Spark and remains largest contributor Offers a hosted service • Spark on EC2 • Notebooks • Plot visualizations • Cluster management • Scheduled jobs
What is Databricks?
3
Spark Streaming
Scalable, fault-tolerant stream processing system
File systems
Databases
Dashboards
Flume
Kinesis HDFS/S3
Kafka
High-level API
joins, windows, … often 5x less code
Fault-tolerant
Exactly-once semantics, even for stateful ops
Integration
Integrates with MLlib, SQL, DataFrames, GraphX
4
What can you use it for? Real-time fraud detection in transactions
React to anomalies in sensors in real-time
Cat videos in tweets as soon as they go viral
5
Spark Streaming
Receivers receive data streams and chop them up into batches
Spark processes the batches and pushes out the results
data streams
rece
iver
s
batches results
6
Word Count with Kafka
val context = new StreamingContext(conf, Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
entry point of streaming functionality
create DStream from Kafka data
7
Word Count with Kafka
val context = new StreamingContext(conf, Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" ")) split lines into words
8
Word Count with Kafka
val context = new StreamingContext(conf, Seconds(1))
val lines = KafkaUtils.createStream(context, ...)
val words = lines.flatMap(_.split(" "))
val wordCounts = words.map(x => (x, 1))
.reduceByKey(_ + _)
wordCounts.print()
context.start()
print some counts on screen
count the words
start receiving and transforming the data
9
Integrates with Spark Ecosystem
10
Spark Core
Spark Streaming
Spark SQL DataFrames
MLlib GraphX
Combine batch and streaming processing
Join data streams with static data sets // Create data set from Hadoop file val dataset = sparkContext.hadoopFile(“file”) // Join each batch in stream with the dataset kafkaStream.transform { batchRDD => batchRDD.join(dataset) .filter( ... ) }
Spark Core
Spark Streaming
Spark SQL DataFrames
MLlib GraphX
11
Combine machine learning with streaming
Learn models offline, apply them online // Learn model offline val model = KMeans.train(dataset, ...) // Apply model online on stream kafkaStream.map { event => model.predict(event.feature) }
Spark Core
Spark Streaming
Spark SQL DataFrames
MLlib GraphX
12
Combine SQL with streaming
Interactively query streaming data with SQL and DataFrames // Register each batch in stream as table kafkaStream.foreachRDD { batchRDD => batchRDD.toDF.registerTempTable("events") } // Interactively query table sqlContext.sql("select * from events")
Spark Core
Spark Streaming
Spark SQL DataFrames
MLlib GraphX
13
Spark Streaming Adoption
14
Spark Survey by Databricks
Survey over 1417 individuals from 842 organizations 56% increase in Spark Streaming users since 2014 Fastest rising component in Spark
https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html
15
Feedback from community
We have learnt a lot from our rapidly growing user base Most of the development in the last few releases have driven by community demands
16
What have we
added recently?
17
Ease of use
Infrastructure
Libraries
Streaming MLlib algorithms
val model = new StreamingKMeans() .setK(10) .setDecayFactor(1.0) .setRandomCenters(4, 0.0) // Train on one DStream model.trainOn(trainingDStream) // Predict on another DStream model.predictOnValues( testDStream.map { lp => (lp.label, lp.features) } ).print()
19
Continuous learning and prediction on streaming data
StreamingLinearRegression [Spark 1.1]
StreamingKMeans [Spark 1.2]
StreamingLogisticRegression [Spark 1.3]
https://databricks.com/blog/2015/01/28/introducing-streaming-k-means-in-spark-1-2.html
Python API Improvements
Added Python API for Streaming ML algos [Spark 1.5] Added Python API for various data sources
Kafka [Spark 1.3 - 1.5] Flume, Kinesis, MQTT [Spark 1.5]
20
lines = KinesisUtils.createStream(streamingContext, appName, streamName, endpointUrl, regionName,
InitialPositionInStream.LATEST, 2) counts = lines.flatMap(lambda line: line.split(" "))
Ease of use
Infrastructure
Libraries
New Visualizations [Spark 1.4-15]
22
Stats over last 1000 batches
For stability Scheduling delay should be approx 0 Processing Time approx < batch interval
New Visualizations [Spark 1.4-15]
23
Details of individual batches
Kafka offsets processed in each batch, Can help in debugging bad data
List of Spark jobs in each batch
New Visualizations [Spark 1.4-15]
24
Full DAG of RDDs and stages generated by Spark Streaming
New Visualizations [Spark 1.4-15]
Memory usage of received data
Can be used to understand memory consumption across executors
Ease of use
Infrastructure
Libraries
Zero data loss
System stability
Non-replayable Sources Sources that do not support replay from any position (e.g. Flume, etc.) Spark Streaming’s saves received data to a Write Ahead Log (WAL) and replays data from the WAL on failure
Zero data loss: Two cases
Replayable Sources Sources that allow data to replayed from any pos (e.g. Kafka, Kinesis, etc.) Spark Streaming saves only the record identifiers and replays the data back directly from source
Cluster
Write Ahead Log (WAL) [Spark 1.3]
Save received data in a WAL in a fault-tolerant file system
29
Driver Executor Data stream Driver runs receivers
Driver runs user code
Receiver Driver runs tasks to process received data
Receiver buffers data in memory and writes to WAL
WAL in HDFS
Executor
Receiver
Cluster
Write Ahead Log (WAL) [Spark 1.3]
Replay unprocessed data from WAL if driver fails and restarts
30
Restarted Executor
Tasks read data from the WAL
WAL in HDFS
Failed Driver
Restarted Driver Failed tasks rerun on
restarted executors
Write Ahead Log (WAL) [Spark 1.3]
WAL can be enabled by setting Spark configuration spark.streaming.receiver.writeAheadLog.enable to true Should use reliable receiver, that ensures data written to WAL for acknowledging sources Reliable receiver + WAL gives at least once guarantee
31
Kinesis [Spark 1.5]
Save the Kinesis sequence numbers instead of raw data
Using KCL
Sequence number ranges sent to driver
Sequence number ranges saved to HDFS
32
Driver Executor
Kinesis [Spark 1.5]
Recover unprocessed data directly from Kinesis using recovered sequence numbers
Using AWS SDK
33
Restarted Driver
Restarted Executor Tasks rerun with
recovered ranges
Ranges recovered from HDFS
Kinesis [Spark 1.5]
After any failure, records are either recovered from saved sequence numbers or replayed via KCL No need to replicate received data in Spark Streaming Provides end-to-end at least once guarantee
34
Kafka [1.3, graduated in 1.5]
A priori decide the offset ranges to consume in the next batch
35
Every batch interval, latest offset info fetched for each Kafka partition
Offset ranges for next batch decided and saved to HDFS
Driver
Kafka [1.3, graduated in 1.5]
A priori decide the offset ranges to consume in the next batch
36
Executor
Executor
Executor
Broker
Broker
Broker Tasks run to read each range in parallel
Driver
Every batch interval, latest offset info fetched for each Kafka partition
Direct Kafka API [Spark 1.5] Does not use receivers, no need for Spark Streaming to replicate Can provide up to 10x higher throughput than earlier receiver approach
https://spark-summit.org/2015/events/towards-benchmarking-modern-distributed-streaming-systems/
Can provide exactly once semantics
Output operation to external storage should be idempotent or transactional
Can run Spark batch jobs directly on Kafka
# RDD partitions = # Kafka partitions, easy to reason about
37
https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html
System stability
Streaming applications may have to deal with variations in data rates and processing rates
For stability, any streaming application must receive data only as fast as it can process Since 1.1, Spark Streaming allowed setting static limits ingestion rates on receivers to guard against spikes
38
Backpressure [Spark 1.5]
System automatically and dynamically adapts rate limits to ensure stability under any processing conditions If sinks slow down, then the system automatically pushes back on the source to slow down receiving
39
rece
iver
s
Sources Sinks
Backpressure [Spark 1.5]
System uses batch processing times and scheduling delays used to set rate limits Well known PID controller theory (used in industrial control systems) is used calculate appropriate rate limits Contributed by Typesafe
40
Backpressure [Spark 1.5]
System uses batch processing times and scheduling delays used to set rate limits
41
Dynamic rate limit prevents receivers from receiving too fast
Scheduling delay kept in check by the rate limits
Backpressure [Spark 1.5]
Experimental, so disabled by default in Spark 1.5 Enabled by setting Spark configuration spark.streaming.backpressure.enabled to true Will be enabled by default in future releases https://issues.apache.org/jira/browse/SPARK-7398
42
What’s next?
API and Libraries
Support for operations on event time and out of order data
Most demanded feature from the community
Tighter integration between Streaming and SQL + DataFrames
Helps leverage Project Tungsten
44
Infrastructure
Add native support for Dynamic Allocation for Streaming Dynamically scale the cluster resources based on processing load Will work in collaboration with backpressure to scale up/down while maintaining stability
Note: As of 1.5, existing Dynamic Allocation not optimized for streaming But users can build their own scaling logic using developer API
sparkContext.requestExecutors(), sparkContext.killExecutors()
45
Infrastructure
Higher throughput and lower latency by leveraging Project Tungsten Specifically, improved performance of stateful ops
46
Fastest growing component in the Spark ecosystem
Significant improvements in fault-tolerance, stability, visualizations and Python API
More community requested features to come
@tathadas