28
CHAPTER 10: SPARK STREAMING Learning Spark by Holden Karau et. al.

Learning spark ch10 - Spark Streaming

Embed Size (px)

Citation preview

Page 1: Learning spark ch10 - Spark Streaming

CHAPTER 10 : SPARK STREAMING

Learning Sparkby Holden Karau et. al.

Page 2: Learning spark ch10 - Spark Streaming

Overview: Spark Streaming

A Simple Example Architecture and

Abstraction Transformations

Stateless Stateful

Output OperationsInput Sources

Core Sources Additional Sources Multiple Sources and

Cluster Sizing 24/7 Operation

Checkpointing Driver Fault Tolerance Worker Fault Tolerance Receiver Fault Tolerance Processing Guarantees

Streaming UIPerformance

Considerations Batch and Window Sizes Level of Parallelism Garbage Collection and

Memory Usage Conclusion

Page 3: Learning spark ch10 - Spark Streaming

10.1 A Simple Example

Before we dive into the details of Spark Streaming, let’s consider a simple example. We will receive a stream of newline-delimited lines of text from a server running at port 7777, filter only the lines that contain the word error, and print them.

Spark Streaming programs are best run as standalone applications built using Maven or sbt. Spark Streaming, while part of Spark, ships as a separate Maven artifact and has some additional imports you will want to add to your project.

Page 4: Learning spark ch10 - Spark Streaming

10.2 Architecture and Abstraction

Page 5: Learning spark ch10 - Spark Streaming

Edx and Coursera Courses

Introduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala

Page 6: Learning spark ch10 - Spark Streaming

10.2 Architecture and Abstraction (cont.)

Page 7: Learning spark ch10 - Spark Streaming

10.3 Transformations

Stateless the processing of each batch does not depend on the

data of its previous batches include the common RDD transformations like map(),

filter(), and reduceByKey() Stateful

use data or intermediate results from previous batches to compute the results of the current batch

include transformations based on: sliding windows tracking state across time

Page 8: Learning spark ch10 - Spark Streaming

10.3.1 Stateless Transformations

Page 9: Learning spark ch10 - Spark Streaming

10.3.2 Stateless Transformations

Windowed Transformation compute results across a longer time period than the

StreamingContext’s batch interval, by combining results from multiple batches

A windowed stream with a window duration of

3 batches and a slide duration of 2 batches;

every two time steps, we compute a result over

the previous 3 time steps

Page 10: Learning spark ch10 - Spark Streaming

10.3.2 Stateless Transformations (cont.)

UpdateStateByKey transformation updateStateByKey() maintains state across the

batches in a DStream by providing access to a state variable for DStreams of key/value pairs

update(events, oldState) returns a newState events is a list of events that arrived in the current batch

(may be empty) oldState is an optional state object, stored within an

Option; it might be missing if there was no previous state for the key

newState is also an Option; we can return an empty Option to specify that we want to delete the state

Page 11: Learning spark ch10 - Spark Streaming

10.4 Output Operations

Specify what needs to be done with the final transformed data in a stream

print()save()

Saving DStream to text files in Scala ipAddressRequestCount.saveAsTextFiles("outputDir", "txt")

Saving SequenceFiles from a DStream in Scala val writableIpAddressRequestCount =

ipAddressRequestCount.map { (ip, count) => (new Text(ip), new LongWritable(count)) }

writableIpAddressRequestCount.saveAsHadoopFiles[ SequenceFileOutputFormat[Text, LongWritable]]("outputDir", "txt")

Page 12: Learning spark ch10 - Spark Streaming

10.5 Input Sources

Spark Streaming has built-in support for a number of different data sources. “core” sources are built into the Spark Streaming

Maven artifact others are available through additional artifacts

Eg: spark-streaming-kafka.

Page 13: Learning spark ch10 - Spark Streaming

10.5.1 Core Sources

Stream of files allows a stream to be created from files written in a directory of

a Hadoop-compatible filesystem needs to have a consistent date format for the directory names

and the files have to be created atomically Eg: Streaming text files written to a directory in Scala

val logData = ssc.textFileStream(logDirectory) Akka actor stream

allows using Akka actors as a source for streaming To construct an actor stream:

create an Akka actor implement the org.apache.spark.streaming.receiver.ActorHelper

interface

Page 14: Learning spark ch10 - Spark Streaming

10.5.2 Additional Sources

Apache KafkaApache PlumePush-based receiver Pull-based receiverCustom input sources

Page 15: Learning spark ch10 - Spark Streaming

10.5.3 Multiple Sources and Cluster Sizing

We can combine multiple DStreams using operations like union() combine data from multiple input DStreams

The receivers are executed in the Spark cluster to use multiple ones Each receiver runs as a long-running task within

Spark’s executors, and hence occupies CPU cores allocated to the application

Note: Do not run Spark Streaming programs locally with master config‐ ured as "local" or "local[1]”

Page 16: Learning spark ch10 - Spark Streaming

10.6 “24/7” Operations

Spark provides strong fault tolerance guarantees. As long as the input data is stored reliably, Spark

Streaming will always compute the correct result from it, offering “exactly once” semantics, even if workers or the driver fail.

To run Spark Streaming applications 24/71. setting up checkpointing to a reliable storage

system, such as HDFS or Amazon S3 2. worry about the fault tolerance of the driver program

and of unreliable input sources

Page 17: Learning spark ch10 - Spark Streaming

10.6.1 Checkpointing

Main mechanism needs to be set up for fault tolerance

Allows periodically saving data about the application to a reliable storage system, such as HDFS or Amazon S3 for use in recovering

Two purposes: Limiting the state that must be recomputed on failure Providing fault tolerance for the driver

Page 18: Learning spark ch10 - Spark Streaming

10.6.2 Driver Fault Tolerance

Requires creating our StreamingContext, which takes in the checkpoint directory use the StreamingContext.getOrCreate() function

Write initialization code using getOrCreate(), need to actually restart your driver program when it crashes

Page 19: Learning spark ch10 - Spark Streaming

10.6.3 Worker Fault Tolerance

Spark Streaming uses the same techniques as Spark for its fault tolerance.

All the data received from external sources is replicated among the Spark workers

All RDDs created through transformations of this replicated input data are tolerant to failure of a worker node, as the RDD lineage allows the system to recompute the lost data all the way from the surviving replica of the input data.

Page 20: Learning spark ch10 - Spark Streaming

10.6.4 Receiver Fault Tolerance

Spark Streaming restarts the failed receivers on other nodes in the cluster

Receivers provide the guarantees: All data read from a reliable filesystem (e.g., with

StreamingContext.hadoop Files) is reliable, because the underlying filesystem is replicated.

For unreliable sources such as Kafka, push-based Flume, or Twitter, Spark repli‐ cates the input data to other nodes, but it can briefly lose data if a receiver task is down.

Page 21: Learning spark ch10 - Spark Streaming

10.6.5 Processing Guarantees

Spark Streaming provide exactly- once semantics for all transformations Even if a worker fails and some data gets reprocessed,

the final transformed result (that is, the transformed RDDs) will be the same as if the data were processed exactly once.

When the transformed result is to be pushed to external systems using out‐ put operations, the task pushing the result may get executed multiple times due to failures, and some data can get pushed multiple times.

Page 22: Learning spark ch10 - Spark Streaming

10.7 Streaming UI

UI page that lets us look at what applications are doing. (typically http:// <driver>:4040)

Page 23: Learning spark ch10 - Spark Streaming

10.8 Performance Considerations

Batch in window sizesLevel of parallelismGarbage Collection and Memory Usage

Page 24: Learning spark ch10 - Spark Streaming

10.8.1 Batch and Window Sizes

Minimum batch size Spark Streaming can use: 500 milliseconds

The best approach: start with a larger batch size (around 10 seconds) work your way down to a smaller batch size.

If the processing times reported in the Streaming UI remain consistent, then you can continue to decrease the batch size Note: if they are increasing you may have reached the

limit for your application.

Page 25: Learning spark ch10 - Spark Streaming

10.8.2 Level of Parallelism

Increasing the parallelism - a common way to reduce the processing time of batches

3 ways: Increasing the number of receivers Explicitly repartitioning received data Increasing parallelism in aggregation

Page 26: Learning spark ch10 - Spark Streaming

10.8.3 Garbage Collection and Memory Usage

Java’s garbage collection - an aspect that can cause problems

To minimize large pauses due to GC enabling Java’s Concurrent Mark- Sweep garbage collector. The Concurrent Mark-Sweep garbage collector does

consume more resources overall, but introduces fewer pauses.

To reduce GC pressure Cache RDDs in serialized form Use Kryo serialization Use an LRU cache

Page 27: Learning spark ch10 - Spark Streaming

Edx and Coursera Courses

Introduction to Big Data with Apache SparkSpark Fundamentals IFunctional Programming Principles in Scala

Page 28: Learning spark ch10 - Spark Streaming

10.9 Conclusion

In this chapter, we have seen how to work with streaming data using DStreams.

Since DStreams are composed of RDDs, the techniques and knowledge you have gained from the earlier chapters remains applicable for streaming and real-time applications.

In the next chapter, we will look at machine learning with Spark.