46
BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA 2014 © Trivadis Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared Juni 2015 Guido Schmutz Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared 1

Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

  • Upload
    ngobao

  • View
    230

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

BASEL BERN BRUGG LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA

2014 © Trivadis

Apache Storm vs. Spark Streaming – Two Stream Processing Platforms compared Juni 2015

Guido Schmutz

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

1

Page 2: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Guido Schmutz

§  Working for Trivadis for more than 18 years §  Oracle ACE Director for Fusion Middleware and SOA §  Co-Author of different books §  Consultant, Trainer Software Architect for Java, Oracle, SOA and

Big Data / Fast Data §  Member of Trivadis Architecture Board §  Technology Manager @ Trivadis

§  More than 25 years of software development

experience

§  Contact: [email protected] §  Blog: http://guidoschmutz.wordpress.com §  Twitter: gschmutz

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

2

Page 3: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Trivadis is a market leader in IT consulting, system integration, solution engineering and the provision of IT services focusing on and technologies in Switzerland, Germany and Austria.

We offer our services in the following strategic business fields: Trivadis Services takes over the interacting operation of your IT systems.

Trivadis

O P E R A T I O N

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

3

Page 4: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Agenda

1.  Introduction / Motivation

2.  Apache Storm

3.  Apache Spark (Streaming)

4.  Stream Processing in the Architecture

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

4

Page 5: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

What is Stream Processing?

Infrastructure for continuous data processing

Computational model can be as general as MapReduce but with the ability to produce low-latency results

Data collected continuously is naturally processed continuously

aka. Event Processing / Complex Event Processing (CEP)

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

5

Page 6: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Trivadis Stream Processing Demo System

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

6

Use Hashtag #JFS2015 plus #storm and/or #spark

Page 7: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

How to design a Stream Processing System?

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

7

Event Stream

event Collecting

event

Queue (Persist)

Event Stream

event Collecting

event

Processing

event Processing

result

result

Event Stream

event Collecting/ Processing

result

Page 8: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

How to scale a Stream Processing System?

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

8

Queue (Persist)

Event Stream

event Collecting Thread 1 event event

Processing Thread 1 result

Collecting Thread 2

Processing Thread 2

event event event result

Collecting Thread n

Processing Thread n

Page 9: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Collecting Process 1

Collecting Process 1

Collecting Process 1

Collecting Process 1

Collecting Process 1

How to scale a Stream Processing System?

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

9

Queue 1 (Persist)

Event Stream

event

Collecting Thread 1

event event Processing Process 1 result

Collecting Thread 1

Processing Process 1

Queue 2 (Persist) event

event result

Processing Process 1

Queue n (Persist)

event

Page 10: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Collecting Process 1

Collecting Process 2

Processing A

Process 2

Processing B

Process 2

Processing A

Process 1

Processing B

Process 1

How to scale a Stream Processing System?

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

10

Event Stream

Collecting Process 1

Collecting Process 2

Processing A Thread 2 Q2

e Processing B

Thread 2 Q2 e

Processing A Thread 1 Q1

e Processing B

Thread 1 Q1 e

Processing A

Process 2 Processing A

Thread n Qn e

Page 11: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

How to make (stateful) Stream Processing System reliable?

Faults and stragglers inevitable in large clusters running big data applications

Streaming applications must recover from them quickly

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

11

Collecting Process 2

Processing A

Process 2

Processing B

Process 2 Event

Stream Collecting Process 2

Processing A Thread 2 Q2

e Processing B

Thread 2 Q2 e

Collecting Process 2

Processing A

Process 2

Processing B

Process 2 Event

Stream Collecting Process 2

Processing A Thread 2 Q2

e Processing B

Thread 2 Q2 e

Page 12: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

How to make (stateful) Stream Processing System reliable?

Solution 1: using active/passive system (hot replication)

•  Both systems process the full load

•  In case of a failure, automatically switch and use the “passive” system

•  Stragglers slow down both active and passive system

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

12 State = State in-memory and/or on-disk

Collecting Process 2

Processing A

Process 2

Processing B

Process 2

Event Stream

Collecting Process 2

Processing A Thread 2 Q2

e Processing B

Thread 2 Q2 e

Active

Collecting Process 2

Processing A

Process 2

Processing B

Process 2 Collecting Process 2

Processing A Thread 2 Q2

e Processing B

Thread 2 Q2 e

Passive

State

State

Page 13: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

How to make (stateful) Stream Processing System reliable?

Solution 2: Upstream backup

•  Nodes buffer messages and reply them to new node in case of failure

•  Stragglers are treated as failures

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

13

State = State in-memory and/or on-disk

buffer = Buffer for replay in-memory and/or on-disk

Collecting Process 2

Processing A

Process 2

Processing B

Process 2 Event

Stream Collecting Process 2

Processing A Thread 2 Q2

e Processing B

Thread 2 Q2 e

State

Page 14: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Processing Models

Batch Processing •  Familiar concept of processing data en masse •  Generally incurs a high-latency

(Event-) Stream Processing •  A one-at-a-time processing model •  A datum is processed as it arrives •  Sub-second latency •  Difficult to process state data efficiently Micro-Batching •  A special case of batch processing with very small batch sizes (tiny) •  A nice mix between batching and streaming •  At cost of latency •  Allows Stateful computation, making windowing an easy task

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

14

Page 15: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Message Delivery Semantics

At most once [0,1] •  Messages my be lost •  Messages never redelivered

At least once [1 .. n] •  Messages will never be lost •  but messages may be redelivered (might be ok if consumer can handle it)

Exactly once [1] •  Messages are never lost •  Messages are never redelivered •  Perfect message delivery •  Incurs higher latency for transactional semantics

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

15

Page 16: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Agenda

1.  Introduction / Motivation

2.  Apache Storm

3.  Apache Spark (Streaming)

4.  Stream Processing in the Architecture

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

16

Page 17: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Apache Storm

A platform for doing analysis on streams of data as they come in, so you can react to data as it happens. •  highly distributed real-time computation system

•  Provides general primitives to do real-time computation

•  To simplify working with queues & workers

•  scalable and fault-tolerant

Originated at Backtype, acquired by Twitter in 2011

Open Sourced late 2011

Part of Apache since September 2013

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

17

Page 18: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Apache Storm – Core concepts

Tuple •  Immutable Set of Key/value pairs

Stream •  an unbounded sequence of tuples that can be processed in parallel by Storm

Topology •  Wires data and functions via a DAG (directed acyclic graph) •  Executes on many machines similar to a MR job in Hadoop

Spout •  Source of data streams (tuples) •  can be run in “reliable” and “unreliable” mode

Bolt •  Consumes 1+ streams and produces new streams •  Complex operations often require multiple

steps and thus multiple bolts

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

18

Spout

Spout

Bolt

Bolt

Bolt

Bolt

Source of Stream B

Subscribes: A Emits: C

Subscribes: A Emits: D

Subscribes: A & B Emits: -

Subscribes: C & D Emits: -

T T T T T T T T

Page 19: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Apache Storm – Core concepts

Each Spout or Bolt are running N instances in parallel

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

19

Split Text nth

Text Spout

Word Count nth

Split Text 1th

Word Count 1st

Shuffle Fields

Shuffle grouping is random grouping

Fields grouping is grouped by value, such that equal value results in equal task

All grouping replicates to all tasks

Global grouping makes all tuples go to one task

None grouping makes bolt run in the same thread as bolt/spout it subscribes to

Direct grouping producer (task that emits) controls which consumer will receive

Local or Shuffle grouping

similar to the shuffle grouping but will shuffle tuples among bolt tasks running in the same worker process, if any. Falls back to shuffle grouping behavior.

Report Global

Page 20: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Storm – How does it work ?

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

20

Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb

#barca

Sentence Splitter

Twitter Spout

Sentence Splitter

… #barca

Shuffle Grouping

Sentence Splitter

… #fcb

bayern

fcb

juve

real

barca

barca

Page 21: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Storm – How does it work ?

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

21

Sentence Splitter

Twitter Spout

Word Counter

Sentence Splitter

Word Counter

Sentence Splitter

Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb

#barca

Shuffle Grouping

… #barca

… #fcb

Fields Grouping

real

juve

barca

barca

bayern

fcb

Page 22: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Storm – How does it work ?

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

22

Sentence Splitter

Twitter Spout

Word Counter

Sentence Splitter

Word Counter

Sentence Splitter

Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb

#barca

Shuffle Grouping

real

juve

barca

barca

bayern

fcb … #barca

… #fcb

Fields Grouping

INCR barca

INCR real

INCR juve

real = 1

juve = 1

INCR barca

INCR bayern bayern = 1

barca = 1

barca = 2

INCR fcb fcb = 1

Page 23: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Storm – How does it work ?

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

23

Sentence Splitter

Twitter Spout

Word Counter

Sentence Splitter

Word Counter

Report

real = 1 juve = 1

barca = 2 bayern = 1

Sentence Splitter

Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb

#barca

Shuffle Grouping

real

juve

barca

barca

bayern

fcb … #barca

… #fcb

Fields Grouping

Global Grouping

real = 1 juve = 1

bayern = 1 barca = 2

30sec

fcb = 1

fcb = 1

Page 24: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Using a NoSQL datastore for persisting results

Keep state in a NoSQL datastore

Using counter type columns of Cassandra

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

24

Twitter Stream

Sentence Splitter

Twitter Spout

Word Counter

Sentence Splitter

Word Counter

Who will win: Barca, Real, Juve or Bayern? … bit.ly/1yRsPmE #fcb

#barca

… #barca

… #fcb real = 1 juve = 1

barca = 2 bayern = 1

INCR barca

INCR real

INCR juve

INCR barca

INCR bayern

real

juve

barca

barca

bayern

fcb

fcb = 1 INCR fcb

Page 25: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Storm Trident

High-Level abstraction on top of storm •  Processing as a series of batches (micro-batches) •  Stream is partitioned among nodes in cluster

5 kinds of operations in Trident •  Operations that apply locally to each partition and cause no network transfer •  Repartitioning operations that don‘t change the contents •  Aggregation operations that do network transfer •  Operations on grouped streams •  Merges and Joins

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

25

Twitter Stream

tweet tweet Sentence Splitter

Twitter Spout

hashtag Sentence Normalizer

Persistent Aggregate

hashtag

groupBy local

Bolt Bolt

Page 26: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Storm Core vs. Storm Trident

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

26

Storm Core Storm Trident Community > 100 contributors > 100 contributors Adoption *** * Language Options Java, Clojure, Scala,

Python, Ruby, … Java, Clojure,

Scala Processing Models Event-Streaming Micro-Batching Processing DSL No Yes Stateful Ops No Yes Distributed RPC Yes Yes Delivery Guarantees

At most once / At least once

Exactly Once

Latency sub-second seconds Platform Storm Cluster, YARN Storm Cluster, YARN

Page 27: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Agenda

1.  Introduction

2.  Apache Storm

3.  Apache Spark (Streaming)

4.  Unified Log (Enterprise Event Bus)

5.  Stream Processing in the Architecture

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

27

Page 28: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Apache Spark

Apache Spark is a fast and general engine for large-scale data processing

•  The hot trend in Big Data!

•  Based on 2007 Microsoft Dryad paper

•  Written in Scala, supports Java, Python, SQL and R

•  Can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk

•  Runs everywhere – runs on Hadoop, Mesos, standalone or in the cloud

•  One of the largest OSS communities in big data with over 200 contributors in 50+ organizations

•  Originally developed 2009 in UC Berkley’s AMPLab

•  Open Sourced in 2010 – since 2014 part of Apache Software foundation

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

28

Page 29: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Apache Spark

Spark Core •  General execution engine for the Spark platform •  In-memory computing capabilities deliver speed •  General execution model supports wide variety of use cases •  DAG-based •  Ease of development – native APIs in Java, Scala and Python

Spark Streaming •  Run a streaming computation as a series of very small, deterministic batch jobs •  Batch size as low as ½ sec, latency of about 1 sec •  Exactly-once semantics •  Potential for combining batch and streaming processing in same system •  Started in 2012, first alpha release in 2013

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

29

Page 30: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Apache Spark - Generality

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

30

Spark SQL (Batch

Processing)

Blink DB (Approximate

Querying)

Spark Streaming (Real-Time)

MLlib, Spark R (Machine Learning)

GraphX (Graph

Processing)

Spark Core API and Execution Model

Spark Standalone MESOS YARN HDFS Elastic

Search Cassandra S3 / DynamoDB

Libraries

Core Runtime

Cluster Resource Managers Data Stores

Adapted from C. Fregly: http://slidesha.re/11PP7FV

Page 31: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Apache Spark – Core concepts

Resilient Distributed Dataset (RDD) •  Core Spark abstraction •  Collections of objects (partitions) spread across cluster •  Can be stored in-memory or on-disk (local) •  Enables parallel processing on data sets •  Build through parallel transformations •  Immutable, re-computable, fault tolerant •  Contains transformation history (“lineage”) for whole data set

Operations •  Stateless Transformations (map, filter, groupBy) •  Actions (count, collect, save)

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

31

Page 32: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

RDD Lineage Example

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

32

HDFS File Input 1

HadoopRDD

FilteredRDD

MappedRDD

ShuffledRDD

HDFS File Output

HadoopRDD

MappedRDD

HDFS File Input 2

SparkContext.hadoopFile()  

SparkContext.hadoopFile()  filter()  

map()   map()  

join()  

SparkContext.saveAsHadoopFile()  

Transformations (Lazy)

Action (Execute Transformations)

Adapted from Chris Fregly: http://slidesha.re/11PP7FV

Page 33: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Apache Spark Streaming – Core concepts

Discretized Stream (DStream) •  Core Spark Streaming abstraction •  micro batches of RDD’s •  Operations similar to RDD

Input DStreams •  Represents the stream of raw data received from streaming sources •  Data can be ingested from many sources: Kafka, Kinesis, Flume, Twitter,

ZeroMQ, TCP Socket, Akka actors, etc. •  Custom Sources can be easily written for custom data sources

Operations •  Same as Spark Core •  Additional Stateful transformations (window, reduceByWindow)

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

33

Page 34: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Discretized Stream (DStream)

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

34

time 1 time 2 time 3

message  

time n ….

f(message  1)  

RDD @time 1

f(message  2)  

f(message  n)  

….

message  1  

RDD @time 1

message  2  

message  n  

….

result  1  

result  2  

result  n  

….

message   message   message  

f(message  1)  

RDD @time 2

f(message  2)  

f(message  n)  

….

message  1  

RDD @time 2

message  2  

message  n  

….

result  1  

result  2  

result  n  

….

f(message  1)  

RDD @time 3

f(message  2)  

f(message  n)  

….

message  1  

RDD @time 3

message  2  

message  n  

….

result  1  

result  2  

result  n  

….

f(message  1)  

RDD @time n

f(message  2)  

f(message  n)  

….

message  1  

RDD @time n

message  2  

message  n  

….

result  1  

result  2  

result  n  

….

Input Stream

DStream

MappedDStream map()  

saveAsHadoopFiles()  

Time Increasing

DSt

ream

Tra

nsfo

rmat

ion

Line

age

Act

ions

Tr

igge

r Spa

rk

Jobs

Adapted from Chris Fregly: http://slidesha.re/11PP7FV

Page 35: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Storm Core vs. Storm Trident vs. Spark Streaming

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

35

Storm Core Storm Trident Spark Streaming

Community > 100 contributors > 100 contributors > 280 contributors

Adoption *** * *

Language Options

Java, Clojure, Scala, Python, Ruby, …

Java, Clojure, Scala

Java, Scala Python (coming)

Processing Models

Event-Streaming Micro-Batching Micro-Batching Batch (Spark Core)

Processing DSL No Yes Yes

Stateful Ops No Yes Yes

Distributed RPC Yes Yes No

Delivery Guarantees

At most once / At least once

Exactly Once Exactly Once

Latency sub-second seconds seconds

Platform Storm Cluster, YARN Storm Cluster, YARN

YARN, Mesos Standalone, DataStax EE

Page 36: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Agenda

1.  Introduction / Motivation

2.  Apache Storm

3.  Apache Spark (Streaming)

4.  Stream Processing in the Architecture

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

36

Page 37: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Architectural Pattern: Standalone Event Stream Processing

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

37 37

Event Processing (ESP / CEP)

State Store / Event Store E

nter

pris

e E

vent

Bus

(In

gres

s)

Eve

nt

Clo

ud

Internet of Things

Social Media Streams

Ent

erpr

ise

Eve

nt B

us

37

Analytical Applications

DB

Ent

erpr

ise

Ser

vice

B

us

Business Rule Management

System Rules

Event Processing

Result Store

Page 38: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Hadoop Big Data Infrastructure

Architectural Pattern: Event Stream Processing as part of Lambda Architecture

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

38 38

Event Processing (ESP / CEP)

State Store / Event Store

Ent

erpr

ise

Eve

nt B

us

(Ingr

ess)

Eve

nt

Clo

ud

Internet of Things

Social Media Streams

Ent

erpr

ise

Eve

nt B

us

38

Analytical Applications

DB

Ent

erpr

ise

Ser

vice

B

us

Event Processing

Map/Reduce

HDFS

Result Store

Result Store

Page 39: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Hadoop Big Data Infrastructure

Architectural Pattern: Event Stream Processing as part of Kappa Architecture

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

39 39

Event Processing (ESP / CEP)

State Store / Event Store

Ent

erpr

ise

Eve

nt B

us

(Ingr

ess)

Eve

nt

Clo

ud

Internet of Things

Social Media Streams

39

Analytical Applications

DB Ent

erpr

ise

Ser

vice

B

us

Event Processing

Replay HDFS

Result Store

Page 40: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Unified Log (Event) Architecture

Stream processing allows for computing feeds off of other feeds

Derived feeds are no different than original feeds they are computed off

Single deployment of “Unified Log” but logically different feeds

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

40

Meter Readings Collector

Enrich / Transform

Aggregate by Minute

Raw Meter Readings

Meter with Customer

Meter by Customer by Minute

Customer Aggregate by Minute

Meter by Minute

Persist

Meter by Minute

Persist

Raw Meter Readings

Page 41: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

41

Tweets

Filter

Persist

Filtered Tweets

Persist

Sensor Readings

Tweet

Distribution Layer

Kafka Storm Cassandra Elasticsearch Titan

Speed Layer

Feature extractor

Count

Skill Matcher

sensor reading

Feature Occurrences

Matches

Feature counter Skill

Unified Log/Event Architecture for Trivadis Streaming Demo System

Page 42: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

42

Tweets

Filter

Persist

Filtered Tweets

Persist

Sensor Readings

Tweet

Distribution Layer

Kafka Storm Cassandra Elasticsearch Titan

Speed Layer

Feature extractor

Count

Skill Matcher

sensor reading

Feature Occurrences

Matches

Feature counter Skill

Unified Log/Event Architecture for Trivadis Streaming Demo System

Storm Topology

Splitter

Kafka Spout

Word Remover

Splitter

Word Remover

Shuffle Fields

Kafka

Kafka

Word Remover

Page 43: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Central Unified Log for (real-time) subscription

Take all the organization’s data and put it into a central log for subscription

Properties of the Unified Log:

•  Unified: “Enterprise”, single deployment

•  Append-Only: events are appended, no update in place => immutable

•  Ordered: each event has an offset, which is unique within a shard

•  Fast: should be able to handle thousands of messages / sec

•  Distributed: lives on a cluster of machines

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

43

0 1 2 3 4 5 6 7 8 9 10 11

reads

writes

Collector

Consumer System A (time = 6)

Consumer System B (time = 10)

reads

Page 44: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Apache Kafka - Overview

•  A distributed publish-subscribe messaging system

•  Designed for processing of real time activity stream data (logs, metrics collections, social media streams, …)

•  Initially developed at LinkedIn, now part of Apache

•  Does not follow JMS Standards and does not use JMS API

•  Kafka maintains feeds of messages in topics

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

44

Kafka Cluster

Consumer Consumer Consumer

Producer Producer Producer

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2

0 1 2 3 4 5 6 7 8 9

0 1 2 3 4 5 6 7 8 9 1 0

1 1

1 2

Anatomy of a topic:

Partition 0

Partition 1

Partition 2

Writes

old new

Page 45: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Trivadis Stream Processing Demo System - Update

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

45

Page 46: Apache Storm vs. Spark Streaming - Java Forum Stuttgart2015.java-forum-stuttgart.de/_data/C6_Schmutz.pdf · Big Data / Fast Data ! ... Apache Storm vs. Spark Streaming ... Java, Clojure,

2015 © Trivadis

Questions and answers ...

2014 © Trivadis

BASEL BERN BRUGES LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MUNICH STUTTGART VIENNA

Guido Schmutz Technology Manager

Juli 2015 Apache Storm vs. Spark Streaming - Two Stream Processing Platforms compared

46