Transcript
Page 1: Building Big Data Streaming Architectures

Building streaming architecturesDavid Martinez Rego BigD@ta Coruña 25-26 April 2016

Page 2: Building Big Data Streaming Architectures

Index

• Why?

• When?

• What?

• How?

Page 3: Building Big Data Streaming Architectures

Why?

Page 4: Building Big Data Streaming Architectures

The world does not wait

• Big data applications are build with the sole purpose of managing a business case of gathering an understanding about the word that would give an advantage.

• The necessity of building streaming applications arises from the fact that in many applications, the value of the information gathered drops dramatically with time.

Page 5: Building Big Data Streaming Architectures
Page 6: Building Big Data Streaming Architectures
Page 7: Building Big Data Streaming Architectures
Page 8: Building Big Data Streaming Architectures
Page 9: Building Big Data Streaming Architectures

Batch/streaming duality• Streaming applications can bring value by giving

an approximate answer just on time. If timing is not an issue (daily), batch pipelines can provide a good solution.

time

value

streaming

batch

Page 10: Building Big Data Streaming Architectures

When?

Page 11: Building Big Data Streaming Architectures

Start big, grow small• Despite the advertisement of vendors, jump in to a

streaming application is not always advisable

• It is harder to get it right and you encounter limitations: probabilistic data structures, guarantees, …

• The value of the data you are about to gather is not clear in a discovery phase.

• Some new libraries provide the same set of primitives both for batch and streaming. It is possible to develop the core of the idea and just translate that to a streaming pipeline later.

Page 12: Building Big Data Streaming Architectures

Not always practical

• As a developer, you can face any of the following situations

• It is mandatory

• It is doubtful

• It will never be necessary

Page 13: Building Big Data Streaming Architectures
Page 14: Building Big Data Streaming Architectures
Page 15: Building Big Data Streaming Architectures
Page 16: Building Big Data Streaming Architectures
Page 17: Building Big Data Streaming Architectures

What?

Page 18: Building Big Data Streaming Architectures

Gathering Brokering SinkProcessing/Analysis

Execution engine

Coordination

Persistent

Persistent

~Persistent

External systems

Page 19: Building Big Data Streaming Architectures

https://www.mapr.com/developercentral/lambda-architecture

Page 20: Building Big Data Streaming Architectures

Lambda architecture• Batch layer (ex. Spark, HDFS): process the master

dataset (append only) to precompute batch views (a view the front end will query

• Speed layer (streaming): calculate ephemeral views only based on recent data!

• Motto: take into account reprocessing and recovery

Page 21: Building Big Data Streaming Architectures

Lambda architecture

• Problems:!

• Maintaing two code bases in sync (often different because speed layer cannot reproduce the same)

• Synchronisation of the two layers in the query layer is an additional problem

Page 22: Building Big Data Streaming Architectures

Gathering Brokering

Reservoir of Master Data

Production

Production

Catching up…

Production pipeline

New pipeline

Page 23: Building Big Data Streaming Architectures

Gathering Brokering

Reservoir of Master Data

Production

Done!

Old pipeline

Production pipeline

Page 24: Building Big Data Streaming Architectures

Kappa approach

• Only maintain one code base and reduce accidental complexity by using too many technologies.

• Can reverse back if something goes wrong

• Not a silver bullet and not prescription of technologies, just a framework.

Page 25: Building Big Data Streaming Architectures

Gathering Brokering SinkProcessing/Analysis

Execution engine

Coordination

Persistent

Persistent

~Persistent

External systems

Page 26: Building Big Data Streaming Architectures

How?

Page 27: Building Big Data Streaming Architectures

Concepts are basic

• There are multiple frameworks available nowadays who change terminology trying to differentiate.

• It makes starting on streaming a bit confusing…

Page 28: Building Big Data Streaming Architectures

Concepts are basic

• It makes starting on streaming a bit confusing…

• Actually there are many concepts which are shared between them and they are quite logical.

Page 29: Building Big Data Streaming Architectures

Step 1: data structure• The basic data structure is made of 4 elements

• Sink: where is this thing going?

• Partition key: to which shard?

• Sequence id: when was this produced?

• Data: anything that can be serialised (JSON, Avro, photo, …)

Partition key Sequence idSink Data( , , ),

Page 30: Building Big Data Streaming Architectures

Step 2: hashing• The holy grail trick of big data to split the work, and

also major block of streaming

• We use hashing in the reverse of classical, force the clashing of the things that are if my interest

Partition key Sequence idSink Data( , , ),h(k) mod N

Page 31: Building Big Data Streaming Architectures

Step 3: fault tolerance

“Distributed computing is parallel computing when you cannot trust anything or anyone”

Page 32: Building Big Data Streaming Architectures

Step 3: fault tolerance

• At any point any node producing the data in the source can stop working

• Non persistent: data is lost

• Persistent: data is replicated so it can always be recovered from other node

Page 33: Building Big Data Streaming Architectures

Step 3: fault tolerance• At any point any node computing our pipeline can

go down

• at most once: we let data be lost, once delivered do not reprocess.

• at least once, we ensure delivery, can be reprocessed.

• exactly once, we ensure delivery and no reprocessing

Page 34: Building Big Data Streaming Architectures

Step 3: fault tolerance• At any point any node computing our pipeline can go

down

• checkpointing: If we have been running the pipeline for hours and something goes wrong, do I have to start from the beginning?

• Streaming systems put in place mechanisms to checkpoint progress so the new worker knows previous state and where to start from.

• Usually involves other systems to save checkpoints and synchronise.

Page 35: Building Big Data Streaming Architectures

Step 4: delivery

• One at a time: we process each message individually. Increases response time per message.

• Micro-batch: we always process data in batches gathered at specified time intervals or size. Makes it impossible to reduce message processing below a limit.

Page 36: Building Big Data Streaming Architectures

Gathering Brokering SinkProcessing/Analysis

Execution engine

Coordination

Persistent

Persistent

~Persistent

External systems

Page 37: Building Big Data Streaming Architectures

Gathering

Page 38: Building Big Data Streaming Architectures

Partition keyTopic Data, , )(

Partition keyTopic Data, , )(

Partition keyTopic Data, , )(

Partition keyTopic Data, , )(

Page 39: Building Big Data Streaming Architectures

Partition keyTopic Data, , )(

Partition keyTopic Data, , )(

Partition keyTopic Data, , )(

Partition keyTopic Data, , )(

h(k) mod N

h(k) mod N

h(k) mod N

h(k) mod N

Page 40: Building Big Data Streaming Architectures

Consumer 1

Consumer 1

Zookeeper

Page 41: Building Big Data Streaming Architectures

Consumer 2

Consumer 1

Zookeeper

Consumer 2 …

Consumer 1 …

Page 42: Building Big Data Streaming Architectures

Produce to Kafka, consume

from Kafka

Page 43: Building Big Data Streaming Architectures

Gathering Brokering SinkProcessing/Analysis

Execution engine

Coordination

Persistent

Persistent

~Persistent

External systems

Page 44: Building Big Data Streaming Architectures

one!at-a-time

mini!batch

exactly!once Deploy Windowing Functional Catch

Yes Yes * Yes * Custom YARN Yes * ~ DRPC

No Yes Yes YARN Mesos Yes Yes MLlib,

ecosystem

Yes Yes Yes YARN Yes Yes Flexible windowing

Yes ~ No YARN ~ No DB update log plugin

Yes Yes Yes Google Yes ~ Google ecosystem

Yes you No AWS you No AWS ecosystem

* with Trident

Page 45: Building Big Data Streaming Architectures

Flink basic concepts• Stream: source of data that feeds computations (a

batch dataset is a bounded stream)

• Transformations: operation that takes one or more streams as input and computes an output stream. They can be stateless of stateful (exactly once).

• Sink: endpoint that received the output stream of a transformation

• Dataflow: DAG of streams, transformations and sinks.

Page 46: Building Big Data Streaming Architectures

Flink basic concepts

Page 47: Building Big Data Streaming Architectures

Flink basic concepts

Page 48: Building Big Data Streaming Architectures
Page 49: Building Big Data Streaming Architectures

Samza basic concepts• Streams: persistent set of immutable messages of similar type

and category with transactional nature.

• Jobs: code that performs logical transformations on a set of input streams to append to a set of output streams.

• Partitions: Each stream breaks into partitions, set of totally ordered sequence of examples.

• Tasks: Each task consumes data from one partition.

• Dataflow: composition of jobs that connects a set of streams.

• Containers: physical unit of parallelism.

Page 50: Building Big Data Streaming Architectures

Samza basic concepts

Page 51: Building Big Data Streaming Architectures
Page 52: Building Big Data Streaming Architectures

Storm basic concepts• Spout: source of data from any external system.

• Bolts: transformations of one or more streams into another set of output streams.

• Stream grouping: shuffling of streaming data between bolts.

• Topology: set of spouts and bolts that process a stream of data.

• Tasks and Workers: unit of work deployable into one container. Workers can process one or more tasks. Task deploy to one worker.

Page 53: Building Big Data Streaming Architectures

Storm basic concepts

Trident Topology

Compile

Page 54: Building Big Data Streaming Architectures
Page 55: Building Big Data Streaming Architectures

Spark basic concepts• DStream: continuous stream of data represented by a

series of RDDs. Each RDD contains data for a specific time interval.

• Input DStream and Receiver: source of data that feeds a DStream.

• Transformations: operations that transform one DStream in another DStream (stateless and stateful with exactly once semantics).

• Output operations: operations that periodically push data of a DStream to a specific output system.

Page 56: Building Big Data Streaming Architectures

Spark basic concepts

Page 57: Building Big Data Streaming Architectures
Page 58: Building Big Data Streaming Architectures

Conclusions…• Think on streaming when there is a hard constraint on time-to-information

• Use a queue system as your place of orchestration

• Select the processing system that best suits to your use case

• Samza: early stage, more to come in the close future.

• Spark: good option if mini batch will always work for you.

• Storm: good option if you can setup the infrastructure. DRPC provides an interesting pattern for some use cases.

• Flink: reduced ecosystem because it has a shorter history. Its design learnt from all past frameworks and is the most flexible.

• Datastream: original inspiration for Flink. Good and flexible model if you want to go the managed route and make use of Google toolbox (Bigtable, etc)

• Kinesis: Only if you have some legacy. Probably better off using Spark connector in AWS EMR.

Page 59: Building Big Data Streaming Architectures

Where to go…• All code examples are available in Github

• Kafka https://github.com/torito1984/kafka-playground.git, https://github.com/torito1984/kafka-doyle-generator.git

• Spark https://github.com/torito1984/spark-doyle.git!

• Storm https://github.com/torito1984/trident-doyle.git!

• Flink https://github.com/torito1984/flink-sherlock.git!

• Samza https://github.com/torito1984/samza-locations.git

Page 60: Building Big Data Streaming Architectures

Building streaming architecturesDavid Martinez Rego BigD@ta Coruña 25-26 April 2016


Recommended