Building Big Data Streaming Architectures

  • View
    250

  • Download
    2

Embed Size (px)

Text of Building Big Data Streaming Architectures

  • Building streaming architecturesDavid Martinez Rego BigD@ta Corua 25-26 April 2016

  • Index

    Why?

    When?

    What?

    How?

  • Why?

  • The world does not wait

    Big data applications are build with the sole purpose of managing a business case of gathering an understanding about the word that would give an advantage.

    The necessity of building streaming applications arises from the fact that in many applications, the value of the information gathered drops dramatically with time.

  • Batch/streaming duality Streaming applications can bring value by giving

    an approximate answer just on time. If timing is not an issue (daily), batch pipelines can provide a good solution.

    time

    value

    streaming

    batch

  • When?

  • Start big, grow small Despite the advertisement of vendors, jump in to a

    streaming application is not always advisable

    It is harder to get it right and you encounter limitations: probabilistic data structures, guarantees,

    The value of the data you are about to gather is not clear in a discovery phase.

    Some new libraries provide the same set of primitives both for batch and streaming. It is possible to develop the core of the idea and just translate that to a streaming pipeline later.

  • Not always practical

    As a developer, you can face any of the following situations

    It is mandatory

    It is doubtful

    It will never be necessary

  • What?

  • Gathering Brokering SinkProcessing/Analysis

    Execution engine

    Coordination

    Persistent

    Persistent

    ~Persistent

    External systems

  • https://www.mapr.com/developercentral/lambda-architecture

    https://www.mapr.com/developercentral/lambda-architecture

  • Lambda architecture Batch layer (ex. Spark, HDFS): process the master

    dataset (append only) to precompute batch views (a view the front end will query

    Speed layer (streaming): calculate ephemeral views only based on recent data!

    Motto: take into account reprocessing and recovery

  • Lambda architecture

    Problems:!

    Maintaing two code bases in sync (often different because speed layer cannot reproduce the same)

    Synchronisation of the two layers in the query layer is an additional problem

  • Gathering Brokering

    Reservoir of Master Data

    Production

    Production

    Catching up

    Production pipeline

    New pipeline

  • Gathering Brokering

    Reservoir of Master Data

    Production

    Done!

    Old pipeline

    Production pipeline

  • Kappa approach

    Only maintain one code base and reduce accidental complexity by using too many technologies.

    Can reverse back if something goes wrong

    Not a silver bullet and not prescription of technologies, just a framework.

  • Gathering Brokering SinkProcessing/Analysis

    Execution engine

    Coordination

    Persistent

    Persistent

    ~Persistent

    External systems

  • How?

  • Concepts are basic

    There are multiple frameworks available nowadays who change terminology trying to differentiate.

    It makes starting on streaming a bit confusing

  • Concepts are basic

    It makes starting on streaming a bit confusing

    Actually there are many concepts which are shared between them and they are quite logical.

  • Step 1: data structure The basic data structure is made of 4 elements

    Sink: where is this thing going?

    Partition key: to which shard?

    Sequence id: when was this produced?

    Data: anything that can be serialised (JSON, Avro, photo, )

    Partition key Sequence idSink Data( , , ),

  • Step 2: hashing The holy grail trick of big data to split the work, and

    also major block of streaming

    We use hashing in the reverse of classical, force the clashing of the things that are if my interest

    Partition key Sequence idSink Data( , , ),h(k) mod N

  • Step 3: fault tolerance

    Distributed computing is parallel computing when you cannot trust anything or anyone

  • Step 3: fault tolerance

    At any point any node producing the data in the source can stop working

    Non persistent: data is lost

    Persistent: data is replicated so it can always be recovered from other node

  • Step 3: fault tolerance At any point any node computing our pipeline can

    go down

    at most once: we let data be lost, once delivered do not reprocess.

    at least once, we ensure delivery, can be reprocessed.

    exactly once, we ensure delivery and no reprocessing

  • Step 3: fault tolerance At any point any node computing our pipeline can go

    down

    checkpointing: If we have been running the pipeline for hours and something goes wrong, do I have to start from the beginning?

    Streaming systems put in place mechanisms to checkpoint progress so the new worker knows previous state and where to start from.

    Usually involves other systems to save checkpoints and synchronise.

  • Step 4: delivery

    One at a time: we process each message individually. Increases response time per message.

    Micro-batch: we always process data in batches gathered at specified time intervals or size. Makes it impossible to reduce message processing below a limit.

  • Gathering Brokering SinkProcessing/Analysis

    Execution engine

    Coordination

    Persistent

    Persistent

    ~Persistent

    External systems

  • Gathering

  • Partition keyTopic Data, , )(

    Partition keyTopic Data, , )(

    Partition keyTopic Data, , )(

    Partition keyTopic Data, , )(

  • Partition keyTopic Data, , )(

    Partition keyTopic Data, , )(

    Partition keyTopic Data, , )(

    Partition keyTopic Data, , )(

    h(k) mod N

    h(k) mod N

    h(k) mod N

    h(k) mod N

  • Consumer 1

    Consumer 1

    Zookeeper

  • Consumer 2

    Consumer 1

    Zookeeper

    Consumer 2

    Consumer 1

  • Produce to Kafka, consume

    from Kafka

  • Gathering Brokering SinkProcessing/Analysis

    Execution engine

    Coordination

    Persistent

    Persistent

    ~Persistent

    External systems

  • one!at-a-time

    mini!batch

    exactly!once Deploy Windowing Functional Catch

    Yes Yes * Yes * Custom YARN Yes * ~ DRPC

    No Yes Yes YARN Mesos Yes YesMLlib,

    ecosystem

    Yes Yes Yes YARN Yes Yes Flexible windowing

    Yes ~ No YARN ~ No DB update log plugin

    Yes Yes Yes Google Yes ~ Google ecosystem

    Yes you No AWS you No AWS ecosystem

    * with Trident

  • Flink basic concepts Stream: source of data that feeds computations (a

    batch dataset is a bounded stream)

    Transformations: operation that takes one or more streams as input and computes an output stream. They can be stateless of stateful (exactly once).

    Sink: endpoint that received the output stream of a transformation

    Dataflow: DAG of streams, transformations and sinks.

  • Flink basic concepts

  • Flink basic concepts

  • Samza basic concepts Streams: persistent set of immutable messages of similar type

    and category with transactional nature.

    Jobs: code that performs logical transformations on a set of input streams to append to a set of output streams.

    Partitions: Each stream breaks into partitions, set of totally ordered sequence of examples.

    Tasks: Each task consumes data from one partition.

    Dataflow: composition of jobs that connects a set of streams.

    Containers: physical unit of parallelism.

  • Samza basic concepts

  • Storm basic concepts Spout: source of data from any external system.

    Bolts: transformations of one or more streams into another set of output streams.

    Stream grouping: shuffling of streaming data between bolts.

    Topology: set of spouts and bolts that process a stream of data.

    Tasks and Workers: unit of work deployable into one container. Workers can process one or more tasks. Task deploy to one worker.

  • Storm basic concepts

    Trident Topology

    Compile

  • Spark basic concepts DStream: continuous stream of data represented by a

    series of RDDs. Each RDD contains data for a specific time interval.

    Input DStream and Receiver: source of data that feeds a DStream.

    Transformations: operations that transform one DStream in another DStream (stateless and stateful with exactly once semantics).

    Output operations: operations that periodically push data of a DStream to a specific output system.

  • Spark basic concepts

  • Conclusions Think on streaming when there is a hard constraint on time-to-information

    Use a queue system as your place of orchestration

    Select the processing system that best suits to your use case

    Samza: early stage, more to come in the close future.

    Spark: good option if mini batch will always work for you.

    Storm: good option if you can setup the infrastructure. DRPC provides an interesting pattern for some use cases.

    Flink: reduced ecosystem because it has a shorter history. Its design learnt from all past frameworks and is the most flexible.

    Datastream: original inspiration for Flink. Good and flexible model if you want to go the managed route and make use of Google toolbox (Bigtable, etc)

    Kinesis: Only if you have some legacy. Probably better off using Spark connector i