Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica
DSP Frameworks
Corso di Sistemi e Architetture per Big Data A.A. 2017/18
Valeria Cardellini
DSP frameworks we consider • Apache Storm (with lab) • Twitter Heron
– From Twitter as Storm and compatible with Storm
• Apache Spark Streaming (lab) – Reduce the size of each stream and process streams
of data (micro-batch processing) • Apache Flink • Apache Samza • Cloud-based frameworks
– Google Cloud Dataflow – Amazon Kinesis Streams
1 Valeria Cardellini - SABD 2017/18
Apache Storm • Apache Storm
– Open-source, real-time, scalable streaming system – Provides an abstraction layer to execute DSP
applications – Initially developed by Twitter
• Topology – DAG of spouts (sources of streams) and bolts
(operators and data sinks)
2 Valeria Cardellini - SABD 2017/18
Stream grouping in Storm
• Data parallelism in Storm: how are streams partitioned among multiple tasks (threads of execution)?
• Shuffle grouping – Randomly partitions the tuples
• Field grouping – Hashes on a subset of the tuple attributes
Valeria Cardellini - SABD 2017/18
3
Stream grouping in Storm
• All grouping (i.e., broadcast) – Replicates the entire stream to all the consumer
tasks
• Global grouping – Sends the entire stream to a single task of a bolt
• Direct grouping – The producer of the tuple decides which task of
the consumer will receive this tuple
Valeria Cardellini - SABD 2017/18
4
Storm architecture
5 Valeria Cardellini - SABD 2017/18
• Master-worker architecture
Storm components: Nimbus and Zookeeper
• Nimbus – The master node – Clients submit topologies to it – Responsible for distributing and coordinating the
topology execution
• Zookeeper – Nimbus uses a combination of the local disk(s)
and Zookeeper to store state about the topology
Valeria Cardellini - SABD 2017/18
6
worker process
executor executorTHREAD THREAD
JAVA PROCESS
task
task
task
task
task
Storm components: worker
• Task: operator instance – The actual work for a bolt or a spout is done in the
task • Executor: smallest schedulable entity
– Execute one or more tasks related to same operator • Worker process: Java process running one or
more executors • Worker node: computing
resource, a container for one or more worker processes
7 Valeria Cardellini - SABD 2017/18
Storm components: supervisor
• Each worker node runs a supervisor
The supervisor: • receives assignments from Nimbus (through
ZooKeeper) and spawns workers based on the assignment
• sends to Nimbus (through ZooKeeper) a periodic heartbeat;
• advertises the topologies that they are currently running, and any vacancies that are available to run more topologies
Valeria Cardellini - SABD 2017/18
8
Twitter Heron
• Realtime, distributed, fault-tolerant stream processing engine from Twitter
• Developed as direct successor of Storm – Released as open source in 2016
https://twitter.github.io/heron/ – De facto stream data processing engine inside
Twitter • Goal of overcoming Storm’s performance,
reliability, and other shortcomings • Compatibility with Storm
– API compatible with Storm: no code change is required for migration
Valeria Cardellini - SABD 2017/18
9
Heron: in common with Storm
• Same terminology of Storm – Topology, spout, bolt
• Same stream groupings – Shuffle, fields, all, global
• Example: WordCount topology
Valeria Cardellini - SABD 2017/18
10
Heron: design goals • Isolation
– Process-based topologies rather than thread-based – Each process should run in isolation (easy
debugging, profiling, and troubleshooting) – Goal: overcoming Storm’s performance, reliability,
and other shortcomings • Resource constraints
– Safe to run in shared infrastructure: topologies use only initially allocated resources and never exceed bounds
• Compatibility – Fully API and data model compatible with Storm
Valeria Cardellini - SABD 2017/18
11
Heron: design goals • Backpressure
– Built-in rate control mechanism to ensure that topologies can self-adjust in case components lag
– Heron dynamically adjusts the rate at which data flows through the topology using backpressure
• Performance – Higher throughput and lower latency than Storm – Enhanced configurability to fine-tune potential
latency/throughput trade-offs • Semantic guarantees
– Support for both at-most-once and at-least-once processing semantics
• Efficiency – Minimum possible resource usage
Valeria Cardellini - SABD 2017/18
12
Heron topology architecture
• Master-work architecture • One Topology Master (TM)
– Manages a topology throughout its entire lifecycle
• Multiple Containers – Each Container multiple Heron Instances, a
Stream Manager, and a Metrics Manager – A Heron Instance is a process that handles a
single task of a spout or bolt – Containers communicate with TM to ensure that
the topology forms a fully connected graph
Valeria Cardellini - SABD 2017/18
13
Heron topology architecture
Valeria Cardellini - SABD 2017/18
14
Heron topology architecture
• Stream Manager (SM): routing engine for data streams – Each Heron connects to its local SM, while all of the SMs in
a given topology connect to one another to form a network – Responsbile for propagating back pressure
Valeria Cardellini - SABD 2017/18
15
Topology submit sequence
Valeria Cardellini - SABD 2017/18
16
Self-adaptation in Heron • Dhalion: framework on top of Heron to autonomously
reconfigure topologies to meet throughput SLOs, scaling resource consumption up and down as needed
Valeria Cardellini - SABD 2017/18
17
• Phases in Dhalion: - Symptom detection
(backpressure, skew, …)
- Diagnosis generation - Resolution
• Adaptation actions: parallelism changes
Heron environment • Heron supports deployment on Apache
Mesos • Can also run on Mesos using Apache Aurora
as a scheduler or using a local scheduler
Valeria Cardellini - SABD 2017/18
18
Batch processing vs. stream processing
• Batch processing is just a special case of stream processing
Valeria Cardellini - SABD 2017/18
19
Batch processing vs. stream processing
• Batched/stateless: scheduled in batches – Short-lived tasks (Hadoop, Spark) – Distributed streaming over batches (Spark
Streaming)
• Dataflow/stateful: continuous/scheduled once (Storm, Flink, Heron) – Long-lived task execution – State is kept inside tasks
Valeria Cardellini - SABD 2017/18
20
Native vs. non-native streaming
Valeria Cardellini - SABD 2017/18
21
Apache Flink
• Distributed data flow processing system • One common runtime for DSP applications
and batch processing applications – Batch processing applications run efficiently
as special cases of DSP applications • Integrated with many other projects in the
open-source data processing ecosystem
Valeria Cardellini - SABD 2017/18
22
• Derives from Stratosphere project by TU Berlin, Humboldt University and Hasso Plattner Institute
• Support a Storm-compatible API
Flink: software stack
Valeria Cardellini - SABD 2017/18
23
• On top: libraries with high-level APIs for different use cases, still in beta
Flink: programming model
• Data stream – An unbounded, partitioned immutable sequence
of events • Stream operators
– Stream transformations that generate new output data streams from input ones
Valeria Cardellini - SABD 2017/18
24
Flink: some features
• Supports stream processing and windowing with Event Time semantics – Event time makes it easy to compute over streams
where events arrive out of order, and where events may arrive delayed
• Exactly-once semantics for stateful computations
• Highly flexible streaming windows
Valeria Cardellini - SABD 2017/18
25
Flink: some features
• Continuous streaming model with backpressure
• Flink's streaming runtime has natural flow control: slow data sinks backpressure faster sources
Valeria Cardellini - SABD 2017/18
26
Flink: APIs and libraries
• Streaming data applications: DataStream API – Supports functional transformations on data
streams, with user-defined state, and flexible windows
– Example: how to compute a sliding histogram of word occurrences of a data stream of texts
Valeria Cardellini - SABD 2017/18
27
WindowWordCount in Flink's DataStream API
Sliding time window of 5 sec length and 1 sec trigger interval
Flink: APIs and libraries • Batch processing applications: DataSet API • Supports a wide range of data types beyond
key/value pairs, and a wealth of operators
Valeria Cardellini - SABD 2017/18
28
Core loop of the PageRank algorithm for graphs
Flink: program optimization
• Batch programs are automatically optimized to exploit situations where expensive operations (like shuffles and sorts) can be avoided, and when intermediate data should be cached
Valeria Cardellini - SABD 2017/18
29
Flink: control events
• Control events: special events injected in the data stream by operators
• Periodically, the data source injects checkpoint barriers into the data stream by dividing the stream into pre-checkpoint and post-checkpoint • More coarse-grained approach than Storm: acks sequences of
records instead of individual records
• Watermarks signal the progress of event-time within a stream partition
• Flink does not provide ordering guarantees after any form of stream repartitioning or broadcasting – Dealing with out-of-order records is left to the operator
implementation Valeria Cardellini - SABD 2017/18
30
Flink: fault-tolerance • Based on Chandy-Lamport distributed
snapshots • Lightweight mechanism
– Allows to maintain high throughput rates and provide strong consistency guarantees at the same time
Valeria Cardellini - SABD 2017/18
31
Flink: performance and memory management • High performance and low latency
• Memory management
– Flink implements its own memory management inside the JVM
Valeria Cardellini - SABD 2017/18
32
Flink: architecture
Valeria Cardellini - SABD 2017/18
33
• The usual master-worker architecture
Flink: architecture
Valeria Cardellini - SABD 2017/18
34
• Master (Job Manager): schedules tasks, coordinates checkpoints, coordinates recovery on failures, etc.
• Workers (Task Managers): JVM processes that execute tasks of a dataflow, and buffer and exchange the data streams – Workers use task slots to control the number of
tasks it accepts – Each task slot represents a fixed subset of
resources of the worker
Flink: application execution
• Jobs are expressed as data flows • The job graph is transformed into the
execution graph • The execution graph contain information to
schedule and execute a job
Valeria Cardellini - SABD 2017/18
35
Flink: infrastructure
• Designed to run on large-scale clusters with many thousands of nodes
• Provides support for YARN and Mesos
Valeria Cardellini - SABD 2017/18
36
A recent need
• A common need for many companies – Run both batch and stream processing
• Alternative solutions – Lambda architecture – Unified frameworks: Spark, Flink, Samza – Unified programming model: Apache Beam
Valeria Cardellini - SABD 2017/18
37
Lambda architecture • Data-processing design pattern to integrate batch and
real-time processing • Streaming framework used to process real-time events,
and, in parallel, batch framework (e.g., Hadoop/Spark) deployed to process the entire dataset (perhaps periodically)
• Results from the two parallel pipelines are then merged
38 Source: https://voltdb.com/products/alternatives/lambda-architecture
Valeria Cardellini - SABD 2017/18
Lambda architecture: example
Valeria Cardellini - SABD 2017/18
39
• Lambda architecture used at LinkedIn before Samza development
Lambda architecture
• Pros: – Flexibility in the frameworks’ choice
• Cons: – Implementing and maintaining two separate
frameworks for batch and stream processing can be hard and error-prone
– Overhead of developing and managing multiple source codes
• The logic in each fork evolves over time, and keeping them in sync involves duplicated and complex manual effort, often with different languages
Valeria Cardellini - SABD 2017/18
40
Unified frameworks
• Use a unified (Lambda-less) design for processing both real-time as well as batch data using the same data structure
• Spark, Flink and Samza follow this trend
Valeria Cardellini - SABD 2017/18
41
Apache Beam • A new layer of abstraction • Provides advanced unified programming model
– Allows to define batch and streaming data processing pipelines that run on any execution engine (for now: Flink, Spark, Google Cloud Dataflow)
– Java, Python and Go as programming languages • Translates the data processing pipeline defined
by the user with the Beam program into the API compatible with the chosen distributed processing engine
• Developed by Google and released as open-source top-level project
Valeria Cardellini - SABD 2017/18
42
Apache Samza
• A distributed system for stateful and fault-tolerant stream processing – A unified framework for batch and stream
processing: similarly to Flink, streams as first-class citizen, batch as special case of streaming
– Production at LinkedIn since 2014
• Uses Kafka for messaging and YARN to provide fault tolerance, processor isolation, security, and resource management
Valeria Cardellini - SABD 2017/18
43
Apache Samza
• Why stateful and fault-tolerant processing? User profiles, email digests, aggregate counts, …
• Example: Email Digestion System – Production application running at LinkedIn using Samza to
digest updates into one email
Valeria Cardellini - SABD 2017/18
44
Apache Samza: features
• Provides distributed and scalable data processing with – Configurable and heterogeneous data sources
and sinks (e.g., Kafka, HDFS, AWS Kinesis) – Efficient state management: local state (in memory
or on disk) partitioned among tasks (rather than using a remote data store) and incremental checkpointing (more efficient than full state checkpointing)
– Unified processing API for stream and batch processing
• Differently for Flink
– Flexible deployment models
Valeria Cardellini - SABD 2017/18
45
Apache Samza: architecture
• Samza is made up of three layers: – A streaming layer: Kafka – An execution layer: YARN – A processing layer: Samza API
Valeria Cardellini - SABD 2017/18
46
Apache Samza: architecture
• Samza, YARN and Kafka interaction
Valeria Cardellini - SABD 2017/18
47
• RM: YARN ResourceManager • NM: YARN NodeManager • AM: ApplicationMaster
• Samza client uses YARN to run a Samza job
• YARN starts and supervises one or more SamzaContainers, and the processing code (using the StreamTask API) runs inside those containers
• The input and output for the Samza StreamTasks come from Kafka brokers
Apache Samza: running an application
• Count the number of page views aggregated by user ID
• Two Samza jobs: one to group messages by user ID, the other to do the counting
Valeria Cardellini - SABD 2017/18
48
Apache Samza: state management
• Common approach (e.g., in Storm) to deal with large amounts of state: use a remote data store (e.g., Redis)
• Samza approach: keep state local to each node and make it robust to machine failures by replicating state changes across multiple machines
Valeria Cardellini - SABD 2017/18
49
Apache Samza: high-level API
Valeria Cardellini - SABD 2017/18
50
Apache Samza: example application
• Samza job to find top-k trending tags – DAG for the logical representation of the
application
Valeria Cardellini - SABD 2017/18
51
Apache Samza: example application
Valeria Cardellini - SABD 2017/18
52
Towards strict delivery guarantees • Most frameworks provide at-least-once delivery
guarantees (e.g., Storm, Samza) – For stateful non-idempotent operators such as counting, at-
least-once delivery guarantees can give incorrect results
• Flink, Storm with Trident, and Google’s MillWheel offer stronger delivery guarantees (i.e., exactly-once) – Exactly-once low latency stream processing in MillWheel works
as follows: • The record is checked against de-duplication data from previous
deliveries; duplicates are discarded • User code is run for the input record, possibly resulting in pending
changes to timers, state, and productions • Pending changes are committed to the backing store • Senders are acked • Pending downstream productions are sent
Valeria Cardellini - SABD 2017/18
53
Source: MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.
DSP in the Cloud
• Data streaming systems are also offered as Cloud services – Amazon Kinesis Data Streams – Google Cloud Dataflow – IBM Streaming Analytics – Microsoft Azure Stream Analytics
• Abstract the underlying infrastructure and support dynamic scaling of the computing resources
• Appear to execute in a single data center
Valeria Cardellini - SABD 2017/18
54
Google Cloud Dataflow
• Fully-managed data processing service, supporting both stream and batch execution of pipelines – Transparently handles resource lifetime and can dynamically
provision resources to minimize latency while maintaining high utilization efficiency
– On-demand and auto-scaling
• Provides a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation – Apache Beam model
• Provides exactly-once processing – MillWheel is Google’s internal version of Cloud Dataflow
Valeria Cardellini - SABD 2017/18
55
Google Cloud Dataflow
• Seamlessly integrates with other Google cloud services – Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud
Bigtable, and BigQuery
• Apache Beam SDKs, available in Java and Python – Enable developers to implement custom extensions and
choose other execution engines
Valeria Cardellini - SABD 2017/18
56
Amazon Kinesis Data Streams
• Allows to build applications that process and analyze streaming data
Valeria Cardellini - SABD 2017/18
57
Kinesis Data Streams: components • Data record
– Unit of data stored by Kinesis
• Stream: group of data records – Data producers write data records to Kinesis streams – Data records in the stream are distributed into shards
• Shard: sequence of data records in a stream – Number of shards in a stream specified when the stream is
created • Can then be increased or decreased as needed but manually
– Base unit of capacity: up to 1MB/s of written data and 2MB/s of read data
– Partition key used to group data by shard within a stream – Used also for service pricing http://amzn.to/2szRTkG – Data records are stored in shards temporarily (24 hours by
default) Valeria Cardellini - SABD 2017/18
58
Kinesis Data Streams: consuming data • Kinesis Data Streams is used to capture streaming
data • An application reads data from a Kinesis stream as
data records, then uses the Kinesis Client Library (KCL) for the processing logic – KCL takes care of: load-balancing across multiple EC2
instances, responding to instance failures, check-pointing processed records, reacting to re-sharding (that adjusts the number of shards)
Valeria Cardellini - SABD 2017/18
59
References
• Overview on DSP frameworks – Wingerath et al. “Real-time stream processing for Big Data”,
2016. http://bit.ly/2rUhXXG
• Kulkarni et al., “Twitter Heron: stream processing at scale”, ACM SIGMOD'15. http://bit.ly/2rUXkux
• Carbone et al., “Apache Flink: Stream and batch processing in a single engine”, Bulletin IEEE Comp. Soc. Tech. Comm. on Data Eng., 2015. http://bit.ly/2sYzoGb
• Noghabi et al., “Samza: Stateful scalable stream processing at LinkedIn”, VLDB Endowment, 2017. https://bit.ly/2LushvF
• Akidau et al., “MillWheel: fault-tolerant stream processing at Internet scale”, VLDB'13. http://bit.ly/2rE7Fa3
Valeria Cardellini - SABD 2017/18
60