27

Tpl dataflow

Embed Size (px)

DESCRIPTION

Pipelines. Tpl dataflow. Use cases

Citation preview

Page 1: Tpl dataflow
Page 2: Tpl dataflow

Pipeline. TPL Dataflow.

Usage.

by Alexey Kursovhttp://www.linkedin.com/in/kursov

Page 3: Tpl dataflow

TPL Dataflow

The Task Parallel Library (TPL) provides dataflow components to help increase the robustness of concurrency-enabled applications. These dataflow components are collectively referred to as the TPL Dataflow Library. Dataflow model providing in-process message passing for coarse-grained dataflow and pipelining tasks...

Page 4: Tpl dataflow

WTF?

Pipeline? Dataflow?

Page 5: Tpl dataflow

Pipeline basics

In software engineering, a pipeline consists of a chain of processing elements (processes, threads, coroutines, etc.), arranged so that the output of each element is the input of the next. Usually some amount of buffering is provided between consecutive elements. The information that flows in these pipelines is often a stream of records, bytes or bits.

The concept is also called the pipes and filters design pattern. It was named by analogy to a physical pipeline.

Simple example:

Page 6: Tpl dataflow

Pipeline basics

A linear pipeline is a series of processing stages which are arranged linearly to perform a specific function over a data stream. The basic usages of linear pipeline is instruction execution, arithmetic computation and memory access.

A non-linear pipeline (also called dynamic pipeline) can be configured to perform various functions at different times. In a dynamic pipeline there is also feed forward or feedback connection. Non-linear pipeline also allows very long instruction word.

Page 7: Tpl dataflow

Pipelines in real life

Page 8: Tpl dataflow

Pipelines in real life

Page 9: Tpl dataflow

Dataflow programming

Dataflow programming is a programming paradigm that models a program as a directed graph of the data flowing between operations, thus implementing dataflow principles and architecture.

● emphasizes the movement of data

● program is series of connections

● explicitly defined inputs and outputs connect operations

Page 10: Tpl dataflow

● parallel computing frameworks

● database engine designs

● digital signal processing

● network routing

● graphics processing

Popular in

Page 11: Tpl dataflow

Usage

In Unix-like computer operating systems, a pipeline is the original software pipeline: a set of processes chained by their standard streams, so that the output of each process (stdout) feeds directly as input (stdin) to the next one. Each connection is implemented by an anonymous pipe. Filter programs are often used in this configuration.

The concept was invented by Douglas McIlroy

for Unix shells and it was named by analogy to a

physical pipeline.

Abstract and concrete examples:

% program1 | program2 | program3 % ls | grep xxx

Page 12: Tpl dataflow

Cascading is a Java application framework that enables typical developers to quickly and easily develop rich Data Analytics and Data Management applications that can be deployed and managed across a variety of computing environments. Cascading works seamlessly with Apache Hadoop and API compatible distributions. It follows a ‘source-pipe-sink’ paradigm, where data is captured from sources, follows reusable ‘pipes’ that perform data analysis processes, where the results are stored in output files or ‘sinks’

Usage

Page 13: Tpl dataflow

Usage

Cascading pipeline example:

Page 14: Tpl dataflow

Usage

Apache Crunch (Simple and Efficient MapReduce Pipelines by Cloudera)

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

StormStorm is a distributed realtime computation system. Similar to how Hadoop provides

a set of general primitives for doing batch processing, Storm provides a set of

general primitives for doing realtime computation. Storm is simple, can be used with

any programming language

Page 15: Tpl dataflow

TPL Dataflow

The Task Parallel Library (TPL) provides dataflow components to help increase the robustness of concurrency-enabled applications. These dataflow components are collectively referred to as the TPL Dataflow Library.

Data Flow Tasks

Coordination data structure Task parallel library

Threads

Page 16: Tpl dataflow

What it provides for me?

● provides a foundation for message passing and parallelizing CPU-intensive and I/O-intensive applications

● gives you explicit control over how data is buffered and moves around the system

● improve responsiveness and throughput by efficiently managing the underlying threads

● allows you to easily create a mesh through which your data flows

● meshes can split and join the data flows, and even contain data flow loops

● allows to create custom blocks and extend functionality

Page 17: Tpl dataflow

Type of blocks

Dataflow blocks - are data structures that buffer and process data.

1. source blocks (acts as a source of data ) ISourceBlock<TOutput>

2. target blocks (acts as a receiver of data) ITargetBlock<TInput>

3. propagator blocks (acts as both a source block and a target block) IPropagatorBlock<TInput, TOutput>

Page 18: Tpl dataflow

Buffering blocks

● BufferBlock<T> - stores a first in, first out (FIFO) queue of messages that can be written to by multiple sources or read from by multiple targets. If some target receives message from bufferblock, that message will be removed

● BroadcastBlock<T> - broadcast a message to multiple components

● WriteOnceBlock<T> - class resembles the BroadcastBlock<T> class, except that a WriteOnceBlock<T> object can be written to one time only

inputTask

output (originals or copies)

input output (original)

input output (originals or copies)Task

Current

First writed value (readonly)

Page 19: Tpl dataflow

Execution blocks

● ActionBlock<TInput> - is a target block that calls a delegate when it receives data

● TransformBlock<TInput, TOutput> - it acts as both a source and as a target and delegate that you pass should return a value of TOutput type

● TransformManyBlock<TInput, TOutput> - resembles the TransformBlock except that TransformManyBlock produces zero or more output values for each input value, instead of only one output value for each input value.

inputTask

inputTask

output

inputTask

output

Page 20: Tpl dataflow

● BatchBlock<T> - combines sets of input data, which are known as batches, into arrays of output data.

● The JoinBlock<T1, T2> and JoinBlock<T1, T2, T3> - collect input elements and propagate out System.Tuple<T1, T2> or System.Tuple<T1, T2, T3> objects that contain those elements

● The BatchedJoinBlock<T1, T2> and BatchedJoinBlock<T1, T2, T3> - collect batches of input elements and propagate out System.Tuple(IList(T1), IList(T2)) or System.Tuple(IList(T1), IList(T2), IList(T3)) objects that contain those elements

Grouping blocks

inputTask

output

Taskoutput

input (T1)

Taskoutput

input (T2)

input (T1)

input (T2)

Page 21: Tpl dataflow

LinkTo and Predicate

Link/UnLink

The ISourceBlock<TOutput>.LinkTo (returns IDisposable) method links a source dataflow block to a target block. If you want to unlink block you should call Dispose method on result of LinkTo call. The predefined dataflow block types handle all thread-safety aspects of linking and unlinking. Also the source will be unlinked automatically if you set MaxMessages larger than -1 on LinkTo call in DataflowLinkOptions after the declared number of messages is received

Predicate

When you link target block you can set “predicate” that will check message before adding it to input buffer. You should specify delegate in DataflowLinkOptions that recives message of TInput type of target block and returns bool value.

Page 22: Tpl dataflow

Another options

You can specify:

● degree of parallelism for block

● maximum number of messages that may be buffered by the block

● task scheduler

● number of message per task

● cancellation

● greedy behavior

● completion

Page 23: Tpl dataflow

Recommendations

Recommendations for building TPL Dataflow pipelines:

● make each block do one thing well

● design for composition

● be stateless where you can

Page 24: Tpl dataflow

Use cases

1. Prototyping pipelines for use in more complex systems

2. Development of flexible asynchronous applications that process some data, like:

○ Web-crawlers

○ Image processors

○ Sound processors

○ Pipelines in mobile phone apps

○ Data analysis/mining services

○ etc.

3. Study pipeline based development

Page 25: Tpl dataflow

Practice

Page 27: Tpl dataflow

Thanks for your attention!