High Performance Stream Processing and Optimizations

University of Iowa | Mobile Sensing Laboratory

High Performance Stream Processing and Optimizations

May 8, 2015

Farley LaiAdvisor: Octav Chipara

Department of Computer Science


• A class of applications that process continuous input data streams and may produce continuous output streams

– High performance for real-time processing

– Long-term efficient resource management

Stream Applications

2

Speaker Models

Speech Recording

VADFeature

Extraction

HTTP Upload

Speaker Identifier

Develop compiler optimizations and efficient runtime environments for scalable streaming systems

Introduction


• Challenges

– Multi-core architectures

– High and variable memory and I/O workloads

– Energy efficiency

• Traditional approach: programmer specified parallelization

– Imperative languages expose limited parallelism to exploit

– Error-prone concurrency primitives

– Energy efficiency with aggressive power management

Stream Processing on Modern Architectures

3

Introduction


• Synchronous Data Flow (SDF)

– Programs are described as directed graphs

• Nodes processes with sequential code

• Edges FIFO communication channels

– Pipeline and data parallelism are explicitly exposed

– Amount of data consumed/produced by a process is fixed & known

• Limited expressivity

– Periodic static schedules

– Memory requirements are bounded

– Memory behavior of entire program may be characterized

Model of Computation

4


• Static schedules: – Order and num. invocations of a process in one period

– Two phases: initialization phase + steady phase

– Existence of a static schedule can be determined based on the production and consumption rates of all processes

• Sequential schedules– Built by simulating the process executions iteratively and

tracking the channel buffer sizes:• A process is schedulable if there is sufficient data to consume

• The simulation continue until the initial buffer size is restored

– Memory requirements are determined during scheduling

• Parallel schedules– derived by partitioning sequential schedules

Synchronous Data Flow

5



• Potential memory inefficiency

– Per channel buffer allocations and pass-by-value semantics

– Pass-by-reference only works on read-only chunks of data

• Aggregate update problem [Haven85]

• What if a process tries to insert new samples in the middle?

• Can we still prevent copying unchanged data portions by capturing the semantics of memory operations at compile-time?

Synchronous Data Flow

6


MY RESEARCHESMS: Efficient Static Memory Analysis on Stream Programs

CSense: A Stream-Processing Toolkit for Mobile Sensing Applications

7


• StreamIt: a SDF language

• Per channel allocation and pass-by-value semantics significant amount of memory usage and copies

• One single global allocation– Reuses memory and reduce memory requirements

– Avoids unnecessary memory copies

Memory Optimizations: ESMS

8

My Research: ESMS


• Component analysis on filter work functions in the logical space for each filter– peek(i), pop(), push(v) are supposed to access

contiguous memory allocations in FIFO channels

– Interprets push(v) in filter work functions as• PASS: moves a unchanged data sample between channels

• UPDATE: otherwise, a new value is pushed

Splitters, joiners and reordering filters are pass-only

Static Analysis

9

int->int filter appender() {

work pop 4 push 8{

for(int i=0; i<4; i++) push(pop()); // pass

for(int i=0; i<4; i++) push(compute(i)); // update

}

}

My Research: ESMS


• Relate logical positions to physical locations in the global allocation– Remaps peek(i), pop(), push(v) to access possibly

non-contiguous memory

– Live range analysis by reference counting in a schedule period• Layout starts with size zero and expands when necessary

• Each location represents a live variable with its live range

• The live range begins when receiving the 1st time pushed value– A splitter pushes a value multiple times for sharing

• The live range ends when it value is last time popped

• A location is free if its live variable is out of range

– Complete memory behaviors and sound approximation• No pointer aliasing

• Terminates in one schedule period

Whole Program Analysis

10

My Research: ESMS


• Simulate one period of the static schedule

– case PASS: reuse memory locations in the layout

– case UPDATE: follow one of the three strategies• Always-Append (AA) | Append-on-Conflict (AoC) | Insert-in-Place (IP)

Layout Stitching

11

MEM Layout

MEM[0, 0]: D0

MEM[1, 0]: D1

MEM Layout

MEM[0, 0]: I0

MEM[1, 0]: I1

MEM Layout

MEM[0, 0]: D0

MEM[1, 0]: D1

MEM[2, 1]: I0

MEM[3, 1]: I1

AA AoC & IPInput (2 updates)

MEM Layout

MEM[0, 0]: D0

MEM[1, 1]: D1

MEM Layout

MEM[0, 1]: I0

MEM[1, 1]: I1

MEM[2, 1]: D1

MEM Layout

MEM[0, 0]: D0

MEM[1, 1]: D1

MEM[2, 1]: I0

MEM[3, 1]: I1

AA & AoC IPInput (2 updates)

My Research: ESMS


– ESMS reduces both channel buffer sizes and the number memory operations for reordering and duplicating data streams.(splitters, joiners, reordering filters)

Memory Usage Reductions

12

45% to 96% reductions73% reductions on average

My Research: ESMS


– The average speedup of AA, AoC, and IP are 3, 3.1, and 3 while the average speedup of CacheOpt is merely 1.07.

– ESMS improves the performance by eliminating unnecessary memory operations and fits in the cache with a smaller working set.

Speedup

13

My Research: ESMS


• Challenges– Mobile sensing applications are difficult to implement on Android

devices• High frame rates

• Concurrency

• Robustness

• Energy efficiency

– Resource limitations and Java VM worsen these problems• Additional cost of virtualization

• Significant overhead of garbage collection

• Integrates SDF with dynamic scheduling– Conditional dataflow paths by partitioning the SDF

– Asynchronous event processing, i.e., network access and UI

– Android-specific power management

Mobile Sensing Applications: CSense

14

My Research: CSense


• Speaker Identifier

– Conditional dataflow paths result in SDF subgraphs

– Bounded memory requirements

Example Application

15

addComponent("audio", new AudioComponentC(rateInHz, 16));

addComponent("rmsClassifier", new RMSClassifierC(rms));

addComponent("mfcc", new MFCCFeaturesG(speechT, featureT))

...

link("audio", "rmsClassifier");

toTap("rmsClassifier::below");

link("rmsClassifier::above", "mfcc::sin");

fromMemory("mfcc::fin");

...

create

components

wire

components

My Research: CSense


Concurrency Model

16

My Research: CSense

getComponent("audio").setThreading(Threading.NEW_DOMAIN);

getComponent("httpPost").setThreading(Threading.NEW_DOMAIN);

getComponent("mfcc").setThreading(Threading.SAME_DOMAIN);

Compiler transformation


• Static analysis– composition errors, memory usage errors, race conditions

• Flow analysis– whole-application configuration and optimization

• Stream Flow Graph transformations– domain partitioning, type conversions, MATLAB component

coalescing

• Code generation– Android application/service, MATLAB (C code + JNI stubs)

CSense Compiler

17

My Research: CSense


• Components exchange data using push/pull semantics

• Runtime includes a scheduler for each domain

– task queue + event queue

– wake lock – for power management

CSense Runtime

18

Scheduler1Task Queue

Event Queue

Scheduler2 Task Queue

Event Queue

Memory Pool

My Research: CSense


• Garbage collection overhead limits scalability

• Concurrency primitives have a significant impact on performance

Producer-Consumer Throughput

19

My Research: CSense

30%

13.8x

19x

MY RESEARCH IN CONTEXTMore energy savings in the dataflow model

Dynamic optimizations in cloud stream processing

20


• Energy consumption in mobile sensing applications

– Energy bugs introduced by power management primitives

Program analysis on code paths and potential races

– Improper usage of I/O components elongates tail power states

Defer and batch I/O operations to execute in a short interval

– Intensive computations

Code offloading to cloud, AMP cores or GPGPU

Energy Efficiency and Stream Processing

21

Research of Interest: Energy Efficiency


• Inconsistent throughputs

– Critical path with bottleneckprocesses

• Dynamic Voltage Frequency Scaling (DVFS)

– Dispatch partitions to cores at different frequencies

Save energy while maintaining performance

DVFS: GreenStreams

22

Research of Interest: Energy Efficiency

[Bartenstein2013]


• New challenges

– Changing input structure

• Tweets mention graph computations

• Distributed storage of the graph

• Consistent graph representation

– Communication overhead

• Conventional stream processing less concerns large scale communications but focuses on local computations

– Changing performance criteria

• Statically made decisions cannot be optimal the whole time

Better performance requires to make decisions dynamically

– Fault-tolerance

Cloud Stream Processing

23

Research of Interest: Cloud Computing


• Goal– Compute timely properties on the changing graph input

• Challenges– High rate of graph updates– Consistent graph structure– Static graph mining algorithms

• Global progress tracking protocol– Graph updates are queued and progress is tracked in a global table– Progress snapshots are taken and distributed to perform transactions of

graph updates and associated computations

• Pros– Decouples graph updates from graph computations

• Cons– Centralized progress tracking– No analysis on updates that may cancel each other and aggregation of

potential propagation of communications

Changing Graph: Kineograph

24



• Goal

– High throughput timely and low latency processing

• Challenges

– Communication overhead dominates local computing resources

– Massive communications cause media contentions and high latency

• Efficient batching and localized communications

– User decide to process input synchronously or asynchronously

– Distributed progress tracking based on partially ordered timestamps

• Pros

– Effective aggregation of communications

• Cons

– Flow control might be a concern for asynchronous delivery

Timely Dataflow: Naiad

25



• Goal

– Efficiently switching between Sync and Async modes for better performance and early termination

• Predict the next better execution mode online

• Pros

– Frees programmers from coding the execution modes explicitly

• Cons

– Separate snapshots and checkpointing under both modes

• Additional space usages

• Unclear performance impact

Execution Modes: PowerSwitch

26



• Energy efficiency in stream processing

– Dataflow paths as code paths facilitate program analysis

– Graph manipulation for better I/O component access aggregations

– Precise workload requests for computing resources

– Adapting to runtime information e.g., user activity predictions

• Incorporate dynamic optimizations

– Static optimizations based on fixed resources

– Reevaluations once in a while, e.g., execution mode switching

– Changing input structures and distributed state sharing

• Less changing information flow paths for aggregating communications

• Localized multicast for reconfiguration and termination in partial order

Conclusions and Future Work

27


• Potential limitation on stream processing optimizations

– Non-trivial transformation between SDF graphs at different granularities

Conclusions and Future Work

28

FFTTestSource0 split08,16,16

FFTReorderSimple0

8,8,8

FFTReorderSimple1

8,8,8

CombineDFT08,4,4

CombineDFT28,4,4

CombineDFT14,8,8

join0

8,8,8

FloatPrinter016,1,1

CombineDFT34,8,8

8,8,8

FloatSource0 split02,4,4

Butterfly0

2,4,4

Butterfly1

2,4,4

join0

4,2,2

4,2,2

split14,8,8

Butterfly2

4,4,4

Butterfly3

4,4,4

join1

4,4,4

4,4,4

BitReverse08,8,8

FloatPrinter08,2,2

Coarse-grained FFT

Fine-grained FFT


Any Questions?

Thank You

29

Software

High Performance Stream Processing and Optimizations