Window-based Data Processing with Stratosphere · 2015. 12. 21. · lational Database Management Systems (RDBMS), Data Stream Management Systems (DSMS), and other systems i.e., special-purpose

Window-based Data Processing withStratosphere

Diplomarbeit

zur Erlangung des akademischen GradesDiplominformatiker

Humboldt-Universitat zu BerlinMathematisch-Naturwissenschaftliche Fakultat II

Institut fur Informatik

eingereicht von: Fabian Fiergeboren am: 12.01.84in: Rheinfelden

Gutachter: Prof. Johann-Christoph Freytag, Ph.D.Prof. Dr. Odej Kao

eingereicht am: . . . . . .

verteidigt am: . . . . . .

Contents

1 Introduction 2

1.1 Stratosphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Ordered data and sliding windows . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Sequential data processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Preliminaries 7

2.1 Stratosphere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Map-Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Parallelization Contracts (PACTs) . . . . . . . . . . . . . . . . . . . 8

2.1.3 PACT execution by Nephele . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Ordered Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Assumptions for ordered data processing . . . . . . . . . . . . . . . . 13

2.3 Continuous query operators . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3.1 Properties of continuous query operators . . . . . . . . . . . . . . . . 13

2.3.2 Sliding windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.3 Types of aggregate functions . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Sliding Window PACTs 18

3.1 General considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 Order and sliding window records . . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Sliding window Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Sliding window Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.1 Semantic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4.2 Considerations for implementation . . . . . . . . . . . . . . . . . . . 26

3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Sliding window Cross . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5.1 Semantic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.5.2 Considerations for implementation . . . . . . . . . . . . . . . . . . . 32

3.5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.6 Sliding window Match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

i

ii CONTENTS

3.6.1 Semantic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.6.2 Considerations for implementation . . . . . . . . . . . . . . . . . . . 433.6.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.7 Sliding window CoGroup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.7.1 Semantic approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.7.2 Considerations for implementation . . . . . . . . . . . . . . . . . . . 443.7.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Evaluation 514.1 Implementation of sliding window Reduce . . . . . . . . . . . . . . . . . . . 514.2 Example sliding window calculation . . . . . . . . . . . . . . . . . . . . . . 534.3 Example PACT using workarounds . . . . . . . . . . . . . . . . . . . . . . . 554.4 Example PACT using sliding window Reduce . . . . . . . . . . . . . . . . . 564.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Conclusions 635.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635.2 Open issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A Appendix 65A.1 SeismoBusAnalysis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 65A.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67A.3 Data models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A.3.1 Relational data model . . . . . . . . . . . . . . . . . . . . . . . . . . 71A.3.2 SEQ data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A.4 Join algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71A.4.1 Common / Simple Hash Join . . . . . . . . . . . . . . . . . . . . . . 71A.4.2 Symmetric / (Double) Pipelined Hash Join . . . . . . . . . . . . . . 72

List of Figures

1.1 Comparison between DSMS, Stratosphere, and RDBMS. . . . . . . . . . . . 4

2.1 Parallelization Contract (PACT) . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 The PU split assigned by the Map second-order function. . . . . . . . . . . 9

2.3 The PU split and grouping assigned by the Reduce second-order function. . 9

2.4 Cross. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 CoGroup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Match. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.7 Compiling and running a program with PACT and Nephele . . . . . . . . . 12

3.1 Sliding window Reduce: split of Parallelization Units into sliding windows. . 20

3.2 Example PACT program using sliding window Reduce. . . . . . . . . . . . . 20

3.3 Example Job Graph with two Map and three Reduce instances. . . . . . . . 21

3.4 Sorted queue for time-based semantics. . . . . . . . . . . . . . . . . . . . . . 26

3.5 Sorted queue for count-based semantics. . . . . . . . . . . . . . . . . . . . . 26

3.6 Reduce algorithm for time-based semantics. . . . . . . . . . . . . . . . . . . 27

3.7 Reduce algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.8 Sliding window Cross: Split of ordered multisets into sliding windows andcartestian products over window pair 1 and 2. . . . . . . . . . . . . . . . . . 30

3.9 Example 1: Split of time axis into sliding windows. . . . . . . . . . . . . . . 34

3.10 Example 2: Split of time axis into sliding windows. . . . . . . . . . . . . . . 34

3.11 Example PACT program using sliding window Cross. . . . . . . . . . . . . . 37

3.12 Example job graph of PACT program using sliding window Cross. . . . . . 37

3.13 Sliding window Cross data structures and potential next window for time-based semantics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.14 Sliding window Cross second-order algorithm for time-based semantics. . . . 39

3.15 Sliding window Match: split of Parallelization Units into sliding windows. . 41

3.16 Example PACT program using sliding window Match. . . . . . . . . . . . . 42

3.17 Example job graph using sliding window Match. . . . . . . . . . . . . . . . 42

3.18 Sliding window CoGroup: split of Parallelization Units into sliding windows. 44

4.1 Overview on Stratosphere classes related to sliding window Reduce. . . . . 52

4.2 Reduce second-order function. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

iii

iv LIST OF FIGURES

4.3 Example sliding window Calculation PACT program without sliding win-dow Reduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.4 DelimitedInputFormat. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5 SerialDelimitedInputFormat. . . . . . . . . . . . . . . . . . . . . . . . . . . 554.6 Workaround: sliding window emulation within records. . . . . . . . . . . . . 554.7 Example sliding window Calculation PACT program using sliding window

Reduce. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.8 Experiments with 4 calculation nodes and parallelization degree 4. . . . . . 574.9 Experiments with 16 calculation nodes and parallelization degree 32. . . . . 58

A.1 Time usage with 4 calculation instances for parallelization degree 4. . . . . 67A.2 Time usage with 4 calculation instances for parallelization degree 8. . . . . 68A.3 Time usage with 4 calculation instances for parallelization degree 16. . . . . 68A.4 Time usage with 16 calculation instances for parallelization degree 16. . . . 69A.5 Time usage with 16 calculation instances for parallelization degree 32. . . . 69A.6 Time usage with 16 calculation instances for parallelization degree 64. . . . 70A.7 Hash Join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72A.8 Symmetric Hash Join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

List of Tables

1.1 Comparison of DSMS, Stratosphere, and RDBMS. . . . . . . . . . . . . . . 5

3.1 Input example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Example for sliding window split. . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Input example for special case regarding the timely availability about in-

formation about the number of predecessors. . . . . . . . . . . . . . . . . . 243.4 Input example for special case regarding non-strict total order. . . . . . . . 243.5 Input example for special case regarding non-strict total order. . . . . . . . 243.6 Example for application of sliding window Cross routing algorithm. . . . . . 36

4.1 Input example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

v

vi LIST OF TABLES

Abstract

Analyzing large amounts of ordered data is a common task in research and industry. Theusual ordering domain is time: Examples for time-ordered data are sensor data, com-munication network data, or financial data. Besides online monitoring, it is common toinvestigate patterns or special events in the data after capturing it. These analysis can tra-ditionally be performed within Data Stream Management Systems or Relational DatabaseManagement Systems. We decided to use the parallelization framework Stratosphere: Bydesign, Stratosphere provides scalability by using clusters or clouds for computations. Forordered data analysis, sliding window semantics are necessary, which are not yet includedwithin the operators of Stratosphere. In this work, we describe sliding window semanticsfrom streaming databases and define Stratosphere operators with sliding window seman-tics. We introduce an exemplary implementation of one sliding window operator andevaluate its performance. The results show that Stratosphere with sliding window oper-ators is a good choice for analysis on large amounts of ordered data. Augmented withthe proposed sliding window operators, the applicability of Stratosphere gets broadenedtowards an even more general-purpose parallelization framework.

1

Chapter 1

Introduction

The analysis of large amounts of ordered data is a common task in research and industry.Consider sensor data that is used for the early warning of earthquakes, network data thatis analysed for security reasons, or financial data that is scanned for correlations withother financial data. All examples have in common, that they have time as an orderingdomain. Assume that data is captured and saved for subsequent analysis. Analysis onordered data include the usage of sliding windows: The data is partitioned into overlappingor disjunctive sets of subsequent data records. Operators like AVERAGE() are appliedto sliding windows instead of the complete set of data. Users may choose a RelationalDatabase Management System (RDBMS) for the data analysis. Usually, RDBMS do notprovide sliding window support. Another choice are Data Stream Management Systems(DSMS). They provide support for sliding windows, but they are usually implementedto actively react to data (push-based) instead of obtaining the data (pull-based). Bothsystem types do usually not support distributed computing on a cluster or a cloud. Thus,both systems are not ideal for the analysis on large amounts of ordered data. We choseto use the parallelization framework Stratosphere for this task. In our previous research,we ported the offline analysis of seismographic data from a non-distributed evaluationenvironment to Stratosphere using a compute cluster. We observed that Stratosphere issuitable for analysis of ordered data, but the operators do not support sliding windowsemantics. Users have to implement the semantics in the user layer, which is troublesome,potentially resource intensive and hard to optimize by the Stratosphere system. In thiswork, we introduce sliding window operators to Stratosphere to overcome these weaknesses.

In the following, we briefly introduce Stratosphere and our notion of ordered data. Wecompare DSMS, RDBMS, and Stratosphere. Finally, we motivate this work and describeour approach.

2

1.1. STRATOSPHERE 3

1.1 Stratosphere

Stratosphere is a research project on ”Information Management in the Cloud”. It is ajoint project of Humboldt-Universitat zu Berlin, Hasso-Plattner-Institut Potsdam, andTechnische Universitat Berlin. Its aim is to advance the way large unstructured or semi-structured data is processed in parallel on distributed systems. A new database inspiredapproach is developed to run analyses, aggregations, and queries on this kind of data inparallel on cluster and cloud architectures. The research focuses on three parts: parallelprogramming models, parallel data processing engines and optimizations of data flowprograms.

Within the project, the programming model PACT and the parallel execution engineNephele are developed and implemented. We refer to ”Stratosphere” as the implementedframework rather than the project. Stratosphere is written in Java and available underan Open Source licence. The current official release is version 0.2. In this work, thedevelopment version 0.2 is discussed. In Chapter 2.1, we introduce Stratosphere and itsoperators in detail.

1.2 Ordered data and sliding windows

Within ordered data, also referred to as sequential data, every atomic data record isassociated with a position in an ordering domain. Usually, the ordering domain is time.Ordered data is either analyzed real-time (continuous data streaming) or after capturingit. In this work, we focus on the analysis of already captured data. We assume thatthe amount of input data is generally larger than the available main memory of a systemthat performs an analysis on it. The main implication of this assumption is that analysiscannot use the total input at once.

This problem can be overcome by employing sliding windows. Calculations are per-formed on subsequent portions of data that fit into main memory. These portions canbe either overlapping or non-overlapping. Depending on the type of analysis, overlappingsliding windows can be used to optimize the calculation: Results of preceding calculationscan be reused instead of running the calculation on each portion from scratch.

In Chapter 2.2, we introduce our ordered data model, further assumptions about pro-cessing ordered data, and sliding windows.

1.3 Sequential data processors

For analysis on ordered data, there are three general types of processors available: Re-lational Database Management Systems (RDBMS), Data Stream Management Systems(DSMS), and other systems i. e., special-purpose systems for network monitoring or theparallelization framework Stratosphere. In the following, we give an overview on the suit-ability of DSMS, RDBMS, and Stratosphere in regard to analysis on ordered data.

By design, RDBMS are not suited well for ordered data and sliding window operationson them. They are optimized for queries on relational business data [ACC+03]: RDBMS

4 CHAPTER 1. INTRODUCTION

treat data collections as sets, not sequences. Data records do not have a notion of ordering,unless ordering attributes are added explicitly.

DSMS, also called stream processing engines (SPE), provide support for real-time (on-line) sequential data analysis and sliding window operations [CcC+02, ACC+03, SCS03].They are optimized for the streaming data model [BBD+02]. This optimization includesthat in the presence of overload, a streaming system applies load shedding techniques likerecord dropping or window parameter optimizations and delivers inexact results.

For this work, we assume that data is analysed after it is captured (offline), so there isno overload and inexact results are not intended. When running analysis on big amounts ofoffline data, DSMS will usually not provide a good performance compared to a distributedsystem: Depending on the DSMS design, parallelization and distribution is either notconsidered or exclusively used for purposes other than a faster execution, for example toprovide a higher quality of service [SZS+03, DJAZ05, MCZ03].

For using parallelization and distribution for accelerated execution, there are frame-works that directly support it by design: Stratosphere is a general purpose parallelizationframework, featuring an extension of the Map-Reduce programming model [BEH+10]. Inour previous research, we utilized Stratosphere for sequential data analysis on a computingcluster. By using Stratosphere, the computation time of sliding window analysis on arbi-trary amounts of sequential data can be scaled by the number of computation nodes. Thedownside of using Stratosphere for this purpose is that sliding window semantics can onlybe accomplished by applying workarounds like data redundancy: Atomic data records areenlarged to contain all values for the calculation of a window. If the calculation includesoverlapping windows, the same values are being copied into several records, resulting indata transfer overhead.

Data update speed

staticQuery complexity

high speed

simple complex

DSMS

RDBMSStratosphere

Figure 1.1: Comparison between DSMS, Stratosphere, and RDBMS.

In Figure 1.1, the suitability of the described systems is depicted schematically. Fordata arriving online at a high speed, a DSMS is usually necessary for analysis on the data.The queries within the analysis must be rather simple compared to queries that can beperformed in RDBMS. Data within RDBMS is usually static, compared to data of DSMS.Stratosphere is in between both system types: It allows for analysis on rather static data,because it is implemented for offline data, similar to RDBMS. Queries within Stratosphereare performed by user-defined code. Thus, Stratosphere queries can potentially be ascomplex as RDBMS queries, but the user has to implement the query semantics himself.

1.4. PROBLEM STATEMENT 5

Table 1.1: Comparison of DSMS, Stratosphere, and RDBMS (partly adapted from[GO10]).

RDBMS Stratosphere DSMS

Data persistent rela-tions

persistent struc-tured data

streams

Data access random sequential sequential, one-pass

Index usage yes no no

Updates arbitrary append-only append-only

Update rates of incoming data relatively low relatively low high, bursty

Processing model query-driven(pull-based)

query-driven(pull-based)

data-driven(push-based)

Queries one-time one-time continuous, ad-hoc

Query answers exact exact approximate, ex-act

HDD and RAM usage hybrid hybrid RAM-only

Interaction model human-active,system-passive

human-active,system-passive

system-active,human-passive

Executable on Cluster orCloud

no (originally) yes no

In Table 1.1, we list the main differences of the three system types. All listed propertieshave a direct or indirect impact on the system designs and particularly the semantics ofthe systems operators. We refer to some of the properties subsequently.

1.4 Problem statement

Stratosphere, as a general purpose parallel data processing framework, is a promisingalternative to existing systems from RDBMS and DSMS for analysis on large amountsof sequential data in a short time. Yet, it is not suited well for problems that employsliding window operations, because the user needs to implement sliding window semanticshimself. The goal of this thesis is to enhance the suitability of Stratosphere for this class ofproblems. The suitability can be improved in three ways. First, Stratosphere operators canbe extended by sliding window semantics. With these semantics, operators like Reduce canoperate on bags of records once they contain all values for single windows. Second, if theorder of records between subsequent Stratosphere operators (PACTs) can be controlled,user-implemented operator functions can potentially become stateful [ABE+10, AEH+11].If the user-defined algorithm allows for it, results of preceding calculations can be reused

6 CHAPTER 1. INTRODUCTION

in order to optimize execution time. Lastly, the degree of parallelization can be enhancedby using sliding windows as Parallelization Units (PUs): Stratosphere uses PUs in orderto achieve data parallelization [BEH+10]. PUs are subsets of data that can be executedindependently. Stratosphere currently uses keys for the creation of PUs. Sliding windowsprovide an additional way to create PUs.

1.5 Approach

Sliding windows are highly discussed in the field of DSMS, which are designed for real-timecontinuous data processing. Stratosphere on the other hand is a batch system explicitlyrunning on offline data. In this diplom thesis, we investigate additions to the Stratospheredesign in order to improve its suitability for sliding window operations on sequential datafollowing the literature of DSMS. However, it is not intended to change the design ofStratosphere to support real-time data processing.

Chapter 2 contains an introduction of concepts that we use throughout this work. Wedescribe Stratosphere and its operators. Furthermore, we define the ordered data modelas an extension of the existing data model Stratosphere is based on. We describe fur-ther assumptions about the processing of ordered data within Stratosphere. The datamodel and the assumptions about processing are important in the subsequent discussionof continuous query operators, including sliding windows, and types of aggregate func-tions. In Chapter 3, we define sliding window Stratosphere operators. For each operator,we start with a semantic approach that contains a detailed definition of the operator se-mantics. Furthermore, we describe how to implement the respective operator and finallydiscuss semantic and implementation alternatives that can be considered. In Chapter 4,we describe our exemplary implementation of one sliding window operator in Stratosphereand evaluate its performance. In Chapter 5, we draw conclusions about our approach onintroducing sliding window semantics to Stratosphere and discuss open issues.

Chapter 2

Preliminaries

In this chapter, we introduce concepts that we subsequently use for sliding window oper-ator definitions. This involves a description of the parallelization framework Stratospherewith a focus on its programming model PACT. PACT defines the semantics of existingStratosphere operators. For our aim to enhance the existing Stratosphere operators withsliding window semantics, we introduce our ordered data model and general sliding windowconcepts.

2.1 Stratosphere

Stratosphere is a parallelization framework. It consists of the programming model PACTand the parallel execution engine Nephele. PACT stands for parallelization contract andis a generalization of the Map-Reduce programming model. Before we describe PACT andits interaction with Nephele, we start with a brief introduction of Map-Reduce.

2.1.1 Map-Reduce

The Map-Reduce programming model is introduced by Jeffrey and Dean [DG04]. It en-ables users to write parallel programs in a predefined way. A Map-Reduce program consistsof a Map followed by a Reduce step. The Map step reads input data entities subsequentlyand generates one or more intermediate key-value pairs. The pairs are sorted and groupedby their key automatically. The subsequent Reducer reads sorted groups of key-value pairsand generates the final output from each group.

The Map and Reduce functions consist of a predefined second-order function each. Theuser implements custom code within the first-order functions, one for Map and one forReduce. The Map second-order function passes the input data to the first-order function.The sorting and grouping of the intermediate key-value pairs is performed by the second-order function of Reduce. Parallelization is achieved by data parallelization: Both theMapper and the Reducer can be executed on subsets of data on several computing instancesin parallel. The Map-Reduce implementation is responsible of all details of parallelization.This involves i. e., the distribution of the program code and the data.

7

8 CHAPTER 2. PRELIMINARIES

A major drawback of the Map-Reduce programming model is that it is not ready formore complex operations. In [BEH+10] example problems are shown that are difficult toimplement in Map-Reduce. The examples consist of joins (inner, outer, anti, and thetajoins), pairwise correlation computation, and K-means clustering. Key-value pairs andthe given functions Map and Reduce are not sufficient for these and other problems inparallel programming. Some implementation issues can be worked around by using customkey or value structures that consolidate several values into one key-value pair. Anotherworkaround is to execute two or more Map-Reduce programs subsequently to achievefunctionalities that are not possible to implement within one program. Nevertheless,workarounds like the described ones obstruct optimization possibilities that can only beleveraged in a more powerful framework. In order to overcome Map-Reduce shortcomings,the PACT programming model was introduced as a generalization of Map-Reduce. Wedescribe it in the following.

2.1.2 Parallelization Contracts (PACTs)

Similar to the streaming database Aurora [ACC+03], PACT follows the arrows and boxesdataflow paradigm. A PACT program consists of operators (boxes) and directed connec-tions (arrows) between them. Besides Map and Reduce, PACT offers further operators.The additional operators are multi-input, which means, that two input connections arenecessary for them. In contrast to Map-Reduce, PACT allows for arbitrary acyclic con-nections between its operators. Similar to Map-Reduce, each PACT operator consistsof a user-implemented first-order function and a second-order function that implementsthe parallelization semantics of the operator. Before we describe the Stratosphere PACToperator semantics, we introduce records and keys.

... ...

Input Data Output Data

InputContract

IndependentData Subsets

User CodeFirst-order function

Key Value

Figure 2.1: Parallelization Contract (PACT) (adopted from [AEH+11]).

Stratosphere operates on multisets (bags) of records. Records can be viewed as ageneralization of key-value pairs. A record consists of an arbitrary number of attributes.Each attribute has a data type such as integer, string etc. A key of a record is a non-empty subset of attributes. Keys are used to achieve data parallelization: Certain dataprocessing tasks only require a part of a given data set for execution. For each distinctkey value, a data subset of records is built containing all records that have this key value.These independently executable subsets of data are called Parallelization Units (PUs).The generation of Parallelization Units is defined in the operator semantics, also referredto as Input Contract. The processing of records within a PACT operator is schematicallydepicted in 2.1. In the arrow on the left, four input records are visible. The key of eachrecord is depicted by the color of the left square (same color indicates equal key value).

2.1. STRATOSPHERE 9

The Input Contract determines, which data subsets (PUs) can be processed independently.For each PU, the first-order function is called, that returns the output data of the operator.

In the following, we describe the existing Input Contracts. As in Map-Reduce, PACTalso comprises of a Map and a Reduce Input Contract that operates on a single inputset. In addition, there are Input Contracts that operate on multiple input sets. They arecalled Match, CoGroup, and Cross.

Map

The Parallelization Unit split of the single Input Contract Map is depicted in Figure 2.2.Map does not use a key. This means, that each input record is assigned to a separate PU.In the figure, records are illustrated by the rectangles with colored squares on the left.Separate PUs are indicated by dashed rectangles in the right part of the figure. For eachPU, the user-defined first-order function is called.

MAP

Figure 2.2: The PUsplit assigned by theMap second-order func-tion (adopted from[Str12]).

REDUCE

Figure 2.3: The PUsplit and groupingassigned by the Reducesecond-order function(adopted from [Str12]).

Reduce

The second-order semantics of Reduce are illustrated in Figure 2.3. Reduce is a singleInput Contract and makes use of a key. The different colors red, violet, and yellow withinthe squares in the figure represent different keys. For each distinct key, a ParallelizationUnit is created. Each Parallelization Unit is embraced by a dashed rectangle in the rightpart of the figure. Each PU contains all records of the input data sharing a particular key.


Cross

Cross is a multiple Input Contract: It has two inputs, A and B. It does not make use ofkeys. The Parallelization Units are all elements of the cartesian product of input A and B.In Figure 2.4, input A is depicted by the upper rectangle, containing two records. InputB is illustrated by the left rectangle with four records. The generated PUs are depictedby the dashed rectangles in the middle, containing two records each.

CROSS

Figure 2.4: Cross (adopted from [Str12]).

CoGroup

CoGroup has two inputs, A and B, and uses a key. It merges both inputs and generatesParallelization Units based on the key: One PU contains all records of input A and Bsharing one particular key instance. The semantics of CoGroup are depicted in Figure2.5: Input A is depicted by the rectangle in the upper part of the figure and contains tworecords. The rectangle in the left of the figure contains four records of input B. Recordswith the same key, represented by the color, form a PU. In the figure, red, violet, andyellow keys form one PU each.

Match

Match is a multiple Input Contract with two inputs, A and B. It uses a key. In Figure 2.6,the operator logic is depicted: Input A is represented by the square in the upper part of thefigure and contains two records with a yellow and a red key. Input B is represented by thesquare in the left part of the figure and contains four records. For each key instance thatexists in both inputs A and B, sub-multisets of records SA and SB, are created (not shownin the figure). Records with key instances that only appear in one input are omitted. Inthe figure, the record with the violet key is an example for an omitted record. Over thesub-multisets SA and SB, a cartesian product is computed. Each element in the cartesianproduct is one Parallelization Unit. The resulting Parallelization Units are depicted bythe dashed rectangles in the figure.

2.1. STRATOSPHERE 11

COGROUP

Figure 2.5: CoGroup (adopted from [Str12]).

MATCH

Figure 2.6: Match (adopted from [Str12]).


2.1.3 PACT execution by Nephele

A user-generated PACT program is a data flow program with operators and directedconnections between them as described before. In order to execute a PACT program on acompute cluster or cloud, Stratosphere compiles it to a Nephele data flow program, whichis a directed acyclic graph (Figure 2.7). The graph basically defines, how many timesan operator (first- and second-order function) is instanciated, on which physical machinesthese instances run, and how the operator instances are physically interconnected.

Compiler

Nephele Data FlowPACT Program

Cluster/ Cloud

Figure 2.7: Compiling and running a program with PACT and Nephele (adapted from[BEH+10]).

2.2 Ordered Data Model

For our subsequent discussion of sliding windows, we introduce an ordered data model.It reflects the data model Stratosphere relies on and extends it by an additional notionof order. In the literature, similar definitions of data models are called streaming datamodel [BW01, BBD+02, ACC+03, GO03b] and sequential data model [SLR94, PSR95].We adapt the following definitions from [SLR94, PSR95].

Ordered data is a sequence of relational records. Each record is associated with aninstance of an ordering domain. The ordering domain is usually time. Assume a set ofatomic data types TBasic. We define a record schema as R =< A1 : T1, ..., AN : TN > withN ∈ N, Ti ∈ TBasic, and each Ai representing a named attribute.

Definition 2.1 (Total Order) Let S be a multiset of records of schema R. Let O be atotally ordered domain. OS is a total order of S by O if for every record ri ∈ S, thereexists p ∈ O such that OS(p, ri).

O is called the ordering domain. The elements of the ordering domain are calledpositions. Definition 2.1 specifies that every record is associated with a position in theordering domain. Note that the definition allows for a position being assigned to severalrecords and vice versa. Each record has a position, but not every position is necessarilyassigned to a record.

Definition 2.2 (Strict Total Order) A total order OS is strict, if ∀r, s ∈ S, r 6= s theircorresponding positions pr and ps are not equal.

The strict total order in Definition 2.1 restricts the assignment of positions: Eachposition is associated to one record at maximum.

2.3. CONTINUOUS QUERY OPERATORS 13

Definition 2.3 (Sequence) A sequence is a tuple < S,O,OS > with S being a multisetof records of schema RS, O being an ordering domain, and OS being a strict total orderof S by O.

In our subsequent operator definitions, we assume a sequence with a strict total orderas described in Definition 2.3. We also discuss semantic implications if the total order isnot strict.

2.2.1 Assumptions for ordered data processing

Besides the data model, further assumptions on the processing of ordered data are impor-tant when defining operator semantics. Since we refer to operator semantics of streamingdatabases, we describe common data processing assumptions from streaming databases.Furthermore, we introduce our assumptions for ordered data processing in Stratosphere.

Streaming databases assume that data arrives in ”multiple, continuous, rapid, time-varying data streams” [BBD+02]. The data is produced i. e., by sensors and processed bythe DSMS in real-time, also referred to as online. DSMS react to arriving data, which iscalled DBMS-active, human-passive model [ACC+03]. Because the data is assumed to betime-varying, overload is an issue. The solution are inexact computations: Within DSMS,partial data loss or approximated results are acceptable, especially in times of overload.The continuity assumption implicates that the amount of input data grows unbounded,unless the user manually interrupts i. e., the production of the data. A data stream isassumed to be append-only.

For Stratosphere, we assume a human-active, DBMS-passive model. When a user exe-cutes a PACT program, Stratosphere reads the input data from a data store and processesit. This is also referred to as offline data processing. Since Stratosphere reads the inputdata from a data store, it controls the amount of input data per unit of time. This elim-inates the overload issue, streaming databases have to cope with. Inexact computationsdue to overload are not needed, so we assume, that partial data loss or inexact results arenot acceptable for Stratosphere. Furthermore, we assume that the total amount of inputdata exceeds the amount of main memory that is available for Stratosphere. Finally, weassume that data is append-only. This means, that Stratosphere writes data only onceand cannot update the same data later.

2.3 Continuous query operators

In the previous sections, we described assumptions about ordered data processing instreaming databases and in Stratosphere. The assumptions have in common, that inputdata is larger than the available main memory. This assumption limits query operators onordered data to only require a part of the input data at once. In the context of streamingdatabases, these operators are referred to as continuous query operators. In the following,we describe properties of these operators, because the properties are important for Strato-sphere operators on ordered data as well. Furthermore, we introduce sliding windows and


argue why the concrete semantics of existing DSMS operators are not suitable to serve asbasis for sliding window operators in Stratosphere.

2.3.1 Properties of continuous query operators

By the term operator, we refer to an operator that is provided by a system like Stratosphereor a DSMS. In DSMS, operators usually implement functions or algorithms such as SUM()or JOIN(). In Stratosphere, operators like Reduce or Cross consist of a first- and a second-order function. By ”operator”, we refer to the second-order function.

In [GO10], the authors distinguish between stateful and stateless continuous queryoperators. Examples for stateless operators are selection, projection, or the Stratosphereoperator Map. Stateless operators process records on-the-fly and independently from eachother. Particularly, there is no state that an operator keeps when processing subsequentrecords. Stateful operators on the other hand keep a state by using intermediate variablesor buffers. An example for a stateful operator is SUM(): In order to calculate the sumover a set of records that can only be read subsequently, a temporary variable is needed.It holds the sum up to the record that is currently read. The Stratosphere operatorsReduce, CoGroup, and Match are implemented stateful: They buffer records until theend of the input data is reached in order to assure that Parallelization Units are completewhen calling the first-order function. In general, stateless operators are unproblematicwhen applying them to continuous data. For stateful operators on the other hand, twoproperties decide about their suitability for continuous data: They have to be non-blockingand use bounded space. For the further discussion of these two properties, we distinguishbetween the semantic definition of operators and their implementation: The semanticscan be implemented in various ways, which has an influence on these properties.

Operators are blocking, if they require the complete input before producing any output[BBD+02]. One example for a blocking operator is SUM(): Independent of its implemen-tation, it requires all input records in order to compute the output. Aggregate functionslike SORT(), COUNT(), or MIN() are defined with blocking semantics. Operators onstreaming data may not be blocking, because the input data potentially never ends andblocking operators would not produce output until the data ends. The Stratosphere op-erators Reduce and CoGroup are implemented blocking: They wait until the input datais read entirely before calling the first-order function with the completed ParallelizationUnits.

As an example for unbounded usage of space, consider the operator SORT(). Assume,that an implementation of SORT() keeps a temporary history of all records until the lastrecord arrives. This implementation uses space proportional to the amount of input data.In our data processing assumptions, we state that the data is larger than the available mainmemory and harddisk space. Thus, continuous query operators are required to only usea bounded amount of space that fits into the available memory. Within the Stratosphereoperator implementations, Reduce, CoGroup, Cross, and Match use unbounded space:They keep a history of partitions of the complete input data.

An operator that is blocking does not necessarily use unbounded space. Unboundedspace usage of an operator does not implicate that it is blocking. Both operator properties

2.3. CONTINUOUS QUERY OPERATORS 15

are not linked to each other. Consider SUM() as an example. It is a blocking operatorby semantic definition, but it can be implemented to use bounded space. SORT() is anexample for an operator that is both, blocking and unbounded in space, if we consideran implementation that keeps the complete data history in a buffer and if we assumethat data is processed append-only. There are also operators that are non-blocking, butstill unbounded in space. Consider a symmetric hash-based join. For the details of thisalgorithm, please refer to the Appendix A.4.2. For both inputs, a hash table is created.Once a new record is read, it is inserted into its corresponding hash table and probedagainst the hash table of the other input. If it matches, output is created. Thus, it isnot blocking, but keeps hash tables of both relations and uses space proportional to theamount of input data.

In order to obtain non-blocking and bounded space using operators, sliding windowsemantics can be applied to existing operators: Sliding windows remove the blockingness,because the operator only waits for a single window to complete in order to produceoutput. If we assume that sliding windows can be computed independently, operatorscannot keep a history larger than the window size, which causes them to use a boundedamount of space. If the operator is an aggregate function, bounded space usage can alsobe achieved by using properties of the function. We discuss this after the sliding windowintroduction.

2.3.2 Sliding windows

There are two types of sliding windows: count-based (also referred to as fixed-size) andtime-based (also referred to as variable-size) sliding windows. The latter one is called time-based, because the ordering domain of ordered data usually is time. The main parametersof sliding windows are window size and window slack : If the window type is count-based,the window size defines, how many records are contained in one window. If the windowtype is time-based, the parameter defines a timespan within the ordering domain, i. e.,”3 days”. Similarly, the window slack defines, how far to slide ahead once a windowis processed. Usually, both of the parameters are either time-based or count-based. Amixture, i. e., a count-based window size combined with a time-based window slack, ispossible, but to our knowledge practically not relevant. If the window size is smallerthan the window slack, the windows are called jumping windows. If the window size isequal to the window slack, the sliding windows are called tumbling windows [BDD+10]. Ifthe window slack is smaller than the window size, the resulting windows are overlapping.Other sliding window types like landmark windows (usually with a fixed starting and amoving end point) are also discussed in the literature [GO10]. To our knowledge, landmarkwindows are not commonly used.

2.3.3 Types of aggregate functions

In the previous sections, we discussed properties of continuous operators and describedsliding windows. If a continuous query operator is an aggregate function like COUNT(),MAX(), or SUM(), its implementation can potentially be optimized regarding space usage


and parallelizability. We discuss these optimization possibilities by introducing three cate-gories of aggregate functions: distributive, algebraic, and holistic [GCB+97, CS01, GO10].

An aggregate function F() is distributive, when it can be decomposed into an earlyaggregate function E() that works on subsets of data and a consolidating function C(), sothat F ({Xi,j}) = C({E({Xi,j})}). Furthermore, F() and E() are represented by the samefunction. Distributive aggregate functions are i. e., COUNT(), MAX(), and SUM(). Asan example, consider SUM(): Both, E() and C() are represented by the original aggregatefunction SUM(). Distributive aggregate functions can be computed incrementally usingconstant space and time [GO10].

If both, the early aggregate function E() and the consolidating function C() are dif-ferent from the original aggregate function F(), F() is called algebraic. An example foran algebraic aggregate function is AVERAGE(): F() divides the sum of all records bythe total count of records. If we split the function into an early aggregate function anda consolidating function, the early aggregate function computes the sum and the counton subsets of data. For brevity, we still call this a ”function”, despite it has two outputs.The consolidating function for AVERAGE() divides the sum of all subsets by the sumof all counts of all subsets. Equal to distributive functions, algebraic functions can beincrementally computed in constant space and time.

Aggregate functions that are neither distributive nor algebraic, are called holistic. Forholistic functions, the complete input is needed, before any computation is started: ”theseaggregate functions cannot be decomposed into sub-aggregate functions and their com-putations depend on the entire set of the input” [CS01]. Examples for holistic aggregatefunctions are MEDIAN(), QUANTILE(), or COUNTDISTINCT(): These functions re-quire space proportional to the input, independent of the concrete implementation of thefunction.

The computation of distributive and algebraic aggregate functions can be optimized intwo ways. First, the early aggregate function can be computed on distinct subsets of theinput data. On a non-parallelized database system, this partial calculation saves memory:It is not necessary to keep a history of all previous records in the main memory. A constantdata structure like a variable is sufficient. On a parallelized system, the early aggregatefunction additionally enables parallelization, since for each independent subset of data, adifferent compute node can be used potentially. Second, if we consider overlapping slidingwindows, the result of the previous sliding window can potentially be reused: Considerthe distributive function SUM(). Instead of re-calculating the sum over the complete newwindow, the sum of the previous window can be updated by subtracting values of recordsthat belong to the previous window and by adding the values of new records in the currentwindow. Both of these optimizations cannot be applied to holistic functions.

2.4 Discussion

Our original intention is to reuse the exact semantics of existing operators from stream-ing databases for definitions of continuous operators in Stratosphere. The discussed con-cepts of non-blocking, bounded-space operators using sliding windows is common amongst

2.4. DISCUSSION 17

DSMS. We reuse these concepts for Stratosphere. In order to reuse concrete semantics ofoperators, common semantic operator definitions from DSMS are needed.

In [BDD+10], the SECRET model is developed, which allows for the comparison ofoperator semantics of DSMS. The model focuses on single-input operators that use time-based windows. Three common academical and industrial streaming systems are com-pared: STREAM, Coral8, and StreamBase. The result is, that all of them implementdifferent semantics for the same operations: ”There are no standards today for queryingstreams; each system has its own semantics and syntax.” [BDD+10] For this reason, wesolely apply the sliding window concepts to Stratosphere operators and define the semanticdetails without regarding the heterogeneous examples from DSMS.

Chapter 3

Sliding Window PACTs

In this chapter, we define sliding window operators for Stratosphere. Our aim is to enableStratosphere users to run sliding window operations on the system without the need to im-plement the corresponding semantics themselves. We start with general considerations onwhere sliding window semantics can be implemented within Stratosphere. Subsequently,we define preconditions that are important for all sliding window operators: order andsliding window records. Each following section describes one sliding window operator.An operator definition starts with a semantic approach that defines requirements for thesecond-order function and partitioning strategies if applicable. It is followed by a descrip-tion on how to implement the operator. The focus for this implementation part is theparallelization strategy. Finally, we discuss the operator semantics and their implementa-tion regarding alternatives and optimization possibilities.

3.1 General considerations

When implementing sliding window PACT programs with the given operators of Strato-sphere, users are required to implement the sliding window semantics in the first-orderfunctions. The approach of this thesis is to move sliding window semantics into thesecond-order functions. From a users perspective, the usage of sliding windows withinStratosphere becomes easier, because instead of implementing sliding window semantics,he can apply existing operators. From a conceptual point of view, functionality withinsecond-order functions can be implemented more efficiently than in user-defined first-orderfunctions. Furthermore, second-order functions enable internal optimizations, potentiallyleading to a higher execution speed and less consumption of compute resources comparedto a user-implemented solution.

In contrast to streaming systems, our aim is to provide deterministic operator se-mantics. Within DSMS, one main source of indeterminism is the arrival time of records,including the interoperator arrival times. The arrival times cannot be controlled, becausestreaming data processing is assumed to be DBMS-active, human-passive as we describedin Chapter 2.2.1: The assumption implies, that operators react to new records that arriveat the operator. The window slide is usually made dependant on the record arrival time.

18

3.2. ORDER AND SLIDING WINDOW RECORDS 19

This potentially leads to different results in each run of the same program with the samedata, as described for the DSMS StreamBase in [BDD+10]. For Stratosphere on the otherhand, we assume human-active, DBMS-passive data processing. Record arrival times arein the control of operators, because Stratosphere reads the input data from a data storeintentionally. For each sliding window operator in Stratosphere, we define the conditionsthat trigger window slides, so that repeating runs of one PACT program with the samedata lead to identical results.

3.2 Order and sliding window records

We assume that input data for a sliding window PACT program is provided as one ormany ordered multisets R: We define an ordered multiset R as a tuple (R,O) with Rbeing a multiset of records and O being an ordering function. The ordering functionassigns instances of an ordering domain to each record. Let the ordering relation be ≥.We assume that for each pair a and b of instances of the ordering domain, either a ≥ b orb ≥ a. The described property is the main property of total orders. It assures, that thetimestamps of any pair of records within R is comparable by ≥, which is important whensorting records by their ordering domain. We further assume, that the total order is strictand denote the strict order by >. A strict total order means that for every timestampa and b within a multiset, a > b xor b > a. If an order is not strict, records withequal timestamps occur: Within the operator subchapters, we discuss the implications ofnon-strictly totally ordered data on the operator semantics.

If we define new PACT operators with sliding window semantics, we also need to ex-tend the record data model. We call this extension sliding window record. The reason fora dedicated record type are the sliding window operators that need additional meta infor-mation attributes within each record. These attributes need to be readable and writableby PACT second-order functions. Except for keys, the existing records are transparent tosecond-order functions. Sliding window records provide sliding window second-order func-tions read- and write-access to three predefined attributes: First, the timestamp, whichholds one instance of the ordering domain. Second, the operator ID, holding an ID of theoperator instance that emitted the record. The second-order function instances apply theirID to all sliding window records that are emitted by first-order functions before handingthem to Nephele. The reason for an operator ID is explained in the sliding window Reducesubchapter. Third, sliding window records contain an attribute that holds a repartition-ing information for sliding window Cross and sliding window Match. The details on thisattribute are explained in the sliding window Cross subchapter.

The timestamp of sliding window records is initially set when reading a record froma data source: In the implementation of a dedicated reader class, the user can choose toeither apply a timestamp from the data to sliding window records (application time), toapply the Stratosphere system time or assign instances of any other ordering domain suchas natural numbers in ascending order. If the system time is applied, non-strictly ordereddata can result: Stratosphere parallelizes the reading of the data source, so that it is likelythat the same timestamp is assigned to two or more different records. By design, this

20 CHAPTER 3. SLIDING WINDOW PACTS

cannot be influenced: The reader classes cannot communicate with each other and canthus not be synchronized. The user can decide to use more than one ordering domainwithin one sliding window PACT program: Operators can output records with instancesof a different ordering domain than the ordering domain of their input records. The onlyrestriction is, that each operator instance generates sorted output: The semantic definitionof successing operators rely on this assumption.

3.3 Sliding window Map

In the semantic definition of Map, sliding window semantics are not applicable: Mapoperates on each record of the input independently. We denote this operator propertystateless (Chapter 2.3.1). Nevertheless, for the support of sliding window records, slidingwindow Map is necessary: The second-order function accepts sliding window records ratherthan the standard records. Furthermore, it passes the sliding window records to the first-order function. Each instance of the second-order function has a globally unique ID: Itassigns the ID to each output sliding window record of the first-order function beforepassing it back to Nephele.

3.4 Sliding window Reduce

SLIDING WINDOW REDUCE

......

window size = 3slack = 1

1

2

3

4

5

One PU:

1 2 3

2 3 4

3 4 5

Input:

......

...

...

...

PU Split: Sliding Window Split:

Figure 3.1: Sliding window Reduce: split of Parallelization Units into sliding windows.

The existing non-sliding-window Reduce partitions its input data based on keys. Thesepartitions are called Parallelization Units (PU). For each partition, the first-order functionis called. Because of the PU creation, the implementation of Reduce is stateful, blockingand uses unbounded space (see Chapter 2.3.1). Thus, Reduce is not suitable for ordereddata together with the data processing assumptions we described in Chapter 2.2.1. In order

3.4. SLIDING WINDOW REDUCE 21

to make Reduce compatible to these assumptions, we apply sliding window semantics toits semantic definition and describe a corresponding implementation. We refer to the newoperator as sliding window Reduce. In addition to the partitioning by key, sliding windowReduce splits the partitions into sliding windows. The first-order function is called oneach sliding window instead of the PUs (Figure 3.1): Up to the PU split, the operatorhas the same functionality like its non-sliding-window version. After the key split, therecords are further broken down into subsequent windows, regarding the order defined bythe time attributes of the sliding window records and the user-defined parameters windowslack (wsl) and window size (wsi).

3.4.1 Semantic approach

Consider the example PACT program depicted in Figure 3.2. It consists of one datasource which is the input for a sliding window Map followed by a sliding window Reduce.Since we only discuss second-order operator semantics, we do not define user-implementedfirst-order functions.

data src

data sink

SW MAPSW

REDUCE

Figure 3.2: Example PACT program using sliding window Reduce.

Given the available physical computation resources, Stratosphere spans the PACTprogram to a job graph to be executed by Nephele. One example for such a job graph isdepicted in Figure 3.3: There is one data source. The Mapper gets two operator instancesallocated that read from the data source in parallel. Stratosphere will use a read strategysuch as round robin to distribute the input amongst the two Map instances. Both Mappersare connected to three subsequent sliding window Reduce operator instances. Their outputis final and written to the data sinks.

By default, Stratosphere assigns a hash repartitioning strategy to the output of eachMap instance. When the PACT program is executed, the strategy chooses a hash functionand applies it to the key of each output record of Map: According to the result of thehash function, the records are distributed amongst the subsequent sliding window Reduceinstances. Each Reduce instance is assigned a fixed set of keys which is distinct to the keysets of all other Reduce instances. For this set, it obtains all records with the correspondingkey. The hash-based repartitioning strategy exists for the non-sliding window Reduce andis used without change for sliding window Reduce.

We assume that the input for a sliding window PACT program is provided sorted byits ordering domain. For reading the input, Stratosphere will use a strategy like roundrobin to distribute it amongst the Map instances. We assume that the strategy does notchange the order of the tuples. Furthermore, we assume that operators do not changethe order of records. In the example, Map could (1) drop a record, (2) return one outputrecord for one input record, or (3) output several records for one input record. It can


data src

SW MAP1 SW MAP2

data sink

SW REDUCE2

data sink

SW REDUCE3

data sink

SW REDUCE1

Figure 3.3: Example Job Graph with two Map and three Reduce instances.

output records either with instances of the same ordering domain like the input recordor assign instances of another ordering domain. For case (3), we require it to output therecords in ascending order. This is important for the semantics of subsequent operators.

From the perspective of sliding window Reduce, the input coming from one single Mapinstance is sorted. The overall input of each instance of sliding window Reduce is mergedfrom all predecessing Map instances. This merge is transparently performed by Nepheleand is not influencable from within PACT. Thus, the overall input of sliding windowReduce is not sorted.

In the following, we introduce an example with input data for one sliding windowReduce instance. Along this example, we develop a second-order function for slidingwindow Reduce. Assume that the sliding window Reduce in the example PACT programfrom above calculates a count-based sliding window with a size of 3 and a window slackof 1. Additionally, assume that records arrive at the Reduce instance in the order asdepicted in Table 3.1: The arrival order of records is given implicitly and only writtenhere for reference. The timestamp and source operator instance ID of records is givenexplicitly by the sliding window records. For simplification, we depict only records for onesingle key and omit the key in the following. From now, we use the notation window(x, y,z) for a multiset of sliding window records with timestamps x, y, and z for simplification.

If we canonically apply sliding windows on the data in Table 3.1, the first-order functionis called for each of the following multisets: window(1,5,2), window(5,2,6), window(2,6,3)etc. The result is dependant on interoperator arrival times of records. In every run ofthe program, the order of incoming records for sliding window Reduce is likely to bedifferent: It depends on external factors like network latency or CPU usage which cannotbe influenced within Stratosphere. This leads to indeterministic results, which are not ourintention. Thus, this canonical way of creating sliding windows is not sufficient.

One approach for an improvement is to locally sort incoming records by their times-tamps. If we consider the described arrival order of Table 3.1 together with a local sort, the


Table 3.1: Input example.

arrival order timestamp of record source operator instance ID

1 1 Map1

2 5 Map1

3 2 Map2

4 6 Map1

5 3 Map2

6 end of input Map2

7 7 Map1

8 end of input Map1

canonical window split is window(1,2,5), window(2,5,6), window(2,3,6) etc. The outputwould still be dependant on the interoperator arrival times of records.

We intend to generate the following output: window(1,2,3), window(2,3,5), window(3,5,6)etc. In order to obtain this semantic, we use an additional information from each slidingwindow record: the predecessing operator instance ID. Every instance of a predecessingoperator generates a unique identifier for itself and appends it to each output sliding win-dow record. With this information, sliding window Reduce can decide whether there isa predecessor that could still deliver output within a potential next window or not: Forcount-based semantics, a potential next window are all locally sorted records betweenposition 0 (the record with the oldest timestamp) and position wsi (the record at theposition of the window size). For time-based sliding windows, the potential next win-dow are all records between ttail (the record with the oldest timestamp) and the recordwith timestamp ttail +wsi. If there is no predecessing operator that could deliver recordswith timestamps within the potential next window, the window can be passed to thefirst-order function. After processing the window, the slide can be performed accordingto the window slack (wsl) parameter: For count-based sliding windows, the records be-tween position 0 and wsl are deleted. Considering time-based semantics, records with thetimestamps ttail, ..., ttail + wsl are removed.

We illustrate our approach with an example in Table 3.2: The input is identical to theexample of Table 3.1. Again, we consider a count-based semantic with window size (wsi) 3and window slack (wsl) 1. The additional column sorted queue displays the current orderof records after applying local sort. The action describes the functionality of a slidingwindow Reduce second-order function. thead (the record with the newest timestamp) ofthe potential next window is compared with the latest timestamp of all input operatorinstances: If thead is less or equal to the latest timestamps of all predecessing operatorinstances, there is no more input for this potential next window. This is due to ourassumptions that the input data for sliding window PACTs is provided sorted and thateach operator instance outputs records sorted.


Table 3.2: Example for sliding window split.

arrivalorder

timest.ofrecord

source op. inst. ID sortedqueue

action

1 1 Map1 (1) potential next window(1): window size notreached, no output

2 5 Map1 (1,5) potential next window(1,2): window size notreached, no output

3 2 Map2 (1,2,5) potential next window(1,2,5): input fromMap1 is complete (latest seen timestamp: 5),waiting for more input from Map2 (latestseen timestamp: 2), no output

4 6 Map1 (1,2,5,6) potential next window(1,2,5): input fromMap1 is complete (latest seen timestamp: 6),waiting for more input from Map2 (latestseen timestamp: 2), no output

5 3 Map2 (1,2,3,5,6) potential next window(1,2,3): input fromMap1 (latest seen timestamp: 6) and Map2(latest seen timestamp: 3) complete: cal-culate window and slide 1 record ahead fornext potential window. Potential next win-dow(2,3,5): input from Map1 (latest seentimestamp: 6) complete, waiting for more in-put from Map2 (last seen timestamp: 3), nofurther output

6 end ofinput

Map2 (2,3,5,6) potential next window(2,3,5): input fromMap1 (latest seen timestamp: 6) and Map2 (end of input) complete: calculate windowand slide 1 record ahead for potential nextwindow. Potential next window(3,5,6): in-put from Map1 (latest seen timestamp: 6)and Map 2 (end of input) complete: calculatewindow and slide 1 record ahead for potentialnext window. Potential next window(5,6):window size not reached, no further output

7 7 Map1 (5,6,7) potential next window(5,6,7): input fromMap1 (latest seen timestamp: 7) and Map2 (end of input) complete: calculate windowand slide 1 record ahead for potential nextwindow. Potential next window(6,7): win-dow size not reached, no further output

8 end ofinput

Map1 (6,7) potential next window(6,7): window size notreached, no output


Table 3.3: Input example for special case regarding the timely availability about informa-tion about the number of predecessors.

arrival order timestamp of record source operator instance ID

1 1 Map1

2 5 Map1

3 6 Map1

4 7 Map1

5 2 Map2

6 3 Map2

7 end of input Map2

8 end of input Map1

Table 3.4: Input example for special case regarding non-strict total order.

arrival order timestamp of record value

1 1 a

2 2 b

3 3 c

4 3 d

One implicit assumption we met in the example is, that the number of predecessingoperator instances is known by the Reduce second-order function from the beginning:Consider the same input data as in the previous Tables 3.1 and 3.2, but with a differentrecord arrival order as described in Table 3.3: In the row with arrival order 3, Reducewould erroneously run the first-order function on window(1,5,6) if it would not knowabout the existance of Map2 at this time. The next windows would be window(5,6,7),window(2,6,7) etc. Thus, sliding window Reduce needs the information about the numberof predecessors beforehand, otherwise the semantics become indeterministic.

One detail has to be regarded for non-strict total orders in combination with count-based semantics. Assume, that the order of the data is of a non-strict total order: The sametimestamp can be assigned to more than one sliding window record. Consider an arrivalorder of records for a sliding window Reduce as depicted in Table 3.4. Again, consider acount-based sliding window with window size 3 and window slack 1. The source operatorreference id is not relevant in this context, so it is not mentioned in the table. Instead, weadd a column ”value” in order to emphasize that the records with arrival order number 3and 4 are different. The calculated windows for this would be window(1[a], 2[b], 3[c]) andwindow(2[b], 3[c], 3[d]). Depending on the implementation of the sorted queue, record 3[c]and 3[d] could also be inserted in reverse order, because they have the same timestamp.The resulting windows would be window(1[a], 2[b], 3[d]) and window(2[b], 3[d], 3[c]), which


is different from the previous result. Thus, if the user decides to use a non-strict totalorder, he has to be aware of this (indeterministic) semantic.

Table 3.5: Input example for special case regarding non-strict total order.

arrival order timestamp of record value source operator instance ID

1 1 a Map1

2 2 b Map2

3 end of input n.a. Map2

4 3 c Map1

5 3 d Map1

6 4 e Map1

For time-based semantics together with a non-strict total order, the comparison criteriain the check for pending records for the potential next window is important for determin-istic results: Consider the record input example for a sliding window Reduce instance inTable 3.5). The window size is 3 and the window slack is 1. In the row with arrivalorder number 4, the sliding window Reduce second-order function checks if the last seentimestamp of all predecessing operator instances is higher than 3. If the comparator wouldinclude equality (≥), window(1[a], 2[b], 3[c]) would be calculated in this step erroneously:The following record with arrival order number 5 again has the timestamp 3. The correctresult should be window(1[a], 2[b], 3[c], 3[d]) in this case, so the comparator needs toexclude equality (>).


3.4.2 Considerations for implementation

The second-order sliding window Reduce function gets parameters additional to the al-ready existing key parameter: isCountBased, window size (wsi), and window slack (wsl).By setting isCountBased to true, the wsi and wsl parameters are set to count-based se-mantics. Count-based semantics yield fixed size sliding windows. If isCountBased is set tofalse, the parameters are interpreted time-based. This leads to windows of variable size,depending on how many records exist for a certain time span. Also the window slack willconsist of a varying number of records.

Each instance of the second-order sliding window Reduce function holds two additionaldata structures per key k. First, there is a list of the latest timestamp seen for each pre-decessing operator instance M1..m: tmax(M1, k), tmax(M2, k), ..., tmax(Mm, k). We choosethe character M, because in previous examples, we used Map as predecessing operatorof Reduce. Each tmax value is parameterized with k, because the value is saved for eachdifferent key value k. The second data structure of each sliding window Reduce instanceis a sorted queue per key k, Q(k): By sorted queue, we define a data structure that holdssliding window records ordered by time attribute. One sliding window second-order func-tion will usually receive input from several predecessing instances of other operators likesliding window Map. It receives all sliding window records for a set of keys. Each arrivingrecord is inserted in order into the sorted queue of its key k, Q(k). If a record arrives late(timestamp is older than the tail of the queue), it is discarded without inserting it, becauseit violates our assumptions and might thus break the window slide logic. We assume thatfrom each predecessing operator instance records are received in order.

In the Figures 3.4 and 3.5, we illustrate sorted queues for time-based and count-based semantics. For time-based semantics, we refer to all records between the tail (oldesttimestamp) up to the time ttail+wsi as potential next window. For count-based semantics,the potential next window is between position 0, which holds the record with ttail, andposition wsi. The potential next window is at the tail of the queue (the records with theleast timestamps) because (by assumption), each predecessing operator provides outputin ascending order. Thus, in the sorted queue, the window from the tail will be the firstto be complete: A window is complete, once (by assumption) there can be no more inputwithin the potential next window. If the potential next window is not complete yet, allsubsequent windows (i. e., ttail + wsl, ..., ttail + wsi + wsl) cannot be complete either.

head

time:

tail

potential next windowttail ttail+wsi

Figure 3.4: Sorted queue for time-based semantics.

0 wsiposition:

headtail

potential next window

Figure 3.5: Sorted queue for count-based semantics.

From the examples in the previous subchapter, we derive second-order function algo-rithms for time-based and count-based semantics: They perform the slicing of the keypartitionings according to user-defined parameters and call the first-order function on the


resulting sliding windows. For time-based semantics, the second-order function is depictedin Figure 3.6. The algorithm is called after a record is inserted into the sorted queue. Ituses the tmax values of the predecessing operators and the sorted queue, both for the keyvalue of the previously inserted record. In step (a), the algorithm checks, whether thepotential next window between ttail and ttail + wsi is complete: This is the case, oncethe tmax values of all predecessing operators are larger than the largest timestamp in thepotential next window, ttail + wsi. The condition is also true for operator instances thatreturned all their output and tmax is end-of-file (inactive operators). If this is not the case,the algorithm ends and the first-order function is not called. If the condition is true, thealgorithm moves to step (b) and calls the first-order function on the current window. Inthe current window, there are all records between the tail of the queue ttail and ttail +wsi.Finally, in step (c), the window slides by the parameter window slack. This means, thatall records between ttail and ttail + wsl are removed from the queue. Since further slidingwindows might also be complete, the algorithms is re-started after this step and repeateduntil it hits a non-complete window in step (a).

start

(a)

(b)

(c)

if for all active predecessor operators M:

t +wsit (M,k) > max tailno

yes

calculate window with all SWRecords from to

(=call first-order function)

t t +wsitail tail

prune all records from to

from the queue

ttail

t +wsltail

end

after every insert, start this algorithm withthe data structuresof the current key k:

Figure 3.6: Reduce algorithm for time-based semantics.

For count-based semantics, the second-order function is depicted in Figure 3.7. Thealgorithm is started, once a new record has been inserted into the sorted queue. In step(a), the algorithm checks whether the queue length has reached the window size. This is


necessary, because in the beginning of reading the data, the queue will not be completeyet. In this case, the algorithm ends. In (b) it is checked, if all predecessor operatorinstances already delivered a record with a timestamp larger than the newest one inthe potential window. A special case occurs for end-of-file: For each operator instancethat returned end-of-file, their part of the condition is true. If all predecessing operatorinstances delivered a record with a larger timestamp than the newest one in the potentialwindow, the algorithm moves to step (c): The first-order function is called with the windowas parameter, because no more input is expected for this window from any predecessingoperator instance. After the window is processed, the window slide is performed in step(d): All records between position 0 and wsl are removed. Since the next potential windowmight also be complete already, the algorithm is restarted. It ends, once the potentialnext window is not complete yet either in step (a) or (b).

start

queue length wsi?≥(a)

(b)

(c)

(d)


t of record at position ≥ t (M,k)maxwsi

no

no

yes

yes

calculate window with all SWRecords from position 0 to position wsi(=call first-order function)

prune all records position 0 to position wsl from the queue

end

after every insert, start this algorithm withthe data structuresof the current key k:

Figure 3.7: Reduce algorithm.


3.4.3 Discussion

In the non-sliding window Reduce, there is a possibility to enhance the execution: Reducecan be enhanced with Combine that operates on sub-multisets of the Parallelization Unitsfor Reduce. Reduce operates on the output generated by Combine. This does not appearto be compatible to sliding window Reduce and our aim to provide deterministic semantics:Since sliding windows are generated within a stateful second-order sliding window Reducefunction, no other previous function is able to generate a partly pre-calculation that isuseful for sliding window Reduce. Nevertheless, further optimization possibilities apartfrom Combine exist: Up to now, only keys are used for data parallelization. If we assumethat subsequent sliding windows can be calculated by independent instances of first-orderfunctions, sliding windows can be used for data parallelization as well. To accomplishparallel processing of successive sliding windows, the hash repartitioning strategy has tobe extended with sliding window semantics. Instead of the second-order function, thisnew hash repartitioning strategy performs the window split. This enables the use ofmore sliding window Reduce operator instances in parallel. Furthermore, it removes thestatefulness of the second-order function.

The user-implemented first-order functions offer a further optimization possibility. Ifa first-order function implements an aggregate function and operates on overlapping slid-ing windows, previously calculated results can potentially be reused in order to speedup the calculation. In the preliminaries (Chapter 2.3.3), we described types of aggregatefunctions: distributive, algebraic, and holistic functions. Distributive and algebraic func-tions can be split into an early aggregate function and a consolidating function. Considera sliding window Reduce first-order function that implements the distributive aggregatefunction SUM() and operates on overlapping sliding windows. Instead of re-calculatingthe sum over all records within a window, the sum of the previous window (or a part of it)can be reused. There are several implemention possibilities: First, the previous sum canbe updated by subtracting records that are out of the current window and by adding newrecords within the current window. Second, if the first-order function keeps the sum ofrecords within the overlapping part of two subsequent sliding windows, the subtraction isnot needed any more. Similar optimizations are possible for all distributive and algebraicaggregate functions, because they can be decomposed. If the user decides to apply such anoptimization in the first-order function, the first-order function instances become stateful.This is no issue if the standard hash-based repartitioning strategy is used, because Strato-sphere routes all records for one key to the same sliding window Reduce instance. From adesign view, this is not desireable for Stratosphere: PACTs should not be stateful in orderto enable the highest possible parallelization. Considering the proposed sliding windowrepartitioning strategy, such a stateful optimization cannot be used, because subsequentsliding windows are potentially sent to different sliding window Reduce operator instances.

Within the second-order function of sliding window Reduce, we canonically defined thatthe check for the completeness of the potential next window runs after each insert of arecord. This causes overhead for each incoming record, which can be reduced. The windowcompleteness check can i. e., be performed after a fixed number of incoming records. It isalso possible to check for new complete windows independent of the insertion of records,


for example after fixed intervals of time or when the the CPU is idle. A similar approach isapplied in Aurora [ACC+03]: Aurora has a queuing thread and a separate thread readingfrom the queue in fixed time intervals.

One downside of the proposed implementation is that it potentially uses unboundedmemory for the sorted queues: Assume that one of two input instances for sliding windowReduce does not emit any records for a long time, while the other instance emits manyrecords. The potential next window is not complete, because there could potentially bemore input from the blocking instance. Sticking with deterministic results, the problem ofunbounded usage of memory in sliding window Reduce cannot be solved. If indeterministicresults are acceptable in such a case, there is a possibility to unblock operator instancesthat did not deliver records for a certain period of time: The tmax value can be reset tothe newest record in the potential next window. In this case, the algorithm continues withthe processing of the potential next window, despite there could be more records from theblocking instance. A similar approach is called ”heartbeat” in Li et al. [LTS+08].


3.5 Sliding window Cross

The existing non-sliding-window Cross builds the cartesian product amongst two multisetsof records. The first-order function is called for each element of the cartesian product.For the generation of the cartesian product, there are no keys to be regarded. A canonicalapproach to apply sliding window semantics to Cross is to compute the cartesian productamongst pairs of sliding windows over the (ordered) multisets.


...

...

window A2window A1 ...

...

input A

input B

RA1

RB1

RA2

RB2

RA2 RB2

RA3

RB3

RA3 RB3

RA2 RB3

window A xB1 1 window A xB2 2

...RA1 RB1

RB1RA2

RA1 RB2

RA2 RB2

RA3 RB2

window B2window B1

Figure 3.8: Sliding window Cross: Split of ordered multisets into sliding windows andcartestian products over window pair 1 and 2.

In the following, we develop a sliding window Cross second-order semantic. As anexample, assume count-based sliding windows with a window size of 2 and a window slackof 1 as depicted in Figure 3.8: Each input A and B is queued and divided into subse-quent sliding windows. For each pair of windows (window1(A,B), window2(A,B), ...), thecartesian product is built. The cartesian products for window 1 and 2 are depicted in thelower part of the figure. As in non-sliding window Cross, for each element in the cartesianproduct, the first-order function is called.

This canonical approach has three disadvantages that we explain in the following:First, it leads to indeterministic results. Second, there are semantic issues with count-based sliding windows that users have to be aware of. Third, it produces duplicates foroverlapping sliding windows.

The approach leads to indeterministic results, because the time attribute within thesliding window records is not regarded and because the start of both windows is not

3.5. SLIDING WINDOW CROSS 33

synchronized. The order of records within the queues is arbitrary, because the arrivalsequence of records at Cross is dependent on the processing speed of predecessing operatorinstances, network latency etc. We discussed this issue within sliding window Reduce andsolved it by local sorting and the usage of operator instance IDs. Besides this orderingissue, there is another reason particular to all multiple-input contracts including slidingwindow Cross that causes indeterministic results. Both windows might start at differentinstances of time if the window start is not synchronized, as we explain in Example 3.1.If we assume, that the synchronization of time is accomplished once at the start of thesliding windows, both windows get out of synchronization again for count-based semantics.We explain this issue in the subsequent paragraph. Since a synchronization cannot persistamongst subsequent window pairs, we propose to not synchronize any window pairs forcount-based semantics. With time-based sliding windows, this effect does not occur, ifwe define that each window pair starts at the same instance of time. Since time-basedwindows contain varying numbers of records, the windows of time-based sliding windowCross stay synchronized.

Example 3.1 (Unsynchronized window start) Assume that the record with the leasttimestamp of input A has a timestamp of 0, while input B starts at time 4711. Consider acount-based sliding window with a window size of 2 and a window slack of 1. Canonically,the cartesian product for the first window pair is {recordA(t = 0), recordA(t = 1)} ×{recordB(t = 4711), recordB(t = 4712)}. An intuitive semantic is the cartesian product{recordA(t = 4711), recordA(t = 4712)} × {recordB(t = 4711), recordB(t = 4712)}.

The second disadavantage of the canonical semantic approach for sliding window Crossare semantic issues with count-based sliding window the user has to be aware of. Weassume strictly totally ordered input data. This order does not guarantee that a recordexists for each successive instance (position) of the ordering domain. In Example 3.2, weshow that the timestamps of records within pairs of count-based windows move apart.If the user decides to use count-based semantics for sliding window Cross, he has to beaware of this circumstance. Considering time-based sliding windows on the other hand,this effect does not occur: Time-based windows adjust to non-existing records for orderinstances, because the windows are of variable size.

Example 3.2 (Count-based windows and strict total order) Assume that input Aprovides records for times 1, 2, 3 etc. and input B skips every second instance, i. e., 1, 3, 5etc. Consider a count-based sliding window with a window size of 2 and a window slack of1. The successive cartesian products are {recordA(t = 1), recordA(t = 2)}×{recordB(t =1), recordB(t = 3)}, {recordA(t = 2), recordA(t = 3)}×{recordB(t = 3), recordB(t = 5)}:The sliding windows of input B ”run away”.

The third issue is not related to timestamps: For overlapping sliding windows, duplicaterecord pairs are created: For RA2RB2 (Figure 3.8), the first-order function is called twice,because this record pair is computed in window pair 1 and again in window pair 2. Weaim to avoid duplicates to achieve a semantic similar to the relational one.


In order to tackle the first problem of indeterministic results, we apply local sortingas discussed in sliding window Reduce: Instead of FIFO queues, sorted queues are usedfor both inputs. Each record is inserted sorted by its ordering domain into queue A or B,depending on whether it originates from input A or B. In order to synchronize windowpairs, the second-order function needs to decide, when a potential next window is complete.As soon as a window pair is complete, the cartesian product is computed. Within onesorted queue A or B, a potential next window is complete once there are no further recordsexpected from any predecessing operator. In the following, we discuss the completenesscriteria for count-based and time-based semantics.

We start with count-based sliding windows. In Example 3.3, we discuss requirementsfor the completeness of the potential next window of one of the two sorted queues. Now,consider both sorted queues A and B. If the pair of potential windows from input A andB is complete, the second-order function builds the cartesian product amongst it, callsthe first-order function for each element in the cartesian product and slides ahead bothqueues for a user-defined window slack.

Example 3.3 (Completeness of potential next window (count-based)) Assume aPACT program that is compiled to a Nephele job graph. The job graph contains one slidingwindow Cross operator instance with two source operator instances for input A: MapA1

and MapA2. For each source operator, we assume that its output is sorted by the orderingdomain. Let the least timestamp in the sorted queue A be 4711 and the successing times-tamp be 4722. If the window size is 2, the potential next window contains the record attime 4711 and the record at time 4722. If both of these records were retrieved from MapA1

and the last record retrieved from MapA2 had the timestamp 4710, the window might notbe complete yet: MapA2 could still deliver records with timestamps within 4711 and 4722.If the successing record from MapA2 is 4712, the window is complete and contains therecord for timestamp 4711 and 4712. If, instead of 4712, the successing record of MapA2

is 4723, the window is complete, too: The window contains the records for timestamp 4711and 4722. The same window would result, if instead of 4723, the input from MapA2 wouldbe end-of-file.

For time-based semantics, we discuss the requirements for completeness of the windowpair in Example 3.4. After calculating the cartesian product and running the first-orderfunction on its elements, the windows slide ahead for the window slack, which is addedto the least timestamp. In contrast to count-based semantics, the sliding windows staytimely synchronized.

Example 3.4 (Completeness of potential next window (time-based)) Assume aPACT program that is compiled to a Nephele job graph. The job graph contains onesliding window Cross operator instance with two source operator instances for input AMapA1 and MapA2 and two source operator instances for input B MapB1 and MapB2.Let the least timestamp of both sorted queues be 4711. Consider a window size of 2: If forall predecessing operator instances MapA1, MapA2, MapB1, and MapB2 the timestampsof the last retrieved record is larger than 4711 + 2 = 4713 or end-of-file, the window pairis complete.


The third issue, duplicate record pairs as a result of overlapping sliding windows, canbe tackled in several ways. Either, the overlapping records are marked (”punctuated”[TM02]), such that subsequent cartesian products exclude pairs of marked records orduplicate values are filtered after computing them. The filtering approach does not seem tobe feasible: Assume, that subsequent sliding windows are computed on different instancesof sliding window Cross. A filter would have to be applied to the output of potentiallymany sliding window Cross operator instances, for example by another new operator.The filtering can be obtained by consolidating the input within one filtering operatorinstance, which is not the intention of a parallelization framework. Since the overhead ofany filtering solution appears to be considerably higher than the first approach, we followthe punctuation approach.


For the implementation of the described sliding window Cross semantics in Stratosphere,it has to be regarded that sliding window Cross needs to be executable on parallel indepen-dent operator instances. In contrast to the other multiple-input contracts sliding windowMatch and sliding window CoGroup, sliding window Cross has no key that enables thedata to be partitioned and processed on independent computing instances. One way toachieve data parallelization apart from keys is to use the window splits of the data. Inthe following, we describe how this data split can be achieved for time-based disjunctiveand for time-based overlapping sliding windows. Subsequently, we discuss implications ofcount-based semantics.

For disjunctive sliding windows (window size ≤ window slack), each subsequent windowpair of input A and B can be processed on a different instance of the sliding window Crossoperator. In order to distribute records amongst the operator instances, Stratosphere usesrecord distribution strategies: The non-sliding window Cross uses a broadcast strategy forinput A and a 1:1 forward strategy for input B, if both the predecessing operator and Crosshave the same number of instances. For input B, it re-uses a data partitioning that wasapplied to any other predecessing operator. All currently existing repartitioning strategiesof Stratosphere partition the data based on keys, except for round-robin. Considering time-based sliding window semantics, we introduce a new time-based repartitioning strategy forboth inputs A and B. Depending on a start timestamp, the window size and the windowslack, the strategy uses the function modulo to decide, to which instance of sliding windowCross a record is sent. Since disjunctive sliding windows are a special case of overlappingsliding windows, we do not define details for this function here, but move on to overlappingsliding windows instead.


For overlapping sliding windows, the time-based repartitioning strategy needs to be analgorithm rather than a function: Each record within the overlapping part of a windowis potentially sent to several sliding window Cross operator instances. The subsequentsliding windows need to be sliced at particular points in time. Consider Figure 3.9: Eachtick on the horizontal arrow represents one instance of time. For a window size of 3and a window slack of 1, the resulting split of time into sliding windows 1 to 12 aredepicted. If there are two sliding window Cross instances Cross1 and Cross2 available,the windows could be sent to the instances round-robin. It would also be possible to sendwindow 1 to 6 to Cross1 and 7 to 12 to Cross2. By sending subsequent windows to onesingle operator instance, records of overlapping windows do not need to be transmittedrepeatedly. Potentially, the slicing can additionally be used to forward records by localmemory rather than network channels. We leave the question on the optimal slicing opento the optimizer, but foresee a packing parameter for the repartitioning algorithm: Itdefines, how many subsequent windows one operator instance obtains.

window 1 7

window 6

window 2 8

window 5

window 3 9

window 4

1211

10

1 2 3 4 5 6 7 8 9 ...

Figure 3.9: Example 1: Split of time axis into sliding windows.

By calculating subsequent overlapping sliding windows on independent operator in-stances, duplicates are created, such as RA2RB2 in the canonical example (Figure 3.8). Itis not sufficient to eliminate duplicates locally, because they arise at different independentoperator instances. Instances do not communicate with each other by design. We believethat a global duplicate elimination is not feasible. Therefore, we follow the approach toavoid duplicates by punctuating sliding window records: Each record gets an additionalboolean information, if it belongs to the overlapping section of a sliding window and wasalready processed in another predecessing window before. If this punctuation is set to true,the record may only be joined with non-punctuated records and not with other punctuatedrecords from the second input. Consider the example in Figure 3.10. Each odd window issent to Cross1 and each even window to Cross2.

window 11 2 3 4 5 6 7 8 9 ...

3

window 2 4

Figure 3.10: Example 2: Split of time axis into sliding windows.

Canonically, the cartesian product built within Cross1 contains RecordA(3)×RecordB(3).The same record combination is again built for window 2 in Cross2. In order to avoidthis duplicate, the first record within each sliding window is punctuated with a star sign.


The first records of window 2 are now RecordA(3)∗ and RecordB(3)∗. All other recordsare not marked with a star. When generating the cartesian product, pairs of recordsthat both have a star, are excluded. Generally, all overlapping records in the beginniningsection (records with the least timestamps) of a sliding window are marked. The overlap-ping records in the ending section remain unpunctuated. The downside of this method is,that the first sliding window produces a non-complete cartesian product: Since it has nopredecessing window, the combination of the marked records is not generated.

In the following, we combine both requirements, the distribution of records to slidingwindow Cross instances and the punctuation, in a dedicated sliding window repartitioningstrategy. The parameters of the strategy are the number of sliding window Cross instances(numberOfInstances), the packing parameter (packingParameter), the window size (wsi),and the window slack (wsl). The first two parameters are provided by the optimizer ofStratosphere, while the latter two are to be defined by the user. The first part of thestrategy is an algorithm that decides for each input record, to which instances of slidingwindow Cross it is routed. The second algorithm is called within the first algorithm anddecides if a record for a certain cross instance is punctuated or not. It conditionally appliesthe punctuation to the record.

Algorithm 1 Sliding window Cross routing algorithm.

1: wsipacked ← wsi + (packingParameter − 1) ∗ wsl2: wslpacked ← wsl ∗ packingParameter3: for currentInstance← 0..(numberOfInstances− 1) do4: moduloV al ← (tR − (currentInstance ∗ wslpacked))mod(numberOfInstances ∗

wslpacked)5: if moduloV al > 0 && moduloV al ≤ wsipacked then6: CheckAndApplyPunctuation(currentRecord, currentInstance)7: RouteRecordToInstance(currentInstance)8: end if9: end for

In Algorithm 1, we describe by pseudocode, to which sliding window Cross instance arecord is sent. The input is one sliding window record currentRecord with timestamp tR.The algorithm internally operates on a window size and a window slack that is enlarged bythe packingParameter (line 1 and 2): If the parameter is set to 1, the algorithm distributessubsequent single windows to operator instances round robin. If it set to 2, the algorithmsends two subsequent windows to one sliding window Cross operator instance and so on.If the sliding windows are overlapping, a record can be sent to all subsequent operatorinstances potentially. Thus, the algorithm iterates through all operator instances, startingat instance number 0 (line 3). We use modulo to determine, if the current record hasto be sent to the current operator instance, because we intend to distribute subsequent(packed) sliding windows to the available instances round robin (line 4). Before applyingmodulo, we normalize the current timestamp tR: We assume that the first window startsat time 0. This first window is to be calculated by instance number 0. The followingwindow starts at time wslpacked (to be calculated by instance number 1), the next one


at 2 ∗ wslpacked (to be calculated by instance number 3 if existent or instance number 0if the number of instances is only 2) etc. We subtract currentInstance ∗ wslpacked fromtR to obtain a ”virtual” timestamp that lies within a multiple of the first window. Fromthis ”virtual” timestamp, we calculate the modulo with numberOfInstances ∗ wslpacked,which represents the ”total” window slack at which the algorithm switches the routing ofthe current window from the last calculation instance back to the first one. In line 5, thealgorithm checks for moduloV al > 0 and moduloV al ≤ wsipacked: Because of the previousnormalization, the moduloV al is within this range if the current record has to be sent tothe current operator instance. Before a record is sent to a subsequent operator instance, itis punctuated if necessary. In the following, we give an example of the routing algorithmand describe the punctuation algorithm subsequently.

Example 3.5 (Example for application of sliding window Cross routing algorithm)Consider time-based sliding window semantics with a window size (wsi) of 3, a windowslack (wsl) of 2, a packing parameter of 2 and the number of instances of 2. The algo-rithm computes the wsipacked variable which is 5 and the wslpacked variable which is 4. InTable 3.6, each incoming record is represented by one row. In the timestamp column, thetimestamp of the incoming record is depicted. All other columns show intermediate resultsof Algorithm 1 if it is applied to records with these timestamps and the window parameterswe defined: The moduloVal is the result of line 4 and the routing condition is the resultof the boolean expression in line 5 of the algorithm. The routing condition for instance 0shows, that records with timestamps between 1 and 5 and with timestamps larger than 9 arerouted to this operator instance (the condition is true). This is correct, because the packingparameter is 2 and thus two subsequent windows have the size 5. The routing condition forinstance 1 shows that records between timestamp 5 and 9 are routed to instance 1. This isalso correct, because this represents the second packed sliding window (the packed windowslide is 4, so this window starts at a timestamp larger than 4, which is 5 here).

The second part of the repartitioning strategy is the punctuation algorithm (Algorithm2). The inputs of the algorithm are currentRecord and currentInstance from the routingalgorithm. If the windows are not overlapping (wsl ≥ wsi), the check for punctuationis skipped (line 1). In line 2, the window size of a packed window is calculated as inthe routing algorithm described above. In line 3, the size of the overlapping part withinthe beginning of each sliding window is calculated: Assuming that a window starts attime 0, the punctuation is applied up to the time tmax = wsi − wsl. In line 4, thetimestamp tR of the current record is normalized to the timestamp of a window startingat time 0. Therefore, tR is moved towards 0: If currentInstance is 0, 0 is subtractedfrom tR. If it is larger than 0, wsipacked minus the overlap tmax is subtracted, multipliedby currentInstance. Subsequently, modulo is used to normalize the result to a windowstarting at 0 up to wsi - 1. The normalization enables to compare tRnorm and tmax: IftRnorm is less or equal to tmax and not 0, the record is within the range that is punctuated.


Table 3.6: Example for application of sliding window Cross routing algorithm.

timestampof record

moduloValfor instance0

moduloValfor instance1

routingconditionfor instance0

routingconditionfor instance1

1 1 -3 true false

2 2 -2 true false

3 3 -1 true false

4 4 0 true false

5 5 1 true true

6 6 2 false true

7 7 3 false true

8 0 4 false true

9 1 5 true true

10 2 6 true false

11 3 7 true false

12 4 0 true false

... ... ... ... ...

Algorithm 2 Sliding window Cross punctuation algorithm

1: if wsl < wsi then2: wsipacked = wsi + (packingParameter − 1) ∗ wsl3: tmax ← wsi− wsl4: tRnorm ← tR − ((wsipacked − tmax) ∗ currentInstance)modwsipacked5: if tRnorm > 0&&tRnorm ≤ tmax then6: PunctuateRecord(true)7: else8: PunctuateRecord(false)9: end if

10: end if


data sink

data src A

SW MAP

data src B

SW MAP

SW CROSS

input A

input B

Figure 3.11: Example PACT program using sliding window Cross.

Considering count-based sliding window semantics, a dedicated repartitioning strategycannot be used for splitting the data into sliding windows. Assume a PACT program asdepicted in Figure 3.11 and a corresponding job graph as depicted in Figure 3.12. Let theconsolidated output of both sliding window Map instances be sorted by the ordering do-main be R1, ..., Rn. MapA1 outputs R1, R2, R5, R7, R9 and MapA3 outputs R3, R4, R6, R8.For a window size of 3 and a window slack of 2, the corresponding sliding windows areR1, R2, R3, R3, R4, R5 etc. A dedicated count-based repartitioning strategy runs on everyinstance of sliding window Map independently. In order to route the records of the firstwindow R1, R2, R3 to Cross1, the repartitioning strategy at MapA1 needs an informationof the existance of R3 from the repartitioning strategy at MapA2. Since the instances areindependent, they do not share any information at run-time by design. Thus, a canon-ical count-based repartitioning strategy is not applicable for count-based sliding windowCross. If the data split of the predecessing operator instances is known, i. e., MapA1 onlyoutputs R1, R3, R5 etc. and MapA2 outputs R2, R4, R6 etc., the application of a reparti-tioning strategy becomes possible. Such a fixed data split can be signalled by applyingoutput contracts to operators [BEH+10]. Since this is beyond the scope of this work,we leave the definition of a dedicated count-based repartitioning strategy in combinationwith other changes to Stratosphere open for future research. Another way to achieve dataparallelization for count-based sliding window Cross does not seem to exist. The only wayto implement count-based semantics is to run sliding window Cross in one single instance.

We return to time-based semantics. Besides the sliding window repartitioning strategy,a dedicated second-order function is necessary for sliding window Cross. It fulfills threetasks: First, it splits the inputs A and B into subsequent sliding windows. The additionalwindow split in the second-order function is necessary, because one operator instancegets subsequent sliding windows at once if packing is used by the optimizer. Second, itgenerates the cartesian product from pairs of windows, avoiding duplicates. Third, it callsthe first-order function for each element in the cartesian products and hands the resultingrecords back to Nephele.

The additional input parameters for the second-order function compared to the non-sliding window one, are window size (wsi), window slack (wsl) and isCountBased. IfisCountBased is set to true, count-based semantics are applied. Otherwise, the semanticsare time-based.

Each instance of the second-order sliding window Cross function holds two data struc-


data src B

MAPB1 MAPB2

data src A

MAPA1 MAPA2

data sink

SW CROSS1 SW CROSS2

input A input B

Figure 3.12: Example job graph of PACT program using sliding window Cross.

tures: A list of latest timestamps and a pair of sorted queues 3.13. The list of latesttimestamps contains one entry per predecessing operator instance. Let M1..Mm be thepredecessing operator instances. The list contains the highest timestamp from any recordreceived from each predecessing operator: tmax(M1), tmax(M2), ..., tmax(Mm). The sortedqueues Q(A) and Q(B) hold records from the inputs A and B, respectively. They are heldsorted by their timestamp. Both data structures are necessary for deterministic semantics.

head

for both inputs together

tail

sorted queue for input A

sorted queue for input B

potential next windowttail ttail+wsi

tmax(M ), t (M ), ..., t (M ) 1 max 2 max M1

2

Figure 3.13: Sliding window Cross data structures and potential next window for time-based semantics.

On reading a record from one of the inputs, its timestamp tcurrent and its predecessingoperator ID Mcurrent is read. If tcurrent ≥ tmax(Mcurrent), tmax(Mcurrent) is updated withtcurrent. If this is false, the record is out of order and discarded. If tmax(Mcurrent) is notyet initialized (no previous record from this operator instance yet), it is updated to tcurrentin any case. If the record is not discarded, it is inserted into Q(A) if it originates frominput A and into Q(B) otherwise. It is inserted sorted by its timestamp.

After a record is inserted into a queue, it is checked, whether the potential next window


is complete. For time-based semantics, the potential next window starts with the recordat the least timestamp ttail of both queues and ends at the record at ttail + wsi. For eachpredecessing operator M1..Mm, it is checked, whether tmax(M) > ttail+wsi. If this is true,the window is complete. If both potential next windows for input A and B are complete,the algorithm computes the cartesian product: The cartesian product is built over allrecords from the potential next window from input A and all records in the potential nextwindow of input B. If in a pair of records, both records are marked with a *, the pair isdiscarded for duplicate avoidance, as described above.

If the packing parameter is larger than 1, this algorithm leads to incorrect results. Con-sider a packing parameter of 2, a window size of 3 and a window slack of 2. One instanceof sliding window Cross gets records for two subsequent sliding windows, followed by an-other two subsequent windows: R1, R2, R3, R4, R5, R9, R10, R11, R12, R13. Each index of Rrepresents the timestamp. At the arrival of R4, the first window R1, R2, R3 is complete.After calling the first-order function on the cartesian product with the completed slidingwindow of the other input, the first two records are pruned to perform the window slide.On the arrival of R9, the second window R3, R4, R5 is complete. Performing the windowslide as usual, the next two records R3, R4 are pruned. R5 remains in the queue and onthe arrival of R11, the window R5, R9, R10 is considered to be complete which is an error:The operator instance only calculates two subsequent sliding windows. We can fix thealgorithm by counting subsequent windows: If the last window of one window sequenceis reached (2 in the example), the window slide is performed by wsi instead of wsl. Thismoves on to the next sequence of windows (R9 in this example).

The decision algorithm for the second-order function is depicted in Figure 3.14. Thevariable windowCounter is initialized with 0. For every record that has been inserted intoone of the queues, the algorithm in Figure 3.14 is started. In (a) it is checked, whether thepotential next window is complete. If for all predecessing operators, tmax > ttail + wsi,the window is complete and the algorithm moves to the next step. In (b), the cartesianproduct is built amongst the pair of potential next windows from queue A and B. If a pair ofrecords is punctuated with *, it is discarded. For every other pair, the first-order functionis called. In step (c) and (d), the windowCounter is incremented by 1 and compared topackingParameter. If the packingParameter is equal to windowCounter, the last windowis reached. This leads to a slide by wsi: all records between ttail and ttail + wsi of bothqueues are deleted. The windowCounter is reset to 0. If packingParameter is not equal towindowCounter (step (f)), the window slide is performed by pruning all records betweenttail and ttail + wsl from both of the queues. In both cases, the algorithm is restarted tocheck for further completed windows. The algorithm ends, once there is an incompletewindow in step (a).


start

(a)

(b)

(c)

(d)

(e)

(f)


t +wsi tailt (M) > max

if windowCounter ==packingParameter

no

no

yes

yes

call first-order function for all pairs (AxB) of SWRecords from position

to position +wsi if not both

records are marked with *

t ttail tail

prune all records from t to tail

t +wsl from the queuestail

prune all records from t to tail

t +wsi from the queues, tail

set windowCounter = 0

set windowCounter = windowCounter + 1

end

Figure 3.14: Sliding window Cross second-order algorithm for time-based semantics.


3.5.3 Discussion

In the semantic definition of the operator, we left the question on the perfect packingparameter open to the optimizer. The optimizer needs metrics (costs) in order to derivean optimal decision. One possible metric is a replication factor : If the windows aredisjunctive, they can be distributed amongst the sliding window Cross operator instancesby any strategy that loads the operators in a fair way, for example round-robin. If thewindows are overlapping, the most replication-intensive strategy is to send each subsequentwindow to the sliding window Cross instances round robin: It causes the repartitioningstrategy to send all overlapping records to at least two operator instances. In order toreduce the replications, the packing parameter can be set to a value larger than 1 by theStratosphere optimizer. The parameter causes several subsequent sliding windows to besent to one sliding window Cross instance without re-sending overlapping records. Theoptimizer has to decide, which packing parameter is optimal. One information that can beobtained from the input data and the user-defined parameter window slack is a replicationfactor. It measures the percentage of records that are replicated. This factor can onlybe measured with the data from the predecessing operators, because time-based slidingwindow semantics cause the overlapping part of windows to be of variable size. Thus,the parameter is not present at the time, the optimized plan is calculated. Another wayto estimate the replication factor without executing the program, is to compute the timeoverlap: Consider a time-based sliding window of size 3 and slide 2. The time overlap is1. If the total order is strict, this gives a hint, that 1 record transmission could potentiallybe saved if using a packing parameter of 1. If the order is non-strict, there might bemany records with an equal timestamp falling into the overlapping part of two subsequentwindows. Thus, the estimation is less adequate for non-strictly ordered data.

The described second-order algorithm (Figure 3.14) is called on each record insert intoone of the sorted queues, which leads to potentially unnecessary overhead. The same issueapplies in sliding window Reduce, where we discussed optimization possibilities (Chapter3.4.3).

The proposed second-order function does not perform a window slide, if there could berecords within the potential next window from at least one preceding operator instance.If predecessing operators do not output any data for this particular sorted queue, it growsunbounded. In 3.4.3, we describe a (indeterministic) workaround to unblock operatorinstances.

3.6. SLIDING WINDOW MATCH 45

3.6 Sliding window Match

The existing non-sliding window Match has two inputs A and B. Based on a user-definedkey, the inputs are partitioned into multisets of records PA(k) and PB(k) sharing the samekey value k. For each key, Match builds the cartesian product amongst the multisets:PA(k) × PB(k). Match calls the first-order function on each element (pair of records)within this cartesian product. For sliding window Match, we define sliding windows ontop of PA(k) and PB(k), regarding the order of the sliding window records: Figure 3.15shows the subsequent sliding window split. Sliding window Match builds the cartesianproduct on pairs of sliding windows and calls the first-order function for each record pairwithin the cartesian product. The semantic is similar to sliding window Cross, but slidingwindow Match uses additional keys for the generation of Parallelization Units.

SLIDING WINDOW MATCH

...

...

...

...


RA1

RB1

RA2

RB2

RA3

RB3

RA4

RB4

RA5

RB5

P (k):B

Input A:

Input B:

...

...

...

...

...

...

...

...

...

...

P (k):A Sliding Window Split:

Sliding Window Split:

pair 1

pair 2

pair 3

RA1

RB1

RA2

RB2

RA3

RB3

RA2

RB2

RA3

RB3

RA4

RB4

RA3

RB3

RA4

RB4

RA5

RB5

Figure 3.15: Sliding window Match: split of Parallelization Units into sliding windows.



Consider an example PACT program as depicted in Figure 3.16. The program has twodata sources A and B. They are processed by two sliding window Map operators. Thesliding window Map operators provide the input A and B for the sliding window Matchoperator. The sliding window Match operator calculates the final output. The first-ordersemantics are not important in this context.

data sink

data src A

SW MAP

data src B

SW MAP

SW MATCH

input A

input B

Figure 3.16: Example PACT program using sliding window Match.

The example PACT program is compiled to a job graph by Stratosphere regardingavailable computation resources. One example of such a job graph is depicted in Figure3.17. The data sources A and B are processed by two Map instances each. Slidingwindow Match gets two operator instances assigned. We assume that both data sourcesare provided sorted to Stratosphere and share the same ordering domain. The input isdistributed amongst the Map instances by a repartitioning strategy like round robin. Weassume that the Map instances do not change the order of records. To each key of theoutput records of Map, a hash function is applied by Stratosphere. It repartitions the dataaccording to the Match key. Each Match operator instance gets records for a disjunctive setof key values. From the perspective of Match, each predecessing operator instance providesrecords sorted by their ordering domain. This is due to our assumption, that each operatorinstance outputs records sorted by their ordering domain. Since Stratosphere merges theinput of predecessing operator instances, the inputs A and B are not guaranteed to arrivesorted: The order of the records within input A and B in Match are dependant on externalfactors like network latency.

In order to achieve deterministic operator semantics, we apply the concepts describedfor the second-order function of sliding window Cross: sorted queues and maximumtimestamps. Sorted queues hold the sliding window records sorted by their orderingdomain. The sorted queues are defined per key k Q(A, k), Q(B, k). Also, the maxi-mum timestamps of each predecessing operator instance M1, ...,Mm are defined per key:tmax(M1, k), ..., tmax(Mm, k). On these data structures, we apply the sliding window Crosssemantics as described in Chapter 3.5.1.

A dedicated sliding window repartitioning strategy as described in sliding window Crossis not necessarily needed for sliding window Match, because it uses keys for the generationof Parallelizations Units. Furthermore, sliding window Cross punctuates records in orderto avoid duplicates. For sliding window Match punctuation is not necessarily needed as

3.6. SLIDING WINDOW MATCH 47

data src B

MAPB1 MAPB2

data src A

MAPA1 MAPA2

data sink

SW MATCH1 SW MATCH2

Figure 3.17: Example job graph using sliding window Match.

local duplicate elimination is possible. We discuss both of these aspects in the subsequentimplementation chapter.


In the non-sliding window Match, data parallelization is achieved by using keys: Splittingthe data into multisets of records sharing the same key value enables Stratosphere toexecute each partition on a different operator instance independently. The key split isimplemented by a hash-based repartitioning strategy. For sliding window Match, eitherthis strategy can be used, or the sliding window repartitioning strategy described in slidingwindow Cross (Chapter 3.5.2). If the hash-based repartitioning strategy is used, onlykeys are used for data parallelization. This limits the maximum number of operatorinstances to the number of keys present in the user-provided data. However, key-basedpartitioning has the advantage that punctuation is not needed because duplicate recordpairs can be eliminated locally within each second-order function instance. If the slidingwindow repartitioning strategy is used, the number of instances becomes unlimited. Thesliding window repartitioning strategy can be used without change: It routes the recordsto sliding window Match instances based on timestamps without regarding keys. Eachsliding window Match distinguishes incoming records by their key and holds sorted queuesand maximum timestamps for every key separately. A third possibility is to create a newhybrid strategy that uses both keys and sliding windows for data parallelization. Sincethe parallelizability achieved with the sliding window repartitioning strategy is alreadyunlimited, this is not an improvement for parallelization. However, it could be usedto reduce memory usage for operator instances: With the sliding window repartitioningstrategy, all data structures have to be held for each key in the data. With a hybridstrategy, records with certain keys can be sent to particular operator instances.

In addition to the partitioning strategy, a dedicated second-order function is needed


for the implementation of sliding window Match. Similar to sliding window Cross (Chap-ter 3.5.2), we introduce two data structures held in every operator instance: maximumtimestamps and sorted queues. Both data structures are kept for each key value in theinput data. The generation of the cartesian product (including duplicate avoidance) andthe algorithm checking for window completeness remain the same like in sliding windowCross, except that they operate on partitions of the data defined by the key.

3.6.3 Discussion

If the hash-based partition strategy is solely used, each operator instance has to keepsorted queues potentially for every key. This can lead to a memory overflow, if manykeys are present in the data. In order to mitigate this risk, either the existing hash-basedpartitioning or the proposed hybrid partitioning strategies can be used: They assure, thateach operator instance is assigned a distinct set of keys it receives records for.

If predecessing operator instances do not output any data for a long period of time, thesecond-order algorithm will not consider potential next windows as being complete. Thus,the sorted queues collect potentially many records, so that the memory usage might exceedthe available main memory. One possible (indeterministic) workaround is, to unblockpredecessing operator instances by resetting their latest timestamp for a particular key, ifthey do not deliver records for a certain period of time. For details, please refer to thediscussion in Chapter 3.4.3.

3.7 Sliding window CoGroup

The non-sliding-window CoGroup has two inputs A and B. Equal to the Match operator,CoGroup uses a key to partition the inputs into multisets of records that share the same keyvalue k. We denote these partitions PA(k) and PB(k). Once the partitions are completefor both inputs, the pairs PA(k) and PB(k) of partitions are handed to the first-orderfunction for each key value. For sliding window CoGroup, we define sliding windows ontop of these partitions (Figure 3.18). The first-order function is called with pairs of slidingwindows.


The semantic definition of sliding window CoGroup is largely identical to sliding windowMatch. Thus, we only highlight the differences between both operators. Both operatorssplit the input data by keys. We define pairs of sliding windows on top of the datapartitions PA(k) and PB(k) sorted by the ordering domain. In contrast to sliding windowMatch, sliding window CoGroup does not apply the cartesian product amongst the windowpairs. The first-order function is directly called with a window pair.

There are two ways to define the semantics of the operator: First, subsequent over-lapping sliding windows contain all records of the windows. Thus, records that are in theoverlapping part of two subsequent windows are processed repeatedly. Second, subsequentoverlapping windows contain only the records that have not yet been returned in another

3.7. SLIDING WINDOW COGROUP 49

SLIDING WINDOW COGROUP

...

...

...

...


RA1

RB1

RA2

RB2

RA3

RB3

RA4

RB4

RA5

RB5

P (k):B

Input A:

Input B:

...

...

...

...

...

...

...

...

...

...

P (k):A Sliding Window Split:

Sliding Window Split:

pair 1

pair 2

pair 3

RA1

RB1

RA2

RB2

RA3

RB3

RA2

RB2

RA3

RB3

RA4

RB4

RA3

RB3

RA4

RB4

RA5

RB5

Figure 3.18: Sliding window CoGroup: split of Parallelization Units into sliding windows.

window before. Note, that this second semantic is equal to disjunctive sliding windows.The same functionality can be achieved by setting the window slack greater or equal tothe window size, so we decide to only use the first semantic. This implies that duplicateavoidance is not needed for sliding window CoGroup.


The implementation of sliding window CoGroup is identical to the implementation ofsliding window Match, except for two differences. First, for sliding window CoGroup, thefirst-order function is called with a pair of windows instead of pairs of records. Second,sliding window CoGroup does not use duplicate elimination, so the punctuation algorithmis not called within the routing algorithm as described in sliding window Cross (Algorithm1).


3.7.3 Discussion

For the discussion, please refer to the discussion for sliding window Match 3.6.3.

Chapter 4

Evaluation

In order to evaluate our sliding window PACT definitions, we implement sliding windowReduce in Stratosphere. The implementation is described in the following. For the prac-tical evaluation, we introduce an example of a calculation facilitating sliding windows onseismographic data. The calculation is implemented twice: First, it is implemented asa PACT program without sliding window Reduce, using compute-intensive workarounds.Second, it is implemented using sliding window Reduce. We compare the performance ofboth implementations by running them on a compute cluster. Finally, we draw conclu-sions about the sliding window Reduce implementation in particular and sliding windowoperators in general.

4.1 Implementation of sliding window Reduce

We start with a brief overview on the system architecture of Stratosphere: We introduceJava classes and their logical relationships along their appearance when implementing andrunning a PACT program (Figure 4.1). As a convention, we use the prefix SWReduceto indicate that a class belongs to the sliding window Reduce implementation. Whenimplementing a PACT program, a user extends one or more given operator stubs such asSWReduceStub. For each extended stub, he instanciates a corresponding contract suchas SWReduceContract. Within the instanciation, the stub is assigned to the contractand connected to its input contract. Finally, the user instanciates a PACT plan, whichresembles the PACT program. The plan is instanciated with the last ”contract”, thedata sink, as parameter: Thus, the plan contains the complete structure of previouslyinterconnected contracts. Along with the instanciation of contracts, user parameters andcompiler hints can be set. Parameters are for example windowSize or windowSlack in thecase of sliding window Reduce. We explain them in the latter of this chapter.

Having created a PACT plan, the user executes the PACT program in Stratosphere.Utilizing the PactCompiler class, Stratosphere creates an OptimizedPlan from the PACTplan. It regards given computing resources such as the number of available computingnodes and their main memory. Within the OptimizedPlan, each operator is representedby a node such as SWReduceNode. The nodes are connected via PactConnections. Pact-

51

52 CHAPTER 4. EVALUATION

- SWReduceStub- SWReduceContract

PactCompiler

- PactConnection- SWReduceNode

JobGraph-Generator

- SWReduceTask- SWReduceTask- DataStructure

Plan OptimizedPlan JobGraph

Figure 4.1: Overview on Stratosphere classes related to sliding window Reduce.

Connections represent a channel and a data shipping strategy : a channel is a physicalchannel such as memory or network, while a data shipping strategy defines, how recordsare routed to the subsequent node: Strategies are for example broadcast, partition by key,or forward. The OptimizedPlan is used by Stratosphere to determine the best alternative:The costs of different possible plans are compared and the best plan is chosen. Subse-quently, the plan is transformed for Nephele, the parallel execution engine of Stratosphere.For the execution in Nephele, the OptimizedPlan in converted to a JobGraph utilizing theJobGraphGenerator class. This involves wrapping the user-extended stubs in tasks suchas SWReduceTask. User-extended stubs represent the user-implemented first-order func-tions. The second-order functions are mainly implemented within the tasks. For the logicof the second-order function, also the PactConnections and their assignment by the opti-mizer is crucial. In the following, we describe the sliding window Reduce implementationwithin the SWReduceTask in detail.

The SWReduceTask is deviated from the existing ReduceTask. Within the Reduc-eTask, sorting is used to group the records by their key (Figure 4.2). Once all inputrecords are read and sorted, an iterator is called repeatedly. For each ”true” it returns,the ReduceStub is invoked. The input parameter of the invocation is the correspondingrecord set from the iterator with records sharing the same key.

For sliding window semantics as defined in Chapter 3.4.2, the sorting functionality isnot needed within the SWReduceTask: We assume, that for each input operator instance,records arrive sorted by their ordering domain instances. In SWReduceTask, records aregrouped by their key and sorted by their ordering domain within sliding windows. Forthis functionality, the user-defined parameters windowSize, windowSlack, isTimeBased,and timeRecordPosition are imported in the task: windowSize defines the width of thesliding window. Depending on isTimeBased, this width is either count-based or time-based. Similarly, the windowSlack parameter defines how many records or how much time

4.1. IMPLEMENTATION OF SLIDING WINDOW REDUCE 53

ReduceStub.reduce(KeyGroupedIterator.getValues(), output)

KeyGroupedIterator.nextKey()

false

true

sort

Figure 4.2: Reduce second-order function.

to move forward for each window slide. Finally, the timeRecordPosition defines, in whichrecord attribute the time instance can be found.

The grouping and ordering functionality is implemented in the class SWReduceTask-DataStructure. On each SWReduceTask instance, one single instance of SWReduceTask-DataStructure is created. For each key, the SWReduceTaskDataStructure instance holdsa sorted queue for records and a list of maximum timestamps for each preceding operatorinstance as defined in Chapter 3.4.2. These data structures are updated on each recordread by the task. After a record is inserted successfully, it is checked, whether the potentialnext window is complete. If this is the case, the user-implemented function is called withthe current window of records as parameter. Finally, according to the user parameterswindowSize and windowSlack, the window slide is performed by removing records fromthe sorted queue of the current key.

For the list of maximum timestamps in the SWReduceTaskDataStructure, we intro-duce an operator instance ID. This ID is generated within each task: This involves thenewly created SWReduceTask as well as existing tasks for other Stratosphere operators.Since the tasks do not communicate with each other by design, the ID is generated ran-domly as a globally unique identifier whenever a task is instanciated. To each outputrecord a stub generates, the operator ID is assigned.

The remaining functionality for sliding window Reduce is reused from the existingReduce implementation without changes. This involves mainly the partitioning strategiesassociated with the operator. Usually, the Stratosphere optimizer assigns a hash-basedrepartitioning strategy to the output of predecessing operator instances of Reduce. Thehash-based repartitioning strategy implements the data split by key. It assigns distinct setsof keys to the subsequent Reduce instances and routes the records to the Reduce instances


Table 4.1: Input example.

node ID dimension ID timestamp measured value

168 0 1324320711878328 -149336

168 1 1324320711878328 -128192

168 2 1324320711878328 -2284136

168 0 1324320711888094 -147688

168 1 1324320711888094 -128992

168 2 1324320711888094 -2285000

168 0 1324320711897860 -148072

168 1 1324320711897860 -128488

168 2 1324320711897860 -2284728

... ... ... ...

based on their key. If the data is already partitioned by key in another predecessingoperator, it assigns the forward strategy instead of repartitioning. Since sliding windowReduce also relies on the key split for data parallelization, the existing partitioning strategyassignment is reused for sliding window Reduce.

4.2 Example sliding window calculation

For the evaluation of the sliding window Reduce definition (Chapter 3.4.2) and imple-mentation (Chapter 4.1), we employ an existing sliding window calculation. It originatesfrom the Humboldt Wireless Lab (HWL), which is a project of Humboldt Universitatzu Berlin. Within the project, a large-scale research mesh network is created [HWL12].Network nodes in a wireless mesh network act as a relay for other nodes: Each node isable to forward network packets on behalf of other nodes. This enables each node to senddata to arbitrary other nodes, even when they are not able to communicate with eachother directly, because they are physically too far apart. In addition to the basic networkhardware, the networking nodes are equipped with sensors. One of these sensors is anacceleration sensor that enables to measure seismic events in three physical directions atonce. If the nodes with the sensors are attached to buildings, the measured data can beused for example for structural health monitoring: The change of the natural frequenciesof buildings indicate changes in their structural health [SZS12]. A prominent example forbuildings that need structural health monitoring are bridges. Within the HWL project,the software framework ClickWatch is developed that enables experiments on the meshnetwork.

The calculation we use for our evaluation is implemented in ClickWatch and called Seis-moBusAnalysis. It runs on seismographic data measured by the wireless network nodes.Each atomic record in the data has the following four attributes: node ID, dimension ID,timestamp, and measured value (example in Table 4.1). The raw data is normalized and

4.2. EXAMPLE SLIDING WINDOW CALCULATION 55

transformed by a Fourier Transformation. The normalization is performed by subtractinga moving average from each measured value. The moving average is calculated on a slidingwindow. Also, the subsequent Fourier Transformation is applied on a sliding window. Theresult of the calculation highlights the occurence of relevant frequencies over time. TheSeismoBusAnalysis is an introductory experiment to develop subsequent structural healthmonitoring analysis as described before.

Originally, the computation of the SeismoBusAnalysis takes place in the non-distributedframework ClickWatch. Depending on the amount of nodes and the frequency of mea-surement, the data amount grows up to dozens of gigabytes for short timespans of i. e., 30minutes. Running the calculation on such an amount of data takes many hours or evendays. The calculation time is not feasible for experiments that involve the analysis of largerdata sets. Additionally, an online analysis is far out of reach as long as the calculation timeis larger than the measured time. One way to accelerate the calculation is parallelization.The computations can be divided by nodes, dimension and/or by time: The data of onedimension of one node can be processed independently from other dimensions of the samenode and data from other nodes. Within the data of one dimension of one node, slidingwindows can be processed independently from each other. For example, the block of dataranging from time 0 to N can be processed in parallel to the data ranging from time 1 toN+1, from 2 to N+2 etc.

We utilize Stratosphere to parallelize the analysis of the SeismoBusAnalysis experi-ment, expecting the calculation time to scale with the number of the used computationinstances. For the purpose of this work, we describe the calculation in brief. For a detaileddescription of the algorithm, please refer to Appendix A.1.

In a first step, each measured value is normalized by the subtraction of a movingaverage. The moving average is calculated as the mean over one window: a windowincludes the current value and preceding measured values. After the normalization, aFourier Transformation is executed over a predefined window of normalized values. Theresult of the Fourier Transformation is an array of values. This array is divided intochunks of equal size. The chunk of a previously user-defined number is chosen. It refersto a range of frequencies. The mean of this chunk divided by a standardization constantare returned. Together with this calculated value, the node ID, dimension ID, and atimestamp is returned: The timestamp is the timestamp of the newest record within thewindow of normalized values.


4.3 Example PACT using workarounds

For performance comparison, we implement the example calculation described in the pre-vious chapter twice: First, we implement it by exclusively using the existing PACT opera-tors. Second, we implement it using the new sliding window Reduce. In the following, wedescribe the first implementation and explain which workarounds are needed to use theexisting operators for a sliding window calculation.

Figure 4.3: Example sliding window Calculation PACT program without sliding windowReduce.

Figure 4.3 shows our PACT program for the calculation. Both the normalization andthe Fast Fourier Transformation (FFT) are implemented by Map operators. In the datasource, the input is tokenized into node ID, dimension ID, timestamp, and measuredvalue. Besides tokenization, the data source fulfills two purposes: First, it prevents the in-put files from being split. Second, records are created that contain lists (sliding windows)of measured values as opposed to single values. The file split is the default behaviour ofStratosphere: If the input of any PACT program is given in files, Stratosphere might splitone file into several distinct data sets that are read by different data source instances, asdepicted in Figure 4.4: File 3 is split into two distinct pieces by the default DelimitedIn-putFormat. Each of our input files contains the measured data for one node and all itsdimensions. Since the second function of the source is to create records with lists of subse-quent measured values, it has to be stateful to remember previous values that are neededin subsequent windows. If an input file is not read as a whole, incomplete windows arethe result, leading to incorrect calculation results. Thus, a SerialDelimitedInputFormat isintroduced as shown in Figure 4.5. It prevents the input files from being split.

In Figure 4.6, the window handling performed within the data source (LineInFormat)and the normalization is depicted. For simplification, only records with the same key arecontained in the ”Input” box. A key consists of a node ID and a dimension ID. In thefigure, the key is omitted. In the ”Input” box, the numbers 1 to 6 represent the timestampsof each record. The characters a to f represent the measured values. The LineInFormatclass reads each record subsequently and stores it in a FIFO queue. As long as the queuecontains less than four entries, the class does not return any record. This applies to thefirst three lines. From including the fourth line, it returns records with the key (nodeID and dimension ID), the current measured value and the last three measured values.The LineInputFormat outputs records with four measured values because of the following

4.3. EXAMPLE PACT USING WORKAROUNDS 57

File 1

File 2

File 3

File 3Part 1

Instance 1

Instance 2

File 1

File 2 File 3Part 2

Figure 4.4: DelimitedInputFormat.

File 1

File 2

File 3

File 3

Instance 1

Instance 2

File 1

File 2

Figure 4.5: SerialDelimitedInputFormat.

reason: Assume, that the normalization is to be performed over a count-based window ofa size of three and the Fourier Transformation over a window of the size of two. Bothshare the same count-based window slack of one. This means, that the LineInFormatneeds to return records with four measured values: The subsequent normalization runs ontwo subsequent windows of the size three. It returns a record with two calculated values.The subsequent Fourier Transformation is calculated on these two values.

LineInFormat1a2b1a3c 2b1a4d 3c 2b 1a5e 4d 3c 2b6f 5e 4d 3c...

Normalize f(x)

4 f(d, c, b) f(c, b, a)5 f(e, d, c) f(d, c, b)6 f(f, e, d) f(e, d, c)...

Fourier Transf. g(x)

4 g(f(d, c, b),f(c, b, a))5 g(f(e, d, c),f(d, c, b))6 g(f(f, e, d),f(e, d, c))...

---4dcba5edcb6fedc...

Input1a2b3c4d5e6f...

Figure 4.6: Workaround: sliding window emulation within records.

We call this workaround ”horizontal” data redundancy: sliding windows are maintainedwithin potentially long records, containing many measured values. The workaround allowsfor independent calculations on each record in the two subsequent Map operators. It hastwo downsides implementation-wise: First, for overlapping sliding windows, the samemeasured values are copied repeatedly in the system, leading to potentially unnecessarydata transfer and memory usage overhead. Second, it causes calculation overhead: thecalculation of the normalization can be optimized by using a temporary variable for themoving average instead of computing it repeatedly on subsequent sliding windows. Thissecond downside is discussable, because the independence of subsequent windows (singlerecords) allows for high parallelizability. Thus, the calculation might not be performedoptimally regarding the usage of computing resources, but optimally regarding speed. Onedownside from a users point of view is, that the sliding window workaround needs to beimplemented intentionally: In this basic example calculation, there are two subsequentsliding windows to be calculated. The user has to calculate the number of needed valueswithin each record from the bottom (the Fourier Transformation in the example) up tothe top (the data source) and implement each PACT accordingly. Considering more


complex examples with window slacks other than one and more subsequent operators,such a workaround becomes problematic.

If the user is not aware of this workaround, he has to (re-)invent one. Besides thepresented ”horizontal” data redundancy, other workarounds are possible as well. As anexample, we also examined ”vertical” data redundancy, which involves copying recordsand using keys to generate sliding windows in combination with the Reduce operator.This approach is not promising, because it limits the program suitability to very smallamounts of input data.

4.4 Example PACT using sliding window Reduce

In the following, we describe the implementation of the example calculation of Chapter4.2 using sliding window Reduce. The corresponding canonical PACT program is depictedin Figure 4.7. Both, the normalization and the Fast Fourier Transformation (FFT) areimplemented as SWReduce PACTs each. Since the normalization window size differs fromthe FFT window size, it is not possible to use only one SWReduce PACT. The precedingMap operator is introduced to assign operator IDs to the records. The operator IDs arenecessary for the sliding window logic in the sliding window Reduce task. In the future,this manual operator ID assignment can be replaced by using a dedicated sliding windowMap. Besides the operator ID assignment, the input is tokenized into node ID, dimensionID, timestamp, and measured value.

data sink

SWReduce .

Map .SWReduce .N

or-

malize

Toke-

niz

e

FFT

node, dimension, time, value, operator ID

data src

Figure 4.7: Example sliding window Calculation PACT program using sliding windowReduce.

Since the sliding window logic is implemented in the SWReduce tasks, there is nothingmore to do for the user, except to set the sliding window parameters of both SWReducePACTs according to the experiment.

4.5 Evaluation

For the evaluation of the sliding window Reduce definition and its implementation, werun both, the PACT program with workarounds (A) and the one with sliding windowReduce (B), on a compute cluster. We expect A to result in a higher execution timethan B when running both programs with the same parameters. We varied the following

4.5. EVALUATION 59

parameters: the number of compute nodes (4 and 16), the parallelization degree (1 time,2 times, and 4 times the number of compute instances), the window sizes used for thenormalization and the Fourier Transformation (one with a smaller window size of 16 andone with a larger window size of 1024), and the amount of input data (10 GB, 20 GB, ...,60 GB). The amount of input data ranges linearly from 367 million records (10 GB) to1751 million records (60 GB). An example of the input data can be found in Chapter 4.2.Each experiment was repeated three times.

We run the evaluation on a compute cluster. For the runs with 4 compute instances,each node consists of 4 Opteron 880 CPUs (2 cores per CPU) and 8 GB memory. For therun with 16 compute instances, additional nodes have the following properties: 12 Opteron6168 CPUs (12 cores per CPU) and 64 GB memory. All nodes are interconnected by a1 GBit Ethernet connection and share a network file system that is accessible from eachnode equally. The used nodes are not exclusively reserved for this evaluation. Otherexperiments run on them in parallel without resource usage restrictions. The time ismeasured by the Stratosphere command line client: When a PACT program completes itsexecution, the client returns the execution time in milliseconds.

2000

4000

6000

8000

10000

12000

14000

10 20 30 40 50 60

tim

e (

sec)

amount of input (GB)

with workaround horizontal data redundancywith sliding window Reduce

Figure 4.8: Experiments with 4 calculation nodes and parallelization degree 4.

The result for four compute nodes and a parallelization degree of 4 is depicted inFigure 4.8. On the horizontal axis, the amount of input data is shown in GB. On thevertical axis, the total runtime is shown in seconds. The measurement for a each datasize is repeated three times for each PACT program A and B. We connected the pointsfor better visibility. In fact, there are no measurements between the points in the graph.As expected, program (B) with sliding window Reduce results in a smaller execution timethan program (A) with workarounds. The runtime of each configuration is nearly thesame, except for small deviations, i. e., for measurement A at 50 GB. The deviations can


be explained by external factors such as other programs slowing down the compute nodes.We repeat the measurements for a parallelization degree of 8 and 16. For the graphs ofthese measurements, please refer to the Appendix A.2. The results are similar to thepresented Figure 4.8.

Regarding the execution times of 16 compute instances (Figure 4.9), PACT programA with workarounds is faster than program B with sliding window Reduce. The reasonfor this undesired effect is a network communication channel: It is assigned betweenthe normalization and the Fourier Transformation. The optimizer treats sliding windowReduce like the existing Reduce. It is not aware that the key of input and output records ofthe normalization PACT remain the same. The data partition can be reused, in connectionwith a more efficient in-memory local forward. The network transmission adds a timepenalty that grows with the number of used compute nodes. Thus, the effect can beobserved using more than 4 compute nodes, as shown in this example.

0

500

1000

1500

2000

2500

3000

10 20 30 40 50 60

tim

e (

sec)



Figure 4.9: Experiments with 16 calculation nodes and parallelization degree 32.

For the graphs of 16 calculation nodes and parallelition degrees 16 and 64, please referto the Appendix A.2.

4.6 Discussion

The evaluation has shown, that the optimizer-assigned physical communication channelis crucial for an optimal performance. For the definition and implementation of slid-ing window Reduce, partition strategies and their corresponding physical communicationchannels have to be considered. This involves the introduction of a user-defined hint thatinforms the optimizer about the output key of sliding window Reduce. Alternatively, codeanalysis of the user-implemented sliding window Reduce PACT code can be used for the

4.6. DISCUSSION 61

same purpose. A weak point of the evaluation is that is statistically not meaningful. Inorder to obtain a significant sample size, all measurements should be repeated more thanthree times. Also, a dedicated cluster would be better than shared compute nodes in orderto reduce external factors on the calculations.

The sliding window implementation is only a prototype and can be enhanced in severalways. First, for time-based windows we assume time to be represented by an integer. Timemay also represented by real numbers: The use of real numbers enables the insertion oftimestamps between two arbitrary existing timestamps. This leverages the flexibility ofthe sliding window Reduce implementation. Second, for the purpose of the prototype,we limit the key class of sliding window Reduce to one single PactInteger. Other keyclasses and combined keys with more than one key class are to be implemented in thefuture. Third, we use standard Java data structures like TreeMap and HashMap in thesliding window Reduce implementation for the sliding window generation. We presumethat the implementation can be enhanced performance-wise by using dedicated, lighterdata structures.

Another enhancement for sliding window support in Stratosphere is a dedicated slidingwindow record. Currently, additional sliding window informations, namely the operatorID and the timestamp, are required to be added to records at fixed attribute positions. Adedicated sliding window record eliminates manual handling of these informations. Fur-thermore, the random operator ID generation in the tasks as introduced in the slidingwindow Reduce implementation disappears. For the operator ID, the Nephele-generatedoperator IDs are a more elegant: Nephele defines the number of predecessing operatorinstances of an operator in the first place. In the current sliding window Reduce im-plementation, the PACT layer finds out about the existance of predecessing operatorinstances iteratively at run-time. The correctness of the detection of the potential nextwindow relies on the knowledge of the complete number of predecessing operator instancesfrom the arrival of the first records. So the current implementation can potentially leadto incorrect sliding windows. As a solution, we propose sliding window records, that arewritable by Nephele. Currently, Nephele is designed to be agnostic of the concrete recordcontents. Thus, Nephele cannot change records directly. A sliding window record thatextends or wraps the current record is one possibility to provide a writeable interface toNephele for this purpose. Another way to perform this record write is to hand up theoperator ID to the PACT level, which is currently not possible to our knowledge. Sincethe implementation of records is tightly connected to the current Stratosphere design, werefrained from this rather radical change.

In Chapter 2.3.3, we introduce properties of aggregate functions. Distributive andalgebraic functions can be decomposed into an early aggregate function and a consolidatingfunction. In the case of overlapping sliding windows, partial results of the early aggregatefunction or the consolidating function can be reused in subsequent windows. This cansave compute resources. As an example, consider the distributive function ADD(). Ifthe sum of the overlapping part of the previous window is known, the new sum can becalculated by adding only the new records that come after the overlapping part. If onlythe overall sum of the previous window is known, the invalidated records are necessaryadditionally: Besides adding the new records, the invalidated ones have to be subtracted


from the old sum in order to generate the new sum. We call a window that exclusivelycontains a set of invalidated and another set of new records a differential window. For theimplementation, an additional parameter, i. e., overlappingWindowOnlyReturnDifferencecan be considered. In contrast to the standard, the first-order function is called with aniterator containing deleted and added records only. The downside of this optimization is,that sliding window Reduce PACTs become stateful, which limits parallelizability and isnot intended by Stratosphere design.

Chapter 5

Conclusions

5.1 Summary

Stratosphere, as a general purpose parallelization framework, is an alternative to existingrelational database and data stream management systems when analysing sequential data.In this work, we leveraged Stratospheres suitability for analysis on sequential data byintroducing sliding window operators. With sliding window operators in Stratosphere,users are not required to implement sliding window semantics within the PACT second-order functions any more. We discussed the semantics of the new operators and describedhow to implement them in Stratosphere. We proved the general ideas behind the operatorsby implementing the sliding window Reduce operator. We evaluated its performance bycomparing it to the performance of a data analysis that does not use the sliding windowoperator. The performance improves when using the dedicated sliding window Reduce.Further improvements are possible, as we describe in the following.

5.2 Open issues

There are two main fields that we left open for further research. First, the definition ofsliding window operators can be enhanced and second, the implementation of the operatorsshould be completed. In the definition of sliding window operators, partitioning strategiesshould be considered more closely. In sliding window Reduce, we found that a strategy (orhint) that reuses data partitions of previous operators enhances performance: If partitionsare reused, data can potentially be transferred by local memory forward rather thanexpensive network transfer. Furthermore, sliding windows can be used as parallelizationunits. This can be achieved by a dedicated sliding window repartitioning strategy aswe introduced for sliding window Cross. The same strategy can be considered for slidingwindow Reduce, Match, and CoGroup. We identified that our definition of sliding windowoperators are blocking if there is no input from one previous operator instance for a longtime. If we loosen the requirement of deterministic semantics, this problem can be solvedby heartbeats, as proposed in Chapter 3.4.3.

The remaining operators sliding window Cross, Match, and CoGroup should be im-

63

64 CHAPTER 5. CONCLUSIONS

plemented. Also, the proposed dedicated sliding window record should be implemented.Along with the implementation of the sliding window record, the timestamp should allowfor other data types additional to integer. The implementations should be evaluated withfurther data analysis apart from the one we used in this work.

Appendix A

Appendix

A.1 SeismoBusAnalysis algorithm

In the following, we describe the SeismoBusAnalysis algorithm we used in the evaluation(4.2). We start with a high-level overview on the algorithm and its parameters. In theremainder, we specify the algorithm by pseudocode and explain it on a line level.

The input data of the algorithm originates from measurements of sensors and is savedin a data store. The algorithm reads the saved data sequentially. Each line in the dataconsists of a timestamp, a node ID, a dimension ID, and a measured value. The inputis sorted ascending by timestamps. The algorithm normalizes the measured values bysubtracting a moving average. The moving average is calculated as the mean over onewindow: A window includes the current value and preceding values. The parameter re-moveOffsetWindowSize determines the total number of values in one window. Afterwards,the algorithm executes a Fourier Transformation over a predefined window of normalizedvalues. The size of this window is set by the fftWindowSize parameter. The result of theFourier Transformation is an array. This array is divided into chunks of equal size. Thenumber of chunks can be influenced with the parameter numberOfBins. This chunkingprocedure is called binning. The chunk of one pre-defined number is chosen by the param-eter chosenBin. It refers to a range of frequencies. The mean of the chosen chunk dividedby a standardization constant is returned. The constant is 221/1000 and ensures that thevalues are returned in ”milli g”: g is a unit for gravitation force.

The SeismoBusAnalysis algorithm is specified in Algorithm A.1. To simplify the code,we assume time to be represented by an increasing integer number starting at 0 andincremented by 1 for each subsequent measurement.

65

66 APPENDIX A. APPENDIX

� In line 1 to 8, parameters for the algorithm are defined: sampleRateInHz specifies,how many measurements are done per second. The fftWindowInSeconds is the win-dow size of the Fast Fourier Transformation (FFT) in seconds. fftWindowSizeOrigholds the number of measurements contained in the window. The window size ofthe FFT is saved in fftWindowSize. numberOfBins is the number of chunks the FFTresult is divided into. Of these chunks, the chosenBinth one is chosen (beginning at0).

� In line 9, the algorithm iterates through the input data line by line.

� From line 10 to 22, the input is normalized. Therefore, all items contained in thecurrent line are read into variables. Node, dimension, and time are only used forthe current iteration, so they are read into the dedicated variables node, dimension,and time. The measured value is needed in subsequent iterations as well, so it isread into the three-dimensional array value[node][dimension][time]. In line 14 to16, the two-dimensional array sum[node][dimension] is initialized if it has not beeninitialized before. This array holds the sum of the current combination of node anddimension, which is also needed in subsequent iterations. The sum is updated inline 17 to 19 by adding the measured value of the current line and subtracting avalue that is already outside of the current sliding window. In line 20, the movin-gAverage is calculated by dividing the sum by the removeOffsetWindowSize. In line21 to 22, the normalized value is calculated by subtracting the moving average fromthe current value. The result is saved into the three-dimensional array normalized-Value[node][dimension][time], because the subsequent Fourier Transformation needsa sliding window consisting of the current normalized value and previous normalizedvalues.

� In line 23, the Fast Fourier Transformation is calculated over the last fftWindowSizenormalizedValues. The result is written to complexFFTResultArray. The variable isan array, because the result of an FFT is an array. In each iteration, the algorithmoverwrites this array.

� In line 28, the length of the complexFFTResultArray is calculated. The length isneeded in order to obtain the result only once: The FFT result is symmetric and isthus contained twice in the complexFFTResultArray.

� From line 29 to 35, the FFT result is binned: The FFT result is split into numberOf-Bins chunks. The chosenBinth chunk is chosen and all its values are saved into theone-dimensional array fftResultArray. The array is needed, because subsequently,all values in the current bin are needed.

� In line 36 to 41, the mean is calculated over all values in the chosenBin. The meanis divided by a pre-defined parameter and the result is emitted.

A.1. SEISMOBUSANALYSIS ALGORITHM 67

Algorithm 3 SeismoBusAnalysis Algorithm

1: sampleRateInHz ← 1002: fftWindowInSeconds← 13: fftWindowSizeOrig ← fftWindowInSeconds ∗ sampleRateInHz4: fftWindowSize← Integer.highestOneBit(fftWindowSizeOrig) ∗ 25: removeOffsetWindowInSec← 106: removeOffsetWindowSize← removeOffsetWindowInSec ∗ sampleRateInHz7: numberOfBins← 108: chosenBin← 29: while line← readInput() do

10: node← extractNodeFrom(line)11: dimension← extractDimensionFrom(line)12: time← extractT imeFrom(line)13: value[node][dimension][time]← extractV alueFrom(line)14: if sum[node][dimension] = undefined then15: sum[node][dimension]← 016: end if17: sum[node][dimension]← sum[node][dimension]18: +value[node][dimension][time]19: −value[node][dimension][time− removeOffsetWindowSize− 1]20: movingAverage← sum[node][dimension]/removeOffsetWindowSize21: normalizedV alue[node][dimension][time] = value[node][dimension][time]22: −movingAverage23: complexFFTResultArray ← FourierTransformation(24: normalizedV alue[node][dimension][time− fftWindowSize],25: normalizedV alue[node][dimension][time− fftWindowSize + 1],26: ...,27: normalizedV alue[node][dimension][time])28: resultLength← (complexFFTResultArray.length/2) + 129: startIndex← (resultLength/numberOfBins) ∗ chosenBin30: endIndex← startIndex + (resultLength/numberOfBins)31: fftResultArray[]← newarray()32: for i = startIndex→ endIndex− 1 do33: fftResultArray[]← complexFFTResultArray[i].abs()34: /fftWindowSizeOrig35: end for36: fftSum = 037: for i = 0→ fftResultArray.length do38: fftSum← fftSum + fftResultArray[i]39: end for40: result← (fftSum/fftResultArray.length)/(221/1000)41: Emit(result)42: end while


A.2 Experiments

In the following, we list all figures of the measurements as explained in Chapter 4.5.

2000

4000

6000

8000

10000

12000

14000

10 20 30 40 50 60

tim

e (

sec)



Figure A.1: Time usage with 4 calculation instances for parallelization degree 4.

A.2. EXPERIMENTS 69

0

2000

4000

6000

8000

10000

12000

14000

16000

10 20 30 40 50 60

tim

e (

sec)




0

2000

4000

6000

8000

10000

12000

14000

10 20 30 40 50 60

tim

e (

sec)





500

1000

1500

2000

2500

3000

3500

4000

10 20 30 40 50 60

tim

e (

sec)




0

500

1000

1500

2000

2500

3000

10 20 30 40 50 60

tim

e (

sec)




A.2. EXPERIMENTS 71

200

400

600

800

1000

1200

1400

1600

1800

2000

10 20 30 40 50 60

tim

e (

sec)





A.3 Data models

In this work, we refer to the relational data model and the SEQ data model. In thefollowing, we review these data models.

A.3.1 Relational data model

The relational data model was first introduced by Codd [Cod70] and was revised overtime. The main intention of this model is to achieve a high degree of data independence.It provides well-defined data semantics and assures data consistency.

Relational data is represented by n-ary relations. An n-ary relation is a subset of allpossible combinations of n domains. One element of a relation is called (n-)tuple:

Let D1, D2, ..., Dn be n Domains. The relation R consists of a subset of the cartesianproduct over all Domains: R(D1 ×D2 × ...×Dn).

SQL, being perceived as the main query language for relational data, in fact deviatesfrom the relational data model. For example, the current SQL standard SQL:2011 allowsfor duplicate rows in a table, while the relational data model is based on sets.

A.3.2 SEQ data model

The SEQ data model extends the relational data model: each tuple of a relation is aug-mented with a position in an order domain like time or linear position. SEQ is intendedto be a uniform data model for applications involving sequence data [PSR95]. SEQ servesas the basis for a sequence database system [PSR96].

A sequence Seq is defined as a 3-tuple < S,O,Os >: S is a set of relational tuplesof a schema RS . O is a countable totally ordered domain and OS is an ordering of Sby O. The ordering OS has the following properties: For every tuple in S, there exists aposition in the ordering. Not every position in the ordering necessarily has correspondingtuples. Furthermore, there can be more than one tuple associated with one position inthe ordering.

A.4 Join algorithms

In this work, we refer to the symmetric hash-based join. In the following, we introducethis join algorithm. Since it is deviated from the the Simple Hash Join, we start with thisalgorithm.

A.4.1 Common / Simple Hash Join

The common hash join or simple hash join [WA93] is a non-pipelined join and consists oftwo phases (Figure A.7): In the first phase, one operand is read entirely. For each tuple,a hash function is applied. The result is saved in a hash table that ideally fits into themain memory. In the second phase, the tuples of the other operands are read. Each tupleis hashed and probed against the hash table. All matching tuple pairs are returned.

A.4. JOIN ALGORITHMS 73

Although the join operation is symmetric, this algorithm treats its operands asym-metric by exclusively hashing one operand and exclusively probing the other one. Outputis only returned in the second phase of the algorithm. In order to produce output at anearlier time during execution, the symmetric (or pipelined) hash join was introduced.

Phase 1

Phase 2

output

probe

build

input A

input B

hash t

able

Figure A.7: Hash Join.

output

input A

input B

build build

probe

Figure A.8: Symmetric Hash Join.

A.4.2 Symmetric / (Double) Pipelined Hash Join

The Pipelined Hash Join is optimized for a high degree of pipelining [UF01, WA93,IFF+99]. Two hash tables are built, one for each input (Figure A.8). In contrast tothe common hash join, the algorithm only consists of one phase. Whenever a new tuplefrom one of the input arrives, it is first inserted into the hash table of this input. Second,the tuple is probed against the hash table of the other input. Each matching tuple pair isreturned.

Whenever one of the operands is read entirely, the other operand does not need tocontinue building its hash table, since it will not be used any more. If one of the operandsis completely read before the other one, the symmetric hash join degenerates to a commonhash join. Compared to the common hash join, the symmetric hash join treats its operandsin a symmetric way: For both operands, hash tables are built, and both operands areprobed against the hash table of the other operand.

As for the common hash join, the hash tables ideally reside in the main memory. Sincehash tables for both operands are built, the memory footprint of this join is bigger thanthat of the common hash join. This limits the use of this algorithm to inputs that aresmall enough to both fit into the available main memory.

Bibliography

[ABE+10] Alexander Alexandrov, Dominic Battre, Stephan Ewen, Max Heimel, FabianHueske, Odej Kao, Volker Markl, Erik Nijkamp, and Daniel Warneke. Mas-sively Parallel Data Analysis with PACTs on Nephele. Proceedings of theVLDB Endowment, 3:1625–1628, September 2010.

[ABW03] Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL ContinuousQuery Language: Semantic Foundations and Query Execution. TechnicalReport 2003-67, Stanford InfoLab, 2003. An earlier version this technicalreport, titled ”An Abstract Semantics and Concrete Language for ContinuousQueries over Streams and Relations”, appears on this publications server astechnical report number 2002-57. A short version of technical report 2002-57also appears in the proceedings of the 9th International Conference on DataBase Programming Languages (DBPL 2003).

[ABW06] Arvind Arasu, Shivnath Babu, and Jennifer Widom. The CQL continuousquery language: semantic foundations and query execution. The VLDB Jour-nal, 15(2):121–142, June 2006.

[ACC+03] Daniel J. Abadi, Don Carney, Ugur Cetintemel, Mitch Cherniack, ChristianConvey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik.Aurora: a new model and architecture for data stream management. TheVLDB Journal, 12:120–139, 2003. 10.1007/s00778-003-0095-z.

[AEH+11] Alexander Alexandrov, Stephan Ewen, Max Heimel, Fabian Hueske, OdejKao, Volker Markl, Erik Nijkamp, and Daniel Warneke. MapReduce andPACT - Comparing Data Parallel Programming Models. In Proceedings ofthe 14th Conference on Database Systems for Business, Technology, and Web(BTW), BTW 2011, pages 25–44, Bonn, Germany, 2011. GI.

[BBD+02] Brian Babcock, Shivnath Babu, Mayur Datar, Rajeev Motwani, and JenniferWidom. Models and issues in data stream systems. In Proceedings of thetwenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles ofdatabase systems, PODS ’02, pages 1–16, New York, NY, USA, 2002. ACM.

[BDD+10] Irina Botan, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Renee J. Miller,and Nesime Tatbul. SECRET: a model for analysis of the execution semantics

74

BIBLIOGRAPHY 75

of stream processing systems. Proc. VLDB Endow., 3(1-2):232–243, Septem-ber 2010.

[BDM02] Brian Babcock, Mayur Datar, and Rajeev Motwani. Sampling from a movingwindow over streaming data. In Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, SODA ’02, pages 633–634, Philadel-phia, PA, USA, 2002. Society for Industrial and Applied Mathematics.

[BEH+10] Dominic Battre, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, andDaniel Warneke. Nephele/PACTs: A Programming Model and ExecutionFramework for Web-Scale Analytical Processing. In Proceedings of the 1stACM symposium on Cloud computing, SoCC ’10, pages 119–130, New York,NY, USA, 2010. ACM.

[BW01] Shivnath Babu and Jennifer Widom. Continuous queries over data streams.SIGMOD Rec., 30(3):109–120, September 2001.

[CcC+02] Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Convey, SangdonLee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik.Monitoring streams: a new class of data management applications. In Proceed-ings of the 28th international conference on Very Large Data Bases, VLDB’02, pages 215–226. VLDB Endowment, 2002.

[CCD+03] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande, Michael J. Franklin,Joseph M. Hellerstein, Wei Hong, Sailesh Krishnamurthy, Samuel R. Madden,Fred Reiss, and Mehul A. Shah. TelegraphCQ: continuous dataflow process-ing. In Proceedings of the 2003 ACM SIGMOD international conference onManagement of data, SIGMOD ’03, pages 668–668, New York, NY, USA,2003. ACM.

[CDTW00] Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang. NiagaraCQ:a scalable continuous query system for Internet databases. SIGMOD Rec.,29(2):379–390, May 2000.

[CJSS03] Chuck Cranor, Theodore Johnson, Oliver Spataschek, and VladislavShkapenyuk. Gigascope: a stream database for network applications. In Pro-ceedings of the 2003 ACM SIGMOD international conference on Managementof data, SIGMOD ’03, pages 647–651, New York, NY, USA, 2003. ACM.

[Cod70] E. F. Codd. A relational model of data for large shared data banks. Commun.ACM, 13(6):377–387, June 1970.

[CS01] A.S. Chiou and J.C. Sieg. Optimization for queries with holistic functions.In Database Systems for Advanced Applications, 2001. Proceedings. SeventhInternational Conference on, pages 327–334, 2001.

[DG04] Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processingon Large Clusters. OSDI, pages 137–150, 2004.

76 BIBLIOGRAPHY

[DJAZ05] Magdalena Balazinska1 Ugur Cetintemel Mitch Cherniack Jeong-Hyon HwangWolfgang Lindner Anurag S. Maskey Alexander Rasin Esther Ryvkina NesimeTatbul Ying Xing Daniel J. Abadi, Yanif Ahmad and Stan Zdonik. The Designof the Borealis Stream Processing Engine. CIDR Conference, January 2005.

[GCB+97] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Re-ichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. Data Cube:A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, andSub-Totals. Data Mining and Knowledge Discovery, 1:29–53, 1997.

[GKS01] Johannes Gehrke, Flip Korn, and Divesh Srivastava. On computing correlatedaggregates over continual data streams. In Proceedings of the 2001 ACMSIGMOD international conference on Management of data, SIGMOD ’01,pages 13–24, New York, NY, USA, 2001. ACM.

[GMLY98] H. Garcia-Molina, W. Labio, and J. Yang. Expiring Data in a Warehouse.Technical Report 1998-35, Stanford InfoLab, 1998.

[GMS93] Ashish Gupta, Inderpal Singh Mumick, and V. S. Subrahmanian. Maintainingviews incrementally. In Proceedings of the 1993 ACM SIGMOD internationalconference on Management of data, SIGMOD ’93, pages 157–166, New York,NY, USA, 1993. ACM.

[GO03a] Lukasz Golab and M. Tamer Oezsu. Processing Sliding Window Multi-Joinsin Continuous Queries over Data Streams. Technical report, February 2003.

[GO03b] Lukasz Golab and M. Tamer Ozsu. Issues in data stream management. SIG-MOD Rec., 32(2):5–14, June 2003.

[GO10] Lukasz Golab and M. Tamer Oezsu. Data Stream Management. SynthesisLectures on Data Management, 2(1):1–73, 2010.

[Gra93] Goetz Graefe. Query evaluation techniques for large databases. ACM Comput.Surv., 25(2):73–169, June 1993.

[HH99] Peter J. Haas and Joseph M. Hellerstein. Ripple joins for online aggregation.In Proceedings of the 1999 ACM SIGMOD international conference on Man-agement of data, SIGMOD ’99, pages 287–298, New York, NY, USA, 1999.ACM.

[HWL12] HWL Website. http://hwl.hu-berlin.de/, January 2012.

[IFF+99] Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Levy, and Daniel S.Weld. An adaptive query execution system for data integration. In Proceedingsof the 1999 ACM SIGMOD international conference on Management of data,SIGMOD ’99, pages 299–310, New York, NY, USA, 1999. ACM.

http://hwl.hu-berlin.de/

BIBLIOGRAPHY 77

[JK03] Stratis D. Viglas Jaewoo Kang, Jeffrey F. Naughton. Evaluating window joinsover unbounded streams. Proc. of the 2003 Intl. Conf. on Data Engineering,March 2003.

[JMS95] H. V. Jagadish, Inderpal Singh Mumick, and Abraham Silberschatz. Viewmaintenance issues for the chronicle data model (extended abstract). In Pro-ceedings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium onPrinciples of database systems, PODS ’95, pages 113–124, New York, NY,USA, 1995. ACM.

[LGO04] Shaveen Garg Lukasz Golab and M. Tamer Oezsu. On Indexing Sliding Win-dows over Online Data Streams. ADVANCES IN DATABASE TECHNOL-OGY, 2992/2004:547–548, 2004.

[LMT+05] Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker.No pane, no gain: efficient evaluation of sliding-window aggregates over datastreams. SIGMOD Rec., 34(1):39–44, March 2005.

[LS03] Alberto Lerner and Dennis Shasha. AQuery: query language for ordereddata, optimization techniques, and experiments. In Proceedings of the 29thinternational conference on Very large data bases - Volume 29, VLDB ’03,pages 345–356. VLDB Endowment, 2003.

[LTS+08] Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, TheodoreJohnson, and David Maier. Out-of-order processing: a new architecture forhigh-performance stream systems. Proc. VLDB Endow., 1(1):274–288, August2008.

[MCZ03] Magdalena Balazinska Don Carney Ugur Centintemel Ying Xing Mitch Cher-niack, Hari Balakrishnan and Stan Zdonik. Scalable Distributed Stream Pro-cessing. Conference on Innovative Data Systems, 2003.

[PSR95] Miron Livny Praveen Seshadri and Raghu Ramakrishnan. SEQ: A Model forSequence Databases. Proceedings of the 11th international conference on dataengineering, pages 232–239, March 1995.

[PSR96] M. Livny P. Seshadri and R. Ramakrishnan. The design and implementationof a sequence database system. Proc. of the 1996 Intl. Conf. on Very LargeData Bases, pages 99–110, September 1996.

[SA85] R. Snodgrass and I. Ahn. A taxonomy of time in databases. pages 236–245.ACM SIGMOD Intl. Conf. on Management of Data, 1985.

[SCS03] Amol Deshpande Michael J. Franklin Joseph M. Hellerstein-Wei Hong SaileshKrishnamurthy Sam Madden Vijayshankar Raman Fred Reiss Sirish Chan-drasekaran, Owen Cooper and Mehul Shah. TelegraphCQ: Continuousdataflow processing for an uncertain world. Proceedings of CIDR Conference,2003.

78 BIBLIOGRAPHY

[SLR94] Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. Sequence queryprocessing. SIGMOD Rec., 23(2):430–441, May 1994.

[Str12] Stratosphere Website. http://www.stratosphere.eu/, January 2012.

[Sul96] M. Sullivan. Tribeca: A stream database manager for network traffic analysis.Intl. Conf. on Very Large Data Bases, page 594, September 1996.

[SZS+03] Stan Zdonik Sbz, Stan Zdonik, Michael Stonebraker, Mitch Cherniack, Ugur CEtintemel, Magdalena Balazinska, and Hari Balakrishnan. The Aurora andMedusa Projects. IEEE Data Engineering Bulletin, 26, 2003.

[SZS12] Markus Scheidgen, Anatoli Zubow, and Robert Sombrutzki. ClickWatch – AnExperimentation Framework for Communication Network Test-beds. In IEEEWireless Communications and Networking Conference, Paris, 2012. IEEE.

[TGNO92] Douglas Terry, David Goldberg, David Nichols, and Brian Oki. Continuousqueries over append-only databases. In Proceedings of the 1992 ACM SIG-MOD international conference on Management of data, SIGMOD ’92, pages321–330, New York, NY, USA, 1992. ACM.

[TM02] Peter A. Tucker and David Maier. Exploiting Punctuation Semantics in Con-tinuous Data Streams. Intl. Conf. Data Engineering, 2002.

[UF00] Tolga Urhan and Michael J. Franklin. XJoin: A reactively-scheduled pipelinedjoin operator. IEEE Data Engineering Bulletin, pages 27–33, June 2000.

[UF01] Tolga Urhan and Michael J. Franklin. Dynamic Pipeline Scheduling for Im-proving Interactive Query Performance. In In Proc. of the 2001 Intl. Conf.on Very Large Data Bases, pages 501–510, 2001.

[WA93] Annita N. Wilschut and Peter M. G. Apers. Dataflow query execution ina parallel main-memory environment. Distributed and Parallel Databases,1:103–128, 1993. 10.1007/BF01277522.

http://www.stratosphere.eu/

Statement of authorship

I declare that I completed this thesis on my own and that information which has beendirectly or indirectly taken from other sources has been noted as such. Neither this nor asimilar work has been presented to an examination committee.

Berlin, March 21, 2013 . . . . . . . . . . . . . . . . . . . . . . . . . .

79

Documents

Window-based Data Processing with Stratosphere · 2015. 12. 21. · lational Database Management Systems (RDBMS), Data Stream Management Systems (DSMS), and other systems i.e., special-purpose