Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak, plale}@cs.indiana.edu

Towards Low Overhead Provenance Tracking inNear Real-Time Stream Filtering

Nithya N. Vijayakumar, Beth PlaleDDE Lab, Indiana University

{nvijayak, plale}@cs.indiana.edu

Project Description

• Provenance collection in stream filtering systems

• Identify unique challenges posed by stream filtering systems to provenance tracking

• Low overhead data model and collection model that addresses these challenges

Outline

• Stream filtering systems • Challenges posed by stream filtering systems• Current provenance solutions applied to

streams• Proposed provenance data model• Low overhead provenance collection model• Calder stream processing system• Implementation of provenance models in Calder• Application in LEAD• Future work

Stream filtering systems

• Data driven systems that accept events in real time– appropriate when data is continuously generated– data stream is an indefinite sequence of time ordered

events

• Filter (query, user defined application) – a processing unit that takes one or more event sequences as

input, and generates a new event sequence, as output– queries with well-defined language or customized

application code– long running and associated with a lifetime

• Applications – monitoring, stock ticks in financial applications,

performance measurements in network monitoring and traffic management, sensor data, scientific datasets

Challenges posed by stream filtering systems

• Identifying provenance entities – atomic unit? event/ stream/source

• Capturing stream filtering conditions with low overhead– distributed environment– environmental and configuration changes

• Maintaining relevance with non-persistent data– trace back source of events long after being derived

• Dynamic accuracy estimation– quality of service guarantees for derived streams– provenance across streams– deduce accuracy of derived streams

Current provenance solutions applied to streams: What is the challenge?• Representing provenance for stream

entities using Virtual Data Grid system– indefinite sequence of time ordered datasets– non-persistent data events– need accountability more than reproducibility

• Provenance collection using PASOA or Karma– provenance to be collected for each stream and

filters executed on streams– communication between components of the

stream filtering system not very important than the entities themselves

Current provenance solutions applied to streams (contd…)

• Logging environmental conditions using Log4j – non-trivial load on the service– aggregating provenance traces difficult

• Augmenting accuracy and lineage using Trio– lineage cannot be associated with

datasets – need to trace the accuracy of a set of

events long after the stream is generated

Provenance data model: What to track?

• Atomic units – streams generated outside the system

(base streams)– declarative queries or application code

that executes continuously (adaptive filters)

– streams generated by executing adaptive filters on base and derived streams (derived streams)

Provenance data model: How to store it?• Provenance stack

– base provenance information and a list of changes– latest information identified by timestamp and is

current from that point onwards

• Provenance tree– derived stream refers to provenance of input streams

(base and derived) + adaptive filters– provenance can refer to annotations outside the

system (SAM)

• Store the provenance history (compressed or uncompressed) of streams and filters

Low overhead provenance collection model• Base provenance

– collected from user when registering a stream/filter

– document the available information (inputs, filters, rate, sources etc)

– store system and user defined metadata as name value pairs in base provenance information

– base provenance can be updated by the user

• Dynamic provenance– subset of a stream identified by a starting

timestamp and ending timestamp– changes logged with starting timestamp current

from then on

A simple example

<derivedstream> <name>Temperature Feed</name> <uniqueID>D0010</uniqueID> <queryID>Q0099</queryID> <inputstreams> <streamID>B0011</streamID> <streamID>D0005</streamID> </inputstreams> <systemmetadata> <name> owner </name> <value> foo </value> <name> permissions </name> <value> open to everyone </value> </systemmetadata> <starttime> <timestamp> 13:00:00 Feb-10-2006</timestamp></starttime> <changelog> <event>

<timestamp> 13:34:56 Feb-10-2006 </timestamp> <description> B0011 down</description> <approximation> Sampling </approximation>

<accuracy> 0.85</accuracy> </event> </changelog></derivedstream>

Calder stream processing system

• Distributed processing of streams

• Service oriented access to data streams

• SQL based rule-action support

• Extends OGSA-DAI v6 GDS to streaming resources

• Synchronous and asynchronous data delivery

Data Management Subsystem

Stream Grid Data Service

Query Planning Service

Stream Rowset Service

Provenance Service

Users/Appli-cation

Computation

Node RunningQuery

ProcessingEngine

Queries/ Requests

Result data

Data StreamsCalder

Pub-sub system

Monitoring Service

Calder Query Execution

Provenance collection in Calder

Query PlannerService

Monitoring Service

Monitoring Updates

Prove-nance Service

Query execution plan updates

Subscribe to receive event of interest

Subscribe to receive event of interest

Monitoring updates

Provenance Queries/ Updates

Provenance Results

Provenance Propagation

XML Database

Computation nodes

Application in LEAD

• Radar meta-data is sent through pub-sub system

• User submits filter query• Calder executes filter query

on incoming data streams• Filtered datasets are

processed using data mining algorithms (MDA & ADaM)

• Triggers (WS-Notifications) sent to workflows that invoke forecast models.

• Provenance tracking will help in understanding why and when a trigger was sent

Future work

• Complex Event Processing– processing multiple streams– identifying global behavior

• Context Management– informative search based on past usage – predicting system characteristics – managing profiles for users and dynamic

system configuration

Thank you

Questions and Feedback Welcome!

Nithya [email protected]

Documents

Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak, plale}@cs.indiana.edu