Upload
justin-parker
View
212
Download
0
Embed Size (px)
Citation preview
Towards Low Overhead Provenance Tracking inNear Real-Time Stream Filtering
Nithya N. Vijayakumar, Beth PlaleDDE Lab, Indiana University
{nvijayak, plale}@cs.indiana.edu
Project Description
• Provenance collection in stream filtering systems
• Identify unique challenges posed by stream filtering systems to provenance tracking
• Low overhead data model and collection model that addresses these challenges
Outline
• Stream filtering systems • Challenges posed by stream filtering systems• Current provenance solutions applied to
streams• Proposed provenance data model• Low overhead provenance collection model• Calder stream processing system• Implementation of provenance models in Calder• Application in LEAD• Future work
Stream filtering systems
• Data driven systems that accept events in real time– appropriate when data is continuously generated– data stream is an indefinite sequence of time ordered
events
• Filter (query, user defined application) – a processing unit that takes one or more event sequences as
input, and generates a new event sequence, as output– queries with well-defined language or customized
application code– long running and associated with a lifetime
• Applications – monitoring, stock ticks in financial applications,
performance measurements in network monitoring and traffic management, sensor data, scientific datasets
Challenges posed by stream filtering systems
• Identifying provenance entities – atomic unit? event/ stream/source
• Capturing stream filtering conditions with low overhead– distributed environment– environmental and configuration changes
• Maintaining relevance with non-persistent data– trace back source of events long after being derived
• Dynamic accuracy estimation– quality of service guarantees for derived streams– provenance across streams– deduce accuracy of derived streams
Current provenance solutions applied to streams: What is the challenge?• Representing provenance for stream
entities using Virtual Data Grid system– indefinite sequence of time ordered datasets– non-persistent data events– need accountability more than reproducibility
• Provenance collection using PASOA or Karma– provenance to be collected for each stream and
filters executed on streams– communication between components of the
stream filtering system not very important than the entities themselves
Current provenance solutions applied to streams (contd…)
• Logging environmental conditions using Log4j – non-trivial load on the service– aggregating provenance traces difficult
• Augmenting accuracy and lineage using Trio– lineage cannot be associated with
datasets – need to trace the accuracy of a set of
events long after the stream is generated
Provenance data model: What to track?
• Atomic units – streams generated outside the system
(base streams)– declarative queries or application code
that executes continuously (adaptive filters)
– streams generated by executing adaptive filters on base and derived streams (derived streams)
Provenance data model: How to store it?• Provenance stack
– base provenance information and a list of changes– latest information identified by timestamp and is
current from that point onwards
• Provenance tree– derived stream refers to provenance of input streams
(base and derived) + adaptive filters– provenance can refer to annotations outside the
system (SAM)
• Store the provenance history (compressed or uncompressed) of streams and filters
Low overhead provenance collection model• Base provenance
– collected from user when registering a stream/filter
– document the available information (inputs, filters, rate, sources etc)
– store system and user defined metadata as name value pairs in base provenance information
– base provenance can be updated by the user
• Dynamic provenance– subset of a stream identified by a starting
timestamp and ending timestamp– changes logged with starting timestamp current
from then on
A simple example
<derivedstream> <name>Temperature Feed</name> <uniqueID>D0010</uniqueID> <queryID>Q0099</queryID> <inputstreams> <streamID>B0011</streamID> <streamID>D0005</streamID> </inputstreams> <systemmetadata> <name> owner </name> <value> foo </value> <name> permissions </name> <value> open to everyone </value> </systemmetadata> <starttime> <timestamp> 13:00:00 Feb-10-2006</timestamp></starttime> <changelog> <event>
<timestamp> 13:34:56 Feb-10-2006 </timestamp> <description> B0011 down</description> <approximation> Sampling </approximation>
<accuracy> 0.85</accuracy> </event> </changelog></derivedstream>
Calder stream processing system
• Distributed processing of streams
• Service oriented access to data streams
• SQL based rule-action support
• Extends OGSA-DAI v6 GDS to streaming resources
• Synchronous and asynchronous data delivery
Data Management Subsystem
Stream Grid Data Service
Query Planning Service
Stream Rowset Service
Provenance Service
Users/Appli-cation
Computation
Node RunningQuery
ProcessingEngine
Queries/ Requests
Result data
Data StreamsCalder
Pub-sub system
Monitoring Service
Calder Query Execution
Provenance collection in Calder
Query PlannerService
Monitoring Service
Monitoring Updates
Prove-nance Service
Query execution plan updates
Subscribe to receive event of interest
Subscribe to receive event of interest
Monitoring updates
Provenance Queries/ Updates
Provenance Results
Provenance Propagation
XML Database
Computation nodes
Application in LEAD
• Radar meta-data is sent through pub-sub system
• User submits filter query• Calder executes filter query
on incoming data streams• Filtered datasets are
processed using data mining algorithms (MDA & ADaM)
• Triggers (WS-Notifications) sent to workflows that invoke forecast models.
• Provenance tracking will help in understanding why and when a trigger was sent
Future work
• Complex Event Processing– processing multiple streams– identifying global behavior
• Context Management– informative search based on past usage – predicting system characteristics – managing profiles for users and dynamic
system configuration