Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets

Soheila Dehghanzadeh, Daniele Dell’Aglio, Shen Gao,

Emanuele Della Valle, Alessandra Mileo , Abraham Bernstein

ICWE - 25 June 2015

Outline

● Introduction to Continous Queries

● Motivating Example

● Problem Description

● Solution

● Experimental Results

● Conclusions

2ICWE - 25 June 2015

Introduction•R

DF Stream Processing engines usually register queries and execute them in a continuous fashion.


RDF Stream Generator

Query

W(ω,β)

EvaluationEvaluation

Time-based sliding window

S3

S4 S5

S6

S7

S8

S9 S10

S11

S12

SSS1

S2

β

ω

t

widthslideWindow


Introduction•C

omplex continuous queries combines data streams with remote background data.

Join


Background data(SPARQL endpoint)


Motivating ExampleFinding Influential Users

•Influential User: users who have more than a specific number of followers and are mentioned more than a specific times in a specific period (200 seconds).

•Follower number: stored in a remote endpoint.

•Mention number: computed by processing the stream of messages.


Inspired by Chris Testa's SemTech 2011 talk: http://goo.gl/kLSqGo

Investigating the Scenario Symmetrical hash join

•Drawbacks:

• Data access constraints.• Background data is huge and has to be fetched at every

evaluation - slow and wasting computational and financial resources.

Join




Investigating the Scenario Nested Loop Join

•Drawbacks:

• One invocation for each mapping from the WINDOW clause evaluation – high number of requests to the server.

• API restrictions (e.g., limited amount of requests over time).

Join




Investigating the Scenario Local Views

•Challenges:

• Data goes out of date

Join



Local View


Investigating the ScenarioMaintenance processes

•Maintenance introduces a trade-off between response quality and time.

•We propose to manage this trade-off by fixing time dimension based on query constraints and maximizing freshness of response.

Join



Local View

Maintenance Process

Freshness decreases

Refresh Cost/Quality trade-

off

10ICWE - 25 June 2015

Problem Description

The maintenance process should identify elements of the local view that maximize response freshness.

11ICWE - 25 June 2015

Requirements of The Maintenance Process

1. should satisfy the Quality of Service constraints on responsiveness and freshness of the answer;

2. should take into account the change rates of the data elements in the REST API;

3. should consider the dynamicity of the change rate values;

4. may consider the sliding window operator.

12ICWE - 25 June 2015

Hypotheses

•We formulated the following hypotheses to build the maintenance process

•HP1: the freshness of the answer can increase by maintaining part of the local view involved in the current query evaluation

•HP2: the freshness of the answer increases by refreshing the (possibly) stale local view entries that would remain fresh in a higher number of evaluations

13ICWE - 25 June 2015

JOIN WSJWSJ WBMWBM

RefresherRefresher

BKG

Window

Solution: WSJ+WBM

Local View

HP1

HP2

14ICWE - 25 June 2015

τ

t5 6 7 8 9 10 11

W1 W2 W3 W4

124

5 6 7 8 9 10 11 124

Terminology

Best Before Time: the time that an element will

become stale and is defined by:

Mappings from the WINDOW clause

Mappings in the LOCAL VIEW

Compatible mappings

15ICWE - 25 June 2015

τ

t5 6 7 8 9 10 11

W1 W2 W3 W4

124

5 6 7 8 9 10 11 124

WSJ

•WSJ identifies the candidate set: the possibly stale local view mappings involved in the current evaluation.

•WSJ analyzes the content of the current window evaluation and identifying the compatible mappings in the local view.

•The possibly stale mappings are identified by analyzing the associated best before time

16ICWE - 25 June 2015

V L Score

τ

t5 6 7 8 9 10 11

W1 W2 W3 W4

124

5 6 7 8 9 10 11 124

WBM

•WBM ranks the candidate set to determine which mappings to update.

•The ranking is computed through two values: the renewed best before time and the remaining life time

•The top k elements are selected to be refreshed. The value k is selected according to the responsiveness constraint.

17ICWE - 25 June 2015

V L Score341

τ

t5 6 7 8 9 10 11

W1 W2 W3 W4

124

5 6 7 8 9 10 11 124

WBM: renewed best before time

•When would the mappings became stale if refreshed now?

•The renewed best before time V is computed as:

18ICWE - 25 June 2015

V L Score3 34 11 3

τ

t5 6 7 8 9 10 11

W1 W2 W3 W4

124

5 6 7 8 9 10 11 124

WBM: remaining life time and score

•For how many future evaluations the mappings is involved?

•The remaining life time L is computed as:

•WBM ranks the mappings by using a score:

Score=min(L,V)

• is selected for the maintenance

19ICWE - 25 June 2015

Experiment- Data Collection

1. Streaming APIa. Twitter stream data for mention count

2. Twitter APIs to get number of followersa. Create snapshots everyone minutesb. Simulate the change based on user’s predefined change rates.

Streaming Dataset

Snapshots /synthetic

data

20ICWE - 25 June 2015

Experimental setup

•We study our hypotheses using a comparative evaluation with

• LRU: use the least recently updated elements for maintenance• RND: use a random subset of elements for maintenance

•Error measure

• Comparing the differences between consecutive evaluation of the motivated query against cache and real/synthetic dataset.

•HP1: We compared the cumulative staleness of using WSJ or not (i.e., GNR) for both baselines.

• GNR: candidate set is the whole view entries.•H

P2: We compared the cumulative staleness of using WBM and the improved baselines.

21ICWE - 25 June 2015

HP1: Maintaining involved entries of local view maximizes response accuracy.

Synthetic

Real

WSJ shows better improvement by increasing the update budget than GNR.

22ICWE - 25 June 2015

HP2: Maintaining possibly stale entries from local view that will stay fresh for a longer time maximizes response accuracy.

Synthetic

Real

WBM doesn’t improve as well as WBM* which shows the estimation error has caused by wrong estimation for BBT. Use more accurate prediction for BBT.

23ICWE - 25 June 2015

Conclusions and Future Work•C

onclusions:• We proposed using the idea of materialization to optimize processing

continuous queries.• We proposed a policy to maximize the freshness according to time

constraint in continuous query.• We tested our policy against based line policies (LRU and Random).

•Future Work:

• Extensions of real continuous query processors with the proposed approach

• Measuring the time overhead of maintenance • Investigating more complex queries that have complicated join patterns

between the SERVICE and STREAM clauses.• Dynamically estimating the change rate of users.

24ICWE - 25 June 2015

Slide 25

Soheila Dehghanzadeh, Daniele Dell’Aglio, Shen Gao, Emanuele Della Valle, Alessandra Mileo , Abraham Bernstein

[email protected] http://www.slideshare.net/sallyde

ICWE - 25 June 2015

Engineering

Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets