77
Uncertainty in Querying and Monitoring Streams Themis Palpanas University of Trento

Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Uncertainty in Querying and

Monitoring Streams

Themis Palpanas University of Trento

Page 2: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Data Evolution

• structured

• precise

• static (almost)

2

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 3: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Data Evolution

• non-structured

• uncertain

• streaming

3

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 4: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Things to Come

• big data

▫ very large collections (i.e., terabytes to exabytes)

▫ scientific, business, user generated

4

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 5: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Things to Come

• big data

• streaming data

▫ sensors, feeds, continuous analytics

▫ response times in seconds to nanoseconds

5

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 6: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Things to Come

• big data

• streaming data

• heterogeneous data

▫ structured, non-structured, text, multimedia

▫ variety of sources, schemas, representations, models

6

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 7: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Things to Come

• big data

• streaming data

• heterogeneous data

• uncertain data

▫ imprecision, inconsistencies, incompleteness, ambiguities, latency, deception, approximations, privacy preserving transformations

▫ process/data/model uncertainty

7

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 8: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Things to Come:

Uncertain Data

• enterprise data are (somewhat) precise

• all other data have some degree of uncertainty

• by 2015 80% of all data will be uncertain

8

Themis Palpanas - INQUEST, Oxford, England - Sep 2012 Quote from Michael Karasick, VP and Director of IBM Research – Almaden

Page 9: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

uncertain data data series similarity

work with:

Michele Dallachiesa, Besmira Nushi, Katsiaryna (Katya) Mirylenka

[PVLDB’12, QUEST’11]

9

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 10: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Data series

• Sequence of points ordered along some dimension

v1 v2

time

x1

x2

xn

va

lue

10

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 11: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Data series

• Sequence of points ordered along some dimension

11

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Wind speed From ocean observing node project, http://bml.ucdavis.edu/boon/wind.html

Page 12: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Data series

• Sequence of points ordered along some dimension

12

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Historical stock quotes From http://money.cnn.com/2012/04/23/markets/walmart_stock/index.htm

Page 13: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Data series

• Sequence of points ordered along some dimension

13

Trajectories from GPS logs From http://www.flickr.com/photos/kitepuppet/3604115258

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 14: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Data series

• Sequence of points ordered along some dimension

14

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Projectile points (arrowhead) converted into time series From “Detecting Time Series Motifs Under Uniform Scaling ”, Dragomir Yankov et al., 2007

Page 15: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Uncertain Data Series

• Point value at each timestamp uncertain

15

v1 v2

time

x1

x2

xn

va

lue

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 16: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Uncertain Data Series

• Point value at each timestamp uncertain

16

Tumor detection from noisy brain scan images From “DUST: a generalized notion of similarity between uncertain time series”, Saret al., 2010

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 17: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Uncertain Data Series

• Point value at each timestamp uncertain

17

Ocean temperature at multiple depths From UpTempO project, http://psc.apl.washington.edu/UpTempO/

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 18: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Uncertain Data Series

• Point value at each timestamp uncertain

18

Stock price changes during one year From http://seekingalpha.com/article/661511-fedex-releases-earnings-on-june-19

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 19: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Similarity queries

• Find all objects in a dataset similar to given query object

19

Query

DB

Similar

objects

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 20: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Similarity queries

• base for several analysis and mining tasks

▫ pattern recognition

▫ clustering

▫ classification

▫ motif discovery

▫ outlier identification

▫ ...

20

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 21: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Probabilistic range queries

• Similarity queries for uncertain data:

▫ Q query sequence

▫ T candidate match

▫ ε distance threshold

▫ τ probabilistic threshold

21

Find all uncertain time series T s.t. Pr(dist(Q,T)≤ε)≥τ

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 22: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Outline

• Review state-of-the-art

• Analytical comparison

• Experimental evaluation

• Neighborhood-aware models

• Guidelines for practitioners

MUNICH “Probabilistic Similarity Search for Uncertain Time Series" Aßfalg et al., SSDBM ’09

PROUD “PROUD: a probabilistic approach to processing similarity queries over uncertain data streams” Yeh et al., EDBT ‘09

DUST “DUST: a generalized notion of similarity between uncertain time series” Sarangi et al., KDD ‘10

22

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 23: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Uncertainty models for time series

• Sequence of independent random variables:

▫ xi models point at timestamp i

▫ independency as simplifying assumption

• What do we know about xi?

23

X = <x1,…,xn>

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 24: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Possible-world semantics

• Samples drawn from xi

24

v1 v2

time

x1 x2 … xn

va

lue

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 25: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Distribution-aware model

• A priori knowledge of xi distribution

25

v1 v2

x1

x2

xn

time

va

lue

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 26: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

MUNICH

• Possible-world semantics

• Distance samples from possible world instantiations

• Pr(dist(Q,T)≤ε) as frequency of matching distances

26

v1 v2

time

x1 x2 … xn

Possible world i Possible world i+1

v1 v2

time

x1 x2 … xn

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 27: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

PROUD

• Known mean and variance

• Distance Dist(Q,T) as sum of random variables

• Central Limit Theorem: Dist(Q,T) ~ N(...)

27

v1 v2

time

t1

t2

tn

q1 q2

… qn

va

lue

Q

T

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 28: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

• Full knowledge of point and value distributions

• dust(xi,yi)= √ - log( Pr(dist(xi,yi) = 0) )

• DUST(X,Y)=√ Σidust(xi,yi)2

• New distance measure

DUST

28

v1 v2

x1

x2

xn

time Pdf

va

lue

va

lue

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 29: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Analytical comparison

MUNICH PROUD DUST

Point independence

Uncertainty model Possible-

world semantics

Distribution-aware model

Distribution-aware model

Knowledge assumptions

Samples Mean and variance

Value and error distributions

Mixed error distribution

Distance measures Euclidean,

DTW Euclidean

DUST, DUST@DTW

Similarity queries Probabilistic

range queries

Probabilistic range queries

Distance measure

Quality guarantees

29

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 30: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Experimental setup

• Techniques implemented in C++

• Datasets loaded in main memory

• 17 real datasets from UCR time series archive

▫ on average 500 data series, 300 points each

• Perturbation models uncertainty

▫ error in measurements

30

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 31: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Probabilistic range queries

31

• Similarity queries for uncertain data

▫ Q query sequence

▫ T candidate match

▫ ε distance threshold

▫ τ probabilistic threshold

Find all uncertain time series T s.t. Pr(dist(Q,T)≤ε)≥τ

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 32: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Distance and probability thresholds

• Distance thresholds ε for query Q:

▫ C = 10th nearest neighbor to Q on original dataset

▫ εEucl = EuclideanDist(Q,C) on uncertain dataset

▫ εDUST = DUST(Q,C) on uncertain dataset

• Probabilistic threshold τ

▫ brute-force tuning to maximize accuracy

32

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 33: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Experimental methodology

• Evaluating accuracy

▫ original datasets used as ground truth

▫ techniques evaluated on perturbed datasets

• All time series used as queries on respective datasets

• MUNICH high computational cost

▫ evaluation on single truncated dataset

▫ 60 time series, length 6. Each point 5 samples.

33

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 34: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Results on time performance

• CPU time per query varying normal error

• error variance slightly affects DUST performance

34

Standard deviation

Tim

e (m

s)

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 35: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Results on quality performance

• Truncated dataset, accuracy varying normal error

• MUNICH penalized: constant low number of samples

35

Standard deviation

F1

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 36: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Results on quality performance

• Accuracy with mixed normal error: 20% stddev=1.0, 80% stddev=0.4

• DUST performs better than PROUD,Euclidean

F1

Datasets

37

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 37: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Impact of perturbation on distances

• Accuracy with mixed normal error: 20% stddev=1.0, 80% stddev=0.4

• DUST performs better than PROUD,Euclidean

F1

Datasets

38

Why such a large difference?

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 38: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Impact of perturbation on distances

39

• Histogram of Euclidean distances in original and perturbed datasets

• Adiac Euclidean distances affected more by perturbation

Fre

qu

ency

Euclidean distance Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 39: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Neighborhood-aware models

• Neighboring points can be used for noise reduction.

• Inspired by moving average.

• Contribution of points to distance measure inversely proportional to their standard deviation.

▫ high variance low contribution

40

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 40: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Uncertain Moving Average

• Point at timestamp i defined as:

▫ sj standard deviation of point xj

▫ vj sample of point xj

41

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 41: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Uncertain Exponential Moving Average

• Point at timestamp i defined as:

▫ sj standard deviation of point xj

▫ vj sample of point xj

▫ λ decay constant

42

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 42: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Results on quality performance

• Accuracy with mixed normal error: 20% stddev=1.0, 80% stddev=0.4

• UEMA/UMA accuracy 4-15% better than DUST

• UEMA better accuracy than UMA

F1

Datasets

43

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 43: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Which technique should I choose?

44

• Constant perturbation over time: Euclidean

• Quality guarantees: MUNICH,PROUD

• DUST high accuracy and robust

▫ does not require probabilistic threshold

▫ requires full knowledge of error/value distributions

• No prior distribution knowledge: MUNICH

▫ does not scale

• Highest accuracy: UEMA/UMA

▫ not probabilistic measures

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 44: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Summary

• Uncertain time series processing real problem in many application domains.

• Existing state-of-the-art techniques have limitations

• Simple neighborhood-aware models outperform more sophisticated techniques.

▫ valuable information is conveyed in neighboring points.

45

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 45: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Open problems

• Remove independency assumption

▫ modeling of sequential correlations in similarity measure

• What can we do in case of just few samples?

• Efficient data structures for processing and summarizing uncertain time series

46

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 46: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

uncertain data existential uncertainty

work with:

Michele Dallachiesa, Gabriela Jacques-Silva, Buğra Gedik, Kun-Lung Wu

47

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 47: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Motivating Example

48

• Continuous vibration measurements from machinery monitoring system

• Measurements from different sensors observing same device

• Measurements as samples from distribution of device vibration

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 48: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Motivating Example

• Device vibration modeled by uncertain tuples:

• Stream operators may introduce existential uncertainty, e.g., filters: vibration < 0.2

49

value uncertainty

existential uncertainty

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 49: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

How do we handle existential

uncertainty in stream processing?

• In many application domains data sources can be seen as uncertain

• Need for handling data uncertainty in composition of stream operators

• We focus on sliding windows and similarity joins as driving use case

50

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 50: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Possible world semantics

• Data stream T growing sequence <t1 , ..., tn, ...>

▫ point ti = sequence of samples ti,1, ..., ti,s

▫ sample ti,j occurs with probability Pr(ti,j)

▫ point ti exists with probability Pr(ti) = Σj Pr(ti,j)

• Strong model for uncertainty in databases

▫ Trio system (Stanford)

▫ Orion system (Purdue)

▫ ...

51

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 51: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Sliding windows on uncertain streams

• Given stream T, W(T,w) = T[n-w+1, n]

▫ n current timestamp

▫ w window width

va

lue

time

sn

s1 s2

...

“uncertain” sliding window

52

possible world #1

si

sj

si exists sj exists

existentially uncertain points

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 52: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Sliding windows on uncertain streams

• Given stream T, W(T,w) = T[n-w+1, n]

▫ n current timestamp

▫ w window width

va

lue

time

sn

s1 s2

...

“uncertain” sliding window

53

si

sj

si exists sj does not exist

existentially uncertain points

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

possible world #2

Page 53: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Sliding windows on uncertain streams

• Given stream T, W(T,w) = T[n-w+1, n]

▫ n current timestamp

▫ w window width

va

lue

time

sn

s1 s2

...

“uncertain” sliding window

54

si

sj

si does not exist sj exists

existentially uncertain points

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

possible world #3

Page 54: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Sliding windows on uncertain streams

• Given stream T, W(T,w) = T[n-w+1, n]

▫ n current timestamp

▫ w window width

va

lue

time

sn

s1 s2

...

“uncertain” sliding window

55

si

sj

si does not exist sj does not exist

existentially uncertain points

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

possible world #4

Page 55: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Similarity join without uncertain data v

alu

e

time

s2

...

sn

s1

S sliding window

tn+1+ε

t1

t2 ...

tn+1-ε

56

• Given two input streams S,T find similar points over time:

▫ candidate matches identified by sliding windows

▫ points si,tj similar if dist(si,tj) ≤ ε

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 56: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Similarity join with uncertain data v

alu

e

time

sn

s1

s2

...

t1

t2

...

tn+1+ε

tn+1-ε

S “uncertain” sliding window

57

• Uncertainty dimensions in similarity matches:

▫ sliding window content (existential uncertainty)

▫ points value (value uncertainty)

▫ existential uncertainty can give rise to value uncertainty!

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 57: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Summary

58

• existential uncertainty knocks on our door ▫ value uncertainty leads to existential uncertainty

▫ existential uncertainty leads to value uncertainty

• related to timestamp uncertainty

▫ (caused by clock/synchronization problems)

• we need streaming operators that can efficiently support it

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 58: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

non-structured data subjectivity analysis

work with:

Mikalai (Nick) Tsytsarau, Sihem Amer-Yahia, Kerstin Denecke

[DMKD’12, DiversiWEb’11, WWW’10]

59

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 59: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

60

Motivation

‣There are many opinion sources:

blogs, wikis, forums, social networks, …

‣Sentiment analysis can be used to:

learn customers attitude to a product

(features)

analyze people's reaction to some event

‣Above problems require sentiment

aggregation:

diversity-preserving

time-aware

scalable

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 60: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

61

Some Examples

Google features opinion aggregation in product search

‣ displays the most representative diverse opinions about a product.

‣ summarizes ratings among different reviews

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 61: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

‣ Subjectivity Analysis

‣ Sentiment Analysis

‣Opinion Aggregation

‣ Contradiction Analysis

‣ Streaming Methods

‣ Scalability

‣ Sentiment Spam

‣ Some More...

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

62

Subjectivity Analysis

‣ survey on Sentiment Analysis and Opinion Mining

‣ Tsytsarau and Palpanas DMKD 2012

‣ summarizes about 100 papers from 2001 to 2010

‣ studies trends

Page 62: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

63

Contradictions, what are they?

‣ Contradictions in text are situations where

’two sentences are extremely unlikely to be true together’

‣ Contradictions may be of different types, for example:

antonymy: hot - cold, light - dark, good - bad

negation: nice - not nice

mismatches: the solar system has 8 planets - there are 9 planets

sentiments: I like this book - this book is awful

‣Sentiment Contradictions may occur due to:

diversity of views (i.e., simultaneous contradiction)

change of views (i.e., sentiment shift)

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 63: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

64

Topic1 contr1

contr2

Topic2 contr3

Topic1 statistics

Topic2 statistics

Text topic1

opinion1

topic2

opinion2

Text

topic1

topic2

Contradiction Detection Pipeline

Topic

Identification

Opinion

Extraction

Opinion

Aggregation

Contradiction

Extraction

Text

Text

Extraction

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 64: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

65

Sentiment Contradictions

‣ We calculate contradiction by combining Aggr. Sentiment and Sentiment Variance

‣ If Aggregated Sentiment close to 0, the contradiction is high

‣ The larger the variance, the higher the contradiction

Contradiction

Sentiment Variance

Aggr. Sentiment

Raw Sentiments

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

‣ Detected contradictions not (always) possible to identify by visual inspection

Page 65: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

66

Physical Structure of the Contradiction Storage

‣ Store only the values of first- and second-order moments of sentiments M1 and M2

‣ Include time-tree pointers (black dots ) and contradiction values (gray squares )

‣ Add cross-links for adjacent pages to achieve a fast consequent reading (diamonds )

‣ Reference the rest of the topics by a special pointer to a separate page (white dots )

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 66: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

67

Contradictions Tool

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 67: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

68

‣Described framework for automatic contradiction detection

‣Defined two types of contradictions

‣ Proposed effective criteria to calculate contradictions

‣ Proposed scalable, efficient, incremental storage structures

...

Summary

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 68: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

69

...

Open Questions

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

‣How can we explain contradictions?

‣What news caused them?

‣Which demographics contribute to them?

Page 69: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

non-structured data entity resolution in data spaces

work with:

George Papadakis, Ekaterini Ioannou, Claudia Niederee, Wolfgang Nejdl

[TKDE’13, WSDM’12, JCDL’11]

70

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 70: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

• benefits:

• identifies and aggregates the different objects describing the same entity

• improves data quality and integrity, fosters re-use of existing data sources

• problems: • is an inherently quadratic problem: each object

has to be compared with all others

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

71

Entity Resolution

Page 71: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

• benefits:

• similar entities are grouped into blocks

• comparisons are only executed inside blocks

• problems:

• existing blocking techniques require a fixed, a-priori known schema

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

72

Blocking for Entity Resolution

Page 72: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Highly Heterogeneous

Information Spaces

• Web 2.0, Semantic Web, Dataspaces

• Voluminous, semi-structured datasets DBPedia 3.4: 36,5 million triples and 2,1 million entities

BTC09: 1,15 billion triples and182 million entities

• Users are free to insert not only attribute values but also

attribute names high levels of heterogeneity DBPedia 3.4: 50,000 attribute names

Google Base:100,000 schemata and10,000 entity types

• Large portion of data stemming from automatic

information extraction noise, tag-style values Themis Palpanas - INQUEST, Oxford, England - Sep 2012

73

Page 73: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Problem Definition

given two duplicate-free heterogeneous information spaces (Clean-Clean ER), or a single one that contains duplicates in itself (Dirty ER):

• Problem 1 (Effectiveness)

Develop block building methods that lead to blocks with low number of missed matches (i.e., high recall)

• Problem 2 (Efficiency)

Develop block processing methods that reduce the number of required pair-wise (entity) comparisons

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

74

Page 74: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

Conclusions

• confluence of very large, uncertain, heterogeneous, non-structured, streaming data

• fueling the INnovative

QUErying of STreams

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

75

Page 75: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

References Uncertain Data Series • Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas. Uncertain Time-Series Similarity: Return to the

Basics. Proceedings of the VLDB Endowment (PVLDB) Journal 5(11):1662-1673, 2012.

• Michele Dallachiesa, Besmira Nushi, Katsiaryna Mirylenka, Themis Palpanas. Similarity Matching for Uncertain Time Series: Analytical and Experimental Comparison. ACM SIGSPATIAL International Workshop on Querying and Mining Uncertain Spatio-Temporal Data (QUeST), in conjunction with ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (GIS), Chicago, IL, USA, November 2011.

Subjectivity Analysis • Mikalai Tsytsarau, Themis Palpanas. Survey on Mining Subjective Data on the Web. Data Mining and Knowledge Discovery

(DMKD) Journal, Special Issue on A Decade of Mining the Web, 24(3), 2012

• Mikalai Tsytsarau, Themis Palpanas, Kerstin Denecke. Scalable Detection of Sentiment-Based Contradictions. International Workshop on Knowledge Diversity on the Web (DiversiWeb), in conjunction with the World Wide Web Conference (WWW), Hyberabad, India, March 2011

• Mikalai Tsytsarau, Themis Palpanas, Kerstin Denecke. Scalable Discovery of Contradictions on the Web. International World Wide Web Conference (WWW), Raleigh, NC, USA, April 2010

Entity Resolution in Data Spaces • George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, Wolfgang Nejdl. A Blocking Framework for Entity

Resolution in Highly Heterogeneous Information Spaces. IEEE Transactions on Knowledge and Data Engineering (TKDE), accepted for publication

• George Papadakis, Ekaterini Ioannou, Claudia Niederee, Themis Palpanas, Wolfgang Nejdl. Beyond 100 Million Entities: Large-Scale Blocking-based Resolution for Heterogeneous Data. ACM International Conference on Web Search and Data Mining (WSDM), Seattle, WA, USA, February 2012

• George Papadakis, Ekaterini Ioannou, Claudia Niederee, Themis Palpanas, Wolfgang Neijdl. Eliminating the Redundancy in Blocking-Based Entity Resolution Methods. International ACM and IEEE Joint Conference on Digital Libraries (JCDL), Ottawa, Canada, June 2011

Themis Palpanas - INQUEST, Oxford, England - Sep 2012

76

Page 76: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

dbTrento: http://db.disi.unitn.eu

77 Themis Palpanas - INQUEST, Oxford, England - Sep 2012

Page 77: Uncertainty in Querying and Monitoring Streams · Things to Come •big data very large collections (i.e., terabytes to exabytes) scientific, business, user generated 4 Themis Palpanas

dbTrento: http://db.disi.unitn.eu

78 Themis Palpanas - INQUEST, Oxford, England - Sep 2012