© 2009 IBM Corporation DUST: A Generalized Notion of Similarity between Uncertain Time Series...

DUST: A Generalized Notion of Similarity between Uncertain Time Series

Smruti R. Sarangi and Karin MurthyIBM Research Labs, Bangalore, India

Uncertainty in Data

Uncertainty introduced due to massive amount of sensor data

ServerMillions of Sensors

Analytics

Business Decisions

Privacy preserving techniques

A certain degree of uncertainty is sometimes intentionally introduced

Outline

Motivation

Generalized Distance Measure– Properties of a Distance Measure– Algebraic Derivation

DUST Distance– Computation– Properties– Examples

Results– Setup– Classification, Motif Detection, 1-NN search

Conclusion3

What does Uncertain Data Look Like?

x = r(x) + ε(x)

observed value

real value

error distribution

observed original error

Uncertain Time Series

Data Mining on Uncertain Time Series

Clustering Classification Pattern Discovery …

Require at least a partial order on the distances between time series elements

However, a total order between the distances is better

We need a distance function to measure the distance between uncertain time series elements

Are x and x’ closer than y and y’ ?

Ensures that all pairs are comparable

Easy to store the distance and manage it later

Distance between Uncertain Time Series

Is T2 closer to T1, or is T3 closer to T1 ?

Doesn’t MatterClearly T3

T2 or T3 ???

How to Measure the Distance between two Time Series Elements?

x = r(x) + ε(x) x’ = r(x’) + ε(x’)

Consider two values

Axiom: The distance between x and x’, should say something about the distance between normal Euclidean distance between r(x) and r(x’)

Prior Approaches

Compute the apriori probability distribution of the random variable X = (r(x) – r(x’))

Work with only the mean and standard deviation of X

X is not a distance measure. It is hard to work with probabilities.

Resolving the Question

T2 should be closer to T1 than T3

– This is because it is possible that T2 and T1 are the same time series. T2 just has some additional error.

– T3 and T1 can never be the same time series because the last value has a very large divergence

T2 or T3 ??? Euclidean distance (EUCL) and Dynamic Time Warping (DTW)

DUST T2

Outline

Motivation

Conclusion9

Arriving at a Distance Measure

Properties of a Distance Measure

1. Non-negativity: d(A,B) ≥ 0

2. Identity of Indiscernibles: d(A,B) = 0 iff A= B

3. Symmetry: d(A,B) = d(B,A)

4. Triangle Inequality: d(A,B) + d(A,C) ≥ d(B,C)

5. The distance should be similar to EUCL or DTWif the magnitude of the error is small. (Extra Condition for an uncertain distance measure)

Extending Prior Work

Two time series are considered similar if : P(DIST(T1,T2) ≤ ε) ≥ τ

DIST(T1, T2) = sqrt(Σi dist(T1[i], T2[i])2)

dist(x,y) = |x-y|

Assumption

P(DIST(T1,T2) ≤ ε) = p(DIST(T1,T2) = 0) ε (irrespective of the size of ε)

Prior Work

-log (φ(|T1[i] – T2[i]|)

Some Algebra

P(DIST(T1,T2) ≤ ε) > P(DIST(T1,T3) ≤ ε)

p(DIST(T1,T2) = 0) > p(DIST(T1,T3) = 0)

Πi p(dist(T1[i], T2[i]) = 0) > Πi p(dist(T1[i], T3[i]) = 0)

Σi –log(p(dist(T1[i], T2[i]) = 0)) ≤ Σi –log(p(dist(T1[i], T3[i]) = 0))

φ(x) = p(dist(0,x) = 0)

dist(x,y) is only dependent on |x-y|

proved in the paper

dust(x,y) = -log(φ(|x-y|)) + log(φ(0)Definition

Some Algebra - II

P(DIST(T1,T2) ≤ ε) > P(DIST(T1,T3) ≤ ε)

Σi –log(p(dist(T1[i], T2[i]) = 0)) ≤ Σi –log(p(dist(T1[i], T3[i]) = 0))≈

dust(x,y) = -log(φ(|x-y|)) + log(φ(0)Definition

Σi dust(T1[i], T2[i])2 ≤ Σi dust(T1[i], T3[i])2

Definition DUST(T1, T2) = Σi dust(T1[i], T2[i])2

DUST(T1, T2) ≤ DUST(T1, T3)

DUST behaves like a standard distance measure

Outline

Motivation

Conclusion14

Computing the DUST Distance

Compute dust(0,Δx)1. Assume values are independent2. Use Bayes’ Theorem3. Arrive at final solution through numerical integration

Original distributionof data

error distribution

dust(0,Δx)

Offline Computation

Online Computation

Check thelast segment in the lookup table

Save the values in a lookup table Compress it using a piece-wise linear representation

Perform a binarysearch to find

the right segment

calculatevalue

dust(0,Δx)

The dust Distance

Normal Distribution Other Distributions

The dust distance is exactly the same as Euclidean distancefor the Normal distribution

dust ultimately converges with Euclidean distance

Combining Multiple Distributions

Let the values in a time series have different error distributions f1 … fn. Let their standarddeviations be σ1 … σn.

Let us choose σe = min (σ1, …, σn)/5

Adjusted

f’(x)

η1 ≤ x ≤ η2

x < η1

x > η2

N(0, σe)

η1 η2

Not interestedInterested

Normal Uniform Exponential

Combining Multiple Normal Distributions

Combining multiple normal distributions with differentStandard deviations

Converge to the same

distance func.

Results

Classification Accuracy

No Error : 77%, DUST: 72%, Euclidean Distance: 62%

Classification Accuracy: Dynamic Time Warping

No Error : 78%, DUST: 74%, Euclidean Distance: 67%

Top-k Motifs : EEG Dataset

Anomalous BehaviorSuperior performance of DUST

#of Matches vs Standard Deviation for k-NN classification – wafer dataset

DUST Euclidean Dist.

Conclusions

Uncertainty in data is increasingly prevalent in– Sensor data– Privacy preserving techniques

Conventional approaches – Don’t produce good results with mining uncertain data

Propose novel metric DUST– Incorporates theoretical measures of similarity– Easy to compute

DUST makes up for half the accuracy lost due to uncertainty

DUST: A Generalized Notion of Similarity between Uncertain Time Series

Smruti R. Sarangi and Karin MurthyIBM Research Labs, Bangalore, India

© 2009 IBM Corporation DUST: A Generalized Notion of Similarity between Uncertain Time Series...

Documents

13.04.2010 Smruti Inamdar EI - Business in France 2010...Apr 13, 2010 · Microsoft PowerPoint - 13.04.2010 Smruti Inamdar EI - Business in France 2010.ppt Author: sandrafarrell Created

Presented By: Niharjyoti Sarangi · 2012. 12. 25. · niharjyoti sarangi. 1. introduction – what & why 2. key features 3. jxta architecture 4. jxta layers 5. jxta jargons 6. jxta

ASVALAYANA SMRUTI

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION … › ... › ieee_tvlsi_sandeep.pdf · Sandeep Chandran, Smruti R. Sarangi, and Preeti Ranjan Panda, Senior Member, IEEE Abstract—The

Αρχαίου Ελληνικού Δράματος By Smruti and Bhavisha Αρχαίου Ελληνικού Δράματος

2 smRuti muktA phalaM - Ahnika

Full page photo - Maharashtra Political Parties · case Sf.RENiry nwketvatuo. AMRUT SMRUTI OFFICE NIA 537.530 g. 09.70400 NIA I NAGAR. SERENITY NO.SOI 41 HEI€yrr FLAT NO SMRUTI

1 Chapter 7 Computer Arithmetic Smruti Ranjan Sarangi Computer Organisation and Architecture PowerPoint Slides PROPRIETARY MATERIAL. © 2014 The McGraw-Hill

1794 IEEE TRANSACTIONS ON VERY LARGE SCALE …sandeep/papers/j2.pdf · Sandeep Chandran, Smruti R. Sarangi, and Preeti Ranjan Panda, Senior Member, IEEE Abstract—The internal state

Smruti Ranjan Panigrahi

A Wait-Free Stack - arXiv · A Wait-Free Stack Seep Goel, Pooja Aggarwal and Smruti R. Sarangi E-mail: seep.goyal@gmail.com, fpooja.aggarwal, srsarangig@cse.iitd.ac.in Indian Institute

Using Kure and Killdevil Mark Reed Sandeep Sarangi ITS Research Computing

swmiji smruti Darshan

Tejas: A Java based Versatile Micro-architectural Simulatorsrsarangi/files/papers/patmospaper.pdf · Tejas: A Java based Versatile Micro-architectural Simulator Smruti R. Sarangi,

Introduction to Computer Architecturesrsarangi/courses/2011/cs211/intro.pdf · 2011. 12. 21. · 1848: British mathematician, George Boole, invented Boolean algebra. Smruti R. Sarangi

To Aai, Baba and Smruti

Presentation by Shri U C Sarangi, Chairman, NABARD

Khodaldham Smruti September-2014

FORM-1 - environmentclearance.nic.inenvironmentclearance.nic.in/writereaddata/modification/Extension/...SRUSHTI SEWA “Ram Smruti Apartment,

eleqtronuli jurnali - Sarangi