Wavelet decomposition of data streams decomposition • Wavelet decomposition can be regarded as...

Preview:

Citation preview

Wavelet decomposition of data streams

by Dragana Veljkovic

Motivation

• Continuous data streams arise naturally in: • telecommunication and internet traffic• retail and banking transactions• web server log records etc.

• Many applications need this data to be processed on a 24*7 basis in only one pass

Motivation cont.

• Usually this data is accumulated and archived for later use, but not always (e.g. network security)

• The ability to make decisions and interpret interesting patterns online can be crucial and has real dollar value for large corporations (e.g. fraud detection)

Our motivation

• Currently working on data collected from 100 electrodes receiving electrical potential of monkey brain over long periods of time

• We want to look at this data in real time and seek patterns, trends and surprises

Outline

• Background • streams• wavelets• sketches• error analysis

• Results • Implementation details• Strengths and weaknesses of this

approach

Data streams

• Sequence of unbounded, real time data with high rate that can only be read once by an application

• Problems: • Unbounded memory requirements• High data rate

Underlying signal

• Signal is one dimensional function a: [0, …, N-1] ? Z+

• Data item that arrives in time is an ordered pair: <domain, value>

Example: voting results<Texas, 60>

Example: phone call records<210-748, 12>

Data model

Two different data models used for rendering the underlying signal:

• Cash register• Aggregate

Example: cash register model<210-748,10>, <210-689,13>, <210-748, 20>, <210-740, 5>, <210-748, 2>, <210-740, 30>…

where the underlying signal is<210-748, 32>, <210-689, 13>, <210-740, 35>

Stream format

Two distinct formats for the stream– Ordered – Unordered

Example: Aggregate ordered stream – any time seriesExample: Unordered cash-register stream – phone call

records

Ordered cash-register is trivial to convert to order aggregate

Wavelets• Basis functions of limited duration and average

value of zero

• Basis functions are shifted and scaled versions of the original wavelet

Discrete wavelet transform• Uses only fixed values for

wavelet scales based on powers of two

• Wavelet positions are also fixed and non overlapping

• Wavelets form a set of wavelet basis vectors of length N

Example: Haar wavelets on signal of length N = 8

• j = 1,…, logN levels• k = 0,…, 2j-1 spaces for each

levelHaar wavelets for signal of size 8

Wavelet decomposition• Wavelet decomposition can be regarded as projection of

the signal on the set of wavelet basis vectors• Each wavelet coefficient can be computed as the dot

product of the signal with the corresponding basis vector

Example:

Table 1. from Gilbert et al. 2003.

Best B-term decomposition• The signal can be fully recovered from the wavelet

decomposition

• Best B-term decomposition uses only a small number of coefficients, B, that carry the highest energy

• The signal reconstructed using the B-term coefficients and the corresponding vectors is called the best B-term approximation

• Most signals that occur in nature can be well approximated using only a small number of coefficients (5-10).

Computing best B-term decomposition in runtime

For the ordered aggregate model• Maintain two sets of items

• Highest B wavelet basis coefficients for the signal seen so far• logN straddling coefficients, one for each level

• When the data item is read the affected straddling coefficients get updated.

• If a coefficient is no longer straddling it is compared to existing highest B coefficient and the set is updated if necessary. New straddling coefficient is initialized.

• Takes O(B + logN) storage and time for the ordered aggregate model

Sketches

• Sketch is made by projecting a signal onto several different low dimensional spaces which are chosen at random

• Many properties of the signal, such as histograms, can be accurately estimated by looking at the sketch

Definition of a sketch

• Atomic sketch of signal a is the dot product <a, r> where r is a random vector of ±1 valued random variables

• A sketch of a signal is k independent atomic sketches, each with a different random vector rj

• Sketch size is small compared to the signal size

Sketches

• Maintaining the sketch is easy as we are receiving the data

• If element <i, a(i)> arrives, add a(i)*rij to

the sketch corresponding to random vector rj

Example: In cash-register receive <5, 10>, need to add 10* r5

j to each atomic sketch corresponding to the random vector rj

Error metrics

• SSE (sum squared error) – if R is a representation of the signal a then SSE is defined as

• Pseudoenergy of the representation R is computed as

Query processing

• Batched – queries are posed at certain periodic intervals

• Ad hoc – a query may be posed at any time

Batch query using best B-term approximation for day 0 of call records

Figure 2. from Gilbert et al. 2003.

Batch query using best B-term approximation for all 7 days of call records

Figure 3. from Gilbert et al. 2003.

Estimating a point query

Answer to point query i is a(i)• Direct point estimate – directly estimating a(i)

using the sketch• Direct wavelet estimate – use the sketch to

estimate the wavelet coefficients whose support intersects i and reconstruct a(i) using these coefficients

• Another way is to compute a(i) using only the high wavelet coefficients (like the known B-term approximation) whose support intersects a(i)

Using sketches to estimate dot product

• Following parameters characterize how well the sketch does

• e – distortion parameter• d – failure probability• ? – failure threshold

• Sketch of a signal is independent atomic sketches, each with a different random vector

• If the cosine between vectors a and b is greater than ? we estimate the dot product within (1±e) with probability at least 1- d

Sketches and random vectors

• If element <i, a(i)> arrives, add a(i)*rij to the

sketch corresponding to random vector rj

• In order to use the sketches we need to get the elements rj quickly.

• rj is of size N, it can not be stored explicitly

Generating random vectors

• The paper shows that rij can be generated

by a pseudorandom number generator using a seed sj of size logO(1)N

• Generator G is based on second order Reed-Muller codes

• The generator G takes sj and i and outputs ri

j = G(sj, i) quickly

Estimation of dot products using sketches

Lemma:Lemma: Let X be a Let X be a O(logNO(logN/ / dd))--wise median of O(1/ wise median of O(1/ ee22))--wise means of independent copies ofwise means of independent copies of

then we have with probability of 1then we have with probability of 1-- dd

NoteNote: use b=a to estimate energy of a using this : use b=a to estimate energy of a using this lemmalemma

Example: Want to estimate dot product of vectors a and b with no more than 30% error with probability of 80%, assuming the cosine between these two vectors is greater then 0.25

That is e = 0.3, ? = 0.25 and d = 0.2 and for a signal of size N=1024 we would need about 30 atomic sketches

TheoremThere is a streaming algorithm, A, such that, given a signal a[1,…, N] with energy ||a||22 if there is a B-term representation with energy at least ?*||a||22, then, with probability at least (1-d) A finds a representation of at most B terms with pseudoenergy at least (1-e) ?*||a||22. If there is no such B-term representation with energy ?*||a||22, A reports “no good representation”. In any case A uses

space and per item time while processing the stream. This holds with both aggregate and cash-register models

Example: take ?=0.3, d=0.2, e=0.3 and B=10. Then if there exists a 10 terms representation of the signal that captures at least 30% of the signal’s energy the algorithm will output a 10 term representation withenergy at least 21% of the signal with 80% probability

Strengths and weaknesses

• Good example how to work with cash-register models

• Shows several ways to estimate the signal using a sketch

• Time requirements seem higher than the paper claims

• On-line algorithms do not seem as promising as batch algorithms

References1. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strauss, "One-

pass wavelet decomposition of data streams," IEEE transactions on knowledge and data engineering, Vol. 15, No. 3, May/June 2003.

2. A. C. Gilbert, Y. Kotidis, S. Muthukrishnan and M. J. Strauss, "Surfing wavelets on streams: one-pass summaries for approximate aggregate queries," Proceedings of the 27th VLDB Conference, Roma, Italy 2001.

3. A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan and M. J. Strauss, "Fast, small-space algorithms for approximate histogram maintenance," STOC ’02, May 19- 21, 2002, Montreal, Quebec, Canada.

Answering queries on-lineComparison of sse/energy of top –B wavelets against direct estimates

Table 1. from Gilbert et al. 2003.

Table 2. from Gilbert et al. 2003.

Direct estimates for the top 10 heavy hitters

Figure 6. from Gilbert et al. 2003.

Direct estimates for the top 10 heavy hitters using the greedy algorithm

Figure 7. from Gilbert et al. 2003.

Adaptive greedy pursuit for heavy hitters

• Obtain a very accurate estimate for the first heavy hitter• Get a new sketch by subtracting this value from the

original sketch. This can be done because sketches are linear

• New sketch is a good estimation of the residual distribution in which the second heavy hitter is the peak value

• Use the new sketch to estimate the second heavy hitter• Repeat procedure for more heavy hitters• Each estimate introduces an error and after many

iterations the errors tend to overwhelm the benefits

Recommended