Upload
rhett
View
19
Download
1
Embed Size (px)
DESCRIPTION
Data-Streams and Histograms. Sudipto Guha, Nick Koudas & Kyuseok Shim. Background. Histogram Captures distribution statistics in an efficient manner Applications Query optimization Approximate query answering Data mining (time series in particular) Piecewise transmission of data - PowerPoint PPT Presentation
Citation preview
Data-Streams and Histograms
Sudipto Guha, Nick Koudas & Kyuseok Shim
Background
• Histogram– Captures distribution statistics in an efficient
manner– Applications
• Query optimization• Approximate query answering• Data mining (time series in particular)• Piecewise transmission of data
• EquiWidth, EquiDepth, MHIST, MaxDiff, V-OPT
Background
• Data Stream– An ordered sequence of points that can be
read only once or a small number of times– Applications
• Mission critical network components• Dynamic traffic configuration, fault identification,
troubleshooting
– Performance of algorithm measured by number of passes algorithm must make over the stream
Motivation
• Since the end use of a histogram is to approximate a data distribution, why not use a near-optimal approximation of the best histogram if it means linear time computation?
Motivation
• Approximate V-OPT histograms by improving the dynamic programming solution from quadratic to linear time
• Revised algorithm uses little space, hence suitable for data stream model
• Assumes cost of interval is monotonic under inclusion
Problem Statement
• Given:– non-negative integers v1, ..., vn
– k intervals or buckets to partition the index 1..n
• Constraint:– Minimize k VARk where is the variance of values in
the kth bucket
• Dynamic Programming solution:– OPT[k, n] = min {OPT[k-1, x] + VAR[(x+1)..n]}
– Runs in O(n2k) time with O(n) space
x<n
Intuition of Improvement
• For a x b, – VAR[a..n] VAR[x..n] VAR [b..n] (1) – OPT[a..n] OPT[x..n] OPT[b..n] (2)
• Use this monotonicity property to reduce the search space by settling for an approximation
• Instead of storing the whole OPT function, approximate it by a histogram!
Intuition of Improvement
• For all 1 p k, maintain intervals (a1,b1),…, (al, bl) • Value of bi (1+)ai
• The number of intervals l depends on p• The value for each interval substitutes for each value in
the interval reducing space and time complexity
Results
• Theorem: A (1+) approximation for V-OPT runs in O((k2/)log n) space and time O((nk2/)log n) in the data stream model
Advantages and Disadvantages
• Accuracy/runtime tradeoff can be controlled by the parameter
• For data-stream model, alternatives abound:– Random sampling (simple, assumption of
distribution)– Other histogram techniques (faster, less optimal)– Wavelet (flexibility)– Sliding Windows (later paper)
Improvements
Conclusion
• The authors provided an algorithm for approximating a distribution that runs reasonably fast and with small space requirements
• Proposed solution can be applied to data-stream model because values are not referred to unless they are stored