View
217
Download
2
Embed Size (px)
Citation preview
1
Algorithms for massive data setsAlgorithms for massive data sets
Lecture 3 (March 2, 2003)
Synopses, Samples & Sketches
2
SynopsesSynopses
• Synopsis (from Webster) : a condensed statement or outline (as of a narrative or treatise)
• Synopsis (here) : A succinct data structure that lets us answers queries efficiently
3
Typical QueriesTypical Queries
Statistics (count, median, variance, aggregates)
Patterns (clustering, associations, classification)
Nearest Neighbors (L1, L2, Hamming norm)
Property Testing (Skewness, Independence)
etc..
4
Why use Synopses?Why use Synopses?
• Can’t store the whole data : E.g. Web Data
• Resides in main memory : fast query response. E.g. OLAP Data
• Remote transmission at minimal cost
• Minimal effect on storage cost
5
Classification of SynopsesClassification of Synopses
• Are they useful for more than kind of query?– General purpose: E.g. samples
– Specific purpose: E.g. Distinct Values Estimator
• What granularity ?– One per database: E.g. Sample of the whole
relation
– One per distinct value of attribute : E.g. Profiles for customers in a call database
6
Some NumbersSome Numbers
• AQUA Project (Bell Labs): – DB Size : 420 MB
– Synopsis Size : 420 KB (0.1%) to 12.5 MB (3%)
– Accuracy : Within 10% for 0.1% of DB size
– Running Time : Less than 0.3% of the time for full query
• Quantile Summary (Khanna et al) : – DB Size : 109 tuples
– Synopsis Size : 1249 tuples
– Accuracy : 1%
7
Synopses need not be fancy!Synopses need not be fancy!
• Maintaining Mean (μ) of numbers
• What about variance ?
22 )( ix
8
ObjectivesObjectives
• Small Size
• Fast Update and Query
• Provable error guarantees (Need not give exact answers)
• Composable : Useful for distributed scenario
9
A coarse classificationA coarse classification
• Sampling based : This lecture
• Sketches
• Histograms
10
SamplingSampling
• Where and how are samples used
• How are samples maintained – Single relation
• Types of samples :– Oblivious
– Value based
• Limitations of oblivious samples
11
Samples in DSSSamples in DSS
• Exact answers NOT always required
– DSS applications usually exploratory: early feedback to help identify “interesting” regions
– Aggregate queries: precision to “last decimal” not needed• e.g., “What percentage of the US sales are in NJ?” (display as
bar graph)
– Base data can be remote or unavailable: approximate processing using locally-cached data synopsesdata synopses is the only option
SQL Query
Exact Answer
DecisionDecisionSupport Support SystemsSystems(DSS) (DSS)
Long Response Times!
12
Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often well-represents all the data
– For a fast approx answer, apply the query to S & “scale” the result
– E.g., R.a is {0,1}, S is a 20% sample
select count(*) from R where R.a = 0
select 5 * count(*) from S where S.a = 0
1 1 0 1 1 1 1 1 0 0 0
0 1 1 1 1 1 0 11 1 0 1 0 1 1
0 1 1 0
Red = in S
R.aR.a
Est. count = 5*2 = 10, Exact count = 10
• Leverage extensive literature on confidence intervals for sampling
Actual answer is within the interval [a,b] with a given probability
E.g., 54,000 ± 600 with prob 90%
13
The Aqua ArchitectureThe Aqua Architecture
DataWarehouse
(e.g., Oracle)
SQLQuery Q
Network
Q
Result HTMLXML
WarehouseData
Updates
BrowserExcel
Picture without Aqua:
• User poses a query Q
• Data Warehouse executes Q and returns result
• Warehouse is periodically updated with new data
14
The Aqua ArchitectureThe Aqua Architecture
Picture with Aqua:
• Aqua is middleware, between the user and the warehouse
• Aqua Synopses are stored in the warehouse
• Aqua intercepts the user query and rewrites it to be a query Q’ on the synopses. Data warehouse returns approximate answer
DataWarehouse
(e.g., Oracle)
Rewriter
SQLQuery Q
Network
Q’
Result (w/ error bounds)
HTMLXML
WarehouseData
Updates
AQUASynopses
AQUATracker
BrowserExcel
select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 Q Q’
15
Schema & QueriesSchema & Queries
• Most queries involve foreign key joins between tables followed by (grouping and) aggregation.
L
O PS
P SCN
R
order part, supp
cust
nation
16
17
Example QueryExample Query
18
What samples are right?What samples are right?
• Naïve approach : maintain samples of each relation in the schema
• Problem : sample of the join is not a join of the samples, even for foreign key joins
• Example :
AABB
AB
a1a2
b1
19
Foreign Key JoinsForeign Key Joins
• Foreign Key Join : Effectively a central “fact” table is appended with columns from the dimension tables.
• Sampling from the join is same as sampling from the “fact” table itself.
• Synopsis : For every table that may be a “fact” table for certain join, sample from the table and join the sample with the dimension tables.
20
SynopsisSynopsis
• For every node in the DAG:– Maintain a sample corresponding to that table.
– Join the sample with tables corresponding to all its descendents in the graph.
– Maximal join for which the table is a “fact” table.
L
O PSP SC
NR
order part, supp
cust
nation
21
Bells and whistles!Bells and whistles!
• How to allocate memory across samples of different “fact” tables
• Group-By Queries: – Are uniform samples best or can we do better?
• Aggregate attribute may be skewed– Are uniform samples best or can we do better?
• We may revisit these issues later– Have not seen some equations for a while!
22
How to sample?How to sample?• Consider a single table with only insertions
• Want to maintain a sample of this table
• Three semantics of sampling:– Coin flip
– Fixed size without replacement
– Fixed size with replacement
• First one (coin flip) easy to maintain under insertions
• Exercise : Can we switch between different samples? If so how ?
23
Reservoir SamplingReservoir Sampling
• Given : A stream of elements (tuples), viewed as insertions into a relation
• Aim : At every instant maintain a uniform random sample of size n without replacement
• Method : (Accept the first n elements)– Let t be the number of elements seen so far
– On seeing the the (t+1)st element include it with probability n/(t+1)
– If included evict one of the previous elements uniformly at random
24
Proof of CorrectnessProof of Correctness
• Easy to see that every instant the size of the sample is exactly n
• Claim : After seeing t elements, every element belongs to the sample with probability n/t
• Exercise : Using induction prove the last claim
25
Efficiency Efficiency • Let N be the number of records seen
• Each record (beyond the first n records) is added to the reservoir with probability n/t
• The average number of records added is
)/ln1()1(/ nNnHHntnn nNNtn
• Consider any reservoir sample. • The t th element has to be a part of the sample with probability no less than n/t. • Thus, the quantity above is also a lower bound on the additions made to the reservoir (time spent)
26
EfficiencyEfficiency• The naïve algorithm makes N calls to
RANDOM() and takes time O(N)
• Consider the following random variable: Let S(n,t) denote the number of elements skipped where n is the size of the reservoir and t is the number of elements processed so far.
• Aim: Study this random variable and sample from its distribution using O(1) operations.
• Idea : Generate S(n,t) and skip those many records doing nothing
27
ObservationsObservations• S(n,t) is non-negative
• Let F(s) denote Prob {S(n,t) ≤ s}, for s≥ 0
1
1
)1(
)1(1
)1(1)(
s
s
n
n
t
nt
st
tsF
Where ab denotes the falling power
a(a-1) (a-2)…(a-b-1) and denotes the rising
power a(a+1)(a+2)…(a+b-1)
ba
28
ObservationsObservations• Subtracting two terms corresponding to s and s-1 we
get the probability distribution function f(s) as
1
1
)1(
)(
)(1)(
s
s
n
n
t
nt
nt
n
st
t
st
nsf
We can compute the expected value which is (t-n+1)/(n-1)Here is a simple way to sample from the distribution corresponding to S(n,t). We already calculated its CDF (F(s)). We generate a random number U between 0 and 1 and find the smallest s such that U ≤ F(s), i.e.
Ut
nts
s
1)1(
)1(1
1
29
ObservationsObservations
• Have reduced the number of calls to RANDOM() to optimal : One per insertion into the reservoir
• There are two ways to find the largest s that satisfies the previous equation– Linear scan : Gives O(N) time algorithm
– Binary search/Newton’s interpolation method to get a running time of O(n2(1 + log (N/n) log log (N/n))
• Note: This is still not optimal. Read the paper for an optimal (up to constants) algorithm.
30
What have we seen so far?What have we seen so far?• How to sample efficiently (Reservoir Sampling)
– A method to sample without replacement by making a single scan
– Optimized the calls to RANDOM()
– Overall processing time can also be optimized
• How samples are used in DSS and what are the different samples that should be kept in order to answer queries
• What next?– Queries in DSS are not simple counts over the entire relation
– Typically they have grouping followed by aggregation of an attribute that may have high variance
31
Error using samplingError using sampling
R = {y1, y2, …, yN}, sample size nVariance in data values:
1
)(1
2
N
YyS
N
ii
Error = Std Dev =√E(μ – μ*)2
N
n
n
S 1)(
32
Group-By QueriesGroup-By Queries
• SELECT avg (salary) FROM census GROUP BY state
• Some of the states have very tuples as compared to others. E.g. CA has 70 times more people as compared to WY
• If we sample uniformly from the entire relation then there will be very few tuples corresponding to WY and hence a large error in its avg(salary) estimate
33
Error Metric (Group-By)Error Metric (Group-By)
• Let c*_i be the true answer (aggregate) corresponding to group i
• Let c_i be the estimate obtained from sample
• The error e_i is given by |c*_i – c_i|/|c_i|
• The cumulative error is the L1,L2, L∞ norm of the error vector {e_i}
34
Optimal sampling strategyOptimal sampling strategy
• For every group the error is inversely proportional to √n where n is the number of tuples in the sample from this group
• In order to reduce the maximum error among all groups we should have equal number of samples from each group (Senate)
• But this strategy is not optimal if the query does not have a group by and is over the entire relation. In that case a uniform sample of the entire relation is optimal (House)
35
Basic-Congress SamplingBasic-Congress Sampling
• Unfortunately, unlike U.S. congress we don’t have place to sit both Senators and House Representatives!
• Hence we do the following:– Let X be the total seats allotted to Congress
– For a state CA let CA_S (resp CA_H) be the seats allotted to it assuming the congress was only made of senate (resp. house)
– The final seat allocation to each state CA is proportional to max(CA_S, CA_H), subject to total seats being X
36
CommentsComments
• No error guarantees– Only a best effort solution
• Cannot use Reservoir sampling anymore– The full paper talks about one pass algorithms, but
admits that they don’t work in all cases
• What if the variance in values (S) is large ?– Outlier indexing
37
Error using samplingError using sampling
R = {y1, y2, …, yN}, sample size nVariance in data values:
1
)(1
2
N
YyS
N
ii
Error = Std Dev :
N
n
n
S 1)(
38
Presence of Data Skew.Presence of Data Skew.
Outliers (deviant tuples).
9950 tuples.Value = 1
50 tuples
Value = 1000.
Uniform sampleof size 100.
Sum estimate= 10,000
OR
Sum estimate > 109,900
Error > 83%
Exact Answer= 59,950
case1
case2
39
Outlier Indexing Scheme.Outlier Indexing Scheme.
R
RO (outliers)
RNO
sample RNO
(sample)
Preprocessing
QA1
Q & extrapolateA2
+ A
Query
40
Selection of Outlier Index.Selection of Outlier Index.
Objective: Remove at most n outliers such that non outliers have least variance.
Theorem: For a sorted (multi)set of values optimal outlier set looks like :
...,vk,vk+1, vk+2,…,vm-1,vm,vm+1,…
41
CommentsComments
• Cannot do reservoir sampling
• One pass algorithm for selection of outliers
42
Types of SamplesTypes of Samples
• Oblivious samples: We do not look at the value of attribute while sampling
• Value based sampling : The distinct sampling of Gibbons et al
• Limitations of oblivious sampling:– Please refer :
Sampling algorithms: lower bounds and applicaitons, Z. Bar-Yossef, S. Ravi Kumar, and D. Sivakumar.STOC 2001.
43
SummarySummary• Obvious type of synopsis: samples
• Use of samples in DB, in particular DSS. – Idea of maintaining the samples of ‘fact’ tables
• How to sample without replacement with a single pass, not knowing the size of the relation a-priori– Reservoir sampling and tricks to make it efficient
• Shortcomings of sampling in DB’s– Group-By queries : Congressional samples
– High Skew in Data : Outlier indexing, stratified sampling
44
ReferencesReferences• Join Synopses for Approximate Query Answering, S. Acharya, P.
Gibbons, V. Poosala, and S. Ramaswamy. SIGMOD 1999.
• Congressional Samples for Approximate Answering of Group-By Queries, S. Acharya, P, Gibbons, and V. Poosala. SIGMOD 2000.
• Overcoming Limitations of Sampling for Aggregation Queries, S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya. ICDE 2001.
• A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, S. Chaudhuri, G. Das and V. Narasayya. SIGMOD 2001.
• Random Sampling with a Reservoir, J. S. Vitter. Trans. on Mathematical Software 11(1):37-57 (1985).
45
Sampling over Sliding windowsSampling over Sliding windows
• Samples of streaming data
• Need to account for staleness of data
• An data element is fresh if it belongs to the last N elements
• Problem statement : Given a stream of elements maintain a uniform random sample of size
46
A Simple, Unsatisfying ApproachA Simple, Unsatisfying Approach• Choose a random subset X={x1, …,xk}, X{0,1,…,n-1}
• The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n)
• Only uses O(k) memory
• Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic
• Unsuitable for many real applications, particularly those with periodicity in the data
47
Reservoir Sampling: Why It Reservoir Sampling: Why It Doesn’t WorkDoesn’t Work
• Suppose an element in the reservoir expires
• Need to replace it with a randomly-chosen element from the current window
• However, in the data stream model we have no access to past data
• Could store the entire window but this would require O(n) memory
48
Chain-SampleChain-Sample• Include each new element in the sample with probability
1/min(i,n)
• As each element is added to the sample, choose the index of the element that will replace it when it expires
• When the ith element expires, the window will be (i+1…i+n), so choose the index from this range
• Once the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements
• When an element is chosen to be discarded from the sample, discard its “chain” as well
49
ExampleExample
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
50
Memory Usage of Chain-SampleMemory Usage of Chain-Sample
• Let T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x
• T(x) =
• The expected length of each chain is less than T(n) e 2.718
• Expected memory usage is O(k)
{ 0 for x < 01 + 1/n [ΣT(j)] for x 1 j<i
51
Memory Usage of Chain-SampleMemory Usage of Chain-Sample• Chain consists of “hops” with lengths 1…n
• Chain of length j can be represented by partition of n into j ordered integer parts– j-1 hops with sum less than n plus a remainder
• Each such partition has probability n-j
• Number of such partitions is (n) < (ne/j)j
• Probability of any such partition is small [O(n-c)]when j = O(k log n)
• Uses O(k log n) memory whp
j
52
Comparison of AlgorithmsComparison of Algorithms
• Chain-sample is preferable to oversampling:– Better expected memory usage: O(k) vs. O(k log n)
– Same high-probability memory bound of O(k log n)
– No chance of failure due to sample size shrinking below k
Algorithm Expected High-Probability
Periodic O(k) O(k)
Oversample O(k log n) O(k log n)
Chain-Sample O(k) O(k log n)