1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches

1

Algorithms for massive data setsAlgorithms for massive data sets

Lecture 3 (March 2, 2003)

Synopses, Samples & Sketches

2

SynopsesSynopses

• Synopsis (from Webster) : a condensed statement or outline (as of a narrative or treatise)

• Synopsis (here) : A succinct data structure that lets us answers queries efficiently

3

Typical QueriesTypical Queries

Statistics (count, median, variance, aggregates)

Patterns (clustering, associations, classification)

Nearest Neighbors (L1, L2, Hamming norm)

Property Testing (Skewness, Independence)

etc..

4

Why use Synopses?Why use Synopses?

• Can’t store the whole data : E.g. Web Data

• Resides in main memory : fast query response. E.g. OLAP Data

• Remote transmission at minimal cost

• Minimal effect on storage cost

5

Classification of SynopsesClassification of Synopses

• Are they useful for more than kind of query?– General purpose: E.g. samples

– Specific purpose: E.g. Distinct Values Estimator

• What granularity ?– One per database: E.g. Sample of the whole

relation

– One per distinct value of attribute : E.g. Profiles for customers in a call database

6

Some NumbersSome Numbers

• AQUA Project (Bell Labs): – DB Size : 420 MB

– Synopsis Size : 420 KB (0.1%) to 12.5 MB (3%)

– Accuracy : Within 10% for 0.1% of DB size

– Running Time : Less than 0.3% of the time for full query

• Quantile Summary (Khanna et al) : – DB Size : 109 tuples

– Synopsis Size : 1249 tuples

– Accuracy : 1%

7

Synopses need not be fancy!Synopses need not be fancy!

• Maintaining Mean (μ) of numbers

• What about variance ?

22 )( ix

8

ObjectivesObjectives

• Small Size

• Fast Update and Query

• Provable error guarantees (Need not give exact answers)

• Composable : Useful for distributed scenario

9

A coarse classificationA coarse classification

• Sampling based : This lecture

• Sketches

• Histograms

10

SamplingSampling

• Where and how are samples used

• How are samples maintained – Single relation

• Types of samples :– Oblivious

– Value based

• Limitations of oblivious samples

11

Samples in DSSSamples in DSS

• Exact answers NOT always required

– DSS applications usually exploratory: early feedback to help identify “interesting” regions

– Aggregate queries: precision to “last decimal” not needed• e.g., “What percentage of the US sales are in NJ?” (display as

bar graph)

– Base data can be remote or unavailable: approximate processing using locally-cached data synopsesdata synopses is the only option

SQL Query

Exact Answer

DecisionDecisionSupport Support SystemsSystems(DSS) (DSS)

Long Response Times!

12

Sampling: BasicsSampling: Basics• Idea: A small random sample S of the data often well-represents all the data

– For a fast approx answer, apply the query to S & “scale” the result

– E.g., R.a is {0,1}, S is a 20% sample

select count(*) from R where R.a = 0

select 5 * count(*) from S where S.a = 0

1 1 0 1 1 1 1 1 0 0 0

0 1 1 1 1 1 0 11 1 0 1 0 1 1

0 1 1 0

Red = in S

R.aR.a

Est. count = 5*2 = 10, Exact count = 10

• Leverage extensive literature on confidence intervals for sampling

Actual answer is within the interval [a,b] with a given probability

E.g., 54,000 ± 600 with prob 90%

13

The Aqua ArchitectureThe Aqua Architecture

DataWarehouse

(e.g., Oracle)

SQLQuery Q

Network

Q

Result HTMLXML

WarehouseData

Updates

BrowserExcel

Picture without Aqua:

• User poses a query Q

• Data Warehouse executes Q and returns result

• Warehouse is periodically updated with new data

14

The Aqua ArchitectureThe Aqua Architecture

Picture with Aqua:

• Aqua is middleware, between the user and the warehouse

• Aqua Synopses are stored in the warehouse

• Aqua intercepts the user query and rewrites it to be a query Q’ on the synopses. Data warehouse returns approximate answer

DataWarehouse

(e.g., Oracle)

Rewriter

SQLQuery Q

Network

Q’

Result (w/ error bounds)

HTMLXML

WarehouseData

Updates

AQUASynopses

AQUATracker

BrowserExcel

select count(*) from R where R.a = 0 select 5 * count(*) from S where S.a = 0 Q Q’

15

Schema & QueriesSchema & Queries

• Most queries involve foreign key joins between tables followed by (grouping and) aggregation.

L

O PS

P SCN

R

order part, supp

cust

nation

16

17

Example QueryExample Query

18

What samples are right?What samples are right?

• Naïve approach : maintain samples of each relation in the schema

• Problem : sample of the join is not a join of the samples, even for foreign key joins

• Example :

AABB

AB

a1a2

b1

19

Foreign Key JoinsForeign Key Joins

• Foreign Key Join : Effectively a central “fact” table is appended with columns from the dimension tables.

• Sampling from the join is same as sampling from the “fact” table itself.

• Synopsis : For every table that may be a “fact” table for certain join, sample from the table and join the sample with the dimension tables.

20

SynopsisSynopsis

• For every node in the DAG:– Maintain a sample corresponding to that table.

– Join the sample with tables corresponding to all its descendents in the graph.

– Maximal join for which the table is a “fact” table.

L

O PSP SC

NR

order part, supp

cust

nation

21

Bells and whistles!Bells and whistles!

• How to allocate memory across samples of different “fact” tables

• Group-By Queries: – Are uniform samples best or can we do better?

• Aggregate attribute may be skewed– Are uniform samples best or can we do better?

• We may revisit these issues later– Have not seen some equations for a while!

22

How to sample?How to sample?• Consider a single table with only insertions

• Want to maintain a sample of this table

• Three semantics of sampling:– Coin flip

– Fixed size without replacement

– Fixed size with replacement

• First one (coin flip) easy to maintain under insertions

• Exercise : Can we switch between different samples? If so how ?

23

Reservoir SamplingReservoir Sampling

• Given : A stream of elements (tuples), viewed as insertions into a relation

• Aim : At every instant maintain a uniform random sample of size n without replacement

• Method : (Accept the first n elements)– Let t be the number of elements seen so far

– On seeing the the (t+1)st element include it with probability n/(t+1)

– If included evict one of the previous elements uniformly at random

24

Proof of CorrectnessProof of Correctness

• Easy to see that every instant the size of the sample is exactly n

• Claim : After seeing t elements, every element belongs to the sample with probability n/t

• Exercise : Using induction prove the last claim

25

Efficiency Efficiency • Let N be the number of records seen

• Each record (beyond the first n records) is added to the reservoir with probability n/t

• The average number of records added is

)/ln1()1(/ nNnHHntnn nNNtn

• Consider any reservoir sample. • The t th element has to be a part of the sample with probability no less than n/t. • Thus, the quantity above is also a lower bound on the additions made to the reservoir (time spent)

26

EfficiencyEfficiency• The naïve algorithm makes N calls to

RANDOM() and takes time O(N)

• Consider the following random variable: Let S(n,t) denote the number of elements skipped where n is the size of the reservoir and t is the number of elements processed so far.

• Aim: Study this random variable and sample from its distribution using O(1) operations.

• Idea : Generate S(n,t) and skip those many records doing nothing

27

ObservationsObservations• S(n,t) is non-negative

• Let F(s) denote Prob {S(n,t) ≤ s}, for s≥ 0

1

1

)1(

)1(1

)1(1)(

s

s

n

n

t

nt

st

tsF

Where ab denotes the falling power

a(a-1) (a-2)…(a-b-1) and denotes the rising

power a(a+1)(a+2)…(a+b-1)

ba

28

ObservationsObservations• Subtracting two terms corresponding to s and s-1 we

get the probability distribution function f(s) as

1

1

)1(

)(

)(1)(

s

s

n

n

t

nt

nt

n

st

t

st

nsf

We can compute the expected value which is (t-n+1)/(n-1)Here is a simple way to sample from the distribution corresponding to S(n,t). We already calculated its CDF (F(s)). We generate a random number U between 0 and 1 and find the smallest s such that U ≤ F(s), i.e.

Ut

nts

s

1)1(

)1(1

1

29

ObservationsObservations

• Have reduced the number of calls to RANDOM() to optimal : One per insertion into the reservoir

• There are two ways to find the largest s that satisfies the previous equation– Linear scan : Gives O(N) time algorithm

– Binary search/Newton’s interpolation method to get a running time of O(n2(1 + log (N/n) log log (N/n))

• Note: This is still not optimal. Read the paper for an optimal (up to constants) algorithm.

30

What have we seen so far?What have we seen so far?• How to sample efficiently (Reservoir Sampling)

– A method to sample without replacement by making a single scan

– Optimized the calls to RANDOM()

– Overall processing time can also be optimized

• How samples are used in DSS and what are the different samples that should be kept in order to answer queries

• What next?– Queries in DSS are not simple counts over the entire relation

– Typically they have grouping followed by aggregation of an attribute that may have high variance

31

Error using samplingError using sampling

R = {y1, y2, …, yN}, sample size nVariance in data values:

1

)(1

2

N

YyS

N

ii

Error = Std Dev =√E(μ – μ*)2

N

n

n

S 1)(

32

Group-By QueriesGroup-By Queries

• SELECT avg (salary) FROM census GROUP BY state

• Some of the states have very tuples as compared to others. E.g. CA has 70 times more people as compared to WY

• If we sample uniformly from the entire relation then there will be very few tuples corresponding to WY and hence a large error in its avg(salary) estimate

33

Error Metric (Group-By)Error Metric (Group-By)

• Let c*_i be the true answer (aggregate) corresponding to group i

• Let c_i be the estimate obtained from sample

• The error e_i is given by |c*_i – c_i|/|c_i|

• The cumulative error is the L1,L2, L∞ norm of the error vector {e_i}

34

Optimal sampling strategyOptimal sampling strategy

• For every group the error is inversely proportional to √n where n is the number of tuples in the sample from this group

• In order to reduce the maximum error among all groups we should have equal number of samples from each group (Senate)

• But this strategy is not optimal if the query does not have a group by and is over the entire relation. In that case a uniform sample of the entire relation is optimal (House)

35

Basic-Congress SamplingBasic-Congress Sampling

• Unfortunately, unlike U.S. congress we don’t have place to sit both Senators and House Representatives!

• Hence we do the following:– Let X be the total seats allotted to Congress

– For a state CA let CA_S (resp CA_H) be the seats allotted to it assuming the congress was only made of senate (resp. house)

– The final seat allocation to each state CA is proportional to max(CA_S, CA_H), subject to total seats being X

36

CommentsComments

• No error guarantees– Only a best effort solution

• Cannot use Reservoir sampling anymore– The full paper talks about one pass algorithms, but

admits that they don’t work in all cases

• What if the variance in values (S) is large ?– Outlier indexing

37

Error using samplingError using sampling

R = {y1, y2, …, yN}, sample size nVariance in data values:

1

)(1

2

N

YyS

N

ii

Error = Std Dev :

N

n

n

S 1)(

38

Presence of Data Skew.Presence of Data Skew.

Outliers (deviant tuples).

9950 tuples.Value = 1

50 tuples

Value = 1000.

Uniform sampleof size 100.

Sum estimate= 10,000

OR

Sum estimate > 109,900

Error > 83%

Exact Answer= 59,950

case1

case2

39

Outlier Indexing Scheme.Outlier Indexing Scheme.

R

RO (outliers)

RNO

sample RNO

(sample)

Preprocessing

QA1

Q & extrapolateA2

+ A

Query

40

Selection of Outlier Index.Selection of Outlier Index.

Objective: Remove at most n outliers such that non outliers have least variance.

Theorem: For a sorted (multi)set of values optimal outlier set looks like :

...,vk,vk+1, vk+2,…,vm-1,vm,vm+1,…

41

CommentsComments

• Cannot do reservoir sampling

• One pass algorithm for selection of outliers

42

Types of SamplesTypes of Samples

• Oblivious samples: We do not look at the value of attribute while sampling

• Value based sampling : The distinct sampling of Gibbons et al

• Limitations of oblivious sampling:– Please refer :

Sampling algorithms: lower bounds and applicaitons, Z. Bar-Yossef, S. Ravi Kumar, and D. Sivakumar.STOC 2001.

http://www.almaden.ibm.com/cs/people/siva/papers/sampling.ps

http://www.almaden.ibm.com/cs/people/siva/papers/sampling.ps

43

SummarySummary• Obvious type of synopsis: samples

• Use of samples in DB, in particular DSS. – Idea of maintaining the samples of ‘fact’ tables

• How to sample without replacement with a single pass, not knowing the size of the relation a-priori– Reservoir sampling and tricks to make it efficient

• Shortcomings of sampling in DB’s– Group-By queries : Congressional samples

– High Skew in Data : Outlier indexing, stratified sampling

44

ReferencesReferences• Join Synopses for Approximate Query Answering, S. Acharya, P.

Gibbons, V. Poosala, and S. Ramaswamy. SIGMOD 1999.

• Congressional Samples for Approximate Answering of Group-By Queries, S. Acharya, P, Gibbons, and V. Poosala. SIGMOD 2000.

• Overcoming Limitations of Sampling for Aggregation Queries, S. Chaudhuri, G. Das, M. Datar, R. Motwani and V. Narasayya. ICDE 2001.

• A Robust Optimization-Based Approach for Approximate Answering of Aggregate Queries, S. Chaudhuri, G. Das and V. Narasayya. SIGMOD 2001.

• Random Sampling with a Reservoir, J. S. Vitter. Trans. on Mathematical Software 11(1):37-57 (1985).

http://hake.stanford.edu/~datar/courses/cs361a/papers/aqua.pdf

http://citeseer.nj.nec.com/acharya99congressional.html

http://citeseer.nj.nec.com/acharya99congressional.html

http://research.microsoft.com/copyright/accept.asp?path=ftp://ftp.research.microsoft.com/users/AutoAdmin/icde01.pdf&pub=IEEE



http://www.acm.org/toms/V11.html

45

Sampling over Sliding windowsSampling over Sliding windows

• Samples of streaming data

• Need to account for staleness of data

• An data element is fresh if it belongs to the last N elements

• Problem statement : Given a stream of elements maintain a uniform random sample of size

46

A Simple, Unsatisfying ApproachA Simple, Unsatisfying Approach• Choose a random subset X={x1, …,xk}, X{0,1,…,n-1}

• The sample always consists of the non-expired elements whose indexes are equal to x1, …,xk (modulo n)

• Only uses O(k) memory

• Technically produces a uniform random sample of each window, but unsatisfying because the sample is highly periodic

• Unsuitable for many real applications, particularly those with periodicity in the data

47

Reservoir Sampling: Why It Reservoir Sampling: Why It Doesn’t WorkDoesn’t Work

• Suppose an element in the reservoir expires

• Need to replace it with a randomly-chosen element from the current window

• However, in the data stream model we have no access to past data

• Could store the entire window but this would require O(n) memory

48

Chain-SampleChain-Sample• Include each new element in the sample with probability

1/min(i,n)

• As each element is added to the sample, choose the index of the element that will replace it when it expires

• When the ith element expires, the window will be (i+1…i+n), so choose the index from this range

• Once the element with that index arrives, store it and choose the index that will replace it in turn, building a “chain” of potential replacements

• When an element is chosen to be discarded from the sample, discard its “chain” as well

49

ExampleExample

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3

50

Memory Usage of Chain-SampleMemory Usage of Chain-Sample

• Let T(x) denote the expected length of the chain from the element with index i when the most recent index is i+x

• T(x) =

• The expected length of each chain is less than T(n) e 2.718

• Expected memory usage is O(k)

{ 0 for x < 01 + 1/n [ΣT(j)] for x 1 j<i

51

Memory Usage of Chain-SampleMemory Usage of Chain-Sample• Chain consists of “hops” with lengths 1…n

• Chain of length j can be represented by partition of n into j ordered integer parts– j-1 hops with sum less than n plus a remainder

• Each such partition has probability n-j

• Number of such partitions is (n) < (ne/j)j

• Probability of any such partition is small [O(n-c)]when j = O(k log n)

• Uses O(k log n) memory whp

j

52

Comparison of AlgorithmsComparison of Algorithms

• Chain-sample is preferable to oversampling:– Better expected memory usage: O(k) vs. O(k log n)

– Same high-probability memory bound of O(k log n)

– No chance of failure due to sample size shrinking below k

Algorithm Expected High-Probability

Periodic O(k) O(k)

Oversample O(k log n) O(k log n)

Chain-Sample O(k) O(k log n)

Documents

1 Algorithms for massive data sets Lecture 3 (March 2, 2003) Synopses, Samples & Sketches