Download pptx - Distinct items:

Page 1: Distinct items:

Distinct items: • Given a stream , where , count the number of distinct items (so we are in the cash

register model)

• Example: 3 5 7 4 3 4 3 4 7 5 9• 5 distinct elements: 3 4 5 7 9 (we only want the count of distinct elements, and not

the set of distinct elements)

• In terms of frequency moments estimation, this is the problem of estimating

• The easy deterministic solutions with space and ( number of distinct elements)

• Deterministic exact solution requires space in the worst case• How about deterministic approximate solutions? And exact randomized?

• Can we do better with randomization and approximation?

Page 2: Distinct items:

Counting distinct elements (Flajolet—Martin 1985)

• Let be a random hash function: For each , value is uniformly distributed in

• What is the relation between the minimum of and the number of distinct elements

(We will do two proofs on the board, one algebraic and one pictorial)

• Moreover, the variance can also be bounded via (Fun problem: I only know an algebraic proof for this, but there could be a pictorial one too given the suggestive-looking rhs)

Page 3: Distinct items:

Counting distinct elementsFirst algorithm• Pick random hash function • Find the minimum of • Output

• Estimator has high variance. Improving the estimator by averaging:

Second algorithm• Run parallel independent copies of the first algorithm• Set ( is the estimate given by the th copy)• Return

Page 4: Distinct items:

Counting distinct elements• Space complexity of the first algorithm: To compute the minimum we just need to keep one real number in the memory. But need to limit precision

• So the space requirement

• Not quite: also need to account for the memory requirements for a random hash function

• What property of random hash function did we really use?

Page 5: Distinct items:

Counting distinct elements• Pick from a 2-wise independent hash function family mapping for a prime

( is chosen large to reduce round off errors)• set of distinct elements

• New estimator: • No longer clear that , but does provide useful informationLemma (probability is over the random choice of )Proof (1) First, prove :

Union bound

Page 6: Distinct items:

Counting distinct elements(2) Prove :

• Define indicator if (this is the good event)

otherwise• and so • We now upper bound by using the pairwise independence of the and Chebyshev’s inequality (proof on the board; also in the book page 297)

Page 7: Distinct items:

Boosting the success probability• Take the median of the means estimator • But doesn’t seem to give a -factor approximation approximation only within factors and

• A related estimator [BJKST 2004]:

• pairwise independent hash function family of functions of type • , so we can take , and have bits decription• So the probability that a random is injective is

• Maintain the smallest hash values the th smallest hash value at the end of the stream The new estimator (BJKST estimator) is

Page 8: Distinct items:

Analyzing the BJKST estimator• Requirements to maintain the BJKST estimator:

– Space – Update time

• We assume (satisfied if true for )

• Recall that the set of distinct elements in the stream• We separately upper bound and using the Chebyshev inequality

Page 9: Distinct items:

Analyzing the BJKST estimator

• I.e., contains at least elements less than (using )

• For , define if and otherwise

• For

• , ,


Page 10: Distinct items:

Analyzing the BJKST estimator• Similarly,

• Thus,

• And now we can apply the median trick: Run parallel independent copies of the algorithm to compute and output their median

Theorem The output of the above algorithm is an -approximation of . It uses space and update time per streaming element

Very powerful: A variant needs 128 bytes for all works of Shakespeare, ≈1/10 [Durand--Flajolet 2003]

• What streaming model does the above algorithm require?

Page 11: Distinct items:

Counting distinct elements (strict turnstile model)

• What about the strict turnstile model? • with integers• Frequency vector nonnegative • The previous algorithm requires cash register model

• A different but closely related algorithm that works in the strict turnstile model

• We will only give the basic idea and not the full details of the proof

Page 12: Distinct items:

Counting distinct elements (strict turnstile model)

• set of distinct elements• First reduce the problem to its decision version: • Input: stream , parameters, and an additional parameter • Output:

– YES if – NO if – Arbitrary otherwise

• Solution of the decision version gives a solution of the general problem with a slight blow up in the space:

• Run parallel versions of the decision problem with • A total of copies

Page 13: Distinct items:

Algorithm for the decision version of counting distinct elements

Basic algorithm

• Choose a random set by picking each element independently with probability :

for all

• Maintain

• Output YES if else output NO

Page 14: Distinct items:

Decision version of counting distinct elements (analysis idea)

Lemma For and if if


Page 15: Distinct items:

Full algorithm• Run independent parallel copies of the basic algorithm for sufficiently large

constant : Sample independently, and maintain for each

• if the ’th instance of the basic algorithm gives otherwise

• Output YES (i.e. declare ) if • Output NO otherwise

• An application of the Chernoff bound using the independence of the shows that this provides an -approximation

• Space requirement? • Use 2-wise independent sampling to choose • Total space requirement is

Page 16: Distinct items:

Counting distinct elements• Why didn’t we just maintain whether or not ?

• is a linear sketch• Allows for negative • So works in the (strict) turnstile model

• The problem of computing is by now very well understood: space complexity with update time This is optimal up to constant factors [Kane et al. 2010]

Page 17: Distinct items:

Document sketching• Problem: duplicate or near-duplicate identification in a collection of

documents• How to measure the similarity between documents? • A reasonable (?) candidate: edit distance• Computationally expensive

• Another measure: resemblance due to [Broder ‘97]

Page 18: Distinct items:

Resemblance of documents [Broder ‘97]• : resemblance between documents and • . Similar means close to • Convert documents to a set of integers

• A contiguous sequence of length contained in document is called a -shingle • Example: (a rose is a rose is a rose)• -shingles of are: (a rose is a), (rose is a rose), (is a rose is), (a rose is a), (rose is a rose)• The set of -shingles of : {(a rose is a), (rose is a rose), (is a rose is)}• Map shingles to integers (for some fixed )• From now on, identify the documents with sets of integers in • Thus a document is represented as a set of integers • (also known as Jaccard similarity between sets and )

• Thus, , but does not mean • In practice, is a reasonable approximation of the informal notion of similarity of

Page 19: Distinct items:

Estimating resemblance• Given : • Estimate: • Exact computation of requires time

• A basic estimator for • : set of permutations • Choose a random

• Variance too high. A smaller variance estimator:• Let denote the set of smallest elements of , and if , then • More generally, for a constant , and uniformly random

is an unbiased estimator of (details on the board)

Page 20: Distinct items:

Estimating resemblance• In fact, this gives us a way of sketching the documents: Fix a permutation , and a constant For document , its sketch is

• Now given the sketches of documents , using the same permutation , we can estimate the resemblance of pairs

• Another way of sketching with low variance: sample random permutations • Sketch of document is • Resemblance can be estimated as

• We can estimate within multiplicative error with (for both methods above)

• One problem with this: storing permutations is expensive• Question: Can we work with a small set of permutations instead of ?

• Yes: Min-wise independent permutations [Broder et al. ‘98]• Can also use 2-wise independent hash functions [Thorup 2013]
