Approximate Frequency Counts Over Data Streams

Embed Size (px)

Citation preview

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    1/87

    Approximate FrequencyCounts over Data Streams

    Gurmeet Singh Manku (Standford)

    Rajeev Motwani (Standford)

    Presented by Michal SpivakNovember, 2003

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    2/87

    The Problem

    Stream

    Identify all elements whose current frequency exceeds support threshold s = 0.1%.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    3/87

    Related problem

    Stream

    Identify all subsets of items whose current frequency exceeds s=0.1%

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    4/87

    Purpose of this paperPresent an algorithm for computing frequency countsexceeding a user-specified threshold over datastreams with the following advantages:

    Simple

    Low memory footprint

    Output is approximate but guaranteed not to exceed

    a user specified error parameter. Can be deployed for streams of singleton items and

    handle streams of variable sized sets of items.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    5/87

    Overview Introduction

    Frequency counting applications

    Problem definition

    Algorithm for Frequent Items

    Algorithm for Frequent Sets of Items Experimental results

    Summary

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    6/87

    Introduction

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    7/87

    Motivating examples Iceberg Query

    Perform an aggregate function over an attribute and eliminatethose below some threshold.

    Association RulesRequire computation of frequent itemsets.

    Iceberg DatacubesGroup bys of a CUBE operator whose aggregate frequencyexceeds threshold

    Traffic measurementRequire identification of flows that exceed a certain fraction oftotal traffic

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    8/87

    Whats out there today Algorithms that compute exact results Attempt to minimize number of data passes

    (best algorithms take two passes).

    Problems when adapted to streams: Only one pass is allowed.

    Results are expected to be available withshort response time. Fail to provide any a-priori guarantee on the

    quality of their output.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    9/87

    Why Streams?

    Streams vs. Stored dataVolume of a stream over its lifetime can

    be huge

    Queries for streams require timelyanswers, response times need to besmall

    As a result it is not possible to store thestream as an entirety.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    10/87

    Frequency countingapplications

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    11/87

    Existing applications for the

    following problems Iceberg Query

    Perform an aggregate function over an attribute andeliminate those below some threshold.

    Association RulesRequire computation of frequent itemsets.

    Iceberg DatacubesGroup bys of a CUBE operator whose aggregate

    frequency exceeds threshold Traffic measurement

    Require identification of flows that exceed a certainfraction of total traffic

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    12/87

    Iceberg QueriesIdentify aggregates that exceed a user-specified threshold r

    One of the published algorithms to compute iceberg queriesefficiently uses repeated hashing over multiple passes.*

    Basic Idea: In the first pass a set of counters is maintained Each incoming item is hashed to one of the counters which is

    incremented These counters are then compressed to a bitmap, with a 1

    denoting large counter value

    In the second pass exact frequencies are maintained for onlythose elements that hash to a counter whose bitmap value is 1

    This algorithm is difficult to adapt for streams because it requirestwo passes

    *M. FANG, N. SHIVAKUMAR, H. GARCIA-MOLINA,R. MOTWANI, AND J. ULLMAN. Computing icebergqueries efficiently. In Proc. of 24th Intl. Conf. on Very Large Data Bases, pages 299310, 1998.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    13/87

    Association RulesDefinitions

    Transaction subset of items drawn from I, theuniverse of all Items.

    Itemset X Ihas support sif X occurs as a subsetat least a fraction - s of all transactions

    Associations rules over a set of transactions are ofthe form X=>Y, where X and Y are subsets ofIsuch

    that XY = 0 and XUY has support exceeding a userspecified threshold s.

    Confidence of a rule X => Y is the valuesupport(XUY) / support(X)

    U|

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    14/87

    Example - Market basket analysis

    Transaction Id Purchased Items

    1 {1, 2, 3}

    2 {1, 4}3 {1, 3}

    4 {2, 5, 6}

    For support = 50%, confidence = 50%, we have the following rules1 => 3 with 50% support and 66% confidence3 => 1 with 50% support and 100% confidence

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    15/87

    Reduce to computing frequent itemsets

    TID Items

    1 {1, 2, 3}

    2 {1, 3}

    3 {1, 4}

    4 {2, 5, 6}

    Frequent Itemset Support

    {1} 75%

    {2} 50%

    {3} 50%

    {1, 3} 50%For the rule 1 => 3:Support = Support({1, 3}) = 50%Confidence = Support({1,3})/Support({1}) = 66%

    For support = 50%, confidence = 50%

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    16/87

    Toivonens algorithm

    Based on sampling of the data stream.

    Basically, in the first pass, frequencies are computedfor samples of the stream, and in the second pass

    these the validity of these items is determined.

    Can be adapted for data stream

    Problems:

    - false negatives occur because the error infrequency counts is two sided- for small values ofe, the number of samples isenormous ~ 1/e (100 million samples)

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    17/87

    Network flow identification

    Flow sequence of transport layer packets thatshare the same source+destination addresses

    Estan and Verghese proposed an algorithm for this

    identifying flows that exceed a certain threshold.The algorithm is a combination of repeated hashingand sampling, similar to those for iceberg queries.

    Algorithm presented in this paper is directlyapplicable to the problem of network flowidentification. It beats the algorithm in terms of spaceand requirements.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    18/87

    Problem definition

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    19/87

    Problem Definition

    Algorithm accepts two user-specified parameters- support threshold s E (0,1)- error parameter E (0,1)

    -

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    20/87

    Approximation guarantees

    All item(set)s whose true frequencyexceeds sN are output. There are no

    false negatives. No item(set)s whose true frequency is

    less than (s- (N is output.

    Estimated frequencies are lessthan truefrequencies by at mostN

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    21/87

    Input Example

    S = 0.1% as a rule of thumb, should be set to one-tenth or one-

    twentieth of s. = 0.01%

    As per property 1, ALL elements with frequency exceeding 0.1%will be output.

    As per property 2, NO element with frequency below 0.09% willbe output

    Elements between 0.09% and 0.1% may or may not be output.Those that make their way are false positives

    As per property 3, all individual frequencies are less than theirtrue frequencies by at most 0.01%

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    22/87

    Problem Definition cont

    An algorithm maintains an -deficientsynopsis if its output satisifies the

    aforementioned properties

    Goal:to devise algorithms to support-

    deficient synopsis using as little mainmemory as possible

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    23/87

    The Algorithms for frequentItems

    Sticky Sampling

    Lossy Counting

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    24/87

    Sticky Sampling Algorithm

    Stream

    Create counters by sampling

    341530

    28

    3141

    233519

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    25/87

    Notations

    Data structure S - set of entries of theform (e,f)

    f estimates the frequency of anelement e.

    r sampling rate. Sampling an element

    with rate = r means we select theelement with probablity = 1/r

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    26/87

    Sticky Sampling cont

    Initially S is empty, r = 1. For each incoming element e

    if (e exists in S)

    increment corresponding felse {sample element with rate r

    if (sampled)add entry (e,1) to S

    else

    ignore}

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    27/87

    The sampling rate

    Let t = 1/ log(s-1-1) ( = probability of failure)

    First 2t elements are sampled at rate=1

    The next 2t elements at rate=2

    The next 4t elements at rate=4

    and so on

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    28/87

    Sticky Sampling cont

    Whenever the sampling rate r changes:for each entry (e,f) in Srepeat {

    toss an unbiased coinif (toss is not successful)diminsh f by one

    if (f == 0) {delete entry from S

    break}

    } until toss is successful

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    29/87

    Sticky Sampling cont

    The number of unsuccessful coin tossesfolows a geometric distribution.

    Effectively, after each rate change S istransformed to exactly the state it wouldhave been in, if the new rate had been usedfrom the beginning.

    When a user requests a list of items withthreshold s, the output are those entries in Swhere f (s)N

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    30/87

    Theorem 1

    Sticky Sampling computes an-deficientsynopsis with probability at least 1 -using at most 2/ log(s-1-1) expected

    number of entries.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    31/87

    Theorem 1 - proof

    First 2t elements find their way into S

    When r 2N = rt + rt` ( t`E [1,t) ) => 1/r t/N

    Error in frequency corresponds to a sequence ofunsuccessful coin tosses during the first fewoccurrences of e.the probability that this length exceeds N is at most(1 1/r)N < (1 t/N)-N < e-t

    Number of elements with f > s is no more than 1/s=> the probability that the estimate for anyof themis deficient by N is at most e-t/s

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    32/87

    Theorem 1 proof cont

    Probability of failure should be at most. This yields

    e-t/s < t 1/ log(s-1-1)

    since the space requirements are 2t,the space bound follows

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    33/87

    Sticky Sampling summary

    The algorithm name is called sticky sampling becauseS sweeps over the stream like a magnet, attractingall elements which already have an entry in S

    The space complexity is independent of N The idea of maintaining samples was first presented

    by Gibbons and Matias who used it to solve the top-kproblem.

    This algorithm is different in that the sampling rate rincreases logarithmically to produce ALL items withfrequency > s, not just the top k

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    34/87

    Lossy Counting

    bucket 1 bucket 2 bucket 3

    Divide the stream into bucketsKeep exact counters for items in the bucketsPrune entrys at bucket boundaries

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    35/87

    Lossy Counting cont

    A deterministic algorithm that computesfrequency counts over a stream of singleitem

    transactions, satisfying the guaranteesoutlined in Section 3 using at most 1/log(N)space where N denotes the current length ofthe stream.

    The user specifies two parameters:- support s- error

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    36/87

    Definitions

    The incoming stream is conceptually dividedinto bucketsof width w = ceil(1/e)

    Buckets are labeled with bucket ids, starting

    from 1 Denote the current bucket idby bcurrent whose

    value is ceil(N/w) Denote feto be the true frequency of an

    element e in the stream seen so far Data stucture D is a set of entries of the form

    (e,f,D)

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    37/87

    The algorithm

    Initially D is empty

    Receive element eif (e exists in D)

    increment its frequency (f) by 1else

    create a new entry (e, 1, bcurrent 1)

    If it bucket boundary prune D by the following therule:(e,f,D) is deleted if f + D bcurrent

    When the user requests a list of items with thresholds, output those entries in D where f (s)N

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    38/87

    Some algorithm facts

    For an entry (e,f,D) f represents the exactfrequency count for e ever since it was

    inserted into D. The value D is the maximum number of times

    e could have occurred in the first bcurrent 1buckets ( this value is exactly bcurrent 1)

    Once a value is inserted into D its value D isunchanged

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    39/87

    Lossy counting in action

    D is Empty

    FrequencyCounts

    At window boundary, remove entries that for them

    f+D

    bcurrent

    +

    First Bucket

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    40/87

    Lossy counting in action contFrequencyCounts

    Next Bucket

    +

    At window boundary, remove entries that for themf+D bcurrent

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    41/87

    Lemma 1

    Whenver deletions occur, bcurrenteN

    Proof:N = bcurrentw + neN = bcurrent+ en

    eN bcurrent

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    42/87

    Lemma 2

    Whenever an entry (e,f,D) gets deletedfe bcurrent

    Proof by induction Base case: bcurrent = 1

    (e,f,D) is deleted only if f = 1 Thus fe bcurrent (fe = f) Induction step:

    - Consider (e,f,D) that gets deleted for some bcurrent > 1.- This entry was inserted when bucket D+1 was being processed.

    - It was deleted at late as the time as bucket D became full.- By induction the true frequency for e was no more than D.- f is the true frequency of e since it was inserted.- fe f+D combined with the deletion rule f+D bcurrent =>

    fe bcurrent

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    43/87

    Lemma 3

    If e does not appear in D, then feeN

    Proof:If the lemma is true for an element ewhenever it gets deleted, it is true forall other N also.From lemmas 1, 2 we infer that fe eNwhenever it gets deleted.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    44/87

    Lemma 4

    If (e,f,D)E D, then f fe f +eN

    Proof:IfD=0, f=fe.Otherwise

    e was possibly deleted in the first D buckets.

    From lemma 2 fe f+

    DD bcurrent 1 eNConclusion

    f fe f +eN

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    45/87

    Lossy Counting cont

    Lemma 3 shows that all elements whose truefrequency exceed eN have entries in D

    Lemma 4 shows that the estimated frequencyof all such elements are accurate to within eN

    => D correctly maintains an e-deficientsynopsis

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    46/87

    Theorem 2

    Lossy counting computes an e-deficientsynopsis using at most 1/elog(eN)entries

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    47/87

    Theorem 2 - proof

    Let B = bcurrent di denote the number of entries in D

    whose bucket id is B - i + 1 (iE[1,B]) e corresponding to di must occur at

    least i times in buckets B-i+1 throughB, otherwise it would have been deleted

    We get the following constraint:(1) Sidi jw for j = 1,2,B. i = 1..j

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    48/87

    Theorem 2 proof

    The following inequality can be provedby induction:

    Sdi Sw/i for j = 1,2,B i = 1..j |D| = Sdi for i = 1..B

    From the above inequality

    |D| Sw/i 1/elogB = 1/elog(eN)

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    49/87

    Sticky Sampling vs.Lossy counting

    Support s = 1%

    Error = 0.1%

    Noofentries

    Log10 of N (stream length)

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    50/87

    Sticky Sampling vs.Lossy counting cont

    N (stream length)

    Noofent

    ries

    Kinks in the curve for sticky sampling correspond to re-samplingKinks in the curve for lossy counting correspond to bucket boundaries

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    51/87

    Sticky Sampling vs.Lossy counting cont

    SS Sticky Sampling LC Lossy Counting

    Zipf zipfian distribution Uniq stream with no duplicates

    es

    SSworst

    LCworst

    SSZipf

    LCZipf

    SSUniq

    LCUniq

    0.1% 1.0% 27K 9K 6K 419 27K 1K

    0.05% 0.5% 58K 17K 11K 709 58K 2K

    0.01% 0.1% 322K 69K 37K 2K 322K 10K

    0.005% 0.05% 672K 124K 62K 4K 672K 20K

    k l

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    52/87

    Sticky Sampling vs.Lossy summary

    Lossy counting is superior by a largefactor

    Sticky sampling performs worsebecause of its tendency to rememberevery unique element that gets sampled

    Lossy counting is good at pruning lowfrequency elements quickly

    C h l

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    53/87

    Comparison with alternativeapproaches

    Toivonen sampling algorithm forassociation rules.

    Sticky sampling beats the approach byroughly a factor of1/e

    C i i h l i

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    54/87

    Comparison with alternativeapproaches cont

    KPS02 In the first path the algorithmmaintains 1/e elements with theirfrequencies. If a counter exists for an

    element it is increased, if there is a freecounter it is inserted, otherwise all existingcounters are reduced by one

    Can be used to maintain e-deficient synopsiswith exactly 1/e space

    If the input stream is ZipfianLossy Counting takes less than 1/e spacefor e=0.01% roughly 2000 entries ~ 20% 1/e

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    55/87

    Frequent Sets of Items

    From theory to Practice

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    56/87

    Frequent Sets of Items

    Stream

    Identify all subsets of items whosecurrent frequency exceeds s = 0.1%.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    57/87

    Frequent itemsets algorithm

    Input: stream of transactions, eachtransaction is a set of items from I

    N: length of the stream

    User specifies two parameters:support s, error e

    Challenge:

    - handling variable sized transactions- avoiding explicit enumeration of all subsetsof any transaction

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    58/87

    Notations

    Data structure D set of entries of the form (set, f, D)

    Transactions are divided into buckets

    w = ceil(1/e) no. of transactions in each bucket

    bcurrent current bucket id

    Transactions are not processed one by one. Mainmemory is filled with as many transactions as possible.Processing is done on a batch of transactions.

    B no. of buckets in main memory in the currentbatch being processed.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    59/87

    Update D

    UPDATE_SET: for each entry (f,set,D) E D,update f by counting occurrences of set in the

    current batch. If the updated entry satisfiesf+D bcurrent, we delete this entry

    NEW_SET: if a set sethas frequency f B inthe current batch and set does not occur in

    D, create a new entry (set,f,bcurrentB)

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    60/87

    Algorithm facts

    If fset eN it has an entry in D

    If (set,f,D)ED then the true frequency of fsetsatisfies the inequality f f

    set f+D

    When a user requests a list of items withthreshold s, output those entries in D wheref (s-e)N

    B needs to be a large number.Any subset of I that occurs B+1 times ormore contributes to D.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    61/87

    Three modules

    BUFFER

    TRIE

    SUBSET-GEN

    maintains the data structureD

    operates on the current batch of transactions

    repeatedly reads in a batch of transactionsinto available main memory

    implement UPDATE_SET, NEW_SET

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    62/87

    Module 1 - Buffer

    bucket 1 bucket 2 bucket 3 bucket 4 bucket 5 bucket 6

    In Main Memory

    Read a batch of transactions

    Transactions are laid out one after the other in a big array A bitmap is used to remember transaction boundaries After reading in a batch, BUFFER sorts each transactionby its item-ids

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    63/87

    Module 2 - TRIE

    50

    40

    30

    31 29 32

    45

    42

    50 40 30 31 2945 32 42

    Sets with frequency counts

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    64/87

    Module 2 TRIE cont

    Nodes are labeled {item-id, f, D, level}

    Children of any node are ordered by their item-ids

    Root nodes are also ordered by their item-ids

    A node represents an itemset consisting of item-idsin that node and all its ancestors

    TRIE is maintained as an array of entries of the form{item-id, f, D, level} (pre-order of the trees).

    Equivalent to a lexicographicordering of subsets itencodes.

    No pointers, levels compactly encode the underlyingtree structure.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    65/87

    Module 3 - SetGen

    BUFFER

    3 3 3 4

    2 2 12 1311

    Frequency countsof subsetsin lexicographic order

    SetGen uses the following pruning rule:if a subset S does not make its way into TRIE after application of bothUPDATE_SET and NEW_SET, then no supersets of S should be considered

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    66/87

    Overall Algorithm

    BUFFER

    3 3 3 42 2 1

    2 131

    1 SUBSET-GEN

    TRIE new TRIE

    Effi i t I l t ti

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    67/87

    Efficient ImplementationsBuffer

    If item-ids are successive integers from1 thru |I|, and Iis small enough (less

    than 1 million)Maintain exact frequency counts forsingleton sets. Prune away those item-

    ids whose frequency is less than eNand then sort the transactions

    If |I| = 105, array size = 0.4 MB

    Effi i t I l t ti

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    68/87

    Efficient ImplementationsTRIE

    Take advantage of the fact that the setsproduced by SetGen are lexicographic.

    Maintain TRIE as a set of fairly large-sized

    chunks of memory instead of one huge array Instead of modifying the original TRIE, create

    a new TRIE. Chunks from the old TRIE are freed as soon

    as they are not required. By the time SetGen finishes, the chunks of

    the original TRIE have been discarder.

    Effi i t I l t ti

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    69/87

    Efficient ImplementationsSetGen

    Employs a priority queue called Heap Initially contains pointers to smallest item-ids

    of all transactions in buffer Duplicate members are maintained together

    and constitute a single item in the Heap.Chain all these pointers together.

    Derive the space from BUFFER. Change item-ids with pointers.

    Effi i t I l t ti

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    70/87

    Efficient ImplementationsSetGen cont

    316542541321

    12

    Effi i t I l t ti

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    71/87

    Efficient ImplementationsSetGen cont

    Repeatedly process the smallest item-id in Heaptogenerate singleton sets.

    If the singleton belongs to TRIE after UPDATE_SET

    and NEW_SET try to generate the next set byextending the current singleton set.

    This is done by invoking SetGen recursively with anew Heapcreated out of successors of the pointersto item-ids just processed and removed.

    When the recursive call returns, the smallest entry inHeapis removed and all successors of the currentlysmallest item-id are added.

    Efficient Implementations

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    72/87

    Efficient ImplementationsSetGen cont

    316542521321

    12

    2

    3

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    73/87

    System issues and optimizations

    Buffer scans the incoming stream by memorymapping the input file.

    Use standard qsort to sort transactions Threading SetGen and Buffer does not help

    because SetGen is significantly slower. The rate at which tries are scanned is much

    smaller than the rate at which sequiential disk

    I/O can be done Possible to maintain TRIE on disk without loss

    in performance

    S i d i i i

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    74/87

    System issues and optimizationsTRIE on disk advantages

    The size of TRIE is not limited by mainmemory this algorithm can function

    with a low amount of main memory. Since most available main memory can

    be devoted to BUFFER, this algorithm

    can handle smaller values ofe thanother algorithms can handle.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    75/87

    Novel features of this technique

    No candidate generation phase.

    Compact disk-based tries is novel

    Able to compute frequent itemsetsunder low memory conditions.

    Able to handle smaller values of support

    threshold than previously possible.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    76/87

    Experimental results

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    77/87

    Experimental Results

    IBM synthetic dataset T10.I4.1000KN = 1Million Avg Tran Size = 10 Input Size = 49MB

    IBM synthetic dataset T15.I6.1000KN = 1Million Avg Tran Size = 15 Input Size = 69MB

    Frequent word pairs in 100K web documentsN = 100K Avg Tran Size = 134 Input Size = 54MB

    Frequent word pairs in 806K Reuters newsreportsN = 806K Avg Tran Size = 61 Input Size = 210MB

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    78/87

    What is studied

    Support threshold s

    Number of transactions N

    Size of BUFFER B

    Total time taken t

    set e = 0.1s

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    79/87

    Varying buffer sizes and support s

    Timetakenins

    econds

    Support threshold s

    B = 4 MBB = 16 MBB = 28 MBB = 40 MB

    Decreasing s leads to increases in running time

    Timeinsecon

    ds

    Support threshold s

    B = 4 MB

    B = 16 MBB = 28 MBB = 40 MB

    IBM test dataset T10.I4.1000K. Reuters 806k docs.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    80/87

    Varying support s and buffer size B

    BUFFER size in MB

    S = 0.001S = 0.002S = 0.004S = 0.008T

    imetakenin

    seconds

    Kinks occur due to TRIE optimization

    on last batch

    BUFFER size in MB

    Timeinseco

    nds

    S = 0.004S = 0.008S = 0.012S = 0.016S = 0.020

    IBM test dataset T10.I4.1000K. Reuters 806k docs.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    81/87

    Varying length N and support s

    Length of stream in Thousands

    S = 0.001

    S = 0.002

    S = 0.004

    Timetakeninsec

    onds

    Running time is linear proportional to the length of the streamThe curve flattens in the end as processing the last batch is faster

    Length of stream in Thousands

    S = 0.001

    S = 0.002

    S = 0.004

    IBM test dataset T10.I4.1000K. Reuters 806k docs.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    82/87

    Comparison with Apriori

    APriori Our Algorithmwith 4MB Buffer

    Our Algorithm

    with 44MB Buffer

    Support Time Memory Time Memory Time Memory

    0.001 99 s 82 MB 111 s 12 MB 27 s 45 MB

    0.002 25 s 53 MB 94 s 10 MB 15 s 45 MB

    0.004 14 s 48 MB 65 s 7MB 8 s 45 MB

    0.006 13 s 48 MB 46 s 6 MB 6 s 45 MB

    0.008 13 s 48 MB 34 s 5 MB 4 s 45 MB

    0.010 14 s 48 MB 26 s 5 MB 4 s 45 MB

    Dataset: IBM T10.I4.1000K with 1M transactions, average size 10

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    83/87

    Comparison with Iceberg Queries

    [FSGM+98] multiple pass algorithm:7000 seconds with 30 MB memory

    Our single-pass algorithm:4500 seconds with 26 MB memory

    Query: Identify all word pairs in 100K web documentswhich co-occur in at least 0.5% of the documents.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    84/87

    Summary

    A Novel Algorithm for computing approximatefrequency counts over Data Streams

    S

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    85/87

    SummaryAdvantages of the algorithms presented

    Require provably small main memoryfootprints

    Each of the motivating examples cannow be solved over streaming data

    Handle smaller values of supportthreshold than previously possible

    Remains practical in environments withmoderate main memory

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    86/87

    Summary cont

    Give an Apriori error guarantee

    Work for variable sized transactions.

    Optimized implementation for frequentitemsets

    For the datasets tested, the algorithm

    runs in one pass and produces exactresults, beating previous algorithms interms of time.

  • 8/4/2019 Approximate Frequency Counts Over Data Streams

    87/87

    Questions?

    More questions/comments can besent to

    [email protected]