XY-Sketch: on Sketching Data Streams at Web ScaleYongqiang Liu
University of Science and Technology of China
Hefei, China
Xike Xie
University of Science and Technology of China
Hefei, China
ABSTRACTConventional sketching methods on counting stream item frequen-
cies use hash functions for mapping data items to a concise struc-
ture, e.g., a two-dimensional array, at the expense of overcounting
due to hashing collisions. Despite the popularity, however, the ac-
cumulated errors originated in hashing collisions deteriorate the
sketching accuracies at the rapid pace of data increasing, which
poses a great challenge to sketch big data streams at web scale. In
this paper, we propose a novel structure, called XY-sketch, which
estimates the frequency of a data item by estimating the proba-
bility of this item appearing in the data stream. The framework
associated with XY-sketch consists of two phases, namely decom-
position and recomposition phases. A data item is split into a set
of compactly stored basic elements, which can be stringed up in
a probabilistic manner for query evaluation during the recompo-
sition phase. Throughout, we conduct optimization under space
constraints and detailed theoretical analysis. Experiments on both
real and synthetic datasets are done to show the superior scalability
on sketching large-scale streams. Remarkably, XY-sketch is orders
of magnitudes more accurate than existing solutions, when the
space budget is small.
KEYWORDSData streams; Sketch; Data structures
1 INTRODUCTIONIn many applications, big data streams are continuously and auto-
matically generated, such as web clicks [25], emails [17], financial
data trackers [31], sensor networks [16], network traffics [13] and
social network interactions [14] [22]. For example, in U.S., the mar-
ket of online advertising is reported as 100 billion dollars in 2018
[23], and is likely to expand to 230 billion dollars in near future [20].
The industry relies on tracking the web-click streams of billions
of users, and counting many combinations of events, leading to
a blow-up in the number of counting items [11][20]. Hence, it is
desired to have a compact yet scalable structure for supporting the
emerging applications, such as web-scale stream analytics.
Sketches are compact data structures which take small space
to support high quality approximate queries over data streams [7]
[5] [9] [19] [6] [30]. Different from conventional database process-
ing that requires multiple passes, data stream processing is often
processed sequentially in one pass, empowered by sketches. Of
sketches, the task is to estimate item frequencies for data streams.
State-of-the-art solutions include CM-sketch [7], C-sketch [5], CU-
sketch [9], A-sketch [19], Cold Filter [30], and so on. They adopt
the similar underlying structure which is essentially a d ×w array
of counters for storing item frequencies. Each of the d row of the
array is associated with a hash function for mapping items to w
counters. Nevertheless, these collisions pose a great challenge to
the sketch scalability, especially in face of big data streams.
The sketch scalability is of paramount importance in big data
streaming scenarios, such as online advertising and social network
tracking, where data items are with unprecedent expanding do-
mains (# of distinct items) and growing volumes. Let fi be the true
frequency of item xi and ˆfi be the estimated frequency of xi . We
investigate how the estimation accuracy scales. Equivalently, we
investigate how the frequency estimation error, | ˆfi − fi |, scaleswith respect to the growth of the total number of items N , and the
number of distinct items n, namely N -scalability and n-scalability.
Table 1: Error Bounds for Sketches (with probability at least1 − δ )
XY-sketch CM-sketch C-sketch
Error Bounds2
δn∑ni=1 fi
ew∑ni=1 fi
8√w
√∑ni=1 f
2
i
We can analyze the scalability based on the estimation error
bounds of sketches, as shown in Table 1, which connects the esti-
mation accuracy with two scalability factors, n and N . Of a stream,
N refers to the total number of items, and n refers to the num-
ber of distinct items. From Table 1, the error bound of C-sketch is
8√w
√∑ni=1 f
2
i [5]. By transforming the error bound from L2 norm to
L1 norm, the error bound can be represented by8√wn
N 1, meaning
that the error bound is proportional to N , and inversely propor-
tional to
√wn. For example, if n is increased to n1, the error bound
of C-sketch is to shrink to
√nn1
of its original bound, resulting in a
lower error bound. In contrary to that, the error bound of XY-sketch
is to shrink tonn1
of the original one, which shows better scalability
in terms of n.In this work, we propose XY-sketch, a novel sketching tech-
nique, that tackles the scalability challenges by adopting novel
decomposition-and-recomposition framework. For the first time,
we estimate the probability of a data item appearing in the data
stream for counting the stream item frequency. Then, to count an
item of a stream, one can simply multiplying the probability with
the total number of items N . The basic idea is to decompose an
item into a sequence of elements which need much smaller storage
space. During the query phase, the decomposed elements can be re-
composed for frequency estimation in a probabilistic manner. Both
decomposition and recomposition phases are enabled by bijective
functions. Using bijective functions, it ensures that a decomposed
item can be uniquely recomposed to its original form item. It cannot
be done by one way functions, i.e., hashing functions.
1According to inequality
1√n
∑ni=1 fi ≤
√∑ni=1 f
2
i .
For hashing function-based solutions, an item frequency retrieval
can falsely be retrieved as the sum of multiple data items, according
to hashing collisions. The hashing collisions may aggravate the
situation, especially when the space budget is small and the value
of n is big, in applications of web data streaming. For XY-sketch,
there can be errors caused by the approximation of conditional
probabilities in the decomposition-and-recomposition framework.
We therefore conduct detailed analysis for gaining theoretical con-
fidence in bounding such errors. Extensive experiments on real and
synthetic datasets show that our proposals are effective, especially
when the space budget is small.
Our contributions can be summarized as follows.
• We propose a novel sketching technique, called XY-sketch,
which utilizes the decomposition-and-recomposition frame-
work.
• We conduct detailed theoretical analysis for the estimation
error bounds to gain insights on the performance of scalabil-
ity.
• We propose both basic structure and extended structure for
XY-sketch. We also investigate corresponding optimization
techniques for further enhancing the frequency estimation
accuracies.
• We conduct extensive experiments on both real and synthetic
datasets to evaluate the scalability of XY-sketch.
The rest of this paper is organized as follows. Section 2 discusses
the related works. Section 3 investigates the decomposition-and-
recomposition framework, which formulates the basic structure of
XY-sketch. Section 4 makes detailed theoretical analysis on the esti-
mation error bounds. Section 5 proposes the extended structure for
XY-sketch in association with a series of optimization techniques.
Section 6 reports the experimental results. Section 7 concludes the
paper.
2 RELATEDWORKCM-sketch [7] corresponds to d rows, each of which consists of a
hashing function and a set of w counters. When a new item arrives,
for each of the d rows, CM-sketch applies the corresponding hash-
ing function to get a counter and increment it by one. In total, dcounters are incremented. Then, for retrieving an item’s frequency,
CM-sketch applies d hashing functions to find d counters and re-
ports the minimal value of the d counters. CU-sketch [9] is similar
to the CM-sketch except that it adopts conservative updating, which
only increments the counter(s) with the minimum value among the
d mapped counters.
C-sketch [5] has the same structure as CM-sketch, except that
it maintains an extra hashing function for each row, which maps
the arrived item to {−1, 1}. Thus, the extra hashing function helps
in determining whether the corresponding counter should be up-
dated positively or negatively. For retrieving an item’s frequency,
C-sketch reports the median one among the d mapped counters.
CM-sketch and C-sketch are considered as two basic sketching
techniques, of which most existing sketching techniques, such as
Bias-sketch [6], A-sketch [19], Cold Filter [30], are variants.
The Bias-sketch [6] improves over C-sketch and CM-sketch by
taking extra storage for samples of streaming items in order to
avoid the bias estimation. Bias-sketch focuses on recording and
recovering the entire data stream, whose target is different from
item frequency estimation considered in our work. A-sketch [19]
and Cold Filter [30] both use filters as auxiliary structures associ-
ated with basic sketches, e.g., CM-sketch. In particular, A-sketch
uses filters for high-frequency items, whereas Cold Filter uses fil-
ters for low-frequency items. Hence, A-sketch achieves high ac-
curacies in estimating high-frequent items. Cold Filter adopts a
well-devised two-layered structure to achieve good accuracies for
the low-frequency item estimation. Meanwhile, it incurs parameter
tuning issues for the automatic configuration in practice.
Recently, MV-skecth [21] studies heavy hitter and heavy change
queries, which is different from the item frequency estimation
considered in this paper. SketchLearn [13] uses adaptive statisti-
cal inference to relieve users approximate measurement burdens.
Although it is versatile in addressing many types of queries, the per-
formance on basic queries, e.g., point queries, is limited, compared
with CM-sketch, C-sketch, and their variants. There also exist other
types of sketching techniques for various purposes. For example,
OM-sketch [29] and Pyramid sketch [27] avoid counter overflows.
Ada-sketch [20] achieves better frequency estimation for recent
items than old ones, by using techniques of digital Dolby noise re-
duction. Odd sketch [18] is a compact binary sketch for estimating
the similarity of two sets, which is relevant in applications such as
web duplicate detection and collaborative filtering.
3 BASIC STRUCTURE OF XY-SKETCH3.1 PreliminariesWe consider a standard model, called cash register model. Suppose
a data stream SN with N items and n distinct items. The stream
SN can be represented by a sequence ⟨e1, ..., eN ⟩, where each item
ei takes a value from the item set X, (ei ∈ X). Notice that items in
X = {x1, ...,xn } are distinct, i.e., xi , x j . The frequency fi equalsto the number of times item xi appearing in the stream SN . Next,
we formally define bijective functions and basic elements, whichmake the foundation for the decomposition-and-recomposition
framework.
Table 2: List of notations
Notation Meaning
SN data stream of N items
N total number of items
n total number of distinct items
X domain of items, |X| = wd
Y domain of elements, |Y| = w
xi ∈ X an item in X
y(j)i ∈ Y j-th element of xiYd×w matrix of d rows andw columns
Y=⟨Y (1), ...,Y (d )⟩ random variables for d elements
bit(xi ,k) k-th bit (from the right) of xi(ˆfi ) fi (estimated) frequency of xi
ϖ and ϖ−1 bijective function and its inverse
d an item has d elements
b an element has b bits
Definition 1 (Bijective Function ϖ). ϖ is a bijective func-tion, which maps a data item to a sequence of d elements, formallyϖ : X → Yd . For example, given item xi ∈ X, we have ϖ(xi ) =⟨y(1)
i ,y(2)
i , ...,y(d )i ⟩ andϖ
−1(⟨y(1)
i ,y(2)
i , ...,y(d )i ⟩) = xi , where {y
(j)i ∈
Y}1≤j≤d are elements of xi .
Of the stream, an item or an element can be viewed as a sequence
of bits, or a binary string, equivalently.We assume that each element
has b bits and each item has d elements and thus d × b bits, so that
an item length is an integral number, i.e., d , of an element length2.
Essentially, an element of an item is an ordered arrangement of
a subset of b bits of the item. Of an item, all the d elements are
equal-sized and mutually exclusive.
The bijective function ϖ represents the one-to-one correspon-
dence between the two domains X and Yd . Once the bijective
function is given, an item can be uniquely identified by its corre-
sponding sequence of elements, vise versa. We show an example
in Figure 1. Given item xi = 101101011, it can be decomposed
into three elements y(1)
i , y(2)
i , and y(3)
i with ϖ . Element y(1)
i = 101
is derived by taking the 1st, 5
th, and 6
thbits from xi . Following
the corresponding relation as shown in Figure 1, y(2)
i = 011 and
y(3)
i = 101 can be obtained, similarly. Inversely, we can recompose
xi by {y(j)i }1≤j≤3 with ϖ−1, which is the inverse function of ϖ .
There could be (d · b)! possible ways of mapping between items
and elements, if an item is of d elements and an element is of bbits. Next, we introduce the concept of random permutation. Based
on that, we define random bijective function, which is general and
therefore makes the basis for the theoretical analysis of XY-sketch.
Definition 2 (Random Permutation). Given a sequence of in-tegers ⟨1, 2, ...,d · b⟩, a random permutation ⟨q1, ...,qs ⟩ (s = d · b)can be obtained by applying a randomly selecting permutation of(d · b) to the sequence, or by choosing a random element from the setof distinct permutations of the sequence, equivalently.
Given an item xi , let bit(xi ,k) be the k-th bit of the binary form
of xi , so that xi =∑sk=1 bit(xi ,k)2
k−1, where s = d · b. Then,
we formally define the random bijective function with random
permutation and function bit(, ) in Definition 3. Since it is easy to
prove that the mapping ϖ∗ defined by Definition 3 is bijection, we
omit it due to page limits.
Definition 3 (Random Bijective Function ϖ∗). ϖ∗ is a ran-dom mapping with a random permutation Q = ⟨q1, ...,qs ⟩. For anyxi ∈ X, ϖ∗(xi ) = ⟨y
(1)
i ,y(2)
i , ...,y(d )i ⟩, where y
(j)i =
∑bk=1 bit(xi ,
q(j−1)·b+k )2k−1, when j = 1, ...,d .
Unless explicitly noted, the random bijective function is used for
item decomposition and recomposition by default.
3.2 Decomposition and RecompositionData Structure. XY-sketch maintains a d × w matrix Yd×w of
counters. Suppose an item consisting of d elements. Each of the
d rows corresponds to the position that an element is ordered at
2If items are of different lengths, we can round them up to the same length. For example,
items of double-precision floating-point format can be rounded up to the length of 64,
meaning that they are of 64 bits.
空白演示单击输入您的封面副标题
Figure 1: Schematic diagram of the decomposition-and-recomposition framework (b = 3, d = 3)
.
the item. Each of thew columns corresponds to a possible value of
an element. Initially, all counters of the matrix are set to zero. The
mapping from stream items to matrix counters is implemented in
accordance to ϖ∗, as detailed below.
Figure 2: Basic Structure
Decomposition. Upon receiving an item from the data stream,
XY-sketch decomposes it into a sequence of elements with ϖ∗. Foreach element, the corresponding counter of the matrix is retrieved
and incremented by one. The process repeats until all thed elements
of the item are elaborated. Therefore, the time complexity is O(d).An example is shown in Figure 2, where a 2× 4matrix is utilized
by XY-sketch for handling data stream ⟨0, 1, 2, ..., 15, 6, 7⟩. Suppose
currently item 6 is to be handled. It is then decomposed into two
elements 01 and 10 by function ϖ∗. The first element updates the
counter of (row 1, column 01) by one. The second elements updates
the counter of (row 2, column 10) by one. After that, the decom-
position processing of item 6 is done. Similarly, the next item 7 is
decomposed into 01 and 11, and the counters of (row 1, column 01)
and (row 2, column 11) are incremented by one, respectively.
We formalize the decomposition phase in Algorithm 1. When
data item xi arrives, XY-sketch first splits the xi into d elements
Algorithm 1 Decomposition Phase
1: Yd×w ← 02: while data item xi in the SN arrives do3: ⟨y
(1)
i ,y(2)
i ...,y(d )i ⟩ ← ϖ∗(xi )
4: for j = 1 to d do5: Yd×w [j,y
(j)i ] ← Yd×w [j,y
(j)i ] + 1
6: end for7: end while
(line 3). Then, the counters of the d elements in Yd×w are found,
and incremented by 1 (line 4-6).
Recomposition. The recomposition phase works for the query
evaluation. In this work, we consider point queries, which is the
most basic type of sketch-based queries.
Algorithm 2 Recomposition Phase
Input: A two-dimensional array Yd×w and item xiOutput: The estimated frequency
ˆfi of item xi
1: ⟨y(1)
i ,y(2)
i ...,y(d )i ⟩ ← ϖ∗(xi )
2: sum ←∑wk=1 Yd×w [1,k] and
ˆfi ← sum3: for j = 1 to d do4:
ˆfi ← ˆfi × Yd×w [j,y(j)i ]/sum
5: end for6: return ˆfi
The process is depicted in Algorithm 2. Upon receiving a point
query, on retrieving the frequency of a item xi , XY-sketch decom-
poses xi into a sequence of d elements with ϖ∗. For each of the delements, we first find its corresponding counter of the d rows in
the matrix. Let y(j)i be the j-the element of xi . For each row, the
probability that y(j)i takes the value of y
(j)i by Yd×w [j,y
(j)i ]/N
3.
The value of N can be derived by summarizing the counter of any
row of the matrix. Finally, we get the product of all the d probabili-
ties and N , which equals to the estimated frequencyˆfi , according
to Equation 3. The result is hence returned as the frequency of item
xi to answer the query (lines 3-6). Since there are d probabilities to
be found, the time complexity is O(d). Next, we show an example
about the calculation in the recomposition phase.
In Figure 2, there are in total 18 items in the stream, N = 18,
which can be calculated by either 4 + 6 + 4 + 4 or 4 + 4 + 5 + 5.
The corresponding counters of item 7’s two elements, 01 and 11,
are valued 6 and 5, respectively. The frequency of item 7 can thus
be estimated by 18 × 6
18× 5
18= 1.67, while the exact frequency of
item 7 is 2. The accuracy for estimating f7 is thus 1.67/2 ≈ 84%. We
will show the estimation is very effective towards the scalability
issues. More details on the error bound analysis are to be shown in
Section 4. Next, we explain the reasonability behind the calculation
of the recomposition phase.
3 Yd×w [j, y(j )i ] records the number of which the j th basic element of the items being
decomposed is y (j )i among N data items. Therefore, the probability of which jth
element takes the value of y (j )i is Yd×w [j, y(j )i ]/N .
3.3 AnalysisThe decomposition-and-recomposition framework is conceived
and formulated with detailed analysis on balancing the tradeoff
between space and accuracy. Let X be the random variable for
distinct item values of the stream. The probability Pr(X = xi ) indi-cates the possibility that an item takes the value of xi , satisfying∑1≤i≤n Pr (X = xi ) = 1. So, the frequency fi of item xi can be eval-
uated by the product of the total number of items N and probability
Pr(X = xi ).
fi = N · Pr(X = xi )
= N · Pr(Y (1) = y(1)i ,Y(2) = y
(2)
i , ...,Y(d ) = y
(d )i )
(1)
Here, Y = ⟨Y (1), ...,Y (d )⟩ represents a sequence of random vari-
ables, for distinct element values obtained through the decompo-
sition phase. Then, Pr(Y(j) = y(j)i ) is the probability that the j-th
element of item xi takes the value of y(j)i . By using the chain rule
of probability theory, Equation 1 can be expanded as follows.
fi = N · Pr(Y (1) = y(1)i ) · Pr(Y(2) = y
(2)
i |Y(1) = y
(1)
i ) · ...
· Pr(Y (d ) = y(d )i |Y(1) = y
(1)
i , ...,Y(d−1) = y
(d−1)i ) (2)
Ideally, to evaluate the exact conditional probability in Equation
2 is costly. For an element of length b, there can be 2bdifferent
possible values for the element. Then, the calculation of Equation 2
requires storing O(wd ) items, which is unaffordable and violates
the compactness requirement of sketching techniques4. To this
end, we study how to accurately approximate the calculation of
the conditional probability, so as to accurately approximate and
accelerate the calculation of the item frequency. Theoretically, if
the random variable Y (j) is independent or weakly dependent on
variables {Y (k )}k,j , the conditional probability can be well approx-
imated and thus simplified by its unconditional counterpart. More
details are covered in the analysis part (Section 4). The estimation
of fi can thus be written as follows.
ˆfi = N∏
1≤j≤d
Pr(Y (j) = y(j)i ) , where
Pr(Y (j) = y(j)i ) =Yd×w [j,y
(j)i ]∑w
k=1Yd×w [j,k]=
Yd×w [j,y(j)i ]
N
(3)
This way, the space complexity is reduced from O(wd ) to O(wd),
since only {Pr(Y (j) = y(j)i )} is needed for the recomposition phase
to evaluateˆfi .
3.4 DiscussionIn this section, we introduce the basic structure of XY-sketch which
is easy for implementation and deployment. The basic data structure
is a two-dimensional matrix Yd×w , wherew = 2b. The mechanism
4Intuitively, to calculate Pr (Y (d ) = y (d )i |Y
(1) = y (1)i , ..., Y (d−1) = y (d−1)i )
part in Equation 1, one needs to maintain a one-dimensional array with wd
elements, Ywd , each of which represents the exact frequency of one data
item, so that Pr (Y (d ) = y (d )i |Y(1) = y (1)i , ..., Y (d−1) = y (d−1)i ) =
Ywd [y
(1)i ·w
d−1+y(2)i ·wd−2+. . .+y(d )i ]
∑y(1)i ·wd−1+y(2)i ·wd−2+. . .+y(d−1)i ·w+w−1
k=y(1)i ·wd−1+y(2)i ·w
d−2+. . .+y(d−1)i ·wYwd [k ]
.
of the basic structure follows a decomposition-and-recomposition
framework. During the query phase, the desired frequency can
be estimated by collecting and evaluating relevant elements in a
probabilistic manner.
In the extreme case, if d equals 1, the decomposition mapping
X → Yd degenerates into X → Y, so that an element degener-
ate into an item. This way, |X| equals n, meaning that XY-sketch
degenerates into a frequency histogram, which takes O(n) space,although the estimation accuracy can be 100%. Again, the setting
of d = 1 makes the solution not scalable and thus violates the com-
pactness requirement of sketching. So, in practice, the value of dis often great than one. Actually, a higher valued d corresponds to
the better space efficiency of the structure. The detailed analysis
enabling the tuning for the trade-off between space efficiency and
estimation accuracy is shown in the subsequent sections.
So far, several questions remain to be answered. 1) How good is
the estimationˆfi , comparing with state-of-the-art solutions, e.g.,
CM-sketch? 2) Parameter w is sets to be a power of 2. In such
settings, it may not make full use of the given space. We tackle
these challenges in the following sections. In particular, the first
question is covered by Section 4, and the second question is covered
in Section 5.
4 ANALYSISIn this section, we show the error bound analysis of XY-sketch. We
first derive the general error bound, based on which we analyze
the N - and n-scalability. Then, we study error bounds under uni-
form and Zipfian distributions, gaining insights in the sketching
properties and performance analysis.
4.1 General Error Bound and ScalabilityAnalysis
We hereby investigate a general error bound for the item frequency
estimation, as shown in Theorem 1.
Theorem 1. If we randomly select one of the n items, with prob-ability at least 1 − δ , we have | ˆfi − fi | ≤
2
nδ N for XY-sketch itemfrequency estimation.
Proof. Let ζi = | ˆfi − fi | be the random variable of the es-
timation error. Since ζi , ˆfi and fi are all positive, it holds that
ζi ≤ max{ ˆfi , fi }. Since∑i fi = N and
∑iˆfi ≤ N (Lemma 1), we
have
∑ni=1
ˆfi +∑ni=1 fi ≤ 2N . Therefore,
n∑i=1
ζi ≤n∑i=1
max{ ˆfi , fi } ≤n∑i=1
ˆfi +n∑i=1
fi ≤ 2N (4)
Then, we use the method of reductio ad absurdum to prove this
theorem. In the case of δn < 1, if there exist i ′ ∈ {1, 2, ...,n}, suchthat ζi′ >
2
nδ N , then we have
∑ni=1 ζi > 2N which contradicts with
Equation 4. In the other case, let δ < k < 1, and suppose that the
random variable of the error ζi is greater than2
nδ N with probability
k . Thus, there exist i1, i2, ..., ikn ∈ {1, 2, ...,n} and they are not
equal to each other, such that ζi j >2
nδ N , where j = 1, 2, ...,kn. If
so, we can get
∑knj=1 ζi j > 2N , which contradicts with Equation 4.
Therefore, the theorem is proved. □
Lemma 1.
∑ni=1
ˆfi ≤ N
Proof. Let p̂i be∏
1≤j≤d Pr(Y (j) = y(j)i ). Sinceˆfi = Np̂i (Equa-
tion 3), we have
∑ni=1
ˆfi = N∑ni=1 p̂i . Since ∀j ∈ [1,d],
∑wk=1
Pr (Y (j) = k) = 1, we have
∑ni=1 p̂i ≤ 1. Therefore,
∑ni=1
ˆfi ≤N . □
Based on Theorem 1, we can figure out why the error bound of
XY-sketch is tighter than that of CM-sketch, if the space budget
of both sketches is set tow · d . Recall that the error bound of CM-
sketch iseNw , with probability at least 1− δ [7]. The error bound of
XY-sketch is2Nnδ , with the probability at least 1 − δ . It means that,
with the same probability at least 1 − δ , letting 2
nδ <ew , we can
find that n > 2weδ orw < enδ
2. Therefore, if the number of distinct
items n is large (higher than2weδ ), or the space budget of sketches
is small (lower thanenδ2
), the error bound of XY-sketch is tighter
than that of CM-sketch. It means that XY-sketch outperforms CM-
sketch, when the space budget is small, or when the number of
distinct items in stream is large. The conclusion is consistent with
experimental results in Section 6.
Theorem 1 can also be used for analyzing the scalability of XY-
sketch. For example, when N is fixed a constant, the error bound
shrinks, as the number of different items n increases. It means that
XY-sketch has good n-scalability.n-scalability. Based on Table 1, after transforming C-sketch’s
error bound from L2 norm to L1 norm, we can see that both C-
sketch and CM-sketch’s n-scalability are not as good as XY-sketch.
For example, the error bound of C-sketch will shrink to
√nn1
of
the original when n increases to n1. While the the error bound
of XY-sketch will shrink tonn1
of the original, leading to a lower
error bound. Because most existing sketching techniques, such as
A-sketch, Cold Filter, are based on CM-sketch or C-sketch, we can
see that the n-scalability of them is not as good as that of XY-sketch.
Remarkably, the error bound of XY-sketch is2
nδ N , which decreases
with an increased value of n, if N and space budget is fixed.
N -scalability.XY-sketch achieves betterN -Scalability compared
with others under certain conditions. Take CM-sketch as an exam-
ple. In the case of n > 2weδ orw < enδ
2, the factor
2
δn is smaller than
the factorew . Therefore, The growth of XY-sketch’s error bound
will be smaller than the growth of CM-sketch’s error bound, when
N increases, meanings that XY-sketch has better N -scalability in
this case.
4.2 Error bound with detailed distributionIn this part, we analyze the estimation error bound when items
follow both uniform and Zipfian distributions.
First of all, for ease of presentation, we define several symbols for
representing probabilities. Let ϕ(Y (1) = y(1)
i ,Y(2) = y
(2)
i , ...,Y(j) =
y(j)i ) be the number of data items in the data stream SN , satisfying
that 1-st element equals y(1)
i , ..., j-th element equals y(j)i . We denote
ϕ(Y (1) = y(1)
i , ...,Y(j) = y
(j)i ) as ϕ(y
(1)
i , ...,y(j)i ) and useCPr (Y (j) =
y(j)i ) to represent the conditional probability Pr (Y (j) = y(j)i |Y
(1) =
y(1)
i , ...,Y(j−1) = y
(j−1)i ). Then, we can get the following equations.
CPr (Y (j) = y(j)i ) =ϕ(y(1)
i ,y(2)
i , ...,y(j)i )
ϕ(y(1)
i ,y(2)
i ...,y(j−1)i )
(5)
Pr(Y (j) = y(j)i ) =ϕ(y(j)i )∑w−1
k=0 ϕ(Y (j) = k)=
ϕ(y(j)i )
N(6)
Next, we show how to use the relationship between Equations 5
and 6, which can be used for bounding the estimated probability
under various item distributions.
Theorem 2. It holds that ∀j ∈ [1,d], ∀xi ∈ SN and ∀y(j)i ∈
[0,w − 1], min
xi′ ∈SN{CPr (Y (j) = y(j)i′ )} ≤ Pr (Y (j) = y(j)i ) ≤ max
xi′ ∈SN{
CPr (Y (j) = y(j)i′ )}.
Proof. We first show the case when j = 2, then generalize it
to case j > 2. We use the method of reductio ad absurdum to
prove this theorem. Suppose estimated probability Pr (Y (2) = y(2)i )
is always greater than conditional probability CPr (Y (2) = y(2)
i ).
With Equations 5 and 6, we get the following equation.
ϕ(y(2)
i )
N>
ϕ(Y (1) = k,Y (2) = y(2)
i )
ϕ(Y (1) = k)k ∈ [0,w − 1]
By simple transformation, we can have the inequality ϕ(y(2)
i ) ·
ϕ(Y (1) = k) > ϕ(Y (1) = k,Y (2) = y(2)
i ) · N . The inequality holds for
different values of k , ranging from 0 tow − 1. Then, we can getwinequalities, by substituting k with values from 0 tow − 1. Basedon that, the following inequality can be obtained, by summarizing
the left and right parts of thew inequalities, respectively.
ϕ(y(2)
i ) ·
w−1∑k=0
ϕ(Y (1) = k) > N ·w−1∑k=0
ϕ(Y (1) = k,Y (2) = y(2)
i )
Since all values ({0, 1, ...,w −1}) of basic elements are taken into ac-
count, we get
∑w−1k=0 ϕ(Y (1) = k) = N and
∑w−1k=0 ϕ(Y (1) = k,Y (2) =
y(2)
i ) = ϕ(y2i ). Finally, we have ϕ(y(2)
i ) ·N > ϕ(y(2)
i ) ·N , which is con-
tradictory. Therefore, the value of estimated probability Pr (Y (2) =
y(2)
i ) is bounded by the minimum and maximum values of condi-
tional probabilitiesCPr (Y (2) = y(2)i′ ), where xi′ ∈ SN . Similarly, we
can extend it to the case when j > 2. □
Theorem 2 gives the upper and lower bounds for the estimated
probability Pr (Y (j) = y(j)i ). The bounds are represented in the form
of conditional probabilities. However, the conditional probabilistic
bounds are difficult to derive in practice. To this end, we consider
two representation distributions, for which the closed form of error
bounds can be evaluated.
Error Bounds vs. Uniform Distributions. If data items of
the data stream follow uniform distribution, the counter value
Yd×w [j,y(j)i ] on each row would be close, where j = 1, 2, ...,d and
∀xi ∈ SN . Therefore, with a large probability, the conditional
probability of the most frequent item achieves its maximum value.
So that, we can get the follows.
max
xi ∈SN{ fi } = N ·
d∏j=1
max
xi′ ∈SN{CPr (Y (j) = y(j)i′ )}
Similarly, the following holds.
min
xi ∈SN{ fi } = N ·
d∏j=1
min
xi′ ∈SN{CPr (Y (j) = y(j)i′ )}
Let R = max
xi ∈SN{ fi } − min
xi ∈SN{ fi } be the range of the frequency
of the data item. Then, the estimation error bound for the case of
uniform distribution is as shown in Theorem 3.
Theorem 3. It holds that ∀xi ∈ SN , | fi − ˆfi | ≤ R under uniformdistribution, where R = max
xi ∈SN{ fi } − min
xi ∈SN{ fi }.
Proof. For ∀xi ∈ SN , ∀j ∈ [1,d] and ∀y(j)i ∈ [0,w − 1] we canget |CPr (Y (j) = y
(j)i ) − Pr (Y (j) = y
(j)i )| ≤ | max
xi′ ∈SN{CPr (Y (j) =
y(j)i′ )} − min
xi′ ∈SN{CPr (Y (j) = y(j)i′ )}|, according to Theorem 2. Thus,
combining Equations 2 and 3, the exact frequency fi and the esti-
mated frequencyˆfi are both lower than the product of maximum
conditional probabilities and higher than the product of minimum
conditional probabilities. Thus, we have
| fi − ˆfi | ≤d∏j=1
max
xi′ ∈SN{CPr (Y (j) = y(j)i′ )} · N
−
d∏j=1
min
xi′ ∈SN{CPr (Y (j) = y(j)i′ )} · N ,
which is equivalent to | fi − ˆfi | ≤ R. □
From Theorem 3, if data items follow uniform distribution, the
value of range R will be very small, meaning that the error bound
of XY-sketch is small. This conclusion can also be drawn from
Theorem 2. Due to the uniform distribution, min
xi′ ∈SN{CPr (Y (j) =
y(j)i′ )} is close to max
xi′ ∈SN{CPr (Y (j) = y(j)i′ )}. That is, for each row in
XY-sketch, the value of estimated probability is very close to the
true probability. Therefore, the final error estimated by XY-sketch
will also be small.
Error bounds vs. Skewed distributions. We show the error
bound of item frequency estimation when the data items in the
stream follow skewed distribution. We use the Zipfian distribution
to model skewed distributions, following the setting of existing
works [8][4][28]. Zipfian distribution has a parameter z (setting z >1) and a constant Cz . Correspondingly, fi represents the frequencyof i-th most frequent item only when talk for skew distribution.
Recall that items are in the range [1, 2, ...,n], where n is the number
of distinct items. Then, the constantCz can be determined by z and
n, since∑ni=1
Cziz = 1. Therefore, the following holds.∫ k+1
k
Cziz
di ≤Czkz· 1 ≤
∫ k
k−1
Cziz
di
After expanding the upper limit of the integration in the above
formula, we can get the follows.∫ +∞k
Cziz
di ≤n∑i=k
Cziz≤
∫ +∞k−1
Cziz
di
With some transformations, we get the following inequality.
Czk1−z
z − 1≤
n∑i=k
Cziz≤
Cz (k − 1)1−z
z − 1
With the derivation from the Zipfian distribution, we can derive
the error bound of cold items. Here, we define cold items as the
set of items excluding k highest frequent items. Let x1 be the most
frequent item and x2 be the second most frequent items. The cold
items can be represented CI = {xk+1,xk+2, ...,xn }. Here, we focuson CI , which is the vast majority of data items in the stream, since
the frequencies of the top-k data items {x1, ...,xk } can almost be
exactly estimated with little extra space, such as a filter for A-sketch
[19] and CM-sketch [7].
According to Theorem 2, it can reasonably assumed that Nk ·1
wd ≤∑ki=1
ˆfi ≤∑ki=1 fi , since we have
1
w ≤ Pr (Y (j) = y(j)i ) ≤
CPr (Y (j) = y(j)i ) for any j ∈ [1, 2, ...,d] and for any i ∈ [1, 2, ...,k].Based on that, we can draw the error bound for cold items as follows.
Theorem 4. For a Zipfian distribution with parameter z and scal-ing constant Cz , with probability at least 1 − δ , the error bound ofcold items is N
(n−k )δ · (Czk1−z
z−1 + (1 − kwd )).
Proof. Since the data items follow Zipfian distribution, we can
get
∑ni=k+1
Cziz ≤
∫ +∞k
Cziz di . That is,
∑ni=k+1 fi ≤ N Czk1−z
z−1 . Ac-
cordingly, we can get
∑ki=1
ˆfi ≥Nkwd . Therefore,
∑ni=k+1
ˆfi ≤
N −∑ki=1
ˆfi ≤ N · (1 − kwd ).
Let ζi = | ˆfi − fi | be the random variable for the estimation error,
where i ∈ [k + 1,k + 2, ...,n]. Notice that ˆfi and fi are both positive,
it holds that ζi ≤ max{ ˆfi , fi }. Therefore,
n∑i=k+1
ζi ≤n∑
i=k+1
ˆfi +n∑
i=k+1
fi ≤ N (Czk
1−z
z − 1+ (1 −
k
wd))
Then, we using themethod of reductio ad absurdum to prove the the-
orem. Suppose u is a probability value, satisfying δ < u < 1. Then,
suppose the random variable ζi is greater thanN
(n−k )δ (Czk1−z
z−1 +(1−
kwd ))with probabilityu. It means that there exist ζi1 , ζi2 , ..., ζiu(n−k ) ,
such that ζi j >N
(n−k )δ (Czk1−z
z−1 +(1−kwd )), where j = 1, 2, ...,u(n−k).
Thus, we will get
∑u(n−k )j=1 ζi j > N (Czk
1−z
z−1 + (1 −kwd )), which con-
tradicts with the assumption. Therefore, the theorem is proved. □
5 EXTENSIONS5.1 Extended Structure of XY-sketchXY-sketch’s basic structure is in the form of a matrix with d rows
and w columns. Let wi be the number of elements of row i . Forthe basic structure,wi is limited to be a power of 2 and allwi s are
equal satisfying
∏i≤d wi = |X|. Due to the setting ofwi , given an
arbitrary amount of space β , it may not be fully padded by the basic
structure, making the allocated space not fully used for sketching
stream items. Hereby, we study how to extend the basic structure
in a way such that the space can be fully utilized and hence the
estimation accuracy can be improved.
In particular, we extend the basic XY-sketch structure by setting
wi s to tunable values. This way, the structure of XY-sketch is no
longer a set of rows of equal length in a matrix, but a set of rows,
long and short, so that each row may contain different numbers of
elements. We represent the length (number of bits) of an element
on row i as bi , satisfying∑i bi = b · d , or equivalently
∏i≤d wi =
|X|, meaning that the domain of items is kept, and therefore the
recomposition of elements from all different rows is capable of
recovering original data stream items.
Heuristics. However, to find the optimal way of setting the
length of rows {wi } is computationally challenging. In total, there
are at most
∑b ·d−1i=0
(b ·d−1i
)= 2
b ·d−1possible ways of settings. Thus,
we design a greedy algorithm for approaching the optimal way of
space allocation, as shown in Algorithm 3. Initially, the item is of
b · d bits and the space budget is β . In the first iteration, we set
up row 1, of which each element has b1 bits. Then, for the next
round, it is equivalent to solve the same problemwith space (β−w1)
for pseudo item of (b · d − b1) bits. The subroutine is thus revokedrecursively.
Algorithm 3 SpaceAlloc (Space Budget β , Pseudo Item Length
b · d) //w is a global sequence
1: r ← ⌊log β⌋2: while b · d > 0 do3: w ′ ← 2
r
4: if β −w ′ < 2 × (b · d − r ) then5: w ′ ← 2
r−1
6: end if7: Appendw ′ tow8: SpaceAlloc(β −w ′,b · d − r )9: end while
At iteration i , the configuration of wi should satisfy two con-
ditions. First, the length of pseudo item r should be maximized
(line 1). Recalling Formula 2 and Formula 3, we take the probability
Pr (Y (j) = y(j)i ) to estimate the exact probability CPr (Y (j) = y
(j)i ).
When j is equal to 1, the probability Pr (Y (j) = y(j)i ) is also equal to
probability CPr (Y (j) = y(j)i ). It means that the probability logged
in the first row of XY-sketch is always the exact probability. So,
we prefer to take large enough space to log the information of this
exact probability. Thus, we letb1 as large as possible. Then, we seemthe original items as two parts, part one with this b1 bits while parttwo with the remainder (b · d − b1) bits. We regard this remainder
(b · d − b1) bits as a new items in a new field. For this new items,
we also use the probability Pr (Y (j) = y(j)i ) to estimate probability
CPr (Y (j) = y(j)i ). Thus, the similar situation occurs in this case.
Therefore, we set b2 large enough for the remainder space.
Second, the remaining space (β − w ′) should be adequate for
storing current pseudo items with (b · d − r ) bits (line 4). And the
correctness is guaranteed by Theorem 6, in which c represents theseremaining bits (b · d − r ). If the remaining space is not enough to
process the remaining bits, it means that the given r is too large.
Then, continue to the next iteration after adjusting the r (line 5).Therefore, according to Algorithm 3, the leftover space is lim-
ited. The leftover space is equally divided into u parts, where
u = ⌊Lspace/wd ⌋. The remainder must be less than wd and is
ignored. Each of the u part is of sizewd . The u parts are then com-
bined with the d-th row to construct an equal-width histogram of
u + 1 buckets. This way, the first d − 1 rows represent the item
sub-domain
∏i≤d−1wi , and the last row (i.e., d-th row) represent
the sub-domainwd , satisfying that the product of the two equals∏i≤d wi = |X|.
Theorem 5. The space cost is decreased by the decompositionoperation.
Proof. It is equivalent to prove that ifm = w1w2,w1,w2 ∈ N+
w1 ≥ 2 andw2 ≥ 2, it holds thatm ≥ w1 +w2.
Since m = w1w2, the problem is to prove w1w2 ≥ w1 + w2,
or (w1 − 1) ≥w1
w2
. Let y1(x) =w1
x . Since y1(x) is a decreasing
function of x when x > 0, we havew1
2> w1
x when x > 2. Suppose
y2(x) = x − 1 and y3(x) =x2, we can get y′
2(x) = 1 and y′
3(x) = 1
2.
We can find that y2(x) is greater than y3(x), when x > 2. So, we
havew1w2 > w1 +w2, ifw1,w2 > 2. □
Theorem 6. Given a series of pseudo items of c bits, the min-imum space required for storing them in the decomp-osition-and-recomposition framework is 2 · c .
Proof. Givenm =∏d
i=1wi and β =∑di=1wi . Then, we show
that β will get the minimum value, if w1 = w2 = ... = wd . Let
wd =m∏d−1i=1 wi
, we have β =∑d−1i=1 wi +
m∏d−1i=1 wi
and∂β∂w j= 1 −
m∏d−1i,j· 1
w2
j= 1−
wdw j
. By solving∂β∂w j= 0, we can get thatw j = wd
is the valley point. Similarly, it holds that β =∑di=1wi is minimum
whenw1 = w2 = ... = wd . Then, according to Theorem 5, the value
ofwi must be minimum. Thus, we getmin(wi ) = 2, sincewi = 2b
and b ∈ N+.Thus, the minimum space is achieved, if the decomposition pro-
cess applies for pseudo items of c bits, until {wi = 2}∀i . The corre-sponding space cost is 2 · c . □
Error Bounds. We show the error bound analysis for basic
structure also works for the extended structure, by Theorems 7 to
9.
Theorem 7. For the extended XY-sketch structure, if we randomlyselect one of the n items, with the probability 1−δ , we have | ˆfi − fi | ≤2
nδ N .
Proof. Similar to Theorem 1, we let ζi = | ˆfi − f i | be the random
variable of the error. Since
∑ni=1 p̂i < 1, we have
∑ni=1
ˆfi ≤ N . Then,
we can use the same reductio ad absurdum in Theorem 1 to prove
this theorem. □
Essentially, the key difference between basic structure and ex-
tended structure is that the number of counters in each row of the
extended structure is different. But this difference does not affect the
key property of XY-sketch min
xi′ ∈SN{CPr (Y (j) = y(j)i′ )} ≤ Pr (Y (j) =
y(j)i ) ≤ max
xi′ ∈SN{CPr (Y (j) = y
(j)i′ )} for ∀j ∈ [1,d], ∀xi ∈ SN and
∀y(j)i ∈ [0,w j − 1] (Theorem 2), meaning that the estimated proba-
bility is bounded by the minimum and the maximum conditional
probabilities. Due to this, the error bounds of XY-sketch under
uniform and Zipfian distributions hold for the extended structure.
For uniform distribution, Theorem 3 still works for the extended
structure, because that the estimated probability is bounded by the
minimum and the maximum conditional probabilities. It reveals
that the gap between the estimated and the true probabilities is
small. Therefore, we can similarly find out that the error bound
under uniform distribution is still limited by the range R as shown
in Theorem 8. As for Zipfian distribution, the key formula Nk ·1∏d
i=1wi≤∑ki=1
ˆfi ≤∑ki=1 fi is still work based on the truth that
the estimated probability is bounded. Thus, the following Theorem
9 can reveal that it is only slightly different from Theorem 4 in terms
of expression. And the proof method is similar to the previous one,
so we will not repeat it again.
Theorem 8. For the extended XY-sketch structure, we have | fi −ˆfi | ≤ R under uniform distribution, whereR = max
xi ∈SN{ fi }− min
xi ∈SN{ fi }.
Theorem 9. For the extended XY-sketch structure, given a Zipfiandistribution with parameter z and scaling constantCz , with probabil-ity at least 1 − δ , the error bound of cold items is N
(n−k )δ · (Czk1−z
z−1 +
(1 − k∏dj=1w j
)).
5.2 Statistics-based OptimizationThere exist (b · d)! possible mappings between b · d bits and d ele-
ments. Eachmappingmethod uniquely determines a decomposition-
and-recomposition procedure. Although all mapping methods com-
ply with the error bounds in Section 4, the performance varies.
Hereby, we design a statistics-based method, which takes a small
amount of items to get statistics in order to supervise the mapping
method selection.
The idea is to get distributions on {0, 1} for every bit of data
items. The distribution is collected with the first N1 items of the
data stream. Let c j (0) and c j (1) be the counts of values 0 and 1 for
the j-th bit, respectively. We can get the probabilities for the j-thbit to take values 0 and 1, as follows.
pj (0) =c j (0)
c j (0) + c j (1)and pj (1) =
c j (1)
c j (0) + c j (1)
For every bit of item, we calculate its entropy byH (j) = −∑1
i=0 pj (i)loд(pj (i)). Based on the entropy, we sort the bits in the descending
order to formulate a sequence. Then, the sequence is divided into
a set of segments with length of b1, b2, and so on. For the basic
structure (Section 3.2), all bi s are equal. For the extended structure
(Section 5.1), bi s are determined by Algorithm 3.
The reasonability of statistics-based method can be explained
by the concept of mutual information. Let mutual information
between i- and j-th bits be defined as I (i, j) = H (i) − H (i |j). Themutual information I (i, j) represents the amount of information
about i-th bit by observing from j-th bit.
Theorem 10. I (i, j) ≤ min{H (i),H (j)}.
Proof. Notice that I (i, j) = H (i) − H (i |j) and I (j, i) = H (j) −H (j |i). Since I (i, j) = I (j, i) andH (i |j),H (j |i) ≥ 0, we can get I (i, j) ≤min{H (i),H (j)}. □
Theorem 11. When the sequence formulated by the descendingorder bits based on the entropy, the mutual information I (i, j) hasa lower bound, where i- and j-th bits from the different rows of XY-sketch.
Proof. According to Theorem 10, the bound of the mutual in-
formation I (i, j) depends on the bit with the lower entropy value.
Let i-th bit from the first row of XY-sketch while j-th bit from the
second row of XY-sketch and denote the minimum entropy bit
in the first row is v-th bit. When the sequence formulated by the
descending order bits based on the entropy, we have the mutual
information I (i, j) bound by H (j), which is always smaller than
H (v). But when this descending condition does not hold, the mu-
tual information I (i, j)may larger thanH (v), which lead to a higher
bound. The case when bit i and bit j from other row has the same
situation and the theorem is proved. □
We only focus on the situation that i- and j-th bits from the
different rows of XY-sketch since the estimated error from XY-
sketch associated with the junction among the different rows. And
Theorem 11 shows the lowermutual information I (i, j) boundwhichwe prefer to.
The optimization procedure takesO(b×d) extra space for gettingthe statistics. The cost can be negligible, since b ×d is small. Notice
that the frequency information of the N1 cannot be preserved in the
optimized XY-sketch, according to the one-pass processing criteria
for data streams. In case that there exists temporarily allocated
space enough for storing N1 items, a better optimization can be
implemented. The space can be released when the optimization
finishes.
It is worth noting that the entropy of each bit obtained through
statistics is related to the order of data stream item arrival. So we
employ the random order model [30] [12] [24], which is a general
model regardless of the distributions of item sets. The idea of ran-
dom order model is that the incoming item in the stream is picked
independently and uniformly at random from X, so that it is adap-
tive to arbitrary frequency distributions over distinct item sets at
any time point. Therefore, the entropies of each bits calculated from
N1 data items can reflect those of the entire data stream.
6 RESULTSWe cover experimental setup in Section 6.1, and report the results
in Section 6.2.
6.1 SetupDatasets. We use two real datasets for experiments, Kosarak [1]
and WebDocs [1]. Kosarak contains 30.5 MB (anonymized) click-
stream data of a Hungarian online news portal. The total number
of data items is 8, 019, 015, while the total number of distinct items
is 41, 270. This dataset has a skew distribution similar to a Zipfian
distribution of 1.0. WebDocs is a huge real-life transactional dataset,
built from a crawled collection of web documents. The dataset has
299, 887, 139 items and 5, 267, 656 distinct items. The size of the
dataset is 1.37 GB. More detailed information on this dataset can be
found in [3]. Let ρ be the ratio of the number of distinct items over
the number of total items, ρ = nN . The ρ-values of Kosarak and
WebDocs are 0.52% and 1.8%, respectively. We also generate a series
of 6 synthetic datasets for the testing of n-scalability. In order to
facilitate the construction of dataset with different ρ-values, these6 synthetic datasets follow Normal distribution with µ = 5 × 107.
We simply vary the variance σ 2from 5 × 105 to 2 × 106 to make
ρ-value from 1.0% to 5.3%. Each of the datasets contains 2 × 108
data items and takes about 1.67 GB space.
Metrics. We adopt two commonly accepted metrics for evaluat-
ing the accuracies of sketching methods, Average Absolute Error
(AAE in short) and Average Relative Error (ARE in short). Formally,
AAE =1
n
n∑i=1| ˆfi − fi | ARE =
1
n
n∑i=1
| ˆfi − fi |
fi(7)
Baselines. We consider five competitors, CM-sketch [7], C-
sketch [5], CU-sketch [9], CU with A-Sketch [19] and CU with
Cold Filter [30], represented by CM, C, CU, A and CF, respectively.
XY-sketch with statistics-based optimization is represented by XY.
We also compare the results of XY to XY-sketch without statistics-
based optimization. For both A and CF, we use 32-bit Bob Hashing
[2], and use CU-sketch for its sketching part, following the setting
of [30]. In particular, for A, we set its filter size, i.e., the number
of items in the filter, to 32 for Kosarak, and to 1, 280 for Webdocs,
with which the best performance is achieved. For CF, we follow
the default setting in [30]. We use CF40, CF70, and CF90 for repre-senting the CF with filter percentages equal to 40%, 70%, and 90%,
respectively. For all baselines, the number of hash functions is 3
according to the setting in [30][10][15].
Parameters. We select first N1 items in data stream to deter-
mine the mapping method. By default, N1 is set to 50, 000. We also
examine the effect of N1 in Section 6.2.
6.2 Experiment ResultsN-scalability.We first evaluate the N-scalability of XY-sketch and
baselines on Webdocs dataset. We vary the total number of data
items from 100M to 300M forWebDocs dataset, and vary the number
of items from 40M to 200M for synthetic datasets. In all testing, the
space budget is fixed to 1.5MB, so that the effect on the scalability
can be solely observed.
The results of AAE and ARE on WebDocs are reported in Fig-
ures 3 (a) and (b), respectively. We can see that all the AAE and
ARE values for baselines increase significantly as the increase of the
number of data items. For example, the AAE value of C increases
more than double when the data volume increase from 100M to
300M, in Figure 3 (a). It is because that more hashing conflicts are
incurred as the growth of the number of data items. Compared to
baselines, XY performs very stable w.r.t. the increase of data stream
volumes. It can be observed that XY mostly dominates other com-
petitors, on both metrics. Compared to the fast increase of AAE and
ARE values for baselines, only slight increase can be observed for
XY. In particular, when the number of items increases from 100M
to 300M, AAE of XY only increases from 7.95 to 18.4. In contrast,
AAEs of CM, C, CU, A, CF (CF90) have increased by 106.95, 117.32,
68.98, 68.89 and 83.85, respectively, which are at least 8 times larger
than that of XY. Similar trends can be observed from the results on
0
100
200
300
400
1 1.5 2 2.5 3
AA
E
Number of data items (108)
(a) AAE w.r.t N (WebDocs)
0
50
100
150
200
250
1 1.5 2 2.5 3
AR
E
Number of data items (108)
(b) ARE w.r.t N (WebDocs)
0
50
100
150
200
250
300
350
400
0.4 0.8 1.2 1.6 2
AA
E
Number of data items (108)
(c) AAE w.r.t N (synthetic)
0
20
40
60
80
0.4 0.8 1.2 1.6 2
AR
E
Number of data items (108)
CMCU
CA
CF40CF70CF90
XY
(d) ARE w.r.t N (synthetic)
Figure 3: N-scalability
synthetic datasets, as shown in Figures 3(c) and (d). In summary,
results in Figure 3 show that XY achieves much better performance
in terms of N -scalability.
n-Scalability.We evaluate the n-scalability of XY and its com-
petitors in Figure 4. We use 6 synthetic datasets so that the AAE
and ARE w.r.t the varying ρ values can be observed. Each dataset
contains 2G data items. In all testing, the space budget is fixed to
1.5MB. From Figure 4 (a), all methods increase w.r.t. ρ, except Cand XY. Which is consistent with the scalability analysis in Section
4. Among all methods, XY achieves the best n-scalability. The AAEvalue of XY is an order of magnitude lower than that of C, and
two orders of magnitudes lower than others. Similar results are ob-
served for the ARE metric, as shown in Figure 4 (b). The ARE value
of XY is orders of magnitude lower than its competitors. We can
conclude that the XY achieves better performance on n-scalabilitythan its competitors.
4
16
64
256
1024
4096
0.01 0.02 0.03 0.04 0.05
AA
E
ρ=n/N
(a) AAE w.r.t ρ
1
10
100
1000
0.01 0.02 0.03 0.04 0.05
AR
E
ρ=n/N
CMCU
CA
CF40CF70CF90
XY
(b) ARE w.r.t ρ
Figure 4: n-scalability
Compactness.We test the space efficiency of XY, which is im-
portant for the compactness requirement of sketching techniques.
The results are collected from experiments on the two real datasets,
and are reported in Figure 5.
First, we compare the AAE and ARE metrics for all six sketches
on the Kosarak data, by varying the space budgets from 80KB to
240KB. In Figure 5(a), all methods achieve smaller AAE, by in-
creasing the space budget, whereas XY always performs better. In
particular, when the space budget equals 80KB, AAEs of CM, C, CU,
A, CF (CF70) are 13.8 times, 20.9 times, 8.4 times, 8.6 times and 14.9
times of that of XY, respectively. It means that XY performs better
frequency estimation than competitors with the same amount of
space cost. Similar trends on ARE can be observed in Figure 5(b),
where XY always achieves the smallest ARE. Especially, when the
space budget is small, the improvement of XY in terms of estimation
accuracy is significant. For example, when space budget equals 80,
the ARE of XY is orders of magnitude lower than others.
The results on WebDocs is reported in Figures 5(c) and (d), re-
spectively. The performance is tested by varying the space from
1.5MB to 4MB. In Figure 5(c), the value of AAE of each method
decreases as the space budget increases. In particular, when space
budget equals 1.5MB, AAEs of CM, C, CU, A, CF (CF90) are 7.6
times, 8.5 times, 4.8 times, 4.8 times and 4.3 times of that of XY,
respectively. It is worth noticing that when the space budget is very
small, all baselines degrade in the frequency estimation for data
items, making them not qualified for data stream sketching. In con-
trast, XY achieves much better estimation accuracy, compared with
baselines, especially when the space budget is small. For example,
when the space budget equals 1.5MB, the AAE of XY is orders of
magnitude lower than that of CF70 and CF90. Similarly, in Figure
5(d), when space budget is set to 1.5MB, the ARE of XY is orders of
magnitude lower than methods. We argue that sketches are often
used in a small and fast memory, e.g, L1 or L2 cache [19] [26]. All
the experiments in [19] [30] [27] [29] [21] are done with space
budget no larger than 2MB. Therefore, we can conclude that XY
dominates its competitors in the common setting of space budget
range.
Efficiency. We test the updating and querying efficiency for all
sketches on the two real-world datasets, in Figure 6. In Figure 6 (a),
all sketches achieve high throughput in handling item updating. In
particular, XY has the highest throughput among all methods on
Kosarak. The reason may be that XY-sketch only uses bit opera-
tions in the decomposition phase. At the same time, the distinct
items n of Kosarak is relatively small. Therefore, the number of
bits to be decomposed is correspondingly small, leading to higher
throughput. Figure 6 (b) reports the query efficiency on Kosarak.
We can see that CM, CU, A,CF90 and XY have similar performance
in query processing, where CU is the fastest. CF’s performance on
query efficiency has significant correlations with parameter set-
tings. In Figure 6 (c) and (d), we report the results on querying and
updating efficiency for WebDocs, respectively. The performance of
all methods, except A, are almost on the same level. XY is slightly
slower than CM, CU, CF70, CF90, and is faster than C and A. There
exist some small fluctuations for XY, which might be caused by the
2
4
8
16
32
64
128
256
80 120 160 200 240
AA
E
Space(KB)
(a) AAE w.r.t Space (Kosarak)
0.16 0.32 0.64 1.28 2.56 5.12
10.24 20.48 40.96
80 120 160 200 240
AR
E
Space(KB)
(b) ARE w.r.t Space (Kosarak)
4
8
16
32
64
128
256
1536 2048 2560 3072 3584 4096
AA
E
Space(KB)
(c) AAE w.r.t Space (WebDocs)
2
4
8
16
32
64
128
256
1536 2048 2560 3072 3584 4096
AR
E
Space(KB)
CMCU
CA
CF40CF70CF90
XY
(d) ARE w.r.t Space (WebDocs)
Figure 5: Compactness
0
1
2
3
4
5
6
CM CU C A CF40 CF70 CF90 XYThro
ughput
(ite
ms/u
s)
Sketches
(a) Updating (Kosarak)
0
100
200
300
400
500
600
700
800
CM CU C A CF40 CF70 CF90 XYThro
ughput
(ite
ms/u
s)
Sketches
(b) Querying (Kosarak)
0
1
2
3
4
5
6
CM CU C A CF40 CF70 CF90 XYThro
ughput
(ite
ms/u
s)
Sketches
(c) Updating (WebDocs)
0
50
100
150
200
250
CM CU C A CF40 CF70 CF90 XYThro
ughput
(ite
ms/u
s)
Sketches
(d) Querying (WebDocs)
Figure 6: Efficiency (Updating and Querying)
recomposition phase incurring computational costs on probability
calculations. We argue that it is still worthy of the efforts, given the
significant improvement in estimation accuracy achieved by XY.
4
8
16
32
64
128
1536 2048 2560 3072 3584 4096
AA
E
Space(KB)
(a) AAE w.r.t Space
2
4
8
16
32
64
128
1536 2048 2560 3072 3584 4096
AR
E
Space(KB)
CF70CF90
CUXY
XYminXYmaxXYavg
(b) ARE w.r.t Space
Figure 7: Effect of Statistics-based Optimization
Effect of Statistics-basedOptimization.Wenow examine the
effectiveness of statistics-based optimization proposed in Section 5.
We randomly select 100 mapping functions from the random bijec-
tive function family {ϖ∗}. We run experiments with all of the 100
mapping functions for WebDocs dataset, as shown in Figure 7 (a).
For each value on the x-axis, we record the result with the highest
estimation error as XYmax , the result with the lowest estimation
error as XYmin , and the average of the 100 results as XYavд . Here,we use XY to denote the method with statistics-based optimization
techniques. In Figures 7 (a) and (b), values of AAE and ARE of the
methods including XYmax , XYavд and XY decrease as the space
allocated increases. Also, AAE and ARE values ofXY are almost the
0
5
10
15
20
25
30
44 46 48 50 52 54 56
AA
E
N1 (103)
80KB120KB160KB200KB
(a) AAE w.r.t N1(Kosarak)
8
12
16
20
24
32 36 40 44 48 52 56
AA
E
N1 (103)
1.5MB3MB4MB
(b) AAE w.r.t N1(WebDocs)
Figure 8: Effect of N1
lowest among 100 tests. It implies that the proposed optimization
techniques make XY mostly outperform the randomly selected set
of mapping functions.
Effect of N1.We hereby test the effect of parameter N1 to the
performance of statistics-based optimization. Parameter N1 is used
to estimate the entropy of every bit of input items in order to
optimize the setting of bijective functions. The results on Kosarak
and WebDocs are reported in Figures 8 (a) and (b), respectively. It
shows that a larger value of N1 corresponds to a better accuracy.
Both AAE and ARE converges after N1 reaches some value. The
convergence point forN1 is 50K for Kosarak, and 4.4K forWebDocs,
which is quite small compared to the total volume of data streams.
Also, we test the effect ofN1 by allocating different amounts of space
budgets, represented by different curves in Figure 8. We can observe
that larger allocated space corresponds to smaller estimation errors.
On the other hand, the convergence point of N1 is independent of
the allocated space, meaning that the setting of N1 is general to
space budgets for our sketch.
7 CONCLUSIONIn this paper, we study the problem of item frequency estimation
for web-scale data streams, by proposing a new sketch, called XY-
sketch. XY-sketch follows novel decomposition-and-recomposition
framework on the basis of bijective functions, which converts the
problem of item frequency estimation into the problem of proba-
bility estimating. XY-sketch can achieve high estimation accuracy
with very small space. We conduct detailed error bound analysis to
gain theoretical insights on the scalability of the structure. Several
optimization techniques are studied for further enhancing the per-
formance. Experiment results on real and synthetic datasets show
that XY-sketch outperforms state-of-the-art solutions, when the
space budget is small.
REFERENCES[1] Frequent Itemset Mining Dataset Repository. http://fimi.uantwerpen.be/data/.
[2] Hash website. http://burtleburtle.net/bob/hash/evahash.html.
[3] WebDocs: a real-life huge transactional dataset.
http://fimi.uantwerpen.be/data/webdocs.pdf.
[4] Lee Breslau, Pei Cao, Li Fan, Graham Phillips, and Scott Shenker. 1999. Web
Caching and Zipf-like Distributions: Evidence and Implications. In INFOCOM.
126–134.
[5] Moses Charikar, Kevin C. Chen, and Martin Farach-Colton. 2002. Finding Fre-
quent Items in Data Streams. In ICALP. 693–703.[6] Jiecao Chen and Qin Zhang. 2017. Bias-Aware Sketches. In PVLDB. 961–972.[7] Graham Cormode and S. Muthukrishnan. 2005. An improved data stream sum-
mary: the count-min sketch and its applications. J. Algorithms 55, 1 (2005),
58–75.
[8] Graham Cormode and S. Muthukrishnan. 2005. Summarizing and Mining Skewed
Data Streams. In SDM. 44–55.
[9] Cristian Estan and George Varghese. 2002. New directions in traffic measurement
and accounting. In SIGCOMM. 323–336.
[10] Amit Goyal, Hal Daumé III, and Graham Cormode. 2012. Sketch Algorithms for
Estimating Point Queries in NLP. In EMNLP-CoNLL. 1093–1103.[11] Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich.
2010. Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search
Advertising in Microsoft’s Bing Search Engine. In ICML. Omnipress, 13–20.
[12] Sudipto Guha and Andrew McGregor. 2009. Stream Order and Order Statistics:
Quantile Estimation in Random-Order Streams. SIAM J. Comput. 38, 5 (2009),
2044–2059.
[13] Qun Huang, Patrick P. C. Lee, and Yungang Bao. 2018. Sketchlearn: relieving
user burdens in approximate measurement with automated statistical inference.
In SIGCOMM. 576–590.
[14] Mohammad Tanvir Irfan and Tucker Gordon. 2019. The Power of Context in
Networks: Ideal Point Models with Social Interactions. In IJCAI. 6176–6180.[15] Yi Lu, Andrea Montanari, Balaji Prabhakar, Sarang Dharmapurikar, and Ab-
dul Kabbani. 2008. Counter braids: a novel counter architecture for per-flow
measurement. In SIGMETRICS. 121–132.[16] Samuel Madden and Michael J. Franklin. 2002. Fjording the Stream: An Architec-
ture for Queries Over Streaming Sensor Data. In ICDE. 555–566.[17] D. Madigan. 2003. DIMACS working group on monitoring message streams.
http://stat.rutgers.edu/madigan/mms/.
[18] Michael Mitzenmacher, Rasmus Pagh, and Ninh Pham. 2014. Efficient estimation
for high similarities using odd sketches. In WWW. 109–118.
[19] Pratanu Roy, Arijit Khan, and Gustavo Alonso. 2016. Augmented Sketch: Faster
and More Accurate Stream Processing. In SIGMOD. 1449–1463.[20] Anshumali Shrivastava, Arnd Christian König, and Mikhail Bilenko. 2016. Time
Adaptive Sketches (Ada-Sketches) for Summarizing Data Streams. In SIGMOD.1417–1432.
[21] Lu Tang, Qun Huang, and Patrick P. C. Lee. 2019. MV-Sketch: A Fast and
Compact Invertible Sketch for Heavy Flow Detection in Network Data Streams.
In INFOCOM. 2026–2034.
[22] Ramine Tinati, Xin Wang, Ian C. Brown, Thanassis Tiropanis, and Wendy Hall.
2015. A Streaming Real-Time Web Observatory Architecture for Monitoring the
Health of Social Machines. In WWW. 1149–1154.
[23] Luca Vassio, Michele Garetto, Carla-Fabiana Chiasserini, and Emilio Leonardi.
2020. User Interaction with Online Advertisements: Temporal Modeling and
Optimization of Ads Placement. TOMPECS 5, 2 (2020), 8:1–8:26.[24] Zhewei Wei, Ge Luo, Ke Yi, Xiaoyong Du, and Ji-Rong Wen. 2015. Persistent
Data Sketching. In SIGMOD. 795–810.[25] Tobias Weller. 2018. Compromised Account Detection Based on Clickstream
Data. In WWW. 819–823.
[26] Tong Yang, Haowei Zhang, Hao Wang, Muhammad Shahzad, Xue Liu, Qin Xin,
and Xiaoming Li. 2019. FID-sketch: an accurate sketch to store frequencies in
data streams. World Wide Web 22, 6 (2019), 2675–2696.[27] Tong Yang, Yang Zhou, Hao Jin, Shigang Chen, and Xiaoming Li. 2017. Pyramid
Sketch: a Sketch Framework for Frequency Estimation of Data Streams. In PVLDB.1442–1453.
[28] Yue Yang and Jianwen Zhu. 2016. Write Skew and Zipf Distribution: Evidence
and Implications. TOS 12, 4 (2016), 21:1–21:19.[29] Yang Zhou, Peng Liu, Hao Jin, Tong Yang, Shoujiang Dang, and Xiaoming Li. 2017.
One Memory Access Sketch: A More Accurate and Faster Sketch for Per-Flow
Measurement. In GLOBECOM. 1–6.
[30] Yang Zhou, Tong Yang, Jie Jiang, Bin Cui, Minlan Yu, Xiaoming Li, and Steve
Uhlig. 2018. Cold Filter: A Meta-Framework for Faster and More Accurate Stream
Processing. In SIGMOD. 741–756.[31] Yunyue Zhu and Dennis E. Shasha. 2002. StatStream: Statistical Monitoring of
Thousands of Data Streams in Real Time. In VLDB. 358–369.