Upload
yakov
View
35
Download
0
Embed Size (px)
DESCRIPTION
Zero-One Frequency Laws . Vladimir( Vova ) Braverman UCLA Joint work with Rafail Ostrovsky. Plan:. General m ethod for computing over frequencies with polylog space (Zero-one f requency l aw) Recursive sketching for vectors. Frequencies. Stream. - PowerPoint PPT Presentation
Citation preview
Vladimir(Vova) Braverman
UCLA
Joint work with Rafail Ostrovsky
Zero-One Frequency Laws
• General method for computing over frequencies with polylog space (Zero-one frequency law)
• Recursive sketching for vectors
Plan:
Stream
Frequencies
Frequency Vector
0 0 0 0 0 0 0 011 123 1 2
Frequency-Based Functions
Frequency Vector0 0 0 1 2 0 0 013
G: N —> R
0 0 G(0)G(1)G(2)G(0)G(0)G(0)G(1)G(3)
G-Sum(V) = ∑ G(mi)
Modified Vector
The objective function
The Data
D is a a stream p1,…, pm where pj є [n] Frequency mi = |{j: pj = i}|Frequency-based function G-Sum(D) =∑i
G(mi) Fk frequency moment G(mi) = mi
k
A single pass over DSmall (polylog) memory :
(1/ε log(nm))O(1)
The (Basic) Streaming Model
Formal Definition
LimitationsOutput a multiplicative approximation X such that: P(|X- ∑i G(mi) | > ε ∑i G(mi) ) < 2/3
What is needed
Alon, Matias, Szegedy (STOC 1996, JCSS 1999, Gödel Award 2005)
• Frequency moments G(x) = xk , in particular:
•Polylog-space algorithms for G(x) = x0 and G(x) = x2
•Lower bounds for k>2
•Algorithms for k>2 (large but sublinear memory)
The open question ofAlon, Matias, Szegedy (1996)
What is the space complexity of estimating other functions G(x)?
Our Result G(0)=0, G is non-decreasing
Function G : R—> R is in STREAM-POLYLOG classIf there exists an algorithm A such that for any data stream D and for any ε, A makes a single pass over D, uses (1/ε log(nm))O(1)
memory bits and outputs X s.t.P(|X - ∑i G(mi) | > ε ∑i G(mi)) < 2/3.
= min(x, min( |z| : |G(x+z) – G(x)| > εG(x)))G : N —> R is tractable
G is in STREAM-POLYLOG if and only if G is tractable
The Main Result
Related Work (A subset)Alon, Gibbons, Matias, Szegedy PODS 99Alon, Matias, Szegedy STOC 96Andoni, Krauthgamer, Onak 2010 (arxiv)
Bar-Yossef, Jayram, Kumar, Sivakumar JCSS 2004Bar-Yossef, Jayram, Kumar, Sivakumar, Trevisan RANDOM 2002Beame, Jayram, Rudra STOC 2007Bhuvanagiri, Ganguly, Kesh, Saha SODA 2006Bhuvanagiri, Ganguly ESA 2006Chakrabarti, Do Ba, Muthukrishnan SODA 2007Chakrabarti, Cormode, McGregor STOC 08, SODA 07Chakrabarti, Khot, Sun 2003Chakrabarti, Regev STOC 2011Charikar, Chen, Farach-Colton Th.Comp.Sc. 2004Coppersmith, Kumar SODA 2004Cormode, Datar, Indyk, Muthukrishnan VLDB 2002Comrode, Muthukrishnan J.Alg. 2005Feigenbaum, Kannan, Strauss, Viswanathan FOCS 99Flajolet, Martin JCSS 85
Ganguly 2004, 2011Ganguly, Cormode RANDOM 2007Guha, Indyk, McGregor COLT 2007Guha, McGregor, Venkatasubramanian SODA 06Harvey, Nelson, Onak FOCS 08Indyk FOCS 2000Indyk, Woodruff FOCS 03, STOC 2005Jayram, McGregor, Muthukrishnan, Vee PODS 07Kane, Nelson, Woodruff PODS 2010, SODA 2010Kane, Nelson, Porat, Woodruff STOC 2011Li SODA 2009, KDD 07McGregor, Indyk SODA 2009Monemizadeh, Woodruff SODA 2010Muthukrishnan 2005 Nelson, Woodruff PODS 2011Saks, Sun STOC 2002Woodruff SODA 2004
Lower Bounds
•Reduction to MultiParty SET-DISJOINTESS problem•The reduction requires monotonicity•Relatively straightforward (see the paper)
y copies
Lower Bounds (informal)
100
…
1
010
…0
001
…
0
….
Assume first that x = k * y
Pick N~ G(x)/G(y)
i
i
i …. i
y copies
j j …. j
The Stream
Reduction (very informal)If the sets intersect then, by monotonicity, the value of G-Sum is at least NG(y) + G(x) ~ 2G(x)
If do not intersect then the value is at most (N+k)G(y) ~ G(x)
Any constant approximation algorithm for G-Sum MUST recognize the difference
And thus requires N/(k^2) space ([Chakrabarti, Khot, Sun]) which is larger then any polylog
Thus G is not tractable
• We follow the fundamental idea of Indyk and Woodruff• First we solve a specific case of G-
heavy elements• Then we show that the general case
can be solved by recursive sketching
Upper Bound: Basic Ideas
Mimic F
G
Certifier H
1 0
IF H=1 RETURN F
ELSE RETURN 0
G-heavy elementsG(1)
G(1)
G(1)
G(10^10)
G(1)
G(1)
ji
ij yGyG )(100)(
Freq
uenc
y Ve
ctor
of s
ize n
G(x)=x^2G(x)=x^3/2
Frequencies
Certifier
G3G2G1
If G is “good” then every G-heavy element is
also F2-heavy
111001
1110001
11100001
Mimic F
G
Certifier H1 0
IF H=1 RETURN F
ELSE RETURN 0
Lemma 0 (very informal)
)1(
)/]([
22
)1(
][
))/(log(
:such that )log(||,
: implies
)()(
then tractableisG IF
O
Snii
O
nii
nmyx
nS [n]S
yGxG
Proof for L_p (0<p<2)
2/12
/1
i
i
p
i
pi
i
pi
p
yy
yx
x
Proof (sketch)
wSii
w
w
www
w
w
ii
ySx
SG
xGx
SGyGxG
iw yiS
22225.02||
||)2(
)(2
2
||)2()()(
1ww }22 :{
Mimic Function
n
1
1
1
1
1 )(||
5.0)1()1(
||
1
1
yGyhG
hPhP
yyh
ii
ii
ii
Mimic F
G
Certifier H1 0
IF H=1 RETURN F
ELSE RETURN 0
Recursive Sketches
Lemma 1
Si
iSi
ii vhvX 2
.|)||||(| 2 VVXP
Svvin
jji
}:{1
Let V є Rn be a vector with non-negative entries. Let H є {0,1}n be a random vector with pairwise-independent uniform entries. Let S be s.t.:
Define
Then
Hadamard product Had(U,V) of two vectors U and V is a vector with entries viui
v1v2
u1u2
v1u1v2u2
vn un vnun
… Had(U,V)
Lemma 2
),(,
1
0
iii HVHadVVV
i
n
j
ij
il Svvl
}:{1
.|)|||||( 21
tVVXP iii
t
i
ii Sj
ij
Sj
ij
iji vvhX 2
Denote for i=1,2,..,t
Then
tHHH ,...,, 21 are i.i.d. vectors
Lemma 3
i
jSj
iijii
tt
vhYY
VY
)21(2
||
11
.1.0|)||||(| 0 VVYP
Denote
Then for )( 3
2
t
The general algorithm (informal)Maintain H1,..,Ht
We can obtain Vi by dropping all stream elements that are not “sampled” For t=O(log(n)), the number of non-zero elements in V t is constant, with constant probabilityThus, given an oracle for “heavy” elements, the sum can be approximated using only log(n) number of calls to “heavy” elements oracle
i
jSj
iijii
tt
vhYY
VY
)21(2
||
11
The Algorithm for large Frequency moments (informal)The general algorithm works for any “separable” vector, in particular for frequency moments vectorAlso, such oracles for “heavy” elements exist for frequency moments E.g., CountSketch by Charikar, Chen, Farach-Colton, 2004. The final algorithm requires n1-2/k log(n)log(m)log(log…(log(nm))) memory bits
Independently Andoni, Krauthgamer, Onak improved the bound to n1-2/k log(n)log(m) (Precision Sampling: Alex’s talk yesterday)
NotesWe need to overcome additional technical issues
Heavy elements: from precise values to approximations
Open problemsCharacterize non-monotonic functions (we made some progress)
Extend the results to sublinear algorithms (o(n) space)
Other models: deletions, sliding windows etc.,
Optimal algorithm for large frequency moments
Thank you!