Upload
truongthu
View
235
Download
0
Embed Size (px)
Citation preview
Communication Cost in Big Data Processing
Dan Suciu University of Washington
Joint work with Paul Beame and Paris Koutris
and the Database Group at the UW
1
Queries on Big Data
Big Data processing on distributed clusters • General programs: MapReduce, Spark • SQL: PigLatin,, BigQuery, Shark, Myria…
Traditional Data Processing • Main cost = disk I/O Big Data Processing • Main cost = communication
2
This Talk • How much communication is needed to solve
a “problem” on p servers?
• “Problem”: – In this talk = Full Conjunctive Query CQ (≈ SQL) – Future work = extend to linear algebra, ML, etc.
• Some techniques discussed in this talk are implemented in Myria (Bill Howe’s talk)
3
Models of Communication
• Combined communication/computation cost: – PRAM à MRC [Karloff’2010] – BSP [Valiant] à LogP Model [Culler]
• Explicit communication cost – [Hong&Kung’81] red/blue pebble games, for I/O
communication complexity à [Ballard’2012] extension to parallel algorithms
– MapReduce model [DasSarma’2013] MPC [Beame’2013,Beame’2014] – this talk
4
Outline
• The MPC Model
• Communication Cost for Triangles
• Communication Cost for CQ’s
• Discussion
5
Massively Parallel Communication Model (MPC) Extends BSP [Valiant]
Server 1 Server p . . . .
Input (size=M)
Input data = size M bits
Number of servers = p
O(M/p) O(M/p)
6
Massively Parallel Communication Model (MPC) Extends BSP [Valiant]
Server 1 Server p . . . .
Server 1 Server p . . . .
Input (size=M)
Round 1
Input data = size M bits
One round = Compute & communicate
Number of servers = p
O(M/p) O(M/p)
7
Massively Parallel Communication Model (MPC) Extends BSP [Valiant]
Server 1 Server p . . . .
Server 1 Server p . . . .
Server 1 Server p . . . .
Input (size=M)
. . . .
Round 1
Round 2
Round 3
. . . .
Input data = size M bits
Algorithm = Several rounds
One round = Compute & communicate
Number of servers = p
O(M/p) O(M/p)
8
Massively Parallel Communication Model (MPC) Extends BSP [Valiant]
Server 1 Server p . . . .
Server 1 Server p . . . .
Server 1 Server p . . . .
Input (size=M)
. . . .
Round 1
Round 2
Round 3
. . . .
Input data = size M bits
Max communication load per server = L
Algorithm = Several rounds
One round = Compute & communicate
Number of servers = p L
L
L
L
L
L
O(M/p) O(M/p)
9
Massively Parallel Communication Model (MPC) Extends BSP [Valiant]
Server 1 Server p . . . .
Server 1 Server p . . . .
Server 1 Server p . . . .
Input (size=M)
. . . .
Round 1
Round 2
Round 3
. . . .
Input data = size M bits
Max communication load per server = L
Ideal: L = M/p This talk: L = M/p * pε, where ε in [0,1] is called space exponent Degenerate: L = M (send everything to one server)
Algorithm = Several rounds
One round = Compute & communicate
Number of servers = p L
L
L
L
L
L
O(M/p) O(M/p)
10
Example: Q(x,y,z) = R(x,y) ⋈ S(y,z)
Server 1 Server p . . . . M/p M/p Input:
• R, S uniformly partitioned on p servers
size(R)+size(S)=M
11
Example: Q(x,y,z) = R(x,y) ⋈ S(y,z)
Server 1 Server p . . . .
Server 1 Server p . . . .
Round 1 O(M/p) O(M/p)
M/p M/p
Round 1: each server • Sends record R(x,y) to server h(y) mod p • Sends record S(y,z) to server h(y) mod p
Input: • R, S uniformly
partitioned on p servers
size(R)+size(S)=M
12
Example: Q(x,y,z) = R(x,y) ⋈ S(y,z)
Server 1 Server p . . . .
Server 1 Server p . . . .
Round 1 O(M/p) O(M/p)
M/p M/p
R(x,y) ⋈ S(y,z) R(x,y) ⋈ S(y,z)
Output: • Each server computes and outputs
the local join R(x,y) ⋈ S(y,z)
Round 1: each server • Sends record R(x,y) to server h(y) mod p • Sends record S(y,z) to server h(y) mod p
Input: • R, S uniformly
partitioned on p servers
size(R)+size(S)=M
13
Example: Q(x,y,z) = R(x,y) ⋈ S(y,z)
Server 1 Server p . . . .
Server 1 Server p . . . .
Round 1 O(M/p) O(M/p)
M/p M/p
R(x,y) ⋈ S(y,z) R(x,y) ⋈ S(y,z)
Output: • Each server computes and outputs
the local join R(x,y) ⋈ S(y,z)
Round 1: each server • Sends record R(x,y) to server h(y) mod p • Sends record S(y,z) to server h(y) mod p
Input: • R, S uniformly
partitioned on p servers
Assuming no Skew:∀y, degreeR(y), degreeS(y) ≤ M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0)
size(R)+size(S)=M
Space exponent = 0
Example: Q(x,y,z) = R(x,y) ⋈ S(y,z)
Server 1 Server p . . . .
Server 1 Server p . . . .
Round 1 O(M/p) O(M/p)
M/p M/p
R(x,y) ⋈ S(y,z) R(x,y) ⋈ S(y,z)
Output: • Each server computes and outputs
the local join R(x,y) ⋈ S(y,z)
Round 1: each server • Sends record R(x,y) to server h(y) mod p • Sends record S(y,z) to server h(y) mod p
Input: • R, S uniformly
partitioned on p servers
Assuming no Skew:∀y, degreeR(y), degreeS(y) ≤ M/p Rounds = 1 Load L = O(M/p) = O(M/p * p0)
size(R)+size(S)=M
Space exponent = 0
SQL is embarrassingly
parallel!
Questions
Fix a query Q and a number of servers p
• What is the communication load L needed to compute Q in one round? This talk based on [Beame’2013, Beame’2014]
• Fix a maximum load L (space exponent ε) How many rounds are needed for Q?Preliminary results in [Beame’2013]
16
Outline
• The MPC Model
• Communication Cost for Triangles
• Communication Cost for CQ’s
• Discussion
17
The Triangle Query
• Input: three tables R(X, Y), S(Y, Z), T(Z, X) size(R)+size(S)+size(T)=M
• Output: compute Q(x,y,z) = R(x,y) ⋈ S(y,z) ⋈ T(z,x)
Z X
Fred Alice
Jack Jim
Fred Jim
Carol Alice
…
Y Z
Fred Alice
Jack Jim
Fred Jim
Carol Alice
…
X Y
Fred Alice
Jack Jim
Fred Jim
Carol Alice
…
R
S
T
18
Triangles in Two Rounds
• Round 1: compute temp(X,Y,Z) = R(X,Y) ⋈ S(Y,Z)
• Round 2: compute Q(X,Y,Z) = temp(X,Y,Z) ⋈ T(Z,X)
Load L depends on the size of temp. Can be much larger than M/p
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X) size(R)+size(S)+size(T)=M
Triangles in One Round
• Factorize p = p⅓ × p⅓ × p⅓
• Each server identified by (i,j,k)
i
j k
(i,j,k)
P⅓
Place the servers in a cube: Server (i,j,k)
[Ganguli’92, Afrati&Ullman’10, Suri’11, Beame’13]
20
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X) size(R)+size(S)+size(T)=M
Triangles in One Round
k
(i,j,k)
P⅓
Z X
Fred Alice
Jack Jim
Fred Jim
Carol Alice
…
Jack Jim Y Z
Fred Alice
Jack Jim
Fred Jim
Carol Alice
Jim Jack Jim Jack
X Y
Fred Alice
Jack Jim
Fred Jim
Carol Alice
…
R
S
T
i = h(Fred)
j = h(Jim)
Fred Jim Fred Jim
Fred Jim Fred Jim
Jim Jack
Jim Jack
Fred Jim Jim Jack
Jim Jack Fred Jim
Fred Jim
Jack Jim Jack Jim
Round 1: Send R(x,y) to all servers (h(x),h(y),*) Send S(y,z) to all servers (*, h(y), h(z)) Send T(z,x) to all servers (h(x), *, h(z))
Output: compute locally R(x,y) ⋈ S(y,z) ⋈ T(z,x)
21
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X) size(R)+size(S)+size(T)=M
Upper Bound is Lalgo = O(M/p⅔)
22
Theorem Assume data has no skew: all degree ≤ O(M/p⅓). Then the algorithm computes Q in one round, with communication load Lalgo = O(M/p⅔)
Compare to two rounds: • No more large intermediate result • Skew-free up to degrees = O(M/p⅓) (from O(M/p)) • BUT: space exponent increased to ε = ⅓ (from ε = 1)
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)
M = size(R)+size(S)+size(T) (in bits)
size(R)+size(S)+size(T)=M
Lower Bound is Llower = M / p⅔
# without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013
M = size(R)+size(S)+size(T) (in bits)
23
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X) size(R)+size(S)+size(T)=M
Lower Bound is Llower = M / p⅔
# without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013
M = size(R)+size(S)+size(T) (in bits)
24
Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)
Assume that the three input relations R, S, T are stored on disjoint servers#
size(R)+size(S)+size(T)=M
Lower Bound is Llower = M / p⅔
M = size(R)+size(S)+size(T) (in bits)
25
Theorem* Any one-round deterministic algorithm A for Q requires Llower bits of communication per server
We prove a stronger claim, about any algorithm communicating < Llower bits Let L be the algorithm’s max communication load per server We prove: if L < Llower then, over random permutations R, S, T, the algorithm returns at most a fraction (L / Llower)3/2 of the answers to Q
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)
# without this assumption need to add a multiplicative factor 3 * Beame, Koutris, Suciu, Communication Steps for Parallel Query Processing, PODS 2013
Assume that the three input relations R, S, T are stored on disjoint servers#
size(R)+size(S)+size(T)=M
Lower Bound is Llower = M / p⅔ Discussion: • An algorithm with space exponent ε < ⅓, has load
L = M/p1-ε and reports only (L / Llower)3/2 = 1/p(1-3ε)/2 fraction of all answers. Fewer, as p increases!
• Lower bound holds only for random inputs: cannot hold for fixed input R, S, T since the algorithm may use a short bit encoding to signal this input to all servers
• By Yao’s lemma, the result also holds for randomized algorithms, some fixed input
26
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X) size(R)+size(S)+size(T)=M
[Balard’2012] consider Strassen’s matrix multiplication algorithm and provethe following theorem for the CDAG model. If each server has M c · n2
p2/lg7
memory, then the communication load per server:
IO(n, p,M) = ⌦
✓npM
◆lg7 M
p
!
Thus, if M = n2
p2/lg7 then IO(n, p) = ⌦⇣
n2
p2/lg7
⌘
Lower Bound is Llower = M / p⅔
• Stronger than MPC: no restriction on rounds • Weaker than MPC
– Applies only to an algorithm, not to a problem – CDAG model of computation v.s. bit-model 27
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)
M here same as n2 here
size(R)+size(S)+size(T)=M
Lower Bound is Llower = M / p⅔
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)
Proof
28
size(R)+size(S)+size(T)=M
Lower Bound is Llower = M / p⅔
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)
Proof
E [#answers to Q] = Σi,j,k Pr [(i,j,k) ∈ R⋈S⋈T] = n3 * (1/n)3 = 1
29
Input: size = M = 3 log n! • R, S, T random permutations on [n]: Pr [(i,j) ∈R] = 1/n, same for S, T
size(R)+size(S)+size(T)=M
Lower Bound is Llower = M / p⅔
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)
Proof
E [#answers to Q] = Σi,j,k Pr [(i,j,k) ∈ R⋈S⋈T] = n3 * (1/n)3 = 1
30
One server v ∈[p]: receives L(v) = LR(v) + LS(v) + LT(v) bits • Denote aij = Pr [(i,j)∈R and v knows it]
Then (1) aij ≤ 1/n (obvious) (2) Σi,j aij ≤ LR(v) / log n (formal proof: entropy) • Denote similarly bjk,cki for S and T
Input: size = M = 3 log n! • R, S, T random permutations on [n]: Pr [(i,j) ∈R] = 1/n, same for S, T
size(R)+size(S)+size(T)=M
Lower Bound is Llower = M / p⅔
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)
Proof
E [#answers to Q] = Σi,j,k Pr [(i,j,k) ∈ R⋈S⋈T] = n3 * (1/n)3 = 1
31
One server v ∈[p]: receives L(v) = LR(v) + LS(v) + LT(v) bits • Denote aij = Pr [(i,j)∈R and v knows it]
Then (1) aij ≤ 1/n (obvious) (2) Σi,j aij ≤ LR(v) / log n (formal proof: entropy) • Denote similarly bjk,cki for S and T • E [#answers to Q reported by server v] =
Σi,j,k aij bjk cki ≤ [(Σi,j aij2) * (Σj,k bjk
2) * (Σk,i cki2)]½ [Friedgut]
≤ (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ]½ by (1), (2)
≤ (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above
Input: size = M = 3 log n! • R, S, T random permutations on [n]: Pr [(i,j) ∈R] = 1/n, same for S, T
size(R)+size(S)+size(T)=M
Lower Bound is Llower = M / p⅔
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X)
Proof
E [#answers to Q] = Σi,j,k Pr [(i,j,k) ∈ R⋈S⋈T] = n3 * (1/n)3 = 1
E [#answers to Q reported by all servers] ≤ Σv [ L(v) / M ]3/2 = p * [ L / M ] 3/2 = [ L / Llower ]3/2
32
Input: size = M = 3 log n! • R, S, T random permutations on [n]: Pr [(i,j) ∈R] = 1/n, same for S, T
One server v ∈[p]: receives L(v) = LR(v) + LS(v) + LT(v) bits • Denote aij = Pr [(i,j)∈R and v knows it]
Then (1) aij ≤ 1/n (obvious) (2) Σi,j aij ≤ LR(v) / log n (formal proof: entropy) • Denote similarly bjk,cki for S and T • E [#answers to Q reported by server v] =
Σi,j,k aij bjk cki ≤ [(Σi,j aij2) * (Σj,k bjk
2) * (Σk,i cki2)]½ [Friedgut]
≤ (1/n) 3/2 * [ LR(v) * LS(v) * LT(v) / log3 n ]½ by (1), (2)
≤ (1/n) 3/2 * [ L(v) / 3* log n ]3/2 geometric/arithmetic = [ L(v) / M ]3/2 expression for M above
size(R)+size(S)+size(T)=M
Outline
• The MPC Model
• Communication Cost for Triangles
• Communication Cost for CQ’s
• Discussion
33
General Formula Consider an arbitrary conjunctive query (CQ): We will: • Give a lower bound Llower formula • Give an algorithm with load Lalgo • Prove that Llower = Lalgo
34
Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`)
size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`
Cartesian Product Q(x,y) = S1(x) × S2(y) size(S1)=M1, size(S2) = M2
Cartesian Product
S1(x) à
S2 (y) à
Q(x,y) = S1(x) × S2(y) size(S1)=M1, size(S2) = M2
Lalgo
= 2
✓M
1
·M2
p
◆ 12
Algorithm: factorize p = p1 × p2
• Send S1(x) to all servers (h(x),*) Send S2(y) to all servers (*, h(y))
• Load = M1 / p1 + M2 / p2; Minimized when M1 / p1 = M2 / p2 = (M1M2/p)½
Cartesian Product
37
S1(x) à
S2 (y) à
Q(x,y) = S1(x) × S2(y) size(S1)=M1, size(S2) = M2
Lalgo
= 2
✓M
1
·M2
p
◆ 12
Llower
= 2
✓M
1
·M2
p
◆ 12
Lower bound: • Let L(v) = L1(v) + L2(v) = load at server v • The p servers must report all answers to Q:
M1 M2 ≤ Σv L1(v) * L2(v) ≤ ½ Σv (L(v))2 ≤ ½ p maxv (L(v))2
Algorithm: factorize p = p1 × p2
• Send S1(x) to all servers (h(x),*) Send S2(y) to all servers (*, h(y))
• Load = M1 / p1 + M2 / p2; Minimized when M1 / p1 = M2 / p2 = (M1M2/p)½
A Simple Observation Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`
Proof similar to cartesian product on previous slide
Definition An edge packing for Q is a set of atoms with no common variables:
U ={Sj1(xj1), Sj2(xj2), . . . , Sj|U|(xj|U|)} 8i, k : xji \ xjk = ;
Claim Any one-round algorithm for Q must also compute the cartesian product:
Q0(xj1 , . . . ,xj|U|) = Sj1(xj1) on Sj2(xj2) on . . . on Sj|U|(xj|U|)
Proof Because the algorithm doesn’t know the other Sj ’s.
Llower
= |U | ·✓Mj1 ·Mj2 · · ·Mj|U|
p
◆ 1|U|
Assume that the three input relations S1, S2, … are stored on disjoint servers#
#otherwise, add another factor |U|
The Lower Bound for CQ
39
Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`
Theorem* Any one-round deterministic algorithm A for Q requires O(Llower) bits of communication per server:
Llower
=
✓M
1
u1 ·M2
u2 · · ·M`u`
p
◆ 1u1+u2+···+u`
*Beame, Koutris, Suciu, Skew in Parallel Data Processing, PODS 2014
Note that we ignore constant factors (like |U| on the previous slide)
Definition A fractional edge packing are real numbers u1, u2, . . . , u`
s.t.:
8j 2 [`] : uj
� 0
8i 2 [k] :X
j2[`]:xi2vars(Sj(xj))
uj
1
Triangle Query Revisited
40
Q(X,Y,Z) = R(X,Y) ⋈ S(Y,Z) ⋈ T(Z,X) size(R) = M1, size(S) = M2, size(T) = M3
Packing u1, u2, u3 L = (M1u1 * M2
u2 * M3u3 / p)1/(u1+u2+u3)
½, ½, ½ (M1 M2 M3)⅓ / p⅔
1, 0, 0 M1 / p 0, 1, 0 M2 / p 0, 0, 1 M3 / p
Llower = the largest of these four values When M1=M2=M3=M, then the largest value is the first: M / p⅔
x y
z
½
½ ½
An Algorithm for CQ [Afrati&Ullman’2010] Factorize p = p1 × p2 × … × pk; a server is (v1, …, vk) ∈[p1] × … × [pk] • Send S1(x11, x12, …) to servers with coordinates h(x11), h(x12), … • Send S2(x21, x22, …) to servers with coordinates h(x21), h(x22), … • …
• Compute the query locally at each server
• Load is O(Lalgo), where:
Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`
[A&U] use sum instead of max. With max we can compute the optimal p1, p2, …, pk
Lalgo
= max
j2[`]
MjQ
i2[k]:xi2vars(Sj(xj))pi
Llower = Lalgo
Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`
Llower
=
✓Qj Mj
uj
p
◆ 1Pj uj
Apply log in base p. Denote µj = logp Mj, ei = logp pi
Lalgo
= max
j
MjQ
i:xi2vars(Sj)pi
Llower = Lalgo
Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`
Llower
=
✓Qj Mj
uj
p
◆ 1Pj uj
Apply log in base p. Denote µj = logp Mj, ei = logp pi
minimize �X
i
ei
1
8j : �+X
j:xi2vars(Sj)
ei
� µj
Lalgo
= max
j
MjQ
i:xi2vars(Sj)pi
Llower = Lalgo
Q(x1, . . . , xk) = S1(x1) on . . . on S`(x`) size(S1) = M1, size(S2) = M2, . . . , size(S`) = M`
Llower
=
✓Qj Mj
uj
p
◆ 1Pj uj
Apply log in base p. Denote µj = logp Mj, ei = logp pi
minimize �X
i
ei
1
8j : �+X
j:xi2vars(Sj)
ei
� µj
maximize (X
j
fj
µj
� f0)
X
j
fj
1
8i :X
j:xi2vars(Sj)
fj
f0Primal/Dual uj = fj / f0
u0 = 1 / f0 at optimality: u0 = Σj uj
Lalgo
= max
j
MjQ
i:xi2vars(Sj)pi
Outline
• The MPC Model
• Communication Cost for Triangles
• Communication Cost for CQ’s
• Discussion
45
Discussion Communication costs for multiple rounds • Lower/upper bounds in [Beame’13] connect the
number of rounds to log of the query diameter • Weaker communication model: algorithm sends
only tuples, not arbitrary bits Communication cost for skewed data • Lower/upper bounds in [Beame’14] have a gap of
poly log p Communication costs beyond full CQ’s • Open!
46