Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Farzin Haddadpour
Trading Redundancy for Communication: Speeding up Distributed SGD for
Non-convex Optimization
Joint work with
Viveck CadambeMehrdad MahdaviMohammad Mahdi Kamani
min f(x) ,X
i
fi(x)Goal: Solving
min f(x) ,X
i
fi(x)Goal: Solving
x(t+1) = x(t) � ⌘1
|⇠(t)|rf(x(t); ⇠(t))SGD
min f(x) ,X
i
fi(x)Goal: Solving
x(t+1) = x(t) � ⌘1
|⇠(t)|rf(x(t); ⇠(t))
Parallelization due to computational cost
x(t+1) = x(t) � ⌘
p
pX
j=1
1
|⇠(t)j |rf(x(t); ⇠(t)j )
SGD
Distributed SGD
min f(x) ,X
i
fi(x)Goal: Solving
x(t+1) = x(t) � ⌘1
|⇠(t)|rf(x(t); ⇠(t))
Parallelization due to computational cost
Communication is bottleneck
x(t+1) = x(t) � ⌘
p
pX
j=1
1
|⇠(t)j |rf(x(t); ⇠(t)j )
SGD
Distributed SGD
Communication
Number of bits per iteration
Gradient compression based techniques
Communication
Number of roundsNumber of bits per iteration
Gradient compression based techniques
Local SGD with periodic averaging
Local SGD with periodic averaging
x(t+1)j = 1
p
Ppj=1
hx(t)j � ⌘ g̃(t)
j
iif ⌧ |T
x(t+1)j = x(t)
j � ⌘ g̃(t)j otherwise,
Averaging step (a)
Local update (b)
Local SGD with periodic averaging
x(t+1)j = 1
p
Ppj=1
hx(t)j � ⌘ g̃(t)
j
iif ⌧ |T
x(t+1)j = x(t)
j � ⌘ g̃(t)j otherwise,
Averaging step (a)
Local update (b)
W1
W1 W3W2
W2 W3
W1 W2 W3
W1 W3W2
(a)
(a)
(a)
p = 3, ⌧ = 1
Local SGD with periodic averaging
x(t+1)j = 1
p
Ppj=1
hx(t)j � ⌘ g̃(t)
j
iif ⌧ |T
x(t+1)j = x(t)
j � ⌘ g̃(t)j otherwise,
Averaging step (a)
Local update (b)
W1
W1
W1
W3
W2
W2
W2 W3
W3
(a)
(b)
p = 3, ⌧ = 1
W1
W1 W3W2
W2 W3
W1 W2 W3
W1 W3W2
(a)
(a)
(a)
p = 3, ⌧ = 3
Convergence Analysis of Local SGD with periodic averaging
Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/
ppT ) i.i.d. & b.g T
[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p
34T
14 )
[Wang & Joshi] O(1/ppT ) i.i.d. O(p
32T
12 )
RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p
32T
12 )
b.g: Bounded gradient kgik22 G
Unbiased gradient estimation E[g̃j ] = gj
Convergence Analysis of Local SGD with periodic averaging
Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/
ppT ) i.i.d. & b.g T
[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p
34T
14 )
[Wang & Joshi] O(1/ppT ) i.i.d. O(p
32T
12 )
RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p
32T
12 )
b.g: Bounded gradient kgik22 G
A. Residual error is observe in practice but theoretical understanding is missing?
B. How we can capture this in convergence analysis?
C. Any solution to improve it?
Unbiased gradient estimation E[g̃j ] = gj
Insufficiency of convergence analysis
A. Residual error is observe in practice but theoretical understanding is missing?
Unbiased gradient estimation does not hold
Insufficiency of convergence analysis
A. Residual error is observe in practice but theoretical understanding is missing?
B. How to capture this in convergence analysis?
Unbiased gradient estimation does not hold
Analysis based on biased gradientsOur work
Insufficiency of convergence analysis
A. Residual error is observe in practice but theoretical understanding is missing?
B. How to capture this in convergence analysis?
C. Any solution to improve it?
Unbiased gradient estimation does not hold
Analysis based on biased gradients
Redundancy
Our work
Our work
Redundancy infused local SGD (RI-SGD)
D = D1 [D2 [D3
W1
W1
W1
W3
W2
W2
W2 W3
W3
D1 D2 D3
p = 3, ⌧ = 3Local SGD
Redundancy infused local SGD (RI-SGD)
D = D1 [D2 [D3
W1
W1
W1
W3
W2
W2
W2 W3
W3
D1 D2 D3
p = 3, ⌧ = 3Local SGD RI-SGD q = 2, p = 3, ⌧ = 3
W1
W1
W3
W2
W2
W3
W1
D1D2 D2 D1
W2
D3 D3
W3
Explicit redundancy
Comparing RI-SGD with other schemes
b.d: Bounded inner product of gradients hgi,gji �Assumption
Redundancy
Biased gradients
q: Number of data chunks at each worker node
Comparing RI-SGD with other schemes
b.d: Bounded inner product of gradients hgi,gji �Assumption
Redundancy
Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/
ppT ) i.i.d. & b.g T
[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p
34T
14 )
[Wang & Joshi] O(1/ppT ) i.i.d. O(p
32T
12 )
RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p
32T
12 )
Biased gradients
q: Number of data chunks at each worker node
1. Speed up not only due to larger effective mini-batch size, but also due to increasing intra-gradient diversity.
2. Fault-tolerance. 3. Extension to heterogeneous mini-batch size
and possible application to federated optimization.
Advantages of RI-SGD:
Faster convergence: Experiments over Image-net (top figures) and Cifar-100 (bottom figures)
Increasing intra-gradient diversity: Experiments over Cifar-10
Fault-Tolerance: Experiments over Cifar-10
For more details please come to my poster session
Wed Jun 12th 06:30 -- 09:00 PM @ Pacific
Ballroom #185
Thanks for your attention!