25
Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex Optimization

Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Farzin Haddadpour

Trading Redundancy for Communication: Speeding up Distributed SGD for

Non-convex Optimization

Page 2: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Joint work with

Viveck CadambeMehrdad MahdaviMohammad Mahdi Kamani

Page 3: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

min f(x) ,X

i

fi(x)Goal: Solving

Page 4: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

min f(x) ,X

i

fi(x)Goal: Solving

x(t+1) = x(t) � ⌘1

|⇠(t)|rf(x(t); ⇠(t))SGD

Page 5: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

min f(x) ,X

i

fi(x)Goal: Solving

x(t+1) = x(t) � ⌘1

|⇠(t)|rf(x(t); ⇠(t))

Parallelization due to computational cost

x(t+1) = x(t) � ⌘

p

pX

j=1

1

|⇠(t)j |rf(x(t); ⇠(t)j )

SGD

Distributed SGD

Page 6: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

min f(x) ,X

i

fi(x)Goal: Solving

x(t+1) = x(t) � ⌘1

|⇠(t)|rf(x(t); ⇠(t))

Parallelization due to computational cost

Communication is bottleneck

x(t+1) = x(t) � ⌘

p

pX

j=1

1

|⇠(t)j |rf(x(t); ⇠(t)j )

SGD

Distributed SGD

Page 7: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Communication

Number of bits per iteration

Gradient compression based techniques

Page 8: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Communication

Number of roundsNumber of bits per iteration

Gradient compression based techniques

Local SGD with periodic averaging

Page 9: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Local SGD with periodic averaging

x(t+1)j = 1

p

Ppj=1

hx(t)j � ⌘ g̃(t)

j

iif ⌧ |T

x(t+1)j = x(t)

j � ⌘ g̃(t)j otherwise,

Averaging step (a)

Local update (b)

Page 10: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Local SGD with periodic averaging

x(t+1)j = 1

p

Ppj=1

hx(t)j � ⌘ g̃(t)

j

iif ⌧ |T

x(t+1)j = x(t)

j � ⌘ g̃(t)j otherwise,

Averaging step (a)

Local update (b)

W1

W1 W3W2

W2 W3

W1 W2 W3

W1 W3W2

(a)

(a)

(a)

p = 3, ⌧ = 1

Page 11: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Local SGD with periodic averaging

x(t+1)j = 1

p

Ppj=1

hx(t)j � ⌘ g̃(t)

j

iif ⌧ |T

x(t+1)j = x(t)

j � ⌘ g̃(t)j otherwise,

Averaging step (a)

Local update (b)

W1

W1

W1

W3

W2

W2

W2 W3

W3

(a)

(b)

p = 3, ⌧ = 1

W1

W1 W3W2

W2 W3

W1 W2 W3

W1 W3W2

(a)

(a)

(a)

p = 3, ⌧ = 3

Page 12: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Convergence Analysis of Local SGD with periodic averaging

Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/

ppT ) i.i.d. & b.g T

[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p

34T

14 )

[Wang & Joshi] O(1/ppT ) i.i.d. O(p

32T

12 )

RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p

32T

12 )

b.g: Bounded gradient kgik22 G

Unbiased gradient estimation E[g̃j ] = gj

Page 13: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Convergence Analysis of Local SGD with periodic averaging

Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/

ppT ) i.i.d. & b.g T

[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p

34T

14 )

[Wang & Joshi] O(1/ppT ) i.i.d. O(p

32T

12 )

RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p

32T

12 )

b.g: Bounded gradient kgik22 G

A. Residual error is observe in practice but theoretical understanding is missing?

B. How we can capture this in convergence analysis?

C. Any solution to improve it?

Unbiased gradient estimation E[g̃j ] = gj

Page 14: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Insufficiency of convergence analysis

A. Residual error is observe in practice but theoretical understanding is missing?

Unbiased gradient estimation does not hold

Page 15: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Insufficiency of convergence analysis

A. Residual error is observe in practice but theoretical understanding is missing?

B. How to capture this in convergence analysis?

Unbiased gradient estimation does not hold

Analysis based on biased gradientsOur work

Page 16: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Insufficiency of convergence analysis

A. Residual error is observe in practice but theoretical understanding is missing?

B. How to capture this in convergence analysis?

C. Any solution to improve it?

Unbiased gradient estimation does not hold

Analysis based on biased gradients

Redundancy

Our work

Our work

Page 17: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Redundancy infused local SGD (RI-SGD)

D = D1 [D2 [D3

W1

W1

W1

W3

W2

W2

W2 W3

W3

D1 D2 D3

p = 3, ⌧ = 3Local SGD

Page 18: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Redundancy infused local SGD (RI-SGD)

D = D1 [D2 [D3

W1

W1

W1

W3

W2

W2

W2 W3

W3

D1 D2 D3

p = 3, ⌧ = 3Local SGD RI-SGD q = 2, p = 3, ⌧ = 3

W1

W1

W3

W2

W2

W3

W1

D1D2 D2 D1

W2

D3 D3

W3

Explicit redundancy

Page 19: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Comparing RI-SGD with other schemes

b.d: Bounded inner product of gradients hgi,gji �Assumption

Redundancy

Biased gradients

q: Number of data chunks at each worker node

Page 20: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Comparing RI-SGD with other schemes

b.d: Bounded inner product of gradients hgi,gji �Assumption

Redundancy

Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/

ppT ) i.i.d. & b.g T

[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p

34T

14 )

[Wang & Joshi] O(1/ppT ) i.i.d. O(p

32T

12 )

RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p

32T

12 )

Biased gradients

q: Number of data chunks at each worker node

Page 21: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

1. Speed up not only due to larger effective mini-batch size, but also due to increasing intra-gradient diversity.

2. Fault-tolerance. 3. Extension to heterogeneous mini-batch size

and possible application to federated optimization.

Advantages of RI-SGD:

Page 22: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Faster convergence: Experiments over Image-net (top figures) and Cifar-100 (bottom figures)

Page 23: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Increasing intra-gradient diversity: Experiments over Cifar-10

Page 24: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

Fault-Tolerance: Experiments over Cifar-10

Page 25: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex

For more details please come to my poster session

Wed Jun 12th 06:30 -- 09:00 PM @ Pacific

Ballroom #185

Thanks for your attention!