24
Training RNNs with 16bit Floa5ng Point Erich Elsen Research Scien5st

Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Training  RNNs  with  16-­‐bit  Floa5ng  Point  

Erich  Elsen  Research  Scien5st  

Page 2: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Silicon  Valley  AI  Lab  @Baidu  •  State  of  the  Art  Speech  Recogni5on  

Systems  –  Deep  Speech  2  (hJp://arxiv.org/abs/1512.02595)  

•  Easily  Adaptable  for  a  variety  of  languages  

•  Trained  on  up  to  40,000  hours  of  speech  –  4.5  years!  

•  Each  Model  requires  over  20  Exa-­‐Flops  to  train  –  Can  take  two  or  more  weeks  using  32  GPUs  

•  =>  FP16  is  one  tool  to  decrease  training  5me  

Erich  Elsen  

CTC

Spectrogram

Recurrentor

GRU(Bidirectional)

1D or 2DInvariant

Convolution

Fully Connected

BatchNormalization

Figure 1: Architecture of the DS2 system used to train on both English and Mandarin speech. We explorevariants of this architecture by varying the number of convolutional layers from 1 to 3 and the number ofrecurrent or GRU layers from 1 to 7.

The two sets of activations are summed to form the output activations for the layer h

l

=

�!h

l

+

�h

l.The function g(·) can be the standard recurrent operation

�!h

l

t

= f(W

l

h

l�1t

+

�!U

l

�!h

l

t�1 + b

l

) (3)

where W

l is the input-hidden weight matrix,�!U

l is the recurrent weight matrix and b

l is a bias term.In this case the input-hidden weights are shared for both directions of the recurrence. The functiong(·) can also represent more complex recurrence operations such as the Long Short-Term Memory(LSTM) units [30] and the gated recurrent units (GRU) [11].

After the bidirectional recurrent layers we apply one or more fully connected layers with

h

l

t

= f(W

l

h

l�1t

+ b

l

) (4)

The output layer L is a softmax computing a probability distribution over characters given by

p(`

t

= k|x) = exp(w

L

k

· h

L�1t

)

Pj

exp(w

L

j

· h

L�1t

)

(5)

The model is trained using the CTC loss function [22]. Given an input-output pair (x, y) and thecurrent parameters of the network ✓, we compute the loss function L(x, y; ✓) and its derivative withrespect to the parameters of the network r

L(x, y; ✓). This derivative is then used to update thenetwork parameters through the backpropagation through time algorithm.

In the following subsections we describe the architectural and algorithmic improvements made rel-ative to DS1 [26]. Unless otherwise stated these improvements are language agnostic. We reportresults on an English speaker held out development set which is an internal dataset containing 2048utterances of primarily read speech. All models are trained on datasets described in Section 5.We report Word Error Rate (WER) for the English system and Character Error Rate (CER) for theMandarin system. In both cases we integrate a language model in a beam search decoding step asdescribed in Section 3.8.

5

Page 3: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Neural  Network  Op5miza5on  

•  A  network  consists  of  parameters,  x,  and  we  try  to  minimize  the  cost  J,  of  the  network  on  some  data  set  

•  Everybody  uses  some  variant  of  Batch  Stochas5c  Gradient  Descent  (SGD)  –  Batch  =  use  mul5ple  examples  per  gradient  calcula5on    

–  α  oden  between  .01  and  .0001  •  The  ra5o  of  the  two  terms  on  the  right  is  very  important  

Erich  Elsen  

xn+1 = xn −α∂J∂x

Page 4: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Training  Recurrent  Neural  Networks  (RNNs)  

•  Parameters  are  W,  U  •  x,  h  are  the  network  ac5va5ons  •  Forward  pass  eqn:  •  Performance  of  GEMM  very  important  

Erich  Elsen  

ht =σ (Wxt +Uht−1)

Page 5: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Training  RNNs  Memory  Usage  

•  Must  save  x,  h  for  the  backward  pass!  •  Most  memory  is  used  to  store  ac5va5ons  •  Weights  <  10%  allocated  memory  •  Standard  is  to  use  32-­‐bit  floa5ng  point  for  weights  and  ac5va5ons  

Erich  Elsen  

U   h  2048  

MB  =  32  -­‐  256  

h   h   h   h   h   h  

Timesteps  =  50  -­‐  800  

Page 6: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

FP16  •  By  using  only  16-­‐bits  per  number  we:  –  can  store  twice  as  many  numbers  

•  Increase  mini-­‐batch  size!  –  can  move  twice  as  many  numbers  around  in  the  same  amount  of  5me  •  All  bandwidth  bound  opera5ons  improve  

– Hardware  arithme5c  units  take  up  less  and  are  faster  •  New  hardware  will  have  twice  as  many  FP16  flops  as  FP32  •  All  compute  bound  opera5ons  will  improve  

•  No  Free  Lunch  –  Op5miza5on  becomes  more  difficult  

Erich  Elsen  

Page 7: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

FP32  GEMM  Performance  

Erich  Elsen  

Page 8: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Pseudo-­‐FP16  GEMM  Performance  

Erich  Elsen  

•  Inputs/Outputs  FP16,  Internals  FP32  

 Use  Today  On  Maxwell:  

 •  CublasSgemmEx•  Nervana GEMM

2-­‐3x  Faster!  

Page 9: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

True  FP16  GEMM  Performance  (es5mated)  

Erich  Elsen  

Inputs/Outputs/Internals  FP16  

 Use  In  the  Future  

on  Pascal    

4-­‐6x  Faster!    

Op5miza5on  problem  even  more  

challenging  

Page 10: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Number  Representa5on  

N = ±m∗2e•  Total  bits  =  1  sign  bit  +  bits  for  m  +  bits  for  e  

•  Intui5on  –  2^m  values  between  each  power  of  2  •  Imagine  2  bits  for  m  and  2  bits  for  e  

Erich  Elsen  

Page 11: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

FP16  vs  FP32  

FP16   FP32  

Max.  Value   ~61,000   ~1e38  

Grid  points  between  each  power  of  2   2048   ~16,700,000  

Smallest  number  you  can  add  to  one  and  get  a  different  number  (ULP  rela5ve  to  1)  

~.000489   ~.00000006  

Erich  Elsen  

Page 12: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Rounding  

•  Generally  accepted  prac5ce  (implemented  in  hardware)  is  round  to  nearest  even  (r2ne)  

•  Go  to  the  nearest  point  and  if  you’re  exactly  halfway  go  to  the  nearest  even  number  (in  binary)  

Erich  Elsen  

Page 13: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Summa5on  and  Rounding  

•  x  updated  by  adding  a  sequence  of  rela5vely  small  numbers  

•  if  the  updates  are  too  small,  we  will  never  make  any  progress  with  round  to  nearest  even  

Erich  Elsen  

xn+1 = xn −α∂J∂x

= xn

Page 14: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Saddle  Points!  

•  Local  minimum  not  a  problem  •  Saddle  points  are  what  make  op5miza5on  hard  – Flat  – Small  Deriva5ve  

Erich  Elsen  

Image  from  wikipedia  

Page 15: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

In  1-­‐D  

Erich  Elsen  

J = −(x −3)2 +3

α    =  .01  

∂J∂x

= −2(x −3)

Page 16: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

1-­‐D  Op5miza5on  Problem  

Erich  Elsen  

Page 17: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

1-­‐D  ra5os  of  updates  to  x  

Erich  Elsen  

Page 18: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Solu5on  1  Stochas5c  Rounding  

•  Round  up  or  down  with  probability  related  to  the  distance  to  the  neighboring  grid  points  

•  Example  –  if  the  closest  grid  points  are  100  and  101  and  the  value  is  100.01  – We  round  up  1%  of  the  5me  – Round  down  99%  of  the  5me  

Erich  Elsen  

Page 19: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Stochas5c  Rounding  

•  Ader  adding  .01,  100  5mes  to  100  – With  r2ne  we  will  s5ll  have  100  – With  stochas5c  rounding  we  will  expect  to  have  101  

•  Allows  us  to  make  op5miza5on  progress  even  when  the  updates  are  small  

Erich  Elsen  

Page 20: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Solu5on  2  High  precision  accumula5on  

•  Keep  two  copies  of  the  weights  – One  in  high  precision  (fp32)  – One  in  low  precision  (fp16)  

•  Accumulate  updates  to  the  high  precision  copy  

•  Round  the  high  precision  copy  to  low  precision  and  perform  computa5ons  

Erich  Elsen  

Page 21: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

High  precision  accumula5on  

•  Ader  adding  .01,  100  5mes  to  100  – We  will  have  exactly  101  in  the  high  precision  weights,  which  will  round  to  101  in  the  low  precision  weights  

•  Allows  for  accurate  accumula5on  while  maintaining  the  benefits  of  fp16  computa5on  

•  Requires  more  weight  storage,  but  weights  are  usually  a  small  part  of  the  memory  footprint  

Erich  Elsen  

Page 22: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Batch  Normaliza5on  Helps  

Erich  Elsen  

Page 23: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Results  

Erich  Elsen  

Page 24: Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Conclusion  

•  Half  precision  enables  bigger,  deeper  networks  

•  Half  precision  enables  faster  training  and  evalua5on  of  networks  

•  Half  precision  enables  beJer  scaling  to  mul5ple  GPUs  

•  Training  can  be  tricky,  these  techniques  can  help  

Erich  Elsen