Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC

Training RNNs with 16-‐bit Floa5ng Point

Erich Elsen Research Scien5st

Silicon Valley AI Lab @Baidu •  State of the Art Speech Recogni5on

Systems –  Deep Speech 2 (hJp://arxiv.org/abs/1512.02595)

•  Easily Adaptable for a variety of languages

•  Trained on up to 40,000 hours of speech –  4.5 years!

•  Each Model requires over 20 Exa-‐Flops to train –  Can take two or more weeks using 32 GPUs

•  => FP16 is one tool to decrease training 5me

Erich Elsen

CTC

Spectrogram

Recurrentor

GRU(Bidirectional)

1D or 2DInvariant

Convolution

Fully Connected

BatchNormalization

Figure 1: Architecture of the DS2 system used to train on both English and Mandarin speech. We explorevariants of this architecture by varying the number of convolutional layers from 1 to 3 and the number ofrecurrent or GRU layers from 1 to 7.

The two sets of activations are summed to form the output activations for the layer h

l

=

�!h

l

+

�h

l.The function g(·) can be the standard recurrent operation

�!h

l

t

= f(W

l

h

l�1t

+

�!U

l

�!h

l

t�1 + b

l

) (3)

where W

l is the input-hidden weight matrix,�!U

l is the recurrent weight matrix and b

l is a bias term.In this case the input-hidden weights are shared for both directions of the recurrence. The functiong(·) can also represent more complex recurrence operations such as the Long Short-Term Memory(LSTM) units [30] and the gated recurrent units (GRU) [11].

After the bidirectional recurrent layers we apply one or more fully connected layers with

h

l

t

= f(W

l

h

l�1t

+ b

l

) (4)

The output layer L is a softmax computing a probability distribution over characters given by

p(`

t

= k|x) = exp(w

L

k

· h

L�1t

)

Pj

exp(w

L

j

· h

L�1t

)

(5)

The model is trained using the CTC loss function [22]. Given an input-output pair (x, y) and thecurrent parameters of the network ✓, we compute the loss function L(x, y; ✓) and its derivative withrespect to the parameters of the network r

✓

L(x, y; ✓). This derivative is then used to update thenetwork parameters through the backpropagation through time algorithm.

In the following subsections we describe the architectural and algorithmic improvements made rel-ative to DS1 [26]. Unless otherwise stated these improvements are language agnostic. We reportresults on an English speaker held out development set which is an internal dataset containing 2048utterances of primarily read speech. All models are trained on datasets described in Section 5.We report Word Error Rate (WER) for the English system and Character Error Rate (CER) for theMandarin system. In both cases we integrate a language model in a beam search decoding step asdescribed in Section 3.8.

5

Neural Network Op5miza5on

•  A network consists of parameters, x, and we try to minimize the cost J, of the network on some data set

•  Everybody uses some variant of Batch Stochas5c Gradient Descent (SGD) –  Batch = use mul5ple examples per gradient calcula5on

–  α oden between .01 and .0001 •  The ra5o of the two terms on the right is very important

Erich Elsen

xn+1 = xn −α∂J∂x

Training Recurrent Neural Networks (RNNs)

•  Parameters are W, U •  x, h are the network ac5va5ons •  Forward pass eqn: •  Performance of GEMM very important

Erich Elsen

ht =σ (Wxt +Uht−1)

Training RNNs Memory Usage

•  Must save x, h for the backward pass! •  Most memory is used to store ac5va5ons •  Weights < 10% allocated memory •  Standard is to use 32-‐bit floa5ng point for weights and ac5va5ons

Erich Elsen

U h 2048

MB = 32 -‐ 256

h h h h h h

Timesteps = 50 -‐ 800

FP16 •  By using only 16-‐bits per number we: –  can store twice as many numbers

•  Increase mini-‐batch size! –  can move twice as many numbers around in the same amount of 5me •  All bandwidth bound opera5ons improve

– Hardware arithme5c units take up less and are faster •  New hardware will have twice as many FP16 flops as FP32 •  All compute bound opera5ons will improve

•  No Free Lunch – Op5miza5on becomes more difficult

Erich Elsen

FP32 GEMM Performance

Erich Elsen

Pseudo-‐FP16 GEMM Performance

Erich Elsen

•  Inputs/Outputs FP16, Internals FP32

Use Today On Maxwell:

•  CublasSgemmEx•  Nervana GEMM

2-‐3x Faster!

True FP16 GEMM Performance (es5mated)

Erich Elsen

Inputs/Outputs/Internals FP16

Use In the Future

on Pascal

4-‐6x Faster!

Op5miza5on problem even more

challenging

Number Representa5on

N = ±m∗2e•  Total bits = 1 sign bit + bits for m + bits for e

•  Intui5on – 2^m values between each power of 2 •  Imagine 2 bits for m and 2 bits for e

Erich Elsen

FP16 vs FP32

FP16 FP32

Max. Value ~61,000 ~1e38

Grid points between each power of 2 2048 ~16,700,000

Smallest number you can add to one and get a different number (ULP rela5ve to 1)

~.000489 ~.00000006

Erich Elsen

Rounding

•  Generally accepted prac5ce (implemented in hardware) is round to nearest even (r2ne)

•  Go to the nearest point and if you’re exactly halfway go to the nearest even number (in binary)

Erich Elsen

Summa5on and Rounding

•  x updated by adding a sequence of rela5vely small numbers

•  if the updates are too small, we will never make any progress with round to nearest even

Erich Elsen

xn+1 = xn −α∂J∂x

= xn

Saddle Points!

•  Local minimum not a problem •  Saddle points are what make op5miza5on hard – Flat – Small Deriva5ve

Erich Elsen

Image from wikipedia

In 1-‐D

Erich Elsen

J = −(x −3)2 +3

α = .01

∂J∂x

= −2(x −3)

1-‐D Op5miza5on Problem

Erich Elsen

1-‐D ra5os of updates to x

Erich Elsen

Solu5on 1 Stochas5c Rounding

•  Round up or down with probability related to the distance to the neighboring grid points

•  Example – if the closest grid points are 100 and 101 and the value is 100.01 – We round up 1% of the 5me – Round down 99% of the 5me

Erich Elsen

Stochas5c Rounding

•  Ader adding .01, 100 5mes to 100 – With r2ne we will s5ll have 100 – With stochas5c rounding we will expect to have 101

•  Allows us to make op5miza5on progress even when the updates are small

Erich Elsen

Solu5on 2 High precision accumula5on

•  Keep two copies of the weights – One in high precision (fp32) – One in low precision (fp16)

•  Accumulate updates to the high precision copy

•  Round the high precision copy to low precision and perform computa5ons

Erich Elsen

High precision accumula5on

•  Ader adding .01, 100 5mes to 100 – We will have exactly 101 in the high precision weights, which will round to 101 in the low precision weights

•  Allows for accurate accumula5on while maintaining the benefits of fp16 computa5on

•  Requires more weight storage, but weights are usually a small part of the memory footprint

Erich Elsen

Batch Normaliza5on Helps

Erich Elsen

Results

Erich Elsen

Conclusion

•  Half precision enables bigger, deeper networks

•  Half precision enables faster training and evalua5on of networks

•  Half precision enables beJer scaling to mul5ple GPUs

•  Training can be tricky, these techniques can help

Erich Elsen

Documents

Training’RNNs’with’160bit Floang’Pointon-demand.gputechconf.com/gtc/...training-recurrent...• =>FP16’is’one’tool’to’decrease’training’ me’ ErichElsen CTC