Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Training RNNs with 16-‐bit Floa5ng Point
Erich Elsen Research Scien5st
Silicon Valley AI Lab @Baidu • State of the Art Speech Recogni5on
Systems – Deep Speech 2 (hJp://arxiv.org/abs/1512.02595)
• Easily Adaptable for a variety of languages
• Trained on up to 40,000 hours of speech – 4.5 years!
• Each Model requires over 20 Exa-‐Flops to train – Can take two or more weeks using 32 GPUs
• => FP16 is one tool to decrease training 5me
Erich Elsen
CTC
Spectrogram
Recurrentor
GRU(Bidirectional)
1D or 2DInvariant
Convolution
Fully Connected
BatchNormalization
Figure 1: Architecture of the DS2 system used to train on both English and Mandarin speech. We explorevariants of this architecture by varying the number of convolutional layers from 1 to 3 and the number ofrecurrent or GRU layers from 1 to 7.
The two sets of activations are summed to form the output activations for the layer h
l
=
�!h
l
+
�h
l.The function g(·) can be the standard recurrent operation
�!h
l
t
= f(W
l
h
l�1t
+
�!U
l
�!h
l
t�1 + b
l
) (3)
where W
l is the input-hidden weight matrix,�!U
l is the recurrent weight matrix and b
l is a bias term.In this case the input-hidden weights are shared for both directions of the recurrence. The functiong(·) can also represent more complex recurrence operations such as the Long Short-Term Memory(LSTM) units [30] and the gated recurrent units (GRU) [11].
After the bidirectional recurrent layers we apply one or more fully connected layers with
h
l
t
= f(W
l
h
l�1t
+ b
l
) (4)
The output layer L is a softmax computing a probability distribution over characters given by
p(`
t
= k|x) = exp(w
L
k
· h
L�1t
)
Pj
exp(w
L
j
· h
L�1t
)
(5)
The model is trained using the CTC loss function [22]. Given an input-output pair (x, y) and thecurrent parameters of the network ✓, we compute the loss function L(x, y; ✓) and its derivative withrespect to the parameters of the network r
✓
L(x, y; ✓). This derivative is then used to update thenetwork parameters through the backpropagation through time algorithm.
In the following subsections we describe the architectural and algorithmic improvements made rel-ative to DS1 [26]. Unless otherwise stated these improvements are language agnostic. We reportresults on an English speaker held out development set which is an internal dataset containing 2048utterances of primarily read speech. All models are trained on datasets described in Section 5.We report Word Error Rate (WER) for the English system and Character Error Rate (CER) for theMandarin system. In both cases we integrate a language model in a beam search decoding step asdescribed in Section 3.8.
5
Neural Network Op5miza5on
• A network consists of parameters, x, and we try to minimize the cost J, of the network on some data set
• Everybody uses some variant of Batch Stochas5c Gradient Descent (SGD) – Batch = use mul5ple examples per gradient calcula5on
– α oden between .01 and .0001 • The ra5o of the two terms on the right is very important
Erich Elsen
xn+1 = xn −α∂J∂x
Training Recurrent Neural Networks (RNNs)
• Parameters are W, U • x, h are the network ac5va5ons • Forward pass eqn: • Performance of GEMM very important
Erich Elsen
ht =σ (Wxt +Uht−1)
Training RNNs Memory Usage
• Must save x, h for the backward pass! • Most memory is used to store ac5va5ons • Weights < 10% allocated memory • Standard is to use 32-‐bit floa5ng point for weights and ac5va5ons
Erich Elsen
U h 2048
MB = 32 -‐ 256
h h h h h h
Timesteps = 50 -‐ 800
FP16 • By using only 16-‐bits per number we: – can store twice as many numbers
• Increase mini-‐batch size! – can move twice as many numbers around in the same amount of 5me • All bandwidth bound opera5ons improve
– Hardware arithme5c units take up less and are faster • New hardware will have twice as many FP16 flops as FP32 • All compute bound opera5ons will improve
• No Free Lunch – Op5miza5on becomes more difficult
Erich Elsen
FP32 GEMM Performance
Erich Elsen
Pseudo-‐FP16 GEMM Performance
Erich Elsen
• Inputs/Outputs FP16, Internals FP32
Use Today On Maxwell:
• CublasSgemmEx• Nervana GEMM
2-‐3x Faster!
True FP16 GEMM Performance (es5mated)
Erich Elsen
Inputs/Outputs/Internals FP16
Use In the Future
on Pascal
4-‐6x Faster!
Op5miza5on problem even more
challenging
Number Representa5on
N = ±m∗2e• Total bits = 1 sign bit + bits for m + bits for e
• Intui5on – 2^m values between each power of 2 • Imagine 2 bits for m and 2 bits for e
Erich Elsen
FP16 vs FP32
FP16 FP32
Max. Value ~61,000 ~1e38
Grid points between each power of 2 2048 ~16,700,000
Smallest number you can add to one and get a different number (ULP rela5ve to 1)
~.000489 ~.00000006
Erich Elsen
Rounding
• Generally accepted prac5ce (implemented in hardware) is round to nearest even (r2ne)
• Go to the nearest point and if you’re exactly halfway go to the nearest even number (in binary)
Erich Elsen
Summa5on and Rounding
• x updated by adding a sequence of rela5vely small numbers
• if the updates are too small, we will never make any progress with round to nearest even
Erich Elsen
xn+1 = xn −α∂J∂x
= xn
Saddle Points!
• Local minimum not a problem • Saddle points are what make op5miza5on hard – Flat – Small Deriva5ve
Erich Elsen
Image from wikipedia
In 1-‐D
Erich Elsen
J = −(x −3)2 +3
α = .01
∂J∂x
= −2(x −3)
1-‐D Op5miza5on Problem
Erich Elsen
1-‐D ra5os of updates to x
Erich Elsen
Solu5on 1 Stochas5c Rounding
• Round up or down with probability related to the distance to the neighboring grid points
• Example – if the closest grid points are 100 and 101 and the value is 100.01 – We round up 1% of the 5me – Round down 99% of the 5me
Erich Elsen
Stochas5c Rounding
• Ader adding .01, 100 5mes to 100 – With r2ne we will s5ll have 100 – With stochas5c rounding we will expect to have 101
• Allows us to make op5miza5on progress even when the updates are small
Erich Elsen
Solu5on 2 High precision accumula5on
• Keep two copies of the weights – One in high precision (fp32) – One in low precision (fp16)
• Accumulate updates to the high precision copy
• Round the high precision copy to low precision and perform computa5ons
Erich Elsen
High precision accumula5on
• Ader adding .01, 100 5mes to 100 – We will have exactly 101 in the high precision weights, which will round to 101 in the low precision weights
• Allows for accurate accumula5on while maintaining the benefits of fp16 computa5on
• Requires more weight storage, but weights are usually a small part of the memory footprint
Erich Elsen
Batch Normaliza5on Helps
Erich Elsen
Results
Erich Elsen
Conclusion
• Half precision enables bigger, deeper networks
• Half precision enables faster training and evalua5on of networks
• Half precision enables beJer scaling to mul5ple GPUs
• Training can be tricky, these techniques can help
Erich Elsen