17
LSTM compression for language modeling Artem Grachev CS HSE, Samsung R&D Russia April 27th, 2017 A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 1 / 17

LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Embed Size (px)

Citation preview

Page 1: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

LSTM compression for language modeling

Artem Grachev

CS HSE, Samsung R&D Russia

April 27th, 2017

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 1 / 17

Page 2: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Language modeling

We need to compute the probability of a sentence or sequence of words(w1, . . . ,wT ) in language L.

P (w1, . . . ,wT ) = P (w1, . . . ,wT−1)P (wT |w1, . . . ,wT−1) =

=T∏t=1

P (wt |w1, . . . ,wt−1)

Fix N and try to compute P (wt |wt−N , . . . ,wt−1)

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 2 / 17

Page 3: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

RNN

RNN – recurrent neural network. T – number timesteps, L – numberrecurrent layers. Each layer we can describe as follows:

z tl =Wixtl−1 + Vlx

t−1l + bl (1)

x tl =σ(zl) (2)

t ∈ 1, . . . ,N; l ∈ 1, . . . , L. The output of the network is given by

y t = softmax[WL+1x

tL + bL+1

t

]Then define

P (wt |wt−N , . . . ,wt−1) = y t

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 3 / 17

Page 4: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

LSTM

Equations for one LSTM-layer:

i tl =σ[W i

l xtl−1 + V i

l xt−1l + bil + D(v il )c

t−1l

]input gate (3)

f tl =σ[W f

l xtl−1 + V f

l xt−1l + bfl + D(v fl )c

t−1l

]forget gate (4)

ctl =f tl · ct−1l + i tl tanh

[W c

l xtl−1 + Uc

l xt−1l + bcl

]cell state (5)

otl =σ[W o

l xtl−1 + V o

l xt−1l + bol + D(vol )c

t−1l

]output gate (6)

x tl =otl · tanh[ctl ] (7)

t ∈ 1, . . . ,N; l ∈ 1, . . . , L. The output of the network is given by

y t = softmax[WL+1x

tL + bL+1

t

]A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 4 / 17

Page 5: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

RNN and LSTM unit

Figure 1: Left: RNN unit, Right: LSTM unit

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 5 / 17

Page 6: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

ProblemNeural network models can take up a lot of space on disk and in memory.Also they need a lot of time for inference.

Let’s compress neural networks

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 6 / 17

Page 7: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Compression methods

1 Pruning2 Quantization3 Low-rank decomposition

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 7 / 17

Page 8: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Pruning

DefinitionPruning – method for reducing number of parameters in NN. Allconnections with weights below a threshold are removed from the network.After this we retrain the network to learn the final weights for theremaining sparse connections

Figure 2: Weights distribution before and after pruning

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 8 / 17

Page 9: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Quantization1

DefinitionQuantization – method for compression the size of neural network inmemory. We are compressing each float value to an eight-bit integerrepresenting the closest real number in a linear set of 256 within the range.

When we use the model for inference we still need to do full-precisioncomputations.

1https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 9 / 17

Page 10: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

PTB Results

Method Size (Mb) Test PPOriginal. 2-layer (each 200) LSTM. 18.6 Mb 117.659

Pruning output layer 90%w/o additional training 5.5 Mb 149.310

Pruning output layer 90%with additional training 5.5 Mb 121.123

Quantization. 4.7 Mb 118.232

Table 1: Pruning and quantization results

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 10 / 17

Page 11: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Low-rank factorization

Simple factorization can be done as follows:

x tl = σ[W a

l Wbl h

tl−1 + Ua

l Ubl h

t−1l + bl

]Let’s require W b

l = Ubl−1. After this we can rewrite our equation for RNN:

x tl =σ[W a

l mtl−1 + Ua

l mt−1l + bl (8)

mtl =Ub

l xtl (9)

yt =softmax[WL+1m

tL + bL+1

](10)

Note that for LSTM it is pretty the same

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 11 / 17

Page 12: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Low-rank factorization. Visual results

Figure 3: up — original, down — compressed

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 12 / 17

Page 13: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Low-rank factorization. Results

Method Size (Mb) Test PPPlain vanilla model2-layer (each 600) LSTM 71.1 43.4SVD for output layer (128) 52.5 43.6Reduce embedding size (100) 46.3 42.9Reduce embedding size (100) andSVD for output layer (128) 27.7 42.9Reduce embedding size (100) andLow-rank factorization (128) 14.5 45.0Reduce embedding size (100) andLow-rank factorization (128) in half-precision 7.3 45.0

Table 2: Low-rank factorization results. (not PTB)

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 13 / 17

Page 14: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Tensor Train decomposition [Oseledets, 2011]

TT-format for a tensor ~A:

A(i1, . . . , id) =∑

α0,...,αd

G1(α0, i1, α1)G2(α1, i2, α2) . . .Gd(αd−1, id , αd).

(11)This can be written compactly as a matrix product:

A(i1, . . . , id) = G1[i1]︸ ︷︷ ︸1×r1

G2[i2]︸ ︷︷ ︸r1×r2

. . .Gd [id ]︸ ︷︷ ︸rd−1×1

(12)

Terminology:Gi : TT-cores (collections of matrices)ri : TT-ranksr = max ri : maximal TT-rank

TT-format uses O(dnr2) memory to store O(nd) elements.Efficient only if the ranks are small.

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 14 / 17

Page 15: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

PTB results

Method Size (Mb) Test PPPTB 200-200. 18.6 Mb 115.91PTB 650-650 79.1 Mb 82.07PTB 1500-1500 264.1 Mb 78.29LR PTB 650-650 16.8 Mb 92.885LR PTB 1500-1500 94.9 Mb 89.462

TT-LSTM PTB 600-600 50.4 Mb 168.639

Table 3: PTB matrix decomposition results

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 15 / 17

Page 16: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

References

Song Han and Huizi Mao and William J. Dally Deep Compression: CompressingDeep Neural Networks with Pruning, Trained Quantization and HuffmanCoding — Acoustics, Speech and Signal Processing (ICASSP), 2016

Zhiyun Lu and Vikas Sindhwan and Tara N. Sainath Learning compact recurrentneural networks — Acoustics, Speech and Signal Processing (ICASSP), 2016

Y. Bengio. Learning Deep Architectures for AI — Foundations and Trends inMachine Learning 2009

Ivan V. Oseledets Tensor-Train Decomposition — SIAM J. Scientific ComputingVol. 33(5), pp. 2295–2317., 2011

Sepp Hochreiter and Jurgen Schmidhuber Long Short-Term Memor — NeuralComputation, Vol. 9(8), pp 1735-1780., 1997

Tomas Mikolov Statistical Language Models Based on Neural Networks — PhdThesis, Brno, CZ, 2012

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 16 / 17

Page 17: LSTM compression for language modeling · PDF fileLSTM compression for language modeling ArtemGrachev ... //petewarden.com/2016/05/03/how-to-quantize-neural-networks-with ... LSTM

Thank you for listening!

A. Grachev (CS HSE, SRR) LSTM compression April 27th, 2017 17 / 17