OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

OUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

Noam Shazeer, Azalia Mirhoseiniy, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean

Google Brain - Jagiellonian University

Presenter: Mohammad Motamedi

• Key contributors in performance of deep networks:• Model Size

• Training data size

• “When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction accuracy”

• Current computational infrastructure fall short of providing the computing demands.

LEPS – UC Davis 2

• Inefficiency in the memory system

• Branch-based vulnerabilities

• Sophisticated linear algebra libraries

• “Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses would dominate: switching to sparse matrices might not pay off.” [2]

LEPS – UC Davis 3

LEPS – UC Davis 4

𝑦 =

𝑖

𝑛

𝐺 𝑥 𝑖 𝐸𝑖 𝑥

𝐺 𝑥 ∈ ℝ+𝑛

• If 𝐺 𝑥 𝑖 = 0, we need not compute 𝐸𝑖 𝑥 .

• In each round, out of 1000 expert modules only a handful of them are active.

• Desired characteristics• Sparsity

• Load Balancing

𝐺 𝑥 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑇𝑜𝑝𝐾 𝐻 𝑥 , 𝑘

𝑇𝑜𝑝𝐾 𝑣, 𝑘 𝑖 = ቊ𝑣𝑖 , 𝑣𝑖 ∈ 𝑠𝑜𝑟𝑡𝑒𝑑(𝑣)[−𝑘: ]−∞, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝐻 𝑥 𝑖 = 𝑥.𝑊𝑔 𝑖+ ln 1 + 𝑒 𝑥.𝑤𝑛𝑜𝑖𝑠𝑒 𝑖

LEPS – UC Davis 5

• Large batch sizes are necessary to achieve high throughput.• Amortizing the overhead of data transfer

• Assume the gating networks chooses 𝑘 out of 𝑛 expert networks. New batch size.

𝑘 × 𝑏

𝑛≪ 𝑏

• Model parallelism over 𝑑 devices𝑘 × 𝑏 × 𝑑

𝑛

LEPS – UC Davis 6

• Gating network selects a certain number of experts and the process is self-reinforcing.

• It is required to define an additional loss term to discourage such a behavior for a given batch X.

𝐿𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 = 𝑤𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 × 𝐶𝑉

𝑥∈𝑋

𝐺 𝑥

2

• It is still possible to have imbalance. How?

LEPS – UC Davis 7

• In machine learning, perplexity is a measure of prediction error.

𝑅 =2− σ𝑋 𝑞 𝑥 log2 𝑝 𝑥

𝑋

• A measure to determine how strongly results are predicted.

LEPS – UC Davis 8

LEPS – UC Davis 9

LEPS – UC Davis 10

Perplexity #Parameters Training Time TFLOPS/GPU

Best public results [7] 34.7 151 Millions 59 Hours – 32 K40 1.09

Proposed 28.0 4.4 Billions 47 Hours – 32 K40 1.56


Billion words language modeling benchmark


1. Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

2. Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

3. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and YonghuiWu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.

4. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and TonyRobinson. One billion word benchmark for measuring progress in statistical language modeling.arXiv preprint arXiv:1312.3005, 2013.


Documents

OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]