13
OUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY - GATED MIXTURE - OF - EXPERTS LAYER [1] Noam Shazeer , Azalia Mirhoseiniy , Krzysztof Maziarz , Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean Google Brain - Jagiellonian University Presenter: Mohammad Motamedi

OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

OUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

Noam Shazeer, Azalia Mirhoseiniy, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean

Google Brain - Jagiellonian University

Presenter: Mohammad Motamedi

Page 2: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

• Key contributors in performance of deep networks:• Model Size

• Training data size

• “When datasets are sufficiently large, increasing the capacity (number of parameters) of neural networks can give much better prediction accuracy”

• Current computational infrastructure fall short of providing the computing demands.

LEPS – UC Davis 2

Page 3: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

• Inefficiency in the memory system

• Branch-based vulnerabilities

• Sophisticated linear algebra libraries

• “Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses would dominate: switching to sparse matrices might not pay off.” [2]

LEPS – UC Davis 3

Page 4: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

LEPS – UC Davis 4

𝑦 =

𝑖

𝑛

𝐺 𝑥 𝑖 𝐸𝑖 𝑥

𝐺 𝑥 ∈ ℝ+𝑛

• If 𝐺 𝑥 𝑖 = 0, we need not compute 𝐸𝑖 𝑥 .

• In each round, out of 1000 expert modules only a handful of them are active.

Page 5: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

• Desired characteristics• Sparsity

• Load Balancing

𝐺 𝑥 = 𝑆𝑜𝑓𝑡𝑚𝑎𝑥 𝑇𝑜𝑝𝐾 𝐻 𝑥 , 𝑘

𝑇𝑜𝑝𝐾 𝑣, 𝑘 𝑖 = ቊ𝑣𝑖 , 𝑣𝑖 ∈ 𝑠𝑜𝑟𝑡𝑒𝑑(𝑣)[−𝑘: ]−∞, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

𝐻 𝑥 𝑖 = 𝑥.𝑊𝑔 𝑖+ ln 1 + 𝑒 𝑥.𝑤𝑛𝑜𝑖𝑠𝑒 𝑖

LEPS – UC Davis 5

Page 6: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

• Large batch sizes are necessary to achieve high throughput.• Amortizing the overhead of data transfer

• Assume the gating networks chooses 𝑘 out of 𝑛 expert networks. New batch size.

𝑘 × 𝑏

𝑛≪ 𝑏

• Model parallelism over 𝑑 devices𝑘 × 𝑏 × 𝑑

𝑛

LEPS – UC Davis 6

Page 7: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

• Gating network selects a certain number of experts and the process is self-reinforcing.

• It is required to define an additional loss term to discourage such a behavior for a given batch X.

𝐿𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 = 𝑤𝑖𝑚𝑝𝑜𝑟𝑡𝑎𝑛𝑐𝑒 × 𝐶𝑉

𝑥∈𝑋

𝐺 𝑥

2

• It is still possible to have imbalance. How?

LEPS – UC Davis 7

Page 8: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

• In machine learning, perplexity is a measure of prediction error.

𝑅 =2− σ𝑋 𝑞 𝑥 log2 𝑝 𝑥

𝑋

• A measure to determine how strongly results are predicted.

LEPS – UC Davis 8

Page 9: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

LEPS – UC Davis 9

Page 10: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

LEPS – UC Davis 10

Page 11: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

Perplexity #Parameters Training Time TFLOPS/GPU

Best public results [7] 34.7 151 Millions 59 Hours – 32 K40 1.09

Proposed 28.0 4.4 Billions 47 Hours – 32 K40 1.56

LEPS – UC Davis 11

Billion words language modeling benchmark

Page 12: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

LEPS – UC Davis 12

Page 13: OUTRAGEOUSLY LARGE NEURAL NETWORKS THE ...lepsucd.com/.../uploads/2017/10/000_OUTRAGEOUSLY-LARGE.pdfOUTRAGEOUSLY LARGE NEURAL NETWORKS THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER [1]

1. Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." arXiv preprint arXiv:1701.06538 (2017).

2. Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.

3. Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and YonghuiWu. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.

4. Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and TonyRobinson. One billion word benchmark for measuring progress in statistical language modeling.arXiv preprint arXiv:1312.3005, 2013.

LEPS – UC Davis 13