Upload
others
View
20
Download
0
Embed Size (px)
Citation preview
Dell EMC Technical White Paper
Dell EMC Ready Solutions for AI – Deep Learning with Intel
Measuring performance and capability of deep learning use cases
Abstract
The Deep Learning with Intel solution provides a scalable, flexible
platform for training a wide variety of neural network models with
different capabilities and performance characteristics. We tested the
ability of the solution to run three different deep learning use cases in
image classification, machine translation, and product recommendation.
June 2019
H17829
Revisions
Dell EMC Technical White Paper
Revisions
Date Description
June 2019 Initial release
Acknowledgements
This paper was produced and reviewed by the following members of the Dell EMC AI Engineering and Dell
EMC Ready Solutions for AI teams:
Author: Lucas A. Wilson, PhD
Support: Srinivas Varadharajan, Pei Yang, and Vineet Gundecha
Other: Phil Hummel
The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this
publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.
Use, copying, and distribution of any software described in this publication requires an applicable software license.
© June 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.
Other trademarks may be trademarks of their respective owners.
Dell believes the information in this document is accurate as of its publication date. The information is subject to change without notice.
Table of contents
3 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
Table of contents
Revisions............................................................................................................................................................................. 2
Acknowledgements ............................................................................................................................................................. 2
Table of contents ................................................................................................................................................................ 3
Executive summary ............................................................................................................................................................. 4
1 Solution Design and Considerations ............................................................................................................................ 5
1.1 System Architecture and Design Considerations ............................................................................................... 5
1.1.1 Networking Considerations ................................................................................................................................. 6
1.1.2 Isilon H600 Storage Array (Optional).................................................................................................................. 7
1.1.3 Solution Physical Architecture Layout ................................................................................................................ 7
1.2 Nauta Deep Learning Platform ........................................................................................................................... 8
1.3 Dell EMC Software Additions ............................................................................................................................. 9
2 Containerized Workload Performance ....................................................................................................................... 10
3 Use Case Selection and Considerations .................................................................................................................... 11
3.1 Image Classification Using Convolutional Neural Networks ............................................................................ 11
3.2 Language Translation Using Multi-Head Attention Networks ........................................................................... 11
3.3 Product Recommendations Using Restricted Boltzmann Machines ................................................................ 11
4 Use Case 1: Pathology Classification in Medical Imagery ......................................................................................... 12
4.1 Convolutional Neural Networks (CNNs) ........................................................................................................... 13
4.2 Model Training .................................................................................................................................................. 13
5 Use Case 2: Machine Language Translation ............................................................................................................. 15
5.1 Multi-head Attention-based Neural Networks ................................................................................................... 15
5.2 Model Training Performance ............................................................................................................................ 16
6 Use Case 3: Ratings-based Product Recommendation ............................................................................................ 18
6.1 Restricted Boltzmann Machines (RBMs) .......................................................................................................... 18
6.2 Model Training Performance ............................................................................................................................ 19
7 Conclusions ................................................................................................................................................................ 21
A Configuration Details .................................................................................................................................................. 22
B Additional Performance Data ..................................................................................................................................... 23
C Related resources ...................................................................................................................................................... 26
Executive summary
4 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
Executive summary
The Dell EMC Ready Solutions for AI – Deep Learning with Intel is a CPU-based scale-out solution for
training neural network models. The solution uses Nauta, a deep learning training platform built on cloud
native technologies such as Docker and Kubernetes, to provide data scientists with a simplified software
environment which can be easily customized to suit whatever requirements the data scientist has.
The solution consists of 17 nodes – a master/login node and 16 compute workers – interconnected with dual
Ethernet-based networks. Onboard shared storage in the master node provides an easy on-ramp to use, but
additional network attached storage in the form of Dell EMC Isilon H600 NAS appliances can be used to
independently scale compute and storage.
We measured and analyzed the performance of this solution using three deep learning training use cases
covering image classification, language translation, and product recommendation. In all three use cases, our
tests demonstrated near-linear scaling in performance up to the full size of the solution. Additional tests
performed on analogous hardware in the Zenith cluster show that the solution can scale all tested use cases
beyond 16 compute nodes, allowing customers to scale the solution as their compute requirements grow.
Solution Design and Considerations
5 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
1 Solution Design and Considerations The Dell EMC Ready Solutions for AI – Deep Learning with Intel (Figure 1) is a scale-out solution for fast
training of deep neural networks. Based on Intel® 2nd Generation Xeon® Scalable processors, the solution
provides a fully-featured, flexible environment for neural network architecture exploration, production training
at scale, and pre-deployment model testing and validation. The solution is intended as a starter configuration
for customers who are at the start of their scalable deep learning journey, with the ability to easily expand the
solution as computational needs grow.
Figure 1: Dell EMC Ready Solutions for AI – Deep Learning with Intel
The solution fully integrates both hardware and software components into a complete customer solution which
is ready for data scientists to use for constructing and training neural network-based models.
1.1 System Architecture and Design Considerations The solution uses Intel® 2nd Generation Xeon® Scalable processors as the core building block for training
deep neural networks, providing a flexible platform for training networks of various configurations on data
objects of various size. Scalability is the secret to realizing improved time-to-solution with this system, so the
solution is built on top of dense-compute C6420 2-socket compute sleds, which can be fitted in sets of 4 into a
2U C6400 compute chassis (Figure 2).
Figure 2: Dell EMC C6420 Compute Sleds (x4) in C6400 Compute Chassis (Front and Back)
Solution Design and Considerations
6 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
The solution does not rely on 2.5” or 3.5” disk drives for holding the operating system and system containers.
Instead, each compute sled is fitted with a 250GB Boot-optimized SSD Storage (BOSS) M.2 form-factor HDD
within the sled. Each sled contains a 1GbE Ethernet adaptor with combined iDRAC, as well as an Intel® X710
10GbE SFP Ethernet adaptor for internal data movement.
System login and management, as well as initial shared storage, is handled by a single 2-socket Dell EMC
R740xd 2U compute server (Figure 3), outfitted with 12x 12TB HDD in two (2) RAID arrays: (1) a 2-drive,
single-redundancy RAID1 array for the operating system (10TB usable), and (2) a 10-drive RAID5 array for
shared storage, Kubernetes management, and other functions (96TB usable). The login/management node
uses internal RJ45 1GbE connectivity for datacenter/Internet connectivity and a separate RJ45 iDRAC
connection. The management node also contains an Intel® X710 10GbE SFP Ethernet adaptor for
connection to the compute nodes as a means of serving the shared storage array.
Figure 3: Dell EMC R740xd with 12x 3.5" Drive Configuration
1.1.1 Networking Considerations The choice of 10GbE over other, higher bandwidth options (including 25/40/100GbE, 100Gbps Intel® Omni-
Path Architecture, and 100Gbps Mellanox HDR Infiniband) was made for the solution to best match the total
anticipated data throughput when training deep neural networks on Intel® Xeon® Scalable processors.
During our in-lab tests on the Zenith supercomputer using both 10GbE and 100Gbps Intel® Omni-path Fabric,
no measurable difference in performance was observed when scaling up to 64 compute nodes (see Appendix
B, Figure 21 and Table 5 for more details).
System components are interconnected using 2 Dell EMC Open Networking switches. The first top-of-rack
(ToR) switch is a Dell EMC S3048-ON 1U RJ45 switch (Figure 4) which provides management and iDRAC
connectivity between the nodes of the solution, as well as connectivity to the wider data center network and
the Internet. This connection will be used by IT administrators to manage the systems components via
iDRAC, connect to the system remotely, and enable the nodes to download container images from public
container registries, such as Docker Hub.
Figure 4: Dell EMC S3048-ON RJ45 Networking Switch
The second ToR switch is a Dell EMC S4128F-ON 10GbE SFP+ switch (Figure 5) which provides internal
communication over an isolated network for data movement during scale-out training and for loading data
from both the shared storage array in the Dell EMC R740xd management node, as well as connecting an
optional Dell EMC Isilon H600 storage array (see Section 1.1.2) via two (2) 40GbE QSFP+ ports on the
switch.
Figure 5: Dell EMC S4128F-ON SFP+ Networking Switch
Solution Design and Considerations
7 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
1.1.2 Isilon H600 Storage Array (Optional) In situations where a more performant or scalable storage solution than the R740xd master node is desired,
the solution can be optionally connected to a Dell EMC Isilon H600 hybrid flash/HDD storage appliance
(Figure 6). Each Isilon appliance consists of four (4) individual nodes, each capable of providing access to the
entire filesystem. Use of the four nodes allows for system-controlled load balance of I/O requests across the
array. The Dell EMC Ready Solutions for AI – Deep Learning with Intel provides the ability to connect two (2)
of the Dell EMC Isilon H600 storage appliance nodes to the data network via the two (2) 40GbE QSFP+ ports
on the Dell EMC S4128F-ON switch. This provides sufficient balance between load and bandwidth
capabilities to support data motion for deep neural network training on the solution.
Figure 6: Dell EMC Isilon H600 Hybrid Flash/HDD Storage Appliance
1.1.3 Solution Physical Architecture Layout
Figure 7: Architecture Layout of Deep Learning with Intel solution with optional Isilon
Solution Design and Considerations
8 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
1.2 Nauta Deep Learning Platform The Dell EMC Ready Solutions for AI – Deep Learning with Intel uses the Nauta deep learning training
platform to coordinate and manage the training of deep neural networks using TensorFlow. Nauta provides
four (4) different modes for enabling this training process:
• Interactive, single-node neural network training using Jupyter notebooks
• Single-node neural network training
• Multi-node neural network training using Distributed TensorFlow
• Multi-node neural network training using Horovod
Nauta also provides capabilities for pre-production testing trained models via Tensor Serving, using both
batch inference (performing inference on a directory of data objects) and streaming inference (a REST-ful
web endpoint to which individual data objects can be submitted for inference).
Figure 8: Structure of the Nauta deep learning platform. Nauta uses Kubernetes for container orchestration, and Helm-based template packs for scripted job submissions (https://www.intel.ai/introducing-nauta).
Solution Design and Considerations
9 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
1.3 Dell EMC Software Additions Dell EMC has provided additional software tools, utilities, and templates which provide users of the Dell EMC
Ready Solutions for AI – Deep Learning with Intel with a more seamless access experience. Dell EMC’s
software additions include:
• Administrator scripts to automate the creation of joint Unix/Nauta user accounts and set appropriate
permissions allowing users access to their Nauta input/output directories
• Remote Desktop Protocol (RDP) enabled on the login node which provides customers with a simpler
path to usage via Windows Remote Desktop Connection client
• Template Yet Another Markup Language (YAML)-based Helm charts for providing individual users
access to optional Isilon storage appliances
Containerized Workload Performance
10 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
2 Containerized Workload Performance Nauta is a Kubernetes-based deep learning training platform which uses Docker containers for executing all
workloads. Nauta provides a single, prebuilt container which can be used for multiple tasks (Jupyter
notebooks, distributed TensorFlow, TensorFlow with Horovod, and Tensor Serving) via specialized script
deployment with Helm.
Containers provide a convenient way of packaging and deploying software, especially software with complex
dependencies such as deep learning frameworks. While software and hardware abstractions have
traditionally meant the sacrifice of performance when compared to executing on “bare metal” systems,
modern Docker containers do not pose a performance penalty.
Our tests have shown than the performance of the container in many cases exceeds the performance of an
equivalent “bare metal” provisioned system. Figure 9 shows the performance of our neural machine
translation use case (see Section 5) using the Deep Learning with Intel solution (“Containerized”) and an
equivalent software environment built on an identical hardware system (see Appendix A for a complete
hardware specification).
Figure 9: Containerized workload performance normalized to the same workload executed on a "bare metal" version of the same platform. The above workload is our neural machine translation use case (see Section 5).
It should be noted that these results may not apply to all workloads. However, our tests on the Deep Learning
with Intel solution shows that, generally, containerization of deep learning workloads does not impose a
performance penalty.
118.36%
110.05%
100.42%
103.80%
101.71%
0.95
1
1.05
1.1
1.15
1.2
1 2 4 8 16
Per
form
ance
(n
orm
aliz
ed t
o B
are
Met
al)
Containerized Performance (Normalized to Bare Metal)
Bare Metal Containerized
Use Case Selection and Considerations
11 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
3 Use Case Selection and Considerations Different neural network architectures exhibit different computational workload properties during the training
process. Because of this, we have selected three (3) different use cases which use different neural network
architectures and constructs on which to evaluate the performance of the Dell EMC Ready Solutions for AI –
Deep Learning with Intel. The computational and workload diversity of these use cases were chosen to
highlight the flexibility of the solution for applications across different customer segments and problem types.
3.1 Image Classification Using Convolutional Neural Networks Convolutional neural networks for image classification and object detection have been the primary source of
news reports related to AI and deep learning in recent years. We believe that this is the area of investigation
that most customers will begin with when beginning the deep learning phase of their AI journey, and as such
we have built an extensive use case in the health care delivery market segment around image classification.
The type of neural network used for these classification tasks – known as convolutional neural networks, or
CNNs – have a very specific computational profile. These networks consist completely of dense tensor
encodings, which lend themselves to fast, efficient computation on single-instruction, multiple-data (SIMD)
execution units such as the AVX512 vector units within the Intel® Xeon® Scalable processor family.
3.2 Language Translation Using Multi-Head Attention Networks Multi-head attention networks are – as of the time of this writing – one of the newest trends in neural machine
translation (NMT) research. Current state of the art models in production at Google are built upon this network
type and have shown significantly higher translation quality when compared to more traditional statistical
methods. While cloud service providers and other AI service providers are offering access to pre-built and
pre-trained translation models, we believe that advanced human/computer interaction through voice and
language translation will be a critical component of many company’s core business innovation, permeating all
aspects from customer service to inter-company communication. When processes become critical to one’s
business, it becomes increasingly risky to offload the work to a third party where access is more difficult to
control and learning from many users is monetized to everyone.
These neural networks tend to have sparser encodings than CNNs, and therefore do not as efficiently use
SIMD instruction sets. The ability to easily traverse the memory address space is therefore more critical to
training these types of neural networks than with CNNs.
3.3 Product Recommendation Using Restricted Boltzmann Machines Online recommendation engines drive viewership, increase purchases, and improve customer satisfaction at
a variety of content and product providers. Much of this work was traditionally performed using collaborative
filtering and singular value decomposition. However, recent advances in neural network architectures,
especially non-feedforward networks like Restricted Boltzmann Machines (RBMs), have produced
tantalizingly high-quality recommendation models.
Unlike traditional feedforward neural networks, RBMs do not accept data on one layer and produce results on
another. Instead, they have a reflective hidden layer which helps to map an incomplete input to the set of
expected values. In the case of product recommendation, placing a rating for a single product on the visible
layer of an RBM will make the other elements of the visible layer transform into the expected rating by that
person for all the other products in the catalog.
Use Case 1: Pathology Classification in Medical Imagery
12 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
4 Use Case 1: Pathology Classification in Medical Imagery Over the last 18 months, Dell EMC’s AI Engineering team has been working with the chestxray14 dataset to
develop highly accurate models for pathology classification in frontal chest x-rays. Most of that work has been
done using bare-metal, HPC environments to show the power and flexibility of scale-out distributed deep
learning. Scale out clusters like traditional HPC systems are having a dramatic influence on the ways neural
networks are trained. Since it has become one of our go-to use cases for validating the performance
characteristics of our systems and solutions, we are including it in this paper.
This image classification use case if very well suited to the use of convolutional neural networks (CNNs). It is
this category of neural network architecture that is used for everything from identifying everyday items in
pictures, to performing object detection and identification in video, to determining which shop your bowl of
ramen noodles came from. The widespread use of CNNs has ensured that various hardware vendors and
software developers have tuned and optimized their platforms to execute the training and inference of these
architectures as efficiently as possible resulting in the ResNet-50 CNN topology becoming the de-facto
benchmark for comparing the performance of deep learning systems.
The problem that our use case attempts to solve is the identification one or more thoracic pathologies from
fourteen (14) different diagnoses identified in the ChestXray14 frontal chest x-ray training data (see Figure
10). This is a multi-label/multi-class problem, meaning that we are attempting to categorize more than one
condition, and each image could exhibit multiple pathologies simultaneously. This contrasts with the typical
benchmark use cases, such as ImageNet, where each image can correctly be mapped to only one label, i.e.,
a picture of a sandwich cannot simultaneously be a car.
Figure 10: The objective of this use case is to train of a neural network-based model to correctly identify thoracic pathologies (total of 14) from frontal chest x-rays. We then measure the performance of the Dell EMC
Ready Solutions for AI – Deep Learning when training this model.
Dell EMC’s AI Engineering team has successfully completed extensive performance and scalability testing on
this model and data set using up to 256 compute nodes, far exceeding the capacity of this solution under test
and enabling significantly improved time to solution. For more information on this model and the work that has
been done so far at Dell EMC, go to hpcatdell.com.
Condition E
Condition D
Condition C
Condition B
Condition A
Patient A
Use Case 1: Pathology Classification in Medical Imagery
13 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
4.1 Convolutional Neural Networks (CNNs) Convolutional neural networks (CNNs) are the architectural work horse of AI-based image classification.
CNNs process information by breaking the input object (in many cases an image) into small, overlapping
windows which extract various features from the numerical representation of the input window pixels.
Successive layers of the CNN subsequently extract more complex and abstract features. The network
eventually ends with a flattened, fully-connected section which connects to the network outputs (see Figure
11).
Figure 11: A Typical CNN Structure (from https://commons.wikimedia.org/wiki/File:Typical_cnn.png)
CNNs are trained using combinations of dense matrix-vector product calculations (forward pass), and dense
matrix-matrix product calculations (backward pass). This means that CNNs are well suited to hardware
architectures which can quickly execute the same instruction on sequences of contiguous data, such as SIMD
architectures.
Common CNNs use a combination of convolutional layers and subsampling layers organized into repeating
blocks. Many modern CNN architectures also include residual connections between the blocks to prevent loss
of critical information due to excessive subsampling. The common CNN architecture used for benchmarking,
ResNet50, is a 50-layer neural network consisting of repeating 3-layer blocks with residual connections. The
architecture contains an input layer, and output layer, and 16 repeating 3-layer blocks, for a total of 50 layers.
Other common CNN topologies can be significantly deeper, some with up to 150 layers or more. The
computational workload necessary to perform both the forward pass and the backward pass increases with
the number of layers, so deeper networks require more computational cycles to solve.
4.2 Model Training We measured the performance of this image classification training use case in the Deep Learning with Intel
solution, using scale-out data parallelism in steps from 1 to 16 commute nodes. Our tests show steady, near-
linear improvement in throughput – measured in images processed per second – as we scale out to 16
compute nodes, or full solution scale (see Figure 12). All model weights are single-precision floating point
numbers (FP32), as 2nd Generation Intel Xeon Scalable processors do not support half-precision (FP16 or
bfloat16) arithmetic.
We also trained this use case using up to 16 nodes on the Zenith supercomputer with 10 Gigabit Ethernet
connections and a full 256 nodes using 100 Gigabit Intel Omni-path (OPA) fabric (see Appendix B, Figure 21).
Our tests demonstrate continued near-linear scaling and significant throughput and time-to-solution
improvement as we scaled out the training. Our best tests demonstrate that, with 256 compute nodes, this
use case could be trained to full accuracy in under 12 minutes (see Appendix B, Figure 23).
Use Case 1: Pathology Classification in Medical Imagery
14 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
Figure 12: CNN (ResNet50) Throughput Performance for Medical Imaging Use Case using Dell EMC Ready Solutions for AI - Deep Learning with Intel. Tests were performed using distributed TensorFlow with Horovod
with 4 MPI processes per node.
Figure 13: Scaled speedup of image classification use case training vs ideal.
While these tests are for a specific use case, the results presented would apply to any form of image
classification problem using images of equivalent size – in this case 256 x 256 images in 3-color RGB.
Performance of the solution would be different for topologies of greater depth, or with images of larger
dimensions.
The tests performed for this use case are only image classification. This use case does not investigate other
image-based deep learning capabilities, such as segmentation, detection, or depth analysis. However, these
scenarios use similar neural network topologies to image classification, and as such can be expected to
exhibit similar – if not identical – performance characteristics.
1.47
2.83
5.52
10.85
20.98
1
2
4
8
16
32
1 2 4 8 16
Imag
es p
er S
eco
nd
Compute Nodes
CNN (ResNet50) Throughput Performance on Solution
1
2
4
8
16
1 2 4 8 16
Spee
du
p
Nodes
CNN (ResNet50) Scaling on Solution
Classification Scaled Speedup Ideal
Use Case 2: Machine Language Translation
15 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
5 Use Case 2: Machine Language Translation Another area where neural networks have transformed the state-of-the-art is in mechanical translation of
human language. While basic dictionary systems and statistical natural language processing (NLP)
techniques were able to build quality translators for simple words and phrases, the ability to generate high
quality translations of entire sentences or documents has only occurred since the adoption of specialized
neural networks for these tasks.
As of the time of this writing, the highest quality language translation models are produced using the multi-
head attention-based models which were first developed and popularized by Google. These neural network
architectures are fundamentally different in both structure and computation from CNNs, and as such provide a
different proving ground for measuring the performance of the system.
Unlike CNNs, approaches such as the transformer model share the weight matrix between an embedding
layer and linear transformation prior to the softmax layer (see “Positional Encoding” in Figure 15). They must
also ensure that the gradients from these two layers are updated appropriately without causing performance
degradation or out-of-memory (OOM) errors.
When we began evaluating Google’s official transformer model, we discovered issues related to OOM errors
caused by assumed sparsity of the positional encoding layer. This caused excessive memory use and limited
scale for training the transformer models. Once the issue was identified and corrected (see Densifying
Assumed-sparse Tensors), scalable training of transformer models, including on the Deep Learning with Intel
solution, became possible.
This use case has many applications, from automated customer service systems which can handle multiple
languages, to website creation and hosting, to legal document translation. The unique computational
requirements of these types of neural networks also made this an ideal case for testing the solution’s flexibility
in handling many types of deep learning problems.
5.1 Multi-head Attention-based Neural Networks The transformer architecture is built with variants of attention mechanism in the encoder-decoder part (see
Figure 14), eliminating the need for traditional Recurrent Neural Networks (RNNs) in the architecture. This
architecture achieves state of the art results in English-German and English-French translation tasks when
compared to RNNs based on Long/Short-Term Memory (LSTM) neurons.
Figure 14: Encoder-decoder architecture of language translation models.
Use Case 2: Machine Language Translation
16 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
Figure 15 illustrates the multi-head attention block used in the transformer model. At a high-level, the scaled
dot-product attention can be imagined as finding the relevant information, values (V) based on Query (Q) and
Keys (K) and multi-head attention could be thought of as several attention layers in parallel to get distinct
aspects of the input.
Figure 15: Full Architecture of Multi-head Attention-based Transformer Neural Network
This scaled dot-product operation is more computationally efficient than LSTM neurons, enabling larger
networks and permitting wider windows of subword tokens – and their associated context – to be processed.
The ability to process wider windows means higher translation quality.
5.2 Model Training Performance We evaluated the Deep Learning with Intel solution using Google’s transformer architecture to train a model
for translating English to German, using the WMT English/German translation corpus. This corpus contains
4.5 million sentence pairs, consisting of an English sentence followed the by appropriately translated German
version.
Training time for transformer architectures is incredibly long without parallelization or acceleration. Our single
node, single process experiments would have required approximately 31 days (749 hours) to train to a high-
quality solution (BLEU score ≥ 27.5). This approximation was calculated based on the total throughput in
subword tokens per second and the expected number of subword tokens required in subsequent, parallelized
runs that were needed to achieve the desired solution quality.
Our performance tests on the solution demonstrated near-linear scaling from 1 to 16 nodes, using 4
processes per node (64-way parallelism) when training transformer architectures with the WMT
English/German corpus. When compared to a single node, single process training run, the 64-way run (16
nodes, 4 processes per node) was able to reduce the time to solution from 749 hours to 50 hours, which is a
15x improvement in time to solution.
Use Case 2: Machine Language Translation
17 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
Figure 16: Performance (subword tokens per second) of multi-head attention-based Transformer model scaled to full solution size. Performance on the Dell EMC Ready Solutions for AI - Deep Learning with Intel is
near-linear using distributed TensorFlow with Horovod for multi-node training.
Figure 17: Scaled speedup of machine translation use case when compared to ideal.
Subsequent tests performed on the Zenith supercomputer in the Dell EMC HPC and AI Innovation Lab –
which is built on the same C6420 compute blocks as the Deep Learning with Intel solution – shows that
additional scaling of this use case continues to improve time to solution. When using 200 compute nodes (50
C6000 chassis), we trained a high-quality model in just over 6 hours (see Appendix B, Figure 24).
There are many other language use cases that are solved using neural networks, from voice-to-text and text-
to-voice, to sentiment analysis, etc. The text-to-text translation use case is one of the newest use cases in
language translation with neural networks, and as such uses one of the more complex topologies for solving
the problem. Other use cases would use different neural network topologies, and customers should expect to
see different performance characteristics with those workloads than they would with text-to-text translation.
1,334.84
2,408.44
4,346.09
8,825.18
16,984.83
1024
2048
4096
8192
16384
32768
1 2 4 8 16
Sub
wo
rd T
oke
ns
per
Sec
on
d
Compute Nodes
Transformer Throughput Performance on Solution
1
2
4
8
16
1 2 4 8 16
Spee
du
p
Nodes
Transformer Scaling on Solution
Translation Scaled Speedup Ideal
Use Case 3: Ratings-based Product Recommendation
18 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
6 Use Case 3: Ratings-based Product Recommendation Online recommendation engines drive viewership, increase purchases, and improve customer satisfaction at
a variety of content and product providers. Much of this work was traditionally performed using collaborative
filtering and singular value decomposition. However, recent advances in neural network architectures,
especially non-feedforward networks, have produced tantalizingly high-quality recommendation models.
Restricted Boltzmann Machines (RBMs) don’t work like traditional feedforward neural networks. Instead of
presenting data on the input layer and receiving an answer on the output layer, the input and output appear
on the same layer. When you put in a value for a single visible neuron, the rest of the neurons in the visible
layer change their values to the expected value according to the mapping created in the hidden layer.
RBMs provide a means of building cross-element mappings, where the value of one input neuron effects the
values of the other neurons. In the case of product or movie recommendation, these mappings represent the
likelihoods that – given one rating for a product or a movie – a given user would highly or poorly rate the other
products or movies for which the network has been trained.
6.1 Restricted Boltzmann Machines (RBMs) Restricted Boltzmann Machines (RBMs) are part of a special category of neural networks called generative
stochastic artificial neural networks, which includes RBMs and general Boltzmann machines, capsule
networks, and deep belief networks. They are used for learning unspecified probability distributions over a set
of inputs.
In our case, the probability distribution that we want the RBM to learn is the expected rating that a customer
would give to a set of products, based on a provided rating for one (or more) products. For this particular use
case, we used the MovieLens data set to train an RBM to provide five-star expected movie ratings for any
customer, given the customer provides a real rating for any one of the 58,000 movies in the dataset. This is
done by building an RBM which contains 58,000 visible neurons and 100 hidden neurons (see Figure 18),
where each hidden neuron can be thought to encode a particular characteristic or feature shared across
movie titles (e.g., film type, lead actor, director, etc.).
Figure 18: Illustration of a Restricted Boltzmann Machine for five-star ratings. This network builds ratings for M products using F characterizations.
Use Case 3: Ratings-based Product Recommendation
19 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
RBMs are difficult to train, and few if any approaches have been suggested for parallelizing the training
process for RMBs, due to the stochastic nature of the sampling and training process. Dell EMC’s AI
Engineering team has developed a novel approach to training RBMs in parallel using a Markov Chain Monte
Carlo (MCMC) approach with Gibbs sampling. We will be publishing the details of the approach along with
additional performance details soon. For now, all the source code is available on the Dell EMC HPC & AI
Engineering GitHub page.
6.2 Model Training Performance When training RBMs for user rating and recommendations on the Deep Learning with Intel solution, we see
near-linear scaling from 1 to 16 compute nodes (see Figure 19 and Figure 20), just as we do with the other
two use cases in image classification and language translation.
Figure 19: Performance of RBM training on Dell EMC Ready Solutions for AI - Deep Learning with Intel. Weak scaling results with 32 records per process, 4 processes per node. Distributed training using TensorFlow with
Horovod.
10.1819.61
40.23
78.90
155.18
0
20
40
60
80
100
120
140
160
180
1 2 4 8 16
Rec
ord
s p
er S
eco
nd
Compute Nodes
RBM Throughout Performance on Solution
Use Case 3: Ratings-based Product Recommendation
20 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
Figure 20: Scaled speedup of RBM training compared to ideal.
In addition to scaling tests up to 16 nodes performed on the solution, we also performed tests with greater
scale on the Zenith supercomputer, showing that we can indeed realize improved time to solution out to 64
nodes (see Appendix B, Figure 25). As with the other use cases, we see potential for continued scaling
performance and improved time to solution as customers continue to expand the Deep Learning with Intel
solution from 16 compute nodes to greater scale, including potentially out to hundreds of compute nodes.
1
2
4
8
16
1 2 4 8 16
Spee
du
p
Nodes
RBM Scaling on Solution
Recommendation Scaled Speedup Ideal
Conclusions
21 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
7 Conclusions Dell EMC’s Ready Solutions for AI – Deep Learning with Intel is a flexible deep learning training platform that
is designed for future expansion and scale out as the customer’s computational demands increase. It has
demonstrated flexibility across multiple use cases which span market segments and computational workload
profiles – ranging from dense convolutional network training, to more sparse language translation model
training, and emerging non-feedforward neural network training of Restricted Boltzmann Machines (RBMs).
The solution is fully containerized, utilizing Docker containers to handle all user-facing software and
Kubernetes with Nauta automation to manage all container orchestration. We have demonstrated that – in
addition to greater flexibility for the data scientist training models – the use of containers in the solutions does
not adversely affect the performance of the solutions. In fact, in some cases customers can expect better
performance from containerized workloads on the solution than they could expect from the same hardware
deployed in a bare metal configuration.
The solution uses the Nauta control program to execute TensorFlow training jobs which can take advantage
of the Horovod distributed training framework automation to realize scale-out performance increases and
improve time to solution. Our tests demonstrate that the solution achieves excellent – near linear – scaling on
the entire range of use case tests. This means customers who wish to expand the solution beyond the base
16 compute nodes can continue to expect improved performance and time to solution as they scale out.
Configuration Details
Dell EMC Technical White Paper
A Configuration Details
Table 1: PowerEdge R740xd hardware components
Component Quantity Description
Processor 2 Intel Xeon Scalable Gold 6230
Memory 12 32 GB 2933 MHz DDR4
Storage 12 12 TB 7.5K RPM HDD
Network adapter 1 Intel X710 10 Gb SFP+ Ethernet adapter
Table 2: RAID configuration for the PowerEdge R740xd master node
Virtual disk Configuration RAID Mounted as
VD0 2 x 12 TB HDD (11 TB usable) RAID-1 /(root)
VD1 10 x 12 TB HDD (97 TB usable) RAID-5 /data (NFS and ETCD)
Table 3: PowerEdge C6420 hardware components
Component Quantity Description
Processor 2 Intel Xeon Scalable Gold 6230
Memory 12 16 GB 2933 MHz DDR4
Storage 1 250 Gb M.2 boot optimized SSD storage
Network adapter 1 Intel X710 10 Gb SFP+ Ethernet adapter
Table 4: Software components
Component Description
BIOS Dell BIOS 2.1.8
Operating system Red Hat Enterprise Linux (RHEL) 7.6
Docker Community Edition 18.06
Kubernetes 1.10.11
Helm 2.9.1
Nauta 1.0 Enterprise Support
Remote Desktop Protocol (XRDP) 0.9.9-1
Additional Performance Data
23 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
B Additional Performance Data
Figure 21: Time to train performance comparison using Intel 100Gb OPA and Intel 10Gb Ethernet on Zenith. Processors are Intel Xeon Scalable Gold 6148.
Table 5: Time to train per epoch using 100 Gigabit Intel Omni-Path and 10 Gigabit Intel X710. Intel Xeon Scalable Gold 6148 processors.
Nodes Intel Omni-Path Architecture 100 Intel X710 10 Gigabit Ethernet
1 9987 10135
2 6425 6328
4 4232 4230
8 2938 2983
16 2308 2236
2048
4096
8192
16384
1 2 4 8 16
Tim
e p
er E
po
ch (
seco
nd
s)
Nodes
Time per Epoch - Intel OPA and 10GbE
Intel Omni-Path 100 Intel X710 10GbE
Additional Performance Data
24 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
Figure 22: Throughput performance of image classification training on various topologies on Zenith supercomputer. Tests performed using TensorFlow with Horovod. Model trained on NIH ChestXray14
dataset.
Figure 23: Time-to-solution for image classification training on various topologies n Zenith supercomputer. Tests performed using TensorFlow with Horovod. Model trained on NIH ChestXray14 dataset.
4 186
1170
5851
6750
8252
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
DenseNet121,P=1, BZ=8
DenseNet121,P=64, BZ=64,
GBZ=4096
VGG16,P=128,
GBZ=8192
Resnet50,P=512,
GBZ=8192
ResNet50,P=800,
GBZ=8000
ResNet50,P=1024,
GBZ=8192
Imag
es p
er S
eco
nd
386845
8319 165321742 1362 825 675
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
DenseNet121,P=1, BZ=8
DenseNet121,P=64, BZ=64,
GBZ=4096
VGG16, P=128,GBZ=8192
ResNet50,P=512,
GBZ=4096
Resnet50,P=512,
GBZ=8192
ResNet50,P=800,
GBZ=8000
ResNet50,P=1024,
GBZ=8192
Tim
e to
So
luti
on
(se
con
ds)
Additional Performance Data
25 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
Figure 24: Time to solution of neural machine translation use case on Deep Learning with Intel solution and Zenith supercomputer. Tests performed using TensorFlow with Horovod. Model trained on WMT
English/German corpus.
Figure 25: Time to solution for Restricted Boltzmann Machine (RBM) training on Zenith supercomputer. Tests performed using TensorFlow with Horovod. Model trained on MovieLens dataset.
749.06
50.01 29.46 14.37 7.75 6.170
100
200
300
400
500
600
700
800
1 16 32 64 128 200
Ho
urs
Nodes
Time to Solution (BLEU ≥ 27.5)
Deep Learning with Intel Zenith
20009
10244
5129
25811306 670 346
0
5000
10000
15000
20000
25000
1 2 4 8 16 32 64
Seco
nd
s
Nodes
RBM Time to Solution
Related resources
26 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829
C Related resources
Dell EMC AI Engineering GitHub
Dell EMC AI Engineering Blogs
ChestXray14 Dataset
ImageNet 2012 Dataset
Google Tensor2Tensor Library
WMT16 Language Translation Datasets
GroupLens Public MovieLens Dataset
Attention is All You Need
Densifying Assumed-sparse Tensors