Dell EMC Ready Solutions for AI Deep Learning with Intel · Dell EMC Ready Solutions for AI – Deep Learning ... The Dell EMC Ready Solutions for AI – Deep Learning with Intel

Dell EMC Technical White Paper

Dell EMC Ready Solutions for AI – Deep Learning with Intel

Measuring performance and capability of deep learning use cases

Abstract

The Deep Learning with Intel solution provides a scalable, flexible

platform for training a wide variety of neural network models with

different capabilities and performance characteristics. We tested the

ability of the solution to run three different deep learning use cases in

image classification, machine translation, and product recommendation.

June 2019

H17829

Revisions


Revisions

Date Description

June 2019 Initial release

Acknowledgements

This paper was produced and reviewed by the following members of the Dell EMC AI Engineering and Dell

EMC Ready Solutions for AI teams:

Author: Lucas A. Wilson, PhD

Support: Srinivas Varadharajan, Pei Yang, and Vineet Gundecha

Other: Phil Hummel

The information in this publication is provided “as is.” Dell Inc. makes no representations or warranties of any kind with respect to the information in this

publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose.

Use, copying, and distribution of any software described in this publication requires an applicable software license.

© June 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries.

Other trademarks may be trademarks of their respective owners.

Dell believes the information in this document is accurate as of its publication date. The information is subject to change without notice.

Table of contents

3 Dell EMC Ready Solutions for AI – Deep Learning with Intel | H17829

Table of contents

Revisions............................................................................................................................................................................. 2

Acknowledgements ............................................................................................................................................................. 2

Table of contents ................................................................................................................................................................ 3

Executive summary ............................................................................................................................................................. 4

1 Solution Design and Considerations ............................................................................................................................ 5

1.1 System Architecture and Design Considerations ............................................................................................... 5

1.1.1 Networking Considerations ................................................................................................................................. 6

1.1.2 Isilon H600 Storage Array (Optional).................................................................................................................. 7

1.1.3 Solution Physical Architecture Layout ................................................................................................................ 7

1.2 Nauta Deep Learning Platform ........................................................................................................................... 8

1.3 Dell EMC Software Additions ............................................................................................................................. 9

2 Containerized Workload Performance ....................................................................................................................... 10

3 Use Case Selection and Considerations .................................................................................................................... 11

3.1 Image Classification Using Convolutional Neural Networks ............................................................................ 11

3.2 Language Translation Using Multi-Head Attention Networks ........................................................................... 11

3.3 Product Recommendations Using Restricted Boltzmann Machines ................................................................ 11

4 Use Case 1: Pathology Classification in Medical Imagery ......................................................................................... 12

4.1 Convolutional Neural Networks (CNNs) ........................................................................................................... 13

4.2 Model Training .................................................................................................................................................. 13

5 Use Case 2: Machine Language Translation ............................................................................................................. 15

5.1 Multi-head Attention-based Neural Networks ................................................................................................... 15

5.2 Model Training Performance ............................................................................................................................ 16

6 Use Case 3: Ratings-based Product Recommendation ............................................................................................ 18

6.1 Restricted Boltzmann Machines (RBMs) .......................................................................................................... 18

6.2 Model Training Performance ............................................................................................................................ 19

7 Conclusions ................................................................................................................................................................ 21

A Configuration Details .................................................................................................................................................. 22

B Additional Performance Data ..................................................................................................................................... 23

C Related resources ...................................................................................................................................................... 26

Executive summary


Executive summary

The Dell EMC Ready Solutions for AI – Deep Learning with Intel is a CPU-based scale-out solution for

training neural network models. The solution uses Nauta, a deep learning training platform built on cloud

native technologies such as Docker and Kubernetes, to provide data scientists with a simplified software

environment which can be easily customized to suit whatever requirements the data scientist has.

The solution consists of 17 nodes – a master/login node and 16 compute workers – interconnected with dual

Ethernet-based networks. Onboard shared storage in the master node provides an easy on-ramp to use, but

additional network attached storage in the form of Dell EMC Isilon H600 NAS appliances can be used to

independently scale compute and storage.

We measured and analyzed the performance of this solution using three deep learning training use cases

covering image classification, language translation, and product recommendation. In all three use cases, our

tests demonstrated near-linear scaling in performance up to the full size of the solution. Additional tests

performed on analogous hardware in the Zenith cluster show that the solution can scale all tested use cases

beyond 16 compute nodes, allowing customers to scale the solution as their compute requirements grow.

Solution Design and Considerations


1 Solution Design and Considerations The Dell EMC Ready Solutions for AI – Deep Learning with Intel (Figure 1) is a scale-out solution for fast

training of deep neural networks. Based on Intel® 2nd Generation Xeon® Scalable processors, the solution

provides a fully-featured, flexible environment for neural network architecture exploration, production training

at scale, and pre-deployment model testing and validation. The solution is intended as a starter configuration

for customers who are at the start of their scalable deep learning journey, with the ability to easily expand the

solution as computational needs grow.

Figure 1: Dell EMC Ready Solutions for AI – Deep Learning with Intel

The solution fully integrates both hardware and software components into a complete customer solution which

is ready for data scientists to use for constructing and training neural network-based models.

1.1 System Architecture and Design Considerations The solution uses Intel® 2nd Generation Xeon® Scalable processors as the core building block for training

deep neural networks, providing a flexible platform for training networks of various configurations on data

objects of various size. Scalability is the secret to realizing improved time-to-solution with this system, so the

solution is built on top of dense-compute C6420 2-socket compute sleds, which can be fitted in sets of 4 into a

2U C6400 compute chassis (Figure 2).

Figure 2: Dell EMC C6420 Compute Sleds (x4) in C6400 Compute Chassis (Front and Back)



The solution does not rely on 2.5” or 3.5” disk drives for holding the operating system and system containers.

Instead, each compute sled is fitted with a 250GB Boot-optimized SSD Storage (BOSS) M.2 form-factor HDD

within the sled. Each sled contains a 1GbE Ethernet adaptor with combined iDRAC, as well as an Intel® X710

10GbE SFP Ethernet adaptor for internal data movement.

System login and management, as well as initial shared storage, is handled by a single 2-socket Dell EMC

R740xd 2U compute server (Figure 3), outfitted with 12x 12TB HDD in two (2) RAID arrays: (1) a 2-drive,

single-redundancy RAID1 array for the operating system (10TB usable), and (2) a 10-drive RAID5 array for

shared storage, Kubernetes management, and other functions (96TB usable). The login/management node

uses internal RJ45 1GbE connectivity for datacenter/Internet connectivity and a separate RJ45 iDRAC

connection. The management node also contains an Intel® X710 10GbE SFP Ethernet adaptor for

connection to the compute nodes as a means of serving the shared storage array.

Figure 3: Dell EMC R740xd with 12x 3.5" Drive Configuration

1.1.1 Networking Considerations The choice of 10GbE over other, higher bandwidth options (including 25/40/100GbE, 100Gbps Intel® Omni-

Path Architecture, and 100Gbps Mellanox HDR Infiniband) was made for the solution to best match the total

anticipated data throughput when training deep neural networks on Intel® Xeon® Scalable processors.

During our in-lab tests on the Zenith supercomputer using both 10GbE and 100Gbps Intel® Omni-path Fabric,

no measurable difference in performance was observed when scaling up to 64 compute nodes (see Appendix

B, Figure 21 and Table 5 for more details).

System components are interconnected using 2 Dell EMC Open Networking switches. The first top-of-rack

(ToR) switch is a Dell EMC S3048-ON 1U RJ45 switch (Figure 4) which provides management and iDRAC

connectivity between the nodes of the solution, as well as connectivity to the wider data center network and

the Internet. This connection will be used by IT administrators to manage the systems components via

iDRAC, connect to the system remotely, and enable the nodes to download container images from public

container registries, such as Docker Hub.

Figure 4: Dell EMC S3048-ON RJ45 Networking Switch

The second ToR switch is a Dell EMC S4128F-ON 10GbE SFP+ switch (Figure 5) which provides internal

communication over an isolated network for data movement during scale-out training and for loading data

from both the shared storage array in the Dell EMC R740xd management node, as well as connecting an

optional Dell EMC Isilon H600 storage array (see Section 1.1.2) via two (2) 40GbE QSFP+ ports on the

switch.

Figure 5: Dell EMC S4128F-ON SFP+ Networking Switch

https://hub.docker.com/



1.1.2 Isilon H600 Storage Array (Optional) In situations where a more performant or scalable storage solution than the R740xd master node is desired,

the solution can be optionally connected to a Dell EMC Isilon H600 hybrid flash/HDD storage appliance

(Figure 6). Each Isilon appliance consists of four (4) individual nodes, each capable of providing access to the

entire filesystem. Use of the four nodes allows for system-controlled load balance of I/O requests across the

array. The Dell EMC Ready Solutions for AI – Deep Learning with Intel provides the ability to connect two (2)

of the Dell EMC Isilon H600 storage appliance nodes to the data network via the two (2) 40GbE QSFP+ ports

on the Dell EMC S4128F-ON switch. This provides sufficient balance between load and bandwidth

capabilities to support data motion for deep neural network training on the solution.

Figure 6: Dell EMC Isilon H600 Hybrid Flash/HDD Storage Appliance

1.1.3 Solution Physical Architecture Layout

Figure 7: Architecture Layout of Deep Learning with Intel solution with optional Isilon

https://shop.dellemc.com/en-us/Product-Family/Dell-EMC-Products/Dell-EMC-Isilon-H600-NAS-Storage/p/Dell-EMC-Isilon-H600-NAS-Storage



1.2 Nauta Deep Learning Platform The Dell EMC Ready Solutions for AI – Deep Learning with Intel uses the Nauta deep learning training

platform to coordinate and manage the training of deep neural networks using TensorFlow. Nauta provides

four (4) different modes for enabling this training process:

• Interactive, single-node neural network training using Jupyter notebooks

• Single-node neural network training

• Multi-node neural network training using Distributed TensorFlow

• Multi-node neural network training using Horovod

Nauta also provides capabilities for pre-production testing trained models via Tensor Serving, using both

batch inference (performing inference on a directory of data objects) and streaming inference (a REST-ful

web endpoint to which individual data objects can be submitted for inference).

Figure 8: Structure of the Nauta deep learning platform. Nauta uses Kubernetes for container orchestration, and Helm-based template packs for scripted job submissions (https://www.intel.ai/introducing-nauta).

https://www.intel.ai/nauta/

https://www.tensorflow.org/

https://jupyter.org/

https://www.tensorflow.org/guide/distribute_strategy

https://eng.uber.com/horovod/

https://www.intel.ai/introducing-nauta



1.3 Dell EMC Software Additions Dell EMC has provided additional software tools, utilities, and templates which provide users of the Dell EMC

Ready Solutions for AI – Deep Learning with Intel with a more seamless access experience. Dell EMC’s

software additions include:

• Administrator scripts to automate the creation of joint Unix/Nauta user accounts and set appropriate

permissions allowing users access to their Nauta input/output directories

• Remote Desktop Protocol (RDP) enabled on the login node which provides customers with a simpler

path to usage via Windows Remote Desktop Connection client

• Template Yet Another Markup Language (YAML)-based Helm charts for providing individual users

access to optional Isilon storage appliances

Containerized Workload Performance


2 Containerized Workload Performance Nauta is a Kubernetes-based deep learning training platform which uses Docker containers for executing all

workloads. Nauta provides a single, prebuilt container which can be used for multiple tasks (Jupyter

notebooks, distributed TensorFlow, TensorFlow with Horovod, and Tensor Serving) via specialized script

deployment with Helm.

Containers provide a convenient way of packaging and deploying software, especially software with complex

dependencies such as deep learning frameworks. While software and hardware abstractions have

traditionally meant the sacrifice of performance when compared to executing on “bare metal” systems,

modern Docker containers do not pose a performance penalty.

Our tests have shown than the performance of the container in many cases exceeds the performance of an

equivalent “bare metal” provisioned system. Figure 9 shows the performance of our neural machine

translation use case (see Section 5) using the Deep Learning with Intel solution (“Containerized”) and an

equivalent software environment built on an identical hardware system (see Appendix A for a complete

hardware specification).

Figure 9: Containerized workload performance normalized to the same workload executed on a "bare metal" version of the same platform. The above workload is our neural machine translation use case (see Section 5).

It should be noted that these results may not apply to all workloads. However, our tests on the Deep Learning

with Intel solution shows that, generally, containerization of deep learning workloads does not impose a

performance penalty.

118.36%

110.05%

100.42%

103.80%

101.71%

0.95

1

1.05

1.1

1.15

1.2

1 2 4 8 16

Per

form

ance

(n

orm

aliz

ed t

o B

are

Met

al)

Containerized Performance (Normalized to Bare Metal)

Bare Metal Containerized

https://kubernetes.io/

Use Case Selection and Considerations


3 Use Case Selection and Considerations Different neural network architectures exhibit different computational workload properties during the training

process. Because of this, we have selected three (3) different use cases which use different neural network

architectures and constructs on which to evaluate the performance of the Dell EMC Ready Solutions for AI –

Deep Learning with Intel. The computational and workload diversity of these use cases were chosen to

highlight the flexibility of the solution for applications across different customer segments and problem types.

3.1 Image Classification Using Convolutional Neural Networks Convolutional neural networks for image classification and object detection have been the primary source of

news reports related to AI and deep learning in recent years. We believe that this is the area of investigation

that most customers will begin with when beginning the deep learning phase of their AI journey, and as such

we have built an extensive use case in the health care delivery market segment around image classification.

The type of neural network used for these classification tasks – known as convolutional neural networks, or

CNNs – have a very specific computational profile. These networks consist completely of dense tensor

encodings, which lend themselves to fast, efficient computation on single-instruction, multiple-data (SIMD)

execution units such as the AVX512 vector units within the Intel® Xeon® Scalable processor family.

3.2 Language Translation Using Multi-Head Attention Networks Multi-head attention networks are – as of the time of this writing – one of the newest trends in neural machine

translation (NMT) research. Current state of the art models in production at Google are built upon this network

type and have shown significantly higher translation quality when compared to more traditional statistical

methods. While cloud service providers and other AI service providers are offering access to pre-built and

pre-trained translation models, we believe that advanced human/computer interaction through voice and

language translation will be a critical component of many company’s core business innovation, permeating all

aspects from customer service to inter-company communication. When processes become critical to one’s

business, it becomes increasingly risky to offload the work to a third party where access is more difficult to

control and learning from many users is monetized to everyone.

These neural networks tend to have sparser encodings than CNNs, and therefore do not as efficiently use

SIMD instruction sets. The ability to easily traverse the memory address space is therefore more critical to

training these types of neural networks than with CNNs.

3.3 Product Recommendation Using Restricted Boltzmann Machines Online recommendation engines drive viewership, increase purchases, and improve customer satisfaction at

a variety of content and product providers. Much of this work was traditionally performed using collaborative

filtering and singular value decomposition. However, recent advances in neural network architectures,

especially non-feedforward networks like Restricted Boltzmann Machines (RBMs), have produced

tantalizingly high-quality recommendation models.

Unlike traditional feedforward neural networks, RBMs do not accept data on one layer and produce results on

another. Instead, they have a reflective hidden layer which helps to map an incomplete input to the set of

expected values. In the case of product recommendation, placing a rating for a single product on the visible

layer of an RBM will make the other elements of the visible layer transform into the expected rating by that

person for all the other products in the catalog.

Use Case 1: Pathology Classification in Medical Imagery


4 Use Case 1: Pathology Classification in Medical Imagery Over the last 18 months, Dell EMC’s AI Engineering team has been working with the chestxray14 dataset to

develop highly accurate models for pathology classification in frontal chest x-rays. Most of that work has been

done using bare-metal, HPC environments to show the power and flexibility of scale-out distributed deep

learning. Scale out clusters like traditional HPC systems are having a dramatic influence on the ways neural

networks are trained. Since it has become one of our go-to use cases for validating the performance

characteristics of our systems and solutions, we are including it in this paper.

This image classification use case if very well suited to the use of convolutional neural networks (CNNs). It is

this category of neural network architecture that is used for everything from identifying everyday items in

pictures, to performing object detection and identification in video, to determining which shop your bowl of

ramen noodles came from. The widespread use of CNNs has ensured that various hardware vendors and

software developers have tuned and optimized their platforms to execute the training and inference of these

architectures as efficiently as possible resulting in the ResNet-50 CNN topology becoming the de-facto

benchmark for comparing the performance of deep learning systems.

The problem that our use case attempts to solve is the identification one or more thoracic pathologies from

fourteen (14) different diagnoses identified in the ChestXray14 frontal chest x-ray training data (see Figure

10). This is a multi-label/multi-class problem, meaning that we are attempting to categorize more than one

condition, and each image could exhibit multiple pathologies simultaneously. This contrasts with the typical

benchmark use cases, such as ImageNet, where each image can correctly be mapped to only one label, i.e.,

a picture of a sandwich cannot simultaneously be a car.

Figure 10: The objective of this use case is to train of a neural network-based model to correctly identify thoracic pathologies (total of 14) from frontal chest x-rays. We then measure the performance of the Dell EMC

Ready Solutions for AI – Deep Learning when training this model.

Dell EMC’s AI Engineering team has successfully completed extensive performance and scalability testing on

this model and data set using up to 256 compute nodes, far exceeding the capacity of this solution under test

and enabling significantly improved time to solution. For more information on this model and the work that has

been done so far at Dell EMC, go to hpcatdell.com.

Condition E

Condition D

Condition C

Condition B

Condition A

Patient A

https://www.nih.gov/news-events/news-releases/nih-clinical-center-provides-one-largest-publicly-available-chest-x-ray-datasets-scientific-community

https://www.hpcatdell.com/



4.1 Convolutional Neural Networks (CNNs) Convolutional neural networks (CNNs) are the architectural work horse of AI-based image classification.

CNNs process information by breaking the input object (in many cases an image) into small, overlapping

windows which extract various features from the numerical representation of the input window pixels.

Successive layers of the CNN subsequently extract more complex and abstract features. The network

eventually ends with a flattened, fully-connected section which connects to the network outputs (see Figure

11).

Figure 11: A Typical CNN Structure (from https://commons.wikimedia.org/wiki/File:Typical_cnn.png)

CNNs are trained using combinations of dense matrix-vector product calculations (forward pass), and dense

matrix-matrix product calculations (backward pass). This means that CNNs are well suited to hardware

architectures which can quickly execute the same instruction on sequences of contiguous data, such as SIMD

architectures.

Common CNNs use a combination of convolutional layers and subsampling layers organized into repeating

blocks. Many modern CNN architectures also include residual connections between the blocks to prevent loss

of critical information due to excessive subsampling. The common CNN architecture used for benchmarking,

ResNet50, is a 50-layer neural network consisting of repeating 3-layer blocks with residual connections. The

architecture contains an input layer, and output layer, and 16 repeating 3-layer blocks, for a total of 50 layers.

Other common CNN topologies can be significantly deeper, some with up to 150 layers or more. The

computational workload necessary to perform both the forward pass and the backward pass increases with

the number of layers, so deeper networks require more computational cycles to solve.

4.2 Model Training We measured the performance of this image classification training use case in the Deep Learning with Intel

solution, using scale-out data parallelism in steps from 1 to 16 commute nodes. Our tests show steady, near-

linear improvement in throughput – measured in images processed per second – as we scale out to 16

compute nodes, or full solution scale (see Figure 12). All model weights are single-precision floating point

numbers (FP32), as 2nd Generation Intel Xeon Scalable processors do not support half-precision (FP16 or

bfloat16) arithmetic.

We also trained this use case using up to 16 nodes on the Zenith supercomputer with 10 Gigabit Ethernet

connections and a full 256 nodes using 100 Gigabit Intel Omni-path (OPA) fabric (see Appendix B, Figure 21).

Our tests demonstrate continued near-linear scaling and significant throughput and time-to-solution

improvement as we scaled out the training. Our best tests demonstrate that, with 256 compute nodes, this

use case could be trained to full accuracy in under 12 minutes (see Appendix B, Figure 23).

https://commons.wikimedia.org/wiki/File:Typical_cnn.png



Figure 12: CNN (ResNet50) Throughput Performance for Medical Imaging Use Case using Dell EMC Ready Solutions for AI - Deep Learning with Intel. Tests were performed using distributed TensorFlow with Horovod

with 4 MPI processes per node.

Figure 13: Scaled speedup of image classification use case training vs ideal.

While these tests are for a specific use case, the results presented would apply to any form of image

classification problem using images of equivalent size – in this case 256 x 256 images in 3-color RGB.

Performance of the solution would be different for topologies of greater depth, or with images of larger

dimensions.

The tests performed for this use case are only image classification. This use case does not investigate other

image-based deep learning capabilities, such as segmentation, detection, or depth analysis. However, these

scenarios use similar neural network topologies to image classification, and as such can be expected to

exhibit similar – if not identical – performance characteristics.

1.47

2.83

5.52

10.85

20.98

1

2

4

8

16

32

1 2 4 8 16

Imag

es p

er S

eco

nd

Compute Nodes

CNN (ResNet50) Throughput Performance on Solution

1

2

4

8

16

1 2 4 8 16

Spee

du

p

Nodes

CNN (ResNet50) Scaling on Solution

Classification Scaled Speedup Ideal

Use Case 2: Machine Language Translation


5 Use Case 2: Machine Language Translation Another area where neural networks have transformed the state-of-the-art is in mechanical translation of

human language. While basic dictionary systems and statistical natural language processing (NLP)

techniques were able to build quality translators for simple words and phrases, the ability to generate high

quality translations of entire sentences or documents has only occurred since the adoption of specialized

neural networks for these tasks.

As of the time of this writing, the highest quality language translation models are produced using the multi-

head attention-based models which were first developed and popularized by Google. These neural network

architectures are fundamentally different in both structure and computation from CNNs, and as such provide a

different proving ground for measuring the performance of the system.

Unlike CNNs, approaches such as the transformer model share the weight matrix between an embedding

layer and linear transformation prior to the softmax layer (see “Positional Encoding” in Figure 15). They must

also ensure that the gradients from these two layers are updated appropriately without causing performance

degradation or out-of-memory (OOM) errors.

When we began evaluating Google’s official transformer model, we discovered issues related to OOM errors

caused by assumed sparsity of the positional encoding layer. This caused excessive memory use and limited

scale for training the transformer models. Once the issue was identified and corrected (see Densifying

Assumed-sparse Tensors), scalable training of transformer models, including on the Deep Learning with Intel

solution, became possible.

This use case has many applications, from automated customer service systems which can handle multiple

languages, to website creation and hosting, to legal document translation. The unique computational

requirements of these types of neural networks also made this an ideal case for testing the solution’s flexibility

in handling many types of deep learning problems.

5.1 Multi-head Attention-based Neural Networks The transformer architecture is built with variants of attention mechanism in the encoder-decoder part (see

Figure 14), eliminating the need for traditional Recurrent Neural Networks (RNNs) in the architecture. This

architecture achieves state of the art results in English-German and English-French translation tasks when

compared to RNNs based on Long/Short-Term Memory (LSTM) neurons.

Figure 14: Encoder-decoder architecture of language translation models.

https://arxiv.org/abs/1706.03762

https://github.com/tensorflow/tensor2tensor





Figure 15 illustrates the multi-head attention block used in the transformer model. At a high-level, the scaled

dot-product attention can be imagined as finding the relevant information, values (V) based on Query (Q) and

Keys (K) and multi-head attention could be thought of as several attention layers in parallel to get distinct

aspects of the input.

Figure 15: Full Architecture of Multi-head Attention-based Transformer Neural Network

This scaled dot-product operation is more computationally efficient than LSTM neurons, enabling larger

networks and permitting wider windows of subword tokens – and their associated context – to be processed.

The ability to process wider windows means higher translation quality.

5.2 Model Training Performance We evaluated the Deep Learning with Intel solution using Google’s transformer architecture to train a model

for translating English to German, using the WMT English/German translation corpus. This corpus contains

4.5 million sentence pairs, consisting of an English sentence followed the by appropriately translated German

version.

Training time for transformer architectures is incredibly long without parallelization or acceleration. Our single

node, single process experiments would have required approximately 31 days (749 hours) to train to a high-

quality solution (BLEU score ≥ 27.5). This approximation was calculated based on the total throughput in

subword tokens per second and the expected number of subword tokens required in subsequent, parallelized

runs that were needed to achieve the desired solution quality.

Our performance tests on the solution demonstrated near-linear scaling from 1 to 16 nodes, using 4

processes per node (64-way parallelism) when training transformer architectures with the WMT

English/German corpus. When compared to a single node, single process training run, the 64-way run (16

nodes, 4 processes per node) was able to reduce the time to solution from 749 hours to 50 hours, which is a

15x improvement in time to solution.

http://www.statmt.org/wmt16/translation-task.html



Figure 16: Performance (subword tokens per second) of multi-head attention-based Transformer model scaled to full solution size. Performance on the Dell EMC Ready Solutions for AI - Deep Learning with Intel is

near-linear using distributed TensorFlow with Horovod for multi-node training.

Figure 17: Scaled speedup of machine translation use case when compared to ideal.

Subsequent tests performed on the Zenith supercomputer in the Dell EMC HPC and AI Innovation Lab –

which is built on the same C6420 compute blocks as the Deep Learning with Intel solution – shows that

additional scaling of this use case continues to improve time to solution. When using 200 compute nodes (50

C6000 chassis), we trained a high-quality model in just over 6 hours (see Appendix B, Figure 24).

There are many other language use cases that are solved using neural networks, from voice-to-text and text-

to-voice, to sentiment analysis, etc. The text-to-text translation use case is one of the newest use cases in

language translation with neural networks, and as such uses one of the more complex topologies for solving

the problem. Other use cases would use different neural network topologies, and customers should expect to

see different performance characteristics with those workloads than they would with text-to-text translation.

1,334.84

2,408.44

4,346.09

8,825.18

16,984.83

1024

2048

4096

8192

16384

32768

1 2 4 8 16

Sub

wo

rd T

oke

ns

per

Sec

on

d

Compute Nodes

Transformer Throughput Performance on Solution

1

2

4

8

16

1 2 4 8 16

Spee

du

p

Nodes

Transformer Scaling on Solution

Translation Scaled Speedup Ideal

Use Case 3: Ratings-based Product Recommendation


6 Use Case 3: Ratings-based Product Recommendation Online recommendation engines drive viewership, increase purchases, and improve customer satisfaction at

a variety of content and product providers. Much of this work was traditionally performed using collaborative

filtering and singular value decomposition. However, recent advances in neural network architectures,

especially non-feedforward networks, have produced tantalizingly high-quality recommendation models.

Restricted Boltzmann Machines (RBMs) don’t work like traditional feedforward neural networks. Instead of

presenting data on the input layer and receiving an answer on the output layer, the input and output appear

on the same layer. When you put in a value for a single visible neuron, the rest of the neurons in the visible

layer change their values to the expected value according to the mapping created in the hidden layer.

RBMs provide a means of building cross-element mappings, where the value of one input neuron effects the

values of the other neurons. In the case of product or movie recommendation, these mappings represent the

likelihoods that – given one rating for a product or a movie – a given user would highly or poorly rate the other

products or movies for which the network has been trained.

6.1 Restricted Boltzmann Machines (RBMs) Restricted Boltzmann Machines (RBMs) are part of a special category of neural networks called generative

stochastic artificial neural networks, which includes RBMs and general Boltzmann machines, capsule

networks, and deep belief networks. They are used for learning unspecified probability distributions over a set

of inputs.

In our case, the probability distribution that we want the RBM to learn is the expected rating that a customer

would give to a set of products, based on a provided rating for one (or more) products. For this particular use

case, we used the MovieLens data set to train an RBM to provide five-star expected movie ratings for any

customer, given the customer provides a real rating for any one of the 58,000 movies in the dataset. This is

done by building an RBM which contains 58,000 visible neurons and 100 hidden neurons (see Figure 18),

where each hidden neuron can be thought to encode a particular characteristic or feature shared across

movie titles (e.g., film type, lead actor, director, etc.).

Figure 18: Illustration of a Restricted Boltzmann Machine for five-star ratings. This network builds ratings for M products using F characterizations.

https://grouplens.org/datasets/movielens/



RBMs are difficult to train, and few if any approaches have been suggested for parallelizing the training

process for RMBs, due to the stochastic nature of the sampling and training process. Dell EMC’s AI

Engineering team has developed a novel approach to training RBMs in parallel using a Markov Chain Monte

Carlo (MCMC) approach with Gibbs sampling. We will be publishing the details of the approach along with

additional performance details soon. For now, all the source code is available on the Dell EMC HPC & AI

Engineering GitHub page.

6.2 Model Training Performance When training RBMs for user rating and recommendations on the Deep Learning with Intel solution, we see

near-linear scaling from 1 to 16 compute nodes (see Figure 19 and Figure 20), just as we do with the other

two use cases in image classification and language translation.

Figure 19: Performance of RBM training on Dell EMC Ready Solutions for AI - Deep Learning with Intel. Weak scaling results with 32 records per process, 4 processes per node. Distributed training using TensorFlow with

Horovod.

10.1819.61

40.23

78.90

155.18

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16

Rec

ord

s p

er S

eco

nd

Compute Nodes

RBM Throughout Performance on Solution

https://github.com/dellemc-hpc-ai




Figure 20: Scaled speedup of RBM training compared to ideal.

In addition to scaling tests up to 16 nodes performed on the solution, we also performed tests with greater

scale on the Zenith supercomputer, showing that we can indeed realize improved time to solution out to 64

nodes (see Appendix B, Figure 25). As with the other use cases, we see potential for continued scaling

performance and improved time to solution as customers continue to expand the Deep Learning with Intel

solution from 16 compute nodes to greater scale, including potentially out to hundreds of compute nodes.

1

2

4

8

16

1 2 4 8 16

Spee

du

p

Nodes

RBM Scaling on Solution

Recommendation Scaled Speedup Ideal

Conclusions


7 Conclusions Dell EMC’s Ready Solutions for AI – Deep Learning with Intel is a flexible deep learning training platform that

is designed for future expansion and scale out as the customer’s computational demands increase. It has

demonstrated flexibility across multiple use cases which span market segments and computational workload

profiles – ranging from dense convolutional network training, to more sparse language translation model

training, and emerging non-feedforward neural network training of Restricted Boltzmann Machines (RBMs).

The solution is fully containerized, utilizing Docker containers to handle all user-facing software and

Kubernetes with Nauta automation to manage all container orchestration. We have demonstrated that – in

addition to greater flexibility for the data scientist training models – the use of containers in the solutions does

not adversely affect the performance of the solutions. In fact, in some cases customers can expect better

performance from containerized workloads on the solution than they could expect from the same hardware

deployed in a bare metal configuration.

The solution uses the Nauta control program to execute TensorFlow training jobs which can take advantage

of the Horovod distributed training framework automation to realize scale-out performance increases and

improve time to solution. Our tests demonstrate that the solution achieves excellent – near linear – scaling on

the entire range of use case tests. This means customers who wish to expand the solution beyond the base

16 compute nodes can continue to expect improved performance and time to solution as they scale out.

Configuration Details


A Configuration Details

Table 1: PowerEdge R740xd hardware components

Component Quantity Description

Processor 2 Intel Xeon Scalable Gold 6230

Memory 12 32 GB 2933 MHz DDR4

Storage 12 12 TB 7.5K RPM HDD

Network adapter 1 Intel X710 10 Gb SFP+ Ethernet adapter

Table 2: RAID configuration for the PowerEdge R740xd master node

Virtual disk Configuration RAID Mounted as

VD0 2 x 12 TB HDD (11 TB usable) RAID-1 /(root)

VD1 10 x 12 TB HDD (97 TB usable) RAID-5 /data (NFS and ETCD)

Table 3: PowerEdge C6420 hardware components

Component Quantity Description

Processor 2 Intel Xeon Scalable Gold 6230

Memory 12 16 GB 2933 MHz DDR4

Storage 1 250 Gb M.2 boot optimized SSD storage

Network adapter 1 Intel X710 10 Gb SFP+ Ethernet adapter

Table 4: Software components

Component Description

BIOS Dell BIOS 2.1.8

Operating system Red Hat Enterprise Linux (RHEL) 7.6

Docker Community Edition 18.06

Kubernetes 1.10.11

Helm 2.9.1

Nauta 1.0 Enterprise Support

Remote Desktop Protocol (XRDP) 0.9.9-1

Additional Performance Data


B Additional Performance Data

Figure 21: Time to train performance comparison using Intel 100Gb OPA and Intel 10Gb Ethernet on Zenith. Processors are Intel Xeon Scalable Gold 6148.

Table 5: Time to train per epoch using 100 Gigabit Intel Omni-Path and 10 Gigabit Intel X710. Intel Xeon Scalable Gold 6148 processors.

Nodes Intel Omni-Path Architecture 100 Intel X710 10 Gigabit Ethernet

1 9987 10135

2 6425 6328

4 4232 4230

8 2938 2983

16 2308 2236

2048

4096

8192

16384

1 2 4 8 16

Tim

e p

er E

po

ch (

seco

nd

s)

Nodes

Time per Epoch - Intel OPA and 10GbE

Intel Omni-Path 100 Intel X710 10GbE



Figure 22: Throughput performance of image classification training on various topologies on Zenith supercomputer. Tests performed using TensorFlow with Horovod. Model trained on NIH ChestXray14

dataset.

Figure 23: Time-to-solution for image classification training on various topologies n Zenith supercomputer. Tests performed using TensorFlow with Horovod. Model trained on NIH ChestXray14 dataset.

4 186

1170

5851

6750

8252

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

DenseNet121,P=1, BZ=8

DenseNet121,P=64, BZ=64,

GBZ=4096

VGG16,P=128,

GBZ=8192

Resnet50,P=512,

GBZ=8192

ResNet50,P=800,

GBZ=8000

ResNet50,P=1024,

GBZ=8192

Imag

es p

er S

eco

nd

386845

8319 165321742 1362 825 675

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

DenseNet121,P=1, BZ=8

DenseNet121,P=64, BZ=64,

GBZ=4096

VGG16, P=128,GBZ=8192

ResNet50,P=512,

GBZ=4096

Resnet50,P=512,

GBZ=8192

ResNet50,P=800,

GBZ=8000

ResNet50,P=1024,

GBZ=8192

Tim

e to

So

luti

on

(se

con

ds)



Figure 24: Time to solution of neural machine translation use case on Deep Learning with Intel solution and Zenith supercomputer. Tests performed using TensorFlow with Horovod. Model trained on WMT

English/German corpus.

Figure 25: Time to solution for Restricted Boltzmann Machine (RBM) training on Zenith supercomputer. Tests performed using TensorFlow with Horovod. Model trained on MovieLens dataset.

749.06

50.01 29.46 14.37 7.75 6.170

100

200

300

400

500

600

700

800

1 16 32 64 128 200

Ho

urs

Nodes

Time to Solution (BLEU ≥ 27.5)

Deep Learning with Intel Zenith

20009

10244

5129

25811306 670 346

0

5000

10000

15000

20000

25000

1 2 4 8 16 32 64

Seco

nd

s

Nodes

RBM Time to Solution

Related resources


C Related resources

Dell EMC AI Engineering GitHub

Dell EMC AI Engineering Blogs

ChestXray14 Dataset

ImageNet 2012 Dataset

Google Tensor2Tensor Library

WMT16 Language Translation Datasets

GroupLens Public MovieLens Dataset

Attention is All You Need

Densifying Assumed-sparse Tensors


https://www.hpcatdell.com/



http://image-net.org/challenges/LSVRC/2012/index

https://github.com/tensorflow/tensor2tensor

http://www.statmt.org/wmt16/translation-task.html

https://grouplens.org/datasets/movielens/



Documents

Dell EMC Ready Solutions for AI Deep Learning with Intel · Dell EMC Ready Solutions for AI – Deep Learning ... The Dell EMC Ready Solutions for AI – Deep Learning with Intel