In partnership with Velocity AI REFERENCE ARCHITECTURE ... · Velocity AI REFERENCE ARCHITECTURE WHITE PAPER VelocityAI Network Configuration DGX-1 NFS CLIENT CONFIGURATION For load

VelocityAI REFERENCE ARCHITECTURE WHITE PAPER

JULY // 2018

In partnership with

Contents

Introduction 01

Challenges with Existing AI/ML/DL Solutions 01

Accelerate AI/ML/DL Workloads with Vexata VelocityAI 02

VelocityAI Reference Architecture 03

VelocityAI Network Configuration 04

DGX-1 NFS Client Configuration 04

VelocityAI Network Configuration – File heads 04

VelocityAI Network Configuration - Storage 04

Imagenet Benchmark Configuration 05

VelocityAI Filesystem Configuration 05

VelocityAI Imagenet Benchmark Tests 06

Imagenet Benchmark Results For 150KB and 1 MB Filesize 07

Storage Analytics for Small and Large File Size 08

Conclusion 08

01


IntroductionENABLING PREDICTIVE AND COGNITIVE ANALYTICS

Machine Learning (ML) and Deep Learning (DL) workloads are increasing in volume and complexity as organizations look to reducetraining and operational timelines for artificial intelligence (AI) use cases. This has given rise to massively parallel GPU servers like theNVIDIA DGX-1, delivering massive compute power to run these machine learning frameworks.

IDC* predicts by 2019, 40% of Data Transformation initiatives will use AI services; by 2021, 75% of commercial enterprise apps will use AI, over 90% of consumers will interact with customer support bots, and over 50% of new industrial robots will leverage AI.

In order to accelerate training and operational cycles, storage systems that power these AI/ML/DL pipelines must maintain ultra-low latency, massive ingest bandwidth and heavy mixed random and sequential read/write handling. Architectures using direct attached storage (DAS) limits performance and data mobility, while existing all-flash arrays lack the sustained performance to deliver timely insights at scale.

NVIDIA and Vexata have teamed up to deliver industry best of breed solution for customers moving to predictive, prescriptive and cognitive analytics. This joint solution comprises of 2 or 4 NVIDIA DGX-1 servers with Vexata VX-100FS File storage system. Vexata VX-100FS File Storage system is pre-configured and tuned and seamlessly deploys in existing or new NFS environments. Vexata VX-100FS can be configured with 2 or 4 Network Attached Storage (NAS) File Heads and Accelerated Storage system.

This paper discusses the joint solution and the reference architecture. The objective of this document is to outline the configurationand deployment details of the Vexata File Storage System. This document captures the performance benchmarks by running a seriesof synthetic workloads.

This joint solution will be offered directly to end customers or through qualified business partners.

Challenges with Existing AI/ML/DL SolutionsExisting AI/ML/DL solutions are based on sharding on Direct attached storage architectures, where data locality was a concern.

• Poor utilization of Expensive GPU cycles as storage I/O is not fast enough to feed GPU cycles• Slower model training and inferencing elapsed time is opportunity lost for businesses• Need massively parallel performance for all pipeline stages to keep up with the GPU parallelism• Current deployments based on Direct attached storage, sharding and bringing compute closer to data leads to complexity• DAS architectures force data to be staged 1st before computing• Compute and storage cannot be scaled independently• 3 way replication based protection leads to poor storage efficiency

* Source IDC Technology Spotlight - Accelerate and Operationalize AI Deployments Using AI Optimized Infrastructure

02


Accelerate AI/ML/DL Workloads with Vexata VelocityAI

Vexata VX-100FS, with its transformative VX-OS is purpose built to overcome these machine learning challenges -

Reduce training and inferencing time from days to hours, improving data scientist productivity• Accelerated data path with deterministic low latency performance for better GPU utilization • Faster storage eliminates data locality

Access large training and inferencing data-sets • Accelerated non-blocking access to NVMe media for large data ingest with low latency IO performance.

Consolidate and eliminate movement between data pipeline stages• Shared storage to handle all data pipeline stages without performance degradation• Simultaneously supports small block random IO , large block sequential IO, mixed Read/Write IO• In-place data analytics with flexibility of ingest protocols (FC, NVMe-oF, NFS, SMB, S3)

Storage security, protection and Efficiency• RAID5/6 protection eliminates 3 copies, compression and always on encryption

S E N S O R SStage 1

Ingest, many data-types, small and large

file, massive B/W

FINANCIAL SYSTEMS

G P U ’ SStage 2

SPARK based ETL - Decode, Augment

tensor, Label

G P U ’ SStage 3

Build Models, using Neural nets

G P U C LU S T E RStage 4

Train, Infer, Predict, Iterate on large

data-sets

Predictive, Prescriptive, Cognitive Analytics

Use Cases

• Fraud Analytics, Quant Trading

• SAS Analytics, Kdb+• Computer Vision• Speech Recognition• Hyper Spectrometry• Autonomous Vehicles• Biomedical Cancer

Detection

03


VelocityAI Reference ArchitectureCompute

• Four DGX-1 systems (8 Tesla V100 GPU’s, 2x Intel E5-2698 v4)• 4*100GbE DGX IB ports are configured to run 100 GbE Ethernet• 4 PFLOPS of Deep Learning performance• Container based Nvidia GPU Cloud Deep Learning stack with

machine learning frameworks

Networking• Mellanox SN2700 100 GbE x 32 switch (2 switches)

Storage• Vexata VX-100FS NVMe-oF scale-out storage system• 430 TB of fast file tier• 50 GB’s of bandwidth• Scale – Add DGX’s, add head nodes, add arrays

SW1Mellanox SN2700

C1 C2

Vexata VX-100FS

DGX-A DGX-B DGX-C DGX-D5 12 132 139 5 12 132 139 5 12 132 139 5 12 132 139

SW2Mellanox SN2700

VEXATA VX-100FS FILE STORAGE SYSTEM

Accelerated Storage NodeBrand: Vexata Model: VX-100FOS: Vexata OS release 3.5.0Usable Capacity: 430 TBStorage Modules (ESM): 16

4 File HeadsProcessor: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHzSockets: 2 Cores per socket: 18Threads per core: 2Memory: 512GBOS: CentOS v7.4

8x100GbE

8x100GbE

8x100GbE

8x100GbE

04


VelocityAI Network Configuration

DGX-1 NFS CLIENT CONFIGURATION

For load balancing using DNS round robin, the following setting needs to be done on the DNS server.

vx-nfs IN A <100GbE interface IP of node1>vx-nfs IN A <100GbE interface IP of node2>vx-nfs IN A <100GbE interface IP of node3>vx-nfs IN A <100GbE interface IP of node4>

Nvidia DGX-1 NFS Client Access File system will be accessible to DGX-1 using single mount point and the DGX-1 clients can do a simple nfs mount to access the file system. For example: mount.nfs-orw,tcp,hard,intr,rsize=32768,wsize=32768,retry=10000,timeo=600,retrans=5,acregmin=3

,acregmax=60,acdirmin=30,acdirmax=60,nfsvers=3,sloppy vx-nfs:/tmp/nfs1 /tmp/n1

VelocityAI NETWORK CONFIGURATION – FILE HEADS

Vexata File Storage System has four pre-configured Nas Head nodes. These nodes require 16 ethernet connections and 12 IPs. Four IPs for management, four IPs for IPMI, and four client facing IPs (two bonded connections per NSD):

NODE ROLE# OF 1 GbEETHERNET

PORTSSPEED IP ADDRESS NETMASK GATEWAY

NAS Head 1 Management 1 Auto ... ... ...

NAS Head 1 IPMI 1 Auto ... ... ...

NAS Head 1 NFS 2 100 GbE ... ... ...



NAS Head 2 NFS 2 100 GbE ... ... ...



NAS Head 3 NFS 2 100 GbE ... ... ...

NAS Head 4 Management 1 Auto ... ...


NAS Head 4 NFS 2 100 GbE ... ... ...

VelocityAI NETWORK CONFIGURATION - STORAGE

The Storage Node requires six ethernet connections (four management ports, two IPMI ports) and five IP addresses. All IP address can all be on the same subnet and can be either static or DHCP assigned.

05


INTERFACE ROLE# OF 1 GbEETHERNET

PORT SPEED IP ADDRESS NETMASK GATEWAY

Primary Management

Virtual IP 0 Auto ... ... ...

Management Controller 0 1 Auto ... ... ...

Management Controller 1 1 Auto ... ... ...

IPMI Controller 0 1 Auto ... ... ...

IPMI Controller 1 1 Auto ... ... ...

IMAGENET BENCHMARK CONFIGURATION

• VelocityAI bandwidth is equally divided between training/Inferencing and Ingest/ETL/Build• Imagenet pre-trained model is used for benchmark, Alexnet used because it is storage IO heavy• Inception V3, Resnet – 50, Resent – 152, Alexnet, VGG16 container images• Supervised Learning with 1.28M labelled images with 1000 categories used as dataset• Standard docker file - nvcr.io/nvidia/tensorflow:18.04-py2• Batch_size = 64

VelocityAI FILESYSTEM CONFIGURATION

Following commands show:

• Cluster configuration with pagepool and name of configured filesystem• 16 volumes assigned to 4 file heads• State of 4 file heads

Following command shows filesystem attributes with the associated inodes and the block size

06


VelocityAI Imagenet Benchmark TestsFilesystem performance testing of 143 GB of Imagenet dataset, took 165 secs average to load in memory.

07


IMAGENET BENCHMARK RESULTS FOR 150KB AND 1 MB FILESIZE

TensorFlow benchmarks were run against ImageNet large visual database designed for visual object recognition software research. The database comprises of 1.28M labelled images to test supervised learning. Pre-trained Convolutional network models were used with the ImageNet dataset to measure the storage IO performance and it’s ability to keep the GPU’s fed during training phase. Alexnet was used because of its ability to exercise storage I/O stack. The performance is measured in-terms of images/sec.

To emulate a real world use case which comprises of Ingest, ETL, modeling, training and inferencing, only half of the available band-width of the storage system is used for training phase. This allows the remaining bandwidth to be used for other phases, thus making the DGX cluster a real world Deep Learning solution. Ingest, ETL, modeling, training and inferencing can be now run on the same solution. This is a unique advantage with VelocityAI due to transformative VX-OS.

Testing was also conducted on small file size images (150Kb), to emulate real world sensor data in-addition to the large file size images (1 MB). VX-OS again uniquely provides the same bandwidth when it is a small block or a large block I/O or when there is mixed Read/Write I/O happening at the same time, across all the Deep Learning phases

Test configuration:

• Bandwidth equally divided between training/Inferencing and Ingest/ETL/Build• Imagenet pre-trained model – Alexnet used because it is storage IO heavy• Inception V3, Resnet – 50, Resent – 152, Alexnet, VGG16 container images• Supervised Learning, labelled images, 1.28M, 1000 categories• Standard docker file - nvcr.io/nvidia/tensorflow:18.04-py2• Batch_size = 64• Horovod = 0.11.3

• VelocityAI - 1 DGX Server, 2 File Heads, 4 Storage Blades• VelocityAI – 2 DGX Servers, 2 File Heads, 8 Storage Blades• VelocityAI – 4 DGX Servers, 4 File Heads, 16 Storage Blades

VEXATA NVIDIA SOLUTION- 1 DGX SERVERS, 4 BLADES, 2 HEADS



File Size

Available B/W–

training/inference

Images/sec Remaining B/W

Available B/W–

training/inference


AvailableB/W –

training/inference


150 KB 6.25 GB/s 41K 6.25 GB/s 12.5 GB/s 83K 12.5 GB/s 25 GB/s 166K 25 GB/s

1 MB 6.25 GB/s 6.25K 6.25 GB/s 12.5 GB/s 12.5K 12.5 GB/s 25 GB/s 25K 25 GB/s

08


STORAGE ANALYTICS FOR SMALL AND LARGE FILE SIZE

Contact Vexata: [email protected]© 2018 Vexata. All Rights Reserved. All third-party trademarks are the property of their respective companies or their subsidiaries in the U.S. and/or other countries. RA-1020-07302018

ConclusionVelocityAI solution clearly demonstrates the price performance leadership, when industry best of breed systems are combined to jump start customer digital transformation journey. It provides a single, easy to use solution which can be leveraged by Data Scientists, Chief Data officers, Chief Analytics officers to accelerate their data pipelines and host a multitude of data driven applications.

Data driven applications literally run on large data-sets which are continuously accessed by the compute layer for training and inferencing. Data driven application neural net prediction is only as good as the training data-set it has. Hence data is the new oil and the GPU compute layer needs to be well fed eliminating all hot spots and I/O bottlenecks at the storage layer.

Nvidia DGX-1 provides massive parallelism (1 PFLOPs per DGX-1) and consolidation at the compute layer and Vexata with its transformative and unique VX-OS and FPGA acceleration is able to present the same massive parallelism at the storage layer. Mellanox 100 GbE fabric, removes all data locality concerns. With this VelocityAI is uniquely able to provide the highest throughput at deterministic latencies across all the Deep Learning phases, with all unstructured data types and for mixed workloads.

AcknowledgmentsWe would like to take this opportunity to sincerely thank our esteemed friends at Nvidia, Darrin Johnson, James Mauro, Tony Paikeday, Richard Salazar who spent cycles reviewing this document.

ABOUT VEXATA: Founded on the premise that every business is challenged to deliver cognitive, data-intensive applications, Vexata delivers 10x performance AND efficiency improvements at a fraction of the cost of existing all-flash storage solutions. Learn more at www.vexata.com

Documents

In partnership with Velocity AI REFERENCE ARCHITECTURE ... · Velocity AI REFERENCE ARCHITECTURE WHITE PAPER VelocityAI Network Configuration DGX-1 NFS CLIENT CONFIGURATION For load