Upload
others
View
38
Download
0
Embed Size (px)
Citation preview
VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
JULY // 2018
In partnership with
Contents
Introduction 01
Challenges with Existing AI/ML/DL Solutions 01
Accelerate AI/ML/DL Workloads with Vexata VelocityAI 02
VelocityAI Reference Architecture 03
VelocityAI Network Configuration 04
DGX-1 NFS Client Configuration 04
VelocityAI Network Configuration – File heads 04
VelocityAI Network Configuration - Storage 04
Imagenet Benchmark Configuration 05
VelocityAI Filesystem Configuration 05
VelocityAI Imagenet Benchmark Tests 06
Imagenet Benchmark Results For 150KB and 1 MB Filesize 07
Storage Analytics for Small and Large File Size 08
Conclusion 08
01
VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
IntroductionENABLING PREDICTIVE AND COGNITIVE ANALYTICS
Machine Learning (ML) and Deep Learning (DL) workloads are increasing in volume and complexity as organizations look to reducetraining and operational timelines for artificial intelligence (AI) use cases. This has given rise to massively parallel GPU servers like theNVIDIA DGX-1, delivering massive compute power to run these machine learning frameworks.
IDC* predicts by 2019, 40% of Data Transformation initiatives will use AI services; by 2021, 75% of commercial enterprise apps will use AI, over 90% of consumers will interact with customer support bots, and over 50% of new industrial robots will leverage AI.
In order to accelerate training and operational cycles, storage systems that power these AI/ML/DL pipelines must maintain ultra-low latency, massive ingest bandwidth and heavy mixed random and sequential read/write handling. Architectures using direct attached storage (DAS) limits performance and data mobility, while existing all-flash arrays lack the sustained performance to deliver timely insights at scale.
NVIDIA and Vexata have teamed up to deliver industry best of breed solution for customers moving to predictive, prescriptive and cognitive analytics. This joint solution comprises of 2 or 4 NVIDIA DGX-1 servers with Vexata VX-100FS File storage system. Vexata VX-100FS File Storage system is pre-configured and tuned and seamlessly deploys in existing or new NFS environments. Vexata VX-100FS can be configured with 2 or 4 Network Attached Storage (NAS) File Heads and Accelerated Storage system.
This paper discusses the joint solution and the reference architecture. The objective of this document is to outline the configurationand deployment details of the Vexata File Storage System. This document captures the performance benchmarks by running a seriesof synthetic workloads.
This joint solution will be offered directly to end customers or through qualified business partners.
Challenges with Existing AI/ML/DL SolutionsExisting AI/ML/DL solutions are based on sharding on Direct attached storage architectures, where data locality was a concern.
• Poor utilization of Expensive GPU cycles as storage I/O is not fast enough to feed GPU cycles• Slower model training and inferencing elapsed time is opportunity lost for businesses• Need massively parallel performance for all pipeline stages to keep up with the GPU parallelism• Current deployments based on Direct attached storage, sharding and bringing compute closer to data leads to complexity• DAS architectures force data to be staged 1st before computing• Compute and storage cannot be scaled independently• 3 way replication based protection leads to poor storage efficiency
* Source IDC Technology Spotlight - Accelerate and Operationalize AI Deployments Using AI Optimized Infrastructure
02
VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
Accelerate AI/ML/DL Workloads with Vexata VelocityAI
Vexata VX-100FS, with its transformative VX-OS is purpose built to overcome these machine learning challenges -
Reduce training and inferencing time from days to hours, improving data scientist productivity• Accelerated data path with deterministic low latency performance for better GPU utilization • Faster storage eliminates data locality
Access large training and inferencing data-sets • Accelerated non-blocking access to NVMe media for large data ingest with low latency IO performance.
Consolidate and eliminate movement between data pipeline stages• Shared storage to handle all data pipeline stages without performance degradation• Simultaneously supports small block random IO , large block sequential IO, mixed Read/Write IO• In-place data analytics with flexibility of ingest protocols (FC, NVMe-oF, NFS, SMB, S3)
Storage security, protection and Efficiency• RAID5/6 protection eliminates 3 copies, compression and always on encryption
S E N S O R SStage 1
Ingest, many data-types, small and large
file, massive B/W
FINANCIAL SYSTEMS
G P U ’ SStage 2
SPARK based ETL - Decode, Augment
tensor, Label
G P U ’ SStage 3
Build Models, using Neural nets
G P U C LU S T E RStage 4
Train, Infer, Predict, Iterate on large
data-sets
Predictive, Prescriptive, Cognitive Analytics
Use Cases
• Fraud Analytics, Quant Trading
• SAS Analytics, Kdb+• Computer Vision• Speech Recognition• Hyper Spectrometry• Autonomous Vehicles• Biomedical Cancer
Detection
03
VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
VelocityAI Reference ArchitectureCompute
• Four DGX-1 systems (8 Tesla V100 GPU’s, 2x Intel E5-2698 v4)• 4*100GbE DGX IB ports are configured to run 100 GbE Ethernet• 4 PFLOPS of Deep Learning performance• Container based Nvidia GPU Cloud Deep Learning stack with
machine learning frameworks
Networking• Mellanox SN2700 100 GbE x 32 switch (2 switches)
Storage• Vexata VX-100FS NVMe-oF scale-out storage system• 430 TB of fast file tier• 50 GB’s of bandwidth• Scale – Add DGX’s, add head nodes, add arrays
SW1Mellanox SN2700
C1 C2
Vexata VX-100FS
DGX-A DGX-B DGX-C DGX-D5 12 132 139 5 12 132 139 5 12 132 139 5 12 132 139
SW2Mellanox SN2700
VEXATA VX-100FS FILE STORAGE SYSTEM
Accelerated Storage NodeBrand: Vexata Model: VX-100FOS: Vexata OS release 3.5.0Usable Capacity: 430 TBStorage Modules (ESM): 16
4 File HeadsProcessor: Intel(R) Xeon(R) Gold 6150 CPU @ 2.70GHzSockets: 2 Cores per socket: 18Threads per core: 2Memory: 512GBOS: CentOS v7.4
8x100GbE
8x100GbE
8x100GbE
8x100GbE
04
VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
VelocityAI Network Configuration
DGX-1 NFS CLIENT CONFIGURATION
For load balancing using DNS round robin, the following setting needs to be done on the DNS server.
vx-nfs IN A <100GbE interface IP of node1>vx-nfs IN A <100GbE interface IP of node2>vx-nfs IN A <100GbE interface IP of node3>vx-nfs IN A <100GbE interface IP of node4>
Nvidia DGX-1 NFS Client Access File system will be accessible to DGX-1 using single mount point and the DGX-1 clients can do a simple nfs mount to access the file system. For example: mount.nfs-orw,tcp,hard,intr,rsize=32768,wsize=32768,retry=10000,timeo=600,retrans=5,acregmin=3
,acregmax=60,acdirmin=30,acdirmax=60,nfsvers=3,sloppy vx-nfs:/tmp/nfs1 /tmp/n1
VelocityAI NETWORK CONFIGURATION – FILE HEADS
Vexata File Storage System has four pre-configured Nas Head nodes. These nodes require 16 ethernet connections and 12 IPs. Four IPs for management, four IPs for IPMI, and four client facing IPs (two bonded connections per NSD):
NODE ROLE# OF 1 GbEETHERNET
PORTSSPEED IP ADDRESS NETMASK GATEWAY
NAS Head 1 Management 1 Auto ... ... ...
NAS Head 1 IPMI 1 Auto ... ... ...
NAS Head 1 NFS 2 100 GbE ... ... ...
NAS Head 2 Management 1 Auto ... ... ...
NAS Head 2 IPMI 1 Auto ... ... ...
NAS Head 2 NFS 2 100 GbE ... ... ...
NAS Head 3 Management 1 Auto ... ... ...
NAS Head 3 IPMI 1 Auto ... ... ...
NAS Head 3 NFS 2 100 GbE ... ... ...
NAS Head 4 Management 1 Auto ... ...
NAS Head 4 IPMI 1 Auto ... ... ...
NAS Head 4 NFS 2 100 GbE ... ... ...
VelocityAI NETWORK CONFIGURATION - STORAGE
The Storage Node requires six ethernet connections (four management ports, two IPMI ports) and five IP addresses. All IP address can all be on the same subnet and can be either static or DHCP assigned.
05
VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
INTERFACE ROLE# OF 1 GbEETHERNET
PORT SPEED IP ADDRESS NETMASK GATEWAY
Primary Management
Virtual IP 0 Auto ... ... ...
Management Controller 0 1 Auto ... ... ...
Management Controller 1 1 Auto ... ... ...
IPMI Controller 0 1 Auto ... ... ...
IPMI Controller 1 1 Auto ... ... ...
IMAGENET BENCHMARK CONFIGURATION
• VelocityAI bandwidth is equally divided between training/Inferencing and Ingest/ETL/Build• Imagenet pre-trained model is used for benchmark, Alexnet used because it is storage IO heavy• Inception V3, Resnet – 50, Resent – 152, Alexnet, VGG16 container images• Supervised Learning with 1.28M labelled images with 1000 categories used as dataset• Standard docker file - nvcr.io/nvidia/tensorflow:18.04-py2• Batch_size = 64
VelocityAI FILESYSTEM CONFIGURATION
Following commands show:
• Cluster configuration with pagepool and name of configured filesystem• 16 volumes assigned to 4 file heads• State of 4 file heads
Following command shows filesystem attributes with the associated inodes and the block size
06
VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
VelocityAI Imagenet Benchmark TestsFilesystem performance testing of 143 GB of Imagenet dataset, took 165 secs average to load in memory.
07
VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
IMAGENET BENCHMARK RESULTS FOR 150KB AND 1 MB FILESIZE
TensorFlow benchmarks were run against ImageNet large visual database designed for visual object recognition software research. The database comprises of 1.28M labelled images to test supervised learning. Pre-trained Convolutional network models were used with the ImageNet dataset to measure the storage IO performance and it’s ability to keep the GPU’s fed during training phase. Alexnet was used because of its ability to exercise storage I/O stack. The performance is measured in-terms of images/sec.
To emulate a real world use case which comprises of Ingest, ETL, modeling, training and inferencing, only half of the available band-width of the storage system is used for training phase. This allows the remaining bandwidth to be used for other phases, thus making the DGX cluster a real world Deep Learning solution. Ingest, ETL, modeling, training and inferencing can be now run on the same solution. This is a unique advantage with VelocityAI due to transformative VX-OS.
Testing was also conducted on small file size images (150Kb), to emulate real world sensor data in-addition to the large file size images (1 MB). VX-OS again uniquely provides the same bandwidth when it is a small block or a large block I/O or when there is mixed Read/Write I/O happening at the same time, across all the Deep Learning phases
Test configuration:
• Bandwidth equally divided between training/Inferencing and Ingest/ETL/Build• Imagenet pre-trained model – Alexnet used because it is storage IO heavy• Inception V3, Resnet – 50, Resent – 152, Alexnet, VGG16 container images• Supervised Learning, labelled images, 1.28M, 1000 categories• Standard docker file - nvcr.io/nvidia/tensorflow:18.04-py2• Batch_size = 64• Horovod = 0.11.3
• VelocityAI - 1 DGX Server, 2 File Heads, 4 Storage Blades• VelocityAI – 2 DGX Servers, 2 File Heads, 8 Storage Blades• VelocityAI – 4 DGX Servers, 4 File Heads, 16 Storage Blades
VEXATA NVIDIA SOLUTION- 1 DGX SERVERS, 4 BLADES, 2 HEADS
VEXATA NVIDIA SOLUTION- 2 DGX SERVERS, 8 BLADES, 2 HEADS
VEXATA NVIDIA SOLUTION- 4 DGX SERVERS, 16 BLADES, 4 HEADS
File Size
Available B/W–
training/inference
Images/sec Remaining B/W
Available B/W–
training/inference
Images/sec Remaining B/W
AvailableB/W –
training/inference
Images/sec Remaining B/W
150 KB 6.25 GB/s 41K 6.25 GB/s 12.5 GB/s 83K 12.5 GB/s 25 GB/s 166K 25 GB/s
1 MB 6.25 GB/s 6.25K 6.25 GB/s 12.5 GB/s 12.5K 12.5 GB/s 25 GB/s 25K 25 GB/s
08
VelocityAI REFERENCE ARCHITECTURE WHITE PAPER
STORAGE ANALYTICS FOR SMALL AND LARGE FILE SIZE
Contact Vexata: [email protected]© 2018 Vexata. All Rights Reserved. All third-party trademarks are the property of their respective companies or their subsidiaries in the U.S. and/or other countries. RA-1020-07302018
ConclusionVelocityAI solution clearly demonstrates the price performance leadership, when industry best of breed systems are combined to jump start customer digital transformation journey. It provides a single, easy to use solution which can be leveraged by Data Scientists, Chief Data officers, Chief Analytics officers to accelerate their data pipelines and host a multitude of data driven applications.
Data driven applications literally run on large data-sets which are continuously accessed by the compute layer for training and inferencing. Data driven application neural net prediction is only as good as the training data-set it has. Hence data is the new oil and the GPU compute layer needs to be well fed eliminating all hot spots and I/O bottlenecks at the storage layer.
Nvidia DGX-1 provides massive parallelism (1 PFLOPs per DGX-1) and consolidation at the compute layer and Vexata with its transformative and unique VX-OS and FPGA acceleration is able to present the same massive parallelism at the storage layer. Mellanox 100 GbE fabric, removes all data locality concerns. With this VelocityAI is uniquely able to provide the highest throughput at deterministic latencies across all the Deep Learning phases, with all unstructured data types and for mixed workloads.
AcknowledgmentsWe would like to take this opportunity to sincerely thank our esteemed friends at Nvidia, Darrin Johnson, James Mauro, Tony Paikeday, Richard Salazar who spent cycles reviewing this document.
ABOUT VEXATA: Founded on the premise that every business is challenged to deliver cognitive, data-intensive applications, Vexata delivers 10x performance AND efficiency improvements at a fraction of the cost of existing all-flash storage solutions. Learn more at www.vexata.com