Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
DDN Storage | © 2017 DDN Storage DDN Confidential
Parallel File System for AI and Deep LearningManaging exponential data growth with Lustre File System
Carlos Thomaz – Technical Product Manager
DDN ConfidentialDDN Storage | © 2018 DDN Storage
How Would it Impact Your Business if You Could
Seamlessly Extract a Lot More Answers per Hour From Your Data at Any Scale?
DDN Storage | © 2018 DDN Storage
DDN | Storage Optimized for All Your AI Needs
Leading Storage, AI, Deep Learning Expertise
Highest Performing Analytics > 12X Faster
Scale IOPS, Bandwidth, Capacity
We Deliver the Most Efficient Storage
Solutions to Transform Your Data
Science Into Real Business Benefits
in AI and Deep Learning
Scalable Performance
Scalable Capacity
Media
Large-ScaleRepositories Deep
Learning SupercomputingLarge-Scale HPC
“Enterprise”& Government
Big Data
Securityand CCTV “Research”
Big Data
TechnicalComputing
AI/MachineLearning
DDN Storage | © 2018 DDN Storage DDN Confidential
DDN: Leading Deployments Across All AI & Deep Learning
Data SecurityFraud Detection
Augmented Reality
Healthcare Diagnostics
Personalized Marketing
Scalable storage PlatformsData-at-scale Services
Delivered Real-timeResource Optimized
Flash/GPU/CPU/Network
Line Speed PerformanceTiered Storage At Scale
Extended Flash Life
Natural Language
Massive Scale Out. Real Time Ingest &
Process
Fast Flash High IO/Client
Security
Globally Distributed
Optimized IO
Sensor Data Live Diagnostics Security
Distributed Analytics Resilient
Massive IngestFlash/Disk Large
Datasets
DriverlessCars
DDN Storage | © 2018 DDN Storage
x
Binaries/Libraries
Optimized Containers
DDN DGX-1 Volta Docker Volume
Broad Range of Clients
DDN | Your Data Was Never as Close to Your GPU’s
RDMA Accelerated I/O Delivered Directly to GPU’s
Parallel File System Which is So Much Faster Than NFS!
Leverages a Wide Range of GPU Optimized Applications
DDN Storage Accelerates Your Processors and GPU’s and Optimized
Containers So That You Achieve the Full Business Value of Your AI System
DDN ConfidentialDDN Storage | © 2018 DDN Storage
Big DataNoSQLAnalytics
Machine and DeepLearning
IO Characteristics: Read, Random, High Throughput per Client, File and IO Sizes between a few kb and a few MB
Training Sets typically larger than local cache
DDN ConfidentialDDN Storage | © 2018 DDN Storage
EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK)
► Integrated Flash Parallel File System Access via TCP or IB
►Extreme Data Access Rates for concurrent DGX Containers
x
Binaries/Libraries
Containers
Host OSDGX-1 hostubuntu 16.04kernel 4.4.0-97-genericOFED-internal-4.0-1.0.1
EXAScaler DGX Docker Volume
EXAScaler Client
EXAScaler DGX Docker Volume
▪ Lustre ES3.2 kernel modules compiled for Ubuntu kernel and host’s OFED
▪ Lustre userspace tools▪ scripts for Lustre mount/umount
DDN ConfidentialDDN Storage | © 2018 DDN Storage
WHY LUSTRE FILE SYSTEM – DDN EXASCALER
Feature Importance for AI Others Lustre
Unique Metadata Operations
Medium - depends on Installation Size and Application Workflow ✓Highly scalable
✓Highly scalable with DNE I and II
Shared Metadata Operations
High - training data are usually curated into a single directory ✘ Lower than 10K ✓ Up to 200K
Support for high-performance mmap() I/O Calls
High - many AI applications use mmap() calls ✘ Extremely poor ✓ Strong
Container SupportHigh - most AI applications are
containerized✘ Poor (network complexity
& root issues)✓ Available
Data Isolation for Containers
Medium/High – important for shared environments
✘ Not available or very limited today
✓ Available
Data-on-Metadata (small file support)
Medium/High – depends on data set ✘ DOM limited to tiny files ✓DOM is highly tunable
#open_source #no_hidden_fees #fully_supported
DDN ConfidentialDDN Storage | © 2018 DDN Storage
DDN EXASCALER FEATURE HIGHLIGHTS
Lustre Persistent Client Cache
Lustre Enhancedsecurity
Integrated Policy Engine
#mls
#changelog_audit
#isolation
#kerberos
#encryption_at_rest
#fast_metadata_scan
#no_sql
#lightweight
#scalable
#mds_integrated
#local_lustre_cache
#lustre_aware
#designed_for_ai
#machine_learning_oriented
#read_write #rh_alternative#containers
Features under development
Lustre Native ContainerSupport
#subdirectory_mounts
#docker_ready_containers
#docker_optimizations
#nvidia_DGX
DDN ConfidentialDDN Storage | © 2018 DDN Storage
DDN EXASCALER LUSTRE CONTAINERMINIMAL OVERHEAD
➢SFA7700 delivering 7BG/s+➢Client node: 16 cores, 128 GB RAM, IB 4X FDRSoftware: CentOS 7 (3.10 kernel), Lustre pre-IEEL3.0 (2.7 + patches), OFED 3.18-1, Docker 1.8.2
DDN ConfidentialDDN Storage | © 2018 DDN Storage
IO LATENCY PROFILE FOR RANDOM READ DGXLUSTRE FULL FLASH
0
20000
40000
60000
80000
100000
120000
140000
160000
100 1000 10000
IO C
OU
NT
LATENCY (NANOSECONDS)
Single Container Random IO Profiledataset size ~500GB
4k 8k 16k 32k 64k 128k 256k 512k 1M
DDN ConfidentialDDN Storage | © 2018 DDN Storage
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
0 0.5 1 1.5 2 2.5 3 3.5
IO C
ou
nt
Latency (us)
IO Latency Distribution for Random 32K Read operations from a single DGX container
ramsize 3xramsize 3xramsize directIO
86K IOPs80K IOPs
88K IOPs
SFA SSDsOSS+SFA CacheDGX
Cache
IO LATENCY PROFILE FOR RANDOM READ DGXLUSTRE FULL FLASH
DDN ConfidentialDDN Storage | © 2018 DDN Storage
DGX DDN EXASCALER VOLUME PERFORMANCE
0
2000
4000
6000
8000
10000
12000
0
20000
40000
60000
80000
100000
120000
4k 8k 16k 32k 64k 128k 256k 512k 1m
IOP
S O
R M
B/S
IO SIZE
Single Container Performance (Lustre/SSD)
IOPs BW
Over 100K IOPs and 11GB/s to a single container
DDN ConfidentialDDN Storage | © 2018 DDN Storage
DGX DDN EXASCALER VOLUME PERFORMANCE
0
20000
40000
60000
80000
100000
120000
140000
4k 8k 16k 32k 64k 128k 256k 512k 1m
IOP
S
IO SIZE
Random Read IOPs
1 container 2 containers 4 containers 8 containers
DDN ConfidentialDDN Storage | © 2018 DDN Storage
DATASET 3X DGX RAM CAPACITY RANDOM READ
0
20000
40000
60000
80000
100000
120000
140000
4k 8k 16k 32k 64k 128k 256k 512k 1m
IOP
S
IO SIZE
Random Read IOPs
1 container 2 containers 4 containers 8 containers
0.00
2.00
4.00
6.00
8.00
10.00
12.00
4k 8k 16k 32k 64k 128k 256k 512k 1m
THR
OU
GH
PU
T (G
B/S
)
IO SIZE
Random Read BW
1 container 2 containers 4 containers 8 containers
DDN ConfidentialDDN Storage | © 2018 DDN Storage
0
10
20
30
40
50
60
0.1 0.25 0.5 0.75 1 2 4 10 20 50 100 250 500 750 1000 2000 4000
%A
GE
OF
IOS
IN L
ATE
NC
Y B
UC
KET
LATENCY BUCKET (MICROSECONDS)
IO Distribution 4K Random Read
1 Container 2 Containers 4 Containers 8 Containers
Almost all reads in under 4 microseconds
DATASET 3X DGX RAM CAPACITY RANDOM READ
DDN Storage | © 2018 DDN Storage DDN Confidential
DGX Testing (libaio)32k random read across different dataset sizes from a single DGX container
1
10
100
1000
10000
100000
1000000
10000000
100000000
0.00 0.00 0.01 0.10 1.00 10.00 100.00
IO C
ou
nt
Latency (ms)
DGX <-- Lustre/SSD random 32k read (libaio)
32MB 500GB 1.5TB
DDN Storage | © 2018 DDN Storage DDN Confidential
Autonomous Driving: Customer Benchmark
► Create 3M files, 100 per directory across 3000 directories
► Starting with 40 files per thread, kick off multiples of 10 threads across a number of clients
➢each thread randomly selects 40 files at a time to read sequentially
► Each thread iterates in batches of 40 files.
► A success rate is calculated corresponding to the percentage of files that are read within a set time window (40ms)
► Lustre DGX Docker Volume demonstrates 100% success rate with 900 IO threads
95.50%
96.00%
96.50%
97.00%
97.50%
98.00%
98.50%
99.00%
99.50%
100.00%
100.50%
100 200 300 400 500 600 700 800 900 1800 3200
%A
GE
SUC
CES
S R
ATE
(R
EAD
BEL
OW
40M
S)
NUMBER OF IO THREADS (100-800 THREADS USED 10 CIENTS, 900,1880; 15 CLIENTS, 3200; 20 CLIENTS)
Success Rate for Scale-Out Random Read
NFS HDD (200x R1 1+1) Lustre SSD (8x R5 8+1)
ddn.com©2018 DataDirect Networks, Inc. *Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.
Thank You!Keep in touch with us.
9351 Deering AvenueChatsworth, CA 91311
1.800.837.22981.818.700.4000
company/datadirect-networks
@ddn_limitless