19
DDN Storage | © 2017 DDN Storage DDN Confidential Parallel File System for AI and Deep Learning Managing exponential data growth with Lustre File System Carlos Thomaz – Technical Product Manager

Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

  • Upload
    others

  • View
    15

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN Storage | © 2017 DDN Storage DDN Confidential

Parallel File System for AI and Deep LearningManaging exponential data growth with Lustre File System

Carlos Thomaz – Technical Product Manager

Page 2: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

How Would it Impact Your Business if You Could

Seamlessly Extract a Lot More Answers per Hour From Your Data at Any Scale?

Page 3: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN Storage | © 2018 DDN Storage

DDN | Storage Optimized for All Your AI Needs

Leading Storage, AI, Deep Learning Expertise

Highest Performing Analytics > 12X Faster

Scale IOPS, Bandwidth, Capacity

We Deliver the Most Efficient Storage

Solutions to Transform Your Data

Science Into Real Business Benefits

in AI and Deep Learning

Scalable Performance

Scalable Capacity

Media

Large-ScaleRepositories Deep

Learning SupercomputingLarge-Scale HPC

“Enterprise”& Government

Big Data

Securityand CCTV “Research”

Big Data

TechnicalComputing

AI/MachineLearning

Page 4: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN Storage | © 2018 DDN Storage DDN Confidential

DDN: Leading Deployments Across All AI & Deep Learning

Data SecurityFraud Detection

Augmented Reality

Healthcare Diagnostics

Personalized Marketing

Scalable storage PlatformsData-at-scale Services

Delivered Real-timeResource Optimized

Flash/GPU/CPU/Network

Line Speed PerformanceTiered Storage At Scale

Extended Flash Life

Natural Language

Massive Scale Out. Real Time Ingest &

Process

Fast Flash High IO/Client

Security

Globally Distributed

Optimized IO

Sensor Data Live Diagnostics Security

Distributed Analytics Resilient

Massive IngestFlash/Disk Large

Datasets

DriverlessCars

Page 5: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN Storage | © 2018 DDN Storage

x

Binaries/Libraries

Optimized Containers

DDN DGX-1 Volta Docker Volume

Broad Range of Clients

DDN | Your Data Was Never as Close to Your GPU’s

RDMA Accelerated I/O Delivered Directly to GPU’s

Parallel File System Which is So Much Faster Than NFS!

Leverages a Wide Range of GPU Optimized Applications

DDN Storage Accelerates Your Processors and GPU’s and Optimized

Containers So That You Achieve the Full Business Value of Your AI System

Page 6: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

Big DataNoSQLAnalytics

Machine and DeepLearning

IO Characteristics: Read, Random, High Throughput per Client, File and IO Sizes between a few kb and a few MB

Training Sets typically larger than local cache

Page 7: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK)

► Integrated Flash Parallel File System Access via TCP or IB

►Extreme Data Access Rates for concurrent DGX Containers

x

Binaries/Libraries

Containers

Host OSDGX-1 hostubuntu 16.04kernel 4.4.0-97-genericOFED-internal-4.0-1.0.1

EXAScaler DGX Docker Volume

EXAScaler Client

EXAScaler DGX Docker Volume

▪ Lustre ES3.2 kernel modules compiled for Ubuntu kernel and host’s OFED

▪ Lustre userspace tools▪ scripts for Lustre mount/umount

Page 8: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

WHY LUSTRE FILE SYSTEM – DDN EXASCALER

Feature Importance for AI Others Lustre

Unique Metadata Operations

Medium - depends on Installation Size and Application Workflow ✓Highly scalable

✓Highly scalable with DNE I and II

Shared Metadata Operations

High - training data are usually curated into a single directory ✘ Lower than 10K ✓ Up to 200K

Support for high-performance mmap() I/O Calls

High - many AI applications use mmap() calls ✘ Extremely poor ✓ Strong

Container SupportHigh - most AI applications are

containerized✘ Poor (network complexity

& root issues)✓ Available

Data Isolation for Containers

Medium/High – important for shared environments

✘ Not available or very limited today

✓ Available

Data-on-Metadata (small file support)

Medium/High – depends on data set ✘ DOM limited to tiny files ✓DOM is highly tunable

#open_source #no_hidden_fees #fully_supported

Page 9: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

DDN EXASCALER FEATURE HIGHLIGHTS

Lustre Persistent Client Cache

Lustre Enhancedsecurity

Integrated Policy Engine

#mls

#changelog_audit

#isolation

#kerberos

#encryption_at_rest

#fast_metadata_scan

#no_sql

#lightweight

#scalable

#mds_integrated

#local_lustre_cache

#lustre_aware

#designed_for_ai

#machine_learning_oriented

#read_write #rh_alternative#containers

Features under development

Lustre Native ContainerSupport

#subdirectory_mounts

#docker_ready_containers

#docker_optimizations

#nvidia_DGX

Page 10: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

DDN EXASCALER LUSTRE CONTAINERMINIMAL OVERHEAD

➢SFA7700 delivering 7BG/s+➢Client node: 16 cores, 128 GB RAM, IB 4X FDRSoftware: CentOS 7 (3.10 kernel), Lustre pre-IEEL3.0 (2.7 + patches), OFED 3.18-1, Docker 1.8.2

Page 11: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

IO LATENCY PROFILE FOR RANDOM READ DGXLUSTRE FULL FLASH

0

20000

40000

60000

80000

100000

120000

140000

160000

100 1000 10000

IO C

OU

NT

LATENCY (NANOSECONDS)

Single Container Random IO Profiledataset size ~500GB

4k 8k 16k 32k 64k 128k 256k 512k 1M

Page 12: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 0.5 1 1.5 2 2.5 3 3.5

IO C

ou

nt

Latency (us)

IO Latency Distribution for Random 32K Read operations from a single DGX container

ramsize 3xramsize 3xramsize directIO

86K IOPs80K IOPs

88K IOPs

SFA SSDsOSS+SFA CacheDGX

Cache

IO LATENCY PROFILE FOR RANDOM READ DGXLUSTRE FULL FLASH

Page 13: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

DGX DDN EXASCALER VOLUME PERFORMANCE

0

2000

4000

6000

8000

10000

12000

0

20000

40000

60000

80000

100000

120000

4k 8k 16k 32k 64k 128k 256k 512k 1m

IOP

S O

R M

B/S

IO SIZE

Single Container Performance (Lustre/SSD)

IOPs BW

Over 100K IOPs and 11GB/s to a single container

Page 14: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

DGX DDN EXASCALER VOLUME PERFORMANCE

0

20000

40000

60000

80000

100000

120000

140000

4k 8k 16k 32k 64k 128k 256k 512k 1m

IOP

S

IO SIZE

Random Read IOPs

1 container 2 containers 4 containers 8 containers

Page 15: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

DATASET 3X DGX RAM CAPACITY RANDOM READ

0

20000

40000

60000

80000

100000

120000

140000

4k 8k 16k 32k 64k 128k 256k 512k 1m

IOP

S

IO SIZE

Random Read IOPs

1 container 2 containers 4 containers 8 containers

0.00

2.00

4.00

6.00

8.00

10.00

12.00

4k 8k 16k 32k 64k 128k 256k 512k 1m

THR

OU

GH

PU

T (G

B/S

)

IO SIZE

Random Read BW

1 container 2 containers 4 containers 8 containers

Page 16: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN ConfidentialDDN Storage | © 2018 DDN Storage

0

10

20

30

40

50

60

0.1 0.25 0.5 0.75 1 2 4 10 20 50 100 250 500 750 1000 2000 4000

%A

GE

OF

IOS

IN L

ATE

NC

Y B

UC

KET

LATENCY BUCKET (MICROSECONDS)

IO Distribution 4K Random Read

1 Container 2 Containers 4 Containers 8 Containers

Almost all reads in under 4 microseconds

DATASET 3X DGX RAM CAPACITY RANDOM READ

Page 17: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN Storage | © 2018 DDN Storage DDN Confidential

DGX Testing (libaio)32k random read across different dataset sizes from a single DGX container

1

10

100

1000

10000

100000

1000000

10000000

100000000

0.00 0.00 0.01 0.10 1.00 10.00 100.00

IO C

ou

nt

Latency (ms)

DGX <-- Lustre/SSD random 32k read (libaio)

32MB 500GB 1.5TB

Page 18: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

DDN Storage | © 2018 DDN Storage DDN Confidential

Autonomous Driving: Customer Benchmark

► Create 3M files, 100 per directory across 3000 directories

► Starting with 40 files per thread, kick off multiples of 10 threads across a number of clients

➢each thread randomly selects 40 files at a time to read sequentially

► Each thread iterates in batches of 40 files.

► A success rate is calculated corresponding to the percentage of files that are read within a set time window (40ms)

► Lustre DGX Docker Volume demonstrates 100% success rate with 900 IO threads

95.50%

96.00%

96.50%

97.00%

97.50%

98.00%

98.50%

99.00%

99.50%

100.00%

100.50%

100 200 300 400 500 600 700 800 900 1800 3200

%A

GE

SUC

CES

S R

ATE

(R

EAD

BEL

OW

40M

S)

NUMBER OF IO THREADS (100-800 THREADS USED 10 CIENTS, 900,1880; 15 CLIENTS, 3200; 20 CLIENTS)

Success Rate for Scale-Out Random Read

NFS HDD (200x R1 1+1) Lustre SSD (8x R5 8+1)

Page 19: Parallel File System for AI and Deep Learning · EXASCALER FOR AI AND DEEP LEARNING ENVIRONMENTS (DGX-1 STACK) Integrated Flash Parallel File System Access via TCP or IB Extreme Data

ddn.com©2018 DataDirect Networks, Inc. *Other names and brands may be claimed as the property of others. Any statements or representations around future events are subject to change.

Thank You!Keep in touch with us.

9351 Deering AvenueChatsworth, CA 91311

1.800.837.22981.818.700.4000

company/datadirect-networks

@ddn_limitless

[email protected]