Upload
others
View
10
Download
0
Embed Size (px)
Citation preview
HPC AND AI ACCELERATION ON GPU
Yi Cheng(易成) SA HPC&AI March 2019
2
Artificial IntelligenceComputer GraphicsGPU Computing
NVIDIA“THE AI COMPUTING COMPANY”
3
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
4
ELEVEN YEARS OF GPU COMPUTING
2010
Fermi: World’s First HPC GPU
World’s First Atomic Model of HIV Capsid
GPU-Trained AI Machine Beats World Champion in Go
2014
Stanford Builds AI Machine using GPUs
World’s First 3-D Mapping of Human Genome
Google Outperforms Humans in ImageNet
2012
Discovered How H1N1 Mutates to Resist Drugs
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
2008
World’s First GPU Top500 System
2006
CUDA Launched
AlexNet beats expert code by huge margin using GPUs
Top 13 Greenest Supercomputers Powered
by NVIDIA GPUs
2017
5
GPU EVOLUTIONSG
EM
M /
W N
orm
alized
2012 20142008 2010 2016
TeslaCUDA
FermiFP64
KeplerDynamic Parallelism
MaxwellDX12
PascalUnified Memory
3D Memory
NVLink
20
16
12
8
6
2
0
4
10
14
18
M2090
M2070
C2070
C2050
C1070
C1060
C870
K80
K40
K20
M60
M40
M10
P100
P40
P4
6
NVIDIA POWERS WORLD’SFASTEST SUPERCOMPUTERS48% More Systems | 22 of Top 25 Greenest
Piz DaintEurope’s Fastest
5,704 GPUs| 21 PF
ORNL SummitWorld’s Fastest
27,648 GPUs| 144 PF
ABCIJapan’s Fastest
4,352 GPUs| 20 PF
ENI HPC4Fastest Industrial
3,200 GPUs| 12 PF
LLNL SierraWorld’s 2nd Fastest
17,280 GPUs| 95 PF
7
NVIDIA POWER GORDON BELL WINNERS & 5 OF 6 FINALISTS
GPU Acceleration Critical To HPC At Scale Today
Material Science300X HigherPerformance
Genomics 2.36 ExaFLOPS
Seismic1st Soil & Structure
Simulation
Quantum Chromodynamics
<1% of Uncertainty Margin
Weather1.15 ExaFLOPS
8
END-TO-END PRODUCT FAMILY
DESKTOP
TITAN/GeForce
WORKSTATION
DGX Station
DATA CENTER
Tesla V100
AUTOMOTIVE
Drive AGX Pegasus
VIRTUAL
WORKSTATION
Virtual GPU
SERVER
PLATFORM
HGX1/ HGX2
HPC / TRAINING INFERENCE
EMBEDDED
Jetson AGX Xavier
DATA CENTER
Tesla V100
Tesla P4/T4
FULLY INTEGRATED AI SYSTEMS
DGX-1 DGX-2
9
TESLA PRODUCT FAMILY
V100 SXM2with NVLINK
V100 PCIe2 slot
HGX-2 Baseboard16 V100 + NVSwitch
HGX-2: V100 & NVSwitch heat sink included but not shown
Supercomputing
DL Training & Inference
Machine Learning
Video | Graphics
TESLA V100 (Scale-up)
DL Inference &
Training
Machine Learning
Video | Graphics
TESLA T4 (Scale-out)
T4 PCIeLow Profile, 70W
10
APPS &FRAMEWORKS
NVIDIA SDK& LIBRARIES
NVIDIA UNIVERSAL ACCELERATION PLATFORMSingle Platform Drives Utilization and Productivity
MACHINE LEARNING / ANALYTICS
cuMLcuDF cuGRAPH
CUDA
DEEP LEARNING
cuDNN cuBLAS CUTLASS NCCL TensorRT
HPC
CuBLAS OpenACCCuFFT
+580 Applications
Amber
NAMD
CUSTOMER USECASES
TESLA GPUs & SYSTEMS
HGX-2Scale up,
Dense Compute
T4Scale out,
Distributed Compute
Speech Translate Recommender Molecular Simulations
WeatherForecasting
SeismicMapping
ManufacturingHealthcare Finance
CONSUMER INTERNET, INDUSTRIAL, and SCIENTIFIC APPLICATIONS
Video | Images Retail
11
TESLA V100
Core 5120 CUDA cores, 640 Tensor cores 5120 CUDA cores, 640 Tensor cores
Compute 7.8 TF DP ∙ 15.7 TF SP ∙ 125 TF DL 7 TF DP ∙ 14 TF SP ∙ 112 TF DL
Memory HBM2: 900 GB/s ∙ 32 GB HBM2: 900 GB/s ∙ 32 GB
InterconnectNVLink (up to 300 GB/s) +
PCIe Gen3 (up to 32 GB/s)PCIe Gen3 (up to 32 GB/s)
Power 300W 250W
Available Now Now
For NVLink Servers For PCIe Servers
12
TESLA V100The Fastest and Most Productive GPU for AI and HPC
Volta Architecture
Most Productive GPU
Tensor Core
125 Programmable
TFLOPS Deep Learning
Improved SIMT Model
New Algorithms
Volta MPS
Inference Utilization
Improved NVLink 2.0
(300GB/s)&HBM2(900GB/s)
Efficient Bandwidth
13
INTRODUCING TESLA P100New GPU Architecture to Enable the World’s Fastest Compute Node
Pascal Architecture NVLink(160GB/s) CoWoS HBM2(768GB/s) Page Migration Engine
PCIe
Switch
PCIe
Switch
CPU CPU
Highest Compute Performance GPU Interconnect for Maximum Scalability
Unifying Compute & Memory in Single Package
Simple Parallel Programming with Virtually Unlimited Memory
Unified Memory
CPU
Tesla P100
14
P100 V100 Ratio
Training acceleration 10 TOPS 125 TOPS 12x
Inference acceleration 21 TFLOPS 125 TOPS 6x
FP64/FP32 5/10 TFLOPS7.8/15.7
TFLOPS1.5x
HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x
NVLink Bandwidth 160 GB/s 300 GB/s 1.9x
L2 Cache 4 MB 6 MB 1.5x
L1 Caches 1.3 MB 10 MB 7.7x
GPU PERFORMANCE COMPARISON
15
21B transistors815 mm2
80 SM5120 CUDA Cores640 Tensor Cores
32 GB HBM2900 GB/s HBM2
300 GB/s NVLink
VOLTA GV100
*full GV100 chip contains 84 SMs
16
VOLTA GV100 SM
GV100
FP32 units 64
FP64 units 32
INT32 units 64
Tensor Cores 8
Register File 256 KB
Unified L1/Shared
memory
128 KB
Active Threads 2048
GPU P100 SM (Streaming Multiprocessor )
GP100
SM/GPU 56
FP32 units 64
FP64 units 32
TensorCore -
Register/SM 256 KB
Shared
Memory/SM64 KB
L1 Cache 24 KB
18
TENSOR COREMixed Precision Matrix Math4x4 matrices
D = AB + C
D =
FP16 or FP32 FP16 FP16 FP16 or FP32
A0,0 A0,1 A0,2 A0,3
A1,0 A1,1 A1,2 A1,3
A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3
B0,0 B0,1 B0,2 B0,3
B1,0 B1,1 B1,2 B1,3
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3
C0,0 C0,1 C0,2 C0,3
C1,0 C1,1 C1,2 C1,3
C2,0 C2,1 C2,2 C2,3
C3,0 C3,1 C3,2 C3,3
19
BASIC CONCEPTSVOLTA TRAINING METHOD
20
USING TENSOR CORES
Volta Optimized Frameworks and Libraries
__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)
{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;
wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);
wmma::mma_sync(Cmat, Amat, Bmat, Cmat);
wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);
}
CUDA C++
Warp-Level Matrix Operations
NVIDIA cuDNN, cuBLAS, TensorRT
21
VOLTA: A GIANT LEAP FOR DEEP LEARNING
P100 V100 P100 V100
Images
per
Second
Images
per
Second
2.4x faster 3.7x faster
FP32 Tensor Cores FP16 Tensor Cores
V100 measured on pre-production hardware.
ResNet-50 Training ResNet-50 Inference
TensorRT - 7ms Latency
22
Universal Inference Acceleration
320 Turing Tensor cores
2,560 CUDA cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
ANNOUNCING TESLA T4WORLD’S MOST ADVANCED INFERENCE GPU
23
TURING
Turing: Up to 72 Streaming Multiprocessors (SM)
24
TURINGPer Streaming Multiprocessor:
• 64 FP32 lanes
• 2 FP64 lanes
• 64 INT32 lanes
• 16 SFU lanes (transcendentals)
• 32 LD/ST lanes (Gmem/Lmem/Smem)
• 8 Tensor Cores
• 40 RT Cores
• 4 TEX lanes
SM
L1 …SM
L1
SM
L1
SM
L1
SM
L1
L2
DRAM
Up to 72 SMs (T4: 40 SMs )
NEW TURING TENSOR CORE
MULTI-PRECISION FOR AI INFERENCE
65 TFLOPS FP16 | 130 TeraOPS INT8 | 260 TeraOPS INT4
26
RT CORESTuring GPU RT Cores Accelerate Ray Tracing
RT Cores accelerate ray tracing
• Hardware accelerated tracing of rays through the scene
• RT Core performance scales up with the Quadro RTX product family
• Applications access capabilities of RT Cores through OptiX, DXR, and Vulkan APIs
27
K80P100
(SXM2)
P100
(PCIE)P40 P4
V100
(PCIE)
V100
(SXM2)
V100
(FHHL)
GPU 2x GK210 GP100 GP100 GP102 GP104 GV100 GV100 GV100
PEAK FP64 (TFLOPs) 2.9 5.3 4.7 NA NA 7 7.8 6.5
PEAK FP32 (TFLOPs) 8.7 10.6 9.3 12 5.5 14 15.7 13
PEAK FP16 (TFLOPs) NA 21.2 18.7 NA NA 112 125 105
PEAK TIOPs NA NA NA 47 22 NA NA NA
Memory Size2x 12GB
GDDR516 GB HBM2
16/12 GB
HBM2
24 GB
GDDR58 GB GDDR5 16GB HBM2 16GB HBM2 16GB HBM2
Memory BW 480 GB/s 732 GB/s732/549
GB/s346 GB/s 192 GB/s 900 GB/s 900 GB/s 900 GB/s
Interconnect PCIe Gen3NVLINK +
PCIe Gen3PCIe Gen3 PCIe Gen3 PCIe Gen3 PCIe Gen3
NVLINK +
PCIe Gen3PCIe Gen3
ECCInternal +
GDDR5
Internal +
HBM2
Internal +
HBM2GDDR5 GDDR5
Internal +
HBM2
Internal +
HBM2
Internal +
HBM2
Form Factor PCIE Dual Slot SXM2PCIE Dual
Slot
PCIE Dual
SlotPCIE LP
PCIE Dual
SlotSXM2
PCIE Single
Slot Full
Height Half
Length
Power 300 W 300 W 250 W 250 W 50-75 W 250W 300W 150W
TESLA PRODUCTS DECODER
28
SPECS OVERVIEW
Tesla P100 Tesla P4Tesla
V100Tesla T4
GPU P100 P104 V100 TU104
CC 6.0 6.1 7.0 7.5
Mem GB/s 732 192 900 320
FP32 TFlops 10.0 5.5 15.5 8.1
FP64 TFlops 5.0 0.2 7.8 0.25
FP16 TFlops 20.0 0.12 31.1 16.2
HMMA TFlops - - 124.5 65
IMMA8 TOps - - - 130
IMMA4 TOps - - - 260
TDP 300W 75W 300W 70W
TensorCores
29
TESLA V100 VS P100
Tesla V100 Tensor Core 和 CUDA 9 对 GEMM 运算有了 9 倍的性能提升。(在 Tesla V100 样机上使用CUDA 9 软件进行的测试)
32GB
30
DGX计算平台
31
DGX PRODUCTS FAMILY
The Fastest Personal Supercomputer for Researchers and Data Scientists
The Essential Instrument of AI Research in data center
The World’s Most Powerful deep learning System for the Most Complex deep learning Challenges
DGX-1
DGX Station DGX-2
32
NVIDIA DGX-1 WITH VOLTAHighest Performance, Fully Integrated HW System
1 PetaFLOPS | 8x Tesla V100 32GB | 300 Gb/s NVLink Hybrid Cube Mesh
2x Xeon | 7 TB RAID 0 | Quad IB/Ethernet 100Gbps, Dual 10GbE | 3U — 3500W
7 TB SSD 8 x Tesla V100 32 GB
Quad IB/Ethernet 100Gbps, Dual 10GbE
2x Xeon
3U – 3200W NVLink Hybrid Cube Mesh
33
VOLTA NVLINK300GB/sec
50% more links
28% faster signaling
34
V100 (16 GB) V100 (32 GB)
VGG-16(16 Layers)
ResNet-152
(152 Layers)
More Complex
Models Now Possible
Dramatic Boost
in Accuracy
DRAMATIC BOOST IN ACCURACY WITH LARGER, MORE COMPLEX MODELS
SAP Brand Impact on DGX-1 (32 GB) for Object Detection
Dataset: Winter Sports 2018 Campaign; high definition resolution images (1920 x1080)
40% Reduced Error Rate
35
FASTER RESULTS ON COMPLEX DL AND HPCUp to 50% Faster Results With 2x The Memory
Unsupervised Image Translation
Input winter photo
AI converts it to summer
Dual E5-2698v4 server, 512GB DDR4, Ubuntu 16.04, CUDA9, cuDNN7| NMT is GNMT-like and run with TensorFlow NGC Container 18.01 (Batch Size= 128 (for 16GB) and 256 (for 32GB) | FFT is with cufftbench 1k x 1k x 1k and comparing 2 V100 16GB (DGX1V) vs. 2 V100 32GB (DGX1V)
Neural MachineTranslation (NMT)
3D FFT 1k x 1k x 1k
1.5X Faster Calculations
1.5X Faster Language Translation
1.2step/sec
0.8step/sec
2.5TF
3.8TF
GAN Image to ImageGen
1024x1024res images
512x512res images
4X Higher resolution
75%Accuracy
(16 layers)
85%Accuracy
(152 layers)
HIGHER ACCURACY HIGHER RESOLUTIONFASTER RESULTS
NVIDIA customer R-CNN for object detection at 1080P with Caffe | V100 16GB uses VGG16| V100 32GB uses Resnet-152
V100 16GB V100 32GB
VGG-16 RN-152
1.4X Lower Error Rate
GAN by NVRESEARCH (https://arxiv.org/pdf/1703.00848.pdf) | V100 16GB and V100 32GB with FP32
36
30% BETTER PERFORMANCE WITH NVLINK THAN PCIE
• Encoder and decoder embedding size of 512
• Batch size of 256 per GPU
• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2
37
2.54X BETTER PERFORMANCE WITH NVLINK
• Performance benefits increase with increasing encoder/ decoder embedding size
• Sockeye neural machine translation single-precision training
• NVIDIA DGX containers version 17.11, processing real data with cuDNN 7.0.4, NCCL 2.1.2
38
3.1X FASTER ON DGX-1 V100 THAN DGX-1 P100
39
DGX STATION
40
INTRODUCING NVIDIA DGX STATIONGroundbreaking AI – at your desk
The Fastest Personal Supercomputer for Researchers and Data Scientists
Revolutionary form factor -designed for the desk, whisper-quiet
Start experimenting in hours, not weeks, powered by DGX Stack
Productivity that goes from desk to data center to cloud
Breakthrough performance and precision – powered by Volta
40
41
INTRODUCING NVIDIA DGX STATIONGroundbreaking AI – at your desk
The Personal AI Supercomputer for Researchers and Data Scientists
41
Key Features
1. 4 x NVIDIA Tesla V100 GPU (NOW 32 GB)
2. 2nd-gen NVLink (4-way)
3. Water-cooled design
4. 3 x DisplayPort (4K resolution)
5. Intel Xeon E5-2698v4 20-core
6. 256GB DDR4 RAM
2
1
5
4
3
6
42
NVIDIA DGX STATION
SPECIFICATIONS
At a GlanceGPUs 4x NVIDIA® Tesla® V100
TFLOPS (GPU FP16) 500
GPU Memory 32 GB per GPU
NVIDIA Tensor Cores 2,560 (total)
NVIDIA CUDA Cores 20,480 (total)
CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core)
System Memory 256 GB RDIMM DDR4
StorageData: 3 x 1.92 TB SSD RAID 0
OS: 1 x 1.92 TB SSD
Network Dual 10GBASE-T LAN (RJ45)
Display 3x DisplayPort, 4K Resolution
Additional Ports 2x eSATA, 2x USB 3.1, 4x USB 3.0
Acoustics < 35 dB
Maximum Power Requirements 1500 W
Operating Temperature Range 10 - 30 oC
Software
Ubuntu Desktop Linux OS
DGX Recommended GPU Driver
CUDA Toolkit
42
DGX STATION SPECIFICATIONS
43
NVIDIA DGX-2THE WORLD’S MOST POWERFUL DEEP LEARNING SYSTEM FOR THE MOST COMPLEX DEEP LEARNING CHALLENGES
• First 2 PFLOPS System
• 16 V100 32GB GPUs Fully Interconnected
• NVSwitch: 2.4 TB/s bisection bandwidth
• 24X GPU-GPU Bandwidth
• 0.5 TB of Unified GPU Memory
• 10X Deep Learning Performance
43
44
DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLE
1
2
3
5
4
6 Two Intel Xeon Platinum CPUs
7 1.5 TB System Memory
44
30 TB NVME SSDs Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/secEthernet
45
NVSWITCHWORLD’S HIGHEST BANDWIDTH ON-NODE SWITCH
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per port bi-directional
Fully-connected crossbar
2 billion transistors | 47.5mm x 47.5mm package
46
NVSWITCHENABLES THE WORLD’S LARGEST GPU
16 Tesla V100 32GB Connected by New NVSwitch
2 petaFLOPS of DL Compute
Unified 512GB HBM2 GPU Memory Space
300GB/sec Every GPU-to-GPU
2.4TB/sec of Total Cross-section Bandwidth
47
Software system
48
Virtual Machine vs. Container
Not so similar
Docker VS VM
49
COMMON SOFTWARE STACK ACROSS DGX FAMILY
Cloud Service Provider
• Single, unified stack for deep learning frameworks
• Predictable execution across platforms
• Pervasive reach
DGX Station DGX-1
NVIDIAGPU Cloud
DGX-2
49
50
NGC GPU-OPTIMIZED DEEPLEARNING CONTAINERS
NVCaffe
Caffe2
Microsoft Cognitive Toolkit (CNTK)
DIGITS
MXNet
PyTorch
TensorFlow
Theano
Torch
CUDA (base level container for developers)
NEW! – NVIDIA TensorRT inference accelerator with ONNX support
A Comprehensive Catalog of Deep Learning Software
51
VIRTUAL WORKSTATIONS AND PCs
软件虚拟GPU GPU透传 GPU共享 vGPU
GPU
User
User
User
User
User
User
User
User
User
User
User
User
GPU
VM
Driver
VM
Driver
VM
Driver
VM
Driver
VM
Driver
vGPU vGPU vGPU vGPU vGPU
Hypervisor
VMDriver
VMDriver
VMDriver
GPU GPU GPU
Hypervisor
VMDriver
VMDriver
VMDriver
Hypervisor
常见GPU解决方案介绍
Driver
Windows Server + XenApp
User Session
GPU虚拟(VGPU)- GPU一对多虚拟切分
服务器
GPU
1
GPU
2
GPU
1
GPU
2
虚拟GPU
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
VM
NVIDIA Driver
vGPUvGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU vGPU
理解VGPU在虚拟化平台上是如何切割的?
四分之一切割
八分之一切割
vGPU切割特点:单GPU不支持多种切割类型、虚拟机关闭vGPU资源释放
NVIDIA GPU虚拟化解决方案发展历史通过软件更新来实现增值
2015.8 2016.8 2017.12
特性
/功能
Kepler架构
K1, K2 GPU8:1 vGPUOGL/DX
XenServer 6.2 SP1
Maxwell架构
M10 GPU支持主机和虚拟机级别的资源监控支持Citrix Desktop
director支持Windows Server
2016 VM支持vSphere 6.5
支持XenServer 7.1/2支持RHEL KVM GPU
透传方案
Pascal架构更名为Virtual GPUP4, P6, P40, P100
CUDA/OGL/DX2倍性能提升
24:1 vGPUNutanix KVM
支持Linux硬编码应用程序级别监控
License授权模式(vApp vPC vDWS)支持VMware vRops新增两种GPU调度
方式License HA部署模式支持大于1TB内存
2013.12 2016.4
Maxwell架构
M6, M60 GPU16:1 vGPU
OGL/DXVMware vSphere6.0
Huawei UVP
License授权模式(vPC vWS vWS ext)支持 Windows 10
Maxwell架构支持4K显示
引入vApp授权模式(vApp vPC vWS)
支持DX12GRID 1.x
GRID 2.x
GRID 3.x
GRID 4.x
Virtual GPU 5.x
2018.10
Volta架构支持V100 16/32GB
PCIE/SXM2 GPU32:1 vGPU
新增支持RHEL 7.5/RHV 4.2 KVM,Sangfor VMP,H3C
CAS KVMvGPU Motion
vPC支持2GB显存vPC支持Linux OS
vPC支持4k显示*2,高清显示*4
Virtual GPU 6.x
NVIDIA Virtual GPU 7.1 平台提供图形、计算和人工智能特性
支持、更新与维护
NVIDIA Tesla (数据中心 GPU)
NVIDIA Virtual GPU 软件
VIRTUAL GPU vGPU 7.X 新特性Unprecedented Performance & Manageability
Multi-vGPU SupportWorld’s Most Powerful
Quadro vDWS
vMotion Support for vGPULive Migration of vGPUenabled VMs
Quadro vDWS & GRID
Tesla T4 GPU Support*Latest Generation Turing
Quadro vDWS
NGC with vGPUAvailable with vGPU
Quadro vDWS
FPO FPO
* Tesla T4 support coming with vGPU software 7.1 release
更多种类的GPU选择
V100 P100 P40 P4 T4 M60 M10 M6 P6
GPUs / Board
(Architecture)
1
(Volta)
1
(Pascal)
1
(Pascal)
1
(Pascal)
1
(Turing)
2
(Maxwell)
4
(Maxwell)
1
(Maxwell)
1
(Pascal)
CUDA Cores 5,120 3,584 3,840 2,560 2,5604,096
(2,048 per GPU)
2,560
(640 per GPU)1,536 2,048
Memory Size32 GB/16 GB
HBM216 GB HBM2 24 GB GDDR5 8 GB GDDR5 16 GB GDDR5
16 GB GDDR5
(8 GB per GPU)
32 GB GDDR5
(8 GB per GPU)8 GB GDDR5 16 GB GDDR5
vGPU Profiles
1 GB, 2 GB, 4 GB,
8 GB, 16 GB,
32 GB
1 GB, 2 GB, 4
GB,
8 GB, 16 GB
1 GB, 2 GB, 3
GB,
4 GB, 6 GB, 8
GB,
12 GB, 24 GB
1 GB, 2 GB, 4
GB,
8 GB
1 GB, 2 GB, 4
GB, 8 GB, 16
GB
0.5 GB, 1 GB, 2
GB,
4 GB, 8 GB
0.5 GB, 1 GB, 2
GB,
4 GB, 8 GB
0.5 GB, 1 GB, 2
GB,
4 GB, 8 GB
1 GB, 2 GB, 4 GB,
8 GB, 16 GB
Form Factor
PCIe 3.0 Dual Slot
& SXM2
(rack servers)
PCIe 3.0 Dual
Slot
(rack servers)
PCIe 3.0 Dual
Slot
(rack servers)
PCIe 3.0 Single
Slot
(rack servers)
PCIe 3.0
Single Slot
(rack servers)
PCIe 3.0 Dual
Slot
(rack servers)
PCIe 3.0 Dual
Slot
(rack servers)
MXM
(blade servers)
MXM
(blade servers)
Power 250W/300W 250W 250W 75W 70W 300W (225W opt) 225W 100W (75W opt) 90W
Thermal passive passive passive passive passive active/passive passive bare board bare board
BLADEOptimized
PERFORMANCEOptimized
支持Tesla P、V、T全线产品,适用于不同用户场景
DENSITYOptimized
THANKS