14
IBM Leading High Performance Computing and Deep Learning Technologies Yubo Li ( 李玉博) Chief Architect, GPU on Cloud IBM Research -- China email: [email protected] QQ: 395238640 GTC China 2016 Sept. 13, 2016

IBM Leading High Performance Computing and Deep Learning

Embed Size (px)

Citation preview

Page 1: IBM Leading High Performance Computing and Deep Learning

IBM Leading High Performance Computing and Deep Learning Technologies

Yubo Li (李玉博)Chief Architect, GPU on CloudIBM Research -- Chinaemail: [email protected]: 395238640

GTC China 2016Sept. 13, 2016

Page 2: IBM Leading High Performance Computing and Deep Learning

New Generation in IBM’s Eyes

AcceleratoronCloud AcceleratorVirtualization AcceleratorManagementandMonitoring

Speed Up DL and Cognitive APIs on Cloud

GPUsavailableondatacenter

Optimizedacceleratorperformance:NVLink

Virtualization to Eligible on Cloud

GPUpassthroughHardwarevirtualization

Containerssharing

PCIpassthroughNVIDIAGRIDnvidia-docker

Optimize Throughput and Operation

AcceleratormanagementOptimizedschedulingMetriccollectionMonitorandalarm

Softwareenhancement

2

Page 3: IBM Leading High Performance Computing and Deep Learning

POWER8 with NVLink

2.5xFasterCPU-GPUDataCommunicationviaNVLink

NVLink80GB/s

GPU

P8

GPU GPU

P8

GPU

PCIe32GB/s

GPU

x86

GPU GPU

x86

GPU

NoNVLink betweenCPU&GPUforx86Servers:PCIe Bottleneck

NVIDIAP100PascalGPU

POWER8NVLink Server x86ServerswithPCIe

• Custom-builtGPUAcceleratorServer• High-SpeedNVLink ConnectionsbetweenCPUs&GPUsandamongGPUs

• FeaturesnovelNVIDIAP100PascalGPUaccelerator

822LC Power System for HPCFirst Custom-Built GPU Accelerator Server with NVLink

3

Page 4: IBM Leading High Performance Computing and Deep Learning

More Compelling for CPU and GPU IntegrationFar easier to create new applications on Tesla P100 + POWER8 with NVLink

NVIDIA Page Migration Engine ensures unified memory spaceUnified memory: address space spans CPU and GPU, 1TB+Hardware managed transfers: eliminates explicit data transfers

Close code-base to parallel CPU code

POWER8 with NVLink ensures speedy data throughput 1TB memory space requires faster CPU:GPU data movementBus masks transfer times

Too Large a Memory

Space Required

Too complicated to move

data

Moves too much data

Too much custom

coding for GPU data movement

Software UVM

feature too

limiting

Requires page

faulting support

Barriers to Entry Removed

4

Page 5: IBM Leading High Performance Computing and Deep Learning

SuperVessel: OpenStack Based Cognitive Cloud

Computing service

Data store service

Network service

Big Data Service

Cloud Data Service

IoTDevelopment

Service

Super Marketplace (Accelerators, Images, Applications)

Infrastructure as Service

Platform layer service

SuperVessel provides multiple layers services.

Accelerator DevOps Service

Cognitive Computing

Service

Accelerator service

VisionBrain: Deep

Learning as a Service

with NVIDIA GPU 5Try SuperVessel : https://www.ptopenlab.com/

Page 6: IBM Leading High Performance Computing and Deep Learning

SuperVessel Cloud Infrastructure

User account & authentication

manage

User dashboard

Admin dashboard

Virtual point management

Statistic and analysis

Platform Management

System

GPUIBM POWER server

Container pool for POWER7 LPAR

Distributed file system / shared file system

KVM pool for POWER8 LE/BE

KVM pool for x86Container pool for POWER8 LE/BE

Container pool for x86

Nova Neutron Cinder

LxC/ Docker

Horizon

OpenStack controller (HA)

Nova

NeutronGlance Cinder

HEAT Senlin

Ironic Swift

Keystone

Services layer

System maintenance

System monitoring

Resource metering

System analysis

Services for cloud administration

Baremetalmanagement

Image management

Nova Neutron Cinder

KVM

Nova Neutron Cinder

KVM/GPU Passthrough

Nova Neutron Cinder

Docker/GPU Sharing

Nova Neutron Cinder

LxC/ Docker

X86 server

GPU scheduler

Auto Provision

6

Page 7: IBM Leading High Performance Computing and Deep Learning

Heterogeneous Computing for Cognitive Cloud

Train Data Set DNN Net File

Trained model

Application Data from User

Training (development) Stage Recognition (deployment) Stage

Big data platform (Hadoop, Spark)

Deep Learning platform(caffe, Torch, Theano,

TensorFlow, etc.)

Model pool

Data Management

CPU + GPU cluster

Data Cleansing

Feature Engineering Modeling

Deep Learning platform Application servers, DB service, messaging, etc.

CPU + GPU cluster

ApplicationRecognition, classification

7

Page 8: IBM Leading High Performance Computing and Deep Learning

Cognitive Computing on SuperVessel

8

Try it on : https://dashboard.ptopenlab.com/computing/

• Cognitive Infrastructure Service

• Cognitive Computing Service

• Cognitive Solution and Demo Service

Page 9: IBM Leading High Performance Computing and Deep Learning

GPU Service and GPU Accelerated Deep LearningSuperVessel provides the GPU sharing service by extending OpenStack and Docker capability. It is the first GPU sharing service in the public cloud.

• Users could apply the docker instance on SuperVessel

• Users could apply the deep learning development environment on SuperVessel, e.g. Caffe, Torch, Theano, and TensorFlow.

• All the DL environment will assign the GPU resource for acceleration automatically.

9

Page 10: IBM Leading High Performance Computing and Deep Learning

Cognitive Innovation Services Exposed to Bluemix China

• SuperVessel team built a new cloud site in 21Vianet

• All the highlighted services are running on cloud with Supervesseltechnology

• GPU is used to accelerate deep learning service

10

Page 11: IBM Leading High Performance Computing and Deep Learning

GPU Enhancement on Container Cloud• GPU support on Mesos/Marathon/Kubernetes

• GPU scheduler• GPU exposition/isolation with container• GPU auto-discovery• GPU driver volume injection• GPU metrics collection

• Community activities• Main contributor for GPU enablement on Mesos/Marathon• Demos/Presentations on several conferences

11

Page 12: IBM Leading High Performance Computing and Deep Learning

Mesos

Disk CPU GPU Memory

Hardware Resources Storage Compute Memory

Resource Management/Orchestration

DL TrainingData pre-processing DL Inference

Inference APIUser UIData/Task Management

Deep Learning Application

SuperVessel IaaS

Infrastructure

Docker Container

Management/Interface

User Authentication

Data Persistence

Marathon

Task Status Monitoring

Shared FS

Cluster Monitoring

Container based Cognitive Service Infrastructure

12

Page 13: IBM Leading High Performance Computing and Deep Learning

GPU Accelerated Spark Components (Current and Future)

Spark Infrastructure (1.6.1+), DataFrames Interface

GPU

GPU

ComputeNode

GPU

GPU

ComputeNode

GPU

GPU

ComputeNode

GPU

GPU

ComputeNode

GPU

GPU

ComputeNode

GPU

GPU

ComputeNode

AnalyticsMachine Learning

Deep Learning GraphX Spark SQL

Logistic RegressionADMMRecommendation/ALSElastic NetNNMF and PCASVMRandom Forest

Word2VecNearest Neighbor/LSHTensorGradient Descent/EAGD

Spark SQL OLAPBFS/DFSLink Prediction

https://github.com/IBMSparkGPU/SparkGPU,CUDA-MLlib13

Page 14: IBM Leading High Performance Computing and Deep Learning

Questions?

Contact me at WeChat:

14