Bitfusion Nimbix Dev Summit Heterogeneous Architectures

HETEROGENEOUS ARCHITECTURES: A SURVEY AND OVERVIEW FOR DEVELOPERS

1

MAZHAR MEMON

CTO, BITFUSION. IO

2

abstract and

slow

à

ß com

plex and

fast

Time à

Delivering performance and efficiency to today’s applica<ons is becoming more difficult

The problem in compuHng

The soKware world is increasingly abstract

Transistor scaling is ending

Moore’s law slowing -‐> complexity

Era of frequency Era of mul<-‐core Era of many-‐core

6

abstract and

slow

à

ß com

plex and

fast

Time à

Help!

The problem in compuHng

The soluHon(s)

•  Hardware •  Specialized hardware required to keep up with accelerated performance curve •  Encourage accessibility: low hourly pricing

• SoIware •  Abstrac<ons: Libraries, APIs, tool chain up to compiler IR, use transla<ons where possible •  Ecosystem: Learning materials, user groups, university engagement

•  What makes this happen: Developers

7

Remainder of this talk is about the hardware out there and how to develop for them

Current State of Developer Experience for Accelerators

8

-‐  Update to the right Opera<ng System -‐  Install Vendor Tool-‐flows which only

work on specific Opera<ng Systems -‐  SeXng up the Environment and

Licenses -‐  Installing the Board -‐  SeXng up the board -‐  Numerous pages of documenta<on

Unhappy Developer Experience L

In many cases developers give up before even star<ng real work due to this poor developer experience

Overview of available compute devices

9

…from easiest to hardest

Integrated GPUs

•  Architecture: SIMD, shared resource architecture

•  Targeted workloads: Medium-‐sized offloads, latency-‐sensi<ve, cost-‐sensi<ve, media

•  Programming models: OpenCL, DirectCompute, C++ AMP, SPIR, HSAIL

•  Ecosystem maturity: High

• Links: •  haps://soIware.intel.com/en-‐us/ar<cles/intel-‐graphics-‐developers-‐guides

10

Discrete GPUs

•  Architecture: SIMD, discrete coprocessor configura<on

•  Targeted workloads: Large-‐sized offloads, throughput-‐sensi<ve, parallel structured

•  Programming models: CUDA, OpenCL, DirectCompute, C++ AMP, SYCL, SPIR, HSA


• Links: •  hap://docs.nvidia.com/cuda/cuda-‐geXng-‐started-‐guide-‐for-‐linux

11

MICs

•  Architecture: Many GP cores, (co)processor configura<on

•  Targeted workloads: Large-‐sized offloads, throughput-‐sensi<ve, generic HPC

•  Programming models: OpenCL, OMP, MPI, general x86


• Links: •  haps://soIware.intel.com/en-‐us/ar<cles/intel-‐xeon-‐phi-‐coprocessor-‐developers-‐quick-‐start-‐guide

12

FPGAs

•  Architecture: LUTs+HPs+Fabric, coprocessor configura<on •  Targeted workloads: extreme pipelining or fanout, systolic, fast configura<on(?)

•  Programming models: VHDL, Verilog, HLS, OpenCL

•  Ecosystem maturity: Medium

• Links:

• haps://www.altera.com/products/design-‐soIware/embedded-‐soIware-‐developers/opencl/overview.highResolu<onDisplay.html

• hap://www.xilinx.com/products/design-‐tools/soIware-‐zone/sdaccel.html

13

Automata

•  Architecture: NFA with programmable fabric

•  Targeted workloads: MISD, paaern matching, parallel unstructured

•  Programming models: API, ANML, regexp

•  Ecosystem maturity: Low

• Links: hap://micronautomata.com/

14

Enabling developers: Accessibility: sHll a problem

15

Vision

 To bring supercompu<ng for the masses by: ◦  building soIware to automa<cally realize the benefits of heterogeneous hardware

16

Enabling scaling automaHcally

Horizontal Scaling with BF Boost remo<ng technology

Ver5cal Scaling with BF Boost spliXng technology

Heterogeneous Scaling with BF Boost intercep<on technology

cpu system gpu system

3X Machine learning with Caffe, Torch: 2 local vs. 8 remote GPUs

3.5X Rendering with Blender: 1 local vs. 4 remote GPUs

20X Rendering with Blender: 4 remote GPUs

8X Image Processing with ImageMagick: 1 vs. 12 local GPUs

10X Computer Vision (face detect) with OpenCV: 12 CPU cores vs. 4 GPUs

7X Computa5onal Science with NAMD: 2 remote GPUs

BiYusion Tech: Remote VirtualizaHon

18

Features •  Scale-‐out: connect one server to many accelerators to boost performance •  Scale-‐in: connect many servers to few accelerators to pool resources and lower cost •  Service discovery: local and remote machines can discover themselves on demand

without complex or <me consuming configura<on. •  Virtual pools: Segment resources by class of users or hardware

Remote virtualiza<on enables varied virtual configura<ons by combining or sharing the resources of local and remote servers

•  Binary-‐level API intercep<on •  Distribute work across local

and remote machines •  Advanced performance

features including synchroniza<on elision and data pipelining

applica5on

remote servers

local server

•  SoIware sees all new hardware as if it were directly connected

•  No change to soIware required

applica5on

virtual server with combined resources

System view Applica5on view

data and compute pipelining

Advanced caching and data directories

Auto service discovery, metering

Func<on redirec<on for advanced coprocessors

Helping to solve accessibility

19

scale-‐out pooling

Inexpensive micro-‐client

Shared Heterogeneous server

offer most affordable

20

Heterogeneous cloud

Developer machine

high performance developer instances and

•  Binary-‐level API intercep<on •  Distribute work across local

and remote machines •  Advanced performance

features including synchroniza<on elision and data pipelining

applica5on

remote servers

local server

data and compute pipelining

Advanced caching and data directories

Auto service discovery, metering

Func<on redirec<on for advanced coprocessors

SUPERCOMPUTING TO THE MASSES

21

Quantum computers

•  Architecture: •  Targeted workloads:

•  Programming models:

•  Ecosystem maturity:

22

ApplicaHon specific processors

•  Architecture: Varied •  Targeted workloads: App specific: molecular simula<ons, dnn

•  Programming models: API

•  Ecosystem maturity: Zero-‐ish

23

Technology

Bitfusion Nimbix Dev Summit Heterogeneous Architectures