Upload
subbu-rama
View
426
Download
0
Embed Size (px)
Citation preview
HETEROGENEOUS ARCHITECTURES: A SURVEY AND OVERVIEW FOR DEVELOPERS
1
MAZHAR MEMON
CTO, BITFUSION. IO
2
abstract and
slow
à
ß com
plex and
fast
Time à
Delivering performance and efficiency to today’s applica<ons is becoming more difficult
The problem in compuHng
The soKware world is increasingly abstract
Transistor scaling is ending
Moore’s law slowing -‐> complexity
Era of frequency Era of mul<-‐core Era of many-‐core
6
abstract and
slow
à
ß com
plex and
fast
Time à
Help!
The problem in compuHng
The soluHon(s)
• Hardware • Specialized hardware required to keep up with accelerated performance curve • Encourage accessibility: low hourly pricing
• SoIware • Abstrac<ons: Libraries, APIs, tool chain up to compiler IR, use transla<ons where possible • Ecosystem: Learning materials, user groups, university engagement
• What makes this happen: Developers
7
Remainder of this talk is about the hardware out there and how to develop for them
Current State of Developer Experience for Accelerators
8
-‐ Update to the right Opera<ng System -‐ Install Vendor Tool-‐flows which only
work on specific Opera<ng Systems -‐ SeXng up the Environment and
Licenses -‐ Installing the Board -‐ SeXng up the board -‐ Numerous pages of documenta<on
Unhappy Developer Experience L
In many cases developers give up before even star<ng real work due to this poor developer experience
Overview of available compute devices
9
…from easiest to hardest
Integrated GPUs
• Architecture: SIMD, shared resource architecture
• Targeted workloads: Medium-‐sized offloads, latency-‐sensi<ve, cost-‐sensi<ve, media
• Programming models: OpenCL, DirectCompute, C++ AMP, SPIR, HSAIL
• Ecosystem maturity: High
• Links: • haps://soIware.intel.com/en-‐us/ar<cles/intel-‐graphics-‐developers-‐guides
10
Discrete GPUs
• Architecture: SIMD, discrete coprocessor configura<on
• Targeted workloads: Large-‐sized offloads, throughput-‐sensi<ve, parallel structured
• Programming models: CUDA, OpenCL, DirectCompute, C++ AMP, SYCL, SPIR, HSA
• Ecosystem maturity: High
• Links: • hap://docs.nvidia.com/cuda/cuda-‐geXng-‐started-‐guide-‐for-‐linux
11
MICs
• Architecture: Many GP cores, (co)processor configura<on
• Targeted workloads: Large-‐sized offloads, throughput-‐sensi<ve, generic HPC
• Programming models: OpenCL, OMP, MPI, general x86
• Ecosystem maturity: High
• Links: • haps://soIware.intel.com/en-‐us/ar<cles/intel-‐xeon-‐phi-‐coprocessor-‐developers-‐quick-‐start-‐guide
12
FPGAs
• Architecture: LUTs+HPs+Fabric, coprocessor configura<on • Targeted workloads: extreme pipelining or fanout, systolic, fast configura<on(?)
• Programming models: VHDL, Verilog, HLS, OpenCL
• Ecosystem maturity: Medium
• Links:
• haps://www.altera.com/products/design-‐soIware/embedded-‐soIware-‐developers/opencl/overview.highResolu<onDisplay.html
• hap://www.xilinx.com/products/design-‐tools/soIware-‐zone/sdaccel.html
13
Automata
• Architecture: NFA with programmable fabric
• Targeted workloads: MISD, paaern matching, parallel unstructured
• Programming models: API, ANML, regexp
• Ecosystem maturity: Low
• Links: hap://micronautomata.com/
14
Enabling developers: Accessibility: sHll a problem
15
Vision
To bring supercompu<ng for the masses by: ◦ building soIware to automa<cally realize the benefits of heterogeneous hardware
16
Enabling scaling automaHcally
Horizontal Scaling with BF Boost remo<ng technology
Ver5cal Scaling with BF Boost spliXng technology
Heterogeneous Scaling with BF Boost intercep<on technology
cpu system gpu system
3X Machine learning with Caffe, Torch: 2 local vs. 8 remote GPUs
3.5X Rendering with Blender: 1 local vs. 4 remote GPUs
20X Rendering with Blender: 4 remote GPUs
8X Image Processing with ImageMagick: 1 vs. 12 local GPUs
10X Computer Vision (face detect) with OpenCV: 12 CPU cores vs. 4 GPUs
7X Computa5onal Science with NAMD: 2 remote GPUs
BiYusion Tech: Remote VirtualizaHon
18
Features • Scale-‐out: connect one server to many accelerators to boost performance • Scale-‐in: connect many servers to few accelerators to pool resources and lower cost • Service discovery: local and remote machines can discover themselves on demand
without complex or <me consuming configura<on. • Virtual pools: Segment resources by class of users or hardware
Remote virtualiza<on enables varied virtual configura<ons by combining or sharing the resources of local and remote servers
• Binary-‐level API intercep<on • Distribute work across local
and remote machines • Advanced performance
features including synchroniza<on elision and data pipelining
applica5on
remote servers
local server
• SoIware sees all new hardware as if it were directly connected
• No change to soIware required
applica5on
virtual server with combined resources
System view Applica5on view
data and compute pipelining
Advanced caching and data directories
Auto service discovery, metering
Func<on redirec<on for advanced coprocessors
Helping to solve accessibility
19
scale-‐out pooling
Inexpensive micro-‐client
Shared Heterogeneous server
offer most affordable
20
Heterogeneous cloud
Developer machine
high performance developer instances and
• Binary-‐level API intercep<on • Distribute work across local
and remote machines • Advanced performance
features including synchroniza<on elision and data pipelining
applica5on
remote servers
local server
data and compute pipelining
Advanced caching and data directories
Auto service discovery, metering
Func<on redirec<on for advanced coprocessors
SUPERCOMPUTING TO THE MASSES
21
Quantum computers
• Architecture: • Targeted workloads:
• Programming models:
• Ecosystem maturity:
22
ApplicaHon specific processors
• Architecture: Varied • Targeted workloads: App specific: molecular simula<ons, dnn
• Programming models: API
• Ecosystem maturity: Zero-‐ish
23