Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
MPIAmorphOS
Chris Rossbach
cs378 Fall 2018
11/12/2018
1
Outline for Today• Questions?
• Administrivia • Keep thinking about projects• FPGA lab web updates
• Required input sizes and memory• Michael Wei (VMware) visit
• Agenda• Exam• MPI• AmorphOS (if we have time)
Acknowledgements:
• Portions of the lectures slides were adopted from:
• Argonne National Laboratory, MPI tutorials.
• Lawrence Livermore National Laboratory, MPI tutorials
• See online tutorial links in course webpage
• W. Gropp, E. Lusk, and A. Skjellum, Using MPI: Portable Parallel Programming with the Message Passing Interface, MIT Press, ISBN 0-262-57133-1, 1999.
• W. Gropp, E. Lusk, and R. Thakur, Using MPI-2: Advanced Features of the Message Passing Interface, MIT Press, ISBN 0-262-57132-3, 1999.
2
Project Proposal
• A very good example
Questions?
• Update MapReduce Lab to use Spark• Resurrect cs372 PPoPP lab• Build an STM (Software Transactional Memory)• Use Intel HTM for for something interesting• Write a thread pool• Build a lock-free data structure• Parallelize an algorithm with [Akka, Spark, Hadoop, Flink, Beam, Storm, etc.]• Build/Measure a TensorFlow Model• Parallelize something with [Pony, Julia, Chapel, X10, MPI]• Do another FPGA algorithm• More GPU parallelization• Do the MPI lab (#6) at scale• …
Weight: 2 labs
02468
101214
<50
50
-55
55
-60
60
-65
65
-70
70
-75
75
-80
80
-85
85
-90
90
-95
Exam
• Graded, entered in Canvas• There sure are a lot of you!
• Mean: 78.8; Median 82.9; range: [~36...~93]
4
C B A
B-
Favorite subjects: Go, GPUs
Least Favorite:GPUs, TM
Exam: 1.1.
5
Exam: 1.2.
6
Exam: 1.3.
7
Wait…why not mutual exclusion? Circular wait can happen without it!• E.g. can RWLocks deadlock? Yes.• Can you deadlock on resources other than locks? Yes• However, I didn’t take/give points for A
Exam: 1.4.
8
0
0.5
1
1.5
2
2.5
1 2 4 8 16 32 64
"1" ".9" ".8" ".7"
".6" ".5" ".4" ".3"
".2" ".1"
Exam: 1.5.
9
Exam: 2.1
10
• Safety• Only one thread in the critical region
• Liveness• Some thread that enters the entry section eventually enters the critical region
• Even if other thread takes forever in non-critical region
• Bounded waiting• A thread that enters the entry section enters the critical section within some
bounded number of operations. The bound is on the number of times other threads entirety of the critical section before it.
• Failure atomicity• It is OK for a thread to die in the critical region
• Dying in critical region → ops all applied or all rolled back
• Many techniques do not provide failure atomicity
Exam: 2.4.
• A) In general no, but receiver can poll
• C) Select!
11
Exam: 3.1.
• A) spin on local go flag
• B) some kind of blocking
• C) barrier doesn’t reset, some strategy to make it reset
12
Exam: 3.2.
• A) something about futures and promises
• B) pretty much anything with go func()
• C) Channels!
• D) Stack-ripping → some creative responses • (next slide)
13
Stack-Ripping
Stack-based state out-of-scope!Requests must carry state
Exam 3.3
• A: Commit normally, produce undo list for abort• Durability – other transactions see effects that get rolled back• Isolation – other transactions see inner effects while outer still in flight• Atomicity – effects of txid2 are part of txid1 and they become visible without the rest of txid1
• B: Produce deferred actions, all in-flight txns read from this list• Isolation – uncommitted inner txn writes visible to all (e.g. txid3)• Not durability, since commit doesn’t occur until out commits• Consistency – can be argued either way. Outer txid1 certainly gets consistent view. Any Txid3 that reads from
list sees inconsistent view
• C: if inner only reads outer writes, do we need to relax ACID? • No, inner sees outers writes but would abort if outer aborts → standard flat nesting• Caveat: txid1 writes the same data more than once
15
Last Thoughts on Exam
• Point was to get people to think
• Don’t worry about your grade• Everyone is still “above water”
• Feel free to come talk to me about it if I misread/misunderstood…
16
Faux Quiz Questions
17
• What is the difference between horizontal and vertical scaling? Describe a setting or application in which horizontal might be preferable and one in which vertical scaling makes more sense.
• What is PGAS?• What is bisection bandwidth?• What is a “shared nothing” architecture?• What is an All-reduce operation? In MPI can it be implemented with other
primitives (e.g. send/recv)? Why does MPI support it as a first class API?• What is distributed shared memory? Suggest some implementation techniques.• What is a collective operation? Give an example and explain why it is a collective. • What is a 3-Tier architecture? What application (s) might be it good for and why?• What are some advantages and disadvantage of distributed memory architectures?• What is the difference between 1-sided and cooperative MPI operations?
Advantages/disadvantages of each?
Scale out vs Scale up
18
Architects Wanted1. User Browses Potential Pets
2. Clicks “Purchase Pooch”
3. Web Server, CGI/EJB + Database complete request
4. Pooch delivered (not shown)
19
Hot Startup Idea:
www.purchase-a-pooch.biz
PurchasePooch
Database
Database
Database
We
b Se
rver
Po
och
Pu
rchasin
gA
pp
lication
Logic
How to handle lots and lots of dogs?
3 Tier architecture
• Web Servers (Presentation Tier) and App servers (Business Tier) scale horizontally
• Database Server_ scales vertically• Horizontal Scale → “Shared Nothing”
• Why is this a good arrangement? 20
Userrequest
Database Server
Web Servers App Servers
Vertical scale gets you a long way, but there is always a
bigger problem size
Goal
21
Design Space
22
ThroughputLatency
Internet
Privatedata
center
Data-parallel
Sharedmemory
Parallel Architectures of Interest for MPIDistributed Memory Multiprocessor
• Messaging between nodes
• Massively Parallel Processor (MPP)
• Many, many processors
Cluster of SMPs
• Shared memory within SMP node
• Messaging between SMP nodes
• Can also be regarded as MPP if processor number is large
processor
memory
processor
memory
processor
memory
processor
memory
interconnection network
…
… … …
… ……
…
interconnection network
MM
MM
PP P P
PPPPnetworkinterface
23Introduction to Parallel Computing, University of Oregon, IPCC
Multicore SMP+GPU Cluster
• Shared mem in SMP node
• Messaging between nodes
• GPU accelerators attached
… …
… …
…
…
interconnection network
MM
MM
PP P P
PPPP
What have we left out? • Uniprocessors• CMPs• Non-GPU Accelerators• Fused Accelerator
What requires extreme scale?
• Simulations—why?• Simulations are sometimes more
cost effective than experiments
• Why extreme scale?• More compute cycles, more
memory, etc, lead for fasterand/or more accurate simulations
Clim
ate
Change
Astr
ophysic
s
Nuclear Reactors
Ima
ge
cre
dit: P
rabhat,
LB
NL
Introduction to Parallel Computing, University of Oregon, IPCC 24
How big is “extreme” scale?
Measured in FLOPs
• FLoating point Operations Per second• 1 GigaFLOP = 1 billion FLOPs
• 1 TeraFLOP = 1000 GigaFLOPs
• 1 PetaFLOP = 1000 TeraFLOPs• Most current super computers
• 1 ExaFLOP = 1000 PetaFLOPs• Arriving in 2018 (supposedly)
Introduction to Parallel Computing, University of Oregon, IPCC 25
200 PFLOPS
Distributed Memory Multiprocessors… Or how to program 10,000,000 Cores
• Each processor has a local memory• Physically separated memory address space
• Processors communicate to access non-local data• Message communication (message passing)
• Message passing architecture
• Processor interconnection network
• Parallel applications partitioned across• Processors: execution units• Memory: data partitioning
• Scalable architecture• Incremental cost to add hardware (cost of node)
26Introduction to Parallel Computing, University of Oregon, IPCC
• Nodes: complete computer• Including I/O
• Nodes communicate via network• Standard networks (IP)• Specialized networks (RDMA, fiber)
Network
M $
P
M $
P
M $
P
Network interface
Performance Metrics: Latency and Bandwidth
• Bandwidth• Need high bandwidth in communication• Match limits in network, memory, and processor• Network interface speed vs. network bisection bandwidth
• Latency• Performance affected since processor may have to wait• Harder to overlap communication and computation• Overhead to communicate is a problem in many machines
• Latency hiding• Increases programming system burden• Examples: communication/computation overlaps, prefetch
27Introduction to Parallel Computing, University of Oregon, IPCC
Is this different from metrics we’ve cared about so far?
Wait…WTF…bisection bandwidth?
if network is bisected, bisection bandwidth is the bandwidth between the two partitions
Ostensible Advantages of Distributed Memory Architectures
• Hardware simpler (especially versus NUMA), more scalable
• Communication explicit, simpler to understand
• Explicit communication →• focus attention on costly aspect of parallel computation
• Synchronization →• naturally associated with sending messages
• reduces possibility for errors from incorrect synchronization
• Easier to use sender-initiated communication →• some advantages in performance
28Introduction to Parallel Computing, University of Oregon, IPCC
Can you think of any disadvantages?
Outline
• …
• Practicalities: running on a supercomputer
• Message Passing
• MPI Programming
• MPI Implementation
29Introduction to Parallel Computing, University of Oregon, IPCC
Running on SupercomputersPracticalities in a nutshell
• Sometimes 1 job takes whole machine• These are called “hero runs”…
• Sometimes many smaller jobs
• Supercomputers used continuously• Processors: “scarce resource”
• jobs are “plentiful”
30Introduction to Parallel Computing, University of Oregon, IPCC
• Programmer plans a job; job ==• parallel binary program
• “input deck” (specifies input data)
• Submit job to a queue
• Scheduler allocates resources when• resources are available,
• (or) the job is deemed “high priority”
• Scheduler runs scripts that initialize the environment• Typically done with environment variables
• At the end of initialization, it is possible to infer:• What the desired job configuration is (i.e., how many tasks per node)• What other nodes are involved• How your node’s tasks relates to the overall program
• MPI library interprets this information, hides the details
The Message-Passing Model• Process: a program counter and address space
• Processes: multiple threads sharing a single address space
• MPI is for communication among processes• Not threads
• Inter-process communication consists of • Synchronization• Data movement
P1 P2 P3 P4process
thread
address space
(memory)
Introduction to Parallel Computing, University of Oregon, IPCC 31
How does this compare with CSP?
SPMD
• Data distributed across processes• Not shared → shared nothing
~~~~~~~~~~~~~~
~~~~~~~~~~~~~~
~~~~~~~~~~~~~~
~~~~~~~~~~~~~~
Sharedprogram
Multipledata
“Owner compute” rule:
Process that “owns”the data (local data)
performs computations
on that data
Introduction to Parallel Computing, University of Oregon, IPCC 32
Message Passing Programming
• Defined by communication requirements• Data communication (necessary for algorithm)• Control communication (necessary for dependencies)
• Program behavior determined by communication patterns
• MPI: support most common forms of communication• Basic forms provide functional access
• Can be used most often
• Complex forms provide higher-level abstractions• Serve as basis for extension• Example: graph libraries, meshing libraries, …
• Extensions for greater programming power
Introduction to Parallel Computing, University of Oregon, IPCC 33
• MPI == Message-Passing library (Interface) specification• Extended message-passing model• Not a language or compiler specification• Not a specific implementation or product
• Targeted for parallel computers, clusters, and NOWs• NOWs = network of workstations
• Specified in C, C++, Fortran 77, F90
• Message Passing Interface (MPI) Forum• http://www.mpi-forum.org/• http://www.mpi-forum.org/docs/docs.html
• Two flavors for communication• Cooperative operations
• One-sided operations
Cooperative Operations
• Data is cooperatively exchanged in message-passing
• Explicitly sent by one process and received by another
• Advantage of local control of memory• Change in the receiving process’s memory made with
receiver’s explicit participation
• Communication and synchronization are combined
Process 0 Process 1
Send(data)
Receive(data)
time
34
Familiar argument?
One-Sided Operations• One-sided operations between processes
• Include remote memory reads and writes
• Only one process needs to explicitly participate• There is still agreement implicit in the SPMD program
• Implication:• Communication and synchronization are decoupled
Process 0 Process 1
Put(data)
(memory)
(memory)
Get(data)
time
Introduction to Parallel Computing, University of Oregon, IPCC 35
Are 1-sided operations better for performance?
A Simple MPI Program (C)
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
MPI_Init( &argc, &argv );
printf( "Hello, world!\n" );
MPI_Finalize();
return 0;
}
36Introduction to Parallel Computing, University of Oregon, IPCC
A Simple MPI Program (C++)
#include <iostream.h>
#include "mpi++.h"
int main( int argc, char *argv[] )
{
MPI::Init(argc,argv);
cout << "Hello, world!" << endln;
MPI::Finalize();
return 0;
}
37Introduction to Parallel Computing, University of Oregon, IPCC
MPI_Init
• Hardware resources allocated• MPI-managed ones anyway…
• Start processes on different nodes• Where does their executable program come from?
• Give processes what they need to know• Wait…what do they need to know?
• Configure OS-level resources
• Configure tools that are running with MPI
• …
38Introduction to Parallel Computing, University of Oregon, IPCC
MPI_Finalize
• Why do we need to finalize MPI?
• What is necessary for a “graceful” MPI exit?• Can bad things happen otherwise?
• Suppose one process exits…
• How do resources get de-allocated?
• How to shut down communication?
• What type of exit protocol might be used?
39Introduction to Parallel Computing, University of Oregon, IPCC
Executive Summary• Undo all of init• Be able to do it on
success or failure exit
• By default, an error causes all processes to abort
• The user can cause routines to return (with an error code) • In C++, exceptions are thrown (MPI-2)
• A user can also write and install custom error handlers
• Libraries may handle errors differently from applications
Running MPI Programs
• MPI-1 does not specify how to run an MPI program
• Starting an MPI program is dependent on implementation
• Scripts, program arguments, and/or environment variables
• % mpirun -np <procs> a.out
• For MPICH under Linux
• mpiexec <args>
• Recommended part of MPI-2, as a recommendation
• mpiexec for MPICH (distribution from ANL)
• mpirun for SGI’s MPI
40Introduction to Parallel Computing, University of Oregon, IPCC
Finding Out About the Environment
• Two important questions that arise in message passing• How many processes are being use in computation?
• Which one am I?
• MPI provides functions to answer these questions• MPI_Comm_size reports the number of processes
• MPI_Comm_rank reports the rank• number between 0 and size-1
• identifies the calling process
41Introduction to Parallel Computing, University of Oregon, IPCC
More “Hello World”
#include "mpi.h"
#include <stdio.h>
int main( int argc, char *argv[] )
{
int rank, size;
MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
MPI_Comm_size( MPI_COMM_WORLD, &size );
printf( "I am %d of %d\n", rank, size );
MPI_Finalize();
return 0;
}
What does this program do?42Introduction to Parallel Computing, University of Oregon, IPCC
Comm?“Communicator”
Some Basic Concepts
• Processes can be collected into groups
• Each message is sent in a context• Must be received in the same context!
• A group and context together form a communicator
• A process is identified by its rank• With respect to the group associated with a communicator
• There is a default communicator MPI_COMM_WORLD• Contains all initial processes
43Introduction to Parallel Computing, University of Oregon, IPCC
MPI Datatypes
• Message data (sent or received) is described by a triple• address, count, datatype
• An MPI datatype is recursively defined as:• Predefined data type from the language• A contiguous array of MPI datatypes• A strided block of datatypes• An indexed array of blocks of datatypes• An arbitrary structure of datatypes
• There are MPI functions to construct custom datatypes• Array of (int, float) pairs• Row of a matrix stored columnwise
44Introduction to Parallel Computing, University of Oregon, IPCC
Why?Why not just native types?
• Enables heterogeneous communication• Support communication between processes on machines with different
memory representations and lengths of elementary datatypes
• MPI provides the representation translation if necessary
• Allows application-oriented layout of data in memory• Reduces memory-to-memory copies in implementation
• Allows use of special hardware (scatter/gather)
MPI Tags
• Messages are sent with an accompanying user-defined integer tag• Assist the receiving process in identifying the message
• Messages can be screened at receiving end by specifying specific tag• MPI_ANY_TAG matches any tag in a receive
• Tags are sometimes called “message types”• MPI calls them “tags” to avoid confusion with datatypes
45Introduction to Parallel Computing, University of Oregon, IPCC
In my (limited) experience,the least useful MPI feature
MPI Basic (Blocking) Send
MPI_SEND (start, count, datatype, dest, tag, comm)
• The message buffer is described by:• start, count, datatype
• The target process is specified by dest• Rank of the target process in the communicator
specified by comm
• Process blocks until:• Data has been delivered to the system
• Buffer can then be reused
• Message may not have been received by target process!
46Introduction to Parallel Computing, University of Oregon, IPCC
Programming MPI with Only Six Functions
• Many parallel programs can be written using:• MPI_INIT()
• MPI_FINALIZE()
• MPI_COMM_SIZE()
• MPI_COMM_RANK()
• MPI_SEND()
• MPI_RECV()
• Why have any other APIs (e.g. broadcast, reduce, etc.)?
• Point-to-point (send/recv) isn’t always the most efficient...• Add more support for communication
47Introduction to Parallel Computing, University of Oregon, IPCC
Excerpt: Count 3s
Excerpt: Barnes-Hut
Reduce/Allreduce
A0 B0 C0
A1 B1 C1
A2 B2 C2
A0+A1+A2 B0+B1+B2 C0+C1+C2
A0+A1+A2 B0+B1+B2 C0+C1+C2
A0+A1+A2 B0+B1+B2 C0+C1+C2
A0+A1+A2 B0+B1+B2 C0+C1+C2
A0 B0 C0
A1 B1 C1
A2 B2 C2
reduce
allreduce
MPI_Reduce• MPI_Reduce (void *sendbuf, void *recvbuf, int count, MPI_Datatype
datatype, MPI_Op op, int root, MPI_Comm comm)• IN sendbuf (address of send buffer)
• OUT recvbuf (address of receive buffer)
• IN count (number of elements in send buffer)
• IN datatype (data type of elements in send buffer)
• IN op (reduce operation)
• IN root (rank of root process)
• IN comm (communicator)
• MPI_Reduce combines elements specified by send buffer and performs a reduction operation on them.
• There are a number of predefined reduction operations: MPI_MAX, MPI_MIN, MPI_SUM, MPI_LAND, MPI_BAND, MPI_LOR, MPI_BOR, MPI_LXOR, MPI_BXOR, MPI_MAXLOC, MPI_MINLOC
Reduce_scatter/Scan
A0 B0 C0
A1 B1 C1
A2 B2 C2
A0+A1+A2
B0+B1+B2
C0+C1+C2
A0 B0 C0
A0+A1 B0+B1 C0+C1
A0+A1+A2 B0+B1+B2 C0+C1+C2
A0 B0 C0
A1 B1 C1
A2 B2 C2
reduce-scatter
scan
MPI_Scan
• MPI_Scan (void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)• IN sendbuf (address of send buffer)
• OUT recvbuf (address of receive buffer)
• IN count (number of elements in send buffer)
• IN datatype (data type of elements in send buffer)
• IN op (reduce operation)
• IN comm (communicator)
• Note: count refers to total number of elements that will be recveived into receive buffer after operation is complete
To use or not use MPI?
• USE• You need a portable parallel program
• You are writing a parallel library
• You have irregular or dynamic data relationships
• You care about performance
• NOT USE• You don’t need parallelism at all
• You can use libraries (which may be written in MPI) or other tools
• You can use multi-threading in a concurrent environment• You don’t need extreme scale
54Introduction to Parallel Computing, University of Oregon, IPCC
AmorphOS Motivation
Bigger, faster FPGAs deployed in the cloud• Microsoft Catapult/Azure• Amazon F1
• FPGAs: Reconfigurable Accelerators • ASIC Prototyping, Video & Image
Proc., DNN, Blockchain • Potential solution to accelerator
provisioning challenge
Our position: FPGAs will be shared• Sharing requires protection• Abstraction layers provide compatibility• Beneficiary: provider → consolidation
55
FPGA Background• Field Programmable Gate Array (FPGA)
• Reconfigurable interconnect → custom data paths• FPGAs attached as coprocessors to a CPU
• FPGA Build Cycle• Synthesis: HDL → Netlist (~seconds)• Place and Route: Netlist → Bitstream (~min--
hours)• Reconfiguration/Partial Reconfiguration (PR)
• Production systems: No multi-tenancy
• Emerging/Research Systems use fixed slots/PR• Fixed-sized slots → fragmentation (50% or more)• Elastic resource management needed
56
Host
bus
AmorphOS Goals
• Protected Sharing/Isolation• Mutually distrustful applications
• Compatibility / Portability• HDL written to AmorphOS interfaces
• 14 benchmarks run unchanged on Microsoft Catapult and Amazon F1
• Elasticity• User logic scales with resource
availability
• Sharing density scales with availability
DRAM
App A
App B
FPGA Fabric
I/O
QSFP
USB
eth
I2C
App A
App B
App A
App B
FPGA FabricFPGA FabricApp A
App A
App B
App B
57
AmorphOS Abstractions
• Zone: Allocatable Unit of Fabric• 1 Global zone• N dynamically sized, sub-dividable PR
zones
• Hull: OS/Protection Layer• Memory Protection, I/O Mediation• Interfaces form a compatibility layer
• Morphlet: Protection Domain• Extends Process abstraction• Encapsulate user logic on PR or global
zone
• Registry: Bitstream Cache• Hides latency of place-and-route (PaR)
FPGA FabricFPGA FabricFPGA Fabric
Global Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
PR Zone
Morphlet
Host
bus
A
B MorphletC
Morphlet
Morphlets Bitstream
<A,B> 0x0a1…
<A,B,C> 0x0fb01…
<B,C> 0x11ad…
Re
gist
ry
58
Scheduling Morphlets
• Tradeoff • Fixed zones + PR → fast, fragmentation
• Global zone + PaR→ eliminates fragmentation, slow
• AmorphOS: best of both worlds• Low Latency Mode
• Fixed zones + PR
• Default Morphlet bitstream
• High Throughput Mode• Combine multiple Morphlets
• Co-schedule on a global zone
59
FPGA Fabric
Host DRAM
FPGA Fabric
Morphlet AApp A
Morphlet A
Host DRAM
FPGA Fabric
Morphlet A
App A
App B
Morphlet A
Morphlet B
Host DRAM
FPGA FabricMorphlet A’
App A
App B
Morphlet A
Morphlet B
Host DRAMApp A
Morphlet A
App B
Morphlet B
App C
Morphlet C
App D
Morphlet D
Morphlet A''
Morphlet B'
Morphlet C
Morphlet D
T0 T1 T2 T3
Low-Latency Mode High-Throughput Mode Low-Latency Mode High-Throughput Mode
60
Scheduling Case Study
AmorphOS Hull
• Hardens and extends vendor Shells• Microsoft Catapult
• Amazon F1
• AmorphOS Interfaces• Control: CntrlReg
• Virtual Memory: AMI
• Bulk Data Transfer: Simple-PCIe
8 GB DDR3
64 GB DDR4 AXI4
SoftReg PCIe
BAR-1-AXI4-Lite DMA-AXI4
Catapult Shell
F1 Shell
Catapult Accelerator
F1 Accelerator
AmorphOS HullMorphlet
AMICntrlReg PCIe
61
AmorphOS Hull
• Hardens and extends vendor Shells• Microsoft Catapult → Higher Level
• Amazon F1 → Lower Level
• AmorphOS Interfaces• Control: CntrlReg
• Virtual Memory: AMI
• Bulk Data Transfer: Simple-PCIe
• Multiplexing of interfaces • Isolation/data protection
• Scalable, 32 accelerators• Tree of multiplexers
8 GB DDR3SoftReg PCIe
Morphlet0
AMICntrlReg
PCIe
Morphlet1
AMICntrlReg
PCIe
Morphlet2
AMICntrlReg
PCIe
Morphlet3
AMICntrlReg
PCIe
Catapult Shell
AMICntrlReg PCIe
62
Implementation & Methodology
• Catapult Prototype
• Altera Mt. Granite Stratix V GS 2x4GB DDR3, Windows Server
• Segment-based protection, partial reconfiguration (PR)
• Amazon F1 Prototype
• Xilinx UltraScale+ VU9P, 4x16GB GDDR4, CentOS 7.5
• No PR, but much more fabric than Catapult
• Workloads• DNNWeaver – DNN inference
• MemDrive – Memory Bandwidth
• Bitcoin – blockchain hashing
• CHStone – 11 accelerators (e.g. AES, jpeg, etc)
63
0
5
10
15
20
25
1 4 8 16 32
Syst
em T
hro
ugh
pu
t
Number of Morphlets
MemDrive Bitcoin DNNWeaver
Scalability
• F1: Xilinx UltraScale+ VU9P, 4x16GB GDDR4, CentOS 7.5• Higher is better, Homogenous Morphlets
MemDrive: BW contention Bitcoin: compute-bound
DNNWeaver:• 32X density• 23X throughput
Takeaway: Massive throughput/density improvement possible, awareness of contended resources necessary
64
1
4
16
64
256
Ru
n T
ime
(s)
End-to-End Bitcoin Run Time
No Sharing Two PR Zones AmorphOS (HT)
Throughput
• 8 Bitcoin Morphlets• Catapult Altera Stratix V GS 2x4GB DDR3, Windows• Registry pre-populated: ctxt sw. 200ms• Log Scale, Lower is better
No-Sharing: serialized
Fixed Zones: worse than no sharing due to down-scaling!
Takeaway: Co-scheduling on a global zone can perform better than fixed-sized slots and PR
65
Partitioning Policies
Non-Sharing• Everything runs serially• Single context
Global Zone• Multiple Morphlets• No fixed size zones
Single-level zone scheme• Two PR zones• One Morphlet each
Co-schedule• Multiple Morphlets in
a single PR zone
Subdivide• Divide top-level PR zone into
smaller PR zones
66
Partitioning Policies
• Bitcoin Morphlets• Catapult: Altera Mt. Granite Stratix V GS 2x4GB DDR3, Windows• Registry pre-populated: ctxt sw. 200ms• Higher is better
Single-level partititioningBetter than recursive subdivision(cause: downscaling)
• Non-sharing: serialized• Global: multi-context on global zone• Single-level: only two fixed slots• Co-schedule: morph multiple in fixed slot• Subdivide: hierarchical partitions
Co-schedule on global zone best
Takeaway: • Hierarchical PR on limited HW not worth it • See paper for projections on F1 67
Related Work
• Access to OS-managed resources• Borph: So [TECS ’08, Thesis ‘07]• Leap: Adler [FPGA ‘11]• CoRAM: Chung [FPGA ‘11]
• First-class OS support• HThreads: Peck [FPL’06], ReconOS: Lϋbbers [TECS ‘09] -- extend threading to
FPGA SoCs• MURAC: Hamilton [FCCM ‘14] – extend process abstraction to FPGAs
• Single-application Frameworks• Catapult: Putnam [ISCA ‘14] / Amazon F1
• Fixed-slot + PR• OpenStack support: Chen [CF ‘14], Byma [FCCM ’14]; Fahmy [CLOUDCOM ‘15]; • Disaggregated FPGAs: Weerasinghe [UIC-ATC-ScalCom ‘15]
• Overlays• Zuma: Brant [FCCM ‘12], • Hoplite: Kapre [FPL ‘15], • ReconOS+Zuma: [ReConfig ’14]
68
Conclusions & Future Work
• Compatibility Improved• without restricting programming model• Comprehensive set of stable interfaces
• Port AmorphOS per platform not each accelerator per platform
• Scalability achieved within and across accelerators• AmorphOS transparently scales morphlets up/down
• Powerful combination of slots/Partial Reconfiguration and full FPGA bitstreams
• Future work• Transparently scale across multiple FPGAs
• Scale across more than just FPGAs
• Open source AmorphOS/port to more platforms
Questions?
69
MPI_Allreduce
74