1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October

1

Using GPCE Principles for Hardware Systems and Accelerators

(bridging the gap to HW design)

Rishiyur S. Nikhil

www.bluespec.com

CTO,

GPCE 09October 4, 2009

2

Generative and component approaches are revolutionizing software development ... GPCE provides a venue for researchers and practitioners interested in foundational techniques for enhancing the productivity, quality, and time-to-market in software development ... In addition to exploring cutting-edge techniques for developing generative and component-based software, our goal is to foster further cross-fertilization between the software engineering research community and the programming languages community.

This seems to be a conference about improving software development ...

... so why am I here talking about hardware design?

Two reasons ....

3

... Generative Programming (developing programs that synthesize other programs), Component Engineering (raising the level of modularization and analysis in application design), and Domain-Specific Languages (elevating program specifications to compact domain-specific notations that are easier to write, maintain, and analyze) are key technologies for automating program development.... enhancing the productivity, quality, and time-to-market in software development that stems from deploying standard components and automating program generation. ...

Reason (1): you may be interested in seeing how the principles highlighted below ...

... are used with equal capability and effectiveness in HW design

4

Reason (2): I would like to tempt you to upgrade from being not only a software engineer (v 1.0) ...

... to “The Compleat Computation-ware Engineere (v 2.0)” ...

... where you think of hardware computation as an important (and easy to use) component in your toolbox, when you solve your next problem.HW

SW

5

The traditional HW creation “flow” (early 1990s to present)

Source code(Verilog/VHDL)

RTL simulation

Traditional ASIC synthesis

Traditional FPGA synthesis*

Gate-levelVerilog/VHDL


Place&Route, ..., tape out, ...

manufacture ...

Place&Route, ..., FPGA download

run/debug/edit: “instant”

10s of months$10M-50M

minutes/ hours

$100-10K

* “synthesis” is just jargon for a certain kind of compilation

6

New flows (not yet mainstream)


RTL simulation


Traditional FPGA synthesis




manufacture ...


Source code(High Level Language)

“High Level” synthesis

By raising level of abstraction,• improve design time by 10x (or more)• expressive power, simulation speed

• with no loss of silicon quality (area, speed, power)

• In fact, sometimes with better silicon quality (because improved flexibility can result in better architectures)

Simulation by compiled execution

7

Some candidate high level languages


RTL simulation






manufacture ...


Source code(BSV)

Source code(C/C++/SystemC)


Classic limitations of automatic parallelization from sequential codes,cf. “dusty deck Fortran” ca. 1970s

Bluespec’s fresh approach, inspired by

• Term Rewriting Systems (parallel atomic transactions) to describe complex concurrent behaviorRelated to: UNITY, TLA+, EventB, ...

• Haskell (types, overloading, parameterization, generativity)

8

HW languages have always been “generative”

module mkM1 (…); mkM3 m3b ( … ); // instantiates mkM3 mkM2 m2 ( … ); // instantiates mkM2endmodule

module mkM2 (…); mkM3 m3a ( … ); // instantiates mkM3endmodule

module mkM3 (…); …endmodule

m3am3b

m2

m1 (instance of mkM1)

m3a

m3b m2

m1 (instance of mkM1)

Example (Verilog) Two visualizations of the resulting module instance hierarchy:

9

HW languages have long been “generative” (contd.)


RTL simulation






manufacture ...


Source code(BSV)

Source code(C/C++/SystemC)


Static Elaboration

Execution

Static Elaboration(jargon for “generation”)

• Execute the structural aspects of the program to produce the module hierarchy (structure)

Execution within the fixed structure (behavior)• Essentially just the

execution of a giant FSM

Verilog/VHDL have poor generative capabilities (weak afterthought!):•Not orthogonal, not reflective, not Turing-complete

10

I’m now going to show you some code examples for some non-trivial HW designs. I hope, at the end of this, you’ll say:

“Hey! I could do that!”

even if you’ve never designed HW before!

11

Verilog/VHDL module interfaces: wire oriented

data

RDY

ENA

data

ENA

RDY

Example: transferring a datum from one module to another

declare input and output wires

declare input and output wires

declaration of wires;connections to module interface

wires;logic for RDY/ENA

data

ENA

RDY

Protocol (proper behavior) specified separately using waveforms and

English text

Very verbose, very error-prone

12

interface Get #(type t); // polymorphic method ActionValue #(t) get();endinterface

interface Put #(type t); method Action put (t x);endinterface

module mkConnection #(Get#(t) g, Put#(t) p) (Empty); rule connect; let x <- g.get(); p.put (x); endruleendmodule

Put

BSV module interfaces: “transactional” (object-oriented)

Get

These interface definitions are sufficiently useful and reusable that they’re in standard BSV libraries

Get#(Packet) g1 <- mkM1 (...);Put#(Packet) p1 <- mkM2 (...);Empty e <- mkConnection (g1, p1);

parameters

13

clientinterface Client #(req_t, resp_t); interface Get#(req_t) request; interface Put#(resp_t) response;endinterface

interface Server #(req_t, resp_t); interface Put#(req_t) request; interface Get#(resp_t) response;endinterface

module mkConnection #(Client#(t1,t2), Server#(t1,t2)); mkConnection (t1.request, t2.request); mkConnection (t2.response, t1.response);endmodule

Get

data

RD

Y

EN

S

Put

data

EN

A

RD

Y

server

Put

data

RD

Y

EN

A

Get

data

EN

A

RD

Y

req_t resp_t

Note overloaded mkConnection(BSV uses Haskell’s Typeclass mechanism for user-

extensible, recursive, statically typed overloading)

Interfaces can be composed

Get/Put pairs are very common, and duals of each other, so the BSV library defines Client/Server interfaces for this purpose

14

Example: a Butterfly cross-bar switch

Basic building blocks:

Recursive structure: 1x1 2x2 4x4 … NxN

buffer (FIFO)

2x1 merge

routing logic

interface XBar #(type t); interface List#(Put#(t)) input_ports; interface List#(Get#(t)) output_ports;endinterface

The entire interface can be defined in a few lines (polymorphic in the data type of packets flowing through the switch):

15

Butterfly switch: module implementation

module mkXBar #(Integer n, function UInt #(32) destinationOf (t x), Module #(Merge2x1 #(t)) mkMerge2x1) ( XBar #(t) )

endmodule: mkXBar

2x1 merge module

used by routing logic

Size of switch(# of ports)

Interface

Module parameters

Parameters are static arguments, and so can be of any type, including (unbounded) Integers, functions, modules, etc.

Interfaces represent dynamic communications and can only carry hardware-representable types.

16


module mkXBar #(...) ( XBar #(t) ); List #(Put#(t)) iports; List #(Get#(t)) oports;

if (n == 1) begin // ---- BASE CASE (n = 1) FIFO #(t) f <- mkFIFO; iports = cons (toPut (f), nil); oports = cons (toGet (f), nil); end

else begin // ---- RECURSIVE CASE (n > 1)

end interface input_ports = iports; interface output_ports = oports;endmodule: mkXBar

buffer (FIFO)

17


module mkXBar #(...) ( XBar #(t) );

if (n == 1) begin // ---- BASE CASE (n = 1)

end else begin // ---- RECURSIVE CASE (n > 1) XBar#(t) upper <- mkXBar (n/2, destinationOf, mkMerge2x1); XBar#(t) lower <- mkXBar (n/2, destinationOf, mkMerge2x1);

List#(Merge2x1#(t)) merges <- replicateM (n, mkMerge2x1);

iports = append (upper.input_ports, lower.input_ports);

function Get#(t) oport_of (Merge2x1#(t) m) = m.oport; oports = map (oport_of, merges);

... routing behavior ...

end

endmodule: mkXBar

18


module mkXBar #(...) ( XBar #(t) );

if (n == 1) begin // ---- BASE CASE (n = 1)

end else begin // ---- RECURSIVE CASE (n > 1)

let ps = append (upper.output_ports, lower.output_ports); for (Integer j = 0; j < n; j = j + 1) rule route; let x <- ps[j].get (); case (flip (destinationOf (x), j, n)) matches tagged Invalid : merges [j] .iport0.put (x); tagged Valid .jFlipped : merges [jFlipped].iport1.put (x); endcase endrule end

endmodule: mkXBar

19

Butterfly switch: atomicity of rules

for (Integer j = 0; j < n; j = j + 1) rule route; let x <- ps[j].get (); case (flip (destinationOf (x), j, n)) matches tagged Invalid : merges [j] .iport0.put (x); tagged Valid .jFlipped : merges [jFlipped].iport1.put (x); endcase endrule

May not be a packet to get

The hardware control logic the manage these complex, dynamic (data-dependent), reactive, control conditions is the most tedious and error-prone aspect of designing with RTL (Verilog, VHDL) and even with SystemC.

Creation of this logic is automated (synthesized), based on the atomicity semantics of rules.

May not be able to put a packet:• flow control• contention

20

Butterfly switch: summary observations

The core mkXBar module is expressed in ~40-50 lines of code• Parameterized by packet type, size, routing function, 2x1 merge

module• It’s fully synthesizable

(550 MHz using Magma Synthesis, TSMC 0.18 micron libraries)

Static elaboration (“generativity”) has the full power of Haskell evaluation• Higher-order functions, lists/vectors, recursion, ...

There is no syntactic distinction between the “static elaboration” part and the “dynamic” part of the source code• An expression “a+b” may be used both for static elaboration and as a

dynamic computation (i.e., an adder in the hardware)

2-layers: static elaboration produces a module hierarchy with rules• The rules are then synthesized according to atomicity semantics into

the correct data paths and control logic

21

Controller Scrambler Encoder

Interleaver Mapper

IFFTCyclicExtend

headers

data

IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain)

complex numbersaccounts for 85% area

24 Uncoded

bits

Example: IFFT in 802.11a wireless transmitter

22

in0

…

in1

in2

in63

in3

in4

Bfly4

Bfly4

Bfly4

x16

Bfly4

Bfly4

Bfly4

…

Bfly4

Bfly4

Bfly4

…

out0

…

out1

out2

out63

out3

out4

Perm

ute

_1

Perm

ute

_2

Perm

ute

_3

All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ...

*

*

*

*

+

-

-

+

+

-

-

+

*jt2

t0

t3

t1

The IFFT computation (specification)

23

IFFT: the HW implementation space(varying in area, power, clock speed, latency, throughput)

serialization unserializationfewer Bfly4s

Varying degrees of pipelining

Iterate 1 stage thrice

Direct combi-national circuit In any stage, use fewer

than 16 Bfly4s

24

stage_j mkLinearPipe ()

module mkLinearPipe #(Integer n_stages, Bool with_registers, function Module #(Pipe#(a,a) mkStage (Integer stage_j)) (Pipe#(a,a))); ...endmodule

Pipe

Get

Put

Pipe

Get

Put

Pipe

Get

Put

n_stages

0

n_stages-1

Higher-order functions for building linear pipelines(“linear combinator”)

mkStage ()

25

mkLoopPipe ()

module mkLoopPipelined #(Integer n, function Module#(PipeF #(Tuple2#(a, UInt#(logn)), a)) mkLoopBody ()) (PipeF #(a,a))

Pipe

Get

Put

Pipe

Get

Put

n

(a,j)

a

(x,j)

x

Higher-order functions for building looped pipelines(“loop combinator”)

26

Generating all versions of IFFT

serialization unserializationfewer Bfly4s

Varying degrees of pipelining

Iterate 1 stage thrice

Direct combi-national circuit In any stage, use fewer

than 16 Bfly4s

Which architecture is “best” depends on the requirements• Desired latency, throughput• Area, power, clock speed• Target silicon technology (FPGA, ASIC 90nm, ASIC 65nm, ...)

“PAClib” (Pipeline Architecture Constructor Library) is a library of such higher-order pipeline combinators. Using PAClib, IFFT can be succinctly expressed in a single source code which, depending on the parameters supplied, will elaborate (unfold) into any one of the possible architectures in the space of architectures illustrated.

PAClib enables a “pipeline DSL”

27

Another important reason for generativity—enables rapid experimentation to determine optimal architecture

Architectural effects can be quite unpredictable. E.g.,• Hypothesis: linear pipe will take more silicon area than looped pipe

But the looped pipe has other silicon costs:• Needs multiplexers, control logic area cost• Needs higher clock speed for same throughput area cost, power cost• A kicker: disables some constant propagations area cost, power cost

(for ASICs, silicon area directly affects price of chip)

Bottom line:• Need to be able to experiment with different architectures• Generativity allows scripting the exploration of the space

28

I hope that by now you’re saying:

“Hey! Writing HW programs doesn’t look too hard!”(Has all the creature comforts of a modern high-level programming language.)

But, so what?• Why would I want to compute something directly in HW?• Even if I want to, aren’t the costs and logistics of actually putting

something in HW just too high a barrier?

29

Why implement things in HW?

Reason (1):

fixed machine(e.g., x86, GPGPU, Cell)

X-machine(fine-grain parallel)

Run: Run:

instructions (program) for application X

Interpret:

Caveat: lots of devils in the details• Interpretation at GHz may still be faster than direct execution at MHz• Interpretation with monster memory bandwidth may still be faster than direct execution with

anemic memory bandwidth

SpeedSpeedSpeed

Direct implementation in HW typically• removes a layer of interpretation, and interpretation generally costs an

order of magnitude in speed• can exploit more parallelism

30

Why implement things in HW?

Reason (2): Power consumption

• Interpretation on fixed computing architectures costs power

fixed machine(e.g., x86, GPGPU, Cell)

X-machine

instructions (program) for application X

Interpret:Pay energy cost for X-execution

Also pay for fetch, decode, register management, cache management, extra data movement, branch misprediction, ...

Portable devices: battery life Server farms/ clouds: cost of power supply, air conditioning

31

Opportunity with today’s FPGA technology(Field Programmable Gate Arrays)

FPGA capacity:• millions of gates

FPGA speeds:• 100s of MHz

Example of what is possible: a single FPGA can easily run H.264 decoding at VGA resolution (640x480) and, with a good design, at HDTV (1920x1080) resolution

FPGA board costs:• As low as $100s• $1K-$10K typical• $10K-$100K for

multi-FPGA boards)

... new and exciting:• FPGA-in-processor-socket:

• AMD Hypertransport bus• Intel Front-Side Bus

• FPGA-on-processor-chip:• Coming soon

Linux X

FA626 ICE X

Bluespec Emulation X

Linux XLinux X

FA626 ICE XFA626 ICE X

Bluespec Emulation XBluespec Emulation XBluespec Emulation X

Your application software on hostFPGA

subsystemYour computation

on FPGAC

lk/Rst

ICE

Int

Ctrl

L2Cache

AXI Interconnect Fabric

AXI-AHBBridge

FA626Processor

GMACTraffic Gen

DDR2Gasket

GMACTransactor

EngineTraffic Gen

EngineTransactor

S

SRAMController

S

SRAMboot memory

RS232UART

SM

DDR2memory

S SSM

S S S

Emulation Board

FPGA Device

Console Co-emulation link

DDR2memory

DDR2Controller

EthernetGMAC

SecurityEngine

S

Debugger

S S

Clk/R

st

ICE

Int

Ctrl

L2Cache

AXI Interconnect Fabric

AXI-AHBBridge

FA626Processor

GMACTraffic Gen

DDR2Gasket

GMACTransactor

EngineTraffic Gen

EngineTransactor

S

SRAMController

S

SRAMboot memory

RS232UART

SM

DDR2memory

S SSM S SSM

S S SS S S

Emulation Board

FPGA Device

Console Co-emulation link

DDR2memory

DDR2Controller

EthernetGMAC

SecurityEngine

S

Debugger

S S

FPGA host communication links:• USB• 1Gb/10Gb Ethernet• PCI Express

32

SW appHW app

(BSV/RTL)

services

SCE-MI

Link layer

services

SCE-MI

Link layer

sockets/PCIe/ USB/ Ethernet/FSB/ Hypertransport

A “Communications Protocol Stack”. Analogy:

RPCsocketTCP/IPEthernet

HW agnostic: FPGA(or Bluesim/Verilog sim)

Software

Making FPGA acceleration easy and routine

Atop today’s FPGA technology, we provide the communication infrastructure:• Make it easy for SW to invoke a HW service or vice versa• Concurrent, pipelined, ...

• Model: Concurrent RPCs (Remote Procedure Calls)• Auto-generate SW and HW (BSV) stubs from service specs• (like using IDL to specify distributed client/server communication)

33

Putting it all together:

SW part (e.g., C++) HW part (BSV)Get/Put/Client/Server

interfacesGet/Put/Client/Server

interfaces

mkConnection connections

FPGA synthesis etc.

BSV synthesisgcc

FPGA

servicesSCE-MI

Link layer

link/ load link/ load

generate

servicesSCE-MI

Link layer

Yourapplication

BSV applies GPCE concepts to HW design—generation, parameterization, changeability; reusability; easy exploration of architecture space, ...

FPGAs are compelling due to speed, lower power, low cost, fast communication with host

34

Virtex5 FPGA

BSV UltraSparc model

Virtutech Simics

Ethernet

Example: CMU ProtoFlexhttp://www.ece.cmu.edu/~protoflex

Virtutech Simics: commercial SW simulator for whole-systems (OS/devices/apps)(“Virtual Platform” for early SW development, before ASIC is available)Problem: very clever tricks for fast simulation, but steady slowdown– for each added thread and core– for each added bit of instrumentation

CMU ProtoFlex:– Fully operational model of 16-cpu UltraSPARC III SunFire 3800 Server, running

unmodified Solaris 8; running on FPGA at 90 MHz– Hybrid simulation: continue to use Simics for modeling rest of system (I/O devices, ...)– Benchmark: TPC-C OLTP on Oracle 10g Enterprise Database Server

Also SPECINT (bzip2, crafty, gcc, gzip, parser, vortex)– Performance: 10-60 MIPS

39x faster than Virtutech Simics alone on same system/benchmark– Written in BSB by 1 graduate student (Eric Chung) in 1 year!

35

Example: Univ. of Glasgow document retrieval experiment

“FPGA-Accelerated Information Retrieval: High-Efficiency Document Filtering”,W. Vanderbauwhede, L. Azzopardi , and M. Moadeli,in Proc. 19th IEEE Intl. Conf. on Field Programmable Logic and Applications (FPL'09), Prague, Czech Republic, Aug 31-Sep 2, 2009

FPGA(match algorithm)

SRAM(search terms)

Document stream

Score stream

E.g.,• find spam in emails• find similar patents• find relevant news stories

Experiments on 3 collections, from ~1M to 1.5M documents eachRan same algorithm• 1.6 GHz Itanium-2• Virtex-4 FPGA

Power consumption: 130 Watts (Itanium), 1.25 Watts (FPGA)

Speedup: ~ 10x – 20x• Itanium slows down as profile (search database) size increases• FPGA does not (parallelism)

36

Example: MEMOCODE’08 Design Contest

Goal: Speed up a software reference application running on the PowerPC on Xilinx XUP reference board using SW/HW codesign

The application:• decrypt• sort• re-encryptlarge db of records in DRAM

Time allotted: 4 weeksXilinx XUP

http://rijndael.ece.vt.edu/memocontest08/

37

Example: MEMOCODE’08 Design Contest Results

(BSV)

Reference: http://rijndael.ece.vt.edu/memocontest08/everybodywins/

Records had to be repeatedly streamed through a “merge-sort” block.

Advantage to those who could rapidly generate a variety of merge-sort architectures and find the best one to “fit” into the FPGA

38

With languages that use GPCE principles,

HW design is now ready for incorporation

into yourprogramming

toolbox!

SW part (e.g., C++) HW part (BSV)Get/Put/Client/Server

interfacesGet/Put/Client/Server

interfaces

mkConnection connections

FPGA synthesis etc.

BSV synthesisgcc

FPGA

servicesSCE-MI

Link layer

link/ load link/ load

generate

servicesSCE-MI

Link layer

Thank you for your kind attention!

In summary

39

Acknowledgements

James Hoe (MIT/CMU) and Arvind (MIT) for original technology for high-level synthesis from rules to RTL used in BSV today, 1997-2000

Lennart Augustsson (Chalmers/Sandburst) for Haskell-based generative technology used in BSV today, 2000-2003

My colleagues in the engineering teams at Sandburst and Bluespec for continuous and substantial improvements, 2000-2009

Prof. Arvind’s group at MIT for their research and ideas, 2000-2009

Documents

1 Using GPCE Principles for Hardware Systems and Accelerators (bridging the gap to HW design) Rishiyur S. Nikhil CTO, GPCE 09 October