Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK

Accelerating Applications using FPGAs

Satnam Singh, Microsoft Research, Cambridge UK

A Heterogeneous Future

Example Speedup: DNA Sequence Matching

Why are regular computers not fast enough?

FPGAs are the Lego of Hardware

multipleindependentmulti-ported

memories

fine-grainparallelism

andpipelining

hard and softembeddedprocessors

The heart of an FPGA

LUT4 (OR)

LUT4 (AND)

LUTs are higher order functions

i o

lut1

oi1

i0

lut2 lut3 lut4i0

i1i2

i0i1i2i3

o o

inv = lut1 not

and2 = lut2 (&&)

mux = lut3 (l s d0 d1 . if s then d1 else d0)

FPGAs as Co-Processors

XD2000i FPGA in-socketaccelerator for Intel FSB

XD2000F FPGA in-socketaccelerator for AMD socket F

XD1000 FPGA co-processormodule for socket 940

What kind of problems fit well on FPGA?

opportunity

scientific computingdata miningsearchimage processingfinancial analytics

challenge

Fibonacci Example

0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, ...

entity fib is port (signal clk, rst : in bit ; signal fibnr : out natural) ;end entity fib ;

architecture behavioural of fib is signal lastFib, currentFib : natural ;begin

compute_fibs : process begin wait until clk'event and clk='1' ; if rst = '1' then lastFib <= 0 ; currentFib <= 1 ; else currentFib <= lastFib + currentFib ; lastFib <= currentFib ; end if ; end process compute_fibs ;

fibnr <= currentFib ; end architecture behavioural ;

demonstration...

data paralleldescriptions

FPGAhardware(VHDL)

GPU code (Accelerator)

SMPC++

The Accidental Semi-colon

;

Kiwi

structural imperative (C)parallelimperative

gate-level VHDL/Verilog Kiwi C-to-

gates

&0

0

0

Q

QSET

CLR

S

R

;;;

jpeg.cthread 2

thread 3

thread 1

KiwiLibrary

Kiwi.cs

circuitmodel

JPEG.cs

Visual Studio

multi-thread simulationdebuggingverification

Kiwi Synthesis

circuitimplementation

JPEG.v

parallelprogram

C#

Thread 1

Thread 2

Thread 3

Thread 3

C togates

C togates

C togates

C togates

circuit

circuit

circuit

circuitVerilog

for system

Our Implementation• Use regular Visual Studio technology to

generate a .NET IL assembly language file.• Our system then processes this file to

produce a circuit:– The .NET stack is analyzed and removed– The control structure of the code is analyzed

and broken into basic blocks which are then composed.

– The concurrency constructs used in the program are used to control the concurrency / clocking of the generated circuit.

System Composition

• We need a way to separately develop components and then compose them together.

• Don’t invent new language constructs: reuse existing concurrency machinery.

• Adopt single-place channels for the composition of components.

• Model channels with regular concurrency constructs (monitors).

Writing to a Channel

public class Channel<T>{ T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }

Reading from a Channel

public T Read(){ T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r;}

systems level concurrency constructsthreads, events, monitors, condition variables

rendezvous join patterns transactionalmemory

dataparallelism

user applications

domain specificlanguages

class FIFO2{ [Kiwi.OutputWordPort(“result“, 31, 0)] public static int result;

static Kiwi.Channel<int> chan1 = new Kiwi.Channel<int>(); static Kiwi.Channel<int> chan2 = new Kiwi.Channel<int>();

public static void Consumer() { while (true) { int i = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } }

public static void Producer() { for (int i = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } }

public static void Behaviour(){ Thread ProducerThread = new Thread(new ThreadStart(Producer)); ProducerThread.Start();

Thread ConsumerThread = new Thread(new ThreadStart(Consumer)); ConsumerThread.Start();

Filter Example

thread one-placechannel

public static int[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = new int[size]; int[] result = new int[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (int i = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }

Transposed Filter

static void Tap(int i, byte w, Kiwi.Channel<byte> xIn, Kiwi.Channel<int> yIn, Kiwi.Channel<int> yout){ byte x; int y; while(true) { y = yIn.Read(); x = xIn.Read(); yout.Write(x * w + y); }}

Inter-thread Communication and Synchronization

// Create the channels to link together the tapsfor (int c = 0; c < size; c++){ Xchannels[c] = new Kiwi.Channel<byte>(); Ychannels[c] = new Kiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros}

// Connect up the taps for a transposed filterfor (int i = 0; i < size; i++){ int j = i; // Quiz: why do we need the local j? Thread tapThread = new Thread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start();}

using System;using System.Collections.Generic;using System.Text;using Microsoft.Research.DataParallelArrays;using PA = Microsoft.Research.DataParallelArrays.ParallelArrays;using IPA = Microsoft.Research.DataParallelArrays.IntParallelArray;namespace ForOxford{ class Program { static void Main(string[] args) { PA.InitGPU(); IPA is1 = new IPA(4, new int[] { 1, 2, 3, 4 }); IPA is2 = new IPA(4, new int[] { 5, 6, 7, 8 }); IPA is3 = new IPA(4, is1.Shape); is3 = PA.Add(is1, is2); IPA result = PA.Evaluate(is3); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine("");

}

}}

Example: Bitmap Blur(Using Accelerator v1.1.1)

using PA = Microsoft.Research.DataParallelArrays.ParallelArrays;using FPA = Microsoft.Research.DataParallelArrays.FloatParallelArray;float[,] Blur (float[] kernel) { FPA pa = new FPA(bitmap); // Convolve in X direction FPA resultX = new FPA(0, pa.Shape); for (int i = 0; i < kernel.Length; i++) { resultX += PA.Shift(pa, 0, i) * kernel[i]; } // Convolve in Y direction. FPA resultY = new FPA(0, pa.Shape); for (int i = 0; i < kernel.Length; i++) { resultY += PA.Shift(resultX, i, 0) * kernel[i]; } float [,] result; PA.ToArray (resultY, out result); return result;}

Expression GraphsFPA pa = new FPA(bitmap);

// Convolve in X directionFPA rX = new FPA(0, pa.Shape);

for (int i = 0; i < kernel.Length; i++){ rX += PA.Shift(pa, 0, i) * kernel[i];}

*

pa

Shift (0,0) k[0]

+

rX

+

*

Shift (0,1) k[1]

+

…

rX

class Program { static void Main(string[] args) { IPA.InitGPU();

IPA ipa1 = new IPA(5, new int[] {1, 2, 3, 4, 5}) ; IPA ipa2 = new IPA(5, new int[] {10, 20, 30, 40, 50}) ;

IPA ipa3 = new IPA(5, new int[] {21, 5, 7, 4, 8}); IPA ipa4 = new IPA(5, new int[] {4, 1, 7, 2, 5}) ;

IPA ipa5 = new IPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); IPA result = PA.Multiply (ipa4, (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine("");

}

}

class Program { static void Main(string[] args) { IPA.InitFPGA();

IPA ipa1 = new IPA(5, new int[] {1, 2, 3, 4, 5}) ; IPA ipa2 = new IPA(5, new int[] {10, 20, 30, 40, 50}) ;

IPA ipa3 = new IPA(5, new int[] {21, 5, 7, 4, 8}); IPA ipa4 = new IPA(5, new int[] {4, 1, 7, 2, 5}) ;

IPA ipa5 = new IPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); IPA result = PA.Multiply (ipa4, (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine("");

}

}

with addr select net_7 <= 10 when 0, 20 when 1, 30 when 2, 40 when 3, 50 when 4;

process begin wait until clk'event and clk='1' ; net_5 <= net_6 + net_7 ; end process ;

process type net_4_delay_type is array (0 to 1) of integer ; variable net_4_delayed : net_4_delay_type ; begin wait until clk'event and clk='1' ; net_4_delayed(0) := net_4_delayed(1) ; net_4_delayed(1) := net_4 ; net_3 <= net_4_delayed(0) - net_5 ; end process ;

8.249ns max delay3 x DSP48Es63 slice registers24 slice LUTs

let rec bfly r n = match n with 1 -> r | n -> ilv (bfly r (n-1)) >-> evens r

Cryptol

as = [Ox3F OxE2 Ox65 OxCA] # new;new = [| a ^ b ^ c || a <- as || b <- drop(1,as) || c <- drop(3,as)|];

3Fas E2

^

65 CA

^

new

Bluespec

rule enqueueSOFData (rx_src_rdy_n_input == 0 && rx_sof_n_input == 0 && recv_state == Ready_for_frame) ; fifo_in.enq (rx_data_input) ; recv_state <= Reading_frame ;endrule

Esterel

Esterel design

void uart_device_driver (){.....}

uart.c

VHDL, Verilog -> hardware implementation

C -> software implementation

Some Challenges for Spatial Computing

• Language support:– Specifying resources.– Specifying memory organization.– Specifying timing.– Specifying control.– Models of computation.

• Co-design and verification.• System integration (OS APIs).• AWFUL AWFUL AWUFL vendor tools.

Some Challenges for Heterogeneous Systems

• A single model for programming very different kinds of computational elements?

• Giving up abstractions– memory

• Constant failure.– dynamically re-mapping computations

Questions?

Documents

Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK