Upload
anoki
View
57
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Accelerating Applications using FPGAs Satnam Singh, Microsoft Research, Cambridge UK. A Heterogeneous Future. Example Speedup: DNA Sequence Matching. Why are regular computers not fast enough?. FPGAs are the Lego of Hardware. multiple independent multi-ported memories. hard and soft - PowerPoint PPT Presentation
Citation preview
Accelerating Applications using FPGAs
Satnam Singh, Microsoft Research, Cambridge UK
A Heterogeneous Future
Example Speedup: DNA Sequence Matching
Why are regular computers not fast enough?
FPGAs are the Lego of Hardware
multipleindependentmulti-ported
memories
fine-grainparallelism
andpipelining
hard and softembeddedprocessors
The heart of an FPGA
LUT4 (OR)
LUT4 (AND)
LUTs are higher order functions
i o
lut1
oi1
i0
lut2 lut3 lut4i0
i1i2
i0i1i2i3
o o
inv = lut1 not
and2 = lut2 (&&)
mux = lut3 (l s d0 d1 . if s then d1 else d0)
FPGAs as Co-Processors
XD2000i FPGA in-socketaccelerator for Intel FSB
XD2000F FPGA in-socketaccelerator for AMD socket F
XD1000 FPGA co-processormodule for socket 940
What kind of problems fit well on FPGA?
opportunity
scientific computingdata miningsearchimage processingfinancial analytics
challenge
Fibonacci Example
0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181, 6765, ...
entity fib is port (signal clk, rst : in bit ; signal fibnr : out natural) ;end entity fib ;
architecture behavioural of fib is signal lastFib, currentFib : natural ;begin
compute_fibs : process begin wait until clk'event and clk='1' ; if rst = '1' then lastFib <= 0 ; currentFib <= 1 ; else currentFib <= lastFib + currentFib ; lastFib <= currentFib ; end if ; end process compute_fibs ;
fibnr <= currentFib ; end architecture behavioural ;
demonstration...
data paralleldescriptions
FPGAhardware(VHDL)
GPU code (Accelerator)
SMPC++
The Accidental Semi-colon
;
Kiwi
structural imperative (C)parallelimperative
gate-level VHDL/Verilog Kiwi C-to-
gates
&0
0
0
Q
QSET
CLR
S
R
;;;
jpeg.cthread 2
thread 3
thread 1
KiwiLibrary
Kiwi.cs
circuitmodel
JPEG.cs
Visual Studio
multi-thread simulationdebuggingverification
Kiwi Synthesis
circuitimplementation
JPEG.v
parallelprogram
C#
Thread 1
Thread 2
Thread 3
Thread 3
C togates
C togates
C togates
C togates
circuit
circuit
circuit
circuitVerilog
for system
Our Implementation• Use regular Visual Studio technology to
generate a .NET IL assembly language file.• Our system then processes this file to
produce a circuit:– The .NET stack is analyzed and removed– The control structure of the code is analyzed
and broken into basic blocks which are then composed.
– The concurrency constructs used in the program are used to control the concurrency / clocking of the generated circuit.
System Composition
• We need a way to separately develop components and then compose them together.
• Don’t invent new language constructs: reuse existing concurrency machinery.
• Adopt single-place channels for the composition of components.
• Model channels with regular concurrency constructs (monitors).
Writing to a Channel
public class Channel<T>{ T datum; bool empty = true; public void Write(T v) { lock (this) { while (!empty) Monitor.Wait(this); datum = v; empty = false; Monitor.PulseAll(this); } }
Reading from a Channel
public T Read(){ T r; lock (this) { while (empty) Monitor.Wait(this); empty = true; r = datum; Monitor.PulseAll(this); } return r;}
systems level concurrency constructsthreads, events, monitors, condition variables
rendezvous join patterns transactionalmemory
dataparallelism
user applications
domain specificlanguages
class FIFO2{ [Kiwi.OutputWordPort(“result“, 31, 0)] public static int result;
static Kiwi.Channel<int> chan1 = new Kiwi.Channel<int>(); static Kiwi.Channel<int> chan2 = new Kiwi.Channel<int>();
public static void Consumer() { while (true) { int i = chan1.Read(); chan2.Write(2 * i); Kiwi.Pause(); } }
public static void Producer() { for (int i = 0; i < 10; i++) { chan1.Write(i); Kiwi.Pause(); } }
public static void Behaviour(){ Thread ProducerThread = new Thread(new ThreadStart(Producer)); ProducerThread.Start();
Thread ConsumerThread = new Thread(new ThreadStart(Consumer)); ConsumerThread.Start();
Filter Example
thread one-placechannel
public static int[] SequentialFIRFunction(int[] weights, int[] input) { int[] window = new int[size]; int[] result = new int[input.Length]; // Clear to window of x values to all zero. for (int w = 0; w < size; w++) window[w] = 0; // For each sample... for (int i = 0; i < input.Length; i++) { // Shift in the new x value for (int j = size - 1; j > 0; j--) window[j] = window[j - 1]; window[0] = input[i]; // Compute the result value int sum = 0; for (int z = 0; z < size; z++) sum += weights[z] * window[z]; result[i] = sum; } return result; }
Transposed Filter
static void Tap(int i, byte w, Kiwi.Channel<byte> xIn, Kiwi.Channel<int> yIn, Kiwi.Channel<int> yout){ byte x; int y; while(true) { y = yIn.Read(); x = xIn.Read(); yout.Write(x * w + y); }}
Inter-thread Communication and Synchronization
// Create the channels to link together the tapsfor (int c = 0; c < size; c++){ Xchannels[c] = new Kiwi.Channel<byte>(); Ychannels[c] = new Kiwi.Channel<int>(); Ychannels[c].Write(0); // Pre-populate y-channel registers with zeros}
// Connect up the taps for a transposed filterfor (int i = 0; i < size; i++){ int j = i; // Quiz: why do we need the local j? Thread tapThread = new Thread(delegate() { Tap(j, weights[j], Xchannels[j], Ychannels[j], Ychannels[j+1]); }); tapThread.Start();}
using System;using System.Collections.Generic;using System.Text;using Microsoft.Research.DataParallelArrays;using PA = Microsoft.Research.DataParallelArrays.ParallelArrays;using IPA = Microsoft.Research.DataParallelArrays.IntParallelArray;namespace ForOxford{ class Program { static void Main(string[] args) { PA.InitGPU(); IPA is1 = new IPA(4, new int[] { 1, 2, 3, 4 }); IPA is2 = new IPA(4, new int[] { 5, 6, 7, 8 }); IPA is3 = new IPA(4, is1.Shape); is3 = PA.Add(is1, is2); IPA result = PA.Evaluate(is3); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine("");
}
}}
Example: Bitmap Blur(Using Accelerator v1.1.1)
using PA = Microsoft.Research.DataParallelArrays.ParallelArrays;using FPA = Microsoft.Research.DataParallelArrays.FloatParallelArray;float[,] Blur (float[] kernel) { FPA pa = new FPA(bitmap); // Convolve in X direction FPA resultX = new FPA(0, pa.Shape); for (int i = 0; i < kernel.Length; i++) { resultX += PA.Shift(pa, 0, i) * kernel[i]; } // Convolve in Y direction. FPA resultY = new FPA(0, pa.Shape); for (int i = 0; i < kernel.Length; i++) { resultY += PA.Shift(resultX, i, 0) * kernel[i]; } float [,] result; PA.ToArray (resultY, out result); return result;}
Expression GraphsFPA pa = new FPA(bitmap);
// Convolve in X directionFPA rX = new FPA(0, pa.Shape);
for (int i = 0; i < kernel.Length; i++){ rX += PA.Shift(pa, 0, i) * kernel[i];}
*
pa
Shift (0,0) k[0]
+
rX
+
*
Shift (0,1) k[1]
+
…
rX
class Program { static void Main(string[] args) { IPA.InitGPU();
IPA ipa1 = new IPA(5, new int[] {1, 2, 3, 4, 5}) ; IPA ipa2 = new IPA(5, new int[] {10, 20, 30, 40, 50}) ;
IPA ipa3 = new IPA(5, new int[] {21, 5, 7, 4, 8}); IPA ipa4 = new IPA(5, new int[] {4, 1, 7, 2, 5}) ;
IPA ipa5 = new IPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); IPA result = PA.Multiply (ipa4, (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine("");
}
}
class Program { static void Main(string[] args) { IPA.InitFPGA();
IPA ipa1 = new IPA(5, new int[] {1, 2, 3, 4, 5}) ; IPA ipa2 = new IPA(5, new int[] {10, 20, 30, 40, 50}) ;
IPA ipa3 = new IPA(5, new int[] {21, 5, 7, 4, 8}); IPA ipa4 = new IPA(5, new int[] {4, 1, 7, 2, 5}) ;
IPA ipa5 = new IPA(5, ipa1.Shape); ipa5 = PA.Add(is1, is2); IPA result = PA.Multiply (ipa4, (PA.Subtract (ipa3, PA.Add(ipa1, ipa2)))); int[] ra1; PA.ToArray(result, out ra1); foreach (int i in ra1) Console.Write(i + " "); Console.WriteLine("");
}
}
with addr select net_7 <= 10 when 0, 20 when 1, 30 when 2, 40 when 3, 50 when 4;
process begin wait until clk'event and clk='1' ; net_5 <= net_6 + net_7 ; end process ;
process type net_4_delay_type is array (0 to 1) of integer ; variable net_4_delayed : net_4_delay_type ; begin wait until clk'event and clk='1' ; net_4_delayed(0) := net_4_delayed(1) ; net_4_delayed(1) := net_4 ; net_3 <= net_4_delayed(0) - net_5 ; end process ;
8.249ns max delay3 x DSP48Es63 slice registers24 slice LUTs
let rec bfly r n = match n with 1 -> r | n -> ilv (bfly r (n-1)) >-> evens r
Cryptol
as = [Ox3F OxE2 Ox65 OxCA] # new;new = [| a ^ b ^ c || a <- as || b <- drop(1,as) || c <- drop(3,as)|];
3Fas E2
^
65 CA
^
new
Bluespec
rule enqueueSOFData (rx_src_rdy_n_input == 0 && rx_sof_n_input == 0 && recv_state == Ready_for_frame) ; fifo_in.enq (rx_data_input) ; recv_state <= Reading_frame ;endrule
Esterel
Esterel design
void uart_device_driver (){.....}
uart.c
VHDL, Verilog -> hardware implementation
C -> software implementation
Some Challenges for Spatial Computing
• Language support:– Specifying resources.– Specifying memory organization.– Specifying timing.– Specifying control.– Models of computation.
• Co-design and verification.• System integration (OS APIs).• AWFUL AWFUL AWUFL vendor tools.
Some Challenges for Heterogeneous Systems
• A single model for programming very different kinds of computational elements?
• Giving up abstractions– memory
• Constant failure.– dynamically re-mapping computations
Questions?