Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
wl 2020 2.1
Custom computing systems
• difference engine: Charles Babbage 1832- compute maths tables
• digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic
• Splash2: Supercomputing Research Center 1993 - multi-FPGA engine, for video processing, DNA computing etc
• Harp1: Oxford University 1995- FPGA + microprocessor (transputer)
• SONIC, UltraSonic: Sony + Imperial College 1999-2002- multi-FPGA, professional video processing
• MaxWorkstation, MaxNode: 2011, Max5: 2017- FPGA cards adopted by JP Morgan, Amazon…
wl 2020 2.2
• 1 exaflop = 1018 FLOPS (TaihuLight: 93 Petaflops)
• using processor cores with 8FLOPS/clock at 2.5GHz
• 50M CPU cores
• what about power?- assume power envelope of 100W per chip
- Moore’s Law scaling: 6 cores today ~100 cores/chip
- 500k CPU chips
• 50MW (just for CPUs!) 100MW likely
• ‘TaihuLight’ power consumption: 15MW
The Exaflop Supercomputer (2022)
source: Maxeler
wl 2020 2.3
• 1 exaflop = 1018 FLOPS
• using processor cores with 8FLOPS/clock at 2.5GHz
• 50M CPU cores
• what about power?- assume power envelope of 100W per chip
- Moore’s Law scaling: 6 cores today ~100 cores/chip
- 500k CPU chips
• 50MW (just for CPUs!) 100MW likely
• ‘TaihuLight’ power consumption: 15MW
The Exaflop Supercomputer (2018)
How do we program this?
Who pays for this?
source: Maxeler
wl 2020 2.4
Technology comparison
DSP: Digital Signal Processor Dedicated HW=ASIC/FPGA
wl 2020 2.5
Execution units
Out-of-order
scheduling &
retirement
L1 data cache
Memory
ordering and
execution
Instruction
decode and
microcode
L2 Cache &
interrupt
servicing
Paging
Branch
prediction
Instruction fetch
& L1 cache
Memory controller
Shared L3 cache
Un
core
Core
I/O
an
d Q
PI I/O
and
QP
IShared L3 cache
CoreCoreCoreCoreCore
Intel 6-Core X5680 “Westmere”
Computation
Core
wl 2020 2.6
• a chip customised for a specific application
• no instructions no instruction decode logic
• no branches no branch prediction
• explicit parallelism no out-of-order scheduling
• data streamed onto-chip no multi-level caches
A special purpose computer
MyApplication
Chip
(Lots o
f)
Mem
ory
Rest of the
world
source: Maxeler
wl 2020 2.7
• but we have more than one application
• impractical to optimise machines for only one application- need to run many applications in a typical system
A special purpose computer
MyApplication
Chip
Mem
ory
NetworkMyApplication
Chip
Mem
ory
NetworkMyApplication
Chip
Mem
ory
NetworkOtherApplication
Chip
Mem
ory
Rest of the
world
source: Maxeler
wl 2020 2.8
• use reconfigurable chip: reprogram at runtime to implement:- different applications, or
- different versions of the same application
A special purpose computer
Config 1
Mem
ory
Network Optimized for
Application A
Optimized for
Application B
Optimized for
Application C
Optimized for
Application D
Optimized for
Application E
source: Maxeler
wl 2020 2.9
Instruction processors
source: Maxeler
wl 2020 2.10
Dataflow/stream processors
source: Maxeler
wl 2020 2.11
Lines of code
Total Application 1,000,000
Kernel to accelerate 2,000
Software to restructure 20,000
Accelerating real applications
• CPUs are good for:
- latency-sensitive, control-intensive, non-repetitive code
• dataflow engines are good for:- high throughput repetitive processing on large data volumes
a system should contain both
source: Maxeler
wl 2020 2.12
Custom computing in a PC
Processor
Register
fileL1$
L2$
where is the Custom Architecture?• on-chip with access to register file• co-processor w/ access to level 1 cache• next to level 2 cache • in adjacent processor socket, connected using QPI/Hypertransport• as Memory Controller not North/South Bridge• as main memory (DIMMs)• as a peripheral on PCI Express bus• inside peripheral, eg customizable Disk controller
North/South Bridge
PCI Bus
Disk Dim
ms
wl 2020 2.13
Embedded systems
• partition programs into software and hardware (custom architecture)
- hardware software co-design
• System-on-Chip: SoC (cover later)
• custom architecture as extension of the processor instruction set
Processor
Register
file
Data
Instructions
Custo
m
Arc
hite
ctu
re
wl 2020 2.14
• depends on the application
- avoid system bottleneck for the application
• possible bottlenecks
- memory access latency
- memory access bandwidth
- memory size
- processor local memory size
- processor ALU resource
- processor ALU operation latency
- various bus bandwidths
Where to locate custom architecture?
wl 2020 2.15
Bottleneck example: Bing page ranking
source: Microsoft
wl 2020 2.16
Reconfigurable computing with FPGAs
DSP Block
Block RAM (20TB/s)
IO BlockLogic Cell (105 elements)
Xilinx Virtex-6 FPGA
DSP BlockBlock RAM
wl 2020 2.17
• 1U Form Factor for racks DFE: Data Flow Engine
High density compute with FPGAs: examples
source: Maxeler
wl 2020 2.18
• schematic entry of circuits
• hardware Description Languages- VHDL, Verilog, SystemC
• object-oriented languages - C/C++, Python, Java, and related languages
• dataflow languages: e.g. MaxJ, OpenSPL
• functional languages: e.g. Haskell, Ruby
• high level interface: e.g. Mathematica, MatLab
• schematic block diagram e.g. Simulink
• domain specific languages (DSLs)
How could we program it?
wl 2020 2.19
Accelerator programming models
DSL
DS
LDSLDSL
Possible applications
Leve
l of
Ab
stra
ctio
n
Flexible Compiler System: MaxCompiler/Ruby
Higher Level Libraries
Higher
Level
Libraries
wl 2020 2.20
Acceleration development flowS
tart
Original
Application
Identify code
for acceleration
and analyze
bottlenecks
Write accelerator
codeSimulate
Functions
correctly?Build for Hardware
Integrate with
Host code
Meets
performance
goals?
Accelerated
Application
NO
YESYES
NO
Transform app,
architect and
model
performance
source: Maxeler
wl 2020 2.21
Acceleration development flowS
tart
Original
Application
Identify code
for acceleration
and analyze
bottlenecks
Write accelerator
codeSimulate
Functions
correctly?Build for Hardware
Integrate with
Host code
Meets
performance
goals?
Accelerated
Application
NO
YESYES
NO
Transform app,
architect and
model
performance
Mainly for project
source: Maxeler
wl 2020 2.22
Customisation techniques
• FPGA technology offers customisation opportunities
- some data may remain constant: e.g. algebraic simplification
- adopt different data structures: e.g. number representation
- transform: e.g. enhance parallelism, pipelining, serialisation
• reuse possibilities (more next lecture)
- description: repeating unit, parametrisation
- transforms: patterns, laws, proofs
• example: polynomial evaluation for numbers ai, xy = a0 + a1 x + a2 x2 + a3 x3 (repeat many times)
wl 2020 2.23
Performance estimation
• clocked circuit: no combinational loops
• gates have delay, and speed limited by propagation delay through the slowest combinational path
• slowest path: usually carry path
• clock rate: approx. 1/(delay of slowest path) assuming- edge-triggered design
- register propagation delay, set-up time, clock skew etc assumed negligible
• lowest level: logic gates, do not worry about transistors
wl 2020 2.24
First polynomial evaluator
• compute y = a0 + a1 x + a2 x2 + a3 x3
• simplification: assume x constant
• problems: speed? size? repeating units?
x
+
a3
x
x
+
+
xx
x
a2
a1
a0
y
y = 0 ;
for i = 0 .. 3
y = y + ai x xi ;
wl 2020 2.25
Customisation possibilities
1. exploit algebraic properties
2. enhance parallelism
3. pipelining
Other possibilities
• serialisation
• customise data representation- non-standard word-length, e.g. 18 bits rather than 32 bits
- non-standard arithmetic, e.g. logarithmic, residue…
wl 2020 2.26
1. Algebraic property: Horner’s Rule
• given
• then
x
+
a3
x
x
+
+
xx
x
a2
a1
a0
x
+
a3
x
x
+
+
a2
a1
a0
a0 + a1 x + a2 x2 + a3 x3 = a0 + x (a1 + x (a2 + a3x))
x
+a
b
x
a x + b x = (a + b) x
x
+
b
a
wl 2020 2.27
2. Enhance parallelism
RR R R
R R R R
R R
R
RR R
wl 2020 2.28
3. Pipelining
• split up combinational circuit: add pipeline registers
• shorter cycle time, assembly-line parallelism, lower power
• pipelined design (if regular: systolic array – more later)- mandatory: same number of additional registers for all inputs
- preferable: balance delay in different stages
- preferable: addition of registers preserves regularity
f g
h
Source: M Spivey
wl 2020 2.29
Horner’s Rule for pipelining?
• given
• then
Q
R P
P and Q are registers, R is computational component
Q
R
Q
R
Q
Q
R
R
PP
P
Q
R
Q
Q
R
R
wl 2020 2.30
module incr_pipe
#(parameter G=4,N=4) // G groups of N bits
(output [G*N-1:0] outp, input [G*N-1:0] inp, input clk);
wire [G:0] carry; // carry chain
wire [G*N-1:0] temp1; // output of delay triangle
genvar i; // loop counter
assign carry[G] = 1; // prime carry input
upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle
lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle
generate
for (i = 0; i < G; i = i + 1) // for each group generate
begin // 1-stage pipelined incrementer
incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N],
temp1[(i+1)*N-1:i*N], carry[G-i], clk);
end
endgenerate
endmodule
Pipelined incrementer: Verilog
• parameterize:- G groups of N bits
- width = G*N
- bits per stage = N
• Verilog implementation:
- decompose into:
• upper register triangle
• chain of incrementers + register (1-stage pipeline)
• lower register triangle
- only top level shown
- need to manage array indices
incrementer cout
a[15..12]
incrementer
a[11..8]
cinincrementer
a[7..4]
incrementer
a[3..0]
sum[15..12] sum[11..8] sum[7..4] sum[3..0]
1-stage pipeline
wl 2020 2.31
Concise parametric representation
• given
• then
Q
R P
[P, Q] ; R = R ; Q, P and Q are registers
Q
R
Q
R
Q
Q
R
R
PP
P
Q
R
Q
Q
R
R
[nP, Qn] ; rdrn R = rdrn (2Q ; R)
wl 2020 2.32
module incr_pipe
#(parameter G=4,N=4) // G groups of N bits
(output [G*N-1:0] outp, input [G*N-1:0] inp, input clk);
wire [G:0] carry; // carry chain
wire [G*N-1:0] temp1; // output of delay triangle
genvar i; // loop counter
assign carry[G] = 1; // prime carry input
upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle
lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle
generate
for (i = 0; i < G; i = i + 1) // for each group generate
begin // 1-stage pipelined incrementer
incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N],
temp1[(i+1)*N-1:i*N], carry[G-i], clk);
end
endgenerate
endmodule
Pipelined incrementer: Verilog vs Ruby
• parameterize:- G groups of N bits
- width = G*N
- bits per stage = N
incrementer cout
a[15..12]
incrementer
a[11..8]
cinincrementer
a[7..4]
incrementer
a[3..0]
sum[15..12] sum[11..8] sum[7..4] sum[3..0]
Pipelined_incrementer G N
= snd (tri G (tri N D)) ; # upper reg triangle
row G (row N (halfadd ; snd D) ; # 1-stage pipelined incre
fst (tri~ G (tri~ N D)) # lower reg triangle
Verilog:
Ruby:
* can generate Verilog or MaxJ!