32
wl 2020 2.1 Custom computing systems difference engine: Charles Babbage 1832 - compute maths tables digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic Splash2: Supercomputing Research Center 1993 - multi-FPGA engine, for video processing, DNA computing etc Harp1: Oxford University 1995 - FPGA + microprocessor (transputer) SONIC, UltraSonic: Sony + Imperial College 1999-2002 - multi-FPGA, professional video processing MaxWorkstation, MaxNode: 2011, Max5: 2017 - FPGA cards adopted by JP Morgan, Amazon…

Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.1

Custom computing systems

• difference engine: Charles Babbage 1832- compute maths tables

• digital orrery: MIT 1985 - special-purpose engine, found pluto motion chaotic

• Splash2: Supercomputing Research Center 1993 - multi-FPGA engine, for video processing, DNA computing etc

• Harp1: Oxford University 1995- FPGA + microprocessor (transputer)

• SONIC, UltraSonic: Sony + Imperial College 1999-2002- multi-FPGA, professional video processing

• MaxWorkstation, MaxNode: 2011, Max5: 2017- FPGA cards adopted by JP Morgan, Amazon…

Page 2: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.2

• 1 exaflop = 1018 FLOPS (TaihuLight: 93 Petaflops)

• using processor cores with 8FLOPS/clock at 2.5GHz

• 50M CPU cores

• what about power?- assume power envelope of 100W per chip

- Moore’s Law scaling: 6 cores today ~100 cores/chip

- 500k CPU chips

• 50MW (just for CPUs!) 100MW likely

• ‘TaihuLight’ power consumption: 15MW

The Exaflop Supercomputer (2022)

source: Maxeler

Page 3: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.3

• 1 exaflop = 1018 FLOPS

• using processor cores with 8FLOPS/clock at 2.5GHz

• 50M CPU cores

• what about power?- assume power envelope of 100W per chip

- Moore’s Law scaling: 6 cores today ~100 cores/chip

- 500k CPU chips

• 50MW (just for CPUs!) 100MW likely

• ‘TaihuLight’ power consumption: 15MW

The Exaflop Supercomputer (2018)

How do we program this?

Who pays for this?

source: Maxeler

Page 4: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.4

Technology comparison

DSP: Digital Signal Processor Dedicated HW=ASIC/FPGA

Page 5: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.5

Execution units

Out-of-order

scheduling &

retirement

L1 data cache

Memory

ordering and

execution

Instruction

decode and

microcode

L2 Cache &

interrupt

servicing

Paging

Branch

prediction

Instruction fetch

& L1 cache

Memory controller

Shared L3 cache

Un

core

Core

I/O

an

d Q

PI I/O

and

QP

IShared L3 cache

CoreCoreCoreCoreCore

Intel 6-Core X5680 “Westmere”

Computation

Core

Page 6: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.6

• a chip customised for a specific application

• no instructions no instruction decode logic

• no branches no branch prediction

• explicit parallelism no out-of-order scheduling

• data streamed onto-chip no multi-level caches

A special purpose computer

MyApplication

Chip

(Lots o

f)

Mem

ory

Rest of the

world

source: Maxeler

Page 7: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.7

• but we have more than one application

• impractical to optimise machines for only one application- need to run many applications in a typical system

A special purpose computer

MyApplication

Chip

Mem

ory

NetworkMyApplication

Chip

Mem

ory

NetworkMyApplication

Chip

Mem

ory

NetworkOtherApplication

Chip

Mem

ory

Rest of the

world

source: Maxeler

Page 8: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.8

• use reconfigurable chip: reprogram at runtime to implement:- different applications, or

- different versions of the same application

A special purpose computer

Config 1

Mem

ory

Network Optimized for

Application A

Optimized for

Application B

Optimized for

Application C

Optimized for

Application D

Optimized for

Application E

source: Maxeler

Page 9: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.9

Instruction processors

source: Maxeler

Page 10: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.10

Dataflow/stream processors

source: Maxeler

Page 11: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.11

Lines of code

Total Application 1,000,000

Kernel to accelerate 2,000

Software to restructure 20,000

Accelerating real applications

• CPUs are good for:

- latency-sensitive, control-intensive, non-repetitive code

• dataflow engines are good for:- high throughput repetitive processing on large data volumes

a system should contain both

source: Maxeler

Page 12: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.12

Custom computing in a PC

Processor

Register

fileL1$

L2$

where is the Custom Architecture?• on-chip with access to register file• co-processor w/ access to level 1 cache• next to level 2 cache • in adjacent processor socket, connected using QPI/Hypertransport• as Memory Controller not North/South Bridge• as main memory (DIMMs)• as a peripheral on PCI Express bus• inside peripheral, eg customizable Disk controller

North/South Bridge

PCI Bus

Disk Dim

ms

Page 13: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.13

Embedded systems

• partition programs into software and hardware (custom architecture)

- hardware software co-design

• System-on-Chip: SoC (cover later)

• custom architecture as extension of the processor instruction set

Processor

Register

file

Data

Instructions

Custo

m

Arc

hite

ctu

re

Page 14: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.14

• depends on the application

- avoid system bottleneck for the application

• possible bottlenecks

- memory access latency

- memory access bandwidth

- memory size

- processor local memory size

- processor ALU resource

- processor ALU operation latency

- various bus bandwidths

Where to locate custom architecture?

Page 15: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.15

Bottleneck example: Bing page ranking

source: Microsoft

Page 16: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.16

Reconfigurable computing with FPGAs

DSP Block

Block RAM (20TB/s)

IO BlockLogic Cell (105 elements)

Xilinx Virtex-6 FPGA

DSP BlockBlock RAM

Page 17: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.17

• 1U Form Factor for racks DFE: Data Flow Engine

High density compute with FPGAs: examples

source: Maxeler

Page 18: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.18

• schematic entry of circuits

• hardware Description Languages- VHDL, Verilog, SystemC

• object-oriented languages - C/C++, Python, Java, and related languages

• dataflow languages: e.g. MaxJ, OpenSPL

• functional languages: e.g. Haskell, Ruby

• high level interface: e.g. Mathematica, MatLab

• schematic block diagram e.g. Simulink

• domain specific languages (DSLs)

How could we program it?

Page 19: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.19

Accelerator programming models

DSL

DS

LDSLDSL

Possible applications

Leve

l of

Ab

stra

ctio

n

Flexible Compiler System: MaxCompiler/Ruby

Higher Level Libraries

Higher

Level

Libraries

Page 20: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.20

Acceleration development flowS

tart

Original

Application

Identify code

for acceleration

and analyze

bottlenecks

Write accelerator

codeSimulate

Functions

correctly?Build for Hardware

Integrate with

Host code

Meets

performance

goals?

Accelerated

Application

NO

YESYES

NO

Transform app,

architect and

model

performance

source: Maxeler

Page 21: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.21

Acceleration development flowS

tart

Original

Application

Identify code

for acceleration

and analyze

bottlenecks

Write accelerator

codeSimulate

Functions

correctly?Build for Hardware

Integrate with

Host code

Meets

performance

goals?

Accelerated

Application

NO

YESYES

NO

Transform app,

architect and

model

performance

Mainly for project

source: Maxeler

Page 22: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.22

Customisation techniques

• FPGA technology offers customisation opportunities

- some data may remain constant: e.g. algebraic simplification

- adopt different data structures: e.g. number representation

- transform: e.g. enhance parallelism, pipelining, serialisation

• reuse possibilities (more next lecture)

- description: repeating unit, parametrisation

- transforms: patterns, laws, proofs

• example: polynomial evaluation for numbers ai, xy = a0 + a1 x + a2 x2 + a3 x3 (repeat many times)

Page 23: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.23

Performance estimation

• clocked circuit: no combinational loops

• gates have delay, and speed limited by propagation delay through the slowest combinational path

• slowest path: usually carry path

• clock rate: approx. 1/(delay of slowest path) assuming- edge-triggered design

- register propagation delay, set-up time, clock skew etc assumed negligible

• lowest level: logic gates, do not worry about transistors

Page 24: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.24

First polynomial evaluator

• compute y = a0 + a1 x + a2 x2 + a3 x3

• simplification: assume x constant

• problems: speed? size? repeating units?

x

+

a3

x

x

+

+

xx

x

a2

a1

a0

y

y = 0 ;

for i = 0 .. 3

y = y + ai x xi ;

Page 25: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.25

Customisation possibilities

1. exploit algebraic properties

2. enhance parallelism

3. pipelining

Other possibilities

• serialisation

• customise data representation- non-standard word-length, e.g. 18 bits rather than 32 bits

- non-standard arithmetic, e.g. logarithmic, residue…

Page 26: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.26

1. Algebraic property: Horner’s Rule

• given

• then

x

+

a3

x

x

+

+

xx

x

a2

a1

a0

x

+

a3

x

x

+

+

a2

a1

a0

a0 + a1 x + a2 x2 + a3 x3 = a0 + x (a1 + x (a2 + a3x))

x

+a

b

x

a x + b x = (a + b) x

x

+

b

a

Page 27: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.27

2. Enhance parallelism

RR R R

R R R R

R R

R

RR R

Page 28: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.28

3. Pipelining

• split up combinational circuit: add pipeline registers

• shorter cycle time, assembly-line parallelism, lower power

• pipelined design (if regular: systolic array – more later)- mandatory: same number of additional registers for all inputs

- preferable: balance delay in different stages

- preferable: addition of registers preserves regularity

f g

h

Source: M Spivey

Page 29: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.29

Horner’s Rule for pipelining?

• given

• then

Q

R P

P and Q are registers, R is computational component

Q

R

Q

R

Q

Q

R

R

PP

P

Q

R

Q

Q

R

R

Page 30: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.30

module incr_pipe

#(parameter G=4,N=4) // G groups of N bits

(output [G*N-1:0] outp, input [G*N-1:0] inp, input clk);

wire [G:0] carry; // carry chain

wire [G*N-1:0] temp1; // output of delay triangle

genvar i; // loop counter

assign carry[G] = 1; // prime carry input

upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle

lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle

generate

for (i = 0; i < G; i = i + 1) // for each group generate

begin // 1-stage pipelined incrementer

incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N],

temp1[(i+1)*N-1:i*N], carry[G-i], clk);

end

endgenerate

endmodule

Pipelined incrementer: Verilog

• parameterize:- G groups of N bits

- width = G*N

- bits per stage = N

• Verilog implementation:

- decompose into:

• upper register triangle

• chain of incrementers + register (1-stage pipeline)

• lower register triangle

- only top level shown

- need to manage array indices

incrementer cout

a[15..12]

incrementer

a[11..8]

cinincrementer

a[7..4]

incrementer

a[3..0]

sum[15..12] sum[11..8] sum[7..4] sum[3..0]

1-stage pipeline

Page 31: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.31

Concise parametric representation

• given

• then

Q

R P

[P, Q] ; R = R ; Q, P and Q are registers

Q

R

Q

R

Q

Q

R

R

PP

P

Q

R

Q

Q

R

R

[nP, Qn] ; rdrn R = rdrn (2Q ; R)

Page 32: Custom computing systemswl/teachlocal/cuscomp/notes/cc... · 2020-01-14 · wl 2020 2.12 Custom computing in a PC Processor Register file L1$ L2$ where is the Custom Architecture?

wl 2020 2.32

module incr_pipe

#(parameter G=4,N=4) // G groups of N bits

(output [G*N-1:0] outp, input [G*N-1:0] inp, input clk);

wire [G:0] carry; // carry chain

wire [G*N-1:0] temp1; // output of delay triangle

genvar i; // loop counter

assign carry[G] = 1; // prime carry input

upper_tri_delay #(G, N) tru (temp1, inp, clk); // upper reg triangle

lower_tri_delay #(G, N) trl (outp, temp2, clk); // lower reg triangle

generate

for (i = 0; i < G; i = i + 1) // for each group generate

begin // 1-stage pipelined incrementer

incr_stage #(N) istg (carry[G-i-1], temp2[(i+1)*N-1:i*N],

temp1[(i+1)*N-1:i*N], carry[G-i], clk);

end

endgenerate

endmodule

Pipelined incrementer: Verilog vs Ruby

• parameterize:- G groups of N bits

- width = G*N

- bits per stage = N

incrementer cout

a[15..12]

incrementer

a[11..8]

cinincrementer

a[7..4]

incrementer

a[3..0]

sum[15..12] sum[11..8] sum[7..4] sum[3..0]

Pipelined_incrementer G N

= snd (tri G (tri N D)) ; # upper reg triangle

row G (row N (halfadd ; snd D) ; # 1-stage pipelined incre

fst (tri~ G (tri~ N D)) # lower reg triangle

Verilog:

Ruby:

* can generate Verilog or MaxJ!