Upload
rinnocente
View
61
Download
0
Embed Size (px)
DESCRIPTION
fpga hpc computing - PowerPoint PPT Presentation
Citation preview
May 10, 2014 R.Innocente 1
Reconfigurable ComputingReconfigurable Computing
Roberto Innocente
May 10, 2014 R.Innocente 2
Flexibility
ASICApplication
SpecificIntegrated Circuit
Very inflexible,designed to solve just 1 problem.Energy, space and time
efficient
GPPGeneralPurpose
Processor
Very flexible,can solve any problem. Energy, space and time
inefficient
?
ReconfigurableHardwareFlexible,
But enough energy, time and space efficient
+-
May 10, 2014 R.Innocente 3
History
May 10, 2014 R.Innocente 4
Gerald Estrin/1is credited the idea of having proposed in the '60 the first reconfigurable
(F+V) FIX+Variable computer
Gerald Estrin. ACM 1960. Organization of computer systems: the fixed plus variable structure computer.
May 10, 2014 R.Innocente 5
Gerald Estrin/2He envisioned that important gains in performance could be achieved when many computations are executed on appropriate problem oriented configurations.
F+V is made of :
- high speed general computer(the F part) : initially an ibm7090
- various size high speed special structures (the V part) problem specific: trigonometric functions, logarithms, exponential, n-th powers, complex arithmetic, …
V is made of a 36 module positions motherboard which can undergo :
- Function reconfiguration: physically changing some modules
- Routing reconfiguration : changing part of the back wiring
The Rammig machine (1977) : investigation of a reconfigurable machine with no manual or mechanical intervention
May 10, 2014 R.Innocente 6
Today reconfigurable hardware
Is born out of the will to replace different logic IC(Integrated Circuits), and successively to rapidly prototype large ASICs(Application Specific ICs) or implement SoCs (Sytem On Chip).
In the following slides readers are supposed to be involved in scientific computing and not EE engineers.
May 10, 2014 R.Innocente 7
Basic digital circuitsAND INVERTER
Shift RegD Type FFMUX
Usually 0=0V, 1=some positive voltage
OR
May 10, 2014 R.Innocente 8
SSI 74xx IC
May 10, 2014 R.Innocente 9
PLD
Inconvenience of standard discrete logic circuits :
- 14 pin packages of 4/6 logic functions
- often you had to traverse the PCB to find a free OR or inverter
- if you needed only a few, you had in any case to put an IC with 4/6
Therefore came the idea of PLD (Programmable Logic Device) :
- SPLD (Simple : PAL/PLA)
- CPLD (Complex)
In which a simple interconnection network could be configured melting some internal fuses (fuse technology) to implement combinatorial logic.
May 10, 2014 R.Innocente 10
disjunctive normal form(aka Sum of products )
Every boolean function of some boolean variables can be represented as a sum of minterms (product of all variables or their complement) .
With 3 boolean vars : a,b,c
are 2 of the 23 = 8 minterms
Eg. f (a ,b , c)=a b c+a b c
a b c , a b c
May 10, 2014 R.Innocente 11
PLA (Programmable Logic Array)
f1=p1+ p2+ p3=x1x2+x1 x3+ x1 x2 x3+x1 x3
May 10, 2014 R.Innocente 12
FPGAAlso CPLDs showed their limits, therefore in 1985/1990 Xilinx introduced a more flexible design , the
FPGA (Field Programmable Gate Array)
In which the interconnection network is much more flexible and on which also sequential circuits can be easily mapped. We will see that gate array is in fact a misnomer today : it's not an array of gates.
May 10, 2014 R.Innocente 13
FPGA idea1985 Xilinx – Ross Freeman (inventor of FPGA): “What if we could develop the equivalent of a circuit board full of standard logic parts (like TTL and PAL devices) on a single high density programmable logic chip ?”
- post fabrication programmability by end users
- fabless semiconductor company
May 10, 2014 R.Innocente 14
Today
May 10, 2014 R.Innocente 15
FPGA marketDominated by 2 players :
- Altera
- Xilinx
From 67% of 2010, today they share together 90% of the market (4.5 billion usd revenues in 2012)
From sourcetech411(2010)
X
A
May 10, 2014 R.Innocente 16
An important question: are FPGAs green ?
Virtex-7 2000T (one of the top FPGAs) :
~ 20 WXilinx showed 3600 copies of its 8 bit processor nanoblaze running on Virtex-7, consuming 20 W
CPU : ~ 100 WCore i7-4770K Haswell (22 nm) 3.5 GHz@ 4 Cores 84 W
Core i7-3930K Sandybridge-E (32 nm) 3.2 GHz @6Cores 130 W
Xeon E7458 Dunnington (45 nm) 2.4 GHz 90 W
Xeon E7460 Dunnington (45 nm) 2.66 GHz 130 W
GPU : ~ 220 WNvidia Tesla M2090 225 W
Nvidia Tesla K20X 235 W
This is a partial answer. We need to be able to estimate FPGA performance to give a more useful index.
May 10, 2014 R.Innocente 17
FPGA architecture
From RF and Wireless World
Sea of gates : logic blocks are like islands in a sea of interconnections
May 10, 2014 R.Innocente 18
Virtex family1998 Virtex 250nm 100mhz 25k-60k cells
2000 Virtex-E 180nm 300mhz 1k-70kcells
2000 Virtex II 150nm to168 mult420mhzupto 93k 4-luts
2005 Virtex-4 90nm 500mhz upto 200k cells
2007 Virtex-5 65nm 550mhz up to 330k cells
Virtex-6 40nm 288-2k DSP to 500k 6-luts
2010 Virtex-7 28nm ~500mhz upto 2000k cells
2014 Virtex-US 20 nm upto 4400k cells
From L Zhuo
Up to ~ 7 billion transistorIntel 2014 15-core Xeon IvyBridge-EX~ 4.3 billion transistorNvidia 2012 GK110 Kepler ~ 7 billion transistor
May 10, 2014 R.Innocente 19
FPGA/CPU evolution
May 10, 2014 R.Innocente 20
Virtex-7 is not monolithic
2.5 D technology : 4 FPGA tiles with silicon interposer that provides 10kInterconnections between layers
May 10, 2014 R.Innocente 21
Enabling technologies
May 10, 2014 R.Innocente 22
Programming technology/1
Antifuse SRAM
OTP(One time programmable)
Disordered except at very low range
Pass transistor in switch block
May 10, 2014 R.Innocente 23
Programming technology/2Antifuse
-pros:
cheap, small
-cons:
requires special processing, One time programming
SRAM
-pros:
can be deployed with standard semiconductor process, can be easily reprogrammed
-cons:
large area required(6 transistors)
May 10, 2014 R.Innocente 24
ConfwareThe configuration of an FPGA ( that becomes compiled to a stream of bits) is not hardware, nor software.
Someone invented the neologism
confware
The configuration of a reconfigurable hardware.
May 10, 2014 R.Innocente 25
How you configure an FPGA ?
SRAM cells as a long shift register : loaded serially clocking in the confwareVirtex 7 2000T = 440 Mbits of SRAM cells(simplified : large fpgas can also parallel load the confware)
May 10, 2014 R.Innocente 26
Logic Blocks/Logic Cells
May 10, 2014 R.Innocente 27
Fine/coarse grain logic blocksFrom :
- a single transistor (Crosspoint : went in bankrupcy)
- a logic gate
To :
- a complete processor (FPNA: field programmable node arrays)
NB. FPNA is also field programmable neural array
May 10, 2014 R.Innocente 28
Homogeneous :
- Logic Cells: 4 input LUT(LookUp Table) + FlipFlop
Heterogeneous(modern development) :
- Logic cells
- DSP (Digital Signal Processing)
- Memory blocks
- I/O blocks
The heterogenous architecture is prevalent now. The blocks are configured by SRAM bits usually loaded trough serial ports as already pointed out.
CLB(Configurable Logic Blocks)
Necessary differentiation to allow things like multiplication/addition to be mapped in an efficient way.
May 10, 2014 R.Innocente 29
Standard Logic Cell
4 input LUT
D type FlipFlop
16 bits of SRAM for conf 1 bit SRAM conf
2:1 Mux
May 10, 2014 R.Innocente 30
standard LUT (Look Up Table)
0 0000 0
1 0001 1
2 0010 0
3 0011 0
4 0100 1
5 0101 0
6 0110 1
7 0111 1
.. .. ..
Dec Bin Out- 16 x 1 memory
- any boolean function of 4 inputs :
Bit 0
Bit 1
Bit 2
Bit 3
f = x3 x2 x1 x0+ x3 x2 x1 x0+ x3 x2 x1 x0+ x3 x2 x1 x0
NB. LUT rhymes with nut
May 10, 2014 R.Innocente 31
Uses of Logic Cell2^4 = 16 x 1 bit memory Any boolean function of 4
inputs
4:1 multiplexer
May 10, 2014 R.Innocente 32
Virtex-7 Logic Block basics
May 10, 2014 R.Innocente 33
Virtex-7 Logic sliceFrom Xilinx
4 x 32=128 bit shift reg
May 10, 2014 R.Innocente 34
Virtex7 CLB slice- 6-input LUT
- 2 5-input LUTs with same inputs
- 2 arbitrary boolean function on 3-input and 2-input or less
May 10, 2014 R.Innocente 35
Altera ALM
May 10, 2014 R.Innocente 36
Interconnection network
May 10, 2014 R.Innocente 37
Interconnection networkHierarchical routing Island type routing(predominant)
Interconnection network can consume 80% of the area of an FPGA !
Nearest neighbours
May 10, 2014 R.Innocente 38
Programmable switch
May 10, 2014 R.Innocente 39
SRAM routing: coarse/fine grain5 bit SRAM 1 bit SRAM
May 10, 2014 R.Innocente 40
Details of island type routing
May 10, 2014 R.Innocente 41
Disjoint/Wilton switch blocks
Disjoint : wire can only go out on wire of same number, creates routing domainsWilton : can change domain in at least one directions
May 10, 2014 R.Innocente 42
Channel segments distribution
May 10, 2014 R.Innocente 43
Columnar architecture7 series Xilinx fpgaColumnar architecture
May 10, 2014 R.Innocente 44
DSP blocks &floating point
May 10, 2014 R.Innocente 45
FPGAs floating point in 1994
B. Fagin and C. Renard. Field Programmable Gate Arrays and Floating Point Arithmetic. IEEE Transactions on VLSI Systems, 2(3), September 1994.
Fagin & Renard report that you can implement floating point operators but it is impractical : no
FPGA in existence could contain a single multiplier circuit !!
May 10, 2014 R.Innocente 46
FPGA fp in 1995Shirazi & al. On the same line of Fagin & Renard propose 2 custom fp formats 16 and 18 bits total:
they provide for them add,sub, mul, div operators
N. Shirazi, A. Walters, and P. Athanas. Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1995.
May 10, 2014 R.Innocente 47
FPGA fp in 2002Belanovic & Leeser present a library of variable width parameterized floating point operators (superset of the ieee formats)
A Library of Parameterized Floating-point Modules and Their UsePavle Belanovic and Miriam Leeser, 2002
May 10, 2014 R.Innocente 48
What allowed the breakthrough ?The addition, by major vendors, of hardware multipliers (called DSP blocks) on their FPGA from 2000 on :
- 1st Xilinx on Virtex II
- soon after Altera on Stratix
This started in the last decade also the interest of HPC community :
Cray XD1, Silicon RASC, Convey HC1
HPRC = High Performance Reconfigurable Computing
May 10, 2014 R.Innocente 49
FPGA MAC operation
May 10, 2014 R.Innocente 50
Virtex-7 DSP48 high level
From Xilinx
1 bit 2 bit
May 10, 2014 R.Innocente 51
DSP48E1 details
May 10, 2014 R.Innocente 52
Altera Stratix V DSP block
4 (*) + 3(+) = 7 flop
May 10, 2014 R.Innocente 53
Data Flow Graphs (DFG)
May 10, 2014 R.Innocente 54
Data flowA representation of a program as a DG(Directed Graph) in which the nodes are the operations and the edges represent the data dependencies from one operation to the next
May 10, 2014 R.Innocente 55
Control flow/Data Flow
dis2=b**2-4*a*c
If dis2 < 0 complex!
dis=sqrt(dis2)
u1=-b/(2*a)
u2=dis/(2*a)
x1=u1+u2
x2=u1-u2x=
−b2a
±√b2−4ac
2a
May 10, 2014 R.Innocente 56
A scalar productFortran :
acc=0.0
do i=1,4
acc=acc+a(i)*b(i)
enddo
C :
acc=0.0;
for(i=0;i<4;i++){
acc=acc+a[i]*b[i];
}
May 10, 2014 R.Innocente 57
Time/Space tradeoffs
May 10, 2014 R.Innocente 58
Systolic array matrix multA(n,n) x B(n,n) requires :2n-1 steps for the last elements to enter the arrayn-1 steps to compute the last c(n,n)n steps to move the result out = 4n-2 steps
May 10, 2014 R.Innocente 59
Codesign
The implementation of algorithms on FPGAs requires a mix of hw and sw design :
Codesign = hw design + sw design
May 10, 2014 R.Innocente 60
How to program FPGAs?Mainly with an HDL (Hardware Description Language):
- Verilog(intially developed by Gateway Design Automation, now a std)
- VHDL (out of a standard committee)
But OpenCL, ImpulseC, SystemC, C, Handel-C translators .. are also available.Is this a good idea ?
The problem is that those languages are not thought for describing hardware and the translation finish up usually with a FSM(finite state machine) with 1 state for every statement and then the FSM machine moves along the states .
This is not the way someone skilled would program the FPGA.
Next statelogic
Stateregister
Output Logic
input
clk
D Q
Out
FSM finite state machine
May 10, 2014 R.Innocente 61
Verilog
May 10, 2014 R.Innocente 62
Using VerilogYou write a functional specification (usually) splitted in modules that documents the exact behaviour of the system.
LogicSynthesis
Place &Route
HDL (Verilog)
FPGAASIC
Functionaldesign
Physicaldesign
Gatenetlist
Simulated annealing used here !
NB. place and route of a large design can take 1 day of a fast CPU !!
May 10, 2014 R.Innocente 63
Verilog/1Basic module :
// comments in this waymodule name(input x0,x1,input [3:0]y, output out);// x0,x1 are wires, y is a 4 wires bus// out is an output wire// combinational logic use assign wire x0,x1, [3:0]y, outendmodule
May 10, 2014 R.Innocente 64
Verilog/2Combinatorial circuit :
// performs not a b c + a not b not cmodule dummy(input a,b,c, output y,z); wire a,b,c,y; assign y = ~a & b & c | a & ~b & ~c; assign z = ~c;endmodule
This is not C ! a,b,c,y,z are wires and y,z change whenever
a or b or c change. To avoid this drama for complex circuitswe use synchronous logic
(everything is stepped in docking stations = Flip flops)
May 10, 2014 R.Innocente 65
Verilog/3
May 10, 2014 R.Innocente 66
Verilog/4A sequential circuit :
// a flip flop described in verilogmodule ff(input d, clk, output q, qbar); wire d, clk; reg q, qbar; always @(posedge clk) begin q <= d; qbar <= ~d; endendmodule
At a raising edge of the wire clk copy the signal to q and the inverse of d to qbar
May 10, 2014 R.Innocente 67
Verilog/5
May 10, 2014 R.Innocente 68
Verilog/6A more complicate sequential circuit :
// in verilog FF with clear/resetmodule ff(input d, clk,clr, output q, qbar); wire d, clk; reg q, qbar; always @(posedge clk, posedge clr) if (clr) q <= 0; else begin q <= d; endendmodule
At a raising edge of the wire clr set q=0, at the raising edge
of clk copy the signal to q and the inverse of d to qbar
May 10, 2014 R.Innocente 69
Verilog/7
May 10, 2014 R.Innocente 70
Power/Energy efficiency
May 10, 2014 R.Innocente 71
Dennard scaling(1974)
1
S
S3
S2 = 2x moretransistors
S = 1.4x lowercapacitance
Scale Vdd by S => S2 = 2x lower energy
S2S = 1.4x fastertransistors
Performance scales as S3 = 2.8 while power density stays constant across generations
May 10, 2014 R.Innocente 72
Every semiconductor generation ~ 2.8 times more performance. Every k generations ~
The wise said :
“All exponentials must end
or they would eat up the Universe … “
2.8k
May 10, 2014 R.Innocente 73
Fred Pollack(Intel) famous graph(1999)
Power density increases !!!In 2004/2005 we hit the power wall => stop frequency increases
“New microarchitecture challenges in the coming generations of CMOS process technology” F.Pollack
May 10, 2014 R.Innocente 74
End of Dennard scaling
1
S
S3
S2 = 2x moretransistors
S = 1.4x lowercapacitance S2
S = 1.4x fastertransistors
In submicron technology rigidity in voltage scaling. Power increases by S2 = 2
May 10, 2014 R.Innocente 75
MOS subthreshold currentScaling down geometry you scale down drain voltage to avoid high electric fields and to decrease energy required to switch. You have to scale down also the threshold voltage to sustain the 30% decrease of gate delay. The small voltage swing that remains is not able to completely turn off the transistor. Subthreshold leakage that was ignored in the past can on modern VLSI chips consume up to ½ of the total power.
May 10, 2014 R.Innocente 76
Subthreshold leakage
May 10, 2014 R.Innocente 77
VT
design tradeoff
VGS
log IDS
- Low VT for high ON current :
- High VT for low OFF current
Phenomenology :60-200 mV of V
GS swing decreases I
DS by
one order of magnitude. Today 0.5-0.2V
T doesn't allow the needed swing of V
GS to
shutoff the transistor.
I Dsat ∝(V DD−V T )2
Low VT
=> high IDS
good for ON condition
High VT => low leakage
good for OFF condition
May 10, 2014 R.Innocente 78
Multicore scaling
65 nm 45 nm 32 nm
4-core 8-core 16-core
Every generation 2x cores, at same or slightly increasing frequency.
May 10, 2014 R.Innocente 79
Multicore scaling at constant frequency
1
SS2
S2 = 2x moretransistors
S = 1.4x lowercapacitance
} S = 1.4x lowerutilization
We hit the utilization wall => dark silicon
May 10, 2014 R.Innocente 80
End of multicore scaling
65 nm 32 nm
4 cores 8 cores
Every generation 2x cores at same or slightly increasing frequency, but only 1.4x not dark.
Dark or dim silicon(“uncore”)
45 nm
5.7 cores
4*2/1.4 ~ 5.7 4*2/1.4*2/1.4 ~ 8
May 10, 2014 R.Innocente 81
Dark silicon and the end of multicore scaling
Doug Burger (Microsoft) at HiPEAC 2013 :
- till 2004: each semiconductor generation gave transistors smaller, faster and that consume less
- from 2004 to now: we still got smaller transistors, but we could not run them faster (power wall)
- in the future : we will still get smaller transistors but we will not be able to use all of them together(dark silicon) or at max speed.
http://www.darksilicon.org
May 10, 2014 R.Innocente 82
Scaling the utilization wallG.Venkatesh ASPLOS 10 :
“while the area budget continues to increase exponentially, the power budget has become a first-order design constraint in current processors. In this regime, utilizing transistors to design specialized cores that optimize energy-per-computation becomes an effective approach to improve the system performance.
”The Utilization Wall : With each successive process generation, the percentage of a chip that can switch at full frequency drops exponentially due to power constraints. [Venkatesh, ASPLOS ‘10]
Single chip heterogeneous computer (E.Chung)
Greater energy efficiency combining GPP with unconventional cores (U-cores) : GPU,FPGA,DSP,ASICs ..
May 10, 2014 R.Innocente 83
Future ?Previous forecasts are based on the hypothesis that CMOS/MOSfet technology stays as now (“ceteris paribus”).
In fact most of the MOSfet leakage happens far down from the gate where the gate is not able to drive completely the substrate and were the drain competes in creating the electric field. It has been shown that a 3D(instead of planar) gate that wraps the channel cures most of these problems : finFET (Chenming Hu, King-Liu, Bokor UCB), UltraThinBody SOI, 3D (Intel)
May 10, 2014 R.Innocente 84
3D FinFET promiseBelow 20nm the roadmap is to use 3D FinFETs :- Faster : +37%- Dynamic Power: -50%- Static Power: -90%
KAIST demonstrated a 3nmFinFET in lab.SPICE model available on Internet(BSIMCMG107)
May 10, 2014 R.Innocente 85
Back of the envelopeperformance estimation
May 10, 2014 R.Innocente 86
Back of the envelope performance estimation
Given number of
- LUTs
- FFs
- DSPs
offered by an FPGA,
and utilization of resources by operators, estimate the max number of operators that can be implemented on the FPGA
Today FPGA clocks are ~500Mhz=0.5GHz(unavoidable price for flexibility)2000 flops per cycle = 1 Teraflops
May 10, 2014 R.Innocente 87
Xilinx Virtex-7 family
Virtex-7 slices : 4 x 6-input LUTs, 8 FFsVirtex-7 DSPs : 48 bits pre-adder, 25x18 multiplier, 48 bits accumulatorVirtex LUT ~ 1.6 standard LUT
May 10, 2014 R.Innocente 88
ff # tot ff* 2 103 90 112 1080 2160 208440 232200
1 113 97 104 0 0 0 00 377 336 376 0 0 00 0 0 0 0 0 0
0 0 0+ 0 369 301 393 1510 0 1011700 1150620
0 0 0 0 0 0 0 0
Tot 2590 2160 1220140 1382820
slices LUT x FF x 6 input ff
slice slice LUT305400 4 8 2160 1221600 2443200
1.61954560
Virtex7 XC7V2000T Custom precision 17/24 bits fp
dsp lut+ff lut tot dsp tot lut
Virtex-7 V2000T available resources
dsp
standard LUTs
May 10, 2014 R.Innocente 89
Virtex7 XC7V2000T IEEE single precision – 32 bits
dsp lut+ff lut ff # tot dsp tot lut tot ff* 3 120 103 105 700 2100 156100 157500
2 160 128 160 0 0 0 01 331 283 331 0 0 00 665 629 669 0 0 0
0 0 0+ 2 293 225 327 25 50 12950 15500
0 500 407 541 1160 0 1052120 1207560
Tot 1885 2150 1221170 1380560
Virtex-7 V2000T available resources
slices LUT x FF x dsp 6 input ff
slice slice LUT305400 4 8 2160 1221600 2443200
1.6standard LUTs 1954560
May 10, 2014 R.Innocente 90
Virtex7 XC7V2000T IEEE double precision – 64 bits
ff # tot ff* 11 325 279 421 196 2156 118384 146216
10 371 299 456 0 0 0 09 439 356 510 0 0 00 2361 2317 2418 0 0 0
0 0 0+ 3 895 705 945 1 3 1600 1840
0 989 794 1029 617 0 1100111 1245106
Tot 814 2159 1220095 1393162
slices LUT x FF x 6 input ff
slice slice LUT305400 4 8 2160 1221600 2443200
1.61954560
dsp lut+ff lut tot dsp tot lut
Virtex-7 V2000T available resources
dsp
standard LUTs
May 10, 2014 R.Innocente 91
IEEE double precision – 64 bits
ff # tot ff* 11 325 279 421 76 836 45904 56696
10 371 299 456 0 0 0 09 439 356 510 0 0 00 2361 2317 2418 0 0 0
0 0 0+ 3 895 705 945 1 3 1600 1840
0 989 794 1029 87 0 155121 175566
Tot 164 839 202625 234102
slices LUT x FF x 6 input ff
slice slice LUT50850 4 8 840 203400 406800
1.6325440
Kintex 7 XC7K325T
dsp lut+ff lut tot dsp tot lut
Kintex XC7K325T – Available resources
dsp
standard LUTs
May 10, 2014 R.Innocente 92
IEEE double precision – 64 bits
ff # tot ff* 11 325 279 421 20 220 12080 14920
10 371 299 456 0 0 0 09 439 356 510 0 0 00 2361 2317 2418 0 0 0
0 0 0+ 3 895 705 945 0 0 0 0
0 989 794 1029 23 0 41009 46414
Tot 43 220 53089 61334
slices LUT x FF x 6 input ff
slice slice LUT13300 4 8 220 53200 106400
1.685120
Zync 7000 Z-020 XC7Z020
dsp lut+ff lut tot dsp tot lut
Zync 7000 Z-020 – Available resources
dsp
standard LUTs
May 10, 2014 R.Innocente 93
IEEE double precision – 64 bits
ff # tot ff* 11 325 279 421 81 891 48924 60426
10 371 299 456 0 0 09 439 356 510 0 0 00 2361 2317 2418 0 0 0
0 0 0+ 3 895 705 945 3 9 4800 5520
0 989 794 1029 92 0 164036 185656
Tot 176 900 217760 251602
slices LUT x FF x 6 input ff
slice slice LUT54650 4 8 900 218600 437200
1.6349760
Zync 7000 Z045 XC7Z045
dsp lut+ff lut tot dsp tot lut
Zync 7000 Z045 – Available resources
dsp
standard LUTs
May 10, 2014 R.Innocente 94
Virtex7 VX690T IEEE double precision – 64 bits
ff # tot ff* 11 325 279 421 327 3597 197508 243942
10 371 299 456 0 0 0 09 439 356 510 0 0 00 2361 2317 2418 0 0 0
0 0 0+ 3 895 705 945 1 3 1600 1840
0 989 794 1029 131 0 233573 264358
Tot 459 3600 432681 510140
slices LUT x FF x 6 input ff
slice slice LUT108300 4 8 3600 433200 866400
1.6693120
dsp lut+ff lut tot dsp tot lut
Virtex-7 VX690T available resources
dsp
standard LUTs
May 10, 2014 R.Innocente 95
IEEE double precision – 64 bits
ff # tot ff* 11 325 279 421 261 2871 157644 194706
10 371 299 456 0 0 0 09 439 356 510 0 0 00 2361 2317 2418 0 0 0
0 0 0+ 3 895 705 945 3 9 4800 5520
0 989 794 1029 1321 0 2355343 2665778
Tot 1585 2880 2517787 2866004
slices LUT x FF x 6 input ff
slice slice LUT314820 8 16 2880 2518560 5037120
1.754407480
Virtex UltraScale XCVU440 20nm -sampling out
dsp lut+ff lut tot dsp tot lut
Virtex Ultra Scale - available resources
dsp
standard LUTs
For US Xilinx publishes Logic Cells =1.75 x 6input LUT
May 10, 2014 R.Innocente 96
IEEE double precision – 64 bits
ff # tot ff* 11 325 279 421 3 33 1812 2238
10 371 299 456 0 0 0 09 439 356 510 0 0 00 2361 2317 2418 3 0 14034 14337
sqrt 2865 2005 3242 2 0 9740 12214+ 3 895 705 945 5 15 8000 9200
0 989 794 1029 5 0 8915 10090
Tot 18 48 42501 48079x 59 1062 2832 2507559 2836661
slices LUT x FF x 6 input ff
slice slice LUT314820 8 16 2880 2518560 5037120
1.754407480
Virtex UltraScale XCVU440 20nm -sampling out
sqrt(sqr(x1-x2)+sqr(y1-y2)+sqr(z1-z2)) each opr implemented as fp
dsp lut+ff lut tot dsp tot lut
function eval costs 3*,5+-,1sqrt therefore you can implement 59*2=118
Virtex Ultra Scale - available resources
dsp
standard LUTsFor US Xilinx publishes Logic Cells =1.75 x 6input LUT
May 10, 2014 R.Innocente 97
XCKU040IEEE double precision – 64 bits
ff # tot ff* 11 325 279 421 174 1914 105096 129804
10 371 299 456 0 0 0 09 439 356 510 0 0 00 2361 2317 2418 0 0 0
0 0 0+ 3 895 705 945 2 6 3200 3680
0 989 794 1029 75 0 133725 151350
Tot 251 1920 242021 284834
slices LUT x FF x 6 input ff
slice slice LUT30300 8 16 1920 242400 484800
1.75424200
kintex UltraScale
dsp lut+ff lut tot dsp tot lut
Kintex UltraScale – Available resources
dsp
standard LUTs
For US Xilinx publishes Logic Cells =1.75 x 6input LUT
May 10, 2014 R.Innocente 98
Gflops per Wattpeak nominal double fp performance/TDP :
Intel Q6600 2.4ghz 38 gflops/105 W = 0.36 gflops/W
Intel Haswell i7-4770K 3.5ghz 112 gflops/84 W = 1.33 gflops/W
Intel IvyBridge 3770K 3.5ghz 112 gflops/77 W = 1.45 gflops/W
Nvidia Tesla M2090 666 gflops/225 W = 2.96 gflops/W
Nvidia Tesla K20X 1310 gflops/235 W = 5.57 gflops/W
Xilinx Virtex-US 800 gflops/20 W = 40 gflops/W C ol um n 1C ol um n 2C ol um n 3
FPGA computing = green computing
}} ~10x
~30x
May 10, 2014 R.Innocente 99
Gigaflops per Watt /2
Intel 2.4 ghz q6600
intel 4770k
intel i7-3770k
tesla m2090
tesla k20x
virtex7
0 5 10 15 20 25 30 35 40 45
Gflops/W
Gflo
ps/
W
May 10, 2014 R.Innocente 100
Top green500 listgreen500_ranktotal_power Year name Total CoresName ManufacturerCountry
1 28 4,503 2013 2720 TSUBAME-KFC NEC Japan2 53 3,632 2013 5120 Wilkes Dell United Kingdom3 79 3,518 2013 4864 HA-PACS TCA Cray Inc. Japan4 1,754 3,186 2012 115984 Cray Inc. Switzerland5 81 3,131 2013 5720 romeo Bull SA France6 923 3,069 2013 74358 TSUBAME 2.5 NEC/HP Japan7 54 2,702 2013 3080 IBM United States8 270 2,629 2013 15840 IBM Germany9 56 2,629 2013 3264 IBM United States
10 71 2,359 2010 4620 CSIRO GPU Cluster Xenon SystemsAustralia11 179 2,351 2012 38400 SANAM Saudi Arabia12 82 2,299 2011 16384 IBM United States13 82 2,299 2012 16384 Cetus IBM United States14 82 2,299 2012 16384 IBM Poland15 82 2,299 2013 16384 IBM United States16 82 2,299 2012 16384 Vesta IBM United States17 82 2,299 2012 16384 IBM United States18 237 2,243 2013 10920 HPCC Hewlett-PackardUnited States
Mflops/WattLX 1U-4GPU/104Re-1G Cluster, Intel Xeon E5-2620v2 6C 2.100GHz, Infiniband FDR, NVIDIA K20xDell T620 Cluster, Intel Xeon E5-2630v2 6C 2.600GHz, Infiniband FDR, NVIDIA K20Cray 3623G4-SM Cluster, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband QDR, NVIDIA K20xCray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20xPiz DaintBull R421-E3 Cluster, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR, NVIDIA K20xCluster Platform SL390s G7, Xeon X5670 6C 2.930GHz, Infiniband QDR, NVIDIA K20xiDataPlex DX360M4, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR14, NVIDIA K20xiDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20xiDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20xNitro G16 3GPU, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, Nvidia K20mAdtech, ASUS ESC4000/FDR G2, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, AMD FirePro S10000AdtechBlueGene/Q, Power BQC 16C 1.60 GHz, CustomBlueGene/Q, Power BQC 16C 1.600GHz, Custom InterconnectBlueGene/Q, Power BQC 16C 1.600GHz, Custom InterconnectBlueGene/Q, Power BQC 16C 1.600GHz, Custom InterconnectBlueGene/Q, Power BQC 16C 1.60GHz, CustomBlueGene/Q, Power BQC 16C 1.60GHz, CustomCluster Platform SL250s Gen8, Xeon E5-2665 8C 2.400GHz, Infiniband FDR, Nvidia K20m
May 10, 2014 R.Innocente 101
FPGA lingo
May 10, 2014 R.Innocente 102
Core
Core in FPGA lingo is a function ready to be instantiated into your design as a “black box”. It can be suppliad as HDL or schematic.
It supports design re-use.
May 10, 2014 R.Innocente 103
Soft/hard coresOn FPGAs functional modules can be implemented :
- using std FPGA resources(logic blocks, DSPs, memory blocks) : softcores
- as an ASIC on the FPGA : hardcores
When the manufacturer puts a processor as an hardcore on the FPGA then it sells this as a SoC (Sytem On Chip) : Dual ARM on Zync-7000 chip, PowerPC on Altera FPGA
May 10, 2014 R.Innocente 104
IP/open cores
The soft attribute is implied.
Hardware designs in an HDL(eventually using vendor libraries):
- opensource cores : http://opencores.org/
OpenRISC 1000 architecture from the OpenCores community,
the Lattice Semiconductor LM32, the LEON3 from Aeroflex
Gaisler and the OpenSPARC family from Oracle
- proprietary : IP(Intellectual Property) cores
Floating point operators, fft, matrix computations
May 10, 2014 R.Innocente 105
Commercial offers
May 10, 2014 R.Innocente 106
PicocomputingSC6 1U Upto 16 FPGA SC6 4U upto 48
EX-600EX-800
FromPICOCOMPUTING
May 10, 2014 R.Innocente 107
Bittware Terabox16 altera stratix-V
From Bittware
May 10, 2014 R.Innocente 108
BeeCUBESpinoff of UCB Wireless center this company offers:
- reconfigurable platform (scalable, full speed interconnect)
- honeycomb architecture (symmetrical 4-FPGA based module arch)
- Nectar distributed OS
May 10, 2014 R.Innocente 109
DINIGROUP Cluster of 4 Virtex7
From DINIGROUP
May 10, 2014 R.Innocente 110
Dinigroup Cluster 40 Kintex-7
From DINIGROUP
May 10, 2014 R.Innocente 111
Maxeler MPC-X
Daresbury Lab UK :The dataflow supercomputer will feature Maxeler developed MPC-X nodes capable of an equivalent 8.52TFLOPs per 1U and 8.97 GFLOPs/Watt.
May 10, 2014 R.Innocente 112
Convey HC-2 , HC-2ex
May 10, 2014 R.Innocente 113
Cray XT5h
“Cray introduces an hybrid supercomputer thatcan integrate multiple processor architectures into a single system and accelerate high performance computing (HPC) workflows. The Cray XT5h delivers higher sustained performance, by applying alternative processor architectures across selected applications within an HPC workflow. The Cray XT5h supports avariety of processor technologies, including scalar processors based on AMD OpteronTM dual and quad-core technologies, vectorprocessors, and FPGA accelerators.”
May 10, 2014 R.Innocente 114
CHRECCenter for High PerformanceReconfigurable ComputingUF/BYU/GWU/VTECH
May 10, 2014 R.Innocente 115
CHREC Novo-G 384 FPGAs“Novo-G is the most powerful reconfigurable supercomputer in the known world. This unique machine features 192 top-end, 40nm FPGAs (Altera Stratix-IV E530) and 192 top-end, 65nm FPGAs (Stratix-III E260). “
http://www.chrec.org/
(pronounce it as shreck)
May 10, 2014 R.Innocente 116
BLAST like Smith-Waterman computes local alignment of 2 sequences :
- Novo-BLAST Novo-G/CHREC implementation : faster, same sensitivity
IPC(Isotope Pattern Calculator) of Protein Identification Algorithm :
- speed up 52-366 on single fpga, 1259 on 4 fpgas, 3340 on a node(16 fpgas)
CHREC/2
May 10, 2014 R.Innocente 117
References forApplications
May 10, 2014 R.Innocente 118
Linear Algebra for RC
Juan Gonzalez and Rafael C. NúñezLAPACKrc: Fast linear algebra kernels/solvers for FPGAaccelerators(JP 2009)DOD funded
May 10, 2014 R.Innocente 119
DCT, FFT on FPGAs
Digital Signal Processing with Field Programmable Gate Arrays ,3d edition(2007)
U.Mayer Baese, Springer Verlag
May 10, 2014 R.Innocente 120
MD on FPGA There are many papers about porting Molecular Dynamics algorithms on FPGAs with substantial positive conclusions about experiments on 1-2 FPGAs. But in the last years there is an embarassing comparison with ANTON (Shaw et al.).
We cant forget that ANTON is a really huge machine consuming over 100 KW !!!!
And is made out of 512 dedicated ASICs at 1ghz!
The comparison with some FPGAs consuming 40/60 W is improper.
FPGA-Accelerated Molecular Dynamics(2013) M. A. Khan,M. Chiu, M. C. Herbordt
May 10, 2014 R.Innocente 121
Neural networks on FPGAs
Editors : Omondi , Rajakapse (2006)
FPGA implementation of neural networks
ANN(Artificial Neural Network) in integer arithmetic performs 40x better than on GPP (old FPGA, 3 generation old)
May 10, 2014 R.Innocente 122
Altera Arria 10
May 10, 2014 R.Innocente 123
Arria10
May 10, 2014 R.Innocente 124
Arria 10 architecture
May 10, 2014 R.Innocente 125
Arria 10 variable precision DSP block
Altera
A
B
CD
A+C*D = 2 flop
May 10, 2014 R.Innocente 126
Arria10 DSP standard precision mode
May 10, 2014 R.Innocente 127
Arria10 DSP High-precision mode
May 10, 2014 R.Innocente 128
Arria10 estimated sp fp performance
- 2 flops per cycle
- 1688 fp single precision DSP (GX660)
1688*2 = 3376 flops per cycle
3376 * 0.5 ghz ~ 1.7 Teraflops in single precision
May 10, 2014 R.Innocente 129
Hard single prec FP on FPGA ?!?
For people that can live with single precision this seems a very attractive new feature.
But many think that it is too much a waste of generic resources and claim that what was missing were simpler blocks !
May 10, 2014 R.Innocente 130
BORPH : Berkeley Operating system for ReProgrammable HardwarePETALINUX : Xilinx linux for Zynq et al.
May 10, 2014 R.Innocente 131
- Idea of HW unix process : has pid, can be killed like a normal unix process, but in fact is an HW instance on FPGA
- ioreg Virtual File System interface
Borph : Berkeley Operating System
May 10, 2014 R.Innocente 132
Xilinx Petalinux
The PetaLinux Software Development Kit (SDK) is a development tool that contains everything necessary to build, develop, test and deploy Embedded Linux systems on : Zync-7000, Zedboard, Kintex-7 boards.
PetaLinux consists of : pre-configured binary bootable images, fully customizable Linux for the Xilinx device, and PetaLinux SDK which includes tools and utilities to automate complex tasks across configuration, build, and deployment.
PetaLinux is offered under two separate licenses :
No charge Evaluation license or Commercial licenses
May 10, 2014 R.Innocente 133
END