Upload
nguyencong
View
244
Download
0
Embed Size (px)
Citation preview
Erik D’Hollander
University of Ghent Belgium
Outline
1. Super desktop GPU/FPGA architecture
2. Programming tool chain
3. FPGA vs. GPU strengths
4. Roofline performance model for FPGA
5. Tuning performance
6. Optimizing compute resources
7. Conclusion
Supercomputing 1969-2018
• 1969: MFlops
• 1985: GFlops
• 1997: PFlops
• 2008: TFlops
• 2018: EFlops? 1.E+03
1.E+06
1.E+09
1.E+12
1.E+15
1.E+18
CD
C 7
60
0
CD
C S
TAR
Cra
y X
-MP
Cra
y-2
Fujit
su N
WT
Hit
ach
i SR
22
01
Inte
l ASC
I
NEC
Ear
th S
imu
lato
r
IBM
Blu
e G
ene
IBM
Ro
adru
nn
er
Tian
he
I K
1969 1974 1982 1985 1990 1996 1997 2004 2005 2008 2010 2011
MFLOPS(y) = 1.72(y-1969)
Trendlines
• Supercomputing FLOPS > Moore’s law
• Memory speed increase << Moore’s law
R² = 0.97
0
2
4
6
8
10
12
14
16
18
1960 1970 1980 1990 2000 2010 2020
FLops (log10)
Moore's law
MFlops Trendline
Memory speed increase (relative)
Trendlines
• Supercomputing FLOPS > Moore’s law
• Memory speed increase << Moore’s law
R² = 0.97
0
2
4
6
8
10
12
14
16
18
1960 1970 1980 1990 2000 2010 2020
FLops (log10)
Moore's law
MFlops Trendline
Memory speed increase (relative)
PC today
Super desktop with GP-GPU and FPGA
• Host = Supermicro PC
• Accelerators =
– GPGPU Tesla C2050 highly regular parallel apps.
– FPGA board Pico EX500 with 2x M501 Virtex 6 configurable, massively parallel apps., low power
“GUDI” Tetra project supported by IWT Flanders Belgium, EhB, VUB and UGent
Super desktop with GP-GPU and FPGA
•
Combining GPU and FPGA strengths
• Image processing + Bio-informatics
• Face recognition + Security
• Audio processing + HMM speech recognition
• Traffic analysis + Neural network control
Super desktop with GP-GPU and FPGA
• Internal architecture and interconnections
Super desktop with GP-GPU and FPGA
• Hybrid system : CPU, 2 FPGAS, GP-GPU
Super desktop with GP-GPU and FPGA
• Internal bandwiths CPU memory: 19.2 GB/s CPU accelerators: 25.6 GB/s (QPI)
Super desktop with GP-GPU and FPGA
• Internal bandwiths CPU FPGAs : 8 GB/s CPU GP-GPU: 8 GB/s
Super desktop with GP-GPU and FPGA
• Internal bandwiths: GPU SMP Global Mem: 115.0 GB/s SMP Shared Mem: 73.5 GB/s
Super desktop with GP-GPU and FPGA
• Internal bandwiths: FPGA DSP/Logic Block RAM: 386 GB/s DSP/Logic PCIe switch: 4 GB/s DSP/Logic DDR3 RAM: 3.2 GB/s
Super desktop with GP-GPU and FPGA
• Heterogeneous architecture:
– 3 computing architectures
– non-uniform memories
•
Programming tool chain
• Algorithm decomposed in GPU, Host and FPGA parts
Programming tool chain
• FPGA architecture generated with High Level Synthesis tools (C to VHDL compilers)
Programming tool chain
• Bitmap files = hardware procedure calls
Programming tool chain
• Code executed on combined platform
• Communication via PCIe
Heterogeneous computing
Data transfer
• GPU: AllDataToDev calculate AllResultToHost (*)
• FPGA: StreamToDev calculate StreamToHost
Local Mem CPU GPU
PCIe
CPU FPGA PCIe stream
(*) unless explicit double buffering
Fast
Comparison axes
• Speed: computational power
• Communication: bandwidth/latency
• Programmability: IDE efficiency speed
programmability
communication
Programming environment Programming language: C
• GPU: CUDA, OpenCL
– C PTX (Parallel Thread Execution)
• FPGA: HLS (High Level Synthesis)
– C VHDL
– History:
AutoESL (Xilinx) Vivado HLS Catapult C tool from Mentor Graphics
C-to HDL tool from Politecnico di Milano (Italy) C-to-Verilog tool from www.c-to-verilog.com
DIME-C from Nallatech Handel-C from Celoxica (defunct)
HercuLeS (C/assembly-to-VHDL) tool Impulse C from Impulse Accelerated Technologies Nios II C-to-Hardware Acceleration Compiler from Altera
ROCCC 2.0 (free and open source C to HDL tool) from Jacquard Computing Inc. SPARK (a C-to-VHDL) from University Of California, San Diego
SystemC from Celoxica (defunct)
FPGA high level synthesis compilers
• ROCCC Riverside Optimizing Compiler for Configurable Computing – target:
• platform dependent modules (IP cores) into library • platform independent systems use modules as functions replicate, parallelize and pipeline
– optimizations • low level: arithmetic balancing • high level: loop unrolling, fusion, wavefront, mul/div elimination,
subexpression elimination • data optimizations: stream with smart buffer
– output • vhdl design + testbench • PCore (Xilinx)
FPGA high level synthesis compilers
• AutoESL: – target:
• Xilinx FPGAs
– optimizations • code: loop unroll, fusion, pipeline, inline • data: remap, partition, arrays, reshape, resource, stream • interface selection: handshake, fifo, bus, register, none,…
– output • vhdl design • performance reports: timing, design and loops latency, utilization,
area, power, interface • design viewer with timeline, regs and interfaces, with links back to
source code
AutoESL programming example: Tuning design for performance
• Simple example: sum of array (N=1.e8) for(i=0; i<N; i++) sum += A[i];
• No optimizations: AutoESL reports 2 * N = 2.e8 cycles
• AutoESL Designer view: 2 cycles/add
cycles
Unroll for parallelism
• Unroll 8 times arith. balancing (4, // adds)
• AutoESL directive:
• Designer view: only 2 // adds?
cycles
Increase # memory ports
• Dual-port memory: only 2 loads at a time!
• I/O bottleneck, increase # mem ports
Partition data for // access
• Partition A over 4 memories (=8 ports, 256 bits)
• 8 loads, 4 // adds
Balance unroll and partitioning
• Impact of Unrolling and Partitioning (N=108)
• Best result: 64 unroll, 32 memory ports, speedup = 16
0.E+00
1.E+08
2.E+08
3.E+08
1 10 100 1000
# cycles
Unroll factor 1, 8, 64, 512
Unrolling loops and increasing memory ports
2 PORTS ONLY
Partition=2 , 4 // streams (DP)
Partition=4 , 8 // streams (DP)
Partition=8 , 16 // streams (DP)
Partition=16, 32 // streams (DP)I/O
bound Resource
bound
• Compare lines C vs. lines VHDL
• Order of magnitude speed up
• VHDL design is correct
Programming Productivity
Code C AutoESL bare AutoESL opt Ratio AutoESL/C
Sum Array 16 266 6,346 17 - 397
Erosion 3x3 31 195 1,067 6 - 34
Gaxpy 13 374 3,904 29 - 300
Performance evaluation Roofline Performance Model
• What is it?
• Why is it required?
• How is it able to compare both architectures?
Roofline model
• Peak Performance (PP) is limited by
– Compute power, CP GFlops/s
– I/O Bandwidth, BW GBytes/s
– Arithmetic Intensity, AI Flops/Byte
• Hardware limited PP = CP
• I/O limited PP = BW*AI
• PP = Min (CP, BW*AI)
AI(Ops/Byte)
PP (GOps/s)
CP(GOps/s)
1
PP=BW
Roofline model
• Roofline model for FPGA. I/O limit?
BRAM: 386 GB/s 386 Gops/s @ AI=1 op/byte
Roofline model
• Roofline model for FPGA. I/O limit?
32 streams @ 4GB/s 128 Gops/s @ AI=1 op/byte
(Pico Computing firmware allows 32 streams)
Roofline model
• Roofline model for FPGA. I/O limit?
1 streams @ 4GB/s 4 Gops/s @ AI=1 op/byte
Roofline model
• Roofline model for FPGA. Computation limit?
• 32 bit addition on Virtex 6 resource consumption
Total: 3834 @ 250 Mhz 958.5 Gops/s
AVAILABLE ADD_DSP ADD_Logic
LUT 98125 0 32
FF 201715 0 32
DSP 768 1 0
TOTAL
AVAILABLE: 768 3066
Roofline model
• Roofline model for FPGA.
Roofline model
• Roofline model for GPU
Roofline model
• Roofline model for GPU and FPGA combined
Experimental Results: FPN
FPN (Fixed Pattern Noise Correction) algorithm Output pixel = f(input pixel, gain, offset, origin)
Requires 4 input bytes to generate 1 output byte Computational intensity = 1 / 4 (output overlaps)
Pico stream = 16 bytes @ 250 MHz = 4 GB/s
One full-duplex stream fits 4 FPNs
Experimental Results: FPN
Max number of FPNS?: Logic Resources
FPGA logic resources allow 96 full-duplex streams
Peak performance = 96 * 4 Ops / 4ns = 96 Gops/s
Experimental Results: FPN
Max number of streams? : Available Bandwidth
AI = 1/4 (pipelined output overlaps with input)
I/O limited performance = BW*AI
32 Pico streams = 32*4GB/4 = 32 Gops/s
1 PCI e stream = 4GB/4 = 1 Gops/s
Experimental Results: FPN
Max performance on the Pico board (32 PicoStreams)
Experimental Results: FPN
Max performance on the combined platform
Image Erosion 3x3
• Example: 3x3 erosion pixel(i,j) = Min(neighbor pixels) = 1 “operation”
• Handwritten VHDL: 9 cycles for 1 computational block (CB)
• Peak? Virtex 6 FPGA accomodates 1536 CBs @250MHz clock rate PP = 42.6 Gops/s
Erosion3x3 on FPGA
Erosion3x3 operation requires 9 input bytes to generate 1 output byte
Computational intensity = 1 / 10
Handwritten VHDL code:
– 1 input bytes per clock cycles
– 1 output byte each 9 clock cycles
Performance = 27.77 MPixelOperations/s
Erosion3x3 on FPGA
Handwritten VHDL code:
One full-duplex stream fits 16x parallel erosion operations = 1 erosion block:
Experimental Results: Erosion3x3
Max number of erosion blocks? : Logic Resources
FPGA logic resources allow 96 full-duplex streams
Peak performance = 96 * 16 Ops/36ns = 42.66 Gops/s
RESOURCE ESTIMATIONS
Logic Utilization128x[16x Erosion[128b]] 96x[16x Erosion[128b]]
Used Available Utilization Used Available Utilization
Number of Slice Registers 214874 301440 71% 174220 301440 58%
109095 150720 72% 76423 150720 51%
Number of fully used LUT-FF pairs 49994 213650 23% 33806 248902 14%
81 600 14% 81 600 14%
Number of Block RAM/FIFO 542 416 130% 414 416 100%
7 32 22% 7 32 22%
Number of DSP48E1s 0 768 0% 0 768 0%
Number of Slice LUTs
Number of bonded IOBs
Number of BUFG/BUFGCTRLs
Experimental Results: Erosion3x3
I/O limited performance? : Available bandwidth
AI = 1 result per 9 bytes = 1/9
BRAM BW = 386 GB/s limit = 42.88 Gops/s
Pico streams BW = 32 GB/s limit = 3.55 Gops/s
PCIe stream BW = 4 GB/s limit = 0.44 Gops/s
Hardware peak = 42.66 Gops/s
I/O streams limit performance
Experimental Results: Erosion3x3
HandWritten VHDL code: Measurements
Experimental Results: Erosion3x3
ROCCC
Smart buffers reuse data only 1 fetch and store
Impact of the smart buffers on the computational intensity:
Improvement of about a factor of (k+1) for larger images H = Height of the image
W= Width of the image
k2= Dimension of the kernel or mask
Experimental Results: Erosion3x3
ROCCC
Manual partial loop unrolling increases data reuse with smart buffers:
Experimental Results: Erosion3x3
ROCCC
Loop Unrolling increases Computationl Intensity
1x Pixel in Parallel 2x Pixel in Parallel 4x Pixel in Parallel
0,00
0,05
0,10
0,15
0,20
0,25
0,30
0,35
0,40
0,45
32x32
64x64
128x128
256x256
512x512
1024x1024
Com
pute
r In
tensity CIx2.25
CIx2.97
CIx3.60 CI original = 0.11
Experimental Results: Erosion3x3
ROCCC: Measurements
Experimental Results: Erosion3x3
AutoESL
First implementation:
Extremely similar to the Handwritten VHDL code.
Same Computational Intensity
Experimental Results: Erosion3x3
AutoESL
Partial Loop Unrolling x4:
Experimental Results: Erosion3x3
AutoESL
Partial Loop Unrolling x4:
Erosion 1
Experimental Results: Erosion3x3
AutoESL
Partial Loop Unrolling x4:
Erosion 2
Experimental Results: Erosion3x3
AutoESL
Partial Loop Unrolling x4:
Erosion 3
Experimental Results: Erosion3x3
AutoESL
Partial Loop Unrolling x4
Unrolled loops are pipelined and data reused CI increases (less bytes fetched per operation):
Erosion 4
Experimental Results: Erosion3x3
AutoESL: Measurements
Experimental Results: Erosion3x3
Handwritten VHDL code vs ROCCC vs AutoESL
Experimental Results: Erosion3x3
Internal Performance (32 PicoStreams)
HandWritten VHDL: Stream Version
HandWritten VHDL: BRAM VersionROCCC 4xParalell Stream.: Default + Inlinemodule
ROCCC 4xParalell BRAM.: Default + InlinemoduleAutoESL Stream.: Pipeline
AutoESL Stream.: Pipeline, PLU x2AutoESL Stream.: Pipeline, PLU x4
AutoESL Stream.: Pipeline, PLU x16
0
20
40
60
80
100
120
0
20
40
60
80
100
120
Perfomance based on the maximum streams
Maximum Nof Streams
Max resource limit performance
GP
ixe
lsO
pe
ratio
ns/s
Fu
ll d
up
lex S
tre
am
s
Highest
Performance
Handwritten ROCCC AutoESL
96 handwritten CBs
HandWritten VHDL: Stream VersionHandWritten VHDL: BRAM Version
ROCCC 4xParalell Stream.: Default + InlinemoduleROCCC 4xParalell BRAM.: Default + Inlinemodule
AutoESL Stream.: Pipeline AutoESL Stream.: Pipeline, PLU x2
AutoESL Stream.: Pipeline, PLU x4AutoESL Stream.: Pipeline, PLU x16
0
20
40
60
80
100
120
0
20
40
60
80
100
120
Performance based on the 32 available streams
Maximum Nof Streams
Max resource limit performance
Max bandwidth limit performance
GP
ixe
lOp
era
tio
ns/s
Fu
ll d
up
lex S
tre
am
s
Experimental Results: Erosion3x3
Internal Performance (32 PicoStreams)
Handwritten ROCCC AutoESL
HandWritten VHDL: Stream VersionHandWritten VHDL: BRAM Version
ROCCC 4xParalell Stream.: Default + InlinemoduleROCCC 4xParalell BRAM.: Default + Inlinemodule
AutoESL Stream.: Pipeline AutoESL Stream.: Pipeline, PLU x2
AutoESL Stream.: Pipeline, PLU x4AutoESL Stream.: Pipeline, PLU x16
0
20
40
60
80
100
120
0
20
40
60
80
100
120
Performance based on the 32 available streams
Maximum Nof Streams
Max resource limit performance
Max bandwidth limit performance
GP
ixe
lOp
era
tio
ns/s
Fu
ll d
up
lex S
tre
am
s
Experimental Results: Erosion3x3
Internal Performance (32 PicoStreams)
Highest
Performance
Handwritten ROCCC AutoESL
Conclusion
ROCCC presents the best performance per stream but is resource hungry.
AutoESL offers the best trade-off between performance and resource consumption.
I/O stress # I/O streams limited to 1 DDR3 memory too slow PCIe limited to 8 lanes
FPGA needs more HPC tweaking HLS tools (AutoESL) are productive