Upload
imogene-blackwell
View
25
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip. Mohamed ABDELFATTAH Vaughn BETZ. Outline. 1. Why NoCs on FPGAs?. 2. Hard/soft efficiency gap. 3. Integrating hard NoCs with FPGA. Outline. 1. Why NoCs on FPGAs?. Motivation. Previous Work. 2. Hard/soft efficiency gap. - PowerPoint PPT Presentation
Citation preview
Design Tradeoffs For Hard and Soft FPGA-based Networks-on-Chip
Mohamed ABDELFATTAHVaughn BETZ
2
Outline
Why NoCs on FPGAs?
Hard/soft efficiency gap
Integrating hard NoCs with FPGA
1
2
3
3
OutlineWhy NoCs on FPGAs?
Hard/soft efficiency gap
Integrating hard NoCs with FPGA
1
2
3
Motivation Previous Work
4
Interconnect
Motivation1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
5
Motivation1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
Hard Blocks:• Memory• Multiplier• Processor
6
Motivation1. Why NoCs on FPGAs?
Logic Blocks
Switch Blocks
Wires
Hard InterfacesDDR/PCIe ..
Interconnect still the same
Hard Blocks:• Memory• Multiplier• Processor
1600 MHz
200 MHz
800 MHz
7
MotivationDDR3 PHY and Controller1. Bandwidth requirements for
hard logic/interfaces2. Timing closure
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
1600 MHz
200 MHz
800 MHz
8
MotivationDDR3 PHY and Controller1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
9
MotivationDDR3 PHY and Controller1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
10
MotivationDDR3 PHY and Controller1. Bandwidth requirements for
hard logic/interfaces2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
5. Low-level interconnect hinders modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
Barcelona Los Angeles
Keep the “roads”, but add “freeways”.
Hard Blocks
Logic Cluster
Source: Google Earth
12
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
1. Bandwidth requirements for hard logic/interfaces
2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
5. Low-level interconnect hinders modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect
FPGA with NoCNoC
Routers
Links Router forwards data packet
Router moves data to local interconnect
13
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
1. Bandwidth requirements for hard logic/interfaces
2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
5. Low-level interconnect hinders modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect
FPGA with NoC
Pre-design NoC to requirements NoC links are “re-usable” Latency-tolerant communication NoC abstraction favors modularity
High bandwidth endpoints known
14
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
1. Bandwidth requirements for hard logic/interfaces
2. Timing closure3. High interconnect utilization:
– Huge CAD Problem– Slow compilation– Power/area utilization
4. Wire speed not scaling:– Delay is interconnect-dominated
5. Low-level interconnect hinders modularity:– Parallel compilation– Partial reconfiguration– Multi-chip interconnect
FPGA with NoC
Latency-tolerant communication NoC abstraction favors modularity
15
DDR3 PHY and Controller
1. Why NoCs on FPGAs?PCIe Controller
Gigabit Ethernet
Implementation options: Soft Logic (LUTs, .. ) Hard Logic (unchangeable)
Mixed Soft/Hard
Hard vs. Soft
Soft NoC Hard NoC• Build as needed out of LUTs • Must build the whole thing
• Tailor to application • Must be general enough for any aiapplication
• Slower, bigger • Faster, smaller
Investigate the hard vs. soft tradeoff for NoCs (area/delay)
Configurability Efficiency
16
Previous Work FPGA-tuned Soft NoCs:
– LiPar (2005), NoCeM (2008), Connect (2012) Hard NoCs:
– Francis and Moore (2008): Exploring Hard and Soft Networks-on-Chip for FPGAs
Applications that leverage NoCs:– Chung et al. (2011): CoRAM: An In-Fabric Memory Architecture
for FPGA-based ComputingOur Contributions:
1. Quantify area/performance gap of hard and soft NoCs2. Investigate how this impacts NoC design (hard/soft)3. Integrate hard NoC with FPGA fabric
1. Why NoCs on FPGAs?
17
OutlineWhy NoCs on FPGAs?
Hard/soft efficiency gap
Integrating hard NoCs with FPGA
1
2
3
NoC Architecture
Methodology Soft NoC design
Results
Area/Speed Efficiency Gap
18
Router Microarchitecture NoC = Routers + Links
2. Hard/Soft Efficiency
State-of-the-art router architecture from Stanford:1. Acknowledge that the NoC community have excelled at
building a router: We just use it2. To meet FPGA bandwidth requirements:
High-performance router3. A complex router includes a superset of NoC
components that may be used: More complete analysis
Split router into 5 Components
19
Router – 5 Components2. Hard/Soft Efficiency
Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
20
Router – 5 Components2. Hard/Soft Efficiency
Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
Multi-Queue Buffer
• Port Width• Buffer depth• Number of VCs
= Memory + CIControl Logic
Input Modules
21
Router – 5 Components2. Hard/Soft Efficiency
Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
Multiplexers
Logic + crowded interconnect
• Port Width• Number of Ports
Crossbar
22
Router – 5 Components2. Hard/Soft Efficiency
Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
Retiming Register
Registers + little control logic
• Port Width• Number of VCs
Output Modules
23
Router – 5 Components2. Hard/Soft Efficiency
Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
Arbiters
= Logic + Registers
• Number of Ports• Number of VCs
Allocators
24
Design Space2. Hard/Soft Efficiency
5 Components
Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
Input Module
Crossbar
VC Allocator
SW Allocator
Output Module
Port Width
Number of Ports
Number of VCs
Buffer Depth
4 Parameters
25
Methodology Post-routing FPGA (soft) area and delay Post-synthesis ASIC (hard) area and delay Both TSMC 65 nm technology (Stratix III) Verify results against previous FPGA:ASIC
comparison by Kuon and Rose
2. Hard/Soft Efficiency
Per Router Component
26
3 Options for Buffer on FPGA Relatively small memories Critical component in router design 3 options for FPGA:
Registers
LUTRAM
Block RAM
One per LUT
640 bits
9 Kbits
2. Hard/Soft Efficiency
Area of each implementation option
27
Width = 32 Bits
2. Hard/Soft Efficiency
Another logic cluster used
3 Options for Buffer on FPGA
28
3 Options for Buffer on FPGA Relatively small memories 3 options for implementation on FPGA
Registers
LUTRAM
Block RAM
One per LUT
640 bits
9 Kbits
0.77 Kbit/mm2
23 Kbit/mm2
142 Kbit/mm2
16% utilized BRAM more area efficient than fully used LUTRAM (Valid for Stratix III)
LUTRAM could win for some points in other FPGAs
Use BRAM for FPGA (soft) implementationSoft
2. Hard/Soft Efficiency
29 High port count inefficient in softSoft
24X – 94X
60X – 170X
2. Hard/Soft Efficiency
Results – High Port Count
30 High port count inefficient in soft Width scales betterSoft
2. Hard/Soft Efficiency
Results – Width
26X – 17X
72X
31 Buffer depth is free on FPGAs when using BRAMSoft
Filling up the BRAM
Results – Deep Buffers2. Hard/Soft Efficiency
32
Soft Router Design Design recommendations based on FPGA silicon area Supported by delay measurements
Buffer depth is free on FPGAs when using BRAMSoft
High port count inefficient in soft Width scales betterSoft
Use BRAM for FPGA (soft) implementationSoft
2. Hard/Soft Efficiency
33
Results – Area
Memory
= Logic + Registers
2. Hard/Soft Efficiency
Router Component Mean Area Ratio LUT:REGInput Module 17 --Crossbar 85 --VC Allocator 48 8:1Switch Allocator 56 20:1Output Module 39 0.6:1Router 30
34
Results – Delay
2. Hard/Soft Efficiency
Router Component Mean Delay RatioInput Module 2.9Crossbar 4.4VC Allocator 3.9Switch Allocator 3.3Output Module 3.4Router 3.6
35
OutlineWhy NoCs on FPGAs?
Hard/soft efficiency gap
Integrating hard NoCs with FPGA
1
2
3
Hard NoC + FPGA Wiring
Conclusion Future Work
36
What to harden?Router Component Area Ratio Delay RatioInput Module 17 2.9Crossbar 85 4.4VC Allocator 48 3.9Switch Allocator 56 3.3Output Module 39 3.4Router 30 3.6
Router Component Area Ratio Delay RatioInput Module 17 2.9Crossbar 85 4.4VC Allocator 48 3.9SW Allocator 56 3.3Output Module 39 3.4Router 30 3.6
50% Total Area Critical
Path
Results suggest hardening Crossbar and Allocators Mixed hard/soft implementation
40%
10%
3. Hard NoC with FPGA
37Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
Mixed Implementation
Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
Input Modules Output Modules
Virtual Channel (VC) Allocator
Switch Allocator
Crossbar Switch
1
5
1
5
Soft Hard MixedArea 4.1 mm2 (1X) 0.14 mm2 (30X) 2.3 mm2 (1.8X)
Speed 150 MHz (1X) 810 MHz (5X) 390 MHz (2.5X)
? ?
How to connect hard and soft?
How efficient is mixed/hard after doing that?
Soft
Hard
Mixed not worth hardening
For a typical router ..• 5 ports• 32 bits wide• 2 VCs• 10 buffer words
3. Hard NoC with FPGA
38
Integrating a Hard Router3. Hard NoC with FPGA
Router Logic
Programmable Interconnect
Router
• Same I/O mux structure as a logic block – 9X the area• Conventional FPGA interconnect between routers
Logic clusters
RouterLogic
39
Router Logic
Programmable Interconnect
FPGA
Router
Integrating a Hard Router3. Hard NoC with FPGA
• Same I/O mux structure as a logic block – 9X the area• Conventional FPGA interconnect between routers
730 MHz
19th of FPGA vertically ( 2.5mm)
40
Router Logic
Programmable Interconnect
Router
Integrating a Hard Router3. Hard NoC with FPGA
Assumed a mesh Can form any topology
FPGA
41
Soft Hard Hard (+ interconnect)Area 4.1 mm2 (1X) 0.14 mm2 (30X) 0.18 mm2 = 9 LABs (22X)
Speed 150 MHz (1X) 810 MHz (5X) 730 MHz (4.7X)
64-node NoC on Stratix V
Integrating a Hard Router
Router Logic
Programmable Interconnect
Router
Soft Hard (+ interconnect)
Area~12,500
LABs576 LABs
%LABs 33 % 1.6 %
%FPGA 12 % 0.6 %
3. Hard NoC with FPGA
Hard NoC + Soft Interconnect is very compelling
Provides 47 GB/s peak bisection bandwidth
Very Cheap! Less than cost of 3 soft nodes
Why NoCs on FPGAs?
Hard/soft efficiency gap
Integrating hard NoCs with FPGA
1
2
3
• Big city needs freeways to handle traffic• Solve communication problems for a large/heterogeneous FPGA:
• Timing Closure – Interconnect Scaling – Modular Design
• A hard NoC is on average 30X smaller and 3.6X faster than soft• Crossbars and allocators worst – Input buffer best
• An efficient soft NoC:• Uses BRAMs – Large width, low Port Count – Deep buffers
• Mixed implementation does not make sense• Integrated fully hard NoC with FPGA fabric (for NoC Links)
• 22X area improvement over soft• Reaches max. FPGA frequency (4.7X faster than soft)• 64-node NoC = 0.6% of total FPGA area (Stratix V)
43
Future Work Power analysis More hardening:
– Dedicated inter-router links (hard wires)– Clock domain crossing hardware
How do traffic hotspots (DDR/PCIe) influence NoC design?
Latency insensitive design methodology that uses NoC CAD tool changes for a NoC-based FPGA
3. Hard NoC with FPGA
Thank You!