Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Lecture 10:Router MicroarchitectureTushar Krishna
Assistant ProfessorSchool of Electrical and Computer EngineeringGeorgia Institute of Technology
ECE 8823 A / CS 8803 - ICNInterconnection NetworksSpring 2017http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/
Acknowledgment: Arbiter slides adapted from Univ of Toronto ECE 1749 H (N Jerger)
Network Architecture¡ Topology¡ How to connect the nodes¡ ~Road Network
¡Routing¡ Which path should a message take¡ ~Series of road segments from source to destination
¡ Flow Control¡ When does the message have to stop/proceed¡ ~Traffic signals at end of each road segment
¡Router Microarchitecture¡ How to build the routers¡ ~Design of traffic intersection (number of lanes, algorithm for
turning red/green)
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
2
Router Microarchitecture¡Implementation of routing, flow control, and
switching¡ Impacts per-hop delay and energy
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
3
Virtual Channel Router
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
4
Crossbar SwitchInput Buffers
VC 1VC 2
VC n
VA: VC AllocationInput VCs arbitrate for “output” VCs (Input VCs at next router)
SA: Switch AllocationInput ports arbitrate for output ports
Input Buffers
VC 0VC 1
VC n
BW VA ST LT
VC Allocator
SW Allocator
BW: Buffer Write
ST: Switch Traversal
LT: Link Traversal
RC: Route Compute
RC
Route Compute
SA BR
BR: Buffer Read
VN0
VN 1
FLIT
Router Microarchitecture¡Components¡Virtual Channel Buffers¡Routing Logic¡Allocation¡ Switch Allocation¡ VC Allocation
¡Crossbar Switch¡Link
¡Pipeline¡5-cycle router à 1-cycle router
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
5
Crossbar SwitchInput Buffers
VC 1VC 2
VC n
Input Buffers
VC 0VC 1
VC n
VC Allocator
SW Allocator
Route Compute
VN0
VN 1
Router Microarchitecture¡Components¡Virtual Channel Buffers¡Routing Logic¡Allocation¡ Switch Allocation¡ VC Allocation
¡Crossbar Switch¡Link
¡Pipeline¡5-cycle router à 1-cycle router
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
6
Buffer Organization¡ Why does the router have buffers?¡ To manage contention between shared links
¡ Minimum number of buffers needed?¡ Functionality/Correctness¡ One per Virtual Channel to avoid deadlocks
¡ Messages in two different VCs will never indefinitely block one another¡ If one of the VCs is blocked, the second one can go ahead
¡ How many VCs required to avoid deadlocks?¡ Two kinds of deadlocks: Protocol and Routing (next slide)
¡ Performance (Flow Control)¡ Message going out of congested output port should not block a
message behind it going out of different output port¡ i.e., avoid “Head-of-Line Blocking”
¡ Cover buffer turnaround time to sustain full throughput
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
7
Minimum number of buffers
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
8
Routing Deadlock Avoidance
Protocol Deadlock Avoidance
For Performance- not required
“Virtual Networks”e.g., req and resp
HoL blockingBuffer Turnaround
Multi-flit packets
Physical Channels Virtual Channels For Correctness- required- cannot be shared
VC0
VC1
VC2
VC4
.
.
.- can be shared
Link Router
Virtual Channel Implementation¡State Information¡G (Global): Idle, Routing, waiting for output VC,
waiting for credits in output VC, active¡R (Route): output port for the packet¡O (Output VC): output VC (VC at next router) for
this packet¡C (Credit Count): number of credits (i.e.,
downstream flit buffers) in output VC O at output port R
¡P (pointers): pointers to head and tail flits in buffer pool VCs implemented as shared pool (next slide)
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
9
Virtual Channel Implementation¡Storage¡Private Buffers Per VC¡ n-flit deep FIFO per VC¡ n >= 1, but can be smaller than the size of the packet
Or¡Shared Buffers¡ All VCs share a pool of buffers¡ One reserved buffer per VC
+ Allows variable sized VCs- More complex circuitry§ Pointers for every flit§ Linked list of free buffer slots
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
10
VC 0 tail head
VC 2 tail head
body
VC 1 head_tail
Router Microarchitecture¡Components¡Virtual Channel Buffers¡Routing Logic¡Allocation¡ Switch Allocation¡ VC Allocation
¡Crossbar Switch¡Link
¡Pipeline¡5-cycle router à 1-cycle router
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
11
Routing Logic¡ Source Routing – each packet comes with a fixed
output port¡ Example: (E, E, N, N, N, N, Eject)¡ Each router reads left most entry, and then strips it away for
next hop¡ Pros+Save latency at each hop+Save routing-hardware at each hop+Can reconfigure routes based on faults+Supports irregular topologies
¡ Cons- Overhead to store all routes at NIC- Overhead to carry routing bits in every
packet (3-bits port x max hops)- Cannot adapt based on congestion
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
12
Routing Logic¡Routing Table at every router¡ Packet can index into it via destination bits, or some
static VCid¡ Pros
+ Any routing algorithm can be implemented by reconfiguring the tables
¡ Cons- Latency, Energy, and Area Overhead – not recommended on-chip
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
13
Routing Table for West-first routing in a 3x3 Mesh
Routing Logic¡Combinational Logic - Compute output port at each
router¡ packet carries only destination coordinates, and each router
computes output port based on packet state and router state¡ e.g., deterministic: use remaining hops and direction¡ e.g., oblivious: use remaining hops and direction and some
randomness factor¡ e.g., adaptive: use congestion metrics (such as buffer
occupancy), history, etc.¡ Pros
+ Simple to implement – most common approach in NoCs¡ Cons
- Routing delay is in critical path- Routing algorithm has to be fixed at design time
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
14
Router Microarchitecture¡Components¡Virtual Channel Buffers¡Routing Logic¡Allocation¡ Switch Allocation¡ VC Allocation
¡Crossbar Switch¡Link
¡Pipeline¡5-cycle router à 1-cycle router
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
15
Switch and VC Allocation¡ “Allocator” matches N requests to M resources¡ “Arbiter” matches N requests to 1 resource¡ VC Allocation¡ Requests: Input VCs¡ Resources: Output VCs (i.e., VCs at next router)
¡ Switch Allocation¡ Requests: Input VCs/Input ports¡ Resources: Output ports of the router
¡ Allocator that delivers the highest matching translates to higher network throughput
¡ In most NoCs, the allocation logic determines cycle time¡ Allocators must be fast and/or able to be pipelined
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
16
Fairness¡Intuitively, a fair arbiter is one that provides equal
service to different requesters
¡Weak fairness: Every request is eventuallyserved
¡Strong fairness: Requesters will be served equally often
¡FIFO Fairness: Requesters are served in the order they make their requests
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
17
Locally Fair Example
R3 receives 4 times the bandwidth as r0, even though individual arbiters provide strong fairness
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
18
A0 A1 A2
r3
Destination1.0
0.5
0.5
r2
0.25
0.25
0.125
0.125
r1
r0
Round Robin Arbiter¡Last request serviced given lowest priority
¡Generate next priority vector from current grant vector
¡If no requests, priority is unchanged
¡Exhibits fairness
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
19
Round Robin Arbiter Implementation
Gi granted à next cycle Pi+1 highFebruary 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
20
Grant 0
Grant 1
Grant 2
Next priority 0
Next priority 1
Next priority 2
Priority 0
Priority 1
Priority 2
Matrix Arbiter¡Least recently served priority scheme
¡Triangular array of state bits wij for j < i¡Bit wij indicates request i takes priority over j¡ Each time request k granted, clears all bits in row
k and sets all bits in column k
¡Good for small number of inputs
¡Fast, inexpensive and provides strong fairnessFebruary 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
21
Matrix Arbiter Implementation
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
22
Request 0
01 02
10
Request 1
20 21
12
Grant 0
Disable 2Disable 1Disable 0
Request 2
Grant 1
Grant 2
wij: Priority of req i wrt req j
Matrix Arbiter Operation¡Grant Policy¡When a request is asserted, it is AND-ed with the
state bits in its row to disable any lower priority requests
¡Request with highest priority is granted
¡Update Policy¡Each time a request k is granted, the state of the
matrix is updated by clearing all bits in row k and setting all bits in column k.
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
23
Matrix Arbiter Implementation
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
24
Request 0
01 02
10
Request 1
20 21
12
Grant 0
Disable 2Disable 1Disable 0
Request 2
Grant 1
Grant 2
1
0
0
Req 2 has higher priority
than Req 0
Req 0 has lower priority than Req 2
Req 1 has lower priority than Req 2
Matrix Arbiter Example
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
25
Request 0
01 02
10
Request 1
20 01
01
Grant 0
Disable 2Disable 1Disable 0
Request 2
Grant 1
Grant 2
0 0
01
1 1
A1A2
B1
C1C2
1 and 2 have priority over 0
2 has priority over 1
Request = 111Grant = 001 (C2)Update = w2* à 0
w*2 à 1
Matrix Arbiter Example
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
26
Request 0
01 02
10
Request 1
20 01
01
Grant 0
Disable 2Disable 1Disable 0
Request 2
Grant 1
Grant 2
0 1
11
0 0
A1A2
B1
C2
1 has priority over 0
0 and 1 have priority over 2
Request = 111Grant = 010 (B1)Update = w1* à 0
w*1 à 1
Matrix Arbiter Example
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
27
Request 0
01 02
10
Request 1
20 01
01
Grant 0
Disable 2Disable 1Disable 0
Request 2
Grant 1
Grant 2
1 1
00
0 1
A1A2
C2
0 and 2 have priority over 1
0 has priority over 2
Request = 101Grant = 100 (A1)Update = w0* à 0
w*0 à 1
Matrix Arbiter Example
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
28
Request 0
01 02
10
Request 1
20 01
01
Grant 0
Disable 2Disable 1Disable 0
Request 2
Grant 1
Grant 2
0 0
01
1 1
A2
C2
1 and 2 have priority over 0
2 has priority over 1
Request = 101Grant = 001 (C2)Update = w2* à 0
w*2 à 1
Matrix Arbiter Example
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
29
Request 0
01 02
10
Request 1
20 01
01
Grant 0
Disable 2Disable 1Disable 0
Request 2
Grant 1
Grant 2
0 1
11
0 0
A2
1 and 2 have priority over 0
2 has priority over 1
Request = 100Grant = 100 (A2)Update = w0* à 0
w*0 à 1
Matrix Arbiter Example
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
30
Request 0
01 02
10
Request 1
20 01
01
Grant 0
Disable 2Disable 1Disable 0
Request 2
Grant 1
Grant 2
0 0
11
1 0
1 and 2 have priority over 0
1 has priority over 2
Allocators¡Arbiter assigns a single resource to one of a group of
requesters (i.e., N : 1)¡Allocator performs a matching between a group of
resources and a group of requestors (i.e., M : N)¡ Each of which may request one or more resources
¡3 rules¡ A grant can be asserted only if the corresponding
request is asserted¡ At most one grant for each input (requester) may be
asserted¡ At most one grant for each output (resource) can be
asserted
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
31
Allocation Example
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
32
1 1 11 1 01 0 00 1 0
1 0 00 1 00 0 00 0 0
G1 =
0 0 10 1 01 0 00 0 0
G2 =
Request Matrix, R =
¡ Both G1 an G2 satisfy rules¡ Which is more desirable?¡ G2¡ All three resources assigned to inputs¡ Maximum matching: solution containing maximum possible
number of assignments
Separable Allocator¡Hard to design an allocator that is both fast and
gives high-matching
¡Need pipeline-able allocators for NoCs
¡Separable allocator composed of arbiters¡First stage: select single request at each input port¡Second stage: selects single request for each output
port
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
33
Separable Allocator Example
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
34
12:3 Allocator
Suppose 4 requestors (A, B, C, D) and 3 resources (X, Y, Z)
3:1 arbiter
3:1 arbiter
3:1 arbiter
3:1 arbiter
4:1 arbiter
4:1 arbiter
4:1 arbiter
111
101
001
101
100000000001
Request A:X, Y, Z
Request B:X, Z
Request C:Z
Request D:X, Z
Grant X:A
Grant Y:0
Grant Z:D
Easier to implement in HW but not very efficient
Separable Switch Allocation¡N input ports, v VCs per input port
¡N x v : N Allocator¡Stage 1: At every input port, choose one VC¡N v:1 arbiters
¡Stage 2: At every output port, choose one input port¡N N:1 arbiters
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
35
VC Allocation¡N input ports, v VCs per input port¡Separable VC Allocator¡ Stage 1: At every input port, choose one VC¡ N v:1 arbiters
¡ Stage 2: At every output VC, choose an input VC¡ Nxv N:1 arbiters
¡VC Selection (Kumar et al, ICCD 2007)¡No point in allocating VC before flit wins SA¡Maintain a pool of free VCs at every output port¡ Perform SA only if output port has at least one free VC¡ Winner of SA is granted this VC
February 15, 2017ICN | Spring 2017 | L10: Router uArch © Tushar Krishna, School of ECE, Georgia Tech
36