CS 258 Parallel Computer Architecture
Lecture 4
Network Topologyand
Routing
February 4, 2008Prof John D. Kubiatowicz
http://www.cs.berkeley.edu/~kubitron/cs258
Lec 4.22/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Review: Links and Channels
• transmitter converts stream of digital symbols into signal that is driven down the link
• receiver converts it back– tran/rcv share physical protocol
• trans + link + rcv form Channel for digital info flow between switches
• link-level protocol segments stream of symbols into larger units: packets or messages (framing)
• node-level protocol embeds commands for dest communication assist within packet
• Clock synchronization: Synchronous or Asynchronous
Transmitter
...ABC123 =>
Receiver
...QR67 =>
Lec 4.32/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Review: Store&Forward vs Wormhole
Time: h(n/b + ) vs n/b + h OR(cycles): h(n/w + ) vs n/w + h
• Wormhole vs virtual cut-through.
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1 0
23 1
023
3 1 0
2 1 0
23 1 0
0
1
2
3
23 1 0Time
Store & Forward Routing Cut-Through Routing
Source Dest Dest
Lec 4.42/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Review: Direct vs Indirect
• Direct: Every network node associated with processor
– Examples: Meshes
• Indirect: More Network nodes than processors– Examples: Trees, Butterflies
Lec 4.52/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Review: Linear Arrays and Rings
• Linear Array– Diameter?– Average Distance?– Bisection bandwidth?– Route A -> B given by relative address R = B-A
• Torus?• Examples: FDDI, SCI, FiberChannel Arbitrated
Loop, KSR1
Linear Array
Torus
Torus arranged to use short wires
Lec 4.62/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Review: Multidimensional Meshes and Tori
• d-dimensional array– n = kd-1 X ...X kO nodes
– described by d-vector of coordinates (id-1, ..., iO)
• d-dimensional k-ary mesh: N = kd
– k = dN– described by d-vector of radix k coordinate
• d-dimensional k-ary torus (or k-ary d-cube)?
2D Grid 3D Cube
Lec 4.72/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Embeddings in two dimensions
• Embed multiple logical dimension in one physical dimension using long wires
• When embedding higher-dimension in lower one, either some wires longer than others, or all wires long
6 x 3 x 2
Lec 4.82/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Trees
• Diameter and ave distance logarithmic– k-ary tree, height d = logk N– address specified d-vector of radix k coordinates describing path down
from root
• Fixed degree• Route up to common ancestor and down
– R = B xor A– let i be position of most significant 1 in R, route up i+1 levels– down in direction given by low i+1 bits of B
• H-tree space is O(N) with O(N) long wires• Bisection BW?
Lec 4.92/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Fat-Trees
• Fatter links (really more of them) as you go up, so bisection BW scales with N
Fat Tree
Lec 4.102/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Butterflies
• Tree with lots of roots! • N log N (actually N/2 x logN)• Exactly one route from any source to any dest• R = A xor B, at level i use ‘straight’ edge if ri=0,
otherwise cross edge• Bisection N/2 vs N (d-1)/d
0
1
2
3
4
16 node butterfly
0 1 0 1
0 1 0 1
0 1
building block
Lec 4.112/4/08 Kubiatowicz CS258 ©UCB Spring 2008
k-ary d-cubes vs d-ary k-flies
• degree d • N switches vs N log N switches• diminishing BW per node vs constant• requires locality vs little benefit
to locality
• Can you route all permutations?
Lec 4.122/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Benes network and Fat Tree
• Back-to-back butterfly can route all permutations• What if you just pick a random mid point?
16-node Benes Network (Unidirectional)
16-node 2-ary Fat-Tree (Bidirectional)
Lec 4.132/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Hypercubes
• Also called binary n-cubes. # of nodes = N = 2n.
• O(logN) Hops• Good bisection BW• Complexity
– Out degree is n = logN
correct dimensions in order– with random comm. 2 ports per processor
0-D 1-D 2-D 3-D 4-D 5-D !
Lec 4.142/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Relationship BttrFlies to Hypercubes
• Wiring is isomorphic• Except that Butterfly always takes log n
steps
Lec 4.152/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Topology Summary
• All have some “bad permutations”– many popular permutations are very bad for meshs (transpose)– ramdomness in wiring or routing makes it hard to find a bad one!
Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024
1D Array 2 N-1 N / 3 1 huge
1D Ring 2 N/2 N/4 2
2D Mesh 4 2 (N1/2 - 1) 2/3 N1/2 N1/2 63 (21)
2D Torus 4 N1/2 1/2 N1/2 2N1/2 32 (16)
k-ary n-cube 2n nk/2 nk/4 nk/4 15 (7.5) @n=3
Hypercube n =log N n n/2 N/2 10 (5)
Lec 4.162/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Paper Discussion: “Future of Wires”
• “Future of Wires,” Ron Ho, Kenneth Mai, Mark Horowitz
• Fanout of 4 metric (FO4)– FO4 delay metric across technologies roughly constant– Treats 8 FO4 as absolute minimum (really says 16 more
reasonable)
• Wire delay– Unbuffered delay: scales with (length)2
– Buffered delay (with repeaters) scales closer to linear with length
• Sources of wire noise– Capacitive coupling with other wires: Close wires– Inductive coupling with other wires: Can be far wires
Lec 4.172/4/08 Kubiatowicz CS258 ©UCB Spring 2008
“Future of Wires” continued
• Cannot reach across chip in one clock cycle!
– This problem increases as technology scales
– Multi-cycle long wires!
• Not really a wire problem – more of a CAD problem??
– How to manage increased complexity is the issue
• Seems to favor ManyCore chip design??
Lec 4.182/4/08 Kubiatowicz CS258 ©UCB Spring 2008
How Many Dimensions?
• n = 2 or n = 3– Short wires, easy to build– Many hops, low bisection bandwidth– Requires traffic locality
• n >= 4– Harder to build, more wires, longer average length– Fewer hops, better bisection bandwidth– Can handle non-local traffic
• k-ary d-cubes provide a consistent framework for comparison
– N = kd
– scale dimension (d) or nodes per dimension (k)– assume cut-through
Lec 4.192/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Traditional Scaling: Latency(P)
• Assumes equal channel width– independent of node count or dimension– dominated by average distance
0
20
40
60
80
100
120
140
0 5000 10000
Machine Size (N)
Ave L
ate
ncy T
(n=40)
d=2
d=3
d=4
k=2
n/w
0
50
100
150
200
250
0 2000 4000 6000 8000 10000
Machine Size (N)
Ave L
ate
ncy T
(n=140)
Lec 4.202/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Average Distance
• but, equal channel width is not equal cost!• Higher dimension => more channels
0
10
20
30
40
50
60
70
80
90
100
0 5 10 15 20 25
Dimension
Ave D
ista
nce
256
1024
16384
1048576
ave dist = d (k-1)/2
Lec 4.212/4/08 Kubiatowicz CS258 ©UCB Spring 2008
In the 3D world
• For n nodes, bisection area is O(n2/3 )
• For large n, bisection bandwidth is limited to O(n2/3 )
– Bill Dally, IEEE TPDS, [Dal90a]– For fixed bisection bandwidth, low-dimensional k-ary n-
cubes are better (otherwise higher is better)– i.e., a few short fat wires are better than many long
thin wires– What about many long fat wires?
Lec 4.222/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Equal cost in k-ary n-cubes• Equal number of nodes?• Equal number of pins/wires?• Equal bisection bandwidth?• Equal area?• Equal wire length?
What do we know?• switch degree: d diameter = d(k-1)• total links = Nd• pins per node = 2wd• bisection = kd-1 = N/k links in each directions• 2Nw/k wires cross the middle
Lec 4.232/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Latency(d) for P with Equal Width
• total links(N) = Nd
0
50
100
150
200
250
0 5 10 15 20 25
Dimension
Average L
ate
ncy (n =
40,
= 2
)256
1024
16384
1048576
Lec 4.242/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Latency with Equal Pin Count
• Baseline d=2, has w = 32 (128 wires per node)• fix 2dw pins => w(d) = 64/d• distance up with d, but channel time down
0
50
100
150
200
250
300
0 5 10 15 20 25
Dimension (d)
Ave
Lat
ency
T(n
=40B
)
256 nodes
1024 nodes
16 k nodes
1M nodes
0
50
100
150
200
250
300
0 5 10 15 20 25
Dimension (d)
Ave
Lat
ency
T(n
= 14
0 B
)
256 nodes
1024 nodes
16 k nodes
1M nodes
Lec 4.252/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Latency with Equal Bisection Width
• N-node hypercube has N bisection links
• 2d torus has 2N 1/2
• Fixed bisection => w(d) = N 1/d / 2 = k/2
• 1 M nodes, d=2 has w=512!
0
100
200
300
400
500
600
700
800
900
1000
0 5 10 15 20 25
Dimension (d)
Ave L
ate
ncy T
(n=40)
256 nodes
1024 nodes
16 k nodes
1M nodes
Lec 4.262/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Larger Routing Delay (w/ equal pin)
• Dally’s conclusions strongly influenced by assumption of small routing delay
– Here, Routing delay =20
0
100
200
300
400
500
600
700
800
900
1000
0 5 10 15 20 25
Dimension (d)
Ave L
ate
ncy T
(n= 1
40 B
)
256 nodes
1024 nodes
16 k nodes
1M nodes
Lec 4.272/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Latency under Contention
• Optimal packet size? Channel utilization?
0
50
100
150
200
250
300
0 0.2 0.4 0.6 0.8 1
Channel Utilization
Late
ncy
n40,d2,k32
n40,d3,k10
n16,d2,k32
n16,d3,k10
n8,d2,k32
n8,d3,k10
n4,d2,k32
n4,d3,k10
Lec 4.282/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Saturation
• Fatter links shorten queuing delays
0
50
100
150
200
250
0 0.2 0.4 0.6 0.8 1
Ave Channel Utilization
Late
ncy
n/w=40
n/w=16
n/w=8
n/w = 4
Lec 4.292/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Phits per cycle
• higher degree network has larger available bandwidth– cost?
0
50
100
150
200
250
300
350
0 0.05 0.1 0.15 0.2 0.25
Flits per cycle per processor
Late
ncy
n8, d3, k10
n8, d2, k32
Lec 4.302/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Discussion
• Rich set of topological alternatives with deep relationships
• Design point depends heavily on cost model
– nodes, pins, area, ...– Wire length or wire delay metrics favor small
dimension– Long (pipelined) links increase optimal dimension
• Need a consistent framework and analysis to separate opinion from design
• Optimal point changes with technology
Lec 4.312/4/08 Kubiatowicz CS258 ©UCB Spring 2008
The Routing problem: Local decisions
• Routing at each hop: Pick next output port!
Cross-bar
InputBuffer
Control
OutputPorts
Input Receiver Transmiter
Ports
Routing, Scheduling
OutputBuffer
Lec 4.322/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Routing
• Recall: routing algorithm determines – which of the possible paths are used as routes– how the route is determined– R: N x N -> C, which at each switch maps the
destination node nd to the next channel on the route
• Issues:– Routing mechanism
» arithmetic» source-based port select» table driven» general computation
– Properties of the routes– Deadlock free
Lec 4.332/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Routing Mechanism
• need to select output port for each input packet
– in a few cycles
• Simple arithmetic in regular topologies– ex: x, y routing in a grid
» west (-x) x < 0» east (+x) x > 0» south (-y) x = 0, y < 0» north (+y) x = 0, y > 0» processor x = 0, y = 0
• Reduce relative address of each dimension in order
– Dimension-order routing in k-ary d-cubes– e-cube routing in n-cube
Lec 4.342/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Routing Mechanism (cont)
• Source-based– message header carries series of port selects– used and stripped en route– CRC? Packet Format?– CS-2, Myrinet, MIT Artic
• Table-driven– message header carried index for next port at next
switch» o = R[i]
– table also gives index for following hop» o, I’ = R[i ]
– ATM, HPPI, MPLS
P0P1P2P3
Lec 4.352/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Properties of Routing Algorithms• Deterministic
– route determined by (source, dest), not intermediate state (i.e. traffic)
• Adaptive– route influenced by traffic along the way
• Minimal– only selects shortest paths
• Deadlock free– no traffic pattern can lead to a situation where no
packets move forward
Lec 4.362/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Deadlock Freedom
• How can it arise?– necessary conditions:
» shared resource» incrementally allocated» non-preemptible
– think of a channel as a shared resource that is acquired incrementally
» source buffer then dest. buffer» channels along a route
• How do you avoid it?– constrain how channel resources are allocated– ex: dimension order
• How do you prove that a routing algorithm is deadlock free?
Lec 4.372/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Proof Technique
• resources are logically associated with channels
• messages introduce dependences between resources as they move forward
• need to articulate the possible dependences that can arise between channels
• show that there are no cycles in Channel Dependence Graph
– find a numbering of channel resources such that every legal route follows a monotonic sequence
• => no traffic pattern can lead to deadlock
• network need not be acyclic on channel dependence graph
Lec 4.382/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Example: k-ary 2D array
• Thm: Dimension-ordered (x,y) routing is deadlock free
• Numbering– +x channel (i,y) -> (i+1,y) gets i– similarly for -x with 0 as most positive edge– +y channel (x,j) -> (x,j+1) gets N+j– similary for -y channels
• any routing sequence: x direction, turn, y direction is increasing 1 2 3
01200 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
17
18
1916
17
18
Lec 4.392/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Channel Dependence Graph
1 2 3
01200 01 02 03
10 11 12 13
20 21 22 23
30 31 32 33
17
18
1916
17
18
1 2 3
012
1718 1718 1718 1718
1 2 3
012
1817 1817 1817 1817
1 2 3
012
1916 1916 1916 1916
1 2 3
012
Lec 4.402/4/08 Kubiatowicz CS258 ©UCB Spring 2008
More examples:
• Why is the obvious routing on X deadlock free?– butterfly?– tree?– fat tree?
• Any assumptions about routing mechanism? amount of buffering?
• What about wormhole routing on a ring?
012
3
45
6
7
Lec 4.412/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Deadlock free wormhole networks?
• Basic dimension order routing techniques don’t work for k-ary d-cubes– only for k-ary d-arrays (bi-directional)
• Idea: add channels!– provide multiple “virtual channels” to break the
dependence cycle– good for BW too!
– Do not need to add links, or xbar, only buffer resources
• This adds nodes the the CDG, remove edges?
OutputPorts
Input Ports
Cross-Bar
Lec 4.422/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Breaking deadlock with virtual channels
Packet switchesfrom lo to hi channel
Lec 4.432/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Up*-Down* routing
• Given any bidirectional network• Construct a spanning tree• Number of the nodes increasing from
leaves to roots• UP increase node numbers• Any Source -> Dest by UP*-DOWN* route
– up edges, single turn, down edges
• Performance?– Some numberings and routes much better than others– interacts with topology in strange ways
Lec 4.442/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Turn Restrictions in X,Y
• XY routing forbids 4 of 8 turns and leaves no room for adaptive routing
• Can you allow more turns and still be deadlock free
+Y
-Y
+X-X
Lec 4.452/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Minimal turn restrictions in 2D
West-first
north-last negative first
-x +x
+y
-y
Lec 4.462/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Example legal west-first routes
• Can route around failures or congestion• Can combine turn restrictions with virtual
channels
Lec 4.472/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Adaptive Routing
• R: C x N x -> C• Essential for fault tolerance
– at least multipath
• Can improve utilization of the network• Simple deterministic algorithms easily run into
bad permutations
• fully/partially adaptive, minimal/non-minimal• can introduce complexity or anomolies• little adaptation goes a long way!
Lec 4.482/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Summary #1
• Tradeoffs in cost: – Constant N, Bisection BW, etc?– Unconstrained: higher-dimensional networks better– Constrained, somewhat lower is better
Topology Degree Diameter Ave Dist Bisection D (D ave) @ P=1024
1D Array 2 N-1 N / 3 1 huge
1D Ring 2 N/2 N/4 2
2D Mesh 4 2 (N1/2 - 1) 2/3 N1/2 N1/2 63 (21)
2D Torus 4 N1/2 1/2 N1/2 2N1/2 32 (16)
k-ary n-cube 2n nk/2 nk/4 nk/4 15 (7.5) @n=3
Hypercube n =log N n n/2 N/2 10 (5)
Lec 4.492/4/08 Kubiatowicz CS258 ©UCB Spring 2008
Summary #2
• Routing Algorithms restrict the set of routes within the topology
– simple mechanism selects turn at each hop– arithmetic, selection, lookup
• Deadlock-free if channel dependence graph is acyclic
– limit turns to eliminate dependences– add separate channel resources to break dependences– combination of topology, algorithm, and switch design
• Deterministic vs adaptive routing– Adaptive adds more freedom, but causes more
deadlock