Upload
aleda
View
69
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Chapter 10: Scalable Interconnection Networks. Goals. Low latency to neighbors to average/furthest node High bandwidth per-node and aggregate bisection bandwidth .. Low cost switch complexity, pin count wiring cost (connectors!) Scalability. Requirements from Above. - PowerPoint PPT Presentation
Citation preview
EECS 570: Fall 2003 -- rev1 1
Chapter 10:Scalable Interconnection
Networks
EECS 570: Fall 2003 -- rev1 2
Goals
• Low latency– to neighbors– to average/furthest node
• High bandwidth– per-node and aggregate– bisection bandwidth ..
• Low cost– switch complexity, pin
count– wiring cost (connectors!)
• Scalability
EECS 570: Fall 2003 -- rev1 3
Requirements from Above• Communication-to-computation ratio
implies bandwidth needs– local or global?– regular or irregular (arbitrary)?– bursty or uniform?– broadcasts? multicasts?
• Programming Model– protocol structure– transfer size– importance of latency vs. bandwidth
EECS 570: Fall 2003 -- rev1 4
Basic Definitions
switch is a device capable of transferring data from input ports to output ports in an arbitrary pattern
network is a graph V={ switches/nodes} connected by communication channels (aka links) C V x V
• direct – node at each switch• indirect – node only on edge (like internet)• route: sequence of links message follows from
node A to node B
EECS 570: Fall 2003 -- rev1 5
What characterizes a network?
• Topology– interconnection structure of the graph
• Switching Strategy– circuit vs. packet switching– store-and-forward vs. cut-through– virtual cut-through vs. wormhole, etc.
• Routing Algorithm– how is route determined
• Control Strategy– centralized vs. distributed
• Flow Control Mechanism– when does a packet (or a portion of it) move along its route– can't send two packets down same link at same time
EECS 570: Fall 2003 -- rev1 6
Topologies
Topology determines many critical parameters:– degree: number of input (output) ports per
switch– distance: number of links in route from A to B– average distance: average over all (A,B) pairs– diameter: max distance between two nodes
(using shortest paths)– bisection: min number of links to separate into
two halves
EECS 570: Fall 2003 -- rev1 7
Bus
• Fully connected network topology– bus plus interface logic is a form of switch
• Parameters:– diameter = average distance = 1– degree = N– switch cost O(N)– wire cost constant– bisection bandwidth constant (or worse)
• Broadcast is free
EECS 570: Fall 2003 -- rev1 8
Crossbar
• Another fully connected network topology– one big switch of degree N connects all nodes
• Parameters:– diameter = average distance = 0(1)– degree=N– switch cost O(N2)– wire cost 2N– bisection bandwidth O(N)
• Most switches in other topologies are crossbars inside
EECS 570: Fall 2003 -- rev1 9
How to build a crossbar
EECS 570: Fall 2003 -- rev1 10
Switches
InputBuffer
OutputBuffer
Cross-bar
Control Routing, Scheduling
Receiver Transmitter
OutputPorts
InputPorts
EECS 570: Fall 2003 -- rev1 11
Linear Arrays and Rings
• Linear Array – Diameter? N -1– Avg. Distance? N/3– Bisection? 1
• Torus (ring): links may be unidirectional or bidirectional• Examples: FDDI, SCI, FiberChannel, KSR1
EECS 570: Fall 2003 -- rev1 12
Multidimensional Meshes and Tori
• d-dimensional k-ary mesh N = kd k = d√N– nodes in each of d dimensions– torus has wraparound, array doesn't; “mesh”
ambiguous
• general d-dimensional mesh– n = kd-1 x ...x k0 nodes
EECS 570: Fall 2003 -- rev1 13
Properties
(N=kd,k=d√N)• Diameter?
– Average Distance? – d x k/3 for mesh
• Degree• Switch Cost?• Wire Cost?• Bisection?
– kd+1
EECS 570: Fall 2003 -- rev1 14
Hypercubes
EECS 570: Fall 2003 -- rev1 15
Trees
Usually indirect, occasionally direct
Diameter and avg. distance are logarithmic– k-ary tree, height d = logk N
Fixed degree
Route up to common ancestor and down– benefit for local traffic
Bisection?
EECS 570: Fall 2003 -- rev1 16
Fat-Trees
Fatter links (really more of them) as you go up, so bisection BW scales with N
EECS 570: Fall 2003 -- rev1 17
Butterflies
• Tree with lots of roots !• N log N (actually N/2 x logN)• Exactly one route from any source to any dest• R = A xor B, at level i use “straight” edge if ri=o, otherwise cross
edge• Bisection N/2 vs N(d-1)/d
EECS 570: Fall 2003 -- rev1 18
Benes Networks and Fat Trees
• Back-to-back butterfly can route all permutations– off line
• What if you just pick a random mid point?
EECS 570: Fall 2003 -- rev1 19
Relationship of Butterflies to Hypercubes
• Wiring is isomorphic• Except that Butterfly always lakes log n steps
EECS 570: Fall 2003 -- rev1 20
Summary of TopologiesTopology Degree Diameter Ave Dist Bisection Diam/Ave Dist N=10241DArray 2 N-1 N/3 1 1023/3411D Ring 2 N/2 N/4 2 512/22DArray 4 2(N½ -1) ⅔N½ N½ 63/213D Array 6 3(N⅓ -1) N⅓ N⅔ -30/-102DTorus 4 N½ ½N 2N½ 32/16k-ary n-cube 2n nk/2 nk/4 nk/4 15/7.5 @ n=3Hypercube n=logN n n/2 N/2 10/52DTree 3 2log2N -2Iog2N 1 20/-204DTree 5 2log4N -2Iog4N-2/3 1 10/-92D fat tree 4 log2N -2Iog2N N 20/-202D butterfly 4 log2N log2N N/2 20/20
EECS 570: Fall 2003 -- rev1 21
Choosing a Topology
Cost vs. performanceFor fixed cost which topology provides best performance?• best performance on what workload?
– message size– traffic pattern
• define cost• target machine sizeSimplify tradeoff to dimension vs. radix• restrict to k-ary d-cubes• what is best dimension?
EECS 570: Fall 2003 -- rev1 22
How Many Dimensions in a Network?d=2 or d=3• Short wires, easy to build• Many hops, low bisection bandwidth• Benefits from traffic localityd>=4• Harder to build, more wires, longer average length• Higher switch degree• Fewer hops, better bisection bandwidth• Handles non-local traffic betterEffect of # of hops on latency depends on switching
strategy...
EECS 570: Fall 2003 -- rev1 23
Store&Forward vs. Cut-Through Routing
• messages typ. fragmented into packets & pipelined• cut-through pipelines on flits
EECS 570: Fall 2003 -- rev1 24
Handling Output Contention
What if output is blocked?• virtual cut-through
– switch w/blocked output buffers entire packet– degenerates to:– requires lots of buffering in switches
• wormhole – leave flits strung out over network (In buffers) – minimal switch buffering– one blocked packet can tie up lots of channels
EECS 570: Fall 2003 -- rev1 25
Traditional Scaling: Latency(P)
Assumes equal channel width• independent of node count or dimension• dominated by average distance
EECS 570: Fall 2003 -- rev1 26
Average Distance
but equal channel width is not equal cost!Higher dimension => more channels
average distance
= d(k-1)/2
EECS 570: Fall 2003 -- rev1 27
In the 3-D world
For large n, bisection bandwidth is limited to O(n2/3 )• Dally, IEEE TPDS, [Dal90a]• For fixed bisection bandwidth, low-dimensional k-ary n-cubes are
better (otherwise higher is better)• i.e., a few short fat wires are better than many long thin wires• What about many long fat wires?
For n nodes, bisection area is O(n2/3 )
EECS 570: Fall 2003 -- rev1 28
Equal cost in k-ary n-cubes
Equal number of nodes?Equal number of pins/wires?Equal bisection bandwidth?Equal area? Equal wire length?
What do we know?switch degree: d diameter = d(k-1)total links = Ndpins per node = 2wdbisection = kd-1 = N/k links in each direction2Nw/k wires cross the middle
EECS 570: Fall 2003 -- rev1 29
Latency(d) for P with Equal Width
• total links(N)=Nd
EECS 570: Fall 2003 -- rev1 30
Latency with Equal Pin Count
Baseline d=2, has w = 32 (128 wires per node)fix 2dw pins => w(d) = 64/ddistance down with increasing d, but channel time up
EECS 570: Fall 2003 -- rev1 31
Latency with Equal Bisection Width
• N-node hypercube has N bisection links
• 2d torus has 2N½ • Fixed bisection => w(d) =
N1/d/2= k/2• 1 M nodes, d=2 has
w=512!
EECS 570: Fall 2003 -- rev1 32
Larger Routing Delay (w/equal pin)
• ratio of routing to channel time is key
EECS 570: Fall 2003 -- rev1 33
Topology Summary
Rich set of topological alternatives with deep relationshipsDesign point depends heavily on cost model• nodes, pins, area, …• Need for downward scalability lends to fix dimension
– high-degree switches wasted in small configuration– grow machine by increasing nodes per dimension
Need a consistent framework and analysis to separate opinion from design
Optimal point changes with technology• store-and-forward vs. cut-through• non-pipelined vs. pipelined signaling
EECS 570: Fall 2003 -- rev1 34
Real Machines
• Wide links, smaller routing delay• Tremendous variation
EECS 570: Fall 2003 -- rev1 35
What characterizes a network?Topology• interconnection structure of the graphSwitching Strategy• circuit vs. packet switching• store-and-forward vs. cut-through• virtual cut-through vs. wormhole, etc,Routing Algorithm• how is route determinedControl Strategy• centralized vs. distributedFlow Control Mechanism• when does a packet (or a portion of it) move along its route• can't send two packets down same link at same time
EECS 570: Fall 2003 -- rev1 36
Typical Packet Format
Two basic mechanisms for abstraction• encapsulation• fragmentation
EECS 570: Fall 2003 -- rev1 37
RoutingRecall: routing algorithm determines• which of the possible paths are used as routes• how the route is determined• R: N x N → C, which at each switch maps the
destination node nd to the next channel on the routeIssues:• Routing mechanism
– arithmetic– source-based port select– table driven– general computation
• Properties of the routes• Deadlock free
EECS 570: Fall 2003 -- rev1 38
Routing Mechanism
need to select output port for each input packet• in a few cycles• Simple arithmetic in regular topologies• ex: ∆x, ∆y routing in a grid
– west(-x) ∆x<0– east (+x) ∆x>0– south(-y) ∆x=0, ∆y<0– north(+y) ∆x=0, ∆y>0– processor ∆x=0, ∆y=0
Reduce relative address of each dimension in order• Dimension-order routing in k-ary d-cubes• e-cube routing in n-cube
EECS 570: Fall 2003 -- rev1 39
Routing Mechanism (cont)
Source-based• message header carries series of port selects• used and stripped en route• CRC? header length …• CS-2, Myrinet, MIT ArticTable-driven• message header carried index for next port at next switch
– o = R[i]• table also gives index for following hop
– , I’ = R[i]• ATM, HPPI
P3 P2 P1 P0
EECS 570: Fall 2003 -- rev1 40
Properties of Routing Algorithms
Deterministic• route determined by (source, dest), not
intermediate state (i.e. traffic)Adaptive• route influenced by traffic along the wayMinimal• only selects shortest pathsDeadlock free• no traffic pattern can lead to a situation where no
packets move forward
EECS 570: Fall 2003 -- rev1 41
Deadlock Freedom
How can it arise?• necessary conditions:
– shared resource– incrementally allocated– non-pre-emptible
• think of a channel as a shared that is acquired incrementally– source buffer then dest. buffer– channels along a route
How do you avoid it?• constrain how channel resources are allocated• ex: dimension orderHow do you prove that a routing algorithm is deadlock free?
EECS 570: Fall 2003 -- rev1 42
Proof Technique
Resources are logically associated with channelsMessages introduce dependences between
resources as they move forwardNeed to articulate possible dependences between
channelsShow that there are no cycles in Channel
Dependence Graph• find a numbering of channel resources such that
every legal route follows a monotonic sequence=> no traffic pattern can lead to deadlockNetwork need not be acyclic, only channel
dependence graph
EECS 570: Fall 2003 -- rev1 43
Example: k-ary 2D array
Theorem: dimension-order x,y routing is deadlock free• Numbering• +x channel (i,y) → (i+1,y) gets i• similarly for -x with 0 as most positive edge• +y channel (x,j) -> (x,j+ I) gets N+j• similarly for -y channelsAny routing sequence' x direction, turn, y direction is increasing
EECS 570: Fall 2003 -- rev1 44
Channel Dependence Graph
EECS 570: Fall 2003 -- rev1 45
More examplesWhy is the obvious routing on X deadlock free? .• butterfly?• tree?• fat tree?Any assumptions about routing mechanism? amount of
buffering?What about wormhole routing on a ring?
EECS 570: Fall 2003 -- rev1 46
Deadlock free wormhole networks?
Basic dimension-order routing doesn't work for k-ary d-cubes• only for k-ary d-arrays (bi-directional, no wrap-around)Idea: add channels!• provide multiple “virtual channels” to break dependence cycle• good for BW too!
• Don't need to add links, or x bar, only buffer resourcesThis adds nodes to the CDG, remove edges?
EECS 570: Fall 2003 -- rev1 47
Breaking deadlock with virtual channels
EECS 570: Fall 2003 -- rev1 48
Turn Restrictions in X, Y
• XY routing forbids 4 of 8 turns and leaves no room for adaptive routing
• Can you allow more turns and still be deadlock free
EECS 570: Fall 2003 -- rev1 49
Minimal turn restrictions in 2D
West-first
north-last negative first
-x +x
-y
+y
EECS 570: Fall 2003 -- rev1 50
Example legal west-first routes
• Can route around failures or congestion• Can combine turn restrictions with virtual channels
EECS 570: Fall 2003 -- rev1 51
Adaptive RoutingR: C x N x ∑ → CEssential for fault tolerance• at least multipathCan improve utilization of the networkSimple deterministic algorithms easily run into bad permutations
• Fully/partially adaptive, minimal/non-minimal• Can introduce complexity or anomalies• Little adaptation goes a long way!
EECS 570: Fall 2003 -- rev1 52
Contention
Two packets trying to use the same link at same time• limited buffering• drop?Most parallel machine networks block in place• link-level flow control• tree saturationClosed system - offered load depends on delivered
EECS 570: Fall 2003 -- rev1 53
Flow Control
What do you do when push comes to shove?• ethernet collision detection and retry after delay• TCPIW AN: buffer, drop, adjust rate• any solution must adjust to output rate
Link-level flow control
EECS 570: Fall 2003 -- rev1 54
Example: T3D
• 3D bidirectional torus, dimension order (NIC selected), virtual cut-through, packet sw
• 16 bit x 150 MHz, short, wide, synch• rotating priority per output• logically separate request/response• 3 independent, stacked switches• 8 16-bit flits on each of 4 VC in each directions
EECS 570: Fall 2003 -- rev1 55
Routing and Switch Design Summary
Routing Algorithms restrict the set of routes within the topology• simple mechanism selects turn at each hop• arithmetic, selection, lookupDeadlock-free if channel dependence graph is acyclic• limit turns to eliminate dependences• add separate channel resources to break dependences• combination of topology, algorithm, and switch designDeterministic vs adaptive routingSwitch design issues• input/output/pooled buffering, routing logic, selection logicFlow control• Real networks are a “package” of design choices
EECS 570: Fall 2003 -- rev1 56
Example: SP
• 8-port switch, 40 MB/s per link, 8-bit phit, 16-bit flit, single 40 MHz clock• packet sw, cut-through, no virtual channel, source-based routing• variable packet <= 255 bytes, 31 byte FIFO per input, 7 bytes per output, 16
phit links• 128 8-byte “chunks” in central queue, LRU per output