Chapter 10: Scalable Interconnection Networks

EECS 570: Fall 2003 -- rev1 1

Chapter 10:Scalable Interconnection

Networks

EECS 570: Fall 2003 -- rev1 2

Goals

• Low latency– to neighbors– to average/furthest node

• High bandwidth– per-node and aggregate– bisection bandwidth ..

• Low cost– switch complexity, pin

count– wiring cost (connectors!)

• Scalability

EECS 570: Fall 2003 -- rev1 3

Requirements from Above• Communication-to-computation ratio

implies bandwidth needs– local or global?– regular or irregular (arbitrary)?– bursty or uniform?– broadcasts? multicasts?

• Programming Model– protocol structure– transfer size– importance of latency vs. bandwidth

EECS 570: Fall 2003 -- rev1 4

Basic Definitions

switch is a device capable of transferring data from input ports to output ports in an arbitrary pattern

network is a graph V={ switches/nodes} connected by communication channels (aka links) C V x V

• direct – node at each switch• indirect – node only on edge (like internet)• route: sequence of links message follows from

node A to node B

EECS 570: Fall 2003 -- rev1 5

What characterizes a network?

• Topology– interconnection structure of the graph

• Switching Strategy– circuit vs. packet switching– store-and-forward vs. cut-through– virtual cut-through vs. wormhole, etc.

• Routing Algorithm– how is route determined

• Control Strategy– centralized vs. distributed

• Flow Control Mechanism– when does a packet (or a portion of it) move along its route– can't send two packets down same link at same time

EECS 570: Fall 2003 -- rev1 6

Topologies

Topology determines many critical parameters:– degree: number of input (output) ports per

switch– distance: number of links in route from A to B– average distance: average over all (A,B) pairs– diameter: max distance between two nodes

(using shortest paths)– bisection: min number of links to separate into

two halves

EECS 570: Fall 2003 -- rev1 7

Bus

• Fully connected network topology– bus plus interface logic is a form of switch

• Parameters:– diameter = average distance = 1– degree = N– switch cost O(N)– wire cost constant– bisection bandwidth constant (or worse)

• Broadcast is free

EECS 570: Fall 2003 -- rev1 8

Crossbar

• Another fully connected network topology– one big switch of degree N connects all nodes

• Parameters:– diameter = average distance = 0(1)– degree=N– switch cost O(N2)– wire cost 2N– bisection bandwidth O(N)

• Most switches in other topologies are crossbars inside

EECS 570: Fall 2003 -- rev1 9

How to build a crossbar

EECS 570: Fall 2003 -- rev1 10

Switches

InputBuffer

OutputBuffer

Cross-bar

Control Routing, Scheduling

Receiver Transmitter

OutputPorts

InputPorts

EECS 570: Fall 2003 -- rev1 11

Linear Arrays and Rings

• Linear Array – Diameter? N -1– Avg. Distance? N/3– Bisection? 1

• Torus (ring): links may be unidirectional or bidirectional• Examples: FDDI, SCI, FiberChannel, KSR1

EECS 570: Fall 2003 -- rev1 12

Multidimensional Meshes and Tori

• d-dimensional k-ary mesh N = kd k = d√N– nodes in each of d dimensions– torus has wraparound, array doesn't; “mesh”

ambiguous

• general d-dimensional mesh– n = kd-1 x ...x k0 nodes

EECS 570: Fall 2003 -- rev1 13

Properties

(N=kd,k=d√N)• Diameter?

– Average Distance? – d x k/3 for mesh

• Degree• Switch Cost?• Wire Cost?• Bisection?

– kd+1

EECS 570: Fall 2003 -- rev1 14

Hypercubes

EECS 570: Fall 2003 -- rev1 15

Trees

Usually indirect, occasionally direct

Diameter and avg. distance are logarithmic– k-ary tree, height d = logk N

Fixed degree

Route up to common ancestor and down– benefit for local traffic

Bisection?

EECS 570: Fall 2003 -- rev1 16

Fat-Trees

Fatter links (really more of them) as you go up, so bisection BW scales with N

EECS 570: Fall 2003 -- rev1 17

Butterflies

• Tree with lots of roots !• N log N (actually N/2 x logN)• Exactly one route from any source to any dest• R = A xor B, at level i use “straight” edge if ri=o, otherwise cross

edge• Bisection N/2 vs N(d-1)/d

EECS 570: Fall 2003 -- rev1 18

Benes Networks and Fat Trees

• Back-to-back butterfly can route all permutations– off line

• What if you just pick a random mid point?

EECS 570: Fall 2003 -- rev1 19

Relationship of Butterflies to Hypercubes

• Wiring is isomorphic• Except that Butterfly always lakes log n steps

EECS 570: Fall 2003 -- rev1 20

Summary of TopologiesTopology Degree Diameter Ave Dist Bisection Diam/Ave Dist N=10241DArray 2 N-1 N/3 1 1023/3411D Ring 2 N/2 N/4 2 512/22DArray 4 2(N½ -1) ⅔N½ N½ 63/213D Array 6 3(N⅓ -1) N⅓ N⅔ -30/-102DTorus 4 N½ ½N 2N½ 32/16k-ary n-cube 2n nk/2 nk/4 nk/4 15/7.5 @ n=3Hypercube n=logN n n/2 N/2 10/52DTree 3 2log2N -2Iog2N 1 20/-204DTree 5 2log4N -2Iog4N-2/3 1 10/-92D fat tree 4 log2N -2Iog2N N 20/-202D butterfly 4 log2N log2N N/2 20/20

EECS 570: Fall 2003 -- rev1 21

Choosing a Topology

Cost vs. performanceFor fixed cost which topology provides best performance?• best performance on what workload?

– message size– traffic pattern

• define cost• target machine sizeSimplify tradeoff to dimension vs. radix• restrict to k-ary d-cubes• what is best dimension?

EECS 570: Fall 2003 -- rev1 22

How Many Dimensions in a Network?d=2 or d=3• Short wires, easy to build• Many hops, low bisection bandwidth• Benefits from traffic localityd>=4• Harder to build, more wires, longer average length• Higher switch degree• Fewer hops, better bisection bandwidth• Handles non-local traffic betterEffect of # of hops on latency depends on switching

strategy...

EECS 570: Fall 2003 -- rev1 23

Store&Forward vs. Cut-Through Routing

• messages typ. fragmented into packets & pipelined• cut-through pipelines on flits

EECS 570: Fall 2003 -- rev1 24

Handling Output Contention

What if output is blocked?• virtual cut-through

– switch w/blocked output buffers entire packet– degenerates to:– requires lots of buffering in switches

• wormhole – leave flits strung out over network (In buffers) – minimal switch buffering– one blocked packet can tie up lots of channels

EECS 570: Fall 2003 -- rev1 25

Traditional Scaling: Latency(P)

Assumes equal channel width• independent of node count or dimension• dominated by average distance

EECS 570: Fall 2003 -- rev1 26

Average Distance

but equal channel width is not equal cost!Higher dimension => more channels

average distance

= d(k-1)/2

EECS 570: Fall 2003 -- rev1 27

In the 3-D world

For large n, bisection bandwidth is limited to O(n2/3 )• Dally, IEEE TPDS, [Dal90a]• For fixed bisection bandwidth, low-dimensional k-ary n-cubes are

better (otherwise higher is better)• i.e., a few short fat wires are better than many long thin wires• What about many long fat wires?

For n nodes, bisection area is O(n2/3 )

EECS 570: Fall 2003 -- rev1 28

Equal cost in k-ary n-cubes

Equal number of nodes?Equal number of pins/wires?Equal bisection bandwidth?Equal area? Equal wire length?

What do we know?switch degree: d diameter = d(k-1)total links = Ndpins per node = 2wdbisection = kd-1 = N/k links in each direction2Nw/k wires cross the middle

EECS 570: Fall 2003 -- rev1 29

Latency(d) for P with Equal Width

• total links(N)=Nd

EECS 570: Fall 2003 -- rev1 30

Latency with Equal Pin Count

Baseline d=2, has w = 32 (128 wires per node)fix 2dw pins => w(d) = 64/ddistance down with increasing d, but channel time up

EECS 570: Fall 2003 -- rev1 31

Latency with Equal Bisection Width

• N-node hypercube has N bisection links

• 2d torus has 2N½ • Fixed bisection => w(d) =

N1/d/2= k/2• 1 M nodes, d=2 has

w=512!

EECS 570: Fall 2003 -- rev1 32

Larger Routing Delay (w/equal pin)

• ratio of routing to channel time is key

EECS 570: Fall 2003 -- rev1 33

Topology Summary

Rich set of topological alternatives with deep relationshipsDesign point depends heavily on cost model• nodes, pins, area, …• Need for downward scalability lends to fix dimension

– high-degree switches wasted in small configuration– grow machine by increasing nodes per dimension

Need a consistent framework and analysis to separate opinion from design

Optimal point changes with technology• store-and-forward vs. cut-through• non-pipelined vs. pipelined signaling

EECS 570: Fall 2003 -- rev1 34

Real Machines

• Wide links, smaller routing delay• Tremendous variation

EECS 570: Fall 2003 -- rev1 35

What characterizes a network?Topology• interconnection structure of the graphSwitching Strategy• circuit vs. packet switching• store-and-forward vs. cut-through• virtual cut-through vs. wormhole, etc,Routing Algorithm• how is route determinedControl Strategy• centralized vs. distributedFlow Control Mechanism• when does a packet (or a portion of it) move along its route• can't send two packets down same link at same time

EECS 570: Fall 2003 -- rev1 36

Typical Packet Format

Two basic mechanisms for abstraction• encapsulation• fragmentation

EECS 570: Fall 2003 -- rev1 37

RoutingRecall: routing algorithm determines• which of the possible paths are used as routes• how the route is determined• R: N x N → C, which at each switch maps the

destination node nd to the next channel on the routeIssues:• Routing mechanism

– arithmetic– source-based port select– table driven– general computation

• Properties of the routes• Deadlock free

EECS 570: Fall 2003 -- rev1 38

Routing Mechanism

need to select output port for each input packet• in a few cycles• Simple arithmetic in regular topologies• ex: ∆x, ∆y routing in a grid

– west(-x) ∆x<0– east (+x) ∆x>0– south(-y) ∆x=0, ∆y<0– north(+y) ∆x=0, ∆y>0– processor ∆x=0, ∆y=0

Reduce relative address of each dimension in order• Dimension-order routing in k-ary d-cubes• e-cube routing in n-cube

EECS 570: Fall 2003 -- rev1 39

Routing Mechanism (cont)

Source-based• message header carries series of port selects• used and stripped en route• CRC? header length …• CS-2, Myrinet, MIT ArticTable-driven• message header carried index for next port at next switch

– o = R[i]• table also gives index for following hop

– , I’ = R[i]• ATM, HPPI

P3 P2 P1 P0

EECS 570: Fall 2003 -- rev1 40

Properties of Routing Algorithms

Deterministic• route determined by (source, dest), not

intermediate state (i.e. traffic)Adaptive• route influenced by traffic along the wayMinimal• only selects shortest pathsDeadlock free• no traffic pattern can lead to a situation where no

packets move forward

EECS 570: Fall 2003 -- rev1 41

Deadlock Freedom

How can it arise?• necessary conditions:

– shared resource– incrementally allocated– non-pre-emptible

• think of a channel as a shared that is acquired incrementally– source buffer then dest. buffer– channels along a route

How do you avoid it?• constrain how channel resources are allocated• ex: dimension orderHow do you prove that a routing algorithm is deadlock free?

EECS 570: Fall 2003 -- rev1 42

Proof Technique

Resources are logically associated with channelsMessages introduce dependences between

resources as they move forwardNeed to articulate possible dependences between

channelsShow that there are no cycles in Channel

Dependence Graph• find a numbering of channel resources such that

every legal route follows a monotonic sequence=> no traffic pattern can lead to deadlockNetwork need not be acyclic, only channel

dependence graph

EECS 570: Fall 2003 -- rev1 43

Example: k-ary 2D array

Theorem: dimension-order x,y routing is deadlock free• Numbering• +x channel (i,y) → (i+1,y) gets i• similarly for -x with 0 as most positive edge• +y channel (x,j) -> (x,j+ I) gets N+j• similarly for -y channelsAny routing sequence' x direction, turn, y direction is increasing

EECS 570: Fall 2003 -- rev1 44

Channel Dependence Graph

EECS 570: Fall 2003 -- rev1 45

More examplesWhy is the obvious routing on X deadlock free? .• butterfly?• tree?• fat tree?Any assumptions about routing mechanism? amount of

buffering?What about wormhole routing on a ring?

EECS 570: Fall 2003 -- rev1 46

Deadlock free wormhole networks?

Basic dimension-order routing doesn't work for k-ary d-cubes• only for k-ary d-arrays (bi-directional, no wrap-around)Idea: add channels!• provide multiple “virtual channels” to break dependence cycle• good for BW too!

• Don't need to add links, or x bar, only buffer resourcesThis adds nodes to the CDG, remove edges?

EECS 570: Fall 2003 -- rev1 47

Breaking deadlock with virtual channels

EECS 570: Fall 2003 -- rev1 48

Turn Restrictions in X, Y

• XY routing forbids 4 of 8 turns and leaves no room for adaptive routing

• Can you allow more turns and still be deadlock free

EECS 570: Fall 2003 -- rev1 49

Minimal turn restrictions in 2D

West-first

north-last negative first

-x +x

-y

+y

EECS 570: Fall 2003 -- rev1 50

Example legal west-first routes

• Can route around failures or congestion• Can combine turn restrictions with virtual channels

EECS 570: Fall 2003 -- rev1 51

Adaptive RoutingR: C x N x ∑ → CEssential for fault tolerance• at least multipathCan improve utilization of the networkSimple deterministic algorithms easily run into bad permutations

• Fully/partially adaptive, minimal/non-minimal• Can introduce complexity or anomalies• Little adaptation goes a long way!

EECS 570: Fall 2003 -- rev1 52

Contention

Two packets trying to use the same link at same time• limited buffering• drop?Most parallel machine networks block in place• link-level flow control• tree saturationClosed system - offered load depends on delivered

EECS 570: Fall 2003 -- rev1 53

Flow Control

What do you do when push comes to shove?• ethernet collision detection and retry after delay• TCPIW AN: buffer, drop, adjust rate• any solution must adjust to output rate

Link-level flow control

EECS 570: Fall 2003 -- rev1 54

Example: T3D

• 3D bidirectional torus, dimension order (NIC selected), virtual cut-through, packet sw

• 16 bit x 150 MHz, short, wide, synch• rotating priority per output• logically separate request/response• 3 independent, stacked switches• 8 16-bit flits on each of 4 VC in each directions

EECS 570: Fall 2003 -- rev1 55

Routing and Switch Design Summary

Routing Algorithms restrict the set of routes within the topology• simple mechanism selects turn at each hop• arithmetic, selection, lookupDeadlock-free if channel dependence graph is acyclic• limit turns to eliminate dependences• add separate channel resources to break dependences• combination of topology, algorithm, and switch designDeterministic vs adaptive routingSwitch design issues• input/output/pooled buffering, routing logic, selection logicFlow control• Real networks are a “package” of design choices

EECS 570: Fall 2003 -- rev1 56

Example: SP

• 8-port switch, 40 MB/s per link, 8-bit phit, 16-bit flit, single 40 MHz clock• packet sw, cut-through, no virtual channel, source-based routing• variable packet <= 255 bytes, 31 byte FIFO per input, 7 bytes per output, 16

phit links• 128 8-byte “chunks” in central queue, LRU per output

Documents

Chapter 10: Scalable Interconnection Networks