CA406 Computer Architecture Networks. Data Flow - Summary Fine-Grain Dataflow Suffered from comms network overload! Coarse-Grain Dataflow Monsoon... Overtaken

CA406

Computer Architecture

Networks

Data Flow - Summary• Fine-Grain Dataflow

• Suffered from comms network overload!

• Coarse-Grain Dataflow• Monsoon ...

• Overtaken by commercial technology!!

• A sad “fact-of-life”• It’s almost impossible to generate the funds

for non-”mainstream” computer architecture research

• $n x 108 required • Non-mainstream = interesting!

Data Flow - Summary• As a software model …

• Functional languages • Dataflow in a different guise! • Theoretically

• important

• Practically?• Inefficient ( = slow!!) • ….. Ask your CS colleagues!

• Cilk - based on C• Used on CIIPS Myrmidons• Uses a dataflow model

• Threads become ready for execution when their data is generated

• Message passing efficiency• Without explicit data transfer & synchronisation!

Networks

• Network Topology (or shape)• Vital to efficient parallel algorithms• Communication is the limiting factor!

• Ideal• Cross-bar

• Any-to-any• Non-blocking

• Except two sources to same receiver

• Realisable• But only for limited order (number of ports)

Networks

• Cross-bars• Achilles

• 8 x 8• Full duplex

• Simultaneous Input and Outputat each port

• 32 bit data-path• Target :

1Gbyte / second total throughput but we needed the 3-D arrangement to achieve

• bandwidth• high order

Networks

• Cross-bars• Achilles

• Hardwarealmost trivial!

• Single FPGAon each level

• Programmable• VHDL Models

• Several topologies

• Just by changing thesoftware!

Networks - More than 8 PEs

• Simple• Use 2 8x8 routers!

but ….This linkgets a lot of traffic!

Networks - Fat tree

• Problem:• High-traffic links between PEs can become a bottleneck

• Solution: Fat-tree• Links higher up the tree are “fatter”• Sustainable bandwidth between all PEs is the same

Networks - Performance Metrics

• Metrics for comparing network topologies• Diameter

• Maximum distance between any pair of nodes• Determines latency

• Bisection Bandwidth• Aggregate bandwidth over any “cut”

which divides the network in half• Determines throughput

• Crossbar• Diameter: 1

• Every PE is directly connected to routerso a single “hop” suffices

• Bisection Bandwidth: b bytes/sec• b is the bandwidth of a single link


• Metrics for comparing network topologies• To connect n Pes with mxm crossbars• Single link bandwidth b bytes/s

• Simple: n = 14 (2 switches)• Diameter 3

• Bisection Bandwidth b

1

2

3


• Fat-tree• Diameter: 2 logmn

• Height is logmn

• Worst case distance - up and down

• Bisection Bandwidth: b n/2 bytes/sec• Links are fatter higher up the tree

logmn


• Mesh• Diameter: 2n-2• Bisection Bandwidth: b n bytes/sec• Order: 4


• Hypercube• Hypercube of order m• Link 2 order m-1 hypercubes with 2m-1 links• Number of PEs: n = 2m

• Order: log2n = m

Order 2 Hypercube Order 2

Hypercube

Order 3 Hypercube

Networks - Hypercubes

• Embedding property• In an n PE hypercube,

we have hypercubes of size n/2, n/4, …• Number PEs with binary numbers

• 000, 001, 010, 011, 100, …• Joining two hypercubes

• add one binary digitto the numbering

• Each PE is connectedto every PE whoseindex differs in only one bit

Networks - Hypercubes

• Embedding property• Partitioning tasks

• Allocate to sub-cubes• Sub-tasks allocated to

sub-cubes of that cube,etc

Futures

VLIW - Very Long Instruction Word

• Instruction word: multiple operations• n RISC-style instructions

• Architecture: fixed set of functional units

Each FU matched

to a “slot” in the

instruction

VLIW - Very Long Instruction Word

• Compiler responsible for allocating instructions to words• Burden squarely on compiler

• Needs to produce near optimal schedule• Inevitable: large number of empty slots!

Lower code density

• Similar to superscalar• but instruction issue flexibility missing• VLIW simpler faster?

• Re-compilation needed• Each new generation will have different

functional unit mix

Synchronous Logic Systems

• Clock distribution• Major problem for chip architect• Clock skews < 100-200ps over whole die

• 10% of cycle time• Small changes

Re-engineer whole chip• Checking for data hazards & logic races


• Clock distribution• Power consumption

• Major problem @ 30W+ per chip• CMOS logic consumes power only on switch but synch systems clock a lot of logic on every

cycle Clock is distributed to every subsystem Even if the logic of the subsystem is

disabled!


• Clock distribution• Power consumption• Worst case propagation delay

• Determines maximum clock speed• Clock edge must wait until all logic has settled• Temperature and process fabrication

Even slower clocks

• Design is simpler• Logic designers have experience• Good tools

Asynchronous Logic Systems

• Clock distribution• No longer a problem

• Synchronisation bundled with data

• Circuits are composable• No global clock …No need to re-engineer a whole chip to change

one section!

• Known correct circuits can be combined

• Power consumption• Circuits switch only when they’re computingPotentially very low power consumption

• May be the biggest attraction of asynch systems!

Asynchronous Logic Systems

• Clock distribution problem removed• Circuits are composable• Power consumption• Average case propagation delay

• Completion signal generated when result is available

• Independent of • Temperature and process fabrication

• Design is harder• Experience will remove this?

Laboratory 1.51

Practical Examinationswill be held in this laboratory

every afternoon from 1:50pm to 5:30pm

next week, June 1 to June 5

The laboratory will be closedto everyone except those in

CT105/CLP110actually taking the exams

during these times.

Please consider the students taking the exam by not disturbing them in any way.

Documents

CA406 Computer Architecture Networks. Data Flow - Summary Fine-Grain Dataflow Suffered from comms network overload! Coarse-Grain Dataflow Monsoon... Overtaken