Upload
jean-pierce
View
230
Download
0
Tags:
Embed Size (px)
Citation preview
CA406
Computer Architecture
Networks
Data Flow - Summary• Fine-Grain Dataflow
• Suffered from comms network overload!
• Coarse-Grain Dataflow• Monsoon ...
• Overtaken by commercial technology!!
• A sad “fact-of-life”• It’s almost impossible to generate the funds
for non-”mainstream” computer architecture research
• $n x 108 required • Non-mainstream = interesting!
Data Flow - Summary• As a software model …
• Functional languages • Dataflow in a different guise! • Theoretically
• important
• Practically?• Inefficient ( = slow!!) • ….. Ask your CS colleagues!
• Cilk - based on C• Used on CIIPS Myrmidons• Uses a dataflow model
• Threads become ready for execution when their data is generated
• Message passing efficiency• Without explicit data transfer & synchronisation!
Networks
• Network Topology (or shape)• Vital to efficient parallel algorithms• Communication is the limiting factor!
• Ideal• Cross-bar
• Any-to-any• Non-blocking
• Except two sources to same receiver
• Realisable• But only for limited order (number of ports)
Networks
• Cross-bars• Achilles
• 8 x 8• Full duplex
• Simultaneous Input and Outputat each port
• 32 bit data-path• Target :
1Gbyte / second total throughput but we needed the 3-D arrangement to achieve
• bandwidth• high order
Networks
• Cross-bars• Achilles
• Hardwarealmost trivial!
• Single FPGAon each level
• Programmable• VHDL Models
• Several topologies
• Just by changing thesoftware!
Networks - More than 8 PEs
• Simple• Use 2 8x8 routers!
but ….This linkgets a lot of traffic!
Networks - Fat tree
• Problem:• High-traffic links between PEs can become a bottleneck
• Solution: Fat-tree• Links higher up the tree are “fatter”• Sustainable bandwidth between all PEs is the same
Networks - Performance Metrics
• Metrics for comparing network topologies• Diameter
• Maximum distance between any pair of nodes• Determines latency
• Bisection Bandwidth• Aggregate bandwidth over any “cut”
which divides the network in half• Determines throughput
• Crossbar• Diameter: 1
• Every PE is directly connected to routerso a single “hop” suffices
• Bisection Bandwidth: b bytes/sec• b is the bandwidth of a single link
Networks - Performance Metrics
• Metrics for comparing network topologies• To connect n Pes with mxm crossbars• Single link bandwidth b bytes/s
• Simple: n = 14 (2 switches)• Diameter 3
• Bisection Bandwidth b
1
2
3
Networks - Performance Metrics
• Fat-tree• Diameter: 2 logmn
• Height is logmn
• Worst case distance - up and down
• Bisection Bandwidth: b n/2 bytes/sec• Links are fatter higher up the tree
logmn
Networks - Performance Metrics
• Mesh• Diameter: 2n-2• Bisection Bandwidth: b n bytes/sec• Order: 4
Networks - Performance Metrics
• Hypercube• Hypercube of order m• Link 2 order m-1 hypercubes with 2m-1 links• Number of PEs: n = 2m
• Order: log2n = m
Order 2 Hypercube Order 2
Hypercube
Order 3 Hypercube
Networks - Hypercubes
• Embedding property• In an n PE hypercube,
we have hypercubes of size n/2, n/4, …• Number PEs with binary numbers
• 000, 001, 010, 011, 100, …• Joining two hypercubes
• add one binary digitto the numbering
• Each PE is connectedto every PE whoseindex differs in only one bit
Networks - Hypercubes
• Embedding property• Partitioning tasks
• Allocate to sub-cubes• Sub-tasks allocated to
sub-cubes of that cube,etc
Futures
VLIW - Very Long Instruction Word
• Instruction word: multiple operations• n RISC-style instructions
• Architecture: fixed set of functional units
Each FU matched
to a “slot” in the
instruction
VLIW - Very Long Instruction Word
• Compiler responsible for allocating instructions to words• Burden squarely on compiler
• Needs to produce near optimal schedule• Inevitable: large number of empty slots!
Lower code density
• Similar to superscalar• but instruction issue flexibility missing• VLIW simpler faster?
• Re-compilation needed• Each new generation will have different
functional unit mix
Synchronous Logic Systems
• Clock distribution• Major problem for chip architect• Clock skews < 100-200ps over whole die
• 10% of cycle time• Small changes
Re-engineer whole chip• Checking for data hazards & logic races
Synchronous Logic Systems
• Clock distribution• Power consumption
• Major problem @ 30W+ per chip• CMOS logic consumes power only on switch but synch systems clock a lot of logic on every
cycle Clock is distributed to every subsystem Even if the logic of the subsystem is
disabled!
Synchronous Logic Systems
• Clock distribution• Power consumption• Worst case propagation delay
• Determines maximum clock speed• Clock edge must wait until all logic has settled• Temperature and process fabrication
Even slower clocks
• Design is simpler• Logic designers have experience• Good tools
Asynchronous Logic Systems
• Clock distribution• No longer a problem
• Synchronisation bundled with data
• Circuits are composable• No global clock …No need to re-engineer a whole chip to change
one section!
• Known correct circuits can be combined
• Power consumption• Circuits switch only when they’re computingPotentially very low power consumption
• May be the biggest attraction of asynch systems!
Asynchronous Logic Systems
• Clock distribution problem removed• Circuits are composable• Power consumption• Average case propagation delay
• Completion signal generated when result is available
• Independent of • Temperature and process fabrication
• Design is harder• Experience will remove this?
Laboratory 1.51
Practical Examinationswill be held in this laboratory
every afternoon from 1:50pm to 5:30pm
next week, June 1 to June 5
The laboratory will be closedto everyone except those in
CT105/CLP110actually taking the exams
during these times.
Please consider the students taking the exam by not disturbing them in any way.