Upload
sadie-buckland
View
220
Download
0
Embed Size (px)
Citation preview
It’s all about latency
Henk Neefs
Dept. of Electronics and Information Systems (ELIS)
University of Gent
Overview
• Introduction of processor model
• Show importance of latency
• Techniques to handle latency
• Quantify memory latency effect
• Why consider optical interconnects?
• Latency of an optical interconnect
• Conclusions
Out-of-order processor pipeline
I-cachefetch decode
instructionwindowrename
architecturalregister file
LDST
executionunits
‘future’register
file
INT
in-orderretirement
Branch latency
I-cachefetch decode
instructionwindowrename
LDST
executionunits
‘future’register
file
INT
BR
time
ADDORST XOR LD
ORBR ST XOR LD
... ... ...... ...... ......BR
latency
Eliminate branch latency
• By prediction:predict outcome of branch => eliminate dependency (with a high probability)
• By predication:convert control dependency to data dependency => eliminate control dependency
while (pointer!=0)
pointer = pointer.next;
Load latency
Loop:LD R1, R1(32)BNE R1, Loop
cycles
LD
CPI = 2 cycles/2 instructions = 1 cycle/instruction
load latency = 2 cyclesbranch latency = 1 cycle
BNELD
BNELD
BNELD
execution units
When longer load latency
cycles
LD
CPI = 8 cycles/2 instructions = 4 cycles/instruction
load latency = 2+6 cyclesbranch latency = 1 cycle
BNE
BNE
BNE
execution units• When L1-cache missesand L2-cache hits:
LD
LD
LD
• When L2-cache missesand main memory hits:
load latency = 2+6+60 cyclesCPI = 34 cycles/instruction
Memory hierarchyregister file execution
unitsL1 cache
L2 cache
main memory
hard drive
storage capacityand latency
L1 cache latency
0
2
4
6
8
10
12
0 50 100 150 200 250 300instruction window size (#instructions)
IPC
latency = 2
latency = 3latency = 4
load/store
IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs
Main memory latency
3
3.1
3.2
3.3
3.4
3.5
3.6
0 20 40 60 80 100
main memory latency (ns)
IPC
load/store
IPC = Instructions Per clock Cycle, 1 Ghz processor, spec95 programs
Performance and latencyInterconnect type Sensitivity of performance
to latency decrease(% per ns)
Processor core/register file 39
Processor/L1-cache 19
L1-cache/L2-cache 3,0
L2-cache/main memory 0,18
performance change = sensitivity * load latency change
Increase performance by
• eliminating/reducing load latency:– By prefetching:
predict the next miss and fetch the datato e.g. L1-cache
– By address prediction:address known earlier=> load executed earlier=> data early in register file
• or reducing sensitivity to load latency:– by fine-grain multithreading
Some prefetch techniques
• Stride prefetching:search for pattern with constant stride
e.g. walking through a matrix (row- or column-order)
• Markov prefetching:recurring patterns of misses
20 31 42 53 64stride: 11
miss history prediction10 110 15 12 100 … ...
Stride prefetching
4.9
5
5.1
5.2
70 75 80 85 90latency main memory (ns)
IPC
prefetching no prefetching
IPC = Instructions Per clock Cycle, 1 Ghz processor, program: compress
load/store
Prefetching and sensitivity
Factors of “performance sensitivity to latency” increase with stride-prefetching:
L1-cache/L2-cache L2-cache/main memoryto L1-prefetching 1.6 4.1to L2-prefetching 2.5
Latency is important:generalization to other processor architectures
Consider schedule of program:
time
Present in everyprogram execution:• Latency of instruction
execution• Latency of
communication
=> latency importantwhatever processor architecture
Optical interconnects (OI)• Mature components:
– Vertical-Cavity Surface Emitting Lasers (VCSELs)
– Light Emitting Diodes (LEDs)
• Very high bandwidths
• Are replacing electronic interconnects in telecom and networks
• Useful for short inter-chip and even intra-chip interconnects?
OI in processor context
• At levels close to processor core,latency is very important=> latency of OI determines how far OI penetrates in the memory hierarchy
• What is the latency of an optical interconnect?
An optical link
Total latency = buffer latency + VCSEL/LED latency + time of flight + receiver latency
LED/VCSEL
buffer/modulation/bias
fiber orlight conductor
receiver diode
transimpedance amplifier
VCSEL characteristics
0
0.5
1
1.5
2
0 1 2 3current (mA)
op
tic
al o
utp
ut
(mW
)
optical power carrier density
load/store
• A small semiconductor laser• Carrier density should be high enough for lasing action
Total VCSEL link latencyconsists of
• Buffer latency
• Parasitic capacitances and series resistances of VCSEL and pads
• Threshold carrier density build up
• From low optical output to final optical output (intrinsic latency)
• Time of flight (TOF)
• Receiver latency
Total optical link latency
load/store
0
1
2
3
4
5
6
7
LED LED VCSEL VCSEL
late
ncy
(n
s)
TOF (10 cm)
receiver
intrinsic
threshold
parasitics
buffer
CMOS: 0.6 m 0.25 m 0.6 m 0.25 m
@ 1 mW
Latency as function of power
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6optical output power (mW)
late
ncy
(n
s)
LED (0.6 microm.)
VCSEL (0.6 microm.)
LED (0.25 microm.)
VCSEL (0.25 microm.)
load/store
Conclusions• When combining performance sensitivity
and optical latency we conclude:– optical interconnects are feasible to main
memory and for multiprocessors– for interconnects close to processor core,
optical interconnects have too high latencywith present (telecom) devices, drivers and receivers
=> but now evolution to lower latency devices, drivers and receivers is taking place...
For more information on the presented results: Henk Neefs, Latentiebeheersing in processors, PhD Universiteit Gent, January 2000www.elis.rug.ac.be/~neefs