View
218
Download
1
Embed Size (px)
Citation preview
Exposing ILP in the Presence of LoopsMarcos Rubén de Alba Rosano
David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Exposing ILP in the Presence of Loops
• To enable wide-issue microarchitectures to obtain high throughput rates, a large window of instructions must be available
• Programs spend 90% of their execution time in 10% of their code (in loops) – [H&P]
• Current compilers can unroll less than 50% of all loops in integer codes – [deAlba 2000]
• Present microarchitectures are not designed to execute loops efficiently [Vajapeyam 1999; Rosner 2001]
• We may need to consider developing customized loop predication hardware [de Alba 2001]
Exposing ILP in the Presence of Loops
• We need to understand whether entire loop executions can be predicted– Could expose large amounts of instruction level
parallelism
• If patterns exist in loop execution, we need to build a dynamic profiling system that can capture these patterns
• If we are able to detect patterns through profiling, we can guide aggressive instruction fetch/issue, effectively unrolling multiple iterations of the loop
Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Experimental methodology• Results• Hybrid fetch engine• Summary and future work
Introduction
• Loops possess high temporal locality and present a good opportunity for caching– Reduce pressure on other instruction caching structures
– Provide for aggressively runtime unrolling
• Applications that possess large loops tend to be good targets to extract ILP– Common in scientific codes (e.g., SPECfp2000)
– Uncommon in integer and multimedia codes (e.g., SPECint2000 and Mediabench)
Introduction
• We propose a path-based multi-level hardware-based loop caching scheme that can:– identify the presence of a loop in the execution stream– profile loop execution, building loop execution histories– cache entire unrolled loop execution traces in a dedicated
loop cache– utilize loop history to predict future loop visits at runtime– Combine a loop prediction mechanism with other
aggressive instruction fetch mechanisms to improve instruction delivery
Loop cache elements
• Loop cache stack – profiles loops that are presently live, uses a stack structure to accommodate nesting
• Loop table – a first-level table used to identify loops and index into the loop cache
• Loop cache – a second-level table used to hold unrolled loop bodies
Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Experimental methodology• Results• Hybrid fetch engine• Summary and future work
Related work• Software-based
– loop unrolling [Ellis, 1986]– software pipelining [Lam , 1988]– loop quantization [Nicolau, 1988]– static loop characteristics [Davidson, 1995]
• Limitations– A compiler cannot unroll a loop if:
• the loop body is too large• the loop induction variable is not an integer• the loop induction variable is not inc/dec by 1• the inc/dec value can not be deduced at compile time• the loop exit condition is not based on the value of a constant• if there is conditional control flow in the loop body
• More than 50% of the loops present in our workloads could not be unrolled by the compiler
Related work• Hardware-based
– loop buffer [Thornton, 1964; Anderson, 1967; Hintz, 1972]– multiple-block ahead loop prediction [Seznec, 1996]– trace cache [Rotenberg, 1996]– loop detection [Kobayashi, 1984; Tubella, 1998; Gonzalez, 1998]– dynamic vectorization [Vajapeyam, 1999]– loop termination [Sherwood, 2000] – loop caches [Texas Inst.; Uh 1999; Motorola; Vahid 2002] – hybrid approaches [Holler, 1997; Hinton, 2001]
• Limitations– These techniques can effectively cache well-structured loops– Conditional control flow present in loop bodies can limit the number of loops that
can be cached
• The conditional control flow found in loops generates complex, yet predictable patterns
Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Hybrid fetch engine• Experimental methodology• Results• Summary and future work
Loop terminology
i0b1i1b2i3b3i4i5i6i7b4i8
loop head
loop tail
loop body
Static terms:
Loop terminology
path-in-iteration A (b3 NT)
i0b1i1b2i3b3i4i5i6i7b4i8
path to loop
path-in-iteration B (b3 T)
Dynamic terms:
loop visit – entering a loop bodyiteration – returning to the loop head before exiting a looppath-in-loop – the complete set path-in-iterations for an entire loop visit
Importance of path-to-loop
b1
b2b3
b6b4
b10b9b8b7
b11
T NT
b1, b2, b6, b9NT, NT, T, NT
NT
T
T
b12
T
NT
Enter loop
NT
NT
T
b1, b3, b5, b9 T, NT, NT, T
Do not enter loop
b5
NT
NT
Importance of path-to-loop
b1
b2b3
b6b4
b10b9b8b7
b11
T NT
b1, b2, b6, b9NT, NT, T, NT
NT
T
T
b12
T
NT
Enter loop
NT
NT
T
b1, b3, b5, b9 T, NT, NT, T
Do not enter loop
b5
NT
NT
For loop caching to besuccessful, we must be
able to predict b9very accurately.
i1Top: i2
i3i4i5i6i7ba zero, Ai8i9
A: i10i11i12i13bb zero, Bi14i15
B: i16i17i18bc zero, Topi19
Static view of loop
i2…ba i8...bb i14…bc
i2…ba i8…bb i16...bc i2…ba i10…bb i14…bc
i2…ba i10…bb i14…bc
Branch A
NT T
NT
TB
Dynamic view of a loop (all possible paths during a single loop iteration)
i1Top: i2
i3i4i5i6i7ba zero, Ai8i9
A: i10i11i12i13bb zero, Bi14i15
B: i16i17i18bc zero, Topi19
Static view of loop
i2…ba i8...bb i14…bc
i2…ba i8…bb i16...bc i2…ba i10…bb i14…bc
i2…ba i10…bb i14…bc
Branch A
NT T
NT
TB
Dynamic view of a loop (all possible paths during a single loop iteration)
For loop caching to besuccessful, we must be
able to predict the path followedon each iteration.
Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Experimental methodology• Results• Hybrid fetch engine• Summary and future work
Loop characterization• Important to characterize loop behavior in order to guide
design tradeoff choices associated with the implementation of the loop cache
• Loops possess a range of characteristics that affect their predictability:– number of loops– number of loop visits– number of iterations per loop visit– dynamic loop body size– number of conditional branches found in an iteration– many more in the thesis
Application of Characterization Study• The number of loops found in our applications ranged from 21 (g721) to
1266 (gcc), with most applications containing less than 100 loops– Guides the size choice for the number of entries in the first level loop table
• In 9 of the 12 benchmarks studied, more than 40% of the loops were visited only 2 - 64 times– Guides the design of the loop cache replacement algorithm
• For more than 80% of all loops visits, the number of iterations executed per visit was less than 17 – Guides the design of the hardware unrolling logic
Application of Characterization Study• The weighted average for the number of instructions executed per iteration ranged from 15.87
– 99.83– Guides the selection of the length of the loop cache line
• In 10 of the 12 benchmarks studied, the largest loop body size was less than 8192 instructions– Guides the select of the size of the loop cache
• On average, 85% of loop iterations contained 3 or fewer conditional branches in their loop bodies– Guides the selection of the path-in-iteration pattern history register width
• Maximum level of nesting: 2 – 7 loops– Guides the selection of the loop cache stack depth
Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Experimental methodology• Results• Hybrid fetch engine• Summary and future work
Dynamic loop profiling
• To be able to accurately predict loop execution, we need to – dynamically identify loops
– predict loop entry
– select loops to cache
– select loops to unroll
• We utilize a loop cache stack mechanism that builds loop history before entering a loop into the loop cache
Stack path-in-loop tableLoop Stack
path iterations next path index
1000001111110010 5 2
1000001111110000 17 3
1000001111100000 10 0
.
.
controlling branch address loop head address path-to-loop *path-in-loop
1000110011001100
1000110011001111
pathpredicted iterations
next pathindex
confidencecounter
path-in-loop table
1000001111110010
5 21000001111110000
17 31000001111100000
10 0
1
2
1
.
.
.
Loop Table
controlling branch address loop head address pred-path-to-loop *pred-path-in-loop
1000110011001100
1000110011001111
Building Loop Cache Histories
Dynamic loop caching and unrolling
• Loop unrolling hardware:– captures loop visits in loop prediction table– interrogates the loop predictor to obtain
information for future loops– utilizes loop predictor information to
dynamically replicate loop bodies in the loop cache
Loop predictor
Loop identificationPredicted iterationsPaths-in-iteration
Dispatch/decode
Queue of speculated instructions
I-cache
Loop unrolling control
Loop cache
Unrolled loop
to e
xecu
tion
Dynamic loop caching and unrolling
• Loop unrolling hardware:– uses path-to-loop information to predict a future loop
visit
– extracts the number of predicted iterations and use it as the initial unroll factor (unless the loop cache size is exceeded)
– as long as the number of predicted iterations is larger than 1, unroll the loop in the loop cache
– use the paths-in-iteration information to create the correct trace of instructions on every loop iteration
The information is used used to interrogatethe loop cache hardware.
Loop prediction table
tag head tail preditns *path-in-iteration
8 60 90 4
path-in-iteration itns
0110
1111
1001
2
1
1path-to-loop
preditns > 1 ?
tag match last branch?
Y
There is no information for this loop, proceed with normal fetching.
N
N Y
hash function bn-1 bn-2 bn-3 b0...
Captured Loop
.
.
.
.
.
.
Tag match ?
Loop Cache
Loop CacheLookup Table
Loop Head Loop Tail
Build dynamic tracefetch mode = basic
fetch mode = LOOP CACHE
YN
loop start address
tag index
Loop cache
i1 i2 b0 i3 i4 i5 b1 b2 i8 b
i1 i2 b0 i4 i5 b1 b2 i8 b
i1 i2 b0 i3 i4 i5 b1 b2 i8 b
i1 i2 b0 i4 i5 b1 i6 b2 i7 b3 i9
Loop cache control
(60, 90, 4, 011, 011, 111,1001)
I-cache60: i164: i268: b06c: i370: i474: i578: b17c: i680: b284: i788: b38c: i890: b 6094: i9
If loop is not in loop cache, then request instructions fromthe I-cache and build dynamictraces according to informationin loop table.
Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Experimental methodology• Results• Hybrid fetch engine• Summary and future work
Experimental methodology
• Modified the Simplescalar 3.0b Alpha EV6 pipeline to model the following features:– loop head/tail detection
– loop visit prediction
– loop cache fill/replacement management
– loop stack and first-level table operations
– trace cache model
– hybrid fetch engine operation
Fetch Dispatch Issue Write-Back Commit
Fetch Engine
Early in-loop branch
mispredictiondetection
In-loop branch misprediction
detectionand recovery
Loop detection
Loop tableupdate
Loop cacheupdate and
lookup
stop loop cachefetching
stop loop cachefetching
loop cachefetch modeand loop startaddress
Baseline ArchitectureParameters
Decode width 16
Commit width 16
Instruction fetch-Q 16
8 Int. Func. Units 1 cycle latency
2 Int. multipliers 7 cycle latency
4 FP adders 4 cycle latency
2 FP multipliers 4 cycle latency
2 FP divide units 12 cycle latency
2 FP SQRT units 23 cycle latency
Branch prediction Bi-modal 4096 entry, 2-level adaptive, 8-entry RAS
L1 D-cache 16KB 4wsa
L1 I-cache 16KB 4wsa
L1 latency 2 cycles
L2 unified cache 256KB 4wsa
L2 latency 10 cycles
Memory latency 250 cycles
TLB 128 entry 4wsa, 4KB page, 30 cycle
Loop Cache ArchitectureParameters
Loop table parameters 512 entries
4-way
1 cycle hit latency
3 cycle hit penalty
16 branches path length
up to 16 iterations captured
Loop stack parameters 8 entries
1 cycle access
Loop cache parameters 8KB
1 cycle hit latency
Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Experimental methodology• Results• Hybrid fetch engine• Summary and future work
Performance gain obtained over a baseline without a loop cache
0
2
4
6
8
10
12
14
16
18
20
Applications
Sp
ee
du
p
Performance speedup using an infinite loop cache
0
20
40
60
80
100
120
140
160
Applications
Spee
dup
Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Experimental methodology• Results• Hybrid fetch engine• Summary and future work
• Trace caches have been shown to greatly improve fetch efficiency [Rosner 2001]
• Trace caches have not been designed to handle loops effectively– Replication of loop bodies in the trace cache space
– Handling of changing conditional control flow in loop bodies
• We explore how to combine a loop cache with a trace cache to filter out problematic loops
Hybrid fetch engine
• Capture all non-loop instructions with a trace cache
• Capture easy-to-predict loops with a trace cache
• Capture complex loops with a loop cache
• Provide new fetch logic to steer instruction fetching to the appropriate source
Hybrid fetch engine strategy
• Trace cache misses:– branch flags mismatch multiple branch
predictor– branches in trace are mispredicted– trace is not found in the trace cache
• Trigger loop cache fetching when any of these happen
Hybrid fetch engine strategy
Hybrid fetching scheme
Fetch Mode
Trace Cache
Loop Cache
L1 Cache
trace
loop cache
basic
1
1
1
Tri-state busarbitrer
Fetch queue
EN
EN
EN
Demux
instructions
φ·IFQ width
IFQ
wid
thβ·IFQ width
α·IFQ width
d
ijkstr
a
g
sm
p
atr
icia
f
ft
e
pic
g
72
1
g
zip
bzip
2
pa
rse
r
tw
olf
% S
pe
ed
up
LC machine
LC PFBP
0
18
38
58
78
100
-2
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
dijk
stra
gsm
pat
ricia
fft
epic
g721
gzip
b
zip2
p
arse
r
t
wol
f
vpr
Co
mm
itte
d in
stru
ctio
ns
L1INSTS
TCINSTS
LCINSTS
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
dijk
stra
pat
ricia
epic
gzip
p
arse
r
vpr
Co
mm
itte
d in
stru
ctio
ns
NONLOOPINSTS
TCLOOPINSTS
-2
0
2
4
6
8
10
12
d
ijkst
ra
g
sm
p
atr
icia
f
ft
e
pic
g
72
1
g
zip
bzi
p2
pa
rse
r
tw
olf
v
pr
Sp
ee
du
p (
%)
TC vs. TC + LC
Publications on Loop Prediction• M. R de Alba, D. R. Kaeli, and J. Gonzalez “Improving the Effectiveness of
Trace Caching Using a Loop Cache,” NUCAR technical report.• M. R de Alba and D.R. Kaeli, “Characterization and Evaluation of Hardware
Loop Unrolling”, 1st Boston Area Architecture Conference, Jan 2003.• M. R. de Alba and D. R. Kaeli “Path-based Hardware Loop Prediction,” Proc.
of the 4th International Conference on Control, Virtual Instrumentation and Digital Systems, August 2002.
• A. Uht, D. Morano, A. Khalafi, M. de Alba, D. R. Kaeli, “Realizing High IPC Using Time-Tagged Resource-Flow Computing,” Proc. of Europar, August 2002.
• M. R. de Alba and D. R. Kaeli, “Runtime Predictability of Loops,” Proc. of the 4th Annual IEEE International Workshop on Workload Characterization, December 2001.
• M. R. de Alba, D. R. Kaeli, and E. S. Kim “Dynamic Analysis of Loops,” Proc. of the 3rd International Conference on Control, Virtual Instrumentation and Digital Systems, August 2001.
• Branch correlation helps to detect loops in advance
• Loops have patterns of behavior (iterations, dynamic body size, in-loop paths)
• From studied benchmarks, on average more than 50 % of loops contain branches
• In-loop branches can be predicted and used to guide unrolling
Conclusions
• Dynamic instruction traces are built using loop profiling and prediction
• Multiple loops can be simultaneously unrolled• Combining a trace cache and a loop cache more
useful and less redundant instruction streams are built
• Performance benefits are gained with a hybrid fetch engine mechanism
Conclusions