Exposing ILP in the Presence of Loops Marcos Rubén de Alba Rosano David Kaeli Department of Electrical and Computer Engineering Northeastern University

Exposing ILP in the Presence of LoopsMarcos Rubén de Alba Rosano

David Kaeli

Department of Electrical and Computer Engineering

Northeastern University

Exposing ILP in the Presence of Loops

• To enable wide-issue microarchitectures to obtain high throughput rates, a large window of instructions must be available

• Programs spend 90% of their execution time in 10% of their code (in loops) – [H&P]

• Current compilers can unroll less than 50% of all loops in integer codes – [deAlba 2000]

• Present microarchitectures are not designed to execute loops efficiently [Vajapeyam 1999; Rosner 2001]

• We may need to consider developing customized loop predication hardware [de Alba 2001]

Exposing ILP in the Presence of Loops

• We need to understand whether entire loop executions can be predicted– Could expose large amounts of instruction level

parallelism

• If patterns exist in loop execution, we need to build a dynamic profiling system that can capture these patterns

• If we are able to detect patterns through profiling, we can guide aggressive instruction fetch/issue, effectively unrolling multiple iterations of the loop

Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Experimental methodology• Results• Hybrid fetch engine• Summary and future work

Introduction

• Loops possess high temporal locality and present a good opportunity for caching– Reduce pressure on other instruction caching structures

– Provide for aggressively runtime unrolling

• Applications that possess large loops tend to be good targets to extract ILP– Common in scientific codes (e.g., SPECfp2000)

– Uncommon in integer and multimedia codes (e.g., SPECint2000 and Mediabench)

Introduction

• We propose a path-based multi-level hardware-based loop caching scheme that can:– identify the presence of a loop in the execution stream– profile loop execution, building loop execution histories– cache entire unrolled loop execution traces in a dedicated

loop cache– utilize loop history to predict future loop visits at runtime– Combine a loop prediction mechanism with other

aggressive instruction fetch mechanisms to improve instruction delivery

Loop cache elements

• Loop cache stack – profiles loops that are presently live, uses a stack structure to accommodate nesting

• Loop table – a first-level table used to identify loops and index into the loop cache

• Loop cache – a second-level table used to hold unrolled loop bodies


Related work• Software-based

– loop unrolling [Ellis, 1986]– software pipelining [Lam , 1988]– loop quantization [Nicolau, 1988]– static loop characteristics [Davidson, 1995]

• Limitations– A compiler cannot unroll a loop if:

• the loop body is too large• the loop induction variable is not an integer• the loop induction variable is not inc/dec by 1• the inc/dec value can not be deduced at compile time• the loop exit condition is not based on the value of a constant• if there is conditional control flow in the loop body

• More than 50% of the loops present in our workloads could not be unrolled by the compiler

Related work• Hardware-based

– loop buffer [Thornton, 1964; Anderson, 1967; Hintz, 1972]– multiple-block ahead loop prediction [Seznec, 1996]– trace cache [Rotenberg, 1996]– loop detection [Kobayashi, 1984; Tubella, 1998; Gonzalez, 1998]– dynamic vectorization [Vajapeyam, 1999]– loop termination [Sherwood, 2000] – loop caches [Texas Inst.; Uh 1999; Motorola; Vahid 2002] – hybrid approaches [Holler, 1997; Hinton, 2001]

• Limitations– These techniques can effectively cache well-structured loops– Conditional control flow present in loop bodies can limit the number of loops that

can be cached

• The conditional control flow found in loops generates complex, yet predictable patterns

Outline • Introduction• Related work• Loop terminology• Loop-based workload characterization• Loop caching and unrolling hardware• Hybrid fetch engine• Experimental methodology• Results• Summary and future work

Loop terminology

i0b1i1b2i3b3i4i5i6i7b4i8

loop head

loop tail

loop body

Static terms:

Loop terminology

path-in-iteration A (b3 NT)

i0b1i1b2i3b3i4i5i6i7b4i8

path to loop

path-in-iteration B (b3 T)

Dynamic terms:

loop visit – entering a loop bodyiteration – returning to the loop head before exiting a looppath-in-loop – the complete set path-in-iterations for an entire loop visit

Importance of path-to-loop

b1

b2b3

b6b4

b10b9b8b7

b11

T NT

b1, b2, b6, b9NT, NT, T, NT

NT

T

T

b12

T

NT

Enter loop

NT

NT

T

b1, b3, b5, b9 T, NT, NT, T

Do not enter loop

b5

NT

NT

Importance of path-to-loop

b1

b2b3

b6b4

b10b9b8b7

b11

T NT

b1, b2, b6, b9NT, NT, T, NT

NT

T

T

b12

T

NT

Enter loop

NT

NT

T

b1, b3, b5, b9 T, NT, NT, T

Do not enter loop

b5

NT

NT

For loop caching to besuccessful, we must be

able to predict b9very accurately.

i1Top: i2

i3i4i5i6i7ba zero, Ai8i9

A: i10i11i12i13bb zero, Bi14i15

B: i16i17i18bc zero, Topi19

Static view of loop

i2…ba i8...bb i14…bc

i2…ba i8…bb i16...bc i2…ba i10…bb i14…bc

i2…ba i10…bb i14…bc

Branch A

NT T

NT

TB

Dynamic view of a loop (all possible paths during a single loop iteration)

i1Top: i2

i3i4i5i6i7ba zero, Ai8i9

A: i10i11i12i13bb zero, Bi14i15

B: i16i17i18bc zero, Topi19

Static view of loop

i2…ba i8...bb i14…bc

i2…ba i8…bb i16...bc i2…ba i10…bb i14…bc

i2…ba i10…bb i14…bc

Branch A

NT T

NT

TB

Dynamic view of a loop (all possible paths during a single loop iteration)

For loop caching to besuccessful, we must be

able to predict the path followedon each iteration.


Loop characterization• Important to characterize loop behavior in order to guide

design tradeoff choices associated with the implementation of the loop cache

• Loops possess a range of characteristics that affect their predictability:– number of loops– number of loop visits– number of iterations per loop visit– dynamic loop body size– number of conditional branches found in an iteration– many more in the thesis

Application of Characterization Study• The number of loops found in our applications ranged from 21 (g721) to

1266 (gcc), with most applications containing less than 100 loops– Guides the size choice for the number of entries in the first level loop table

• In 9 of the 12 benchmarks studied, more than 40% of the loops were visited only 2 - 64 times– Guides the design of the loop cache replacement algorithm

• For more than 80% of all loops visits, the number of iterations executed per visit was less than 17 – Guides the design of the hardware unrolling logic

Application of Characterization Study• The weighted average for the number of instructions executed per iteration ranged from 15.87

– 99.83– Guides the selection of the length of the loop cache line

• In 10 of the 12 benchmarks studied, the largest loop body size was less than 8192 instructions– Guides the select of the size of the loop cache

• On average, 85% of loop iterations contained 3 or fewer conditional branches in their loop bodies– Guides the selection of the path-in-iteration pattern history register width

• Maximum level of nesting: 2 – 7 loops– Guides the selection of the loop cache stack depth


Dynamic loop profiling

• To be able to accurately predict loop execution, we need to – dynamically identify loops

– predict loop entry

– select loops to cache

– select loops to unroll

• We utilize a loop cache stack mechanism that builds loop history before entering a loop into the loop cache

Stack path-in-loop tableLoop Stack

path iterations next path index

1000001111110010 5 2

1000001111110000 17 3

1000001111100000 10 0

.

.

controlling branch address loop head address path-to-loop *path-in-loop

1000110011001100

1000110011001111

pathpredicted iterations

next pathindex

confidencecounter

path-in-loop table

1000001111110010

5 21000001111110000

17 31000001111100000

10 0

1

2

1

.

.

.

Loop Table

controlling branch address loop head address pred-path-to-loop *pred-path-in-loop

1000110011001100

1000110011001111

Building Loop Cache Histories

Dynamic loop caching and unrolling

• Loop unrolling hardware:– captures loop visits in loop prediction table– interrogates the loop predictor to obtain

information for future loops– utilizes loop predictor information to

dynamically replicate loop bodies in the loop cache

Loop predictor

Loop identificationPredicted iterationsPaths-in-iteration

Dispatch/decode

Queue of speculated instructions

I-cache

Loop unrolling control

Loop cache

Unrolled loop

to e

xecu

tion

Dynamic loop caching and unrolling

• Loop unrolling hardware:– uses path-to-loop information to predict a future loop

visit

– extracts the number of predicted iterations and use it as the initial unroll factor (unless the loop cache size is exceeded)

– as long as the number of predicted iterations is larger than 1, unroll the loop in the loop cache

– use the paths-in-iteration information to create the correct trace of instructions on every loop iteration

The information is used used to interrogatethe loop cache hardware.

Loop prediction table

tag head tail preditns *path-in-iteration

8 60 90 4

path-in-iteration itns

0110

1111

1001

2

1

1path-to-loop

preditns > 1 ?

tag match last branch?

Y

There is no information for this loop, proceed with normal fetching.

N

N Y

hash function bn-1 bn-2 bn-3 b0...

Captured Loop

.

.

.

.

.

.

Tag match ?

Loop Cache

Loop CacheLookup Table

Loop Head Loop Tail

Build dynamic tracefetch mode = basic

fetch mode = LOOP CACHE

YN

loop start address

tag index

Loop cache

i1 i2 b0 i3 i4 i5 b1 b2 i8 b

i1 i2 b0 i4 i5 b1 b2 i8 b

i1 i2 b0 i3 i4 i5 b1 b2 i8 b

i1 i2 b0 i4 i5 b1 i6 b2 i7 b3 i9

Loop cache control

(60, 90, 4, 011, 011, 111,1001)

I-cache60: i164: i268: b06c: i370: i474: i578: b17c: i680: b284: i788: b38c: i890: b 6094: i9

If loop is not in loop cache, then request instructions fromthe I-cache and build dynamictraces according to informationin loop table.


Experimental methodology

• Modified the Simplescalar 3.0b Alpha EV6 pipeline to model the following features:– loop head/tail detection

– loop visit prediction

– loop cache fill/replacement management

– loop stack and first-level table operations

– trace cache model

– hybrid fetch engine operation

Fetch Dispatch Issue Write-Back Commit

Fetch Engine

Early in-loop branch

mispredictiondetection

In-loop branch misprediction

detectionand recovery

Loop detection

Loop tableupdate

Loop cacheupdate and

lookup

stop loop cachefetching

stop loop cachefetching

loop cachefetch modeand loop startaddress

Baseline ArchitectureParameters

Decode width 16

Commit width 16

Instruction fetch-Q 16

8 Int. Func. Units 1 cycle latency

2 Int. multipliers 7 cycle latency

4 FP adders 4 cycle latency

2 FP multipliers 4 cycle latency

2 FP divide units 12 cycle latency

2 FP SQRT units 23 cycle latency

Branch prediction Bi-modal 4096 entry, 2-level adaptive, 8-entry RAS

L1 D-cache 16KB 4wsa

L1 I-cache 16KB 4wsa

L1 latency 2 cycles

L2 unified cache 256KB 4wsa

L2 latency 10 cycles

Memory latency 250 cycles

TLB 128 entry 4wsa, 4KB page, 30 cycle

Loop Cache ArchitectureParameters

Loop table parameters 512 entries

4-way

1 cycle hit latency

3 cycle hit penalty

16 branches path length

up to 16 iterations captured

Loop stack parameters 8 entries

1 cycle access

Loop cache parameters 8KB

1 cycle hit latency


Performance gain obtained over a baseline without a loop cache

0

2

4

6

8

10

12

14

16

18

20

Applications

Sp

ee

du

p

Performance speedup using an infinite loop cache

0

20

40

60

80

100

120

140

160

Applications

Spee

dup


• Trace caches have been shown to greatly improve fetch efficiency [Rosner 2001]

• Trace caches have not been designed to handle loops effectively– Replication of loop bodies in the trace cache space

– Handling of changing conditional control flow in loop bodies

• We explore how to combine a loop cache with a trace cache to filter out problematic loops

Hybrid fetch engine

• Capture all non-loop instructions with a trace cache

• Capture easy-to-predict loops with a trace cache

• Capture complex loops with a loop cache

• Provide new fetch logic to steer instruction fetching to the appropriate source

Hybrid fetch engine strategy

• Trace cache misses:– branch flags mismatch multiple branch

predictor– branches in trace are mispredicted– trace is not found in the trace cache

• Trigger loop cache fetching when any of these happen

Hybrid fetch engine strategy

Hybrid fetching scheme

Fetch Mode

Trace Cache

Loop Cache

L1 Cache

trace

loop cache

basic

1

1

1

Tri-state busarbitrer

Fetch queue

EN

EN

EN

Demux

instructions

φ·IFQ width

IFQ

wid

thβ·IFQ width

α·IFQ width

d

ijkstr

a

g

sm

p

atr

icia

f

ft

e

pic

g

72

1

g

zip

bzip

2

pa

rse

r

tw

olf

% S

pe

ed

up

LC machine

LC PFBP

0

18

38

58

78

100

-2

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

dijk

stra

gsm

pat

ricia

fft

epic

g721

gzip

b

zip2

p

arse

r

t

wol

f

vpr

Co

mm

itte

d in

stru

ctio

ns

L1INSTS

TCINSTS

LCINSTS

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

dijk

stra

pat

ricia

epic

gzip

p

arse

r

vpr

Co

mm

itte

d in

stru

ctio

ns

NONLOOPINSTS

TCLOOPINSTS

-2

0

2

4

6

8

10

12

d

ijkst

ra

g

sm

p

atr

icia

f

ft

e

pic

g

72

1

g

zip

bzi

p2

pa

rse

r

tw

olf

v

pr

Sp

ee

du

p (

%)

TC vs. TC + LC

Publications on Loop Prediction• M. R de Alba, D. R. Kaeli, and J. Gonzalez “Improving the Effectiveness of

Trace Caching Using a Loop Cache,” NUCAR technical report.• M. R de Alba and D.R. Kaeli, “Characterization and Evaluation of Hardware

Loop Unrolling”, 1st Boston Area Architecture Conference, Jan 2003.• M. R. de Alba and D. R. Kaeli “Path-based Hardware Loop Prediction,” Proc.

of the 4th International Conference on Control, Virtual Instrumentation and Digital Systems, August 2002.

• A. Uht, D. Morano, A. Khalafi, M. de Alba, D. R. Kaeli, “Realizing High IPC Using Time-Tagged Resource-Flow Computing,” Proc. of Europar, August 2002.

• M. R. de Alba and D. R. Kaeli, “Runtime Predictability of Loops,” Proc. of the 4th Annual IEEE International Workshop on Workload Characterization, December 2001.

• M. R. de Alba, D. R. Kaeli, and E. S. Kim “Dynamic Analysis of Loops,” Proc. of the 3rd International Conference on Control, Virtual Instrumentation and Digital Systems, August 2001.

• Branch correlation helps to detect loops in advance

• Loops have patterns of behavior (iterations, dynamic body size, in-loop paths)

• From studied benchmarks, on average more than 50 % of loops contain branches

• In-loop branches can be predicted and used to guide unrolling

Conclusions

• Dynamic instruction traces are built using loop profiling and prediction

• Multiple loops can be simultaneously unrolled• Combining a trace cache and a loop cache more

useful and less redundant instruction streams are built

• Performance benefits are gained with a hybrid fetch engine mechanism

Conclusions

Documents

Exposing ILP in the Presence of Loops Marcos Rubén de Alba Rosano David Kaeli Department of Electrical and Computer Engineering Northeastern University