Upload
blaise-hoover
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23
Macro-op Scheduling:Relaxing Scheduling Loop
Constraints
Ilhyun KimMikko H. LipastiPHARM Team
University of Wisconsin-Madison
December 4, 2003 Slide 2 of 23Ilhyun Kim -- MICRO-36
It’s all about granularity Instruction-centric hardware design
HW structures are built to match an instruction’s specifications Controls occur at every instruction boundary
Instruction granularity may impose constraints on the hardware design space
Relaxing the constraints at different processing granularities
CoarserFiner Processing granularity
instructionoperand
Half-pricearchitecture (ISCA03)
conventional Coarser-granular architecture
macro-op
December 4, 2003 Slide 3 of 23Ilhyun Kim -- MICRO-36
Outline
Scheduling loop constraints Overview of coarser-grained scheduling Macro-op scheduling implementation Performance evaluation Conclusions & future work
December 4, 2003 Slide 4 of 23Ilhyun Kim -- MICRO-36
Scheduling loop constraints Loops in out-of-order execution
Scheduling atomicity (wakeup / select within a single cycle) Essential for back-to-back instruction execution Hard to pipeline in conventional designs
Poor scalability Extractable ILP is a function of window size Complexity increases exponentially as the size grows Increasing pressure due to deeper pipelining and slower memory system
Fetch Decode Sched Disp RF Exe WB Commit
Scheduling loop(wakeup / select)
Exe loop(bypass)
Load latency resolution loop
December 4, 2003 Slide 5 of 23Ilhyun Kim -- MICRO-36
Related Work Scheduling atomicity
Speculation & pipelining Grandparent scheduling [Stark], Select-free scheduling [Brown]
Poor scalability Low complexity scheduling logic
FIFO style window [Palacharla, H.Kim] Data-flow based window [Canal, Michaud, Raasch …]
Judicious window scaling Segmented windows [Hrishikesh], WIB [Lebeck] …
Issue queue entry sharing AMD K7 (MOP), Intel Pentium M (uops fusion)
Still based on instruction-centric scheduler designs Making a scheduling decision at every instruction boundary Overcoming atomicity and scalability in isolation
December 4, 2003 Slide 6 of 23Ilhyun Kim -- MICRO-36
Source of the atomicity constraint
Minimal execution latency of instruction Many ALU operations have single-cycle latency Schedule should keep up with execution 1-cycle instructions need 1-cycle scheduling
Multi-cycle operations do not need atomic scheduling
Relax the constraints by increasing the size of scheduling unit Combine multiple instructions into a multi-cycle latency unit Scheduling decisions occur at multiple instruction boundaries Attack both atomicity and scalability constraints
December 4, 2003 Slide 7 of 23Ilhyun Kim -- MICRO-36
Macro-op scheduling overview
Issuequeueinsert
Wakeup
Pipelined scheduling
RFSelectPayload RAM
Sequencinginstructions
EXEI-cacheFetch
MOPdetection
Wakeup order information
Dependenceinformation
MOPpointers
Fetch / Decode / Rename Queue Scheduling RF / EXE / MEM / WB / Commit
CoarserMOP-grained Instruction-grainedInstruction-grained
MEM
cacheports
MOP formation
Rename
Disp
WBCommit
December 4, 2003 Slide 8 of 23Ilhyun Kim -- MICRO-36
MOP scheduling(2x) example
Pipelined instruction scheduling of multi-cycle MOPs Still issues original instructions consecutively
Larger instruction window Multiple original instructions logically share a single issue queue entry
12
3 45
6
7 98
10 11
12
13
14
15
16
n
n+1
selectwakeupselectwakeup
1
3
2
5
4
87
10
12
9
11
13
1415
16
n
n+1
select/ wakeup
select/ wakeup
6 Macro-op (MOP)
• 9 cycles• 16 queue entries
• 10 cycles• 9 queue entries
December 4, 2003 Slide 9 of 23Ilhyun Kim -- MICRO-36
Outline
Scheduling loop constraints Overview of coarser-grained scheduling Macro-op scheduling implementation Performance evaluation Conclusions & future work
December 4, 2003 Slide 10 of 23Ilhyun Kim -- MICRO-36
Issues in grouping instructions Candidate instructions
Single-cycle instructions: integer ALU, control, store agen operations Multi-cycle instructions (e.g. loads) do not need single-cycle scheduling
The number of source operands Grouping two dependent instructions up to 3 source operands Allow up to 2 source operands (conventional) / no restriction (wired-OR)
MOP size Bigger MOP sizes may be more beneficial 2 instructions in this study
MOP formation scope Instructions are processed in order before inserted into issue queue Candidate instructions need to be captured within a reasonable scope
December 4, 2003 Slide 11 of 23Ilhyun Kim -- MICRO-36
Dependence edge distance (instruction count)
73% of value-generating candidates (potential MOP heads) have dependent candidate instructions (potential MOP tails)
An 8-instruction scope captures many dependent pairs Variability in distances (e.g. gap vs. vortex) remember this
Our configuration: grouping 2 single-cycle instructions within an 8-instruction scope
49.2 50.9 27.8 48.7 37.4 56.3 40.2 47.5 42.7 47.7 37.6 44.7% total insts
MOP potential
0%
20%
40%
60%
80%
100%
bzip
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perl
twol
f
vort
ex vpr
Tot
al v
alue
-gen
erat
ing
cand
idat
e in
stru
ctio
ns c
omitt
ed
dynamically deadnot MOP candidate8+ instructions4~7 instructions1~3 instructions
8-instruction scope
December 4, 2003 Slide 12 of 23Ilhyun Kim -- MICRO-36
MOP detection
Finds groupable instruction pairs Dependence matrix-based detection (detailed in
the paper) Performance is insensitive to detection latency (pointers reused
repeatedly) A pessimistic 100-cycle latency loses 0.22% of IPC
Generates MOP pointers 4 bits per instruction, stored in $IL1 A MOP pointer represents a groupable instruction pair
Issuequeueinsert
Wakeup RFSelectPayload RAM
EXEI-cacheFetch
MOPdetection
Wakeup order information
Dependenceinformation
MOPpointers
MEM
MOP formation
Rename
WBCommit
poin
ter
pointer
December 4, 2003 Slide 13 of 23Ilhyun Kim -- MICRO-36
MOP detection –
Avoiding cycle conditions Cycle condition examples (leading to deadlocks)
Conservative cycle detection heuristic Precise detection is hard (multiple levels of dep tracking)
?
1
3
2
1
3
2
4
Assume a cycle if both outgoing and incoming edges are detected
Captures over 90% of MOP opportunities (compared to the precise detection)
December 4, 2003 Slide 14 of 23Ilhyun Kim -- MICRO-36
MOP formation
Locates MOP pairs using MOP pointers MOP pointers are fetched along with instructions
Converts register dependences to MOP dependences Architected register IDs MOP IDs Identical to register renaming
Except that it assigns a single ID to two groupable instructions Reflects the fact that two instructions are grouped into one scheduling unit
Two instructions are later inserted into one issue entry
Issuequeueinsert
Wakeup RFSelectPayload RAM
EXEI-cacheFetch
MOPdetection
Wakeup order information
Dependenceinformation
MOPpointers
MEM
MOP formation
Rename
WBCommit
MOP
MOP
December 4, 2003 Slide 15 of 23Ilhyun Kim -- MICRO-36
Scheduling MOPs
Instructions in a MOP are scheduled as a single unit A MOP is a non-pipelined, 2-cycle operation from the scheduler’s perspective Issued when all source operands are ready, incurs one tag broadcast
Wakeup / select timings
Issuequeueinsert
Wakeup RFSelectPayload RAM
EXEI-cacheFetch
MOPdetection
Wakeup order information
Dependenceinformation
MOPpointers
MEM
MOP formation
Rename
WBCommit
n
n+1
n+2
n+3
n+4
select 1
wakeup 2, 3
select 2, 3
wakeup 4
select 4
select MOP(1, 3)
wakeup 2, 4
select 2, 4
select 1wakeup 2, 3
select 2, 3wakeup 4
select 4
Atomic scheduling 2-cycle scheduling 2-cycle MOP schedulingcycle
1
4
1
2 3
4
1
3
2 4
2 3
December 4, 2003 Slide 16 of 23Ilhyun Kim -- MICRO-36
Sequencing instructions
A MOP is converted back to two original instructions The dual-entry payload RAM sends two original instructions Original instructions are sequentially executed within 2 cycles Register values are accessed using physical register IDs
ROB separately commits original instructions in order MOPs do not affect precise exception or branch misprediction recovery
Issuequeueinsert
Wakeup RFSelectPayload RAM
EXEI-cacheFetch
MOPdetection
Wakeup order information
Dependenceinformation
MOPpointers
MEM
MOP formation
Rename
WBCommit
sequence original insts
December 4, 2003 Slide 17 of 23Ilhyun Kim -- MICRO-36
Outline
Scheduling loop constraints Overview of coarser-grained scheduling Macro-op scheduling implementation Performance evaluation Conclusions & future work
December 4, 2003 Slide 18 of 23Ilhyun Kim -- MICRO-36
Machine parameters Simplescalar-Alpha-based 4-wide OoO + speculative
scheduling w/ selective replay, 14 stages Ideally pipelined scheduler
conceptually equivalent to atomic scheduling + 1 extra stage 128 ROB, unrestricted / 32-entry issue queue 4 ALUs, 2 memory ports, 16K IL1 (2), 16K DL1 (2), 256K L2 (8),
memory (100) Combined branch prediction, fetch until the first taken branch
MOP scheduling 2-cycle (pipelined) scheduling + 2X MOP technique 2 (conventional) or 3 (wired-OR) source operands MOP detection scope: 2 cycles (4-wide X 2-cycle = up to 8 insts)
Spec2k INT, reduced input sets Reference input sets for crafty, eon, gap (up to 3B instructions)
December 4, 2003 Slide 19 of 23Ilhyun Kim -- MICRO-36
0%
20%
40%
60%
80%
100%
b
zip
cra
fty
eon
gap
gcc
gzip
mcf
pars
er
perl
twolf
vort
ex
vpr
Tota
l dynam
ic instr
uctions c
om
mitte
d
not MOP candidate
MOP candidate
0%
20%
40%
60%
80%
100%
b
zip
cra
fty
eon
gap
gcc
gzip
mcf
pars
er
perl
twolf
vort
ex
vpr
Tota
l dynam
ic instr
uctions c
om
mitte
d
not MOP candidate
MOP candidate but not grouped
MOP
0%
20%
40%
60%
80%
100%
b
zip
cra
fty
eon
gap
gcc
gzip
mcf
pars
er
perl
twolf
vort
ex
vpr
Tota
l dynam
ic instr
uctions c
om
mitte
d
not MOP candidateMOP candidate but not groupedindependent MOPMOP
# grouped instructions
28~46% of total instructions are grouped 14~23% reduction in the instructions count in scheduler Dependent MOP cases enable consecutive issue of dependent
instructions
2-sr
c3-
src
December 4, 2003 Slide 20 of 23Ilhyun Kim -- MICRO-36
MOP scheduling performance(relaxed atomicity constraint only)
Up to ~19% of IPC loss in 2-cycle scheduling MOP scheduling restores performance
Enables consecutive issue of dependent instructions 97.2% of atomic scheduling performance on average
0.8
0.85
0.9
0.95
1
1.05bzip
cra
fty
eon
gap
gcc
gzip
mcf
pars
er
perl
twolf
vort
ex
vpr
IPC
norm
alized t
o b
ase s
cheduling
2-cycle MOP-2src MOP-3src
0.8
0.85
0.9
0.95
1
1.05bzip
cra
fty
eon
gap
gcc
gzip
mcf
pars
er
perl
twolf
vort
ex
vpr
IPC
norm
alized t
o b
ase s
cheduling
2-cycle MOP-2src MOP-3srcUnrestricted IQ / 128 ROB
December 4, 2003 Slide 21 of 23Ilhyun Kim -- MICRO-36
Insight into MOP scheduling Performance loss of 2-cycle scheduling
Correlated to dependence edge distance Short dependence edges (e.g. gap)
instruction window is filled up with chains of dependent instructions 2-cycle scheduler cannot find plenty of ready instructions to issue
MOP scheduling captures short-distance dependent instruction pairs They are the important ones Low MOP coverage due to long dependence edges does not matter
2-cycle scheduler can find many instructions to issue (e.g. vortex)
MOP scheduling complements 2-cycle scheduling Overall performance is less sensitive to code layout
December 4, 2003 Slide 22 of 23Ilhyun Kim -- MICRO-36
0.8
0.85
0.9
0.95
1
1.05bzip
cra
fty
eon
gap
gcc
gzip
mcf
pars
er
perl
twolf
vort
ex
vpr
IPC
norm
alized t
o b
ase s
cheduling
2-cycle MOP-2src MOP-3src
MOP scheduling performance(relaxed atomicity + scalability constraints)
Benefits from both relaxed atomicity and scalability constraints
Pipelined 2-cycle MOP scheduling performs comparably or better than atomic scheduling
32 IQ / 128 ROB
December 4, 2003 Slide 23 of 23Ilhyun Kim -- MICRO-36
Conclusions & Future work Changing processing granularity can relax the
constraints imposed by instruction-centric designs
Constraints in instruction scheduling loop Scheduling atomicity, poor scalability
Macro-op scheduling relaxes both constraints at a coarser granularity
Pipelined, 2-cycle macro-op scheduling can perform comparably or even better than atomic scheduling
Potentials for narrow bandwidth microarchitecture Extending the MOP idea to the whole pipeline (Disp, RF, bypass) e.g. achieving 4-wide machine performance using 2-wide bandwidth
December 4, 2003 Slide 24 of 23Ilhyun Kim -- MICRO-36
Questions??
December 4, 2003 Slide 25 of 23Ilhyun Kim -- MICRO-36
0.8
0.85
0.9
0.95
1
1.05
bzip
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perl
twol
f
vort
ex vpr
IPC
nor
mal
ized
to
base
sch
edul
ing
Select-free-squash-dep Select-free-scoreboard MOP-wiredOR
Select-free (Brown et al.) vs. MOP scheduling
4.1% better IPC on average over select-free-scoreboard (best 8.3%) Select-free cannot outperform the atomic scheduling
Select-free scheduling is speculative and requires recovery operations MOP scheduling is non-speculative, leading to many advantages
32 IQ / 128 ROB, no extra stage for MOP formation
December 4, 2003 Slide 26 of 23Ilhyun Kim -- MICRO-36
MOP detection –
MOP pointer generation Finding dependent pairs
Dependence matrix-based detection (detailed in MICRO paper) Insensitive to detection latency (pointers reused repeatedly)
A pessimistic 100-cycle latency loses 0.22% of IPC Similar to instruction preprocessing in trace cache lines
MOP pointers (4 bits per instruction)
0 011: add r1 r2, r3
0 000: lw r4 0(r3)
1 010: and r5 r4, r2
0 000: bez r1, 0xff (taken)
0 000: sub r6 r5, 1
control offset
MOPpointers
Control bit (1)
: captures up to 1 control discontinuity Offset bits (3)
: instruction count from head to tail
December 4, 2003 Slide 27 of 23Ilhyun Kim -- MICRO-36
MOP formation –
MOP dependence translation Assigns a single ID to two MOPable instructions
reflecting the fact that two instructions are grouped into one unit The process and required structure is identical to register
renaming Register values are still access based on original register IDs
1234
34
56
567…
78--
Logicalreg ID
Physicalreg ID
Register rename table
p5
p6 p7
p8
p3
p4
I1
I2 I3
I4
m5
m5
m6
m6
m3
m4
1234
34
55
567…
66--
Logicalreg ID MOP ID
MOP translation table
a single MOP ID
is allocated totwo groupedinstructions
I1
I2
I3
I4
December 4, 2003 Slide 28 of 23Ilhyun Kim -- MICRO-36
Inserting MOPs into issue queue
Inserting instructions across different groups
Issuequeueinsert
Wakeup RFSelectPayload RAM
EXEI-cacheFetch
MOPdetection
Wakeup order information
Dependenceinformation
MOPpointers
MEM
MOP formation
Rename
WBCommit
Issuequeue
21 3 4
65 7 8
pending
cycle n
X
124
3
65 7 8
pending
cycle n+1
124568
3
7
pending
cycle n+2
:MOP pointer
December 4, 2003 Slide 29 of 23Ilhyun Kim -- MICRO-36
Performance considerations Independent MOPs
Group independent instructions with the same source dependences No direct performance benefit but reduce queue contention
Last-arriving operands in tail instructions
1
2
3
CLK 10
CLK 15
CLK 19
CLK 17
1
2
3
CLK 10
CLK 15
CLK 17
CLK 12
Unnecessarily delays head instructions
MOP detection logic filters out harmful grouping
Create an alternative pair if any
December 4, 2003 Slide 30 of 23Ilhyun Kim -- MICRO-36
1
1 1 1 1
1
1
1
1
1 1
2 2
1inval
1 2 3 4
1
2
3
4
1
1
1 1
2 2
2 2
1
1
1
1 2 3 4 5 6 7 8
1
2
3
4
5
6
7
8
1
1
1
1
1
2
1
2
1
1 1
5 6 7 8 9 10 11 12
5
6
7
8
9
10
11
12
2
1
2
1 1 1 1 1 1
1
1
1
1
1
1
1
1
1
1
1
23:5
7:8
3:5
7:8
9:10
11:121
tail
head
possiblecycle
detected
prioritydecoder
picksone
MOPpointers
MOPpointers
MOPpointers
1
2
3
4
5
6
7
8
9
10
11
12
STEP 1
STEP 2 STEP 3
Originaldata
dependence graph
clk nclk n+1clk n+2
notgroupable
MOP pointerdetected
after step 3