Upload
tan
View
43
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Block-Precise Processors Nagesh B Lakshminarayana , Hyesoon Kim. Block-Precise Processors. Processors designed for low power Architectural state is correct at basic block granularity rather than instruction granularity. Outline. Background B-Processor mechanisms Results Conclusion. - PowerPoint PPT Presentation
Citation preview
Block-Precise Processors Nagesh B Lakshminarayana, Hyesoon Kim
2Block-Precise Processors| Processors designed for low power
| Architectural state is correct at basic block granularity rather than instruction granularity
3Outline| Background
| B-Processor mechanisms
| Results
| Conclusion
4Pipeline Designs| Depending on when instructions read their source
operands two pipeline designs are possible Operand values are read before issue Operand values are read after issue
Issue instruction sent to functional unit for execution Dispatch instruction inserted into instruction scheduler
5
Operands Values Are Read Before Issue| Pipeline has a Data-Capture (DC) Scheduler
DC Scheduler + ARF + ROB with Data – Intel Nehalem, Intel Core
Data-Capture Scheduler
Update
Bypass and Wake
up
Fetch, Decode and Dispatch
ARF
Execution Units
ROB/Rename Buffer
Read
6DC Scheduler + ARF + ROB with Data| Results produced by instructions are copied twice
First to ROB – on instruction completion Then to ARF – on instruction commit
| ROB + ARF consume a significant portion of the total core power > 10% [Brooks et al. ISCA 2000]
7Goal| Design mechanism(s) to reduce the power
consumption of the ROB + ARF reduce the number of writes to these structures
8Related Work| Change the organization of these structures
ports, hierarchical organization, banking [MICRO’92, MICRO’94]
| Reduce accesses to these structures Register File Caches [Yung et al, ICCD ‘95] Reduce writes
Target short-lived variables (mostly VLIW)
9Observation| Many instruction results within a basic block are not
visible outside the basic block we call such values BB-Internal values
| Values visible outside a basic block are called BB-External values The last value written to a register within a basic block is a
BB-External value
…ADD R1, R2, R3SUB R4, R1, R6…MUL R1, R1, R4…JGZ R10
Basic Block
Inst-M
Inst-N
10Dependency Distance| Dependency Distance (Dep-Distance) – integer value
defined for every instruction For instructions producing BB-Internal value(s) only
it is the distance of last consumer from the instruction For instructions producing BB-External value(s)
it is infinite
11Dependency Distance| Many BB-Internal values become dead shortly after being produced
i.e., all consumers of BB-Internal value are found within a short distance of the instruction producing the BB-Internal value
>22% of all instructions produce BB-Internal values only and those values are consumed within 4 instructions of being produced
perlbench gcc
gobmksje
ng
h264refasta
r
gamess
zeusmp
cactu
sADM
namdsoplex
calcu
lixtonto wrf
0102030405060708090
100
BB-ExternalDep-Distance > 8Dep-Distance = [5, 8]Dep-Distance = 4Dep-Distance = 3Dep-Distance = 2Dep-Distance = 1
12Mechanisms – Overview| Instruction results are broadcast over the bypass
network
|If we can guarantee that instructions dependent on BB-Internal values produced by a instruction have received the BB-Internal values from the bypass network then we can skip writing the BB-Internal values to the operand store(s)
13Mechanisms – Overview| If results of a instruction are not being written to
operand stores (Mechanism #1), then we can stop broadcast of results beyond first stage of bypass
14Eliminating writes to ROB and ARF| Assistance of the Compiler| Changes to ISA| Changes to hardware
15Compiler| Do analysis of life-time of variables and identify the
dep-distance of instructions in basic blocks
16ISA Extensions| Add 2-bits to instruction encoding
Compiler passes dep-distance of instructions via this encoding
Bits can be encoded in several ways Example encoding using multiples of 2
Encoding Meaning00 Dep-Distance is Infinite01 1 ≤ Dep-Distance < 2 * 1
[1]10 2 ^ 1 ≤ Dep-Distance < 2 * 2
[2-3]11 2 ^ 2 ≤ Dep-Distance < 2 * 3
[4-7]
17Changes to Scheduler| Add a bit-mask (Presence Vector) to track the
presence of instructions in Scheduler Bit-mask of same size as ROB
Bit mask has head and tail pointers First 0 (from tail) in mask is set when a new instruction is dispatched First 1 (from head) in mask is cleared when a instruction is retired
18Changes to Scheduler| When instruction is issued, check if all dependent
instructions have been dispatched If dep-distance is n, check if nth bit from bit for this instruction
is set If set then do not write to ROB and ARF
–IaIbIcId
. . .–
01111…0
SchedulerPV
–IaIbIcId
. . .–
01111…0
SchedulerPV
DD = 3
Check
hit
19Changes to Scheduler
| d1d0 – 2 bit encoding for the instruction bxbx-1…b0 – Presence Vector
d1d0 = 00 must write to ROB and ARF d1d0 = 01 dep-distance is 1 d1d0 = 10 dep-distance in [2,3] d1d0 = 11 dep-distance in [4,7]
01 10 11 Dep-Distance
20
Issues – Supporting Precise Exceptions| Precise exceptions are not supported
Many instructions will not update the architectural state as they are supposed to do But at end of a basic block architectural state matches state
obtained with regular execution
Soln: Check-point RF at the end of each basic block, whenever there is an exception, rollback to start of basic block and execute in instruction-precise mode Use a light weight RF check-pointing mechanism
21Check-pointing Mechanism
| ARF 2 ARF + 1 Dirty Mask + Several State Masks Each bit mask is equal to size of ARF # of state masks is equal to the maximum number of basic blocks
supported by pipeline + 1
ARF-1ARFARF-0
Dirty and State Masks 2 copies of ARF
ARF
22Check-pointing Mechanism| Dirty mask
Tracks which registers have been written by the current basic block
| State mask Holds current mapping of registers i.e., whether latest value
of register is in ARF0 or in ARF1
| First write to a register in a basic block flips the bit in the state mask register value at end of last basic block is untouched subsequent writes to same register use the current mapping
23Results| MacSim Simulator
with integrated McPAT-based tool for modeling power
| Nehalem like core 4-wide, 128 entry ROB, 36 entry scheduler, 16 IRegs, 32 Fregs 22nm
24Results| Power savings for ROB + ARF
15% over baseline, 7% over RFC-32 FP benchmarks – B-Processor skips writing many results and RFC
mechanism writes lot of live values to ROB
perlbench gcc
gobmksje
ng
h264refasta
r
gamess
zeusmp
cactu
sADM
namdsoplex
calcu
lixtonto wrf
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
RFC-32B-Processor
Tota
l pow
er c
onsu
mpti
on fo
r RO
B +
ARFs
an
d ot
her d
ata
stor
es re
lativ
e to
Bas
elin
e
25Results| Power savings for Bypass Network
baseline has two levels of bypass
10% savings on average
perlbench gcc
gobmksje
ngh264
astar
bwaves
milc
gromacs
leslie3d
dealII
povray
GemsFDTD
lbmsphinx3
GMean0
5
10
15
20
25
30
35
40
B-Processor-C
% sa
ving
in P
ower
ove
r Bas
elin
e fo
r th
e By
pass
Net
wor
k
26Conclusion| ROB + ARF contribute a significant fraction of total power
propose mechanism to reduce their power consumption
| For bb-internal values, if all dependent instructions read value off bypass network then skip writes to ROB and ARF and broadcast beyond first stage of bypass
| Mechanism results in correct architecture state at basic block granularity
| Mechanism reduces ROB + ARF power consumption by 15% and bypass power consumption by 10% relative to conventional design
27
Thank You!