View
229
Download
1
Category
Tags:
Preview:
Citation preview
1
Register Write Specialization
Register Read Specialization A path to complexity effective wide-issue superscalar processors
André Seznec, Eric Toullec, Olivier Rochecouste
IRISA/ INRIA
2
AS-ET-ORCaps Team
Irisa
Why designing wide issue superscalar processors
SMT Superscalar Processors !
3
AS-ET-ORCaps Team
Irisa
Doubling the issue width
Functional Units Silicon area: 2x
Power consumption: 2x
Same latency
Register file: Silicon area: > 8x Power consumption: > 4x access time: 1.5x
Wake-up logic entries: monitors twice as many
inputs area, consumption,
response time Bypass network:
wider multiplexors >2x longer communications
4
AS-ET-ORCaps Team
Irisa
An unwritten rule applied on all superscalar processor designs
For general purpose registers:
Any physical register can be the source or the result of any instruction executed
on any functional unit
5
AS-ET-ORCaps Team
Irisa
The register file issue
6
AS-ET-ORCaps Team
Irisa
Silicon area for the physical register file
7
AS-ET-ORCaps Team
Irisa
Conventional clustered design
C1C0 C2 C3
Register File
8
AS-ET-ORCaps Team
Irisa
Distributed register file
C0 C1 C3C2
Local register file: shorter read access time but larger silicon area
9
AS-ET-ORCaps Team
Irisa
8-way distributed register file 4 identical copies14.5 W (x 4.5)4 cycles (+1)256 x 1792 w2 x W (x11)
8-way monolithic register file16 W (x 5)5 cycles (+2)256 x 1120 w2 x W (x 8)
4-way distributed register file2 identical copies3.1W 3 cycles 128 x 320w2 x W
8-way against 4-way100nm, 5 Ghz
10
AS-ET-ORCaps Team
Irisa
Let us reduce the number of ports
on each individual register
11
AS-ET-ORCaps Team
Irisa
Register Write Specialization
C1C0 C2 C3
S0 S1 S2 S3
12
AS-ET-ORCaps Team
Irisa
Distributed Register File and Register Write Specialization
C0 C1 C3C2
13
AS-ET-ORCaps Team
Irisa
Register Write Specialization
Each cluster writes only a subset of the registers
Less write ports on every individual physical register
But allocation to clusters must precede register renaming
4-cluster 8-way distributed register file 512 entries
320 x w2 per register bit
3 cycles access time
8.5 W
14
AS-ET-ORCaps Team
Irisa
Register Write Specialization and Register Renaming
1:Op R6, R7 -> R52:Op R2, R5 -> R63:Op R6, R3 -> R44:Op R4, R6 -> R2
4 free odd reg4 free even reg
4-bit subset target vector
1:Op L6, L7 -> res12:Op L2, res1 -> res23:Op res2, L3 -> res34:Op res3,res2 -> res4
4 new free registers
+Old map table
1:Op P6, P7 -> RES12:Op P2, RES1 -> RES23:Op RES2, L3 -> RES34:Op RES3,RES2 -> RES4
New map table
15
AS-ET-ORCaps Team
Irisa
Register Write Specialization and Register Renaming (2)
Consumes a lot of registers : need for recycling
1:build two lists of registers to be recycled2: pack both lists 3: concatenate the two lists4: append to the free list
16
AS-ET-ORCaps Team
Irisa
Register Write Specialization and Register Renaming (3)
An alternative: Compute the number of registers in each register subset Pick the right number of registers from each of the free lists No need for recycling registers
Think about round-robin distribution !
17
AS-ET-ORCaps Team
Irisa
Performance issues
Register Write Specialization only: round robin allocation:
• no extra stage for register renaming • shorter register acces time
Overall shorter pipeline:
slightly better performances
18
AS-ET-ORCaps Team
Irisa
Register Read Specialization
C1C0 C2 C3
S0 S1
19
AS-ET-ORCaps Team
Irisa
Register Read Specialization
Limits number of read ports on each individual register
Puts strong constraints on allocation of instructions to clusters
Caution:
Personal opinion: don’t use it alone !
Interconnection topology must ensurethat every instruction is executable
20
AS-ET-ORCaps Team
Irisa
WSRS architectures
Combining Register Read Specialization and
Register Write Specialization
21
AS-ET-ORCaps Team
Irisa
4-cluster WSRS architecture
S0
S0 C0
S1
S1C1
S2
C2
S3
S3C3S2
inst. operands positionsdetermine
the execution cluster
22
AS-ET-ORCaps Team
Irisa
4-cluster WSRS architecture: allocating instructions to clusters
S0
S0 C0
S1
S1C1
S2
C2
S3
S3C3S2
Op:R6,R7 R5 S1,S2 S0
First op determines top or down bicluster
Second op determines left or right bicluster
23
AS-ET-ORCaps Team
Irisa
4-cluster WSRS architecture :allocating instructions to clusters (2)
01
01
01
kji
j j 2 j
j i 2 k
i i 2 i
S S ,S :I
Op:R6,R7 R5 S1,S2 S0
Computation of the two bits are independent :-)
24
AS-ET-ORCaps Team
Irisa
Each individual physical register:4 identical copies of (2-read, 3-write) registers8x smaller than conventional monolithic approach12.8x smaller than conventional distributed approach
4-cluster 8-way WSRS architecture :the register file
WSRS512 registers
6.25W, 3 cycles
Conventional256 registers
(16W, 5 cycles) or (14.5W, 4 cycles)
25
AS-ET-ORCaps Team
Irisa
4-cluster 8-way WSRS architecture :the wake-up logic
The wake-up logic monitors all possible sources for each operand FUs from only two clusters are possible sources only 6 possible sources !
8-way WSRS architecture, wake-up logic entry complexity
=4-way issue
wake-up logic entry complexity
26
AS-ET-ORCaps Team
Irisa
4-cluster 8-way WSRS architecture :bypass network
Possible sources for each operand FUs from only two clusters are possible sources
Bypass point(pipeline length) x (possible FU sources) + register file
8-way dist.4 cycles
49 pos. op.
WSRS3 cycles
19 pos. op.
8-way mon.5 cycles
61 pos. op.
27
AS-ET-ORCaps Team
Irisa
Local fast-forwarding inside a single cluster2 out of 4 consumers are reached on the next cycle
Partial fast-forwarding inside a pair of adjacent clusters:3 out of 4 consumers are reached on the next cycle !
Complete fast-forwarding:consumer is close: may be possible to implement!
4-cluster WSRS architecture :fast-forwarding
28
AS-ET-ORCaps Team
Irisa
4-cluster WSRS architecture:Nothing is entirely free !
Strong constraint on allocation of instructions to clusters: The cluster executing a dyadic instruction depends on the
position of its operands in the register subsets.
Degrees of freedom: Monadic instructions can be executed on two clusters One out of two commutative dyadic instructions can be
executed on two clusters Design clusters able to execute instructions in two forms ?
• A-B and -B + A
29
AS-ET-ORCaps Team
Irisa
Using monadic instructions for load balancing
S0S
0 C0
S1
S1C1
S2
C2
S3S
3C3S2
S0 or S1
30
AS-ET-ORCaps Team
Irisa
Commutativity for load balancing
S0S
0 C0
S1
S1C1
S2
C2
S3S
3C3S2
S0 op S2
31
AS-ET-ORCaps Team
Irisa
4-cluster WSRS architecture :nothing comes from free (2)
Extra free lists and associated logic
Extra pipeline stage(s): Instructions must be allocated to clusters before the last
step in register renaming: + 3 cycles But shorter register access time : - 2 cycles
32
AS-ET-ORCaps Team
Irisa
Performance issues on 4-way WSRS architectures
Workload may be unbalanced among the clusters: Use of the degrees of freedom
• monadic instructions • « commutative » clusters
Higher probability of local consumption of a register
Naive allocation policies on WSRS competes favorably with naive policies on conventional architecture
33
AS-ET-ORCaps Team
Irisa
Summary
Register Write Specialization limiting the number of write ports on each physical register leads to naturally use distributed register file mastering power consumption, silicon area and access time
But
Some extra complexity in register renaming
34
AS-ET-ORCaps Team
Irisa
Summary (2)
Register Write Specialization + Register Read Specialization Further limits the number of ports on each physical register mastering power consumption, silicon area and access time
side effects: • mastering wake-up logic and bypass network complexity
But constraints instruction allocation to clusters
35
AS-ET-ORCaps Team
Irisa
Future works
Intelligent instruction allocation policies
Exploration of other possible interconnections
Use of heterogeneous clusters
SMT mode
Recommended