1 Register Write Specialization Register Read Specialization A path to complexity effective...

Register Write Specialization

Register Read Specialization A path to complexity effective wide-issue superscalar processors

André Seznec, Eric Toullec, Olivier Rochecouste

IRISA/ INRIA

AS-ET-ORCaps Team

Why designing wide issue superscalar processors

SMT Superscalar Processors !

AS-ET-ORCaps Team

Doubling the issue width

Functional Units Silicon area: 2x

Power consumption: 2x

Same latency

Register file: Silicon area: > 8x Power consumption: > 4x access time: 1.5x

Wake-up logic entries: monitors twice as many

inputs area, consumption,

response time Bypass network:

wider multiplexors >2x longer communications

AS-ET-ORCaps Team

An unwritten rule applied on all superscalar processor designs

For general purpose registers:

Any physical register can be the source or the result of any instruction executed

on any functional unit

AS-ET-ORCaps Team

The register file issue

AS-ET-ORCaps Team

Silicon area for the physical register file

AS-ET-ORCaps Team

Conventional clustered design

C1C0 C2 C3

Register File

AS-ET-ORCaps Team

Distributed register file

C0 C1 C3C2

Local register file: shorter read access time but larger silicon area

AS-ET-ORCaps Team

8-way distributed register file 4 identical copies14.5 W (x 4.5)4 cycles (+1)256 x 1792 w2 x W (x11)

8-way monolithic register file16 W (x 5)5 cycles (+2)256 x 1120 w2 x W (x 8)

4-way distributed register file2 identical copies3.1W 3 cycles 128 x 320w2 x W

8-way against 4-way100nm, 5 Ghz

AS-ET-ORCaps Team

Let us reduce the number of ports

on each individual register

AS-ET-ORCaps Team

C1C0 C2 C3

S0 S1 S2 S3

AS-ET-ORCaps Team

Distributed Register File and Register Write Specialization

C0 C1 C3C2

AS-ET-ORCaps Team

Each cluster writes only a subset of the registers

Less write ports on every individual physical register

But allocation to clusters must precede register renaming

4-cluster 8-way distributed register file 512 entries

320 x w2 per register bit

3 cycles access time

AS-ET-ORCaps Team

Register Write Specialization and Register Renaming

1:Op R6, R7 -> R52:Op R2, R5 -> R63:Op R6, R3 -> R44:Op R4, R6 -> R2

4 free odd reg4 free even reg

4-bit subset target vector

1:Op L6, L7 -> res12:Op L2, res1 -> res23:Op res2, L3 -> res34:Op res3,res2 -> res4

4 new free registers

+Old map table

1:Op P6, P7 -> RES12:Op P2, RES1 -> RES23:Op RES2, L3 -> RES34:Op RES3,RES2 -> RES4

New map table

AS-ET-ORCaps Team

Register Write Specialization and Register Renaming (2)

Consumes a lot of registers : need for recycling

1:build two lists of registers to be recycled2: pack both lists 3: concatenate the two lists4: append to the free list

AS-ET-ORCaps Team

Register Write Specialization and Register Renaming (3)

An alternative: Compute the number of registers in each register subset Pick the right number of registers from each of the free lists No need for recycling registers

Think about round-robin distribution !

AS-ET-ORCaps Team

Performance issues

Register Write Specialization only: round robin allocation:

• no extra stage for register renaming • shorter register acces time

Overall shorter pipeline:

slightly better performances

AS-ET-ORCaps Team

Register Read Specialization

C1C0 C2 C3

AS-ET-ORCaps Team

Register Read Specialization

Limits number of read ports on each individual register

Puts strong constraints on allocation of instructions to clusters

Caution:

Personal opinion: don’t use it alone !

Interconnection topology must ensurethat every instruction is executable

AS-ET-ORCaps Team

WSRS architectures

Combining Register Read Specialization and

AS-ET-ORCaps Team

4-cluster WSRS architecture

S3C3S2

inst. operands positionsdetermine

the execution cluster

AS-ET-ORCaps Team

4-cluster WSRS architecture: allocating instructions to clusters

S3C3S2

Op:R6,R7 R5 S1,S2 S0

First op determines top or down bicluster

Second op determines left or right bicluster

AS-ET-ORCaps Team

4-cluster WSRS architecture :allocating instructions to clusters (2)

j j 2 j

j i 2 k

i i 2 i

S S ,S :I

Op:R6,R7 R5 S1,S2 S0

Computation of the two bits are independent :-)

AS-ET-ORCaps Team

Each individual physical register:4 identical copies of (2-read, 3-write) registers8x smaller than conventional monolithic approach12.8x smaller than conventional distributed approach

4-cluster 8-way WSRS architecture :the register file

WSRS512 registers

6.25W, 3 cycles

Conventional256 registers

(16W, 5 cycles) or (14.5W, 4 cycles)

AS-ET-ORCaps Team

4-cluster 8-way WSRS architecture :the wake-up logic

The wake-up logic monitors all possible sources for each operand FUs from only two clusters are possible sources only 6 possible sources !

8-way WSRS architecture, wake-up logic entry complexity

=4-way issue

wake-up logic entry complexity

AS-ET-ORCaps Team

4-cluster 8-way WSRS architecture :bypass network

Possible sources for each operand FUs from only two clusters are possible sources

Bypass point(pipeline length) x (possible FU sources) + register file

8-way dist.4 cycles

49 pos. op.

WSRS3 cycles

19 pos. op.

8-way mon.5 cycles

61 pos. op.

AS-ET-ORCaps Team

Local fast-forwarding inside a single cluster2 out of 4 consumers are reached on the next cycle

Partial fast-forwarding inside a pair of adjacent clusters:3 out of 4 consumers are reached on the next cycle !

Complete fast-forwarding:consumer is close: may be possible to implement!

4-cluster WSRS architecture :fast-forwarding

AS-ET-ORCaps Team

4-cluster WSRS architecture:Nothing is entirely free !

Strong constraint on allocation of instructions to clusters: The cluster executing a dyadic instruction depends on the

position of its operands in the register subsets.

Degrees of freedom: Monadic instructions can be executed on two clusters One out of two commutative dyadic instructions can be

executed on two clusters Design clusters able to execute instructions in two forms ?

• A-B and -B + A

AS-ET-ORCaps Team

Using monadic instructions for load balancing

S0 or S1

AS-ET-ORCaps Team

Commutativity for load balancing

S0 op S2

AS-ET-ORCaps Team

4-cluster WSRS architecture :nothing comes from free (2)

Extra free lists and associated logic

Extra pipeline stage(s): Instructions must be allocated to clusters before the last

step in register renaming: + 3 cycles But shorter register access time : - 2 cycles

AS-ET-ORCaps Team

Performance issues on 4-way WSRS architectures

Workload may be unbalanced among the clusters: Use of the degrees of freedom

• monadic instructions • « commutative » clusters

Higher probability of local consumption of a register

Naive allocation policies on WSRS competes favorably with naive policies on conventional architecture

AS-ET-ORCaps Team

Summary

Register Write Specialization limiting the number of write ports on each physical register leads to naturally use distributed register file mastering power consumption, silicon area and access time

Some extra complexity in register renaming

AS-ET-ORCaps Team

Summary (2)

Register Write Specialization + Register Read Specialization Further limits the number of ports on each physical register mastering power consumption, silicon area and access time

side effects: • mastering wake-up logic and bypass network complexity

But constraints instruction allocation to clusters

AS-ET-ORCaps Team

Future works

Intelligent instruction allocation policies

Exploration of other possible interconnections

Use of heterogeneous clusters

SMT mode

1 Register Write Specialization Register Read Specialization A path to complexity effective...

Documents

Design Specialization

Brain Specialization

Cloud Object Storage | Store & Retrieve Data …...Management Specialization; Health Care Administration Specialization; Human Resources Management Specialization; Information Systems

TFE Pierre-Yves Toullec 2012-2013 1 · TFE – Pierre-Yves Toullec – 2012-2013 2 Avant-propos Ce travail de fin d’études est la concrétisation de mes 3 années d’études à

product list specialization

Cerebral Specialization

Specialization Catalog

Specialization PSE_Aug2011

Specialization Guide Oracle

Specialization research

Sudan Medical Specialization Board Pharmacy Specialization Board Nile -FMP.pdf · · 2018-03-25Sudan Medical Specialization Board Pharmacy Specialization Board ... There are two

Specialization in Marketing & Political Communications Today’s … · 2020-04-21 · Specialization in Marketing & Political Communications Head of the Specialization: Dr. Amit

Specialization - ybz.org.il

Cell Specialization

AAS SPECIALIZATION

Des infrastructures agroécologiques aux infrastructures … · 2015. 8. 17. · Jean-Luc Toullec, animateur national du réseau biodiversité Des infrastructures agroécologiques

Cerebral Specialization 1 Cerebral Specialization during ... · Cerebral Specialization 1 Cerebral Specialization during Lucid Dreaming A Right Hemisphere Hypothesis Robert Piller

Tourism Specialization and Economic Development: Evidence ... · Tourism Specialization and Economic Development: Evidence from the UNESCO ... Tourism Specialization and Economic

Cell specialization in multicellular organisms results ... · D. Cell Specialization: Regulation of Transcription Cell specialization in multicellular organisms results from differential

Presentation Debt Specialization