Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors

Register Write Specialization Register Read Specialization:A Path to Complexity-Effective Wide-Issue Superscalar Processors�

Andr¥e Seznec Eric Toullec Olivier Rochecoustefseznec,etoullec,[email protected]

IRISA/INRIA, Rennes, France

AbstractWith the continuous shrinking of transistor size, proces-

sor designers are facing new difÝculties to achieve highclock frequency. The register Ýle read time, the wake upand selection logic traversal delay and the bypass networktransit delay with also their respective power consumption-s constitute major difÝculties for the design of wide issuesuperscalar processors.

In this paper, we show that transgressing a rule, that hasso far been applied in the design of all the superscalar pro-cessors, allows to reduce these difÝculties. Currently usedgeneral-purpose ISAs feature a single logical register Ýle(and generally a Ðoating-point register Ýle). Up to now al-l superscalar processors have allowed any general-purposefunctional unit to read and write any physical general-purpose register.

First, we proposeRegister Write Specialization, i.e, forc-ing distinct groups of functional units to write only in dis-tinct subsets of the physical register Ýle, thus limiting thenumber of write ports on each individual register. RegisterWrite Specialization signiÝcantly reduces the access time,the power consumption and the silicon area of the registerÝle without impairing performance.

Second, we propose to combineRegister Write Special-ization with Register Read Specialization for clustered su-perscalar processors. This limits the number of read portson each individual register and simpliÝes both the wake-up logic and the bypass network. With a 8-way 4-clusterWSRS architecture, the complexities of the wake-up logicentry and bypass point are equivalent to the ones found witha conventional 4-way issue processor. More physical regis-ters are needed in WSRS architectures. Nevertheless, usingWSRS architecture allows a dramatic reduction of the totalsilicon area devoted to the physical register Ýle (by a factorfour to six). Its power consumption is more than halved andits read access time is shortened by one third. Some extrahardware and/or a few extra pipeline stages are needed for

�This work was partially supported by an Intel grant

register renaming. WSRS architecture induces constraintson the policy for allocating instructions to clusters. How-ever, performance of a 8-way 4-cluster WSRS architecturestands the comparison with the one of a conventional 8-way4-cluster conventional superscalar processor.

1 Introduction

The physical register Ýle, the bypass network and the se-lection and wake-up logic have become a burden for the de-sign of general-purpose dynamically scheduled superscalarprocessors [14]. Access to the register Ýle is now pipelinedaccross several cycles (e.g. on Pentium4 [10]). Pipeliningis also considered for wake-up and selection logic [18] andonly limited fast-forwarding capability is implemented onclustered processors, e.g. Alpha 21264 [11]. At constantissue width, this trend will continue with the advance of in-tegration technology. It will be further emphasized on widerissue processors.

Unlike some VLIW ISAs (e.g., MultiÐow [3]), the ISAscurrently used on PCs, workstations as well as servers fea-ture a single logical general-purpose register Ýle, and gener-ally a second logical register Ýle for Ðoating-point registers.From the code generator perspective, this general-purposeregister Ýle is central to the architecture, since the operandsof every integer operation are read from this register Ýle andresults are written onto it. This central view of the generalpurpose register Ýle has also been adopted for the hardwareimplementation of the physical register Ýle in dynamicallyscheduled superscalar processors. The following unwrittenrule has been always applied:every general-purpose physical register can be the sourceor the result of any instruction executed on any integer func-tional unit.

It has been applied for implementing processors usinga centralized monolithic physical register Ýle as well asfor clustered processors implementing a distributed regis-ter Ýle, e.g. Alpha 21264. In this latter case, a copy of thephysical register Ýle is associated with each cluster (Figure1). Each of these copies features only a fraction of the totalnumber of read ports, but features the same number of write

1

Proceedings of the 35 th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-35) 1072-4451/02 $17.00 © 2002 IEEE

Monolithic register Ýle Clustered register Ýle

Functional Units

Register File

Register File

Register File

Register File

Register File

C1 C3C0 C2

(a) (b)

Figure 1. Monolithic versus clustered register file organization

ports as in the monolithic register Ýle case.In this paper, we show that this unwritten design rule can

be transgressed on dynamically scheduled processors. Theset of physical registers can be divided in distinct subsetsthat are only read-connected (respectively write-connected)with a subsets of the entries (respectively the exits) of thefunctional units. The number of write and read ports oneachindividual physical register and the overall complexi-ties of the physical register Ýle, the bypass network and thewake-logic are decreased.Register write specialization Distinct clusters of func-tional units or distinct pools of functional units can beforced to write in distinct subsets from the set of the physi-cal registers (Figure 2). We refer to this principle asregisterwrite specialization.

Whenever an instruction is executed on cluster Ci (or apool of functional units), its result is written into a physicalregister from register subset Si. Each individual physicalregister is write-connected with only a subset of the func-tional units. Using register write specialization allows tosigniÝcantly reduce the silicon area devoted to the physicalregister Ýle, to decrease its read access time and to reduceits power consumption.

Register write specialization links register renaming withallocation of instructions to clusters. However, for sim-ple policies for allocating instructions on clusters such asround-robin, it does not impair performance at all, provid-ed that the number of physical registers is sufÝciently in-creased. For more complex dynamic instruction allocationpolicies, a few extra pipeline stages may be needed for reg-ister write specialization.WSRS architectures With a clustered superscalar pro-cessor, one can also constrain the clusters to read theiroperands from only a subset of the physical registers, pro-vided that each instruction remains executable by, at least,one of the clusters. This allows to reduce the number ofread ports on each individual register.

We refer to this method asregister read specialization

S0 S1

S2

S0

S3S2

S1

S3

C2

C0 C1

C3

Sec

ond

oper

and

Second operand

First operand

First operand

Figure 3. A 4-cluster WSRS architecture

and to clustered architectures implementing the combina-tion of register write specialization and register read spe-cialization asWSRS (for register Write Specialization, reg-ister Read Specialization) architectures. A 4-cluster WSRSarchitecture is illustrated on Figure 3.

The Ýrst operand of an instruction executed on clusterC1 is read from a physical register belonging to subset S0or to subset S1 (i.e., it has been produced by cluster C0 orcluster C1). Therefore, (a) the bypass point associated withthe Ýrst operand entry of a cluster C1 functional unit hasonly to be connected with the output buses of the clusterC0 and C1 functional units and (b) for checking the validityof this operand, the wake-up logic has only to monitor theexecutions on clusters C0 and C1. The complexity of the

2


Distinct clusters write Pools of functional units writeon distinct register subsets on distinct register subsets

C1 C3C0 C2

S1 S3S0 S2 S1 S3S0 S2 S1 S3S0 S2 S1 S3S0 S2 S1 S3S0 S2 S1 S3S0 S2 S1 S3S0 S2 S1 S3S0 S2

Ld/St units ALU simple ALU complex Branch units

(a) (b)

Figure 2. Register Write Specialization

bypass point (respectively the wake-up logic entry) on the4-cluster WSRS architecture is equivalent to the complexityof the bypass point (respectively the wake-up logic entry) ofa conventional 2-cluster superscalar processor.

Moreover, the physical registers from subset S1 are onlyread-connected with the Ýrst operand entries of functionalunits from clusters C0 and C1 and with the second operandentries of functional units from clusters C1 and C3. There-fore the register Ýle of the 4-cluster WSRS architecture ex-hibits reduced silicon area, reduced power consumption anda shorter access time, when compared with the one of of aconventional 4-cluster architecture.

On the other hand, allocation of instructions to clustersis strongly constrained on this 4-cluster WSRS architecture.This induces extra hardware logic for register renaming andfor computing the allocation of instructions to clusters aswell as some extra pipeline stages before register renaming.

Paper organization The remainder of the paper is orga-nized as follows. Register write specialization is further an-alyzed in Section 2. Section 3 analyzes the 4-cluster WSRSarchitecture. In particular register renaming and instructionallocation policies are detailed

Section 4 discusses the complexity advantages of WSRSarchitectures over conventional clustered architectures.Section 5 presents performance results conÝrming thatcombining register write specialization and register readspecialization does not impair performance of a 4-clusterprocessor architecture. Section 6 reviews previous relat-ed works to optimize physical register Ýles (e.g., virtual-physical registers [13], register caches [4, 1], read/write portarbitration [1]) or to reduce critical path [18, 2] or pow-er consumption [5, 8] on wake-up and selection logic. OnWSRS architectures, all these proposals can be applied atcluster level. Finally, Section 7 summarizes this study.

2 Register Write Specialization2.1 Register Write Specialization principle

Figure 2 illustrates register write specialization with a4-cluster processor and with a processor relying on reser-vation stations servicing pools of identical functional units.The physical register Ýle is partitioned in several distinctsubsets of registers. Each cluster (resp. pool of function-al units) can write only into one of the subsets. Assuming2-way clusters able to produce up to 3 results per cycle (ason Alpha EV6 [11]), each physical register can be built us-ing four identical (4-read, 3-write) copies instead of fouridentical (4-read, 12-write) copies when register write spe-cialization is not implemented.

2.2 Register write specialization and register re-naming

When register write specialization is used, register re-naming is strongly dependent of the instruction allocationto clusters. The cluster (or the pool of functional units) thatexecutes an instruction determines the register subset wherethe instruction result is written.

In this paper we assume that instructions are Ýrst allo-cated to clusters then renamed1. That is, once the instruc-tion has been allocated to a cluster, register renaminghasto take this constraint into account. We propose below twopossible implementations of register renaming with regis-ter write specialization. Both implementations are derivedfrom a register renaming process using a single set of reg-isters, a map table and a free list. This renaming process(quite similar as the one used in many current processors)can be decomposed in three tasks:

� (A) dependency propagation within the group of in-structions to be renamed in parallel,

� (B) assignment of a free register to each instructionproducing a register result,

1The alternative solution (register renaming Ýrst, instruction allocationto clusters at second) may lead to very unbalanced workloads on clusters.

3


� (C) read and update of the map table: the new maptable is built from the old map table and the group ofnew free registers.

Task (C) is dependent on Tasks (A) and (B). Tasks (A)and (C) can be directly adapted to register write specializa-tion.

For both implementations, we assume that a listFi offree registers is maintained for each of the physical registerÝle subsetsSi. We also assume that a subset target vectorV representing the allocation of instructions to clusters isavailable. That isallocation of instructions to the clustersmust precede the final pipeline stage in the register re-naming process. Note that the instruction allocation policymay induce extra pipeline stages.

2.2.1 First implementation

If a physical register target is systematically assigned to aninstruction Task (B) can be implemented with a convention-al superscalar processor architecture as follows:Let N be the number of instructions to be renamed in par-allel, N free physical registersP�,..,PN are picked from thefree lists. RegisterPj is assigned to the jth instruction in thegroup to be renamed.

This solution can be adapted to register write specializa-tion as follows.

N free registersP��i� �� PN�i are picked from each freelist Fi. Then the assignment of instruction J to a clusteris used to assign a single target register to the instructionamong the registerPJ�i.

A major drawback of this solution is to ìwasteî manyfree registers. The unused free registers must be recycled.This recycling of free registers can be handled in pipelinedmode for each of the free lists: 1) build the two lists ofregisters to be recycled (list of registers freed by committedinstructions and list of registers that were not attributed tothe group of instructions renamed on the previous cycle), 2)independently pack both lists, 3) make a single list and 4)append this list to the free list. A residual problem is that alarge number of free registers are not accessible since theyare Ðowing through the recycling pipeline.

2.2.2 Second implementation

The difÝculty associated with free registers recycling can beeliminated at the cost of a longer pipeline in Task (B).

For the group of N instructions to be renamed on a sin-gle cycle, the exact numberNi of registers required fromeach register subsetSi is Ýrst computed from the subset tar-get vector V. Then the exact numbersN i of required freeregisters are picked from each free listFi. These groupsof registers are then expanded and merged using the subsettarget vector.

Careful design should limit the extra pipeline length in-curred by such a design to 2 or 3 cycles in Task (B). For

static instruction allocation to clusters (e.g. round-robin),the cluster allocation of instruction is known very early inthe pipeline. The Ýnal register renaming stage might not bedelayed at all.2.3 A deadlock issue and its workarounds

There is a possible deadlock issue when register writespecialization is used. Let us suppose that the number ofphysical registers in some (or all ) register subsets is small-er than the number of logical registers in the ISA. When at agiven point of the execution, all the physical registers fromone of the register subsets represent an architectural regis-ter from the ISA, the logical-to-physical register renamingmechanism can not rename any new instruction to the clus-ter/pool of functional units writing in this register subset.This situation might result in a deadlock.

Such a deadlock can not occur when the register subsetsfeature at least the same number of registers as the num-ber of logical registers in the ISA. However, for SMTs orfor ISAs featuring very large numbers of registers (e.g., I-A64), this might not be a realistic solution. Two possibleworkarounds can be considered: (a) the allocation of in-structions to clusters may be in charge of avoiding the dead-lock or (b) An exception is raised whenever the deadlock isdetected; moves that map some of the logical registers ontothe other register subsets are then issued.2.4 Performance considerationsPipeline stalls Depending on instruction allocation pol-icy to clusters and the size of the register subsets, reg-ister write specialization may induce some extra/differentpipeline stalls compared with a conventional approach.However, the impact on performance should be very smallprovided that the total number of physical registers is sufÝ-ciently increased to manage the slight unbalancing that canoccur among the local requirement for registers.

Let us illustrate this with a 4-cluster processor example.Let us assume that the ISA features 32 (logical) registersand that each cluster is able to accept up to 56 inÐight in-structions (i.e a total of 224 instructions). For a convention-al approach, using a 256-entry register Ýle guarantees thatregister renaming is never the stalling factor. When usingregister write specialization, the same applies if 88 entriesper register subset are available: at most 32 registers aremapped to the current architectural registers, then 56 regis-ters are available for renaming i.e., one per possible inÐightinstruction.

Pipeline depth Depending on the instruction allocationpolicy to clusters, register write specialization may (or maynot) induce extra pipeline stages in the register renamingprocess. With a round-robin or pseudo-random allocation,the read of the free lists can be initiated very early in thepipeline. When using pools of functional units associat-ed with dedicated reservation stations, the allocation of in-structions to the pools can be stored in the instruction cache

4


as predecoded bits. For these cases, we can reasonably as-sume that no extra pipeline stage will be needed for both theregister renaming implementations we have proposed.

Other allocation policies, particularly policies using dy-namic register dependencies to allocate instructions to clus-ters [14], will induce extra pipeline stages: allocation ofinstructions to clusters must be executed in parallel withdependency propagation, therefore Task (B) in the registerrenaming process is delayed.

3 WSRS architecture

In this section, we Ýrst analyze the constraints on registerrenaming and instruction allocation to clusters on 4-clusterWSRS architecture. Then we present the degrees of free-dom that exists on this allocation.

3.1 A 4-cluster WSRS Architecture

A 4-cluster WSRS architecture is illustrated in Figure 3:� Functional units are grouped into four identical clus-

ters,C�,C�,C� andC�.� The set of registers is splitted into four distinct subsets

of physical registersS�, S�, S� andS�.� Register write specialization: Any result produced on

cluster Ci is written on the register subset Si.� Register read specialization: for any instruction ex-

ecuted on a given cluster, the Ýrst (resp. the second)operand is read on a Ýxed pair of register subsets.

The execution cluster of a dyadic instruction and its reg-ister subset target are determined by the register subsetswhere its operands are located.

3.2 Cluster Allocation and Register renaming

The instruction allocation to clusters and register renam-ing are strongly linked in the 4-cluster WSRS architecture.Once the register subÝle target has been determined, reg-ister renaming can be handled as described in 2.2. A sim-ple rule illustrated on Figure 3 determines the cluster thatexecutes instruction I: Position of the Ýrst operand deter-mines whether instructionI is executed on the top or bot-tom 2-cluster, and the position of the second operand deter-mines whether instructionI is executed on the left or right2-cluster.

The computation of the two bits that represents the exe-cution cluster number for instructionI are independent. Itcan be implemented as follows. At any cycle, two vectors ofbits f ands represent respectively the Ýrst and second bitsof the subset numbers for the physical registers affected tothe logical registers (i.e, logical register Ri is mapped ontoa physical register in subset� � fi � si). Computations ofthe new values of vectorsf ands are very similar to regis-ter renaming. For a group of N instructions to be renamedin parallel, it consists in two phases: (A1) propagation of(pseudo)-dependencies within the group, (A2) read and up-date of vectorsf ands.

We assume that step A1 is executed in parallel with de-pendency propagation in register renaming. The complexityof the second step is equivalent to the complexity of updat-ing a map table.

Extra pipeline stages On a WSRS architecture, for theÝrst register renaming implementation proposed in 2.2, theregister renaming is delayed by step A2 (read and update ofbit vectorsf ands). In Section 5, we will assume that thistranslates in a single extra pipeline stage before renaming.

For the second register renaming implementation pro-posed in 2.2, two other actions dependent from step A2must also be performed before the late phase in register re-naming: 1) computation of the numbers of registers to pickfrom each of the free lists and 2) read of the free lists fol-lowed by expansion and merge of the groups of free regis-ters. In Section 5, we will assume that this translates in atotal of three extra pipeline stages before renaming.

3.3 Degrees of freedom for allocating instructionsto clusters

We list here some degrees of freedom that can be ex-ploited for allocating the instructions to the clusters on the4-cluster WSRS architecture.Notation Instructions are often using immediate operand-s. However, this paper is only concerned with dynamic reg-ister operands and results. Then we will refer to an instruc-tion using tworegister operands as a dyadic instruction andto an instruction using oneregister operand as a monadicinstruction independently of their use of an extra immediateoperand or not.Monadic instructions A large fraction of the instructionsare either monadic or noadic. Monadic instructions offer adegree of freedom for the distribution of instructions amongclusters since they can be executed by two clusters. How-ever this may lead to a slight unbalancing in the workload:chains of dependent monadic instructions are executed on asingle cluster pair (either (C�,C�) or (C�, C�)).Commutative monadic instructions Monadic instruc-tions are instructions that use a single register operand and,may be, an immediate as a second operand, e.g., addition ofa register and an immediate. Usual convention is to use theregister as the Ýrst operand and the immediate as the secondoperand. If a functional unit is implemented in such a waythat it can take its register operand either on the right entryor on the left entry then commutative monadic instruction-s can be executed by any of three clusters on the 4-clusterWSRS architecture.Commutative dyadic instructions On optimized codes,the compiler tends to maintain invariant operands in the reg-isters in order to avoid repetitive loads of the same data fromthe cache. On a 4-cluster WSRS architecture, this may leadto unbalance the workload among the clusters.

This phenomenon can be limited through exploitingthe commutativity of many dyadic instructions (add, or,

5


exclusive-or, ..). Commutative dyadic operations can be ex-ecuted on two clusters provided that the two operands do notlie in the same register subset. This second degree of free-dom can be exploited through inverting the two operandsbefore cluster allocation in the pipeline.

“Commutative” clusters In order to further improve thedegree of freedom provided by the commutative dyadic in-structions, functional units can be implemented in order tobe able to execute instructions in two forms inverting theoperands order (i.e for instance computing A-B and -A+B).Any dyadic instruction with register operands in two differ-ent register subsets can then be executed on two clusters.Moreover any monadic instruction can be executed on threeout of the four clusters.

4 Complexity of the 4-cluster WSRS archi-tecture

In this section, we compare a 8-way 4-cluster WSRS ar-chitecture with a conventional 8-way superscalar processorin terms of complexity of implementation.

Throughout this section, we will use the example of asymmetric 4-cluster architecture. Each cluster is able to is-sue two instructions per cycle. On any cycle up to 4 readsand 6 writes (4 ALU results and 2 load results) onto theregister Ýle may be generated by each cluster. These pa-rameters are similar as those found on the two-cluster Alpha21264 [11].

4.1 Constraints and extra hardware

Four identical clusters The 4-cluster WSRS architectureinvolves the use of identical functional units clusters. Eachcluster must be able to execute every kind of instruction.This might be an issue for complex integer instructions suchas integer division or multiplication. Replicating dividersand multipliers on every cluster might be considered as awaste of silicon. As an alternative to complete replication, adivider (resp. multiplier) can be shared among two adjacentclusters. Static arbitration among the two clusters shouldallow their smooth sharing.

Need for more physical registers On a 4-cluster WSRSarchitecture, more physical registers are needed than on aconventional 4-cluster architecture, but each of them sup-port fewer access ports.

More complex register renaming pipeline It has beenpointed out in Section 3.2 that, compared with the conven-tional architecture, the complete renaming process on theWSRS architecture (including allocation of instructions toclusters) involves 1 or 3 extra pipeline stages (dependingon the register renaming implementations). It also requiressome extra logic hardware (extra free lists, register recy-cling pipeline, ..).

4.2 Complexity of the register file

Compared with a conventional superscalar architecture(Figure 1), the 4-cluster WSRS architecture presents a ma-jor difference: any physical register is connected with onlyhalf of the functional unit entries and can be written by onlyone fourth of the functional units. On a 4-cluster 8-wayWSRS architecture, each physical register can be imple-mented using two (4-read, 3-write) register copies insteadof four (4-read, 3-write) register copies when using registerwrite specialization alone, or four (4-read, 12-write) registercopies on a conventional 4-cluster 8-way architecture.

In this section, we try to quantitatively evaluate how thisimpacts on the access time, power consumption and siliconarea of the physical register Ýle.

4.2.1 Methodology

Silicon area estimation The silicon footprint of a mul-tiported register Ýle is dominated by the area devoted tomemory cells [21]. When the number of ports is high,the size of a multiported memory cell is approximately aquadratic function of its number of access ports [19].

For a conventional multiported memory cell featuringNread ports andNwrite ports,Nread bitlines,Nread word-line wires,�Nwrite bitlines andNwrite wordline wires mustcross the cell [21].w being the width of each wire (i.e. thewidth of the wire itself plus the distance with the neighborwire), the area devoted to a register cell is given by:

w�� Nread �Nwrite� � �Nread � � �Nwrite� (1)

We use Formula 1 to report the silicon area devoted torepresent a single bit of a physical register.

Power consumption and access time estimation In or-der to evaluate the peak power consumption and the accesstime for multiported register Ýles in future superscalar pro-cessors, we used the CACTI2.0 package [20]. Since CAC-TI2.0 is devoted to evaluate peak power consumption andaccess time on caches, we discarded the tag path in the mea-sures presented here. We also modify CACTI2.0 in order totake in account register write specialization.

Technology assumptions Due to the 4-6 years micropro-cessor design cycle, current research propositions cannotappear in products before 2006-2008. Therefore we presentthis evaluation for a two generation ahead technology C-MOS ��m and a 10 Ghz clock2. If the current trendin the increasing of clock frequency continues then onecan reasonably expect to achieve frequencies in the 10 Ghzrange using CMOS�� m.

Considered configurations We report estimates for four8-way issue superscalar architectures and a 4-way issue su-perscalar architecture.

2ìThe technology scaling in CACTI 2.0 should work well down to be-low ��m.î Norm Jouppi, private communication

6


The four considered 8-way conÝgurations are (1) noWS-M a conventional 8-way architecture with a monolithic reg-ister Ýle (Figure 1.a), (2) noWS-D a conventional 4-clusterarchitecture with distributed register Ýle(Figure 1.b), (3)WS, a 4-cluster architecture featuring register Write Spe-cialization (Figure 2.a), (4) WSRS-S, a 4-cluster WSRS ar-chitecture (Figure 3). Estimates for a conventional 2-cluster4-way architecture, noWS-2, are also presented.

The Alpha 21264 features 80 physical integer register-s. As future processors will feature deeper pipelines, weassume 128 physical integer registers for a conventional 4-way processor and twice as many for a conventional super-scalar 8-way processor. A total of 512 registers is assumedfor WS and WSRS architectures, since more registers areneeded in these architectures.

4.2.2 Estimates

Table 1 reports register Ýle characteristics as well as esti-mations on access time, power consumption and silicon areafor the Ýve considered conÝgurations. We report the follow-ing characteristics of the register Ýles: number of copies ofeach individual register, number of read and write ports oneach individual register copy and total number of registersubÝles. We also report an estimation of the relative size ofthe register Ýle compared with the size of the register Ýle on2-cluster 4-way issue processor.

The number of pipeline stages needed for accessing theregister Ýle is also estimated, Ýrst assuming a very aggres-sive 10 Ghz clock and a less aggressive 5 Ghz clock. Anextra half cycle is assumed in order to drive the data to thefunctional units. Using this register read pipeline depth andassuming a complete bypass network, we also report thenumber of possible sources that must be arbitrated by a by-pass point (see 4.3.1).

Analysis By reducing the number of ports on each indi-vidual register, register write specialization alone enables adramatic complexity reduction of the overall register Ýle interms of silicon area, power consumption and access time.

Using a WSRS architecture allows to further halve thesilicon area and to further reduce the access time and thepower consumption. Compared with a conventional 4-cluster 8-way architecture (noWS-D), the total silicon areaof the physical register Ýle is divided by more than six de-spite that the fact that the number of physical registers isdoubled. Peak power consumption is more than halved andaccess time is reduced by more than one third. This willallow to implement a shorter register read pipeline.

Compared with the 2-cluster conventional architecture,the physical register Ýle on the 4-cluster WSRS architecturescales very smoothly: a) the read access time is in the samerange, b) the total silicon area is only increased by 75 % c)power consumption only doubles.

4.3 Complexity on the bypass network and on thewake-up and selection logic

For any instruction executed on a given cluster, its Ýrst(resp. second) operand can have been produced by only twoof the four clusters. Therefore the bypass point at an entryof a functional unit is only connected with half of the resultsbuses. The wake-up logic in a given cluster only monitorstwo clusters as possible producers for its Ýrst operands andtwo clusters as possible producers for its second operands.

4.3.1 Bypass network complexity

The bypass network allows the functional units to use the re-sult of an operation as soon as it has been produced. As theaccess to the register Ýle will be pipelined on future gener-ations of processors, the performance of the processor willdramatically depend on the bypass network. Two distinctissues must be distinguished: Ýrst the ability of the bypassnetwork to forward the data to the functional unit entries,second the fast-forwarding capability, i.e. the ability to usean instruction result as an operand for a dependent instruc-tion on the very next cycle.

Bypass point complexity A complete bypass networkconnects all result buses to all functional unit entries. Thecost of a complete bypass network is huge: if the read-writepipeline on the register Ýle is X cycles long and if the reg-ister can be produced byN possible units then, for eachfunctional unit entry, up toX �N already computed resultsare potentially unaccessible from the register Ýle. The by-pass point logic must then choose amongX�N�� possiblesources for each operand.

The complexity of the bypass point on a 4-cluster WSRSarchitecture beneÝts from two orthogonal factors comparedwith a conventional architecture. First the register operandaccessed by a functional unit entry can have been producedby only two clusters instead of four. Second the registerread pipeline is shorter on a 4-cluster WSRS architecture.As a consequence, assuming a complete bypass network, abypass point on a 4-cluster WSRS architecture must chooseamong the same number of possible sources as the bypasspoint on a 2-cluster conventional architecture (see Table 1).

Fast-forwarding Forwarding an operation result in orderto use it as an operand on the very next cycle is also a verychallenging task. The transit delays between the function-al units are becoming long compared with an ALU oper-ation. On the other hand, a systematic one (or more) cy-cle delay for forwarding a result to a dependent instructionimpairs performance [4]. In a Ýrst approximation, the fast-forwarding delay increases with the distance between theproducer of a register and its consumer. A second orderfactor for this delay is the number of entries that have to befed within the next cycle.

Three possibilities of increasing hardware complexityare natural on a 4-cluster architecture:

7


noWS-M noWS-D WS WSRS noWS-2nb of registers 256 256 512 512 128

register copies 1 4 4 2 2(R,W) ports per copy (16,12) (4,12) (4,3) (4,3) (4,6)physical subÝles 1 4 4 4 2

nJ/cycle 3.20 2.90 1.70 1.25 0.63Access time (ns) 0.71 0.52 0.40 0.35 0.34

Pipeline cycles: 10 Ghz 8 6 5 4 4sources per bypass point: 10 Ghz 97 73 61 25 25Pipeline cycles: 5 Ghz 5 4 3 3 3sources per bypass point: 5 Ghz 61 49 37 19 19Reg. bit area (xw�) 1120 1792 280 140 320

total area

area noWS��7 11.2 3.50 1.75 1

Table 1. Estimates for different architecture configurations

� Fast-forwarding inside a single cluster: the WSRS ar-chitecture presents the advantage that, assuming ran-dom distribution of instructions to clusters (when somefreedom is available), statistically two out of four pos-sible consumers for a result will be located on the pro-ducer cluster instead of only one out of four in a con-ventional architecture.

� Fast-forwarding inside pairs of adjacent clusters: onthe WSRS architecture, statistically three out of fourpossible consumers of a result will be able to captureit on the very next cycle instead of two out of four ona conventional 4-cluster architecture.

� Complete fast-forwarding: Figure 3 suggests a possi-ble layout of the 4-cluster WSRS architecture wherethe consumer cluster is always close to (i.e. touches)the producer cluster. Such a layout may favor a simplerimplementation of complete fast-forwarding capabilitythan on a conventional 4-cluster architecture.

4.3.2 Wake-up logic complexity

With an out-of-order execution processor, an instructioncan not be issued before its operands are guaranteed to bevalid in time3 On each cycle, the wake-up logic entry as-sociated with an instruction must monitor every possiblesource for any of its operands and check it against its regis-ter operand numbers: if an instruction features two registeroperands and if N possible sources can produce these regis-ter operands then each wake-up logic entry implements 2*Ncomparators. The wake-up logic (and these comparators inparticular) is responsible for a signiÝcant part of the pow-er consumption [9, 12] in the processors, therefore limitingthe number of comparators in each wake-up logic entry is amajor challenge.

The wake-up logic response time also increases dramat-ically when doubling the possible sources for an operand

3For loads, a cache hit is predicted.

from 4 to 8 (46 % is reported in [14] assuming 0.18�m C-MOS technology).

For an instruction executed on a given cluster with a 4-cluster WSRS architecture, a given operand can only beproduced by two of the four clusters in the processor: awake-up logic entry on a 8-way 4-cluster WSRS architec-ture features only the same number of comparators as theone of a 4-way issue conventional processor.

5 Performance evaluationA WSRS architecture features a deeper pipeline than

a conventional architecture. Strong constraints on the in-struction allocation policy to clusters are also encountered.Therefore, assuming equal cycle for a conventional clus-tered architecture and a 4-cluster WSRS architecture, onecan envision some performance loss when using a 4-clusterarchitecture.

The preliminary simulations presented in this sectionshows that, a contrario from this intuition, the 4-clusterWSRS architecture stands the performance comparisonwith a conventional 4-cluster architecture.5.1 Experimental framework5.1.1 Sparc ISAFor our simulations, we used the Sparc ISA. Instructionsusing three register operands (i.e indexed stores, .. ) aretranslated at decode in two microoperations as suggested inSection 3. The Sparc ISA features register windows. Inour simulations, we considered that 4 register windows aremapped in the physical register Ýle at the same time, i.e atotal of 80 logical general-purpose registers are used. Anexception is taken on a window overÐow.5.2 General characteristics of the simulated ar-

chitecture8-way 4-cluster architectures are considered. All clusters

are assumed identical. The cluster is 2-way issue. It featuresa single load/store unit, a single fully pipelined Ðoating-point unit and two integer ALUs. Latencies of instructions

8


are summarized in Table 2. Fast-forwarding is possible in-side a single cluster, one cycle delay is needed to forward aresult from cluster to an other.

inst loads ALU mul/div fadd/fmul fdiv/fsqrtlat. 2 1 15 4 15

Table 2. Latencies for principal instructions

Our study focuses on the impact of the use of WSRS ar-chitecture on the performance of the execution core. There-fore, we made some simpliÝcation hypotheses on the in-struction fetch front-end of the processor. We assume that,the front-end stages in the pipeline, up to the rename stage,delivers eight instructions/microoperations per cycle at asustained rate. That is, our simulations ignore all the arte-facts associated with irregular instruction fetch bandwidth.

Realistic conditional branch prediction was simulated. Avery large2Bc-gskew branch predictor featuring 512Kbit-s of memorization was considered [17]. The size and ac-curacy of the branch predictor are equivalent to the onesof the branch predictor from the cancelled Alpha EV8 mi-croprocessor [16]. Perfect prediction of branch target wasassumed since target misprediction for PC relative branch-es can be corrected very early in the pipeline, procedurereturns can be predicted almost perfectly with a return s-tack and very few indirect jumps were encountered on ourbenchmark set.

Load/store addresses were computed in order, loads by-passing stores whenever no conÐict were encountered. Thedata memory hierarchy was modelled using the parametersreported in Table 3.

size latency miss pen. bandwidthL1 D-$ 32 Kb 2 cycles 12 cycles 4 W/cycle

L2 $ 512 Kb 12 cycles 80 cycles 16 B/cycle

Table 3. Memory hierarchy characteristics

5.2.1 Simulated configurations

As a comparison point, we used a conventional 4-clustersuperscalar processor using a round-robin allocation policy.256 physical registers are assumed. The current trend forpipeline depth on high-end microprocessors is toward verydeep pipeline (14 stages on EV8, 18 on Pentium4). We as-sume a minimum 17-cycle misprediction penalty for thisbase case.

A processor featuring only register Write Specializationis simulated. Round-robin instruction allocation to clusteris also assumed. A total of 384 or 512 physical register-s is assumed. Both register renaming strategies describedin Section 2 were simulated with a minimum mispredictionpenalty is set to 16 cycles: the register read pipeline is onecycle shorter than on the conventional architecture (see Sec-tion 4.2). As simulation results did not exhibit any signif-

icant difference, we only display the simulation results forthe second register renaming strategy.

For the 4-cluster WSRS architecture, we also assume atotal of 384 or 512 physical registers. For the Ýrst registerrenaming stategy, we set the minimum misprediction penal-ty to 16 cycles and for the second strategy to 18 cycles.These misprediction penalties take into account respective-ly one and three extra pipeline stages before renaming (seeSection 3.2) and two pipeline stages saved on the registerread (see Section 4.2). As, with using Write Specializationalone, simulation results for the two register renaming s-trategies were very close, therefore we only display resultsfor the second strategy.

We simulated two simple allocation policies on theWSRS architecture:

� random monadic, RM: on monadic instructions, theregister operand determines the top or bottom biclus-ter, the left or right bicluster is randomly selected be-fore register renaming.

� random “commutative” cluster, RC : Functional u-nits are assumed to be able to execute any instructionin two forms (e.g. A-B and -A + B) taking their Ýrstoperand either on their left port or on their right entryport. The form of the instruction is Ýrst randomly se-lected. Then for monadic instructions, two clusters areable to execute it, one of them is randomly selected.

5.3 Benchmark selectionWe present simulation results for 7 SPECFP2000 (wup-

wise, swim, mgrid, appluy, galgel, equake andfacerec) and5 SPECINT2000 (gzip, vpr, gcc, mcf and crafty) bench-marks using the ref input sets.

Codes were compiled with the following options,cc -xO3 -xarch=v8plusa -xCC, c++ -xO3 -xarch=v9, f77 -fast-xarch=v8plusa, f90 -fns=no -fast -xarch=v9.

The initialisation phase of the application were skippedusing a fast-forward mode, then caches and branch predic-tion structures were warmed for 20 millions of instructions.A slice of 10 millions of instructions is then simulated.5.4 Simulation results

Figure 4 summarizes the performance results on the dif-ferent benchmarks (measured in instructions per cycle).

5.4.1 Register Write Specialization onlyAs expected, the same level of performance is reached usingthe conventional architecture or using register write special-ization alone on integer applications.

For Ðoating-point point applications, a marginal perfor-mance improvement consistantly obtained when using reg-ister write specialization. This marginal performance in-crease is allowed by the larger instruction window allowedby an overall larger register set. This trends is further en-hanced by increasing the overall number of physical regis-ters from 384 to 512.

9


gzip vpr gcc mcf crafty

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

IPC

Integer benchmarks

RR 256

WSRR 384

WSRR 512

WSRS RC S 384

WSRS RC S 512

WSRS RM S 512

wupwise swim mgrid applu galgel equake facerec

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

IPC

Floating point benchmarks

RR 256

WSRR 384

WSRR 512

WSRSRC S 384

WSRSRC S 512

WSRSRM S 512

Figure 4. Performance results

5.4.2 WSRS architectureOn all our integer applications, the 4-cluster WSRS archi-tecture performs slightly better than the conventional archi-tecture. On the other hand, the 4-cluster WSRS architectureperformed slightly worse than the conventional architectureon most Ðoating point applications, particularly the applica-tions with relative high IPCs. Nevertheless, when usingRCinstruction allocation policy (column WSRSRC), the per-formance always stay within a 3% difference margin withthe base architecture.

Note that increasing the total number of registers from384 to 512 has a minor impact on performance.

Analysis Two phenomena associated with the distributionof instructions to clusters have two opposite impacts on per-formance.

Round robin allocation of instructions to clusters lead toa better load balancing of the workload among the clustersthan achieved byRM andRC policies on the WSRS archi-tecture. But on the other hand,RM andRC policies statis-tically distribute the instructions ìcloserî to the producer(s)of their operand(s) than round-robin allocation policy.

For characterizing the unbalancing of the workload, wesplit the applications in groups of 128 instructions and mea-sure the ratio of these groups that are unbalanced. We ar-bitrarily deÝne a group as unbalanced whenever one of thefour clusters gets less than 24 instructions or more than 40instructions. We deÝne theunbalancing degree of an appli-cation as the ratio of unbalanced instruction groups in theapplication.

Figure 5 represents the unbalancing degrees on our set ofbenchmarks. Round-robin policy exhibits a perfect balanc-ing degree. TheRM policy uses less degrees of freedomthan RC. Therefore, in most of the cases, it exhibits thehighest unbalancing degree.

Floating point benchmarks tend to exhibit higher unbal-ancing degrees than integer benchmarks. For instance, onthe two high IPC benchmarks that are exercising the most

signiÝcant performance loss (facerec and wupwise), the un-balancing degree is close to 100 %, while on integer highIPC benchmarks (gzip and crafty), the unbalancing degreeis around 80 %.

In future research on allocation policies for WSRS ar-chitectures, we plan to study dynamic policies that tradeoff-s allocation of dependent instructions within a cluster and(local) workload balancing between clusters.

6 Related works

VLIW ISAs such as MultiÐow [3] or more recently LX[6] implement distinct logical register Ýles that are accessedby different clusters of functional units. Whenever the dif-ferent operands of an operation lie in different register Ýles,the compiler is responsible to guarantee moves between theregister Ýles to enable the execution of the operation. Thisallows the implementation of wide-issue statically sched-uled processors using silicon register Ýles with a limitednumber of read and write ports. Our proposal tackles theimplementation of dynamically scheduled wide-issue pro-cessors for current general purpose ISAs featuring asinglelogical register Ýle.

For ISA featuring a single logical register Ýle, Farkas etal. [7] proposed the use of two distinct physical registerÝles, each of them associated with a subset of the ISA log-ical registers. Each physical register Ýle is associated witha cluster of functional units. The main difÝculty with thisapproach is that whenever an instruction uses two logicaloperands mapped onto the two distinct subsets of the logi-cal register Ýle, moves have to be generated by the hardwarebetween the two physical register Ýles. The load balancingof the clusters is very sensitive to code generation. Howev-er, in some sense this work is close to our proposals, sincethe unwritten rule we cited in the introduction is also trans-gressed.

Previous research work on improving access time to theregister Ýle with out-of-order execution superscalar pro-

10


gzip vpr gcc mcf crafty

0

20

40

60

80

100

unba

lanc

ing

Integer benchmarks

WSRS RC

WSRS RM

wupwise swim mgrid applu galgel equake facerec

0

20

40

60

80

100

unba

lanc

ing

degr

ee

Floating point benchmarks

WSRS RC

WSRS RM

Figure 5. Unbalancing degrees

cessor includes virtual-physical register Ýle [13](limitingthe number of physical registers) and register caching [4](caching the critical registers).

Monreal et al. [13] proposed virtual-physical registers.The allocation of the physical register is delayed until in-struction execution or even result write back. The renamingof registers is replaced by the allocation of a virtual stampwhich does not connect directly with any physical location.A physical register is associated with the virtual stamp atinstruction execution (or result write back). This solutionallows to reduce the number of required physical registers,therefore to reduce the silicon area of the physical registerÝle and its power consumption. This approach addresses thenumber of physical registers needed in a superscalar proces-sor.

Cruz et al. [4] remarked that many physical registershave to be accessible on the very next cycle. Many physi-cal registers are not even ever read since they are used on-ly once and they are captured through the bypass network.They proposed to use a register Ýle cache. Only registerslikely to be useful in the very next cycles are written in theregister Ýle cache. A complete register Ýle copy is main-tained, but it can feature a longer access time as well asfewer read ports. This organization results in a low latencyregister access while supporting a large number of physicalregisters. As it allows a shorter register read pipeline, it al-so decreases the number of possible sources on each bypasspoint. Balasubramonian et al. [1] remarked that physicalregister must stay alive long after their last read has beenissued, that is until their last read is validated. Therefore thecontents of these alive critical registers can migrate to a L2register cache.

As many instructions do not really use the read and writeports on the register Ýle (monadic instructions, instructionswith no result, operands captured on the bypass network),Balasubramonian et al. [1] also proposed to arbitrate at run

time register access ports, thus implementing less ports onthe register Ýle. However, this puts more pressure on anoth-er critical path in the processor: the wake-up and selectionlogic.

Stark et al. [18] proposed to pipeline the wake-up and s-election logic to address this electrical critical path. Brownet al. [2] further proposed to optimistically select any in-struction that is Ýreable, removing selection logic from thecritical path. Critical path as well as power consumptionin the wake-up logic is addressed by Ernst and Austin [5]through eliminating tag comparison for one of the operands.Folegnani and Gonzales [8] proposed to selectively disablepart of the comparators in the wake-up logic.

We would like to point out that all these techniques [13,4, 1, 18, 2, 5, 8] are orthogonal with WSRS and can beapplied at cluster level to WSRS architectures.

7 ConclusionScaling the current superscalar designs for wider is-

sue processors would result in a more than quadratic in-crease of the register Ýle, the bypass network and the wake-up/selection logic complexities. In this paper, we haveshown that these issues may be attacked by transgressingan unwritten rule that has so far been applied to all super-scalar processor designs. All currently used general purposeISAs feature a single set of logical general purpose register-s. This central view of the general purpose register Ýle hasalso been adopted for the hardware implementation of thephysical register Ýle, i.e, any general purpose physical reg-ister can be read or written by any integer functional units.

By using Register Write Specialization, i.e. by forc-ing each functional unit to write only on a Ýxed subset ofthe physical register Ýle, one can decrease dramatically thecomplexity of the physical register Ýle in a wide-issue su-perscalar processor. The number of write ports on eachphysical register is decreased and therefore the silicon area,the power consumption and the access time of the register

11


Ýle are signiÝcantly decreased.Register Write Specializa-tion does not impair performance for static policies of allo-cation of instructions to clusters or functional units.

Second, we have proposed to combine register write spe-cialization and register read specialization in a 4-clusterWSRS architecture. On a 8-way 4-cluster WSRS architec-ture, the complexities of the bypass points and of the wake-up logic entries are the same as the one found with a con-ventional 4-way superscalar processor. Moreover the com-plexity of the physical register Ýle is even further reducedcompared with using register write specialization alone (seeTable 1 in Section 4).

The 4-cluster WSRS architecture trades the complexityof the register Ýle, of the bypass network and the wake-uplogic against degrees of freedom for allocating of instruc-tions to clusters and a more complex register renaming (atthe cost of a few extra pipeline stages in register renam-ing). The location of the physical register operands restrictsthe set of clusters that can execute an instruction. Howev-er monadic instructions can be executed on several cluster-s. One can also execute commutative dyadic operations onseveral clusters, . . .

The performance study presented in this paper indicatesthat, by exploiting these degrees of freedom, simple poli-cies for allocating instructions on clusters will be able toreasonably balance the workload among the clusters. Thisperformance study also indicates that, at equal cycle time,the 4-cluster WSRS architecture will achieve performancelevels in the same range as a conventional 4-cluster archi-tecture using a round-robin instruction allocation.

Furthermore, in [15], we have also shown the WSRS ar-chitecture can be extended to a 7-cluster architecture whilemaintaining the complexities of each individual wake-uplogic entry and each bypass point, and also using only two(4-read, 3-write) copies of each individual physical register.

References

[1] Rajeev Balasubramonian, Sandhya Dwarkadas, andDavid H. Albonesi. Reducing the complexity of the registerÝle in dynamic superscalar processors. InProceedings of the34th Annual International Symposium on Microarchitecture,December 2001.

[2] M. Brown, J. Stark, and Y. Patt. Select-free instructionscheduling logic. InProceedings of the 34th Annual Inter-national Symposium on Microarchitecture, December 2001.

[3] Robert P. Colwell, Robert P. Nix, John J. OíDonnell,David B. Papworth, and Paul K. Rodman. A VLIW archi-tecture for a trace scheduling compiler. InProceedings ofSecond International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASP-LOS II), October 1987.

[4] Jos¥e-Lorenzo Cruz, Antonio Gonzalez, Mateo Valero, andNigel Topham. Multiple-banked register Ýle architectures. InProceedings of the 27th International Symposium on Com-puter Architecture, june 2000.

[5] D. Ernst and T. Austin. EfÝcient dynamic scheduling throughtag elimination. InProceedings of the 29th Annual Interna-tional Symposium on Computer Architecture, May 2002.

[6] Paolo Faraboschi, Geoffrey Brown, Joseph A. Fisher,Giuseppe Desoli, and Fred (Mark Owen) Homewood. Lx:A technology platform for customizable VLIW embeddedprocessing. InProceedings of the 27th Annual InternationalSymposium on Computer Architecture, June 2000.

[7] K. I. Farkas, P. Chow, N. P. Jouppi, and Z. Vranesic. Themulticluster architecture: Reducing cycle time through par-titioning. In Proceedings of the 30th Annual IEEE/ACM In-ternational Symposium on Microarchitecture (MICRO-97),December 1997.

[8] D. Folegnani and A. Gonzalez. Energy-effective issue logic.In Proceedings of the 28th Annual International Symposiumon Computer Architecture, June 30ñJuly 4, 2001.

[9] R. Gonzalez and M. Horowitz. Energy dissipation in gener-al purpose microprocessor. InIEEE Journal of Solid-StateCircuits, september 1996.

[10] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean,A. Kyker, and P. Roussel. The microarchitecture of the pen-tium4 processor. InIntel Technology Journal, 2001.

[11] Richard E. Kessler. The Alpha 21264 microprocessor.IEEEMicro, 19(2):24ñ36, 1999.

[12] S. Manne, A. Klauser, and D. Grunwald. Pipeline gating:Speculation control for energy reduction. InProceedings ofthe 25th Annual International Symposium on Computer Ar-chitecture (ISCA-98), June 27ñJuly 1 1998.

[13] Teresa Monreal, Antonio Gonzalez, Mateo Valero, Jos¥e Gon-zalez, and Victor Vinals. Dynamic register renaming throughvirtual-physical registers.Journal of Instruction-Level Par-allelism, May 2000.

[14] Subbarao Palacharla, Norman P. Jouppi, and J. E. Smith.

Complexity-effective superscalar processors. In��th Annual

International Symposium on Computer Architecture, pages206ñ218, 1997.

[15] Andr¥e Seznec. A path to complexity-effective wide-issuesuperscalar processors. InTechnical Report IRISA PI 1411,August 2001.

[16] Andr¥e Seznec, Stephen Felix, Venkata Krishnan, and YanosSazeides. Design tradeoffs for the ev8 branch predictor. InProceedings of the 29th Annual International Symposium onComputer Architecture, May 2002.

[17] Andr¥e Seznec and Pierre Michaud. De-aliased hybrid branchpredictors. Technical Report RR-3618, Inria, 1999.

[18] J. Stark, M. D. Brown, and Y. N. Patt. On pipelining dy-namic instruction scheduling logic. InProceedings of the33rd Annual International Symposium on Microarchitecture,December 2000.

[19] Marc Tremblay, Bill Joy, and Ken Shin. A three dimensionalregister Ýle for superscalar processors. InProceedings ofthe 28th Annual Hawaii International Conference on SystemSciences, Jan 1995.

[20] Steven J. E. Wilton and Norman P. Jouppi. Cacti: An en-hanced cache access and cycle time model.IEEE Journal ofSolid-State Circuits, May 1996.

[21] Victor Zyuban and Peter Kogge. The energy complexity ofregister Ýles. InProceedings of the International Sympo-sium on Low Power Electronics and Design (ISLPED-98),August 10ñ12 1998.

12


Documents

Register write specialization register read specialization: a path to complexity-effective wide-issue superscalar processors