SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURES FOR …neeraj/doc/thesis_Neeraj_Goel.pdf · SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURES FOR VLIW PROCESSORS by Neeraj Goel Department

SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURESFOR VLIW PROCESSORS

NEERAJ GOEL

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

INDIAN INSTITUTE OF TECHNOLOGY DELHI

AUGUST 2010

c© Indian Institute of Technology Delhi (IITD), New Delhi, 2010. All rights reserved.

SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURESFOR VLIW PROCESSORS

byNeeraj Goel

Department of Computer Science and Engineering

Submitted

in fulfillment of the requirements of the degree of

Doctor of Philosophy

to the

Indian Institute of Technology Delhi

August 2010

Certificate

This is to certify that the thesis titled “SCALABLE LOW ENERGY REGISTER FILE ARCHI-

TECTURES FOR VLIW PROCESSORS” being submitted by Neeraj Goel to the Indian Institute

of Technology Delhi, for the award of the degree of Doctor of Philosophy, is a record of bona-fide

research work carried out by him under our supervision. In our opinion, the thesis has reached the

standards fulfilling the requirements of the regulations relating to the degree.

The results contained in this thesis have not been submitted to any other university or institute for

the award of any degree or diploma.

Anshul Kumar

Professor


Indian Institute of Technology Delhi, New Delhi 110 016

Preeti Ranjan Panda

Professor


Indian Institute of Technology Delhi, New Delhi 110 016

Acknowledgments

I would like to take this opportunity to thank all those who helped me in making my PhD dissertation

a successful attempt. First, I would like to thank my supervisors Prof. Anshul Kumar and Dr. Preeti

Ranjan Panda, without their suggestions, technical guidance and constructive feedback this thesis

would not have been in this shape. Moreover, they gave me friendly environment, freedom to work

in my own way, gave time when it was required, which made my PhD experience a unique one.

I express my gratitude for Prof. M. Balakrishnan who encouraged me to do PhD and helped me

throughout my stay in IIT Delhi by providing day-to-day suggestions, feedbacks, and encourage-

ment. I would also like to thank, Dr. Kolin Paul and Prof. Ranjan Bose, who gave useful feedbacks

in my SRC presentations.

I am grateful to my seniors, Anup Gangwar, Basant Dwivedi and Satyakiran Munaga for their

encouragement and guidance during PhD. I would like to express my gratitude to Sonali Chouhan,

for her extended support and many informal discussions. Discussions with my PhD colleagues,

Aryabartta Sahu, Anant Vishnoi, Nagaraju Pothineni, Lava Bhargawa, BVN Silpa, G Krishniah and

Vikram Goyal helped me at various points.

During my PhD, I got the opportunity to get closely involved in various B. Tech. and M. Tech.

projects. These projects helped in increasing the breadth of my understanding. Specially, I would

like to thank Manoj Gupta, Rakesh Nalluri, Devdutt, Ramakrishna, Monika Gupta, Kiran Chan-

dramohan for working with me, and sharing their thoughts.

I would like to thank member of the lab staff, Vandana Ahulwalia (Philips VLSI lab) and Somdutt

ix

Sharma (DHD lab), who made available their support in number of ways.

I am indebted to my father, mother, and my sisters, for their endless love, immense patience and

moral support. Support from my wife, Deepika during my thesis writing stage was very crucial.

Last but not the least, I owe my deepest gratitude to Almighty, who makes everything happen,

who brings the ideas in one’s mind, creates environment to cherish those ideas, and motivates one

to implement the ideas.

August 2010 Neeraj Goel

Abstract

Multiported register files (RF) consume a significant fraction of energy in VLIW processors. Due

to large number of ports they do not scale well with increase in number of function units (FU). We

observe that bandwidth provided by RF is also not fully utilized by the processor. Also, in most

applications, many variables are short-lived, i.e., they are produced and consumed within a short

duration.

Observing this we propose a two level register file architecture, where at first level there are local

buffers associated with each function unit (FU) and second level is a monolithic RF. We explore

different architecture options for local buffer based architecture. As most accesses to RF will be

from first level RF, we propose to reduce the number of ports of second level RF by sharing them

among FUs. However, port sharing among FUs may lead to access conflicts and thus reduced per-

formance. Further, port sharing may offset some of the energy savings port reduction brought in.

To address these issues, our solution includes a carefully designed RF-FU interconnection network

which permits port sharing with minimum conflicts and energy overheads. To minimize the perfor-

mance loss due to conflicts and maximize energy savings by increasing the accesses to local buffers,

we propose a novel scheduling and binding algorithm.

To estimate the effect of number of ports on the performance and energy we developed analytical

models. With help of our analytical model and a number of experiments we established that the

proposed architecture leads to as much as 74% register file energy savings with not more than 5%

loss in performance for a 4 issue width processor. Experiments on different issue width processors

xi

reveals that proposed architecture is scalable in both performance as well as energy.

Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Previously Proposed RF Architectures . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2.1 Single Level and Monolithic RF Architecture . . . . . . . . . . . . . . . . 5

1.2.2 Single Level Multibanked RF Architecture . . . . . . . . . . . . . . . . . 5

1.2.3 Two Level RF Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.3 Proposed Solutions: Local Buffers Based RF Architecture . . . . . . . . . . . . . 7

1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Proposed RF Architecture 11

2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Local Buffer Based Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.1 RISO Operand Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.2 SIRO Result Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2.3 RIRO Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

2.2.4 Qualitative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3 SIRO Buffers and Conventional VLIW Architecture . . . . . . . . . . . . . . . . . 24

xiii

2.3.1 SIRO Buffers and RF Bypass . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.2 Advantages of SIRO Buffers . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Reduced Port Second Level RF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.4.1 Processor Design with Shared Port RF . . . . . . . . . . . . . . . . . . . . 28

2.5 Issues with the Proposed Architectures . . . . . . . . . . . . . . . . . . . . . . . . 32

2.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3 Code Generation for Proposed Architecture 39

3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Scheduling-binding Problem and Methodology . . . . . . . . . . . . . . . . . . . 40

3.3 Proposed Scheduling and Binding Algorithm . . . . . . . . . . . . . . . . . . . . 42

3.3.1 Scheduling Priority Function . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3.2 RF Port Aware Scheduling and Binding . . . . . . . . . . . . . . . . . . . 44

3.3.3 Iterative Schedule Improvement . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 Additional Compiler Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4.1 Identifying Global and Local Reads and Writes . . . . . . . . . . . . . . . 55

3.4.2 Code Generation: Register Renaming . . . . . . . . . . . . . . . . . . . . 56

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4 Performance and Energy Models 59

4.1 Model for Fixed Issue-width Processor . . . . . . . . . . . . . . . . . . . . . . . . 59

4.1.1 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.1.2 RF Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.2 Modeling of a Generic Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.1 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2.2 RF Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Model Validation and Evaluation of the Proposed Architecture 69

5.1 Implementation Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.1.1 Base Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.2 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Model Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.1 Performance Model for Fixed Issue Width Processor . . . . . . . . . . . . 73

5.2.2 Performance Model for Generic Processor . . . . . . . . . . . . . . . . . . 73

5.3 Architecture Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3.1 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3.2 Number of SIRO Reads . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.4 Direct Interconnect Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 78

5.3.5 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6 Varying Issue Width and Scalability 89

6.1 RF and Processor Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

6.1.1 Related Work in Processor Scalability . . . . . . . . . . . . . . . . . . . . 90

6.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.4 Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.5 Clustered VLIW and Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

7 Conclusions and Future Work 101

7.1 Contributions and Major Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

References 105

List of Figures

1.1 Architecture of a VLIW processor. . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Example of short life time of variables. . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Percentage global values accesses in various applications. . . . . . . . . . . . . . . 13

2.3 Cumulative number of reads for different number of cycles after write. . . . . . . . 14

2.4 RF read port usage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.5 Base local buffer model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 RISO buffer based VLIW Architecture. . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 An example of instructions for RISO based architecture. . . . . . . . . . . . . . . 19

2.8 Detailed RISO Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.9 SIRO buffer based VLIW Architecture. . . . . . . . . . . . . . . . . . . . . . . . 21

2.10 Detailed SIRO Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.11 Example instructions for SIRO based architecture. . . . . . . . . . . . . . . . . . . 22

2.12 SIRO buffers and RF bypass. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.13 Bypass control for one operand of a functional unit. . . . . . . . . . . . . . . . . . 27

2.14 RF-FU interconnection topologies for shared ported register file. . . . . . . . . . . 29

2.15 Direct interconnection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.16 Example direct interconnects and corresponding interconnection matrices. . . . . . 32

3.1 An abstract view of reservation table. . . . . . . . . . . . . . . . . . . . . . . . . . 44

xvii

xviii List of Figures

3.2 Example data-flow graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Operations binding example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Schedule for the 4 issue slot VLIW processor with 4 read port and 4 write port RF. 51

3.5 Example binding conflict graph. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6 Example binding conflict graph after binding of OP5 and OP6. . . . . . . . . . . . 53

3.7 Schedule for the 4 issue slot VLIW processor with 4 read port and 3 write port RF. 54

3.8 Example:Register renaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1 Basic block diagram of the ILP model. . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1 Experiment framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Model validation against simulation results. . . . . . . . . . . . . . . . . . . . . . 74

5.3 Model validation for different issue width processors and different read write port

configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 SIRO buffer reads for different issue processors. . . . . . . . . . . . . . . . . . . . 77

5.5 Direct interconnection RF architecture exploration. . . . . . . . . . . . . . . . . . 79

5.6 Different direct RF configurations . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.7 Performance evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.8 Effectiveness of RPA scheduling algorithm with respect to a naive algorithm (Set II

benchmarks) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.9 Normalized average RF energy for the direct and the complete interconnect topologies. 85

5.10 Normalized RF energy for different benchmarks. . . . . . . . . . . . . . . . . . . 86

6.1 Performance for different issue width processors. . . . . . . . . . . . . . . . . . . 94

6.2 Normalized cycle-delay product for different issue width processors . . . . . . . . 95

6.3 Total RF energy for different issue width processors . . . . . . . . . . . . . . . . . 98

6.4 Normalized performance for clustered VLIW processors . . . . . . . . . . . . . . 99

List of Tables

3.1 Type of operations that can be executed on each function unit . . . . . . . . . . . . 48

5.1 Function unit positions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2 Benchmark characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.3 Comparison of processor core area values with/out SIRO buffer information . . . . 77

5.4 Interconnect matrices for different direct RF configurations . . . . . . . . . . . . . 79

6.1 High ILP Benchmark details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.2 Medium ILP Benchmark details . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

xix

xx List of Tables

1 Introduction

1.1 Motivation

In recent years, embedded systems have seen a remarkable change. There is a drift from control

dominated applications to computation intensive applications. For example, embedded systems

such as smart phones, music players, and DVD players, execute a number of computation intensive

applications. These systems require high performance processors to meet computation demand.

The processors also need to be less energy consuming as most of these systems are battery driven.

Traditional micro-controllers and RISC based processors consume very less energy but often do

not meet the performance requirement. Therefore, these are not preferred for high performance sys-

tems. On the other extreme processors based on superscalar architectures are high performance but

also consume a lot of energy. In superscalar processors, instruction level parallelism is determined

by hardware and multiple instructions are executed concurrently. Finding parallel instructions in

hardware leads to complex logic and high energy consumption. Such processors are therefore more

suitable for systems where energy constraint is not very stringent.

In between the above two extremes, there is a choice of very large instruction word (VLIW)

processors. In VLIW processors instruction level parallelism is determined at compile time. Sim-

ilar to superscalar processors multiple operations are executed concurrently but without complex

hardware. Therefore, VLIW processors meet the requirement of high performance as well as low

energy.

Various kinds of application specific processors such as DSP processors are also used for high

end embedded applications. These also often have VLIW type of instruction level parallelism.

1

1 Introduction

There are many examples of commercial processors with VLIW architectures, such as ST Micro-

electronics’s Lx [Faraboschi et al., 2000], Intel’s Itanium [McNairy and Soltis, 2003], TI’s 320C6x

[Seshan, 1998], NXP’s Trimedia [van Eijndhoven et al., 1999] and Analog Device’s TigerSharc

[Fridman and Greenfield, 2000].

In recent years, a trend towards multi-core architectures has been observed. Multi-core archi-

tectures also give the benefits of low energy and higher performance. Thread level parallelism

available in applications is exploited in multi-core architectures to enhance performance. VLIW

approach is orthogonal to multi-core approach as each of the core can be a VLIW processor, and

therefore, both instruction level as well as thread level parallelism can be exploited to achieve higher

performance. Intel’s Itanium 3, Fujitsu’s FR1000 [Shiota et al., 2005], SiliconHive’s Avispa [sil],

and Tilera’s TILE64 [til] are a few examples of multi-core architectures with VLIW processor as

the core. There are also evidences of VLIW processor being used as one of the core [Stolberg

et al., 2005] in multi-core design. Therefore, study of VLIW architectures is important for high end

embedded applications.

Figure 1.1 shows the simplified architectural view of a typical VLIW processor with issue width

N. Issue width of a processor is defined as the maximum number of operations that can be executed

in parallel. An instruction containing N operations is read from the instruction memory in the fetch

stage. In the decode stage, the opcode is decoded and operands are read from the register file (RF).

In the execute stage, the operations are executed. If the operation is a memory operation, in the

memory stage it reads the data from or writes it to the data memory. The results are written back

to the RF in the write back stage. All these stages are pipelined such that a new instruction can be

fetched every cycle. To avoid data hazards, results produced by the FUs are provided at the input of

the FUs via bypass paths. In different commercial VLIW processors, the basic architecture remains

the same, though there may be extra hardware or different number of pipeline stages to improve the

performance or energy of the processor.

For higher performance when we increase the issue width of a processor, the number of ports of

instruction memory remains the same (only width of memory bus increases), ports of data-memory

2

1.1 Motivation

Opcode src1 src2 dest Opcode src1 src2 destOpcode src1 src2 dest

FetchInstruction Memory

Operation 1 Operation 2

Decode Decode Decode and

Read

Execute

Memory

FU 1 FU 2

Register File

Data Memory

FU N

Writeback

Operation N

Decode

Register

Bypass Paths

Figure 1.1: Architecture of a VLIW processor.

increases only if memory function units are added, but ports of RF always increases. In other words,

RF is always effected when there is a change in the issue width. If one function unit (FU) requires

two read ports and one write port, then for N FUs, 2N read and N write RF ports are usually present

3

1 Introduction

with 1-to-1 connection between the FU ports and the RF ports. It has been observed by Zyuban

and Kogge [1998] that RF power increases super-linearly (N 2 to N3) with number of ports. Rixner

et al. [2000] have shown that area and access time of the RF increase at the order of N 3 and N3/2,

respectively. Experimental data also suggests that in VLIW processors multiported RF consumes a

significant fraction of total processor energy [van de Waerdt et al., 2005; Lambrechts et al., 2005].

As the number of FUs increases in a VLIW processor, port requirement of the RF increases.

With the increase in the number of ports, area, power, and access time of the RF increases super-

linearly. Due to this the RF is the most unscalable component in high issue width VLIW processors.

Therefore, there is a need to design an RF which is more scalable as the number of FUs increases

in a VLIW processor.

Problem Statement

The objective of this research is to design an RF architecture that is low energy consuming as

well as scalable in terms of performance for VLIW processors. In VLIW processors, order of

execution as well as the operations that execute in parallel is determined by the compiler. Therefore,

designing the RF architecture of a VLIW processor also involves associated compiler development.

The compiler algorithms are necessary for correct functioning of the modified hardware as well as

for enhancing the performance of the processor. In this thesis we focus on energy as well as the

scalability aspects of the RF for VLIW processors. Scalability of the RF is determined in terms of

area, execution cycles, execution time, and energy.

There has been a number of RF architectures proposed in the past for different processors and

with different motivations. Before going ahead, first we look at these architectures.

1.2 Previously Proposed RF Architectures

We classify the RF previously proposed architectures in three categories: first, single level mono-

lithic RF, second, single level multibanked RF, and multiple level and monolithic RF at each level.

4


1.2.1 Single Level and Monolithic RF Architecture

Monolithic or centralized RF architecture is the RF architecture used in traditional design of pro-

cessors. Various techniques have been proposed to optimize the monolithic RF. Sangireddy [2007]

suggests to reduce the number of RF ports such that issue logic selects the instructions based on its

number of operands. Instructions with two operands are issued to specific slots and instruction with

lower number of operand requirement are issued to other slots. Park et al. [2002] further reduce

RF ports by reducing the operand requirement by not reading those values which are available in

bypass paths.

Use of packing more number of variables in a single register has been suggested to increase

the effective number of registers in a given register file. Ergin et al. [2004] suggest using a single

register to store more than one value of smaller bit-widths. They also suggest bit-width aware

register allocation in hardware. Kondo and Nakamura [2005] suggest using different banks for

lower significant bits and upper significant bits. If one subword has all zero bits, that subword is

released at writeback stage and can be used by other operands. Gonzalez et al. [2004a] allocate

same RF space to two variables if the values are same for the two. In another approach, bit width

awareness is used to reduce the width of a few RF ports [Aggarwal and Franklin, 2003]. Overall

number of ports remain unchanged, but due to width reduction of some ports, the RF energy is

saved.

1.2.2 Single Level Multibanked RF Architecture

In this class of architectures there are multiple register files, each with less number of registers and

ports. If each RF bank is connected to all FUs than it is termed banked RF in literature. If each RF

bank is connected to a subset of FUs, it is called clustered architecture. Both of these architectures

are discussed next.

5

1 Introduction

Banked RF Architecture

The banked RF mimics the behavior of single RF with large number of ports. However, there can

be conflicts in accessing an RF bank, e.g., if an RF bank has single read port than only one FU can

read the bank in a cycle. Conflicts in accessing banks leads to performance penalties. In superscalar

processors, the conflict management is done in hardware, while for VLIW processors, the conflicts

are managed at compile time.

To avoid the conflicts Balasubramonian et al. [2001] suggest to read partially, i.e., if one operand

is available from a bank that operand is read and latched till the other operand is read from the other

RF bank. Tseng and Asanovic [2003] suggest to dividing the RF ports in left-ports and right-ports.

Left ports and right ports are connected to left port and right port of FUs, respectively. This reduces

the size of the crossbar connecting FU ports and RF ports. In their case port arbitration is done in

a separate pipeline stage. To reduce the conflicts authors also suggest using values from the bypass

network. Pericas et al. [2004] also suggest to resolve the conflicts by an arbiter. Ayala et al. [2004]

suggest an approach in which register allocation by compiler partially controls the register renaming

to reduce the conflicts. Conflict management is still done in the hardware. In all these techniques

[Tseng and Asanovic, 2003; Pericas et al., 2004; Ayala et al., 2004], in case of a port conflict, the

instruction either waits for the port or is killed and reissued.

In RISC and VLIW processors there is no register renaming performed in hardware, so the com-

piler can allocate registers of different banks to different operands and there is no port conflict

encountered in the hardware. In [Llosa et al., 1994, 1995] a compiler based technique of RF bank-

ing for VLIW processors is presented. They present a two bank model, one with large number of

ports and other with less number of ports. The registers in two banks may be mutually exclusive or

may be present in both banks while the consistency is managed by the compiler. In a similar effort,

Nalluri et al. [2007] suggest register file banking for a RISC architecture, where compiler assigns

most accessed registers to the smallest bank and rest to the other banks.

6


Clustered Architectures

In clustered architectures each FU port cannot access all the RF banks. Because of limited connec-

tivity, RF to RF interconnection network is required. Clustered RF architectures have been observed

in both superscalar as well as in VLIW processors. Superscalar processors manage the inter-RF

communications using hardware mechanisms [Palacharla et al., 1997; Yeager, 1996; Farkas et al.,

1997], while in VLIW architectures, compiler inserts explicit instructions to copy an operand from

one RF to other RF when required [Capitanio et al., 1992; Seshan, 1998; Faraboschi et al., 2000;

Gangwar, 2005].

Alpha 21264 has replicated registers in register files of each cluster [Kesseler, 1999]. A write is

broadcast while read is done from the local register file. This reduces read ports of each RF and

also allows using single port for read as well as write. Along similar lines, Gonzalez et al. [2004b]

suggest using three register files instead of one, which are used in different stages of the processor

pipeline and a write is broadcast to all the RFs. Due to different usage pattern, number of ports and

size of each RF is less than the centralized RF.

1.2.3 Two Level RF Architectures

Two level RF, also known as RF cache or hierarchical RF, reduces the number of registers in the RF

is in direct contact of FUs, while RF ports remain same. The other RF bank may have more number

of registers and can have more access time.

For superscalar architectures several possibilities of two level RF have been explored. Cruz et al.

[2000] discuss a two level RF architecture in which operands are read from first level, while the

results are written to both levels. To bring values from level two to level one, the authors suggest

prefetching and caching techniques. Balasubramonian et al. [2001] and Sangireddy [2004] suggest

that only level one RF is visible to reorder buffer and rename table (of superscalar architecture), and

proposed hardware copy values to level two RF if they are not required except in case of branch

mis-prediction. Reinman [2005] also introduces a similar idea of having two register files, operand

register file and speculative register file. The latter is used only in case of branch mis-prediction. In

7

1 Introduction

VLIW processor explicit move operation is required to copy a register from one level RF to other

RF [Zalamea et al., 2000].

1.3 Proposed Solutions: Local Buffers Based RF Architecture

We propose an architecture with two levels of register files, where the first level is partitioned into

a number of banks called local buffers that are associated with each FU or issue slot and the second

level is a monolithic RF. With two level organization we reduce the number of accesses to level two

RF. Local buffers at first level localize the FU-RF interconnection avoiding the disadvantage of the

banking solution. With the reduced number of access to second level RF, we proposed a reduced

port architecture for second level RF.

The proposed architecture has advantages of most previously proposed architectures. Local

buffers being connected to each FU, avoid RF-FU interconnects for level one RF. First level RF

is scalable as it is distributed. Energy is reduced due to fewer accesses to the second level RF. Port

reduction of second level RF further reduces the energy consumed and leads to a scalable RF.

With the two level RF and partitioning at the level one, there are several possibilities of intercon-

nections. We explore these possibilities to arrive at a low energy solution. To further reduce the RF

energy, we reduce the number of ports of second level RF. For port reduction, RF ports are shared

among FU ports. We study RF-FU interconnects and propose direct interconnects approach that is

least energy consuming yet performance effective.

In the VLIW architectures, the order of execution is planned by the compiler, so the compiler

has an important role in all architectural optimizations. We study the required compiler support

and propose scheduling and binding algorithm which improve the performance and effectiveness of

the proposed RF architecture. Our scheduling and binding algorithm (1) increases the utilization of

local buffers, (2) minimizes the performance loss due to port conflicts caused by reduced port RF,

(3) efficiently binds operations to FU in presence of given RF-FU interconnects.

We also propose a theoretical model of the performance of the reduced port RF architecture.

The model takes the characterized application and processor architecture as input and estimates

8

1.4 Thesis Outline

execution cycles and energy. Issue width, number of RF read ports, and number of RF write ports

describes the processor. The model is validated against the simulation of various benchmarks over

various issue width processors. The model also establishes the scalability aspect of the proposed

RF architectures.

The scheduling and binding algorithm is implemented using Trimaran compiler framework [Chakra-

pani et al., 2005]. Various benchmarks of Mediabench and MiBench are used for experiments. With

experiments we show the RF energy reduction and the scalability of the proposed architecture. The

main contributions of thesis are following:

1) Proposed, analyzed and explored a new RF architecture with local buffer at first level and reduced

port RF at second level.

2) Proposed scheduling and binding algorithms for the proposed architecture that optimizes energy

and performance of applications.

3) Developed theoretical models for performance and energy estimations of the proposed architec-

ture.

4) Demonstrated scalability of the proposed architecture.

1.4 Thesis Outline

The rest of the thesis is organized as follows, Chapter 2 discusses the proposed RF architectures in

detail. The local buffer based RF architecture and reduced port RF architectures are discussed in

this chapter. Compiler aspects of the proposed RF architecture and scheduling binding algorithms

are discussed in Chapter 3. Performance and energy models are discussed in Chapter 4. In Chapter

5 we validate our model and evaluate the proposed architecture with experiments on a fixed issue

width processor. In Chapter 6 we vary the issue width of processor and show the scalability of

proposed architecture. Conclusions and future work are discussed in Chapter 7.

9

1 Introduction

10

2 Proposed RF Architecture

A VLIW processor with N issue slots usually has a register file with 2N read ports and N write ports

to support the following:

• Each of the N concurrent operations can simultaneously read two operations and write one

result.

• A value produced by any FU can be read by any FU.

• A value written in any cycle can be read after any amount of delay.

However, the real situation is not as demanding and the design can be simplified with a view to

reduce the chip area and/or power consumption. In this chapter we examine the typical demands

imposed on the VLIW RFs and present a energy saving architecture.

2.1 Motivation

It has been observed that most values stored in the RF have a short lifetime confined to a basic block.

A ‘use’ of a value is defined as local read if the value is defined in same basic block, otherwise it

is called a non-local read. Similarly, if all uses of a definition are in the same basic block then the

definition is termed local write, else the definition is non-local write. For example, Fig. 2.1 shows a

high level statement and corresponding assembly level code. In this code, variables Aaddr, Baddr,

Caddr, A, B, and C are produced (defined) and consumed (used) and are not used later, thus have

short life time in the RF. Such variables need not be stored in the RF if a local and temporary storage

is available.

11


A High level code example:

C[x] = A[y] + B[z]

Assembly level code:

1. Aaddr <- Abase + y2. Baddr <- Bbase + z3. Caddr <- Cbase + x4. A <- load Aaddr5. B <- load Baddr6. C <- A + B7. store C Cadd

Figure 2.1: Example of short life time of variables.

To generalize this observation, we perform an experiment with a set of Mediabench and Mibench

benchmarks. We use Trimaran [Chakrapani et al., 2005] and its simulator to obtain the information

about local reads and writes. The liveness information present in the compiler is used to mark a

read/write as local or global. From the experiments we observe that on an average only 44% reads

and 26% writes are global (Fig. 2.2); the remaining reads and writes are local.

Local reads/writes allow compiler to (a) schedule read and write such that read is available di-

rectly from the output of FUs, (b) compute number of cycles between read and write and order reads

and writes. (a) leads to chaining of operations in the schedule. Gangwar [2005] observes that there

is a large number of long chains of operations available in most applications that can be mapped to a

set of FUs (called cluster). To exploit this locality we suggest physical local buffers associated with

each FU. Associating buffers with each FU distributes the storage and leads to scalable design. (b)

helps in designing these buffers such that input or output can be serial, which simplifies the design

of buffer.

To find the approximate size of these buffers, we performed another experiment. We calculated

the number of operand reads from RF within n cycles of write for different values of n. To find these

12

2.1 Motivation

Global readsGlobal writes

0%

20%

40%

60%

80%

100%

basi

cmat

hdi

jkst

rabi

tcou

ntbl

owfi

shFF

Tpa

tric

iaqs

ort

sha

g721

enco

deg7

21de

code

gsm

deco

degs

men

code

unep

icra

wca

udio

raw

daud

iope

gwitd

ecpe

gwite

ncA

vera

ge

Per

cent

age

glob

al r

eads

/wri

tes

Figure 2.2: Percentage global values accesses in various applications.

values, benchmarks were compiled, and simulated in Trimaran infrastructure for different issue

widths. The results are shown in Fig. 2.3 as cumulative number of reads represented as fraction of

total reads. The figure shows that after two cycles, increase in number of reads diminishes. In other

words, if each result value is kept in local buffers for two cycles, most local reads will be from local

buffers. The number of cycles for which a result value is there in buffer represents the depth of local

buffer. The experiment also suggests that depth of the buffer can be made small (for example, ‘2’),

without any significant loss of performance. We redefine the local reads as those which are read

from local buffers. A write is redefined as local write if all its uses are read from local buffers. All

other reads and writes are non-local reads and non-local writes, respectively.

The values which are not read from local buffers (non-local values) are read from second level

register file. Second level register file is a monolithic register file with 2N read ports and N write

ports, as there is a 1-to-1 connection between the FU ports and the RF ports. However, the average

RF port usage per cycle is less than 3N, because some of the operations like mov have only one

13


0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16 18 20

Cum

lativ

e nu

mbe

r of

rea

ds w

ithin

n c

ycle

s

Cycle after write (n)

16 Issue8 Issue4 Issue

Figure 2.3: Cumulative number of reads for different number of cycles afterwrite.

operand to read, while in some other operations, immediate operands do not require a register access

and memory store operation does not write back in the RF. Also, the fact that the average instruction

per cycle (IPC) for an application is less than peak IPC also contributes to lower average usage of RF

ports per cycle. Further, with local buffer based architecture, the RF read and write traffic ought to

observe a good reduction. Figure 2.4 shows the effect of these factors on average RF port usage for

a set of high ILP benchmarks. The benchmarks were executed using Trimaran compiler framework.

Figure 2.4 shows the difference between peak read port requirement and port requirement due to

average parallelism (given by 2*average IPC). Average port usage is even lower due to fewer ports

requirement of certain operations. Average RF port usage further reduces with operands available

in local buffers. For example, for a 6 issue processor, on an average 3 read ports are used though

the maximum port usage is 12 read ports. 12 read ports, in this case, clearly waste the available RF

bandwidth. Therefore, ports of the second level RF can be reduced.

In summary motivations for local buffer based architecture are the following:

• Local reads and writes are captured in local buffers.

14

2.2 Local Buffer Based Architecture

0

4

8

12

16

20

24

2 4 6 8 10

RF

rea

ds p

er c

ycle

Number of FUs (N)

Maximum RF reads(2N)2 * Average IPC

Average RF readsAverage non-local reads

Figure 2.4: RF read port usage.

• Distributed nature of local buffers increases scalability.

• Small size of buffers ensures that the clock period is not stretched.

• The traffic to the second level RF is reduced.

• Second level RF has reduced ports.

Based on these motivations we propose one local buffer to be associated with each input or

output port of a FU. We propose various possibilities of the local buffer RF architecture in detail,

and discuss their architectural impacts.


We propose one local buffer to be associated with each input or output port of a FU as shown

in Fig. 2.5. Buffers and FUs are connected with an interconnection network which depends on

15


From FU Outputs

To FU Inputs

Interconnection Between Buffers

Buffers Associatedwith fu’s output ports

(a) Local buffers associated with output.

From FU Outputs

To FU Inputs

Buffers Associatedwith fu’s input ports

Interconnection Between Buffers

(b) Local buffers associated with input.

Figure 2.5: Base local buffer model.

input/output behavior of buffers. Buffers associated with input FU port are called operand buffers

and those associated with FU output ports are called result buffers.

In operand buffer based architecture each FU input port is connected to a dedicated operand buffer

and FU always reads from it. Operand address in these buffers may be fixed or variable. When

operand address is fixed, all values in a buffer must be read from the same address sequentially

while in variable operand address case, any value can be read directly. As values can be written to

any address of the operand buffers, these buffers are referred as ‘random in sequential out’ (RISO)

and ‘random in random out’ (RIRO) operand buffer, respectively.

Similarly, in case of result buffers, results are written to fixed register locations of the buffer or to

random locations of the buffer, while values can be read from any location. These buffers are called

‘sequential in random out’ SIRO results buffers and ‘random in random out’ RIRO result buffers,

respectively.

Among all these possibilities, conceptually there is also a possibility of ‘sequential in sequential

out’ SISO operand or result buffers. These buffers may be considered as perfect FIFO or queue

where results are written in a predetermined order, and operands are read in the same order. As the

16


Register

File

FU1

FU2

FU3

RISO (r1)

RISO (r2)

RISO (r3)

RISO (r4)

RISO (r5)

RISO (r6)

Figure 2.6: RISO buffer based VLIW Architecture.

order of production and consumption of operands is rarely identical, these buffers are of least use,

though they may be used in some specific application or architecture. For example, in ASIC design

FIFOs have been used for efficient synthesis [Balakrishnan and Khanna, 2000].

Next we discuss the characteristics and behavior of architectures with RISO, SIRO, and RIRO

buffers.

2.2.1 RISO Operand Buffers

In architectures with RISO buffers, one buffer is associated with each FU input port. In each cycle,

FU reads the operand from a predefined register (usually first) of its RISO buffers. The remaining

contents of the buffer are shifted by one address location as in a shift register. FUs can write to any

location in the buffer. As the operands are always read from the RISO buffer at predefined address,

it is not necessary to specify operand addresses in the instruction format. In other words, only write

result is needed in the instruction. For three FU VLIW processor, datapath with RISO buffers is

shown in Fig. 2.6.

For the correct execution of instructions, FUs write their results to a particular RISO buffer and

at pre-calculated address. To disambiguate, each RISO register gets a unique address and RISO

17


address space is different from the RF address space. For RISO buffer with depth k, the time

difference between production of a result and its consumption by the other FU should be less than

or equal to k. If the time difference is more than k, the values are stored in second level RF. If there

is a branch operation between these producer and consumer instructions, then too the values are

stored in second level RF.

Operands that are available in the second level RF, are first written to RISO buffers as the FU reads

their operands only from RISO buffers. For reading a value from register file, a move operation

(movb) is used. We observed that in this architecture, an instruction requires only the result address

field. However, some writes are to RISO buffers, while other writes are to the second level RF.

Therefore, we propose to use two result address fields for each kind of write. This instruction format

imposes an additional constraint. When a result value may be used by more than one operation, in

that case also the result is written to the RF and accessed from the RF. Instruction format of two

types of operations in RISO architecture are shown below:

OPCODE <buffer address> <RF address>

MOVB <buffer address> <RF address>

It may be noted that requirement of movb instructions and constraint of single RISO buffer desti-

nation are not characteristics of RISO buffers but they are due to the suggested instruction format.

However, with a different instruction format other issues may arise. For example, multiple RISO

buffer destinations in instruction will lead to additional instruction bits. Still, it depends on applica-

tion that how many RISO buffer destinations will be sufficient.

Fig. 2.7 shows the example instructions for a RISO based architecture. These instructions are the

result of ASAP (as soon as possible) scheduling of the assembly code given in Fig. 2.1. Operations

that can be executed in parallel are separated by semi-colon. It may be noticed that the 0th instruction

contains movb operations that copy operands in RISO buffers. In the example code, the RISO buffer

address is the concatenation of buffer name (as shown in Fig. 2.6), and register number in that buffer,

and opcode also reflects the FU binding. Values are written onto register r1 1 in 0th instruction

which is consumed by the first instruction. The values written in register r1 1 by the first instruction

18


is consumed in the next instruction, and so on. Similarly, in instruction 1 the result of ADD.3 is

written in r6 3, and it is read in instruction 4.

RISO Code0: MOVB r1_1, Abase; MOVB r2_1, y; MOVB r3_1, Bbase; MOVB r4_1, z;MOVB r5_1, Cbase; MOVB r6_1, x;1: ADD.1 r1_1, X; ADD.2 r3_1, X; ADD.3 r6_3, X2: LOAD.1 r1_1, X; LOAD.2 r2_1, X;3: ADD.1 r5_1, X;4: STORE.3;

Figure 2.7: An example of instructions for RISO based architecture.

Hardware Implementation of RISO Buffers

There are two implementations possible for RISO buffer. In one, the values are shifted every cycle.

This implementation is performance effective, but costly in terms of energy as in every cycle each

buffer location is written.

In other implementation contents of each buffer location is not shifted every cycle. It uses a

modulo t counter based address generator, where t is the depth of RISO buffer. Figure 2.8 shows the

detailed view of a RISO buffer of depth 3. A modulo address counter controls Mux2. Mux1 selects

the input from various FUs and the RF. Number of inputs to Mux1 is N +1 and in Mux2 it is equal

to the depth of the RISO, where N is the number of FUs. It may be noticed that Mux2 is not there in

shift register based implementation of RISO buffer. Therefore, Mux2 represents a trade-off between

performance and energy efficient implementation.

The architecture with RISO buffers provides a path from FU outputs to FU inputs. In other words,

there is implicit full bypass present in RISO architecture.

19


Register

File

lines

Mux2 FU2

FU3

Mux1

Write ports

FU1

Address

Figure 2.8: Detailed RISO Architecture.

2.2.2 SIRO Result Buffers

In SIRO result buffers based architecture, there is a buffer associated with each FU output port.

Output of an FU is written to a predefined address of the associated buffer. Any FU can read from

any SIRO buffer and values can be read from any location in the buffer.

Similar to RISO buffers implementation, there can be two implementations of the SIRO buffer.

One in which values are shifted in each cycle and the other in which there is a modulo k counter for

the address generation for SIRO buffer of depth k. In case of architecture with SIRO buffers, we

prefer the first arrangement, as the pipeline registers of architecture can be reused. In the processor

pipeline, if there are pipeline stages between execution stage and writeback stage, those registers can

be treated as SIRO registers. If the depth of SIRO buffer is more than the available pipeline registers

then the extra registers can be used, which will act like a shift register. The architectural view and

detailed view of the architecture with SIRO buffer is shown in Fig. 2.9 and 2.10, respectively.

As all the FUs can read the values of each SIRO buffer, the structure forms complete bypass net-

work with bypass depth of k + 1. SIRO buffer implementation also suggests a fast implementation

of bypass network with bypass depth more than one. In the implementation suggested in Fig. 2.10,

20


Register

File

FU1

FU2

FU3

SIRO (s1)

SIRO (s3)

SIRO (s2)

Figure 2.9: SIRO buffer based VLIW Architecture.FU

2FU

3

Mux1

File

Register

Mux2

FU1

Figure 2.10: Detailed SIRO Architecture.

the input of Mux1 is increased by one with respect to traditional bypass design with bypass depth

one.

21


Similar to RISO buffer architecture, here too compiler generates operand addresses. A unique

address is assigned to all the elements of SIRO buffers and the compiler while scheduling decides

whether operands are read from SIRO buffers or from the RF.

FU can read operands from any SIRO buffer or register file. In other words, each SIRO buffer

provides operands to all FUs. Therefore, number of read ports in a SIRO is 2N where N is number

of function units. It may appear that a SIRO architecture needs no destination address field in the

instruction, but the field is required for values to be written to register file. It may be noted that

all non-local writes need to be written to the RF. The FU reads operands from either the register

file or from SIRO buffer. RF and SIRO have different address spaces; thus, a separate field for RF

address and SIRO address is not required in the instruction. Due to this, instruction format of the

architecture with SIRO buffer remains the same as that of conventional architectures.

SIRO Code0: ADD.1 s1_1, Abase, y; ADD.2 s2_1, Bbase, z; ADD.3 s3_1, Cbase, x;1: LOAD.1 s1_1, s1_1; LOAD.2 s2_1, s2_1;2: ADD.1 C, s1_1, s2_1;3: STORE.1 s1_1, s3_3;

Figure 2.11: Example instructions for SIRO based architecture.

The ASAP schedule corresponding to the assembly code of Fig. 2.1 for a SIRO architecture is

given Fig. 2.11. We may observe that the instruction format remains the same as for conventional

VLIW architectures. In this example, all the results are written to the first register of the correspond-

ing SIRO buffer because there is no need to write them into the second level register file.

2.2.3 RIRO Buffers

RIRO buffers associated with read and write ports of FU are called RIRO operand buffers and RIRO

result buffers, respectively. Due to random access for both read and write these buffers act as small

22


register files. For read as well as write to a RIRO buffer, the compiler must provide the appropriate

address. The addresses of RIRO cannot be generated in hardware as in RISO or SIRO buffers.

In architectures with RIRO operand buffers, a RIRO buffer is associated with each input port of

an FU. The FU can read any register into its RIRO buffer, and can write to any location in any RIRO

buffer. Structurally an architecture with RIRO operand buffer is similar to architectures with RISO

buffers (Fig. 2.8). The only difference is that the control of Mux2 is generated by RIRO address

instead of a modulo counter. Unlike RISO buffer, operand address is required for a RIRO operand

buffer. Therefore, the instruction format would contain two source operand addresses pointing to

RISO buffers and two result addresses – one pointing RISO buffer and other pointing the RF. If both

are not needed simultaneously, then it is only an extra bit.

In architectures with RIRO result buffers one RIRO buffer is associated with each output port

of the FU. Structurally, the architecture is similar to the architecture with SIRO buffers. There is

no shifting of data required and therefore, pipeline registers cannot be re-used for design of RIRO

result buffers. Each FU writes to its RIRO operand buffers, and if required it writes to the RF as

well. The input operands of FUs are read from either RIRO result buffers or from the RF. Instruction

encoding includes two destination addresses, one for register file and another for RIRO buffer.

Because of random access principle, RIRO buffers do not have the limitation on temporal locality

of definition and use of values. Therefore, conceptually, the architecture with RIRO buffer need not

have a global register file. Without global register file, the required size of each RIRO buffer would

be large. As the results produced by one FU may be used by more than one FU, therefore, it is

essential to have a communication mechanism between RIRO buffer to other RIRO buffer. Access

time would also increase due to both these factors. The other implication of large access time is

the absence of implicit RF bypass, and therefore, an explicit bypass network would be required.

This RIRO architecture without second level RF is a special case of clustered VLIW architecture

[Capitanio et al., 1992]. For example, in case of RIRO operand buffer only one FU can read from

a RIRO buffer while all others can write to it, which is similar to clustered VLIW architecture with

write across interconnects [Gangwar et al., 2007]. Similarly, RIRO result buffers are equivalent to

23


read across clustered VLIW architecture.

If we relax certain interconnection constraints then the RIRO architecture is similar to many

previously proposed architectures. For example, if all FUs can read and write to any RIRO buffer,

then it will be like banked-RF architecture with full interconnect [Balasubramonian et al., 2001]. If

a RIRO buffer is shared between multiple FUs, than it is similar to clustered architecture.

2.2.4 Qualitative Analysis

All the three architectures mentioned above fulfill our goal of capturing temporally local variables

and avoiding RF access in such cases. The architectures also physically distribute the local buffers

in favor of physical routing.

RIRO buffers are the most lucrative alternative from a performance point of view. They have

the advantages of RISO/SIRO and because of random access, they can use their size most effi-

ciently. Additionally, use of a value in RIRO buffer is not limited to the boundaries of a basic block.

However, compiler development for RIRO based architecture is extremely complex, as it involves

simultaneous register allocation, register file partitioning, operation scheduling, and FU binding.

As this is first effort in the direction of local buffer based architectures, we chose a subset of the

available design space.

RISO and SIRO based architectures are equivalent on most grounds. There are some restrictions

in RISO based architectures which lead to lower performance and lower energy. For example, in

RISO each second level RF read requires a movb operation. Therefore, an additional 30-40% of

original instructions would be required. Also, results which are required by multiple FUs need to

be routed through the second level RF. However, as indicated before, these restrictions are due to

design choices, not fundamental properties of RISO buffers. We prefer to use SIRO architecture

for further exploration because SIRO buffer has the additional advantage that it is most similar to

conventional VLIW architecture with full bypass network.

In next section we discuss the similarities of SIRO based architecture and conventional VLIW

architecture.

24

2.3 SIRO Buffers and Conventional VLIW Architecture


2.3.1 SIRO Buffers and RF Bypass

In a conventional pipelined processor RF bypass paths are provided to avoid data hazards. Operands

are read from the RF as well as from the pipeline registers. If a value is available from the pipeline

registers, the RF value is discarded. In other words, RF is bypassed for reading operands which are

available in the pipeline paths. A value is available from the RF only after it is written to the RF.

The number of stages from where bypass paths are provided is known as bypass depth, d.

SIRO based architecture is similar to this classical design of VLIW processor. In SIRO buffer

architecture also, the results of FUs are stored in SIRO buffers and shifted in each cycle. If each

SIRO register is replaced by pipeline register, and depth of SIRO buffer is equal to the bypass depth,

SIRO based architecture has the same datapath as that of a conventional VLIW architecture.

SIRO buffer depth cannot be less than the bypass depth, as it will lead to data hazards. In case

SIRO buffer depth is more than the bypass depth additional registers would be required to implement

the SIRO buffer. Number of such registers would be the difference of the buffer depth and the

bypass depth. An example architecture with bypass depth of 2 is shown in Fig. 2.12(a). FU’s input

multiplexer gets inputs from RF, execute stage, and writeback stage. If a RISO buffer with depth 4

is to be implemented in this processor, two additional registers (4−2) are required as shown in Fig.

2.12(b). Similar to Fig. 2.10, Mux2 multiplexes outputs from three registers of SIRO buffers.

2.3.2 Advantages of SIRO Buffers

SIRO based architecture has a number of advantages over conventional VLIW architectures. First,

RF accesses are reduced in redundant cases; second, bypass control become simpler; third, multiple

stage bypass gives performance advantages.

25


Decode Stage Execute Stage WritebackStage

File

Register

(a) An architecture with bypass depth 2.

Decode Stage Execute Stage WritebackStage

File

Register

Mux2

FUMux1

(b) An architecture with SIRO buffer of depth 4.

Figure 2.12: SIRO buffers and RF bypass.

RF Access Avoidance

In conventional VLIW processors values are always read from RF and are discarded if operands are

available in the bypass network. This happens because information that a value exists in the bypass

network is not available before reading from RF. RF read stage and bypass control computation are

done in parallel. In SIRO based architecture, due to address decoding of SIRO registers, the RF read

is avoided easily.

Each register of SIRO buffer is uniquely addressed. Equivalently, all the pipeline registers from

which bypass can be read are addressed. The address space of the SIRO buffers is also mutually

exclusive to the RF address space. Therefore, when the operand address is a SIRO address, the RF

is not read. Similarly, when the result address is a SIRO address, it is not written to the RF. If the

result address points to the RF, the RF as well as SIRO buffers are written onto.

26


Addr m

N suchmuxes

m

m:1 mux

32

32

Addr i

Addr 1

Addr i

m*N AddressComparisons

Operand

Selection

(a) Original bypass control.

m

RF inhibit bit

m:1 muxN suchAddr decoder

output

m

N suchdecoders

m 32

32

muxes

(b) Modified bypass control.

Figure 2.13: Bypass control for one operand of a functional unit.

Bypass Control Simplifications

In conventional bypass control circuits, for each FU input, depth*N address comparisons are re-

quired to know if the operand is available in any of the bypass paths for N issue processors.

Therefore, for N issue processor the number of comparators required in bypass control circuit is

2 ∗ depth ∗N2. As the issue width increases the complexity of bypass circuit also increases. In

general, if number of possible bypass sources for the ith operand is m, its bypass control is shown in

Fig. 2.13(a).

In SIRO buffer based architecture, each pipeline register from where bypass value is expected, is

addressed. Control unit for bypass, in this case, consists of a decoding circuit which expands the

information given for each operand, which make bypass control simpler. Now instead of 2∗depth∗

N2 comparators, 2∗N decoders will be required as shown in Fig. 2.13(b).

To inhibit register read/write a signal is generated when an operand is read from bypass. This

signal is generated by a combinational circuit which is based on the address space used by SIRO

registers.

27


Multi-stage Bypass

We showed that we can increase the SIRO buffer depth with minimal impact on clock cycle delay.

The implementation of the SIRO buffer also suggests a way realize multi-stage bypass control which

may be helpful in deep processor pipelines.

2.4 Reduced Port Second Level RF

In architectures with local buffers, non local reads and writes are provided by second level mono-

lithic RF. Each FU is directly connected to the second level RF; therefore, 2N reads and N writes

ports are required for N parallel operations. As suggested in the beginning of this chapter, an RF

with such a large bandwidth is not required and we propose to reduce the number of RF ports by

sharing them among FUs.

Reduction of ports may lead to conflicts (we call these port conflicts) causing an increase in the

execution time. In shared port RF architecture, RF-FU connections are no longer 1-to-1. Depending

on the structure of the RF-FU interconnection network, there may be additional conflicts (we call

these path conflicts) due to lack of interconnection paths, even though ports may be available. In

this section we present an approach to keep both these kind of conflicts within acceptable levels,

while keeping the benefits of energy saving.

The RF-FU interconnection problem is independent of the presence of SIRO buffers, as the FU

reads and writes to the second level RF directly. SIRO buffers help in reducing the traffic to the

second level RF.

2.4.1 Processor Design with Shared Port RF

With the shared port register file, only register read and write stages of the processor pipeline are

modified, while the rest of the processor design remains unaltered. In the conventional VLIW ar-

chitecture, FU and RF port have one-to-one connectivity. In commercial architectures like velociTI

[Seshan, 1998] and Lx [Faraboschi et al., 2000], the RF ports are shared by a number of function

28


units in the same issue slot. The function units in a particular issue slot form a composite unit which

can be viewed as a single FU at the architectural level.

In the proposed shared port RF architecture, one RF port may be connected to more than one FU

port. Therefore, an interconnection network is required to map an RF port to FU port and operand

address to RF address port. The mapping logic and the access energy of the RF read/write depends

on the RF-FU interconnection network.

RF-FU Interconnection Network

Register

File

1

2

4

3

FU1

FU2

FU3

FU4

(a) Complete interconnection.

File

1

2

3

4

FU1

FU2

FU3

FU4

Register

(b) Direct interconnection.

Register

File

1

2 FU2

FU3

FU4

3

4

FU1

(c) Partial interconnection.

Figure 2.14: RF-FU interconnection topologies for shared ported register file(Bypass paths and SIRO buffers are not shown to simplify thefigure).

We classify different interconnection topologies into three classes : complete, partial, and direct.

The complete interconnection networks form one extreme, in which every FU port is connected to

every RF port (Fig. 2.14(a)). On the other extreme are direct interconnection topologies in which

each FU port is connected to only one RF port (Fig. 2.14(b)). There are numerous partial inter-

connection topologies in between these two extremes (Fig. 2.14(c)). The direct interconnection

29


networks require the least multiplexing and offer the least RF access energy and access delay. On

the other hand, the complete interconnection eliminates path conflicts by providing all possible

interconnection paths, thus offering the best performance.

In the direct interconnection, each FU port is connected to a single RF port, therefore, FU bindings

determine the RF port mapping. In case of complete interconnection, port mapping information

may be generated by the compiler and passed to the hardware, but it requires additional bits in the

instruction. The additional bits will result in increase in code size and significant modification in

the fetch stage of processor. The alternative is to do port mapping in hardware. Therefore, in case

of the complete interconnection the compiler just ensures that the number of RF reads and writes in

a cycles is less than or equal to the number of available ports in the RF.

An example of a direct interconnect is shown in Fig. 2.15. Multiple address inputs are multiplexed

in accordance with interconnection topology. Outputs of the RF need no extra multiplexing. The

multiplexing at the input end adds to the delay but it is more than compensated by decrease in the

RF access time due to port reduction.

SIR

OSI

RO

SIR

OSI

RO

1

2

3

4

FU1

FU2

FU3

FU4

1

2

3

4

op1_addr1

op1_addr2

op2_addr1

op2_addr2

op3_addr1

op3_addr2

op4_addr1

op4_addr2

Register

File

Figure 2.15: Direct interconnection.

30


In multi-issue processors, different issue slots are usually homogeneous for most frequent opera-

tions, while less frequent operations are available on fewer function units. For example, in Lx archi-

tecture, integer operations can be performed by all FUs while memory operations can be performed

by one in four FUs. Considering homogeneity of FUs, a careful selection of the interconnection net-

work makes more FUs available as binding options to the scheduler. Therefore, the path conflicts

can be kept fairly low, resulting in a performance close to that of complete interconnection, while

retaining the advantages of the simple hardware and multiplexer less connections.

Choosing Direct Interconnection Matrix

Direct interconnection matrix can be defined by P. P1i j, P2

i j is one if there is a path from the ith RF

read port to the first and second inputs of the FU j, respectively. Pwi j is one if there is a path from the

ith write port of the RF to the output of FU j.

There are MN possible direct interconnection matrices, where N and M are the number of FU

ports and RF ports, respectively. MN also includes the matrices that uses less than M RF ports. In an

RF, all ports are symmetrical and every pair is equivalent. This property of RF reduces the number

of candidate interconnection matrices significantly. The number of unique interconnection matrices

for a given M and N is an order of magnitude less than the total number. Further, the choice can be

narrowed down using the following guidelines:

(i) The two read ports of each FU are connected to different RF ports. This is a necessary

condition for a valid interconnection network.

(ii) Each RF port should be connected to approximately equal number of FU ports such that the

resulting interconnect is balanced. An imbalanced interconnection network leads to more

path conflicts.

(iii) We have observed that on an average the left port of an FU is used four times more than the

right port. Therefore, we suggest that an RF port can be shared with the left read port of one

FU and the right read port of another FU for more balanced sharing. In other words, left ports

31


and right ports should separately be distributed among RF ports, as uniformly as possible.

The above guidelines assume that most frequent operations can be executed by all FUs of the

processor. Using the guidelines it turns out that the number of matrices which satisfy the guidelines

is very small. The total number of possible matrices is MN , and the number of those which use all

M RF ports is MN − (M−1)N . The number of unique interconnection matrices, taking into account

symmetry of the RF ports, is given by Stirling number of second kind. Stirling numbers of the

second kind S(N,M) count the number of ways to partition a set of N elements into M nonempty

subsets [Baghdadi et al., 2000; D’Antona and Munarini, 2000]. The number of interconnection

matrices that follows the first, second, and third guideline can be found by enumerating the subsets

given by Stirling numbers of the second kind and applying the given constraints.

For example, for 8 FU ports and 4 RF ports, 65536 interconnection matrices are possible, out of

which 40824 uses all 4 RF ports and only 1701 (S(8,4)) are unique. The matrices that follow first,

first two and all three guidelines are 652, 60 and 9, respectively. Any interconnect that follows the

given guidelines leads to less performance impact due to path conflicts. A few examples of such

interconnects are given in Fig. 2.16. In these examples, 4 RF read ports are shared between 8 FU

input ports. Each RF port is connected to two FU input ports; one of them is the left input port and

the other is the right input port, to ensure homogeneity. The figure also shows the corresponding

interconnection matrices. Each row of interconnection matrix indicates the RF port to which a FU

port is connected.

2.5 Issues with the Proposed Architectures

The proposed architectures are based on the fact that the values available in SIRO buffers are directly

used by the FUs requiring them, bypassing their read and write in the RF. Also, the instruction points

to the registers which are temporary in nature, i.e., shifted to the next register in each cycle. Due to

these properties of our architecture, a few issues arise which are discussed now.

32

2.5 Issues with the Proposed Architectures

23

1

4

3

243

2

1

3

4

FU1

FU2

FU3

FU4

File

Register

1

2

3

4

FU1

FU2

FU3

FU4

File

Register

1

2

3

4

FU1

FU2

FU3

FU4

File

4

(b)

2143

2431

23

1

Register 2

1 1

4

(a) (c)

Figure 2.16: Example direct interconnects and corresponding interconnectionmatrices.

Performance Impact

SIRO registers address space is disjoint with RF address space. The number of bits allocated to

operand address is fixed in an instruction. Sharing the address space given may reduce the available

general purpose registers and thus, increase the register pressure which may lead to more spill code

and therefore, performance loss. If SIRO buffer aware register allocation is used, the performance

loss is minimized. Yan and Zhang [2008] suggest such a register allocation approach that considers

the operands available in bypass paths as virtual registers. They show that using virtual registers

leads to a decrease in the register pressure. Therefore, relative performance would increase if local

buffer aware register allocation is used.

Predicate Operations

In VLIW processors predication is used for increasing basic block size and to increase parallelism.

Predicated operations are conditionally executed and the conditions are determined at run time.

Due to non-determinism at compile time, our RISO/SIRO based approach cannot be used with

predicated operations. For example, if x is a predicated operation, and operation y reads its result

33


from SIRO buffer in the next cycle. If due to conditional execution operation, x doesn’t get executed,

operation y will not get the correct operand from the SIRO buffer and the execution will not be

correct. However note that in commercial processors predication is not generally used. For example,

Analog’s TigerShark, TI’s 320C6X, Sun Majc have no predication in their instruction sets. ST’s Lx

has only partial predication support.

Exceptions

If there is an exception raised by an instruction having operands to be read from SIRO buffers, or

there is an interrupt at this instruction, the operand value will not be in SIRO buffers after returning

from interrupt/exception unless some micro-architectural change is done in the processor design.

Exceptions and interrupts are handled in VLIW processors like any other pipeline processor with

the difference that we cannot change the addresses of operands in hardware, as all registers are

determined at compile time. Ozer et al. [1998] describe how the traditional techniques of excep-

tion/interrupt handing can be extended for VLIW processor. They propose to store a copy of the

register file in it and reorder buffer or history buffer. Whenever control gets back from exception,

the original state can be retrieved from either history buffer or future file with reorder buffer.

In our case any of these methods can be adopted, and SIRO buffer registers will be saved in

history buffer or future file as regular registers and transient values can be retrieved at any instant.

Extra logic is required to provide connection and control so that future file or history buffer can read

from (written to) SIRO buffer registers. Quantitative analysis of power and performance impact of

exception handling is not in scope of this work.

Pipeline stalls

In VLIW processors all the function unit latencies are exposed to the compiler. Thus the schedule

generated by the compiler is executed as it is in the hardware. Caches are an exception to this

rule. In case of memory access the compiler assumes a minimum access latency (assuming a hit).

When there is a data cache miss the whole pipeline is stalled till the data arrives. In SIRO buffer

34

2.6 Related Work

based architecture, whenever these pipeline bubbles are introduced, the hardware has to make sure

that pipeline registers donot lose their values. However, for instruction cache miss, a pipeline stall

usually leads to flushing of the rest of the pipeline. In our case, a signal is required for stalling all

the pipeline stages till we get a value from the I-cache. Our approach is not affected due to control

hazards, as the compiler analysis is done only within basic blocks.

Downward Compatibility

Code generation in the proposed architecture has to be done for a given architecture. Any additional

function unit in the architecture would mean a change in registers for SIRO buffers, which would

necessitate regeneration of code. Also, in case of direct interconnect, any change in number of

issue slots or FU placement inside issue slots may change the interconnection matrix. Changes in

interconnection matrix of direct interconnect would require re-scheduling of operations. Note that

rescheduling is not essential in case of reduced port RF architecture with complete interconnect

if the new architecture has more FUs or issue slots, though rescheduling would result in a better

performance.

2.6 Related Work

Local Buffer Architecture Related Work

Queue based hardware structures have been used earlier also in datapath of processor or ASICs.

FIFOs are used in RTL datapath synthesis by Balakrishnan and Khanna [2000]. A shift queue based

architecture is used in synthesizing ASICs for loop acceleration by [Fan et al., 2005; Schreiber et al.,

2002]. In shift queue architecture, size, ports and connection of the queue depends on the applica-

tion loop to be synthesized. Fernandes [1998] proposes FIFO based buffers for VLIW processors.

He proposed a modulo scheduling algorithm which allocates variables with equal life times and

different scheduling times to one FIFO buffer.

RISO architecture is also similar to transport triggered architecture (TTA) [Hoogerbrugge and

35


Corporaal, 1994]. TTA architecture is based on operand transport rather than instruction execution.

The basic principle in TTA is, operations occur as by-products of operand transport, while in our

architecture, both operand and operation are explicit. In this way RISO based architecture lies

inbetween TTA and VLIW. In TTA architecture, register file and functional units were connected

by a bus which may be unscalable, whereas we use point to point connection. Another similarity of

buffer based architectures and TTA is, both architectures use reduced port register file [Corporaal,

1999].

Bypass Related Work

Researchers have used the presence of bypass network to save register file energy. The idea here is

to avoid reading operands from the register file which could be obtained from the bypass network.

The opportunities for avoiding register file read may be detected by hardware [Park et al., 2002]

or the compiler [Sami et al., 2002; Asanovic et al., 2002]. The hardware approach requires bypass

control computation before RF read. This may lead to either increased clock period or increase in

number of pipeline stages. Moreover, RF writes cannot be avoided in the hardware approach as the

liveness information of the operands is not available in the hardware.

In another direction of research, researchers tried to reduce the complexity of the bypass network

by reducing the number of bypass paths [Ahuja et al., 1995]. Important issues are, the selection

of bypass paths that can be removed and code generation to minimize the performance impact due

to missing paths. Fan et al. [2003] and Shrivastava et al. [2005] suggest methodologies to explore

the design space and select bypass paths for superscalar and VLIW architectures, respectively. Park

et al. [2006] and Kudlur et al. [2004] suggest scheduling algorithms to avoid performance impact

due to partial bypass network for superscalar and VLIW architecture, respectively.

Port Sharing

In a single issue processor, RF port sharing among multiple function units is a matter of rule, but

shared port RF architectures for multiple issue processors have received very little attention. In

36

2.7 Summary

VLIW processors, typically, port sharing has been used at the level of issue slots, i.e., function units

in a single issue slot share the same read and write ports [Seshan, 1998]. Our approach goes one

step further and permits sharing ports across issue slots. Aditya et al. [1999] proposed an approach

with very limited port sharing. In their approach, an FU having less port requirement shares RF

port with an FU having higher port requirement. Port reduction achieved is quite limited in their

approach.

There are some instances of reduced port RFs in other multiissue architectures. For example,

in [Park et al., 2002; Kim and Mudge, 2003; Sangireddy, 2007; Sirsi and Aggarwal, 2009], the

authors have suggested port sharing for superscalar processors. The focus here is on port conflict

management in hardware. However, for a VLIW processor, a compiler driven solution is required.

2.7 Summary

In this chapter we proposed local buffer based RF architectures for VLIW processors. The RISO,

SIRO, and RIRO architectures were studied in detail. We studied their architectural implications

critically and further explored the architectures with SIRO buffers. We showed the resemblance of

SIRO architecture with traditional full bypass network and presented mechanisms for avoiding RF

access when SIRO buffers are accessed. SIRO buffers are also a promising choice for large bypass

depth. The SIRO architecture potentially reduces the RF traffic. Based on this we proposed a

reduced port RF architecture. For reduced port RF architecture we explored RF-FU interconnection

network and proposed direct interconnection which is simple and has least hardware cost. For direct

interconnection, we gave guidelines that help in selecting the interconnection matrix.

37


38

3 Code Generation for Proposed Architecture

3.1 Motivation

VLIW processors depend on their compilers for extraction of parallelism and resource management.

In other words, the compiler for a VLIW processor must correctly resolve all timing and resource

conflicts, and optimize the code for maximum parallelism.

Correctness is the minimum requirement for a VLIW compiler. For correctness, the detailed

VLIW architecture is made visible to the compiler, and resource constraints are given as inputs to

it. Compiler’s algorithms of operation scheduling, FU binding, and register allocation usually take

care of architecture constraints to produce correct code. For the proposed architecture, in addition

to resource constraints of a conventional VLIW architecture, the following additional constraints

are present:

(i) Operand and result addresses in the instruction should point to correct SIRO/RF address,

(ii) Number of RF reads and writes in an execution cycle should be less or equal to physical ports

of the register file, and

(iii) RF-FU paths should be available to FU whenever required.

Different constraints of the proposed architecture have impact on different modules of the com-

piler. The first constraint affects the register allocation; the second constraint affects the operation

scheduling; and the third constraint affects the FU binding. Further, operation scheduling and FU

binding affect each other, and thus, there is a phase ordering issue among them. As a workaround,

39


we propose an algorithm which does both operation scheduling as well as FU binding simultane-

ously.

Further, several optimizations are possible to improve performance and reduce energy consump-

tion. For example, the number of reads and writes to SIRO buffers can be increased to reduce RF

energy; the operation schedule can be optimized to reduce the impact due to port conflicts and path

conflicts; register allocation can consider SIRO register address space to reduce register pressure.

In our thesis, we focus on operation scheduling and FU binding algorithms which not only meet

the correctness requirement, but also do performance and energy optimizations. Register allocation

is out of scope of this thesis. However, our solution of register allocation ensures correctness.

In the rest of the chapter, Section 3.2 defines the scheduling and binding problem and Section 3.3

discusses the scheduling binding algorithms in detail. Section 3.4 discusses the other modules of

the compiler assisting in code generation. Section 3.5 summarizes the chapter.

3.2 Scheduling-binding Problem and Methodology

The inputs to the scheduling and binding algorithm are an application program represented as data

flow graphs and some parameters representing the architecture. A data-flow graph is a directed

acyclic graph G(V,E), where each element of the vertex set V = {v0,v1, . . . ,vn} corresponds to an

operation, and each edge ei j ∈ E represents dependency of v j on vi. Edge weight di j corresponding

to ei j , is the minimum schedule time difference between vi and v j. Each vertex, vi is also associated

with a delay xi that the operation requires to execute. In this graph, vn is the sink node that has no

outgoing edge and has incoming edges from all other nodes. We assume that each operation v i may

require upto two operands and may produce a result, as denoted by r1i , r2

i and wi. Value of r1i , r2

i and

wi are one if vi reads operand 1, reads operand 2 and writes result, respectively, else the values are

zero.

The number of issue slots N, the number of RF read ports R, and the number of RF write ports W

are the architecture parameters considered.

The scheduling problem for reduced port RF architecture is to find the integer labellings of the

40

3.2 Scheduling-binding Problem and Methodology

operations ϕ : V → Z+ satisfying the following constraints:

(i) Schedule time satisfies the dependency constraints due to all edges in the graph.

ϕ(vi) ≥ ϕ(v j)+d ji ∀i, j s.t. e ji ∈ E. (3.1)

(ii) Total number of operand reads should be less then read ports of the RF for any schedule time.

∑i:ϕ(vi)= j

r1i + r2

i ≤ R ∀ j. (3.2)

(iii) Total number of result writes in any cycle should be less then write ports of the RF.

∑i:ϕ(vi)+xi= j

wi ≤W ∀ j. (3.3)

(iv) Total number of operations scheduled in a cycle should also be less than the issue width of

the processor.

∑i:ϕ(vi)= j

i ≤ N ∀ j. (3.4)

To reduce the impact on cost, we reduce the demand of RF read and write ports by using the

operands available in the SIRO buffers. As discussed in Chapter 2, if an operand is available in the

SIRO buffers, it is not necessary to read that operand from the RF. Similarly, if all the uses of a result

are read from SIRO buffers, we may avoid writing it in the RF. To decide if an RF read or write can

be avoided, the schedule information ϕ is required. In the graph G, each incoming dataflow edge to

a vertex corresponds to an RF read, and outgoing dataflow edges from a vertex corresponds to an

RF write. An RF read for operation vi, associated with a dataflow edge, e ji = (v j,vi), can be avoided

if

ϕ(vi)−ϕ(v j) ≤ d ji +depth−1, (3.5)

where depth is the SIRO buffer depth. If an RF read is avoided, the corresponding value of r1i or

41


r2i is toggled to zero. RF write associated with vi can be avoided (and wi is set to zero) if the write

is not global and all the RF reads corresponding to dataflow edges starting from node v i follow

the condition (3.5). Global writes are the essential writes and are determined using global liveness

information (this will be discussed in 3.4.1).

3.3 Proposed Scheduling and Binding Algorithm

We use list scheduling as the base algorithm. In standard list scheduling, the scheduler begins with

a ready list, a set of nodes ready to be scheduled, and schedules them in order of their priorities

considering resource constraints. The priority determines which operation should be scheduled

earlier when the number of ready operations is more than the resources available.

To maximize the number of reads from SIRO buffer, we consider a priority function specific to

the proposed architecture.

3.3.1 Scheduling Priority Function

In list scheduling, as soon as an operation executes, its successor operations become ready. If the

number of ready operations is less than or equal to the FUs available, then all the ready operations

get scheduled in the next cycle. If dependencies are due to data-flow edges, atleast one operand will

be available from the SIRO buffers, as condition (3.5) will be satisfied. By this argument we can

say that list scheduling provides a good framework for developing a scheduling/binding algorithm

to optimize SIRO usage.

If the number of operations available for scheduling is more than the resources available, the

priority function decides which operation will be scheduled in the current cycle. Scheduling the

ready operations over multiple cycles may increase the difference between the production time and

the consumption time; therefore, usage of SIRO buffers may reduce. To maximize the number

of SIRO reads the priority function should also consider the availability of operands from SIRO

buffers.

In general, priority function in list scheduling is performance driven. We propose two approaches

42


to modify the the current scheduling priority function. The first approach is conservative and the

second approach is aggressive for increasing the number of reads from the SIRO buffers.

Two Step Priority Function

In this approach the primary priority function is performance driven. We consider the distance from

the sink node as the primary priority function. A secondary priority criterion is used in case ready

operations have same primary priority values. We call our secondary priority function as bypass

priority. Bypass priority function returns higher priority for operations getting more input operands

from the SIRO buffers. Also, if an operation can get operands from the SIRO buffer in the following

cycles, then it has lower priority than operations that can get operands from the SIRO buffers only

in the current cycle. Overall bypass priority, Pby, is defined as

Pby = ∑i∈inputs

Pbyi (3.6)

where,

Pbyi =

tc − tpi +1 if tc − tpi < depth

0 otherwise(3.7)

where tc, tpi are the current time step and time step when the operand i will be computed. This

scheduling priority is conservative as it does not affect the original scheduling function.

Modification in Schedule Priority

In this approach, we increment the original schedule priority by δ if there is any possibility of getting

operands from the SIRO buffers. The value of δ is assigned in such a way that it just changes the

order of scheduling. For example, if distance from sink node is the original priority function, then

δ may be equal to minimum execution delay of any FU.

We found through experiments that the above mentioned small change in priority gives a sig-

nificant advantage in terms of decrease in the number of reads and writes from the second level

RF.

43


Length

FUs

MaxScheduleLength

ScheduleMax Max

ScheduleLength

R W

RT RTFU RRT W

Figure 3.1: Reservation table (Issue, R, and W are the number of issue slots,the number of read and write ports; Max schedule length iscalculated for each compilation region).

3.3.2 RF Port Aware Scheduling and Binding

RF port aware scheduling algorithm takes care of new resource constraints. The concept of reser-

vation tables is used to maintain the list of resources available in each cycle. Beyond this, our

scheduling algorithm performs the operation to FU binding explicitly.

Algorithm 1 RPA sched()Input: G, MOutput: ϕ

1: t = 02: RT = init reservation table(M)3: while all nodes are not scheduled do4: ready list[] = get ready list(G, t)5: priority[] = get priority(list, G, M)6: bind set = get bind set(ready list, priority, RT, M, t)7: ϕ[bind set] = t8: t = t + 19: end while

Algorithm 1 shows the pseudo code of the core RF port aware (RPA) scheduling. The input to

the algorithm is an acyclic data flow graph G and the processor model M. M includes the number

of issue slots, number of read and write ports of RF, RF-FU interconnection network, operation

44


mapping to function unit, X . First we initialize cycle time t to zero and reservation table (RT) to

available resources which includes FU, read ports and write ports (line 1–2). The reservation table

records the availability of resources in any cycle. The structure of the reservation table is shown

in Fig. 3.1. It has three parts, RTFU , RTR, RTW . RTFU [ f , t], RTR[r, t], and RTW [w, t] indicate the

availability of FU f , read port r and write port w, respectively, at time t. Bit corresponding to the

resource is toggled when the resource is not available. The main loop finds and schedules opera-

tions in the current cycle (line 3–9). For each cycle, a ready list is generated using get ready list()

function (line 4). priority is the operation priority corresponding to each operation in the ready list

(line 5). Operations that are to be scheduled in the current cycle and their mapping to a function

unit is determined in get bind set function. The get bind set function (Algorithm 2) takes care of

constraints (3.2), (3.3), and (3.4). The operations in the bind set are scheduled at the current cycle

(line 7). Note that for the purpose of the algorithm, we are using ϕ as an array rather than a func-

tion, with similar semantic. After scheduling operations in the current cycle, t is incremented and

the loop is repeated till all the nodes are scheduled.

FU Binding and RF-FU Interconnections

In the underlying architecture we assume that an FU may perform different types of operations; the

type of operations may differ from one FU to other (in other words, we may have heterogeneous

FUs). Further, in direct interconnection, each RF port is shared with a set of specific FU ports,

which implies that only one of these FUs can use this port at one time. Thus, the assignment of

an FU to an operation may lead to non-availability of other FUs due to path conflicts. FU binding

in case of direct interconnection needs to take care of both the heterogeneous FUs and the RF port

sharing, while in case of complete interconnect, FU binding need to take care of only heterogeneous

FUs.

Let X be the set of all types of operations and Xi be the set of types of operations that can

be executed by the ith FU. A function T : V → X defines the types of operations associated with

different nodes.

45


The binding problem can be defined as an integer labeling ψ : V → Z+ such that

(i) Each operation can be bound to a FU where it can be executed.

T (vi) ∈ Xψ(vi) ∀i. (3.8)

(ii) No two operations can have the same schedule time as well as the same binding option, i.e.,

an FU can not be used by two operations at the same time.

(ϕ(vi),ψ(vi)) 6= (ϕ(v j),ψ(v j)) ∀i, j, i 6= j. (3.9)

(iii) No two FUs can access same RF port in the same cycle:

∑i:ϕ(vi)=k

(r1i P1

jψ(vi)+ r2

i P2jψ(vi)

) ≤ 1 ∀ j,k. (3.10)

∑i:ϕ(vi)+xi=k

wiPwjψ(vi)

≤ 1 ∀ j,k. (3.11)

To solve these constraints we need a solution that considers all possible OP-FU mappings in a

cycle and then decides the optimal set of mappings. An optimal set of mappings has the maximum

number of operations bound in a cycle. We propose a conflict graph based heuristic, in which

conflicts of all possible mappings are found, and binding is done on the basis of least conflict.

A node in the conflict graph is a tuple containing the operation and the FU slot, < v i, fl > if

T (vi) ∈ Xl . Edges in the conflict graph represent conflicts. There is an edge between nodes <

vi1, fl1 > and < vi2, fl2 > if

(i) both operations are mapped to the same FU, i.e., l1 = l2,

(ii) both FUs are accessing the same read port, i.e., ∃ j |(r1i1P1

jl1 or r2i1P2

jl1) and (r1i2P1

jl2 or r2i2P2

jl2),

(iii) both operations write to RF using the same RF port and at the same time, i.e., ∃ j |(w i1Pwjl1 and wi2Pw

jl2 =

1) and (xi1 = xi2).

46


Using the conflict graph binding algorithm binds the least conflicting FU to each operation.

Algorithm 2 get bind set()Input: ready list, priority, RT, M, tOutput: bind set, ψ

1: bind set = φ2: confl graph = build conflict graph(ready list, RT, M)3: for v ∈ ready list in priority order do4: for f ∈ FU in increasing order of conflict do5: if resource available(v,f, RT, t) = 1 then6: ψ[v] = f7: bind set.add(v)8: update res table(v, f, RT, t)9: update conflict graph(v, confl graph)

10: end if11: end for12: end for

The pseudo code of the binding algorithm (get bind set function) is shown in Algorithm 2. The

binding conflict graph is built on the basis of resources requirement of the operations and the FUs

to which operations can be mapped (line 2). Binding of a ready operation is done in the order of

priority (line 3–12). All the FUs where an operation can be mapped are considered in increasing

order of conflict (line 4). This set of FUs is computed from the binding conflict graph. If all

resources required (RF read and write ports) to execute v on the selected FU are available in the

reservation table, ψ is set (line 5 – 6), and the operation is added to bind set (line 7). Availability of

the FU, read ports, and write ports is toggled in the reservation tables based on usage (line 8), i.e.,

RTFU [ f , t] = 1,

RTR[ j, t] = 1|r1v P1

j f or r2v P2

j f ∀ j,

RTW [k, t + xv] = 1|wvPwk f ∀k.

In the conflict graph update, the nodes, conflicting nodes, nodes related to selected operation with

their edges are removed to satisfy constraints (3.9), (3.10), and (3.11).

47


ADD

OP1SUB

OP2LOAD

OP3LOAD

OP4LOAD

OP5CMPP

OP6

INR

OP9MUL

OP8SHRR1

R1 R2 R3 R4 R5 R6 R7 R8

OP14ST

OP13ST

OP12

OP10 OP11

OP7

ADD ADD

BR

0

1 11

1

11

1

1 1 1 1

1

Figure 3.2: Example data-flow graph.

FU 1 (X1) INT - MEM -FU 2 (X2) INT FLOAT - -FU 3 (X3) INT - MEM -FU 4 (X4) INT - - BRANCH

Table 3.1: Type of operations that can be executed on each function unit

Example

We illustrate our scheduling and binding algorithm with the help of an example data flow graph

(DFG) shown in Fig. 3.2. Each node in the graph represents an operation and different node col-

ors/shades are for different FU types. The edges between the nodes are data dependency or control

dependency edges. For a simplified view of the DFG, memory dependency, output dependency and

anti-dependency edges are not shown. Each edge is associated with an edge weight that signifies

48


Final Binding

FU1 FU2 FU3 FU4 R1 R2 R3 R4

OP3

OP4

FU1 FU3 FU4 R3 R4

FU1 FU3

FU3

OP1−FU2

OP2−FU4

OP6OP5OP4OP3OP2OP1

order)(In prior.

Ready List Conflict Values

OP1

OP2

FU2FU1 FU3 FU4

2

3 3

4 4 2

2525

Reservation Table

R4 OP3−FU1

Figure 3.3: Operations binding example.

the minimum time interval between the schedule time of two nodes. The figure also shows the RF

reads for each operation explicitly in circles (Labeled as R1, R2, etc.).

We schedule this graph for three cases:

(i) Architecture with reduced read ports and complete interconnection.

(ii) Architecture with reduced read ports and direct interconnection.

(iii) Architecture with reduced read and write ports, and complete interconnection.

All the above three cases use a 4 issue width processor. The types of operations that can be

performed by each issue slot are shown in Table 3.1.

Architecture with reduced read ports and complete interconnection We schedule the DFG

for a four issue VLIW architecture with a 4 read and 4 write port RF and complete interconnect.

For complete interconnection, the only conflict is due to heterogeneous function units; therefore,

conflict value of a FU-OP tuple is the number of other ready operations that can be mapped to that

FU.

49


In the first cycle, six operations are available in the ready list (shown in Fig. 3.3). The figure

shows ready operations in priority order. Conflict values of various FUs corresponding to the highest

priority operation ready for schedule, and reservation table. In reservation table, write port resources

are not shown, as they are not constrained in the current example.

The ready operations are bound in order of their priority (line 3, get bind set()) to the least con-

flicting FU. Since three of the ready operations are MEM type which can be mapped only on FU1

and FU3, the conflict value for FU1 and FU3 is high and for FU2 and FU4 is low. OP1 being first

ready operation, is bound to the least conflicting FU2 and the resources required for OP1 (FU2,

and two read ports) are removed from the reservation table. The conflict graph is also updated after

OP1-FU2 binding. In the same way, OP2 is bound to FU4, and OP3 being a memory operation is

mapped to FU1. None of OP4, OP5, OP6 could be scheduled in the first cycle due to unavailability

of read port resource.

In cycle 2, due to availability of operands from the SIRO buffers, OP7, OP8, and OP9 do not

require any register file read and therefore, OP7, OP8, OP9, and OP4 are scheduled in this cycle.

Similarly in the third cycle, the available operations are OP10, OP11, OP5, and OP6. Due to

availability of operands in SIRO buffers, the number of read ports required is 4 instead of 7 and all

these 4 operations can be scheduled in this cycle. In the last cycle the remaining three operations

are scheduled. The resulting schedule is shown in Fig. 3.4. We observe that in example DFG 50%

reduction in RF read ports did not lead to performance degradation due to availability of operands

in the SIRO buffers.

Architecture with reduced read ports and direct interconnection We consider a 4 issue width

processor with 4 read and 4 write ports. The direct interconnect is as shown in Fig. 2.16(a). Consider

the operations of cycle 3 in the scheduled graph shown in Fig. 3.4.

The conflict graph corresponding to the possible mappings is shown in Fig. 3.5. Solid edges in

Fig. 3.5(a) show the conflicts due to path conflict and dotted edges in the Fig. 3.5(b) show conflicts

due to FUs. After observing scheduled graph in Fig. 3.4, we notice that OP10 and OP5 require only

the left operand from the RF. OP11 gets both its operands from SIRO buffers, and OP6 requires

50


4 Reads0 Write

0 Write

0 Read

0 Write

1 Read

BR.4

OP14

4 Reads

Scheduled Graph RF Reads/

Writes

ADD.2 SUB.4

OP2Load.1

OP3

SHR.2

OP7MUL.4

OP8INR.1

OP9 OP4

OP10 OP11ADD.4

OP5 OP6

ST.1

OP12ST.3

OP13

LOAD.3

LOAD.1 CMPP.2

OP1

ADD.3

Figure 3.4: Schedule for the 4 issue slot VLIW processor with 4 read port and4 write port RF.

both of its operands from the RF. Based on this resource requirement and resource information

given by interconnection matrix (Fig. 2.16(a)) and type of operations that can be performed by each

FU (Table 3.1), the conflict graph is constructed. The overall binding conflict graph is formed by

the superposition of Fig. 3.5(b) and 3.5(a) and overall conflict at a node is sum of all edges.

Using this conflict graph operations are bound to FUs on the basis of minimum conflict. In order

of priority, first OP5 is considered. For OP5, FU1 and FU3 are the least conflicting function units

and we choose FU1 as it is the first available FU. After this binding, OP5-FU1 node along with all

conflicting mapping, and nodes related to OP10 are pruned from the graph. The edges of the pruned

nodes are also pruned from the graph. The resulting binding conflict graph is shown in Fig. 3.6(a).

In this graph edges due to path conflicts and FU conflicts are drawn in same graph.

The next operation in priority is OP6. Fig. 3.6(a) shows that for OP6, FU2 and FU3 have equal

priority. FU2 is selected as it is the first available FU and the graph is pruned as it was done for

OP5. The resulting graph is shown in Fig. 3.6(b). Next, OP10 is bound to FU4 and OP11 to FU3

51


OP10−FU3 OP5−FU3

OP5−FU1OP10−FU1

OP10−FU2

OP10−FU4

OP11−FU1

OP11−FU2

OP11−FU3

OP11−FU4

OP6−FU1

OP6−FU2

OP6−FU3

OP6−FU4

(a) Bind graph due to RF port sharing.

OP10−FU3 OP5−FU3

OP5−FU1OP10−FU1

OP10−FU2

OP10−FU4

OP11−FU1

OP11−FU2

OP11−FU3

OP11−FU4

OP6−FU1

OP6−FU2

OP6−FU3

OP6−FU4

(b) Bind graph due to heterogeneous FUs.

Figure 3.5: Example binding conflict graph for cycle 3 of scheduled graph inFig. 3.4.

without any conflict.

Architecture with reduced read and write ports, and complete interconnection In this case

we schedule the subject graph for a 4 issue width processor, with 4 read port and 3 write port RF.

The resulting schedule (shown in Fig. 3.7) takes 5 cycles instead of 4. The scheduler conservatively

assumes that no RF write can be avoided, so it reserve resource for all RF writes. Consequently, in

52


OP10−FU3

OP10−FU2

OP10−FU4

OP11−FU2

OP11−FU3

OP11−FU4

OP6−FU2

OP6−FU3

(a) Bind graph after binding of OP5.

OP10−FU4

OP11−FU3

OP11−FU4

(b) Bind graph after binding of OP6.

Figure 3.6: Example binding conflict graph for cycle 3 of scheduled graph inFig. 3.4 after binding of OP5 and OP6.

all the schedule cycles, maximum of three RF writes are scheduled.

3.3.3 Iterative Schedule Improvement

The RPA scheduling described above takes care of all the constraints of read/write port and intercon-

nection. The write port resources of an operation are reserved in the current cycle as write avoidance

can only be determined in future cycles. Therefore, the resulting schedule does not benefit from the

fact that the writes can be avoided due to operand reads from the SIRO buffers.

We reschedule the output of the RPA sched algorithm as resources due to avoided writes are

available only after scheduling. We list the operations that can be scheduled in earlier cycles due

to availability of write ports. An operation is scheduled if all the resources of the operation are

available, and other constraints are satisfied. For example, in Fig. 3.7, in the second clock step,

resources are available to schedule OP4. With the vacancy created by OP4 in the third cycle, OP5

and OP6 are rescheduled to cycle 3. In this way, we observe that the new schedule is the same as

the schedule of Fig. 3.4. In this case we get the optimum schedule in a single iteration. However,

the procedure of improvement can be repeated till we see no further improvement.

53


0 Read

2 Reads0 Write

Cycle 1

Cycle 2

Cycle 3

Cycle 4

4 Reads

0 Write

3 Reads0 Write

0 Read

0 Write

Cycle 5

ADD.2

OP1SUB.4

OP2Load.1

OP3

SHR.2

OP7MUL.4

OP8INR.1

OP9

OP4OP11ADD.4

OP10ADD.2

ST.1

OP12 OP5CMPP.2

OP6

ST.1

OP13 OP14BR.4

LOAD.1

LOAD.3

Figure 3.7: Schedule for the 4 issue slot VLIW processor with 4 read port and3 write port RF.

Algorithm 2 is the pseudo code of Im-RPA scheduling algorithm that does iterative schedule

improvement. First, reservation table of each resource is initialized in accordance with input the

scheduled graph (line 2). All the operands for which write can be avoided are found and their

respective write port resources are freed from reservation tables (line 5).

For each schedule cycle, moveup list is calculated (line 8). moveup list is the list of those op-

erations which can be scheduled in the current cycle. All the operations in the moveup list are

checked for resource constraints. If the required resources are available, the operation is scheduled

and bound, and the reservation table is updated (line 10–14). By the end of the loop (line 7–17),

we have a new schedule. If the new schedule has a smaller schedule length, the whole process is

repeated until we benefit in terms of reduction in schedule length.

Although this algorithm is iterative in nature, convergence is guaranteed and the maximum num-

ber of iterations is less than the initial schedule length achieved by RPA sched. This can be seen as

54

3.4 Additional Compiler Support

Algorithm 3 Im-RPA sched()Input: V, MOutput: V

1: RT = init reservation tables(V, M)2: repeat3: for each v ∈ V do4: remove available write port(RT, v)5: end for6: for t=0 to schedule length do7: moveup list = find ready moveup op(V, t)8: for each i ∈ moveup list do9: fu pos = get fu position(i, V, M, t)

10: if fu pos > 0 then11: ϕ[i] = t12: ψ[i] = fu pos13: update res table(i, fu pos, RT, t)14: end if15: end for16: end for17: until schedule length reduction > 0

follows. In the first iteration, possible changes in 0th cycle of schedule will finalize since all possible

move up operations have been considered for schedule in cycle 0. Similarly after the k th cycle, the

schedule for cycle 0–k will not change. Thus the maximum number of iterations would always be

less than the initial schedule length of the graph.

3.4 Additional Compiler Support

3.4.1 Identifying Global and Local Reads and Writes

To inhibit RF reads and writes, the compiler during its analysis phase, identifies the operands which

are read from the SIRO buffers and also the writes that can be avoided. This analysis is done in

two stages. In first stage, we mark the results that are necessary to write to the RF. In the second

stage, the operand where RF read/write is avoided is marked. The analysis is done at the level of

the compilation region which is basic block in the most simple case.

In the first stage we mark the global writes. If a result is read by atleast one operation of any other

55


SIRO Code0: ADD.1 s1_1, Abase, y; ADD.2 s2_1, Bbase, z; ADD.3 s3_1, Cbase, x;1: LOAD.1 s1_1, s1_1; LOAD.2 s2_1, s2_1;2: ADD.1 C, s1_1, s2_1;3: STORE.1 s1_1, s3_3;

Figure 3.8: Example:Register renaming.

basic block then it is a global write. All the operands marked global write are essential to write to

the RF. We use liveness analysis of standard compiler to determine these essential writes. For each

basic block, liveout operands are available from the liveness analysis. All the operations in the basic

block are iterated in the reverse order, if the result of an operation is in liveout list, it is marked as

global write. If an operand in the liveout list is already marked global write, other instances of the

same operand in the basic block are not marked as global write.

Similarly, global reads may be identified in this phase but that information is not useful for finding

essential reads. The second stage of determining RF reads and writes is done during scheduling as

explained in the previous section.

3.4.2 Code Generation: Register Renaming

For correct code generation registers are renamed to the SIRO registers. Total number of registers

given to register allocation algorithm is the number of registers in the second level RF. After register

allocation and post pass scheduling, registers corresponding to the operands available from SIRO

buffer are renamed with that SIRO register. Note that no change is done in the write register address;

thus register value will be updated in the register file and can be used by another operation in

subsequent cycles. However, when the results of an operation is marked as ‘avoid RF write’ the

address of the destination register is also replaced by the corresponding SIRO register.

An example of code after register renaming is shown in Fig. 3.8 (The figure is reproduced from

Fig. 2.11). In the example the operands that are named with the prefix ‘s’ are SIRO registers.

56

3.5 Summary

3.5 Summary

In this chapter we clearly defined the role of the compiler for the proposed architecture. We proposed

a novel scheduling and binding algorithm. Our algorithm maximizes the number of SIRO reads,

and minimizes the performance loss due to reduced port and direct interconnect. Ideas of binding

conflict graph were suggested for FU binding, and iterative scheduling was proposed for reduced

write port RF architectures.

57


58

4 Performance and Energy Models

In previous chapters we have proposed an architecture based on local buffers and reduced port RF.

Local buffers save energy by avoiding RF reads and writes without impacting the performance if the

size of the RF is unchanged. Port reduction further saves energy by reducing energy per RF access.

However, the number of execution cycles may increase in this case, leading to some performance

degradation.

In this chapter we theoretically model the performance and energy of applications with local

buffer based reduced port RF architecture. The base architecture is HPL-PD architecture [Kathail

et al., 2000]. Though HPL-PD is a completely parametrized architecture, the only parameters con-

sidered by our model are issue width, number of reads ports, and number of write ports. Apart from

architecture, the other input to the model is an application. First the application is characterized,

then application characteristics are used for estimating performance and energy for the proposed

architecture. The input and output of our modeling framework are shown in the Fig. 4.1.

First we describe the performance and energy model for the reduced port architecture for fixed

issue processor and in the next section we generalize it for any issue processor.

4.1 Model for Fixed Issue-width Processor

In this section we model the performance of VLIW architecture with shared port RF and compute the

additional execution cycles required due to port reduction, keeping the issue width of the processor

fixed. The model needs to account for the fact that the RF port requirement is reduced due to

availability of some values in the SIRO buffers. In a shared port RF architecture, in spite of these

59


Architecture

Model

Performance Energy

Application

Figure 4.1: Basic block diagram of the ILP model.

reductions in port usage, certain instructions may still require more number of ports than available.

Such instructions may require additional cycles for getting re-scheduled.

4.1.1 Performance Model

For an N issue processor, unconstrained RF has 2N read ports and N write ports. If read and write

ports of the register file in such a processor are limited to k and m respectively, the instructions which

require more than k read ports or m write ports need to be rescheduled such that port constraints are

met. The number of cycles required in scheduling such instructions due to read port constraint can

be estimated as total number of reads in these instruction divided by k. Similarly the number of

cycles due to write constraints can be estimated.

For the above computation, we use two vectors R and W of length 2N and N, respectively. R i (or

Wi) denotes the number of cycles using i read ports (or i write ports) in execution of an application

on the architecture with unconstrained RF ports. The additional cycles due to read port constraint

(Cycle+read ) and write port constraint (Cycle+

write) can be estimated as:

60

4.1 Model for Fixed Issue-width Processor

Cycle+read =

∑2Nj=k+1 R j ∗ j

k−

2N

∑j=k+1

R j. (4.1)

Cycle+write =

∑Nj=m+1Wj ∗ j

m−

N

∑j=m+1

Wj. (4.2)

The additional cycles due to both read port constraint and write port constraint is estimated by

addition of both.

Cycle+ = Cycle+read +Cycle+

write. (4.3)

Total number of cycles is

C =2N

∑j=0

R j +Cycle+. (4.4)

For an application with characteristic vector R and W computed once, our model can be used

to find approximate execution cycles for same issue width processor with any number of read and

writes port in the RF. Note that the model is an approximation of the actual scheduling process and

therefore, it can overestimate or underestimate performance. The actual performance depends on

many factors, like, data dependency in the application, quality of scheduling and binding, intercon-

nection topology for shared port RF, etc.

4.1.2 RF Energy Model

Energy consumed by RF is the sum of the dynamic energy and leakage energy spent by it. Dynamic

energy depends on activity and can be calculated by counting the number of RF accesses and energy

per access. Leakage energy depends on leakage current and time of execution. Leakage current is

usually fixed for a supply voltage. Execution time is the number of cycles multiplied by cycle time.

Thus, RF energy, Er f is

Er f = EreadNread +EwriteNwrite +Pr f leaktC. (4.5)

61


where, Eread , Ewrite, Nread , Nwrite, Pr f leak , t, and C are energy per read access, energy per write,

number of RF reads, number of RF writes, RF leakage power, clock period and number of cycles,

respectively.

Nread and Nwrite are the number of RF reads and writes after avoiding redundant reads and writes

due to operands available in SIRO buffers. They can be calculated by the characteristic vector R

and W:

Nread =2N

∑j=1

R j ∗ j. (4.6)

Nwrite =N

∑j=1

Wj ∗ j. (4.7)

The total number of operand reads and result writes in an application is fixed. However, operand

reads from SIRO buffers may change with number of reads ports and write ports as the schedule

gets elongated with reduction in ports. Considering this change in Nread and Nwrite as second order

effect, it is ignored for computation of RF energy.

The total number of cycles, C, can be calculated from equation (4.4). The RF access energy

for each read and write can be calculated using Cacti 4.0 [Tarjan et al., 2006]. Cacti 4.0 is an

analytical model for SRAM which estimates area, power, energy, and access time. As Cacti is also

an analytical model, it is easily integrated into our model.

The RF energy depends on energy per access and number of cycles. Energy per access reduces

whereas the number of execution cycles C increases with decrease in number of ports. Thus, in

equation (4.5), the first two terms decrease the RF energy and the third increases it, as the number

of ports reduces. Overall, RF energy reduces as the first two terms dominates the third. Also, the

scheduler attempts to minimize the increase in number of cycles in order to minimize the effect of

the third term.

62

4.2 Modeling of a Generic Processor


A generic processor may have any number of function units, any number of read and write ports in

the RF. Performance of an application depends on both the resources present in the architecture as

well as the parallelism in the application. Before moving ahead, we refer to definitions of instruction

level parallelism as giving by Gangwar [2005]. The parallelism present in an application without

any hardware or compiler constraints is defined as available instruction level parallelism (ILP).

The parallelism present in the application for a given compiler without hardware constraints is

defined as achievable-S ILP – instruction level parallelism. The parallelism present with no compiler

constraints for a given hardware is achievable-H ILP. The achieved ILP is parallelism with a specific

hardware configuration for a given compiler. Achieved ILP is the result of interaction between

achievable-S ILP and achievable-H ILP.

4.2.1 Performance Model

Jouppi [1989] suggested a first order approximation for calculating the achieved ILP. He suggested

that the achieved ILP in an application is the minimum of achievable-H ILP and achievable-S ILP

as written in (4.8).

Achieved ILP = MIN(Achievable-H ILP, Achievable-S ILP). (4.8)

He further suggested that due to non-uniform parallelism achieved ILP is less than achievable-S

ILP when achievable-H ILP is equal to achievable-S ILP. Noonburg and Shen [1994] gave a theoret-

ical model of the performance based on Jouppi model. Their model uses control and data parallelism

distributions in the application and fetch, issue, and branch parallelism of the architecture to find

achieved ILP. We use Noonburg’s model for deriving performance estimates in a generic VLIW

processor.

Parallelism in an application with certain input is described by parallelism vector, x.

63


x = [ x1 x2 x3 . . . ], (4.9)

where xi is the number of cycles having parallelism of degree i. Using the parallelism vector, ILP

can be calculated as

Achievable-S ILP =∑i xi ∗ i

∑i xi. (4.10)

Effect of Limiting the Number of FUs

When we limit the number of issue slots to k (we assume k uniform FUs in a k issue width processor),

we assume

• The cycles that have parallelism more than or equal to k, will be limited to parallelism of k,

and the corresponding operations are distributed with parallelism of degree k.

• The cycles that have parallelism less then k will not get affected.

Thus new elements in the parallelism vector are

x′i =

xi if i < k∑∞

j=k x j∗ jk if i = k

0 if i > k.

(4.11)

The achievable ILP is calculated using a modified parallelism vector by using equation (4.10). The

number of additional cycles due to issue width limitation is given by Cycle+issue

Cycle+issue = ∑

i

x′i −∑i

xi (4.12)

Using equation (4.10) and (4.11), we can write Cycle+issue as:

Cycle+issue =

∑∞j=k x j ∗ j

k−

∞

∑j=k

x j (4.13)

64


The proposed model based on the above assumptions closely follows the simulation results except

for a few cases. The reasons for the anomalies are as follows. First, the number of cycles given in

the vector is distributed and thus accumulated over the whole application, while calculated value of

x′k assumes them lumped. This may lead to underestimation of cycles and overestimation of ILP.

Second, it is assumed that cycles with parallelism of degree less than k will not be effected. This

may not be true always; the increase in schedule length because of issue width limitation, leads to

increase in slack for operations of low ILP cycles. The available slack may increase the parallelism

and lead to under-estimation of the ILP.

Effect of Limiting the Number of RF ports

To account for the effect of reduced read and write ports, similar parallelism vectors are defined.

We define different read vectors for local and non-local reads. All the reads from local buffers are

considered as local reads while reads from second level RF are considered as nonlocal reads. The

local readsi and nonlocal readsi denote the number of cycles having i simultaneous reads from

SIRO and RF, respectively. Similarly, local writei is the number of cycles in which i simultaneous

writes are avoided from RF due to local buffers. nonlocal writei is the number of simultaneous

writes to the second level RF.

Due to port or FU constraint, the schedule length may be longer, and some operands may not be

available in local buffers. This may change some local reads to non-local reads and similarly local

writes to non-local writes. The number of additional non-local reads (writes) is directly proportional

to the total local reads (local writes), and increase in execution time due to port or FU constraint.

Thus:

nonlocal read+ = αCycle+∞

∑i=0

i∗ local readi (4.14)

nonlocal write+ = βCycle+∞

∑i=0

i∗ local writei (4.15)

65


The proportionality constant, α and β has to be determined empirically. Cycle+ is the total

number of additional cycles due to port or FU constraints. Since we don’t know the additional

cycles due to port constraint, we substitute the value of Cycle+issue as first approximation. Once

we compute the additional cycles due to port constraints (4.20), we use that value to recompute

nonlocal read+ and nonlocal write+.

Increase in the number of execution cycles due to read port constraint depend on, (a) non local

reads having parallelism more than k, (b) additional non local reads. Additional cycles can be

calculated as:

Cycle+read =

∑∞i=k i∗nonlocal readi +nonlocal read+

k−

∞

∑i=k

nonlocal readi (4.16)

Similarly, the additional cycles due to limited number of write ports are calculated as

Cycle+write =

∑∞i=m i∗nonlocal writei +nonlocal write+

m−

∞

∑i=m

nonlocal writei (4.17)

Combined effect of port and FU limitation

The number of execution cycles increases when we limit issue width of processor to k. A similar

increase may be there when we limit read ports to 2k or write ports to k. To account for additional

cycles due to only read port constraint, Cycles+read only, we subtract the additional cycles due to issue

width constraint from the additional cycles due to read port constraint. If Cycle+read is less than

Cycle+issue, then the number of additional cycles due to only read port constraint is zero, i.e.,

Cycle+read only = MAX(0,Cycle+

read −Cycle+issue). (4.18)

Similarly, the number of additional cycles due to only write port constraint, Cycles+write only, is

Cycle+write only = MAX(0,Cycle+

write −Cycle+issue). (4.19)

Total additional cycles, Cycle+, is the sum of additional cycles due to each constraint.

66


Cycle′+ = Cycle+issue +Cycle+

read only +Cycle+write only. (4.20)

As the values of nonlocal read+ and nonlocal write+ depend on additional cycles due to ports

and FU constraints, so the procedure of calculating Cycle+ is iterative. We substitute the value of

Cycle′+ as Cycle+ in equation (4.14) and (4.15) to recalculate the value of Cycle′+ till Cycle′+ is

same as Cycle+.

The achieved ILP is calculated as

Achieved ILP =∑i xi ∗ i

∑i xi +Cycle+. (4.21)

4.2.2 RF Energy Model

We use the energy model discussed in Sec. 4.1.2. In the generic processor model, the number of

RF reads and writes also changes with the read ports and write ports. From the discussions in Sec.

4.2.1 the number of reads and writes for reduced port RF architecture can be calculated as

Nread = nonlocal read+ +∑i

i∗nonlocal readi. (4.22)

Nwrite = nonlocal write+ +∑i

i∗nonlocal writei. (4.23)

In case of only FU constraint, the above equations can be used to calculate Nread and Nwrite,

considering the number of read and write ports as 2N and N for N issue processor. The number of

cycles, C can be calculated as

C = ∑i

xi +Cycle+. (4.24)

The RF energy is computed using these values and values from Cacti in equation (4.5).

67


4.3 Summary

In this chapter we proposed a theoretical model of performance for reduced port RF architecture.

The model used architecture parameters and application characteristics to estimate the performance.

The model for fixed issue processor characterizes the application by executing the application on

a processor of same issue width but without port constraint. The generic model characterizes the

application by executing it on a very high issue width processor. Both the model are based on the

fact that there is non-uniform parallelism in the application. Further, RF energy can be modeled by

estimating the number of reads and writes in the RF.

The prediction of the model can be used in various ways. It can be used for predicting the

performance of an architecture without doing compilation and simulation. Thus our model can be

used in early phases of design space exploration. The modeling also gives us the insight into the

behavior of performance of reduced port RF architecture.

68

5 Model Validation and Evaluation of the

Proposed Architecture

In the previous chapters we have proposed an architecture and compiler algorithms for energy and

performance optimization in VLIW processors. This chapter discusses the experiments performed

to substantiate the claims. Section 5.1 discusses the implementation framework and experimental

setup. The rest of the chapter discusses the effect of the proposed architecture and compiler tech-

niques on area, number of avoided RF reads and writes, performance, and energy in Sections 2, 3,

4, and 5 respectively.

5.1 Implementation Framework

The proposed algorithms for operand analysis, scheduling and binding are implemented in Trimaran

compiler framework [Chakrapani et al., 2005]. Trimaran compiler is an open source compiler for

instruction level parallel architectures. The front end of Trimaran is IMPACT [Chang et al., 1991]

developed at University of Illinois, Urbana-Champaign. IMPACT takes a ‘C’ program as input and

performs various high level compiler optimizations to extract parallelism. The back end of Tri-

maran is Elcor, developed at HP Labs. Elcor performs architecture specific compiler optimizations

and performs scheduling and register allocation. The output of Elcor is read by simulator ‘Simu’

which emulates applications on HPL-PD architecture. The simulator gives the number of execution

cycles and other execution statistics. The processor is described in a high level machine description

language (HMDES), which is the input to Elcor as well as Simu.

69

5 Model Validation and Evaluation of the Proposed Architecture

C ParsingOptimization

ELCOR

IMPACT

Function Inlining

ProfilingRegion Formation

Classical

Machine levelOptimization

DFG Formation

Register allocation

Machine

Description

Simulator

Application

Scheduling

Performance Statistics

Figure 5.1: Experiment framework

We augmented the HPL-PD architecture given in HMDES with the information required for our

proposed compiler algorithms. The additional information provided is, number of read and write

ports, type of RF-FU interconnection, RF-FU interconnection matrix, and depth of SIRO buffers.

The simulator was augmented to provide additional statistics related to RF reads and writes. To

avoid the effect of limited number of registers in register files such as register spilling, we simu-

lated the scheduled code with virtual registers. Also, no memory hierarchy was considered for the

simulations.

5.1.1 Base Architecture

We used a processor with a fixed issue width to understand the effect of SIRO buffers and RF port

sharing and different issue width processors to validate the model. The issue width in commercial

VLIW processor is usually in the range of 2 to 16, with processors using 4 issue width or clusters of

4 issue being the most common (e.g., [Faraboschi et al., 2000; Seshan, 1998]). Therefore, we also

70

5.1 Implementation Framework

Issue slot 1 INT - MEM -Issue slot 2 INT FLOAT - -Issue slot 3 INT - MEM -Issue slot 4 INT - - BRANCH

Table 5.1: Function unit positions

use this range of issue widths; 4 issue VLIW processor is used as the base processor for fixed issue

width experiments. In the 4 issue width processor, there are 4 integer units, 1 floating point unit,

2 memory units, 1 branch unit and a 64 word register file. Function units are placed in different

issue slots as shown in Table 5.1. Based on observation from Fig. 2.3, SIRO depth is ‘2’ for all

experiments. Experiments were performed with different number of RF read and write ports. In

the experiments we refer to an RF configuration by the number of read and write port, e.g., 4r3w

configuration represents shared port RF with 4 read ports and 3 write ports.

5.1.2 Benchmarks

For experiments we used two sets of benchmarks. Set I of benchmarks is composed of Mediabench

[Lee et al., 1997] and Mibench [Guthaus et al., 2001]. Mediabench and Mibench consist of high

end embedded applications. The Set II of benchmarks consists of a number of kernels and appli-

cations from the embedded systems domain with high instruction level parallelism (ILP). Certain

transformations like loop unrolling [Davidson and Jinturkar, 1995], constant folding, and tree height

reduction [Mahlke et al., 1992] are used to further enhance ILP of Set II benchmark applications.

The two sets of benchmarks represent different workload conditions. Set I benchmarks represent

the standard embedded applications, while Set II represents applications which have inherently

high ILP or compiler is using aggressive optimizations to extract ILP. The two sets have different

resource requirements such as RF read, RF writes, and FU required in a cycle. Therefore, the two

sets are suitable to evaluate shared port RF architecture. Resource requirement can be characterized

71


Set I Set IIBenchmarks ILP Benchmarks ILPbasicmath 1.12 mm int32 18.29dijkstra 1.03 dct int2 13.4bitcount 1.82 sobel 10.96blowfish 2.38 convolution 17.48FFT 1.31 hamm 28.96patricia 1.13 colorspace 16.76qsort 1.26 mm int8 4.63sha 3.03 dct int 7.44g721encode 1.36 susan 4.29g721decode 1.37 viterbi 4.78gsmdecode 1.39 rijndael 4.4gsmencode 2.23unepic 1.23rawcaudio 1.44rawdaudio 1.52pegwitdec 1.46pegwitenc 1.42

Table 5.2: Benchmark characteristics.

by the achievable-S ILP in the application. Notice that the achievable-S ILP in an application is

independent of the processor and issue width, though it depends on the compiler. We compute

achievable-S ILP in an application by compiling and simulating the benchmarks for very high rate

processor so that resources are not the constraint. Presently we use a 64 issue processor for this

purpose. The ILP values for each benchmark is shown in Table 5.2.

72

5.2 Model Validation

5.2 Model Validation

5.2.1 Performance Model for Fixed Issue Width Processor

Our analytical model estimates the schedule length by considering the application characteristics,

such as parallelism and resource requirement. We validate the model by code generation and simu-

lation for the corresponding architectures.

We compiled and simulated benchmark applications for different register file configurations and

normalized the execution cycles with respect to cycles of 8r4w configuration. Normalized cycles

for all the benchmarks are averaged. Figure 5.2 shows the average normalized cycles for different

RF configurations as estimated by the performance model (Section 4.1) and as obtained by the code

generation and simulation for individual RF configurations.

Figure 5.2(a) shows the performance comparison for Set I benchmark applications. It is observed

that the performance estimate of the model is always within 2% of the performance obtained by

simulations. For Set II benchmarks (Fig. 5.2(b)) the performance difference is within 12%. These

results clarify two points – first, the performance model closely models the behavior of reduced port

RF architecture, second, the proposed scheduling and binding algorithm is effective in optimizing

the schedule for the proposed architecture.

5.2.2 Performance Model for Generic Processor

To validate the model proposed in Section 4.2, we simulated high ILP benchmarks for different issue

width processors varying from 2 to 16. Each issue width processor has four different read/write port

RF configurations, first with 2N read ports and N write ports, second with N read ports and N write

ports, third with N read ports and 3/4N write ports, forth with N/2 read ports and N/2 write ports.

For each configuration, we estimate the achievable ILP using the model and find the ILP value by

compilation and simulations.

The results are shown in Fig. 5.3. Lines with solid points represent estimated values while lines

with empty points are values from simulations. Average root mean square (RMS) error for all these

73


0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1.08

1.09

1.1

2r1w

2r2w

2r3w

2r4w

4r1w

4r2w

4r3w

4r4w

6r1w

6r2w

6r3w

6r4w

8r1w

8r2w

8r3w

8r4w

Ave

rage

nor

mal

ized

cyc

les

RF Configurations

ModelSimulation

(a) Normalized average number of cycles for Set I.

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

2r1w

2r2w

2r3w

2r4w

4r1w

4r2w

4r3w

4r4w

6r1w

6r2w

6r3w

6r4w

8r1w

8r2w

8r3w

8r4w

Ave

rage

nor

mal

ized

cyc

les

RF Configurations

ModelSimulation

(b) Normalized average number of cycles for Set II.

Figure 5.2: Model validation against simulation results.

74

5.3 Architecture Evaluation

0

2

4

6

8

10

0 2 4 6 8 10 12 14 16

Ach

ieva

ble

ILP

Number of issue slots

2N,N:Model2N,N:SimuN,N:ModelN,N:Simu

N,3/4N:ModelN,3/4N:Simu

N/2,N.2:ModelN/2,N/2:Simu

Figure 5.3: Model validation for different issue width processors and differentread write port configurations

configurations was found to be 14.2%. We observe that for highly constrained architectures such as

those with N/2 read ports and N/2 write ports, the error is usually larger. Similarly for 2 issue width

processor, the estimation has large deviation from simulated values. If we exclude these extreme

cases, the average RMS error is 7.6%. Average of absolute errors in mildly constrained architectures

is 5.2%.


5.3.1 Area

The proposed architecture affects the area of the processor as well as register file. The following are

the important factors; (a) Size of the SIRO buffers, (b) SIRO controller, (c) Change in the number

of registers in RF due to SIRO buffers, and (d) RF area savings due to port reduction.

As we have shown in Chapter 2, SIRO buffers with depth equal to bypass depth have no impact

75


on processor area. Effects due to reduction in size and ports are well known [Wilton and Jouppi,

1996] so we do not focus on that. We study the effect on size of controller due to the proposed

architecture (Section 2.3.2).

To study the area savings due to SIRO controller, we used a parametrized RTL model of a VLIW

processor core. This processor core contains dispatch, decode, and execute units. Register file

and caches are not modeled in the processor core. We modeled conventional VLIW architecture

and SIRO buffer based VLIW architecture using the base model. In the first case of no compiler

information about bypass paths, all the operand address comparisons are done in hardware. As

discussed in Section 2.3.2, for N issue processor, N 2 address comparisons are required (depth being

1 in this case). One bit is generated from each comparison which indicates whether the operand is

to be read from that bypass path or not. Such bits are produced for each bypass path and for each

operand as shown in Fig. 2.13(a). In the second case, the information of SIRO register address is

available in encoded form in the instruction, and a circuit is required to decode that (Fig. 2.13(b)).

To obtain area values, the RTL model of the processor core was synthesized using Synopsys’ DC

compiler with 0.18µm UMC libraries. The area shown here is only logic area with no routing area

included.

As discussed earlier, the number of comparisons increases as we increase the number of function

units. Thus, the advantage of SIRO addresses is expected to be more in case of higher issue width

processor. Table 5.3 shows the area of different VLIW processor cores in the two approaches, i.e.

conventional processor and processor based on SIRO buffers. From the table it can be seen that area

gains of the order of 4% of total core area, are due to saving in bypass control overheads.

5.3.2 Number of SIRO Reads

To increase the number of reads from the SIRO buffers, we proposed two variations of the prior-

ity function in the list scheduling algorithm. Effects of the number of SIRO reads will be more

evident on high ILP applications; therefore, we used Set II for the validation. We used VLIW pro-

cessors with issue width varying from 1 to 16. All the applications were compiled for these VLIW

76


Issue Conventional Processor with % Areawidth processor (µm2) SIRO buffers (µm2) saved3 3.28e5 3.21e5 2.034 3.62e5 3.49e5 3.435 6.42e5 6.10e5 4.96 7.29e5 7.08e5 2.897 8.37e5 8.08e5 3.468 9.36e5 8.99e5 4.0

Table 5.3: Comparison of processor core area values with/out SIRO bufferinformation

architectures and simulated using Trimaran. Figure 5.4 shows the average ratio of reads from the

SIRO buffer paths to the total operand reads. The ratio of SIRO buffer reads is shown for all three

variations of the list scheduling algorithm.

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14 16

Ave

rage

nor

mal

ized

rea

ds fr

om b

ypas

s

Issue slots

Base list sched.Two step priorityModified priority

Figure 5.4: SIRO buffer reads for different issue processors.

It is clear from the results that as the issue width increases, the number of SIRO buffer reads

increases. List scheduling with the proposed modified priority function leads to the maximum

77


number of SIRO buffer reads. For a single issue processor, the increase is more than three times,

while for two issue processor the increase is approximately two times. Though in higher issue width

processor, the difference between the number of SIRO buffer reads due to different algorithms is

marginal, still list scheduling with modified priority function better in most cases. However, for

issue width 15 and 16, base list scheduling has 1% more number of reads from the SIRO buffers.

5.3.3 Performance

5.3.4 Direct Interconnect Evaluation

We evaluated different interconnection matrices to understand the gains due to the guidelines dis-

cussed in Section 2.4.1. We have used 4r4w configuration for this experiment which has maximum

write ports and reduced read ports. We used a sample of 24 different interconnection matrices (out

of 652) and grouped them by their RF port imbalance factor. For the 4r4w RF configuration, in

completely balanced interconnect, each RF read port is connected to two FU ports, one left FU port,

and one right FU port. In an interconnect configuration, if an RF port is connected to more FU ports

than in the balanced interconnection, the difference is called the port imbalance factor. The RF port

imbalance factor is calculated as the sum of the FU port imbalance, the left port imbalance, and the

right port imbalance at all RF ports. All the configurations having ‘0’ RF port imbalance form one

group, configurations having ‘1’ RF port imbalance form a separate group, and so on. The average

percentage increase in the number of cycles for each group for all benchmarks (of Set I) with respect

to 8r4w configuration is shown in Fig. 5.5. It can be clearly seen that the more the imbalance, the

more it costs in terms of performance penalty. In other words if an interconnect matrix follows all

the three guidelines, the performance penalty would be the least and may come close to complete

interconnection.

With this insight, the interconnection matrices for other RF configurations are selected based on

the minimum port imbalance and are shown in Table 5.4. Three entries of each row of a configura-

tion shows the two read and one write port of RF to which a particular issue slot is connected. Some

of these configurations are shown in Fig. 5.6.

78


0

1

2

3

4

5

6

0 1 2 3 4 5 6 7

Ave

rage

% in

crea

se in

exe

cutio

n cy

cles

RF port imbalance

RF port imbalance

Figure 5.5: Direct interconnection RF architecture exploration.

8r4w 8r3w 8r2w 8r1w 6r4w 6r3w 6r2w 6r1wSlot 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0Slot 2 2 3 1 2 3 1 2 3 1 2 3 0 2 3 1 2 3 1 2 3 1 2 3 0Slot 3 4 5 2 4 5 2 4 5 0 4 5 0 4 5 2 4 5 2 4 5 0 4 5 0Slot 4 6 7 3 6 7 2 6 7 1 6 7 0 5 4 3 5 4 2 5 4 1 5 4 0

4r4w 4r3w 4r2w 4r1w 2r4w 2r3w 2r2w 2r1wSlot 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0Slot 2 1 2 1 1 2 1 1 2 1 1 2 0 1 0 1 1 0 1 1 0 1 1 0 0Slot 3 2 3 2 2 3 2 2 3 0 2 3 0 0 1 2 0 1 2 0 1 0 0 1 0Slot 4 3 0 3 3 0 2 3 0 1 3 0 0 1 0 3 1 0 2 1 0 1 1 0 0

Table 5.4: Interconnect matrices for different direct RF configurations

Evaluation of Scheduling and Binding Efficiency

To show the effectiveness of our scheduler in performing SIRO architecture specific optimizations,

we compare the proposed scheduling and binding algorithm with the results of applying only RPA

79


File

FU1

FU2

FU3

FU4

Register

0

1

6

7

2

3

4

5

0123

(a) 8r4w configuration.

File

1

2

3

4

FU1

FU2

FU3

FU4

Register

0

5

012

(b) 6r3w configuration.

File

0

1

2

3

FU1

FU2

FU3

FU4

Register

01

(c) 4r2w configuration.

File

1

FU1

FU2

FU3

FU4

Register

2

1

(d) 2r1w configuration.

Figure 5.6: Different direct RF configurations

scheduling algorithm without updating the value r1i and r2

i due to eq(3.5) (referred to as naive al-

gorithm in Fig. 5.8). Thus, naive algorithm does everything to ensure the correctness of scheduling

and binding, but does not use the information that values are available in SIRO buffers. While

comparing naive algorithm with the proposed algorithm for complete interconnection, we observe

that as number of RF read or write ports decreases, the performance of naive algorithm deteriorates

rapidly while our algorithm is able to cope up.

80


0.99

1

1.01

1.02

1.03

1.04

1.05

1.06

1.07

1.08

1.09

1.1

2r1w

4r2w

6r3w

8r4w

2r4w

4r4w

6r4w

8r4w

8r1w

8r2w

8r3w

8r4w

Ave

rage

nor

mal

ized

cyc

les

RF Configurations

Complete interconnectDirect interconnect

(a) Normalized average number of cycles for Set I benchmarks.

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2r1w

4r2w

6r3w

8r4w

2r4w

4r4w

6r4w

8r4w

8r1w

8r2w

8r3w

8r4w

Ave

rage

nor

mal

ized

cyc

les

RF Configurations


(b) Normalized average number of cycles for Set II benchmarks.

Figure 5.7: Performance evaluation.

81


1

1.5

2

2.5

3

3.5

8r4w 8r3w 8r2w 8r1w 6r4w 6r3w 6r2w 6r1w 4r4w 4r3w 4r2w 4r1w 2r4w 2r3w 2r2w 2r1w

Ave

rage

nor

mal

ized

cyc

les

RF Configurations

RPA Algorithm (Complete Interconnect)Naive algorithm

Figure 5.8: Effectiveness of RPA scheduling algorithm with respect to a naivealgorithm (Set II benchmarks)

The performance predicted by the model is compared with the performance of complete intercon-

nect architecture, the performance model does not model the RF-FU interconnections. However, the

model may be used for estimating the performance of direct interconnects as the performance dif-

ference between complete interconnection and direct interconnection architecture is marginal. To

demonstrate this, Fig. 5.7 shows the average normalized cycles for different RF configurations for

direct interconnect and complete interconnect.

Figures 5.7(a) and 5.7(b) show the average normalized number of cycles for some selected config-

urations with direct interconnect and complete interconnect architectures. Both figures have three

pairs of curves. The first pair is for decreasing number of write ports, the second for decreas-

ing number of read ports, and the third for both read and write ports decreasing. It is clear that

complete interconnect performs always better, because of the absence of path conflicts. For Set II

82


benchmarks, the average number of cycles over all benchmarks in the architecture with direct in-

terconnect is within 8% of the number of cycles in the architecture with complete interconnect, for

any RF configuration. In case of Set I benchmarks this figure is 2%.

The first graph in Fig. 5.7(a) and 5.7(b) shows the variation of average normalized cycles with

decreasing write ports. The number of cycles increases by 60% for single write port, but it is only

9% and 20% for three write and two write port cases, respectively. This shows that performance

with decreasing write ports deteriorates rapidly. On the other hand, the effect of reducing read ports

on the performance is less than the effect of reducing write ports. The second Set of graphs in the

same figure shows the performance variation with the read ports keeping write ports fixed. There

is almost no difference between the performance of 6 read ports and 8 read ports configurations.

In case of 4 read ports configuration, the performance loss is marginal at 5%. The increase in the

number of cycles is significant only when read ports are reduced to 2. The third pair of graphs shows

the impact on number of cycles when both read and write ports decrease in the same ratio. For these

configurations the increase in number of cycles is both due to read port reduction as well as due

to write port reduction. Therefore, overall increase in number of cycles for 2r1w configuration is

much higher than increase due to individual read and write effects.

There is a marked difference between the performance observed for Set I benchmarks (Fig.

5.7(a)) and Set II benchmarks (Fig. 5.7(b)). The performance penalty due to port and path con-

flicts for Set II is much larger than that for Set I benchmarks because of higher demand of read and

write ports in the applications.

5.3.5 Energy

We have used Cacti 4.0 [Tarjan et al., 2006] to estimate read and write access energy for the direct

interconnect RF. For complete interconnect RF, we modeled interconnects in Cacti. We observe

that a significant fraction of the energy is consumed in multiplexing for complete interconnection;

therefore, energy per access in complete interconnect is always more than the corresponding direct

interconnect RF configuration.

83


Using equation (4.5), RF energy of different configurations is calculated, normalized with respect

to energy of standard RF and shown in Fig. 5.9. We assumed 0.13µ technology, 85 ◦C temperature

[Skadron et al., 2004], 1 GHz frequency, 64 bit 64 word RF for calculating the energy. Standard RF

is similar to the 8r4w configuration but does not inhibit RF read/write when values are available in

SIRO buffers. Therefore, the 8r4w configuration in our case saves 40% both in case of Set I as well

as Set II benchmarks with respect to standard RF energy by avoiding redundant reads and writes in

the RF. In other configurations, total energy saving is due to both the avoidance of redundant read

and write as well as due to reduced ports of the RF. There may be an increase in total energy due to

increase in leakage energy by additional execution cycles. It is observed that the direct interconnect

topology is always more energy efficient than the full interconnect topology, due to its lower energy

per access. For Set I as well as Set II benchmarks 2r1w configuration with direct interconnection is

the most energy efficient configuration saving 75% and 66% of energy respectively.

Figure 5.10 shows the normalized RF energy of the 8r4w configuration and the 4r4w configu-

ration with the complete interconnection and the direct interconnection for different benchmarks.

Number of reads from SIRO buffers are similar in both direct interconnect and complete intercon-

nect. The energy saving in 8r4w configuration is due to fewer reads and writes. Therefore, ‘rijndael’

with the fewest read/write from the RF, saves the most energy. For some benchmarks like ‘gsmde-

code’ energy saving in the 4r4w configuration with respect to the 8r4w is high while in ‘gsmencode’

it is less. The reduction in the energy due to RF dynamic access energy is similar; the difference is

due to leakage energy which depends on the number of cycles.

There can be several metrics for choosing an optimal configuration. For example, if energy is

the only criterion, 2r1w configuration with direct interconnect is best. If energy-delay product is

also considered then 4r4w configuration with direct interconnect is the best configuration for Set

II benchmarks, and 2r1w configuration with direct interconnect is best for Set I benchmarks. If

the minimum acceptable performance loss is the criterion, than some other configurations may be

preferred. In all cases, the shared port RF architecture is beneficial in terms of energy.

84


0

0.2

0.4

0.6

0.8

1

2r1w

2r2w

2r3w

2r4w

4r1w

4r2w

4r3w

4r4w

6r1w

6r2w

6r3w

6r4w

8r1w

8r2w

8r3w

8r4w

Ave

rage

Nor

mal

ized

RF

ene

rgy

RF Configurations


(a) Normalized average RF energy for benchmarks in Set I.

0

0.2

0.4

0.6

0.8

1

2r1w

2r2w

2r3w

2r4w

4r1w

4r2w

4r3w

4r4w

6r1w

6r2w

6r3w

6r4w

8r1w

8r2w

8r3w

8r4w

Ave

rage

Nor

mal

ized

RF

ene

rgy

RF Configurations


(b) Normalized average RF energy for benchmarks in Set II.

Figure 5.9: Normalized average RF energy for the direct and the completeinterconnect topologies.

85


8r4w4r4w with complete interconnect 4r4w with direct interconnect

0

0.2

0.4

0.6

0.8

1

basi

cmat

h_sm

all

dijk

stra

_sm

all

bitc

ount

FFT

patr

icia

qsor

t_sm

all

sha

g721

enco

deg7

21de

code

gsm

deco

degs

men

code

unep

icra

wca

udio

raw

daud

iope

gwitd

ecpe

gwite

nc.

mm

_int

32dc

t_in

t2so

bel

econ

volu

tion

eham

mco

lors

pace

mm

_int

8dc

t_in

tsu

san

vite

rbi

rijn

dael

Nor

mal

ized

RF

ener

gy

Figure 5.10: Normalized RF energy for different benchmarks.

5.4 Summary

In this chapter we discussed our experimental setup and implementation framework. We performed

experiments with a fixed issue processor of issue width 4. Our experiments suggest that with the

proposed architecture, we can save upto 4% of the processor core area due to simplified bypass

control in SIRO buffers. Apart from that register file size sees a significant decrease in size with

port reduction. Experiments also reveal that using SIRO buffers we can avoid around 60% of the

RF reads and 70% of the RF writes, on an average.

Our study shows that complete interconnection is less energy efficient than direct interconnection.

Though the direct interconnection has more compiler constraints yet performance losses are within

2% of the complete interconnection topology. Compiler support is important for the architecture.

With higher reduction of ports more performance penalties were observed, though these configura-

tions offer higher energy savings. Shared port architecture leads to more than 60% savings in the

86

5.4 Summary

RF energy. The number of ports in the RF can hence be selected on the basis of energy budget or

performance budget.

87


88

6 Varying Issue Width and Scalability

In the previous chapter we established with the help of experiments that the proposed architecture

and compiler algorithms perform well and there are significant savings in terms of RF energy. The

experiments in the previous chapter were performed on 4-issue width processors. In this chapter,

we study the performance of the proposed architecture with different issue widths.

6.1 RF and Processor Scalability

Multiple function units(FU) are used in processors to exploit instruction level parallelism. With

increase in the number of FUs, architecture of the processor should ideally scale, that is, a lin-

ear increase in area, power and a constant cycle time. Studies suggest that in both superscalar as

well as VLIW architectures the increase in the number of FUs results in poor scaling of processor

[Palacharla et al., 1996; Terechko et al., 2005].

In VLIW processors, the most unscalable component is the multi-port register file [Capitanio

et al., 1992]. In a classical design of VLIW processors, if a function unit requires 2 read port and

1 write port, 2N read and N write ports are required in the register file for N issue processor. Area,

power and access time of a large ported RF are highly unscalable [Rixner et al., 2000].

In all the previous studies of VLIW processors, clustering is accepted as the solution for the

register file scalability problem. In commercial VLIW processors as well, processors with high

issue width are clustered. For example, Trimedia [van Eijndhoven et al., 1999] is a five issue and

two cluster processor, TI’s TMS3206x [Seshan, 1998] is an 8 issue and two cluster processor, ST’s

Lx architecture [Faraboschi et al., 2000] has configurable number of clusters, each with issue width

89


of 4. Clustering involves multiple register files, and only a set of FUs are connected to each. Access

of a data element from one cluster to the other involves inter-cluster move operations. The number

of inter-cluster move operations increases the number of cycles and effects performance.

Our proposed architecture is a better alternative for RF scalability than clustered architectures

as it consumes lower energy and gives higher performance. As discussed in previous chapters, the

scalability of our proposed architecture is possible because of (a) distributed local buffers, (b) small

depth of local buffers, and (c) reduced port second level RF. Due to distributed nature and small

depth, local buffers do not cause any additional energy and delay over conventional VLIW architec-

ture, while reduced port RF helps in bringing down energy, area and delay cost. In addition to the

architecture, compiler’s scheduling and binding algorithms also help in minimizing performance

loss.

In this chapter we show that processors with our proposed RF architecture are scalable in perfor-

mance and energy with respect to issue width. Finally, we compare the scalability of reduced port

architecture and clustered VLIW architecture.

6.1.1 Related Work in Processor Scalability

Apart from the register file, other unscalable components of a VLIW processor are FU-FU/FU-RF

interconnect and issue logic [Palacharla et al., 1996; Gangwar et al., 2007; Zhong et al., 2005].

For high issue widths [Palacharla et al., 1996] suggest that FU-FU interconnection network can be

in critical path. Gangwar et al. [2007] showed that the FU-RF interconnection and inter-cluster

communication do not scale well with the bus based interconnects. Their study also suggests that

issue logic will be in the critical path for very high issue processors. The problem of scalability in

the issue logic is also identified by [Zhong et al., 2005]. They suggested a distributed issue logic for

a clustered VLIW processor.

In other architectural paradigms also, like transport triggered architecture [Corporaal, 1999],

steam processors [Khailany et al., 2003] and multiprocessor [Taylor et al., 2001] scalability issues

have been studied.

90

6.2 Experimental Setup

Benchmark Description ILPMatrix32 Matrix multiplication -unroll factor 32 18.29Convolution Convolution error control codes 17.48Hamming Hamming error control codes 28.96DCT2 DCT kernel with unroll factor of 2 13.4Sobel A 3x3 edge filter for images 10.96Colorspace RGB to YUV conversion for images 16.76

Table 6.1: High ILP Benchmark details

6.2 Experimental Setup

Processors with different number of issue slots, ranging from 2 to 16 were experimented with. A

register file of depth 64 is assumed for all experiments. The remaining experimental settings are

same as in Chapter 5. Three reduced port RF configurations are used for each processor, that is, N

read-N write, N read-3/4N write and N/2 read, N/2 write port RF. Hereafter, these configurations

will be called as config1, config2, and config3, respectively. These configurations are compared

with full port register file, that is, 2N read, N write port RF, referred to as config0. Buffer depth is

fixed at 2 for all experiments.

To do this study we choose high ILP benchmarks shown in Table 5.2. These benchmarks are

further divided into high ILP benchmarks and medium ILP benchmarks. Benchmarks having

achievable-S ILP more than 8 are classified as high ILP benchmarks, and those having achievable-S

ILP more than 4 but less than 8 are classified as medium ILP benchmarks. This classification is

based on issue width of the processor we are experimenting with. High ILP benchmarks have ei-

ther comparable achievable-S ILP or higher achievable-S ILP than machine parallelism offered by

processor of issue width 2 to 16. Details of ILP and benchmarks are given in Table 6.1 and 6.2.

91


Benchmark Description ILPMatrix8 Matrix multiplication(unroll factor 8) 4.63Viterbi Convolution error control decoder 4.78Susan An image processing filter 4.29DCT DCT kernel 7.44Rijndael Encryption algorithm 4.4

Table 6.2: Medium ILP Benchmark details

6.3 Performance

Performance of a processor can be defined in various ways. For example, number of execution

cycles can be one metric while total time spent by the application (number of execution cycles

multiplied by clock period) can be another. Performance in terms of execution cycles does not take

into account clock period; in other words, it assumes clock period is constant.

The processor with full ported RF does not have any port conflicts, therefore, will always have

lower number of execution cycles than reduced port RF configurations. We compare the two to

observe the performance loss in terms of number of cycles. Performance is defined as inverse of the

number of cycles. We normalized the performance by a performance of single issue processor. The

values are averaged for the set of benchmarks.

Figure 6.1(a) shows the performance in terms of number of cycles for high ILP benchmarks. We

observe that config0, config1 and config2 have very similar performance. Config1 has a perfor-

mance loss in the range of 1-8%, with maximum being 6 issue width and minimum for 16 issue

width. Config2 has slighly more performance loss with respect to config0. For example it is 1.5%

for 16 issue width and 14% for 6 issue width processor. Config3 suffers higher performance loss

as the number of issue slots increases. Config3 suffers loss in the range of 32-53% with respect to

config0.

92

6.3 Performance

The lower performance loss for high issue processors is attributed to the fact that percentage num-

ber of SIRO reads increases with the issue width. Percentage number of SIRO reads at a particular

issue slot is the function of ILP present in the applications. Due to higher ILP in the benchmark

applications, percentage number of operands from SIRO buffers is higher at higher issue widths.

For medium ILP benchmarks, config1 and config2 have maximum performance loss with respect

to config0 for 2 issue processor, that is, 9%. For the remaining cases congif1 and config2 exhibit

less than 9% performance loss. Config1 always performs slightly better, but the difference is not

significant. For medium ILP benhcmarks also, config3 suffers a loss of upto 30% for 2 issue and 4

issue processors.

From these experiments we can conclude that our proposed RF architecture may increase the

number of execution cycles by 10%, if port reduction is not drastic, which is quite acceptable.

We now focus on performance in terms of total execution time. Total time taken by an application

is the product of number of cycles and cycle time. Cycle time is a complex function of various

pipeline stages of processor. As the number of issue increases in the processor, the register file

access time increases. Though cycle time is not determined by a single pipeline stage, we assume

that cycle time depends on RF access time. The assumption gives an approximate effect of increased

RF access time on overall performance.

Overall performance - inverse of product of number of cycles and RF access time is shown in

figure 6.2(a) for high ILP benchmarks. RF access time increases almost linearly from single issue

processor to 16 issue processor, and for 16 issue processor it is almost double the single issue

processor in config0, while for config1 and config2 the increase is less. Due to combined effect of

number of cycles and RF access time, we observe that config0 saturates in overall performance after

issue width of 12 and starts decreasing for issue widths greater than 14. Config1 and config2 perform

better than config0 for issue widths more than 8. Overall, config2 performs the best. Config3, due

to higher increase in cycles, does not perform better even if cycle time is taken into account. It

performs better than config1 and config2 for 16 issue slot processor.

For medium ILP applications, the processor with no port reduction performs the best for 2 issue

93


0

2

4

6

8

10

12

14

0 2 4 6 8 10 12 14 16

Nor

mal

ized

per

form

ance

(no

. of c

ycle

s)

Issue slots

RF:2N,N (Config0)RF:N,N (Config1)

RF:N,3/4N (Config2)RF:N/2,N/2 (Config3)

(a) Performance of high ILP benchmarks.

0

1

2

3

4

5

6

7

0 2 4 6 8 10 12 14 16

Nor

mal

ized

per

form

ance

(no

. of c

ycle

s)

Issue slots



(b) Performance of medium ILP benchmarks.

Figure 6.1: Performance for different issue width processors.

94

6.3 Performance

1

2

3

4

5

6

7

8

9

0 2 4 6 8 10 12 14 16

Nor

mal

ized

per

form

ance

(cy

cles

*RF

acc

ess

time)

Issue slots



(a) Normalized cycle-delay product of high ILP benchmarks.

1

2

3

4

5

6

7

8

9

0 2 4 6 8 10 12 14 16

Nor

mal

ized

per

form

ance

(cy

cles

*RF

acc

ess

time)

Issue slots



(b) Normalized cycle-delay product of medium ILP benchmarks.

Figure 6.2: Normalized cycle-delay product for different issue widthprocessors

95


width. Config1 performs the best for processor with issue width 4 and 6. For config2 and config3

performance is similar for 8 issue width processor. Config3 is the best for issue width higher than

or equal to 8. This happens because performance in terms of number of cycles (Fig. 6.1) does not

improve beyond 8 issue processor and access time of RF dominates if issue width of processor is

more than 8.

6.4 Energy

RF energy is computed as read and write energy for all RF reads and writes. Leakage energy also

contributes to total RF energy which depends on total time of application and, therefore,the number

of cycles. There are three factors affecting the total RF energy – first, increase in read and write

energy per access; second, the number of SIRO reads and avoided writes; third, the number of cycles

which affects the leakage energy. Note that the total number of reads and writes (sum of RF reads

and SIRO reads) is constant for a given application.

The first factor increases while the second and third factors decrease the RF energy with increase

in the number of issue slots. Figure 6.3 shows the normalized RF energy of different issue proces-

sors for different configurations. RF energy corresponding to config0 is less than conventional RF

due to the avoided reads and writes. Config1, config2, and config3 have reduced RF energy per

access. Further, the number of avoided RF reads and writes due to SIRO buffers is more than that

for config0. Higher number of SIRO reads in reduced port configurations are present to minimize

performance penalty due to reduced ports. Due to increased SIRO reads and writes and the reduced

RF access energy, total energy at higher issue slots is much lower than the config0 energy. Leak-

age factor is not prominent enough to change the behavior of the total energy, because of similar

performance of the three type of configurations.

For high ILP benchmarks and medium ILP benchmarks we do not see much difference in behav-

ior of the three reduced port configurations.

The above result does not contradict the result of [Rixner et al., 2000] which says that the power

consumption of register file increases as N3 with increase in the number of issue slots. They suggest

96

6.5 Clustered VLIW and Scaling

three factors contributing to it – first, increase in RF energy per access with increased ports; sec-

ond, increase in number of registers with increased parallelism; and third, more parallel operations.

We assume the same number of registers in register file for all issue processors. Thus power will

increase with order of only N in our case.


As discussed before, clustering is the default scaling approach in high issue width VLIW architec-

tures. In this section we show that reduced port architecture is more effective and efficient in most

cases — both in terms of performance as well as energy.

For this study we use 4 issue, 8 issue, 12 issue and 16 issue processors with 2 clusters and 4

clusters and compare them with our proposed reduced port configuration, that is, N read ports and

3/4N write ports. FUs in each cluster are assumed to be uniform. The interconnection mechanism

in clustered architecture is bus based. Total number of function units available in any clustered

processor is same as the FUs available in corresponding reduced port configuration or monolithic

RF configuration. The application set is the same as that used for other experiments in this chapter.

As the compiler framework for clustered VLIW and the proposed reduced port architecture are

different, we normalized the performance with respect to the number of cycles in the architecture

with monolithic RF. From Fig. 6.4, it is clear that for high ILP applications except 4 issue and 8

issue processor with 2 clusters, reduced port architecture performs better than clustering. In case

of 2 cluster 8 issue processor the performance difference between reduced port architecture and

clustered architecture is marginal. In case of 4 clusters only 4 issue processor performs better than

reduced port architecture and that too marginally.

Clustering performed better in case of 4 issue width processor due to its approach of utilizing

slack for inter-cluster move operations. In high ILP applications, there is a large slack available for

scheduling due to less available resources. This slack is utilized for inter-cluster move operations.

Therefore, in spite of 21% operations being inter-cluster move operations, the number of execution

cycles increased by only 2%. The reduced port architecture, which banks on the availability of

97


0

0.5

1

1.5

2

2.5

0 2 4 6 8 10 12 14 16

Nor

mal

ized

RF

ene

rgy

Issue slots



(a) Total RF energy of high ILP benchmarks.

1

1.5

2

2.5

3

3.5

4

4.5

5

0 2 4 6 8 10 12 14 16

Nor

mal

ized

RF

ene

rgy

Issue slots



(b) Total RF energy for medium ILP benchmarks.

Figure 6.3: Total RF energy for different issue width processors

98


operands from the SIRO buffers, gets less number of operands from SIRO buffers when ILP of the

application is large and available FUs are much less.

Due to the above reason, for medium ILP applications, reduced port architectures always perform

better than clustered RF architectures.

�� Monolithic RFProposed RF2 cluster4 cluster

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

��

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

""""""""""""""""""""""""""""""

################

$$$$$$$$$$$$$$$$

%%%%%%%%%%%%%%%%

&&&&&&&&&&&&&&&&

''''''''''''''''''''''''

((((((((((((((((((((((((

0

0.2

0.4

0.6

0.8

1

16−issue12−issue8−issue4−issue

Nor

mal

ized

Per

form

ance

Processors configurations ))**+,-./0 Monolithic RF

Proposed RF2 cluster4 cluster

111111111111111111111111111111111111111111111111111111111111111

222222222222222222222222222222222222222222222222222222222222222

333333333333333333333333333333333333333333

444444444444444444444444444444444444444444

555555555555555555555555555555555555555555

666666666666666666666666666666666666666666

777777777777777777777777777777777777777777777777777777777777777

888888888888888888888888888888888888888888888888888888888888888

999999999999999999999999999999999999999999999999999999999999999

::::::::::::::::::::::::::::::::::::::::::

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

==========================================

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

??????????????????????????????????????????

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD

EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH

IIIIIIIIIIIIIIIIIIIIIIIIIII

JJJJJJJJJJJJJJJJJJJJJJJJJJJ

KKKKKKKKKKKKKKKKKK

LLLLLLLLLLLLLLLLLL

MMMMMMMMMMMMMMMM

NNNNNNNNNNNNNNNN

OOOOOOOOOOOOOOOOOOOOOOOOOOO

PPPPPPPPPPPPPPPPPPPPPPPPPPP

0

0.2

0.4

0.6

0.8

1

1.2

16−issue12−issue8−issue4−issue

Nor

mal

ized

Per

form

ance

Processors configurations

Figure 6.4: Normalized performance for clustered VLIW processors

Apart from the performance loss in clustered VLIW processors there is an increase in the number

of operations due to inter-cluster moves. We observe that the number of inter-cluster move opera-

tions vary from 15 to 35% of total operations. In other words, there is an 18 to 50% increase in the

number of operations. Increase in the number of operations has a direct impact on the total energy

of a VLIW proecssor. Therefore, in respect of energy as well, the clustered architectures are not

99


favourable.

6.6 Summary

From different experiments we conclude that on the whole, our proposed architecture performs

better than conventional RF architecture for higher issue widths. Increase in the number of cycles

is also within the acceptable limits of 10%. The proposed architecture is also more scalable than

the conventional architecture in terms of energy. In comparison to clustered RF architecture, the

proposed architecture is better in terms of performance as well as energy at higher issue widths.

100

7 Conclusions and Future Work

7.1 Contributions and Major Results

In this thesis we worked towards energy reduction and scalability of multiported register files in

VLIW processors. We proposed novel architectures for the register files and also proposed compiler

algorithms for these architectures. Contributions of this thesis are summarized below:

• We proposed local buffer based RF architectures that explore the possibilities of RF read and

write reduction. We studied and critically analyzed the architectural implications of RISO,

SIRO, and RIRO based local buffers. We showed that with SIRO architecture, it is easy to

increase the bypass depth without much increase in the delay or effect on clock period. Apart

from saving energy due to lower accesses of second level RF, the approach saves upto 4% of

the processor core area due to simpler bypass control logic.

• We proposed reduced port RF architecture for VLIW processors. We studied different RF-

FU interconnects and argued in favor of the direct interconnections. Though the complete

interconnection topology avoids all path conflicts, its hardware complexity is higher. For

direct interconnection architecture, we showed how to choose an appropriate interconnection

matrix in order to minimize path conflicts.

• We proposed scheduling and binding algorithms that (a) increase the number of reads and

writes from SIRO and reduce reads and writes from second level RF, (b) avoid performance

loss due to reduced port second level RF by considering reduced traffic to the RF, and (c)

avoid performance loss due to direct interconnect by doing intelligent FU binding. It has

101


been shown in the results that using our approach the number of reads and writes in SIRO

buffers increases significantly. Performance loss due to port reduction and direct interconnect

is within 5% for Mediabench and Mibench set of benchmarks.

• We proposed two theoretical models for predicting the performance and energy of the pro-

posed architecture. One model takes the number of read and write ports of RF as inputs and

assumes issue width of the processor is fixed. The second model takes the number of issue

slots, the number of read ports and write ports of RF as inputs. Both models take application

characteristics as inputs. Along with performance and energy estimates, the models also give

insight into factors that affect performance and energy.

• We implemented the scheduling and binding algorithms in Trimaran compiler framework.

Our experiments show that 40% of RF energy can be saved due to SIRO buffers, and upto

70% of RF energy can saved with reduced port RF architecture in a 4 issue processor. Our

experiments with different issue width processors show that the proposed RF architecture is

scalable in terms of performance as well as energy.

7.2 Future Work

In this thesis we show the scalability of the register file in a high issue VLIW processor. This

work can be extended by considering other components of VLIW processors for scalability. Most

important of them is interconnect. FU-FU interconnects form the bypass network. In very high

issue VLIW processors they can become the bottleneck. FU-FU interconnects can be localized, i.e.,

only the physically neighboring FUs may be connected to each other. This type of topology is called

partial bypass and has been studied as an architectural constraint. Partial bypass with reduced port

may give higher scalability, and, therefore, can be studied in this new context.

Another possible extension of the work is, combining reduced port RF architecture and clustered

RF architecture. In reduced port clustered architecture ports of each RF bank will be reduced. This

architecture will be suitable for scaling the processor beyond 16 issue processor.

102

7.2 Future Work

The register allocation algorithm can also be integrated with the proposed scheduling and binding

algorithms for further exploration. Integration of register allocation will be extremely important if

one wants to explore “random in random out”, RIRO based RF architecture.

The techniques proposed in thesis can be extended for dynamically scheduled multi-issue pro-

cessors. In that case, the port and path conflict management has to be done by hardware at run time.

It would be interesting to investigate the application of complier techniques in that case.

103


104

References

Silicon hive. http://www.siliconhive.com.

Tilera. http://www.tilera.com.

S. Aditya, B. R. Rau, and V. Kathail. Automatic architectural synthesis of VLIW and EPIC proces-

sors. In International Symposium on System Synthesis, pages 107–113, 1999.

A. Aggarwal and M. Franklin. Energy efficient asymmetrically ported register files. In International

Conference on Computer Design, pages 2 – 7, 2003.

P. Ahuja, D. Clark, and A. Rogers. The performance impact of incomplete bypassing in processor

pipelines. In Proceedings. 28th Annual International Symposium on Microarchitecture, pages

36–45, 1995.

K. Asanovic, M. Hampton, R. Krashinsky, and E. Witchel. Power Aware Computing. Kluwer

Academic/Plenum Publishers, June 2002.

J. Ayala, M. Lopez-Vallejo, and A. Veidenbaum. A compiler-assisted banked register file architec-

ture. In IEEE Workshop on Application Specific Processors, 2004.

A. Baghdadi, N. Zergainoh, W. Cesario, T. Roudier, and A. Jerraya. Design space exploration

for hardware/software codesign of multiprocessor systems. In Proceedings. 11th International

Workshop on Rapid System Prototyping, pages 8–13, 2000.

105

REFERENCES

M. Balakrishnan and H. Khanna. Allocation of fifo structures in RTL data paths. ACM Transactions

on Design Automation of Electronic Systems (TODAES), 5(3), 2000.

R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. Reducing the complexity of the register

file in dynamic superscalar processors. In Proceedings. 34th Annual International Symposium on

Microarchitecture, pages 237 – 248, 2001.

A. Capitanio, N. Dutt, and A. Nicolau. Partitioned register files for VLIWs: A preliminary analysis

of tradeoffs. In Proceedings. 25th Annual International Symposium on Microarchitecture, 1992.

L. N. Chakrapani, J. Gyllenhaal, W. mei W. Hwu, S. A. Mahlke, K. V. Palem, and R. M. Rabbah.

Lecture Notes in Computer Science, volume 3602/2005, chapter Trimaran: An Infrastructure for

Research in Instruction-Level Parallelism, pages 32 – 41. Springer Berlin / Heidelberg, 2005.

P. P. Chang, S. A. Mahlke, W. Y. Chen, N. J. Warter, and W.-m. W. Hwu. IMPACT: an architectural

framework for multiple-instruction-issue processors. SIGARCH Comput. Archit. News, 19(3):

266–275, 1991. ISSN 0163-5964.

H. Corporaal. TTAs: Missing the ILP complexity wall. Journal of Systems Architecture, 36(12),

1999.

J. L. Cruz, A. Gonzalez, M. Valero, and N. P. Topham. Multiple-banked register file architectures.

In International Symposium on Computer Architecture, pages 316–325, 2000.

O. M. D’Antona and E. Munarini. A combinatorial interpretation of punctured partitions. Journal

of Combinatorial Theory Series A, 91(1-2):264 – 282, 2000.

J. Davidson and S. Jinturkar. Improving instruction-level parallelism by loop unrolling and dynamic

memory disambiguation. In Proceedings. 28th Annual International Symposium on Microarchi-

tecture, 1995.

O. Ergin, D. Balkan, K. Ghose, and D. Ponomarev. Register packing: Exploiting narrow-width

106

REFERENCES

operands for reducing register file pressure. In Proceedings. 37th Annual International Sympo-

sium on Microarchitecture, pages 304– 315, 2004.

K. Fan, N. Clark, M. Chu, K. Manjunath, R. Ravindran, M. Smelyanskiy, and S. Mahlke. Systematic

register bypass customization for application-specific processors. In IEEE 14th International

Conference on Application-specific Systems, Architectures and Processors (ASAP), June 2003.

K. Fan, M. Kudlur, H. Park, and S. Mahlke. Cost sensitive modulo scheduling in a loop accelerator

synthesis system. In Proceedings. 38th Annual International Symposium on Microarchitecture,

2005.

P. Faraboschi, G. B. abd Joseph A Fisher, G. Desoli, and F. Homewood. Lx: A technology plat-

form for customizable VLIW embedded processing. In Proceeding of the 27th International

Symposium on Computer Architecture, pages 203–213, June 2000.

K. Farkas, P. Chow, N. Jouppi, and Z. Vranesic. The multicluster architecture: reducing cycle time

through partitioning. In Proceedings. 30th Annual International Symposium on Microarchitec-

ture, pages 149–159, Dec 1997.

M. Fernandes. A clustered VLIW architecture based on queue register files. PhD Thesis, University

of Edinburgh, 1998.

J. Fridman and Z. Greenfield. The tigersharc dsp architecture. In IEEE Micro, pages 66–76, Jan-Feb

2000.

A. Gangwar. A Methodology For Exploring Communication Architectures of Clustered VLIW Pro-

cessors. PhD thesis, Department of Computer Science, IIT Delhi, 2005.

A. Gangwar, M. Balakrishnan, and A. Kumar. Impact of intercluster communication mechanisms

on ilp in clustered VLIW architectures. ACM Transactions on Design Automation of Electronic

Systems (TODAES), 12(1), 2007.

107

REFERENCES

R. Gonzalez, A. Cristal, D. Ortega, A. Veidenbaum, and M. Valero. A content aware integer register

file organization. In International Symposium on Computer Architecture, 2004a.

R. Gonzalez, A. Cristal, M. Pericas, A. Veidenbaum, and M. Valero. Scalable distributed register

file. In WCED, 2004b.

M. R. Guthaus, J. S. Ringenberg, D. Ernst, T. M. Austin, T. Mudge, and R. B. Brown. MiBench:

A free, commercially representative embedded benchmark. In IEEE 4th Annual Workshop on

Workload Characterization, Dec. 2001.

J. Hoogerbrugge and H. Corporaal. Register file port requirements of transport triggered archi-

tectures. In Proceedings. 27th Annual International Symposium on Microarchitecture, pages

191–195, 1994.

N. P. Jouppi. The nonuniform distribution of instruction-level and machine parallelism and its effect

on performance. IEEE Trans. Comput., 38(12):1645–1658, 1989.

V. Kathail, M. Schlansker, and B. R. Rau. HPL-PD architecture specification: Version 1.1. Technical

Report HPL-93-80R1, 2000.

R. Kesseler. The alpha 21264 microprocessor. IEEE Micro, pages 24 – 36, March-April 1999.

B. Khailany, W. Dally, S. Rixner, U. Kapasi, J. Owens, and B. Towles. Exploring the VLSI scala-

bility of stream processors. In International Conference on High Performance Computer Archi-

tecture, 2003.

N. S. Kim and T. Mudge. Reducing register ports using delayed write-back queues and operand

pre-fetch. In International Conference on Supercomputing, pages 172–182, 2003.

M. Kondo and H. Nakamura. A small, fast and low-power register file by bit-partitioning. In

International Symposium on High Performance Computer Architecture, 2005.

108

REFERENCES

M. Kudlur, K. Fan, M. Chu, R. Ravindran, N. Clark, and S. Mahlke. Flash: Foresighted latency-

aware scheduling heuristic for processors with customized datapaths. In CGO ’04: Proceedings

of the international symposium on Code generation and optimization, pages 201 – 212, 2004.

A. Lambrechts, P. Raghavan, A. Leroy, G. Talavera, T. V. Aa, M. Jayapala, F. Catthoor, D. Verk-

est, G. Deconinck, H. Corporaal, F. Robert, and J. Carrabina. Power breakdown analysis for a

heterogeneous NoC platform running a video application. In IEEE International Conference on

Application-Specific Systems, Architecture Processors, pages 179–184, 2005.

C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: a tool for evaluating and synthesiz-

ing multimedia and communications systems. In International Symposium on Microarchitecture,

pages 330–335, 1997.

J. Llosa, M. Valero, and E. Ayguade. Non-consistent dual register files to reduce register pressure.

In International Symposium on High-Performance Computer Architecture, page 22, 1995.

J. Llosa, M. Valero, J. Fortes, and E. Ayguade. Using sacks to organize register files in VLIW

machines. In CONPAR, 1994.

S. Mahlke, W. Chen, J. Gyllenhaal, W. Hwu, P. Chang, and T. Kiyohara. Compiler code transfor-

mations for superscalar-based high-performance systems. pages 808–817, Nov 1992.

C. McNairy and D. Soltis. Itanium 2 processor microarchitecture. IEEE Micro, pages 44–55, 2003.

R. Nalluri, R. Garg, and P. R. Panda. Customization of register file banking architecture for low

power. In International Conference on VLSI Design and Embedded Systems, pages 239–244,

2007.

D. B. Noonburg and J. P. Shen. Theoretical modeling of superscalar processor performance. In

Proceedings. 27th Annual International Symposium on Microarchitecture, pages 52–62, 1994.

E. Ozer, S. Sathaye, K. Menezes, S. Banerjia, M. Jennings, and T. Conte. A fast interrupt handling

109

REFERENCES

scheme for VLIW processors. In Proceedings of International Conference on Parallel Architec-

tures and Compilation Techniques, pages 136–141, Oct 1998.

S. Palacharla, N. Jouppi, and J. Smith. Quantifying the complexity of superscalar processors. Tech-

nical Report, CS-96-1328, University of Wisconsin and Madision, November 1996.

S. Palacharla, N. P. Jouppi, and J. E. Smith. Complexity-effective superscalar processors. In Pro-

ceedings. 24th annual International Symposium on Computer Architecture, pages 206–218, 1997.

I. Park, M. D. Powell, and T. N. Vijaykumar. Reducing register ports for higher speed and lower

energy. In Proceedings. 35th Annual International Symposium on Microarchitecture, pages 171–

182, 2002.

S. Park, A. Shrivastava, N. Dutt, A. Nicolau, Y. Paek, and E. Earlie. Bypass aware instruction

scheduling for register file power reduction. In Proceedings of the conference on Language,

compilers, and tool support for embedded systems, pages 173–181, 2006.

M. Pericas, R. Gonzalez, A. Cristal, A. Veidenbaum, and M. Valero. An optimized front-end phys-

ical register file with banking and writeback filtering. In Workshop on Power Aware Computer

System, 2004.

G. Reinman. Using an operand file to save energy and to decouple commit resources. IEE Proceed-

ings of Computer and Digital Techniques, 152(5), 2005.

S. Rixner, W. J. Dally, B. Khailany, P. R. Mattson, U. K. Kapasi, and J. D. Owens. Register or-

ganization for media processing. In International Symposium on High Performance Computer

Architecture, pages 375–386, 2000.

M. Sami, D. Sciuto, C. Silvano, V. Zaccaria, and R. Zafalon. Low-power data forwarding for VLIW

embedded architectures. IEEE Transaction of VLSI Systems, 10(5):614–622, 2002.

R. Sangireddy. Register organization for enhanced on-chip parallelism. In International Conference

on Application-specific Systems, Architectures and Processors (ASAP), 2004.

110

REFERENCES

R. Sangireddy. Register port complexity reduction in wide-issue processors with selective instruc-

tion execution. Microprocessors and Microsystems., 31(1):51–62, 2007.

R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. Rau, D. Cronquist, and M. Sivaraman. Cost

sensitive modulo scheduling in a loop accelerator synthesis system. The Journal of VLSI Signal

Processing, 31(2), 2002.

N. Seshan. High VelociTI processing. IEEE Signal Processing Magazine, 15(2):88–101, 1998.

T. Shiota, K. Kawasaki, Y. Kawabe, W. Shibamoto, A. Sato, T. Hashimoto, F. Hayakawa, S. Tago,

H. Okano, Y. Nakamura, H. Miyake, A. Suga, and H. Takahashi. A 51.2 GOPS 1.0 GB/s-DMA

single-chip multi-processor integrating quadruple 8-way VLIW processors. In IEEE Interna-

tional Solid-State Circuits Conference, volume 1, pages 194 –593, Oct. 2005.

A. Shrivastava, N. Dutt, A. Nicolau, and E. Earlie. PBExplore: A framework for compiler-in-the-

loop exploration of partial bypassing in embedded processors. In DATE ’05: Proceedings of the

conference on Design, Automation and Test in Europe, pages 1264–1269, 2005.

S. Sirsi and A. Aggarwal. Exploring the limits of port reduction in centralized register files. In 22nd

International Conference on VLSI Design and Embedded system, pages 535–540, Jan. 2009.

K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan. Temperature-

aware microarchitecture: Modeling and implementation. ACM Trans. Archit. Code Optim., 1(1):

94–125, 2004.

H.-J. Stolberg, M. Berekovic, S. Moch, L. Friebe, M. B. Kulaczewski, S. Flugel, H. Klußmann, ,

A. Dehnhardt, and P. Pirsch. HiBRID-SoC: A multi-core SoC architecture for multimedia signal

processing. 41(1):9 – 20, August 2005.

D. Tarjan, S. Thoziyoor, and N. P. Jouppi. CACTI 4.0. Technical Report HPL-2006-86, HP Labs,

2006.

111

REFERENCES

M. Taylor, J. Kim, J. Miller, F. Ghodrat, B. Greenwald, P. Johnson, W. Lee, A. Ma, N. Shnidman,

V. Strumpen, D. Wentzlaff, M. Frank, S. Amarasinghe, and A. Agarwal. The raw processor - a

scalable 32-bit fabric for embedded and general purpose computing. In proceedings of Hotchips,

August 2001.

A. Terechko, M. Garg, and H. Corporaal. Evaluation of speed and area of clustered VLIW proces-

sors. In Internation Conference on VLSI Design, 2005.

J. Tseng and K. Asanovic. Banked multi port register file for high frequency superscaler micro-

processors. In Proceedings of 30th International Symposium on Computer Architecture, pages

62–71, June 2003.

J. W. van de Waerdt, S. Vassiliadis, S. Das, S. Mirolo, C. Yen, B. Zhong, C. Basto, J. P. van Itegem,

D. Amirtharaj, K. Kalra, P. Rodriguez, and H. van Antwerpen. The TM3270 media-processor. In

International Conference on Microarchitecture, pages 331–342, 2005.

J. T. J. van Eijndhoven, F. W. Sijstermans, K. A. Vissers, E. Pol, M. I. A. Tromp, P. Struik, R. Bloks,

P. van der Wolf, A. Pimentel, and H. Vranken. TriMedia CPU64 architecture. In International

Conference on Computer Design, pages 586–592, 1999.

S. J. E. Wilton and N. P. Jouppi. CACTI: an enhanced cache access and cycle time model. IEEE

Journal of Solid State Circuits, 31:677–688, 1996.

J. Yan and W. Zhang. Exploiting virtual registers to reduce pressure on real registers. ACM Trans.

Archit. Code Optim., 4(4):1–18, 2008.

K. Yeager. The Mips R10000 superscalar microprocessor. Micro, IEEE, 16(2):28–41, Apr 1996.

J. Zalamea, J. Llosa, E. Ayguade, and M. Valero. Two-level hierarchical register file organization for

VLIW processors. In Proceedings. 33rd Annual International Symposium on Microarchitecture,

pages 137–146, 2000.

112

REFERENCES

H. Zhong, K. Fan, S. Mahlke, and M. Schlansker. A distributed control path architecture for VLIW

processors. In International Conference on Parallel Architectures and Compilation Techniques

(PACT), 2005.

V. Zyuban and P. Kogge. The energy complexity of register files. In International Symposium on

Low Power Electronics and Design, pages 305–310, 1998.

113

REFERENCES

114

List of Publications

• Neeraj Goel, Anshul Kumar, Preeti Ranjan Panda. Power Reduction in VLIW Processor with

Compiler Driven Bypass Network. Internation Conference on VLSI Design and Embedded

System, pages 233–238, 2007.

• Neeraj Goel, Anshul Kumar, Preeti Ranjan Panda. Shared Port Register File Architecture for

Low Energy VLIW Processors. Under submission.

• Neeraj Goel, Anshul Kumar, Preeti Ranjan Panda. Low Energy and scalable VLIW Processor

with Two Level Register File. Under submission.

115

116

Brief Bio-data

Neeraj Goel has received B. Tech. degree from NIT Kurukshetra in Electronics and Communication

in 2002 and M.Tech. in VLSI Design Tools and Technology from IIT Delhi in 2004. His broad

research interest includes embedded processors (like VLIWs) and their tools and compilers; FPGAs

and reconfigurable computing.

117

Documents

SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURES FOR …neeraj/doc/thesis_Neeraj_Goel.pdf · SCALABLE LOW ENERGY REGISTER FILE ARCHITECTURES FOR VLIW PROCESSORS by Neeraj Goel Department