INSTRUCTION SCHEDULING FOR VLIW PROCESSORS UNDER …web2py.iiit.ac.in/publications/default/download... · INSTRUCTION SCHEDULING FOR VLIW PROCESSORS UNDER VARIATION SCENARIO A thesis

INSTRUCTION SCHEDULING FOR VLIW PROCESSORS UNDER

VARIATION SCENARIO

A thesis submitted in partial fulfilment of

the requirements for the degree of

Master of Science (by Research)

in

Computer Science and Engineering

by

Mujadiya Nayan Vasantbhai

Roll No: 200605011

nayan [email protected]

Center for VLSI and Embedded System Technologies

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

GACHIBOWLI, HYDERABAD, A.P., INDIA - 500 032.

NOVEMBER 2010

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY

GACHIBOWLI, HYDERABAD, A.P., INDIA - 500 032

CERTIFICATE

It is certified that the work contained in this thesis, titled “Instruction Scheduling for

VLIW Processors Under Variation Scenario” by Mujadiya Nayan Vasantbhai, has been

carried out under my supervision and is not submitted elsewhere for a degree.

Date Dr. Madhu Mutyam (Adviser)

Abstract

In Very Long Instruction Word (VLIW) processors, based on the available

instruction-level parallelism in programs, compilers schedule operations onto dif-

ferent functional units. By assuming all the functional units of same kind and

having the same latency, the conventional list-scheduling algorithm selects the

first available (free) functional unit to schedule an operation. But, in advanced

process technologies due to process variation, functional units of same kind may

have different latencies. In such situation, conventional scheduling algorithms may

not yield good performance. Process variations in components like adders, multi-

pliers, etc., of different Integer Functional Units (IFUs) in VLIW processors may

cause these units to operate at different speeds, resulting in non-uniform latency

IFUs. Pessimistic/conservative techniques dealing with the non-uniform latency

IFUs may incur significant performance and/or leakage energy loss.

In this work, We propose three process variation-aware compile-time tech-

niques to handle non-uniform latency IFUs. In the first technique, namely ‘turn-

off’, we turn off all the process variation affected high latency IFUs. In the second

technique, namely ‘on-demand turn-on’, we use some of the process variation af-

fected high latency IFUs by turning them on whenever there is a requirement. Our

experimental results show that with these techniques, the non-uniform latency IFU

can be tackled without much performance penalty. The proposed techniques also

achieve significant reduction in leakage energy consumption as some of the IFUs

are turned off.

In variation affected situation, conventional scheduling algorithms may not

yield good performance. We propose third technique, namely ‘mobility-list-scheduling’

algorithm to schedule operations on non-uniform latency functional units and com-

pare our algorithm with the conventional list-scheduling algorithm.

Acknowledgments

I would like to thank all people who have helped and inspired me during my

master study.

I especially want to thank my advisor, Dr. Madhu Mutyam, for his guidance

during my research and study at IIIT-H. His perpetual energy and enthusiasm in

research had motivated all his advisees, including me. In addition, he was always

accessible and willing to help his students with their research. I want to thank

him also for taking time to read our lengthy e-mails, inspite of his busy schedule at

IIT-M. He used to take great care to proof read my research papers. As a result,

research life became smooth and rewarding for me.

I was delighted to interact with Prof. R. Govindarajulu by attending his

classes. His insights in various topics of computer architecture are eye openers

for me. Besides, he sets an example of a world-class researcher for his rigor and

passion on research.

I would like to thank Rodric Rabbah, Nate Clark and Neeraj Goel from

trimaran-users mailing list for their Trimaran related help.

All my lab buddies at the Intel Multi-core Research Laboratory made it a con-

vivial place to work. In particular, I would like to thank Kalyan for his friendship

and help in the past four years. Also, I would like to thank Abid for his help in my

research work. I also want to thank my friends Bagi (Bhargav), Ankur, Rahulji,

Shivang, Sachin, Chirag, Mangal, Rachita, Suman, Ayu, Abu, Arpit Joshi, Anil

Patelia, Chintan Modi, Shirish Peshwe, Mandar Kale, Akash Agrawal, Akshay

Jawa and others for making my stay in and/or out of IIIT-H enjoyable.

My deepest gratitude goes to my family for their unflagging love and support

throughout my life; this dissertation is simply impossible without them. I am in-

debted to my father, Vasant N. Mujadia, for his moral support and for encouraging

me to take up studies in my field of interest.

iii

Dedicated to my parents

iv

Table of contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Table of contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

List of figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Parameter variations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.1 Process variations . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.2 Voltage variations . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.1.3 Temperature variations . . . . . . . . . . . . . . . . . . . . . . 4

1.1.4 Input variations . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.2 VLIW processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2.1 VLIW scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Contributions of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4 Outline of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Prior work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Modeling process variation and experimental setup . . . . . . . . . 12

3.1 Modeling process variation . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Trimaran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

v

Page

vi

3.2.2 HotSpot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.3 Power model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.4 Our framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Variation-aware scheduling techniques . . . . . . . . . . . . . . . . . . 20

4.1 Working with non-uniform latency functional units . . . . . . . . . . . 20

4.1.1 Turn-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.2 On-demand turn-on . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Mobility-list-scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Working with non-uniform latency functional units . . . . . . . . . . . 34

5.2.1 Mobility-list-scheduling . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . 43

List of publications related to the thesis . . . . . . . . . . . . . . . . . 44

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

List of tables

Table Page

1 Variation map considering 20% variation in Vth. ‘11’/‘10’/‘01’ signifieshigh latency FU and ‘00’ signifies clean FU latency. . . . . . . . . . . 14

2 Operating parameters for HotSpot. . . . . . . . . . . . . . . . . . . . . 18

3 CPU and memory hierarchy configuration parameters for Trimaran. . 18

4 Benchmark codes and important statistics (based on basic block schedul-ing). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Original VLIW schedule. . . . . . . . . . . . . . . . . . . . . . . . . . 21

6 VLIW schedule with worst-case latencies. . . . . . . . . . . . . . . . . 22

7 Latency map for each component (in cycle). . . . . . . . . . . . . . . . 22

8 VLIW schedule after applying ‘turn-off’. . . . . . . . . . . . . . . . . 23

9 Scheduling tables for different basic blocks in different loops after ap-plying ‘on-demand turn-on’. . . . . . . . . . . . . . . . . . . . . . . 23

10 VLIW schedule after applying list-scheduling algorithm for latencypattern 000000. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35



13 VLIW schedule after applying mobility-list-scheduling algorithm forlatency pattern 210000. . . . . . . . . . . . . . . . . . . . . . . . . . . 38

14 Mobility information for each operation in Graph Gn. . . . . . . . . . 39

vii

List of figures

Figure Page

1 Parameter variations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 A generic VLIW architecture. . . . . . . . . . . . . . . . . . . . . . . . 5

3 An example VLIW schedule . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Overview of compilation steps in Trimaran. . . . . . . . . . . . . . . . 15

5 Trimaran-based framework (Shaded blocks are the modified parts). . . 17

6 Floorplan of the chip simulated. . . . . . . . . . . . . . . . . . . . . . 17

7 Benchmark-wise IPC for different techniques. . . . . . . . . . . . . . . 26

8 Leakage energy savings of all the IFUs for different techniques. . . . . 27

9 Peak temperature of IFUs for different techniques. . . . . . . . . . . . 28

10 Average change in IPC for different techniques, for 6 and 4 IFUs with20% and 40% variation. . . . . . . . . . . . . . . . . . . . . . . . . . . 29

11 Average leakage energy savings for different techniques, for 6 and 4IFUs with 20% and 40% variation. . . . . . . . . . . . . . . . . . . . . 30

12 Average peak temperature reduction for different techniques, for 6 and4 IFUs with 20% and 40% variation. . . . . . . . . . . . . . . . . . . . 30

13 Total execution cycles for benchmarks “apsi” and “bmcm” for all pos-sible latency pattern scenarios after applying list-scheduling algorithm. 33

14 Benchmark-wise execution cycles after applying list-scheduling algo-rithm for all latency patterns of a latency pattern set {4,1,1}. . . . . . 34

15 Dependency graph (Gn) for Basic Block (BBi). . . . . . . . . . . . . . 34

16 Mobility-list-scheduling algorithm. . . . . . . . . . . . . . . . . . . . . 37

17 % IPC degradation w.r.t to the nominal case for benchmark mxm. . . 40

18 % IPC degradation w.r.t to the nominal case for benchmark tsf. . . . 40

19 Number of execution cycles for benchmark mxm. . . . . . . . . . . . . 41

20 Number of execution cycles for benchmark tsf. . . . . . . . . . . . . . 41

viii

Figure Page

ix

21 Average % IPC degradation w.r.t to the nominal case over all bench-marks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

CHAPTER 1

Introduction

Embedded systems are pervasive in everyday life. Embedded systems are

being composed of hardware and software components specifically designed for

controlling a given application device.

Embedded systems are a growing phenomena in the technology world. Tele-

phone and communication devices were some the earliest innovators of embedded

systems. Today, governments and industries are putting major financing towards

development of new applications and devices for trains, planes, and automobiles.

Embedded systems are also found in smart cards and many consumer electronics.

Some new applications include smart medical devices intended to drive down

the cost of health care and help track a persons medical information. Embedded

systems devices are being designed to assist with medicinal dosing and delivery

systems to prevent accidents and treatment errors.

Smart houses will make us more energy efficient, and lower our carbon foot-

print. A new television commercial shows a woman at the airport turning off

electrical devices at home with her smart phone. The futuristic energy grid will

talk with your home or office during peak energy use to prevent brown outs and

black outs.

All of this sounds wonderful, as these devices in our everyday gadgets, making

life easier and more enjoyable. Ford uses Microsoft embedding in their automobile

sound systems. Anti-lock brakes are another embedded software system. However,

we’ve heard before that computers and the Internet will provide society with a

technological utopia.

All these multimedia consumer applications require high performance embed-

ded platforms for their intensive processing requirements. In this area of embedded

systems, Very Long Instruction Word (VLIW) processors provide a promising so-

1

Figure 1. Parameter variations.

lution to achieve suitable performance-power trade-off.

On the other hand, in never ending pursuit of making a circuit faster and

denser, more and more transistors are being placed on a single chip by reducing

the feature size. However, transistor scaling is accompanied by an increase of

variability in process technology. These variations result in large variations in the

critical properties like maximum operating frequency and power dissipated by the

chip. It is shown in [1] that for the 130nm technology, variations result in a 30%

spread in the maximum allowable operating frequency and upto five fold increase

in the leakage power. In [2], it is predicted that for the sub-65nm VLSI technology,

the main focus will be on designing variation tolerant circuits.

1.1 Parameter variations

Parameter variations, which are becoming worse as technology scales, impact

the frequency and leakage distribution of fabricated microprocessor dies [3]. Pa-

rameter variations encompass a range of variation types, including process varia-

tions due to manufacturing phenomena, voltage variations due to manufacturing

and runtime phenomena, and temperature variations due to varying activity lev-

els and power dissipations. Figure 1 presents a general classification [4] of the

parameter variations. Process variations in a device are a result of variations in

its manufacturing process, whereas environmental variations are a result of the

2

variations in the external conditions in which the device is being used.

1.1.1 Process variations

Process variations manifest themselves as die-to-die (DTD) [5] variations and

with-in-die (WID) [6] variations. These variations may result in increased la-

tency and power dissipation [2, 7, 8]. DTD variations affect every element on a

chip equally. On the other hand, WID variations produce non-uniform electrical

characteristics across the chip. DTD variations are the result of factors such as

processing temperature, equipment properties and wafer placement, whereas the

WID variations are the result of non-deterministic placement of dopant atoms and

variation of channel length across a single die. These variations limit the operating

frequency and increase the power consumption of a chip. Earlier, DTD variations

contributed the major portion of the total process variations, but the reducing

feature size has increased WID variations and made them comparable to DTD

variations.

In general, DTD variations are handled using circuit level techniques, whereas

WID variations are handled using architectural techniques. These process vari-

ations can result in reduced performance and/or increased energy consumption

(both dynamic and static or leakage) in embedded VLIW processor. We empha-

size on WID variations.

1.1.2 Voltage variations

These are variations in the supply voltage (VDD). Voltage across an inductor

is directly proportional to its inductance and the rate of change of current. So

a sudden change in the value of current across the inductor will cause a large

voltage drop across the inductance, which results in reduced voltage being supplied

to the components connected in series with the inductor. This may result in

the malfunction of those components. Factors such as package routing and chip

pads affect inductance. These factors do not follow the CMOS process scaling

3

trends which makes processor supply voltage variations significant with technology

scaling. These variations are also the result of the different architectural power-

savings modes of a device. The processor switches between idle, sleep and active

modes, which causes variations in the current drawn by it, which in turn results

in voltage variations. These voltage variations limit the frequency of operation of

the processor.

To overcome these problems, techniques like using adaptive VDD or using de-

coupling capacitors [9] have been proposed. Decoupling capacitors supply instan-

taneous charge and reduce voltage variations by about 50%.

1.1.3 Temperature variations

The operating temperature limits the circuit’s performance in several ways.

The temperature variations can be classified into two types; spatial and temporal.

Spatial variations result when a highly active unit on a chip is placed next to a

not so active unit. This difference in activity results in differences in transistor

performance and leakage across the die. Temporal variations are the variations in

temperature of a unit that occur with time. A processor will be active during some

duration and will be idle during the rest of the duration. During the active dura-

tion, its temperature is high and during the idle duration its temperature is low.

This varying temperature causes variation in power consumption and transistor

performance. A processor’s cooling system is designed keeping in mind the peak

temperature. So when the temperature is low, the processor runs suboptimally.

To overcome these problems, techniques like throttling [10] have been pro-

posed. In throttling, when the temperature rises, the processor is made to run at

a lower frequency. When the processor cools down, this process reverses.

1.1.4 Input variations

A circuit’s actual delay depends on its inputs. Some inputs produce the result

earlier when compared to other inputs. There is a large difference between the

4

Figure 2. A generic VLIW architecture.

average and worst case delays. Inspite of this, circuits are designed considering the

worst case delay. This results in suboptimal processor performance.

The differences in the delays can be used to save power by lowering supply

voltage to finish the operation in worst case delay as shown in [11].

1.2 VLIW processor

Figure 2 shows a generic VLIW architecture. VLIW architectures are espe-

cially attractive for embedded systems as instruction (operation) scheduling in

these architectures is performed by the compiler and obeyed by the hardware.

VLIWs use multiple, independent functional units. Rather than attempting to is-

sue multiple, independent instructions to the units, a VLIW packages the multiple

operations into one very long instruction, or requires that the instructions in the

issue packet satisfy the same constraints. Intel’s Itanium IA-64 EPIC appears as

the only example of a widely used VLIW architecture.

VLIW refers to a CPU architecture designed to take advantage of instruction

level parallelism (ILP). The VLIW, executes operations in parallel based on a

fixed schedule. This schedule is generated by compiler when the programs are

5

compiled. Since determining the order of execution of operations (including which

operations can execute simultaneously) is handled by the compiler. As a result,

VLIW CPUs offer significant computational power with less hardware complexity

(but greater compiler complexity) than is associated with most superscalar CPUs.

The VLIW hardware is not responsible for discovering opportunities to execute

multiple operations concurrently. The long instruction word already encodes the

concurrent operations. This is in contrast to the superscalar architectures with

out-of-order execution where hardware decides the final schedule and execution

order of instructions. As a result, estimating code performance in VLIWs is easier

as compared to superscalar architectures. Thus, we say that the VLIW processor

has static issue capability.

1.2.1 VLIW scheduling

Scheduling is a mapping of parallelism within the constraints of available paral-

lel resources. Scheduling in VLIW processors takes place at compile-time with the

help of a VLIW compiler. Since a VLIW processor consists of several functional

units, to use the available parallelism, VLIW compiler schedules instructions in

such a way that as many functional units as possible are being used in every cycle.

A VLIW compiler reads a program written in a high level programming language

and translates the complex operations into micro operations that are supported

by the processor. The compiler then checks for control and data dependencies

between these operations to find which operations can be performed in parallel.

These micro operations are then regrouped as VLIWs and saved in memory until

they are executed on the processor. VLIW compilers do all of the translation and

scheduling at compile-time. We can tabulate VLIW scheduling information for a

basic block (see Figure 3) with rows indicating execution cycles and columns indi-

cating functional units information. From Figure 3, we know that FU[2] executes

op6 at cycle 2.

Hence the compiler has full control of instruction scheduling in VLIW archi-

6

Cycles/Units FU[0] FU[1] FU[2] FU[3]0 op1 op2 op4 NOP1 op3 NOP NOP op5

2 NOP NOP op6 op7

3 op8 NOP op9 NOP

Figure 3. An example VLIW schedule

tectures. We exploit this advantage of VLIW processors to handle non-uniform

latency IFUs due to process variation.

1.3 Contributions of the thesis

In this thesis, we present solutions to minimize the effects of WID process vari-

ation by handling non-uniform latency in Integer Functional Units (IFUs). Our

techniques exploit the fact that during some parts of VLIW program execution the

maximum number of operations that can be executed per cycle is much smaller

than the number of available IFUs [12]. We propose three compile-time variabil-

ity aware IFU assignment techniques, namely, ‘turn-off’, ‘on-demand-turn-on’ and

‘mobility-list-scheduling’. In the first technique, we turn-off the high latency IFUs.

In the second technique, we use the high latency IFUs whenever there is a require-

ment.

In order to work with non-uniform latency functional units, we also propose

a new scheduling algorithm, namely, mobility-list-scheduling, which is a modified

version of the list-scheduling algorithm that uses the mobility [13] information to

schedule operations onto non-uniform latency FUs.

1.4 Outline of the thesis

Following this introduction, Chapter 2 provides related work. In Chapter 3, we

model the process variation and introduce our simulation framework. In Chapter

4 and Chapter 5, we describe our techniques to handle non-uniform latency IFUs.

Finally, Chapter 6 concludes the contributions of this dissertation and outlines

future directions for this research.

7

CHAPTER 2

Prior work

In this chapter, we briefly discuss some of the existing techniques which reduce

the effects of process variations.

Process variations which include DTD [5] and WID [6] variations, have signif-

icant impact on both performance and power consumption of a chip [2, 7, 8] which

in turn translates to reduced yield. In order to mitigate the impact of variations,

several techniques have been proposed at both circuit and architectural levels.

Bidirectional adaptive body biasing (ABB) is used in [14] to minimize the

effects of Die-to-Die and Within-Die process variations on maximum operating

frequency and leakage energy dissipated. A study on limits of using forward body

biasing to make the circuit more robust against variations in the threshold voltage

is done in [15].

Borkar and et.al. [2] discusses the effect of parameter variations on system

performance and then presents techniques which use forward body bias (FBB)

and reverse body bias (RBB) to reduce their effect on circuits and in-turn im-

prove the chip yield. Another technique to improve chip yield is proposed in [16],

wherein algorithms were developed for gate sizing and for determining the optimal

bin boundaries to obtain maximum benefit of using frequency binning. Effects

of process variations on SRAM cells are analyzed in [17, 18]. By considering a

chip with random and spatial variations, leakage energy dissipated in the chip is

analyzed in [19].

Several techniques to handle variability in cache memories can also be found

in the literature. [17] analyzes SRAM cell failures under process variation and

proposes new variation-aware cache architecture suitable for high performance ap-

plications. The proposed architecture adaptively resizes the cache to avoid faulty

cells, thereby improving yield. In [20], they discuss performance losses due to the

8

worst-case process variation delay in cache and then propose a technique called

block rearrangement to minimize performance loss incurred by a process variation.

In [21], a variation-tolerant design technique, namely, block remap with turnoff

(BRT) is proposed, to minimize performance loss and leakage energy consumption.

In BRT technique they selectively turnoff few blocks after rearranging them in such

a way that all sets get almost equal number of process variation affected blocks.

By turning off process variation affected blocks of a set, leakage energy can be

minimized and the set can be accessed with low latency at the cost of reduced set

associativity.

In [22], they propose four microarchitectural techniques to minimize the yield

loss due to power and delay violations in the data cache. The first scheme, Yield-

Aware Power-Down (YAPD), disables a cache way if it violates the delay or power

limits. The second scheme, Horizontal-YAPD (H-YAPD), modifies the approach

by turning off horizontal regions of a cache instead of the regular (vertical) ways.

Due to spatial correlation of process variations, turning off segments that exhibit

similar behavior improves the yield further. The third scheme, VAriable-latency

Cache Architecture (VACA), allows different load accesses to be completed with

varying latencies with the help of special buffers placed before the function units.

As a result, if some accesses take longer than the predefined number of cycles,

the execution can still be performed correctly. A final method, Hybrid scheme,

combines YAPD and VACA.

The paper [23], formulate a variation-aware scheduling process for heteroge-

neous multi-core MPSoC architectures and propose a statistical scheduling algo-

rithm to mitigate the effects of parameter variations. They propose a new metric,

known as performance yield, which is used at each step of the iterative dynamic

priority computation of the task scheduler. Also, an efficient yield computation

method for task scheduling and a fast criticality-based routing algorithm have been

proposed to improve the performance of the scheduling algorithm.

9

Huang and Xu [24] presents a novel quasi-static scheduling strategy, wherein

a set of variation-aware schedules is synthesized off-line and, at run-time, the

scheduler will select the right one based on the actual variation for each chip,

such that the timing constraint can be satisfied whenever possible.

Raghavan and et.al. [25] presents a complete approach that handles speed

variability of the RF proposing different compile-time and run-time design alter-

natives for VLIWs. The first alternative extends current RF architectures and

uses a compile-time variability-aware register assignment algorithm. The second

alternative presents a fully-adjustable pure run-time approach, which overcomes

the variability loss as well, but at the extra cost of cycles and area.

Techniques are proposed in [26] to reduce the impact of variability on floating

point units (FPUs) and register read accesses. For pipelined multi cycle FPUs,

the authors used time borrowing between stages and added an extra set of latches,

which can be used to add an extra stage to the FPU pipeline, and for the RF, the

read accesses are steered such that fast entry/port combinations are prioritized.

In [27], they alleviate the impact of having slow access latencies for some

registers, by renaming the destinations of critical instructions to fast registers,

slow functional units by giving critical instructions priority for the fast functional

units. And to mitigate the impact of having some L1 cache frames that are slower

than others, by prefetching into small prefetch buffers.

Few mechanisms similar to our technique are described below. Uncriticality-

directed instruction scheduling is proposed in [28]. In this technique, only uncritical

instructions are issued into and executed on the variable latency or long latency

ALU’s. In [29] and [30], Leakage-Aware Operation to Functional Units Binding

Mechanism (LA-OFBM) and Leakage-Aware Power Gating (LA-PG) considering

temperature and process variations are proposed. In LA-OFBM [29], a leakage

sensor is introduced in each FU and operations are issued to the FUs based on

the leakage information acquired from the sensors. The leakage sensor values are

10

continuously read and the FU priorities are updated. The main drawback with this

method is the addition of a multiplexor in the critical path of the execution. In

LA-PG [30], a power gating mechanism is used to reduce the leakage energy. In the

first step, the current instruction per cycle (IPC) information is used to find out

how many FUs to power gate and in the second step, information from a leakage

sensor is used to determine which FUs to power gate. These works concentrate on

Superscalar processors.

However, to best of our knowledge there has been no work that addresses

instruction scheduling on non-uniform latency IFUs at compiler level. Since the

compiler has full control of instruction scheduling in VLIW architectures, IFU

assignment can take place at compile-time. We exploit this advantage of VLIW

processors to handle non-uniform latency IFUs.

11

CHAPTER 3

Modeling process variation and experimental setup

3.1 Modeling process variation

In this work, we consider WID variations only. First we discuss the different

methods of modeling WID variations and later we describe how we model them.

In [31], to model WID process variations, HSPICE [32] is used. HSPICE is a

device level circuit simulation tool. It takes a SPICE file as input and produces

output describing the circuit simulation. Using this tool, Monte-Carlo simulation

is performed. In Monte-Carlo simulation, the type of variation in the inputs is

entered to obtain the variation in the delay and power values.

To model WID process variations, the random and systematic components of

five different variation parameters, interconnect metal thickness, inter-layer dielec-

tric thickness, line-width on interconnects, gate length and threshold voltage are

considered in [33]. The statistical computing R-tool [34] is used to model varia-

tions. On specifying the required type of distribution, mean and standard deviation

of the distribution, the R-tool generates different values for each parameter. To

model the random variations, random values from a uniform random distribution

are used. To model the systematic component, the range factor (φ) in the 2D

layout of the chip is considered. Each process parameter is considered as a func-

tion of its mean, standard deviation and range factor. The effect of range factor

on correlation factor (Ci) of a parameter is given in Equation 1. Here, di is the

distance between the two points between which the correlation factor is measured.

Ci = φ/di − 1 if di ≤ φ ; else Ci = 0 (1)

The set of values generated for each process parameter are used in the pa-

rameterized SPICE models. A set of 2000 chips is simulated. Each time after the

simulation, new set of parameters are chosen. This process simulates the WID

12

variations for each chip.

We consider only the WID variations in this work. To model WID variations,

we use the R-tool [34]. R-tool is a language and environment for statistical comput-

ing and graphics. It provides a wide variety of statistical and graphical techniques

like linear and nonlinear modeling, statistical tests, time series analysis, classifica-

tion and clustering.

Changes in the width, length, oxide thickness, etc., of a transistor are observed,

due to process variations. All these changes can be modeled in terms of variations in

the threshold voltage of the transistor. So, instead of considering all the parameters

separately for each transistor we consider only the variations in threshold voltage.

We consider only the random components.

We assume that all the transistors in FU component are uniformly affected by

process variations. So, the number of random component values generated for the

threshold voltage using the R-tool are equal to the total number of FU components

in entire IFU (FU0-FU5). We model the random component as a Gaussian random

variable.

Finally, the values of the threshold voltage for each component of FU are

obtained by adding their random variation components to their respective mean

values. The latency of a FU component is directly proportional to its threshold

voltage. A component of an FU is assumed to have high latency if its threshold

voltage is greater than (µ+ σ) of the threshold voltage in the required technology.

To determine the leakage power dissipated in the IFU (FU0-FU5), we consider

only the sub-threshold leakage (Isub). The value is calculated using the following

formula [35]

Isub = K1We−Vth/ηVθ(1− e−V/Vθ) (2)

where K1 and η are experimentally determined parameters, V is the supply

voltage, Vθ is 25mV at room temperature, W is width and Vth is threshold voltage

13

FU FU FU FU FU FU[0] [1] [2] [3] [4] [5]10 11 00 00 10 00

Table 1. Variation map considering 20% variation in Vth. ‘11’/‘10’/‘01’ signifies highlatency FU and ‘00’ signifies clean FU latency.

of the transistor. Leakage power dissipated by an FU component is calculated

by adding the leakage power dissipated by all the transistors present in that FU

component. Because of process variations, Vth can be low or high. So, the latency

and leakage values of the FU component change accordingly. Hence, the FU com-

ponents in the entire IFU (FU0-FU5) can be characterized in to four types of FU

components (1) FU component with nominal latency and nominal leakage (2) FU

component with nominal latency and high leakage (3) FU component with high

latency and nominal leakage and (4) FU component with high latency and high

leakage.

When only one value is used to represent a FU, power calculation is done by

taking the average of powers of all the components (adder, multiplier, comparator

and divider) present in the FU. This effect of process variation was modeled in

power model [36]. For latency calculation maximum latency of all the components

in a FU is used as the latency value of the FU. This whole process produces the

variation map which is shown in Table 1. Table 1 shows variation map considering

20% variation in Vth. In the table, ‘11’, ‘10’ and ‘01’ signify the FU which has high

latency and ‘00’ signifies the clean FU which has normal latency.

3.2 Experimental setup

In order to validate our proposed techniques, we have used our compila-

tion/simulation framework shown in Figure 5, which consists of three parts, the

Trimaran infrastructure [37], the HotSpot [38] and a power model [36]. We con-

figure the simulator to simulate the floorplan shown in Figure 6 (an Itanium [39]

like architecture). Given below are brief details of the tools, benchmarks and

14

Figure 4. Overview of compilation steps in Trimaran.

configuration parameters used in evaluation.

3.2.1 Trimaran

Trimaran is an integrated compilation and performance monitoring infrastruc-

ture. The architecture space that Trimaran covers is characterized by HPL-PD, a

parameterized processor architecture supporting novel features such as predication,

control and data speculation, and compiler controlled management of the memory

hierarchy. Trimaran also consists of a full suite of analysis and optimization mod-

ules, as well as a graph-based intermediate language. Optimizations and analysis

modules can be easily added, deleted or bypassed, thus facilitating compiler opti-

mization research. Similarly, computer architecture research can be conducted by

varying the HPL-PD machine via the machine description language HMDES. Tri-

maran also provides a detailed simulation environment and a flexible performance

monitoring environment that automatically tracks the machine as it is varied.

Trimaran comprises of three components: the IMPACT compiler, the Elcor

compiler, and the Simu simulator (as shown in Figure 4). Trimaran uses IMPACT

to compile the original source code into an assembly intermediate representation

(IR) called Lcode. The Lcode produced is optimized for ILP, but not for a specific

15

machine. This code is then passed on to the Elcor compiler, along with a machine

description (MDES) that specifies the target machine. Elcor compiles the code for

the target machine, producing another IR called rebel. The Trimaran simulator

known as Simu consumes the rebel code, executes the code, and gathers execution

statistics.

3.2.2 HotSpot

HotSpot was developed by a group of people at Univ. of Virginia. It is an

accurate and fast thermal model suitable for use in architectural studies. It is based

on an equivalent circuit of thermal resistances and capacitances that correspond to

microarchitecture blocks and essential aspects of the thermal package. The model

has been validated using finite element simulation. HotSpot has a simple set of

interfaces and hence can be integrated with most power-performance simulators

like Wattch. The chief advantage of HotSpot is that it is compatible with the

kinds of power/performance models used in the computer-architecture community,

requiring no detailed design or synthesis description. HotSpot makes it possible

to study thermal evolution over long periods of real, full-length applications. One

can simply specify custom floorplan file with any floorplan and granularity. The

format of the floorplan file simply requires, for each block in the floorplan, the

width, height, and (x,y) coordinates of the lower-left corner. Its a open-source

tool, available for academic research.

3.2.3 Power model

Power Model is an Architecture-Level Leakage Simulator. The dynamic energy

of the functional units, caches, TLBs, pipeline units, the latches, the IO unit, and

the register files are read from a table that stores the results from HSPICE simu-

lations. The leakage power is expressed as hierarchical model at the architecture,

circuit and device level. It integrates a temperature estimation tool (HotSpot) to

calculate the leakage power at run-time using temperature feedback.

16

Figure 5. Trimaran-based framework (Shaded blocks are the modified parts).

Figure 6. Floorplan of the chip simulated.

3.2.4 Our framework

In Figure 5, the IMPACT compiler parses the program, forms the basic blocks,

and performs several machine independent optimizations, while the Elcor compiler

implements several machine-dependent optimizations such as instruction schedul-

ing and register allocation. We modified the instruction scheduling part of the Elcor

to incorporate extra knowledge of variation map. We also simulate the scheduling

algorithms by modifying the default simulator that comes with the Trimaran.

We incorporate the power model (based on architecture similar to the Intel

Itanium IA64) under 65nm technology described in [36] into the HPL-PD archi-

tecture simulator to capture both dynamic and leakage power consumption of each

17

functional unit during the execution. The enhanced simulator provides the power

consumption of each block to HotSpot. It can estimate the temperature of each

functional unit based on an equivalent circuit of thermal resistances and capac-

itances and on the essential aspects of the thermal package. The temperature

information is then fed back into the power simulator since the leakage power

model is temperature-aware.

Parameter ValueFrequency 1.5GHz

Vdd 1V

Initial temperature 60◦CAmbient temperature 45◦CPackage thermal resistance0.8K/W

Die 0.5mm thick, 7.52mm× 7.52mm

Heat spreader 1mm thick, 1cm× 1cm

Heat sink 7mm thick, 6cm× 6cm

Table 2. Operating parameters for HotSpot.

CPU core Memory hierarchyGeneral Purpose Static 64 Size Way Line LatencyRegister File Size (cycles)General Purpose Rotating 64 L1 I- & D-Cache 16KB 4 512-bit 1Register File Data ArrayFloating Point 64 L1 I- & D-Cache 1K 4 - 1Register File Tag ArrayFloating Point Rotating 64 L2 Cache Data 256KB 8 256-bit 7Register File ArrayPrediction Register File 256L2 Cache Tag 2K 4 - 7

ArrayPrediction Rotating 64 L3 Cache Data 4MB 12 1024-bit 35Register File ArrayControl 64 L3 Cache Tag 16K 1 - 35Register File ArrayControl Rotating 64 L1 TLB 32×76 - - -Register FileBranch Target 16 L2 TLB 32×55 - - -Register FileInteger Units 6 L3 TLB 128×46 - - -Floating Point Units 2Load/store Units 2Integer ALU Latency (cycles) 1Integer Compare Latency (cycles) 1Integer Multiply Latency (cycles) 3Integer Divide Latency (cycles) 8

Table 3. CPU and memory hierarchy configuration parameters for Trimaran.

18

Benchmark Source # of Nests # of BBs % of Int Ops.adi Livermore 2 17 72.9apsi Perfect Club 3 25 72.1bmcm Perfect Club 4 25 62.0eflux Perfect Club 2 43 66.9mxm Spec 2 17 61.9tomcatv Spec 9 51 68.6tsf Perfect Club 4 38 62.1vpenta Spec 8 41 66.2wss Perfect Club 7 39 66.7

Table 4. Benchmark codes and important statistics (based on basic block scheduling).

We employ Power Supply Gating (PSG) to turn off the high latency functional

units and utilize implementer from [40] to estimate the overhead of turning on/off

functional units , in both performance and energy terms. Specifically, the func-

tional unit enable time is 56.89ns (around 85 cycles with the simulated 1.5GHz

processor) and the energy overhead is 30.1pJ . In our simulation, both the time

overhead and the energy overhead for turning on/off functional units are taken

into consideration. To support explicit functional unit turn on/off, we provide a

sleep signal (per integer functional unit). The compiler transforms a functional

unit from active mode to a leakage control mode or vice versa by controlling this

sleep signal. This can be achieved by augmenting ISA.

Table 2 and Table 3 give the default simulation parameters used in our ex-

periments, namely, the CPU/memory configuration parameters and the thermal

parameters, respectively. Table 4 lists the information of array-intensive bench-

mark codes used in our experiments. The third and the fourth columns give the

number of loop nests and the number of basic blocks, respectively, for each of the

benchmarks. The last column gives the percentage of integer ALU operations with

respect to all the operation during dynamic execution.

19

CHAPTER 4

Variation-aware scheduling techniques 1

In this chapter, we present solutions to minimize the effects of WID process

variation by considering non-uniform latency in Integer Functional Units (IFUs).

Our technique exploits the fact that during some parts of VLIW program execution

the maximum number of operations that can be executed per cycle is much smaller

than the number of available IFUs [12]. We propose two compile-time variability

aware IFU assignment techniques. In the first technique, we turn-off the high

latency IFUs. In the second technique, we use the high latency IFUs whenever

there is requirement.

4.1 Working with non-uniform latency functional units

In this chapter, we assume that an IFU which is not affected by process vari-

ation takes 1 cycle delay each to perform addition, subtraction, and comparison

operations, 3 cycle delay for multiplication operation and 8 cycle delay for divi-

sion. We can tabulate VLIW scheduling information for a basic block with rows

indicating the execution cycles and columns indicating the functional units infor-

mation. Table 5 shows one such VLIW schedule. The table shows six integer

FUs (FU0-FU5) and two load-store FUs (FU8-FU9). For illustrative purpose, we

consider Add(A), Multiply(M), Compare(C), Load(L), Store(S) operations only.

We assume that each IFU can perform addition, multiplication, comparison, and

division operations. However, in a particular cycle only one operation can be per-

formed by an IFU. We use the terms “integer functional units” and “functional

units” interchangeably.

Due to variability, it is highly possible to have some IFU components with

higher latency than that of the others of the same kind [2, 7, 8]. In order to know

1This chapter is based on the published paper titled “Instruction scheduling for VLIW processorsunder variation scenario”[41]

20

FU FU FU FU FU FU FU FU[0] [1] [2] [3] [4] [5] [8] [9]

0 A1 A2 A3 A4 A5 A6

1 A7 A8 C1 C2 L1 L2

2 A9 A10 L3

34 L4

5 M1 M2 M3

678 S1 S2

Table 5. Original VLIW schedule.

the latency values, we assume that speed test is performed on IFUs during the

production test (Note that, BIST is generally used only for memory structures.)

and the latency information of each IFU is made available to the processor. Let us

represent this component-wise latency for each IFU in a latency table (as shown

in Table 7). We can observe from Table 1 and Table 7 that FU0, FU1, and FU4

take higher latency as compared to the other IFUs.

In order to work with these non-uniform IFUs, considering a worst-case tech-

nique to schedule instructions on all the IFUs by assuming the component-wise

maximum latency may result in significant performance loss. For example, Table

6 shows schedule with worst-case latencies for all the FUs. Consider multiplica-

tion instruction M1, in Table 5, it is scheduled in cycle 5 and completes by cycle

7. Whereas in Table 6, M1 is scheduled in cycle 8 and takes 6 cycles to complete.

In order to handle the scheduling on non-uniform integer functional units,

we propose two techniques wherein the high latency IFUs are turned-off and are

activated on need basis only if high instruction level parallelism is there to ex-

ploited. In the first technique, instructions are scheduled only on clean FUs (FUs

not affected with process variation) and the unused clean FUs along with the high

latency FUs are turned-off (using supply gating [40]). This decision is taken based

on IPC (Instructions issued per cycle) value. Turning off is in the sense that

we keep that component in sleep mode. To support explicit functional unit turn

21


0 A1 A2 A3 A4 A5 A6

12 A7 A8 C1 C2 L1 L2

34 A9 A10 L3

567 L4

8 M1 M2 M3

91011121314 S1 S2

Table 6. VLIW schedule with worst-case latencies.

Adder Multiplier Comparator Divider00 1 3 1 801 1 4 1 1110 2 5 2 1311 2 6 2 16

Table 7. Latency map for each component (in cycle).

on/off, we provide a sleep signal per each integer functional unit. In the second

technique, a high latency FU (process variation affected FU) is turned-on based

on IPC value. Functional unit ‘turn-on’ and ‘turn-off’ instructions (that control

the supply gating) are inserted at the beginning and end of the associated loop

considering its IPC and the available integer functional units.

It should be noted that our techniques can work in conjunction with any

other performance-oriented scheduler such as basic block/trace scheduling [42, 43],

superblock scheduling [44] and hyperblock scheduling [45]. We now elaborate our

proposed techniques by considering the variation map (as shown in Table 1) and

the latency table (as shown in Table 7) in the following subsections.

22


0 A1 A2 A3

1 A4 A5 A6 L1 L2

2 A7 A8 C1

3 C2 A9 A10 L3

45 L4

6 M1 M2 M3

789 S1 S2

Table 8. VLIW schedule after applying ‘turn-off’.

Loop FU FU FU FU FU FU FU FUNum. [0] [1] [2] [3] [4] [5] [8] [9]

0 A1 A2 A3

1 1 M1 M2 A6

(IPC=3)...

......

......

......

...5 S S

......

......

......

......

...0 A4 A5 A1 A2 A6 A3

k 1

(IPC=6)...

......

......

......

...7 S S

Table 9. Scheduling tables for different basic blocks in different loops after applying‘on-demand turn-on’.

4.1.1 Turn-off

VLIW compilers do all of the translation and scheduling at compile time. So

we can use information from the variation map to schedule the instructions only

on clean FUs. In the turn-off technique, we turn off high latency FUs and use only

clean FUs for scheduling. We also turn-off the unused clean FUs so that leakage

power of unused functional units and variation affected functional units can be

greatly reduced. To turn-off the unused clean FUs we use the IPC information

and priority is given to FU which consume high leakage.

23

By considering the variation map (as shown in Table 1) and the latency table

(as shown in Table 7), a schedule table using the ‘turn-off’ technique is shown in

Table 8. As we know that functional units FU0, FU1 and FU4 have high latency,

instructions A1 to A10 are scheduled only on clean FUs (i.e. FU2, FU3 and FU5).

Similarly, multiply instructions M1, M2, M3 are scheduled on FU2, FU3, FU5,

respectively. It can be observed that compare instructions C1 and C2 are being

scheduled on FU5 and FU2 instead of FU2 and FU3, respectively. With turn-

off technique, apart from achieving improved performance we can also reduce the

leakage energy of the FUs when compared to the worst case.

4.1.2 On-demand turn-on

It is observed in [12] that during some parts of a VLIW program execution the

maximum number of operations that can be executed per cycle is much smaller

than the number of available IFUs. Motivated by this observation, a technique of

IPC (instructions issued per cycle) tuning at loop-level granularity is proposed in

[12]. The basic idea of this technique is to find a suitable IPC for a given loop and

select IPC number of integer functional units for re-scheduling operations and turn

off the remaining integer functional units for reducing the leakage power. Similar to

this technique, in our on-demand turn-on technique, we compute suitable IPC for

a given loop and based on the IPC value we turn-on a high latency FU, if required.

By default, instructions are scheduled only on clean FUs and high latency FUs are

turned-off along with the unused clean FUs. Whenever loop IPC is greater than

the number of clean FUs available, only required number of high latency FUs are

activated by giving priority to those process variation affected FUs which take less

latency and consume less leakage.

For example, consider Table 9 in which for loop ‘1’, IPC is 3, so the high latency

FUs are not needed. This low IPC can be satisfied by scheduling instructions only

on clean FUs. On the other hand, in loop k, IPC is found to be 6 so that there is

a need to turn-on the high latency FUs.

24

However, these two techniques require the recompilation of the sources for

every target processor since the latency of FUs varies between each instance of

the target architecture, acting as a limitation. But additional run-time hardware

techniques can be used which store the latency information of FUs in the BIOS

of the system and the latency information is loaded at boot-time [25]. Section 4.2

provides the detailed analysis of these two techniques.

4.2 Experimental results

To evaluate the proposed algorithm, we implement and simulate the algorithm

within the Trimaran [37] framework (see Figure 5 in Section 3). We modify the

instruction scheduling part of the Elcor to incorporate extra changes. We config-

ure the simulator to simulate an Itanium [39] like architecture. Table 3 gives the

default simulation parameters used in our experiments, namely, the CPU/memory

configuration parameters. Table 4 lists the information of array-intensive bench-

mark codes used in our experiments.

We study two variation maps, one with 20% variation (as shown in Table 1)

and another with 40% variation in transistor parameters. To study the effectiveness

of our techniques with these variation maps, we compare them with best case, IPC

technique [46], PV-IPC technique and the worst case.

In the best case, all the components in all the IFUs are clean (there is no

variation) so they have normal latency (as shown in Table 3). On the other hand,

in the worst case, all the components of IFU take their corresponding worst case

latencies. In IPC technique, proposed in [46] instructions are scheduled based on

the IPC value. For example, if IPC is 3 then only the first 3 IFUs (FU0-FU2) are

used. In this technique, all the IFUs are assumed to be clean. PV-IPC implements

the IPC technique considering the effect of variations in IFUs. So, when IPC is 3

the first three IFUs are used, even though they have high latency (are variation

effected). In figures, “PV-IPC: 20% variation”, “Turn-off: 20% variation” and

“On-demand turn-on: 20% variation” indicate the cases of applying ‘PV-IPC’,

25

‘turn-off’ and ‘on-demand turn-on’ techniques for 20% variation map, respectively.

Similarly, we indicate the different techniques for 40% variation also.

Figure 7. Benchmark-wise IPC for different techniques.

Figure 7 shows the benchmark-wise IPC values for all the above techniques

considering six FUs. We can observe that both ‘turn-off’ and ‘on-demand turn-on’

techniques perform better than worst case scenario. For benchmarks like “bmcm”,

“mxm” and “tsf” the ‘turn-off’ technique incurs at most 0.4% of performance

degradation w.r.t. best case. For all other benchmarks because of high resource

requirement we have performance degradation for ‘turn-off’ w.r.t. best case.

In case of ‘on-demand turn-on’ technique, we have incurred an average per-

formance loss of 1.1% when compared to best case, for 20% variation. On the

other hand, for 40% variation, we incur loss of 3.6% in performance. In case of

“wss” benchmark because of its high resource requirement, we can observe a dras-

tic change in IPC values for ‘turn-off’ and ‘on-demand turn-on’ techniques. By

considering IPC and PV-IPC techniques we can observe the affect of variations on

IPC value. Because functional units FU0 and FU1 are effected, they posses high

latency, the average IPC value for PV-IPC technique is 6.0% less than that of sim-

ple IPC technique. It can also be noted, due to obvious reasons, that performance

of our proposed techniques decreases with increase in variation.

26

Figure 8. Leakage energy savings of all the IFUs for different techniques.

Figure 8 shows benchmark wise leakage energy savings obtained for different

techniques w.r.t. the worst case. First of all, we would like to point out that

the leakage savings for best case w.r.t. worst case is less because it uses all the

FUs. When it comes to IPC technique, the savings w.r.t. worst case are more

than that of best case because we turn-off unused FUs based on IPC value. For

PV-IPC technique because of variations, the savings in leakage energy are less

than that of IPC technique. Coming to our ‘turn-off’ technique, we can observe

that because of turning off all the variation effected FUs we have higher leakage

energy savings compared to that of best, IPC and PV-IPC techniques. In case of

‘on-demand turn-on’ technique, because we are turning on the effected FUs based

on IPC value, we have less leakage energy savings when compared to that of ‘turn-

off’ technique (on an average 14.5% less). But for benchmarks “apsi”, “bmcm”,

“mxm”, “tsf” and “vpenta” we have achieved almost same leakage energy savings

in case of ‘turn-off’ and ‘on-demand turn-on’ techniques because of low IPC (Figure

7). Considering “wss” benchmark, we can observe that the savings for ‘turn-off’

technique are much more than other techniques because we are turning off 3 FUs

(process variation affected FUs) even though IPC value is 4.35 (Figure 7).

27

Figure 9. Peak temperature of IFUs for different techniques.

For all the techniques, Figure 9 shows benchmark wise peak temperatures. For

IPC technique, the peak temperature is more than that of best case because more

number of instructions are being scheduled on the initial FUs. In PV-IPC as we

use the variation effected FUs, we can see a drastic increase in the peak tempera-

ture w.r.t. worst case. On an average, w.r.t. worst case, the peak temperature for

PV-IPC technique is 11.3◦C more. Our ‘turn-off’ technique achieves an average

peak temperature reduction of 17.5◦C w.r.t. the worst case because we are com-

pletely turning off the variation effected FUs. Similarly, for ‘on-demand turn-on’

technique we achieve 10.0◦C reduction in average peak temperature w.r.t. worst

case. In general, we can observe that peak temperature increases with increase in

the variation.

28

Figure 10. Average change in IPC for different techniques, for 6 and 4 IFUs with 20%and 40% variation.

Figures 10, 11 and 12 show the results of sensitivity analysis with 4 IFUs.

For easy comparison, we also show the results when number of IFUs is 6. Figure

10 shows the impact of our techniques on IPC value. It shows average perfor-

mance degradation when compared with best case and average improvement in

performance when compared to worst case, over all the benchmarks. In case of

IPC technique we can observe 3% degradation w.r.t the best case. This degrada-

tion further increases in case of PV-IPC. In case of our ‘turn-off’ and ‘on-demand

turn-on’ techniques degradation is 14% and 1%, respectively, w.r.t the best case

when 20% variation is considered. This degradation in IPC value is observed with

increase in the variation. Similarly we can observe 20% and 39% improvement

w.r.t the worst case considering 20% variation which reduced with increase in the

variation. The reason of it being negative in case of “Turn-off: 40% variation” is

that out of 4 FUs 3 FUs are variation affected.

29

Figure 11. Average leakage energy savings for different techniques, for 6 and 4 IFUs with20% and 40% variation.

Figure 11 shows average leakage energy savings over all the benchmarks. We

can observe that we are able to save more leakage energy in case of ‘turn-off’

when compare to the ‘on-demand turn-on’, IPC and PV-IPC techniques. We can

observe that on an average 82% and 52% of savings are obtain in case of ‘turn-off’

and ‘on-demand turn-on’, respectively, w.r.t the base case. When worst case is

considered we achieve 87% and 65% of savings, respectively. As variation increase

leakage energy saving decreases.

Figure 12. Average peak temperature reduction for different techniques, for 6 and 4IFUs with 20% and 40% variation.

30

Figure 12 shows average peak temperature reduction over all the benchmarks.

We can observe that in the IPC technique average peak temperature is more than

that of best case and less than that of worst case. In case of PV-IPC as instructions

are scheduled on variation affected FUs, we can see a drastic increase in average

peak temperature. In case of ‘turn-off’ and ‘on-demand turn-on’ techniques we can

observe 12.8% and 5% reduction in average peak temperature when compared to

that of best case, and 17% and 10% reduction when compared to the worst case,

for 6 FU and 4 FU in case of 20% variation. From figures 10, 11 and 12 we can

observe that with increase in variation IPC degradation increases, leakage energy

saving decrease and average peak temperature increases.

4.3 Conclusion

We have presented two compile-time techniques namely ‘turn-off’ and ‘on-

demand turn-on’ to handle these non-uniform latency IFUs and reduce the per-

formance penalty. Apart from achieving nearly same performance as that of IFUs

without variability, we also achieve nearly 76.5% reduction in leakage energy con-

sumption along with 13.3% reduction in peak temperature of IFU as compared to

the worst-case.

31

CHAPTER 5

Mobility-list-scheduling

To achieve high performance, VLIW processors use multiple functional units.

By exploiting the available instruction-level parallelism in programs, compilers

schedule operations on different functional units of VLIW processors. It is a com-

mon case that list-scheduling [43] is used for scheduling operations in VLIW proces-

sors to achieve high performance. But, the list-scheduling always tends to schedule

operations on first freely available functional unit [46]. As long as the functional

units of same kind having the same latency, list-scheduling will give better per-

formance results. But, functional units of same kind may have different latencies.

This scenario can happen in advanced process technologies due to process variation

[26], [47]. In such situation, the list-scheduling may not yield good performance.

In order to work with non-uniform latency functional units, we propose a new

scheduling algorithm, namely, mobility-list-scheduling, which is a modified version

of the list-scheduling algorithm that uses the mobility [13] information to schedule

operations onto non-uniform latency FUs.

5.1 Motivation

In this chapter, we assume a VLIW processor with six integer functional units

(IFUs) in such a way that each IFU can take either nominal latency (type-0), 1 cycle

extra latency as compared to the nominal latency (type-1), or 2 cycle extra latency

(type-2). Here an IFU with k cycle extra latency means instructions scheduled on

that IFU will take k extra cycles as compared to the nominal latency of the IFU

for all the operations. For n IFUs with m possible latency types, we have(n+m−1m−1

)different latency pattern sets with a total of mn latency patterns, where a latency

pattern set defines the total number of IFUs for each latency type while a latency

pattern determines the latency type of each IFU. In other words, a latency pattern

32

Figure 13. Total execution cycles for benchmarks “apsi” and “bmcm” for all possiblelatency pattern scenarios after applying list-scheduling algorithm.

set A is defined as A = {i0, i1, · · · , im−1 |∑m−1k=0 ik = n}, where ik is the total

number of IFUs with type-k latency and n is the total number of IFUs, while a

latency pattern p is defined as p = l(IFU0)l(IFU1) · · · l(IFUn−1), where l(IFUk)

is the latency type of IFUk.

So, for 6 IFUs and with 3 possible latency types, we have 28 latency pattern

sets with a total of 729 (=36) different latency patterns. Considering a compiler

which is aware of these latency types, Figure 13 shows the number of execution

cycles for all the 729 latency patterns for “apsi” and “bmcm” benchmarks af-

ter applying the list-scheduling algorithm. From the figure, it is clear that the

total number of execution cycles depends on the latency pattern. For a better

understanding, Figure 14 shows values for 9 benchmarks with all possible latency

patterns of a latency pattern set A = {4, 1, 1}. We can observe that for latency

pattern p1 = 210000, all benchmarks take more number of execution cycles as

compared to that of latency pattern p2 = 000012. From the above observation,

we can conclude that the position of high latency IFUs plays an important role in

determining the performance.

33

Figure 14. Benchmark-wise execution cycles after applying list-scheduling algorithm forall latency patterns of a latency pattern set {4,1,1}.

NOP

NOP

A1 A2 C1 A3 M1 A4 M2

A5 A6 A7 A9C2A8

A10

M3

Figure 15. Dependency graph (Gn) for Basic Block (BBi).

5.2 Working with non-uniform latency functional units

Figure 15 shows a simple dependency graph Gn with n nodes as operations

for a basic block BBi. Gn consists of 10 Add (Ai, i ∈ {1, · · · , 10}) operations, 2

Compare (C1 and C2) operations, and 3 Multiply operations (M1, M2, and M3).

34

IFU IFU IFU IFU IFU IFU[0] [1] [2] [3] [4] [5]

1 A1 A2 C1 A3 M1 A4

2 M2 A5 A6 A7 C2

3 A10

4 M3 A8

5 A9

6

Table 10. VLIW schedule after applying list-scheduling algorithm for latency pattern000000.


1 A1 A2 C1 A3 M1 A4

2 M2 A5 A6 A7

3 A10

4 M3 C2

5 A9 A8

6


Just for the illustrative purpose, we consider only these operations. We assume

that each IFU can perform Add, Multiply, and Compare operations, however,

in a particular cycle only one operation can be performed by the IFU. In the

advanced process technologies, because of process variation, it may so happen

that functional units of same kind may have different latencies [26], [47]. In this

chapter, as described in Section 5.1, we assume that each IFU can belong to one

of the three latency types (type-0, type-1, or type-2). We also assume that a type-0

IFU takes 1 cycle latency to perform an Add or a Compare operation and 3 cycle

latency for Multiply operation. We tabulate VLIW scheduling information for a

basic block with row indicating the execution cycles and columns indicating the

functional units. Table 10 shows a VLIW schedule for latency pattern 000000 (that

is, all the IFUs take nominal latency) by considering Gn (Figure 15) as input to

the list-scheduling algorithm [43].

35


1 A1 A2 C1 A3 M1 A4

2 M2 A6 A7

3 C2

4 A5 A8

5 A9

67 A10

8910 M3

11121314


Considering a compiler which is aware of these latency types, Table 11 shows a

schedule obtained by using the list-scheduling algorithm for Gn for latency pattern

000012. Note that the schedule lengths (i.e., the number of rows in a schedule

table) are same for both latency patterns 000000 and 000012 as the list-scheduling

algorithm has a tendency to schedule instruction on the first freely available IFU

and for the latency pattern considered here, all the high latency IFUs are towards

the end. Now, when a latency pattern 210000 is considered, from Table 12, we

can see that most of the instructions are scheduled on IFU0, which is a type-2

latency IFU. This results in increased schedule length (14 cycle) and hence there

is a performance loss as compared to that of the schedules given in Tables 10 - 11

(6 cycles in each case). To overcome this problem in the list-scheduling algorithm,

we present a modified list scheduling algorithm, namely, mobility-list-scheduling,

which uses mobility [13] information of each operation to schedule the operation

on a particular IFU.

36

MOBILITY - LIST(Gn(V,E), a, m) {Compute mobility for all the operations and form mobility classes;l = 1;repeat {for each mobility class k = 0, 1, · · · , t {Determine candidate operations Ul,k;Sort the operations of Ul,k in ascending order of their latency;j = 0;repeat {Determine unfinished operations Tl,j ;Select the first Sk ⊆ Ul,k operations, such that |Sk|+ |Tl,j | ≤ aj ;Schedule the Sk operations on IFUs with type-j latency at step lby setting ti = l,∀i : vi ∈ Sk;Ul,k = Ul,k − Sk;j = j + 1;} until (Ul,k is empty or j == m);}l = l + 1;} until (vn is scheduled);return(t);}

Figure 16. Mobility-list-scheduling algorithm.

5.2.1 Mobility-list-scheduling

Mobility of an operation corresponds to the difference in the start time com-

puted by the As-Late-As-Possible (ALAP) and As-Soon-As-Possible (ASAP) al-

gorithms [13]. For an operation with zero mobility, the operation has to bound

to an IFU with type-0 latency and execute at the earliest start time in order to

avoid performance penalty. On the other hand, a k-mobility operation, k > 0, can

be bound to an IFU with type-m latency, where m ≤ k, so that its execution can

be postponed by k − m steps. Operations with zero mobility are called critical

operations. In Figure 15, operations A1, A2, C1, A5, A6, A10, and M3 are critical

operations. In order not to delay these critical operations, whenever there is a

possibility, the mobility-list-scheduling algorithm (as shown in Figure 16) avoids

scheduling these operations on IFUs with type-m latency, m > 0. In general,

for scheduling k-mobility operations, the mobility-list-scheduling algorithm always

gives preference to type-j latency IFUs, where j ≤ k. If such IFUs are not avail-

able, the algorithm chooses the next best IFU which incurs minimal performance

penalty.

37


1 M1 M2 A1 A2 C1 A3

2 A5 A6 A7 A4

3 A10 C2

4 M3 A8

5 A9

6

Table 13. VLIW schedule after applying mobility-list-scheduling algorithm for latencypattern 210000.

The input to the mobility-list-scheduling algorithm (as shown in Figure 16) is

a dependency graph Gn, latency pattern set A = {a0, a1, · · · , am}, and the number

of latency types, m. The mobility-list-scheduling algorithm (as shown in Figure

16) selects the set of all operations that can be executed in each schedule step. In

each schedule step, the algorithm selects operations in the increasing order of their

mobility. Ul,k defines the set of all eligible k-mobility operations that are ready

to execute in schedule step l. Operations of set Ul,k are sorted in ascending order

of their execution latency so that low latency operations are preferred first over

high latency operations. Now, the algorithm checks for free IFUs in the increasing

order of their latency types. Ti,j defines the set of operations scheduled on IFUs

with type-j latency and these operations are started earlier and whose execution

is not finished in step l. The number of IFUs with type-j latency is denoted by

aj. The inner repeat loop in the algorithm explores all IFUs, starting from low

latency to high latency, to schedule the operations of Ul,k.

Table 13 shows a schedule table after applying our algorithm by considering

Gn and latency pattern 210000. Here, we choose the latency pattern 210000 as it

gives worst performance when the conventional list-scheduling is applied (see Table

12). From Table 13 and Table 14, it is clear that, as A1, A2 and C1 are 0-mobility

operations, they are scheduled on type-0 latency IFUs 2 − 4. Though A3 is a 1-

mobility operation, as there is a free type-0 latency IFU, it is scheduled on IFU

5. As both M1 and M2 are 2-mobility operations, they are scheduled on available

38

type-1 and type-2 IFUs, respectively. Though the start time of A4 is 1, because

of 4-mobility, its execution is postponed to step 2 by giving preference to other

low mobility operations in schedule step 1. In this way, the algorithm completes

the schedule with a schedule length of 6 cycles, thus improving the performance

as compared to that of conventional case (Table 12).

OperationsStart timeMobilityA1 1 0A2 1 0A3 1 1A4 1 4A5 2 0A6 2 0A7 2 1A8 2 2A9 2 2A10 3 0C1 1 0C2 2 4M1 1 2M2 1 2M3 4 0

Table 14. Mobility information for each operation in Graph Gn.

Note that, when a latency pattern 222222 is considered, it is obvious that

our algorithm perform in the same way as that of the conventional list-scheduling

algorithm.

It can be noted that our techniques can work in conjunction with any other

performance-oriented schedulers such as basic-block/trace-scheduling [42], [43], su-

perblock scheduling [44], and hyperblock scheduling [45]. Our algorithm requires

the recompilation of the sources once for every target processor (to get the latency

types of FUs). However, additional run-time hardware techniques are available

which store the latency information of FUs in the BIOS of the system [25]. This

latency information can be loaded at boot-time for usage in our algorithm.

39

Figure 17. % IPC degradation w.r.t to the nominal case for benchmark mxm.

Figure 18. % IPC degradation w.r.t to the nominal case for benchmark tsf.

5.3 Experimental results

To evaluate the proposed algorithm, we implement and simulate the algorithm

within the Trimaran [37] framework (see Chapter 3). We modify the instruction

scheduling part of the Elcor to incorporate extra changes. We configure the simula-

tor to simulate an Itanium [39] like architecture. Table 3 give the default simulation

parameters used in our experiments, namely, the CPU/memory configuration pa-

rameters. Table 4 lists the information of array-intensive benchmark codes used

in our experiments.

In this section, we study the behavior of list-scheduling and mobility-list-

scheduling algorithms for all 729 latency patterns. Figures 17-18 shows % IPC

40

Figure 19. Number of execution cycles for benchmark mxm.

Figure 20. Number of execution cycles for benchmark tsf.

(instructions per cycle) degradation w.r.t to the nominal case (latency pattern

000000) for benchmarks “mxm” and “tsf ”, respectively, for all 728 latency pat-

terns. From the figures, we can observe that, when we consider conventional

list-scheduling algorithm, % IPC degradation is more compare to mobility-list-

scheduling algorithm.

Figure 19 and 20 gives the number of execution cycle for all the latency pattern

of latency pattern set {4,1,1}, for benchmarks “mxm” and “tsf ”, respectively. As

list-scheduling has the tendency of scheduling the instruction on the first freely

available functional unit, similar behavior can be observed in both these figures.

41

Figure 21. Average % IPC degradation w.r.t to the nominal case over all benchmarks.

We can observe that there is a significant difference in number of execution cycles

when we consider latency patterns 000012, 010002 and 210000. On the other hand,

when we consider mobility-list-scheduling it requires least execution time, equal to

the lowest number of execution cycles in that particular set.

Figure 21 shows average % IPC degradation w.r.t to the nominal case over

all benchmarks for all 728 latency patterns. we can observe that mobility-list-

scheduling algorithm gives good performance improvement compare to list-scheduling

algorithm. For the latency pattern 222222, we do not have IFUs with type-0 or

type-1 latency. In such cases, it is obvious that our algorithm will not have any

improvement in performance when compared to conventional list-scheduling algo-

rithm. We observe a similar behavior in case of all the other benchmarks.

5.4 Conclusion

We proposed mobility-list-scheduling algorithm, which is a modified version

of the list-scheduling algorithm that uses the mobility information to schedule

operations onto non-uniform latency IFUs. Our experimental evaluation showed

that the mobility-list-scheduling technique achieves on average 20.7% performance

improvement compared to conventional list-scheduling when non-uniform latency

IFUs are considered.

42

CHAPTER 6

Conclusion and future work

Due to process variation, components like adders, multipliers, etc., of different

integer functional units (IFUs) in VLIW processors may operate at various speeds,

resulting in non-uniform latency IFUs which can cause performance loss. We have

presented two compile-time techniques, namely, ‘turn-off’ and ‘on-demand turn-

on’ to handle these non-uniform latency IFUs and reduce the performance penalty.

Apart from achieving nearly the same performance as that of IFUs without vari-

ability, we also achieve nearly 76.5% reduction in leakage energy consumption along

with 13.3% reduction in peak temperature of IFU as compared to the worst-case.

Conventional list-scheduling algorithm schedules instructions on first freely

available IFU, which results in significant performance loss in case where first

free IFU is of type-1 or type-2 and critical instructions are scheduled on them.

We proposed mobility-list-scheduling algorithm, which is a modified version of the

list-scheduling algorithm that uses the mobility information to schedule operations

onto non-uniform latency IFUs. Our experimental evaluation shows that mobility-

list-scheduling observe on average 20.7% performance improvement compare to

conventional list-scheduling when non-uniform latency IFUs are considered.

As future work, one can explore compile-time technique which can work with

non-uniform latency clustered VLIW architectures.

43

List of publications related to the thesis

[a] Nayan V. Mujadiya, “Instruction scheduling for VLIW processors under vari-

ation scenario” in International Symposium on Systems, Architectures, Modeling,

and Simulation, July. 2009, pp. 33− 40.

[b] Nayan V. Mujadiya and M. Mutyam, “Instruction Scheduling on Variable La-

tency Functional Units of VLIW Processors” (to be communicated).

44

References

[1] A. Datta and et. al., “Speed binning aware design methodology to improve profitunder process variations,” in Asia and South Pacific Design Automation Conference,Sept. 2004, pp. 712–717.

[2] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De, “Parame-ter variations and impact on circuits and microarchitecture,” in DAC ’03: Proceed-ings of the 40th conference on Design automation. New York, NY, USA: ACM,2003, pp. 338–342.

[3] K. A. Bowman, A. R. Alameldeen, S. T. Srinivasan, and C. B. Wilkerson, “Impactof die-to-die and within-die parameter variations on the throughput distributionof multi-core processors,” in ISLPED ’07: Proceedings of the 2007 internationalsymposium on Low power electronics and design. New York, NY, USA: ACM,2007, pp. 50–55.

[4] O. S. Unsal and et. al., “Impact of parameter variations on circuits and microarchi-tecture,” in IEEE Micro, Nov. 2006, pp. 30–39.

[5] S. Nassif, “Modeling and analysis of manufacturing variations,” Custom IntegratedCircuits, 2001, IEEE Conference on., pp. 223–228, 2001.

[6] S. Nassif, “Within-chip variability analysis,” Electron Devices Meeting, 1998. IEDM’98 Technical Digest., International, pp. 283–286, Dec 1998.

[7] K. Bowman, S. Duvall, and J. Meindl, “Impact of die-to-die and within-die pa-rameter fluctuations on the maximum clock frequency distribution for gigascaleintegration,” Solid-State Circuits, IEEE Journal of, vol. 37, no. 2, pp. 183–190, Feb2002.

[8] P. S. Zuchowski, P. A. Habitz, J. D. Hayes, and J. H. Oppold, “Process and environ-mental variation impacts on asic timing,” in ICCAD ’04: Proceedings of the 2004IEEE/ACM International conference on Computer-aided design. Washington, DC,USA: IEEE Computer Society, 2004, pp. 336–342.

[9] T. Rahal-Arabi and et. al., “Design and validation of the pentium 3 and pentium 4processors power delivery,” in IEEE Symposium on VLSI Circuits, 2002, pp. 220–223.

[10] D. Brooks and M. Martonosi, “Dynamic thermal management for high-performancemicroprocessors,” in High-Performance Computer Architecture, 2001, pp. 171–182.

[11] A. Abdollahi and et. al., “Leakage current reduction in cmos vlsi circuits by inputvector control,” in IEEE Transactions on VLSI Systems, 2004, pp. 140–154.

[12] H. S. Kim, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin, “Adapting instruc-tion level parallelism for optimizing leakage in vliw architectures,” SIGPLAN Not.,vol. 38, no. 7, pp. 275–283, 2003.

45

[13] G. D. Micheli, “Synthesis and optimization of digital circuits,” McGraw-Hill, 1994.

[14] J. W. Tschanz, J. T. Kao, S. G. Narendra, R. Nair, D. A. Antoniadis, A. P. Ch,S. Member, and V. De, “Adaptive body bias for reducing impacts of die-to-die andwithin-die parameter variations on microprocessor frequency and leakage,” in IEEEJournal Of Solid-State Circuits, 2002, pp. 1396–1402.

[15] S. Narendra, A. Keshavarzi, B. Bloechel, S. Borkar, and V. De, “Forward bodybias for microprocessors in 130-nm technology generation and beyond,” Solid-StateCircuits, IEEE Journal of, vol. 38, no. 5, pp. 696–701, May 2003.

[16] A. Datta, S. Bhunia, J. H. Choi, S. Mukhopadhyay, and K. Roy, “Speed binningaware design methodology to improve profit under parameter variations,” in ASP-DAC ’06: Proceedings of the 2006 conference on Asia South Pacific design automa-tion. Piscataway, NJ, USA: IEEE Press, 2006, pp. 712–717.

[17] A. Agarwal, B. Paul, S. Mukhopadhyay, and K. Roy, “Process variation in embeddedmemories: failure analysis and variation aware architecture,” Solid-State Circuits,IEEE Journal of, vol. 40, no. 9, pp. 1804–1814, Sept. 2005.

[18] Q. Chen, H. Mahmoodi, S. Bhunia, and K. Roy, “Modeling and testing of sramfor new failure mechanisms due to process variations in nanoscale cmos,” in VTS’05: Proceedings of the 23rd IEEE VLSI Test Symposium. Washington, DC, USA:IEEE Computer Society, 2005, pp. 292–297.

[19] H. Chang and S. S. Sapatnekar, “Full-chip analysis of leakage power under processvariations, including spatial correlations,” in DAC ’05: Proceedings of the 42ndannual conference on Design automation. New York, NY, USA: ACM, 2005, pp.523–528.

[20] M. Mutyam and V. Narayanan, “Working with process variation aware caches,” inDATE ’07: Proceedings of the conference on Design, automation and test in Europe.San Jose, CA, USA: EDA Consortium, 2007, pp. 1152–1157.

[21] M. A. Hussain and M. Mutyam, “Block remap with turnoff: a variation-tolerantcache design technique,” in ASP-DAC ’08: Proceedings of the 2008 conference onAsia and South Pacific design automation. Los Alamitos, CA, USA: IEEE Com-puter Society Press, 2008, pp. 783–788.

[22] S. Ozdemir, D. Sinha, G. Memik, J. Adams, and H. Zhou, “Yield-aware cache archi-tectures,” in MICRO 39: Proceedings of the 39th Annual IEEE/ACM InternationalSymposium on Microarchitecture. Washington, DC, USA: IEEE Computer Society,2006, pp. 15–25.

[23] F. Wang, C. Nicopoulos, X. Wu, Y. Xie, and N. Vijaykrishnan, “Variation-awaretask allocation and scheduling for mpsoc,” in ICCAD ’07: Proceedings of the 2007IEEE/ACM international conference on Computer-aided design. Piscataway, NJ,USA: IEEE Press, 2007, pp. 598–603.

46

[24] L. Huang and Q. Xu, “Performance yield-driven task allocation and schedulingfor mpsocs under process variation,” in DAC ’10: Proceedings of the 47th DesignAutomation Conference. New York, NY, USA: ACM, 2010, pp. 326–331.

[25] P. Raghavan, J. Ayala, D. Atienza, F. Catthoor, G. De Micheli, and M. Lopez-Vallejo, “Reduction of register file delay due to process variability in vliw embeddedprocessors,” Circuits and Systems, 2007. ISCAS 2007. IEEE International Sympo-sium on, pp. 121–124, May 2007.

[26] X. Liang and D. Brooks, “Mitigating the impact of process variations on proces-sor register files and execution units,” in MICRO 39: Proceedings of the 39th An-nual IEEE/ACM International Symposium on Microarchitecture. Washington, DC,USA: IEEE Computer Society, 2006, pp. 504–514.

[27] B. F. Romanescu, M. E. Bauer, S. Ozev, and D. J. Sorin, “Reducing the impact ofintra-core process variability with criticality-based resource allocation and prefetch-ing,” in CF ’08: Proceedings of the 2008 conference on Computing frontiers. NewYork, NY, USA: ACM, 2008, pp. 129–138.

[28] T. Sato and S. Watanabe, “Instruction scheduling for variation-originated variablelatencies,” Quality Electronic Design, International Symposium on, vol. 0, pp. 361–364, 2008.

[29] D. Kannan, A. Shrivastava, S. Bhardwaj, and S. Vrudhul, “Power reduction offunctional units considering temperature and process variations,” in VLSID ’08:Proceedings of the 21st International Conference on VLSI Design. Washington,DC, USA: IEEE Computer Society, 2008, pp. 533–539.

[30] D. Kannan, A. Shrivastava, V. Mohan, S. Bhardwaj, and S. Vrudhula, “Tempera-ture and process variations aware power gating of functional units,” in VLSID ’08:Proceedings of the 21st International Conference on VLSI Design. Washington,DC, USA: IEEE Computer Society, 2008, pp. 515–520.

[31] X. Liang, R. Canal, G.-Y. Wei, and D. Brooks, “Process variation tolerant 3t1d-based cache architectures,” Microarchitecture, 2007. MICRO 2007. 40th AnnualIEEE/ACM International Symposium on, pp. 15–26, Dec. 2007.

[32] “Hspice,” http://www.synopsys.com/products/mixedsignal/hspice/hspice.html.

[33] A. Das, S. Ozdemir, G. Memik, and A. Choudhary, “Evaluating voltage islands incmps under process variations,” Computer Design, 2007. ICCD 2007. 25th Inter-national Conference on, pp. 129–136, Oct. 2007.

[34] R Development Core Team, R: A Language and Environment for StatisticalComputing, R Foundation for Statistical Computing, Vienna, Austria, 2008, ISBN3-900051-07-0. [Online]. Available: http://www.R-project.org

[35] N. Kim, T. Austin, D. Baauw, T. Mudge, K. Flautner, J. Hu, M. Irwin, M. Kan-demir, and V. Narayanan, “Leakage current: Moore’s law meets static power,”Computer, vol. 36, no. 12, pp. 68–75, Dec. 2003.

47

http://www.R-project.org

[36] Y.-F. Tsai, A. Ankadi, N. Vijaykrishnan, M. Irwin, and T. Theocharides, “Chip-power: an architecture-level leakage simulator,” SOC Conference, 2004. Proceed-ings. IEEE International, pp. 395–398, Sept. 2004.

[37] L. N. Chakrapani, J. Gyllenhaal, W. mei W. Hwu, S. A. Mahlke, K. V. Palem,and R. M. Rabbah, “Trimaran: An infrastructure for research in instruction-levelparallelism,” in Lecture Notes in Computer Science, 2004, p. 2005.

[38] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tar-jan, “Temperature-aware microarchitecture,” SIGARCH Comput. Archit. News,vol. 31, no. 2, pp. 2–13, 2003.

[39] J. Huck, D. Morris, J. Ross, A. Knies, H. Mulder, and R. Zahir, “Introducing theia-64 architecture,” IEEE Micro, vol. 20, no. 5, pp. 12–23, 2000.

[40] Y.-F. Tsai, D. E. Duarte, N. Vijaykrishnan, and M. J. Irwin, “Characterization andmodeling of run-time techniques for leakage power reduction,” IEEE Trans. VeryLarge Scale Integr. Syst., vol. 12, no. 11, pp. 1221–1232, 2004.

[41] N. V. Mujadiya, “Instruction scheduling for vliw processors under variation sce-nario,” in SAMOS’09: Proceedings of the 9th international conference on Systems,architectures, modeling and simulation. Piscataway, NJ, USA: IEEE Press, 2009,pp. 33–40.

[42] J. Fisher, “Trace scheduling: A technique for global microcode compaction,” Com-puters, IEEE Transactions on, vol. C-30, no. 7, pp. 478–490, July 1981.

[43] S. Muchnick, “Advanced compiler design and implementation,” Morgan Kaufmann,1997.

[44] W.-M. W. Hwu, S. A. Mahlke, W. Y. Chen, P. P. Chang, N. J. Warter, R. A.Bringmann, R. G. Ouellette, R. E. Hank, T. Kiyohara, G. E. Haab, J. G. Holm,and D. M. Lavery, “The superblock: an effective technique for vliw and superscalarcompilation,” J. Supercomput., vol. 7, no. 1-2, pp. 229–248, 1993.

[45] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, “Effec-tive compiler support for predicated execution using the hyperblock,” SIGMICRONewsl., vol. 23, no. 1-2, pp. 45–54, 1992.

[46] M. Mutyam, F. Li, V. Narayanan, M. Kandemir, and M. J. Irwin, “Compiler-directed thermal management for vliw functional units,” SIGPLAN Not., vol. 41,no. 7, pp. 163–172, 2006.

[47] E. Chun, Z. Chishti, and T. N. Vijaykumar, “Shapeshifter: Dynamically changingpipeline width and speed to address process variations,” in MICRO 41: Proceed-ings of the 41st annual IEEE/ACM International Symposium on Microarchitecture.Washington, DC, USA: IEEE Computer Society, 2008, pp. 411–422.

48

Documents

INSTRUCTION SCHEDULING FOR VLIW PROCESSORS UNDER …web2py.iiit.ac.in/publications/default/download... · INSTRUCTION SCHEDULING FOR VLIW PROCESSORS UNDER VARIATION SCENARIO A thesis