Adaptive and Low-Complexity Microarchitectures for Power ... · Abstract Technology and microarchitecture evolution is driving microprocessors towards higher clock frequencies and

Adaptive and Low-Complexity

Microarchitectures for Power Reduction

Jaume Abella Ferrer

2005

A thesis submitted in fulfillment ofthe requirements for the degree of

DOCTOR OF PHILOSOPHY / DOCTOR PER LA UPC

Departament d’Arquitectura de ComputadorsUniversitat Politecnica de Catalunya

ii ·

· iii

Adaptive and Low-Complexity

Microarchitectures for Power Reduction

Jaume Abella Ferrer

2005

Advisor: Antonio Gonzalez Colas

A thesis submitted in fulfillment ofthe requirements for the degree of

DOCTOR OF PHILOSOPHY / DOCTOR PER LA UPC

Departament d’Arquitectura de ComputadorsUniversitat Politecnica de Catalunya

iv ·

· v

Salud, hijas de Zeus!Otorgadme el hechizo de vuestro canto.

Celebrad la estirpe sagradade los sempiternos inmortales,

los que nacieron de Geay del estrellado Urano,

los que nacieron de la tenebrosa Nochey los que crio el salobre Ponto.

(...)E inspiradme esto, Musas,

que desde un principiohabitais las mansiones olımpicas,

y decidme lo que de ello fue primero.

Hesıodo – ”Teogonıa”

vi ·

· vii

A tu que em vas ensenyar a lligar-me les sabates,a tu que em vas ensenyar a jugar als escacs,

alla on sigueugracies

viii ·

Abstract

Technology and microarchitecture evolution is driving microprocessors towardshigher clock frequencies and higher integration scale. These two factors translateinto higher power density, which calls for more sophisticated and expensive coolingsystems. Reduction of power dissipation can be very beneficial not only in terms ofcooling cost reduction, but also for saving energy or increasing performance for agiven thermal solution or extending battery life.

Processors are often designed to achieve high performance for a wide range ofapplications with different resource requirements. Thus, it is often the case thatthe resources are underutilized. Hence, we can save energy because resources wasteenergy while they are idle. In general, the structures are sized in such a way thatmaking them larger hardly increases the performance, but making them smaller mayharm the performance for some programs or for some parts of some programs. Thus,there is room to dynamically adapt these structures to cut the energy consumptionof those parts that do not contribute to increase performance. Additionally, thistype of worst case design requires high complexity power-hungry structures.

This thesis presents new microarchitectural techniques to reduce the energy con-sumption and complexity of the main microprocessor structures. We propose newcache memory, issue logic, load/store queue and clustered microarchitecture designs,as well as techniques to dynamically resize these structures. We show that the pro-posals presented in this dissertation reduce significantly the dynamic and leakageenergy by means of low complexity structures and resizing mechanisms.

x ·

Acknowledgements

Ha estat un llarg camı i sou molts els que m’heu ajudat. Com donar-vos les graciesa tots? Una bona pregunta la resposta de la qual intenten ser aquestes lınies.Primer de tot vull comencar donant les gracies al meu director de tesi, l’AntonioGonzalez, per tot el seu suport durant aquests anys i per la seva paciencia davantels meus rampells, sobretot al principi, quan l’unic que em deien en els congressosera ”rejected”. Tambe vull donar les gracies als meus pares i ma germana perproporcionar-me aquest refugi tant necessari. Sempre he tingut el seu suport iconfianca en mi.

Ara miro enrera i recordo el dia que vaig entrar en aquest departament, en elprojecte MHAOTEU. Alla vaig coneixer als que son els meus primers amics de debo,els que em coneixen millor que jo mateix i que han tingut cura de mi tot aquesttemps. En Xavi sempre ha estat un amic i sovint, un germa gran. Mai podre agrair-li prou tot el que m’ha ensenyat com a persona, ni el que m’ha arribat a ajudarprofessionalment. Sempre ha estat un plaer treballar amb ell i espero poder-ho ferdurant molt mes temps. Quan penso en Xavi tambe penso en Nerina. I tot i que elsfaci rabia, sempre que recordo a un me’n recordo de l’altre. Si puc dir que a alguestimo de debo es a ells dos. Nerina sovint ha estat el contrapes que m’ha ajudat amantenir els peus a terra, i que m’ha protegit de mi mateix. Es l’amiga a qui recorroquan se m’enfonsa el terra a sota els peus. Un cop em van fer una carta astral i deiaque soc Capricorn amb ascendencia de Balanca. Vet aquı que Xavi es Capricorn iNerina Balanca. Potser per aixo son dos pilars basics en la meva vida. Per mi ellssempre hi son, i procuro ajudar-los sempre que em deixen, tot i que no me’n surtimassa be. Estigueu segurs que la majoria de coses bones que tinc son culpa seva.

Ara seria injust no donar-li les gracies a Josep Maria per les moltes conversesanant i tornant del ”super”, i per una complicitat que poca gent pot oferir. Tant debo la coca-cola t’ajudi a conservar-te molts anys tal com ets!! (tot i que em reservoel dret a vetar-te els acudits!!).

Cronologicament ara em toca parlar de l’Alex Pajuelo, un altre Capricorn. Bonagent els Capricorn ;-). Crec que potser hem compartit un miler de cafes, i sovintconverses que han contribut a la nostra amistat.

Tambe aquı he fet un altre amic per tota la vida: Fran. es Balanca... casualitatsde la vida? Ell es ”papa pato”, i els seus amics som els seus aneguets. De vegadesdiuen que els amics de veritat es poden comptar amb els dits d’una ma. Ell es undels meus dits.

xii ·

I perdoneu que als altres em limiti a llistar-vos, pero necessitaria una altra tesiper dir-vos com sou d’importants per mi: Xavi Verdu (sempre transmetent calor!!),German (i la seva passio pel PP), Oliver (alias transformer), Ayose (quins momentstan ıntims a Madrid i a Denver!!), Carmelo (en els seus dies), Marco (que a les sisde la tarda encara diu ”pronto”), Ale, Eduard, Enric, Ramon, Pedro, Llorenc, Suso,Pepe, Fernando, Daniel J., Daniel O., Jordi G., Alex A., Raimir, Ruben Gonzalez,Ruben Gran,... i tants d’altres que m’he deixat.

Encara que sembli increble, encara no he acabat perque fora de la universitattambe hi ha vida!! Vull donar les gracies a Gazmira per ensenyar-me com sond’importants les petites coses i per obrir-me el seu cor tant sincerament. Graciestambe a Carla i Fatima per estar a prop meu i escoltar-me quan ho he necessitat.Gracies a Carmina, Nay i Liz per confiar tant en mi.

El camı es llarg i la tesi nomes n’es una etapa. Ha arribat el moment de seguirel camı. Aquı teniu el fruit d’aquests anys de feina.

I que no se m’oblidin els agraiments oficials: This work has been partiallysupported by the Ministry of Education and Science under grants AP2002-3677,TIN2004-07739-C02-01 and TIN2004-03072, the CICYT project TIC2001-0995-C02-01, Feder funds, and Intel Corporation. We would like to thank the anonymousreviewers of the papers in this thesis by their comments.

Contents

Abstract viii

Acknowledgments x

Contents xiii

List of Figures xvii

List of Tables xxi

1 Introduction 11.1 Sources of Power Dissipation . . . . . . . . . . . . . . . . . . . . . . . 31.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Power Efficiency Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Cache Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Fast and Slow L1 Data Cache . . . . . . . . . . . . . . . . . . 81.4.2 Low Leakage L2 Cache . . . . . . . . . . . . . . . . . . . . . . 81.4.3 Heterogeneous Way-Size Caches . . . . . . . . . . . . . . . . . 8

1.5 Issue Logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.5.1 Adaptive Issue Queue and Register File . . . . . . . . . . . . . 91.5.2 Low-Complexity Floating-Point Issue Logic . . . . . . . . . . . 10

1.6 Load/Store Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101.7 Clustered Microarchitectures . . . . . . . . . . . . . . . . . . . . . . . 111.8 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Evaluation Framework 132.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 Tools and simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.1 Simplescalar . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.2.2 Wattch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.3 CACTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Cache Memories 213.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 Low Miss Rate Schemes . . . . . . . . . . . . . . . . . . . . . 24

xiv · CONTENTS

3.1.2 Pseudo-Associative Caches . . . . . . . . . . . . . . . . . . . . 243.1.3 Non-Resizing Low Power Schemes . . . . . . . . . . . . . . . . 253.1.4 Resizing Low Power Schemes . . . . . . . . . . . . . . . . . . 25

3.2 Fast and Slow L1 Data Cache . . . . . . . . . . . . . . . . . . . . . . 263.2.1 Energy and Delay Models in CMOS Circuits . . . . . . . . . . 263.2.2 Criticality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.2.3 Cache Organizations . . . . . . . . . . . . . . . . . . . . . . . 283.2.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 303.2.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3 IATAC: Low Leakage L2 Cache . . . . . . . . . . . . . . . . . . . . . 403.3.1 Predictors for L2 Caches . . . . . . . . . . . . . . . . . . . . . 403.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 493.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.4 Heterogeneous Way-Size Caches . . . . . . . . . . . . . . . . . . . . . 603.4.1 Heterogeneous Way-Size Cache (HWS Cache) . . . . . . . . . 603.4.2 HWS Cache Evaluation . . . . . . . . . . . . . . . . . . . . . 653.4.3 Dynamically Adaptive HWS cache (DAHWS cache) . . . . . . 743.4.4 DAHWS Cache Evaluation . . . . . . . . . . . . . . . . . . . . 783.4.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4 Issue Logic 874.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4.1.1 Basic CAM-based Approaches . . . . . . . . . . . . . . . . . . 904.1.2 Matrix-based Approaches . . . . . . . . . . . . . . . . . . . . 924.1.3 Issue Logic Based on Dynamic Code Pre-Scheduling . . . . . . 934.1.4 Issue Logic Based on Dependence Tracking . . . . . . . . . . . 93

4.2 Adaptive Issue Queue and Register File . . . . . . . . . . . . . . . . . 944.2.1 Baseline Microarchitecture . . . . . . . . . . . . . . . . . . . . 944.2.2 Adaptive Schemes . . . . . . . . . . . . . . . . . . . . . . . . . 984.2.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 1024.2.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.3 Low-Complexity Floating-Point Issue Logic . . . . . . . . . . . . . . . 1124.3.1 Proposed Issue Logic Design . . . . . . . . . . . . . . . . . . . 1134.3.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 1224.3.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

5 Load/Store Queues 1295.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1315.2 SAMIE-LSQ: A New LSQ Organization for Low Power and Low Com-

plexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1325.2.1 SAMIE-LSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1335.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 1415.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

CONTENTS · xv

6 Clustered Microarchitectures 1516.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1546.2 Ring Clustered Microarchitecture . . . . . . . . . . . . . . . . . . . . 155

6.2.1 Ring Clustered Processor . . . . . . . . . . . . . . . . . . . . . 1566.2.2 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 1636.2.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7 Conclusions 1757.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1777.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

Bibliography 181

xvi · CONTENTS

List of Figures

1.1 Power evolution for Intel microprocessors . . . . . . . . . . . . . . . . 51.2 Power evolution for AMD microprocessors . . . . . . . . . . . . . . . 51.3 Cooling system cost with respect to the power dissipation [55] . . . . 61.4 Temperature distribution of a microprocessor [55] . . . . . . . . . . . 6

2.1 Pipeline for sim-outorder . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Power dissipation compared to the baseline technology for differentVTH and VDD values . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2 Load criticality distribution for different cache sizes of the 2-way set-associative slow cache configuration . . . . . . . . . . . . . . . . . . . 32

3.3 IPC loss of criticality-based cache for the guided and the randomversions w.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Performance loss of locality-based and criticality-based organizationsw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.5 Miss ratio breakdown of critical and non-critical loads for the 16KBbaseline cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34



3.8 Dynamic energy consumption for a 16K baseline cache . . . . . . . . 373.9 Dynamic energy consumption for a 32K baseline cache . . . . . . . . 373.10 Dynamic energy consumption for a 64K baseline cache . . . . . . . . 383.11 Leakage energy requirements for the different cache organizations and

sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.12 Average time between hits to cache lines (time between hits) and the

average time from the last access to replacement (time before miss)against varying numbers of accesses. The results correspond to fourrepresentative programs (note logarithmic scale) . . . . . . . . . . . . 42

3.13 Structures required for the IATAC mechanism for a 4MB (512KB)L2 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.14 Algorithm of the IATAC mechanism . . . . . . . . . . . . . . . . . . 443.15 Mechanism to update the decay interval for the adaptive mode control 49

xviii · LIST OF FIGURES

3.16 IPC degradation for the different mechanisms for 512KB and 4MBL2 caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.17 L2 turn off cache line ratio for the different mechanisms for 512KBand 4MB L2 caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.18 L2 miss ratio for the different mechanisms for 512KB and 4MB L2caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.19 IPC loss for the SPEC CPU2000 benchmarks and a 512KB L2 cache 53

3.20 Number of misses for the SPEC CPU2000 benchmarks and a 512KBL2 cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.21 Decay interval coefficient of variation . . . . . . . . . . . . . . . . . . 55

3.22 Energy consumption for the different mechanisms for 512KB and4MB L2 caches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.23 EDP for the different mechanisms for 512KB and 4MB L2 caches . . 58

3.24 ED2P for the different mechanisms for 512KB and 4MB L2 caches . . 59

3.25 Associativity utilization for the L1 data cache . . . . . . . . . . . . . 61

3.26 Associativity utilization for the L1 instruction cache . . . . . . . . . . 62

3.27 Associativity utilization for the L2 unified cache . . . . . . . . . . . . 62

3.28 Indexing functions for a conventional cache (left) and a HWS cache(right) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.29 Example of RIT update for a HWS cache . . . . . . . . . . . . . . . 64

3.30 Number of cache configurations for associativity ranging from 2 to 8,and number of different way sizes ranging from 2 to 6 . . . . . . . . . 66

3.31 Hit rate for 2-way set-associative L1 Dcaches . . . . . . . . . . . . . . 67



3.34 Example of better behavior of HWS cache with respect to a conven-tional cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.35 Signature Size resizing algorithm [38] . . . . . . . . . . . . . . . . . . 75

3.36 DAHWS cache resizing algorithm . . . . . . . . . . . . . . . . . . . . 77

3.37 Miss rate, percentage of active lines, IPC and number of reconfigura-tions for the different L1 Data cache resizing schemes. The number ofreconfigurations is split according to the number of ways whose sizeis changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

3.38 Miss rate, percentage of active lines, IPC and number of reconfigu-rations for the different L1 instruction cache resizing schemes. Thenumber of reconfigurations is split according to the number of wayswhose size is changed . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

3.39 Miss rate, percentage of active lines, IPC and number of reconfigura-tions for L2 unified maxDAHWS cache. The number of reconfigura-tions is split according to the number of ways whose size is changed . 84

4.1 Issue logic for an entry of a CAM/RAM-array . . . . . . . . . . . . . 90

4.2 Multiple-banked issue queue . . . . . . . . . . . . . . . . . . . . . . . 95

LIST OF FIGURES · xix

4.3 Scheme of a read operation . . . . . . . . . . . . . . . . . . . . . . . . 96

4.4 Scheme of a write operation . . . . . . . . . . . . . . . . . . . . . . . 96

4.5 Heuristic to resize the reorder buffer and the issue queue . . . . . . . 100

4.6 IPC for different interval lengths . . . . . . . . . . . . . . . . . . . . . 104

4.7 Reorder buffer occupancy reduction for different interval lengths . . . 104

4.8 IPC loss for the different techniques . . . . . . . . . . . . . . . . . . . 106

4.9 Issue queue dynamic energy savings . . . . . . . . . . . . . . . . . . . 107

4.10 Issue queue leakage energy savings . . . . . . . . . . . . . . . . . . . 107

4.11 Dynamic energy savings for the integer register file and rename buffersw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.12 Leakage energy savings for the integer register file and rename buffersw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.13 Dynamic energy savings for the FP register file and rename buffersw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.14 Leakage energy savings for the FP register file and rename buffersw.r.t. the baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.15 Reduction in number of dispatched instructions . . . . . . . . . . . . 111

4.16 IPC loss of IssueFIFO technique w.r.t. the unbounded conventionalissue queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.17 Issue time computation for LatFIFO scheme . . . . . . . . . . . . . . 116

4.18 IPC loss of LatFIFO technique w.r.t. unbounded conventional issuequeue for the FP benchmarks . . . . . . . . . . . . . . . . . . . . . . 118

4.19 Example of selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

4.20 IPC loss of MixBUFF technique w.r.t. unbounded conventional issuequeue for the FP benchmarks . . . . . . . . . . . . . . . . . . . . . . 121

4.21 Performance for the integer benchmarks . . . . . . . . . . . . . . . . 123

4.22 Performance for the FP benchmarks . . . . . . . . . . . . . . . . . . . 124

4.23 Energy breakdown for the different schemes . . . . . . . . . . . . . . 125

4.24 Normalized power dissipation . . . . . . . . . . . . . . . . . . . . . . 126

4.25 Normalized energy consumption . . . . . . . . . . . . . . . . . . . . . 126

4.26 Normalized EDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

4.27 Normalized ED2P . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126

5.1 IPC of ARB with respect to an ideal unbounded LSQ. Configurationswith different number of banks and addresses per bank are shown . . 134

5.2 SAMIE-LSQ organization . . . . . . . . . . . . . . . . . . . . . . . . 135

5.3 Average number of entries occupied in an unbounded SharedLSQ fordifferent configurations of the DistribLSQ . . . . . . . . . . . . . . . 139

5.4 Number of programs that do not use the AddrBuffer during the 99%of their execution for a varying number of SharedLSQ entries . . . . . 140

5.5 IPC loss of SAMIE-LSQ with respect to the 128-entry conventionalLSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

xx · LIST OF FIGURES

5.6 Number of deadlock-avoidance pipeline flushes per million of cyclesfor SAMIE-LSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.7 Dynamic energy consumption for the LSQ . . . . . . . . . . . . . . . 1465.8 Dynamic energy consumption breakdown for the SAMIE-LSQ . . . . 1475.9 Dynamic energy consumption for the L1 data cache . . . . . . . . . . 1485.10 Dynamic energy consumption for the data TLB . . . . . . . . . . . . 1485.11 Accumulated active area in mm2 for the LSQ . . . . . . . . . . . . . 1495.12 Active area breakdown for the SAMIE-LSQ . . . . . . . . . . . . . . 149

6.1 Ring clustered microarchitecture . . . . . . . . . . . . . . . . . . . . . 1576.2 Steering algorithm for the ring clustered microprocessor . . . . . . . . 1586.3 Example of the steering algorithm . . . . . . . . . . . . . . . . . . . . 1596.4 Placement alternatives for 8 clusters . . . . . . . . . . . . . . . . . . 1606.5 High level layout for cluster modules . . . . . . . . . . . . . . . . . . 1616.6 High level layout for cluster modules with integer and FP independent

rings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1636.7 Steering algorithm for the conventional clustered microprocessor . . . 1646.8 Speedup of Ring over Conv . . . . . . . . . . . . . . . . . . . . . . . 1666.9 Average number of communications per instruction . . . . . . . . . . 1676.10 Average distance per communication . . . . . . . . . . . . . . . . . . 1686.11 Average delay per communication due to bus contention . . . . . . . 1686.12 Workload imbalance using NREADY figure . . . . . . . . . . . . . . . 1696.13 Distribution of the dispatched instructions across the clusters . . . . . 1706.14 Speedup of Ring over Conv for different bus latencies . . . . . . . . . 1716.15 Simple Steering algorithm for both Ring and Conv processors . . . . 1716.16 Speedup of Ring+SSA over Conv+SSA . . . . . . . . . . . . . . . . . 1726.17 Workload imbalance using NREADY figure with the Simple Steering

Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

List of Tables

2.1 Compile and run commands for the SPEC CPU2000 . . . . . . . . . 162.2 Fast forwarded instructions for the SPEC CPU2000 . . . . . . . . . . 17

3.1 Cache sizes used in the comparison . . . . . . . . . . . . . . . . . . . 303.2 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 503.4 Energy model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.5 Energy breakdown for IATAC mechanism . . . . . . . . . . . . . . . 573.6 Feasible configurations for a HWS cache with associativity 2 or 3 and

capacity ranging from 8KB to 32KB . . . . . . . . . . . . . . . . . . 653.7 3-way set-associative L1 Dcaches (conventional caches are represented

in italics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 683.8 4-way set-associative L1 Dcaches (conventional caches are represented

in italics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703.9 3-way and 4-way set-associative L1 Icaches (conventional caches are

represented in italics) . . . . . . . . . . . . . . . . . . . . . . . . . . . 723.10 3-way and 4-way set-associative L2 caches (conventional caches are

represented in italics) . . . . . . . . . . . . . . . . . . . . . . . . . . . 733.11 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 79

4.1 Delay and energy for the different components of a multiple-bankedregister file design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.2 Delays and energy for read/write operations in the sequential andparallel schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

4.3 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 1034.4 Reorder buffer size reduction . . . . . . . . . . . . . . . . . . . . . . . 1064.5 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.6 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 114

5.1 Access time of conventional cache accesses and access time when thephysical cache line is known for different cache configurations. Thenumber of bytes per line is 32 in all configurations . . . . . . . . . . . 141

5.2 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 1425.3 SAMIE-LSQ configuration . . . . . . . . . . . . . . . . . . . . . . . . 142

xxii · LIST OF TABLES

5.4 Energy consumption of the different types of accesses to a 128-entryconventional LSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

5.5 Energy consumption for the different activities of the SAMIE-LSQ . . 1435.6 Area of the different components of the conventional LSQ and

SAMIE-LSQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.1 Area of the main cluster’s blocks . . . . . . . . . . . . . . . . . . . . 1616.2 Processor configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 1656.3 Evaluated configurations . . . . . . . . . . . . . . . . . . . . . . . . . 165

1

Introduction

CHAPTER 1

INTRODUCTION

Technology and microarchitecture evolution is driving microprocessors towardshigher clock frequencies and higher integration scale. These two factors translateinto higher power density, which calls for more sophisticated and expensive coolingsystems. Reduction of power dissipation can be very beneficial not only in terms ofcooling cost reduction, but also for saving energy or increasing performance for agiven thermal solution or extending battery life.

The energy consumption can be classified basically into two categories: dynamicenergy consumption and leakage energy (or leakage for short). The dynamic energyconsumption is produced by the circuit activity (transitions from 1 to 0, and viceversa), whereas leakage is caused by any powered-on circuit due to the intrinsiccharacteristics of the CMOS technology and its fabrication process.

The temperature is strongly related with the energy consumption. Reducingthe energy consumption in the hottest spots of the processor is crucial to preventthe chip from reaching too high temperatures. If the temperature is too high, theprocessor may slow down or even stop the execution of programs. Thus, it may bebeneficial sacrifying some performance to keep the temperature at bearable levels toprevent such temperature emergencies from happening. The temperature emergencycan be also prevented distributing the activity across larger areas. If we success indistributing the activity homogeneously, the peak temperature achieved in the chipwill be lower and fewer stalls will be required.

1.1 Sources of Power Dissipation

The sources of energy consumption on a CMOS chip can be classified as dynamicand static power dissipation. The dominant component of energy consumption inCMOS is dynamic power consumption caused by changes in the state of the circuit.A first order approximation of the dynamic power consumption of CMOS circuitryis given by the formula:

P = C ∗ V 2DD ∗ f

where P is the power, C is the effective switch capacitance, VDD is the supplyvoltage, and f is the frequency of operation. The dynamic power dissipation arisesfrom two main sources:

4 · Chapter 1. Introduction

• The main source corresponds to the charging and discharging of the circuitparasitic capacitances. Every low-to-high and high-to-low logic transition ina node incurs a change of charge in the associated parasitic capacitance, dis-sipating power, which translates into heat.

• The short circuit power dissipation is caused by short circuit currents. Duringthe transition on the input of a CMOS gate both p and n channel devices mayconduct simultaneously, briefly establishing a short circuit from the supplyvoltage to ground. The short circuit power dissipation is not much significant.

The static energy consumption is caused basically by leakage currents. Thus,static energy is often referred to as leakage energy. Leakage power is growing inCMOS chips. Until recently, it was a secondary order effect; however, the totalamount of leakage grows exponentially with every new technology. Different stud-ies [112, 120, 125] predict that it can be as significant as the dynamic power in thenear future technologies.

1.2 Motivation

Energy consumption is a concern in current and future microprocessors. As tech-nology evolves, the power dissipation, the energy consumption and the chip temper-ature become more and more critical. Designing low power and high-performancestructures is basic to make processors more powerful. To show how significant thepower consumption is, we use the figures 1.1 and 1.2 from [106] where it can beseen the evolution of the power dissipation during the last years for Intel and AMDprocessors. The power dissipation grows from generation to generation and hence,the cooling system becomes more expensive, as shown in figure 1.3 [55]. We observethat the Pentium 4 maximum power dissipation is around 100W, which is over thelimit of cheap cooling systems. Further increasing of the power dissipation impliesa huge cost of the cooling system. Thus, saving energy is crucial for future designs.Additionally, saving energy increases the battery life of laptops and embedded sys-tems.

Another issue strongly related to the power dissipation is the temperatureachieved in the different parts of the chip. Dissipating power increases the tem-perature of the chip, which requires complex cooling systems to keep it in a bear-able level. But the temperature is not distributed homogeneously in the whole chipas shown in figure 1.4 [55]. Some structures, like the issue logic, produce highertemperatures than the others due to their high power density. Thus, reducing thepower dissipation of these structures has a deeper impact in the cooling solutionthan saving energy in other structures.

Summing up, saving energy is beneficial to extend the battery life. The coolingsystem cost is also reduced by saving energy, especially if we save energy in thehottest spots of the chip, since that helps to reduce the maximum temperature.

1.2. Motivation · 5

Maximum power for INTEL processors

0

20

40

60

80

100

120

PentiumP5

PentiumP54

PentiumMMX

PentiumP6

Pentium II Pentium III Pentium IV

Wat

ts

Fig. 1.1: Power evolution for Intel microprocessors

Maximum power for AMD processors

0

10

20

30

40

50

60

70

80

90

AMD K5 AMD K6 AMD K7 AMD K8

Wat

ts

Fig. 1.2: Power evolution for AMD microprocessors


Fig. 1.3: Cooling system cost with respect to the power dissipation [55]

Fig. 1.4: Temperature distribution of a microprocessor [55]

1.3. Power Efficiency Metrics · 7

We can save energy because processors are often designed for (almost) the worstcase, and most of the time many resources are overdesigned. In general, the struc-tures are sized in such a way that making them larger hardly increases the perfor-mance, but making them smaller may harm the performance for some programs orfor some parts of some programs. Thus, in this thesis we investigate schemes todynamically adapt these structures to cut down the energy consumption of thoseparts that do not contribute to increase the performance. The following sectionsdescribe how these issues have been faced for different structures of the chip whosecomplexity, dynamic and/or leakage energy consumption are significant.

1.3 Power Efficiency Metrics

Different metrics related with power-efficiency have been proposed to compare dif-ferent schemes [20]. Depending on what the constraints are, different metrics shouldbe used. Power is adequate when the heat is the main constraint, whereas energyis used for comparing schemes where the battery lifetime is the strongest constraint.Other metrics like energy-delay (EDP) and energy-delay2 (ED2P) are more ap-propriate when execution time is also important, as it usually is.

Different systems based on a superscalar processor can have different limitations.For instance, laptops have limitations in their cooling systems, so heat is a constraint(strongly related to power). Additionally, battery lifetime (energy) is also a con-straint that must be considered in laptop’s processor design. If the processor is to beused in a desktop or mainframe, then execution time is a significant factor (EDP).For the highest performance server-class machines, it may be appropriate to weightthe delay part even more (ED2P).

1.4 Cache Memories

The first structures we deal with are the cache memories. First level (L1 for short)data and instruction caches have a significant dynamic and leakage energy con-sumption. Their dynamic energy is very important since L1 caches are accessedvery frequently. It is not rare accessing L1 data and instruction caches more thanonce per cycle. Their contribution to the energy consumption of the chip variesform one microarchitecture to another, but as an example, we see that a 21464 Al-pha processor [127] was expected to devote 26% of its dynamic energy to caches.Additionally, cache memories are the structures that occupy most of the area of thechip, especially the L2 cache, and hence, they contribute importantly to the totalleakage of the chip.

We have proposed different mechanisms to adapt the caches to the programrequirements. The following subsections describe the contributions of this thesis toreduce the energy consumption of the cache memories.


1.4.1 Fast and Slow L1 Data Cache

As stated before, processors are often designed for (almost) the worst case. Forinstance, all data cache accesses are served as soon as possible, although some ofthem can tolerate large latencies. Since fast caches consume significant dynamicand leakage energy, using these caches for all the accesses is a waste of energy. Ourcontributions [4] to deal with this issue are as follows:

• We propose using different L1 data cache modules with different latency andenergy characteristics to serve the accesses depending on their latency tol-erance. The organization of the modules is also studied. We propose bothschemes with a flat and a hierarchical organization.

• The memory instructions are classified dynamically to make them access themost suitable module. Thus, we have some degree of control over the perfor-mance and the energy consumption because non-critical instructions can beserved from an slow module whose energy consumption is low.

1.4.2 Low Leakage L2 Cache

Most of the current microprocessors have an on-chip L2 cache. A big budget oftransistors is devoted to this structure that occupies a large area of the chip. Thus,even if low leakage transistors are used to implement this structures, its leakage isnoticeable. We have studied state-of-the-art techniques that turn off those cachelines whose contents are not expected to be reused. These techniques dynamicallyadapt the cache size. Previous techniques were proposed for L1 caches and theydo not work well for L2 caches because their behavior differs a lot from L1 cachesbehavior. Our contributions [9] to save L2 cache leakage are the following:

• We have found that local predictors do not work well to predict when a L2cache line can be turned off, and hence, global prediction is required.

• We have investigated the relation between the number of accesses to a cacheline and the time interval to turn it off. The result is a prediction mechanismthat effectively predicts when cache lines can be safely turned off.

• We propose an implementation of such predictor and show that it achievessignificant leakage energy savings, close to an oracle mechanism, with negligibleperformance degradation.

1.4.3 Heterogeneous Way-Size Caches

Most caches are set-associative caches, which provide a certain degree of associativ-ity for all the sets. On the other hand, we experimentally observe that only few cachesets require some associativity. Thus, conventional set-associative caches are overde-signed because their ways have homogeneous sizes. Additionally, this kind of caches

1.5. Issue Logic · 9

lacks of flexibility to be adapted dynamically since all the ways have to be resizedin concert. To tackle these limitations we propose the following contributions [7]:

• We propose an heterogeneous way-size cache design (HWS cache for short)that fits better the program requirements. The HWS cache enables associa-tivity for all sets, but they partially share the required space.

• HWS cache is shown to perform similarly to a conventional cache of higherassociativity and/or greater capacity. Thus, a HWS cache achieves the sameperformance but requires lower dynamic and/or leakage energy.

• An algorithm is proposed to dynamically resize the HWS cache.It is shown tobe much more adaptable than a conventional cache.

1.5 Issue Logic

We study the issue logic [1] because it is one of the most important structuresin terms of energy and complexity. This structure has a lot of activity since allthe instructions are placed in it and every time an instruction is issued, all theinstructions in this structure have to be checked for dependences on the producedregister. The issue logic consumes a lot of energy due to its many comparisons andis one of the main hotspots of superscalar processors. Their contribution to theenergy consumption of the chip varies form one microarchitecture to another, but asan example, we see that the 21464 Alpha processor [127] devotes 46% of its dynamicenergy to the issue logic and register files.

Furthermore, the register file, whose energy consumption and complexity is alsohigh, is coupled to the issue logic, which increases the energy consumption of thissmall area of the chip. Thus, reducing the energy consumption and/or complexityof these structures is crucial to reduce the maximum temperature of the chip.

We have designed different mechanisms to adapt the issue logic to the programrequirements and reduce its complexity. The following subsections detail the con-tributions of this thesis in this area.

1.5.1 Adaptive Issue Queue and Register File

The issue queue is sized to provide high ILP for all types of programs. However,a large number of instructions placed in the issue queue do not often increase theperformance because they have been dispatched so early. Hence, the issue queuecan be dynamically resized to fit the program requirements. The idea of resizingthe issue queue is not new. Some works have proposed to turn off the empty partsof the queue, or even turn off some parts that would be used if they were turnedon but that are not expected to increase the performance. But the state-of-the-arttechniques have some limitations that we try to overcome with a new mechanismto dynamically resize the issue queue. The contributions [3, 2] in this area are asfollows:


• We show that there is a strong relation between the issue queue and thereorder buffer occupancies that can be exploited to effectively resize the formerstructure.

• We propose a mechanism based on this relation to resize the issue queue whichis shown to perform better than state-of-the-art approaches in terms of per-formance and energy.

• Delaying the dispatch of some instructions has a beneficial effect on the registerfile, since its pressure is reduced. We take advantage of this feature and makethe register file also adaptable. Hence, it is resized accordingly with the issuequeue, producing further energy savings.

1.5.2 Low-Complexity Floating-Point Issue Logic

The complexity of the full-associative issue logic is a concern because it is directly re-lated with its high energy consumption. In the literature there are a lot of attemptsto reduce the complexity of this structure by lowering the issue logic associativity.State-of-the-art solutions to effectively reduce the complexity and the energy con-sumption of the issue logic have been shown to perform well for integer programs,but the performance for FP programs is low.

State-of-the-art mechanisms are based on using either dependences or latenciesof the instructions to classify them into simple structures. FP programs performpoorly with low complexity structures based on dependences, but their performanceis not much better when using latencies, which require higher complexity hardware.We contribute [5] to tackle this problem as follows:

• We propose a new issue logic design that fits FP program requirements withlow complexity structures that provide high-performance.

• Our solution is based on using both the dependences and the latencies in differ-ent stages to achieve a low complexity design that provides high performance.

• The functional units can be distributed across the small queues of the proposedissue logic, which further reduces the complexity of the design as well as itsenergy requirements.

1.6 Load/Store Queue

The load/store queue (LSQ for short), similarly to the issue queue, requires manyfully-associative lookups to check for dependences. While the issue queue keepstrack of the register dependences, the LSQ keeps track of the memory dependences.The LSQ entries are allocated at dispatch. Every time that an address is computed,it is compared with that of many other memory instructions and the complexityand energy required to do all these comparisons is significant.

1.7. Clustered Microarchitectures · 11

Most state-of-the-art approaches focus on reducing the energy consumption ofthis structure by filtering accesses or pipelining them. A few works propose alter-native approaches that require huge structures to achieve moderate reductions incomplexity and dynamic energy. Our contributions [8] in this area are:

• We propose a set-associative design of the LSQ, where the associativity isvery low to reduce the energy consumption and the complexity, while theperformance remains similar to that of a fully-associative LSQ.

• We enable LSQ entries to hold several memory instructions to further reducethe number of required comparisons.

• Each entry of the LSQ is extended with some fields to record some informationabout the position of the data in the L1 data cache and the address translationof the TLB to save energy in these two structures.

• The different components of the new LSQ are dynamically resized to keep theleakage energy requirements low.

1.7 Clustered Microarchitectures

Distributing is a well-known technique to save energy and reduce complexity anddelays. A clustered microarchitecture has a subset of the resources in each cluster.However, conventional clustered microarchitectures must trade workload balance forcommunications to achieve high-performance. The minimum number of communi-cations is achieved sending all the instructions to the same cluster, but this resultsin a totally imbalanced scenario. On the other hand, achieving a perfect work-load balance requires a very high number of communications because producers andconsumers use to be in different clusters. Thus, they are opposite objectives.

Conventional clustered microarchitectures send the instructions to a few clustersto reduce the number of required communications, until some degree of imbalanceis achieved. Then, instructions are forced to go to other clusters, which results, ingeneral, in extra communications. Thus, the activity is high in a few clusters duringsome cycles, and some performance is lost due to the forced communications. Inthis area we contribute [6] as follows:

• We propose organizing the clusters in a ring fashion in such a way that fastbypasses are set between each cluster and the following one in the ring, insteadof forwarding the data to the cluster itself.

• We show how a steering algorithm similar to that of the conventional microar-chitecture inherently achieves better activity distribution for the ring clusteredmicroarchitecture, and increases the performance.

• We study simple steering algorithms and show that the ring clustered mi-croarchitecture dynamically distributes the activity across all clusters with low


communication requirements. On the other hand, the conventional microar-chitecture is unable to balance the workload and concentrates the activity ina few clusters during long periods, which is so negative in terms of temper-ature. Thus, the ring microarchitecture succeeds in dynamically distributingthe activity while keeping the performance high.

1.8 Organization

The rest of this thesis is organized as follows. Chapter 2 presents the evaluationframework. This section details the simulators and tools that have been used andhow they have been modified. The benchmarks used are also described.

Chapter 3 presents the contributions on cache memories. The different ap-proaches as well as the related work are described. Chapter 4 details our proposalson low power adaptive issue logic schemes. Chapters 5 and 6 present our contri-butions on load/store queues and clustered microarchitectures respectively. Finally,chapter 7 presents the main conclusions of this dissertation and points out someideas for future work.

2

Evaluation Framework

CHAPTER 2

EVALUATION FRAMEWORK

This chapter describes the evaluation framework that we have used in this thesis.The conclusions and results presented in this dissertation have been obtained withthe benchmarks, the simulators and other tools that are presented in the followingsections.

2.1 Benchmarks

The focus of this thesis is reducing the energy consumption and complexity ofsuperscalar processors. Hence, the most suitable benchmarks are those for high-performance systems. We have chosen the SPEC CPU2000 that is an industry-standardized CPU-intensive benchmark suite [115]. These benchmarks, developedfrom real user applications, measure the performance of the processor, memory andcompiler on the tested system. The benchmark suite consists of 12 integer programsand 14 floating point programs. For the sake of generality we have used both theinteger and the FP programs in the whole thesis.

Table 2.1 presents the parameters used to compile each benchmark, as well asthe input files used to run them. We have used the ref input data set, but weprovide the name of the input files because some of the benchmarks have severalinputs. The table is divided into two parts: the first part corresponds to the FPbenchmarks, whereas the second part presents the integer benchmarks.

The benchmarks have been compiled using the native HP/Alpha compiler. Allflags used to compile are shown in the table but the -non shared flag, which hasbeen used for all benchmarks. This flag is required to simulate the benchmarks withthe Simplescalar simulator [23] that has been used in this thesis.

We have used execution-driven simulations. To simulate significant parts of theprograms, we have measured the number of instructions to be skipped in our binaries.The simulations have been run after fast forwarding a given number of instructionsand warming up the caches and tables with 100 millions of instructions. In the restof the thesis, the default number of instructions simulated is 100 millions exceptotherwise is stated. Table 2.2 provides the number of fast forwarded instructionsfor each benchmark. We skip 200 millions of instructions for most benchmarksexcepting those that require larger fast forwards.

16 · Chapter 2. Evaluation Framework

FP programsBenchmark Compile options Input filesammp cc -O3 -lm -DSPEC CPU2000 ammp.in, init cond.run.1, init cond.run.2,

init cond.run.3applu f77 -O4 applu.inapsi f77 -O4 apsi.inart cc -O3 -lm c756hel.in, a10.img, hc.imgequake cc -O3 -lm inp.infacerec f90 -O4 ref.in, ar1.asc, ar2.asc, pk2.asc,

graphPars.dat, imagePars.dat,matchPars.dat, ref-albumPars.dat,ref-probePars.dat, trafoPars.dat

fma3d f90 -O4 fma3d.ingalgel f90 -O4 -fixed galgel.inlucas f90 -O4 lucas2.inmesa cc -O3 -lm mesa.in, numbersmgrid f77 -O4 mgrid.insixtrack f77 -O4 fort.16, fort.2, fort.3, fort.7, fort.8, inp.inswim f77 -O4 swim.inwupwise f77 -O4 wupwise.in

INT programsBenchmark Compile options Input filesbzip2 cxx -O3 input.sourcecrafty cc -O3 -lm -DSPEC CPU2000 crafty.in

-DALPHAeon cxx -O2 -lm -I. -DNDEBUG chair.camera, chair.control.cook,

chair.control.kajiya,chair.control.rushmeier, chair.surfaces,eon.dat, materials, spectra.dat

gap cc -O3 -lm -DSYS IS BSD ref.in, all *.g files-DSPEC CPU2000 LP64-DSYS HAS CALLOC PROTO-DSYS HAS MALLOC PROTO-DSYS HAS TIME PROTO

gcc cxx -O3 -lm 166.i-Dalloca= builtin alloca

gzip cc -O3 input.sourcemcf cc -O3 -lm inp.in

-DWANT STDC PROTOparser cc -O3 -lm -DSPEC CPU2000 ref.in, 2.1.dict, all words/* filesperlbmk cxx -O3 -lm perfect.pl, all lib/* files

-DSPEC CPU2000 DUNIXtwolf cc -O3 -lm -DSPEC CPU2000 ref.blk, ref.cel, ref.net, ref.parvortex cc -O3 -lm lendian.rnv, lendian.wnv, lendian1.raw,

-DSPEC CPU2000 LP64 lendian2.raw, lendian3.raw, persons.1kvpr cxx -O3 -lm -DSPEC CPU2000 arch.in, net.in, place.in

Table 2.1: Compile and run commands for the SPEC CPU2000

2.2. Tools and simulators · 17

FP programs INT programsBenchmark Fast forwarded Benchmark Fast forwarded

instructions (millions) instructions (millions)ammp 1.400 bzip2 200applu 200 crafty 200apsi 200 eon 200art 200 gap 200equake 3.400 gcc 200facerec 200 gzip 200fma3d 200 mcf 3.400galgel 200 parser 200lucas 3.400 perlbmk 200mesa 500 twolf 500mgrid 200 vortex 200sixtrack 500 vpr 200swim 500wupwise 3.400

Table 2.2: Fast forwarded instructions for the SPEC CPU2000

2.2 Tools and simulators

Three different simulators and tools have been used in this thesis. They are describedin the following sections.

2.2.1 Simplescalar

The base simulator that we have used for the different works is the Simplescalartoolset [23]. This simulator has been chosen because it is widely used bythe computer architecture community and offers easily modificable well-organizedsource files. Two simulators of Simplescalar have been used: sim-outorder andsim-cache.

sim-outorder is a detailed performance simulator of modern superscalar mi-croprocessors. We have modified the source code to adapt the simulator to ourrequirements. The main enhancements that we have done to the baseline microar-chitecture are the separation of the reorder buffer and the issue queue, and themodeling of the ports of the register file.

The pipeline for sim-outorder is shown in figure 2.1. The instructions arefetched from the instruction cache accessing the branch predictor if required. Then,the instructions are decoded and dispatched to the reorder buffer and the issuequeue. Load and store instructions are split into an address computation thatis placed in the issue queue and the reorder buffer, and a memory access that isplaced in the load/store queue. Instructions stay in the issue queue until theyare issued to the funcional units and in the load/store queue until they commit.When instructions are issued, they wake up the instructions depending on them.The instructions writeback their results when they finish their execution. At this


Fetch Dispatch Scheduler

Memory scheduler

Exec

Mem

Writeback Commit

Icache Dcache

DTLBITLB

L2 cache

Virtual memory

Fig. 2.1: Pipeline for sim-outorder

point the unresolved branches realize if they were mispredicted or not. In case ofmisprediction the pipeline is flushed. Finally, the instructions commit and leave thepipeline. Store instructions access memory at this stage.

sim-cache is a cache simulator that provides cache statistics (hits, misses, re-placements, etc), with much faster simulations than sim-outorder, which simulatesthe whole processor.

2.2.2 Wattch

We have also used the Wattch simulator [21], which is an architecture-level powerand performance simulator based on Simplescalar. Wattch adds activity counters tothe sim-outorder simulator, and estimates the energy consumption of the differentstructures using the CACTI tool [111].

The main processor units that Wattch models fall into four categories:

• Array Structures: Data and instruction caches, cache tag arrays, all regis-ter files, register alias table, branch predictors, the reorder buffer, and largeportions of the issue queue and the load/store queue.

• Fully Associative Content-Addressable Memories: Issue queue wakeup logic,load/store queue address checks, TLBs (if they are configured as fully-associative).

• Combinational Logic and Wires: Functional units, issue queue selection logic,dependency check logic at decode stage, and result buses.

• Clocking: Clock buffers, clock wires, etc.

We have enhanced Wattch in the same way as sim-outorder simulator. Thepower model has been extended properly to keep track of the power dissipation ofthe modified structures.

2.2. Tools and simulators · 19

2.2.3 CACTI

Finally, cache statistics such as delay, energy per access and area have been drawnfrom the CACTI tool [111], which is a timing, power and area model for cachememories. This tool has been also used to estimate the energy of other structureslike the register file isolating some components and sizing them properly.

The CACTI model has the following parameters:

• Cache size

• Associativity (direct-mapped, set-associative or fully-associative)

• Line size (number of bytes per line)

• Number of ports of each type (read, write and read/write ports)

• Technology

• Number of banks

CACTI presents results in terms of area, energy consumption and delay for thedecoders, bitlines, wordlines, comparators, sense amplifiers, routing buses, outputdriver, etc.


3

Cache Memories

CHAPTER 3

CACHE MEMORIES

The relevance of cache memories increases in current and future microprocessors.The latency and energy consumption of caches increase from generation to genera-tion due to different factors.

The fraction of the area devoted to caches grows in every new processor genera-tion, mainly due to the L2 cache. It seems reasonable to expect that in a few yearsL3 caches will often be on-chip. Thus, even though low leakage transistors are usedfor caches, their large area makes them to be the one of the leakiest structures inthe chip.

Cache memories also contribute significantly to the dynamic energy of the chip.The dynamic energy of caches is especially high for L1 data and instruction cachesbecause they are accessed very frequently. For instance, a 4-way superscalar proces-sor may require around one access per cycle to the L1 instruction cache to fetchinstructions, and one or two accesses per cycle to the data cache to load or storedata given that around 1/3 of the instructions use to be memory accesses.

Cache latency is another key point. Dynamic and leakage energy of caches canbe significantly reduced by adjusting the supply and threshold voltages of the tran-sistors, but then its latency is increased. If we increase latency, some performance islost and the total processor energy consumption may be increased. Thus, voltagesmust be adjusted carefully if we do not want to produce counterproductive effects.Some cache energy can be also saved if the cache size or associativity is reduced,but the miss rate may increase. In general, the higher the miss rate, the larger theexecution time. Hence, reducing the cache size or associativity saves some energyin the cache, but may cause higher energy consumption in the rest of the chip.

Reducing the cache energy consumption without increasing the cache latency orthe miss rate is a challenge. In the following sections of this chapter we presentthe approaches that we have proposed to tackle this problem. In section 3.1 wepresent state-of-the-art techniques that address the cache energy and performanceimprovement. Then, we present the different proposals for cache memories of thisthesis in sections 3.2, 3.3 and 3.4.

24 · Chapter 3. Cache Memories

3.1 Related Work

This section reviews some related work, which has been classified into different cate-gories for the sake of readability. Literature on cache architectures is very abundant.Here, we just outline some of the closest works to our proposals.

3.1.1 Low Miss Rate Schemes

Several approaches to reducing the miss rate and/or complexity of conventionalcaches have been proposed. Conventional caches use a subset of bits from the addressto index the tag and data arrays, as dictated by the modulo function. Some authorsshow that other functions provide lower miss rates since they reduce the number ofconflicts. Topham et al. [121] propose an implementation of a polynomial modulofunction to index the cache. Kharbutli et al. [73] propose index functions basedon the use of prime numbers. Both works show a significant reduction in terms ofconflicts misses with respect to the conventional modulo function, but require someextra hardware and delay to access caches, which may have an impact especially forL1 caches.

Other approaches to reduce the cache miss rate focus on the replacement func-tion. Different replacement functions have been proposed to improve the perfor-mance of LRU for L2 caches [130], to take into account the cost of the replacementsin L2 caches [67], and to enable the compiler to provide hints to the cache system toreplace those cache lines that are not likely to be reused soon [126]. Kim et al. [74]propose a non-uniform cache architecture (NUCA) for wire-delay dominated on-chipcaches and a scheme to place the data in this kind of caches to obtain low miss ratesand reduce latency. Chishti et al. [36] present an improved version of NUCA.

3.1.2 Pseudo-Associative Caches

Different approaches have been proposed to try to achieve the miss rate of a set-associative cache and the latency and power dissipation of a direct-mapped one.Some implementations focus on reducing the access time [11, 17, 69, 109, 135],while other approaches are based on predicting the way where the data is stored [30,65, 95]. One way to do it [65] consists in using way predictors to access just oneway in set-associative caches. Further work has been done by Powell et al. [95],who propose using way-prediction and selective direct-mapping for non-conflictingaccesses. This approach is based on the performance-oriented schemes by Batsonand Vijaykumar [17] and by Calder et al. [30].

Power can also be reduced at the expense of latency by accessing the wayssequentially [72].

3.1. Related Work · 25

3.1.3 Non-Resizing Low Power Schemes

Several works [52, 66, 78] have investigated the effects on performance and powerof supply and threshold voltages. Heo et al. [59] present a circuit technique toreduce the leakage of bit lines by means of leakage-biased bit lines. Transmissionline caches [18] reduce delay and power for L2 caches using on-chip transmissionlines instead of conventional RC wires.

There are also different approaches based on changing the cache system organi-zation. Kin et al. [77] propose a small filter cache placed before the conventionalL1 cache to serve most of the accesses without accessing the L1 data cache. Sev-eral works [51, 64, 105, 122] propose a cache organization with different specializedmodules for different types of accesses.

Compression has also been used to save power in caches. Yang and Gupta [131]present a data cache design where frequent values are encoded to reduce the powerdissipation. Canal et al. [33] and Villa et al. [124] propose compressing zero-valuedbytes. Alameldeen and Wood [13] propose an adaptive compression mechanism forL2 caches.

Physical cache organization and distribution has been also studied. Ghose andKamble [49] study the effects of using subbanking, multiple line buffers and bitlinesegmentation to reduce dynamic power dissipation in caches. Su and Despain [119]investigate vertical and horizontal cache partitioning, as well as Gray code addressingto reduce dynamic power. Hezavei et al. [60] study the effectiveness of different lowpower SRAM design strategies like divided bitline, pulsed wordline and isolatedbitline.

Other approaches have focused on reducing the number and/or complexity ofcache accesses. For instance, Witchel et al. [129] use compile-time information toallow some loads and stores to access the data cache without a tag check wheneverit can be guaranteed that the memory access will be to the same line as an earlieraccess. Memik et al. [82] present a kind of filters to early detect whether a cache misswill also miss in the following cache levels. These filters reduce the miss penalty andthe power consumption since the energy spent by these filters is much lower thanaccessing the corresponding cache. Buyuktosunoglu et al. [27] propose an algorithmto gate fetch and issue to reduce the number of accesses to the instruction cacheand other structures.

3.1.4 Resizing Low Power Schemes

Energy can be saved resizing the cache to fit the program requirements. Powell etal. [96] propose to gate the supply voltage (VDD) or the ground path of the cache’sSRAM cells whose contents are unlikely to be required. The content of the cache lineis lost but it practically does not leak. Agarwal et al. [10] propose a gated-groundscheme for turning off cache lines but still preserving their contents. This kind ofcircuits requires only one supply voltage, but they highlight that for technologiesunder 100 nm, small variations in the threshold voltage may destroy the contents of


the cell. Thus, if a very high-precision threshold voltage cannot be achieved duringthe fabrication process, the stability of the cells cannot be guaranteed, making thistechnique non-viable. Flautner et al. [44] propose to reduce the supply voltage ofsome cache lines by putting them in drowsy mode (a kind of sleep mode) to reducetheir leakage without losing their contents. This kind of circuits requires two supplyvoltages. To address this limitation, Kim et al. [76] propose super-drowsy caches.They behave similarly to drowsy caches but only one VDD is required. The maindrawback of this approach is that cells in drowsy state are much more susceptibleto soft errors [80].

Heuristics to decide when and which cache lines should be turned off make useof these techniques to turn off individual cache lines. Kaxiras et al. [71] and Zhou etal. [137] have recently proposed different techniques to reduce leakage by switchingoff cache lines whose contents are not expected to be reused, using the gated-VDD

approach. Kim et al. [75] present a different heuristic based on drowsy caches. Li etal. [81] have observed that L2 cache subblocks that are present in L1 can be turnedoff.

Energy savings can also be achieved by dynamically [15, 39] or statically [14, 134,136] reconfiguring cache characteristics such as cache size, associativity and activeways. Yang et al. [132] study different static and dynamic resizing mechanisms andpropose a hybrid mechanism. Dhodapkar and Smith [38] have studied a variety ofdynamic cache resizing algorithms. They propose a simple mechanism to effectivelyresize L1 caches, saving significant power with a low number of reconfigurations.

3.2 Fast and Slow L1 Data Cache

This section proposes different cache organizations that reduce significantly dynamicand leakage energy consumption with a small performance loss. This study triesto guide processor designers to choose the cache organization with best trade-offbetween efficiency and energy consumption.

This section is organized as follows. Section 3.2.1 introduces the model usedto choose supply and threshold voltages. Section 3.2.2 details the definition ofcriticality that guides some of the evaluated cache systems. Some experimentalcache organizations are presented in section 3.2.3 and their results are shown insection 3.2.4. Finally, section 3.2.5 draws the main conclusions of this work.

3.2.1 Energy and Delay Models in CMOS Circuits

CMOS power dissipation [104, 103] is given by 3.1 where dynamic power (Pdyn) andleakage power (Pleak) can be expressed as shown in 3.2 and 3.3 respectively.

P = Pdyn + Pleak (3.1)

Pdyn = pt · CL · V 2DD · fCLK (3.2)

Pleak = I0 · 10−VTH/S · VDD (3.3)

3.2. Fast and Slow L1 Data Cache · 27

In the power equations pt is the switching probability, CL is the load capacitance(wiring and device capacitance), VDD is the supply voltage and fCLK is the clockfrequency. I0 is a function of the reverse saturation current, the diode voltageand the temperature. VTH is the threshold voltage. Finally, S corresponds to thesubthreshold slope and is typically about 100mV/decade. Using equation 3.3 it canbe observed that leakage power dissipation decreases by 10 times if VTH increases0.1V.

CMOS propagation delay can be approximated by the following simple α powermodel [104]1 where k is a proportionality constant specific to a given technology.

Delay = k ·CL · VDD

(VDD − VTH)α(3.4)

The α power reflects the fact that the transistors may be velocity saturated. α iscompressed in the range [1 : 2], where α = 1 implies complete velocity saturationand α = 2 implies no velocity saturation. For the 0.18 µm technology assumed inthis work, α is typically 1.3.

From equations 3.2 and 3.3, it can be concluded that decreasing VDD reducesboth dynamic and leakage power dissipation and slightly increasing VTH reducesdrastically leakage, but both parameters adjustments increase the propagation delayas equation 3.4 shows. Thus, there is a trade-off between reducing power dissipationand increasing delay propagation with minimum performance loss.

3.2.2 Criticality

In modern superscalar processors, where multiple instructions can be processed inparallel, deciding when a given resource should be assigned to an instruction is awell-known problem. For instance, when two ready instructions require the samefunctional unit to be executed, only one of them can be chosen. Different policiesare used to take these decisions in existing processors, but they are quite simpleand inaccurate. Some studies [43, 98, 118] have proposed techniques to heuristi-cally obtain more accurate information and use it to increase performance. Loadinstructions are especially harmful if they have high latencies and are in the criticalpath [107]. Thus, the criticality of load instructions is important information tohandle them efficiently.

An exact computation of the criticality of each load instruction is not feasibledue to its complexity. Thus, an approximation to the criticality is proposed. Thenwe propose an accurate predictor of criticality according to our definition. For theproposed criticality-based cache organization, we will only need to classify loads intotwo categories: critical and non-critical, so we need a mechanism to decide whena load can be delayed one or more cycles and when delaying it will significantlydegrade performance.

1The subthreshold current is considered to be a constant and it is assumed that transistors arein the current saturation mode.


Criticality Estimation

In order to decide whether an instruction is critical or not we keep track of thenumber of cycles elapsed since an instruction has finished its execution until itcommits. If this number of cycles is greater than a given threshold N , then theinstruction is considered non-critical. Intuitively, this criteria indicates that theinstruction belongs to a chain of dependent instructions which is not the longestone or that there is an instruction that stops the commit process (for instance aload that misses L1 cache), and thus, this chain may take some more cycles withoutperformance degradation. In our experiments, after evaluating different values forN , we have observed that N = 4 cycles gives the best results for the chosen cacheorganizations.

With the previous criteria, the last instruction of every dependence chain is cor-rectly classified, but this criticality must be propagated upwards in the dependencechains. Thus, we consider if the data produced by the instruction (if any) is imme-diately used by at least another critical instruction. With this criterion only thoseinstructions belonging to a chain of dependent instructions that are executed as soonas possible, are considered critical.

The criticality predictor has been implemented as a 2048 untagged entry tablewhere each entry is a 2-bit saturated counter whose most significant bit is the pre-diction. Initially the table indicates that all the instructions are critical. The tableis updated by every instruction that commits. If the committing instruction hasbeen waiting for commit less than N cycles (N = 4 in our experiments), or its pro-duced data (if any) is forwarded to another critical instruction through a bypass andthe depending instruction is issued immediately, the corresponding 2-bit counter isincremented, otherwise it is decremented.

The evaluation section describes how this criticality predictor has been validated.

3.2.3 Cache Organizations

This section describes different cache organizations that are compared to a baselineL1 monolithic 1-cycle latency cache. Our proposals are based on two L1 cachemodules implemented with different technology parameters. One of them is a 1-cycle latency cache implemented with the same technology than the baseline. Itwill be referred to as Fast Cache in the rest of this work. The second one is a 2-cycle latency cache implemented with lower VDD and higher VTH than the baseline,in order to reduce both dynamic and leakage power dissipation at the expense ofincreasing the access time. It will be referred to as Slow Cache in the rest of thiswork.

According to the formulas described in section 3.2.1 we are interested in decreas-ing VDD and increase VTH as much as possible with the following limitations: theseparameters should be technologically feasible and the latency should be at most 2times larger than the latency of the baseline cache.


Power dissipation w.r.t. the baseline technology for different Vth/Vdd combinations

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0,53 0,54 0,55 0,56 0,57 0,58 0,59 0,6

1,18 1,2 1,24 1,26 1,28 1,3 1,32 1,34

Vth / Vdd

PdynPleak

Fig. 3.1: Power dissipation compared to the baseline technology for different VTH and VDD values

Leakage energy consumption can be analytically estimated, but dynamic energydepends on the program, thus optimal generic values for VDD and VTH cannot becomputed. In order to guide the selection of these values, figure 3.1 shows differentvalid combinations of these values that can be chosen and the expected dynamicand leakage power dissipation compared to the baseline technology. The assumedparameters for the baseline technology are VDD = 2.0V and VTH = 0.55V [28]. Therest of the parameters are described in section 3.2.1.

We have chosen VTH = 0.57V and VDD = 1.28V technology for the Slow Cachebecause it reduces both power dissipation sources to the same percentage.

Proposed Cache Organizations

Two different cache organizations are proposed. The first one is a hierarchicallocality-based cache system where the fast cache is the first level data cache, theslow cache is the second level cache and the baseline’s second level cache is the thirdlevel cache. In this organization the slow cache should be larger than the fast cacheto be useful.

The second one is a flat criticality-based cache system with no inclusion (somedata contained in fast cache may not be in slow cache and vice versa). Both thefast and the slow caches are accessed always in parallel. If a critical load hits in theslow cache and misses in the fast cache, the cache line is copied from the slow tothe fast cache. If a critical load misses both caches, then the data fetched from thefollowing cache level is allocated only in the fast cache. If a non-critical load hits atleast in one of both caches, the data is not copied from one cache to the other. Ifa non-critical load misses in both caches, then the data fetched from the following


Hierarchical / Hierarchical /Baseline Criticality-based Criticality-based

(3-way slow) (2-way slow)L1 Fast Slow Fast Slow16K 4K 12K 4K 8K32K 8K 24K 8K 16K64K 16K 48K 16K 32K

Table 3.1: Cache sizes used in the comparison

cache level is allocated only in the slow cache. Finally, if a store hits at least inone cache there is no data copy, but if it misses, assuming that the policy used iswrite-allocate, the data is fetched to the fast or the slow cache depending on thecriticality of the store instruction.

Another important consideration is the cache size. Most of the existing proces-sors have data caches whose size is in the range [16K : 64K]. Table 3.1 describes thedifferent cache sizes used to compare the different alternatives. All caches describedhave 32-bytes cache lines with 2 read/write ports. The baseline and fast caches are2-way set-associative.

It can be seen in table 3.1 that there are two versions for both proposals. Inthe first one, the total size is the same than the baseline (slow cache is 3-way set-associative) and in the second one the total size is smaller than the baseline butthe slow cache has the same associativy than the fast one (slow cache is 2-wayset-associative). The cache parameters for both proposals (the hierarchical andthe criticality-based ones) are exactly the same, so their performance and powerdissipation are comparable.

3.2.4 Performance Evaluation

This section evaluates the accuracy of the criticality predictor and the performanceand power dissipation of the different cache organizations in a superscalar processor.

Experimental Framework

Our power dissipation and performance results are derived from Wattch [21] as de-scribed in section 2.2. The programs used are the whole SPEC CPU2000 benchmarksuite [115]. Table 3.2 shows the processor parameters.

Criticality Evaluation

This section describes the experiments done in order to verify the effectiveness ofthe criticality detection mechanism. Figure 3.2 shows the distribution of criticalloads for the criticality-based 2-way set-associative slow cache configuration. The3-way set-associative slow cache configuration, the locality-based organizations andthe baseline show similar results (they differ less than 3% in all cases) so their criticalloads distribution is not depicted.


Parameter ValueFetch, Decode, Issue, Commit width: 4 instructions/cycleIssue queue size: 40 entriesReorder buffer size: 64 entriesRegister file: 80 INT + 80 FPIntALU’s: 3 (1 cycle)IntMult/Div: 1 (3 cycles pipelined mult,

20 cycles non-pipelined div)FP ALU’s: 2 (2 cycles pipelined)FP Mult/Div: 1 (4 cycles pipelined mult,

12 cycles non-pipelined div)Memory Ports: 2Branch Predictor: Hybrid: 2K entry Gshare, 2K entry bimodal

and 1K entry metatableBTB: 2048 entries, 4-wayL1 Icache size: 64KB 2-way, 32-byte lines, 1 cycle latencyL1 Dcache size: 2-way, 32-byte linesL2 Unified cache: 512KB, 4-way, 64-byte lines, 10 cycles latencyMemory: 50 cycles, 2 cycles interchunkData TLB: 128 entries, 30 cycles miss penaltyInstruction TLB: 128 entries, 30 cycles miss penalty

Table 3.2: Processor configuration

The integer percentage of critical loads is higher than the floating point onebecause the integer applications show less ILP than floating point ones. As thefigure shows, SpecINT2000 has near twice the percentage of critical loads (60%approx.) with respect to SpecFP2000 (30% approx) across all cache configurations.

After studying what percentage of loads is considered critical, the next step isverifying that those loads are really critical. In order to verify that the criticalitycriterion detects the critical loads, we will compare the execution of the criticality-based 2-way slow cache organization versus the baseline in two ways:

• The loads considered as critical or non-critical are treated as critical or non-critical respectively.

• The same percentage of loads that were considered as critical ones in theprevious simulation will be considered critical, but this time they will be chosenrandomly.

As figure 3.3 shows, the criticality scheme achieves significantly higher perfor-mance than the random scheme across all cache sizes. It can be seen that whenloads are chosen as critical according to the criticality criterion the IPC loss is muchlower than in the randomly chosen scheme so, it can be concluded that the crit-icality criterion gives a good classification of loads that can be used to guide thecriticality-based cache organization.


Critical Loads

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

SP

EC

INT

SP

EC

FP

SP

EC

SP

EC

INT

SP

EC

FP

SP

EC

SP

EC

INT

SP

EC

FP

SP

EC

4K+8K 8K+16K 16K+32K

non-criticalcritical

Fig. 3.2: Load criticality distribution for different cache sizes of the 2-way set-associative slowcache configuration

IPC loss when loads are classified

-1%

0%

1%

2%

3%

4%

5%

SP

EC

INT

SP

EC

FP

SP

EC

SP

EC

INT

SP

EC

FP

SP

EC

SP

EC

INT

SP

EC

FP

SP

EC

4K + 8K 8K + 16K 16K + 32K

criticality schemerandom scheme

Fig. 3.3: IPC loss of criticality-based cache for the guided and the random versions w.r.t. thebaseline


IPC loss w.r.t. the baseline

-1%

0%

1%

2%

3%

SP

EC

INT

SP

EC

FP

SP

EC

SP

EC

INT

SP

EC

FP

SP

EC

SP

EC

INT

SP

EC

FP

SP

EC

4K + 12K (8K) 8K + 24K (16K) 16K + 48K (32K)

locality 3waycriticality 3waylocality 2waycriticality 2way

Fig. 3.4: Performance loss of locality-based and criticality-based organizations w.r.t. the baseline

Cache Organizations Comparison

The comparison between the locality-based and the criticality-based cache organi-zations versus the baseline has been done based on different metrics: performance(IPC), miss ratio, dynamic energy consumption and leakage energy consumption.

Performance. Figure 3.4 shows the IPC loss for both cache organizations versusthe baseline. 2way and 3way stand for 2-way and 3-way set-associative slow cacherespectively. The SpecINT2000, SpecFP2000 and SPEC2000 percentages have beenaveraged using the harmonic means of the IPC’s. It can be observed that the locality-based scheme works better than the criticality-based scheme for the SpecINT2000but the criticality-based scheme achieves better results than the locality-based forthe SpecFP2000. We have observed that the loads can be classified as critical ornon-critical, but it is common that the data fetched by a non-critical load is reusedby a critical one and vice versa. Due to this, if there is no capacity limitation in thefast cache it is better to fetch all data to the fast cache than fetching some data tothe slow cache if it has to be fetched later to the fast cache by a critical load.

In general, integer applications have small working sets so, the locality-basedscheme that always fetches the data to the fast cache works better than thecriticality-based scheme. But for floating point applications with huge working sets,this performance loss due to delaying some critical loads that find their data in theslow cache instead of the fast cache, is compensated by retaining during more cyclesdata that will be reused by critical loads in the fast cache, instead of replacing itwith data that only will be used by non-critical loads during that period of time.


16 KB - Critical loads

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

SPECINT SPECFP SPEC

hit fast hit slow miss

(a) 16KB baseline, critical loads

16 KB - Non-critical loads

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

SPECINT SPECFP SPEC


(b) 16KB baseline, non-critical loads

Fig. 3.5: Miss ratio breakdown of critical and non-critical loads for the 16KB baseline cache

It can be seen that for FP programs the criticality-based scheme may achievebetter results than the baseline due to the beneficial effect of not placing data fetchedby non-critical loads in the fast cache.

Another observation is that the larger the caches, the lower the difference betweenboth cache organizations. In fact, for a 16KB fast cache and a 32KB (or 48KB) slowcache organization, both schemes perform similarly.

Miss Ratios. Figures 3.5, 3.6 and 3.7 show the miss ratios for critical and non-critical loads for different cache sizes. These figures classify loads into L1 hits andmisses for the baseline. For the other organizations the loads are classified into threegroups: those that hit in the Fast cache, those that miss in the Fast cache but hitin the Slow cache, and those that miss in both L1 caches. Note that the scale for allthe figures begins at 50% for the sake of showing better the hit/miss distribution,because in all cases the fast cache hit ratio is higher than 50%. Hit fast, hit slowand miss stand for hit in the fast cache, miss in the fast cache but hit in the slowcache, and miss in both L1 caches respectively.

We can observe that, in general, for the SpecFP2000 the fast cache hit ratio ofthe critical loads in the criticality-based schemes is slightly higher than the sameratio in the locality-based schemes. For the SpecINT2000 higher hit ratios in thefast cache are achieved in the locality-based schemes because the working sets aresmall and critical loads reuse data fetched by non-critical loads to the fast cache.



50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%ba

selin

e

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

SPECINT SPECFP SPEC




50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

SPECINT SPECFP SPEC





50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

SPECINT SPECFP SPEC




50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

SPECINT SPECFP SPEC





The hit ratio for the critical loads in the slow cache is higher for both criticality-based schemes. Non-critical loads fetch data to the slow cache and critical loads laterreuse this data. For the integer applications this means that in the locality-basedschemes some critical loads find their data in the fast cache whereas in the criticality-based schemes they find their data in the slow cache, increasing their latency. For theFP programs this means that some critical loads that in the locality-based schemesdo not find their data in L1 caches (fast and slow), in the criticality-based schemesfind their data in the slow cache, decreasing their latency. For the FP applications,with large working sets, the data fetched by critical loads stays during more time inthe fast cache because it is not evicted by data fetched by non-critical loads. Notealso that the criticality-based schemes do not have the inclusion property in fastand slow cache so more different data can be stored in both caches.

For non-critical loads it can be observed that criticality-based schemes achievelower miss ratios in the L1 caches and higher hit ratios in the slow cache becausethe classifying mechanism places data fetched by non-critical loads in the slow cacheand this data is not evicted by data fetched by critical loads.

After analyzing the miss ratios it can be understood why the criticality-basedscheme does not improve significantly the locality-based one in both kinds of bench-marks. The loads can be considered as critical or non-critical, but a critical loadcan use the data fetched by a non-critical one or vice versa (the same data or otherdata contained in the same cache line).

Dynamic Energy Consumption. After understanding the reasons that produce thoseperformance differences between the proposed cache organizations, we will comparethe power requirements of each organization in order to decide which one achievesthe best tradeoff between power and performance. Figures 3.8, 3.9 and 3.10 showthe percentages of dynamic energy consumption for every scenario. All percentageshave been computed with respect to the baseline cache organization. The energyconsumption of the locality-based scheme and the criticality-based scheme is brokendown into different energy consumption sources: fast cache, slow cache, L2 cacheenergy increase and, only for the criticality-based scheme, the additional structuresto decide when a load is or not critical (table to decide when a load is critical ornot and counters to know if an instruction that has finished its execution has beenin the reorder buffer during more than 4 cycles).

As shown in the figures, criticality-based schemes save near 25% data cacheenergy for the 3-way slow cache configuration and near 40% for the 2-way slowcache configuration, whereas the locality-based organizations save near 60% dynamicenergy with respect to the baseline. The L2 energy increase becomes negligiblefor large caches because the miss ratios of the criticality-based and locality-basedschemes are very similar to the baseline scheme miss ratio.

Slow cache energy requirements are higher for the criticality-based schemes thanfor the locality-based ones because fast and slow caches are accessed in parallel in


16 KB

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%ba

selin

e

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

SPECINT SPECFP SPEC

L2 increaseAdd. StructsSlow CacheFast Cachetotal

Fig. 3.8: Dynamic energy consumption for a 16K baseline cache

32 KB

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

SPECINT SPECFP SPEC




64 KB

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

SPECINT SPECFP SPEC



the criticality-based schemes. In order to reduce these energy requirements we didsome experiments with different cache access policies like:

• Accessing only one cache and access the other just in case of miss.

• Access both in parallel for a critical load and only the slow cache for non-critical ones.

• Access both in parallel for a critical load and first the slow followed by thefast cache in case of miss for a non-critical load.

This kind of schemes proved to be especially harmful for performance because,as shown before, some critical loads miss in fast cache and hit in slow cache, soaccessing sequentially to the L1 caches increases their latency. Additionally, even ifload instructions access sequentially both caches, the energy savings are lower thanthose of the locality-based schemes because, in the criticality-based schemes, thestore instructions must access both caches in order to maintain cache coherence. Inall cases the results did not show drastic energy reduction but the performance losswas significant, so we decided to choose the policy with best performance even if itdoes not save as much energy as the other organizations.

Finally, as shown in figures 3.8, 3.9 and 3.10, the energy consumption of theadditional structures of the criticality-based schemes is quite small.

Leakage Energy Consumption. Figure 3.11 shows the leakage energy consumptionfor all scenarios with respect to the baseline for the different cache sizes. It canbe observed that, for the locality-based and the criticality-based schemes, the fast


Leakage Energy Consumption

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

base

line

loca

lity

3way

criti

calit

y 3w

ay

loca

lity

2way

criti

calit

y 2w

ay

16K 32K 64K

Add. StructsSlow CacheFast Cachetotal

Fig. 3.11: Leakage energy requirements for the different cache organizations and sizes

cache and slow cache leakage energy requirements are the same for each slow cacheconfiguration (2-way or 3-way set-associative) respectively. As shown in the figure,the leakage energy consumed by the proposed cache organizations is substantiallysmaller than the leakage energy consumed by the baseline architecture because theseorganizations use a technology with lower leakage energy consumption for the slowcache and have lower capacity for the 2-way set-associative slow cache organizations.

3.2.5 Conclusions

This study shows how different L1 multi-banked data cache organizations can obtainsimilar performance to that of a monolithic cache, requiring at the same time lowerdynamic and leakage energy, and reducing the cache access time. It has been shownthat different technology parameters can be combined to obtain high performancecaches with small energy requirements.

Another important conclusion is that the improvement in performance that acriticality-based scheme can obtain with respect to a locality-based scheme in somecases, does not justify the additional complexity to detect which instructions arecritical and which are not, and the additional energy requirements. The criticalitycriterion applied in a cache system cannot improve substantially performance be-cause we use it to classify data, whereas the criticality is an instruction property,not a data property. Storing some data in a slow cache because a non-critical loadfetched it, can degrade the performance if a critical load requires this data later.

Finally, we can conclude that a locality-based cache organization that combinesdifferent supply and threshold voltages can achieve high performance with small en-


ergy requirements (both dynamic and leakage), small cache access time and reducedcomplexity.

3.3 IATAC: Low Leakage L2 Cache

The cache leakage reduction is addressed by several state-of-the-art approaches.These schemes propose to selectively turn-off cache lines that are not expected tobe needed in the future. Techniques to turn off cache lines are primarily designedfor L1 caches. However, L2 caches have significantly different characteristics. L1caches act as a filter for the L2 cache such that the L2 cache is only accessed bya relatively small number of the total L1 cache accesses. As we show later in theevaluation section, L1 oriented approaches, when applied to L2, perform poorly dueto this difference in behavior.

This section introduces IATAC (Inter-Access Time per Access Count), a newapproach to reducing the leakage of L2 caches by turning off those cache lines thatare not expected to be reused. For each cache line, it keeps track of the inter-accesstime and the number of accesses, providing decay times for each individual line. Thismechanism is shown to perform significantly better than all previous approaches interms of metrics that combine energy and execution time.

The rest of this section is organized as follows. Section 3.3.1 presents our ap-proach and two state-of-the-art mechanisms. The framework used for the evaluationas well as the results are presented in section 3.3.2. Finally, section 3.3.3 summarizesthe main conclusions of this work.

3.3.1 Predictors for L2 Caches

In this section we introduce IATAC, a new mechanism to turn off L2 cache lines.We assume that cache lines are turned off using the traditional gated-VDD circuittechnique [96], which implies losing the contents of the cache lines that are gated.Other techniques to reduce the leakage of cache lines [10, 44, 76], but without losingtheir contents, have been introduced in the related work section 3.1, but they intro-duce soft errors or scalability limitations, and thus, we have used gated-VDD. Forthe purpose of comparison we also present two state-of-the-art techniques [71, 137]and explain why they do not work for L2 caches. The techniques presented in thissection are decay-based techniques. Given a cache line, the decay interval is thenumber of cycles that have to elapse since the last access before the cache line isturned off.

IATAC

L1 caches filter out most stride 0 accesses. Stride 0 accesses frequently appear wherea program variable is repeatedly loaded or stored, which usually happens in loops.These memory references frequently hit the same cache line and rarely incur misses.Thus, the L2 cache has to serve very few references with stride 0.

3.3. IATAC: Low Leakage L2 Cache · 41

Furthermore, memory references normally access different L2 cache lines and theaccess pattern observed in a given cache line is likely to be observed in other cachelines. Similarly, patterns observed in other cache lines are also likely to be observed,in future, for this cache line. Hence, local cache line predictors are not expected towork well as they rely on using the information of a given cache line to only makepredictions for this cache line; no information is shared across different cache lines.

In order to have an accurate prediction, we need a mechanism to perform globalprediction of decay intervals. We have observed that patterns can be classified usingthe number of accesses to cache lines before they are replaced. Figure 3.12 shows thetime between hits and the time before miss for four representative programs. Onlyrepresentative access counts (corresponding to at least 2% of the occurrences) areplotted; other access counts are very rare so we neglect them. We observe that insome cases the time between hits remains the same for the different patterns (Applualways has a very small time between hits), whereas other programs show quitedifferent time between hits for the different access counts (Facerec, Gcc, Parser).It can be also observed that the time between hits is shorter than the time beforemiss, which suggests turning off cache lines after waiting the typical time betweenhits. Another important observation is that the time we have to wait before turningoff a cache line varies from one access count to another. For instance, we have towait at least 500K cycles (the maximum time between hits for access counts largerthan 1) before turning off a cache line that has been accessed once in Gcc (timebetween hits for 2 accesses in Gcc plot of figure 3.12). After this number of cycleswithout any access, it is very unlikely that it will be accessed again. But if the cacheline has been accessed 5 times, waiting 175K cycles (the maximum time between hitsfor access counts larger than 5, which corresponds to time between hits for 7 accessesin Gcc plot of Figure 3.12) is enough.

Based on these observations, we propose IATAC, which is a mechanism to turnoff L2 cache lines. IATAC observes the inter-access time for each line, classifies thisinformation depending on the number of accesses to the line and makes predictionsbased on the global history.

The underlying idea can be illustrated with an example. Let us assume that agiven cache line has been accessed for the N -th time. To know if this is the lastaccess or not, a decay interval is calculated based upon the inter-access time for thosepatterns with more than N accesses. Thus, if cache lines with more than N accessesusually have no more than K cycles between consecutive hits, it is convenient touse this value (K ) as the decay interval for the accessed line. It is highly likely thateither the cache line is accessed again within K cycles or it is not accessed anymore.Thus, it can be safely turned off after K cycles if no further access occurs. We turnoff the data part but not the tag, which is always turned on. Clearly, recordingall this information has a non-negligible overhead. However, we will show that thisoverhead pays off due to the high accuracy of the mechanism.

Figure 3.13 depicts the structures required to implement IATAC. For those pa-rameters with two values, the first one corresponds to the configuration for large


Applu

1

10

100

1000

2 3 4 5 6 8 10 12 16

Number of accesses

Th

ou

san

ds

of

cycl

es

Time between hits Time before miss

Facerec

1

10

100

1000

2 4 6 8

Number of accesses

Th

ou

san

ds

of

cycl

es


Gcc

1

10

100

1000

10000

1 2 3 4 5 6 7 8

Number of accesses

Th

ou

san

ds

of

cycl

es


Parser

1

10

100

1000

10000

1 2 4 6 8 10 12 14 16

Number of accesses

Th

ou

san

ds

of

cycl

es


Fig. 3.12: Average time between hits to cache lines (time between hits) and the average time fromthe last access to replacement (time before miss) against varying numbers of accesses. The resultscorrespond to four representative programs (note logarithmic scale)


Global structures

acumcounter 10 bits

32 (16) entries

globaldecay 4 bits

maxglobaldecay

Per cache line fieldswrong (1 bit) onoff (1 bit) decay (4 bits) thits (13 (10) bits) elapsed (13 (10) bits) counter (6 (5) bits)

10 bits 10 bits

4 bits 4 bits

4 bits 4 bits 4 bits

Fig. 3.13: Structures required for the IATAC mechanism for a 4MB (512KB) L2 cache

caches (4MB in our evaluation), while the second one (in parenthesis) correspondsto medium-size caches (512KB in our evaluation). The following are the fields foreach cache line:

• wrong : a field indicating whether the cache line was prematurely turned off.

• onoff : a field indicating whether the cache line is on or off.

• decay : the number of cycles that have to elapse to turn off the line.

• thits : a field to store the maximum time observed between consecutive hits tothe cache line.

• elapsed : a counter for the number of cycles elapsed since the line was lastaccessed.

• counter : a counter to keep track of the number of accesses to the cache line.

Conceptually, only two global structures are required. Both of them need tohave as many entries as the maximum number of accesses we want to consider. Atthis point we assume MAX ACCESS entries. The global structures are as follows:

• acumcounter : counts how many cache lines have been replaced for each valueof the access counter.

• globaldecay : stores the decay required for each value of the access counter.

Figure 3.14 details the mechanism. As it illustrates, different actions are takendepending on whether there is a hit or a miss. Note that if there is an access to aturned off line and there is a match in the tag, it is considered to be a hit by thealgorithm even if the data cannot be served from the cache.


If there is a hit in the tag:

(1) counter = counter + 1;

(2) if (elapsed > thits)

(2) thits = elapsed;

(2) elapsed = 0

(3) if (not onoff) wrong = 1;

(4) decay = 1;

(4) for (i=counter+1; i<=MAX ACCESS; i=i+1)

(4) if (acumcounter[i] > MIN COUNT)

(4) decay = MAX(decay, globaldecay[i]);

If there is a miss in the tag:

(5) if (thits > globaldecay[counter])

(5) globaldecay[counter] = globaldecay[counter] x 2;

(5) else if (thits x 2 < globaldecay[counter])

(5) globaldecay[counter] = globaldecay[counter] / 2;

(6) acumcounter[counter] = acumcounter[counter] + 1;

(6) if (acumcounter[counter] >= MAX COUNT)

(6) for (i=1; i<=MAX ACCESS; i=i+1)

(6) acumcounter[i] = acumcounter[i] / 2;

(7) decay = 1;

(7) for (i=counter+1; i<=MAX ACCESS; i=i+1)

(7) if (acumcounter[i] > MIN COUNT)

(7) decay = MAX(decay, globaldecay[i]);

(8) onoff = 1;

(8) wrong = 0;

(8) counter = 1;

(8) thits = 0;

(8) elapsed = 0;

Fig. 3.14: Algorithm of the IATAC mechanism

When there is a hit, the line’s counter is incremented (1). The maximum timeelapsed between hits is also updated (2). If the line was off, the wrong field is setindicating that there was a misprediction and the line cannot be turned off until it isreplaced (3). Finally, a new decay is provided to the line. Conceptually, in this stepthe decay is set to the maximum value of the globaldecay (4), considering only thosepositions that correspond to access counts greater than counter, but only if thesevalues are representative; that is only if these access counts have recently appearedfrequently enough.

Different actions are taken when there is a miss. First of all, the inter-access timefor cache lines having exactly counter accesses (globaldecay [counter ]) is updated.As shown in step (5) of figure 3.14, if thits is greater than the counter position ofglobaldecay, the inter-access time is multiplied by 2. But if thits is lower than half ofthe counter position of globaldecay, the inter-access time is divided by 2. This waywe update the decay for future predictions, adjusting it to the recent history. Thesecond action (6) is updating acumcounter that keeps track of which access countsare significant. The counter position is incremented, and if it reaches the maximumvalue, all positions are divided by 2. This way acumcounter indicates which accesscounts have recently appeared and how many times with respect to the other accesscounts. The decay field (7) is updated in the same way as the case of having a hit.


Thus, steps (4) and (7) are exactly the same. Finally, the other cache line’s fieldsare initialized (8).

It should be noted that, for any cache line, when elapsed reaches the value storedin decay, if wrong is not set, the cache line is turned off. Finally, the tag array ofthe cache is always on.

Implementation. Some issues have to be addressed to make the implementation ofthe IATAC mechanism feasible. We use hierarchical counters [71] to keep track ofthe cycles elapsed since each cache line was last accessed. Decay intervals are in therange of tens or hundreds of thousands of cycles. Such large decay intervals makeit impractical for the counters to count cycles (too many bits would be required).Instead, it is necessary for the counters to tick at much coarser level. Our solutionis to use a hierarchical counter mechanism where a single global cycle counter isset up to provide the ticks for smaller cache-line counters. For instance, the globalcounter may send a tick to the local counters every 1000 cycles, as assumed in ourexperiments. If we use Gray coding for the local counters only one bit per counterwill change state at any time.

Thits and elapsed fields should have a maximum value in order to determine thenumber of bits required. We have observed empirically that, values higher than 8192do not provide any advantage, and even for a medium-size cache, 1024 is enough.Hence, these fields require 13 bits each if the maximum is 8192 or 10 bits if themaximum is 1024. The decay field and the entries of globaldecay do not require asmany bits. They can be only a power-of-2, since globaldecay positions are set initiallyto 1 and increased (decreased) multiplying (dividing) by 2, and decay is set to one ofthe values stored in globaldecay. There are 13 or 10 different powers-of-2 dependingon whether the maximum value is 8192 or 1024 respectively. In any case 4 bits areenough to encode that number of different values. The maximum number of accesses(MAX ACCESS in figure 3.14) that are considered sets the length of globaldecayand acumcounter as well as the size of the counter field. Setting MAX ACCESS tovalues greater than 32 does not provide higher accuracy. For medium-size caches,this value can be set to 16. Lower values of MAX ACCESS make IATAC lessaggressive and opportunities to save energy are lost. Thus, the length of globaldecayand acumcounter is MAX ACCESS, and counter requires log2(MAX ACCESS ) bits.MAX ACCESS value is strongly related with the L2 cache size.

In the implementation, we use an additional structure that we call maxglobalde-cay. It has as many positions as globaldecay. Each position P records the maximumvalue of globaldecay between positions P+1 and MAX ACCESS for those positionswhose corresponding entry in acumcounter is greater than or equal to MIN COUNT.Thus, we do not need to compute any maximum when a line is accessed since it hasbeen previously computed. The experiments show that one position of globaldecayis updated once every 9 misses to L2 cache, and maxglobaldecay is updated onceevery 3500 misses.


The last parameters to be described are MAX COUNT and MIN COUNT.MAX COUNT is the upper bound of acumcounter entries. While none of the en-tries of acumcounter reaches this value, the number of occurrences of the differentvalues of counter are recorded. Thus, using large values for MAX COUNT makesobservation periods longer. If these periods are too long, it may be the case thatacumcounter stores old records mixed with current ones. Conversely, if these pe-riods are too short, it may be that acumcounter is frequently decreased (dividingall positions by 2) and the counters of frequent values may be set to values underMIN COUNT. In this case, their inter-access times are ignored to compute the de-cay and the mechanism, therefore, becomes less accurate. Setting MAX COUNTto 1024 provides a good trade-off. Thus, each position of acumcounter requires 10bits to store any value between 0 and MAX COUNT.

We have set MIN COUNT to 8 in our experiments. That means that only accesscounts with an acumcounter value equal or higher than 8 will be considered to makepredictions. Lower values make IATAC more conservative since we always select themaximum decay and rare access counts’ decays are considered. On the other hand,larger values make IATAC more aggressive because only those frequently occurringaccess counts are considered.

MAX COUNT and MIN COUNT must be set in concert. For instance, if wedecide to set MAX COUNT to 16384 (1024 x 16), we must scale MIN COUNT inthe same way (8 x 16) to prevent rare access counts being considered. If we usea larger value for MAX COUNT and do not increase MIN COUNT, we have thatacumcounter positions are divided by two less frequently and there is more time forinfrequent access counts to reach MIN COUNT and be considered.

Additional Comments. We have used hierarchical counters for the decay intervalsbecause they require less hardware than conventional counters and expend less en-ergy. The loss of accuracy introduced by this kind of counters is negligible sincethe decay intervals normally are some orders of magnitude higher than the updateperiod for the counters (1000 cycles).

The logic to compute the decay interval for a given line may take a few cycles,but it is done after serving the cache access. Thus, the cache latency and access timeare not affected. Delaying the computation of the decay interval by a few cycles haslittle impact on power and performance (decay intervals normally are between 10Kand 1000K cycles).

IATAC and the compared mechanisms consider all accesses to the L2 cache,including those due to copying back L1 dirty lines. Considering or not this kindof accesses does not change the results. We also assume that multiple updates tothe structures of IATAC (and the other mechanisms) can be done immediately.If L2 cache accesses are hits, they just update the line’s local information. Globalstructures are affected only by misses. If these updates have to be done sequentially,a small queue with as many entries as memory outstanding misses will suffice toupdate the predictor properly. Delaying updates by a few cycles has a negligible


impact on energy savings since cache lines are turned off for tens or hundreds ofthousands of cycles.

State-of-the-art Approaches

This section describes two state-of-the-art approaches, which also rely on turningoff cache lines to save leakage energy. They are extensively evaluated against ourproposed technique, IATAC, in section 3.3.2.

Cache Decay. This work [62, 71] relies on the observation that for L1 data cachesit is very common to have all the hits to a cache line concentrated in a short periodof time. They are then followed by a long period without any access to the cacheline, which is then finally replaced. Based on this observation they propose twomechanisms.

Fixed Decay Interval. The first mechanism assumes a fixed decay interval forthe whole cache and program execution. We will refer to this mechanism as decayNin the rest of the work where N stands for the decay interval. For instance, if thedecay interval is set to 64000 cycles, the mechanism is referred to as decay64K.DecayN provides high power savings with a small performance degradation for L1data caches, since cache lines are turned off soon after the last hit.

Different programs may achieve maximum leakage reduction with different decayintervals. The reason for this fact is that memory intensive programs have shorterinter-hit intervals as well as shorter inactive intervals than non-memory intensiveprograms. Additionally, different cache lines may require different decay intervals.This fact is particularly significant for irregular applications where the access pat-terns to distinct cache lines are quite different.

Due to the much larger size of L2 caches, data take longer to be replaced com-pared to the L1 data cache. Furthermore, the time elapsed between hits to a givencache line is quite irregular and it may be even longer than the inactive period afterthe last hit. Therefore, a long decay interval may not turn off cache lines whosedata will not be used again before being replaced. However, a short decay may turnoff cache lines too early, increasing the number of misses.

Adaptive Decay Interval. The second mechanism improves upon the first one inthe sense that it tries to find individual decay intervals for each cache line whichare adapted dynamically. This mechanism (we will refer to it as decayAdap) triesto choose the smallest possible decay interval for each cache line individually. AL1 cache line normally has a sequence of hits for a brief period, and then remainsunused for a long time until it is replaced. DecayAdap increases the decay when amiss occurs soon after turning off a line as this access would have been a hit if thecache line was not turned off prematurely. On the other hand, if the miss takes placelong after turning off the line, the decay is decreased as this access would have been


a miss. If the time elapsed after turning off the cache line and the correspondingmiss is neither too short but not too large, then the decay interval remains the same.

This mechanism shows good results for L1 data caches, but it has some drawbacksfor L2 caches. First of all, decayAdap uses local history for the predictions. Since L1caches serve most of the accesses with stride zero, the L2 cache has to serve memoryinstructions with non-zero strides, which normally access different lines. Hence, theinformation about the behavior of a particular line is useful for other cache lines ifaccessed by the same memory instructions and, therefore, global history can makemuch more accurate predictions. Another important drawback of decayAdap residesin the fact that, instead of keeping the tags on to realize if a cache line has beenturned off so early or not, it tries to infer the cause of misses. If periods betweenconsecutive hits are long enough, cache lines are prematurely turned off and theprediction, which is wrong, is considered successful. Hence, instead of increasingthe decay interval to correct the wrong prediction, it is decreased, and the nextprediction is also likely to be wrong.

Adaptive Mode Control. The adaptive mode control [137] is also a decay-basedtechnique that uses a global decay interval for the whole cache. This mechanismis similar to decayN but it allows N to be adapted dynamically depending on thefraction of extra misses introduced by the mechanism. This mechanism does notturn off the tag array, which helps to detect whether a miss is incurred or not bytheir mechanism. Obviously, not turning off the tag array implies that some leakagecannot be saved.

The implementation of this mechanism is similar to that of the decayN, but inthis case the decay is updated periodically after a certain number of cycles. Figure3.15 shows how this mechanism works. Ideal misses stands for the number of missesthat would have occurred regardless of a line’s sleep/active status, whereas sleepmisses are those specifically caused by sleep-mode lines. A sleep miss occurs whena matching tag is found but the data portion is in sleep mode. At the end of theperiod the target error is computed, which corresponds to the ideal number of missesmultiplied by a performance factor (PF ).

• If the number of sleep misses is higher than 1.5 times the target error, thedecay interval (N ) is increased in order to be less aggressive. The new decayis 2xN.

• If the number of sleep misses is below 0.5 times the target error, the decayinterval is decreased. The new decay interval is N /2.

The performance factor (PF ) is set to a power-of-2 (potentially negative) and itdoes not change during the program execution. We will refer to this technique asAMCPF in the rest of the work. For instance, if PF is 1/4 we will refer to it asAMC0.25.

This mechanism is shown to be very efficient for L1 caches (both data andinstructions), but it has some limitations for the L2 cache. For instance, as we


ideal misses

target error =PF*(ideal misses)

0.5*PF*(ideal misses)

number of misses

time

increase decay interval

decrease decay interval

total misses =ideal misses + sleep misses

Fig. 3.15: Mechanism to update the decay interval for the adaptive mode control

show in section 3.3.2, the decay is not homogeneous across all the cache lines. Thus,assuming the same decay for all the cache lines may miss significant energy savingsand/or degrade performance due to extra misses. Another intrinsic problem of thistechnique resides in the fact that it targets a given error. For some programs,maximum energy savings can be achieved without increasing the miss ratio. Let usassume that the current decay interval is N, PF is 1/2 (or smaller) and the typicalperiod between two consecutive accesses to a given cache line is always in the range(N : 2xN ). During this period there will be as many misses as cache accesses, andthus the new decay interval will be set to 2xN. During the following period therewill not be sleep misses, making the decay interval decrease to N. Again there willbe as many misses as cache accesses, and the pattern repeats. At the end, duringhalf of the program execution there are as many misses as accesses, which causessignificant performance degradation.


This section presents the results for the proposed mechanism as well as those usedfor comparison purposes. First we describe the processor configuration. Later wediscuss the results in terms of performance and energy.


Dynamic power and performance results are derived from CACTI 3.0 [111] and anenhanced version of Wattch [21] as described in section 2.2. We have used thewhole Spec2000 benchmark suite [115]. We have simulated 1 billion instructions foreach benchmark instead of the 100M assumed in other works because the L2 cacheneeds larger activity periods due to its low number of accesses per cycle. Leakageis modeled taking into consideration the number of bit cells that are active for eachmechanism and the 3% area increase [96] of the cache cells due to the logic needed


Parameter ValueFetch, Decode, Commit width: 8 instructions/cycleIssue width: 8 INT + 8 FP instructions/cycleFetch queue size: 64 entriesIssue queue size: 128 INT + 128 FP entriesLoad/store queue size: 128 entriesReorder buffer size: 256 entriesRegister file: 160 INT + 160 FPIntALU’s: 6 (1 cycle)IntMult/Div: 3 (3 cycles pipelined mult,


12 cycles non-pipelined div)Memory Ports: 4 R/W ports (2 cycles pipelined)Branch Predictor: Hybrid: 2K entry Gshare, 2K entry bimodal

and 1K entry metatableBTB: 2048 entries, 4-wayL1 Icache size: 64KB 2-way, 32-byte lines, 1 cycle latencyL1 Dcache size: 8KB 4-way, 32-byte lines, 4 R/W ports, 2 cyclesL2 Unified cache: 512KB / 4 MB, 4-way, 64-byte lines, 10 cycles latencyMemory: 100 cycles, 2 cycles interchunkData TLB: 128 entries, 30 cycles miss penaltyInstruction TLB: 128 entries, 30 cycles miss penalty


to turn off the cache lines. This increase is assumed to result in a 3% additionalleakage. The detailed energy model is described later.

Table 3.3 shows the processor configuration. Two different sizes (512KB and4MB) for the L2 cache have been evaluated. Since using different latencies for bothcache sizes does not provide significant insights, we assume the same latency forboth of them. We consider 100-cycle memory latency, since it is expected thatfuture processors will have very large off-chip L3 caches with latencies much lowerthan main memory [113].

Configurations Evaluated

For the 512KB cache, results are reported for IATAC as well as for differentconfigurations of the related approaches: decay64K, decay128K, decay256K, de-cay512K, decay1024K, decayAdap, AMC0.5, AMC0.25, AMC0.125, AMC0.0625 andAMC0.03125. We chose these configurations because lower values of N for decayNdegrade significantly performance (see figure 3.16), whereas higher values of N makedecayN too conservative (see figure 3.17). The situation for AMCPF is different:higher values of PF are excessively aggressive and the performance loss is high (seefigure 3.16), but lower values of PF hardly change the performance results (IPCand turn-off ratio keep the same). As stated in section 3.3.1, there are some patho-


IPC loss (512KB)

0%

2%

4%

6%

8%

10%

12%

14%

base

line

deca

y64K

deca

y128

K

deca

y256

K

deca

y512

K

deca

y102

4K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

IPC loss (4MB)

22,1%

0%

2%

4%

6%

8%

10%

12%

14%

base

line

deca

y512

K

deca

y102

4K

deca

y204

8K

deca

y409

6K

deca

y819

2K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

Fig. 3.16: IPC degradation for the different mechanisms for 512KB and 4MB L2 caches

Turned off Cache Lines (512KB)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

base

line

deca

y64K

deca

y128

K

deca

y256

K

deca

y512

K

deca

y102

4K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

idea

l

Turned off Cache Lines (4MB)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

base

line

deca

y512

K

deca

y102

4K

deca

y204

8K

deca

y409

6K

deca

y819

2K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

idea

l

Fig. 3.17: L2 turn off cache line ratio for the different mechanisms for 512KB and 4MB L2 caches

logic (and quite frequent) situations where AMCPF loses significant performancewhatever the PF value is, and thus, lower values of PF do not change the results.For the 4MB cache, we report results for the proposed IATAC mechanism. We alsoreport results for decay512K, decay1024K, decay2048K, decay4096K, decay8192K,decayAdap, AMC0.5, AMC0.25, AMC0.125, AMC0.0625 and AMC0.03125. Thebaseline starts with the whole cache turned off and it never turns off any cache lineonce it has been accessed.

Performance

In this section we present results showing IPC degradation, the ratio of cache linesturned off and L2 miss ratio for the different mechanisms.

Figure 3.16 shows the performance degradation for both cache sizes for all mecha-nisms. Performance of individual benchmarks has been averaged using the harmonicmean. Similarly, figure 3.17 shows the turn-off ratios and figure 3.18 the miss ratiosfor the different mechanisms.

In figure 3.17 ideal stands for the oracle mechanism that turns off every cacheline right after the last hit. We observe that the baseline has a turn-off ratio higher


L2 miss ratio (512KB)

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

base

line

deca

y64K

deca

y128

K

deca

y256

K

deca

y512

K

deca

y102

4K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

L2 miss ratio (4MB)

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

base

line

deca

y512

K

deca

y102

4K

deca

y204

8K

deca

y409

6K

deca

y819

2K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

Fig. 3.18: L2 miss ratio for the different mechanisms for 512KB and 4MB L2 caches

than 0. That is so because some programs (perlbmk and eon among others) do notfully utilize the L2 cache. Since the simulations start with the whole cache turnedoff, some cache lines are never accessed and thus never turned on.

We observe that for decayN, increasing N reduces the performance degradationat the expense of having a more conservative mechanism. That translates into alower miss ratio and lower turn-off ratio. AMCPF shows a similar trend when PFis decreased, but eventually the results stabilize and lower values for PF hardlychange the results. For the 512KB cache, PF values lower than 0.03125 obtainresults very close to those of AMC0.03125. The same trend is observed for the 4MBcache. As outlined in section 3.3.1, AMCPF targets a given error and thus, if anoptimal situation is achieved (maximum power savings and no extra misses) themechanism becomes more aggressive, degrading performance significantly most ofthe times. We also observe that AMC techniques perform relatively close to IATACfor the 4MB cache in terms of turn-off ratio, but not for the 512KB cache. This isso because the 4MB L2 cache is underutilized for many programs leading to thesehigh turn-off ratios.

For the 512KB cache, it can be seen that other mechanisms produce a low IPCloss, similarly to IATAC. However, IATAC achieves a much higher turn-off ratiothan all the other mechanisms with an IPC degradation of 1.4%. Only decay1024Kloses less performance, but its turn-off ratio is extremely low.

We observe that the performance loss for decayAdap for a 4MB L2 cache is large.As stated by the authors [62]:

the idea is to start with a short decay interval, detect whether it was amistake, and adjust the decay interval accordingly

For many programs, once the decay interval is elapsed, a significantly greaterperiod of time elapses before there is an access to a particular cache line, whichwould be a hit if not were for the fact that it is turned off. Since the tags are off,the mechanism does not realize that it has been a decay miss, and instead, sincethe time elapsed after turning the line off is quite large, the mechanism decides that


IPC loss

0%

5%

10%

15%

20%

25%

amm

p

appl

u

apsi art

bzip

2

craf

ty

eon

equa

ke

face

rec

fma3

d

galg

el

gap

gcc

gzip

luca

s

mcf

mes

a

mgr

id

pars

er

perlb

mk

sixt

rack

swim

twol

f

vort

ex vpr

wup

wis

e

SP

EC

decay512KdecayAdapAMC 0.0625IATAC

Fig. 3.19: IPC loss for the SPEC CPU2000 benchmarks and a 512KB L2 cache

it was not a decay miss and tries to make the decay interval shorter. Thus, mostof the L2 cache lines are off, but the performance is severely harmed. For smallercaches decayAdap does not lose as much performance because cache lines are reusedmore frequently and, even if the decay interval is small, it is easy to quickly havean access to the turned off cache line and decide that it was turned off prematurely.We must state that decayAdap was proposed for L1 caches, and the authors did notevaluate it for L2 caches. We have included this mechanism in the evaluation forthe sake of comparison.

It can be observed that IATAC has a similar or even higher miss ratio andlower performance degradation than some versions of decayN. Our approach slightlyincreases the miss ratios in many programs, which produces a small performanceloss in each program. DecayAdap, decayN and AMCPF usually increase the missratio significantly for a few programs. That translates into significant performancedegradation for these programs and also for the harmonic mean. In order to showthis fact, we show detailed results for all benchmarks for decay512K, decayAdap,AMC0.0625 and IATAC. Figure 3.19 shows the IPC degradation whereas figure3.20 shows the number of misses (in logarithmic scale). It can be seen that IATACis the worst behaving mechanism in terms of miss ratio and IPC only for the swim

benchmark. Swim is a program with high ILP that has a large number of misses tothe L2 cache. Increasing the number of misses has negligible performance impact


L2 misses

1,E+00

1,E+01

1,E+02

1,E+03

1,E+04

1,E+05

1,E+06

1,E+07

1,E+08

1,E+09am

mp

appl

uap

si art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

ksi

xtra

cksw

imtw

olf

vort

ex vpr

wup

wis

eS

PE

C

decay512KdecayAdapAMC 0.0625IATACbaseline

Fig. 3.20: Number of misses for the SPEC CPU2000 benchmarks and a 512KB L2 cache

as their latency is overlapped with other misses. IATAC has also a larger numberof misses for the perlbmk program, but there are only 200 misses. The figuresshow that there is a high correlation between significantly increasing the numberof misses and degrading the performance. For instance, decayAdap is the worstbehaving technique for ammp, crafty, eon, fma3d, gzip, mesa and vortex programsin terms of misses and IPC. In general, as expected, a higher number of missesmeans a higher performance loss.

For the 4MB L2 cache it can be observed that the turn-off ratio is significantlyhigher than for the 512KB cache for some versions of decayAdap, decayN and AM-CPF techniques, but the performance loss is also higher. Only decay4096K anddecay8192K produce performance degradation slightly lower than IATAC, but atthe expense of a much lower turn-off ratio. IATAC is the only mechanism achievinghigh turn-off ratio and low performance degradation for medium and large cachesizes.

We have done an additional experiment to check that IATAC is an accuratemechanism. We have obtained the average number of L2 cache lines turned off foreach program when using the IATAC mechanism. Then, we have run the programskeeping this number of L2 cache lines turned off, but every time that a L2 cacheline is accessed (and turned on) we select randomly a cache line that is turnedoff. This way the fraction of L2 cache lines turned off keeps constant. The results


Coefficient of Variation

0%

100%

200%

300%

400%

500%

600%

amm

pap

plu

apsi art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

isS

PE

C

generationcache line

Fig. 3.21: Decay interval coefficient of variation

show, as expected, that the fraction of L2 cache lines turned off is the same thanfor IATAC, but the performance loss grows from 1.6% (IATAC ) to 4.9% (randomscheme) for the 512KB L2 cache, and from 2.8% to 6.9% for the 4MB L2 cache. Thehigh performance loss for the random scheme overrides its energy savings leadingto ED2P values worse than those of the baseline. Thus, IATAC is effective resizingdynamically the L2 cache size turning off cache lines and choosing the cache linesthat are turned off.

Dynamic Behavior. In order to show how IATAC adapts decay values dynamically,we present some results describing the predictive behavior of the IATAC schemefor the 512KB L2 cache. Results for the 4MB cache show the same trend. Figure3.21 shows the standard deviation normalized with respect to the mean (coefficientof variation, or CV for short) for the predicted decay intervals, computed in twodifferent ways:

• Generation: every time a prediction is performed, the CV of the latest predic-tion for every cache line is computed. These values are averaged. This givesan indication of the variability of decay intervals across different lines at agiven point in time. We have observed that the result is greater than 100% forhalf of the programs. That confirms that assuming the same decay intervalsimultaneously for all the cache lines is not worth doing since different lines


Energy distribution for the whole chip: 50% dynamic + 50% leakageFraction of leakage of the L2 cache: 33% (512KB) or 50% (4MB)Dynamic energy includes:(1) Dynamic energy increase due to larger execution time for the different mechanisms(2) Extra L2 misses(3) Overhead due to additional structuresLeakage energy includes:(4) Leakage increase due to larger execution time for the different mechanisms(5) 3% extra leakage overhead for the L2 cache due to gated-VDD technique(6) Overhead due to additional structures

Table 3.4: Energy model

require different decay intervals. That explains why decayN and AMCPF donot work well for the L2 cache.

• Cache line: we compute the CV for the predictions of each individual cacheline. Hence, we obtain as many values as cache lines. The final result is theaverage of all these values. This gives an indication of the variability of thedecay intervals across time for a given cache line. We have observed that thisvalue is quite high, indicating that using local history for a given cache lineto make new predictions is unreliable. It is not frequent the case that thepredictions for a given cache line are constant and, at some point in time,they change to a different value that also remains constant. In fact, we haveobserved that cache line CV is high because decay intervals for a given cacheline vary frequently. This explains why decayAdap performs poorly for theL2 cache. Additionally, decayAdap does not keep the tags on, and tries toinfer the cause of misses instead of detecting whether or not they have beenproduced by an early deactivation of the cache line. While this approach isbeneficial for L1 caches, it is inaccurate for the L2 cache.

Energy-Delay Efficiency

We have shown that IATAC can turn off a significant portion of the L2 cache at theexpense of a low performance degradation. However, it does require some additionalhardware. In this section we evaluate the impact of the different approaches usingappropriate power-performance metrics [20]. We consider energy, EDP (energy-delay product) and ED2P (energy-delay2 product).

The SIA roadmap [112] and Heo et al. [59] among others estimate that the leakagemay represent about 50% of the energy consumption of the chip for 70nm technology.Others predict even a higher percentage [120]. In our evaluations we assume 50%.Additionally, we are conservative and assume that L2 cache leakage may representas much as 50% of the total leakage energy, even though other authors [81] reportup to 75% contribution. Table 3.4 summarizes the energy model that has beenconsidered.


IATAC IATAC(512KB L2) (4MB L2)

A. Dynamic energy 50.0% 50.0%B. Dynamic overhead due to IPC loss 0.6% 0.8%C. Dynamic overhead due to extra L2 misses 0.5% 0.6%D. Dynamic overhead due to additional structures 0.6% 0.8%Total Dynamic energy 51.7% 52.2%

E. Leakage energy 50.0% 50.0%F. Leakage energy savings -11.2% -16.0%G. Leakage overhead due to IPC loss 0.5% 0.9%H. Leakage overhead due to gated-VDD 0.6% 0.8%I. Leakage overhead due to additional structures 1.0% 1.6%Total Leakage energy 40.9% 37.3%

Total chip energy 92.6% 89.5%Total chip energy savings 7.4% 10.5%

Table 3.5: Energy breakdown for IATAC mechanism

All techniques introduce some extra misses (2) when cache lines are turned offtoo early. The consequent increase in execution time impacts both dynamic (1) andleakage (4) energy. The larger the execution time, the higher the leakage energyconsumed in all the structures but the L2 cache, some of whose lines are turnedoff. Also the dynamic energy grows due to the activity that there is every cycle,like selecting instructions from the issue queue. The dynamic energy overhead (3)for the additional structures is the number of bits read and written multiplied bythe energy consumption of accessing a bit of the L2 cache. The energy per L2 bitis obtained by dividing the energy consumption of an L2 access by the number ofbits accessed. Gated-VDD technique (5) requires a 3% area overhead to be imple-mented. We assume that increasing the L2 area by 3% increases its leakage by thesame percentage. Finally, the leakage overhead (6) due to additional structures iscomputed in a similar way. We assume that each bit of the additional structureshas the same leakage as a bit in the L2 cache. For instance, for the 512KB L2 cacheIATAC requires 288 bits for the global structures and 31 bits for each cache line.Each cache line has 560 bits (48 tag + 512 data), and there are 8192 cache lines.Thus, the L2 cache has 4587520 bits and the additional structures require 254240bits, which is 5.5% leakage overhead. We also consider the effects of tags leakagefor those approaches that do not turn them off (AMCPF, IATAC ).

We have not taken into account the extra off-chip energy consumption. On onehand, it is hard to estimate the energy consumption of an off-chip access. On theother hand, even if the extra off-chip energy consumption is higher than the on-chip energy savings, the main benefit of reducing the on-chip energy consumption isthat temperature can be reduced and the cooling system can be simpler or a higherperformance can be a achieved for a given thermal solution.

Table 3.5 shows the energy breakdown for the whole chip when IATAC is used.It can be observed that the energy savings are very high and thus, the overhead


Energy (512KB)

0,70

0,75

0,80

0,85

0,90

0,95

1,00

1,05

base

line

deca

y64K

deca

y128

K

deca

y256

K

deca

y512

K

deca

y102

4K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

idea

l

Energy (4MB)

0,70

0,75

0,80

0,85

0,90

0,95

1,00

1,05

base

line

deca

y512

K

deca

y102

4K

deca

y204

8K

deca

y409

6K

deca

y819

2K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

idea

l

Fig. 3.22: Energy consumption for the different mechanisms for 512KB and 4MB L2 caches

Energy x Delay (512KB)

0,70

0,75

0,80

0,85

0,90

0,95

1,00

1,05

1,10

1,15

1,20

base

line

deca

y64K

deca

y128

K

deca

y256

K

deca

y512

K

deca

y102

4K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

idea

l

Energy x Delay (4MB)

1,261

0,75

0,80

0,85

0,90

0,95

1,00

1,05

1,10ba

selin

e

deca

y512

K

deca

y102

4K

deca

y204

8K

deca

y409

6K

deca

y819

2K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

idea

l

Fig. 3.23: EDP for the different mechanisms for 512KB and 4MB L2 caches

(dynamic and static) for our mechanism is more than compensated. In the table, B(G) stands for the extra dynamic energy (leakage) due to having a larger executiontime. C accounts for the extra dynamic energy in the L2 cache to write data thatwere evicted too early, and to read data that have to be written back to the memoryagain. D is the dynamic energy of the structures required by the IATAC mechanism.F corresponds to the leakage energy savings in the L2 cache. H accounts for theextra leakage energy required by the Gated-VDD technique as stated before. I isthe leakage of the extra bits per cache line and structures required by the IATACmechanism.

Figures 3.22, 3.23 and 3.24 show the relative energy, EDP and ED2P required bythe different mechanisms. It can be seen that IATAC is clearly the best performingtechnique in terms of the last two metrics, which combine energy and performance.IATAC has higher accuracy to identify useless cache lines at the expense of a verysmall performance degradation. Notice that decay1024K saves similar energy thanIATAC for the 4MB cache, but that comes at the expense of high IPC degradation.In terms of ED2P, IATAC performs 7% (6%) better than the baseline and 7% (4%)better than the second best technique for the 512KB (4MB) L2 cache.


Energy x Delay2 (512KB)

1,284

0,70

0,80

0,90

1,00

1,10

1,20

base

line

deca

y64K

deca

y128

K

deca

y256

K

deca

y512

K

deca

y102

4K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

idea

l

Energy x Delay2 (4MB)

1,227 1,619

0,75

0,80

0,85

0,90

0,95

1,00

1,05

1,10

1,15

1,20

base

line

deca

y512

K

deca

y102

4K

deca

y204

8K

deca

y409

6K

deca

y819

2K

deca

yAda

p

AM

C 0

.5

AM

C 0

.25

AM

C 0

.125

AM

C 0

.062

5

AM

C 0

.031

25

IAT

AC

idea

l

Fig. 3.24: ED2P for the different mechanisms for 512KB and 4MB L2 caches

To summarize, IATAC achieves the highest energy savings, and its performanceloss is very low. Those mechanisms with energy savings close to those of IATAC havehigher performance degradation. Overall, in terms of ED2P, IATAC outperforms allother mechanisms. In this work, we focus on reducing L2 leakage, which is 16.7%(25%) of the whole chip energy for the 512KB (4MB) L2 cache and the assumedconfiguration. The ideal mechanism that turns off each cache line after its lastuse saves 13.4% (17.8%) of the whole chip energy whereas IATAC saves as much as7.4% (10.5%) energy, which is more than half of the ideal mechanism energy savings.IATAC achieves these high energy savings at the expense of a very low performanceloss.

3.3.3 Conclusions

IATAC is a new microarchitectural technique to reduce cache leakage for L2 caches,which is expected to be an important source of energy consumption in future proces-sors. IATAC dynamically adapts the decay interval for each cache line individuallybased on global history. To the best of our knowledge, this is the first approachthat is based on identifying global access patterns to predict the appropriate decayinterval for each individual cache line. We show that considering the number of ac-cesses to the cache lines and the decay intervals together is a promising approach todeveloping predictors to turn off cache lines. IATAC outperforms all other previousapproaches since its turn-off ratio is very high (around 65%) and its performanceloss is low. Besides, it comes at the expense of small additional hardware. IATACprovides the best ED2P across all mechanisms for different cache sizes. For instance,IATAC improves ED2P by 7% (6%) for a 512KB (4MB) L2 cache.

There is still some room for improvement. The turn-off ratio achieved by IATACis close to the ideal, but that comes at the expense of a small performance degrada-tion. Although the performance loss is small, it has significant impact in the ED2Pmetric. Thus, new predictors that achieve a similar level of accuracy as IATAC butfurther reduce the IPC loss will be interesting. Additionally, enabling the compiler


to provide some hints to the hardware predictor, such as the expected number ofaccesses to a cache line, may improve the accuracy of the predictor.

3.4 Heterogeneous Way-Size Caches

Set-associative cache architectures are commonly used. These caches consist of anumber of ways, each of the same size. We have observed that the different ways havevery different utilization, which motivates the design of caches with heterogeneousway sizes. This can potentially result in higher performance for the same area,better capabilities to implement dynamically adaptive schemes, and more flexibilityfor choosing the size of the cache.

This section proposes a novel cache architecture in which the different cache waysmay have different sizes. This new cache architecture is referred to as HeterogeneousWay-Size cache (HWS cache). HWS structures can be used for any kind of set-associative memory structures such as caches, predictor tables, etc. In this sectionwe use HWS structures for L1 and L2 caches. HWS caches are deeply evaluated interms of hit ratios and energy.

We also study a Dynamically Adaptive version of the HWS cache (DAHWScache). DAHWS caches are expected to be more adaptive than conventional archi-tectures because of the DAHWS caches higher flexibility.

The organization of this section is as follows. Section 3.4.1 presents the HWScache. The evaluation framework and a performance study of the HWS cache forL1 data, L1 instruction and L2 caches are presented in section 3.4.2. Section 3.4.3describes how dynamic reconfiguration schemes can be applied to HWS caches.Section 3.4.4 shows the evaluation of the reconfigurable HWS cache schemes. Theconclusions of this work are presented in section 3.4.5.

3.4.1 Heterogeneous Way-Size Cache (HWS Cache)

Cache architecture is a major concern in microprocessor’s design. Fully-associativecaches are prohibitive in terms of cycle time and power, whereas direct-mappedcaches have high miss rates due to conflicts. Hence, data and instruction cachesnormally use set-associative organizations.

Some degree of associativity is required to avoid conflicts, but only few setsmake use of this associativity at a given point in time. To show evidence of this, wehave studied 4-way set-associative L1 instruction and data caches, and a L2 unifiedcache. Figures 3.25, 3.26 and 3.27 show the cumulative distribution of the numberof sets that have 1, 2, 3 or 4 active lines. An active line is a line that will be usedin the future before being replaced. Here we just plot the average for the wholeSPEC CPU2000 benchmark suite. Simulation details are shown in section 3.4.2.For instance, for the L1 data cache, 90% of the time we have that:

• No more than 4 sets have 4 active lines.

3.4. Heterogeneous Way-Size Caches · 61

L1 Data Cache (8KB, 4-way, 32B/line)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64

number of sets

tim

e

1 line2 lines3 lines4 lines

Fig. 3.25: Associativity utilization for the L1 data cache



• No more than 20 sets have 1 active line.

• At least 12 sets (the remaining until 64) have 0 active lines.

Then, 90% of the time the utilization of the ways is 81% for the first way, 50% forthe second one, 23% for the third one and 6% for the fourth one.

Overall, only a small fraction of sets require some associativity most of the time,and the number of sets that makes effective use of a given degree of associativitydecreases as associativity increases. This is an important motivation to investigatecache architectures with heterogeneous way sizes.

Set-associative caches have basically three design parameters: number of ways orassociativity (A), number of cache lines per way or number of sets (S ), and numberof bytes per line (B). The total cache size (C ) is given by the following formula:

C = A · S · B (3.5)

Feasible configurations are constrained by the fact that both S and B have tobe a power of two to simplify the indexing function and all ways have the same size.For instance, we can design a 20KB 5-way cache, or a 24KB 3-way cache, but not a20KB 3-way cache.


L1 Instr Cache (32KB, 4-way, 64B/line)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%0 8 16 24 32 40 48 56 64 72 80 88 96 104

112

120

128

number of sets

tim

e


Fig. 3.26: Associativity utilization for the L1 instruction cache

L2 Unified Cache (2MB, 4-way, 64B/line)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0

512

1024

1536

2048

2560

3072

3584

4096

4608

5120

5632

6144

6656

7168

7680

8192

number of sets

tim

e


Fig. 3.27: Associativity utilization for the L2 unified cache


an-1 ...aj+3 aj+2 aj+1 aj aj-1 ...a0

Address

aj+2 aj+1 aj

aj+2 aj+1 aj

Conventional 2-way cache

aj+2 aj+1 aj

aj+1 aj

HWS 2-way cache

Fig. 3.28: Indexing functions for a conventional cache (left) and a HWS cache (right)

We propose the HWS cache to tackle the above issues. A HWS cache does notrequire the different ways to have the same size. The only constraint is that thenumber of lines for each way (Si) must be a power of two. For instance, we candesign the 20KB 3-way cache if we have two 8KB ways and one 4KB way. The totalcache size is given by the following formula:

C = B ·∑

i=1,A

Si (3.6)

Cache Organization

For accessing a HWS cache we will use the conventional modulo function: a subsetof the address bits is used to select a line in each way, but the subset of bits usedfor each way may be different. Other indexing functions different to modulo couldbe used but this is out of the scope of this work. Figure 3.28 depicts the indexingfunctions for 2-way set-associative caches. The conventional cache has the same sizefor both ways (8 cache lines per way), whereas the HWS cache has 8 lines in oneway and 4 lines in the other. The conventional cache uses the 3-least significantbits (ignoring bits for the word offset) for the index in both ways, whereas the HWScache uses the 3- and 2-least significant bits for each of the ways respectively.

Replacement Policy Considerations

For conventional caches, the Replacement Information Table (RIT ) records the in-formation necessary to decide the cache line to be replaced for each set. The RIThas S entries (one per set), and can be just appended to the tags (its physical lo-cation is irrelevant for the discussion here). When there is an access to a set, thecorresponding entry of the RIT is read to know the cache line to be replaced.

For a HWS cache, the RIT has as many entries as the largest cache way(MAXi=1,A Si or Max Si for short). For every reference, the RIT is indexed like


1aj+2 aj+1 aj

Address

aj+2 aj+1 aj

RIT

Max Si = S1

0 0

Way 1S1 = 8

Way 2 S2 = 4

Way 3 S3 = 4

Way 4 S4 = 2

aj+1 aj aj+1 aj

aj

If the referenced cache line is in way 1, update RIT entry 100bIf the referenced cache line is in way 2 or way 3, update RIT entries X00b (000b and 100b)If the referenced cache line is in way 4, update RIT entries XX0b (000b, 010b, 100b and 110b)

000b001b010b011b100b101b110b111b

Fig. 3.29: Example of RIT update for a HWS cache

Max Si way. The entry is read to know the cache line to be replaced in case of miss.A difference between conventional caches and HWS caches is the way to update theRIT. While conventional caches update just one RIT entry, HWS caches update asmany entries as the ratio between Max Si and the Si of the way where the cacheline is replaced (in case of a miss) or accessed (in case of a hit). The RIT entriesupdated are all those entries corresponding to cache lines of Max Si whose contentcould be in the replaced or accessed cache line. Figure 3.29 shows an example.

Tag Considerations

In HWS caches, the number of bits from the address that correspond to the tag aredifferent for the different cache ways. The solution that we have adopted consistsof storing for each cache way tags with different lengths. For instance, suppose a12KB 2-way HWS cache whose first way is 8KB and the second one is 4KB. Thefirst way requires K bits for the tag as if it was one of the ways of a 16KB 2-wayconventional cache, whereas the second way requires K+1 bits as if it was one ofthe ways of a 8KB 2-way conventional cache.

Flexibility

Cache parameters are usually determined by constrains of the design, such as power,latency and available area. HWS caches allow the designer to choose among a muchricher set of possible configurations. For instance, table 3.6 shows the different HWScache configurations in comparison with a conventional cache for a small designspace. Rows in italics correspond to conventional cache configurations. We limitthe total cache size to the range between 8KB and 32KB, and set-associativity tobe 2 or 3. It can be observed that there are 5 conventional cache configurationsand 29 HWS cache extra configurations. For higher set-associativity the numberof possible HWS cache configurations grows dramatically. For instance, for 4-way


Assoc Size Ways size Assoc Size Ways size2 8 KB 4|4 3 13 KB 8|4|12 9 KB 8|1 3 14 KB 8|4|22 10 KB 8|2 3 16 KB 8|4|42 12 KB 8|4 3 17 KB 8|8|12 16 KB 8|8 3 18 KB 8|8|22 17 KB 16|1 3 20 KB 8|8|42 18 KB 16|2 3 18 KB 16|1|12 20 KB 16|4 3 19 KB 16|2|12 24 KB 16|8 3 20 KB 16|2|22 32 KB 16|16 3 21 KB 16|4|13 8 KB 4|2|2 3 22 KB 16|4|23 9 KB 4|4|1 3 24 KB 8|8|83 10 KB 4|4|2 3 24 KB 16|4|43 10 KB 8|1|1 3 25 KB 16|8|13 11 KB 8|2|1 3 26 KB 16|8|23 12 KB 4|4|4 3 28 KB 16|8|43 12 KB 8|2|2 3 32 KB 16|8|8

Table 3.6: Feasible configurations for a HWS cache with associativity 2 or 3 and capacity rangingfrom 8KB to 32KB

set-associative caches, there are 3 conventional configurations and 43 HWS extraconfigurations.

The following formula shows the number of possible HWS cache configurationsfor a given set-associativity (A) and N different way sizes (e.g., if N is 1, all wayshave the same size, which corresponds to a conventional cache, and there is just onepossible configuration):

Number of configurations =

(

N + A − 1A

)

=(N + A − 1)!

A! · (N − 1)!(3.7)

For instance, we have 20 different HWS cache configurations for 3-way set-associative caches (A = 3) and way sizes of 1KB, 2KB, 4KB or 8KB (N = 4).This number of configurations includes those corresponding to conventional caches(there are just N different conventional cache configurations). Figure 3.30 plots thedifferent number of configurations for conventional caches and HWS caches in loga-rithmic scale for associativity ranging from 2 to 8 and number of different way sizesranging from 2 to 6. It can be seen that HWS caches allow many more configurationsthan conventional caches.

3.4.2 HWS Cache Evaluation

This section presents the hit rates for different configurations of the HWS cache.The analysis is performed for L1 data and instruction caches, and L2 caches. Besidesof the flexibility advantage of HWS caches over conventional caches, in this sectionwe show that some HWS cache configurations perform better than conventionalcaches.


Number of HWS configurations

1

10

100

1000

10000

2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6 2 3 4 5 6

2 3 4 5 6 7 8

5 different number of way sizes / 7 set-associativities

convHWS

Fig. 3.30: Number of cache configurations for associativity ranging from 2 to 8, and number ofdifferent way sizes ranging from 2 to 6


The HWS cache has been evaluated using sim-cache, which is part of the Sim-plescalar toolset [23]. Different cache sizes and associativity are studied. The onlyfixed parameter is the number of bytes per cache line. We have assumed 32 bytes/linefor L1 Dcache, and 64 bytes/line for L1 Icache and L2 unified cache. When sim-ulating the L2 cache we have assumed 8KB 4-way L1 Dcache and 32KB 2-way L1Icache. Dynamic energy results are derived from CACTI 3.0 [111]. Leakage hasbeen assumed to be proportional to the cache size. For this study we have used thewhole SPEC CPU2000 benchmark suite [115].


In this study, L1 cache size ranges from 4KB to 64KB, whereas L2 cache size rangesfrom 256KB to 4MB. Those HWS cache configurations whose smallest way sizes areextremely small in comparison with their largest way sizes have not been includedto reduce the number of configurations, and because they behave quite badly. Set-associativity between 2 and 4 has been considered. Higher set-associativity increasesdramatically the number of available HWS cache configurations.


L1 Dcache: 2-way 4KB-16KB

87,0%

88,0%

89,0%

90,0%

91,0%

92,0%

93,0%

94,0%

0 4 8 12 16 20

Cache size (KB)

HWSConventional


92,6%

92,8%

93,0%

93,2%

93,4%

93,6%

93,8%

94,0%

94,2%

12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

Cache size (KB)

HWSConventional

Fig. 3.31: Hit rate for 2-way set-associative L1 Dcaches

L1 Data Cache

To better observe the differences, there are two graphs for each set-associativity:one graph showing cache sizes ranging from 4KB to 16KB, and another from 16KBto 64KB. Figure 3.31 shows the hit rates for 2-way set-associative data caches. Itcan be seen that both conventional and HWS caches follow the same trend in termsof hit rate, but HWS cache enables many more configurations.

Figure 3.32 shows the hit rates for 3-way set-associative caches. In addition tothe greater number of configurations of the HWS cache, it can be observed thatfor each conventional cache there is a HWS cache configuration with equal sizeand set-associativity, and higher hit rate. Table 3.7 details some interesting HWSconfigurations that are more advantageous than conventional ones in terms of energydue to their lower size or associativity. Rows in italics correspond to conventionalcache configurations. For instance, a HWS cache of 18KB 3-way has about thesame hit rate than a 24KB 3-way conventional cache, but the HWS one reduces thedynamic energy by 8% and leakage by 25%. A 2-way set-associative HWS cachehas about the same hit rate than a 3-way set-associative conventional cache, bothwith 12KB capacity, but the HWS one reduces the dynamic energy by 22%.

Figure 3.33 shows the hit rates for 4-way set-associative data caches. As for3-way caches, for each conventional cache there is a HWS cache configuration withthe same size and associativity, and higher hit rate. The improvement is especiallysignificant for small caches. Table 3.8 shows some HWS configurations that out-



88,5%

89,0%

89,5%

90,0%

90,5%

91,0%

91,5%

92,0%

92,5%

93,0%

93,5%

0 4 8 12 16 20

Cache size (KB)

HWSConventional


93,0%

93,2%

93,4%

93,6%

93,8%

94,0%

94,2%

12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

Cache size (KB)

HWSConventional


Assoc Size Ways size Hit Rate Dynamic LeakageEnergy Savings Energy Savings

3 6 KB 2|2|2 90.56%2 6 KB 4|2 90.24% 24% 0%3 6 KB 4|1|1 90.59% 0% 0%3 12 KB 4|4|4 92.69%2 12 KB 8|4 92.50% 22% 0%3 11 KB 8|2|1 92.54% 2% 8%3 12 KB 8|2|2 92.71% 0% 0%3 24 KB 8|8|8 93.48%2 24 KB 16|8 93.34% 18% 0%3 18 KB 8|8|2 93.27% 8% 25%3 22 KB 16|4|2 93.46% 3% 8%3 24 KB 16|4|4 93.52% 0% 0%3 48 KB 16|16|16 93.92%2 48 KB 32|16 93.89% 11% 0%3 36 KB 16|16|4 93.77% 13% 25%3 44 KB 32|8|4 93.92% 3% 8%3 48 KB 32|8|8 93.97% -1% 0%

Table 3.7: 3-way set-associative L1 Dcaches (conventional caches are represented in italics)



88,5%

89,0%

89,5%

90,0%

90,5%

91,0%

91,5%

92,0%

92,5%

93,0%

93,5%

0 4 8 12 16 20

Cache size (KB)

HWSConventional


93,0%

93,2%

93,4%

93,6%

93,8%

94,0%

94,2%

94,4%

12 16 20 24 28 32 36 40 44 48 52 56 60 64 68

Cache size (KB)

HWSConventional


perform conventional caches from the standpoint of energy, hit rate, associativity orcapacity. For instance, we observe that a 26KB 4-way HWS cache saves 6% and 19%dynamic energy and leakage respectively with respect to a 32KB 4-way conventionalcache, while the hit rate is quite close. Similarly, a 32KB 3-way HWS cache saves14% dynamic energy.

Since the HWS cache area is lower or equal to that of the conventional cache,the HWS cache delay should be equal or even lower. For instance, a 2·S |S |S 3-way HWS cache can have the same layout as its S |S |S |S 4-way conventional cachecounterpart, but half of the HWS cache largest way (2·S size way) is gated whenthe cache is accessed with the corresponding bit of the address.

From the results for L1 data caches, we make the following observations:

• HWS cache configurations with the same size and set-associativity (above 2,since 2 ways of different size can never have a total capacity equal to a power of2) as the conventional cache behave better. In particular, if the conventionalcache has ways of size S, a 3-way HWS cache configuration with 2·S |S/2|S/2way sizes performs better than the conventional cache with S |S |S way sizes.For 4 ways, a HWS cache configuration with 2·S |S |S/2|S/2 way sizes performsbetter than a conventional cache whose way sizes are S |S |S |S.

• HWS cache configurations with the same size and one unit less associativitythan conventional caches behave similarly. For 2-way and 3-way caches, theHWS cache and conventional cache configurations with similar performance



4 4 KB 1|1|1|1 88.97%3 4 KB 2|1|1 88.98% 33% 0%4 4 KB 2|1|1/2|1/2 89.45% 15% 0%4 8 KB 2|2|2|2 91.76%3 8 KB 4|2|2 91.68% 19% 0%4 7 KB 4|2|1/2|1/2 91.53% 1% 13%4 8 KB 4|2|1|1 91.86% 0% 0%4 16 KB 4|4|4|4 93.22%3 16 KB 8|4|4 93.14% 17% 0%4 13 KB 4|4|4|1 93.02% 4% 19%4 15 KB 8|4|2|1 93.19% 1% 7%4 16 KB 8|4|2|2 93.26% 0% 0%4 32 KB 8|8|8|8 93.71%3 32 KB 16|8|8 93.70% 14% 0%4 26 KB 16|4|4|2 93.63% 6% 19%4 30 KB 16|8|4|2 93.72% 2% 7%4 32 KB 16|8|4|4 93.75% 0% 0%4 64 KB 16|16|16|16 94.14%3 64 KB 32|16|16 94.14% 9% 0%4 52 KB 32|8|8|4 94.05% 9% 19%4 60 KB 32|16|8|4 94.13% 3% 7%4 64 KB 32|16|8|8 94.17% -1% 0%

Table 3.8: 4-way set-associative L1 Dcaches (conventional caches are represented in italics)

are 2·S |S and S |S |S respectively. For 3-way and 4-way caches, the HWS cacheand conventional cache configurations are 2·S |S |S and S |S |S |S respectively.

There is a reason for this behavior. In a conventional cache any cache linecan be potentially stored in just one position of each way, so it has A potentiallocations, A being the associativity. We can observe that the HWS configurationsthat outperform a conventional cache have the largest way size equal to 2·S. Thismeans that if we consider the set of all memory blocks that can be placed in Adifferent lines in a conventional cache, they can be stored in A+1 different linesin the HWS cache. For each single memory block, we still have just A possiblelines where it can be stored, but some blocks that conflict with it in a conventionalcache do not conflict with it in HWS cache. This happens when the block storedin the largest way and the conflicting lines differ with it in one bit of the index. Anexample is shown in Figure 3.34. It can be observed that line1 and line2 conflict ina conventional cache, but they may not conflict in the HWS cache. Since a HWScache normally introduces few extra conflicts in the ways smaller than S, as expectedfrom the study at the beginning of section 3.4.1, it behaves close to a cache withA+1 associativity. In the example, if four different memory blocks conflict in theconventional cache, they will cause misses. In the HWS cache, it is possible to storethese four blocks in different lines and not cause any miss. Ways 2 and 3 of the HWS


Way 1S1 = 4

Way 2 S2 = 4

aj+1 aj

Conventional Cache

aj+1 aj

Way 3 S3 = 4

aj+1 aj

line1, line2 line1, line2 line1, line2

aj+2 aj+1 aj

Way 1S1 = 8

Way 3 S3 = 2

Way 2 S2 = 2

aj aj

HWS cache

line2

line1

line1, line2line1, line2

0line1 1 0

1aj+2 aj+1 aj

1 0line2

address

Fig. 3.34: Example of better behavior of HWS cache with respect to a conventional cache

cache are smaller than those of the conventional cache, but this hardly increases themiss rate since few sets require high associativity.

L1 Instruction Cache and L2 Unified Cache

The trends observed for the L1 instruction cache and the L2 unified cache are similarto those of the L1 data cache so we just report the main results in this section.

Tables 3.9 and 3.10 detail the most relevant HWS configurations. We can observethe same trends as for L1 data caches: HWS caches behave better than conventionalcaches of the same size and associativity, and HWS caches perform similar to con-ventional caches of the same size and one unit more associativity. For some L1instruction and L2 unified cache configurations, HWS caches perform significantlybetter than conventional caches with higher associativity and same size. This is akind of unexpected result but there is an explanation for it. Note that for conven-tional caches, those blocks conflicting in one cache way conflict in all of them. Forinstance, in a 4-way conventional cache (S |S |S |S ), if five different blocks that aremapped to the same set are accessed in a round-robin fashion, they will always miss.But for a 3-way HWS cache where the way sizes are 2·S |S |S it may happen that oneblock is stored in the largest way and it is never replaced (see figure 3.34). This isso if the address bit aj+2 in the figure has the same value for four of the blocks anda different value for the other block. Once the block with this different bit is placedin the largest way the other blocks cannot replace it. Then, a significant number ofthe accesses to these blocks (1 every 5 accesses) always hits.

The 3-way HWS cache behaves slightly worse or equal than the 4-way conven-tional one for most of the programs. However, in those cases where the workingset is a bit larger than the cache size, the above-mentioned effect can normally beobserved. For the few programs where this happens, the HWS cache behaves betterthan the conventional cache and, on average, the HWS cache is a bit better thanthe conventional cache even if its associativity is one unit lower.



3 6 KB 2|2|2 89.51%2 6 KB 4|2 89.52% 24% 0%3 6 KB 4|1|1 90.19% 0% 0%3 12 KB 4|4|4 94.58%2 12 KB 8|4 94.71% 22% 0%3 11 KB 8|2|1 94.90% 2% 8%3 12 KB 8|2|2 95.47% 0% 0%3 24 KB 8|8|8 98.33%2 24 KB 16|8 98.15% 18% 0%3 22 KB 16|4|2 98.32% 3% 8%3 24 KB 16|4|4 98.57% 0% 0%3 48 KB 16|16|16 99.48%2 48 KB 32|16 99.37% 11% 0%3 40 KB 16|16|8 99.33% 8% 17%3 44 KB 32|8|4 99.41% 3% 8%3 48 KB 32|8|8 99.49% -1% 0%4 4 KB 1|1|1|1 86.73%3 4 KB 2|1|1 86.96% 33% 0%4 4 KB 2|1|1/2|1/2 87.40% 15% 0%4 8 KB 2|2|2|2 91.96%3 8 KB 4|2|2 92.25% 19% 0%4 7 KB 4|2|1/2|1/2 91.55% 1% 13%4 8 KB 4|2|1|1 92.63% 0% 0%4 16 KB 4|4|4|4 96.74%3 16 KB 8|4|4 97.01% 17% 0%4 14 KB 8|4|1|1 96.61% 3% 13%4 15 KB 8|4|2|1 97.02% 1% 7%4 16 KB 8|4|2|2 97.31% 0% 0%4 32 KB 8|8|8|8 99.07%3 32 KB 16|8|8 99.06% 14% 0%4 26 KB 16|4|4|2 98.86% 6% 19%4 30 KB 16|8|4|2 99.12% 2% 7%4 32 KB 16|8|4|4 99.19% 0% 0%4 64 KB 16|16|16|16 99.75%3 64 KB 32|16|16 99.73% 9% 0%4 52 KB 16|16|16|4 99.63% 10% 19%4 60 KB 32|16|8|4 97.74% 3% 7%4 64 KB 32|16|8|8 97.77% -1% 0%

Table 3.9: 3-way and 4-way set-associative L1 Icaches (conventional caches are represented initalics)


Assoc Size Ways size Hit Rate Dynamic LeakageEnergy EnergySavings Savings

3 384 KB 128|128|128 68.37%2 384 KB 256|128 68.46% 24% 0%3 384 KB 256|64|64 69.23% 0% 0%3 768 KB 256|256|256 74.36%2 768 KB 512|256 74.00% 22% 0%3 704 KB 512|128|64 74.43% 2% 8%3 768 KB 512|128|128 75.26% 0% 0%3 1536 KB 512|512|512 81.09%2 1536 KB 1024|512 80.73% 17% 0%3 1408 KB 1024|256|128 81.21% 3% 8%3 1536 KB 1024|256|256 82.07% 0% 0%3 3072 KB 1024|1024|1024 85.46%2 3072 KB 2048|1024 85.06% 10% 0%3 2560 KB 2048|256|256 84.75% 7% 17%3 2816 KB 2048|512|256 85.31% 3% 8%3 3072 KB 2048|512|512 85.78% -1% 0%4 256 KB 64|64|64|64 67.18%3 256 KB 128|64|64 67.07% 20% 0%4 256 KB 128|64|32|32 67.29% 0% 0%4 512 KB 128|128|128|128 70.65%3 512 KB 256|128|128 71.20% 19% 0%4 448 KB 256|128|32|32 70.44% 1% 13%4 512 KB 256|128|64|64 71.36% 0% 0%4 1024 KB 256|256|256|256 76.38%3 1024 KB 512|256|256 77.41% 17% 0%4 832 KB 512|128|128|64 76.15% 4% 19%4 896 KB 512|256|64|64 76.84% 2% 13%4 1024 KB 512|256|128|128 77.72% 0% 0%4 2048 KB 512|512|512|512 83.27%3 2048 KB 1024|512|512 83.39% 14% 0%4 1792 KB 1024|256|256|256 82.98% 4% 13%4 1920 KB 1024|512|256|128 83.25% 2% 6%4 2048 KB 1024|512|256|256 83.53% 0% 0%4 4096 KB 1024|1024|1024|1024 87.03%3 4096 KB 2048|1024|1024 87.57% 9% 0%4 3584 KB 2048|1024|256|256 86.80% 5% 13%4 3712 KB 2048|1024|512|128 86.98% 4% 9%4 4096 KB 2048|1024|512|512 87.64% -1% 0%

Table 3.10: 3-way and 4-way set-associative L2 caches (conventional caches are represented initalics)


The energy results for the L1 instruction cache are the same than for the L1 datacache for identical configurations. Other interesting results regarding energy are asfollows: a 832KB 4-way L2 HWS cache saves 4% dynamic energy and 19% leakagewith respect to a 1024KB 4-way L2 conventional cache, and both perform very closein terms of hit ratio. Similarly, a 768KB 2-way L2 HWS cache saves 22% dynamicenergy with respect to a 768 3-way L2 conventional cache, and they achieve similarhit ratios.

3.4.3 Dynamically Adaptive HWS cache (DAHWS cache)

Caches are designed to provide good average performance across a variety of pro-grams, but individual programs and individual phases within the same program mayhave very different cache requirements. This suggests that a dynamically adaptivearchitecture may be more effective in terms of power-performance.

There are different families of techniques to save cache power by turning offparts of it dynamically. Some techniques work at cache line granularity [71, 75, 81,137], whereas other mechanisms work at coarser granularities by turning off cacheways [15, 39, 134] and/or resizing all cache ways simultaneously [15, 38, 134].

Previous proposals that work at a coarse granularity can adapt the numberof active ways or the size of the cache, but those cache ways that remain activehave all the same size. HWS opens the door to new degrees of adaptation byallowing different sizes for the each of the cache ways. In this section, we introducesimple algorithms to dynamically adapt HWS caches. The adaptive version of theHWS cache is referred to as Dynamically Adaptive HWS cache (DAHWS cache).A particular DAHWS cache configuration will be characterized by the size of itscorresponding ways, listed from highest to lowest.

In this work we are interested in showing the higher potential of DAHWS cachesto dynamically adapt their capacity over conventional organizations. Other adaptivetechniques based on turning off cache ways or individual cache lines are orthogonaland can also be applied to DAHWS caches, and are not evaluated here. We presentresults for 4-way set-associative caches.

Assuming a copy-back write policy, when the way size is decreased, all dirtydata in the part being turned off must be written back to the next level of thememory hierarchy, for both conventional and DAHWS caches. When the way sizeis increased, some data in the active part should be remapped or alternatively evictedto the next memory level. We assume the latter approach due to its higher simplicity.

Resizing Scheme for Conventional Caches

For comparison purposes, we have used the Signature Size proposed by Dhodapkarand Smith [38] for conventional caches, which was shown to be very effective atresizing L1 caches since it achieves similar energy savings to previous works andresults in lower miss rates and lower number of reconfigurations. Here we presentan outline of the Signature Size mechanism, and refer the interested reader to the


IF (state == STABLE) THEN

IF (phase change) THEN

state = UNSTABLE

ENDIF

ELSE IF (state == UNSTABLE) THEN

IF (NO phase change) THEN

choose cache size

state = STABLE

ENDIF

ENDIF

Fig. 3.35: Signature Size resizing algorithm [38]

original paper [38] for further details. Figure 3.35 shows the algorithm, which isapplied at the end of every interval.

The Signature Size scheme uses a bit vector called signature. The vector isinitially reset. For each instruction executed, some bits of the program counterare hashed and the corresponding vector position is set. After a given numberof instructions, this vector is compared with the vector produced in the previousinterval. The distance between both vectors is computed as:

δ =(V ector1 ⊕ V ector2)

(V ector1 + V ector2)(3.8)

Note that δ = 0 implies that both vectors are identical and δ = 1 means thatthey are completely different. If δ is higher than a given threshold (δth) a phasechange is assumed. The vector is reset at the beginning of each interval.

In particular, they show that a bit vector of 1024 bits, 100K-instruction intervalsand δth = 0.5 effectively detect phase changes. The hash function used duringsimulation is based on the C library functions srand and rand.

To choose the cache size they use the information in the bit vector. The authorsclaim that this vector is ”a lossy-compressed working set representation”. In fact,it is a lossy-compressed representation of the footprint, because it keeps track of allthe addresses accessed independently of how many times each one is reused. Onlythose that are frequently reused belong to the working set, so the working set is asubset of the footprint. In particular, we have observed that using this vector tocompute the working set for instructions is quite accurate, whereas it is not for data.

The size (number of ones) of the vector is probabilistically related to the truefootprint size. When K random keys are hashed into n buckets, the fraction ofbuckets filled, f , is given by:

f = 1 − (1 −1

n)K (3.9)

Given the fraction of the vector that has been filled, the footprint size can beestimated using the relation:

K =log(1 − f)

log(1 − 1n)

(3.10)


As a final remark, note that the vector is indexed through program counterbits to compute instruction footprint and through data addresses to compute datafootprint.

Baseline Resizing for DAHWS caches

As shown in section 3.4.2, a HWS cache whose way sizes are 2·S |S |S/2|S/2 normallyobtains lower miss rates than a conventional cache whose way sizes are S |S |S |S(same total capacity). This fact is more noticeable for small cache sizes, whichare quite frequent when the cache is dynamically resized. Then, the initial resizingscheme we will consider for DAHWS caches, referred to as baseDAHWS scheme, isthe same as Signature Size applied to a conventional cache, but when the cache isresized the way sizes are configured as 2·S |S |S/2|S/2. In case the maximum cachesize is desired, the way sizes are S |S |S |S since no cache way can be made largerthan the physical size of the corresponding bank.

Enhanced Resizing for DAHWS caches

The baseDAHWS resizing scheme can be enhanced by trying to further reducethe size of each cache way every time there is a phase change, to better adaptthe DAHWS cache to the program requirements. Figure 3.36 shows the proposedscheme, which is an extension of the Signature Size algorithm, to exploit the advan-tages of DAHWS caches over conventional caches. Like the previous scheme, thisalgorithm is applied once at the end of every interval.

This new algorithm does not move from the UNSTABLE state to the STABLEstate directly, but it goes first through a TUNING state. In this state, the cacheconfiguration is tuned as follows. First, if a phase change is detected while in theTUNING state, the state becomes UNSTABLE again. Otherwise, the miss rateof the second interval is recorded, since the miss rate in the first interval may behighly influenced by the previous phase and not be representative (rate ok variableis used to keep track of the current interval). Then, the size of the different waysis halved successively in a round-robin fashion, starting by the last one, until it isobserved an increase in the miss rate greater than a threshold (MAX INCREASE ).When this happens, the size of the last modified way (say i) is rolled back, and infollowing intervals, only the ways in the range [i+1, LAST ] are tried to be furtherreduced, starting again with LAST way. The system moves to the STABLE statewhen it fails to halve LAST way. In the reported experiments, MAX INCREASEthreshold is set to 0.5%. Additionally, when resizing the L2 cache, after N intervalsin the STABLE state, the system moves to the UNSTABLE state again to start anew tuning. The reason for that is that the mechanism to detect phases may misssome of them for the L2 cache since it is not accurate enough for very large workingsets. In our case N is set to 50.

Two alternative schemes to choose the initial cache size at the beginning of eachphase have been investigated:


IF (state == STABLE) THEN


state = UNSTABLE

ENDIF

ELSE IF (state == UNSTABLE) THEN

IF (NO phase change) THEN

choose cache size

state = TUNING

rate ok = 0

ENDIF

ELSE IF (state == TUNING) THEN


state = UNSTABLE

ELSE

IF (rate ok < 2) THEN

rate ok = rate ok + 1

save miss rate

ELSE

IF (miss rate < saved miss rate + MAX INCREASE) THEN

size of way to decrement is reduced by half

select way to decrement in next iteration

ELSE

double size of last way decreased

state = STABLE

ENDIF

ENDIF

ENDIF

ENDIF

Fig. 3.36: DAHWS cache resizing algorithm

• Hits : the signature is used to estimate the total cache size but it is distributedacross the ways based on some simple statistics. There are as many hit countersas number of ways. Every time there is a hit, we check the position of the line inthe LRU list (we assume Least Recently Used replacement in our experiments)and the corresponding counter is then incremented. For instance, if there is ahit to the most recently used line of a set, the first counter is incremented. Thisfunction is interesting for the L1 instruction cache because, as stated before,the signature accurately estimates the required cache size and this preventsmany reconfigurations from happening.

• Max : all cache ways are set to the maximum size at the beginning of everyphase. Since the signature fails to accurately estimate the cache size for L1data caches and L2 caches, a simple solution is setting the cache size to themaximum at each phase change.

For instance, if we have a 32KB 4-way cache and the signature method indicatesthat only 10KB are required, and assuming that hit counter values are 50, 30, 10and 10 respectively, hits will initially set way sizes to 8KB, 4KB, 1KB and 1KBrespectively (e.g., 10KB · 50 / (50+30+10+10), rounded to the next power of two,for the first way), whereas max initially sets all way sizes to 8KB each.


Enhanced Resizing for Conventional Caches

For the sake of a fair comparison, we have evaluated a scheme based on conventionalcaches similar to our enhanced resizing mechanism for DAHWS caches. The algo-rithm is the same than for our previous mechanism (see figure 3.36), but insteadof halving or doubling the size of a cache way, we resize all the cache ways simul-taneously because conventional caches do not allow different ways to have differentsizes. We will refer to this algorithm as Signature Size++.

3.4.4 DAHWS Cache Evaluation

This section presents an evaluation of the different adaptive schemes. Miss rates,percentage of lines that are active, IPC and number and type of reconfigurationsare the main metrics used for this analysis. We have evaluated L1 data and instruc-tion caches, and a L2 unified cache. Based on the above discussion, in this sectionwe present performance metrics for baseDAHWS, hitsDAHWS (for L1 instructioncache) and maxDAHWS (for L1 data and L2 unified caches). For comparison pur-poses we use the Signature Size and the Signature Size++ schemes applied to aconventional cache. We have also evaluated a non-resizing conventional cache withhalf the size of the baseline cache.


The DAHWS cache has been evaluated using sim-outorder, which is part of theSimplescalar toolset [23]. Caches have the following parameters: 32KB, 4-wayset-associative, 32 bytes/line for L1 data cache; 32KB, 4-way set-associative, 64bytes/line for L1 instruction cache; and 2MB, 4-way set-associative, 64 bytes/linefor unified L2 cache. The replacement policy for all caches is LRU. When resizingthe L2 cache we assume that L1 cache sizes are set to their maximum sizes. Table3.11 shows the processor configuration.

For this study we have used the SPEC CPU2000 benchmark suite [115]. We havesimulated 1 billion instructions for each benchmark after skipping the initializationpart and warming up caches for 100 million instructions.

L1 Data Cache

Figure 3.37 shows the miss rate with respect to a conventional cache with no dynamicresizing (first graph), percentage of active lines (second graph), IPC (third graph)and number of reconfigurations (fourth graph) for the different schemes evaluated.First, it can be observed that the Signature Size algorithm for conventional cachesachieves a small reduction in L1 data cache size. As stated in section 3.4.3, the ap-proach it uses to compute the required cache size is based on observing the footprintof the memory accesses. Therefore, there is practically no cache size reduction forthose programs that touch a large set of data. In fact, 16 among 26 programs showactive ratios higher than 99%, and active ratios below 50% are observed just for 2


Parameter ValueFetch, Decode, Commit width: 8 instructions/cycleIssue width: 8 INT + 8 FP instructions/cycleFetch queue size: 64 entriesIssue queue size: 128 INT + 128 FP entriesLoad/store queue size: 128 entriesReorder buffer size: 256 entriesRegister file: 160 INT + 160 FPIntALU’s: 6 (1 cycle)IntMult/Div: 3 (3 cycles pipelined mult,



and 1K entry metatableBTB: 2048 entries, 4-wayL1 Icache size: 32KB 4-way, 64-byte lines, 1 cycle latencyL1 Dcache size: 32KB 4-way, 32-byte lines, 4 R/W ports, 2 cyclesL2 Unified cache: 2MB, 4-way, 64-byte lines, 10 cycles latencyMemory: 100 cycles, 2 cycles interchunkData TLB: 128 entries, 30 cycles miss penaltyInstruction TLB: 128 entries, 30 cycles miss penalty



programs. Note that baseDAHWS has exactly the same active ratio as SignatureSize because both algorithms take the same resizing decisions and choose the samecache size. The only difference is the cache organization since DAHWS cache canchoose a configuration with different sizes in their respective ways. Because of that,baseDAHWS miss rate is slightly lower than that of Signature Size. As shown insection 3.4.2, a 4-way conventional cache whose way sizes are S |S |S |S has highermiss rate than a HWS cache whose way sizes are 2·S |S |S/2|S/2. However, thedifference in terms of miss rate is very low because the cache size is usually set tothe maximum capacity and, in this case, both Signature Size and baseDAHWS useexactly the same configuration for all their ways.

It can be observed that maxDAHWS scheme is more aggressive than SignatureSize and baseDAHWS ones, obtaining much lower active ratios but at the expenseof slightly higher miss rates (around 0.35% increase). Similarly, Signature Size++slightly increases the miss rate, but it is not as efficient as maxDAHWS becauseconventional caches are less flexible than DAHWS ones. Thus, maxDAHWS re-duces the cache size choosing configurations that cannot be implemented with aconventional cache. In terms of IPC, we can observe that maxDAHWS is muchbetter than Signature Size++. This is so because Signature Size++ attempts toreduce the cache size halving its size, which introduces a large number of missesduring some intervals before rolling back the resizing action. Most of the programsdo not lose too much performance during these intervals, but two of them (lucasand perlbmk) lose around 20% IPC, which is the reason for the high average IPCloss of this technique. On the other hand, maxDAHWS can halve only the size ofone way at the end of every interval, which has a lower impact in performance evenif the action must be rolled back. We also observe that using a cache with half thesize of the baseline one instead of resizing algorithms, does not reduce the activeratio as much as maxDAHWS while its performance loss is significantly higher.

It can be seen that Signature Size and baseDAHWS cache reconfigure the cachevery few times, because they hardly decrease the size of the cache, so most of thetimes that a phase change is detected, the previous cache size and the chosen cachesize are the maximum cache size. Although maxDAHWS takes resizing actions moreoften than the other mechanisms, we observe that most of the times only one wayis resized, and thus, the overhead to evict cache lines is low. On average, one wayof the cache is resized every 550.000 instructions, which corresponds to resizing thewhole cache (4 ways) once every 2.200.000 instructions. Thus, the impact of cacheresizing (a few hundreds of cycles) on performance is negligible. Additionally, thisoverhead is lower for maxDAHWS than for Signature Size++ since the latter resizesone way of the cache every 440.000 instructions, which corresponds to resizing thewhole cache (4 ways) once every 1.760.000 instructions.


L1 Data Cache - Miss Rate

5,0%

5,5%

6,0%

6,5%

7,0%

7,5%

8,0%

8,5%

baseline 32KSignature SizebaseDAHWSSignature Size++maxDAHWSbaseline 16K

L1 Data Cache - Active Ratio

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

baseline 32KSignature SizebaseDAHWSSignature Size++maxDAHWSbaseline 16K

L1 Data Cache - IPC loss

0,0%

0,5%

1,0%

1,5%

2,0%

2,5%

3,0%

3,5%

4,0%

4,5%

Signature SizebaseDAHWSSignature Size++maxDAHWSbaseline 16K

L1 Data Cache - Number of Reconfigurations

0

200

400

600

800

1000

1200

1400

1600

Sig

natu

re S

ize

base

DA

HW

S

Sig

natu

re S

ize+

+

max

DA

HW

S

1 way resized2 ways resized3 ways resized4 ways resized

Fig. 3.37: Miss rate, percentage of active lines, IPC and number of reconfigurations for the differentL1 Data cache resizing schemes. The number of reconfigurations is split according to the numberof ways whose size is changed


L1 Instruction Cache - Miss Rate

0,0%

0,5%

1,0%

1,5%

2,0%

2,5%

3,0%

3,5%

4,0%

baseline 32KSignature SizebaseDAHWSSignature Size++hitsDAHWSbaseline 16K

L1 Instruction Cache - Active Ratio

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

baseline 32KSignature SizebaseDAHWSSignature Size++hitsDAHWSbaseline 16K

L1 Instruction Cache - IPC loss

0,0%

0,5%

1,0%

1,5%

2,0%

2,5%

3,0%

3,5%

4,0%

4,5%

Signature SizebaseDAHWSSignature Size++hitsDAHWSbaseline 16K

L1 Instruction Cache - Number of Reconfigurations

0

100

200

300

400

500

600

700

800

900

1000

Sig

natu

re S

ize

base

DA

HW

S

Sig

natu

re S

ize+

+

hits

DA

HW

S


Fig. 3.38: Miss rate, percentage of active lines, IPC and number of reconfigurations for the differentL1 instruction cache resizing schemes. The number of reconfigurations is split according to thenumber of ways whose size is changed


L1 Instruction Cache

Figure 3.38 shows the miss rate (first graph), percentage of active lines (secondgraph), IPC (third graph) and number of reconfigurations for the different schemesevaluated (fourth graph). First, it can be observed that the baseDAHWS schemeachieves the same active ratio than the Signature Size scheme for conventionalcaches, but with much lower miss rate increase. The size of the L1 instructioncache is significantly reduced, which increases the benefits from the HWS architec-ture. Similarly, hitsDAHWS behaves much better than Signature Size++ in termsof miss rate and slightly better in terms of active ratio, but its behavior is similarto that of baseDAHWS. HitsDAHWS slightly outperforms baseDAHWS in terms ofactive ratio, but its miss rate is a bit higher. Using a 16KB cache significantly in-creases the miss rate because some programs require larger caches during a fractionof their execution time.

The same trends observed for the miss rates can be observed for the IPC,where the baseDAHWS and hitsDAHWS techniques lose less than 1% performance,whereas Signature Size and Signature Size++ lose between 1% and 2% IPC. Finally,the 16KB cache loses more than 4% IPC.

It can be observed that Signature Size and baseDAHWS have much lower num-ber of reconfigurations than hitsDAHWS, but if we consider the number of waysreconfigured, the difference is not so large. Signature size resizes 144 times the 4ways (576 way reconfigurations) whereas hitsDAHWS resizes 34 times the 4 ways,20 times 3 ways, 48 times 2 ways and 630 times 1 way (922 way reconfigurations).Signature Size++ causes few more reconfigurations than Signature size because thecache size is accurately estimated and few extra reconfigurations are required to findthe best cache configuration.

To summarize, hitsDAHWS and baseDAHWS have similar performance and bothare much better than Signature Size and Signature Size++ for L1 instruction caches.

L2 Cache

The Signature Size scheme, as acknowledged by the authors [38], does not work wellfor large caches. We have corroborated this with various experiments varying the bitvector size and the interval size. For instance, in spite of augmenting the bit vectorsize to 4096 and using 4M instruction intervals, the Signature Size scheme increasesL2 miss rate by 12%. The miss ratio increase is still much higher if shorter intervalsare used. Because of that, we do not present detailed results for it in this section.For DAHWS L2 caches, we have considered 1M, 2M and 4M instruction intervalssince 100K instruction intervals are too short due to the much lower utilization ofthe L2 cache.

Figure 3.39 shows the miss rate, percentage of active lines, IPC and numberof reconfigurations for the maxDAHWS with different interval sizes. It can be seenthat maxDAHWS significantly reduces the active ratio (to 59% and 72% of the totalcapacity) at the expense of low miss rate increase (between 0.37% and 0.58%). We


L2 Cache - Miss Rate

31,06%

20,0%

20,5%

21,0%

21,5%

22,0%

22,5%

23,0%

23,5%

24,0%

24,5%

25,0%

baseline 2MBmaxDAHWS (1M)maxDAHWS (2M)maxDAHWS (4M)baseline 1MB

L2 Cache - Active Ratio

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

baseline 2MBmaxDAHWS (1M)maxDAHWS (2M)maxDAHWS (4M)baseline 1MB

L2 Cache - IPC loss

0%

1%

2%

3%

4%

5%

6%

7%

maxDAHWS (1M)maxDAHWS (2M)maxDAHWS (4M)baseline 1MB

L2 Cache - Number of Reconfigurations

0

20

40

60

80

100

120

140

160

180

200

max

DA

HW

S (

1M)

max

DA

HW

S (

2M)

max

DA

HW

S (

4M)


Fig. 3.39: Miss rate, percentage of active lines, IPC and number of reconfigurations for L2 unifiedmaxDAHWS cache. The number of reconfigurations is split according to the number of ways whosesize is changed


observe that increasing the interval size reduces the active ratio at the expense ofa bit larger miss rate. As expected, the longer the interval, the lower the numberof reconfigurations. For 1M instruction intervals, which is the worst case in termsof reconfigurations, the cache is reconfigured once every 6 million instructions. Forthe 4M instruction intervals, reconfigurations happen every 17 million instructions.In any case, spending some thousands of cycles reconfiguring the L2 cache has anegligible impact on performance.

If we consider using a L2 cache of half the size of the baseline one, we observethat even if the active ratio is a bit lower than for maxDAHWS, the miss rate andthe IPC are significantly harmed. For instance, we observe that for 4M instructionintervals maxDAHWS reduces the active ratio by 41%, which is not far from the 50%active ratio reduction of a 1MB L2 cache, but the IPC loss is low for maxDAHWS(1.8%) and quite high for the 1MB L2 cache (6.8%).

3.4.5 Conclusions

We have presented the HWS cache, which is a new cache architecture where eachcache way may have a different size. In general, HWS architectures can be usedfor any set-associative storage structure like caches, branch predictor tables, TLBs,etc. In this work we have focused on L1 data caches, L1 instruction caches, and L2unified caches.

HWS caches exploit the fact that just a small percentage of the cache sets havemultiple active lines at any point in time. Besides, the number of sets that store agiven number of active lines decreases as the number of active lines per set increases.A conventional set-associative cache provides a separate storage for each cache wayof each set, whereas in a HWS cache, some sets share the storage devoted to some oftheir cache ways. Because of its more effective use of the storage, a HWS cache hasshown to outperform conventional caches for L1 data and instruction caches, andL2 caches. For instance, a 18KB 3-way HWS data cache can achieve 8% dynamicand 25% leakage energy savings with respect to a 24KB 3-way conventional cache,while the hit ratio is practically the same. Additionally, since HWS caches can besmaller, the HWS cache access time is expected to be equal or lower to that of theconventional cache.

Besides, a HWS cache enables higher flexibility in the design. For instance, for4-way set-associative caches, 32 bytes/line and cache ways between 2KB and 16KBthere are 4 possible conventional cache configurations and 31 extra HWS cachedifferent configurations. This higher flexibility makes HWS caches especially inter-esting for scenarios where adaptive schemes are applied. We have shown that withminor changes to state-of-the-art cache resizing algorithms, a HWS cache achieveshigher energy savings and lower miss rate than adaptive conventional caches.

Finally, a HWS cache is also advantageous when the hardware budget is con-strained. For instance, if we have to design a 2-way set associative cache not largerthan 12KB due to area, power or latency, a conventional design must choose between


a 8KB 2-way cache (some capacity is lost) or a 12KB 3-way cache (set-associativityis increased). On the other hand, a HWS cache of 12KB and 2-way associativity canbe designed by having one way with 8KB capacity and the other way with 4KB.

4

Issue Logic

CHAPTER 4

ISSUE LOGIC

The issue logic of superscalar processors is one of the main hotspots due to itshigh power density. Besides, this logic has a significant delay and is difficult topipeline [22, 90]. Overall, the design of low latency and power-aware dissipationissue logic is an important challenge for continuing scaling up the performance ofsuperscalar processors.

From the energy standpoint, techniques that dynamically adapt the issue queuesize to reduce its dynamic and leakage energy are interesting. Such techniquesrequire the issue queue to be implemented as a multiple-banked structure, whichenables to turn off entire banks when they are not needed.

The issue logic complexity is also strongly related with its high energy consump-tion. The issue logic is typically implemented using fully associative schemes for thewake-up process [25, 90]. This kind of schemes requires a mechanism for checkingwhich operands become ready for all the instructions in the issue queue every timethat an instruction is selected for execution. Those instructions whose all operandsare ready are considered in the selection process that follows the wake-up. Eventhough this approach results in high IPC rates, its latency may not be compatiblewith high clock rates. Additionally, its complexity grows drastically if the issuewidth or the issue queue size are increased. Large instruction windows are requiredfor augmenting the opportunities to extract more ILP, which in turn requires widerpipelines.

In the following sections we study the issue queue energy and complexity, andpropose some adaptive mechanisms to reduce the energy and complexity of the issuequeue. First, in section 4.1 we present state-of-the-art techniques that address theenergy and complexity reduction of the issue queue. Then, we present our proposalsto tackle this problem in sections 4.2 and 4.3.

4.1 Related Work

This section reviews some related work, which has been classified into differentcategories for the sake of readability. Some of the techniques are later described inmore detail since they are used for comparison purposes.

90 · Chapter 4. Issue Logic

OR

RDY L OPR TAG L OPR TAG R RDY R

OR=

=

=

=

TAG 1TAG IW

Fig. 4.1: Issue logic for an entry of a CAM/RAM-array

4.1.1 Basic CAM-based Approaches

One of the most common ways to implement the issue logic is based on someCAM/RAM-array structures. These structures can store a number of instructions,which in general is smaller than the total number of in-flight instructions. Each entryholds an instruction that has not been issued or that has been issued speculativelybut not yet validated, and thus, it may need to be re-executed.

In general, each entry stores in RAM cells all the operation, destination operandand flags that indicate whether source operands are ready. Source operands iden-tifiers (also referred to as tags) are stored in CAM cells. After one instruction isselected for execution, its destination tag is broadcast to all the instructions in theissue queue. Each source tag in the queue is compared with the broadcast one and ifthere is a match, the operand is marked as ready. This process is known as wakeup.In a superscalar processor, multiple tags can be broadcast and compared in parallel.Figure 4.1 shows a block diagram of the issue logic associated to one entry of theissue queue.

The selection process identifies those instructions whose source operands areready and the required resources are available and issues them for execution. Whenmore than one instruction compete for the same resource, the selection logic choosesone among them according to some heuristics [24].

The main source of complexity and power dissipation of the above scheme comesfrom the many tag comparisons that must be performed every cycle. The followingsubsections describe the main approaches that have been proposed to make thisscheme more power-efficient. These approaches are classified into static and dynamicones. The former group use fixed structures whereas the latter group dynamicallyadapts some of the structures according to the properties of the executed code.


Dynamic Approaches

An approach to reducing the power dissipation of the issue logic is based on disablingthe wakeup logic for those CAM cells that are either empty or correspond to operandsthat are already ready (i.e., were woken up in the past but the instruction has notbeen issued yet). This can be easily achieved by gating off the wake up logic of eachcell based on the value of the ready and empty bits [45]. This saves dynamic power.

In addition, leakage energy can also be reduced through a multiple-banked im-plementation of the issue queue, by turning off entire banks when they are empty.

Alternative resizing schemes have been proposed [26, 45, 94]. Folegnani andGonzalez [45] propose a scheme that monitors the IPC provided by the youngestpart of the issue queue. If the number of committed instructions that were issuedfrom the youngest part is below a given threshold, then the issue queue size is reducedby one bank. Every certain number of cycles the issue queue size is increased byone bank. Buyuktosunoglu et al. [26] propose a resizing scheme based on the usageof the issue queue entries.

Karkhanis et al. [70] proposed a mechanism for saving energy by means of just-in-time instruction delivery. Their mechanism tries to limit the number of in-flightinstructions in order to reduce the activity in different structures such as the Icache,issue queue and rename tables. This mechanism saves significant dynamic powerin the issue queue at the expense of near 3% IPC loss for integer applications(SpecINT2000), but it has not been evaluated for FP applications. The schemetriggers the resizing mechanism when the program enters a new program phase,which is defined as a given interval of the execution characterized by a particularinstruction working set. A program phase transition is detected by a change inthe IPC or the number of executed branches, which is a very simple approximationthat works well for coarse granularity but that may miss to identify some phasetransitions correlated with other factors.

Low-Complexity Approaches

Since it is hard to implement large out-of-order issue queues at high clock rates,some schemes are based on having fast and small issue queues along with someother simpler structures. Only a subset of the in-flight instructions is dispatched tothe small and complex issue queues whereas the rest go to the simpler structures.However, these simpler structures do not allow a full out-of-order issue. In thistype of schemes, deciding which instructions go to each structure is critical forperformance.

For instance, since those instructions that depend on a load that misses in cachewill not be issued until the miss is serviced, Lebeck et al. [79] propose a mechanismthat places the instructions in a conventional issue queue, but when a load misses incache, all the instructions that depend directly or indirectly on the load, are movedto a waiting buffer. As soon as the miss is serviced, the instructions are moved backto the issue queue since the proposed waiting buffer does not have issue capabilities.


In current superscalar processors, instructions that depend on a load are specu-latively issued by assuming that the load hits in cache, since otherwise, a significantperformance penalty would be paid. These speculatively-issued instructions are usu-ally kept in the issue queue until it is confirmed that the load hits in cache sinceif the load misses, they must be reissued. This speculative technique obviously in-creases the required number of entries in the issue queue. Smart techniques can beapplied to move to another buffer those instructions that are issued speculatively asa consequence of a load-hit prediction as proposed by Moreshet and Bahar [84].

Finally, another way to decide which instructions must be placed in the fastand small issue queue is based on estimating the criticality of each instruction.Criticality has been deeply studied [43, 118]. Criticality can be measured by thenumber of cycles that an instruction can be delayed without affecting the executiontime of the program. No feasible method is known to compute criticality of allthe instructions in a real program, so proposed schemes are based on heuristics toestimate the criticality

Assuming that instructions can be classified according to their criticality, theissue logic can be implemented with a small and fast CAM-based issue queue forcritical instructions and another slow issue queue for the rest that can be larger thanthe CAM-based one, less complex and dissipate less power [19]. Critical instructionsare placed in the fast issue queue whereas non-critical instructions are placed in theslow one.

Another way to reduce the issue queue complexity is based on the observationthat most of the instructions have one or none non-ready operands at dispatch time.Ernst and Austin [41] propose designing the issue logic with three different queues:one without CAM logic for instructions ready at dispatch, one with CAM logic forjust one operand for instructions with only one non-ready operand at dispatch, anda third one with CAM logic for both operands for instructions with both operandsnon-ready at dispatch.

4.1.2 Matrix-based Approaches

An alternative way to implement the issue logic is by means of a bit matrix insteadof a CAM structure. This matrix of bits has as many rows as entries the issue queue,and as many columns as number of physical registers.

When an instruction is dispatched, all the bits in its row but those correspondingto its non-ready input physical registers are cleared. The wakeup process consistsin clearing the column corresponding to the output physical register that has beengenerated. An instruction is ready when all the bits in its row are set to zero.

A CAM-based implementation holds the tags, and the resulting complexity of thestructure is logarithmic with respect to the number of physical registers, whereasthe matrix-based approach holds bit vectors which results in a linear complexitywith respect to the number of physical registers. The weak points of a CAM-basedissue queue are the high number of ports of the CAM cells required for the wakeup


process (as many as results can be generated in a given cycle), and the high powerdissipation. The weak points of a matrix-based issue queue are the size of the matrix,the size of the decoders to select the column to be cleared in the wakeup process,and the logic to detect when an instruction is ready since a larger number of bitshave to be checked.

Improvements to the matrix-based approach have been proposed later in orderto reduce the complexity and size of the matrix. Goshima et al. [53] propose todistribute the matrix for integer, floating point and load/store instructions; andnarrowing the matrix based on the observation that most of the instructions dependon instructions closer in the reorder buffer.

The matrix-based approach has been also used for implementing the selectionlogic. Brown et al. [22] propose using this organization for detecting which operandsare ready and also for detecting which functional units are required by an instructionand which functional units are available. This approach allows pipelining the issuelogic.

4.1.3 Issue Logic Based on Dynamic Code Pre-Scheduling

Issue logic schemes [32, 31, 54, 83, 97] based on dynamic code pre-scheduling havebeen studied as an alternative to conventional issue queues. The target of thepre-scheduling based schemes is to schedule the instructions into an in-order bufferwhere the instructions are placed according to its expected issue-time. The issue-time is calculated based on the latencies of the producers and their expected issue-time. These techniques can be regarded as an approximation of a run-time VLIWorganization. The main advantage of this scheme is the elimination of the associativesearch needed for the wake-up phase.

Estimating the latency of operations is trivial for non-memory instructions, sincetheir latency is fixed, but it is not so easy for memory accesses. The different schemeshave similar approaches to dealing with fixed-latency instructions and their maindifference resides on the solution they provide for the variable latency instructions,as well as their direct or indirect consumers. In these schemes, the wake-up andthe select logic are simplified since both the availability of the operands and of theexecution resources are taken into account when the scheduling is done.

4.1.4 Issue Logic Based on Dependence Tracking

The last category of schemes consists of those that reduce the complexity of the issuelogic through mechanisms based on tracking the dependences among the instructionsand link in some way producers and consumers. By keeping this explicit relation,these schemes avoid (or reduce) the associative look-up inherent in a conventionalissue logic. Most of these schemes exploit the fact that most of the results generatedby an instruction have just one consumer [32]. Thus, propagating the results onlyto the consumers is much more efficient than broadcasting the results to all theinstructions in the queue.


In these mechanisms, the associative logic required by conventional schemes isreplaced by a direct-access RAM structure. All these mechanisms use a table tokeep track of the dependences, whereas instructions are kept in a separate table.When the instructions are ready to execute they are forwarded to the issue logic.

Two alternative implementations have been proposed, differing on the placewhere the instructions are kept before they are issued: either in the dependencestructure [32, 31, 89] or in a separate structure [63].

The dependence structure is indexed by physical register identifier. For the sim-plest schemes, each entry keeps just one consumer instruction for the correspondingregister. This reduces the amount of parallelism that the issue logic can exploit. Anumber of approaches to relax this constraint have been proposed.

A somewhat different approach that can be classified into this category is theone proposed by Palacharla et al. [90]. In this scheme, the issue logic is distributedinto several FIFO queues. Instructions are dispatched to the queue where the lastinstruction of the queue is the producer of a source operand; if no queue meets thiscondition, the instruction is sent to an empty queue. If none is available the dispatchstage is stalled. Placing instructions in this way guarantees that the instructions ina given FIFO must be executed sequentially, and thus, only the youngest instructionin each queue must be monitored for potential issue.

4.2 Adaptive Issue Queue and Register File

In this section, we present a novel adaptive microarchitecture to reduce dynamicand leakage energy consumption in both the issue queue and the register file thatis based on limiting the number of in-flight instructions. Our proposal is based onobserving how much time the instructions spend in the reorder buffer and the issuequeue, and taking resizing decisions based on these observations. The proposedtechnique delays the dispatch of instructions when this is expected not to degradeperformance. This results in two types of advantages: a) it increases the number ofempty entries in the different structures and opens more opportunities to turn themoff, and b) it shortens the time that instructions are in-flight and thus, the numberof failed issue attempts.

The rest of the section is organized as follows. Section 4.2.1 describes the baselineissue queue, register file and reorder buffer designs that have been assumed. Section4.2.2 describes the proposed technique and the mechanism used for comparisonpurposes. Section 4.2.3 evaluates the performance of the proposed approach. Finally,a summary of this work is presented in section 4.2.4.

4.2.1 Baseline Microarchitecture

In this section we describe the baseline microarchitecture, with special emphasis onthe structures that are the target of this work: the issue queue, the register file andthe reorder buffer.

4.2. Adaptive Issue Queue and Register File · 95

Entry1_Active

EntryN_Active Turn_Off

Bank

enable/disable

en2 CAM RAM

en1 CAM RAM

tagline CAM RAM

bitline

en3 CAM RAM

WAKE-UP INSTRUCTION

S E L E C T I ON

L OG I C

R E ADY R E ADY R E ADY

R E ADY

Fig. 4.2: Multiple-banked issue queue

Processor

Two different organizations for the storage of speculative values have been studied.The first one is similar to that of the Alpha 21264 [40] and Pentium 4 [117]. In thiscase, speculative and committed values are stored in a centralized register file. Thesecond one is similar to that of the HP PA8700 [58]. In this case, committed valuesare stored in an architectural register file, whereas speculative values are stored inrename buffers until commit. Integer and FP values are kept in separated files forboth cases. There are two register files for the first organization and two sets ofrename buffers for the second one. In the rest of this work the organization basedon a centralized register file will be referred to as RegF whereas the one based onrename buffers will be referred to as RenB.

Issue Queue

This work is based on a multiple-banked issue queue where the instructions areplaced in sequential order. As previous work [45, 26], no compaction mechanism forthe issue queue has been assumed since compaction results in a significant amountof extra energy consumption [25].

The assumed issue queue can turn off each bank independently, as in [45, 26].Figure 4.2 shows a block diagram of the issue queue and the logic to turn off abank. This scheme provides a simple mechanism to turn off at the same time theCAM array where the tags to be compared are stored and the RAM array wherethe operands of the instructions are stored. The selection logic is always turnedon but its energy consumption is much lower than that of the wakeup logic [90].The mechanism guarantees that a given bank will not be turned off until all theinstructions in this bank are issued and the bank is empty.

In order to avoid the wakeup of empty entries placed in turned on banks, weassume that the wakeup is gated in each individual entry that is empty [45]. This


a0

a0a1

a1a4 a3 a2

8 banks with 4 entries per bank. Reading register #30 (11110b)

Fig. 4.3: Scheme of a read operation

8 banks with 4 entries per bank. Writing register #30 (11110b)

a1

a0

a1

a0

a4 a3 a2

Fig. 4.4: Scheme of a write operation

capability has been assumed for all the compared mechanisms, including the base-line.

Register Files and Rename Buffers

This section describes the implementation assumed for the register file, but a similarimplementation has been assumed for the rename buffers. Integer and FP registerfiles are identical. A register file is split into banks (8 entries per bank in ourexperiments). In order to reduce the bank access time, the bank selection logic andthe decoding of the entry to be accessed are done in parallel.

Figure 4.3 shows the scheme for a read operation. One entry of each bankis read, and the output logic selects the requested register among those. It can beobserved that this scheme overlaps the bank selection with the decoding and readingof each bank. Figure 4.4 illustrates a write operation. The wordlines that select therequested register for every bank are gated by the bank selection logic. In this case,the bank selection is overlapped with just the wordline decoding because the writemust be performed only in the proper bank.


Component Abbrev. delay (ps) energy (pJ)Address routing add 84 1.3Decode (4 to 16) 4to16 232 3.8Decode (3 to 8) 3to8 203 1.5 per bank

Wordline + bitline wlbl 134 5.1 per bankData to/from bank data 104 10.8 per bank

Out driver out 106 27.6

Table 4.1: Delay and energy for the different components of a multiple-banked register file design

Seq. scheme Critical path Delay (ps) Energy (pJ)Read add+4to16+3to8+wlbl+data+out 863 50.1Write add+4to16+3to8+wlbl 653 22.5

Par. scheme Critical path Delay (ps) Energy (pJ)Read add+3to8+wlbl+data+out 631 32.7+17.4 x #BanksOnWrite 4to16+ctrl wordlines(=add)+wlbl 450 10.2+12.3 x #BanksOn

Table 4.2: Delays and energy for read/write operations in the sequential and parallel schemes

This implementation of the register file reduces its access time at the expenseof increasing notably its dynamic energy consumption. If the access time of thisstructure is not critical, a sequential decoding scheme could be considered.

In this work, the parallel implementation of the multiple-banked register filehas been assumed for all the compared mechanisms, including the baseline. Thisdecision is justified by the estimated access time for both schemes. For this purpose,we used CACTI 3.0 [111], with a configuration of 16 banks, 8 registers per bank,64-bit data width, 0.10 µm technology, 16 read and 8 write ports. Table 4.1 showsthe delays obtained for each component of the register file.

Table 4.2 shows the delay of the critical path for both a read and a write operationin both schemes. The table shows that the parallel scheme reduces the access time by27% for read operations and 29% for write operations with respect to the sequentialscheme. #BanksOn stands for the number of turned on banks at the operation time.

Turning off unused banks can save leakage energy for both schemes and dynamicenergy for the parallel one. A given bank is turned on as soon as at least oneof its registers is allocated to an instruction as its destination operand. A givenbank is turned off when none of its registers is being used. This scheme can beeasily implemented adding a bit (BusyBit) to every register. This bit is set whena register is allocated to an instruction and is reset when the instruction commitsand frees the previous mapping of its destination register. The bank enable/disablesignal is a NOR function of its registers’ BusyBits.

In order to maximize the number of banks that are turned off, when a free registeris requested, the one with the lowest bank identifier is chosen so that the activity inthe register file is concentrated on the banks with lower identifiers.


Reorder Buffer

Since the reorder buffer does not store register values, its contribution to the totalenergy consumption is very small and thus, reducing its energy consumption is notthe objective of this work. However, its occupancy can be limited dynamically tocontrol the number of in-flight instructions and thus, to control the pressure on theissue queue and register files.

4.2.2 Adaptive Schemes

This section describes the proposed mechanism and the mechanism used for com-parison purposes [45].

Proposed Mechanism

Underlying Concepts. Superscalar processors try to keep full both the reorder bufferand the issue queue. In general, dispatching1 instructions as soon as possible isbeneficial for performance, but not for power. However, in many cases instructionsare held in the issue queue for some cycles before they are finally issued. From theperformance standpoint, it is desirable not to delay the issue of any instruction.From the power standpoint, it is desirable that instructions remain in the issuequeue for the minimum number of cycles (in this way the number of times that aninstruction is attempted to be issued is minimized). Our proposal tries to achievethese objectives by means of various heuristics:

• The first heuristic tries to reduce the time that instructions spend waiting forbeing issued in the issue queue. If it is observed that instructions wait toolong, the instruction window size (reorder buffer size) is reduced and thus,the dispatch of instructions is delayed. Reducing the number of entries in thereorder buffer has a twofold side effect: a) reduces the number of instructionsin the issue queue, and b) reduces the number of registers in use.

• The second heuristic tries to prevent situations where the limited instructionwindow size is harming performance. Even if instructions spend too muchtime in the issue queue, it is not desirable being so aggressive if there are fewinstructions in the reorder buffer.

• Finally, there are some events that require an immediate action. In particular,L2 data cache misses, which have a very long latency, stall the commit ofinstructions for many cycles. Thus, in case of an L2 miss it is interesting toincrease the instruction window size to allow the processor to process moreinstructions while the miss is being serviced.

1We call dispatch the action of sending an instruction to the issue queue, whereas we call issuethe action of sending an instruction from the issue queue to execution.


Deciding when instructions spend too long in the issue queue is one of the trickyparts of the mechanism. We are interested in finding out the minimum numberof cycles that the instructions require to spend in the issue queue without losingsignificant IPC. In order to obtain this information, we have experimentally observedthe behavior of different programs for short intervals of time. If we just considerthe intervals of time when a certain IPC is achieved, we can observe some trends:a) the minimum time that instructions spend in the issue queue and the time thatthey spend in the reorder buffer are correlated, b) this correlation is not linear: thelonger the time in the reorder buffer, the longer the time in the issue queue but theratio between the latter and the former decreases as the time spent in the reorderbuffer increases. As an example, it could happen that if the instructions spend inaverage 10 cycles in the reorder buffer, they require to spend in average 5 cycles inthe issue queue. If these instructions spend 7 instead of 5 cycles, the IPC is notsignificantly higher, so it is interesting maintaining this ratio (5/10 = 0.5). But ifthe instructions spend 20 cycles in the reorder buffer, they do not require spendingin average 10 cycles in the issue queue. As mentioned before, the longer the time inthe reorder buffer, the smaller the ratio between the time spent in the issue queueand the time spent in the reorder buffer. In this example it could happen that theinstructions require to spend in average 8 cycles in the issue queue (8/20 = 0.4).

We have studied some benchmarks (applu, apsi, art, gcc, mesa, perlbmk andtwolf) and we have observed this trend for all of them. We have simulated thesebenchmarks tracing the average number of cycles spent in the issue queue and thereorder buffer as well as the IPC every 1000-cycle interval (the interval length hasbeen chosen arbitrarily). We have chosen intervals where the IPC is similar to theglobal IPC for the whole program, and we have observed the mentioned trend.

Implementation of the Mechanism. The first and second heuristics outlined aboveare based on measuring the number of cycles that instructions spend in the issuequeue and in the reorder buffer, as discussed in the previous section. However,an exact computation of these parameters may be quite expensive in hardware (e.g.time stamps for each entry) and consume a non-negligible amount of energy. Little’slaw [128] says that for queuing systems in which a steady-state distribution exists,the following relation holds:

Lq = λ · Wq (4.1)

Lq, λ and Wq stand for the average queue size, the average number of arrivalsper time unit and the average time that a customer spends in the queue respectively.In the issue queue and the reorder buffer, the arrivals (λ) are exactly the same so,instead of counting how many cycles (Wq) every committed instruction spends inthe issue queue and the reorder buffer, we will count how many instructions (Lq)are in these structures every cycle. This approximation implies that all instructionsthat arrive to the queues but do not commit are also counted. We have observedin our simulations that the effect of considering or not these instructions does notprovide significant differences.


(1) THRESHOLD LOW = 1 - ROB dyn size / ROB size

(1) THRESHOLD HIGH = THRESHOLD LOW + 1/8

(2) FRACTION = #instr in IQ / #instr in ROB

(3) if (FRACTION > THRESHOLD HIGH)

(3) ROB dyn size = max(ROB dyn size - 8, 32)

(3) else if (FRACTION < THRESHOLD LOW)

(3) ROB dyn size = min(ROB dyn size + 8, ROB size)

(3) endif

(4) if (L2 miss during the period)

(4) ROB dyn size = min(ROB dyn size + 8, ROB size)

(4) endif

(5) if (#cycles disp stall > IQ THRESHOLD HIGH)

(5) IQ dyn size = min(IQ dyn size + 8, IQ size)

(5) else if (#cycles disp stall < IQ THRESHOLD LOW)

(5) IQ dyn size = max(IQ dyn size - 8, 8)

(5) endif

Fig. 4.5: Heuristic to resize the reorder buffer and the issue queue

We have experimentally observed that this relationship between queue size andwaiting time holds for the 7 benchmarks previously mentioned. We have studiedthem in the same way as in the previous section: observing every 1000-cycle intervalthe average number of cycles spent in the queue and the average queue occupancy,for both the issue queue and the reorder buffer. Both relationships are near-linear,so we can conclude that using occupancy ratios instead of time ratios does not resultin significant differences.

In order to take advantage of the relationship between the time spent in the issuequeue and the time spent in the reorder buffer, the proposed mechanism uses theratio between both occupancies (IQ occupancy / ROB occupancy) to take resizingdecisions. If this value is higher than a given threshold, the window size is decreasedby N instructions (8 instructions in our experiments), and if it is lower than anotherthreshold, the window size is increased by N instructions.

These thresholds are dynamically adapted according to the observations made inthe above section, that is, they depend on the reorder buffer size. Figure 4.5 detailsthe approach for resizing the reorder buffer and the issue queue. ROB size standsfor the physical size of the reorder buffer (128 instructions in our evaluation), andROB dyn size stands for the maximum number of instructions allowed to stay in thereorder buffer at a given time (similar definition applies to IQ size and IQ dyn size).In order to avoid an extremely small reorder buffer, the following constraint is ap-plied: ROB size/4 ≤ ROB dyn size ≤ ROB size. The thresholds are set accordingto (1). The fraction of time that instructions spend in the issue queue versus thetime that they spend in the reorder buffer is approximated as (2). This parameter isaveraged for each interval of time. At the end of each interval, resizing decisions aretaken according to the criteria described in (3): the reorder buffer dynamic size is


increased by 8 instructions, decreased by 8 instructions, or left unchanged dependingon the value of the FRACTION parameter and the thresholds.

Finally, the third heuristic in the above section is implemented as follows. When-ever there is an L2 cache miss, the reorder buffer size is increased, as (4) in figure 4.5shows. In theory only data misses should be considered but for the sake of simplic-ity, we do not distinguish between instruction and data misses since the majority ofL2 misses correspond to data.

The register file banks (or rename buffers) are turned off when they are not busyas explained in section 4.2.1. The issue queue occupancy is further controlled by amechanism that monitors how many cycles the dispatch is stalled due to unavailableentries in the issue queue. As detailed in section (5) of figure 4.5, if stalls are toofrequent, the issue queue size is augmented (#cycles disp stall stands for the numberof cycles that the dispatch is stalled due to lack of space in the issue queue). If stallsare very rare, the issue queue size is decreased. This simple mechanism along withthe adaptive mechanism to limit the reorder buffer occupancy achieves a significantissue queue size and register file pressure reduction with very small performanceloss. In our experiments, for an interval size of 128 cycles, different values forthe issue queue thresholds were evaluated (2, 4, 8, 16, 32, 64) obtaining significantpower savings and small performance degradation for these pairs of values: <16,32>and <16,64>. To simplify the implementation and avoid doing some divisions andmultiplications, integer arithmetic is used instead of FP one. In particular, thethresholds are scaled as follows:

THRESHOLD LOW = ROB size − ROB dyn size (4.2)

THRESHOLD HIGH = THRESHOLD LOW + ROB size/8 (4.3)

In our experiments we use a 128 entry reorder buffer, so THRESHOLD HIGHcorresponds to THRESHOLD LOW + 16. Thresholds are compared with FRAC-TION so this parameter is also scaled as follows:

FRACTION = ROB size ∗ #instr in IQ/#instr in ROB (4.4)

The multiplication in the above expression is trivial to implement since the re-order buffer size is a power of 2. For the division, the dividend has 11 bits and thedivisor has 7 bits, assuming an interval of 128 cycles. This requires a rather smallhardware. In fact, an iterative divider can be used instead of a parallel one, sincedelaying the resizing decisions by a few cycles does not have any practical impact.

The energy consumption of the additional hardware is negligible because onlythree small counters are updated every cycle and the rest of the structures workonly once every interval (128 cycles in our experiments). Assuming that the divideris implemented as a radix 4 divider (2 bits of the quotient are computed each cy-cle), the total hardware required is one multiplexor and less than 20 units (adders,incrementers and comparators) whose inputs always have 11 bits or less. We haveexperimentally verified that delaying the resizing of the reorder buffer by 2 or 3cycles to allow for an iterative divisor has negligible impact on performance.


The Mechanism Used for Comparison

The proposed mechanism has been compared with the mechanism proposed in [45],which will be referred to as FoGo in the rest of this study. The comparison hasbeen done in two ways: using the parameters that the authors of FoGo reported asthe best ones and using the same resizing interval as our mechanism (128 cycles).The issue queue has the same structure for both the proposed mechanism and themechanism used for comparison, but the resizing schemes are different.

FoGo reduces power consumption by dynamically resizing the issue queue. Themechanism monitors the performance contribution of the youngest bank of the issuequeue (8 instructions in their experiments) and measure how much these entriescontribute to the IPC. If the contribution is negligible, the issue queue size is reduced(one bank is turned off). On the other hand, the size of the queue is increasedperiodically if it is smaller than its maximum size.

Every time that an instruction is issued, if it is one of the 8 youngest instructionsin the queue, their mechanism sets a bit in the corresponding entry of the reorderbuffer. In the commit stage, the number of instructions that have this bit set isaccumulated. If there are less than N instructions issued from the youngest partduring an interval of time, the issue queue size is reduced. Their experiments showedthat an interval of 1000 cycles and a threshold of 25 instructions save considerablepower in the issue queue with a very small performance loss. Every 5 intervals, theissue queue size is increased by one bank.

For the comparison presented below, we have chosen the configuration with theparameters that they report as the more appropriate ones (FoGo1000 ) and thesame parameters but with an interval of 128 cycles (FoGo128 ) with a correspondingthreshold of 3 instructions issued from the youngest part.


In this section we present performance and power results for the proposed mecha-nism, and compare it with the technique proposed in [45].


Power and performance results are derived from Wattch [21]. The model requiredfor multiple-banked structures has been obtained from CACTI 3.0 [111]. For thisstudy we have selected the whole SPEC CPU2000 benchmark suite [115]. Table 4.3describes the assumed processor configuration.

Interval Length

In order to choose a suitable interval to resize the structures, we have done someexperiments to identify a good tradeoff between performance and power savings.Figure 4.6 shows the IPC with respect to the baseline for different interval lengthsusing 3 integer benchmarks (gap, gzip, twolf) and 3 FP benchmarks (ammp, applu,


Parameter ValueFetch, Decode, Issue, Commit width: 8 instructions/cycleFetch queue size: 64 entriesIssue queue size: 80 entriesLoad/store queue size: 64 entriesReorder buffer size: 128 entriesRegF microarchitecture:

INT registers: 112 (14 banks x 8)FP registers: 112 (14 banks x 8)

RenB microarchitecture:INT rename buffers: 80 (10 banks x 8)FP rename buffers: 80 (10 banks x 8)

IntALU’s: 6 (1 cycle)IntMult/Div: 3 (3 cycles pipelined mult,



and 1K entry metatableBTB: 2048 entries, 4-wayL1 Icache size: 64KB 2-way, 32-byte lines, 1 cycle latencyL1 Dcache size: 64KB 4-way, 32-byte lines, 4 R/W ports, 2 cyclesL2 Unified cache: 512KB, 4-way, 64-byte lines, 10 cycles latencyMemory: 50 cycles, 2 cycles interchunkData TLB: 128 entries, 30 cycles miss penaltyInstruction TLB: 128 entries, 30 cycles miss penaltyTechnology: 0.10 µm



IPC with respect to baseline

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

100%

32 128 512

interval length (cycles)

ammpappluartgapgziptwolf

Fig. 4.6: IPC for different interval lengths

Reorder buffer occupancy reduction vs baseline

0%

5%

10%

15%

20%

25%

30%

32 128 512

interval length (cycles)

ammpappluartgapgziptwolf

Fig. 4.7: Reorder buffer occupancy reduction for different interval lengths


art). These particular benchmarks were chosen due to their different behavior forthe main features: memory, issue queue requirements, reorder buffer requirements,branch misprediction rate, etc. It can be seen from figure 4.6 that in general, longerintervals lose less performance. Figure 4.7 shows the reorder buffer occupancy reduc-tion for different interval lengths. It can be observed that the shorter the interval,the higher the occupancy reduction. Higher occupancy reduction will translate intobetter opportunities to save power.

Figures 4.6 and 4.7 show that a 32-cycle interval hardly reduces the reorderbuffer occupancy with respect to a 128-cycle interval whereas it results in slightlyhigher performance degradation. In addition, the shorter the interval, the higher theenergy overhead to resize the structures. The 512-cycle interval in general loses verylittle performance but it is not so effective to reduce the reorder buffer occupancy.We can conclude that the 128-cycle interval is the best tradeoff between power andperformance.

Performance and Power Results

The performance evaluation has been done by comparing two versions of theproposed technique, two versions of the FoGo, FoGo128 and FoGo1000 as de-scribed above, and a baseline with no adaptive resizing. The two versionsof our technique are those corresponding to different threshold values for theIQ THRESHOLD2. We will refer to them as IqRob32 (IQ THRESHOLD2=32)and IqRob64 (IQ THRESHOLD2=64) in the rest of this work. The baseline ar-chitecture does not resize the issue queue nor the reorder buffer but does have themechanisms that we have assumed for IqRob and FoGo in order to turn off unusedregister file banks (or rename buffer banks) and avoid the wakeup of empty entriesin the issue queue.

Performance. Figure 4.8 shows the IPC loss for the different mechanisms. IqRob32and IqRob64 have better performance than FoGo1000 and FoGo128 respectivelyfor the SpecINT2000 and the whole Spec2000, and achieve similar results for theSpecFP2000. On average, the IqRob32 technique loses less than 2% IPC andIqRob64 loses less than 3.5% IPC. The FoGo technique reduces the size of theissue queue when the IPC contribution of the youngest bank is below a fixed thresh-old. This threshold basically determines the loss of IPC that the mechanism maycause and thus, it has a bigger impact for programs with lower IPC, such as someof the SpecINT2000.

Reorder Buffer. The IqRob technique results in a lower reorder buffer occupancythan FoGo. IqRob technique resizes the reorder buffer in order to stop the dispatchprocess when it is expected that new instructions will not increase performance.Having fewer instructions in the reorder buffer implies that fewer issue queue entriesand registers are used so more energy is saved. IqRob is significantly more effective


IPC loss vs. baseline

0,0%

0,5%

1,0%

1,5%

2,0%

2,5%

3,0%

3,5%

4,0%

SPECINT SPECFP SPEC

FoGo1000FoGo128IqRob32IqRob64

Fig. 4.8: IPC loss for the different techniques

SpecINT2000 SpecFP2000 Spec2000IqRob32 35.4% 23.5% 29.0%IqRob64 34.2% 22.4% 27.9%

Table 4.4: Reorder buffer size reduction

than FoGo for reducing reorder buffer occupancy, especially for integer applicationsdue to their lower ILP. IqRob64 reduces reorder buffer occupancy by nearly 20% onaverage and over 20% for SpecINT2000.

The additional reduction in occupancy provided by IqRob comes from the schemethat dynamically limits the maximum size of the reorder buffer. Table 4.4 showsthe effectiveness of this scheme. On average, the maximum reorder buffer size is setto about 70% of its total capacity. About 45% of the entries are occupied and 25%of the entries are enabled but empty. This is mainly due to sections of code whereinstructions spend few cycles in the issue queue. The IqRob mechanism tends toincrease the reorder buffer size in these situations because these instructions do notwaste much power in the issue queue.

Issue Queue. The IqRob and FoGo mechanisms resize the issue queue in differentways. Both FoGo and IqRob resize the issue queue (through different heuristics)but in addition, IqRob resizes the reorder buffer, which also causes a reduction inthe issue queue occupancy as a side effect.


Issue Queue: Dynamic energy savings

0%

5%

10%

15%

20%

25%

30%

35%

40%

SPECINT SPECFP SPEC

BaselineFoGo1000FoGo128IqRob32IqRob64

Fig. 4.9: Issue queue dynamic energy savings

Issue Queue: Leakage energy savings

0%

5%

10%

15%

20%

25%

30%

35%

40%

SPECINT SPECFP SPEC


Fig. 4.10: Issue queue leakage energy savings


IqRob32 and IqRob64 achieve higher occupancy reduction in the issue queue thanFoGo1000 and FoGo128 respectively. This occupancy reduction allows IqRob64 toturn off 29% of banks for the whole Spec2000. FoGo128 turns off only about 1%more banks than IqRob32, but it loses twice as much performance when comparedwith the baseline (FoGo128 loses 3.6% IPC and IqRob32 loses 1.8% IPC). The effectof turning off these banks is a reduction of the dynamic and leakage energy require-ments of the issue queue. Additionally, some extra dynamic energy is saved fromavoiding the wakeup of empty entries, which is the only source of the dynamic energysavings for the baseline. Figures 4.9 and 4.10 show this effect. It can be observedthat significant power savings can be achieved. IqRob32 and IqRob64 outperformFoGo1000 and FoGo128 respectively in energy savings as well as performance.

Integer Register File and Rename Buffers. As discussed above, reducing the numberof in-flight instructions results in a lower number of registers in use. IqRob achieveshigher reductions than FoGo due to its higher effectiveness at reducing the reorderbuffer size. FoGo1000 reduces the register pressure by 7%, and FoGo128 does it by15%, whereas IqRob32 and IqRob64 achieve reductions of 18% and 20% respectively.These register pressure reductions are exactly the same for both architectures (RenBand RegF ) since they have been configured with exactly the same number of registers(80 rename buffers + 32 logical registers for RenB, and 112 registers for RegF ).Figures 4.11 and 4.12 show that IqRob and FoGo achieve higher dynamic energysavings in the register file and rename buffers than the baseline.

It can be seen that higher energy savings are achieved for the RenB architecture.The main reason is that rename buffers with high index are freed as soon as theinstruction commits and thus, used registers correspond almost always to low-indexregisters. In this way, high-index banks can be turned-off in most of the caseswhen the number of unused registers is higher than the size of a bank. For theRegF architecture it may happen that a register with high index is allocated to aninstruction and remains allocated for a very long period of time after the instructioncommits, preventing the corresponding bank to be turned off.

Floating Point Register File and Rename Buffers. FP registers and rename buffersare hardly used by integer programs so we report energy statistics only for FPprograms. Figures 4.13 and 4.14 show dynamic and leakage energy savings. It canbe seen that IqRob outperforms FoGo in both dynamic and leakage energy savingsfor both architectures. Additionally the FP register requirements are reduced bymore than 13% for both IqRob techniques and less than 10% for FoGo techniques.

Dispatched Instructions. IqRob and FoGo mechanisms delay the dispatch of instruc-tions in order to avoid the time that they spend in-flight. Some of these instructionsare never dispatched due to branch mispredictions, which results in additional en-ergy savings. Figure 4.15 shows the reduction of dispatched instructions. IqRob32


Integer RenB/RegF: Dynamic energy savings w.r.t. baseline

0%

5%

10%

15%

20%

SPECINT SPECFP SPEC SPECINT SPECFP SPEC

RenB RegF


Fig. 4.11: Dynamic energy savings for the integer register file and rename buffers w.r.t. the baseline

Integer RenB/RegF: Leakage energy savings w.r.t. baseline

0%

5%

10%

15%

20%

SPECINT SPECFP SPEC SPECINT SPECFP SPEC

RenB RegF


Fig. 4.12: Leakage energy savings for the integer register file and rename buffers w.r.t. the baseline


FP RenB/RegF: Dynamic energy savings w.r.t. baseline

0%

2%

4%

6%

8%

10%

12%

14%

SPECFP SPECFP

RenB RegF


Fig. 4.13: Dynamic energy savings for the FP register file and rename buffers w.r.t. the baseline

FP RenB/RegF: Leakage energy savings w.r.t. baseline

0%

2%

4%

6%

8%

10%

12%

14%

SPECFP SPECFP

RenB RegF


Fig. 4.14: Leakage energy savings for the FP register file and rename buffers w.r.t. the baseline


Instructions dispatched reduction

0%

1%

2%

3%

4%

5%

6%

7%

8%

9%

10%

SPECINT SPECFP SPEC


Fig. 4.15: Reduction in number of dispatched instructions

and IqRob64 outperform FoGo1000 and FoGo128 respectively. The achieved re-duction is higher for integer programs because they have many more branches andmispredictions are much more frequent.

Summary. Table 4.5 summarizes the main performance statistics of the proposedIqRob mechanism with respect to the baseline configuration. The FoGo mechanismis also shown for comparison purposes.

DynEnergy, LeakEnergy, IQ, Reg pressure, RegF, RenB and DispInstr Reductionstand for dynamic energy savings, leakage energy savings, issue queue, register pres-sure reduction, register file, rename buffers and dispatched instructions reductionrespectively.

Table 4.5 shows that IqRob32 saves more dynamic and leakage energy thanFoGo1000 in all the studied structures and loses less IPC. On the other hand,IqRob64 outperforms FoGo128 in all metrics. Additionally, IqRob32 performs sim-ilarly to FoGo128 with significantly lower IPC degradation.

Moreover, we also evaluated the effect of simply reducing the issue queue size by20%, which would result in about the same energy savings in the issue queue as theproposed mechanism. However, this resulted in a higher IPC degradation (4.4%)and lower energy savings in the register file or rename buffers (around 4-5% higherenergy consumption than IqRob32 ).


FoGo1000 FoGo128 IqRob32 IqRob64IPC Loss 2.1% 3.6% 1.8% 3.3%

IQ DynEnergy 27.5% 34.2% 34.2% 37.6%IQ LeakEnergy 13.0% 22.1% 21.1% 25.1%

INT Reg Pressure 7.2% 15.3% 18.1% 19.8%FP Reg Pressure 5.8% 9.9% 13.0% 15.5%

INT RegF DynEnergy 4.1% 6.2% 8.1% 10.9%INT RegF LeakEnergy 3.2% 5.4% 9.8% 13.0%FP RegF DynEnergy 0.8% 2.0% 6.0% 7.4%FP RegF LeakEnergy 1.0% 2.9% 7.3% 8.0%INT RenB DynEnergy 6.0% 8.0% 13.0% 15.7%INT RenB LeakEnergy 3.9% 7.0% 14.2% 18.2%FP RenB DynEnergy 2.0% 4.5% 9.3% 10.6%FP RenB LeakEnergy 2.7% 6.2% 10.2% 12.1%DispInstr reduction 2.5% 3.7% 3.3% 5.1%

Table 4.5: Summary of results

4.2.4 Conclusions

We have presented a novel scheme that dynamically limits the number of in-flightinstructions in order to save dynamic and leakage energy in the issue queue andthe register file. The proposed mechanism is based on monitoring how much timeinstructions spend in both the issue queue and the reorder buffer and limit theiroccupancy based on these statistics.

The proposed mechanism has been evaluated in terms of performance, dynamicand leakage energy savings and reduction in number of dispatched instructions forthe whole SPEC CPU2000. The results have been compared with a state-of-the-artissue queue resizing technique, and it has been shown that the proposed techniqueoutperforms previous work in terms of performance and energy savings. The pro-posed technique achieves more than 15% dynamic and 18% leakage extra energysavings in the integer rename buffers (45% dynamic and 54% leakage total energysavings with respect to not turning off banks) and more than 10% dynamic and12% leakage extra energy savings in the FP rename buffers (23% dynamic and 30%leakage total energy savings with respect to not turning off banks). Significant en-ergy savings are also achieved for the register files if they are used instead of renamebuffers.

Additionally the register requirements are reduced by more than 18% for theinteger registers and more than 13% for the FP ones.

4.3 Low-Complexity Floating-Point Issue Logic

The complexity of the issue queue is a concern in terms of energy and delay. Differ-ent approaches have been proposed for the integer issue queue, but they do not workwell for the floating-point one. To tackle this problem we present a low-complexityFP issue logic (MB distr) that achieves high performance with small energy require-

4.3. Low-Complexity Floating-Point Issue Logic · 113

ments. The MB distr scheme is based on classifying instructions and dispatchingthem into a set of queues depending on their data dependences. These instructionsare selected for issuing based on an estimation of when their operands will be avail-able, which is computed locally at each queue, so the conventional wakeup activityis not required. Additionally, the functional units are distributed across the differentqueues for further energy savings and to reduce the complexity of the issue logic.

This section is organized as follows. Section 4.3.1 presents the proposed scheme.The performance evaluation is shown in section 4.3.2. Section 4.3.3 summarizes themain conclusions of this work.

4.3.1 Proposed Issue Logic Design

The objective of this work is reducing the complexity of the issue logic by avoiding(or reducing) the associative look-up inherent in a conventional issue logic. Theapproach by Palacharla et al. [90] is the one that better achieves this objectivethrough a mechanism based on tracking dependences among the instructions.

They propose an issue queue design based on a small number of first in firstout (FIFO) queues. Only the instructions at the head of each FIFO are consideredfor issue. Since our proposal is partially based on these FIFO queues, we describethis approach in detail. Instructions are dispatched to the FIFOs with the followingheuristics:

• If there is a queue whose tail instruction produces the first operand of theinstruction being dispatched, the instruction is placed in this queue. If thequeue is full and the instruction has only one source operand then dispatch isstalled.

• If there is a queue whose tail instruction produces the second operand of theinstruction being dispatched, the instruction is placed in this queue. If thequeue is full then dispatch is stalled.

• Otherwise the instruction is placed in an empty FIFO. If there are not emptyFIFOs then dispatch is stalled.

These heuristics guarantee that instructions in a given FIFO must be executedsequentially. This mechanism only requires a table to store for each register whichqueue (if any) has its producer at the tail of the queue. This table can be imple-mented in two different ways: storing the mentioned information for each physicalregister or for each architectural register. If the former is chosen, the table hasnot to be modified under a branch misprediction. If the latter is chosen, the tablestores wrong information under a branch misprediction so it has to be regeneratedor cleared. We have experimentally observed that clearing the table does not havesignificant impact in performance and simplifies the hardware.

A FIFO-based organization does not require the wakeup logic. Instructions atFIFO heads check if their operands are ready every cycle in a small table. Thistable stores just one bit per physical register indicating whether it is available.


Parameter ValueFetch, Decode, Commit width: 8 instructions/cycleIssue width: 8 INT + 8 FP instructions/cycleFetch queue size: 64 entriesLoad/store queue size: 128 entriesReorder buffer size: 256 entriesRegister file: 160 INT + 160 FPIntALU’s: 8 (1 cycle)IntMult/Div: 4 (3 cycles pipelined mult,





A FIFO-based issue queue organization works well for integer applications since,in general, this kind of programs has narrow dependence graphs that fit in a smallnumber of FIFO buffers. Additionally, integer operations have short dependencechains with short latencies, so after being allocated to one dependence chain, aFIFO usually becomes empty in short, which allows another dependence chain tobe placed in it. Since FP programs have wide dependence graphs and long latencyoperations, they require a large number of FIFOs. This can be observed in thefollowing experiments. We will refer to this organization as IssueFIFO AxB CxDwhere A and C correspond to the number of integer and FP queues respectively,and B and D correspond to the size of the integer and FP queues respectively.

Details of the processor configuration can be found in table 4.6. We have usedthe SPEC CPU2000 benchmark suite [115] and the Simplescalar simulator [23].

In this section we measure the potential of the different issue logic architectures.Thus, the issue queue of the baseline processor has the same size as the reorder buffer(256 entries). This corresponds to an unbounded issue queue, since the dispatchprocess is never stalled due to lack of entries in the issue queue. Smaller issuequeues may be more cost-effective and are considered for the proposed scheme.

Figure 4.16 shows the IPC loss of the IssueFIFO scheme with respect to thebaseline for the integer (first plot) and FP programs (second plot). Different con-figurations varying the number of queues and their sizes have been evaluated. For


% IPC loss w.r.t. baseline

0%

1%

2%

3%

4%

5%

6%

7%

8%

SPECINT

IssueFIFO_8x8_16x16IssueFIFO_8x16_16x16IssueFIFO_10x8_16x16IssueFIFO_10x16_16x16IssueFIFO_12x8_16x16IssueFIFO_12x16_16x16

(a) Integer benchmarks


0%

5%

10%

15%

20%

25%

SPECFP

IssueFIFO_16x16_8x8IssueFIFO_16x16_8x16IssueFIFO_16x16_10x8IssueFIFO_16x16_10x16IssueFIFO_16x16_12x8IssueFIFO_16x16_12x16

(b) FP benchmarks

Fig. 4.16: IPC loss of IssueFIFO technique w.r.t. the unbounded conventional issue queue

SpecINT2000 the integer queues are varied whereas different configurations of theFP queues are explored for SpecFP2000 benchmarks.

It can be observed that the IPC loss of IssueFIFO is relatively small for integerbenchmarks (3% - 8%) whereas the complexity is reduced significantly. Increasingthe number of FIFO queues achieves higher performance since the dispatch stage isstalled less frequently. On the other hand, large queues do not provide significantbenefits. Our experiments show that increasing the number of queue entries from 8to 16 improves performance by 0.1% for 8, 10 and 12 queues.

FP benchmarks show similar trends regarding the number and size of the queues,but it can be seen that these applications lose much more performance (18% - 25%)than integer ones. FP benchmarks have wider DDG’s (Data dependence graphs)than integer ones, so more queues are required. Increasing the number of queuesdoes not come for free since:

1. The logic to dispatch instructions to the queues becomes more complex.

2. There are more candidate instructions to be issued at a given cycle, so moreinstructions must check if their operands are ready.


IssueCycle = MAX(current cycle + 1, OpLeftCycle, OpRightCycle)

if (instruction is load)

IssueCycle = MAX(IssueCycle, AllStoreAddr)

else if (instruction is store)

AllStoreAddr = MAX(AllStoreAddr, IssueCycle of its address + AddressLatency)

endif

if (instruction has destination register)

DestCycle = IssueCycle + InstructionLatency

endif

Fig. 4.17: Issue time computation for LatFIFO scheme

3. The hardware for issuing the instructions to the functional units is more com-plex.

We can thus conclude that the IssueFIFO organization is suitable for integerDDG’s but not for FP ones. Below, we propose more appropriate approaches forFP codes.

Latency Based Organization

The study of the IssueFIFO organization for FP benchmarks revealed that the dis-patch process is stalled very often, but most of the queues store a very small numberof instructions. Given that most FP operations have long latencies, interleaving dif-ferent dependence chains in a single queue could be an interesting approach to reducecomplexity with minimal impact in performance. However, it is crucial to interleavethese dependence chains in an appropriate way, since instructions of the same queueare issued in the same order as they are placed. Ideally, one would like to placeinstructions in a given queue in such a way that a new instruction can be issuedevery cycle. This is what the scheme proposed in this section tries to achieve. Forthis purpose, we propose to estimate the issue time of each instruction, which de-pends on its dependences and the latencies of the operations, among other factors.We will refer to this organization as LatFIFO. We assume the same organization forthe integer queues than IssueFIFO. However, for FP ones, instructions are placedin FIFO queues considering the expected time when they will be ready to be issued.The expected issue time is computed at dispatch stage as shown in figure 4.17.

OpLeftCycle and OpRightCycle stand for the cycle when its left operand (ifany) and its right operand (if any) will be available respectively. We assume thatinstructions can be issued the cycle right after dispatching them (current cycle + 1)in case they have all their operands ready. If a larger number of stages is assumedfrom dispatch to issue, this value must be used instead of ”1”. AllStoreAddr standsfor the first cycle when the address of all previous store instructions will be known.It should be noted that load and store instructions are split into two operations: onefor computing the memory address and another for accessing memory. The memoryaccess requires knowing that no conflict with previous stores exists, but the addresscomputation does not. Store instructions update the cycle when the addresses of all


store instructions will be known. AddressLatency stands for the number of cyclesrequired to compute the address of a load or store instruction. DestCycle stands forthe cycle when the destination operand (if any) will be available. InstructionLatencycorresponds to the latency of the corresponding operation. L1 Dcache hit latencyis assumed for loads. We experimentally checked that knowing the exact numberof cycles for each memory access has no significant effect on the proposed schemes.We assume that the above computations can be performed in a single cycle, whichmay be an optimistic assumption.

Each instruction is placed in that queue that is not full and whose last instructionhas an estimated issue time at least one cycle earlier than the instruction beingdispatched. If there is more than one queue that meets these conditions, the onewhose last instruction is expected to be issued later is selected. If no queue meetsthese conditions, an empty queue (if any) is selected. If no queue can be selectedthe dispatch is stalled. Choosing the queue in this way leaves more opportunitiesfor younger instructions to be dispatched without any stall.

The performance of the LatFIFO scheme for the FP benchmarks is shown infigure 4.18. It can be observed that the performance loss is much smaller thanthe one of the IssueFIFO scheme, but it is still significantly high (8% - 16%). Onaverage, the performance of LatFIFO is about 10% better than that of IssueFIFO.It can also be observed in figure 4.18 that increasing the size of the queues hardlyimproves performance.

Mixed Approach

The main reason for the loss of performance of the LatFIFO scheme is that instruc-tions in a given queue must be issued in the same order as they are dispatched. Anew dispatched instruction has to be always placed at the tail of a queue, even if itis expected to be issued in between two instructions placed consecutively in a givenqueue. An alternative could be using conventional CAM/RAM issue queues butthey dissipate significantly much power and are much slower. In order to avoid theuse of CAM cells but still having the flexibility of this kind of queues, the issue queueorganization that we propose is based on a RAM structure similar to a register file.The main features of the proposed organization are the following:

• Instructions do not have to be placed in program order in this buffer.

• Only one instruction from each queue can be selected for issuing, so the selec-tion logic is quite simple.

• Instructions do not need to know whether their operands are ready before theyare chosen as candidates to be selected, so the wakeup process is not necessary.

• Dependent instructions are placed in the same queue as IssueFIFO schemedoes. Given that each cycle only one instruction is selected per queue, havingdependent instructions in the same queue reduces the probability of having



0%

5%

10%

15%

20%

25%

SPECFP

LatFIFO_16x16_8x8LatFIFO_16x16_8x16LatFIFO_16x16_10x8LatFIFO_16x16_10x16LatFIFO_16x16_12x8LatFIFO_16x16_12x16

Fig. 4.18: IPC loss of LatFIFO technique w.r.t. unbounded conventional issue queue for the FPbenchmarks

more than one ready instruction in the same queue. Different independentdependence chains of instructions can share the same queue. These dependencechains will be referred to as chains in the rest of this work.

• Instructions that are considered for issue for the first time have priority overthose that were not issued the first time that they were supposed to be ready.This heuristic avoids selecting instructions that depend on either loads thatmissed in cache or unfinished instructions of other queues, instead of thoseinstructions whose issue has not been delayed.

• Latencies are considered in order to know when the instructions will be readyfor issuing, but it is done locally at each queue so no complex hardware isrequired.

Implementation. There is a table that maps logical registers to queues. This tableis similar to the one used by the IssueFIFO scheme, but in this case it stores thequeue identifier and the chain identifier since each queue can contain different chains.Thus, each queue has its own set of chains, and each entry in the table contains some


bits identifying the queue where the operand is mapped and some bits identifyingthe chain of that queue whose last instruction produces the operand. The use of thechain number is justified later. At dispatch time each instruction accesses this tableto know the mapping of its source operands and the queue where it will be placedis determined in a similar way that IssueFIFO scheme does. The only difference isthat an instruction is placed in the same queue as its predecessor only if it is the lastinstruction of the chain instead of the last instruction of the queue. If the preferredqueue is full or it is not found an appropriate queue, then a free chain identifier isassigned to the instruction. There are as many chains as the product of the numberof queues and the number of chains per queue. In order to balance the number ofbusy chains per queue, the lowest free chain identifier is assigned. For instance, ifthere are 2 queues and 3 chains per queue, the chains will be assigned with thefollowing priority order: chain 0 from queue 0, chain 0 from queue 1, chain 1 fromqueue 0, chain 1 from queue 1, chain 2 from queue 0, and chain 2 from queue 1.When an instruction is dispatched to a queue, the mapping table is updated withboth the queue and the chain number.

Each queue has an associated selection logic that every cycle picks just oneinstruction from those in the queue, and a small table for the chain latencies. Thistable stores for each chain how many cycles will take the last issued instructionof this chain to finish. This table is very small since it has as many entries asthe number of chains per queue, and the number of bits required to encode thelargest functional unit latency. Every cycle the entire table is read and written. Itis written to decrease by one all the entries using saturated counters, except thatentry corresponding to the chain of the instruction being issued (if any), which isupdated with the instruction’s latency. All the entries are read and their informationis compressed and broadcast to all the entries in the queue. An instruction in thequeue needs to know if its predecessor in the chain has finished, or it is going to finishnext cycle, or it will take 2 or more cycles to finish. Thus, each entry of the latencytable is encoded into 2 bits: 00 if the instruction is going to finish the next cycle, 01if it has finished, and 11 if it will take 2 or more cycles to finish. Each entry in thequeue selects its corresponding pair of bits and concatenates its age identifier to thispair of bits. The age identifier is a field that indicates the older/younger relationshipamong instructions in-flight. It can be implemented by using the reorder bufferposition plus one extra bit concatenated on the left that is reset every time that thefirst position of the reorder buffer is assigned. Combining the bits in this way allowsthe selection logic to select the oldest instruction among those with higher priorityaccording to the criteria described above using the same type of hardware as theone used by the baseline scheme. This mechanism is illustrated with an example infigure 4.19.

Even if the selection hardware is not trivial, it is much simpler than the onerequired by a conventional issue queue since it has to select one instruction in eachsmall issue queue instead of the N oldest ready instructions among the whole issuequeue.


i+30 1i+41 1i+52 1

34

i5 0i+16 0i+27 0

extra bitReorder buffer

tail

head

00111243

chainCycles chains

cycles01000011

i (0 101) 0i+1 (0 110) 1i+4 (1 001) 2i+5 (1 010) 3i+2 (0 111) 0

i+3 (1 000) 2

chainIssue queue

01 0 10100 0 11000 1 00111 1 01001 0 111

00 1 000

SELECT

00 0 110

age identifierscompressed latency info.

Fig. 4.19: Example of selection

In this example the instruction i+1 is selected for issuing the next cycle since itis the oldest one from those with higher priority (those belonging to chains 1 and 2).The selection logic just picks the instruction with smaller identifier. This exampleshows that instructions belonging to the same chain have the same most significantpair of bits, so the oldest one is the one with higher priority.

This scheme will be referred to as MixBUFF in the rest of this work. This schemeuses buffers instead of FIFO structures for the FP queues and both dependence andlatency criteria are considered. Its performance has been evaluated assuming thatunbounded chains per queue are allowed. As figure 4.20 shows, the performanceof this scheme with only 8 queues of 16 entries each is just around 5% lower thanthat of an unbounded (256 entries) conventional issue queue. MixBUFF ’s perfor-mance is much better than that of IssueFIFO and LatFIFO schemes. For instance,with 8 FP queues of 16 entries each, the performance loss of MixBUFF is 5.2%whereas IssueFIFO and LatFIFO lose 24.8% and 15.2% respectively for the sameconfiguration.

It can be also observed that for MixBUFF, increasing the size of the buffers re-sults in more benefits than increasing the number of buffers. Since this mechanismdistributes quite effectively the instructions that are ready at a given cycle across allthe queues, increasing the number of queues does not provide significant improve-ments. On the other hand, placing multiple chains in a given queue increases itsoccupancy, so larger queues reduce the number of dispatch stalls.

Other Observations

Another source of complexity of conventional issue schemes lies in the interconnectsthat are required to issue an instruction from the issue queue to any functional unit.The above schemes have multiple issue queues, so functional units can be distributedacross these FIFO queues or buffers with small impact on performance, and reducingsignificantly the complexity.



0%

5%

10%

15%

20%

25%

SPECFP

MixBUFF_16x16_8x8MixBUFF_16x16_8x16MixBUFF_16x16_10x8MixBUFF_16x16_10x16MixBUFF_16x16_12x8MixBUFF_16x16_12x16

Fig. 4.20: IPC loss of MixBUFF technique w.r.t. unbounded conventional issue queue for the FPbenchmarks

In order to take advantage of this, the proposed MixBUFF scheme has the func-tional units distributed across the different queues. The same distribution schemeis assumed for IssueFIFO, which is evaluated for comparison purposes. For bothschemes the following configuration has been assumed:

• 8 integer FIFO queues of 8 entries each.

• 8 FP queues (buffers for MixBUFF and FIFO queues for IssueFIFO) of 16entries each. For MixBUFF a maximum of 8 chains per queue has beenassumed.

• 1 integer ALU per integer queue.

• 1 integer mult/div unit per pair of integer queues.

• 1 FP add and 1 FP mult/div per pair of FP queues.

Higher performance could be achieved from increasing the number of queues(both integer and FP), but doing so would increase both power dissipation andcomplexity.



In this section we present performance and power results for the proposed MixBUFFscheme, and it is compared with the IssueFIFO scheme.


Power and performance results are derived from CACTI 3.0 [111] and Wattch [21].The processor configuration and evaluated benchmarks are those described in section4.3.1. The only difference is the issue queue configuration assumed for the baseline.In order to do a reasonable comparison, the size of the issue queue assumed for thebaseline scheme is not unbounded. The baseline configuration has two issue queues:one for integer instructions and another for FP ones. They store instructions out-of-order, like in P6 family (Pentium Pro, Pentium II and Pentium III) and Pentium4 [116], and any instruction in the queue can be issued if its operands are ready andthe required resources are available.


Based on the study presented in section 4.3.1, the configurations that have been cho-sen are MixBUFF 8x8 8x16 and IssueFIFO 8x8 8x16, both with distributed func-tional units. For the sake of readability they are referred to as MB distr and IF distrrespectively. They have been compared in terms of power and performance with abaseline with 64 entries for the integer issue queue and 64 entries for the FP issuequeue. It is referred to as IQ 64 64. A baseline with the same number of issue queueentries as MB distr and IF distr (64 and 128 entries for integer and FP queues re-spectively) has not been considered because it implies higher power dissipation andmore complexity than the chosen baseline, and it achieves only 1.0% extra IPC withrespect to the chosen baseline.

It has been assumed that the baseline consumes energy for waking-up only thoseCAM cells corresponding to unready operands, as proposed in [45] in order to make itmore power efficient. A multiple-banked implementation of the issue queue has alsobeen assumed: each queue consists of 8 banks with 8 entries each. Additionally, theselection logic does not dissipate power if the queue is empty for both the IQ 64 64and MB distr schemes (IF distr does not have selection logic).

Performance

Figure 4.21 shows the performance for integer applications. As expected, bothMB distr and IF distr schemes achieve the same performance, except for eon sinceit has a significant number of FP instructions. These schemes lose on average 7.7%IPC w.r.t. the baseline, which is a reasonable loss since the complexity of bothschemes is quite low for integer queues.

It can be observed in figure 4.22 that IF distr loses significant performance(26.0%) for FP applications, whereas MB distr only loses 7.6% IPC w.r.t. the


IPC SPECINT

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

bzip

2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vpr

HA

RM

EA

N

IQ_64_64IF_distrMB_distr

Fig. 4.21: Performance for the integer benchmarks

baseline. MB distr allows several chains to share the same queue in an efficientway, so dispatch stage is stalled fewer times than it is when the IF distr scheme isused. It can be also seen that MB distr outperforms IF distr for all FP benchmarks.

Energy Consumption

This section analyzes where the energy is consumed for each scheme. Figure 4.23(a) shows the energy breakdown for the baseline (IQ 64 64 ). Even though onlyunready operands are woken up, and a multiple-banked implementation is assumed,the wakeup dissipates most of the power. Reading and writing instructions from/intothe issue queue (buff ) as well as the selection logic (select) dissipate significantpower. The logic to drive instructions to the functional units is significant only forinteger ALUs (MuxIntALU ).

Figures 4.23 (b) and (c) show the energy breakdown for IF distr and MB distrschemes respectively. It can be observed that integer applications consume 25-30% of the energy in the table that stores the corresponding queue for each logicalregister (Qrename). Reading and writing instructions from/into the FIFO queuesrequires around 35% of the energy (fifo). Similar percentage is consumed for readingand writing the information regarding which registers are ready (regs ready). Sincethe functional units have been distributed across the queues, the logic to drive theinstructions to the functional units dissipates negligible power.


IPC SPECFP

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

5,0

amm

p

appl

u

apsi art

equa

ke

face

rec

fma3

d

galg

el

luca

s

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

HA

RM

EA

N


Fig. 4.22: Performance for the FP benchmarks

FP benchmarks show similar trends than integer ones for IF distr, whereasMB distr has other sources of power dissipation. MB distr scheme places FP in-structions into buffers (buff ) instead of FIFO queues (fifo). Some energy is spentselecting instructions (select) and managing the information concerning chains ’ la-tencies (chains). Finally, the energy required to drive instructions to the functionalunits and to save the last selected instruction (reg) is negligible.

Power Efficiency

We have compared the different schemes in terms of power, energy, EDP and ED2P.The comparison has been normalized to the baseline configuration.

Figures 4.24 and 4.25 show the comparison in terms of power and energy re-spectively for the issue queue. It can be observed that both MB distr and IF distrdissipate much less power and consume much less energy than IQ 64 64. Theirpower and energy requirements are the same for integer applications, but for FPones MB distr spends a bit more energy.

Figures 4.26 and 4.27 compare the different schemes using the EDP and ED2Pmetrics for the whole processor considering that the issue queue contribution tothe total chip power is 23% [127]. It can be observed in figure 4.26 that MB distroutperforms both IF distr and the baseline in EDP for FP applications. The poorperformance of IF distr is basically due to its significant loss in IPC. Figure 4.27


Energy breakdown IQ_64_64

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

SPECINT SPECFP

MuxFPMULMuxFPALUMuxIntMULMuxIntALUselectbuffwakeup

(a) IQ 64 64 scheme

Energy breakdown IF_distr

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

SPECINT SPECFP

MuxFPMULMuxFPALUMuxIntMULMuxIntALUregs_readyfifoQrename

(b) IF distr scheme

Energy breakdown MB_distr

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

SPECINT SPECFP

MuxFPMULMuxFPALUMuxIntMULMuxIntALUregchainsselectregs_readybufffifoQrename

(c) MB distr scheme

Fig. 4.23: Energy breakdown for the different schemes


Power (Normalized)

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

SPECINT SPECFP


Fig. 4.24: Normalized power dissipation

Energy (Normalized)

0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

0,9

1

SPECINT SPECFP


Fig. 4.25: Normalized energy consumption

EDP (Normalized)

0

0,2

0,4

0,6

0,8

1

1,2

SPECINT SPECFP


Fig. 4.26: Normalized EDP

ED2P (Normalized)

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

SPECINT SPECFP


Fig. 4.27: Normalized ED2P


shows that MB distr significantly outperforms IF distr and achieves practically thesame performance as the baseline in terms of ED2P.

It must be taken into account that the reduced complexity of the issue queuefor both MB distr and IF distr schemes may enable a reduction of the cycle time,which may significantly reduce the execution time for these two schemes and thus,significantly improve their EDP and ED2P metrics with respect to the baseline.Measuring the effect of a shorter cycle time requires a detailed circuit analysis ofthe whole processor, which is out of the scope of this work.

We conclude that MB distr results in the best tradeoff among performance,energy and power.

4.3.3 Conclusions

In this section we have presented a low-complexity FP issue queue organization(MB distr) that achieves high performance with small energy requirements. TheMB distr scheme is based on dispatching instructions into a set of multiple queuesdepending on their data dependences at dispatch time. The selection logic is basedon estimating the availability time of each operand, instead of the complex andpower-hungry conventional wakeup logic. Additionally, the proposed issue logicorganization distributes the functional units across the different queues. Thus, thecomplexity of the crossbar from the issue queue to the functional units is significantlyreduced.

The energy required by the proposed MB distr scheme is substantially lowerthan the energy required by a baseline conventional issue queue, even if the baselinehas the capability of spending energy for waking-up only unready operands. Thisbaseline scheme has a much longer delay than MB distr.

It has been shown that MB distr achieves similar ED2P as the baseline andreduces it by 35% with respect to the IF distr scheme. If the EDP metric is used,then the reductions are 5% with respect to the baseline and 18% with respect to theIF distr scheme. The IPC loss of MB distr scheme for FP applications with respectto the high-complexity baseline is only 7.6%, whereas IF distr loses 26%.


5

Load/Store Queues

CHAPTER 5

LOAD/STORE QUEUES

The load/store queue (LSQ) of superscalar processors is another of their mainhotspots. Besides, the LSQ has a significant delay due to its complexity and itis difficult to pipeline. Overall, the design of a low latency and low power LSQis an important challenge for continuing scaling up the performance of superscalarprocessors.

The LSQ is typically implemented using fully-associative schemes to check de-pendences between load and store instructions. There are different ways to identifymemory dependences in the LSQ. The most common way to do it works as follows.When the address of a load instruction is known, it is compared with the addressesof the older in-flight store instructions to identify memory dependences. Similarly,when the address of a store instruction is known, it is compared with the addressesof the younger in-flight load instructions to forward them the right data in case ofa match. Even though this approach may help to improve instruction-level paral-lelism (ILP), its latency may be high and offset its potential benefits. Additionally,its complexity grows drastically if we increase the number of ports or the LSQ size.Large instruction windows are required for augmenting the opportunities to extractmore ILP, which in turn requires wider pipelines. Hence, large and highly portedLSQs are desirable in high-performance processors, provided that their latency andpower dissipation is reasonable.

In this chapter we present an effective LSQ organization that reduces the com-plexity of this structure as well as its energy consumption. First, we present state-of-the-art LSQ organizations in section 5.1. These techniques improve the LSQperformance and reduce its complexity. Then, we present our LSQ scheme in sec-tion 5.2.

5.1 Related Work

This section reviews some techniques to increase the performance and/or save energyof the logic devoted to dynamically disambiguate loads and stores.

Some techniques [37, 85, 88, 133] focus on predicting dependences between loadsand stores. If the address of a load is known but there are older stores whoseaddresses are still unknown, they predict whether the load depends on those storesor not. On a missprediction, a significant overhead may be incurred because thepipeline must be flushed like in a branch missprediction.

132 · Chapter 5. Load/Store Queues

Other approaches [48] simplify the logic devoted to memory disambiguation byexecuting loads without comparing their addresses against stores addresses. Loadsare later validated by re-executing them right before commit. If there is a mismatchbetween the data loaded at the execution stage and the data loaded at the re-execution stage, then the pipeline is flushed. Different mechanisms [29, 101, 102]have been proposed to filter the number of instructions that need re-execution sincethey require memory ports, which are a scarce resource.

Based on the observation that it is usual to have some loads in-flight that fetchthe same data, Nicolaescu et al. [87] propose forwarding the data to among loadsinstead of accessing several times the same data in cache. This technique reducesthe L1 data cache energy consumption but requires that loads can obtain their dataforwarded not only from stores, but also from loads.

Sethumadhavan et al. [108] propose using hash encoding of the memory addressesto check dependences between loads and stores. When the filter (Bloom filter) pre-dicts that a given load or store has no dependences with other memory instructions,the instruction can be executed safely. On the other hand, if the filter predictsthat the memory instruction may have a conflict, the associative search in the LSQmust be done to check whether the dependence exists or not. This mechanism savesa significant number of power-hungry associative searches in the LSQ by checkingonly the low power filter, but does not reduce the intrinsic complexity of the LSQand introduces indeterminism in the latency to check address dependences. Thiswork also proposes to bank the LSQ and the bloom filter accordingly. Then, onlythose banks whose bloom filter predicts a potential conflict must be checked. In theworst case, all banks may be predicted to have potential conflicts, so all of themmay have to be checked as in a conventional LSQ. Then, this new design achievesfurther energy savings in the LSQ but does not reduce its complexity.

Park et al. [93] propose a segmented LSQ to reduce its latency although checkingfor the dependences of a load/store may take several cycles since LSQ segments arechecked sequentially.

Franklin and Sohi [47] propose distributing the LSQ into N banks and classifyingthe instructions in the banks according to the addresses they access. Each bankhas M different addresses, and each address has space for P instructions, being Pthe maximum number of in-flight loads/stores allowed. There is space for N ·M ·Pinstructions but only P instructions are allowed in total. This scheme relies onthe idea that as we increase N, M can be decreased. As shown later, even if N islarge, many programs require M to be also large. Thus, N ·M must be large not tolose significant performance, which implies that a lot of space is wasted, and smallbenefits are achieved with respect to a conventional LSQ.

5.2 SAMIE-LSQ: A New LSQ Organization for Low Power and Low Complexity

In this section we present a set-associative, multiple-instruction entry LSQ (SAMIE-LSQ), which is a new LSQ scheme for low power and low-complexity. The SAMIE-

5.2. SAMIE-LSQ: A New LSQ Organization for Low Power and Low Complexity · 133

LSQ is much more suitable for high-performance superscalar processors with largeinstruction windows than conventional LSQs and scales much better than a fully-associative LSQ. The SAMIE-LSQ design is based on two observations: first, manyin-flight memory instructions access the same cache line so they can be placed inentries where the cache line address is shared by several load/store instructions;second, in-flight load/store instructions access very few cache lines with the samelow-order bits, and thus, we can use a set-associative structure since it produces fewconflicts for most of the programs.

The SAMIE-LSQ achieves significant energy savings with respect to a conven-tional LSQ. Moreover, since the SAMIE-LSQ entries can hold multiple memoryinstructions, it enables caching some information like the location of the cache linein the L1 data cache and the address translation provided by the data TLB. As aconsequence, a significant number of load and store instructions do not need to checkL1 data cache tags nor access all ways of a set-associative L1 data cache, and thenumber of data TLB accesses is reduced. This results in significant energy savingsin the L1 data cache and the data TLB with negligible performance overhead.

The rest of the section is organized as follows. Section 5.2.1 presents the proposedscheme and section 5.2.2 evaluates its performance. Section 5.2.3 summarizes themain conclusions of this work.

5.2.1 SAMIE-LSQ

As outlined in section 5.1, the ARB [47] reduces the complexity and energy con-sumption of the LSQ by distributing it into several banks where instructions areallocated depending on the memory address to be accessed. To achieve significantdynamic power savings the ARB requires a high degree of banking of their LSQ. Fig-ure 5.1 shows the performance of ARB with respect to an unbounded size LSQ fordifferent configurations of number of banks and addresses per bank. For instance,the configuration 2x64 corresponds to having 2 banks with 64 different addresseseach. The processor configuration is detailed in the evaluation section. The mostrelevant processor parameters are its width (8) and the window size (256 instruc-tions). Looking at figure 5.1 we can see that when the number of banks is very low,the dynamic power savings are very low since the number of addresses to be com-pared in a given bank is significant. As we increase the degree of banking and reducethe number of entries per bank, the power savings potential increases but the per-formance decreases dramatically. The configuration with 64 banks and 2 addresseseach loses as much as 28% IPC. Additionally, the ARB requires that each entry hasspace for an address and in the worst case as many memory instructions as totalnumber of in-flight instructions. Thus, the leakage of ARB may be very high. Forinstance, assuming a maximum of 128 loads and stores in-flight, the ARB storagehas to be as large as 16384 (128·128) loads and stores. On the other hand, reducingthe number of banks, addresses per bank or allowed in-flight memory instructionssignificantly harms performance. For instance, we can observe in the figure the per-


ARB: % IPC w.r.t. unbounded LSQ

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1x12

8

2x64

4x32

8x16

16x8

32x4

64x2

128x

1

Banks x Addresses

NormalHalf addresses

Fig. 5.1: IPC of ARB with respect to an ideal unbounded LSQ. Configurations with differentnumber of banks and addresses per bank are shown

formance when the number of in-flight memory instructions allowed is reduced tothe half. The performance loss is 16% for the fully associative configuration (1 bankwith 64 addresses).

Our objective is distributing the LSQ in many small banks to save dynamicpower with moderate total LSQ storage requirements. To achieve this, We proposethe SAMIE-LSQ, which is an extreme distribution of the LSQ into multiple queuesthat is based on the loads/stores addresses and requires very few entries per queue.We add a small queue for instructions that do not have room in their correspondingqueue. Our approach is based on the observation that it is very common havingseveral in-flight instructions that access the same cache line. For instance, we haveobserved that, on average, every 3 or more in-flight memory instructions access thesame cache line. The proposed LSQ can hold several instructions that access thesame cache line in a single entry, which results in several benefits:

• Loads and stores that are placed in the LSQ compare their address only withother very few cache line addresses, which saves significant energy.

• Once an instruction has accessed the L1 data cache, the LSQ entry recordswhere the cache line is located. Then, further accesses to the same cache linecan access the data cache as if it was a direct-mapped cache (just a singlebank) even if the cache is set-associative in practice, and it is not necessary tocompare the tag. Hence, many cache accesses require little energy and lowerlatency.


@ inst. 0 inst. 1 inst. 2 inst. 3

@ inst. 0 inst. 1 inst. 2 inst. 3

DistribLSQ

SharedLSQ

AddrBuffer

Fig. 5.2: SAMIE-LSQ organization

• Once an instruction has accessed the data TLB to translate its address, thetranslation can be cached in the LSQ entry. The other instructions of the sameLSQ entry do not access the TLB, which results in significant energy savingsin the TLB and may reduce the latency of memory instructions.

The following subsections present a detailed description of the SAMIE-LSQ.

Structures

Figure 5.2 shows a block diagram of the SAMIE-LSQ. We observe three main struc-tures: DistribLSQ, SharedLSQ and AddrBuffer.

DistribLSQ is the banked LSQ (4 banks in the figure). Each bank can holdinstructions accessing to different data cache lines (2 different cache lines per bankin the example). For each LSQ entry we may have an associated cache line andseveral instructions. We refer to these parts of an entry as slots. The basic informa-tion required for each instruction is its offset within the cache line, its relative ageidentifier used for data forwarding, and the data loaded or to be stored if available.Additionally, each instruction needs a bit to know if its data are available, anotherbit to know if older stores addresses are known, and some bits with other instructioninformation like the number of bytes to be loaded/stored, the type of instruction(load/store), and the slot of the store that forwards its data (if any) in case thisinstruction is a load.

Those instructions that do not find an available entry/slot in its correspondingbank of the DistribLSQ are placed in the SharedLSQ whose entries have the samefields as the DistribLSQ. We assume 4 entries in the figure.

Finally, instructions that can be placed neither in the DistribLSQ nor in theSharedLSQ, are placed in a waiting buffer called AddrBuffer. Memory instructionsin the AddrBuffer cannot access cache; they have to go first to the DistribLSQ or


SharedLSQ for disambiguation. Each entry of this buffer holds the complete addressto be accessed (cache line address and offset), the age identifier of the instruction,and those bits indicating whether it is a load or a store and how many bytes mustbe accessed.

To maintain coherence and do not allow loads to be issued before knowing theaddresses of older stores, the reorder buffer is extended with some information.For each entry, there is a bit (readyBit) used for memory disambiguation. If theinstruction is a store and its address is known, its readyBit is set. Any load has itsbit set only if there are no older stores whose addresses are still unknown. A loadcannot access memory until this bit is set. The reorder buffer entries have also afield telling where the instruction is placed (whereLSQ). The whereLSQ field is onlyrelevant for loads. When a load is placed in the DistribLSQ or the SharedLSQ, thisfield is set with the corresponding location.

Every time that a store address is computed, its readyBit is set. In case thereare no older stores whose addresses are still unknown, it also sets the readyBit of allthe following instructions in the reorder buffer until a store with unknown address isreached. All loads whose readyBits are set during this process are notified by usingthe whereLSQ field.

Operation

When the address of a memory instruction is computed, it is forwarded to the LSQ.The DistribLSQ is a set-associative structure where the banks are assigned in adirect-mapped manner based on the effective address and the entries of each bankare accessed in a fully-associative manner. If the instruction finds its cache lineaddress in any of the entries of the corresponding bank and there is a free slot, theinstruction fills this slot. If no entry has the same cache address or it is present butwithout free slots, a free entry is allocated and one of its slots is used.

If an instruction fails to be placed in the DistribLSQ due to lack of space, it isplaced in the SharedLSQ, which is a small fully-associative structure. The processis the same: a free slot in an entry with the same cache line address is chosen ifavailable; otherwise, an empty entry is allocated. Both structures are accessed inparallel, so the address of the load/store is compared with any other address in thecorresponding bank of the DistribLSQ and all the addresses in the SharedLSQ.

Finally, if neither the DistribLSQ nor the SharedLSQ have room for the instruc-tion, it is placed in the AddrBuffer. The instructions in the AddrBuffer have priorityover the ones coming from the functional units when choosing which ones are to beplaced in the DistribLSQ or the SharedLSQ.

Deadlock Avoidance

It may happen that the oldest in-flight instruction is in the AddrBuffer and itdoes not find free space in the LSQ (DistribLSQ and SharedLSQ) because youngerinstructions have filled the entries where this instruction can be placed. This is


easily detected by checking whether the head instruction of the reorder buffer is notplaced in the LSQ. Our evaluations show that sizing properly the different structuresmakes this happen very rarely (less than once every million instructions). Thus, incase of detecting this scenario we take an easy solution to avoid deadlocks: we flushthe pipeline. Since the oldest instruction will be the first to re-enter the pipeline, itwill get an entry in the LSQ, which guarantees forward progress.

There is another situation where the SAMIE-LSQ might require the pipeline tobe flushed: this is when an address computation finishes and it cannot be placed inany of the structures (DistribLSQ, SharedLSQ and AddrBuffer). Sizing the struc-tures properly prevents this to happen. For instance, if the AddrBuffer has as manyentries as in-flight memory instructions allowed, this situation will never happen.Note that the AddrBuffer is a simple FIFO structure so its complexity is rather low(e.g., no associative searches are performed in it). In our simulations, even assum-ing a smaller AddrBuffer, it never happens. An alternative solution would be notallowing address computations to be executed if they are not guaranteed to have atleast one free slot in the AddrBuffer.

SAMIE-LSQ Extensions

SAMIE-LSQ puts several instructions that access the same cache line in the sameentry of the DistribLSQ or the SharedLSQ. We take advantage of this to save energyin the L1 data cache and data TLB.

We can save L1 data cache (Dcache for short) energy by caching the physicallocation of the cache line (set and way) in the corresponding LSQ entry once itis accessed, and adding a bit per cache line (presentBit) in the Dcache indicatingwhether its physical location has been cached in the LSQ or not. When the firstinstruction in a given entry accesses the Dcache, the physical location (set and way)of the cache line is stored in the LSQ entry and both the cache line and LSQ entrypresentBit are set. Any other access to this cache line from this LSQ entry (notethat all instructions in the same entry access the same cache line) needs neither tocheck the tags nor to read all the ways. These low power accesses read the datafrom the cache line of the concrete way without checking the tag. The storage tohold the physical location of the cache line requires just few bits. For instance, a32KB cache with 32 bytes per line has 1024 lines, and thus, 10 bits are enough torecord the physical location of the cache line. The DistribLSQ may require fewerbits to encode the physical cache line since, for a given DistribLSQ bank, only oneor few sets can be accessed. This simple mechanism saves significant Dcache energy,and has two positive side effects: these accesses have lower latency, and we knowthat they will hit in advance. The benefits of these two effects are not considered inthe performance study.

When a cache line is replaced, some LSQ entries in the SharedLSQ and theDistribLSQ may have to reset their presentBit flag. To avoid the comparison ofthe cache line address being replaced and the addresses in the LSQ, we use a very


simple alternative, which consists of resetting the presentBit flag of all entries thatcan be potentially affected.

Data TLB (DTLB for short) energy is also saved by keeping the translatedaddress in the LSQ entries. When the first instruction in the entry access thedata cache, the DTLB is looked up and the address translation is cached in thecorresponding entry of the DistribLSQ or SharedLSQ. The other instructions readthis information from their LSQ entry. Similarly to the technique applied to saveDcache energy, there are two additional positive side effects whose benefit is notconsidered in the quantitative evaluation later in this work: the translation hasmuch lower latency and the translation hit rate may increase since the DTLB is notaccessed for many instructions.

Sizing SAMIE-LSQ Structures

We initially experimented with a configuration without the SharedLSQ that placesall instructions in the DistribLSQ. We found that different programs (we useSpec2000 benchmarks [115]) show extremely different address patterns. For instance,integer programs often require few entries in the DistribLSQ and these entries aredistributed across different banks even if the number of banks is low. Hence, theyhardly use the SharedLSQ. Most of the FP (floating-point) programs require a lot ofentries in the DistribLSQ, but they exhibit different patterns. Some FP programsuse evenly the different banks, which is beneficial to save energy in the DistribLSQ,but other FP programs concentrate most of their entries in few banks even if thenumber of banks is high. Increasing the number of entries of all the DistribLSQbanks is a waste of space since only a few banks will use them at any given pointin time. Thus, a most cost-effective solution is to use the SharedLSQ to hold theinstructions that cannot be placed in the DistribLSQ.

Another important design parameter is the number of slots per entry. A largenumber benefits energy savings for address comparisons, Dcache and DTLB, forthose programs where the number of in-flight memory instructions accessing thesame L1 data cache line is high. On the other hand, there are some programs thatdo not take advantage of a large number of slots per entry.

Summing up, we need a highly banked DistribLSQ with enough entries to placemost memory instructions, and some entries in the SharedLSQ for conflicting ad-dresses. Figure 5.3 shows the average occupancy of the SharedLSQ for differentconfigurations of the DistribLSQ varying the degree of banking (banks × entriesper bank). The SharedLSQ is assumed to be unbounded, and there are 8 slots perentry in both the DistribLSQ and the SharedLSQ. Other configuration details arereported in the evaluation section.

We observe that a configuration with 128 banks of 1 entry each (128x1) requiresa significant number of entries in the SharedLSQ for many programs. That meansthat the SharedLSQ must be quite large and many comparisons will have to bedone since each address is compared with the addresses of the corresponding bank


SharedLSQ entries required

0

2

4

6

8

10

12am

mp

appl

u

apsi art

bzip

2

craf

ty

eon

equa

ke

face

rec

fma3

d

galg

el

gap

gcc

gzip

luca

s

mcf

mes

a

mgr

id

pars

er

perlb

mk

sixt

rack

swim

twol

f

vort

ex vpr

wup

wis

e

SP

EC

128x164x232x4

Fig. 5.3: Average number of entries occupied in an unbounded SharedLSQ for different configura-tions of the DistribLSQ

of the DistribLSQ and all the addresses in the SharedLSQ. Thus, this configurationis too much banked. On the other hand, we observe that the SharedLSQ spacerequirements of the 64x2 DistribLSQ are only a bit higher than those of the 32x4DistribLSQ. Thus, we select the 64x2 configuration of the DistribLSQ because itsbanks are small and its SharedLSQ space requirements low.

Figure 5.4 shows the number of programs that need a given number of SharedLSQentries in order not to require the AddrBuffer during 99% of the time. It can beseen that 4 entries are enough for 16 over 26 programs, so 10 programs may losesome performance, whereas 8 entries are enough for 21 programs. If we consider aSharedLSQ with 12 entries only one more program has enough entries during the99% of the time of its execution. Hence, an 8-entry SharedLSQ seems a good tradeoffand is what we assume in our experiments.

The number of slots per entry is set to 8. More slots per entry would helpto reduce the energy consumption since more instructions may benefit from powerreductions when accessing the Dcache and the TLB. The drawback of increasing thenumber of slots per entry is that leakage and delay are increased.

Using a lower number of slots per entry would help to save leakage and reducethe delay, but it is counterproductive for some programs whose memory referencestend to concentrate in few cache lines, because the associated LSQ bank or theSharedLSQ would require more entries. This may offset the benefits of reducing thenumber of slots per entry.


Stacked number of programs vs. SharedLSQ entries required

0

2

4

6

8

10

12

14

16

18

20

22

24

260 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60

SharedLSQ entries

Nu

mb

er o

f p

rog

ram

s

Fig. 5.4: Number of programs that do not use the AddrBuffer during the 99% of their executionfor a varying number of SharedLSQ entries

As shown in figure 5.3 some programs require a large number of entries in theSharedLSQ. When they fail to place an instruction in the SharedLSQ, the instructionhas to wait in the AddrBuffer. We have observed that an AddrBuffer of 64 entriesis always enough for all programs. A few programs such as ammp and facerec needmore than 32 entries more than 5% of their execution time. Since the AddrBufferis a cheap structure in terms of energy and delay, we set its size to 64 entries.

Delay

The delay of the different components has been evaluated using CACTI 3.0 [111] with0.10 µm technology. The largest delay for SAMIE-LSQ corresponds to DistribLSQ(64 banks, 2 entries/bank, 8 slots/entry). We also assume an extra latency to sendthe addresses to the banks with respect to a conventional LSQ because it is a largerstructure. We assume this additional delay to be equal to the delay of the buses(bitlines and wordlines) of a 128-entry structure with the same total capacity. Themaximum delay of DistribLSQ is the delay to send an address to a bank (0.124ns) plus the delay of comparing the cache line addresses in such a bank (0.590 ns).Thus, the total DistribLSQ delay is 0.714 ns. The delays for SharedLSQ (8 entries,8 slots/entry) and AddrBuffer (64 slots) are 0.617 ns and 0.319 ns respectively.

The assumed baseline LSQ (128 entries) has a delay of 0.881 ns, which is 23%higher than the delay of SAMIE-LSQ. We also have found that a conventional LSQwith 16 entries has a delay similar (4% larger) to our SAMIE-LSQ configuration.


Size Associativity Ports Conventional Physical line Improvementdelay (ns) known delay (ns)

8KB 2 way 2 0.865 0.700 19.4%8KB 2 way 4 1.014 0.875 13.7%8KB 4 way 2 1.008 0.878 12.9%8KB 4 way 4 1.307 1.266 3.1%32KB 2 way 2 1.195 1.092 8.7%32KB 2 way 4 1.551 1.490 4.0%32KB 4 way 2 1.194 1.165 2.5%32KB 4 way 4 1.693 1.693 0.0%

Table 5.1: Access time of conventional cache accesses and access time when the physical cache lineis known for different cache configurations. The number of bytes per line is 32 in all configurations

In terms of delay, we have found that those accesses to the Dcache where thephysical cache line to access is known, may be done with lower delay than conven-tional Dcache accesses. Table 5.1 shows the access time of both types of accesses fordifferent cache configurations. It can be observed that most of the configurationshave lower access time when the physical cache line is known beforehand. Althoughin this work we do not take advantage of this lower access time for Dcache accesses,we consider that this feature of the SAMIE-LSQ can provide additional benefits andwill be the target of future work.


This section presents the performance and energy statistics for the SAMIE-LSQand a baseline with a conventional fully-associative LSQ. First, the experimentalframework for performance and energy modeling is presented. Then, the impact ofthe SAMIE-LSQ in performance, dynamic energy and leakage is discussed.


The SAMIE-LSQ performance has been evaluated with Simplescalar [23]. Energyresults are derived from CACTI 3.0 [111]. For this study we have used the wholeSPEC SPU2000 benchmark suite [115]. Table 5.2 shows the processor configurationand table 5.3 shows the SAMIE-LSQ configuration.

Energy Model for the LSQ

The energy and area parameters used are derived from CACTI 3.0 [111]. Thebaseline LSQ is a conventional fully-associative structure of 128 entries. For thesake of a fair comparison, we assume for the baseline that a load address is onlycompared with the addresses of the older stores whose address is known. On theother hand, a store address is only compared with the addresses of the youngerloads whose addresses are known. If there is any match, the matching loads data


Parameter ValueFetch, Decode, Commit width: 8 instructions/cycleIssue width: 8 INT + 8 FP instructions/cycleFetch queue size: 64 entriesIssue queue size: 128 INT + 128 FP entriesLoad/store queue size: 128 entries for the baselineReorder buffer size: 256 entriesRegister file: 160 INT + 160 FPIntALU’s: 6 (1 cycle)IntMult/Div: 3 (3 cycles pipelined mult,





Parameter ValueDistribLSQ : 64 banks

2 entries per bank8 slots per entry

SharedLSQ : 8 entries8 slots per entry

AddrBuffer : 64 slots

Table 5.3: SAMIE-LSQ configuration


LSQ EnergyAddress comparison 452 pJ + 3.53 pJ per address comparedRead/Write an address 57.1 pJRead/Write a datum 93.2 pJ

Table 5.4: Energy consumption of the different types of accesses to a 128-entry conventional LSQ

DistribLSQ EnergyAddress comparison 4.33 pJ + 2.17 pJ per address comparedRead/Write an address 4.07 pJAge id comparison in one entry 19.4 pJ + 1.21 pJ per age id comparedRead/Write an age id 1.64 pJRead/Write a datum 10.9 pJRead/Write a TLB address translation 6.02 pJRead/Write a cache line id 0.236 pJBus to DistribLSQ EnergySend an address 54.4 pJSharedLSQ EnergyAddress comparison 22.7 pJ + 2.83 pJ per address comparedRead/Write an address 6.16 pJAge id comparison in one entry 19.4 pJ + 2.43 pJ per age id comparedRead/Write an age id 1.64 pJRead/Write a datum 10.9 pJRead/Write a TLB address translation 8.73 pJRead/Write a cache line id 0.342 pJAddrBuffer EnergyRead/Write a datum 31.6 pJRead/Write an age id 15.7 pJ

Table 5.5: Energy consumption for the different activities of the SAMIE-LSQ

are forwarded from a store when it is available and the load does not access theDcache. Table 5.4 details the energy consumption for the different types of accesses.

Our proposed SAMIE-LSQ requires comparing each address with all the ad-dresses (entries) in-use of the corresponding bank of the DistribLSQ and all the ad-dresses in-use of the SharedLSQ. Additionally, the age identifier (it is implementedas the reorder buffer position plus an extra bit) of the instruction whose address hasjust been computed is compared with all the age identifiers of the slots in-use in thecorresponding bank of the DistribLSQ and all the age identifiers of the SharedLSQ.This way, if it is a load, it will record the slot where there is the store that forwardsits data. If it is a store, it updates the forwarding information of the loads. Theenergy consumption for the different activities is shown in table 5.5.

The energy consumption of a Dcache access is 1009 pJ, whereas the energyconsumption is 276 pJ when only one of the ways is accessed and no address iscompared for a 8KB 4-way cache. For the DTLB, the energy of an access is 273 pJ.

Since CACTI does not estimate leakage, we keep track of the active area for thebaseline LSQ and the SAMIE-LSQ, which is closely related to the leakage energy.Both mechanisms are intended to be energy efficient, so we assume that the conven-


Conventional LSQ Type AreaAddress CAM 28 µm2

Datum RAM 20 µm2

DistribLSQ Type AreaAddress CAM 10 µm2

Age id CAM 10 µm2

Datum RAM 6 µm2

TLB address translation RAM 6 µm2

Cache line id RAM 6 µm2

SharedLSQ Type AreaAddress CAM 10 µm2

Age id CAM 10 µm2

Datum RAM 6 µm2

TLB address translation RAM 6 µm2

Cache line id RAM 6 µm2

AddrBuffer Type AreaDatum RAM 20 µm2

Age id RAM 20 µm2

Table 5.6: Area of the different components of the conventional LSQ and SAMIE-LSQ

tional LSQ has active all in-use entries plus four extra entries for new instructions.This limitation hardly impacts the performance (less than 0.1% IPC loss) and sig-nificantly reduces leakage. On the other hand, the SAMIE-LSQ has active all in-useentries plus one extra entry in each bank of the DistribLSQ and one extra entry inthe SharedLSQ. In each entry, the slots in-use plus an extra slot are considered tobe active. The AddrBuffer has all in-use slots plus four extra slots active. As in theconventional LSQ, the performance degradation of these limitations is negligible.

The type and area of the different cells is detailed in table 5.6.

Performance

Figure 5.5 presents the IPC of the SAMIE-LSQ with respect to the conventionalLSQ. We observe that SAMIE-LSQ loses some performance for ammp, apsi andmgrid. As shown in figure 5.3, these programs would require a large number ofSharedLSQ entries. Thus, the SharedLSQ often becomes full and some instructionshave to wait in the AddrBuffer, which implies that some instructions that are readyto execute have to wait for an available entry/slot in the proper bank of the Dis-tribLSQ or in the SharedLSQ. Furthermore, since some instructions have to wait inthe AddrBuffer, it may happen that the oldest memory instruction cannot be placedneither in the DistribLSQ or in the SharedLSQ, firing the deadlock avoidance scheme(i.e. pipeline flush) described above. Figure 5.6 shows the number of deadlocks permillion of cycles. It can be seen that ammp is the only program with a significantnumber of deadlocks.

In figure 5.5 we also observe that some programs such as facerec and fma3d

perform better with the SAMIE-LSQ than with the conventional LSQ. This is sobecause these programs have high LSQ pressure and the conventional LSQ can hold


% IPC loss of SAMIE-LSQ w.r.t. the conventional LSQ

-4%

-2%

0%

2%

4%

6%

8%

amm

p

appl

u

apsi art

bzip

2

craf

ty

eon

equa

ke

face

rec

fma3

d

galg

el

gap

gcc

gzip

luca

s

mcf

mes

a

mgr

id

pars

er

perlb

mk

sixt

rack

swim

twol

f

vort

ex vpr

wup

wis

e

SP

EC

Fig. 5.5: IPC loss of SAMIE-LSQ with respect to the 128-entry conventional LSQ

Deadlocks per 1.000.000 cycles for SAMIE-LSQ

0

50

100

150

200

250

300

amm

p

appl

u

apsi art

bzip

2

craf

ty

eon

equa

ke

face

rec

fma3

d

galg

el

gap

gcc

gzip

luca

s

mcf

mes

a

mgr

id

pars

er

perlb

mk

sixt

rack

swim

twol

f

vort

ex vpr

wup

wis

e

SP

EC

Fig. 5.6: Number of deadlock-avoidance pipeline flushes per million of cycles for SAMIE-LSQ


Dynamic Energy (nJ) LSQ

0,0E+00

5,0E+06

1,0E+07

1,5E+07

2,0E+07

2,5E+07

3,0E+07

3,5E+07

4,0E+07

4,5E+07

5,0E+07

amm

pap

plu

apsi art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

isS

PE

C

Conventional LSQSAMIE-LSQ

Fig. 5.7: Dynamic energy consumption for the LSQ

up to 128 memory instructions, whereas the SAMIE-LSQ can hold many more ifthey are well distributed among the different banks.

On average, the SAMIE-LSQ loses 0.6% IPC with respect to the conventionalLSQ. This does not take into account the potential benefits from the fact that thedelay of the SAMIE-LSQ is lower than that of the conventional LSQ, as shown insection 5.2.1.

Dynamic Energy

Figure 5.7 shows the dynamic energy consumption of the conventional LSQ and theSAMIE-LSQ. We observe that the SAMIE-LSQ is much more energy-efficient thanthe conventional LSQ for all but one program. In fact, the programs that havehigh energy consumption with the SAMIE-LSQ are those with high SharedLSQrequirements. This trend can be seen in figure 5.8 where the energy of the SAMIE-LSQ is broken down. Most of the programs spend the energy in the DistribLSQand the buses, but ammp, apsi, facerec and mgrid have significant number ofconflicts and require large space in the SharedLSQ and the AddrBuffer. On average,the SAMIE-LSQ saves 82% of the dynamic energy of the conventional LSQ withnegligible performance degradation.

As stated in section 5.2.1, the SAMIE-LSQ enables significant energy savings inthe L1 data cache and the data TLB by caching the location of the data in cacheand the address translation respectively. Figure 5.9 shows the energy consumptionof the Dcache for both the conventional LSQ and the SAMIE-LSQ. It can be seen


Dynamic Energy Breakdown for SAMIE-LSQ

0%

20%

40%

60%

80%

100%

amm

pap

plu

apsi art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

isS

PE

C

BusAddrBufferSharedLSQDistribLSQ

Fig. 5.8: Dynamic energy consumption breakdown for the SAMIE-LSQ

that the energy savings for the SAMIE-LSQ are consistent across all benchmarks.On average, 42% of the L1 data cache energy can be saved, ammp and swim beingthe programs with highest savings (58%), and sixtrack being the program withlowest energy savings (21%).

Figure 5.10 shows the data TLB energy consumption. In general, those Dcacheaccesses that do not compare the address and only access one way, do not access theDTLB because the address translation has also been cached. Thus, the fraction ofenergy savings for the DTLB is higher than that for the Dcache. On average, 73%DTLB energy is saved if we compare the SAMIE-LSQ with a conventional LSQ.The highest savings correspond to ammp (84%) and the lowest to mcf (55%).

Leakage

The SAMIE-LSQ is larger than the conventional LSQ because it has practicallythe same number of addresses but space for 8 instructions per address, whereas theconventional LSQ only has space for one instruction per address. Nevertheless, theSAMIE-LSQ can work with an active area similar to that of the conventional LSQas shown in figure 5.11. We accumulate the area every cycle instead of using theaverage area to take into account the longer or shorter execution time of the differentprograms. The accumulated active area for both the conventional LSQ and theSAMIE-LSQ are very similar, and slightly favorable to the SAMIE-LSQ (5%). Thebest scheme in terms of active area depends on the program, some integer programs(bzip2, crafty, gcc, parser, perlbmk) being the worst programs for SAMIE-LSQ


Dynamic Energy (nJ) L1 Data Cache

0,E+00

1,E+07

2,E+07

3,E+07

4,E+07

5,E+07

6,E+07

7,E+07am

mp

appl

uap

si art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

isS

PE

C


Fig. 5.9: Dynamic energy consumption for the L1 data cache

Dynamic Energy (nJ) Data TLB

0,0E+00

2,0E+06

4,0E+06

6,0E+06

8,0E+06

1,0E+07

1,2E+07

1,4E+07

1,6E+07

1,8E+07

amm

pap

plu

apsi art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

isS

PE

C


Fig. 5.10: Dynamic energy consumption for the data TLB


Active LSQ Area (mm2)

0,E+00

1,E+07

2,E+07

3,E+07

4,E+07

5,E+07

6,E+07am

mp

appl

uap

si art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

isS

PE

C


Fig. 5.11: Accumulated active area in mm2 for the LSQ

Active Area Breakdown for SAMIE-LSQ

0%

20%

40%

60%

80%

100%

amm

pap

plu

apsi art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

isS

PE

C

AddrBufferSharedLSQDistribLSQ

Fig. 5.12: Active area breakdown for the SAMIE-LSQ


because they have very low LSQ space requirements and the SAMIE-LSQ keepslarger empty area active than the conventional LSQ.

Figure 5.12 shows the area breakdown for SAMIE-LSQ. The DistribLSQ is thestructure with the largest active area, and the SharedLSQ active area is noticeableonly in those programs with high SharedLSQ space requirements (ammp, apsi, art,facerec, mgrid).

5.2.3 Conclusions

We have presented the SAMIE-LSQ, which is a power-aware load/store queue design.The SAMIE-LSQ exploits the fact that many in-flight loads and stores access thesame cache line and places these instructions in the same entry. This reduces thenumber of address comparisons and other activity in the data cache and the TLB.This number of address comparisons is further reduced by using a set-associativeorganization instead of a fully-associative one.

The SAMIE-LSQ saves 82% dynamic energy for the load/store queue, 42% forthe L1 data cache and 73% for the data TLB, with a negligible impact on perfor-mance (0.6%).

Additionally, the delay of the SAMIE-LSQ is lower than that of a conventionalload/store queue, and the access time for many L1 data cache references is alsoreduced. This enables further opportunities for optimizations to improve the per-formance and/or energy requirements, which have not been exploited in this workand will be the target of our future research. Another interesting research directionis coupling the SAMIE-LSQ with the L1 data cache by integrating the DistribLSQentries and their corresponding cache set(s) in the same physical structure to furtherreduce the cache access time.

6

Clustered Microarchitectures

CHAPTER 6

CLUSTEREDMICROARCHITECTURES

Clock rates have undergone a continuous increase since the first microprocessor as aresult of deeper pipelines and the use of ever smaller and faster transistors. On theother hand, on-chip communications are more critical from generation to generationsince they become slower relative to gate delays [12, 61]. Wire delays are makingprocessor designers to devote more effort and resources to techniques to minimizetheir impact. Clustered microarchitectures are becoming a widely-used approach totackle this problem.

Clustered processors deal with the wire delay problem trying to keep locally mostof the communications and, at the same time, balancing the workload. Conventionalclustered processors are laid out in such a way that data/tags forwarding inside acluster is fast while inter-cluster forwarding is slow. Because of that, minimizingthe penalties of wire delays and balancing the workload of the clusters are oppositeobjectives. The best performance is achieved when the best trade-off between thesetwo factors is identified. Different approaches that search for this trade-off aredescribed in the related work section.

Clustered microarchitectures are also effective at reducing energy consump-tion [138] because some of the resources are distributed across the different clusters.For instance, the issue logic, the register files and the functional units are distrib-uted. Hence, in each cluster there are small issue queues and register files with lownumber of ports, and few functional units. Thus, the energy consumption as well asthe complexity of these distributed structures is low when compared with that of acentralized superscalar processor.

Clustered processors can also be effective at reducing the temperature of thechip through a better distribution of the activity across the whole die. This maytranslate in significant benefits in performance, by reducing the frequency of thermalemergencies, and cost, by allowing cheaper cooling solutions for a given performancelevel. However, conventional clustered microarchitectures tend to concentrate theactivity in the minimum number of clusters that can provide the maximum through-put required by each particular code, since spreading the activity across all clustersimplies an increase in communication penalties. Thus, we have that conventionalclustered microarchitectures distribute the activity across some clusters, but fail to

154 · Chapter 6. Clustered Microarchitectures

distribute this activity across all of them to prevent temperature emergencies fromhappening, which may harm performance.

In this chapter we present a new clustered microarchitecture that achieves highperformance and distributes the activity better than conventional clustered proces-sors. In section 6.1 we present state-of-the-art clustered microarchitectures andtechniques to improve their performance. Then, we present our proposal for clus-tered microarchitectures in section 6.2.

6.1 Related Work

Conventional clustered architectures have fast interconnects for propagating intra-cluster signals, whereas inter-cluster communications require long and slow wires.These architectures rely on keeping most of the communications local within clustersto achieve high performance. Even if most of the communications are local, inter-cluster communications are required in a non-negligible number of cases (1 com-munication every 4 instructions may be common in a 4-cluster configuration [91]).Those inter-cluster communications are critical since increasing their number andtheir latency degrades performance significantly [16, 34, 35, 91, 92].

Some previous works on clustered microarchitectures focus on 2 clusters [42, 57],but future microprocessors are likely to have a higher degree of clustering. Thesearchitectures rely on a mechanism to distribute instructions across clusters that isreferred to as steering logic [99]. Some approaches are based on partitioning thecode at branch boundaries. Trace processors [46, 100, 114, 123] partition dynam-ically the code into chunks of consecutive instructions called traces. Then, eachtrace is steered as a unit to a given cluster. Since the traces have similar length, theworkload is effectively balanced. However, this scheme may incur in a large num-ber of communications. Additionally, if there is parallelism inside traces but notbetween different traces, trace processors may lose significant performance becauseonly one cluster executes instructions simultaneously. Moreover, clusters must bewide enough to extract all the parallelism inside traces. Thus, the complexity is stilla concern for these processors because their clusters must be wide in order to keephigh the performance.

Multiscalar processors [46, 114] divide the code into tasks. Each task is madeup of a set of consecutive instructions and is assigned to a different processingelement (PE), which are interconnected through a ring network. PEs have fastinternal interconnects and slow connections to other PEs. Thus, communicationamong tasks use the slow inter-cluster connection. Multiscalar steers tasks thatare determined at compile time. These tasks are executed speculatively to increaseperformance.

Other works are based on steering instructions considering data dependences,trying to send dependent instructions to the same cluster without compromisingworkload balance [34, 35, 42, 86, 90, 91, 92, 110]. The Multicluster architecture [42]partitions the register name space into two subsets. The program is partitioned at

6.2. Ring Clustered Microarchitecture · 155

compile time by estimating the workload balance and inter-cluster communication.Seznec et al. [110] propose a scheme that introduces some constraints for reading andwriting registers and distributing the functional units in order to reduce the numberof ports required for the register file, and reduce the complexity of the crossbar fromthe register file to the functional units. In some aspects, this approach is similar toa clustered microarchitecture, but it can be applied to any architecture (clusteredor non-clustered).

Palacharla et al. [90] propose a dependence-based clustered architecture whereeach issue queue is a FIFO queue. Instructions are dispatched to the queue wherethe last instruction of the queue is the producer of a source operand; if no queuemeets this condition, the instruction is sent to an empty queue. If none is availablethe dispatch stage is stalled. Placing instructions in this way guarantees that theinstructions in a given FIFO must be executed sequentially, and thus, only theyoungest instruction in each queue must be monitored for potential issue.

Some other mechanisms [34, 35, 91, 92] have been proposed to deal with theproblem of reducing the number of required communications and maximizing work-load balance at the same time. These policies are based on using a dependence-basedsteering algorithm and an additional mechanism to manage workload balance. Themost recent approach among these ones [92] proposes steering algorithms that areaware of the interconnection network between clusters. The clustered processor hasclusters with fast intra-cluster connections and point-to-point buses to communi-cate data to remote clusters. This processor tries to steer instructions to the clusterwhere their source operands are mapped, but in case of requiring communications,the steering algorithm chooses the cluster that minimizes the communication delay.This algorithm requires workload imbalance control. The figure used to measurethis feature is DCOUNT, which is described in the original paper [92].

6.2 Ring Clustered Microarchitecture

The performance of clustered microarchitectures relies on steering schemes that tryto find the best trade-off between workload balance and inter-cluster communica-tion penalties. In previously proposed clustered processors, reducing communica-tion penalties and balancing the workload are opposite targets, since improving oneusually implies a detriment in the other. Additionally, conventional clustered mi-croarchitectures use to concentrate all the activity in just one or few clusters until acertain level of imbalance is achieved. Then, they force the instructions to be steeredto other clusters. Thus, the activity is not well distributed at fine granularity.

In this work we propose a new clustered microarchitecture that distributes theactivity across all clusters at fine and coarse granularity and can minimize com-munication penalties without compromising workload balance. The key idea is toarrange the clusters in a ring topology in such a way that results of one clustercan be forwarded to the neighbor cluster with a very short latency. In this way,minimizing communication penalties is favored when the producer of a value and


its consumer are placed in adjacent clusters, which also favors workload balance. Inparticular, the typical bypass network that in conventional designs allows values tobe bypassed from the output of a unit to the input of any other unit of the samecluster, is replaced by a bypass network that allows values to be fast bypassed tothe next cluster, in a unidirectional ring topology.

Note also that for codes with very small ILP, a conventional clustered processormay choose to execute most of the instructions in just one cluster, whereas the restremain almost idle, in order to maximize performance. The proposed microarchi-tecture will still distribute the work evenly across all clusters, even if all have lowutilization, since this also minimizes communication penalties. Thus, the proposedmicroarchitecture distributes the activity in the issue queue, the register files andthe functional units much better because dependence chains are spread across dif-ferent clusters, and it is expected to produce less heat than conventional clusteredmicroarchitectures.

The organization of this section is as follows. Section 6.2.1 presents the proposedscheme. Section 6.2.2 evaluates the performance of the proposed approach and com-pares it with conventional clustered microarchitectures. Section 6.2.3 summarizesthe main conclusions of this work.

6.2.1 Ring Clustered Processor

This section describes the proposed processor organization, which will be referredto as ring clustered microarchitecture. Whereas conventional clustered processorshave fast interconnects between the outputs and the inputs of the functional unitsinside the same cluster, our approach is based on having these bypasses between theoutputs of the functional units of a given cluster and the inputs of the functionalunits of the following cluster. The clusters are arranged forming a ring in such a waythat cluster 0 bypasses its data to cluster 1, cluster 1 to cluster 2 and so on. Finally,cluster n-1 bypasses its data to cluster 0, closing the ring. We also assume such fastinter-cluster bypasses for tags in order to perform the wakeup in the following clusterof the ring, instead of waking up instructions in the same cluster. Additionally, thereis a set of buses communicating values from one cluster to a cluster other than thefollowing one. They are unidirectional and fully pipelined buses. This kind of busescan be easily designed, have low latency per hop in comparison with non-pipelinedbuses and scale quite well.

Figure 6.1 shows a block diagram of the ring clustered microarchitecture. It canbe seen that data produced in one cluster are sent to the next cluster and thereare not bypasses from the outputs of the functional units of a given cluster to theinput of these functional units. The register file is distributed across all the clusters.Each register file can be read only from the cluster where it is, and written fromthe previous cluster in the ring. This organization allows the processor to issuedependent instructions back to back only if they are sent to contiguous clusters.Instructions issued in a given cluster wakeup instructions just in the following cluster


Cluster 0 Cluster 1 Cluster 2

Cluster 3Cluster 7

Cluster 6 Cluster 5 Cluster 4

RegfileIssue

Queue

Functional Units

CommQueue

Fig. 6.1: Ring clustered microarchitecture

of the ring, but not in the same cluster. In this section we validate to some extentthat these assumptions are realistic.

This architecture works as follows. Assume an instruction is issued in cluster i.When it finishes its execution, the output is written in the register file of the followingcluster (cluster (i+1) mod N, where N stands for the number of clusters). Thisdatum is also bypassed to the functional units of the following cluster. Instructionsin the issue queue and communication instructions of the following cluster are wokenup. Communication instructions are generated dynamically when an operand isneeded in a cluster where it is not present and wait in a separate issue queue. Whenan instruction is dispatched, for each required communication, a new communicationinstruction is created in the producer cluster (the one where the value is stored),and one register is allocated in the consumer cluster for storing the copy of thecorresponding value. More details can be found elsewhere [91, 92].

We assume homogeneous clusters, all with the same configuration although theproposed scheme can be applied to heterogeneous clusters.

The data cache is centralized, forming a separate cluster. It has been assumed agiven delay to send the addresses and the same delay to send the read datum backto the cluster that requested it, in addition to the latency of the cache.

The register file has enough ports to perform all the required accesses in a givencycle. Thus, if each cluster has issue width IW, NumFU functional units and thereare B buses, the number of ports is:

Read Ports = 2 · IW + B (6.1)

Write Ports = NumFU + B (6.2)


If the instruction has 0 source operands:

The cluster with more free registers is chosen.

Else If the instruction has 1 source operand:

Those clusters where the register is mapped are selected, and the one with more

free registers among them is chosen.

Else If the instruction has 2 source operands:

If there is at least one cluster where both operands are mapped:

Those clusters where both registers are mapped are selected, and the one with more

free registers among them is chosen.

Else:

Those clusters where one operand is mapped are chosen. Since one communication is

required, it is chosen the one that incurs in the shorter communication distance.

If there is more than one, the one with more free registers among them is chosen.

Endif

Endif

If the chosen cluster is full, then the dispatch stage is stalled.

Fig. 6.2: Steering algorithm for the ring clustered microprocessor

This configuration guarantees that any issued instruction can read up to twooperands and, if the buses are idle, a communication instruction per bus can also beissued. On the other hand, the register file has write ports for the data produced bythe functional units of the previous cluster and the incoming buses. A lower numberof register ports may well suffice to provide the same performance but this analysisis out of the scope of this work.

A ring clustered processor, like conventional clustered processors, copies datafrom one cluster to another only when needed. A copy flows through a bus untilit reaches the destination register file. Multiple copies of a given register can bepresent in different clusters and all copies are released at the same time [91, 92].Alternatively, register copies could be released as soon as they are read, whereas theoriginal copy is released when the instruction that redefines the register commits.This would reduce register pressure at the expense of an increase in the number ofcopies. In this work we just analyze the former alternative.

Steering Algorithm

We use a dependence-based steering policy that reduces the number and distanceof the communications and, inherently balances the workload. The algorithm worksas shown in figure 6.2.

The algorithm sends the instructions to the clusters considering their depen-dences. In case of having more than one candidate cluster to dispatch an instruction,the one with more registers available is selected.

The algorithm is illustrated with the example in figure 6.3. Figure 6.3 shows thesource code as well as the steering decisions taken for each instruction. It can beobserved that, when an instruction is steered to a given cluster, the value is onlyavailable in the following cluster of the ring.

Whereas a conventional dependence-based clustered architecture partitions ver-tically the dependence graph, sending dependent instructions to the same cluster


I4 is sent to 3 (R3 is local, R1 requires only 1 cycle of bus from 2)

I1. R1 = 1I2. R2 = R1 + 1I3. R3 = R1 + R2I4. R4 = R1 + R3I5. R5 = R1 x 3Registers for each cluster:

I1 is sent randomly to 0R1

I2 is sent to 1 (R1 is local)R1 R2

I3 is sent to 2 (R2 is local, R1 requires only 1 cycle of bus)R1 R1,R2 R3

R4 R1 R1,R2 R1,R3

I5 can be sent to 1, 2 or 3. Cluster 3 has more free registersR4,R5 R1 R1,R2 R1,R3

0 1 2 3

Fig. 6.3: Example of the steering algorithm

if the workload imbalance is not high, the ring clustered processor partitions thedependence graph in a horizontal-like approach. Another interesting feature is thatthose instructions with two operands are always sent to a cluster where at least oneof its operands is mapped. Thus, an instruction never requires two communications.

It can be observed that the best way to execute instructions back-to-back inthe ring microarchitecture is sending instructions to adjacent clusters. Hence, theactivity is distributed across all clusters at fine granularity without requiring explicitworkload balance mechanisms.

Layout Considerations

As it has been outlined before, our approach relies on having fast interconnectsbetween one cluster and the following one instead of having fast intra-cluster wires.The purpose of this subsection is to verify that this assumption is realistic. Inorder to do so, we do a simplified high-level layout to check that the distance ofneighbor connections are short enough to bypass data with the same or shorter delaythan they would have for the intra-cluster connections of a conventional clusteredmicroarchitecture.

The cluster placement for an 8-cluster configuration is shown in figure 6.4. It canbe observed that two different modules are required for an 8-cluster configuration:straight and corner clusters. For a 4-cluster configuration only corner clusters arerequired.


Fig. 6.4: Placement alternatives for 8 clusters

Designing circuits at high level to deduce the layout is hard to approximate,since it is strongly dependent on the technology and the circuits characteristics.Designing the whole backend at circuit level and doing a full layout is the only wayto obtain exact figures for the area and delays of the different components. Sucheffort is unaffordable to us and probably unreasonable for a microarchitectural studylike this. Our objective is to check if the proposed microarchitecture is feasible, andfor that we use alternative schemes based on some first-order models.

Based on published models [56] we have estimated the area of the different com-ponents of a given cluster. We have used the same parameters as the authors of themodel. They can be found in table 6.1. For the sake of simplicity we have assumedthat all components but the queues are square blocks.

Other functional units not detailed in the table can be assumed to be out of thecritical path even if this fact increases their latency since they do not execute frequentinstructions. The area of a register file cell is based on what the model suggests asaverage area after looking at several current microprocessors. This assumption maywell be pessimistic for our clusters. For instance, the model [56] reports that aregister file with 4 read and 2 write ports has a cell area of 27200 λ2. If we considerthat each cluster is able to issue 1 INT + 1 FP instructions per cycle and thereis one global bus, 3 read and 3 write ports per register file (integer or FP one) areenough. Thus, with the same number of ports (3R+3W instead of 4R+2W) we haveassumed larger register file cells (40600 λ2 instead of 27200 λ2) with the worst casesize reported in [56].

The next step consists on placing the different components of a cluster in sucha way that they can be easily connected (inputs of one cluster are close to outputs


Component Area per cell Size Height/Width Total area(λ2) (λ/λ) (λ2)

Issue queue 22.300 CAM 16 entries 9.619 / 1.000 9.619.20013.900 RAM 12 bits CAM/entry

24 bits RAM/entryComm. queue 22.300 CAM 16 entries 8.006 / 1.000 8.006.400

13.900 RAM 6 bits CAM/entry9 bits RAM/entry

Register file 40.600 48 regs 11.168 / 11.168 124.723.20064 bits/reg

Integer ALU 2.410.000 64 bits 12.419 / 12.419 154.240.000Integer Multiplier 1.840.000 64 bits 10.852 / 10.852 117.760.000

FP Unit (Add+Mult) 4.550.000 64 bits 17.065 / 17.065 291.200.000

Table 6.1: Area of the main cluster’s blocks

IntRF

Int IQcomm IQ

FP IQ

FPRF

IntALU

IntMult

FPU

a) Straight cluster module

IntRF

Int IQcomm IQ

FP IQ

FPRF

IntMult

IntALU FPU

b) Corner cluster module

Fig. 6.5: High level layout for cluster modules

of the previous cluster), and the wires from one cluster to the following one havesimilar length to that of an intra-cluster communication in a conventional clusteredmicroarchitecture. Since the largest block is the FP unit, and its height (or width)is around 17.100 λ, the design for both types of cluster modules (straight and cornerones) would require intra-cluster connections of this order of magnitude.

Figure 6.5 shows the proposed design for both types of modules. It can beobserved that the maximum length between the input and the output of two clustermodules for integer data is 17.400 λ (17.100 - 10.900 + 11.200) from the output ofthe integer multiplier of a straight module to the input of any integer functionalunit of another straight module. For FP data, the maximum distance exists whenany module is connected to a corner one: 23.300 λ (12.400 + 10.900). Thus, onlyFP data may have their bypass delay increased. However, if the FP unit could fill


the empty space in the middle of the corner module, the delay for the worst casecould be reduced to 10.900 λ.

To conclude, it seems feasible to send data from one cluster to the followingone with similar delay to that of an intra-cluster bypass of a conventional clusteredprocessor. Thus, we assume that a given instruction can be issued back to back withits predecessor, which is in the previous cluster.

If shorter inter-cluster connections are desired, the ring-like approach can bedesigned with two independent rings: one for integer instructions and another forFP ones. Thus, integer and FP clusters can be smaller and the blocks can be placedin such a way that inter-cluster bypasses are shorter. Instructions with integerinputs and FP output or vice versa are extremely rare, so it can be assumed a bi-directional bus from one of the integer clusters to one of the FP clusters for thoseinstructions. The only frequent instructions with integer inputs and FP outputs areFP load instructions. The address calculation of these instructions is sent to theinteger ring, and when it has been computed, it is sent to the load/store queue.Once the memory access is performed, the datum is sent back to the correspondingcluster of the FP ring. Thus, these instructions work in the same way even if FPand integer units are in different rings.

If a further reduction in bypass delay is needed, fatter connections and/or re-peaters [61] can be used.

The high level layout for the clusters of the architecture with two independentrings (one for integer instructions and another for FP ones) is shown in figure 6.6.It can be observed that the maximum distance for integer or FP data is 11.200 λ,which correspond to any module connected to a straight one.

Additional Comments

Some considerations must be taken into account for the design of the ring clusteredprocessor. It is hard to achieve accurate floorplans of the clusters without doing thefull detailed layout of the whole processor core, but we believe that this is a first-order approximation that validates the potential of the idea. The objective is toshow some evidence that issue queues, register files and functional units can be laidout in such a way that sending the tags/data from one cluster to the following one inthe ring has a similar delay to that of the intra-cluster connections of a conventionalclustered architecture.

In this work, the distance in time to/from the data cache and the load/storequeue has been considered to be the same for all clusters. We have assumed a1-cycle penalty for sending data to/from these structures to any cluster. In someimplementations the cache latency may not be uniform across all the clusters. Thatwould probably degrade performance, but it is expected that the effect will be thesame for both a ring and a conventional clustered architectures. Note also that thecache could be partitioned in a clustered architecture so that each cluster had a local


a) Integer straight cluster module

IntRF

Int IQcomm IQ

IntALU

IntMult Int

RF

Int IQcomm IQ

IntMult

IntALU

b) Integer corner cluster module

c) FP straight cluster module

FP IQ

FPRF

FPU

comm IQ

d) FP corner cluster module

FP IQFPRF

FPU

comm IQ

Fig. 6.6: High level layout for cluster modules with integer and FP independent rings

cache that could be accessed very fast. However, this is orthogonal to the main ideasproposed in this work.


The proposed architecture is evaluated in this section. First, we describe the con-ventional clustered microarchitecture used for comparison purposes and the experi-mental framework that has been used. Results are then reported.

Microarchitecture Used for Comparison

The proposed ring clustered microarchitecture will be compared with a state-of-the-art clustered microarchitecture [92] with about the same number of resources:number and configuration of clusters and buses, number of bypasses, etc. In the restof the section, the architecture used for comparison purposes will be referred to asConv, whereas the ring clustered microarchitecture will be referred to as Ring.

As described in section 6.1, the Conv processor has clusters with fast intra-cluster connections and point-to-point buses to communicate data between differentclusters. The steering algorithm is shown in figure 6.7.

It can be observed that the algorithm tries to reduce the number of inter-clustercommunications and balance the workload at a time. The latency to steer instruc-


If the workload imbalance is higher than the threshold:

The least loaded cluster is chosen (that with lower DCOUNT value).

Else:

If any source operand is not available at dispatch time:

Cluster(s) where the pending operand(s) are to be produced are selected

Else If all source operands are available at dispatch time:

Cluster(s) that minimize the longest communication distance are selected.

Else If it has no source operands:

All clusters are selected.

Endif

The least loaded cluster among the selected clusters is chosen.

Endif

Fig. 6.7: Steering algorithm for the conventional clustered microprocessor

tions to clusters is 1 cycle for both Conv and Ring architectures. Larger latenciesmay be considered but they will have a minor impact on the conclusions since eachadditional cycle in the frontend will basically increase the penalty for branch mispre-dictions by about one cycle in both the proposed and the baseline microarchitecture.


Performance results have been obtained through Simplescalar [23], which has beenmodified to simulate clustered microarchitectures. Three different configurationshave been evaluated: 4 and 8 clusters of issue width 2 INT + 2 FP each, and8 clusters of issue width 1 INT + 1 FP each. Table 6.2 describes the assumedprocessor configuration. We have used the SPEC CPU2000 benchmark suite [115].

We have focused in these numbers of clusters because the performance for fewerclusters is significantly lower, especially for FP programs. For instance, if the issuewidth per cluster is 1 INT and 1 FP instruction per cycle, then the performancewhen we move from 2 to 4 clusters grows 41% for FP programs and 19% for INTones. If we move from 4 to 8 clusters, the speedups are 24% for FP programs and8% for INT ones.

The 4-cluster configuration does not require high number of communicationsso it has been assumed one unidirectional fully pipelined bus (i.e. a datum canbe transmitted from every cluster to the following one at the same time). The 8-cluster configuration may require more and farther communications, so either oneor two buses have been considered. For the two-buses configuration, Ring has bothbuses with the same direction, whereas Conv has one bus for each direction in orderto reduce the distance of the communications. Table 6.3 details how the differentconfigurations are referred to in the rest of this work.

Performance

Figure 6.8 shows the speedup of Ring over Conv for each configuration. Ringachieves higher performance than Conv for all configurations. It can be observedthat the speedup for integer programs is smaller than for FP programs, and even


Parameter ValueFetch, Decode, Commit width: 8 instructions/cycleFetch queue: 64 entriesIssue queue (4 clusters): 32 INT + 32 FP + 16 comm. entries/clusterIssue queue (8 clusters): 16 INT + 16 FP + 16 comm. entries/clusterLoad/store queue size: 128 entriesReorder buffer size: 256 entriesRegister file (4 clusters): 64 INT + 64 FP registers/clusterRegister file (8 clusters): 48 INT + 48 FP registers/clusterINT functional units: ALU (1 cycle), mult/div (3 cycles mult,

20 cycles non-pipelined div)FP functional units: ALU (2 cycles), mult/div (4 cycles mult,

12 cycles non-pipelined div)1 INT + 1 FP issue width: 1 unit of each type per cluster2 INT + 2 FP issue width: 2 units of each type per clusterMemory Ports: 4 R/W ports (2 cycles pipelined)Branch Predictor: Hybrid: 2K entry Gshare, 2K entry bimodal

and 1K entry metatableBTB: 2048 entries, 4-wayL1 Icache size: 64KB 2-way, 32-byte lines, 1 cycle latencyL1 Dcache size: 32KB 4-way, 32-byte lines, 4 R/W ports, 2 cyclesLatency to/from L1 Dcache: 1 cycleL2 Unified cache: 512KB, 4-way, 64-byte lines, 10 cycles latencyMemory: 100 cycles, 2 cycles interchunkData TLB: 128 entries, 30 cycles miss penaltyInstruction TLB: 128 entries, 30 cycles miss penalty


Architect. Num. clust. Issue width Num. buses NameConv 4 2 INT + 2 FP 1 Conv 4clus 1bus 2IWConv 8 1 INT + 1 FP 1 Conv 8clus 1bus 1IWConv 8 1 INT + 1 FP 2 Conv 8clus 2bus 1IWConv 8 2 INT + 2 FP 1 Conv 8clus 1bus 2IWConv 8 2 INT + 2 FP 2 Conv 8clus 2bus 2IWRing 4 2 INT + 2 FP 1 Ring 4clus 1bus 2IWRing 8 1 INT + 1 FP 1 Ring 8clus 1bus 1IWRing 8 1 INT + 1 FP 2 Ring 8clus 2bus 1IWRing 8 2 INT + 2 FP 1 Ring 8clus 1bus 2IWRing 8 2 INT + 2 FP 2 Ring 8clus 2bus 2IW

Table 6.3: Evaluated configurations


Speedup

-2%

0%

2%

4%

6%

8%

10%

12%

14%

16%

AVERAGE INT FP

Ring_4clus_1bus_2IWRing_8clus_2bus_1IWRing_8clus_1bus_1IWRing_8clus_2bus_2IWRing_8clus_1bus_2IW

Fig. 6.8: Speedup of Ring over Conv

slightly negative for one configuration. Since Ring is much more effective than Convat reducing the number and the distance of the communications, Ring achieveshigher speedups for programs with larger number of communications, as it is thecase for FP programs. This becomes even more obvious under the presence of justone bus. In this case, the speedup increases significantly. In order to show why Ringperforms better than Conv, we have analyzed the penalty introduced by communi-cations and workload imbalance.

Communications

Figure 6.9 shows the number of communications per instruction. Ring requires fewercommunications than Conv because it succeeds in distributing the workload withoutintroducing extra communications. On the other hand, Conv reaches quite oftensituations where it has to send an instruction to the least loaded cluster, even if thatdecision introduces more communications. Thus, the workload is balanced at theexpense of more communications. It can also be observed that FP programs requiremore communications than integer ones.

Communications distance has to be also studied. The distance of a communi-cation is the number of hops required to copy the data from the source cluster tothe destination cluster. Shorter communications are desirable in order to reducethe time that consumer instructions spend waiting for the remote data. Figure 6.10shows the average distance per communication for the different configurations. It


Communications per instruction

0,00

0,05

0,10

0,15

0,20

0,25

0,30

0,35

0,40

0,45

AVERAGE INT FP

Conv_4clus_1bus_2IWRing_4clus_1bus_2IWConv_8clus_2bus_1IWRing_8clus_2bus_1IWConv_8clus_1bus_1IWRing_8clus_1bus_1IWConv_8clus_2bus_2IWRing_8clus_2bus_2IWConv_8clus_1bus_2IWRing_8clus_1bus_2IW

Fig. 6.9: Average number of communications per instruction

can be observed that Conv has to cover a similar distance than Ring for two buses,but Ring has much shorter communications for one bus.

Both, the number of communications and their distance, determine the bus occu-pancy and thus, further delays due to bus contention. Figure 6.11 shows the averagenumber of cycles that a ready communication instruction has to wait until it canaccess the bus. It can be seen that Conv has much higher contention than Ring,especially if there is only one bus. For both 8-cluster, 1-bus configurations Conv ’scontention is larger than 5 cycles for FP instructions.

Workload Imbalance

The workload balance figure used to guide the steering algorithm of Conv isDCOUNT since it provides better performance than others, but in order to showthe effect of workload imbalance in IPC it has been used another more suitablefigure: NREADY [91, 92]. NREADY accounts for the number of ready instructionsthat are not issued at a given instant of time due to exceeding the issue width oftheir respective clusters, but could be issued in other clusters since they have idlefunctional units.

Figure 6.12 quantifies the workload imbalance for different configurations. It canbe observed that the conventional clustered microarchitecture balances the workloadsomewhat better than the ring clustered processor. However, this small detrimentin workload balance is more than offset by the reduction in communication penal-ties, as shown before, especially for FP programs, which are more communication


Distance per communication

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

AVERAGE INT FP


Fig. 6.10: Average distance per communication

Bus contention per communication

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

5,0

5,5

AVERAGE INT FP


Fig. 6.11: Average delay per communication due to bus contention


Workload imbalance (NREADY)

0,0

0,5

1,0

1,5

2,0

2,5

3,0

AVERAGE INT FP


Fig. 6.12: Workload imbalance using NREADY figure

intensive. Conv reduces the workload imbalance at the expense of a much largernumber of communications, as we have shown, which produces significant perfor-mance degradation.

Conv uses DCOUNT figure for balancing workload, so it is expected that thenumber of instructions dispatched to each cluster is approximately the same. Onthe other hand, Ring does not use any mechanism for balancing workload since it isinherent to the dependence-based steering algorithm. Nevertheless, it is interestingto show how many instructions are dispatched to each cluster. Figure 6.13 showsthe percentage of instructions dispatched to each cluster for all benchmarks forthe Ring 8clus 1bus 2IW configuration. It can be observed that the percentage ofinstructions dispatched to each cluster is pretty much the same for all programs.Similar results are achieved for the other configurations.

The configurations with 8 clusters and 2 INT + 2 FP issue width are especiallyinteresting since they achieve a very good workload balance for both Conv and Ring,and are still suitable for high clock rates since the structures involved in the issueprocess are quite small: 16-entry issue queues and 48-entry register files.

Scaling Wires

It is well known that wires scale very badly, so it is expected that future clusteredmicroprocessors may have large latencies for inter-cluster communications. We haveassumed buses with 1-cycle latency per hop, but this may not be feasible for futureprocessors. Thus, it is interesting to analyze how both, the Conv and the Ring


Instruction distribution

0,0%

12,5%

25,0%

37,5%

50,0%

62,5%

75,0%

87,5%

100,0%

amm

pap

plu

apsi art

bzip

2cr

afty

eon

equa

kefa

cere

cfm

a3d

galg

elga

pgc

cgz

iplu

cas

mcf

mes

am

grid

pars

erpe

rlbm

sixt

rack

swim

twol

fvo

rtex vp

rw

upw

is

cluster 7cluster 6cluster 5cluster 4cluster 3cluster 2cluster 1cluster 0

Fig. 6.13: Distribution of the dispatched instructions across the clusters

processors, perform with slower buses. For this purpose, we have evaluated theconfigurations with 8 clusters and 2 INT + 2 FP issue width (with 1 and 2 buses)using 2-cycle latency per hop, and fully pipelined buses. Thus, a given bus may beprocessing 16 communications at a time.

Figure 6.14 shows the speedup of Ring over Conv. For the configuration with onebus the speedup grows from 8.1% with 1-cycle latency per hop to 11.8% (19.1% forFP programs) with 2-cycle latency per hop. Similar trend is observed for 2 buses.Conv loses much more performance than Ring because the former has more andlonger communications than the latter.

A Simple Steering Algorithm

The steering algorithm that we have assumed has similar complexity to the one byParcerisa et al. [92]. In this section we evaluate the performance of a conventionalclustered microarchitecture and the ring one using a simpler steering algorithm,whose complexity is similar to that of the rename logic. Figure 6.15 shows theSimple Steering Algorithm (SSA).

As it can be seen, this simple steering algorithm does not include an explicitworkload imbalance control.

Figure 6.16 shows the speedup of Ring over Conv when the simple steeringalgorithm is used. It can be observed that the speedup is huge. For instance, for a8-clusters 1-way issue, 2-bus configuration, the speedup of Ring over Conv is 50%


Speedup

0%

2%

4%

6%

8%

10%

12%

14%

16%

18%

20%

AVERAGE INT FP

2bus_1cyclehop2bus_2cyclehop1bus_1cyclehop1bus_2cyclehop

Fig. 6.14: Speedup of Ring over Conv for different bus latencies

If the instruction has at least one input operand:

It is sent to the lower index cluster that stores (or will store) its leftmost operand.

Else (no input operands):

It is sent to a cluster in a round-robin fashion.

Endif

Fig. 6.15: Simple Steering algorithm for both Ring and Conv processors


Speedup

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

AVERAGE INT FP

Ring_4clus_1bus_2IWRing_8clus_2bus_1IWRing_8clus_1bus_1IWRing_8clus_2bus_2IWRing_8clus_1bus_2IW

Fig. 6.16: Speedup of Ring+SSA over Conv+SSA

in average (1.5X). As observed for the other steering algorithm, the speedup for FPprograms (80%) is higher than for integer programs (30%).

The performance drop of Ring+SSA with respect to Ring is between 5% and14% depending on the configuration. The performance drop is small because theworkload balance is similar whereas the communication distance slightly increases.The performance drop of Conv+SSA with respect to Conv is between 23% and42% depending on the configuration. It is so high mainly due to the workloadimbalance. Ring inherently balances the workload, whereas Conv does not. Thus,Conv+SSA tends to concentrate most of the instructions in very few clusters. Thisconcentration reduces the communication penalty but incurs in many dispatch stallsbecause the cluster selected to steer instructions is full. Additionally, full clustershold many ready instructions that cannot be issued because there are more of themthan the issue width, whereas other clusters have less workload than that theycould absorb. In fact, due to this workload imbalance (figure 6.17), Ring+SSAshows higher speedup with respect to Conv+SSA when the issue width is 1. Onthe other hand, for the enhanced steering (see figure 6.8) we observed the oppositetrend: lower speedup when the issue width is 1.

Note that the workload imbalance of Ring+SSA is 10% higher than that of Ring(see figure 6.12). In the case of Conv, the workload imbalance increases by between100% and 300% depending on the configuration.


Workload imbalance (NREADY) with Simple Steering Algorithm (SSA)

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

AVERAGE INT FP


Fig. 6.17: Workload imbalance using NREADY figure with the Simple Steering Algorithm

6.2.3 Conclusions

We have presented a new clustered microarchitecture for superscalar processors. Adistinguishing feature of this microarchitecture is that those schemes that favor hid-ing wire delays also favor workload balance among the clusters, due to the particularway that clusters are laid out and interconnected. The clusters are arranged in aring topology, which is not new, but unlike previous proposals, fast interconnectsare used for forwarding values among neighbor clusters, whereas internal bypassesare not needed. As a consequence, a dependence-based steering algorithm that at-tempts to reduce the number and distance of global communications is extremelyeffective at distributing the workload across all the clusters without requiring anyexplicity workload balance scheme.

The proposed ring clustered microarchitecture significantly outperforms state-of-the-art clustered organizations. The benefits increase as the number of clustersincreases, and the global interconnects are simple, scarce and have long latencies.Thus, the ring clustered microarchitecture is more scalable than conventional ones.For instance, for an 8-cluster configuration and just one fully pipelined unidirectionalbus with a latency of 2 cycles per hop, the proposed architecture achieves 19%speedup over a state-of-the-art, topology-aware conventional clustered architecturefor FP programs. If a simpler steering algorithm is used, the speedup can be aslarge as 50% on average for the whole Spec2000 benchmark suite.

While conventional clustered architectures tend to steer the instructions to one ora few number of clusters until a certain workload imbalance is achieved, the proposed


ring-like architecture distributes the activity across all the clusters during all the timeas we have shown. This fact may result in a better temperature distribution acrossthe chip and thus, it is expected to reduce the frequency of temperature emergencieswith respect to a conventional clustered microarchitecture.

To conclude, the proposed clustered microarchitecture outperforms state-of-the-art microarchitectures. Besides, it is more scalable than conventional ones, requiresless communication resources and is more effective at distributing the activity acrossall the clusters.

7

Conclusions

CHAPTER 7

CONCLUSIONS

As technology evolves, power density significantly increases and cooling systems be-come more complex and expensive. Reducing the energy requirements and complex-ity of microprocessors is a key point to allow designers to produce higher performanceprocessors.

In this dissertation, we have proposed several schemes to dynamically adaptsome of the processor structures to the program requirements, as well as alternativedesigns for such structures to reduce their complexity and energy consumption.

The result of this thesis is a set of solutions for the most relevant microprocessorstructures in terms of energy requirements and complexity.

7.1 Contributions

We have proposed solutions for the cache memories, the issue logic, the load/storequeue and the whole processor back-end. The main contributions are as follows:

1. New organizations for the L1 data cache have been proposed to optimize per-formance and energy consumption. We have shown that combining differentcache modules with different levels of performance and energy requirementsis a good solution for the L1 data cache. Two different organizations havebeen proposed. In one of them, the cache modules are arranged in paralleland the accesses classified depending on their criticality. In the other one, thecache modules are arranged in a hierarchical way. The proposed cache mi-croarchitectures have been shown to reduce the dynamic and leakage energyconsumption while achieving high performance.

2. L2 caches are the largest and leakiest structures in a chip. State-of-the-arttechniques to reduce its leakage by turning off the cache lines have been shownto perform poorly. We have proposed IATAC, which is a new technique toefficiently turn off L2 cache lines. IATAC predicts the decay interval that wemust wait to turn off individual cache lines using a global predictor based onthe existing relation between the decay interval and the number of accesses percache line. We have shown that our approach provides significant benefits interms of energy, EDP and ED2P, and outperforms all previous state-of-the-arttechniques.

178 · Chapter 7. Conclusions

3. We have relaxed a limitation of set-associative caches: the homegeneity of itsway sizes. We have proposed a new design for memory structures that enablescache ways to have heterogeneous sizes (HWS caches). This approach hasbeen shown to be very effective for L1 data and instruction caches and L2caches.

4. The HWS caches have many advantages over conventional caches. They canbetter adapt to the program requirements. We have proposed new resizingalgorithms for the HWS cache. Our evaluations have proved that the HWScache is much more adaptable than conventional caches for resizing schemesfor L1 and L2 caches.

5. The issue logic has been deeply studied. The issue queue is often designed foralmost the worst case and most of the time its size is bigger than needed. Wehave proposed a new resizing scheme to adapt the issue queue size to the realprogram requirements dynamically. We show that our approach provides sig-nificant energy savings for the issue queue and the register file, outperformingstate-of-the-art techniques.

6. We have proposed a FIFO-based issue logic that achieves very high perfor-mance for FP programs at a low cost in terms of complexity and energy re-quirements. By combining dependence and latency information of instructionsthis issue logic adapts to the FP program requirements.

7. The load/store queue has high complexity and significant energy requirements.We have proposed SAMIE-LSQ, which is a new load/store queue design thatfits program requirements, reduces the number of address comparisons, reducesthe complexity of the comparisons that must be done and saves energy for theload/store queue, the L1 data cache and the data TLB.

8. Finally, we have studied clustered microarchitectures as a solution for super-scalar microprocessors because they distribute the resources. We have pro-posed a new clustered microarchitecture that effectively distributes the work-load and the activity across all the clusters and minimizes the number ofcommunications without requiring to trade one for the other. The proposedorganization is especially interesting because it succeeds in distributing theactivity across the clusters, which is beneficial in terms of temperature. Thering microarchitecture proposed in this thesis outperforms state-of-the-art or-ganizations and scales better.

7.2 Future Work

The different techniques presented in this dissertation can be further enhanced orextended. The proposal on multiple-banked L1 data caches where the differentcache modules have different power and performance characteristics shows that the

7.2. Future Work · 179

compiler can provide higher benefits. It is hard to classify data dynamically becausethe processor does not have a global view of the program. Thus, the compiler caneither provide hints or classify the data to make the proposed cache organizationmore effective. Some of these ideas have been studied by Gibert et al. [50], whopropose an scheme based on our work where the compiler classifies the data toachieve higher energy savings.

IATAC has been shown to be very effective to turn off L2 cache lines due tothe existing relation between the number of accesses to the cache lines and thedecay intervals. We believe that this relation can be further exploited to reduce theperformance penalty introduced by IATAC and it can be extended to use it in othercache levels.

Heterogeneous way-size caches have been studied for L1 and L2 caches, but theycan be used for any kind of set-associative structure. Further research can be doneusing HWS caches in structures other than caches, like branch predictors and TLBs.

We have proposed new techniques to reduce the complexity and energy require-ments of the issue logic. Our work has been continued by Jones et al. [68]. Theypropose using the compiler to resize the issue queue instead of doing it by hardware.

SAMIE-LSQ has been shown to be a very effective load/store queue organization,and it enables to reduce the energy consumption and access time to cache of manyloads and stores. SAMIE-LSQ allows many memory instructions to share the samequeue entry and thus, after the first access, the other instructions in the entry knowbeforehand if they will hit in cache and the cache line to access. Techniques to takeadvantage of this lower latency and hit/miss knowledge of the accesses can be aninteresting area of research.

180 · Chapter 7. Conclusions

Bibliography

[1] Jaume Abella, Ramon Canal, and Antonio Gonzalez. Power- and complexity-aware issue queue designs. In IEEE Micro, Special Issue on Power- andComplexity-Aware Design, 2003.

[2] Jaume Abella and Antonio Gonzalez. On reducing register pressure and en-ergy in multiple-banked register files. In 21st International Conference onComputer Design (ICCD’03), 2003.

[3] Jaume Abella and Antonio Gonzalez. Power-aware adaptive issue queue andregister file. In International Conference on High Performance Computing(HiPC’03), 2003.

[4] Jaume Abella and Antonio Gonzalez. Power efficient data cache designs. In21st International Conference on Computer Design (ICCD’03), 2003.

[5] Jaume Abella and Antonio Gonzalez. Low-complexity distributed issue queue.In International Symposium on High-Performance Computer Architecture(HPCA’04), 2004.

[6] Jaume Abella and Antonio Gonzalez. Inherently workload-balanced clusteredmicroarchitecture. In International Parallel and Distributed Processing Sym-posium (IPDPS’05), 2005.

[7] Jaume Abella and Antonio Gonzalez. Heterogeneous way-size cache. In Sub-mitted to the International Conference on Supercomputing (ICS’06), 2006.

[8] Jaume Abella and Antonio Gonzalez. Samie-lsq: Set-associative multiple-instruction entries load/store queue. In International Parallel and DistributedProcessing Symposium (IPDPS’06), 2006.

[9] Jaume Abella, Antonio Gonzalez, Xavier Vera, and Michael F.P. O’Boyle.Iatac: A smart predictor to turn-off l2 cache lines. In ACM Transactions onArchitecture and Code Optimization (TACO), 2005.

[10] A. Agarwal, H. Li, and K. Roy. Drg-cache: A data retention gated-groundcache for low power. In 39th Design Automation Conference (DAC’02), 2002.

182 · Bibliography

[11] A. Agarwal and S. Pudar. Column associative caches: A technique for reduc-ing miss rate of direct-mapped caches. In 20th International Symposium onComputer Architecture (ISCA’93), 1993.

[12] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger. Clock rate ver-sus ipc: The end of the road for conventional microarchitectures. In 27thInternational Symposium on Computer Architecture (ISCA’00), 2000.

[13] A.R. Alameldeen and D.A. Wood. Adaptive cache compression for high-performance processors. In 31st International Symposium on Computer Ar-chitecture (ISCA’04), 2004.

[14] D.H. Albonesi. Selective cache ways: On-demand cache resource allocation.In 32nd International Symposium on Microarchitecture (MICRO’99), 1999.

[15] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas.Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In 33rd International Symposium on Mi-croarchitecture (MICRO’00), 2000.

[16] A. Baniasadi and A. Moshovos. Instruction distribution heuristics for quad-cluster, dynamically-scheduled, superscalar processors. In 33th InternationalSymposium on Microarchitecture (MICRO’00), 2000.

[17] B. Batson and T.N. Vijaykumar. Reactive-associative caches. In 10th In-ternational Conference on Parallel Architectures and Compilation Techniques(PACT’01), 2001.

[18] B.M. Beckmann and D.A. Wood. Tlc: Transmission line caches. In 36thInternational Symposium on Microarchitecture (MICRO’03), 2003.

[19] E. Brekelbaum, J. Rupley II, C. Wilkerson, and B. Black. Hierarchical schedul-ing windows. In 35th International Symposium on Microarchitecture (MI-CRO’02), 2002.

[20] D.M. Brooks, P. Bose, S.E. Schuster, H. Jacobson, P.N. Kudva, A. Buyukto-sunoglu, J.D. Wellman, V. Zyuban, M. Gupta, and P.W. Cook. Power-awaremicroarchitecture: Design and modeling challenges for next-generation micro-processors. In IEEE Micro, Volume 20 Issue 6, 2000.

[21] D.M. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework forarchitectural-level power analysis and optimizations. In 27th InternationalSymposium on Computer Architecture (ISCA’00), 2000.

[22] M.D. Brown, J. Stark, and Y.N. Patt. Select-free instruction scheduling logic.In 34th International Symposium on Microarchitecture (MICRO’01), 2001.

Bibliography · 183

[23] D. Burger and T. Austin. The simplescalar tool set, version 3.0. Technicalreport, University of Wisconsin-Madison, 1999.

[24] M. Butler and Y.N. Patt. An investigation of the performance of variousdynamic scheduling techniques. In 25th International Symposium on Microar-chitecture (MICRO’92), 1992.

[25] A. Buyuktosunoglu, D. Albonesi, P. Bose, P. Cook, and S. Schuster. Tradeoffsin power-efficient issue queue design. In International Symposium on LowPower Electronics and Design (ISLPED’02), 2002.

[26] A. Buyuktosunoglu, D. Albonesi, S. Schuster, D. Brooks, P. Bose, and P. Cook.A circuit level implementation of an adaptive issue queue for power-awaremicroprocessors. In 11th Great Lakes Symposium on VLSI (GLSVLSI’01),2001.

[27] A. Buyuktosunoglu, T. Karkhanis, D.H. Albonesi, and P. Bose. Energy effi-cient co-adaptive instruction fetch and issue. In 30th International Symposiumon Computer Architecture (ISCA’03), 2003.

[28] G. Cai and C.H. Lim. Architectural level power/performance optimization anddynamic power estimation. In In Cool-Chips Tutorial. Held in conjunction withMICRO’99, 1999.

[29] H. Cain and M. Lipasti. Memory ordering: A value based definition. In 31stInternational Symposium on Computer Architecture (ISCA’04), 2004.

[30] B. Calder, D. Grunwald, and J. Emer. Predictive sequential associative cache.In 2nd International Symposium on High-Performance Computer Architecture(HPCA’96), 1996.

[31] R. Canal and A. Gonzalez. A low-complexity issue logic. In InternationalConference on Supercomputing (ICS’00), 2000.

[32] R. Canal and A. Gonzalez. Reducing the complexity of the issue logic. InInternational Conference on Supercomputing (ICS’01), 2001.

[33] R. Canal, A. Gonzalez, and J.E. Smith. Very low power pipelines using sig-nificance compression. In 33rd International Symposium on Microarchitecture(MICRO’00), 2000.

[34] R. Canal, J.M. Parcerisa, and A. Gonzlez. A cost-effective clustered architec-ture. In 8th International Conference on Parallel Architectures and Compila-tion Techniques (PACT’99), 1999.

[35] R. Canal, J.M. Parcerisa, and A. Gonzlez. Dynamic cluster assignment mech-anisms. In 6th International Symposium on High-Performance Computer Ar-chitecture (HPCA’00), 2000.

184 · Bibliography

[36] Z. Chishti, M.D. Powell, and T.N. Vijaykumar. Distance associativity forhigh-performance energy-efficient non-uniform cache architectures. In 36thInternational Symposium on Microarchitecture (MICRO’03), 2003.

[37] G. Chrysos and J. Emer. Memory dependence prediction using store sets. In25th International Symposium on Computer Architecture (ISCA’98), 1998.

[38] A.S. Dhodapkar and J.E. Smith. Managing multi-configuration hardware viadynamic working set analysis. In 29th International Symposium on ComputerArchitecture (ISCA’02), 2002.

[39] S. Dropsho, A. Buyuktosunoglu, R. Balasubramonian, D.H. Albonesi,S. Dwarkadas, G. Semeraro, G. Magklis, and M.L. Scott. Integrating adaptiveon-chip storage structures for reduced dynamic power. In 11th InternationalConference on Parallel Architectures and Compilation Techniques (PACT’02),2002.

[40] J. Emer. Ev8: The post-ultimate alpha. In Keynote at the 10th InternationalConference on Parallel Architectures and Compiler Techniques (PACT’01),2001.

[41] D. Ernst and T. Austin. Efficient dynamic scheduling through tag elimination.In 29th International Symposium on Computer Architecture (ISCA’02), 2002.

[42] K.I. Farkas, P. Chow, N.P. Jouppi, and Z. Vranesic. The multicluster ar-chitecture: Reducing cycle time through partitioning. In 30th InternationalSymposium on Microarchitecture (MICRO’97), 1997.

[43] B. Fields, S. Rubin, and R. Bodık. Focusing processor policies via critical-path prediction. In 28th International Symposium on Computer Architecture(ISCA’01), 2001.

[44] K. Flautner, N.S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches:Simple techniques for reducing leakage power. In 29th International Sympo-sium on Computer Architecture (ISCA’02), 2002.

[45] D. Folegnani and A. Gonzalez. Energy-effective issue logic. In 28th Interna-tional Symposium on Computer Architecture (ISCA’01), 2001.

[46] M. Franklin. The multiscalar architecture, 1993.

[47] M. Franklin and G.S. Sohi. Arb: A hardware mechanism for dynamic reorder-ing of memory references. In IEEE Transactions on Computers, volume 45,issue 5, 1996.

[48] K. Gharachorloo, A. Gupta, and J. Hennessy. Two techniques to enhance theperformance of memory consistency models. In International Conference onParallel Processing (ICPP’91), 1991.

Bibliography · 185

[49] K. Ghose and M.B. Kamble. Reducing power in superscalar processor cachesusing subbanking, multiple line buffers and bit-line segmentation. In Interna-tional Symposium on Low Power Electronics and Design (ISLPED’99), 1999.

[50] E. Gibert, J. Abella, J. Sanchez, X. Vera, and A. Gonzalez. A heterogeneousmulti-module data cache for vliw processors. 2004.

[51] A. Gonzalez, C. Aliagas, and M. Valero. A data cache with multiple cachingstrategies tuned to different types of locality. In 9th International Conferenceon Supercomputing (ICS’95), 1995.

[52] R. Gonzalez, B.M. Gordon, and M.A. Horowitz. Supply and threshold voltagescaling for low power cmos. In IEEE Journal of Solid-State Circuits (JSSC)volume 32 number 8, 1997.

[53] M. Goshima, K. Nishino, Y. Nakashima, S. Mori, T. Kitamura, and S. Tomita.A high-speed dynamic instruction scheduling scheme for superscalar proces-sors. In 34th International Symposium on Microarchitecture (MICRO’01),2001.

[54] J.P. Grossman. Cheap out-of-order execution using delayed issue. In 18thInternational Conference on Computer Design 2000 (ICCD’00), 2000.

[55] S.H. Gunther, F. Binns, D.M. Carmean, and J.C.Hall. Managing the impactof increasing power consumption. In Intel Technology Journal, 1st quarter2001, 2001.

[56] S. Gupta, S.W. Keckler, and D. Burger. Technology independent area anddelay estimates for microprocessor building blocks. Technical Report 2000-5,Department of Computer Sciences, University of Texas at Austin, 2000.

[57] L. Gwennap. Digital 21264 sets new standard. In Microprocessor Report, 10(14), 1996.

[58] D. Halperin. Pa-risc 8x00 family of microprocessors with focus on pa-8700. InHot Chip Conference, 2000.

[59] S. Heo, K. Barr, M. Hampton, and K. Asanovic. Dynamic fine-grain leakagereduction using leakage-biased bitlines. In 29th International Symposium onComputer Architecture (ISCA’02), 2002.

[60] J. Hezavei, N. Vijaykrishnan, and M.J. Irwin. A comparative study of power ef-ficient sram designs. In 10th Great Lakes Symposium on VLSI (GLSVLSI’00),2000.

[61] R. Ho, K.W. Mai, and M.A. Horowitz. The future of wires. In In proceedingsof the IEEE, volume 89, issue 4, 2001.

186 · Bibliography

[62] Z. Hu, S. Kaxiras, and M. Martonosi. Let caches decay: Reducing leakageenergy via exploitation of cache generational behavior. In ACM Transactionson Computer Systems,Volume 20 Issue 2, 2002.

[63] M. Huang, J. Renau, and J. Torrellas. Energy-efficient hybrid wakeup logic. InInternational Symposium on Low Power Electronics and Design (ISLPED’02),2002.

[64] M. Huang, J. Renau, S.M. Yoo, and J. Torrellas. L1 data cache decompositionfor energy efficiency. In International Symposium on Low Power Electronicsand Design (ISLPED’01), 2001.

[65] K. Inoue, T. Ishihara, and K. Murakami. Way-predictive set-associative cachefor high performance and low energy consumption. In International Sympo-sium on Low Power Electronics and Design (ISLPED’99), 1999.

[66] K. Itoh, K. Sasaki, and Y. Nakagome. Trends in low-power ram circuit tech-nologies. In IEEE volume 83 number 4, 1995.

[67] J. Jeong and M. Dubois. Cost-sensitive cache replacement algorithms. In9th International Symposium on High-Performance Computer Architecture(HPCA’03), 2003.

[68] T. Jones, M.F.P. O’Boyle, J. Abella, and A. Gonzalez. Software assistedissue queue power reduction. In 11th International Symposium on High-Performance Computer Architecture (HPCA’05), 2005.

[69] T. Juan, T. Lang, and J.J. Navarro. The difference-bit cache. In 23rd Inter-national Symposium on Computer Architecture (ISCA’96), 1996.

[70] T. Karkhanis, J.E. Smith, and P. Bose. Saving energy with just in timeinstruction delivery. In International Symposium on Low Power Electronicsand Design (ISLPED’02), 2002.

[71] S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generationalbehavior to reduce cache leakage power. In 28th International Symposium onComputer Architecture (ISCA’01), 2001.

[72] R.E. Kessler, R. Jooss, A. Lebeck, and M.D. Hill. Inexpensive implemen-tations of set-associativity. In 16th International Symposium on ComputerArchitecture (ISCA’89), 1989.

[73] M. Kharbutli, K. Irwin, Y. Solihin, and J. Lee. Using prime numbers forcache indexing to eliminate conflict misses. In 10th International Symposiumon High-Performance Computer Architecture (HPCA’04), 2004.

Bibliography · 187

[74] C. Kim, D. Burger, and S.W. Keckler. An adaptive non-uniform cache struc-ture for wire-delay dominated on-chip caches. In 10th International Conferenceon Architectural Support for Programming Languages and Operating Systems(ASPLOS’02), 2002.

[75] N.S. Kim, K. Flautner, D. Blaauw, and T. Mudge. Drowsy instruction caches.leakage power reduction using dynamic voltage scaling and cache sub-bank pre-diction. In 35th International Symposium on Microarchitecture (MICRO’02),2002.

[76] N.S. Kim, K. Flautner, D. Blaauw, and T. Mudge. Single-vdd and sin-gle-vt super-drowsy techniques for low-leakage high-performance instructioncaches. In International Symposium on Low Power Electronics and Design(ISLPED’04), 2004.

[77] J. Kin, M. Gupta, and W.H. Mangione-Smith. The filter cache: An energyefficient memory structure. In 30th International Symposium on Microarchi-tecture (MICRO’97), 1997.

[78] T. Kuroda, K. Suzuki, S. Mita, T. Fujita, F. Yamane, F. Sano, A. Chiba,Y. Watanabe, K. Matsuda, T. Maeda, T. Sakurai, and T. Furuyama. Variablesupply-voltage scheme for low-power high-speed cmos digital design. In IEEEJournal of Solid-State Circuits (JSSC) volume 33 number 3, 1998.

[79] A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan, and E. Rotenberg. A large,fast instruction window for tolerating cache misses. In 29th InternationalSymposium on Computer Architecture (ISCA’02), 2002.

[80] L. Li, V. Degalahal, N. Vijaykrishnan, M. Kandemir, and M.J. Irwin. Softerror and energy consumption interactions: a data cache perspective. In In-ternational Symposium on Low Power Electronics and Design (ISLPED’04),2004.

[81] L. Li, I. Kadayif, Y-F. Tsai, N. Vijaykrishnan, M. Kandemir, M.J. Irwin,and A. Sivasubramaniam. Leakage energy management in cache hierarchies.In 11th International Conference in Parallel Architectures and CompilationTechniques (PACT’02), 2002.

[82] G. Memik, G. Reinman, and W.H. Mangione-Smith. Just say no: Benefitsof early cache miss determination. In 9th International Symposium on High-Performance Computer Architecture (HPCA’03), 2003.

[83] P. Michaud and A. Seznec. Data-flow prescheduling for large instruction win-dows in out-of-order processors. In 7th International Symposium on HighPerformance Computer Architecture (HPCA’01), 2001.

188 · Bibliography

[84] T. Moreshet and R.I. Bahar. Complexity-effective issue queue design un-der load-hit speculation. In Workshop on Complexity Effective Design(WCED’02), held in conjunction with ISCA’02, 2002.

[85] A. Moshovos, S. Breach, T.N. Vijaykumar, and G.S. Sohi. Dynamic specula-tion and synchronization of data dependences. In 24th International Sympo-sium on Computer Architecture (ISCA’97), 1997.

[86] R. Nagarajan, K. Sankaralingam, D. Burger, and S.W. Keckler. A design spaceevaluation of grid processor architectures. In 34th International Symposiumon Microarchitecture (MICRO’01), 2001.

[87] D. Nicolaescu, A. Veidenbaum, and A. Nicolau. Reducing data cache energyconsumption via cached load/store queue. In 9th International Symposium onLow Power Electronics and Design (ISLPED’03), 2003.

[88] S. Onder and R. Gupta. Memory disambiguation in the presence of out-of-order store issuing. In 32nd International Symposium on Microarchitecture(MICRO’99), 1999.

[89] S. Onder and R. Gupta. Superscalar execution with dynamic data forwarding.In 34th International Symposium on Microarchitecture (MICRO’01), 2001.

[90] S. Palacharla, N.P. Jouppi, and J.E. Smith. Complexity-effective super-scalar processors. In 24th International Symposium on Computer Architecture(ISCA’97), 1997.

[91] J.M. Parcerisa and A. Gonzlez. Reducing wire delay penalty through value pre-diction. In 33rd International Symposium on Microarchitecture (MICRO’00),2000.

[92] J.M. Parcerisa, J. Sahuquillo, A. Gonzlez, and J. Duato. Efficient interconnectsfor clustered microarchitectures. In 11th International Conference on ParallelArchitectures and Compilation Techniques (PACT’02), 2002.

[93] I. Park, C.L. Ooi, and T.N. Vijaykumar. Reducing design complexity of theload/store queue. In 36th International Symposium on Microarchitecture (MI-CRO’03), 2003.

[94] D. Ponomarev, G. Kucuk, and K. Ghose. Reducing power requirements of in-struction scheduling logic through dynamic allocation of multiple datapath re-sources. In 34th International Symposium on Microarchitecture (MICRO’01),2001.

[95] M. Powell, A. Agarwal, T.N. Vijaykumar, B. Falsafi, and K. Roy. Reducingset-associative cache energy via way-prediction and selective direct-mapping.In 34th International Symposium on Microarchitecture (MICRO’01), 2001.

Bibliography · 189

[96] M. Powell, S.H. Yang, B. Falsafi, K. Roy, and T.N. Vijaykumar. Gated-vdd:A circuit technique to reduce leakage in deep-submicron cache memories. InInternational Symposium on Low Power Electronics and Design (ISLPED’00),2000.

[97] S.E. Raasch, N.L. Binkert, and S.K. Reinhardt. A scalable instruction queuedesign using dependence chains. In 29th International Symposium on Com-puter Architecture (ISCA’02), 2002.

[98] R. Rakvic, B. Black, D. Limaye, and J.P. Shen. Non-vital loads. In 8th Interna-tional Symposium on High-Performance Computer Architecture (HPCA’02),2002.

[99] N. Ranganathan and M. Franklin. An empirical study of decentralized ilpexecution models. In 8th International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOS’98), 1998.

[100] E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.E. Smith. Trace processors. In30th International Symposium on Microarchitecture (MICRO’97), 1997.

[101] A. Roth. A high bandwidth low latency load/store unit for single and multi-threaded processors. Technical Report MS-CIS-04-09, University of Pennsyl-vania, 2004.

[102] A. Roth. Store vulnerability window (svw): Re-execution filtering for en-hanced load/store optimization. Technical Report MS-CIS-04-29, Universityof Pennsylvania, 2004.

[103] T. Sakurai, H. Kawaguchi, and T. Kuroda. Low-power cmos design throughvth control and low-swing circuits. In International Symposium on Low PowerElectronics and Design (ISLPED’97), 1997.

[104] T. Sakurai and A.R. Newton. Alpha-power law mosfet model and its ap-plications to cmos inverter delay and other formulas. In IEEE Journal ofSolid-State Circuits (JSSC), volume 25 number 2, 1990.

[105] J. Sanchez and A. Gonzalez. A locality sensitive multi-module cache withexplicit management. In 13th International Conference on Supercomputing(ICS’99), 1999.

[106] Sandpile. http://www.sandpile.org.

[107] M. Schlansker and V. Kathail. Critical path reduction for scalar programs. In28th International Symposium on Microarchitecture (MICRO’95), 1995.

[108] S. Sethumadhavan, R. Desikan, D. Burger, C.R. Moore, and S.W. Keckler.Scalable hardware memory disambiguation for high ilp processors. In 36thInternational Symposium on Microarchitecture (MICRO’03), 2003.

190 · Bibliography

[109] A. Seznec. Dasc cache. In 1st International Symposium on High-PerformanceComputer Architecture (HPCA’95), 1995.

[110] A. Seznec, E. Toullec, and O. Rochecouste. Register write specializationregister read specialization: a path to complexity-effective wide-issue super-scalar processors. In 35th International Symposium on Microarchitecture (MI-CRO’02), 2002.

[111] P. Shivakumar and N.P. Jouppi. Cacti 3.0: An integrated cache timing, powerand area model. Technical Report 2001/2, WRL, Palo Alto, CA (USA), 2001.

[112] Semiconductor Industry Association (SIA). International Technology Road-map for Semiconductors 2001. http://public.itrs.net/files/2001ITRS/, 2001.

[113] B. Sinharoy. Power5 architecture and systems. In Keynote at the IEEE10th International Symposium on High-Performance Computer Architecture(HPCA’04), 2004.

[114] G. Sohi, S. Breach, and T.N. Vijaykumar. Multiscalar processors. In 22ndInternational Symposium on Computer Architecture (ISCA’95), 1995.

[115] SPEC 2000. http://www.specbench.org/osg/cpu2000/, 2000.

[116] E. Sprangle. Personal communication.

[117] E. Sprangle and D. Carmean. Increasing processor performance by implement-ing deeper pipelines. In 29th International Symposium on Computer Architec-ture (ISCA’02), 2002.

[118] S.T. Srinivasan, R.D. Ju, A.R. Lebeck, and C. Wilkerson. Locality vs. critical-ity. In 28th International Symposium on Computer Architecture (ISCA’01),2001.

[119] C.L. Su and A.M. Despain. Cache design trade-offs for power and performanceoptimization: A case study. In International Symposium on Low Power Elec-tronics and Design (ISLPED’95), 1995.

[120] SYNOPSYS. Managing power in ultra deep submicron asic/ic design. InSynopsys white paper, 2002.

[121] N. Topham, A. Gonzalez, and J. Gonzalez. The design and performance of aconflict-avoiding cache. In 30th International Symposium on Microarchitecture(MICRO’97), 1997.

[122] O.S. Unsal, I. Koren, C.M. Krishna, and C.A. Moritz. The minimax cache:An energy-efficient framework for media processors. In 8th International Sym-posium on High-Performance Computer Architecture (HPCA’02), 2002.

Bibliography · 191

[123] S. Vajapeyam and T. Mitra. Improving superscalar instruction dispatch andissue by exploiting dynamic code sequences. In 24th International Symposiumon Computer Architecture (ISCA’97), 1997.

[124] L. Villa, M. Zhang, and K. Asanovic. Dynamic zero compression for cacheenergy reduction. In 33rd International Symposium on Microarchitecture (MI-CRO’00), 2000.

[125] Virage logic corporation. power reduction techniques for ultra-low-power so-lutions. technical overview. 2004.

[126] Z. Wang, K.S. McKinley, A.L. Rosenberg, and C.C. Weems. Using the com-piler to improve cache replacement decisions. In 11th International Conferenceon Parallel Architectures and Compilation Techniques (PACT’02), 2002.

[127] K. Wilcox and S. Manne. Alpha processors: a history of power issues anda look to the future. In In Cool-Chips Tutorial. Held in conjunction withMICRO’99, 1999.

[128] W.L. Winston. Operations Research Applications and Algorithms. Ed.Duxbury Press. Second edition, 1991.

[129] E. Witchel, S. Larsen, C.S. Ananian, and K. Asanovic. Direct addressedcaches for reduced power consumption. In 34th International Symposium onMicroarchitecture (MICRO’01), 2001.

[130] W.A. Wong and J-L. Baer. Modified lru policies for improving second-levelcache behavior. In 6th International Symposium on High-Performance Com-puter Architecture (HPCA’00), 2000.

[131] J. Yang and R. Gupta. Energy efficient frequent value data cache design. In35th International Symposium on Microarchitecture (MICRO’02), 2002.

[132] S.H. Yang, M.D. Powell, B. Falsafi, and T.N. Vijaykumar. Exploiting choicein resizable cache design to optimize deep-submicron processor energy-delay.In 8th International Symposium on High-Performance Computer Architecture(HPCA’02), 2002.

[133] A. Yoaz, M. Erez, R. Ronen, and S. Jourdan. Speculation techniques for im-proving load-related instruction scheduling. In 26th International Symposiumon Computer Architecture (ISCA’99), 1999.

[134] C. Zhang, F. Vahid, and W. Najjar. A highly configurable cache architec-ture for embedded systems. In 30th International Symposium on ComputerArchitecture (ISCA’03), 2003.

[135] C. Zhang, X. Zhang, and Y. Yan. Two fast and high-associativity cacheschemes. In IEEE Micro, volume 17 issue 5, 1997.

192 · Bibliography

[136] W. Zhang, J.S. Hu, V. Degalahal, M. Kandemir, N. Vijaykrishnan, and M.J.Irwin. Compiler-directed instruction cache leakage optimization. In 35th In-ternational Symposium on Microarchitecture (MICRO’02), 2002.

[137] H. Zhou, M. Toburen, E. Rotenberg, and T. Conte. Adaptive mode control:A static-power-efficient cache design. In 10th International Conference onParallel Architectures and Compilation Techniques (PACT’01), 2001.

[138] V. Zyuban. Inherently lower power high performance superscalar architec-tures, 2000.

Documents

Adaptive and Low-Complexity Microarchitectures for Power ... · Abstract Technology and microarchitecture evolution is driving microprocessors towards higher clock frequencies and