Accurate Leakage-Conscious Architecture-Level Power

THESIS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Accurate Leakage-Conscious

Architecture-Level Power Estimation

for SRAM-based Memory Structures

MINH QUANG DO

Division of Computer Engineering

Department of Computer Science and Engineering

CHALMERS UNIVERSITY OF TECHNOLOGY

Göteborg, Sweden 2007

Accurate Leakage-Conscious Architecture-Level Power Estimation for SRAM-based

Memory Structures

Minh Quang Do

ISBN 978-91-7291-968-6

Copyright c© Minh Quang Do, 2007.

Doktorsavhandlingar vid Chalmers tekniska högskola

Ny serie Nr 2649

ISSN 0346-718X

Technical report 31D

Department of Computer Science and Engineering

Embedded and Networked Processor Research Group

Division of Computer Engineering

Chalmers University of Technology

SE-412 96 GÖTEBORG, Sweden

Phone: +46 (0)31-772 10 00

Author e-mail: [email protected]

Printed by Chalmers Reproservice

GÖTEBORG, Sweden 2007

Accurate Leakage-Conscious

Architecture-Level Power Estimation

for SRAM-based Memory Structures

Minh Quang Do

Division of Computer Engineering, Chalmers University of Technology

ABSTRACTFollowing Moore’s Law, technology scaling will continue providing integration

capacity of billions of transistors for IC industry. As transistors keep shrink-

ing in size, leakage power dissipation dramatically increases and gradually be-

comes a first-class design constraint. To provide higher performance at lower

power and energy for micro-architectures, on-chip caches are growing in size

and thus become a major contributor to the total leakage power dissipation in

next-generation processors. In these circumstances, accurate leakage power es-

timation obviously is needed to allow designers to strike a balance between dy-

namic power and leakage power, and between total power and delay in on-chip

caches.

This dissertation presents a modular, hybrid power modeling methodology

capable of capturing accurately both dynamic and leakage power mechanisms

for on-chip caches and for SRAM arrays. The methodology successfully com-

bines the most valuable advantage of circuit-level power estimation – high ac-

curacy – with the flexibility of higher-level power estimation while allowing for

short component characterization and estimation time. The methodology offers

high-level parameterizable, but still accurate power dissipation estimation mod-

els that consist of analytical equations for dynamic power and pre-characterized

leakage power values stored in tables.

In addition, a modeling methodology to capture the dependence of leakage

power on temperature variation, on supply-voltage scaling, and on the selec-

tion of process corners has also been presented. This methodology provides an

essential extension to the proposed power models.

Keywords: VLSI, CMOS, Deep Submicron, Power Estimation, Cache Power Modeling,

SRAM Power Modeling, Power-Performance Estimation Tool, DSP Architecture

ii

Preface

This Ph.D. thesis presents the results of my research work conducted during the

period January 2002 to May 2007. It is based on the following seven papers:

⊲ Paper 1: M. Q. Do, P. Larsson-Edefors and L. Bengtsson,

“Table-based Total Power Consumption Estimation of Memory Ar-

rays for Architects,” in Proceedings of the 14th International Work-

shop on Power and Timing Modeling, Optimization and Simulation

(PATMOS), Isle of Santorini, Greece, Sept. 15–17, 2004, pp. 869–

878.

⊲ Paper 2: M. Q. Do, M. Draždžiulis, P. Larsson-Edefors and

L. Bengtsson, “Parameterizable Architecture-level SRAM Power

Model Using Circuit-simulation Backend for Leakage Calibration,”

in Proceedings of International Symposium on Quality Electronic

Design (ISQED), San Jose, CA, USA, March 27-29, 2006, pp. 557–

563.

⊲ Paper 3: M. Q. Do, M. Draždžiulis, P. Larsson-Edefors and

L. Bengtsson, “Leakage-Conscious Architecture-Level Power Es-

timation for Partitioned and Power-Gated SRAM Arrays,” in Pro-

ceedings of International Symposium on Quality Electronic Design

(ISQED), San Jose, CA, USA, March 26-28, 2007, pp. 185–191,

(best-paper-award nominee).

⊲ Paper 4: M. Q. Do, P. Larsson-Edefors and M. Draždžiulis,

“Capturing Process-Voltage-Temperature (PVT) Variations in Ar-

iii

iv PREFACE

chitectural Static Power Modeling for SRAM Arrays,” Technical

Report No. 2007-06, Department of Computer Science & Engi-

neering, School of Computer Science and Engineering, Chalmers

University of Technology, Göteborg, Sweden, May 2007.


“Leakage-Conscious Architecture-Level Power Estimation Models

for On-Chip Caches,” manuscript.


“Current Probing Methodology for Static Power Extraction in Sub-

90nm CMOS Circuits,” Technical Report No. 2007-07, Depart-

ment of Computer Science & Engineering, School of Computer

Science and Engineering, Chalmers University of Technology, Göte-

borg, Sweden, May 2007.


“High-Accuracy Architecture-Level Power Estimation for Parti-

tioned SRAM Arrays in a 65-nm CMOS BPTM Process,” to ap-

pear in Proceedings of 10th Euromicro Conference on Digital Sys-

tem Design, Architecture, Methods and Tools (DSD 2007), Lübech,

Germany, August 27–31, 2007, (invited paper).

The following related papers are not included in this thesis:

⊲ Paper 8: M. Q. Do, L. Bengtsson and P. Larsson-Edefors,

“DSP-PP: A Simulator/Estimator of Power Consumption and Per-

formance for Parallel DSP Architectures,” in Proceedings of the

21st Multiconference in Applied Informatics - Parallel and Dis-

tributed Computing and Networks Symposium (PDCN), Innsbruck,

Austria, Feb. 10–13, 2003, pp. 767–772

⊲ Paper 9: M. Q. Do, L. Bengtsson and P. Larsson-Edefors,

“Models for Power Consumption Estimation in the DSP-PP Sim-

ulator,” in Proceedings of the 1st International Signal Processing

Conference (ISPC), Dallas, Texas, USA, March 31–April 3, 2003.

v

⊲ Paper 10: M. Q. Do and L. Bengtsson, “Analytical Models for

Power Consumption Estimation in the DSP-PP Simulator: Prob-

lems and Solutions,” Technical Report No. 03-22, Department of

Computer Engineering, School of Computer Science and Engineer-

ing, Chalmers University of Technology, Göteborg, Sweden, Au-

gust 2003.


“Table-based Total Power Consumption Estimation Approach for

Architects,” in Proceedings of the Swedish System-on-Chip Con-

ference, Båstad, Sweden, April 13–14, 2004.


“Towards a Power and Performance Simulation Framework for Par-

allel DSP Architecture,” in Poster abstracts of 1st International

Summer School on Advanced Computer Architecture and Compi-

lation for Embedded Systems, L’Aquila, Italy, July 24-30, 2005,

pp. 161–164.

⊲ Paper 13: M. Q. Do, M. Draždžiulis, and P. Larsson-Edefors,

“Architecture-Level Power Estimation and Scaling Trends for

SRAM Arrays,” in Proceedings of the Swedish System-on-Chip

Conference, Kålmorden, Sweden, May 4–5, 2006.

vi PREFACE

Acknowledgments

My life has been characterized by “adventures”, both long and short. This dis-

sertation marks the end of an unforgettable adventure that started 5,5 years ago

when I was admitted to the Doctoral Program at the Department of Computer

Science and Engineering, Chalmers University of Technology. Although full of

toil and sweat, it has never been a lonely journey for me since I was blessed to

have wonderful people as my companions.

First and foremost, I owe my greatest gratitude to two of the wised men I am for-

tunate to work with: my research supervisor, Associate Professor Lars Bengts-

son, and my research examiner, Professor Per Larsson-Edefors.

I am very grateful to Lars Bengtsson for accepting me as his PhD student, letting

me do the research my way, constantly backing, encouraging and giving me

invaluable advices, especially in the first half of my PhD study.

I am also indebted to Per Larsson-Edefors for his enthusiasm, constant sup-

ports, encouragements and excellent professional advices. I have been always

impressed and inspired by his sense of professionalism. For me, Per Larsson-

Edefors is not only a research examiner but also an advisor who can work in-

tensively together with his students at very late hours at night, who always has

some new ideas to add and knows several ways to solve it. And who is still in

love with “hard rock” musics at his age today ...

vii

viii ACKNOWLEDGMENTS

I would like to thank Docent Lars “J” Svensson for being a member of my ad-

vising committee and sharing his research experience and profound competence

with me, especially in know-how, know-where questions and research issues.

I want to thank Dr. Daniel Eckerbert for his comments and critical research dis-

cussions on power consumption estimation methodologies and its classification

as well as for his kind helps in Hspice and Cadence designing tools.

Many thanks goes to Firas Milh, a master thesis worker at the Department of

Computer Science and Engineering, for implementing the DSP-PP simulator

using in this thesis.

I have also met many other interesting people along the way who deserve special

thanks from me:

⊲ All recent and former members of the VLSI Research Group: Daniel

Eckerbert, Henrik Eriksson, Mindaugas Draždžiulis, Dainius Ciuplys,

Daniel Andersson, Magnus Själander; thank you my friends for accept-

ing me as an “unofficial” group’s member, for sharing not only your

research experiences, but also your interests in music, fashion, games,

entertainment, etc. thus making my life here not just work and work!

⊲ Thank you, Martin Thuresson for your kind helps and critical discus-

sions in research-related topics like C programming, LATEX, Emacs, UNIX,

etc! I have enjoyed so much your warm friendship, hospitality and your

sharing with me the knowledge on Swedish culture, language, history

and society.

⊲ A special thanks goes to Mindaugas Draždžiulis and Egle Reimontaite

for their warm friendship to me and to my family. An extra thanks is

given to Mindaugas for his helps and cooperations in doing research.

⊲ An extra thanks goes to Magnus Själander and Martin Thuresson for

their helps in proof-reading of this dissertation.

⊲ A special thanks goes to Jochen Hollmann, Djordje Jeremic, Xiao Ming,

ix

Wolfgang John, Raul Barbosa, M. Waliullah, Mafijul Islam, former Ph.D

students (Dr. Fredrik Warg, Dr. Zihuai Lin, Dr. Dhammika Bokolamulla

Dr. Håkan Forsberg, Dr. Kristina Forsberg, Dr. Jim Nilsson, Lic. Eng.

Peter “biff”) and other Ph.D students at the Department of Computer

Science and Engineering for their colleagueship that creates a friendly

and inspiring atmosphere to work at our department.

⊲ I would like to thank Per Waborg for guiding me through university bu-

reaucracy, for being a good “unbeatable” table-tennis opponent-player

and sharing his life experience with me.

⊲ Many thanks to the rest of the colleagues, administrative staff and tech-

nical support at the Department of Computer Science and Engineering

for creating a nice working environment and providing me assistances

in many possible ways.

⊲ I send a lot of thanks to my Vietnamese friends at CTH for their friend-

ship and helps without which my study here would be much more diffi-

cult and less enjoyable.

For the last, but the most deserved, I am grateful to my parents who have always

trusted and encouraged me to reach the heights I wish; to my brothers and

sister for their understanding and encouragements; and especially to my beloved

wife, Tran Thi Thu Ha, and my little angel, Ngoc “Candy”, for their incredible

patience, unflagging supports, great encouragements and endless love to me,

particularly at very difficult moments in my adventures. This work is, therefore,

dedicated to them.

Minh Quang Do

Göteborg, May 2007

x ACKNOWLEDGMENTS

Contents

Abstract i

Preface iii

Acknowledgments vii

I Introduction 1

1 Introduction 3

1.1 Technology Scaling and its Induced Problems . . . . . . . . . . 3

1.2 On-Chip Cache – Trend of Development . . . . . . . . . . . . . 7

1.3 On-Chip Cache – Leakage Power Estimation . . . . . . . . . . 8

1.4 Dissertation Objective and Scope . . . . . . . . . . . . . . . . . 11

1.4.1 Objective . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.4.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

1.5 Dissertation Contributions . . . . . . . . . . . . . . . . . . . . 13

1.6 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . 16

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

II Background 19

2 On-Chip SRAM Cache Architecture 21

2.1 Caches for DSP and Embedded Systems . . . . . . . . . . . . . 21

xi

xii CONTENTS

2.1.1 Basic DSP Architectures . . . . . . . . . . . . . . . . . 22

2.1.2 Cache Architectures in DSP and Embedded Systems . . 25

2.2 Caches for GPP Systems . . . . . . . . . . . . . . . . . . . . . 28

2.2.1 Cache System Architecture . . . . . . . . . . . . . . . . 28

2.3 Cache on GPPs and DSPs: Differences . . . . . . . . . . . . . . 32

2.4 Cache Organization . . . . . . . . . . . . . . . . . . . . . . . . 33

2.4.1 Basic Cache Organization . . . . . . . . . . . . . . . . 33

2.4.2 Memory Partitioning . . . . . . . . . . . . . . . . . . . 38

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 Power Dissipation in CMOS 43

3.1 Mechanisms of Power Dissipation . . . . . . . . . . . . . . . . 44

3.1.1 Dynamic Power . . . . . . . . . . . . . . . . . . . . . . 45

3.1.2 Leakage Power . . . . . . . . . . . . . . . . . . . . . . 46

3.2 Trend of Development and Emerging Issues . . . . . . . . . . . 50

3.3 Leakage Power Reduction Techniques . . . . . . . . . . . . . . 53

3.3.1 Power Cut-off Techniques . . . . . . . . . . . . . . . . 54

3.3.2 Leakage-Reduction Techniques for SRAM-based Caches 56

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Cache Power Modeling – Tool Perspective 63

4.1 A Survey of Existing Power-Performance Tools . . . . . . . . . 64

4.2 High-Level Power Estimation Tools for Caches . . . . . . . . . 68

4.3 Power Dissipation Estimation Models . . . . . . . . . . . . . . 69

4.3.1 High-level Power Estimation Methodology . . . . . . . 69

4.3.2 Analytical Models . . . . . . . . . . . . . . . . . . . . 70

4.3.3 Table-based and Equation-based Models . . . . . . . . . 75

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

III Power Modeling for SRAM-based Structures 81

5 Modular Approach to Power Modeling 83

CONTENTS xiii

5.1 Analytical Power Modeling Approach & Problems . . . . . . . 84

5.2 The Proposed Modular Modeling Approach . . . . . . . . . . . 86

5.3 Probing Methodology for Leakage . . . . . . . . . . . . . . . . 89

5.4 Power Models for On-Chip Caches . . . . . . . . . . . . . . . . 93

5.4.1 Power Models for Partitioned Data SRAM Arrays . . . 93

5.4.2 Power Models for Unpartitioned Data SRAM Arrays . . 106

5.4.3 Power Models for SRAM-based Tag Arrays . . . . . . . 107

5.5 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5.5.1 Validation Methodology . . . . . . . . . . . . . . . . . 112

5.5.2 Validation of Power Models for Data SRAM Arrays . . 115

5.5.3 Validation of Power Models for SRAM-based

Tag Arrays . . . . . . . . . . . . . . . . . . . . . . . . 121

5.6 Thermal and Variability Issues . . . . . . . . . . . . . . . . . . 124

5.6.1 Modeling the Dependence of Leakage on Temperature . 124

5.6.2 Modeling Leakage with Variation in Supply Voltage . . 127

5.6.3 Modeling the Dependence of Leakage on Process

Corner . . . . . . . . . . . . . . . . . . . . . . . . . . 129

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

6 Conclusion and Future Work 133

6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

IV Appendix 137

A DSP-PP Simulator 139

A.1 Characteristics of DSP Architectures . . . . . . . . . . . . . . . 140

A.2 DSP-PP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

A.2.1 Features of the DSP-PP . . . . . . . . . . . . . . . . . . 142

A.2.2 Description of the DSP-PP Simulator (Version 2.0) . . . 144

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

xiv CONTENTS

List of Figures

1.1 The Original Moore Law. (Source: Intel Museum [2]) . . . . . 4

1.2 Moore’s Law as illustrated by the transistor count per IC for

Intel microprocessors from the 4004 to the Itanium 2 (9 MBytes

cache). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3 Dynamic and leakage power trend as predicted by ITRS (from

[7]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 The die photo of a) Intel’s Madison Processor (374 mm2). b)

Intel’s Pentium M Processor (84 mm2). (Source: Intel Press-

room [2]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Total leakage power as function of min L and tox for a 6T-

SRAM cell (BPTM 32-nm @ Vdd = 1.1 V) . . . . . . . . . . 10

2.1 Basic DSP Architectures: Harvard architecture . . . . . . . . . 22

2.2 Basic DSP Architectures: a) with a MUX, and b) with a MUX

and a small instruction cache . . . . . . . . . . . . . . . . . . 23

2.3 The most frequently used DSP Architecture . . . . . . . . . . 24

2.4 A VLIW-DSP Architecture with Multiple Datapaths . . . . . . 24

2.5 A Two-level Cache Architecture (TI TMS320C6211/TI C6x

DSP [5]) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.6 A typical memory hierarchy in GPPs . . . . . . . . . . . . . . 29

2.7 A two-level cache architecture used in GPPs . . . . . . . . . . 30

2.8 A three-level cache architecture used in GPPs . . . . . . . . . 30

2.9 Block diagram of the Intel Itanium 2 processor [14] . . . . . . 31

xv

xvi LIST OF FIGURES

2.10 Basic organization of a Direct-mapped cache [7] . . . . . . . . 35

2.11 Basic organization of a typical SRAM-based cache . . . . . . . 36

3.1 Leakage mechanisms in an off-state NMOS transistor with

VG = VS = 0 and VD = Vdd . . . . . . . . . . . . . . . . . . . 44

3.2 EOT and gate leakge density scaling for extended planar bulk

CMOS devices (ITRS 2006) . . . . . . . . . . . . . . . . . . . 52

3.3 Scaling in subthreshold leakage for extended planar bulk CMOS

devices (ITRS 2006) . . . . . . . . . . . . . . . . . . . . . . . 52

3.4 Gate length scaling for extended planar bulk CMOS devices

(ITRS 2006) . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.5 Leakage current paths in the SCCMOS technique (from [12]) . 54

3.6 Leakage current paths in the ZSCCMOS technique (from [12]) 55

3.7 Leakage current paths in the GSCMOS technique (from [12]) . 56

5.1 Subthreshold leakage power with different temperature for an

NMOS transistor (commercial 130-nm process) . . . . . . . . 85

5.2 Power modeling methodology: a) Component Characteriza-

tion Phase, and b) Power Estimation Phase . . . . . . . . . . . 87

5.3 Current measurement for MOS transistors used in Hspice sim-

ulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.4 Block diagram of a partitioned SRAM array using DWL and

DBL techniques . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.5 Organization of a sub-array . . . . . . . . . . . . . . . . . . . 96

5.6 (a) Characterization of a 6T-SRAM cell, (b) Hspice configura-

tion for VLBL estimation . . . . . . . . . . . . . . . . . . . . 98

5.7 Subthreshold (green, solid) and gate leakage (red, dotted) cur-

rents in a partitioned 6T-SRAM cell . . . . . . . . . . . . . . . 100

5.8 Characterization of (a) a sense amplifier, (b) a write circuit . . 101

5.9 Architecture of a 8-256 row decoder . . . . . . . . . . . . . . 104

5.10 The structure of a typical Ntag-bits NOR-based comparator . . 109

5.11 Total power dissipation of 8-KB data arrays [blue/grey — 8A,

brown/black — 8B, yellow/white — 8C] . . . . . . . . . . . . 115

LIST OF FIGURES xvii

5.12 Total power dissipation of 2-KB data arrays [blue/grey — 2A,

brown/black — 2B] . . . . . . . . . . . . . . . . . . . . . . . 116

5.13 Accuracy in estimating: a) dynamic power, b) leakage power,

c) total power for 8-KB data arrays [blue/grey—8A, brown/black—

8B, yellow/white—8C] . . . . . . . . . . . . . . . . . . . . . 118

5.14 Accuracy in estimating: a) dynamic power, b) leakage power,

c) total power for 2-KB data arrays [blue/grey—2A, brown/black—

2B] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.15 The proportion of dynamic (in brown/black) and leakage (in

blue/grey) power in the 8A, 8B and 8C arrays. . . . . . . . . . 120

5.16 The proportion of dynamic (in yellow) and leakage (in orange)

power in the 2A array. The proportion of dynamic (in blue)

and leakage (in brown) power in the 2B array. . . . . . . . . . 120

5.17 Accuracy in estimating: a) dynamic power, b) leakage power

for a 2-KB SRAM-based tag array . . . . . . . . . . . . . . . 122

5.18 Total power dissipation of a 2-KB partitioned SRAM-based

tag array (blue/grey) and a 2-KB partitioned data array (brown/black)123

5.19 Subthreshold leakage power as a function of temperature for a

6T-SRAM cell (commercial 130-nm) . . . . . . . . . . . . . . 126

5.20 Gate and subthreshold leakage power as functions of Vdd for a

6T-SRAM cell (BPTM 65-nm [20]) . . . . . . . . . . . . . . . 128

5.21 The subthreshold leakage power’s dependence on temperature

for a 6T-SRAM cell (commercial 130-nm with process cor-

ners: SS, TT, FF) . . . . . . . . . . . . . . . . . . . . . . . . 129

A.1 Block Diagram of the DSP Power Performance Simulator . . . 143

A.2 Interconnection of components inside a SP of the extended

ManArray architecture [3] . . . . . . . . . . . . . . . . . . . . 147

A.3 Interconnection of components inside a PE of the extended

ManArray architecture [3] . . . . . . . . . . . . . . . . . . . . 148

A.4 The GUI of our implemented DSP-PP simulator . . . . . . . . 151

xviii LIST OF FIGURES

List of Tables

4.1 The equations for capacitance of critical nodes . . . . . . . . . 73

5.1 Organization parameters for partitioned SRAM arrays . . . . . 95

5.2 Organization parameters for partitioned SRAM-based tag arrays108

xix

xx LIST OF TABLES

Part I

Introduction

1Introduction

1.1 Technology Scaling and its Induced Problems

In the beginning, complementary metal-oxide semiconductor (CMOS) technol-

ogy was chosen because it dissipated much less power than earlier technologies

such as transistor-transistor logic (TTL) and emitter-coupled logic (ECL). It

was, in fact, true at that time because when not switching MOS transistors dis-

sipate negligible power for the clock frequency in the kHz range. However, as

device switching frequency and chip integration density keep increasing, power

dissipation increases dramatically. Observing the trend of CMOS device inte-

gration, Gordon Moore of Intel, in 1965 gave his most famous prediction that

the number of devices on an IC will double every 12 months (later revised to

every 24 months) [1] often referred to as Moore’s Law. This observation, after

3

4 CHAPTER 1. INTRODUCTION

several revisions, is still pretty much true, it has served and continues to be a

driving force for CMOS technology, silicon industry, and personal computer

(PC) manufacturing industry. Figure 1.1 shows the original graph drawn by

Gordon Moore when he published his observation in 1965, whereas Figure 1.2

shows the development of the transistor count for Intel microprocessors from

the 4004 to the Itanium 2 (the version with 9 MBytes cache) as an illustration

of Moore’s Law exercising in real life.

Figure 1.1: The Original Moore Law. (Source: Intel Museum [2])

Along with technology scaling many new design challenges have emerged;

performance and power dissipation are two major issues of computer system

design, among those the latter one has been recognized by the processor design

community as a first-class architectural design constraint not only for portable

computers, mobile communication devices, but also for high-end systems, e.g.

superscalar, single, multiprocessor, multi-core and high-performance embed-

ded processor systems [3].

1.1. TECHNOLOGY SCALING AND ITS INDUCED PROBLEMS 5

1970 1975 1980 1985 1990 1995 2000 200510

0

101

102

103

104

105

106

Year

Th

ou

sa

nd

s o

f T

ran

sis

tors

4004 8008

8080

8086

286

386

486

Pentium

Pentium II

Pentium III

Pentium 4

Itanium

Itanium 2

Itanium 2 (9MB cache)

Figure 1.2: Moore’s Law as illustrated by the transistor count per IC for Intel micro-

processors from the 4004 to the Itanium 2 (9 MBytes cache).

While low power is important, achieving the lowest power solution alone

is obviously not the primary goal. First and foremost, the system design must

meet performance and feature requirements of the application. Success ulti-

mately lies in the ability to strike the optimum balance between performance,

power and cost. In order to achieve these design goals there is a need to develop

power-performance estimation tools that can aid designers to model entire sys-

tems as well as every system component, to perform the power-performance

evaluation and tradeoff analysis.


Moreover, as a result of CMOS technology scaling, leakage power dissipa-

tion has become a significant portion of the total power consumption in deep-

submicron VLSI chip [4]. The International Technology Roadmap for Semi-

conductors (ITRS [5]) predicts that in a few years, the total leakage power of a

chip may exceed the total dynamic power, and the projected increases in sub-

threshold leakage (Figure 1.3) shows that it will exceed total dynamic power

dissipation as technology drops below 65-nm feature size [6]. As leakage con-

tinues to increase in importance, accurate leakage power estimation is needed

to allow designers to make good design trade-offs. This is especially true at

higher design levels that are associated with a higher degree of design freedom,

potentially leading to higher power savings.

Figure 1.3: Dynamic and leakage power trend as predicted by ITRS (from [7])

1.2. ON-CHIP CACHE – TREND OF DEVELOPMENT 7

1.2 On-Chip Cache – Trend of Development

Although leakage power dissipation is an issue for all processor circuit compo-

nents, it is a particularly important problem in on-chip caches that have large

sections that are idle for relatively long periods of time. This is due to three rea-

sons [8]: (i) increasing sub-threshold leakage current due to technology scaling;

(ii) leakage energy increases with the effective number of transistors in the cir-

cuits; (iii) a large transistor budget is allocated for on-chip caches in current

processors.

Figure 1.4: The die photo of a) Intel’s Madison Processor (374 mm2). b) Intel’s Pen-

tium M Processor (84 mm2). (Source: Intel Pressroom [2])

In recent years, in order to minimize the latency and to improve memory

bandwidth larger L1, L2, and even L3 caches are being integrated on die,

thanks to the advanced ability in integration offered by recent submicron and

DSM, VDSM CMOS processes. For example, the Alpha 21464 processor has

128 KBytes L1 and 1.5 MBytes L2, Intel’s Madison processor has 1 MBytes L2

and 6 MBytes L3, Intel’s Pentium M (Centrino) processor has 2 MBytes L2, and


Dual-Core Multi-Threaded Xeon processor has 2 MBytes L2 and 16 MBytes

L3 on-chip caches, respectively. Figure 1.4 shows die photos of Intel’s Madi-

son and Pentium Centrino processors, respectively. In these processors, on-chip

caches occupy more than 50% of the die area. This trend eventually makes on-

chip caches to be one of the major components-contributors to the total leakage

power dissipation of microprocessors.

1.3 On-Chip Cache – Leakage Power Estimation

As leakage does continue to increase in importance, accurate leakage power

estimation is needed to allow designers to strike a balance between dynamic

power and leakage power, and between total power and delay in the on-chip

caches.

Since all leakage mechanisms are closely related to the physical behavior

of MOS transistors, the type of circuitry involved and the process technology

parameters, a straight-forward way to model it is to use equations and sets of

parameters to describe those complex behaviors of MOS transistors. This mod-

eling way is referred to as the analytical approach. The complexity of equa-

tions defines the accuracy in estimating leakage power. BSIM4 models leakage

mechanisms using very detailed and complex equations [9], for example BSIM4

models sub-threshold leakage current for a MOS transistor using the following

equations (refer to [9] for more details):

Isub = I0 (1 − e−VdsVth ) e

−VT −VoffnVth (1.1)

where,

VT = VTH0 + δNP (∆VT,body_effect − ∆VT,charge_sharing

−∆VT,DIBL + ∆VT,reverse_short_channel + ∆VT,narrow_width

+∆VT,small_size − ∆VT,pocket_implant)

I0 = µW

LV 2

th

√

q ǫsi NDEP

2φs; Vth =

kB T

q(1.2)

1.3. ON-CHIP CACHE – LEAKAGE POWER ESTIMATION 9

Here, q is the electrical charge, T is the varying temperature, n is the sub-

threshold swing coefficient, kB is the Boltzmann constant, NDEP is the chan-

nel doping concentration, φs is the surface potential, ǫsi is the dielectric con-

stant of silicon, µ is the carrier mobility, Vth is the thermal voltage, Vds is the

drain-source voltage, Voff is the offset voltage, W is the width, and L is the

length. VT is the device threshold voltage defined by a very complex expres-

sion: VTH0 is the threshold voltage of a long-channel device at zero bias, and

∆VT,body_effect, ∆VT,charge_sharing , ∆VT,DIBL, ∆VT,reverse_short_channel,

∆VT,narrow_width, ∆VT,small_size, ∆VT,pocket_implant are body-effect, charge-

sharing, DIBL, reverse-short-channel, narrow-width, small-size, and pocket-

implant effects on VT respectively. δNP is defined as +1 for NMOS and −1 for

PMOS.

Eqs 1.1 - 1.2 show how complicated it is to calculate the value of sub-

threshold leakage current for a MOS transistor analytically. Thus, although

BSIM4 models offer high accuracy in estimating leakage power they are ob-

viously not suitable for higher-level power estimation due to their complex re-

lations and equations that require the user to have deep knowledge of device

models and access to detailed process parameter.

During the past decade, a fair amount of research effort has been directed

towards developing high-level power-performance tools for on-chip caches. To

avoid the complexity of an analytical approach there are efforts to simplify

BSIM3 or BSIM4 analytical equations to some degree of complexity accept-

able to be used in higher-level power estimation tools. However, those simpli-

fied analytical leakage power models still suffer serious drawbacks in estimat-

ing leakage power: inaccuracy and inflexibility. One of the most widely used

power estimation tools in the public domain is CACTI [10] that offers analytical

timing and energy models for partitioned caches. In its previous versions 1.0,

2.0 and 3.2, CACTI used only ideal first-order scaling for technology trends.

Further, it did not include any leakage power models. The PRACTICS tool [11]

uses analytical models to determine an optimal design for partitioned caches


by performing an exhaustive comparison of alternative memory configuration

parameters. Although PRACTICS provides more accurate estimates of inter-

connect effects in comparison to CACTI 3.2, it still does not include power

models for leakage estimation.

The recently released CACTI version (4.0 [12]) is updated with respect

to basic circuit structures, to device parameters for an improved technology

scaling, and to leakage models, in that a model based on Hotleakage [13] and

eCACTI [14] is added. However, the added model still fails to accurately ac-

count for small channel effects, for gate leakage, and for terminal voltage de-

pendencies in transistor stacks—the model error in estimating leakage power

dissipation is claimed to be 21.5% [14]. However, as the concept of a tech-

nology node, according to ITRS’05 [5], gradually is abandoned, using a typical

process may yield large estimation errors for static-power dominated memories.

If leakage power models at architectural level are to guide design trade-offs,

they need to be calibrated to one or several target processes.

3.23.3

3.43.5

3.6x 10−8

1.55

1.6

1.65

x 10−9

1

2

3

x 10−7

tox

L

Ileak

47

1.6

0.55

16.4

Isleak

/Igleak

Figure 1.5: Total leakage power as function of min L and tox for a 6T-SRAM cell (BPTM

32-nm @ Vdd = 1.1 V)

1.4. DISSERTATION OBJECTIVE AND SCOPE 11

Sub-threshold leakage still remains the main contributor to total leakage,

however other mechanisms such as gate oxide tunneling and junction (BTBT)

leakage are of increasing significance. To predict which leakage mechanism

will dominate in the future is difficult, since there is a complex interaction of

technology and circuit development. In a recent study on power dissipation for

nanometer caches [15], a surprising trend for sub-threshold and gate leakage

power components was outlined: For example, for a 32-nm Berkeley Predictive

Technology Model (BPTM) [16] process, the sub-threshold contribution dom-

inated gate leakage by approximately 30× (Figure 6 in [15]). In Figure 1.5

the dependence of total leakage power, i.e. the sum of gate (Igleak) and sub-

threshold (Isleak) leakage power, on minimum L and tox, respectively, is plot-

ted for a 6T-SRAM cell in that particular 32-nm BPTM process [16]. Assuming

the default minimum L for transistors suggested in a predictive process clearly

is a poor design compromise. This example also serves to show how important

it is for leakage estimation to capture not only what general technology is used,

but also which circuit design context is used. Since the relative significance of

leakage mechanisms will vary on design context, only leakage estimation based

on data calibrated to target libraries can be trusted.

1.4 Dissertation Objective and Scope

1.4.1 Objective

With the motivations mentioned in Sections 1.1 - 1.3 the objective of this dis-

sertation is to solve, partially or completely, the following problems:

1. To provide designers with a modeling methodology to capture accurately

all leakage mechanisms and dynamic power dissipation for on-chip caches

and SRAM arrays. The methodology needs to explore not only the most

valuable advantage of circuit-level power estimation – high accuracy,

but also the flexibility of higher-level power estimation. Moreover, the

methodology needs to be simple and generic enough so that designers


can use it to generate power models for their on-chip caches and SRAM

arrays of interest with different configurations and organizations.

2. To provide users with accurate parameterizable power estimation models

that require low complexity and less computation time in estimating both

leakage and dynamic power dissipation for on-chip caches and SRAM

arrays. These power models needs to be extendible and implementable in

architecture-level power estimation tools.

3. To provide the compatibility between the proposed power models to other

power models that are implemented in existing power simulation tools

for enabling capability of updating those models to the better and more

accurate models.

4. To outline and design an architecture-level power dissipation estimation

tool for DSP using the new cache and SRAM power models (i.e. the

“DSP-PP”).

1.4.2 Scope

In this dissertation, to limit the scope of the research, CMOS circuit technology

has been assumed and all on-chip caches are assumed being implemented using

deep-submicron (DSM) and very-deep-submicron (VDSM) CMOS processes.

Depending on where it is located in the memory hierarchy on-chip caches

have different organization and organizing parameters (e.g. cache size, block

size, word size, associativity). For example, L1 cache often has small block

size and word size is usually equal to the size of the data bus, whereas L2/L3

caches tend to have larger block and data-word size, and smaller associativity.

The organization of on-chip caches also depends on what type of applications

and systems they are used for, i.e. the application domain. Caches for DSP and

embedded systems are organized differently in comparison to caches for GPP

microprocessor system. So, there are more than few options to chose from when

selecting the cache organization for use in this dissertation. To serve as a basic

1.5. DISSERTATION CONTRIBUTIONS 13

platform for modeling power dissipation direct-mapped L1 caches with small

size (i.e. 2 KBytes - 8 KBytes), small block size (i.e. 4) and small data-word

length (i.e. 4 Bytes) has been selected. The reason for this selection is that if

the power models for the selected cache are successfully obtained there would

not be any fundamental problems to extend these power models for the selected

L1 cache with small size and data-word length to the power models for L1/L2

caches with higher associativity, bigger size and bigger data-word length, i.e. in

other words the obtained power models are fully-extendible.

To further limit the scope of the research, on-chip SRAM-based caches con-

sisting of tag 6T-SRAM-based arrays and data 6T-SRAM-based arrays with

regular structures have been assumed. Both tag and data arrays are physically

partitioned into sub-arrays using divided-bit-line (DBL) and divided-word-line

(DWL) techniques within a memory bank. Partitioning a cache into banks is

done in a higher-level than the physical partitioning, and it is normally applied

for highly-associative caches. According to the discussion given in [12], in

practice most users expect multiported multi-bank caches to first synthesize de-

pendent ports from independent banks, and only multi-port the bank themselves

if the required number of ports exceeds the number of banks. Thus, for simplic-

ity, memory banks with single read/write ports have been assumed to be used

in this dissertation.

1.5 Dissertation Contributions

The main contributions of this dissertation are:

1. To propose a modular hybrid power estimation modeling methodology

for on-chip caches and SRAM arrays. The proposed modeling methodol-

ogy is capable of capturing accurately both dynamic and leakage power

mechanisms for on-chip caches and SRAM arrays. Also, the proposed

modeling methodology is simple and straight-forward, allowing for short

component characterization and estimation time. Rather than using only


one technique to estimate power dissipation, the proposed methodology

seeks to find the best match between a particular estimation technique

and a specific cache component. For example, a probabilistic approach

has been used to estimate both dynamic and static power of address de-

coders, an analytical approach has been used to estimate dynamic power

of bitlines and 6T-SRAM cells, sense amplifiers, write circuits, and word-

line drivers, while a circuit-simulation-based modeling backend has been

used to estimate all leakage power mechanisms. Furthermore, the pro-

posed modeling methodology is modular, thus, it can be applied to model

power dissipation for the other type of components of regular structures,

e.g. content-addressable-memory (CAM).

The initial idea of the modeling methodology has been discussed in Pa-

per 1, where the White-box Table-based Total Power Consumption es-

timation approach (WTTPC) is introduced. Further development on the

idea of WTTPC approach leads to the formation of the modeling method-

ology for unpartitioned data SRAM arrays that is then fully described

in Paper 2. The modeling methodology for physically partitioned data

SRAM arrays is developed and described in Paper 3. And finally, the

modeling methodology for on-chip caches is invented and described in

details in Paper 5.

2. To offer high-level parameterizable, but still accurate power dissipation

estimation models for on-chip caches and SRAM arrays. For each cache

component, its power model for total power estimation consists of ana-

lytical equations for dynamic power and pre-characterized leakage power

values. Different cache components are characterized by performing

few, simple circuit-level DC simulations using the appropriate probes,

to extract the leakage power from simulation data. Dynamic analytical

power models are derived based on the well-known activity-based switch-

ing power equation, with nodal capacitances extracted using a circuit-

level simulator that establishes the operating point and DC capacitances.

The total leakage power accounts for all types of leakage currents that

1.5. DISSERTATION CONTRIBUTIONS 15

are present in the transistor models used by circuit simulators, during

both idle and active cycles. Therefore, the proposed power models offer

much better accuracy and flexibility in estimating both total and leak-

age power dissipation for on-chip caches and SRAM arrays comparing to

those high-level analytical power models implemented in existing power

estimation tools.

In Paper 2, the component characterization for leakage power and ca-

pacitance extraction for all components of an unpartitioned data SRAM-

based array is described in details, power models are also clearly ex-

plained and presented. In Paper 3, power models for components of

a physically partitioned data SRAM-based array are explained and pre-

sented. Power models for on-chip cache components, including tag and

data SRAM-based arrays, are described in Paper 5.

3. To provide the verification of the proposed power estimation models for

a number of on-chip cache configurations implemented in 0.13-µm and

65-nm CMOS processes. The validation results for on-chip cache is

shown in Paper 5, whereas the validation results for unpartitioned and

partitioned data SRAM-based arrays are given in Paper 2 and Paper 3

and Paper 7, respectively. The obtained accuracy in those validations

are high (more than 95%) compared to the power values of circuit-level

simulations.

4. To propose a modeling methodology to capture the dependence of leak-

age power on temperature variation, on supply-voltage scaling, and on the

selection of process corners for accurate architectural-level power estima-

tion of on-chip caches. The modeling methodology extends the obtained

earlier power models for cache components to capture the dependence of

leakage power on variability issues. The proposed modeling methodol-

ogy and power models are described in Paper 4.

5. To separate all leakage mechanisms existing in on-chip caches and en-

sure ability to capture it correctly using appropriate probing strategy and


a circuit-level simulator. Initially, a description of major leakage mech-

anisms for components of a 6T-SRAM-based data array and the probing

strategy to capture them is given in Paper 2. Later, a methodology for

probing circuits for static current measurements in CMOS circuits during

simulation has been proposed. The methodology is capable of capturing

all leakage mechanisms existing in BSIM4 models, in this case imple-

mented in the Hspice simulator. The full description of the methodology

is given in Paper 6. The proposed probing methodology was used suc-

cessfully to obtain accurate and distinguishable static power constituents

(i.e. gate, subthreshold and total leakage power) for 2-kB unpartitioned

and partitioned SRAM memory arrays implemented in a BPTM 65-nm

process, Paper 7.

6. To create a framework and a design for implementing a cycle-accurate

architecture-level performance-power estimation tool for parallel DSP ar-

chitectures (DSP-PP). The structure and design of DSP-PP simulator is

described in Paper 1.

1.6 Dissertation Overview

The remainder of this dissertation is organized as follows. Chapters 2 - 4 pro-

vide readers with background and theory. Chapter 2 focus on the information

of on-chip SRAM-based cache architectures used in DSP, embedded, and GPP

systems, respectively. Chapter 3 reveals the mechanisms behind power dissi-

pation of MOS transistors, and provides several useful techniques to combat

power dissipation. Chapter 4 provides information of the power estimation

models that are implemented in existing power estimation tools for on-chip

caches and other processor components. This chapter also presents some back-

ground information of power modeling in general, its classification and the area

of applications.

BIBLIOGRAPHY 17

Chapter 5 accounts for the work done on modeling methodology for on-chip

caches and SRAM arrays. First, a discussion shows in more details drawbacks

of an analytical approach to power modeling, and the reason why the table-

based simulation-based power modeling approach has been selected. After that,

our modular hybrid power estimation modeling methodology for on-chip caches

and SRAM data arrays is described in details. The followed section is dedicated

to validation of the obtained power models against circuit-level simulations for

complete on-chip caches and data SRAM arrays. After this section, the mod-

eling methodology to capture the dependence of leakage power on temperature

variation, on supply-voltage scaling, and on the selection of process corners is

presented and discussed in detail.

The appendix A presents work done on design and implementation of a

cycle-accurate architecture-level performance-power estimation tool for paral-

lel DSP architectures (DSP-PP) as a case study. This serves as an example of an

application where our proposed power modeling methodology and power mod-

els for on-chip caches and data SRAM arrays can be implemented.

The final part of the dissertation is then ended by conclusions and some

ideas for future works for modeling power dissipation for other type of compo-

nents, e.g. CAM, and clocking networks.

Bibliography

[1] G. E. Moore, “Cramming More Components onto Integrated Circuits,” Electron-

ics, vol. 38, no. 8, Apr. 1965.

[2] http://www.intel.com, 2007.

[3] T. Mudge, “Power: A First Class Design Constraint,” IEEE Transaction on Com-

puters, vol. 34, no. 4, pp. 52–58, Apr. 2001.

[4] S. Borkar, “Design Challenges for Technology Scaling,” IEEE Micro, vol. 19, no.

4, pp. 23–29, Aug. 1999.


[5] International Technology Roadmap for Semiconductors, http://public.itrs.net,

ITRS, 2006.

[6] B. Doyle et al., “Transistor Elements for 30-nm Physical Gate Lengths and Be-

yond,” Intel Technology Journal, vol. 6, pp. 42–54, May 2002.

[7] Nam Sung Kim, K. Flautner, D. Blaauw, and T. Mudge, “Circuit and Microarchi-

tectural Techniques for Reducing Cache Leakage Power,” IEEE Transaction on

VLSI Systems, vol. 12, no. 2, pp. 167–184, February 2004.

[8] L. Li et al., “Leakage Energy Management in Cache Hierarchies,” in Proceedings

of PACT’02, Sept. 2002, pp. 131–140.

[9] Univ. California Berkeley Device Group, BSIM4.2.1 MOSFET Model: User’s

Manual, Dept. of EECS, Univ. of California, Berkeley, CA 94720, USA, 2002.

[10] S.J.E. Wilton et al., WRL 93/5: An Enhanced Access and Cycle Time Model for

On-chip Caches, WRL, 1994.

[11] A. Y. Zeng et al., “Cache Array Architecture Optimization at Deep Submicron

Technologies,” in ICCD 2004, Oct. 2004, pp. 320–5.

[12] D. Tarjan et al., HPL 2006-86: CACTI4.0, HP, 2006.

[13] Y. Zhang et al., CS 2003-05: HotLeakage : A Temperature-Aware Model of Sub-

threshold and Gate Leakage for Architects, Dept. of CS, Univ. of Virginia, USA,

2003.

[14] M. Mamidipaka et al., CECS 04-28: eCACTI: An Enhanced Power Estimation

Model for On-chip Caches, CECS, Univ. of California, Irvine, USA, 2004.

[15] S. Rodriguez et al., “Energy/Power Breakdown of Pipelined Nanometer Caches

(90nm/65nm/45nm/32nm),” in ISLPED 2006, oct 2006, pp. 25–30.

[16] W. Zhao et al., “New generation of Predictive Technology Model for sub-45nm

design exploration,” in ISQED 2006, March 2006, pp. 585–90.

Part II

Background

2On-Chip SRAM Cache Architecture

In this chapter, background information of on-chip SRAM-based cache archi-

tectures is provided. Section 2.1 presents the available on-chip cache architec-

tures used in Digital-Signal-Processing (DSP) and embedded systems, whereas

Section 2.2 focuses on the on-chip cache architectures used in General-Purpose-

Processor (GPP) system.

2.1 Cache Architecture Used in DSP and Embed-

ded Computer Systems

This section is started by showing some very basic DSP architectures given in

Figs 2.1, 2.2, 2.3 and 2.4. Later, it shows how caches were introduced and in-

tegrated into some of those basic DSP architectures creating high-performance

21

22 CHAPTER 2. ON-CHIP SRAM CACHE ARCHITECTURE

DSP processors capable of meeting the rapidly increased demands posed by

high-performance DSP applications.

2.1.1 Basic DSP Architectures

In the classical von Neumann architecture the arithmetic-logic unit (ALU) and

the control unit (CU) are connected to a single memory that stores both the data

values and program instructions. This architecture is very simple and it was

used when memory was very expensive to build. The main drawback of this

architecture is the bottleneck of the memory system.

DatapathData

Memory

InstructionProcessor

ProgramMemory

Addresses

Data

Instruction

Addresses

Instructions

OP

CodeStatus

Figure 2.1: Basic DSP Architectures: Harvard architecture

Fig. 2.1 shows the classical Harvard architecture. It is an improved architec-

ture compared to the von Neumann’s one. Two separated memories are used to

store data (i.e. Data Memory - DM) and program (i.e. Program Memory - PM),

and two separated busses are also used to connect data and program memories

to the Datapath (DP) and to the Instruction Processor (IP), respectively. This

simple architecture is still used in many micro-controllers, but it is not used in

any recent DSPs [1].

Since the most common operation in digital signal processing is the convo-

lution that is implemented by several multiply and add steps, a DSP processor

must be able to efficiently perform multiply-and-accumulate operations, e.g.

2.1. CACHES FOR DSP AND EMBEDDED SYSTEMS 23

by using Multiply-And-aCcumulate (MAC) units. Ideally, each multiply-and-

accumulate operation should be performed in a single instruction cycle which

requires at least two values be read from and one value be written to the data

memory, while two or more address registers must be updated. Thus, it is ob-

vious that high memory bandwidth is just as important as a fast multiply-and-

accumulate operation [2].

(b)

DM

IP

DP

PM, DM

M

U

X

C a c

h e

DM

IP

DP

PM, DM

M

U

X

(a)

Figure 2.2: Basic DSP Architectures: a) with a MUX, and b) with a MUX and a small

instruction cache

Fig. 2.2 shows two other DSP architectures that were gradually improved

from the Harvard one to provide multiple accesses to DM. In these architec-

tures, the program memory also can be used as a coefficient (data) memory

when executing a convolution. A multiplexer (MUX) is used to provide ac-

cesses to DM and PM when needed. In the architecture shown in Fig.2.2b, a

small cache is added to store a short program. This cache is used when PM

is required for data accessing supporting non-overlap hardware loops including

multiple instructions. However, these architectures require dual or multiple-

ported program memory, thus raising their design cost. Besides, the clock rate

is limited by the memory access rate and therefore can not be very high.

Fig. 2.3 shows the most frequently used simple DSP architecture with a sin-

gle DP. Two data memories are used to support convolution and vector-based

algorithms.


DM

IP

DP

DM

M

U

X

PM

Figure 2.3: The most frequently used DSP Architecture

DP

DM

DP

DMDM

DP DP

DM

DMA

PM

IP

MainMemory

Control Signals

Multiple

InstructionsSwitch Network for Multiplexing

Figure 2.4: A VLIW-DSP Architecture with Multiple Datapaths

Fig. 2.4 shows a typical VLIW architecture, an example of a DSP architec-

ture with multiple datapaths. The VLIW-DSP architecture allows multiple in-

structions to be fetched and executed in parallel. Those instructions are decoded

in the IP and then control signals are supplied to multiple datapaths. Parallel ex-

ecution of multiple arithmetic operations in DP requires multiple DMs to store

coefficients and results. VLIW DSP typically assumes that data dependencies

are known and therefore manages data dependencies during compile time [1].


2.1.2 Cache Architectures in DSP and Embedded Systems

Traditionally, DSP system architectures do not have any caches [3]. Instead,

they rely on multiple banks of fast on-chip addressable SRAM memories and

multiple bus sets to allow for several memory accesses per instruction cycle

(Figs 2.1, 2.2a, and 2.3). The on-chip addressable SRAMs are designed to be

accessible by both the central processing unit (CPU) and the direct memory ac-

cess unit (DMA) [4]. However, caches are increasingly used in DSPs for storing

instructions and data required by large, high-performance and memory-hungry

DSP applications. In the beginning, a small specialized instruction cache was

incorporated in some DSP processors to store instructions of small loops so

that the on-chip bus sets can be free to retrieve data (Fig. 2.2b). Later, on-chip

multi-level caches were commonly used on some general purpose DSP fami-

lies, e.g. the Texas Instruments (TI) TMS320C6211 and TI C6x DSP [5]. The

main reasons to have caches in DSP architectures are:

• High-performance DSP applications increasingly require processing ca-

pability from DSP processors which in turn imposes harsh demands of

increased operating frequency and bandwidth on the memory system.

• The frequency of on-chip SRAM memories traditionally used in DSPs

does not scale along with DSP clock rate, and as a result only relatively

small memory sizes are able to meet the frequency goals. This is in di-

rect contrast to the increasing program size requirements seen by DSP

applications, which requires even the larger on-chip SRAM.

• Advanced process technologies have allowed both the CPU speed to in-

crease and more memory to be integrated on-chip, but the access time of

on-chip memory has not increased proportionally. Therefore, the mem-

ory often becomes a processing bottleneck. Besides, large on-chip SRAM

memory is also expensive to build.

• Caching offers a hardware-managed and user-transparent view of a large

address space in a physically small, local SRAM and narrows the per-

formance gap between processor and main memory. Therefore, the in-


troduction of multi-level cache systems to DSP architectures can greatly

reduces the CPU-to-memory processing bottleneck while still maintain-

ing the DSP goals of low cost and low power.

CPUL2 Cache CTL

and

L2 SRAM CTL

I-Cache CTL

L1 I-CacheSRAMs

D-Cache CTL

L1 D-CacheSRAMs

DMA Logic

64 64 512

8 instructions 256

256

256

256

256

External DMA

Interface

L2 Cache Region(0 - 256 Kb)

L2 SRAM Region(0 - 8 Mb)

Figure 2.5: A Two-level Cache Architecture (TI TMS320C6211/TI C6x DSP [5])

In a multi-level cache system, the level nearest the DSP (level 1) is opti-

mized for the high DSP core clock rate and low access latency. The size of this

level 1 (L1) cache may be constrained by the core clock rate. At the same time,

the outer levels can be optimized for storage density and power. Often the outer

cache levels have multi-cycle access time. The penalty for a miss from the in-

ner cache levels to an outer level that hits in the outer level is normally a small

integer number of clock cycles. Fig. 2.5 shows a two-level cache architecture

used in the TI TMS320C6211 [4] and TI C6x DSP families [5]. In this cache ar-

chitecture, the L1 memories consist of a small direct-mapped instruction cache

(I-cache) and a small two-way set associative data cache (D-cache), while the


level 2 (L2) consists of a relatively-larger on-chip unified SRAM memory that

can be partially configured as a four-way set associative cache.

The TI TMS320C6211 uses separate L1 4-KB I-cache and D-cache, and

four 16-KB banks of on-chip SRAM memory that individually can be config-

ured as either local memory or a unified L2 cache. In the TI C6x DSP families,

the L1 cache consists of separate 16-KB I-cache and D-cache, while the L2 is

a 1-MB memory that can be mapped as all SRAM or as a mix of cache (up to

256-KB) and SRAM [5]. There are motivations behind the selection of size and

associativity for both L1 and L2 caches:

• Since most DSP algorithms consist of small, tight loops that execute the

same code on multiple data locations, a direct-mapped cache is suitable

for the L1 I-cache. The size of an L1 I-cache should be large enough

to accommodate multiple DSP kernels simultaneously to ensure a small

number of cache misses [3].

• As mentioned earlier in Section 2.1.1, DSP processors must be able to

efficiently perform each multiply-and-accumulate (MAC) operation, ide-

ally, in a single instruction cycle which requires at least two values be

read from and one value be written to the data memory. A two-way set

associative cache is suitable for an L1 D-cache since it keeps both MAC

operands in the cache, allowing simultaneous accesses to both operands

without going to the L2 cache or main memory. The size of the L1 D-

cache should be large enough to keep data for several DSP kernels loaded

simultaneously in the L1 I-cache.

• The size of the L2 memory is designed to be as large as possible because

misses are much less likely to occur. The L2 memory can be config-

ured as a unified on-chip SRAM memory or as a cache entirely, or as a

combination of cache and SRAM. The associativity of the L2 cache is

determined by how many of the banks are configured as caches, allowing

1-, 2-, 3- or 4-way associativity.


An L2 memory can also be further optimized for a particular system by

selecting the appropriate parameters consisting of line size, allocation policies,

replacement policies, pipelining, prefetching, SRAM latency, etc. The L2 mem-

ory interfaces to the DMA controller for cache accesses and DMA transfers.

Data coherency between external memory and the L1 caches is maintained.

The L2 can be programmed to access various memory sizes with various access

latencies and also to allow CPU initiated DMA transfers [5].

Caches present in DSPs are typically adapted to suit DSP needs. For ex-

ample, the DSP may allow the programmer to manually “lock” portions of the

cache, so that performance-critical sections of the software can be guaranteed

to be resident in the cache. This helps to provide easy execution time predic-

tions at the cost of reduced performance for other sections of software that may

need to be fetched from main memory. Normally, DSP vendors are responsible

for providing programmers with tools that enable an accurate determination of

program execution times. These tools are great helps for programmers to im-

plement and optimize real-time DSP software, thus improving the performance

of DSPs.

2.2 Cache Architectures in GPP Systems

2.2.1 Cache System Architecture

Unlike DSP processors, on-chip caches are already commonly used in general

purpose processors (GPPs). By definition, cache is referred to the name given

to the first level of the memory hierarchy encountered once the address leaves

the CPU [6]. Fig. 2.6 shows a typical memory hierarchy used in embedded,

desktop and server computers [6]. A memory hierarchy takes advantage of

temporal locality by keeping more recently accessed data items closer to the

processor, and take advantage of spatial locality of reference data by moving

blocks consisting of multiple contiguous words in memory to upper level of the

hierarchy. Fig. 2.6 also shows that the memory hierarchy uses smaller and faster

2.2. CACHES FOR GPP SYSTEMS 29

memory technologies close to the processor. Therefore, if the hit ratio is high

enough, the memory hierarchy has an effective access time close to that of the

highest (and fastest) level and a size equal to that of the largest (and slowest)

level [7].

C

ac

h

e

Memory I/O Devices

Registers

Memory

Bus

I/OBusCPU

Figure 2.6: A typical memory hierarchy in GPPs

Cache hit ratio and access time are two metrics that determine the perfor-

mance of a cache system. There have been numerous studies on techniques

for achieving fast cache access while maintaining high hit ratios which include

selecting the appropriate cache parameters such as cache size, line size, set

associativity, allocation policies, replacement policies, pipelining, prefetching,

and SRAM latency, etc [8], [9], [10]. Those topics, however, are beyond the

scope of this thesis works, therefore it will not be summarized and studied in

this dissertation.

Theoretically, a memory hierarchy can consist of N cachelevels cache levels where

N cachelevels is an integer numbered as 1, 2, 3, etc. Depending on performance and

cost requirements for the cache system the number of cache levels must be

chosen to minimize the cache access time, as well as to maintain high cache

hit ratios [8]. In practical designs, N cachelevels ≤ 4 has been seen in most cache

hierarchies. As a rule of thumb, it is safe to say that the cache at the lowest level

is usually small, fast and often located on-chip while the one at the highest level

is often large, unified and it may or may not be located on-chip. Fig. 2.7 shows

a typical memory system used in GPPs where the cache system consists of two

levels: L1 and L2. The on-chip level 1 cache is split into separate I-cache and

D-cache to support the instruction and data fetch bandwidths of modern GPPs.


The L2 cache is an off-chip unified memory used to store both instructions

and data [9]. The Translation Lookaside Buffer (TLB) is an on-chip cache

used to store those translated addresses that are used for translating virtual page

addresses to valid physical addresses. In Fig. 2.7, the size of cache increases

from the level 1 (lower) to the level 2 (higher), but the speed decreases. In other

words, the storage capacity and also the latency of a cache are increasing while

going from a lower to a higher cache level.

CPULevel 1

D-cache

Level 1I-cache

TLB

Microprocessor

RegisterFile

I/ODevices

MemoryBus

I/OBusLevel 2

UnifiedCache

MainMemory

Figure 2.7: A two-level cache architecture used in GPPs

CPULevel 1

D-cache

Level 1I-cache

TLB

Microprocessor

RegisterFile

Level 2

UnifiedCache

Level 3Cache

MemoryBus

MainMemory

Figure 2.8: A three-level cache architecture used in GPPs

In some recent designs, the L1 and the L2 caches are integrated on-chip, and

there is no L3 cache located between the L2 cache and the main memory, e.g.

the Intel Pentium 4, the Intel Pentium M, the Intel Xeon, the Intel Dual-core

Pentium D [11], and the AMD Dual-core Opteron [12]. Instead of using an

2.2. CACHES FOR GPP SYSTEMS 31

off-chip L3 cache, communications between an on-chip L2 cache and the main

memory are usually done through a memory controller that can be located on-

chip or off-chip. If the memory controller is integrated on-chip, the L2 cache is

connected to the memory controller through a high-speed back-side bus (BSB),

and if it is located off-chip, then the L2 cache is connected to the memory con-

troller through a slower front-side bus (FSB).

In several other designs, there are three levels of caches: An on-chip L1, an

on-chip L2 and an off-chip L3 cache (Fig. 2.8). For example, the IBM multi-

core Power5 microprocessor has a separate on-chip L1 cache (consisting of a

64-KB two-way set-associative I-cache, a 32-KB four-way set-associative D-

cache) for each core; an on-chip ten-way set-associative 1.875-MB L2 cache

shared between two cores; and an off-chip 36-MB L3 cache with an on-chip

directory [13]. The L3 cache is connected directly to the L2 cache through a

high-speed back-side bus, not via the on-chip memory controller, however.

L3

Ca

ch

e

L2

Ca

ch

e-

Qu

ad

Po

rt

ECC ECC

Bus (128-bit data, 6.4 GB/s @ 400 MT/s)ECC ECC

Branch & Predicate

Registers

128 Integer

RegistersSc

ore

bo

ard

,

Pred

ica

te,

NA

Ts

,

Ex

ce

ptio

ns

Floating

Point

Units

(x2)

Quad-port

L1 D-Cache

and DTLB

AL

AT

Branch

Units (x3)

Integer and

MM Units (x7)

128 FP Registers

L1 I-Cache and

Fetch/Pre-fetch EngineITLB

Instruction Queue 8 bundlesBranch

Prediction

IA-32

Decode

and

Control

M MM M I I F FB BB

Register Stack Engine / Re-Mapping

11 Issue

Ports

Figure 2.9: Block diagram of the Intel Itanium 2 processor [14]


Besides, in order to reduce memory traffic in a multiprocessor configura-

tion, Intel has other versions of the Pentium 4 with much larger on-chip caches,

for example the Intel Xeon MP processor comes with an on-chip L3 of 1 MB or

2 MB or 4 MB, and the Intel Pentium 4 Extreme Edition processor comes with

an on-chip L3 of 2 MB [7]. Moreover, the Intel Itanium 2 processor (Fig. 2.9) –

a representative of the Intel’s IA-64 64-bit EPIC processor family – has an L1,

an L2 and an L3 cache integrated on-chip. It has a separate 16-KB four-way

associative L1 D-cache and I-cache, a 256-KB unified eight-way associative L2

cache, and a large unified 24-way set-associative L3 cache of either 3 MB or

6 MB or 9 MB in size [14]. Nevertheless, the Intel Itanium 2 is not yet the

Intel’s processor which have the largest caches on-chip. The Intel Dual-core

Itanium 2 processor even has a unified 24-MB low-latency L3 cache, and the

cache hierarchy is nearly 27-MB in total for the entire processor [11].

The above-mentioned examples of Intel processors suggest several trends

of development for cache implementation in recent GPPs: (i) More cache lev-

els are explored and implemented; (ii) Larger and larger caches are integrated

on-chip; (iii) More cores are integrated on a processor chip which requires even

larger on-chip caches and memories to provide necessary instructions and appli-

cation data for cores to run threads in parallel, and to store the obtained results.

2.3 Cache on GPPs and DSPs: Differences

Although both DSP and GPP caches are implemented to bridge the performance

gap between processor and main memory by maintaining fast access time and

high hit ratios, there are still some differences:

1. More levels of on-chip caches are used in GPPs, often three levels, while

a DSP cache system normally consists of two levels, thus far. The differ-

ence in implemented cache levels may be changed in future products, but

for the time being the cache systems used in DSPs are still one generation

behind the ones used in GPPs [15].

2.4. CACHE ORGANIZATION 33

2. In level 1, a cache usually is split into separate I-cache and D-cache in

both DSPs and GPPs, however the size of these caches is larger in GPPs.

In DSPs, the I-cache is a direct-mapped and the D-cache is a two-way

set-associative, whereas in GPPs the I-cache is normally two-way set-

associative and the D-cache is multiple-way set-associative.

3. In level 2, both GPPs and DSPs use large unified caches, however in

DSPs it can be configured either as an unified on-chip SRAM memory or

as a cache entirely, or as a combination of cache and SRAM. This ability

has not been seen in any GPP caches. In addition, the L2 caches in GPPs

tend to have more way set-associativity than the ones used in DSPs.

4. Compared to the caches used in GPPs, which are generally not visible and

not controlled by the application programmer, the cache systems in DSPs

are both visible to and controlled by the application programmers [3].

Unlike GPPs, DSPs do not generally use dynamic features such as the

branch prediction and the speculative execution. Therefore, predicting

the execution time for a given section of code is fairly easy on a DSP

which allows programmers to confidently push the DSP performance lim-

its [15].

2.4 Cache Organization

2.4.1 Basic Cache Organization

This section briefly describes the organization of a typical SRAM-based cache,

its working principles and circuitry. A detailed descriptions of cache organiza-

tions are given in [7] and [16].

Caches are normally organized as two-dimensional arrays. The first dimen-

sion is the set, and the second dimension is set associativity. The set ID is

determined by a function of the address bits of the memory request. The line ID

within a set is determined by matching the address tags in the target set with the


referenced address. Caches with set associativity of one are commonly referred

to as the direct-mapped caches while caches with set associativity greater than

one are referred to as the set-associative caches. If there is only one set, the

cache is called fully-associative.

Inside a cache, each cache entry consists of data and a tag that identifies

the main memory address of that data. To identify if a cache block has valid

information, a valid bit (V) is added to each cache entry: V = 1 indicating that

the tag entry contains a valid memory address, otherwise the tag entry should

be ignored and there can not be a match for this block. A memory request hits

in the cache when the upper bits of the reference address and the tag are equal,

and the data is supplied to the processor. Otherwise, a miss occurs.

Fig. 2.10 shows the basic organization of a 16-KB direct-mapped cache used

in the Intrinsity FastMATH Adaptive Signal Processor [7] which contains 256

blocks with 16 words (i.e. 512 bits) per block. The byte-offset, the block-offset

and the index fields are two-bit, four-bit and eight-bit wide, respectively, and

the tag field is 18-bit wide. The 8-bit index field defines the number of cache

entries (i.e. 256) whereas the four-bit block-offset field is used to select a word

from a block using a 16-to-1 multiplexor.

There are two parts in a cache access: (i) Accessing the tag array and per-

forming the tag comparison by comparing the memory-requested address with

the stored address tag to determine if the data is in the cache; (ii) Accessing

the data array to bring out the requested data. For a set-associative cache, the

results of the tag comparison are used to select the requested line from within

the set driven out of the data array.

In practice, a cache is divided into two separate arrays: a small tag array,

and a larger SRAM data array. Fig. 2.11 shows the organization of a typical

SRAM-based cache given in CACTI [16]. This organization is used as the ba-

sic organization assumed for power modeling throughout this dissertation.


Byte

Offset

Block

OffsetIndexAddress Tag bits

Address (showing bit positions). . .

. . .

. . .

. . .

. . .

. . .

V Tag Data

18 bits 512 bits

256

entries

8

AND

418

32 323232

Hit

=Comparator

MUX

32

Data

18

31 14 13 6 5 2 1 0 ... ... ...

Figure 2.10: Basic organization of a Direct-mapped cache [7]

The access procedure to the assumed cache (given in Fig. 2.11) consists of

precharge and evaluation phases that can be divided into the following steps:

1. Address decoding: Address bits are inputs to the row and column de-

coders (showed as a decoder in Fig. 2.11). For each address combination,

the row decoder drives exactly one tag wordline and one data wordline in

the tag and data arrays, respectively, while the column decoder selects a

set of bitline pairs (BL/BL) in the tag and data arrays. Thus, for each


Tag

Wordlines

Address Input

Valid output Data outputs

Data Bitlines

Data

Wordlines

...

...... ...

...

. .

.

. .

.

...

DATAARRAY

TAGARRAY

D E

C O

D E

R

Tag BitlinesSRAM-based

Tag cellSRAM cell

MemoryCells

MUX Drivers

...

Total outputs

...

Total-Output Driver

Output Drivers...

Tag Comparators

SenseAmplifiers

ColumnMultiplexers

Data

OutputDrivers

TotalOutputDriver

Addresstag bits

(Search Lines)Comparators.

. .

... ...

Tag readouts Data readouts

MUX Driver

outputs

Match lines

Figure 2.11: Basic organization of a typical SRAM-based cache


address combination only a set of memory cells in the tag and data arrays

is selected.

2. Bitline precharging: A simple precharge scheme is used that precharges

all bitline pairs in the tag and data arrays to Vdd during the precharge

phase. The precharge scheme is deactivated during the evaluation phase.

3. Selecting memory cells: The evaluation phase starts when the row de-

coder fires, driving a wordline high. Each memory cell in the selected

row pulls down one of its two bitlines; the value stored in the memory

cell determines which bitline goes low.

4. Preparing for sensing: The column decoder fires and connects sense

amplifiers to their selected bitline pairs through a multiplexer (MUX).

This step is needed only if the number of sense amplifiers is less than the

number of bitlines, i.e. more than one bitline pairs share a sense amplifier.

5. Sensing: Each sense amplifier (SA) monitors a pair of bitlines and de-

tects when one changes. By detecting which bitline goes low, the sense

amplifier determines the content of the selected memory cell. Voltage

differential sense amplifiers are assumed to be used in both the tag and

data arrays. In order to minimize the sensing time, bitlines of all sense

amplifiers are precharged to high during the precharge phase.

6. Tag comparing: The information read from the tag array is compared

to the address tag bits to determine if the requested block exists in the

cache. The number of comparators needed is equal to the number of way

used for cache’s set-associativity, e.g. for a direct-mapped cache only one

comparator is needed.

7. Checking the valid bit and selecting the data: The valid bit is checked

first to know if the entry contains a valid address. If V is set, and if the

tag comparison is successful, which means the requested block is found

in the cache (a cache hit), the MUX drivers are set to select the proper


data from the data array. If V is not set and/or the tag comparison is un-

successful, a cache miss occurs. The processor control unit, together with

a separate controller (neither is shown in Fig. 2.11), is responsible for de-

tecting a miss and processing the miss by fetching the requested data from

a lower-level cache or from the main memory. When the requested data

are available, a write into the cache occurs: (i) putting the requested data

in the data portion of the cache entry; (ii) writing the upper bits of the ad-

dress into the tag field; (iii) turning the valid bit on. On a cache miss, the

processor is simply stalled until the lower-level cache/main memory re-

sponds with the requested data. Then, the stalled cache access is restarted,

this time finding the data in the cache.

8. Driving out data: All output data from the cache (i.e. a valid bit and the

selected data) are driven to the appropriate bus through the total-output

drivers.

Thus, there are two potential critical paths in a cache access: the tag-array-

access and the data-array-access. The tag-array-access path consists of (i) read

the tag array; (ii) perform the tag comparison; (iii) drive the multiplexor select

signal. The data-array-access consists of (i) read the data array; (ii) drive the

data to the multiplexor. If the time to perform the tag-array-access is larger than

the time to do the data-array-access then the tag side is the critical path. Other-

wise, the data side is the critical path.

In practice, what side will be the critical path depends strongly on the cache

organization parameters (e.g. cache size, associativity, line size, data word

length), process technology parameters, and types of circuits used to imple-

ment components of the cache. A detailed descriptions of cache components

and their circuitry are given in Chapter 5 of this dissertation.

2.4.2 Memory Partitioning

Partitioning is one of the most successful techniques for memory energy op-

timization, where a large memory array is subsequently divided into smaller


arrays in such a way that each of these can be independently controlled and ac-

cessed. The aim of the partitioning approach is to find the best balance between

energy savings, delay and area overheads. Partitioning of memory can be at two

levels: logical and physical [17].

Logical partitioning involves creating several smaller memory macros in-

stead of the original single large array, and then synthesizing a control logic to

activate the different memory macros. In this approach, each memory macro

is actually a separate memory array with a smaller size including decoders,

precharge and read/write circuits of its own. Control logic is added on top to

activate one array at a time based on address inputs. Since this scheme requires

extra control circuitry, some extra wiring and multiple decoders, precharge and

read/write circuits, designers always try to strike a balance between the energy

savings from having small arrays and the overhead for supporting them.

Physical partitioning, on other hand, involves dividing the original array

into several sub-arrays sharing decoders, precharge and read/write circuits, and

then synthesizing internal control circuitry to provide mutually exclusive acti-

vation of the sub-arrays inside the original array. Moreover, the internal con-

trol circuitry is merged with row/column decoders to generate sub-array selec-

tion signals, so introduction of extra circuitry is efficiently limited. In physi-

cal partitioning, memory arrays can be partitioned horizontally using a divided

word-line (DWL) technique proposed by Yoshimoto et al. [18], vertically using

a hierarchical divided bit-line (DBL) technique presented by Karandikar and

Parhi [19], or bidirectionally using a combination of both techniques [17]. Due

to their advantages in energy-efficiency and ease of implementation, physically

partitioned memory arrays are widely used in L1 and/or L2 caches of recent

microprocessors and DSPs.


Bibliography

[1] Dake Liu, Compendium: Design of Embedded DSP Processors, Department of

Electrical Engneering, Linköping University, Linköping, Sweden, second edition,

2004.

[2] L. Wanhammar, DSP Integrated Circuits, Academic Press, 1999.

[3] J. Eyre and J. Bier, “DSP Processors Hit the Mainstream,” IEEE Computer, vol.

31, no. 8, pp. 51–59, Aug. 1998.

[4] S. Agarwala, C. Fuoco, T. Anderson, D. Comisky, and C. Mobley, “A Multi-

level Memory System Architecture for High Performance DSP Applications,” in

Proceedings of International Conference on Computer Design (ICCD), September

2000, pp. 408–413.

[5] S. Agarwala, T. Anderson, A. Hill, M.D Ales, R. Damodaran, P. Wiley,

S. Mullinnix, J. Leach, A. Lell, M. Gill, A. Rajagopal, A. Chachad, M. Agarwala,

J. Apostol, M. Krishnan, Bui Duc, An Quang, N.S. Nagaraj, T. Wolf, and T.T.

Elappuparackal, “A 600-MHz VLIW DSP,” IEEE Journal of Solid-State Circuits,

vol. 37, no. 11, pp. 1532–1544, Nov. 2002.

[6] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Ap-

proach, Morgan Kaufmann, fourth edition, 2006.

[7] D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The

Hardware/Software Interface, Morgan Kaufmann, third edition, 2005.

[8] B. L. Jacob, P. M. Chen, S. R. Silverman, and T. Mudge, “An Analytical Model

for Designing Memory Hierarchies,” IEEE Transaction on Computers, vol. 45, no.

10, pp. 1180–1194, Oct. 1996.

[9] N. P. Jouppi and S. J. E. Wilton, “Tradeoffs in Two-level On-chip Caching,” in

Proceedings of the Anual International Symposium on Computer Architecture, Apr.

1994, pp. 34–45.

[10] J. K. Peir, W. W. Hsu, and A. J. Smith, “Functional Implementation Techniques

for CPU Cache Memories,” IEEE Transaction on Computers, vol. 48, no. 2, pp.

100–110, Feb. 1999.

[11] Intel Pressroom Homepage, http://www.intel.com/pressroom/, 2007.

[12] AMD Homepage, http://www.amd.com, 2007.

BIBLIOGRAPHY 41

[13] B. Sinharoy, R. N. Kalla, J. M. Tendler, R. J. Eickemeyer, and J. B. Joyner, “Power5

system architecture,” IBM Journal of Research & Devepolments, vol. 49, no. 4, pp.

505–521, Sept. 2005.

[14] S. Rusu, H. Muljono, and B. Cherkauer, “Itanium 2 Processor 6M: Higher Fre-

quency and Larger L3 Cache,” IEEE Micro, vol. 24, no. 2, pp. 10–18, Apr. 2004.

[15] J. Eyre, “The digital signal processor derby,” IEEE Spectrum, vol. 38, no. 6, pp.

62–68, June 2001.

[16] S.J.E. Wilton and N.P. Jouppi, WRL Research Report 93/5: An Enhanced Access

and Cycle Time Model for On-chip Caches, Western Research Laboratory, 1994.

[17] P. Sithambaram et al., “Design and Implementation of a Memory Generator for

Low-Energy ASBE SRAMs,” in PATMOS 2005, Sept. 2005, pp. 477–87.

[18] M. Yoshimoto et al., “A divided word-line structure in the static RAM and its

application to a 64K full CMOS RAM,” IEEE JSSC, vol. 18, no. 5, pp. 479–85,

Oct. 1983.

[19] A. Karandikar et al., “Low Power SRAM Design Using Hierarchical Divided Bit-

line Approach,” in ICCD 1998, Oct. 1998, pp. 82–8.


3Power Dissipation in CMOS

The goal of this chapter is to explain the most important mechanisms behind

power dissipation of CMOS circuits. This is essential for the readers who wish

to understand the probing used in the component characterization phase given

in Chapter 5 of this dissertation. Section 3.1 first gives some background in-

formation on mechanisms of power dissipation in CMOS circuits. Then, Sec-

tion 3.2 provides some insights into the trends of leakage power dissipation

in current process technologies, and emerging issues. Finally, Section 3.3 de-

scribes some useful power reduction/cut-off techniques to combat dynamic and

leakage power dissipation in digital circuits, caches and SRAM arrays.

43

44 CHAPTER 3. POWER DISSIPATION IN CMOS

3.1 Mechanisms of Power Dissipation

Based on the behavior of digital CMOS circuits and mechanisms for power

dissipation, total power dissipation of a digital circuit can be decomposed into

two main components: static (the power consumed when the circuit is in the

‘steady state’) and dynamic (the power consumed during switching, when the

circuit is in the ‘active state’).

p

Bulk

DrainSource

Gate

n+n+

I7 I

8

I2 I

3 I

6

I1

I5

I4

Figure 3.1: Leakage mechanisms in an off-state NMOS transistor with VG = VS = 0

and VD = Vdd

While the dynamic type of power dissipation consists of switching power,

glitching power and short-circuit power, the static type includes many more

power dissipation mechanisms. Fig. 3.1 illustrates significant leakage mech-

anisms which exist in an off-state NMOS transistor [1]: The reverse-bias pn

junction leakage (I1), the subthreshold leakage (I2), the Gate-Induced-Drain

Leakage (I4), the channel punchthrough current (I5), the gate oxide leakage

(I7), and the gate current due to hot-carrier injection (I8). The I3 and I6 are the

subthreshold leakage caused by the Drain-Induced Barrier-Lowering (DIBL)

and the Short-Channel Effect (SCE) together with the narrow-width effect via

VT modulation, respectively. Currents I2, I3, I4, I5, I6 are off-state leakage

mechanisms, while I1 and I7 occur in both on-state and off-state. The current

3.1. MECHANISMS OF POWER DISSIPATION 45

I8 can occur in the off-state, but more typically occurs during the transition of

transistor bias [2].

Of the above mentioned types of power dissipation, switching power is the

one that has been, so far, considered by the high-level power estimation commu-

nity to be the completely dominating source of power dissipation. The second

significant source is considered to be the subthreshold leakage power. Besides,

as technology scaling reaches lower than 70 nm, gate oxide leakage has become

one of the significant contributors to the total leakage power dissipation, [3]. As

technology scaling continues to very deep submicron pn junction leakage also

receives a lot of attention and it is already considered as another significant

source of leakage power dissipation for future CMOS processes.

Based on the degree of importance, only four sources of power dissipations

will be described in more detail in this section: switching power, subthreshold

leakage, gate leakage, and pn junction leakage power. More detailed descrip-

tions of other sources of power dissipations are given in [2] [4] [5].

3.1.1 Dynamic Power

Switching Power

Switching power constitutes the major part of the total power dissipation in to-

day’s and in future digital CMOS circuits. Although it has been reduced by

various techniques like supply voltage scaling and clock gating, it still will be

the dominating power dissipation for future technologies.

The switching power basically is the power consumed during charging and

discharging of the capacitances associated with each circuit node and it can be

summarized in the following equation as:

Pswitching = α CL V 2dd fclk = α CL ∆V Vdd fclk (3.1)


Here, CL is the load capacitance, fclk is the clock frequency, Vdd is the supply

voltage, ∆V is the swing voltage of the node, and α is the node ’0→1’ transition

activity factor which is defined between 0 and 1.

3.1.2 Leakage Power

Subthreshold Leakage

Ideally, CMOS circuits dissipate no static (DC) power since in the steady state

there is no direct path from Vdd to ground. Of course, this assumption can

never be realized in practice since in reality the MOS transistor is not a perfect

switch, which means that there will always be leakage currents even when the

MOS transistors are OFF. The subthreshold leakage power is due to the sub-

threshold leakage current, i.e. the drain to source current running through the

channel occurring due to a potential difference between source and drain even

if the gate voltage is far below the threshold voltage.

In the weak inversion region, the subthreshold current can be calculated by

Eq. 3.2 taken from [5] [6]. From this equation, it is clear that the subthreshold

current depends strongly on different technological parameters, especially the

threshold voltage and temperature.

Isub = I0 (1 − e−VdsVth ) e

−VT −VoffnVth (3.2)

where,

I0 = µW

LV 2

th

√

q ǫsi NDEP

2φs; Vth =

kB T

q(3.3)

Here, q is the electrical charge, T is the varying temperature, n is the sub-

threshold swing coefficient, kB is the Stefan-Boltzmann constant, NDEP is the

channel doping concentration (similar to Nch defined in [7]), φs is the surface

potential, ǫsi is the dielectric constant of silicon, µ is the carrier mobility at

TNOM , Vth is the thermal voltage, Vds is the drain-source voltage, Voff is the


offset voltage, W is the width, L is the length, and VT is the device threshold

voltage.

At the fixed nominal temperature (TNOM = 27 ◦C), VT is defined by a

very complex expression (Eq. 3.5) accounting for all effects such as the body

effect, ∆VT,body_effect, the charge sharing, ∆VT,charge_sharing , the DIBL,

∆VT,DIBL, the reverse short-channel, ∆VT,reverse_short_channel, the narrow-

width effect, ∆VT,narrow_width, the small-size effect, ∆VT,small_size, and the

pocket implant, ∆VT,pocket_implant. The VTH0 is the threshold voltage of a

long-channel device at zero bias, and the δNP is defined as +1 for NMOS and

as −1 for PMOS. For more detailed equations, see [6].

VT = VTH0 + δNP (∆VT,body_effect − ∆VT,charge_sharing

− ∆VT,DIBL + ∆VT,reverse_short_channel (3.4)

+ ∆VT,narrow_width + ∆VT,small_size − ∆VT,pocket_implant)

Then, the dependence of subthreshold leakage on the varying temperature,

T , is modeled by using temperature-dependent scaling equations:

VT (at T ) = VT (at TNOM) + KT (T

TNOM− 1) (3.5)

µ (at T ) = µ (at TNOM) (T

TNOM)UTE (3.6)

KT = KT 1 +KT 1L

Leff+ KT 2 Vbseff (3.7)

Here, Vbseff is the effective bulk-source voltage, KT 1 is the temperature coef-

ficient of the threshold voltage, KT 1L is the channel-length coefficient of the

threshold voltage’s temperature dependence, KT 2 is the bulk-bias coefficient

of the threshold voltage’s temperature dependence, UTE is the temperature co-

efficient for the zero-field universal mobility µ0, and Leff is the effective gate

length.

Eqs 3.2 - 3.7 show how complicated it is to calculate a value of subthreshold

leakage current for a single MOS transistor analytically. Thus, given the high


number of transistors typically found in digital circuits, to accurately estimate

subthreshold leakage power is obviously a time-consuming and challenged task

requiring high computating time.

Gate Oxide Leakage

While subthreshold leakage is still the major source of static power dissipation

in today’s technologies, gate leakage is catching up, especially for technolo-

gies lower than 50 nm [3]. For a 2004 technology generation device the gate

leakage power already contribute as much as 15% of the total power dissipa-

tion [8]. Therefore if there will not be any solutions to handle efficiently the

gate leakage problem (including substantially better high-k material and gate-

leakage suppression circuit techniques), then this scenario will be our reality in

less than 10 years time [3].

The gate leakage is due to the direct tunneling currents that penetrate the thin

gate insulator. Unlike the subthreshold leakage, gate leakage is present in both

off-state and on-state MOS transistors which makes gate leakage more difficult

to control than the subthreshold one. In an on-state transistor, the gate leakage

is the sum of two components: the gate-to-channel and gate-to-source/drain

extension (gate-to-SDE) overlap currents, while in an off-state transistor it is

equal to the edge-direct tunneling (EDT) current [9]. Therefore, gate leakage

strongly depends on the voltage potential on the transistor gate, VG, the gate

oxide thickness, Tox, the gate oxide insulator material, K , and the width of

transistor, W , rather than on temperature. The gate leakage current can be

approximately defined by Eq. 3.8 taken from [10]. It is clear from Eq. 3.8 that

gate leakage will reduce if Tox increases. However, it is not a good option since

an increase in Tox also degrades the transistor’s effectiveness.

Igate = K W (Vdd

Tox)2 e

−αToxVdd (3.8)

Here, parameters K and α can be derived experimentally.


Recently, gate leakage power has been intensively studied by many re-

searchers and some solutions to this problem have been proposed. These pro-

posed solutions include the introduction of a high-k material [11] and several

gate-leakage-suppression circuit techniques [12]. Hence, the overall picture

fortunately will not be that bad as it was predicted in [3].

pn Junction Reverse-Bias Leakage

The pn junction leakage is due to the currents running across the reverse-biased

drain- and source-to-well junctions. It has two major components: (i) minority

carrier diffusion/drift near the edge of the depletion region; (ii) electron-hole

pair generation in the depletion region of the reverse-biased junction [2]. The

junction leakage current is a function of junction area and doping concentration

of p and n regions. If both p and n regions are heavily doped, band-to-band tun-

neling (BTBT) dominates the pn junction leakage. In advanced MOS transistors

heavily doped shallow junctions and halo doping are often used to reduce the

Short-Channel Effect (SCE), that is why the pn junction leakage is usually re-

ferred as the BTBT leakage in recent advanced CMOS technological processes.

The BTBT current can be estimated using Eg. 3.9, where m∗ is the effective

mass of electron; Eg is the energy band gap; Vapp is the applied reverse bias;

E is the electric field at the junction (should be more than 106 V/cm); q is

the electron charge; h is 1/2π times Planck’s constant [2]; Na and Nd are the

doping in the p and n regions, respectively; and Vbi is the built-in voltage across

the junction.

JBTBT = AE Vapp

E1/2g

e−B E

3/2

gE (3.9)

where,

A =(2m∗)1/2 q3

4 π3 h2 , B =4 (2m∗)1/2

3 q h(3.10)

E =

√

2 q Na Nd (Vapp + Vbi)

ǫsi (Na + Nd)(3.11)


From Eqs 3.9 - 3.11, it is obvious that the BTBT leakage current strongly

depends on doping concentration and the total voltage drop across the junction

which needs to be more than the energy band gap to create tunneling.

3.2 Trend of Development and Emerging Issues

The International Technology Roadmap for Semiconductors (ITRS) [13] pre-

dicts the trends of development for future process technologies to meet some

key scaling goals: (i) for High Performance (HP) applications the key target is

to maintain the historical 17% per year transistor performance increase; (ii) For

Low Power chips (e.g. mobile applications), the target is specifically low level

of leakage current. For example, for Low STandby Power applications (LSTP)

the goal is very low leakage for lower performance targeting comsumer appli-

cations, while for Low Operating Power applications (LOP) the target is low

dynamic power with relatively higher performance. Recent ITRS main predic-

tions can be briefly summarized as:

1. Technology-node concept: The traditional simple ITRS technology-node

concept was rapidly becoming too much of an oversimplification of the

industry state-of-the-art. This problem has been reflected in recent years

by the growing confusion by many researchers when using one single typ-

ical process to represent a "technology node" reflected in press releases,

conference presentations/publications, etc. Obviously, the technology-

node concept has outlived its usefulness. Therefore, it will be gradually

abandoned.

2. Alternative technology: It is expected to become increasingly difficult to

effectively scale planar bulk CMOS devices beyond the 65-nm technol-

ogy generation (with the physical gate length = 25-nm). Major problems

are (i) adequately controlling SCE is projected to become especially prob-

lematic; (ii) the channel doping will need to be increased to exceedingly

high values, causing a reduction in the mobility and very high BTBT

leakage current between drain and body; (iii) total number of dopants

3.2. TREND OF DEVELOPMENT AND EMERGING ISSUES 51

in the channel becomes relatively small resulting in unacceptably large

statistical variation of the threshold voltage; Thus, a potential solution

is to utilize ultra-thin body, fully depleted SOI MOSFETs. Single SOI

MOSFETs are projected for 2008 for high-performance logic, while more

complex and more scalable multiple-gate SOI MOSFETs are projected to

be implemented in 2011.

3. Gate oxide scaling: For extended planar bulk CMOS devices, it was

projected by ITRS 2005 that high-k gate dielectric and metal gate tech-

nology will be required by 2008 to control the leakage. However, the

deployment of high-k gate dielectric and metal gate electrodes is delayed

by two years, until 2010. The Equivalent Oxide thickness (EOT), defined

as Td/(κ/3.9) to represent the relation between the gate dielectric of thick-

ness Td and relative dielectric constant κ, continues to scale but its rate of

scaling is quite slow from 2005 through 2007. However, there is a sharp

EOT decrease in 2008, when it is assumed that high-k gate dielectric will

be implemented (Fig. 3.2).

4. Supply voltage: continues to scale (but not very impressively) from

1.1 V in 2007 (for the 65-nm technology generation with the physical

gate length of 25 nm) to 0.9 V in 2013 (for the 32-nm technology gener-

ation with the physical gate length of 13 nm).

5. Major leakage components: The three types of leakage mechanisms

that continue to be the major ones for future processes are: Subthresh-

old, gate oxide and reverse-biased, drain- and source-substrate junction

BTBT. With technology scaling each of the major leakage components

increases drastically contributing significantly to a dramatic increase in

total leakage. Fig. 3.3 shows a near-term prediction of gate and sub-

threshold leakage for future process technologies until 2012.

6. Emerging research: MOS scaling will likely become ineffective and/or

very costly, therefore novel, non-CMOS devices and/or circuits or archi-

tectures are the potential solutions.


Figure 3.2: EOT and gate leakge density scaling for extended planar bulk CMOS devices

(ITRS 2006)

Figure 3.3: Scaling in subthreshold leakage for extended planar bulk CMOS devices

(ITRS 2006)

3.3. LEAKAGE POWER REDUCTION TECHNIQUES 53

Figure 3.4: Gate length scaling for extended planar bulk CMOS devices (ITRS 2006)

3.3 Leakage Power Reduction Techniques

Power-saving techniques are widely used across levels of design abstraction, i.e.

software, architecture, circuits, devices, and technology [14]. Some approaches

utilizing cooperations between different levels of design abstraction have also

been reported [15]. Power-saving techniques are usually designed for operating

in two different modes of circuit operation: active and sleep. Depending on cir-

cuit topology and the associated major sources of power dissipation, different

techniques are employed to achieve an efficient reduction in total power dis-

sipation. Architecture-level power-saving techniques include dynamic-voltage

scaling (DVS), clock gating, frequency-voltage control and multi-processor de-

sign. Circuit-level techniques include selection of logic style, transistor sizing,

transistor reordering, logic-gate restructuring, gated clocks, optimizing inter-


connect, layout consideration, power cut-off techniques, and low power SRAM

design with virtual ground [16]. Despite the wide range of power-saving tech-

niques, this section focuses mainly on a survey of those circuit-level power cut-

off techniques used for leakage power reduction in digital circuits and SRAM.

BA

I

ddvV

I

I

I I

I

tot

sthBI

sthC

gC

CgB

sthA

gA

(Virtual supply)gpV

Figure 3.5: Leakage current paths in the SCCMOS technique (from [12])

3.3.1 Power Cut-off Techniques

Some power cut-off techniques have mainly targeted subthreshold leakage which

include Super Cut-Off CMOS (SCCMOS) [17], Multi-threshold CMOS (MTC-

MOS) [18] and its enhanced version – Zigzag Super Cut-Off CMOS (ZSCC-

MOS) [19]. These techniques suppress subthreshold leakage currents when a

logic circuit is not active, i.e. it is in the sleep mode. Since a circuit dissipates

leakage power not only in sleep mode, but also in active mode (referred to as

active leakage), MTCMOS has recently been used in conjunction with a clock-

gating technique to reduce both dynamic and leakage power when the circuit is

active [20]. For this type of applications, the wake-up time of a power cut-off

technique is an important issue. While MTCMOS and SCCMOS have wake-up

times of several clock cycles, ZSCCMOS can offer a wake-up time of less than

one clock cycle by employing a sophisticated scheme with a virtual ground rail.

However, the efficiency of the ZSCCMOS technique is degraded due to gate

leakage currents [21].


B

off

CIgA

A gC

sthC

gBII

I

gngndvV

IsthB

IsthA

on

offoff

on

on

ddvV

(Virtual ground)

(Virtual supply)Vgp

V

Figure 3.6: Leakage current paths in the ZSCCMOS technique (from [12])

Fig. 3.5 shows the leakage current paths in the SCCMOS technique where

Ig , Isth are the gate and subthreshold leakage currents in inverters A, B, and

C, respectively, and Itot is the total leakage current. In ZSCCMOS (Fig. 3.6),

the virtual power rails are connected to the logic transistor nets that are OFF in

sleep mode, while the conducting transistor nets use the external power rails.

This scheme cuts off the subthreshold current paths (dashed arrows), however

the gate leakage paths (solid arrows) from external supply to ground rails re-

main resulting in the voltage level across the gate insulators being close to Vdd.

Thus, this technique is inefficient for gate leakage reduction.

The Gate leakage Suppressing CMOS (GSCMOS) technique is shown in

Fig. 3.7, in which an additional virtual supply rail with a separate power switch

was added [12]. The added virtual supply rail 2, connecting to the logic tran-

sistor nets that are conducting while in sleep mode, effectively eliminates all

gate leakage paths. The wake-up time due to the added virtual supply rail can

be limited through some design steps: (i) In sleep mode, the GSCMOS circuit

is forced to the state for which gate and subthreshold leakage components to-

gether exhibit minimal current; (ii) In active mode, the power switches are sized

for equal voltage drops (bounces) on each of the virtual power rails in the worst

case scenario [12].


B

off

CIgA

A gC

sthC

gBII

I

gn

IsthB

IsthA

on

offoff

on

on

VVirtualground

gpVVirtualsupply (1)

Virtualsupply (2)W 2 W 1

Figure 3.7: Leakage current paths in the GSCMOS technique (from [12])

Following these design steps, power switches are distributed and sized in

such a way so that they equalize the charge (discharge) times of the virtual

power rails for a transition from sleep to active modes. When the logic circuit

operates in active mode there are gate leakage currents in the on-state power

switches, however this leakage is very small compared to the dynamic cur-

rents, and therefore is negligible. Due to oxide-stress reliability issues, GSC-

MOS as well as SCCMOS and ZSCCMOS require oxide-stress relaxed level

shifters [22] to generate control voltages (Vgn,Vgp) for the power switches. To

force the logic inputs to the required state GSCMOS must, in the same way

as for ZSCCMOS, employ flip-flops using a phase forcing circuit [22]. When

GSCMOS goes into sleep mode the voltage of virtual supply rail 2 drops to ∼

Vdd/2. As a result the internal voltages are undefined. Thus, like for SCCMOS

and ZSCCMOS, GSCMOS must store data in external SRAM cells that are not

connected to virtual power rails [23] before it enters sleep mode.

3.3.2 Leakage-Reduction Techniques for SRAM-based Caches

Leakage-reduction techniques for SRAM-based cache and memory have been

studied intensively by many authors. The main techniques include e.g. drowsy-


caches [24], gated-Vdd [25], gated-ground [26], dual-VT [27], MTCMOS [28],

dynamic-VT SRAM [29], and reverse/forward body-biased SRAM [30].

The drowsy-cache technique utilizes the dynamic-voltage-scaling (DVS)

principle to reduce leakage power. In active mode, a nominal supply voltage

is provided to memory cells, while in sleep or drowsy mode, a stand-by inter-

mediate voltage level is applied to memory cells to reduce the leakage power.

The stand-by voltage must be higher than the minimum state-preserving voltage

considering process variations such as transistor VT and channel length [24]. In

the drowsy mode, accesses to memory cells are not allowed because the voltage

level of BL/BL is higher than that of the cross-coupled inverters inside the cell,

otherwise it may destroy the state information stored in the cell. Besides, the

sense amplifier may not operate properly due to insufficient driving capability

of the accessed cells.

The basic concept of dual-VT is to use low-VT , faster and leakier transistors

for circuits in the critical path, and high-VT , slower transistors for the rest of

the circuits to suppress unneccessary subthreshold leakage currents. Normally,

in SRAM-based cache designs, the low-VT transistors have been used in the

peripheral circuits of the caches, and in the pass-transistors connecting mem-

ory cells to BLs/BLs while high-VT transistors are used for memory cells [27].

This technique requires no additional control circuitry and can significantly re-

duce the subthreshold leakage currents compared to low-VT devices. Besides,

no data are discarded and no additional cache misses are incurred. However,

this technique suffers from longer bitline delay due to the high-VT devices that

have slower switching speed and lower current drive.

The gated-Vdd and gated-ground techniques reduce leakage power by plac-

ing high-VT transistors between the circuit and power supply rails, i.e. Vdd

and ground, respectively, to turn off the supply power of the memory cell when

the cell is in the low-power mode. These high-VT gating transistors effectively

reduce the subthreshold leakage power of the memory cell circuit because of


the stacking effect and the exponential dependence of the subthreshold leakage

on VT . The main disadvantage of these techniques is that all state informa-

tion within the memory cell is lost, which may inflict a significant performance

penalty when the memory cell is accessed, and require a complex and conserva-

tive cache management policy to handle it. Furthermore, these gating transistors

are in the critical path, thus resulting in increased access time of the caches.

Leakage currents can be reduced by dynamically raising the transistor VT

using the principle of modulating the back-gate bias voltage [28] [29] [31]. Dur-

ing normal operation, the memory cell is connected to Vdd and ground and back-

gate voltages are set to the appropriate power rails. When sleep is activated, the

p-channel wells are biased using an alternative power supply voltage, Vdd+, at

a higher voltage level than the source terminals raising the effective VT . All

transistors inside memory cells experience higher VT and therefore the leakage

currents are reduced significantly. The major advantage of MTCMOS is that

memory cell values are preserved during sleep mode whereas the disadvantages

include (i) an additional power-supply voltage that must be distributed through-

out the array; (ii) larger electric fields placed across the transistor gates during

sleep which may affect the reliability of memory cells; (iii) a latency penalty to

awaken a line being in the sleep mode before data can be accessed [28].

Among the above-mentioned techniques, drowsy caches have received con-

siderable attention; it was shown in [32] that total cache leakage energy was

reduced by an average of 76% at a wakeup penalty, for a drowsy cache line, of

no more than one cycle. Moreover, drowsy caches can be implemented easily

using simple control circuits to assign different voltage levels, called tranquility

levels, at different priority levels, based on information of replacement policies

used [33]. These advantages make drowsy cache one of the most widely used

techniques for leakage reduction in caches and SRAM arrays.

BIBLIOGRAPHY 59

Bibliography

[1] A. Keshavarzi, K. Roy, and C. F. Hawkins, “Intrinsic leakage in low power deep

submicron CMOS IC’s,” in Proceedings of International Test Conference (ITC),

1997, pp. 146–155.

[2] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand, “Leakage Current Mech-

anisms and Leakage Reduction Techniques in Deep-Submicrometer CMOS Cir-

cuits,” Proceedings of the IEEE, vol. 91, no. 2, pp. 305–327, Feb. 2003.

[3] D. Helms, E. Schmidt, and W. Nebel, “Leakage in CMOS Circuits - An Introduc-

tion,” in Proceedings of International Workshop on Power and Timing Modeling,

Optimization and Simulation (PATMOS’04), LNCS 3254, Sept. 2004, pp. 17–35.

[4] A. Agarwal, S. Mukhopadhyay, A. Raychowdhury, K. Roy, and C. H. Kim, “Leak-

age Power Analysis and Reduction for Nanoscale Circuits,” IEEE Micro, vol. 26,

no. 2, pp. 68–80, Apr. 2006.

[5] W. Liu, MOSFET Models for SPICE Simulation including BSIM3v3 and BSIM4,

John Wiley & Sons, Inc., 2001.



[7] University of California Berkeley Device Group, BSIM3v3.2.2 Manual, Device

Research Group of the Dept. of EE and CS, University of California, Berkeley,

1999.

[8] R. M. Rao, J. L. Burns, A. Devgan, and R. B. Brown, “Efficient Techniques for

Gate Leakage Estimation,” in Proceedings of International Symposium on Low

Power Electronics and Design (ISLPED), Sept. 2003, pp. 17–35.

[9] M. Draždžiulis and P. Larsson-Edefors, “A Gate Leakage Reduction Strategy for

Future CMOS Circuits,” in European Solid-State Circuits Conference (ESSCIRC),

2003, pp. 317–320.

[10] N. S. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin,

M. Kandemir, and V. Narayanan, “Leakage current: Moore’s Law Meets Static

Power,” IEEE Computer, vol. 36, no. 12, pp. 68–75, Dec. 2003.

[11] R. Chau, S. Datta, M. Doczy, J. Kavalieros, and M. Metz, “Gate dielectric scaling

for high-performance CMOS: from SiO2 to High-K,” in Extended Abstracts of

International Workshop on Gate Insulator (IWGI 2003), Nov. 2003, pp. 124–126.


[12] M. Draždžiulis, P. Larsson-Edefors, D. Eckerbert, and H. Eriksson, “A power

cut-off technique for gate-leakage supression,” in European Solid-State Circuits

Conference, Sept. 2004, pp. 171–174.


ITRS, 2006.

[14] T. Sakurai, “Perspectives on Power-Aware Electronics,” in Digest of Technical

Papers, IEEE International Solid-State Circuits Conference, 2003, vol. 1, pp. 26–

29.

[15] T. Sakurai, “Minimizing Power Across Multiple Technology and Design Levels,”

in IEEE/ACM International Conference on Computer Aided Design, 2002, pp. 24–

27.

[16] V. Venkatachalam and M. Franz, “Power Reduction Techniques for Microproces-

sor Systems,” ACM Computing Surveys, vol. 37, no. 3, pp. 195–237, Sept. 2005.

[17] H. Kawaguchi et al., “A Super Cut-Off CMOS (SCCMOS) Scheme for 0.5-V

Supply Voltage With Picoampere Stand-By Current,” IEEE Journal of Solid-State

Circuits, vol. 35, no. 10, pp. 1498–1501, Oct. 2000.

[18] S. Mutoh et al., “1-V Power Supply High-Speed Digital Circuit Technology with

Multithreshold-Voltage CMOS,” IEEE Journal of Solid-State Circuits, vol. 30, no.

8, pp. 847–854, Aug. 1995.

[19] K.-S. Min et al., “Zigzag Super Cut-Off CMOS (ZSCCMOS) Block Activa-

tion with Self-Adaptive Voltage Level Controller: An Alternative to Clock-Gating

Scheme in Leakage Dominant Era,” in International Solid-State Circuits Confer-

ence (ISSCC), 2003, pp. 400–402.

[20] J. W. Tschanz et al., “Dynamic Sleep Transistor and Body Bias for Active Leakage

Power Control of Microprocessors,” IEEE Journal of Solid-State Circuits, vol. 38,

no. 11, pp. 1838–1845, Nov. 2003.

[21] M. Draždžiulis and P. Larsson-Edefors, “Evaluation of Power Cut-off Techniques

in the presence of Gate Leakage,” in Proceedings of the International Symposium

on Circuits and Systems (ISCAS), May 2004, pp. 475–478.

[22] K.-S. Min et al., “Zigzag Super Cut-Off CMOS (ZSCCMOS) Block Activa-

tion with Self-Adaptive Voltage Level Controller: An Alternative to Clock-Gating

Scheme in Leakage Dominant Era,” in Digest of Technical Papers of International

Solid-State Circuits Conference, 2003, pp. 400–402.

BIBLIOGRAPHY 61

[23] H. Kawaguchi et al., “A Super Cut-Off CMOS (SCCMOS) Scheme for 0.5-V

Supply Voltage With Picoampere Stand-By Current,” IEEE Journal of Solid-State

Circuits, vol. 35, no. 10, pp. 1498–1501, Oct. 2000.

[24] K. Flautner, Nam Sung Kim, S. Martin, D. Blaauw, and T. Mudge, “Drowsy

Caches: Simple Techniques for Reducing Leakage Power,” in Proceedings of the

29st Anual International Symposium on Computer Architecture (ISCA), May 2002,

pp. 148–157.

[25] M. Powell, S. H. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar, “Gated-Vdd: A

Circuit Technique to Reduce Leakage in Deep-Submicron Cache Memories,” in

Proceedings of the International Symposium on Low Power Electronics and Design

(ISLPED), jul 2000, pp. 90–95.

[26] A. Agarwal, H. Li, and K. Roy, “A Single Vth Low-leakage Gated-ground Cache

for Deep Submicron,” IEEE Journal of Solid-State Circuits, vol. 38, no. 2, pp.

319–328, Feb. 2003.

[27] F. Hamzaoglu, Y. Ye, A. Keshavarzi, K. Zhang, S. Narendra, S. Borkar, M. Stan,

and V. De, “Analysis of dual-VT SRAM cells with full-swing single-ended bit line

sensing for on-chip cache,” IEEE Transactions on Very Large Scale Integration

(VLSI) Systems, vol. 10, no. 2, pp. 91–95, Apr. 2002.

[28] T. Douseki, N. Shibata, and J. Yamada, “A 0.5-1V MTCMOS/SIMOX SRAM

Macro with Multi-Vth Memory Cells,” in Proceedings of IEEE International SOI

Conference, Oct. 2000, pp. 24–25.

[29] C. H. Kim and K. Roy, “Dynamic Vth SRAM: A Leakage Tolerant Cache Memory

for Low-voltage Microprocessors,” in Proceedings of International Symposium on

Low Power Electronics and Design (ISLPED), Aug. 2002, pp. 251–254.

[30] C. H. Kim, J. J. Kim, S. Mukhopadhyay, and K. Roy, “A Forward Body-biased

Low-leakage SRAM Cache: Device and Architecture Considerations,” in Proceed-

ings of International Symposium on Low Power Electronics and Design (ISLPED),

Aug. 2003, pp. 6–9.

[31] K. Nii, H. Makino, Y. Tujihashi, C. Morishima, Y. Hayakawa, H. Nunogami,

T. Arakawa, and H. Hamano, “A Low power SRAM using Auto-backgate-

controlled MT-CMOS ,” in Proceedings of International Symposium on Low Power

Electronics and Design (ISLPED), Sept. 1998, pp. 293–298.


[32] K. Flautner et al., “Drowsy Caches: Simple Techniques for Reducing Leakage

Power,” in ISCA 2002, May 2002, pp. 148–57.

[33] N. Mohyuddin et al., “Controling Leakage Power with the Replacement Policy in

Slumberous Caches,” in CF 2005, May 2005, pp. 161–70.

4Cache Power Modeling – Tool

Perspective

This chapter provides a review of existing power estimation-performance anal-

ysis tools for microprocessors (Section 4.1) and information of some existing

power estimation tools for on-chip caches (Section 4.2). Finally, Section 4.3

presents some background information of power modeling in general, its classi-

fication and the area of applications. Also, detailed descriptions of some power

models used in several existing power estimation-performance analysis tools

are given in this section.

63

64 CHAPTER 4. CACHE POWER MODELING – TOOL PERSPECTIVE

4.1 Architecture-level Performance Simulator and

Power Dissipation Estimator: a Survey of

Existing Tools

During the past decade, a fair amount of research effort has been directed to-

wards developing tools for superscalar microprocessor and for multiprocessor

systems. Examples of performance analysis tools1 include Simics [1], SimOS [2],

SimpleScalar [3], HydraScalar [4], and RSIM [5].

Simics and SimOS are complete system simulation platforms that can func-

tionally model the execution of complex software systems on the instruction-set

architectural abstraction level. They are designed to boot and run commercial

unmodified operating systems, with realistic workloads, and can simulate sev-

eral types of superscalar microprocessors at the instruction-set level, including

the full supervisor state [1].

SimpleScalar is a powerful microarchitectural simulation infrastructure that

has the capability of modeling a whole range of superscalar microarchitec-

tural designs. It can model a variety of platforms ranging from simple un-

pipelined processors to the detailed dynamically scheduled microarchitectures

with multiple-level memory hierarchies. It has fairly small code sizes and offers

a documented and well-structured design [3].

HydraScalar is an expanded version of SimpleScalar (version 2.0) to accu-

rately model a wide-issue, out-of-order execution multipath superscalar proces-

sor [4].

RSIM is an execution-driven simulator that simulates a variety of shared-

memory ILP superscalar multiprocessors (and uniprocessor) architectures con-

figurations. It can model state-of-the-art instruction-level parallism (ILP) mul-

1They are also referred to as performance simulators.

4.1. A SURVEY OF EXISTING POWER-PERFORMANCE TOOLS 65

tiprocessors, a high-performance memory system, and a multiprocessor coher-

ence protocol and interconnect, including contention at all resources [5].

Together with these performance analysis tools, several power dissipation es-

timation tools for superscalar processors have also been designed, including

Wattch [6], SimplePower [7], TEM2P2EST [8], AccuPower [9], HotLeakage [10],

and PowerTimer [11].

Wattch is an architecture-level power dissipation estimator whose princi-

ple is based on a suite of parameterizable power models for different hard-

ware structures and on per-cycle resource usage counts generated through a

cycle-level simulator. Basically, Wattch is built upon the SimpleScalar (version

3.0) out-of-order simulator that has been extended conceptually from a 5-stage

pipeline to an 8-stage pipeline [6].

TEM2P2EST is another power dissipation estimator that is built upon the

SimpleScalar (version 2.0) out-of-order simulator. The main differences be-

tween the two power dissipation estimators are their power models for estimat-

ing active power dissipation. Neither simulators estimate static power dissipa-

tion, but only assume that static power dissipation is about 10% of active power

dissipation [8].

SimplePower is an execution-driven, cycle-accurate RTL power estimation

tool that is used in evaluating algorithmic, architectural and compiler optimiza-

tions. SimplePower is based on the architecture of a simple 5-stage pipelined

datapath and simulates only the integer subset of the instruction set of Sim-

pleScalar [7]. It simulates the executables, which are converted from bench-

mark programs by using the SimpleScalar compiler toolset, providing cycle-

by-cycle energy estimates and switch capacitance statistics for the processor

datapath, memory and on-chip buses [12].


AccuPower is a power estimation tool that uses a true hardware-level and

cycle-level microarchitectural simulator and energy/power dissipation coeffi-

cients taken from SPICE data of actual CMOS layouts of critical datapath com-

ponents to obtain an accurate estimation of power dissipation for superscalar

microprocessors with several variants of superscalar datapath. AccuPower is a

greatly modified version of the SimpleScalar simulator, especially the Register

Update Unit in the datapath, to mimic an actual hardware implementation of

modern superscalar microprocessors [9].

HotLeakage is a micro-architectural simulation tool based on Wattch and

the Cache-decay simulator. In this work, Parikh et al. [10] developed an archi-

tectural model for subthreshold and gate leakage that explicitly captures temper-

ature, voltage, and parameter variations. This was an attempt to further develop

the methodology of Butts and Sohi [13] to address the effect of temperature on

leakage power dissipation.

Besides the public tools, there also exist accurate power estimation tools

that are available within the organizations of individual microprocessor ven-

dors for specific architectures. Examples of these tools include IBM’s MET

and its associated power estimation components PowerTimer for a PowerPC

implementation [11], and Compaq’s ASIM for simulating and estimating tran-

sition activity within Alpha processor implementations [14]. Nevertheless, in

the following paragraphs, a brief review of those tools is given to provide a

complete overview of existing power-performance estimation tools.

Based on publicly available documents released by IBM [11] [15], Pow-

erTimer is a toolset developed for use in early-stage, microarchitecture-level

power performance analysis of microprocessors. It includes a parameterized

microarchitecture evaluation toolset (MET), a cycle-accurate performance sim-

ulator (Turandot) within the MET and research microarchitecture power mod-

els (RMAP). For general research studies, the Turandot/MET is used to read

instructions from a program’s executable code, or from its traces, and then sim-

4.1. A SURVEY OF EXISTING POWER-PERFORMANCE TOOLS 67

ulate the timing flow within the targeted processor. All timing issues such as

pipeline latencies, stall/flush occurrences, etc. are modeled as accurate as pos-

sible providing Turandot/MET an ability to generate an accurate performance

figure (in processor cycles) of the given input program. For designing a new

PowerPC processor a baseline cycle-accurate performance simulator is selected

accordingly. Those cycle-accurate performance simulators are properties of

IBM and are not publicly available.

The microarchitecture-level energy models used in PowerTimer are derived

based on either (i) energy characterization data obtained by using low-level

circuit- and simulation-based research tools (e.g. circuit-simulation-based, or

RTL-simulation-based, or actual hardware-measurement-based tools) for avail-

able components of previous designs; (ii) analytical models built for charac-

terizing the power on the basis of the implementation structure of each mi-

croarchitectural entity or event (at the gate-level or circuit-level with or without

interconnect effects). Those energy models are implemented in C as energy

functions and called RMAP.

The PowerTimer toolset is currently in use to provide early-stage power

performance analysis and microarchitecture definition of high-end, general pur-

pose IBM PowerPC processors [15]. This toolset is not accessible outside the

IBM corporation.

ASIM is a performance modeling framework used in the Compaq (formerly

Digital) processor design team to simulate and predict performance of Alpha

processors [16]. ASIM consists of collections of modules (implemented in

C++) where each of them represents a physical component of a targeted pro-

cessor or captures a hardware algorithm’s operation. Each ASIM module is

designed as a software component providing a well-defined interface for users

(or developers) to reuse modules in different contexts or replace them with other

modules implementing a different algorithm for the same function. Each mod-

ule interface uses method calls to communicate between a module and its em-

bedded sub-modules, and uses ports to provide communication and timing be-

tween modules. Using this framework, several performance models for unipro-


cessors, vector processors and chip multiprocessors, etc. are developed. In

order to get the performance figure of a microprocessor, users need to create a

performance model for it and then run ASIM on that performance model with

a program or benchmark. Since ASIM is used mainly for simulating and esti-

mating performances of Alpha processors, it does not provide any information

of the power consumed by those processors, and, moreover, it is not accessible

outside the Compaq corporation.

All of the above mentioned performance-power estimation tools are de-

signed to estimate power dissipation and performance of superscalar single and

multi-processor architectures, but none of them are dedicated for single or par-

allel DSP architectures. Clearly, there is a lack of efficient Power-Performance

Simulator/Estimators for DSP parallel architectures. This is the area in which

a power-performance simulator for DSPs (e.g. the DSP-PP simulator which is

presented in Chapter A) is intended to contribute.

4.2 High-Level Power Estimation Tools for Caches

During the past decade, some research works have also been directed towards

developing analytical models for estimating dynamic and static power dissipa-

tion for SRAM-based caches and SRAM arrays at the architecture-level, how-

ever only a few power models have been made publicly available.

CACTI is one of the most widely used power estimation tools in the public

domain [17]. It offers analytical timing and energy models for un-partitioned

and partitioned on-chip caches. In its previous versions 1.0, 2.0 and 3.2, CACTI

used only ideal first-order scaling for technology trends. Further, it did not

include any leakage power models.

The recently released CACTI version (4.0 [18]) is updated with respect

to basic circuit structures, to device parameters for an improved technology

scaling, and to leakage models, in that a model based on Hotleakage [19] and

eCACTI [20] is added. However, the added model still fails to accurately ac-

4.3. POWER DISSIPATION ESTIMATION MODELS 69

count for small-channel effects, gate leakage, and terminal voltage dependen-

cies in transistor stacks—the model error in estimating leakage power dissipa-

tion was claimed to be below 21.5% [20].

Zeng et al. developed the Predictor of Access and Cycle Time for Cache

Stack (PRACTICS) tool [21] that uses analytical models (i.e. similar to CACTI)

to determine an optimal design for partitioned caches by exhaustive compari-

son of alternative memory configuration parameters. Although PRACTICS pro-

vides more accurate estimates of interconnect effects in comparison to CACTI 3.2,

it still does not include power models for leakage estimation, and therefore has

limited accuracy in estimating total power dissipation.

4.3 Power Dissipation Estimation Models

4.3.1 High-level Power Dissipation Estimation

Methodology

In general, architecture-level power dissipation estimation methods can be clas-

sified into two groups: Analytical (statistical) and Simulation-based.

Analytical power estimation models have been used in several projects, e.g.

[6] [13] [10] and [22]. The advantage of the analytical model is the simplicity

of the formulas used to calculate the dynamic and leakage power dissipation

estimates. These simple formulas allow architects to rapidly obtain the power

estimates and consider power characteristics of alternative designs. However,

analytical models usually offer low accuracy compared to the estimates from

circuit-level power estimation tools like SPICE and its clones. Moreover, due

to the simplicity of the formulas, analytical models may not cover the complete

deep-submicron behavior of MOS transistors and wiring, causing an unaccept-

able decrease in accuracy [23].


In contrast to the analytical approach, the simulation-based power estima-

tion methods offer very accurate power estimates at the price of long estima-

tion run-time. The simulation-based power estimation methods can be imple-

mented by table-based or equation-based power models. The difference be-

tween equation-based and table-based models is that the former ones are "dis-

crete" tabulated power dissipation values, while the latter ones are mathematical

equations resulting from "generalization" of those "discrete" power estimates

using curve-fitting techniques, e.g. linear and non-linear regressions. A more

detailed description of analytical, table-based and equation-based power models

are given in the next section.

4.3.2 Analytical Models

Some research have been directed towards developing analytical models for

estimating dynamic and static power dissipation at the architectural design level.

The first two models, the Cai-Lim and the Wattch are fundamentally similar

relying on activity-based power models to estimate dynamic power dissipation

[24].

Cai-Lim Power Estimation Models

The Cai-Lim power model is an activity-sensitive power model built on the Sim-

pleScalar 2.0 out-of-order simulator [25]. It partitions the basic SimpleScalar

architecture into 17 hardware structures that are further subdivided into a total

of 32 physical blocks. Each physical block is then further divided into power

density and area for both active and inactive contributions from dynamic, static,

programmable logic array (PLA), clock, and memory sections of the block.

Area estimates are based on publicly available designs with additional area al-

located for clocking, interconnects, and power supply. Active circuit power

density is estimated from SPICE simulations of typical designs based on Tai-

wan Semiconductor Manufacturing Corporation (TSMC) 0.25-µm process files.

Then, power density numbers are used as constants in conjunction with the ac-


tivity counters to model power dissipation. The basic power estimation formu-

las are as follows:

Overall Power Dissipation:

Pc = Pactive + Pstatic ≈ Pdynamic + Pleakage

Pc =∑

i

{EAF ∗∑

m

(EA ∗ APD)m}i

+∑

i

{(1 − EAF ) ∗∑

m

(EA ∗ IPD)m}i (4.1)

Dynamic Power Dissipation:

Pdynamic =∑

i

{Power(active)i}

=∑

i

{EAF ∗∑

m

(EA ∗ APD)m}i (4.2)

Static Power Dissipation:

Pleakage =∑

i

{Power(inactive)i}

=∑

i

{(1 − EAF ) ∗∑

m

(EA ∗ IPD)m}i (4.3)

Here, EAF is the effective activity factor, EA is the effective area, APD is

the active power density, IPD is the inactive power density, i is the number of

cycles, and m is the circuit type.

The Cai-Lim model tracks how a hardware structure is used by breaking

it down into different types of accesses and then counting each time that type

of access occurs during a cycle. This structural breakdown and its associated

information provide an opportunity for detailed modeling and ability to track re-

duction in dynamic activity [24]. All values for power densities and areas have

been pre-computed and included as part of the source code of the power estima-

tor. Cai-Lim does not claim any specific accuracy, but in general an accuracy of

75% of layout-level power tools is expected [24].


Wattch Power Estimation Models

Wattch is a collection of power models. Wattch divides a main microproces-

sor units into four categories as array structures (including data and instruction

caches, cache tag array, all register files, register alias table, branch predictors

and large portions of the instruction window, and load/store queue), content-

addressable memories (including instruction window/reorder buffer wakeup logic,

load/store order checks, and TLBs), combinational logics and wires (including

functional units, instruction window logics, and result busses), and clocking

(clock buffers, clock wires and capacitive loads). Wattch uses power models

for these basic components, where one of them is an "all components always

on" model and the remaining three models are activity sensitive with varying

degrees of conditional clocking enabled [6]. The basic power estimation for-

mulas are as follows:

Overall Power Dissipation:

Pc = Pactive + Pstatic ≈ Pdynamic + Pleakage (4.4)

Dynamic Power Dissipation:

Pdynamic = a C V 2dd f (4.5)

Static Power Dissipation: assumed to be 10% of Pdynamic

Activity factors a for each certain critical sub-circuits are obtained from bench-

mark programs using an architectural simulator, the SimpleScalar. Otherwise,

a = 1, for circuits that precharge and discharge on every cycle; a = 0.5, for

sub-circuits where the activity can not be simulated. Supply voltage Vdd and

clock frequency f are taken from the assumed 0.35-µm process technology.

The load capacitance C is estimated based on the circuit and the transistor siz-

ing using the formulas shown in Table 4.1 [6].

Wattch claims an accuracy within 10% of layout-level power tools and pro-

vides validation results that indicate an average accuracy of ± 13% when com-

paring relative power against known relative powers for implemented architec-


Table 4.1: The equations for capacitance of critical nodes

Categories / Nodes Capacitance Equations

Register = Cdiff (WordLineDriver)

Array files +Cgate(CellAccess) ∗ NumBitlines

Structure Wordline +Cmetal ∗ WordLineLength

Register = Cdiff (Precharge)

files +Cdiff (CellAccess) ∗ NumWordlines

Bitline +Cmetal ∗ BitLineLength

CAM = Cgate(CompareEnable) ∗ NumberTags

CAM Tagline +Cdiff (CompareDriver)

Structure +Cmetal ∗ TagLineLength

CAM = 2 ∗ Cdiff (CompareEnable) ∗ TagSize

Matchline +Cdiff (MatchPrecharge)

+Cdiff (MatchOR)

+Cmetal ∗ MatchLineLength

Complex Result = 0.5 * Cmetal ∗ NumALU ∗ (ALUHeight)

Logic Bus +Cmetal(RegisterF ileHeight)

Blocks

tures (Pentium Pro and Alpha 21264) [6] and [24]. Wattch uses technology

scaling factors included for processes ranging from 0.1-µm to 0.8-µm in its

power models.

Both Wattch and Cai-Lim power models are based on the SimpleScalar

toolset that is commonly used to model microarchitectures in educational and

some research environments. They are fairly flexible and acceptably accurate

for processes technologies of 0.25-µm and 0.35-µm. However, they still have

some shortcomings: The lack of directly accessible details on scaling factors

limits Cai-Lim model’s ability to be used to directly compute relative contri-

butions to power from different blocks. The model is also difficult to extend


without examining the original process files to determine how to incorporate

new hardware structures. Wattch provides for greater access to the underlying

details of the models than the Cai-Lim model. Counters for different types of ac-

cesses are employed, but many details are still left out. This lack of granularity

in access counting limits Wattch’s ability to identify activity reduction power

savings. In addition, Wattch does not estimate the inactive power dissipation

due to subthreshold leakage current, but simply assumes that its contribution is

just 10% of the active power. These models, therefore, have limited accuracy

and lack scalability to future technology processes.

Butts-Sohi Static Power Models

Butts and Sohi [13] proposed a generic, high-level model for micro-architecture

components. The model is based on a key design parameter, Kdesign, capturing

device type, device geometry and stacking factors that can be obtained based

on simulations. Its model of subthreshold leakage accurately addresses some

different issues affecting static power in such a way that it makes it easy to rea-

son about leakage effects at the micro-architectural level. However, it turns out

not to be well suited for some types of SRAM circuits with power-saving and

leakage-reduction techniques like MT-CMOS, Gated-Vdd, and Drowsy Cache.

Also, it was never released as publicly available software.

A Temperature-Aware Static Power Model (HotLeakage)

Parikh et al. [10] developed an architectural model for subthreshold and gate

leakage that explicitly captures temperature, voltage, and parameter variations.

This model was implemented in the micro-architectural HotLeakage simulation

tool based on Wattch and the Cache-decay simulator. This was an attempt to

develop the methodology of Butts and Sohi to address the effect of temperature

on leakage power dissipation. However, the accuracy of the leakage power

estimation for any complex circuit structures like memory arrays, caches, etc.,

is unknown.


An enhanced CACTI (eCACTI)

Another effort to develop further the methodology of Butts and Sohi is the work

by Mahesh et al. given in [22] [26] and [27]. In this work, the authors developed

analytical models parameterized in terms of high-level design parameters to

estimate leakage power in SRAM arrays. An error margin of "less than 23.9%"

compared to HSPICE power values is achieved by this method. These analytical

models are then implemented in an architecture-level power tool for SRAM

arrays, called as the eCACTI [20].

Research Microarchitecture Power Models (RMAP) of PowerTimer

Microarchitecture-level energy models used in PowerTimer can be derived based

on either (i) energy characterization data obtained by using a low-level circuit-

and simulation-based research tool (i.e. CPAM [28]) for components of previ-

ous designs; (ii) analytical models built for characterizing power on the basis of

the implementation structure of each microarchitectural entity or event.

Actually, RMAP consists of energy models implemented in C, which are

derived by using several methodological paths: (i) model formulation is based

on the unit-level and pipeline stage-level latch counts (called latch-based energy

models) that are estimated either from logic-level bit specifications of individual

functions or from area and latch-density of prior designs; (ii) model formulation

is based on detailed macro-level power simulation data that is available from

prior processor projects, and a utility script used to convert those data into high-

level, unit-specific energy functions; (iii) when detailed circuit schematics are

available, model formulation is based on low-level energy data generated by

CPAM for those circuits. Energy models for each microarchitecture block are

then formulated by collecting and abstracting those obtained energy data.

4.3.3 Table-based and Equation-based Models

Schmidt et al. [29] developed an automatic black box memory-modeling ap-

proach based on nonlinear regression, which intends to combine good model


properties (i.e. accuracy, speed, etc.) with good modeling properties (i.e. au-

tomatism, adaptability to design flow, low overhead and IP protection). Never-

theless, this approach offers advantages at the price of a complex and compu-

tationally expensive model characterization phase. For typical memory arrays

whose regular internal structures are known and can easily be analyzed, a white

box modeling approach (e.g. our approach [30]) can be a good alternative to

the black box one, offering a simpler and faster model characterization phase.

Our approach is described in details in Chapter 5.

In [31] Eckerbert et al. presented a methodology to accurately estimate to-

tal power dissipation (including static power) at the RT-level using simulation-

based power estimation models. The methodology takes into account the changes

in the component environment, which occur between characterization and esti-

mation. By separating the different power dissipation mechanisms this method-

ology achieves high degrees of accuracy in estimating power dissipation. Al-

though it is a complex and accurate RT-level simulation approach and mainly

focuses on estimating total power dissipation of complex components, such as

arithmetic-logic circuits, it still serves as a good hint for us. Furthermore, this

methodology can be used together with our proposed approach, where neces-

sary (e.g. ALUs, MACs, etc.), providing an architecture-level solution to the

problem of estimating total power dissipation of all processor components.

Bibliography

[1] S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hållberg, J. Högberg,

F. Larsson, A. Moestedt, and B. Werner, “SIMICS: A Full System Simulation

Platform,” IEEE Transaction on Computers, pp. 50–58, Feb. 2002.

[2] M. Rosenblum, E. Bugnion, A. Herrod, and S. Devine, “Using the SimOS Machine

Simulator to Study Complex Computer Systems,” ACM Transactions on Modeling

and Computer Simulation, vol. 7, no. 1, pp. 78–103, Jan. 1997.

[3] T. Austin, E. Larson, and D. Ernst, “SimpleScalar: An Infrastructure for Computer

System Modeling,” IEEE Transactions on Computers, pp. 59–67, Feb. 2002.

BIBLIOGRAPHY 77

[4] K. Skadron and Pritpal S. Ahuja, “HydraScalar: A Multipath-Capable Simulator,”

Newsletter of the IEEE Technical Committee on Computer Architecture, pp. 65–70,

Jan. 2001.

[5] C. Hughes, V. Pai, P. Ranganathan, and S. Adve, “RSIM: simulating Shared-

Memory Multiprocessors with ILP Processors,” IEEE Transaction on Computers,

pp. 40–49, Feb. 2002.

[6] D. Brooks, V. Twari, and M. Martonosi, “Wattch: A Framework for Architectural-

Level Power Analysis and Optimizations,” in Proceedings of the Anual Interna-

tional Symposium on Computer Architecture, June 2000, pp. 83–94.

[7] N. Vijaykrishnan, M. Kandermir, J. Irwin, H. Kim, and W. Ye, “Energy-Driven

Hardware-Software Optimizations Using SimplePower,” in Proceedings of the

Anual International Symposium on Computer Architecture, June 2000, pp. 95–106.

[8] A. Dhodapkar, C. Lim, G. Cai, and R. Daasch, “TEM2P2EST: A Thermal Enabled

Multi Model Power/ Performance ESTimator,” in Proceedings of the Workshop on

Power-Aware Computer Systems, Nov. 2000, pp. 112–125.

[9] D. Ponomarev, K. Gurhan, and K. Ghose, “AccuPower: An accurate Power Esti-

mation Tool for SuperScalar Microprocessors,” in Proceedings of the 5th Design

Automation and Test in Europe Conference, Mar. 2002, pp. 124–129.

[10] D. Parikh et al., “Comparison of State-Preserving vs. Non-State-Preserving Leak-

age Control in Caches,” in Proceedings of the Workshop on Duplicating, Decon-

structing and Debunking (held in conjunction with ISCA), June 2003, pp. 14–25.

[11] D. Brooks, J. Wellman, P. Bose, and M. Martonosi, “Power-Performance Modeling

and Tradeoff Analysis for a High-End Microprocessor,” in Proceedings of the

Workshop on Power-Aware Computer Systems, Nov. 2000, pp. 126–136.

[12] W. Ye, N. Vijaykrishnan, M. Kandermir, and J. Irwin, “The design and use of

SimplePower: a cycle accurate energy estimation tool,” in Proceedings of the

Design Automation Conference, June 2000, pp. 340–345.

[13] J. A. Butts and G. S. Sohi, “A Static Power Model for Architects,” in Proceedings

of the International Symposium on Micro-architectures, Dec. 2000, pp. 191–201.

[14] The ASIM Manual, Compaq Computer Corporation, 2000.

[15] D. Brooks, P. Bose, V. Srinivasan, M. K. Gschwind, P. G. Emma, and M. G.

Rosenfield, “New Methodology for early-stage, microarchitecture-level Power-


performance analysis of microprocessors,” IBM Journal of Research & Devepol-

ments, vol. 47, no. 5, pp. 653–670, Sept. 2003.

[16] J. Emer, P. Ahuja, E. Borch, A. Klauser, C. Luk, S. Manne, S. Mukherjee, H. Patil,

S. Wallace, N. Binkert, R. Espasa, and T. Juan, “ASIM: A Performance Model

Framework,” IEEE Transaction on Computers, pp. 68–76, Feb. 2002.

[17] S.J.E. Wilton et al., WRL 93/5: An Enhanced Access and Cycle Time Model for

On-chip Caches, WRL, 1994.




2003.



[21] A. Y. Zeng et al., “Cache Array Architecture Optimization at Deep Submicron

Technologies,” in ICCD 2004, Oct. 2004, pp. 320–5.

[22] M. Mamidipaka et al., “IDAP: A Tool for High-Level Power Estimation of Custom

Array Structures,” IEEE Transactions on Computer-Aided Design of Integrated

Circuits and Systems, vol. 23, no. 9, pp. 1361–1369, September 2004.

[23] M. Q. Do and L. Bengtsson, “Analytical models for power consumption estima-

tion in the dsp-pp simulator: Problems and solutions, technical report no. 03-22,”

Tech. Rep., The Department of Computer Engineering, Chalmers University of

Technology, Göteborg, Sweden, 2003.

[24] S. Ghiasi and D. Grunwald, “A Comparison of Two Architectural Power Models,”

in Proceedings of the Workshop on Power-Aware Computer Systems, Nov. 2000,

pp. 137–152.

[25] G. Cai and C. H. Lim, “Architectural Level Power/ Performance Optimization and

Dynamic Power Estimation,” in Proceedings of Cool Chips Tutorial, Nov. 1999,

pp. 90–113.

[26] M. Mamidipaka, K. Khouri, N. Dutt, and M. Abadir, “A methodology for accu-

rate modeling of energy dissipation in array structures,” in Proceedings of 16th

International Conference on VLSI Design, Jan. 2003, pp. 320–325.

BIBLIOGRAPHY 79

[27] M. Mamidipaka et al., “Leakage power estimation in srams, technical report no.

03-32,” Tech. Rep., Center for Embedded Computer Systems, University of Cali-

fornia, Irvine, USA, 2003.

[28] J. S. Neely, H. H. Chen, S. G. Walker, J. Venuto, and T. J. Bucelot, “CPAM:

A common power analysis methodology for high-performance VLSI design,” in

Proceedings of 9th Topical Meeting on Electrical Performance of Electronic Pack-

aging, Oct. 2000, pp. 303–306.

[29] E. Schmidt et al., “Memory Power Models for Multilevel Power Estimation and

Optimization,” IEEE Transaction on VLSI Systems, vol. 10, pp. 106–109, Apr.

2002.

[30] M. Q. Do, P. Larsson-Edefors, and L. Bengtsson, “Table-based Total Power Con-

sumption Estimation of Memory Arrays for Architects,” in Proceedings of Inter-

national Workshop on Power and Timing Modeling, Optimization and Simulation

(PATMOS’04), LNCS 3254, Sept. 2004, pp. 869–878.

[31] D. Eckerbert and P. Larsson-Edefors, “A Deep Submicron Power Estimation

Methodology Adaptable to Variations Between Power Characterization and Es-

timation,” in Proceedings of the 2003 Asia-South Pacific Design Automation Con-

ference, Jan. 2003, pp. 716–719.


Part III

Power Modeling for

SRAM-based Structures

5Modular Approach to Power

Modeling for On-Chip Caches

This chapter describes the work done on power modeling methodology for on-

chip caches. First, Section 5.1 in detail shows drawbacks of an analytical ap-

proach to power modeling, and the reason why a table-based simulation-based

power modeling approach has been selected. After that, the proposed modular

hybrid power estimation modeling methodology for on-chip caches and SRAM

data arrays is described in detail in Section 5.2. Section 5.3 is dedicated for de-

scribing a probing methodology to correctly capture the total leakage currents of

sub-90nm logic circuits when circuit simulators, such as Hspice, are employed.

Section 5.4 presents power dissipation estimation models for on-chip caches

including power models for tag SRAM-based and data SRAM arrays. Sec-

83

84 CHAPTER 5. MODULAR APPROACH TO POWER MODELING

tion 5.5 is dedicated to validation of the obtained power models against circuit-

level simulations for a complete on-chip cache, a physically partitioned and an

unpartitioned SRAM arrays. Finally, in Section 5.6, the modeling methodol-

ogy to capture the dependence of leakage power on temperature variation, on

supply-voltage scaling, and on the selection of process corners is presented and

discussed in detail.

5.1 Analytical Approach to Power Modeling and

Its Induced Problems

As mentioned earlier in Section 1.3, the analytical approach is a straight-forward

way to model MOS transistor’s leakage mechanisms. The complexity of equa-

tions defines the accuracy of the approach in estimating leakage power. The

BSIM4 models describe leakage mechanisms using very detailed and com-

plex equations [1], for example, the BSIM4 models define the subthreshold

leakage current for a single MOS transistor using Eqs 3.2 - 3.7 given in Sec-

tion 3.1. Although BSIM4 models offer high accuracy in estimating leakage

power accounting for different variations in temperature, in threshold voltage, in

technology-related parameters, etc. they are obviously not suitable for higher-

level power estimation due to their complex relations and equations that require

the user to have deep knowledge of device models and access to detailed pro-

cess parameters.

Several studies have been directed to develop analytical architecture-level

leakage power models with support for supply-voltage scaling and tempera-

ture variation, based on a simplified version of BSIM3 and BSIM4 models for

subthreshold leakage current, i.e. [2] and [3], respectively. These represent

attempts to simplify a BSIM model to a less complex model, that is intended

for use in high-level power estimation tools, by introducing curve-fitting co-

efficients and circuit-dependent empirical constants fixed for each particular

process technology. The recently released CACTI version (4.0 [4]) has been

5.1. ANALYTICAL POWER MODELING APPROACH & PROBLEMS 85

updated with a leakage model based on Hotleakage [2] and eCACTI [5] to offer

a rudimentary ability to estimate leakage power with supply-voltage scaling and

temperature variation over a set of typical technology nodes.

30 40 50 60 70 80 90 100 11010

−9

10−8

10−7

10−6

10−5

Temperature ( oC)

I su

b (

A)

BSIM3

CACTI4

Figure 5.1: Subthreshold leakage power with different temperature for an NMOS tran-

sistor (commercial 130-nm process)

The concept of a technology node is however gradually being abandoned

(ITRS’05 [6]). Already today the notion of having one single typical process

to represent a "technology node" yields large estimation errors for static-power

dominated memories, since process technologies within a classical technology

node can be so different. With further technology scaling, the diversity in pro-

cess technology offerings will probably increase significantly, thus exacerbating

the problem.

Fig. 5.1 shows subthreshold leakage power (log scale) for a minimum-sized

130-nm NMOS transistor for a range of different temperatures. The power val-

ues obtained by using a BSIM3 model (dotted line) are approximately 250×


smaller than the values obtained by using the 130-nm leakage model imple-

mented in CACTI 4.0. This serves to illustrate the drawbacks of simplifying a

set of analytical leakage models: inaccuracy and inflexibility. Clearly, if leak-

age power models at architectural level are to guide design trade-offs, they can

not be based on generic process parameters, but they must be calibrated to the

actual target process(es).

5.2 The Proposed Modular Hybrid Power Estima-

tion Modeling Approach

In general, as mentioned in Section 4.3.1, architecture-level power dissipation

estimation methods can be classified into two groups: Analytical (statistical)

and simulation-based. While the analytical estimation method uses mathemati-

cal formulas, the simulation-based power estimation methods are implemented

by either table-based or equation-based power models.

The proposed power estimation modeling methodology for SRAM-based

caches is a hybrid one, i.e. rather than using only one technique to estimate

power dissipation, the methodology seeks to find the best match between a par-

ticular estimation technique and a specific cache component. Fig. 2.11 shows

the organization of a typical SRAM-based cache that is divided into two ar-

rays: tag and data arrays. The tag array consists of the SRAM-based array, the

column multiplexers, the tag sense amplifiers, the tag writing circuits, the tag

wordline drivers, the comparators, the MUX-drivers, etc. The data array con-

sists of the SRAM data array, the data wordline/bitline drivers, the data sense

amplifiers, the data writing circuits, the data multiplexers, the output drivers,

etc. The row/column decoders are shared between two arrays. For each type of

cache components, based on its structure (since this is a white-box approach)

an analysis is performed to define the major mechanisms of power dissipation.

Then, based on the result of this analysis, the appropriate power estimation tech-

niques are selected. For example, a probabilistic approach has been used to esti-

5.2. THE PROPOSED MODULAR MODELING APPROACH 87

mate both dynamic and static power of address decoders, an analytical approach

has been used to estimate dynamic power of bitlines and 6T-SRAM cells, sense

amplifiers, write circuits, and wordline drivers, while a circuit-simulation-based

modeling backend has been used to estimate all leakage power mechanisms.

Figure 5.2: Power modeling methodology: a) Component Characterization Phase, and

b) Power Estimation Phase

At a close look, the power estimation modeling approach for SRAM-based

caches consists of two underlying phases: Component Characterization and

Power Estimation. (See Figs 5.2a and 5.2b) [7].

1. Component Characterization: takes as inputs the netlist of a typical

cache component, its states (i.e. Read, Write, Leak) and memory-orga-

nization parameters, generates leakage power values by performing few,

simple circuit-level DC simulations using the appropriate probes, and tab-

ulates those values into the pre-characterized leakage tables. The inde-

pendent inputs to the pre-characterized leakage tables are Type of compo-

nent (i.e. type of cache component), Component State (S), Temperature

(T ), Frequency (F ), Threshold Voltage (VT ), Supply Voltage (VDD) and

Process Corner (PV ). The power values in those tables are the per-cycle


leakage power dissipation of that component. In addition, for each cache

component, the nodal capacitances are also extracted using a circuit-level

simulator that establishes the operating point and DC capacitances.

2. Power Estimation: takes as inputs the pre-characterized leakage tables,

states of the component, input traces (i.e. a sequence of accesses like

{Read, Write, Write, Read, Write, Read, Leak, etc.}), and produces

power dissipation estimates in a cycle-by-cycle manner. For each cache

component, its power model for total power estimation consists of ana-

lytical equations for dynamic power and pre-characterized leakage power

values. Dynamic analytical power models are derived based on the well-

known activity-based switching power equation (Eq. 3.1) with nodal ca-

pacitances extracted during the Component Characterization phase. The

total leakage power accounts for all types of leakage currents that are

present in the transistor models used by the circuit-level simulator, dur-

ing both idle and active cycles. Total power dissipation of the component

is the sum of dynamic and leakage power dissipation values.

For any cache component the Component Characterization phase is typi-

cally performed only once by a cell-library designer. Computer architects hav-

ing access to the netlist of new components also can perform the characteriza-

tion of their components and create new tables for later use. When the char-

acterization is done, the pre-characterized leakage power values and the values

of the extracted nodal capacitances are tabulated for later use in the power es-

timation phase, and no further simulations are needed until the structure of the

component is modified. Therefore, as compared to those high-level analytical

power models implemented in existing power estimation tools, the proposed

power models offer much better accuracy and flexibility in estimating both to-

tal and leakage power dissipation for on-chip caches, requiring much less time

for the component characterization phase. Furthermore, the proposed model-

ing methodology is modular, thus, it can be applied to model power dissipation

for other types of components of regular structures, e.g. content-addressable-

memory (CAM).

5.3. PROBING METHODOLOGY FOR LEAKAGE 89

5.3 Probing Methodology for Leakage

In submicron CMOS processes, other leakage mechanisms than subthreshold

leakage become significant and, therefore, a systematic probing methodology

is essential to obtain accurate power estimates. The reason why leakage-current

probing of very deep submicron circuits is complex is that currents no longer

only flow through the transistor channel. Rather, probes need to be applied so

that input and output circuit interfaces, through which significant currents flow,

can be captured. Since the proposed power estimation methodology is used

to calculate total power from the power of many regularly assembled memory

cells, the overall accuracy is very dependent on cell interface currents. In this

section, a methodology for probing circuits for static current measurements in

CMOS circuits during simulation is presented. The methodology is capable

of capturing all leakage mechanisms existing in BSIM4 models, in this case

implemented in the Hspice simulator. The full description of the methodology

together with some illustrative examples and a survey of related works are given

in [8].

S

G

D

B

i1

i2 i4

i3

D

G

S

B

S

G

D

B

(a) (b) (c)

Figure 5.3: Current measurement for MOS transistors used in Hspice simulator

For MOS transistors, Hspice provides the ability to capture the Drain (D)

current, the Gate (G) current, the Source (S) current, and the Bulk (B) current

using current probes i1, i2, i3, and i4, respectively. Fig. 5.3(a) shows these cur-

rents and their Hspice-defined conventional directions.


The direction of gate, drain and source currents for MOS transistors is de-

fined by the value of VGS , VGD , and VDS . Fig. 5.3(b) shows the gate and

subthreshold leakage currents (broken lines) for an NMOS transistor in off-

state (i.e. VG = 0), and the gate leakage and Drain-Source currents (solid lines)

for an NMOS transistor in on-state (i.e. VG = Vdd). For a PMOS transistor,

Fig. 5.3(c) shows the gate and subthreshold leakage currents when VG = Vdd

(broken lines), and the gate leakage and Source-Drain currents when VG = 0,

(solid lines).

A number of observations can be made from these figures:

• Gate leakage currents exist in all transistors no matter if these are in

on- or off-state, as long as |VGS |> 0 and |VGD|> 0. If VG = VD = VS ,

there is however no gate leakage.

• When VG = Vdd, the gate leakage current is going into the transistor

through the Gate to either Drain or Source (or both) that have a

voltage potential less than Vdd.

• When VG = 0, gate leakage current is going out from the transistor

through the Gate from either Drain or Source (or both) that have a

voltage potential greater than VG = 0.

• A subthreshold leakage current exists only in those transistors that

are in off-state, and it goes from Drain to Source (NMOS) and from

Source to Drain (PMOS) for |VDS | > 0. If VD = VS , there is no

subthreshold leakage.

• A substrate leakage current exists in all transistors no matter if

these are in the on-state or in the off-state.

• The gate and substrate leakage currents are captured directly by

using probes i2 and i4, respectively, while the subthreshold leakage

current is captured by using either i1 or i3 probes depending on the

direction of the resulting gate leakage current. For example, in the

NMOS transistor showed in Fig. 5.3(b), the subthreshold leakage

current is captured by i1 for VG = Vdd, and by i3 for VG = 0.

5.3. PROBING METHODOLOGY FOR LEAKAGE 91

The observations above have lead to the following methodology to capture

leakage mechanisms using Hspice current probes for static CMOS circuits (re-

ferred to as the circuit in this section):

Capturing Total Leakage:

1. Following the Kirchhoff’s current law for the circuit, the summation of all

in-going currents to the circuit must be equal to the summation of all out-

going currents from the circuit. The total leakage current in the circuit

is equal either to the summation of all in-going currents to the circuit or

to the summation of all out-going currents from the circuit. If several

interconnected circuits are analyzed separately and if their total leakage

power is summed up (e.g. to obtain the total leakage power of a system),

then total leakage power for all separately analyzed circuits should be

obtained in the same manner, either by adding all in-going currents or by

adding all out-going currents.

2. The in-going currents to the circuit refer to those currents that go from

the supply voltage source (Vdd) through PMOS transistors that have their

Sources directly connected to Vdd (denoted as MpmosVdd

in Eq. 5.1); and

those gate leakage currents that go into the circuit through the Gate of the

transistors having VG = Vdd (denoted as MVG = Vdd in Eq. 5.1).

3. The out-going currents from the circuit refer to those currents that go

to the ground (gnd) through NMOS transistors that have their Sources

directly connected to the ground (denoted as Mnmosgnd in Eq. 5.2); and

those gate leakage currents that go out from the circuit through the Gate

of the transistors having VG = 0 (denoted as MVG=0 in Eq. 5.2).

4. By using Hspice current probes, equations of the total in-going and out-

going currents for the circuit are created:

Iin_goingleak =

∑

i

[i3(MpmosVdd

)]i +∑

j

[i2(MVG=Vdd]j

+∑

mp

[i4(Mpmos)]mp (5.1)


Iout_goingleak =

∑

k

[i3(Mnmosgnd )]k +

∑

t

[i2(MVG=0)]t

+∑

mn

[i4(Mnmos)]mn (5.2)

Here, i, j, k and t are the number of PMOS transistors that have their

Sources directly connected to Vdd, the number of the transistors that have

VG = Vdd, the number of NMOS transistors that have their Sources di-

rectly connected to the ground and the number of the transistors that have

VG = 0 inside the circuit, respectively. mn is the number of NMOS tran-

sistors, whereas mp is the number of PMOS transistors inside the circuit.

5. Eqs 5.1 and 5.2 are simplified by removing the current probes for those

transistors that have no gate and subthreshold leakage. Either Eq. 5.1 or

Eq. 5.2 represents the total leakage current of the circuit.

To Separate Leakage Mechanisms:

1. From Eq. 5.1 (or Eq. 5.2) the total substrate leakage is obtained by sum-

ming up all i4 probes for the PMOS (or NMOS) transistors of the circuit,

i.e. Eq. 5.3:

Isubstrateleak =

∑

mp

[i4(Mpmos)]mp =∑

mn

[i4(Mnmos)]mn (5.3)

2. To capture the total subthreshold leakage current of the circuit, three steps

need to be carried out: (i) In the circuit, for all conduction paths of sub-

threshold leakage currents connecting Vdd to gnd nodes, find the bound-

ary nodes1; (ii) For each conduction path, if the transistor located below

the boundary node is a PMOS, use current probe i3, otherwise use i1

to obtain the subthreshold leakage current for that path; (iii) The total

subthreshold leakage current of the circuit is the summation of currents

obtained in all conduction paths.

1The intermediate connection points between those transistors that have VG = Vdd and those

that have VG = 0.

5.4. POWER MODELS FOR ON-CHIP CACHES 93

3. The total gate leakage current of the circuit is obtained by subtracting the

total subthreshold leakage current from the total in-going (or out-going)

leakage current of the circuit.

For each cache component, the probing methodology is applied to capture

not only the total leakage power, but also other leakage components, i.e. the

gate, subthreshold and substrate leakage. In the next section, the detailed prob-

ing strategy for memory cells is shown. For other cache components, probing

schemes are obtained in a similar manner.

5.4 Power Models for On-Chip Caches

In this section, the characterization phase for each cache component is shown

and their obtained power models are described in detail.

5.4.1 Power Models for Partitioned Data SRAM Arrays

Organization Parameters

The assumed organization parameters for partitioned SRAM arrays are defined

in Table 5.1. As mentioned in Section 2.4.2, Wada et al. [9] showed how the

array can be split horizontally and vertically using Ndwl and Ndbl. Increasing

Ndwl and Ndbl, thus, yields shorter wordlines and bitlines, respectively, which

decreases the array access time, but increases the memory footprint. Increasing

Ndbl also increases the number of precharge circuits required, while increasing

Ndwl introduces a need for more wordline drivers. The parameter Nout, to-

gether with the number of available sense amplifiers (NSA) and write circuits

(NWRC ), defines the multiplexing ratio and the size of multiplexors required.

Increasing Nout would result in an increase of the required NSA and NWRC ,

or in an increase of the multiplexor size.

In general, partitioning with a large number of sub-arrays incurs a signif-

icant area overhead due to extra internal control logic. Clearly, determining


G l o

b a

l w

o r d

l i n

e d

r i v

e r s

SRAM Memory Array

RowDecoding

Circuit

. . .

Am+1

Am+2

An

Decoder_en

A-Byte Write Logic(8 x Write Drivers, 8 x Write_Control Circuits)

A-Byte Column Isolation Logic

Read_out Data(Byte)

Write_in Data(Byte)

A-Byte Read Logic(8 x Sense Amplifiers, 8 x Read_Control

Circuits)

Column Multiplexer (1:2m+1). . .

ColumnDecoding

Circuit

. . .

A0

A1

Am

Decoder_en

Read_en

Column_isolation

Write_en

LBL_PRE

GWLSub-

array...

...

...

...

...

...

...

...

...

...

...

...

Sub-

array

Sub-

array

Sub-

array

Sub-

array

Sub-

array

Sub-

array

Sub-

array

Sub-

array

Internalcontrollogic

GBL

... ... ...

LWL_SELLBL_SEL

Nsubarrays

Nsubarrays

An

An-j

AmA

m-i

Nrows

Ndwl

x

GBL

Figure 5.4: Block diagram of a partitioned SRAM array using DWL and DBL techniques


Table 5.1: Organization parameters for partitioned SRAM arrays

Symbols Meanings Parameters

Naddr Address width in bits Naddr = Nrowdecaddr

+ Ncoldecaddr

Nrowdecaddr

Number of addresses to row decoder integer (i.e. 1, 2, 3, 4, ...)

Ncoldecaddr

Number of addresses to column decoder integer (i.e. 1, 2, 3, 4, ...)

Nout Output width in bits integer, multiple of 8

(i.e. 8, 16, 32 ...)

Ndwl Number of segments per wordline 1, 2, 4, 8, ...

Ndbl Number of segments per bitline 1, 2, 4, 8, ...

Nsub−arrays Total number of sub-arrays Nsub−arrays = Ndwl ×Ndbl

Nrows Number of rows Nrows = 2Nrowdec

addr

Nwords Number of addressable words Nwords = 2Ncoldec

addr

Nwlength Word length integer, multiple of 8

(= 8 in this thesis)

Ncolumns Number of columns Ncolumns = Nwords ×Nwlength

sub-array organization is about striking a good balance between energy savings

and access time reduction, and the overhead for supporting them.

Fig. 5.4 shows the block diagram of a partitioned SRAM array using DWL

and DBL techniques; the original array is divided into Ndwl ×Ndbl sub-arrays.

Each sub-array takes as inputs global wordlines (WL) from global WL drivers,

global bitlines (BL), and several control signals from the internal control logic,

e.g. local BL precharge (LBL_PRE), local WL selection (LWL_SEL) and

local BL selection (LBL_SEL) signals. Inside each sub-array (Fig. 5.5), global

WL is AND-ed with LWL_SEL to create local WLs, and local BLs are con-

nected to global BLs through pass transistors controlled by LBL_SEL sig-


nals. For each local WL there is a local WL driver used to drive the WL se-

lection signal to the memory cells. Each local BL has a local precharge circuit

controlled by LBL_PRE. As it is straightforward to implement, the static

pull-up BL precharging scheme is widely used in partitioned SRAM arrays and

caches [10]—this is assumed for the partitioned array configuration used in this

section.

Cell

Cell

Cell

Cell

GWL

LWLLWL_SEL

G B

L

L B

L

LBL_SEL

Prechargecircuit

LBL_PRE

Prechargecircuit

LWL drivers

L B

L

G B

L

Figure 5.5: Organization of a sub-array

Power Models for Partitioned Data SRAM Arrays

The power models for SRAM memory components are summarized in the fol-

lowing equations:

Total Powerarray =∑

i

(Pdyn + Pleak)i (5.4)

where, i is the index for components of a SRAM array including memory cells,

SA, WRC, wordline drivers, decoders, multiplexers and column isolation logic.

With reference to Fig. 5.5, a read or a write is preceded by a precharge

of the selected LBL/LBL to Vdd, and the selection of local row/column by

row/column decoders based on a given address. A local wordline and a local


bitline (or a set of them) are selected by using LWL_SEL and LBL_SEL

signals to read or write memory cell(s). During read, column isolation PMOS

transistors are turned ON to allow the voltage difference between the selected

LBL/LBL (connected to sense amplifiers through GBL/GBL) to develop

to the sensing voltage (Vsense), after which they are turned OFF to isolate

LBL/LBL from sense amplifiers, helping the amplifiers to quickly sense the

data stored in the cells that are accessed. Multiplexing NMOS transistors

(MUXes) are used to connect write circuits to the selected pairs of LBL/LBL

during write cycles only; the write circuits are idle (leaking) during read cycles.

Local bitline precharging uses a static pull-up scheme that statically leaves

them on all the time [10]; precharging turns OFF only in the evaluation phase

of read/write cycles. Bitline precharge time is designed to be partially hid-

den under the address decoding time, to reduce the power dissipated by the

precharge buffers, while still achieving a short read/write time. Drivers of the

write circuits are designed to be powerful enough, so that they can pull down

precharged LBL/LBL (connected to the write circuits through GBL/GBL) to

zero fast. Sense amplifiers (SA) are designed to have Vsense= 200 mV and the

bitlines surrounding the SA are always precharged to Vdd before turning ON

isolation transistors and the wordline for a read. The architectural selection of

SA and the precharging scheme was motivated by the fact that this type of SA

dissipates less short-circuit power than one that requires precharging to V dd2 .

Memory cells:

In partitioned arrays, the dynamic power of a read operation is due to LBL/

LBL and GBL/GBL discharging currents through the accessed cell, while

write dynamic power is due to discharging currents through the write circuits.

The “passive read” dynamic power is due to LBL/LBL and GBL/GBL dis-

charging currents through the opened pass transistors into cells, which share the

same local wordline with the selected cell, while GBL/GBL are disconnected

from all SAs and write circuits (WRC). The number of “passive read” cells can

be defined as N pass.readmcells = Ncolumns

Ndwl− Nwlength, and it is decreasing with an


a) b)

CLBL

LBL

VLBL = 1.2 V

VLWL= 0 -> 1.2V

Vdd

Memcell

LBL

”0” ”1”

Vdd

VLBL = 1.2 V

VLWL

= 0 V

gnd CLBLCLBL

LBL

VLBL

= 1.2 V

Figure 5.6: (a) Characterization of a 6T-SRAM cell, (b) Hspice configuration for VLBL

estimation

increasing Ndwl. Hence, the “passive read” dynamic power is lower in parti-

tioned arrays than in the unpartitioned array [11].

Characterization of a memory cell is done by performing a circuit-level DC

simulation for a single cell connected to a pair of LBL and LBL to quantify

all leakage components (Fig. 5.6a). The dynamic power dissipation can be

accurately estimated using Eq. 5.6, given global and local bitline capacitance

CGBL, CLBL, global and local bitline voltage swing ∆VGBL, ∆VLBL. The

“passive read” power is estimated by using Eq. 5.7, where “passive read” global

and local bitline voltage swing ∆V pass.readGBL , ∆V pass.read

LBL are obtained using

Eqs 5.10 and 5.11.

Pmcellsdyn = Nwlength Pmcell

active + Npass.readmcells Pmcell

pass.read (5.5)

Pmcellactive = Vdd fclk (CGBL∆VGBL + CLBL∆VLBL) (5.6)

P mcellpass.read = Vdd fclk CGBL ∆V pass.read

GBL


+ Vdd fclk CLBL ∆V pass.readLBL (5.7)

CGBL = Ndbl Cnmos_passdrain + Cmux

source + Cisodrain + CGBL

wire (5.8)

CLBL =Nrows

NdblCmcell + Cnmos_pass

source

+ 2 Cpmos_prechdrain + CLBL

wire (5.9)

∆V pass.readLBL = ∆Twordline

ILBL_discharge

C LBL(5.10)

∆V pass.readGBL = ‖∆V pass.read

LBL − (Vdd − V initialGBL )‖ (5.11)

P mcellsleak = Nmcells I mcell

leak Vdd (5.12)

Here, Cmcell, Cnmos_passdrain , Cnmos_pass

source , Cmuxsource, Ciso

drain, Cpmos_prechdrain , CGBL

wire ,

and CLBLwire are a cell’s load capacitance onto the bitline, drain and source ca-

pacitances of an NMOS transistor connecting local to global bitlines, source

capacitance of a MUX NMOS transistor, drain capacitance of an ISO PMOS

transistor, drain capacitance of a precharge PMOS transistor, global bitline wire

and local bitline wire capacitances, respectively. ∆Twordline is the time during

which a wordline is on, and ILBL_discharge is the local bitline discharging cur-

rent that can be obtained by running a circuit-level DC simulation for a stack

of the two NMOS transistors (from the SRAM cell) connected between a local

bitline (precharged to Vdd) and ground (see Fig. 5.6b). V initialGBL is the initial

voltage level of the non-selected global bitline when the evaluation cycle starts.

Fig. 5.7 shows the subthreshold and gate leakage currents for a partitioned

6T-SRAM cell. The total leakage power of memory cells is estimated using

Eq. 5.12, where Nmcells = 2Naddr is the number of memory cells and I mcellleak

is the total leakage current for a single memory cell, which is obtained by using

the methodology given in the Section 5.3, and defined either by Eq. 5.13 or

Eq. 5.14:

I mcellleak = i1(PT 1) + i1(PT 2) + i3(P1) + i3(P2)

+ i4(P1) + i4(P2) (5.13)


VLWL

= ”0”

Vdd

gnd

P1 P2

N1 N2

PT1

PT2”0” ”1”

LBL = ”1” LBL = ”1”

VLBL_SEL

= ”0”

GBL GBL

PT3 PT4

Figure 5.7: Subthreshold (green, solid) and gate leakage (red, dotted) currents in a

partitioned 6T-SRAM cell

I mcellleak = i2(PT 1) + i2(PT 2) + i2(PT 3) + i2(PT 4)

+ i3(N1) + i3(N2) + i4(N1) + i4(N2)

+ i4(PT 1) + i4(PT 2) + i4(PT 3) + i4(PT 4) (5.14)

Sense Amplifier:

Power dissipation of a sense amplifier (SA) consists of leakage and dynamic

components. Dynamic power is due to the current that discharges bitlines of the

SA (referred to as BLSA/BLSA) from (Vdd − Vsense) to zero, which can be

estimated using Eq. 5.15, given bitline SA capacitance CBLSA, fclk, bitline SA

voltage swing ∆VBLSA = Vdd−Vsense and Vdd. Leakage power is obtained by

running a circuit-level DC simulation with the appropriate probes for a single

SA with the configuration for characterization shown in Fig. 5.8a. Here, NSA is

the number of sense amplifiers used in the array, CSA, Cisosource and CGBL

wire are

the capacitance of a SA, the source capacitance of a PMOS isolation transistor

and the GBL wire capacitance, respectively.

P SAsdyn = Vdd fclk CBLSA (Vdd − Vsense) (5.15)


b)

BLWRC

CBLWRC

Write Circuit

Vdd

VWRDATA

= 0 V

VWR_ENABLE

= 0 V

VMUX= 0 V

VBLWRC

= 1.2 V

BLWRC

C BLWRC

gnd

VBLWRC

= 1.2 V

V WRDATA = 1.2 V

VWR_ENABLE = 1.2 V

a)

CBLSA

CBLSA

BLSA

VBLSA = 1.2 V

SenseAmpV

ddV

dd

Vdd

VSAENABLE = 0 V

VSAPRE

= 1.2 V

BLSA

V BLSA = 1.2 V

gnd

Figure 5.8: Characterization of (a) a sense amplifier, (b) a write circuit

P SAsleak = NSA I SA

leak Vdd (5.16)

CBLSA = CSA + Nwords Cisosource + CGBL

wire (5.17)

Writing logic:

The write circuit (WRC) dissipates power dynamically through the cur-

rent that discharges the selected, precharged LBL/LBL (connected to WRC

through a pair of selected GBL/GBL) from Vdd to zero while driving zero or

one to the selected cell (Fig. 5.8b). This power dissipation is estimated using

Eqs 5.18 – 5.21 for given ∆V writeGBL , ∆V write

LBL , fclk, Vdd, CGBL, CLBL, and

CBLWRC that is calculated using Eq. 5.22. The leakage power is obtained by

using Eq. 5.23, where I WRCleak is the leakage current obtained by characteriza-

tion for a single write circuit applying the probing methodology described in

the Section 5.3.

P WRCsdyn = NWRC (P BLWRC

dyn + P GBLdyn + P LBL

dyn ) (5.18)

P BLWRCdyn = V 2

dd fclk CBLWRC (5.19)


P GBLdyn = Vdd fclk ∆V write

GBL CGBL (5.20)

P LBLdyn = Vdd fclk ∆V write

LBL CLBL (5.21)

CBLWRC = CWRC + Nwords Cmuxsource + Cmux

gate + CGBLwire (5.22)

P WRCsleak = NWRC I WRC

leak Vdd (5.23)

Here, CBLWRC is the bitline capacitance of a write circuit, Cmuxgate is the gate

capacitance of a MUX NMOS transistor, and ∆V writeGBL , ∆V write

LBL are the voltage

swings of global and local bitlines in write cycles, respectively.

Global/Local wordline drivers:

There are 2Nrowdecaddr global wordline drivers and Ndwl×2Nrowdec

addr local word-

line drivers for a given row decoder with N rowdecaddr memory address bits, how-

ever only one global and one local driver are active each read/write cycle, while

the rest are idle and leaking. Although the total number of wordline drivers

for a partitioned array is increased (with respect to the unpartitioned array), the

size of each global wordline driver is smaller due to smaller driving capaci-

tances. The dynamic power of global and local wordline drivers is estimated

using Eqs 5.24 – 5.27 for a given input capacitance of an AND gate CANDgate ,

the gate capacitance of the cell’s NMOS pass transistor Cnmos_passgate , the output

capacitance of GWL/LBL drivers CGWLDrvout , CLWLDrv

out , and GWL/LBL wire

capacitances CGWLwire , CLWL

wire , respectively.

P GwlDrvdyn = V 2

dd fclk CGWL (5.24)

P LwlDrvdyn = V 2

dd fclk CLWL (5.25)

CGWL = Ndwl CANDgate + CGWLDrv

out + CGWLwire (5.26)

CLWL = 2Ncolumns

NdwlCnmos_pass

gate + CLWLDrvout + CLWL

wire (5.27)

P GwlDrvleak = Nrows I GwlDrv

leak Vdd (5.28)

P LwlDrvleak = Ndwl Nrows I LwlDrv

leak Vdd (5.29)


A circuit-level DC simulation for a single global/local wordline driver estab-

lishes the leakage power components for Eqs 5.28 and 5.29 using the probing

methodology described in the Section 5.3.

Address Decoders:

Fig. 5.9 shows the architecture of a row/column decoder used in this thesis.

This architecture is similar to the one used in CACTI [12] for cache delay esti-

mation. For a given number of address bits Naddr, the number of 3to8 (=N3to8)

and 2to4 decoders (=N2to4), the number of NOR gates (=Nnor), the number

of inverters (=Ninv) and the number of wordline drivers (=NwlDrv) required

for the implementation of address row decoder (rowdec) and column decoder

(coldec) are given in Eq. 5.30, and Eq. 5.32, Eq. 5.33, respectively.

Naddr = 3 N3to8 + 2 N2to4 (5.30)

Naddr = N rowdecaddr + N coldec

addr (5.31)

Nrows = 2Nrowdecaddr = N rowdec

nor = N rowdecinv = NwlDrv (5.32)

Nwords = 2Ncoldecaddr = N coldec

nor = 0.5N coldecinv (5.33)

Here, recalls from Table 5.1, N rowdecaddr and N coldec

addr is the number of addresses

used for row and column in bits, respectively, and Nwords is the number of ad-

dressable words used in this memory array.

Each 3to8 and 2to4 decoder is typically implemented using NAND gates

and inverters to complement the address inputs. During each read/write cy-

cle, the decoder enables signal DecSel triggering the decoder’s outputs. Each

NOR gate collects an output from every decoder and then, together with an

inverter and a wordline driver, forms a wordline activation signal. Since the

0→1 nodal transition is considered to be the power consuming one, all NAND,

NOR, inverter gates are active when making 0→1 transitions, and are inactive

and leaking otherwise. The leakage power is obtained by running a circuit-level

DC simulation with appropriate probes for each row or column decoder with


.

.

.

3-8

dec

..

.DecselClock Out13

Out20

A10A11A12

Out14

3-8

dec

..

.DecselClock

A7A8A9

Out5

Out12

Out6

2-4dec

A6A5

Clock Out1

Out2

Out3

Out4

.

.

.

NOR gates inverters

Row1

Row2

Row256Decsel

Figure 5.9: Architecture of a 8-256 row decoder

no 0→1 transitions in the addresses. A probabilistic method is used to esti-

mate dynamic power dissipation of row and column decoders. Based on the

method described in [13], a transition activity factor α0→1 can be calculated

for each node assuming that all addresses to decoders have equal probability,

and DecSel is turned ON in every read/write cycles. The dynamic and leak-

age power dissipation of a row/column decoder are calculated by Eq. 5.34 and

Eq. 5.35, respectively.

P decdyn = P dec3to8

dyn + P dec2to4dyn + P nor

dyn + P invdyn (5.34)

P decleak = P dec3to8

leak + P dec2to4leak + P nor

leak + P invleak (5.35)

where,

P dec3to8dyn = V 2

dd fclk(NinvαinvCinv + NoutαoutCout)dec3to8

P dec2to4dyn = V 2

dd fclk(NinvαinvCinv + NoutαoutCout)dec2to4

P nordyn + P inv

dyn = V 2dd fclk αnor(Cnor + Cinv)


P dec3to8leak = P dec3to8

leak_inv + P dec3to8leak_nand

P dec2to4leak = P dec2to4

leak_inv + P dec2to4leak_nand

P dec3to8leak_nand = 8 (1 − α dec3to8

out ) I dec3to8leak_nand Vdd

P dec2to4leak_nand = 4 (1 − α dec2to4

out ) I dec2to4leak_nand Vdd

P dec3to8leak_inv = 3 (1 − α dec3to8

inv ) I dec3to8leak_inv Vdd

P dec2to4leak_inv = 2 (1 − α dec2to4

inv ) I dec2to4leak_inv Vdd

P norleak + P inv

leak = (N − 1) (I norleak + I inv

leak ) Vdd

Here, αinv , αout, αnor are the ’0→1’ transition activity factors for address

inverters, NAND gates (inside 3to8 and 2to4 decoders), and for NOR gates,

respectively. For equally probable address inputs to decoders, αinv = 0.25,

αdec3to8out = 0.1094, αdec2to4

out = 0.1875, and αnor = 1N , where N = Nrows and

N = Nwords for row and column decoders, respectively.

MUX and Isolation logic:

There are 2×Ncolumns NMOS and PMOS transistors used for multiplex-

ing WRCs and SAs to GBL/GBL, respectively. During a read cycle, PMOS

isolation transistors are turned ON, while NMOS transistors are turned OFF,

and during a write cycle PMOS transistors are OFF, while NMOS are ON. The

number of idle NMOS (Nmux) and PMOS (Niso) transistors is inversely pro-

portional to Nwlength, i.e. a longer access word length leads to a fewer number

of idle MUX and isolation transistors. In the model verification part, since

Nwlength = 8 bits, there are 2×(Ncolumns − Nwlength) = 496 idle transistors

for the 8-KB and 240 for the 2-KB array, hence giving rise significantly to leak-

age power. The leakage power in MUXes is due to the leakage currents to the

substrate and through the gate. By running a circuit-level DC simulation for

an off-state NMOS and an off-state PMOS connected between Vdd and ground,

leakage currents for those off-state transistors are captured.

P muxdyn + P iso

dyn = Nmux Inmosdyn + Niso Ipmos

dyn (5.36)

P muxleak + P iso

leak = Nmux Inmosleak + Niso Ipmos

leak (5.37)


5.4.2 Power Models for Unpartitioned Data SRAM Arrays

As compared to a physically partitioned SRAM array of the same size, unpar-

titioned array has a simpler organization. There are no global bitlines, global

wordlines, neither global wordline drivers nor extra control circuits for sub-

array selection. Therefore, power modeling for unpartitioned SRAM arrays is

straight-forward and it is simpler in comparison to the one for partitioned ar-

rays. Component power models of a partitioned array can be reused directly for

some components of an unpartitioned array such as SA, row/column decoders,

and MUX/isolation logic. For the rest of components, some modifications to

their power models are required. Eqs 5.38 - 5.46 show the obtained power

models for memory cells, WRC and wordline drivers, respectively.

Memory cells:

P mcellsdyn = Vdd fclk CBL (∆VBL + ∆V pass.read

BL ) (5.38)

CBL = Nrows Cmcell + 2 Cpmos_prechdrain + Cmux

source

+ Cisodrain + CBL

wire (5.39)

P mcellsleak = Nmcells I mcell

leak Vdd (5.40)

Writing logic:

P WRCsdyn = V 2

dd fclk CBLWRC (5.41)

CBLWRC = CWRC + Nwords Cmuxsource + Cmux

gate + CBLwire (5.42)

P WRCsleak = NWRC I WRC

leak Vdd (5.43)

Wordline drivers:

P wlDrvdyn = V 2

dd fclk CWL (5.44)

CWL = 2 Ncolumns Cnmos_passgate + CWLDrv

out + CWLwire (5.45)

P wlDrvleak = Nrows I wlDrv

leak Vdd (5.46)


Here, ∆VBL is the bitline voltage swing and ∆V pass.readBL is the “passive read”

bitline voltage swing. CBLwire, CWL

wire, CWLDrvout and Ciso

drain are the bitline wire

capacitance, the wordline wire capacitance, the output capacitance of a word-

line driver, and the drain capacitance an ISO PMOS transistor, respectively.

5.4.3 Power Models for SRAM-based Tag Arrays

Tag Array Organization Parameters

Table 5.2 shows the assumed organization parameters for the partitioned SRAM-

based tag arrays. The size of a tag field Ntag is calculated using the following

equation:

Ntag = Nmem_addr − Nindex

+ log2A − NByte_offset − NWord_block (5.47)

In order to reduce the total power in a tag array, physical partitioning tech-

niques also have been used. Physically partitioned tag arrays usually are parti-

tioned vertically using the DBL technique [14] but not horizontally since they

are often designed to read a complete tag-line at a time, as fast as possible.

A horizontally partitioned tag array may require several clock cycles to read

a complete tag-line thus it slows down the cache performance. Therefore, in

this thesis only vertically partitioned tag arrays are considered, i.e. Ntwl = 1 is

assumed to be constant.

Power Models for Partitioned SRAM-based Tag Arrays

Fig. 2.11 in Section 2.4.1 shows the organization of a typical SRAM-based

cache used as the basic organization assumed for power modeling throughout

this work. It is clear from this figure that the tag array has two additional com-

ponents comparing to the data array: the comparators and the MUX drivers.

However, the MUX drivers dissipate power insignificantly compared to the

comparator [12], therefore it will not be considered in this work.


Table 5.2: Organization parameters for partitioned SRAM-based tag arrays

Symbols Meanings Parameters

C Cache size in Bytes

B Block size in Bytes

A Associativity integer (i.e. 1, 2, 3, 4, ...)

Ntwl Number of segments per tag wordline 1, 2, 4, 8, ...

Ntbl Number of segments per tag bitline 1, 2, 4, 8, ...

Ntag

sub−arraysTotal number of tag sub-arrays N

tag

sub−arrays= Ntdwl ×Ntdbl

Ntag Size of tag field integer (i.e. 1, 2, 3, 4, ...)

Nmem_addr Memory address width in bits integer (i.e. 1, 2, 3, 4, ...)

Nindex Index in bits integer (i.e. 1, 2, 3, 4, ...)

NByte_offset Byte offset in bits integer (i.e. 1, 2, 3, 4, ...)

NWord_block Word offset in bits integer (i.e. 1, 2, 3, 4, ...)

Comparator:

Fig. 5.10 shows the structure of a typical NOR-based comparator assumed

for power modeling in this section. This architecture is similar to the one used

in CACTI [12]. The outputs from the tag SAs of the tag array are connected to

the inputs labeled an and an, while the bn and bn inputs are driven by tag bits in

the address (it is also referred to as Search Lines – SLs in Fig. 2.11). Here, the

index n = 0, 1, 2, ... Ntag. The node OUTcmp is the output of the comparator

which is connected to the input of a match-line sense amplifier (MLSA). The

output of a MLSA is the match result denoted as ML. The node EVAL is used

as a “virtual ground” for the pull-down paths of the comparator. The working

principle of a NOR-based comparator consists of three phases [15]:

1. SL precharge: precharge the search lines (bn/bn) to low


M1M2

M3

M4

a0 a0

b0

b0

. . .

aN

tag

aN

tag

bN

tagb

Ntag

a1

b1

b1

a1

Vdd

MLSA

gnd

EVAL

From a

dummy

tag SA

MLOUTcmp

PRECHcmpM

pre

Figure 5.10: The structure of a typical Ntag-bits NOR-based comparator

2. Match-line precharge: precharge the OUTcmp to high by turning ON the

precharge PMOS transistor

3. Match-line evaluation: (i) turn OFF the precharge PMOS transistor;

(ii) drive SLs (bn/bn) to the tag bits in the address; (iii) drive an/an to the

outputs from the tag SAs; (iv) perform comparison and drive OUTcmp to

the MLSA that in turn generates a match result based on the voltage level

it senses.

In the match-line evaluation phase, to ensure that the output OUTcmp is

not discharged before the an bits become stable, node EVAL is held high un-

til roughly three inverter delays after the generation of the an signals. This is

accomplished by using a timing chain driven by a tag SA in the tag array. The

output of this timing chain is connected to EVAL [12]. For simplicity of power


modeling, the SL precharge and match-line precharge phases are combined to

be one denoted as the Comparator precharge phase.

Applying the methodology given in the Section 5.3 to the comparator cir-

cuit, some observations can be made:

• For each pair of the comparing bits an and bn there are two pull-down

paths: the first consists of two NMOS transistors M1 and M4, and the

second – M2 and M3 (see Fig. 5.10).

• During the comparator precharge phase, the node EVAL and OUTcmp

are high, thus there is no subthreshold leakage in the comparator circuits.

However, there are gate leakage currents out-going from those MOS tran-

sistors that have VG = 0. For example the precharge PMOS transistor

Mpre has a gate leakage current running from its source with VS = Vdd

to its gate which is captured by using the probe i2(Mpre). However, this

leakage current turns out to be negligible.

• In the match-line evaluation phase, i.e. Mpre is OFF and VEVAL = 0, there

are two possible cases: match or mismatch. A match-case occurs when

either an = bn = Vdd or an = bn = 0, while a mismatch-case occurs when

an 6= bn. In the match-cases, there are no paths connecting OUTcmp to

ground, so there is no dynamic power, but only leakage. In the mismatch-

cases, the dynamic power of the comparator is due to the OUTcmp dis-

charging current running through a number of pull-down paths and an

on-state NMOS transistor of the last-stage inverter of the timing chain

to the ground. The value of this current depends on the number of mis-

matched bits. The leakage power in this case is also negligible.

Based on these observations, the power models for a comparator of Ntag

bits are described by Eqs 5.48 - 5.55:

P cmptotal = Hcache P cmp

match + (1 − Hcache) P cmpmismatch

= Hcache Vdd I cmpleak + (1 − Hcache) P cmp

dyn (5.48)

5.5. VALIDATION 111

P cmpdyn = ∆V

OUTcmp

swing C OUTcmp Vdd fclk (5.49)

C OUTcmp = Ntag C nmosdrain + C prech_pmos

source + CMLSA (5.50)

I cmpleak = N an=1

match I an=1a_bit_leak + N an=0

match I an=0a_bit_leak + I inv

leak (5.51)

Here, Hcache is the cache hit ratio2, P cmpmatch and P cmp

mismatch are the comparator

power values in the match-case and mismatch-case, respectively. I cmpleak is the

total leakage current of the comparator for the match-case, and P cmpdyn is the

total dynamic power of the comparator for the mismatch-case. ∆VOUTcmp

swing is

the voltage swing of the node OUTcmp when a mismatch occurs, and C OUTcmp

is the output capacitance of the comparator. C nmosdrain , C prech_pmos

source and CMLSA

are the drain capacitance of a NMOS transistor, the source capacitance of the

PMOS precharge transistor and the input capacitance of a MLSA, respectively.

Using the methodology given in the Section 5.3, the total leakage current of

a pair of pull-down paths for a match-case with an = 1 and with an = 0 is

obtained by Eq. 5.53, and Eq. 5.55, respectively.

I an=1a_bit_leak = i2(M3) + i1(M1) + i1(M2) (5.52)

= i2(M2) + i2(M4) + i3(M3) + i3(M4) (5.53)

I an=0a_bit_leak = i2(M4) + i1(M2) + i1(M1) (5.54)

= i2(M1) + i2(M3) + i3(M3) + i3(M4) (5.55)

5.5 Validation

In this section, the validation results of power models for partitioned and un-

partitioned data SRAM arrays, and for a partitioned SRAM-based tag array are

given. Based on the obtained validation results, some analyses and discussions

are presented and conclusions are drawn.

2For L1 and L2 on-chip caches, the hit ratio is intentionally maintained to be high by employing

many architecture-level cache management policies and techniques. The average hit ratio is about

99.8% and 98.5% for the I-cache and the D-cache, respectively [16].


5.5.1 Validation Methodology

Below, a brief summary of the validation methodology will be presented. The

methodology has been applied to prove the validity of the obtained power mod-

els against circuit-level simulations for several complete physically partitioned

and unpartitioned data SRAM arrays with different configurations, and also for

a partitioned SRAM-based tag array.

1. Select initial cache/memory-organization parameters, e.g. the size, the

output width, the access word length, the associativity, etc. Then, use

CACTI 3.2 tool to generate the configuration parameters for all unparti-

tioned and partitioned data arrays, and for the partitioned tag array.

2. Create netlists of these arrays based on the obtained configuration param-

eters. Select a typical structure for each array component which is widely

used in the literature and the research community.

3. Size properly netlists of these arrays in some available commercial and

predictive CMOS processes. Perform simulations and static time anal-

yses to provide a proper functionality for each array. In this validation

work, a commercial 0.13-µm process (Vdd = 1.2 V; normal VT = VTH0

≈ 0.25 V) and a Berkeley Predictive Technology Model (BPTM) 65-nm

process (Vdd = 1.1 V; V nmosT ≈ 0.42 V and |V pmos

T | ≈ 0.36 V) have been

used.

4. Select the process-dependent parameters (i.e. threshold voltage VT , sup-

ply voltage VDD and process corners PV ), and other parameters (e.g.

temperature T , frequency F ) for setting up the simulation environment

for a circuit-level simulator (in this case, Hspice) to perform simulations

and analyses.

5. For each array, perform Component Characterization for each array com-

ponent by running few, simple Hspice DC simulations using the appro-

priate probes to obtain both the total leakage power value and the power

value of each leakage components, i.e. gate, subthreshold and substrate

5.5. VALIDATION 113

leakage. In the same time, extract the nodal capacitances for each array

component. Then, tabulate those leakage power values, and the nodal

capacitances into the pre-characterized leakage tables for use in the next

step.

6. For each component, using the proposed power models with the obtained

nodal capacitances to calculate the dynamic power dissipation value. To-

tal power dissipation of the component is the sum of dynamic and leakage

power dissipation values.

7. For each array, calculate the total power dissipation value by summing up

all the component’s power dissipation values for each reading and writing

state.

8. For each array, perform several Hspice transient analyses to obtain the av-

erage per-cycle total power dissipation values for each reading and writ-

ing state.

9. For each array, compare the power value obtained in step 7 with the power

value obtained in step 8 for each reading and writing state to draw con-

clusions.

10. Repeat from step 4 for any changes in the value of PV , VT , VDD and T .

Repeat from step 1 for any changes in the cache/memory-organization

and configuration parameters.

The random nature of the input addresses to row/column decoders calls for

substantial modifications to the above-mentioned steps 6 and 8.

• Step 6a: For each address decoder, using the proposed power models

with the obtained nodal transition activity factors α0→1 to calculate the

total dynamic and leakage power dissipation values. Total power dissi-

pation of the decoder is the sum of the total dynamic and leakage power

dissipation values.

• Step 8a: For each address decoder, the total per-cycle power value is esti-

mated by running an Hspice transient analysis for a long trace consisting


of several hundreds of random read/write accesses. In this work, a trace

consisting of one thousand of random read/write accesses has been used.

In this validation work, five data SRAM arrays with different configurations

and one SRAM-based tag array have been used. For simplicity in referring to

them, each array is assigned with a configuration letter as follows:

• Three 8-KB data SRAM arrays: an unpartitioned data array (referred to

as 8A), a partitioned data array with Ndwl= 4, Ndbl= 16 (referred to as

8B), and a partitioned data array with Ndwl = Ndbl= 16 (referred to as

8C).

• Two 2-KB data SRAM arrays: an unpartitioned data array (referred to as

2A) and a partitioned data array with Ndwl= 4, Ndbl= 8 (referred to as

2B).

• A 2-KB partitioned SRAM-based tag array with Ntwl= 1, Ntbl= 8.

All three 8-KB data SRAM arrays have been implemented in a commercial

0.13-µm CMOS process, while one 2-KB data SRAM array (i.e. 2A) has been

implemented in both a commercial 0.13-µm and a 65-nm BPTM bulk CMOS

process. Both the 2-KB partitioned data SRAM array (i.e. 2B) and the 2-KB

partitioned SRAM-based tag array have been implemented in a 65-nm BPTM

bulk CMOS process.

For conventional memory arrays in typical applications, the operational

temperatures T are ranging from 40◦C to 110◦C, thus a nominal middle tem-

perature point T = 70◦C has been selected. The typical process corner value

(PV = typical), the normal supply voltage and the normal threshold voltage for

both CMOS processes have been also used in most Hspice simulations required

for this validation work. The frequency of accessing to any array is defined by

the static time analysis of each read/write cycle of that array. For example, array

2A has its fclk = 400 MHz and 512 MHz when it is implemented in a 0.13-µm

and a BPTM 65-nm process, respectively.

5.5. VALIDATION 115

5.5.2 Validation of Power Models for Data SRAM Arrays

To prove the validity of the proposed power models for data SRAM array the

above-presented validation methodology has been used for several partitioned

and unpartitioned arrays of 8 KBytes and 2 KBytes in size. The selection of

the array sizes is motivated by the fact that 8 KBytes is the practical size limit

of a memory bank, or in other words, it is the largest allowable size of a single

SRAM memory in order to maintain an acceptable access time without imple-

menting a memory-banking technique. To further reduce the total power dis-

sipation of an SRAM array, physical partitioning techniques are used for each

separate SRAM bank.

0

1

2

3

4

5

6

7

Mem

cells

Sense

Am

p

Writ

e Circ

uit

Word

Line

Drv

s

Row

/Col D

ecoder

s

Inte

rnal

CTR

Total A

rray

Pow

er

Po

we

r (m

W)

Figure 5.11: Total power dissipation of 8-KB data arrays [blue/grey — 8A, brown/black

— 8B, yellow/white — 8C]

Fig. 5.11 shows the total power dissipation of the unpartitioned array 8A

(in blue/grey), the partitioned array 8B (in brown/black), the partitioned array

8C (in yellow/white) implemented in a commercial 0.13-µm process, and its

component’s power values. The most basic observation here is that partition-

ing reduces total power dissipation of arrays, obviously by reducing the power

dissipation in memory cells. Although partitioning requires some extra power


dissipation of internal control circuits and introduces some overheads in term

of delay due to wakeup time, the total power dissipation of a partitioned array

is reduced significantly in comparison to the unpartitioned case. For example,

partitioning an 8-kB array with Ndbl = 16, Ndwl = 4 (i.e. the array 8B) reduces

active power dissipations by 65%, and leakage power by 21% which resulted

in 60% total power reduction in comparison to the unpartitioned array (8A) of

the same size. In addition, Fig. 5.11 also shows that the array 8B, whose con-

figuration was optimized for speed and power by CACTI 3.2, has a total power

dissipation higher than the total power of the array 8C with less-optimized con-

figuration for speed.

0

500

1000

1500

2000

2500

3000

3500

Mce

lls

Sense

Amp

Writ

e Circ

uit

Wor

dLine

Drv

s

Row

/Col D

ecod

ers

Inte

rnal C

TR

Total A

rray Pow

er

Po

we

r (u

W)

Figure 5.12: Total power dissipation of 2-KB data arrays [blue/grey — 2A, brown/black

— 2B]

Fig. 5.12 shows the total power dissipation of the 2-KB unpartitioned and

partitioned arrays implemented in a 65-nm BPTM process as well as a power

breakdown into individual components. In this case, although partitioning still

reduces the total power dissipation (only by 8.5%) of the partitioned array 2B in

comparison to the unpartitioned array 2A, it is obvious that the partitioning con-

figuration suggested by CACTI 3.2 for the array 2B (i.e. Ndbl = 8, Ndwl = 4)

5.5. VALIDATION 117

is a non-optimal one in terms of power reduction. Figs 5.11 and 5.12 also clearly

pointed out the main contributors to total array power dissipation: the memory

cells, the write circuits and the row/column decoders.

Fig. 5.13 show accuracy values in estimating dynamic, leakage, and total

power dissipation for unpartitioned and partitioned data arrays, respectively.

For the main component contributor to total power dissipation of the array, the

memory cells, the proposed models achieve very high accuracy in estimating

dynamic power (96%), leakage power (94% for unpartitioned and 98% for par-

titoned arrays), and the total power (97%). Although the models get low accu-

racy in estimating dynamic power for wordline drivers (85%), and in estimating

dynamic and leakage power for write circuits (82%), it still offers very high

accuracy in estimating total power dissipation for all the data arrays (97%).

Fig. 5.14 shows similar accuracy figures in estimating dynamic, leakage,

and total power dissipation for 2-KB unpartitioned and partitioned arrays, re-

spectively. In this case, the accuracy achieved by using the proposed models is

high for the memory cells, the SAs, and the wordline drivers. The worst case in

terms of accuracy is WRC (as low as 84%). A reason for this inaccuracy may

be explained by short-circuit power of WRC, which has not been captured in

the proposed models, yet.

The proportion of dynamic and leakage power in each array component is

interesting information. Figs 5.15 and 5.16 show the proportion of dynamic and

leakage power (the sum of dynamic and leakage power amounts to 100%) for

each array components of 8A, 8B, 8C, 2A and 2B arrays, respectively. Phys-

ical partitioning reduces total power dissipation of an array by changing the

proportion of dynamic and leakage power mostly in the memory cells and the

wordline drivers. By partitioning the unpartitioned array 8A with Ndbl = 16,

Ndwl = 4 and Ndbl = Ndwl = 16 the “passive reading” dynamic power is

rapidly reduced resulting in total dynamic power reduction for memory cells

from 94% (in 8A) to 68% (in 8B) and to 22% (in 8C). However, the memory


80%

82%

84%

86%

88%

90%

92%

94%

96%

98%

100%

Mem

cells

Sense

Am

p

Writ

e Circ

uit

Word

Line

Drv

s

Row

/Col D

ecoder

s

Accu

racy

70%

75%

80%

85%

90%

95%

100%

Mem

cells

Sense

Am

p

Writ

e Circ

uit

Word

Line

Drv

s

Row

/Col D

ecoder

s

Accu

racy

80%

82%

84%

86%

88%

90%

92%

94%

96%

98%

100%

Mem

cells

Sense

Am

p

Writ

e Circ

uit

Word

Line

Drv

s

Row

/Col D

ecoder

s

Total A

rray

Pow

er

Accu

racy

a)

b)

c)

Figure 5.13: Accuracy in estimating: a) dynamic power, b) leakage power, c) total

power for 8-KB data arrays [blue/grey—8A, brown/black—8B, yellow/white—8C]

5.5. VALIDATION 119

80828486889092949698

100

Mce

lls

Sense

Amp

Writ

e Circ

uit

Wor

dLine

Drv

s

Row

/Col D

ecod

ers

Total D

yn. P

ower

Accu

racy (

%)

80828486889092949698

100

Mce

lls

Sense

Amp

Writ

e Circ

uit

Wor

dLine

Drv

s

Row

/Col D

ecod

ers

Total L

eak. P

ower

Accu

racy (

%)

80828486889092949698

100

Mce

lls

Sense

Amp

Writ

e Circ

uit

Wor

dLine

Drv

s

Row

/Col D

ecod

ers

Total A

rray Pow

er

Accu

racy (

%)

a)

b)

c)

Figure 5.14: Accuracy in estimating: a) dynamic power, b) leakage power, c) total

power for 2-KB data arrays [blue/grey—2A, brown/black—2B]


0

10

20

30

40

50

60

70

80

90

100

Mem

cells

Sense

Amp

Writ

e Circ

uit

Wor

dLine

Drv

s

Row

/Col D

ecod

ers

Total L

eak an

d Dyn

amic

Perc

en

tag

e (

%)

Figure 5.15: The proportion of dynamic (in brown/black) and leakage (in blue/grey)

power in the 8A, 8B and 8C arrays.

0

10

20

30

40

50

60

70

80

90

100

Mem

cells

Sense

Amp

Writ

e Circ

uit

Wor

dLine

Drv

s

Row

/Col D

ecod

ers

Total D

ynam

ic a

nd L

eaka

ge

Perc

en

tag

e (

%)

Figure 5.16: The proportion of dynamic (in yellow) and leakage (in orange) power in

the 2A array. The proportion of dynamic (in blue) and leakage (in brown) power in the

2B array.

5.5. VALIDATION 121

cells leakage power is also rapidly increased, from 6% (in 8A) to 32% (in 8B)

and to 78% (in 8C). This trend makes memory cell leakage more visible in the

partitioned arrays.

A similar trend also is shown in Fig. 5.16. By partitioning the 2 kB array

with Ndbl = 8 and Ndwl = 4, the dynamic power, which is the dominant in

the unpartitioned array 2A, exchanges its position with the leakage power that

becomes the dominant in the partitioned array 2B. Since partitioning requires

more global/local wordline drivers, for which an increasing fraction is inactive,

the proportion of wordline driver leakage power significantly increases (11%).

As a result, after partitioning, the leakage power constitutes as much as 45.5%

of the array’s total power dissipation.

In comparison to the unpartitioned array 2A, partitioning reduces active

power by 34.5%. However, it also increases leakage power by 75.8%! This re-

sult clearly suggests that the partitioning parameters obtained from CACTI 3.2

are not suitable for partitioning the given 2 kB array to reduce the total power

dissipation in general, and the leakage power in particular. Furthermore, a con-

clusion can be made that the effect of partitioning on an array’s power depends

strongly on the selection of technology process.

5.5.3 Validation of Power Models for SRAM-based

Tag Arrays

By applying directly the same modeling methodology, which has been used

to obtain power models for data SRAM arrays, to SRAM-based tag arrays the

power models for the comparator are obtained. Together with the power mod-

els of other array components, the comparator’s power models are used in this

section to provide the power dissipation estimates of a partitioned 2-KB SRAM-

based tag array for a validation againts the Hspice simulated values.


a)

b)

80

84

88

92

96

100

Mce

lls

Sense

Amp

Writ

e Circ

uit

Wor

dLine

Drv

s

Row

/Col D

ecod

ers

Com

para

tor

Total D

yn. P

ower

Accu

racy (

%)

80

84

88

92

96

100

Mce

lls

Sense

Amp

Writ

e Circ

uit

Wor

dLine

Drv

s

Row

/Col D

ecod

ers

Com

para

tor

Total L

eak Pow

er

Accu

racy (

%)

Figure 5.17: Accuracy in estimating: a) dynamic power, b) leakage power for a 2-KB

SRAM-based tag array

Fig. 5.17 shows accuracy values in estimating dynamic (part a) and leakage

(part b) power dissipation for the tag array. As discussed in Section 5.4.3, a tag

array comparator dissipates power dynamically only in the mismatch-case, oth-

erwise it is “leaking” in the match-case. Therefore, the accuracy values shown

in Fig. 5.17a are for the mismatch-case (i.e. when a cache miss occurs), and

those shown in Fig. 5.17b are for the match-case only.

5.5. VALIDATION 123

0,0

0,5

1,0

1,5

2,0

2,5

3,0

Mce

lls

Sense

Amp

Writ

e Circ

uit

Wor

dLine

Drv

s

Row

/Col D

ecod

ers

Com

para

tor

Inte

rnal C

TR

Total A

rray Pow

er

Po

wer

(mW

)

Figure 5.18: Total power dissipation of a 2-KB partitioned SRAM-based tag array

(blue/grey) and a 2-KB partitioned data array (brown/black)

Fig. 5.18 show the Hspice simulated total power dissipation of a 2-KB par-

titioned SRAM-based tag array (in blue/grey), a 2-KB partitioned data array (in

brown/black) implemented in a 65-nm BPTM bulk CMOS process, and its com-

ponent’s power values. The tag array is physically partitioned with Ntwl= 1,

Ntbl= 8, whereas the data array is partitioned with Ndwl= 4, Ndbl= 8 forming

a complete 2-KB SRAM-based cache. The data array has 128 rows of 128 6T-

SRAM cells in each, while the tag array has 128 rows and each row consists

of 22 6T-SRAM cells. Although both arrays dissipate nearly the same amount

of total power, they have very different power break-downs. While the major

contributors to total array power dissipation in the data array are memory cells

and wordline drivers, in the tag array it is the write circuit. The main reasons

for these differences are: (i) the number of write circuits used in the tag array

(22) is larger than the number of write circuits used in the data array (8); (ii) a

smaller number of the memory cells used in the tag array (22 × 128) in com-


parison to the one used in the data array (128 × 128). In addition, the internal

control circuits also dissipate a significant amount of power in both arrays.

The proposed models achieve very high accuracy in estimating both dy-

namic (97%) and leakage (96%) power for the memory cells, the SAs, and the

wordline drivers. However for the write circuits, the decoders and the com-

parator the obtained accuracy values are not impressively high. There are more

than one reasons for this: first, since the modeling methodology, which has been

used to obtain power models for data SRAM arrays, is directly applied to obtain

the power models for comparator — a dynamic-style circuit — there should be

some features that have not been yet captured in the obtained power models.

This problem requires further research work. Second, the total power dissipa-

tion of a SRAM-based tag array depends strongly on the cache hit ratio Hcache

that normally is maintained as high as 99% [16]. Thus, for the comparator the

dynamic power is not an issue.

5.6 Thermal and Variability Issues

In the presence of temperature variations, while dynamic power dissipation re-

mains unchanged, static power does not; the subthreshold component exponen-

tially depends on temperature. In the presence of supply voltage variations,

while switching power quadratically depends on voltage, both subthreshold and

gate leakage components are exponentially dependent on voltage [3]. There-

fore, in a power modeling approach that targets very high accuracy, temperature-

and supply-voltage-aware leakage power modeling become unavoidable.

5.6.1 Modeling the Dependence of Leakage on Temperature

To avoid the complexity of an analytical approach, while maintaining a high de-

gree of accuracy and flexibility, a simulation-based approach for temperature-

aware leakage power estimation has been used. At any fixed temperature, the

proposed power models offer high accuracy (as 96% [17]) in estimating total

5.6. THERMAL AND VARIABILITY ISSUES 125

power as well as dynamic and leakage power dissipation for both partitioned

and unpartitioned SRAM arrays. To preserve high accuracy in the presence of

temperature variations, the power models are extended to a number of temper-

atures, through a systematic extension of simulation points.

The first priority in assuring high accuracy in estimating total power dissi-

pation is to accurately capture the dependence of leakage on temperature for the

memory cells. It is possible to introduce temperature-dependent power models

for all other memory blocks too, but since memory-cell leakage power is the

main constituent of total leakage power dissipation of an SRAM array (approx-

imately 78% [11]) it may suffice to only consider the memory-cell model. In

the context of temperature dependent memory-cell power modeling, this will

be translated into a somewhat stricter accuracy requirement on the memory-

cell model, Acc mcellsleak , that in turn defines the number of temperature points at

which the memory-cell model needs to be defined.

To model the dependence of leakage power on temperature for a 6T-SRAM

cell, we need to (i) select the temperature range of interest by specifying Tlow

and Thigh; (ii) obtain two leakage power values by running a short DC sim-

ulation for the 6T-SRAM cell at the specified Tlow and Thigh; (iii) calculate

the number of simulation points using Eqs 5.56 - 5.57 with the given allowable

accuracy in estimating leakage power for memory cells, Acc mcellsleak ; (iv) obtain

leakage power values at specified temperature points by running a short DC

simulation with a temperature sweep.

NT.interval ≥Isub(at Thigh) − Isub(at Tlow)

2 Acc mcellsleak Isub(at Tlow)

(5.56)

Nsimulation point = NT.interval + 1 (5.57)

The number of simulation points is defined from the leakage power accuracy

defined for the entire temperature range. An example is showed in Fig. 5.19,

where the given temperature range of interest is 50–100 ◦C, Acc mcellsleak = 10%,

Isub(at Thigh)Isub(at Tlow) = 5, thus resulting in Nsimulation point = 22.


In Eq. 5.56, NT.interval is an integer that denotes the number of intervals

divided between simulation points in the selected temperature range.

30 40 50 60 70 80 90 100 1100

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6x 10

−8

Temperature (oC)

I su

b(A

)

Tlow

Thigh

Isub

(at Tlow

)

Isub

(at Thigh

)

Simulation points = 22

. . .

Figure 5.19: Subthreshold leakage power as a function of temperature for a 6T-SRAM

cell (commercial 130-nm)

For a typical SRAM, the leakage power of other memory components, P othersleak ,

constitutes approximately one fifth of the total array’s leakage power, Pleak [11].

As mentioned before, deploying temperature modeling for other components

can be ignored to simplify the temperature-modeling approach. In this case,

although the accuracy requirement on memory cell becomes stricter than in the

original case, it does not cause any fundamental problems neither significant

changes in our power models. Eqs 5.58 - 5.60 are used to define the accuracy

requirement on memory cells with given the accuracy in estimating total leak-

age power, Acc leak, Ratio leakothers/total – the ratio between P others

leak and Pleak ,

and Ratio leakmcells/total – the ratio between the leakage power of memory cells,

P mcellsleak , and Pleak .

Acc mcellsleak =

Acc leak

Ratio leakmcells/total

−Ratio leak

others/total

Ratio leakmcells/total

(5.58)


Ratio leakmcells/total =

P mcellsleak

Pleak(5.59)

Ratio leakothers/total =

P othersleak

Pleak(5.60)

5.6.2 Modeling Leakage with Variation in Supply Voltage

A common method to efficiently reduce total power dissipation is to reduce

the supply voltage, since switching power has a quadratic dependence on Vdd

while leakage power has an exponential one. Several leakage-reduction tech-

niques at the circuit level have been utilized for architecture-level leakage-

control schemes: Either the power to cache lines can be cut off (i.e. “gated-

Vdd” schemes, in which leakage basically is eliminated entirely) or it can be

put at an intermediate voltage level (i.e. “drowsy” schemes) to guarantee mem-

ory data is retained. Drowsy schemes have received considerable attention; it

was shown [18] that total cache leakage energy was reduced by an average of

76% at a wakeup penalty, for a drowsy cache line, of no more than one cycle.

Drowsy caches can be implemented using simple control circuits to assign dif-

ferent voltage levels, called tranquility levels—V drowsytlevel , at different priority

levels, based on information of replacement policies used [19].

To model the leakage power for “drowsy” memories, the dominating leak-

age mechanisms need to be modeled only for the circuits that exhibit static

power in idle mode. Only the SRAM cells need to be driven by the intermedi-

ate voltage level; all other circuits can be power gated completely. Thus, only a

leakage model for the SRAM cell’s dependence on the power supply’s tranquil-

ity level is required.

From Eqs 3.2 - 3.5 and BSIM4 equations for threshold voltage [1] it is clear

that the subthreshold leakage current’s dependence on supply voltage is eVdd ,

which is, in comparison to its dependence on temperature, very straightforward.

Based on this observation, a physically-based analytical approach for modeling

the leakage dependence of memory cells on supply voltage is proposed.


For the sake of simplicity, linearly distributed voltages are assumed for

N drowsytlevel tranquility levels between the lowest possible operating voltage

(V drowsymin.tlevel ≈ VT +200 mV representing deep sleep mode) and the full supply

voltage. The relation between I mcellleak and V drowsy

tlevel is established on gate and

subthreshold leakage currents obtained by running a short DC simulation for a

6T-SRAM cell with the supply voltage varying from V drowsymin.tlevel to full supply

voltage. Then, total leakage power of memory cells is expressed as a function

of Vdd:

P mcellsleak (Vdd) = Nmcells Vdd I mcell

leak (Vdd) (5.61)

I mcellleak (Vdd) = I mcell

gleak (Vdd) + I mcellsubleak(Vdd) (5.62)

0.6 0.65 0.7 0.75 0.8 0.85 0.90

0.5

1

1.5

2

2.5

3

3.5

4x 10

−8

Vdd

(V)

Cu

rre

nt

(A)

Isleak

Igleak

Itotal

Isleak

= 1.45x10−10

+ 0.99 e−18.7+0.92V

dd

Igleak

= 2.14x10−10

e5.23V

dd

Figure 5.20: Gate and subthreshold leakage power as functions of Vdd for a 6T-SRAM

cell (BPTM 65-nm [20])

Both the gate and subthreshold leakage power of a memory cell have exponen-

tial dependencies on Vdd. However, since gate leakage power is very sensi-


tive to changes in the transistor gate voltage, it depends strongly on Vdd. sub-

threshold leakage power, on the other hand, is less sensitive to changes in Vdd,

which is reflected in a weaker exponential function. Fig. 5.20 shows the de-

pendence of gate and subthreshold leakage power on Vdd for a 65-nm BPTM

6T-SRAM cell with the maximum Vdd= 0.9 V, V drowsymin.tlevel= 0.6 V, N drowsy

tlevel = 8,

and T = 70◦C [20].

30 40 50 60 70 80 90 100 1100

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

−8

Temperature ( oC)

I su

b(A

)

TT

FF

SS

Figure 5.21: The subthreshold leakage power’s dependence on temperature for a 6T-

SRAM cell (commercial 130-nm with process corners: SS, TT, FF)

5.6.3 Modeling the Dependence of Leakage on Process

Corner

The notion of process corners represents a straightforward to capture manufacturing-

induced device characteristic variations in simulation. The process corner TT

denotes the typical case for both NMOS and PMOS devices. This is the corner

all simulations routinely are based on, and so are all power models thus far in

this paper. The corner SS (Slow NMOS and PMOS), on the other hand, as-


sumes the slowest possible devices leading to the lowest leakage, whereas FF

(Fast NMOS and PMOS) conversely yields the highest leakage.

During design exploration of SRAM arrays, evaluation of process corners

can prove useful to understand how device variability impacts resulting leakage

power. Fig. 5.21 shows the subthreshold leakage power of a 130-nm 6T-SRAM

cell for the three different corners as function of temperature. As expected, the

different process corners give different memory-cell leakage power; the magni-

tude varies as much as 10×.

As it was shown in Fig. 5.21 for all process corners the obtained memory-

cell leakage power has a similar dependence on temperature and a similar de-

pendence on supply voltage. Not very surprising, type of process corner is

just another input dimension, next to e.g. temperature, which can be added to

the leakage power tables. Since the proposed approach is fully parameteriz-

able with respect to memory size (an integer defines row count, while another

defines column count) only one instance of memory-cell power models will be

used. Therefore, the complexity in using the proposed method does not increase

with the added complexity of the core model of the memory cell.

Bibliography





2003.

[3] W. Liao et al., “Temperature and Supply Voltage Aware Performance and Power

Modeling at Microarchitecture Level,” IEEE Trans. on CAD of ICS, vol. 24, no. 7,

pp. 1042–53, July 2005.


BIBLIOGRAPHY 131




ITRS, 2006.

[7] M. Q. Do, P. Larsson-Edefors, and L. Bengtsson, “Table-based Total Power Con-

sumption Estimation of Memory Arrays for Architects,” in Proceedings of Inter-

national Workshop on Power and Timing Modeling, Optimization and Simulation

(PATMOS’04), LNCS 3254, Sept. 2004, pp. 869–878.

[8] M. Q. Do, Mindaugas Draždžiulis, and P. Larsson-Edefors, “Current probing

methodology for static power extraction in sub-90nm cmos circuits, technical re-

port no. 2007-07,” Tech. Rep., The Department of Computer Science and Engi-

neering, Chalmers University of Technology, Göteborg, Sweden, 2007.

[9] T. Wada et al., “An Analytical Access Time Model for On-Chip Cache Memories,”

JSSC, vol. 27, no. 8, pp. 1147–56, Aug. 1992.

[10] A. Chandrakasan et al., Design of High-Performance Microprocessor Circuits,

IEEE Press, 2001.

[11] M. Q. Do, Mindaugas Draždžiulis, P. Larsson-Edefors, and L. Bengtsson, “Pa-

rameterizable Architecture-level SRAM Power Model Using Circuit-simulation

Backend for Leakage Calibration,” in Proceedings of International Symposium

on Quality Electronic Design (ISQED), March 2006, pp. 557–563.

[12] S.J.E. Wilton and N.P. Jouppi, WRL Research Report 93/5: An Enhanced Access

and Cycle Time Model for On-chip Caches, Western Research Laboratory, 1994.

[13] A. P. Chandrakasan and R.W. Brodersen, “Minimizing Power Consumption in

Digital CMOS Circuits,” Proceedings of the IEEE, vol. 83, no. 4, pp. 498–523,

April 1995.

[14] A. Karandikar et al., “Low Power SRAM Design Using Hierarchical Divided Bit-

line Approach,” in ICCD 1998, Oct. 1998, pp. 82–8.

[15] K. Pagiamtzis and A. Sheikholeslami, “Content-Addressable Memory (CAM) Cir-

cuits and Architectures: A Tutorial and Survey,” IEEE Journal of Solid-State Cir-

cuits, vol. 41, no. 3, pp. 712–727, Mar. 2006.

[16] J. L. Hennessy and D. A. Patterson, Computer Architecture: A Quantitative Ap-

proach, Morgan Kaufmann, fourth edition, 2006.


[17] M. Q. Do, Mindaugas Draždžiulis, P. Larsson-Edefors, and L. Bengtsson,

“Leakage-Conscious Architecture-Level Power Estimation for Partitioned and

Power-Gated SRAM Arrays,” in Proceedings of International Symposium on Qual-

ity Electronic Design (ISQED), March 2007.

[18] K. Flautner et al., “Drowsy Caches: Simple Techniques for Reducing Leakage

Power,” in ISCA 2002, May 2002, pp. 148–57.

[19] N. Mohyuddin et al., “Controling Leakage Power with the Replacement Policy in

Slumberous Caches,” in CF 2005, May 2005, pp. 161–70.

[20] Y. Cao et al., “New paradigm of predictive MOSFET and interconnect modeling

for early circuit design,” in CICC 2000, 2000, pp. 201–4.

6Conclusion and Future Work

In this chapter, conclusions on the presented work are given and future work on

both the power modeling part and on the implementation of the proposed power

models in a high-level power-performance simulator is discussed.

6.1 Conclusion

Following Moore’s Law the number of transistor integratable on a chip is dou-

bled every two years, exponentially increasing the leakage power every two

years. The increase in the number of transistor leads to one of the most difficult-

to-solve problems for semiconductor industry: leakage power dissipation. Al-

though sub-threshold leakage still remains the main contributor to total leak-

age, other mechanisms such as gate oxide tunneling and junction (BTBT) leak-

133

134 CHAPTER 6. CONCLUSION AND FUTURE WORK

age are of increasing significance. When the total leakage power is approach-

ing about 50% of total power, further supply voltage scaling for normal MOS

transistors will not make any sense. The problem is that the scaling of the

threshold voltage gives rise to even more leakage power. This reason puts se-

rious demands on low power design, leakage control and reduction techniques,

and eventually on leakage power estimation tools. Therefore, accurate leakage

power estimation is needed to allow designers to make good design trade-offs

at higher, architectural design levels.

Since all leakage mechanisms are closely related to the physical behavior

of MOS transistors, circuit-level simulators are needed in order to maintain a

high accuracy in estimating leakage power dissipation. However, this high ac-

curacy comes at an extremely high cost in the form of computational complex-

ity since those circuit-level simulators are built on very complex, technology-

dependent and detailed analytical power models, e.g. BSIM3 or BSIM4. Ob-

viously, circuit-level simulation alone is not a viable solution. However, on the

other hand, as shown in Section 1.3 of this dissertation, neither simplified ana-

lytical leakage power models can be the solution to the conflicting requirements

enforced on leakage power estimation: high accuracy, flexibility and simplicity.

This is the area in which our research work intends to contribute.

This dissertation presents a modular, hybrid power modeling methodology

capable of capturing accurately both dynamic and leakage power mechanisms

for SRAM-based memory structures like on-chip caches and SRAM arrays.

The methodology successfully combines the most valuable advantage of circuit-

level power estimation – high accuracy – with the flexibility of higher-level

power estimation while allowing for short component characterization and es-

timation time. The methodology offers high-level parameterizable, but still ac-

curate power dissipation estimation models that consist of analytical equations

for dynamic power and pre-characterized leakage power values stored in tables.

6.2. FUTURE WORK 135

Through verification for a number of SRAM arrays and on-chip caches with

different configurations implemented in 0.13-µm and 65-nm CMOS processes,

the proposed power models show a high accuracy in estimating both dynamic

and static power for all the SRAM array and cache components.

In order to capture correctly the total leakage currents of sub-90nm logic cir-

cuits when circuit simulators, such as Hspice, are employed, a methodology for

probing circuits for static current measurements in CMOS circuits during sim-

ulation has been proposed. In the power modeling validation part, the proposed

probing methodology has been used successfully to obtain accurate and distin-

guishable static power constituents (i.e. gate, sub-threshold and total leakage

power) for several unpartitioned and physically partitioned data SRAM arrays

and a SRAM-based tag array implemented in a BPTM 65-nm process.

In addition, a modeling methodology to capture the dependence of leakage

power on temperature variation, on supply-voltage scaling, and on the selection

of process corners has also been presented. This methodology provides an es-

sential extension to the proposed power models.

The proposed power modeling methodology and power models, as far as

we know, are the first ones that can offer high-level, parameterizable, relatively

simple and high-accuracy cache power estimation models accounting for both

dynamic and static power consumption.

6.2 Future Work

The following is a list of major tasks that are subject to future work:

1. As mentioned in Section 5.5.3, the obtained power models for a com-

parator of a SRAM-based tag array have not captured correctly all power

136 CHAPTER 6. CONCLUSION AND FUTURE WORK

dissipation mechanisms existed in the comparator yet, therefore still suf-

fering from low accuracy values in estimating leakage power (i.e. 80%).

This problem requires some additional research to solve.

2. Thermal management/Hot spot identification is an emerging issue due

to technology scaling. By knowing the leakage power density within a

chip, it is possible to obtain a thermal map for it. Thus, coupling power

modeling to thermal map is a good topic for future research.

3. There is a need to implement our proposed power models into a existing

power simulator, e.g. CACTI, to improve its power dissipation estimates.

This is also a good topic for future research.

4. The proposed power modeling methodology is modular and applicable

to any type of components of regular structures which can satisfy the

following two main requirements: (i) The number of internal hardware

block/cell instances is finite; (ii) The netlists of typical components are

provided. Therefore, potential candidates-components for future works

can be Content-Addressable-Memory (CAM) and clocking network.

Part IV

Appendix

ADSP-PP – A Power Estimation and

Performance Analysis Tool for

Parallel DSP Architectures

This chapter devotes for the description of the work done for designing and im-

plementing an architecture-level cycle-accurate power-performance simulator

for parallel DSP architectures (DSP-PP). Section A.1 gives some background

information on the special characteristics of DSP architectures. The following

Section A.2 describes in detail the design of DSP-PP simulator and its usage in

estimating performance and power consumption of DSP parallel architectures.

139

140 APPENDIX A. DSP-PP SIMULATOR

A.1 Characteristics of DSP Architectures

Comparing to microprocessors, DSP architectures have the following special

characteristics:

1. A fixed-point DSP usually has single/multiple MACs each of which con-

sists of a single-cycle multiplier (or pipelined one), a fixed-point ALU

operating on double wordlength operands, double wordlength accumula-

tors, shifters and registers. A floating-point DSP usually has single/multiple

floating-point MACs used together with single/multiple fixed-point or

floating-point ALUs. DSP usually provides good support for saturation

arithmetic, rounding and shifting.

2. In order to save cost and reduce energy consumption, DSPs tend to use

the shortest dataword and lowering the clock frequency to the minimum

possible value that will provide adequate accuracy in the target applica-

tions. Data words of most fixed-point DSPs are 16-b, or 32-b.

3. A combination of several special-purpose registers (e.g. accumulators)

and general-purpose register files is used. Number of registers is usually

fewer than the one used in microprocessors.

4. Separated small multiple-ported data (e.g. two data memories for X and

Y operands) and program memories are used. Data memories usually are

built in-core with the size within 64-KB. Program memory can be built

either on-core or off-core with a relatively larger size than the size of data

memories. A small single-level instruction cache can be used to store

program. Some DSPs also uses unified data-program memory structure,

but Harvard memory structure is mostly used.

5. Multiple busses are used to communicate between datapaths and the mem-

ory sub system (on-chip or off-chip), between DSP core and peripherals,

etc. DSP usually has specialized interfaces, e.g. Analogue-to-Digital and

Digital-to-Analogue.

A.1. CHARACTERISTICS OF DSP ARCHITECTURES 141

6. A simple instruction pipeline system is used. Since DSP assumes that

data dependency is known and data flow is predictable (it is always pos-

sible for DSP applications) so there is no “out-of-order” issue and execu-

tion of instructions.

7. A simplified instruction set consisting mostly of simple and simplified

instructions for datapath functions (i.e. about 80% of the total instruc-

tions is most frequently used DSP instructions, other 20% is multi-clock

complex instructions) is used. DSP has instruction-level functional ac-

celeration (i.e. the ability to accelerate and merge most frequently used

DSP instructions into subroutines) therefore DSP instructions are very

efficient and DSP code size is small.

8. Register-register (register direct), memory-memory (memory direct), cir-

cular addressing, immediate data, register indirect, and register-indirect

with post-increment addressing modes are used. These addressing modes

require complex data memory addressing circuits.

9. A hardwired instruction decoder is used generating control signals for

datapaths.

10. Most DSP applications and algorithms are implemented in Assembler,

and sometimes in C

11. Parallel DSPs usually achieve a high level of parallelism by combining

multiple cores with the same architecture in its system (e.g. the ManAr-

ray - BOPS DSP architecture). Communication and data transfer between

cores are provided by switching fabrics and system buses that are capable

of interconnecting and organizing a set of cores into standard ring, mesh,

torus, hypercube, and other organizations. Local parallelism is achieved

mainly by using multiple datapaths, multiple resources (e.g. the DSP

VLIW architecture) and by instruction pipelining.


A.2 DSP-PP

A.2.1 Features of the DSP-PP

The DSP-PP is a cycle-accurate performance simulator and power consumption

estimator for parallel DSP architectures. The DSP-PP has been designed by us-

ing an object-oriented approach and written in C++ with the SystemC library

to provide high level of abstraction and encapsulation as well as flexibility and

extendibility of the simulation program. The block diagram of the DSP-PP sim-

ulator/estimator is shown in the Fig. A.1.

The first version of the simulator (i.e. DSP-PP version 1.0) was imple-

mented using analytical power models that were developed based mainly on

Wattch power models with added components for leakage power dissipation

estimation [1]. However, these modified Wattch power models offer low accu-

racy values (i.e. as many as 15% - 70%) in estimating power dissipation com-

paring to the power value obtained by using a circuit-level power dissipation

tool such as HSPICE [2]. These accuracy values do not satisfy sufficiently the

architecture-level requirement on accuracy, and therefore modified analytical

Wattch power models can not be used in our DSP-PP simulator. This problem

has triggered some research ideas leading us to the introduction of the WTTPC

approach and the table-based power dissipation models that are implemented in

the current version of DSP-PP simulator. The description of the implemented

simulator (version 2.0) is given in more detail in the Section A.2.2 below.

The DSP-PP consists of two components: the Cycle-level Performance Sim-

ulator (CPS) and the Power-Dissipation Estimator (PDE).

Cycle-level Performance Simulator (CPS):

The CPS is an execution-driven cycle-accurate performance simulator. The

main functions of CPS are as follows:

A.2. DSP-PP 143

DSP Power-Performance Estimator

Cycle-by-cycle

Performance

SimulatorPower

Dissipation

Estimator

PerformanceEstimate

PowerConsumption

Estimate

TraceHardwareConfiguration

Program Executable orCompiled Benchmark

Process Technology

Parameters

Power Models (Tables)

Figure A.1: Block Diagram of the DSP Power Performance Simulator

1. Accepts as input, an executable program obtained by compilation of input

benchmarks as well as the PE/DSP configurations. (Here, PE denotes a

processor element).

2. Simulates, cycle-by-cycle, instruction execution and dataflows between

PE components as well as between parallel DSP architecture components.

3. Generates output performance statistics (i.e. program cycle counts) and

cycle-by-cycle traces.

Using object-oriented programming techniques, all components are mod-

eled as objects. Each object accepts certain type of input data, performs cer-

tain functions, and generates defined outputs. Moreover, each object also has

a power consumption model and a hardware access count that can be sent di-

rectly to the PDE to create the power consumption estimation. The communi-

cation between objects and the order of that communication are handled by an

event-scheduler, the Simulator Engine, which is the core module of the DSP-PP

simulator.

Power Dissipation Estimator (PDE):

The PDE consists of power consumption models for DSP components and a

total-power-estimation-engine module used to calculate the overall power dissi-

pation of the entire DSP parallel architecture in a cycle-by-cycle manner. These


power models include WTTPC-tables of power values for memory arrays and

similar types of components; parameter sets for other types of DSP components

such as arithmetic-logic circuits, etc. The main functions of PDE are as follows:

1. Accepts as input, cycle-by-cycle traces from CPS for different hardware

components involved in the DSP parallel architecture, as well as the

PE/DSP configuration and the configuration of the entire DSP parallel

architecture.

2. Generates power estimation values in a cycle-by-cycle manner for the

given configuration.

In order to reduce the number of WTTPC-tables created for each compo-

nent, similar to what was done in [8], the PDE is designed so that it can inter-

polate (using curve-fitting interpolation functions) between component charac-

terization points covering the entire possible design range of that component.

A.2.2 Description of the DSP-PP Simulator (Version 2.0)

The Cycle-level Performance Simulator has been fully implemented for the ex-

tended ManArray DSP architecture. The Power Dissipation Estimator is par-

tially implemented and power consumption models for all DSP components are

still under our on-going research works. The implementation of DSP-PP was

done in a Master thesis project by Firas Milh [3]. This section gives a brief

description of the implemented DSP-PP Simulator version 2.0.

The program code for the simulator is written in Visual C++ 6.0 using the

SystemC library. The code is divided into two projects. One project contains the

simulator and the other project contains the graphical user interface. The simu-

lator is divided into a collection of files where every implemented unit has two

files associated with it. One declaration file (*.h) and one implementation file

(*.cpp). There are also files associated with the main program, the shared mem-

ory and different classes used for simulator unit communication. The project

for the GUI is a Microsoft Windows project based on dialog boxes.

A.2. DSP-PP 145

System Overview

In order to fulfill the design features of the DSP-PP defined in the Section A.2.1

above, several changes were made to the ManArray model and the assembly

language. Among the most important modifications is the number of units in

each PE. Instead of the fixed five units, additional ALUs, MAUs, and DSUs are

supported. For each of these three unit types there can be at most 10 units which

together with the single LU, and SU sums up to a total of 32 execution units per

PE in the widest configuration. This change results in additional VLIW mem-

ory, memory ports and a few additional registers to keep track of the status of

each unit. Another generalization is the ability to have an arbitrary number of

PEs connected to the SP. This flexibility is limited only by the available sys-

tem resources of the host machine. The Cluster Switch is resized dynamically

to accommodate the number of PEs. A limitation of the simulator is the lack

of support for some of the instructions in the ManArray instruction set and the

DMA capability [3].

The modifications made the original ManArray architecture to be a very

flexible parallel DSP architecture capable of changing number of attached PEs

to each core as well as the number of execution units inside each SP/PE (within

32 units), reconfiguring cluster switches to handle connection between any

number of PE and SP needed. Therefore, this extended version of ManAr-

ray architecture can serve as a base for other types of DSPs. For example, with

only a single active MAC (i.e. a MAU) and with iVLIW pipeline organization

turned off the architecture resembles a “simple” DSP while a general-purpose

DSP VLIW can be resembled by this architecture with fully active five execu-

tion units together with the VILW pipeline organization turned on.

This extended ManArray architecture allows users to elaborate their ideas

using different number of functional units, different sizes of register files, differ-

ent sizes of memories, etc. in their exploration of the DSP architecture design

space.


Cycle Accurate Modeling

The simulator executes given code cycle by cycle registering important events

and bit transitions within the architecture model. There are counter variables

built in to the simulator that keep track of all important accesses, unit activities,

and bit flips. At every clock cycle every type of accesses is registered both

by the counter variables inside of the simulator and also written to files with

one file per Processing Element. The files have the format of comma separated

lists with one row for each cycle and one entry per counter variable in each row.

These files can be read by Microsoft Excel or other spread sheet software which

makes calculation and manipulation of the statistic data rather straight forward.

There is also a small program written to accompany the simulator that reads the

count variables directly from the simulator through shared memory.

Active Units

Every unit inside the SP and PE in the simulator model has an activity counter

which keeps track of the number of clock cycles that the unit is active. Every

memory read is registered in a access count structure with three elements for

every type of read. The first element of the count structure holds information

about actual independent reads, the second holds the number of zeroes read, and

the third holds the number of ones read. These counter are separate for each SP

and PE. Every memory write is also registered in a access count structure with

five elements for every type of write. The first element of the count structure

holds information about actual independent writes, the second holds the number

of writes where a zero is written to a bit containing a zero, the third holds the

number of writes where a zero is written to a bit containing a one, the fourth

holds holds the number of writes where a one is written to a bit containing a

zero, and the fifth holds holds the number of writes where a one is written to a

bit containing a one. These counter are also separate for each SP and PE.

A.2. DSP-PP 147

Implementation Overview

At the topmost level the Sequence Processor (SP), an array of Processing Ele-

ments (PEs), and the cluster switch is declared and connected with appropriate

signals. Each of the units at this level have a main clock signal for synchroniza-

tion purposes. Fig. A.2 and Fig. A.3 shows the interconnection of components

inside a PE and a SP of the simulator, respectively.

VIMs

(Shared

Memory)

IR2

IR1

CF

Decode

CF

Exec

Branch

EPLoop

Instruc.

Memory

Figure A.2: Interconnection of components inside a SP of the extended ManArray ar-

chitecture [3]

The SP is connected to each of the PEs with a set of instruction carrying

signals. These signals are used by the SP to dispatch instructions directly to

the corresponding port of each of the PEs units. There are two sets of signals

carrying instructions from the SP to each unit of each PE. Each set is associ-

ated with one of the two pipeline modes: Normal Pipeline (NP) and Extended

Pipeline (EP). The instruction port set associated with the NP is a single set and

is dispatched from the IR1 unit of the SP to the decode stage of each of the

execution units of each PE. Since these ports are single each PEs connected to


SU

Decode

SU

Exec

LU

Decode

LU

Exec

ALU

Decode 0

ALU

Exec 0

ALU

Decode n

ALU

Exec n

MAU

Decode 0

MAU

Exec 0

MAU

Decode n

MAU

Exec n

DSUDecode 0

DSU

Exec 0

DSU

Decode n

DSU

Exec n

SP

Data

RFShared

Memory

PostCND

Figure A.3: Interconnection of components inside a PE of the extended ManArray ar-

chitecture [3]

A.2. DSP-PP 149

these ports sees the same information a every clock cycle. The instruction port

set associated with the EP is dispatched from IR2 of the SP and multiplied by

the number of PEs so that each of the PEs can have an independent array of in-

struction issued at every clock cycle which is necessary when executing VLIW

instructions.

Each PE has only one Load unit (LU) and one Store unit (SU) while multiple

instances of ALUs, MAUs and DSUs which all have separate ports from IR2 of

the SP. There is a separate instruction signal from each of the pipeline modes

to the Cluster Switch (CS). The last port of the SP is used for signalling control

flow information to the decode stage of each unit in each PE.

Configuration Files

The information needed for the simulator to run is mainly the number of PEs

and the mode of the instruction set. There are two modes that can be chosen,

one is the original instruction set for the BOPS ManArray, and the other is the

extended instruction mode which allows multiple execution units inside the PEs.

There is a main configuration file that holds the information needed for the

simulation. The file is named config_a.txt and resides at the same location as

the executable file of the simulator. An example of a configuration file follows:

3# NUM_PE - number of PEs

1# NUM_ALU - number of ALUs

1# NUM_MAU - number of MAUs

1# NUM_DSU - number of DSUs

0# MULTIPLE UNIT PE

-1# EOF

Output Files

As the simulator executes a program, the number of access counts for each

counter are stored in files corresponding to each of the PEs. These files reside


in the folder ’Stats’ and are are named stat pe*.csv where the star is replaced

with the number the PE it represents. There will be as many files as there is PEs

in the simulation. These files are overwritten every time the simulator runs. To

save interesting results the user must copy these files after each simulation to

avoid loss of data.

The format of the file is semi colon separated lists. Each row represents one

clock cycle of the simulation and each column represents one access count. The

first row shows the name of the counter. These files can easily be opened and

manipulated with spreadsheet programs like Microsoft Excel.

Graphical User Interface

The graphical user interface (GUI) is very simple and offers some possibilities

for interaction. There are two main actions that can be invoked through the GUI

as stepping one clock cycle in the program code and running simulator without

interruption. There is also a pull down menu at the top which gives the user the

ability to choose from which PE the information is to be displayed in the PE

related areas of the GUI, (see Fig. A.4).

The GUI is divided into several areas showing different kind of informa-

tion related to the program execution. There are two areas showing the pipeline

state. One area shows the hexadecimal values of the instructions and the other

shows the mnemonics of the instructions. These areas show the information

related to the selected PE only. Below there are areas for the different memo-

ries including VIM, Special Purpose Registers, Register File, and RAM. At the

bottom there are two areas with the Access Count variables listed. One for the

SP and one for the selected PE. For every clock cycle the values that has been

changed are marked with square brackets to make tracing easier. Differences

are also marked when the user switches between the different PEs so that the

differences are easy to spot [3].

A.2. DSP-PP 151

Figure A.4: The GUI of our implemented DSP-PP simulator


Current Limitations of Simulator

The main current limitation of the simulator is the incomplete implementation

of the instruction set. About 80% of the total ManArray instructions was imple-

mented. Those instructions are the most frequently used ones and all the basic

functionality was implemented. The implementation of the remaining instruc-

tions is rather straight forward. Another limitation is the lack of support for

DMA and interrupts [3].

Bibliography

[1] M. Q. Do, L. Bengtsson, and P. Larsson-Edefors, “Models for Power Consumption

Estimation in the DSP-PP Simulator,” in Proceedings of the International Signal

Processing Conference (ISPC03), Apr. 2003.

[2] M. Q. Do and L. Bengtsson, “Analytical models for power consumption estimation

in the dsp-pp simulator: Problems and solutions, technical report no. 03-22,” Tech.

Rep., The Department of Computer Engineering, Chalmers University of Technol-

ogy, Göteborg, Sweden, 2003.

[3] M. Firas, “Implementation of the DSP-PP Cycle-True Simulator Using SystemC,”

M.S. thesis, The Department of Computer Science and Engineering, Chalmers Uni-

versity of Technology, Göteborg, Sweden, 2004.

Documents

Accurate Leakage-Conscious Architecture-Level Power