Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Software-level Power-Aware Computing
Lecture 4
Lecture Organizations• Lecture 1:
• Introduction to Low-power systems
L bi di• Low-power binary encoding
• Power-aware compiler techniques
• Lectures 2 & 3ectures & 3
• Dynamic voltage scaling (DVS) techniques
– OS-level DVS: Inter-Task DVS
C il l l DVS: I t T k DVS– Compiler-level DVS: Intra-Task DVS
– Application-level DVS
• Dynamic power management
• Lecture 4
• Software power estimation & optimization
• Low power techniques for multiprocessor systems• Low-power techniques for multiprocessor systems
• Leakage reduction techniques
2Low Power SW.4 J. Kim/SNU
Software Power Estimation & Optimization
• Modeling-based Approaches
• Instruction-level power modeling (V. Tiwari)
l b f h i i– Employs base energy cost of each instruction
– Instruction-level analysis and optimization
• Component-based power modelingComponent based power modeling
– Wattch, SimPower
M t b d A h• Measurement-based Approaches
• SES : SNU Energy Scanner
– Cycle-level power measurement on the target board
• PowerScope (J. Flinn)
– Function-level analysis w/ DMM
• ePRO : energy PRofiler and Optimizer• ePRO : energy PRofiler and Optimizer
– Function-level analysis and optimization w/ DAQ
3Low Power SW.4 J. Kim/SNU
Instruction-level Power Estimation (1)
• Base energy cost
mov bx, [dx]b [d ]
mov bx, [dx]b [d ]
same instructions repeatedly executed
mov bx, [dx]mov bx, [dx]
.
.
mov bx, [dx]mov bx, [dx]
.
.
TargetProcessor
TargetProcessor
.mov bx, [dx]
.mov bx, [dx] current measured by ammeter
→ average value calculated
• For variations in base energy costs due to
g
• For variations in base energy costs due to operands and addresses, average base cost values are employed
4Low Power SW.4 J. Kim/SNU
Instruction-level Power Estimation (2)
• Inter-Instruction Effects
• Effects of circuit state : switching activity in a circuit
A h d h h i i b– Avg. overhead through extensive experiments between pairs of instructions
• Effect of resource constraints : pipeline stall and write b ff t llbuffer stalls
– Avg. energy cost of each stall experimentally determined
• Effect of cache misses
– Avg. energy penalty of cache miss cycles
∑∑∑ +×+×=k kji jijii iip ENONBE
, ,, )()(
pE
iB jiN ,
jiO ,: average energy cost for a program
: the base cost of each instruction
: the circuit state overhead
: the no. of times the pair is executed
5Low Power SW.4 J. Kim/SNU
iN kE: the number of times executed : other instruction effects
Instruction-level Power Estimation (3)
• Energy estimation framework for a program
6Low Power SW.4 J. Kim/SNU
Instruction-level Power Optimization• Instruction Reordering
• A technique to reduce the circuit state overhead• Instructions are scheduled in order to minimize the• Instructions are scheduled in order to minimize the
estimated switching in the control path– For 486DX, energy saving only up to 2%, but for a DSP
up to 33.1%p
• Energy cost driven code generation• Instructions with memory operands have much higherInstructions with memory operands have much higher
currents than those with register operands• Better utilization like optimal global register allocation of
temporaries and frequently used variables
7Low Power SW.4 J. Kim/SNU
SES : SNU Energy Scanner (1)
• Overall Structure
Host PCHost PCCompilerC Program
User Interface
Execution and Energy InformationProgramLoader
ControlInput
User Interface
Energy Analysis Module
CPU Energy AnalayzerMemory Energy
Estimator
Profile Matching ModuleEnergy AnalysisModule
HW/SW PCI Interface
Summary
Energy Measurement Board (ARM7TDMI)
Energy Measurement Board (ARM7TDMI)
Profile AcquisitionModuleController Memory
Energy EstimationModule
• Summary
• On-board, cycle-level power measurement
• Source code related energy analysis
8Low Power SW.4 J. Kim/SNU
• Source code related energy analysis
SES : SNU Energy Scanner (2)
• Energy Measurement H/W Module
9Low Power SW.4 J. Kim/SNU
SES : SNU Energy Scanner (3)
• Energy Analyzer GUI
10Low Power SW.4 J. Kim/SNU
SES : SNU Energy Scanner (4)
• Pros
• No additional measurement device (like DMM or DAQ) necessary
• Cycle-level accuracy and timeliness
• Source code related energy analysisSource code related energy analysis
– C program function or instruction level
• Cons
• No portability
F h h d d– For each processor, new hardware and program are necessary
time, cost, and effort!!
• No exact performance-energy correlation
– Performance is not measured
11Low Power SW.4 J. Kim/SNU
PowerScope (1)
• Overall Structure
Summary• Summary
• Power measurement w/ external device (DMM)
• Source code related energy analysis
12Low Power SW.4 J. Kim/SNU
• Source code related energy analysis
Powerscope (2)
• Pros
• Portability
l Li LKM (L d bl K l M d l ) d MM– Employs Linux LKM (Loadable Kernel Module) and DMM
• Moderate accuracy but fast measurement
• Source code related energy analysisSource code related energy analysis
– C program function level
C• Cons
• Overhead
– Sampling trigger feedback between target b’d and DMM
varying interrupt handling time
- Long profile function path
• No performance energy correlation• No performance-energy correlation
– Performance is not measured
13Low Power SW.4 J. Kim/SNU
ePRO : energy PRofiler and Optimizer (1)• Overall Structure
Collection & Analysis System (Windows XP)Target System
Embedded S/W Eclipse Platform
Collection Symbol
CPU
RAM
Adapter
y ( )
power samples
Kernel
CollectionMonitor
Symbol Table
System Monitor
ext. trigger
HDDPC/PIDsamples
DAQ
performance profiles
Code SizeAnalyzer
EnergyAnalyzer
GUI
PerformanceAnalyzer
PerformanceMonitor
System Monitor
Energy, Performance,
• Summary
GUIMonitor Code Size Profiles
Summary
• Power measurement w/ external device (DAQ)
• Source code related energy, performance, and code size analysis
14Low Power SW.4 J. Kim/SNU
• Automatic compiler-level optimization
ePRO : energy PRofiler and Optimizer (2)
• Overview
• Automatized tool which analyzes and optimizes software d f b denergy and performance based on measurement
• Function details
• Performance AnalysisPerformance Analysis
– Function-level performance indices
• Energy Analysis
– Function-level energy consumption
– Device-level energy consumption
E O ti i ti• Energy Optimization
– Energy-optimal compiler option selection
• Integrated Development Environment (IDE)Integrated Development Environment (IDE)
– Plug-in of Eclipse
15Low Power SW.4 J. Kim/SNU
ePRO : energy PRofiler and Optimizer (3)
• Performance Analysis
• Using XScale processor’s PMU
• CPI (Cycles Per Instruction) I-cache/D-cache efficiency
• Instruction fetch latency
D t /b t b ff : D h b ff t ll• Data/bus request buffer : D-cache buffer stall
• Stall/writeback statistics
• I-TLB/D-TLB efficiencyI TLB/D TLB efficiency
16Low Power SW.4 J. Kim/SNU
ePRO : energy PRofiler and Optimizer (4)
• Function-level Energy Consumption Analysis
• Device-level Energy Consumption Analysis
• CPU, RAM, FLASH, HDD, etc
17Low Power SW.4 J. Kim/SNU
ePRO : energy PRofiler and Optimizer (5)
• Energy Optimization
• CL-OSE (Compiler-Level Optimal Space )Exploration) : Selects the energy-optimal options
time-efficiently for the target program among the all the available compiler optionsp p
18Low Power SW.4 J. Kim/SNU
ePRO : energy PRofiler and Optimizer (6)
• Integrated Development Environment
• Employs Eclipse’s plug-in functionEmploys Eclipse s plug in function
19Low Power SW.4 J. Kim/SNU
ePRO : energy PRofiler and Optimizer (7)
• Pros
• Portability
– Employs Linux LKM and DAQ assembly
• Program function-level energy, performance, and code size analysissize analysis
• Automatized compiler-level energy optimization
• Cons
• Overhead
S t b h i li h d– System behavior sampling overhead
• Limited to a processor with PMU (Performance Monitoring Unit) : e.g. XScale
• No support for multiple processes till now
20Low Power SW.4 J. Kim/SNU
3D Graphics Pipeline
• 3D Graphics Pipeline
• Geometry CalculationGeometry Calculation
–Calculation of geometric data of objects
• RasterizationRasterization
–Converting an object on a screen or
mat
ion
Reje
ctio
n
htin
g
visi
on
tM
appi
ng
e Se
tup
onve
rsio
n
Map
ping
men
t ra
tion
orm
atio
n
Reje
ctio
n
htin
g
visi
on
tM
appi
ng
e Se
tup
onve
rsio
n
Map
ping
men
t ra
tion
Tran
sfo
Triv
ial R
Ligh
W D
i
View
port
Tria
ngl
Scan
Co
Text
ure
Frag
mO
per
Tran
sfo
Triv
ial R
Ligh
W D
i
View
port
Tria
ngl
Scan
Co
Text
ure
Frag
mO
per
<3D Graphics Pipeline>
Geometry Calculation RasterizationGeometry Calculation Rasterization
21Low Power SW.4 J. Kim/SNU
p p
Power Breakdown of 3D Graphics
• Power Consumption of 3D Graphics Application
60%
70%
%)
c h e ck image
20%
30%
40%
50%
60%
er
Consum
ption (
%
0%
10%
nsfo
rmat
ion
l Rej
ectio
n
Ligh
ting
& v
iew
port
ngle
Set
up
Conv
ersi
on
re M
appi
ng
Frag
men
t
Pow
e
Tran
s
Triv
ial
W d
ivis
ion
&
Tria
n
Scan
C
Text
ure
P ipe line S tage
cpu
memory
to ta l
je lly fish45%
50%
jelly fish
15%
20%
25%
30%
35%
40%
onsu
mption (
%)
0%
5%
10%
15%
orm
atio
n
ejec
tion
Ligh
ting
view
port
le S
etup
nver
sion
Map
ping
Frag
men
t
Pow
er C
o
22Low Power SW.4 J. Kim/SNU
Tran
sfor
Triv
ial R
ej LW
div
isio
n &
vie
Tria
ngle
Scan
Con
v
Text
ure
M Fr
P ipeline Stage
cpumemorytotal
face model25%
30%
35%
(%)
5%
10%
15%
20%
25%
er
Consum
ption
Face model0%
Tran
sfor
mat
ion
Triv
ial R
ejec
tion
Ligh
ting
sion
& v
iew
port
Tria
ngle
Set
upca
n Co
nver
sion
Text
ure
Map
ping
Frag
men
tPow
e
cpuT Tr
W d
ivis
i T
Sca
Te
Pipeline Stage
cpumemorytotal
23Low Power SW.4 J. Kim/SNU
A Low-Power Texture Mapping (1)• Previous Work
• “A Low-Power Content-Adaptive Texture Mapping Architecture for Real-Time 3D Graphics”, Jeongseon Euh et al, PACS’02.
– Adaptive texture mappingAdaptive texture mapping
• based on a model of human visual perception (HVP)
– DVS is applied to the interpolation block
• “Trading Efficiency for Energy in a Texture Cache Architecture”, Iosif Antochi et al, MPCS’02
– Mobile devices cannot afford large texture cache
• Due to gate count limitation and low power consumption
– 128~512 bytes texture cache between the graphics128 512 bytes texture cache between the graphics accelerator and texture memory
24Low Power SW.4 J. Kim/SNU
A Low-Power Texture Mapping (2)• A Low-Power Texture Mapping Technique for Mobile 3D
Graphics
• A small texture cache can increase the miss ratio
• The technique to preserve performance is needed
–Prefetching
–Victim cache
25%
30%
Jelly Fish
Check Image
10%
15%
20%
Mis
s R
atio (
%)
0%
5%
10%
1KDM 2KDM 4KDM 8KDM 16KDM
M
25Low Power SW.4 J. Kim/SNU
Cache S ize
A Low-Power Texture Mapping (3)• Prefetch techniques
• Technique 1: Prediction of next texelsTechnique 1 Prediction of next texels
–Division is required due to “perspective correction”
• Technique 2: Prediction of next blocks
–Assuming that derivatives are not changedg g
–Division is eliminated
• Technique 3: Prediction of next blocks based on qdirection of texture map access
–Simple, but less exact than technique 2
• A small fully associative prefetch buffer is used
26Low Power SW.4 J. Kim/SNU
Technique 1
●●
●
Prefetch●
●
●
● ● ● ●●
Prefetch●
●
●
● ● ● ●
Span Line
u
v 다음에 접근될 텍셀
u
v
u
v 다음에 접근될 텍셀
<screen> <Texture Image>
현재 접근되는 텍셀현재 접근되는 텍셀
27Low Power SW.4 J. Kim/SNU
Technique 2
예측된 텍셀예측된 텍셀
●●
●
●
●
● ● ● ●
Block Prefetch●
●●
●
●
●
● ● ● ●
Block Prefetch●Span Line
u
v다음에 접근될 텍셀
u
v
u
v다음에 접근될 텍셀
<screen>
현재 접근되는 텍셀현재 접근되는 텍셀
<Texture Image><screen>
28Low Power SW.4 J. Kim/SNU
Technique 3
현재 접근되는 블록현재 접근되는 블록현재 접근되는 블록현재 접근되는 블록
현재 접근되는 블록현재 접근되는 블록
Block PrefetchBlock Prefetch
Block Prefetch
현재 접근되는 블록
Block Prefetch
현재 접근되는 블록
Block PrefetchBlock Prefetch
u
v
u
v
u
v
u
v
uu
v
29Low Power SW.4 J. Kim/SNU
A Low-Power Texture Mapping (4)• Using victim cache
• Sizes of texture images are powers of twoSizes of texture images are powers of two
• Conflict misses can occur between blocks
–Especially in the small texture cacheEspecially in the small texture cache
–Blocks are reused in processing of next spanlinep
• Victim cache can reduce conflict misses
–Prefetch buffer performs as the victim cachep
–Evicted blocks are moved into the prefetch buffer
30Low Power SW.4 J. Kim/SNU
A Low-Power Texture Mapping (5)• Experimental Results
• Area Reduction
0 008
0.01
0.012
0.014
m^2)
conventional cache
double port tag cache with prefetch buffer
0
0.002
0.004
0.006
0.008
Are
a (
cm
• Prefetch Accuracy
1KDM 2KDM 4KDM 8KDM 16KDMCache Size
0 6
0.8
1
1.2
Accura
cy
P1 P2 P3
0 6
0.8
1
1.2
Accura
cy
P1 P2 P3
0
0.2
0.4
0.6
Pre
fetc
h A
0
0.2
0.4
0.6
Pre
fetc
h A
31Low Power SW.4 J. Kim/SNU
1KDM 2KDM 4KDM 8KDM 16KDMCache Size
1KDM 2KDM 4KDM 8KDM 16KDM
Cache Size
<jelly fish> <check image>
A Low-Power Texture Mapping (6)• Miss Ratio and Power Consumption
3 0 %
N o _ P r e fe tc h P 1 P 2 P 3
30%
N o _ P r e fe tc h P 1 P 2 P 3
1 5 %
2 0 %
2 5 %
Mis
s ratio
1 5%
20%
25%
ss R
atio
0 %
5 %
1 0 %
1 K D M 2 K D M 4 K D M 8 K D M 1 6 K D M
M
0%
5%
10%
1K D M 2K D M 4K D M 8K D M 16K D M
Mi
<jelly fish> <check image>
1 K D M 2 K D M 4 K D M 8 K D M 1 6 K D M
C a c h e S i z eC a c h e S i z e
1 0 00
1 1 00
700
800
4 00
5 00
6 00
7 00
8 00
9 00
onsu
mption(m
J)
300
400
500
600
700
nsu
mption(m
J)
0
1 00
2 00
3 00
refe
tch
P1
P2
P3
refe
tch
P1
P2
P3
refe
tch
P1
P2
P3
refe
tch
P1
P2
P3
refe
tch
P1
P2
P3
Energ
y C
o
0
100
200
refe
tch
P1
P2
P3
refe
tch
P1
P2
P3
refe
tch
P1
P2
P3
refe
tch
P1
P2
P3
refe
tch
P1
P2
P3
Energ
y C
on
32Low Power SW.4 J. Kim/SNU
No_P
No_P
No_P
No_P
No_P
1 K D M 2K D M 4 K D M 8 K D M 16 K D MC a c h e S i z e
No_Pr
No_Pr
No_Pr
No_Pr
No_Pr
1KD M 2KD M 4KD M 8KD M 16KD MC a c h e S iz e
<jelly fish> <check image>
Researches on 3D Graphics (1)• “An Effective Pixel Rasterization Pipeline Architecture
for 3D Rendering Processors”, Woo-Chan Park et al, IEEE Transactions on Computers `03
• Avoid unnecessary texture mapping for obscured fragmentsfragments
• Reduce the miss penalties of the pixel cache by prefetching schemeprefetching scheme
• “Design and Implementation of Low-Power 3D• Design and Implementation of Low Power 3D Graphics SoC for Mobile Multimedia Applications”, Ramchan Woo, PHD thesis, KAIST `04
• Implementing full-3D pipeline with texture mapping and special effects
33Low Power SW.4 J. Kim/SNU
Researches on 3D Graphics (2)• “GraalBench: A 3D Graphics Benchmark Suite for
Mobile Phones”, Iosif Antochi et al. LCTES`04
• A set of 3D graphics workloads representative for mobile devices
• “Power-Aware 3D Computer Graphics Rendering”, J E h J l f VLSI Si l P i `05Jeongseon Euh, Journal of VLSI Signal Processing`05
• Low power system based on Approximate Graphics Rendering (AGR)Graphics Rendering (AGR)
• Power savings are examined for stages
–Shading–Shading
–Texture mapping
34Low Power SW.4 J. Kim/SNU
Low-Power Techniques for M l iMultiprocessors• DVS techniques for Multiprocessors
• Slack reclamation
• Condition-aware scheduling
• dist-PID• dist PID
• Power-Aware Parallelism Optimization
• Static & Dynamic optimizing
• Local Memory Management
35Low Power SW.4 J. Kim/SNU
Slack Reclamation (1)• Problem definition
• Greedy slack reclamation
–Any slack is used to reduce the speed of next task on same processor
It t t d dli– It cannot guarantee deadline
• Example: 6 tasks in 2 processorsΓ={T (WCET AET)|T (5 2) T (4 4) T (3 3)– Γ={Ti(WCET,AET)|T1(5,2), T2(4,4), T3(3,3), T4(2,2), T5(2,2), T6(2,2)}, Deadline=9
T3 uses up its time, T6 misses the deadline3 p 6
36Low Power SW.4 J. Kim/SNU
Slack Reclamation (2)• Shared slack reclamation [Zhu03]
• Share the slack with other processors
–Split slack into multiple parts
• Slack sharing example (see figure (b))
–Slack1: Two time units before T2’s finish time (based on T2’s WCET)
Slack2: One time units after T ’s finish time–Slack2: One time units after T2’s finish time
Share slack2 with P2
• All tasks meet deadlines• All tasks meet deadlines
37Low Power SW.4 J. Kim/SNU
Condition-Aware Scheduling (1)• Task scheduling for conditional task graph
• Conditional Task Graph (CTG)Conditional Task Graph (CTG)
–Various task sequences depending on the conditions
–Require power-aware scheduling technique considering conditions
38Low Power SW.4 J. Kim/SNU
Condition-Aware Scheduling (2)
• Condition-Aware scheduling [Shin03]• Step 1: Task orderingStep 1 Task ordering
–Use the schedule table: <start time, clock speed>Depending the condition value each task has–Depending the condition value, each task has different start time and clock speed
39Low Power SW.4 J. Kim/SNU
Condition-Aware Scheduling (3)
• Condition-Aware scheduling [Shin03]• Step 2: Task stretchingStep 2 Task stretching
–Use probabilities of conditions from profile informationMinimize ∑ E ( ) Prob( )–Minimize ∑ E (τ) Prob(τ)•Optimize for high probability conditions•c1 ≫ c2, c3c1 ≫ c2, c3
40Low Power SW.4 J. Kim/SNU
DVFS in MPSoC (1)
• Local-DVFS
• Decide the frequency of the each processor onlyDecide the frequency of the each processor only using the local information.
–Do not use the information of the other processors.
• Higher frequency as more tasks in the task queue.
Li it ti f th l l DVFS• Limitations of the local-DVFS
• If a processor is executed with lower frequency, it can hurt the performance of the otherit can hurt the performance of the other processor because of the dependency.
41Low Power SW.4 J. Kim/SNU
DVFS in MPSoC (2)
• dist-DVFS [Juang05]
• Decide the frequency of the each processor using the global information.
• Operation steps
E ti t th f t t k–Estimate the future task queue occupancy
– Identify the critical-path-tasks(with the highest queue occupancy)(with the highest queue occupancy)
–Decide the frequency of the each processor not hurting the performance of the critical-path-g p ptasks
42Low Power SW.4 J. Kim/SNU
Optimizing Parallelism (2)
• The number of processors that generate the best execution time for each loop nestp
Loop nest
Number of processors for the best
tiexecution time
• Using only a small subset of processors out of 8 processors
• This is a strong motivation for shutting down unused processors.
43Low Power SW.4 J. Kim/SNU
Optimizing Parallelism (3)
• Designing an effective parallelization strategy for an on-chip multiprocessor
• Mechanism
–Dynamic approach
Th b f f h l•The number of processors for each loop nest is decided at run time.
–Static approachStatic approach
•The number of processors for each loop nest is decided at compile time.
• Policy
–Criterion to decide the number of processors
i i d•Execution time, energy and so on.
44Low Power SW.4 J. Kim/SNU
Optimizing Parallelism (4)
• Procedure
• Determine the number of processors from mechanism and policy
• Insert activation / deactivation call in the code
O ti i th d• Optimize the code
• Optimization• Optimization
• Current active/idle status of processors is maintained as much as possiblep
–To minimize overhead from on/off
• We have to pre-activate the processors,
– If the processor will be used in next loop
–Not to hurt the performance
45Low Power SW.4 J. Kim/SNU
Runtime Code Parallelization (1)• A run-time strategy for determining the best number
of processors to use [Kandemir03]
• Dynamic mechanism
• To minimize energy and execution time
• Need some help from H/W and compiler.
46Low Power SW.4 J. Kim/SNU
Run-time Code Parallelization (2)• Parallelization based on training
• Each dot represents an iteration
• Training period
–Find the optimal number of processors
U i th b f d t i d• Using the number of processors determined
Extra Optimization• Extra Optimization
• Minimize training iteration based on history.
–Utilize the past history avoid redundant
47Low Power SW.4 J. Kim/SNU
Utilize the past history, avoid redundant training.
Local Memory Management (1)
• Latency of memory access
local memory <
local memory of other CPUy
<< off-chip memory
• Frequent off-chip memory access can be very costly
Target MPSoC architecture
• Frequent off chip memory access can be very costly from both performance and energy perspectives
• Propose local memory management scheme for low cost [Chen05]
48Low Power SW.4 J. Kim/SNU
Local Memory Management (2)
• Access pattern of the data block is analyzed by compilercompiler
• Software-managed memory is usedSoftware managed memory is used
• When a data block is stored in the local memory ofWhen a data block is stored in the local memory of the processor,
• Even though the data block is predicted not to be used any more by the processorused any more by the processor,
• If the data block is predicted to be used by another processor, keep the data block in the p , plocal memory.
49Low Power SW.4 J. Kim/SNU
References[Chen05] Guilin Chen, Guangyu Chen, Ozcan Ozturk and Mahmut
Kandemir, “Exploiting Inter-Processor Data Sharing for Improving Bebavior of Multi-Processor SoCs” In Proc of Annual Symposium onBebavior of Multi Processor SoCs , In Proc. of Annual Symposium on VLSI, 2005.
[Juang05] Philo Juang and Qiang Wu, “Coordinated, Distributed, Formal Energy Management of Chip Multiprocessors”, In Proc. of ISLPED, 2005.
[Kadayif05] Ismail Kadayif, Mahmut Kandemir, Guilin Chen and Ozcan Ozturk, Optimizing Array-Intensive Applications for On-Chip Multiprocessors” IEEE Trans on Parallel and Distributed SystemsMultiprocessors , IEEE Trans. on Parallel and Distributed Systems, 2005.
[Kandemir03] M Kandemir, W Zhang, M Karakoy, “Runtime code parallelization for on-chip multiprocessors “, In Proc. of DATE, 2003.paralleli ation for on chip multiprocessors , In Proc. of AT , 003.
[Shin03] Dongkun Shin and Jihong Kim, “Power-Aware Scheduling of Conditional Task Graphs in Real-Time Multiprocessor Systems”, In Proc. of ISLPED, 2005.
[Zhu03] Dakai Zhu, Rami Melhem, and Bruce R. Childers, “Scheduling with Dynamic Voltage/Speed Adjustment Using Slack Reclamation in Multiprocessor Real-Time Systems”, IEEE Trans. on Parallel and Distributed Systems Vol 14 No 7 July 2003
50Low Power SW.4 J. Kim/SNU
Distributed Systems, Vol.14, No.7, July 2003.
Leakage Current
Leakage: Not Activity BasedShutting down inactive parts helpShutting down inactive parts help
51Low Power SW.4 J. Kim/SNU
(source: Kim et al., IEEE Computer, Dec, 2003)
Subthreshold Leakage, Isub
• Isub ~ e(-Vt/Va) (1 – e(-V/Va))
where Va is the thermal voltage
• How to reduce Isub
• Turn off the supply voltage• Turn off the supply voltage
(-) loss of state
• Increase the threshold voltage• Increase the threshold voltage
(-) loss of performance
52Low Power SW.4 J. Kim/SNU
Leakage Power Reduction• State-Destructive vs State-Preserving
• Application-Sensitive vs Application-InsensitiveApplication Sensitive vs. Application Insensitive
53Low Power SW.4 J. Kim/SNU
Dynamic Resizing of Instruction Cache
[Powell, ISLPED00]
54Low Power SW.4 J. Kim/SNU
Gated-Vdd
• State Destructive Application Insensitive
55Low Power SW.4 J. Kim/SNU
• State Destructive, Application Insensitive
Drowsy Cache [Flautner, 2002]
56Low Power SW.4 J. Kim/SNU
• State Preserving, Application Insensitive