Upload
kevyn
View
71
Download
12
Tags:
Embed Size (px)
DESCRIPTION
Runtime Power Measurement/Modeling and Thermal Modeling. Research Seminar Canturk ISCI. MOTIVATION. Power Matters! Performance improves exponentially SO DOES POWER DENSITY Chip areas increase 7%/year Battery Life: Improves Much Slower Thermal Issues Follows power density - PowerPoint PPT Presentation
Citation preview
Runtime Power
Measurement/Modelingand Thermal Modeling
Research SeminarCanturk ISCI
2
MOTIVATIONMOTIVATION Power Matters!
Performance improves exponentially SO DOES POWER DENSITY
Chip areas increase 7%/year Battery Life: Improves Much Slower Thermal Issues
Follows power densityPackaging costs: +$1/W over ~40W
Need good Measurement/Modeling techniques for Power & Thermally aware/adaptive systems Using Measurement to probe microarchitectural details
CASTLE, data activity experiment Compiler Level Power Optimizations
SW Power Profiling and Optimization Power aware OS
power modeling for decision making Dynamic thermal/power management
Thermal hotspots & Power threshold
3
MOTIVATIONMOTIVATION Power Models reflecting modern processors
Clock gating, power Voltage regulation, di/dt
Need for Fast-Realtime Modeling and Measurement to observe long time periods Thermal time constants: O(s) Not feasible even with architecural simulators
i.e.: 1s of real run ~5 x IPC hrs of WATTCH simulation
Need live, run-time power/thermal measures Dynamic Thermal Management Power-Aware OS & Systems control
4
THE BIG PICTURETHE BIG PICTURE
To Estimate component power & temperature breakdowns for P4 at runtime…
Bottom line…
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
5
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Remainder of TalkRemainder of Talk
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
6
RELATED WORKRELATED WORK Implementing counter readers:
PCL [Berrendorf 1998], Intel VTune, Brink & Abyss [Sprunt 2002]
Using counters for Performance: HPC [Crummey 2001], CPU profilers
Using counters for Power: CASTLE [Joseph 2001], power profilers event driven OS/cruise control [Bellosa 2000,2002]
Real Power Measurement: Compiler Optimizations [Seng 2003] Cycle-accurate measurement with switch caps [Chang
2002]
7
RELATED WORKRELATED WORK Power Management and Modeling Support:
Instruction level energy [Tiwari 1994] PowerScope: Procedure level energy [Flinn 1999] Event counter driven energy coprocessor [Haid 2003] Power-breakdown driven energy reduction [Huang 2001] Virtual Energy Counters for Mem. [Kadayif 2001] ECOsystem: OS energy accounting [Ellis 2002]
Thermal Management and Modeling Support: PID based DTM [Skadron 2002] Architectural Thermal Model [Skadron 2003] Evaluating DTM techniques [Brooks 2001]
8
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Milestone 1Milestone 1
Performance Monitoring
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
9
Live CPU Performance Monitoring Live CPU Performance Monitoring with Hardware Counterswith Hardware Counters
Most CPUs have hardware performance counters P4 Performance Monitoring HW:
18 Event Counters 18 Counter Configuration Control Registers
Configure how to count 45 Event Selection Control Registers
Configure what to count Additional Control Registers
10
Counter OverviewCounter Overview Counting Types
Non-retirement: At-Retirement:
Can count BOGUS vs NBOGUS, Tag uops,etc.Mechanisms:
Front end taggingExecution taggingReplay TaggingNo Tags
Also:Event Counting Event Based SamplingPrecise EBS
Event Types 59 event classes 100s of events to count Metric Classifications:
GeneralEx: Speculative Uops retiredBranchingEx: Mispredicted conditionalsTrace Cache and Front EndEx: Processor N deliver modeMemoryEx: MOB Load replaysBusEx: Prefetch bus accessesCharacterizationEx: Packed SP retiredMachine ClearEx: Memory Order Machine Clear
11
Our Event-Counter: Performance ReaderOur Event-Counter: Performance Reader
Performance Reader implemented as Linux Loadable Kernel Module
Implements 6 syscalls: select_events()reset_event_counter()start_event_counter()stop_event_counter()get_event_counts()set_replay_MSRs()
User Level Interface: Defines the events
Starts counters Stops counters
Reads counters & TSC
12
Performance Reader: Performance Reader: Example ValidationExample Validation
L1_Dcache benchmark
Controls cache hit behavior
Validated against measured cache events
Vary hit rate from 0-100%
L1 Hit Rate Experiment
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1Desired Hit Rate (Benchmark Input)
Acq
uir
ed H
it R
ates
Ideal Hit RateAcquired L1 Hit RateL1 hit rate from L2 Access
13
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Milestone 2Milestone 2
Real Power Measurement
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
14
P4 Power Measuring SetupP4 Power Measuring Setup
1mV/Adc conversion
Clamp ammeter on 12V lines on measured CPU
Voltage readings via RS232 to
logging machine
Serial Reader(PowerMeter)(PowerPlotter)
Convert to Power vs. time window
DMM reading clamp voltages
15Pow
erP
lott
er:
Exa
mp
leP
ower
Plo
tter
: E
xam
ple “Branch exercise”
(Taken rate: 1)“High-Low”“L1Dcache”
Array Size1/100 of L1
“L1Dcache”Array Sizex25 of L1~L2
“L1Dcache”Array Sizex4 of L2
Initialization
BenchmarkExecution
“Fast”
16
SPEC Power ExamplesSPEC Power Examples
Different programs show very different power characteristics
Timescale of interest can be huge => inaccessible via simulation
Spec GCC (O3) with specrun -a run
0
10
20
30
40
50
60
70
80
0 50 100 150 200time (s)
[W]
Spec VPR (O3) with specrun -a run
0
10
20
30
40
50
60
0 100 200 300 400 500time(s)
[W]
17
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Milestone 3Milestone 3
PowerModeling
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
18
DefineComponents
Performance Monitoring
Real Power Measurement
PowerModeling
DefineEvents
Real Power Measurement
Verify total power against measured processor power
PowerModeling
Convert counter info into component power breakdowns
Performance Monitoring
Gather counter info with minimal power overhead and program interruption
DefineEvents
Determine combination of P4 events that represent component accesses best
DefineComponents
Define components (I.e. L1 cache, BPU, Regs, etc.), whose powers we’ll model: from annotated layout
P4 POWER MODELP4 POWER MODEL
19
Defining ComponentsDefining Components
20
Defining ComponentsDefining Components
21
Defining Events Defining Events Access Rates Access Rates We determined 24 events to approximate access rates
for 22 components Used Several Heuristics to represent each access rate Ex: 2nd Level BPU:
Metric 1: Instructions fetched from L2 (predict)Event: ITLB_Reference
Counts ITLB translationsMask:
All hits, misses Metric 2: Branches retired (history update)
Event: branch_retiredCounts branches retired
Mask:Count all Taken/NT/Predicted/MissP
Need to rotate counters 4 times to collect all event data Used 15 counters & 4 rotations to collect all event data
22
Access Rates Access Rates Component Powers Component Powers We gather counter data at measured computer via
the tiny counter reader We send the access rates to logger machine
Don’t want to do any computation at host
Logger machine converts access rates to the component power breakdowns Computation done externally, still at runtime Access rates used as proxy to max component
power weighting together with microarchitectural details
EX: Trace cache delivers 3 uops/cycle maxPower(TC)=Access-Rate(TC)/3 * MaxPower(TC) + Non-gated TC CLK power
23
Generic EquationGeneric Equation
Power(Component)||
Access-Rate(Component)x
Microarchitectural Scalingx
MaxPower(Component)+
Non-gated component Clock power
24
Experiment Setup – Recall:Experiment Setup – Recall:
1mV/Adc conversion
Clamp ammeter on 12V lines on measured CPU
Voltage readings via RS232 to
logging machine
Serial Reader(PowerMeter)(PowerPlotter)
Convert to Power vs. time window
DMM reading clamp voltages
25
Experiment SetupExperiment Setup
Voltage readings via RS232 to logging machine
1mV/Adc conversion
26
Experiment SetupExperiment Setup
POWERCLIENT
POWERSERVER
Voltage readings via RS232 to logging machine
Convert voltage to measured powerConvert access rates to modeled powersSync together in time window
1mV/Adc conversion
Component access rates
over ethernet
27
Area Based Power Estimate – Area Based Power Estimate – Total Power ResultTotal Power Result
“Fast”
“Branch exercise”(Taken rate: 1) “High-Low”“L1Dcache”
(Hit Rate : 0.1)Measured
Modeled
28
After Tuning?After Tuning?
“Fast”
“Branch exercise”(Taken rate: 1) “High-Low”“L1Dcache”
(Hit Rate : 0.1)Measured
Modeled
29Com
pon
ent
Bre
akd
own
sC
omp
onen
t B
reak
dow
ns
Component Breakdowns for “branch_exercise”
Colors for 4 CPU subsystems
Issue - RetireExecution
30
SPEC ResultsSPEC Results Measured
Modeled
Gcc Gzip Vpr Vortex Gap
Crafty
31
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
Milestone 4Milestone 4
ThermalModeling
Related Work Performance Monitoring
P4 Performance Counters Performance Reader LKM
Real Power Measurement P4 Power Measurement Setup Examples
Power Modeling P4 Power Model Model + Measurement Sync Setup,
Verification Thermal Modeling
Refined Thermal Model Ex: Ppro Thermal Model
32
THERMAL MODELING: A Basic ModelTHERMAL MODELING: A Basic Model
Based on lumpedR-C model from packaging
Built uponpower modeling Sampled
Component Powers
Respective component areas
Physical processor Parameters
PackagingHeat Transfer
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIETb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl
Tb,i
Rth,i
Cth,iPi
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIE
Blki Blkj
BlkkBlkl
DIE
ithith
i
ith
ii
ibithR
TT
i
RC
tT
C
tPT
dt
dTCP
ith
hib
,,,
,,
:equationdifferenceFinal
,
,
t : Sampling intervalTi : The temperature
difference between block and the heatsink
33
Refined Thermal ModelRefined Thermal Model Steady State Analysis reveals, Heatsink-Die
abstraction is not sufficient for real systems Proceeding to a multilayer thermal model:
Active die thickness metalization/insulation chip-package interface package heatsink
Requires searching of several materials/ dimensions and thermal properties
Multiple layers Multiple T nodes Multiple DEs
Baseline Heat removal Structure: HEATSINKThermal GreaseHeat Spreader
PackageDie
34
Physical Structure vs. Thermal Model Physical Structure vs. Thermal Model
Ambient Temperature
Heatsink
Heat Spreader
Package
Die
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
Thermal Grease
Ambient Airflow
35
Analytical DerivationAnalytical Derivation
4 Nodes 4 DEs 1) Tspr:
sprsprspr
hsprgrsprsprspr
totalspr
sprsprR
TT
total
sprsprR
TT
total
TTT
tTTRCC
tPT
t
TCP
dt
dTCP
grspr
hspr
grspr
hspr
)(1
:equationdifferenceFinal
:timengDiscretizi
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
Th
Rh
Ch
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
36
EX: Ppro Thermal ModelEX: Ppro Thermal ModelUse CASTLE [Joseph, 2001] computed
component powersDetermine component areas from Die
photoDetermine processor/packaging
physical parametersGenerate numerical thermal modelApply component difference equations
recursively along power flowTdie,i
Tp,i
Tspr
Th
Update Tdie,i
Update Tp,i
Update Tspr
Update Th
37
Simulation OutputsSimulation Outputs Thermal nodes updated every t~20ms
Component Temperatures Build up to ~350K in ~5hrs Theatsink moves very slowly as expected
Pentium Pro Thermal Simulation
01020304050607080
Ambie
nt
Heatsi
nk
Heat S
prea
der
Decod
eIss
ue
Reord
er
DMem
IMem FUs
Other
Te
mp
era
ture
(C
)
At startupAfter 5 Hours
38
SUMMARYSUMMARY
Performance Monitoring
Real Power Measurement
PowerModeling
ThermalModeling
39
ConclusionsConclusions Contributions:
Portable runtime real power measurement system Performance counter based runtime power & thermal
model and runtime verification with synchronous real power measurement
Thermal model, which can be applied to ANY power model - with good physical characterization - as long as physical component based power breakdowns are used.
Runtime modeling & measurement system for arbitrarily long timescales!
Outcomes: We can do reasonably accurate real power measurements
at runtime without interfering with HW We can perform runtime power modeling, with the tiny
performance reader without inducing any significant overhead to power profile
40
What to do next?What to do next? Keep tuning for SPECs
<1st Stop> Try regression at several corners
Won’t do well due to clk gating?? Get data from Intel? Try runtime self updating model? Compare all to actual data Experiment with March., evaluate several power properties
<2nd Stop> Add thermal Try to add lateral heat diffusion Get Contour results <New bkmrk>
<3rd Result> P4 thermal monitor stuff Could be played from kernel to modulate clock Can we use with our models to do power savings on REAL
HW??
41
42
RELATED WORK RELATED WORK – performance monitoring– performance monitoring
implementing counter readers:
PCL Performance Counter Library, by Rudolf Berrendorf (University of Applied Sciences Bonn-Rhein-Sieg), Heinz Ziegler, and Bernd Mohr at the Central Institute for Applied Mathematics (ZAM) at the Research Centre Juelich , Germany uniform interface for several architectures (intel Pentium,MMX,
Pro, III, 4/linux; IBM Power3, Power3-II/AIX; etc.) Software library with C, C++, Java & Fortran Bindings Kernel patch (Mikael Pettersson) recompile
PAPI Performance Application Programming Interface Project, by Jack Dongarra, Kevin London, Shirley Moore, Philip Mucci, etc., at Innovative Computing Lab, CS dept., University of Tennessee Standard Simple high level API and low level programmable
interface Supports Pentium, MMX, Pro, III/Linux, Windows; Power 3,4/AIX;
etc. PerfCtr kernel patch (Mikael Pettersson) recompile
43
RELATED WORK RELATED WORK – performance monitoring– performance monitoring
implementing counter readers: Perfmon Performance Monitoring Tool by Richard Enbody, Associate
Professor Department of Computer Science and Engineering, Michigan State University.
For SUN Ultra-Sparc & Ppro Device Driver (LKM)
Rabbit Performance Counters Library by Don Heller, Scalable Computing Laboratory, Iowa State University
for Intel Pentium MMX, Pro, II, III/Linux; AMD/Linux functions to access from within C
Cleanest of all, but still ~30 files & ~50instructions LKM
Intel’s VTune Performance analyzer Windows & Linux <New>
IBM’s HPM toolkit Power 3,4/AIX
Brink and Abyss Pentium 4 Performance Counter Tools For Linux, by Brinkley Sprunt, Electrical Engineering, Bucknell University
brink: high level perl script to read experiment/config files abyss: c program to access counters abyss_dev: device driver for counter access EBS kernel patches: to handle PMIs
44
RELATED WORK RELATED WORK – performance monitoring– performance monitoring
using counter readers: CASTLE Project by Margaret Martonosi and Russ Joseph,
Princeton University acquire Ppro counter data to model component power
breakdowns Frank Bellosa, “Benefits of Event Driven energy
Accounting in Power Sensitive Systems”, 9th SIGOPS European workshop, 2000 Counters to show power ~ k x instr-ns/cycle (PII) OS power optimizations:
Throttle down CPU/extend thread time for cache hit/slow down CPU core if main memory is accessed
Andreas Weissel, Frank Bellosa, “Process Cruise Control: Event driven clock scaling for dynamic power management”, CASES 2002 Use event counters info to scale individual thread
frequencies Intel Xscale / Modified Linux kernel
45
RELATED WORK RELATED WORK – performance monitoring– performance monitoring
using counter readers: HPC Toolkit, by John Mellor-Crummey, Rob
Fowler, CS Dept. Rice University Uses perf counter data for profiling converts raw profiling information into platform
independent XML formats and produces performance metric correlations from multiple sources
Used in compiler optimizations Jennifer Anderson, et al, “Continuous Profiling:
Where Have All the Cycles Gone?”, ACM Transactions on Computer Systems, Vol. 15, No. 4, November 1997, pp. 357 - 390. Performance analysis example – from DEC Data collection by counter sampling, performance
info from program level to individual instructions
46
RELATED WORK RELATED WORK – real power– real power
CASTLE Project by Margaret Martonosi and Russ Joseph, Princeton University Shunt R over Ppro power lines to measure total
processor power John Seng, Dean Tullsen, “Effect of compiler
optimizations on Pentium 4 Power consumption”, 7th Annual Workshop on Interaction between Compilers and Computer Architectures, February, 2003 Shunt R between VRM and CPU
Marc A. Viredaz, Deborah A. Wallach, “Power Evaluation of Itsy Version 2.3”, tech. note TN-57, WRL, Compaq Computer Corp., 2000 similar series R to estimate battery life of itsy pocket
computer
47
RELATED WORK RELATED WORK – real power– real power
Frank Bellosa, “Benefits of Event Driven energy Accounting in Power Sensitive Systems”, 9th SIGOPS European workshop, 2000 Crude Current measurement with DMM for Pentium II to help
define per instruction powers Andreas Weissel, Frank Bellosa, “Process Cruise Control:
Event driven clock scaling for dynamic power management”, CASES 2002 series sense resistor added to Intel IQ 80310 evaluation
platform power supply, to measure energy effect of frequency scaling
Naehyuck Chang, Kwanho Kim, and Hyun Gyu Lee, "Cycle-Accurate Energy Consumption Measurement and Analysis: Case Study of ARM7TDMI" ISLPED 2000 & IEEE Transactions on VLSI Systems, Vol. 10, pp. 146 - 154, Apr., 2002. cycle accurate energy consumption measurement based on
charge transfer Inserts switch caps between power supply and Processor that
switch with the same clock frequency!!
48
RELATED WORK RELATED WORK – power model– power model
Simulation Tools:
WATTCH, by David Brooks and Margaret Martonosi, Princeton University, ISCA 2000 Architectural power simulator Power Models intergrated upon SimpleScalar
SimplePower by W. Ye, N. Vijaykrishnan, M. Kandemir, Penn-State University, and M. Irwin “The Design and Use of SimplePower: A cycle-accurate energy estimation tool”, DAC, June 2000 Execution driven, Cycle accurate, RTL power
estimation Emulates 5 stage pipe with SimpleScalar’s Integer
ISA
49
RELATED WORK RELATED WORK – power model– power model
Power Modeling:
R. Joseph and M. Martonosi. “Run-Time Power Estimation in High Performance Microprocessors”, International Symposium on Low Power Electronics and Design, 2001 complete CASTLE Project: Collects Ppro counter data and models
component power breakdowns verifying against measured total power
Also Wattch simulation vs. counter approximation for SimpleScalar architecture
Russ Joseph, David Brooks, and Margaret Martonosi, "Live, Runtime Power Measurements as a Foundation for Evaluating Power/Performance Tradeoffs" Workshop on Complexity Effectice Design (WCED, held in conjunction with ISCA-28), 2001 Evaluate power vs. performance by measuring total power and
acquiring performance data from counters – i.e. Cache hit rate, branch prediction, bitline activity
50
RELATED WORK RELATED WORK – power model– power model
H. Zeng, X. Fan, C. Ellis, A. Lebeck, and A. Vahdat, “ECOSystem: Managing Energy as a First Class Operating System Resource”, Proceedings of ASPLOS X, Oct. 2002 Uses Currentcy Model (Fixed Power & Time budget for a task) for OS
level energy management for battery life ECOsystem is the Linux OS implementation <No counters> Considers CPU ON/OFF could do better with Power model
H. Zeng, C. Ellis, A. Lebeck, A. Vahdat , “Currentcy: Unifying Policies for Resource Management”, USENIX 2003 Annual Technical Conference Detailed description of currency (OS scheduling, etc.)
Flinn J., Satyanarayanan, M., “PowerScope: A Tool for Profiling the Energy Usage of Mobile Applications”, Proceedings of the Second IEEE Workshop on Mobile Computing Systems and Applications February, 1999 Maps Energy Program structure (Power Profiling – Energy efficient SW
design) DMM gets energy for machine kernel modification (system monitor) gets PIDs for processes and
identifies procedures for profiling offline
51
RELATED WORK RELATED WORK – power model– power model
V. Tiwari, S. Malik, and A. Wolfe, “Power analysis of embedded software: A first step towards software power minimization”, International Conference on Computer-Aided Design & IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 1994 PIONEER WORK in Power Measurement/Modeling Measure current drawn by an Intel 486DX2 Processor and DRAM Generate Energy cost table for instructions Identify inter-instructions effects: circuit state overhead, resource
constraint effect, cache miss effects there are 1 million like this: modeling SW energy, I won’t put here
Lee, A. Ermedahl, and S. Min. “An accurate instruction-level energy consumption model for embedded risc processors” ACM SIGPLAN Conf. on Languages, Compilers, and Tools for Embedded Systems (LCTES'01), Jun 2001 Derives energy consumption for instructions rather than functional
units for RISC ARM7TDMI processor Uses their cycle-accurate power measurement scheme Black box approach (similar to F. Bellosa) with linear regression
52
RELATED WORK RELATED WORK – power model– power model
J. Russell and M.F. Jacome, "Software Power Estimation and Optimization for High Performance, 32-bit Embedded Processors," Proc. of ICCD '98 Estimates SW energy for i960 family 32 bit embedded RISC
processors Uses digitizing oscilloscope/series Resistor over processor power
lines for measurement Uses const Pest for processor power and estimates energy based on
runtime ( won’t work with clock gating!) J. Haid, G. Kafer, et al, "Run-Time Energy Estimation in System-
On-a-Chip Designs", ASP-DAC 2003 Proposes a coprocessor for runtime energy estimation for SoC Defines similar event counters in coprocessor and uses power
macro-models M. Lajolo, A. Raghunathan, S. Dey, L. Lavagno, and A.
Sangiovanni-Vincentelli. “Efficient power estimation techniques for hw/sw systems”, IEEE Proc. VOLTA'99 International Workshop on Low Power Design, pages 191--199, March 1999. Power estimation for HW/SW SoC designs RTL HW simulator and Instruction Set simulator using instruction
level power models
53
RELATED WORK RELATED WORK – power model– power model
M. Huang, J. Renau, and J. Torrellas. “Profile-based energy reduction in high-performance processors”, In 4th Workshop on Feedback-Directed and Dynamic Optimization, December 2001 Use profiling to determine when to activate/deactivate low
power methods –i.e. DVS, clock gating, etc. Use energy statistics (power breakdowns) from
performance counters for profiling (SIM) I. Kadayif , T. Chinoda , M. Kandemir , N. Vijaykirsnan ,
M. J. Irwin , A. Sivasubramaniam, “vEC: virtual energy counters”, Proceedings of the 2001 ACM SIGPLAN-SIGSOFT workshop on Program analysis for software tools and engineering, 2001 Uses Perfmon library for UltraSPARC to read SPARC HW
perf counters related to memory Converts readings to power using analytical memory
energy model estimates memory system energy consumption
54
RELATED WORK RELATED WORK – power model– power model
Luca Benini et al “System-level power estimation and optimization”,
Proceedings 1998 international symposium on Low power electronics and design
“System-level power optimization: techniques and tools”, Proceedings of international symposium on Low power electronics and design, 1999
Tutorial on power conscious system level designMemory optimizations, Hardware software partitioning, instruction level power optimizations, DVS, DPM (allow components to sleep)
“Supporting system-level power exploration for DSP applications”, Proceedings of the 10th Great Lakes Symposium on VLSI, 2000
Modified ARM simulator for instruction level power estimation
55
RELATED WORK RELATED WORK – thermal model– thermal model
K. Skadron, T. Abdelzaher, and M. R. Stan. “Control-theoretic techniques and thermal-RC modeling for accurate and localized dynamic thermal management”, In Proc. HPCA-8, pages 17--28, Feb. 2002. Single degree component based thermal R-C model for MIPS
R10000 scaled to 0.18Um Only die heatsink thermal conduction, with const. heatsink and
Si properties only Power/Thermal Simulation using Wattch for verification of DTM
with PID controller
Sabry, M.-N.; Bontemps, A.; Aubert, V.; Vahrmann, R, “Realistic and efficient simulation of electro-thermal effects in VLSI circuits”, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , Volume: 5 Issue: 3 , Sep 1997 Transistor level with interdevice thermal resistances
Szekely, V.; Poppe, A.; Pahi, A.; Csendes, A.; Hajas, G.; Rencz, M, “Electro-thermal and logi-thermal simulation of VLSI designs”, Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , Volume: 5 Issue: 3 , Sep 1997 LOGITHERM simulator module for gate level thermal simulation, by
thermal characterization of logic gates
56
RELATED WORK RELATED WORK – thermal model– thermal model
COSMOS/FloWorks by NIKA fluid flow and thermal analysis program Heat flow computation based on mesh analysis
A. Dhodapkar, C. H. Lim, G. Cai, and W. R. Daasch. “TEMPEST: A thermal enabled multi-model power / performance estimator”, Proceedings of Workshop on Power-Aware Computer Systems, Nov. 2000. Thermally enabled architectural simulator based on
SimpleScalar Single R,C for the whole processor packaging oriented
D. Brooks and M. Martonosi. Dynamic thermal management for high-performance microprocessors. In Proceedings of the Seventh International Symposium on High-Performance Computer Architecture, pages 171--82, Jan. 2001. Discusses Microarchitectural and scaling DTM mechanisms Uses moving average of power for ~100K cycles of Wattch
simulation as a proxy for temperature to detect thermal emergencies for DTM triggering
57
RELATED WORK RELATED WORK – thermal model– thermal model
Thermal Monitoring, “Intel Architecture SW developer’s Manual vol. 3” Catastrophic shutdown detector
thermal diode resets stop clock duty cycle Automatic Thermal monitor
Internally modulate stop clock duty cycle Software controlled clock modulation
SW modulates stop clock duty cycle
Kevin Skadron et al, “Temperature aware Microarchitecture”, 30th ISCA, 2003 HotSpot: architecture level thermal simulator built
upon Wattch Uses multiple degree thermal R-C model for die,
packaging, heatsink and convection to ambient More realistic area estimates based on Alpha 21364 Back Back
58
59
Counter Access HeuristicsCounter Access Heuristics 1) BUS CONTROL:
No 3rd Level cache BSQ allocations ~ IOQ allocations Metric1: Bus accesses from all agents
Event: IOQ_allocationCounts various types of bus transactions
Should account for BSQ as wellaccess based rather than duration
MASK:Default req. type, all read (128B) and write (64B) types, include OWN,OTHER and PREFETCH
Metric2: Bus Utilization(The % of time Bus is utilized)Event: FSB_data_activity
Counts DataReaDY and DataBuSY events on BusMask:
Count when processor or other agents drive/read/reserve the busExpression: FSB_data_activity x BusRatio / Clocks Elapsed
To account for clock ratios
60
Counter Access HeuristicsCounter Access Heuristics 2) L2 Cache:
Metric: 2nd Level cache referencesEvent: BSQ_cache_reference
Counts cache ref-s as seen by bus unitMASK:
All MESI read misses (LD & RFO)2nd level WR misses
3) 2nd Level BPU: Metric 1: Instructions fetched from L2 (predict)
Event: ITLB_ReferenceCounts ITLB translations
Mask:All hits, misses & UC hits
Metric 2: Branches retired (history update)Event: branch_retired
Counts branches retiredMask:
Count all Taken/NT/Predicted/MissP
61
Counter Access HeuristicsCounter Access Heuristics 4) ITLB & I-Fetch:
etc……… 10) FP Execution:
Metric: FP instructions executedevent1: packed_SP_uop
counts packed single precision uopsevent2: packed_DP_uop
counts packed single precision uopsevent3: scalar_SP_uop
counts scalar double precision uopsevent4: scalar_DP_uop
counts scalar double precision uopsevent5: 64bit_MMX_uop
counts MMX uops with 64bit SIMD operandsevent6: 128bit_MMX_uop
counts integer SSE2 uops with 128bit SIMD operandsevent7: x87_FP_UOP
counts x87 FP uopsevent8: x87_SIMD_moves_uop
counts x87, FP, MMX, SSE, SSE2 ld/st/mov uops Back Back
62
63
INTRODUCTION to RUNTIME
• What is Runtime Power/Thermal Measurement:Methodology for measuring CPU power / temperature and component breakdowns3 alternatives:1. Measuring power/temperature directly from hardware; i.e. with
multimeter probesImpossible with VLSIRuntime speed
2. Simulating processor execution with SW and extracting power/temperature data
WATTCH, Tempest, etc. Computation time problems, especially with thermalCycle level detail
3. Runtime Measurement: Getting Processor power/thermal data at runtime using both hardware and software
Runtime speed and SW support – not cycle detail!
64
INTRODUCTION to RUNTIME
• Why Runtime Power/Thermal Measurement:
Offers a hybrid technique overlapping slow, but detailed simulation and crude, but fast realtime measurementsHardware performance counters help extract lots of useful information – both performance and power – on the flyCan be used for ‘priming’ instead of a long simulation where the last few million instructions bear the most of interest
65
WHY POWER & THERMAL
• Moore’s Law:Transistor count x4 / 3 years
DRAM density x4 / 3 years
Performance improves exponentially SO DOES POWER [1]
• Nuclear Core Example:
66
WHY POWER & THERMAL
67
…WHY POWER & THERMAL
• Battery technology increases much slower
• Packaging costs: +$1/W over 35-40W [2]
Back to slide Back to slide
68
POWER BASICS
• Total Power = Dynamic Power + Static Power + Short Circuit Power
Dynamic Power (switching power):Discharging of Capacitances when switching occurs (0 1) – data dependent
Csw= (1/2)..CL.Vdd2.f
Where this came from
69
Derivation of Switching Power
2)2/1( CVEnergy
dt
dVVCViPower
dt
dVCiC
dissipatedisechthis
transitioneachat
fVC
periodclock
VC
timeEnergyPower
VCEnergy
transitioneachat
ddL
ddL
ddL
arg
:10
/
:01
2
2
2
fVCPower
activityswitching
cycleainswitchingofyprobabilitP
PtransitionEnergyPower
stransitiontotaltransitionEnergy
EnergyTotal
ddL2
10
10
2
1
)2/1(
/
10/
:
70
POWER BASICS
Static Power (leakage power):Due to leakage through the N channel and through the drain-substrate junctions.
71
POWER BASICS
Short Circuit Power :Due to finite rise time of input signal.Generic CMOS feature
• In comparison:Currently: 80% Sw. + 10% Leak + 10% SC
Future: 45% Sw. + 45% Leak + 10% SC [3]
72
WATTCH simulates 80K instr-s/sec
SpecINT 164.GZIP runs:~350s with average upc ~1.3 on 1.4 GHz P4 producing ~665 billion uops
WATTCH simulation would take ~100 days
Assuming a 1GHz Machine:1s of real run ~5 x IPC hrs of WATTCH simulation
Back to slide Back to slide
NEED FOR SPEEDNEED FOR SPEED
73
P4 DetailsP4 DetailsKarelian.ee:
P4 – 1.4GHz0.18, C4-FC-PGA-423Heatsink Folded FinM6, Al interconnectDie Size: 217 mm2
Package Size: 5.34cm x 5.17cmPower: Idle/typ./max=??/51.8/71WD$1&T$1/L2: 8K&12KUops/256KVoltage: 1.7/1.75V
74
P4 DetailsP4 Details 1st LKM: <LKM_CPUinfo & UserLevel_CPUinfo>
Implements syscall: getCPUinfo()Gathers CPU info from:
/asm/processor.hIntel control registers (CR4)CPUID instruction
Reveals:Debug Store mechanism exists for PEBSTSC existsMSRs implemented
We can read/write performance counters
EX:karelian (P4,willamette): UserLevel_CPUinfoviale (P4, Northwood): UserLevel_CPUinfo
Back Back
75
P4 Detector - Counter ClustersP4 Detector - Counter ClustersEvent Detectors Event Counters
4 bit wide bus
P4
Com
pone
nts
EV
EN
TS
76
Counters, ESCRs & CCCRsCounters, ESCRs & CCCRs
Simplified Recipe:1. Select Event to count2. Select a counter
(also defines CCCR)3. Select an ESCR4. Set ESCR fields5. Set CCCR fields6. Enable CCCR
77
Counting MechanismsCounting MechanismsCounting Types
Non-retirement: Events occur any time during execution
At-Retirement: Events at the retirement of instruction
Can count BOGUS vs NBOGUS, Tag uops to count, etc.
TerminologyMechanisms:
Front end tagging (i.e. LD/ST retired)Execution tagging (i.e. packed_DP_retired)Replay Tagging (i.e. L1 misses)No Tags (i.e. uops retired)
Also:Event Counting | IEBS | PEBS
Back Back
78
At Retirement Counting TerminologyAt Retirement Counting Terminology
Back Back
BOGUS/NBOGUS (speculative)Tagging (count uops that encounter event)Replay (Data speculation)
79
Verifying Counter ReaderVerifying Counter Reader1) L1Dcache_exercise:
Uses pointer assignment L1=8K, L2=256K Array Size = (L1 Size/Hit Rate)
i.e. for 10% Hit rate: 80K 20K entriesArray Size < L2 size
Array elements PRBS of array indices Bench loop:
new index array[old index] However, gcc puts 5 LDs in the bench loop
4 static Hit rate ~ 100%1 our load our desired hit rate
80
……Verifying Counter ReaderVerifying Counter Reader
1) L1Dcache_exercise results:
L1Dcache Experiment
-20.00%
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0.04 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 100 1000
Desired Hit Rate
Ac
qu
ire
d R
ate
s
Acquired L1 Hit Rate
Our L1 Hit Rate From L2Accesses
Ex:L1Dcache_exerciseHit Rate = 0.25
81
……Verifying Counter ReaderVerifying Counter Reader2) branch_exercise:
Uses random number comparison Assigns 400K PRBS array outside bench loop
To avoid rand() instructions in bench loop bench loop:
Compares array index to threshodThreshold = RAND_MAX*TakenRate
Repeats 1000 reseeding each time However gcc adds 2 more branches into
bench loop:Loop exit condition (Prediction ~ 100%)Unconditional JMP (Prediction ~ 100%)
Our Branch’s Expected Mispredict Rate:~ (0.5 - |TakenRate – 0.5| )
82
……Verifying Counter ReaderVerifying Counter Reader
2) branch_exercise results:
Ex:branch_exerciseTaken Rate=0.5
Branch Prediction Experiment
0.00%
20.00%
40.00%
60.00%
80.00%
100.00%
120.00%
0 0.1 0.2 0.25 0.3 0.4 0.5 0.6 0.7 0.75 0.8 0.9 1
Desired Taken Rate
Acq
uir
ed R
ates
Approximated Mispredict RateOur Branch's Taken Rate
Back Back
83
Log voltage readings
Convert to instantaneous power: 12 x Vsample x 1000
P4 POWER MEASUREMENTP4 POWER MEASUREMENTComplete Setup:
Serial Reader(PowerMeter)(PowerPlotter)
1mV/Adc conversion
Voltage [
V]
Readings
Clamp Current Probe over 12V
lines
Log Power values Plot Power values
84
MEASUREMENT MethodMEASUREMENT Method Select Power lines that reflect CPU power
P4 uses 12 V lines Clamp the current probe over the 12V lines
1mV/Adc conversion Connect the clamp into DMM Send Voltage reading over serial Log the voltage readings
Convert to instantaneous power as:12 x Vsample x 1000
Log Power values Plot Power values
85
MEASUREMENT ToolsMEASUREMENT ToolsPoll serial port ~20ms
quicker overkill, slower overlookCompute running average sample every t you select
Easier to sync with Power ModelPowerMeter:
Convert voltage reading to power and logP=12 x Vread x 1000
PowerPlotter: Plot Power samples over sliding time
window100 s history with 1000 samples (t = 100ms)
86
Current ProbeCurrent ProbeFluke i410Uses Hall Voltage to measure current
and convert to Voltage:1mV / Adc
Range: 0.5 – 400A Accuracy: 3.5%+0.5AGenerated voltage is fed to DMMCompared against the Ppro Amoeba
shunt setup for verification
87
Clamp vs ShuntClamp vs Shunt
sampled current for L1Dcache from clamp
0
1
2
3
4
5
6
7
8
0 200 400 600 800 1000
current
sampled current for L1Dcache from shunt
0
1
2
3
4
5
6
7
0 200 400 600 800 1000 1200
current
current for grep from shunt
0
1
2
3
4
5
6
7
0 100 200 300 400
100 ms
A Series1
current for grep from clamp
0123456789
0 100 200 300 400 500 600
100 ms
A Series1
Back Back
88
DMMDMMAgilent 34401A Measurement Motive:
We should sample as quick as possible (grep case)
Measurement Setup:Fast 4 digit, Autozero OFF, Display OFF
From [8], 1000 readings/s (x150 faster than fast 6 digit)
Serial Interface:From [9] 55 ASCII readings /s
Polling serial port faster than 20ms is overkill
Back Back
89
P4 Power LinesP4 Power Lines Which power lines should we cut / clamp?
[5] shows the power lines:1-CPU power connector 13-System power connectorP1 13 & P2 1
[6],[7] say P4 uses 12V lines for CPU, rather than 5V lines
Both P1 & P2 have 12, 5 and 3.3 V lines
I run branch_exercise (takenRate=1) and gzip_static obtain the current variation on the lines
90
Current on Power LinesCurrent on Power LinesCurrent on Connector P1
line7 (12V)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 20 40 60 80
time (s)
I [A
]
Series1
Current on Connector P1 lines1,3,,6,18,19,20,22 (5V)
0
0.5
1
1.5
2
2.5
0 20 40 60 80
time (s)
I [A
]Series1
Current on Connector lines 11,12,23 (3.3V)
0
0.5
1
1.5
2
2.5
0 10 20 30 40 50 60 70 80
time (s)
I [A
]
Series1
Current on connector P2 line1 (3.3V)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 10 20 30 40 50 60 70 80
time(s)
I(A
)
Series1
Current on connector P2 line14 (5V)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Current on Connector P2 line 3 (12V)
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Current on Connector P2 line7 (12V)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Current on connector P2 line 9 (5V)
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 10 20 30 40 50 60 70
time (s)
I [A
]
Series1
Reveals ALL 3 12V lines’ currents follow CPU activity All add to CPU Power! Back Back
91
Validating with OptimizationsValidating with Optimizations Compare to Optimizations vs Power of [Seng & Tullsen]
SPECINT AVE. Power vs gcc Optimizations
39
41
43
45
47
49
51
53
GZIP VPR GCC
AV
Era
ge
Po
wer
[W
]
O0O1O2O3O3 unrollO3 unroll ALL
92
OptimizationsOptimizations O0
None at all O1 –fomit-frame-pointer
thread-jumps, delayed-branches, defer-pop O2 –fomit-frame-pointer
CSE related blocks, jumps, expensive optimizations, reschedule instr-ns, etc.
-O3 –fomit-frame-pointerO2 + inline functions heuristically
-O3 –fomit-frame-pointer –funroll-loopsOnly for #iterations known at compile/run time
-O3 –fomit-frame-pointer –funroll-all-loopsDo for all loops (usually bad result)
93
GZIP – power vs timeGZIP – power vs timePower for GZIP Optimizations
0
10
20
30
40
50
60
70
0 100 200 300 400 500 600 700 800 900time (s)
[W]
O0
O1
O2
O3
O3unroll
O3unrollALL
94
……GZIP – power vs timeGZIP – power vs timeAll have similar powerExec. time(O0) ~
x2 Exec Time(Oelse)Different data sets provide
different power profile
95
3 specINT average Power3 specINT average PowerSPECINT AVE. Power vs gcc Optimizations
39
41
43
45
47
49
51
53
GZIP VPR GCC
AV
Era
ge
Po
we
r [W
]
O0
O1
O2
O3
O3 unroll
O3 unroll ALL
Optimized code runs quicker, and yet with less average power
specFP – art seems to be the exception?
Back Back
96
About the ripplesAbout the ripplesAdd ripple stuff
here…!!!!!!!!!!!!!!!!!!!!!!!!!!!
97
P4 Architecture vs LayoutP4 Architecture vs Layout
Components to Model:
1) Bus Control2) L2 Cache3) 2nd Level BPU4) ITLB & Ifetch5) L1 Cache
6) MOB7) Mem Control8) DTLB9) Int EXE10)FP EXE11) Int RF
12)FP RF13)Decode14)Trace $15)1st Level BPU16)Microcode ROM17)Allocation
18)Rename19) Inst-n Qs20)Schedule21) Inst-n Qs22)Retirement
Back Back
98
Defining ComponentsDefining Components
99
Counter RotationsCounter Rotations
Back Back
100
Experiment SetupExperiment Setup
POWERCLIENT
POWERSERVER
Com
pon
ent
Bre
akd
own
sC
omp
onen
t B
reak
dow
ns
102
THERMAL BasicsTHERMAL Basics
Duality heat flow electrical flow
Thermal Mass (Capacitance) :
Cth=c.A.t [J/K]c: Specific heat [J/m3K]A: Block Area [m2]t: Wafer thickness [m]
Thermal Resistance :
Rth,norm=.t/A [K/W] : Thermal resistivity [m.K/W]A: Block Area [m2]t: Wafer thickness [m]
103
Simplified Thermal ModelSimplified Thermal Model Divide the CPU to component blocks
Each block dissipates different power, Pblock reveal different temperature changes, Tblock
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIETb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl
Tb,i
Rth,i
Cth,iPi
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIE
Blki Blkj
BlkkBlkl
DIE
ithith
i
ith
ii
ibithR
TT
i
RC
tT
C
tPT
dt
dTCP
ith
hib
,,,
,,
:equationdifferenceFinal
,
,
t : Sampling intervalTi : The temperature difference
between block and the heatsink
t should be much smaller than the RC time constant, th,i
Tb,j
Rth,j
Cth,jPj
Numerical Values?
See Quantitative Example >>
104
QUANTITATIVE EXAMPLE Use t=0.1 mm – thinned wafer Areas given in table (c=106 [J/m3K] & =10-2 [m.K/W] ) th=RthCth=c t2=10-4s=100s ind. of Area!
Temperature buildup for Regfile with t =133.4 ns:
21.11
42.85
100100
100
blkthblkth
blk
blkth
blkblk CR
tT
C
tPHeatSinktrwT
,,,
)...(
Back to slide Back to slide
105
THERMAL FORMULATIONTHERMAL FORMULATION
For any block, i:Tb,i
Rth,i
Cth,iPi
Th
ithith
i
ith
ii
iith
ith
ii
ihibib
h
hibi
ibithR
TT
i
ibithR
TT
i
RC
tT
C
tPT
t
TC
R
TP
TTTT
T
TTTDefinet
TCP
dt
dTCP
ith
hib
ith
hib
,,,
,,
0
,,
,
,,
,,
:equationdifferenceFinal
:constAssuming
:
:timengDiscretizi
,
,
,
,
t : Sampling interval Ti: The temperature
difference between block and the heatsink
t should be much smaller than the RC time constant, th,i
Back to slide Back to slide
106
Refined Thermal ModelRefined Thermal Model Steady State Analysis reveals, Heatsink-Die
abstraction is not sufficient for real systems Proceeding to a multilayer thermal model:
Active die thickness metalization/insulation chip-package interface package heatsink
Requires searching of several materials/ dimensions and thermal properties
Multiple layers Multiple T nodes Multiple DEs
Baseline Heat removal Structure:
Tb,j
Rth,j
Cth,jPj
107
Refined Thermal ModelRefined Thermal ModelTb,j
Rth,j
Cth,jPjNeed to define the physical structure All the layers heat-flux propagates through
Corresponding Thermal model Multinode Different Assumptions/decisions
Physical Parameters for different elements Dimensions Material types
th and cth
New set of Thermal update DEs
108
Physical Model vs. Thermal Model Physical Model vs. Thermal Model
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
Th
Rh
Ch
Th
Rh
Ch
R_hXA
TA
Tspr
Rspr
Cspr
Tspr
Rspr
Cspr
R_grXspr
Tp,i
Rp,i
Cp,i
Tp,i
Rp,i
Cp,i
Tdie,i
Rdie,i
Cdie,iPi
Tdie,i
Rdie,i
Cdie,iPi
Ptotal
109
Analytical DerivationAnalytical Derivation
4 Nodes 4 DEs 1) Tspr:
sprsprspr
hsprgrsprsprspr
totalspr
sprsprhsprRtotal
sprsprR
TT
total
sprsprR
TT
total
TTT
tTTRCC
tPT
TCtTTtPt
TCP
dt
dTCP
grspr
grspr
hspr
grspr
hspr
)(1
:equationdifferenceFinal
.)(.
:timengDiscretizi
1
110
……Analytical DerivationAnalytical Derivation
2) Th:
3) Tdie,i:
4) Tp,i:
hhh
Ahahhh
grspr
hspr
h
TTT
tTTRCC
tR
TT
T
)(1
idieidieidie
ipidieidieidieidie
iidie
TTT
tTTRCC
tPT
,,,
,,,,,
, )(1
ipipip
spripipipip
idie
ipidie
h
TTT
tTTRCC
tR
TT
T
,,,
,,,,
,
,,
)(1
111
Temperature UpdatingTemperature Updating and and Initial ConditionsInitial Conditions
D.E.s should be updated along the direction of current (power) flow: Tdie,i Tp,i Tspr Th
It is not reasonable to start from ambient temperatures as initial conditions. Mostly, the processor is already running
TA is given as ~50oC by Intel Thermal Design Guidelines Assume idle power:(Ppro ~2 W)
Th=TA+2W.Rhxa=~52oC Tspr=Th+2W.Rspr+gr=~52oC Tp,i=Tdie,i=Tspr=~52oC
Update Tdie,i
Update Tp,i
Update Tspr
Update Th
Back Back
112
Steady State SolutionSteady State Solution If Rth,iRth,i x20
Tss,i Tss,I x20
Regfile ex. of presentation 1:Pi=10 & Rth,i=4 Ti,ss=40K
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIETb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl
Tb,i
Rth,i
Cth,iPi
Tb,i
Rth,i
Cth,iPi
Tb,j
Rth,j
Cth,jPj
Tb,j
Rth,j
Cth,jPj
Tb,k
Rth,k
Cth,kPk
Tb,k
Rth,k
Cth,kPk
Tb,l
Rth,l
Cth,lPl
Tb,l
Rth,l
Cth,lPl
Th
Rth,h
Cth,hPi+Pj+Pk+Pl HEATSINK
Blki Blkj
BlkkBlkl
DIE
Blki Blkj
BlkkBlkl
DIE
KT
decodeforyNumericall
A
tPRPT
TtR
TP
C
Ttate
RC
tT
C
tPT
decssi
i
thiithissi
iith
ii
ith
i
ithith
i
ith
ii
15.010.35
:
.
01
0:SolutionSSteady
:equationdifferenceFinal
2,,
,,
,,
,,,
Back Back
113
EX: PproEX: Ppro Thermal Model Thermal ModelTb,j
Rth,j
Cth,jPjUse CASTLE computed component powers
Select– thermal – sampling intervalDetermine component areas from Die
photoDetermine processor/packaging
physical parametersGenerate numerical thermal modelApply component difference equations
recursively
114
SimulationSimulation
and c values hardcoded for materials (except Si)
Areas/Relative Areas Hardcoded for components Individual R and C computed for components D.E. loop is re-executed every t, in the discussed
order Updated Thermal Nodes displayed every t~20ms
Component Temperatures Build up to ~350K in ~5hrs Clock Temp. Shoots up Theatsink moves very slowly as expected
For complete set of computed numerical simulation results go to additional slides
115
Simulation Outputs – at StartupSimulation Outputs – at Startup
116
Simulation Outputs – After 5 hrsSimulation Outputs – After 5 hrs
Back Back
117
Thermal Model ParametersThermal Model Parameters
BASELINE AMBIENT TEMPERATURET_ambient = 323; /* in K */ Intel Thermal Design Guidelines
SAMPLING INTERVALdt = 5e-6 sec.s I Choose
Processor Specific Parameters
118
Physical ParametersPhysical Parameters
15% of Heatsink area has fins, 85% doesn’tOverall Rth estimate:
RfinRnofin
119
……Physical ParametersPhysical Parameters
Temperature assumed uniform along heat spreader – and therefore, above
120
……Physical ParametersPhysical Parameters
We don’t use total R&C for package as it’s decomposed into component areas in the model
DIE:Process info scaled from P4 data in [7] using ITRS 1999 & 2001 and interpolating MPU ½ pitch vs. Wire pitch
Metal layer & Isolation scale factor 2.15
ITRS FEP Si final device thickness ~100nm (130nm tech.)I used the overall wafer thickness
Temperature dependent Si: Si(T)=1.5486.102.(300/T)4/3
121
……Physical ParametersPhysical Parameters DIE Rth Estimate:
Rdie=RSi+Rmetal+Rpoly+RSiO2
For 10% die area:RSi~ 0.1 K/W
Rmetal~ 0.0008 K/W
Rpoly~ single layer ignorable
RSi~0.86 K/W
Rdie~ RSi+RSiO2
DIE Cth Estimate:Only Si considered as rest is much thinner
Back Back
122
Numerical Numerical ValuesValues
Back Back
123
Back Back
Computed ThermalComputed Thermal values values
124
Computed Thermal v.2 valuesComputed Thermal v.2 values
Back Back
125
Ppro info & AreasPpro info & Areas Complete processor info([4],[5],[6])
200MHz4 Metal layersPackage: 387 pin DC-PGAPackage size: 6.76cm x 6.25cm0.35 BiCMOSDie Size: 196mm2 (14x14)
Area estimates for dieScale component areas from [1]:
[1] Ours150 MHz 200 MHz0.50 0.35 <process scaling x0.7>Die size:306mm2 196mm2 <Area scaling x0.64>
I use x0.64 area scaling and [1]’s breakdowns for component area estimates
126
Component AreasComponent Areas
3.9% 11.8%
7.9% 4.0%
4.4%4.2%
7.6%8.6%
14.3%
4.1%
2.5%
2.2%
4.6%
1.3%
Close to Intel data:
These areas cover ~81.3% of die
Clock area found from Intel data as:
Aclk=Pclk/PwrDensityclk = 1.7%
127
CASTLE Breakdown AreasCASTLE Breakdown Areas We need to convert given areas to CASTLE comp-s:
DECODEID+MIS=11.7%
ISSUERS=7.6%
REORDERRAT+(ROB&RRF)=8.6%
DMEMDCU = 8.6%
IMEMIFU=11.8%
FUNC_UNITAGU+IEU+FEU=10%
OTHER100-above=41.7%
CLOCK1.7%
Back Back
128
CASTLE
• Power measurement / profiling tool• Developed by Prof Martonosi and Russ• Implemented on a P6, Linux• Generates power profiles for benchmarks at
runtimeUses performance counters to gather utilization information Uses WATTCH’s per usage wattage values for max power values ([8 p.3])Uses heuristics to extract usage counts for blocksUses register sampling to compute activity factors for single ended bitlines.Computes total processor powerUses a digital multimeter for validation
129
Performance Counters
• Exist on most new processors• Majorly used to track performance related events
Cache missesCommitted intr-s, etc.
• Can be used to gather power related data• P6 has 2 performance counters that count 77 events
Can be accessed with:RDMSR (Read Machine Specific Register)WRMSR (Write Machine Specific Register)RDTSC (Read Time Stamp Counter)
Kernel level (Ring 0) instructionsExemplary events:0. TSC elapsed machine cycles03. 03H L1 read misses 44. C0H instr-ns retired
130
Heuristics
• To extract power related data from performance counters
• Platform Dependent!
131
CASTLE implementation
• Platform:P6, 200 MHz | Linux kernel v2.2.16-3
HW counters
Kernel Code
Server code
Series Resistance
Xmultimeter server
Client Code
132
CASTLE Filesystem – User Code
• Client: <cpu-probe>Includes cpu-monitor & cpu-networkCpu-monitor:
Provides the x-windows for power breakdown bar graphs <gtk and threads>Acquires power breakdowns from cpu-network
Cpu-network:Connects to server side through ethernet <sockets and threads>Gets event counts and number of elapsed cycles for each tracked eventConstructs component power values from event data using heuristics
Client Code
133
CASTLE Filesystem – User Code
• Multimeter: <xmmeter>Real Multimeter reads the voltage over series R and sends over RS232
Xmmeter reads the serial port and converts the voltage reading into power as:
P=(Vread/Rs).Vdd
X-window displays the readings
Series Resistance
Xmultimeter server
134
CASTLE Filesystem – User Code
• Server: <probe-server>Reads the performance counts with syscall “getglobaleventcount” defined in kernel code every second
Acquires event counts and elapsed cycles for all events
Sends the event and cycle data to client as a stream of chars.
Server code
135
CASTLE Filesystem – Kernel Code
• Required to access counters• Scattered in:
/usr/src/linux/arch/i386/kernel/entry.S/usr/src/linux/include/linux/sched.h/usr/src/linux/kernel/fork.c/usr/src/linux/kernel/sched.c
• Defines 2 new system calls:GeteventcountGetglobaleventcount
• Accesses the counters, gets counter & cycle dataSyscall returns the server event and cycle counts as a 2D array
Kernel Code
136
CASTLE Details• In castle code, 12 distinct events are defined• From [1] and [8], 10 of the events are used:
instructions decodedinstructions executedinstructions retiredfloating point operations executedbranches retiredBranchesDecodedL1 instruction cache accessesL1 data cache accessesL2 unified cache accessesmain memory requests
• [1] and [8] suggest a 10ms sampling period• Probe-server samples counters every second
137
Power Breakdown ComponentsPower Breakdown Components CASTLE tracks 12 events
Develops power breakdowns for 8 units:DECODEISSUEREORDERDMEMIMEMFUNC_UNITOTHERCLOCK
Component powers recomputed every second in CPU-network
138
Thermal Modeling with CASTLE
• Thermal model requires only power and sampling time information
Thermal model can be added at user level, by:extending cpu-network for temperature updates
extending cpu-monitor for a new thermal x-window
• A pitfall resides as the sampling periodSampling time should be smaller than time constant, for reliable modeling (<< 100s)
Back Back
139
EOP