Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Need good name 1
Energy efficient computing in high performance systems
Efraim RotemIntel Corporation, Technion .
Israel
Ran GinosarTechnion,
Israel
Avi MendelsonTechnion,
Israel
Uri WeiserTechnion,
Israel
June 2013
Work supported by“ICRI-CI” – Intel Collaborative Research Institute for Computational Intelligence”
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
1978 1982 1986 1990 1994 1998 2002 2006
1
10
100
1,000
10,000
Source: Dave Patterson
386
486
Pentium
P-2
P-4
Core-2 Duo
MMX
286PC-XT8086
Compute Performance
2
Fifth Workshop on Energy-Efficient Design – Weed 2013June -20133
The best is yet to come• Clients:
• Create, innovate, Collaborate
• Perform complex tasks
• Deliver computational density
• Economy of scale
• Ubiquitous computing
• Lower entry cost
• Rich content compute demand
• Audio visual
• Cognitive computing
• Servers:
• Drive the connected world
• Google, Facebook …
• Perform some of the compute for thin clients
• Large scale computing
• Finance, science, engineering, cloud computing
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Moore’s law for power performance
• Theoretical scaling of new Process technology:– Linear Dimensions: Shrinks by 0.7– Area: Shrinks by 0.5– Capacitance: Shrinks by 0.7– Voltage: Scale down by 0.7– Frequency: Scale up by 1/0.7– Power Scale down by 0.5
There’s additional transistor and power budget for: New features Architectural extensions Performance improvement
Half the area
Half the power
Sustainable Performance improvement at same power consumption
Power = C * V2 * F + Leakage
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Recent Reality - Deep in the power wall Practical scaling factors:
Linear Dimensions, Area and active capacitance continue to shrink Interconnect impact increases
Voltage: Roughly the same!!! Frequency A design choice Between leakage and speed Power = C * V2 * F + Leakage Roughly the same @ 1/0.7 higher freq. Power = C * V2 * F + Leakage 0.7X power @ same freq. and ½ area
Same transistor count 0.7-1X
power, Higher power density
Same area 1.5-2X power
The Power wall – Moors law will continue delivering transistor density but tough design and architectural choices: Higher power density to enable the frequency speedup
Harder to cool
Any architectural additions come at cost on higher power
Cdyn/mm^2 increases with process shrink and architectural improvements
Process energy efficiency break even – 2X transistors 1.5X perf.
*Reference: “MultiAmdahl: How Should I Divide My Heterogeneous Chip?”, T. Zidenberg et. al.
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Energy Density Over Time
250 180 130 90 65 45
0.5
1.0
1.5
2.0
32 22
Technology Node
No
rm
alized
En
erg
y D
en
sit
y
More then Moore’s Law
• Moors law delivers transistor density
• No longer delivers energy efficiency
• At the same area, power and energy consumption increase
• Power and energy efficiency is back in engineering hands
6
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Historical power trends
7
Cd
yn[n
F]
P ~ Cdyn*V2*fCdyn trending ~ fixed for a given architecture – core area shrinksDynamic range of power increases (10X on recent products)
486 Pentium™ P6 Centrino™ Core™
~ equal Cdyn
@ smaller area
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Power Management FundamentalsMaximize user experience under multiple constraints• User Experience (May have different preferences):
– Throughput performance
– Responsiveness - burst performance
– Ergonomics (acoustic noise, skin temp)
– Battery life / energy consumption: on and standby
• Optimizing around Constraints to meet user preferences– Silicon capabilities
– System Thermo-Mechanical capabilities – short and long
– Power delivery capabilities – from the wall to the transistor
– Workload and usage
– Workload dynamic range
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Physical constraints
thermo-mechanical
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
TDP – Thermal design power
Traditional design approach – worst case designMax realistic power at steady state for long period of time
ARD Application Power (5s Peak)
25
27
29
31
33
35
37
39
Pre
mie
r P
ro C
S3
+ U
T2
00
4P
rem
ier
Pro
CS
3 +
Fa
rCry
Pre
mie
r P
ro C
S3
+ S
tar
Wa
rsP
rem
ier
Pro
CS
3 +
UT
20
04
+P
rem
ier
Pro
CS
3 +
Lo
st
Pla
ne
tP
rem
ier
Pro
CS
3 +
Ca
ll o
f D
uty
Pre
mie
r P
ro C
S3
+ W
ME
PC
Ma
rk V
an
tag
eP
rem
ier
Pro
CS
3T
AT
@ 1
00
WM
E +
Lo
st
Pla
ne
tW
ME
+ F
arC
ryP
CM
ark
05
WM
E +
UT
20
04
WM
E +
Ca
ll o
f D
uty
43
DM
ark
Va
nta
ge
Po
we
rDir
ec
tor
7 (
VC
1 -
MP
G4
)P
ow
erD
ire
cto
r 7
(H
.26
4 1
08
0p
-3
DM
ark
Va
nta
ge
(8
x6
)P
ow
erD
ire
cto
r 7
(H
.26
4 -
MP
G4
)W
ME
So
ny
Ve
ga
s
Po
we
rDir
ec
tor
7 (
H.2
64
- M
PG
2)
3D
Ma
rk0
6P
ow
er
Pro
du
ce
r 5
3D
Ma
rk0
6 (
All)
Lo
st
Pla
ne
t E
xtr
em
e (
10
x7
)3
DM
ark
06
(A
ll)
Ca
ll o
f D
uty
4 (
8x
6)
Fa
rCry
Ca
ll o
f D
uty
4 (
16
x1
2)
3D
Ma
rk0
6U
T2
00
4C
all o
f D
uty
4 (
10
x7
)C
om
pa
ny
of
He
ros
(8
x6
)B
att
lefi
eld
2 (
16
x1
2)
SY
SM
ark
07
3D
Ma
rk0
3 (
De
mo
)C
om
pa
ny
of
He
ros
(8
x6
)F
EA
RB
att
lefi
eld
2 (
10
x7
)U
lea
d 1
1B
att
lefi
eld
2 (
14
x1
0)
3D
Ma
rk0
3 (
De
mo
8x
6)
Pri
me
95
(x
2)
FE
AR
Pre
y
Co
mp
an
y o
f H
ero
s (
10
x7
)F
EA
RS
tar
Wa
rs (
Me
nu
)L
os
t P
lan
et
Ex
tre
me
(1
1x
6)
Sta
r W
ars
(M
en
u)
Bio
sh
oc
k (
6x
4)
Bio
sh
oc
k (
8x
6)
Bio
sh
oc
k (
8x
6)
Sta
r W
ars
(In
tro
)C
all o
f J
ua
rez (
10
x7
)C
rys
is (
Intr
o)
Ca
ll o
f J
ua
rez (
10
x7
)C
rys
is (
GP
U)
Vid
eo
Ca
ptu
re (
6x
4)
Zip
HD
D-H
DD
Pri
me
95
(x
1)
Cry
sis
(C
PU
1)
Zip
HD
D-U
SB
HD
DC
rys
is (
CP
U2
)V
ide
o C
ap
ture
(2
.0M
P)
Idle
TDP
Po
we
r
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Recent years change
• The rules of the game changed
– Focus on User Experience
• New innovative Form Factors
– High computation low power devices
– Skin temperature sensitive
– Impose changes on system engineering
• New usage models emerge at the data center
– Interactive web services: Google, Facebook *
*Source: “Online data intensive services”, D. Meisner et. al.
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Classic ModelSteady-State Thermal Resistance
Design guide for steady state
New ModelSteady-State Thermal Resistance
GPU and CPU sharingAND
Dynamic Thermal Capacitance
New Concept: Thermal CapacitanceTe
mp
erat
ure
Time
Tem
per
atu
re
Time
More realistic response to
power changes
PCU manages energy budgets over multiple time constants
Classic model respond
CPU GPU
12
Example:Cp_Al ~ 0.9 J/(gr*’K)100gr heat sink @ 35W 100Sec
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Tablet Thermal example
Tj=90
Tskin=40
10sec
100sec
1000sec
1w 5w4w3w2w 6w
355sec
210sec~100sec
~50sec
1906sec
98sec
14.4sec
6.4sec
7200sec
Sustained
operation
System temp limit
Tj (junction) Limited
Operation region
Max power limit
Traditional “TDP”
Turbo
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Mapping the Usages Of Interest
Short(hitting PD, Freq constraints)
Long (hitting system power constraints, Tskin)
Max Perf within system constraints
Meet QoS @ min Energy (BL)
Idle
Video PB
MP3 PB
Casual game
Web surfing
Video encode
Create PDF
Photo editing
File compression
Video encode
Create PDF
Photo editing
File compression
Math Apps Math Apps Heavy games
AOAC
Time
Pow
er/P
erf
VIRUS
TDP
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Mapping the Usages Of Interest
Short(hitting PD, Freq constraints)
Long (hitting system power constraints, Tskin)
Max Perf within system constraints
Meet QoS @ min Energy (BL)
Idle
Video PB
MP3 PB
Casual game
Web surfing
Video encode
Create PDF
Photo editing
File compression
Video encode
Create PDF
Photo editing
File compression
Math Apps Math Apps Heavy games
AOAC
Time
VIRUS
TDP
Tj=90
Tskin=40
10sec
100sec
1000sec
1w 5w4w3w2w 6w
355sec
210sec~100sec
~50sec
1906sec
98sec
14.4sec
6.4sec
7200sec
Pow
er/P
erf
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
What is CPU “Turbo”• P1 is Guaranteed frequency
– Wide dynamic power rage
• P0 is max possible frequency– P1 to P0 range is fully H/W controlled
• P1-P0 has significant frequency range (GHz)– Single thread performance
– Light load performance
• Various possible policies and user preferences
• Pn is the energy efficient point– Lower then Pn is controlled by T-state
“Turbo”H/W
Control
OS VisibleStates
OS Control
T-state &Throttle
P1
Pn
P0 1C
Vo
ltag
e an
d f
req
uen
cy
P0 2/3/4C
LFM
16
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Intel® Turbo-Bust Technology• Turbo enabled product specifications
P1 P0
CPU
P1 P0
PG TDP total package sustained power
Source: http://www.intel.com/Assets/PDF/datasheet/324692.pdf
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Power Telemetry
• Power management is based on measurements
• Intel® SOC implement power meter
– Used for power management algorithms
– Architecturally exposed to software and system
– For the use of S/W or system embedded controller
Average accuracy – 0.9%STDEV 0.6%
0
5
10
15
20
25
30
35
40
45
0 50 100 150 200 250
CPU - predicted
GPU - predicted
Package - predicted
CPU - actual
GPU - actual
Package - actual
18
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Time
Power
Sleep orLow power
“Turbo”
“TDP”
C0(Turbo)
After idle periods, the system accumulates “energy budget” and can accommodate high power/performance for a few seconds
In Steady State conditions the power stabilizes on TDP
Buildup thermal budget during idle periods
Use accumulated
energy budget to enhance user
experience
Intel® Turbo-Boost Technology 2.0
19
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Energy Efficient P-State - optimizing MIPS / Watt
• Voltage scaling is not energy efficient– Used to get raw performance
– Some applications are less energy efficient than others
– May still be more efficient then bringing up another system
• Not all workloads gain performance from frequency– For example – many memory accesses poor scalability
– “Wait slowly” accumulate energy headroom
• Continuously generate “scalability” metric– Drop frequency (less turbo) if scalability is low
– Save energy OR more performance at same energy
20
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Time
Power
Sleep orLow power
“Turbo”
“TDP”
Buildup thermal budget during idle periods
Intel® Turbo Boost Technology 2.0
21
Max current
* Source: “Multiple Clock and Voltage Domains for Chip Multi Processors”, Rotem et. al.
Turbo introduces very high power dynamic range stress the power delivery network
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Power delivery constraints
Georg Simon Ohm16 March 1789 – 6 July 1854
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Power delivery constraints
• Power Delivery limits
– Wall To System
– System to package
– Inside Package
• DC sustained power/current
• Instantaneous and transients
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Mobile platform PDN
• Power supply and battery current feeding the total platform are also limited
IA core
GT
PCU
SoC
CPU / GT Platform VR
Brick/Silver Box
Battery
X
SVID
Space is limited in tablets and SFF
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
What PDM parameters limit us• VR Max Icc (total package)
– “TDP” - Need to sustain forever (thermal limit)
– “Virus” – Long time and O.K. to thermal throttle
– Instantaneous – should be treated as “never exceed
• I*R drop on DC and AC load line
• Load release overshoot
• FET max current and magnetic saturation – technology dependent
• Over current protection
• Battery and brick max Icc (total platform!)
– Battery electrical max current and overheat
– Brick over current protection
O.K. to apply control
By design or a-priori
Fast control or a-priori
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
PDN controls in action
Time
P-statePower
Actual instantaneous power
Hard LimitMax Icc
Power limit 2
Power limit 1
PL1 time exp. average
C0 P0
Voltage Regulator reported capability
CURRENT_CONFIG_CONTROL MSR
TURBO_POWER_LIMIT Control MSR
Enables and locks
Package Power limit 2 – Instantaneous
Package Power limit 1 Time interval
Package Power limit 1 clamp bit
Package Power limit 1 - power
•Also:• Individual power controls available• Explicit frequency control
User / OEM / OS preference
Allow programmability
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Platform Energy
Energy Aware Race to Halt
27
Source: “Energy Aware Race to Halt: A Down to EARtH Approach for Platform Energy Management”, E. Rotem et al.
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Combined platform energy
• Total platform energy may have an optimal freq.– Possibly within the operation range
• Can we calculate this point at run time?– Minimize energy to complete a task
28
Performance
Energy
Pe - Globalminimum
ECPU~ f2Esystem ~ 1/f
QoS
EE algorithmRace to Halt
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
The analytical model – run time
• Fixed platform power– Continues: Components idle power
– while executing: Component active idle
• Fixed platform energy: data transfer cost
• Freq. Dependent Energy: CPU DVFS
29
Platform run time power
Platform constant power
tCPU tMEM
CPU
Idle
CPU Active power
Time
Po
we
r
Categorizing:
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
The analytical model – Some algebra
30
• CPR is a Parameter that can describes system power characteristics compared to workload power characteristics
• SCA is a characteristic that represents Amdahl behavior of a workload and represents how well performance is scales with frequency
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Exploring the energy function
31
• Relative platform energy as a function of freq.
– SCA = 1 ; Different CPR values
– CPU power >> system power LFM and vice versa
0.60
0.80
1.00
1.20
1.40
1.60
1.80
1.0 1.2 1.4 1.6 1.8 2.0 2.2
Plat
form
Ene
rgy
Relative Frequency
Platform energy vs. Frequency
0.33
0.26
0.20
0.15
0.11
0.08
0.06
Optimal Fc
CPR values
Aligns well with intuition
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Exploring the energy function
32
0.60
0.70
0.80
0.90
1.00
1.10
1.20
1.30
1.40
1.00 1.20 1.40 1.60 1.80 2.00 2.20
Pla
tfo
rm E
ne
rgy
Relative Frequency
Platform energy vs. Frequency
1
0.71
0.50
0.35
0.25
0.18
Optimal Fc
SCA values
• Relative platform energy as a function of freq.
– Fixed CPR ; Different SCA values
– The lower SCA the lower is optimal frequency
Aligns well with intuition
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Heterogeneous computing
Combining a mix of “big” and “small” cores
• High performance high power cores
– For demanding compute tasks
– Excel on single threaded workloads
• Small, energy efficient cores
– Excel on low QoS workloads
– Many of them perform multi threaded workloads efficiently
33
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
EARtH– hetro platform energy
34
0.88
0.90
0.92
0.94
0.96
0.98
1.00
1.02
1.04
0.6 0.8 1.0 1.2 1.4 1.6
Relat
ive tot
al ener
gy
Relative Frequency
Total energy vs. Frequency - hybrid core
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1
Optimal Fc
SCA
“BIG” core“Small” core
• In general, small core is more energy efficient
• Platform energy may be different– For some CPR/SCA values – big core can be more
energy efficient
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Homogeneous core policy results
• Big core usually gains from low frequency
• Small core usually gains from RtH
• But - not always
– Cannot predict a-priori which works better
35
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
Tota
l pla
tfo
rm e
ner
gy s
avin
gs
Standard voltage CPU
EARtH over LFM
EARtH over RtH
EARtH over random
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
60.0%
Tota
l pla
tfo
rm e
ne
rgy
sa
vin
gs
Low voltage CPU
EARtH over LFM
EARtH over RtH
EARtH over random
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Benefit from Hetro
• Energy savings of Asymmetric core compared to a CPU consisting big cores only assuming no QoS requirement
36
0%
5%
10%
15%
20%
25%
30%
35%En
erg
y Sa
vin
gs
Asymetric core energy savings
Asymetric core energy savings
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
EARtH benefits on Hetro-CPU
• EARtH policy compared to fixed frequency policy
37
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
Ene
rgy
Savi
ng
Workloads (sorted)
Asymetric core energy savings
S-LFM S-RtH
B-LFM B-RtH
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Sensitivity to platform power
• Sensitivity of the energy savings to platform power
• The higher the platform is – the more scenarios big core benefits
38
0%
20%
40%
60%
80%
100%
35% 50% 60% 70% 80% 85% 90%
wo
rko
ads
(%)
Platform power (%)
Type of core that achives the lowest energy
Small core is better Big core is better
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Platform energy and energy “load line”
39
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Energy proportionality
• Data centers are rarely fully utilized
• Energy cost of the data center is significant
• Server platform attempt to achieve power load line – propositional dependency between utilization and power consumption
40
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Energy proportionality
41
SSJ Operations
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Rack power delivery optimization
• Applying control vs worst case design
42
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Summary
• Moors law drives transistor density
– Does not deliver energy efficiency
– Drives the need for aggressive energy efficient design, architecture and management
• Compute system goodness has many aspects
– Instantaneous and sustain performance
– Managed within multiple physical constraints
43
Fifth Workshop on Energy-Efficient Design – Weed 2013June -2013
Thank You
44