Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
µDPM: Dynamic Power Management for the Microsecond Era
Chih-Hsun [email protected]
Laxmi N. Bhuyan [email protected]
Daniel [email protected]
HPCA 2019
ns msµs events
Killer
Microsecond
2
Computer systems efficiently support . . .
Masked by microarchitectural
techniques
Masked by OS-level
techniques
HPCA 2019
The Killer Microseconds
3
“System designers can no longer ignore efficient support for microsecond-scale I/O ... Novel microsecond-optimized system stacks are needed”[1]
Killer Microseconds
[1] Luiz Barroso, Mike Marty, David Patterson, and Parthasarathy Ranganathan. Attack of the killer microseconds. Commun. ACM 60, 4 (March 2017), 48-54.
HPCA 2019
Computer systems cannot efficiently support Microsecond-scale service time
4
Request Response
~milliseconds
Traditional Monolithic Services
HPCA 2019
Computer systems cannot efficiently support Microsecond-scale service time
5
Request Response
Emerging Microservices
microseconds
HPCA 2019
Microservice Example
6
Source: Adrian Cockcroft, “Monitoring Microservices and Containers: A Challenge”
~700 microservices
HPCA 2019 7
Implications of Killer Microsecond service time on
Dynamic Power Management?
HPCA 2019
Opportunity for DPM – Latency Slack› Slow down
request processing(DVFS)
› Delay request processing(Sleep)
8
0%
20%
40%
60%
80%
100%
120%
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Tail
Late
ncy
(% o
f SLA
)
Load (% of peak load)
tail latency SLA
Latency Slack
HPCA 2019
Dynamic Power Management Overview› DVFS (Rubik [MICRO’15], Pegasus [ISCA’14])
› Rubik adjusts f per request
› DVFS + Sleep › SleepScale [ISCA’14] finds optimal frequency & C-state depth for 60s epochs
9
R1R0 R2 R3 R4
R4R0 R2 R3R1Epoch 0 Epoch 1
t
t
f
f
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
sleep
time
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
sleep
time
R1 arrives
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
sleep
time
R1 arrives
Target Tail Latency
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
sleep
time
R1 arrives
Wake
Target Tail Latency
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
sleep
time
R1 arrives
Wake
Target Tail Latency
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
sleep
time
Wake
R2 arrives
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
sleep
time
Wake
R2 arrives
Target Tail Latency
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
sleep
time
Wake
R2 arrives
Target Tail Latency
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
sleep
time
Wake
R3 arrives
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
Target Tail Latency
sleep
time
Wake
R3 arrives
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
Wake
Target Tail Latency
sleep
time
R3 arrives
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
Wake
Target Tail Latency
sleep
time
R3 arrives
HPCA 2019
Dynamic Power Management Overview› Sleep states (PowerNap[ASPLOS’09], Dreamweaver[ASPLOS’12],
DynSleep[ISLPED’16], CARB[CAL’17])
10
Wakesleep
time
HPCA 2019
DPM ineffective w/ microsecond service time
11P
ower
(W)
28
30.75
33.5
36.25
39
Avg service time(us)10 100 1000
Baseline RubikSleepScale DynSleep
DVFSDeep SleepDVFS+Sleep
Baseline
HPCA 2019
DPM ineffective w/ microsecond service time
11P
ower
(W)
28
30.75
33.5
36.25
39
Avg service time(us)10 100 1000
Baseline RubikSleepScale DynSleep
DPM Ineffective
DVFSDeep SleepDVFS+Sleep
Baseline
HPCA 2019
DPM ineffective w/ microsecond service time
› >250µs: DVFS effective at slowing down request processing
11P
ower
(W)
28
30.75
33.5
36.25
39
Avg service time(us)10 100 1000
Baseline RubikSleepScale DynSleep
DPM Ineffective
DVFSDeep SleepDVFS+Sleep
Baseline
HPCA 2019
DPM ineffective w/ microsecond service time
› >250µs: DVFS effective at slowing down request processing
› <250µs: DPM becomes ineffective
11P
ower
(W)
28
30.75
33.5
36.25
39
Avg service time(us)10 100 1000
Baseline RubikSleepScale DynSleep
DPM Ineffective
DVFSDeep SleepDVFS+Sleep
Baseline
HPCA 2019
DPM ineffective w/ microsecond service time
› >250µs: DVFS effective at slowing down request processing
› <250µs: DPM becomes ineffective
› Surprisingly, sleep-based policiesoutperform DVFS-based policies
11P
ower
(W)
28
30.75
33.5
36.25
39
Avg service time(us)10 100 1000
Baseline RubikSleepScale DynSleep
DPM Ineffective
DVFSDeep SleepDVFS+Sleep
Baseline
HPCA 2019
Fragmented idle periods à Lost Opportunities
12
50% utilization
Longer Service Time
Shorter Service Time
50% utilization
HPCA 2019
Fragmented idle periods à Lost Opportunities
› Short service times fragment idle periods
13
DVFS (500µs)Sleep (500µs)Sleep (200µs)
DVFS (200µs)
HPCA 2019
Fragmented idle periods à Lost Opportunities
› Short service times fragment idle periods
› Sleep states / request delaying canconsolidateidle periods
13
DVFS (500µs)Sleep (500µs)Sleep (200µs)
DVFS (200µs)
HPCA 2019
Significant transition overheads and idle power
14
Baseline
Rubik
Sleepscale
DynSleep
Optimal
Normalized Energy
0.6 0.68 0.76 0.84 0.92 1
Busy Idle C-state tran.VFS tran.
Idleness and transition overhead still account for up to ~25% of energy
HPCA 2019
DPM inefficiencies
15
Tail Service Time = 78µs
C6 Residency Time = 300µs
R1C0C3C6
R1
Arr
ival
Tail Latency Target = 800µs
Wasted Energy
Wasted Energy
DVFS limited in closing Latency Gap tR0
* SPECjbb timing
HPCA 2019
DPM inefficiencies
15
Tail Service Time = 78µs
C6 Residency Time = 300µs
R1C0C3C6
R1
Arr
ival
Tail Latency Target = 800µs
Wasted Energy
Wasted Energy
Solution:Aggressive Deep Sleep
Solution:Request Delaying
DVFS limited in closing Latency Gap tR0
Solution:Coordinate DVFS
* SPECjbb timing
HPCA 2019
Key Insight
16P
ower
(W)
28
30.75
33.5
36.25
39
Avg service time(us)10 100 1000
Baseline RubikSleepScale DynSleepµDPM
Careful coordination of DVFS, Sleep state, and request delaying is the key to effective DPM with microsecond service times
Deep SleepDVFS+SleepDVFSBaseline
HPCA 2019
µDPM› Aggressively Deep Sleep› Delay and slow down request processing to finish just-in-time,
even under microsecond request service times› Carefully coordinating DVFS, Sleep, and request delaying
17
Tail Service Time = 78µs
R1 t
Tail Latency Target = 800µs
Solution:Aggressive Deep Sleep
Solution:Request Delaying
R1
Arr
ival
Residency Time = 300µs
C0C3C6
R0
Solution:Coordinate DVFS
HPCA 2019
Can latency-critical workloads utilize deep sleep states?
18
Memcached
SPECjbb
Xapian
Masstree
Start
Tail Service Time
(95th percentile)
33µs
78µs
250µs
1200µs
HPCA 2019
Can latency-critical workloads utilize deep sleep states?
18
Memcached
SPECjbb
Xapian
Masstree
Start
Target Tail Latency(95
th percentile)150µs
800µs
1100µs
2100µs
Tail Service Time
(95th percentile)
33µs
78µs
250µs
1200µs
HPCA 2019
Can latency-critical workloads utilize deep sleep states?
18
Memcached
SPECjbb
Xapian
Masstree
Start
Target Tail Latency(95
th percentile)150µs
800µs
1100µs
2100µs
Tail Service Time
(95th percentile)
33µs
78µs
250µs
1200µs
Opportunity
HPCA 2019
Aggressive deep sleep and request delaying
› Wakeup after residency time
› Wakeup before residency time if needed to meet tail latency
19
R0Req
R0 Arrival
t
C0C3C6
Residency Time = 300µs
HPCA 2019
Estimating Service time and Latency› Estimate tail service time
› Statistical performance model[2]
› Online periodic resampling (100ms)
20
R0ReqR0 Arrival
t
C0C3C6
Si
[2] Kasture, Harshad, Davide B. Bartolini, Nathan Beckmann, and Daniel Sanchez. "Rubik: Fast analytical power management for latency-critical systems." MICRO 2015.
HPCA 2019
Estimating Service time and Latency› Estimate tail service time
› Statistical performance model[2]
› Online periodic resampling (100ms)
› Estimate Request Tail Latency› L = W + Twake + Tdvfs + Si / f
20
R0ReqR0 Arrival
t
C0C3C6
Si
W Si/fTwake + Tdvfs
[2] Kasture, Harshad, Davide B. Bartolini, Nathan Beckmann, and Daniel Sanchez. "Rubik: Fast analytical power management for latency-critical systems." MICRO 2015.
HPCA 2019
Detecting critical request arrival
› Detecting critical request arrival › If inter-arrival time between 2 consecutive requests are shorter than the
tail service time
21
R0Req
R0
t
C0C3C6
HPCA 2019
Detecting critical request arrival
› Detecting critical request arrival › If inter-arrival time between 2 consecutive requests are shorter than the
tail service time
21
R0Req
R0
tR1*
R1 Arrival – Critical!
C0C3C6
HPCA 2019
Detecting critical request arrival
› Detecting critical request arrival › If inter-arrival time between 2 consecutive requests are shorter than the
tail service time
21
R0Req
R0
tR1*
R1 Arrival – Critical!QoS
Violation!
C0C3C6
HPCA 2019
Detecting critical request arrival
› Detecting critical request arrival › If inter-arrival time between 2 consecutive requests are shorter than the
tail service time
21
R0Req
R0
tR1*
R1 Arrival – Critical!QoS
Violation!
C0C3C6
R1 critical if tR1 – tR0 ≤ Sf
tail
HPCA 2019
Coordinate frequency, sleep, delay
22
R0Req
R0
tR1*
R1QoS
Violation!
C0C3C6
R1 critical if tR1 – tR0 ≤ Sf
tail
HPCA 2019
Coordinate frequency, sleep, delay
22
Req
R0
t
R1
C0C3C6
R1 critical if tR1 – tR0 ≤ Sf
tail
R1R0
f’ = Stail/(TRiTargetCompletion-TRi-1Completion )
HPCA 2019
Coordinate frequency, sleep, delay
22
Req
R0
t
R1
C0C3C6
R1 critical if tR1 – tR0 ≤ Sf
tail
R1R0Increase frequency on wakeup
Can sleep longerdue to higher freq.
f’ = Stail/(TRiTargetCompletion-TRi-1Completion )
HPCA 2019
Coordinate frequency, sleep, delay
› Reset to lowest frequency on wake up › Only increase frequency on reconfiguration
22
Req
R0
t
R1
C0C3C6
R1 critical if tR1 – tR0 ≤ Sf
tail
R1R0Increase frequency on wakeup
Can sleep longerdue to higher freq.
f’ = Stail/(TRiTargetCompletion-TRi-1Completion )
HPCA 2019
Minimize state transition overheads› Calculate criticality score
› Send to core that is least critical
› Minimize state transitions
23
See paper for details!
HPCA 2019
Evaluation› In-House Simulator (similar to BigHouse) › Empirical Power Model
› 10µs DVFS transition, 89µs sleep transition time (double to account for cache flushing)
› Add 25µs to first request service time after idle period for cold miss penalty › Baseline – Linux menu idle governor and intel_pstate driver › Workloads
24
HPCA 2019
Energy Savings
25
Ene
rgy
Sav
ing
(%)
0
10
20
30
40
Load0.1 0.2 0.3 0.4 0.5
Memcached
Ene
rgy
Sav
ing
(%)
0
10
20
30
40
Load
0.1 0.2 0.3 0.4 0.5 0.6 0.7
SPECjbb
Ene
rgy
Sav
ing
(%)
0
5
10
15
20
Load0.1 0.2 0.3 0.4 0.5 0.6
Masstree
Ene
rgy
Sav
ing
(%)
0
2.75
5.5
8.25
11
Load0.1 0.2 0.3 0.4 0.5 0.6
Xapian
Rubik DynSleep Sleepscale µDPM Optimal
HPCA 2019
Energy Savings
25
Ene
rgy
Sav
ing
(%)
0
10
20
30
40
Load0.1 0.2 0.3 0.4 0.5
Memcached
Ene
rgy
Sav
ing
(%)
0
10
20
30
40
Load
0.1 0.2 0.3 0.4 0.5 0.6 0.7
SPECjbb
Ene
rgy
Sav
ing
(%)
0
5
10
15
20
Load0.1 0.2 0.3 0.4 0.5 0.6
Masstree
Ene
rgy
Sav
ing
(%)
0
2.75
5.5
8.25
11
Load0.1 0.2 0.3 0.4 0.5 0.6
Xapian
Rubik DynSleep Sleepscale µDPM Optimal
µDPM typically within 2-3% of OptimalµDPM saves ~2X vs others
HPCA 2019
State transition overhead reduction
26
Baseline
Rubik
Sleepscale
DynSleep
Optimal
µDPM
Normalized Energy
0.6 0.68 0.76 0.84 0.92 1
Busy IdleC-state tran. VFS tran.
HPCA 2019
Under Varying Load
27
HPCA 2019
Sensitivity to target tail latency
28
Ene
rgy
savi
ng (%
)
0
3
6
9
12
Latency contraint (µs)
600 1133 1667 2200
Ene
rgy
savi
ng (%
)0
7.5
15
22.5
30
Latency contraint (µs)80 135 190 245 300
Ene
rgy
savi
ng (%
)
0
2.25
4.5
6.75
9
Latency contraint (µs)
1200 1950 2700 3450 4200
Ene
rgy
savi
ng (%
)
0
7.5
15
22.5
30
Latency contraint (µs)400 700 1000 1300 1600
Memcached SPECjbb
Masstree Xapian
HPCA 2019
Sensitivity to Transition time
5
10
15
20
25
30
0 100 200 300
Ene
rgyr
Sav
ing
(%)
sleep transition time(µsec)
µDPM µDPM w/ criticality-awareness
29
0
10
20
30
0 20 40 60 80 100
Ene
rgy
Sav
ing
(%)
VFS transition time (µsec)
Rubik µDPM µDPM w/ criticality-awareness
HPCA 2019
Conclusion› Microsecond service times present challenges for Dynamic
Power Management
› Careful coordination of DVFS, Sleep and Request delaying can achieve savings with µs service times
› µDPM is able to save ~2x energy compared to state-of-the-art techniques
30
µDPM: Dynamic Power Management for the Microsecond Era
Thank you!
Chih-Hsun [email protected]
Laxmi N. Bhuyan [email protected]
Daniel [email protected]