32
Dynamic Thermal Management for Distributed Systems © Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004 1 TACS’04 Dynamic Thermal Management for Distributed Systems Andreas Weissel • Frank Bellosa Department of Computer Science 4 (Operating Systems) University of Erlangen-Nuremberg Martensstr. 1, 91058 Erlangen, Germany {weissel,bellosa}@cs.fau.de

Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Embed Size (px)

Citation preview

Page 1: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

1

TA

CS

’04

Dynamic Thermal Managementfor

Distributed Systems

Andreas Weissel • Frank BellosaDepartment of Computer Science 4 (Operating Systems)

University of Erlangen-Nuremberg

Martensstr. 1, 91058 Erlangen, Germany

{weissel,bellosa}@cs.fau.de

Page 2: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

2

TA

CS

’04

Benefits of Dynamic Thermal Management

■ Cooling servers, server clusters◆ cooling facilities often dimensioned for worst-case temperatures or

overprovisioned

■ Guarantee temperature limits◆ no need for overprovisioning of cooling units

◆ reduced costs (floor space, energy consumption, maintenance, ...)

■ Increased reliability◆ safe operation in case of cooling unit failure

◆ avoid local hot-spots in the server room

➜ Temperature sensors

Page 3: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

3

TA

CS

’04

Drawbacks of Existing Approaches

■ If critical temperature is reached◆ throttle the CPU:

e.g. halt cycles, reduced duty cycle, reduced speed

■ But: neglect of application-, user- or service-specific requirementsdue to missing online information about◆ the originator of a specific hardware activation and

◆ the amount of energy consumed by that activity

➜ Throttling penalizes all tasks

Page 4: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

4

TA

CS

’04

Outline

■ From events to energy◆ event-monitoring counters

◆ on-line estimation of energy consumption

■ From energy to temperature◆ temperature model

■ Energy Containers◆ accounting of energy consumption

◆ task-specific temperature management

■ Infrastructure for temperature management in distributed systems

Page 5: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

5

TA

CS

’04

Approaches to Energy Characterization

■ Reading of thermal diode embedded in modern CPUs◆ low temporal resolution

◆ significant overhead

➜no information about originator of power consumption

Page 6: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

6

TA

CS

’04

Approaches to Energy Characterization

■ Reading of thermal diode embedded in modern CPUs◆ low temporal resolution

◆ significant overhead

➜no information about originator of power consumption

■ Counting CPU cycles◆ time as an indicator for energy consumption

◆ time as an indicator for contribution to temperature level

◆ throttling according to runtime

➜but: wide variation of the active power consumption

Page 7: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

7

TA

CS

’04

Approaches to Energy Characterization

■ P4 (2 GHz) running compute intensive tasks: CPU load of 100%◆ variation between 30–51 W

alu-

add

alu-

add-

bsw

ap-in

cal

u-m

ov-in

c-ad

dm

em-lo

adm

em-s

tore

push

pop

mem

-load

/sto

rem

em-in

c L1 L2ad

ler

ln2

sha

ripem

d160

wra

pfa

ctor

0

10

20

30

40

50

pow

er c

onsu

mpt

ion

[Wat

t]

idle powerconsumption:12.3 W

Page 8: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

8

TA

CS

’04

From Events to Energy: Event-Monitoring Counters

■ Event counters register energy-critical events in the complete system architecture.◆ several events can be counted simultaneously

◆ low algorithmic overhead

◆ high temporal resolution

◆ fast response

■ Energy estimation

➜correlate a processor-internal event to an amount of energy

◆ select several events and use a linear combination of these event counts to compute the energy consumption

������ ����� ����⋅

∑=

Page 9: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

9

TA

CS

’04

From Events to Energy: Methodology

■ Measure the energy consumption of training applications

■ Find the events with the highest correlation to energy consumption

■ Compute weights from linear combination of event counts and real power measurements of the CPU

➜solve linear optimization problem:find the linear combination of these events that produce the minimum estimation error

➜avoid underestimation of energy consumption

� ����� ���� ��������������–⋅

��������������� � ����� ����⋅

∑≤

Page 10: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

10

TA

CS

’04

From Events to Energy: Methodology

■ Set of events and their weights

■ Limitations of the Pentium 4◆ insufficient events for MMX, SSE & floating point instructions

◆ the case for dedicated Energy Monitoring Counters

event weight [nJ]

time stamp counter 6.17

unhalted cycles 7.12

µop queue writes 4.75

retired branches 0.56

mispred branches 340.46

mem retired 1.73

ld miss 1L retired 13.55

Page 11: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

11

TA

CS

’04

Outline

■ From events to energy◆ event-monitoring counters

◆ on-line estimation of energy consumption

■ From energy to temperature◆ temperature model

■ Energy Containers◆ accounting of energy consumption

◆ task-specific temperature management

■ Infrastructure for temperature management in distributed systems

Page 12: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

12

TA

CS

’04

From Energy to Temperature: Thermal Model

■ CPU and heat sink treated as a black box with energy in- and output

◆ energy input: electrical energy being consumed

◆ energy output: heat radiation and convection

CPUEnergy(#Events)

Thermal Capacity

Heat SinkThermal Resistance

Energy(∆Temp)

Page 13: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

13

TA

CS

’04

From Energy to Temperature: Thermal Model

■ Energy input: energy consumed by the processor

CPUEnergy(#Events)

Thermal Capacity

Heat SinkThermal Resistance

Energy(∆Temp)

�� ����=

� �������������� ���

Page 14: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

14

TA

CS

’04

From Energy to Temperature: Thermal Model

■ Energy output: primarily due to convection

CPUEnergy(#Events)

Thermal Capacity

Heat SinkThermal Resistance

Energy(∆Temp)

�� �� � ��–( )�–=

� � ���� �������

��

�� ����=

� �������������� ���

Page 15: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

15

TA

CS

’04

From Energy to Temperature: Thermal Model

■ Altogether:

◆ energy estimator ➔ power consumption P

◆ time stamp counter ➔ time interval dt

◆ the constants , and have to be determined

�� ��� �� � ��–( )–[ ]�=

�� �� ��

Page 16: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

16

TA

CS

’04

From Energy to Temperature: Thermal Model

■ Altogether:

◆ energy estimator ➔ power consumption P

◆ time stamp counter ➔ time interval dt

◆ the constants , and have to be determined

■ Solving this differential equation yields

�� ��� �� � ��–( )–[ ]�=

�� �� ��

� ( )��–��

------- ��– ����⋅ ���

����---- � ��+⋅+=

dynamic part static part

Page 17: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

17

TA

CS

’04

Thermal Model: Dynamic Part

■ Measurements of the processor temperature◆ on a sudden constant power consumption and

◆ a sudden power reduction to HLT power.

➜fit an exponential function to the data: coefficient = ��

0 1000 2000 3000 4000 5000time [s]

30

35

40

45

50

55

60

tem

pera

ture

[° C

elsi

us]

IdleActiveIdle(50 W) (12 W)

Pentium 4@2GHz with active heat sink

Page 18: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

18

TA

CS

’04

Thermal Model: Static Part

■ Static temperatures and power consumption of the test programs

10 20 30 40 50power consumption [Watt]

30

35

40

45

50

55

60st

atic

tem

pera

ture

[Cel

sius

]

Page 19: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

19

TA

CS

’04

Thermal Model: Static Part

■ Linear function to determine and �� ��

10 20 30 40 50power consumption [Watt]

30

35

40

45

50

55

60st

atic

tem

pera

ture

[Cel

sius

]

Page 20: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

20

TA

CS

’04

Thermal Model: Implementation

■ Linux 2.6 kernel

■ Periodically compute a temperature estimation from the estimated energy consumption

■ Deviation of a few degrees celsius over 24 hours◆ or if ambient temperature changes

■ Re-calibration with measured temperature every few minutes

Page 21: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

21

TA

CS

’04

Thermal Model: Accuracy

0 2000 4000 6000 8000 10000 12000time [s]

35

40

45

50

55

60te

mpe

ratu

re (

cels

ius)

estimated temperaturemeasured temperature

Page 22: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

22

TA

CS

’04

Outline

■ From events to energy◆ event-monitoring counters

◆ on-line estimation of energy consumption

■ From energy to temperature◆ temperature model

■ Energy Containers◆accounting of energy consumption◆ task-specific temperature management

■ Infrastructure for temperature management in distributed systems

Page 23: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

23

TA

CS

’04

Properties of Energy Accounting

■ Accounting to different tasks/activities/clients◆ example: web server serving requests from different client classes

◆ e.g. Internet/Intranet, different service contracts

■ “Resource principal” can change dynamically

■ Client/server relationships between processes◆ account energy consumption of server to client

Page 24: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

24

TA

CS

’04

Energy Containers

■ Resource Containers [OSDI ’99] ➔ Energy Containers◆ separation of protection domain and “resource principal”

■ Container Hierarchy◆ root container (whole system)

◆ processes are attached to containers

◆ this association can be changed dynamically (client/server relationship)

➜energy is automatically accounted to the activity responsible for it

■ Energy shares◆ amount of energy available (depending on energy limit)

◆ periodically refreshed

◆ if a container runs out of energy, its processes are stopped

web server

root

Intranet Internet

80% 20%

131.188.34.*

Page 25: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

25

TA

CS

’04

Energy Containers

■ Example:web server working for two clients with different shares

0 100 200 300time (seconds)

0

10

20

30

40

50

60pow

er

consu

mptio

n (

Watt)

web server (total)client1 (80% share)client2 (20% share)

Start throttling

Page 26: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

26

TA

CS

’04

Task-specific Temperature Management

■ Periodically compute an energy limit for the root container(depending on the temperature limit )

■ Dissolve to ➔

■ Energy budgets of all containers are limited according to their shares

■ Tasks are automatically throttled according to their contribution to the current temperature

■ Throttling is implemented by removing tasks from the runqueue

��

�� ��� ��+ � ��–( )[ ]��� ���� �–≤=

� ��

Page 27: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

27

TA

CS

’04

Temperature Management

■ Example: Enforcing a temperature limit of 45º

0 500 1000 1500 2000 2500time [s]

35

40

45

50

55

tem

pera

ture

(ce

lsiu

s)

estimated temperaturemeasured temperature

temperature limit

Page 28: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

28

TA

CS

’04

Outline

■ From events to energy◆ event-monitoring counters

◆ on-line estimation of energy consumption

■ From energy to temperature◆ temperature model

■ Energy containers◆ accounting of energy consumption

◆ task-specific temperature management

■ Infrastructure for temperature management in distributed systems

Page 29: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

29

TA

CS

’04

Energy Containers

■ Distributed energy accounting

◆ Id transmitted with the network packets (IPv6 extension headers)

◆ receiving process attached to the corresponding energy container

◆ temperature and energy are cluster-wide accounted and limited

◆ transparent to applications and unmodified operating systems

id 1

root

Id 2 id 1

root

id 2 id 1

root

id 2

IPv6 IPv6

Page 30: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

30

TA

CS

’04

Energy Containers

■ Energy accounting across machine boundaries◆ requests from two different clients represented by two containers

◆ web server sends requests to factorization server

➜the energy consumption of the server is correctly accounted to the client

0 500 1000

server #1 (web server)

time [s]0 500 10000

10

20

30

40

50

pow

er c

onsu

mpt

ion

[W]

server #2 (factorization)

time [s]1500 1500

container 1 (id 1)root container (total)

container 2 (id 2)

Page 31: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

31

TA

CS

’04

Infrastructure for DTM in Distributed Systems

■ Distributed energy accounting

■ Foundation for policies managing energy and temperature in server clusters◆ account, monitor and limit energy consumption and temperature of

each node

■ Examples◆ set equal energy/temperature limits for all servers

➜cluster-wide uniform temperature and power densities, no hot spots in the server room

◆ use energy/temperature limits to

➜throttle affected servers in case of a cooling unit failure

➜reduce number of active cooling units in case of low utilization

Page 32: Dynamic Thermal Management for Distributed Systemsskadron/tacs/weiss_slides.pdf · Dynamic Thermal Management for Distributed ... Dynamic Thermal Management for Distributed Systems

Dynamic Thermal Management for Distributed Systems© Andreas Weissel • University of Erlangen-Nuremberg • Computer Science 4 (Operating Systems) • 2004

32

TA

CS

’04

Conclusion

■ Event-monitoring counters enable◆ on-line energy accounting

◆ task-specific temperature management

■ Correctly account client/server relations across machine boundaries

■ Transparent to applications and unmodified operating systems

■ Future directions◆ examine more sophisticated energy models

◆ task-specific frequency scaling to adjust the thermal load