10
American Institute of Aeronautics and Astronautics 1 Modeling Hierarchical Thermal Transport in Electronic Systems Amip Shah, Rocky Shih, Cullen Bash and Niru Kumari Hewlett Packard Company, Palo Alto, California 94304, USA Electronic systems have an inherent hierarchy due to the underlying architecture that governs the processing, transmission, and storage of data. In the present work, we leverage this hierarchy to develop an approach that enables rapid optimization of the thermal architecture of multi-scale electronic systems. We demonstrate the approach for the case of data centers, which contain a hierarchy that extends from discrete electronic components such as microprocessors and memory to electronic systems such as computer servers and ultimately infrastructures such as racks and data centers. Nomenclature A = ratio of power consumed for cooling in computer racks to power consumed by compressor B = ratio of power consumed by CRAC units to the power consumed by compressor C = ratio of power consumed by hydraulics and pumps to the power consumed by compressor COP = Coefficient of Performance comp = compressor (of chiller in data center) CRAC = Computer Room Air-Conditioning unit d = hydraulic diameter D = ratio of power consumed by cooling tower to the power consumed by compressor DC = data center f = friction factor F = function in = into control volume l = laminar Q = rate of heat transfer P = pressure sys = system t = turbulent T = absolute temperature v = velocity V = volumetric flowrate W = work consumed by the system x = distance, in direction of flow = density I. Introduction LECTRONIC systems are designed for a variety of functions. For example, computers generally receive data, store information, process instructions, and transmit data. From a systems architecture point of view, each of these functions are performed at different levels within a „hierarchy‟. As illustration, consider data centers – the backbone of most information technology (IT) networks that drive global communications today. Data centers are essentially large warehouses containing thousands of servers, storage, and network systems. At this highest level of E Amip Shah is a Senior Research Scientist, Rocky Shih is a Senior Engineer, Cullen Bash is a Distinguished Technologist, and Niru Kumari is a Research Scientist at Hewlett Packard Company. 10th AIAA/ASME Joint Thermophysics and Heat Transfer Conference 28 June - 1 July 2010, Chicago, Illinois AIAA 2010-4778 Copyright © 2010 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved.

[American Institute of Aeronautics and Astronautics 10th AIAA/ASME Joint Thermophysics and Heat Transfer Conference - Chicago, Illinois ()] 10th AIAA/ASME Joint Thermophysics and Heat

  • Upload
    niru

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

American Institute of Aeronautics and Astronautics

1

Modeling Hierarchical Thermal Transport in Electronic

Systems

Amip Shah, Rocky Shih, Cullen Bash and Niru Kumari

Hewlett Packard Company, Palo Alto, California 94304, USA

Electronic systems have an inherent hierarchy due to the underlying architecture that

governs the processing, transmission, and storage of data. In the present work, we leverage

this hierarchy to develop an approach that enables rapid optimization of the thermal

architecture of multi-scale electronic systems. We demonstrate the approach for the case of

data centers, which contain a hierarchy that extends from discrete electronic components

such as microprocessors and memory to electronic systems such as computer servers and

ultimately infrastructures such as racks and data centers.

Nomenclature

A = ratio of power consumed for cooling in computer racks to power consumed by compressor

B = ratio of power consumed by CRAC units to the power consumed by compressor

C = ratio of power consumed by hydraulics and pumps to the power consumed by compressor

COP = Coefficient of Performance

comp = compressor (of chiller in data center)

CRAC = Computer Room Air-Conditioning unit

d = hydraulic diameter

D = ratio of power consumed by cooling tower to the power consumed by compressor

DC = data center

f = friction factor

F = function

in = into control volume

l = laminar

Q = rate of heat transfer

P = pressure

sys = system

t = turbulent

T = absolute temperature

v = velocity

V = volumetric flowrate

W = work consumed by the system

x = distance, in direction of flow

= density

I. Introduction

LECTRONIC systems are designed for a variety of functions. For example, computers generally receive data,

store information, process instructions, and transmit data. From a systems architecture point of view, each of

these functions are performed at different levels within a „hierarchy‟. As illustration, consider data centers – the

backbone of most information technology (IT) networks that drive global communications today. Data centers are

essentially large warehouses containing thousands of servers, storage, and network systems. At this highest level of

E

Amip Shah is a Senior Research Scientist, Rocky Shih is a Senior Engineer, Cullen Bash is a Distinguished

Technologist, and Niru Kumari is a Research Scientist at Hewlett Packard Company.

10th AIAA/ASME Joint Thermophysics and Heat Transfer Conference28 June - 1 July 2010, Chicago, Illinois

AIAA 2010-4778

Copyright © 2010 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved.

American Institute of Aeronautics and Astronautics

2

the hierarchy, different functions within the data center are distributed by „function‟: network systems, such as

routers and switches, perform the data transmission; compute servers perform the processing and execution of

instruction sets; and storage systems archive information. However, within each of these systems, the functional

hierarchy is repeated: each server contains interconnects via which information is transmitted internal to the system;

processors via which instructions are executed; and memory or storage where information is archived.

At each level in the hierarchy, power is consumed in the form of electricity and rejected in the form of heat.

Thus, as IT becomes more complex, a hierarchy is also generated in terms of thermal transport. Traditionally,

different levels of the hierarchy have been independent: first, a component vendor might optimize the design of the

microelectronic package; then, at the next level upwards in the hierarchy, the systems designer treats the component

as a blackbox and optimizes the system thermal architecture. Similarly, data center designers then treat the system as

a blackbox and facility designers treat the data center as a blackbox. But in this approach, it is difficult to discern

how choices at the facility level might cascade downwards to the component or sub-component level and vice versa.

Building a model that involves details of each level in the hierarchy leads to a fairly large design space that requires

significant computational capabilities to solve.

Moreover, the electricity use across the hierarchy (from “chip to chiller”) is increasingly becoming a matter of

environmental and economic concern. Figure 1 shows such a hierarchy for an enterprise computing infrastructure.

The electronic components draw electricity from the utility grid through a power delivery infrastructure which

converts high-voltage AC power to the necessary low-voltage DC power. The economics of enterprise IT are

typically most advantageous when systems containing these electronic components are deployed at high densities,

most often by mounting a large number of systems in industry-standard computer racks. The power from the

electronic components inside the systems is dissipated in the form of heat, which must be removed to maintain the

electronic components at feasible operating temperatures. Within computer systems, the heat is most commonly

Figure 1. Schematic of a Typical Raised-Floor Data Center Infrastructure. A data center can broadly be

considered as being constructed of three key infrastructures: the IT infrastructure, which performs the

computational work; the cooling infrastructure, which removes the heat dissipated by the IT equipment; and

the power delivery infrastructure, which provides the electricity required by the IT equipment.

American Institute of Aeronautics and Astronautics

3

transmitted from the electronic components through a thermomechanical package into the airspace within the

computer chassis („box‟). In traditional raised-floor architectures, air is supplied to the computer systems via a

pressurized plenum into a „cold aisle‟. System fans draw the cold supply air into the box. The heat picked up by the

cold air from the computer systems is exhausted into a „hot aisle‟. This warm exhaust air is returned to a Computer

Room Air Conditioning (CRAC) unit, where typically, the heat is rejected to a chilled water stream. The refrigerated

air is returned to the underfloor plenum, from where it flows back to the cold aisle. The warm water which absorbs

heat from the return air is returned to a chiller, where a traditional refrigeration cycle supplies chilled water back to

the data center. The evaporator heat is then rejected to a secondary water loop, where the heat that was originally

dissipated by the electronic components is eventually rejected to the external ambient via a cooling tower.

Each of the systems within the compute, cooling, and power delivery infrastructures require electricity for

operation. As a result, computing infrastructures often consume significant amounts of energy. In 2006, it was

estimated that U.S. data centers required nearly 61 billion kWh, or approximately 1.5% of the total electricity

generated in the country1; and the energy consumption of data centers has been estimated to double every five

years2. Often, as much as half of the total electricity supplied to the data center infrastructure goes towards the

auxiliary power delivery and cooling equipment supporting the computer systems. A method to rapidly simulate

different „what if‟ operational conditions within the data center for purposes of reducing the energy use is desirable.

In this paper, we propose a modeling approach that recognizes and takes advantage of the hierarchy within

electronic systems. At each level of the hierarchy, we run a small number of numerical simulations that characterize

the thermofluidic behavior along the boundaries of the design space. We then utilize these simulations to generate a

network of „thermovolume‟ resistances – expressions that characterize the thermal resistance and flow resistance

across each component of the hierarchy. Because this network is closed form, it becomes possible to leverage a set

of robust algorithms that can provide an optimal solution for predefined constraints. The approach is demonstrated

for finding the most energy efficient configuration within a data center hierarchy that does not violate any

temperature constraints at the component level. We find that the predicted optimal configuration closely mirrors that

predicted by more detailed numerical models, while requiring less than 1% of the computational time.

II. Prior Work

Previously, most models of the data center cooling infrastructure have focused on tracing the airflow from the

CRAC discharge to return. Computational Fluid Dynamics (CFD) models are the most common approach21, 22

,

wherein off-the-shelf commercial packages are utilized to generate temperature and flow profiles across the data

center. However, the complexity of the numerical iterative approaches and corresponding computational time

increases non-linearly with the grid size in CFD models. Thus, for large data centers, grid sizes that are reasonably

sized for the airspace are generally too coarse to represent the electronic components and/or the airflow within the

computer systems. The computational expenses associated with the use of higher-resolution grids are generally too

great, with the result that most CFD models simply treat the systems as black-boxes. While this approach is

reasonable for industry-standard systems, custom-designed systems cannot generally be adequately represented in

these models.

Many previous researchers have explored the challenges associated with the modeling of such multiscale

infrastructures. Particularly within the electronics cooling and data center segments, reduced-order models that

decompose the numerical dimensionality in terms of a small number of independent variables have proven to be

useful for efficient modeling3-5

. In this approach, the authors take a large-scale representation of a given

thermofluidic space – such as a matrix containing the thermofluidic characteristics of every node or volume within a

CFD mesh – and mathematically reduce the representation to a small set of key parameters, such as a basis set of

vectors of the given space. The smaller set of parameters can be manipulated much more rapidly, thus enabling

testing of a large number of variations within the multiscale infrastructure. However, a key assumption of these

approaches is that the space is fully defined through linear combination of its basis vectors, implicitly requiring that

the thermal behavior be homogeneous across different design variations. For example, a non-linear change in the

heat transfer coefficients would not be explicitly included in such approaches.

Alternatively, approaches based on machine learning techniques6-8

allow rapid modeling of the design space, but

require a fairly sophisticated set of initial „training‟ data. More advanced numerical analytics, such as principal

component analysis9,10

are useful but dependent upon a reinterpretation of the physical phenomena governing the

problem space in terms of the formulated constraints. For example, the optimal settings derived from several

machine learning-type approaches are often physically meaningless and thus require pruning within the feasible

design space. Often, setting up these design constraints is a time-consuming and non-trivial challenge.

American Institute of Aeronautics and Astronautics

4

Another approach to solving complex

computational thermofluidic problems has

been to leverage physics-based approaches

to simplify modeling of the problem.

However, the accuracy of such approaches

is greatly reduced as the parametric design

space is widened such that the thermofluidic

phenomena may change (e.g. from laminar

to turbulent) flow. As illustration, within

the data center space, an approach based on

potential flow has been proposed11,12

but

such an approach may generally only work

effectively within those areas of the room

where boundary surfaces are not as critical.

For example, such a potential-flow based

model would likely prove insufficient for

capturing flow characteristics within

electronic systems.

By contrast, the method of

thermovolumes13,14

recognizes that different

thermofluidic phenomena may govern

different regimes of the design, and rely

upon an approximate definition of these regimes by the designer to quickly segment the solution space and increase

computational speed. First proposed by Patel and Belady13

for heat sinks and demonstrated by Bash and Patel14

for

an entire computer system, the method of thermo-volumes provides a convenient way to expediently assess the

trade-offs between reducing thermal resistance (i.e. increasing thermal performance) and increasing pressure drop

(i.e. increasing operational power consumption) of a given computer system thermal solution. Essentially, by

treating the entire heat dissipating body as a control volume of known geometry, relatively straightforward

approximations of the fluid flow can be made using equations for flow through a duct:

2

2

1

2

1in

tin

l vd

fv

d

f

x

P

(1)

where the friction factor f has been defined in terms of a laminar and turbulent component as follows:

tin

l fv

ff (2)

The foundation behind Eq. 2 is that for laminar flow, the friction factor varies inversely with velocity while for

turbulent flow, the friction factor is a constant15

. The advantage of the formulation of Eq. 1 is that the friction factor

of a given system is generally easy to approximate (either through experimentation or numerical modeling) for

laminar-only or turbulent-only flow, because in these cases the pressure drop is respectively proportional to the

velocity and the square of the velocity. Using these, Eq. 1 allows a reasonable estimate of the pressure drop for cases

where both laminar and turbulent flow may be encountered over a characteristic system length. The interested reader

is referred to the work of Bash and Patel14

for further details on this approach.

Thus, for a given approach velocity – regardless of the type of flow – the pressure drop across a system can be

estimated without detailed CFD models containing dense grids that separate the laminar versus turbulent areas of the

flow. Instead, coarse CFD models can be used to estimate the laminar-only and turbulent-only components of the

friction factor, which can then be combined using Eq. 1 to enable a system-scale model with a coarser grid than

would be otherwise necessary. Then, the temperature rise for a given heat dissipation can be estimated quickly. In

this manner, a composite representation of the flow and thermal resistance can be obtained as shown in Fig. 2. In

addition to the thermal resistance, an inlet-outlet temperature difference across each curve (i.e. the temperature rise

in the airstream for a given flowrate and heat dissipation) can also be specified as part of the thermovolume

characteristics.

To use Fig. 2, the maximum allowable temperature rise allowable at a particular heat dissipation (i.e., the desired

thermal resistance) of a given system is determined based on the system architecture. Using Fig. 2, the volumetric

flowrate required to achieve that thermal resistance can be determined. If the pressure drop is found to be

unacceptably high in the system, the design is iteratively modified (e.g., by changing the air routing, or material

Figure 2. Conceptual illustration of a thermovolume curve16

.

For a given approach velocity (and flowrate), the temperature

rise and pressure drop can modeled using CFD. This allows the

user to easily specify the work required to remove heat from the

system at a given temperature.

American Institute of Aeronautics and Astronautics

5

choice) till the optimal operating point is

found. Thus, the thermo-volume approach

can be used to expedite modeling and seek

the optimal operating point of a given

thermal solution, or to design a thermal

solution to meet a particular operating cost

target. The time required for iterative design

is substantially lower than that required by

detailed CFD models, with comparatively

small losses in accuracy even for complex

geometries. For example, Bash and Patel14

found the difference between thermo-

volume predictions and experimental

measurements of flow and temperature to be

less than 10%.

III. Proposed Approach

In the present approach, we apply the

method of thermo-volumes to capture the

hierarchical characteristics of electronic systems. Specifically, we recognize that systems are simply an aggregation

of components, each of which necessarily has distinct thermovolume characteristics. Therefore, systems can be

modeled as a superposition of the thermovolume characteristics of each component within the system. Similarly,

infrastructures like data centers are simply discrete aggregations of systems, each of which again has defined

thermovolume characteristics that can be approximated through an appropriate superposition method. Because these

thermovolumes all have independent thermal resistance (i.e., a known temperature rise for a given heat dissipation)

and flow resistance curves (i.e., a known pressure drop for a given air velocity), we model the aggregation of these

components using a network resistance model.

To create such a resistance model, we begin by defining independent components upon which a feasible design

space can be constructed. For example, in a typical computer system, we identify the processor; memory; heatsink;

etc. as independent components. (This is in contrast to previous numerical approaches, which define the independent

variables mathematically without due concern to the physical practicality upon which that independence is based.)

Next, for each of these independent components, we perform a detailed numerical model – e.g. using finite

element or computational fluid dynamics – to obtain the thermal resistance curves subject to inlet temperature and

flow constraints. For example, in the case of a processor, we model the temperature rise across the component for a

given heat dissipation (thermal resistance) as a function of different air flowrates. Simultaneously, for this flowrate,

we model the pressure drop across the

component (flow resistance). Upon

combining the two, we obtain a

thermovolume curve that provides thermal

resistance as well as flow resistance. Figure

4 illustrates such curves for a given

component. As discussed earlier, such

thermovolume curves can be used to allow

us to easily identify the optimal operating

point within which a desired temperature

threshold can be maintained.

Once such thermovolume curves have

been obtained for different components, we

then assemble the components through a

simple thermovolume network. For

example, components that are placed

downstream of each other in the direction of

the flow will have flow resistances in series,

because the pressure drops are additive. By

Cover

PCI-E expansion

slotsDIMM

socketsFans

Display

Hard

DrivesUSB

Ports

LEDsProcessors

Power

Supply

Cover

PCI-E expansion

slotsDIMM

socketsFans

Display

Hard

DrivesUSB

Ports

LEDsProcessors

Power

Supply

Figure 3. Hierarchy within a computer system. The above

system, a 2U rack-mounted industry-standard server17

, contains

several discrete components (some of which are highlighted in

the diagram). The thermovolume characteristics of these

components can be superposed to obtain a thermovolume curve

for the entire system.

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

1 2 3 4 5

Air Inlet Velocity (m/s)

Thermal Resistance [K/W]

Pressure Drop [10^-1 Pa]

Figure 4. Sample thermal resistance and pressure drop

curves for a given component. Note that the scale of the units

for each curve has been adjusted for improved visibility.

American Institute of Aeronautics and Astronautics

6

contrast, for components stacked on top of

each other orthogonal to the direction of

flow (such as multiple die in a package, or

multiple servers in a rack), the flow

resistances will be in parallel. Such

specification allows for a rapid estimation of

the layout of the flow, or more specifically,

how the flow will get divided across the

diverse array of components that go into a

system. Figure 5 illustrates such a flow for

one of the racks of the IT infrastructure

shown earlier in Fig. 1. These racks have a

large number of IT systems (servers)

stacked vertically inside the rack, while the

airflow enters the rack horizontally from a

cold aisle and is exhausted through the rear of the rack. In such a configuration, the airflow encounters some

resistance as it flows into the system (e.g., through a perforated vent tile). The airflow then splits across multiple

servers stacked vertically in the rack; this is the branching of the network in Fig. 5. Each branch corresponds to flow

through a single system. Only three systems are shown for illustration.

For all the systems, a processor with heat sink is placed at the system inlet. The air flows past this resistance and

is preheated. Downstream of the system, there are multiple components (memory, chipset, I/O etc.) laid out. The air

once again gets distributed as it flows across these components, which is why the flow resistances corresponding to

these downstream components are in parallel. The airflow then encounters some resistance as it flows out through

the system chassis, and exits the system rear. A single rack door may provide an additional resistance past this

system outlet (not shown).

Once the flow paths are specified as described above, the preheat occurring in the airstream prior to reaching a

given component can be simulated from the inlet-outlet temperature characteristics (available from the

thermovolume simulations). The corresponding temperature rise within the component can then be estimated using

the component-level thermal resistance curves. For example, consider a given air flowrate to a particular rack. Based

on the flow resistance diagram, the total flow resistance of the rack is known and the air velocity entering each

system can be calculated. Then, using the thermovolume curves of Fig. 4, the pressure drop corresponding to that

inlet velocity can be determined. This gives the total flow work required to cool the system at that flowrate.

Simultaneously, the thermal resistance within the system for the given inlet velocity can be inferred from the

thermovolume curves. Thus, the component temperature corresponding to a rack-level flowrate can be determined

very quickly. Using such a modeling approach, we can generate flow resistance and thermal resistance curves for a

variety of different component and system configurations.

The flow resistance network described above captures the hierarchy from components to systems to racks. The

diagram could be further extended using the same concepts to include multiple racks and an entire data center if

desired. At that stage, if parameterized in terms of a preset number of variables, it becomes possible to determine the

optimal configuration for cooling a given infrastructure using the lowest amount of energy. For example, if trying to

optimize the layout of a particular system, then one may choose to parameterize in terms of the distance between

components; or the distance in terms of interconnect length required for the different components to meet a given

computer architecture performance target. Similarly, if one wishes to find the most energy efficient data center

configuration (which is a topic of growing interest), an optimization routine can be set up that solves the closed-form

network problem to minimize the flow resistance such that the thermal resistance does not exceed a preset threshold.

The key difference in the above approach relative to existing techniques is the ability to perform optimization at

multiple length scales. For example, one could easily rearrange the configuration of components within a system in

the search for energy optimality at the data center level. The next section examines such an example.

IV. Case Study

To illustrate the above approach, we consider the design of a data center consisting of 16 different racks, each

with 32 different systems (each of which contains between 10 and 15 different heat-dissipating components). Thus,

the model involves a hierarchy of 4 levels (data center; rack; system; and component), and we are optimizing the

placement of about 6400 different components within the data center subject to thermal and computational

performance constraints.

Inlet

ProcessorParallel downstream flowpaths

Exit

Outlet

Figure 5. Example resistance network for a computer rack

containing three rack-mounted servers. (In practice, racks

often hold up to 42 rack-mounted systems stacked vertically or

even more systems bundled together into discrete chassis.)

American Institute of Aeronautics and Astronautics

7

For the given hierarchical infrastructure,

we begin by constructing the thermovolume

curves of each system, comprised of different

components. Figure 6 and Fig. 7 show the

results from this simulation, where the

pressure drop and maximum component

temperature has been predicted for two

different types of computer systems using the

thermovolume network of components

(“predicted”) and then compared to the

pressure drop obtained from a full k- CFD

model of the systems (“modeled”). The

baseline CFD model (at 30 cfm) consisted of

a uniform mesh of approximately 940,000

grid cells and a convergence criteria of sub-

unity residuals was imposed for temperature

and pressure. We find that there is good

agreement between the predicted and

modeled values, with the difference ranging

between 2% and 6% for pressure drop and

between 1% and 10% for temperature across

the range of flowrates considered. The only

exception is at extremely low flowrates,

where the temperature rise is sufficiently

large that the system properties differ

significantly from those assumed in the base-

level component curves. For electronic

systems, such large temperature rises (and

low operating flowrates) are unlikely.

Next, we combine the system-level

thermovolume models into rack-level models

and the rack-level models into full data center

models by constructing the corresponding

flow- and thermal resistance networks as

discussed earlier. The resulting datacenter-

level thermovolume resistance network model

is then solved to find the most energy-

efficient data center configuration. Because

the IT infrastructure is required to support the

same user-defined workload (i.e., the same

computational capacity), we assume that the

power drawn from the IT infrastructure is

constant. Additionally, the power delivery

infrastructure (to transmit power from the

utility to the IT system components) is held

fixed. Subject to these assumptions,

optimizing the energy efficiency within the data center corresponds to minimizing the energy used by the cooling

infrastructure. In other words, we seek to maximize the Coefficient of Performance (COP) of the data center, which

can be defined for the raised-floor data center architecture of Fig. 1 as follows18

:

DCBA

WQCOP

compdc

1

/ (3)

where A is the ratio of the power consumed by the cooling infrastructure across components, systems, and racks to

the power consumed in the chiller compressor:

0

2

4

6

8

10

12

30 40 50 60 70 80 90 100

Flowrate (cfm)

Pre

ssu

re D

rop

(P

a)

Predicted - System A

Predicted - System B

Modeled - System A

Modeled - System B

Figure 6. Validation of system-level flow resistance curves.

There is good agreement between the predicted (based on

thermovolume component networks) and modeled (using CFD of

full system model) pressure drop.

0

10

20

30

40

50

60

70

80

30 40 50 60 70 80 90 100

Flowrate (cfm)

Ma

x C

om

po

ne

nt

Te

mp

Ris

e (

de

g C

)

Predicted - System A

Predicted - System B

Modeled - System A

Modeled - System B

Figure 7. Validation of system-level thermal resistance

curves. There is good agreement between the predicted (based

on thermovolume component networks) and modeled (using CFD

of full system model) temperature rise across the processor.

American Institute of Aeronautics and Astronautics

8

comp

componentssysrack

W

WWWA

(3a)

B is the ratio of power consumed by the blowers in CRAC

units to the chiller compressor power:

comp

CRAC

W

WB

(3b)

C is the ratio of the power consumed by primary and

secondary pumps to the chiller compressor power:

comp

pump

W

WC

(3c)

and D is the ratio of power consumed by blowers and

pumps at the cooling tower to the chiller compressor

power:

comp

towercooling

W

W

D

(3d)

In the above equation, the heat dissipated within the

data center (Qdc) is a function of the computational

workload which needs to be supported, and can be treated

as fixed in the present case study. In addition, we do not

alter the cooling infrastructure outside the data center (i.e.,

pumps, compressors and cooling tower) so that the

coefficients C and D are fixed to a first order. (In practice,

the efficiency of the infrastructure may change as a

function of the air temperature returned to the CRAC units

from the hot aisle19,20

, but we ignore these considerations in the present work.) Subject to these assumptions, the

problem at hand is how to arrange the 6,400 components within the data center to minimize the energy consumption

by the component and system fans, the CRAC units, and the compressor in the chiller. That is, our objective

function for minimization is the following:

compWBA

COPF

1

1 (4)

To implement the above optimization scheme, we outline a range of available parameters within each level (e.g.,

choices regarding the (x,y) location of how components should be placed within systems and the z location of a

system within a given rack of the data center). For each resulting configuration, a resistance network is created. For

a given parameter setting, this network resistance can be solved for minimal flow work subject to thermal resistance

(temperature) constraints as described in Section III. In order to find the optimal parameter settings over the range of

available inputs, a generalized reduced gradient algorithm is used for the above closed-form problem.

To illustrate the results, we show three different configurations in Fig. 8(a)-8(c) that were considered in the case

study. We find that optimizing the placement of components within each system effectively gives rise to two

different types of systems: a system containing components with low thermal dissipation (“low density”) such as

memory and hard disk drives, and a system containing components with high thermal dissipation (“high density”)

such as processors and chipsets. The average rate of heat dissipation from a high-density system (250 W) is

approximately twice that from the low-density systems (126 W). In addition to the larger airflow required to remove

heat from the high-density systems, the flow work required across these is higher due to the presence of heat sinks

and other components with a large pressure drop.

(a)

Racks with low-density systems

Racks with high-density

systems

(b)

Low-density systems

High-density systems

(c)

Racks with low-density systems

Racks with high-density systems

Figure 8. Examples of different system layouts (a)

DC1, (b) DC2, (c) DC3.

American Institute of Aeronautics and Astronautics

9

The thermal architecture of the systems is fixed by the allocation

of these high- and low-density components within the systems, since

we do not include optimization of the thermal management of the

components themselves (such as redesigning the heat sink) within the

scope of the present work. Then, the primary optimization knob

available becomes the placement of these high- and low-density

systems within the racks. The choice of these placements will result in

different data center configurations. Figure 8(a) shows one

configuration (DC1) where all systems of similar density are placed in

a rack, and then the downstream rack contains systems of the opposite

type of density. In the configuration DC2 of Fig. 8(b), the lower half

of a rack contains systems of similar density while the upper half of

the rack contains systems of the opposite type. Unlike DC1, however,

where the low-density and high-density racks had different levels of

heat dissipation, all racks in DC2 have the same heat density. In Fig.

8(c), the low-density racks and high-density racks of DC1 are maintained but arranged differently in the sense that

all racks of similar density are placed downstream, and racks of differing density are placed adjacent rather than

downstream.

Each of the above data center configurations processes the same computational load and thus dissipates the same

amount of heat. But as shown in Fig. 9, the efficiency of the cooling infrastructure vastly differs depending on the

arrangement within the data center. We find that DC3 is the worst layout, operating at the lowest COP and in fact

generally infeasible to operate at supply air temperatures above 32 oC. The reason is that placing multiple high-

density systems back-to-back causes significant preheating as the air flows downstream. As a result, the temperature

difference between the downstream system inlet and maximum allowable temperature in DC3 is quite small, and

very high flowrates are required to effectively remove the heat dissipated by the downstream systems. This

additional flow work leads to a lower COP.

By contrast, DC2 has a COP that is almost 4X higher and allows energy-efficient operation without violating

thermal constraints even at elevated inlet temperatures above 35 oC. Given the flexibility of alternating airflow

through high-density and low-density systems, we are able to take advantage of higher allowable temperature

differences in the low-density systems without requiring the higher flow work resulting from a configuration where

several high-density systems are placed downstream from each other.

But more important than the optimal configuration, it is worth noting that once the initial curves had been

defined – which took on the order of approximately 24 hours – the entire simulation and optimization routine took

on the order of 20 minutes on a standard notebook computer (2.0 GHz dual-core processor with 2 GB RAM). By

comparison, each detailed numerical (CFD) model of just a single system took on the order of 4-6 hours and over 12

hours to simulate a single facility. Thus, even if it were possible to somehow define an optimization routine (e.g. via

design of experiments) with standard CFD simulation capabilities, exploring the entire design space using traditional

numerical modeling methods would take on the order of about 3-4 weeks instead of the roughly 1 day required with

the proposed methodology. For larger production data center facilities, which are often a factor of 1000X larger than

the current system, the computational expense of configuring the optimal data center using CFD across multiple

length scales would be prohibitive. Thus, the proposed methodology can potentially enable improved efficiency in

the design of large multi-scale data centers.

V. Summary and Future Work

In this paper, we have developed a methodology to computationally simulate the thermal behavior of a large

number of electronic components which exist within a multi-level hierarchy on commodity computer equipment

within a relatively short period of time. The method, built upon the previously proposed notion of thermo-volumes,

essentially involves establishing a baseline configuration using traditional CFD models and then iterating perturbed

configurations rapidly using a small set of control parameters in order to obtain approximate results. The method

was demonstrated on the test case of a single data center, with good agreement between the proposed method and

traditional full-scale CFD models.

Longer-term, we wish to explore applicability to additional types of hierarchical infrastructures including more

traditional thermal architectures. Particularly, we are interested in extending the current work to include thermal

components within the full-scale data center facility (e.g., heat exchangers in CRAC units, compressors in the

chiller, cooling tower, etc). Another key limitation of the present work is that we have only explored a single heat

Figure 9. COP of different thermal

solutions. DC2 is the optimal layout.

American Institute of Aeronautics and Astronautics

10

transfer mechanism (i.e. forced convection) at each level in the hierarchy. While we expect the methodology should

adequately handle additional thermal transport mechanisms, this needs to be verified in future work. Ultimately, our

goal is to explore the suitability of the thermovolume approach for hierarchies that encompass multiple heat transfer

modes within and across the boundaries of the hierarchy for generalized systems with thermal transport, including

but not limited to IT infrastructures such as data centers.

References 1EPA, Report to Congress on Server and Data Center Efficiency Public Law 109-431, U.S. Environmental Protection

Agency, Washington D.C., 2007. 2Koomey, J.G., Estimating Total Power Consumption by Servers in the U.S. and the World, Technical Report, available

http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf, 2008. 3 Samadiani, E., Joshi, Y., Hamann, H., Iyengar, M.K., and Bonilla, M., “Reduced Order Thermal Modeling of Data Centers

via Distributed Sensor Data,” Proceedings of the ASME-Pacific Rim Technical Conference and Exhibition on Packaging and

Integration of Electronic and Photonic Systems, MEMS, and NEMS (InterPACK’09), San Francisco, CA, July 2009. 4Haider, S., Burton, L., and Joshi, Y., “A Proper Orthogonal Decomposition Based System-Level Thermal Modeling

Methodology for Shipboard Power Electronics Cabinets,” Heat Transfer Engineering, Vol. 29, No. 2, pp. 198-215, 2008. 5Rambo, J., and Joshi, Y., “Modeling of Data Center Airflow and Heat Transfer: State of the Art and Future Trends,”

Distributed and Parallel Databases, Vol. 21, No. 2-3, pp. 193-225, 2007. 6Betul, A.-S., Karlik, B., Bali, T., and Ayhan, T., “Neural Network Methodology for Heat Transfer Enhancement Data,”

International Journal of Numerical Methods for Heat and Fluid Flow, Vol. 17, No. 8, pp. 788-798, 2007. 7Cortes, O., Urquiza, G., Hernandez, J.A., and Cruz, M.A., “Artificial Neural Networks for Inverse Heat Transfer Problems,”

Proceedings of the 2007 IEEE Conference on Electronics, Robotics and Automotive Mechanics (CERMA), Cuernavaca, Mexico,

pp. 198-201, September 2007. 8Shrivastava, S.K., Van Gilder, J. W., and Sammakia, B., “Data Center Cooling Prediction using Artificial Neural Network,”

Proceedings of the ASME-Pacific Rim Technical Conference and Exhibition on Packaging and Integration of Electronic and

Photonic Systems, MEMS, and NEMS (InterPACK’07), Vancouver, BC, July 2007. 9Sharma, R., Marwah, M., Lugo, W., and Bash, C., “Application of Data Analytics to Heat Transfer Phenomena for Optimal

Design and Operation of Complex Systems,” Proceedings of ASME Summer Heat Transfer Conference, San Francisco, CA, July

2009. 10Bautista, L., and Sharma, R., “Analysis of Environmental Data in Data Centers,” Technical Report No. HPL-2007-98,

Hewlett Packard Laboratories, Palo Alto, CA, June 2007. 11Toulouse, M., Doljac, G., Carey, V.P., and Bash, C., “Exploration of a Potential-Flow-Based Compact Model of Air-flow

Transport in Data Centers,” Proceedings of the ASME International Mechanical Engineering Congress and Exposition (IMECE),

Lake Buena Vista, FL, November 2009. 12Hamann, H.F., López, V., and Stephanchuk, A., “Thermal Zones for More Efficient Data Center Energy Management,”

Proceedings of the 13th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems

(ITHERM), Las Vegas, NV, June 2010. 13Patel, C., and Belady, C., “Modeling and Metrology in High Performance Heat Sink Design,” Proceedings of the 47th IEEE

Electronic Components and Technology Conference (ECTC), San Jose, CA, pp. 296-302, 1997. 14Bash, C., and Patel, C., “Modeling and Metrology for Expedient Analysis and Design of Computer Systems,” International

Journal of Microcircuits and Electronic Packaging, Vol. 22, No. 2, pp. 157-163, 1999. 15Moody, L.F., “Friction Factors for Pipe Flow,” Trans. ASME, Vol. 66, No. 8, pp. 671-684, 1944. 16Shah, A.J., and Patel, C. D., “Designing Environmentally Sustainable Electronic Cooling Systems Using Exergo-Thermo-

Volumes,” International Journal of Energy Research, Vol. 33, pp. 1266-1277, March 2009. 17Hewlett Packard, HP Proliant DL385 Generation 5 Overview, available http://h18004.www1.hp.com/products/quickspecs/

13012_na/13012_na.html, 2009. 18Patel, C.D., Sharma, R.K., Bash, C.E., and Beitelmal, M.H., “Energy Flow in the Information Technology Stack:

Introducing the Coefficient of Performance of the Ensemble,” Proceedings of the ASME International Mechanical Engineering

Congress and Exposition, Chicago, IL, November 2006. 19Breen, T.J., Walsh, E.J., Punch, J., Shah, A.J., and Bash, C.E., “From Chip to Cooling Tower Data Center Modeling: Part I,

Influence of Server Inlet Temperature and Temperature Rise across Cabinet,” Proceedings of the 13th Intersociety Conference on

Thermal and Thermomechanical Phenomena in Electronic Systems (ITHERM), Las Vegas, NV, June 2010. 20Walsh, E.J., Breen, T.J., Punch, J., Shah, A.J., and Bash, C.E., “From Chip to Cooling Tower Data Center Modeling: Part

II, Influence of Chip Temperature Control Philosophy,” Proceedings of the 13th Intersociety Conference on Thermal and

Thermomechanical Phenomena in Electronic Systems (ITHERM), Las Vegas, NV, June 2010. 21Patel, C., Bash, C., Belady, C., Stahl, L., and Sullivan, D., “Computational Fluid Dynamics Modeling of High Compute

Density Data Centers to Assure System Inlet Air Specifications,” Proceedings of the ASME-Pacific Rim Electronic Packaging

Technical Conference and Exhibition (InterPACK’01), Kauai, HI, July 2001. 22Schmidt, R., and Cruz, E., “Raised Floor Computer Data Center: Effect on Rack Inlet Temperatures of Chilled Air Exiting

Both the Hot and Cold Aisles,” Proceedings of the 8th Intersociety Conference on Thermal and Thermomechanical Phenomena

in Electronic Systems (ITHERM), San Diego, CA, June 2002.