Upload
niru
View
212
Download
0
Embed Size (px)
Citation preview
American Institute of Aeronautics and Astronautics
1
Modeling Hierarchical Thermal Transport in Electronic
Systems
Amip Shah, Rocky Shih, Cullen Bash and Niru Kumari
Hewlett Packard Company, Palo Alto, California 94304, USA
Electronic systems have an inherent hierarchy due to the underlying architecture that
governs the processing, transmission, and storage of data. In the present work, we leverage
this hierarchy to develop an approach that enables rapid optimization of the thermal
architecture of multi-scale electronic systems. We demonstrate the approach for the case of
data centers, which contain a hierarchy that extends from discrete electronic components
such as microprocessors and memory to electronic systems such as computer servers and
ultimately infrastructures such as racks and data centers.
Nomenclature
A = ratio of power consumed for cooling in computer racks to power consumed by compressor
B = ratio of power consumed by CRAC units to the power consumed by compressor
C = ratio of power consumed by hydraulics and pumps to the power consumed by compressor
COP = Coefficient of Performance
comp = compressor (of chiller in data center)
CRAC = Computer Room Air-Conditioning unit
d = hydraulic diameter
D = ratio of power consumed by cooling tower to the power consumed by compressor
DC = data center
f = friction factor
F = function
in = into control volume
l = laminar
Q = rate of heat transfer
P = pressure
sys = system
t = turbulent
T = absolute temperature
v = velocity
V = volumetric flowrate
W = work consumed by the system
x = distance, in direction of flow
= density
I. Introduction
LECTRONIC systems are designed for a variety of functions. For example, computers generally receive data,
store information, process instructions, and transmit data. From a systems architecture point of view, each of
these functions are performed at different levels within a „hierarchy‟. As illustration, consider data centers – the
backbone of most information technology (IT) networks that drive global communications today. Data centers are
essentially large warehouses containing thousands of servers, storage, and network systems. At this highest level of
E
Amip Shah is a Senior Research Scientist, Rocky Shih is a Senior Engineer, Cullen Bash is a Distinguished
Technologist, and Niru Kumari is a Research Scientist at Hewlett Packard Company.
10th AIAA/ASME Joint Thermophysics and Heat Transfer Conference28 June - 1 July 2010, Chicago, Illinois
AIAA 2010-4778
Copyright © 2010 by the American Institute of Aeronautics and Astronautics, Inc. All rights reserved.
American Institute of Aeronautics and Astronautics
2
the hierarchy, different functions within the data center are distributed by „function‟: network systems, such as
routers and switches, perform the data transmission; compute servers perform the processing and execution of
instruction sets; and storage systems archive information. However, within each of these systems, the functional
hierarchy is repeated: each server contains interconnects via which information is transmitted internal to the system;
processors via which instructions are executed; and memory or storage where information is archived.
At each level in the hierarchy, power is consumed in the form of electricity and rejected in the form of heat.
Thus, as IT becomes more complex, a hierarchy is also generated in terms of thermal transport. Traditionally,
different levels of the hierarchy have been independent: first, a component vendor might optimize the design of the
microelectronic package; then, at the next level upwards in the hierarchy, the systems designer treats the component
as a blackbox and optimizes the system thermal architecture. Similarly, data center designers then treat the system as
a blackbox and facility designers treat the data center as a blackbox. But in this approach, it is difficult to discern
how choices at the facility level might cascade downwards to the component or sub-component level and vice versa.
Building a model that involves details of each level in the hierarchy leads to a fairly large design space that requires
significant computational capabilities to solve.
Moreover, the electricity use across the hierarchy (from “chip to chiller”) is increasingly becoming a matter of
environmental and economic concern. Figure 1 shows such a hierarchy for an enterprise computing infrastructure.
The electronic components draw electricity from the utility grid through a power delivery infrastructure which
converts high-voltage AC power to the necessary low-voltage DC power. The economics of enterprise IT are
typically most advantageous when systems containing these electronic components are deployed at high densities,
most often by mounting a large number of systems in industry-standard computer racks. The power from the
electronic components inside the systems is dissipated in the form of heat, which must be removed to maintain the
electronic components at feasible operating temperatures. Within computer systems, the heat is most commonly
Figure 1. Schematic of a Typical Raised-Floor Data Center Infrastructure. A data center can broadly be
considered as being constructed of three key infrastructures: the IT infrastructure, which performs the
computational work; the cooling infrastructure, which removes the heat dissipated by the IT equipment; and
the power delivery infrastructure, which provides the electricity required by the IT equipment.
American Institute of Aeronautics and Astronautics
3
transmitted from the electronic components through a thermomechanical package into the airspace within the
computer chassis („box‟). In traditional raised-floor architectures, air is supplied to the computer systems via a
pressurized plenum into a „cold aisle‟. System fans draw the cold supply air into the box. The heat picked up by the
cold air from the computer systems is exhausted into a „hot aisle‟. This warm exhaust air is returned to a Computer
Room Air Conditioning (CRAC) unit, where typically, the heat is rejected to a chilled water stream. The refrigerated
air is returned to the underfloor plenum, from where it flows back to the cold aisle. The warm water which absorbs
heat from the return air is returned to a chiller, where a traditional refrigeration cycle supplies chilled water back to
the data center. The evaporator heat is then rejected to a secondary water loop, where the heat that was originally
dissipated by the electronic components is eventually rejected to the external ambient via a cooling tower.
Each of the systems within the compute, cooling, and power delivery infrastructures require electricity for
operation. As a result, computing infrastructures often consume significant amounts of energy. In 2006, it was
estimated that U.S. data centers required nearly 61 billion kWh, or approximately 1.5% of the total electricity
generated in the country1; and the energy consumption of data centers has been estimated to double every five
years2. Often, as much as half of the total electricity supplied to the data center infrastructure goes towards the
auxiliary power delivery and cooling equipment supporting the computer systems. A method to rapidly simulate
different „what if‟ operational conditions within the data center for purposes of reducing the energy use is desirable.
In this paper, we propose a modeling approach that recognizes and takes advantage of the hierarchy within
electronic systems. At each level of the hierarchy, we run a small number of numerical simulations that characterize
the thermofluidic behavior along the boundaries of the design space. We then utilize these simulations to generate a
network of „thermovolume‟ resistances – expressions that characterize the thermal resistance and flow resistance
across each component of the hierarchy. Because this network is closed form, it becomes possible to leverage a set
of robust algorithms that can provide an optimal solution for predefined constraints. The approach is demonstrated
for finding the most energy efficient configuration within a data center hierarchy that does not violate any
temperature constraints at the component level. We find that the predicted optimal configuration closely mirrors that
predicted by more detailed numerical models, while requiring less than 1% of the computational time.
II. Prior Work
Previously, most models of the data center cooling infrastructure have focused on tracing the airflow from the
CRAC discharge to return. Computational Fluid Dynamics (CFD) models are the most common approach21, 22
,
wherein off-the-shelf commercial packages are utilized to generate temperature and flow profiles across the data
center. However, the complexity of the numerical iterative approaches and corresponding computational time
increases non-linearly with the grid size in CFD models. Thus, for large data centers, grid sizes that are reasonably
sized for the airspace are generally too coarse to represent the electronic components and/or the airflow within the
computer systems. The computational expenses associated with the use of higher-resolution grids are generally too
great, with the result that most CFD models simply treat the systems as black-boxes. While this approach is
reasonable for industry-standard systems, custom-designed systems cannot generally be adequately represented in
these models.
Many previous researchers have explored the challenges associated with the modeling of such multiscale
infrastructures. Particularly within the electronics cooling and data center segments, reduced-order models that
decompose the numerical dimensionality in terms of a small number of independent variables have proven to be
useful for efficient modeling3-5
. In this approach, the authors take a large-scale representation of a given
thermofluidic space – such as a matrix containing the thermofluidic characteristics of every node or volume within a
CFD mesh – and mathematically reduce the representation to a small set of key parameters, such as a basis set of
vectors of the given space. The smaller set of parameters can be manipulated much more rapidly, thus enabling
testing of a large number of variations within the multiscale infrastructure. However, a key assumption of these
approaches is that the space is fully defined through linear combination of its basis vectors, implicitly requiring that
the thermal behavior be homogeneous across different design variations. For example, a non-linear change in the
heat transfer coefficients would not be explicitly included in such approaches.
Alternatively, approaches based on machine learning techniques6-8
allow rapid modeling of the design space, but
require a fairly sophisticated set of initial „training‟ data. More advanced numerical analytics, such as principal
component analysis9,10
are useful but dependent upon a reinterpretation of the physical phenomena governing the
problem space in terms of the formulated constraints. For example, the optimal settings derived from several
machine learning-type approaches are often physically meaningless and thus require pruning within the feasible
design space. Often, setting up these design constraints is a time-consuming and non-trivial challenge.
American Institute of Aeronautics and Astronautics
4
Another approach to solving complex
computational thermofluidic problems has
been to leverage physics-based approaches
to simplify modeling of the problem.
However, the accuracy of such approaches
is greatly reduced as the parametric design
space is widened such that the thermofluidic
phenomena may change (e.g. from laminar
to turbulent) flow. As illustration, within
the data center space, an approach based on
potential flow has been proposed11,12
but
such an approach may generally only work
effectively within those areas of the room
where boundary surfaces are not as critical.
For example, such a potential-flow based
model would likely prove insufficient for
capturing flow characteristics within
electronic systems.
By contrast, the method of
thermovolumes13,14
recognizes that different
thermofluidic phenomena may govern
different regimes of the design, and rely
upon an approximate definition of these regimes by the designer to quickly segment the solution space and increase
computational speed. First proposed by Patel and Belady13
for heat sinks and demonstrated by Bash and Patel14
for
an entire computer system, the method of thermo-volumes provides a convenient way to expediently assess the
trade-offs between reducing thermal resistance (i.e. increasing thermal performance) and increasing pressure drop
(i.e. increasing operational power consumption) of a given computer system thermal solution. Essentially, by
treating the entire heat dissipating body as a control volume of known geometry, relatively straightforward
approximations of the fluid flow can be made using equations for flow through a duct:
2
2
1
2
1in
tin
l vd
fv
d
f
x
P
(1)
where the friction factor f has been defined in terms of a laminar and turbulent component as follows:
tin
l fv
ff (2)
The foundation behind Eq. 2 is that for laminar flow, the friction factor varies inversely with velocity while for
turbulent flow, the friction factor is a constant15
. The advantage of the formulation of Eq. 1 is that the friction factor
of a given system is generally easy to approximate (either through experimentation or numerical modeling) for
laminar-only or turbulent-only flow, because in these cases the pressure drop is respectively proportional to the
velocity and the square of the velocity. Using these, Eq. 1 allows a reasonable estimate of the pressure drop for cases
where both laminar and turbulent flow may be encountered over a characteristic system length. The interested reader
is referred to the work of Bash and Patel14
for further details on this approach.
Thus, for a given approach velocity – regardless of the type of flow – the pressure drop across a system can be
estimated without detailed CFD models containing dense grids that separate the laminar versus turbulent areas of the
flow. Instead, coarse CFD models can be used to estimate the laminar-only and turbulent-only components of the
friction factor, which can then be combined using Eq. 1 to enable a system-scale model with a coarser grid than
would be otherwise necessary. Then, the temperature rise for a given heat dissipation can be estimated quickly. In
this manner, a composite representation of the flow and thermal resistance can be obtained as shown in Fig. 2. In
addition to the thermal resistance, an inlet-outlet temperature difference across each curve (i.e. the temperature rise
in the airstream for a given flowrate and heat dissipation) can also be specified as part of the thermovolume
characteristics.
To use Fig. 2, the maximum allowable temperature rise allowable at a particular heat dissipation (i.e., the desired
thermal resistance) of a given system is determined based on the system architecture. Using Fig. 2, the volumetric
flowrate required to achieve that thermal resistance can be determined. If the pressure drop is found to be
unacceptably high in the system, the design is iteratively modified (e.g., by changing the air routing, or material
Figure 2. Conceptual illustration of a thermovolume curve16
.
For a given approach velocity (and flowrate), the temperature
rise and pressure drop can modeled using CFD. This allows the
user to easily specify the work required to remove heat from the
system at a given temperature.
American Institute of Aeronautics and Astronautics
5
choice) till the optimal operating point is
found. Thus, the thermo-volume approach
can be used to expedite modeling and seek
the optimal operating point of a given
thermal solution, or to design a thermal
solution to meet a particular operating cost
target. The time required for iterative design
is substantially lower than that required by
detailed CFD models, with comparatively
small losses in accuracy even for complex
geometries. For example, Bash and Patel14
found the difference between thermo-
volume predictions and experimental
measurements of flow and temperature to be
less than 10%.
III. Proposed Approach
In the present approach, we apply the
method of thermo-volumes to capture the
hierarchical characteristics of electronic systems. Specifically, we recognize that systems are simply an aggregation
of components, each of which necessarily has distinct thermovolume characteristics. Therefore, systems can be
modeled as a superposition of the thermovolume characteristics of each component within the system. Similarly,
infrastructures like data centers are simply discrete aggregations of systems, each of which again has defined
thermovolume characteristics that can be approximated through an appropriate superposition method. Because these
thermovolumes all have independent thermal resistance (i.e., a known temperature rise for a given heat dissipation)
and flow resistance curves (i.e., a known pressure drop for a given air velocity), we model the aggregation of these
components using a network resistance model.
To create such a resistance model, we begin by defining independent components upon which a feasible design
space can be constructed. For example, in a typical computer system, we identify the processor; memory; heatsink;
etc. as independent components. (This is in contrast to previous numerical approaches, which define the independent
variables mathematically without due concern to the physical practicality upon which that independence is based.)
Next, for each of these independent components, we perform a detailed numerical model – e.g. using finite
element or computational fluid dynamics – to obtain the thermal resistance curves subject to inlet temperature and
flow constraints. For example, in the case of a processor, we model the temperature rise across the component for a
given heat dissipation (thermal resistance) as a function of different air flowrates. Simultaneously, for this flowrate,
we model the pressure drop across the
component (flow resistance). Upon
combining the two, we obtain a
thermovolume curve that provides thermal
resistance as well as flow resistance. Figure
4 illustrates such curves for a given
component. As discussed earlier, such
thermovolume curves can be used to allow
us to easily identify the optimal operating
point within which a desired temperature
threshold can be maintained.
Once such thermovolume curves have
been obtained for different components, we
then assemble the components through a
simple thermovolume network. For
example, components that are placed
downstream of each other in the direction of
the flow will have flow resistances in series,
because the pressure drops are additive. By
Cover
PCI-E expansion
slotsDIMM
socketsFans
Display
Hard
DrivesUSB
Ports
LEDsProcessors
Power
Supply
Cover
PCI-E expansion
slotsDIMM
socketsFans
Display
Hard
DrivesUSB
Ports
LEDsProcessors
Power
Supply
Figure 3. Hierarchy within a computer system. The above
system, a 2U rack-mounted industry-standard server17
, contains
several discrete components (some of which are highlighted in
the diagram). The thermovolume characteristics of these
components can be superposed to obtain a thermovolume curve
for the entire system.
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
1 2 3 4 5
Air Inlet Velocity (m/s)
Thermal Resistance [K/W]
Pressure Drop [10^-1 Pa]
Figure 4. Sample thermal resistance and pressure drop
curves for a given component. Note that the scale of the units
for each curve has been adjusted for improved visibility.
American Institute of Aeronautics and Astronautics
6
contrast, for components stacked on top of
each other orthogonal to the direction of
flow (such as multiple die in a package, or
multiple servers in a rack), the flow
resistances will be in parallel. Such
specification allows for a rapid estimation of
the layout of the flow, or more specifically,
how the flow will get divided across the
diverse array of components that go into a
system. Figure 5 illustrates such a flow for
one of the racks of the IT infrastructure
shown earlier in Fig. 1. These racks have a
large number of IT systems (servers)
stacked vertically inside the rack, while the
airflow enters the rack horizontally from a
cold aisle and is exhausted through the rear of the rack. In such a configuration, the airflow encounters some
resistance as it flows into the system (e.g., through a perforated vent tile). The airflow then splits across multiple
servers stacked vertically in the rack; this is the branching of the network in Fig. 5. Each branch corresponds to flow
through a single system. Only three systems are shown for illustration.
For all the systems, a processor with heat sink is placed at the system inlet. The air flows past this resistance and
is preheated. Downstream of the system, there are multiple components (memory, chipset, I/O etc.) laid out. The air
once again gets distributed as it flows across these components, which is why the flow resistances corresponding to
these downstream components are in parallel. The airflow then encounters some resistance as it flows out through
the system chassis, and exits the system rear. A single rack door may provide an additional resistance past this
system outlet (not shown).
Once the flow paths are specified as described above, the preheat occurring in the airstream prior to reaching a
given component can be simulated from the inlet-outlet temperature characteristics (available from the
thermovolume simulations). The corresponding temperature rise within the component can then be estimated using
the component-level thermal resistance curves. For example, consider a given air flowrate to a particular rack. Based
on the flow resistance diagram, the total flow resistance of the rack is known and the air velocity entering each
system can be calculated. Then, using the thermovolume curves of Fig. 4, the pressure drop corresponding to that
inlet velocity can be determined. This gives the total flow work required to cool the system at that flowrate.
Simultaneously, the thermal resistance within the system for the given inlet velocity can be inferred from the
thermovolume curves. Thus, the component temperature corresponding to a rack-level flowrate can be determined
very quickly. Using such a modeling approach, we can generate flow resistance and thermal resistance curves for a
variety of different component and system configurations.
The flow resistance network described above captures the hierarchy from components to systems to racks. The
diagram could be further extended using the same concepts to include multiple racks and an entire data center if
desired. At that stage, if parameterized in terms of a preset number of variables, it becomes possible to determine the
optimal configuration for cooling a given infrastructure using the lowest amount of energy. For example, if trying to
optimize the layout of a particular system, then one may choose to parameterize in terms of the distance between
components; or the distance in terms of interconnect length required for the different components to meet a given
computer architecture performance target. Similarly, if one wishes to find the most energy efficient data center
configuration (which is a topic of growing interest), an optimization routine can be set up that solves the closed-form
network problem to minimize the flow resistance such that the thermal resistance does not exceed a preset threshold.
The key difference in the above approach relative to existing techniques is the ability to perform optimization at
multiple length scales. For example, one could easily rearrange the configuration of components within a system in
the search for energy optimality at the data center level. The next section examines such an example.
IV. Case Study
To illustrate the above approach, we consider the design of a data center consisting of 16 different racks, each
with 32 different systems (each of which contains between 10 and 15 different heat-dissipating components). Thus,
the model involves a hierarchy of 4 levels (data center; rack; system; and component), and we are optimizing the
placement of about 6400 different components within the data center subject to thermal and computational
performance constraints.
Inlet
ProcessorParallel downstream flowpaths
Exit
Outlet
Figure 5. Example resistance network for a computer rack
containing three rack-mounted servers. (In practice, racks
often hold up to 42 rack-mounted systems stacked vertically or
even more systems bundled together into discrete chassis.)
American Institute of Aeronautics and Astronautics
7
For the given hierarchical infrastructure,
we begin by constructing the thermovolume
curves of each system, comprised of different
components. Figure 6 and Fig. 7 show the
results from this simulation, where the
pressure drop and maximum component
temperature has been predicted for two
different types of computer systems using the
thermovolume network of components
(“predicted”) and then compared to the
pressure drop obtained from a full k- CFD
model of the systems (“modeled”). The
baseline CFD model (at 30 cfm) consisted of
a uniform mesh of approximately 940,000
grid cells and a convergence criteria of sub-
unity residuals was imposed for temperature
and pressure. We find that there is good
agreement between the predicted and
modeled values, with the difference ranging
between 2% and 6% for pressure drop and
between 1% and 10% for temperature across
the range of flowrates considered. The only
exception is at extremely low flowrates,
where the temperature rise is sufficiently
large that the system properties differ
significantly from those assumed in the base-
level component curves. For electronic
systems, such large temperature rises (and
low operating flowrates) are unlikely.
Next, we combine the system-level
thermovolume models into rack-level models
and the rack-level models into full data center
models by constructing the corresponding
flow- and thermal resistance networks as
discussed earlier. The resulting datacenter-
level thermovolume resistance network model
is then solved to find the most energy-
efficient data center configuration. Because
the IT infrastructure is required to support the
same user-defined workload (i.e., the same
computational capacity), we assume that the
power drawn from the IT infrastructure is
constant. Additionally, the power delivery
infrastructure (to transmit power from the
utility to the IT system components) is held
fixed. Subject to these assumptions,
optimizing the energy efficiency within the data center corresponds to minimizing the energy used by the cooling
infrastructure. In other words, we seek to maximize the Coefficient of Performance (COP) of the data center, which
can be defined for the raised-floor data center architecture of Fig. 1 as follows18
:
DCBA
WQCOP
compdc
1
/ (3)
where A is the ratio of the power consumed by the cooling infrastructure across components, systems, and racks to
the power consumed in the chiller compressor:
0
2
4
6
8
10
12
30 40 50 60 70 80 90 100
Flowrate (cfm)
Pre
ssu
re D
rop
(P
a)
Predicted - System A
Predicted - System B
Modeled - System A
Modeled - System B
Figure 6. Validation of system-level flow resistance curves.
There is good agreement between the predicted (based on
thermovolume component networks) and modeled (using CFD of
full system model) pressure drop.
0
10
20
30
40
50
60
70
80
30 40 50 60 70 80 90 100
Flowrate (cfm)
Ma
x C
om
po
ne
nt
Te
mp
Ris
e (
de
g C
)
Predicted - System A
Predicted - System B
Modeled - System A
Modeled - System B
Figure 7. Validation of system-level thermal resistance
curves. There is good agreement between the predicted (based
on thermovolume component networks) and modeled (using CFD
of full system model) temperature rise across the processor.
American Institute of Aeronautics and Astronautics
8
comp
componentssysrack
W
WWWA
(3a)
B is the ratio of power consumed by the blowers in CRAC
units to the chiller compressor power:
comp
CRAC
W
WB
(3b)
C is the ratio of the power consumed by primary and
secondary pumps to the chiller compressor power:
comp
pump
W
WC
(3c)
and D is the ratio of power consumed by blowers and
pumps at the cooling tower to the chiller compressor
power:
comp
towercooling
W
W
D
(3d)
In the above equation, the heat dissipated within the
data center (Qdc) is a function of the computational
workload which needs to be supported, and can be treated
as fixed in the present case study. In addition, we do not
alter the cooling infrastructure outside the data center (i.e.,
pumps, compressors and cooling tower) so that the
coefficients C and D are fixed to a first order. (In practice,
the efficiency of the infrastructure may change as a
function of the air temperature returned to the CRAC units
from the hot aisle19,20
, but we ignore these considerations in the present work.) Subject to these assumptions, the
problem at hand is how to arrange the 6,400 components within the data center to minimize the energy consumption
by the component and system fans, the CRAC units, and the compressor in the chiller. That is, our objective
function for minimization is the following:
compWBA
COPF
1
1 (4)
To implement the above optimization scheme, we outline a range of available parameters within each level (e.g.,
choices regarding the (x,y) location of how components should be placed within systems and the z location of a
system within a given rack of the data center). For each resulting configuration, a resistance network is created. For
a given parameter setting, this network resistance can be solved for minimal flow work subject to thermal resistance
(temperature) constraints as described in Section III. In order to find the optimal parameter settings over the range of
available inputs, a generalized reduced gradient algorithm is used for the above closed-form problem.
To illustrate the results, we show three different configurations in Fig. 8(a)-8(c) that were considered in the case
study. We find that optimizing the placement of components within each system effectively gives rise to two
different types of systems: a system containing components with low thermal dissipation (“low density”) such as
memory and hard disk drives, and a system containing components with high thermal dissipation (“high density”)
such as processors and chipsets. The average rate of heat dissipation from a high-density system (250 W) is
approximately twice that from the low-density systems (126 W). In addition to the larger airflow required to remove
heat from the high-density systems, the flow work required across these is higher due to the presence of heat sinks
and other components with a large pressure drop.
(a)
Racks with low-density systems
Racks with high-density
systems
(b)
Low-density systems
High-density systems
(c)
Racks with low-density systems
Racks with high-density systems
Figure 8. Examples of different system layouts (a)
DC1, (b) DC2, (c) DC3.
American Institute of Aeronautics and Astronautics
9
The thermal architecture of the systems is fixed by the allocation
of these high- and low-density components within the systems, since
we do not include optimization of the thermal management of the
components themselves (such as redesigning the heat sink) within the
scope of the present work. Then, the primary optimization knob
available becomes the placement of these high- and low-density
systems within the racks. The choice of these placements will result in
different data center configurations. Figure 8(a) shows one
configuration (DC1) where all systems of similar density are placed in
a rack, and then the downstream rack contains systems of the opposite
type of density. In the configuration DC2 of Fig. 8(b), the lower half
of a rack contains systems of similar density while the upper half of
the rack contains systems of the opposite type. Unlike DC1, however,
where the low-density and high-density racks had different levels of
heat dissipation, all racks in DC2 have the same heat density. In Fig.
8(c), the low-density racks and high-density racks of DC1 are maintained but arranged differently in the sense that
all racks of similar density are placed downstream, and racks of differing density are placed adjacent rather than
downstream.
Each of the above data center configurations processes the same computational load and thus dissipates the same
amount of heat. But as shown in Fig. 9, the efficiency of the cooling infrastructure vastly differs depending on the
arrangement within the data center. We find that DC3 is the worst layout, operating at the lowest COP and in fact
generally infeasible to operate at supply air temperatures above 32 oC. The reason is that placing multiple high-
density systems back-to-back causes significant preheating as the air flows downstream. As a result, the temperature
difference between the downstream system inlet and maximum allowable temperature in DC3 is quite small, and
very high flowrates are required to effectively remove the heat dissipated by the downstream systems. This
additional flow work leads to a lower COP.
By contrast, DC2 has a COP that is almost 4X higher and allows energy-efficient operation without violating
thermal constraints even at elevated inlet temperatures above 35 oC. Given the flexibility of alternating airflow
through high-density and low-density systems, we are able to take advantage of higher allowable temperature
differences in the low-density systems without requiring the higher flow work resulting from a configuration where
several high-density systems are placed downstream from each other.
But more important than the optimal configuration, it is worth noting that once the initial curves had been
defined – which took on the order of approximately 24 hours – the entire simulation and optimization routine took
on the order of 20 minutes on a standard notebook computer (2.0 GHz dual-core processor with 2 GB RAM). By
comparison, each detailed numerical (CFD) model of just a single system took on the order of 4-6 hours and over 12
hours to simulate a single facility. Thus, even if it were possible to somehow define an optimization routine (e.g. via
design of experiments) with standard CFD simulation capabilities, exploring the entire design space using traditional
numerical modeling methods would take on the order of about 3-4 weeks instead of the roughly 1 day required with
the proposed methodology. For larger production data center facilities, which are often a factor of 1000X larger than
the current system, the computational expense of configuring the optimal data center using CFD across multiple
length scales would be prohibitive. Thus, the proposed methodology can potentially enable improved efficiency in
the design of large multi-scale data centers.
V. Summary and Future Work
In this paper, we have developed a methodology to computationally simulate the thermal behavior of a large
number of electronic components which exist within a multi-level hierarchy on commodity computer equipment
within a relatively short period of time. The method, built upon the previously proposed notion of thermo-volumes,
essentially involves establishing a baseline configuration using traditional CFD models and then iterating perturbed
configurations rapidly using a small set of control parameters in order to obtain approximate results. The method
was demonstrated on the test case of a single data center, with good agreement between the proposed method and
traditional full-scale CFD models.
Longer-term, we wish to explore applicability to additional types of hierarchical infrastructures including more
traditional thermal architectures. Particularly, we are interested in extending the current work to include thermal
components within the full-scale data center facility (e.g., heat exchangers in CRAC units, compressors in the
chiller, cooling tower, etc). Another key limitation of the present work is that we have only explored a single heat
Figure 9. COP of different thermal
solutions. DC2 is the optimal layout.
American Institute of Aeronautics and Astronautics
10
transfer mechanism (i.e. forced convection) at each level in the hierarchy. While we expect the methodology should
adequately handle additional thermal transport mechanisms, this needs to be verified in future work. Ultimately, our
goal is to explore the suitability of the thermovolume approach for hierarchies that encompass multiple heat transfer
modes within and across the boundaries of the hierarchy for generalized systems with thermal transport, including
but not limited to IT infrastructures such as data centers.
References 1EPA, Report to Congress on Server and Data Center Efficiency Public Law 109-431, U.S. Environmental Protection
Agency, Washington D.C., 2007. 2Koomey, J.G., Estimating Total Power Consumption by Servers in the U.S. and the World, Technical Report, available
http://enterprise.amd.com/Downloads/svrpwrusecompletefinal.pdf, 2008. 3 Samadiani, E., Joshi, Y., Hamann, H., Iyengar, M.K., and Bonilla, M., “Reduced Order Thermal Modeling of Data Centers
via Distributed Sensor Data,” Proceedings of the ASME-Pacific Rim Technical Conference and Exhibition on Packaging and
Integration of Electronic and Photonic Systems, MEMS, and NEMS (InterPACK’09), San Francisco, CA, July 2009. 4Haider, S., Burton, L., and Joshi, Y., “A Proper Orthogonal Decomposition Based System-Level Thermal Modeling
Methodology for Shipboard Power Electronics Cabinets,” Heat Transfer Engineering, Vol. 29, No. 2, pp. 198-215, 2008. 5Rambo, J., and Joshi, Y., “Modeling of Data Center Airflow and Heat Transfer: State of the Art and Future Trends,”
Distributed and Parallel Databases, Vol. 21, No. 2-3, pp. 193-225, 2007. 6Betul, A.-S., Karlik, B., Bali, T., and Ayhan, T., “Neural Network Methodology for Heat Transfer Enhancement Data,”
International Journal of Numerical Methods for Heat and Fluid Flow, Vol. 17, No. 8, pp. 788-798, 2007. 7Cortes, O., Urquiza, G., Hernandez, J.A., and Cruz, M.A., “Artificial Neural Networks for Inverse Heat Transfer Problems,”
Proceedings of the 2007 IEEE Conference on Electronics, Robotics and Automotive Mechanics (CERMA), Cuernavaca, Mexico,
pp. 198-201, September 2007. 8Shrivastava, S.K., Van Gilder, J. W., and Sammakia, B., “Data Center Cooling Prediction using Artificial Neural Network,”
Proceedings of the ASME-Pacific Rim Technical Conference and Exhibition on Packaging and Integration of Electronic and
Photonic Systems, MEMS, and NEMS (InterPACK’07), Vancouver, BC, July 2007. 9Sharma, R., Marwah, M., Lugo, W., and Bash, C., “Application of Data Analytics to Heat Transfer Phenomena for Optimal
Design and Operation of Complex Systems,” Proceedings of ASME Summer Heat Transfer Conference, San Francisco, CA, July
2009. 10Bautista, L., and Sharma, R., “Analysis of Environmental Data in Data Centers,” Technical Report No. HPL-2007-98,
Hewlett Packard Laboratories, Palo Alto, CA, June 2007. 11Toulouse, M., Doljac, G., Carey, V.P., and Bash, C., “Exploration of a Potential-Flow-Based Compact Model of Air-flow
Transport in Data Centers,” Proceedings of the ASME International Mechanical Engineering Congress and Exposition (IMECE),
Lake Buena Vista, FL, November 2009. 12Hamann, H.F., López, V., and Stephanchuk, A., “Thermal Zones for More Efficient Data Center Energy Management,”
Proceedings of the 13th Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems
(ITHERM), Las Vegas, NV, June 2010. 13Patel, C., and Belady, C., “Modeling and Metrology in High Performance Heat Sink Design,” Proceedings of the 47th IEEE
Electronic Components and Technology Conference (ECTC), San Jose, CA, pp. 296-302, 1997. 14Bash, C., and Patel, C., “Modeling and Metrology for Expedient Analysis and Design of Computer Systems,” International
Journal of Microcircuits and Electronic Packaging, Vol. 22, No. 2, pp. 157-163, 1999. 15Moody, L.F., “Friction Factors for Pipe Flow,” Trans. ASME, Vol. 66, No. 8, pp. 671-684, 1944. 16Shah, A.J., and Patel, C. D., “Designing Environmentally Sustainable Electronic Cooling Systems Using Exergo-Thermo-
Volumes,” International Journal of Energy Research, Vol. 33, pp. 1266-1277, March 2009. 17Hewlett Packard, HP Proliant DL385 Generation 5 Overview, available http://h18004.www1.hp.com/products/quickspecs/
13012_na/13012_na.html, 2009. 18Patel, C.D., Sharma, R.K., Bash, C.E., and Beitelmal, M.H., “Energy Flow in the Information Technology Stack:
Introducing the Coefficient of Performance of the Ensemble,” Proceedings of the ASME International Mechanical Engineering
Congress and Exposition, Chicago, IL, November 2006. 19Breen, T.J., Walsh, E.J., Punch, J., Shah, A.J., and Bash, C.E., “From Chip to Cooling Tower Data Center Modeling: Part I,
Influence of Server Inlet Temperature and Temperature Rise across Cabinet,” Proceedings of the 13th Intersociety Conference on
Thermal and Thermomechanical Phenomena in Electronic Systems (ITHERM), Las Vegas, NV, June 2010. 20Walsh, E.J., Breen, T.J., Punch, J., Shah, A.J., and Bash, C.E., “From Chip to Cooling Tower Data Center Modeling: Part
II, Influence of Chip Temperature Control Philosophy,” Proceedings of the 13th Intersociety Conference on Thermal and
Thermomechanical Phenomena in Electronic Systems (ITHERM), Las Vegas, NV, June 2010. 21Patel, C., Bash, C., Belady, C., Stahl, L., and Sullivan, D., “Computational Fluid Dynamics Modeling of High Compute
Density Data Centers to Assure System Inlet Air Specifications,” Proceedings of the ASME-Pacific Rim Electronic Packaging
Technical Conference and Exhibition (InterPACK’01), Kauai, HI, July 2001. 22Schmidt, R., and Cruz, E., “Raised Floor Computer Data Center: Effect on Rack Inlet Temperatures of Chilled Air Exiting
Both the Hot and Cold Aisles,” Proceedings of the 8th Intersociety Conference on Thermal and Thermomechanical Phenomena
in Electronic Systems (ITHERM), San Diego, CA, June 2002.