10
Leveraging Thermal Dynamics in Sensor Placement for Overheating Server Component Detection Xiaodong Wang , Xiaorui Wang , Guoliang Xing and Cheng-Xian Lin The Ohio State University, USA Michigan State University, USA Florida International University, USA †∗ {wangxi, xwang}@ece.osu.edu [email protected] lincx@fiu.edu Abstract—Server overheating has become a well-known issue in today’s data centers that host a large number of high-density servers. The current practice of server overheating detection is to monitor the server inlet temperature with the temperature sensor on the server enclosure, or the CPU temperature with on- die thermal sensors. However, this is in contrast to the fact that different components in a server may have different overheating thresholds, which are closely related to their respective thermal failure rates and expected lifetimes. Moreover, the thermal cor- relation between the inlet (or CPU) and other server components can be different for every server model. As a result, relying on the single inlet or CPU temperature for server overheating detection is over-simplistic, which may lead to either degraded detection performance or false alarms that can result in excessive cooling power, leading to unnecessarily low inlet temperature. In this paper, we propose a model-based approach that lever- ages thermal dynamics to intelligently choose sensor placement locations for precise overheating server component detection. We first formulate the detection problem as a constrained optimization problem. We then adopt Computational Fluid Dy- namics (CFD) to establish the thermal model and analyze the thermal status of the server enclosure under various overheating conditions, such as inlet overheating, fan failures and CPU overloading. Based on the CFD analysis, we apply data fusion and advanced optimization techniques to find a near-optimal solution for sensor placement locations, such that the probability of detect- ing different overheating components is significantly improved. Our empirical results on a real rack server testbed demonstrate the detection performance of our solution. Extensive simulation results also show that the proposed solution outperforms other commonly used overheating monitoring solutions in terms of detection probability and error rate. I. I NTRODUCTION In recent years, server overheating has become one of the most important concerns in large-scale data centers. Due to the considerations such as real estate and integrated management, data centers continue to increase their computing capabilities by deploying high-density servers (e.g., blade servers). As a result, the increasingly high server and thus power densities can lead to some serious problems. First, the reduced server space may result in a greater probability of thermal failures for various components within the servers, such as processors, hard disks, and memories. Such failures may cause undesired server shutdowns and service disruption. Second, even though some components may not fail immediately, their lifetimes may be significantly reduced due to overheating. It is reported in [1][2][3] that the lifetime of an electronic device decreases exponentially with the increase of the operating temperature. Finally, the generated heat dissipation can also lead to negative environmental implications. Therefore, it is important for each server component to run at a temperature below its overheating threshold. However, in today’s data centers, how to precisely detect whether any component in a server is overheating remains an open question. The current practice of detecting and monitor- ing an overheating server can be divided into two categories. The first category is a coarse-grained approach that only uses the temperature at a proxy component, e.g. CPU [4] or at a fixed location, e.g. the server inlet, for server overheating monitoring. This is in contrast to the fact that different com- ponents in a server may have different overheating thresholds, which are closely related to their respective thermal failure rates and expected lifetimes. Relying on a single threshold at the server inlet or at the proxy component is therefore over- simplistic, because the thermal correlation between the inlet (or the proxy component) and each server component can be different for every server model. As a result, monitoring only the inlet temperature or a proxy component, such as the CPU, may lead to either missed detection of overheating for the components other than CPU, resulting in a degraded system lifetime or false alarms that result in excessive cooling power to unnecessarily lower the inlet temperature. The second category of server thermal monitoring approach assumes that each different component has its own built- in thermal sensor. Extensive research [5][6][7][8] of server thermal management has recently been conducted based on this assumption. Unfortunately, today’s high-density severs are not equipped with a thermal sensor on every component. In most servers, only the processors have on-die sensors while some memory chips may also have built-in sensors. Therefore, it is important to provide a mechanism for measuring the temperatures of other components (e.g., hard disk, network chips), such that the previously proposed thermal management schemes can work effectively. More importantly, even if every component has its own thermal sensor, those sensors are used only for the control loops of those components in an isolated way. As a result, they cannot provide a system-level thermal picture that can help the fan system of the server and the cooling systems in the data center to efficiently cool down overheating components. Furthermore, low-end sensors used in server components commonly have measurement noises and hardware biases that may lead to failed detection or false alarms. Recent studies [9][10] have shown that the collaborative data fusion of multiple sensors can significantly improve the detection accuracy. Therefore, it is preferable to have server-level thermal monitoring with multiple sensors that 978-1-4673-2154-9/12/$31.00 c 2012 IEEE

Leveraging Thermal Dynamics in Sensor Placement for

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Leveraging Thermal Dynamics in Sensor Placement for

Leveraging Thermal Dynamics in Sensor Placementfor Overheating Server Component Detection

Xiaodong Wang†, Xiaorui Wang†, Guoliang Xing‡ and Cheng-Xian Lin∗†The Ohio State University, USA ‡Michigan State University, USA ∗Florida International University, USA†∗{wangxi, xwang}@ece.osu.edu ‡[email protected][email protected]

Abstract—Server overheating has become a well-known issuein today’s data centers that host a large number of high-densityservers. The current practice of server overheating detection isto monitor the server inlet temperature with the temperaturesensor on the server enclosure, or the CPU temperature with on-die thermal sensors. However, this is in contrast to the fact thatdifferent components in a server may have different overheatingthresholds, which are closely related to their respective thermalfailure rates and expected lifetimes. Moreover, the thermal cor-relation between the inlet (or CPU) and other server componentscan be different for every server model. As a result, relying on thesingle inlet or CPU temperature for server overheating detectionis over-simplistic, which may lead to either degraded detectionperformance or false alarms that can result in excessive coolingpower, leading to unnecessarily low inlet temperature.

In this paper, we propose a model-based approach that lever-ages thermal dynamics to intelligently choose sensor placementlocations for precise overheating server component detection.We first formulate the detection problem as a constrainedoptimization problem. We then adopt Computational Fluid Dy-namics (CFD) to establish the thermal model and analyze thethermal status of the server enclosure under various overheatingconditions, such as inlet overheating, fan failures and CPUoverloading. Based on the CFD analysis, we apply data fusion andadvanced optimization techniques to find a near-optimal solutionfor sensor placement locations, such that the probability of detect-ing different overheating components is significantly improved.Our empirical results on a real rack server testbed demonstratethe detection performance of our solution. Extensive simulationresults also show that the proposed solution outperforms othercommonly used overheating monitoring solutions in terms ofdetection probability and error rate.

I. INTRODUCTIONIn recent years, server overheating has become one of the

most important concerns in large-scale data centers. Due to theconsiderations such as real estate and integrated management,data centers continue to increase their computing capabilitiesby deploying high-density servers (e.g., blade servers). As aresult, the increasingly high server and thus power densitiescan lead to some serious problems. First, the reduced serverspace may result in a greater probability of thermal failuresfor various components within the servers, such as processors,hard disks, and memories. Such failures may cause undesiredserver shutdowns and service disruption. Second, even thoughsome components may not fail immediately, their lifetimesmay be significantly reduced due to overheating. It is reportedin [1][2][3] that the lifetime of an electronic device decreasesexponentially with the increase of the operating temperature.Finally, the generated heat dissipation can also lead to negativeenvironmental implications. Therefore, it is important for each

server component to run at a temperature below its overheatingthreshold.

However, in today’s data centers, how to precisely detectwhether any component in a server is overheating remains anopen question. The current practice of detecting and monitor-ing an overheating server can be divided into two categories.The first category is a coarse-grained approach that only usesthe temperature at a proxy component, e.g. CPU [4] or ata fixed location, e.g. the server inlet, for server overheatingmonitoring. This is in contrast to the fact that different com-ponents in a server may have different overheating thresholds,which are closely related to their respective thermal failurerates and expected lifetimes. Relying on a single threshold atthe server inlet or at the proxy component is therefore over-simplistic, because the thermal correlation between the inlet(or the proxy component) and each server component can bedifferent for every server model. As a result, monitoring onlythe inlet temperature or a proxy component, such as the CPU,may lead to either missed detection of overheating for thecomponents other than CPU, resulting in a degraded systemlifetime or false alarms that result in excessive cooling powerto unnecessarily lower the inlet temperature.

The second category of server thermal monitoring approachassumes that each different component has its own built-in thermal sensor. Extensive research [5][6][7][8] of serverthermal management has recently been conducted based onthis assumption. Unfortunately, today’s high-density severs arenot equipped with a thermal sensor on every component. Inmost servers, only the processors have on-die sensors whilesome memory chips may also have built-in sensors. Therefore,it is important to provide a mechanism for measuring thetemperatures of other components (e.g., hard disk, networkchips), such that the previously proposed thermal managementschemes can work effectively. More importantly, even if everycomponent has its own thermal sensor, those sensors are usedonly for the control loops of those components in an isolatedway. As a result, they cannot provide a system-level thermalpicture that can help the fan system of the server and thecooling systems in the data center to efficiently cool downoverheating components. Furthermore, low-end sensors usedin server components commonly have measurement noisesand hardware biases that may lead to failed detection orfalse alarms. Recent studies [9][10] have shown that thecollaborative data fusion of multiple sensors can significantlyimprove the detection accuracy. Therefore, it is preferable tohave server-level thermal monitoring with multiple sensors that978-1-4673-2154-9/12/$31.00 c⃝ 2012 IEEE

Page 2: Leveraging Thermal Dynamics in Sensor Placement for

can precisely detect overheating components.In this paper, we propose to leverage the thermal dynamics

in a server to intelligently place sensors for precise overheatingserver components detection. Our sensor placement solutionfeatures a model-based approach, which adopts ComputationalFluid Dynamics (CFD) as a theoretical foundation to establishthe thermal model and analyze the thermal status of theserver enclosure under various overheating conditions. CFDis a powerful mechanical fluid dynamic analysis approachand is widely used to analyze the fluid dynamics in var-ious engineering fields, such as aircraft engine design andthermal analysis for buildings. CFD has already been usedby computer system packaging designers to make intelligentdecisions on server component layout design, but not yet forsensor placement in the server box. While CFD-based thermalmonitoring has shown promise, a key limitation of CFD isits high computation overhead. As a result, CFD cannot beeffectively used to report thermal emergencies in real time.In this work, we propose to use CFD to analyze the thermaldynamics offline and then optimally place sensors based onthe analysis results to conduct online overheating detection.Such an integrated approach can enable us to achieve thebenefits of both the systematic modeling of thermal dynamics(from CFD), as well as online measurement calibration andfast responsiveness (from sensors). Our solution provides away to equip external sensors on the existing servers deployedin data centers for more accurate overheating monitoring. Theproposed solution can also be used on future servers to placemore sensors on the motherboard during the design phase.

In our integrated thermal monitoring solution, we first useCFD to model the thermal environment of a given rackserver box under different overheating conditions, includinginlet overheating, fan failure and CPU overloading. We thencalculate the most correlated regions in the server box for eachspecific component by correlation analysis. Accordingly, for agiven number of sensors, we seek to place them in the serverbox such that the overheating components can be detected withthe maximum detection probability, while the error rate of thedetection can be bounded. We formulate this problem as aconstrained optimization problem. Based on the CFD analysis,we design a heuristic algorithm to find a near-optimal sensorplacement solution. In our algorithm, we apply data fusiontechniques to allow sensors to make collaborative detectiondecisions of server component overheating. Specifically, thecontributions of this paper are four-fold.

• While the current thermal monitoring solutions rely onsimplistic sensor placement, i.e., a single sensor at theinlet or the CPU, we propose a novel sensor placementscheme to intelligently place sensors for maximized over-heating detection probabilities of each server componentof interest.

• We use CFD analysis as a theoretic foundation to de-sign our proposed sensor placement scheme. Our CFDanalysis models the thermal dynamics of a rack serverbox in various overheating scenarios, including inletoverheating, CPU overloading, and fan failure.

• We formulate optimal sensor placement as a constrainedoptimization problem and propose a heuristic algorithmto find a near-optimal solution. Temperature correlationanalysis is conducted to find the most correlated regionsfor each server component.

• We evaluate our sensor placement scheme in a real-worldrack server box. Both our empirical and simulation resultsdemonstrate that our placement solution can significantlyimprove hot server detection performance.

The remainder of this paper is organized as follows. SectionII highlights the distinction of our work by discussing relatedwork. Section III presents the data fusion model, the formula-tion of the server overheating detection problem, as well as thetemperature threshold setting for each different components.Section IV introduces the fundamentals of the ComputationalFluid Dynamics approach and provides an example of howto model a rack server box. Section V elaborates on how touse the analytical results from CFD in our sensor placementproblem and proposes a heuristic algorithm to solve the prob-lem. In Section VI, we introduce our experiment methodologyand then evaluate our sensor placement scheme using bothsimulation and experiments on hardware testbed. Section VIIconcludes the paper and discusses the possible future work.

II. RELATED WORK

Thermal management for computer systems has been widelystudied in the past. Skadron et al. have proposed a temperature-aware microprocessor management tool, HotSpot [11], whichuses thermal resistances and capacitances to model the tem-perature of microprocessors. Performance and thermal be-haviors of storage systems are extensively studied in [12],which identifies the knob for temperature optimization of highspeed disks. Lin et al. [8] have proposed a software thermalmanagement scheme for DRAM Memory, which has beenimplemented on real machines. However, few studies havebeen done on the joint thermal monitoring and managementacross different system components. Jeohwang et al. havemodeled the thermal profile for an operating server systemand a rack in [13] to provide a bridge between the individualcomponent thermal status and data center thermal profile.Different from all the previous work that addresses a singlecomponent individually, our work focuses on the joint thermalmonitoring of multiple components in a single rack serversystem.

Sensors have been deployed to conduct thermal manage-ment in computer systems. The existing thermal managementwith sensors can be categorized into two classes. The firstclass is to deploy sensors in server rooms and large datacenters for environment temperature monitoring. For example,a hybrid wired and wireless sensor network is used in [14]for data center thermal monitoring. Sensors are also usedin [9] to detect the overheating servers at the single systemlevel. The second class is to deploy sensors inside or arounddifferent computer components for a specific component ther-mal monitoring. For example, the current CPU temperaturethermal management schemes deploy on-die thermal sensors

Page 3: Leveraging Thermal Dynamics in Sensor Placement for

to monitor the CPU temperature at runtime [15]. Temperaturesensor circuits have also been adopted in the DRAM designto provide thermal monitoring for memory chips [16]. Chiplevel thermal profile is also studied in [17] by using runtimetemperature sensor readings. Our work is different from allthe aforementioned research. We use Computational FluidDynamics (CFD) and temperature correlation of differentcomponents to guide sensor placement, such that the efficiencyof the thermal emergency detection can be maximized.

Different sensor deployment approaches for improved mon-itoring and detection performance have also been studiedbefore. A sensor placement scheme based on the MultivariateGaussian Process model is proposed in [18]. Though it pro-vides informative monitoring results, an offline training stagebefore the actual deployment is required. This is not feasiblefor thermal monitoring of production server systems becausethermal emergency should not be created for the collection ofthe training data. A fast sensor placement approach for fusion-based target detection is also proposed in [10] to minimize thenumber of deployed sensors while achieving assured detectionperformance. Different from the aforementioned work, we pro-pose a new model-based sensor deployment approach, whichleverages the theoretical computational results from CFD tomaximize the detection performance of server componentthermal emergency.

III. OVERHEATING SERVER COMPONENT DETECTION

In this section, we first introduce the detection model foroverheating server components. We then formulate overheatingserver component detection as a constrained optimizationproblem. Lastly, we introduce how to set the overheatingtemperature threshold for each component.

A. Overheating Component Detection ModelIn the design of a computer system, it is always desirable to

optimize the cooling efficiency of the system. However, dueto the difference in functionalities and the variance in man-ufacturing processes, each component in the system usuallyrequires a different safe operating environment temperature.Therefore, in order for the computer system to operate moreefficiently and safely, the operating environment temperatureof each component should be monitored separately basedon their own requirement. Ideally, individual thermal mon-itoring and cooling mechanism should be provided to eachsingle component. For example, the current design of CPUincorporates on-die thermal sensors, such that the temperatureof the CPU chip can be monitored at runtime. Moreover,a heat sink is usually attached on top of the CPU chip toincrease the air flow rate over CPU, such that the coolingefficiency can be improved. Unfortunately, there is usually nosuch on-die sensor embedded onto other components, such asmemory chip and network chip, etc. Therefore, new techniquesare needed to monitor the operating environment of all thecomponents, such that their overheating conditions can bedetected and reported promptly. In this paper, we proposeto place additional sensors into the computer system box to

monitor the operating environment temperatures of all thecomponents in the computer system.

With all the components and cooling equipments running,the thermal environment inside a computer box is complex,which could cause more noise in the sensor readings. Further-more, the number of sensors that can be placed into a high-density server box is limited, as one wants to maximize thespace utilization for all kinds of server components and avoidcomplex wiring and costly installation in the already compactserver box. Thus, the additional sensor nodes added to theserver box should collaborate with each other to maximizetheir utility. To address these challenges, we adopt Data Fusion[19], a widely adopted collaborative sensing technique, tojointly process noise data from multiple sensors.

It is clear that temperatures at distant locations from acomponent are less likely to be correlated with the ambi-ent temperature of that component. Therefore, we define afusion region for each monitored component as a disc witha fusion radius R, where each monitored component islocated at the center of that disc. The sensors within thefusion region of a monitored component should collaborate tomake the overheating detection decision for that component.Moreover, because of the complex air flows inside the system,temperatures at different locations within the fusion regionhave different correlation with the ambient temperature of themonitored component. For example, based on the air flow di-rection, the temperatures at locations behind the CPU are morecorrelated with CPU ambient temperature, compared with thetemperatures at the locations in front of the CPU. Therefore,we further define a correlation threshold Th(i, j) for eachpair of location i and component location j. To contribute tothe ambient temperature monitoring for component j, sensorsshould be placed at location i within the fusion radius ofcomponent j, where the correlation value should be largerthan Th(i, j).

To decide the ambient temperature at the monitored compo-nent location, we adopt a data fusion scheme which calculatesthe average temperature of all the reported temperatures fromsensors that meet the above two criteria. We compare theaverage temperature value with a detection threshold ηj . If theaverage temperature is higher than the threshold, the decisionof a component being operating in an overheating environmentis positive. The ambient temperature, Tj , of component j canbe derived from the temperature reading, Ti, at the location(xi, yi) of sensor i. The approach we use to derive thetemperature Tj is explained in Section V-B. For now, wejust denote this derivation as Tj = fj(Ti). Measurementnoise is usually included in the sensor readings. Denote themeasurement noise strength measured by sensor i as Ni, whichfollows the zero-mean normal distribution with a varianceof σ2, i.e., Ni ∼ N (0, σ2) [18]. We assume that all thetemperature sensors are identical, such that they follow thesame measurement distribution. The final reported temperaturefor the location of component j can be presented as

Tj = f (T (xi, yi)) +N2i (1)

Page 4: Leveraging Thermal Dynamics in Sensor Placement for

where N2i is the noise in energy form. The noise is taken

out from the transformation since it is additive to the realtemperature readings.

Assuming there are nj sensors within the data fusion groupof a component at location j, the detection probability of thatlocation being overheating can be calculated as

PDj = P

(1

nj

nj∑i=1

fj(T (xi, yi)) +N2i > ηj

)(2)

where ηj is the detection threshold of overheating for thecomponent at location j. Because of the measurement noisefrom the sensor device, ηj includes both the real temperaturethreshold for a component, denoted as Cj , and the measure-ment noise. With a high noise level from the measurement, adetection system is likely to report a false alarm when thereis no real event. In our case, we define the false alarm ratewhen the environment of the monitored component is actuallynot overheating as follows

PFj = P

(1

nj

nj∑i=1

(Ni

2 + Cj

)> ηj

)(3)

We assume Gaussian Noise, i.e., Ni/σ ∼ N (0, 1). There-fore,

∑nj

i=1(Ni/σ)2 follows the Chi-square distribution with

nj degrees of freedom, denoted as χnj (·). Hence, Equations(2) and (3) can be modified as follows

PDj = 1− χnj

(njηj − Σ

nj

i=1fj(T (xi, yi))

σ2

)(4)

PFj = 1− χnj

(nj(ηj − Cj)

σ2

)(5)

B. Problem FormulationWe assume that there are M components in a computer

server, whose operating ambient temperatures need to bemonitored. Given N sensors, (N ≤ M ), we need to findthe placement of these N sensors such that we can detectthe overheating emergency at any of the M locations withthe highest possible confidence. We assume N ≤ M isbecause it is preferable to place as few sensors as possiblein the server box for thermal monitoring purpose, consideringthe complexity and high cost of the wiring design on themother board. Our goal is to maximize the average detectionprobability of all the monitored locations

max1

M

∑1≤j≤M

PDj (6)

subject to the following constraintPFj ≤ α ∀1 ≤ j ≤ M (7)

where α is the tolerable detection false alarm rate bound. Wenote that the false alarm rate needs to be bounded in manypractical scenarios in order to reduce the waste of systemresources. For a certain sensor placement, PFj ≤ α is a nec-essary condition in our problem. By Equation (5), we convert

the constraint in Equation (7) to ηj ≥σ2χ−1

nj(1−α)

nj+ Cj , a

constraint for the detection threshold ηj at monitored locationj, where χ−1(·) is the inverse function of χ(·). Using this

equation, we can obtain the threshold that satisfies the falsealarm rate bound while maximizing the detection probability.From Equation (4) we know that PDj decreases when ηjincreases. Therefore, to maximize the detection probability,we remove the inequality in the constraint and only use thelower bound α. Hence, ηj can be calculated as

ηj =σ2χ−1

nj(1− α)

nj+ Cj (8)

C. Component Temperature ThresholdBefore solving the problem in Section III-B, we need to

set the overheating threshold for each components in thesystem. Among all the factors that contribute to the lifetimeof semiconductor devices, operating junction temperature, i.e.,the highest temperature inside the semiconductor device, isa critical deciding factor. With a higher junction tempera-ture, devices tend to fail sooner. There has been research[11][1] studying the temperature-induced failure mechanismsof semiconductor devices. In most of the models studied, theoperating junction temperature shows an exponential impacton the failure rate λ of a device, which is:

λ ∝ exp(− Ea

kTJ) (9)

where k is the Boltzmann’s Constant, 8.6eV/K. Ea and TJ

are the activation energy of electromigration and the operatingjunction temperature, respectively. The common activationenergy for Al and Al with silicon is 0.6eV .

Hardware components from manufacturers often come witha warranty time. For example, both Intel and AMD sell theirproducts with a three-year warranty package. Note that thiswarranty time indicates the time period that the device shouldwork properly without hard intrinsic failures, even runningunder extreme conditions within the specification. However,as a common practice, computer systems usually serve for alonger period of time than three years with upgrades to somecomponents, such as adding new disks for larger storage space.To extend the working time, we need to lower the operatingambient temperature threshold of each component. Given theextended lifetime requirement t′ and the lifetime requirementt under warranty, we can use Equation (9) to calculate the newoperating junction temperature threshold T ′

J as1

T ′J

=k

Ealn(

t′

t) +

1

TJ(10)

In this work, we use sensors to monitor the temperature ofthe operating environment, which is the ambient temperatureof a working component. The ambient temperature TA can becalculated using junction temperature Tj in Equation 9 as

TA = TJ − P × θJA (11)

where P is the operating power of the device and θJA is thejunction-to-ambient thermal resistance [20].

Based on all the above derivations and related values fromdata sheets of different components, we set the operatingenvironment temperature threshold Cj for component j inour work by one of the following three methods: 1) Directly

Page 5: Leveraging Thermal Dynamics in Sensor Placement for

3

64

87

5

10

9

11

Fig. 1. The DELL PowerEdge 2950 2U rack server used in our hardwaretestbed. The yellow boxes are the chips whose operating environment tem-peratures need to be monitored. The red dashed box in the lower picturehighlights the front panel assembly of the server. The red dashed box in theupper picture highlights the temperature sensor used by the DELL server tomonitor the temperature at the inlet. Except CPU and Memory, chips need tobe monitored for temperature are indexed and highlighted with yellow boxes

taken from the datasheet. For some of the components inthe computer system, the maximum operating environmenttemperature is listed in the datasheet or the manual. Figure 1is the platform used in our experiment. It is a 2U DELL rackserver equipped with an AMD Opteron 2222SE Dual-Coreprocessor. The maximum operating temperature listed on thedatasheet for this type of CPU is 69◦C. 2) Converted from thejunction temperature threshold. For example, the maximumjunction temperature and the junction-to-ambient thermal re-sistance for Lattice ispMACH CPLD chip in our system are75◦C and 41.8◦C/W, respectively. Applying Equations (10)and (11) with lifetime requirement of 7 years, we can getthe ambient threshold as 60◦C. 3) For the unknown type ofchips or the chips whose datasheets are not available, we use43◦C, the default System Board Ambient Temperature settingrequired by OpenManage, DELL’s server management tool.

IV. CFD MODELING FOR SERVER BOX AND COMPONENTSIn this section, we first introduce Computational Fluid

Dynamics (CFD), the tool we use to analyze the thermal en-vironment inside the server box. We then provide an exampleto demonstrate how to model a server box and each of itscomponents in practice using Fluent [21], a widely used CFDmodeling software package.

A. CFD Modeling

CFD is a fluid mechanics approach that analyzes propertiesof fluid flows based on numerical methods and algorithms.The key for CFD modeling is to solve the governing transportequations represented in the following conservation law form:

∂ρϕ

∂t+

∂ρUjϕ

∂xj=

∂xj(Γϕ,eff

∂ϕ

∂xj) + Sϕ (12)

where ϕ represents different parameters such as mass, velocity,temperature or turbulence properties; ρ is the fluid (air) den-sity; t is the time for transient simulations; xj is the coordinate

Fig. 2. Colored temperature map (◦C) of the DELL server running CPUintensive benchmarks. The small black boxes indicate all the chips whosetemperatures need to be monitored. The large box in the middle is the CPUsink. The four vertical short lines in the middle represent the four systemfans. The four horizontal thin blocks underneath the CPU sink represent thememory modules. The temperature of the memory closest to the CPU sink isalso required to be monitored. Disk is on the left side of the graph.

variable for x, y or z with j being 1, 2 or 3; Uj is the velocityin different directions; Γ is the diffusion coefficient; and S isthe source for the particular variable. For example, when ϕis the air temperature, S stands for the volumetric heat ratefrom a source component. The four equation terms representtransient, convection, diffusion, and source parts of transportphenomenon in the spatial domain [22].

The partial differential equations listed in Equation (12)represent a system, where all the transport equations arecoupled together and require to be solved simultaneously. Fora complicated environment, such as a server enclosure, closed-form solutions are hard to be found for the airflow and heattransfer of the entire system. Therefore, the most fundamentalconsideration in CFD is how to treat a continuous fluid ina discretized fashion, such that numerical methods can beapplied to find the solutions. Most CFD software packagesapply the control volume method to find numerical solutions.

B. Example of Server Box CFD Modeling

Using CFD to perform a continuous fluid model requiresthe discretization of the spatial domain into small cells. Onemethod to perform this discretization is to generate volumetricgrid. After the discretization, necessary boundary conditionsand suitable algorithms need to be applied to solve theabove-mentioned transport equations. Several popular softwarepackages, such as Fluent, FLOTHERM, Flovent and Phoenics,can be used for CFD modeling purpose. In our project, we useFluent, a widely used CFD software package from ANSYSInc., to perform the geometry meshing and solution finding.

The CFD model we establish in this example is for theDELL PowerEdge 2950 server box, shown in Figure 1. Inthe first step, we use Gambit, which is a grid generator, toperform the geometry establishment for this server. Basically,we choose different geometric shapes and perform unificationor split to establish the geometric model for the entire serverbased on the real measured scales. Then we add differentgeometric shapes into the server box geometry to modelthe server components, such as the system fan and CPUsink, according to their geographic location and correspondingscale. After all components are added into the geometricmodel, we need to specify different boundary types, such asthe server walls, the fans, and the inlets/outlets of the server

Page 6: Leveraging Thermal Dynamics in Sensor Placement for

box. The last step is to divide the entire geometric model intosmaller scale cells by applying geometry meshing in Gambit.The grid size is a user-specific parameter. With a finer grid,more accurate CFD modeling can be reached. However, a finegrid increases the computational burden in the following stagewhen the transport equations are solved by numerical methods.We use 1mm as the grid size to mesh the geometry. Althoughthe CFD geometry model takes some time to generate becauseof the complicated component layout in the server box, wenote that it is a one-time work that can be used for the analysison all different overheating conditions for the same server,which is feasible for an offline sensor placement approach.

After meshing the entire server in Gambit, we export thegrid to the second software package, Fluent, to solve thetransport equations in Equation (12). Fluent requires all theboundary conditions of our geometric model to be specified.For example, we need to specify the power dissipation of eachheat dissipating components such as CPU, memory, disk andall the other system chips. We also need to specify the inlettemperature and the system fan speed. After all the parametersare set up, the standard k-epsilon two-equation turbulencemodel is chosen to simulate the turbulent flow. Each simulationof one running condition takes about 20mins to finish. Figure2 shows a colored cross-section temperature map after solvingthe transport equations in Fluent. This is a scenario in which allthe components are running under the power setting specifiedon their datasheets.

V. CFD-GUIDED SENSOR PLACEMENTIn this section, we introduce how to use the results from the

CFD analysis to guide sensor placement inside the server box,with the goal of maximizing the overheating detection prob-ability for all the components. We then introduce a heuristicalgorithm for solving this detection probability maximizationproblem.

A. Overview of Our ApproachUsing CFD tools for our sensor placement in the server

box primarily involves two steps. In the first step, we estab-lish a geometric model for the server box in Gambit, meshthe geometry, and export the grid to Fluent. We then takemeasurements for the incoming air temperature and air flowrate at the inlet of the server. These measurements, alongwith the power consumption of each component and the fanspeed, are the input parameters to Fluent. We repeat the firststep by tuning the actuating parameter of the overheatingscenarios to get multiple results of CFD analysis. For example,in an overheating scenario caused by inlet overheating, wechange the inlet temperature to several different values to runCFD analysis. Based on the CFD results with different inlettemperatures, we obtain the temperature correlation betweenany spatial location, defined by the CFD grid, and eachcomponent location. We also use the CFD data to obtain anapproximation function for each spatial location and targetedcomponent location pair, such that the temperature at thetargeted location can be calculated from the temperature atany spatial location with a high correlation.

In the second step, we feed the results from the CFDanalysis, including the overheating scenario temperature dataand the correlation data to our optimization algorithm to findthe best locations for sensor placement. We assume that oursensor placement needs to monitor the temperature of the pointabove the center of each component’s top face. To solve theplacement problem efficiently, we develop our algorithm basedon the Constrained Simulated Annealing approach [23]. Thealgorithm is explained in detail in the following sections.

B. Component Ambient Temperature Function and Correla-tion

In Section III-A, we denote the reported temperature ofcomponent at location j from sensor i by a relationshipTj = fj(Ti). Because of the complex fluid dynamics andthermal distribution in the server box, the temperature atlocation i can be very different from the temperature atlocation j, even if the physical distance between the twolocations is short. Therefore, we need a function mappingfrom Ti to Tj such that the temperature reading from sensor atlocation i can be used to report the component temperature Tj .We use the CFD analysis results from the last section to derivethis relationship mapping. We first repeat the CFD analysiswith different parameter settings. For example, in the inletoverheating scenario, the inlet temperature is changed at dif-ferent runs of the CFD analysis. Based on all the temperaturedata from different runs of CFD, we establish a second-orderpolynomial model to approximate the relationship between anytemperature Ti and the component temperature Tj as:

Tj = aj,iT 2i + bj,iTi + cj,i (13)

We have also introduced in Section III-A that our sensorplacement scheme only places sensors at the locations thathave high temperature correlations to the monitored targets.Therefore, we use the same set of CFD data as used in theabove function approximation to calculate the spatial correla-tion between the temperatures Ti and component temperatureTj . Person’s Correlation is a widely adopted metric [24] thatcalculates the degree of association between two variables.Assuming that we have n sets of CFD data with different inlettemperature settings, we can calculate Person’s Correlationr(Ti, Tj) by

r(Ti, Tj) =Σn

k=1(Tki − Ti)(T

kj − Tj)√

Σnk=1(T

ki − Ti)2

√Σn

k=1(Tki − Ti)2

(14)

The polynomial function approximation and correlationvalues are all inputs to the algorithm in the next section.C. Sensor Placement Algorithm

Our goal is to find the optimal sensor placement locations inthe server box to maximize the average overheating probabilityfor all the monitored component locations. We propose touse a nonlinear programming solver based on the ConstrainedSimulated Annealing (CSA) algorithm [23]. CSA is an exten-sion of the conventional Simulated Annealing algorithm forsolving the global constrained optimization problem with dis-crete variables. Theoretically, CSA can reach a global optimal

Page 7: Leveraging Thermal Dynamics in Sensor Placement for

20

30

40

50

60

Tem

pera

ture

(C

)

Measurement

CFD

0

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Location ID

Fig. 3. Comparison at multiple locations in thesever between temperature measurements on thetestbed and CFD simulation results. Testbed runsthe same CPU intensive workload as in Figure 2.

Fig. 4. Server temperature map of a partial inletoverheating scenario. The red dashed boxes are thechips whose environment temperatures exceed theirindividual overheating thresholds. Triangles indicatethe sensors placed by CFD-guided approach, whenthe given sensor number is four. The black crossesindicate the four sensors placed by the baseline ChipBest approach.

20

40

60

80

100

Ave

rag

e

ecti

on

Pro

bab

ilit

y (

%)

CFD

Chip Best

Chip Average

Uniform Grid

Random

0

20

1 2 3 4 5 6 7 8 9 10 11

Dete

Sensor Number

Fig. 5. Average detection probability of the pro-posed CFD-guided solution and the baselines in theproposed CFD-guided solution and the baselines inthe inlet overheating case (simulation).

Procedure 1 CFD-GUIDED SENSOR PLACEMENT(D)

Input: Sensor number N, Component Location listx[K] and y[K], CFDdata, Correlation data rdata,Overheating Threshold List C[K]

Output: Placement solution D1: for j = 1 to K do2: x[j]min = xj −R; x[j]max = xj +R3: y[j]min = yj −R; y[j]max = yj +R4: end for5: x′

min = min(x[K]); x′max = max(x[K]);

6: y′min = min(y[K]); y′max = max(y[K]);7: (P,D)8: = CSA(N, x′

min, x′max, y

′min, y

′max, C[K], CFDdata, rdata)

9: return D

solution by converging asymptotically to a constrained globaloptimum with a probability of 1. However, a limitation of CSAis that its computational complexity grows exponentially withrespect to the number of variables and the solution searchspace [23][10]. Therefore, before we apply CSA, we firstreduce the search space of the algorithm by calculating theplausible search space according to the component locations.In our sensor placement problem, we propose to utilize sensorsthat are within the fusion range of a component location to col-laboratively decide if the operating environment temperatureof that component is overheating. Therefore, the search spaceis only plausible for that component if the sensor is placedinside the fusion range R of that component. We aggregateall the plausible search spaces of each component together byfinding the maximum and minimum possible x and y valuesof a sensor. The aggregated region is then used as the searchspace for the sensor placement algorithm. The pseudo codeof this algorithm is listed in Algorithm 1. Lines 1-6 calculatethe plausible solution search region. Based on the CFD andcorrelation analysis, i.e., CFDdata and rdata, lines 7-8 useCSA solver to find the placement solution D that maximizesthe detection probability P . Algorithm outputs the placementsolution D.

VI. EVALUATION

In this section, we first validate our CFD model by compar-ing the CFD analysis result with the real sensor measurements.Then we introduce the experiment set up and the methodology

used for the performance evaluation on our hardware testbed.After that, the overheating component detection performanceis evaluated in both simulation and hardware testbed experi-ments in three different overheating scenarios, including inletoverheating, fan failure, and CPU overloading.

A. Model Validation and Experiment MethodologyTo validate our server model in the CFD analysis, we place

19 sensors into the server box. The server is placed in anisolated server room with a dedicated air conditioning system.We measure the temperature under a normal server runningcondition, in which the server is running the SPEC CPU2006benchmarks at an average temperature of 19.6◦C at the inlet,with a 0.5◦C fluctuation because of the air conditioningactuation. The measurements are taken when the server isrunning under stable thermal status with sensors placed in theclosed enclosure. The sensors we used for the real temperaturemeasurement are the Telosb sensor motes [25]. We choosethis type of sensors because we can collect the temperaturereadings from those sensors with wireless signal without open-ing the server enclosure. We note that our approach does notdepend on a particular sensor type and can utilize either wiredor wireless communications (though wireless sensors can beless intrusive to the already complicated server environment).Figure 3 shows the comparison between the CFD analysistemperature result and the testbed measurement result. We cansee that the temperature difference between CFD analysis andreal measurement is about 6.3% on average, which shows thatour computational CFD result is sufficiently close to the realtemperature measurements. If a different type of sensors thatis smaller in size is used, the difference can be further reduced.

There are totally five different sensor placement strategiesthat we evaluate across all the experiments. CFD-guidedsensor placement is the placement approach we propose inthis work to place sensors based on the analytical resultsfrom CFD analysis. Chip Best is the placement resultingfrom a best effort approach. To get this best performance,we first place sensors at all the exact chip locations in theoverheating experiment, one for each chip, to collect thetemperature data. Then, for a given number of N sensors (lessthan the number of M chips), we find the combination withthe N locations that results in the best detection performancefrom all possible combinations. Note it is infeasible to use

Page 8: Leveraging Thermal Dynamics in Sensor Placement for

0

20

40

60

80

1 2 3 4 5 6 7 8 9 10 11

Avera

ge

Err

or

Rate

(%

)

Sensor Number

CFD

Chip Best

Chip Average

Fig. 6. Average detection error rate of the pro-posed CFD-guided solution and the baselines in theinlet overheating case (simulation).

0

20

40

60

80

100

1 2 3 4 5

Avera

ge D

ete

cti

on

P

rob

ab

ilit

y (

%)

Sensor Number

CFD Chip Best Chip Average

Fig. 7. Average detection probability of theproposed CFD-guided solution and the baselinesin the inlet overheating case (testbed).

0

20

40

60

80

1 2 3 4 5

Avera

ge

Err

or

Rate

(%

)

Sensor Number

CFD Chip Best Chip Average

Fig. 8. Average detection error rate of the pro-posed CFD-guided solution and the baselines in theinlet overheating case (testbed).

Chip Best in a real implementation, because it needs to testall different combinations of sensor/chip pairing and selectthe best one. Different from Chip Best, Chip Averagecalculates average detection performance of all the possiblecombinations. Random is a simple heuristic strategy thatplaces sensor randomly in the server box, which is the averageresults from 10 runs of random placements. Uniform Griddivides the server box into uniform-sized grid and places onesensor in each grid randomly.

In all of our experiments, we evaluate the average detec-tion probability and the error rate for different placementapproaches. The average detection probability is defined asthe number of overheating chips that are detected divided bythe total number of overheating chips. The error rate evaluatedconsists of both the false alarm and mis-detection. For all ofour testbed results, we run each overheating experiment 10times and calculate the average value of each performancemetric. There are no average results in simulation, sincethere is no variation in CFD temperature results, when theexperiment settings remain the same.

B. Inlet Overheating DetectionIn this subsection, we evaluate the detection performance

under a partial inlet overheating condition. Partial inlet over-heating is often hard to be captured by the single inlettemperature sensor on the front panel assembly in Figure1. Ideally, one could adjust the air conditioning system inthe room (e.g., reducing its blowing range) to emulate inletoverheating caused by cooling systems. However, due tolimited allowed access to the air conditioning system in theroom, we use a hair dryer to blow warm air into the server atthe lower left corner of the front inlet to emulate the partialinlet overheating in our testbed experiment. To calculatethe spatial temperature correlation and the target temperaturefunction, CFD analysis is conducted in different scenarios withdifferent inlet overheating temperatures. As a result, the sensorplacement solution computed by our algorithm can handle thedynamics in different inlet overheating scenarios, despite thatwe only test a subset of those scenarios. Figure 4 shows thetemperature distribution of the server box under the highestpartial inlet overheating temperature. We can see that 9 chips(red dashed frames in the figure) out of the total 11 monitoredchips are overheating in this scenario.

Figure 5 shows the average detection probability in thepartial inlet overheating scenario. We see that the CFD-guided approach has the highest overheating detection proba-bility. Compared with Chip Best, CFD shows a maximumperformance advantage of about 22% when the sensor number

is 2. This is mainly because when a sensor is placed atthe exact location of one chip by Chip Best, it cannotalways provide temperature monitoring for other chips, aschips are usually not placed close to each other. AlthoughChip Best may show some acceptable overheating componentdetection performance when the number of sensors is large,this performance is actually hard to achieve without testing allthe combinitaions of sensor locations with the given number ofsensors. Without exhaustively testing all the combinations, onecan choose chip locations randomly, leading to the detectionperformance of the Chip Average scheme. We see that theCFD-guided placement outperforms the Chip Average at allsensor numbers in the experiment, with a highest performancegain of 45% when sensor number is 2. The other two baselines,Random and Uniform Grid, show significantly worse per-formance than CFD-guided, Chip Best, and Chip Averagesince they are only heuristic approaches. To illustrate thedifference between CFD-guided and Chip Best, a placementexample with 4 sensors is given in Figure 4. We see thatCFD placement does not place sensors on any of the chips.Instead, it places sensors in between chips, such that eachsensor can cover more chips, thus leading to better detectionresults. Figure 6 shows the average error rate in this scenario.We see that CFD-guided placement shows significantly lowererror rates than the other two chip-location placement schemes.This demonstrates that with the analytical results from CFDanalysis, the placement can cover more targets, which leadsto less miss-detection.

Figure 7 and Figure 8 show the detection probability anderror rate of detection on the hardware testbed. We extractthe sensor placement locations from the simulations and placeall the sensors into the server box accordingly. Because ofthe limited space, we only place up to five sensors into theserver box. Since we evaluate three different sensor placementschemes, the maximum number of sensors placed in the serverat the same time is 15. From the result we see that thedetection probability and detection error performance on thehardware testbed matches the simulation results well. Amongall the three schemes, CFD-guided shows the best detectionperformance and Chip Average has the worst performance.

C. Fan Failure DetectionIn this experiment, we conduct both simulation and hard-

ware testbed experiment on a fan failure scenario. To ensurethe safe operation of the system, we only disable one single fanin the system. To calculate the spatial temperature correlationand the target temperature function, several runs of CFDanalysis with different fan speeds are conducted. Similar to

Page 9: Leveraging Thermal Dynamics in Sensor Placement for

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11

Avera

ge D

ete

cti

on

P

rob

ab

ilit

y (

%)

Sensor Number

CFD

Chip Best

Chip Average

Uniform Grid

Random

Fig. 9. Average detection probability of theproposed CFD-guided solution and the baselinesin the scenario with single fan failure (simulation).

0

10

20

30

40

1 2 3 4 5 6 7 8 9 10 11

Avera

ge

Err

or

Rate

(%

)

Sensor Number

CFD Chip Best Chip Average

Fig. 10. Average error rate of the proposed CFD-guided solution and the baselines in the scenariowith single fan failure (simulation).

0

20

40

60

80

100

1 2 3 4 5

Avera

ge D

ete

cti

on

P

rob

ab

ilit

y (

%)

Sensor Number

CFD Chip Best Chip Average

Fig. 11. Average detection probability of the pro-posed CFD-guided solution and the baselines in thescenario with single fan failure (testbed).

Fig. 12. Server temperature map in a scenario with single fan failure. Thered dashed frame are the chips whose environment temperatures exceed theirindividual operating temperature thresholds. The black solid triangles indicatethe sensors placed by the proposed CFD-guided approach, when the givensensor number is two. The black crosses indicate the two sensors placed bythe baseline Chip Best approach.

0

10

20

30

40

50

60

1 2 3 4 5

Avera

ge

Err

or

Rate

(%

)

Sensor Number

CFD Chip Best Chip Average

Fig. 13. Average error rate of the proposed CFD-guided solution and thebaselines in the scenario with single fan failure (testbed).

the inlet overheating scenario discussed before, our sensorplacement solution can handle the dynamics in different fanfailure scenarios, because the CFD analysis is conducted withdifferent fan speeds. Figure 12 shows the colored temperaturemap of the server with a single fan disabled. The missing lineat one of the fan positions represents the failed fan. We see that4 chips (marked in read frame) out of the total 11 monitoredchips are operating in the overheating environment.

The average overheating detection probability from simu-lation is shown in Figure 9. We see that CFD placementapproach only requires two sensors to reach a 100% ofoverheating component detection for all the four overheat-ing locations while Chip Best requires three sensors. Theplacements with two sensors by these two approaches aremarked in Figure 12. We see that CFD placement tries tocover all the right corner overheating chips by putting onlyone sensor in middle of the chips. Compared with ChipAverage, CFD shows significantly better performance bya 60% higher detection probability. As expected, UniformGrid and Random schemes perform much worse than theother placement schemes. Figure 10 shows the average errorrate of the fan failure scenario in simulations. We see thatdespite some random errors, CFD outperforms the othertwo baseline approaches. Chip Average shows the worstperformance among the three approaches.

Figure 11 and Figure 13 show the detection probability

Fig. 14. Server temperature map in the scenario of CPU overloading 3xthe listed power consumption on the data sheet. The red dashed boxes arethe chips whose environment temperature exceeds their individual operatingtemperature threshold. The black solid triangles indicate the sensors placedby the proposed CFD-guided approach, when the given sensor number istwo. The black solid crosses indicate the two sensors placed by the baselineChip Best approach.

and detection error rate on the hardware testbed based onthe extracted sensor placement locations from the simulation.From Figure 11 we see that CFD has similar performancewith Chip Best, but both of them still outperform the ChipAverage scheme significantly. Figure 13 shows the averageerror rate in this fan failure case. We see that CFD performsjust a little worse than Chip Best, but still performs muchbetter than the Chip Average. The degraded performance inthis fan failure scenario is most likely caused by the modelinaccuracy of the CFD analysis. Disabling a fan makes thethermal fluid dynamics more complex than other scenarios,leading to an increase of the modeling error. Please note againthat Chip Best is actually not feasible in a real implemen-tation, because it needs to test all different combinations ofsensor/chip pairing and select the best one.

D. CPU Overloading Detection

In this section, we present the simulation results for over-heating scenario induced by CPU overloading. With thewidely adopted DVFS technique, CPU power is well known tobe a cubic function of CPU frequency [26]. By overclockingCPU frequency to 1.5x of the maximum value listed ondata sheet, 3x overloaded power consumption can be easilyreached. Unfortunately, the platform we use in our hardwareexperiment does not support CPU overclocking. Therefore,we only show the simulation results in this section for thedetection performance under CPU 3x overloading. To calculatethe spatial temperature correlation and the target temperaturefunction, several runs of CFD analysis with different CPUpower settings are conducted. Note again that our sensorplacement solution is designed to handle the dynamics indifferent CPU overloading scenarios.

Figure 14 shows the colored temperature map for the CPUoverloading 3x power scenario. Although the color pattern isquite similar to the result in Figure 2, i.e., a normal run with

Page 10: Leveraging Thermal Dynamics in Sensor Placement for

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11

Avera

ge D

ete

cti

on

Pro

bab

ilit

y (

%)

Sensor Number

CFD

Chip Best

Chip Average

Random

Uniform Grid

Fig. 15. Average detection probability in the scenario of CPU overloading3x power.

benchmark workload, it shows significantly higher temperaturethan that in the normal run. The highest temperature can reachup to about 120◦C. Six chips are found to be working underoverheating condition among all the 11 monitored chips. Theplacement results with three sensors is illustrated in Figure 14for both CFD-guided placement and Chip Best. We see thatCFD placement places sensors in the middle of the clusterof overheating chips such that more chips can be covered bythe limited number of sensors.

Figure 15 is the average detection probability of thisCPU overloading scenario. We can see that CFD placementconstantly shows the best detection probability result, andoutperforms both Chip Best and Chip Average. With asensor number of 2, the performance of CFD reaches twiceas high as that of CFD Average. The average error rate ofthe component overheating detection with CPU overloading isshown in Figure 16. We see that CFD placement outperformsboth the Chip Best and Chip Average with all differentnumber of sensors.

VII. CONCLUSIONS

Efficient thermal monitoring is critical for today’s serversystems to ensure safe operation and continuous service. It isalso important for each server component to maintain a desir-able lifetime of service. However, the current practice of serverthermal monitoring simply relies on either sensors placed atthe server inlet or on-die thermal sensors equipped only withsome of components, such as CPU, memory or both, whichmay lead to degraded overheating detection performance forcertain components. In this paper, we have presented a novelsolution to place additional sensors into server box for over-heating server component detection based on the CFD analysisof the thermal and fluid dynamics inside the server box. Oursensor placement scheme applies Constrained Simulated An-nealing algorithm with a reduced search space to find a sensorplacement with maximized overheating component detectionprobability. Our solution also adopts data fusion techniquesto collaboratively make the overheating detection decision,resulting in improved detection performance. We evaluate ourCFD-based sensor placement strategy with a real-world 2Urack server in different component overheating scenarios. Ourresults show that the proposed placement strategy achievessignificantly better overheating detection performance thanseveral well-designed baselines. Extensive simulation resultsalso demonstrate the effectiveness of our CFD guided sensorplacement scheme.

REFERENCES

[1] J. Srinivasan, S. Adve, P. Bose, and J. Rivers, “Lifetime reliability:toward an architectural solution,” Micro, IEEE, vol. 25, no. 3, pp. 70 –80, 2005.

0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10 11

Avera

ge

Err

or

Rate

(%

)

Sensor Number

CFD

Chip Best

Chip Average

Fig. 16. Average error rate in the scenario of CPU overloading 3x power.

[2] ——, “The case for lifetime reliability-aware microprocessors,” in ISCA,2004.

[3] F. J. Mesa-Martinez, E. K. Ardestani, and J. Renau, “Characterizingprocessor thermal behavior,” in ASPLOS, 2010.

[4] N. Tolia, Z. Wang, P. Ranganathan, C. Bash, M. Marwah, and X. Zhu,“Unified thermal and power management in server enclosures,” inASME, 2009.

[5] J. Donald and M. Martonosi, “Techniques for multicore thermal man-agement: Classification and new exploration,” in ISCA, 2006.

[6] R. Z. Ayoub, K. R. Indukuri, and T. S. Rosing, “Energy efficientproactive thermal management in memory subsystem,” in ISLPED, 2010.

[7] S. Gurumurthi and A. Sivasubramaniam, “Thermal issues in disk drivedesign: Challenges and possible solutions,” Trans. Storage, vol. 2,February 2006.

[8] J. Lin, H. Zheng, Z. Zhu, E. Gorbatov, H. David, and Z. Zhang,“Software thermal management of dram memory for multicore systems,”in SIGMETRICS, 2008.

[9] X. Wang, X. Wang, G. Xing, J. Chen, C.-X. Lin, and Y. Chen, “Towardsoptimal sensor placement for hot server detection in data centers,” inICDCS, 2011.

[10] Z. Yuan, R. Tan, G. Xing, C. Lu, Y. Chen, and J. Wang, “Fast sensorplacement algorithms for fusion-based target detection,” in RTSS, 2008.

[11] K. Skadron, M. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan,and D. Tarjan, “Temperature-aware microarchitecture,” in ISCA, 2003.

[12] Y. Kim, S. Gurumurthi, and A. Sivasubramaniam, “Understanding theperformance-temperature interactions in disk i/o of server workloads,”in HPCA, 2006.

[13] J. Choi, Y. Kim, A. Sivasubramaniam, J. Srebric, Q. Wang, and J. Lee,“Modeling and managing thermal profiles of rack-mounted servers withthermostat,” in HPCA, 2007.

[14] C.-J. M. Liang, J. Liu, L. Luo, A. Terzis, and F. Zhao, “RACNet: ahigh-fidelity data center sensing network,” in SenSys, 2009.

[15] S. Memik, R. Mukherjee, M. Ni, and J. Long, “Optimizing thermalsensor allocation for microprocessors,” Computer-Aided Design of Inte-grated Circuits and Systems, IEEE Transactions on, vol. 27, no. 3, pp.516 –527, 2008.

[16] T. Yasuda, “On-chip temperature sensor with high tolerance for processand temperature variation,” in ISCAS, 2005.

[17] Y. Zhang, A. Srivastava, and M. Zahran, “Chip level thermal profileestimation using on-chip temperature sensors,” in ICCD, 2008.

[18] A. Krause, C. Guestrin, A. Gupta, and J. Kleinberg, “Near-optimalsensor placements: maximizing information while minimizing commu-nication cost,” in IPSN, 2006.

[19] P. K. Varshney, Distributed Detection and Data Fusion. Springer-VerlagNew York, Inc., 1996.

[20] S. Marsh, “Direct extraction technique to derive the junction temperatureof hbt’s under high self-heating bias conditions,” IEEE Transactions onElectron Devices, vol. 47, Feb 2000.

[21] “CFD flow modeling software and solutions from fluent,”http://www.fluent.com.

[22] S. V. Patankar, Numerical Heat Transfer and Fluid Flow. HemispherePublishing Corporation, New York, 1980.

[23] B. W. Wah, Y. Chen, and T. Wang, “Simulated annealing with asymp-totic convergence for nonlinear constrained optimization,” J. of GlobalOptimization, vol. 39, 2007.

[24] A. Verma, G. Dasgupta, T. K. Nayak, P. De, and R. Kothari, “Serverworkload analysis for power minimization using consolidation,” inUSENIX, 2009.

[25] “MEMSIC, TelosB mote,” http://www.memsic.com/products/wireless-sensor-networks/wireless-modules.html.

[26] K. Choi, W. Lee, R. Soma, and M. Pedram, “Dynamic voltage andfrequency scaling under a precise energy model considering variable andfixed components of the system power dissipation,” in ICCAD, 2004.