194
Optimum Cooling of Data Centers Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht Application of Risk Assessment and Mitigation Techniques

Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Optimum Coolingof Data Centers

Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht

Application of Risk Assessmentand Mitigation Techniques

Page 2: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Optimum Cooling of Data Centers

Page 3: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Jun Dai · Michael M. Ohadi · Diganta Das Michael G. Pecht

1 3

Optimum Cooling of Data Centers

Application of Risk Assessment and Mitigation Techniques

Page 4: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Jun DaiDiganta DasMichael G. PechtCenter Advanced Life Cycle EngineeringUniversity of Maryland College Park MD USA

© Springer Science+Business Media New York 2014This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law.The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)

ISBN 978-1-4614-5601-8 ISBN 978-1-4614-5602-5 (eBook)DOI 10.1007/978-1-4614-5602-5Springer New York Heidelberg Dordrecht London

Library of Congress Control Number: 2013950360

Michael M. OhadiDepartment of Mechanical EngineeringUniversity of Maryland College Park MD USA

Page 5: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

v

The worldwide energy consumption of data centers increased nearly 56 % between 2005 and 2010, and reached 237 terawatt hours (TWh) in 2010, accounting for about 1.3 % of the world’s electricity usage [1]. In the US, data center energy con-sumption increased by about 36 % between 2005 and 2010, reaching 76 TWh and accounting for about 2 % of total US electricity consumption in 2010 [1]. Cooling systems (primarily air conditioners) in data centers account for a large part of this energy consumption: in 2009, about 40 % of the energy consumed by data centers was for cooling [2, 3].

A 2012 biannual survey by Emerson Network Power polled members of the Data Center Users’ Group, an association of data center, IT, and facility managers, about the greatest issues facing data centers. Energy efficiency was cited as the pri-mary concern, followed by availability and infrastructure monitoring. The electric-ity cost to remove the heat generated from the server racks has continued to rise to the point that the 4-year energy costs of operating many data centers exceeds their purchase price.

Realizing nearly 40 % of the power supplied to a typical data center is spent on cooling infrastructure; numerous approaches are underway to realize substantial reductions in energy consumption of data centers. One such example is ‘‘free air cooling,’’ where ambient air under proper temperature and humidity conditions is brought into the data center to cool the equipment directly, thereby reducing the energy consumed in cooling and conditioning. Numerous successful examples of free air cooling have demonstrated substantial energy savings, and some have achieved a power usage effectiveness (PUE) of nearly 1. However, a systematic examination of the impact of free air cooling on the performance and reliability of telecommunication equipment is needed. The implementation of free air cool-ing changes the operating environment, such as the temperature and humidity, and may have a significant impact on performance and reliability.

Maintaining the high availability of data centers requires reliability methods that provide useful information about impending failures, identify failure loca-tions, and help isolate failure causes, while taking into account the life cycle conditions during system service. Traditional standards-based standard qualifica-tion methods will not work when free air cooling is implemented in data centers already in operation, since it is usually not practical to interrupt equipment service for re-qualification purposes.

Preface

Page 6: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Prefacevi

The purpose of this book is to provide data center designers and operators with methods by which to assess and mitigate the risks associated with utilization of optimum cooling solutions. The goal is to provide readers with sufficient knowl-edge to implement new/emerging measures such as free air cooling and direct liq-uid immersion cooling properly in data centers, base stations, and server farms, and addresses the following questions:

• What are the cost/benefits associated with an optimum cooling solution for the given system?

• How could the given optimum cooling method(s) be implemented in the given data center?

• Are the current telecom industry standards sufficient/applicable for the selected optimum cooling method(s)?

• What are the potential risks and failure mechanisms associated with the imple-mentation of the optimum cooling method(s)?

• How can the risks to the performance and reliability of telecom equipment under optimum cooling conditions be assessed?

• How can the associated risks to telecom equipment at multiple life cycle stages (design, test, and operation) be mitigated?

• Why is prognostics and health management (PHM) a proper risk mitigation method for the operation stage?

This book discusses various telecommunication infrastructures, with an empha-sis on data centers and base stations. Among the various energy and power manage-ment techniques, this book covers the most commonly known, as well as emerging cooling solutions for data centers. The risks to the electronic equipment fitted in these installations and the methods of risk mitigation are described. The book devotes a particular focus to an up-to-date review of the emerging cooling methods (such as free air cooling and direct liquid immersion cooling), tools and best prac-tices for installation operators, informs installation designers and manufacturers of the benefits and limitations of most common existing and emerging cooling meth-ods, and prepares the designers and manufacturers of electronics for these installa-tions to develop and supply products that meet the operators’ availability, reliability, and performance requirements under the optimum cooling regime.

Chapter 1 provides an overview of the global telecom industry based on the current market and predicted future trends. The reasons for reducing energy con-sumption are also discussed in detail, including energy costs, environmental con-cerns, and government regulations.

Chapter 2 provides an overview of the main components (power equipment, cooling equipment, and IT equipment) and operating environments in data centers, as well as the energy efficiency metrics by which they are measured. It also intro-duces the methods for improving energy efficiency in telecom devices and in data centers, which include more efficient technologies for telecom devices, reducing the required computational power by improving application management, improv-ing the efficiency of servers, improving the efficiency of power supplies and distri-bution, and improving the efficiency of cooling equipment.

Page 7: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Preface vii

Chapter 3 introduces the standards for telecom equipment and data centers, including the qualification environmental standards for telecom equipment and the standards providing data center thermal guidelines and design, installation, and performance requirements. These standards include TL 9000, which can be used to evaluate the quality of telecom equipment and assess the impact of free air cool-ing on telecom equipment; TIA-942, which focuses on the design and installation of data centers; and ASHRAE thermal guidelines. The application of these stand-ards under free air cooling conditions is also discussed.

Chapter 4 introduces the principal cooling methods most commonly used, as well as emerging optimum cooling solutions that seek to minimize energy con-sumption requirements without compromising the integrity of the data and the quality of service by the particular data center. Measures such as air conditioning/cooling with improved power management technologies, liquid cooling, free air cooling, tower free cooling, and comparison of the cooling methods are covered. When applicable the methods considered are compared in terms of energy effi-ciency, retrofit cost, and weather dependence. This chapter also plays a particu-lar focus on free air cooling, its operation principals, opportunities, and challenges associated with use of free air cooling. Several data center design case study scenarios of free air cooling are also discussed, along with their potential energy savings, as well as other considerations and findings based on the available data.

Chapter 5 presents the potential risks to telecom equipment under free cooling conditions due to changes in the operating environment, including temperature, humidity, and contamination, as an example of risk analysis for the optimum cool-ing methods. Various relevant risk assessment procedures and the associated stand-ards are reviewed in this chapter. The most critical unknown factor that remains in the assessment of reliability is the actual conditions under free air cooling, since there is not enough publicly available data to determine the actual environmen-tal envelope under free air cooling. In addition, the most significant difference between free air cooling and traditional air conditioning is the diversity among various free air cooled data centers, which vary in terms of their location, the spe-cific architecture of the free air cooling, and the inclusion of other power manage-ment methods in conjunction with free air cooling.

Chapter 6 presents steps to identify the parts of the telecom equipment with the highest potential risks under optimum cooling conditions and provides a pro-cess for assessing whether, if appropriate optimums to the parts are available, the optimums are qualified under the new environment. If the appropriate optimums are not practical or possible, uprating methods are introduced to assess whether the original parts are qualified under optimum cooling conditions. Three uprating methods (parameter re-characterization, parameter conformance, and stress bal-ancing) are presented with examples to show the steps for their implementation. The uprating methods are compared, and methods for selecting an appropriate uprating method are introduced in this chapter.

Chapter 7 presents guidelines for assessing part reliability under optimum cooling conditions. Handbook-based reliability predictions have been used for decades; however, they do not consider the failure mechanisms and provide only

Page 8: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Prefaceviii

limited insight into practical reliability issues. As a result, they cannot offer accu-rate predictions. This chapter presents several methods to replace the handbook methods at different product lifecycle stages. At the design and test stages, the manufacturers can use physics-of-failure (PoF) and accelerated testing to predict part reliability. At the operation stage, when the products are being used in the field, the field data can be analyzed to estimate reliability.

When optimum cooling is implemented in data centers already in operation, traditional reliability assessment methods and current standards-based qualifi-cation methods are insufficient to estimate the reliability of telecom equipment. Thus, Chap. 8 introduces prognostics and health management (PHM), which can be applied to identify and mitigate the system-level risks of operating telecom equipment under free air cooling conditions. This chapter provides a basic intro-duction to PHM, the monitoring techniques for PHM, and PHM approaches. The physics-of-failure approach, the data-driven approach, and a combination of both approaches (fusion approach) are introduced. Chapter 8 also presents a multi-stage method to identify and mitigate the potential risks to telecommunication equip-ment under energy conservation measures such as free air cooling conditions, thus providing a case example of how the PHM approaches can be used to mitigate risks associated with use of optimum cooling method.

Chapter 9 presents some common features of next-generation data centers. Emerging trends of next-generation data centers suggest that they will be more energy efficient, use space more efficiently, use higher-density electronic compo-nents, reduce capital and operational costs, use optimized cooling methods, reduce emissions to net-zero, increasingly use hardware and software in the integrated design and operation/management of the center, increasingly use cloud computing, and make continuous progress in use of risk assessment and mitigation techniques to take advantage of optimum infrastructure design/installation and operation measures.

This book offers information for sustainable design and operating principles that meet expectations of next-generation data centers. The focus of the book is on optimum cooling and other energy recovery and efficiency improvement meas-ures; thus, it will be useful for stakeholders in both the IT and HVAC industries, including facility developers/designers, HVAC equipment manufacturers, IT and telecom equipment manufacturers, and data center end-users/owners, operators, and energy auditors. The book will be valuable for researchers and academic com-munities as well in their search for future solutions and further enhancements in this growing and promising field. What distinguishes this book from previous books in the field is the analysis that it offers a review of the potential risks due to the implementation of optimum cooling methods (free air cooling, as an example) and a set of assessment methods for part performance and reliability. Additionally, for data center and base station designers, this book provides a review of the guidelines and regulations imposed, the goals set by governments, and a review of all variations of optimum cooling techniques. For data center operators, this book provides a prognostics-based assessment to identify and mitigate the risks of tel-ecom equipment under optimum cooling conditions.

Page 9: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Preface ix

The authors wish to thank Dr. Bo Song who led writing Chap. 5. Her efforts are much appreciated. We also thank Dr. Serguei Dessiatoun and Dr. Kyosung Choo of the Smart and Small Thermal Systems Laboratory at the University of Maryland for their contributions to Chaps. 2 and 9. We are grateful to Profs. Avram Bar-Cohen of the University of Maryland and Yogendra Joshi of Georgia Tech for their discussions on contemporary issues and future trends in thermal packaging of electronics.

References

1. J.G. Koomey, Growth in data center electricity use 2005 to 2010 (Analytics Press, Oakland, 2011)

2. A. Almoli, A. Thompson, N. Kapur, J. Summers, H. Thompson, G. Hannah, Computational fluid dynamic investigation of liquid rack cooling in data cen-tres, Appl. Energy. 89(1), 150–155 (2012)

3. P. Johnson, T. Marker, Data center energy efficiency product profile, Pitt & Sherry, Report to equipment energy efficiency committee (E3) of The Australian Government Department of the Environment, Water, Heritage and the Arts (DEWHA), Apr. (2009)

Page 10: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

xi

1 The Telecom Industry and Data Centers . . . . . . . . . . . . . . . . . . . . . . . . 11.1 An Overview of the Telecom Industry Market . . . . . . . . . . . . . . . . . . 1

1.1.1 The Global Telecom Market . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.2 The United States Telecom Market . . . . . . . . . . . . . . . . . . . . 2

1.2 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2.1 The Cost of Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.2 Environmental Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2.3 Government Regulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Data Center Energy Flow and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . 92.1 Data Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Power Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.2 Cooling Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.1.3 IT Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 Energy Efficiency Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.3 Methods to Improve Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.1 Efficient Electronics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.2 Efficient Software Applications . . . . . . . . . . . . . . . . . . . . . . . 162.3.3 Efficient Power Supply and Distributions . . . . . . . . . . . . . . . 162.3.4 Efficient Cooling Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4 Case Study Example on Data Center Energy Saving Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4.1 Analysis of Energy Consumption . . . . . . . . . . . . . . . . . . . . . . 172.4.2 Energy Consumption Simulations . . . . . . . . . . . . . . . . . . . . . 222.4.3 Energy Conservation Findings . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3 Standards Relating to Data Center . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1 ASHRAE Thermal Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 TIA-942 Data Center Standard. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Contents

Page 11: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Contentsxii

3.3 Environmental Qualification Standards . . . . . . . . . . . . . . . . . . . . . . . 363.3.1 Telcordia GR-63-CORE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363.3.2 ETSI 300 019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.3.3 Use for Data Center Cooling Methods . . . . . . . . . . . . . . . . . . 41

3.4 Quality Management Standard: TL 9000 . . . . . . . . . . . . . . . . . . . . . . 413.4.1 Metrics in TL 9000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423.4.2 Use for Data Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Principal Cooling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.1 Principal Cooling Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.1 Air Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474.1.2 Liquid Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.1.3 Liquid Immersion Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 504.1.4 Tower Free Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.5 Enhanced Cooling Utilizing Power

Management Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . 534.1.6 Comparison of Principal Cooling Methods . . . . . . . . . . . . . . 54

4.2 Free Air Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.2.1 Operation of Airside Economizer . . . . . . . . . . . . . . . . . . . . . . 564.2.2 Operating Environment Setting . . . . . . . . . . . . . . . . . . . . . . . 574.2.3 Energy Savings from Free Air Cooling . . . . . . . . . . . . . . . . . 594.2.4 Hidden Costs of Free Air Cooling . . . . . . . . . . . . . . . . . . . . . 644.2.5 Examples of Free Air Cooling . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Reliability Risks Under Free Air Cooling . . . . . . . . . . . . . . . . . . . . . . . . 715.1 Failure Causes Under Free Air Cooling . . . . . . . . . . . . . . . . . . . . . . . 71

5.1.1 Increased Temperature and Temperature Variation . . . . . . . . 715.1.2 Uncontrolled Humidity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.1.3 Contamination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.2 Failure Mechanisms Under Free Air Cooling . . . . . . . . . . . . . . . . . . 765.2.1 Electrostatic Discharge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 765.2.2 Conductive Anodic Filament Formation . . . . . . . . . . . . . . . . 765.2.3 Electrochemical Migration . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.2.4 Corrosion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Testing for Free Air Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 835.3.1 Mixed Flowing Gas (MFG) Test . . . . . . . . . . . . . . . . . . . . . . 835.3.2 Dust Exposure Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 855.3.3 Clay Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.3.4 Temperature/Humidity/Bias (THB) Testing . . . . . . . . . . . . . . 885.3.5 Salt Spray Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885.3.6 Cyclic Temperature/Humidity Testing . . . . . . . . . . . . . . . . . . 89

Page 12: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Contents xiii

5.3.7 Water Spray Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 905.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6 Part Risk Assessment and Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.1 Part Datasheet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.1.1 Datasheet Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 956.1.2 Understanding the Part Number . . . . . . . . . . . . . . . . . . . . . . . 966.1.3 Ratings of an Electronic Part . . . . . . . . . . . . . . . . . . . . . . . . . 976.1.4 Thermal Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 996.1.5 Electrical Specifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

6.2 Part Uprating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.2.1 Steps of Part Uprating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1026.2.2 Parameter Conformance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1046.2.3 Parameter Re-characterization . . . . . . . . . . . . . . . . . . . . . . . . 1076.2.4 Stress Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.2.5 Continuing Steps After Uprating . . . . . . . . . . . . . . . . . . . . . . 112

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7 Part Reliability Assessment in Data Centers . . . . . . . . . . . . . . . . . . . . . 1157.1 Part Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1157.2 Example Handbook-Based Reliability Prediction Methods . . . . . . . . 117

7.2.1 Mil-hdbk-217 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177.2.2 Telcordia SR-332 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1187.2.3 How the Handbook Calculations Work . . . . . . . . . . . . . . . . . 1187.2.4 How the Operating Environments are Handled . . . . . . . . . . . 1197.2.5 Insufficiency of the Handbook Methods . . . . . . . . . . . . . . . . 119

7.3 Prognostics and Health Management Approaches . . . . . . . . . . . . . . . 1217.3.1 Monitoring Techniques for PHM . . . . . . . . . . . . . . . . . . . . . . 1227.3.2 Physics-of-Failure Approach . . . . . . . . . . . . . . . . . . . . . . . . . 1227.3.3 Data-Driven Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1287.3.4 Fusion Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1327.3.5 Use for the Efficient Cooling Methods. . . . . . . . . . . . . . . . . . 134

7.4 Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1347.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

8 Life Cycle Risk Mitigations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1418.1 Risk Assessment Based on Product Life Cycle Stage . . . . . . . . . . . . 1418.2 Risk Assessment at the Design Stage . . . . . . . . . . . . . . . . . . . . . . . . . 142

8.2.1 Initial Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1438.2.2 Part Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1438.2.3 Virtual Qualification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1448.2.4 Simulation and Final Design . . . . . . . . . . . . . . . . . . . . . . . . . 145

Page 13: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Contentsxiv

8.3 Risk Assessment at the Test Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 1468.3.1 Standards-Based Assessment . . . . . . . . . . . . . . . . . . . . . . . . . 1468.3.2 Uprating Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

8.4 Risk Assessment at the Operation Stage . . . . . . . . . . . . . . . . . . . . . . 1488.5 A Case Study of Network Equipment . . . . . . . . . . . . . . . . . . . . . . . . 149

8.5.1 Estimation of Operating Conditions . . . . . . . . . . . . . . . . . . . . 1508.5.2 FMMEA and Identification of Weak Subsystems . . . . . . . . . 1518.5.3 System and Weak Subsystem Monitoring . . . . . . . . . . . . . . . 1518.5.4 Anomaly Detection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1528.5.5 Prognostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155

8.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

9 Emerging Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1599.1 Increased Use of Software Tools for Optimum

and Reliable Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1599.2 Trends in Development of Energy Efficient Electronics . . . . . . . . . . 1609.3 Embedded (Near Source) Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

9.3.1 Enhanced Air Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1639.3.2 CRAC Fan Speed Control . . . . . . . . . . . . . . . . . . . . . . . . . . . 1649.3.3 Direct Liquid Cooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1649.3.4 Direct Phase-Change Cooling . . . . . . . . . . . . . . . . . . . . . . . . 1659.3.5 Comparison Between Embedded Air, Liquid,

and Two-Phase Flow Cooling . . . . . . . . . . . . . . . . . . . . . . . . . 1669.4 Net-Zero Emission Data Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1699.5 Mission Critical Data Centers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1709.6 Waste Heat Recovery/Chiller-less Cooling . . . . . . . . . . . . . . . . . . . . 1719.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Page 14: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

1

The telecommunications industry encompasses the information and communication technology (ICT) sectors, ranging from fixed telephone, broadband Internet, and mobile wireless applications to cable television. The supporting backbone of the tel-ecom industry’s infrastructure is data centers, which comprise the buildings, facili-ties, and rooms that contain enterprise servers and equipment for communications, cooling, and power distribution and control. This chapter overviews the telecom industry market and the need for energy consumption reduction/optimization in data centers.

1.1 An Overview of the Telecom Industry Market

The telecommunications industry is essential to the global economy and plays a key role in almost every sector of society. Insight Research estimated that global telecom spending as a share of global GDP will be 5.9 % in 2013, up from 4.8 % in 2006 [1]. The transfer of entertainment and information delivery (e.g., movie streams, eBooks, podcasts) through the Internet will continue to increase the demand for reliable and high bandwidth telecommunication systems.

1.1.1 The Global Telecom Market

In 2010, the global market revenue of the telecommunications industry reached about US $4 trillion, nearly double the US $2.3 trillion in 2002. The market expanded at double-digit annual rates from 2003 to 2008 [2], but this expansion slowed down in 2008 and experienced a drop in 2009 due to the economic reces-sion in large parts of the developed world. The Telecommunications Industry Association trade group predicted that global telecommunications spending would

Chapter 1The Telecom Industry and Data Centers

J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5_1, © Springer Science+Business Media New York 2014

Page 15: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

2 1 The Telecom Industry and Data Centers

reach $4.9 trillion in 2013 [2]. Figure 1.1 shows the telecommunications industry global revenue both past and projected.

Projected opportunities for growth in the telecommunications market are in the Middle East, Africa, Latin America, and Asia. The telecommunications mar-ket in the Middle East and Africa is predicted to be $395 billion in 2013 with a 12.1 % compound annual growth rate. Latin America will be second in growth rate. In Asia, the growth rate of the telecommunications market is expected to be 8.5 % compound annual rate, reaching $1.5 trillion in 2013 compared to $1.1 trillion in 2009. The European telecommunications market is relatively satu-rated but is projected to rise to $1.31 trillion in 2013 at a 4 % compound annual rate [1].

Among the segments of the telecommunications market, wireless subscribers will play a key role. The number of wireless subscriptions is expected to reach 5.5 billion in 2013, compared to 1.4 billion in 2009. The wireless market in Latin America is projected to increase to $134 billion in 2013 at a 9.3 % compound annual rate compared to $94 billion in 2009. In Asia, by 2013 the wireless sub-scribers in China and India are expected to increase by between 300 million and 340 million, respectively, and the two countries will contribute about 62 % of all Asia Pacific subscribers [2].

1.1.2 The United States Telecom Market

In the economic recession, the U.S. telecommunications market experienced a drop of about $60 billion in 2009 compared to 2008. However, the U.S. telecom-munication industry revenue is expected to grow to $1.14 trillion in 2013 at an annual 3.7 % rate compared to US $988 billion in 2009 [2].

The Federal Communications Commission (FCC) plan titled “Connecting America: The National Broadband Plan” [3] was required by the U.S. Congress

106.6 113.2 115.6 115.8 109.1 104.3 98.792.4 86.4 80.7 75.4 70.1

6.9 12.3 23.332.3

43.555.6

71.1

94.8118.6

141.4

163.2180.4

0

40

80

120

160

200

2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

Rev

enue

($

Bill

ions

)

Time

Voice

Data

Fig. 1.1 U.S. wireless revenue by service sector [4]

Page 16: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

3

as a part of the “American Recovery and Reinvestment Act of 2009” to improve broadband Internet access in the United States. This plan includes “a detailed strategy for achieving affordability and maximizing use of broadband to advance consumer welfare, civic participation, public safety and homeland security, com-munity development, health care delivery, energy independence and efficiency, education, employee training, private sector investment, entrepreneurial activity, job creation and economic growth, and other national purposes” [3]. This FCC plan proposes six long-term goals for quantitative access speed as well as other policy goals. If adequately funded and implemented by the U.S. Government and industry, this plan will contribute to the growth of the telecommunications industry.

The implementation of the National Broadband Plan could be a stimulus for the growth of the broadband market. The number of household broadband subscribers is estimated to increase from 75.6 million in 2009 to 111 million in 2013; this will represent an increase in the percentage of the public with broad-band access from 64.2 % in 2009 to 90.7 % in 2013. The business broadband market was estimated to grow from 5.7 million to 7.5 million between 2009 and 2013 [3].

One of the largest shares of the telecom market is wireless revenue, which is projected to be $205 billion in 2013 with a 34.9 % increase compared to $152 bil-lion in 2009 [4]. One driver of wireless market expansion is the rapid growth of high-volume data applications. Data-related spending reached $95 billion in 2012 from only $43 billion in 2009. The data service revenue will continue to rise and is projected to reach $184 billion in 2016, with an increase of 94 % in the next four years (2012–2016). It is estimated that data service will account for about 72 % of wireless service in 2016, as shown in Fig. 1.1 [4].

1.2 Energy Consumption

The growth of the telecom industry has resulted in increases in energy consump-tion to run the telecom infrastructure. In fact, energy consumption is one of the main contributors of operating expenses for telecom network operators. Reliable access to electricity is limited in many developing countries that are high-growth markets for telecommunications, and any ability to operate with lower energy con-sumption is a competitive advantage. Some companies have adopted corporate social responsibility initiatives with the goal of reducing their networks’ carbon footprints, and network infrastructure vendors are striving to gain a competitive advantage by reducing the power requirements of their equipment. Chip manufac-turers have also taken steps to reduce power consumption, such as migrating to 3-D chips, where chips are stacked in a 3-D arrangement to minimize the inter-connect distance, thus reducing energy consumption. More such developments are described in Chap. 9.

1.1 An Overview of the Telecom Industry Market

Page 17: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

4 1 The Telecom Industry and Data Centers

1.2.1 The Cost of Energy

Approximately 1.5–3.0 % of the energy produced in industrialized countries is consumed by data centers. Data centers can be as much as 10 times or more energy-intensive than conventional office buildings [5] per unit area. A single data center can house thousands of servers, storage devices, and network devices, and continued growth in the number of servers is expected as industries expand data center capabilities.

In 2007, the Environmental Protection Agency (EPA) published its “Report to Congress on Server Data Center Energy Efficiency Public Law 109–431” [6] to evaluate the energy consumption of government and commercial data centers in the U.S. The EPA report found that U.S. data centers roughly doubled their energy consumption between 2001 and 2006, using 61 terawatt-hours (TWh) in 2006, at a cost of $4.5 billion [6]. The electricity consumed by data centers in 2006 was equivalent to the energy consumed by 5.8 million average U.S. households and was similar to the volume used by the entire U.S. transportation manufacturing industry, which includes the manufacturing of automobiles, aircraft, trucks, and ships [6]. In 2010, the energy consumed by U.S. data centers increased to 76 TWh [7]. Furthermore, the energy consumption in global data centers reached 237 TWh electricity in 2010, accounting for about 1.3 % of the world’s electricity use [7].

The EPA report [6] includes the results of a 2006 survey of the power con-sumption of more than 20 data centers. It was found that a data center’s IT equip-ment, including servers, storage devices, telecom equipment, and other associated equipment, can use from about 10 to almost 100 W/sq. ft, which is over 40 times more than the power consumed by a conventional office building. For example, the Google data center in Oregon was estimated to have consumed 103 MW of energy in 2011 [8]. The energy consumption of a single rack of servers can reach up to 20–25 kW, which is equivalent to the peak electricity demand of about 15 typical California homes [6]. About 50 % of the energy consumed by data centers goes toward the power and cooling infrastructure that supports electronic equip-ment [6, 9].

1.2.2 Environmental Issues

The telecom industry accounted for nearly 2 % of global CO2e1 emissions (about 830 megatons) in 2007 as per the “Smart 2020 Report” [10] published in 2008 by the Global e-Sustainability Initiative, a partnership of technology firms and indus-try associations. Even if efficient technology is developed to reduce energy con-sumption, it was estimated that the gas emissions of the telecom industry will

1 CO2e is CO2 equivalent.

Page 18: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

5

increase at an annual rate of 6 % until 2020, when it will reach 1.43 Gt (gigatons, 109 tons) of CO2. About one-quarter of the gas emissions comes from the telecom equipment materials and manufacturing processes; the rest is generated during their operation in the field [10].

Generally, there are three subsectors in the telecom industry: personal computers (PCs) and peripherals (workstations, laptops, desktops, monitors, and printers), data centers, and telecom devices (fixed lines, mobile phones, chargers, Internet Protocol TV (IPTV) boxes, and home broadband routers). The PC population is experienc-ing explosive growth in some developing countries. The emerging middle classes in China and India are purchasing PCs in a manner similar to how their counterparts in developed countries did before; as a result, this will substantially increase the gas emissions due to the large population. The carbon footprint of PCs and moni-tors is expected to be 643 Mt CO2e in 2020, with an annual growth of 5 % based on 200 Mt CO2e in 2002 (the gas emissions from peripherals will be about 172 Mt CO2e in 2020). The carbon footprint of data centers is projected to reach 259 Mt CO2 emis-sions by 2020, compared with 76 Mt CO2 emissions in 2002. Telecom device gas emissions are also expected to increase to 349 Mt CO2 in 2020 [10]. The growth of the three subsectors in the telecom industry is illustrated in Fig. 1.2.

Many companies have made public commitments to reduce energy costs and other environmental impacts (Tables 1.1 and 1.2) [10]. These commitments sug-gest that corporations are seeking to improve their costs and environmental impact.

Fig. 1.2 The gas emission growth of global telecom industry subsectors [10]

302406

815

148

307349

74.2 116

259

0

200

400

600

800

1000

2002 2007 2020

MtC

O2

Time

PCs andperipherals

Telecomdevice

Data center

Table 1.1 Companies’ public commitments to energy reduction [10]

Company Commitment

Intel Reduce normalized energy use in operations by 4 % pa of 2002 level by 2010, and annually by 5 % of the 2007 level by 2012

Hewlett-Packard Reduce energy consumption of desktop and notebook PC families by 25 % (per unit) of 2005 level by 2010

Nokia Siemens Networks Reduce energy use of office facilities by 6 % of the 2007 level by 2012

France Telecom Reduce energy consumption by 15 % below 2006 level by 2020Nokia Reduce energy consumption of office facilities to 6 % of the 2006

level by 2012

1.2 Energy Consumption

Page 19: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

6 1 The Telecom Industry and Data Centers

1.2.3 Government Regulations

The average global temperature is predicted to increase by 1.4–5.8 °C between 1990 and 2100 [11]. In order to prevent anthropogenic effects on the climate sys-tem due to increasing greenhouse gas concentrations in the atmosphere, the Kyoto Protocol was signed in 1999 and came into force in 2005 [12]. This protocol estab-lished commitments for reducing greenhouse gas emissions and required all mem-ber countries to enact policies and measures to meet the objectives. The goal is to lower the overall emissions of six greenhouse gases (carbon dioxide, methane, nitrous oxide, sulfur hexafluoride, hydrofluorocarbons, and perfluorocarbons), averaged 5 % against 1990 between 2008 and 2012. National limitations range from 8 % reductions for the European Union (EU), to 7 % for the United States, to 6 % for Japan. Now, the Kyoto Protocol is a protocol in the United Nations Framework Convention on Climate Change (UNFCCC) [12].

Many energy savings goals are being set around the world to improve energy efficiency and secure energy supplies. The Energy Independence and Security Act of 2007 in the United States requires that all new federal buildings be “car-bon-neutral” by 2050 [13]. A general policy was published in June 2005 by the European Commission (EC) to protect the environment and reduce energy waste [14]. This green paper set the goal of saving at least 20 % of the EU’s present energy consumption by 2020. The EC found that 10 % of the savings could be obtained by fully implementing energy saving legislation, while the other 10 % depends on new regulations. Germany plans to cut gas emissions by 40 % of 1990 levels by 2020, while Norway expects to become carbon neutral by 2050. The UK

Table 1.2 Public environmental commitments of companies [10]

a PUE is a key metric of energy efficiency advocated by Green Grid, a global consortium dedi-cated to advancing energy efficiency in data centers and business computing ecosystems. The definition of PUE can be found in Chap. 2b Carbon intensity is the total carbon dioxide emissions from the consumption of energy per dol-lar of gross domestic product (GDP)

Company Commitment

British Telecommunications

Reduce the worldwide CO2 emissions per unit of BT’s contribu-tion to GDP by 80 % of 1996 levels by 2020, and reduce UK CO2 emissions in absolute terms by 80 % of 1996 levels by Dec. 2016

Microsoft Every two years through 2012 cut by half the difference between the annual average data center PUEa and the ideal PUE (1.0)

Sun By 2012, reduce CO2 emissions 20 % from 2002 levelsAlcatel-Lucent Reach a 10 % reduction in total CO2 emissions from facilities

from the 2007 baseline by the end of 2010Dell Reduce operational carbon intensityb by 15 % of the 2007 level by

2012

Page 20: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

7

plans to reduce gas emissions 60 % below 1990 levels by 2050 [10]. China’s latest five-year plan (2011–2015) contains 17 % energy-efficiency improvement targets [15]. The telecom industry will need to share the responsibility for improving its operations for nations to meet their stated goals. Flow-down and independent reg-ulations on the telecom industry are described in the next subsection.

Since data centers are such large contributors to overall energy consumption, some of these requirements are likely to trickle down to data center operators. Some laws are being drafted specifically for data center operators. The “Code of Conduct on Data Center Energy Efficiency—Version 1.0” was enacted by the European Commission in October 2008 [1] and revised in November 2009 [16]. The document serves as an enabling tool for industry to implement cost-effective energy savings and help determine and accelerate the application of energy-effi-cient technologies. The aim of the code is to “inform and stimulate data center operators and owners to reduce energy consumption in a cost-effective man-ner without hampering the mission and critical function of data centers” [1, 16]. The code suggests achieving this by “improving understanding of energy demand within data centers, raising awareness, and recommending energy efficiency best practice and targets.”

In order to assist data center operators in identifying and implementing meas-ures to improve the energy efficiency of data centers, a supplementary document to the code of conduct, “Best Practices for EU Code of Conduct on Data Centres,” was released in 2008 [17], and revised in 2009 [18]. A broad group of experts, including data center equipment manufacturers, vendors, and consultants, have contributed to and reviewed the document. The document provides a list of best practices for data center operations, including data center management and plan-ning; IT equipment and services (e.g., how to select and deploy new IT equipment and services and manage existing ones); cooling; power equipment (selecting and deploying new power equipment and managing existing equipment); data center construction; and other necessary practices and guidelines in data centers [18].

In the U.S. Public Law 109–431 [19] was implemented to study and promote the energy efficiency of data centers and servers in the U.S. In August 2007, the U.S. Environmental Protection Agency (EPA) published a report [6] to Congress on server and data center energy efficiency. This report includes the growth trends of energy use associated with data centers and servers in the U.S., potential oppor-tunities, and recommendations for energy savings through improved energy effi-ciency. One priority identified in the EPA’s report to the U.S. Congress was the development of objective energy performance metrics and ratings for data center equipment. The EPA developed an Energy Star program to identify energy effi-cient enterprise servers. The program requirements comprise eligibility criteria for qualifying server products, along with general partner commitments (including labeling requirements). The product specifications for Energy Star-qualified serv-ers identify eligible products and the corresponding efficiency requirements for qualifying as an Energy Star product [9].

1.2 Energy Consumption

Page 21: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

8 1 The Telecom Industry and Data Centers

1.3 Summary

The telecom industry has become more concerned with energy consumption and the associated environmental effects. Since about 40 % of the total energy con-sumption in the telecom industry is devoted to cooling equipment in data centers, there is a great opportunity to modify cooling methods to improve the energy effi-ciency of the telecom industry. The benefits are not only in meeting environmental requirements, but also in lowering operating costs.

References

1. The Insight Research Corporation, The 2009 Telecommunications Industry Review: An Anthology of Market Facts and Forecasts, Boonton, NJ, USA, Nov 2008

2. Telecommunications Industry Association (TIA), ICT Market Review and Forecast 2010, Washington, DC, USA (2010)

3. Federal Communications Commission, Connecting America: The National Broadband Plan, March 2010

4. Telecommunications Industry Association (TIA) ICT Market Review and Forecast 2013, Washington, DC, USA (2013)

5. M. Hodes, et al., Energy and Power Conversion: A Telecommunications Hardware Vendor’s Prospective, Power Electronics Industry Group (PEIG) Technology Tutorial and CEO Forum, Cork, Ireland, 6 Dec 2007

6. U.S. Environmental Protection Agency Energy Star Program, Report to Congress on Server and Data Center Energy Efficiency Public Law 109-431, 2 Aug 2007

7. J.G. Koomey, Growth in Data Center Electricity Use 2005 to 2010 (Analytics Press, Oakland, CA, 2011). Aug 2011

8. J. Cradden, The greening of the IT sector. Technology Ireland, Dec 2009 9. P. Johnson, T. Marker, Data Center Energy Efficiency Product Profile, Pitt & Sherry, Report

to Equipment Energy Efficiency Committee (E3) of the Australian Government Department of the Environment, Water, Heritage and the Arts (DEWHA), Apr 2009

10. The Climate Group on Behalf of the Global eSustainability Initiative (GeSI), SMART 2020: Enabling the Low Carbon Economy in the Information Age, Brussels, Belgium (2008)

11. The Intergovernmental Panel on Climate Change (IPCC), Executive summary, Chapter 9: Projections of future climate change, in Climate Change 2001: The Scientific Basis, ed. by J. T. Houghton et al., http://www.grida.no/climate/ipcc_tar/wg1/339.htm. Accessed on 2 Aug 2009

12. The United Nations Framework Convention on Climate Change (UNFCCC or FCCC), Kyoto Protocol, Kyoto, Japan, Dec 1997

13. Energy Independence and Security Act of 2007, U.S., Dec 2007 14. European Commission, Green Paper on Energy Efficiency, Doing More with Less (2005) 15. APCO World Wide, China’s 12th Five-Year Plan (2011), http://www.apcoworldwide.com/

content/PDFs/Chinas_12th_Five-Year_Plan.pdf 16. European Commission, Code of Conduct on Data Centres Energy Efficiency—Version 2.0,

Nov 2009 17. European Commission, Best Practices for EU Code of Conduct on Data Centres—Version

1.0, Oct 2008 18. European Commission, Best Practices for EU Code of Conduct on Data Centres—Version

2.0, Nov 2009 19. U.S., Public Law 109–431 109th Congress, Dec 2006

Page 22: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

9

Data centers form the backbone of information management in every sector of the economy, and their energy consumption has been of concern to governments and the telecom industry. This chapter introduces data center energy efficiency, includ-ing the main components and operating environments in data centers, as well as the standards, thermal guidelines, and metrics used to quantify the energy effi-ciency. This chapter also presents the major cooling methods used in the industry to improve energy efficiency. A case study is discussed in which energy consump-tion of a medium-size primary data center at an academic campus is analyzed and compared with experimental measurements.

2.1 Data Centers

A data center includes four categories of equipment: (1) power equipment, including power distribution units (PDUs), uninterruptible power supply systems (UPSs), switchgears,1 generators, power distribution units, and batteries; (2) cooling equip-ment such as chillers, computer room air-conditioning (CRAC) units, cooling tow-ers, and automation devices; (3) IT equipment, including servers, network, and storage nodes, and supplemental equipment such as keyboards, monitors, worksta-tions, and laptops used to monitor or otherwise control the center; and (4) miscella-neous component loads, such as lighting and fire protection systems [1]. Of these four categories, IT equipment energy consumption, which is used to manage, pro-cess, store, or route data, is considered effective [1]. However, the other three catego-ries that represent the supportive infrastructure need to be minimized to improve the energy efficiency without compromising the data center reliability and performance. The approximate energy distribution in a typical data center for a PUE of 1.8 and

1 Switchgears, used in association with the electric power system, combine electrical discon-nects, fuses, and/or circuit breakers to isolate electrical equipment.

Chapter 2Data Center Energy Flow and Efficiency

J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5_2, © Springer Science+Business Media New York 2014

Page 23: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

10 2 Data Center Energy Flow and Efficiency

a PUE of 3.0 are shown in Fig. 2.1. As seen there, a higher PUE translates to a greater portion of the electrical power coming to the data center spent on the cooling infrastructure and vice versa [2]. The selected PUE of 1.8 in Fig. 2.1, represents the average value reported in a survey of more than 500 data centers as reported by Uptime Institute in 2011. The Energy Star program in the past has reported an aver-age PUE of 1.9 in 2009 based on the data it gathered for more than 100 data centers.

Fig. 2.1 Energy consumption in a typical data center

Chiller 19%

Humidifier3%

CRAC/CRAH13%

IT Equipment55%

PDU2%

UPS5%

Lighting/aux devices

2%

Switchgear/generator1%

Power distribution for PUE = 1.80

Chiller 31%

Humidifier3%

CRAC/CRAH21%

IT Equipment34%

PDU2% UPS

6%

Lighting/aux devices

2%

Switchgear/generator1%

Power distribution for PUE = 3.0

Page 24: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

11

2.1.1 Power Equipment

The power equipment in data centers includes UPSs and PDUs. Barroso and Hozle [3] introduce the three typical functions of UPS systems in data centers. The first function is to use a transfer switch to transmit the active power input to the data center. Data centers usually have two kinds of power input: the utility power (regu-lar power) and the generator power (alternative power). When the utility power fails, the generator starts and becomes the active power for data centers under the control of the transfer switch. The second function of the UPS system is to use an AC–DC–AC double conversion function and batteries (or flywheels) to provide temporary power for data centers before the availability of generator power in the event of a utility power failure. AC–DC–AC double conversion converts the input AC power to DC, and then flows to a UPS-internal DC bus, which charges bat-teries inside the UPS system. After that, the output of the UPS-internal DC bus is converted back to AC power for the equipment in data centers. When the util-ity power fails, this design can retain the internal DC power of the UPS system until the power supply from the generator is available. The third function of a UPS system is to remove the voltage spikes or sags in incoming power due to the AC–DC–AC double conversion. The power requirement in UPS operation is from the inherent losses associated with all the electrical conversions. These losses also result in heat dissipation. Due to their large space requirement and to avoid having to cool those systems, the UPS system usually is housed in rooms separate from the IT equipment [3].

The PDU units receive power from the UPS systems and then convert and dis-tribute the higher voltage power (typically 200–480 V) into many 110 or 220 V circuits, which can be supplied to the IT equipment in the data centers. Each cir-cuit is individually protected by its own dedicated breaker system. If the breaker of a circuit trips due to a ground short, only this circuit (not all the PDUs and UPS systems) will be affected. A typical PDU can supply 75–225 kW of power and feed many 110 or 220 V circuits. The PDUs and UPS systems are usually redundantly deployed with a small delay switch, which can prevent a system from interrupting the power supply to the IT equipment in case of UPS or PDU failures [3].

2.1.2 Cooling Equipment

A common cooling method for data centers is the use of computer room air-conditioners (CRAC), which pass cooled air to the IT equipment racks through a raised floor (see Fig. 2.2). The air flows across the IT equipment and then removes exhausted heat from the back of the rack. In order to avoid mixing hot and cold air, and thus reduce the cooling efficiency, the typical practice is to arrange alter-nating rack rows of “hot aisles” and “cold aisles.” Since hot air is lighter than cold air, the hot exhaust air from the IT equipment rises and recirculates into the

2.1 Data Centers

Page 25: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

12 2 Data Center Energy Flow and Efficiency

CRAC, where it is cooled and supplied into the racks again. The hot and cold aisles in modern data centers are physically separated from each other via curtains or hard partitions to avoid mixing of hot and cold air, thus improving air distribu-tion and energy efficiency.

The warm exhaust air leaving the electronics cabinets is pushed by fans through the coils containing chilled liquid coolant in the CRAC unit, where it exchanges heat and is cooled before returning back to the cabinets. The heated cooling fluid leaving the CRAC coils is recirculated by pumps into a secondary loop chiller or cooling tower(s), where the heat removed from the coolant is expelled to the out-side environment. Typically, the coolant temperature is maintained in the range of 12–14 °C, the cool air leaving the CRACs is in the range of 16–20 °C, and the cold aisle is about 18–22 °C [3].

2.1.3 IT Equipment

The IT equipment in data centers includes servers, storage devices, and telecom equipment (e.g., routers and switches). Storage includes storage area networks, network-attached storage, and external hard disk drive (HDD) arrays [4]. The main functions of data centers are to store data and provide access to the data when requested. The IT equipment is the primary equipment that performs these func-tions in data centers: servers, storage units, and telecom equipment, including routers and switches, which provide communication among the equipment inside data centers, as well as between data centers and the outside.

Serv

er R

ack

CRAC

Raised Floor

CRAC

Serv

er R

ack

Serv

er R

ack

Serv

er R

ack

Hot Aisle

ColdAisle

ColdAisle

Hot Air

Cold Air Cold Air

Fig. 2.2 Circulation with raised floor and hot–cold aisles

Page 26: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

13

2.2 Energy Efficiency Metrics

An appropriate energy efficiency metric is very important for a balance between the sophisticated engineering associated with IT equipment and the engineering associated with optimum design of infrastructure to house the sophisticated IT equipment. A variety of metrics are under consideration for quantifying and com-paring data center efficiencies. Among the more established are the power usage effectiveness (PUE) metric and its reciprocal, the data center infrastructure effi-ciency (DCiE). The PUE is defined as the ratio of the total power drawn by a data center facility to the power used by the IT equipment in that facility [5]:

where total facility power is the total power consumption by a data center, and the IT facility power is the power consumption of IT equipment in the data center. An alternative to the PUE metric is DCiE, which is defined as [5]:

The DCiE is a measure of the overall efficiency of a data center. This metric indi-cates the percentage of the total energy drawn by a facility that is used by the IT equipment. Both PUE and DCiE represent the same concept, but in a different for-mat. PUE appears to more clearly convey the penalty burden one pays for the infra-structure for a PUE value greater than one. In fact, in recent years PUE has been overwhelmingly used more than DCiE to the extent that the Green Grid has dropped DCiE from its list of recommended metrics for data center energy efficiency.

Various industry and government organizations, including the 7 × 24 Exchange, the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE), the Green Grid, the Silicon Valley Leadership Group, the U.S. Department of Energy’s “Save Energy Now” and Federal Energy Management programs, the U.S. Environmental Protection Agency’s Energy Star Program, the U.S. Green Building Council, and the Uptime Institute, met to discuss energy met-rics and measurement on January 13, 2010. As a result, the following agreement emerged: (1) power usage effectiveness (PUE) is the preferred energy efficiency metric for data centers; (2) IT energy consumption should, at a minimum, be meas-ured at the output of the uninterruptible power supply (UPS); however, the meas-urement capabilities should be improved over time to measure the IT consumption directly at the IT load (i.e., servers, storage, and network equipment); (3) the total energy consumption measurement for a dedicated data center (a facility in which all the spaces and supporting infrastructure, e.g., HVAC and lighting, are directly associated with the operation of the data center) includes all energy sources at the

(2.1)PUE =Total facility power

IT facility power

(2.2)DCiE =IT facility power

Total facility power

(2.3)DCiE =1

PUE

2.2 Energy Efficiency Metrics

Page 27: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

14 2 Data Center Energy Flow and Efficiency

point of utility hand off to the data center operator (for a data center in a mixed-use building, it will be all the energy required to operate the data center itself) [6]. This agreement also recommended four measurement categories for the PUE: 0, 1, 2, and 3 for a typical power delivery procedure. A typical power delivery procedure is described in Fig. 2.3, and the categories are defined in Table 2.1.

PUE category 0 is an entry-level measurement category which enables opera-tors that do not have a consumption measurement capability to utilize demand-based power readings. This category represents the peak load during a 12-month measurement period. In this measurement, the IT power is represented by the demand (kW) reading of the UPS system output (or the sum of the outputs if more than one UPS system is installed), as measured during peak IT equipment use. The total data center power is measured at the utility meter(s) and is typically reported as demand kW on the utility bill. As this category provides only a snapshot meas-urement, the true impact of fluctuating IT or mechanical loads can be missed. This

IT Equipment

Power Grid

UninterruptiblePower Supply

Power Distribution

UnitsCooling Tower

Chiller

Fire ProtectionSystem

Lighting

Computer Room Air Conditioning

Network Equipment

ServersPower Supply

Unit

StoragesPower Supply

Unit

Power Supply Unit

Cooling

OthersTotal Energy

Power

UninterruptiblePower Supply

UninterruptiblePower Supply

Power Distribution

Units

Power Distribution

Units

Fig. 2.3 A typical power delivery procedure in data centers

Table 2.1 Summary of the four power category measurements [6]

Category 0 Category 1 Category 2 Category 3

IT energy measurement location

UPS output UPS output PDU output Server input

Definition of IT energy

Peak IT electric demand

IT annual energy IT annual energy IT annual energy

Definition of total energy

Peak total electric demand

Total annual energy

Total annual energy

Total annual energy

Page 28: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

15

category can be used only for all-electric data centers, excluding centers that also use other types of energy (e.g., natural gas or chilled water).

PUE category 1, 2, and 3 measurements have the same total power measure-ment, which is typically obtained from the utility company bills by adding 12 consecutive monthly kWh readings, as well as annual natural gas or other fuel consumption (if applicable). However, the IT power measurements of PUE cat-egories 1, 2, and 3 are different. For PUE category 1, the IT load is represented by a 12-month total kWh reading of the UPS system output. For PUE category 2, the IT load is represented by a 12-month total kWh reading taken at the output of the power distributor units (PDUs) that support the IT loads. For category 3, the IT load is represented by a 12-month total kWh reading taken at the point of connec-tion of the IT devices to the electrical system. All three are cumulative measure-ments that require the use of kWh consumption meters at all measurement points. The four PUE categories are summarized in Table 2.1.

2.3 Methods to Improve Energy Efficiency

This section introduces some representative methods to improve the energy effi-ciency of data centers. It includes measures for energy efficiency improvements of electronics hardware, software applications, power equipment, and cooling systems.

2.3.1 Efficient Electronics

The first and best method for improving energy efficiency in data centers is to develop efficient electronics. For example, for PCs, solid state hard drives could reduce energy consumption by up to 50 %, and choleristic LCD screens could reduce monitor energy consumption by up to 20 % [7]. For telecom devices, it is expected that the overall power consumption per device will decrease steadily between now and 2020 due to placing a high priority on the adoption of efficient technologies. For example, one available efficiency technology for mobile infrastructure is a network optimization package, which can reduce energy consumption by up to 44 % [7].

Manufacturers of servers, processors, and other ICT equipment have set energy efficiency as one of their main goals and have made some notable achievements. There are three major techniques to improve processor efficiency: “multi-core” pro-cessors, low voltage processors, and smaller chips made with advanced materials. A multi-core processor has two or more processors in a single processor and can run all the processors simultaneously when needed or run only one processor upon demand. Low voltage processors can offer sufficient performance and high efficiency for many applications. Smaller chips with advanced materials have been designed to uti-lize new materials (e.g., strained silicon) to reduce leakage current and heat losses [8].

2.2 Energy Efficiency Metrics

Page 29: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

16 2 Data Center Energy Flow and Efficiency

2.3.2 Efficient Software Applications

The number of servers used in data centers has increased, even as the cost of com-puting power per server has decreased. When servers in data centers are not per-forming work functions, they may operate in an idle state and still consume power. An effective method to save energy is to increase the utilization of servers. Most data centers are designed to have larger server sizes than necessary in order to com-pensate for inefficient utilization and also to provide space for future growth. As a result, servers are usually operated at less than their full capacity in data centers, and the actual efficiency is also below the manufacturers’ rated efficiency. It is not uncommon that servers are operated with utilization rates of 10–15 % of their full capacity. A Hewlett-Packard Lab Study based on data centers reported that most of their 1,000 servers have utilization rates of only 10–25 % of their full capacity [8].

Data center operators are interested in developing software to operate data centers more efficiently and reduce energy consumption. Efficient management with soft-ware development can reduce the number of working servers and achieve the same functional tasks. Furthermore, if a reduced number of servers can achieve the same performance in data centers, the amount of auxiliary equipment, such as power sup-plies, distribution equipment, and cooling equipment, can also be reduced [8].

One technology for saving energy is virtualization—an approach for the effi-cient usage of computer server resources that reduces the total number of servers or server locations where utilization is low. Virtualization can shift the working load between the data centers of a company and allow underutilized servers to be shut down. This technology represents a radical rethinking of how to deliver data center services and could reduce emissions by up to 27 % [7].

2.3.3 Efficient Power Supply and Distributions

The servers in existing data centers usually have in-box power supply units operated at 60–70 % efficiency. However, more efficient power supplies have been developed for many servers. Their peak efficiency can reach 85–87 % with single 12 V outputs [8]. Besides the energy consumption reduction of the power supplies themselves, the energy efficiency of distribution and cooling systems has also been improved. The actual efficiencies of power supplies in data centers are lower than their rated effi-ciencies, since they seldom operate at the same loads under which the rated efficien-cies are calculated. The new technologies of power supplies are intended to address decreased efficiency at lower loads; for example, there are uninterruptible power supply (UPS) systems with higher efficiencies. The typical UPS system efficiency is about 90 % with a full load, and some can reach 95 % or higher [8]. The implemen-tation of DC power can help improve UPS efficiency, and the energy consumption of a power distribution system can be reduced by about 7.3 % [9].

Page 30: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

17

2.3.4 Efficient Cooling Systems

Cooling systems provide opportunities to reduce the overall energy consumption in data centers, since roughly 40 % of the total energy is associated with cool-ing. Data centers usually have a tendency to overcool to prevent equipment down-time and maintain an operating environment of about 20 °C and 50 % RH [8]. There are some “smart” or “adaptive” cooling solutions that allow for dynamic modification of the data center cooling air flow and temperature set points based on heat load monitoring throughout the data center. These methods save excess energy consumption due to overcooling and also prevent the formation of hot spots. For example, free air cooling (FAC) (or air economization) is one of the simple and most promising methods to reduce the energy consumption for cool-ing. FAC uses air outside data centers to cool equipment directly instead of air-conditioning [6, 10]. Large data centers have implemented the design strategy of “cool aisle/hot aisle” to separate hot and cold air flow. Most data centers use air-flow to cool equipment; however, liquid cooling can offer both greater efficiencies and the capability to handle high power density equipment, such as the processors in servers, since liquids have considerably larger heat capacities than air. More comprehensive coverage of the various cooling methods for data centers is given in Chap. 4.

2.4 Case Study Example on Data Center Energy Saving Opportunities

In this study, the PUE of a medium-size primary data center (20,000–249,000 ft2 [11]) on the campus of the University of Maryland was evaluated experimentally and by simulation. A simulation model was developed to investigate the tempera-ture and energy flow mapping of the data center. Energy consumption analysis of this data center, as well as possible energy conservation measures and the respec-tive energy savings will be discussed in the following section.

2.4.1 Analysis of Energy Consumption

Figure 2.4 shows the schematic diagram of the data center in the current study. It includes UPS rooms (three rooms), a data center room, and the supporting infra-structure. A cooling tower and satellite central utilities building (SCUB) connected to a heat exchanger and pumps supply the chilled water. The UPS rooms include three CRACs and three UPSs. The 355 m2 data center room includes 59 racks, 14 PDUs, and six CRACs, as shown in Fig. 2.5.

2.3 Methods to Improve Energy Efficiency

Page 31: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

18 2 Data Center Energy Flow and Efficiency

PUE, defined as the ratio of total data center input power to IT power, the power consumption levels at points (a), (b), and (d) in Fig. 2.6 are needed to obtain the PUE [12]. The points represent the cooling load, power input to UPS, and the IT load, respectively.

Light

3 CRACs

Cooling Towerand SCUB

HeatExchanger

PDU Servers

6 CRACs fans

3 UPS

3 UPS Rooms

Data Center Room

UPS Room Thermal Load

Pumps

Data Center Room Thermal

Load

SCUB Energy Consumption Pump Energy Consumption

Fig. 2.4 Schematic diagram of the data center components

1 2 3 4 5 6 7 8 9 10 15 PDU 17 # 22 23 26 31 32 33 34 35 36 37 38 39 40 43 44 45 46 47 48 49 50 51 52 53 54 55 60 61 62

A B C D E F G H I J K L M N O P Q R S T U W X Y Z AA AB ACADAEAFAGAH AI AJAKALAMANAOAP AQ ARASAT AU AVAWAXAYAZBABBBCBDBE BFBGBH BI BJ BK3 3 23

4 4 22

5 5 21 21 21 21

6 6 20 20 20 20

7 7 19 19 19 19

8 8 18 18 18 18 ``

9 9 17 17 17

10 10 16 16 16

11 11 15 15 15 15

12 12 14

13 13 13 13 13

14 14 12 12 12

15 15 11 11 11

16 16 10 10

17 17 9 9 9

18 18 8 8 AFAGAH AI AJAKALAMANAOAP AQ ARASAT AU AVAWAXAYAZBABBBCBDBE BFBGBH BI BJ BK19 19 7 7 19 PDU 19 7

20 20 6 6 20 35 36 37 38 39 40 41 42 43 44 45 46 47 48PDU 18 5 5 21 5 5

22 22 4 4 22 4 4

23 23 3 3 23 3 3

24 24 2 2 24 2 2

25 25 1 1 25 1 1 0 1 2 3 4 5 6 726 26 0 0 26 0 0

27 27 2728 28 28

29 3 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # 20 21 22 23 24 25 26 27 28 29 30 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 6230 30 30 PDU 15CRAC 8

18626582

4731

5192

10608

2343

3842

3324

6320

2770

2880

3704

2935

11311

4702

10489

5591

0

1885

Row M

Row

R

1895

4023

1570

2055

4546

996

406

406

1916

48496897

2544

Row

T

7072

7514

1500

7902

12680

2530

982

PDU

1255

5524

3407

4894

3688

3499

3630

3729

2875

951

1955

754

1634

2546

PDU 15

CRAC 4

PDU 25

CRAC 3

PDU 16 CRAC 6

CRAC 7 PDU 20 PDU 21 CRAC 5

PDU 22

PDU 27

PDU 26 PDU 24

PDU 29 PDU 28

10144

12680

12680

12680

3253

10582

Row K Row L

Heat Dissipation - 2601 to 5200 W

Heat Dissipation - 5201 to 7800 W

Heat Dissipation - 7801 to 10400 W

Heat Dissipation - 0 to 2600 W

Heat Dissipation - 10401 to 13000 W

406

812

406

1940

0

29302

406

732

406

Row B Row C Row D Row E Row F

2650 1452

Row G

7608

406

Fig. 2.5 Schematic floor plan of the data center room

Page 32: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

19

where a is the power into the cooling facility, b is the power into the power facil-ity, and d is the power into the IT equipment. The combined thermal load of the data center room and three UPS rooms comprises the IT load, the UPS and PDU loads, the CRAC fans’ loads, lighting, and any associated miscellaneous loads.

In order to obtain the thermal load of the data center room and three UPS rooms, the temperatures and velocities at the inlets and outlets of all nine CRACs were measured, as shown in Fig. 2.7. Sixteen temperature points and 24 velocity points were measured at the inlets and outlets of the CRACs. The average temperature and velocity were used to calculate the total power consumption of the room.

The thermal loads in the data center room and UPS rooms were 313 and 122 kW, respectively, for a total of 435 kW. The IT load of 210 kW is obtained by measuring the output power at the PDUs, as shown in Table 2.2.

A flowchart of the cooling load is shown in Fig. 2.8. The total thermal load of 435 kW represents the total consumed energy in the data center room and the three associated UPS rooms. The CRACs in the data center room are connected to a heat exchanger located adjacent to the building where the data center is housed, as shown in Fig. 2.9. The heat exchanger exchanges heat with the satellite cen-tral utility building (SCUB), which has a coefficient of performance (COP) of 5.18 and is located in box 4 in Fig. 2.9. The power consumption of the SCUB is obtained by using the thermal loads of the rooms and the COP of the SCUB. The energy consumed by the SCUB for data center cooling was determined at 84 kW.

The energy consumption results are summarized in Fig. 2.10 and Table 2.3. As seen there, the IT load at point (d) of Fig. 2.6 obtained from Table 2.2 is 210 kW and the combined power input to the UPS and data center room is 435 kW (sum of 313 kW + 122 kW), and. the total energy consumption is 541 kW. The power facility, including the UPS and PDU, consumes 117.3 kW, and the total amount of energy for the cooling load is 156.2 kW, as previously mentioned. Therefore, the

(2.4)PUE =Total facility power

IT facility power=

a + b

d

(2.5)

QRoom = QIT + QUPS, PDU + QLight

=

9∑

i=1

micp

(

T in, i − Tout, i

)

CRACs

Fig. 2.6 Definition of PUE [12]

Utility

Cooling

UPS PDU

IT

RejectedEnergy

(a)

(b) (c) (d)

(e)(f)

2.4 Case Study Example on Data Center Energy Saving Opportunities

Page 33: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

20 2 Data Center Energy Flow and Efficiency

IT load, power load, and cooling load consume 38.8, 21.7, and 28.9 % of the total generated electricity, respectively.

The measured PUE of the data center is 2.58 based on the measured results in Table 2.3. The power of the server fans in these measurements was calculated as part of IT load. A higher PUE will result if the server fans are calculated as part of the cooling power, which some believe should be how the actual PUE is calculated.

The measured PUE of 2.58 for this data center is considerably higher than the average energy efficient data centers, represented by PUEs of 1.5–2.0), [13–15]. There are many reasons for this, including the fact that the data center consumes 28.9 % of its total energy by cooling and 21.7 % of the total energy by power facility. In addition, 8.9 % (48.3 kW) of the total energy is power loss of the power facility in the data center room (not listed in Fig. 2.10), and 1.7 % of the total energy is consumed by light. Therefore, the case example data center provides many energy conservation opportunities. Five major energy conservation measures

Fig. 2.7 Velocity and temperature measurement of CRACs

Table 2.2 Thermal load contributions

Thermal Load kW

Data center room 313Three UPS rooms 122Total thermal load 435IT load 210

ThermalLoad

CRACsHeat

ExchangerSCUB SCUB

Cooling TowerCOP = 5.18435 kW 84 kW

Fig. 2.8 Flowchart of cooling load

Page 34: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

21

(ECMs) were identified and recommended as action steps to improve the energy efficiency of this data center. These are tabulated in Table x (below) and included: (I) Elimination of unnecessary CRACs (Computer Room Air-Conditioning Units), (II) Increase in the return air temperature at CRACs, (III) Cold aisle containment, (IV) Variable speed fan drive, and (V) Free air cooling. Further details and the payback analysis of each of these ECMs is discussed in Sect. 2.4.3.

Fig. 2.9 1 Data center; 2 heat exchanger; 3 building which shares the heat exchanger; 4 cooling tower and SCUB

Light: 9.2

3 CRACs16.7

Cooling Tower

& Chiller84

HeatExchanger

(541)

PDU12

ServersIT: 189Fans: 21

(313)

6 CRACs fans: 33.5

(122)

UPS105.3

UPS Room

Data Center Room

Electricity

Thermal Load

Heat

Electricity

Pumps 22

Fig. 2.10 Power consumption of each facility

2.4 Case Study Example on Data Center Energy Saving Opportunities

Page 35: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

22 2 Data Center Energy Flow and Efficiency

2.4.2 Energy Consumption Simulations

The thermal and fluid flow characteristics of data centers have been analyzed and evaluated by many researchers using simulation software [17–21]. In the current study, the six Sigma commercial simulation program of Future Facility Inc., (San Jose, CA), which is one of the widely used programs for data center simulations, was used for the simulation analysis. Comparison of the measurements against the simulation results are provided in this chapter.

The average inlet and outlet temperatures of the IT racks and CRACs were measured and were compared with the simulation results. As shown in Fig. 2.11, the average inlet and outlet temperatures at the racks are 15 and 24 °C, respec-tively. The maximum average outlet temperature of the racks was 36.5 °C, with rack R06 registering a temperature of 39.4 °C as a hot spot. The maximum allow-able temperature of a rack outlet in a typical data center is around 59 °C [16]. Therefore, the data center is operating at much cooler than necessary set point temperature.

Next, the inlet temperature, outlet temperature, and cooling load of each CRAC are shown in Table 2.4. The cooling capacity of all CRACs is 100 kW. As seen there, the cooling contributions of CRACs 3 and 6 are lower than those of the other CRACs. In facts, analysis of temperature distributions revealed that CRACs 3 and 6 actually had an adverse effect on the data center cooling due to supplying air at an average of 18.6 °C, which is considerably warmer than the other CRACS. Accordingly, turning off CRACs 3 and 6 will improve the cooling efficiency and energy savings of the data center.

Figure 2.12 shows a 3D simulation model using the 6 Sigma commercial simu-lation program [17]. The simulation model reflects the actual data center systems. The height from the raised floor to the ceiling is 3 m. The heights of the raised floor and racks are 44 cm and 2 m, respectively. As shown in Fig. 2.13, the loca-tion and number of servers in each rack are reflective of the actual data center arrangement.

Figure 2.14 shows the simulated air temperature distribution results at a loca-tion 1.8 m from the raised floor. The maximum temperature of a rack outlet is 39.3 °C which takes place at Row F. The location and value of the outlet tem-perature matches the measurement results, as mentioned in Table 2.4. In addition, some hot air circulation at Row R was observed.

Table 2.3 Power consumption of the data center

kW

Data center room thermal load 313UPS room thermal load 122Cooling consumption by SCUB 84Pumps 22Total 541IT load 210

Page 36: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

23

Figure 2.15 shows the temperature distribution of the supplied air from the CRACs below the raised floor. The thermal effective areas of each CRAC are clearly observed. As previously mentioned, CRACs 3 and 6 have an adverse effect on the uniform cooling of the data center supplying warmer air than other CRAC to the data center, thus reducing the cooling effect of other CRACs.

Table 2.5 provides comparison of the cooling loads from the simulation and measurements. As seen there, the simulation results are on average within ±10 % of the measurement results. Similar to the findings from the measurement analy-sis, the cooling contributions of CRACs 3, 6, and 8 were lower than those of the other CRACs, although the cooling capacity of all CRACs was 100 kW. Turning off CRACs 3 and 6 will improve the cooling efficiency and energy saving of the data center, consistent with the measurement results.

Figure 2.16 shows the temperature distribution of the supplied air from the CRACs below the raised floor when CRACs 3 and 6 are turned off. Compared to Fig. 2.15, the temperature of the supplied air below the raised floor is decreased and the cooling performance is increased, reflecting the benefit due to the absence of the supplied warm air from CRACs 3 and 6, leading to a further decrease on

Fig. 2.11 Average inlet and outlet temperatures at racks

Table 2.4 Cooling loads of CRACs

Inlet T (oC) Outlet T (oC) ΔT (oC) Power (kW)

CRAC 3 20.0 20.0 0.0 0.0CRAC 4 21.4 9.9 11.4 89.1CRAC 5 19.5 10.0 9.6 83.1CRAC 6 20.5 17.2 3.3 25.1CRAC 7 19.6 9.2 10.4 86.1CRAC 8 20.1 15.9 4.2 29.4Total measured power 312.8

2.4 Case Study Example on Data Center Energy Saving Opportunities

Page 37: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

24 2 Data Center Energy Flow and Efficiency

the supplied air temperature through the tiles on the cold aisles, as shown in Fig. 2.17. Note that although CRACs 3 and 6 are off, the total cooling capacity of the remaining four CRACs in the data center room is 400 kW, which is 100 kW larger than the total power generation in the room.

Fig. 2.12 3D simulation model using the 6 Sigma program from Future Facility Inc

Fig. 2.13 Location and number of servers in rack at Row R

Page 38: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

25

The air temperature distribution at a height of 1.8 m above the raised floor is shown in Fig. 2.18. The maximum rack outlet temperature was 40.9 °C at Row F. The difference between the hot spots with all CRACs running and with CRACs 3 and 6 off is only 1.6 °C. The inlet temperature at Row T was decreased by 10 °C after turning off the two CRACs, as shown in Fig. 2.19 and the hot spot tempera-ture at the outlet of row R was decreased by 10 °C after turning off CRACs 3 and 6, as shown in Fig. 2.20. The inlet humidity at Row T was about 60 %RH and the outlet humidity at Row R about 23 %RH after the monitored humidity lev-els were stabilized. Apart from temperature monitoring, data centers must moni-tor the humidity level, since inappropriate humidity levels may cause some failure mechanisms (such as electrostatic discharge (ESD) as a result of very low humid-ity levels and conductive anodic filament (CAF) formation under high humidity

Fig. 2.14 Temperature distribution of air 1.8 m from the raised floor

Fig. 2.15 Temperature distribution of supplied air from CRACs below the raised floor

2.4 Case Study Example on Data Center Energy Saving Opportunities

Page 39: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

26 2 Data Center Energy Flow and Efficiency

levels) to be more active. The details will be introduced in Chap. 5. Furthermore, as shown in Table 2.6, the cooling loads of CRACs 3 and 6 were distributed to the remaining four CRACs. A total power of 11.2 kW, which was generated by the two fans of CRACs 3 and 6, was saved by turning off these two CRACs.

2.4.3 Energy Conservation Findings

In this study, the PUE of a medium-size primary data center at the University of Maryland was evaluated experimentally and compared with simulation results. The IT, cooling, and power loads were measured to evaluate the PUE of the data center. The IT load, cooling load, and power load represented 38.8, 28.9, and 21.7 % of the total energy consumption, respectively. Based on the analysis, the PUE of the data center was calculated to be 2.58. Five major energy saving oppor-tunities were identified and recommended as action steps to improve the energy efficiency of this data center. These are tabulated in Table 2.7 and included: (I) Elimination of unnecessary CRACs (Computer Room Air-Conditioning Units);

Table 2.5 Comparison of cooling loads between simulation and measurement

Inlet T (oC) Outlet T (oC) ΔT (oC) Simulation (kW) Measurement (kW)

CRAC 3 20.0 20.0 0.0 0.0 0.0CRAC 4 21.4 9.9 11.4 89.1 84.2CRAC 5 19.5 10.0 9.6 83.1 79.3CRAC 6 20.5 17.2 3.3 25.1 25.6CRAC 7 19.6 9.2 10.4 86.1 86.6CRAC 8 20.1 15.9 4.2 29.4 34.1Total power 312.8 309.8

Fig. 2.16 Temperature distribution of supplied air from CRACs below the raised floor without CRACs 3 and 8

Page 40: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

27

(II) Increase in the return air temperature at CRACs; (III) Cold aisle containment; (IV) Variable speed fan drive; and (V) Fresh air cooling. Table 2.7 provides a sum-mary of thee ECMs and the payback for each.

Fig. 2.17 Supplied air temperature at the tiles on cold aisles with all CRACs (above) and with CRACs 3 and 6 off

Fig. 2.18 Temperature distribution of air at the height of 1.8 m above the raised floor with CRACs 3 and 8 off

2.4 Case Study Example on Data Center Energy Saving Opportunities

Page 41: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

28 2 Data Center Energy Flow and Efficiency

02/05/13 16:30:00.0 02/05/13 22:45:00.0 02/06/13 05:00:00.0 02/06/13 11:15:00.010

15

20

25

30

35

40

45

50

55

60

65

Temperature (*C) Inlet Row TRH (%) Inlet Row T

Date & Time

Fig. 2.19 Temperature and humidity inlet at row T

02/05/13 16:30:00.0 02/05/13 22:45:00.0 02/06/13 05:00:00.0 02/06/13 11:15:00.020

25

30

35

40

Date & Time

Temperature (*C) Outlet Row RRH (%)RH (%) Outlet Row R

Fig. 2.20 Temperature and humidity outlet at row R

Table 2.6 Cooling loads of CRACs

Inlet T (oC) Outlet T (oC) ΔT (oC) Simulation (kW)

CRAC 3 – – – –CRAC 4 21.9 10.4 11.5 89.4CRAC 5 21 10.9 10.1 87.4CRAC 6 – – – –CRAC 7 21.6 10.9 10.7 89CRAC 8 21.7 16.6 5.1 36Total power 301.8

Page 42: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

29

The results in Table 2.7 indicate immediate payback (no cost ECM) for two of the measures and short-term paybacks for the other two. Free air cooling payback calculations are in progress and yet to be completed.

2.5 Summary

This chapter describes the data center energy flow and energy efficiency metrics. The energy consumed by IT equipment is considered effective energy consump-tion, whereas the energy consumed by other equipment (e.g., power and cool-ing equipment) needs be minimized to improve the energy efficiency. The major energy efficiency metric for data centers is Power Usage Effectiveness, which is the ratio of the total energy consumption to IT equipment consumption. A case study was presented to discuss the energy conservation opportunities in a medium-size data center.

References

1. D. Dyer, Current trends/challenges in datacenter thermal management—a facilities perspec-tive, in Proceedings of the ITHERM, San Diego, CA, Jun 2006

2. Qpedia Thermal Magazine, Total Cost of Data Center Ownership, vol. V, Issue VIII (Sep 2011)

3. L. Barroso, U. Hozle, in Synthesis Lectures on Computer Architecture, ed. M. Hill. The Datacenter as a Computer: An introduction to the design of warehouse-scale machines, vol 6 (Morgan & Claypool, U.S., 2009)

4. U.S. Environmental Protection Agency Energy Star Program, Report to Congress on Server and Data Center Energy Efficiency Public Law 109-431 (Aug 2007)

5. P. Johnson, T. Marker, Data Centre Energy Efficiency Product Profile, Australian Equipment Energy Efficiency Committee Report (Apr 2009)

6. 7 × 24 Exchange, the Green Grid, et al., Recommendations for Measuring and Reporting Overall Data Center Efficiency Version 1—Measuring PUE at Dedicated Data Centers (Jul 2010)

Table 2.7 The identified energy conservation measures (ECMs)

Energy savings (MWh/year)

Dollar savings (US$/year) Ton (CO2)/year Payback period

ECM 1—Turn off two CRACs

96.4 10,700 61.9 Immediate

ECM 2—CRAC set point

111.2–152.4 12,300–16,800 87.4 Immediate

ECM 3—Closed cold aisles

132.0 14,600 65.4–97.3 3.6 months

ECM 4—Variable speed fan drive

113 12,400 71.3 3.6 years

ECM 5—Fresh air cooling

770–901 85,000–98,000 716–836 In progress

2.4 Case Study Example on Data Center Energy Saving Opportunities

Page 43: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

30 2 Data Center Energy Flow and Efficiency

7. The Climate Group on Behalf of the Global eSustainability Initiative (GeSI), SMART 2020: Enabling the Low Carbon Economy in the Information Age (Brussels, Belgium, 2008)

8. J. Loper, S. Parr, Energy Efficiency in Data Centers: A New Policy Frontier, Alliance to Save Energy White Paper (Jan 2007)

9. M. Ton, B. Fortenbery, W. Tschudi, DC Power for Improved Data Center Efficiency, Lawrence Berkeley National Laboratory project report (Mar 2008)

10. D. Atwood, J.G. Miner, Reducing Data Center Cost with an Air Economizer, IT@Intel Brief, Computer Manufacturing, Energy Efficiency; Intel Information Technology (Aug 2008)

11. Eaton Corporation, Data Center—Unique Characteristics and the Role of Medium Voltage Vacuum Circuit Breakers, White Paper (2009)

12. M.K. Patterson, Energy Efficiency Metrics, ITherm 2012 13. T. Lu, X. Lu, M. Remes, M. Viljanen, Investigation of air management and energy perfor-

mance in a data center in Finland: Case study. Energy Build. 43, 3360–3372 (2011) 14. H.S. Sun, S.E. Lee, Case study of data centers’ energy performance. Energy Build. 38, 522–

533 (2006) 15. J. Cho, B.S. Kim, Evaluation of air management system’s thermal performance for superior

cooling efficiency in high-density data centers. Energy Build. 43, 2145–2155 (2011) 16. U.S. Department of Energy, Federal Energy Management Program, Data Center Rack

Cooling with Rear-door Heat Exchanger (Jun 2010) 17. M. Green, S. Karajgikar, P. Vozza, N. Gmitter, D. Dyer, Achieving Energy Efficient Data

Centers Using Cooling Path Management Coupled with ASHRAE Standards, in 28th IEEE Semi-therm symposium, USA, March 18–22, 2012

18. A. Almoli, A. Thompson, N. Kapur, J. Summers, H. Thompson, G. Hannah, Computational fluid dynamic investigation of liquid rack cooling in data centers. Appl. Energy 89, 150–155 (2012)

19. J. Siriwardana, S.K. Halgamuge, T. Scherer, W. Schott, Minimizing the thermal impact of computing equipment upgrades in data centers. Energy Build. 50, 81–92 (2012)

20. J. Cho, T. Lim, B.S. Kim, Measurement and predictions of the air distribution systems in high compute density (internet) data centers. Energy Build. 41, 1107–1115 (2009)

21. W.A. Abdelmaksoud, T.W. Dang, H.E. Khalifa, R.R. Schmidt, M. Iyengar, Perforated tile models for improving data center CFD simulation, in 13th IEEE ITHERM Conference, San Diego, USA, May 29–June 1, 2012

Page 44: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

31

Several standards are adopted by the industry on data center design and operating environment requirements and on telecom equipment qualification, quality man-agement, and installation. This chapter reviews some key standards, including the American Society of Heating, Refrigerating, and Air Conditioning Engineers (ASHRAE) Thermal Guidelines (operating environment requirement), TIA Standard 942 (Telecommunication Infrastructure Standard, data center design, and telecom equipment installation), Telcordia GR-63-CORE (telecom equipment qualification), ETSI 300 019 (telecom equipment qualification), and TL 9000 (telecom equipment quality management).

3.1 ASHRAE Thermal Guidelines

Operating environmental settings directly affect cooling energy efficiency. In most data centers, the operating environment is maintained at fixed air inlet temperatures and narrow humidity ranges. It is estimated that traditional A/C-cooled data centers can save 4–5 % of energy costs for every 1 °C increase in set point temperature [1].

The American Society of Heating, Refrigerating, and Air Conditioning Engineers (ASHRAE) published the “Thermal Guidelines for Data Centers and Other Data Processing Environments—Expanded Data Center Classes and Usage Guidance” in 2011, which provides allowable and recommended operating con-dition limits for data centers, including temperature and humidity [2]. The rec-ommended conditions are designed “to give guidance to data center operators on maintaining high reliability and also operating their data centers in the most energy efficient manner.” The allowable conditions are the conditions in which “IT man-ufacturers test their equipment in order to verify that the equipment will function within those environmental boundaries” [3]. The ASHRAE 2011 thermal guide-lines include a list of data center classes to “accommodate different applications and priorities of IT equipment operation” [2]. The environmental specifications of

Chapter 3Standards Relating to Data Center

J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5_3, © Springer Science+Business Media New York 2014

Page 45: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

32 3 Standards Relating to Data Center

the classes are shown in Table 3.1 and Fig. 3.1, in which class A1 is typically a data center with tightly controlled environmental parameters (dew point (DP), tem-perature, and relative humidity) and mission critical operations. Types of products designed for this environment are enterprise servers and storage products. Class A2 is typically an information technology space or office or lab environment with

Table 3.1 Environmental specifications for 2011 ASHRAE thermal guideline classes [3]

Class Product operation Product power off

Temperature (°C)

Humidity range

Maximum dew point (°C)

Temperature (°C)

Humidity range

Maximum dew point (°C)

RecommendedA1–

A418–27 5.5 °C DP to

60 % and 15 °C DP

AllowableA1 15–32 20–80 %RH 17 5–45 8–80 % RH 27A2 10–35 20–80 %RH 21 5–45 8–80 % RH 27A3 5–40 −12 °C DP

and 8 %RH to 85 % RH

24 5–45 8–85 % RH 27

A4 5–45 −12 °C DP and 8 %

RH to 90 % RH

24 5–45 8–90 % RH 27

B 5–35 8 % to 80 % RH 28 5–45 8–80 % RH 29C 5–40 8–80 % RH 28 5–45 8–80 % RH 29

Fig. 3.1 ASHRAE psychometric chart for data centers [2]

Page 46: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

33

some control of environmental parameters (DP, temperature, and relative humid-ity). Types of products designed for this environment are volume servers, stor-age products, personal computers, and workstations. Class A3/A4 is typically an information technology space or office or lab environment with some control of environmental parameters (DP, temperature, and relative humidity). Types of products typically designed for this environment are volume servers, storage prod-ucts, personal computers, and workstations. Class B is typically an office, home, or transportable environment with minimal control of environmental parameters (temperature only). Types of products typically designed for this environment are personal computers, workstations, laptops, and printers. Class C is typically a point-of-sale or light industrial or factory environment with weather protection and sufficient winter heating and ventilation. Types of products typically designed for this environment are point-of-sale equipment, ruggedized controllers, or computers and personal digital assistants (PDAs).

The ASHRAE 2011 thermal guidelines represent expansion of the ASHRAE 2008 thermal guidelines [3] and ASHRAE 2004 thermal guidelines [4]. The ASHRAE 2004 thermal guidelines provided initial recommendations for the data center operating environment, and the 2008 ASHRAE thermal guidelines expanded the limits to improve energy efficiency. The 2008 revision increased both the temperature and moisture ranges recommended for data center equip-ment, as shown in Table 3.2 (the moisture range is expressed in terms of DP, since research has demonstrated that equipment failure is not necessarily directly related to relative humidity [3], but is strongly related to DP, which is the temperature at which the air can no longer hold all of its water vapor, and some of the water vapor condense into liquid water.).

As shown in Table 3.2, the recommended temperature limits in the ASHRAE Thermal Guidelines 2008 were extended to 18–27 °C from 20–25 °C in the ASHRAE Thermal Guidelines 2004. This extension was based on a long history of reliable operation of telecom equipment in data centers all over the world. This extension was also based on another generally accepted industry standard for telecom equipment, Telcordia GR-63-CORE [6], which sets the recommended temperature limits of 18–27 °C. The expanded recommended temperature limits save energy in data centers. Lowering the recommended temperature extends the control range of economized systems by not requiring the mixing of hot return air to maintain the previous 20 °C temperature recommendation. Expanding the high-side temperature limit is good for free air cooling since it allows more annual

Table 3.2 Operating environmental limit comparisons between ASHRAE 2004 and 2008 [3]

Recommended limits Allowable limits

2004 version 2008 version

Low temperature 20 °C 18 °C 15 °CHigh temperature 25 °C 27 °C 32 °CLow moisture 40 %RH 5.5 °C DP 20 %RHHigh moisture 55 %RH 60 %RH and 15 °C DP 80 %RH

3.1 ASHRAE Thermal Guidelines

Page 47: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

34 3 Standards Relating to Data Center

operating hours of airside economizers, and more details are shown in Chap. 4. For non-economizer cooling systems, there is also an energy benefit to increasing the supply air or chilled water temperatures.

Compared with the recommended moisture limits of 40–55 %RH (relative humidity) in the ASHRAE Thermal Guidelines 2004, the ASHRAE Thermal Guidelines 2008 extended these limits to 5.5–15 °C DP (dew point) and 60 %RH, as shown in Table 3.2. The new moisture limits are expressed as the combina-tion of DP and relative humidity, since both of them affect the failure mechanism of corrosion [3]. Furthermore, there may be risks of conductive anodic filament (CAF) growth if the relative humidity exceeds 60 % [3]. By extending the mois-ture limit, ASHRAE Thermal Guidelines 2008 allows a greater number of oper-ating hours per year where humidification is not required. However, dryer air has a greater risk of electrostatic discharge (ESD) than air with more moisture. Therefore, the main concern with decreased humidity is that the frequency of ESD events may increase. IT equipment manufacturers have not reported any ESD issues within the 2008 recommended limits [3]. A lower humidity limit based on a minimum DP rather than on a minimum relative humidity has been accepted by the ASHRAE 2008 version, since research has demonstrated a stronger correlation between DP and ESD than between humidity and ESD.

3.2 TIA-942 Data Center Standard

In April 2005, the Telecommunications Industry Association (TIA) released stand-ard TIA-942 (Telecommunication Infrastructure Standard) [5], which describes the design, installation, and performance requirements for telecommunication systems in data centers. This standard, for use by data center designers in the development process, considers site space and layout, the cabling infrastructure, tiered reliabil-ity, and environmental conditions.

The TIA-942 standard recommends specific functional areas for site space and layout in data centers. This design considers the future expansion of servers and applications, and upgrades of data centers can be implemented with minimal down-time and disruption. The functional areas include the entrance room; the main distribution area (MDA), which is a centrally located area that houses the main cross-connect as well as core routers and switches for LAN (local area network) and SAN (storage area network) infrastructures; the horizontal distribution area (HDA), which serves as the distribution point for horizontal cabling and houses cross-connects and active equipment for distributing cable to the equipment distribution area (EDA), which is the location of equipment, cabinets, and racks; the zone distribution area (ZDA), which is an optional interconnection point in horizontal cabling between the HDA and EDA; and backbone and horizontal cabling, as shown in Fig. 3.2.

To determine specific data center needs, TIA-942 includes an informative annex with data center availability tiers based on information from the Uptime Institute, a consortium dedicated to provide its members with best practices and

Page 48: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

35

benchmark comparisons for improving the design and management of data cent-ers. This standard classifies four tier data centers

A Tier 1 (Basic) data center has an availability requirement of 99.671 % (annual downtime of 28.8 h). It has a single path of power and cooling distribution with-out redundancy. This tier data center suffers from interruptions resulting from both planned and unplanned events. It may or may not have a raised floor, UPS, or gen-erator, and it must be shut down completely to perform preventive maintenance.

A Tier 2 (Redundant Components) data center has an availability requirement of 99.741 % (annual downtime of 22.0 h). It has a single path of power and cool-ing distribution with redundant components. Compared to a Tier 1 data center, this tier data center suffers less from interruptions resulting from both planned and unplanned events. It includes a raised floor, UPS, or generator, and the maintenance of the power path or other parts if the infrastructure requires a processing shutdown.

A Tier 3 (Concurrently Maintainable) data center has an availability requirement of 99.82 % (annual downtime of 1.6 h). It has multiple paths of power and cooling distribution with redundant components (but only one path is active). This tier data center experiences no interruptions resulting from planned activity, but suffers from interruptions from unplanned events. It includes a raised floor and sufficient capacity and distribution to carry a load on one path while performing maintenance on another.

A Tier 4 (Fault Tolerant) data center has an availability requirement of 99.995 % (annual downtime of 0.4 h). This tier data center experiences no critical load interruptions resulting from planned activity, but suffers from interruptions from unplanned events. It can sustain at least one worst-case unplanned event with no critical load impact.

Entrance Room

Main DistributionArea

Horizontal Distribution Area

Equipment Distribution Area

Horizontal Distribution Area

Horizontal Distribution Area

ZoneDistribution Area

Equipment Distribution Area

Equipment Distribution Area

TelecommunicationRoom

Offices, Operation Centers,

Support Rooms

Horizontal Cabling

Horizontal Cabling

Horizontal Cabling

Horizontal CablingHorizontal Cabling

Backbone Cabling

Backbone Cabling

Backbone

Cabling

Carriers Carriers

Computer Room

Fig. 3.2 TIA-942 compliant data center showing specific functional areas [5]

3.2 TIA-942 Data Center Standard

Page 49: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

36 3 Standards Relating to Data Center

The environmental considerations in TIA-942 include, but are not limited to, fire suppression, humidity levels, operating temperatures, architecture, electrical (power), and mechanical system specifications. For power equipment, the deter-mination is based on the desired reliability tier and may include two or more power feeds from the utility, UPS, multiple circuits to systems and equipment, and on-site generators. The estimation of power needs must consider both the power required for the existing devices and devices anticipated in the future. For cooling equipment, the TIA-942 Standard incorporates specifications to encourage airflow and reduce the amount of heat generated by concentrated equipment. It recom-mends the use of adequate cooling equipment, as well as a raised floor system for flexible cooling. In addition, the standard suggests that cabinets and racks should be arranged in alternating patterns to create “hot” and “cold” aisles to keep the hot air from mingling with the cold air.

3.3 Environmental Qualification Standards

Environmental qualification standards are used to assess the quality of telecom equipment at the point of manufacturing. In this section, we describe two widely used environmental qualification standards for telecom equipment: Telcordia GR-63-CORE [6] for the North American market and ETSI 300 019 [7] for the European market. Telcordia1 developed GR-63-CORE in 1995 and revised it in 2006. This standard provides the operating requirements for telecom equipment, including the associated cable layouts, distributing and interconnecting frames, power equipment, operations support systems, and cable entrance facilities. It also provides test methods for telecom equipment in NEBS. The European qualifica-tion standard ETSI 300 019 was published by the European Telecommunications Standards Institute2 in 1994. It includes more than thirty documents related to qualification testing of telecom equipment.

3.3.1 Telcordia GR-63-CORE

Telcordia GR-63-CORE provides the minimum spatial and environmental requirements for all new telecommunications equipment in data centers and other environmentally-controlled spaces with telecommunications equipment.

1 Telcordia, formerly Bell Communications Research, Inc. or Bellcore, is a telecommunications research and development (R&D) company working with mobile, broadband, and enterprise soft-ware and services based in the United States.

2 European Telecommunications Standards Institute (ETSI) is an independent, non-profit stand-ardization organization in the telecommunications industry in Europe.

Page 50: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

37

Telcordia and industry representatives cooperated to develop these requirements, which are applicable to “switching and transport systems, associated cable distri-bution systems, distributing and interconnecting frames, power equipment, oper-ations support systems, and cable entrance facilities” [6]. This standard covers all aspects of physical qualification testing for equipment installed in an office building, including tests related to storage, transportation, operating temperature and humidity, vibration, illumination, altitude, acoustics, contamination, fire, and earthquakes.

The Telcordia Generic Requirements GR-63-CORE [6] also provides recom-mended operating environments for telecom equipment, which is the same as that in the ASHRAE 2008 version. The humidity ranges are different from those in the ASHRAE guidelines, as shown in Table 3.3.

The assessment methods in Telcordia GR-63-CORE can be used only when the operating environment of the telecom equipment is within specified limits of this standard. When free air cooling is used in data centers, the operating environment may change. If it is still within the standards, the methods in Telcordia GR-63-CORE for assessing the equipment remain valid for free air cooling. However, if the operating environment with free air cooling goes beyond the requirements, assessment according to the standard will be invalid, and other ways must be found to assess the equipment.

Telcordia GR-63-CORE provides environmental tests for all new telecommu-nications network systems, including all the associated equipment and facilities. Since the major environmental changes from free air cooling are operating tem-perature and relative humidity, we describe the operating temperature and relative humidity test defined in this standard. When the operating environment is inside the required limits in Telcordia GR-63-CORE in Table 3.3, the equipment can be tested using an operating temperature and relative humidity test. The test lasts about 1 week, and the failure criteria are based on the ability of the equipment to operate throughout the test period. If a product can operate properly during the test, it passes. This test is performed for qualification, but the results from this test cannot be used to predict the product reliability over the expected life time. During testing, the controlled conditions are temperature and relative humidity (RH). The temperature range is from −5 °C to 50 °C, and the relative humidity range is from less than 15 to 90 %. The temperature and humidity profiles are shown in Fig. 3.3.

Table 3.3 Operating environmental limits in Telcordia generic requirements [6]

Recommended limits

Low temperature 18 °CHigh temperature 27 °CLow relative humidity 5 %RHHigh relative humidity 55 %RH

3.3 Environmental Qualification Standards

Page 51: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

38 3 Standards Relating to Data Center

3.3.2 ETSI 300 019

ETSI 300 019 [7] provides environmental requirements for telecom equipment under various environmental conditions: storage, transportation, stationary use at weather-protected locations, stationary use at non-weather-protected locations, ground vehicle installations, ship environments, portable and non-stationary use, and stationary use at underground locations. Within each condition, there may be several classes. For example, stationary use at weather-protected locations includes six classes, and the allowable environments for the classes of stationary use at weather-protected locations are shown in Table 3.4. These classes also apply to data centers.

ETSI 300 019 specifies environmental classes according to the climatic and bio-logical conditions, chemically and mechanically active substances, and mechani-cal conditions. The main purpose of environmental classification is to set up some “standardized” and operational references for a wide range of environmental conditions, which include storage, transportation, and use. Details are shown in Table 3.5. In this standard, the three conditions are defined as [7]: storage, trans-portation, and in-use. In the storage condition, equipment is placed at a certain site for a long period, but is not intended for use during this period. If the equip-ment is packaged, the environmental conditions apply to the packaging protect-ing the equipment. The transportation condition includes the phase during which the equipment is moved from one place to another after being made ready for dis-patch. In the in-use condition, equipment is in-use when it is directly operational. Furthermore, the “in-use” condition includes [7]: (1) stationary use, where equip-ment is mounted firmly on a structure or on mounting devices, or it is permanently placed at a certain site; it is not intended for portable use, but short periods of han-dling during erection work, down time, maintenance, and repair at the location are included; (2) mobile use, where equipment is in mobile use when it is primarily intended to be installed or fixed and operated in, or on, a vehicle or a ship; and (3) portable and non-stationary use, where equipment is frequently moved from place to place. During transfer there is no special packaging for the equipment. The total transfer time may amount to a portion of the equipment’s lifetime. The equipment is not permanently mounted on any structure or placed at a fixed site. The equipment may be operated while being either in a non-stationary or in a transfer state.

Fig. 3.3 Operating temperature and humidity test in Telcorida GR-63-CORE [6]

-100

102030405060

0 20 40 60 80 100 120 140 160 180 200

Hours

Tem

pera

ture

(o C

)

0

20

40

60

80

100

Rel

ativ

e H

umid

ity (

%)Temperature Relative Humidity

Page 52: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

39

Tabl

e 3.

4 E

TSI

300

019

allo

wab

le e

nvir

onm

ents

for

sta

tiona

ry u

se a

t wea

ther

-pro

tect

ed lo

catio

ns [

7]

Env

iron

men

t cla

sses

Defi

nitio

nL

ow

tem

pera

ture

Hig

h te

mpe

ratu

reL

ow r

elat

ive

hum

idity

Hig

h re

lativ

e hu

mid

ity

Tem

pera

ture

-con

trol

led

loca

tions

(cl

ass

3.1)

A p

erm

anen

tly te

mpe

ratu

re-c

ontr

olle

d, e

nclo

sed

loca

tion

with

us

ually

unc

ontr

olle

d hu

mid

ity, a

com

bina

tion

of c

lass

es

3K3/

3Z2/

3Z4/

3B1/

3C2(

3C1)

/3S2

/3M

1 in

IE

C S

tand

ard

6072

1-3-

3.

5 °C

40 °

C5

%R

H85

%R

H

Part

ly te

mpe

ratu

re

cont

rolle

d lo

catio

ns

(cla

ss 3

.2)

An

encl

osed

loca

tion

havi

ng n

eith

er te

mpe

ratu

re n

or h

umid

ity

cont

rol.

–5 °

C45

°C

5 %

RH

95 %

RH

Non

tem

pera

ture

co

ntro

lled

loca

tions

(c

lass

3.3

)

A w

eath

er-p

rote

cted

loca

tion

with

nei

ther

tem

pera

ture

nor

hu

mid

ity c

ontr

ol.

–25

°C55

°C

10 %

RH

100

%R

H

Site

s w

ith a

hea

t-tr

ap

(cla

ss 3

.4)

A w

eath

er-p

rote

cted

loca

tion

with

nei

ther

tem

pera

ture

nor

hu

mid

ity c

ontr

ol th

at is

aff

ecte

d by

dir

ect s

olar

rad

iatio

n an

d he

at-t

rap

cond

ition

s.

–40

°C70

°C

10 %

RH

100

%R

H

Shel

tere

d lo

catio

ns

(cla

ss 3

.5)

A s

helte

r w

here

dir

ect s

olar

rad

iatio

n an

d he

at-t

rap

cond

ition

s do

no

t exi

st.

–40

°C40

°C

10 %

RH

100

%R

H

Tele

com

con

trol

roo

m

loca

tions

(cl

ass

3.6)

A p

erm

anen

tly te

mpe

ratu

re-c

ontr

olle

d, e

nclo

sed

loca

tion,

usu

ally

w

ithou

t con

trol

led

hum

idity

; a c

ombi

natio

n of

cla

sses

3K

2/3Z

2/3Z

4/3B

1/3C

2(3C

1)/3

S2/3

M1

in I

EC

Sta

ndar

d 60

721-

3-3.

15 °

C75

°C

10 %

RH

75 %

RH

3.3 Environmental Qualification Standards

Page 53: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

40 3 Standards Relating to Data Center

Tabl

e 3.

5 E

nvir

onm

enta

l cla

sses

in E

TSI

300

019

[7]

Stor

age

Tra

nspo

rtat

ion

In-u

se

Stat

iona

ry u

seM

obile

use

Port

able

and

non

-sta

-tio

nary

use

Wea

ther

-pro

tect

edlo

catio

nsN

on-w

eath

er

prot

ecte

dlo

catio

ns

Und

ergr

ound

loca

tions

Gro

und

vehi

cle

inst

alla

tions

Ship

envi

ronm

ent

Cla

ss 1

.1W

eath

er p

rote

cted

, pa

rtly

tem

pera

-tu

re c

ontr

olle

d st

orag

e lo

catio

ns

Cla

ss 2

.1V

ery

care

ful

tran

spor

tatio

n

Cla

ss 3

.1Te

mpe

ratu

re-

cont

rolle

dlo

catio

ns

Cla

ss 4

.1N

on-w

eath

er

prot

ecte

d lo

catio

ns

Cla

ss 8

.1Pa

rtly

wea

ther

pr

otec

ted

unde

rgro

und

loca

tions

Cla

ss 5

.1Pr

otec

ted

inst

alla

tion

Cla

ss 6

.1To

tally

w

eath

er

prot

ecte

dlo

catio

ns

Cla

ss 7

.1Te

mpe

ratu

re-c

ontr

olle

dlo

catio

ns

Cla

ss 1

.2W

eath

er p

rote

cted

no

t tem

pera

ture

co

ntro

lled

stor

-ag

e lo

catio

ns

Cla

ss 2

.2C

aref

ultr

ansp

orta

tion

Cla

ss 3

.2Pa

rtly

tem

pera

-tu

re c

ontr

olle

d lo

catio

ns

Cla

ss 4

.1E

Non

-wea

ther

pr

otec

ted

loca

-tio

ns—

exte

nded

Cla

ss 5

.2Pa

rtly

pr

otec

ted

inst

alla

tion

Cla

ss 6

.2Pa

rtly

wea

ther

pr

otec

ted

loca

tions

Cla

ss 7

.2Pa

rtly

tem

pera

ture

con

-tr

olle

d lo

catio

ns

Cla

ss 1

.3N

on-w

eath

er p

ro-

tect

ed s

tora

ge

loca

tions

Cla

ss 2

.3Pu

blic

tran

spor

tatio

n

Cla

ss 3

.3N

ot te

mpe

ratu

re

cont

rolle

d lo

catio

ns

Cla

ss 4

.2L

Non

-wea

ther

pro

-te

cted

loca

tions

—ex

trem

ely

cold

Cla

ss 6

.3N

on-w

eath

er

prot

ecte

d lo

catio

ns

Cla

ss 7

.3Pa

rtly

wea

ther

pro

tect

ed

and

non-

wea

ther

pr

otec

ted

loca

tions

Cla

ss 1

.3E

Non

-wea

ther

pro

-te

cted

sto

rage

lo

catio

ns—

exte

nded

Cla

ss 3

.4Si

tes

with

hea

t-tr

apC

lass

4.2

HN

on-w

eath

er

prot

ecte

d lo

ca-

tions

—ex

trem

ely

war

m, d

ry

Cla

ss 7

.3E

Part

ly w

eath

er p

rote

cted

and

non-

wea

ther

pr

otec

ted

loca

tions

exte

nded

Cla

ss 3

.5Sh

elte

red

loca

tions

Cla

ss 3

.6Te

leco

mm

unic

atio

ns

cont

rol r

oom

lo

catio

ns

Page 54: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

41

Every condition is categorized into several classes based on the weather pro-tection conditions. There are definitions for all of the environmental classes and their climatic condition requirements [6]. ETSI 300 019 utilizes the test methods in other standards for telecom equipment.

3.3.3 Use for Data Center Cooling Methods

Qualification tests of telecom equipment are usually performed using the test methods provided by the industry standards. However, these methods do not address the risks of the cooling methods in data centers. For one thing, the meth-ods in the standards can be used only when the operating conditions are within the required ranges in the standards. For example, the relative humidity with free air cooling in an Intel data center ranged from 5 % to more than 90 % RH [8], which is beyond the limits of the Telcordia GR-63-CORE requirement, as shown in Table 3.3. In fact, the humidity in the Intel case was uncontrolled, which is common for free air cooling applications, since humidifying or dehumidifying requires a lot of energy that would offset the energy savings from free air cool-ing. However, uncontrolled humidity is beyond the operating environment require-ments in standards such as Telcordia GR-63-CORE, and, therefore, the standard assessment would be invalid for the free air cooled environment.

Additionally, the test methods in Telcordia GR-63-CORE and ETSI 300 019 do not address the usual failure mechanisms involved in various cooling methods in data centers. For example, the impacts of two failure mechanisms from free air cooling—CAF and ESD—which can occur in free air cooling conditions are not addressed in the operating temperature and humidity test in Telcordia GR-63-CORE. Furthermore, when the standard tests are not passed, the test results cannot determine the failures mechanisms.

The qualification methods in the standards do not predict the reliability of equipment in the range of the targeted operating temperatures. For example, in Telcordia GR-63-CORE, the operating temperature and humidity test are used for qualification and are considered to be a measure of the quality of the equipment, but cannot predict the reliability of equipment under usage conditions.

3.4 Quality Management Standard: TL 9000

Reliability goals are often used to define contractual requirements regarding the warranties on equipment and systems. For telecommunications equipment and systems, common metrics include the mean time to failure (MTTF) or mean time between failures (MTBF). Several additional reliability measures are also com-monly agreed upon by members of the supply chain or between service providers and government regulators to ensure the quality of equipment. These measures are often based on the TL 9000 standard.

3.3 Environmental Qualification Standards

Page 55: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

42 3 Standards Relating to Data Center

3.4.1 Metrics in TL 9000

TL 9000 is a telecommunications standard published by the Quality Excellence for Suppliers of Telecommunications (QuEST) Forum [9], a consortium of inter-national telecommunications service providers and suppliers. It includes a common set of requirements and measurements for the telecommunications industry. The measurements that it uses to quantify the reliability and performance of products are performance-based. TL 9000 develops reliability metrics, including the initial return rate, 1-year return rate, long-term return rate, and normalized 1-year return rate. These metrics are a measure of the risks to a telecommunications system.

The initial return rate is the return rate of units during the first 6 months after initial shipment, representing the reliability during installation, turn-up (the com-missioning and debugging of equipment after it is installed), and testing. This met-ric is calculated as:

where Ri is the number of returns during the initial return rate basis shipping period (months 1 through 6 prior to reporting), Si is the number of field replaceable units shipped during the initial return rate basis shipping period, and Afactor is the annuali-zation factor, which is the number of report periods in a year. If the report period is a calendar month, the Afactor is 12, and if the report period is 2 months, the Afactor is 6.

For the initial return rate calculation, the Ri includes the current month’s return numbers in order to alert managers of any developing problems. However, the Si does not include the current month’s shipping numbers, because it is expected that most units shipped in the month will have not been placed into operation during the month. For example, calculating the initial return rate in May 2011, the Ri covers November 2010 through May 2011, but the Si includes November 2010 through April 2011.

The 1-year return rate is the return rate of units during the first year following the initial return rate (7 through 18 months after shipment), representing the prod-uct’s quality in its early life. It is calculated as:

where Ry is the number of returns during the 1-year return rate basis shipping period (the 7th through the 18th month prior to reporting), and Sy is the number of field replaceable units (field replaceable units) shipped during the 1-year return rate basis shipping period.

The long-term return rate is the return rate of units any time following the 1-year return rate (19 months and later following shipment), representing the prod-uct’s mature quality. It is calculated as:

where Rt is the number of returns during the long-term return rate basis shipping period (the 19th month and later prior to reporting), and St is the number of field replaceable units shipped during the long-term return rate basis shipping period.

(3.1)IRR =(

Ri

/

Si

)

× Afactor × 100 %

(3.2)YRR =(

Ry

/

Sy

)

× Afactor × 100 %

(3.3)LTR =(

Rt

/

St

)

× Afactor × 100 %

Page 56: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

43

The calculation methods for 1-year return rate and long-term return rate are similar to that for initial return rate. The difference is that the number of returns in the initial return rate period covers a different period from that of the number of units shipped in the initial return rate period, but the numbers of returns in the 1-year return rate and long-term return rate cover the same periods as the number of units shipped in the 1-year return rate and long-term return rate, respectively.

An example of shipments and returns with different basis shipping periods is shown in Table 3.6, in which the reporting month is December 2008 and data are available from April 2007. In this table, the initial return rate basis shipping period is June 2008 through November 2008, the 1-year return rate basis shipping period is June 2007 through May 2008, and the long-term return rate basis shipping period is April 2007 through May 2007.

In Table 3.6, December 2008 is included in the Ri but not in the Si. The initial return rate, 1-year return rate, and long-term return rate can be calculated as:

The normalized 1-year return rate is the normalized return rate of units during the 1-year return rate period. To calculate the normalized 1-year return rate, returns are aggregated for product normalized units (NUs), which are based on product categories and are defined in the TL 9000 measurement applicability table in the appendix. The numbers of NUs are calculated based on how many NUs can be deployed by the shipped parts. The normalized 1-year return rate is calculated as:

where the Ry is the same as that in Equation (4.2), and S is the number of normal-ized units (NUs) shipped during the 1-year return rate basis shipping period. For example, a high bit-rate digital subscriber line (HDSL) transmission system con-sists of an HDSL central office transceiver unit (HTU-C) and an HDSL remote transceiver unit (HTU-R). The number of shipments and returns during a 1-year return rate basis shipping period are shown in Table 3.7.

YRR =Ry

Sy

× Afactor × 100 %

=1 + 0 + 2 + 0 + 1 + 1 + 0 + 2 + 0 + 1 + 1 + 2

50 + 50 + 90 + 70 + 60 + 80 + 50 + 50 + 70 + 90 + 40 + 60× 12 × 100 % = 17.3 %

LTR =Rt

St

× Afactor × 100 % =2 + 1

50 + 40× 12 × 100 % = 40 %

(3.4)NRY =(

Ry

/

S)

× Afactor × 100 %

Table 3.6 Example Shipments and Returns for TL 9000 Metrics

Year 2007 2008

Month 04 05 06 07 08 09 10 11 12 01 02 03 04 05 06 07 08 09 10 11 12

Shipment 50 40 50 50 90 70 60 80 50 50 70 90 40 60 80 70 70 60 80 90 60St Sy Si

Return 2 1 1 0 2 0 1 1 0 2 0 1 1 2 2 1 0 1 3 0 2Rt Ry Ri

3.4 Quality Management Standard: TL 9000

Page 57: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

44 3 Standards Relating to Data Center

Then, the normalized 1-year return rate can be calculated as:

These metrics, including initial return rate, 1-year return rate, long-term return rate, and normalized 1-year return rate, are applicable to telecommunication systems consisting of field replaceable units, systems which are field replace-able units, and individual field replaceable units themselves. In general, the initial return rate may be inaccurate for products warehoused outside the supplier’s con-trol, because the product is not used by customers during the warehouse period (this period may be included in the initial use period, especially when the ware-house period might not be known to the supplier). The long-term return rate may be inaccurate for old products that are out of service or cheap, because the users may prefer to buy new products instead of sending the old products to the suppli-ers for maintenance. Unlike the 1-year return rate, the normalized 1-year return rate allows a comparison between systems with different architectures.

3.4.2 Use for Data Centers

TL 9000 is a standard used to establish and maintain a common set of telecom qual-ity management system requirements that meet the supply chain quality demands of the worldwide telecommunications industry. The industry also uses the metrics in TL 9000 to evaluate the quality of telecom equipment used in various cooling methods of data centers. The possible environmental changes due to the implementation of various cooling methods should have no impact on the initial return rate of telecom equipment, since this metric represents the reliability during installation, turn-up, and testing. The environmental changes also have no significant impact on the 1-year return rate of telecom equipment, which represents the reliability during early life, when environmental changes have little impact on reliability. However, the long-term return rate may be affected by environmental changes because the long-term reliabil-ity of telecom equipment may differ under the various cooling methods. Thus, the long-term return rate should be selected as the metric to assess the risks of various cooling methods, although the long-term return rate may become inaccurate for older products as they are taken out of service.

The members of the supply chain and regulatory agencies are often contractually obligated to meet reliability goals in terms of the TL 9000 metrics for the equipment

NRY =30 + 40

50000× 12 × 100 % = 1.68 %

Table 3.7 Example of normalized 1-year return

HTU-C HTU-R

Shipment 50000 60000Return 30 40HDSL that can be deployed 50000

Page 58: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

45

supplied. The equipment manufacturers want to know ahead of time if the equipment is going to be used in traditionally-cooled installations or with other cooling methods so that they can decide on the level of guarantee to provide for their equipment. The QuEST forum compiles the data on these metrics from the member companies. If the QuEST forum starts differentiating between the data from the traditional and other cooling systems, the industry can make informed judgments about the effects of these new cooling strategies on system reliability and availability.

3.5 Summary

Standards provide rules and regulations for data center design and operation, but some of them (e.g., Telcordia GR-63-CORE, and ETSI 300 019) need to be updated for use under various cooling conditions in data centers. The data center cooling com-munity needs to get involved in the standards and help update them. Contractual rela-tions between equipment manufacturers, data center operators, and regulators need to be updated to allow for the use of various cooling tools. TL 9000 can be used to evaluate the quality of telecom equipment and also to assess the impact of various cooling methods on telecom equipment. TIA-942 focuses on the design and instal-lation of data centers, and some changes may be needed to allow for the use of vari-ous cooling methods (e.g., free air cooling). ASHRAE thermal guidelines considered free air cooling when they extended the recommended temperature for data centers in 2008, and the extension helps increase the energy savings for cooling.

References

1. R. Miller, Data Center cooling set points debated. Data center knowledge, Sept 20072. American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE)

Technical Committee (TC) 9.9, 2011 Thermal Guidelines for Data Processing Environments—Expanded Data Center Classes and Usage Guidance, Atlanta (2011)

3. American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE) Technical Committee (TC) 9.9, Thermal Guidelines for Data Processing Environments, Atlanta (2008)

4. American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE) Technical Committee (TC) 9.9, 2008 ASHRAE Environmental Guidelines for Datacom Equipment, Atlanta (2004)

5. Telecommunications Industry Association (TIA), TIA-942 (Telecommunication Infrastructure Standard for Data Centers), Arlington, April 2005

6. Telcordia, Generic Requirements GR-63-CORE, Network Equipment-Building System (NEBS) Requirements: Physical Protection, Piscataway, March 2006

7. ETS 300 019, Equipment Engineering (EE); Environmental Conditions and Environmental Tests for Telecommunications Equipment (European Testing Standards Institute, Sophia Antipolis, France, 2003)

8. D. Atwood, J.G. Miner, Reducing Data Center Cost with an Air Economizer”, IT@Intel Brief; Computer Manufacturing; Energy Efficiency; Intel Information Technology, August 2008

9. Quality Excellence for Suppliers of Telecommunications Forum, TL9000 Quality Management System Measurement Handbook 3.0, TX, Dec 2001

3.4 Quality Management Standard: TL 9000

Page 59: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

47

The continuous increase in energy and simultaneous decrease in IT hardware prices are among the chief reasons for the data center industry to pay attention to energy efficiency as a top priority. Since nearly 50 % of the power provided to a data center may go to the cooling infrastructure, it is imperative to develop high performance and reliable, yet cost-effective, cooling solutions. Moreover, it is probable that tighter government regulations will force the data center industry to improve the energy efficiency of their operations. In addition to well-established air cooling methods, several other cooling methods are already in use, about which preliminary reports and research papers have been published and include evalu-ation of their performance and comparison with traditional air cooling systems. While this chapter focuses primarily on existing cooling methods, coverage of emerging cooling technologies and trends for energy efficient thermal manage-ment for data centers is discussed in Chap. 9.

4.1 Principal Cooling Methods

This section introduces some most commonly known cooling methods for data centers, which include air cooling, liquid cooling, direct immersion cooling, tower free cooling, and air cooling with power management technologies. The benefits and limitations of these cooling methods are also analyzed.

4.1.1 Air Cooling

The majority of existing data centers use air cooling systems to maintain the desired operating conditions. However, future data centers most likely will use a combination of cooling methods in order to efficiently and most directly remove

Chapter 4Principal Cooling Methods

J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5_4, © Springer Science+Business Media New York 2014

Page 60: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

48 4 Principal Cooling Methods

heat from the information technology (IT) equipment, use waste heat effi-ciently, and improve the overall efficiency of the system and the life cycle cost effectiveness.

In a typical air cooling system, heat generated by the processor is conducted to a heat sink and transferred to the chilled air blowing into the server. Cooling air typically enters through an under-floor plenum to the cold aisle in front of the racks and exits on the back to the hot aisle. Hot air rises and moves to a computer room air-conditioning unit (CRAC) where it is cooled by chilled water, which must be maintained at a sub-ambient temperature in order to produce a sufficient heat transfer rate. Heat is then removed to ambient with an elevated air tempera-ture through a cooling tower or air-cooled condenser for smaller units, as shown in Fig. 4.1.

As seen in Fig. 4.1, several points of thermal resistance and energy losses are associated with such cooling system. One major source of thermal resist-ance is between the heat generating processor and the heat sink. This thermal contact resistance can be reduced through embedded cooling techniques that utilize advanced heat sinks and high-conductivity thermal substrates for direct heat removal from the electronics. Another major source of thermal resistance is between the heat sink and the air. Various enhanced heat transfer augmentation have reduced the air side thermal resistance. This area remains of active research interest and the search for innovative cooling solutions on the air side continues.

Cold Aisle

ColdAisle

Hot Aisle

Rack Rack Rack Rack

CRACComputer RoomAir Conditioning

Plenum

FrontRearRearFront

Tj90oC

TCW,return

PChiller

PCW,pump

TCW,supply

PCT,pumpTCT, supply

TCT,return

PCT,fan

Air Mixture In

Air Mixture Out

Air Mixture In

MakeupWater

Warm Water

Cool Water

Cooling Tower Loop

Chilled Refrigerant Loop

Chilled Water Loop

Data Center QRoom

20oC

Fig. 4.1 Traditional air cooling and the various resistances between the source and the sink

Page 61: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

49

Chapter 9 provides a more detailed analysis of enhanced air cooling and its com-parison with liquid and two-phase cooling.

4.1.2 Liquid Cooling

Liquid (water, refrigerants, or other coolants) cooling has been used extensively in data centers. When compared to air, liquids used in cooling have higher heat trans-port capacity and can remove higher levels of heat during heat exchange. With the strong capability of heat removal, liquid cooling is particularly an appropriate solution for high power density components such as CPUs. However, the imple-mentation of liquid cooling may increase the overall cost and complexity of the data centers. Some companies, such as Intel, suggest that the use of liquid cooling be limited to special high-density equipment/components [1]. However, the pro-gression of Moore’s Law in electronic technology continues to add higher func-tionality (e.g., computing power, access speed) to the chips, with simultaneous reductions in chip size, resulting in an increased on-chip power dissipation den-sity that is well beyond the capability of conventional thermal management tech-niques utilizing air. Inadequate cooling has further constrained advances in high performance electronic systems. Therefore, in the authors’ view, liquid cooling, and most likely phase change cooling, will find its way into data centers for per-formance and economical gains at least to augment the other thermal management method.

There are several methods of implementing liquid cooling for server racks. One method involves a liquid-cooled door, which is usually located on the back of the server rack and cools the air flowing from the rack down to (or near) ambi-ent room temperature and then removes the heat. Another implementation method is a closed-liquid rack that is sealed and uses a heat exchanger to remove the heat from the airflow fully contained within the rack. The heat exchanger is connected to a liquid cooling system that transfers the heat to the liquid. This design is ther-mal and airflow neutral to the room, and usually also quiet. There needs to be a mechanism to open the rack manually, however, to prevent overheating in case of a failure.

Other implementation methods related to the rack cooling strategy are in-row liquid coolers and overhead liquid coolers, which are similar to the liquid-cooled door. These two methods remove the heat very near the heat sources, and the room-level air cooling system is not stressed by heat load, although there is local room airflow. The advantage of these two types of coolers is that they are rack-independent and not limited to some specific server or rack manufacturer. However, a disadvantage of both of these methods is that they occupy a large amount of space [1].

Another liquid cooling technology is offered by Iceotope [2], a manufacturer of liquid cooling equipment. In their method, the server motherboard is completely immersed in an individually sealed bath of an inert liquid coolant. The generated

4.1 Principal Cooling Methods

Page 62: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

50 4 Principal Cooling Methods

heat is removed by the coolant from the sensitive electronics to a heat exchanger, which is formed by the wall of the bath. The coolant is continuously recirculated and cooled in the bath. Iceotope claims that this cooling technology can dramati-cally reduce data center energy costs if IT managers can become comfortable with the idea of liquids in their centers.

One of the main concerns with direct liquid cooling has been the potential risk of fluid leakage from the liquid pipes close to the IT equipment, which could cause system failure. However, the issue has been addressed through several pro-tective measures, such as safe, quick connections, system monitoring devices, and a leak detection system with the liquid cooling implementation. Although this adds to the capital and operational costs [1], long term trends point to increased use of liquid cooling. Successful implantations are already in place and with favorable payback periods in some cases [3, 4].

4.1.3 Liquid Immersion Cooling

Immersing servers in dielectric liquid (e.g., oil) is used due to its rapid/low resist-ance cooling. Liquid immersion cooling is not a new concept, having been used by IBM for over 20 years to cool high powered chips on multi-chip substrates dur-ing electrical testing prior to final module assembly. Recently, liquid immersion cooling has been applied to server cooling of data center as the power density of data center electronics drastically increases, since liquid immersion cooling is sim-pler and less expensive to implement than other pumped liquid cooling techniques [5]. It is an example of passive two-phase cooling, which uses a boiling liquid to remove heat from a surface and then condenses the liquid for reuse, all without a pump as shown in Fig. 4.2.

The servers are immersed in liquids, a non-conductive chemical with a very low boiling point, which easily condenses from gas back to liquid. Typically, mineral oil has been used in immersion cooling because it is not hazardous and it transfers heat almost as well as water but it is electrically non-conductive. A number of companies have introduced liquid cooling solutions that immerse servers in fluid. 3 M’s Novec is the most widely used dielectric coolant for super-computers and data center cooling. The liquids can hold 1,200 times more heat by volume than plain air and support heat loads of up to 100 kW per 42U rack, far beyond current average heat loads of 4–8 kW per rack and high-density loads of 12–30 kW per rack.

Recently Intel [6] examined the use of this technique for data centers. The microprocessor company finished a year-long test of its mineral-oil server-immer-sion cooling technology in 2012, and reported that not only does the technology appear safe for server components, but it might also become the norm for anyone needing maximum computer power or building up data center capacity.

Intel’s product, called CarnotJet, was designed to house servers in a special-ized coolant oil that absorbs the heat given off, which is then sent to a radiator

Page 63: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

51

where it is cooled before being recycled back into the server housing [6]. Whereas traditional air-cooled server racks often operate at a Power Usage Effectiveness rating of about 1.6 (meaning that cooling adds a 60 % increase over the power needed to power the servers’ computing workloads), Intel’s oil-immersed servers are reported to operate at a PUE between 1.02 and 1.03. It is possible to achieve similarly low PUE ratings with traditional air and liquid cooling methods, but this requires engineering infrastructure adaptation and corresponding additional capital costs.

Intel’s research into oil-optimized servers could result in a defined architecture, around which server manufacturers could begin building such systems. Most serv-ers follow design principles for optimal airflow and distribution. Liquid/oil immer-sion cooling could do away with some of the traditional rules to arrive at much more efficient systems. Some immediate steps involve eliminating anything to do with fans, sealing hard drives (or some form of solid state drives), and replacing any organic materials that might leach into the oil. A redesign of the heat sink will most likely be necessary as well, as would a new architecture for optimum place-ment of various components on the motherboard. Oil immersion means there is no need for chillers, raised floors, or other costly measures typically required for air cooling. It is possible that the energy stored in the hot oil could be reused more easily than the warm air servers return today, thus making a data center even more efficient [6]. In its preliminary evaluation, Intel suggested the cost savings associ-ated with oil immersion might make this technique more commercially feasible in the short to midterm range than otherwise perceived. The big hurdle to adoption might be the data center operations staff and managers themselves, whose division or department does not pay the energy bills. This hurdle is especially prevalent as companies start building out data center space and are looking to save on construc-tion costs as well as energy bills.

Fig. 4.2 Working principle of immersion cooling system (http://www.allied-control.com/immersion-cooling)

4.1 Principal Cooling Methods

Page 64: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

52 4 Principal Cooling Methods

Immersion cooling can produce large savings on infrastructure, allowing users to operate servers without bulky heat sinks or air channels on the hardware, server fans, raised floors, CRAC units, CRAHs, chillers, or hot/cold aisle containment. Using a passive two-phase immersion cooling system has inadvertent positive side effects on both the device and facility level. Massive air flow, dust, and noise are eliminated from the facility. A clean and elegant design is possible, since there are no fans, bulky heatsinks, or air channels on the hardware level. Should a board ever have to leave the bath, it will come out dry. It won’t be wet, sticky, or oily. There is no need to keep rubber mats or tissues nearby. Chances are the disks in many hard drives and camera lenses went through coolant vapor phase cleaning when they were made. Another benefit of the immersion cooling system is that it maintains lower junction temperatures without temperature swings, hot spots, or server fan failure. In addition, the immersion cooling enhances reliability since the filtered oil prevents corrosion, electrochemical migration, moisture, and environ-mental contaminants from accumulating on electronics. Immersion cooling can fit any form, size, or shape of electronics components and boards and works in con-fined spaces and extreme environments due to the reduction of the environmental impact. In the case of the coolant Novec made by 3 M, the coolant has zero ozone depletion potential and a very low global warming potential. It is not flammable and inherent fire protection.

Green Revolution Cooling has developed the data center cooling system which submerges servers in a liquid, similar to mineral oil as shown in Fig. 4.3. A rack filled with 250 gallons of dielectric fluid with servers inserted vertically into slots in the enclosure. Fluid temperature is maintained by a pump with a heat exchanger, which can be connected to a standard commercial evaporative cooling. The containment is a 3 inch metal wall, made of angle iron, surrounding the tanks and pumping module and sealed to the concrete slab below. The area holds signifi-cantly more than one rack. In between the tanks, it is possible to place expanded metal catwalk that sits 3 inches high to allow people to walk around the racks even if the containment area contains coolant. Each tank has two coolant level detection sensors that tie into the control software and send out instant alerts in the event of a change in coolant level. The unit was installed at the Texas Advanced Center in

Fig. 4.3 A four-rack installation of the Green Revolution liquid cooling solution, which sub-merges servers in a coolant similar to mineral oil [7]

Page 65: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

53

Austin, home to the Ranger supercomputer. Its enclosures represent a 50 % sav-ings in overall energy costs for the workloads at Midas Networks. The company says the payback on the initial investment in the liquid cooling system ranges from 1 to 3 years. DfR Solutions and Green Revolution Cooling showed that the PUE of the data center using immersion cooling is less than 1.08.

4.1.4 Tower Free Cooling

Tower free cooling (or simply free cooling), is usually implemented by waterside economizers, which are joined to a cooling tower, evaporative cooler, or dry cooler to remove the heat from the rooms. A waterside economizer system has cooling coils to cool the room air and then carry the heat to the heat exchanger, which is connected with an air-to-liquid heat exchanger to remove heat from the coolant and discharge to the environment [8].

Airside economizers are preferred over waterside economizers, since free air cooling is used in mild conditions, whereas tower free cooling can only be used in cold conditions. Although more complicated, tower free cooling can be used where it may not be practical to create large floor openings in facilities to accom-modate the outside air and relief ducts.

Wells Fargo bank introduced tower free cooling for its data center in Minneapolis, Minnesota, in 2005, and achieved energy savings. The added invest-ment was $1 million, due to the implementation of tower free cooling, which accounted for about 1 % of the total construction costs. The waterside economizer is used when the outside air temperature drops to about 2 °C, and can be oper-ated about 4 months a year. The energy savings amounted to $150,000 in 2006 and up to $450,000 per year in the subsequent years as the bank continued to expand operations [9].

4.1.5 Enhanced Cooling Utilizing Power Management Technologies

Air-conditioning is the dominant cooling method in data centers, with room tem-perature usually set at a fixed temperature. However, new power measurement and management technologies have been developed to monitor, manage, and improve energy efficiency of air-conditioning.

One example is air-conditioning equipped with IBM’s Measurement and Management Technologies (MMT) [10], a tool set that helps visualize and under-stand the thermal profile of an existing data center and its power and cooling sys-tems. MMT provides a detailed assessment of the heat distribution throughout the center by creating a three-dimensional chart that pinpoints power and cooling

4.1 Principal Cooling Methods

Page 66: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

54 4 Principal Cooling Methods

inefficiencies. After a measurement survey, sensors are installed and coupled to a software system encoded with the survey results to provide ongoing reporting and analysis of the room conditions. Based on the in situ monitoring and analysis of the room condition distributions, this tool will set optimal cooling system levels to minimize over-provision and over-cooling. Under collaboration with IBM, a five-month test of MMT was implemented by Toyota Motor Sales (at its 20,000 ft2 Torrance, California, data center) and Southern California Edison, one of the larg-est electric utilities in the U.S. [11]. MMT information feedback allowed Toyota to safely shut down two CRACs, resulting in energy and cost savings.

Another example is air-conditioning equipped with Kool-ITTM technology from AFCO systems [12], which controls the temperature across a data center and then helps improve cooling efficiency. This method claims to be able to keep a data center efficiently and reliably cool [12].

4.1.6 Comparison of Principal Cooling Methods

The cooling method must be selected to maximize energy efficiency. To assist with this decision, this section compares the cooling methods discussed earlier and identifies their advantages and disadvantages.

The energy efficiency of air-conditioning with the new power management technologies is only moderate. IBM claims that liquid cooling is very efficient for high power density subsystems (e.g., CPUs) due to the high heat transfer coef-ficients [3], but Intel doubts its efficiency for entire data center implementation, particularly for many low density units [1]. For the Wells Fargo bank, tower free cooling has also proven to be very energy efficient. The A/C with power manage-ment technologies is also an efficient cooling method for data centers [10, 12].

The cost of retrofitting air-conditioning with new power management technol-ogies is moderate. The retrofit costs for liquid cooling are higher than for other cooling methods because the pipes for liquid recirculation must be installed, or sometimes reinstalled. For example, when Iceotope is installed [2], motherboards must be removed from the servers and then completely immersed in a sealed bath of coolant, which results in high costs in existing data centers with traditional A/C. Retrofitting tower free cooling entails moderate costs, since airside economiz-ers, waterside economizers, and the associated pumping equipment and filtration equipment are needed, which are often inexpensive and readily available.

Air-conditioning and liquid cooling with new power management technolo-gies are not weather-dependent, but tower free cooling is highly dependent on the weather. Mild weather conditions can maximize the operating hours of airside economizers, and cold weather conditions can maximize the operating hours of waterside economizers. Table. 4.1 compares the cooling methods.

Page 67: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

55

4.2 Free Air Cooling

Free air cooling uses outside ambient air (under prescribed temperature and humidity levels) as the primary cooling medium to reduce the costs of energy required for cooling. The objective is to use an airside economizer to take advan-tage of favorable climate conditions. When the outside air is cooler than the return air (air returned from the data center which has flushed over the equipment), an airside economizer exhausts the hot return air and replaces it with cooler, filtered outside air, essentially “opening the windows” to cool the data center equip-ment. FAC has been investigated by companies including Intel [13], Google [14], Microsoft [15], and Vodafone [16].

Intel conducted a 10 month test to evaluate the impact of using only outside air via airside economization to cool a high-density data center in New Mexico in October 2007 [13]. The center had 900 heavily utilized production servers. In this test, the system provided 100 % air exchange with a temperature variation in the supply air from 18 °C to more than 32 °C, no humidity control (4–90 % RH), and minimal air filtration. The results showed that about $2.87 million (a 67 % sav-ings in total energy costs) was saved by the new cooling method. Internet giants Google and Microsoft have used free air cooling in their data centers. Google operates a data center in Belgium where the room temperature can be above 27 °C [14], which allows the application of free air cooling during most of the year. Microsoft operates a data center with free air cooling in Dublin, Ireland, in which the room temperature can reach 35 °C [15]. Vodafone runs its telecom equip-ment at a standard temperature of 35 °C now, rather than the previous norm of 25–30 °C, in order to save energy in cooling [16]. But, these Google, Microsoft, and Vodafone reports don’t mention information about humidity in the implemen-tation of free air cooling in the data centers.

Intel [13], Google [14], and Microsoft [15] have claimed reductions in energy consumption and improved efficiency with free air cooling. Similar to tower free cooling, free air cooling has advantages in terms of energy efficiency and retro-fit cost, but they are strongly dependent on the local climate of the data center. Compared to tower free cooling, free air cooling can be implemented in more regions since there are more regions with mild weather than with cold weather. Free air cooling is considered one of the most promising cooling methods for data

Table 4.1 Comparison of cooling methods

A/C with power manage-ment technologies Liquid cooling Tower free cooling

Energy efficiency Medium High for high power density subsystem but medium for whole data centers

High

Retrofit cost Medium High MediumWeather dependence Low Low High

4.2 Free Air Cooling

Page 68: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

56 4 Principal Cooling Methods

centers, particularly when both temperature and humidity conditions are favorable. Free air cooling in many locations is limited by humidity (dew point), and not by temperature (drybulb temperature).

The following section introduces free air cooling implementation with an air-side economizer and the key considerations in its implementation. The potential benefits and insufficiencies with free air cooling are also discussed, and examples of industry practices are presented.

4.2.1 Operation of Airside Economizer

Various types of airside economizers are available in the market, but they have common core design features [13, 17, 18], as shown in Fig. 4.4, [19]. Generally, an airside economizer consists of sensors, ducts, dampers, and containers that supply the appropriate volume of air at the right temperature to satisfy cooling demands. Before airside economizers are used, a temperature range for the supply air temperature needs to be set [13]. The outside air is brought into the containers and then distributed to cool the equipment via a series of dampers and fans. The supply air cools the equipment, transfers heat, and then returns to the containers in the airside economizers. Instead of being recirculated and cooled, the exhaust air is simply directed outside. If the temperature of the outside air is below the set temperature range of the supply air, the economizer must mix the incoming outside air and the exhaust air to ensure that the temperature of the supply air is within the set range. If the conditions achievable by economization and mixing of outside air are outside the set range, an air condition system will be used to adjust the supply air conditions to within the set ranges. Thus, the set temperature range determines the operating hours of the airside economizer.

There are exceptions to the approach explained above. For example, one Google data center in Belgium does not use chillers or heating coils at all [14].

M Return AirExhaust Air

Intake Air

Return/Exhaust Fan

Air-side Economizer Data Center

Pre-FilterSupply Fan

Final Filter

Cooling Coil and Humidifier

Supply Air

Return / Exhaust Air

Fig. 4.4 Schematic of airside economizer with airflow path [19]

Page 69: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

57

When the air temperature gets above the set temperature, it redirects the work-load, turning off equipment as needed and shifting the computing load to other data centers.

4.2.2 Operating Environment Setting

The operating environment setting is a key factor in free air cooling implemen-tation. The operating environment determines the annual operating hours of the airside economizer, as well as local operating conditions of the data center equip-ment. The appropriate operating environment setting must be based on the climate, equipment specifications, standards, and identified hotspots of the particular data center. The climate profile is the most critical factor in selecting the data center’s location.

Analysis of historical climate data for a particular location is useful to deter-mine the operational feasibility of a free air-cooled data center. Generally, once the operating environment has been set, the local weather conditions determine the operating hours per year of the cooling coils inside the airside economiz-ers. Example, weather data from two cities (Seattle and Houston) are shown in Tables 4.2 and 4.3. Because the temperature of Houston is higher than that of Seattle, Seattle has the greater potential for energy savings when free air cooling is implemented. However, the humidity in Seattle is higher than that of Houston, and this can affect the overall system reliability.

Recommended operating environments “give guidance to data center operators on maintaining high reliability and also operating their data centers in the most energy efficient manner;” and allowable operating environments are the range in which “IT manufacturers test their equipment in order to verify that the equip-ment will function within those environmental boundaries” [21]. These have been

Table 4.2 Seattle weather averages [20]

Month Temperature (oC) Relative humidity (%)

Avg min Avg max Avg

January 2 8 5 78February 3 9 6 75March 4 12 8 69April 6 15 11 61May 9 19 14 60June 11 21 16 61July 13 24 19 61August 13 24 18.5 67September 11 21 16 70October 8 16 12 77November 5 11 8 80December 3 9 6 81

4.2 Free Air Cooling

Page 70: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

58 4 Principal Cooling Methods

introduced in Chap. 3. Telcordia standards GR-63-CORE [22] and GR-3028-CORE [23], state that the recommended operating conditions are 18–27 °C and 5–55 % RH, and the allowable operating conditions are 5–40 °C and 5–85 % RH. ASHRAE recommends that data centers maintain their environments within the recommended envelope for routine operation. According to the ASHRAE Thermal Guidelines [21], exceeding the recommended limits for short periods of time should not cause a problem, but running near the allowable limits for months could negatively impact reliability. These standards-based operating conditions have been generally accepted by the industry, but may be changed, since the 2012 European Union (EU) guidelines allow the inlet air temperature and humidity of data centers to be as much as 45 °C and 80 % RH, respectively [24].

When a data center implements free air cooling, the temperature and humid-ity range settings need to be based on the manufacturers’ specifications for the equipment in the data center. The specifications can be found in the datasheets for individual equipment items and need to be confirmed with the equipment manu-facturers. For example, the allowable operating temperature range of Cisco 3600 Series routers is specified as 0–40 °C, and the allowable humidity range is 5–95 % RH [25].

If the local operating conditions go beyond the equipment allowable tempera-tures, hot spots may be created that can reduce the equipment reliability and cause unscheduled downtime. In some cases, it may be possible to redirect or optimize the air flow to eliminate unwanted hot spots. This kind of optimization may be performed at the system level or on selected server rack configurations. Air flow optimization also helps to identify “weak link” hardware, i.e., hardware that has a lower thermal margin than other pieces of equipment and which, therefore, limits the ability of the data center to function at its maximum targeted temperature.

Table 4.3 Houston weather averages [20]

Month Temperature (oC) Relative humidity (%)

Avg minAvg max Avg

January 8 17 12.5 65February 9 18 13.5 54March 12 22 17 58April 16 25 21 59May 20 29 25 58June 23 32 28 57July 24 34 29 58August 24 34 29 58September 22 31 27 59October 17 28 23 55November 12 21 17 57December 9 18 13.5 64

Page 71: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

59

4.2.3 Energy Savings from Free Air Cooling

In 2008, the Department of Civil and Environmental Engineering at the University of California at Berkeley and the Lawrence Berkeley National Laboratory (LBNL) published a report, “Energy Implications of Economizer Use in California Data Centers,” estimating the energy savings possible in several climate zones of California [26]. In order to quantitatively identify the possible energy savings, the report compared free air cooling (with airside economizers) with tower free cool-ing (with a waterside economizer) and a baseline (with traditional air-condition-ing) based on energy models and simulations. The following sections summarize the energy saving benefits of free air cooling in California outlined in the report.

4.2.3.1 Data Center Cooling Scenarios

The baseline for identifying the free air cooling energy savings was traditional “computer room air conditioning” (CRAC) units placed on the server room floor. The air was cooled by entering the top of the computer room air-conditioning and passing over the cooling coils; the cooled air was then discharged into the under-floor plenum. The cold air in the under-floor plenum went through the perfora-tions in the floor tiles located in front of the server racks and passed across the server racks to remove their heat with the help of the servers. The exhausted air exited the backside of the server racks and became warm. This air then rose to the intake of the computer room air-conditioning unit. In the baseline scenario, the air circulation was usually internal to the data center. A rooftop air handling unit (AHU) provided a small amount of air to positively pressurize the room and sup-plied outside air for occupants. The refrigerant in a water-cooled chiller plant used heat exchangers to cool water from the computer room air-conditioning units of the data center. The chiller used heat exchangers to transfer the waste heat to the condenser water piped in from the cooling towers, in which the warm water could be cooled by the outside air. This baseline design has been widely used in mid-size and large-size data centers.

In the waterside economizer (WSE) scenario, a computer room air-condition-ing unit similar to that of the baseline scenario was used, except that additional heat exchangers were installed between the chilled water supplied to the computer room air-conditioning units and the condenser water in the cooling towers (see Fig. 4.5). When the local climate was cold enough, the chiller plant did not need to be used, because the condenser water in the cooling towers was cold enough to directly cool the chilled water supplied to the computer room air-conditioning. Since the computer room air-conditioning units and chiller plant were the same as those in the baseline scenario, the energy savings were achieved through the replacement of compressor-driven chilling with fan-driven evaporation cooling.

In the airside economizer (ASE) scenario, there were some differences in air delivery compared to the traditional computer room air-conditioning units used in

4.2 Free Air Cooling

Page 72: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

60 4 Principal Cooling Methods

typical data centers. Air handling units were placed on the rooftop, which was out-side of the data center room, and ducts were used to deliver air to and from the server racks. The ducted system design could prevent the cold air and warm air from unintentionally mixing in the data center, but it had greater air resistance than a traditional computer room air-conditioning unit. When the outside air tempera-ture was inside the set range, the air handling unit directly supplied the outside air into the data center, passed over the servers, and moved the return air with heat removal to the outside of the room. In this process of 100 % outside air cooling, fans consumed more energy than in the baseline case. However, the economizer design saved the energy of operating the chiller, chilled water pumps, and cooling tower fans. If the outside air temperature was higher than the set range, the chiller needed be operated as in the baseline case (see Fig. 4.6) [26].

Fig. 4.5 Schematic of the waterside economizer scenario [26]

Fig. 4.6 Schematic of airside economizer scenario [23]

Page 73: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

61

4.2.3.2 Energy Modeling Protocol

The model calculations of energy consumption assumed that each design was implemented in a data center with a size of a 30,000 ft2 (2800 m2). The heat den-sity of the data center was assumed to be about 80 W/ft2 (0.86 kW/m2, 2.4 MW total), which was considered to be low-range to mid-range. Table 4.4 shows the data center’s basic properties in all the three scenarios. Total energy demand was calculated as the sum of the energy consumption of the servers, chiller use, fan operation, transformer and uninterruptible power supply (UPS) losses, and build-ing lighting [26].

The chiller system included a coolant compressor, chilled water pumps, con-densing water pumps, humidification pumps, and cooling tower fans. The energy consumption of servers, UPS, and lighting were considered constant in the three design scenarios. Humidity was conventionally restricted by ASHRAE 2005 (40–55 % RH) in the baseline and waterside economizer scenarios, and was typically not restricted in the airside economizer scenario. The airside economizer scenario also had different fan parameters, as listed in Table 4.5.

4.2.3.3 Energy Consumption Comparison of Cooling Scenarios

This Lawrence Berkeley National Laboratory report [23] considered five cities in California (Sacramento, San Francisco, San Jose, Fresno, and Los Angeles) as data center locations and assumed that a data center was located in each city. The annual energy consumption of each data center was calculated based on the three design scenarios, and the ratio of total data center energy to server energy con-sumption was also calculated (see Table 4.6). In the baseline scenario, the per-formance ratio of building energy consumption to server energy consumption was 1.55, which was the same for all of the five data centers, since operation under this design was practically independent of outdoor weather conditions. The perfor-mance ratios of the airside economizer scenario showed that airside economizers

Table 4.4 Data center characteristics common to many designs [26]

Data Center parameters

Floor area 30,000 ft2

UPS waste heat 326 kWData center lights 30 kWTotal rack load 2000 kWTotal internal load 2,356 kWAverage internal load density 79 W/ft2

Minimum ventilation 4,500 ft3/minSupply air temperature 13 °CReturn air temperature 22 °CChiller capacity 1750 kWNumber of chillers 3

4.2 Free Air Cooling

Page 74: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

62 4 Principal Cooling Methods

can reduce energy consumption, compared to the baseline case. The waterside economizer scenario could save energy compared to the baseline, but the savings would be less than those in the airside economizer scenario.

In this report, a small change in the performance ratio represented substantial savings. For example, the performance ratio change from 1.55 to 1.44 in the San Jose data center could save about 1.9 million kWh/year in energy, equivalent to about $130,000/year (assuming $0.07/kWh) [26].

The energy consumption of the five data centers considered in the three design scenarios is shown in Fig. 4.7. The results show that the airside economizer sce-nario in San Francisco provided the greatest energy savings, while that in Fresno provided the least energy savings. Under the waterside economizer scenario, the data center in Sacramento obtained the greatest benefits, while those in Los Angeles and San Francisco gained minimal energy savings. The San Francisco waterside economizer scenario might be expected to have savings due to the cool climate, but the chiller part-load inefficiencies reduced the savings. San Francisco air contains a relatively higher moisture content, which increases the latent cooling load in the model and often reaches the capacity limit of the first chiller plant so that a second chiller needs be activated. Since the cooling load is equally shared by the two chillers and the cooling is transferred from the first chiller to the sec-ond chiller, both chillers have cooling loads slightly above half their capacity lim-its, which results in inefficiency in the chillers. The data center with the waterside

Table 4.5 Data center fan properties [26]

Fan system parameters Baseline and waterside economizerAirside economizer

MUAH(makeup air handling) Exhaust

Computer room air-conditioning Supply Relief

Total air flow (cfm) 4500 4500 49,500 437,758 437,758Fan motor size, nominal

(hp)7.5 3 10 30 50

Number of fans 1 1 30 10 5Fan efficiency (%) 53.3 44.0 55.6 63.8 67.5 Fan drive efficiency (%) 95 95 95 95 95 Fan motor efficiency (%) 89.6 86.2 90.1 92.5 93.2 VFD efficiency (%) n/a n/a n/a 98 98 Total static pressure drop 3.5 1 1.6 2 1

Table 4.6 Ratio of total building energy to computer server energy (PUE) [26]

San Jose San Francisco Sacramento Fresno Los Angeles

Baseline 1.55 1.55 1.55 1.55 1.55Airside economizer 1.44 1.42 1.44 1.46 1.46Waterside economizer 1.53 1.54 1.53 1.53 1.54

Page 75: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

63

economizer scenario in San Francisco must model the hour-by-hour load of the chiller instead of the peak load, and then operate the appropriate number of chill-ers to maintain the chillers near their most efficient operating point at any moment.

The annual energy consumption in the five data centers are related to the dif-ferent humidity restrictions, and one example of the Los Angeles data center is shown in Fig. 4.8. Among the three cooling scenarios, the baseline and waterside economizer scenarios are generally independent of the humidity restrictions; how-ever, the airside economizer energy consumption will increase sharply, and it may exceed even those in the other scenarios when the relative humidity restriction range is narrowed. The humidity range for data centers recommended by the 2004 version of the ASHRAE thermal guidelines is 40–55 %. (The humidity ranges in the 2008 and 2011 versions are represented by both relative humidity and dew point), as was discussed in Chap. 3. In order to gain the maximum energy savings, the humidity level in the airside economizer scenario usually goes far beyond the recommended range. This may accelerate some failure mechanisms and then pose reliability risks to the equipment in data centers, as will be discussed in Chap. 5.

0

50

100

150

200

250

San Jose SanFrancisco

Sacramento Fresno Los Angeles

Ann

ual E

nerg

y U

se (

KW

h/ft

2)

Baseline Air-side Water-side

Fig. 4.7 Energy consumption under economizer scenarios [26]

Fig. 4.8 Energy consumption resulting from humidity restrictions in Los Angeles data center [26]

Relative Humidity

120

140

160

180

200

220

240

260

10%to100%

10% to90%

20% to80%

30% to70%

40% to55%

Ann

nal E

nerg

y U

se (

KW

h/ft

2) Baseline

Air-side

Water-side

4.2 Free Air Cooling

Page 76: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

64 4 Principal Cooling Methods

4.2.4 Hidden Costs of Free Air Cooling

The cost of an airside economizer depends on the materials and installation of dampers used for closing and opening the economizer windows, filters, fans, actu-ators, logic modules, and sensors. Though the costs of the logic controller and sen-sors are, for the most part, independent of the economizer’s size, the cost of all other components depend on the size of the data center and the amount of outside air that must be distributed.

The cooling system design affects the energy efficiency, even when data centers are operated within the recommended conditions. For example, some fans inside equipment have multiple speeds, and thermal management algo-rithms can change fan speed in response to temperature variations. When the ambient temperature increases, the fan speeds will be increased to prevent the equipment from becoming hot, which will consume more energy and offset some of the savings from the increased ambient temperature [21]. Thus, the cooling energy efficiency is not only determined by the operating temperature settings in data centers, but also by the cooling algorithm designs of equip-ment. There is an example in [21]: thermal management algorithms keep the fan speed constant up to about 23 °C, and the component temperature increases (roughly) linearly with the ambient temperature below 23 °C. When the ambi-ent temperature increases beyond 23 °C, the fan speed also increases to main-tain the component temperature at a relatively constant temperature. Most IT manufacturers start to increase the fan speed at around 25 °C to improve the cooling of components and offset the increased ambient air temperature. The design of variable fan speeds can minimize the effects of increased ambient temperatures on the reliability of temperature-sensitive components (which are usually the weakest components). But it is estimated that the power increases with the cube of the fan speed [24]. That is, when a fan speed doubles, it con-sumes eight times more energy.

Another hidden cost of free air cooling is the increased leakage power of server chips. Chip leakage power arises from the formation of reverse bias between diffusion regions and wells, and between wells and substrate, and it doesn’t support the chip computational workload. The leakage power is usu-ally very small and can be neglected compared with the chip computation power, if the chip temperature is below the temperature threshold. But when the chip temperature under free air cooling conditions goes beyond the thresh-old, the leakage power will increase exponentially with the further temperature increase, which has been up to 30 % of the total chip power consumption in the most recent enterprise servers [27]. The implementations of free air cool-ing need to account for the impacts of increased air temperatures on the leak-age power.

The implementation of free air cooling may cause gaseous and particulate contaminations such as dust, smoke, and gas to enter into the data center airflow. Impacts of that intake are discussed in Chap. 5.

Page 77: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

65

4.2.5 Examples of Free Air Cooling

Free air cooling has been implemented in data centers in the US, Europe, and Asia (see Table 4.7). Due to the climate diversity of the data center locations and designs, the days per year in which free air cooling can be implemented dif-fer, resulting in a range of energy savings. The economy of free air cooling will depend on the local energy costs, energy sources, and regulations.

Our first case example of free air cooling involves an Intel data center [13]. in which typically close to 60 % of the energy is consumed by power and cooling equipment. The design of increasingly complex electronics requires the support of increased computing capacity in data centers, which results in the rapid growth of

Table 4.7 Implementation of FAC by companies [28]

Company Location Description

Facebook Oregon, USA Facebook’s first company-built facility, 147,000 ft2, with Power Usage effectiveness (PUE) of 1.15

Microsoft Chicago, USA One data center with 700,000 ft2

Citigroup Frankfurt, Germany 230,000 ft2, 65 % days of a year with free air cooling

Digital Realty Trust California, USA More than 65 % days of the year with free air cooling, annual 3.5 M KWH energy saving ($250,000), with a PUE of 1.31

VMWARE Washington, USA The mechanical system uses hot air/cold air physical separation to extend the operation hours of airside economizers

Microsoft Dublin, Ireland 303,000 ft2, Microsoft’s fist mega data center in Europe

Internet Initiative Japan Japan Expected to reduce the cost of cloud service by 40 %, reducing annual CO2 output by about 4,000 tons

Advanced data centers California, USA 237,000 ft2, use of airside economizers and recycled grey water as a redundant water supply

Google Brussels, Belgium Operated below 27 °C, with temperatures above the acceptable range (27 °C) only about seven days per year on average

Weta digital Wellington,New Zealand

10,000 ft2, running full time and often at full capacity, with no air-conditioning

IBM cloud North Carolina, USA More than 100,000 ft2, with $362 M annual use of FAC for half year

Fujitsu Perth, Australia About 8,000 ft2, and potentially decreasing the cooling load by up to 50 %

HP Wynyard Newcastle, UK 300,000 ft2, Data Center Leaders’ Award for 2008

Verne Global Keflavik, Iceland 100 % free cooling utilizing the low ambient temperature

4.2 Free Air Cooling

Page 78: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

66 4 Principal Cooling Methods

energy consumption. Intel implemented free air cooling to minimize the energy consumption in data centers.

Free air cooling was implemented in one of Intel’s data centers in New Mexico for 10 months, as a proof of concept. The blade servers in the data center were utilized to deliver high computing capacity and thus, they generated a lot of heat. With air-conditioning units, the supply air cooled the servers to 20 °C. After the air passed across the servers, its temperature increased by 32 °C and reached 52 °C. If Intel wanted to recirculate the air, it needed to cool the air by 32 °C, which would consume substantial energy if it were done with air-conditioning units [13].

In order to avoid equipment downtime due to the severe operating environ-ment, free air cooling was implemented in a trailer which was originally designed to provide temporary additional computing capacity. The trailer was 1,000 ft2 and divided into two approximately equal-size compartments. One compartment with 448 highly utilized blade servers was cooled by airside economizers, which were modified from low-cost, warehouse-grade direct expansion air-conditioning equip-ment. The airside economizers expelled exhaust (hot) air from servers to the out-doors and supply outside (colder) air to cool the servers. The other compartment, also with 448 blade servers, was cooled by traditional air-conditioning units in order to identify the impact of free air cooling on reliability. Sensors installed in the two compartments were used to monitor the temperature and humidity [13].

In order to maximize the energy savings, the outside supply air temperature was set to a range from 18 to 32 °C, since the servers can work under tempera-tures as high as 37 °C, according to the manufacturer’s ratings. This temperature range of the supply air was maintained by the air-conditioning units inside the air-side economizers. When the outside air exceeded 32 °C, the air-conditioning units would start to cool the supply air to 32 °C. If the temperature of the supply air was below 18 °C, the hot return air from the servers would be mixed with the supply air to reach the temperature set range. There were no controls on the humidity, and filtering was applied only to remove large particles in the supply air.

The Intel test started in October 2007 and ended in August 2008. The servers were used to run a large workload to maintain a utilization rate of about 90 % [13]. The servers with free air cooling were subjected to wide operating condition vari-ations, with average high temperatures ranging from 9 to 33 °C, and average low temperatures ranging from −5 to 18 °C. Due to the slow response of the low-cost air-conditioning units inside the airside economizers, the temperature of the sup-ply air at times slightly exceeded the set range. The records showed that the sup-ply air temperature varied from 17.7 to 33.3 °C. The relative humidity varied from 4 % to more than 90 % with rapid changes at times. The compartment and the servers with free air cooling were covered with dust [13].

With the use of the economizer, the cooling load of the direct expansion air-conditioning units was reduced from 112 to 29 KW in the economizer com-partment, which saved up to 74 % in energy consumption. It was estimated that 67 % energy consumption could be saved with 91 % use of airside econo-mizers, which could reduce the annual energy cost by up to $2.87 million in a 10-megawatt (MW) data center. The failure rate in the compartment with direct

Page 79: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

67

expansion air-conditioning units providing cooling was 2.45 %, and the failure rate in the economizer compartment was 4.46 % with the presence of dust and wider ranges of temperature and humidity [13]. But, this case only considered the servers, and other major pieces of IT equipment in the data center (e.g., rout-ers and switches) were not included in the failure rate estimation. In addition, the test duration of 10 months was too small (compared to the IT equipment life-time of 3–5 years) to determine if the failure rates would increase with time or remain the same.

As a second case example study, we considered free air cooling implementation in a Dell’s 50,000 ft2 data center in Austin, TX, during the first 5 months of 2010. The baseline power consumption without economization was about 5,000 KW, which was decreased with economization when the outside temperature was lower than 10 °C. This economization implementation realized a reduction of $179,000 (about 15 %) in overall energy cost in the data center through utilization of free air cooling in the first 4 months of 2010, even though the climate in Austin is hot and not ideal for implementing free air cooling [29].

Dell did not report on the reliability information for the Austin data center, but they did perform an experiment to identify the impact of free air cooling on server reliability [26]. That experiment was conducted on the servers at 40 °C and 85 % RH for more than 1.5 years. The results showed only a small difference in the number of server failures compared with the conditions of 22 °C and 50 % rela-tive humidity. But, as with the Intel case, the Dell case focused mainly on server hardware; telecommunication equipment, such as routers and switches, were not included in the experiment. In addition, the experiment was operated at constant conditions of high temperature and high humidity, although the operating condi-tions under free air cooling are likely to be a thermal cycling environment.

4.3 Summary

Cooling equipment accounts for a major share of the data center energy consump-tion, and thus provides opportunities to improve energy efficiency by modifying or implementing innovative cooling methods. This chapter discussed some cooling methods that can serve to improve the energy efficiency in data centers, including liquid cooling, tower free cooling, and air conditioner cooling with power manage-ment technologies. The benefits and disadvantages of these cooling methods were discussed.

This chapter focused on free air cooling, which under proper/allowed environ-mental conditions serves an accepted approach for efficient cooling in data centers and has been implemented with airside economizers. The energy savings from free air cooling will depend on the set operating environment and the local climate. Data flow optimization using simulations can be used to minimize the potential hotspots during the free air cooling implementation design process. The Berkeley–LBNL report [25] demonstrated that humidity restrictions also have a significant

4.2 Free Air Cooling

Page 80: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

68 4 Principal Cooling Methods

impact on the energy savings from free air cooling implementation. Chapter 5 dis-cusses the reliability risks in a data center that uses free air cooling.

Free air cooling is a promising cooling alternative method for data centers that are located in optimal climatic locations (with temperature and humidity condi-tions within the recommended or allowable margins). This method has already been adopted by some leading companies. The approach to improve cooling effi-ciency and reduce energy consumption will be an “all of the above” approach across the industry with each location selecting the method best suited for their situation. On the same note, combinations of methods can be selected by the oper-ators for a given location and a given data center to achieve year-round optimum cooling.

References

1. M.K. Patterson, D. Fenwick, The state of data center cooling. Intel Corporation white paper (2008)

2. R. Miller, Iceotope: a new take on liquid cooling, Data Center Knowledge, Nov 2009 3. S. O’Donnell, IBM claim that water cooled servers are the future of IT at scale, the Hot

Aisle, Jun 3 (2009) 4. R. Mandel, S.V. Dessiatoun, M.M. Ohadi, Analysis of Choice of Working Fluid for Energy

Efficient Cooling of High Flux Electronics, Progress Report, Electronics cooling consortium, CALCE/S2Ts lab, Dec 2011

5. Allied-Control, immersion-cooling, http://www.allied-control.com/immersion-cooling. Accessed 25 Aug 2013

6. D. Harris, Intel immerses its servers in oil—and they like it, http://gigaom.com/cloud/intel-immerses-its-servers-in-oil-and-they-like-it. Accessed 31 Aug 2012

7. Green Revolution Cooling, Reduce data center cooling costs by up to 95 %, http://www.grcooling.com/0. Accessed 25 Aug 2013

8. U.S. Environmental Protection Agency, Heat and Cooling, Energy Star Program, Jan 2008 9. R. L. Mitchell, Case study: wells fargo’s free data center cooling system, Computer World,

Nov 2007 10. International Business Machines (IBM), IBM measurement and management technologies

(MMT) data center thermal analysis, IBM Systems Lab Services and Training Solutions Brief, Jun 2011

11. International Business Machines (IBM), IBM collaborates with toyota motor sales, U.S.A. Inc. and Southern California Edison to Create Green Data Center, Press Release, Oct 2009

12. J. Fulton, Control of server inlet temperatures in datacenters—a long overdue strategy, AFCO Systems white paper, May, 2007, http://www.ebookxp.com/e8cd6ce619/Control+of+Server+Inlet+Temperatures+in+Data+Centers.html. Accessed 20 May 2010

13. D. Atwood, J. G. Miner, Reducing data center cost with an air economizer, IT@Intel Brief; Computer Manufacturing; Energy Efficiency; Intel Information Technology, Aug 2008

14. R. Miller, Google’s chiller-less data center, Data Center Knowledge, Jul 2009 15. R. Miller, Microsoft’s chiller-less data center, Data Center Knowledge, Sep 2009 16. The Economist, Technology Quarterly: Q4 2008, How green is your network? (2008),

http://www.economist.com/node/12673321. Accessed 10 Oct 2013 17. D. Pickut, Free cooling: economizers in data centers, Equinix, Inc., Interop presentation,

Slideshare, Mar 2008 18. V. Sorell, OA economizers for data centers. ASHRAE J. 49(12), 32–37 (2007) 19. D. Beaty, R. Schmidt, Data center energy efficiency. ASHRAE–Save Energy Now

Presentation Series, Jun 2011

Page 81: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

69

20. Climate and Temperature Information, http://www.climatetemp.info. Accessed 26 Dec 2009 21. American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE)

Technical committee (TC) 9.9, 2008 ASHRAE environmental guidelines for datacom equip-ment. (Atlanta, 2008)

22. Telcordia, Generic requirements GR-63-CORE, network equipment-building system (NEBS) requirements: physical protection. (Piscataway, 2006)

23. Telcordia, Generic requirements GR-3028-CORE, thermal management in telecommunica-tions central offices. (Piscataway, 2001)

24. P. Bertoldi, The European Programme for Energy Efficiency in Data Centres: the Code of Conduct, European Commission DG JRC Institute for Energy document. (2011)

25. Cisco, Cisco 3600 series—modular, high-density access routers, Mar 2002 26. A. Shehabi1, S. Ganguly, K. Traber, H. Price, A. Horvath1, W.W. Nazaroff, A.J. Gadgil,

Energy implications of economizer use in California data centers, ACEEE Conference Proceedings, Monterey, CA, Sep 2008

27. ABB Inc., The hidden cost of free cooling and what you can do, White Paper, http://search.abb.com/library/Download.aspx?DocumentID=3BUS095684&LanguageCode=en&DocumentPartId=&Action=Launch. Accessed 2 July 2013

28. Datacenterdynamics, Free cooling guide, Apr 30 2010 29. T. Homorodi, J. Fitch, Fresh Air Cooling Research, Dell Techcenter, Jul 2010

References

Page 82: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

71

Free air cooling is one of the cooling methods best know for the energy savings it offers in data centers and is increasingly accepted by the industry. Due to the potential energy savings, the “EU Code of Conduct on Data Centers” and the 2010 version of the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) Standard 90.1 recommend free air cooling as the preferred cooling method in data centers. But, under free air cooling, operating environ-ments usually go beyond those in traditional data centers and in standards, which may causes potential reliability risks to equipment in data centers. This chapter summarizes the risks arising from the modified free air cooling environment, and discusses the potential failure mechanisms and test methods for the reliability of equipment in data centers.

5.1 Failure Causes Under Free Air Cooling

Generally, the operating temperature in a free air cooled data center is subjected to increased temperature and temperature variations, which may affect the lifetime of equipment and result in reliability concerns. The humidity during free air cooling is usually uncontrolled in order to save the energy associated with humidification and dehumidification, but this may cause some failure mechanisms [such as elec-trostatic discharge (ESD) as a result of very low humidity levels and conductive anodic filament (CAF) formation under high humidity levels] to be more active. In addition, the contamination in free air cooling is a potential failure cause.

5.1.1 Increased Temperature and Temperature Variation

A traditional data center uses both A/C units and air flow to adjust the temper-ature; however, with free air cooling, the temperature is controlled solely by air

Chapter 5Reliability Risks Under Free Air Cooling

J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5_5, © Springer Science+Business Media New York 2014

Page 83: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

72 5 Reliability Risks Under Free Air Cooling

flow. When free air cooling is used in data centers, the operating temperatures may rise and exacerbate existing hotspots. Increases in operating temperatures, particu-larly in power infrastructures and cooling systems, affect the performance of com-munication equipment and impact the electrical parameters of components and systems. As a result of parameter variations, particularly at hotspots, there is a risk of intermittent behavior beyond the system specifications. This can result in inter-mittent failure of the electrical products or system, which is the loss of a function or performance characteristic in a product for a limited period of time, and then the subsequent recovery of the function. An intermittent failure may not be easily predicted, nor is it necessarily repeatable. However, it is often recurrent, especially during temperature variations. Compared with equipment cooled by A/C units, those under free air cooling conditions will experience more intermittent failures due to the increased temperature and temperature variation.

Free air cooling increases the temperature variations and results in additional temperature cycles for the equipment. For example, during a proof-of-concept study performed by Intel, the average diurnal temperature variation ranged from 13 to 17 °C [1]. For a piece of equipment with a lifetime of 5 years, this would result in an additional 1825 temperature cycles, which may accelerate the wear-out of the equipment. The free air cooling method can accelerate wear-out in cooling equipment such as fans. When the operating temperature is increased, the cooling algorithm may increase the fan speed to offset the temperature increase. This can affect the lifetime and reliability of the fans.

Temperature has an accelerating effect, through humidity, in a corrosion reac-tion. The relative humidity of the air changes with the temperature. If the tempera-ture drops, the RH will, at some point, exceed 100 % RH, and a layer of water will form on surfaces of printed circuit assemblies or housing. After the water layer is formed, the speed of corrosion accelerates to several thousand times faster than at the starting point. In addition, an increase in temperature can increase the solu-bility of some species in an electrolyte formed by the water layer. For instance, oxygen plays a dominant part in the electrochemical reaction for corrosion. The solubility of oxygen increases as temperature increases. Therefore, temperature variation can increase metal corrosion and degradation in free air cooling data centers.

Increases in operating temperatures can decrease the lifetime and performance of the components used in a data center, such as batteries. The recommended tem-perature for batteries is usually 25 °C, and the allowable temperature range can be 15–40 °C. A decrease in temperature may cause a drop in battery capacity, where a 1 °C decrease can result in a 1 % drop in battery capacity. A temperature increase may accelerate the corrosion of bipolar plates in the batteries, with more water consumed, thus decreasing the lifetimes of batteries. Generally, the lifetimes of batteries are maximized at around 25 °C, and it is estimated that the expected life may drop by 50 % when the operating temperature increases 50 % [2].

The failure rate for a typical switched mode power supply (SMPS) doubles with every 10–15 °C temperature rise above 25 °C [3]. Increased operating tem-peratures also result in increased conduction and switching losses in switching

Page 84: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

73

transistors. With a temperature rise, the mobility of charge carriers is also reduced. In switching transistors, such as power MOSFETs, reduced mobility leads to increased device resistance, and, hence, increased conduction loss [4]. Reverse bias leakage currents increase exponentially with increases in temperature, and this leads to greater power loss [5]. High operating temperatures lead to an increase in switching turn-off time for power transistors, which in turn results in increased switching loss. Another result of higher operating temperatures is degra-dation in the gate oxide of switching transistors, which may result in time-depend-ent dielectric breakdown. High temperatures can lead to activation of the parasitic bipolar junction transistor in a power MOSFET and the parasitic thyristor in an IGBT, causing destruction of the device due to latch-up. In aluminum electrolytic capacitors, increased operating temperature leads to evaporation of the electrolyte. This will cause a reduction in the capacitance, an increase in the equivalent series resistance (ESR), and an increase in power dissipation. High operating tempera-tures cause structural overstress within the Schottky diode, which is a semiconduc-tor diode with a low forward voltage drop and a very fast switching action, causing cracks that can propagate into the Schottky contact region, leading to catastrophic failure.

5.1.2 Uncontrolled Humidity

Humidity is measured either using RH or absolute humidity (AH). AH is the weight of water vapor per unit volume of air/steam mixture, typically measured in g/m3. RH is the ratio of actual vapor pressure to the saturated vapor pressure at the same temperature. RH and AH are linked by the following equation [6]:

where A is the area and β is a constant. When the amount of moisture in the air (AH) remains constant and the temperature increases, the RH decreases. Corrosion is typically linked to RH, whereas the process of moisture diffusion through mate-rials is commonly linked to AH.

Typical humidity levels in data centers based on ASHRAE Guidelines are between 40 and 60 % RH. This range provides protection against a number of cor-rosion-related failure mechanisms, such as electrochemical migration (ECM) and CAF formation. Uncontrolled humidity in free air cooling can increase reliability risks. Both overly high and overly low humidity can activate failure mechanisms. Different forms of corrosion can be caused by high humidity, while ESD is more common in low humidity environments. These failure mechanisms can result in equipment failure [7].

(5.1)RH = (AH) ∗ A ∗ exp(β/T)

5.1 Failure Causes Under Free Air Cooling

Page 85: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

74 5 Reliability Risks Under Free Air Cooling

5.1.3 Contamination

Contamination is another potential risk with the application of free air cooling, since there is little control of dust and gas in some data centers [1]. There have been increased hardware failures observed in data centers, especially in those located near industrial operations and other sources of pollution. Both gaseous and particulate contaminations can cause mechanical and electrical failures of elec-tronics under poorly controlled temperature and humidity conditions.

Gaseous pollutants in a data center can contribute to corrosion. The combustion of fossil fuels can generate the pollutant gases sulfur dioxide (SO2) and nitrogen oxides, and particles such as soot. Various hygroscopic sulfate and nitrate salts are formed by the oxidation of gases. Chloride ions corrode most metals. Hydrogen sulfide (H2S) and hydrogen chloride (HCl) are related to industrial emissions in specific microclimates. H2S corrodes all copper alloys and silver at all air humid-ity levels.

SO2, a product of the combustion of sulfur-containing fossil fuels, plays an important role in atmospheric corrosion. It is adsorbed on metal surfaces, has high solubility in water, and tends to form sulfuric acid in the presence of surface mois-ture films. Sulfate ions are formed in the surface moisture layer by the oxidation of SO2 according to the following reaction. The required electrons originate from the anodic dissolution reaction.

Nitrogen compounds, in the form of NOx, also tend to accelerate atmospheric attack. NOx emissions, largely from combustion processes, have been reported to increase relative to SO2 levels [8]. H2S, HCl, and chlorine in the atmosphere can intensify corrosion damage, but they represent special cases of atmospheric cor-rosion invariably related to industrial emissions in specific microclimates [9]. The corrosive effects of gaseous chlorine and HCL in the presence of moisture tend to be stronger than those of chloride salt anions, due to the acidic character of the former species [9]. There is an important synergistic effect between these gases. For example, H2S alone is not very corrosive to silver, but the combination of H2S and nitrous oxide is highly corrosive [10]. Similarly, neither SO2 nor nitrous oxide alone is corrosive to copper, but together they attack copper at a very fast rate [11]. The anion in the gas dissolved in water is normally more corrosive than that of a water-soluble salt.

Particulate contaminations consist of both inorganic and organic substances, although inorganic materials typically outweigh the organic materials. There are many types of inorganic mineral particles in dust, including quartz sand (SiO2), feldspar (KAlSi3O8–NaAlSi3O8–CaAl2Si2O8), calcite (CaCO3), mica (SiO2·Al2O3·K2O·Na2O·H2O), and gypsum (CaSO4·2H2O) [12]. Some of their inorganic substances are water-soluble salts that can absorb moisture from the atmosphere and dissolve into the absorbed water to form a solution. This property

(5.2)SO2 + O2 + 2e−

→ SO2−4

Page 86: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

75

is called deliquescence. Organic substances include carbon black, fibers, and organic ions such as formate (COOH−) and acetate (CH3COO−) [12].

Particulate contaminations have different sizes and are generated in various ways [13]. Particles greater than 2.5 μm in diameter are usually called coarse par-ticles, while particles less than or equal to 2.5 μm in diameter are fine particles. Coarse and fine dust particles are typically generated in different ways. Fine dust particles are normally generated through the condensation of low-volatility gases, followed by the coalescence of a number of these nuclei to create larger parti-cles. Ammonium sulfate and ammonium hydrogen sulfate are examples of this. Coarse dust particles are generated by a variety of mechanical processes, such as wind blowing over soil, the dispersion of sea salt by ocean waves, and industrial machining. Dusts with sodium chloride and mineral particles are normally gener-ated in this manner.

The quantity of readily deliquescent substances that is present in particulate contaminations is of interest when considering their impact on reliability [13]. Deliquescent substances sorb water vapor from the environment and gradually dissolve in the sorbed water to form a solution, thus increasing the conductivity between two adjacent electrodes. The phase transformation from a solid particle to a saline droplet usually occurs spontaneously when the RH in the surrounding atmosphere reaches a certain level, known as the deliquescent RH or critical rela-tive humidity (CRH). The major cations and anions are NH4

+, K+, Na+, Ca2+, Mg2+, Cl−, F−, NO3

−, and SO42−. In fine dust particles, the ionic components

are mainly sulfate and ammonium. The ammonium/sulfate ratio is normally 1:2. The formula can be written as (NH4)2−XHXSO4, where X can be either 0 or 1. In coarse dust particles, sulfate, ammonium, calcium, magnesium, sodium, and chlo-ride are the most prevalent ionic components, with large local variations for mag-nesium, sodium, and chloride. The ions exist in either pure or mixed-salt forms in the dust. Their CRH values are specific to their respective chemical compositions.

Publications on particulate contamination have documented different failures on connectors, contacts, and printed circuit board (PCB) assemblies [14 , 15]. Dust particles can increase friction on sliding contact surfaces, thus promoting third body wear and fretting corrosion, which in turn can increase the contact resist-ance. Dust particles can act as dielectric materials to induce signal interference in contaminated signal connectors and lines. Dust accumulation on heat sinks, power connectors, or active devices can cause overheating due to physical covering. One of the critical failure modes caused by dust contamination in PCBs is surface insu-lation resistance (SIR) degradation, or impedance degradation [13]. Impedance degradation can lead to intermittent or permanent failure of PCB assemblies. Hygroscopic dust contamination, followed by rapid increases in RH, has led to failures of telecommunication PCBs in the field. Ionic contamination in dust parti-cles can further lead to electrical short circuiting of closely spaced features by ion migration.

Free air cooled data centers, particularly those located in highly populated municipal or industrial areas, can have harmful contamination arising from the ingress of outdoor particulates and/or gases into the free air cooling system.

5.1 Failure Causes Under Free Air Cooling

Page 87: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

76 5 Reliability Risks Under Free Air Cooling

However, some data centers may not experience hardware failures due to particu-late or gaseous contamination if the air conditioning system with particulate filtra-tion is well designed and put in place.

5.2 Failure Mechanisms Under Free Air Cooling

Due to the failure causes implicit in the implementation of free air cooling, some potential failure mechanisms become more active under the new operating envi-ronment. One of the critical failure mechanisms is corrosion, which can take dif-ferent forms. All metal components in a data center can be affected by corrosion reactions in an uncontrolled temperature and humidity environment with gaseous and particulate contaminations. Printed circuit assemblies, connectors, and sockets could be problematic due to different forms of corrosion. Pore corrosion is a con-cern for noble-plated contacts, and fretting corrosion can occur between two solid surfaces in contact. High density PCBs with smaller feature size and spacing are vulnerable to ionic migration and creep corrosion. If water layers are formed on critical surfaces and interfaces, they can result in resistance degradation, leading to soft and/or hard equipment failures. In addition, ESD is prone to occur if the RH is too low. These potential failure mechanisms are introduced in the following subsections.

5.2.1 Electrostatic Discharge

If the humidity is too low, data centers can experience ESD, which is the sudden flow of electricity between two objects caused by contact, an electrical short, or dielectric breakdown. ESD can shut down an electronic component or a piece of equipment and possibly damage it. ESD is most often caused by humidity, but it can also be caused by temperature, pressure, airborne particles, and air recir-culation [16]. ESD tends to occur below 20 % RH, as shown by the high volt-ages attained at 20 % RH in Table 5.1. Static charging persists even at high RH. However, humid conditions can prevent electrostatic charge generation because the thin layer of moisture that accumulates on most surfaces dissipates electric charges [16].

5.2.2 Conductive Anodic Filament Formation

CAF occurs in substrates and PCBs when a copper conductive filament forms in the laminate dielectric material between two adjacent conductors or in plated-through vias under an electrical bias. CAF can be a potentially dangerous source

Page 88: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

77

of electrical failures in IC packaging substrates, PCBs, and overall systems (pack-ages, modules). The trend in the electronics industry to place as many components as possible on minimized PCB real estate has increased the reliability require-ments for bare printed wiring boards (PWBs) and is raising potential reliability issues about CAF formation within multilayer structures.

Typical CAF behavior is shown in Fig. 5.1 [16]. A two-step process model was developed to explain filament formation on the resin/fiber interface. The first step is the degradation/delimitation of the fiber/epoxy interface due to the coefficient of thermal expansion (CTE) mismatch between the glass fiber CTE (~5.5 ppm/°C) and the epoxy resin CTE (~65 ppm/°C). The second step is the electrochemical corrosion reaction, which involves ionic transport of the metal (copper). When these conductive filaments reach the cathode, CAF is formed and the insulation resistance between the cathode and anode drops. Eventually, an electrical short is created. CAF can take place in the plated-through-hole to plated-through-hole (PTH–PTH), PTH-plane, and trace–trace geometries [16]. The reaction process of CAF occurs due to moisture absorption. Most laminate materials absorb mois-ture through surface absorption and diffusion into the interior, especially when exposed to high temperature and humidity environments, which accelerate absorp-tion and can result in quicker degradation and path formation. The different mois-ture absorption rates of resin and glass fiber can also lead to interface stress. Resin swells due to the absorption of moisture, which can lead to debonding at the resin/glass fiber interface [7].

Table 5.1 Electrostatic discharge and relative humidity [16]

Activity Static voltage

20 % RH (kV) 80 % RH

Walking across vinyl floor 12 250 VWalking across synthetic carpet 35 1.5 kVArising from foam cushion 18 1.5 kVPicking up polyethylene bag 20 600 VSliding styrene box on carpet 18 1.5 kVRemoving Mylar tape from PC board 12 1.5 kV

Fig. 5.1 Conductive anodic filament growth [16]

5.2 Failure Mechanisms Under Free Air Cooling

Page 89: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

78 5 Reliability Risks Under Free Air Cooling

5.2.3 Electrochemical Migration

ECM is the loss of insulation resistance between two conductors due to the growth of conductive metal filaments on the substrate [17]. The occurrence of electro-chemical migration has four prerequisites: a mobile metal, a voltage gradient, a continuous film, and soluble ions [18]. ECM occurs as a result of metallic ions being dissolved into a medium at the anode and plating out at the cathode in nee-dle- or tree-like dendrites. Such migration may reduce isolation gaps and even-tually lead to an electrical short that causes catastrophic failure [19]. During the ECM process, the anodically dissolved metal ions can migrate to the cathode. They are deposited, thereby obtaining electrons and reducing back to metal. This process is applicable to the commonly used metals in electronics, such as silver, copper, lead, and tin [20 , 21].

A dendrite grows when an electrolyte bridges two electrodes to form a path. The metal ions are formed by anodic dissolution. Anodic corrosion involves the oxidation of metals to generate cations at the anode. Dissolved ionic contaminants, such as halides, can promote this process. Then, metal ions migrate through the electrolyte under the influence of electromotive forces toward the cathode. Finally, electrochemical metal deposits at the cathode. As more and more neutral metal deposits on the nuclei, dendrites or dendrite-like structures may grow toward the anode. When a dendrite fully spans the gap between adjacent conductors and touches the anode, a short may occur. The current flowing through a dendrite may burn out part of the dendrite due to Joule heating. This phenomenon can lead to intermittent failures, which can be recurrent if re-growth and fusing occur cycli-cally. If the dendrites are thick enough to withstand the current, a permanent short can result [22].

Under bias voltage, the metal at the anode goes into the solution, migrates toward the cathode, and plates out at the cathode. The susceptibility of different metals to ECM is affected by the electrode potential energy of metal ions from metal dissolution. The standard electrode potentials of some main metals used in electronics are listed in Table 5.2. The metals become more likely to corrode going from the noble metal gold to nickel. Because gold has a high standard electrode potential, an electroless nickel immersion gold (ENIG) finish has a high resistance to ECM. Silver also has a relatively high standard electrode potential, but has a

Table 5.2 Standard electrode potentials in an aqueous solution at 25 °C

Cathode (reduction) half-reaction Standard electrode potential E° (volts)

Au3+ (aq) + 3e− → Au(s) 1.50Ag+ (aq) + e− → Ag(s) 0.80Cu2+ (aq) + 2e− → Cu(s) 0.34Sn4+ (aq) + 2e− → Sn2+ (aq) 0.15Pb2+ (aq) + 2e− → Pb(s) −0.13Sn2+ (aq) + 2e− → Sn(s) −0.14Ni2+ (aq) + 2e− → Ni(s) −0.23

Page 90: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

79

tendency to form migratory shorts. Silver is more susceptible to migration than other metals because it is anodically very soluble and its hydroxides have good solubility. Moreover, silver is unlikely to form a passivation oxide layer. The rela-tive positions of the materials in Table 5.2 change in other environments, such as in the presence of sea water.

Without air-conditioning, gaseous and particulate contaminations in the atmos-phere have a larger impact on the reliability of electronics in terms of surface insulation resistance degradation due to the formation of ECM. During the ECM process, the electrolyte path is formed by the adsorbed humid layer or water con-densation, which has conductive properties. The conductivity of the pure water can be enhanced by the adsorption of some gases that form ionic compounds with water.

Despite the abundance of industry standards on temperature/humidity/bias (THB) testing to assess the reliability risks associated with ECM, field perfor-mance sometimes differs substantially from testing results for several reasons. First, contamination is often either not considered or improperly accounted for in industry standards testing. For example, GR-1274-CORE provides a test method to simulate the electrical effects of fine mode particulate contamination by reduc-ing SIR. However, the tested PCBs are coated with a surface film (e.g., of a sin-gle component hygroscopic solution) that will lower their SIR at high humidity. The compositions of hygroscopic solutions are too simple to represent the com-plexity of real dust, which usually includes both mineral particles and multiple hygroscopic salts. Furthermore, the test conditions are insufficient to accelerate the ECM process in uncontrolled working environments. Therefore, a test method is needed to evaluate the reliability performance of electronic products under rapidly changing environmental conditions in the presence of particulate contamination.

5.2.4 Corrosion

Another possible failure mechanism from free air cooling is corrosion, which can be accelerated by increased temperatures, uncontrolled humidity levels, and con-tamination. Various types of corrosion may occur under free air cooling, and this section introduces them in detail.

5.2.4.1 Creep Corrosion

Creep corrosion is a mass-transport process during which solid corrosion prod-ucts migrate over a surface. The corrosion begins with the growth of dendrites that propagate equally in all directions, unlike potential-driven dendrite growth between the anode and cathode. Creep corrosion is driven by the concentration gradients of chemical species of the corrosion products, so that chemical species move from areas with a higher concentration to areas with a lower concentration [23]. This

5.2 Failure Mechanisms Under Free Air Cooling

Page 91: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

80 5 Reliability Risks Under Free Air Cooling

failure mechanism can result in the malfunction of electrical contacts, connectors, and PCBs. For components with noble metal plating over a base metal such as cop-per, creep corrosion is a reliability risk for long-term field applications.

The creep corrosion process initiates from exposure of the underlying base metallic material. It can be caused by the absence of plating, poor plating, mis-handling of components, plating cracks, and bare edges resulting from the trim and form process or after mounting the component onto a PCB. Depending on the environmental conditions, corrosion products may be continuously generated from the exposed copper sites and diffuse over the lead surface. Corrosion products dis-solve in water and can creep long distances from their place of origin. When the water dries, they stay in the new location, from which they can start creeping again when a new water solution forms. The rate of creep corrosion is therefore affected by the wet and dry cycle.

The surface diffusion process depends upon the chemical species of the corro-sion products and the properties of the underlying surface. A copper lead frame is chemically active and will oxidize if exposed to air, but this oxide species is not mobile. On the other hand, copper sulfide and chloride have higher surface mobility than copper oxides [24, 25], and can accelerate and regenerate the copper corrosion products. A surface diffusion coefficient is used to quantify the mobil-ity of corrosion products over a surface under given environmental conditions. A high surface diffusion coefficient represents a material with low resistance to creep corrosion. Both palladium and gold have high surface diffusion coefficients [26]. However, while the mechanisms of creep corrosion over palladium and gold surfaces are similar, palladium surfaces tend to have a higher surface resistance to creep corrosion than gold surfaces, because palladium develops a few atomic layers of surface oxide when exposed to an ambient environment. In electronic devices, nickel is used as an intermediate layer under gold, and silver reduces the creep of corrosion products of copper. In highly corrosive conditions, in the presence of SO2 and chlorine, nickel corrodes, and its corrosion products creep more than those of copper. The creep of corrosion products can be demonstrated with humidity and heat tests, clay tests, and mixed flowing gas (MFG) tests. A field failure due to creep corrosion products over the plastic package is shown in Fig. 5.2 [23].

5.2.4.2 Pore Corrosion

Pore corrosion is a form of bimetallic corrosion created at the microscopic pores, holes, and faults in noble metal plating. The less noble base metal corrodes and pushes the corrosion products toward the more noble plating, thus diffusing them on the surface.

Pore corrosion is a concern for noble metal plated contacts. The porosity of noble metal plating exposes the base metal to the environment. The exposed base metal atoms may react with oxygen and gaseous contaminants, such as H2S and

Page 92: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

81

SO2. The corrosion products migrate out of the pores and spread over the noble metal plating. A schematic pore corrosion process is illustrated in Fig. 5.3.

The high resistivity of corrosion products increases the contact resistance of, for example, gold-plated contacts. Pore corrosion can also be seen on other noble metals, such as silver and palladium. The risk of pore corrosion for platings less than 1 μm thick is high, but it can be minimized by using a barrier layer of nickel under the gold.

The MFG corrosion test, IEC 60068-2-60 Test Method 1 (H2S + SO2), is appli-cable to the pore corrosion tests of gold and palladium plating. A corresponding method can be found in the ASTM B799-95 Standard Test Method for Porosity in Gold and Palladium Coatings by Sulfurous Acid/Sulfur-Dioxide Vapor. The pur-pose of the MFG test is to simulate the field-use corrosive environment for elec-tronics due to gaseous pollutants in the atmosphere.

Pore corrosion is sometimes very dangerous since it is difficult to identify from the surface [27]. There are small cracks or pores in the plating that are only vis-ible from a cross-sectional view, but the plating is mostly intact. The base metal is severely corroded underneath the plating. If the pores or cracks are too small to be seen with the naked eye, there is no way of knowing that this severe corrosion problem exists until the contact suddenly fails.

Fig. 5.2 Scanning electron microscope (SEM) of growth edge of creep corrosion products over plastic package [23]

10 µm10 µm

Fig. 5.3 Schematic of the pore corrosion process

Noble metal (cathode)

Base metal (anode)

Corrosion product

Noble metal (cathode)

Base metal (anode)

Corrosion product

5.2 Failure Mechanisms Under Free Air Cooling

Page 93: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

82 5 Reliability Risks Under Free Air Cooling

5.2.4.3 Pitting Corrosion

Pitting corrosion is a form of localized corrosion that creates point- or pit-like indents on the surface of the metal. It does not spread laterally across an exposed surface rapidly, but penetrates into the metal very quickly, usually at a 90º angle to the surface. Solutions containing chloride or chlorine-containing ions (such as sodium chloride in sea water) have strong pitting tendencies. Pitting corrosion can be initiated and accelerated by an increase of corrosive gaseous and particulate contaminants in the environment. It can also be exacerbated by elevated tempera-ture. Pitting corrosion is typical of metals where corrosion resistance is based on a passive protective layer on the surface. Aluminum and stainless steel are among such metals. Also, noble coatings on a base material, for instance, nickel coating on steel, can create conditions for pitting corrosion if the plating is damaged [9].

Pitting can be separated into two different phases: pit initiation and pit growth. Pit initiation is believed to be caused by the breakdown of passive film on the surface. Pitting corrosion initiates at any time from days to years before a quick growth phase. For example, the pitting corrosion of aluminum alloys stops gradu-ally, but if there is water on the surface containing chloride and oxygen, the corro-sion may proceed quickly. Some example images of pitting corrosion can be found in [28].

5.2.4.4 Fretting Corrosion

Fretting corrosion is a form of fretting wear in which oxidation plays a role [29]. The micro-motion of fretting can result from mechanical vibration or differential movement caused by temperature fluctuations in the environment due to the differ-ent coefficients of thermal expansion (CTEs) of dissimilar materials.

Tin plating is especially sensitive to fretting corrosion since tin oxidizes eas-ily and is soft. Tin can rapidly form a thin, hard oxide. A tin oxide is hard and brittle, and penetrates the soft tin during fretting. The sliding movements between the contact surfaces break the tin oxide film on the surface and expose the fresh tin to oxidation and corrosion. The accumulation of oxides at the contacting inter-face due to repetitive sliding movements causes an increase in contact resistance. Figure 5.4 shows a schematic representation of fretting corrosion. Fretting corro-sion can lead to intermittent electrical discontinuities with tin-plated contacts [30].

The change in the contact resistance of the connector caused by fretting cor-rosion due to changes in temperature can be estimated according to [31], where a formula is presented for tin plating. According to the formula, the change in resist-ance follows a power law with the temperature range of the number of temperature cycles:

(5.3)∆R = k(∆T)2.28C

2

Page 94: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

83

where ΔR is the change in resistance (mΩ), k is the constant determined by an experiment 2.36E-10 (mΩ/K 2.28), ΔT is the fluctuation range of temperature (K), and C is the number of fluctuations in temperature.

5.3 Testing for Free Air Cooling

This section introduces some accelerated tests that can be used to evaluate the reli-ability of electronic equipment in a data center using the free air cooling method with uncontrolled environmental conditions. These tests can be conducted by the manufacturers before the equipment is delivered to data centers.

5.3.1 Mixed Flowing Gas (MFG) Test

The MFG test is an accelerated environmental test used to assess degradation in electronic products in which failure is caused by corrosive gases [32]. The MFG test is conducted in a test chamber where the temperature, RH, and concentration of selected gases are carefully controlled and monitored [33]. At least three corro-sive gases, H2S, NO2, and Cl2, at various concentration levels are used in an MFG test. Four-gas tests, which include SO2, are the most common [34]. MFG tests are

Fig. 5.4 Schematic of fretting corrosion [30]

5.2 Failure Mechanisms Under Free Air Cooling

Page 95: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

84 5 Reliability Risks Under Free Air Cooling

normally conducted at RH levels in the range of 70–80 % and temperatures in the range of 25–50 °C. MFG studies have been conducted on creep corrosion on pre-cious metal pre-plated [23 , 35–37] and electrical contacts [38 , 39].

Test classifications have been defined to simulate corrosion mechanisms and accelerate corrosion processes for electronics in various operational environ-ments. There are many MFG standard tests to choose from, but there is no consen-sus on which is best. MFG standard tests include those defined by Battelle Labs [19], IBM [38], the International Electro-technical Commission (IEC) [40], the Electronic Industries Alliance (EIA) [41], and Telcordia [42]. Test conditions for the Battelle Class II, III, and IV and Telcordia Indoor and Outdoor are summarized in Table 5.3.

MFG testing can reproduce corrosion and creep corrosion over components with noble metal pre-plated lead frames. A diagram of an MFG chamber is shown in Fig. 5.5 [23]. In the Telcordia Outdoor and Battelle Class III MFG tests, the phenomenon of creep corrosion over the mold compound surface of packages with noble metal pre-plated lead frames was produced within a 10-day exposure period

Table 5.3 Test conditions of mixed flowing gas tests

Condition Temp RH (%) H2S (ppb) Cl2 (ppb) NO2 (ppb) SO2

Battelle Class II 30 ± 2 70 ± 2 10 + 0/−4 10 + 0/−2 200 ± 25Battelle Class III 30 ± 2 75 ± 2 100 ± 10 20 ± 5 200 ± 25Battelle Class IV 50 ± 2 75 ± 2 200 ± 10 50 ± 5 200 ± 25Telcordia Indoor 30 ± 1 70 ± 2 10 ± 1.5 10 ± 1.5 200 ± 30 100 ± 15Telcordia Outdoor 30 ± 1 70 ± 2 100 ± 15 20 ± 3 200 ± 30 200 ± 30

Fig. 5.5 Schematic diagram of MFG testing system [23]

Page 96: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

85

[36]. The appearance of creep corrosion under optical microscopy and scanning electron microscope (SEM) was similar to that seen in field failures, which sug-gests that the MFG test can be used as an acceptance or qualification approach to assess the creep corrosion risk for components with noble metal pre-plated lead frames.

5.3.2 Dust Exposure Tests

Dust exposure testing is still being formalized and standardized to address the con-cern of particulate contamination. There is no standard dust chamber available to researchers and electronic product manufacturers. Testing for the effects of dust is conducted by seeding a surface with a measured quantity of dust or by exposing a board to an atmosphere containing a known airborne concentration of particles. Parameters such as leakage current, SIR, moisture uptake/desorption as a function of time and RH, formation of corrosion products, or dielectric breakdown can be measured through the testing.

Researchers have developed different dust exposure test methods and dust chambers to suit their research needs, but each tests uses different dust particles. Some tests use artificial dust of known composition, like hygroscopic salts. Other studies use real dust collected from indoor or outdoor environments. There are also standard dusts (such as Arizona road dust) that can be purchased. However, it has been suggested that the composition of artificial dust particles is too simple to represent the complexity of real dust [43].

Sandroff et al. [44] conducted a hygroscopic dust exposure test of PCBs and found that failures due to hygroscopic dust were related to a SIR lower than the 106 Ω range. The effects of different salts on PCB insulation resistance were also studied, with varied relative humidities. Some of these salts, such as ammonium hydrogen sulfate or ammonium sulfate, can be found in high concentrations in air-borne hygroscopic dust. Sodium sulfide provides the highest sensitivity in resist-ance variation over the humidity range of 30–100 %. Although sodium sulfide is not typically found in airborne hygroscopic dust, it offers a controllable technique to simulate the loss of SIR. Therefore, a 1/10 M sodium sulfide solution was rec-ommended to qualify circuit boards. It was deposited by spin coating at 600 rev-olutions per minute (rpm). When small surfaces need to be coated, the salt may be deposited by spin coating. However, this technique might be difficult to use for large circuit boards. The salt solution can also be sprayed. Using ultrasonic atomization, a mist of fine droplets of controlled particulate size [25 , 26] can be deposited onto a circuit board placed in front of a spray-shaping chamber. The cal-ibration of this technique measures the mass of salt deposited on the surface. A sketch of the hygroscopic salt mist deposition system is depicted in [44].

DeNure et al. [14] conducted a dust test to qualify multicontact circuit board connectors. Hygroscopic salts were used to simulate some of the most severe con-ditions found in service environments. The salt composition was similar to that

5.3 Testing for Free Air Cooling

Page 97: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

86 5 Reliability Risks Under Free Air Cooling

of natural dusts, except that for safety reasons, nitrates were not used. Hard min-eral particles were included to provide a substance with the mechanical strength to hold the contacts apart if dust got into the interface. The mineral particles used were Arizona road dust. A dust chamber was designed and sketched in [14]. The results showed that the connectors can tolerate a large amount of dust with no sig-nificant degradation of electrical resistance stability.

Lin et al. [15] conducted dust corrosion tests. The dust samples were collected from three locations—an office area in Beijing, a storehouse in Shanghai, and a workshop in Shanghai. Testing sheets for the experiments were made of phos-phor bronze (alloy CA-511: 95.6Cu-4.2Sn-0.2P) coated with nickel and gold on top. The collected dust particles were dissolved in distilled water. The solution was dispersed by an ultrasonic cleaner, heated and naturally cooled down, and filtered using filter paper. The dust solution was dripped on the test sheet with a burette after each previous droplet of solution evaporated. (It took about 2 h for one drop-let of solution to evaporate at room temperature and 35 % RH.) The procedure continued until corrosion products were observed on the surface. Corrosion prod-ucts formed on the test sheet at different rates, depending on the compositions of the dust samples collected from different locations. This experiment demonstrated that water-soluble salts are contained in dust. The solution forms electrolytes and corrodes metals. The corrosion behavior of dust particles was also evaluated in [15] by spreading dust on test sheets, and then conducting seven cycles with vary-ing temperature from 20 to 60 °C and RH from 40 to 90 %. Each cycle was 17 h. Natural dust particles were spread on testing sheets (Au/Ni/Cu) at an average den-sity of about 3,200/cm2 by means of a custom-made dust chamber. A simplified diagram of the dust chamber is shown in [15]. The dust particles were fed into the dust filler. Electrical fans connected in series blew the particles through an air-flow pipe into a dust mixing room and finally into the dust deposition room for about 3 min. The fans were then stopped to allow dust particles to fall freely in the dust deposition room for about 30 min. Testing sheets were placed horizontally to accept dust particles. The test confirmed that dust does cause corrosion under uncontrolled temperature and RH levels.

5.3.3 Clay Test

The clay test developed by Schudeller et al. [45] uses the high-sulfur clay used in modeling studios to drive creep corrosion. The modeling clay was made by Chavant (type J-525). The goal of the test was to simulate an actual use environ-ment. The clay was heated to working temperature with large heaters and was then wetted with water to smooth the surface. Testing was performed by plac-ing the clay into a plastic container with a clamp-down lid. A small amount of water (1–2 ml) was used to wet the clay. The container with the clay was placed in a microwave oven and heated until the clay started to become soft and work-able (≈50 °C). Printed wiring assembly (PWA) samples were placed in a vertical

Page 98: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

87

position within the container, and the lid was replaced. To achieve aggressive creep corrosion, 2–4 pounds of clay were used and the PWA was cooled in a refrigerator for 5 min prior to placing it in the container (to enhance the condensa-tion). The PWA remained in the container at room temperature for 11–13 h, after which the process was repeated (two cycles per day). Creep corrosion on ImAg PWAs was visible after 2 days and became pronounced after 5 days. Creep corro-sion was also found on PWAs with organic solderability preservative (OSP) sur-face finish, but not on preassembled bare PCBs with OSP coating. Therefore, the degree of creep on OSP-coated PWAs is dependent on the amount of OSP remain-ing on the pads after assembly, as well as on the concentration of corrosive gasses in the environment [45]. Lead-free HASL finish did not experience creep corro-sion in this test, but exposed copper on the lead-free HASL boards could lead to creep corrosion.

The sulfur concentration was then reduced by using only 30 g of clay. This test still produced creep corrosion on ImAg PWAs, but took about twice as long to do so. Further reductions in the severity of the test conditions can be achieved by reducing the moisture in the container and reducing the number of heat/condensa-tion cycles.

Zhou et al. [46] described a similar test method. A drawing of the test container is shown in Fig. 5.6. The RH in the container was close to 100 % and the tem-perature was 29.8 °C, as detected by a hygrometer with a probe. All of the samples were held by clamps, and none directly contacted the clay. The corrosion experi-ment lasted for 3 days uninterrupted at room temperature. After 3 days, the RH in the container was 75.2 %, and the temperature was 26.2 °C. After the test, corro-sion products were observed using an environmental scanning electronic micro-scope (ESEM), and their compositions were detected using an energy dispersive spectrometer (EDS). A testing circuit was built to investigate the impact of creep corrosion products on the degradation of SIR in PCBs.

Fig. 5.6 Corrosion test fixture [46]

Clay

PCB PCB

Clay

PCB PCB

Clamp

5.3 Testing for Free Air Cooling

Page 99: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

88 5 Reliability Risks Under Free Air Cooling

5.3.4 Temperature/Humidity/Bias (THB) Testing

THB is the most common test method for assessing the potential for loss of SIR failure due to exposure to noncondensing ambient moisture. The common testing conditions are 85 °C and 85 % RH. However, lower temperatures are considered more appropriate, because higher temperatures can actually reduce the propensity for corrosion or introduce a shift in failure mechanisms. At temperatures above 50 °C, residues from no-clean soldering tend to break down, creating a more benign condition. In the meantime, elevated temperature may induce plating dis-solution, resulting in changes in the migration mechanism. The RH level is the other condition in THB testing. Since constant humidity is designed to assess reli-ability in noncondensing environments, there is often a tradeoff between maximiz-ing humidity to induce potential failures while avoiding condensation. This is the primary driver toward setting 93 % RH as the industry standard. RH is often only controllable to ±2 %, and 95 % RH is often considered the maximum controllable RH before condensation within the chamber becomes highly likely. The time-to-failure can change by orders of magnitude with relatively minor changes in RH.

The different industrial specifications tend to use different durations of expo-sure, with SIR tests extending approximately 4–7 days, and electrochemi-cal migration tests extending approximately 500 h (21 days). For products with no conformal coating, a 40 °C/93 % RH exposure of 3–5 days is recommended. Additional 2–3 days are required for diffusion through the conformal coating.

5.3.5 Salt Spray Testing

The salt spray test is a standardized method used in the industrial sector to check the corrosion resistance of coated surfaces or parts. Since coating can provide high corrosion resistance throughout the intended life of the part, it is necessary to check corrosion resistance by other means. The salt spray test is an accelerated corrosion test that produces a corrosive attack on the coated samples in order to predict the suitability of the coating as a protective finish. The appearance of cor-rosion products is evaluated after a period of time. The test duration depends on the corrosion resistance of the coating. Salt spray testing is popular because it is cheap, quick, well-standardized, and repeatable. There is, however, only a weak correlation between the duration of the salt spray test and the expected life of a coating, since corrosion is a very complicated process that can be influenced by many external factors.

In the salt spray test, various standard salt and water solutions are sprayed on the device. The device is kept in humidity for days or weeks in between sprays. Chamber construction, testing procedures, and testing parameters are standardized under national and international standards, such as ASTM B117 and ISO 9227. These standards provide the necessary information to carry out this test. Testing

Page 100: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

89

parameters include temperature, air pressure of the sprayed solution, preparation of the spraying solution, and concentration. ASTM B117-03, Standard Practice for Operating Salt Spray (Fog) Apparatus, describes the apparatus, procedure, and conditions required to create and maintain the salt spray (Fog) test environment. This does not prescribe the type of test specimen, exposure periods to be used for a specific product, or how to interpret the results. ISO 9227, “Corrosion Tests in Artificial Atmospheres—Salt Spray Tests,” specifies the apparatus, reagents, and procedure to be used in the neutral salt spray (NSS), acetic acid salt spray (AASS), and copper-accelerated acetic acid salt spray (CASS) tests for assessment of corro-sion resistance.

5.3.6 Cyclic Temperature/Humidity Testing

The cyclic temperature/humidity test is also known as the dew point test. The pur-pose of this test is to assess the ability of a product to operate reliably under con-densing conditions (dew point). Figure 5.7 shows the profile of cyclic humidity testing as per MIL-STD-810 (95 ± 4 % RH, five cycles) [47].

There are two factors that could affect the cyclic humidity test results. The first is power dissipation. Condensation occurs when the temperature of a prod-uct is less than the dew point temperature within the chamber. If the unit is powered, there is the possibility of a temperature rise, depending upon the total power being dissipated. If this temperature rise is high enough, and typically it only needs to be 5 °C higher than ambient, the likelihood of condensation on the board drops. An even higher rise in temperature can induce sufficient heating to

Fig. 5.7 Profile of cyclic humidity testing per MIL-STD-810 [47]

5.3 Testing for Free Air Cooling

Page 101: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

90 5 Reliability Risks Under Free Air Cooling

inhibit condensation on cabling or on the interior walls of the housing, and there-fore prevent subsequent dripping. The second factor is conformal coating, which is designed to be a physical barrier on the surface in order to limit the possibil-ity of failure during condensation. However, conformal coating does not provide definitive protection against condensation. Condensation caused during rapid tem-perature cycling can induce electrochemical migration over the conformal coating. This failure mechanism cannot be traced to cleanliness, as ion chromatography analysis of the assembly often finds nominal levels of contaminants. For migration to occur under conformal coating, additional time is required to allow for the dif-fusion of moisture through the conformal coating to the coating/board interface.

5.3.7 Water Spray Testing

The water spray test is derived from military specification MIL-STD-810, Section 506.4 [48]. The purpose of this test is to assess the ability of a product to operate reliably either during or after exposure to condensing conditions.

The water spray test consists of three rain-related test procedures with exposure times ranging from 15 to 40 min. Procedure I tests the equipment against rain and blowing rain. It is applicable for materials that will be deployed outdoors unpro-tected from rain or blowing rain. Procedure II is not intended to simulate natural rainfall, but will provide a high degree of confidence about the water tightness of materials. Procedure III is appropriate when a material is normally protected from rain, but may be exposed to falling water from condensation or leakage from upper surfaces. Procedure III could be applicable to the equipment in a free air cooling data center where there is concern about water condensation.

Both the water spray procedure III and cyclic humidity tests can be used to assess the robustness of a product in the presence of water condensation.

5.4 Summary

This chapter introduced the possible risks for equipment in data centers when free air cooling is implemented. The changes in temperature, humidity, and contamina-tion levels as a result of free air cooling may make some failure mechanisms more active when compared to those under traditional A/C conditions, thus reducing the reliability of the equipment. The effects may become significant when the data centers are located in areas where the airborne contamination level is high, such as in heavily industrialized metropolitan areas. The associated risks must be carefully analyzed before the use of free air cooling is prescribed for a data center.

The various relevant failure mechanisms and test methods were reviewed in this chapter. The most critical unknown factor that remains for the assessment of reli-ability is the actual conditions under free air cooling. The use of free air cooling

Page 102: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

91

is relatively new and there is not enough publicly available data to determine the actual environmental envelope under free air cooling. In addition, there is a large variation among various free air cooled data centers in terms of the operating envi-ronment, which will depend on the location, the specific architecture of implemen-tation of free air cooling, and the inclusion of other power management methods in conjunction with free air cooling.

References

1. D. Atwood , J.G. Miner, Reducing data center cost with an air economizer, IT@Intel Brief; Computer Manufacturing; Energy Efficiency; Intel Information Technology, 2008

2. X.B. Yu , S. Jin, Research of temperature-raising in base stations room. Telecom Eng. Technics Stand. 12 (2008). ISSN 1008-5599, CN 11-4017/TN, Chinese version

3. K.B. Pressman, T. Morey, Switching Power Supply Design (McGraw Hill, New York, 2009) 4. J.B. Haran, D. David, N. Refaeli, B. Fischer, K. Voss, G. Du , M. Heiss, Mapping of single

event burnout in power MOSFETs. IEEE Trans. Nucl. Sci. 54, 2488–2494 (2007) 5. P. McCluskey, R. Grzybowski, T. Podlesak, High Temperature Electronics (CRC Press, Boca

Raton, 1997) 6. M.P. Garcia, M.R. Cosley, Ambient air cooling of electronics in an outdoor environment.

In Proceedings of the 26th Annual International Telecommunications Energy Conference, (2004), pp. 437–441

7. K.C. Yung Winco, Conductive anodic filament: mechanisms and affecting factors. HKPCA J. 21, 1–6 (2006)

8. V. Kucera, E. Mattson, Atmospheric corrosion, in Corrosion Mechanisms, ed. by F. Mansfeld (Marcel Dekker, New York, 1987)

9. R. Hienonen , R. Lahtinen, Corrosion and climatic effects in electronics. (VTT publications 626, 2007), http://www.vtt.fi/publications/index.jsp

10. L. Volpe, P.J. Peterson, Atmospheric sulfidation of silver in tubular corrosion reactor. Corros. Sci. 29(10), 1179–1196 (1989)

11. L. Johansson, Laboratory study of the influence of NO2 and combination of NO2 and SO2 on the atmospheric corrosion of different metals. Electrochem. Soc. Ext. Abs. 85(2), 221–222 (1985)

12. J.W. Wan, J.C. Gao, X.Y. Lin, J.G. Zhang, Water-soluble salts in dust and their effects on electric contact surfaces. In Proceedings of the International Conference on Electrical Contacts, Electromechanical Components and Their Applications, (1999), pp. 37–42

13. B. Song, M.H. Azarian, M.G. Pecht, Effect of temperature and relative humidity on the impedance degradation of dust-contaminated electronics. J. Electrochem. Soc. 160(3), C97–C105 (2013)

14. D.G. DeNure, E.S. Sproles, Dust test results on multicontact circuit board connectors. IEEE Trans. Compon. Hybrids Manuf. Technol. 14(4), 802–808 (1991)

15. X.Y. Lin , J.G. Zhang, Dust corrosion, The 50th IEEE Holm Conference on Electrical Contacts (2004)

16. B. Sood , M. Pecht, Conductive filament formation in printed circuit boards—effects of reflow conditions and flame retardants. 35th International Symposium for Testing and Failure Analysis (ISTFA 2009), San Jose, 15-19, November 2009

17. J.Y. Jung, S.B. Lee, H.Y. Lee, Y.C. Joo, Y.B. Park, Electrochemical migration characteris-tics of eutectic Sn-Pb solder alloy in NaCl and Na2SO4 solutions. J. Electron. Mater. 38(5), 691–699 (2009)

18. D.Q. Yu, W. Jillek, E. Schmitt, Electrochemical migration of Sn-Pb and lead free solder alloys under distilled water. J. Mater. Sci.: Mater. Electron. 17(3), 219–227 (2006)

5.4 Summary

Page 103: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

92 5 Reliability Risks Under Free Air Cooling

19. W.H. Abbott, The development and performance characteristics of flowing mixed gas test environments. IEEE Trans. Compon. Hybrids Manuf. Technol. 11(1), 22–35 (1988)

20. A. DerMarderosian, The electrochemical migration of metals. In Proceedings of the International. Society of Hybrid Microelectronics, (1978), p. 134

21. G. Ripka, G. Harshyi, Electrochemical migration in thick-film ICs. Electrocomponent Sci. Technol. 11, 281 (1985)

22. M.V. Coleman , A E. Winster, Silver migration in thick-film conductors and chip attachment resins. Microelectron. J. 4, 23 (1981)

23. P. Zhao, M. Pecht, Field failure due to creep corrosion on components with palladium pre-plated leadframes. Microeletron. Reliab. 43, 775–783 (2003)

24. C.J. Weschler, S.P. Kelty, I.E. Lingovsky, The effect of building fan operation on indoor-out-door dust relationships. J. Air Pollut. Control Assoc. 33, 624–629 (1983)

25. M. Tencer, Deposition of aerosol (“Hygroscopic Dust”) on electronics—mechanism and risk. Microelectron. Reliab. 48(4), 584–593 (2008)

26. J.D. Sinclair, Corrosion of electronics. J. Electrochem. Soc. 135, 89C–95C (1988) 27. M. Gedeon, Creep corrosion and pore corrosion. Tech. Tidbits Brush Wellman. 4(5) (2002) 28. Kingston Technical Software. Pitting Corrosion. http://corrosion-doctors.org/Forms-pitting/

Pitting.htm. Accessed Mar 2012 29. ASTM B542-99: Standard Terminology Relating to Electrical Contacts and Their Use,

ASTM International (1999) 30. H.Y. Qi, S. Ganesan, M. Pecht, No-fault-found and intermittent failures in electronic prod-

ucts. Microelectron. Reliab. 48(5), 663–674 (2008) 31. R.D. Malucci, Impact of fretting parameters on contact degradation, In Proceedings of the

42nd IEEE Holm Conference joint with 18th International Conference on Electrical Contacts 1996, pp. 16–20

32. American Society for Testing and Material, ASTM Designation B827-97: Standard Practice for Conducting Mixed Flowing Gas (MFG) Environmental Tests, 1997

33. American Society for Testing and Material, ASTM Designation B845-97 (Reapproved 2003): Standard Guide for Mixed Flowing Gas (MFG) Tests for Electrical Contacts, 2003

34. P.G. Slade, Electrical Contacts: Principles and Applications (Marcel Dekker, New York, 1999)

35. R.J. Geckle, , R.S. Mroczkowski, Corrosion of precious metal plated copper alloys due to mixed flowing gas exposure. IEEE Trans. Compon. Hybrids Manuf. Technol. 14(1), 162–169 (1991)

36. P. Zhao, M. Pecht, Mixed flowing gas studies of creep corrosion on plastic encapsulated microcircuit packages with noble metal pre-plated lead frames. IEEE Trans. Device Mater. Reliab. 5(2), 268–276 (2005)

37. P. Zhao, M. Pecht, Assessment of Ni/Pd/Au–Pd and Ni/Pd/Au–Ag pre-plated lead frame packages subject to electrochemical migration and mixed flowing gas tests. IEEE Trans. Compon. Packag. Technol. 29(4), 818–826 (2006)

38. P. W. Lees, Qualification testing of automotive terminals for high reliability applications. In Proceedings of 43rd Electronic Components and Technology Conference, pp.80–87 June 1993

39. R. Martens, M. Pecht, An investigation of the electrical contact resistance of corroded pore sites on gold plated surfaces. IEEE Trans. Adv. Packag. 23(3), 561–567 (2000)

40. International Electro-technical Commission, IEC Standard 68-2-60 (2nd edn.) Environmental Testing Part 2: Flowing Mixed Gas Corrosion Test (1995)

41. Electronic Industries Alliance, EIA Standard TP-65A: Mixed Flowing Gas (1998) 42. Telcordia, Information Management Services Generic Requirements for Separable Electrical

Connector Used in Telecommunication Hardware. (Bellcore TR-NWT-001217, Issue. 1, 1992)

43. P.E. Tegehall, Impact of humidity and contamination on surface insulation resistance and electrochemical migration. IVF Industrial Research and Development Corporation, http://www.europeanleadfree.net/

Page 104: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

93

44. F.S. Sandroff, and W.H. Burnett, Reliability qualification test for circuit boards exposed to airborne hygroscopic dust. In Proceedings of the 42nd Electronic Components and Technology Conference, pp. 384–389, 18–20 May 1992

45. R. Schueller, Creep Corrosion on Lead-free Printed Circuit Boards in High Sulfur Environments. in Proceedings of the SMTA International Conference, pp.643–654, October 2007

46. Y. Zhou , M. Pecht, Investigation on mechanism of creep corrosion of immersion silver fin-ished printed circuit board by clay tests, in Proceedings of the 55th IEEE Holm Conference on Electrical Contacts, 2009, pp.324–333

47. MIL standard, Environmental Engineering Considerations and Laboratory Tests. (MIL-STD-810F, method 507.4,.3 2000)

48. MIL standard, Environmental Engineering Considerations and Laboratory Tests.( MIL-STD-810F, method 506.4, 2000)

References

Page 105: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

95

Some of the efficient cooling methods, such as free air cooling, extend the operating environment of telecom equipment, which may impact the performance of electronic parts. Parts located at hotspots may not function as required or may have unacceptable parameter variations resulting in inadequate performance. This chapter introduces the background information and methods necessary to identify the parts at risk.

6.1 Part Datasheet

Electronic part manufacturers convey information about their products through part datasheets. It is necessary for all product developers to evaluate and assess this information source; and identify when there is a possibility that the parts may be used beyond their normally expected operating conditions.

6.1.1 Datasheet Contents

The history of part datasheets can be traced back to the U.S. Department of Defense (DoD) standards and specifications for electronic part design, manufac-ture, test, acceptance, and use. The purpose of these documents was to help select appropriate parts with respect to part quality, environment, inter-operability, and documentation [1]. The U.S. military has templates called Standard Microcircuit Drawings (SMDs) that list the contents of a datasheet for acceptance as a military part.

Part manufacturers provide a great deal of part information in datasheets, from conceptual design through production, but not all of that information is published. The final datasheet is often a snapshot of the part information that a manufacturer

Chapter 6Part Risk Assessment and Mitigation

J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5_6, © Springer Science+Business Media New York 2014

Page 106: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

96 6 Part Risk Assessment and Mitigation

chooses to divulge.1 From the manufacturer’s point of view, the datasheet serves as marketing literature, a technical fact sheet, and a business document, and may contain disclaimers and limitations on the use of the part.

IEC publication 747-1 [2] lists the information to be included in a part data-sheet. It also mentions that it is not compulsory to include data for all the items in the list. The standard published data, per that publication, include: part type and category; information on outlines, terminal identification and connections, case material, and lead finishes; electrical, thermal, and mechanical ratings; electrical and thermal characteristics; mechanical data; environmental and/or reliability data; and graphical representation of characteristics.

The information in the part datasheet may be complemented by supplemen-tary documents, such as application notes and design guidelines. Table 6.1 lists information commonly available in part datasheets and associated documents. It is important for a company to obtain this information for the parts it will use in the product.

6.1.2 Understanding the Part Number

The part number usually provides information on the technology type, function-ality, package type, and temperature range for a part. Examples of part numbers from several manufacturers are shown in Figs. 6.1 and 6.2. The examples show that product category (e.g., SmartMotor), technology rating (e.g., 4000000pF), and packaging information (e.g., gull-wing lead) can be obtained from the part num-bers. Sometimes the recommended operating conditions can be obtained from the part numbers. For example, the “100” in 405K100CS4G in Fig. 6.1 means that the recommended voltage condition is 100 DC volts.

1 Not all datasheets are public. A part may be built for a specific application and the datasheet for this part may be a proprietary internal document.

Table 6.1 Information commonly available in part datasheets and associated documents

Information in part datasheets Information in associated documents

Part status Definitions of terminology usedPart functionality Thermal characteristicsRatings Programming guidesElectrical specifications Design tipsPackaging information Handling and assembly guidelines

Page 107: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

97

6.1.3 Ratings of an Electronic Part

Part datasheets provide two types of ratings: Absolute maximum ratings and rec-ommended operating conditions. Absolute maximum ratings are provided as a limit for the “reliable” use of parts, and recommended operating conditions are the conditions within which electrical functionality and specifications given in the part datasheet are guaranteed.

The IEC defines absolute maximum ratings as “limiting values of operating and environmental conditions applicable to any electronic device of a specific type, as defined by its published data, which should not be exceeded under the worst pos-sible conditions. These values are chosen by the device manufacturer to provide acceptable serviceability of the device, taking no responsibility for equipment var-iations and the effects of changes in operating conditions due to variations in the characteristics of the device under consideration and all other electronic devices in the equipment” [3]. The absolute maximum ratings (AMRs) in the datasheet often include limits on operational and environmental conditions, including power, power derating, supply and input voltages, operating temperature (typically ambi-ent or case), junction temperature, and storage temperature.

SM 5 D

Product CategorySM: SmartMotorTM

Frame Size

Motor

Class

Connector StyleD: D-Sub

23 16 DE AD1

OptionsDE: Drive Enable

OptionsAD1: 24V Expansion I/O

Fig. 6.1 Class 5 SmartMotor™ part number for Animatics

405 100 CS4Capacitance

The first two digits are significant figures, and the third digit is the numberof zeros following. e.g., 405 = 4000000 pF=4.0 uF

DC Voltage Rating

Capacitance ToleranceK = ±10%

Product Type

G: Gull-wing leadLead Style or Packaging

GK

Fig. 6.2 Capacitor part number format from Paktron

6.1 Part Datasheet

Page 108: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

98 6 Part Risk Assessment and Mitigation

The IEC also states that “the equipment manufacturer should design so that initially and throughout life, no absolute-maximum value for the intended service is exceeded for any device under the worst probable operating conditions with respect to supply voltage variation, equipment component variation, equipment control adjustment, load variations, signal variation, environmental conditions, and variation in characteristics of the device under consideration and of all other elec-tronic devices in the equipment” [3]. In summary, telecom companies that inte-grate electronic parts into products and systems are responsible for ensuring that the AMR conditions are not exceeded.

Part manufacturers generally state that below the AMR but above the recom-mended conditions, the performance of the part is not guaranteed to follow the datasheet, but the useful life of the part may not be affected. That is, there are no reliability (e.g., MTBF) concerns below the AMR. But some manufacturers (e.g., Freescale) state that “at or near the AMR,” there may be reliability concerns over the long term [4].2

Philips notes [5], “The ‘RATINGS’ table (limiting values in accordance with the Absolute Maximum System—IEC 134) lists the maximum limits to which the device can be subjected without damage. This does not imply that the device will function at these extreme conditions, but rather that, when these conditions are removed and the device operated within the recommended operating conditions, it will still be functional and its useful life will not have been shortened.” That is, the temperature-dependent failure rate will not substantially change.

Almost all datasheets contain some form of warning statement or disclaimer to discourage or prohibit the use of parts at or near absolute maximum ratings. The most common statements used in the warning labels regarding AMRs include “functional operation is not implied,” “stresses above these ratings can cause per-manent damage to the parts,” and “exposure to these conditions for extended peri-ods may affect reliability and reduce useful life.”

Part manufacturers guarantee the electrical parameters (typical, minimum, and maximum) of the parts only when the parts are used within the recommended operating conditions. Recommended operating conditions provided by manufac-turers can include parameters such as voltage, temperature ranges, and input rise and fall time.

Philips notes, “The recommended operating conditions table (in the Philips datasheet) lists the operating ambient temperature and the conditions under which the limits in the DC characteristics and AC characteristics will be met.” Philips also states, “The table (of recommended operating conditions) should not be seen as a set of limits guaranteed by the manufacturer, but the conditions used to test the devices and guarantee that they will then meet the limits in the DC and AC characteristics table” [5].

2 Some EIA/JEDEC documents refer to absolute maximum ratings as absolute maximum “con-tinuous” ratings. In those documents, transient conditions under which these ratings may be exceeded are defined.

Page 109: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

99

Some recommended operating conditions may not be explicitly marked as such but are listed within the electrical specifications. For example, input logic voltage levels are often listed only in the DC characteristics of the electrical specification section, but the input voltage levels are actually inputs to the part. It is within these voltage levels that the part will meet its output specifications [6]. The AC charac-teristics of microprocessors consist of output delays, input setup requirements, and input hold requirements. The input setup and input hold requirements are practi-cally “recommended operating conditions” within which other parameters, such as AC switching parameters, will meet specifications [7].

6.1.4 Thermal Characteristics

The thermal characteristics of a part provide information on its power dissipation and ability to dissipate heat. Power dissipation and the total thermal resistance of a package are generally available from the part datasheets or associated documents. When considering the temperature ratings, the thermal characteristics of a part need to be investigated in order to determine if a part will be below or above the ratings specified in the datasheet.

Some part manufacturers provide maximum power dissipation values in the absolute maximum ratings section of the datasheet. This value is usually based on the heat dissipation capacity of the package.3 The power dissipation limit is typi-cally the maximum power that the manufacturer estimates the package can dissi-pate without resulting in damage to the part or raising the junction temperature above the manufacturer’s internal specifications. Thus, it is important that the part be used below this maximum value.

In some cases, safety margins between the actual power dissipation capacity of a part and the rating on the datasheet are given. Also, sometimes the power dissi-pation level is associated with a junction or case temperature through a “derating” factor, which is the practice of limiting the thermal, electrical, and mechanical stresses on electronic parts to levels below the manufacturer’s specified ratings.

The junction temperature for an integrated circuit is “the temperature of the semiconductor junction in which the majority of the heat is generated. The meas-ured junction temperature is only indicative of the temperature in the immediate vicinity of the element used to measure temperature” [8]. Often, the junction tem-perature is assumed to be the average temperature of the die surface within the package [9], although the temperature may not be uniform across the die during operation.

The case temperature is “the temperature at a specified, accessible reference point on the package in which a semiconductor die is mounted” [8]. It should not

3 Some manufacturers, such as Philips and Freescale, provide supplementary information on how to estimate power dissipation for some of their parts and part families.

6.1 Part Datasheet

Page 110: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

100 6 Part Risk Assessment and Mitigation

be assumed that the case temperature is defined or measured at the hottest location on the package surface. For example, Intel measures case temperature at the center of the top surface of the package [10], which may not always be the hottest tem-perature point on the package.

The ambient temperature is “the temperature of the specified, surrounding medium (such as air or a liquid) that comes into contact with a semiconductor device being tested for thermal resistance” [8]. The location of the measurement of the ambient temperature at test and system setup should be specified when provid-ing an ambient temperature.

The storage temperature is “the temperature limits to which the device may be subjected in an unpowered condition. No permanent impairment will occur (if used within the storage temperature range), but minor adjustments may be needed to restore performance to normal” [8]. The storage temperature limit, when provided, is listed in the absolute maximum ratings section of the datasheet. The common ranges of storage temperature ratings for electronics are −65 °C or −55 to 150 °C.

The lead temperature rating is the maximum allowable temperature on the leads of a part during the soldering process. This rating is usually provided only for sur-face-mounted parts [11].4 Some companies include other information about sol-dering conditions in separate documents. For example, Intel provides pre-heat, preflow, and reflow times; temperatures; and ramp rates in its packaging data book [10].

Lead temperature ratings are typically in the 260–300 °C range, with a maxi-mum exposure time of 10 s. The temperature limit and exposure time depend on the thermal inertia of the package under consideration, but often there is no differ-ence in lead temperature ratings for different package types [12]. For example, the Unitrode controller, UC1637, is available in SOIC and PLCC packages, and both package types have the same 300 °C/10 s lead temperature rating.

The thermal resistance is “a measure of the ability of its carrier or package and mounting technique to provide for heat removal from the semiconductor junction” [13]. The thermal resistance is given by the temperature difference between two specified locations per unit of power dissipation and is measured in degree celsius per Watt. The lower the thermal resistance, the better the package is able to dis-sipate heat.

The most commonly used thermal resistance values for electronic parts are junction-to-case thermal resistance and junction-to-ambient thermal resist-ance [8]. Junction-to-ambient thermal resistance (θJA) is the thermal resistance

4 Parts usually go through reflow soldering where the whole package is exposed to radiative and/or convective heat. The lead temperature and exposure time limit together provide a safeguard so that the package and the circuitry are not damaged by exposure to high temperatures. For inser-tion-mount parts, which are usually wave soldered, the part bodies are not exposed to direct heat, and this rating has generally not been considered essential.

Page 111: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

101

from the semiconductor junction to a specified point in the surrounding ambient atmosphere, while junction-to-case thermal resistance (θJC) is the thermal resist-ance from the semiconductor junction to a specified point on the exterior of the package.

Part manufacturers determine thermal resistance values for devices and package families primarily through experimentation, thermal simulation, and extrapola-tion between package families. Part manufacturers generally follow the EIA/JESD Standard 51 and its supplements in determining thermal resistance values [14]. Many companies provide descriptions of their thermal resistance methodology. For example, AMD describes how it follows the EIA/JEDEC standard [15], and Intel describes its in-house method [10].

Thermal resistance data are only valid for a particular test or simulation con-dition, because they depend on factors such as the thermal conductivity of the printed circuit board, proximity and power dissipation of neighboring devices, speed and pattern of airflow through the system [16], coolant physical properties, and thermal radiation properties of the surrounding surfaces. Relating the thermal resistance data to the actual operating conditions is the responsibility of the prod-uct manufacturer. In particular, if a datasheet includes the thermal resistance infor-mation in the absolute maximum ratings section, the part must be mounted and operated so that the thermal resistance does not exceed the rated maximum ther-mal resistance.

In some cases, datasheets specify the mounting conditions—for example, the mounting torque5—necessary to achieve a desired value of thermal resistance. Some manufacturers also provide thermal resistance values from the junction to a specified point on the package beside the case (such as lead or heat sink mounting locations). Several part manufacturers make thermal resistance data available by package type in separate handbooks, websites, or through technical support [17].

6.1.5 Electrical Specifications

The datasheet provides tables for the electrical parameters that are normally specified by part manufacturers for a given set of operational and environmental conditions. Industry standards on electrical parameters, such as voltage, current, and power values, exist for some common mature parts. For example, the IEC [2] and the EIA [18] provide recommendations on these parameter values to part

5 The mounting torque of a screw-mounted device determines the quality of thermal contact between the part and the board. Thus, mounting torque impacts the heat flow from a part to the board.

6.1 Part Datasheet

Page 112: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

102 6 Part Risk Assessment and Mitigation

manufacturers for use in semiconductor part specifications. Some examples of part electrical specifications are shown in Table 6.2.

6.2 Part Uprating

For some efficient cooling methods in data centers, such as free air cooling, there is a need to operate equipment in locations with temperatures higher than those previ-ously encountered. In the case of telecom equipment and servers, free air cooling may increase the operating temperatures from 25 to 40 °C or higher. If the operat-ing conditions of parts under free air cooling are within their manufacturer-specified ratings, it is unnecessary to conduct additional tests on the parts; otherwise, uprating tests are needed to assess whether the parts are reliable under those conditions.

Clear definitions allow a proper understanding of the intent and processes involved in engineering and business decisions. They also help in communicating across the supply chain provide clarity [19]. Some key terms used in assessing the risk to telecom equipment in extended temperature ranges are given below.

Uprating is the process of assessing the ability of a part to meet the functional-ity and performance requirements of an application for operating conditions out-side the manufacturers’ recommended range. Thermal uprating is the process of assessing the ability of a part to meet the functionality and performance require-ments in an application beyond the manufacturer-specified recommended operat-ing temperature range.

6.2.1 Steps of Part Uprating

This subsection presents the steps for an assessment procedure (uprating) that must be conducted before parts are used at extended environmental temperatures, such as those that might be encountered when increasing the room temperature of a piece of telecom equipment from a maximum of 25–40 °C or higher. These steps include collecting information about the candidate part, uprating, and managing products with uprated parts.

Table 6.2 Examples of parts specified by ambient temperature

Part number Part type Company Temperature range (°C)

VSP3100 DSP Texas Instruments 0 to 85Intel® Atom™ processor Z510PT Microprocessor Intel –40 to 85Ultrastar 7K3000 Microcontroller Hitachi –40 to 70Intel® Atom™ processor Z510P Microprocessor Intel 0 to 70IC + IP178C Chip IC Plus Corp. 0 to 70

Page 113: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

103

The parts selection and management process helps evaluate the risks inherent in the use of an electronic part. Steps in part selection and management that should be conducted prior to uprating include a manufacturing assessment, part assess-ment, and distribution assessment. Assessing the part manufacturer involves com-paring data acquired from the manufacturer to predetermined criteria to determine if the manufacturer’s policies and procedures are consistent with producing reli-able parts of acceptable quality. Evaluation of the part involves examining man-ufacturing data to assess the part’s quality, reliability, and assembly ability, and comparing the data with predetermined criteria to determine if the part will func-tion acceptably in the product. The distributor assessment ensures that the distribu-tors do not create a bottleneck in the supply chain due to delays or mistakes in delivery, and that the actions of the distributor do not compromise the quality and reliability of parts.

For the candidate part, the assessment steps prior to uprating include determin-ing if the actual requirements warrant uprating, finding alternative parts, modify-ing equipment operation, and utilizing thermal management. These alternatives should be evaluated by the product development team, since decisions regarding the acceptability of any alternative require assessing the technical and financial trade-offs (Fig. 6.3).

There are three methods of uprating: parameter conformance, parameter re-characterization, and stress balancing. The suitability of a particular method depends on various technical, logistical, and cost factors. Table 6.3 provides the selection criteria for the uprating methods, and the following sections provide brief summaries of the three methods.

Candidate part

Manufacturer assessment

Part assessment

Alternate part

available?

Equipment manufacturer intervention

Reject candidate part

Acceptable?Yes

No

No

Yes

Candidate partDistributor assessment

Fig. 6.3 Assessment steps prior to uprating

6.2 Part Uprating

Page 114: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

104 6 Part Risk Assessment and Mitigation

6.2.2 Parameter Conformance

Parameter conformance is an uprating process in which a part (device, module, or assembly) is tested to assess whether its functionality and electrical parameters meet the manufacturer’s specifications under the targeted application conditions. Electrical testing verifies compliance with the semiconductor manufacturer’s spec-ified parameter limits. Electrical testing is performed with the test setups given in the semiconductor manufacturer’s datasheet. The tests are often functional “go/no-go” tests conducted without measuring and data-logging the actual parameter

Table 6.3 Selection criteria for the uprating methods

Parameter conformance

Parameter re-characterization

Stress balancing

Time for assessment Less than re-char-acterization but more than stress balancing

Most time consuming Least time consuming

Required tests Go/no-go tests Electrical test with data logging

Electrical test

Functional test Functional testNeed for changing the

datasheetNo need May be necessary Necessary

Costs Less than re-char-acterization but more than stress balancing

Has the highest cost Has the lowest cost

Sample sizes High sample size is required

Depends on the precision, standard deviation, and confidence level.

Only one sample needs to be tested

Directly proportional to the square of standard deviation

Indirectly propor-tional to the square of precision

Test margins Larger than the target application conditions

Acceptable margins for the electrical parameters under operational condi-tions should be established

A margin can also be added to applica-tion power

Tests should be performed at some points above and below the target application conditions

Page 115: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

105

values. When completed, the electrical parameter specifications in the datasheet are not modified. The steps of parameter conformance are shown in Fig. 6.4

Candidate part

Perform go/no-go test at nominal conditions and extremes of

the target application condition

Does

available information indicate

part will not uprate to target

temperature?

Consider other

alternatives

Do all

parts tested pass at all test

conditions?

Conduct electrical test of the final

assembly over the application

conditions.

Are the

results of the system level

test acceptable?

Consider other

alternatives

Consider other

alternatives

Part is uprated through

parameter conformance

No

Yes

No

No

Yes

Yes

Fig. 6.4 Steps in parameter conformance

6.2 Part Uprating

Page 116: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

106 6 Part Risk Assessment and Mitigation

Parameter conformance is the least expensive of all the uprating methods. It is well suited for low-complexity parts where functional tests with high coverage can be performed. Parameter conformance characterizes the parameters only at the extremes of the application conditions. It is necessary to test a sample of each incoming part lot for all functions and parameters that are critical to the end prod-uct’s performance and operation. If more than a predetermined percentage of parts fail to meet the acceptance criteria, the part should be deemed not uprated by the parameter conformance method.

Tests in parameter conformance are of the “go/no-go” type. All functions and parameters that are critical to end-product performance and operation should be tested, ideally testing all electrical functions and parameters. However, this is not always practical, and may even be impossible. For example, performing a com-plete functional test on a complicated part such as a microprocessor could take years, even if done by the part manufacturers, and the lack of availability of detailed internal information about the parts can limit the scope of the selected tests. In most cases, it is sufficient to test the functions necessary for implementa-tion of the final product.

Margins can be added to the electrical parameters or to the target temperature extremes. Accordingly, there are two types of tests in parameter conformance. These are shown in Fig. 6.5. In type 1, the test is at the target temperature with margins on the electrical parameter. In type 2, the test is at the electrical parameter specification limit, with margins on the target temperature. Both types of margins may be combined.

The pass-fail criteria for each parameter are based on the datasheet-specified limits. Margins on the datasheet electrical parameters provide confidence in the applicability of the test results to assess all parts and part lots.

Confidence in the test results can be obtained by experimentally and statisti-cally determining the conditions at which parts fail to meet the datasheet speci-fication limits. For example, while uprating for temperature, the test temperature could be incrementally increased (or decreased) beyond the target application temperature range until at least one of the part parameters no longer meets the datasheet performance specifications. This is called the “fallout temperature” for

Fig. 6.5 The two types of go/no-go tests in parameter conformance

Temperature

Tar

get t

empe

ratu

re Margin

1

2

Ele

ctri

cal P

aram

eter

(e.g

., pr

opag

atio

n de

lay)

Specification limit

Page 117: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

107

that part. Once the fallout temperature for parts under test has been determined, a distribution can be plotted, with the temperature on the x-axis and percentage fallout on the y-axis. The temperature margins can then be determined with statis-tical confidence levels. Figure 6.6 is a temperature fallout chart that illustrates the relationship among the recommended operating temperature (T-ROC), the target temperature (T-Target), and the temperature fallout distribution (T-first-fallout and T-mean-fallout). However, if the margins are large, then this may not be needed.

6.2.3 Parameter Re-characterization

The parameter re-characterization process mimics the characterization process used by the part (device, module, assembly) manufacturer to assess a part’s func-tionality and electrical parameters over the targeted application conditions. A part uprated by parameter re-characterization will have a modified datasheet of electri-cal parameters if the environment drives the temperature above the recommended operating temperature. The risks associated with the use of parts uprated by this method are lower than with the other methods of uprating.

Part manufacturers characterize their parts to determine and ensure their electri-cal parameter limits [20]. They usually test samples of parts at room temperature (typically 25 °C) to determine the normal values of the electrical parameters. They then test the parts at the recommended operating temperature extremes in the data-sheet to determine the limiting values of the electrical parameters.

Figure 6.7 shows an example of a characterization curve for a digital logic part [21]. A sample of parts was used to characterize the output rise time of the logic part. The figure shows the mean values of the parameters and standard deviations at each temperature where characterization was conducted. The data were used to determine the electrical parameter values for parts in two temperature ranges: –40 to 85 and –40 to 125 °C.

Temperature

Perc

ent p

arts

fai

led

0

50

100

T–

mea

n –

fallo

ut

T–

firs

t –fa

ll ou

t

T–

Tar

get

T–

RO

C

Fig. 6.6 Example chart showing part fallout with temperature

6.2 Part Uprating

Page 118: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

108 6 Part Risk Assessment and Mitigation

Figure 6.8 shows the flow diagram for the parameter re-characterization pro-cess. Electrical testing of the part is required. Testing may be performed in-house or at outside test facilities. When performing re-characterization, new electrical parameter limits may be required.

Functional testing of a part is required for parameter re-characterization, because if the part is operated outside the rated values, it may fail to function, even though it meets all required parameter limits. Functional tests may include test vectors that exercise the part in a manner similar to the application for which it is being considered. Re-characterization of all functions at all conditions may not be required if certain functions are not used.

Ideally, all electrical parameters should be tested in parameter re-characteri-zation. The exclusion of any datasheet parameter in the electrical testing for re-characterization needs to be justified. However, many electrical parameters in a datasheet depend on each other, and the trends for one parameter can be derived from the trends of others. Some electrical parameters may be known to be inde-pendent of temperature, and those parameters may be excluded from thermal re-characterization. It may also be possible that electrical characterization data are available for one or more of the electrical parameters for the complete target uprat-ing range, making their re-characterization unnecessary.

The test process estimates how much of a margin exists between the part parameters and the datasheet specifications for the parameters for targeted condi-tions of use. Unacceptable discontinuities of parameters or failures in functional tests are causes for rejection of a part from further consideration in parameter re-characterization. Figure 6.9 schematically shows a representation of param-eter limit modification to account for changes in electrical parameters with temperature.

14

15

0

1

2

3

4

5

6

7

8

-40 -20 0 20 40 60 80 100 120

Temperature (oC)

Minimum specified value = 1 ns

Maximum specified value = 15 ns

µ = 4.8σ = 0.36

µ = 4.30σ= 0.35

µ= 3.64σ= 0.33

µ= 3.01σ= 0.28

µ = Mean (ns)σ = Standard deviation (ns)6σ spread is shown

~~

~~

Out

put R

ise

Tim

e (n

s)

Fig. 6.7 Fairchild HC244 characterization curve [21]

Page 119: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

109

Table 6.4 shows an example of parameter limit modification of a 0–70 °C rated part to a –55 to 125 °C temperature range of operation. In this example, the same

Candidate part

Does availableinformation indicate part will not

uprate to target applicationtemperature?

Are thereelectrical parameter discontinuit ies

or functional failures?

Electrically test parts over conditions extending outside target application

conditions

Consider otheralternatives

Consider otheralternatives

Do all partparameters conform to the data

sheet electrical specifications withacceptable margin?

Can new parameterlimits be created to meet the

system electrical requirements atthe target application

conditions?

Consider otheralternatives

Are the results of system levelelectrical tests acceptable?

Part uprated through parameter re-characterization and can be used

in the system

Conduct electrical tests of the final assembly over the target application conditions

Consider otheralternatives

Yes

No

Yes

No

Yes

No

Yes

No

Yes

No

Fig. 6.8 Flow diagram for the parameter re-characterization process

6.2 Part Uprating

Page 120: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

110 6 Part Risk Assessment and Mitigation

margin that was obtained by testing the parts over 0–70 °C at the target tempera-ture range is maintained at the −55 to 125 °C temperature range of operation.

After electrical test data are collected over the targeted conditions of use, the margins on the parameters at the uprated points of interest are assessed. This assessment process takes into account the confidence interval on the parameters, the measurement uncertainties, and the spread of the electrical parameters.

Electrical Parameter (e.g., Propagation delay time)

Popu

latio

n of

Par

tsParameter distribution at

manufacturer’s temperature limit

Parameter distribution at target temperature

limit

Manufacturer’s parameter limit

Modified parameter limit

pSPEC pNEW

Change in parameter limit

{may be 0}

Fig. 6.9 Schematic representation of parameter limit modification for use at target temperature range

Table 6.4 Parameter thermal re-characterization example: TI SN74ALS244 octal buffer

a The margins at the commercial temperature limit (0–70 °C) are maintained at the military tem-perature limit (–55 to 125 °C)

Parameter Commercial limit Military limit Measured value at military limit

Modified limita (calculated)

tPLH (ns)Minimum 2.00 1.00 5.10 1.80Maximum 10.00 16.00 12.80 15.2tPHL (ns)Minimum 3.00 3.00 6.70 1.90Maximum 10.00 12.00 10.20 11.10VOH (V)Minimum 3.50 3.50 3.75 3.31VOL (V)Maximum 0.40 0.40 0.18 0.42ICCH (mA)Minimum 9.00 9.00 9.10 7.65Maximum 17.00 18.00 14.14 18.60ICCL (mA)Minimum 15.00 15.00 14.71 14.50Maximum 24.00 25.00 19.36 26.00

Page 121: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

111

6.2.4 Stress Balancing

Stress balancing is a thermal operating method. It is applicable when the part (device, module, or assembly) manufacturer specifies a maximum recommended ambient or case operating temperature. It can be conducted when at least one of the part’s electrical parameters can be modified to reduce heat generation, thereby allowing operation at a higher ambient or case temperature than that specified by the manufacturer.6

For example, in active semiconductor devices:

where TJ is the junction temperature, TA is the ambient temperature, P is the power dissipation, and θJA is the junction to ambient thermal resistance. Equation (6.1) shows that the ambient operating temperature can be increased beyond the man-ufacturer’s recommended rated values if the power dissipation is reduced while keeping the junction temperature constant. Similarly, Eq. (6.2) shows that the case operating temperature can be increased beyond the manufacturer’s recommended rated values if the power dissipation is reduced while keeping the junction tem-perature constant:

where TC is the case temperature and θJC is the junction-to-case thermal resistance.The trade-off between increased ambient or case temperature and a change

in one or more electrical parameters can be made if the power dissipation of the part is found to depend on some electrical parameter(s) (e.g., operating voltage, frequency). If the electrical parameter(s) can be selected to meet the application requirements, then the trade-off can be accepted. Electrical testing of the part is then performed at the worst-case application conditions to ensure operation of the part with modified electrical parameters.

Stress balancing exploits the trade-off among power, temperature, and electri-cal parameters. Stress balancing can be performed when the part power dissipa-tion can be reduced by changing some of its electrical parameters and if sufficient reduction in power dissipation is possible to obtain the required increase in ambi-ent temperature. The goal is to assess which electrical parameters can be changed and how much change can be tolerated with respect to the application in order to accommodate ambient or case temperatures greater than the rating on the datasheet.

The stress balancing method requires less testing than the parameter conform-ance and parameter re-characterization uprating methods. Testing is conducted

6 The junction temperature limit specified in the absolute maximum ratings of the part datasheet cannot be used as an uprating parameter because reliability is not guaranteed.

(6.1)TJ = TA + P · θJA

(6.2)TJ = TC + P · θJC

6.2 Part Uprating

Page 122: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

112 6 Part Risk Assessment and Mitigation

only to check the applicability of the theoretical results. This method can be applied at the module, assembly, or final product level.

6.2.5 Continuing Steps After Uprating

Future engineering decisions involving equipment maintenance, part replacement, similar part uprating, part reassessment criteria, and design modifications require good documentation. The documented information needs include the alternatives considered before uprating was performed, the reasons that uprating was chosen in favor of other methods (e.g., using other parts or thermal management), and the rationale for using uprating for the particular application. The relevant part data-sheets, application notes, internet documents, and communications with part man-ufacturers—including, but not limited to, the thermal, electrical, and mechanical data used to make uprating decisions (e.g., thermal resistance data, power dissipa-tion limits)—also need be documented. In addition, the documented information should cover the standards followed in uprating assessment (e.g., company docu-mentation, industry standards); the details of the parameters tested, test conditions, results, failure analysis, and any anomalies; and statistical data such as mean, standard deviation, and confidence interval on the electrical tests.

The customers of equipment containing uprated parts should be kept informed of the use of uprated parts. This includes customers of final products, as well as customers making engineering decisions regarding the integration of subsystems containing uprated parts. In these cases, written approval for each instance of the use of uprated parts should be obtained, and analysis and test results associated with uprating should be made available to the customer. In addition, any possible hazards due to the use of uprated parts that an equipment manufacturer knows of, or should have known of, must be communicated to the customer.

Product change notifications from part manufacturers should be monitored and evaluated. Process changes (e.g., a die shrink, a new package, or an improvement in a wafer process) may or may not affect the part datasheet; however, part perfor-mance in the extended condition may have changed. The effects of manufacturers’ design and process changes manifest themselves between lots, and the effects of these changes need to be assessed during the quality assurance process for future lots of parts.

Some changes in parts may warrant additional uprating assessment. These changes include any changes in the rating(s), power dissipation, or thermal char-acteristics of a part, or can be caused by modifications in package type, size, foot-print, die size, or material set.

When performing maintenance or repair on equipment requiring the replace-ment of an uprated part, it is necessary to replace the part with an equivalent uprated part. An identification system for parts that have been uprated is necessary.

Page 123: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

113

6.3 Summary

This chapter introduced the methods to identify the parts at performance risk under data center efficient cooling methods that may change the operating envi-ronment, and then provided a process to assess whether the alternatives of the parts are qualified under the new environment. If the appropriate alternatives are not practical or possible, uprating methods are introduced to assess whether the original parts are qualified under the data center efficient cooling methods. Three uprating methods (parameter re-characterization, parameter conformance, and stress balancing) were presented with examples to show the steps for their imple-mentation. A comparison of the uprating methods and the methods to select the appropriate uprating method were also introduced in this chapter.

References

1. H. Kanter, R. Atta, Integrating Defense into the Civilian Technology and Industrial Base, Office of the Assistant Secretary of Defense for Production and Logistics, Feb. 1993

2. IEC Standard 747-1, Semiconductor Devices—Discrete Devices and Integrated Circuits, Geneva, Switzerland, 1983

3. IEC Standard 60134, Ratings System for Electronic Tubes and Valves and Analogous Semiconductor Devices, Geneva, Switzerland, 1961 (Last review date 1994)

4. R. Locher, Introduction to Power MOSFETs and their Applications, National Semiconductor Application Note, vol. 558, Santa Clara, CA, Dec. 1988

5. Philips, Family Specifications: HCMOS Family Characteristics, Mar. 1988 6. Harris Semiconductor, Datasheet of CD54HC00, Melbourne, Florida, 1997 7. AMD, Datasheet of AM486DE2, Sunnyvale, CA, Apr. 1996 8. SEMATECH, SEMATECH Official Dictionary Rev 5.0, Technology Transfer 91010441C-

STD, http://www.sematech.org/public/publications/dict/images/dictionary.pdf, 1995, as of Aug. 2002

9. Intel Application Note AP-480–Pentium® Processor Thermal Design Guidelines Rev 2.0, Nov. 1995

10. Intel, Packaging Data Book, Denver, CO, Jan. 1999 11. EIA 583, Packaging Material Standards for Moisture Sensitive Parts, Alexandria, VA, 1991 12. P. McCluskey, R. Munamarty, M. Pecht, Popcorning in PBGA packages during IR reflow

soldering. Microelectron. Int. 42, 20–23 (1997) 13. United States Department of Defense, Mil-Std-883: Test Method Standards—Microcircuits,

1996 14. EIA/JEDEC Standard EIA/JESD51, Methodology for the Thermal Measurement of

Component Packages (Single Semiconductor Device), Alexandria, VA, Dec. 1995 15. AMD, Packaging Handbook—Chapter 8: Performance Characteristics of IC Packages,

Sunnyvale, CA, 1998 16. V. Dutta, Junction-to-case thermal resistance—still a myth? in Proceedings of the 4th IEEE

SEMI-THERM Symposium, pp. 8-11 (1988) 17. E.A. Wood, Obsolescence Solutions for 5 Volt Integrated Circuits Beyond 2005, in

Proceedings of Commercialization of Military and Space Electronics, pp. 393-405, Los Angeles, CA, January 30–February 2, 2000

18. EIA Standard RS-419-A, Standard List of Values to Be Used in Semiconductor Device Specifications and Registration Formats, Alexandria, VA, Oct. 1980

6.3 Summary

Page 124: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

114 6 Part Risk Assessment and Mitigation

19. L. Condra, R. Hoad, D. Humphrey, T. Brennom, J. Fink, J. Heebink, C. Wilkinson, D. Marlborough, D. Das, N. Pendsé, M. Pecht, Terminology for use of electronic parts outside the manufacturer’s specified temperature ranges. IEEE Trans. Compon. Packag. Technol 22(3), 355–356 (1999)

20. N. Pendsé, M. Pecht, Parameter Re-characterization Case Study: Electrical Performance Comparison of the Military and Commercial Versions of All Octal Buffer, Future Circuits International, vol. 6 (Technology Publishing Ltd, London, 2000), pp. 63–67

21. D. Das, N. Pendsé, C. Wilkinson, M. Pecht, Parameter recharacterization: a method of ther-mal uprating. IEEE Trans. Compon. Packag. Technol. 24(4), 729–737 (2001)

Page 125: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

115

The risks to telecom equipment due to failure and degradation of parts need to be evaluated in order to assess component and system reliability in telecom equip-ment and data centers. This chapter provides rules to identify the reliability risks in parts under select existing or emerging energy efficient cooling methods, and then discusses handbook-based reliability prediction methods, analyzing their applicability for the cooling methods. This chapter also provides methods to assess the reliability of parts under the cooling conditions.

7.1 Part Capability

When the efficient cooling methods are implemented, the local operating condi-tions for parts, including temperature, humidity, and contamination, are modified. This change poses risks to the reliability of parts in telecom equipment. However, it is impractical to estimate the impact of the operating condition modifications on all components. Therefore, it is necessary to identify the parts which are at risk, as these parts are likely to fail first under the efficient cooling conditions. This sec-tion provides two methods to distinguish the parts that are at risk: analyzing the technology and analyzing the operating environment.

Each component has its own reliability risks, which are affected by continu-ously changing technology. One of the primary drivers of the electronics industry has been complementary metal–oxide–semiconductor (CMOS) technology from the 1970s. With CMOS’s reduced channel length and high level of performance, the transistor power density, total circuits per chip, and total chip power consump-tion have all increased, resulting in a decrease in reliability and product lifetime. For example, it is estimated that the expected lifetime of a product with a 180 nm technology node is more than 100 years, but that of a product with a 90/65 nm1

1 90/65 nm refers to the size of the transistors in a chip, and it also refers to the level of CMOS process technology.

Chapter 7Part Reliability Assessment in Data Centers

J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5_7, © Springer Science+Business Media New York 2014

Page 126: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

116 7 Part Reliability Assessment in Data Centers

technology node for semiconductor processors is less than 15 years [1]. In other words, each technology has its own level of reliability risk under the same operat-ing conditions. Technology analysis can help identify the parts at risk under the efficient cooling conditions.

Information about the operating environment and specifications of parts pro-vided by manufacturers can be used to determine a part’s suitability in any appli-cation. In particular, the recommended operating conditions (RoCs) and absolute maximum ratings (AMRs) (introduced in Chap. 6) are valuable guidelines. The commonly understood reliability metrics, such as hazard rate or time-to-failure, are generally not included in the part datasheet. However, part manufacturers guarantee the electrical parameters (typical, minimum, and maximum) of the parts when the parts are used within the RoCs, and therefore below the AMRs. If the environmental conditions are such that the part is operated within the recom-mended operating conditions, then there should be no impact on system reliabil-ity. One assumption is that equipment designs take into account the variability of the part performance parameters shown their datasheets. Furthermore, when the local operating conditions increase beyond the part’s RoC, its performance vari-ations increase eventually affecting performance. For example, if a part is used in a system without taking into account the possible variability in performance specified in the datasheet, then an increase in operating temperature might cause a system performance malfunction.

The reliability of parts is generally independent of the operating conditions (e.g., temperature), as long as the part is used within the absolute maximum rating. When a part is used above the AMR for a short time, there should be no significant change in its lifetime. This is already established, because most parts are subject to solder reflow conditions that are beyond their AMR conditions. However, if the part is used above its AMR for an extended period, some failure mechanisms may precipitate more quickly beyond an acceptable level compared to when the part is pirated within its AMR. These risks can only be determined if there is a failure model for temperature-accelerated life consumption or by conducting accelerated tests at multiple conditions to create such a model.

Based on the discussion above, the enhanced reliability risks of parts under the efficient cooling conditions can be identified, taking into account the local operat-ing temperatures of the part. The local temperature should be measured or can be estimated from prior experience or similar room conditions.

Under the efficient cooling methods, there can be multiple failure mecha-nisms for products. It is possible that the dominant failure mechanisms will vary with operating environment changes. When the dominant failure mechanisms change in the efficient cooling conditions, the reliability of the parts will need to be estimated, while also taking competing (and sometimes collaborative) failure mechanisms into consideration. As described in Chap. 5, there are sev-eral potentially active failure mechanisms, such as corrosion, electrostatic dis-charge (ESD), and conductive anodic filament (CAF). These mechanisms should also be considered when identifying the parts at risk under the efficient cooling methods.

Page 127: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

117

7.2 Example Handbook-Based Reliability Prediction Methods

Some handbooks provide methods to predict the part’s reliability under various operating conditions. All handbook-based reliability prediction methods contain one or more of the following types of prediction: (1) tables of operating and/or non-operating constant failure rate values arranged by part type; (2) multiplicative factors for different environmental parameters to calculate the operating or non-operating constant failure rate; and (3) multiplicative factors that are applied to a base operating constant failure rate to obtain the non-operating constant failure rate.

Reliability prediction for electronic equipment using handbooks can be traced back to MIL-HDBK-217, “Reliability Prediction of Electronic Equipment,” pub-lished in 1960, which was based on curve fitting a mathematical model to his-torical field failure data to determine the constant failure rate of parts. Several companies and organizations, such as the Society of Automotive Engineers (SAE) [2], Bell Communications Research (now Telcordia) [3], the Reliability Analysis Center (RAC) [4], the French National Center for Telecommunication Studies (CNET, now France Telecom R&D) [5], Siemens AG [6], Nippon Telegraph and Telephone Corporation (NTT), and British Telecom [7], subsequently decided that it was more appropriate to develop their own “application-specific” prediction handbooks for their products and systems.

In this section, we present two examples of handbook-based reliability predic-tion methods. MIL-HDBK-217 was selected because it was the first and remains the most well-known handbook, and Telcordia SR-332 was selected because it is used for telecommunication equipment and systems.

7.2.1 MIL-HDBK-217

The MIL-HDBK-217 reliability prediction methodology was developed by the Rome Air Development Center. The last version was MIL-HDBK-217 Revision F Notice 2, which was released on February 28, 1995 [8]. In 2001, the office of the US Secretary of Defense stated that “…. the Defense Standards Improvement Council (DSIC) made a decision several years ago to let MIL HDBK 217 ‘die a natural death’” [8]. In other words, the OSD will not support any updates/revisions to MIL-HDBK-217.

The stated purpose of MIL-HDBK-217 was “… to establish and maintain con-sistent and uniform methods for estimating the inherent reliability (i.e., the reli-ability of a mature design) of military electronic equipment and systems” [8]. The MIL-HDBK-217 method provided a way to predict the reliability of military elec-tronic equipment/systems at the program acquisition stage. The reliability predic-tion can be used to compare and evaluate the equipment/system reliability with

7.2 Example Handbook-Based Reliability Prediction Methods

Page 128: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

118 7 Part Reliability Assessment in Data Centers

various competitive designs, and then increase the reliability of the equipment being designed [8].

Since MIL-HDBK-217 has been out-of-date since the 1990s, and the data center sector is using parts from cutting age technology, this handbook is not suit-able for data center reliability assessment, with or without the efficient cooling methods. The necessity of discussing this document comes only from the fact that it spawned many industry-specific documents with supposedly more current data and methods. One such handbook is Telcordia SR-332 [3].

7.2.2 Telcordia SR-332

Telcordia (previously known as Bellcore) SR-332 is a reliability prediction meth-odology developed by Bell Communications Research (or Bellcore) primarily for telecommunications companies [3]. Bellcore, which was previously the telecom-munications research branch of the Regional Bell Operating Companies (RBOCs), is now known as Telcordia Technologies. The methodology was revised in 2008.

The stated purpose of Telcordia SR-332 is “to document the recommended methods for predicting device and unit hardware reliability [and also] for predict-ing serial system hardware reliability” [3]. The methodology is based on empiri-cal statistical modeling of commercial telecommunication systems whose physical design, manufacturing, installation, and reliability assurance practices meet the appropriate Telcordia (or equivalent) generic and system-specific requirements. In general, Telcordia SR-332 adapts the equations in MIL-HDBK-217 to represent what telecommunications equipment experience in the field. Results are provided as a constant failure rate, and the handbook provides the upper 90 % confidence-level point estimate for the constant failure rate. Telcordia SR-332 also provides methodologies to incorporate burn-in, field, and laboratory test data, using a Bayesian analysis.

7.2.3 How the Handbook Calculations Work

In most cases, handbooks adapt the MIL-HDBK-217 method of curve-fitting field failure data to a model of the form given in Eq. (7.1).

where λP is the calculated constant part failure rate, λG is an assumed (generic) constant part failure rate, and πi is a set of adjustment factors for the assumed con-stant failure rates. What all of these handbook methods have in common is that they either provide a constant failure rate or calculate one using one or more mul-tiplicative factors (which may include factors for part quality, temperature, design, and environment) to modify a given constant base failure rate.

(7.1)λp = f (λG,πi)

Page 129: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

119

The constant failure rate models used in the handbooks are obtained by per-forming a linear regression analysis on field data. The regression analysis quan-tifies the expected theoretical relationship between the constant part failure rate and the independent variables. The first step in the analysis is to examine the cor-relation matrix for all variables, showing the correlation between the dependent variable (the constant failure rate) and each independent variable. The independent variables used in the regression analysis include factors such as the device type, package type, screening level, ambient temperature, and application stresses. The second step is to apply stepwise multiple linear regressions to the data, expressing the constant failure rate as a function of the relevant independent variables and their respective coefficients. The constant failure rate is then calculated using the regression formula and the input parameters.

7.2.4 How the Operating Environments are Handled

In this section, the manner in which the operating environment is handled by handbook-based reliability prediction methods is demonstrated using the Telcordia methodology as an example. In this method, the constant failure rate for a part, λss,i, is given by:

where λG is the generic steady-state failure rate for the ith part, ΠQi is the quality factor for the ith part, ΠSi is the stress factor for the ith part, and ΠTi is the tem-perature factor for the ith part.

The temperature factor in Telcordia SR-332 is analogous to the other handbook methods and follows the Arrhenius relationship. The base temperature is taken as 40° C. This temperature factor is determined by the operating ambient temperature and the temperature stress curve, which is a number within the range of 1–10 and can be found according to the part type in the handbook. After the temperature stress curve is determined, the temperature factor can be found in the temperature table provided in the handbook for different activation energies and operating tem-peratures. A part of the temperature table is shown in Table 7.1

7.2.5 Insufficiency of the Handbook Methods

The traditional handbook-based reliability prediction methods rely on analysis of failure data collected from the field and assume that the components of a sys-tem have inherent constant failure rates that are derived from the collected data. These methods assume that the constant failure rates can be tailored by inde-pendent “modifiers” to account for various quality, operating, and environmen-tal conditions, despite the fact that most failures do not occur at constant rates.

(7.2)λss, i = λGiΠQiΠSiΠTi

7.2 Example Handbook-Based Reliability Prediction Methods

Page 130: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

120 7 Part Reliability Assessment in Data Centers

Furthermore, none of these handbook prediction methods identify failure modes or mechanisms, nor do they involve any uncertainty analysis. Hence, they offer lim-ited insight into practical reliability issues [9]. A comparison between field failures and handbook-based failure predictions is shown in Table 7.2; this demonstrates the futility of handbook methods, through the large differences between the field MTBF (mean time between failures) and the MIL-HDBK-217 predicted MTBF [10].

Table 7.1 Selected temperature factors (ΠT) in Telcordia SR-332 [3]

Temperature factors (ΠT)

Operating ambient temperature (°C) Temperature stress curve

1 2 3 4 5 6 7 8 9 10

30 1.0 0.9 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.431 1.0 0.9 0.9 0.8 0.7 0.7 0.6 0.6 0.5 0.532 1.0 0.9 0.9 0.8 0.8 0.7 0.6 0.6 0.6 0.533 1.0 0.9 0.9 0.9 0.8 0.7 0.7 0.7 0.6 0.634 1.0 0.9 0.9 0.9 0.8 0.8 0.7 0.7 0.7 0.635 1.0 1.0 0.9 0.9 0.9 0.8 0.8 0.8 0.7 0.736 1.0 1.0 0.9 0.9 0.9 0.8 0.8 0.8 0.8 0.737 1.0 1.0 1.0 0.9 0.9 0.9 0.9 0.9 0.8 0.838 1.0 1.0 1.0 1.0 0.9 0.9 0.9 0.9 0.9 0.839 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.9 0.9 0.940 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.041 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.1 1.1 1.142 1.0 1.0 1.0 1.0 1.1 1.1 1.1 1.1 1.1 1.243 1.0 1.0 1.0 1.1 1.1 1.1 1.2 1.2 1.2 1.344 1.0 1.0 1.1 1.1 1.1 1.2 1.2 1.2 1.3 1.445 1.0 1.0 1.1 1.1 1.2 1.2 1.3 1.3 1.4 1.556 1.0 1.1 1.1 1.1 1.2 1.3 1.3 1.4 1.5 1.647 1.0 1.1 1.1 1.2 1.3 1.3 1.4 1.4 1.6 1.848 1.0 1.1 1.1 1.2 1.3 1.4 1.4 1.5 1.7 1.949 1.0 1.1 1.1 1.2 1.3 1.4 1.5 1.6 1.8 2.150 1.0 1.1 1.1 1.2 1.4 1.5 1.6 1.7 1.9 2.2

Table 7.2 A comparison between field failures and handbook-based failure predictions [10]

MIL-HDBK-217 MTBF (Hours) Observed MTBF (Hours)

7247 11605765 743500 6242500 21742500 512000 10561600 36121400 981250 472

Page 131: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

121

In addition, there are special limitations of the handbook prediction methods for the conditions of some efficient cooling methods in data centers (e.g., free air cooling). The first limitation is the unknown operating condition of equip-ment, since various data center operators may set different operating conditions in free air cooling implementation. Without operating condition information, the handbook-based methods cannot be applied to predict the reliability of parts and systems in data centers. Even if the pre-set operating condition ranges during free air cooling are the same, the actual condition of supply air in various data cent-ers may vary with their local climate. For example, if the supply air temperature range is set as 15–30 °C, the supply air temperature may be 15–30 °C depending on the local weather. So it is difficult to identify the exact operating condition of parts and systems under free air cooling conditions, and that a requirement for the handbook-based methods. In addition, these handbook-based methods emphasize steady-state-temperature-dependent failure mechanisms. However, the temperature dependence of individual components under free air cooling conditions is cyclical. Therefore, the handbook prediction methods are not valid for free air cooling.

7.3 Prognostics and Health Management Approaches

Prognostics and health management (PHM) is an enabling discipline consisting of technologies and methods to assess the reliability of a product in its actual life cycle conditions to determine the advent of failure and mitigate system risks [11]. It allows for the reliability of a deployed product to be assessed, by monitoring the environmental, operational, and functional parameters to identify the degrada-tion of the product. PHM monitors performance degradation, such as the variation of performance parameters from their expected values. PHM also monitors physi-cal degradation, such as material cracking, corrosion, interfacial delamination, or increases in electrical resistance or threshold voltage. In addition, PHM moni-tors changes in the life cycle profile (LCP), such as usage duration and frequency, ambient temperature and humidity, vibration, and shock [12]. PHM provides several benefits, including early warning of failures for a product; a maintenance schedule to avoid/minimize unscheduled downtime; life cycle cost reduction of equipment due to the reduction of inspection costs, downtime, and inventory; and qualification improvement of current equipment and assistance in the design of future equipment [12].

There are three PHM approaches, including the physics-of-failure (PoF) approach, the data-driven approach, and a combination of both (the fusion approach). This section introduces these three approaches and the monitoring techniques for their implementation.

7.2 Example Handbook-Based Reliability Prediction Methods

Page 132: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

122 7 Part Reliability Assessment in Data Centers

7.3.1 Monitoring Techniques for PHM

Data collection is an essential part of PHM, and appropriately monitoring environ-mental and operational parameters is one of the key steps in the implementation of PHM. Several techniques are used to monitor the equipment for PHM, including built-in test (BIT), sensors, fuses, and canaries. This chapter introduces sensors as examples.

Sensors are devices that convert a physical parameter into a signal that can be measured electrically, by converting physical, chemical, or biological phenomena into electrical signals [12]. There are several types of sensors: thermal sensors, electrical sensors, mechanical sensors, humidity sensors, biosensors, chemical sen-sors, optical sensors, and magnetic sensors. Some examples are shown in Table 7.3 [12].

7.3.2 Physics-of-Failure Approach

Physics-of-failure is a PHM approach that utilizes knowledge of a product’s life cycle loading and failure mechanisms to perform reliability assessment [12, 13, 14]. This approach identifies potential failure modes, failure mechanisms, and fail-ure sites of a system with consideration of its specific life cycle loading condition, material, and hardware architecture.

Table 7.3 Examples of sensor measurands for PHM [12]

Domain Examples

Thermal Temperature (ranges, cycles, gradients, ramp rates), heat flux, heat dissipationElectrical Voltage, current, resistance, inductance, capacitance, dielectric constant,

charge, polarization, electric field, frequency, power, noise level, impedance

Mechanical Length, area, volume, velocity or acceleration, mass flow, force, torque, stress, strain, density, stiffness, strength, direction, pressure, acoustic intensity or power, acoustic spectral distribution

Humidity Relative humidity, absolute humidityBiological pH, concentration of biological molecule, microorganismsChemical Chemical species, concentration, concentration gradient, reactivity, molecular

weightOptical (radiant) Intensity, phase, wavelength, polarization, reflectance, transmittance, refrac-

tive index, distance, vibration, amplitude, frequencyMagnetic Magnetic field, flux density, magnetic moment, permeability, direction, dis-

tance, position, flow

Page 133: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

123

7.3.2.1 The Physics-of-Failure Methodology

The physics-of-failure methodology is based on the analysis of failures due to fundamental mechanical, chemical, electrical, thermal, and radiation processes. This approach calculates the cumulative damage resulting from the identified failure mechanisms of a system under its particular operating condition, and then provides early warning of failures. A schematic of the physics-of-failure-based PHM approach (which includes failure modes, mechanisms, and effects analysis (FMMEA)) is shown in Fig. 7.1.

This approach starts with identifying the product material properties and geom-etries, which can help define items and identify which elements and functions need be analyzed. The next step is to identify the potential failure modes based on the estimation of the product life cycle loading and monitoring of the life cycle environment and operational loading. Then, the potential failure causes, potential failure mechanisms, and failure models are identified. The failure mechanisms are prioritized based on their frequency of occurrence and consequences, which can be found in maintenance records. The identification process must be documented. The parameters and locations involved in the failure mechanisms with high pri-ority need to be monitored, and then fuse and canary devices, which suffer the same failure mechanisms as the product but fail before the product’s failure, can be used to estimate the product’s remaining useful life (RUL). At the same time, data reduction and load feature extraction can be conducted by monitoring the life cycle environment and operational loading, and then the stress/strain and damage can be calculated to estimate the product RUL.

Identify potential failure mechanisms

Estimated lifecycle loading

Maintenancerecords

Materialproperties and

productgeometries

Put fuse orcanarydevices

Define item and identifyelements and functions

to be analyzed.

Identify potential failure modes

Identify potential failure causes

Prioritize the failuremechanisms

Identify failure model

Document the process

Monitor life cycleenvironment andoperating loading

Choosemonitoring

parameters andlocations

RUL estimation

Conduct datareduction and loadfeature extraction

Performstress/strainand damagecalculation

RUL: remaining useful life

Fig. 7.1 Physics-of-failure approach [12]

7.3 Prognostics and Health Management Approaches

Page 134: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

124 7 Part Reliability Assessment in Data Centers

7.3.2.2 Product Configuration and Materials

One key step to implement the physics-of-failure approach is to characterize the configuration and material information of a product, which includes the product architecture, materials, and manufacturing processes. This information provides fundamental knowledge for the reliability assessment.

A system usually consists of a number of subsystems and components working together to deliver an overall function. The architecture of a product is the physi-cal and functional relationships between the subsystems (down to the component level), and the configuration of a product is the design of the components and sub-systems and the product architecture. The effect of the manufacturing process on the product may also be considered [14]. Generally, the hardware of electronic equipment includes electronic parts (e.g., a chip, a resistor, or a capacitor), printed circuit boards, connectors, and enclosures. The configuration of an electronic part includes the part geometry and structure and the connection methods, such as wirebonds or solder balls. The configuration and material information of a printed circuit board usually includes the materials; layer stacks; connections between lay-ers; additions to the layers, such as heat spreaders; and elements such as stiffeners [14].

The materials in a product affect the stress level of the product under external and internal loads and also the damage accumulation process [15, 16]. It is neces-sary to identify the physical properties of the materials to analyze their impacts on damage accumulation in the product [17, 18]. For example, stress arising from repeated temperature excursions may cause the failure of a solder joint. In such a case, the coefficient of thermal expansion of the solder joint material needs to be identified to determine the cyclic stress state. In another example, a reduction in the contact force between connector elements may cause the failure of a solder joint. In this case, the elastic moduli of the connector elements, the loading ele-ments, and their housings are used to determine the contact force between the con-nector elements and the solder joint degradation pattern. The properties of some common materials for electronic products can be found in [17–19].

A single manufacturing process is usually not enough for the products, and the final product often requires a sequence of different manufacturing processes to achieve all the required attributes. However, some properties of the material may be changed due to the residual stress produced in the manufacturing process. For example, the thermo-physical properties of a printed circuit board can be affected by a lead-free reflow profile. As a result, the material information also needs to include the characterizations of the material property variations caused by differ-ent manufacturing processes.

7.3.2.3 Life Cycle Loading Monitoring

The next step in the physics-of-failure approach is to understand the LCP of prod-ucts. A LCP is a time history of events and conditions associated with a product

Page 135: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

125

from the time of its release from manufacturing to its removal from service. The life cycle includes the phases that an item will encounter in its life, such as ship-ping, handling, and storage prior to use; mission profiles while in use; phases between missions, such as stand-by, storage, and transfer to and from repair sites and alternate locations; geographical locations of expected deployment and maintenance; and maintenance and repair procedures for the system and its components.

The LCP is the basis for selecting product physics-of-failure approach con-ditions, including types and severity levels. The major task in understanding the LCP is to characterize the loads applied onto a product during its life cycle, because loads applied during the life cycle drive the processes that lead to product degradation and failure. The life cycle loads include assembly—and installation-related loads, environmental loads, and operational loads. These loads can be ther-mal, mechanical, chemical, physical, or operational [20]. Various combinations and levels of these loads can influence the reliability of a product, and the extent and rate of product degradation depend upon the nature, magnitude, and duration of exposure to such loads. The environmental loading for a component should be from its surrounding environment, as well as from within, but not from the sys-tem-level environment. For example, when a silicon chip is working, the tempera-ture and humidity of its environment will affect its function and reliability, as does the heat generation within the chip. However, when the temperature is increased in the data center, the temperature variation of the chip is determined by the local operating temperature and the cooling algorithm of the fan, instead of the room temperature increase [21]. Sensors may be used to monitor the LCP of a product with the temperature increase in the efficient cooling conditions.

The loading of a product can be applied during life cycle conditions, including manufacturing, shipment, storage, handling, and operating and non-operating con-ditions. Any individual or combined loading can cause performance or physical degradation of the product or can reduce its service life [22]. The product degra-dation rate depends on the magnitude and duration of exposure (usage rate, fre-quency, and severity) to the loads. If these load profiles can be monitored in situ, the cumulative degradation can be evaluated based on the load profiles and dam-age models. Some typical life cycle loads are summarized in Table 7.4.

Table 7.4 Life cycle loads [22]

Load Load condition

Thermal Steady-state temperature, temperature ranges, temperature cycles, temperature gradients, ramp rates, heat dissipation

Mechanical Pressure magnitude, pressure gradient, vibration, shock load, acoustic level, strain, stress

Chemical Contamination, ozone, pollution, fuel spillsPhysical Radiation, electromagnetic interface, altitudeElectrical Current, voltage, power

7.3 Prognostics and Health Management Approaches

Page 136: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

126 7 Part Reliability Assessment in Data Centers

Ramakrishnan and Pecht [23] assessed the impact of life cycle usage and envi-ronmental loads on electronic structures and components. They introduced the life consumption monitoring (LCM) methodology (Fig. 7.2). Life consumption monitoring can be used to estimate the remaining product life by combining loads measured in situ with physics-based stress and damage models.

7.3.2.4 Failure Causes, Modes, Mechanisms, and Models

The identification and ranking of the failure causes, modes, and mechanisms of a product under its specific operating conditions can help to identify the weak-est components involved in the dominant failure mechanisms. A failure cause is defined as “the specific process, design, and/or environmental condition that initiated the failure, the removal of which will eliminate the failure” [14]. For a

Step 1: Conduct failure modes, mechanisms, and effect analysisStep 1: Conduct failure modes, mechanisms, and effect analysis

Step 5: Perform damage assessment and damage accumulationStep 5: Perform damage assessment and damage accumulation

Continue monitoringContinue

monitoringIs the remaining life acceptable?

No

Yes

Step 4: Conduct data simplification for model input Step 4: Conduct data simplification for model input

Schedule a maintenance actionSchedule a maintenance action

Step 3: Monitor appropriate product parameters such as environmental (e.g, shock, vibration, temperature, humidity)

operational (e.g., voltage, power, heat dissipation)

Step 3: Monitor appropriate product parameters such as environmental (e.g, shock, vibration, temperature, humidity)

operational (e.g., voltage, power, heat dissipation)

Step 2: Conduct a virtual reliability assessment to assess the failure mechanisms with earliest time-to-failure

Step 2: Conduct a virtual reliability assessment to assess the failure mechanisms with earliest time-to-failure

Step 6: Estimate the remaining life of the product (e.g., data trending, forecasting models, regression analysis)

Step 6: Estimate the remaining life of the product (e.g., data trending, forecasting models, regression analysis)

Step 1: Conduct failure modes, mechanisms, and effect analysisStep 1: Conduct failure modes, mechanisms, and effect analysis

Step 5: Perform damage assessment and damage accumulationStep 5: Perform damage assessment and damage accumulation

Continue monitoringContinue

monitoringIs the remaining life acceptable?

No

Yes

Step 4: Conduct data simplification for model input Step 4: Conduct data simplification for model input

Schedule a maintenance actionSchedule a maintenance action

Step 3: Monitor appropriate product parameters such as environmental (e.g, shock, vibration, temperature, humidity)

operational (e.g., voltage, power, heat dissipation)

Step 3: Monitor appropriate product parameters such as environmental (e.g, shock, vibration, temperature, humidity)

operational (e.g., voltage, power, heat dissipation)

Step 2: Conduct a virtual reliability assessment to assess the failure mechanisms with earliest time-to-failure

Step 2: Conduct a virtual reliability assessment to assess the failure mechanisms with earliest time-to-failure

Step 6: Estimate the remaining life of the product (e.g., data trending, forecasting models, regression analysis)

Step 6: Estimate the remaining life of the product (e.g., data trending, forecasting models, regression analysis)

Fig. 7.2 Life consumption monitoring methodology [23]

Page 137: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

127

specific product, the identification of failure causes is useful for determining the failure mechanisms leading to product failure.

A failure mode is the effect by which a failure is observed to occur [24]. It can also be defined as “the way in which a component, subsystem, or system fails to meet or deliver the intended function” [14]. Failure modes can be observed by vis-ual inspection, electrical measurement, or other tests and measurements. All possi-ble failure modes need to be identified using numerical stress analysis, appropriate accelerated tests, past experience, product information from similar products in a technology family, and engineering judgment.

Failure mechanisms are “the physical, chemical, thermodynamic, or other pro-cesses that result in failure” [14]. Generally, there are two types of failure mecha-nisms: overstress failure mechanisms, which result in failure due to a single load (stress) condition that exceeds a fundamental material strength [14]; and wear-out failure mechanisms, which result in failure due to cumulative damage from loads (stresses) applied over an extended period of time or number of cycles. PHM can only be applied to wear-out failure mechanisms, since overstress failure mecha-nisms usually result in the sudden failure of the product and are not caused by the accumulated damage. Typical wear-out failure mechanisms for electronics are summarized in Table 7.5 [25].

Failure models are used to evaluate the time-to-failure or the likelihood of fail-ure based on the product information, including geometry, material construction, and environmental and operational conditions. For wear-out mechanisms, failure models use both stress and damage analysis to quantify the product damage accu-mulation [14].

A product may be operated under several different environments or stress lev-els, which may activate several failure mechanisms, but generally there are only a few failure mechanisms responsible for the majority of failures. The failure

Table 7.5 Examples of failure mechanisms, relevant loads, and models [25]

Failure mechanisms Failure sites Relevant loads Failure models

Fatigue Die attach, wirebond/TAB, solder leads, bond pads, traces, vias/PTHs, interfaces.

ΔT, Tmean, dT/dt, dwell time, ΔH, ΔV

Nonlinear Power Law (Coffin-Manson)

Corrosion Metallization M, ΔV, T Eyring (Howard)Electromigration Metallization T, J Eyring (Black)Conductive filament

formationBetween metallization M, ∇V Power law (Rudra)

Stress-driven diffusion voiding

Metal traces S, T Eyring (Okabayashi)

Time-dependent die-lectric breakdown

Dielectric layers V, T Arrhenius (Fowler-Nordheim)

Δ: Cyclic range ∇: Gradient V: Voltage M: MoistureT: Temperature J: Current density S: Stress H: Humidity

7.3 Prognostics and Health Management Approaches

Page 138: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

128 7 Part Reliability Assessment in Data Centers

mechanisms are prioritized based on their occurrence and severity. The compo-nents involved in the dominant failure mechanisms with the highest priorities are considered to be the weakest components, and they are the most likely to fail first in the product.

7.3.2.5 Damage Assessment and Remaining Life Calculation

Most failure models define the time-to-failure under a specific loading condi-tion. Most products usually have multiple loading conditions in their LCP, which requires methods to evaluate the time-to-failure over multiple loading conditions. One way to do this is to calculate the damage ratio of a specific failure mecha-nism, which is expressed as the ratio of stress condition exposure time to the time-to-failure for the component stress condition. The damage accumulation of this specific failure mechanism is estimated as the sum of all the damage ratios under multiple loading conditions, and the total damage of the product is the sum of the damage accumulation from all identified failure mechanisms. When the total dam-age equals one, the product is considered to have failed. The estimation of the time it takes for the total damage to reach one is also an estimate of the RUL.

Life consumption monitoring methodology was applied to conduct a prognos-tic remaining life assessment of circuit cards inside a space shuttle solid rocket booster (SRB) [26]. Cumulative damage was estimated based on the recorded vibration history of the SRB from the prelaunch stage to splashdown, as well as on physics-based models. With the entire recorded life cycle loading profile of the solid rocket boosters, the RUL values of the components were estimated. The vibration and shock analysis identified significant life loss of the aluminum brack-ets due to shock loading, which caused unexpected electrical failure of the circuit board cards [26].

7.3.3 Data-Driven Approach

A pure data-driven approach does not consider the failure mechanisms and uti-lizes only data analysis to monitor and analyze the trends in product degradation. The monitored data of a system include data such as the environmental data, sys-tem operating data, and performance data. The data need to be monitored from the beginning of the product function, when the product is considered to be healthy data and has no degradation. The healthy data will be used later as a baseline to identify the extent of shifts in the monitored parameters and the extent of degrada-tion of the product. An anomaly is considered to occur when the parameter data are outside the range of these historical data.

The general flowchart of the data-driven approach is shown in Fig. 7.3 [27]. It starts with a functional evaluation of the system under consideration. After a feasibility study, various data acquisition techniques are reviewed and selected to

Page 139: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

129

gather system performance information in practice. According to the data gathered by the sensor, a number of features can be observed after the raw data are cleaned and normalized to reduce the data noise and scaling effects. These data features can be used to establish the healthy state of the system and also to identify per-formance or physical degradation due to product wear-out. The threshold limits on these features are set to define system failure. The data trend is monitored over time to perform system prognostics and estimate RUL based on pre-defined failure criteria.

Torres et al. [28] utilized signal integrity parameters associated with signal distortion, power plane integrity, and signal transmission quality to reduce uncer-tainty, when the signal return paths, power, and ground networks of those signals were monitored to identify deviations from the healthy data baseline. This meas-urement of signal integrity parameters simplifies the PHM monitoring algorithms by limiting deviations from the expected normal data. Signal integrity techniques can also be used in PHM since they are well-established in electronic systems for high-speed device-to-system simulation of electronics. Signal integrity measure-ment can provide accurate monitoring for the implementation of PHM.

Orchard and Vachtsevanos [29] proposed an on-line particle-filtering(PF)-based framework for fault diagnosis and failure prognosis for nonlinear, non-Gaussian systems. The results from real-world applications showed that the PF methodol-ogy can be used for both fault diagnosis and failure prognosis, which provides a smooth transition from detection to prediction by employing the last detection

Health Estimation

Functional Considerations

Data Acquisition

Data Feature Extraction

Prognostics

Diagnostics

• System level priorities• Feasibility and limitations• Environment and usage conditions• Economic justifications• Parameter selection

• Excitation (scripts)• Sensing• Data transmission• Data storage

• Absolute (Statistics)-Mean- Standard deviation

• Relative- Error function- Difference in features

• Baseline creation• Real time health estimation

• Anomaly detection• Real time health estimation

• Fault prediction• Remaining useful life estimation

• Data cleansing• Data normalization• Noise reduction

• Sensitivity• Dimensionality• Computational

requirements

• Past experience• Multivariate analysis

• Parameters contribution

• Feature trending• Prognostic measures

Fig. 7.3 Data-driven approach [27]

7.3 Prognostics and Health Management Approaches

Page 140: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

130 7 Part Reliability Assessment in Data Centers

probability density function (PDF) as the initial condition for prognostics. Chen et al. [30, 31] developed a new integrated PF framework to perform failure prognostics, where an adaptive neuro-fuzzy inference system was built as a high-order hidden Markov model to capture the fault degradation characteristics.

Gillblad et al. [32] presented an improved statistical model derived from empir-ical data to produce satisfactory classification rates. Their model used only a small amount of initial training data supplemented by expert knowledge and examples of actual diagnostic cases. This model used an inconsistency-checking unit to elimi-nate the effects of input noise and then improve the prediction accuracy. Three sets of discrete data and one set of continuous data were monitored in the experiments, and prototypes were constructed for each diagnosis to construct a diagnostic sys-tem for a particular application. The obtained case data were then used for system diagnosis. If case data are not available, the diagnosis will rely solely on prototypi-cal data.

Artificial neural networks (ANNs) are the most widely used data-driven method in fault diagnostics and failure prognostics. An ANN is able to learn unknown nonlinear functions by adjusting its weight values. According to the model struc-ture, an ANN generally can also be called a feedforward neural network [33–38], a radial basis function neural network [39–41], and a recurrent neural network [42–44]. Feedforward neural networks have been successfully applied in machine fault diagnostics. More recently, they have also been adopted in the failure prog-nostics. For example, Gebraeel et al. [45] used a set of feedforward neural net-works to predict the residual life of bearings. Some other researchers [46, 47] have used condition monitoring data as input and life percentage as output to train and validate the feedforward neural network. In Tian et al. [47], both failure and sus-pension condition monitoring data were used to model the failure process. A radial basis function neural network contains radial basis functions in its hidden nodes, which in some applications results in better failure prediction performance com-pared to feedforward neural networks. Since recurrent neural networks possess feedback links in the model structure, they are capable of dealing with dynamic processes. This feature is very useful in failure prognostics because the predic-tion is always a dynamic process. For instance, a recurrent neural network was used to forecast the fault trends of various mechanical systems in Tse and Atherton [43]. The results showed that its prediction accuracy was much better than that of a feedforward neural network. Self-organizing map (SOM) neural networks are another type of ANN. Unlike the aforementioned ANNs, SOM does not need supervised training, and the input data can be automatically clustered in different groups. Huang et al. [48] used SOM to extract a new feature that shows the devia-tion of bearing performance from normal conditions. Multiple feedforward neu-ral networks were trained by the feature to estimate the residual life of bearings through a fusion strategy.

One key step in the data-driven approach is to identify failure precursors. A failure precursor is a data event or trend that signifies impending failure [12]. A failure precursor is usually identified by measuring the changes of the variables that can be associated with subsequent failure. One example is that the shift in the

Page 141: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

131

output voltage of a power supply might suggest impending failure due to a dam-aged feedback regulator. Failures can then be predicted by using causal relation-ships between measured variables.

Pecht et al. [49] proposed several possible failure precursor parameters of elec-tronic products, which includes switching power supplies, cables and connectors, CMOS integrated circuits (ICs), and voltage-controlled high-frequency oscillators (see Table 7.6).

Table 7.6 Potential failure precursors for electronics [49]

Electronic subsystem Failure precursor

Switching power supply • Direct current (DC) output (voltage and current levels)• Ripple current• Pulse width duty cycle• Efficiency• Feedback (voltage and current levels)• Leakage current• Radio frequency (RF) noise

Cables and connectors • Impedance changes• Physical damage• High-energy dielectric breakdown

CMOS IC • Supply leakage current• Supply current variation• Operating signature• Current noise• Logic-level variations

Voltage-controlled oscillator • Output frequency• Power loss• Efficiency• Phase distortion• Noise

Field effect transistor • Gate leakage current/resistance• Drain-source leakage current/resistance

Ceramic chip capacitor • Leakage current/resistance• Dissipation factor• RF noise

General purpose diode • Reverse leakage current• Forward voltage drop• Thermal resistance• Power dissipation• RF noise

Electrolytic capacitor • Leakage current/resistance• Dissipation factor• RF noise

RF power amplifier • Voltage standing wave ratio (VSWR)• Power dissipation• Leakage current

7.3 Prognostics and Health Management Approaches

Page 142: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

132 7 Part Reliability Assessment in Data Centers

7.3.4 Fusion Approach

A fusion PHM method combines the merits of both the data-driven and physics-of-failure methods, compensating for the weaknesses of each, and is expected to give better prognostic results than either method alone. The process of the fusion method is illustrated in Fig. 7.4 [50].

The strength of the physics-of-failure method is its ability to identify the root causes and failure mechanisms that contribute to system failure, as well as its ability to give predictions of RUL under different usage loadings even before the device is actually in use. The weakness of the physics-of-failure method is that prognostics is conducted with the assumption that all components and assemblies are manufactured in the same way. An extreme example of this problem is that physics-of-failure cannot predict field failure for a chip package without a die in it.

The advantage of the data-driven method is that it can track any anomaly in the system, no matter what mechanism caused it. The weak point of the data-driven method is that, without knowledge about what mechanism caused the anomaly, it is very difficult to set a threshold to link the level of data anomaly to the failure definition.

A fusion of the two methods can compensate for the disadvantages of each by allowing observation of any unexpected anomaly not listed among the failure mechanisms, while at the same time identifying known mechanisms that can cause it. It can then set a reasonable threshold for warning of impending failure at a cer-tain level of anomalous behavior.

Identifyparameters

Healthybaseline

Continuousmonitoring

Anomaly?

Historical database and standards

POF models

Failuredefinition

Parameter trending

Parameter isolation

Remaining useful lifeestimation

AlarmYes

ContinuemonitoringNo

Fig. 7.4 Flowchart of the fusion approach with physics-of-failure and data-driven models [50]

Page 143: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

133

Cheng et al. [51] utilized the fusion method for RUL prediction of multi-layer ceramic capacitors (MLCCs). This was carried out in nine steps. The first step was parameter identification. FMMEA was used to identify potential failure mechanisms and determine the parameters to be monitored. The MLCCs in the case study underwent temperature-humidity bias (THB) testing. Two possible fail-ure mechanisms were identified—silver mitigation and overall degradation of the dielectric of capacitors. The next steps were parameter monitoring and creation of a healthy baseline from the training data, followed by data trend monitoring using the multivariate state estimation technique (MSET) and sequential probability ratio test (SPRT) algorithms. When an anomaly is detected, the RUL can be pre-dicted at that moment. All three parameters contributed to the anomaly detection. During the anomaly detection, MSET generated residuals for each parameter. The residuals of the three parameters were used to generate a residual vector, which is an indicator of the degradation between the monitored MLCCs and the baseline. When an anomaly is detected, the parameters are isolated, and the failure can be defined based on the potential failure mechanism. If physics-of-failure models are available, failure can be defined based on the physics-of-failure model and the cor-responding parameters; otherwise, it should be based on historical data. The exam-ple given in the article has a predicted failure time of between 875 and 920 h. The actual failure time was at the 962 hour. Therefore, the fusion PHM method pre-dicted the failure of this capacitor in advance.

Patil et al. [52] proposed using the fusion method for remaining life predic-tion of an insulated gate bipolar transistor (IGBT) power module, but the imple-mentation of the process has not yet been performed. The proposed method is as follows. First, the parameters to be monitored in the application during operation are identified. Examples of parameters include the collector-emitter ON voltage, ambient temperature, and module strains. A baseline for the healthy behavior of the parameters is established. Identified parameters are continuously monitored in the application and compared with the healthy baseline. When anomalies are detected, the parameters contributing to the anomalies are isolated. Using failure thresholds, methods such as regression analysis can be applied to trend the iso-lated parameters over time. Further, the isolation of the parameters causing anom-alous behavior helps identify the critical failure mechanisms in operation. In a power module, a drop in collector-emitter voltage indicates damage in the solder die attach. Trending of the collector-emitter voltage would provide data-driven RUL estimates that could then be fused with the physics-of-failure estimates. The fusion approach can therefore provide an estimate of the RUL of a product based on the combination of information from anomaly detection, parameter isolation, physics-of-failure models, and data-driven techniques.

7.3 Prognostics and Health Management Approaches

Page 144: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

134 7 Part Reliability Assessment in Data Centers

7.3.5 Use for the Efficient Cooling Methods

PHM approaches can be used for data centers with the efficient cooling methods, since these approaches monitor the equipment health status to provide early warn-ings of failures that are independent of the cooling method. There are some special benefits for free air cooling. Traditional reliability assessment methods and current standards-based qualification methods are insufficient to estimate the reliability of telecom equipment when free air cooling is implemented in data centers. If the data centers were not originally designed for free air cooling, it is not practical to take the equipment out of service from installed telecom infrastructures (e.g., data centers) for test purposes. Even if such a task could be undertaken, the tested equipment needs to be sacrificed, since it would lose an unknown amount of use-ful life during the test process. It would also be impossible to gather the necessary sample size for any equipment-level evaluation from the systems already in opera-tion. Accelerating the life cycle conditions for a complete data center is also not a feasible option. It would be prohibitively expensive and would not result in useful information regarding the reliability of the system. On the other hand, it is risky to take no action and just track the failure events that occur, since there may not be time to take appropriate action.

It is usually also not practical to assess the telecom equipment by the manu-facturers before the equipment shipment, since the primary unknown with free air cooling is the operating environment at the equipment level. Generally, it depends on the local weather at the telecom infrastructures and usually changes with local seasonal and diurnal weather variations. In addition, the various architectures of data centers may also cause different operating environments for telecom equip-ment. With an unknown operating environment, it is not possible for the manufac-tures to evaluate whether the equipment can be used in free air cooling condition with high reliability.

PHM can help overcome these above difficulties. PHM uses in situ life cycle load monitoring to identify the onset of abnormal behavior that may lead to either intermittent out-of-specification characteristics or permanent equipment failure. It can provide advance warning of failures, minimize unscheduled maintenance, reduce equipment downtime, and improve product design. This chapter provided a basic introduction to PHM, the monitoring techniques for PHM, and PHM approaches. The implementation of PHM approaches for the efficient cooling methods will be introduced in Chap. 8.

7.4 Other Approaches

There are some other approaches that can be used for part reliability assessment under the efficient cooling conditions. This section introduces accelerated testing as an example.

Page 145: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

135

Many reliability tests are accelerated tests, since a higher level of stress can cause failures within a shorter period of time than the intended life cycle of the product. Accelerated testing allows for reduced test times by providing test condi-tions that “speed up” the evolution of failures, thus saving the time-to-market of a product. Accelerated testing measures the performance of the test product at loads or stresses that are more severe than would normally be encountered in order to enhance the damage accumulation rate within a reduced time period. The failure mechanisms, sites, and modes in the accelerated environment must be the same as (or quantitatively correlated with) those observed, or predicted, under actual usage conditions, and it must be possible to quantitatively extrapolate from the acceler-ated environment to the usage environment with a reasonable degree of assurance.

The determination of accelerated test conditions considers not only the domi-nant failure mechanisms, but also the strength limits and margins of products. Strength limits can be obtained by the highly accelerated life test (HALT). The purpose of HALT is to expose design weaknesses by iteratively subjecting the product to increasingly higher levels of stress and then learning what aspects or components should be improved. HALT is the first physical test performed during the product qualification stage [53].

In accelerated testing, HALT can be used to identify the operational and destruct limits and margins, known as the “strength limits,” as shown in Fig. 7.5. The limits include the upper and lower specification limits, the upper design mar-gin, the upper operating limit, and the upper destruct limit. These specification limits are usually provided by the manufacturer and determine the load ranges under accelerated testing based on analysis of the experimental equipment capabil-ity, expected experimental duration, and other constraints. The design limits are the stress conditions at which the product is designed to survive. The operational limits of the product are reached when the product can no longer function at the accelerated conditions due to a recoverable failure. The stress value at which the

Lower destruct margin

Lower operating margin

Lower design margin

Lower specification limit

Upper specification limit

Upper design margin

Upper operating margin

Upper destruct margin

Stre

ss

Fig. 7.5 Strength limits and margins diagram [53]

7.4 Other Approaches

Page 146: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

136 7 Part Reliability Assessment in Data Centers

product fails permanently and catastrophically is identified as the destruct limit. Generally, large margins are desired between the operational and destruct limits, and between the actual performance stresses and the specification limits of the product, to ensure higher inherent reliability [53].

A sufficient number of samples is needed to identify the complete distribution characteristics of the strength limits and margins. The strength limits obtained from HALT can be used to design the accelerated test plan and screening condi-tions. Generally, accelerated testing is used to assess the wear-out failure mecha-nisms, and the testing conditions should not exceed the design limits; otherwise, the product may not work in the test. Furthermore, the accelerated test must be conducted within the specification limits. So the primary goal of HALT is to iden-tify the specification limits. Test conditions are determined based on these limits and test constraints, such as the expected test duration. If the expected test dura-tion is long enough, the test designer can select a relatively low stress load, which can provide a relatively accurate result when the reliability at the operating condi-tion is extrapolated from the test result. However, if there is not enough time to conduct the accelerated testing, it is necessary to put the product under a relatively high stress load to observe the product failures.

The degree of stress acceleration is usually controlled by an acceleration factor, defined as the ratio of life under normal use conditions to life under the accel-erated conditions. To calculate the acceleration factor, a model must exist. The model can be a physics-of-failure model, as described previously, or a curve-fitted model. The latter can be obtained by conducting a series of accelerated tests under various load conditions and then curve-fitting the results. Once a curve-fitting equation is developed, the time-to-failure for the actual use conditions in the field can be estimated by an extrapolation of the equation.

7.5 Summary

This chapter introduced methods for part reliability prediction. Handbook-based reliability predictions have been used for decades; however, they do not consider the failure mechanisms and only provide limited insight into practical reliability issues. As a result, they cannot offer accurate predictions. This chapter presented PHM approaches to replace the handbook methods for part reliability assessment under the efficient cooling conditions.

All three PHM approaches—physics-of-failure, data-driven, and fusion—can be used to identify and mitigate the reliability risks of telecom equipment under the efficient cooling conditions. The physics-of-failure approach uses knowledge of a product’s life cycle loading and failure mechanisms to perform reliability design and assessment. The data-driven approach uses mathematical analysis of current and historical data to provide signals of abnormal behavior and estimate RUL. The fusion approach combines physics-of-failure and data-driven mod-els for prognostics, overcoming some of the drawbacks of using either approach

Page 147: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

137

alone. The PHM approaches assess the reliability risks without interrupting tele-com equipment service, and then allow the implementation of the efficient cool-ing methods in data centers which were not originally designed for these cooling methods. More details will be presented in Chap. 8.

There are some part reliability assessment approaches at different product lifecycle stages. At the design and test stages, the manufacturers can use acceler-ated testing to predict part reliability. At the operation stage, when the products are being used in the field, the field data can be analyzed to estimate reliability. These two approaches calculate the part reliability based on part samples, which is an “average” reliability of all the part samples. When the part failure times have a wide variation, some part failures cannot be predicted accurately based on the “average” reliability. PHM approaches monitor an individual part’s health condi-tion, and can predict when the part will fail.

References

1. M. White, Y. Chen, Scaled CMOS Technology Reliability Users Guide (Jet Propulsion Laboratory Publication, CA, 2008)

2. SAE G-11 Committee, aerospace information report on reliability prediction methodologies for electronic equipment AIR5286, Draft Report, Jan (1998)

3. Telcordia Technologies, Special Report SR-332: Reliability Prediction Procedure for Electronic Equipment, Issue 1 (Telcordia Customer Service, Piscataway, 2001)

4. W. Denson, A tutorial: PRISM. RAC J. 1–6 (1999) 5. Union Technique de L’Electricité, Recueil de données des fiabilite: RDF 2000, “Modèle

universel pour le calcul de la fiabilité prévisionnelle des composants, cartes et équipements électroniques” (Reliability Data Handbook: RDF 2000 – A universal model for reliability prediction of electronic components, PCBs, and equipment), July 2000

6. A.G. Siemens, Siemens Company Standard SN29500, Version 6.0, Failure Rates of Electronic Components, Siemens Technical Liaison and Standardization, 9 Nov 1999

7. British Telecom, Handbook of Reliability Data for Components Used in Telecommunication Systems, Issue 4, Jan 1987

8. United States Department of Defense, U.S. MIL-HDBK-217F, Reliability Prediction of Electronic Equipment, Version F, U.S. Government Printing Office, 28 Feb 1995

9. J. Gu, M. Pecht, Health assessment and prognostics of electronic products: an alternative to traditional reliability prediction methods. Electron. Cool. 15(2), 10–16 (2009)

10. M.J. Cushing, D.E. Mortin, T.J. Stadterman, A. Malhotra, Comparison of electronics-reliabil-ity assessment approaches. IEEE Trans. Reliab. 42(4), 540–546 (1993)

11. S. Cheng, M. Azarian, M. Pecht, Sensor systems for prognostics and health management. Sensors 10, 5774–5797 (2010)

12. M. Pecht, Prognostics and Health Management of Electronics (Wiley-Interscience, New York, 2008)

13. P. Lall, M. Pecht, M. Cushing, A Physics-of-Failure (physics-of-failure) Approach to Addressing Device Reliability in Accelerated Testing. Proceedings of the 5th European sym-posium on reliability of electron devices, failure physics and analysis, Glasgow, Scotland, Oct 1994

14. J. Gu, M. Pecht, Prognostics-based Product Qualification. IEEE Aerospace Conference, Big Sky, Mar 2009

15. A. Dasgupta, M. Pecht, Material Failure Mechanisms and Damage Models. IEEE Trans. Reliab. 40(5), 531–536 (1991)

7.5 Summary

Page 148: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

138 7 Part Reliability Assessment in Data Centers

16. M. Pecht, Handbook of Electronic Package Design (Marcell Dekker Inc, New York, 1991) 17. M. Pecht, R. Agarwal, P. McCluskey, T. Dishongh, S. Javadpour, R. Mahajan, Electronic

Packaging Materials and Their Properties (CRC Press, Boca Raton, 1999) 18. S. Ganesan, M. Pecht, Lead-free Electronics, 2nd edn. (John Wiley & Sons, Inc., New York,

2006) 19. M. Pecht, L. Nguyen, E. Hakim, Plastic Encapsulated Microelectronics: Materials,

Processes, Quality, Reliability, and Applications (John Wiley Publishing Co., New York, 1995)

20. P. Lall, M. Pecht, E. Hakim, The Influence of Temperature on Microelectronic Device Reliability (CRC Press, Boca Raton, 1997)

21. American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE) Technical Committee (TC) 9.9. 2008 ASHRAE Environmental Guidelines for Datacom Equipment, Atlanta, GA, 2008

22. N. Vichare, M. Pecht, Prognostics and health management of electronics. IEEE Trans. Compon. Packag. Technol. 29(1), 222–229 (2006)

23. A. Ramakrishnan, M. Pecht, A life consumption monitoring methodology for electronic sys-tems. IEEE Trans. Compon. Packag. Technol. 26(3), 625–634 (2003)

24. M. Pecht, Product Reliability, Maintainability, and Supportability Handbook (CRC Press, New York, 1995)

25. N. Vichare, P. Rodgers, V. Eveloy, M. Pecht, Environment and Usage Monitoring of Electronic Products for Health (Reliability) Assessment and Product Design. IEEE Workshop on Accelerated Stress Testing and Reliability, Austin, Texas, Oct 2005

26. S. Mathew, D. Das, M. Osterman, M. Pecht, R. Ferebee, Prognostic assessment of aluminum support structure on a printed circuit board. ASME J. Electron. Packag. 128(4), 339–345 (2006)

27. S. Kumar, M. Torres, M. Pecht, Y. Chan, “A hybrid prognostics methodology for electronics systems. Paper presented at the WCCI-IJCNN 2008 special session on computational intel-ligence for anomaly detection, diagnosis, and prognosis, Hong Kong, China, 1–6 June 2008

28. M. Torres, E. Bogatin, Signal integrity parameters for health monitoring of digital electron-ics. 2008 prognostics and health management international conference, Denver, CO, 6–9 Oct 2008

29. M. Orchard, G. Vachtsevanos, A particle-filtering approach for on-line fault diagnosis and failure prognosis. Trans. Inst. Meas. Contr. 31(¾), 221–246 (2009)

30. C. Chen, B. Zhang, G. Vachtsevanos, M. Orchard, Machine condition prediction based on adaptive neuro–fuzzy and high-order particle filtering. IEEE Trans. Industr. Electron. 58(9), 4353–4364 (2011)

31. C. Chen, G. Vachtsevanos, M. Orchard, Machine remaining useful life prediction based on adaptive neuro-fuzzy and high-order particle filtering, in Annual Conference of the Prognostics and Health Management Society, Portland, OR, 10–16 Oct 2010

32. D. Gillblad, R. Steinert, A. Holst, in Fault-tolerant Incremental Diagnosis with Limited Historical Data presented at the Prognostics and Health Management International Conference, Denver, CO, 6–9 Oct 2008

33. M.J. Roemer, C. Hong, S.H. Hesler, Machine health monitoring and life management using finite element-based neural networks. J. Eng. Gas Turbines Power – Trans. ASME 118, 830–835 (1996)

34. B. Li, M.-Y. Chow, Y. Tipsuwan, J.C., Hung, Neural-network-based motor rolling bearing fault diagnosis. IEEE Trans. Ind. Electron. 47, 1060–1069 (2000)

35. Y. Fan, C.J. Li, Diagnostic rule extraction from trained feed forward neural networks. Mech. Syst. Signal Process. 16, 1073–1081 (2002)

36. N. Gebraeel, M. Lawley, R. Liu, V. Parmeshwaran, Residual life prediction from vibration-based degradation signals: a neural network approach. IEEE Trans. Industr. Electron. 51, 694–700 (2004)

Page 149: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

139

37. A.K. Mahamad, S. Saon, T. Hiyama, Predicting remaining useful life of rotating machinery based artificial neural network. Comput. Math. Appl. 60, 1078–1087 (2010)

38. Z. Tian, L. Wong, N. Safaei, A neural network approach for remaining useful life prediction utilizing both failure and suspension histories. Mech. Syst. Signal Process. 24, 1542–1555 (2010)

39. D.K. Ranaweera, N.E. Hubele, A.D. Papalexopoulos, Application of radial basis function neural network model for short-term load forecasting. IEE Proc. Gener. Transm. Distrib. 142, 45–50 (1995)

40. D.C. Baillie, J. Mathew, A comparison of autoregressive modeling techniques for fault diag-nosis of rolling element bearings. Mech. Syst. Signal Process. 10, 1–17 (1996)

41. F. Zhao, J. Chen, L. Guo, X. Lin, Neuro-fuzzy based condition prediction of bearing health. J. Vib. Control 15, 1079–1091 (2009)

42. C.J. Li, T.-Y. Huang, Automatic structure and parameter training methods for modeling of mechanical systems by recurrent neural networks. Appl. Math. Model. 23, 933–944 (1999)

43. P. Tse, D. Atherton, Prediction of machine deterioration using vibration based fault trends and recurrent neural networks. J. Vib. Acoust. 121, 355–362 (1999)

44. W. Wang, F. Golnaraghi, F. Ismail, Prognosis of machine health condition using neuro-fuzzy systems. Mech. Syst. Signal Process. 18, 813–831 (2004)

45. N. Gebraeel, M. Lawley, R. Liu, V. Parmeshwaran, Residual life prediction from vibration-based degradation signals: a neural network approach. IEEE Trans. Industr. Electron. 51, 694–700 (2004)

46. A.K. Mahamad, S. Saon, T. Hiyama, Predicting remaining useful life of rotating machinery based artificial neural network. Comput. Math. Appl. 60, 1078–1087 (2010)

47. Z. Tian, L. Wong, N. Safaei, A neural network approach for remaining useful life prediction utilizing both failure and suspension histories. Mech. Syst. Signal Process. 24, 1542–1555 (2010)

48. R. Huang, L. Xi, X. Li, C. Liu, H. Qiu, J. Lee, Residual life predictions for ball bearings based on self-organizing map and back propagation neural network methods. Mech. Syst. Signal Process. 21, 193–207 (2007)

49. M. Pecht, R. Radojcic, G. Rao, Guidebook for Managing Silicon Chip Reliability (CRC Press, Boca Raton, 1999)

50. R. Jaai, M. Pecht, Fusion Prognostics Proceedings of the Sixth DSTO International Conference on Health & Usage Monitoring, Melbourne, Australia, March 9-12, 2009

51. S. Cheng, M. Pecht, A Fusion Prognostics Method for Remaining Useful Life Prediction of Electronic Products. 5th Annual IEEE Conference on Automation Science and Engineering, Bangalore, India, 22–25, Aug (2009), pp. 102–107

52. N. Patil, D. Das, C. Yin, H. Lu, C. Bailey, M. Pecht, A Fusion Approach to IGBT Power Module Prognostics Thermal, Mechanical and Multi-Physics simulation and Experiments in Microelectronics and Microsystems Conference, Delft, Netherlands, 27–29 Apr 2009

53. M. Pecht, J. Gu, Prognostics-based Product Qualification. IEEE Aerospace Conference, Big Sky, MT, 7–14 Mar 2009

References

Page 150: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

141

To identify and mitigate performance and reliability risks to data center equipment when subjected to the operating environment changes under the efficient cooling methods introduced in Chap. 2, the design, test, and operation needs to be consid-ered. In the design stage, a plan is needed to create a product, structure, system, or part. During the test stage, machines, tools, equipment, and experiments produce a product and assess it to meet its requirements. The operation stage is when equip-ment is being used by the end users in a data center. For example, if a router is being concept phase or its parts are still being selected, it is in the design stage. If the router has been manufactured and is being assessed by manufacturers before the product is shipped to end users, it is in the test stage. If the router is already in place and is being used by end users, then it is in the operation stage. The assess-ment described in this chapter evaluates whether the equipment and system are expected to be reliable and functional in the efficient cooling operating environ-ment at each of these three stages.

8.1 Risk Assessment Based on Product Life Cycle Stage

Assessment for the three product life cycle stages is shown in Fig. 8.1. The assess-ment starts with the estimation of operating condition ranges with the principally alternative cooling methods for data centers. Although ambient air is used to cool the equipment directly, data center operators typically set an environmental range for the supply air. This set environmental range can be based on the recommended operating ranges of published standards, such as Telcordia GR-63-CORE [1], Telcordia GR-3028-CORE [2], European Standard ETSI 300 019 [3], or ASHRAE [4], as discussed in Chap. 3. In order to save as much energy as possible and maximize the operating hours of their air-side economizer, data center operators may further set an operating range that is wider than the published recommended ranges from these standards. If the outside ambient air conditions are within the recommended ranges, then outside air is brought directly into the data center for

Chapter 8Life Cycle Risk Mitigations

J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5_8, © Springer Science+Business Media New York 2014

Page 151: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

142 8 Life Cycle Risk Mitigations

cooling via an air-side economizer fan [5]. When the air temperature is beyond the set ranges, there are various options available. Sometimes, the data center uses internally recirculated conditioned air instead of outside air for cooling; in other cases, local cooling or heating of air may be used. Even after the operating range has been defined for a data center, the outside supply air temperature and humidity will vary by season and from daytime to night. Estimation of an operating condi-tion range is essential for the reliability assessment described in this chapter.

In Sects. 8.2–8.4, evaluation is discussed based on the product’s life cycle stage: design, test, or operation. For each stage, the available information and con-straints are analyzed, and then the assessment is described.

8.2 Risk Assessment at the Design Stage

During the design stage, the functional requirements of the product are defined. However, the hardware may not yet be finalized. The material, interconnection, and performance information for the potential parts can be used to ensure that they meet the performance requirements and to assess the reliability of the product. Prior experience with the use of similar parts and designs can also be used for part reliability assessment with the efficient cooling methods in data centers. When the equipment is assessed at this stage, an iterative process is followed to finalize the design and the bill of materials.

In the design stage, the assessment includes initial design, part selection, simu-lation and virtual qualification, and final design. This is similar to product design for reliability, except that the product operating conditions of the efficient cooling methods are considered during the development of the life cycle of the product. When the efficient cooling methods are implemented in different data centers, the

Virtual qualification

Uprating assessmentStandard-based system-level and

assembly-level test

Design OperationTest

Parts selection

Initial designPrognostics-based

monitoring

Simulation and final design

Identification of operating condition range

Are operating conditions

within standardrequirement?Yes No

Fig. 8.1 Schematic of risk mitigation for the efficient cooling methods

Page 152: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

143

operating conditions can be diverse depending on the data center locations. Unless the product is developed for a specific known application, the exact operating conditions are likely unknown, but the operating conditions can be determined in several ways. For example, the companies can investigate their target market and then develop products for that segment. They can develop an environmental range for the product used by that defined segment. The companies can also target wide ranges of product operating conditions and attempt to cover the worst possible environmental ranges for product operations.

8.2.1 Initial Design

Initial design is the creation of a product concept and architecture based on an understanding of the operating life cycle and the expected functional and reliabil-ity requirements. Other factors that influence the initial design include the nature of the application, expected life of the equipment, and energy efficiency require-ments of the equipment. Chap. 9 will review some of the factors affecting the design of energy-efficient data centers. We have shown in Chap. 5 that the major environmental changes with the efficient cooling methods (e.g., free air cooling) are wider ranges of temperature and humidity than seen with traditional cooling methods and the possible presence of particulate and gaseous contaminants.

A failure modes, mechanisms, and effects analysis (FMMEA) is critical in the initial design. FMMEA identifies the critical failure mechanisms and their associated failure sites—which are referred to as the critical parts. FMMEA combines tradi-tional failure modes and effects analysis (FMEA) with knowledge of the physics-of-failure [6]. A failure mechanism is the physical phenomenon that causes the onset of failure. Common failure mechanisms in mechanical systems include corrosion fatigue and wear [6]. The underlying failure mechanism becomes evident to the user through the failure modes, which are observations of how the system or device has failed. Overheating, unexpected shutdown, and reduced performance are observ-able failure modes. FMMEA uses a life cycle profile to identify the active stresses and select the potential failure mechanisms. The failure mechanisms are prioritized based on knowledge of the load type, level, and frequency, combined with the failure sites, severity, and likelihood of occurrence. The process is shown in Fig. 8.2 [6].

8.2.2 Part Selection

The general criteria for part selection were introduced in Chap. 6. Part selection is based on the local operating conditions, absolute maximum ratings (AMRs), and recommended operating conditions (RoCs), which can be found in the part data-sheet. One of the risks in designing a system for high temperature operation is that the component supplier’s design assumptions are not part of the datasheet. For

8.2 Risk Assessment at the Design Stage

Page 153: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

144 8 Life Cycle Risk Mitigations

example, the datasheet gives an AMR, but it does not give a maximum acceptable operating time at that AMR. In addition, many component suppliers consider the “design for maximum customer value” point to be proprietary information. Thus, the designer should plan sufficient margins between the expected part temperature in the efficient cooling conditions and the component thermal limits during the part selection process using the available information.

The part’s local operating condition is affected by the selection of control logic and the algorithm of its cooling system (e.g., fan). For example, a case study [7], introduced in Chap. 4, showed that the part temperature increases roughly linearly with the inlet air temperature when the fan speed is constant. However, if the fan speed is dynamic, the part temperature change is more complex. Below a fixed fan speed transition point (e.g., 23 °C), which depends on the cooling algorithm, the part temperature also increases roughly linearly with the inlet ambient temperature. When the part temperature exceeds the transition point, the fan speed will increase in order to offset the increase in the inlet air temperature, and the part temperature still increases, but at a slower rate than the rise in inlet ambient temperature. Thus, cooling algorithm selection needs to be considered in the part selection process.

8.2.3 Virtual Qualification

The initial design with the initially selected parts is evaluated and improved by a virtual qualification process during the third step in the design stage. Virtual qualification is used to evaluate the reliability of the part. The virtual qualification

Identify Life Cycle Profile

Define System and Identity Elements

Identify Potential Failure Modes

Prioritize Failure Mechanisms

Calculate / Analyze / Estimate

Document the Process

Identify Failure Models

Identify Potential Failure Mechanisms

Identify Potential Failure Causes

Fig. 8.2 FMMEA methodology [6]

Page 154: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

145

process uses physics-of-failure (PoF) models of the critical failure mechanisms [8–10]. Stress analysis is first performed to determine the local operating condi-tions of the parts. These stress analysis results are then used as input for the fail-ure models, so that the failure mechanisms can be identified and their impact on the parts can be estimated. This assessment is usually conducted with the help of custom-made, specific software. An example of software used for virtual qualifica-tion is CalcePWA, which is “a simulation software which estimates the cycles to failure of components under various loading conditions using Physics-of-Failure (PoF)” [11]. This software can be used to perform thermal analysis, vibration analysis, and failure assessment on printed wiring assemblies. George et al. [11] implemented CalcePWA on communication hardware and predicted the reliability under field conditions. Qi et al. [12] used CalcePWA to assess solder joint reliabil-ity under thermal cycling and vibration conditions. Ghosh et al. [13] implemented CalcePWA in a plastic encapsulated DC–DC converter to predict its reliability under field loading conditions.

Based on the results of virtual qualification, the design is improved or the parts are reselected if necessary. Then, the improved design with the new parts is reeval-uated by virtual qualification. This process is repeated until the results show that the design meets the expected requirements under the new environment.

8.2.4 Simulation and Final Design

If the part reliability meets its prescribed requirements according to the virtual qualification, its performance will be assessed at the assembly/subsystem level. During the performance evaluation, the system design is evaluated to determine whether it meets or exceeds the expected functional requirements under the life cycle conditions. An example of a performance simulation tool for semiconduc-tors is the Simulation Program with Integrated Circuit Emphasis (SPICE) [14]. This tool, and its many commercial and academic variations, is used to evaluate the performance of semiconductor parts. SPICE evaluates basic semiconductor parameters, such as carrier concentration and mobility, under various electrical and thermal conditions, and determines their impact on the final circuit parame-ters, such as output voltage and current. These results can then be used to predict the circuit performance and estimate the system performance based on func-tional analysis, which will determine whether the design meets the functional requirements.

The design is finalized when it passes the virtual qualification and simulation assessment. The design stage ends with the creation and release of the final design to the product test stage. From this point onwards, additional testing and assess-ment will continue to ensure the performance and reliability of the manufactured product, as described in the Sect. 8.3.

8.2 Risk Assessment at the Design Stage

Page 155: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

146 8 Life Cycle Risk Mitigations

8.3 Risk Assessment at the Test Stage

If a product design is modified during its manufacturing process, the evaluation should be restarted at the design stage. The designs, which took the efficient cool-ing conditions into consideration in the design stage, will still need to go through the basic assessment steps at the test stage to ensure that the manufactured prod-uct will meet its performance and reliability goals. This assessment also needs to account for the fact that any given component in the system may have several different suppliers, and each supplier’s component may have slightly different ratings.

The estimation of operating condition ranges under the efficient cooling con-ditions can be used to determine whether the operating conditions are within a current standard’s requirements, which were introduced in Chap. 3 (e.g., GR-63-CORE [1]). If so, the equipment will be evaluated by the test methods provided by the standard. Otherwise, the equipment will be evaluated by an uprating method.

8.3.1 Standards-Based Assessment

This section describes a widely used standard, Telcordia Generic Requirements GR-63-CORE [1], for system-level and subsystem-level assessment. The Telcordia GR-63-CORE provides a test method for equipment in a network equipment building, such as a data center. Its operating condition requirements are shown in Table 2.3 in Chap. 2. If the ambient temperature and relative humidity asso-ciated with the efficient cooling method are within the required ranges, the tests in Telcordia Generic Requirements GR-63-CORE are valid for the equipment. Because the equipment in GR-63-CORE can refer to “all telecommunication equipment systems used in a telecommunication network system” [1], including “associated cable distribution systems, distributing and interconnecting frames, power equipment, operations support systems, and cable entrance facilities” [1], the standard can be used to test both the whole system and the equipment used in the system. In other words, the operating temperature and humidity test in Telcordia GR-63-CORE, shown in Fig. 3.3 in Chap. 3, can be used to assess the risks of the efficient cooling method when the operating conditions are inside the Telcordia requirements (as shown in Table 3.3 in Chap. 3).

Another standard is the European Standard ETSI 300 019 [3], which was published by the European Telecommunications Standards Institute in 1994 and updated in 2003. The tests in ETSI 300 019 are more complicated than those in GR-63-CORE, since ETSI 300 019 defines various classes of data centers based on the environmental conditions (more details in Sect. 3.3.2), and the test for every class may be different from those for others. Qualification testing based on com-bining GR-63-CORE and ETSI 300 019 has been proposed, but a clear description of how to merge the two operating temperature tests has not yet emerged [15].

Page 156: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

147

Other environmental tests for telecommunication equipment (such as the transpor-tation and storage test, the equipment subsystem fire spread test, and the unpack-aged equipment drop test) are not included in this chapter because the conditions for these environmental tests are generally unaffected by the efficient cooling methods.

8.3.2 Uprating Assessment

If the operating conditions with the efficient cooling methods are outside a stand-ard’s requirements, then the standards-based method is no longer valid. In this case, some parts may experience hot spots and otherwise be operated beyond their AMR or RoC. Parts with small thermal margins may also face the risk of operat-ing beyond their specific ranges under the efficient cooling conditions. A practical alternative way to evaluate the risks of the efficient cooling conditions is through uprating assessment of parts at the exposed operating condition levels.

IEC Standard 62240 [16] provides uprating tests to ensure that products can meet the functionality requirements of applications outside the manufacturer-specified temperature ranges. The first step in part-level testing is to identify whether the operating temperature for a part exceeds the manufacturer-specified temperature range. The operating temperature of the parts can be obtained from the system-level and subsystem-level testing results and additional analysis. If the operating temperature increases beyond the manufacturer-specified ranges, the uprating process should be performed.

The uprating process starts with a capability assessment, which consists of three steps: package capability assessment, assembly risk assessment, and compo-nent reliability assurance. The package capability assessment analyzes the part’s qualification test data and other applicable data to ensure that the package and internal construction can undergo the higher temperature from the efficient cool-ing methods without causing any material properties to change. The assembly risk assessment estimates the ability of a device to perform under the higher tempera-ture from the efficient cooling methods. Component reliability assurance qualifies a part based on the application requirements and performance requirements over the intended range of operating conditions.

Quality assurance secures the ongoing quality of successfully uprated parts by monitoring the part process change notices obtained from the manufacturers. In this assessment, parameter recharacterization testing can be used to assess incom-ing parts, and change monitoring can be used to warn of a part change that would affect the part’s ability to operate under an increased operating temperature.

Uprating is a very expensive process, and the qualification of an uprated part has to be redone if anything about the part manufacturing process changes. With good thermal design at the system level and careful component selection, uprating can usually be avoided.

8.3 Risk Assessment at the Test Stage

Page 157: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

148 8 Life Cycle Risk Mitigations

After the part uprating assessment is completed, the part needs to be assessed at the assembly level to verify whether it can work well at that higher level. The assembly test needs to be conducted throughout the required operating condition range under the efficient cooling conditions. Details of the test procedure can be found in IEC Standard 62240 [16]. Only after both the part uprating test and the assembly level assessment are passed can the part be considered for use under the efficient cooling conditions.

8.4 Risk Assessment at the Operation Stage

When the equipment is already in operation in data centers, it is not practical to take the equipment out of service for testing. If the equipment was not originally designed for the efficient cooling methods, prognostics and health management (PHM) is a retrofitting technique to assess and mitigate the risks for the implemen-tation of the efficient cooling methods.

PHM techniques have been implemented to detect anomalies/faults or predict a product’s remaining useful lifetime (RUL) [17–24]. A prognostics-based approach was developed to assess and mitigate risks due to the implementation of the effi-cient cooling methods, as shown in Fig. 8.3 [25]. This approach starts by iden-tifying the set operating condition range under the efficient cooling conditions. Based on the identified operating condition range, a failure modes, mechanisms, and effects analysis (FMMEA) is conducted to identify the weakest subsystems/components that are the most likely to fail first in the system. Several mechanisms may occur at a higher rate under the efficient cooling conditions due to uncon-trolled humidity: electrochemical migration (often occurs in low relative humid-ity), conductive anodic filament (CAF) formation (often occurs in high humidity),

Fig. 8.3 A prognostics-based approach to mitigate the risks of the efficient cooling methods [23]

Failure modes,mechanisms, and effects analysis

Identification of the weakest subsystem

System monitoring

Weakest subsystem (part)

monitoring

Anomaly detection

Identification of operating environment

Prognostics

System functionalconsideration

Page 158: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

149

and creep corrosion (often occurs at high humidity in the presence of low levels of sulfur-based pollutants). FMMEA identifies weak subsystems due to damage from the failure mechanisms in a system under the efficient cooling conditions.

FMMEA can be further conducted on the weakest subsystems to identify the critical failure mechanisms at that level and the parameters which indicate the deg-radation trends of the system. Under some circumstances, monitoring and data analysis can also be performed on low-level systems or components [26]. Based on the FMMEA results, the parameters of the system and its weakest subsystems/components (e.g., voltage, current, resistance, temperature, and impedance) will be monitored for risk assessment and mitigation.

In principle, all three PHM approaches (i.e., PoF, data-driven, and fusion) can be used for anomaly detection and prognostics. The PoF approach is usually not practical for complicated systems with a large number of subsystems and compo-nents. However, the monitored parameter data allow the use of data-driven PHM at the system level with only a limited need for additional sensing, monitoring, stor-age, and transmission tools. The data-driven approach detects system anomalies based on system monitoring that covers performance (e.g., uptime, downtime, and quality of service) and other system parameters (e.g., voltage, current, resistance, temperature, humidity, vibration, and acoustic signals). The data-driven approach identifies failure precursor parameters that indicate impending failures based on system performance and collected data. Furthermore, the availability of FMMEA and precursor parameter data for low-level subsystems and components permits a data-driven PHM approach at those levels.

With the implementation of PHM approaches, anomaly detection and prog-nostics can be conducted to identify equipment anomalies and predict the RUL of equipment, respectively. Based on this information, data center operators can schedule equipment maintenance or replacement to avoid the unscheduled down-time of data centers.

8.5 A Case Study of Network Equipment

The network architecture in a data center consists of a set of routers and switches whose function is to send data packets to their intended destinations. The network equipment selected for this study was the power adapter of a Zonet ZFS 3015P switch, which is widely used in offices and small enterprises. This hardware was selected for its well-defined and directly observable failure criteria. In this case study, we implemented a data-driven method to detect anomalies in the power adapter to provide early warning of failure and then mitigate the risks. The power adapter block diagram is shown in Fig. 8.4.

This power adapter is a switched-mode power supply, which incorporates a switching regulator to provide a regulated output voltage. Generally, the unreg-ulated input DC voltage is fed to a high frequency switch that is turned between the “ON” and “OFF” states (referred to as switching) at a high frequency to

8.4 Risk Assessment at the Operation Stage

Page 159: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

150 8 Life Cycle Risk Mitigations

control the power supply. When the switch is in the ON state, the unregulated voltage is filtered by circuits and then applied to the output voltage. When the switch is in the OFF state, no voltage is supplied to the output voltage. High frequency switching between the ON and OFF states and control of the state (ON and OFF) durations ensure that the average DC output voltage equals the desired output voltage. For this power adapter, the output voltage is rated as 9 V. The output voltage drops when the power adapter degrades. The power adapter is considered to have failed when the voltage goes below 10 % of the rated value (i.e., 8.1 V).

8.5.1 Estimation of Operating Conditions

The reliability assessment starts by identifying the operating conditions of the power adapter, which are 0–40 °C and 10–90 % RH. The operating conditions are set by data centers and usually determined by the amount of energy savings expected from the implementation of the efficient cooling methods. In this case, we assumed that the operating conditions were 0–50 °C and 5–95 % RH in order to maximize energy savings. We used conditions of 95 and 70 % RH in the experi-ment to increase the rate of degradation. The power adapter was placed inside an environmental chamber and was in operation during the experiment. An Agilent 34970A data acquisition monitor was used to monitor and record the parameter trends of the power adapter.

Fig. 8.4 Power adapter of Zonet ZFS 3015p switches. 1 THX 202H integrated circuit (IC); 2-C1, 2-C2, 2-C3 aluminum electrolytic capacitor; 3 resistor; 4 power transformer; 5 output volt-age supply

Page 160: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

151

8.5.2 FMMEA and Identification of Weak Subsystems

The power adapter in this case is a kind of switched-mode power supply (SMPS). FMMEA can identify the critical failure mechanisms, the weakest components involved in those failure mechanisms, and the parameters that indicate power adapter degradation. According to the FMMEA results [27], the critical failure mechanisms are aging of the electrolyte, wire melt due to current overload, ther-mal fatigue, contact mitigation, time dependent dielectric breakdown, and solder joint fatigue. The components with high reliability risks due to the critical failure mechanisms are the aluminum electrolytic capacitor, the diode, the power metal–oxide–semiconductor field-effect transistor (MOSFET), the transformer, and the integrated circuit (IC).

8.5.3 System and Weak Subsystem Monitoring

With the consideration of measurement applicability, four parameters of capaci-tors and integrated circuits were monitored in this experiment: the voltages of the three capacitors (shown as 2-C1, 2-C2, and 2-C3 in Fig. 8.4), and the output fre-quency of the THX 202H IC (shown as 1 in Fig. 8.4). In addition, the output volt-age across the power adapter was monitored for the power adapter performance trends (shown as 5 in Fig. 8.4). A summary of the monitored parameters is shown in Table 8.1.

The parameters monitored during the experiment are shown in Figs. 8.5, 8.6 and 8.7. The parameter shifts observed indicated the degradation trend of the power adapter.

Comparisons between the health baselines and the final values are shown in Table 8.2. The power adapter failed at 501 min. The health baseline of every monitored parameter is the average value of its first 20 data points (10 min) in

020406080

100120140160180

0 100 200 300 400 500 600 700

Time (min)

IC F

req

(KH

Z)

Failure

Fig. 8.5 The IC frequency of the power adapter

8.5 A Case Study of Network Equipment

Page 161: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

152 8 Life Cycle Risk Mitigations

the experiment, considered as the healthy data. The final value of the monitored parameter is the average value of the first 20 data points (10 min) after the power adapter failed. The IC frequency and the voltage of capacitor 1 experienced large drops of 93.7 and 53.7 % during failure in the experiment, respectively.

8.5.4 Anomaly Detection

The back propagation neural network is an adaptive statistical model based on an anal-ogy with the structure of the brain, where the output of a neuron is generated based on a function of the weighted sum of the inputs plus a bias, as shown in Fig. 8.8. The

Table 8.1 Monitored subsystems/components and parameters

Monitored subsystem/component

Capacitor 1 (V1)

Capacitor 2 (V2)

Capacitor 3 (V3)

THX 202H IC (IC Freq.)

Output supply wire (Vout)

Monitored parameter

Voltage Voltage Voltage Frequency Voltage

Fig. 8.6 The voltages across capacitors 1 and 2 of the power adapter

Fig. 8.7 Output voltage and voltage across capacitor 3 of the power adapter

Page 162: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

153

back propagation neural network is used to detect anomalies because it is applicable for nonlinear statistical modeling, there is no need for distribution assumptions or for degradation models, and it uses supervised training by mapping the input to the desired output. These features exactly fit the data features in this case.

The back propagation neural network uses supervised training, which supplies the neural network with inputs and the desired outputs, and modifies the weights to reduce the difference between the actual and desired outputs [28–30]. The implementation process is shown in Fig. 8.9. The process starts with the preproc-essing of the experimental data. The experimental data are normalized as:

where A is the test data of V1, V2, V3, and fFreq; Amean is the mean of the test data; and Astd is the standard deviation of the test data.

(8.1)Anorm =A − Amean

Astd

Table 8.2 Monitored parameter shifts

Failure time Parameter Baseline Final value Shift (%)

501 min V1 (V) 82.1 38.5 53.7V2(V) 148.0 150.5 1.4V3(V) 147.9 150.3 1.4Vout (V) 9.34 2.74 70.7Frequency (kHZ) 159.0 10.4 93.7

Hidden LayerInput Layer Output Layer

wi,j

x1

x2

xn

h1

h2

hr

y1

y2

ymwj,k

Fig. 8.8 Back propagation neural network

8.5 A Case Study of Network Equipment

Page 163: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

154 8 Life Cycle Risk Mitigations

The first 20 data points are considered the healthy data and are selected as the training data to train the neural network. The purpose of training is to adjust the weights of the input parameters (IC frequency, V1, V2, and V3 in this case), which ensures that the expected output (Vout in this case) calculated by the neural net-work is close enough (error within preset precision) to the actual output under the equipment health conditions. The weights are adjusted by minimizing the error between the expected Vout and the actual Vout. The preset error precision is 10−8 in this case; that is, the weight adjustment will stop when the error precision between the expected Vout and the actual Vout is equal to or below 10−8.

The error precision between the expected Vout and the actual Vout is selected to detect the anomaly. The expected Vout is calculated based on the relation between the input parameters (IC frequency, V1, V2, and V3) and the Vout, which was deter-mined by the healthy data in the training phase. An increase in the error precision indicates that the relation between the input parameters and the Vout determined in the training phase is not valid any more. In other words, an anomaly has occurred.

Data normalization

Training data selection(first 20 data points)

Weight assignment/adjustment

Expected output calculation

Error calculationbetween expected output

and actual output

Error < 10-8?

No

Yes

Anomaly detection

Continuous error calculationfor incoming experiment data

Training error precision selection (10-8)

Five consecutive

errors< 10-4?

Yes No

Fig. 8.9 Neural network-based anomaly detection

Page 164: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

155

There are no clear and detailed criteria for anomaly detection for this power adapter. In this case, an anomaly is considered to be detected when five consecu-tive data error precisions reach or go beyond 10−4, based on the error precision of 10−8 in the training phase.

8.5.5 Prognostics

When an anomaly is detected, 30 min of data before the anomaly detection point (including the anomaly detection point) are selected to predict the failure. This case uses the exponential model to predict failure:

where y is the error precision, x is the time, and least square curve fitting is used to determine the model parameters a, b, c, and d.

There are no industry standard failure criteria for the error precision. In this case, a failure is predicted when five consecutive data error precisions reach or go beyond 10−3, based on the error precision of 10−8 in the training phase. The pro-cess is shown in Fig. 8.10.

The anomaly detection and prognostic results are shown in Fig. 8.11. The anomaly is detected at 169th minute, and failure is predicted to occur at 637th minute (actual failure occur at 501st minute).This anomaly detection and predic-tion can provide early warning of equipment failure.

(8.2)y = aebx+ cedx

An anomaly is detected

Selecting data for prognostics(data in 30 minutes before

anomaly detection)

Select exponential model for prognostics

Use selected data to determine model parameters by least

square curve fitting

Predict failure when five consecutive data error

precisions arrive 10-3

Fig. 8.10 Prognostics by exponential model

8.5 A Case Study of Network Equipment

Page 165: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

156 8 Life Cycle Risk Mitigations

8.6 Summary

This chapter presented a multistage process for evaluating the potential risks asso-ciated with the efficient cooling methods, including performance and reliability assessment. The assessment identifies the operating conditions with the efficient cooling methods and determines whether they are within the required limits of selected standards, such as Telcordia GR-63-CORE and ETSI 300 019. Traditional reliability evaluations in these standards can be used to assess the risks of the effi-cient cooling conditions if the system’s operating conditions meet the standards’ requirements. However, if the operating conditions of the efficient cooling meth-ods go beyond the standards’ limits, the methods provided by the standards are no longer valid for reliability assessment.

As an alternative to the standard based methods, prognostics-based assessment can predict the lifetime of a system. It can identify and address the failure mechanisms associated with the efficient cooling conditions, which otherwise cannot be achieved by the existing standards. This method does not need to run equipment to failure, and it is especially beneficial when the equipment is already in operation in a data center and cannot be taken away from service. In addition, the method can provide the remaining useful life estimation and then mitigate the risks of the efficient cooling conditions.

References

1. Telcordia, Generic requirements GR-63-CORE. Network Equipment-Building System (NEBS) Requirements: Physical Protection (Piscataway, NJ, March 2006)

2. Telcordia, Generic requirements GR-3028-CORE. Thermal Management in Telecommunications Central Offices (Piscataway, NJ, December 2001)

3. European Telecommunications Standards Institute (ETSI), Equipment Engineering (EE); Environmental Conditions and Environmental Tests for Telecommunications Equipment. ETS 300 019-1-3 V.2.2.2, Sophia Antipolis Cedex, France (2004)

4. American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE) Technical Committee (TC) 9.9, 2011 Thermal Guidelines for Data Processing Environments—Expanded Data Center Classes and Usage Guidance (Atlanta, GA, 2011)

Fig. 8.11 Anomaly detection and prediction results

0 100 200 300 400 500 600

-10

-8

-6

-4

-2

0

2

4

Time (min)

Err

or P

reci

sion

(lo

g)

637501

Failure PredictedFailure

-3

169

Anomaly detection

Page 166: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

157

5. D. Atwood, J.G. Miner, Reducing Data Center Cost with an Air Economizer. IT@Intel Brief; Computer Manufacturing; Energy Efficiency; Intel Information Technology (August 2008)

6. W.Q. Wang, M.H. Azarian, M. Pecht, Qualification for product development. 2008 International Conference on Electronic Packaging Technology & High Density Packaging, July 28–31, 2008

7. American Society of Heating, Refrigerating, and Air-Conditioning Engineers (ASHRAE), 2008 ASHRAE Environmental Guidelines for Datacom Equipment (Atlanta, GA, 2008)

8. Joint Electron Devices Engineering Council (JEDEC) Solid State Technology Association, Method for Developing Acceleration Models for Electronic Part Failure Mechanisms (JEDEC91A, Arlington, VA, August 2003)

9. M. Jackson, A. Mathur, M. Pecht, R. Kendall, Part Manufacturer Assessment Process. Qual. Reliab. Eng. Int. 15, 457–468 (1999)

10. M. Jackson, P. Sandborn, M. Pecht, C.H. Davis, P. Audette, A Risk Informed Methodology for Parts Selection and Management. Qual. Reliab. Eng. Int. 15, 261–271 (1999)

11. E. George, D. Das, M. Osterman, Physics of failure based virtual testing of communication hardware. ASME International Mechanical Engineering Congress and Exposition IMECE 2009, Lake Buena Vista, Florida, November 13–19, 2009

12. H. Qi, C. Wilkinison, M. Osterman, M. Pecht, Failure analysis and virtual qualification of PBGA under multiple environmental loadings. Electronic Components and Technology Conference, 54th Proceedings, vol. 1, No. 1–4, June 2004, pp. 413–420

13. K. Ghosh, B. Willner, P. McCluskey, Virtual qualification of a plastic encapsulated DC–DC converter. 2004 IEEE 35th Annual Power Electronics Specialists Conference, vol. 4, Aachen, Germany, June 20–25, 2004, pp. 2578–2582

14. K.S. Kundert, The Designer’s Guide to SPICE and SPECTRE (Kluwer Academic Publishers, Boston, 1998)

15. C. Forbes, Reliability: Combining GR-63-CORE and ETS 300 019 (2002), http://www.ce-mag.com/archive/01/Spring/Forbes.html. Accessed Feb 2009

16. International Electro-technical Commission, IEC Standard 62240, Process Management for Avionics—Use of Semiconductor Devices outside Manufacturers’ Specified Temperature Range (Switzerland, 2005)

17. Z.J. Li, K.C. Kapur, Models and measures for fuzzy reliability and relationship to multi-state reliability. Special issue on multi-state system reliability. Int. J. Perform. Eng. 7(3), 241–251 (2011)

18. D. Wang, Q. Miao, R. Kang, Robust Health Evaluation of Gearbox Subject to Tooth Failure with Wavelet Decomposition. J. Sound Vib. 324(3–5), 1141–1157 (2009)

19. B. Long, S.L. Tian, H.J. Wang, Diagnostics of filtered analog circuits with tolerance based on LS-SVM using frequency features. J. Electron. Test. Theor. Appl. 28(3), 291–300 (2012)

20. W. He, N. Williard, M. Osterman, M. Pecht, Prognostics of lithium-ion batteries based on Dempster–Shafer theory and Bayesian Monte Carlo method. J. Power Sources 196(23), 10314–10321 (2011)

21. W. He, Z.N. Jiang, K. Feng, Bearing fault detection based on optimal wavelet filter and sparse code shrinkage. Measurement 42(7), 1092–1102 (2009)

22. J. Dai, D. Das, M. Pecht, Prognostics-based health management for free air cooling of data centers. IEEE Prognostics and Health Management Conference, Macau, China, Jan 12–14, 2010

23. S.P. Zhu, H.Z. Huang, L. He, Y. Liu, Z.L. Wang, A generalized energy-based fatigue-creep damage parameter for life prediction of turbine disk alloys. Eng. Fract. Mech. 90, 89–100 (2012)

24. S.P. Zhu, H.Z. Huang, R. Smith, V. Ontiveros, L.P. He, M. Modarres, Bayesian framework for probabilistic low cycle fatigue life prediction and uncertainty modeling of aircraft turbine disk alloys. Probab. Eng. Mech. 34, 114–122 (2013)

25. J. Dai, M. Ohadi, M. Pecht, Achieving greener and lower cost data centers through PHM. Electronics Goes Green 2012 + Conference, Berlin, Germany, Sep 9–12, 2012

26. H. Oh, T. Shibutani, M. Pecht, Precursor monitoring approach for reliability assessment of cooling fans. J. Intell. Manuf. (2009). doi: 10.1007/s10845-009-0342-2

References

Page 167: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

158 8 Life Cycle Risk Mitigations

27. S. Mathew, M. Alam, M. Pecht, Identification of failure mechanisms to enhance prognos-tic outcomes. MFPT: The Applied Systems Health Management Conference 2011, Virginia Beach, VA, May 10–12, 2011

28. G. Mirchandani, W. Cao, On hidden nodes for neural nets. IEEE Trans. Circuit Syst. 36(5), 661–664 (1989)

29. J.Y. Audibert, O. Catoni, Robust Linear Least Squares Regression. Ann. Stat. 39(5), 2766–2794 (2011)

30. A. Cottrell, Regression analysis: basic concepts, for the course Econometric Theory and Methods. Department of Economics, Wake Forest University, 2011

Page 168: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

159

The information technology (IT) and telecommunications (TC) industries today are perceived as heavy energy consumption entities, accounting for nearly 2 % of the world energy consumption [1] and with a strong demand-driven upward trend in the years to come. However, in the future the energy efficiency gains from digital processes replacing energy intensive activities may make this industry a major contributor to improved global energy efficiency and overall reduced car-bon footprint. For this to be realized, we must improve IT efficiency and energy management in data centers, computer stations, and portable devices, while utiliz-ing dedicated software and hardware that can synchronize and optimize the opera-tion of the entire data center per its designated mission and functions. Accordingly, the next generation of data centers will employ emerging technologies to further improve the energy efficiency and risk management as key components of an optimum operation. This chapter presents some of the key trends including multi- objective optimization of data centers, renewed focus on energy management and the need for development of energy efficient electronics, low resistance cooling methods, utilization of waste heat recovery/chiller-less cooling, thermal storage, and additional measures that can promote reliable and optimum operation of data centers.

9.1 Increased Use of Software Tools for Optimum and Reliable Operation

Advances in computational fluid dynamics (CFD) and heat transfer, as well as multiple objective optimization, and the availability of affordable instrumenta-tion and monitoring (including remote monitoring) have enabled reliable and cost effective design, operation, and energy resource management of data cent-ers. Unlike traditional models, which include chip-level thermal management, new CFD and other software tools can be used for IT equipment airflow control, room

Chapter 9Emerging Trends

J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5_9, © Springer Science+Business Media New York 2014

Page 169: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

160 9 Emerging Trends

air distribution, room and rack configuration, and equipment deployment and energy management of data centers, including optimum use of dedicated HVAC facilities against forecasted demand and contracted energy rates. In particular, air flow and energy dissipation data at the chip and system level are necessary to improve CFD models for data centers for proper air (or liquid) distribution and energy management. Accordingly, in emerging/modern data centers, there will be greater use of sensors and measurement equipment for continuous energy moni-toring, control, and system performance analysis. Since the 2000s, the cost of basic instrumentation and metering/sub metering has decreased, making future advancements possible. Once data have been gathered for data centers, the CFD models and software tools can be tailored to reflect real operational conditions. A calibrated CFD model can then be used for diverse design optimization and energy management scenarios in the data center, including failure analysis and life cycle equipment enhancement planning. Figure 9.1 shows the ten disciplines controlled in a custom designed software offered by a commercial vendor [2].

9.2 Trends in Development of Energy Efficient Electronics

The success of the electronics industry since the 1970s can be attributed to the increased number of transistors on the chip (Moore’s law) that allowed increased functionality and speed; a reduction of feature size of electronic components that

Optimizes the energy usage profile against forecasted demand, contracted rates, and alternate energy source rates.

Provides required information for remote access to service the data center operations using subject matter experts across one or more data centers in the enterprise.

Handles work order ticketing using prescribed work flows for submittal, creation, tracking, expenditures, lessons learned, and spare parts availability.

Identifies all assets within the IT & facilities domains, shows their location, calculates power and cooling needs, performs “what- if” scenarios when adding, moving or changing servers.

Analyzes CPU utilization and application criticality in conjunction with server temperatures to adjust candidate servers to operate more efficiently through reduced power draw.

Analyzes the current and forecasted server, power and cooling demand along with utility contract rates to balance the load across multiple data center sites.

Enforces that changes are tracked and made in a prescriptive work flow using a common database to ensure that data is represented the same throughout the system.

Manages and controls the cooling system, physical security, fire protection, leak detection, and CCTV monitoring.

Provides a scalable data repository for the local data center that integrates within an enterprise level historian to optimize data retrieval at each location.

Sometimes included as part of Facilities Management, this discipline meters and manages the delivery of power to the servers from primary and alternate sources.

Fig. 9.1 Ten disciplines of data center enterprise management in ABB Inc. Decathlon [2]

Page 170: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

161

enabled the miniaturization (Fig. 9.2) of electronics, and increased power den-sity at the component (chip) and system (data center) levels which has required more aggressive thermal management technologies. However, as the feature size decreases, the functionality density on the chip increases (shown in Fig. 9.3), and the energy consumption of the chip escalates dramatically, as illustrated in Fig. 9.4. Therefore, one of the top priorities of chip manufacturers, before they will invest in faster chips, is energy efficiency. In fact, most recent data indicate that Moore’s Law progression will be stalled by voltage and feature size lim-its and that enhanced cooling techniques are needed to restore chip frequency progression.

Given that energy consumption of the computations is reduced with decreased interconnect distance (Fig. 9.5), among other benefits, 3-D stacking of the chips and increased packaging density will be essential in reducing energy consumption of the chips. However, the thermal management of these 3-D chips introduces new chal-lenges, and advanced thermal management technologies such as embedded cooling

Fig. 9.2 Continued decrease in feature size from 1970 to 2010 [3]

65nm

45 nm

32 nm

22 nm

10

100

1,000

10,000

0.01

0.1

1

10

1970 1980 1990 2000 2010 2020

[nm

]

Sem

ico

nd

uct

or

Fea

ture

Siz

e [

m]

0.7 times size every 2 years

Time

xreduction

µ

Fig. 9.3 Chip performance with technology node [4]

0

5000

10000

15000

20000

65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm

Chi

p P

erfo

rman

ce (

GF

)

Technology Node

9.2 Trends in Development of Energy Efficient Electronics

Page 171: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

162 9 Emerging Trends

may be most effective for next-generation electronics, as discussed in the following sections. Embedded cooling provides a co-design environment, in which design of chip thermal management is an integral part of the electrical and overall design of the chip, thus enabling optimization of chip performance and functionality.

9.3 Embedded (Near Source) Cooling

Most current thermal management systems rely on heat rejection to a cooling fluid that is remotely located from the source of heat, thus involving thermal conduc-tion and spreading in substrates across multiple material interfaces, each having its own thermal parasitic. This so called remote cooling has many limitations and account for a large fraction of size weight, and power requirements (SWaP) of advanced high power electronics, lasers, and computer systems [5, 6]. With a direct or embedded cooling system, it is possible to achieve high levels of cooling while shrinking the dimensions of the thermal management hardware, since the embedded microfluidic-based system will deliver the cooling in close proximity to the on-chip heat source. A combination of high conductivity interfacial materials and embedded cooling can lead to at least one order of magnitude improvement in heat removal effectiveness from the system. Defense Advanced Research Program Agency (DARPA) is currently supporting research on high conductivity interfacial materials, as well as embedded single- and two-phase cooling solutions for high flux electronics with applications to military equipment, as well as commercial electronics for diverse applications [7, 8].

Fig. 9.4 Energy increase by feature size [4]

0

100

200

300

400

500

65nm 45nm 32nm 22nm 16nm 11nm 8nm 5nm

Chi

p P

ower

(W

)

Technology Node

Fig. 9.5 Energy increase by interconnect distance [4]

0

2

4

6

8

10

0.1 1 10 100 1000

Ene

rgy/

bit

(PJ)

Interconnect Distance (cm)

Page 172: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

163

9.3.1 Enhanced Air Cooling

The majority of existing data centers use air cooling systems due to the many advantages air cooling offers in maintaining the desired operating conditions. However, future data centers most likely will use a combination of cooling meth-ods in order to efficiently and most directly remove heat from the IT equipment, use waste heat efficiently, and improve the overall efficiency of the system and the life cycle cost effectiveness.

In a typical air cooling system, heat generated by the processor is conducted to a heat sink and transferred to the chilled air blowing into the server. As described in Chap. 4, current air cooling systems introduce large thermal resistances at vari-ous interfaces, including thermal resistance between the heat generating processor and the heat sink. This thermal contact resistance can be reduced through embedded cooling techniques that utilize advanced heat sinks and high-conductivity thermal substrates. Another major source of thermal resistance is between the heat sink and the air. Improved air-side heat transfer coefficient by reducing the thermal resist-ance between the heat sink and the air has been of active interest to the technical community for at least the past three decades. Most recently a joint program has been announced between the National Science Foundation and the Electric Power Research Institute (EPRI) to support research in development of highly effective dry cooled heat exchangers with a specific focus on improving the air-side heat trans-fer coefficient [9]. Any improvement in air cooling heat transfer technologies will directly benefit the electronics thermal packaging industry in their search for further pushing the effectiveness of air cooling equipment at the chip and data center levels.

Computational fluid dynamics (CFD) simulations of a standard air-cooled heat sink on an 85 W source demonstrate that the incoming air has to be cooled to as low as 5 °C to keep the temperature of the CPU below 78 °C (with an air flow rate of 12.7 L/s or 26.7 CFM, ΔP = 2.26 Pa, and pumping power of 29 mW) [10].

Two major drawbacks of air cooling are the bypass that occurs (thus, underuti-lization of the chilled air) and the uncontrolled mixing of warm and cold air down-stream of the heat sink in many traditional data centers. One design improvement to address this issue is cooling of the air inside the door cabinets by chilled water, thus minimizing the possibility of mixing warm and cold air. Another option is the use of curtains and partitions to physically separate hot and cold air, thus avoiding their uncontrolled mixing.

As a remedy to minimize adverse effects of mixing of hot and cold air in data centers, there has been an increased interest in incorporating two-phase evapo-rators in the cabinet doors of server racks, eliminating the intermediate refriger-ant-water heat exchanger. Incorporating the two-phase evaporators in the cabinet further reduces thermal resistance, reduces the total cost of equipment, and elimi-nates the possibility of water damage to the IT equipment. However, the thermal resistance between the heat sink and the ultimate sink (ambient air) still remains high, as do the associated high levels of air circulation, friction losses, and noise. Example systems of cabinet door cooling can be found in the literature of Motivair cooling solutions [11].

9.3 Embedded (Near Source) Cooling

Page 173: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

164 9 Emerging Trends

9.3.2 CRAC Fan Speed Control

An important energy conservation measure for air cooled data centers is the use of variable fan drives by which fan speed can be controlled and programmed for optimum operation based on the data center operation load. Control methods implemented in most data centers operate CRAC fans at 100 % of their maxi-mum speed. These are paired with a P or PI (proportional or proportional integral) controller within the CRAC to regulate the flow of chilled water in the cooling coil, based on the supply and return temperatures on the water and air sides. Sundaralingam et al. [12] implemented a server rack heat load-based CRAC fan speed controller to provide a data center with only the necessary cooling.

The impact of power reduction in a CRAC on the power required for the chiller can be determined. For a fan speed of 60 % of full speed, the CRAC draws 2.5 kW, and for a fan speed of 100 %, 5 kW. The coefficient of performance (COP) for the CRAC can be computed using the ratio of the total power removed from the high performance computing (HPC) zone of the data center by the CRAC to the total power of the CRAC unit for a 24 h period. The COP of the building chiller is cal-culated as the ratio of the total heat being removed from the building to the total power of the chiller compressors, pumps, and fans for a 24 h period. The COP val-ues can be used to compute the required chiller input power. To remove a constant heat load of 110 kW from the HPC zone for a period of 24 h, 950 kWh are required by the setup with the controller, compared to 967 kWh of the setup without the controller. When combined with the savings from the CRAC, there is a savings of 47 kWh for the 24 h period. This is approximately a 6 % savings in the total input power to the CRAC and the chiller, compared to constant speed fan operation [12].

9.3.3 Direct Liquid Cooling

The direct liquid-cooled method eliminates two of the major thermal resistances: heat-sink-to-air and air-to-chilled-water. Liquid cooling improves the heat transfer efficiency, decreasing the overall thermal resistance of the heat transfer circuit, the energy consumption, and the size and cost of equipment. An IBM study in 2009 found that liquid cooling can be 3,500 times more efficient than air cooling [13]. Their tests showed a 40 % reduction in total energy usage with liquid cooling. Liquid cooling also improves working conditions for data center personnel by reducing the noise level, since the multiple fans per server that are used in air cooling can be eliminated. However, there is still no clear working fluid choice for the cooling liq-uid. Water, which has good thermal properties, can damage electronics if leaks occur. Electronic-friendly dielectric liquids, such as certain refrigerants, have poor thermal properties in the single-phase (liquid), and are also costly. A study by Mandel et al. [14] indicates that ammonia may be a good candidate as a working fluid for cooling high flux electronics; however, the use of ammonia carries safety risks, and thus is

Page 174: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

165

subject to strict regulations in most cases. Accordingly, the search continues to iden-tify a suitable candidate cooling fluid with minimum global warming potential. In the lack of better fluid, it is possible that the industry will increasingly tilt to use of water as the working fluid once proper safe guards against possible leakage and other draw-backs of water cooling are demonstrated.

Similar to the air-cooled heat sink CFD work mentioned in Sect. 9.3.1, simu-lations using a microchannel cold plate on an 85 W source with both water and FC-72 as cooling fluid showed that the incoming temperature for water could be about 62 °C (flow rate of 18 cm3/s or 0.04 CFM, ΔP = 32 kPa, and pump-ing power of 57 mW). However, the dielectric fluid, such as R-134a, would need an incoming temperature of −4 °C (flow rate of 18 cm3/s or 0.04 CFM, ΔP = 31 kPa, and pumping power of 56 mW) [10]. These results indicate that if water (or another high performing fluid such as ammonia) is used as the cooling fluid, the server cooling system for the data center could be operated without com-pressors, thus saving energy, as well as reducing capital equipment and life cycle costs. However, if the cooling fluid is limited to dielectric fluids, then compressors will be required, as illustrated in Fig. 9.6b.

9.3.4 Direct Phase-Change Cooling

Direct two-phase refrigerant cooling for data centers, if properly implemented, can eliminate the use of chilled water and ventilation/air-conditioning (HVAC) equipment, resulting in potential savings in the capital equipment, infrastruc-ture cost, and life cycle costs. This approach is gaining momentum as an

HeatExchanger

HeatExchanger

HeatExchanger

Fig. 9.6 Cooling options: a water cooling and b dielectric fluid cooling

9.3 Embedded (Near Source) Cooling

Page 175: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

166 9 Emerging Trends

option for next-generation data centers. However, in order to eliminate the need for subcooling, the overall thermal resistance of the cold plate in the two-phase cool-ing option should be very low. Reducing the thermal resistance is possible with some of the emerging technologies in embedded phase-change cooling. For exam-ple, embedded manifolded microchannel cooling utilizing thin film cooling, shown in Fig. 9.7, yields low thermal resistance between the heat source and the cool-ing fluid [14, 15]. Thermal resistance as low as 0.04 K/W has been experimentally reported for manifolded microchannel heat sinks for chip cooling applications, as illustrated in Fig. 9.8. When compared to conventional, commercially available off-the-shelf cold plates with thermal resistances of 0.15–0.20 K/W for dielectric fluids, the manifolded microchannel heat sinks for chip cooling applications had nearly an order of magnitude reduction in the thermal resistance. That low thermal resistance eliminates the need for the vapor compression cycle of the HVAC equip-ment. Heat from the chip can be directly removed to the ambient of almost any cli-matic zone using only a pumped refrigerant loop with evaporation on the chip side and condensing the refrigerant in an air-cooled condenser or a much smaller HVAC system than would otherwise be required by conventional systems.

The two-phase CFD simulation results demonstrate that an entering liquid tem-perature as high as 76.5 °C (170 °F) would be sufficient to cool an 85 W CPU at a flow rate of 0.54 g/s or 0.46 cm3/s, ΔP = 5 kPa, and pumping power = 2.3 mW [10]. The fluid exiting these cold plates could be cooled using ambient air due to the elevated temperature of the fluid. Thermal energy at such temperatures can also be used as a low-grade heat source for heating or cooling through heat active absorption refrigeration. There are other uses for waste heat, including using warm water from the waste heat for district heating and other applications.

Despite its many advantages, there are some challenges in deploying two-phase cooling for servers in data centers. The fluid/vapor handling system must allow individual servers to be swapped in and out of a cabinet. This can be made easier by redesigning server layouts to allow external access to the top surfaces of the CPUs and GPUs. A control system will be needed to detect, isolate, and stop fluid leaks. These issues have delayed deployment of two-phase cooling in commercial systems; thus, these issues need to be addressed before any major adoption and full-scale technology change takes place.

9.3.5 Comparison between Embedded Air, Liquid, and Two-Phase Flow Cooling

Table 9.1 provides a summary of our simulation analysis of air, liquid, and two-phase cooling. The thermal resistance of liquid cooling is less than one half of air cooling due to the high heat capacity of liquid. However, the pumping power of liquid cooling is twice that of air cooling due to higher pressure drops. In heat transfer design, it is common to trade improved heat transfer performance for higher pressure drops, as long as the associated pressure drop penalties are not excessive and result in total system improvement.

Page 176: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

167

From the results in Table 9.1, it is clear that two-phase flow cooling provides reduced total thermal resistance, as much as an order of magnitude less than that of air and significantly below that of liquid cooling. The pressure drop associated with conventional two-phase cooling is higher than single-phase liquid cooling, due to the presence of vapor acceleration and the additional pressure drop losses inherent with the phase-change phenomenon [16–18]. The additional pressure

Microgrooved Surface

Liquid Inlet

Convection

Convective Boiling

Boiling/EvaporationJet

Impingement

Liquid Inlet Liquid Inlet

Vapor Outlet

Fig. 9.7 Thin film manifold microchannel cooling [15]

9.3 Embedded (Near Source) Cooling

Page 177: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

168 9 Emerging Trends

drops associated with conventional phase-change heat transfer lead to higher pumping power penalties, and thus higher operating costs of the cooling system. Therefore, the optimum cooling method for data centers would have high ther-mal performance with low pumping power requirements, while meeting all sys-tem reliability and competitive market forces. Thin film manifolded microchannel cooling, which yields remarkably low thermal resistance, as well as low pumping power, may be a suitable cooling method for next-generation data centers [14, 15].

As shown in Fig. 9.9, low resistance two-phase thin film cooling by the force-fed manifolded microchannels has 20 times lower thermal resistance than liquid cooling at 5 times lower corresponding pumping power consumption [10]. This represents a major advancement in the development of highly efficient data cent-ers, and contrasts with the generally held view that two-phase cooling systems require a higher pumping power than single-phase cooling systems. The main reason for the favorable behavior of force-fed microchannels is that the govern-ing regime is a combination of forced convection boiling and thin film evapora-tion over high aspect ratio microchannels with limited fluid flow running length. An optimized design will aim for dominance of thin film evaporation by achiev-ing high vapor quality at the exit of the evaporator, while requiring the minimum

Fig. 9.8 Thermal resistance and pressure drop of two-phase thin film cooling [15]

5

6

7

8

9

10

11

12

13

0.034

0.036

0.038

0.040

0.042

0.044

0.046

0.048

0.050

0 50 100 150 200 250

Pre

ssu

re D

rop

[kP

a]

Th

erm

al R

esis

tan

ce

[K/W

]

Heater Power [W]

Thermal Resistance

Pressure Drop

Refrigerant: R245fa

Table 9.1 Comparison between air, liquid, and two-phase cooling

Air WaterDielectric fluid (FC-72)

Two-phase flow

R-245fa

Generated power 85 WFluid inlet tempera-

ture (Tin)5 °C 62.4 °C −4 °C 76.5 °C

Thermal resistance (Rth)

0.4–0.7 K/W 0.15–0.2 K/W 0.15–0.2 K/W 0.038–0.048 K/W

Pumping power (Ppump)

29 mW 57 mW 56 mW 2.3 mW

Page 178: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

169

possible fluid flow in circulation. The heat transfer coefficients associated with thin film microchannel cooling are often an order of magnitude higher than what has been reported for conventional two-phase flow systems for similar applica-tions. The magnitude of the heat transfer coefficient is inversely proportional to the film thickness on the wall, and thus, an optimized design of the heat trans-fer surface and the manifold system could yield heat transfer coefficients as much as 1000 kW/(m2-K) or higher. Meanwhile, the corresponding pressure drops are below that of liquid cooling or conventional two-phase flow cooling.

As a combination of two-phase cooling and advanced, ultra-low resistance, cold plates could contribute to a thermal management system with significant capital and energy savings. These savings come from the elimination of compres-sors (since subambient fluids are no longer needed), fans, and heat exchangers; a reduction in the amount of cooling fluid being pumped around the data center; a decrease in the size of other components; and a reduction in the amount of elec-tricity required to operate the cooling system. Thus, a data center using two-phase cooling will have a competitive advantage in the marketplace due to its lower capi-tal cost, operating costs, and energy savings.

9.4 Net-Zero Emission Data Centers

Due to the large amount of energy that data centers consume, their associated CO2 footprint is not sustainable and must be reduced in future data centers. This can be best achieved by optimum thermal management systems that minimize energy consumption while utilizing the produced waste heat for heating and cooling applications. Highly efficient phase-change cooling systems, similar to those dis-cussed in the previous section, can best achieve this. District heating (DH) systems

Fig. 9.9 Comparison of thermal resistance between air, liquid, and two-phase cooling [10]

0 10 20 30 40 50 600.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

The

rmal

Res

ista

nce

(Rth

)

Pumping Power [mW]

Two-phase cooling(Thin film manifold)

Air cooling

Liquid cooling

9.3 Embedded (Near Source) Cooling

Page 179: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

170 9 Emerging Trends

that use the generated waste heat for remotely located residential and commercial heating requirements are used in Europe. While the waste heat from future data centers may not be sufficient for district heating (where a typical radius of several miles may be covered), neighborhood heating (within a radius of about one mile) can be applied to neighboring buildings for space heating applications. Although district heating systems are not common in the U.S., higher energy prices and the commitment of various state governments to promoting green energy and reducing their carbon footprint may help advance the concept of zero emissions in next-generation data centers.

9.5 Mission Critical Data Centers

Mission critical data centers are created to heighten homeland security, secure valuable information, and communicate faster and more accurately. Typical appli-cations for these centers include electronic transactions of financial services (e.g., online banking and electronic trading), electronic medical records for healthcare, and public emergency services (e.g., 911 services). These data centers require higher availability, reliability, and disaster recovery capability than traditional centers.

Data centers can experience two types of downtime: (1) unplanned downtime caused by system error, human error, or an external event (e.g., natural disaster); and (2) planned downtime required for system upgrades, scheduled maintenance, and other data center requirements [14]. The downtime of mission critical data centers should be minimized regardless of the source, as high availability is one of fundamental requirements of mission critical data centers. The system reliability is determined by the hardware reliability and software reliability. The hardware reli-ability is primarily determined by the reliability of components. Furthermore, the reliabilities of the weakest and most important components determine the system reliability, and they should be improved as much as possible. At the system level, redundant designs can be used for high reliability. Software is designed with fault tolerant functions in order to maintain high reliability in case of intermittent hard-ware fault occurrences.

It is usually very expensive to eliminate downtime, which requires proprietary servers and specialized support staff. One industry example of minimizing down-time is the HP data center operating environment (DCOE) [14], which “delivers a rich, flexible set of options for failover and disaster recovery” [19]. With these options, customers can use a secondary application or servers to back up the pri-mary mission critical servers in case of disaster.

Mission critical data centers must also be capable of recovering from a natural disaster. Today, disaster recovery means that there are two or more geographically separated data centers to ensure that resources will remain available in a disaster. In order to save resources, customers can back up primary mission critical servers

Page 180: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

171

and their applications using development servers or secondary applications as designated failover resources [19].

9.6 Waste Heat Recovery/Chiller-less Cooling

The benefits of utilizing waste heat in data centers are well analyzed in [20]. In gen-eral low resistance cooling is often the optimum choice for best utilization of waste heat from microprocessors. This includes high effectiveness phase change heat removal processes [21]. Garday and Housley [22] describe the use of chilled water storage at an Intel regional hub data center facility. Two 24,000 gallon tanks con-taining water at 5.6 °C (42 °F) allowed successful operation during an outage lasting several hours in 2006. Once the chillers stopped working, water from the storage tanks was added to the chilled water system to continue to maintain 12.8 °C (55 °F) water delivery temperature to the CRAC units. Chilled water pumps and CRAC fans were on UPS power for continued operation. The servers continued to operate for more than 15 min following the outage, due to the relatively small IT load on the system during the period. The cooling continued long enough afterwards to ensure removal of the stored heat. The cost of the storage system was found to be signifi-cantly lower than putting the chiller on a UPS and standby generator.

Schmidt et al. [23] discussed the infrastructure design for the Power 6.575 high performance cluster at the National Center for Atmospheric Research in Boulder, Colorado. Each of the eleven racks in this cluster generates 60 kW. Module-level water cooling and Rear Door Heat Exchangers (RDHxs) remove 80 % of the heat generated by each rack, and the remainder is removed by the CRAC units. Two 1,500 gallon thermal storage tanks were employed. The storage system was designed so that the chilled water supply temperature to Power 6.575 did not exceed 17.8 °C (64 °F) for at least 10 min following a chiller failure. The tanks were made of carbon steel and were highly insulated. Each tank was 145 cm (5 ft 6 inch) in diameter and 2.13 m (7 ft) tall. Schmidt et al. [23] noted that con-siderable prior literature is available on the storage tank design due to its use for other applications, such as solar energy storage. They discussed the importance of stratification effects, which result in the settling of cooler, denser liquid layers near the bottom and warmer, less dense layers near the top of the tank, and suggested that the stratification effect should be addressed in the design of thermal manage-ment systems. They also discussed the importance of the aspect ratio of the tank, the ratio of height to diameter. However, the design requirements for chilled water storage for data center cooling are different than solar energy storage, and more specific design and calculation studies are needed to avoid unexpected hot spots.

Other trends of the data centers will include evolving business application architectures. Companies rely on their business applications, and successful busi-ness applications provide instant transactions among internal employees and outside partners. Future data centers must support a wide range of applications efficiently, in order to build business advantages in the globally competitive world.

9.5 Mission Critical Data Centers

Page 181: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

172 9 Emerging Trends

Additionally, diverse media applications, increased online transactions, and the increasing shift from paper to digital storage will increase the demands on band-width and capacity in future data centers.

Another possible trend is reducing operational expenditure in data centers. Operational expenditure has become a major cost for data centers, and it will con-tinue to grow rapidly. Creative ways to reduce operational expenditures of data centers without any compromise on stringent requirements for reliability and availability will be an important advantage for data centers in the future.

With increasingly more secure data stored in data centers and more applications supported by data centers, the security of data will become an important concern in the future. Increasing efforts will be used to protect data centers from new types of cyber-attacks, which are surfacing every day.

Finally, future data centers may increasingly be expected to develop/employ intelligent techniques to analyze massive loads of data to find trends, which are useful in the retail and other sectors. These techniques may be able to eliminate data noise, and extract the useful information from unstructured data. This will improve the quality of data center services and add value.

9.7 Summary

This chapter covered some of the common features of future data centers, includ-ing the expectation for high performance, energy efficiency, and high reliability and availability. Among these features, energy efficiency is the key trend, since energy consumption is the major challenge of data center development. Therefore, the data center industry needs find innovative methods to improve energy effi-ciency. As a result, emerging technologies such as free air cooling or direct liquid immersion techniques may gain renewed implementation strength if it can assure cost-effectiveness and reliable operation is expected to be widely accepted by both new and existing data centers in the future.

References

1. E. Centegen, Force Fed Microchannel High Heat Flux Cooling Utilizing Microgrooved Surfaces. Ph.D Thesis, University Of Maryland, 2010

2. ARC Advisory Group, Enabling Predictive Reliability and Energy Efficiency for Today’s Data Centers, Report to ABB Inc. Data Center Enterprise Management | Decathlon™, Oct 2011

3. A. Shooshtari, R. Mandel, and M. Ohadi, in Cooling of Next Generation Electronics for Diverse Applications, ed. by S. Anwar. Encyclopedia of Energy Engineering and Technology (Taylor and Francis, New York, 2012) (In Press)

4. S. Borkar, Next Generation Materials, Devices or Packages—Year 2025, InterPACK’11 Panel Session, Portland, OR, 7 Jul 2011

5. A. Bar-Cohen, Gen-3 Thermal management technology: role of microchannels and nano-structures in an embedded cooling paradigm. ASME JNEM (In press)

Page 182: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

173

6. A. Bar-Cohen, J.J. Maurer, J.G. Felbinger, Keynote Lecture, “DARPA’s Intra/Interchip Enhanced Cooling (ICECool) Program”, in Proceedings, IEEE CSMantech, New Orleans, La, May 2013, pp. 171–174

7. K.P. Bloschock, A. Bar-Cohen, Advanced Thermal Management Technologies for Defense Electronics, in Proceedings, SPIE 8405, Defense Transformation and Net-Centric Systems 2012, Baltimore, MD, May 2012

8. A. Bar-Cohen, B.A. Srivastava, B. Shi, Thermo-Electrical Co-Design of 3D ICs: Challenges and Opportunities. Computational Thermal Sciences, 2013 (in Press)

9. 2013 Joint Electric Power Research Institute (EPRI) and National Science Foundation (NSF) solicitation on Advanced Dry Cooling for Power Plants, Solicitation No. 13–564, National Science Foundation, Washington Dc, May 2013

10. M.M. Ohadi, S.V. Dessiatoun, K. Choo, M. Pecht, Air Vs. Liquid and Two-Phase Cooling of Data Centers, in Semi-Therm Proceedings, San Jose, CA, 18–22 Mar 2012

11. Motivaircorp Inc., literature, Amherst, NY, http://www.motivaircorp.com/literature. Accessed 23 Aug 2013

12. S.V. Sundaralingam, P. Kumar, Y. Joshi, Server Heat Load Based CRAC Fan Controller Paired with Rear Door Heat Exchanger, in Proceedings of the ASME 20 II pacific rim techni-cal conference and exposition on packaging and integration of electronic and photonic sys-tems, InterPACK201 l, Portland, 6–8 Jul 2011

13. S. O’Donnell, “IBM Claim that Water Cooled Servers are The Future of It at Scale”, the Hot Aisle, 3 Jun 2009

14. R. Mandel, S.V. Dessiatoun, M.M. Ohadi, “Analysis of Choice of Working Fluid for energy efficient cooling of high flux electronics,” Progress Report, Electronics cooling consortium, CALCE/S2Ts lab, Dec 2011

15. E. Centegen, Force Fed Microchannel High Heat Flux Cooling Utilizing Microgrooved Surfaces, Ph.D Thesis, University Of Maryland, 2010

16. K.S. Choo, S.J. Kim, Heat transfer and fluid flow characteristics of nonboiling two-phase flow in microchannels. ASME J. Heat Transfer 133, 102901 (2011)

17. Ghiaasiaan, Two-Phase Flow, Boiling, and Condensation in Conventional and Miniature Systems, (Cambridge University, Cambridge, 2008)

18. K.S. Choo, S.J. Kim, Heat transfer characteristics of impinging air jets under a fixed pump-ing power condition. Int. J. Heat Mass Transfer 53, 320–326 (2010)

19. Hewlett-Packard Development Company, “HP-UX Data Center Operating Environment and Integrity Server Blades for the Mission Critical Data Center”, white paper, 2010

20. Y. Joshi, P. Kumar, Energy Efficient Thermal Management of Data Centers (Springer, New York, 2012)

21. J.B. Marcinichen, J.R. Thome, B. Michel, Cooling of microprocessors with microevapora-tion: a novel two-phase cooling cycle. Int. J. Refrig 33(7), 1264–1276 (2010)

22. D. Garday, J. Housley, “Thermal Storage System Provides Emergency Data Center Cooling,” White Paper Intel Information Technology, Intel Corporation, Sept 2007

23. R. Schmidt, G. New, M. Ellsworth, M. Iyengar, IBM’s Power6 High Performance Water Cooled Cluster at Ncar-Infrastructure Design, in Proceedings of the ASME 2009 InterPACK conference IPACK2009, San Francisco, 19–23 Jul 2009

References

Page 183: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

175J. Dai et al., Optimum Cooling of Data Centers, DOI: 10.1007/978-1-4614-5602-5, © Springer Science+Business Media New York 2014

Glossary

Air cooled blade Blade used to remove heat using air.

Air cooled board Circuit board used to remove heat using air.

Air cooled chip Chip used to remove heat using air.

Air cooled equipment Equipment used to remove heat using air.

Air cooling Removal of heat at its source using air.

Absolute humidity The amount of water vapor in a specific unit volume of air, usually expressed as kilograms per cubic meter.

Ambient temperature The temperature of the specified, surrounding medium (such as air or a liquid) that comes into contact with a semiconductor device being tested for thermal resistance.

AMR Absolute Maximum Ratings, which are the limiting values of operating and environmental conditions applicable to any electronic device of a specific type as defined by its published data, which should not be exceeded under the worst possible conditions.

ANSI American National Standards Institute.

ASHRAE American Society of Heating, Refrigerating, and Air-Condition Engineers.

ASHRAE TC9.9 Technical Committee for Facility and Equipment Thermal Guidelines for Data Center and Other Data Processing Environments. This is a consortium of IT users and manufacturers creating common guidelines for the standardization, layout, testing, and reporting of IT rooms and data centers.

BT British Telecommunications.

CAF Conductive Anodic Filament, which occurs in substrates and printed wiring boards (PCB’s) when a Cu conductive filament forms in the laminate dielectric

Page 184: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary176

material between two adjacent conductors or plated through vias under an elec-trical bias.

Carbon intensity The total carbon dioxide emissions from the consumption of energy per dollar of gross domestic product (GDP).

Case temperature The temperature at a specified, accessible reference point on the package in which a semiconductor die is mounted.

Chilled water system A type of precision cooling system widely used in mid-sized to large IT environments. A chilled water system uses water as a cooling medium. Cold water is pumped from a chiller to computer room air handlers designed to cool the space. A chilled water air conditioner can be thought of as similar to a car radiator with a fan, with hot air being cooled by being blown through a cool radiator. In a chilled water system cooling an IT facility, the chilled water may be provided as a utility in the building, or special dedicated water chillers may be installed.

Chiller A device used to continuously refrigerate large volumes of water. A chiller uses a refrigeration cycle to produce large volumes of chilled water (typ-ically at 45–48 °F/7–9 °C) that is distributed to Computer Room Air Handler (CRAH) units designed to remove heat from the IT environment.

Clean room A room that is virtually free of dust or bacteria, used in laboratory work and in assembly or repair of precision equipment. Clean rooms usually use precision air conditioning.

Cloud computing IT resources and services that are abstracted from the under-lying infrastructure and provided “on-demand” and “at scale” in a multitenant environment.

Cluster Several communicated servers that have a common access, which can provide data access in the case of a single server failure the servers also add computing capability to the network in case of large numbers of users.

CMOS Complementary Metal-Oxide Semiconductor. A technology for construct-ing integrated circuits that is used in microprocessors, microcontrollers, static RAM, and other digital logic circuits.

Comfort air conditioning Common air conditioning systems designed for the comfort of people. When compared to computer room air condition systems, comfort systems typically remove an unacceptable amount of moisture from the space and generally do not have the capability to maintain the temperature and humidity parameters specified for IT rooms and data centers.

Compressor The compressor is an essential component in the refrigeration cycle that uses mechanical energy to compress or squeeze gaseous refrigerant. This compression process is what allows an air conditioner to absorb heat at one temperature (such as 70 °F/21 °C) and exhaust it outdoors at a potentially higher temperature (such as 100 °F/38 °C).

Page 185: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary 177

Condenser coil A condenser coil is one means of heat rejection commonly used in an air conditioning system. It is typically located on an outdoor pad or on a rooftop and looks like an automobile radiator in a cabinet. It is usually hot to the touch (120 °F/49 °C) during normal use. Its function is to transfer heat energy from the refrigerant to the cooler surrounding (usually outdoor) environment. The related Dry Cooler and Fluid Cooler serve the same purpose of heat rejec-tion and physically appear similar, with the difference that the condenser coil uses hot refrigerant which changes from a gas to liquid as it moves through the coil, whereas the fluid cooler uses hot liquid such as water or a water-glycol mix.

Conduction A mode of heat transfer in which heat energy is transferred within an object itself or between objects in contact. When a cold spoon is left in a pot of boiling water, the spoon eventually becomes hot. This is an example of conduc-tion. Conduction is one of the three forms of heat transfer, which also include convection and radiation.

Convection A mode of heat transfer in which heat energy is transferred from an object to moving fluid such as air, water, or refrigerant. The heat sink of a computer processor is an example of heat transfer by convection. Convection is one of the three forms of heat transfer, which also include Conduction and Radiation.

Cooling Removal of heat.

Cooling tower A heat rejection method that transfers heat energy from a data center or IT room to the outside atmosphere via the evaporation of water. In a cooling tower, water is sprayed onto a high surface-area packing material as large volumes of air are drawn across through the structure. The net effect of this process is that a small portion of the water circulated through the cool-ing tower evaporates into the outside atmosphere. The remaining water (now cooler) is collected at the bottom of the cooling tower.

CRAC Computer Room Air Conditioning. A device usually installed in the data center that uses a self-contained refrigeration cycle to remove heat from the room and send it away from the data center through some kind of cool-ing medium via piping. Must be used with a heat rejection system which then transfers the heat from the data center into the environment. The heat rejection system typically takes one of the following forms: condensing unit, fluid cooler or cooling tower to discharge to the outdoor atmosphere.

CRAH Computer Room Air Handler. A device usually installed in a data center or IT room that uses circulating chilled water to remove heat. Must be used in conjunction with a chiller.

CWR Chilled Water Return. The term used for all piping intended to return chilled water from the computer room air handlers to the chiller.

CWS Chilled Water Supply. The term used for all piping intended to deliver chilled water from the chiller to the computer room air handlers.

Page 186: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary178

Data center Includes all buildings, facilities, and rooms that contain enterprise servers, server communication equipment, and cooling and power equipment, and provide some form of data service.

Data-driven approach One type of prognostics and health management approach which is based exclusively on data analysis to detect anomalies and predict remaining useful life.

DCiE Data Center Infrastructure Efficiency. The ratio of the total power drawn by the IT equipment to the power used by a data center facility’s equipment.

DCiE =IT Facility Power

Total Facility Power.

Dehumidification The process of removing moisture from air. In a data center or IT room, most dehumidification occurs as moisture-laden air flows across the cold evaporator coil.

Derating The practice of limiting thermal, electrical, and mechanical stresses on electronic parts to levels below the manufacturer’s specified ratings in order to improve the reliability of the part. The objective is to improve equipment reliabil-ity by reducing stress or by making allowances for degradation in performance.

Design condition The desired properties for an environment expressed in dry bulb temperature, wet-bulb temperature, and relative humidity. Design condi-tions are commonly used during the planning stages of a data center or IT room as a basis to aid in the specification of air conditioning systems. Cooling equip-ment manufacturers normally publish performance data of air conditioning sys-tems at several design conditions.

Downtime A period of time during which a piece of equipment is not operational.

DP Dew point. The temperature at which the air can no longer hold all of its water vapor (that is, it is saturated), and some of the water vapor must condense into liquid water.

Dry bulb temperature The temperature of air shown on a standard thermometer.

DX Direct Expansion. A general term applied to computer room air conditioning systems that have a self-contained refrigeration system and are air-, glycol-, or water-cooled.

Economizer The term applied to an additional cooling coil installed into glycol-cooled computer room air conditioning units to provide free cooling in cold cli-mates. The economizer coil contains cold glycol circulating directly from the fluid cooler when atmospheric conditions allow.

EDA Equipment Distribution Area. Horizontal cables are typically terminated with patch panels in the EDA, the location of equipment, cabinets and racks.

EM Electromigration. The mass transport of a metal wire due to the momentum exchange between the conducting electrons which move in the applied electric field and the metal atoms which make up of the interconnecting material.

Page 187: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary 179

EPA U.S. Environmental Protection Agency.

ESD Electrostatic Discharge. The sudden and momentary electric current that flows between two objects at different electrical potentials caused by direct contact or induced by an electrostatic field.

ETSI European Telecommunications Standards Institute. Produces globally appli-cable standards for Information and Communications Technologies (ICT), including fixed, mobile, radio, converged, broadcast, and Internet technologies.

EU European Union. An economic and political union of 27 member states which are located primarily in Europe. Committed to regional integration, the EU was established by the Treaty of Maastricht in 1993 upon the foundations of the European Communities.

Evaporation The process of a liquid becoming a vapor. If a cup of water were boiled for long enough, all the water would be gone: by adding heat, all the water becomes a vapor and mixes with the air.

Evaporator coil The evaporator coil is an essential component used in the refrig-eration cycle. It looks like an automobile radiator. This is the part of the system that gets cold to the touch (about 45 °F/7 °C for air conditioning systems) during normal use. It is usually found inside the space that heat needs to be removed from. Cold-feeling air that exits an air conditioner has just transferred some heat energy to the flashing refrigerant as it passed through the evaporator coil.

Facilities equipment (data center) Comprises the mechanical and electrical sys-tems that are required to support the IT equipment and may include power dis-tribution equipment, uninterruptible power supplies (UPS), standby generators, cooling systems (chillers, fans, pumps), lighting, etc.

Free cooling A practice where outside air is used to directly cool an IT room or data center. There are two common types of free cooling. Air-side free cool-ing introduces cold outside air directly into the IT room or data centers when atmospheric conditions allow. Waterside free cooling uses an additional cool-ing coil containing cold glycol circulating directly from the fluid cooler when atmospheric conditions allow. There are building codes for areas in the Pacific Northwest that mandate free cooling for all data centers.

Fresh air The air outside data centers.

FRU Field Replaceable Unit. A unit that can be replaced in the field.

Fusion approach One type of prognostics and health management approach which combines the merits of both the data-driven and the PoF method, com-pensating for the weaknesses of each, and is expected to give an even better prognostication than either method alone.

GEIA Government Electronics Information Technology Association.

Gt Gigatons, 109 tons.

Page 188: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary180

HCI Hot Carrier Injection. A failure mechanism in CMOS technologies. The car-riers in a MOSFET’s drain-end gain sufficient energy to inject into the gate oxide and cause degradation of some device parameters.

HAD Horizontal Distribution Area. Serves as the distribution point for horizontal cabling and houses cross-connects and active equipment for distributing cable to the equipment distribution area.

Heat Heat is simply a form of energy. It exits in all matter on earth, in varied quantities and intensities. Heat energy can be measured relative to any refer-ence temperature, body, or environment.

Heat exchanger A heat exchanger allows different fluids to transfer heat energy without mixing. It achieves this by keeping the flowing fluids separated by thin tubes or thin metal plates. Heat exchangers are commonly used in place of con-denser coils in water or glycol-cooled air conditioning systems.

Heat pipe Tubular closed chamber containing a fluid in which heating one end of the pipe causes the liquid to vaporize and transfer to the other end, where it con-denses and dissipates its heat. The liquid flows back toward the hot end by grav-ity or by means of a capillary wick. Also defined as a type of heat exchanger.

Heat transfer Heat transfer is the process of an object or fluid losing heat energy while another object or fluid gains heat energy. Heat energy always flows from a higher temperature substance to a lower temperature substance. For example, a cold object placed in a hot room cannot drop in temperature it can only gain heat energy and rise in temperature. The amount of heat transferred can always be measured over a period of time to establish a rate of heat transfer.

Hot aisle/cold aisle A common arrangement for perforated tiles and datacom equipment. Supply air is introduced into a region called the cold aisle. On each side of the cold aisle, equipment racks are placed with their intake sides facing the cold aisle. A hot aisle is the region between the backs of two rows of racks. The cold air delivered is drawn into the intake side of the racks. This air heats up inside the racks and is exhausted into the hot aisle.

Humidification The process of adding moisture to air. A simple example of the humidification process is when water is boiled and the water vapor produced mixes with the air.

Humidifier The device used to provide humidification in the data center or IT room. Humidifiers either use heat or rapid vibrations to create water vapor. The moisture is usually added to the air stream exiting the air conditioner or air handler.

HVAC Heating, ventilation, and air conditioning. Sometimes an “R” is shown at the end to represent refrigeration.

ICT Information and Communications Technology. Includes fixed telephone, broadband, mobile wireless, information technology, networks, cable TV, etc.

Page 189: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary 181

Data centers include the buildings, facilities, and rooms that contain enterprise servers, communication equipment, cooling equipment, and power equipment.

IEC International Electrotechnical Commission. The world’s leading organization that prepares and publishes International Standards for all electrical, electronic, and related technologies.

Inlet air temperature Temperature of the air entering equipment for cooling.

IRR Initial Return Rate. The return rate of units during the first six months after initial shipment (0–6 months after shipment), representing the reliability during installation, turn-up, and testing.

IT equipment (data center) Encompasses all equipment involved with providing the primary data center functions and may include servers, storage devices, net-working equipment, and monitoring and control workstations.

Junction temperature The temperature of the semiconductor junction in which the majority of the heat is generated. The measured junction temperature is only indicative of the temperature in the immediate vicinity of the element used to measure temperature.

Lead temperature The maximum allowable temperature on the leads of a part during the soldering process. This rating is usually provided only for surface mounted parts.

Liquid cooled blade Blade used to remove heat using a liquid.

Liquid cooled board Circuit board used to remove heat using a liquid.

Liquid cooled chip Chip used to remove heat using a liquid.

Liquid cooling Removal of heat at its source using a liquid.

LTR Long-Term Return Rate. The return rate of units anytime following YRR (19 months and later after shipment), representing the product’s mature quality.

Make-up air Outside air introduced into IT room or data center. Make-up air is mandated by building codes primarily to ensure that the space is fit for human occupancy.

Manufacturer-specified recommended operating temperature range The oper-ating temperature range over which the part manufacturer guarantees the func-tionality and the electrical parameters of the part. The part manufacturer may specify the operating temperatures as ambient, case, or junction temperature.

MDA Main Distribution Area. A centrally located area that houses the main cross-connect as well as core routers and switches for LAN and SAN infrastructures.

Microprocessor controller A computer logic-based system found in precision cooling systems that monitors, controls, and reports data on temperature, humid-ity, component performance, maintenance requirements, and component failure.

Page 190: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary182

MMT Measurement and Management Technologies. A tool set created by IBM to help visualize and understand the thermal profile of existing data centers and IT power and cooling systems. It provides a detailed assessment of the heat distri-bution throughout a data center and a real-time solution to monitor and manage the cooling and energy efficiency of data centers.

MOSFET Metal-Oxide Semiconductor Field-Effect Transistor, which is a transis-tor used for amplifying or switching electronic signals.

Mt Megatons (106 tons).

NRY Normalized One-Year Return Rate. The normalized return rate of units dur-ing the One-Year Return Rate period.

NU Normalized Units. Based on product categories. They are defined in the TL 9000 measurement applicability table appendix.

OAT Outside air temperature.

OD Outside air damper in air-side economizers.

Outlet air temperature Temperature of air discharged from equipment after cooling.

Parameter conformance An uprating process in which a part (device, module, assembly) is tested to assess whether its functionality and electrical parameters meet the manufacturer’s specifications over the targeted application conditions.

Parameter re-characterization Involves mimicking the characterization process used by the part (device, module, assembly) manufacturer to assess a part’s functionality and electrical parameters over the targeted application conditions.

PHM Prognostics and Health Management. Uses in situ life-cycle load monitor-ing to identify the onset of abnormal behavior that may lead to either intermit-tent out-of-specification characteristics or permanent equipment failure.

PoF Physics-of-failure. Utilizes knowledge of a product’s life-cycle loading and failure mechanisms to perform reliability design and assessment.

Power density Electrical power used in a space divided by the area of the space.

Power dissipation The power dissipation limit is typically the maximum power that the manufacturer estimates the package can dissipate without resulting in damage to the part or raising the junction temperature above the manufactur-er’s internal specifications. Thus, it is important that the part is used below this maximum value.

Precision air conditioning A term describing air conditioning or air handling systems specifically designed to cool IT equipment in a data center or IT room. Precision air conditioning systems maintain temperature (±1 °F) (±0.56 °C) AND humidity (±4 %) within much tighter tolerances than regular air condi-tioning systems. These systems provide high airflow rates (170 + CFM/kW or

Page 191: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary 183

4.8 Lps/kW), are designed for continuous usage, and provide high levels of air filtration. Precision air conditioners are also engineered to minimize the amount of moisture removed from the air during the cooling process.

Psychometric chart The properties of air and water contained in it at different temperatures arranged in the form of a chart. In particular, it shows the quan-titative interdependence between temperature and humidity. It is useful in the planning, specification, and monitoring of cooling systems.

PUE Power Usage Effectiveness. The ratio of the total power drawn by a data center facility to the power used by the IT equipment in that facility

PUE =Total Facility Power

IT Facility Power.

QuEST Quality Excellence for Suppliers of Telecommunications. A global com-munications association comprising a unique partnership of industry service providers and suppliers dedicated to continually improving products and ser-vices in the telecom industry.

Rack Structure for housing electronic equipment. Differing definitions exist between the computing industry and the telecom industry. In the computing industry, a rack is an enclosed cabinet housing computer equipment where the front and back panels may be solid, perforated, or open depending on the cool-ing requirements of the equipment within. In the telecom industry, a rack is a framework consisting of two vertical posts mounted to the floor and a series of open shelves upon which electronic equipment is placed; typically, there are no enclosed panels on any side of the rack.

Redundancy Backups of critical systems or components which are expected to work in case the original system or component fails.

Refrigerant The working fluid used in the refrigeration cycle. Modern systems primarily use fluorinated hydrocarbons that are nonflammable, non-corrosive, nontoxic, and non-explosive. Refrigerants are commonly referred to by their ASHRAE numerical designation. The most commonly used refrigerant in the IT environment is R-22. Environmental concerns of ozone depletion may lead to legislation increasing or requiring the use of alternative refrigerants such as R-134a.

Reliability The probability that a product will perform its expected function under a specific condition for a specific amount of time.

Return air Air returned from the room or building (e.g., data center) which has flushed over the equipment.

RH Relative humidity. The amount of water vapor contained in air relative to the maximum amount the air is capable of holding. Expressed in percentage.

RoC Recommended Operating Conditions. The ratings on a part within which the electrical specifications are guaranteed.

Page 192: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary184

Router A device that connects any number of LANs. Routers use headers and a forwarding table to determine where packets (pieces of data divided up for tran-sit) go, and they use ICMP to communicate with each other and configure the best route between any two hosts. Very little filtering of data is done through routers.

RU Rack Unit. The vertical dimension of a rack-mount server expressed in terms of units. 1 RU represents 1.75 inch (44.45 mm) of vertical height within a rack. Also referred to as U.

Screening A process of separating products with defects from those without defects.

Semiconductor A material that is neither a good conductor nor a good insulator of electricity. The most common semiconductor materials are silicon and ger-manium. These materials are then doped to create an excess or lack of electrons and used to build computer chips.

Server A computer that provides some service for other computers connected to it via a network. The most common example is a file server, which has a local disk and services requests from remote clients to read and write files on that disk.

Set point User-set or automatic thresholds for heating, cooling, humidification, and dehumidification usually measured in the return air stream of the computer room air conditioner or air handler.

Storage temperature The temperature limits to which the device may be sub-jected in an unpowered condition. No permanent impairment will occur (if used within the storage temperature range), but minor adjustments may be needed to restore performance to normal.

Stress balancing A thermal operating method applicable when a part (device, module, assembly) manufacturer specifies a maximum recommended ambi-ent or case operating temperature. It can be conducted when at least one of the part’s electrical parameters can be modified to reduce heat generation, thereby allowing operation at a higher ambient or case temperature than that specified by the part manufacturer.

Supply air Air entering a room or building (e.g., data center) from an air condi-tioner, economizer, or other facility.

Target temperature range The operating temperature range of the application in which the part is to be used. It may be outside the manufacturer-specified rec-ommended operating temperature range, and it may include temperatures that are higher or lower than the manufacturer-specified temperature range, or both.

TDDB Time-Dependent Dielectric Breakdown. A failure mechanism in CMOS technologies. The electric field applied to the MOSFET gate causes the pro-gressive degradation of the dielectric material, which results in conductive

Page 193: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary 185

paths in oxide and shorts the anode and the cathode. This results in an abrupt increase in gate current and a loss of gate voltage controllability on device cur-rent flowing between drain and source.

Telcordia Formerly Bell Communications Research, Inc. or Bellcore. A telecom-munications research and development (R&D) company created as part of the 1982 Modification of Final Judgment that broke up American Telephone and Telegraph. It is a global leader in the development of mobile, broadband, and enterprise communications software and services.

Temperature The measurement of heat energy within a body or substance. There are two common scales used to measure temperature: centigrade and Fahrenheit. The centigrade scale (also commonly referred to as Celsius) is widely used internationally while the Fahrenheit scale is commonly used in the United States.

Thermal resistance A measure of the ability of a carrier or package and its mounting technique to provide for heat removal from the semiconductor junc-tion. It is given by the temperature difference between two specified locations per unit of power dissipation and is measured in °C/W. The lower the thermal resistance, the better the package is able to dissipate heat.

Thermal uprating A process for assessing the ability of a part to meet the function-ality and performance requirements of the application in which the part is used beyond the manufacturer-specified recommended operating temperature range.

TIA Telecommunications Industry Association. It is the leading trade associa-tion representing the global information and communications technology (ICT) industries through standards development, government affairs, business oppor-tunities, market intelligence, certification, and worldwide environmental regula-tory compliance.

TWh Terawatt hours (1012 W h).

Uprating A process for assessing the ability of a part to meet the functionality and performance requirements of the application in which the part is used out-side the manufacturers’ recommended operating range.

UPS Uninterruptible Power Supply. An electrical apparatus that provides emergency power to a load when the input power source, typically the utility main, fails.

Upscreening A term used to describe the practice of attempting to create a part equivalent to a higher quality level by additional screening of a part.

VPN Virtual Private Network. The extension of a private network across an inse-cure/public network (typically the Internet) to provide a secure connection with another server in the private network.

Watt A measurement of energy commonly used to measure electrical and heat loads in data centers and IT rooms. The wattage consumed by the IT

Page 194: Jun Dai Michael M. Ohadi Diganta Das Michael G. Pecht ...arco-hvac.ir/.../04/Optimum-Cooling-of-Data-Centers... · lar focus on free air cooling, its operation principals, opportunities,

Glossary186

equipment, lights, etc. is the amount of heat energy to be removed from the room by the air conditioning system. This term is becoming more common when specifying cooling systems.

WB Wet bulb temperature. The temperature of air shown on a wet thermometer as water vapor evaporates from it. The difference between wet bulb and dry bulb temperatures is a way historically used to determine humidity. Today direct measurement of humidity using electrical sensors has caused this terminology to become obsolete.

WSE Waterside Economizer. When the outside air’s dry- and wet-bulb tempera-tures are low enough, waterside economizers use water-cooled by a wet cooling tower to cool buildings without operating a chiller.

YRR One-Year Return Rate. The return rate of units during the first year follow-ing IRR (7–18 months after shipment), representing the product’s quality in its early period of life.

ZDA Zone Distribution Area. An optional interconnection point in horizon-tal cabling between the HDA and EDA. It can act as a consolidation point for reconfiguration flexibility or for housing freestanding equipment such as main-frames and servers that cannot accept patch panels.