85
RISK MANAGEMENT FOR IT INFRASTRUCTURE Executive Handbook, Vol. 1 Series Editors Julian Kudritzki and Matt Stansberry

RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

  • Upload
    haminh

  • View
    218

  • Download
    3

Embed Size (px)

Citation preview

Page 1: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

1

RISK MANAGEMENT FOR IT INFRASTRUCTURE

Executive Handbook, Vol. 1Series Editors Julian Kudritzki and Matt Stansberry

Page 2: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

2

RISK MANAGEMENT FOR IT INFRASTRUCTURE

Executive Handbook Volume 1

Page 3: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

3

RISK MANAGEMENT FOR IT INFRASTRUCTURE

Executive Handbook Volume 1

Series EditorsJulian Kudritzki and Matt Stansberry, Uptime Institute

Executive PublisherMartin McCarthy, CEO, 451 Group

Designed by David Wilson

Seattle and New YorkUptime Institute & 451 Group

20 W. 37th Street6th Floor

New York, NY 10018

iii

Page 4: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

4

Uptime Institute is an independent division of The 451 Group. Reproduction and distribution of this publication, in whole or in part, in any form without prior written permission is forbidden. The information contained herein has been obtained from sources believed to be reliable and opinions expressed in this publication are solely those of the authors and do not represent the position of Uptime Institute or its Affiliates. Uptime Institute disclaims all warranties as to the accuracy, completeness or adequacy of such information. Although selections of this

publication may discuss legal issues related to the information technology business, Uptime Institute does not provide legal advice or services and this publication should not be construed or used as such. Uptime Institute shall

have no liability for errors, omissions, or inadequacies in the information contained herein or for interpretations thereof. The reader assumes sole responsibility for the selection of these materials to achieve its intended results.

The opinions expressed herein are subject to change without notice.

ISBN 978-0-9982850-0-9Printed in the United States of America

Printed on 100% post-consumer waste paper© 2016 by Uptime Institute, LLC. All Rights Reserved.

iv

Page 5: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

5

CONTENTS INTRODUCTION / viiHow to overcome the operations risk to IT infrastructure

2016 DATA CENTER INDUSTRY SURVEY / 2Responses from 1,000 IT and data center end users provide an overview of an industry and profession in transition

DATA CENTER WALKTHROUGH CHECKLIST / 14Identifying lurking vulnerabilities in even the best-designed data centers

WHY EFFECTIVE GOVERNANCE NEEDS INDUSTRY CERTIFICATIONS / 18The benchmarks are a function of your business needs, but the award on the wall substantiates Risk Management

AVOIDING DATA CENTER CAPITAL PROJECT FAILURES / 24 Identify and mitigate costly mistakes

BALANCING LIFE SAFETY, INFRASTRUCTURE INVESTMENT, AND DOWNTIME / 32Due to the uninterruptible nature of IT infrastructure, many organizations allow high-risk maintenance activities

v

Page 6: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

6

COMPLEX SYSTEMS FAILURE THEORY / 40Conventional wisdom blames “human error” for the majority of IT outages, but those failures are incorrectly attributed to front-line operator errors, rather than management oversights

APPLY EFFICIENT IT PRINCIPLES TO ADDRESS SUSTAINABILITY RISKS /56As Corporate Sustainability programs become increasingly important to C-level execs and investors, IT organizations need to adopt more meaningful KPIs to remain relevant

IT RESILIENCE DURING A NATURAL DISASTER / 64The most common cause of disruption to IT services during a natural disaster is preventable

A HOLISTIC APPROACH TO VENDOR SELECTION FOR CLOUD AND COLOCATION / 68As companies rely more on colocation, cloud, and other off-premise computing models, enterprise IT needs to improve how it selects and manages vendors

vi

Page 7: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

7

INTRODUCTIONHow to overcome the operations risk to IT infrastructure

IT infrastructure decisions are fraught with risk from almost every angle due to the extensive investment and high stakes of technology deployments.

Organizations are exposed to financial and reputational risk from IT service outages. There are market risks associated with lagging behind industry peers and losing agility and competitiveness. There are operational risks associated with life safety and heavy industrial equipment. Increasingly, companies are facing sustainability and regulatory risks due to the intensity of IT energy use.

All of these concerns require Risk and IT executives to ensure attention to detail at the trade and technician level, while maintaining a holistic view of the overall business goals and challenges. A lack of insight in either area can result in costly organizational blind spots.

In this volume, we have included excerpts of Uptime Institute’s extensive thought leadership in this area. This publication provides senior executives the insight of our leadership team.

Uptime Institute is an unbiased advisory organization focused on improving the performance, efficiency, and reliability of business critical infrastructure through innovation, collaboration, and independent certifications.

Our organization’s tagline is The Global Data Center Authority–and we have assessed and certified over 1,000 IT infrastructure capital projects and operations programs around the globe. Our Network, a user group of professionals, has recorded over two decades worth of data and insights into what and why IT failures happen. Our subject matter experts have held leadership positions in IT infrastructure organizations in some of the world’s largest companies.

The keys to risk management for IT infrastructure are identifying the risk factors your organization faces, assessing your organization’s exposure, and ensuring that your processes and procedures are in line with industry recommendations.

To that end, we have included a chapter titled “Theory of Complex Systems Failures” in this volume. We believe it is important for operations risk stakeholders to consider that the reasons for failures are human and addressable.

vii

Page 8: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

8

We have also assembled multiple practical pieces of thought leadership to bring stakeholder attention to life safety, operational upsets, and loss of revenue risks inherent in IT operations. Often, these three are interlaced.

It is historical fact that IT has only reluctantly welcomed outside scrutiny and tends to limit stakeholder input when making strategic decisions. Thus, we have included Industry Survey data to quantify the evolving practices of IT stakeholders, as well as highlight latent issues.

We have also placed IT in the context of natural disaster risk planning, using extensive research into the IT solutions and operations programs that outlasted Superstorm Sandy. Additionally, we call attention to the need for IT to adapt to and adopt the principles of Corporate Sustainability. For many organizations, IT is the leading consumer of resources by street address or headcount. As pressure mounts to prioritize and communicate IT’s resource consumption, a lack of response represents an image and fiduciary risk to enterprises.

We hope that you find this Executive Handbook compelling. This is only an excerpt of our substantial body of thought leadership. We welcome your questions and comments via e-mail below.

viii

As Chief Operating Officer of Uptime Institute, Julian Kudritzki directly leads the strategic corporate and content initiatives with the new Efficient IT and Corporate Governance Advisory Services, standardizing global data center portfolios, and reducing resource consumption in IT infrastructure. After joining Uptime Institute in 2004, he served as one of the architects of the Tier Certification program, which has redefined standardization and accountability within the design-build-operations of capital IT infrastructure projects. Since the inception of this program, he managed the rapid expansion of Uptime Institute commercial offerings outside the U.S. and formed local teams in Brasil, Latin America, Europe, Middle East, and throughout Asia Pacific.

[email protected]

Matt Stansberry is Senior Director of Content and Publications for the Uptime Institute and also serves as Program Director for the Uptime Institute Symposium, an annual event that brings together 1,500 stakeholders in enterprise IT, data center facilities, and corporate real estate to deal with the critical issues surrounding enterprise computing. He was formerly Editorial Director for Tech Target’s Data Center and Virtualization media group, and was managing editor of Today’s Facility Manager magazine. He has reported on the convergence of IT and facilities for more than a decade.

[email protected]

Page 9: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

1

RISK MANAGEMENT FOR IT INFRASTRUCTURE

Executive Handbook Volume 1

Page 10: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

2

UPTIME INSTITUTE 2016 DATA CENTER INDUSTRY SURVEY RESULTSEnterprise IT budgets are shrinking with execs projected to outsource heavily to the cloud in the coming 5 years

Page 11: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

3

EXECUTIVE SYNOPSIS Many enterprise IT departments are shrinking, due to budget pressures, IT hardware advances, and the outsourcing of workloads to cloud and colocation providers. At present, the majority of IT groups maintains a mix of assets across enterprise-owned data centers, colocation partners, and cloud platforms, which is consistent with several years of survey data that suggested the shift to cloud computing would be gradual for conservative enterprise IT organizations. However, this year’s data indicate those assumptions may be incorrect. A major shift in IT’s role in the enterprise is imminent—or has already happened—unbeknownst to the enterprise IT professional. This survey explores the rapidly growing relationship between enterprise IT and colocation providers and also how enterprise IT can work effectively with business functions outside of its discipline, specifically Corporate Sustainability, in order to drive efficiencies and demonstrate responsible stewardship of resources. Uptime Institute concludes that IT will need to move away from its role as a slow-moving centralized service provider, as IT assets become more distributed across locations and platforms, and instead provide corporate governance across the various business lines—evaluating security, costs, and performance of IT for end users. DEMOGRAPHICS The sixth annual Uptime Institute Data Center Industry Survey was conducted via email in February 2016 and includes responses from over 1,000 data center operators and IT practitioners (see Figure 1).

Job Function Location Top Verticals

U.S. and Canada 40%Europe 22%APAC 13%Africa and Middle East 12%Latin America 10%Russia and CIS 3%

33% Executive34% IT Management33% Facilities Management

Colocation or Multi-tenant Data Centers 26%Financial 18%Telecommunications 14%Government 10%Manufacturing 6%Utilities/Energy 6%

2016 Survey Respondents

Figure 1: Uptime Institute’s survey respondents include 1,000 data center owners, operators, and IT practitioners from various industries and locations around the world.

Uptime Institute 2016 Data Center Survey Results

Page 12: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

4

The survey respondents are end users—those responsible for managing infrastructure at the world’s largest IT organizations. The participants represent a wide range of industries, with about a 50-50 split between enterprise IT leaders and service providers—those with operational or executive responsibilities in colocation or cloud computing companies.

The roles of the participants range from IT and Facilities Management to Executive, with senior-level participants at the VP-level and above. Multiple geographic regions are represented, providing a global perspective.

BUDGETS: STABLE OR SHRINKING ENTERPRISE IT?Is the glass half-full or trending toward empty? For the last five years, around half of enterprise IT departments have faced flat or shrinking overall budgets (combined technology infrastructure of IT and data center facilities). This percentage has held steady in each of Uptime Institute’s surveys and speaks to a close scrutiny of enterprise IT spending. Some enterprise IT organizations are receiving modest budget increases, but fewer than 10% are seeing any significant growth.

Having conducted this survey for six years, Uptime Institute has noticed less variance in responses between regions. Put another way, an IT director at a bank in São Paulo, Brasil, responds to questions in much the same way as a London-based IT exec in the financial industry. The biggest variances appear to relate to company size and job function and between verticals.

Select regional economies are growing faster and might adopt certain technologies more quickly than others, but these differences have no impact on the purpose of this survey: to examine and evaluate the decision making of enterprise IT and data center leaders.

For this survey, Uptime Institute has divided respondents into two categories: enterprise IT and service providers. The enterprise IT category includes government, financial industry, manufacturers, retailers, and any other vertical that deploys IT to serve an internal business function. Service providers include cloud and colocation vendors—any organization that provides IT or infrastructure for customers.

Uptime Institute withheld questions from service providers in several areas of this survey. For example, the survey did not ask service providers about their server hardware footprints or cloud computing adoption plans. The survey is largely focused on how enterprise IT and infrastructure is deployed and managed, both in enterprise-owned data centers and through off-premise computing models.

Lastly, throughout the survey, Uptime Institute uses the generic term colocation. In this survey, colocation applies broadly to any service provider supplying a data center facility, from dedicated facilities to multi-tenant spaces.

Uptime Institute

Regional Difference Are Minor in a Global Economy Defining Enterprise, and Addressing Enterprise Issues

Page 13: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

5

This trend is corroborated by 451 Research’s “Voice of the Enterprise Data Centers Q4 2015” report, which found 48% of budgets are flat or shrinking and less than 6% of respondents will see an annual spending increase over 25%.

Over half of enterprise respondents reported a flat or shrinking server hardware footprint (see Figure 2).

HP Proliant and Dell PowerEdge Servers were listed as the two most critical server platforms for enterprise IT users in 451 Research’s “Voice of the Enterprise, Servers and Converged Infrastructure Q4 2015” report. Over half of the respondents planned to cut spending on HP equipment in 2016, and nearly half plan to cut spending on Dell hardware. Respondents report plans for increased spending toward Cisco’s converged hardware platforms in 2016. But, for companies that supply the x86 hardware that makes up bulk of data center capacity, dramatic cuts may be coming. Nearly 30% of respondents planned to cut spending on HP server hardware by over 50% in 2016.

The impact of flat enterprise IT budgets and shrinking server hardware footprints is now trickling down to the colocation providers (see Figure 3).

For the last 5 years, colocation providers have experienced massive growth, trying to keep up with demand. Yet the forces shrinking enterprise IT deployments are now impacting the capital project cycle, even for colocation providers. Despite experiencing a slowdown in new capital projects, colocation or multi-tenant data center providers are playing a major role in many enterprise IT team’s asset mix. According to the survey, a significant portion of an enterprise’s IT workload is deployed in colocation provider sites. The following section will address the drivers and trends for managing IT assets in these third-party service provider sites.

Figure 2. Enterprise IT organizations are looking for ways to decrease spending on telecommunications, staffing, facilities infrastructure, and server hardware.

50% of enterprise budgets flat or shrinking

55% of enterprise server footprints flat or shrinking

Uptime Institute 2016 Data Center Survey Results

Page 14: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

6

Why Are Enterprise Server Footprints Shrinking? A conversation with 451 Research Director Peter Christy

The number of server hardware units seems to be flat or declining across the enterprise. Uptime Institute first noted the trend in conversations with the people running advanced enterprise IT departments who participated in Uptime Institute’s Server Roundup program two years ago, and now the survey data confirm that the rest of the industry is starting to see something similar. Does this shrinking server footprint fit your view from the market side?

Christy: It does. All the traditional enterprise server suppliers are seeing flat or even declining businesses. New servers have more capacity than the older ones they replace, and server virtualization makes it much easier to refresh the server infrastructure and take advantage of more powerful and cost-effective hardware. All of that fits what you are seeing.

How are these server hardware trends impacting deployment?

Christy: Moore’s Law can’t last forever, but it is still making progress. Intel just introduced its most recent Xeon E5 v4 series, which features a 25% performance bump over last year. There are more cores per processor. The last version had 18; now they’re up to 22. The cores are smaller and offer more throughput.

Also, server virtualization allows higher server utilization; fewer servers are needed to do the same work. Server virtualization also makes it easier to refresh servers with newer, more cost- and space-effective replacements. Server virtualization has happened rapidly by the historical standards of data center change because it could be done by the server team and because it yielded a quick ROI.

Also, server virtualization has made it easier to move workloads to the cloud, and enterprises are starting to outsource elements of IT to the public cloud so that fewer servers are needed in their private enterprise data centers.

Our surveys have shown a conservative adoption rate of cloud computing among enterprise IT groups. But, we also see some indications that the industry is poised to make a major shift. What do you think happens next?

Christy: The shift to the public cloud is fascinating. It’s being driven now by the need for enterprises to be more agile—to respond to issues and opportunities more quickly. IT plays an important role in business agility. If IT isn’t agile, it’s hard for a business to be agile. Public cloud services, in particular the market leader Amazon Web Services (AWS), have played a key role in demonstrating the potential value of agile IT and demonstrating what is possible, at least on a platform like AWS.

Most enterprises would prefer a private alternative to the public cloud, but so far that’s been hard to accomplish. Although IT surveys clearly show this reluctance and would lead you to believe the evolution to the public cloud will be slow, other data suggest otherwise.

For the last year, Amazon has broken out AWS as a separate business, and we see that revenues are approaching a US$10-billion run rate, which is growing at more than 50% year over year with 25% profitability. These are all breathtaking numbers, especially in an IT industry that is at best slowly growing.

The shift is also seen in what Intel reports about server CPU sales. Three years ago, Intel said that the cloud segment was growing more rapidly than the enterprise segment but was still much smaller. More recently, Intel said that 2016 would be the crossover year in which cloud sales would exceed enterprise sales.

Finally, all the traditional enterprise IT suppliers (HP, Dell, and IBM) are struggling just to keep the enterprise business flat and looking for ways they can sell to cloud providers. Although the surveys may show that enterprises don’t want rapid evolution to the cloud, other data suggest the change may happen quickly.

Peter Christy is the Research Director of 451 Research’s Networking Practice. For more than 30 years, Peter has worked with segment leaders in a spectrum of IT and networking technologies. He managed software and system technology for companies including HP, Sun, IBM, Digital Equipment Corp, and Apple.

Uptime Institute

Page 15: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

7

Figure 3. Despite experiencing a slowdown in new capital projects, colocation or multi-tenant data center providers are playing a major role in many enterprise IT team’s asset mix.

Figure 4. Uptime Institute says that these survey numbers may understate the move to off-premises computing by enterprises.

86% 74% 64%2014 2015 2016

Colocation budget increases 2014 – 2016

Colocation Builds

Enterprise Builds

45%

18%

2014 2015 2016

29%

15%

24%

15%

Enterprise-owned Data Center 71% Colocation or Multi-tenant Data Center Provider 20% Cloud Computing 9%

Where are your current IT assets located? Estimate percentages:

Enterprise cuts affecting the colocation industry

THIRD-PARTY SERVICE PROVIDER ADOPTION AND MANAGEMENT For the last 5 years, the majority of survey respondents have reported that some percentage of their IT portfolio resides outside of their enterprise-owned data centers, either in the cloud or in a colocation facility (see Figure 4). In 2016, over 75% of respondents claimed to use some form of off-premise computing. Arguably that number is closer to 100%, as many of the respondents may be unaware of initiatives outside of their purview and end users might even purposely circumvent traditional IT channels and barriers. In the years (2012-2016) that the survey has included this question, these numbers have remained fairly static. Despite massive growth in cloud

Uptime Institute 2016 Data Center Survey Results

Page 16: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

8

computing revenues, cloud adoption appears to be conservative. And yet, asking this question in a different way to a different segment of the audience yields results that suggest the industry is poised for a major realignment.

About half of senior executives say they expect the majority of their IT workloads to reside off-premise in cloud or colocation sites in the future. Around 70% of those respondents expect that shift to happen by 2020, and 23% expect that shift to happen by next year.

Uptime Institute saw it coming. “The 2013 Data Center Data Center Industry Survey” found that C-level execs were not paying attention to data center infrastructure cost or performance metrics. At that time, Uptime Institute advocated that data center and IT professionals become more effective at articulating their value to the business or risk being outsourced.

At Uptime Institute Symposium that year, an operations director at a very large U.S.-based company said that he was making simple, no-cost changes to the data center that would save his company hundreds of thousands of dollars annually and extend the life of legacy data center assets—offsetting a looming eight-figure capital investment.

To the question “What does your CIO think of your projects?” he responded, “I’ll let you know if I ever meet him.”

Fundamentally this survey takes the pulse of enterprise IT and data center professionals—stakeholders who are not motivated to go to the public cloud, who will attempt to diminish the speed with which it will happen, and who will emphasize the potential problems with cloud computing. While CIOs may be loath to relinquish their empires, other executives in the business lines will demand more scalable, responsive IT on demand.

Survey: Top Drivers for Multi-Tenant Data Center Adoption In your own words, what are the drivers for colocation adoption?

• Reduce churn of noncritical workloads into critical space

• Mergers/Acquisitions activity

• Disaster recovery site on a separate power grid

• Executive directive to divest owned data center infrastructure

• Global expansion

• Avoid large capital expenses of new site build

• Not core business

• Lack of confidence in staff/resources

Uptime Institute

Page 17: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

9

To be clear, there will not be an exodus of enterprise data center workloads to the cloud. Sunk investments, human nature, and organizational resistance will sustain many traditional enterprise IT roles into the foreseeable future. And yet, as business lines demand agility and transparency, enterprise IT will need to emulate service providers, as they will increasingly be competing with them. Additionally, IT and data center teams can reorient to provide corporate governance, advising and assisting business lines with service provider procurement, and managing vendor relationships.

Despite major adoption of cloud and colocation, the outsourcing model is not a panacea. According to 2016 Survey Data:

• 40% of enterprise respondents are paying more for colocation contracts than they had initially planned or expected

• Nearly one-third of respondents had experienced an outage at a colocation vendor site

• Over 60% of respondents said the penalty clause in their Service Level Agreement (SLA) would not adequately offset the cost of that outage to the business

Enterprise IT on Colocation Providers What is the length of the typical contract commitment you make to a colocation or multi-tenant data center provider?

• Under 2 years 12%

• 2-4 years 39%

• 4-5 years 25%

• Over 5 years 24%

How many separate colocation or multi-tenant data center providers is your organization currently using?

• One 30%

• Two to Three 35%

• Three to Five 13%

• Over Five 22%

Does your organization consider third-party certifications such as the Uptime Institute’s Tier Certification and/or M&O Stamp of Approval as part of the vetting process for considering potential colocation candidates?

• Yes 65%

• No 35%

Uptime Institute 2016 Data Center Survey Results

Page 18: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

10

To be fair, customer satisfaction levels are high. Almost half the respondents reported being satisfied or very satisfied with their primary provider, while 7% said they were dissatisfied or very dissatisfied. In 2015, enterprise IT organizations reported experiencing slightly more outages in their enterprise-owned sites over a 2-year period than at their colocation sites.

That said, enterprise IT organizations paying a premium for a third party to deliver data center capacity should hold service providers to higher standards than their own organization. There is significant room for improvement in vetting, negotiating, and managing those relationships.

THE CORPORATE SUSTAINABILITY DEPARTMENT Uptime Institute has tracked data center trends and sentiments in this survey for 6 years, but the last 3 years of survey responses illustrate a major shift in IT infrastructure efficiency prioritization. The survey results from 2014–16 demonstrates how Corporate Sustainability will have a major impact on IT departments going forward.

In 2014, many enterprise IT organizations were sitting on recently built (last 5 years) data center facilities that were underutilized due to forecasting errors. Senior IT and data center staff relied heavily on skewed PUE metrics as an indicator of success—touting the efficiency of the cooling systems in a partially loaded facility (that in hindsight should never have been built at that scale) as a metric senior management should care about.

Nearly half the respondents were not auditing their sites for comatose server hardware at all, as addressing this issue would only make overall utilization look worse. Yet 80% of respondents reported achieving U.S. Green Building Council’s LEED designation or other green building award, which provided very little environmental or financial return in the context of the data center.

It is not surprising that respondents reported C-suite executives were not interested in data center efficiency or performance.

In 2015, organizations with major IT infrastructure investments began to try to address root problems but struggled to get organizational buy-in. The primary culprits of organizational inefficiency had been ignored for years:

• Poor demand and capacity planning within and across functions • Significant failings in asset management and utilization

• Lack of financial accountability

Uptime Institute

Page 19: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

11

These problems stemmed from a disconnect between IT infrastructure costs and the business lines. Accurate forecasting and asset utilization are not prioritized if no one is held accountable for those functions. In many cases, the facility or corporate real estate teams owned an underutilized data center investment that was treated as an undifferentiated cost center, totally unattributed to the IT department or the lines of business. Many IT leaders realized chargeback would address the chronic problems with IT efficiency.

In 2015, less than a third of survey respondents said that their organizations had deployed a chargeback accounting method. In May of that year, Uptime Institute gathered a group of senior stakeholders for the Executive Assembly for Efficient IT. The group comprised leaders from large financial, healthcare, retail, and web-scale IT organizations; the purpose of the meeting was to share experiences, success stories, and challenges to improving IT efficiency.

Nearly every organization in the room had struggled to implement chargeback, but almost all of them were in process and realized they needed to reach out to counterparts in other disciplines within the business to make the most of these efforts.

In 2016, infrastructure leaders said that they faced strong internal resistance to addressing chronic inefficiency. In order to implement accountability and efficiency measures, the projects needed senior-level support. These efforts would not succeed as bottom-up initiatives. With that knowledge, infrastructure teams reached out to their counterparts in other disciplines (see Figure 5).

Increasingly, Corporate Sustainability drives decisions at large companies, as this function can affect the investor community, stock price, and capitalization. Many companies meet these challenges by creating sustainability offices that have both C-level visibility and broad staff participation across all business units and facilities, including IT.

Chargeback is a method of charging internal consumers (e.g., departments, functional units) for the IT services they use. Instead of bundling all IT costs under the IT department, a chargeback program allocates the various costs of delivering IT (e.g., ser vices, hardware, software, maintenance) to the business units that consume them (See “IT Chargeback Drives Efficiency” The Uptime Institute Journal, vol. 6, page 22 and journal.uptimeinstitute.com).

Uptime Institute 2016 Data Center Survey Results

Page 20: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

12

The relationship between Corporate Sustainability and enterprise IT is really just getting started. There are good signs for the potential of this relationship, but also signals that entrenched behaviors and metrics will be difficult to overcome.

The relationship so far, according to the survey stakeholders, has been overwhelmingly positive:

• 73% report that sustainability executives understand the data you provide and use it properly

• 44% report having a beneficial relationship with Corporate Sustainability

• Less than 10% claim that Corporate Sustainability efforts pose a risk to IT performance or availability or create needless work

And yet, the reporting functionality has a long way to go. If IT infrastructure leaders are motivated to improve the accountability and efficiency of their organizations, Corporate Sustainability is a great partner for gaining C-level buy-in and funding for projects.

But in many cases, IT infrastructure teams are still relying on the least meaningful metrics to drive efficiency.

Uptime Institute

Figure 5. IT reached out to partner with finance, risk, and even Corporate Sustainability to gain executive visibility and traction to address chronic problems.

Which major business functions or departments are consistently absent from major IT infrastructure decisions?

1. Finance 2. Risk 3. Sustainability

Two years ago, a place at the table for sustainability would have been provocative, and perhaps evoked derision. In 2015, less than a tenth of enterprise IT stakeholders had confidence in Corporate Sustainability to affect IT efficiency and costs. One short year later, 2016 is a vastly different matter, and the data suggests that the time of Corporate Sustainability in IT is here now: 70% of enterprise IT organizations actively participate in Corporate Sustainability efforts. The influence of an outside party breaks down the ‘thwart by silo’ effect that has been the cause of so much well meaning, and often fruitless, energies to reshape IT.

As Uptime Institute Chief Operating Officer Julian Kudritzki wrote in Network World in April 2016:

Page 21: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

13

The majority of IT departments position total data center power consumption, LEED certifications, and total data center power usage as primary indications of efficient stewardship of environmental and corporate resources.

Infrastructure leaders should not co-opt the Corporate Sustainability department to continue to perpetuate the fallacy that efficient computer room air conditioning is indicative of an efficient IT organization.

Rather, companies need to use this visibility and executive influence to address the chronic efficiency problems in a holistic manner, not only for the sake of their businesses but also for the very existence of enterprise IT, as increasingly these organizations will be forced to compete with the cloud (See “A Holistic Approach to Reducing Cost and Resource Consumption” The Uptime Institute Journal, vol. 4, page 18 and at journal.uptimeinstitute.com).

CONCLUSIONS Enterprise IT budgets and server footprints are in decline, and that trend will continue. Outsourcing is rampant in the face of opaque costs and chronically poor capacity planning.

In the face of budget constraints and competition from service providers, leading enterprise IT organizations are trying to drive efficiency and transparency to compete with the cloud.

In the face of these challenges, infrastructure executives have reached out to business stakeholders in other parts of the organization to become more responsive. Most IT organizations are partnering with Corporate Sustainability, but the efforts are still primarily focusing on least impactful aspects of efficiency.

As IT assets become more distributed across locations and platforms, IT needs to move away from its role as a slow-moving centralized service provider, and instead provide corporate governance across the various lines of business–evaluating security, costs, efficiency, and performance of IT for end users.

Uptime Institute 2016 Data Center Survey Results

By Matt Stansberry, Senior Director of Content and Publications, Uptime Institute

Page 22: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

14

DATA CENTER WALKTHROUGH CHECKLISTIdentifying lurking vulnerabilities in even the best-designed data centers

Page 23: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

15

Even the best-designed data centers have vulnerabilities. Companies with complex IT systems design safeguards against failure with multiple layers of protection and backup. Thus, when IT infrastructure fails, it is not due to a lack of backup systems but rather a failure of management.

Uptime Institute has delivered operations assessments across hundreds of data center facilities and has identified indicators of management shortfalls.

But how do you evaluate management behaviors?

You do not need to be an expert in data center operations to determine whether underlying risk factors are being left untended by management. Use this checklist to ensure processes and documentation are in place—the organization’s responsiveness, familiarity, and adherence to documented procedures are key to evaluating performance.

The majority of IT outages occur for practical and predictable reasons that aren’t sexy and aren’t attended to. Management structures were not in place or were not followed; a lack of processes or enforcement of processes defeated the investment.

Use this checklist to identify areas of improvement and further inquiry for your staff or service provider.

Are there any combustible materials (cardboard, paper, etc.) on the raised floor, battery room, or electrical rooms? All incoming equipment should be stripped of packaging outside of critical space.

Are unrelated items—office furniture, shelving units, tools—stored in critical space? This is a fire, safety, and contamination issue.

Review fire extinguishers for out-of-date tags.

Ask to see the housekeeping policy and procedure documentation.

If the facility operates a raised floor, review condition of underfloor plenum. This area should be cleaned regularly—ask to see the schedule.

How many employees have access to the critical space? Does your organization even have an access policy for staff?

Ask to see the vendor check-in and training requirements; non-vetted individuals should not be allowed in critical areas.

Data Center Walkthrough Checklist

WALKTHROUGH CHECKLIST

Page 24: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

16

Are panels, switchboards, and valves labeled to indicate “normal” operating positions?

Ensure arc flash labeling is installed on all panels and PDUs. (See “Balancing Life Safety, Infrastructure Investment, and Downtime,” page 32)

Data center cooling practices for over a decade have called for airflow isolation—cool air delivered to the front of a rack of IT equipment and hot air exhausted out the back. In a raised floor environment, rows of equipment are typically arranged in what is called Hot Aisle-Cold Aisle configuration—perforated tiles deliver cool air to the cold aisle or server intakes.

Are any grated or perforated panels in the Hot Aisle? Are there unsealed cutouts in the raised floor?

Are there uncovered gaps in the racks between IT hardware?

All these are indicators of poor bypass airflow management. This results in cooling inefficiency, wasted money, and poor adherence to management best practices.

Ask to see records and schedules for maintenance activities on batteries, engine generators, and mechanical systems.

Ask to see staffing documentation—overtime rates greater than 10% can lead to an increase in human error that causes outages. Are roles and responsibilities documented? Are qualifications listed?

Ask to see list of preventive maintenance activities. Are the activities fully scripted? What is the quality control process?

Who keeps critical documentation on equipment, including warranty info, maintenance records, and performance data?

Ask to see training records, annual budget, and time allocation.

What is the process for keeping the reference library (staffing, equipment, maintenance, procedures, and scripts) up-to-date?

Uptime Institute

Page 25: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

17

Many data center programs, even successful rigorous operations, are subject to vulnerabilities and would benefit from continuous improvement. Some of the items on this checklist will raise red flags and identify areas to focus attention.

But there are some other symptoms to look for that indicate a crisis in management rigor:

Are data center staff voice mail boxes full, emails not responded to, email inbox size limit exceeded, or meetings missed or routinely cancelled?

Does your data center team report having no time for training? Shortage of qualified staff? Personnel performing work outside their competency? High personnel turnover?

Has Maintenance exceeded its budget? How about energy cost estimates?

Does the back of the server or cable trays look like a spaghetti pot blew up? Is the cabling all correctly labeled? Is there a unique labeling system for equipment? If it looks like a mess, it is a mess.

Successful IT infrastructure teams are preoccupied with failure and display attention to detail and a commitment to process over personality. Use this checklist to discover vulnerabilities in your IT operations and start a conversation with your staff and service providers.

Management and operations have the biggest impact on your IT infrastructure performance, and provide the biggest opportunity for change and improvement.

Data Center Walk Through Checklist

By Uptime Institute senior technical staff

Page 26: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

18

WHY EFFECTIVE GOVERNANCE NEEDS INDUSTRY CERTIFICATIONSThe benchmarks are a function of your business needs, but the award on the wall substantiates Risk Management

Page 27: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

19

Discussions around standards and certifications tend to take two forms: technical debates on the nuances of criteria and discussions around proof—whether self-attested versus audited by a third party. Both are valid discussions and Uptime Institute has written a wide ranging body of content addressing both (See The Uptime Institute Journal, industry publications, and conference archives).

But what is lost in these discussions is the governance imperative of a rigorous certification that considers the special attributes of the outcome being validated.

When evaluating any certification executives should focus on its governance benefit.

• Will the process of certification fundamentally improve my project outcome?

• Will it have benefit to my organization beyond the testing/evaluation period?

• Is its process perfunctory? An adaptation of accounting and stock- keeping practices?

• Will its accomplishment speak to a ‘through and through’ insight?

• Will my Board understand it?

The certification business is unforgiving in practice. You are only as good as the last thing you certified. Uptime Institute has certified around 1,000 projects in dozens of countries. Our business is to release standards to the industry royalty free, but we exclusively audit and certify. Reserving the audit right is key to consistency, as Uptime Institute experts adjudicate the criteria without conflict of interest or other temptation. And, by maintaining a central auditing resource, we are best able to ensure consistency and accountability.

Our certifications were developed to secure the governance-level requirements of a major capital investment. For example, Tier Certification is a three-sequence process of Design Documents, Constructed Facility, and Operations. This sequence enforces discipline and transparency because each Certification is a prerequisite for the next. If an organization has only the first award, it immediately begs the question of their capabilities to build what has been designed. On the other hand, each award allows the confident transfer from design to implementation to service.

The need for this rigor is just as potent for the 1,000th Certification as it was for the first. The reason is complex systems failure theory. An application of complex systems failures theory to data center and IT infrastructure may be found in this book (see page 40).

Why Effective Governance Needs Industry Certification

Page 28: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

20

Uptime Institute

In summary, as the complexity of a project increases and as the number of disciplines involved increases, the opportunity for failure exponentially increases.

Human foible is the thief of success. And as long as complex capital projects require multiple human beings in various trades, the risk will not go away. And, the stature of each engineering, construction, testing, or operations vendor is not a guarantee and should not be mistaken for a Governance decision or Risk Management.

Despite specialization of brands and an industry that is arguably in its late second or early third generation of professional, the need for Governance through Certification is not abating. We have a unique vantage point as a certification body. And, as we have chosen to certify at the system level, we stand at the intersection, alignment, and battleground of the many trades and disciplines.

When we developed our capital project Assessment & Certification regimen, we based it on technical criteria that would verify each and every technical and engineering detail. But, we also instilled criteria that were consistent with the fundamental tenets of a high availability IT investment, regardless of the size, scale, of sophistication of the project. The fundamental tenets include life safety, independent verification of design intent (such as loss of utility power), and proving the site’s investment in functionality.

Over 70% of projects fail at least one of our fundamental tenets. In a majority of assessments, we discovered at least one unidentified incident looming over steady-state operations. In every one of these failed assessments, the owner had offset risk with appropriate capital investment and the team had signed off on the project as ready for service.

The vast majority of capital projects, regardless of location or market maturity, experience one or more of the following;

• Disarray of project milestones o Design decisions being made well into construction or even commissioning

• Absence of Owners’ Representation o Project schedule is not published and updated o Design or other key documentation has revision control issues o Project requirements are broadly written or bypassed

• Non-binding KPIs o Testing and commissioning regimen is not established o Certifications are not communicated and coordinated

Page 29: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

21

Why Effective Governance Needs Industry Certification

EXIT REPORTS FROM RECENT PROJECTS An exit report from a 2016 capital project in the Americas reads:

The review continues in exacting detail, but, at an executive level, the back-up power systems failed during a simulated electrical utility outage. This was an anticipated design condition—and arguably the most rudimentary function—of the new data center. The underlying reason was a ‘feature’ that had been engineered into the back-up power systems, but the owner did not receive training, did not have appropriate knowledge, and had not been informed of its existence, thereby defeating the purpose of the data center.

An exit report from a 2016 capital project in Europe reads:

The review continues in exacting detail, but, at an executive level, the data center posed a threat to life safety. Service work on the power systems necessitated placing a screwdriver on a live 400-volt connection. Additional failures were attributed to incorrect fuse ratings and errors in the building monitoring and automation system. Any of these three issues would have resulted in a service interruption of the new data center.

It is important to note that our assessment was the very last step in the capital project, immediately preceding the new data center being placed into service. All of the capital project stakeholders had signed off on the data center before our assessment began.

The reasons for the high rate of shortfalls is consistent with the social science of disaster and the normalization of deviance. Schedule pressures, cost pressures, and no one focused on the comprehensive or the ‘whole’ lead to these outcomes even when the requisite skills and experience are in place.

• Ordinary Design

• Acceptable Construction

• Minor Testing & Commissioning

• Failed

• Repurposing of an Existing Building in Historical Area

• Well Thought Out Solutions to Limitations of Materials & Configuration

• Minor Testing & Commissioning

• Failed

Page 30: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

22

Uptime Institute

These items were discovered because Tier Certification was the barrier to the asset being places in service. Without a Certification in place, the investor had no recourse. By requiring Certification, the rigor of the project itself was put to the test. And the award on the wall demonstrated compliance at the profound level of the intent of the investment and its core business value.

During the genesis of Tier Certification, it was decided that any project claiming to be of high availability, regardless of Tier, had to follow a high standard of care. The project must demonstrate exactitude and coordination that was responsive to its complexity and the enormity of the investment. This process enforced a clarity and sequencing of the design, construction, and testing. The team that developed Tier Certification understood that it was an inherent challenge to uphold the sequence of project milestones, but that within each project phase, tens of trades were involved that could defeat the outcome single.

Industry certifications with a holistic focus transcend from technical acumen to the governance level. On numerous occasions, we have been engaged at the Board level in either a protectionist or triage role. These engagements showed the capability of industry certifications to span the abstruse nature of technology with the Board’s mandate of fiduciary responsibility and both Risk Mitigation and Management.

By Julian Kudritzki, Chief Operating Officer, Uptime Institute

Page 31: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

23

Page 32: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

24

AVOIDING DATA CENTER CAPITAL PROJECT FAILURESIdentify and mitigate costly mistakes

Page 33: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

25

For enterprise IT organizations embarking on a data center capital project—the stakes are undeniably high. Building a new data center is a massive investment, but also enables or hampers an organization’s IT strategy and capability—affecting an organization’s business performance for years to come. As more organizations rely on colocation data center providers, ensuring the design and construction of these projects meets your requirements is critical as well.

With multiple vendors, subcontractors, and typically more than 50 different disciplines involved in any data center project—structural, electrical, HVAC, plumbing, fuel pumps, networking, and more—it would be remarkable if there were no errors introduced or corners cut during the construction process.

Lapses in construction oversight, planning, and budget can mean that an expensive new data center facility will fail to meet the owner’s requirements, with the end result offering poor performance, limited flexibility, insufficient compute resources, or excess stranded capacity.

Addressing problems as they reveal themselves may delay construction and delay the start of operations and usually requires significant spending. In some cases, the problems continue to hamper operations for the life of the data center and may eventually require the facility to be replaced prematurely. Even if the facility should continue operations for its expected life, it may cost more than expected to operate, suffer more downtime incidents, and complicate efforts to introduce new products and services.

Any data center capital project is subject to complex challenges. Inclement weather, delayed equipment delivery, overwhelmed local resources, slow-moving permitting and approval bureaucracies, lack of availability of public utilities (power, water, gas), merger or acquisition, or other shift in corporate strategy can delay construction or increase costs. However, capital project teams must prepare for and resolve problems that result from unexpected conditions.

In Uptime Institute’s experience delivering more than 1,000 Certifications across the globe, problems in construction are most often attributed to:

• Poor integration of complex systems

• Lack of thorough commissioning or compressed commissioning schedules

• Unapproved or unexamined design changes

• Substitution of materials or products

Avoiding Data Center Capital Project Failures

Page 34: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

26

In a number of cases, data center failures, delays, or cost overruns occur during the construction phase because of misaligned construction incentives or poor contractor performance. In reality, the seeds of both these issues are sown in the earliest phases of the capital project, when design objectives, budgets, contracts, and schedules are developed; RFPs and RFIs issued; and the construction team assembled.

These issues arise during construction, commissioning, or even after operations have commenced and may impact cost, schedule, or IT operations. These construction problems often occur because of poor change management processes, inexperienced project teams, misaligned objectives of project participants, or lack of third-party verification.

CONTRACTS AND OWNER’S REPRESENTATIVES At the project outset, all parties should recognize that owner objectives differ greatly from builder objectives. The owner wants a data center that best meets cost, schedule, and overall business needs, including data center availability. The builder wants to meet project budget and schedule requirements while preserving project margin. Data center uptime (availability) and operations considerations are usually outside the builder’s scope and expertise.

Thus, it is imperative that the project owner—or owner’s representatives—devise contract language, processes, and controls that limit the contractors’ ability to change or undermine design decisions while making use of the contractors’ experience in materials and labor costs, equipment availability, and local codes and practices, which can save money and help construction follow the planned timeline without compromising availability and reliability.

Data center owners should appoint an experienced owner’s representative to properly vet contractors. This representative should review contractor qualifications, experience, staffing, leadership, and communications. Less experienced and cheaper contractors can often lead to quality control problems and design compromises.

Data center owners should also gather a design, construction, and project management team with extensive data center experience. If necessary, outside experts may be needed to focus on the owner’s project requirements. Keep in mind that an IT group may not understand schedule risk or the complexity of a project. Experienced teams are more likely to pushback on unrealistic schedules or suggestions that would compromise the project.

The owner or owner’s representative must work through all the project requirements and establish an agreed upon sequence of operations and an

Uptime Institute

Page 35: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

27

Avoiding Data Center Capital Project Failures

appropriate and incentivized construction schedule that includes sufficient time for rigorous and complete commissioning. In addition, the owner’s representative should regularly review the project schedule and apprise team members of the project status to ensure that the time allotted for testing and commissioning is not reduced.

Project managers or contractors looking to keep on schedule may perform tasks out of sequence. Tasks performed out of sequence often have to be reworked to allow access to space allocated to another system or to correct misplaced electrical service, conduits, ducts, etc., which only exacerbates scheduling problems.

Construction delays should not be allowed to compromise commissioning. Incorporating penalties for delays into the construction contract is one solution that should be considered.

CHANGE CONTROL IS CRITICAL TO PROJECT SUCCESSOnce a design has been finalized, change control processes are essential to managing and reducing risk during the construction phase. For various reasons, many builders, and even some owners, may be unfamiliar with the criticality of change control as it relates to data center projects. No project will be completely error free; however, good processes and documentation will reduce the number and severity of errors and sometimes make the errors that do occur easier to fix.

Value Engineering is a common construction practice to reduce the expected cost of building a completed design. The process has its benefits, but it tends to focus just on the first costs of the build. Often conducted by a building contractor, the practice has a poor reputation among designers because it often leads to changes that compromise the design intent. Yet other designers believe that in qualified hands, value engineering can yield savings for the project owner, without affecting reliability, availability, or operations.

Value Engineering needs to be integrated in the change control process. If it is performed without input from Operations and appropriate design review, any initial savings realized from these changes may be far less than charges for remedial work needed to restore features necessary to achieve Concurrent Maintainability or Fault Tolerance and increased operating costs over the life of the data center.

As a result, each and every change must be scrutinized for its effect on the design. Retaining the original design engineer or a project engineer with experience in data centers may reduce the number of inappropriate changes

Page 36: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

28

generated during the process. Even so, data center owners should be aware that Uptime Institute personnel have observed that improperly conducted Value Engineering has led to equipment substitutions or systems consolidations that compromised owner expectations of Fault Tolerance or Concurrent Maintainability. Contractors may substitute lower-priced equipment that has different capacity, control methodology, tolerances, or specifications without realizing the effect on reliability.

THE SORRY STATE OF COMMISSIONING Uptime Institute’s extensive global field experience reveals the vast majority of even the world’s most elite data center projects do not operate as designed/installed on day one. Uptime Institute consultants find system faults in nearly every Tier Certification site visit. Data center owners often comment that Tier Certification demonstrations are more rigorous than their commissioning program.

Commissioning activities represent a unique opportunity for data center owners. The ability to rigorously test the capabilities of the critical infrastructure that support the data center without any risk to mission critical IT loads is an opportunity that should be capitalized on to the maximum possible extent.

Uptime Institute observes that this critical opportunity is being wasted far too often in data center facilities, with not nearly enough emphasis on the rigor and depth of the commissioning program required for a mission critical facility until critical IT hardware is already connected.

A well-planned and executed commissioning program will help validate the capital investment in the facility to date. It will also put the operations team in a far better position to manage and operate the critical infrastructure for the rest of the data center’s useful life, and ultimately ensure that the facility realizes its full potential.

Construction teams that are insufficiently experienced in the rigors of data center commissioning often underestimate the time required or regard the commissioning period as a kind of buffer that can be accessed when work runs late. For both these reasons, it is important that the owner or owner’s representative take care to schedule adequate time for commissioning and ensure that contractors meet or exceed construction deadlines. A recommendation would be to engage the commissioning agent and general contractor early in the process as a partner in the development of the project schedule.

In addition, data center capital projects include requirements that might be unfamiliar to teams lacking experience in mission critical environments; these requirements often have budgetary impacts.

Uptime Institute

Page 37: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

29

For example, owners and owner’s representatives must scrutinize construction bids to ensure that they include funding and time for: • Factory witness tests of critical equipment

• Extended Level 4 and Level 5 commissioning with vendor support

• Load banks to simulate full IT load within the critical environment

• Diesel fuel to test and verify engine-generator systems

Because experienced teams understand the importance of data center specific commissioning, the commissioning agent will be able to work more effectively early in the process, setting the stage for the transition to operations.

In addition, Operations should be part of the design and construction team from the start of the project through commissioning and handover. Including Operations in change management gives it the opportunity to share and learn key information about how that data center will run, including set points, equipment rotation, change management, training, and spare inventory, that will be essential in every day operations and dealing with incidents.

THIRD-PARTY CERTIFICATION Third-party verifications can assure the owner that the project delivered meets the owner’s project requirements. Uptime Institute has witnessed third-party verification improve contractor performance. The verifications motivate the contractors to work more diligently, perhaps because verification increases the likelihood that shortcuts or corner cutting will be found and repaired at the contractor’s expense.

Certifications and verifications are only effective when conducted by an unbiased, vendor-neutral third party. Many certifications in the market fail to meet this threshold. Some certifications and verification processes are little more than a vendor stamp of approval on pieces of equipment. Others take a checklist approach, without examining causes of test failures.

Serious mistakes can take place at almost any time during the construction process, including during the bidding process. In one such instance, an owner’s procurement department tried to maximize a vendor discount for a piece of equipment but failed to order components to connect it.

In another example, a contractor won a bid based on the cost of transporting completely assembled generators on skids for more than 800 miles. When the vendor threatened to void warranty support for this creative use of product,

Avoiding Data Center Capital Project Failures

Page 38: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

30

the contractor was forced to absorb the substantial costs of transporting equipment in a more conventional way. In such instances, owners might be wise to watch closely whether the contractor tries to recoup his costs by changing the design or making other equipment substitution.

In another case, the Uptime Institute team found that the builder implemented a design as it saw fit, without considering maintenance access or labeling of this critical infrastructure. The builder had instead rerouted the bus ducts into a shared compartment and neglected to label any of the conductors.

Many more examples are found in almost every exit report from Uptime Institute Tier Certification engagements.

CONCLUSIONSData center capital projects are subject to complex challenges, with multiple stakeholders and contractors coming together across multiple disciplines. To ensure that the infrastructure investment meets an organization’s business requirements, project leaders need to ensure that they have selected the right partners, empowered a competent owner’s representative, and left adequate time for rigorous commissioning and third-party certification.

By Kevin Heslin, Chief Editor, Uptime Institute, with Keith Klesner, Senior Vice President, North America, and input from additional senior and technical staff

Uptime Institute

Page 39: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

31

Page 40: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

32

BALANCING LIFE SAFETY, INFRASTRUCTURE INVESTMENT, AND DOWNTIMEDue to the uninterruptible nature of IT infrastructure, many organizations allow high-risk maintenance activities

Page 41: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

33

Due to the uninterruptible nature of data center operations, a large percentage of organizations allows maintenance activities on energized electrical equipment. These conditions put personnel at risk of an arc flash accident.

Electrical accidents such as arc flash occur all too often in facility environments that have high-energy use requirements, a multitude of high-voltage electrical systems and components, and frequent maintenance and equipment installation activities.

The U.S. Occupational Safety and Health Administration (OSHA) defines arc flash as “a phenomenon where a flashover of electric current leaves its intended path and travels through the air from one conductor to another, or to ground. The results are often violent and when a human is in close proximity to the arc flash, serious injury and even death can occur.”

When these accidents occur they can derail operations and cause serious harm to workers and equipment. Costs to businesses can include lost work time, downtime, OSHA investigation, fines, medical costs, litigation, lost business, equipment damage, and most tragically, loss of life. According to the Workplace Safety Awareness Council (WPSAC), the average cost of hospitalization for electrical accidents is US$750,000, with many exceeding US$1,000,000.

There are regulations in the U.S. and globally that set safety requirements, but there is wide-ranging industry confusion over how to comply with those regulations and understandable uneasiness with requirements that put personnel at risk.

According to Uptime Institute’s annual data center industry survey, about one-third of organizations allow maintenance activities on energized electrical equipment at voltage levels that could cause health or human-safety consequences (see Figure 1).

Does your organization allow maintenance activities on energized electrical equipment?

Balancing Life Safety, Infrastructure Investment, and Downtime

Figure 1. Source: Uptime Institute Annual Data Center Industry Survey 2015

YES31%

NO69%

Page 42: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

34

OSHA and the National Fire Protection Association (NFPA) Standard 70E address electrical safety in the workplace and provide guidance and regulations on safety programs, warning labels, personal protective equipment, boundary requirements, and hazard analysis. And yet, there is widespread confusion over how the codes should be applied in the data center industry, as evidenced by the responses from North American data center operators and executives (see Figure 2).

This confusion over how regulations and codes should be applied is clearly a major issue facing this industry.

Even highly informed experts can disagree on how these regulations should be applied. The confusion creates opportunities for accidents and operational exposures to risk that can cause significant injuries and even death.

The most effective way to eliminate the risk of electrical shock or arc flash hazard is to de-energize the equipment. Uptime Institute’s Tier III and Tier IV criteria both require design and installation of systems that enable equipment to be fully de-energized to allow planned activities such as repair, maintenance, replacement, or upgrade without exposing personnel to the risks of working on energized equipment.

INDUSTRY STANDARDS AND REGULATIONSTo prevent these kinds of accidents and injuries, it is imperative that data center operators understand and follow appropriate safety standards for working with electrical equipment. Both the NFPA and OSHA have established standards and regulations that help protect workers against electrical hazards and prevent electrical accidents in the workplace.

OSHA 29 CFR Part 1910, Subpart S and OSHA 29 CFR Part 1926, Subpart K include requirements for electrical installation, equipment, safety-related work practices, and maintenance for general industry and construction workplaces, including data centers.

Do the relevant codes, laws, or regulations in your region prohibit maintenance activities or energized equipment?

Uptime Institute

Figure 2. North American respondents demonstrate inconsistent understanding and enforcement.

Yes 29%Yes but unenforced 14%No 35%Don’t know 22%

Page 43: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

35

NFPA 70E is a set of detailed standards (issued at the request of OSHA and updated periodically) that address electrical safety in the workplace. It covers safe work practices associated with electrical tasks and for performing other non-electrical tasks that may expose an employee to electrical hazards. OSHA revised its electrical standard to reference NFPA 70E-2000 and continues to recognize NFPA 70E today.

OSHA requires that facilities:

• Provide and be able to demonstrate a safety program with defined responsibilities.

• Calculate the degree of arc flash hazard.

• Use correct personal protective equipment (PPE) for workers.

• Train workers on the hazards of arc flash.

• Use appropriate tools for safe working. • Provide warning labels on equipment.

NFPA 70E further defines “electrically safe work conditions” to mean that equipment is not and cannot be energized. To ensure these conditions, personnel must identify all power sources, interrupt the load and disconnect power, visually verify that a disconnect has opened the circuit, lock out and tag the circuit, test for absence of voltage, and ground all power conductors, if necessary.

JUSTIFICATION FOR “HOT WORK”NFPA 70E and OSHA require employers to prove that working in a de-energized state creates more or worse hazards than the risk presented by working on live components or is not practical because of equipment design or operational limitations, for example, when working on circuits that are part of

Are you uncomfortable with maintenance activities on energized electrical equipment?

Balancing Life Safety, Infrastructure Investment, and Downtime

Figure 3. Even for organizations with appropriate training, protective equipment, and analysis many operators are uncomfortable with “hot work.”

YES 59%NO 41%

Page 44: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

36

a continuous process that cannot be completely shut down. Other exceptions include situations in which isolating and deactivating system components would create a hazard for people not associated with the work, for example, when working on life-support systems, emergency alarm systems, ventilation equipment for hazardous locations, or extinguishing illumination for an area.

In addition, OSHA makes provision for situations in which it would be “infeasible” to shut down equipment. For example, some maintenance and testing operations can only be done on live electric circuits or equipment. The decision to work hot should only be made after careful analysis of the determination of what constitutes infeasibility. In recent years, some well-publicized OSHA actions and statements have centered on the matter of how to interpret this term.

ELECTRICAL SAFETY MEASURES IN PRACTICEOnly qualified persons should work on electrical conductors or circuits that have not been put into an electrically safe work condition. A qualified person is one who has received training in and possesses skills and knowledge in the construction and operation of electric equipment and installation and the hazards involved with this type of work. Knowledge or training should encompass the skill to distinguish exposed live parts from other parts of electric equipment, determine the nominal voltage of exposed live parts, and calculate the necessary clearance distances and the corresponding voltages to which a worker will be exposed.

An arc flash hazard analysis for any work must be conducted to determine the appropriate arc flash boundary, the incident energy at the working distance, and the necessary protective equipment for the task.

NFPA 70E outlines strict standards for the type of PPE required for any employees working in areas where electrical hazards are present based on the task, the parts of the body that need protection, and the suitable arc rating to match the potential flash exposure. PPE includes items such as a flash suit, switching coat, mask, hood, gloves, and leather protectors. Flame-resistant clothing underneath the PPE gear is also required.

After an arc flash hazard analysis has been performed, the correct PPE can be selected according to the equipment’s arc thermal performance exposure value (ATPV) and the break open threshold energy rating (EBT). Together, these components determine the calculated hazard level that any piece of equipment is capable of protecting a worker from (measured in calories per square centimeter). For example, a hard hat with an attached face shield provides adequate protection for Hazard/Risk Category 2, whereas an arc flash protection hood is needed for a worker exposed to Hazard/Risk Category 4.

Uptime Institute

Page 45: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

37

PPE is the last line of defense in an arc flash incident; it is not intended to prevent all injuries, but to mitigate the impact of a flash, should one occur. In many cases, the use of PPE has saved lives or prevented serious injury.

CONCLUSIONSIt can be argued that some of today’s data center operations approach the status of being “essential” for much of the underlying infrastructure that runs our 24x 7 digitized society. Data centers support the functioning of global financial systems, power grids and utilities, air traffic control operations, communication networks, and the information processing that support vital activities ranging from daily commerce to national security.

Each facility must assess its operations and system capabilities to enable adherence to safe electrical work practices as much as possible without jeopardizing critical mission functions. In many cases, it may become a jurisdictional decision as to the answer for a specific data center business requirement.

Balancing the need for appropriate electrical safety measures and compliance with the need to maintain and sustain uninterrupted production capacity in an energy-intensive environment is a challenge.

But it is a challenge the data center industry is perhaps better prepared to meet than many other industry segments. It is apparent that those in the data center industry who subscribe to high-availability concepts such as the Tier Standards: Topology and Operational Sustainability have adopted a rigorous approach to cleaning, maintenance, installation, training, and other tasks that forestall arc flash.

Organizations that subscribe to Tier standards and maintain stringent operational practices are better prepared to take on the challenges of compliance with OSHA and NFPA 70E requirements, in particular the requirements for safely performing work on energized systems, when such work is allowed per the safety standards.

If you had resources, would you invest in infrastructure to prevent the need for maintenance on energized platforms?

Balancing Life Safety, Infrastructure Investment, and Downtime

Figure 4. The vast majority of survey respondents would strongly prefer investment in concurrently maintainable topology that would eliminate the need for “hot work.”

YES 92%

Page 46: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

38

No measure will ever completely remove the risk of working on live, energized equipment. In instances where working on live systems is necessary and allowed by NFPA 70E rules, the application of Uptime Institute Tier III and Tier IV criteria can help minimize the risks. Tier III and IV both require the design and installation of systems that enable equipment to be fully de-energized to allow planned activities such as repair, maintenance, replacement, or upgrade without exposing personnel to the risks of working on energized electrical equipment.

• Does this data center [specify] perform site work and maintenance on energized electrical equipment? o If no, and you are in a Tier III or IV Certified data center that—by design and Uptime Institute award—your organization has no reason to risk exposure to hot work. o If no, and you are not in a Tier III or IV Certified data center, then you are exposed to the risk of equipment failure due to indefinitely deferred site work and maintenance. o If yes, these questions may help you to understand your risk exposure of life safety, unplanned downtime, disrupted business process, code violation and penalty, and/or adverse revenue impact: — What is the established corporate policy for performing work on energized electrical equipment? — Who is informed of, and signed off on, this policy?

· Data Center Operations · Maintenance & Site Work Contractors · IT Systems · Risk/Compliance · Life Safety/Health · Regulatory/Oversight (3rd-Party or Internal) — When was the last time that work was performed on energized electrical equipment?

ARC FLASH POP QUIZ

Uptime Institute

Page 47: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

39

— Who was alerted before the work was performed on energized electrical equipment?

· Data Center Operations · IT Systems · Risk/Compliance · Life Safety/Health · Regulatory/Oversight (3rd-Party or Internal) — Who and how many performed the work (contractor or employees)? — What were the safety precautions? — How long was the hot work period scheduled for? — How long did it actually take? — What was the process of QA/QC before the hot work period concluded and normal operations restored? — What are the scheduled and upcoming hot work periods? — As noted in this article, regulatory and code affecting hot work is changing. Who is responsible for checking on the latest impacts to corporate policy and site work activities? · How often does that check up occur?

Balancing Life Safety, Infrastructure Investment, and Downtime

By Matt Stansberry, Senior DIrector of Content and Publications, Uptime Institute, and Uptime Institute senior technical staff

Page 48: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

40

EXAMINING AND LEARNING FROM COMPLEX SYSTEMS FAILURESConventional wisdom blames “human error” for the majority of outages, but those failures are incorrectly attributed to front-line operator errors, rather than management oversights

Page 49: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

41

Data centers, oil rigs, ships, power plants, and airplanes may seem like vastly different entities, but all are large and complex systems that can be subject to failure—sometimes catastrophic failure. Natural events like earthquakes or storms may initiate a complex system failure. But often blame is assigned to “human error”—front-line operator mistakes combined with a lack of appropriate procedures and resources or compromised structures that result from poor management decisions. Human error is an insufficient and misleading term. The front-line operator’s presence at the site of the incident ascribes responsibility to the operator for failure to rescue the situation. But this masks the underlying causes of an incident. It is more helpful to consider the site of the incident as a spectacle of mismanagement. Responsibility for an incident, in most cases, can be attributed to a senior management decision (e.g., design compromises, budget cuts, staff reductions, vendor selection, and resourcing) seemingly disconnected in time and space from the site of the incident.

What decisions led to a situation where front-line operators were unprepared or untrained to respond to an incident and mishandled it? To safeguard against failures, standards and practices have evolved in many industries that encompass strict criteria and requirements for the design and operation of systems, often including inspection regimens and certifications. Compiled, codified, and enforced by agencies and entities, these programs and requirements help protect the service user from the bodily injuries or financial effects of failures and spur industries to maintain preparedness and best practices.

Twenty years of Uptime Institute research into the causes of data center incidents places predominant accountability for failures at the management level and finds only single-digit percentages of spontaneous equipment failure. This fundamental and permanent truth compelled the Uptime Institute to step further into standards and certifications that were unique to the data center and IT industry. Uptime Institute undertook a collaborative approach with a variety of stakeholders to develop outcome-based criteria that would be lasting and developed by and for the industry. Uptime Institute’s Certifications were conceived to evaluate, in an unbiased fashion, front-line operations within the context of management structure and organizational behaviors.

Examining and Learning from Complex Systems Failures

Page 50: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

42

EXAMINING FAILURESThe sinking of the Titanic. The Deepwater Horizon oil spill. DC-10 air crashes in the 1970s. The failure of New Orleans’ levee system. The Three Mile Island nuclear release. The northeast (U.S.) blackout of 2003. Battery fires in Boeing 787s. The space shuttle Challenger disaster. Fukushima Daiichi nuclear disaster. The grounding of the Kulluk arctic drilling rig. These are a few of the most infamous, and in some cases tragic, engineering system failures in history. While the examples come from vastly different industries and each story unfolded in its own unique way, they all have something in common with each other—and with data centers. All exemplify highly complex systems operating in technologically sophisticated industries.

The hallmarks of so-called complex systems are “a large number of interacting components, emergent properties difficult to anticipate from the knowledge of single components, adaptability to absorb random disruptions, and highly vulnerable to widespread failure under adverse conditions (Dueñas-Osorio and Vemuru 2009).” Additionally, the components of complex systems typically interact in non-linear fashion, operating in large interconnected networks.

Large systems and the industries that use them have many safeguards against failure and multiple layers of protection and backup. Thus, when they fail it is due to much more than a single element or mistake.

It is a truism that complex systems tend to fail in complex ways. Looking at just a few examples from various industries, again and again we see that it was not a single factor but the compound effect of multiple factors that disrupted these sophisticated systems. Often referred to as “cascading failures,” complex system breakdowns usually begin when one component or element of the system fails, requiring nearby “nodes” (or other components in the system network) to take up the workload or service obligation of the failed component. If this increased load is too great, it can cause other nodes to overload and fail as well, creating a waterfall effect as every component failure

John Maclean, author of numerous books analyzing deadly wildfires, including Fire on the Mountain (Morrow 1999), suggests rebranding of high reliability organizations, which is a fundamental concept of firefighting crews, the military, and the commercial airline industry. He argued for ‘high-risk organizations.’ A high-reliability organization may only fail, like a goalkeeper, as performance is so highly anticipated. A high-risk organization is tasked with averting or minimizing impact and may gauge success in a non-binary fashion. It is a recurring theme in Mr. Maclean’s forensic analyses of deadly fires that front-line operators, including the perished, carry the blame for the outcome and management shortfalls are far less exposed.

Uptime Institute

Page 51: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

43

increases the load on the other, already stressed components. The following transferable concept is drawn from the power industry:

Power transmission systems are heterogeneous networks of large numbers of components that interact in diverse ways. When component operating limits are exceeded, protection acts to disconnect the component and the component ‘fails’ in the sense of not being available... Components can also fail in the sense of misoperation or damage due to aging, fire, weather, poor maintenance, or incorrect design or operating settings... The effects of the component failure can be local or can involve components far away, so that the loading of many other components throughout the network is increased... the flows all over the network change (Dobson, et al. 2009).

A component of the network can be mechanical, structural or human agent, as front-line operators respond to an emerging crisis. Just as engineering components can fail when overloaded, so can human effectiveness and decision-making capacity diminish under duress. A defining characteristic of a high-risk organization is that it provides structure and guidance despite extenuating circumstances—duress is its standard operating condition.

The sinking of the Titanic is perhaps the most well-known complex system failure in history. This disaster was caused by the compound effect of structural issues, management decisions, and operating mistakes that led to the tragic loss of 1,495 lives. Just a few of the critical contributing factors include design compromises (e.g., reducing the height of the watertight bulkheads that allowed water to flow over the tops and limiting the number of lifeboats for aesthetic considerations), poor discretionary decisions (e.g., sailing at excessive speed on a moonless night despite reports of icebergs ahead), operator error (e.g., the lookout in the crow’s nest had no binoculars—a cabinet key had been left behind in Southampton), and misjudgment in the crisis response (e.g., the pilot tried to reverse thrust when the iceberg was spotted, instead of continuing at full speed and using the momentum of the ship to turn course and reduce impact). And, of course, there was the hubris of believing the ship was unsinkable.

Examining and Learning from Complex Systems Failures

Figure 1a. (Left) NTSB photo of the burned auxiliary power unit battery from a JAL Boeing 787 that caught fire on January 7, 2013 at Boston’s Logan International Airport. Photo credit: By National Transportation Safety Board (NTSB) [Public domain], via Wikimedia Commons. Figure 1b. (Right) A side-by-side comparison of an original Boeing Dreamliner (787) battery compared and a damaged Japan Air Lines battery. Photo credit: By National Transportation Safety Board (NTSB) [Public domain], via Wikimedia Commons.

Page 52: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

44

Looking at a more recent example, the issue of battery fires in Japan Airlines (JAL) Boeing 787s, which came to light in 2013 (see Figure 1), was ultimately blamed on a combination of design, engineering, and process management shortfalls (Gallagher 2014). Following its investigation, the U.S. National Transportation Safety Board reported (NTSB 2014):

• Manufacturer errors in design and quality control. The manufacturer failed to adequately account for the thermal runaway phenomenon: an initial overheating of the batteries triggered a chemical reaction that generated more heat, thus causing the batteries to explode or catch fire. Battery “manufacturing defects and lack of oversight in the cell manufacturing process” resulted in the development of lithium mineral deposits in the batteries. Called lithium dendrites, these deposits can cause a short circuit that reacts chemically with the battery cell, creating heat. Lithium dendrites occurred in wrinkles that were found in some of the battery electrolyte material, a manufacturing quality control issue.

• Shortfall in certification processes. The NTSB found shortcomings in U.S. Federal Aviation Administration (FAA) guidance and certification processes. Some important factors were overlooked that should have been considered during safety assessment of the batteries.

• Lack of contractor oversight and proper change orders. A cadre of contractors and subcontractors were involved in the manufacture of the 787’s electrical systems and battery components. Certain entities made changes to the specifications and instructions without proper approval or oversight. When the FAA performed an audit, it found that Boeing’s prime contractor wasn’t following battery component assembly and installation instructions and was mislabeling parts. A lack of “adherence to written procedures and communications” was cited.

How many of these circumstances parallel those that can happen during the construction and operation of a data center? It is all too common to find deviations from as-designed systems during the construction process, inconsistent quality control oversight, and the use of multiple subcontractors. Insourced and outsourced resources may disregard or hurry past written procedures, documentation, and communication protocols (see “Avoiding Data Center Capital Project Failures,” page 24).

THE NATURE OF COMPLEX SYSTEM FAILURESLarge industrial and engineered systems are risky by their very nature. The greater the number of components and the higher the energy and heat levels, velocity, and size and weight of these components the greater the skill and teamwork required to plan, manage, and operate the systems safely. Between

Uptime Institute

Page 53: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

45

mechanical components and human actions, there are thousands of possible points where an error can occur and potentially trigger a chain of failures.

In his seminal article on the topic of complex system failure, “How Complex Systems Fail”—first published in 1998 and still widely referenced today—Dr. Richard I. Cook identifies and discusses 18 core elements of failure in complex systems:

1. Complex systems are intrinsically hazardous systems. 2. Complex systems are heavily and successfully defended against failure. 3. Catastrophe requires multiple failures—single point failures are not enough. 4. Complex systems contain changing mixtures of failures latent within them. 5. Complex systems run in degraded mode. 6. Catastrophe is always just around the corner. 7. Post-accident attribution to a ‘root cause’ is fundamentally wrong. 8. Hindsight biases post-accident assessments of human performance. 9. Human operators have dual roles: as producers and as defenders against failure. 10. All practitioner actions are gambles. 11. Actions at the sharp end resolve all ambiguity. 12. Human practitioners are the adaptable element of complex systems. 13. Human expertise in complex systems is constantly changing. 14. Change introduces new forms of failure. 15. Views of ‘cause’ limit the effectiveness of defenses against future events. 16. Safety is a characteristic of systems and not of their components. 17. People continuously create safety. 18. Failure-free operations require experience with failure (Cook 1998).

Let’s examine some of these principles in the context of a data center. Certainly high-voltage electrical systems, large-scale mechanical and infrastructure components, high-pressure water piping, power generators, and other elements create hazards [Element 1] for both humans and mechanical systems/structures. Data center systems are defended from failure by a broad range of measures [Element 2], both technical (e.g., redundancy, alarms, and safety features of equipment) and human (e.g., knowledge, training, and procedures). Because of these multiple layers of protection, a catastrophic failure would require the breakdown of multiple systems or multiple individual points of failure [Element 3].

RUNNING NEAR CRITICAL FAILUREComplex systems science suggests that most large-scale complex systems, even well-run ones, by their very nature are operating in “degraded mode” [Element 5], i.e., close to the critical failure point. This is due to the progression over time of various factors including steadily increasing load demand, engineering forces, and economic factors.

Examining and Learning from Complex Systems Failures

Page 54: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

46

The enormous investments in data center and other highly available infrastructure systems perversely incents conditions of elevated risk and higher likelihood of failure. Maximizing capacity, increasing density, and hastening production from installed infrastructure improves the return on investment (ROI) on these major capital investments. Deferred maintenance, whether due to lack of budget or hands-off periods due to heightened production, further pushes equipment towards performance limits—the breaking point.

The increasing density of data center infrastructure exemplifies the dynamics that continually and inexorably push a system towards critical failure. Server density is driven by a mixture of engineering forces (advancements in server design and efficiency) and economic pressures (demand for more processing capacity without increasing facility footprint). Increased density then necessitates corresponding increases in the number of critical heating and cooling elements. Now the system is running at higher risk, with more components (each of which is subject to individual fault/failure), more power flowing through the facility, and more heat generated, etc.

This development trajectory demonstrates just a few of the powerful “self-organizing” forces in any complex system. According to Dobson, et al (2009), “these forces drive the system to a dynamic equilibrium that keeps [it] near a certain pattern of operating margins relative to the load. Note that engineering improvements and load growth are driven by strong, underlying economic and societal forces that are not easily modified.”

Because of this dynamic mix of forces, the potential for a catastrophic outcome is inherent in the very nature of complex systems [Element 6]. For large-scale mission critical and business critical systems, the profound implication is that designers, system planners, and operators must acknowledge the potential for failure and build in safeguards.

WHY IS IT SO EASY TO BLAME HUMAN ERROR?Human error is often cited as the root cause of many engineering system failures, yet it does not often cause a major disaster on its own. Based on analysis of 20 years of data center incidents, Uptime Institute holds that human error must signify management failure to drive change and improvement. Leadership decisions and priorities that result in a lack of adequate staffing and training, an organizational culture that becomes dominated by a fire drill mentality, or budget cutting that reduces preventive/proactive maintenance could result in cascading failures that truly flow from the top down.

Although front-line operator error may sometimes appear to cause an incident, a single mistake (just like a single data center component failure) is not often sufficient to bring down a large and robust complex system unless conditions

Uptime Institute

Page 55: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

47

are such that the system is already teetering on the edge of critical failure and has multiple underlying risk factors.

For example, media reports after the 1983 Exxon Valdez oil spill zeroed in on the fact that the captain, Joseph Hazelwood, was not at the bridge at the time of the accident and accused him of drinking heavily that night. However, more measured assessments of the accident by the NTSB and others found that Exxon had consistently failed to supervise the captain or provide sufficient crew for necessary rest breaks (see Figure 2).

Perhaps even more critical was the lack of essential navigation systems: the tanker’s radar was not operational at time of the accident. Reports indicate that Exxon’s management had allowed the RAYCAS radar system to stay broken for an entire year before the vessel ran aground because it was expensive to operate. There was also inadequate disaster preparedness and an insufficient quantity of oil spill containment equipment in the region, despite the experiences of previous small oil spills. Four years before the accident, a letter written by Captain James Woodle, who at that time was the Exxon oil group’s Valdez port commander, warned upper management, “Due to a reduction in manning, age of equipment, limited training and lack of personnel, serious doubt exists that [we] would be able to contain and clean-up effectively a medium or large size oil spill” (Palast 1999).

As Dr. Cook points out, post-accident attribution to a root cause is fundamentally wrong [Element 7]. Complete failure requires multiple faults, thus attribution of blame to a single isolated element is myopic and, arguably, scapegoating. Exxon blamed Captain Hazelwood for the accident, and his share of the blame obscures the underlying mismanagement that led to the failure. Inadequate enforcement by the U.S. Coast Guard and other regulatory agencies further contributed to the disaster.

Similarly, the grounding of the oil rig Kulluk was the direct result of a cascade of discrete failures, errors, and mishaps, but the disaster was first set in motion by Royal Dutch Shell’s executive decision to move the rig off of the Alaskan coastline to avoid tax liability, despite high risks (Lavelle 2014). As a result, the rig and its tow vessels undertook a challenging 1,700-nautical-mile journey across the icy and storm-tossed waters of the Gulf of Alaska in December 2012 (Funk 2014).

Examining and Learning from Complex Systems Failures

Figure 2. Shortly after leaving the Port of Valdez, the Exxon Valdez ran aground on Bligh Reef. The picture was taken three days after the vessel grounded, just before a storm arrived. Photo credit: Office of Response and Restoration, National Ocean Service, National Oceanic and Atmospheric Administration [Public domain], via Wikimedia Commons.

Page 56: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

48

There had already been a chain of engineering and inspection compromises and shortfalls surrounding the Kulluk, including the installation of used and uncertified tow shackles, a rushed refurbishment of the tow vessel Discovery, and electrical system issues with the other tow vessel, the Aivik, which had not been reported to the Coast Guard as required. (Discovery experienced an exhaust system explosion and other mechanical issues in the following months. Ultimately the tow company—a contractor—was charged with a felony for multiple violations.)

This journey would be the Kulluk’s last, and it included a series of additional mistakes and mishaps. Gale-force winds put continual stress on the tow line and winches. The tow ship was captained on this trip by an inexperienced replacement, who seemingly mistook tow line tensile alarms (set to go off when tension exceeded 300 tons) for another alarm that was known to be falsely annunciating. At one point the

Aivik, in attempting to circle back and attach a new tow line, was swamped by a wave, sending water into the fuel pumps (a problem that had previously been identified but not addressed), which caused the engines to begin to fail over the next several hours (see Figure 3).

Despite harrowing conditions, Coast Guard helicopters were eventually able to rescue the 18 crew members aboard the Kulluk. Valiant last-ditch tow attempts were made by the (repaired) Aivik and Coast Guard tugboat Alert, before the effort had to be abandoned and the oil rig was pushed aground by winds and currents.

Poor management decision making, lack of adherence to proper procedures and safety requirements, taking shortcuts in the repair of critical mechanical equipment, insufficient contractor oversight, lack of personnel training/experience—all of these elements of complex system failure are readily seen as contributing factors in the Kulluk disaster.

EXAMINING DATA CENTER SYSTEM FAILURESTwo recent incidents demonstrate how the dynamics of complex systems failures can quickly play out in the data center environment.

Uptime Institute

Figure 3. Waves crash over the drilling unit Kulluk where it sits aground on the southeast side of Sitkalidak Island, AK, on January 1, 2013. Photo Credit: By Petty Officer 3rd Class Jonathan Klingenberg, United States Coast Guard ([1]) [Public domain], via Wikimedia Common.

Page 57: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

49

Example AThe data center in this example had been designed appropriately with fuel pumps and engine-generator controls powered from multiple circuit panels. As built, however, a single panel powered both, whether due to implementation oversight or cost reduction measures. At issue is not the installer, but rather the quality of communications from the implementation team and the operations team.

In the course of operations, technicians had to shut off utility power during the performance of routine maintenance to an electrical switchgear. This meant the building was running on engine-generator sets. However, when the engine-generator sets started to surge due to a clogged fuel line, the UPS automatically switched the facility to battery power. The day tanks for the engine-generator sets were starting to run dry. If quick-thinking operators had not discovered the fuel pump issue in time, there would have been an outage to the entire facility: a cascade of events leading down a rapid pathway from simple routine maintenance activity to complete system failure.

Example BIn this example, an enterprise data center shared space with corporate offices in the same building, with a single chilled water plant used to cool both sides of the building. The office air handling units also brought in outside air to reduce cooling costs.

One night, the site experienced particularly cold temperatures and the control system did not switch from outside air to chilled water for office building cooling, which affected data center cooling as well. The freeze stat (a temperature sensing device that monitors a heat exchanger to prevent its coils from freezing) failed to trip; thus the temperature continued to drop and the cooling coil froze and burst, leaking chilled water onto the floor of the data center. There was a limited leak detection system in place and connected, but it had not been fully tested yet. Chilled water continued to leak until pressure dropped and then the chilled water machines started to spin offline in response. Once the chilled water machines went offline neither the office building nor data center had active cooling.

At this point, despite the extreme outside cold, temperatures in the data hall rose through the night. As a result of the elevated indoor temperature conditions, the facility experienced myriad device-level (e.g., servers, disc drives, and fans) failures over the following several weeks. Though a critical shut down was not the issue, damage to components and systems—and the cost of cleanup and replacement parts and labor—were significant. One single initiating factor—a cold night—combined with other elements in a cascade of failures.

Examining and Learning from Complex Systems Failures

Page 58: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

50

Uptime Institute

In both of these cases, severe disaster was averted, but relying on front-line operators to save the situation is neither robust not reliable.

PREVENTING FAILURES IN THE DATA CENTERFacility infrastructure is only one component of failure prevention; how a facility is run and operated on a day-to-day basis is equally critical. As Dr. Cook noted, humans have a dual role in complex systems as both the potential producers (causes) of failure as well as, simultaneously, some of the best defenders against failure [Element 9].

The fingerprints of human error can be seen on the two data center examples. In Example A, the electrical panel was not set up as originally designed, and the leak detection system, which could have alerted operators to the problem, had not been fully activated in Example B.

Dr. Cook also points out that human operators are the most adaptable component of complex systems [Element 12], as they “actively adapt the system to maximize production and minimize accidents.” For example, operators may “restructure the system to reduce exposure of vulnerable parts,” reorganize critical resources to focus on areas of high demand, provide “pathways for retreat or recovery,” and “establish means for early detection of changed system performance in order to allow graceful cutbacks in production or other means of increasing resiliency.” Given the highly dynamic nature of complex system environments, this human-driven adaptability is key.

STANDARDIZATION CAN ADDRESS MANAGEMENT SHORTFALLSIn most of the notable failures in recent decades, there was a breakdown or circumvention of established standards and certifications. It was not a lack of standards, but a lack of compliance or sloppiness that contributed the most to the disastrous outcomes. For example, in the case of the Boeing batteries, the causes were bad design, poor quality inspections, and lack of contractor oversight. In the case of the Exxon Valdez, inoperable navigation systems and inadequate crew manpower and oversight—along with insufficient disaster preparedness—were critical factors. If leadership, operators, and oversight agencies had adhered to their own policies and requirements and had not cut corners for economics or expediency, these disasters might have been avoided.

Ongoing operating and management practices and adherence to recognized standards and requirements, therefore, must be the focus of long-term risk mitigation. In fact, Dr. Cook states that “failure-free operations are the result of activities of people who work to keep the system within the boundaries of tolerable performance.... human practitioner adaptations to changing conditions actually create safety from moment to moment” [Element 17].

Page 59: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

51

Examining and Learning from Complex Systems Failures

This emphasis on human activities as decisive in preventing failures dovetails with Uptime Institute’s advocacy of operational excellence as set forth in the Tier Standard: Operational Sustainability. This was the data center industry’s first standardization, developed by and for data centers, to address the management shortfalls that could unwind the most advanced, complex, and intelligent of solutions. Uptime Institute was compelled by its findings that the vast majority of data center incidents could be attributed to operations, despite advancements in technology, monitoring, and automation.

The Operational Sustainability criteria pinpoint the elements that impact long-term data center performance, encompassing site management and operating behaviors, and documentation and mitigation of site-specific risks. The detailed criteria include personnel qualifications and training and policies and procedures that support operating teams in effectively preventing failures and responding appropriately when small failures occur to avoid having them cascade into large critical failures. As Dr. Cook states, “Failure free operations require experience with failure” [Element 18]. We have the opportunity to learn from the experience of other industries, and, more importantly, from the data center industry’s own experience, as collected and analyzed in Uptime Institute’s Abnormal Incident Reports database. Uptime Institute has captured and catalogued the lessons learned from more than 5,000 errors and incidents over the last 20 years and used that research knowledge base to help develop an authoritative set of benchmarks. It has ratified these with leading industry experts and gained the consensus of global stakeholders from each sector of the industry. Uptime Institute’s Tier Certifications and Management & Operations (M&O) Stamp of Approval provide the most definitive guidelines for and verification of effective risk mitigation and operations management.

Dr. Cook explains, “More robust system performance is likely to arise in systems where operators can discern the ‘edge of the envelope.’ It also depends on calibrating how their actions move system performance towards or away from the edge of the envelope. [Element 18]” Uptime Institute’s deep subject matter expertise, long experience, and evidence-based standards can help data center operators identify and stay on the right side of that edge. Organizations like CenturyLink are recognizing the value of applying a consistent set of standards to ensure operational excellence and minimize the risk of failure in the complex systems represented by their data center portfolio (See the sidebar CenturyLink and the M&O Stamp of Approval).

“It is human nature for elected officials and the general public to tend to disregard the fact that the ubiquitous complex systems that we rely on in our daily lives operate at a certain level of risk—a likelihood that they may fail, and that their failures need to be addressed proactively.” (ASME 2011)

Page 60: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

52

CENTURYLINK AND THE M&O STAMP OF APPROVALThe IT industry has growing awareness of the importance of management-people-process issues. That’s why Uptime Institute’s Management & Operations (M&0) Stamp of Approval focuses on assessing and evaluating both operations activities and management as equally critical to ensuring data center reliability and performance. The M&O Stamp can be applied to a single data center facility, or administered across an entire portfolio to ensure consistency. Recognizing the necessity of making a commitment to excellence at all levels of an organization, CenturyLink is the first service provider to embrace the M&O assesment for all of its data centers. It has contracted Uptime Institute to assess 57 data center facilities across a global portfolio. This decision shows the company is willing to hold itself to a uniform set of high standards and operate with transparency. The company has committed to achieve M&O Stamp of Approval standards and certification across the board, protecting its vital networks and assets from failure and downtime and providing its customers with assurance.

CONCLUSIONComplex systems fail in complex ways, a reality exacerbated by the business need to operate complex systems on the very edge of failure. The highly dynamic environments of building and operating an airplane, ship, or oil rig share many traits with running a high availability data center. The risk tolerance for a data center is similarly very low, and data centers are susceptible to the heroics and missteps of many disciplines. The coalescing element is management, which makes sure that frontline operators are equipped with the hands, tools, parts, and processes they need, and, the unbiased oversight and certifications to identify risks and drive continuous improvement against the continuous exposure to complex failure.

By Julian Kudritzki, COO, Uptime Institute with Uptime Institute editorial and research staff

Uptime Institute

Page 61: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

53

REFERENCES

ASME (American Society of Mechanical Engineers). 2011. Initiative to Address Complex Systems Failure: Prevention and Mitigation of Consequences. Report prepared by Nexight Group for ASME (June). Silver Spring MD: Nexight Group. http://nexightgroup.com/wp-content/uploads/2013/02/initiative-to-address-complex-systems-failure.pdf

Bassett, Vicki. (1998). “Causes and effects of the rapid sinking of the Titanic,” working paper. Department of Mechanical Engineering, the University of Wisconsin. http://writing.engr.vt.edu/uer/bassett.html#authorinfo.

BBC News. 2015. “Safety worries lead US airline to ban battery shipments.” March 3, 2015. http://www.bbc.com/news/ technology-31709198

Brown, Christopher and Matthew Mescal. 2014. View From the Field. Webinar presented by Uptime Institute, May 29, 2014. https://uptimeinstitute.com/research-publications/asset/webinar-recording-view-from-the-field

Cook, Richard I. 1998. “How Complex Systems Fail (Being a Short Treatise on the Nature of Failure; How Failure is Evaluated; How Failure is Attributed to Proximate Cause; and the Resulting New Understanding of Patient Safety).” Chicago, IL: Cognitive Technologies Laboratory, University of Chicago. Copyright 1998, 1999, 2000 by R.I. Cook, MD, for CtL. Revision D (00.04.21), http://web.mit.edu/2.75/resources/random/How%20Complex%20Systems%20Fail.pdf Dobson, Ian, Benjamin A. Carreras, Vickie E. Lynch and David E. Newman. 2009. “Complex systems analysis of a series of blackouts: Cascading failure, critical points, and self-organization.” Chaos: An Interdisciplinary Journal of Nonlinear Science 17: 026103 (published by the American Institute of Physics).

Dueñas-Osorio, Leonard and Srivishnu Mohan Vemuru. 2009. Abstract for “Cascading failures in complex infrastructure systems.” Structural Safety 31 (2): 157-167.

Funk, McKenzie. 2014. “The Wreck of the Kulluk.” New York Times Magazine December 30, 2014. http://www.nytimes. com/2015/01/04/magazine/the-wreck-of-the-kulluk.html?_r=0

Gallagher, Sean. 2014. “NTSB blames bad battery design—and bad management—in Boeing 787 fires.” Ars Technica, December 2, 2014. http://arstechnica.com/information-technology/2014/12/ntsb-blames-bad-battery-design-and-bad-management-in-boeing-787-fires/

Glass, Robert, Walt Beyeler, Kevin Stamber, Laura Glass, Randall LaViolette, Stephen Contrad, Nancy Brodsky, Theresa Brown, Andy Scholand, and Mark Ehlen. 2005. Simulation and Analysis of Cascading Failure in Critical Infrastructure. Presentation (annotated version) Los Alamos National Laboratory, National Infrastructure Simulation and Analysis Center (Department of Homeland Security), and Sandia National Laboratories, July 2005..New Mexico: Sandia National Laboratories. http://www.sandia.gov/CasosEngineering/docs/Glass_annotatedpresentation.pdf

Page 62: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

54

Kirby, R. Lee. 2012. “Reliability Centered Maintenance: A New Approach.” Mission Critical, June 12, 2012. http://www.missioncriticalmagazine.com/articles/84992-reliability-centered-maintenance--a-new-approach

Klesner, Keith. 2015. “Avoiding Data Center Construction Problems.” The Uptime Institute Journal. 5: Spring 2014: 6-12. https://journal.uptimeinstitute.com/avoiding-data-center-construction-problems/

Lipsitz, Lewis A. 2012. “Understanding Health Care as a Complex System: The Foundation for Unintended Consequences.” Journal of the American Medical Association 308 (3): 243–244. http://jama.jamanetwork.com/article.aspx?articleid=1217248

Lavelle, Marianne. 2014. “Coast Guard blames Shell risk taking in the wreck of the Kulluk.” National Geographic, April 4, 2014. http://news.nationalgeographic.com/news/energy/2014/04/140404-coast-guard-blames-shell-in-kulluk-rig-accident/

“Exxon Valdez Oil Spill.” New York Times. “Exxon Valdez Oil Spill,” On NYTimes.com, last updated August 3, 2010. http:// topics.nytimes.com/top/reference/timestopics/subjects/e/exxon_valdez_oil_spill_1989/index.html

NTSB (National Transportation Safety Board). 2014. “Auxiliary Power Unit Battery Fire Japan Airlines Boeing 787-8, JA829J.” Aircraft Incident Report released 11/21/14. Washington, DC: National Transportation Safety Board. http://www.ntsb. gov/Pages/..%5Cinvestigations%5CAccidentReports%5CPages%5CAIR1401.aspx

Palast, Greg. 1999. “Ten Years After But Who Was to Blame?” for Observer/Guardian UK, March 20, 1999. http://www. gregpalast.com/ten-years-after-but-who-was-to-blame/

Pederson, Brian. 2014. “Complex systems and critical missions—today’s data center.” Lehigh Valley Business, November 14, 2014. http://www.lvb.com/article/20141114/CANUDIGIT/141119895/complex-systems-and-critical-missions--todays-data-center

Plsek, Paul. 2003. Complexity and the Adoption of Innovation in Healthcare. Presentation, Accelerating Quality Improvement in Health Care Strategies to Speed the Diffusion of Evidence-Based Innovations, conference in Washington, DC, January 27-28, 2003. Roswell, GA: Paul E Plsek & Associates, Inc. http://www.nihcm.org/pdf/Plsek.pdf

Reason, J. 2000. “Human Errors Models and Management.” British Medical Journal 320 (7237): 768–770. http://www.ncbi. nlm.nih.gov/pmc/articles/PMC1117770/

Reuters. 2014. “Design flaws led to lithium-ion battery fires in Boeing 787: U.S. NTSB.” December 2, 2014. http://www. reuters.com/article/2014/12/02/us-boeing-787-battery-idUSKCN0JF35G20141202

Page 63: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

55

Wikipedia, s.v. “Cascading Failure,” last modified April 12, 2015. https://en.wikipedia.org/wiki/Cascading_failure

Wikipedia, s.v. “The Sinking of The Titanic,” last modified July 21, 2015. https://en.wikipedia.org/wiki/Sinking_of_the_RMS_ Titanic

Wikipedia, s.v. “SOLAS Convention,” last modified June 21, 2015. https://en.wikipedia.org/wiki/SOLAS_Convention Captions

Page 64: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

56

APPLY EFFICIENT IT PRINCIPLES TO ADDRESS SUSTAINABILITY RISKSAs Corporate Sustainability programs become increasingly important to C-level execs and investors, IT organizations need to adopt more meaningful KPIs to remain relevant

Page 65: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

57

Increasingly, Corporate Sustainability drives decisions at large companies, as this function can affect a company’s standing with the investor community, its stock performance and capitalization.

For many companies, IT infrastructure is the organization’s largest expense, carbon emitter, and environmental liability due to the resource intensiveness of technology operations.

Many of the corporate risk considerations associated with global sustainability concerns have an impact on IT infrastructure decisions, including but not limited to:

• Climate change and other direct environmental impacts

• Government regulation

• Resource scarcity

• Ethical investment and sourcing

• Waste management

Large companies are under pressure from activists, investors, regulators, and customers to adopt sustainable business practices and communicate those methods and metrics to stakeholders. Noncompliance with sustainable practices can result in reputational damage, litigation, and penalties. Starting in 2009, NGOs like Greenpeace pilloried Facebook and other web scale companies for sourcing carbon-intensive utility providers. Many companies have significantly shifted utility sourcing in the interim and worked to ensure they are demonstrating efficiency and environmental stewardship, but these companies continue to be closely monitored. More traditional enterprise organizations are also under scrutiny—especially companies with large IT footprints.

Apply Efficient IT Principles to Address Sustainability Risks

Figure 1: Studies find companies that focus on sustainability issues achieve real and quantifiable financial impacts.

Page 66: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

58

While many companies approach sustainability to avoid negative outcomes, the positive benefits of a robust sustainability program may be even more profound (see Figures 1 and 2). The seventh global executive study on corporate sustainability from MIT Sloan Management Review and The Boston Consulting Group (BCG) found that 75% of investors cite improved revenue performance and operational efficiency from sustainability as strong reasons to invest. More than 60% believe that solid sustainability performance reduces a company’s risks. Nearly the same number also strongly believes that it lowers a company’s cost of capital. At the same time, nearly half of investors say that they won’t invest in a company with a record of poor sustainability performance. Some 60% of investment firm board members say they are willing to divest from companies with a poor sustainability footprint.

Other benefits include cost savings, increased employee attraction and retention, and increased customer loyalty.

INNOVATION

RISKMANAGEMENT

COST SAVINGS

Business Drivers for Efficient IT

Uptime Institute

Figure 2: Stockholders, financial institutions, and customers increasingly scrutinize a company’s sustainability record before deciding to invest money.

Figure 3: Embracing Efficient IT principles provides an immediate financial return, but also delivers value to risk management and enables business agility.

Page 67: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

59

Yet only 60% of managers in publicly traded companies believe that good sustainability performance is materially important to investors’ investment decisions. For the rank-and-file employees, across IT and other lines of business, sustainability is rarely a priority. IT has an especially poor track record of tolerating inefficiency and waste, in favor of expedience and performance. In order to implement an effective sustainability program, companies have had to mandate participation at a corporate level.

Most organizations meet these challenges by creating sustainability offices that have both C-level visibility and broad staff participation across all business units and facilities, including IT (see Figure 4).

Two years ago, a place at the IT table for sustainability would have been provocative, and perhaps evoked derision. In 2015, less than a tenth of enterprise IT stakeholders had confidence in corporate sustainability to affect IT efficiency and costs.

IT often stood apart, isolated from the rest of the company because of the perceived complexity of its needs, the robustness of its procedures, and low prioritization of cost and resource savings.

One short year later, 2016 is a vastly different matter and the data suggest that the time of corporate sustainability in IT is here now: 70% of enterprise IT organizations actively participate in corporate sustainability efforts. The influence of an outside party breaks down the ‘thwart by silo’ effect that has been the cause of so much well meaning, and often fruitless, energies to reshape IT.

The relationship between Corporate Sustainability and Enterprise IT is really just getting started. There are good signs for the potential of this relationship, but also signals that entrenched behaviors and metrics will be difficult to overcome.

Apply Efficient IT Principles to Address Sustainability Risks

Figure 4: The vast majority of large companies have implemented a formal sustainability program. IT departments are participating in these programs at a nascent level.

Page 68: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

60

The relationship so far, according to the survey stakeholders, has been overwhelmingly positive (see Figure 5):

And yet, looking at the reporting functionality, we still have a long way to go. If IT infrastructure leaders are motivated to improve the accountability and efficiency of their organizations, Corporate Sustainability is a great partner for gaining C-level buy-in and funding for projects.

But in many cases, IT Infrastructure teams are still relying on the least meaningful metrics to drive efficiency. Of the 70% of IT organizations who participate in corporate sustainability, the majority focuses on metrics with the least impact to the cost and carbon picture.

The majority of IT departments submit metrics for sustainability departments that focus wholly on the data center—such as total data center power consumption and Power Usage Effectiveness or PUE (a ratio that can be used to describe the efficiency of a data center’s cooling systems).

These KPIs address less than 15% of the opportunity to improve IT efficiency. The data center facility is only the site of resource consumption and can only, to a limited extent, be made more efficient in and of itself.

The actual decision making that impacts resource consumption generally does not occur at that street address—sometimes not even in that same city, state, or country. Additionally the invoices and/or allocations for those resources consumed are not all sent or charged to the right location or department.

Thus, Efficient IT relies on organizational navigation rather than spot improvements in a specific site.

Uptime Institute

Figure 5: The relationship between IT and Corporate Sustainability has been positive so far, with limited downside and increased visibility of achievement to senior management.

Page 69: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

61

The problem of comatose servers is a prime example of rampant inefficiency in IT that can be addressed only through the org chart and not the street address. Industry reports suggest server CPU utilization in many enterprises is in single-digit percentages. Furthermore, Uptime Institute research has demonstrated that over 20% of IT equipment serves no business function whatsoever. Hardware long abandoned by application owners and users but still racked and running hide in plain sight within even the most sophisticated IT organizations.

By decommissioning an abandoned server or eliminating the need to deploy a server through asset optimization, kilowatts of power and cooling are saved. This leads to savings in expense and carbon that are appreciated at the data center. If the organization is trying to avoid another data center build or purchase, then the kilowatts that are recovered are all the more precious. This is because the recovered savings are now being thought of as capital cost avoidance (tens of millions) rather than cost savings (tens of thousands).

Yet the decision to deploy or decommission a server is rarely made at the data center street address, but rather in the line of business or IT function outside of the purview of the data center. Understanding how to address the issue of comatose hardware requires navigating the org chart to find owners and executive sponsors to address underlying efficiency problems.

Server optimization and consolidation are only a portion of efficient IT opportunities but serve to illustrate how the big gains are reached by starting with the org chart, not the street address.

CONCLUSIONSAs Corporate Sustainability increasingly exercises influence on IT decision making, the question becomes “How will sustainability affect technology adoption trends?”

Will cloud adoption accelerate as result?

Enterprise IT’s reputation for low-utilization assets, comatose equipment, and inability to document resource consumption seem retrograde when compared to cloud providers, who are very ready to have the discussion—and provide facts and figures—about the costs, operational efficiencies, and waste reduction inherent to their service offerings.

The public cloud provides a good story for Corporate Sustainability in its “reveal” of resources consumed.

Apply Efficient IT Principles to Address Sustainability Risks

Page 70: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

62

Extant enterprise IT needs to move beyond emotionally driven efforts to stem or claw back cloud adoption. Instead, IT teams need to collaboratively work with their Corporate Sustainability counterparts to develop business-level KPIs demonstrating “best use of assets.”

Outsourcing is no guarantee of efficiency or fiduciary responsibility—there is no better example than enterprises overprovisioning space or power in a colocation environment. Rather than focusing on the compute venue, IT needs to address the management behaviors that lead to waste.

To assess inherent best use of resources, Corporate Sustainability will have to gain insight and influence into the ways and means of IT provisioning and procurement. This will have to encompass both extant IT infrastructure, which is arguably obscured in terms of cost and utilization, and public cloud, which is believed to be the opposite. The latter will be easier and thus preferred, and that natural bias could be carried forward to more cloud adoption.

Cost and utilization insights into IT are not impossible, and they don’t require investment in a new suite of tools. Plus, IT infrastructure management may want to show their benefits to organizations in ways that are resonant, or even competitive, with public cloud.

Figure 6 lists questions that executives should ask extant IT leadership.

Corporate Sustainability’s mission and outreach will be the impetus for new forms of control even in old systems. But we can’t be surprised if the inflexibility of existing IT, when weighed against the perceived ease of public cloud, persuades Corporate Sustainability to hasten the speed of the transition to off-premise solutions.

Uptime Institute

By Matt Stansberry, Senior Director of Content and Publications, Uptime Institute, and Julian Kudritzki, COO, Uptime Institute

Figure 6. IT sustainability programs can be more effective if designed to address five key questions.

Page 71: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

63

Page 72: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

64

IT RESILIENCE DURING NATURAL DISASTERS The most common cause of disruption to IT services during a natural disaster is preventable

Page 73: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

65

In late October 2012, Superstorm Sandy tore through the Caribbean and up the east coast of the U.S., killing over a hundred, leaving millions without power, and causing billions of dollars in damage. In the aftermath of the storm, Uptime Institute surveyed data center operators to gather information on how Sandy affected data center and IT operations.

The survey focused primarily on the northeast corridor of the U.S. as the greater New York City area took the brunt of the storm and suffered the most devastating losses.

Uptime Institute examined how facilities fared during the storm and after actions taken by the data center owners and operators to ensure availability and safeguard of critical equipment. Of all respondents spread through North America, approximately one-third said they were affected by the storm in some way. The results show that natural disasters can bring unexpected risks but also reveal that planning and preemptive measures can be applied in anticipation of a catastrophic event.

When sites lost their IT computing services, it was due largely to either critical infrastructure components being located below grade or by depending on external resources for re-supply of engine-generator fuel—both preventable outcomes.

Full preparedness for a natural disaster is not a simple proposition. However, being prepared with a robust infrastructure system, sufficient on-site fuel, available staff, and knowledge of past events will go a long way toward ensuring operational readiness.

IMPACTSAlmost all the respondents in the path of the storm went on engine generators during the course of the storm, with a couple following industry best practice by going on engine generators before losing utility power. About three-quarters of the respondents who turned to engine generators successfully rode out the storm and the days after. The remainder turned to disaster recovery sites or underwent an orderly shutdown. For all who turned to engine-generator power, maintaining sufficient on-site fuel storage was a key to remaining operational.

Utility power outages were widespread due to high winds and flooding. Notably, two sites that had two separate commercial power feeds were not immune. One site lost one utility feed completely, and large voltage swings rendered the other unstable. Thus, the additional infrastructure investment was unusable in these circumstances.

IT Resilience During Natural Disasters

Page 74: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

66

Respondents reported engine-generator runtimes ranging from one hour to eight days. Facilities with engine generators, fuel storage, or fuel pumps in underground basements experienced problems. Flooding affected fuel storage, fuel pumps, or fuel piping distribution in 25% of the respondents. Several operators remained on engine-generator power after utility power was restored due to risks of instability in the grid. Additionally, for some respondents, timely delivery of fuel to fill tanks was not available as fuel shortages caused fuel vendors to prioritize hospitals and other life-safety facilities. In short, fuel delivery SLAs were unenforceable.

WHAT WORKED? PREPARATIONAlmost all the respondents reported that they topped off fuel or arranged additional fuel, and one-third made sleeping and food provisions for operators or vendors expected to be on extended shift. About one-quarter of respondents reported checking that business continuity and maintenance actions were up-to-date. Others ensured that roof and other drainage structures were clear and working. And, a handful obtained sandbags.

All respondents reviewed operational procedures with their teams to ensure a thorough understanding of standard and emergency procedures. Several reported that they brought in key vendors to work with their crews on site during the event, which they said proved helpful.

Some firms also had remote operations emergency response teams to relieve the local staff, but one respondent reported that blocked roadways and flight cancellations delayed their arrival for a lengthy period of time.

Multiple respondents said that conducting an in-depth review of emergency procedures in preparation for the storm resulted in the staff being better aware of how to respond to events. Preparations enabled all the operators to continue IT operations during the storm. For example, unexpected water leaks materialized, but precautions such as sandbags and tarps successfully safeguarded the IT gear.

Overwhelmingly, respondents said emergency preparations were valuable and enabled personnel to anticipate and prevent problems. Reviewing rehearsed procedures, load transfer testing to switch the electrical load from utility power to engine generators, and extra attention to fuel storage all provided benefits.

WHAT FAILED? FUEL SUPPLY AND STORAGE This storm showed that even a backup plan for fuel delivery is no guarantee that fuel will be available to replenish stock. In some cases fuel supplier power outages, fuel shortages, or closed roadways prevented deliveries. In other cases, however, companies that thought they were on a priority list learned hospitals, fire stations, and police had even higher priority.

Uptime Institute

Page 75: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

67

In addition, fuel suppliers had their own issues remaining operational. Due to the widespread outages, some refineries shut down and some suppliers could not dispense fuel. There were also problems with phone systems, making it difficult for suppliers to run their businesses and communicate with customers.

Also, respondents reported lessons about the location of fuel tanks. Fuel storage tanks or fuel pumps should not be located below-grade. As indicated, many areas below grade flooded due to either storm water or pipe damage. Respondents experiencing this problem plan to move pumps or other engine-generator support equipment to higher locations.

No one expected that water infiltration would have such an impact. Wind speeds were so extreme that building envelopes were not watertight, with water entering buildings through roofing and entryways.

A few respondents saw a need to move critical facilities away from an area susceptible to a hurricane or flood. While some respondents plan to increase the resiliency of their site infrastructure, they are also evaluating extending the use of existing facilities in other locations to pick up the computing needs during the emergency response period.

CONCLUSIONSIn order to maintain functionality through a region-wide disaster, it is important for executives and infrastructure staff to identify the risks and mitigate them.

Solutions include the following:

• Thoroughly review site location and elevation of critical components, including fuel storage and fuel pumps for flooding potential.

• Perform regular testing and maintenance of the infrastructure systems, in particular switching power from utility to engine generator.

• Ensure sufficient duration of engine-generator fuel stored on site.

• Maintain up-to-date disaster recovery, business continuity, and IT load shedding plans. Brief stakeholders on these plans regularly to ensure confidence and common understanding.

In a major disaster, unforeseen issues can arise. The goal is to reduce potential impacts as much as possible.

IT Resilience During Natural Disasters

By Uptime Institute Senior Technical Staff

Page 76: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

68

A HOLISTIC APPROACH TO VENDOR SELECTION FOR CLOUD AND COLOCATION As companies rely more on colocation, cloud, and other off-premise computing models, enterprise IT needs to improve how it selects and manages vendors

Page 77: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

69

According to Uptime Institute’s sixth annual Data Center Industry Survey, the majority of IT departments maintain a mix of assets in their own data centers, colocation partners, and cloud platforms.

The percentage of IT processed at in-house sites has remained steady around 70%, but data points to a major shift to colocation and cloud for new workloads in the coming years.

Half of senior IT execs expect the majority of their IT workloads to reside off-premise in the future. Of those, 70% expect that shift to happen by 2020.

It is hard to predict what percentage will go to public cloud, but a significant portion of those workloads will be shifting to colocation providers—companies that provide data center facilities and varying levels of operations management and support.

Many colocation suppliers have been growing rapidly in recent years. Survey respondents listed the following as top drivers for colocation adoption:

• Reduce churn of noncritical workloads into critical space

• Mergers/Acquisitions activity

• Disaster recovery site on a different power grid

• Executive directive to divest owned data center infrastructure

• Global expansion

• Avoid large capital expenses of new site build

• Not core business

• Lack of confidence in staff/resources

Yet, many decisions to deploy IT assets in colocation or cloud computing environments do not holistically view the financial, risk, performance, or other impacts of that decision.

Survey results show executives are not confident in their ability to evaluate deployment alternatives due to the following challenges:

• Incomplete data when evaluating internal assets, such as data center capital costs that aren’t included in TCO calculations for IT projects, or lack of insight into personnel costs associated with providing internal IT services.

A Holistic Approach to Vendor Selection for Cloud and Colocation

Page 78: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

70

• Lack of insight into cloud computing security, pricing models, and reliability data.

• Lack of credible cloud computing case studies.

• Inconsistency in reporting structures across geographies and divisions and between internal resources and colocation providers.

• Difficulty articulating business value for criteria not tied to a specific cost metric, like redundancy or service quality.

• Difficulty connecting IT metrics to business performance metrics.

• Challenge of capacity planning for IT requirements forecast beyond six months due to evolving architecture/application strategy and shifting vendor roadmaps.

• Difficulty collecting information across the various stakeholders, from application development, corporate real estate.

INTRODUCING FORCSSUptime Institute developed a system called FORCSS to enable enterprise IT to identify, weigh, and communicate the advantages and risks of IT application deployment options using a consistent and relevant criteria based on business drivers and influences.

FORCSS helps the enterprise to overcome this challenge by focusing on the critical selection factors, thereby reducing or eliminating unfounded assumptions and organizational “blind spots.” FORCSS establishes a consistent and repeatable set of evaluation criteria and a structure to communicate the informed decision to stakeholders.

A coherent IT deployment strategy is often difficult because the staff responsible for IT assets and IT services across multiple geographies and multiple operating units are themselves spread over multiple geographies and operating units. The result can be a range of operating goals, modes, and needs that are virtually impossible to incorporate into a single, unified deployment strategy. And when a single strategy is developed from the “top down,” the staff responsible for implementing that strategy often struggles to adapt that strategy to their operational requirements and environments.

FORCSS was developed to provide organizations with the flexibility to respond to varying organizational needs while maintaining a consistent overall strategic approach to IT deployments. FORCSS represents a process a) to apply consistent

Uptime Institute

Page 79: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

71

selection criteria to specific deployment options, and b) to translate the outcome of the key criteria into a concise structure that can be presented to “non-IT” executive management.

The FORCSS system is composed of six necessary and sufficient selection factors supported by three underlying inputs per factor. These six factors, or criteria, provide a holistic evaluation system and drive a succinct decision exercise that avoids analytical paralysis (see Figure 1).

And, by scaling the importance of the criteria within the system, FORCSS allows each organization to align the decision process to organizational needs and business drivers.

FORCSS DEFINITIONS

Financial: The fiscal consequences associated with deployment alternatives.

• What is the net revenue impact or value of the IT deployment to the business?

• Comparative Cost of Ownership: The identified differential cost of deploying the alternative—a detailed accounting analysis is not necessary to procure services, rather the ability to effectively identify and communicate the financial consequences of each alternative.

• Cash and Funding Commitment: Representation of liquidity—cash necessary at appropriate intervals for the projected duration of the business service.

Net Revenue ImpactComparative Cost of OwnershipCash and Funding Commitment

Time to ValueScalable CapacityBusiness Leverage and Synergy

Cost of Downtime vs. AvailabilityAcceptable Security AssessmentSupplier Flexibility

Government MandatesCorporate PoliciesCompliance & Certifications to Industry Standards

Carbon and Water ImpactEfficient IT CertificationSustainability Metrics & Documentation

Application AvailabilityApplication PerformanceEnd-User Satisfaction

SIX FORCSS FACTORS, 18 INPUTS

A Holistic Approach to Vendor Selection for Cloud and Colocation

Figure 1. FORCSS helps organizations evaluate IT alternatives by examining 18 relevant inputs.

Page 80: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

72

Opportunity: A deployment alternative’s ability to fulfill compute capacity demand over time.

• Time to Value: The time period from decision to IT service availability. Timeline must include department deployment schedules of IT, facilities, network, and service providers.

• Scalable Capacity: Available capacity for expansion of a given deployment alternative.

• Business Leverage and Synergy: Significant ancillary benefits of a deployment alternative outside of the specific application or business service. For example: Improve economies of scale and pricing for other applications. Or, geographic location of a particular site provides business benefits beyond the scope of a single application.

Risk: A deployment alternative’s potential for negative business impacts.

• Cost of Downtime vs. Availability: Estimated cost of an IT service outage vs. forecasted availability of deployment alternative.

• Acceptable Security Assessment: Internal security staff evaluation of deployment alternative’s physical and data security.

• Supplier Flexibility: Potential “lock-ins” from a technical or contractual standpoint.

Compliance: Verification, internal and/or third-party, of a deployment alternative’s compliance with regulatory, industry, or other relevant criteria.

• Government: Legally mandated reporting obligations associated with the application or business service. For example: HIPAA, Sarbanes-Oxley, PCI-DSS.

• Corporate Policies: Internal reporting requirements associated with the application or business service. For example: Data protection and privacy, ethical procurement, Corporate Social Responsibility.

• Compliance & Certifications to Industry Standards: Current or recurring validations achieved by the site or service provider, beyond internal and governmental regulations. For example: SAS 70, SSAE 16, Uptime Institute Tier Certification or M&O Stamp of Approval, ISO.

Uptime Institute

Page 81: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

73

Sustainability: Environmental consequences of a deployment alternative.

• Carbon and Water Impact: Carbon and water impacts for given a site or service.

• Efficient IT Certifications: Current or recurring validations achieved by the site or service provider of sustainable IT operations practices. For example: Uptime Institute Efficient IT Stamp of Approval.

• Sustainability Metrics: Transparency and accountability demonstrated by meaningful IT efficiency metrics including a documented energy management plan and IT asset utilization program.

Service Quality: A deployment alternative’s capability to meet end-user performance requirements.

• Application Availability: Computing environment uptime at the application or operating system level.

• Application Performance: Evaluation of an application functional response, acceptable speeds at the end-user level.

• End-User Satisfaction: Stakeholder response that an application or deployment alternative addresses end-user functional needs. For example: End-user preference for graphical user interfaces or operating/management systems tied to a specific deployment alternative.

Many organizations already perform due diligence that would include most of this process. But this system was validated by thought leaders in the enterprise IT industry to ensure usefulness by those who inform senior-level decision makers.

Uptime Institute acknowledges that there are overlaps and dependencies across all six factors. But, in order to provide a succinct, sufficient process to inform C-level decision makers, categories must be finite and separate to avoid analysis paralysis. The purpose of FORCSS is to identify the business requirements of the IT service and pragmatically evaluate capabilities of potential deployment options as defined. The Uptime Institute FORCSS system provides a system of common criteria agreed upon by an elite group of data center owners and operators from around the world.

A Holistic Approach to Vendor Selection for Cloud and Colocation

Page 82: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

74

THE FORCSS INDEXThe FORCSS Index is the executive summary presentation tool of a FORCSS process. Once the organization has gathered the inputs in the FORCSS process, the Index provides a graphical means to compare the deployment alternatives and the relative impacts of each alternative on each factor (see Figure 2).

In conversations with senior management, the Index will also facilitate discussions of the weighting of each FORCSS Factor in the ultimate decision.

The indicators may be placed in relative positions (high, medium, low) to reflect the advantages or the exposures within any given factor.

The FORCSS Index effectively compares multiple alternatives at the application or physical layer(s). Organizations executing a FORCSS analysis can populate multiple indexes to compare a range of deployment alternatives.

Certain data inputs may be weighted more heavily than others (positively or negatively) in determining the indicator position for a factor. These special considerations are defined as Key Determinants and are specifically labeled in the FORCSS Index output.

Several data inputs may be used to determine the indicator position for one factor, and one data input may affect the placement of indicators in several factors.

The FORCSS Index is designed as a means of relative comparison of any number of alternatives. Although it would seem that the logical extension of this approach would be to assign numerical scores to each data input for each factor, during FORCSS development numerical scoring was found to add unnecessary complexity which can obscure the key determinants. Scoring can also mislead as the score assigned to one factor can numerically erase the score assigned to another (prohibitive) factor, thereby defeating one of the major benefits of the FORCSS process.

Alternative: 1. Refurbishment 2. Build 3. ColocationAlternative: Alternative:

Uptime Institute

Figure 2. The graphical output of a FORCSS evaluation suggests the strengths and weaknesses of three alternative IT plans.

Page 83: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

75

For a full explanation of how to apply FORCSS to a colocation decision, see “How FORCSS Works: Case Study 1” in The Uptime Institute Journal (Vol. 3, page 92 or journal.uptimeinstitute.com).

INSIGHT FROM A FORCSS LEADER:A large financial organization in North America was an early adopter of FORCSS and has used the system for deployment decisions in his organization. The excerpt below describes his experience using FORCSS in his own words:

“Our organization has historically tried to self-provision first. But, the doors have opened up. The IT departments aren’t holding onto hardware anymore, and the shorter timelines are having a huge impact on how we respond. You have to pick projects that you can do better, and you have to be ready to let go of things you can’t do fast enough. Most builds will take longer than buying the service. An IT organization isn’t linear anymore. There are multiple stakeholders who have different influences and different impacts on decisions. If you’re not thinking 3 to 5 years out, an enterprise organization won’t be to be able to respond to the business demand.

“Quantifying the opportunity is a difficult, but important, aspect of the FORCSS process. One of the biggest considerations in this decision was the business synergy [providing business continuity infrastructure for an adjacent back-office function], documented in the Opportunity section. You have to be well connected across your organization or you will miss Opportunity.

“Vetting a multi-tenant data center provider required due diligence. We attended site tours. We had them provide single-lines of how the infrastructure would look. We got as close to apples-to-apples as we could get, down to cabinet layout of the room. We had three detailed meetings where my engineering and operations teams sat down with the colocation fulfillment team.

“One of the biggest risks of an outsourcer is not about the immediate contract, but about how you deal with change going forward. How do you handle change, like a new business opportunity, that isn’t in the contracts? How do you deal with non-linear growth? We never got to the point of pulling the trigger, but had a frank discussion with our board about risk associated with outsourcing and they were comfortable with the alternative.

“The Board ultimately funded the Build option. I believe that our FORCSS process was successful with decision makers due to the thoroughness of our preparation. For today’s enterprise, speed is key: speed of decision making; speed of deployment. In this environment, the decision-making methodology must be credible and consistent and timely. We adopted FORCSS because it was thorough, independent, and industry accepted.”

A Holistic Approach to Vendor Selection for Cloud and Colocation

Page 84: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

76

RFPs and SLAs for ColocationDespite increasing adoption of cloud and colocation, the outsourcing model is not a panacea. According to 2016 Survey Data:

• 40% of enterprise respondents are paying more for colocation contracts than they had initially planned or expected.

• Nearly one-third of respondents had experienced an outage at a colocation vendor site

• Over 60% of respondents said the penalty clause in their Service Level Agreement (SLA) would not adequately offset the cost of that outage to the business.

Enterprise IT organizations pay a premium for a third party to deliver data center capacity and should hold service providers to higher standards than their own organization. There is significant room for improvement in vetting, negotiating, and managing those relationships.

Companies need to become much savvier about defining their requirements. The following recommendations can help enterprise IT organizations to improve vendor selection and management:

• The Site: Availability is at the forefront of all colocation discussions. Ask for industry certifications and documentation. The vast majority of major suppliers claim to build to Uptime Institute’s Tier III standard. Which ones can provide verification that the site was built to that specification? If your IT workloads are critical to your business, can you afford to take their word for it? Look for trusted third party validation: Uptime Institute Tier Certification Constructed Facility or Management and Operations Stamp of Approval can shorten the due diligence cycle.

• The Operations: The biggest risk of an IT outage is due to operations failures. If you want to see how a data center really runs, ask to review the last 5 years of incident reports.Demand to inspect maintenance records. Ask to see commissioning reports. Negotiate increased control and transparency with your provider to ensure operational excellence.

• The Business: A data center that has been operated by the same team with the same vendors and clients for several years will likely be very stable. But if that provider is bought by another company, or conducting a consolidation project, changing operations programs, or installing new equipment, you will have a higher risk of failure.

Uptime Institute

Page 85: RISK MANAGEMENT FOR IT INFRASTRUCTURE - The … · overview of an industry and profession in ... risks due to the intensity of IT energy ... to risk management for IT infrastructure

77

A Holistic Approach to Vendor Selection for Cloud and Colocation

Ask about the current occupancy rate. If you are the first tenant in a shared space, every other person coming in is an opportunity for your equipment to be de-energized or for a technician to make a mistake. Ask about turnover of staff and average tenure with the company. Turnover can be a red flag. Is the equipment infrastructure at the end of its lifespan? The company is likely planning on upgrades that you should be aware of before signing.

Many problems with colocation providers can be avoided by setting more effective terms up front and by writing better RFPs (request for proposal) and SLAs (service level agreement).

To that end, Uptime Institute conducted a panel with its user group, the Uptime Institute Network, including participants from large enterprise organizations and colocation vendors, to come up with the following recommendations for writing better RFPs and SLAs:

Don’t use the RFP for due diligence: 20-page RFPs are expensive and time consuming to write and evaluate. Don’t waste time with a huge document. Focus on your business requirements and must haves in a short 2-3 page RFP.

Customers are not a substitute for due diligence: Large companies renting space in the facility is not an indication of whether the site will meet your business requirements. You have no way to find out if that workload is business critical.

Avoid overprovisioning: Common mistakes include relying on IT equipment faceplate data to calculate power draw requirements and underestimating the impact a hardware refresh could have with increasingly efficient equipment.

Worst-case scenario: In the instance of multiple outages, do not focus on increasing financial penalties. SLA penalties will not cover your cost but will in fact make your contract worth less and less to your provider and would likely drive down service levels you receive. Rather, structure your SLA in the case of multiple outages toward an exit. Negotiate your move out costs and free rent while you find a new space.

As companies are increasingly relying on colocation and other off-premise computing models, enterprise IT and data center staff will need to develop the planning skills, expertise and coordination to play an importance governance role in their organizations going forward.

By Matt Stansberry, Senior Director of Content and Publications, Uptime Institute, and Julian Kudritzki, COO, Uptime Institute