Maximize IT Uptime T5x20

Embed Size (px)

Citation preview

  • 8/3/2019 Maximize IT Uptime T5x20

    1/34

    MAXIMIZE IT UPTIME

    BY UTILIZING DEPENDABLESUN SPARC ENTERPRISE T5120

    AND T5220 SERVERSWhite Paper

    February 2009

  • 8/3/2019 Maximize IT Uptime T5x20

    2/34

    Sun Microsystems, Inc.

    Table of Contents

    Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    The Critical Role of Information Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Requirements for Dependable IT Services. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    Pervasive Demand for Reliability, Availability, and Serviceability Features . . . . . . . . 3

    Introducing Sun SPARC Enterprise T5120 and T5220 Servers. . . . . . . . . . . . . . . . . . . . 4

    Designed for Reliability, Availability, and Serviceability. . . . . . . . . . . . . . . . . . . . . 7

    Minimizing Component Count to Achieve Maximum Reliability. . . . . . . . . . . . . . . . . 7

    Improving Availability Part Redundancy, On-line Serviceability, and Self Diagnosis

    and Correction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    Integrated Lights Out Management for Simplified Remote Serviceability . . . . . . . . 12

    Isolating Faults with Sun Virtualization Technologies . . . . . . . . . . . . . . . . . . . . . 14

    Sun Logical Domains Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    Solaris Containers Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    Suns Comprehensive Focus On IT Service Uptime . . . . . . . . . . . . . . . . . . . . . . . . 19

    Speeding Diagnosis with a Comprehensive Fault Management Architecture. . . . . . 19

    Reducing Downtime with the Solaris Operating System. . . . . . . . . . . . . . . . . . . . . . 21

    Sun Software for Efficient System Management. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

    Measuring Server Reliability, Availability, and Serviceability. . . . . . . . . . . . . . . . 26

    Sun Availability Benchmark Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    R-Cubed Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    Assessing RAS Levels of Scale-Out Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

    Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

  • 8/3/2019 Maximize IT Uptime T5x20

    3/34

    1 Executive Summary Sun Microsystems, Inc.

    Executive Summary

    IT solutions play a critical role in helping enterprises compete effectively in todays fastchanging global markets. Organizations of every size and type increasingly depend

    upon information technology to execute day-to-day operations, interact with

    customers, and generate revenue. Current trends toward non-stop business operations

    place pressure on IT departments to keep services available around-the-clock. As a

    result, the reliability, availability, and serviceability (RAS) capabilities of systems and

    software applications continue to escalate in importance.

    Reliability, availability, and serviceability are intricately related and each factor is

    equally important to achieving maximum IT service uptime. The reliability features of a

    system work to minimize the frequency of faults and ensure data integrity. Availability

    capabilities support continuous accessibility to IT services despite system faults or error

    events. Serviceability mechanisms foster short service cycles for component upgrades

    or repair. In some cases, designing for all three RAS factors at the same time can pose a

    challenge. For instance, maximizing redundancy can boost system availability.

    However, redundancy also adds to component count, lowering potential reliability

    levels. Determined to continuously improve RAS capabilities, Sun utilizes carefully-

    engineered metrics to aid design efforts to balance and optimize the reliability,

    availability, and serviceability levels of each new generation of servers.

    Sun SPARC Enterprise T5120 and T5220 servers provide excellent RAS characteristics,

    ideal for maximizing the uptime of business-critical IT services. Highly reliable parts and

    a relatively low total component count minimize the opportunity for system errors. In

    addition, these servers include core and thread offlining capabilities, redundant hot-

    swap disks, power supplies, and fans, integrated disk RAID functions, and extensive ECC

    hardware protection. The energy efficiency of Sun SPARC Enterprise T5120 and T5220

    servers minimizes heat generation, helping enterprises avoid susceptibility to faults

    related to environmental factors. An Integrated Lights Out Management service

    processor provides extensive system monitoring and eases administration.

    Taking a comprehensive top to bottom approach, Sun focuses on RAS at each and

    every platform layer. Suns intimate knowledge of technology at the processor, server,

    virtualization, operating system, and system management layers results in delivery oftightly integrated, extremely dependable platforms. Combining highly dependable and

    energy efficient, Sun SPARC Enterprise T5120 and T5220 server platforms with Sun

    Logical Domains technology, the Solaris 10 Operating System (OS), and Sun

    management tools further improves upon platform stability. By taking advantage of

    Sun technology, organizations can create exceptional IT solutions that minimize total

    cost of ownership (TCO), optimize asset utilization, and maximize IT service uptime

    levels.

  • 8/3/2019 Maximize IT Uptime T5x20

    4/34

    2 The Critical Role of Information Technology Sun Microsystems, Inc.

    Chapter 1

    The Critical Role of Information Technology

    Information technology (IT) plays a critical role in organizations of every kind from

    large government agencies and fortune 500 companies to small municipalities and

    start-up ventures. In the extreme, the very existence of some organizations depends on

    information technology and Web connectivity. Electronic storefronts, social networking

    sites, and Voice over IP (VoIP) providers offer just a few examples. Traditional

    businesses also rely heavily on information technology. For example, manufacturers

    often use IT services to communicate and complete business-to-business transactions

    with supply chain partners functions critical to effective operation, production, and

    distribution of products. Indeed, a multitude of organizations depend on IT services to

    expand revenue opportunities, execute critical business functions, and lower the costof operations.

    As network-centric technologies evolve, the importance of IT continues to escalate.

    Coupled with a rapid increase in new user and device connections, compelling network

    services and the emergence of Web 2.0 collaboration are driving demand for new IT

    services. Furthermore, end-users often expect constant availability, accessibility, and

    responsiveness from these new IT services. As a result, service providers and IT

    departments are under intense pressure to minimize service interruptions.

    Requirements for Dependable IT ServicesToday, pervasive dependence on electronic transactions and communication places IT

    systems in the critical revenue path for most organizations. System downtime can carry

    financial consequences on the order of hundreds of thousands of dollars of lost revenue

    per minute. In addition, trends toward network-facing business operations means that

    the consequences of downtime can often reach beyond financial loss to include

    damage to brand image, lower levels of customer satisfaction, and heightened

    potential for security risks and compliance failure. With such heavy reliance on IT and

    so much as stake, downtime in many environments is now unacceptable.

    Whether specifically stated in service level agreements or simply implicit, IT

    departments and CIOs are often held responsible for meeting availability expectations.

    True availability of an IT service hinges on all contributing elements processors,

    system platforms, operating systems, management tools, and more. With heightened

    demand for highly reliable, available, and serviceable IT products and solutions across

    every area of the enterprise, organizations are working to minimize both planned and

    unplanned downtime.

  • 8/3/2019 Maximize IT Uptime T5x20

    5/34

    3 The Critical Role of Information Technology Sun Microsystems, Inc.

    Pervasive Demand for Reliability, Availability, andServiceability Features

    The reliability, availability, and serviceability (RAS) features of IT systems are more

    critical to a wider range of organizations than ever before. While RAS is important to

    many projects budgets, workload sizes, and resulting architectural approaches can

    vary greatly. Large, vertically-scalable systems are known for extensive RAS features

    and are suitable for some IT service deployments. However, a monolithic scaling

    strategy is not appropriate or cost effective for all projects. For example, many stateless

    network-centric IT services achieve maximum performance by spreading the workload

    across multiple systems and application instances in a horizontally-scaled architecture.

    While techniques that distribute network-centric applications across multiple systems

    take some pressure off of individual server availability, continuous operation of each

    system remains important.

    The growing trend toward adoption of virtualization technologies in horizontally scaled

    architectures is increasing the importance of RAS features in these environments.

    Virtualization technology helps organizations maximize utilization of IT assets by

    allowing multiple applications to run on each server. Increasingly powerful,

    multithreaded and multicore processors along with fine-grained virtualization

    capabilities help organizations continuously raise the number of applications and

    services hosted on a single server. By combining powerful processors with virtualization

    technologies, enterprises can consolidate multiple horizontally-scaled applications

    onto a minimal number of servers (Figure 1-1). A virtualized, consolidated, scale-out

    architectural approach can maximize performance and efficiency while reducing

    administrative efforts and acquisition costs. As with any consolidation strategy, placing

    more applications on each server increases the importance of system RAS capabilities.

  • 8/3/2019 Maximize IT Uptime T5x20

    6/34

    4 The Critical Role of Information Technology Sun Microsystems, Inc.

    Figure 1-1. Virtualization technology facilitates radical server consolidation, dramatically increasing

    the importance of platform RAS features.

    Introducing Sun SPARC Enterprise T5120 and T5220 Servers

    Sun SPARC Enterprise T5120 and T5220 servers provide enterprises with an energy

    efficient, high performance, consolidation platform (Figure 1-2). Based on the second

    generation of Suns Chip Multithreading Technology (CMT) in the UltraSPARC T2

    processor, these servers offer organizations a unique blend of vertical and horizontal

    scalability. Optimized for the enterprise datacenter, Sun SPARC Enterprise T5120 and

    T5220 servers are ideal for maximizing the uptime of network-centric Web and

    application-tier workloads, as well as on-line transaction databases and other software

    programs that require massive compute power and memory capacity.

    Virtual

    Machine A

    Portal

    Web Server B

    Virtual

    Machine B

    Portal

    Web Server C

    Virtual

    Machine C

    Inventory

    Web Server C

    Virtual

    Machine D

    PortalApplication

    Server B

    Virtual

    Machine E

    InventoryApplication

    Server B

    Poral WebServer A

    PortalApplication Server A

    Portal WebServer B

    PortalApplication Server B

    Portal WebServer C

    InventoryWeb Server A

    InventoryApplication Server A

    InventoryWeb Server B

    InventoryApplication Server B

    InventoryWeb Server C

    Employee Portal Web and

    Application Tier Servers

    Inventory Tracking Web and

    Application Tier Servers

    Virtual

    Machine A

    Portal

    Web Server A

    Virtual

    Machine B

    Inventory

    Web Server A

    Virtual

    Machine C

    Inventory

    Web Server B

    Virtual

    Machine D

    PortalApplication

    Server A

    Virtual

    Machine E

    InventoryApplication

    Server A

  • 8/3/2019 Maximize IT Uptime T5x20

    7/34

    5 The Critical Role of Information Technology Sun Microsystems, Inc.

    Figure 1-2. Sun SPARC Enterprise T5120 and T5220 servers include advanced reliability, availability, and

    serviceability features, helping organizations maximize IT service uptime.

    Sun SPARC Enterprise T5120 and T5220 servers incorporate the following key design

    elements to help organizations improve the dependability of IT services.

    Reduced parts count contributes to better overall stability and reliability of the

    platform

    Processor thread and core offlining and built-in RAID capabilities supports

    continuous system operations in the face of certain fault conditions

    Redundancy and hot-swap components lays the foundation for system resiliency

    and increased serviceability

    Parity protection and error correction capabilities detects and corrects errors

    throughout the system and works to ensure data integrity

    Integrated Lights Out Management (ILOM) service processor eases remote

    management and provides considerable administrative flexibility

    Superior energy efficiency reduces heat dissipated into the datacenter, helping to

    minimize susceptibility to faults due to thermal conditions

    Robust virtualization technology facilitates fault isolation between applications

    and contributes additional reliability, availability, and serviceability features to

    further improve IT service uptime

    Comprehensive fault management provides proactive management of faults and

    error conditions in all major elements including the system, virtualization

    technology, and operating system layers

  • 8/3/2019 Maximize IT Uptime T5x20

    8/34

    6 The Critical Role of Information Technology Sun Microsystems, Inc.

    Table 1-1 details the features of Sun SPARC Enterprise T5120 and T5220 servers.

    Table 1-1. Overview of Sun SPARC Enterprise T5120 and T5220 server features

    Sun SPARC Enterprise T5120Server

    Sun SPARC Enterprise T5220Server

    Enclosure One rack unit Two rack units

    Processors Four-, six-, or eight-core 1.2 GHzor 1.4 GHz UltraSPARC T2processor

    Up to 64 threads

    Four-, six-, or eight-core 1.2 GHzor eight-core 1.4 GHzUltraSPARC T2 processor

    Up to 64 threads

    Memory Up to 128 GB 1 GB, 2 GB, 4 GB, or 8GB

    FBDIMMs

    Up to 128 GB 1 GB, 2 GB, 4 GB, or 8GB

    FBDIMMs

    Ethernet Four on-board Gigabit Ethernetports (10/100/1000)

    Two 10 Gb Ethernet ports viaXAUI combo slots

    Four on-board Gigabit Ethernetports (10/100/1000)

    Two 10 Gb Ethernet ports viaXAUI combo slots

    Internal Storage Up to eight internal drives 73 GB, 146 GB, or 300 GB 2.5 inch SAS hard drives RAID 0/1

    Up to sixteen internal drives 73 GB, 146 GB, or 300 GB 2.5 inch SAS hard drives RAID 0/1

    Expansion Bus One eight lane PCI Express Two four lane PCI Express or

    XAUI combo slots

    Two eight lane PCI Express Two four lane PCI Express Two four lane PCI Express or

    XAUI combo slots

    Power Two hot-swap power supplyunits

    AC 720 Watt (Climate Savera) orDC 660 Watt

    N+1 redundancy

    a. For more information on the Climate Savers Computing Initiative, please see

    http://www.climatesaverscomputing.org/

    Two hot-swap power supplyunits

    AC 750 Watt or AC 1100 Wattb

    N+1 redundancy

    b. The AC 1100 Watt power supply isrequired for systems with a 16-disk backplane

    Fans Four hot-swap fan trays, with

    two fans per tray N+1 redundancy

    Three hot-swap fan trays, with

    two fans per tray N+1 redundancy

    Service Processor Integrated Lights Out Manager RJ45 serial port and RJ45

    Ethernet connectors

    Integrated Lights Out Manager RJ45 serial port and RJ45

    Ethernet connectors

    Operating System Solaris 10 Operating System Solaris 10 Operating System

    VirtualizationTechnology

    Solaris Containers technology Sun Logical Domains

    technology

    Solaris Containers technology Sun Logical Domains

    technology

  • 8/3/2019 Maximize IT Uptime T5x20

    9/34

    7 Designed for Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    Chapter 2

    Designed for Reliability, Availability, and

    Serviceability

    Sun SPARC Enterprise T5120 and T5220 servers help organizations maximize the uptime

    of IT services. By minimizing the total number of system components, these servers are

    naturally more reliable since using fewer parts tends to result in lower numbers of

    failures. In addition, inclusion of redundant components and features that automate

    data integrity, isolation, and correction improve the robustness of Sun SPARC

    Enterprise T5120 and T5220 servers. Online maintenance capabilities and simplified

    maintenance procedures help enterprises avoid the need for planned outages.

    Extensive fault management and self healing capabilities reduce unplanned outages

    and speed recovery time.

    Minimizing Component Count to Achieve MaximumReliability

    Given the detrimental impact of downtime, IT architects are driven to build IT services

    on the most reliable systems available. Sun SPARC Enterprise T5120 and T5220 servers

    contain significantly fewer parts than competitive systems, dramatically reducing the

    potential for service interruptions due to component failure (Figure 2-1). Extensive

    processor and system integration also results in less susceptibility to faults. Highly

    reliable components are integrated whenever redundancy can not be afforded by

    design or cost, fostering enhanced system stability.

    Figure 2-1. Built with a streamlined architecture, the Sun SPARC Enterprise T5120 server contains

    dramatically fewer parts than systems based on traditional architecture designs.

    8 96 74 52 310

    Parts Count (thousand)

    Dell

    PowerEdge2950

    IBM

    System

    p5 p520

    Sun SPARC

    Enterprise

    T5120

    HP

    ProLiant

    DL585

  • 8/3/2019 Maximize IT Uptime T5x20

    10/34

    8 Designed for Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    As the industrys first massively-threaded system-on-a-chip (SoC), the Sun UltraSPARC T2

    processor directly contributes to the low component count of Sun SPARC Enterprise

    T5120 and T5220 servers. Based on a 65 nm manufacturing process, the UltraSPARC T2

    processor combines all major server functions on a single chip die, including a network

    interface unit for 10 Gigabit Ethernet processing, PCI-Express for low latency data

    transfer, and a stream processing unit (SPU) for wire speed cryptography (Figure 2-2).

    Figure 2-2. The system-on-a-chip design of the UltraSPARC T2 processor incorporates massive compute

    power, networking, I/O, and cryptographic capabilities into a single die.

    The Sun UltraSPARC T2 processor includes the following key design elements:

    Support for up to 64 simultaneous threads

    Eight cores with two processing pipelines each

    Eight threads per core

    Eight floating point units (1 per core)

    Eight stream processing units (1 per core, acting as cryptographic coprocessors

    operating in parallel with the core)

    On-chip caches and memory management

    Two on-chip 10 Gb Ethernet ports

    On-chip PCI Express I/O

    Cross Bar

    L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$

    C0 C1 C2 C3 C4 C5 C6 C7

    FPU FPU FPU FPU FPU FPU FPU FPU

    System I/FNetworkInterface Unit

    PCIe

    10 GigabitEthernet Ports (2)

    x8 @ 2.0 GHz

    FB DIMM FB DIMM FB DIMM FB DIMM

    FB DIMM FB DIMM FB DIMM FB DIMM

    SPU SPU SPU SPU SPU SPU SPU SPU

    MCU MCU MCU MCU

  • 8/3/2019 Maximize IT Uptime T5x20

    11/34

    9 Designed for Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    The system-on-a-chip design of the UltraSPARC T2 processor reduces the need for

    additional ASICs to connect on-board components. As a result, Sun SPARC Enterprise

    T5120 and T5220 servers simply contain fewer parts and pins that can fail than many

    traditional servers with multiple single-threaded dual-core or quad-core processors.

    Figure 2-3 illustrates the drastic simplification in system design afforded by the

    UltraSPARC T2 processor.

    Figure 2-3. With eight cores providing 64 threads, the UltraSPARC T2 processor maximizes compute

    power and minimizes system component count at the same time delivering greater reliability than

    systems with many more processors, multiple system boards, and far more complicated designs.

    Sun estimates that as much as ten percent of all hardware failures are caused by

    excessive heat in the datacenter. The power efficient components within Sun SPARC

    Enterprise T5120 and T5220 servers help keep these systems cool and add to potential

    reliability. Since individual cores in the UltraSPARC T2 processor implement much

    simpler pipelines than single-threaded chips, the processors are substantially cooler

    and require significantly less electrical energy to operate. As in most systems, air flows

    through the front of Sun SPARC Enterprise T5120 and T5220 servers, passing the disk

    drives prior to crossing over the systems board and electronics. These systems utilize

    I/O

    Memory

    Memory

    Memory

    Core0

    Core1

    Core2

    Core3

    Core4

    Core5

    Core6

    Core7

    Memory

    System-on-a-chip DesignUltraSPARC T2 Processor

    I/O

    CPU

    Memory

    I/O

    CPU

    Memory

    CPU

    Memory

    CPU

    Memory

    Switch

    CPU

    Memory

    I/O

    CPU

    Memory

    CPU

    Memory

    CPU

    Memory

    Switch

    CPU

    Memory

    I/O

    CPU

    Memory

    CPU

    Memory

    CPU

    Memory

    Switch

    CPU

    Memory

    I/O

    CPU

    Memory

    CPU

    Memory

    CPU

    Memory

    Switch

    CPU

    Memory

    I/O

    CPU

    Memory

    CPU

    Memory

    CPU

    Memory

    Switch

    Classic System Design

    CPU

    Memory

    I/O

    CPU

    Memory

    CPU

    Memory

    CPU

    Memory

    Switch

    CPU

    Memory

    I/O

    CPU

    Memory

    CPU

    Memory

    CPU

    Memory

    Switch

    CPU

    Memory

    I/O

    CPU

    Memory

    CPU

    Memory

    CPU

    Memory

    Switch

    Switch

    Switch

  • 8/3/2019 Maximize IT Uptime T5x20

    12/34

    10 Designed for Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    2.5 SAS hard disk drives which draw less than half the power of 3.5 SAS hard disk

    drives, providing another important contribution to reducing heat dissipation and

    keeping the platforms cool.

    Improving Availability Part Redundancy, On-lineServiceability, and Self Diagnosis and Correction

    Minimizing the need for system interruptions in order to correct error conditions

    improves the availability of a system and the hosted IT service. Sun SPARC Enterprise

    T5120 and T5220 servers provide many design features to help enable completion of

    maintenance procedures without impacting continuous system operation.

    Redundant, Hot-Swap Components and RAID Capabilities

    Hot-swap and hot-plug technology improves the serviceability and availability of a

    system by enabling addition and replacement of components without service

    disruption. Sun SPARC Enterprise T5120 and T5220 servers support hot-plug of chassis

    mounted hard drives, and hot-swap of redundant fan units and power supplies. For

    systems configured with redundant components, administrators can utilize software

    commands to remove and replace disks, power supplies, and fan units while the

    system continues to operate.

    Sun SPARC Enterprise T5120 and T5220 servers also support the attachment of an

    optional Sun External I/O Expansion Unit. Taking advantage of the External I/O

    Expansion unit increases the I/O connectivity capabilities of these systems, providing

    the means to create additional I/O path redundancies. The External I/O Expansion Unitis a 4U rack-mountable device which accommodates up to 12 additional PCI Express

    slots1 connected to, and managed by, the Sun SPARC Enterprise server. Attachment

    of the External I/O Expansion unit consumes two host PCI Express slots, resulting in a

    net gain of 10 PCI Express slots. By using cassettes, the expansion unit supports active

    replacement of hot-plug cards. In addition, the External I/O Expansion unit offers

    redundancy and hot-plug capabilities for power supply units, fans, and I/O boats.

    The built-in RAID capabilities of Sun SPARC Enterprise T5120 and T5220 servers provide

    data redundancy and increased performance at no additional cost. These servers

    support on-board hardware RAID 1 to enable mirroring of data across any two internal

    drives. In addition, the RAID capabilities of Sun SPARC Enterprise T5120 and T5220

    servers support creation of two, three, or four disk RAID 0 striped volumes. Sun also

    provides optional PCI-Express cards that offer higher levels of RAID, including RAID 5

    and RAID 6 together with the benefit of battery backed write caches. The optional PCI-

    Express RAID cards can operate on internal disks or external disk chassis.

    1. Sun SPARC Enterprise T5120 and T5220 servers are not qualified to utilize the PCI-X variant of the

    External I/O Expansion unit .

  • 8/3/2019 Maximize IT Uptime T5x20

    13/34

    11 Designed for Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    Sun SPARC Enterprise T5120 and T5220 servers running the Solaris OS can also take

    advantage of software RAID capabilities. For example, the Solaris Volume Manager and

    Solaris ZFS file system can be used for internal or external storage devices, providing

    flexibility and redundancy beyond the server chassis.

    Hardware Protection ECC Everywhere

    System self diagnosis, error correction, and parity checking features in high-end

    systems help enterprises maximize IT service uptime. Sun SPARC Enterprise T5120 and

    T5220 servers include mainframe-class processor RAS features that enhance system

    uptime by maintaining data integrity across on-chip memory a unique capability not

    found in other volume market systems. For example, extended ECC technology helps

    Sun SPARC Enterprise T5120 and T5220 servers withstand multibit memory errors within

    a DRAM device, including failures that cause incorrect data on all data bits of the

    device.

    The extensive data protection of the UltraSPARC T2 processor provides for self diagnosis

    and corrective actions, facilitating continuous system operation in the face of error

    conditions. Parity protection is provided throughout the UltraSPARC T2 processor

    including the following elements:

    Instruction cache (I-cache)

    Data cache (D-cache) tags and data

    Instruction Translation Lookaside Buffer (ITLB)

    Data Translation Lookaside Buffer (DTLB)

    Modular arithmetic memory and store buffer addresses

    Utilizing a combination of hardware and software correction flows, Sun SPARC

    Enterprise T5120 and T5220 servers work to maximize data integrity. In the event of an

    error, hardware re-fetch is used for single error correction (SEC) of the I-cache and D-

    cache. ECC protection is provided for integer RF, floating-point FR, store buffer data,

    trap stack, and other internal arrays.

    Sun SPARC Enterprise T5120 and T5220 servers use Content Addressable Memory (CAM)

    technology to implement hardware-based virtual address (VA) lookup, and physical

    address translation is implemented in RAM. Parity protection is provided for both VA

    lookup and physical address translation. While parity protection for the physicaladdress translation function is increasingly common, variable page size support makes

    accomplishing parity protection for hardware based VA lookup much more challenging.

    Suns unique patented approach to providing parity protection for hardware-based VA

    lookup provides organizations with added reliability not found in other platforms.

    The Robust UltraSPARC T2 Processor

    Sun SPARC Enterprise T5120 and T5220 servers leverage the UltraSPARC T2 processors

    fault management support for all major elements, including the cores, threads,

  • 8/3/2019 Maximize IT Uptime T5x20

    14/34

  • 8/3/2019 Maximize IT Uptime T5x20

    15/34

    13 Designed for Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    Voltage conditions

    Solaris watchdog, boot time-outs and automatic server restart events

    ILOM provides administrators with the capability to monitor and control Sun SPARC

    Enterprise T5120 and T5220 servers over a dedicated Ethernet connection and supports

    secure shell (SSH), Web, and Integrated Platform Management Interface (IPMI) access.

    ILOM functions can also be accessed through a dedicated serial port for connection to a

    terminal or terminal server. The ILOM command-line and browser-based interfaces

    simplify remote administration of geographically distributed or physically inaccessible

    machines. In addition, ILOM provides remote execution of diagnostics that generally

    require physical proximity to the server serial port. ILOM can be configured to distribute

    email alerts of hardware failures and warnings, as well as other events related to the

    server.

  • 8/3/2019 Maximize IT Uptime T5x20

    16/34

    14 Isolating Faults with Sun Virtualization Technologies Sun Microsystems, Inc.

    Chapter 3

    Isolating Faults with Sun Virtualization

    Technologies

    Consolidation of many applications on a single server is often a necessity in order to

    economically host and manage the high number of IT services required within an

    enterprise. Sun Logical Domains and Solaris Containers technology provide the ability

    to virtualize the considerable system resources of Sun SPARC Enterprise T5120 and

    T5220 servers. Within the context of a consolidation strategy, virtualization

    technologies from Sun can help isolate individual software application faults from

    impacting other IT services hosted on the same platform. In fact, utilizing Sun

    virtualization technologies can help organizations improve resource utilization while

    reducing downtime.

    Sun Logical Domains Technology

    Any virtualization solution must be carefully examined for its contribution to the

    reliability, availability, and serviceability of key applications and services. Implemented

    both in firmware and hardware, Sun Logical Domains technology fosters greater

    reliability than software-based virtualization solutions. In addition to robust fault

    isolation, Sun Logical Domains technology includes a number of options for creating

    redundancy within configurations, provides proactive fault management capabilities,

    and implements a unique I/O architecture that speeds recovery time of guest logical

    domains.

    Sun Logical Domains technology provides extensive resource isolation through

    firmware and hardware constructs. Each logical domain can be created, destroyed,

    reconfigured, and rebooted independently, without requiring a power cycle of the

    server. Logical domains are also managed as entirely independent machines, with

    localized control of the following resources:

    Kernel, patches, and tuning parameters

    User accounts

    Disks

    Network interfaces

    MAC addresses

    IP addresses

    Sun Logical Domains technology uses a lightweight hypervisor firmware layer provided

    on the UltraSPARC T2 processor to virtualize machine hardware, decoupling the link

    between the operating system and the hardware (Figure 3-1). As such, the number of

    virtual machines that can be created relies upon the capabilities of the hypervisor as

    opposed to the number of physical hardware devices installed in the system. Within

  • 8/3/2019 Maximize IT Uptime T5x20

    17/34

    15 Isolating Faults with Sun Virtualization Technologies Sun Microsystems, Inc.

    Sun SPARC Enterprise T5120 and T5220 servers, up to 64 logical domains can be

    established1. Each logical domain is a full virtual machine that runs an independent

    operating system instance or runtime environment and contains virtualized CPU,

    memory, storage, console, and cryptographic devices.

    All logical domain instances rely upon the same fundamental technology constructs

    just described. However, several different roles exist for logical domains. Based on

    context and use, a single logical domain may function in one or more of the following

    roles.

    Control domain executes Logical Domains Manager software to govern logical

    domain creation and assignment of physical resources

    Service domain interfaces with the hypervisor on behalf of a guest domain to

    manage access to hardware resources such as CPU, memory, network, disk, console,

    and cryptographic units

    I/O domain controls direct physical access to input/output devices, such as PCI

    Express cards, storage units, and network devices

    Guest domain utilizes virtual devices offered by service and I/O domains and

    operates under the management of the control domain

    Figure 3-1. The UltraSPARC T2 hypervisor provides isolation of virtual machine operating system

    instances and allocated hardware resources through Sun Logical Domains technology.

    Logical Domain Isolation

    Organizations often utilize virtualization technologies to host multiple business-critical

    applications on the same server. As a result, enterprises need assurance that software

    faults or maintenance events in one virtual machine remain isolated and unable to

    impact the availability of IT services in other partitions.

    1. Though establishing 64 logical domains is possible, this is not a recommended practice.

    OperatingSystem

    LDom A LDom B LDom C LDom D

    LDoms

    Hypervisor

    Platform

    HardwareMemory

    CPU CPU CPU

    I/O I/O

    Memory

    CPU CPU

    Memory

    CPU CPU

    Memory

    CPU

  • 8/3/2019 Maximize IT Uptime T5x20

    18/34

    16 Isolating Faults with Sun Virtualization Technologies Sun Microsystems, Inc.

    Unfortunately, many virtualization solutions require a full system reset in order to

    reboot a control or service domain. In contrast, Sun Logical Domains technology

    supports reboot and reset of any domain, independent of all other domains even

    domains with direct control over physical hardware resources. Guest logical domains

    can be configured, started, and stopped independently, without power-cycling the

    machine and with out impact to continuous operation of other logical domains. In

    addition, virtual I/O interfaces can connect and disconnect as necessary without

    impacting other logical domains on same platform. Administrators can even

    dynamically add and remove virtual CPUs on a logical domain while the operating

    system instance continues to execute, helping avoid the need for planned downtime.

    Robust Virtualized I/O

    Continuous communication with I/O devices proves critical to delivery of many IT

    services. Within Sun Logical Domains technology, an I/O domain is a logical domain

    which physically connects to I/O devices. I/O domains also generally act as service

    domains, sharing I/O access to other domains in the form of virtual devices. Sun

    Logical Domains technology provides a number of features and architectural elements

    to mitigate the impact of I/O and service domain fault conditions, helping provide

    uninterrupted access to I/O devices.

    Masking I/O Fault EventsThe capabilities of Sun Logical Domains technology mitigate the impacts caused by

    temporary loss of I/O connectivity. In response to a fault or maintenance operation

    on an I/O device, Sun Logical Domains technology interfaces with the Solaris OS togracefully suspend I/O operations to all impacted virtual I/O devices. After recovery

    of the physical I/O device, virtual I/O devices can reconnect and resume I/O

    transactions at the point of suspension.

    Redundant I/O PathsWhile I/O suspension provides a means to mask faults, many applications simply can

    not afford any period of time without I/O service. Employing Solaris I/O multipathing

    software can help ensure continuous access to disk and network services despite loss

    of a single I/O path. For example, an administrator can configure two service

    domains, each providing a unique path to a physical disk subsystem somewhat

    analogous to employing two host bus adapters (HBAs) in a traditional I/Oconnectivity model. Once initialized within a guest domain, Solaris I/O multipathing

    software provides a mechanism for automatic fail-over between the two paths in the

    event of a failure. In addition, the Solaris I/O multipathing software includes a

    manual path switch-over capability, letting administrators redirect I/O to an

    alternate path during reconfiguration or reboot of the primary I/O domain.

  • 8/3/2019 Maximize IT Uptime T5x20

    19/34

    17 Isolating Faults with Sun Virtualization Technologies Sun Microsystems, Inc.

    Clustering I/O DomainsAdministrators can further increase the availability of virtualized I/O services by

    adopting clustering technology. Sun Logical Domains technology is certified for use

    with Solaris Cluster software and Veritas Cluster Server by Symantec to provide for

    failover of I/O domains. Clustering I/O domains provides for automated failover of

    I/O device services, minimizing the time to recover I/O connectivity and ultimately

    improving overall IT service availability.

    Speeding Boot Times

    After a fault or maintenance event occurs, reducing the time required to bring systems

    back on-line is a key step toward maximizing IT service availability. Hosting an IT service

    in a logical domain can actually speed recovery time. In the event of a planned or

    unplanned reboot of a logical domain, Sun Logical Domains technology allows an

    operating system instance to bypass time consuming I/O bring-up procedures. The

    majority of logical domains utilize virtual I/O devices, delegating the burden of I/O bus

    ownership, probing buses for devices, and loading device drivers to an I/O domain.

    Logical domains that utilizes virtual I/O devices contain no I/O bus topology to probe,

    and no physical connection to I/O devices, speeding recovery by eliminating the need

    for time-consuming I/O initialization steps during boot of an operating system.

    Solaris Containers Technology

    Whether used in a single system image of the Solaris OS or within a logical domain,

    Solaris Containers technology can further isolate software applications and services

    using flexible, software-defined boundaries. A breakthrough approach to virtualization

    and software partitioning, Solaris Containers technology allows creation of many

    private execution environments within a single instance of the Solaris OS (Figure 3-2).

    Figure 3-2. Solaris Containers use flexible software mechanisms to isolate applications.

    Solaris Containers provide a complete, isolated, secure runtime environment for

    applications and allow for granular management of system resources. Each Solaris

    Container can be managed independently with regard to users, device paths, CPU and

    Container

    Users

    Container

    Users

    Server with Sun SPARC, AMD, or Intel Processors

    Container

    Users

    Single Solaris 10 Operating System Instance

    AllocatedResources

    AllocatedResources

    AllocatedResources

    Applications Applications Applications

  • 8/3/2019 Maximize IT Uptime T5x20

    20/34

    18 Isolating Faults with Sun Virtualization Technologies Sun Microsystems, Inc.

    memory resources, and networking. Dynamic resource reallocation capabilities let

    unused system resources shift among containers as needed, providing high quality of

    service to hosted applications. Resources can also be allocated or reserved for critical

    services to reduce contention for compute power with lower priority workloads. The

    fine-grained resource management provided by Solaris Containers encourages efficient

    use of computing power while maintaining high levels of service availability.

    Applications within containers are isolated, preventing activities in one container from

    monitoring or affecting processes running in another container. Even a superuser

    process can not view or affect activity in other containers. Software fault and security

    isolation features of Solaris Containers also prohibit poorly behaved applications from

    impacting other containers. By utilizing Solaris Containers and Sun Logical Domains

    technology individually or in combination, organizations can better define and meet

    service levels by dynamically controlling application and resource priorities.

  • 8/3/2019 Maximize IT Uptime T5x20

    21/34

    19 Suns Comprehensive Focus On IT Service Uptime Sun Microsystems, Inc.

    Chapter 4

    Suns Comprehensive Focus On IT Service Uptime

    Operating system and management tool choices heavily influence IT service uptime. As

    a part of a dedicated effort toward helping organizations build highly dependable

    solutions, Sun works to integrate reliability, availability, and serviceability at all levels

    of the platform. In fact, the Solaris OS includes a comprehensive fault management

    architecture known as Predictive Self Healing that governs error handing throughout

    the hardware, virtualization, and operating system layers of Sun SPARC Enterprise

    T5120 and T5220 servers. By utilizing the Solaris OS and Sun management tools in

    conjunction with Sun SPARC Enterprise T5120 and T5220 servers, organizations can

    further enhance the availability of hosted IT services.

    Speeding Diagnosis with a Comprehensive FaultManagement Architecture

    When faults occur or maintenance is required, the amount of time required to return

    an IT service to an operational state becomes critical. Extending the fault management

    framework first introduced in the Solaris OS, Sun Predictive Self Healing technology

    governs error events across all technology layers within Sun SPARC Enterprise T5120

    and T5220 server platforms, including individual hardware components and Sun Logical

    Domains technology. By proactively diagnosing, isolating, and recovering from both

    hardware and software failures, Sun Predictive Self Healing technology provides

    meaningful information about faults, speeds resumption of IT services, and even helps

    prevent certain system failures.

    Solaris Predictive Self Healing Software

    Solaris Predictive Self Healing software proactively monitors and manages system

    components to help organizations achieve maximum availability of IT services.

    Predictive Self Healing is an innovative capability in the Solaris 10 OS that

    automatically diagnoses, isolates, and recovers from many hardware and application

    faults. As a result, business-critical applications and essential system services can

    continue uninterrupted in the event of software failures, major hardware component

    failures, and even software configuration problems.

    Solaris Fault Manager and Solaris Service Manager are the two main components of

    Predictive Self Healing. Solaris Fault Manager receives data relating to hardware and

    software errors and automatically diagnoses the underlying problem. Once diagnosed,

    Solaris Fault Manager automatically responds by offlining faulty components and

    signaling the error condition to administrators (via console messages and external

    system and component status indicators on the front and rear of the chassis). Solaris

    Service Manager makes services rather than processes into first-class citizens,

  • 8/3/2019 Maximize IT Uptime T5x20

    22/34

    20 Suns Comprehensive Focus On IT Service Uptime Sun Microsystems, Inc.

    permitting automatic self-healing. Service descriptions for base services of the Solaris

    OS include full dependency information for start, stop, and restart. Configuring user

    applications to run under Solaris Service Manager is relatively simple, helping

    organizations to effectively manage faults for all IT services hosted on Sun SPARC

    Enterprise T5120 and T5220 servers.

    Predictive Self Healing for Sun SPARC Enterprise T5120 and T5220

    Servers

    Sun SPARC Enterprise T5120 and T5220 servers leverage a number of Predictive Self

    Healing response agents already in the Solaris OS, including syslog, CPU offline,

    memory page retire, and I/O retire agents. The ILOM software on the service processor

    provides an additional set of response agents. These agents include a dynamic field

    replaceable unit ID (FRUID) agent that updates the faulted FRU with the error event,

    and an LED agent that lights the appropriate system or component (FRU) status

    indicator for specific faults.

    Fault diagnosis and recovery status is synchronized between the logical domain and

    service processor. Fault events on Sun SPARC Enterprise T5120 and T5220 servers

    include the FRU part and serial number of the faulted components. In addition, error

    and fault persistence, common messaging, and knowledge articles specific to the Sun

    SPARC Enterprise T5120 and T5220 servers can be accessed at

    sun.com/msg.

    The following capabilities provide a few examples of the granularity and power of the

    Sun Predictive Self Healing implementation on Sun SPARC Enterprise T5120 and T5220servers.

    Within the UltraSPARC T2 processor, the L1 and L2 caches, per-thread registers,

    integrated dual 10 Gigabit Ethernet interfaces and cryptographic units include error

    reporting mechanisms that can be diagnosed by Sun Predictive Self Healing

    technology.

    When a correctable memory error is encountered by a Power On Self Test (POST) on

    Sun SPARC Enterprise T5120 and T5220 servers, the error is queued up for Predictive

    Self Healing diagnosis. Memory pages are automatically retired as necessary.

    Faults within the Sun SPARC Enterprise T5120 and T5220 servers governed by SunPredictive Self Healing include direct I/O devices with hardened and non-hardened

    drivers, as well as devices that are virtualized by the hypervisor.

    The PCI express switches and SAS/SATA controllers utilize hardened device drivers

    that generate reports whenever an error is detected by the error handlers. Errors are

    forwarded for complete analysis and diagnosis.

  • 8/3/2019 Maximize IT Uptime T5x20

    23/34

    21 Suns Comprehensive Focus On IT Service Uptime Sun Microsystems, Inc.

    Predictive Self Healing for Logical Domains

    In order to execute proper error handling within a virtualized environment, messages

    and alerts must extend to virtual machine instances. In addition to close integration

    with Sun SPARC Enterprise T5120 and T5220 server hardware, Sun Predictive Self

    Healing technology governs fault handling within logical domains. Taking a

    comprehensive system view, error conditions for devices abstracted by the hypervisor

    processors, memory, console, and cryptographic devices trap to the hypervisor,

    generating a standard error report that is sent to the service processor via the host to

    service processor mailbox channel. A diagnosis engine provides complete analysis of

    each error report and the logical domains manager distributes alerts to each

    potentially affected OS instance. A Sun Predictive Self Healing technology knowledge

    base is also available. Administrators can use the knowledge base to correlate error

    messages related to Sun Logical Domains technology execution to documented pre-

    defined corrective actions, speeding time to system recovery.

    Reducing Downtime with the Solaris Operating System

    Over two decades of investment contribute to making the Solaris OS one of the most

    reliable operating systems in the industry. Key computing elements operating

    system, networking, and user environment combine within the Solaris OS to provide

    a stable, high-quality foundation for execution of IT services. In fact, many

    organizations can point to systems running the Solaris OS that execute continuously for

    months or years without need for a restart.

    The Solaris OS is designed for availability. Built with a small, compact kernel, the

    Solaris OS limits the potential for operating system faults and subsequent platform

    downtime. In addition, the Solaris OS establishes a clear distinction between the

    kernel, shared libraries, and applications in order to limit the impact of application

    failures. Furthermore, the ability to install most patches and other incremental

    software updates for the Solaris OS without taking the system offline helps

    organizations increase uptime and ease serviceability. Ease-of-use features including

    Web-based installation and a graphical process manager can also help boost

    availability, by reducing the risk of operator error and by minimizing service times.

    Described in the sections that follow, a number of key features of the Solaris OS can

    help organizations develop, deploy, and manage IT services with extreme reliability,relentless availability, and simplified serviceability.

    Solaris Memory Page Retirement

    In many systems, addressing both correctable or uncorrectable permanent memory

    errors requires server downtime. As a part of the Solaris Predictive Self Healing

    technology framework, the Solaris OS memory page retirement (MPR) capability works

    to isolate memory issues without system interruption. Diagnosis software in the

    Predictive Self Healing technology Fault Manager examines memory correctable errors

  • 8/3/2019 Maximize IT Uptime T5x20

    24/34

    22 Suns Comprehensive Focus On IT Service Uptime Sun Microsystems, Inc.

    and uncorrectable errors detected by underlying hardware on a continual basis. MPR

    retires memory pages containing correctable errors and relocatable clean pages

    containing uncorrectable errors without interrupting user applications. In addition,

    MPR can also isolate relocatable dirty pages containing uncorrectable errors with

    limited impact on affected user processes and avoids forcing an outage of an entire

    system. By utilizing MPR on Sun servers, system interruption rates can be reduced by as

    much as 35 to 40 percent1.

    Solaris File Systems

    Reliable data subsystems are critical to creating highly-available IT services.

    Organizations continue to depend on the UNIX File System (UFS) within the Solaris 10

    OS to provide high-resiliency features, such as metadata logging to protect against data

    corruption and to speed recovery in the event of system failure. In addition, the Solaris

    10 OS now also features the Solaris ZFS, a file system that offers a dramatic advance in

    data management with an innovative approach to data integrity.

    Solaris ZFS provides increased protection against administrative error and delivers end-

    to-end data integrity elements, such as 64-bit checksumming and comprehensive data

    updates. To ensure that the data on disk is self-consistent at all times, Solaris ZFS

    combines proven and cutting edge technologies, such as copy-on-write and end-to-end

    checksumming. Data is always written to a new block on disk before changing the

    pointers to the data and committing the write. Since the file system is always

    consistent, time-consuming recovery procedures like fsck are not required if the

    system is shut down in an unclean manner. Copy-on-write also enables administrators

    to take consistent backups or roll data back to a known point in time.

    The Solaris 10 OS with Solaris ZFS is the only known OS designed to provide end-to-end

    checksumming for all data. Solaris ZFS constantly reads and checks data to help ensure

    integrity, and if an error exists in a mirrored pool, the technology can automatically

    repair the corrupt data. This relentless vigilance on behalf of availability protects

    against costly and time-consuming data loss even previously undetectable silent

    data corruption.

    Solaris Flash and Solaris Live Upgrade

    In many cases, planned downtime accounts for the bulk of system interruptions each

    year. Solaris Flash and Solaris Live Upgrade help organizations decrease requirements

    for planned downtime by providing efficient installation and upgrade operations. By

    employing Solaris Flash and Solaris Live Upgrade, deployment or upgrades of Sun

    SPARC Enterprise T5120 and T5220 servers can complete in minutes.

    The Solaris Flash facility helps IT organizations quickly install and update systems

    with an operating system configuration tailored to enterprise needs. The technology1. Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults,

    http://blogs.sun.com/mws/resource/ieee.pdf

  • 8/3/2019 Maximize IT Uptime T5x20

    25/34

    23 Suns Comprehensive Focus On IT Service Uptime Sun Microsystems, Inc.

    in Solaris Flash software provides tools to system administrators for building custom

    rapid-install images including applications, patches, and parameters that can

    be installed at a data rate close to the full speed of the hardware.

    Solaris Live Upgrade software provides mechanisms to upgrade and manage multiple

    on-disk instances of the Solaris OS. This technology facilitates installation of a new

    operating system or patches on a running production system without taking a server

    offline. Downtime is only required to reboot the new configuration. A specific feature

    of Solaris Live Upgrade software provides administrators with the ability to quickly

    roll systems back to the initial state if desired.

    Sun Software for Efficient System Management

    Technologies that automate management procedures help prevent faults and provide

    efficiencies that help system administrators manage a greater number of servers. Inaddition to the ILOM system controller, organizations can take advantage of powerful

    Sun Management Center software and Sun N1 System Manager software. These

    sophisticated tools automate monitoring and administrative functions, lowering the

    administrative burden and reducing the opportunity for common errors. Sun

    Management Center software simplifies administrative tasks and Sun N1 System

    Manager software automates complex software installation and configuration, helping

    to ease operations.

    Sun Management Center Software

    Streamlining system management procedures simplifies operations and can result inhigher availability levels. Sun Management Center software improves management

    efficiency by providing an aggregate view of the entire network of Sun components

    from the heart of the datacenter to remote locations at the edge of the network.

    Within Sun Management Center software, a single interface provides IT administrators

    with the ability to proactively manage and monitor remote Sun systems, storage

    components, the Solaris OS, and applications. Remote access services enable system

    administrators, local or remote, to gain protected access through administrative

    networks. From this interface, technicians can monitor system health, perform remote

    bring-up, and restart or take down individual machines.

    Consolidating management views even further, integration of Sun Management

    software with Solaris Cluster also allows visibility and control over cluster resources. In

    addition, to support legacy networks and heterogeneous environments, Sun

    Management Center software tightly integrates with all major management

    frameworks, includingCA Unicenter TNG, HP Open-View, IBM Tivoli, and BMC Patrol.

    Sun Management Center software is based on a three-tiered design with an agent-

    based framework that provides a single point of management for the enterprise

    (Figure 4-1). The architecture of Sun Management Center software simplifies systems

  • 8/3/2019 Maximize IT Uptime T5x20

    26/34

    24 Suns Comprehensive Focus On IT Service Uptime Sun Microsystems, Inc.

    management by delivering a higher level of scalability and availability while reducing

    cost and complexity. A major strength of Sun Management Center software is that

    these autonomous agents continue to operate even when contact with the central

    software server is lost. After loss of the central server, remote agents continue to

    collect data and take action if an event is triggered. The agents are reconnected when

    the Sun Management Center server is restarted, perhaps at a backup site. The central

    server collects data on all events that occurred during the downtime, allowing

    administrators to retain historical perspectives.

    Figure 4-1. Sun Management Center software three-tier architecture

    Sun N1 System Manager Software

    Providing high availability levels in a rapidly changing IT infrastructure requires the

    ability to provision and reprovision servers efficiently while minimizing the opportunity

    for process errors. Sun N1 System Manager software automates complex software

    installation and configuration across heterogeneous network infrastructures, helpingto speed and simplify IT service deployments. Sun N1 System Manager software also

    provides a comprehensive solution to simplify system infrastructure life-cycle

    management. By utilizing Sun N1 System Manager software, administrators can

    discover, provision, monitor, update, and manage hundreds of Sun SPARC and x64

    based servers from a single console, with an innovative and user-friendly hybrid

    graphical and command line interface (CLI). Sun N1 System Manager software also

    provides greater flexibility in allocating and reallocating resources as enterprise

    requirements shift.

    JavaTM

    Console WebConsoleCommand

    Line

    Interface

    Management

    Server

    Sun System

    Sun

    Management

    Center

    Console Layer

    Java Technology-Based Management Applications

    IPA-Based

    Services

    Java Client API

    Trap, Topology

    Services

    Security

    Services

    Sun System Sun System

    Sun

    Management

    Center

    Server Layer

    Sun

    ManagementCenter

    Agent Layer

    RMI HTTPS IPv6

    SNMPv3

  • 8/3/2019 Maximize IT Uptime T5x20

    27/34

    25 Suns Comprehensive Focus On IT Service Uptime Sun Microsystems, Inc.

    Designed to simplify datacenter management tasks, Sun N1 System Manager software

    includes features to facilitate remote power control, operating system deployment and

    patching, system BIOS and firmware updates, event logging and notification, and

    hardware and operating system monitoring. In addition, the software features the

    ability to create logical groups of systems and perform actions across these groupings

    as easily as if performing actions on a single node. By providing fast and easy access to

    systems for monitoring and maintenance, Sun N1 System Manager software increases

    operational efficiency and minimizes downtime related to process errors.

  • 8/3/2019 Maximize IT Uptime T5x20

    28/34

    26 Measuring Server Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    Chapter 5

    Measuring Server Reliability, Availability, and

    Serviceability

    Taking a holistic view of system operation, Sun studies the interrelation between

    reliability, availability, and serviceability and works to optimize all three factors at the

    same time. By utilizing repeatable metrics during the design of new systems, Sun can

    achieve continuous improvements over current and previous generations of Sun servers

    and competitive systems. Given the importance of system availability, Sun continues to

    invest and work actively to help establish industry wide standards for measurement.

    Sun Availability Benchmark Framework

    Benchmarks often guide research and development efforts to enhance computer

    systems performance. As the reliability, availability, and serviceability features of

    systems increases in importance, there is a natural desire on the part of engineers to

    benchmark availability levels. Unfortunately, industry-accepted standards are still a

    work in progress in this important area. With a goal of adopting a methodical approach

    to improving system dependability and in view of no existing industry standard for

    measurement, Sun created an availability benchmark framework called R-Cubed1. The

    Sun R-cubed availability benchmark framework is applicable to a wide range of systems

    and information about the methodology is published openly for reuse by the industry,

    academia, and the technical community at large.

    In addition to traditional measures of server outage minutes per year, Sun's R-Cubed

    availability benchmark framework reaches further to focus on overall dependability

    the ability of a system to remain available and prevent service interruptions. As such,

    the R-cubed framework accounts for system reliability, availability, and serviceability

    levels by quantifying three key attributes.

    Fault and maintenance rate measures the frequency of system faults in a given

    period of time

    Robustness accounts for the extent to which system operation degrades due to a

    fault, as well as the potential for completing repairs online Recovery examines system characteristics to quantify the effort required to return

    to an operational state after a fault or maintenance event

    While optimizing the three elements of the R-cubed framework results in high levels of

    system dependability, increasing all three R-cubed factors at the same time takes a

    careful approach. For example, designing a server with a low component count

    minimizes the fault and maintenance rate, increasing reliability. However, a system

    1. R-Cubed (R3): Rate, Robustness, and Recovery An Availability Benchmark Framework,

    research.sun.com/techrep/2002/abstract-109.html

  • 8/3/2019 Maximize IT Uptime T5x20

    29/34

    27 Measuring Server Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    with a low component count in all probability lacks the robustness provided by

    redundant parts. Without redundancy, single component errors or faults can cause

    system outages essentially placing reliability at odds with increased availability.

    Utilizing the R-cubed framework helps Sun take a balanced approach to maximizing

    overall system reliability, availability, and serviceability.

    R-Cubed Metrics

    The R-cubed Availability framework incorporates the following measures to help system

    designers quantify the dependability of a system design and evaluate potential

    improvements.

    Fault Robustness Benchmark-A (FRB-A) rewards systems where faults do not cause

    disruption of service. Measurement is indicated by a numeric scalar between 1 and

    100, where 1 means any single failure causes disruption and 100 means no single

    failure causes a disruption. A system scores higher on FRB-A by optimizing cost and

    redundancy trade-offs less reliable parts are made redundant, and more reliable

    parts are utilized where economical.

    Maintenance Robustness Benchmark-A (MRB-A) quantifies the ability to perform

    maintenance without system disruption by utilizing a numeric scalar. A score of 1

    means all maintenance actions result in a system outage while a score of 100

    indicates all field replaceable units can be replaced without downtime. MRB-A scores

    are higher for systems with hot-swap capabilities. In general, small form factor

    servers do not score as well as larger servers that are designed for online hardware

    servicing.

    Mean Time Between Services (MTBS) utilizes mean time between failure (MTBF)

    rates, isolating calculations to only include field replaceable unit (FRU) failures that

    incur a service exception.

    Mean Time Between System Interrupts (MTBSI) quantifies component failures which

    lead to system downtime and scheduled shutdowns for service actions. MTBSI

    considers both scheduled and unscheduled system interruptions. In some cases, an

    unscheduled interruption results in a degraded mode of system operation and

    requires a scheduled interruption to replace failed components.

    Unscheduled Mean Time Between System Interrupts (U_MTBSI) measures the rate ofsystem interrupts that are caused by component failures. Unscheduled interruptions

    pose the most significant impact to predictable service delivery.

    Availabilityprovides a traditional measure of the percentage of uptime achieved by a

    system per year, assuming utilization of SunSpectrumSM Gold service plan

    maintenance response times, Sun approved software installation methods, and

    system management best practices.

  • 8/3/2019 Maximize IT Uptime T5x20

    30/34

    28 Measuring Server Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    As evidenced by the data in Table 5-1 and the bar chart in Figure 5-1, the Sun SPARC

    Enterprise T5120 and T5220 servers make great strides toward maximizing

    dependability when compared to earlier Sun systems such as the Sun Fire V490 server.

    Table 5-1. Sun SPARC Enterprise T5120 and Sun SPARC Enterprise T5220 servers score significantly

    higher on R-cubed metrics than the Sun Fire V490 server, released only a few years ago.

    Figure 5-1. Graphical representation of R-cubed benchmark results for the Sun SPARC Enterprise T5120,

    T5220, and Sun Fire V490 servers.

    SystemFaultRobustnessBenchmark-A

    MaintenanceRobustnessBenchmark-A

    U_MTBSI MTBSI MTBS Availability

    Sun SPARCEnterpriseT5120 server

    81.2 56.98 409,628 212,737 87,050 0.99999

    Sun SPARCEnterpriseT5220 server

    80.06 54.52 407,056 212,041 92,282 0.99999

    Sun Fire

    V490 server

    58.7 32.77 161,804 83,539 77,905 0.99998

    Sun SPARC

    Enterprise

    T5120

    Sun SPARC

    Enterprise

    T5220

    Sun Fire

    V490

    Hours

    100,000 200,000 300,000 400,000 500,0000

    U_MTBSI

    MTBSI

    MTBS

    Sun SPARC

    Enterprise

    T5120

    Sun SPARC

    Enterprise

    T5220

    Sun Fire

    V490

    80 9060 7040 5020 30100

    Fault RobustnessBenchmark - A

    Maintenance RobustnessBenchmark - A

    Robustness on Scale of 1 to 100

  • 8/3/2019 Maximize IT Uptime T5x20

    31/34

    29 Measuring Server Reliability, Availability, and Serviceability Sun Microsystems, Inc.

    Assessing RAS Levels of Scale-Out Architectures

    Network-centric applications often utilize a scale-out approach to achieve desired

    throughput, replicating a single application across a number of rackmount servers. In

    scale-out designs, loss of a single server generally incurs little impact on desired

    application performance or availability. Rather, uptime and effective performance of

    the IT service often hinges on the continuous operation of a subset of the total servers

    deployed. For example, in a deployment of nine servers, only eight servers may be

    required to meet performance goals or contracted service levels. At the same time, loss

    of more than a certain number of servers can impact throughput, degrade

    responsiveness, and even result in denial of service to clients.

    In order to measure the dependability of a scale-out architecture, and to assess its

    ability to meet performance targets, Sun utilizes a metric called performability. After

    calculating the number of required units for each type of server to meet a baselineperformance level, performability can be measured. The performability calculation

    quantifies the probability that a certain set of servers can be expected to be available to

    deliver baseline performance. The performability data in Table 5-2 for Sun SPARC

    Enterprise T5120, Sun SPARC Enterprise T5220, and Sun Fire V490 servers illustrates

    Suns commitment to continuously increase availability along with performance. Sun

    SPARC Enterprise T5120 and T5220 servers meet baseline performance with fewer

    systems, lowering the number of service intervals, reducing costs, simplifying

    administration, and minimizing energy consumption, heat dissipation, and space

    requirements.

    Table 5-2. Sun SPARC Enterprise T5120 and Sun SPARC Enterprise T5220 servers improve upon the

    performability levels of the Sun Fire V490 server.

    System Units Sockets/Unit Performability Yearly Services

    Sun SPARCEnterpriseT5220 server

    6+1 One eight-coreUltraSPARC T2processor

    0.9999988 0.664

    Sun SPARCEnterpriseT5120 server

    6+1 One eight-coreUltraSPARC T2processor

    0.9999988 0.704

    Sun Fire V490server

    9+1 Four dual-coreUltraSPARC IV+processor

    0.9999893 1.124

  • 8/3/2019 Maximize IT Uptime T5x20

    32/34

    30 Conclusion Sun Microsystems, Inc.

    Chapter 6

    Conclusion

    Requirements for relentless availability of IT services are increasingly common.

    Deploying Sun SPARC Enterprise T5120 and T5220 servers can help support

    organizational efforts to achieve high IT service uptime goals. A low component count,

    extensive data integrity features, and superior energy efficiency promote reliability.

    Redundant components foster high levels of availability. Integrated Lights Out

    Management and self healing features simplify serviceability.

    As a part of a comprehensive design approach, the processor, virtualization technology,

    operating system, and software tools for Sun SPARC Enterprise T5120 and T5220 servers

    all provide innovative RAS features. Utilizing Sun Logical Domains and Solaris

    Containers technology to virtualize platforms helps enterprises isolate IT services

    against failure while optimizing asset utilization. The Solaris OS enables RAS features

    previously only available in large-scale systems, such as Memory Page Retirement

    (MPR) and Extended ECC protection. The extension of Solaris Predictive Self Healing

    technology to Sun SPARC Enterprise T5120 and T5220 platforms, the UltraSPARC T2

    processor, and Sun Logical Domains technology helps facilitate rapid system recovery

    in the event of a fault. Sun also provides additional tools such as Sun N1 Systems

    Manger and Sun Management Center software to help organizations prevent

    administrative error, realize greater serviceability, and benefit from faster recovery

    times.

    For over 20 years, Sun has brought enterprise expertise and innovation to the

    development of hardware and software products. With each new generation of

    systems, Sun works to improve platform reliability, availability, and serviceability

    capabilities. As a result, Sun systems provide a strong foundation for organizations

    seeking to support non-stop IT service operations.

  • 8/3/2019 Maximize IT Uptime T5x20

    33/34

    31 Conclusion Sun Microsystems, Inc.

    For More Information

    To learn more about Sun products and the benefits of Sun SPARC Enterprise T5120 and

    T5220 servers, contact a Sun sales representative or consult the related documents and

    Web sites listed in Table 6-1 below.

    Table 6-1. Related Web Sites

    Web Site URL Title

    sun.com/coolthreads Sun SPARC Enterprise T5120 and T5120 Servers

    sun.com/processors/UltraSPARC-T2 Sun UltraSPARC T2 Processor

    opensparc.net/opensparc-t2 OpenSPARC T2

    sun.com/processors/throughput Throughput Computing

    sun.com/servers/coolthreads/overview Sun Servers with CoolThreads Technology

    sun.com/servers/coolthreads/ldoms Sun Logical Domains

    sun.com/solaris The Solaris Operating System

    sun.com/software/products/system_manage

    Sun N1 System Manager

    sun.com/software/products/sunmanagementcenter

    Sun Management Center

    research.sun.com/techrep/2002/smli_tr-2002-109.pdf

    R-Cubed Availability Framework

  • 8/3/2019 Maximize IT Uptime T5x20

    34/34

    Sun SPARC Enterprise T5120 and T5220 Servers On the Web sun.com/coolthreads

    Sun Microsystems, Inc. 4150 Network Circle, Santa Clara, CA 95054 USA Phone 1-650-960-1300 or 1-800-555-9SUN (9786) Web sun.com

    2009 Sun Microsystems, Inc. All rights reserved. Sun, Sun Microsystems, the Sun logo, Java, N1, Sun FIre, SunSpectrum, Solaris, and SPARC Enterprise are trademarks, registered trademarks, or service marks ofSun Microsystems, Inc. in the U.S. and other countries. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. in the U.S. and other countries. Productsbearing SPARC trademarks are based upon architecture developed by SunMicrosystems Inc Information subject to change without notice Printed in USA 02/09 SunWIN # 512751