HPC Open Petascale Computing

Embed Size (px)

Citation preview

  • 8/9/2019 HPC Open Petascale Computing

    1/48

    PATHWAYS TO

    OPEN P

    ETASCALE COMPUTING

    The Sun

    Constellation System designed for performance

    White Paper

    November 2009

    Make everything as simple as possible, but not simpler

    Albert Einstein

  • 8/9/2019 HPC Open Petascale Computing

    2/48

    Sun Microsystems, Inc.

    T

    able of Contents

    Executive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    Pathways to Open Petascale Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    The Unstoppable Rise of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    The Importance of a Balanced and Rigorous Design Methodology . . . . . . . . . . . . . . 4

    The Sun Constellation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    Fast, Large, and Dense InfiniBand Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    The Fabric Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    Sun Datacenter Switches for InfiniBand Fabrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    Deploying Dense and Scalable Modular Compute Nodes

    . . . . . . . . . . . . . . . . . . . 15

    Compute Node Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    The Sun Blade 6048 Modular System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

    Scaling to Multiple Sun Datacenter InfiniBand Switch 648. . . . . . . . . . . . . . . . . . . . 23

    Scalable and Manageable Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    Storage for Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    Clustered Sun Fire X4540 Servers as Data Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    The Sun Lustre Storage System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    ZFS and Sun Storage 7000 Unified Storage Systems. . . . . . . . . . . . . . . . . . . . . . . . . 30

    Long-Term Retention and Archive. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

    Sun HPC Software

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    Sun HPC Software, Linux Edition

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    Seamless and Scalable Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    Simplified Cluster Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

    Deploying Supercomputing Clusters Rapidly with Less Risk . . . . . . . . . . . . . . . . . 37

    Sun Datacenter Express Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    Sun Architected HPC Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    A massive supercomputing cluster at the Texas Advanced Computing Center . . . . 38

    Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    Acknowledgements

    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    For More Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

  • 8/9/2019 HPC Open Petascale Computing

    3/48

    This Page Intentionally Left Blank

  • 8/9/2019 HPC Open Petascale Computing

    4/48

    1

    E

    x

    ecutive Summary

    Sun Microsystems, Inc.

    E

    xecutive Summary

    F

    rom weather prediction and global climate modeling to minute sub-atomic analysis

    and other grand-challenge problems, modern supercomputers often provide the key

    technology for unlocking some of the most critical challenges in science and

    engineering. These essential scientific, economic, and environmental issues are

    complex and daunting and many require answers that can only come from the

    fastest available supercomputing technology. In the wake of the industry-wide

    migration to terascale computing systems, an open and predictable path to petascale

    supercomputing environments has become essential.

    Unfortunately, the design, deployment, and management of very large terascale and

    petascale clusters and grids has remained elusive and complex. While a few have

    accomplished petascale deployments, they have been largely proprietary in nature, andhave come at a high cost. In fact, it is often difficult to reach petascale for fundamental

    reasons not because of inherent limitations, but due to practicalities of attempting

    to scale architectures to their full potential. Seemingly simple concerns heat, power,

    cooling, cabling, and weight are rapidly overloading the vast majority of even the

    most modern datacenters.

    Sun understands that the key to building petascale supercomputers lies in a balanced

    and systemic infrastructure design approach, along with careful application of the

    latest technology advancements. Derived from Suns experience and innovation with

    very large supercomputing deployments, the Sun Constellation System provides the

    world's first open petascale computing environment one built entirely with open

    and standard hardware and software technologies. Cluster architects can use the Sun

    Constellation System to design and rapidly deploy tightly-integrated, efficient, and cost-

    effective supercomputing clusters that scale predictably from a few teraflops to over a

    petaflop. With a completely modular approach, processors, memory, interconnect

    fabric, and storage can all be scaled independently depending on individual needs.

    Best of all, the Sun Constellation System is an enterprise-class Sun-supported offering

    comprised of general-purpose compute nodes, interconnects, and storage components

    that can be deployed very rapidly. In fact, existing supercomputing clusters have

    already been built using the system. For instance, the Texas Advanced Computing

    Center (TACC) at the University of Texas at Austin partnered with Sun to deploy the Sun

    Constellation system as their Ranger supercomputing cluster

    1

    with a peak

    performance rating of over 500 teraflops. This document describes the key challenges

    and constraints involved in the build-out of petascale supercomputing architectures,

    including network fabrics, multicore modular compute systems, storage, open HPC

    software, and general-purpose I/O.

    1.

    http://www.tacc.utexas.edu/resources/hpcsystems/#constellation

    http://www.tacc.utexas.edu/resources/hpcsystems/#constellationhttp://www.tacc.utexas.edu/resources/hpcsystems/#constellation
  • 8/9/2019 HPC Open Petascale Computing

    5/48

    2

    P

    a

    thw

    a

    ys to Open P

    etascale Computing

    Sun Microsystems, Inc.

    Chapter 1

    P

    athways to Open Petascale Computing

    Most pr

    actitioners in today's high-performance computing (HPC) marketplace would

    readily agree that the industry is well into the age of terascale systems.

    Supercomputing systems capable of processing multiple teraflops are becoming

    commonplace. These systems are readily being built using mostly commercial off-the-

    shelf (COTS) components with the ability to address terabytes and petabytes of storage,

    and more recently, terabytes of system memory (generally as distributed shared

    memory and storage pools, or even as a single system image at the high end).

    Only a few years ago, general-purpose terascale computing clusters constructed of

    COTS components were hard to imagine. Though they were on several industry road-

    maps, such systems were widely regarded as impractical due to limitations in thescalability of the interconnects and fabrics that tie disparate systems together. Through

    competitive innovation and the race to be the fastest, the industry has been driven into

    the realm of practical and commercially-viable terascale systems and now to the

    edge of pondering what similar limitations, if any, lie ahead in the design of open

    petascale systems.

    T

    he Unstoppable Rise of Clusters

    In the last fi

    ve years, technologies used to build the world's fastest supercomputers

    have evolved rapidly. In fact, clusters of smaller interconnected rackmount and blade

    systems now represent a majority of the supercomputers on the Top500 list of

    supercomputing sites

    1

    steadil

    y replacing vector supercomputers and other large

    systems that dominated previously. Figure 1 shows the relative shares of various

    supercomputing architectures comprising the Top500 list from 1993 through 2009,

    establishing clear acceptance of clusters as leading supercomputing technology.

    1.

    www.top500.org

  • 8/9/2019 HPC Open Petascale Computing

    6/48

    3

    P

    a

    thw

    a

    ys to Open P

    etascale Computing

    Sun Microsystems, Inc.

    F

    igure 1. In the last five years, clusters have increasingly dominated the Top500

    list architecture share (image courtesy www.top500.org)

    No

    t only have clusters provided access to supercomputing resources for increasingly

    larger groups of researchers and scientists, but the largest supercomputers in the world

    are now built using cluster architectures. This trend has been assisted by an explosion

    in performance, bandwidth, and capacity for key technologies, including: Faster processors, multicore processors, and multisocket rackmount and blade

    systems

    Inexpensive memory and system support for larger memory capacity

    Faster standard interconnects such as InfiniBand

    Higher aggregated storage capacity from inexpensive commodity disk drives

    Unfortunately, significant challenges remain that have stifled the growth of true open

    petascale-class supercomputing clusters. Time-to-deployment constraints have resulted

    from the complexity of deploying and managing large numbers of compute nodes,

    switches, cables, and storage systems. The programability of extremely large clusters

    remains an issue. Environmental factors too are paramount since deployments must

    often take place in existing datacenter space with strict constraints on physical

    footprint, as well as power and cooling.

    Architecture Share Over Time

    1993 - 2009500

    400

    300

    200

    100

    0

    06/1993

    06/1994

    06/1995

    06/1996

    06/1997

    06/1998

    06/1999

    06/2000

    06/2001

    06/2002

    06/2003

    06/2004

    06/2005

    06/2006

    06/2007

    06/2008

    06/2009

    Systems

    Top500 Releases

    MPP

    Cluster

    SMP

    Constellations

    Single Processor

    Others

    http://www.top500.org/http://www.top500.org/http://www.top500.org/http://www.top500.org/
  • 8/9/2019 HPC Open Petascale Computing

    7/48

    4

    P

    a

    thw

    a

    ys to Open P

    etascale Computing

    Sun Microsystems, Inc.

    In addition t

    o these challenges, most petascale computational users also have unique

    requirements for clustered environments beyond those of less demanding HPC users,

    including:

    Scalability at the socket and core level

    Some have espoused large grids ofrelatively low-performance systems, but lower performance only increase the

    number of nodes that are required to solve very large computational problems.

    Density in all things

    Density is not just a requirement for compute nodes, but for

    interconnect fabrics and storage solutions as well.

    A scalable programming and execution model

    Programmers need to be able to

    apply their programmatic challenges to massively-scalable computational resources

    without special architecture-specific coding requirements.

    A lightweight grid model

    Demanding applications need to be able to start

    thousands of jobs quickly, distributing workloads across the available computational

    resources through highly-efficient distributed resource management (DRM) systems.

    Open and standards-based solutions

    Programmatic solutions must not cause

    extensive porting efforts, or be dedicated to particular proprietary architectures or

    environments, and datacenters must remain free to purchase the latest high-

    performance computational gear without being locked into proprietary or dead-end

    architectures.

    T

    he Importance of a Balanced and Rigorous DesignMethodology

    A

    s anyone who has witnessed prior generations of supercomputing and HPC

    architectures can attest, scaling gracefully is not simply a matter of accelerating

    systems that already perform well. Bigger versions of existing technologies are not

    always better. Regrettably, the pathways to teraflop systems are littered with the

    products and technologies from dozens of companies that simply failed to adapt along

    the way.

    Many technologies have failed because the fundamental principles that worked in

    small clusters simply could not scale effectively when re-cast in a run-time environment

    thousands of times larger or faster than their initial implementations. For example, Ten

    Gigabit Ethernet though a significant accomplishment is known in the

    supercomputing realm to be fraught with sufficiently variable latency as to make itimpractical for situations where low guaranteed latency and throughput dominate

    performance. Ultimately, building petascale-capable systems is about being willing to

    fundamentally rethink design, using the latest available components that are capable

    of meeting or exceeding specified data rates and capacities.

    Put simply, getting to petascale requires balance and massive scalability in all

    dimensions, including scalable tools and frameworks, processors, systems,

    interconnects, and storage as well as the ability to accommodate changes that

    allow software to scale accordingly.

  • 8/9/2019 HPC Open Petascale Computing

    8/48

    5

    P

    a

    thw

    a

    ys to Open P

    etascale Computing

    Sun Microsystems, Inc.

    K

    ey challenges for petascale environments include:

    Keeping floating-point operations (FLOPs) to memory bandwidth ratios balanced to

    minimize the effects of memory latency (with each FLOP representing at least two

    loads and one store) Allowing for the practical scaling of the interconnect fabric to allow the connection of

    tens of thousands of nodes

    Exploiting the considerable investment, momentum, and cost savings of commodity

    multicore x64 processors, tools, and software

    Overcoming software challenges such as the forward portability of HPC codes to new

    architectures, scalability limitations, reliability, robustness, and being able to take

    advantage of multicore multiprocessor system architectures

    Architecting to account for the opportunity to take advantage of external floating

    point, vector, and/or general purpose processing on graphics processing unit

    (GP/GPU) solutions within a cluster framework Designing the highest levels of density into compute nodes, interconnect fabrics, and

    storage solutions in order to facilitate large and compact clusters

    Building systems with efficient power and cooling to accommodate the broadest

    range of datacenter facilities and to help ensure the highest levels of reliability

    Architecting cluster architecture such that compute-intensive applications have

    access to fast cluster scratch storage space for a balanced computational approach

    These challenges serve as reminders that the value of genuine innovation in the

    marketplace must never be underestimated even as design-cycle times shrink and

    the pressures of time to market grow with the demand for faster, cheaper and

    standards based solutions.

    T

    he Sun Constellation System

    Since its inception, Sun has been f

    ocused on building balance and even elegance into

    its system designs. The Sun Constellation System represents a tangible application of

    this philosophy on a grand scale in the form of a systematic approach to building

    terascale and petascale supercomputing clusters. Specifically, the Sun Constellation

    System delivers an open architecture that is designed to allow organizations to build

    clusters that scale seamlessly from a few racks to teraflops or petaflops of performance.

    With an overall datacenter focus, Sun is free to innovate at all levels of the system

    from switching fabric, to core system and storage elements, to HPC and file system

    software. As a systems company, Sun looks beyond existing technologies toward

    solutions that optimize the simultaneous equations of cost, space, practicality, and

    complexity. In the form of the Sun Constellation System, this systemic focus combines a

    massively-scalable InfiniBand interconnect with very dense computational and storage

    solutions in a single architecture that functions as a cohesive system. Organizations

    can now obtain all of these tightly-integrated building blocks from a single vendor, and

    benefit from a unified management approach.

  • 8/9/2019 HPC Open Petascale Computing

    9/48

    6

    P

    a

    thw

    a

    ys to Open P

    etascale Computing

    Sun Microsystems, Inc.

    C

    omponents of the Sun Constellation System include:

    The Sun Datacenter InfiniBand Switch 648, offering up to 648 QDR/DDR ports in a

    single 11 rack unit (11U) chassis, and supporting clusters of up to 5,184 nodes with

    multiple switches The Sun Datacenter InfiniBand Switch 72, offering up to 72 QDR/DDR ports in a

    compact 1U form factor, and supporting clusters of up to 576 nodes with multiple

    switches

    The Sun Datacenter InfiniBand Switch 36, offering up to 36 nodes in a 1U form factor

    The Sun Blade 6048 Modular System, providing an ultra-dense InfiniBand-connected

    blade platform with support for up to 48 multiprocessor, multicore Sun Blade 6000

    server modules and up to 96 compute nodes in a rack-sized chassis

    Sun Fire X4540 storage clusters, serving as an economical InfiniBand-connected

    parallel file system building block, with support for up to 48 terabytes in only four

    rack units and up to 480 terabytes in a single rack The Sun Storage 7000 Unified Storage System, integrating enterprise flash

    technology through ZFS hybrid storage pools and DTrace Analytics to provide

    economical, scalable, and transparent storage

    The Sun Lustre Storage System, a simple-to-deploy storage environment based on

    the Lustre parallel file system, Sun Fire servers, and Sun Open Storage platforms

    Sun HPC Software, encompassing integrated developer tools, Sun Grid Engine

    infrastructure, advanced ZFS and Lustre file systems, provisioning, monitoring,

    patching, and simplified inventory management available in both a Linux Edition

    and a Solaris Operating System (OS) Developer Edition

    The Sun Constellation System provides an open systems supercomputer architecture

    designed for petascale computing as integrated and Sun-supported product. This

    holistic approach offers key advantages to those designing and constructing the largest

    supercomputing clusters:

    Massive scalability in terms of optimized compute, storage, interconnect, and

    software technologies and services

    Simplified cluster deployment with open HPC software that can rapidly turn bare-

    metal systems into functioning clusters that are ready to run

    A dramatic reduction in complexity through integrated connectivity and

    management to reduce start-up, development, and operational connectivity

    Breakthrough economics from technical innovation that results in fewer more

    reliable components and high-efficiency systems in a tightly-integrated solution

    Along with key technologies and the experience of helping design and deploy some of

    the worlds largest supercomputing clusters, these strengths make Sun an ideal partner

    for delivering open high-end terascale and petascale architecture.

  • 8/9/2019 HPC Open Petascale Computing

    10/48

    7

    F

    as

    t, Lar

    ge, and D

    ense I

    nfi

    niBan

    d

    Infr

    as

    tructur

    e

    Sun Microsystems, Inc.

    Chapter 2

    F

    ast, Large, and Dense InfiniBand Infrastructure

    Building the lar

    gest supercomputing grids presents significant challenges, with fabric

    technology paramount among them. Sun set out to design InfiniBand architecture for

    maximum flexibility and fabric scalability, and to drastically reduce the cost and

    complexity of delivering large-scale HPC solutions. Achieving these goals required a

    delicate balancing act one that weighed the speed and number of nodes along with

    a sufficiently fast interconnect to provide minimal and predictable levels of latency.

    T

    he Fabric Challenge

    F

    or many applications, the interconnect fabric is already the element that limits

    performance. One unavoidable driver is that faster processors require a faster

    interconnect. Beyond merely employing a fast technology, the fabric must scale

    effectively with both the speed and number of systems and processors. Interconnect

    fabrics for large terascale and petascale deployments require:

    Low latency

    High bandwidth

    The ability to handle fabric congestion

    High reliability to avoid interruptions

    Open standards such as OpenFabrics and the OpenMPI software stack

    InfiniBand technology has emerged as an attractive fabric for building large

    supercomputing clusters. As an open standard, InfiniBand presents a compelling choice

    over proprietary interconnect technologies that depend on the success and innovation

    of a single vendor. InfiniBand also presents a number of significant technical

    advantages:

    A switched fabric offers considerable scalability, supporting large numbers of

    simultaneous collision-free connections with virtually no increase in latency.

    Host channel adaptors (HCAs) with remote direct memory access (RDMA) support

    offload communications processing from the processor and operating system,

    leaving more processor resources available for computation.

    Fault isolation and troubleshooting are easier in switched environments since

    problems can be isolated to a single connection.

    Applications that rely on bandwidth or quality of service are also well served, since

    they each receive their own dedicated bandwidth.

    Even with these advantages, building the largest InfiniBand clusters and grids has

    remained complex and expensive primarily because of the need to interconnect very

    large numbers of computational nodes. Traditional large clusters require literally

    thousands of cables and connections and hundreds of individual core and leaf switches

    adding considerable expense, weight, cable management complexity, and

  • 8/9/2019 HPC Open Petascale Computing

    11/48

    8

    F

    as

    t, Lar

    ge, and D

    ense I

    nfi

    niBan

    d

    Infr

    as

    tructur

    e

    Sun Microsystems, Inc.

    consumption of v

    aluable datacenter rack space. It is clear that density, consolidation,

    and management efficiencies are important not just for computational platforms, but

    for InfiniBand interconnect infrastructure as well.

    Even with very significant accomplishments in terms of processor performance and

    computational density, large clusters are ultimately constrained by real estate and and

    the complexities and limitations of interconnect technologies. Cable length limitations

    constrain how many systems can be connected together in a given physical space while

    avoiding increased latency. Interconnect topologies play a vital role in determining the

    properties that clustered systems exhibit. Mesh, torus (or toroidal), and Clos topologies

    are popular choices for interconnected supercomputing clusters and grids.

    M

    esh and 3D Torus Topologies

    In mesh and 3D torus topologies, each node connects to its neighbors in the x, y, and zdimensions, with six connecting ports per node. Some of the most notable

    supercomputers based upon torus topologies include IBMs BlueGene and Crays XT3/

    XT4 supercomputers. Torus fabrics have had the advantage that they have generally

    been easier to build than Clos topologies. Unfortunately, torus topologies represent a

    blocking fabric, where interconnect bandwidth can vary between nodes. Torus fabrics

    also provide variable latency due to variable hop count, and application deployment for

    torus fabrics must carefully consider node locality as a result. For some specific

    applications that express a nearest-neighbor type of communication pattern, torus

    topologies are a good fit. Computational fluid dynamics (CFD) is one such application.

    Clos Fat Tree Topologies

    F

    irst described by Charles Clos in 1953, Clos networks have long formed the basis for

    practical multistage telephone switching systems. Clos networks utilize a fat tree

    topology, allowing complex switching networks to be built using many fewer

    crosspoints than if the entire system were implemented as a single large crossbar

    switch. Clos switches are typically comprised of multiple tiers and stages (hops), with

    each tier built from of a number of crossbar switches. Connectivity exists only between

    switch chips on adjacent tiers.

    Clos fabrics have the advantage of being non-blocking, in that each attached node has

    a constant bandwidth. In addition, an equal number of stages between nodes provides

    for uniform latency. Historically, the disadvantage of large Clos networks was that they

    required more resources to build.

    C

    onstructing Large Switched Supercomputing Clusters

    C

    onstructing very large InfiniBand Clos switches in particular is governed by a number

    of practical constraints, including the number of ports available in individual switch

    elements, maximum achievable printed circuit board size, and maximum connector

  • 8/9/2019 HPC Open Petascale Computing

    12/48

    9

    F

    as

    t, Lar

    ge, and D

    ense I

    nfi

    niBan

    d

    Infr

    as

    tructur

    e

    Sun Microsystems, Inc.

    density

    . Sun has employed considerable innovation in all of these areas, and provides

    both dual data rate (DDR) and quad data rate (QDR) scalable InfiniBand fabrics. For

    example, as a part of the Sun Constellation System, Sun InfiniBand infrastructure can

    provide both QDR Clos clusters that can scale up to 5,184 nodes as well as 3D Torusconfigurations.

    S

    un Datacenter Switches for InfiniBand Fabrics

    R

    ecognizing the considerable promise of InfiniBand interconnects, Sun has made

    InfiniBand connectivity a core competency, and has set out to design scalable and

    dense switches that avoid many of the conventional limitations. Not content to accept

    the status quo in terms of available InfiniBand switching, cabling, and host adapters,

    Sun engineers used their considerable networking and datacenter experience to view

    InfiniBand technology from a systemic perspective.

    K

    ey Technical Innovations for Sun Datacenter InfiniBand Switches

    Sun Da

    tacenter InfiniBand Switches 36, 72, and 648 are components of a complete

    system that is based on multiple technical innovations, including:

    The Sun Datacenter InfiniBand Switch 648 chassis implements a three-stage Clos

    fabric with up to 54 36-port Mellanox InfiniScale IV switching elements, integrated

    into a single 11U rackmount enclosure. The Sun Datacenter InfiniBand Switch 648

    implements a 3-stage Clos fabric.

    Industry-standard 12x CXP connectors on Sun Datacenter InfiniBand Switch 72 and

    648 consolidate three discrete InfiniBand 4x connectors, resulting in the ability tohost 72 4x ports through 24 physical 12x connectors.

    Complementing the 12x CXP connector, a 12x trunking cable carries signals from

    three servers to a single switch connector, offering a 3:1 cable reduction when used

    for server trunking, and reducing the number of cables needed to support 648 servers

    to 216. A splitter cable that converts one 12x connection to three 4x connections is

    provided for connectivity to systems and storage that require 4x QSFP connectors.

    A custom-designed double-height Network Express Module (NEM) for the Sun Blade

    6048 Modular System provides seamless connectivity to both the Sun Datacenter

    InfiniBand Switch 648 and 72. Using the same 12x CXP connectors, the Sun Blade

    6048 InfiniBand QDR Switched NEM can trunk up to 12 Sun Blade 6000 servermodules (up to 24 compute nodes) in a single Sun Blade 6048 Modular System shelf.

    The NEM together with the 12x CXP cable facilitates connectivity of up to 5,184

    servers in a 5-stage Clos topology.

    Fabric topology for forwarding InfiniBand traffic is established by a redundant host-

    based Subnet Manager. A host-based solution allows the Subnet Manager to take

    advantage of the full resources of a general-purpose multicore server.

  • 8/9/2019 HPC Open Petascale Computing

    13/48

    10 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

    Massive Switch and Cable Consolidation

    Given the scale involved with building supercomputing clusters and grids, cost and

    complexity figure importantly. Regrettably, traditional approaches to using InfiniBand

    for massive connectivity have required very large numbers of conventional switches

    and cables. In these configurations, many cables and ports are consumed redundantly

    connecting core and leaf switches together, making advertised per-port switch costs

    relatively meaningless, and reducing reliability through extra cabling.

    In contrast, the very dense InfiniBand fabric provided by Sun Datacenter InfiniBand

    switches is able to potentially eliminate hundreds of switches and thousands of cables

    dramatically lowering acquisition costs. In addition, replacing physical switches and

    cabling with switch chips and traces on printed circuit boards drastically improves

    reliability. Standard 12x InfiniBand cables and connectors coupled with a specialized

    Sun Blade 6048 Network Express Module can eliminate thousands of additional cables,providing additional cost, complexity, and reliability improvements. Overall, these

    switches provide radical simplification of InfiniBand infrastructure. Sun Datacenter

    Switches are available to support both DDR and QDR data rates, with fabric capacities

    enumerated in Table 1.

    Table 1. Sun Datacenter InfiniBand Switch capacities

    The Sun Datacenter InfiniBand Switch 648

    The Sun Datacenter InfiniBand Switch 648 is designed to drastically reduce the cost and

    complexity of delivering large-scale HPC solutions, such as those scaled for leadership

    in the Top500 list of supercomputing sites, as well as smaller and moderately-sizedenterprise and HPC applications in scientific, technical, and financial markets. Each Sun

    Datacenter InfiniBand Switch 648 provides up to 648 QDR InfiniBand ports in only 11

    rack units (11U). Up to eight Sun Datacenter InfiniBand Switch 648 can be combined to

    InfiniBand Switch Data Rate (Connector) Maximum SupportedNodes per Switch

    MaximumClos Fabric

    Sun Datacenter InfiniBandSwitch 648

    QDR or DDR(up to 216 12x CXP)

    648 5,184a

    a.Eight switches are required. The Sun Datacenter InfiniBand Switch 648 is capable of supporting

    clusters beyond 5,184 servers. The maximum number of nodes is currently determined by the

    number of uplink ports (eight) provided by the Sun Blade 6048 InfiniBand QDR Switched NEM.

    Sun Datacenter InfiniBandSwitch 72

    QDR or DDR(24 12x CXP)

    72 576a

    Sun Datacenter InfiniBandSwitch 36

    QDR or DDR(36 4x QSFP)

    36

  • 8/9/2019 HPC Open Petascale Computing

    14/48

    11 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

    support up to 5,184 nodes in a single cluster. As shown in Figure 2, the Sun Datacenter

    InfiniBand Switch 648 also provides extensive cable support and management for clean

    and efficient installations.

    Figure 2. The Sun Datacenter InfiniBand Switch 648 offers up to 648 QDR/DDR/SDR

    4x InfiniBand connections in an 11u rackmount chassis (shown with cable

    management arms deployed).

    The Sun Datacenter InfiniBand Switch 648 is ideal for deploying fast, dense, and

    compact Clos fabrics when used as a part of the Sun Constellation System. Based on

    the Mellanox InfiniScale IV 36-port InfiniBand switch device, each switch chassis

    connects up to 648 nodes using 12x CXP connectors. The switch represents a full three-

    stage Clos fabric, and up to eight Sun Datacenter InfiniBand Switch 648 can be used to

    combine up to 54 Sun Blade 6048 chassis in a maximal 5,184-node fabric. Up to threeSun Datacenter InfiniBand Switch 648 (and up to 1,944 QDR ports) can be deployed in a

    single standard rack (Figure 3).

    The Sun Datacenter InfiniBand Switch 648 is tightly integrated with the Sun Blade 6048

    InfiniBand QDR Switched Network Express Module (NEM). 12x cables and CXP

    connectors provide a 3:1 cable consolidation ratio. Each dual-height NEM connects up

    to 24 compute nodes in a single Sun Blade 6048 shelf to a QDR InfiniBand fabric. Suns

    approach to InfiniBand networking is highly flexible in that both Clos and mesh/torus

    interconnects can be built using the same components. The Sun Blade 6048 InfiniBand

    Switched NEM can be used by itself to build mesh and torus fabrics, or in combination

    with the Sun Datacenter InfiniBand Switch 648 switch to build Clos InfiniBand fabrics.

    The Sun Datacenter InfiniBand Switch 648 employs a passive midplane. Fabric cards

    install vertically and connect to the midplane from the rear of the chassis. Up to nine

    line cards install horizontally from the front of the chassis. A three-dimensional

    perspective of the fabric provided by the switch is shown in Figure 4, with an example

    route overlaid. With this dense switch configuration, InfiniBand packets traverse onlyFigure 3. Up to three Sun Datacenter

    InfiniBand Switch 648 in a single

    19-inch rack deliver 1,944 QDR ports.

  • 8/9/2019 HPC Open Petascale Computing

    15/48

    12 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

    three hops from ingress to egress of the switch, keeping latency very low. The Sun

    Blade 6048 InfiniBand QDR Switched NEM adds only two hops for a total of five. All

    InfiniBand routing is managed using a redundant host-based subnet manager.

    Figure 4. A path through a Sun Datacenter InfiniBand Switch 648 core switch

    connects two nodes across horizontal line cards, a vertical fabric card, and the

    passive orthogonal midplane.

    The Sun Datacenter InfiniBand Switch 72

    The Sun Datacenter InfiniBand Switch 72 leverages many of the innovations found in

    the Sun Datacenter InfiniBand Switch 648, while offering support for smaller and mid-

    sized configurations. Like the larger 648-port switch, the Sun Datacenter InfiniBandSwitch 72 offers QDR and DDR connectivity, extreme density, and unrivaled cable

    aggregation for Sun Blade and Sun Fire servers as well as Sun storage solutions.

    NineLin

    eCards

    NineFabricCards

    Path Through Switch

    Alternate Path Through Switch

  • 8/9/2019 HPC Open Petascale Computing

    16/48

    13 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

    Depicted in Figure 5, the Sun Datacenter InfiniBand Switch 72 occupies only one rack

    unit, offering an ultraslim and ultradense complete switch fabric solution for clusters of

    up to 72 nodes.

    Figure 5. The Sun Datacenter InfiniBand Switch 72 offers 72 4x QDR InfiniBand

    ports in a 1U form factor

    When used in conjunction with the Sun Blade 6048 Modular System, up to eight Sun

    Datacenter InfiniBand Switch 72 can be combined to support clusters of up to 576

    nodes. While similar solutions from competitors occupy over 17 rack units, eight 1U Sun

    Datacenter InfiniBand Switch 72 save considerable space, and require roughly one third

    the number of cables. In addition to simplification, this end-to-end supercomputing

    solution offers extremely low latency using industry-standard transport, and

    commodity processors including AMD Opteron, Intel Xeon, and Sun SPARC.

    The Sun Datacenter InfiniBand Switch 72 provides the following specifications:

    72 QDR/DDR/SDR 4x InfiniBand ports (expressed through 24 12x CXP connectors)

    Data throughput of 4.6 Tb/sec.

    Port-to-port latency of 300ns (QDR)

    Eight data virtual lanes

    One management virtual lane

    4096 byte MTU

    Sun Datacenter InfiniBand Switch 36

    Leveraging the properties of the InfiniBand architecture, the Sun Datacenter InfiniBand

    Switch 36 helps organizations deploy smaller high-performance fabrics in demanding

    high-availability (HA) environments. The switch supports the creation of logically

    isolated sub-clusters, as well as advanced features for traffic isolation and Quality ofService (QoS) management preventing faults from causing costly service disruptions.

    The embedded InfiniBand fabric management module supports active/hot-standby

    dual-manager configurations, helping to ensure a seamless migration of the fabric

    management service in the event of a management module failure. The Sun

  • 8/9/2019 HPC Open Petascale Computing

    17/48

    14 Fast, Large, and Dense InfiniBand Infrastructure Sun Microsystems, Inc.

    Datacenter InfiniBand Switch 36 is provisioned with redundant power and cooling for

    high availability in demanding datacenter environments. The Sun Datacenter

    InfiniBand Switch 36 is shown in Figure 6.

    Figure 6. The Sun Datacenter InfiniBand Switch 36 offers 36 QDR InfiniBand ports

    in a 1U form factor

    The Sun Datacenter InfiniBand Switch 36 provides the following specifications:

    36 QDR/DDR/SDR 4x InfiniBand ports (expressed through 36 4x QSFP connectors)

    Data throughput of 2.3 Tb/sec.

    Port-to-port latency of 100ns (QDR)

    Eight data virtual lanes

    One management virtual lane

    4096 byte MTU

  • 8/9/2019 HPC Open Petascale Computing

    18/48

    15 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

    Chapter 3

    Deploying Dense and Scalable Modular Compute

    Nodes

    Implementing terascale and petascale supercomputing clusters depends heavily on

    having access to large numbers of high-performance systems with large memory

    support and high memory bandwidth. As a part of the Sun Constellation System, Suns

    approach is to combine the considerable and constant performance gains in the

    standard processor marketplace with the advantages of modular architecture. This

    approach results in some of the fastest and most dense systems possible all tightly

    integrated with Sun Datacenter InfiniBand switches.

    Compute Node RequirementsWhile some supercomputing architectures employ very large numbers of slower

    proprietary nodes, this approach does not translate easily to petascale. The

    programmatic implications alone of handling literally millions of nodes are not

    particularly appealing much less the physical realities of managing and housing

    such systems. Instead, building large and open terascale and petascale systems

    depends on key capabilities for compute nodes, including:

    High Performance

    Compute nodes must provide very high peak levels of floating-point performance.

    Likewise, because floating-point performance is dependent on multiple memory

    operations, equally high levels of memory bandwidth must be provided. I/Obandwidth is also crucial, yielding high-speed access to storage and other

    compute nodes.

    Density, Power, and Cooling

    The physical requirements of todays ever more expensive datacenter real estate

    dictate that any viable solutions take the best advantage of datacenter floor space

    while staying within environmental realities. Solutions must be as energy efficient

    as possible, and must provide effective cooling that fits well with the latest

    energy-efficient datacenter practices.

    Superior Reliability and Serviceability

    Due to their large numbers, computational systems must be as reliable and

    servicable as possible. Not only must systems provide redundant hot-swap

    processing, I/O, power, and cooling modules, but serviceability must be a key

    component of their design and management. Interconnect schemes must allow

    systems to be cabled once and reconfigured at will as required.

    Blade technology has offered considerable promise in these areas for some time, but

    has often been constrained by legacy blade platforms that locked adopters into

    expensive proprietary infrastructure. Power and cooling limitations often meant that

  • 8/9/2019 HPC Open Petascale Computing

    19/48

    16 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

    processors were limited to less powerful versions. Limited processing power, memory

    capacity, and I/O bandwidth often severely restricted the applications that could be

    deployed. Proprietary tie-ins and other constraints in chassis design dictated

    networking and interconnect topologies, and I/O expansion options were limited to asmall number of expensive and proprietary modules.

    The Sun Blade 6048 Modular SystemTo address the shortcomings of earlier blade computing platforms, Sun started with a

    design point focused on the needs of the datacenter and highly-scalable deployments,

    rather than with preconceptions of chassis design. With this innovative and truly

    modular approach, the Sun Blade 6048 Modular System offers an ultra-dense high-

    performance solution for large HPC clusters. Organizations gain the promised benefits

    of blades, and can deploy thousands of nodes within the cabling, power, and cooling

    constraints of existing datacenters. Fully compatible with the Sun Blade 6000 Modular

    System, the Sun Blade 6048 Modular System provides distinct advantages over other

    approaches to modular architecture.

    Innovative Chassis Design for Industry-Leading Density and Environmentals

    The Sun Blade 6048 Modular System features a standard rack-size chassis that

    facilitates the deployment of high-density computational environments. By

    eliminating all of the hardware typically used to rackmount individual blade

    chassis, the Sun Blade 6048 Modular System provides 20% more usable space in

    the same physical footprint. Up to 48 Sun Blade 6000 server modules can be

    deployed in a single Sun Blade 6048 Modular System for up to 96 compute nodes

    per rack. Innovative chassis features are carried forward from the Sun Blade 6000

    Modular System.

    A Choice of Processors and Operating Systems

    Each Sun Blade 6048 Modular System chassis supports up to 48 full performance

    and full featured Sun Blade 6000 series server modules. Server modules based on

    x86/x64 architectures, and ideal for HPC and supercomputing environments

    include:

    The Sun Blade X6440 server module, with four sockets for Six-Core AMD Opteron

    8000 Series processors, and support for up to 256 GB of memory

    The Sun Blade X6270 server module, with two sockets for Intel Xeon Processor5500 Series (Nehalem) CPUs and 144 GB of memory per server module

    The Sun Blade X6275 server module, with two nodes, each with two sockets for

    Intel Xeon Processor 5500 Series CPUs, 96 GB of memory per node, and an on-

    board QDR Mellanox InfiniBand host channel adapter (HCA)

    Each server module provides significant I/O capacity as well, with up to 32 lanes of

    PCI Express 2.0 bandwidth delivered from each server module to the multiple

    available I/O expansion modules (a total of up to 207 Gb/sec supported per server

  • 8/9/2019 HPC Open Petascale Computing

    20/48

    17 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

    module). To enhance availability, server modules dont have separate power

    supplies or fans. Some server modules feature up to four hot-swap hard disk drives

    (HDDs) or solid state drives (SSDs) disks with hardware RAID options, while others

    provide on-board flash technologies for fast and reliable I/O. Organizations candeploy server modules based on the processors and operating systems that best

    serve their applications or environment. Different server modules can be mixed

    and matched in a single chassis, and deployed and redeployed as needs dictate.

    The Solaris Operating System (OS), Linux, and Microsoft Windows are all

    supported.

    Complete Separation Between CPU and I/O Modules

    Sun Blade 6048 Modular System design avoids compromises because it provides a

    complete separation between CPU and I/O modules. Two types of I/O modules are

    supported.

    Up to two industry-standard PCI Express ExpressModule (EMs) slots are dedi-

    cated to each server module.

    Up to two Network Express Modules (NEMs) provide bulk IO for all of the server

    modules installed in each shelf (four shelves per chassis).

    Through this flexible approach, server modules can be configured with different

    I/O options depending on the applications they host. All I/O modules are hot-plug

    capable, and customers can choose from Sun-branded or third-party adapters for

    networking, storage, clustering, and other I/O functions.

    Sun Blade Transparent Management

    Many blade vendors provide management solutions that lock organizations into

    proprietary management tools. With the Sun Blade 6048 Modular System,

    customers have the choice of using their existing management tools or Sun Blade

    Transparent Management. Sun Blade Transparent Management is a standards-

    based cross-platform tool that provides direct management over individual server

    modules and direct management of chassis-level modules using Sun Integrated

    Lights out Management (ILOM).

    Within the Sun Blade 6048 Modular System, a chassis monitoring module (CMM)

    works in conjunction with the service processor on each server module to form a

    complete and transparent management solution. Individual server modules

    provide support for IPMI, SNMP, CLI (through serial console or SSH), and HTTP(S)

    management methods. In addition, Sun Ops Center provides discovery,

    aggregated management, and bulk deployment for multiple systems.

  • 8/9/2019 HPC Open Petascale Computing

    21/48

    18 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

    System Overview

    The Sun Blade 6048 chassis provides space for up to 12 server modules in each of its

    four shelves for up to 48 Sun Blade 6000 server modules in a single chassis. This

    design approach provides considerable density. Front and rear perspectives of the Sun

    Blade 6048 Modular System are provided in Figure 7.

    Figure 7. Front and rear perspectives of the Sun Blade 6048 Modular System

    With four self-contained shelves per chassis, the Sun Blade 6048 Modular System

    houses a wide range of components.

    Up to 48 Sun Blade 6000 server modules insert from the front of the chassis, with 12modules supported by each shelf.

    A total of eight hot-swap power supply modules insert from the front of the chassis,

    with two 8,400 Watt 12-volt power supplies (N+N) are provided for each shelf. Each

    power supply module contains a dedicated fan module.

    Up to 96 hot-plug PCI Express ExpressModules (EMs) insert from the rear of the

    chassis (24 per shelf), supporting industry-standard PCI Express interfaces with two

    EM slots available for use by each server module.

    Four

    self-contained

    shelves

    Chassis Management Module

    and Power Interface ModuleHot Swappable

    N+N Redundant

    power supply

    Up to 12 Sun Blade 6000

    server modules

    per shelf

    modules

    Up to 24 PCI Express

    ExpressModules (EMs)

    Up to two single-height Network

    Express Modules (NEMS) or one

    Eight fan modules (N+1)

    dual-height InfiniBand NEM

  • 8/9/2019 HPC Open Petascale Computing

    22/48

    19 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

    Up to four dual-height Sun Blade 6048 InfiniBand NEMs can be installed in a single

    chassis (one per shelf). Alternately, up to eight single-height Network Express

    Modules (NEMs) can be inserted from the rear, with two NEM slots serving each shelf

    of the chassis. A chassis monitoring module (CMM) and power interface module are provided for

    each shelf. The CMM provides for transparent management access to individual

    server modules while the Power Interface Module provides six plugs for the power

    supply modules in each shelf.

    Redundant (N+1) fan modules are provided at the rear of the chassis for efficient

    front-to-back cooling.

    Standard I/O Through a Passive Midplane

    In essence, the passive midplane in the Sun Blade 6048 Modular System is a collection

    of wires and connectors between different modules in the chassis. Since there are no

    active components, the reliability of this printed circuit board is extremely high in

    the millions of hours. The passive midplane provides electrical connectivity between

    the server modules and the I/O modules.

    All front and rear modules connect directly to the passive midplane, with the exception

    of the power supplies and the fan modules. The power supplies connect to the

    midplane through a bus bar and to the AC inputs via a cable harness. The redundant

    fan modules plug individually to a set of three fan boards, where fan speed control and

    other chassis-level functions are implemented. The front fan modules that cool the PCI

    Express ExpressModules each connect to the chassis via self-aligning, blind-mate

    connections. The main functions of the midplane include:

    Providing a mechanical connection point for all of the server modules

    Providing 12 VDC from the power supplies to each customer-replaceable module

    Providing 3.3 VDC power used to power the System Management Bus devices on each

    module, and to power the CMM

    Providing a PCI Express interconnect between the PCI Express root complexes on each

    server module to the EMs and NEMs installed in the chassis

    Connecting the server modules, CMMs, and NEMs to the management network

    Each server module is energized through the midplane from the redundant chassis

    power grid. The midplane also provides connectivity to the I2C network in the chassis,allowing each server module to directly monitor the chassis environment, including fan

    and power supply status as well as various temperature sensors. A number of I/O links

    are also routed through the midplane for each server module. Connection details differ

    depending on the selected server module and associated NEMS. As an example,

    Figure 8 illustrates the dual-node Sun Blade X6275 server module configured with the

    Sun Blade 6048 InfiniBand QDR Switched NEM with connections that include:

    An x8 PCI Express 2.0 link connecting from each compute node to a dedicated EM

    Two gigabit Ethernet links to the NEM one from each compute node

  • 8/9/2019 HPC Open Petascale Computing

    23/48

    20 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

    Two 4x QDR InfiniBand connections to the NEM one from each compute node

    An Ethernet connection from the server module to the CMM for management

    Figure 8. Distribution of communications links from a typical Sun Blade 6000

    server module

    Tight Integration with Sun Datacenter InfiniBand Switches

    Providing dense connectivity to servers while minimizing cables is one of the issues

    facing large HPC cluster deployments. The Sun Blade 6048 QDR Switched InfiniBand

    NEM solves this challenge and improves both density and reliability by integratingconnections and switch components into a dual-height NEM form factor for the Sun

    Blade 6048 chassis. As a part of the Sun Constellation System, the NEM uses common

    components, cables, connectors, and architecture with the Sun Datacenter InfiniBand

    Switch 648 and 72.

    Gigabit Ethernet (Node 1)

    4x QDR InfiniBand (Node 1)

    Gigabit Ethernet (Node 0)

    4x QDR InfiniBand (Node 0)

    Ethernet

    PCI Express x8 (Node 0)

    PCI Express x8 (Node 1)

    Sun Blade X6275 Server Module

    NEM 0

    NEM 1

    CMM

    EMs

    Node 0

    Node 1

  • 8/9/2019 HPC Open Petascale Computing

    24/48

  • 8/9/2019 HPC Open Petascale Computing

    25/48

    22 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

    density are key, each Sun Blade X6275 server module features two compute nodes, with

    each node supporting two sockets for Intel Xeon Processor 5500 Series CPUs and up to

    96 GB of memory (Figure 10).

    Figure 10. The Sun Blade X6275 server module provides two compute nodes on a

    single server module.

    Figure 11 shows a block-level representation of how the Sun Blade X6275 server module

    connects to the Sun Blade 6048 InfiniBand QDR Switched NEM. In this configuration,

    twelve ports from each switch chip (24 total) are used to communicate with the two

    compute nodes on each Sun Blade X6275 server module, with nine ports used toconnect the two switches together. The 30 remaining ports (15 per switch chip) are

    used as uplinks to either other QDR switched NEMS or external InfiniBand switches.

    Sun Blade 6048 InfiniBand QDR Switched NEMs can be connected together directly to

    provide mesh or 3D torus fabrics. Alternately, one or more Sun Datacenter InfiniBand

    Switch 648 or 72 can be connected to provide Clos fabric implementations. The external

    ports use industry-standard CXP connectors that aggregate three 4x ports into a single

    12x connector.

  • 8/9/2019 HPC Open Petascale Computing

    26/48

    23 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

    Figure 11. The Sun Blade 6048 InfiniBand QDR Switched NEM connects directly to

    Mellanox HCAs on both nodes of the Sun Blade X6275 server module

    In the default configuration, and for clusters that utilize up to four Sun Datacenter

    InfiniBand Switch 648s, the switch provides a non-blocking fabric. To maintain a non-

    blocking fabric in configurations of larger than four switches, an external 12x cable can

    link two of the external CXP connectors (one connected to each internal switch chip) to

    interconnect the two switches with an additional three 4x connections. This

    configuration fully meshes the InfiniScale IV chips on the Sun Blade 6048 InfiniBand

    QDR Switched NEM, with a total of 12 ports communicating between the two Mellanox

    InfiniScale IV InfiniBand switches, while still leaving 24 4x ports (Eight 12x CXP

    connectors) available as switch uplinks.

    Scaling to Multiple Sun Datacenter InfiniBand Switch 648Designers need the ability to scale supercomputing deployments without being

    constrained by arbitrary limitations. The Sun Datacenter InfiniBand Switch 648 lets

    organizations scale from mid-sized InfiniBand deployments that may only populate a

    portion of a single Sun Datacenter InfiniBand Switch 648 chassis, to very large

    deployments built from multiple Sun Datacenter InfiniBand Switch 648. As with single-

    Sun Blade X6275Server Module

    Sun Blade 6048 InfiniBandQDR Switched NEM

    Memory

    Memory

    Memory

    Memory

    P

    P

    P

    PP

    CIe

    2.0

    IBH

    CA

    PCIe2.0

    IB

    HCA

    4x IB

    Up to 12

    ServerModules

    Node 1

    Node 2

    9

    ports

    15 ports

    15 ports

    12x IB

    Cables

    36-port

    QDR IB

    Switch

    36-port

    QDR IBSwitch

  • 8/9/2019 HPC Open Petascale Computing

    27/48

    24 Deploying Dense and Scalable Modular Compute Nodes Sun Microsystems, Inc.

    switch configurations, a multiswitch system still functions and is managed as a single

    entity, greatly reducing management complexity.

    A single Sun Datacenter InfiniBand Switch 648 can be deployed for configurations

    that require up to 648 compute nodes. Up to eight Sun Datacenter InfiniBand Switch 648 can be configured to serve up to

    5,184 compute nodes.

    Certain requirements exist for maintaining a non-blocking InfiniBand fabric. Table 2

    lists various supported numbers of Sun Blade 6048 chassis, Sun Datacenter InfiniBand

    Switch 648, Line Cards, Sun Blade 6048 InfiniBand QDR Switched NEMS, and 12x cables

    to support various numbers of compute nodes via Sun Blade 6275 server modules. All

    listed configurations are non-blocking.

    Table 2. Maximum numbers of Sun Blade 6275 server modules and Sun Blade 6048 Modular Systems

    supported by various numbers of Sun Datacenter InfiniBand Switch 648.

    Sun Blade 6048Chassis

    Number ofSun DatacenterInfiniBandSwitch 648

    Line CardsRequired

    Sun Blade 6048InfiniBand QDRSwitched NEMs

    12x CablesRequired

    Total ComputeNodes Supported

    1 1 2 4 32 96

    2 1 3 8 64 192

    3 1 4 12 96 288

    4 1 6 16 128 384

    5 1 7 20 160 480

    6 1 8 24 192 5768 2 11 32 256 768

    10 2 14 40 320 960

    12 2 16 48 384 1,152

    24 4 32 96 768 2,304

    48 8 64 192 1,536 4,608

    54 8 72 216 1,728 5,184

  • 8/9/2019 HPC Open Petascale Computing

    28/48

    25 Scalable and Manageable Storage Sun Microsystems, Inc.

    Chapter 4

    Scalable and Manageable Storage

    Large-scale supercomputing clusters place significant demands on storage systems. The

    enormous computational performance gains that have been realized through

    supercomputing clusters are capable of generating ever-larger quantities of data at very

    high rates. Effective HPC storage solutions must provide cost-effective capacity, and

    throughput must be able to scale along with the performance of cluster compute

    nodes. In addition, users and systems alike need fast access to data and home

    directories, and longer-term retention and archival are increasingly important in HPC

    and supercomputing environments. These diverse demands require a robust range of

    integrated storage offerings.

    Storage for ClustersAlong with the general growth in storage capacity requirements and the shear number

    of files stored, large HPC environments are seeing significant growth in the numbers of

    users needing convenient access to their files. All users want to access their essential

    data quickly and easily without having to perform extraneous steps. Organizations also

    want to get the best utilization possible from their computational systems.

    Unfortunately, storage speeds have seriously lagged behind computational

    performance for years, and HPC users are increasingly concerned about storage

    benchmarks, the increasingly complexity of the I/O path, and the range of solutions

    required to provide complete storage solutions.

    Of particular importance, large HPC environments need to be able to effectively

    manage the flow of high volumes of data through their storage infrastructure,

    requiring:

    Storage that acts as a resilient compute engine data cacheto match the streaming

    rates of applications running on the compute cluster

    Storage that provides longer-term retention and archiveto store massive quantities

    of essential data to tiered disk or tape hierarchies

    A range of scalable and parallel file systems and integrated data management

    software to help project file system data from near-term cache to longer-term

    retention and archiving and back on demand

    Even as the capacities of individual disk drives have risen, and prices have fallen, high-

    volume parallel storage systems have remained expensive and complex. With

    experience deploying petabytes of storage into large supercomputing clusters, Sun

    understands the key issues needed to deliver high-capacity, high-throughput storage in

    a cost-effective and manageable fashion. As an example, the Tokyo Institute of

  • 8/9/2019 HPC Open Petascale Computing

    29/48

    26 Scalable and Manageable Storage Sun Microsystems, Inc.

    Technology (TiTech) TSUBAME supercomputing cluster was initially deployed with 1.1

    petabytes of storage provided by clustered Sun Fire X4500 storage servers and the

    Lustre parallel file system.

    Clustered Sun Fire X4540 Storage Servers as Data CacheIdeal for building storage clusters to serve as cluster scratch space or data cache, the

    Sun Fire X4540 storage server defines a new category of system. These innovative

    systems closely couple a general-purpose enterprise-class x64 server with high-density

    storage all in a very compact form factor. Supporting up to 48 terabytes in only four

    rack units, the Sun Fire X4540 storage server also provides considerable compute power

    with dual sockets for Third-Generation Quad-Core and enhanced Quad-Core AMD

    Opteron processors. The server can also be configured for high-throughput InfiniBand

    networking allowing it to be connected directly to Sun InfiniBand switches. With

    support for up to 48 internal 500 GB or 1 TB disk drives, the Sun Fire X4540 storage

    server is ideal for large cluster deployments running the Linux OS and the Lustre

    parallel file system.

    Figure 12. The Sun Fire X4540 storage server provides up to 48 terabytes of

    compact storage in only four rack units ideal for configuration as cluster

    scratch space using the Lustre parallel file system.

    The Sun Fire X4540 storage server represents an innovative design that provides

    throughput and high-speed access to the 48 directly-attached, hot-plug Serial ATA (SATA)

    disk drives. Designed for datacenter deployment, the efficient system is cooled from

    from front to back across the components and disk drives. Each Sun Fire X4540 storage

    server provides:

    Minimal cost per gigabyte utilizing SATA II storage and software RAID 6 with six

    SATA II storage controllers connecting to 48 high-performance SATA disk drives

    High performance from an industry-standard x64 server based on two Quad-Core or

    enhanced Quad-Core AMD Opteron processors

    48 high-performanceSATA disk drives

    Four rack units (4U)

  • 8/9/2019 HPC Open Petascale Computing

    30/48

    27 Scalable and Manageable Storage Sun Microsystems, Inc.

    Maximum memory and bandwidth scaling from embedded single-channel DDR

    memory controllers on each processor, delivering up to 64 GB of capacity

    High-performance I/O from two PCI-X slots to delivers over 8.5 gigabits per second of

    plug-in I/O bandwidth, including support for InfiniBand HCAs Easy maintenance and overall system reliability and availability from redundant hot-

    pluggable disks, power supply units, fans, and I/O

    Parallel file systems are required for moving massive amounts of data through

    supercomputing clusters. Given its strengths, the Sun Fire X4540 storage server is now a

    standard component of many large supercomputing cluster deployments around the

    world. Large grids and clusters need high-performance heterogeneous access to data,

    and the Sun Fire X4540 storage server provides both high throughput as well as

    essential scalability that allow parallel file systems to perform at their best. Together

    with the Lustre parallel file system and the Linux OS, the Sun Fire X4540 storage server

    also serves as the key component for the Sun Lustre Storage System (Chapter 5).

    The Sun Lustre Storage SystemThe Lustre file system is a software-only architecture that supports a number of

    different hardware implementations. Lustres state-of-the-art object-based storage

    architecture provides ground-breaking I/O and metadata throughput, with considerable

    reliability, scalability, and performance advantages. The Lustre file system currently

    scales to thousands of nodes and hundreds of terabytes of storage with the potential

    to support tens of thousands of nodes and petabytes of data.

    Building on the strengths of the Lustre parallel file system, the Sun Lustre Storage

    system is architected using Sun Open Storage systems that deliver exceptional

    performance and provide additional value. The main components of a typical Lustre

    architecture include:

    Lustre file system clients (Lustre clients)

    Metadata Severs (MDS)

    Object Storage Servers (OSS)

    Metadata Servers and Object Storage Servers implement the file system and

    communicate with the Lustre clients. The MDS manages and stores metadata, such as

    file names, directories, permissions and file layout. Configurations also require one or

    more Lustre Object Storage Server (OSS) modules, which provide scalable I/O

    performance and storage capacity.

    To these standard configurations, all Sun Lustre Storage System configurations include

    a High Availability Lustre Metadata Server (HA MDS) module that provides failover. For

    maximum flexibility, the Sun Lustre Storage System defines two OSS modules: a

  • 8/9/2019 HPC Open Petascale Computing

    31/48

    28 Scalable and Manageable Storage Sun Microsystems, Inc.

    Standard OSS module for greatest density and economy, and an HA OSS module that

    provides OSS failover for environments where automated recovery from OSS failure is

    important (Figure 13).

    Figure 13. High availability metadata servers (HA MDS) and high availability

    object storage servers (HA OSS) allow for file system failover in Luster

    configurations.

    HA MDS Module

    Designed to meet the critical requirement of high availability, the HA MDSmodule, is common to all Sun Lustre Storage System configurations. This module

    includes a pair of Sun Fire X4270 servers with an attached Sun Storage J4200 array

    acting as shared storage. Internal boot drives in the Sun Fire X4270 server are

    mirrored for added protection. The Sun Fire X4270 server features two quad-core

    Intel Xeon Processor 5500 Series (Nehalem) CPUs and is configured with

    24 GB RAM.

    Metadata

    Servers

    (MDS)

    (active) (standby)

    Object Storage

    Servers (OSS)

    Storage Arrays

    (Direct Connect)

    Enterprise

    Storage Arrays

    & SAN Fabrics

    Commodity

    Storage

    Ethernet

    Multiple

    networks

    supported

    simultaneously

    Clients

    File System

    Fail-over

    InfiniBand

  • 8/9/2019 HPC Open Petascale Computing

    32/48

    29 Scalable and Manageable Storage Sun Microsystems, Inc.

    Standard OSS Module

    The Sun Fire X4540 server was chosen for use as the Standard OSS module. As

    discussed, the Sun Fire X4540 server features an innovative architecture that

    combines a high-performance server, high I/O bandwidth, and very high densitystorage in a single integrated system.

    HA OSS Module

    Each HA OSS module includes two Sun Fire X4270 servers and four Sun Storage

    J4400 arrays. Sun Fire X4270 servers were chosen for the HA OSS module, because

    with six PCI Express slots, the Sun Fire X4270 server has the ability to drive the

    high throughput required in Sun Lustre Storage System environments. The Sun

    Storage J4400 array was chosen for the HA OSS module because it offers

    compelling storage density, connectivity, higher availability and very low price per

    gigabyte. With redundant SAS I/O Modules and front-serviceable disk drives, the

    Sun Storage J4400 array helps the Sun Lustre Storage System deliver price/

    performance advantages without sacrificing RAS features.

    Sun can reference many storage installations that have achieved impressive scalability

    results. One such reference is the Texas Advanced Computing Centers Ranger System

    (see http://www.tacc.utexas.edu/resources/hpcsystems/#ranger), where Sun has

    demonstrated near-linear scalability in a configuration encompassing fifty similar

    previous-generation Sun OSS modules with a single HA MDS module supporting a file

    system that was 1.2 petabytes in size. Figure 14 shows the Lustre file system

    throughput at TACC where throughput rates of 45 GB/sec with peaks approaching 50

    GB/sec have been observed. In addition, TACC has experienced near-linear throughputon a single applications use of the Lustre file system at 35 GB/sec.

    Figure 14. Luster parallel file system performance at TACC

    More information on implementing the Lustre parallel file system can be found in the

    Sun BluePrints article titled Solving the HPC I/O Bottleneck: Sun Lustre Storage System

    (http://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-

    +Sun+Lustre+Storage+System).

    35

    0

    5

    10

    30

    25

    20

    15

    10 100 1000 1000010 100 1000 10000

    0

    10

    30

    20

    40

    50

    60

    $SCRATCH File System Throughput

    W

    riteSpeed(GB/sec)

    $SCRATCH Application Performance

    # of Writing Clients # of Writing Clients

    Stripecount = 1

    Stripecount = 4

    Stripecount = 1

    Stripecount = 4

    W

    riteSpeed(GB/sec)

    http://www.tacc.utexas.edu/resources/hpcsystems/#rangerhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://wikis.sun.com/display/BluePrints/Solving+the+HPC+IO+Bottleneck+-+Sun+Lustre+Storage+Systemhttp://www.tacc.utexas.edu/resources/hpcsystems/#ranger
  • 8/9/2019 HPC Open Petascale Computing

    33/48

    30 Scalable and Manageable Storage Sun Microsystems, Inc.

    ZFS and Sun Storage 7000 Unified Storage SystemsWhile high-throughput cluster scratch space is critical, clusters also need storage that

    serves other needs. Some of an organizations most important data includes completed

    simulations and key source data. Clusters need storage that provides scalable, reliable,

    and robust storage for tier-1 data archival and users home directories.

    To address this need, Sun Storage 7000 Unified Storage Systems incorporate an open-

    source operating system, commodity hardware, and industry-standard technologies.

    These systems represent low-cost, fully-functional network attached storage (NAS)

    storage devices designed around the following core technologies:

    General-purpose x64-based servers (that function as the NAS head), and Sun Storage

    products proven high-performance commodity hardware solutions with

    compelling price-performance points

    The ZFS file system, the worlds first 128-bit file system with unprecedentedavailability and reliability features

    A high-performance networking stack using IPv4 or IPv6

    DTrace Analytics, that provide dynamic instrumentation for real-time performance

    analysis and debugging

    Sun Fault Management Architecture (FMA) for built-in fault detection, diagnosis, and

    self-healing for common hardware problems

    A large and adaptive two-tiered caching model, based on DRAM and enterprise-class

    solid state devices (SSDs)

    To meet varied needs for capacity, reliability, performance, and price, the product

    family includes three different models the Sun Storage 7110, 7210, 7310, and 7410

    Unified Storage Systems (Figure 15). Configured with appropriate data processing and

    storage resources, these systems can support a wide range of requirements in HPC

    environments.

  • 8/9/2019 HPC Open Petascale Computing

    34/48

    31 Scalable and Manageable Storage Sun Microsystems, Inc.

    Figure 15. Sun Storage 7000 Unified Storage Systems

    Sun Storage 7110 Unified Storage System

    Sun Storage 7210 Unified Storage System

    Sun Storage 7410 Unified Storage System

    Sun Storage 7310 Unified Storage System

  • 8/9/2019 HPC Open Petascale Computing

    35/48

    32 Scalable and Manageable Storage Sun Microsystems, Inc.

    Tight integration of the ZFS scalable file system

    Sun Storage 7000 Unified Storage Systems are powered by the ZFS scalable file stem.

    ZFS offers a dramatic advance in data management with an innovative approach to

    data integrity, tremendous performance improvements, and a welcome integration of

    both file system and volume management capabilities. A true 128-bit file system, ZFS

    removes all practical limitations for scalable storage, and introduces pivotal new

    concepts such as hybrid storage pools that de-couple the file system from physical

    storage. This radical new architecture optimizes and simplifies code paths from the

    application to the hardware, producing sustained throughput at near platter speeds.

    New block allocation algorithms accelerate write operations, consolidating what would

    traditionally be many small random writes into a single, more efficient write sequence.

    Silent data corruption is corruption that goes undetected, and for which no error

    messages are generated. This particular form of data corruption is of special concern toHPC applications since they typically generate, store, and archive significant amounts

    of data. In fact, a study by CERN1 has shown that silent data corruption, including disk

    errors, RAID errors, and memory errors, is much more common that previously

    imagined. ZFS provides end-to-end checksumming for all data, greatly reducing the risk

    of silent data corruption.

    Sun Storage 7000 Unified Storage Systems rely heavily on ZFS for key functionality such

    as Hybrid Storage Pools. By automatically allocating space from pooled storage when

    needed, ZFS simplifies storage management and gives organizations the flexibility to

    optimize data for performance. Hybrid Storage Pools also effectively combine the

    strengths of system memory, flash memory technology in the form of enterprise solid

    state drives (SSDs), and conventional hard disk drives (HDDs).

    Key capabilities of ZFS related to Hybrid Storage Pools include:

    Virtual storage pools Unlike traditional file systems that require a separate volume

    manager, ZFS introduces the integration of volume management functions.

    Data integrity ZFS uses several techniques to keep on-disk data self consistent and

    eliminate silent data corruption, such as copy-on-write and end-to-end

    checksumming.

    High performance ZFS simplifies the code paths from the application to the

    hardware, delivering sustained throughput at near platter speeds. Simplified administration ZFS automates many administrative tasks to speed

    performance and eliminate common errors.

    Sun Storage 7000 Unified Storage Systems utilize ZFS Hybrid Storage Pools to

    automatically provide data placement, data protection, and data services such as RAID,

    error correction, and system management. By placing data on the most appropriate

    storage media, Hybrid Storage Pools help to optimize performance and contain costs.

    Sun Storage 7000 Unified Storage Systems feature a common, easy-to-use management

    1.Silent Corruptions, Peter Kelemen, CERN After C5, June 1st, 2007

  • 8/9/2019 HPC Open Petascale Computing

    36/48

    33 Scalable and Manageable Storage Sun Microsystems, Inc.

    interface, along with a comprehensive analytics environment to help isolate and

    resolve issues. The systems support NFS, CIFS, and iSCSI data access protocols, mirrored

    and parity-based data protection, local point-in-time (PIT) copy, remote replication, data

    checksum, data compression, and data reconstruction.

    Long-Term Retention and ArchiveStaging, storing, and maintaining HPC data requires a massive repository of on-line and

    near-line storage to support data retention and archival needs. High-speed data

    movement must be provided between computational and archival environments. The

    Sun Constellation System addresses this need by integrating with a wealth of

    sophisticated Sun StorageTek options, including:

    Sun StorageTek SL8500 and SL500 Modular Library Systems

    Sun StorageTek 6540 and 6140 Modular Arrays

    High-speed data movers

    Sun StorageTek 5800 system fixed-content archive

    The comprehensive Sun StorageTek software offering is key to facilitating seamless

    migration of data between cache and archival.

    Sun StorageTek QFS

    Sun StorageTek QFS software provides high-performance heterogeneous shared

    access to data over a storage area network (SAN). Users across the enterprise get

    shared access to the same large files or data sets simultaneously, speeding time to

    results. Up to 256 systems running Sun StorageTek QFS technology can have

    shared access to the same data while maintaining file integrity. Data can bewritten and accessed at device-rated speeds, providing superior application I/O

    rates. Sun StorageTek QFS software also provides heterogeneous file sharing using

    NFS, CIFS, Apple Filing Protocol, FTP, and Samba.

    Sun StorageTek Storage Archive Manager (SAM) software

    Large HPC installations must manage the considerable storage required by

    multiple projects running large-scale computational applications on very large

    datasets. Solutions must provide a seamless and transparent migration for

    essential archival data between disk and tape storage systems. Sun StorageTek

    Storage Archive Manager (SAM) addresses this need by providing data

    classification and policy-driven data placement across tiers of storage.

    Organizations can benefit from data protection as well as long-term retention and

    data recovery to match their specific needs.

    Chapter 6provides additional detail and a graphical depiction of how caching file

    systems such as the Lustre parallel file system combine with SAM-QFS in a real-world

    example to provide data management in large supercomputing installations.

  • 8/9/2019 HPC Open Petascale Computing

    37/48

    34 Sun HPC Software Sun Microsystems, Inc.

    Chapter 5

    Sun HPC Software

    As clusters move from the realm of supercomputing to the enterprise, cluster software

    has never been more important. Organizations deploying clusters at all levels need

    better ways to control and monitor often expansive cluster deployments in ways that

    benefit their users and applications. Unfortunately, collecting, assembling, testing, and

    patching all of the requisite software components for effective cluster operation has

    proved challenging, to say the least.

    Available in both a Linux Edition, and a Solaris Developer Edition, Sun HPC software is

    designed to address these needs. Sun HPC Software, Linux Edition is detailed in the

    sections that follow. For more information on Sun HPC Software, Solaris Developer

    Edition, please see http://wikis.sun.com/display/hpc /Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1

    Sun HPC Software, Linux EditionMany HPC customers are demanding Linux-based HPC solutions with open source

    components. To answer these demands, Sun has introduced Sun HPC Software, Linux

    Edition an integrated solution for Linux HPC clusters based on Sun hardware. More

    than a mere collection of software components, Sun HPC Software simplifies the entire

    process of deploying and managing large-scale Linux HPC clusters, providing

    considerable potential savings in maintenance time and expense.

    From its inception, the projects goals were to provide an open product one that

    uses as much open source software as possible, and one that depends on and enhances

    the community aspects of software development and consolidation. The ongoing goals

    for Sun HPC software are to:

    Provide simple, scalable provisioning of bare-metal systems into a running HPC

    cluster

    Validate configurations

    Dramatically reduce time-to-results

    Offer integrated management and monitoring of the cluster

    Employ a community-driven process

    Seamless and Scalable IntegrationSun HPC Software, Linux Edition covers the entire cluster life-cycle. The software

    provides everything needed to provision the cluster nodes, verify that the software and

    hardware are working correctly, manage the cluster, and monitor the clusters

    performance and health. All of the components have been fully tested on Sun HPC

    hardware, so that the likelihood of post-installation integration problems is significantly

    reduced.

    http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1http://wikis.sun.com/display/hpc/Sun+HPC+Software,+Solaris+Developer+Edition+1.0+Beta+1
  • 8/9/2019 HPC Open Petascale Computing

    38/48

    35 Sun HPC Software Sun Microsystems, Inc.

    Because clusters can vary widely in size, Sun HPC software is designed to be scalable,

    and all of the components are selected with large numbers of nodes in mind. For

    example, the Lustre parallel file system and OneSIS provisioning software are both well

    known for working well with clusters comprised of thousands of nodes. Tools thatprovision, verify, manage, and monitor the cluster were likewise selected for their

    scalability to reduce the management cost as clusters grow.

    Sun HPC software, Linux Edition is built to be completely modular so that organizations

    can customize it according to their own preferences and requirements. The modular

    framework provides a ready-made stack that contains the components required to

    deploy an HPC cluster. Add-on components let organizations make specific choices

    beyond the core software installed. Figure 16 provides a high-level perspective of Sun

    HPC Software, Linux Edition. For more specific information on the components provided

    at each level, please see www.sun.com/software/products/hpcsoftware, or send an e-

    mail to [email protected] join the community.

    Figure 16. Sun HPC Software stack, Linux Edition

    Sun HPC Software, Linux Edition 2.0.1 contains the components listed in Table 3.

    Table 3. Sun HPC Software 2.0.1 components

    Category Sun HPC Software, Linux Edition 1.2

    Operating System and kernel Red Hat Enterprise Linux, CentOS Linux, Lustre parallel filesystem, perfctr, SuSE Linux Enterprise Server

    User space library Allinea DDT, Env-switcher, genders, git, Heartbeat, IntelCompiler, Mellanox firmware tools, Modules, MVAPITCHand MVAPITCH2, OFED, OpenMPI, PGI compiler, RRDtool,Sun Studio, Sun HPC Clustertools, Totalview

    Verification HPCC Bench Suite, Lustre IOkit, IOR, Lnet selftest, NetPIPE

    Schedulers Sun Grid Engine, PBS, LSF, SLURM, MOAB, MUNGE

    Monitoring Ganglia

    Provisioning OneSIS, Cobbler

    Management CFEngine, Conman, FreeIPMI, gtdb, IBSRM, IPMItool,Ishw, OpenSM, pdsh, Powerman, Sun Ops Center

    http://www.sun.com/software/products/hpcsoftwarehttp://www.sun.com/software/products/hpcsoftware
  • 8/9/2019 HPC Open Petascale Computing

    39/48

    36 Sun HPC Software Sun Microsystems, Inc.

    Simplified Cluster ProvisioningSun HPC Software, Linux Edition is specifically designed to simplify the complex task of

    provisioning systems as a part of a clustered environment. For example, the software

    stack includes the OneSIS open source software tool developed at Sandia National

    Laboratory. This tool is specifically designed to ease system administration in large-

    scale Linux cluster environments.

    The software stack itself can be downloaded from the Web, and is designed to fit onto a

    single DVD. While installing the first system in a cluster might take place from the DVD,

    it goes without saying that installing an entire large cluster in this fashion would

    consume unacceptable amounts of time, not to mention the additional time to

    maintain and update individual system images. With OneSIS, administrators can create

    system images that define the behavior of the entire computing infrastructure.

    A typical installation process approximates the following:

    The software is first onto a management node, and Installing the system locally via

    DVD typically takes about 20 minutes from bare metal to a login prompt

    Configuring the system requires another 20 minutes at most

    Other cluster systems are then booted onto the master image

    The cluster can be running and ready to accept jobs in as little as 50 minutes.

  • 8/9/2019 HPC Open Petascale Computing

    40/48

    37 Deploying Supercomputing Clusters Rapidly with Less Risk Sun Microsystems, Inc.

    Chapter 6

    Deploying Supercomputing Clusters Rapidly

    with Less Risk

    Sun has considerable experience helping organizations deploy supercomputing clusters

    specific to their computational, storage, and collaborative requirements.

    Complementing the compelling capabilities of the Sun Constellation System, Sun

    provides a range of services that are specifically aimed at delivering results for HPC-

    focused organizations. Suns partnership with the Texas Advanced Computing Center

    (TACC) at the University of Texas at Austin to deliver the Sun Constellation System in the

    3,936-node Ranger supercomputing cluster is one such example.

    Sun Datacenter Express ServicesSuns new Datacenter Express Services provide a comprehensive, al