26
Dell | Terascala HPC Storage Solution (DT-HSS3) A Dell Technical White Paper David Duncan

Dell Terascala HPC Storage Solution Technical Whitepaper · 2012. 5. 21. · The OST provides storage for file object data, while the OSS manages one or more OSTs. Dell | Terascala

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Dell | Terascala HPC Storage Solution (DT-HSS3)

    A Dell Technical White Paper

    David Duncan

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page ii

    THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL

    ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR

    IMPLIED WARRANTIES OF ANY KIND.

    © 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without

    the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.

    Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc.

    Other trademarks and trade names may be used in this document to refer to either the entities

    claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in

    trademarks and trade names other than its own.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 1

    Contents Figures .................................................................................................................... 2

    Tables ..................................................................................................................... 3

    Introduction ............................................................................................................. 4

    Lustre Object Storage Attributes .................................................................................... 4

    Dell | Terascala HPC Storage Solution Description ............................................................... 6

    Metadata Server ...................................................................................................... 6

    Object Storage Server Pair ......................................................................................... 7

    Scalability ............................................................................................................. 8

    Networking .......................................................................................................... 10

    Managing the Dell | Terascala HPC Storage Solution ........................................................... 10

    Performance Studies ................................................................................................. 11

    N-to-N Sequential Reads / Writes ............................................................................... 14

    Random Reads and Writes ........................................................................................ 15

    IOR N-to-1 Reads and Writes ..................................................................................... 16

    Metadata Testing .................................................................................................. 17

    Conclusions ............................................................................................................ 21

    Appendix A: Benchmark Command Reference................................................................... 22

    Appendix B: Terascala Lustre Kit .................................................................................. 23

    References ............................................................................................................. 24

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 2

    Figures

    Figure 1: Lustre Overview ............................................................................................... 5

    Figure 2: Dell PowerEdge R710 ......................................................................................... 6

    Figure 3: OSS Base Components ........................................................................................ 7

    Figure 4: 96TB Dell | Terascala HPC Storage Solution Configuration ............................................ 8

    Figure 5: MDS Cable Configuration ..................................................................................... 9

    Figure 6: OSS Scalability ................................................................................................. 9

    Figure 7: Cable Diagram for OSS ...................................................................................... 10

    Figure 8: Terascala Management Console ........................................................................... 11

    Figure 11: N-to-N Random reads and writes. ....................................................................... 16

    Figure 12: N-to-1 Read / Write ....................................................................................... 17

    Figure 13: Metadata create comparison ............................................................................. 18

    Figure 14: Metadata File / Directory Stat ........................................................................... 19

    Figure 15: Metadata File / Directory Remove ...................................................................... 20

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 3

    Tables

    Table 1: Test Cluster Details .......................................................................................... 12

    Table 2: DT-HSS3 Configuration ...................................................................................... 14

    Table 3: Metadata Server Configuration ............................................................................ 17

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 4

    Introduction

    In High-Performance computing, the requirement for efficient delivery of data to and from compute

    nodes is critical. With the speed at which research can use data in HPC systems, storage is fast

    becoming a bottleneck. Managing complex storage systems is another growing burden on storage

    administrators and researchers. Data requirements are increasing rapidly, from the perspective of both

    performance and capacity. Ways to increase the throughput and scalability of the storage devices

    supporting those compute nodes typically require a great deal of planning and configuration.

    The Dell | Terascala HPC Storage Solution (DT-HSS) is designed for researchers, universities, HPC users

    and enterprises who are looking to deploy a fully supported, easy to use, scale-out and cost-effective

    parallel file system storage solution. DT-HSS is a new scale-out storage solution providing high

    throughput storage as an appliance. Utilizing intelligent, yet intuitive management interface, the

    solution simplifies managing and monitoring all the hardware and file system components. It is easy to

    scale in capacity or performance or both, thereby providing you a path to grow in future. The storage

    appliance uses Lustre®, the leading open source parallel file system.

    The storage solution is delivered as a pre-configured, ready to go appliance and is available with full

    hardware and software support from Dell and Terascala. Utilizing the enterprise ready Dell

    PowerEdge™ servers and PowerVault™ storage products, the latest generation Dell | Terascala HPC

    Storage Solution, referred to as DT-HSS3 in the rest of the paper, delivers a great combination of

    performance, reliability, ease of use and cost-effectiveness.

    Because of the changes made in this third generation of the Dell | Terascala High-Performance

    Computing Storage Solution, it is important to evaluate the performance progress of the solution to

    determine what should be expected. This paper is a description of this generation of the solution and

    outlines some of the performance characteristics.

    Many details of the Lustre filesystem and integration with Platform Computing’s Platform Cluster

    Manager (PCM) remain unchanged from the previous generation solution or DT-HSS2. Readers may

    refer to Appendix B on Page 23 for instructions on integrating with PCM.

    Lustre Object Storage Attributes Lustre is a clustered file system, offering high performance through parallel access and distributed

    locking. In the DT-HSS family, access to the storage is provided via a single namespace. This single

    namespace is easily mounted from PCM and managed through the Terascala Management Console.

    A parallel file system such as Lustre delivers its performance and scalability by distributing (called

    “striping”) data across multiple access points, allowing multiple compute engines to access data

    simultaneously. A Lustre installation consists of three key systems: the metadata system, the object

    storage system, and the compute clients.

    The metadata system is comprised of the Metadata Target (MDT) and the Metadata Server (MDS). The

    MDT stores all metadata for the file system including file names, permissions, time stamps, and where

    the data objects are stored within the object storage system. The MDS is the dedicated server that

    manages the MDT. For the Lustre version used in our testing, there is an active/passive pair of MDS

    servers providing a highly available metadata service to the cluster.

    The object storage system is comprised of the Object Storage Target (OST) and the Object Storage

    Server (OSS). The OST provides storage for file object data, while the OSS manages one or more OSTs.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 5

    There are typically several active OSSs at any time. Lustre is able to deliver increased throughput with

    an increase in the number of active OSSs. Each additional OSS increases the maximum networking

    throughput, and storage capacity. Figure 1 shows the relationship of the OSS and OST components of

    the Lustre filesystem.

    Figure 1: Lustre Overview

    The Lustre client software is installed on the compute nodes to gain access to data stored in the Lustre

    clustered filesystem. To the clients, the filesystem appears as a single branch in the filesystem tree.

    This single directory makes a simple starting point for application data access. This also ensures access

    via tools native to the operating system for ease of administration.

    Lustre also includes a sophisticated storage network protocol enhancement referred to as lnet that can

    leverage features of certain types of networks; the DT-HSS3 utilizes Infiniband in support of lnet’s

    sophisticated features. When the lnet is used with Infiniband, Lustre can take advantage of the RDMA

    capabilities of the Infiniband fabric to provide more rapid IO transport than can be achieved by typical

    networking protocols.

    The elements of the Lustre clustered file system are as follows:

    Metadata Target (MDT) – Tracks the location of “stripes” of data

    Object Storage Target(OST) – Stores the stripes or extents on a filesystem

    Lustre Clients – Access the MDS to determine where files are located, then access the OSSs to

    read and write data

    Typically, Lustre deployments and configurations are considered complex and time consuming. Lustre

    installation and administration is done via a command line interface, which may prevent the Systems

    Administrator unfamiliar with Lustre from performing an installation, possibly preventing his or her

    organization from experiencing the benefits of a clustered filesystem. The Dell | Terascala HPC Storage

    Solution removes the complexities of installation and minimizes Lustre deployment time as well as

    configuration time. This decrease in time and effort leaves opportunity for filesystem testing and

    speeds the general preparations for production use.

    Client

    Client

    Client

    Client

    MDS Server

    OSS 2 Server

    OSS 1 Server OSTs

    Data

    MDT

    Metadata

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 6

    Dell | Terascala HPC Storage Solution Description The DT-HSS3 solution provides a pre-configured storage solution consisting of Lustre Metadata Servers

    (MDS), Lustre Object Storage Servers (OSS), and the associated storage. The application software

    images have been modified to support the PowerEdge R710 as a standards-based replacement for the

    previously custom fabricated servers used as the Object Storage and Metadata Servers in the

    configuration. This substitution, shown in Figure 2, allows for a significant enhancement in the

    performance and serviceability of these solution components with a decrease in the overall complexity

    of the solution itself.

    Figure 2: Dell PowerEdge R710

    Metadata Server Pair

    In the latest DT-HSS3 solution, the Metadata Server (MDS) pair is comprised of two Dell PowerEdge

    R710 servers configured as an active/passive cluster. Each server is directly attached to a Dell

    PowerVault MD3220 storage array housing the Lustre MDT. The MD3220 is fully populated with 24,

    500GB, 7.2K, 2.5” near-line SAS drives configured in a RAID10 for a total available storage of 6TB. In

    this single Metadata Target (MDT), the DT-HSS3 provides 4.8TB of formatted space for filesystem

    metadata. The MDS is responsible for handling file and directory requests and routing tasks to the

    appropriate Object Storage Targets (OST's) for fulfillment. With this single MDT, the maximum number

    of files served will be in excess of 1.45 Billion. Storage requests are handled across a single 4OGb QDR

    Infiniband connection via lnet.

    Figure 3 Metadata Server Pair

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 7

    Object Storage Server Pair

    Figure 4: OSS Pair

    The hardware improvements also extend to the Object Storage Server (OSS). The previous generation

    of the OSS was built with the custom fabricated server and this has been changed. The PowerEdge R710

    server has become the new standard for this solution. In the DT-HSS3, Object Storage Servers are

    arranged in two-node high availability (HA) clusters providing active/active access to two Dell

    PowerVault MD3200 storage arrays. Each MD3200 array is fully populated with 12 2TB 3.5” 7.2K near-

    line SAS drives. Capacity of each MD3200 array can be extended with up to seven additional MD1200s.

    Each OSS Pair provides raw storage capacity ranging from 48TB up to 384TB.

    Object Storage Servers (OSS’s) are the building blocks of the DT-HSS solutions. With the two 6Gb SAS

    controllers in each PowerEdge R710, two servers are redundantly connected to each of two MD3200

    storage arrays. The MD3200 storage arrays are extended with MD1200 attached SAS devices to provide

    additional capacity.

    With 12 7.2K, 2TB Near Line SAS drives in each PowerVault MD3200 or MD1200 enclosure provides a

    total of 24TB raw storage capacity. The storage is divided equally in to two RAID 5 (Five data and one

    parity disk) virtual disks per enclosure to yield two Object Storage Targets (OST’s) per enclosure. Each

    OST provides 9TB of formatted object storage space. The OSS pair is expanded by 4 OST’s with each

    stage of increase. The OSTs can be connected to an Lnet via 40Gb QDR Infiniband connections.

    When viewed from any compute node equipped with the Lustre client the entire namespace can be

    reviewed and manage like any other filesystem, but with enhancements for Lustre management.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 8

    Figure 5: 96TB Dell | Terascala HPC Storage Solution Configuration

    Scalability

    Providing the Object Storage Servers in active/active cluster configurations yields greater throughput

    and product reliability. This is also the recommended configuration for decreasing maintenance

    requirements, consequently reducing potential downtime.

    The PowerEdge Servers are included to provide greater performance and simplify maintenance by

    eliminating specialty hardware. This 3rd Generation solution continues to scale from 48TB up to 384TB

    of raw storage for each OSS pair. The DT-HSS3 solution leverages the QDR Infiniband interconnects for

    high-speed, low-latency storage transactions. The 3rd generation upgrade to PowerEdge R710 takes

    advantage of the PCIe Gen2 interface for QDR Infiniband helping achieve higher network throughput

    per OSS. The Terascala Management console also offers the same level of management and monitoring.

    The Terascala Lustre Kit for PCM 2.0.1 remains unchanged at version 1.1. This combination of

    components from storage to client access is formally offered as the DT-HSS3 appliance.

    An example, 96TB configuration shown with the PowerEdge R710 is provided in Figure 5.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 9

    Figure 6: MDS Cable Configuration

    Figure 7: OSS Scalability

    Scaling the DT-HSS3 is achieved in two ways. The first method, demonstrated in Figure 8 involves

    simply expanding the storage capacity of a single OSS pair using MD1200 storage arrays in pairs. This

    allows for an increase in the volume of storage available while maintaining a consistent maximum

    network throughput. As shown in Figure 7, the second method adds additional OSS pairs, thus

    increasing the total network throughput and increasing the storage capacity at once.

    Also see Figure 8 for detail of the cable configuration and expansion of the DT-HSS3 storage by

    increasing the number of storage arrays.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 10

    Figure 8: Cable Diagram for OSS

    Networking

    Management Network

    The private management network provides a communication infrastructure for Lustre and Lustre HA

    functionality as well as storage configuration and maintenance. This network creates the segmentation

    required to facilitate day-to-day operations and to limit the scope of troubleshooting and maintenance.

    Access to this network is provided via the Terascala Management Server, the TMS2000.

    The TMS2000 extends a single communications port for the external management network. All user-

    level access is via this device. The TMS2000 is responsible for user interaction as well as systems health

    management and monitoring. While the TMS2000 is responsible for collecting data and management, it

    does not play an active role in the Lustre filesystem and can be serviced without requiring downtime on

    the filesystem. The TMS2000 presents the data collected and provides management via an interactive

    GUI. Users interact with the TMS2000 through the Terascala Management Console.

    Data Network

    The Lustre filesystem is served via a preconfigured LustreNet implementation on Infiniband. In the

    Infiniband network, fast transfer speeds with low latency can be achieved. LustreNet leverages the use

    of RDMA for rapid file and metadata transfer from MDT’s and OST’s to the clients. The OSS and MDS

    servers take advantage of the full QDR Infiniband fabric with single port ConnectX-2 40Gb adapters.

    The QDR Infiniband controllers can be easily integrated in to existing DDR networks using QDR to DDR

    crossover cables if necessary.

    Managing the Dell | Terascala HPC Storage Solution The Terascala Management Console (TMC) takes the complexity out of administering a Lustre file

    system by providing a centralized GUI for management purposes. The TMC can be used as a tool to

    standardize the following actions: mounting and unmounting the file system, initiating failover of the

    file system from one node to another, and monitoring performance of the file system and the status of

    its components. Figure 9 illustrates the TMC main interface.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 11

    The TMC is a Java-based application that can be run from any Java JRE equipped computer to remotely

    manage the solution (assuming all security requirements are met). It provides a comprehensive view of

    the hardware and filesystem, while allowing monitoring and management of the DT-HSS solution.

    Figure 9 shows the initial window view of a DT-HSS3 system. In the left pane of the window are all the

    key hardware elements of the system. Each element can be selected to drill down for additional

    information. In the center pane is a view of the system from the Lustre perspective, showing the status

    of the MDS and various OSS nodes. In the right pane is a message window that highlights any conditions

    or status changes. The bottom pane displays a view of the PowerVault storage arrays.

    Using the TMC, many tasks that once required complex CLI instructions, can now be completed easily

    with a few mouse clicks. The TMC can be used to shut down a file system, initiate a failover from one

    MDS to another, monitor the MD32XX arrays, etc.

    Figure 9: Terascala Management Console

    Performance Studies The goal of the performance studies in this paper is to profile the capabilities of a selected DT-HSS3

    configuration. The goal is to identify points of peak performance and the most appropriate methods for

    scaling to a variety of use cases. The test bed is a Dell HPC compute cluster with configuration as

    described in Table 1.

    A number of performance studies were executed, stressing a DT-HSS3 192TB configuration with

    different types of workloads to determine the limitations of performance and define the sustainability

    of that performance.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 12

    Table 1: Test Cluster Details

    Component Description

    Compute Nodes: Dell PowerEdge M610, 16 each in 4 PowerEdge M1000e chassis SERVER BIOS: 3.0.0 PROCESSORS: Intel Xeon X5650 MEMORY: 6 x 4GB 1333 MHz RDIMM INTERCONNECT: Infiniband - Mellanox MTH MT26428 [ConnectX IB QDR, Gen-2 PCIe] IB SWITCH: Mellanox 3601Q QDR blade chassis I/O switch module CLUSTER SUITE: Platform PCM 2.0.1 OS: Red Hat Enterprise Linux 5.5 x86_64 (2.6.18-194.el5) IB STACK: OFED 1.5.1

    Testing focused on three key performance markers:

    Throughput, data transferred in MiB/s

    I/O Operations per second (IOPS), and

    Metadata Operations per second

    The goal is a broad, but accurate review of the computing platform.

    There are two types of file access methods used in these benchmarks. The first file access method is N-

    to-N, where every thread of the benchmark (N clients) writes to a different file (N files) on the storage

    system. IOzone and IOR can both be configured to use the N-to-N file-access method. The second file

    access method is N-to-1, which means that every thread writes to the same file (N clients, 1 file). IOR

    can use MPI-IO, HDF5, or POSIX to run N-to-1 file-access tests. N-to-1 testing determines how the file

    system handles the overhead introduced with multiple concurrent requests when multiple clients

    (threads) write to the same file. The overhead encountered comes from threads dealing with Lustre’s

    file locking and serialized writes. See Appendix A for examples of the commands used to run these

    benchmarks. The number of threads or tasks chosen to run with each of these servers is chosen around

    the test cluster configuration. The number of threads corresponds to a number of hosts up to 64. That

    is to say that the number of threads is consistently 1 per host until we reach 64. Total numbers of

    threads above 64 are achieved by increasing the number of threads across all clients. IOzone limits the

    number of threads to 255 so the final data point for IOzone is consistent, save for one client who will

    have 3 rather than 4 IOzone threads like the other 3.

    Figure 10 shows a diagram of the cluster configuration used for this study.

    The number of threads or tasks chosen to run with each of these servers is chosen around the test

    cluster configuration. The number of threads corresponds to a number of hosts up to 64. That is to say

    that the number of threads is consistently 1 per host until we reach 64. Total numbers of threads above

    64 are achieved by increasing the number of threads across all clients. IOzone limits the number of

    threads to 255 so the final data point for IOzone is consistent, save for one client who will have 3

    rather than 4 IOzone threads like the other 3.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 13

    Figure 10: Test Cluster Configuration

    ES

    T

    External

    Network

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    13

    14

    15

    16

    17

    18

    19

    20

    21

    22

    23

    24

    LNK/ACT LNK/ACT LNK/ACT LNK/ACT

    RESET

    PWRRPSDIAG

    TEMPFAN

    FDX

    /HDX

    LNK

    /ACT

    COMBO PORTS

    PowerEdge M1000e

    9 101 2

    11 123 4

    13 145 6

    15 167 8

    0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0

    0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0

    PowerEdge M1000e

    9 101 2

    11 123 4

    13 145 6

    15 167 8

    0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0

    0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0

    M3601Q

    8

    7

    6

    5

    4

    3

    2

    1

    16

    15

    14

    13

    12

    11

    10

    9

    M3601Q

    8

    7

    6

    5

    4

    3

    2

    1

    16

    15

    14

    13

    12

    11

    10

    9

    PowerEdge M1000e

    9 101 2

    11 123 4

    13 145 6

    15 167 8

    0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0

    0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0

    PowerEdge M1000e

    9 101 2

    11 123 4

    13 145 6

    15 167 8

    0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0

    0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0 0

    1

    00

    1

    0

    M3601Q

    8

    7

    6

    5

    4

    3

    2

    1

    16

    15

    14

    13

    12

    11

    10

    9

    M3601Q

    8

    7

    6

    5

    4

    3

    2

    1

    16

    15

    14

    13

    12

    11

    10

    9

    IS 5035CONSOLEMGT

    STATUS

    PSU 1

    PSU 2

    FAN

    RST

    3433

    3231

    3635

    2827

    2625

    3029

    2221

    2019

    2423

    1615

    1413

    1817

    109

    87

    1211

    43

    21

    65

    Infiniband NetworkHead Node

    Compute Nodes/

    Lustre Clients

    Ma

    na

    ge

    me

    nt N

    etw

    ork

    0 3 6 9

    1 4 7 10

    2 5 8 11

    Hard Drives

    0 3 6 9

    1 4 7 10

    2 5 8 11

    Hard Drives

    TMS2000

    LNK ACT

    Stacking HDMI1 2 SFP+1 2

    LNK ACT LNK ACT

    Reset

    LNK ACT

    Stack No.

    M RPSFan

    PW

    R

    Status

    2 4 6 8 10 12 14 16 18 20 22 24

    1 3 5 7 9 11 13 15 17 19 21 23

    Terascala

    Management Network

    OSS Pair

    MDS Servers

    0 3 6 9

    1 4 7 10

    2 5 8 11

    Hard Drives

    0 3 6 9

    1 4 7 10

    2 5 8 11

    Hard Drives

    +MD1200's for

    Additional Storage

    The test environment for the DT-HSS3 has a single MDS Pair and a single OSS Pair. There is a total of

    192TB of raw disk space for the performance testing. The OSS pair contains two PowerEdge R710s, each

    with 24GB of memory, two 6Gbps SAS controllers and a single Mellanox ConnectX-2 QDR HCA. The MDS

    has 48GB of memory, a single 6Gbps SAS controller and a single Mellanox ConnectX-2 QDR HCA.

    The Infiniband fabric was configured as a 50% blocking fabric. It was created by populating Mezzanine

    slot B on all blades, using a single Infiniband switch in I/O module slot B1 in the chassis and using one

    external 36 port switch. 8 external ports on the internal switch were connected to the external 36 port

    switch. As the total throughput to and from any chassis exceeds the throughput to the OSS servers 4:1,

    this fabric configuration exceeded the bandwidth requirements necessary to fully utilize the OSS pair

    and MDS pair for all testing.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 14

    Table 2: DT-HSS3 Configuration

    DT-HSS3 Configuration

    Configuration Size 192TB Medium

    Lustre Version 1.8.4

    # of Drives 96 x 2TB NL SAS drives.

    OSS Nodes 2 x PowerEdge R710 Servers with 24GB Memory

    OSS Storage Array 2 x MD3200 /3 x MD1200

    Drives in OSS Storage Arrays 96 3.5" 2TB 7200RPM Nearline SAS

    MDS Nodes 2 x PowerEdge R710 Servers with 48GB Memory

    MDS Storage Array 1 x MD3220

    Drives in MDS Storage Array 24 2.5" 500GB 7200RPM Nearline SAS

    Infiniband

    PowerEdge R710 Servers Mellanox ConnectX-2 QDR HCA

    Compute Nodes Mellanox ConnectX-2 QDR HCA - Mezzanine Card

    External QDR IB Switch Mellanox 36 Port is5030

    IB Switch Connectivity QDR Cables

    All tests are performed with a cold cache established with the following technique. After each test, the

    filesystem is unmounted from all clients, then unmounted from the metadata server. Once the

    metadata server is unmounted, a sync is performed and the kernel is instructed to drop caches with the

    following command:

    sync

    echo 3 > /proc/sys/vm/drop_caches

    In order to further thwart any client cache errors, the compute node responsible for writing the

    sequential file was not the same one used for reading the files. A file size of 64GB was determined for

    sequential writes. This is 25% greater than the total memory combined for each PowerEdge R710 , so

    all sequential reads and writes have an aggregate sample size of 64GB multiplied by the number of

    threads, up to a total 64GB * 255 threads or greater than 15TB. The block size for IOzone was set to

    1MB to match the 1MB request size.

    In measuring the performance of the DT-HSS3, all tests were performed within similar environments.

    The filesystem was configured to be fully functional and the targets tested were emptied of files and

    directories prior to each test.

    N-to-N Sequential Reads / Writes

    The sequential testing was done with the IOzone testing tools version 3.347 and throughput results are

    reported in units of KB/sec, but are converted to MiB for consistency. N-to-N testing describes the

    technique of generating a single file for each process thread spawned by IOzone. The file size

    determined to be appropriate for testing is a 64GB file. Each individual file written was large enough to

    saturate the total available cache on each OSS/MD3200 pair. Files written were distributed across the

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 15

    client LUNs evenly. This was to prevent uneven I/O load on any single SAS connection in the same way

    that a user would expect to balance a workload.

    Figure 11: Sequential Reads / Writes DT-HSS3

    Figure 11 demonstrates the performance of the HSS-3 192TB test configuration. Write performance

    peaks at 4.1GB/sec at sixteen concurrent threads and continues to exceed 3.0GB/sec while each of the

    64 clients writes up to four files at a time. With four files per client, sixteen files are being written

    simultaneously on each OST. Read performance exceeds 6.2GB/sec for sixteen simultaneous processes

    and continues to exceed 3.5GB/sec while four threads per compute-node access individual OST's. The

    write and read performance rises steadily as we increase the number of process threads up to sixteen.

    This is the result of increasing the number of OST's utilized directly with the number of processes and

    compute nodes. When the client to OST ratio (CO) exceeds one, the write performance is evenly

    distributed between the multiple files on each OST, consequently increasing seek and Lustre callback

    time. By careful crafting of the IOzone hosts file, each thread added balances between blade chassis,

    compute nodes, and SAS controllers, as well as adding a single file to each of the OST's allowing a

    consistent increase in the number of files written while creating minimal locking contention between

    the files. As the number of files written increases beyond one per OST, throughput declines. This is

    most likely the result of issues related to the larger number of requests per OST. Throughput drops as

    the number of clients making requests to each OST increase. Positional delays and other overhead are

    expected as a result of the increased randomization of client requests. In order to maintain the higher

    throughput for a greater number of files, increasing the number of OST’s should help. A review of the

    storage array performance from the Dell PowerVault Modular Disk Storage Manager or using SMCli

    performance monitor was done to independently confirm the throughput values produced by the

    benchmarking tools.

    Random Reads and Writes

    The same tool, IOzone, was used to gather random reads and writes metrics. In this case, we have

    chosen to write a total of 64GB during any one execution. This total 64GB size is divided amongst the

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    1 2 4 8 16 32 64 128 255

    MiB

    /se

    c

    Threads

    DT-HSS3 Sequential Reads / Writes

    readers writers

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 16

    threads evenly. The IOzone hosts are arranged to distribute the work load evenly across the compute

    nodes. The storage is addressed as a single volume with an OST count of 16 and stripe size of 4MB. As

    we are measuring our performance in Operations per second (OPS), a 4KB request size is used because

    it aligns with this Lustre version’s 4KB filesystem block size.

    Figure 12 shows that the random writes are saturated at sixteen threads and around 5,200 IOPS. At

    sixteen threads, the client to OST ratio (CO) is one. As we increase the number of threads in IOzone,

    we increase the CO to a maximum of 4:1 at 64 threads. The reads increase rapidly from sixteen to 32

    and then continuously increase at a relatively steady rate. As the writes require a file lock per OST

    accessed, saturation is not unexpected. Reads take advantage of Lustre’s ability to grant overlapping

    read extent locks for part or all of a file. Increasing the number of disks in the single OSS pair or

    additional OSS pairs can increase the throughput.

    Figure 12: N-to-N Random reads and writes.

    0

    5000

    10000

    15000

    20000

    25000

    8 16 32 64 128 192 255

    IOP

    S

    Threads

    DT-HSS3 N-to-N Random

    Random reads Random writes

    IOR N-to-1 Reads and Writes

    Performance review of the DT-HSS3 with reads and writes to a single file was done with the IOR

    benchmarking tool. IOR is written to accommodate MPI communications for parallel operations and has

    support for manipulating Lustre striping. IOR allows several different IO interfaces for working with the

    files, but this testing used the POSIX interface to exclude more advanced features and associated

    overhead. This gives us an opportunity to review the filesystem and hardware performance

    independent of those additional enhancements and most applications have adopted a “one-file-per-

    processor approach” to disk-IO rather than using parallel IO API’s.

    IOR benchmark version 2.10.3 was used in this study, and is available at

    http://sourceforge.net/projects/ior-sio/. The MPI stack used for this study was Open MPI version

    1.4.1, and was provided as part of the Mellanox OFED kit.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 17

    The configuration for writing included a directory set with striping characteristics designed to stripe

    across all sixteen OST’s with a stripe chunk size of 4MB. The file to which all clients write is evenly

    striped across all OST’s. The request size for Lustre is 1MB, but in this test, a transfer size of 4MB was

    used to match the stripe size.

    Figure 13: N-to-1 Read / Write

    After setting the 4MB transfer size for IOR, reads have the advantage and peak at greater than 5.5 GB

    per second, a value significantly near the sequential N-to-N performance. Write performance peaks

    within 10% of the sequential writes at 3.9GB, but as the client to OST ratio (CO) increases, the write

    performance settles around 2.6GB per second, sustained. Reads are less affected since the read locks

    for clients are allowed to overlap some or all of a given file.

    Metadata Testing

    Metadata testing measures the time to complete certain file and directory operations that return

    attributes. Mdtest is an MPI-coordinated benchmark that performs create, stat, and delete operations

    on files and directories. The output of mdtest is an evaluation of the timing for completion of the

    operations per second. The mdtest benchmark is MPI enabled. Using mpirun, tests create threads from

    each included compute node. Mdtest can be configured to compare metadata performance for

    directories and files.

    This study used mdtest version 1.8.3, which is available at http://sourceforge.net/projects/mdtest/.

    The MPI stack used for this study was Open MPI version 1.4.1, and was provided as part of the Mellanox

    OFED kit. The Metadata Server internal details, including processor and memory configuration, are

    outlined in Table 3 for comparison.

    Table 3: Metadata Server Configuration

    Processor 2 x Intel Xeon E5620 2.4Ghz, 12M Cache, Turbo

    Memory 48GB Memory (12x4GB), 1333MHz Dual Ranked RDIMMs

    0

    1000

    2000

    3000

    4000

    5000

    6000

    2 4 8 16 32 64

    MiB

    /Se

    c

    Threads

    DT-HSS3 N-to-1 Read / Write

    read write

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 18

    Controller 2 x SAS 6Gbps HBA

    PCI Riser Riser with 2 PCIe x8 + 2 PCIe x4 Slot

    On a Lustre file system, the directory metadata operations are all performed on the MDT. The OST’s

    are queried for object identifiers in order to allocate or locate extents associated with the metadata

    operations. This interaction requires the indirect involvement of OSTs in most metadata operations. As

    a result of this dependency, Lustre favors metadata operations on files with lower stripe counts. The

    greater the stripe count, the more operations are required to complete the OST object allocation or

    calculations.

    The first test stripes files in 1MB stripes over all available OSTs. The next test repeats the previous

    operations on a single OST for comparison. Metadata operations across a single OST vs. the same

    operations spreading single stripe files evenly over sixteen OST stripes are very similar.

    Figure 14: Metadata create comparison

    Preallocating from sixteen OSTs or a single OST does not seem to change the rate at which files are

    created. The performance of the MDT and MDS are more important to the create operation.

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    1 2 4 8 16 32 64

    Op

    era

    tio

    ns

    / se

    c

    Tasks (Nodes)

    DT-HSS3 Metadata Create

    16 OST File Create 1 OST File Create DT-HSS3 16 OST Directory Create

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 19

    Figure 15: Metadata File / Directory Stat

    The mdtest benchmark for stat uses the native call as opposed to the bash shell call to retrieve file and

    directory metadata. The directory stats seem to peak around 26,000 per second, but the increased

    number of OSTs makes a significant difference in the number of calls handled by 1 task or greater. If

    the files were striped across a number of OSTs greater than 1, all OSTs would be contacted in order to

    accurately calculate the file size. The file stats indicate that beyond 8 tasks, the advantage lies in

    distributing the requests across more than a single OST, but the increase is not significant as the

    requests to any one OST do not represent more than a fraction of the data requested by the test.

    Directory creates do not require the same draw from the prefetch pool of OST object IDs as the creates

    do for files. The result is fewer operations to completion and therefore faster time to completion.

    Removal of the files requires less communication between the compute nodes (Lustre clients) and the

    MDS than removing directories. Directory removal requires verification from the MDS that the directory

    is void of content prior to execution. With 32 simultaneous stripes, a break occurs in the file removal

    between the single OST result and that of the sixteen OSTs. The metadata operations for directories

    across multiple OSTs appear to gain a slight advantage after the number of tasks exceeds 32. In the

    case of the removal of the files, we see a definite increase in performance when commands are

    executed across multiple OSTs.

    0

    5000

    10000

    15000

    20000

    25000

    30000

    1 2 4 8 16 32 64

    Op

    era

    tio

    ns

    / se

    c

    Tasks (Nodes)

    DT-HSS3 Metadata stat

    16-OST File Stat 1-OST File Stat 16 OST Directory Stat 1 OST Directory Stat

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 20

    Figure 16: Metadata File / Directory Remove

    0

    5000

    10000

    15000

    20000

    25000

    30000

    1 2 4 8 16 32 64

    Op

    era

    tio

    ns

    / se

    c

    Tasks (Nodes)

    DT-HSS3 Metadata Remove

    16 OST File Remove 1 OST File Remove

    16 OST Directory Remove 1 OST Directory Remove

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 21

    Conclusions There is a well-known requirement for scalable, high-performance clustered filesystem solutions. There

    are many solutions for customized storage. The third generation of the Dell | Terascala HSS, the DT-

    HSS3, addresses this need with the added benefit of standards based and proven Dell PowerEdge and

    PowerVault products and Lustre, the leading open source solution for clustered filesystem. The

    Terascala Management Console (TMC) unifies the enhancements and optimizations included in the DT-

    HSS3 into a single control and monitoring panel for ease of use.

    The latest generation Dell | Terascala HPC Storage Solution continues to offer the same advantages as

    the previous generation solutions, but with greater processing power and memory and greater IO

    bandwidth. The scale of the raw storage from 48TB up to a total of 384TB per Object Storage Server

    Pair and up to 6.2GB/s of throughput in a packaged component is consistent with the needs of the high

    performance computing environments. The DT-HSS3 is also capable of scaling horizontally as easily as it

    scales vertically.

    The performance studies demonstrate an increase in the throughput for both reads, and writes over

    previous generations for n-to-n and n-to-1 file type access. Results from mdtest show a generous

    increase in the metadata operations speed. With the PCI-e 2.0 interface, controllers and HCAs will

    excel in both high IOP and high bandwidth applications.

    The continued use of generally available, industry-standard tools like IOzone, IOR and mdtest provide

    an easy way to match current and expected growth with the performance outlined for the DT-HSS3.

    The profiles reported from each of these tools provide sufficient information to align the configuration

    of the DT-HSS3 with the requirements of many applications or group of applications.

    In short, the Dell | Terascala HPC Storage Solution delivers all the benefits of a scale-out parallel file

    system-based storage for your high performance computing needs.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 22

    Appendix A: Benchmark Command Reference This section describes the commands used to benchmark the Dell | Terascala HPC Storage Solution 3. IOzone IOzone Sequential Writes – /root/iozone/iozone -i 0 -c -e -w -r 1024k -s 64g -t $THREAD -+n -+m ./hosts

    IOzone Sequential Reads - /root/iozone/iozone -i 1 -c -e -w -r 1024k -s 64g -t $THREAD -+n -+m ./hosts

    IOzone IOPS Random Reads / Writes – /root/iozone/iozone -i 2 -w -O -r 4k -s $SIZE -t $THREAD -I -+n -+m ./hosts

    The O_Direct command line parameter (“-I”) allows us to bypass the cache on the compute nodes

    where the IOzone threads are running.

    IOR IOR Writes – mpirun -np $i --hostfile hosts $IOR -a POSIX -i 2 -d 32 -e -k -w -o $IORFILE

    -s 1 -b $SIZE -t 4m

    IOR Reads – mpirun -np $i --hostfile hosts $IOR -a POSIX -i 2 -d 32 -e -k -r -o $IORFILE

    -s 1 -b $SIZE -t 4m

    mdtest - Metadata Create Files - mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b

    $DIRS -z 1 -L -I $FTP -y -u -t -F -C

    Stat files - mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b

    $DIRS -z 1 -L -I $FTP -y -u -t -F -R -T

    Remove files - mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b

    $DIRS -z 1 -L -I $FTP -y -u -t -F -r

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 23

    Appendix B: Terascala Lustre Kit PCM software uses the following terminology to describe software provisioning concepts: 1. Installer Node – Runs services such as DNS, DHCP, HTTP, TFTP, etc. 2. Components – RPMS 3. Kits – Collection of components 4. Repositories – Collection of kits 5. Node Groups – Allow association of software to a particular set of compute nodes The following is an example of how to integrate a Dell | Terascala HPC Storage Solution into an existing

    PCM cluster:

    1. Access a list of existing repositories on the head node:

    The key here is the Repo name: line.

    2. Add the Terascala kit to the cluster:

    3. Confirm the kit has been added:

    4. Associate the kit to the compute node group on which the Lustre client should be installed:

    a. Launch ngedit at the console: # ngedit b. On the Node Group Editor screen, select the compute node group to add the Lustre client. c. Select the Edit button at the bottom of the screen. d. Accept all default settings until you reach the Components screen. e. Using the down arrow, select the Terascala Lustre kit. f. Expand and select the Terascala Lustre kit component. g. Accept the default settings, and on the Summary of Changes screen, accept the changes and

    push the packages out to the compute nodes. On the front-end node, there is now a /root/terascala directory that contains sample directory setup scripts. There is also a /home/apps-lustre directory that contains Lustre client configuration parameters. This directory contains a file that the Lustre file system startup script uses to optimize clients for Lustre operations.

  • Dell | Terascala High-Performance Storage Solution DT-HSS3

    Page 24

    References Dell | Terascala HPC Storage Solution Brief

    http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-hpcstorage-

    solution-brief.pdf

    Dell | Tersascala HPC Storage Solution Whitepaper

    http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-dt-hss2.pdf

    Dell PowerVault MD3200 / MD3220

    http://www.dell.com/us/en/enterprise/storage/powervault-md3200/pd.aspx?refid=powervault-

    md3200&s=biz&cs=555

    Dell PowerVault MD1200

    http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=555&l=en&oc=MLB1218&s=biz

    Lustre Home Page

    http://wiki.lustre.org/index.php/Main_Page

    Transitioning to 6Gb/s SAS

    http://www.dell.com/downloads/global/products/pvaul/en/6gb-sas-transition.pdf

    Dell HPC Solutions Home Page

    http://www.dell.com/hpc

    Dell HPC Wiki

    http://www.HPCatDell.com

    Terascala Home Page

    http://www.terascala.com

    Platform Computing

    http://www.platform.com

    Mellanox Technologies

    http://www.mellanox.com

    http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-hpcstorage-solution-brief.pdfhttp://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-hpcstorage-solution-brief.pdfhttp://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-dt-hss2.pdfhttp://www.dell.com/us/en/enterprise/storage/powervault-md3200/pd.aspx?refid=powervault-md3200&s=biz&cs=555%20http://www.dell.com/us/en/enterprise/storage/powervault-md3200/pd.aspx?refid=powervault-md3200&s=biz&cs=555%20http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=555&l=en&oc=MLB1218&s=bizhttp://wiki.lustre.org/index.php/Main_Pagehttp://www.dell.com/downloads/global/products/pvaul/en/6gb-sas-transition.pdfhttp://www.dell.com/hpchttp://www.hpcatdell.com/http://www.terascala.com/http://www.platform.com/http://www.mellanox.com/