Dell Terascala HPC Storage Solution Technical Whitepaper · 2012. 5. 21. · The OST provides storage for file object data, while the OSS manages one or more OSTs. Dell | Terascala

Dell | Terascala HPC Storage Solution (DT-HSS3)

A Dell Technical White Paper

David Duncan

Dell | Terascala High-Performance Storage Solution DT-HSS3

Page ii

THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL

ERRORS AND TECHNICAL INACCURACIES. THE CONTENT IS PROVIDED AS IS, WITHOUT EXPRESS OR

IMPLIED WARRANTIES OF ANY KIND.

© 2011 Dell Inc. All rights reserved. Reproduction of this material in any manner whatsoever without

the express written permission of Dell Inc. is strictly forbidden. For more information, contact Dell.

Dell, the DELL logo, and the DELL badge, PowerConnect, and PowerVault are trademarks of Dell Inc.

Other trademarks and trade names may be used in this document to refer to either the entities

claiming the marks and names or their products. Dell Inc. disclaims any proprietary interest in

trademarks and trade names other than its own.


Page 1

Contents Figures .................................................................................................................... 2

Tables ..................................................................................................................... 3

Introduction ............................................................................................................. 4

Lustre Object Storage Attributes .................................................................................... 4

Dell | Terascala HPC Storage Solution Description ............................................................... 6

Metadata Server ...................................................................................................... 6

Object Storage Server Pair ......................................................................................... 7

Scalability ............................................................................................................. 8

Networking .......................................................................................................... 10

Managing the Dell | Terascala HPC Storage Solution ........................................................... 10

Performance Studies ................................................................................................. 11

N-to-N Sequential Reads / Writes ............................................................................... 14

Random Reads and Writes ........................................................................................ 15

IOR N-to-1 Reads and Writes ..................................................................................... 16

Metadata Testing .................................................................................................. 17

Conclusions ............................................................................................................ 21

Appendix A: Benchmark Command Reference................................................................... 22

Appendix B: Terascala Lustre Kit .................................................................................. 23

References ............................................................................................................. 24


Page 2

Figures

Figure 1: Lustre Overview ............................................................................................... 5

Figure 2: Dell PowerEdge R710 ......................................................................................... 6

Figure 3: OSS Base Components ........................................................................................ 7

Figure 4: 96TB Dell | Terascala HPC Storage Solution Configuration ............................................ 8

Figure 5: MDS Cable Configuration ..................................................................................... 9

Figure 6: OSS Scalability ................................................................................................. 9

Figure 7: Cable Diagram for OSS ...................................................................................... 10

Figure 8: Terascala Management Console ........................................................................... 11

Figure 11: N-to-N Random reads and writes. ....................................................................... 16

Figure 12: N-to-1 Read / Write ....................................................................................... 17

Figure 13: Metadata create comparison ............................................................................. 18

Figure 14: Metadata File / Directory Stat ........................................................................... 19

Figure 15: Metadata File / Directory Remove ...................................................................... 20


Page 3

Tables

Table 1: Test Cluster Details .......................................................................................... 12

Table 2: DT-HSS3 Configuration ...................................................................................... 14

Table 3: Metadata Server Configuration ............................................................................ 17


Page 4

Introduction

In High-Performance computing, the requirement for efficient delivery of data to and from compute

nodes is critical. With the speed at which research can use data in HPC systems, storage is fast

becoming a bottleneck. Managing complex storage systems is another growing burden on storage

administrators and researchers. Data requirements are increasing rapidly, from the perspective of both

performance and capacity. Ways to increase the throughput and scalability of the storage devices

supporting those compute nodes typically require a great deal of planning and configuration.

The Dell | Terascala HPC Storage Solution (DT-HSS) is designed for researchers, universities, HPC users

and enterprises who are looking to deploy a fully supported, easy to use, scale-out and cost-effective

parallel file system storage solution. DT-HSS is a new scale-out storage solution providing high

throughput storage as an appliance. Utilizing intelligent, yet intuitive management interface, the

solution simplifies managing and monitoring all the hardware and file system components. It is easy to

scale in capacity or performance or both, thereby providing you a path to grow in future. The storage

appliance uses Lustre®, the leading open source parallel file system.

The storage solution is delivered as a pre-configured, ready to go appliance and is available with full

hardware and software support from Dell and Terascala. Utilizing the enterprise ready Dell

PowerEdge™ servers and PowerVault™ storage products, the latest generation Dell | Terascala HPC

Storage Solution, referred to as DT-HSS3 in the rest of the paper, delivers a great combination of

performance, reliability, ease of use and cost-effectiveness.

Because of the changes made in this third generation of the Dell | Terascala High-Performance

Computing Storage Solution, it is important to evaluate the performance progress of the solution to

determine what should be expected. This paper is a description of this generation of the solution and

outlines some of the performance characteristics.

Many details of the Lustre filesystem and integration with Platform Computing’s Platform Cluster

Manager (PCM) remain unchanged from the previous generation solution or DT-HSS2. Readers may

refer to Appendix B on Page 23 for instructions on integrating with PCM.

Lustre Object Storage Attributes Lustre is a clustered file system, offering high performance through parallel access and distributed

locking. In the DT-HSS family, access to the storage is provided via a single namespace. This single

namespace is easily mounted from PCM and managed through the Terascala Management Console.

A parallel file system such as Lustre delivers its performance and scalability by distributing (called

“striping”) data across multiple access points, allowing multiple compute engines to access data

simultaneously. A Lustre installation consists of three key systems: the metadata system, the object

storage system, and the compute clients.

The metadata system is comprised of the Metadata Target (MDT) and the Metadata Server (MDS). The

MDT stores all metadata for the file system including file names, permissions, time stamps, and where

the data objects are stored within the object storage system. The MDS is the dedicated server that

manages the MDT. For the Lustre version used in our testing, there is an active/passive pair of MDS

servers providing a highly available metadata service to the cluster.

The object storage system is comprised of the Object Storage Target (OST) and the Object Storage

Server (OSS). The OST provides storage for file object data, while the OSS manages one or more OSTs.


Page 5

There are typically several active OSSs at any time. Lustre is able to deliver increased throughput with

an increase in the number of active OSSs. Each additional OSS increases the maximum networking

throughput, and storage capacity. Figure 1 shows the relationship of the OSS and OST components of

the Lustre filesystem.

Figure 1: Lustre Overview

The Lustre client software is installed on the compute nodes to gain access to data stored in the Lustre

clustered filesystem. To the clients, the filesystem appears as a single branch in the filesystem tree.

This single directory makes a simple starting point for application data access. This also ensures access

via tools native to the operating system for ease of administration.

Lustre also includes a sophisticated storage network protocol enhancement referred to as lnet that can

leverage features of certain types of networks; the DT-HSS3 utilizes Infiniband in support of lnet’s

sophisticated features. When the lnet is used with Infiniband, Lustre can take advantage of the RDMA

capabilities of the Infiniband fabric to provide more rapid IO transport than can be achieved by typical

networking protocols.

The elements of the Lustre clustered file system are as follows:

Metadata Target (MDT) – Tracks the location of “stripes” of data

Object Storage Target(OST) – Stores the stripes or extents on a filesystem

Lustre Clients – Access the MDS to determine where files are located, then access the OSSs to

read and write data

Typically, Lustre deployments and configurations are considered complex and time consuming. Lustre

installation and administration is done via a command line interface, which may prevent the Systems

Administrator unfamiliar with Lustre from performing an installation, possibly preventing his or her

organization from experiencing the benefits of a clustered filesystem. The Dell | Terascala HPC Storage

Solution removes the complexities of installation and minimizes Lustre deployment time as well as

configuration time. This decrease in time and effort leaves opportunity for filesystem testing and

speeds the general preparations for production use.

Client

Client

Client

Client

MDS Server

OSS 2 Server

OSS 1 Server OSTs

Data

MDT

Metadata


Page 6

Dell | Terascala HPC Storage Solution Description The DT-HSS3 solution provides a pre-configured storage solution consisting of Lustre Metadata Servers

(MDS), Lustre Object Storage Servers (OSS), and the associated storage. The application software

images have been modified to support the PowerEdge R710 as a standards-based replacement for the

previously custom fabricated servers used as the Object Storage and Metadata Servers in the

configuration. This substitution, shown in Figure 2, allows for a significant enhancement in the

performance and serviceability of these solution components with a decrease in the overall complexity

of the solution itself.

Figure 2: Dell PowerEdge R710

Metadata Server Pair

In the latest DT-HSS3 solution, the Metadata Server (MDS) pair is comprised of two Dell PowerEdge

R710 servers configured as an active/passive cluster. Each server is directly attached to a Dell

PowerVault MD3220 storage array housing the Lustre MDT. The MD3220 is fully populated with 24,

500GB, 7.2K, 2.5” near-line SAS drives configured in a RAID10 for a total available storage of 6TB. In

this single Metadata Target (MDT), the DT-HSS3 provides 4.8TB of formatted space for filesystem

metadata. The MDS is responsible for handling file and directory requests and routing tasks to the

appropriate Object Storage Targets (OST's) for fulfillment. With this single MDT, the maximum number

of files served will be in excess of 1.45 Billion. Storage requests are handled across a single 4OGb QDR

Infiniband connection via lnet.

Figure 3 Metadata Server Pair


Page 7

Object Storage Server Pair

Figure 4: OSS Pair

The hardware improvements also extend to the Object Storage Server (OSS). The previous generation

of the OSS was built with the custom fabricated server and this has been changed. The PowerEdge R710

server has become the new standard for this solution. In the DT-HSS3, Object Storage Servers are

arranged in two-node high availability (HA) clusters providing active/active access to two Dell

PowerVault MD3200 storage arrays. Each MD3200 array is fully populated with 12 2TB 3.5” 7.2K near-

line SAS drives. Capacity of each MD3200 array can be extended with up to seven additional MD1200s.

Each OSS Pair provides raw storage capacity ranging from 48TB up to 384TB.

Object Storage Servers (OSS’s) are the building blocks of the DT-HSS solutions. With the two 6Gb SAS

controllers in each PowerEdge R710, two servers are redundantly connected to each of two MD3200

storage arrays. The MD3200 storage arrays are extended with MD1200 attached SAS devices to provide

additional capacity.

With 12 7.2K, 2TB Near Line SAS drives in each PowerVault MD3200 or MD1200 enclosure provides a

total of 24TB raw storage capacity. The storage is divided equally in to two RAID 5 (Five data and one

parity disk) virtual disks per enclosure to yield two Object Storage Targets (OST’s) per enclosure. Each

OST provides 9TB of formatted object storage space. The OSS pair is expanded by 4 OST’s with each

stage of increase. The OSTs can be connected to an Lnet via 40Gb QDR Infiniband connections.

When viewed from any compute node equipped with the Lustre client the entire namespace can be

reviewed and manage like any other filesystem, but with enhancements for Lustre management.


Page 8

Figure 5: 96TB Dell | Terascala HPC Storage Solution Configuration

Scalability

Providing the Object Storage Servers in active/active cluster configurations yields greater throughput

and product reliability. This is also the recommended configuration for decreasing maintenance

requirements, consequently reducing potential downtime.

The PowerEdge Servers are included to provide greater performance and simplify maintenance by

eliminating specialty hardware. This 3rd Generation solution continues to scale from 48TB up to 384TB

of raw storage for each OSS pair. The DT-HSS3 solution leverages the QDR Infiniband interconnects for

high-speed, low-latency storage transactions. The 3rd generation upgrade to PowerEdge R710 takes

advantage of the PCIe Gen2 interface for QDR Infiniband helping achieve higher network throughput

per OSS. The Terascala Management console also offers the same level of management and monitoring.

The Terascala Lustre Kit for PCM 2.0.1 remains unchanged at version 1.1. This combination of

components from storage to client access is formally offered as the DT-HSS3 appliance.

An example, 96TB configuration shown with the PowerEdge R710 is provided in Figure 5.


Page 9

Figure 6: MDS Cable Configuration

Figure 7: OSS Scalability

Scaling the DT-HSS3 is achieved in two ways. The first method, demonstrated in Figure 8 involves

simply expanding the storage capacity of a single OSS pair using MD1200 storage arrays in pairs. This

allows for an increase in the volume of storage available while maintaining a consistent maximum

network throughput. As shown in Figure 7, the second method adds additional OSS pairs, thus

increasing the total network throughput and increasing the storage capacity at once.

Also see Figure 8 for detail of the cable configuration and expansion of the DT-HSS3 storage by

increasing the number of storage arrays.


Page 10

Figure 8: Cable Diagram for OSS

Networking

Management Network

The private management network provides a communication infrastructure for Lustre and Lustre HA

functionality as well as storage configuration and maintenance. This network creates the segmentation

required to facilitate day-to-day operations and to limit the scope of troubleshooting and maintenance.

Access to this network is provided via the Terascala Management Server, the TMS2000.

The TMS2000 extends a single communications port for the external management network. All user-

level access is via this device. The TMS2000 is responsible for user interaction as well as systems health

management and monitoring. While the TMS2000 is responsible for collecting data and management, it

does not play an active role in the Lustre filesystem and can be serviced without requiring downtime on

the filesystem. The TMS2000 presents the data collected and provides management via an interactive

GUI. Users interact with the TMS2000 through the Terascala Management Console.

Data Network

The Lustre filesystem is served via a preconfigured LustreNet implementation on Infiniband. In the

Infiniband network, fast transfer speeds with low latency can be achieved. LustreNet leverages the use

of RDMA for rapid file and metadata transfer from MDT’s and OST’s to the clients. The OSS and MDS

servers take advantage of the full QDR Infiniband fabric with single port ConnectX-2 40Gb adapters.

The QDR Infiniband controllers can be easily integrated in to existing DDR networks using QDR to DDR

crossover cables if necessary.

Managing the Dell | Terascala HPC Storage Solution The Terascala Management Console (TMC) takes the complexity out of administering a Lustre file

system by providing a centralized GUI for management purposes. The TMC can be used as a tool to

standardize the following actions: mounting and unmounting the file system, initiating failover of the

file system from one node to another, and monitoring performance of the file system and the status of

its components. Figure 9 illustrates the TMC main interface.


Page 11

The TMC is a Java-based application that can be run from any Java JRE equipped computer to remotely

manage the solution (assuming all security requirements are met). It provides a comprehensive view of

the hardware and filesystem, while allowing monitoring and management of the DT-HSS solution.

Figure 9 shows the initial window view of a DT-HSS3 system. In the left pane of the window are all the

key hardware elements of the system. Each element can be selected to drill down for additional

information. In the center pane is a view of the system from the Lustre perspective, showing the status

of the MDS and various OSS nodes. In the right pane is a message window that highlights any conditions

or status changes. The bottom pane displays a view of the PowerVault storage arrays.

Using the TMC, many tasks that once required complex CLI instructions, can now be completed easily

with a few mouse clicks. The TMC can be used to shut down a file system, initiate a failover from one

MDS to another, monitor the MD32XX arrays, etc.

Figure 9: Terascala Management Console

Performance Studies The goal of the performance studies in this paper is to profile the capabilities of a selected DT-HSS3

configuration. The goal is to identify points of peak performance and the most appropriate methods for

scaling to a variety of use cases. The test bed is a Dell HPC compute cluster with configuration as

described in Table 1.

A number of performance studies were executed, stressing a DT-HSS3 192TB configuration with

different types of workloads to determine the limitations of performance and define the sustainability

of that performance.


Page 12

Table 1: Test Cluster Details

Component Description

Compute Nodes: Dell PowerEdge M610, 16 each in 4 PowerEdge M1000e chassis SERVER BIOS: 3.0.0 PROCESSORS: Intel Xeon X5650 MEMORY: 6 x 4GB 1333 MHz RDIMM INTERCONNECT: Infiniband - Mellanox MTH MT26428 [ConnectX IB QDR, Gen-2 PCIe] IB SWITCH: Mellanox 3601Q QDR blade chassis I/O switch module CLUSTER SUITE: Platform PCM 2.0.1 OS: Red Hat Enterprise Linux 5.5 x86_64 (2.6.18-194.el5) IB STACK: OFED 1.5.1

Testing focused on three key performance markers:

Throughput, data transferred in MiB/s

I/O Operations per second (IOPS), and

Metadata Operations per second

The goal is a broad, but accurate review of the computing platform.

There are two types of file access methods used in these benchmarks. The first file access method is N-

to-N, where every thread of the benchmark (N clients) writes to a different file (N files) on the storage

system. IOzone and IOR can both be configured to use the N-to-N file-access method. The second file

access method is N-to-1, which means that every thread writes to the same file (N clients, 1 file). IOR

can use MPI-IO, HDF5, or POSIX to run N-to-1 file-access tests. N-to-1 testing determines how the file

system handles the overhead introduced with multiple concurrent requests when multiple clients

(threads) write to the same file. The overhead encountered comes from threads dealing with Lustre’s

file locking and serialized writes. See Appendix A for examples of the commands used to run these

benchmarks. The number of threads or tasks chosen to run with each of these servers is chosen around

the test cluster configuration. The number of threads corresponds to a number of hosts up to 64. That

is to say that the number of threads is consistently 1 per host until we reach 64. Total numbers of

threads above 64 are achieved by increasing the number of threads across all clients. IOzone limits the

number of threads to 255 so the final data point for IOzone is consistent, save for one client who will

have 3 rather than 4 IOzone threads like the other 3.

Figure 10 shows a diagram of the cluster configuration used for this study.

The number of threads or tasks chosen to run with each of these servers is chosen around the test

cluster configuration. The number of threads corresponds to a number of hosts up to 64. That is to say

that the number of threads is consistently 1 per host until we reach 64. Total numbers of threads above

64 are achieved by increasing the number of threads across all clients. IOzone limits the number of

threads to 255 so the final data point for IOzone is consistent, save for one client who will have 3

rather than 4 IOzone threads like the other 3.


Page 13

Figure 10: Test Cluster Configuration

ES

T

External

Network

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

LNK/ACT LNK/ACT LNK/ACT LNK/ACT

RESET

PWRRPSDIAG

TEMPFAN

FDX

/HDX

LNK

/ACT

COMBO PORTS

PowerEdge M1000e

9 101 2

11 123 4

13 145 6

15 167 8

0

1

00

1

0 0

1

00

1

0 0

1

00

1

0 0

1

00

1

0

0

1

00

1

0 0

1

00

1

0 0

1

00

1

0 0

1

00

1

0

PowerEdge M1000e

9 101 2

11 123 4

13 145 6

15 167 8

0

1

00

1

0 0

1

00

1

0 0

1

00

1

0 0

1

00

1

0

0

1

00

1

0 0

1

00

1

0 0

1

00

1

0 0

1

00

1

0

M3601Q

8

7

6

5

4

3

2

1

16

15

14

13

12

11

10

9

M3601Q

8

7

6

5

4

3

2

1

16

15

14

13

12

11

10

9

PowerEdge M1000e

9 101 2

11 123 4

13 145 6

15 167 8

0

1

00

1

0 0

1

00

1

0 0

1

00

1

0 0

1

00

1

0

0

1

00

1

0 0

1

00

1

0 0

1

00

1

0 0

1

00

1

0

PowerEdge M1000e

9 101 2

11 123 4

13 145 6

15 167 8

0

1

00

1

0 0

1

00

1

0 0

1

00

1

0 0

1

00

1

0

0

1

00

1

0 0

1

00

1

0 0

1

00

1

0 0

1

00

1

0

M3601Q

8

7

6

5

4

3

2

1

16

15

14

13

12

11

10

9

M3601Q

8

7

6

5

4

3

2

1

16

15

14

13

12

11

10

9

IS 5035CONSOLEMGT

STATUS

PSU 1

PSU 2

FAN

RST

3433

3231

3635

2827

2625

3029

2221

2019

2423

1615

1413

1817

109

87

1211

43

21

65

Infiniband NetworkHead Node

Compute Nodes/

Lustre Clients

Ma

na

ge

me

nt N

etw

ork

0 3 6 9

1 4 7 10

2 5 8 11

Hard Drives

0 3 6 9

1 4 7 10

2 5 8 11

Hard Drives

TMS2000

LNK ACT

Stacking HDMI1 2 SFP+1 2

LNK ACT LNK ACT

Reset

LNK ACT

Stack No.

M RPSFan

PW

R

Status

2 4 6 8 10 12 14 16 18 20 22 24

1 3 5 7 9 11 13 15 17 19 21 23

Terascala

Management Network

OSS Pair

MDS Servers

0 3 6 9

1 4 7 10

2 5 8 11

Hard Drives

0 3 6 9

1 4 7 10

2 5 8 11

Hard Drives

+MD1200's for

Additional Storage

The test environment for the DT-HSS3 has a single MDS Pair and a single OSS Pair. There is a total of

192TB of raw disk space for the performance testing. The OSS pair contains two PowerEdge R710s, each

with 24GB of memory, two 6Gbps SAS controllers and a single Mellanox ConnectX-2 QDR HCA. The MDS

has 48GB of memory, a single 6Gbps SAS controller and a single Mellanox ConnectX-2 QDR HCA.

The Infiniband fabric was configured as a 50% blocking fabric. It was created by populating Mezzanine

slot B on all blades, using a single Infiniband switch in I/O module slot B1 in the chassis and using one

external 36 port switch. 8 external ports on the internal switch were connected to the external 36 port

switch. As the total throughput to and from any chassis exceeds the throughput to the OSS servers 4:1,

this fabric configuration exceeded the bandwidth requirements necessary to fully utilize the OSS pair

and MDS pair for all testing.


Page 14

Table 2: DT-HSS3 Configuration

DT-HSS3 Configuration

Configuration Size 192TB Medium

Lustre Version 1.8.4

# of Drives 96 x 2TB NL SAS drives.

OSS Nodes 2 x PowerEdge R710 Servers with 24GB Memory

OSS Storage Array 2 x MD3200 /3 x MD1200

Drives in OSS Storage Arrays 96 3.5" 2TB 7200RPM Nearline SAS

MDS Nodes 2 x PowerEdge R710 Servers with 48GB Memory

MDS Storage Array 1 x MD3220

Drives in MDS Storage Array 24 2.5" 500GB 7200RPM Nearline SAS

Infiniband

PowerEdge R710 Servers Mellanox ConnectX-2 QDR HCA

Compute Nodes Mellanox ConnectX-2 QDR HCA - Mezzanine Card

External QDR IB Switch Mellanox 36 Port is5030

IB Switch Connectivity QDR Cables

All tests are performed with a cold cache established with the following technique. After each test, the

filesystem is unmounted from all clients, then unmounted from the metadata server. Once the

metadata server is unmounted, a sync is performed and the kernel is instructed to drop caches with the

following command:

sync

echo 3 > /proc/sys/vm/drop_caches

In order to further thwart any client cache errors, the compute node responsible for writing the

sequential file was not the same one used for reading the files. A file size of 64GB was determined for

sequential writes. This is 25% greater than the total memory combined for each PowerEdge R710 , so

all sequential reads and writes have an aggregate sample size of 64GB multiplied by the number of

threads, up to a total 64GB * 255 threads or greater than 15TB. The block size for IOzone was set to

1MB to match the 1MB request size.

In measuring the performance of the DT-HSS3, all tests were performed within similar environments.

The filesystem was configured to be fully functional and the targets tested were emptied of files and

directories prior to each test.

N-to-N Sequential Reads / Writes

The sequential testing was done with the IOzone testing tools version 3.347 and throughput results are

reported in units of KB/sec, but are converted to MiB for consistency. N-to-N testing describes the

technique of generating a single file for each process thread spawned by IOzone. The file size

determined to be appropriate for testing is a 64GB file. Each individual file written was large enough to

saturate the total available cache on each OSS/MD3200 pair. Files written were distributed across the


Page 15

client LUNs evenly. This was to prevent uneven I/O load on any single SAS connection in the same way

that a user would expect to balance a workload.

Figure 11: Sequential Reads / Writes DT-HSS3

Figure 11 demonstrates the performance of the HSS-3 192TB test configuration. Write performance

peaks at 4.1GB/sec at sixteen concurrent threads and continues to exceed 3.0GB/sec while each of the

64 clients writes up to four files at a time. With four files per client, sixteen files are being written

simultaneously on each OST. Read performance exceeds 6.2GB/sec for sixteen simultaneous processes

and continues to exceed 3.5GB/sec while four threads per compute-node access individual OST's. The

write and read performance rises steadily as we increase the number of process threads up to sixteen.

This is the result of increasing the number of OST's utilized directly with the number of processes and

compute nodes. When the client to OST ratio (CO) exceeds one, the write performance is evenly

distributed between the multiple files on each OST, consequently increasing seek and Lustre callback

time. By careful crafting of the IOzone hosts file, each thread added balances between blade chassis,

compute nodes, and SAS controllers, as well as adding a single file to each of the OST's allowing a

consistent increase in the number of files written while creating minimal locking contention between

the files. As the number of files written increases beyond one per OST, throughput declines. This is

most likely the result of issues related to the larger number of requests per OST. Throughput drops as

the number of clients making requests to each OST increase. Positional delays and other overhead are

expected as a result of the increased randomization of client requests. In order to maintain the higher

throughput for a greater number of files, increasing the number of OST’s should help. A review of the

storage array performance from the Dell PowerVault Modular Disk Storage Manager or using SMCli

performance monitor was done to independently confirm the throughput values produced by the

benchmarking tools.

Random Reads and Writes

The same tool, IOzone, was used to gather random reads and writes metrics. In this case, we have

chosen to write a total of 64GB during any one execution. This total 64GB size is divided amongst the

0

1000

2000

3000

4000

5000

6000

7000

1 2 4 8 16 32 64 128 255

MiB

/se

c

Threads

DT-HSS3 Sequential Reads / Writes

readers writers


Page 16

threads evenly. The IOzone hosts are arranged to distribute the work load evenly across the compute

nodes. The storage is addressed as a single volume with an OST count of 16 and stripe size of 4MB. As

we are measuring our performance in Operations per second (OPS), a 4KB request size is used because

it aligns with this Lustre version’s 4KB filesystem block size.

Figure 12 shows that the random writes are saturated at sixteen threads and around 5,200 IOPS. At

sixteen threads, the client to OST ratio (CO) is one. As we increase the number of threads in IOzone,

we increase the CO to a maximum of 4:1 at 64 threads. The reads increase rapidly from sixteen to 32

and then continuously increase at a relatively steady rate. As the writes require a file lock per OST

accessed, saturation is not unexpected. Reads take advantage of Lustre’s ability to grant overlapping

read extent locks for part or all of a file. Increasing the number of disks in the single OSS pair or

additional OSS pairs can increase the throughput.

Figure 12: N-to-N Random reads and writes.

0

5000

10000

15000

20000

25000

8 16 32 64 128 192 255

IOP

S

Threads

DT-HSS3 N-to-N Random

Random reads Random writes

IOR N-to-1 Reads and Writes

Performance review of the DT-HSS3 with reads and writes to a single file was done with the IOR

benchmarking tool. IOR is written to accommodate MPI communications for parallel operations and has

support for manipulating Lustre striping. IOR allows several different IO interfaces for working with the

files, but this testing used the POSIX interface to exclude more advanced features and associated

overhead. This gives us an opportunity to review the filesystem and hardware performance

independent of those additional enhancements and most applications have adopted a “one-file-per-

processor approach” to disk-IO rather than using parallel IO API’s.

IOR benchmark version 2.10.3 was used in this study, and is available at

http://sourceforge.net/projects/ior-sio/. The MPI stack used for this study was Open MPI version

1.4.1, and was provided as part of the Mellanox OFED kit.


Page 17

The configuration for writing included a directory set with striping characteristics designed to stripe

across all sixteen OST’s with a stripe chunk size of 4MB. The file to which all clients write is evenly

striped across all OST’s. The request size for Lustre is 1MB, but in this test, a transfer size of 4MB was

used to match the stripe size.

Figure 13: N-to-1 Read / Write

After setting the 4MB transfer size for IOR, reads have the advantage and peak at greater than 5.5 GB

per second, a value significantly near the sequential N-to-N performance. Write performance peaks

within 10% of the sequential writes at 3.9GB, but as the client to OST ratio (CO) increases, the write

performance settles around 2.6GB per second, sustained. Reads are less affected since the read locks

for clients are allowed to overlap some or all of a given file.

Metadata Testing

Metadata testing measures the time to complete certain file and directory operations that return

attributes. Mdtest is an MPI-coordinated benchmark that performs create, stat, and delete operations

on files and directories. The output of mdtest is an evaluation of the timing for completion of the

operations per second. The mdtest benchmark is MPI enabled. Using mpirun, tests create threads from

each included compute node. Mdtest can be configured to compare metadata performance for

directories and files.

This study used mdtest version 1.8.3, which is available at http://sourceforge.net/projects/mdtest/.

The MPI stack used for this study was Open MPI version 1.4.1, and was provided as part of the Mellanox

OFED kit. The Metadata Server internal details, including processor and memory configuration, are

outlined in Table 3 for comparison.

Table 3: Metadata Server Configuration

Processor 2 x Intel Xeon E5620 2.4Ghz, 12M Cache, Turbo

Memory 48GB Memory (12x4GB), 1333MHz Dual Ranked RDIMMs

0

1000

2000

3000

4000

5000

6000

2 4 8 16 32 64

MiB

/Se

c

Threads

DT-HSS3 N-to-1 Read / Write

read write


Page 18

Controller 2 x SAS 6Gbps HBA

PCI Riser Riser with 2 PCIe x8 + 2 PCIe x4 Slot

On a Lustre file system, the directory metadata operations are all performed on the MDT. The OST’s

are queried for object identifiers in order to allocate or locate extents associated with the metadata

operations. This interaction requires the indirect involvement of OSTs in most metadata operations. As

a result of this dependency, Lustre favors metadata operations on files with lower stripe counts. The

greater the stripe count, the more operations are required to complete the OST object allocation or

calculations.

The first test stripes files in 1MB stripes over all available OSTs. The next test repeats the previous

operations on a single OST for comparison. Metadata operations across a single OST vs. the same

operations spreading single stripe files evenly over sixteen OST stripes are very similar.

Figure 14: Metadata create comparison

Preallocating from sixteen OSTs or a single OST does not seem to change the rate at which files are

created. The performance of the MDT and MDS are more important to the create operation.

0

1000

2000

3000

4000

5000

6000

7000

8000

1 2 4 8 16 32 64

Op

era

tio

ns

/ se

c

Tasks (Nodes)

DT-HSS3 Metadata Create

16 OST File Create 1 OST File Create DT-HSS3 16 OST Directory Create


Page 19

Figure 15: Metadata File / Directory Stat

The mdtest benchmark for stat uses the native call as opposed to the bash shell call to retrieve file and

directory metadata. The directory stats seem to peak around 26,000 per second, but the increased

number of OSTs makes a significant difference in the number of calls handled by 1 task or greater. If

the files were striped across a number of OSTs greater than 1, all OSTs would be contacted in order to

accurately calculate the file size. The file stats indicate that beyond 8 tasks, the advantage lies in

distributing the requests across more than a single OST, but the increase is not significant as the

requests to any one OST do not represent more than a fraction of the data requested by the test.

Directory creates do not require the same draw from the prefetch pool of OST object IDs as the creates

do for files. The result is fewer operations to completion and therefore faster time to completion.

Removal of the files requires less communication between the compute nodes (Lustre clients) and the

MDS than removing directories. Directory removal requires verification from the MDS that the directory

is void of content prior to execution. With 32 simultaneous stripes, a break occurs in the file removal

between the single OST result and that of the sixteen OSTs. The metadata operations for directories

across multiple OSTs appear to gain a slight advantage after the number of tasks exceeds 32. In the

case of the removal of the files, we see a definite increase in performance when commands are

executed across multiple OSTs.

0

5000

10000

15000

20000

25000

30000

1 2 4 8 16 32 64

Op

era

tio

ns

/ se

c

Tasks (Nodes)

DT-HSS3 Metadata stat

16-OST File Stat 1-OST File Stat 16 OST Directory Stat 1 OST Directory Stat


Page 20

Figure 16: Metadata File / Directory Remove

0

5000

10000

15000

20000

25000

30000

1 2 4 8 16 32 64

Op

era

tio

ns

/ se

c

Tasks (Nodes)

DT-HSS3 Metadata Remove

16 OST File Remove 1 OST File Remove

16 OST Directory Remove 1 OST Directory Remove


Page 21

Conclusions There is a well-known requirement for scalable, high-performance clustered filesystem solutions. There

are many solutions for customized storage. The third generation of the Dell | Terascala HSS, the DT-

HSS3, addresses this need with the added benefit of standards based and proven Dell PowerEdge and

PowerVault products and Lustre, the leading open source solution for clustered filesystem. The

Terascala Management Console (TMC) unifies the enhancements and optimizations included in the DT-

HSS3 into a single control and monitoring panel for ease of use.

The latest generation Dell | Terascala HPC Storage Solution continues to offer the same advantages as

the previous generation solutions, but with greater processing power and memory and greater IO

bandwidth. The scale of the raw storage from 48TB up to a total of 384TB per Object Storage Server

Pair and up to 6.2GB/s of throughput in a packaged component is consistent with the needs of the high

performance computing environments. The DT-HSS3 is also capable of scaling horizontally as easily as it

scales vertically.

The performance studies demonstrate an increase in the throughput for both reads, and writes over

previous generations for n-to-n and n-to-1 file type access. Results from mdtest show a generous

increase in the metadata operations speed. With the PCI-e 2.0 interface, controllers and HCAs will

excel in both high IOP and high bandwidth applications.

The continued use of generally available, industry-standard tools like IOzone, IOR and mdtest provide

an easy way to match current and expected growth with the performance outlined for the DT-HSS3.

The profiles reported from each of these tools provide sufficient information to align the configuration

of the DT-HSS3 with the requirements of many applications or group of applications.

In short, the Dell | Terascala HPC Storage Solution delivers all the benefits of a scale-out parallel file

system-based storage for your high performance computing needs.


Page 22

Appendix A: Benchmark Command Reference This section describes the commands used to benchmark the Dell | Terascala HPC Storage Solution 3. IOzone IOzone Sequential Writes – /root/iozone/iozone -i 0 -c -e -w -r 1024k -s 64g -t $THREAD -+n -+m ./hosts

IOzone Sequential Reads - /root/iozone/iozone -i 1 -c -e -w -r 1024k -s 64g -t $THREAD -+n -+m ./hosts

IOzone IOPS Random Reads / Writes – /root/iozone/iozone -i 2 -w -O -r 4k -s $SIZE -t $THREAD -I -+n -+m ./hosts

The O_Direct command line parameter (“-I”) allows us to bypass the cache on the compute nodes

where the IOzone threads are running.

IOR IOR Writes – mpirun -np $i --hostfile hosts $IOR -a POSIX -i 2 -d 32 -e -k -w -o $IORFILE

-s 1 -b $SIZE -t 4m

IOR Reads – mpirun -np $i --hostfile hosts $IOR -a POSIX -i 2 -d 32 -e -k -r -o $IORFILE

-s 1 -b $SIZE -t 4m

mdtest - Metadata Create Files - mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b

$DIRS -z 1 -L -I $FTP -y -u -t -F -C

Stat files - mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b

$DIRS -z 1 -L -I $FTP -y -u -t -F -R -T

Remove files - mpirun -np $THREAD --nolocal --hostfile $HOSTFILE $MDTEST -d $FILEDIR -i 6 -b

$DIRS -z 1 -L -I $FTP -y -u -t -F -r


Page 23

Appendix B: Terascala Lustre Kit PCM software uses the following terminology to describe software provisioning concepts: 1. Installer Node – Runs services such as DNS, DHCP, HTTP, TFTP, etc. 2. Components – RPMS 3. Kits – Collection of components 4. Repositories – Collection of kits 5. Node Groups – Allow association of software to a particular set of compute nodes The following is an example of how to integrate a Dell | Terascala HPC Storage Solution into an existing

PCM cluster:

1. Access a list of existing repositories on the head node:

The key here is the Repo name: line.

2. Add the Terascala kit to the cluster:

3. Confirm the kit has been added:

4. Associate the kit to the compute node group on which the Lustre client should be installed:

a. Launch ngedit at the console: # ngedit b. On the Node Group Editor screen, select the compute node group to add the Lustre client. c. Select the Edit button at the bottom of the screen. d. Accept all default settings until you reach the Components screen. e. Using the down arrow, select the Terascala Lustre kit. f. Expand and select the Terascala Lustre kit component. g. Accept the default settings, and on the Summary of Changes screen, accept the changes and

push the packages out to the compute nodes. On the front-end node, there is now a /root/terascala directory that contains sample directory setup scripts. There is also a /home/apps-lustre directory that contains Lustre client configuration parameters. This directory contains a file that the Lustre file system startup script uses to optimize clients for Lustre operations.


Page 24

References Dell | Terascala HPC Storage Solution Brief

http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-hpcstorage-

solution-brief.pdf

Dell | Tersascala HPC Storage Solution Whitepaper

http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-dt-hss2.pdf

Dell PowerVault MD3200 / MD3220

http://www.dell.com/us/en/enterprise/storage/powervault-md3200/pd.aspx?refid=powervault-

md3200&s=biz&cs=555

Dell PowerVault MD1200

http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=555&l=en&oc=MLB1218&s=biz

Lustre Home Page

http://wiki.lustre.org/index.php/Main_Page

Transitioning to 6Gb/s SAS

http://www.dell.com/downloads/global/products/pvaul/en/6gb-sas-transition.pdf

Dell HPC Solutions Home Page

http://www.dell.com/hpc

Dell HPC Wiki

http://www.HPCatDell.com

Terascala Home Page

http://www.terascala.com

Platform Computing

http://www.platform.com

Mellanox Technologies

http://www.mellanox.com
http://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-hpcstorage-solution-brief.pdfhttp://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-hpcstorage-solution-brief.pdfhttp://i.dell.com/sites/content/business/solutions/hpcc/en/Documents/Dell-terascala-dt-hss2.pdfhttp://www.dell.com/us/en/enterprise/storage/powervault-md3200/pd.aspx?refid=powervault-md3200&s=biz&cs=555%20http://www.dell.com/us/en/enterprise/storage/powervault-md3200/pd.aspx?refid=powervault-md3200&s=biz&cs=555%20http://configure.us.dell.com/dellstore/config.aspx?c=us&cs=555&l=en&oc=MLB1218&s=bizhttp://wiki.lustre.org/index.php/Main_Pagehttp://www.dell.com/downloads/global/products/pvaul/en/6gb-sas-transition.pdfhttp://www.dell.com/hpchttp://www.hpcatdell.com/http://www.terascala.com/http://www.platform.com/http://www.mellanox.com/