Corralling Big Data at TACC

Preview:

DESCRIPTION

In this presentation from the DDN User Meeting at SC13, Tommy Minyard from the Texas Advanced Computing Center describes TACC's new Corral data storage system. Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/

Citation preview

Corralling Big Data at TACC

Tommy Minyard

Texas Advanced Computing Center

DDN User Group Meeting

November 18, 2013

TACC Mission & StrategyThe mission of the Texas Advanced Computing Center is to enable scientific discovery and enhance society through the application of advanced computing technologies.

To accomplish this mission, TACC:

– Evaluates, acquires & operatesadvanced computing systems

– Provides training, consulting, anddocumentation to users

– Collaborates with researchers toapply advanced computing techniques

– Conducts research & development toproduce new computational technologies

Resources &

Services

Research &

Development

TACC Storage Needs

• Cluster specific storage– High performance (tens to hundreds GB/s bandwidth)

– Large-capacity (~2TBs per Teraflop), purged frequently

– Very scalable to thousands of clients

• Center-wide persistent storage– Global filesystem available on all systems

– Very large capacity, quota enabled

– Moderate performance, very reliable, high availability

• Permanent archival storage– Maximum capacity, tens of PBs of capacity

– Slow performance, tape-based offline storage with spinning storage cache

History of DDN at TACC

• 2006 – Lonestar 3 with DDN S2A9500

controllers and 120TB of disk

• 2008 – Corral with DDN S2A9900 controller

and 1.2PB of disk

• 2010 – Lonestar 4 with DDN SFA10000

controllers with 1.8PB of disk

• 2011 – Corral upgrade with DDN SFA10000

controllers and 5PB of disk

Global Filesystem Requirements

• User requests for persistent storage available

on all production systems

– Corral limited to UT System users only

• RFP issued for storage system capable of:

– At least 20PB of usable storage

– At least 100GB/s aggregate bandwidth

– High availability and reliability

• DDN solution selected for project

Stockyard: Design and Setup

Stockyard: Design and Setup

• A Lustre 2.4.1 based global files system, with

scalability for future upgrades

• Scalable Unit (SU): 16 OSS nodes providing

access to 168 OST’s of RAID6 arrays from

two SFA12k couplets, corresponding to 5PB

capacity and 25+ GB/s throughput per SU

• Four SU’s provide 20PB with 100GB/s now

• 16 initial LNET router set for external mounts

SU (One server rack with Two DDN

SFA12k couplet racks)

SU Hardware Details

• SFA12k Rack: 50U rack with 8x L6-30p

• SFA12k couplet with 16 IB FDR ports (direct

attachment to the 16 OSS servers)

• 84 slot SS8460 drive enclosures (10 per rack,

20 enclosures per SU)

• 4TB 7200RPM NL-SAS drives

Stockyard Logical Layout

Stockyard: Capabilities and Features

• 20PB usable capacity with 100+ GB/s

aggregate bandwidth

• Client systems can bring its own LNET router

set to connect to the Stockyard core IB

switches or connect to the built-in LNET

routers using either IB or TCP. (FDR14 or

10GigE)

• HSM potential to Ranch tape archival system

Capabilities and Features (cont’d)

• Meta-data performance enhancement

possible with DNE (phase1)

• NRS (Network Request Scheduler)

evaluation: characteristics of different policies

on ost_io.nrs_policies, particularly with

crrn(client round-robin over nids) under

contention dominated by a few jobs

Stockyard: Numbers So Far

• 16 LET-routers configured as direct client

(within the Stockyard fabric) can push 25GB/s

on the unit

• With two SU’s the same set of clients can

achieve 50GB/s, and 75GB/s with three SU.

• With four SU we hit the 16 client limit. No

improvement beyond 75GB/s (corresponding

to ~4.7GB/s from each client)

Numbers So Far (Single Client)

• Single thread write performance with Lustre

2.4.1 is ~770MB/s

– big improvement over 2.1.X at about 500MB/s

• Multi-thread from a single client saturates

around 4.7GB/s (with credits=256 on both

servers and clients)

Numbers So Far (Aggregate)

• Performance numbers with 16 lnet-routers :

75GB/s from 16 direct clients

• Numbers from Stampede compute clients:

65GB/s with 256 clients (IOR, posix, fpp, with

8 tasks per node)

• Saturation point for Stampede clients: 65GB/s

• N.B. credits=64 on client nodes of Stampede

– Quick test on interactive 2.1.x node with higher

credit number gives expected boost.

Numbers So Far (Failover Tests)

• OSS failover test setup and results

• Procedure: – Identify the OST’s for the test pair

– Initiate the dd processes targeted to the particular OST’s each of

about 67GB in size so that it does not finish before the failover

– Interrupt one of the OSS server with shutdown using ipmitool

– Record the individual dd process outputs as well as server and

client side Lustre messages

– Compare and confirm the recovery and operation of the failover

pair with 21 OST’s

• All I/O completes within 2 minutes of failover

Failover Testing (cont’d)

• Similarly for MDS pair: same sequence of interrupted

I/O and collection of Lustre messages on both servers and

clients, client side log shows the recovery.– Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:

13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed

out for sent delay: [sent 1381348698/real 0] req@ffff88180cfcd000

x1448277242593528/t0(0) o250-

>MGC192.168.200.10@o2ib100@192.168.200.10@o2ib100:26/25 lens 400/544 e 0 to

1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1

– Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:

13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar

message

– Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at

MGC192.168.200.10@o2ib100_1) after server handle changed from

0xb9929a99b6d258cd to 0x6282da9e97a66646

– Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100:

Connection restored to MGS (at 192.168.200.11@o2ib100)

Automated Failover

• The tests were on an artificial setup to

simplify the tracking of the completion of the

I/O on clients and shutdown and failover

mounts were done manually.

• Corosync and pacemaker are being set up to

automate the process.

Routed Clients

• We monitor the routerstat output on the

attached routers and differences between two

timestamps, focusing on the even distribution

of request streams

• Contrary to the expectation that “autodown”

may suffice, Lustre clients need to have

“check_routers_before_use=1” to have

automatic updates of router status

Routed Clients (cont’d)

• Even with automatic router checks, clients

cannot detect the non-functional routers: a

router which was active only on the client side

will be assumed to be active by clients

• Clients encounter timeouts due to the non-

functional routers

• Resolution: separate router checks on router

nodes are added.

Stockyard: Looking Ahead

• Deploy as a global $WORK space for TACC

resources, will push the number of clients to

all TACC resources

• Evaluation of Lustre 2.5.0 before full

production for HSM functionality and

compatibility with SAMFS on Ranch

• Quota management (different on 2.4+)

• Integrated monitoring setup

• Security evaluation

Summary

• Storage capacity and performance needs

growing at exponential rate

• High-performance and reliable filesystems

critical for HPC productivity

• Benefits of large parallel filesystems outweigh

the system administration overhead

• Current best solution for cost, performance

and scalability is Lustre-based filesystem