6
High Performance Computing: Talon 2.0 Dr. Scott Yockel, Manager of HPC Services Academic and User Services

Univ. of North Texas: Solving the HPC Data Deluge

Embed Size (px)

Citation preview

Page 1: Univ. of North Texas: Solving the HPC Data Deluge

High Performance Computing: Talon 2.0

Dr. Scott Yockel, Manager of HPC Services Academic and User Services

Page 2: Univ. of North Texas: Solving the HPC Data Deluge

u  Located in Denton, TX at the northern edge of the Dallas/Ft.Worth Metroplex

u  The nation's 26th largest public university with over 36,000.

u  97 bachelor's, 81 master's and 35 doctoral degree programs, many nationally recognized.

u  One of the top programs in Computational Chemistry – Electronic Structure

u  The university has been named one of America's 100 Best College Buys® for 17 consecutive years and is listed as a “Best in the West” college by The Princeton Review.

Page 3: Univ. of North Texas: Solving the HPC Data Deluge

Growth of HPC Research since FY’10

u  PI’s utilizing HPC Services has grown five-fold to 55, with over 350 users.

u  The number of departments has expanded from 4 to 14, across 5 colleges.

u  11 in Physics, 8 in Chemistry, 8 in Mathematics, 6 in Materials Science, 5 in Computer Science, 4 in Biology, 3 in Electrical Engineering, 3 at the Health Science Center, 2 in Mechanical & Energy Engineering, 1 in Engineering Technology, 1 in Sociology, 1 in Educational Psychology, 1 in Kinesiology, Health Promotion, and Recreation and 1 at UNT-Dallas (Chemistry).

u  5% of all faculty utilize HPC Services and bring in 10% of all externally-funded grants at UNT.

u  Projected Growth for FY’13-FY’16

u  ~20 new faculty hires are planned in S.T.E.M. areas of research

u  Total number of researchers could easily grow to 500 users by FY’16

u  It is anticipated that 10-15 new faculty with data intensive research will be hired by FY’16, with as many as 40 existing faculty across the entire academic spectrum possibly joining HPC computing

Page 4: Univ. of North Texas: Solving the HPC Data Deluge

Talon 2.0 Specs - Overview

u  Talon 2.0 is only a 42% increase in equipment cost, but provides:

u  5 times the processing power with 4096 computing cores in 250 Dell R420 series of servers.

u  10 times the high-performance storage at 1.5PB with Dell-Terascala’s HSS4.5 Storage Appliance

u  3 times the interconnect speed at 56Gb/s with Mellanox FDR InfiniBand

u  Twice the available memory, with up to 512GB in a single server.

u  This new system aims at addressing sophisticated problems requiring massive parallelization, large memory arrays, and/or a large volume of high-performance storage that are crucial to the research goals of 55 research groups at UNT.

Page 5: Univ. of North Texas: Solving the HPC Data Deluge

Talon 2.0 Topology

vis-login (3 x X11)

talon2 (/home)

hpc-02 (/admin, /share)

c32 (160 x 16 cores)

c64 (64 x 16 cores)

c512 (8 x 32 core)

g64

(16 x 16core,

1024 GPUs)

MDS-01/02

OSS-03/04/ 05/06

Terascala (1.4PB) (scratch)

Mellanox FDR InfiniBand

hpc-01 (remote, research)

Research SAN (200TB)

Login Nodes (cluster controller)

Infrastructure layer

compute nodes

Page 6: Univ. of North Texas: Solving the HPC Data Deluge

Talon 2.0 Specs - Storage Dell | Terascala High-Performance Storage Solution DT-HSS4.5

Page 4

Figure 2 HSS4.5 Components Overview

The appliance software images have been modified to support the Dell PowerEdge R620 as the Object Storage and Metadata Servers in the configuration. This PowerEdge R620, shown in Figure 3, allows for a significant improvement in the server density, performance and serviceability of these solution components with a decrease in the overall complexity of the solution itself.

Figure 3: Dell PowerEdge R620

3.1 Management Module The Management Module is a single server connected to the rest of the HSS servers via an internal 1 GbE network. The server can be either a PowerEdge R210-II for small clusters or a PowerEdge R720-XD for large clusters, since client health information can also be monitored (via optional software) and historical data grows proportional to the client cluster size.

The management server is responsible for user interaction as well as systems health management and monitoring. All user-level access to the HSS4.5 appliance is via this device. While the management server is responsible for collecting data and management, it does not play an active role in the Lustre file system itself and can be serviced without requiring downtime on the file system. The management server presents the data collected and provides management via an interactive Java GUI called Terascala Management Console. Alternatively, a new and more powerful web GUI named TeraView

IB SAS SAS

Front View

Rear View

u  Eight 60 x 4TB drives for 1.92 PB raw

u  RAID/6 8+2 – 48 LUNs

u  Lustre 2.1.5 parallel filesystem

u  1.4 PB usable storage

u  Initial testing r/w

u  13 GB/s read

u  8 GB/s write