Simplifying data management through storage system design · 2019-06-05 · II. NERSC wants to...

Preview:

Citation preview

Glenn K. Lockwood, Ph.D.Advanced Technologies Group

Simplifying data management through storage system design

- 1 -

May 1, 2019

,, I

Ne~RSC .,."

u.s . DEPARTMENT o, Office of

~ ENERGY Science •

Outline / ChargeWhat are our issues and challenges around data storage?(deletion/retention policy and practice, cost, duplication,

hot/cold storage, types of storage)

I. Storage architecture has gotten very complicatedII. NERSC wants to deploy systems that accelerate scientific

discovery by simplifying data management through innovations in system hardware and software

III. We need to manage complexity to simplify user interfaces

- 2 -u.s. DEPARTMENT o, Office of

~ ENERGY Science •

NERSC: the mission HPC facilityfor the U.S. Department of Energy Office of Science

- 3 -

7,000 users800 projects700 applications2,000 publications per year Photo credit: CAMERA

u.s. DEPARTMENT o, Office of

~ ENERGY Science •

Advancing Science• Provide a productive computing

environment for science applications– software/programming environment– advanced app/system performance expertise

• Tightly couple with workflows of DOE’s experimental and observational facilities

Advancing HPC• Collaborate with HPC vendors years

before system delivery to deploy advanced systems/new capabilities at scale

4

NERSC's dual missionAdvance science and advance the state-of-the-art in supercomputing

u.s. DEPARTMENT o, Office of

~ ENERGY Science •

Storage architecture is getting complicated

- 5 -

Performance

Capacity

Burst Buffer

Scratch

Project

Archive

More info: G. K. Lockwood et al., “Storage 2020: A Vision for the Future of HPC Storage,” Berkeley, CA, 2017.

u.s . DEPARTMENT o, Office of

~ ENERGY Science •

Addressing complexity of data management

I. Change user applications

II. Better middleware and tools

III. Smarter system softwareIV. Simpler storage system

hardware architecture

- 6 -u.s. DEPARTMENT o, Office of

ENERGY Science •

Simplifying storage system architecture in 2020

- 7 -

Performance

Capacity Scratch

Community

Archive

More info: G. K. Lockwood et al., “Storage 2020: A Vision for the Future of HPC Storage,” Berkeley, CA, 2017.

u.s . DEPARTMENT o, Office of

~ ENERGY Science •

2025 requires smarter storage system software

- 8 -

Scratch

Community

Archive

2020 Hierarchy 2025 Hierarchy

Hot Tier

Cold Tier

Software abstra

ction

Nonvolatile RAM

NAND (Flash)

Hard disk drives

TapesSoftware abstra

ction

More info: G. K. Lockwood et al., “Storage 2020: A Vision for the Future of HPC Storage,” Berkeley, CA, 2017.

- 9 -

Towards intelligent data management systemsthrough quantitative analysis

Burst Buffer

Scratch

Project

Archive

Elastic

G. K. Lockwood, N. J. Wright, S. Snyder, P. Carns, G. Brown, and K. Harms, “TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis,” in Proceedings of the 2018 Cray User Group, 2018.

- 10 -

State of the art in understanding I/O today

Burst Buffer

Scratch

Project

Archive

ENOSPC

Elastic

SeaStream

Perl thingy

LDMS

Darshan

Python thingymmperfmon

LMT Python thingy

Python thingy

LDMS

collectd

collectd

G. K. Lockwood, N. J. Wright, S. Snyder, P. Carns, G. Brown, and K. Harms, “TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis,” in Proceedings of the 2018 Cray User Group, 2018.

- 11 -

State of the art in understanding I/O today

Burst Buffer

Scratch

Project

Archive

ENOSPC

Elastic

SeaStream

Perl thingy

LDMS

Darshan

Python thingymmperfmon

LMT Python thingy

Python thingy

LDMS

collectd

collectdExpert knowledgerequired to

turn data into answers

G. K. Lockwood, N. J. Wright, S. Snyder, P. Carns, G. Brown, and K. Harms, “TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis,” in Proceedings of the 2018 Cray User Group, 2018.

- 12 -

TOKIO to give insight into data movement

Burst Buffer

Scratch

Project

Archive

ENOSPC

Elastic

SeaStream

Perl thingy

LDMS

Darshan

Python thingymmperfmon

LMT Python thingy

Python thingy

LDMS

collectd

collectdTotal Knowledge of I/O

(TOKIO) Framework

G. K. Lockwood, N. J. Wright, S. Snyder, P. Carns, G. Brown, and K. Harms, “TOKIO on ClusterStor: Connecting Standard Tools to Enable Holistic I/O Performance Analysis,” in Proceedings of the 2018 Cray User Group, 2018.

What TOKIO has told us so far:

Why is I/O slow?

- 13 -

ApplicationI/O Perf (GB/s)

BandwidthContention

Data serverload (cpu%)

Metadata serverload (cpu%)

MetadataOps (MOps)

File SystemFullness (%) Top: G. K. Lockwood, S. Snyder, T. Wang, S. Byna, P. Carns, and N. J. Wright, “A Year in the

Life of a Parallel File System,” in Proceedings of SC'18, 2018, p. 74:1--74:13.Left: G. K. Lockwood et al., “UMAMI: A Recipe for Generating Meaningful Metrics through

Holistic I/O Performance Analysis,” in Proceedings of PDSW-DISCS'17, 2017, pp. 55–60.

ESnet

NERSC ALCF

- 14 -

What TOKIO can tell us going forward:

How does data move during workflows?

Data movement on January 28, 2019

≤ 10 TiB≤ 50 TiB≤ 100 TiB≤ 500 TiB

≤ 1 TiB

OutlookManaging complexity in data management requires increasingly intelligent storage systems to offset increasing architectural complexity• Simpler, all-flash storage hierarchy is a viable short-term solution• Holistic understanding of the entire storage system is becoming essential• TOKIO is a software foundation for holistic understanding and working towards

adaptive storage

The Total Knowledge of I/O (TOKIO) Project, a collaboration between Lawrence Berkeley National Laboratory and Argonne National Laboratory, is

supported by the U.S. Department of Energy, Office of Science, under contracts DE-AC02-05CH11231 and DE-AC02-06CH11357,

https://www.nersc.gov/research-and-development/tokio/

"Any opinions, findings, conclusions or recommendations

expressed in this material are those of the author(s) and do not

necessarily reflect the views of the Networking and Information

Technology Research and Development Program."

The Networking and Information Technology Research and Development

(NITRD) Program

Mailing Address: NCO/NITRD, 2415 Eisenhower Avenue, Alexandria, VA 22314

Physical Address: 490 L'Enfant Plaza SW, Suite 8001, Washington, DC 20024, USA Tel: 202-459-9674,

Fax: 202-459-9673, Email: nco@nitrd.gov, Website: https://www.nitrd.gov

Recommended