20
Yin Lu, Yong Chen , Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University Memory-Conscious Collective I/O for Extreme Scale HPC Systems ROSS 2013, Eugene, Oregon 6/10/13 1

Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

Memory-Conscious Collective I/O for Extreme Scale HPC Systems

ROSS 2013, Eugene, Oregon 6/10/13 1

Page 2: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  A wide range of HPC applications, simulations, and visualizations[1]

  Many applications are increasingly data intensive[2]

ROSS 2013, Eugene, Oregon

Introduction – Data Intensive HPC Simulations/Applications

6/10/13

Molecular  Science   Weather  &  Climate  Forecas5ng  

Earth  Science  Astronomy   Health  

Astrophysics  

Life  Science   Materials  

2

1.  Simulation at Extreme Scale, William D. Gropp, Invited presentation at Big Data Science: A Symposium in Honor of Martin Schultz, October 26, 2012, New Haven, Connecticut

2.  J.Dongarra, P. H. Beckman, et. al. The International Exascale Software Project roadmap. IJHPCA 25(1): 3-60 (2011)

Page 3: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

Introduction – Data Intensive HPC Simulations/Applications

PI Project On-Line Data Off-Line Data Lamb, Don FLASH: Buoyancy-Driven Turbulent Nuclear Burning 75TB 300TB Fischer, Paul Reactor Core Hydrodynamics 2TB 5TB Dean, David Computational Nuclear Structure 4TB 40TB Baker, David Computational Protein Structure 1TB 2TB Worley, Patrick H. Performance Evaluation and Analysis 1TB 1TB Wolverton, Christopher Kinetics and Thermodynamics of Metal and 5TB 100TB Washington, Warren Climate Science 10TB 345TB Tsigelny, Igor Parkinson's Disease 2.5TB 50TB Tang, William Plasma Microturbulence 2TB 10TB Sugar, Robert Lattice QCD 1TB 44TB Siegel, Andrew Thermal Striping in Sodium Cooled Reactors 4TB 8TB Roux, Benoit Gating Mechanisms of Membrane Proteins 10TB 10TB

Data requirements for representative INCITE applications at ALCF

Source: R. Ross et. al., Argonne National Laboratory

  Many simulations/applications process O(1TB-100TB) in a single run   Application teams are projected to manipulate O(1PB-10PB) on exascale systems

ROSS 2013, Eugene, Oregon 6/10/13 3

Page 4: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Neither available memory capacity nor memory bandwidth scales by the same factor as the total concurrency

  The disparity of growth between memory and concurrency indicates the average memory and bandwidth per core even drop in exascale system

Expected Exascale Architecture Parameters and Comparison with Current Hardware[3]

ROSS 2013, Eugene, Oregon

Motivation – Decreased Memory & BW per core at Exascale

6/10/13 4

3.  S. Ahern, A. Shoshani, K.-L. Ma, et al. Scientic discovery at the exascale. Report from the DOE ASCR 2011 Workshop on Exascale Data Management, Analysis, and Visualization, February 2011

Page 5: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Available memory exhibits imbalance among compute nodes   Available memory per node can vary significantly at an extreme scale   These projected constraints present challenges for I/O solution at exascale

including collective I/O

Memory Usage of 815 compute nodes at one time

ROSS 2013, Eugene, Oregon

Motivation – Increased Available Memory Variance

6/10/13 Memory Usage of a single compute node in one month

Memory Usage of four compute nodes in one hour

5

Page 6: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Collective I/O optimizes I/O accesses by merging small & noncontiguous I/O requests into large & contiguous ones, removing overlaps, reducing calls

  Remains critical for extreme scale HPC systems   Performance can be significantly affected under memory pressure

ROSS 2013, Eugene, Oregon

Motivation – Collective I/O Performs with Limited Memory

6/10/13

50

150

250

350

450

128 64 32 16 8 4 2

Wri

te B

andw

idth

in M

byts

/s

Memory size in Mbytes

Collective I/O Performance with 120 Cores

16 MB messages per process 8 MB messages per process 2 MB messages per process

Performance of Collective I/O for Various Memory Sizes

Compute Node 0 Compute Node 1

p0 p1 p2 p3 p4 p5

Aggregator 0 Aggregator 1

Data C

omm

. Phase I/O

Phase

Parallel File System

6

Page 7: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Objective: to design and develop collective I/O with awareness of memory capacity, variance, off-chip bandwidth

  Contributions •  Identified performance & scalability constraints imposed by memory

pressure issue

•  Proposed a memory-conscious strategy to conduct collective I/O with memory-aware data partition and aggregation

•  Prototyped and evaluated the proposed strategy with benchmarks

•  Memory-conscious strategy can be important given the significance of collective I/O and substantial memory pressure at extreme scale

•  Towards addressing challenges of an exascale I/O system

Memory-Conscious Collective I/O

6/10/13 ROSS 2013, Eugene, Oregon 7

Page 8: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

Performance Model Directed Data Sieving

  Contains four major components   Aggregation Group Division divides the I/

O requests into separated groups   I/O Workload Partition partitions the

aggregate access file region into contiguous file domains

  Workload Portion Remerging rearranges the file domains considering the memory usage of physical nodes

  Aggregators Location determines the placement of aggregators for each file domain

  Applications, library, parallel file systems

ROSS 2013, Eugene, Oregon

Memory-Conscious Collective I/O (cont.)

6/10/13

MPI-IO Middleware

Inde

pend

ent I

/O

Workload Portion Remerging

Agg

rega

tors

Lo

catio

n

I/O W

orkload Partition

Aggregation Group Division

Parallel File Systems

Memory-Conscious Collective I/O

High-Level I/O Library

Applications

8

Page 9: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  To avoid global aggregation and reduce memory requirements   The global data shuffling traffic increases the memory pressure on

aggregators and leads to off-chip memory bandwidth contention   Divides the I/O workloads into non-overlapping chunks guided by the

optimal group message size Msggroup

  Aggregation groups perform their own aggregation in parallel   Limit one node in one group to reduce communications

ROSS 2013, Eugene, Oregon

Aggregation Group

6/10/13

Optimal Group Message Size Msggroup

Compute Node 0 Compute Node 1 Compute Node 2

Aggregation Group One

p0 p1 p2 p3 p4 p5 p6 p7 p8

9

Page 10: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Analyzes all I/O accesses within each aggregation group   Workload dynamically partitioned into distinct domains

ROSS 2013, Eugene, Oregon

I/O Workload Partition within Aggregation Group

6/10/13

Dynamical Workload Partition Algorithm Bisect { Compute root weight Rootwgh ; If Rootwgh > Msgind

Bisect_tree(root); } Bisect_tree(vertex) { Create two children for the vertex; Split the region belonging to this vertex in half; The compute nodes with associated messages in this region are partitioned accordingly into two sets; Assign each set to one child; For each child { Compute child weight Childwgh; If Childwgh > Msgind Bisect_tree(child); } }

49

25 24

13 12 12 12

6 7

10

Page 11: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Reorganizes the file domains considering the memory consumption for the aggregation

  File domain merged with the domain nearby (still a contiguous file domain)   To aggregate I/O requests based on available memory & saturate B/W

ROSS 2013, Eugene, Oregon

Workload Portions Remerging

6/10/13

A

B

C

C

B

C

A B

B

11

Page 12: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Aggregation and file domain partition with memory-conscious strategy   Compared against conventional evenly partition   Avoid iterations of carrying out collective I/O

ROSS 2013, Eugene, Oregon

An Example and Comparison

6/10/13 12

Compute Node 0 Compute Node 1 Compute Node 2

Aggregation Group One Aggregation Group Two

p0 p1 p2 p3 p4 p5 p6 p7 p8

Page 13: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Limits the number of aggregators in a node   Identifies the node with required available memory and minimizes

communications and B/W requirement

ROSS 2013, Eugene, Oregon

Aggregators Location

6/10/13

Detect physical nodes of all processes within one file domain

Node has less than Nah aggregators

Add the node into candidate list

Checked all the nodes

Identify node with maximum Memavl

Memavl larger than minimum memory

requirement

Select the corresponding process as the aggregator

for this file domain

Coalesce the file domain with the file domain nearby

Evaluate next node

N

N

Y

Y

Y N

13

Page 14: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Experimental Environment •  640-node Linux-based cluster test bed with 600TB Lustre file system

•  Each node contains two Intel Xeon 2.8 GHz 6-core processors with 24 GB main memory

•  Nodes connected with DDR InfiniBand interconnection

•  Prototyped with MPICH2-1.0.5p3 library   Three well-known MPI-IO benchmarks selected for evaluation &

comparison against normal collective I/O •  coll_perf from ROMIO software package

•  IOR developed at Lawrence Livermore National Laboratory

•  MPI-IO Test developed at Los Alamos National Laboratory

ROSS 2013, Eugene, Oregon 6/10/13 14

Evaluation and Performance Analysis

Page 15: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Experimental Results of coll_perf Benchmark •  120 MPI processes used to write and read a 32 GB file on Lustre file system

•  Evicted cached data with memory flushing after write phase

•  Average performance for write and read tests were 34.2% and 22.9% respectively

ROSS 2013, Eugene, Oregon

Evaluation and Performance Analysis

6/10/13 15

0

0.2

0.4

0.6

0

200

400

600

800

128 64 32 16 8 4 2 Write&Bandw

idth&in&Mbytes/s&

Memory&size&in&Mbytes&

Benchmark&coll_perf&Performance&with&120&cores&

Normal Collective I/O Memory-Conscious Improvement

0

0.2

0.4

0

200

400

600

800

128 64 32 16 8 4 2 Read%Bandw

idth%in%Mbytes/s%

Memory%size%in%Mbytes%

Benchmark%coll_perf%Performance%with%120%cores%

Normal Collective I/O Memory-Conscious Improvement

Page 16: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Experimental Results of IOR Benchmark •  Tests carried out with 120 and 1080 processes •  Maximum write and read improvement up to 121.7% and 89.1% respectively

•  Write tests performance improvements were more sensitive to the new strategy

•  Average performance for write and read tests were 24.3% and 57.8% respectively

ROSS 2013, Eugene, Oregon

Evaluation and Performance Analysis

6/10/13

0

0.5

1

1.5

0

200

400

600

800

1000

1200

128 64 32 16 8 4 2 Wri

te B

andw

idth

in M

byte

s/s

Memory size in Mbytes

Benchmark IOR Performance with 120 cores

Normal Collective I/O Memory-Conscious Improvement

0

0.5

1

1.5

0

500

1000

1500

2000

2500

128 64 32 16 8 4 2 Read  Bandw

idth  in  Mbytes/s  

Memory    size  in  Mbytes  

Benchmark  IOR  Performance  with  120  cores  

Normal Collective I/O Memory-Conscious Improvement

0

0.25

0.5

0

500

1000

1500

2000

2500

128 64 32 16 8 4 2 Write  Bandw

idth  in  Mbytes/s  

Memory  size  in  Mbytes  

Benchmark  IOR  Performance  with  1080  cores  

Normal Collective I/O Memory-Conscious Improvement

0

0.5

1

0

500

1000

1500

2000

2500

3000

3500

128 64 32 16 8 4 2 Read  Bandw

idth  in  Mbytes/s  

Memory  size  in  Mbytes  

Benchmark  IOR  Performance  with  1080  cores  

Normal Collective I/O Memory-Conscious Improvement

16

Page 17: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

  Experimental Results of mpi-io-test Benchmark •  Performance increased at a relatively moderate rate compared with IOR •  Average performance improvements for read and write tests were 32.9% and 14.6% at 120 cores

•  Average performance improvements for read and write tests were 29.8% and 24.1% at 1080 cores

ROSS 2013, Eugene, Oregon

Evaluation and Performance Analysis

6/10/13 17

0

0.2

0.4

0

200

400

600

800

128 64 32 16 8 4 2 Write  Bandw

idth  in  Mbytes/s  

Memory  size  in  Mbytes  

Benchmark  MPI-­‐IO-­‐TEST  Performance  with  1080  cores  

Normal Collective I/O Memory-Conscious Improvement

0

0.2

0.4

0

500

1000

1500

2000

2500

3000

3500

128 64 32 16 8 4 2 Read  Bandw

idth  in  Mbytes/s  

Memory  size  in  Mbytes  

Benchmark  MPI-­‐IO-­‐TEST  Performance  with  1080  cores  

Normal Collective I/O Memory-Conscious Improvement

0

0.2

0.4

0

200

400

600

800

128 64 32 16 8 4 2 Write  Bandw

idth  in  Mbytes/s  

Memory  size  in  Mbytes  

Benchmark  MPI-­‐IO_TEST  Performance  with  120  cores  

Normal Collective I/O Memory-Conscious Improvement

0

0.1

0.2

0.3

0.4

0.5

0

500

1000

1500

2000

128 64 32 16 8 4 2 Read  Bandw

idth  in  Mbytes/s  

Memory  size  in  Mbytes  

Benchmark  MPI-­‐IO-­‐TEST  Performance  with  120  cores    

Normal Collective I/O Memory-Conscious Improvement

Page 18: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

Conclusion

  Exascale HPC systems near the horizon •  Decreased memory capacity per core, increased memory variance, and

decreased bandwidth per core are critical challenges for collective I/O

  Studied the constraints at projected exascale systems   Proposed a new memory-conscious collective I/O strategy

•  Restricts aggregation data traffic within groups

•  Determines I/O aggregation dynamically

•  With memory-aware data partition and aggregation

  Experiments performed on MPICH2+Lustre   Evaluation results confirmed the proposed strategy

outperformed existing collective I/O given memory constraints

ROSS 2013, Eugene, Oregon

Conclusion

6/10/13 18

Page 19: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

Conclusion

  An I/O system aware of memory constraints can be critical on current petascale and projected exascale systems

  The memory-conscious collective I/O strategy •  Retains benefits of collective I/O

•  Reduces memory pressure

•  Alleviates off-chip bandwidth contention

  Plan to further investigate and reduce communication costs   Plan to investigate the leverage of SCM, burst buffer, caching

ROSS 2013, Eugene, Oregon

Future Work

6/10/13 19

Page 20: Memory-Conscious Collective I/O for Extreme Scale HPC Systems · Yin Lu, Yong Chen, Texas Tech University Rajeev Thakur, Argonne National Laboratory Yu Zhuang, Texas Tech University

Thank You. For more information please visit

http://discl.cs.ttu.edu/

6/10/13 ROSS 2013, Eugene, Oregon

Any Questions?

20