I/O optimizations with Data Aggregation · Context Approach Evaluation Memory Partitionning Conclusion DataMovementatScale I JLESCprojectstartedearly2015 I Focusfromﬂopstobytes:

Context Approach Evaluation Memory Partitionning Conclusion

I/O optimizations with Data Aggregation

Florin Isaila‡∗, Emmanuel Jeannot†, Preeti Malakar∗, François Tessier∗,Venkatram Vishwanath∗,

∗Argonne National Laboratory, USA†Inria Bordeaux Sud-Ouest, France‡University Carlos III, Spain

October 8, 2018

1 / 17


Data Movement at Scale

I JLESC project started early 2015I Focus from flops to bytes:

allocate datamove datastore databring data to the right place at the right time

I Computer simulation: climate simulation, heart or brain modelling,cosmology, etc

2 / 17


Data Movement at Scale

I Large needs in terms of I/O: high resolution, high fidelity

Table: Example of large simulations I/O coming from diverse papers

Scientific domain Simulation Data sizeCosmology Q Continuum 2 PB / simulationHigh-Energy Physics Higgs Boson 10 PB / yearClimate / Weather Hurricane 240 TB / simulation

I Growth of supercomputers to meet the performance needs but withincreasing gaps

0.0001

0.001

0.01

0.1

1997 2001 2005 2009 2013 2017

Rati

o o

f I/O

(TB

/s)

to F

lop

s (T

F/s)

in p

erc

ent

Years

Figure: Ratio IOPS/FLOPS of the #1 Top 500 for the past 20 years. Computing capability hasgrown at a faster rate than the I/O performance of supercomputers.

3 / 17


Complex Architectures

I Complex network topologies tending to reduce the distance between thedata and the storage

Multidimensional tori, dragonfly, ...I Partitioning of the architecture to avoid I/O interference

IBM BG/Q with I/O nodes (Figure), Cray with LNET nodesI New tiers of storage/memory for data staging

MCDRAM in KNL, NVRAM, Burst buffer nodes

Compute nodes I/O nodes

Storage

QD

R In

finib

and

switc

h

Bridge nodes

5D Torus network2 GBps per link 2 GBps per link 4 GBps per link

PowerPC A2, 16 cores 16 GB of DDR3

GPFS filesystem

IO forwarding daemon GPFS client

Pset128 nodes

2 per I/O node

Mira- 49,152 nodes / 786,432 cores- 768 TB of memory- 27 PB of storage, 330 GB/s (GPFS)- 5D Torus network- Peak performance: 10 PetaFLOPS

4 / 17


Two-phase I/O

I Present in MPI I/O implementations like ROMIOI Optimize collective I/O performance by reducing network contention and

increasing I/O bandwidthI Chose a subset of processes to aggregate data before writing it to the

storage system

Limitations:I Better for large messages

(from experiments)I No real efficient aggregator

placement policyI Informations about

upcoming data movementcould help

X Y Z X Y Z X Y Z X Y Z

Processes

Data

AggregatorsX X X X Y Y

Y Y Z Z Z Z FileX X X X Y Y

Y Y Z Z Z Z

I/O Phase

Aggr. phase

3210

0 2

Figure: Two-phase I/O mechanism

5 / 17


Outline

1 Context

2 Approach

3 Evaluation

4 Number of Aggregators

5 Conclusion and Perspectives

6 / 17


Approach

I Relevant aggregator placement while taking into account:The topology of the architectureThe data pattern

I Efficient implementation of the two-phase I/O schemeI/O scheduling with the help of information about the upcomming readingsand writingsPipelining aggergation and I/O phase to optimize data movementsOne-sided communications and non-blocking operations to reducesynchronizations

I TAPIOCA (Topology-Aware Parallel I/O: Collective Algorithm)MPI Based library for 2-phase I/Oportable through abstracted representation of the machinegeneralize toward data movement at scale on HPC facilityresults published in Cluster 2017

7 / 17


Aggregator Placement

I Goal: find a compromise betweenaggregation and I/O costs

I Four tested strategiesShortest path: smallest distance to theI/O nodeLongest path: longest distance to theI/O nodeGreedy: lowest rank in partition (can becompared to a MPICH strategy)Topology-aware

How to take the topology into account inaggregators mapping?

Compute node

Bridge node

Storage system

Aggregator

L

S

S

L

Shortest path

Longest path

Greedy

Topology-Aware

Aggregation partition

T

G

G

T

Figure: Data aggregation for I/O:simple partitioning and aggregatorelection on a grid.

8 / 17


Aggregator Placement - Topology-aware strategy

I ω(u, v): Amount of data exchanged betweennodes u and v

I d(u, v): Number of hops from nodes u to v

I l : The interconnect latencyI Bi→j : The bandwidth from node i to node j

I C1 = max(l × d(i ,A) + ω(i,A)

Bi→A

), i ∈ VC

I C2 = l × d(A, IO) + ω(A,IO)BA→IO

Vc : Compute nodesIO : I/O nodeA : Aggregator

C1

C2

Objective function:

TopoAware(A) = min (C1 + C2)

I Computed by each process independently in O(n), n = |VC |9 / 17


Micro-benchmark - Placement strategies

I Evaluation on Mira (BG/Q), 512 nodes, 16 ranks/nodeI Each rank sends an amount of data distributed randomly between 0 and

2 MBI Write to /dev/null of the I/O node (performance of just aggregation and

I/O phases)I Aggregation settings: 16 aggregators, 16 MB buffer size

Table: Impact of aggregators placement strategy

Strategy I/O Bandwidth (MBps) Aggr. Time/round (ms)Topology-Aware 2638.40 310.46Shortest path 2484.39 327.08Longest path 2202.91 370.40

Greedy 1927.45 421.33

10 / 17


Micro-benchmark - Theta

I 10 PetaFLOPS Cray XC40 supercomputerI 512 Theta-nodes, 16 ranks per nodeI 48 aggregators, 8 MB aggregation buffer size

0

1

2

3

4

5

6

7

8

0 0.5 1 1.5 2 2.5 3 3.5 4

Bandw

idth

(G

Bps)

Data size per rank (MB)

TAPIOCAMPI I/O

Compute node

Storage

Aries routerKnights Landing proc. 4 per router

Lustre filesystem

2D all-to-all structure 96 routers per group

36 tiles (2 cores, L2)16 GB MCDRAM192 GB DDR4128 GB SSD

Intel KNL 7250Dragonfly network

Elec. links 14 GBps

6 (le

vel 2

)

16 (level 1)

Dragonfly networkOpt. links 12.5 GBps

Compute node

2-cabinet group 9 groups - 18 cabinets 16 x 6 routers hosted All-to-all

(leve

l 3)

IB FDR 56 GBps

Service node LNET, gateway, … Irregular mapping

11 / 17


HACC-IO – MIRA (BG/Q)

I I/O part of a large-scale cosmological application simulating the massevolution of the universe with particle-mesh techniques

I Each process manage particles defined by 9 variables (XX , YY , ZZ , VX ,VY , VZ , phi , pid and mask)


Processes

Data


Data layoutsin file

X X X X Y Y Y Y Z Z Z Z

Array of structures

Structure of arrays

0 1 2 3

Figure: Data layouts implemented in HACC

12 / 17


HACC-IO – MIRA (BG/Q)



0

20

40

60

80

100

0 0.5 1 1.5 2 2.5 3 3.5 4

Bandw

idth

(G

Bps)


TAPIOCA AoSMPI I/O AoS

TAPIOCA SoAMPI I/O SoA

I BG/Q (Mira) one file per PsetI 4096 nodes (16 ranks/node)I TAPIOCA: 16 aggregators per

Pset, 16 MB aggregator buffer size

Very poor performance fromMPI I/O on AoSUp to 2.5 faster than MPI I/Oon SoA

13 / 17


HACC-IO – Theta (Cray XC40)



0

2

4

6

8

10

12

14

16

18

0 0.5 1 1.5 2 2.5 3 3.5 4

Bandw

idth

(G

Bps)


TAPIOCA AoSMPI I/O AoS

TAPIOCA SoAMPI I/O SoA

I Theta from 2048 nodes (16ranks/node)

I Lustre: 48 OSTs, 16 MB stripesize

I TAPIOCA: 384 aggregators, 16MB aggregator buffer size

AoS, 3.6 MB: 4 time faster thanMPI I/O

14 / 17


Number of aggregators


Processes

Data


Data layoutsin file

X X X X Y Y Y Y Z Z Z Z

Array of structures

Structure of arrays

0 1 2 3

Figure: Array of structure vs structure of array

Trade-Off:I Many aggregators: lot of I/O roundsI Few aggregators: I/O Latency dominates

15 / 17


Model Simulation

1

10

100

1000

1 2 4 8 16 32 64 128 256 512 1024 2048 4096NB_Agg

Rea

d_tim

e

Layout

AoS

SoA

AoS and SoA aggregation time vs #aggregators

I 4096 nodesI 9*1MB structuresI 16 ranks/nodesI 56 I/O streams

16 / 17


Conclusion and Perspectives

ConclusionI I/O library based on the two-phase scheme developed to optimize data

movementsTopology-aware aggregator placementOptimized buffering (two pipelined buffers, one-sided communications,block size awareness)

I Very good performance at scale, outperforming standard approachesI On the I/O part of a cosmological application, up to 15× improvementI Start modeling I/O performance.

17 / 17

Documents

I/O optimizations with Data Aggregation · Context Approach Evaluation Memory Partitionning Conclusion DataMovementatScale I JLESCprojectstartedearly2015 I Focusfromﬂopstobytes: