Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National

Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University

Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National Lab.

“mcrEngine”a Scalable Checkpointing System using

Data-Aware Aggregation and Compression

2

Background

Checkpoint-restart widely usedMPI applications

Take globally coordinated checkpoints

Application-level checkpointHigh-level I/O format

HDF5, Adios, netCDF etc.

Checkpoint writing

Application

I/O Library Data-Format API

Struct ToyGrp{1. float Temperature[1024];2. short Pressure[20][30];};

NetC

DF

HD

F5

1. HDF5 checkpoint{2. Group “/”{3. Group “ToyGrp”{

DATASET “Temperature”{DATATYPE H5T_IEEE_F32LEDATASPACE SIMPLE {(1024) / (1024)}}DATASET “Pressure” {DATATYPE H5T_STD_U8LEDATASPACE SIMPLE {(20,30) / (20,30)}}}}}

Parallel File System (PFS)

N1


NN


NM

Not scalableBest compromise but complex

Easiest but Contention on PFS

Tanzima Islam ([email protected]) mcrEngine: Data-aware Aggregation & Compression

3mcrEngine: Data-aware Aggregation & Compression

Impact of Load on PFS at Large Scale

IOR78MB of data per processNN checkpoint transfer

Observations:(-) Large average write time less frequent checkpointing(-) Large average read time poor application performance

0200400600800

100012001400

# of Processes (N)

128

256

512

1024

2048

4096

8192

1540

80

20406080

100120140

Ave

rage

Rea

d T

ime

(s)

Ave

rage

Wri

te T

ime

(s)

# of Processes (N)

Tanzima Islam ([email protected])


What is the Problem?

Today’s checkpoint-restart systems will not scaleIncreasing number of concurrent transfersIncreasing volume of checkpoint data



Our Contributions

Data-aware aggregationReduces the number of concurrent transfersImproves compressibility of checkpoints

Data-aware compressionReduces data almost 2x more than simply concatenating them and compressing

Design and develop mcrEngineNM checkpointing systemDecouples checkpoint transfer logic from applicationsImproves application performance



Overview

Background

Problem

Data aggregation & compression

Evaluation



Agnostic scheme – concatenate checkpoints

Agnostic-block scheme – interleave fixed-size blocks

Observations:(+) Easy(−) Low compression ratio

Data-Agnostic Schemes

C1

[1-B]C2

[1-B]

C1

[B+1-2B]

C2

[B+1-2B]

C1

[1-B]

C1

[B+1-2B]

C2

[1-B]

C2

[B+1-2B]

C1

C2

C1 C2 Gzip PFS

Gzip PFS

First Phase



Identify Similar Variables Across Processes

C1.T C1.P C2.T C2.P

Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};

P0


P1

Meta-data:1. Name2. Data-type3. Class:

-- Array, Atomic

Concatenating similar variables

C2.TC1.T C2.PC1.P

Aware Scheme



Aware-Block Scheme


P0


P1

Meta-data:1. Name2. Data-type3. Class:

-- Array, Atomic

First ‘B’ bytes of TemperatureNext ‘B’ bytes of TemperatureInterleavePressure

C1.T C1.P C2.PC2.T

Interleaving similar variables



Data-Aware Aggregation & Compression

Aware scheme – concatenate similar variablesAware-block scheme – interleave similar variables

Output buffer

Data-type aware compression FPC Lempel-Ziv

T P H D

Gzip

PFS

C2.TC1.T C2.PC1.P

First Phase

Second Phase



How mcrEngine Works

CNCCNC

CNCANCCNC

CNCCNC

CNCCNC

ANC

ANC

Meta-data

Meta-data

Meta-data

Identifies “similar” variables

Request T, P

Request T, P

Request T, P

Applies data-aware aggregation and compression

PFS

CNC : Compute node componentANC: Aggregator node componentRank-order groups, Group size = 4, NM checkpointing

Group

CNCCNC

CNCCNC

Group

Group

Request H, D

Request H, D

Request H, D

D

T P

T P

T P

H D

H D

H D

Gzip

Gzip

Gzip

HPT

HT P D

HT P D



Overview

Background

Problem

Data aggregation & compression

Evaluation



Evaluation

ApplicationsALE3D – 4.8GB per checkpoint setCactus – 2.41GB per checkpoint setCosmology – 1.1GB per checkpoint setImplosion – 13MB per checkpoint set

Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster15,408 cores, 1.3 Petabyte Lustre file system

Compression algorithmFPC [1] for double-precision floatFpzip [2] for single-precision floatLempel-Ziv for all other data-typesGzip for general-purpose compression



Evaluation Metrics

Effectiveness of data-aware compressionWhat is the benefit of multiple compression phases?How does group size affect compression ratio?How does compression ratio change as a simulation progresses?

Performance of mcrEngineOverhead of the checkpointing phaseOverhead of the restart phase

Compression ratio = Uncompressed size

Compressed size



Fir

st-P

hase

Sec

ond-

Pha

se

Fir

st-P

hase

Sec

ond-

Pha

se

Fir

st-P

hase

Sec

ond-

Pha

se

Fir

st-P

hase

Sec

ond-

Pha

se

ALE3D Cactus Cosmology Implosion

0

0.5

1

1.5

2

2.5

3

3.5

4

No Benefit with Data-Agnostic Double Compression

Data-type aware compression improves compressibilityFirst phase changes underlying data format

Data-agnostic double compression is not beneficialBecause, data-format is non-uniform and uncompressible

Com

pres

sion

Rat

io

Data-Aware

Data-Agnostic

Multiple Phases of Data-Aware Compressionare Beneficial



1 2 4 8 16 322.5

3.5

4.5

1 2 4 8 16 32 64 1281

1.5

2

1 2 4 8 16 32 641.7

2.7

3.7

1 2 4 8 16 320.5

1.5

2.5

Different merging schemes better for different applicationsLarger group size beneficial for certain applications

ALE3D: Improvement of 8% from group size 2 to 32ALE3D Cactus

Cosmology Implosion

Com

pres

sion

Rat

io

Group size

Aware-Block

Aware

Impact of Group Size on Compression Ratio



1 2 4 8 16 322.5

3.5

4.5

1 2 4 8 16 32 64 1281

1.5

2

1 2 4 8 16 32 641.7

2.7

3.7

1 2 4 8 16 320.5

1.5

2.5

Data-Aware Technique Always Wins over Data-Agnostic

ALE3D Cactus

Cosmology Implosion

Com

pres

sion

Rat

io

Group size

Aware-Block

Aware

Agnostic-Block

Agnostic

98-115%

Data-aware technique always yields better compression ratio than Data-Agnostic technique



Compression Ratio Follows Course of Simulation

Data-aware technique always yields better compression

Cactus

Com

pres

sion

Rat

io

Aware-Block

Aware

Agnostic-Block

Agnostic

0.81

1.21.41.61.8

22.2

1.0

2.0

3.0

4.0

5.0

6.0

1.3

1.5

1.7

1.9

2.1

2.3

Simulation Time-steps

Cosmology Implosion



Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme

Application Total Size (GB) Aware-Block (%)

Aware (%)

ALE3D 4.8 6.6 - 27.7 6.6 - 12.7

Cactus 2.41 10.7 – 11.9 98 - 115

Cosmology 1.1 20.1 – 25.6 20.6 – 21.1

Implosion 0.013 36.3 – 38.4 36.3 – 38.8



Impact of Aggregation on Scalability

Used IORNN: Each process transfers 78MBNM: Group size 32, 1.21GB per aggregator

128256

5121024

20484096

8192 10

20406080

100120140

128256

5121024

20484096

8192

154080

200400600800

100012001400

# of Processes (N)

Ave

rage

Wri

te T

ime

(sec

)A

vera

ge R

ead

Tim

e (s

ec)

N->N Write

N->M Write

N->N Read

N->M Read



Impact of Data-Aware Compression on Scalability

IOR with NM transfer, groups of 32 processesData-aware: 1.2GB, data-agnostic: 2.4GB

Data-aware compression improves I/O performance at large scaleImprovement during write 43% - 70%Improvement during read 48% - 70%

128256

5121024

20484096

819215424

1638420480

2457628672

0

50

100

150

200

250

300

# of Processes (N)

Ave

rage

Tra

nsfe

r T

ime

(se

c)

Agnostic

Aware

Agnostic-Read

Agnostic-Write

Aware-Read

Aware-Write



End-to-End Checkpointing Overhead

15,408 processesGroup size of 32 for NM schemesEach process takes a checkpoint

Converts network bound operation into CPU bound one

CPU Overhead

Transfer Overhead to PFS

No

Com

p.+N

->N

Indi

v. C

omp+

N->

N

No

Com

p.+N

->M

Agn

osti

c+A

gg

Aw

are+

Agg

No

Com

p.+N

->N

Indi

v. C

omp.

+N->

M

No

Com

p.+N

->M

Agn

osti

c+A

gg

Aw

are+

Agg

ALE3D Cactus

0

50

100

150

200

250

300

350

Tota

l Che

ckpo

inti

ng O

verh

ead

(sec

) Reduction in checkpointing overhead

87% 51%



No

Com

p.+N

->N

No

Com

p.+N

->M

No

Com

p.+N

->M

Agn

osti

c+A

gg

Aw

are+

Agg

No

Com

p.+N

->N

Indi

v. C

omp.

+N->

M

No

Com

p.+N

->M

Agn

osti

c+A

gg

Aw

are+

Agg

ALE3D Cactus

0

100

200

300

400

500

600

End-to-End Restart Overhead

Reduced overall restart overheadReduced network load and transfer time

CPU Overhead

Transfer Overhead to PFS

Tota

l Rec

over

y O

verh

ead

(sec

)

43%71%

Reduction in I/O overhead

62% 64%

Reduction in recovery overhead



Conclusion

Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115%

Investigated different merging techniquesEvaluated effectiveness using real-world applications

Designed and developed a scalable frameworkImplements NM checkpointingImproves application performanceTransforms checkpointing into CPU bound operation



Contact Information

Tanzima Islam ([email protected])Website: web.ics.purdue.edu/~tislam


AcknowledgementPurdue:

Saurabh Bagchi ([email protected])Rudolf Eigenmann ([email protected])

Lawrence Livermore National LaboratoryKathryn Mohror ([email protected])Adam Moody ([email protected])Bronis R. de Supinski ([email protected])

mailto:[email protected]






26mcrEngine: Data-aware Aggregation & CompressionTanzima Islam ([email protected])


Backup Slides



[Backup Slide] Failures in HPC

“A Large-scale Study of Failures in High-performance Computing Systems, by Bianca Schroeder, Garth Gibson

Breakdown of root causes of failures Breakdown of downtime into root causes



Future Work

Analytical solution to group size selection?Better way than rank-order grouping?Variable streaming?

Engineering challenge



References

1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”.

2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”.

3. L. Reinhold, “QuickLZ”.


Documents

Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National