30
Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National Lab. “mcrEngine” a Scalable Checkpointing System using Data-Aware Aggregation and Compression

Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National

Embed Size (px)

Citation preview

Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University

Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National Lab.

“mcrEngine”a Scalable Checkpointing System using

Data-Aware Aggregation and Compression

2

Background

Checkpoint-restart widely usedMPI applications

Take globally coordinated checkpoints

Application-level checkpointHigh-level I/O format

HDF5, Adios, netCDF etc.

Checkpoint writing

Application

I/O Library Data-Format API

Struct ToyGrp{1. float Temperature[1024];2. short Pressure[20][30];};

NetC

DF

HD

F5

1. HDF5 checkpoint{2. Group “/”{3. Group “ToyGrp”{

DATASET “Temperature”{DATATYPE H5T_IEEE_F32LEDATASPACE SIMPLE {(1024) / (1024)}}DATASET “Pressure” {DATATYPE H5T_STD_U8LEDATASPACE SIMPLE {(20,30) / (20,30)}}}}}

Parallel File System (PFS)

N1

Parallel File System (PFS)

NN

Parallel File System (PFS)

NM

Not scalableBest compromise but complex

Easiest but Contention on PFS

Tanzima Islam ([email protected]) mcrEngine: Data-aware Aggregation & Compression

3mcrEngine: Data-aware Aggregation & Compression

Impact of Load on PFS at Large Scale

IOR78MB of data per processNN checkpoint transfer

Observations:(-) Large average write time less frequent checkpointing(-) Large average read time poor application performance

0200400600800

100012001400

# of Processes (N)

128

256

512

1024

2048

4096

8192

1540

80

20406080

100120140

Ave

rage

Rea

d T

ime

(s)

Ave

rage

Wri

te T

ime

(s)

# of Processes (N)

Tanzima Islam ([email protected])

4mcrEngine: Data-aware Aggregation & Compression

What is the Problem?

Today’s checkpoint-restart systems will not scaleIncreasing number of concurrent transfersIncreasing volume of checkpoint data

Tanzima Islam ([email protected])

5mcrEngine: Data-aware Aggregation & Compression

Our Contributions

Data-aware aggregationReduces the number of concurrent transfersImproves compressibility of checkpoints

Data-aware compressionReduces data almost 2x more than simply concatenating them and compressing

Design and develop mcrEngineNM checkpointing systemDecouples checkpoint transfer logic from applicationsImproves application performance

Tanzima Islam ([email protected])

6mcrEngine: Data-aware Aggregation & Compression

Overview

Background

Problem

Data aggregation & compression

Evaluation

Tanzima Islam ([email protected])

7mcrEngine: Data-aware Aggregation & Compression

Agnostic scheme – concatenate checkpoints

Agnostic-block scheme – interleave fixed-size blocks

Observations:(+) Easy(−) Low compression ratio

Data-Agnostic Schemes

C1

[1-B]C2

[1-B]

C1

[B+1-2B]

C2

[B+1-2B]

C1

[1-B]

C1

[B+1-2B]

C2

[1-B]

C2

[B+1-2B]

C1

C2

C1 C2 Gzip PFS

Gzip PFS

First Phase

Tanzima Islam ([email protected])

8mcrEngine: Data-aware Aggregation & Compression

Identify Similar Variables Across Processes

C1.T C1.P C2.T C2.P

Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};

P0

Group ToyGrp{ float Temperature[100]; int Pressure[10][50];};

P1

Meta-data:1. Name2. Data-type3. Class:

-- Array, Atomic

Concatenating similar variables

C2.TC1.T C2.PC1.P

Aware Scheme

Tanzima Islam ([email protected])

9mcrEngine: Data-aware Aggregation & Compression

Aware-Block Scheme

Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};

P0

Group ToyGrp{ float Temperature[100]; int Pressure[10][50];};

P1

Meta-data:1. Name2. Data-type3. Class:

-- Array, Atomic

First ‘B’ bytes of TemperatureNext ‘B’ bytes of TemperatureInterleavePressure

C1.T C1.P C2.PC2.T

Interleaving similar variables

Tanzima Islam ([email protected])

10mcrEngine: Data-aware Aggregation & Compression

Data-Aware Aggregation & Compression

Aware scheme – concatenate similar variablesAware-block scheme – interleave similar variables

Output buffer

Data-type aware compression FPC Lempel-Ziv

T P H D

Gzip

PFS

C2.TC1.T C2.PC1.P

First Phase

Second Phase

Tanzima Islam ([email protected])

11mcrEngine: Data-aware Aggregation & Compression

How mcrEngine Works

CNCCNC

CNCANCCNC

CNCCNC

CNCCNC

ANC

ANC

Meta-data

Meta-data

Meta-data

Identifies “similar” variables

Request T, P

Request T, P

Request T, P

Applies data-aware aggregation and compression

PFS

CNC : Compute node componentANC: Aggregator node componentRank-order groups, Group size = 4, NM checkpointing

Group

CNCCNC

CNCCNC

Group

Group

Request H, D

Request H, D

Request H, D

D

T P

T P

T P

H D

H D

H D

Gzip

Gzip

Gzip

HPT

HT P D

HT P D

Tanzima Islam ([email protected])

12mcrEngine: Data-aware Aggregation & Compression

Overview

Background

Problem

Data aggregation & compression

Evaluation

Tanzima Islam ([email protected])

13mcrEngine: Data-aware Aggregation & Compression

Evaluation

ApplicationsALE3D – 4.8GB per checkpoint setCactus – 2.41GB per checkpoint setCosmology – 1.1GB per checkpoint setImplosion – 13MB per checkpoint set

Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster15,408 cores, 1.3 Petabyte Lustre file system

Compression algorithmFPC [1] for double-precision floatFpzip [2] for single-precision floatLempel-Ziv for all other data-typesGzip for general-purpose compression

Tanzima Islam ([email protected])

14mcrEngine: Data-aware Aggregation & Compression

Evaluation Metrics

Effectiveness of data-aware compressionWhat is the benefit of multiple compression phases?How does group size affect compression ratio?How does compression ratio change as a simulation progresses?

Performance of mcrEngineOverhead of the checkpointing phaseOverhead of the restart phase

Compression ratio = Uncompressed size

Compressed size

Tanzima Islam ([email protected])

15mcrEngine: Data-aware Aggregation & Compression

Fir

st-P

hase

Sec

ond-

Pha

se

Fir

st-P

hase

Sec

ond-

Pha

se

Fir

st-P

hase

Sec

ond-

Pha

se

Fir

st-P

hase

Sec

ond-

Pha

se

ALE3D Cactus Cosmology Implosion

0

0.5

1

1.5

2

2.5

3

3.5

4

No Benefit with Data-Agnostic Double Compression

Data-type aware compression improves compressibilityFirst phase changes underlying data format

Data-agnostic double compression is not beneficialBecause, data-format is non-uniform and uncompressible

Com

pres

sion

Rat

io

Data-Aware

Data-Agnostic

Multiple Phases of Data-Aware Compressionare Beneficial

Tanzima Islam ([email protected])

16mcrEngine: Data-aware Aggregation & Compression

1 2 4 8 16 322.5

3.5

4.5

1 2 4 8 16 32 64 1281

1.5

2

1 2 4 8 16 32 641.7

2.7

3.7

1 2 4 8 16 320.5

1.5

2.5

Different merging schemes better for different applicationsLarger group size beneficial for certain applications

ALE3D: Improvement of 8% from group size 2 to 32ALE3D Cactus

Cosmology Implosion

Com

pres

sion

Rat

io

Group size

Aware-Block

Aware

Impact of Group Size on Compression Ratio

Tanzima Islam ([email protected])

17mcrEngine: Data-aware Aggregation & Compression

1 2 4 8 16 322.5

3.5

4.5

1 2 4 8 16 32 64 1281

1.5

2

1 2 4 8 16 32 641.7

2.7

3.7

1 2 4 8 16 320.5

1.5

2.5

Data-Aware Technique Always Wins over Data-Agnostic

ALE3D Cactus

Cosmology Implosion

Com

pres

sion

Rat

io

Group size

Aware-Block

Aware

Agnostic-Block

Agnostic

98-115%

Data-aware technique always yields better compression ratio than Data-Agnostic technique

Tanzima Islam ([email protected])

18mcrEngine: Data-aware Aggregation & Compression

Compression Ratio Follows Course of Simulation

Data-aware technique always yields better compression

Cactus

Com

pres

sion

Rat

io

Aware-Block

Aware

Agnostic-Block

Agnostic

0.81

1.21.41.61.8

22.2

1.0

2.0

3.0

4.0

5.0

6.0

1.3

1.5

1.7

1.9

2.1

2.3

Simulation Time-steps

Cosmology Implosion

Tanzima Islam ([email protected])

19mcrEngine: Data-aware Aggregation & Compression

Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme

Application Total Size (GB) Aware-Block (%)

Aware (%)

ALE3D 4.8 6.6 - 27.7 6.6 - 12.7

Cactus 2.41 10.7 – 11.9 98 - 115

Cosmology 1.1 20.1 – 25.6 20.6 – 21.1

Implosion 0.013 36.3 – 38.4 36.3 – 38.8

Tanzima Islam ([email protected])

20mcrEngine: Data-aware Aggregation & Compression

Impact of Aggregation on Scalability

Used IORNN: Each process transfers 78MBNM: Group size 32, 1.21GB per aggregator

128256

5121024

20484096

8192 10

20406080

100120140

128256

5121024

20484096

8192

154080

200400600800

100012001400

# of Processes (N)

Ave

rage

Wri

te T

ime

(sec

)A

vera

ge R

ead

Tim

e (s

ec)

N->N Write

N->M Write

N->N Read

N->M Read

Tanzima Islam ([email protected])

21mcrEngine: Data-aware Aggregation & Compression

Impact of Data-Aware Compression on Scalability

IOR with NM transfer, groups of 32 processesData-aware: 1.2GB, data-agnostic: 2.4GB

Data-aware compression improves I/O performance at large scaleImprovement during write 43% - 70%Improvement during read 48% - 70%

128256

5121024

20484096

819215424

1638420480

2457628672

0

50

100

150

200

250

300

# of Processes (N)

Ave

rage

Tra

nsfe

r T

ime

(se

c)

Agnostic

Aware

Agnostic-Read

Agnostic-Write

Aware-Read

Aware-Write

Tanzima Islam ([email protected])

22mcrEngine: Data-aware Aggregation & Compression

End-to-End Checkpointing Overhead

15,408 processesGroup size of 32 for NM schemesEach process takes a checkpoint

Converts network bound operation into CPU bound one

CPU Overhead

Transfer Overhead to PFS

No

Com

p.+N

->N

Indi

v. C

omp+

N->

N

No

Com

p.+N

->M

Agn

osti

c+A

gg

Aw

are+

Agg

No

Com

p.+N

->N

Indi

v. C

omp.

+N->

M

No

Com

p.+N

->M

Agn

osti

c+A

gg

Aw

are+

Agg

ALE3D Cactus

0

50

100

150

200

250

300

350

Tota

l Che

ckpo

inti

ng O

verh

ead

(sec

) Reduction in checkpointing overhead

87% 51%

Tanzima Islam ([email protected])

23mcrEngine: Data-aware Aggregation & Compression

No

Com

p.+N

->N

No

Com

p.+N

->M

No

Com

p.+N

->M

Agn

osti

c+A

gg

Aw

are+

Agg

No

Com

p.+N

->N

Indi

v. C

omp.

+N->

M

No

Com

p.+N

->M

Agn

osti

c+A

gg

Aw

are+

Agg

ALE3D Cactus

0

100

200

300

400

500

600

End-to-End Restart Overhead

Reduced overall restart overheadReduced network load and transfer time

CPU Overhead

Transfer Overhead to PFS

Tota

l Rec

over

y O

verh

ead

(sec

)

43%71%

Reduction in I/O overhead

62% 64%

Reduction in recovery overhead

Tanzima Islam ([email protected])

24mcrEngine: Data-aware Aggregation & Compression

Conclusion

Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115%

Investigated different merging techniquesEvaluated effectiveness using real-world applications

Designed and developed a scalable frameworkImplements NM checkpointingImproves application performanceTransforms checkpointing into CPU bound operation

Tanzima Islam ([email protected])

25mcrEngine: Data-aware Aggregation & Compression

Contact Information

Tanzima Islam ([email protected])Website: web.ics.purdue.edu/~tislam

Tanzima Islam ([email protected])

AcknowledgementPurdue:

Saurabh Bagchi ([email protected])Rudolf Eigenmann ([email protected])

Lawrence Livermore National LaboratoryKathryn Mohror ([email protected])Adam Moody ([email protected])Bronis R. de Supinski ([email protected])

26mcrEngine: Data-aware Aggregation & CompressionTanzima Islam ([email protected])

27mcrEngine: Data-aware Aggregation & Compression

Backup Slides

Tanzima Islam ([email protected])

28mcrEngine: Data-aware Aggregation & Compression

[Backup Slide] Failures in HPC

“A Large-scale Study of Failures in High-performance Computing Systems, by Bianca Schroeder, Garth Gibson

Breakdown of root causes of failures Breakdown of downtime into root causes

Tanzima Islam ([email protected])

29mcrEngine: Data-aware Aggregation & Compression

Future Work

Analytical solution to group size selection?Better way than rank-order grouping?Variable streaming?

Engineering challenge

Tanzima Islam ([email protected])

30mcrEngine: Data-aware Aggregation & Compression

References

1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”.

2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”.

3. L. Reinhold, “QuickLZ”.

Tanzima Islam ([email protected])