Upload
joan-barber
View
218
Download
3
Tags:
Embed Size (px)
Citation preview
Tanzima Z. Islam, Saurabh Bagchi, Rudolf Eigenmann – Purdue University
Kathryn Mohror, Adam Moody, Bronis R. de Supinski – Lawrence Livermore National Lab.
“mcrEngine”a Scalable Checkpointing System using
Data-Aware Aggregation and Compression
2
Background
Checkpoint-restart widely usedMPI applications
Take globally coordinated checkpoints
Application-level checkpointHigh-level I/O format
HDF5, Adios, netCDF etc.
Checkpoint writing
Application
I/O Library Data-Format API
Struct ToyGrp{1. float Temperature[1024];2. short Pressure[20][30];};
NetC
DF
HD
F5
1. HDF5 checkpoint{2. Group “/”{3. Group “ToyGrp”{
DATASET “Temperature”{DATATYPE H5T_IEEE_F32LEDATASPACE SIMPLE {(1024) / (1024)}}DATASET “Pressure” {DATATYPE H5T_STD_U8LEDATASPACE SIMPLE {(20,30) / (20,30)}}}}}
Parallel File System (PFS)
N1
Parallel File System (PFS)
NN
Parallel File System (PFS)
NM
Not scalableBest compromise but complex
Easiest but Contention on PFS
Tanzima Islam ([email protected]) mcrEngine: Data-aware Aggregation & Compression
3mcrEngine: Data-aware Aggregation & Compression
Impact of Load on PFS at Large Scale
IOR78MB of data per processNN checkpoint transfer
Observations:(-) Large average write time less frequent checkpointing(-) Large average read time poor application performance
0200400600800
100012001400
# of Processes (N)
128
256
512
1024
2048
4096
8192
1540
80
20406080
100120140
Ave
rage
Rea
d T
ime
(s)
Ave
rage
Wri
te T
ime
(s)
# of Processes (N)
Tanzima Islam ([email protected])
4mcrEngine: Data-aware Aggregation & Compression
What is the Problem?
Today’s checkpoint-restart systems will not scaleIncreasing number of concurrent transfersIncreasing volume of checkpoint data
Tanzima Islam ([email protected])
5mcrEngine: Data-aware Aggregation & Compression
Our Contributions
Data-aware aggregationReduces the number of concurrent transfersImproves compressibility of checkpoints
Data-aware compressionReduces data almost 2x more than simply concatenating them and compressing
Design and develop mcrEngineNM checkpointing systemDecouples checkpoint transfer logic from applicationsImproves application performance
Tanzima Islam ([email protected])
6mcrEngine: Data-aware Aggregation & Compression
Overview
Background
Problem
Data aggregation & compression
Evaluation
Tanzima Islam ([email protected])
7mcrEngine: Data-aware Aggregation & Compression
Agnostic scheme – concatenate checkpoints
Agnostic-block scheme – interleave fixed-size blocks
Observations:(+) Easy(−) Low compression ratio
Data-Agnostic Schemes
C1
[1-B]C2
[1-B]
C1
[B+1-2B]
C2
[B+1-2B]
C1
[1-B]
C1
[B+1-2B]
C2
[1-B]
C2
[B+1-2B]
C1
C2
C1 C2 Gzip PFS
Gzip PFS
First Phase
Tanzima Islam ([email protected])
8mcrEngine: Data-aware Aggregation & Compression
Identify Similar Variables Across Processes
C1.T C1.P C2.T C2.P
Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};
P0
Group ToyGrp{ float Temperature[100]; int Pressure[10][50];};
P1
Meta-data:1. Name2. Data-type3. Class:
-- Array, Atomic
Concatenating similar variables
C2.TC1.T C2.PC1.P
Aware Scheme
Tanzima Islam ([email protected])
9mcrEngine: Data-aware Aggregation & Compression
Aware-Block Scheme
Group ToyGrp{ float Temperature[1024]; int Pressure[20][30];};
P0
Group ToyGrp{ float Temperature[100]; int Pressure[10][50];};
P1
Meta-data:1. Name2. Data-type3. Class:
-- Array, Atomic
First ‘B’ bytes of TemperatureNext ‘B’ bytes of TemperatureInterleavePressure
C1.T C1.P C2.PC2.T
Interleaving similar variables
Tanzima Islam ([email protected])
10mcrEngine: Data-aware Aggregation & Compression
Data-Aware Aggregation & Compression
Aware scheme – concatenate similar variablesAware-block scheme – interleave similar variables
Output buffer
Data-type aware compression FPC Lempel-Ziv
T P H D
Gzip
PFS
C2.TC1.T C2.PC1.P
First Phase
Second Phase
Tanzima Islam ([email protected])
11mcrEngine: Data-aware Aggregation & Compression
How mcrEngine Works
CNCCNC
CNCANCCNC
CNCCNC
CNCCNC
ANC
ANC
Meta-data
Meta-data
Meta-data
Identifies “similar” variables
Request T, P
Request T, P
Request T, P
Applies data-aware aggregation and compression
PFS
CNC : Compute node componentANC: Aggregator node componentRank-order groups, Group size = 4, NM checkpointing
Group
CNCCNC
CNCCNC
Group
Group
Request H, D
Request H, D
Request H, D
D
T P
T P
T P
H D
H D
H D
Gzip
Gzip
Gzip
HPT
HT P D
HT P D
Tanzima Islam ([email protected])
12mcrEngine: Data-aware Aggregation & Compression
Overview
Background
Problem
Data aggregation & compression
Evaluation
Tanzima Islam ([email protected])
13mcrEngine: Data-aware Aggregation & Compression
Evaluation
ApplicationsALE3D – 4.8GB per checkpoint setCactus – 2.41GB per checkpoint setCosmology – 1.1GB per checkpoint setImplosion – 13MB per checkpoint set
Experimental test-bedLLNL’s Sierra: 261.3 TFLOP/s, Linux cluster15,408 cores, 1.3 Petabyte Lustre file system
Compression algorithmFPC [1] for double-precision floatFpzip [2] for single-precision floatLempel-Ziv for all other data-typesGzip for general-purpose compression
Tanzima Islam ([email protected])
14mcrEngine: Data-aware Aggregation & Compression
Evaluation Metrics
Effectiveness of data-aware compressionWhat is the benefit of multiple compression phases?How does group size affect compression ratio?How does compression ratio change as a simulation progresses?
Performance of mcrEngineOverhead of the checkpointing phaseOverhead of the restart phase
Compression ratio = Uncompressed size
Compressed size
Tanzima Islam ([email protected])
15mcrEngine: Data-aware Aggregation & Compression
Fir
st-P
hase
Sec
ond-
Pha
se
Fir
st-P
hase
Sec
ond-
Pha
se
Fir
st-P
hase
Sec
ond-
Pha
se
Fir
st-P
hase
Sec
ond-
Pha
se
ALE3D Cactus Cosmology Implosion
0
0.5
1
1.5
2
2.5
3
3.5
4
No Benefit with Data-Agnostic Double Compression
Data-type aware compression improves compressibilityFirst phase changes underlying data format
Data-agnostic double compression is not beneficialBecause, data-format is non-uniform and uncompressible
Com
pres
sion
Rat
io
Data-Aware
Data-Agnostic
Multiple Phases of Data-Aware Compressionare Beneficial
Tanzima Islam ([email protected])
16mcrEngine: Data-aware Aggregation & Compression
1 2 4 8 16 322.5
3.5
4.5
1 2 4 8 16 32 64 1281
1.5
2
1 2 4 8 16 32 641.7
2.7
3.7
1 2 4 8 16 320.5
1.5
2.5
Different merging schemes better for different applicationsLarger group size beneficial for certain applications
ALE3D: Improvement of 8% from group size 2 to 32ALE3D Cactus
Cosmology Implosion
Com
pres
sion
Rat
io
Group size
Aware-Block
Aware
Impact of Group Size on Compression Ratio
Tanzima Islam ([email protected])
17mcrEngine: Data-aware Aggregation & Compression
1 2 4 8 16 322.5
3.5
4.5
1 2 4 8 16 32 64 1281
1.5
2
1 2 4 8 16 32 641.7
2.7
3.7
1 2 4 8 16 320.5
1.5
2.5
Data-Aware Technique Always Wins over Data-Agnostic
ALE3D Cactus
Cosmology Implosion
Com
pres
sion
Rat
io
Group size
Aware-Block
Aware
Agnostic-Block
Agnostic
98-115%
Data-aware technique always yields better compression ratio than Data-Agnostic technique
Tanzima Islam ([email protected])
18mcrEngine: Data-aware Aggregation & Compression
Compression Ratio Follows Course of Simulation
Data-aware technique always yields better compression
Cactus
Com
pres
sion
Rat
io
Aware-Block
Aware
Agnostic-Block
Agnostic
0.81
1.21.41.61.8
22.2
1.0
2.0
3.0
4.0
5.0
6.0
1.3
1.5
1.7
1.9
2.1
2.3
Simulation Time-steps
Cosmology Implosion
Tanzima Islam ([email protected])
19mcrEngine: Data-aware Aggregation & Compression
Relative Improvement in Compression Ratio Compared to Data-Agnostic Scheme
Application Total Size (GB) Aware-Block (%)
Aware (%)
ALE3D 4.8 6.6 - 27.7 6.6 - 12.7
Cactus 2.41 10.7 – 11.9 98 - 115
Cosmology 1.1 20.1 – 25.6 20.6 – 21.1
Implosion 0.013 36.3 – 38.4 36.3 – 38.8
Tanzima Islam ([email protected])
20mcrEngine: Data-aware Aggregation & Compression
Impact of Aggregation on Scalability
Used IORNN: Each process transfers 78MBNM: Group size 32, 1.21GB per aggregator
128256
5121024
20484096
8192 10
20406080
100120140
128256
5121024
20484096
8192
154080
200400600800
100012001400
# of Processes (N)
Ave
rage
Wri
te T
ime
(sec
)A
vera
ge R
ead
Tim
e (s
ec)
N->N Write
N->M Write
N->N Read
N->M Read
Tanzima Islam ([email protected])
21mcrEngine: Data-aware Aggregation & Compression
Impact of Data-Aware Compression on Scalability
IOR with NM transfer, groups of 32 processesData-aware: 1.2GB, data-agnostic: 2.4GB
Data-aware compression improves I/O performance at large scaleImprovement during write 43% - 70%Improvement during read 48% - 70%
128256
5121024
20484096
819215424
1638420480
2457628672
0
50
100
150
200
250
300
# of Processes (N)
Ave
rage
Tra
nsfe
r T
ime
(se
c)
Agnostic
Aware
Agnostic-Read
Agnostic-Write
Aware-Read
Aware-Write
Tanzima Islam ([email protected])
22mcrEngine: Data-aware Aggregation & Compression
End-to-End Checkpointing Overhead
15,408 processesGroup size of 32 for NM schemesEach process takes a checkpoint
Converts network bound operation into CPU bound one
CPU Overhead
Transfer Overhead to PFS
No
Com
p.+N
->N
Indi
v. C
omp+
N->
N
No
Com
p.+N
->M
Agn
osti
c+A
gg
Aw
are+
Agg
No
Com
p.+N
->N
Indi
v. C
omp.
+N->
M
No
Com
p.+N
->M
Agn
osti
c+A
gg
Aw
are+
Agg
ALE3D Cactus
0
50
100
150
200
250
300
350
Tota
l Che
ckpo
inti
ng O
verh
ead
(sec
) Reduction in checkpointing overhead
87% 51%
Tanzima Islam ([email protected])
23mcrEngine: Data-aware Aggregation & Compression
No
Com
p.+N
->N
No
Com
p.+N
->M
No
Com
p.+N
->M
Agn
osti
c+A
gg
Aw
are+
Agg
No
Com
p.+N
->N
Indi
v. C
omp.
+N->
M
No
Com
p.+N
->M
Agn
osti
c+A
gg
Aw
are+
Agg
ALE3D Cactus
0
100
200
300
400
500
600
End-to-End Restart Overhead
Reduced overall restart overheadReduced network load and transfer time
CPU Overhead
Transfer Overhead to PFS
Tota
l Rec
over
y O
verh
ead
(sec
)
43%71%
Reduction in I/O overhead
62% 64%
Reduction in recovery overhead
Tanzima Islam ([email protected])
24mcrEngine: Data-aware Aggregation & Compression
Conclusion
Developed data-aware checkpoint compression technique Relative improvement in compression ratio up to 115%
Investigated different merging techniquesEvaluated effectiveness using real-world applications
Designed and developed a scalable frameworkImplements NM checkpointingImproves application performanceTransforms checkpointing into CPU bound operation
Tanzima Islam ([email protected])
25mcrEngine: Data-aware Aggregation & Compression
Contact Information
Tanzima Islam ([email protected])Website: web.ics.purdue.edu/~tislam
Tanzima Islam ([email protected])
AcknowledgementPurdue:
Saurabh Bagchi ([email protected])Rudolf Eigenmann ([email protected])
Lawrence Livermore National LaboratoryKathryn Mohror ([email protected])Adam Moody ([email protected])Bronis R. de Supinski ([email protected])
26mcrEngine: Data-aware Aggregation & CompressionTanzima Islam ([email protected])
28mcrEngine: Data-aware Aggregation & Compression
[Backup Slide] Failures in HPC
“A Large-scale Study of Failures in High-performance Computing Systems, by Bianca Schroeder, Garth Gibson
Breakdown of root causes of failures Breakdown of downtime into root causes
Tanzima Islam ([email protected])
29mcrEngine: Data-aware Aggregation & Compression
Future Work
Analytical solution to group size selection?Better way than rank-order grouping?Variable streaming?
Engineering challenge
Tanzima Islam ([email protected])
30mcrEngine: Data-aware Aggregation & Compression
References
1. M. Burtscher and P. Ratanaworabhan, “FPC: A High-speed Compressor for Double-Precision Floating-Point Data”.
2. P. Lindstrom and M. Isenburg, “Fast and Efficient Compression of Floating-Point Data”.
3. L. Reinhold, “QuickLZ”.
Tanzima Islam ([email protected])