2004/10/27
1ECMWF Workshop 2004 - Parallelisation of HIRLAM
Parallelization of HIRLAM
Model Parallelization and Asynchronous I/O
2004/10/27
2ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM Optimizations
• Two major projects:
⑥ Model scalability improvement
⑥ Asynchronous I/O optimization• Both projects funded by DMI and NEC
2004/10/27
3ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM Parallelization
Message Passing Optimization
2004/10/27
4ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM Scalability Optimization
• Methods• Implementation• Performance
2004/10/27
5ECMWF Workshop 2004 - Parallelisation of HIRLAM
Optimization Focus
• Data transposition– from 2D to FFT distribution and reverse– from FFT to TRI distribution and reverse
• Exchange of halo points– between north and south– between east and west
2004/10/27
6ECMWF Workshop 2004 - Parallelisation of HIRLAM
Approach
• First attempt: straight-forward conversion from SHMEM to MPI-2 put/get calls– It works, but:– Too much overhead due to fine granularity
• Original plan shattered, now what?– Panic (mild form)
• More to do: in-depth analysis of how things really work
• Human memory: retention at least 6 years
2004/10/27
7ECMWF Workshop 2004 - Parallelisation of HIRLAM
Approach
• Redesign of transposition based on method used in initiative by met.no called PARLAM (Dag Bjørge and Roar Skålin, 1995)
• Redesign of halo swap routines• Benefits:
➼ less and larger messages➼ independent message passing process
groups
2004/10/27
8ECMWF Workshop 2004 - Parallelisation of HIRLAM
2D Sub Grids
• HIRLAM sub grid definition in TWOD data distribution
• Processors:
nproc = nprocx · nprocy
0 1 2
3 4 5
6 7 8
9 10 11
lati
tud
e
longitude
levels
2004/10/27
9ECMWF Workshop 2004 - Parallelisation of HIRLAM
Original FFT Sub Grids
• HIRLAM sub grid definition in FFT data distribution
• Each processor handles slabs of full longitude lines
lati
tud
e
longitude
levels
2004/10/27
10ECMWF Workshop 2004 - Parallelisation of HIRLAM
2D�FFT Redistribution
Sub grid data to be distributed to all processors:
nproc2 send-receive pairs
4
lati
tud
e
longitude
levels
2004/10/27
11ECMWF Workshop 2004 - Parallelisation of HIRLAM
3 4 5
2D�FFT Redistribution
• Sub grids in east-west direction form full longitude lines
• nprocy independent sets of nprocx2 send-receive pairs
• Total nr of pairs: nprocy · nprocx2
or:
nproc2 / nprocy
• nprocy times less messages
lati
tud
e
longitude
levels
5
4
3
2004/10/27
12ECMWF Workshop 2004 - Parallelisation of HIRLAM
Transpositions 2D�FFT�TRI
0 1 2
3 4 5
6 7 8
9 10 11
21
0
54
3
87
6
1110
9
2 5 8 111 4 7 10
0 3 6 9
2D FFT TRI
lati
tud
e
longitude
levels
2004/10/27
13ECMWF Workshop 2004 - Parallelisation of HIRLAM
MPI Methods
• Three transfer methods tried:– Remote Memory Access: mpi_put, mpi_get– Async Point-to-Point: mpi_isend, mpi_irecv– All-to-All: mpi_alltoallv, mpi_alltoallw
• Buffering vs. direct– Explicit buffering– MPI derived types
(Method selection by environment variables)
2004/10/27
14ECMWF Workshop 2004 - Parallelisation of HIRLAM
SHMEM Support
• Integrated in new design• Not much specific SHMEM code left: same
buffer structures used for MPI and SHMEM• Very similar to MPI-2 RMA code part• New SHMEM code ported to SHMEM
architecture by Ole Vignes, met.no
2004/10/27
15ECMWF Workshop 2004 - Parallelisation of HIRLAM
Performance
Test grid Details
seconds180Time step
noneInitialization
steps40NSTOP
levels60Vertical
points568Latitude
points602Longitude
2004/10/27
16ECMWF Workshop 2004 - Parallelisation of HIRLAM
Parallel Speedup on NEC SX-6
• Cluster of 8 NEC SX-6 nodes at DMI
• Up to 60 processors:
� 7 nodes with 8 processors per node
� 1 node with 4 processors
• Parallel efficiency 78% on 60 processors
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0 10 20 30 40 50 60
Processors
Sp
eed
up
ideal original optimized
2004/10/27
17ECMWF Workshop 2004 - Parallelisation of HIRLAM
Performance – Observations on SX-6
• New data redistribution method much more efficient (78% vs. 45% on 60 processors)
• No performance advantage with RMA (one-sided MP) or All-to-All over plain Point-to-Point method
• Elegant code with MPI derived types, but explicit buffering faster
2004/10/27
18ECMWF Workshop 2004 - Parallelisation of HIRLAM
Performance – Observations on SGI Origin
• Tests executed at met.no with grid size 468x378x40 points on 196 processors
• Time step times:– Old SHMEM: 1.4 s– New SHMEM: 1.15 s– MPI: 0.85 s
• Conclusions regarding SHMEM:– New SHMEM code faster than old SHMEM code– New MPI code faster than new SHMEM code– End of SHMEM version?
2004/10/27
19ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM GRIBfile Server
Asynchronous I/O
2004/10/27
20ECMWF Workshop 2004 - Parallelisation of HIRLAM
HGS Overview
HIRLAM GRIBfile Server• Purpose:
Take both input and output processing out of the time stepping loop
• Server is a virtual concept, not a separate application or system
• Integrated in model source• Enabled/disabled at runtime
2004/10/27
21ECMWF Workshop 2004 - Parallelisation of HIRLAM
HGS Overview
Brief history1. HGS originally written for shared memory
architecture (IPC) (Sun, Jan Boerhout, 2001)
2. Similar initiative for distributed memory architecture (MPI), output only (CSC, Jussi Heikonen
and FMI, Kalle Eerola, 2001)3. HGS rewrite for MPI, based on ideas from 1
and 2 (met.no, Ole Vignes, 2002)
4. Recently optimized for faster data exchange (NEC, Jan Boerhout, 2004; funded by DMI)
2004/10/27
22ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - Sequential
• 3 hours forecast• 610 x 568 x 40 points• 15 SX-6 processors
(8GF peak)
• Hourly input and output• Time step 360 seconds• Graph shows function call
stack on time line
InputOutput
Time (s)
Model
Model
Time step
2004/10/27
23ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - AsynchronousDisk Input
Disk Output
Time (s)
HGS
Model
Model
Model Input Model Output
2004/10/27
24ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - Asynchronous
A
1Time (s)
2
HGS
Model
Model
HGS:1. Receives preread
request and reads first input file
2. Reads second input file
Model:A. Sends HGS request to
pre-read first two input files and waits for HGS availability confirmation
2004/10/27
25ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - Asynchronous
B
3Time (s)
HGS
Model
Model
HGS:3. Sends data from first two
input files
Model:B. Receives data from first
two input files
2004/10/27
26ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - Asynchronous
C
D
4
Time (s)
HGS
Model
Model
HGS:4. Collects data for first
output files
Model:C. First time stepD. Sends data for first three
output files
2004/10/27
Grid 3
Grid 2
Grid 1
Grid 0
27ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM Model and HGS Animation
• Demonstration of output (input very similar)– Four model sub grids– One HGS process– Buffer block size: 4 fields (in
reality: 100)• Model running...• Next slides: model writes
output, HGS collects and stores the data
2004/10/27
GRIBfile Server
28ECMWF Workshop 2004 - Parallelisation of HIRLAM
HGS Collects Output
2004/10/27
GRIBfile Server
29ECMWF Workshop 2004 - Parallelisation of HIRLAM
HGS Collects Output
2004/10/27
GRIBfile Server
30ECMWF Workshop 2004 - Parallelisation of HIRLAM
HGS Collects Output
2004/10/27
GRIBfile Server
31ECMWF Workshop 2004 - Parallelisation of HIRLAM
HGS Collects Output
2004/10/27
32ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - Output
Time (s)
HGS
Model
Model
Output:Short model
communication time
Will be even shorter with DMI’s new MSLP algorithm
2004/10/27
33ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - Asynchronous
E
5Time (s)
HGS
Model
Model
HGS:5. Writes output files to disk
Model:E. Sends requests to store
output and pre-read next input, then continues
2004/10/27
GRIBfile Server
34ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model continues while HGS Writes to Disk
2004/10/27
GRIBfile Server
35ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model continues while HGS Writes to Disk
2004/10/27
GRIBfile Server
36ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model continues while HGS Writes to Disk
2004/10/27
GRIBfile Server
37ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model continues while HGS Writes to Disk
2004/10/27
GRIBfile Server
38ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model continues while HGS Writes to Disk
2004/10/27
GRIBfile Server
39ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model continues while HGS Writes to Disk
2004/10/27
GRIBfile Server
40ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model continues while HGS Writes to Disk
2004/10/27
GRIBfile Server
41ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model continues while HGS Writes to Disk
2004/10/27
GRIBfile Server
42ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model continues while HGS Writes to Disk
2004/10/27
43ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - Asynchronous
G
F
Time (s)
6 7
HGS
Model
Model
HGS:6. Reads next input file7. Sends input data
Model:F. Sends requests to
receive input dataG. Next time step
2004/10/27
44ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - Input
Time (s)
HGS
Model
Model
Input:Very short model
communication time
2004/10/27
45ECMWF Workshop 2004 - Parallelisation of HIRLAM
HIRLAM I/O Activity - Asynchronous
H
Time (s)
8
HGS
Model
Model
HGS:8. Collects output data
(repeat of 4)
Model:H. Sends output data
(repeat of D)
D
4
2004/10/27
46ECMWF Workshop 2004 - Parallelisation of HIRLAM
Model
HGS
HGS
HGS
Time (s)
Multiple HGS tasks: 3 HGS, 12 model (1 shown)
Input
Output
Multiple HGS Tasks Supported
2004/10/27
47ECMWF Workshop 2004 - Parallelisation of HIRLAM
HGS Documentation
• More detailed descriptions in HGS documentation
• To be published as HIRLAM Newsletter, see: http://hirlam.knmi.nl
• Graphs generated with Vftrace, free tool for NEC SX users, see http://vftrace.jboerhout.nl
Figure 4.4 – Model sends output data - overview
Time (s)
HGS
Model
A C
B D
E
12
34 5
F G I K L
H J M O
N
76
2004/10/27
48ECMWF Workshop 2004 - Parallelisation of HIRLAM
Questions?
• Thank you!